João Pedro Jericó created SPARK-20294:
-----------------------------------------

             Summary: _inferSchema for RDDs fails if sample returns empty RDD
                 Key: SPARK-20294
                 URL: https://issues.apache.org/jira/browse/SPARK-20294
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.1.0
            Reporter: João Pedro Jericó
            Priority: Minor


Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method will return an empty RDD most of the time. However, this is 
not the desired result because we are able to sample at least 1% of the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to