[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

Hyukjin Kwon (JIRA) Tue, 11 Apr 2017 18:52:00 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965257#comment-15965257
 ]


Hyukjin Kwon commented on SPARK-20294:
--------------------------------------

Just for other guys,

{code}
>>> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
>>> small_rdd.toDF(sampleRatio=0.01).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/session.py", line 57, in toDF
    return sparkSession.createDataFrame(self, schema, sampleRatio)
  File ".../spark/python/pyspark/sql/session.py", line 524, in createDataFrame
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File ".../spark/python/pyspark/sql/session.py", line 364, in _createFromRDD
    struct = self._inferSchema(rdd, samplingRatio)
  File ".../spark/python/pyspark/sql/session.py", line 356, in _inferSchema
    schema = rdd.map(_infer_schema).reduce(_merge_type)
  File ".../spark/python/pyspark/rdd.py", line 838, in reduce
    raise ValueError("Can not reduce() empty RDD")
ValueError: Can not reduce() empty RDD
{code}


> _inferSchema for RDDs fails if sample returns empty RDD
> -------------------------------------------------------
>
>                 Key: SPARK-20294
>                 URL: https://issues.apache.org/jira/browse/SPARK-20294
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: João Pedro Jericó
>            Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

Reply via email to