[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965568#comment-15965568
 ] 

João Pedro Jericó edited comment on SPARK-20294 at 4/12/17 8:56 AM:
--------------------------------------------------------------------

Yes, if sampling ration is not given it infers for the first... still, maybe 
I'm nitpicking but my use case here is that I'm working with RDDs pulled from 
data that can vary from being ~50 rows long to 100k+, and I need to do a 
flatMap operation on them and then send them back to being a DF. However, the 
schema for the first row is not necessarily the schema for the whole thing 
because some Rows are missing some entries. The solution I used was to set the 
samplingRatio at 0.01, which works very well except for the RDDs below 100 
entries, where the sampling ratio is so small it has a chance of failing.

The solution I came up with was to set sampling ration as {code:python} 
min(100., N) / N {code}, which is either 1% of the DF or everything if the 
dataframe is smaller than 100, but I think this is not ideal. If we now that 
the rdd is not empty (the function tests for that before line 354), then we 
should at least use the first row as a fallback if the sampling fails.


was (Author: jpjandrade):
Yes, if sampling ration is not given it infers for the first... maybe I'm 
nitpicking but my use case here is that I'm working with RDDs pulled from data 
that can vary from being ~50 rows long to 100k+, and I need to do a flatMap 
operation on them and then send them back to being a DF. However, the schema 
for the first row is not necessarily the schema for the whole thing because 
some Rows are missing some entries. The solution I used was to set the 
samplingRatio at 0.01, which works very well except for the RDDs below 100 
entries, where the sampling ratio is so small it has a chance of failing.

The solution I came up with was to set sampling ration as {code:python} 
min(100., N) / N {code}, which is either 1% of the DF or everything if the 
dataframe is smaller than 100, but I think this is not ideal. If we now that 
the rdd is not empty (the function tests for that before line 354), then we 
should at least use the first row as a fallback if the sampling fails.

> _inferSchema for RDDs fails if sample returns empty RDD
> -------------------------------------------------------
>
>                 Key: SPARK-20294
>                 URL: https://issues.apache.org/jira/browse/SPARK-20294
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.0
>            Reporter: João Pedro Jericó
>            Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to