[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] João Pedro Jericó updated SPARK-20294: -- Description: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: {code} small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() {code} This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. This is probably a problem with the other Spark APIs however I don't have the knowledge to look at the source code for other languages. was: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. This is probably a problem with the other Spark APIs however I don't have the knowledge to look at the source code for other languages. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > {code} > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > {code} > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] João Pedro Jericó updated SPARK-20294: -- Description: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. This is probably a problem with the other Spark APIs however I don't have the knowledge to look at the source code for other languages. was: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. > This is probably a problem with the other Spark APIs however I don't have the > knowledge to look at the source code for other languages. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD
[ https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] João Pedro Jericó updated SPARK-20294: -- Description: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method it will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. was: Currently the _inferSchema function on [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) line 354 fails if applied to an RDD for which the sample call returns an empty RDD. This is possible for example if one has a small RDD but that needs the schema to be inferred by more than one Row. For example: ```python small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) small_rdd.toDF(samplingRatio=0.01).show() ``` This will fail with high probability because when sampling the small_rdd with the .sample method will return an empty RDD most of the time. However, this is not the desired result because we are able to sample at least 1% of the RDD. > _inferSchema for RDDs fails if sample returns empty RDD > --- > > Key: SPARK-20294 > URL: https://issues.apache.org/jira/browse/SPARK-20294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: João Pedro Jericó >Priority: Minor > > Currently the _inferSchema function on > [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354) > line 354 fails if applied to an RDD for which the sample call returns an > empty RDD. This is possible for example if one has a small RDD but that needs > the schema to be inferred by more than one Row. For example: > ```python > small_rdd = sc.parallelize([(1, 2), (2, 'foo')]) > small_rdd.toDF(samplingRatio=0.01).show() > ``` > This will fail with high probability because when sampling the small_rdd with > the .sample method it will return an empty RDD most of the time. However, > this is not the desired result because we are able to sample at least 1% of > the RDD. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org