subject:"\[jira\] \[Updated\] \(SPARK\-20294\) _inferSchema for RDDs fails if sample returns empty RDD"

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

João Pedro Jericó updated SPARK-20294:
--
Description: 
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

{code}
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
{code}

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

This is probably a problem with the other Spark APIs however I don't have the 
knowledge to look at the source code for other languages.

  was:
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

This is probably a problem with the other Spark APIs however I don't have the 
knowledge to look at the source code for other languages.


> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> {code}
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> {code}
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

João Pedro Jericó updated SPARK-20294:
--
Description: 
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

This is probably a problem with the other Spark APIs however I don't have the 
knowledge to look at the source code for other languages.

  was:
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.


> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-11 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

João Pedro Jericó updated SPARK-20294:
--
Description: 
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

  was:
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method will return an empty RDD most of the time. However, this is 
not the desired result because we are able to sample at least 1% of the RDD.


> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

3 matches

Site Navigation

Mail list logo

Footer information