[ 
https://issues.apache.org/jira/browse/SPARK-27778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27778:
------------------------------------

    Assignee: David Vogelbacher

> toPandas with arrow enabled fails for DF with no partitions
> -----------------------------------------------------------
>
>                 Key: SPARK-27778
>                 URL: https://issues.apache.org/jira/browse/SPARK-27778
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: David Vogelbacher
>            Assignee: David Vogelbacher
>            Priority: Major
>
> Calling to pandas with {{spark.sql.execution.arrow.enabled: true}} fails for 
> dataframes with no partitions. The error is a {{EOFError}}. With 
> {{spark.sql.execution.arrow.enabled: false}} the conversion.
> Repro (on current master branch):
> {noformat}
> >>> from pyspark.sql.types import *
> >>> schema = StructType([StructField("field1", StringType(), True)])
> >>> df = spark.createDataFrame(sc.emptyRDD(), schema)
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> >>> df.toPandas()
> /Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py:2162: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error 
> below and can not continue. Note that 
> 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on 
> failures in the middle of computation.
>   warnings.warn(msg)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2143, in toPandas
>     batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2205, in _collectAsArrow
>     results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 210, in load_stream
>     num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 810, in read_int
>     raise EOFError
> EOFError
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "false")
> >>> df.toPandas()
> Empty DataFrame
> Columns: [field1]
> Index: []
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to