[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization

Darcy Shen (Jira) Thu, 18 Mar 2021 02:14:11 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Darcy Shen updated SPARK-34771:
-------------------------------
    Description: 
{code:python}
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
2)])})
df = spark.createDataFrame(pdf)
df.toPandas()
{code}

with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine 
without WARNINGS.

with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
with WARNINGS. Because of Unsupported type in conversion, the Arrow 
optimization is actually turned off. 


Detailed steps to reproduce:

{code:python}
$ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.226:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1615994008526).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/03/17 23:13:31 WARN SQLConf: The SQL config 
'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
instead of it.
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
>>> 2)])})
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
>>>
>>> df.show()
+----------+
|     point|
+----------+
|(0.0, 0.0)|
|(0.0, 0.0)|
+----------+

>>> df.schema
StructType(List(StructField(point,ExamplePointUDT,true)))
>>> df.toPandas()
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Unsupported type in conversion to Arrow: ExamplePointUDT
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
       point
0  (0.0,0.0)
1  (0.0,0.0)

{code}


Added a traceback:
{code:python}
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
Traceback (most recent call last):
  File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", 
line 319, in createDataFrame
    return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", 
line 451, in _create_from_pandas_with_arrow
    arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
  File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas
  File 
"/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py",
 line 529, in dataframe_to_types
    type_ = pa.array(c, from_pandas=True).type
  File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint: did 
not recognize Python value type when inferring an Arrow data type
{code}





  was:
{code:python}
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
2)])})
df = spark.createDataFrame(pdf)
df.toPandas()
{code}

with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine 
without WARNINGS.

with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
with WARNINGS. Because of Unsupported type in conversion, the Arrow 
optimization is actually turned off. 


Detailed steps to reproduce:

{code:python}
$ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.226:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1615994008526).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/03/17 23:13:31 WARN SQLConf: The SQL config 
'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
instead of it.
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
>>> 2)])})
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
>>>
>>> df.show()
+----------+
|     point|
+----------+
|(0.0, 0.0)|
|(0.0, 0.0)|
+----------+

>>> df.schema
StructType(List(StructField(point,ExamplePointUDT,true)))
>>> df.toPandas()
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Unsupported type in conversion to Arrow: ExamplePointUDT
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
       point
0  (0.0,0.0)
1  (0.0,0.0)

{code}






> Support UDT for Pandas with Arrow Optimization
> ----------------------------------------------
>
>                 Key: SPARK-34771
>                 URL: https://issues.apache.org/jira/browse/SPARK-34771
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 3.0.2, 3.1.1
>            Reporter: Darcy Shen
>            Priority: Major
>
> {code:python}
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> from pyspark.testing.sqlutils  import ExamplePoint
> import pandas as pd
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
> 2)])})
> df = spark.createDataFrame(pdf)
> df.toPandas()
> {code}
> with `spark.sql.execution.arrow.enabled` = false, the above snippet works 
> fine without WARNINGS.
> with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
> with WARNINGS. Because of Unsupported type in conversion, the Arrow 
> optimization is actually turned off. 
> Detailed steps to reproduce:
> {code:python}
> $ bin/pyspark
> Python 3.8.8 (default, Feb 24 2021, 13:46:16)
> [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>       /_/
> Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
> Spark context Web UI available at http://172.30.0.226:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1615994008526).
> SparkSession available as 'spark'.
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> 21/03/17 23:13:31 WARN SQLConf: The SQL config 
> 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
> be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
> instead of it.
> >>> from pyspark.testing.sqlutils  import ExamplePoint
> >>> import pandas as pd
> >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), 
> >>> ExamplePoint(2, 2)])})
> >>> df = spark.createDataFrame(pdf)
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Could not convert (1,1) with type ExamplePoint: did not recognize Python 
> value type when inferring an Arrow data type
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> >>>
> >>> df.show()
> +----------+
> |     point|
> +----------+
> |(0.0, 0.0)|
> |(0.0, 0.0)|
> +----------+
> >>> df.schema
> StructType(List(StructField(point,ExamplePointUDT,true)))
> >>> df.toPandas()
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Unsupported type in conversion to Arrow: ExamplePointUDT
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
>        point
> 0  (0.0,0.0)
> 1  (0.0,0.0)
> {code}
> Added a traceback:
> {code:python}
> >>> df = spark.createDataFrame(pdf)
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Could not convert (1,1) with type ExamplePoint: did not recognize Python 
> value type when inferring an Arrow data type
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> Traceback (most recent call last):
>   File 
> "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 
> 319, in createDataFrame
>     return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 
> 451, in _create_from_pandas_with_arrow
>     arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
>   File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas
>   File 
> "/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py",
>  line 529, in dataframe_to_types
>     type_ = pa.array(c, from_pandas=True).type
>   File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint: did 
> not recognize Python value type when inferring an Arrow data type
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization

Reply via email to