[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Darcy Shen updated SPARK-34771: ------------------------------- Description: {code:python} spark.conf.set("spark.sql.execution.arrow.enabled", "true") from pyspark.testing.sqlutils import ExamplePoint import pandas as pd pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.toPandas() {code} with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS. with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off. Detailed steps to reproduce: {code:python} $ bin/pyspark Python 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) Spark context Web UI available at http://172.30.0.226:4040 Spark context available as 'sc' (master = local[*], app id = local-1615994008526). SparkSession available as 'spark'. >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") 21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. >>> from pyspark.testing.sqlutils import ExamplePoint >>> import pandas as pd >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, >>> 2)])}) >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) >>> >>> df.show() +----------+ | point| +----------+ |(0.0, 0.0)| |(0.0, 0.0)| +----------+ >>> df.schema StructType(List(StructField(point,ExamplePointUDT,true))) >>> df.toPandas() /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unsupported type in conversion to Arrow: ExamplePointUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) point 0 (0.0,0.0) 1 (0.0,0.0) {code} Added a traceback: {code:python} >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) Traceback (most recent call last): File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 319, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 451, in _create_from_pandas_with_arrow arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas File "/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 529, in dataframe_to_types type_ = pa.array(c, from_pandas=True).type File "pyarrow/array.pxi", line 292, in pyarrow.lib.array File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type {code} was: {code:python} spark.conf.set("spark.sql.execution.arrow.enabled", "true") from pyspark.testing.sqlutils import ExamplePoint import pandas as pd pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.toPandas() {code} with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS. with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off. Detailed steps to reproduce: {code:python} $ bin/pyspark Python 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) Spark context Web UI available at http://172.30.0.226:4040 Spark context available as 'sc' (master = local[*], app id = local-1615994008526). SparkSession available as 'spark'. >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") 21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. >>> from pyspark.testing.sqlutils import ExamplePoint >>> import pandas as pd >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, >>> 2)])}) >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) >>> >>> df.show() +----------+ | point| +----------+ |(0.0, 0.0)| |(0.0, 0.0)| +----------+ >>> df.schema StructType(List(StructField(point,ExamplePointUDT,true))) >>> df.toPandas() /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unsupported type in conversion to Arrow: ExamplePointUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) point 0 (0.0,0.0) 1 (0.0,0.0) {code} > Support UDT for Pandas with Arrow Optimization > ---------------------------------------------- > > Key: SPARK-34771 > URL: https://issues.apache.org/jira/browse/SPARK-34771 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 3.0.2, 3.1.1 > Reporter: Darcy Shen > Priority: Major > > {code:python} > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > from pyspark.testing.sqlutils import ExamplePoint > import pandas as pd > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, > 2)])}) > df = spark.createDataFrame(pdf) > df.toPandas() > {code} > with `spark.sql.execution.arrow.enabled` = false, the above snippet works > fine without WARNINGS. > with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine > with WARNINGS. Because of Unsupported type in conversion, the Arrow > optimization is actually turned off. > Detailed steps to reproduce: > {code:python} > $ bin/pyspark > Python 3.8.8 (default, Feb 24 2021, 13:46:16) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT > /_/ > Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) > Spark context Web UI available at http://172.30.0.226:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1615994008526). > SparkSession available as 'spark'. > >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") > 21/03/17 23:13:31 WARN SQLConf: The SQL config > 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may > be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' > instead of it. > >>> from pyspark.testing.sqlutils import ExamplePoint > >>> import pandas as pd > >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), > >>> ExamplePoint(2, 2)])}) > >>> df = spark.createDataFrame(pdf) > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Could not convert (1,1) with type ExamplePoint: did not recognize Python > value type when inferring an Arrow data type > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > >>> > >>> df.show() > +----------+ > | point| > +----------+ > |(0.0, 0.0)| > |(0.0, 0.0)| > +----------+ > >>> df.schema > StructType(List(StructField(point,ExamplePointUDT,true))) > >>> df.toPandas() > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Unsupported type in conversion to Arrow: ExamplePointUDT > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > point > 0 (0.0,0.0) > 1 (0.0,0.0) > {code} > Added a traceback: > {code:python} > >>> df = spark.createDataFrame(pdf) > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Could not convert (1,1) with type ExamplePoint: did not recognize Python > value type when inferring an Arrow data type > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > Traceback (most recent call last): > File > "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line > 319, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line > 451, in _create_from_pandas_with_arrow > arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) > File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas > File > "/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py", > line 529, in dataframe_to_types > type_ = pa.array(c, from_pandas=True).type > File "pyarrow/array.pxi", line 292, in pyarrow.lib.array > File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array > File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint: did > not recognize Python value type when inferring an Arrow data type > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org