[jira] [Updated] (SPARK-35211) Support UDT for Pandas with Arrow Disabled

Darcy Shen (Jira) Sat, 24 Apr 2021 02:11:09 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-35211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Darcy Shen updated SPARK-35211:
-------------------------------
    Description: 
{code:java}
$ pip freeze
certifi==2020.12.5
coverage==5.5
flake8==3.9.0
mccabe==0.6.1
mypy==0.812
mypy-extensions==0.4.3
numpy==1.20.1
pandas==1.2.3
pyarrow==2.0.0
pycodestyle==2.7.0
pyflakes==2.3.0
python-dateutil==2.8.1
pytz==2021.1
scipy==1.6.1
six==1.15.0
typed-ast==1.4.2
typing-extensions==3.7.4.3
xmlrunner==1.7.7
{code}

{code}
(spark) ➜  spark git:(master) bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/04/24 15:51:29 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.12:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1619250689842).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
>>> from pyspark.testing.sqlutils  import ExamplePoint

>>>
>>> import pandas as pd
>>>
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
>>> 2)])})
>>>
>>> df = spark.createDataFrame(pdf)

>>>
>>> df.show()
+----------+
|     point|
+----------+
|(0.0, 0.0)|
|(0.0, 0.0)|
+----------+
>>> df.toPandas()

       point
0  (0.0,0.0)
1  (0.0,0.0)
>>>
>>>
{code}

The correct result should be:
{code}
       point
0  (1.0,1.0)
1  (2.0,2.0)
{code}


The following code snippet works fine:
{code}
(spark) ➜  spark git:(sadhen/SPARK-35211) ✗ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/04/24 17:08:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.12:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1619255290637).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), 
>>> ExamplePoint(2.0, 2.0)])})
>>> df = spark.createDataFrame(pdf)
>>> df.show()
+----------+
|     point|
+----------+
|(1.0, 1.0)|
|(2.0, 2.0)|
+----------+
{code}


  was:
{code:java}
$ pip freeze
certifi==2020.12.5
coverage==5.5
flake8==3.9.0
mccabe==0.6.1
mypy==0.812
mypy-extensions==0.4.3
numpy==1.20.1
pandas==1.2.3
pyarrow==2.0.0
pycodestyle==2.7.0
pyflakes==2.3.0
python-dateutil==2.8.1
pytz==2021.1
scipy==1.6.1
six==1.15.0
typed-ast==1.4.2
typing-extensions==3.7.4.3
xmlrunner==1.7.7
{code}

{code}
(spark) ➜  spark git:(master) bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/04/24 15:51:29 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
      /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.12:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1619250689842).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
>>> from pyspark.testing.sqlutils  import ExamplePoint

>>>
>>> import pandas as pd
>>>
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
>>> 2)])})
>>>
>>> df = spark.createDataFrame(pdf)

>>>
>>> df.show()
+----------+
|     point|
+----------+
|(0.0, 0.0)|
|(0.0, 0.0)|
+----------+
>>> df.toPandas()

       point
0  (0.0,0.0)
1  (0.0,0.0)
>>>
>>>
{code}

The correct result should be:
{code}
       point
0  (1.0,1.0)
1  (2.0,2.0)
{code}



> Support UDT for Pandas with Arrow Disabled
> ------------------------------------------
>
>                 Key: SPARK-35211
>                 URL: https://issues.apache.org/jira/browse/SPARK-35211
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 3.1.1
>            Reporter: Darcy Shen
>            Priority: Major
>
> {code:java}
> $ pip freeze
> certifi==2020.12.5
> coverage==5.5
> flake8==3.9.0
> mccabe==0.6.1
> mypy==0.812
> mypy-extensions==0.4.3
> numpy==1.20.1
> pandas==1.2.3
> pyarrow==2.0.0
> pycodestyle==2.7.0
> pyflakes==2.3.0
> python-dateutil==2.8.1
> pytz==2021.1
> scipy==1.6.1
> six==1.15.0
> typed-ast==1.4.2
> typing-extensions==3.7.4.3
> xmlrunner==1.7.7
> {code}
> {code}
> (spark) ➜  spark git:(master) bin/pyspark
> Python 3.8.8 (default, Feb 24 2021, 13:46:16)
> [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 21/04/24 15:51:29 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>       /_/
> Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
> Spark context Web UI available at http://172.30.0.12:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1619250689842).
> SparkSession available as 'spark'.
> >>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
> >>> from pyspark.testing.sqlutils  import ExamplePoint
> >>>
> >>> import pandas as pd
> >>>
> >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), 
> >>> ExamplePoint(2, 2)])})
> >>>
> >>> df = spark.createDataFrame(pdf)
> >>>
> >>> df.show()
> +----------+
> |     point|
> +----------+
> |(0.0, 0.0)|
> |(0.0, 0.0)|
> +----------+
> >>> df.toPandas()
>        point
> 0  (0.0,0.0)
> 1  (0.0,0.0)
> >>>
> >>>
> {code}
> The correct result should be:
> {code}
>        point
> 0  (1.0,1.0)
> 1  (2.0,2.0)
> {code}
> The following code snippet works fine:
> {code}
> (spark) ➜  spark git:(sadhen/SPARK-35211) ✗ bin/pyspark
> Python 3.8.8 (default, Feb 24 2021, 13:46:16)
> [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 21/04/24 17:08:09 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>       /_/
> Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
> Spark context Web UI available at http://172.30.0.12:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1619255290637).
> SparkSession available as 'spark'.
> >>> spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
> >>> from pyspark.testing.sqlutils  import ExamplePoint
> >>> import pandas as pd
> >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), 
> >>> ExamplePoint(2.0, 2.0)])})
> >>> df = spark.createDataFrame(pdf)
> >>> df.show()
> +----------+
> |     point|
> +----------+
> |(1.0, 1.0)|
> |(2.0, 2.0)|
> +----------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35211) Support UDT for Pandas with Arrow Disabled

Reply via email to