[ 
https://issues.apache.org/jira/browse/SPARK-38435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Peloton updated SPARK-38435:
-----------------------------------
    Description: 
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:python}
# mymod.py
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
{code}
I can import it and use it as:
{code:python}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:python}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--------------+
|to_upper(name)|
+--------------+
|      JOHN DOE|
+--------------+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:bash}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in <module>
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in <module>
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[https://github.com/apache/spark/blob/e21cb62d02c85a66771822cdd49c49dbb3e44502/python/pyspark/sql/types.py#L1014-L1015]

But the assert does not solve the problem – it just indicates the problem.

  was:
h2. Old style pandas UDF

let's consider a pandas UDF defined in the old style:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
{code}
I can import it and use it as:
{code:python}
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show() {code}
and launch it via:
{code:python}
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--------------+
|to_upper(name)|
+--------------+
|      JOHN DOE|
+--------------+ {code}
Except the `UserWarning`, the code is working as expected.
h2. New style pandas UDF: using type hint

Let's now switch to the version using type hints:
{code:python}
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper() {code}
But this time, I obtain an `AttributeError`:
{code:bash}
spark-submit main.py
Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 835, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 839, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'During handling of the 
above exception, another exception occurred:Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
2, in <module>
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 
5, in <module>
    def to_upper(s: pd.Series) -> pd.Series:
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
 line 432, in _create_pandas_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 43, in _create_udf
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 206, in _wrapped
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
 line 96, in returnType
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 841, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 831, in _parse_datatype_string
  File 
"/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
 line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info. {code}
 

The code crashes at the import level. Looking at the code, the spark context 
needs to exist:

[https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]

which at the time of the import is not the case. 
h2. Questions

First, am I doing something wrong? I do not see in the documentation 
([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
 mention of this, and it seems it should affect many users that are moving from 
old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, 
where the module can be imported without problem, the new behaviour looks like 
a regress. Why users would have to have the spark context active to just import 
a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been 
recently added (https://issues.apache.org/jira/browse/SPARK-37620):

[https://github.com/apache/spark/blob/e21cb62d02c85a66771822cdd49c49dbb3e44502/python/pyspark/sql/types.py#L1014-L1015]

But the assert does not solve the problem – it just indicates the problem.


> Pandas UDF with type hints crashes at import
> --------------------------------------------
>
>                 Key: SPARK-38435
>                 URL: https://issues.apache.org/jira/browse/SPARK-38435
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.0
>         Environment: Spark: 3.1
> Python: 3.7
>            Reporter: Julien Peloton
>            Priority: Major
>
> h2. Old style pandas UDF
> let's consider a pandas UDF defined in the old style:
> {code:python}
> # mymod.py
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import StringType
> @pandas_udf(StringType(), PandasUDFType.SCALAR)
> def to_upper(s):
>     return s.str.upper()
> {code}
> I can import it and use it as:
> {code:python}
> # main.py
> from pyspark.sql import SparkSession
> from mymod import to_upper
> spark = SparkSession.builder.getOrCreate()
> df = spark.createDataFrame([("John Doe",)], ("name",))
> df.select(to_upper("name")).show() {code}
> and launch it via:
> {code:python}
> spark-submit main.py
> spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
> UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
> specify type hints for pandas UDF instead of specifying pandas 
> UDF type which will be deprecated in the future releases. See 
> SPARK-28264 for more details.
> +--------------+
> |to_upper(name)|
> +--------------+
> |      JOHN DOE|
> +--------------+ {code}
> Except the `UserWarning`, the code is working as expected.
> h2. New style pandas UDF: using type hint
> Let's now switch to the version using type hints:
> {code:python}
> # mymod.py
> import pandas as pd
> from pyspark.sql.functions import pandas_udf
> @pandas_udf("string")
> def to_upper(s: pd.Series) -> pd.Series:
>     return s.str.upper() {code}
> But this time, I obtain an `AttributeError`:
> {code:bash}
> spark-submit main.py
> Traceback (most recent call last):
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 835, in _parse_datatype_string
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 827, in from_ddl_datatype
> AttributeError: 'NoneType' object has no attribute '_jvm'During handling of 
> the above exception, another exception occurred:Traceback (most recent call 
> last):
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 839, in _parse_datatype_string
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 827, in from_ddl_datatype
> AttributeError: 'NoneType' object has no attribute '_jvm'During handling of 
> the above exception, another exception occurred:Traceback (most recent call 
> last):
>   File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 
> 2, in <module>
>     from mymod import to_upper
>   File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", 
> line 5, in <module>
>     def to_upper(s: pd.Series) -> pd.Series:
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py",
>  line 432, in _create_pandas_udf
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
>  line 43, in _create_udf
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
>  line 206, in _wrapped
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py",
>  line 96, in returnType
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 841, in _parse_datatype_string
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 831, in _parse_datatype_string
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 823, in from_ddl_schema
> AttributeError: 'NoneType' object has no attribute '_jvm'
> log4j:WARN No appenders could be found for logger 
> (org.apache.spark.util.ShutdownHookManager).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info. {code}
>  
> The code crashes at the import level. Looking at the code, the spark context 
> needs to exist:
> [https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827]
> which at the time of the import is not the case. 
> h2. Questions
> First, am I doing something wrong? I do not see in the documentation 
> ([https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html])
>  mention of this, and it seems it should affect many users that are moving 
> from old style to new style pandas UDF.
> Second, is this the expected behaviour? Looking at the old style pandas UDF, 
> where the module can be imported without problem, the new behaviour looks 
> like a regress. Why users would have to have the spark context active to just 
> import a module that contains pandas UDF? 
> Third, what could we do? I see in the master branch that an assert has been 
> recently added (https://issues.apache.org/jira/browse/SPARK-37620):
> [https://github.com/apache/spark/blob/e21cb62d02c85a66771822cdd49c49dbb3e44502/python/pyspark/sql/types.py#L1014-L1015]
> But the assert does not solve the problem – it just indicates the problem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to