[ 
https://issues.apache.org/jira/browse/SPARK-43189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Grigorev updated SPARK-43189:
------------------------------------
    Description: 
h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet taken from 
[docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs]
 that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("col1 string, col2 long")
def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
    s3['col2'] = s1 + s2.str.len()
    return s3 {code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.

  was:
h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("string")
def f(s: pd.Series) -> pd.Series:
    return pd.Series(["a"]*len(s), index=s.index)
{code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.


> No overload variant of "pandas_udf" matches argument type "str"
> ---------------------------------------------------------------
>
>                 Key: SPARK-43189
>                 URL: https://issues.apache.org/jira/browse/SPARK-43189
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.2.4, 3.3.2, 3.4.0
>            Reporter: Andrew Grigorev
>            Priority: Major
>
> h2. Issue
> Users who have mypy enabled in their IDE or CI environment face very verbose 
> error messages when using the {{pandas_udf}} function in PySpark. The current 
> typing of the {{pandas_udf}} function seems to be causing these issues. As a 
> workaround, the official documentation provides examples that use {{{}# type: 
> ignore[call-overload]{}}}, but this is not an ideal solution.
> h2. Example
> Here's a code snippet taken from 
> [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs]
>  that triggers the error when mypy is enabled:
> {code:python}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> @pandas_udf("col1 string, col2 long")
> def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
>     s3['col2'] = s1 + s2.str.len()
>     return s3 {code}
> Running mypy on this code results in a long and verbose error message, which 
> makes it difficult for users to understand the actual issue and how to 
> resolve it.
> h2. Proposed Solution
> We kindly request the PySpark development team to review and improve the 
> typing for the {{pandas_udf}} function to prevent these verbose error 
> messages from appearing. This improvement will help users who have mypy 
> enabled in their development environments to have a better experience when 
> using PySpark.
> Furthermore, we suggest updating the official documentation to provide better 
> examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
> these errors.
> h2. Impact
> By addressing this issue, users of PySpark with mypy enabled in their 
> development environment will be able to write and verify their code more 
> efficiently, without being overwhelmed by verbose error messages. This will 
> lead to a more enjoyable and productive experience when working with PySpark 
> and pandas UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to