Andrew Grigorev created SPARK-43189:
---------------------------------------

             Summary: No overload variant of "pandas_udf" matches argument type 
"str"
                 Key: SPARK-43189
                 URL: https://issues.apache.org/jira/browse/SPARK-43189
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 3.4.0, 3.3.2, 3.2.4
            Reporter: Andrew Grigorev


h2. Issue

Users who have mypy enabled in their IDE or CI environment face very verbose 
error messages when using the {{pandas_udf}} function in PySpark. The current 
typing of the {{pandas_udf}} function seems to be causing these issues. As a 
workaround, the official documentation provides examples that use {{{}# type: 
ignore[call-overload]{}}}, but this is not an ideal solution.
h2. Example

Here's a code snippet that triggers the error when mypy is enabled:
{code:python}
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("string")
def f(s: pd.Series) -> pd.Series:
    return pd.Series(["a"]*len(s), index=s.index)
{code}
Running mypy on this code results in a long and verbose error message, which 
makes it difficult for users to understand the actual issue and how to resolve 
it.
h2. Proposed Solution

We kindly request the PySpark development team to review and improve the typing 
for the {{pandas_udf}} function to prevent these verbose error messages from 
appearing. This improvement will help users who have mypy enabled in their 
development environments to have a better experience when using PySpark.

Furthermore, we suggest updating the official documentation to provide better 
examples that do not rely on {{# type: ignore[call-overload]}} to suppress 
these errors.
h2. Impact

By addressing this issue, users of PySpark with mypy enabled in their 
development environment will be able to write and verify their code more 
efficiently, without being overwhelmed by verbose error messages. This will 
lead to a more enjoyable and productive experience when working with PySpark 
and pandas UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to