[ 
https://issues.apache.org/jira/browse/SPARK-53672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-53672:
----------------------------------
    Description: 
We would extend @udf by allowing more input types and return types to promote 
vectorized UDFs.

 

Vectorized UDFs (based on Pandas and PyArrow) are much more performant than 
normal Python UDFs, but people are still using the normal UDFs in most cases.

 

 

Existing normal UDFs take Python objects (e.g. bool/int/float/dict) and Rows 
(for StructType) as inputs and outputs, for example:
{code:java}
@udf(returnType=IntegerType())
def calc(a: int, b: int) -> int:
    return a + 10 * b{code}
 

While in vectorized UDFs, the type hints are used to infer the evaluation types.

 

Proposal: Allow Pandas/PyArrow objects (pd.Series/pd.DataFrame/pa.Array/etc) as 
input types and return types, try to infer the corresponding evaluation type 
based on the signature and type hints. If the type hints match one of the 
patterns of supported vectorized UDFs (e.g. the Series to Series type in 
[pandas_udf|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html?highlight=pandas_udf#pyspark.sql.functions.pandas_udf]),
 then the UDF is treated as a vectorized UDF.

> Unified interface for UDF
> -------------------------
>
>                 Key: SPARK-53672
>                 URL: https://issues.apache.org/jira/browse/SPARK-53672
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 4.1.0
>            Reporter: Ruifeng Zheng
>            Priority: Major
>
> We would extend @udf by allowing more input types and return types to promote 
> vectorized UDFs.
>  
> Vectorized UDFs (based on Pandas and PyArrow) are much more performant than 
> normal Python UDFs, but people are still using the normal UDFs in most cases.
>  
>  
> Existing normal UDFs take Python objects (e.g. bool/int/float/dict) and Rows 
> (for StructType) as inputs and outputs, for example:
> {code:java}
> @udf(returnType=IntegerType())
> def calc(a: int, b: int) -> int:
>     return a + 10 * b{code}
>  
> While in vectorized UDFs, the type hints are used to infer the evaluation 
> types.
>  
> Proposal: Allow Pandas/PyArrow objects (pd.Series/pd.DataFrame/pa.Array/etc) 
> as input types and return types, try to infer the corresponding evaluation 
> type based on the signature and type hints. If the type hints match one of 
> the patterns of supported vectorized UDFs (e.g. the Series to Series type in 
> [pandas_udf|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html?highlight=pandas_udf#pyspark.sql.functions.pandas_udf]),
>  then the UDF is treated as a vectorized UDF.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to