[ 
https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40307:
---------------------------------
    Description: 
Python user-defined function (UDF) enables users to run arbitrary code against 
PySpark columns. It uses Pickle for (de)serialization and executes row by row.

One major performance bottleneck of Python UDFs is (de)serialization, that is, 
the data interchanging between the worker JVM and the spawned Python subprocess 
which actually executes the UDF. We should seek an alternative to handle the 
(de)serialization: Arrow, which is used in the (de)serialization of Pandas UDF 
already.


There should be two ways to enable/disable the Arrow optimization for Python 
UDFs:

- the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, 
disabled by default.
- the `useArrow` parameter of the `udf` function, None by default.

The Spark configuration takes effect only when `useArrow` is None. Otherwise, 
`use arrow` decides whether the user-defined function is optimized by Arrow or 
not.

The reason why we introduce these two ways is to provide convenient, 
per-Spark-session control and finer-grained, per-UDF control of the Arrow 
optimization for Python UDFs.


  was:
Python user-defined function (UDF) enables users to run arbitrary code against 
PySpark columns. It uses Pickle for (de)serialization and executes row by row.

One major performance bottleneck of Python UDFs is (de)serialization, that is, 
the data interchanging between the worker JVM and the spawned Python subprocess 
which actually executes the UDF. We should seek an alternative to handle the 
(de)serialization: Arrow, which is used in the (de)serialization of Pandas UDF 
already.


> Introduce Arrow-optimized Python UDFs
> -------------------------------------
>
>                 Key: SPARK-40307
>                 URL: https://issues.apache.org/jira/browse/SPARK-40307
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Xinrong Meng
>            Priority: Major
>
> Python user-defined function (UDF) enables users to run arbitrary code 
> against PySpark columns. It uses Pickle for (de)serialization and executes 
> row by row.
> One major performance bottleneck of Python UDFs is (de)serialization, that 
> is, the data interchanging between the worker JVM and the spawned Python 
> subprocess which actually executes the UDF. We should seek an alternative to 
> handle the (de)serialization: Arrow, which is used in the (de)serialization 
> of Pandas UDF already.
> There should be two ways to enable/disable the Arrow optimization for Python 
> UDFs:
> - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, 
> disabled by default.
> - the `useArrow` parameter of the `udf` function, None by default.
> The Spark configuration takes effect only when `useArrow` is None. Otherwise, 
> `use arrow` decides whether the user-defined function is optimized by Arrow 
> or not.
> The reason why we introduce these two ways is to provide convenient, 
> per-Spark-session control and finer-grained, per-UDF control of the Arrow 
> optimization for Python UDFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to