[ https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-40307: ------------------------------------ Assignee: (was: Apache Spark) > Introduce Arrow-optimized Python UDFs > ------------------------------------- > > Key: SPARK-40307 > URL: https://issues.apache.org/jira/browse/SPARK-40307 > Project: Spark > Issue Type: Umbrella > Components: PySpark > Affects Versions: 3.4.0 > Reporter: Xinrong Meng > Priority: Major > > Python user-defined function (UDF) enables users to run arbitrary code > against PySpark columns. It uses Pickle for (de)serialization and executes > row by row. > One major performance bottleneck of Python UDFs is (de)serialization, that > is, the data interchanging between the worker JVM and the spawned Python > subprocess which actually executes the UDF. We should seek an alternative to > handle the (de)serialization: Arrow, which is used in the (de)serialization > of Pandas UDF already. > There should be two ways to enable/disable the Arrow optimization for Python > UDFs: > - the Spark configuration `spark.sql.execution.pythonUDF.arrow.enabled`, > disabled by default. > - the `useArrow` parameter of the `udf` function, None by default. > The Spark configuration takes effect only when `useArrow` is None. Otherwise, > `use arrow` decides whether the user-defined function is optimized by Arrow > or not. > The reason why we introduce these two ways is to provide convenient, > per-Spark-session control and finer-grained, per-UDF control of the Arrow > optimization for Python UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org