[ https://issues.apache.org/jira/browse/SPARK-40307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xinrong Meng updated SPARK-40307: --------------------------------- Summary: Optimize (De)Serialization of Python UDFs by Arrow (was: Optimize (De)Serialization of Python UDF) > Optimize (De)Serialization of Python UDFs by Arrow > -------------------------------------------------- > > Key: SPARK-40307 > URL: https://issues.apache.org/jira/browse/SPARK-40307 > Project: Spark > Issue Type: Umbrella > Components: PySpark > Affects Versions: 3.4.0 > Reporter: Xinrong Meng > Priority: Major > > Python user-defined function (UDF) enables users to run arbitrary code > against PySpark columns. It uses Pickle for (de)serialization, and executes > row by row. > One major performance bottleneck of Python UDFs is (de)serialization, that > is, the data interchanging between the worker JVM and the spawned Python > subprocess which actually executes the UDF. We should seek for an alternative > to handle the (de)serialization: Arrow, which is used in (de)serialization of > Pandas UDF already. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org