Xinrong Meng created SPARK-40307: ------------------------------------ Summary: Optimize (De)Serialization of Python UDF Key: SPARK-40307 URL: https://issues.apache.org/jira/browse/SPARK-40307 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng
Python user-defined function (UDF) enables users to run arbitrary code against PySpark columns. It uses Pickle for (de)serialization, and executes row by row. One major performance bottleneck of Python UDFs is (de)serialization, that is, the data interchanging between the worker JVM and the spawned Python subprocess which actually executes the UDF. We should seek for an alternative to handle the (de)serialization: Arrow, which is used in (de)serialization of Pandas UDF already. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org