Xinrong Meng created SPARK-40307:
------------------------------------

             Summary: Optimize (De)Serialization of Python UDF
                 Key: SPARK-40307
                 URL: https://issues.apache.org/jira/browse/SPARK-40307
             Project: Spark
          Issue Type: Umbrella
          Components: PySpark
    Affects Versions: 3.4.0
            Reporter: Xinrong Meng


Python user-defined function (UDF) enables users to run arbitrary code against 
PySpark columns. It uses Pickle for (de)serialization, and executes row by row.

One major performance bottleneck of Python UDFs is (de)serialization, that is, 
the data interchanging between the worker JVM and the spawned Python subprocess 
which actually executes the UDF. We should seek for an alternative to handle 
the (de)serialization: Arrow, which is used in (de)serialization of Pandas UDF 
already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to