Weichen Xu created SPARK-27870:
----------------------------------

             Summary: Flush each batch for pandas UDF
                 Key: SPARK-27870
                 URL: https://issues.apache.org/jira/browse/SPARK-27870
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
    Affects Versions: 3.0.0
            Reporter: Weichen Xu


Flush each batch for pandas UDF.

This could improve performance when multiple pandas UDF plans are pipelined.

When batch being flushed in time, downstream pandas UDFs will get pipelined as 
soon as possible, and pipeline will help hide the donwstream UDFs computation 
time. For example:

When the first UDF start computing on batch-3, the second pipelined UDF can 
start computing on batch-2, and the third pipelined UDF can start computing on 
batch-1.

If we do not flush each batch in time, the donwstream UDF's pipeline will lag 
behind too much, which may increase the total processing time.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to