[jira] [Updated] (SPARK-27870) Flush each batch for python UDF

Weichen Xu (JIRA) Tue, 28 May 2019 20:35:48 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Weichen Xu updated SPARK-27870:
-------------------------------
    Description: 
Flush each batch for python UDF.

This could improve performance when multiple python UDF plans are pipelined.

When batch being flushed in time, downstream python UDFs will get pipelined as 
soon as possible, and pipeline will help hide the donwstream UDFs computation 
time. For example:

When the first UDF start computing on batch-3, the second pipelined UDF can 
start computing on batch-2, and the third pipelined UDF can start computing on 
batch-1.

If we do not flush each batch in time, the donwstream UDF's pipeline will lag 
behind too much, which may increase the total processing time.

 

  was:
Flush each batch for pandas UDF.

This could improve performance when multiple pandas UDF plans are pipelined.

When batch being flushed in time, downstream pandas UDFs will get pipelined as 
soon as possible, and pipeline will help hide the donwstream UDFs computation 
time. For example:

When the first UDF start computing on batch-3, the second pipelined UDF can 
start computing on batch-2, and the third pipelined UDF can start computing on 
batch-1.

If we do not flush each batch in time, the donwstream UDF's pipeline will lag 
behind too much, which may increase the total processing time.

 


> Flush each batch for python UDF
> -------------------------------
>
>                 Key: SPARK-27870
>                 URL: https://issues.apache.org/jira/browse/SPARK-27870
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Weichen Xu
>            Priority: Major
>
> Flush each batch for python UDF.
> This could improve performance when multiple python UDF plans are pipelined.
> When batch being flushed in time, downstream python UDFs will get pipelined 
> as soon as possible, and pipeline will help hide the donwstream UDFs 
> computation time. For example:
> When the first UDF start computing on batch-3, the second pipelined UDF can 
> start computing on batch-2, and the third pipelined UDF can start computing 
> on batch-1.
> If we do not flush each batch in time, the donwstream UDF's pipeline will lag 
> behind too much, which may increase the total processing time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27870) Flush each batch for python UDF

Reply via email to