[ https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu updated SPARK-27870: ------------------------------- Summary: Flush each batch for pandas UDF (for improving pandas UDFs pipeline) (was: Flush each batch for python UDF) > Flush each batch for pandas UDF (for improving pandas UDFs pipeline) > -------------------------------------------------------------------- > > Key: SPARK-27870 > URL: https://issues.apache.org/jira/browse/SPARK-27870 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 3.0.0 > Reporter: Weichen Xu > Priority: Major > > Flush each batch for python UDF. > This could improve performance when multiple python UDF plans are pipelined. > When batch being flushed in time, downstream python UDFs will get pipelined > as soon as possible, and pipeline will help hide the donwstream UDFs > computation time. For example: > When the first UDF start computing on batch-3, the second pipelined UDF can > start computing on batch-2, and the third pipelined UDF can start computing > on batch-1. > If we do not flush each batch in time, the donwstream UDF's pipeline will lag > behind too much, which may increase the total processing time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org