[ https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27870: ------------------------------------ Assignee: Apache Spark > Flush each batch for pandas UDF > ------------------------------- > > Key: SPARK-27870 > URL: https://issues.apache.org/jira/browse/SPARK-27870 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 3.0.0 > Reporter: Weichen Xu > Assignee: Apache Spark > Priority: Major > > Flush each batch for pandas UDF. > This could improve performance when multiple pandas UDF plans are pipelined. > When batch being flushed in time, downstream pandas UDFs will get pipelined > as soon as possible, and pipeline will help hide the donwstream UDFs > computation time. For example: > When the first UDF start computing on batch-3, the second pipelined UDF can > start computing on batch-2, and the third pipelined UDF can start computing > on batch-1. > If we do not flush each batch in time, the donwstream UDF's pipeline will lag > behind too much, which may increase the total processing time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org