[ https://issues.apache.org/jira/browse/SPARK-28128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-28128. ---------------------------------- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24926 [https://github.com/apache/spark/pull/24926] > Pandas Grouped UDFs should skip over empty partitions > ----------------------------------------------------- > > Key: SPARK-28128 > URL: https://issues.apache.org/jira/browse/SPARK-28128 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 2.4.3 > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Priority: Major > Fix For: 3.0.0 > > > When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle > uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". > If the data is small, e.g. in testing, many of the partitions will be empty > but are treated just the same. For example, ArrowPythonRunner.compute is > called and starts a number of threads that do nothing since there is no > iteration. These computations could be skipped for empty partitions, which > will save time overall. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org