[ https://issues.apache.org/jira/browse/SPARK-27070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27070: ------------------------------------ Assignee: (was: Apache Spark) > DefaultPartitionCoalescer can lock up driver for hours > ------------------------------------------------------ > > Key: SPARK-27070 > URL: https://issues.apache.org/jira/browse/SPARK-27070 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.1, 2.3.2, 2.4.0 > Reporter: Yuli Fiterman > Priority: Major > > We're running Spark on EMR reading large datasets from S3. When trying to > coalesce a UnionRDD of two large FileScanRDDs (each with a few million > partitions) into around 8k partitions the driver can stall for over an hour. > > Profiler shows that over 90% of the time is spent in TimSort which is invoked > by `pickBin`. This seems like a very inefficient way to find the least > occupied PartitionGroup. IMO a better way would just using the `min` method > on the ArrayBuffer of `PartitionGroup`s -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org