[ https://issues.apache.org/jira/browse/SPARK-23496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell resolved SPARK-23496. --------------------------------------- Resolution: Fixed Assignee: Ala Luszczak Fix Version/s: 2.4.0 > Locality of coalesced partitions can be severely skewed by the order of input > partitions > ---------------------------------------------------------------------------------------- > > Key: SPARK-23496 > URL: https://issues.apache.org/jira/browse/SPARK-23496 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Ala Luszczak > Assignee: Ala Luszczak > Priority: Major > Fix For: 2.4.0 > > > Example: > Consider RDD "R" with 100 partitions, half of which have locality preference > "hostA" and half have "hostB". > * Assume odd-numbered input partitions of R prefer "hostA" and even-numbered > prefer "hostB". Then R.coalesce(50) will have 25 partitions with preference > "hostA" and 25 with "hostB" (even distribution). > * Assume partitions with index 0-49 of R prefer "hostA" and partitions with > index 50-99 prefer "hostB". Then R.coalesce(50) will have 49 partitions with > "hostA" and 1 with "hostB" (extremely skewed distribution). > > The algorithm in {{DefaultPartitionCoalescer.setupGroups}} is responsible for > picking preferred locations for coalesced partitions. It analyzes the > preferred locations of input partitions. It starts by trying to create one > partition for each unique location in the input. However, if the the > requested number of coalesced partitions is higher that the number of unique > locations, it has to pick duplicate locations. > Currently, the duplicate locations are picked by iterating over the input > partitions in order, and copying their preferred locations to coalesced > partitions. If the input partitions are clustered by location, this can > result in severe skew. > Instead of iterating over the list of input partitions in order, we should > pick them at random. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org