[ https://issues.apache.org/jira/browse/SPARK-23496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374439#comment-16374439 ]
Marco Gaido commented on SPARK-23496: ------------------------------------- I read that the proposed solution is to use random numbers instead of iterating. This seems to me not as a solution but as a workaround, which doesn't solve the problem, but it makes it unlikely. What about a solutions which does solve the problem, ie. we enforce an even distribution according to the incoming data distribution? I mean, what about creating a sort of reversed index with the preferred location as key and the partition as values and picking from each value list a ratio corresponding to the coalescing ratio? > Locality of coalesced partitions can be severely skewed by the order of input > partitions > ---------------------------------------------------------------------------------------- > > Key: SPARK-23496 > URL: https://issues.apache.org/jira/browse/SPARK-23496 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Ala Luszczak > Priority: Major > > Example: > Consider RDD "R" with 100 partitions, half of which have locality preference > "hostA" and half have "hostB". > * Assume odd-numbered input partitions of R prefer "hostA" and even-numbered > prefer "hostB". Then R.coalesce(50) will have 25 partitions with preference > "hostA" and 25 with "hostB" (even distribution). > * Assume partitions with index 0-49 of R prefer "hostA" and partitions with > index 50-99 prefer "hostB". Then R.coalesce(50) will have 49 partitions with > "hostA" and 1 with "hostB" (extremely skewed distribution). > > The algorithm in {{DefaultPartitionCoalescer.setupGroups}} is responsible for > picking preferred locations for coalesced partitions. It analyzes the > preferred locations of input partitions. It starts by trying to create one > partition for each unique location in the input. However, if the the > requested number of coalesced partitions is higher that the number of unique > locations, it has to pick duplicate locations. > Currently, the duplicate locations are picked by iterating over the input > partitions in order, and copying their preferred locations to coalesced > partitions. If the input partitions are clustered by location, this can > result in severe skew. > Instead of iterating over the list of input partitions in order, we should > pick them at random. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org