[
https://issues.apache.org/jira/browse/SPARK-23496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Herman van Hovell resolved SPARK-23496.
---------------------------------------
Resolution: Fixed
Assignee: Ala Luszczak
Fix Version/s: 2.4.0
> Locality of coalesced partitions can be severely skewed by the order of input
> partitions
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-23496
> URL: https://issues.apache.org/jira/browse/SPARK-23496
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Ala Luszczak
> Assignee: Ala Luszczak
> Priority: Major
> Fix For: 2.4.0
>
>
> Example:
> Consider RDD "R" with 100 partitions, half of which have locality preference
> "hostA" and half have "hostB".
> * Assume odd-numbered input partitions of R prefer "hostA" and even-numbered
> prefer "hostB". Then R.coalesce(50) will have 25 partitions with preference
> "hostA" and 25 with "hostB" (even distribution).
> * Assume partitions with index 0-49 of R prefer "hostA" and partitions with
> index 50-99 prefer "hostB". Then R.coalesce(50) will have 49 partitions with
> "hostA" and 1 with "hostB" (extremely skewed distribution).
>
> The algorithm in {{DefaultPartitionCoalescer.setupGroups}} is responsible for
> picking preferred locations for coalesced partitions. It analyzes the
> preferred locations of input partitions. It starts by trying to create one
> partition for each unique location in the input. However, if the the
> requested number of coalesced partitions is higher that the number of unique
> locations, it has to pick duplicate locations.
> Currently, the duplicate locations are picked by iterating over the input
> partitions in order, and copying their preferred locations to coalesced
> partitions. If the input partitions are clustered by location, this can
> result in severe skew.
> Instead of iterating over the list of input partitions in order, we should
> pick them at random.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]