[jira] [Resolved] (SPARK-23496) Locality of coalesced partitions can be severely skewed by the order of input partitions

Herman van Hovell (JIRA) Mon, 05 Mar 2018 05:34:37 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-23496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Herman van Hovell resolved SPARK-23496.
---------------------------------------
       Resolution: Fixed
         Assignee: Ala Luszczak
    Fix Version/s: 2.4.0

> Locality of coalesced partitions can be severely skewed by the order of input 
> partitions
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-23496
>                 URL: https://issues.apache.org/jira/browse/SPARK-23496
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Ala Luszczak
>            Assignee: Ala Luszczak
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Example:
> Consider RDD "R" with 100 partitions, half of which have locality preference 
> "hostA" and half have "hostB".
>  * Assume odd-numbered input partitions of R prefer "hostA" and even-numbered 
> prefer "hostB". Then R.coalesce(50) will have 25 partitions with preference 
> "hostA" and 25 with "hostB" (even distribution).
>  * Assume partitions with index 0-49 of R prefer "hostA" and partitions with 
> index 50-99 prefer "hostB". Then R.coalesce(50) will have 49 partitions with 
> "hostA" and 1 with "hostB" (extremely skewed distribution).
>  
> The algorithm in {{DefaultPartitionCoalescer.setupGroups}} is responsible for 
> picking preferred locations for coalesced partitions. It analyzes the 
> preferred locations of input partitions. It starts by trying to create one 
> partition for each unique location in the input. However, if the the 
> requested number of coalesced partitions is higher that the number of unique 
> locations, it has to pick duplicate locations.
> Currently, the duplicate locations are picked by iterating over the input 
> partitions in order, and copying their preferred locations to coalesced 
> partitions. If the input partitions are clustered by location, this can 
> result in severe skew.
> Instead of iterating over the list of input partitions in order, we should 
> pick them at random.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23496) Locality of coalesced partitions can be severely skewed by the order of input partitions

Reply via email to