[jira] [Commented] (SPARK-23496) Locality of coalesced partitions can be severely skewed by the order of input partitions

Marco Gaido (JIRA) Fri, 23 Feb 2018 06:54:11 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374439#comment-16374439
 ]


Marco Gaido commented on SPARK-23496:
-------------------------------------

I read that the proposed solution is to use random numbers instead of 
iterating. This seems to me not as a solution but as a workaround, which 
doesn't solve the problem, but it makes it unlikely.

What about a solutions which does solve the problem, ie. we enforce an even 
distribution according to the incoming data distribution? I mean, what about 
creating a sort of reversed index with the preferred location as key and the 
partition as values and picking from each value list a ratio corresponding to 
the coalescing ratio?

> Locality of coalesced partitions can be severely skewed by the order of input 
> partitions
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-23496
>                 URL: https://issues.apache.org/jira/browse/SPARK-23496
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Ala Luszczak
>            Priority: Major
>
> Example:
> Consider RDD "R" with 100 partitions, half of which have locality preference 
> "hostA" and half have "hostB".
>  * Assume odd-numbered input partitions of R prefer "hostA" and even-numbered 
> prefer "hostB". Then R.coalesce(50) will have 25 partitions with preference 
> "hostA" and 25 with "hostB" (even distribution).
>  * Assume partitions with index 0-49 of R prefer "hostA" and partitions with 
> index 50-99 prefer "hostB". Then R.coalesce(50) will have 49 partitions with 
> "hostA" and 1 with "hostB" (extremely skewed distribution).
>  
> The algorithm in {{DefaultPartitionCoalescer.setupGroups}} is responsible for 
> picking preferred locations for coalesced partitions. It analyzes the 
> preferred locations of input partitions. It starts by trying to create one 
> partition for each unique location in the input. However, if the the 
> requested number of coalesced partitions is higher that the number of unique 
> locations, it has to pick duplicate locations.
> Currently, the duplicate locations are picked by iterating over the input 
> partitions in order, and copying their preferred locations to coalesced 
> partitions. If the input partitions are clustered by location, this can 
> result in severe skew.
> Instead of iterating over the list of input partitions in order, we should 
> pick them at random.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23496) Locality of coalesced partitions can be severely skewed by the order of input partitions

Reply via email to