[jira] [Commented] (SPARK-4644) Implement skewed join

Aaron Davidson (JIRA) Sat, 29 Nov 2014 08:51:27 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228839#comment-14228839
 ]


Aaron Davidson commented on SPARK-4644:
---------------------------------------

[~zsxwing] I believe that this problem is related more fundamentally to the 
problem that Spark currently requires that all values for the same key remain 
in memory. Your solution aims to fix this for the specific case of joins, but I 
wonder if we generalize it, if we could solve this for things like groupBy as 
well.

I don't have a fully fleshed out idea yet, but I was considering a model where 
there are 2 types of shuffles: aggregation-based and rearrangement-based. 
Aggregation-based shuffles use partial aggregation and combiners to form and 
merge (K, C) pairs. Rearrangement-based shuffles do not expect a decrease in 
the amount of total data, however, and so my thought is that this model does 
not make sense.

Instead, we could provide an interface similar to ExternalAppendOnlyMap but 
which returns an Iterator[(K, Iterable[V])] pairs, with some extra semantics 
related to the Iterable[V]s (such as having a .chunkedIterator() method which 
enables block nested loops join).

In this model, join could be implemented by mapping the left side's key to (K, 
1) and the right side to (K, 2) and having logic which reads from two adjacent 
value-iterables simultaneously -- e.g., 

val ((k, 1), left: Iterable[V]) = map.next()
val ((k, 2), right: Iterable[V]) = map.next()
// perform merge using the left and right iterators.


> Implement skewed join
> ---------------------
>
>                 Key: SPARK-4644
>                 URL: https://issues.apache.org/jira/browse/SPARK-4644
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Shixiong Zhu
>         Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4644) Implement skewed join

Reply via email to