[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228839#comment-14228839 ]
Aaron Davidson commented on SPARK-4644: --------------------------------------- [~zsxwing] I believe that this problem is related more fundamentally to the problem that Spark currently requires that all values for the same key remain in memory. Your solution aims to fix this for the specific case of joins, but I wonder if we generalize it, if we could solve this for things like groupBy as well. I don't have a fully fleshed out idea yet, but I was considering a model where there are 2 types of shuffles: aggregation-based and rearrangement-based. Aggregation-based shuffles use partial aggregation and combiners to form and merge (K, C) pairs. Rearrangement-based shuffles do not expect a decrease in the amount of total data, however, and so my thought is that this model does not make sense. Instead, we could provide an interface similar to ExternalAppendOnlyMap but which returns an Iterator[(K, Iterable[V])] pairs, with some extra semantics related to the Iterable[V]s (such as having a .chunkedIterator() method which enables block nested loops join). In this model, join could be implemented by mapping the left side's key to (K, 1) and the right side to (K, 2) and having logic which reads from two adjacent value-iterables simultaneously -- e.g., val ((k, 1), left: Iterable[V]) = map.next() val ((k, 2), right: Iterable[V]) = map.next() // perform merge using the left and right iterators. > Implement skewed join > --------------------- > > Key: SPARK-4644 > URL: https://issues.apache.org/jira/browse/SPARK-4644 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Shixiong Zhu > Attachments: Skewed Join Design Doc.pdf > > > Skewed data is not rare. For example, a book recommendation site may have > several books which are liked by most of the users. Running ALS on such > skewed data will raise a OutOfMemory error, if some book has too many users > which cannot be fit into memory. To solve it, we propose a skewed join > implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org