Glenn Strycker created SPARK-11387: -------------------------------------- Summary: minimize shuffles during joins by using existing partitions and bundling messages Key: SPARK-11387 URL: https://issues.apache.org/jira/browse/SPARK-11387 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Glenn Strycker
Currently an RDD join in Spark requires repartitioning by the join key (for large RDDs that cannot use broadcast). This is very bad for highly skewed data, as every row containing a particular key will end up on one node. Additionally, repartitioning is expensive, and the existing partitioning scheme may have been optimized to minimize message passing. For example, perhaps an RDD is an edge list for a graph, but a user has already partitioned this data by a community structure or connected components, ensuring that similar edges are on the same partition. Using a join operation to perform message passing will require repartitioning the edge list by the first or second vertex in the edge as a key. Instead of repartitioning and shuffling, could messages across partitions be "bundled" together and passed once, almost like a broadcast operation? Essentially the request here is to treat ALL RDDs of any size as broadcast-capable, and each partition would be broadcast one and at a time and the results aggregated. It would be up to the user to optimize the partitioning to minimize the between-partition message passing volume. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org