[ https://issues.apache.org/jira/browse/SPARK-11387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979070#comment-14979070 ]
Glenn Strycker commented on SPARK-11387: ---------------------------------------- This ticket may be a particular implementation idea that would address the request for general skewed joins > minimize shuffles during joins by using existing partitions and bundling > messages > --------------------------------------------------------------------------------- > > Key: SPARK-11387 > URL: https://issues.apache.org/jira/browse/SPARK-11387 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Glenn Strycker > > Currently an RDD join in Spark requires repartitioning by the join key (for > large RDDs that cannot use broadcast). > This is very bad for highly skewed data, as every row containing a particular > key will end up on one node. > Additionally, repartitioning is expensive, and the existing partitioning > scheme may have been optimized to minimize message passing. For example, > perhaps an RDD is an edge list for a graph, but a user has already > partitioned this data by a community structure or connected components, > ensuring that similar edges are on the same partition. Using a join > operation to perform message passing will require repartitioning the edge > list by the first or second vertex in the edge as a key. > Instead of repartitioning and shuffling, could messages across partitions be > "bundled" together and passed once, almost like a broadcast operation? > Essentially the request here is to treat ALL RDDs of any size as > broadcast-capable, and each partition would be broadcast one and at a time > and the results aggregated. It would be up to the user to optimize the > partitioning to minimize the between-partition message passing volume. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org