Hi, I am quite new to Spark and have some questions on joins and co-partitioning.
Are the following assumptions correct. When a join takes place and one of the RDD's has been partitioned, does Spark make a best effort to execute the join for a specific partition where the partitioned data resides locally i.e. in memory. Does the order at which I write the join for this case make any difference - example: rdd1 - hash partitioned rdd2 - not partitioned rdd1.join(rdd2); rdd2.join(rdd1); ---------- If we have two RDD's that have both been partitioned using the same partitioner and both have been evaluated into memory on the cluster, how does Spark decide what node to run the join for each partition. Is the following order correct. 1. Spark will try to run the join for a partition on a node where both RDD's partitions are co-located. 2. If both RDD's have already been evaluated into memory on the cluster but the partitions are not co-located it will try to run on the node that has local access to the larger partition of both RDD's. 3. Continuing on from 2 it will fallback to local access to the smaller RDD partition. 4. Run on any node. In the above scenario does the order at which I write the join have any effect on how Spark decides where to run the join. rdd1.join(rdd2); rdd2.join(rdd1); --------------- A follow on from above. If there are two partitioned RDD's but only a single RDD has been evaluated into memory in the cluster, is the following order correct. 1. Spark will try to run the join on the node where the RDD partition has been evaluated into memory. 2. Run on any node. Again does the order I write the join in make any difference. rdd1.join(rdd2); rdd2.join(rdd1); Thank you in advance for any help on this topic. Dave. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Co-Partitioned-Joins-tp25956.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org