Hi,

I am quite new to Spark and have some questions on joins and
co-partitioning. 

Are the following assumptions correct. 

When a join takes place and one of the RDD's has been partitioned, does
Spark make a best effort to execute the join for a specific partition where
the partitioned data resides locally i.e. in memory. 
Does the order at which I write the join for this case make any difference -
example:

rdd1 - hash partitioned 
rdd2 - not partitioned

rdd1.join(rdd2);
rdd2.join(rdd1);

----------
If we have two RDD's that have both been partitioned using the same
partitioner and both have been evaluated into memory on the cluster, how
does Spark decide what node to run the join for each partition. Is the
following order correct. 

1. Spark will try to run the join for a partition on a node where both RDD's
partitions are co-located. 
2. If both RDD's have already been evaluated into memory on the cluster but
the partitions are not co-located it will try to run on the node that has
local access to the larger partition of both RDD's. 
3. Continuing on from 2 it will fallback to local access to the smaller RDD
partition. 
4. Run on any node. 

In the above scenario does the order at which I write the join have any
effect on how Spark decides where to run the join. 

rdd1.join(rdd2);
rdd2.join(rdd1);

---------------
A follow on from above. If there are two partitioned RDD's but only a single
RDD has been evaluated into memory in the cluster, is the following order
correct. 

1. Spark will try to run the join on the node where the RDD partition has
been evaluated into memory. 
2. Run on any node. 

Again does the order I write the join in make any difference. 

rdd1.join(rdd2);
rdd2.join(rdd1);

Thank you in advance for any help on this topic. 
Dave. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Co-Partitioned-Joins-tp25956.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to