Thanks for you answer. 

But the same problem appears if you start from one common RDD:

    val partitioner = new HashPartitioner(10)     
    val dummyJob = sc.parallelize(0 until 10).map(x => (x,x))     
    dummyJob.partitionBy(partitioner).foreach { case (ind, x) =>
println("Dummy1 -> Id = " + ind) } 
    dummyJob.partitionBy(partitioner).foreach { case (ind, x) =>
println("Dummy2 -> Id = " + ind) } 

The output on one node is again similar (under Spark v1.0.2):
    Dummy1 -> Id = 2
    Dummy2 -> Id = 7 

In this example, if Spark doesn't guarantee that two partitions sharing the
same ID will be collocated on the same node, then what's the point of having
the RDD.zip function? To zip two randomly picked partitions? Doesn't make
sense...

To be more concrete, say I have two RDDs representing two relations and want
to do a hash join between them. I want to partition them on the join key and
do local joins on each node using the zip function. Obviously, corresponding
partitions from these two relations have to be collocated on the same node.
Partitioning with Spark 0.9.2 has worked exactly like that. With the new
release, that's not the case, and obviously, the join result is incorrect. 

Thanks,
Milos



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-Where-do-my-partitions-go-tp11635p11772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to