Thanks for you answer. But the same problem appears if you start from one common RDD:
val partitioner = new HashPartitioner(10) val dummyJob = sc.parallelize(0 until 10).map(x => (x,x)) dummyJob.partitionBy(partitioner).foreach { case (ind, x) => println("Dummy1 -> Id = " + ind) } dummyJob.partitionBy(partitioner).foreach { case (ind, x) => println("Dummy2 -> Id = " + ind) } The output on one node is again similar (under Spark v1.0.2): Dummy1 -> Id = 2 Dummy2 -> Id = 7 In this example, if Spark doesn't guarantee that two partitions sharing the same ID will be collocated on the same node, then what's the point of having the RDD.zip function? To zip two randomly picked partitions? Doesn't make sense... To be more concrete, say I have two RDDs representing two relations and want to do a hash join between them. I want to partition them on the join key and do local joins on each node using the zip function. Obviously, corresponding partitions from these two relations have to be collocated on the same node. Partitioning with Spark 0.9.2 has worked exactly like that. With the new release, that's not the case, and obviously, the join result is incorrect. Thanks, Milos -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-Where-do-my-partitions-go-tp11635p11772.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org