Hi guys, I’ve noticed some changes in the behavior of partitioning under Spark 1.0.x. I’d appreciate if someone could explain what has changed in the meantime.
Here is a small example. I want to create two RDD[(K, V)] objects and then collocate partitions with the same K on one node. When the same partitioner for two RDDs is used, partitions with the same K end up being on different nodes. // Let's say I have 10 nodes val partitioner = new HashPartitioner(10) // Create RDD val rdd = sc.parallelize(0 until 10).map(k => (k, computeValue(k))) // Partition twice using the same partitioner rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy1 -> k = " + k) } rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy2 -> k = " + k) } The output on one node is: Dummy1 -> k = 2 Dummy2 -> k = 7 I was expecting to see the same keys on each node. That was happening under Spark 0.9.2, but not under Spark 1.0.x. Anyone has an idea what has changed in the meantime? Or how to get corresponding partitions on one node? Thanks in advance, Milos --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org