Partitioning strategy changed in Spark 1.0.x?

Milos Nikolic Tue, 26 Aug 2014 01:01:54 -0700

Hi guys, 

I’ve noticed some changes in the behavior of partitioning under Spark 1.0.x. 
I’d appreciate if someone could explain what has changed in the meantime.


Here is a small example. I want to create two RDD[(K, V)] objects and then 
collocate partitions with the same K on one node. When the same partitioner 
for two RDDs is used, partitions with the same K end up being on different 
nodes.

    // Let's say I have 10 nodes 
    val partitioner = new HashPartitioner(10)     
    
    // Create RDD 
    val rdd = sc.parallelize(0 until 10).map(k => (k, computeValue(k))) 
    
    // Partition twice using the same partitioner 
    rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy1 -> k 
= " + k) } 
    rdd.partitionBy(partitioner).foreach { case (k, v) => println("Dummy2 -> k 
= " + k) } 

The output on one node is: 
    Dummy1 -> k = 2 
    Dummy2 -> k = 7 

I was expecting to see the same keys on each node. That was happening under 
Spark 0.9.2, but not under Spark 1.0.x. 

Anyone has an idea what has changed in the meantime? Or how to get 
corresponding partitions on one node? 

Thanks in advance,
Milos
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Partitioning strategy changed in Spark 1.0.x?

Reply via email to