I have made some progress - the partitioning is very uneven, and everything goes to one partition. I see that spark partitions by key, so I tried this:
//partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_stream.transform(rdd => { rdd.map({ kv => val k = kv._1 val v = kv._2 (UUID.randomUUID().toString, v) }) }) uniform.foreachRDD(rdd => { rdd.forEachPartition(partition => { ... As you can see, I'm using random keys. Even in this case, when running with 2 nodes, i verified that one partition is completely empty, and the other contains all the records. What is going wrong with the partitioning here? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Imbalanced-shuffle-read-tp18648p18790.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org