Why different numbers of partitions give different results for the same computation on the same dataset?

Saiph Kappa Tue, 03 Mar 2015 11:34:22 -0800

Hi,

I have a spark streaming application, running on a single node, consisting
mainly of map operations. I perform repartitioning to control the number of
CPU cores that I want to use. The code goes like this:


    val ssc = new StreamingContext(sparkConf, Seconds(5))
>     val distFile = ssc.textFileStream("/home/myuser/spark-example/dump")
>     val words = distFile.repartition(cores.toInt).flatMap(_.split(" "))
>       .filter(_.length > 3)
>
>       val wordCharValues = words.map(word => {
>         var sum = 0
>         word.toCharArray.foreach(c => {sum += c.toInt})
>         sum.toDouble / word.length.toDouble
>       }).foreachRDD(rdd => {
>         println("MEAN: " + rdd.mean())
>       })
>

I have 2 questions:
1) How can I use coalesce in this code instead of repartition?

2) Why, using the same dataset (which is a small file processed within a
single batch), the result that I obtain for the mean varies with the number
of partitions? If I don't call the repartition method, the result is always
the same for every execution, as it should be. But repartitioning for
instance in 2 partitions gives a different mean value than using 8
partitions. I really don't understand why given that my code is
deterministic. Can someone enlighten me on this?

Thanks.

Why different numbers of partitions give different results for the same computation on the same dataset?

Reply via email to