Hi, I have a spark streaming application, running on a single node, consisting mainly of map operations. I perform repartitioning to control the number of CPU cores that I want to use. The code goes like this:
val ssc = new StreamingContext(sparkConf, Seconds(5)) > val distFile = ssc.textFileStream("/home/myuser/spark-example/dump") > val words = distFile.repartition(cores.toInt).flatMap(_.split(" ")) > .filter(_.length > 3) > > val wordCharValues = words.map(word => { > var sum = 0 > word.toCharArray.foreach(c => {sum += c.toInt}) > sum.toDouble / word.length.toDouble > }).foreachRDD(rdd => { > println("MEAN: " + rdd.mean()) > }) > I have 2 questions: 1) How can I use coalesce in this code instead of repartition? 2) Why, using the same dataset (which is a small file processed within a single batch), the result that I obtain for the mean varies with the number of partitions? If I don't call the repartition method, the result is always the same for every execution, as it should be. But repartitioning for instance in 2 partitions gives a different mean value than using 8 partitions. I really don't understand why given that my code is deterministic. Can someone enlighten me on this? Thanks.