Hello Spark users and developers. I read file and want to ensure that it has exact number of partitions, for example 128.
In documentation I found: def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] But argument here is minimal number of partitions, so I use coalesce to ensure desired number of partitions: def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T] //Return a new RDD that is reduced into numPartitions partitions. So I combine them and get number of partitions lower than expected: scala> sc.textFile("perf_test1.csv", minPartitions=128).coalesce(128).getNumPartitions res14: Int = 126 Is this expected behaviour? File contains 100000 lines, size of partitions before and after coalesce: scala> sc.textFile("perf_test1.csv", minPartitions=128).mapPartitions{rows => Iterator(rows.length)}.collect() res16: Array[Int] = Array(782, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781) scala> sc.textFile("perf_test1.csv", minPartitions=128).coalesce(128).mapPartitions{rows => Iterator(rows.length)}.collect() res15: Array[Int] = Array(1563, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 1563, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782) So two partitions are double the size. Is this expected behaviour or is it some kind of bug? Thanks, Maciej Sokołowski