Hello Spark users and developers.

I read file and want to ensure that it has exact number of partitions, for
example 128.

In documentation I found:

def textFile(path: String, minPartitions: Int = defaultMinPartitions):
RDD[String]

But argument here is minimal number of partitions, so I use coalesce to
ensure desired number of partitions:

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord:
Ordering[T] = null): RDD[T]
//Return a new RDD that is reduced into numPartitions partitions.

So I combine them and get number of partitions lower than expected:

scala> sc.textFile("perf_test1.csv",
minPartitions=128).coalesce(128).getNumPartitions
res14: Int = 126

Is this expected behaviour? File contains 100000 lines, size of partitions
before and after coalesce:

scala> sc.textFile("perf_test1.csv", minPartitions=128).mapPartitions{rows
=> Iterator(rows.length)}.collect()
res16: Array[Int] = Array(782, 781, 782, 781, 781, 782, 781, 781, 781, 781,
782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781,
781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781,
781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781,
781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782,
781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781,
782, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781,
781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781,
782, 781, 781, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781)

scala> sc.textFile("perf_test1.csv",
minPartitions=128).coalesce(128).mapPartitions{rows =>
Iterator(rows.length)}.collect()
res15: Array[Int] = Array(1563, 781, 781, 781, 782, 781, 781, 781, 781,
782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781,
782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781,
781, 782, 781, 781, 781, 781, 1563, 782, 781, 781, 782, 781, 781, 781, 781,
782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781,
781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781,
781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781,
781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782,
781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782)

So two partitions are double the size. Is this expected behaviour or is it
some kind of bug?

Thanks,
Maciej Sokołowski

Reply via email to