Hi,

consider the following code:

import org.apache.spark.{SparkContext, SparkConf}
object ParallelismBug extends App {
  var sConf = new SparkConf()
    .setMaster("spark://hostName:7077") // .setMaster("local[4]")
    .set("spark.default.parallelism", "7") // or without it
  val sc = new SparkContext(sConf)
  val rdd = sc.textFile("input/100") // val rdd =
sc.parallelize(Array.range(1, 100))
  val rdd2 = rdd.intersection(rdd)
  println("rdd: " + rdd.partitions.size + " rdd2: " + rdd2.partitions.size)
}

Suppose that input/100 contains 100 files. In above configuration output is
rdd: 100 rdd2: 7, which seems ok. when we don't set parallelism then output
is rdd: 100 rdd2: 100, but according to
https://spark.apache.org/docs/latest/configuration.html#execution-behavior
it should be rdd: 100 rdd2: 2 (on my 1 core machine).
But when rdd is defined using sc.parallelize results seems ok: rdd: 2 rdd2:
2.
Moreover when master is local[4] and we set parallelism then result is rdd:
100 rdd2: 4 instead of rdd: 100 rdd2: 7. And when we don't set parallelism
it behaves like with master spark://hostName:7077.

Do I misunderstanding something, or is it a bug?

Thanks,
Grzegorz

Reply via email to