Hi, consider the following code:
import org.apache.spark.{SparkContext, SparkConf} object ParallelismBug extends App { var sConf = new SparkConf() .setMaster("spark://hostName:7077") // .setMaster("local[4]") .set("spark.default.parallelism", "7") // or without it val sc = new SparkContext(sConf) val rdd = sc.textFile("input/100") // val rdd = sc.parallelize(Array.range(1, 100)) val rdd2 = rdd.intersection(rdd) println("rdd: " + rdd.partitions.size + " rdd2: " + rdd2.partitions.size) } Suppose that input/100 contains 100 files. In above configuration output is rdd: 100 rdd2: 7, which seems ok. when we don't set parallelism then output is rdd: 100 rdd2: 100, but according to https://spark.apache.org/docs/latest/configuration.html#execution-behavior it should be rdd: 100 rdd2: 2 (on my 1 core machine). But when rdd is defined using sc.parallelize results seems ok: rdd: 2 rdd2: 2. Moreover when master is local[4] and we set parallelism then result is rdd: 100 rdd2: 4 instead of rdd: 100 rdd2: 7. And when we don't set parallelism it behaves like with master spark://hostName:7077. Do I misunderstanding something, or is it a bug? Thanks, Grzegorz