With the following pseudo-code, val rdd1 = sc.sequenceFile(...) // has > 100 partitions val rdd2 = rdd1.coalesce(100) val rdd3 = rdd2 map { ... } val rdd4 = rdd3.coalesce(2) val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
I would expect the parallelism of the map() operation to be 100 concurrent tasks, and the parallelism of the save() operation to be 2. However, it appears the parallelism of the entire chain is 2 -- I only see two tasks created for the save() operation and those tasks appear to execute the map() operation as well. Assuming what I'm seeing is as-specified (meaning, how things are meant to be), what's the recommended way to force a parallelism of 100 on the map() operation? thanks!