We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together based on some predefined business cases. After updating to 1.2.0, some of the concurrency expectations about how the stages within jobs are executed have changed quite significantly.
Given 3 RDDs: RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache() RDD2 = RDD1.outputToFile RDD3 = RDD1.groupBy().outputToFile In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and RDD3 to both block waiting for RDD1 to complete and cache- at which point RDD2 and RDD3 both use the cached version to complete their work. Spark 1.2.0 seems to schedule two (be it concurrently running) stages for each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each get run twice). It does not look like there is any sharing of the results between these jobs. Are we doing something wrong? Is there a setting that I'm not understanding somewhere?