We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework
that we've been developing that connects various different RDDs together
based on some predefined business cases. After updating to 1.2.0, some of
the concurrency expectations about how the stages within jobs are executed
have changed quite significantly.

Given 3 RDDs:

RDD1 = inputFromHDFS().groupBy().sortBy().etc().cache()
RDD2 = RDD1.outputToFile
RDD3 = RDD1.groupBy().outputToFile

In Spark 1.1.0, we expected RDD1 to be scheduled based on the first stage
encountered (RDD2's outputToFile or RDD3's groupBy()) and then for RDD2 and
RDD3 to both block waiting for RDD1 to complete and cache- at which point
RDD2 and RDD3 both use the cached version to complete their work.

Spark 1.2.0 seems to schedule two (be it concurrently running) stages for
each of RDD1's stages (inputFromHDFS, groupBy(), sortBy(), etc() will each
get run twice). It does not look like there is any sharing of the results
between these jobs.

Are we doing something wrong? Is there a setting that I'm not understanding
somewhere?

Reply via email to