"I'm not sure, but I wonder if because you are using the Spark REPL that it may not be representing what a normal runtime execution would look like and is possibly eagerly running a partial DAG once you define an operation that would cause a shuffle.
What happens if you setup your same set of commands [a-e] in a file and use the Spark REPL's `load` or `paste` command to load them all at once?" From Richard I have also packaged it in a jar file (without [e], the debug string), and still see the extra stage before the other two that I would expect. Even when I remove [d], the action, I still see stage 0 being executed (and do not see stage 1 and 2). Again a shortened log of the Stage 0: INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at sortByKey, which has no missing parents INFO DAGScheduler: ResultStage 0 (sortByKey) finished in 0.192 s -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation-with-an-action-tp22707p22713.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org