Hi, I am trying to see exactly what happens underneath the hood of Spar when performing a simple sortByKey. So far I've already discovered the fetch-files and both the temp-shuffle and shuffle files being written to disk, but there is still an extra stage that keeps on puzzling me.
This is the code that I execute 1 by 1 in the spark-shell (I will refer to them as [letter]) val input = sc.textFile(path, 2) [a] val inputRDD = input.map(some lamdba to create key_value) [b] val result = inputRDD.sortByKey(true, 2) [c] result.saveAsTextFile [d] As there is only one shuffle, I expect to see two stages. This is confirmed by result.toDebugString: (2) ShuffledRDD[5] at sortByKey at <console>:25 [] -> [c] +-(2) MapPartitionsRDD[2] at map at <console>:23 [] -> [b] | ./Sort/input-10-records-2-parts/ MapPartitionsRDD[1] at textFile at <console>:21 [] -> [a] | ./Sort/input-10-records-2-parts/ HadoopRDD[0] at textFile at <console>:21 [] -> [a] As there is one indentation, there should be 2 stages. There is an extra RDD (MapPartitionsRDD[6]) that is created by [d], but is not a parent of my result RDD, so not listed in this trace. Now when I run these commands 1 by 1 in the spark-shell, I see the following execution: [a] [b] [c] (//no action performed yet) INFO SparkContext: Starting job: sortByKey at <console>:25 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at sortByKey at <console>:25), which has no missing parents INFO DAGScheduler: ResultStage 0 (sortByKey at <console>:25) finished in 0.109 s INFO DAGScheduler: Job 0 finished: sortByKey at <console>:25, [d] (// Here I trigger the computation with an actual action) INFO SparkContext: Starting job: saveAsTextFile at <console>:28 INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[2] at map at <console>:23), which has no missing parents INFO DAGScheduler: ShuffleMapStage 1 (map at <console>:23) finished INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[6] at saveAsTextFile at <console>:28) INFO DAGScheduler: ResultStage 2 (saveAsTextFile at <console>:28) finished INFO DAGScheduler: Job 1 finished: saveAsTextFile at <console>:28 Job 1 with stage 1 and 2 seem logical for me, it is computing everything before and after the shuffle (wide dependency) respectively. Now what I find interesting and puzzling, is Job 0 with stage 0. It executes and finishes before I perform an action (in [d]), and with larger input set can also take a noticeable time. Does anybody have any idea what is running in this Job/stage 0? Thanks, Tom Hubregtsen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation-with-an-action-tp22707.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org