Extra stage that executes before triggering computation with an action

Tom Hubregtsen Wed, 29 Apr 2015 10:48:37 -0700

Hi, I am trying to see exactly what happens underneath the hood of Spar when
performing a simple sortByKey. So far I've already discovered the
fetch-files and both the temp-shuffle and shuffle files being written to
disk, but there is still an extra stage that keeps on puzzling me.


This is the code that I execute 1 by 1 in the spark-shell (I will refer to
them as [letter])
val input = sc.textFile(path, 2) [a]
val inputRDD = input.map(some lamdba to create key_value) [b]
val result = inputRDD.sortByKey(true, 2) [c]
result.saveAsTextFile [d]

As there is only one shuffle, I expect to see two stages. This is confirmed
by result.toDebugString:
(2) ShuffledRDD[5] at sortByKey at <console>:25 [] -> [c]
 +-(2) MapPartitionsRDD[2] at map at <console>:23 [] -> [b]
    |  ./Sort/input-10-records-2-parts/ MapPartitionsRDD[1] at textFile at
<console>:21 [] -> [a] 
    |  ./Sort/input-10-records-2-parts/ HadoopRDD[0] at textFile at
<console>:21 [] -> [a]
As there is one indentation, there should be 2 stages. There is an extra RDD
(MapPartitionsRDD[6]) that is created by [d], but is not a parent of my
result RDD, so not listed in this trace.

Now when I run these commands 1 by 1 in the spark-shell, I see the following
execution:
[a]
[b]
[c] (//no action performed yet)
INFO SparkContext: Starting job: sortByKey at <console>:25
INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at
sortByKey at <console>:25), which has no missing parents
INFO DAGScheduler: ResultStage 0 (sortByKey at <console>:25) finished in
0.109 s
INFO DAGScheduler: Job 0 finished: sortByKey at <console>:25,
[d] (// Here I trigger the computation with an actual action)
INFO SparkContext: Starting job: saveAsTextFile at <console>:28
INFO DAGScheduler: Submitting ShuffleMapStage 1 (MapPartitionsRDD[2] at map
at <console>:23), which has no missing parents
INFO DAGScheduler: ShuffleMapStage 1 (map at <console>:23) finished
INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[6] at
saveAsTextFile at <console>:28)
INFO DAGScheduler: ResultStage 2 (saveAsTextFile at <console>:28) finished
INFO DAGScheduler: Job 1 finished: saveAsTextFile at <console>:28

Job 1 with stage 1 and 2 seem logical for me, it is computing everything
before and after the shuffle (wide dependency) respectively. Now what I find
interesting and puzzling, is Job 0 with stage 0. It executes and finishes
before I perform an action (in [d]), and with larger input set can also take
a noticeable time. Does anybody have any idea what is running in this
Job/stage 0?  

Thanks,

Tom Hubregtsen




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Extra-stage-that-executes-before-triggering-computation-with-an-action-tp22707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Extra stage that executes before triggering computation with an action

Reply via email to