Hi,

I have the classic word count example:
> file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + 
> _).collect()

From the Job UI, I can only see 2 stages: 0-collect and 1-map.

What happened to ShuffledRDD in reduceByKey? And both flatMap and map 
operations is collapsed into a single stage?

14/11/25 16:02:35 INFO SparkContext: Starting job: collect at <console>:15
14/11/25 16:02:35 INFO DAGScheduler: Registering RDD 6 (map at <console>:15)
14/11/25 16:02:35 INFO DAGScheduler: Got job 0 (collect at <console>:15) with 2 
output partitions (allowLocal=false)
14/11/25 16:02:35 INFO DAGScheduler: Final stage: Stage 0(collect at 
<console>:15)
14/11/25 16:02:35 INFO DAGScheduler: Parents of final stage: List(Stage 1)
14/11/25 16:02:35 INFO DAGScheduler: Missing parents: List(Stage 1)
14/11/25 16:02:35 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[6] at map at 
<console>:15), which has no missing parents
14/11/25 16:02:35 INFO MemoryStore: ensureFreeSpace(3464) called with 
curMem=163705, maxMem=278302556
14/11/25 16:02:35 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 3.4 KB, free 265.3 MB)
14/11/25 16:02:35 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 
(MappedRDD[6] at map at <console>:15)
14/11/25 16:02:35 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
14/11/25 16:02:35 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, 
localhost, PROCESS_LOCAL, 1208 bytes)
14/11/25 16:02:35 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, 
localhost, PROCESS_LOCAL, 1208 bytes)
14/11/25 16:02:35 INFO Executor: Running task 0.0 in stage 1.0 (TID 0)
14/11/25 16:02:35 INFO Executor: Running task 1.0 in stage 1.0 (TID 1)
14/11/25 16:02:35 INFO HadoopRDD: Input split: 
file:/Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md:0+2405
14/11/25 16:02:35 INFO HadoopRDD: Input split: 
file:/Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md:2405+2406
14/11/25 16:02:35 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
14/11/25 16:02:35 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
14/11/25 16:02:35 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
14/11/25 16:02:35 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
14/11/25 16:02:35 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
14/11/25 16:02:36 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 1869 
bytes result sent to driver
14/11/25 16:02:36 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 1869 
bytes result sent to driver
14/11/25 16:02:36 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) 
in 536 ms on localhost (1/2)
14/11/25 16:02:36 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) 
in 529 ms on localhost (2/2)
14/11/25 16:02:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have 
all completed, from pool 
14/11/25 16:02:36 INFO DAGScheduler: Stage 1 (map at <console>:15) finished in 
0.562 s
14/11/25 16:02:36 INFO DAGScheduler: looking for newly runnable stages
14/11/25 16:02:36 INFO DAGScheduler: running: Set()
14/11/25 16:02:36 INFO DAGScheduler: waiting: Set(Stage 0)
14/11/25 16:02:36 INFO DAGScheduler: failed: Set()
14/11/25 16:02:36 INFO DAGScheduler: Missing parents for Stage 0: List()
14/11/25 16:02:36 INFO DAGScheduler: Submitting Stage 0 (ShuffledRDD[7] at 
reduceByKey at <console>:15), which is now runnable
14/11/25 16:02:36 INFO MemoryStore: ensureFreeSpace(2112) called with 
curMem=167169, maxMem=278302556
14/11/25 16:02:36 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 2.1 KB, free 265.2 MB)
14/11/25 16:02:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
(ShuffledRDD[7] at reduceByKey at <console>:15)
14/11/25 16:02:36 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/11/25 16:02:36 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, 
localhost, PROCESS_LOCAL, 948 bytes)
14/11/25 16:02:36 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 3, 
localhost, PROCESS_LOCAL, 948 bytes)
14/11/25 16:02:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 2)
14/11/25 16:02:36 INFO Executor: Running task 1.0 in stage 0.0 (TID 3)
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 
2 non-empty blocks out of 2 blocks
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 5 ms
14/11/25 16:02:36 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 
0 remote fetches in 5 ms
14/11/25 16:02:36 INFO Executor: Finished task 0.0 in stage 0.0 (TID 2). 4602 
bytes result sent to driver
14/11/25 16:02:36 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 2) 
in 135 ms on localhost (1/2)
14/11/25 16:02:36 INFO Executor: Finished task 1.0 in stage 0.0 (TID 3). 5051 
bytes result sent to driver
14/11/25 16:02:36 INFO DAGScheduler: Stage 0 (collect at <console>:15) finished 
in 0.168 s
14/11/25 16:02:36 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 3) 
in 160 ms on localhost (2/2)
14/11/25 16:02:36 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
14/11/25 16:02:36 INFO SparkContext: Job finished: collect at <console>:15, 
took 1.046888 s

Thanks!


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to