Hello! I'm newbie in spark I would like to understand some basic mechanism on how it works behind the scenes. I have attached the lineage of my RDD and I have the following questions: 1. Why do I have 8 stages instead of 5? From the book Learning from Spark (Chapter 8 -http://bit.ly/1E0Hah7), I could understand that "RDDs that exist at the same level of indentation as their parents will be pipelined [into same physical stage] during physical execution". Since I have 5 parents, I'm expected to have 5 stages. Still the Spark UI stages view, shows 8 stages. Also what represents the (8) represented in the debug string? Is any bug in this function? 2. At the stage level, what is the execution order among the tasks? They can be executed all of them in parallel (for example: test4spark.csv HadoopRDD[0] || test4spark.csv MappedRDD[1] || MapPartitionsRDD[4] || ZippedWithIndexRDD[6]) or they are waiting each task upon the other to complete ( test4spark.csv HadoopRDD[0]=>completed=> test4spark.csv MappedRDD[1]=>completed=>etc) 3. Between stages, the order is given by the execution plan, so each stage is waiting till the ones before it will be completed. Is this a correct assumption?
I look forward for your answers. Regards, Florin (8) MappedRDD[21] at map at WAChunkSepvgFilterNewModel.scala:298 [] | MappedRDD[20] at map at WAChunkSepvgFilterNewModel.scala:182 [] | ShuffledRDD[19] at sortByKey at WAChunkSepvgFilterNewModel.scala:182 [] +-(8) ShuffledRDD[16] at aggregateByKey at WAChunkSepvgFilterNewModel.scala:182 [] +-(8) FlatMappedRDD[15] at flatMap at WAChunkSepvgFilterNewModel.scala:174 [] | ZippedWithIndexRDD[14] at zipWithIndex at WAChunkSepvgFilterNewModel.scala:174 [] | MappedRDD[13] at map at WAChunkSepvgFilterNewModel.scala:272 [] | MappedRDD[12] at map at WAChunkSepvgFilterNewModel.scala:161 [] | ShuffledRDD[11] at sortByKey at WAChunkSepvgFilterNewModel.scala:161 [] +-(8) ShuffledRDD[8] at aggregateByKey at WAChunkSepvgFilterNewModel.scala:161 [] +-(8) FlatMappedRDD[7] at flatMap at WAChunkSepvgFilterNewModel.scala:153 [] | ZippedWithIndexRDD[6] at zipWithIndex at WAChunkSepvgFilterNewModel.scala:153 [] | MappedRDD[5] at map at WAChunkSepvgFilterNewModel.scala:248 [] | MapPartitionsRDD[4] at mapPartitionsWithIndex at WAChunkSepvgFilterNewModel.scala:114 [] | test4spark.csv MappedRDD[1] at textFile at WAChunkSepvgFilterNewModel.scala:215 [] | test4spark.csv HadoopRDD[0] at textFile at WAChunkSepvgFilterNewModel.scala:215 [] [image: Inline image 1] Excerpt from the book: "The lineage output shown in Example 8-8 uses indentation levels to show where RDDs are going to be pipelined together into physical stages. RDDs that exist at the same level of indentation as their parents will be pipelined during physical execution "