Hi, When looking at application UI (in Amazon EMR) I'm seeing one job for my particular line of code, for example: 64 Running count at MySparkJob.scala:540
When I click into the job and go to stages I can see over a 100 stages running the same line of code (stages are active, pending or completed): 190 Pending count at MySparkJob.scala:540 ... 162 Active count at MySparkJob.scala:540 ... 108 Completed count at MySparkJob.scala:540 ... I'm not sure what that means, I thought that stage was a logical operation boundary and you could have only one stage in the job (unless you executed the same dataset+action many times on purpose) and tasks were the ones that were replicated across partitions. But here I'm seeing many stages running, each with the same line of code? I don't have a situation where my code is re-processing the same set of data many times, all intermediate sets are persisted. I'm not sure if EMR UI display is wrong or if spark stages are not what I thought they were? Thanks, Joe --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org