Why is spark running multiple stages with the same code line?

Joe Thu, 21 Apr 2022 07:10:05 -0700

Hi,
When looking at application UI (in Amazon EMR) I'm seeing one job for
my particular line of code, for example:
64 Running count at MySparkJob.scala:540


When I click into the job and go to stages I can see over a 100 stages
running the same line of code (stages are active, pending or
completed):
190 Pending count at MySparkJob.scala:540
...
162 Active count at MySparkJob.scala:540
...
108 Completed count at MySparkJob.scala:540
...

I'm not sure what that means, I thought that stage was a logical
operation boundary and you could have only one stage in the job (unless
you executed the same dataset+action many times on purpose) and tasks
were the ones that were replicated across partitions. But here I'm
seeing many stages running, each with the same line of code?

I don't have a situation where my code is re-processing the same set of
data many times, all intermediate sets are persisted.
I'm not sure if EMR UI display is wrong or if spark stages are not what
I thought they were?
Thanks,

Joe



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Why is spark running multiple stages with the same code line?

Reply via email to