Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Sean Owen
The line of code triggers a job, the job triggers stages. You should see they are different operations, all supporting execution of the action on that line. On Thu, Apr 21, 2022 at 9:24 AM Joe wrote: > Hi Sean, > Thanks for replying but my question was about multiple stages running > the same

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Russell Spitzer
There are a few things going on here. 1. Spark is lazy, so nothing happens until a result is collected back to the driver or data is written to a sink. So the 1 line you see is most likely just that trigger. Once triggered, all of the work required to make that final result happen occurs. If

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Joe
Hi Sean, Thanks for replying but my question was about multiple stages running the same line of code, not about multiple stages in general. Yes single job can have multiple stages, but they should not be repeated, as far as I know, if you're caching/persisting your intermediate outputs. My

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Sean Owen
A job can have multiple stages for sure. One action triggers a job. This seems normal. On Thu, Apr 21, 2022, 9:10 AM Joe wrote: > Hi, > When looking at application UI (in Amazon EMR) I'm seeing one job for > my particular line of code, for example: > 64 Running count at MySparkJob.scala:540 > >

Why is spark running multiple stages with the same code line?

2022-04-21 Thread Joe
Hi, When looking at application UI (in Amazon EMR) I'm seeing one job for my particular line of code, for example: 64 Running count at MySparkJob.scala:540 When I click into the job and go to stages I can see over a 100 stages running the same line of code (stages are active, pending or