Re: Why is spark running multiple stages with the same code line?

Sean Owen Thu, 21 Apr 2022 07:29:57 -0700

The line of code triggers a job, the job triggers stages. You should see
they are different operations, all supporting execution of the action on
that line.


On Thu, Apr 21, 2022 at 9:24 AM Joe <j...@net2020.org> wrote:

> Hi Sean,
> Thanks for replying but my question was about multiple stages running
> the same line of code, not about multiple stages in general. Yes single
> job can have multiple stages, but they should not be repeated, as far
> as I know, if you're caching/persisting your intermediate outputs.
>
> My question is why am I seeing multiple stages running the same line of
> code? As I understand it stage is a grouping of operations that can be
> executed without shuffling data or invoking a new action and they are
> divided into tasks, and tasks are the ones that are executed in
> parallel and can have the same line of code running on different
> executors. Or is this assumption wrong?
> Thanks,
>
> Joe
>
>
> On Thu, 2022-04-21 at 09:14 -0500, Sean Owen wrote:
> > A job can have multiple stages for sure. One action triggers a job.
> > This seems normal.
> >
> > On Thu, Apr 21, 2022, 9:10 AM Joe <j...@net2020.org> wrote:
> > > Hi,
> > > When looking at application UI (in Amazon EMR) I'm seeing one job
> > > for
> > > my particular line of code, for example:
> > > 64 Running count at MySparkJob.scala:540
> > >
> > > When I click into the job and go to stages I can see over a 100
> > > stages
> > > running the same line of code (stages are active, pending or
> > > completed):
> > > 190 Pending count at MySparkJob.scala:540
> > > ...
> > > 162 Active count at MySparkJob.scala:540
> > > ...
> > > 108 Completed count at MySparkJob.scala:540
> > > ...
> > >
> > > I'm not sure what that means, I thought that stage was a logical
> > > operation boundary and you could have only one stage in the job
> > > (unless
> > > you executed the same dataset+action many times on purpose) and
> > > tasks
> > > were the ones that were replicated across partitions. But here I'm
> > > seeing many stages running, each with the same line of code?
> > >
> > > I don't have a situation where my code is re-processing the same
> > > set of
> > > data many times, all intermediate sets are persisted.
> > > I'm not sure if EMR UI display is wrong or if spark stages are not
> > > what
> > > I thought they were?
> > > Thanks,
> > >
> > > Joe
> > >
> > >
> > >
> > > -------------------------------------------------------------------
> > > --
> > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > >
>
>
>

Re: Why is spark running multiple stages with the same code line?

Reply via email to