Re: Question on stage rerun in Celeborn 0.5.1

rexxiong Thu, 17 Oct 2024 02:28:43 -0700

Hi Sungwoo,

Thank you for your message. I'm glad to hear that you've successfully
implemented stage rerun in Hive-MR3! It's great that the integration with
Celeborn was smooth.
If you have any further questions, feel free to reach out.


Thanks,
Jiashu Xiong

Sungwoo Park <[email protected]> 于2024年10月16日周三 22:37写道：

> Hi Jiashu,
>
> Thank your for your reply. Based on what you advised, we have fully
> implemented stage rerun in Hive-MR3, where a read failure triggers the
> whole re-execution of the parent stage. There were not many changes to the
> code for interfacing with Celeborn (which makes me think again that
> Celeborn API is quite elegant).
>
> Thanks a lot,
>
> --- Sungwoo Park
>
> On Tue, 8 Oct 2024, rexxiong wrote:
>
> > Hi Sungwoo,
> >
> > Glad to see that the integration of Mr3 with Celeborn works well and
> that the upgrade
> > from version 0.3 to 0.5 went smoothly.
> > Now, Celeborn supports Spark stage reruns, which are already in
> production. And for
> > your questions:
> >
> > 1. Suppose that a reducer fails to read the output of a certain mapper.
> In
> > such a case, should we re-execute all the mappers in the previous stage?
> > Or, is it okay to re-execute only the mapper whose output is lost?
> > In our previous implementation, MR3-Celeborn does not fully support task
> > rerun (similar to stage rerun) because Celeborn does not return the
> > identity of mapper tasks whose output has been lost.
> >
> > In my opinion, Mr3 should use a new shuffle ID to re-execute all mappers
> and generate
> > fresh outputs for that reducer. This is necessary because the Celeborn
> Worker merges
> > all mapper outputs by partition ID into the same files. If the reducer
> is unable to
> > read these files, it indicates that the data is lost, which indicates
> that
> > mappers(probably all,  and I think only rerun those mapper which
> associate the data is
> > not easy in celeborn, possible but may change a lot) that wrote to those
> files should
> > be re-executed
> >
> > 2. When a reducer tries to read the output of mappers, when is it okay to
> > use the application shuffle ID?
> > If MR3 doesn't support stage rerun as before, MR3 can still use the
> application
> > Shuffle Id. Additionally, for Spark applications, we can disable stage
> rerun support
> > by setting celeborn.client.spark.fetch.throwsFetchFailure to false. This
> will enable
> > Spark to continue using the application Shuffle ID.
> >
> > 3. Along the same line of question 2, should we always get Celeborn
> > shuffle IDs when trying to read the output of mappers? Considering the
> > fact the the current code of MR3-Celeborn works fine, it seems like this
> > is not always necessary.
> >
> > If MR3 does not require support for stage reruns, we can continue to use
> the
> > application shuffle ID as before. However, if support for stage reruns
> is needed in
> > MR3, we will need to regenerate the shuffle ID for each stage attempt.
> This will allow
> > us to distinguish between the map outputs for the reducers across
> different attempts
> > of the same stage.
> >
> > Thanks,Jiashu Xiong
> >
> > Sungwoo Park <[email protected]> 于2024年10月7日周一 14:20?道：
> >       (I left this message in Celeborn Slack channel, but perhaps this
> mailing
> >       list is the right place.)
> >
> >       Hello,
> >
> >       Previously we implemented an extension of MR3 (an execution
> engine) to
> >       support Celeborn 0.3.1. For a short introduction, please see:
> >       https://mr3docs.datamonad.com/docs/mr3/features/celeborn/
> >
> >       Now we are upgrading Celeborn to 0.5.1 and working on supporting
> stage
> >       rerun, much like Spark-Celeborn.
> >
> >       To my (pleasant) surprise, upgrading Celeborn from 0.3.1 to 0.5.1
> was
> >       quite smooth. After recompiling with Celeborn 0.5.1, MR3-Celeborn
> just
> >       worked fine. I was surprised because the current code does not
> obtain
> >       Celeborn shuffle IDs at all (because there was no notion of
> Celeborn
> >       shuffle IDs back in 0.3.1) and we use only application shuffle IDs
> which
> >       are generated by MR3 (similarly to Spark shuffle IDs).
> >
> >       I have a few questions.
> >
> >       1. Suppose that a reducer fails to read the output of a certain
> mapper. In
> >       such a case, should we re-execute all the mappers in the previous
> stage?
> >       Or, is it okay to re-execute only the mapper whose output is lost?
> >       In our previous implementation, MR3-Celeborn does not fully
> support task
> >       rerun (similar to stage rerun) because Celeborn does not return the
> >       identity of mapper tasks whose output has been lost.
> >
> >       2. When a reducer tries to read the output of mappers, when is it
> okay to
> >       use the application shuffle ID?
> >
> >       3. Along the same line of question 2, should we always get Celeborn
> >       shuffle IDs when trying to read the output of mappers? Considering
> the
> >       fact the the current code of MR3-Celeborn works fine, it seems
> like this
> >       is not always necessary.
> >
> >       Thank you.
> >
> >       --- Sungwoo Park
> >
> >
> >

Re: Question on stage rerun in Celeborn 0.5.1

Reply via email to