Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Keyong Zhou Fri, 29 Sep 2023 04:44:12 -0700

Hi Sungwoo,

Thanks for your reply. For the required two features you mentioned, here is
my understanding.

1. Currently Celeborn Worker will track map indices for each partition
split file
if `celeborn.client.shuffle.rangeReadFilter.enabled` is enabled. This
config's original
purpose is to filter out unnecessary partition splits in mapper range
reading. The
map indices of each partition split will be tracked and returned to
LifecycleManager
through `CommitFilesResponse#committedMapIdBitMap`.
So, as long as the `CommitFiles` succeeds, LifecycleManager will know about
the
mapper ids for each partition split. However, if `CommitFiles` fails, we
have no way
to get the information.
This feature is also reused in memory storage design[1].

2. Currently LifecycleManager does not have a mechanism to change a shuffle
from
'complete' to 'incomplete', because the succeeded attempts for each map
task may
already be propagated to executors and for reading. Shuffle Client will
filter out data
from non-successful attempts to ensure correctness, so it's hard to
dynamically change
a subset of succeeded attempts. In Erik's proposal, the whole upstream
stage will
be rerun when data lost.

We are really glad to know about your efforts and progress to integrate
Celeborn with MR3,
I think it's a good example showing that Celeborn is a general purpose
remote shuffle service
for various big data compute engines. In fact we already mentioned MR3 in
Celeborn's
website[2]. So when your blog is ready, I'm happy to reference it in our
website.

[1]
https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit#heading=h.fudf3s3zacpr
[2]
https://celeborn.apache.org/docs/latest/developers/overview/#compute-engine-integration

Thanks,
Keyong Zhou

<o...@pl.postech.ac.kr> 于2023年9月28日周四 11:12写道：

> Hello,
>
> As we are developing MR3 extension for Celeborn, I would like to add my
> comments on stage re-run in the context of using Celeborn for MR3. I don't
> know the internal details of Spark stage re-run very well, so my apology
> if my comments are irrelevant to the proposal in the design document.
>
> For Celeborn-MR3, we only need the following two features:
>
> 1. When mapper out is lost and read errors occur, CelebornIOException from
> ShuffleClientImpl includes the task index of the mapper (or a set of task
> indexes) whose output has been lost.
>
> 2. ShuffleClientImpl notifies LifeCycleManager so that
> ShuffleClient.mapperEnd(shuffleId, mapper_task_index, ...) can be called
> again. In other words,  LifeCycleManager markes shuffleId from 'complete'
> back to 'incomplete'.
>
> Then, the task-rexecution mechanism of MR3 can take care of the rest, by
> re-executing the mapper and calling ShuffleClient.mapperEnd() again.
>
> From the proposal (if I understood it correctly), however, it seems that 1
> is not easy to implement in the current architecture of Celeborn (???):
>
> Celeborn doesn't know which mapper tasks need to be recomputed, unless the
> mapping of parititionId -> List<mapId> is recorded and reported to
> LifeCycleManager at committing time.
>
> By the way, we have finished the initial implementation of
> Hive-MR3-Celeborn, and it works very reliably when tested with TPC-DS 10TB
> and the performance is also good. A release candidate is currently being
> tested in production by a third parity. It could take a bit of time to
> learn to use Hive-MR3-Celeborn, but Hive-MR3-Celeborn could be another way
> to run stress tests on Celeborn. For example, we produced the EOFException
> error when running stress tests by using speculative execution a lot and
> intentionally giving heavy memory pressure. (We have quick start guides
> for Hadoop, K8s, standalone mode, so it should take no more than a couple
> of hours to learn to run Hive-MR3-Celeborn.) If you are interested, please
> let me know. Thank you.
>
> Best,
>
> -- Sungwoo
>
> On Fri, 22 Sep 2023, Erik fang wrote:
>
> > Hi folks,
> >
> > I have a proposal to implement Spark stage resubmission to handle shuffle
> > fetch failure in Celeborn
> >
> >
> https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8
> >
> > please have a look and let me know what you think
> >
> > Regards,
> > Erik
> >
>

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

Reply via email to