Hi Sungwoo, Thanks for your reply. For the required two features you mentioned, here is my understanding.
1. Currently Celeborn Worker will track map indices for each partition split file if `celeborn.client.shuffle.rangeReadFilter.enabled` is enabled. This config's original purpose is to filter out unnecessary partition splits in mapper range reading. The map indices of each partition split will be tracked and returned to LifecycleManager through `CommitFilesResponse#committedMapIdBitMap`. So, as long as the `CommitFiles` succeeds, LifecycleManager will know about the mapper ids for each partition split. However, if `CommitFiles` fails, we have no way to get the information. This feature is also reused in memory storage design[1]. 2. Currently LifecycleManager does not have a mechanism to change a shuffle from 'complete' to 'incomplete', because the succeeded attempts for each map task may already be propagated to executors and for reading. Shuffle Client will filter out data from non-successful attempts to ensure correctness, so it's hard to dynamically change a subset of succeeded attempts. In Erik's proposal, the whole upstream stage will be rerun when data lost. We are really glad to know about your efforts and progress to integrate Celeborn with MR3, I think it's a good example showing that Celeborn is a general purpose remote shuffle service for various big data compute engines. In fact we already mentioned MR3 in Celeborn's website[2]. So when your blog is ready, I'm happy to reference it in our website. [1] https://docs.google.com/document/d/1SM-oOM0JHEIoRHTYhE9PYH60_1D3NMxDR50LZIM7uW0/edit#heading=h.fudf3s3zacpr [2] https://celeborn.apache.org/docs/latest/developers/overview/#compute-engine-integration Thanks, Keyong Zhou <o...@pl.postech.ac.kr> 于2023年9月28日周四 11:12写道: > Hello, > > As we are developing MR3 extension for Celeborn, I would like to add my > comments on stage re-run in the context of using Celeborn for MR3. I don't > know the internal details of Spark stage re-run very well, so my apology > if my comments are irrelevant to the proposal in the design document. > > For Celeborn-MR3, we only need the following two features: > > 1. When mapper out is lost and read errors occur, CelebornIOException from > ShuffleClientImpl includes the task index of the mapper (or a set of task > indexes) whose output has been lost. > > 2. ShuffleClientImpl notifies LifeCycleManager so that > ShuffleClient.mapperEnd(shuffleId, mapper_task_index, ...) can be called > again. In other words, LifeCycleManager markes shuffleId from 'complete' > back to 'incomplete'. > > Then, the task-rexecution mechanism of MR3 can take care of the rest, by > re-executing the mapper and calling ShuffleClient.mapperEnd() again. > > From the proposal (if I understood it correctly), however, it seems that 1 > is not easy to implement in the current architecture of Celeborn (???): > > Celeborn doesn't know which mapper tasks need to be recomputed, unless the > mapping of parititionId -> List<mapId> is recorded and reported to > LifeCycleManager at committing time. > > By the way, we have finished the initial implementation of > Hive-MR3-Celeborn, and it works very reliably when tested with TPC-DS 10TB > and the performance is also good. A release candidate is currently being > tested in production by a third parity. It could take a bit of time to > learn to use Hive-MR3-Celeborn, but Hive-MR3-Celeborn could be another way > to run stress tests on Celeborn. For example, we produced the EOFException > error when running stress tests by using speculative execution a lot and > intentionally giving heavy memory pressure. (We have quick start guides > for Hadoop, K8s, standalone mode, so it should take no more than a couple > of hours to learn to run Hive-MR3-Celeborn.) If you are interested, please > let me know. Thank you. > > Best, > > -- Sungwoo > > On Fri, 22 Sep 2023, Erik fang wrote: > > > Hi folks, > > > > I have a proposal to implement Spark stage resubmission to handle shuffle > > fetch failure in Celeborn > > > > > https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8 > > > > please have a look and let me know what you think > > > > Regards, > > Erik > > >