Hi, Jiashu Thanks for joining the discussion.
> My only concern is whether the change for the Celeborn worker supports a graceful shutdown/decommission. Could you provide more details on that? Since the graceful shutdown/decommission is achieved by changing the state of the worker and the master, the proposal doesn't change the logic. It can leverage the existing management framework as before, so I think the feature will continue to operate as intended. We have also conducted some simple tests to validate the process, the results also show that it can work as expected. Best, Yuxin rexxiong <[email protected]> 于2024年6月3日周一 12:18写道: > Hi Yunxin, > Thanks a lot for the CIP, +1 > > For me the whole design and implementation appear clearer and have no > compatibility issues. My only concern is whether the change for the > Celeborn worker supports a graceful shutdown/decommission. Could you > provide more details on that? > > Thanks, > Jiashu Xiong > > Xintong Song <[email protected]> 于2024年5月29日周三 09:39写道: > > > +1 for this proposal. > > > > Greetings to the Apache Celeborn community~! Yuxin and I are from the > > Apache Flink community, and have been working on the shuffle related > > components for years. We are both excited about making our first > > contribution to the Apache Celeborn community. > > > > Hybrid Shuffle is a new shuffle architecture that the Flink community has > > been working on for ~2 years. We are planning to make it the default (and > > eventually the only) batch shuffle in the Flink 2.0 release (end of this > > year). The architecture is flexible and extensible so that it can support > > all the capabilities of existing shuffle modes, while providing new > > advantages on task scheduling, resource efficiency and usability. To > > achieve this, we abstract storages (memory, local dist, remote storage / > > service) into Tiers, and hide details such as assembling records to > > buffers, dynamic switching between Tiers and memory management from the > > Tiers. > > > > We believe it is important that Flink and Celeborn can be integrated > under > > the new architecture, in addition to the existing integration based on > the > > shuffle-service interfaces. > > > > Looking forward to your feedback. > > > > Best, > > > > Xintong > > > > > > > > On Tue, May 28, 2024 at 8:52 PM Yuxin Tan <[email protected]> > wrote: > > > > > Hi all, > > > > > > I would like to start a discussion on CIP-6 Support Flink hybrid > shuffle > > > integration with Apache > > > Celeborn[1]. Celeborn provides a stable, performant, scalable remote > > > shuffle service. > > > Concurrently, Flink hybrid shuffle supports transitions between memory, > > > disk, and remote > > > storage to improve performance and job stability. This integration > > proposal > > > is to harness the > > > benefits from both Celeborn and hybrid shuffle simultaneously. > > > > > > Note that this proposal has two parts. > > > 1. The Celeborn-side changes are in CIP-6[1]. > > > 2. The Flink-side modifications are in FLIP-459[2]. > > > > > > Looking forward to everyone's feedback and suggestions. Thank you! > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/CELEBORN/CIP-6+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn > > > [2] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn > > > > > > Best, > > > Yuxin > > > > > >
