Re: [DISCUSS] CIP-10: Introduce Celeborn Chaos Testing Framework

Nicholas Jiang Mon, 15 Jul 2024 20:40:13 -0700

Hey Keyong,

Thanks for your feedback.


In my opinion, validating the data quality by checking "Shuffle Records 
Written" and "Records Read" satisfies the correctness check. Meanwhile, the 
codes of chaos testing framework are separated into "verifier" module. WDYT?

Regards,
Nicholas Jiang

On 2024/07/14 17:10:31 Keyong Zhou wrote:
> Thanks for the proposal!
> 
> The chaos framework is very useful for Celeborn, there are two points I
> think are important:
> 1. We need to add correctness check in the framework, correctness is NO.1
> important thing.
> 2. The framework should not intrude into the common code.
> 
> Regards,
> Keyong Zhou
> 
> Nicholas Jiang <nicholasji...@apache.org> 于2024年7月12日周五 14:29写道：
> 
> > Hey Mridul,
> >
> > Thanks for your feedback. The ability to reproduce problematic cases by
> > capturing logs of events that have been triggered can maximize the value of
> > chaos testing framework. Celeborn chaos testing not only needs to verify
> > the reliability of the service under the background of simulating various
> > abnormal events, but also reproduces problem cases to troubleshoot the root
> > cause of Celeborn problems. I would like to take this reproduction feature
> > into consideration for this CIP.
> >
> > Best Regards,
> > Nicholas Jiang
> >
> > On 2024/07/10 09:35:52 Mridul Muralidharan wrote:
> > > Hi,
> > >
> > >   This is a great idea - and would go a long way in flushing out bugs and
> > > issues - and improving the overall robustness of Celeborn !
> > > It would also be good to have:
> > > a) Capture a (replay) log of all events which were triggered.
> > > b) Ability to 'replay' the log and deterministically reach the same
> > state.
> > >
> > > This will allow us to identify failure cases with the testing framework -
> > > while allowing developers to deterministically reproduce the identified
> > > state.
> > >
> > > (Hopefully I did not miss this in the proposal).
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > On Wed, Jul 10, 2024 at 4:07 AM Nicholas Jiang <nicholasji...@apache.org
> > >
> > > wrote:
> > >
> > > > Hello community,
> > > >
> > > > It's been a while since the discussion on the Celeborn chaos testing
> > > > framework. The main process of Celeborn chaos testing includes:
> > > >
> > > > 1. Defining a test plan to describe the types of events, the order in
> > > > which events are triggered, and their duration. Event types include
> > node
> > > > anomalies, disk anomalies, IO anomalies, CPU overload, etc.
> > > > 2. The client submits the plan to the scheduler.
> > > > 3. The scheduler sends operations to each node's runner according to
> > the
> > > > plan description.
> > > > 4. The runner is responsible for executing the operations and reporting
> > > > the current status of the node.
> > > > 5. Before triggering an operation, the scheduler deduces the result of
> > > > this event. If it leads to the inability to meet the minimum runnable
> > > > environment for RSS, the event is rejected.
> > > >
> > > > Do you have any thoughts or questions about this chaos testing
> > framework?
> > > > Welcome feedback to further ensure the reliability of Celeborn through
> > > > chaos testing.
> > > >
> > > > Regards,
> > > > Nicholas Jiang
> > > >
> > > > At 2024-07-03 05:20:57, "Nicholas Jiang" <nicholasji...@apache.org>
> > wrote:
> > > > >Hi all,
> > > > >
> > > > >I would like to start a discussion on CIP-10: Introduce Celeborn Chaos
> > > > Testing Framework[1].
> > > > >
> > > > >A chaos testing framework is designed to simulate unpredictable and
> > > > adverse conditions in distributed systems to validate their robustness
> > and
> > > > resilience. This proposal aims to simulate various anomalies and test
> > the
> > > > stability of Celeborn in distributed environments via chaos testing.
> > > > >
> > > > >Looking forward to everyone's feedback and suggestions. Thank you!
> > > > >
> > > > >[1]
> > > >
> > https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework
> > > > >
> > > > >Regards,
> > > > >Nicholas Jiang
> > > >
> > >
> >
>

Re: [DISCUSS] CIP-10: Introduce Celeborn Chaos Testing Framework

Reply via email to