Thanks for the proposal!

The chaos framework is very useful for Celeborn, there are two points I
think are important:
1. We need to add correctness check in the framework, correctness is NO.1
important thing.
2. The framework should not intrude into the common code.

Regards,
Keyong Zhou

Nicholas Jiang <nicholasji...@apache.org> 于2024年7月12日周五 14:29写道:

> Hey Mridul,
>
> Thanks for your feedback. The ability to reproduce problematic cases by
> capturing logs of events that have been triggered can maximize the value of
> chaos testing framework. Celeborn chaos testing not only needs to verify
> the reliability of the service under the background of simulating various
> abnormal events, but also reproduces problem cases to troubleshoot the root
> cause of Celeborn problems. I would like to take this reproduction feature
> into consideration for this CIP.
>
> Best Regards,
> Nicholas Jiang
>
> On 2024/07/10 09:35:52 Mridul Muralidharan wrote:
> > Hi,
> >
> >   This is a great idea - and would go a long way in flushing out bugs and
> > issues - and improving the overall robustness of Celeborn !
> > It would also be good to have:
> > a) Capture a (replay) log of all events which were triggered.
> > b) Ability to 'replay' the log and deterministically reach the same
> state.
> >
> > This will allow us to identify failure cases with the testing framework -
> > while allowing developers to deterministically reproduce the identified
> > state.
> >
> > (Hopefully I did not miss this in the proposal).
> >
> > Regards,
> > Mridul
> >
> >
> > On Wed, Jul 10, 2024 at 4:07 AM Nicholas Jiang <nicholasji...@apache.org
> >
> > wrote:
> >
> > > Hello community,
> > >
> > > It's been a while since the discussion on the Celeborn chaos testing
> > > framework. The main process of Celeborn chaos testing includes:
> > >
> > > 1. Defining a test plan to describe the types of events, the order in
> > > which events are triggered, and their duration. Event types include
> node
> > > anomalies, disk anomalies, IO anomalies, CPU overload, etc.
> > > 2. The client submits the plan to the scheduler.
> > > 3. The scheduler sends operations to each node's runner according to
> the
> > > plan description.
> > > 4. The runner is responsible for executing the operations and reporting
> > > the current status of the node.
> > > 5. Before triggering an operation, the scheduler deduces the result of
> > > this event. If it leads to the inability to meet the minimum runnable
> > > environment for RSS, the event is rejected.
> > >
> > > Do you have any thoughts or questions about this chaos testing
> framework?
> > > Welcome feedback to further ensure the reliability of Celeborn through
> > > chaos testing.
> > >
> > > Regards,
> > > Nicholas Jiang
> > >
> > > At 2024-07-03 05:20:57, "Nicholas Jiang" <nicholasji...@apache.org>
> wrote:
> > > >Hi all,
> > > >
> > > >I would like to start a discussion on CIP-10: Introduce Celeborn Chaos
> > > Testing Framework[1].
> > > >
> > > >A chaos testing framework is designed to simulate unpredictable and
> > > adverse conditions in distributed systems to validate their robustness
> and
> > > resilience. This proposal aims to simulate various anomalies and test
> the
> > > stability of Celeborn in distributed environments via chaos testing.
> > > >
> > > >Looking forward to everyone's feedback and suggestions. Thank you!
> > > >
> > > >[1]
> > >
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework
> > > >
> > > >Regards,
> > > >Nicholas Jiang
> > >
> >
>

Reply via email to