Hi,

  This is a great idea - and would go a long way in flushing out bugs and
issues - and improving the overall robustness of Celeborn !
It would also be good to have:
a) Capture a (replay) log of all events which were triggered.
b) Ability to 'replay' the log and deterministically reach the same state.

This will allow us to identify failure cases with the testing framework -
while allowing developers to deterministically reproduce the identified
state.

(Hopefully I did not miss this in the proposal).

Regards,
Mridul


On Wed, Jul 10, 2024 at 4:07 AM Nicholas Jiang <nicholasji...@apache.org>
wrote:

> Hello community,
>
> It's been a while since the discussion on the Celeborn chaos testing
> framework. The main process of Celeborn chaos testing includes:
>
> 1. Defining a test plan to describe the types of events, the order in
> which events are triggered, and their duration. Event types include node
> anomalies, disk anomalies, IO anomalies, CPU overload, etc.
> 2. The client submits the plan to the scheduler.
> 3. The scheduler sends operations to each node's runner according to the
> plan description.
> 4. The runner is responsible for executing the operations and reporting
> the current status of the node.
> 5. Before triggering an operation, the scheduler deduces the result of
> this event. If it leads to the inability to meet the minimum runnable
> environment for RSS, the event is rejected.
>
> Do you have any thoughts or questions about this chaos testing framework?
> Welcome feedback to further ensure the reliability of Celeborn through
> chaos testing.
>
> Regards,
> Nicholas Jiang
>
> At 2024-07-03 05:20:57, "Nicholas Jiang" <nicholasji...@apache.org> wrote:
> >Hi all,
> >
> >I would like to start a discussion on CIP-10: Introduce Celeborn Chaos
> Testing Framework[1].
> >
> >A chaos testing framework is designed to simulate unpredictable and
> adverse conditions in distributed systems to validate their robustness and
> resilience. This proposal aims to simulate various anomalies and test the
> stability of Celeborn in distributed environments via chaos testing.
> >
> >Looking forward to everyone's feedback and suggestions. Thank you!
> >
> >[1]
> https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework
> >
> >Regards,
> >Nicholas Jiang
>

Reply via email to