Thanks for the proposal! The chaos framework is very useful for Celeborn, there are two points I think are important: 1. We need to add correctness check in the framework, correctness is NO.1 important thing. 2. The framework should not intrude into the common code.
Regards, Keyong Zhou Nicholas Jiang <nicholasji...@apache.org> 于2024年7月12日周五 14:29写道: > Hey Mridul, > > Thanks for your feedback. The ability to reproduce problematic cases by > capturing logs of events that have been triggered can maximize the value of > chaos testing framework. Celeborn chaos testing not only needs to verify > the reliability of the service under the background of simulating various > abnormal events, but also reproduces problem cases to troubleshoot the root > cause of Celeborn problems. I would like to take this reproduction feature > into consideration for this CIP. > > Best Regards, > Nicholas Jiang > > On 2024/07/10 09:35:52 Mridul Muralidharan wrote: > > Hi, > > > > This is a great idea - and would go a long way in flushing out bugs and > > issues - and improving the overall robustness of Celeborn ! > > It would also be good to have: > > a) Capture a (replay) log of all events which were triggered. > > b) Ability to 'replay' the log and deterministically reach the same > state. > > > > This will allow us to identify failure cases with the testing framework - > > while allowing developers to deterministically reproduce the identified > > state. > > > > (Hopefully I did not miss this in the proposal). > > > > Regards, > > Mridul > > > > > > On Wed, Jul 10, 2024 at 4:07 AM Nicholas Jiang <nicholasji...@apache.org > > > > wrote: > > > > > Hello community, > > > > > > It's been a while since the discussion on the Celeborn chaos testing > > > framework. The main process of Celeborn chaos testing includes: > > > > > > 1. Defining a test plan to describe the types of events, the order in > > > which events are triggered, and their duration. Event types include > node > > > anomalies, disk anomalies, IO anomalies, CPU overload, etc. > > > 2. The client submits the plan to the scheduler. > > > 3. The scheduler sends operations to each node's runner according to > the > > > plan description. > > > 4. The runner is responsible for executing the operations and reporting > > > the current status of the node. > > > 5. Before triggering an operation, the scheduler deduces the result of > > > this event. If it leads to the inability to meet the minimum runnable > > > environment for RSS, the event is rejected. > > > > > > Do you have any thoughts or questions about this chaos testing > framework? > > > Welcome feedback to further ensure the reliability of Celeborn through > > > chaos testing. > > > > > > Regards, > > > Nicholas Jiang > > > > > > At 2024-07-03 05:20:57, "Nicholas Jiang" <nicholasji...@apache.org> > wrote: > > > >Hi all, > > > > > > > >I would like to start a discussion on CIP-10: Introduce Celeborn Chaos > > > Testing Framework[1]. > > > > > > > >A chaos testing framework is designed to simulate unpredictable and > > > adverse conditions in distributed systems to validate their robustness > and > > > resilience. This proposal aims to simulate various anomalies and test > the > > > stability of Celeborn in distributed environments via chaos testing. > > > > > > > >Looking forward to everyone's feedback and suggestions. Thank you! > > > > > > > >[1] > > > > https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework > > > > > > > >Regards, > > > >Nicholas Jiang > > > > > >