[ 
https://issues.apache.org/jira/browse/HDDS-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089166#comment-18089166
 ] 

Chun-Hung Tseng commented on HDDS-15501:
----------------------------------------

Hi [~ivanandika] , I have discussed with [~chungen] and will let him drive the 
communication here :)!

> Distributed System Testing in Ozone
> -----------------------------------
>
>                 Key: HDDS-15501
>                 URL: https://issues.apache.org/jira/browse/HDDS-15501
>             Project: Apache Ozone
>          Issue Type: Test
>          Components: test
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
> We had a MiniOzoneChaosCluster (fault injection testing), but it seems 
> unused. I propose to introduce a distributed system testing and proofs system 
> so that we can have the Ozone spec as the shared mental model. Some of the 
> regressions for issues like breaking majority commit contract (HDDS-15052) is 
> not detected since we don't have the spec as the source of truth. 
> Additionally sometimes simply we use our intuitions to guide our 
> implementation and fixes which can cause regressions (for example, a lot of 
> ReplicationManager fixes are only done only when there is an issue in 
> productions).
> This is a parent task for the effort to introduce distributed system testing 
> and proofs to test the correctness of Ozone implementation (e.g. partial 
> write commit, container state transitions, replication manager, container 
> replica management (i.e. how to appease eventually consistent heartbeat and 
> strongly consistent Ratis in SCM), quasi closed, block deletion orphan issue, 
> etc).
> Distributed system testing tools:
>  - Jepsen, Ellen, Maelstorm
>  - Fray
>  - Hypothesis (Hegel)
>  - Antithesis (paid)
> Distributed system proofs:
>  - TLA+
>  - Lean4
>  - P framework
> Real systems
>  - 3FS ([https://github.com/deepseek-ai/3FS/tree/main/specs]) - uses P 
> framework
>  - AWS S3 
> ([https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/]
>  and [https://p-org.github.io/P/casestudies/#case-studies])
>  - etcd robustness test 
> ([https://github.com/etcd-io/etcd/tree/main/tests/robustness]) - Uses 
> antithesis (among other things)
> I prefer if we can start with P framework since some storage systems already 
> used it.
> In the future, we can support Deterministic Simulation Testing 
> ([https://antithesis.com/docs/resources/deterministic_simulation_testing]). 
> However, this requires a lot of efforts and very invasive since we need to 
> make all Ozone implementations to be testable by the simulation framework so 
> it's not going to happen anytime soon. Some of the deterministic simulation 
> testing framework is Madsim (Rust), Turmoil (Rust), BUGGIFY (C++, 
> FoundationDB), Tickloom (Java, [https://github.com/unmeshjoshi/tickloom]), 
> Vortex (Zig, used by TigerBeetle). We can start taking look at Tickloom for 
> testing in Java.
> Having a real industry-recognized spec helps to instil confidence in Ozone 
> robustness. More importantly, distributed system testing allows us to have 
> confidence that our changes will not introduce critical issues (as long as 
> the system is covered by the test).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to