[ 
https://issues.apache.org/jira/browse/HDDS-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087606#comment-18087606
 ] 

Wei-Chiu Chuang commented on HDDS-15501:
----------------------------------------

Reference: HBASE-29975 HBase TLA+ specification for AM.

I've not tried myself but I suspect we can leverage AI for TLA+.

> Distributed System Testing in Ozone
> -----------------------------------
>
>                 Key: HDDS-15501
>                 URL: https://issues.apache.org/jira/browse/HDDS-15501
>             Project: Apache Ozone
>          Issue Type: Test
>          Components: test
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
> We had a MiniOzoneChaosCluster (fault injection testing), but it seems 
> unused. I propose to introduce a distributed system testing and proofs system 
> so that we can have the Ozone spec as the shared mental model. Some of the 
> regressions for issues like breaking majority commit contract (HDDS-15052) is 
> not detected since we don't have the spec as the source of truth. 
> Additionally  sometimes simply we use our intuitions to guide our 
> implementation and fixes.
> This is a parent task for the effort to introduce distributed system testing 
> and proofs to test the correctness of Ozone implementation (e.g. partial 
> write commit, container state transitions, quasi closed, etc).
> Distributed system testing tools:
> - Jepsen, Ellen, Maelstorm
> - Fray
> - Hypothesis (Hegel)
> - Antithesis (paid) 
> Distributed system proofs:
> - TLA+
> - Lean4
> - P framework
> Real systems
> - 3FS (https://github.com/deepseek-ai/3FS/tree/main/specs) - uses P framework
> - AWS S3 
> (https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
>  and https://p-org.github.io/P/casestudies/#case-studies)
> - etcd robustness test 
> (https://github.com/etcd-io/etcd/tree/main/tests/robustness) - Uses 
> antithesis (among other things) 
> I prefer if we can start with P framework since some storage systems already 
> used it.
> Having a real industry-recognized spec helps to instil confidence in Ozone 
> robustness. More importantly, distributed system testing allows us to have 
> confidence that our changes will not introduce critical issues (as long as 
> the system is covered by the test).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to