[ 
https://issues.apache.org/jira/browse/HDDS-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Andika updated HDDS-15501:
-------------------------------
    Description: 
Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
We had a MiniOzoneChaosCluster, but it seems unused. I propose to introduce a 
distributed system testing and proofs system so that we can have the Ozone spec 
as the shared mental model. Some of the regressions for issues like breaking 
majority commit contract (HDDS-15052) is not detected since we don't have the 
spec as the source of truth and sometimes simply uses intuitions to guide our 
implementation.

This is a parent task for the effort to introduce distributed system testing 
and proofs to test the correctness of Ozone implementation (e.g. partial write 
commit, container state transitions, quasi closed, etc).

Distributed system testing tools:

- Jepsen, Ellen, Maelstorm
- Fray
- Hypothesis (Hegel)
- Antithesis (paid)

Distributed system proofs:
- TLA+
- Lean4
- P framework

Real systems
- 3FS (https://github.com/deepseek-ai/3FS/tree/main/specs) - uses P framework
- AWS S3 
(https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
 and https://p-org.github.io/P/casestudies/#case-studies)

I prefer if we can start with P framework since some storage systems already 
used it.

  was:
Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
We had a MiniOzoneChaosCluster, but it seems unused. I propose to introduce a 
distributed system testing and proofs system so that we can have the Ozone spec 
as the shared mental model. Some of the regressions for issues like breaking 
majority commit contract (HDDS-15052) is not detected since we don't have the 
spec as the source of truth and sometimes uses intuition to guide our design.

This is a parent task for the effort to introduce distributed system testing 
and proofs to test the correctness of Ozone implementation (e.g. partial write 
commit, container state transitions, quasi closed, etc).

Distributed system testing tools:

- Jepsen, Ellen, Maelstorm
- Fray
- Hypothesis (Hegel)
- Antithesis (paid)

Distributed system proofs:
- TLA+
- Lean4
- P framework

Real systems
- 3FS (https://github.com/deepseek-ai/3FS/tree/main/specs) - uses P framework
- AWS S3 
(https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
 and https://p-org.github.io/P/casestudies/#case-studies)

I prefer if we can start with P framework since some storage systems already 
used it.


> Distributed System Testing in Ozone
> -----------------------------------
>
>                 Key: HDDS-15501
>                 URL: https://issues.apache.org/jira/browse/HDDS-15501
>             Project: Apache Ozone
>          Issue Type: New Feature
>          Components: test
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests. 
> We had a MiniOzoneChaosCluster, but it seems unused. I propose to introduce a 
> distributed system testing and proofs system so that we can have the Ozone 
> spec as the shared mental model. Some of the regressions for issues like 
> breaking majority commit contract (HDDS-15052) is not detected since we don't 
> have the spec as the source of truth and sometimes simply uses intuitions to 
> guide our implementation.
> This is a parent task for the effort to introduce distributed system testing 
> and proofs to test the correctness of Ozone implementation (e.g. partial 
> write commit, container state transitions, quasi closed, etc).
> Distributed system testing tools:
> - Jepsen, Ellen, Maelstorm
> - Fray
> - Hypothesis (Hegel)
> - Antithesis (paid)
> Distributed system proofs:
> - TLA+
> - Lean4
> - P framework
> Real systems
> - 3FS (https://github.com/deepseek-ai/3FS/tree/main/specs) - uses P framework
> - AWS S3 
> (https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/
>  and https://p-org.github.io/P/casestudies/#case-studies)
> I prefer if we can start with P framework since some storage systems already 
> used it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to