[
https://issues.apache.org/jira/browse/HDDS-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089090#comment-18089090
]
Chung-En Lee commented on HDDS-15501:
-------------------------------------
I completely agree that DST requires a lot of effort and is highly invasive,
making it less practical for us to adopt right away.
Instead, I propose that we start by introducing Porcupine (or a similar
*linearizability* checker). Since it operates as a post-facto history checker,
we can fully leverage our existing testing cluster or the
{{MiniOzoneChaosCluster}} without making any intrusive changes to Ozone's core
codebase. This will allow us to quickly detect whether the current
implementation violates linearizability and catch hidden regressions like
{{HDDS-15052}}.
More importantly, even when we move toward DST in the future, we will still
absolutely need this kind of checker to validate execution histories. While DST
excels at generating high-density concurrent interleavings and chaos, it
remains blind to semantic correctness without an oracle. By preparing
*Porcupine (or a similar checker)* now, we are building the exact foundational
infrastructure needed for future simulation frameworks.
To kick this off, I created HDDS-15561 to introduce a linearizability checker.
Let's change this ticket to an Epic.
> Distributed System Testing in Ozone
> -----------------------------------
>
> Key: HDDS-15501
> URL: https://issues.apache.org/jira/browse/HDDS-15501
> Project: Apache Ozone
> Issue Type: Test
> Components: test
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests.
> We had a MiniOzoneChaosCluster (fault injection testing), but it seems
> unused. I propose to introduce a distributed system testing and proofs system
> so that we can have the Ozone spec as the shared mental model. Some of the
> regressions for issues like breaking majority commit contract (HDDS-15052) is
> not detected since we don't have the spec as the source of truth.
> Additionally sometimes simply we use our intuitions to guide our
> implementation and fixes which can cause regressions (for example, a lot of
> ReplicationManager fixes are only done only when there is an issue in
> productions).
> This is a parent task for the effort to introduce distributed system testing
> and proofs to test the correctness of Ozone implementation (e.g. partial
> write commit, container state transitions, replication manager, container
> replica management (i.e. how to appease eventually consistent heartbeat and
> strongly consistent Ratis in SCM), quasi closed, block deletion orphan issue,
> etc).
> Distributed system testing tools:
> - Jepsen, Ellen, Maelstorm
> - Fray
> - Hypothesis (Hegel)
> - Antithesis (paid)
> Distributed system proofs:
> - TLA+
> - Lean4
> - P framework
> Real systems
> - 3FS ([https://github.com/deepseek-ai/3FS/tree/main/specs]) - uses P
> framework
> - AWS S3
> ([https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/]
> and [https://p-org.github.io/P/casestudies/#case-studies])
> - etcd robustness test
> ([https://github.com/etcd-io/etcd/tree/main/tests/robustness]) - Uses
> antithesis (among other things)
> I prefer if we can start with P framework since some storage systems already
> used it.
> In the future, we can support Deterministic Simulation Testing
> ([https://antithesis.com/docs/resources/deterministic_simulation_testing]).
> However, this requires a lot of efforts and very invasive since we need to
> make all Ozone implementations to be testable by the simulation framework so
> it's not going to happen anytime soon. Some of the deterministic simulation
> testing framework is Madsim (Rust), Turmoil (Rust), BUGGIFY (C++,
> FoundationDB), Tickloom (Java, [https://github.com/unmeshjoshi/tickloom]),
> Vortex (Zig, used by TigerBeetle). We can start taking look at Tickloom for
> testing in Java.
> Having a real industry-recognized spec helps to instil confidence in Ozone
> robustness. More importantly, distributed system testing allows us to have
> confidence that our changes will not introduce critical issues (as long as
> the system is covered by the test).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]