[
https://issues.apache.org/jira/browse/HDDS-15501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089108#comment-18089108
]
Ivan Andika commented on HDDS-15501:
------------------------------------
Thanks [~chungen] for the interest. The linearizable checker is a good idea and
Porcupine ( [https://github.com/anishathalye/porcupine] ) is a well known
linearizable checker (used in TiDB and MIT Distributed System course). Another
one is [https://github.com/jepsen-io/knossos] . However, from what I see,
Porcupine can only be used for Golang programs, so not sure whether we can test
Ozone with it. Additionally, for linearizable checker we can try to test OM and
SCM first since it maps quite cleanly to the key value store API or we can try
to use the linearizable checker on Ratis first (to check that the underlying
Raft implementation is linearizable.
Yes, DST is definitely not worth the effort right now. I think we can start
from the lower hanging fruits first that can be done without changing any Ozone
implementation. However, it's good for the community to be aware of this and
probably experiment with DST frameworks.
> Distributed System Testing in Ozone
> -----------------------------------
>
> Key: HDDS-15501
> URL: https://issues.apache.org/jira/browse/HDDS-15501
> Project: Apache Ozone
> Issue Type: Test
> Components: test
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> Currently, we only test Ozone using the traditional UT, IT, Acceptance Tests.
> We had a MiniOzoneChaosCluster (fault injection testing), but it seems
> unused. I propose to introduce a distributed system testing and proofs system
> so that we can have the Ozone spec as the shared mental model. Some of the
> regressions for issues like breaking majority commit contract (HDDS-15052) is
> not detected since we don't have the spec as the source of truth.
> Additionally sometimes simply we use our intuitions to guide our
> implementation and fixes which can cause regressions (for example, a lot of
> ReplicationManager fixes are only done only when there is an issue in
> productions).
> This is a parent task for the effort to introduce distributed system testing
> and proofs to test the correctness of Ozone implementation (e.g. partial
> write commit, container state transitions, replication manager, container
> replica management (i.e. how to appease eventually consistent heartbeat and
> strongly consistent Ratis in SCM), quasi closed, block deletion orphan issue,
> etc).
> Distributed system testing tools:
> - Jepsen, Ellen, Maelstorm
> - Fray
> - Hypothesis (Hegel)
> - Antithesis (paid)
> Distributed system proofs:
> - TLA+
> - Lean4
> - P framework
> Real systems
> - 3FS ([https://github.com/deepseek-ai/3FS/tree/main/specs]) - uses P
> framework
> - AWS S3
> ([https://cacm.acm.org/practice/systems-correctness-practices-at-amazon-web-services/]
> and [https://p-org.github.io/P/casestudies/#case-studies])
> - etcd robustness test
> ([https://github.com/etcd-io/etcd/tree/main/tests/robustness]) - Uses
> antithesis (among other things)
> I prefer if we can start with P framework since some storage systems already
> used it.
> In the future, we can support Deterministic Simulation Testing
> ([https://antithesis.com/docs/resources/deterministic_simulation_testing]).
> However, this requires a lot of efforts and very invasive since we need to
> make all Ozone implementations to be testable by the simulation framework so
> it's not going to happen anytime soon. Some of the deterministic simulation
> testing framework is Madsim (Rust), Turmoil (Rust), BUGGIFY (C++,
> FoundationDB), Tickloom (Java, [https://github.com/unmeshjoshi/tickloom]),
> Vortex (Zig, used by TigerBeetle). We can start taking look at Tickloom for
> testing in Java.
> Having a real industry-recognized spec helps to instil confidence in Ozone
> robustness. More importantly, distributed system testing allows us to have
> confidence that our changes will not introduce critical issues (as long as
> the system is covered by the test).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]