[ https://issues.apache.org/jira/browse/CASSANDRA-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200093#comment-17200093 ]
Alex Petrov commented on CASSANDRA-15348: ----------------------------------------- Committed in [1d7f66e2d5b39702ff218cd36e0b9043b0d47cf1 |https://github.com/apache/cassandra-harry/commit/1d7f66e2d5b39702ff218cd36e0b9043b0d47cf1 > Harry: generator library and extensible framework for fuzz testing Apache > Cassandra > ----------------------------------------------------------------------------------- > > Key: CASSANDRA-15348 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15348 > Project: Cassandra > Issue Type: New Feature > Reporter: Alex Petrov > Assignee: Alex Petrov > Priority: Normal > Fix For: 4.0-beta > > > h2. Description: > This ticket introduces Harry, a component for fuzz testing and verification > of the Apache Cassandra clusters at scale. > h2. Motivation: > Current testing tooling largely tests for common- and edge-cases, and most of > the tests use predefined datasets. Property-based tests can help explore a > broader range of states, but often require either a complex model or a large > state to test against. > h2. What problems Harry solves: > Harry allows to run tests that are able to validate state of both dense nodes > (to test local read-write path) and large clusters (to test distributed > read-write path), and do it efficiently. Main goals, and what sets it apart > from the other testing tools is: > * The state required for verification should remain as compact as possible. > * The verification process itself should be as performant as possible. > * Ideally, we'd want a way to verify database state while _continuing_ > running state change queries against it. > h2. What Harry does: > To achieve this, Harry defines a model that holds the state of the database, > generators that produce reproducible, pseudo-random schemas, mutations, and > queries, and a validator that asserts the correctness of the model following > execution of generated traffic. > h2. Harry consists of multiple reusable components: > * Generator library: how to create a library of invertible, order-preserving > generators for simple and composite data types. > * Model and checker: how to use the properties of generators to validate the > output of an eventually-consistent database in a linear time. > * Runner library: how to create a scheme for reproducible runs, despite the > concurrent nature of database and fuzzer itself. > h2. Short and somewhat approximate description of how Harry achieves this: > Generation and validation define strict mathematical relations between the > generated values and pseudorandom numbers they were generated from. Using > these properties, we can store minimal state and check if these properties > hold during validation. > Since Cassandra stores data in rows, we should be able to "inflate" data to > insert a row into the database from a single number we call _descriptor_. > Each value in the row read from the database can be "deflated" back to the > descriptor it was generated from. This way, to precisely verify the state of > the row, we only need to know the descriptor it was generated from and a > timestamp at which it was inserted. > Similarly, keys for the inserted row can be "inflated" from a single 64-bit > integer, and then "deflated" back to it. To efficiently search for keys, > while allowing range scans, our generation scheme preserves the order of the > original 64-bit integer. Every pair of keys generated from two 64-bit > integers would sort the same way as these integers. > This way, in order to validate a state of the range of rows queried from the > database, it is sufficient to "deflate" its key and data values, use deflated > 64-bit key representation to find all descriptors these rows were generated > from, and ensure that the given sequence of descriptors could have resulted > in the state that database has responded with. > Using this scheme, we keep a minimum possible amount of data per row, can > efficiently generate the data, and backtrack values to the numbers they were > generated from. Most of the time, we operate on 64-bit integer values and > only use "inflated" objects when running queries against database state, > minimizing the amount of required memory. > h2. Name: > Harry (verb). > According to Marriam-Webster: > * to torment by or as if by constant attack > * persistently carry out attacks on (an enemy or an enemy's territory) -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org