[ 
https://issues.apache.org/jira/browse/CASSANDRA-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh McKenzie updated CASSANDRA-15348:
--------------------------------------
    Fix Version/s: 4.0-beta

> Harry: generator library and extensible framework for fuzz testing Apache 
> Cassandra
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15348
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15348
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Alex Petrov
>            Assignee: Alex Petrov
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> h2. Description:
> This ticket introduces Harry, a component for fuzz testing and verification 
> of the Apache Cassandra clusters at scale. 
> h2. Motivation: 
> Current testing tooling largely tests for common- and edge-cases, and most of 
> the tests use predefined datasets. Property-based tests can help explore a 
> broader range of states, but often require either a complex model or a large 
> state to test against.
> h2. What problems Harry solves:
> Harry allows to run tests that are able to validate state of both dense nodes 
> (to test local read-write path) and large clusters (to test distributed 
> read-write path), and do it efficiently. Main goals, and what sets it apart 
> from the other testing tools is:
>  * The state required for verification should remain as compact as possible.
>  * The verification process itself should be as performant as possible.
>  * Ideally, we'd want a way to verify database state while _continuing_ 
> running state change queries against it.
> h2. What Harry does: 
> To achieve this, Harry defines a model that holds the state of the database, 
> generators that produce reproducible, pseudo-random schemas, mutations, and 
> queries, and a validator that asserts the correctness of the model following 
> execution of generated traffic.
> h2. Harry consists of multiple reusable components:
>  * Generator library: how to create a library of invertible, order-preserving 
> generators for simple and composite data types.
>  * Model and checker: how to use the properties of generators to validate the 
> output of an eventually-consistent database in a linear time.
>  * Runner library: how to create a scheme for reproducible runs, despite the 
> concurrent nature of database and fuzzer itself.
> h2. Short and somewhat approximate description of how Harry achieves this:
> Generation and validation define strict mathematical relations between the 
> generated values and pseudorandom numbers they were generated from. Using 
> these properties, we can store minimal state and check if these properties 
> hold during validation.
> Since Cassandra stores data in rows, we should be able to "inflate" data to 
> insert a row into the database from a single number we call _descriptor_. 
> Each value in the row read from the database can be "deflated" back to the 
> descriptor it was generated from. This way, to precisely verify the state of 
> the row, we only need to know the descriptor it was generated from and a 
> timestamp at which it was inserted.
> Similarly, keys for the inserted row can be "inflated" from a single 64-bit 
> integer, and then "deflated" back to it. To efficiently search for keys, 
> while allowing range scans, our generation scheme preserves the order of the 
> original 64-bit integer. Every pair of keys generated from two 64-bit 
> integers would sort the same way as these integers.
> This way, in order to validate a state of the range of rows queried from the 
> database, it is sufficient to "deflate" its key and data values, use deflated 
> 64-bit key representation to find all descriptors these rows were generated 
> from, and ensure that the given sequence of descriptors could have resulted 
> in the state that database has responded with.
> Using this scheme, we keep a minimum possible amount of data per row, can 
> efficiently generate the data, and backtrack values to the numbers they were 
> generated from. Most of the time, we operate on 64-bit integer values and 
> only use "inflated" objects when running queries against database state, 
> minimizing the amount of required memory.
> h2. Name: 
> Harry (verb). 
> According to Marriam-Webster: 
>   * to torment by or as if by constant attack
>   * persistently carry out attacks on (an enemy or an enemy's territory)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to