[ https://issues.apache.org/jira/browse/CASSANDRA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brandon Williams updated CASSANDRA-16710: ----------------------------------------- Severity: Critical (was: Normal) Raising priority on this as we should either decide to restore the older behavior or document the new one. > Read repairs can break row isolation > ------------------------------------ > > Key: CASSANDRA-16710 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16710 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination > Reporter: Samuel Klock > Priority: Urgent > Fix For: 3.0.x, 3.11.x, 4.0.x > > > This issue essentially revives CASSANDRA-8287, was resolved "Later" in 2015. > While it was possible in principle at that time for read repair to break row > isolation, that couldn't happen in practice because Cassandra always pulled > all of the columns for each row in response to regular reads, so read repairs > would never partially resolve a row. CASSANDRA-10657 modified Cassandra to > only pull the requested columns for reads, which enabled read repair to break > row isolation in practice. > Note also that this is distinct from CASSANDRA-14593 (for read repair > breaking partition-level isolation): that issue (as we understand it) > captures isolation being broken across multiple rows within an update to a > partition, while this issue covers broken isolation across multiple columns > within an update to a single row. > This behavior is easy to reproduce under affected versions using {{ccm}}: > {code:bash} > ccm create -n 3 -v $VERSION rrtest > ccm updateconf -y 'hinted_handoff_enabled: false > max_hint_window_in_ms: 0' > ccm start > (cat <<EOF > CREATE KEYSPACE IF NOT EXISTS rrtest WITH REPLICATION = {'class': > 'SimpleStrategy', 'replication_factor': '3'}; > CREATE TABLE IF NOT EXISTS rrtest.kv (key TEXT PRIMARY KEY, col1 TEXT, col2 > INT); > CONSISTENCY ALL; > INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'a', 1); > EOF > ) | ccm node1 cqlsh > ccm node3 stop > (cat <<EOF > CONSISTENCY QUORUM; > INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'b', 2); > EOF > ) | ccm node1 cqlsh > ccm node3 start > ccm node2 stop > (cat <<EOF > CONSISTENCY QUORUM; > SELECT key, col1 FROM rrtest.kv WHERE key = 'key'; > EOF > ) | ccm node1 cqlsh > ccm node1 stop > (cat <<EOF > CONSISTENCY ONE; > SELECT * FROM rrtest.kv WHERE key = 'key'; > EOF > ) | ccm node3 cqlsh > {code} > This snippet creates a three-node cluster with an RF=3 keyspace containing a > table with three columns: a partition key and two value columns. (Hinted > handoff can mask the problem if the repro steps are executed in quick > succession, so the snippet disables it for this exercise.) Then: > # It adds a full row to the table with values ('a', 1), ensuring it's > replicated to all three nodes. > # It stops a node, then replaces the initial row with new values ('b', 2) in > a single update, ensuring that it's replicated to both available nodes. > # It starts the node that was down, then stops one of the other nodes and > performs a quorum read just for the letter column. The read observes 'b'. > # Finally, it stops the other node that observed the second update, then > performs a CL=ONE read of the entire row on the node that was down for that > update. > If read repair respects row isolation, then the final read should observe > ('b', 2). (('a', 1) is also acceptable if we're willing to sacrifice > monotonicity.) > * With {{VERSION=3.0.24}}, the final read observes ('b', 2), as expected. > * With {{VERSION=3.11.10}} and {{VERSION=4.0-rc1}}, the final read instead > observes ('b', 1). The same is true for 3.0.24 if CASSANDRA-10657 is > backported to it. > The scenario above is somewhat contrived in that it supposes multiple read > workflows consulting different sets of columns with different consistency > levels. Under 3.11, asynchronous read repair makes this scenario possible > even using just CL=ONE -- and with speculative retry, even if > {{read_repair_chance}}/{{dclocal_read_repair_chance}} are both zeroed. We > haven't looked closely at 4.0, but even though (as we understand it) it lacks > async read repair, scenarios like CL=ONE writes or failed, > partially-committed CL>ONE writes create some surface area for this behavior, > even without mixed consistency/column reads. > Given the importance of paging to reads from wide partitions, it makes some > intuitive sense that applications shouldn't rely on isolation at the > partition level. Being unable to rely on row isolation is much more > surprising, especially given that (modulo the possibility of other atomicity > bugs) Cassandra did preserve it before 3.11. Cassandra should either find a > solution for this in code (e.g., when performing a read repair, always > operate over all of the columns for the table, regardless of what was > originally requested for a read) or at least update its documentation to > include appropriate caveats about update isolation. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org