[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291211#comment-15291211 ]
Stefan Podkowinski edited comment on CASSANDRA-11349 at 5/20/16 8:26 AM: ------------------------------------------------------------------------- I've been debuging the latest mentioned error case using the following cql/ccm statements and a local 2 node cluster. {code} create keyspace ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2}; use ks; CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 text, c4 float, PRIMARY KEY (c1, c2, c3) ) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 'false'}; DELETE FROM table1 USING TIMESTAMP 1463656272791 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'c'; ccm node1 flush DELETE FROM table1 USING TIMESTAMP 1463656272792 WHERE c1 = 'a' AND c2 = 'b'; ccm node1 flush DELETE FROM table1 USING TIMESTAMP 1463656272793 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'd'; ccm node1 flush {code} Timestamps have been added for easier tracking of the specific tombstone in the debugger. ColmnIndex.Builder.buildForCompaction() will add tombstones in the following order to the tracker: *Node1* {{1463656272792: c1 = 'a' AND c2 = 'b'}} First RT, added to unwritten + opened tombstones {{1463656272791: c1 = 'a' AND c2 = 'b' AND c3 = 'c'}} Overshadowed by RT added before while being older at the same time. Will not be added and simply ignored. {{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} Overshaded by first and only RT added to opened so far, but newer and will thus be added to unwritten+opened We end up with 2 unwritten tombstones (..92+..93) passed to the serializer for message digest. *Node2* {{1463656272792: c1 = 'a' AND c2 = 'b'}} (EOC.START) First RT, added to unwritten + opened tombstones {{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} (EOC.END) comparision of EOC flag (Tracker:251) of previously added RT will cause having it removed from the opened list (Tracker:258). Afterwards the current RT will be added to unwritten + opened. {{1463656272792: c1 = 'a' AND c2 = 'b'}} ({color:red}again!{color}) Gets compared with prev. added RT, which supersedes the current one and thus stays in the list. Will again be added to unwritten + opened list. We end up with 3 unwritten RTs, including 1463656272792 twice. -I still haven't been able to exactly pinpoint why the reducer will be called twice with the same TS, but since [~blambov] explicitly mentioned that possibility, I guess it's intended behavior (but why? :)).- Running sstable2json makes it more obvious how node2 flushes the RTs: {noformat} [ {"key": "a", "cells": [["b:_","b:d:_",1463656272792,"t",1463731877], ["b:d:_","b:d:!",1463656272793,"t",1463731886], ["b:d:!","b:!",1463656272792,"t",1463731877]]} ] {noformat} was (Author: spo...@gmail.com): I've been debuging the latest mentioned error case using the following cql/ccm statements and a local 2 node cluster. {code} create keyspace ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2}; use ks; CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 text, c4 float, PRIMARY KEY (c1, c2, c3) ) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 'false'}; DELETE FROM table1 USING TIMESTAMP 1463656272791 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'c'; ccm node1 flush DELETE FROM table1 USING TIMESTAMP 1463656272792 WHERE c1 = 'a' AND c2 = 'b'; ccm node1 flush DELETE FROM table1 USING TIMESTAMP 1463656272793 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'd'; ccm node1 flush {code} Timestamps have been added for easier tracking of the specific tombstone in the debugger. ColmnIndex.Builder.buildForCompaction() will add tombstones in the following order to the tracker: *Node1* {{1463656272792: c1 = 'a' AND c2 = 'b'}} First RT, added to unwritten + opened tombstones {{1463656272791: c1 = 'a' AND c2 = 'b' AND c3 = 'c'}} Overshadowed by RT added before while being older at the same time. Will not be added and simply ignored. {{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} Overshaded by first and only RT added to opened so far, but newer and will thus be added to unwritten+opened We end up with 2 unwritten tombstones (..92+..93) passed to the serializer for message digest. *Node2* {{1463656272792: c1 = 'a' AND c2 = 'b'}} (EOC.START) First RT, added to unwritten + opened tombstones {{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} (EOC.END) comparision of EOC flag (Tracker:251) of previously added RT will cause having it removed from the opened list (Tracker:258). Afterwards the current RT will be added to unwritten + opened. {{1463656272792: c1 = 'a' AND c2 = 'b'}} ({color:red}again!{color}) Gets compared with prev. added RT, which supersedes the current one and thus stays in the list. Will again be added to unwritten + opened list. We end up with 3 unwritten RTs, including 1463656272792 twice. I still haven't been able to exactly pinpoint why the reducer will be called twice with the same TS, but since [~blambov] explicitly mentioned that possibility, I guess it's intended behavior (but why? :)). > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > --------------------------------------------------------------------------------------------- > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug > Reporter: Fabien Rousseau > Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)