[jira] [Comment Edited] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Stefan Podkowinski (JIRA) Fri, 20 May 2016 01:26:45 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291211#comment-15291211
 ]


Stefan Podkowinski edited comment on CASSANDRA-11349 at 5/20/16 8:26 AM:
-------------------------------------------------------------------------

I've been debuging the latest mentioned error case using the following cql/ccm 
statements and a local 2 node cluster.

{code}
create keyspace ks WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
use ks;
CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 text, c4 float,
 PRIMARY KEY (c1, c2, c3)
) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 
'false'};
DELETE FROM table1 USING TIMESTAMP 1463656272791 WHERE c1 = 'a' AND c2 = 'b' 
AND c3 = 'c';
ccm node1 flush
DELETE FROM table1 USING TIMESTAMP 1463656272792 WHERE c1 = 'a' AND c2 = 'b';
ccm node1 flush
DELETE FROM table1 USING TIMESTAMP 1463656272793 WHERE c1 = 'a' AND c2 = 'b' 
AND c3 = 'd';
ccm node1 flush
{code}

Timestamps have been added for easier tracking of the specific tombstone in the 
debugger.

ColmnIndex.Builder.buildForCompaction() will add tombstones in the following 
order to the tracker:

*Node1*

{{1463656272792: c1 = 'a' AND c2 = 'b'}}
First RT, added to unwritten + opened tombstones

{{1463656272791: c1 = 'a' AND c2 = 'b' AND c3 = 'c'}}
Overshadowed by RT added before while being older at the same time. Will not be 
added and simply ignored.

{{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}}
Overshaded by first and only RT added to opened so far, but newer and will thus 
be added to unwritten+opened

We end up with 2 unwritten tombstones (..92+..93) passed to the serializer for 
message digest.


*Node2*

{{1463656272792: c1 = 'a' AND c2 = 'b'}} (EOC.START)
First RT, added to unwritten + opened tombstones

{{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} (EOC.END)
comparision of EOC flag (Tracker:251) of previously added RT will cause having 
it removed from the opened list (Tracker:258). Afterwards the current RT will 
be added to unwritten + opened.

{{1463656272792: c1 = 'a' AND c2 = 'b'}} ({color:red}again!{color})
Gets compared with prev. added RT, which supersedes the current one and thus 
stays in the list. Will again be added to unwritten + opened list.

We end up with 3 unwritten RTs, including 1463656272792 twice.

-I still haven't been able to exactly pinpoint why the reducer will be called 
twice with the same TS, but since [~blambov] explicitly mentioned that 
possibility, I guess it's intended behavior (but why? :)).-

Running sstable2json makes it more obvious how node2 flushes the RTs:

{noformat}
[
{"key": "a",
 "cells": [["b:_","b:d:_",1463656272792,"t",1463731877],
           ["b:d:_","b:d:!",1463656272793,"t",1463731886],
           ["b:d:!","b:!",1463656272792,"t",1463731877]]}
]
{noformat}


was (Author: spo...@gmail.com):
I've been debuging the latest mentioned error case using the following cql/ccm 
statements and a local 2 node cluster.

{code}
create keyspace ks WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
use ks;
CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 text, c4 float,
 PRIMARY KEY (c1, c2, c3)
) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 
'false'};
DELETE FROM table1 USING TIMESTAMP 1463656272791 WHERE c1 = 'a' AND c2 = 'b' 
AND c3 = 'c';
ccm node1 flush
DELETE FROM table1 USING TIMESTAMP 1463656272792 WHERE c1 = 'a' AND c2 = 'b';
ccm node1 flush
DELETE FROM table1 USING TIMESTAMP 1463656272793 WHERE c1 = 'a' AND c2 = 'b' 
AND c3 = 'd';
ccm node1 flush
{code}

Timestamps have been added for easier tracking of the specific tombstone in the 
debugger.

ColmnIndex.Builder.buildForCompaction() will add tombstones in the following 
order to the tracker:

*Node1*

{{1463656272792: c1 = 'a' AND c2 = 'b'}}
First RT, added to unwritten + opened tombstones

{{1463656272791: c1 = 'a' AND c2 = 'b' AND c3 = 'c'}}
Overshadowed by RT added before while being older at the same time. Will not be 
added and simply ignored.

{{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}}
Overshaded by first and only RT added to opened so far, but newer and will thus 
be added to unwritten+opened

We end up with 2 unwritten tombstones (..92+..93) passed to the serializer for 
message digest.


*Node2*

{{1463656272792: c1 = 'a' AND c2 = 'b'}} (EOC.START)
First RT, added to unwritten + opened tombstones

{{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} (EOC.END)
comparision of EOC flag (Tracker:251) of previously added RT will cause having 
it removed from the opened list (Tracker:258). Afterwards the current RT will 
be added to unwritten + opened.

{{1463656272792: c1 = 'a' AND c2 = 'b'}} ({color:red}again!{color})
Gets compared with prev. added RT, which supersedes the current one and thus 
stays in the list. Will again be added to unwritten + opened list.

We end up with 3 unwritten RTs, including 1463656272792 twice.

I still haven't been able to exactly pinpoint why the reducer will be called 
twice with the same TS, but since [~blambov] explicitly mentioned that 
possibility, I guess it's intended behavior (but why? :)). 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Reply via email to