[ 
https://issues.apache.org/jira/browse/CASSANDRA-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Berenguer Blasi updated CASSANDRA-18118:
----------------------------------------
    Description: 
This 
[Epoch|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48]
 can 
[leak|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/Memtable.java#L392]
 affecting all the timestamps logic.  It has been observed in a production env 
it can i.e. prevent proper sstable and tombstone cleanup.

To reproduce create the following table:
{noformat}
drop keyspace test;
create keyspace test WITH replication = {'class':'SimpleStrategy', 
'replication_factor' : 1};
CREATE TABLE test.test (
    key text PRIMARY KEY,
    id text
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '2', 'tombstone_compaction_interval': 
'3000', 'tombstone_threshold': '0.1', 'unchecked_tombstone_compaction': 'true'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 10
    AND gc_grace_seconds = 10
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

CREATE INDEX id_idx ON test.test (id);
{noformat}

And stress load it with:
{noformat}
insert into test.test (key,id) values('$RANDOM_UUID $RANDOM_UUID', 
'eaca36a1-45f1-469c-a3f6-3ba54220363f') USING TTL 10
{noformat}

Notice how all inserts have a 10s TTL, the default 10s TTL and gc_grace is also 
at 10s. This is to speed up the repro:
- Run the load for a couple minutes and track sstables disk usage. You will see 
it does only increase, nothing gets cleaned up and it doesn't stop growing 
(notice all this is well past the 10s gc_grace and TTL)
- Running a flush and a compaction while under load against the keyspace, table 
or index doesn't solve the issue.
- Stopping the load and running a compaction doesn't solve the issue. Flushing 
does though.
- On the original observation where TTL was around 600s and gc_grace around 
1800s we could get GBs of sstables that weren't cleaned up or compacted away 
after hours of work.
- Reproduction can also happen on plain sstables by repeatedly 
inserting/deleting/overwriting the same values over and over again without 2i 
indices or TTL being involved.

The problem seems to be 
[EncodingStats|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48]
 using a synthetic Epoch in 2015 which plays nice with Vint serialization.  
Unfortunately {{Memtable}} is using that to keep track of the {{minTimestamp}} 
which can leak the 2015 Epoch. This confuses any logic consuming that 
timestamp. In this particular case purge and fully expired sstables weren't 
properly detected.


  was:
This Epoch can leak affecting all the timestamps logic.  It has been observed 
in a production env it can i.e. prevent proper sstable and tombstone cleanup.




> Do not leak 2015 memtable synthetic Epoch
> -----------------------------------------
>
>                 Key: CASSANDRA-18118
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18118
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Memtable
>            Reporter: Berenguer Blasi
>            Assignee: Berenguer Blasi
>            Priority: Normal
>
> This 
> [Epoch|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48]
>  can 
> [leak|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/Memtable.java#L392]
>  affecting all the timestamps logic.  It has been observed in a production 
> env it can i.e. prevent proper sstable and tombstone cleanup.
> To reproduce create the following table:
> {noformat}
> drop keyspace test;
> create keyspace test WITH replication = {'class':'SimpleStrategy', 
> 'replication_factor' : 1};
> CREATE TABLE test.test (
>     key text PRIMARY KEY,
>     id text
> ) WITH bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '2', 'tombstone_compaction_interval': 
> '3000', 'tombstone_threshold': '0.1', 'unchecked_tombstone_compaction': 
> 'true'}
>     AND compression = {'chunk_length_in_kb': '64', 'class': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 10
>     AND gc_grace_seconds = 10
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
> CREATE INDEX id_idx ON test.test (id);
> {noformat}
> And stress load it with:
> {noformat}
> insert into test.test (key,id) values('$RANDOM_UUID $RANDOM_UUID', 
> 'eaca36a1-45f1-469c-a3f6-3ba54220363f') USING TTL 10
> {noformat}
> Notice how all inserts have a 10s TTL, the default 10s TTL and gc_grace is 
> also at 10s. This is to speed up the repro:
> - Run the load for a couple minutes and track sstables disk usage. You will 
> see it does only increase, nothing gets cleaned up and it doesn't stop 
> growing (notice all this is well past the 10s gc_grace and TTL)
> - Running a flush and a compaction while under load against the keyspace, 
> table or index doesn't solve the issue.
> - Stopping the load and running a compaction doesn't solve the issue. 
> Flushing does though.
> - On the original observation where TTL was around 600s and gc_grace around 
> 1800s we could get GBs of sstables that weren't cleaned up or compacted away 
> after hours of work.
> - Reproduction can also happen on plain sstables by repeatedly 
> inserting/deleting/overwriting the same values over and over again without 2i 
> indices or TTL being involved.
> The problem seems to be 
> [EncodingStats|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48]
>  using a synthetic Epoch in 2015 which plays nice with Vint serialization.  
> Unfortunately {{Memtable}} is using that to keep track of the 
> {{minTimestamp}} which can leak the 2015 Epoch. This confuses any logic 
> consuming that timestamp. In this particular case purge and fully expired 
> sstables weren't properly detected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to