[ https://issues.apache.org/jira/browse/CASSANDRA-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656976#comment-17656976 ]
Caleb Rackliffe commented on CASSANDRA-18118: --------------------------------------------- 3.11 PR looks good, modulo one minor nit > Do not leak 2015 memtable synthetic Epoch > ----------------------------------------- > > Key: CASSANDRA-18118 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18118 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable > Reporter: Berenguer Blasi > Assignee: Berenguer Blasi > Priority: Normal > Fix For: 3.11.x, 4.0.x > > > This > [Epoch|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48] > can > [leak|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/Memtable.java#L392] > affecting all the timestamps logic. It has been observed in a production > env it can i.e. prevent proper sstable and tombstone cleanup. > To reproduce create the following table: > {noformat} > drop keyspace test; > create keyspace test WITH replication = {'class':'SimpleStrategy', > 'replication_factor' : 1}; > CREATE TABLE test.test ( > key text PRIMARY KEY, > id text > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '2', 'tombstone_compaction_interval': > '3000', 'tombstone_threshold': '0.1', 'unchecked_tombstone_compaction': > 'true'} > AND compression = {'chunk_length_in_kb': '64', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND crc_check_chance = 1.0 > AND dclocal_read_repair_chance = 0.0 > AND default_time_to_live = 10 > AND gc_grace_seconds = 10 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99PERCENTILE'; > CREATE INDEX id_idx ON test.test (id); > {noformat} > And stress load it with: > {noformat} > insert into test.test (key,id) values('$RANDOM_UUID $RANDOM_UUID', > 'eaca36a1-45f1-469c-a3f6-3ba54220363f') USING TTL 10 > {noformat} > Notice how all inserts have a 10s TTL, the default 10s TTL and gc_grace is > also at 10s. This is to speed up the repro: > - Run the load for a couple minutes and track sstables disk usage. You will > see it does only increase, nothing gets cleaned up and it doesn't stop > growing (notice all this is well past the 10s gc_grace and TTL) > - Running a flush and a compaction while under load against the keyspace, > table or index doesn't solve the issue. > - Stopping the load and running a compaction doesn't solve the issue. > Flushing does though. > - On the original observation where TTL was around 600s and gc_grace around > 1800s we could get GBs of sstables that weren't cleaned up or compacted away > after hours of work. > - Reproduction can also happen on plain sstables by repeatedly > inserting/deleting/overwriting the same values over and over again without 2i > indices or TTL being involved. > The problem seems to be > [EncodingStats|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/EncodingStats.java#L48] > using a synthetic Epoch in 2015 which plays nice with Vint serialization. > Unfortunately {{Memtable}} is using that to keep track of the > {{minTimestamp}} which can leak the 2015 Epoch. This confuses any logic > consuming that timestamp. In this particular case purge and fully expired > sstables weren't properly detected. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org