[ https://issues.apache.org/jira/browse/CASSANDRA-9572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antti Nissinen updated CASSANDRA-9572: -------------------------------------- Comment: was deleted (was: I tried to make a simplified example of the issue: z = expired data x = valid data 1 zzzzzz 2 zzzzzz 3 zzzzzz 4 zzzzzz 5 zzzxxx 6 xxxxxx 7 xxxxxx 8 xxxxxx First round of getFullyExpiredSStables will select files 1-3. 4 is discarded since it overlaps with one of the non-fully expired files (file 5) Let's assume that the most interesting bucket will be files 7 -8 Files 1-3 and 7-8 will be fed to the compactionTask In the second call to getFullyExpiredSStables (w/o arguments) the overlapping files will include file 4. File number 3 will be dropped since it overlaps with file number 4. getFullyExpiredSSTables will return files 1,2 Variable actuallyCompact is a difference between set 1,2,3,7,8 and set 1,2. Compaction will be done for files 3,7,8 and in the end all files (1,2,3,7,8) will be deleted from the disc. The problem is that file number 3 ends up to the compaction (should be dropped) and the starttime of the new file (number 9) is not the starttime of file number 7. Currently the starttime of new SSTable is same as starttime of file number 4. ) > DateTieredCompactionStrategy fails to combine SSTables correctly when TTL is > used. > ---------------------------------------------------------------------------------- > > Key: CASSANDRA-9572 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9572 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Antti Nissinen > Assignee: Marcus Eriksson > Labels: dtcs > Fix For: 2.1.x > > Attachments: cassandra_sstable_metadata_reader.py, > cassandra_sstable_timespan_graph.py, compaction_stage_test01_jira.log, > compaction_stage_test02_jira.log, datagen.py, explanation_jira.txt, > motivation_jira.txt, src_2.1.5_with_debug.zip > > > DateTieredCompaction works correctly when data is dumped for a certain time > period in short SSTables in time manner and then compacted together. However, > if TTL is applied to the data columns the DTCS fails to compact files > correctly in timely manner. In our opinion the problem is caused by two > issues: > A) During the DateTieredCompaction process the getFullyExpiredSStables is > called twice. First from the DateTieredCompactionStrategy class and second > time from the CompactionTask class. On the first time the target is to find > out fully expired SStables that are not overlapping with any non-fully > expired SSTables. That works correctly. When the getFullyExpiredSSTables is > called second time from CompactionTask class the selection of fully expired > SSTables is modified compared to the first selection. > B) The minimum timestamp of the new SSTables created by combining together > fully expired SSTable and files from the most interesting bucket is not > correct. > These two issues together cause problems for the DTCS process when it > combines together SSTables having overlap in time and TTL for the column. > This is demonstrated by generating test data first without compactions and > showing the timely distribution of files. When the compaction is enabled the > DCTS combines files together, but the end result is not something to be > expected. This is demonstrated in the file motivation_jira.txt > Attachments contain following material: > - Motivation_jira.txt: Practical examples how the DTCS behaves with TTL > - Explanation_jira.txt: gives more details, explains test cases and > demonstrates the problems in the compaction process > - Logfile file for the compactions in the first test case > (compaction_stage_test01_jira.log) > - Logfile file for the compactions in the seconnd test case > (compaction_stage_test02_jira.log) > - source code zip file for version 2.1.5 with additional comment statements > (src_2.1.5_with_debug.zip) > - Python script to generate test data (datagen.py) > - Python script to read metadata from SStables > (cassandra_sstable_metadata_reader.py) > - Python script to generate timeline representation of SSTables > (cassandra_sstable_timespan_graph.py) -- This message was sent by Atlassian JIRA (v6.3.4#6332)