[ 
https://issues.apache.org/jira/browse/CASSANDRA-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904458#comment-14904458
 ] 

Antti Nissinen commented on CASSANDRA-10280:
--------------------------------------------

I am also voting for discarding the max_sstable_age_days and limiting the 
compaction window size in DTCS. If the DTCS will have a major modifications 
then adopting the some of the ideas from TWCS would be beneficial and also 
trying to take into account the practical view points presented in several Jira 
items:

- limiting the window size in DTCS (this item, 
[CASSANDRA-10280|https://issues.apache.org/jira/browse/CASSANDRA-10280])

- using STCS in the newest window or if the amount of files exceeds the 
max_threshold 
([CASSANDRA-10276|https://issues.apache.org/jira/browse/CASSANDRA-10276],[CASSANDRA-9666|https://issues.apache.org/jira/browse/CASSANDRA-9666])

- while compacting a large amount of files, start from small ones and progress 
towards larger ones (especially in the case of small sstables originated from 
repair operations) 
[CASSANDRA-9597|https://issues.apache.org/jira/browse/CASSANDRA-9597]

- setting limits for number of files compacted in one shot based on the sum of 
files sizes (not trying to compact several large files at ones and running out 
of disk space during the operation) 
[CASSANDRA-10195|https://issues.apache.org/jira/browse/CASSANDRA-10195]

- round-robin approach for the selection of compaction window inside which next 
compaction will be executed. Target is to get rid of small files as soon as 
possible. At the moment TWCS and DTCS work with newer windows and progress 
towards the history when finished with the current one 
[CASSANDRA-10195|https://issues.apache.org/jira/browse/CASSANDRA-10195]

Should we actually create a Jira item where we would collect the ideas for 
"ultimate time series compaction strategy" for more detailled discussion? At 
the moment these ideas are scattered around different items. Probably the above 
list is missing many of the relevant points.

Another important goal (our wish) for the time series data base is to able to 
wipe off data effectively so that disk space would be released as soon as 
possible. I tried to describe those ideas in 
[CASSANDRA-10306|https://issues.apache.org/jira/browse/CASSANDRA-10306], but 
there is no comments yet on that item. The main idea was to have a possibility 
split SSTables along a certain time line on all nodes so that SSTables could be 
dropped (like with TTL in DTCS and TWCS) or archived on different media where 
they can be digged up on some day if really needed. Deleting data efficiently 
on demand is presently one of the biggest obstacles for using C* in closed 
environments with fairly limited hardware resources for time series data 
collection. TTL is a working solution when you can predict data collection 
demands well before hand and have additional resources available if predictions 
don't match with the reality. 

What are the biggest obstacles in the present architecture for the below 
scenario?
- Decide a time stamp for the data deletion / archiving
- All existing SSTables on each node would be split to two files along the time 
line if the SSTable covers data on both sides of the time line.
- SSTables falling behind the timeline would be inactivated from the SSTable 
set (not participating any more on compactions or returning data on queries)
- you can decide if you want copy the files somewhere else or just simply 
delete those
- This tool could be used through the nodetool with external script

> Make DTCS work well with old data
> ---------------------------------
>
>                 Key: CASSANDRA-10280
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10280
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Marcus Eriksson
>            Assignee: Marcus Eriksson
>             Fix For: 3.x, 2.1.x, 2.2.x
>
>
> Operational tasks become incredibly expensive if you keep around a long 
> timespan of data with DTCS - with default settings and 1 year of data, the 
> oldest window covers about 180 days. Bootstrapping a node with vnodes with 
> this data layout will force cassandra to compact very many sstables in this 
> window.
> We should probably put a cap on how big the biggest windows can get. We could 
> probably default this to something sane based on max_sstable_age (ie, say we 
> can reasonably handle 1000 sstables per node, then we can calculate how big 
> the windows should be to allow that)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to