Scott Carey created CASSANDRA-16769:
---------------------------------------

             Summary: Add an option to nodetool garbagecollect that collects 
only a fraction of the data
                 Key: CASSANDRA-16769
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16769
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Scott Carey
            Assignee: Scott Carey


{{nodetool garbagecollect can currently only run across an entire table.  }}

{{For a very large table, with many use cases, the most likely tables to be 
full of 'garbage' are the oldest tables.  With both LCS and STCS, the tables 
with the lowest generation number are, under normal operation, going to have 
the majority of data that is masked by a tombstone or overwritten.}}

{{In order to make 'nodetool garbagecollect' more useful for such large tables, 
I propose that we add an option `--oldest-fraction` that takes a floating point 
value between 0.00 and 1.00, and only runs 'garbagecollect' over the oldest 
SSTables that cover at least that fraction of data.}}

{{This would mean, for insatnce, that if you ran this with `--oldest-fraction 
0.1`  every week, that no table would be older than 10 weeks old, and there 
would exist no data that has been overwritten, TTL'd, or deleted that was 
originally written more than 10 weeks ago.}}

{{In my use case, the oldest LCS table is about 20 months old if the table 
operates in steady-state on Cassandra 3.11.x, but only 5% of the data in tables 
that age has not been overwritten.  This breaks some of the performance promise 
of LCS – if your last level is 50% filled with overwritten data, then your 
chance of finding data _only_ in that level is significantly less than 
advertised.}}

{{'nodetool compact' is extremely expensive, and not conducive to any sort of 
incremental operation currently.   But nodetool garbagecollect run on a 
fraction of the oldest data would be.    }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to