Does anyone have a good explanation or pointers to docs for understanding
how Cassandra decides to remove SSTables from disk?

After performing a large set of deletes on our cluster, a few hundred
gigabytes work (essentially cleaning out nearly all old data), we noticed
that nodetool reported about the same load as before.

With my understanding, running a repair should have triggered compactions
between SSTable files and reference counting on the subsequent restart of
cassandra on a node should have cleared the old files, but this did not
appear to happen. The load did not start going down until we were writing
to the cluster again.

I suspect that there are a few values hanging around in the old tables so
the references stayed intact, can anyone confirm this?

The fact that the SSTables aren't removed isn't in issue, I'd just like to
be able to understand the process better.

What's a bit more important for me is being able to accurately report the
size of the "active" data set, since nodetool doesn't seem to be useful for
this. I use counters for reporting some of this, but is there a single
source of truth for this, especially since counters do occasionally miss
updates.


Cassandra setup:
1.2 w/ vnodes using LeveledCompactionStrategy, using 128 mb SSTables.

Thanks,
Jasdeep

Reply via email to