On Fri, Oct 25, 2013 at 1:10 PM, Jasdeep Hundal <dsjas...@gmail.com> wrote:

>
> After performing a large set of deletes on our cluster, a few hundred
> gigabytes work (essentially cleaning out nearly all old data), we noticed
> that nodetool reported about the same load as before.
>

Tombstones are purgeable only after gc_grace_seconds has elapsed, and only
if all SSTables which contain fragments of that row are involved in the
compaction.


> With my understanding, running a repair should have triggered compactions
> between SSTable files and reference counting on the subsequent restart of
> cassandra on a node should have cleared the old files, but this did not
> appear to happen. The load did not start going down until we were writing
> to the cluster again.
>

Repair is unrelated to minor compaction, except in that it creates new
SSTables via streaming, which may trigger minor compaction.


> I suspect that there are a few values hanging around in the old tables so
> the references stayed intact, can anyone confirm this?
>

Stop suspecting and measure with checksstablegarbage :
https://github.com/cloudian/support-tools


> What's a bit more important for me is being able to accurately report the
> size of the "active" data set, since nodetool doesn't seem to be useful for
> this. I use counters for reporting some of this, but is there a single
> source of truth for this, especially since counters do occasionally miss
> updates.
>

In very new versions of Cassandra, there is tracking of and metrics
available for what percentage of data in a SSTable is expired.

=Rob

Reply via email to