Thanks Rob.

Will checkout the tool you linked to. In our case it's definitely not the
tombstones hanging around since we write entire rows at once and the amount
of data in a row is far, far greater than the space a tombstone takes.

Jasdeep


On Fri, Oct 25, 2013 at 1:14 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Fri, Oct 25, 2013 at 1:10 PM, Jasdeep Hundal <dsjas...@gmail.com>wrote:
>
>>
>> After performing a large set of deletes on our cluster, a few hundred
>> gigabytes work (essentially cleaning out nearly all old data), we noticed
>> that nodetool reported about the same load as before.
>>
>
> Tombstones are purgeable only after gc_grace_seconds has elapsed, and only
> if all SSTables which contain fragments of that row are involved in the
> compaction.
>
>
>> With my understanding, running a repair should have triggered compactions
>> between SSTable files and reference counting on the subsequent restart of
>> cassandra on a node should have cleared the old files, but this did not
>> appear to happen. The load did not start going down until we were writing
>> to the cluster again.
>>
>
> Repair is unrelated to minor compaction, except in that it creates new
> SSTables via streaming, which may trigger minor compaction.
>
>
>> I suspect that there are a few values hanging around in the old tables so
>> the references stayed intact, can anyone confirm this?
>>
>
> Stop suspecting and measure with checksstablegarbage :
> https://github.com/cloudian/support-tools
>
>
>> What's a bit more important for me is being able to accurately report the
>> size of the "active" data set, since nodetool doesn't seem to be useful for
>> this. I use counters for reporting some of this, but is there a single
>> source of truth for this, especially since counters do occasionally miss
>> updates.
>>
>
> In very new versions of Cassandra, there is tracking of and metrics
> available for what percentage of data in a SSTable is expired.
>
> =Rob
>
>

Reply via email to