Re: clearing tombstones?

William Oberman Mon, 14 Apr 2014 06:09:49 -0700

I'm still somewhat in the middle of the process, but it's far enough along
to report back.


1.) I changed GCGraceSeconds of the CF to 0 using cassandra-cli
2.)  I ran nodetool compact on a single node of the nine (I'll call it
"1").  It took 5-7 hours, and reduced the CF from ~450 to ~75GG (*).
3.)  I ran nodetool compact on nodes 2, 3, .... while watching write/read
latency averages in OpsCenter.  I got all of the way to 9 without any ill
effect
4.) 2->9 all completed with similar results

(*) So, I left one one detail that changed the math (I said above I
expected to clear down to at most 50GB).  I found a small bug in my delete
code mid-last week.  Basically, it deleted all of the rows I wanted, but
due to a race condition, there was a chance I'd delete rows in the middle
of doing new inserts.  Luckily, even in this case, it wasn't "end of the
world", but I stopped the cleanup anyways and added a time check (as all of
the rows I wanted to delete were older than 30 days).  I *thought* I'd
restarted the cleanup threads on a smaller dataset due to all of the
deletes, but instead I saw millions & millions of empty rows (the
tombstones).  Thus the start of this "clear the tombstones" subtask to the
original goal, and the reason I didn't see a 90%+ reduction in size.

In any case, now I'm running the cleanup process again, which will be
followed by ANOTHER round of compactions, and then I'll finally turn
GCGraceSeconds back on.

On the read/write production side, you'd never know anything happened.
 Good job on the distributed system! :-)

Thanks again,

will


On Fri, Apr 11, 2014 at 1:02 PM, Mark Reddy <mark.re...@boxever.com> wrote:

> Thats great Will, if you could update the thread with the actions you
> decide to take and the results that would be great.
>
>
> Mark
>
>
> On Fri, Apr 11, 2014 at 5:53 PM, William Oberman <ober...@civicscience.com
> > wrote:
>
>> I've learned a *lot* from this thread.  My thanks to all of the
>> contributors!
>>
>> Paulo: Good luck with LCS.  I wish I could help there, but all of my CF's
>> are SizeTiered (mostly as I'm on the same schema/same settings since 0.7...)
>>
>> will
>>
>>
>>
>> On Fri, Apr 11, 2014 at 12:14 PM, Mina Naguib <mina.nag...@adgear.com>wrote:
>>
>>>
>>> Levelled Compaction is a wholly different beast when it comes to
>>> tombstones.
>>>
>>> The tombstones are inserted, like any other write really, at the lower
>>> levels in the leveldb hierarchy.
>>>
>>> They are only removed after they have had the chance to "naturally"
>>> migrate upwards in the leveldb hierarchy to the highest level in your data
>>> store.  How long that takes depends on:
>>>  1. The amount of data in your store and the number of levels your LCS
>>> strategy has
>>> 2. The amount of new writes entering the bottom funnel of your leveldb,
>>> forcing upwards compaction and combining
>>>
>>> To give you an idea, I had a similar scenario and ran a (slow,
>>> throttled) delete job on my cluster around December-January.  Here's a
>>> graph of the disk space usage on one node.  Notice the still-diclining
>>> usage long after the cleanup job has finished (sometime in January).  I
>>> tend to think of tombstones in LCS as little bombs that get to explode much
>>> later in time:
>>>
>>> http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg
>>>
>>>
>>>
>>> On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes <
>>> paulo.mo...@chaordicsystems.com> wrote:
>>>
>>> I have a similar problem here, I deleted about 30% of a very large CF
>>> using LCS (about 80GB per node), but still my data hasn't shrinked, even if
>>> I used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool
>>> scrub forces a minor compaction?
>>>
>>> Cheers,
>>>
>>> Paulo
>>>
>>>
>>> On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy <mark.re...@boxever.com>wrote:
>>>
>>>> Yes, running nodetool compact (major compaction) creates one large
>>>> SSTable. This will mess up the heuristics of the SizeTiered strategy (is
>>>> this the compaction strategy you are using?) leading to multiple 'small'
>>>> SSTables alongside the single large SSTable, which results in increased
>>>> read latency. You will incur the operational overhead of having to manage
>>>> compactions if you wish to compact these smaller SSTables. For all these
>>>> reasons it is generally advised to stay away from running compactions
>>>> manually.
>>>>
>>>> Assuming that this is a production environment and you want to keep
>>>> everything running as smoothly as possible I would reduce the gc_grace on
>>>> the CF, allow automatic minor compactions to kick in and then increase the
>>>> gc_grace once again after the tombstones have been removed.
>>>>
>>>>
>>>> On Fri, Apr 11, 2014 at 3:44 PM, William Oberman <
>>>> ober...@civicscience.com> wrote:
>>>>
>>>>> So, if I was impatient and just "wanted to make this happen now", I
>>>>> could:
>>>>>
>>>>> 1.) Change GCGraceSeconds of the CF to 0
>>>>> 2.) run nodetool compact (*)
>>>>> 3.) Change GCGraceSeconds of the CF back to 10 days
>>>>>
>>>>> Since I have ~900M tombstones, even if I miss a few due to impatience,
>>>>> I don't care *that* much as I could re-run my clean up tool against the 
>>>>> now
>>>>> much smaller CF.
>>>>>
>>>>> (*) A long long time ago I seem to recall reading advice about "don't
>>>>> ever run nodetool compact", but I can't remember why.  Is there any bad
>>>>> long term consequence?  Short term there are several:
>>>>> -a heavy operation
>>>>> -temporary 2x disk space
>>>>> -one big SSTable afterwards
>>>>> But moving forward, everything is ok right?
>>>>>  CommitLog/MemTable->SStables, minor compactions that merge SSTables,
>>>>> etc...  The only flaw I can think of is it will take forever until the
>>>>> SSTable minor compactions build up enough to consider including the big
>>>>> SSTable in a compaction, making it likely I'll have to self manage
>>>>> compactions.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy 
>>>>> <mark.re...@boxever.com>wrote:
>>>>>
>>>>>> Correct, a tombstone will only be removed after gc_grace period has
>>>>>> elapsed. The default value is set to 10 days which allows a great deal of
>>>>>> time for consistency to be achieved prior to deletion. If you are
>>>>>> operationally confident that you can achieve consistency via anti-entropy
>>>>>> repairs within a shorter period you can always reduce that 10 day 
>>>>>> interval.
>>>>>>
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 11, 2014 at 3:16 PM, William Oberman <
>>>>>> ober...@civicscience.com> wrote:
>>>>>>
>>>>>>> I'm seeing a lot of articles about a dependency between removing
>>>>>>> tombstones and GCGraceSeconds, which might be my problem (I just 
>>>>>>> checked,
>>>>>>> and this CF has GCGraceSeconds of 10 days).
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli <
>>>>>>> tbarbu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> compaction should take care of it; for me it never worked so I run
>>>>>>>> nodetool compaction on every node; that does it.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-04-11 16:05 GMT+02:00 William Oberman <
>>>>>>>> ober...@civicscience.com>:
>>>>>>>>
>>>>>>>> I'm wondering what will clear tombstoned rows?  nodetool cleanup,
>>>>>>>>> nodetool repair, or time (as in just wait)?
>>>>>>>>>
>>>>>>>>> I had a CF that was more or less storing session information.
>>>>>>>>>  After some time, we decided that one piece of this information was
>>>>>>>>> pointless to track (and was 90%+ of the columns, and in 99% of those 
>>>>>>>>> cases
>>>>>>>>> was ALL columns for a row).   I wrote a process to remove all of those
>>>>>>>>> columns (which again in a vast majority of cases had the effect of 
>>>>>>>>> removing
>>>>>>>>> the whole row).
>>>>>>>>>
>>>>>>>>> This CF had ~1 billion rows, so I expect to be left with ~100m
>>>>>>>>> rows.  After I did this mass delete, everything was the same size on 
>>>>>>>>> disk
>>>>>>>>> (which I expected, knowing how tombstoning works).  It wasn't 100% 
>>>>>>>>> clear to
>>>>>>>>> me what to poke to cause compactions to clear the tombstones.  First I
>>>>>>>>> tried nodetool cleanup on a candidate node.  But, afterwards the disk 
>>>>>>>>> usage
>>>>>>>>> was the same.  Then I tried nodetool repair on that same node.  But 
>>>>>>>>> again,
>>>>>>>>> disk usage is still the same.  The CF has no snapshots.
>>>>>>>>>
>>>>>>>>> So, am I misunderstanding something?  Is there another operation
>>>>>>>>> to try?  Do I have to "just wait"?  I've only done cleanup/repair on 
>>>>>>>>> one
>>>>>>>>> node.  Do I have to run one or the other over all nodes to clear
>>>>>>>>> tombstones?
>>>>>>>>>
>>>>>>>>> Cassandra 1.2.15 if it matters,
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> will
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Paulo Motta*
>>>
>>> Chaordic | *Platform*
>>> *www.chaordic.com.br <http://www.chaordic.com.br/>*
>>> +55 48 3232.3200
>>>
>>>
>>>
>>
>>
>>
>

Re: clearing tombstones?

Reply via email to