Re: Unable to repair a node

Philippe Tue, 16 Aug 2011 23:50:09 -0700

>
> ctrl-c will not stop the repair.
>
Ok, so that's  why I've been seeing logs of repairs on other CFs


That's probably the 2280 issue. Data from all CF's is streamed over
>
Ah, I get it now.

Thanks





>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/08/2011, at 10:09 AM, Philippe wrote:
>
> One last thought : what happens when you ctrl-c a nodetool repair ? Does it
> stop the repair on the server ? If not, then I think I have multiple repairs
> still running. Is there any way to check this ?
>
> Thanks
>
> 2011/8/16 Philippe <watche...@gmail.com>
>
>> Even more interesting behavior : a repair on a CF has consequences on
>> other CFs. I didn't expect that.
>>
>> There are no writes being issued to the cluster yet the logs indicate
>> that
>>
>>    - SSTableReader has opened dozens and dozens of files, most of them
>>    unrelated to the CF being repaired
>>    - compactions are taking place continuously on CFs other than the one
>>    being repaired, even CFs in other keyspaces
>>    - I see "Sending AEService tree" messages for CF not being repaired.
>>
>>
>> After a very long time, I got some AES messages indicating that streaming
>> from node C had finished and then many minutes after that node B. And yet
>> the pending stream count on node B hasn't changed.
>>
>> The *-data.db files for the CF being repaired are about 70MB on-disk.
>>
>> Maybe when a stream is fully received on node B, netstats indicates that
>> no streams are pending but since they are not acknowledged, node A doesn't ?
>>
>>
>> 2011/8/16 Philippe <watche...@gmail.com>
>>
>>> I'm still trying different stuff. Here are my latest findings, maybe
>>> someone will find them useful:
>>>
>>>    - I have been able to repair some small column families by issuing a
>>>    repair [KS] [CF]. When testing on the ring with no writes at all, it 
>>> still
>>>    takes about 2 repairs to get "consistent" logs for all AES requests.
>>>    - Launching a repair one the smallest CF of the biggest KS has
>>>    triggered a flurry of compactions and streams. Some of those streams are 
>>> for
>>>    other CF in that keyspace !?
>>>    - During repairs (one at a time cluster-wide), I get 25-50% io waits
>>>    & 35%-50% cpu usage on a 6 core SATA-disk setup
>>>
>>> What is surprising to me (bug?) is that netstats shows me streams going
>>> from node A to node B at 0% progress. But netstats on node B doesn't show me
>>> any streams coming in. I'm thinking that repairs may be never ending and
>>> that may be messing up my compactions hence the huge pile up of compactions
>>> until the disk fulls.
>>> I know there's an issue related to failed streams & repairs, could I be
>>> hitting it ?
>>>
>>> Thanks
>>>
>>> 2011/8/14 Philippe <watche...@gmail.com>
>>>
>>>> @Teijo : thanks for the procedure, I hope I won't have to do that
>>>>
>>>> Peter, I'll answer inline. Thanks for the detailed answer.
>>>>
>>>>
>>>>>  > the number of SSTables for some keyspaces goes dramatically up (from
>>>>> 3 or 4
>>>>> > to several dozens).
>>>>>
>>>>> Typically with a long running compaction, such as that triggered by
>>>>> repair, that's what happens as flushed memtables accumulate. In
>>>>> particular for memtables with frequent flushes.
>>>>>
>>>>> Are you running with concurrent compaction enabled?
>>>>>
>>>> Yes, it is enabled. On my 0.8 cluster, cassandra.yaml has this (it's
>>>> commented). BTW, I have 6 cores on each server.
>>>> #concurrent_compactors: 1
>>>>
>>>> > the commit log keeps increasing in size, I'm at 4.3G now, it went up
>>>>> to 40G
>>>>> > when the compaction was throttled at 16MB/s. On the other nodes it's
>>>>> around
>>>>> > 1GB at most
>>>>> Hmmmm. The Commit Log should not be retained longer than what is
>>>>> required for memtables to be flushed. Is it possible you have had an
>>>>> out-of-disk condition and flushing has stalled? Are you seeing flushes
>>>>> happening in the log?
>>>>>
>>>> No I don't believe there was ever an out of disk.  Yes it is flushing
>>>> for the first couple of hours.
>>>> Then, when repair seems locked up, my log is mostly filled with lines
>>>> such as this
>>>> INFO [ScheduledTasks:1] 2011-08-14 23:15:47,267 StatusLogger.java (line
>>>> 88) [My_Keyspace].[My_Columnfamily]           45,105541               50/50
>>>>               20/20
>>>>  Why is that ?
>>>>
>>>> > the data directory is bigger than on the other nodes. I've seen it go
>>>>> up to
>>>>> > 480GB when the compaction was throttled at 16MB/s
>>>>> How much data are you writing? Is it at all plausible that the huge
>>>>> spike is a reflection of lots of overwriting writes that aren't being
>>>>> compacted?
>>>>>
>>>> No, there's no bulk loading going on at the moment and I'm pretty sure
>>>> there wasn't when it spiked up to that load.
>>>> I've never measured the load because it's a mix of counter increments
>>>> and new counters all the time. It's not that much though.
>>>>
>>>>
>>>>> Normally when disk space spikes with repair it's due to other nodes
>>>>> streaming huge amounts (maybe all of their data) to the node, leading
>>>>> to a temporary spike. But if your "real" size is expected to be 60,
>>>>> 480 sounds excessive. Are you sure other nodes aren't running repairs
>>>>> at the same time and magnifying each other's data load spikes?
>>>>>
>>>> Yes, the two other nodes were running repairs. I had them scheduled at 8
>>>> hour intervals but they must have started.
>>>> When data is streamed from one to another, does that data go into the
>>>> commit log as a regular write ?
>>>>  How much of a negative impact can that have on the repair going on on
>>>> this node ?
>>>>
>>>> > What's even weirder is that currently I have 9 compactions running but
>>>>> CPU
>>>>> > is throttled at 1/number of cores half the time (while > 80% the rest
>>>>> of the
>>>>> > time). Could this be because other repairs are happening in the ring
>>>>> ?
>>>>> You mean compaction is taking less CPU than it "should"?
>>>>>
>>>> Yes
>>>>
>>>>
>>>>> No, this should not be due to other nodes repairing. However it sounds
>>>>> to me like you are bottlenecking on I/O and the repairs and
>>>>>
>>>> Yes, I/O is really high on the node right now. Around 50% I/O waits.
>>>>
>>>>
>>>>> compactions are probably proceeding extremely slowly, probably being
>>>>> completely drowned out by live traffic (which is probably having an
>>>>> abnormally high performance impact due to data size spike).
>>>>>
>>>> Yes, the live traffic is 3 to 10x times slower during repair. Ouch... I
>>>> hope I won't to do this too often while in production !
>>>>
>>>>
>>>>>
>>>>> What's your read concurrency configured on the node? What does "iostat
>>>>> -x -k 1" show in the average queue size column?
>>>>
>>>> Average queue size on the disk (RAID-1 + separate LVM volumes for data,
>>>> commit log, caches, logs)) varies between 2 and 90. I'd say the average is
>>>> around 30-40. Very high variation.
>>>>
>>>>
>>>>> Is "nodetool -h
>>>>> localhost tpstats" showing that ReadStage is usually "full" (@ your
>>>>> limit)?
>>>>>
>>>> No backlog at all in tpstats
>>>>
>>>> I've figured out how AES is logging its actions and it looks like it
>>>> really is going through every CF in every keyspace and doing a tree request
>>>> for every token range
>>>> So it really looks like it's just taking forever to compact stuff as
>>>> it's repairing.
>>>> I saw in another email that repairing was taking 2-3mn/ GB... it looks
>>>> like a lot more for my ring. Anybody else have numbers ?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>
>

Re: Unable to repair a node

Reply via email to