Somewhere I remember discussions about issues with the merkle tree range 
splitting or some such that resulted in repair always thinking a little bit of 
data was out of sync. 

If you want to get a better idea about what's been transfered turn the logging 
up to DEBUG or turn it up just for org.apache.cassandra.streaming.StreamOut and 
look for the DEBUG message that says "Ranges areā€¦". Or watch the nodetool 
netstats .

May be wrong but I *suspect* a small about of data will be transferred and that 
Sylvain will be able to explain why if you grab some evidence.
 
Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 18/08/2011, at 9:48 AM, Philippe wrote:

> I have a smallish keyspace on my 3 node, RF=3 cluster. My cluster has no 
> read/write traffic while I am testing repairs. I am running 0.8.4 of debian 
> packages on ubuntu.
> 
> I've know run 7 repairs in a row on this keyspace and every single one has 
> finished successfully but performed streams between all nodes. This keyspace 
> was written to over the course of several weeks, sometimes with CL.write=ALL, 
> CL.read=ONE but lately at QUORUM. 
> So either I have faulty hardware, a faulty network or something is wrong. But 
> because repairs on a freshly created 40GB keyspace come up with consistent 
> ranges, I'm guessing it's neither the hardware or the network.
> 
> I could provide the data directories privately to a commiter if that helps... 
> I assume an eighth repair would also stream stuff around. The data 
> directories are : 8.3GB, 3.3GB and 3.1GB
>  
> 
> Thanks
> 
> 2011/8/17 Philippe <watche...@gmail.com>
> ctrl-c will not stop the repair. 
> Ok, so that's  why I've been seeing logs of repairs on other CFs 
> 
> That's probably the 2280 issue. Data from all CF's is streamed over
> Ah, I get it now.
> 
> Thanks
> 
> 
>  
>  
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 17/08/2011, at 10:09 AM, Philippe wrote:
> 
>> One last thought : what happens when you ctrl-c a nodetool repair ? Does it 
>> stop the repair on the server ? If not, then I think I have multiple repairs 
>> still running. Is there any way to check this ?
>> 
>> Thanks
>> 
>> 2011/8/16 Philippe <watche...@gmail.com>
>> Even more interesting behavior : a repair on a CF has consequences on other 
>> CFs. I didn't expect that.
>> 
>> There are no writes being issued to the cluster yet the logs indicate that 
>> SSTableReader has opened dozens and dozens of files, most of them unrelated 
>> to the CF being repaired
>> compactions are taking place continuously on CFs other than the one being 
>> repaired, even CFs in other keyspaces
>> I see "Sending AEService tree" messages for CF not being repaired.
>> 
>> After a very long time, I got some AES messages indicating that streaming 
>> from node C had finished and then many minutes after that node B. And yet 
>> the pending stream count on node B hasn't changed.
>> 
>> The *-data.db files for the CF being repaired are about 70MB on-disk.
>> 
>> Maybe when a stream is fully received on node B, netstats indicates that no 
>> streams are pending but since they are not acknowledged, node A doesn't ?
>> 
>> 
>> 2011/8/16 Philippe <watche...@gmail.com>
>> I'm still trying different stuff. Here are my latest findings, maybe someone 
>> will find them useful:
>> I have been able to repair some small column families by issuing a repair 
>> [KS] [CF]. When testing on the ring with no writes at all, it still takes 
>> about 2 repairs to get "consistent" logs for all AES requests.
>> Launching a repair one the smallest CF of the biggest KS has triggered a 
>> flurry of compactions and streams. Some of those streams are for other CF in 
>> that keyspace !?
>> During repairs (one at a time cluster-wide), I get 25-50% io waits & 35%-50% 
>> cpu usage on a 6 core SATA-disk setup
>> What is surprising to me (bug?) is that netstats shows me streams going from 
>> node A to node B at 0% progress. But netstats on node B doesn't show me any 
>> streams coming in. I'm thinking that repairs may be never ending and that 
>> may be messing up my compactions hence the huge pile up of compactions until 
>> the disk fulls.
>> I know there's an issue related to failed streams & repairs, could I be 
>> hitting it ?
>> 
>> Thanks
>> 
>> 2011/8/14 Philippe <watche...@gmail.com>
>> @Teijo : thanks for the procedure, I hope I won't have to do that
>> 
>> Peter, I'll answer inline. Thanks for the detailed answer.
>>  
>> > the number of SSTables for some keyspaces goes dramatically up (from 3 or 4
>> > to several dozens).
>> 
>> Typically with a long running compaction, such as that triggered by
>> repair, that's what happens as flushed memtables accumulate. In
>> particular for memtables with frequent flushes.
>> 
>> Are you running with concurrent compaction enabled?
>> Yes, it is enabled. On my 0.8 cluster, cassandra.yaml has this (it's 
>> commented). BTW, I have 6 cores on each server.
>> #concurrent_compactors: 1
>> 
>> > the commit log keeps increasing in size, I'm at 4.3G now, it went up to 40G
>> > when the compaction was throttled at 16MB/s. On the other nodes it's around
>> > 1GB at most
>> Hmmmm. The Commit Log should not be retained longer than what is
>> required for memtables to be flushed. Is it possible you have had an
>> out-of-disk condition and flushing has stalled? Are you seeing flushes
>> happening in the log?
>> No I don't believe there was ever an out of disk.  Yes it is flushing for 
>> the first couple of hours.
>> Then, when repair seems locked up, my log is mostly filled with lines such 
>> as this
>> INFO [ScheduledTasks:1] 2011-08-14 23:15:47,267 StatusLogger.java (line 88) 
>> [My_Keyspace].[My_Columnfamily]           45,105541               50/50      
>>          20/20
>>  Why is that ?
>> 
>> > the data directory is bigger than on the other nodes. I've seen it go up to
>> > 480GB when the compaction was throttled at 16MB/s
>> How much data are you writing? Is it at all plausible that the huge
>> spike is a reflection of lots of overwriting writes that aren't being
>> compacted?
>> No, there's no bulk loading going on at the moment and I'm pretty sure there 
>> wasn't when it spiked up to that load.
>> I've never measured the load because it's a mix of counter increments and 
>> new counters all the time. It's not that much though.
>>  
>> Normally when disk space spikes with repair it's due to other nodes
>> streaming huge amounts (maybe all of their data) to the node, leading
>> to a temporary spike. But if your "real" size is expected to be 60,
>> 480 sounds excessive. Are you sure other nodes aren't running repairs
>> at the same time and magnifying each other's data load spikes?
>> Yes, the two other nodes were running repairs. I had them scheduled at 8 
>> hour intervals but they must have started.
>> When data is streamed from one to another, does that data go into the commit 
>> log as a regular write ?
>>  How much of a negative impact can that have on the repair going on on this 
>> node ?
>> 
>> > What's even weirder is that currently I have 9 compactions running but CPU
>> > is throttled at 1/number of cores half the time (while > 80% the rest of 
>> > the
>> > time). Could this be because other repairs are happening in the ring ?
>> You mean compaction is taking less CPU than it "should"?
>> Yes
>>  
>> No, this should not be due to other nodes repairing. However it sounds
>> to me like you are bottlenecking on I/O and the repairs and
>> Yes, I/O is really high on the node right now. Around 50% I/O waits.
>>  
>> compactions are probably proceeding extremely slowly, probably being
>> completely drowned out by live traffic (which is probably having an
>> abnormally high performance impact due to data size spike).
>> Yes, the live traffic is 3 to 10x times slower during repair. Ouch... I hope 
>> I won't to do this too often while in production !
>>  
>> 
>> What's your read concurrency configured on the node? What does "iostat
>> -x -k 1" show in the average queue size column?
>> Average queue size on the disk (RAID-1 + separate LVM volumes for data, 
>> commit log, caches, logs)) varies between 2 and 90. I'd say the average is 
>> around 30-40. Very high variation.
>>  
>> Is "nodetool -h
>> localhost tpstats" showing that ReadStage is usually "full" (@ your
>> limit)?
>> No backlog at all in tpstats
>>  
>> I've figured out how AES is logging its actions and it looks like it really 
>> is going through every CF in every keyspace and doing a tree request for 
>> every token range
>> So it really looks like it's just taking forever to compact stuff as it's 
>> repairing. 
>> I saw in another email that repairing was taking 2-3mn/ GB... it looks like 
>> a lot more for my ring. Anybody else have numbers ?
>> 
>> Thanks
>> 
>> 
>> 
> 
> 
> 

Reply via email to