Re: Unable to repair a node

2011-08-19 Thread Peter Schuller
> Somewhere I remember discussions about issues with the merkle tree range > splitting or some such that resulted in repair always thinking a little bit > of data was out of sync. https://issues.apache.org/jira/browse/CASSANDRA-2324 - fixed for early 0.8. I don't *think* there's a know open bug t

Re: Unable to repair a node

2011-08-19 Thread Peter Schuller
> I've know run 7 repairs in a row on this keyspace and every single one has > finished successfully but performed streams between all nodes. This keyspace > was written to over the course of several weeks, sometimes with How much data is streamed, do you know? Mainly interesting is if there is a

Re: Unable to repair a node

2011-08-18 Thread aaron morton
Somewhere I remember discussions about issues with the merkle tree range splitting or some such that resulted in repair always thinking a little bit of data was out of sync. If you want to get a better idea about what's been transfered turn the logging up to DEBUG or turn it up just for org.ap

Re: Unable to repair a node

2011-08-17 Thread Philippe
I have a smallish keyspace on my 3 node, RF=3 cluster. My cluster has no read/write traffic while I am testing repairs. I am running 0.8.4 of debian packages on ubuntu. I've know run 7 repairs in a row on this keyspace and every single one has finished successfully but performed streams between al

Re: Unable to repair a node

2011-08-16 Thread Philippe
> > ctrl-c will not stop the repair. > Ok, so that's why I've been seeing logs of repairs on other CFs That's probably the 2280 issue. Data from all CF's is streamed over > Ah, I get it now. Thanks > > Cheers > > - > Aaron Morton > Freelance Cassandra Developer > @aaronmorto

Re: Unable to repair a node

2011-08-16 Thread aaron morton
ctrl-c will not stop the repair. You kind of check things by looking at netstat compationstats , that will just tell you if there are compactions backing up. Not necessarily if they are validation compactions used during repairs. You can trawl the logs to look for messages from the AntiEntropy

Re: Unable to repair a node

2011-08-16 Thread Philippe
One last thought : what happens when you ctrl-c a nodetool repair ? Does it stop the repair on the server ? If not, then I think I have multiple repairs still running. Is there any way to check this ? Thanks 2011/8/16 Philippe > Even more interesting behavior : a repair on a CF has consequences

Re: Unable to repair a node

2011-08-16 Thread Philippe
Thanks for the pointers, responses inline. On Tue, Aug 16, 2011 at 3:48 PM, Philippe wrote: > > I have been able to repair some small column families by issuing a repair > > [KS] [CF]. When testing on the ring with no writes at all, it still takes > > about 2 repairs to get "consistent" logs for

Re: Unable to repair a node

2011-08-16 Thread Philippe
Even more interesting behavior : a repair on a CF has consequences on other CFs. I didn't expect that. There are no writes being issued to the cluster yet the logs indicate that - SSTableReader has opened dozens and dozens of files, most of them unrelated to the CF being repaired - compa

Re: Unable to repair a node

2011-08-16 Thread Jonathan Ellis
On Tue, Aug 16, 2011 at 3:48 PM, Philippe wrote: > I have been able to repair some small column families by issuing a repair > [KS] [CF]. When testing on the ring with no writes at all, it still takes > about 2 repairs to get "consistent" logs for all AES requests. I think I linked these in anoth

Re: Unable to repair a node

2011-08-16 Thread Philippe
I'm still trying different stuff. Here are my latest findings, maybe someone will find them useful: - I have been able to repair some small column families by issuing a repair [KS] [CF]. When testing on the ring with no writes at all, it still takes about 2 repairs to get "consistent" log

Re: Unable to repair a node

2011-08-14 Thread Philippe
@Teijo : thanks for the procedure, I hope I won't have to do that Peter, I'll answer inline. Thanks for the detailed answer. > > the number of SSTables for some keyspaces goes dramatically up (from 3 or > 4 > > to several dozens). > > Typically with a long running compaction, such as that trigge

Re: Unable to repair a node

2011-08-14 Thread Teijo Holzer
Forgot to mention, you want to check the following in cassandra.yaml on the node that you bootstrap before you initiate the bootstrap: * Ensure that the initial_token is set to the correct value (see nodetool) * Ensure that the seeds list doesn't contain the IP of the node you are trying to boo

Re: Unable to repair a node

2011-08-14 Thread Peter Schuller
> oh i know you can run rf 3 on a 3 node cluster. more i thought that if you > have one fail you have less nodes than the rf, so the cluster is at less > than rf, and writes might be disabled or something like that, while at 4 you > still have met the rf... A node failing is independent of RF. *De

Re: Unable to repair a node

2011-08-14 Thread Peter Schuller
Sorry about the lack of response to your actual issue. I'm afraid I don't have an exhaustive analysis, but some quick notes: > balanced ring but the other nodes are at 60GB. Each repair basically > generates thousands of pending compactions of various types (SSTable build, > minor, major & validat

Re: Unable to repair a node

2011-08-14 Thread Teijo Holzer
Hi, I took the following steps to get a node that refused to repair back under control. WARNING: This resulted in some data loss for us, YMMV with your replication factor * Turn off all row & key caches via cassandra-cli * Set "disk_access_mode: standard" in cassandra.yaml * Kill Cassandra on

Re: Unable to repair a node

2011-08-14 Thread Philippe
No it depends on the consistency level. It's different : for example, QUORUM = 2 for RF=3 Anyway, anyone have an answer to my real issue ? Thanks 2011/8/14 Stephen Connolly > oh i know you can run rf 3 on a 3 node cluster. more i thought that if you > have one fail you have less nodes than the

Re: Unable to repair a node

2011-08-14 Thread Stephen Connolly
oh i know you can run rf 3 on a 3 node cluster. more i thought that if you have one fail you have less nodes than the rf, so the cluster is at less than rf, and writes might be disabled or something like that, while at 4 you still have met the rf... - Stephen --- Sent from my Android phone, so ra

Re: Unable to repair a node

2011-08-14 Thread Philippe
5 hours later, the number of pending compactions host up to 8k as usual, the number of SST tables for another keyspace shot up to 160 (from 4). At 4pm, a daily cron job that runs repair starts on that same node and all of a sudden, the number of pending compactions went down to 4k and to number of

Re: Unable to repair a node

2011-08-14 Thread Peter Schuller
> i am always wondering why people run clusters with number of nodes == rf > > i thought you needed to have number of nodes > rf ti gave any sensible > behaviour... but i am no expert at all No. The only requirement is that the number of nodes be >= RF, since clearly in a cluster with fewer nodes

Re: Unable to repair a node

2011-08-14 Thread Stephen Connolly
i am always wondering why people run clusters with number of nodes == rf i thought you needed to have number of nodes > rf ti gave any sensible behaviour... but i am no expert at all - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense a

Unable to repair a node

2011-08-14 Thread Philippe
Hello, I've been fighting with my cluster for a couple days now... Running 0.8.1.3, using Hector and loadblancing requests across all nodes. My question is : how do I get my node back under control so that it runs like the other two nodes. It's a 3 node, RF=3 cluster with reads & writes at LC=QUO