Hi, To verify that the repair was successful, you can look for this kind of messages in the log : INFO [AntiEntropyStage:1] 2012-05-19 00:57:52,351 AntiEntropyService.java (line 762) [repair #e46a0a90-a13c-11e1-0000-596f3d333ab7] UsersCF is fully synced (3 remaining column family to sync for this session) ... INFO [AntiEntropyStage:1] 2012-05-19 00:59:25,348 AntiEntropyService.java (line 762) [repair #e46a0a90-a13c-11e1-0000-596f3d333ab7] MyOtherCF is fully synced (2 remaining column family to sync for this session) ...
To verify that one node "really" has the data it is supposed to have, well, you could isolate it from the rest of the cluster, and query the data (with thrift) with CL ONE. Regards, Samuel Luke Hospadaruk <luke.hospada...@ithaka.org> 05/06/2012 20:53 Veuillez répondre à user@cassandra.apache.org A "user@cassandra.apache.org" <user@cassandra.apache.org> cc Objet Nodes not picking up data on repair, disk loaded unevenly I have a 4-node cluster with one keyspace (aside from the system keyspace) with the replication factor set to 4. The disk usage between the nodes is pretty wildly different and I'm wondering why. It's becoming a problem because one node is getting to the point where it sometimes fails to compact because it doesn't have enough space. I've been doing a lot of experimenting with the schema, adding/dropping things, changing settings around (not ideal I realize, but we're still in development). In an ideal world, I'd launch another cluster (this is all hosted in amazon), copy all the data to that, and just get rid of my current cluster, but the current cluster is in use by some other parties so rebuilding everything is impractical (although possible if it's the only reliable solution). $ nodetool -h localhost ring Address DC Rack Status State Load Owns Token 1.xx.xx.xx Cassandra rack1 Up Normal 837.8 GB 25.00% 0 2.xx.xx.xx Cassandra rack1 Up Normal 1.17 TB 25.00% 42535295865117307932921825928971026432 3.xx.xx.xx Cassandra rack1 Up Normal 977.23 GB 25.00% 85070591730234615865843651857942052864 4.xx.xx.xx Cassandra rack1 Up Normal 291.2 GB 25.00% 127605887595351923798765477786913079296 -Problems I'm having: Nodes are running out of space and are apparently unable to perform compactions because of it. These machines have 1.7T total space each. The logs for node #2 have a lot of warnings about insufficient space for compaction. Node number 4 was so extremely out of space (cassandra was failing to start because of it)that I removed all the SSTables for one of the less essential column families just to bring it back online. I have (since I started noticing these issues) enabled compression for all my column families. On node #1 I was able to successfully run a scrub and major compaction, so I suspect that the disk usage for node #1 is about where all the other nodes should be. At ~840GB I'm probably running close to the max load I should have on a node, so I may need to launch more nodes into the cluster, but I'd like to get things straightened out before I introduce more potential issues (token moving, etc). Node #4 seems not to be picking up all the data it should have (since repication factor == number of nodes, the load should be roughly the same?). I've run repairs on that node to seemingly no avail (after repair finishes, it still has about the same disk usage, which is much too low). -What I think the solution should be: One node at a time: 1) nodetool drain the node 2) shut down cassandra on the node 3) wipe out all the data in my keyspace on the node 4) bring cassandra back up 5) nodetool repair -My concern: This is basically what I did with node #4 (although I didn't drain, and I didn't wipe the entire keyspace), and it doesn't seem to have regained all the data it's supposed to have after the repair. The column family should have at least 200-300GB of data, and the SSTables in the data directory only total about 11GB, am I missing something? Is there a way to verify that a node _really_ has all the data it's supposed to have? I don't want to do this process to each node and discover at the end of it that I've lost a ton of data. Is there something I should be looking for in the logs to verify that the repair was successful? If I do a 'nodetool netstats' during the repair I don't see any streams going in or out of node #4. Thanks, Luke