Re: Nodes not picking up data on repair, disk loaded unevenly

Luke Hospadaruk Fri, 08 Jun 2012 06:49:21 -0700

Follow-up:
After adding the EBS nodes, I successfully compacted, the node that had ~1.3T 
is now down to about 400/500GB (some of that is compression savings).  You're 
right about the load – lots of overwrites.


I'm going to get things back off the EBS and add a couple more nodes (I've got 
4 right now, maybe move up to 6 or 8 for the time being.

I also plan on copying all my CFs to new ones to un-do the major compaction.  
I've got some fairly minor schema changes in mind, so it's a good time to copy 
over my data anyways.

Thanks for all the help, it's been very informative

Luke

From: aaron morton <aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Nodes not picking up data on repair, disk loaded unevenly

 I am now running major compactions on those nodes (and all is well so far).
Major compaction in this situation will make things worse. When end up with one 
big file you will need that much space again to compact / upgrade / re-write it.

back down to a normal size, can I move all the data back off the ebs volumes?
something along the lines of:
Yup.

Then add some more nodes to the cluster to keep this from happening in the 
future.
Yerp. Get everything settled and repair running it should be a simple operation.

I assume all the files stored in any of the data directories are all uniquely 
named and cassandra won't really care where they are as long as everything it 
wants is in it's data directories.
Unique on each node.

So it looks like I never got the tree from node #2 (the node which has 
particularly out of control disk usage).
If you look at the logs for 2. you will probably find an error.
Or it may still be running, check nodetool compactionstats

-Is there any way to force replay of hints to empty this out – just a full 
cluster restart when everything is working again maybe?
Normally I would say stop the nodes and delete the hints CF's. As you have 
deleted CF's from one of the nodes there is a risk of losing data though.

If you have been working at CL QUORUM and have not been getting 
TimedOutException you can still delete the hints. As the writes they contain 
should be on at least one other node and they will be repaired by repair.

 I have a high replication factor and all my writes have been at cl=ONE (so all 
the data in the hints should actually exist in a CF somewhere right?).
There is a chance that a write was only applied locally on the node that you 
delete the data from, and it recorded hints to send to the othe nodes. It's a 
remote chance but still there.

 how much working space does this need?  Problem is that node #2 is so full I'm 
not sure any major rebuild or compaction will be susccessful.  The other nodes 
seem to be handiling things ok although they are still heavily loaded.
upgradetables processes one SSTable at a time, it only needs enough space to 
re-write the SSTable.

This is why major compaction hurts in these situations. If you have 1.5T of 
small files, you may have enough free space to re-write all the files. If you 
have a single 1.5T file you don't.

This cluster has a super high write load currently since I'm still building it 
out.  I frequently update every row in my CFs
 Sounds like a lot of overwrites. When you get compaction running it may purge 
a lot of data.


Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 7/06/2012, at 2:51 AM, Luke Hospadaruk wrote:

Thanks for the tips

Some things I found looking around:

grepping the logs for a specific repair I ran yesterday:

/var/log/cassandra# grep df14e460-af48-11e1-0000-e9014560c7bd system.log
INFO [AntiEntropySessions:13] 2012-06-05 19:58:51,303 AntiEntropyService.java 
(line 658) [repair #df14e460-af48-11e1-0000-e9014560c7bd] new session: will 
sync /4.xx.xx.xx, /1.xx.xx.xx, /3.xx.xx.xx, /2.xx.xx.xx on range 
(85070591730234615865843651857942052864,127605887595351923798765477786913079296]
 for content.[article2]
INFO [AntiEntropySessions:13] 2012-06-05 19:58:51,304 AntiEntropyService.java 
(line 837) [repair #df14e460-af48-11e1-0000-e9014560c7bd] requests for merkle 
tree sent for article2 (to [ /4.xx.xx.xx, /1.xx.xx.xx, /3.xx.xx.xx, 
/2.xx.xx.xx])
INFO [AntiEntropyStage:1] 2012-06-05 20:07:01,169 AntiEntropyService.java (line 
190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle tree for 
article2 from /4.xx.xx.xx
INFO [AntiEntropyStage:1] 2012-06-06 04:12:30,633 AntiEntropyService.java (line 
190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle tree for 
article2 from /3.xx.xx.xx
INFO [AntiEntropyStage:1] 2012-06-06 07:02:51,497 AntiEntropyService.java (line 
190) [repair #df14e460-af48-11e1-0000-e9014560c7bd] Received merkle tree for 
article2 from /1.xx.xx.xx

So it looks like I never got the tree from node #2 (the node which has 
particularly out of control disk usage).

These are running on amazon m1.xlarge instances with all the EBS volumes raided 
together for a total of 1.7TB.

What version are you using ?
1.0

Has there been times when nodes were down ?
Yes, but mostly just restarts, and mostly just one node at a time

Clear as much space as possible from the disk. Check for snapshots in all KS's.
Already done.

What KS's (including the system KS) are taking up the most space ? Are there a 
lot of hints in the system KS (they are not replicated)?
-There's just one KS that I'm actually using, which is taking up anywhere from 
about 650GB on the node I was able to scrub and compact (that sounds like the 
right size to me), and 1.3T on the node that is hugely bloated.
-There are pretty big huge hints CFs on all but one node (the node I deleted 
data from, although I did not delete any hints from there). They're between 
175GB and 250GB depending on the node.
-Is there any way to force replay of hints to empty this out – just a full 
cluster restart when everything is working again maybe?
-Could I just disable hinted handoff and wipe out those tables?  I realize I'll 
loose those hints, but that doesn't bother me terribly.  I have a high 
replication factor and all my writes have been at cl=ONE (so all the data in 
the hints should actually exist in a CF somewhere right?).  Perhaps more 
importantly if some data has been stalled in a hints table for a week I won't 
really miss it since it basically doesn't exist right now.  I can re-write any 
data that got lost (although that's not ideal).

Try to get a feel for what CF's are taking up the space or not as the case my 
be. Look in nodetool cfstats to see how big the rows are.
The hints table and my tables are the only thing taking up any significant 
space on the system

you have enabled compression run nodetool upgradetables to compress them.
how much working space does this need?  Problem is that node #2 is so full I'm 
not sure any major rebuild or compaction will be susccessful.  The other nodes 
seem to be handiling things ok although they are still heavily loaded.

In general, try to get free space on the nodes by using compaction, moving 
files to a new mount etc so that you can get repair to run.
-I'll try adding an EBS volume or two to the bloated node and see if that 
allows me to successfuly compact/repair.
-If I add another volume to that node, then run some compactions and such to 
the point where everything fits on the main volume again, I may just replace 
that node with a new one.  Can I move things off of and then kill the ebs 
volume?

Other thoughts/notes:
This cluster has a super high write load currently since I'm still building it 
out.  I frequently update every row in my CFs
I almost certainly need to add more capacity (more nodes).  The general plan is 
to get everything sort of working first though, since repairs and such are 
currently failing it seems like a bad time to add more nodes.

Thanks,
Luke

From: aaron morton 
<aa...@thelastpickle.com<mailto:aa...@thelastpickle.com><mailto:aa...@thelastpickle.com>>
Reply-To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>"
 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>>
To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>"
 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org>>
Subject: Re: Nodes not picking up data on repair, disk loaded unevenly

You are basically in trouble. If you can nuke it and start again it would be 
easier. If you want to figure out how to get out of it keep the cluster up and 
have a play.


-What I think the solution should be:
You want to get repair to work before you start deleting data.

At ~840GB I'm probably running close
to the max load I should have on a node,
roughly 300GB to 400GB is the max load

On node #1 I was able to successfully run a scrub and
major compaction,
In this situation running a major compaction is now what you want. it creates a 
huge file that can only be compacted if there is enough space for another huge 
file. Smaller files only need small space to be compacted.

Is there something I should be looking for in the logs to verify that the
repair was successful?
grep for "repair command"

The shortcut on EC2 is add an EBS volumn, tell cassandra it can store stuff 
there (in the yaml) and buy some breathing room.


What version are you using ?

Has there been times when nodes were down ?

Clear as much space as possible from the disk. Check for snapshots in all KS's.

What KS's (including the system KS) are taking up the most space ? Are there a 
lot of hints in the system KS (they are not replicated)?

Try to get a feel for what CF's are taking up the space or not as the case my 
be. Look in nodetool cfstats to see how big the rows are.

I you have enabled compression run nodetool upgradetables to compress them.


In general, try to get free space on the nodes by using compaction, moving 
files to a new mount etc so that you can get repair to run.

Cheers



-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/06/2012, at 6:53 AM, Luke Hospadaruk wrote:

I have a 4-node cluster with one keyspace (aside from the system keyspace)
with the replication factor set to 4.  The disk usage between the nodes is
pretty wildly different and I'm wondering why.  It's becoming a problem
because one node is getting to the point where it sometimes fails to
compact because it doesn't have enough space.

I've been doing a lot of experimenting with the schema, adding/dropping
things, changing settings around (not ideal I realize, but we're still in
development).

In an ideal world, I'd launch another cluster (this is all hosted in
amazon), copy all the data to that, and just get rid of my current
cluster, but the current cluster is in use by some other parties so
rebuilding everything is impractical (although possible if it's the only
reliable solution).

$ nodetool -h localhost ring
Address     DC        Rack  Status State  Load       Owns   Token


1.xx.xx.xx   Cassandra   rack1       Up     Normal  837.8 GB   25.00%  0

2.xx.xx.xx   Cassandra   rack1       Up     Normal  1.17 TB    25.00%
42535295865117307932921825928971026432
3.xx.xx.xx   Cassandra   rack1       Up     Normal  977.23 GB  25.00%
85070591730234615865843651857942052864
4.xx.xx.xx   Cassandra   rack1       Up     Normal  291.2 GB   25.00%
127605887595351923798765477786913079296

-Problems I'm having:
Nodes are running out of space and are apparently unable to perform
compactions because of it.  These machines have 1.7T total space each.

The logs for node #2 have a lot of warnings about insufficient space for
compaction.  Node number 4 was so extremely out of space (cassandra was
failing to start because of it)that I removed all the SSTables for one of
the less essential column families just to bring it back online.


I have (since I started noticing these issues) enabled compression for all
my column families.  On node #1 I was able to successfully run a scrub and
major compaction, so I suspect that the disk usage for node #1 is about
where all the other nodes should be.  At ~840GB I'm probably running close
to the max load I should have on a node, so I may need to launch more
nodes into the cluster, but I'd like to get things straightened out before
I introduce more potential issues (token moving, etc).

Node #4 seems not to be picking up all the data it should have (since
repication factor == number of nodes, the load should be roughly the
same?).  I've run repairs on that node to seemingly no avail (after repair
finishes, it still has about the same disk usage, which is much too low).


-What I think the solution should be:
One node at a time:
1) nodetool drain the node
2) shut down cassandra on the node
3) wipe out all the data in my keyspace on the node
4) bring cassandra back up
5) nodetool repair

-My concern:
This is basically what I did with node #4 (although I didn't drain, and I
didn't wipe the entire keyspace), and it doesn't seem to have regained all
the data it's supposed to have after the repair. The column family should
have at least 200-300GB of data, and the SSTables in the data directory
only total about 11GB, am I missing something?

Is there a way to verify that a node _really_ has all the data it's
supposed to have?

I don't want to do this process to each node and discover at the end of it
that I've lost a ton of data.

Is there something I should be looking for in the logs to verify that the
repair was successful?  If I do a 'nodetool netstats' during the repair I
don't see any streams going in or out of node #4.

Thanks,
Luke

Re: Nodes not picking up data on repair, disk loaded unevenly

Reply via email to