I have spent a lot of time working with single-node, RF=1 clusters in my 
development. Before I deploy a cluster to our live environment, I have spent 
some time learning how to work with a multi-node cluster with RF=3. There were 
some surprises. I’m wondering if people here can enlighten me. I don’t exactly 
have that warm, fuzzy feeling.

I created a three-node cluster with RF=3. I then wrote to the cluster pretty 
heavily to cause some dropped mutation messages. The dropped messages didn’t 
trickle in, but came in a burst. I suspect full GC is the culprit, but I don’t 
really know. Anyway, I ended up with 17197 dropped mutation messages on node 1, 
6422 on node 2, and none on node 3. In order to learn about repair, I waited 
for compaction to finish doing its thing, recorded the size and estimated 
number of keys for each table, started up repair (nodetool repair <keyspace>) 
on all three nodes, and waited for it to complete before doing anything else 
(even reads). When repair and compaction were done, I checked the size and 
estimated number of keys for each table. All tables on all nodes grew in size 
and estimated number of keys. The estimated number of keys for each node grew 
by 65k, 272k and 247k (.2%, .7% and .6%) for nodes 1, 2 and 3 respectively. I 
expected some growth, but that’s significantly more new keys than I had dropped 
mutation messages. I also expected the most new data on node 1, and none on 
node 3, which didn’t come close to what actually happened. Perhaps a mutation 
message contains more than one record? Perhaps the dropped mutation message 
counter is incremented on the coordinator, not the node that was overloaded?

I repeated repair, and the second time around the tables remained unchanged, as 
expected. I would hope that repair wouldn’t do anything to the tables if they 
were in sync. 

Just to be clear, I’m not overly concerned about the unexpected increase in 
number of keys. I’m pretty sure that repair did the needful thing and did bring 
the nodes in sync. The unexpected results more likely indicates that I’m 
ignorant, and it really bothers me when I don’t understand something. If you 
have any insights, I’d appreciate them.

One of the dismaying things about repair was that the first time around it took 
about 4 hours, with a completely idle cluster (except for repairs, of course), 
and only 6 GB of data on each node. I can bootstrap a node with 6 GB of data in 
a couple of minutes. That makes repair something like 50 to 100 times more 
expensive than bootstrapping. I know I should run repair on one node at a time, 
but even if you divide by three, that’s still a horrifically long time for such 
a small amount of data. The second time around, repair only took 30 minutes. 
That’s much better, but best-case is still about 10x longer than bootstrapping. 
Should repair really be taking this long? When I have 300 GB of data, is a 
best-case repair going to take 25 hours, and a repair with a modest amount of 
work more than 100 hours? My records are quite small. Those 6 GB contain almost 
40 million partitions. 

Following my repair experiment, I added a fourth node, and then tried killing a 
node and importing a bunch of data while the node was down. As far as repair is 
concerned, this seems to work fine (although again, glacially). However, I 
noticed that hinted handoff doesn’t seem to be working. I added several million 
records (with consistency=one), and nothing appeared in system.hints (du -hs 
showed a few dozen K bytes), nor did I get any pending Hinted Handoff tasks in 
the Thread Pool Stats. When I started up the down node (less than 3 hours 
later), the missed data didn’t appear to get sent to it. The tables did not 
grow, compaction events didn’t schedule, and there wasn’t any appreciable CPU 
utilization by the cluster. With millions of records that were missed while it 
was down, I should have noticed something if it actually was replaying the 
hints. Is there some magic setting to turn on hinted handoffs? Were there too 
many hints and so it just deleted them? My assumption is that if hinted handoff 
is working, then my need for repair should be much less, which given my 
experience so far, would be a really good thing.

Given the horrifically long time it takes to repair a node, and hinted handoff 
apparently not working, if a node goes down, is it better to bootstrap a new 
one than to repair the node that went down? I would expect that even if I chose 
to bootstrap a new node, it would need to be repaired anyway, since it would 
probably miss writes while bootstrapping.

Thanks in advance

Robert

Reply via email to