Hi good people.

I underestimated load during peak times and now I'm stuck with our production 
cluster. 
Right now its 3 nodes, rf 3 so everything is everywhere. We have ~300GB data 
load. ~10MB/sec incoming traffic and ~50 (peak) reads/sec to the cluster

The problem derives from our quorum read / writes: At peak hours one of the 
machines (thats random) will fall behind because its a little slower than the 
others and than shortly after that it will drop most read requests. So right 
now the only way to survive is to take one machine down making every read / 
write a ALL operation. It's necessary to take one machine down because 
otherwise users will wait for timeouts from that overwhelmed machine when the 
client lib chooses it. Since we are a real time oriented thing thats a killer.

So now we tried to add 2 more nodes. Problem is that anticompaction takes to 
long. Meaning it is not done when peak hour arrives and the machine that would 
stream the data to the new node must be taken down. We tried to block the ports 
7000 and 9160 to that machine because we hoped that would stop traffic and let 
the machine end anticompaction. But that did not work because we could not cut 
the already existing connections to the other nodes.

Currently I am copying all data files (thats all existing data) from one node 
to the new nodes in hope that I could than manually assign them their new 
tokenrange (nodetool move) and do cleanup.

Obviously I will try this tomorrow (it's been a long day) on a test system but 
any advice would be highly appreciated.

Sighs and thanks.
Daniel

smeet.com
Berlin

Reply via email to