Hi good people. I underestimated load during peak times and now I'm stuck with our production cluster. Right now its 3 nodes, rf 3 so everything is everywhere. We have ~300GB data load. ~10MB/sec incoming traffic and ~50 (peak) reads/sec to the cluster
The problem derives from our quorum read / writes: At peak hours one of the machines (thats random) will fall behind because its a little slower than the others and than shortly after that it will drop most read requests. So right now the only way to survive is to take one machine down making every read / write a ALL operation. It's necessary to take one machine down because otherwise users will wait for timeouts from that overwhelmed machine when the client lib chooses it. Since we are a real time oriented thing thats a killer. So now we tried to add 2 more nodes. Problem is that anticompaction takes to long. Meaning it is not done when peak hour arrives and the machine that would stream the data to the new node must be taken down. We tried to block the ports 7000 and 9160 to that machine because we hoped that would stop traffic and let the machine end anticompaction. But that did not work because we could not cut the already existing connections to the other nodes. Currently I am copying all data files (thats all existing data) from one node to the new nodes in hope that I could than manually assign them their new tokenrange (nodetool move) and do cleanup. Obviously I will try this tomorrow (it's been a long day) on a test system but any advice would be highly appreciated. Sighs and thanks. Daniel smeet.com Berlin