5 hours later, the number of pending compactions host up to 8k as usual, the number of SST tables for another keyspace shot up to 160 (from 4). At 4pm, a daily cron job that runs repair starts on that same node and all of a sudden, the number of pending compactions went down to 4k and to number of SST tables went back to normal for the other keyspace.
The first repair is still not over. 2011/8/14 Philippe <watche...@gmail.com> > Hello, I've been fighting with my cluster for a couple days now... Running > 0.8.1.3, using Hector and loadblancing requests across all nodes. > My question is : how do I get my node back under control so that it runs > like the other two nodes. > > > It's a 3 node, RF=3 cluster with reads & writes at LC=QUORUM, I only have > counter columns inside super columns. There are 6 keyspaces, each has about > 10 column families. I'm using the BOP. Before the sequence of events > described below, I was writing at CL=ALL and reading at CL=ONE. I've > launched repairs multiple times and they have failed for various reasons, > one of them being hitting the limit on number of open files. I've raised it > to 32768 now. I've probably launched repairs when a repair was already > running on the node. At some point compactions were throttled to 16MB / s, > I've removed this limit. > > The problem is that one of my nodes is now impossible to repair (no such > problem with the two others). The load is about 90GB, it should be a > balanced ring but the other nodes are at 60GB. Each repair basically > generates thousands of pending compactions of various types (SSTable build, > minor, major & validation) : it spikes up to 4000 thousands, levels then > spikes up to 8000.Previously, I hit linux limits and had to restart the node > but it doesn't look like the repairs have been improving anything time after > time. > At the same time, > > - the number of SSTables for some keyspaces goes dramatically up (from > 3 or 4 to several dozens). > - the commit log keeps increasing in size, I'm at 4.3G now, it went up > to 40G when the compaction was throttled at 16MB/s. On the other nodes it's > around 1GB at most > - the data directory is bigger than on the other nodes. I've seen it go > up to 480GB when the compaction was throttled at 16MB/s > > > Compaction stats: > pending tasks: 5954 > compaction type keyspace column family bytes compacted > bytes total progress > ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_17 > 569432689 596621002 95.44% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 > 2751906910 5806164726 47.40% > ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20 > 2570106876 2776508919 92.57% > ValidationROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_19 > 3010471905 6517183774 46.19% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_15 > 4132 303015882 0.00% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_18 > 36302803 595278385 6.10% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_17 > 24671866 70959088 34.77% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 > 15515781 692029872 2.24% > MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20 > 1607953684 6606774662 24.34% > ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20 > 895043380 2776306015 32.24% > > My current lsof count for the cassandra user is > root@xxx:/logs/cassandra# lsof -u cassandra| wc -l > 13191 > > What's even weirder is that currently I have 9 compactions running but CPU > is throttled at 1/number of cores half the time (while > 80% the rest of the > time). Could this be because other repairs are happening in the ring ? > Exemple (vmstat 2) > 7 2 0 177632 1596 13868416 0 0 9060 61 5963 5968 40 7 > 53 0 > 7 0 0 165376 1600 13880012 0 0 41422 28 14027 4608 81 17 > 1 0 > 8 0 0 159820 1592 13880036 0 0 26830 22 10161 10398 76 19 > 4 1 > 6 0 0 161792 1592 13882312 0 0 20046 42 7272 4599 81 17 > 2 0 > 2 0 0 164960 1564 13879108 0 0 17404 26559 6172 3638 79 18 > 2 0 > 2 0 0 162344 1564 13867888 0 0 6 0 2014 2150 40 2 > 58 0 > 1 1 0 159864 1572 13867952 0 0 0 41668 958 581 27 0 > 72 1 > 1 0 0 161972 1572 13867952 0 0 0 89 661 443 17 0 > 82 1 > 1 0 0 162128 1572 13867952 0 0 0 20 482 398 17 0 > 83 0 > 2 0 0 162276 1572 13867952 0 0 0 788 485 395 18 0 > 82 0 > 1 0 0 173896 1572 13867952 0 0 0 29 547 461 17 0 > 83 0 > 1 0 0 163052 1572 13867920 0 0 0 0 741 620 18 1 > 81 0 > 1 0 0 162588 1580 13867948 0 0 0 32 523 387 17 0 > 82 0 > 13 0 0 168272 1580 13877140 0 0 12872 269 8056 6725 56 9 > 34 0 > 44 1 0 202536 1612 13835956 0 0 26606 530 7946 3887 79 19 > 2 0 > 48 1 0 406640 1612 13631740 0 0 22006 310 8605 3705 80 18 > 2 0 > 9 1 0 340300 1620 13697560 0 0 19530 103 8101 3984 84 14 > 1 0 > 2 0 0 297768 1620 13738036 0 0 12438 10 4115 2628 57 9 > 34 0 > > Thanks >