Re: Unable to repair a node

Philippe Sun, 14 Aug 2011 08:44:16 -0700

5 hours later, the number of pending compactions host up to 8k as usual, the
number of SST tables for another keyspace shot up to 160 (from 4).
At 4pm, a daily cron job that runs repair starts on that same node and all
of a sudden, the number of pending compactions went down to 4k and to number
of SST tables went back to normal for the other keyspace.


The first repair is still not over.

2011/8/14 Philippe <watche...@gmail.com>

> Hello, I've been fighting with my cluster for a couple days now... Running
> 0.8.1.3, using Hector and loadblancing requests across all nodes.
> My question is : how do I get my node back under control so that it runs
> like the other two nodes.
>
>
> It's a 3 node, RF=3 cluster with reads & writes at LC=QUORUM, I only have
> counter columns inside super columns. There are 6 keyspaces, each has about
> 10 column families. I'm using the BOP. Before the sequence of events
> described below, I was writing at CL=ALL and reading at CL=ONE. I've
> launched repairs multiple times and they have failed for various reasons,
> one of them being hitting the limit on number of open files. I've raised it
> to 32768 now. I've probably launched repairs when a repair was already
> running on the node. At some point compactions were throttled to 16MB / s,
> I've removed this limit.
>
> The problem is that  one of my nodes is now impossible to repair (no such
> problem with the two others). The load is about 90GB, it should be a
> balanced ring but the other nodes are at 60GB. Each repair basically
> generates thousands of pending compactions of various types (SSTable build,
> minor, major & validation) : it spikes up to 4000 thousands, levels then
> spikes up to 8000.Previously, I hit linux limits and had to restart the node
> but it doesn't look like the repairs have been improving anything time after
> time.
> At the same time,
>
>    - the number of SSTables for some keyspaces goes dramatically up (from
>    3 or 4 to several dozens).
>    - the commit log keeps increasing in size, I'm at 4.3G now, it went up
>    to 40G when the compaction was throttled at 16MB/s. On the other nodes it's
>    around 1GB at most
>    - the data directory is bigger than on the other nodes. I've seen it go
>    up to 480GB when the compaction was throttled at 16MB/s
>
>
> Compaction stats:
> pending tasks: 5954
>           compaction type        keyspace   column family bytes compacted
>   bytes total  progress
>                ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_17
> 569432689       596621002    95.44%
>                     MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20
>  2751906910      5806164726    47.40%
>                ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20
>  2570106876      2776508919    92.57%
>                ValidationROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_19
>  3010471905      6517183774    46.19%
>                     MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_15
>  4132       303015882     0.00%
>                     MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_18
>  36302803       595278385     6.10%
>                     MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_17
>  24671866        70959088    34.77%
>                     MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20
>  15515781       692029872     2.24%
>                     MinorROLLUP_CDMA_COVERAGEPUBLIC_MONTHLY_20
>  1607953684      6606774662    24.34%
>                ValidationROLLUP_WIFI_COVERAGEPUBLIC_MONTHLY_20
> 895043380      2776306015    32.24%
>
> My current lsof count for the cassandra user is
> root@xxx:/logs/cassandra# lsof -u cassandra| wc -l
> 13191
>
> What's even weirder is that currently I have 9 compactions running but CPU
> is throttled at 1/number of cores half the time (while > 80% the rest of the
> time). Could this be because other repairs are happening in the ring ?
> Exemple (vmstat 2)
>  7  2      0 177632   1596 13868416    0    0  9060    61 5963 5968 40  7
> 53  0
>  7  0      0 165376   1600 13880012    0    0 41422    28 14027 4608 81 17
>  1  0
>  8  0      0 159820   1592 13880036    0    0 26830    22 10161 10398 76 19
>  4  1
>  6  0      0 161792   1592 13882312    0    0 20046    42 7272 4599 81 17
>  2  0
>  2  0      0 164960   1564 13879108    0    0 17404 26559 6172 3638 79 18
>  2  0
>  2  0      0 162344   1564 13867888    0    0     6     0 2014 2150 40  2
> 58  0
>  1  1      0 159864   1572 13867952    0    0     0 41668  958  581 27  0
> 72  1
>  1  0      0 161972   1572 13867952    0    0     0    89  661  443 17  0
> 82  1
>  1  0      0 162128   1572 13867952    0    0     0    20  482  398 17  0
> 83  0
>  2  0      0 162276   1572 13867952    0    0     0   788  485  395 18  0
> 82  0
>  1  0      0 173896   1572 13867952    0    0     0    29  547  461 17  0
> 83  0
>  1  0      0 163052   1572 13867920    0    0     0     0  741  620 18  1
> 81  0
>  1  0      0 162588   1580 13867948    0    0     0    32  523  387 17  0
> 82  0
> 13  0      0 168272   1580 13877140    0    0 12872   269 8056 6725 56  9
> 34  0
> 44  1      0 202536   1612 13835956    0    0 26606   530 7946 3887 79 19
>  2  0
> 48  1      0 406640   1612 13631740    0    0 22006   310 8605 3705 80 18
>  2  0
>  9  1      0 340300   1620 13697560    0    0 19530   103 8101 3984 84 14
>  1  0
>  2  0      0 297768   1620 13738036    0    0 12438    10 4115 2628 57  9
> 34  0
>
> Thanks
>

Re: Unable to repair a node

Reply via email to