[ 
https://issues.apache.org/jira/browse/CASSANDRA-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15151180#comment-15151180
 ] 

Dominik Keil commented on CASSANDRA-10389:
------------------------------------------

I think we're seeing this issue as well. Running Cassandra 2.2.5. Haven't tried 
restarting all nodes but will do that now.

We're running incremental repairs (now default, eh?) and while testing this 
before we put that into production we already found that repairing a whole 
keyspace will create a massive amount of open filehandles / "anti-compacted" 
sstables even though the repair will still only work one CF at a time. This 
caused some problems so we're now running repairs one CF at a time and on only 
one node at a time.

We did not have this issue in our testing but seing it in production now, 
nevertheless. What's interesting is that the node, on which the repair runs, at 
some point suddenly thrashes its heap (i.e. full heap usage, 65%-85% GC!!!) 
while at the same time produces huge amounts of tiny, concurrent reads, leading 
to really bad read latency from disk and a lot of I/O wait.

The bad thing is: This (Cassandra) node becomes so unresponsive that it 
significantly impacts the performance of the whole cluster (a total of 9 
machines, rf 5 / quorum for most reads/writes, rf 2 / one for less important 
bulk data). So neither the java driver nor the other nodes, when being 
coordinator, manage to just leave this node alone for a while. As soon as I 
disable gossip on this node, the rest of the cluster is fine again.

[~slebresne]: I applaud you for your very useful comment.

> Repair session exception Validation failed
> ------------------------------------------
>
>                 Key: CASSANDRA-10389
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10389
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 8, Java 1.8.0_60, Cassandra 2.2.1 (datastax 
> compilation)
>            Reporter: Jędrzej Sieracki
>             Fix For: 2.2.x
>
>
> I'm running a repair on a ring of nodes, that was recently extented from 3 to 
> 13 nodes. The extension was done two days ago, the repair was attempted 
> yesterday.
> {quote}
> [2015-09-22 11:55:55,266] Starting repair command #9, repairing keyspace 
> perspectiv with repair options (parallelism: parallel, primary range: false, 
> incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], 
> hosts: [], # of ranges: 517)
> [2015-09-22 11:55:58,043] Repair session 1f7c50c0-6110-11e5-b992-9f13fa8664c8 
> for range (-5927186132136652665,-5917344746039874798] failed with error 
> [repair #1f7c50c0-6110-11e5-b992-9f13fa8664c8 on 
> perspectiv/stock_increment_agg, (-5927186132136652665,-5917344746039874798]] 
> Validation failed in cblade1.XXX/XXX (progress: 0%)
> {quote}
> BTW, I am ignoring the LEAK errors for now, that's outside of the scope of 
> the main issue:
> {quote}
> ERROR [Reference-Reaper:1] 2015-09-22 11:58:27,843 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@4d25ad8f) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@896826067:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-73-big
>  was not released before the reference was garbage collected
> {quote}
> I scrubbed the sstable with failed validation on cblade1 with nodetool scrub 
> perspectiv stock_increment_agg:
> {quote}
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:31,615 OutputHandler.java:42 
> - Scrubbing 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
>  (345466609 bytes)
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:31,615 OutputHandler.java:42 
> - Scrubbing 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
>  (60496378 bytes)
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@4ca8951e) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@114161559:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-48-big
>  was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@eeb6383) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1612685364:/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big
>  was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@1de90543) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@2058626950:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-49-big
>  was not released before the reference was garbage collected
> ERROR [Reference-Reaper:1] 2015-09-22 12:05:31,676 Ref.java:187 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@15616385) to class 
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1386628428:/var/lib/cassandra/data/perspectiv/receipt_agg_total-76abb0625de711e59f6e0b7d98a25b6e/la-47-big
>  was not released before the reference was garbage collected
> INFO  [CompactionExecutor:1703] 2015-09-22 12:05:35,098 OutputHandler.java:42 
> - Scrub of 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-82-big-Data.db')
>  complete: 51397 rows in new sstable and 0 empty (tombstoned) rows dropped
> INFO  [CompactionExecutor:1704] 2015-09-22 12:05:47,605 OutputHandler.java:42 
> - Scrub of 
> BigTableReader(path='/var/lib/cassandra/data/perspectiv/stock_increment_agg-840cad405de711e5b9929f13fa8664c8/la-83-big-Data.db')
>  complete: 292600 rows in new sstable and 0 empty (tombstoned) rows dropped
> {quote}
> Now, after scrubbing, another repair was attempted, it did finish, but with 
> lots of errors from other nodes:
> {quote}
> [2015-09-22 12:01:18,020] Repair session db476b51-6110-11e5-b992-9f13fa8664c8 
> for range (5019296454787813261,5021512586040808168] failed with error [repair 
> #db476b51-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, 
> (5019296454787813261,5021512586040808168]] Validation failed in /10.YYY 
> (progress: 91%)
> [2015-09-22 12:01:18,079] Repair session db482ea1-6110-11e5-b992-9f13fa8664c8 
> for range (-3660233266780784242,-3638577078894365342] failed with error 
> [repair #db482ea1-6110-11e5-b992-9f13fa8664c8 on 
> perspectiv/stock_increment_agg, (-3660233266780784242,-3638577078894365342]] 
> Validation failed in /10.XXX (progress: 92%)
> [2015-09-22 12:01:18,276] Repair session db4a0361-6110-11e5-b992-9f13fa8664c8 
> for range (9158857758535272856,9167427882441871745] failed with error [repair 
> #db4a0361-6110-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, 
> (9158857758535272856,9167427882441871745]] Validation failed in /10.YYY 
> (progress: 95%)
> {quote}
> After scrubbing stock_increment_agg on all nodes, just to be sure, the repair 
> still failed, this time with the following exception:
> {quote}
> INFO  [Repair#16:50] 2015-09-22 12:08:47,471 RepairJob.java:181 - [repair 
> #ea123bf3-6111-11e5-b992-9f13fa8664c8] Requesting merkle trees for 
> stock_increment_agg (to [/10.60.77.202, cblade1.XXX/XXX])
> ERROR [RepairJobTask:1] 2015-09-22 12:08:47,471 RepairSession.java:290 - 
> [repair #ea123bf0-6111-11e5-b992-9f13fa8664c8] Session completed with the 
> following error
> org.apache.cassandra.exceptions.RepairException: [repair 
> #ea123bf0-6111-11e5-b992-9f13fa8664c8 on perspectiv/stock_increment_agg, 
> (355657753119264326,366309649129068298]] Validation failed in cblade1.
>         at 
> org.apache.cassandra.repair.ValidationTask.treeReceived(ValidationTask.java:64)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:183)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:399)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:158)
>  ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
> ~[apache-cassandra-2.2.1.jar:2.2.1]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_60]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_60]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to