[ https://issues.apache.org/jira/browse/CASSANDRA-11723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefano Ortolani updated CASSANDRA-11723: ----------------------------------------- Description: Upgrade seems fine, but any restart of the node might lead to a situation where the node just dies after 30 seconds / 1 minute. Nothing in the logs besides many "FailureDetector.java:456 - Ignoring interval time of 3000892567 for /10.12.a.x" output every second (against all other nodes) in debug.log plus some spurious GraphiteErrors/ReadRepair notifications: {code:xml} DEBUG [GossipStage:1] 2016-05-05 22:29:03,921 FailureDetector.java:456 - Ignoring interval time of 2373187360 for /10.12.a.x DEBUG [GossipStage:1] 2016-05-05 22:29:03,921 FailureDetector.java:456 - Ignoring interval time of 2000276196 for /10.12.a.y DEBUG [ReadRepairStage:24] 2016-05-05 22:29:03,990 ReadCallback.java:234 - Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-152946356843306763, e859fdd2f264485f42030ce261e4e12e) (d6e617ece3b7bec6138b52b8974b8cab vs 31becca666a62b3c4b2fc0bab9902718) at org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85) ~[apache-cassandra-3.0.5.jar:3.0.5] at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:225) ~[apache-cassandra-3.0.5.jar:3.0.5] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] DEBUG [GossipStage:1] 2016-05-05 22:29:04,841 FailureDetector.java:456 - Ignoring interval time of 3000299340 for /10.12.33.5 ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 22:29:05,692 ScheduledReporter.java:119 - RuntimeException thrown from GraphiteReporter#report. Exception was suppressed. java.lang.IllegalStateException: Unable to compute ceiling for max when histogram overflowed at org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231) ~[apache-cassandra-3.0.5.jar:3.0.5] at org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103) ~[apache-cassandra-3.0.5.jar:3.0.5] at com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252) ~[metrics-graphite-3.1.0.jar:3.1.0] at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166) ~[metrics-graphite-3.1.0.jar:3.1.0] at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) ~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) ~[metrics-core-3.1.0.jar:3.1.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] {code} I know this is not much but nothing else gets to dmesg or to any other log. Any suggestion how to debug this further? I upgraded two nodes so far, and it happened on both nodes. was: Upgrade seems fine, but any restart of the node might lead to a situation where the node just dies after 30 seconds / 1 minute. Nothing in the logs besides many "FailureDetector.java:456 - Ignoring interval time of 3000892567 for /10.12.a.x" output every second (against all other nodes) in debug.log plus some spurious GraphiteErrors/ReadRepair notifications: {{ DEBUG [GossipStage:1] 2016-05-05 22:29:03,921 FailureDetector.java:456 - Ignoring interval time of 2373187360 for /10.12.a.x DEBUG [GossipStage:1] 2016-05-05 22:29:03,921 FailureDetector.java:456 - Ignoring interval time of 2000276196 for /10.12.a.y DEBUG [ReadRepairStage:24] 2016-05-05 22:29:03,990 ReadCallback.java:234 - Digest mismatch: org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(-152946356843306763, e859fdd2f264485f42030ce261e4e12e) (d6e617ece3b7bec6138b52b8974b8cab vs 31becca666a62b3c4b2fc0bab9902718) at org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85) ~[apache-cassandra-3.0.5.jar:3.0.5] at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:225) ~[apache-cassandra-3.0.5.jar:3.0.5] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] DEBUG [GossipStage:1] 2016-05-05 22:29:04,841 FailureDetector.java:456 - Ignoring interval time of 3000299340 for /10.12.33.5 ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 22:29:05,692 ScheduledReporter.java:119 - RuntimeException thrown from GraphiteReporter#report. Exception was suppressed. java.lang.IllegalStateException: Unable to compute ceiling for max when histogram overflowed at org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231) ~[apache-cassandra-3.0.5.jar:3.0.5] at org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103) ~[apache-cassandra-3.0.5.jar:3.0.5] at com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252) ~[metrics-graphite-3.1.0.jar:3.1.0] at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166) ~[metrics-graphite-3.1.0.jar:3.1.0] at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) ~[metrics-core-3.1.0.jar:3.1.0] at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) ~[metrics-core-3.1.0.jar:3.1.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_60] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [na:1.8.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] }} I know this is not much but nothing else gets to dmesg or to any other log. Any suggestion how to debug this further? I upgraded two nodes so far, and it happened on both nodes. > Cassandra upgrade from 2.1.11 to 3.0.5 leads to unstable nodes > -------------------------------------------------------------- > > Key: CASSANDRA-11723 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11723 > Project: Cassandra > Issue Type: Bug > Reporter: Stefano Ortolani > Priority: Critical > > Upgrade seems fine, but any restart of the node might lead to a situation > where the node just dies after 30 seconds / 1 minute. > Nothing in the logs besides many "FailureDetector.java:456 - Ignoring > interval time of 3000892567 for /10.12.a.x" output every second (against all > other nodes) in debug.log plus some spurious GraphiteErrors/ReadRepair > notifications: > {code:xml} > DEBUG [GossipStage:1] 2016-05-05 22:29:03,921 FailureDetector.java:456 - > Ignoring interval time of 2373187360 for /10.12.a.x > DEBUG [GossipStage:1] 2016-05-05 22:29:03,921 FailureDetector.java:456 - > Ignoring interval time of 2000276196 for /10.12.a.y > DEBUG [ReadRepairStage:24] 2016-05-05 22:29:03,990 ReadCallback.java:234 - > Digest mismatch: > org.apache.cassandra.service.DigestMismatchException: Mismatch for key > DecoratedKey(-152946356843306763, e859fdd2f264485f42030ce261e4e12e) > (d6e617ece3b7bec6138b52b8974b8cab vs 31becca666a62b3c4b2fc0bab9902718) > at > org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:225) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_60] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > DEBUG [GossipStage:1] 2016-05-05 22:29:04,841 FailureDetector.java:456 - > Ignoring interval time of 3000299340 for /10.12.33.5 > ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-05 22:29:05,692 > ScheduledReporter.java:119 - RuntimeException thrown from > GraphiteReporter#report. Exception was suppressed. > java.lang.IllegalStateException: Unable to compute ceiling for max when > histogram overflowed > at > org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:231) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103) > ~[apache-cassandra-3.0.5.jar:3.0.5] > at > com.codahale.metrics.graphite.GraphiteReporter.reportHistogram(GraphiteReporter.java:252) > ~[metrics-graphite-3.1.0.jar:3.1.0] > at > com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:166) > ~[metrics-graphite-3.1.0.jar:3.1.0] > at > com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) > ~[metrics-core-3.1.0.jar:3.1.0] > at > com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) > ~[metrics-core-3.1.0.jar:3.1.0] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_60] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > [na:1.8.0_60] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > [na:1.8.0_60] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > [na:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_60] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_60] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] > {code} > I know this is not much but nothing else gets to dmesg or to any other log. > Any suggestion how to debug this further? > I upgraded two nodes so far, and it happened on both nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)