[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417 ]
Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:34 PM: --------------------------------------------------------------------- [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz This happened twice during this period and here is the first one. Note the pid changed because our monitoring detected and restarted the node. {code} tpstats_20151023-00:16:02_pid_37996_postpend_0.txt tpstats_20151023-00:18:08_pid_37996_postpend_1.txt tpstats_20151023-00:20:14_pid_37996_postpend_0.txt tpstats_20151023-00:22:19_pid_37996_postpend_3.txt tpstats_20151023-00:24:25_pid_37996_postpend_133.txt tpstats_20151023-00:26:30_pid_37996_postpend_809.txt tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt tpstats_20151023-00:51:11_pid_48196_postpend_0.txt tpstats_20151023-00:53:13_pid_48196_postpend_0.txt tpstats_20151023-00:55:16_pid_48196_postpend_0.txt tpstats_20151023-00:57:21_pid_48196_postpend_0.txt {code} was (Author: jeffery.griffith): [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz This happened twice during this period and here is the first one. Note the pid changed because our monitoring detected and restarted the node. {code} -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:16 tpstats_20151023-00:16:02_pid_37996_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:18 tpstats_20151023-00:18:08_pid_37996_postpend_1.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:20 tpstats_20151023-00:20:14_pid_37996_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:22 tpstats_20151023-00:22:19_pid_37996_postpend_3.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:24 tpstats_20151023-00:24:25_pid_37996_postpend_133.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:26 tpstats_20151023-00:26:30_pid_37996_postpend_809.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:28 tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:30 tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:32 tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:34 tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:36 tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:38 tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:40 tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:43 tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:45 tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:47 tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:49 tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:51 tpstats_20151023-00:51:11_pid_48196_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:53 tpstats_20151023-00:53:13_pid_48196_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:55 tpstats_20151023-00:55:16_pid_48196_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:57 tpstats_20151023-00:57:21_pid_48196_postpend_0.txt {code} > Commit logs back up with move to 2.1.10 > --------------------------------------- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 > Reporter: Jeff Griffith > Assignee: Branimir Lambov > Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, > cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 > compaction type keyspace table completed > total unit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3* 6592197093 > 6592316682 bytes 100.00% > Compaction SyncCore *cf4* 3411039555 > 3411039557 bytes 100.00% > Compaction SyncCore *cf5* 2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes 100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes 100.00% > Compaction SyncCore *cf8* 3005734580 > 3005768582 bytes 100.00% > Active compaction remaining time : n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)