[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002051#comment-15002051 ] Marcus Eriksson edited comment on CASSANDRA-10515 at 11/12/15 1:23 PM: --- it explains the huge amount of sstables in L0 it is not committed or released yet was (Author: krummas): it is not committed or released yet > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov > Labels: commitlog, triage > Fix For: 3.1, 2.1.x, 2.2.x > > Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, > CommitLogProblem.jpg, CommitLogSize.jpg, > MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, > cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980375#comment-14980375 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/29/15 12:58 PM: -- [~blambov] [~benedict] See image: Killed two birds with one stone here it seems. Looking at he logs before the commit log growth, it looks like the IndexOutOfBounds exceptions affected all nodes in this small cluster of 3 at the same time, with with RF=3 that probably makes sense, doesn't it? https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg was (Author: jeffery.griffith): [~blambov] [~benedict] Killed two birds with one stone here it seems. Looking at he logs before the commit log growth, it looks like the IndexOutOfBounds exceptions affected all nodes in this small cluster of 3 at the same time, with with RF=3 that probably makes sense, doesn't it? > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, > CommitLogProblem.jpg, CommitLogSize.jpg, > MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, > cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980375#comment-14980375 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/29/15 12:59 PM: -- [~blambov] [~benedict] See image: Killed two birds with one stone here it seems. Looking at the logs before the commit log growth, it looks like the IndexOutOfBounds exceptions affected all nodes in this small cluster of 3 at the same time, with with RF=3 that probably makes sense, doesn't it? https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg was (Author: jeffery.griffith): [~blambov] [~benedict] See image: Killed two birds with one stone here it seems. Looking at he logs before the commit log growth, it looks like the IndexOutOfBounds exceptions affected all nodes in this small cluster of 3 at the same time, with with RF=3 that probably makes sense, doesn't it? https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, > CommitLogProblem.jpg, CommitLogSize.jpg, > MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, > cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978368#comment-14978368 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/28/15 1:00 PM: - Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy was a good one :-) [~blambov] i have the test running now. normally happens a couple of times a day so i should know by this evening. On that first form of growth, I unfortunately did not get a chance to try it before the problem corrected itself. I believe that gradually our 3 remaining problematic nodes too longer to reduce the # of L0 files than did the rest of our clusters. It took weeks rather than days and coincidentally I got involved near the end. From what I saw, though, the symptoms seemed to match exactly what [~krummas] described. was (Author: jeffery.griffith): Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy was a good one :-) [~blambov] i have the test running now. normally happens a couple of times a day so i should know by this evening. On that first form of growth, I unfortunately did not get a chance to try it before the problem corrected itself. I believe that gradually our 3 remaining problematic nodes too longer to reduce the # of L0 files than did the rest of our clusters. It took weeks rather than days and coincidentally I got involved near the end. From what saw, though, the symptoms seemed to match exactly what [~krummas] described. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, > cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978368#comment-14978368 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/28/15 12:58 PM: -- Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy was a good one :-) [~blambov] i have the test running now. normally happens a couple of times a day so i should know by this evening. On that first form of growth, I unfortunately did not get a chance to try it before the problem corrected itself. I believe that gradually our 3 remaining problematic nodes too longer to reduce the # of L0 files than did the rest of our clusters. It took weeks rather than days and coincidentally I got involved near the end. From what saw, though, the symptoms seemed to match exactly what [~krummas] described. was (Author: jeffery.griffith): Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy was a good one :-) [~blambov] i have the rest running now. normally happens a couple of times a day so i should know by this evening. On that first form of growth, I unfortunately did not get a chance to try it before the problem corrected itself. I believe that gradually our 3 remaining problematic nodes too longer to reduce the # of L0 files than did the rest of our clusters. It took weeks rather than days and coincidentally I got involved near the end. From what saw, though, the symptoms seemed to match exactly what [~krummas] described. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, > cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977052#comment-14977052 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/27/15 7:59 PM: - [~krummas] [~tjake] something interesting on this second form of commit log growth where all nodes had uncontrolled commit log growth unlike the first example (many files in L0) where it was isolated nodes. for this latter case, I think i'm able to relate this to a separate problem with an index out of bounds exception. working with [~benedict] it seems like we have that one solved. i'm hopeful that patch will solve this growing commit log problem as well. it seems like all roads lead to rome where rome is commit log growth :-) here is the other JIRA identifying an integer overflow in AbstractNativeCell.java https://issues.apache.org/jira/browse/CASSANDRA-10579 Still uncertain how to proceed with the first form that seems to be starvation as you have described. was (Author: jeffery.griffith): [~krummas] [~tjake] something interesting on this second form of commit log growth where all nodes had uncontrolled commit log growth unless the first example (many files in L0) where it was isolated nodes. for this latter case, I think i'm able to relate this to a separate problem with an index out of bounds exception. working with [~benedict] it seems like we have that one solved. i'm hopeful that patch will solve this growing commit log problem as well. it seems like all roads lead to rome where rome is commit log growth :-) here is the other JIRA identifying an integer overflow in AbstractNativeCell.java https://issues.apache.org/jira/browse/CASSANDRA-10579 Still uncertain how to proceed with the first form that seems to be starvation as you have described. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, > cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:34 PM: - [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz This happened twice during this period and here is the first one. Note the pid changed because our monitoring detected and restarted the node. {code} tpstats_20151023-00:16:02_pid_37996_postpend_0.txt tpstats_20151023-00:18:08_pid_37996_postpend_1.txt tpstats_20151023-00:20:14_pid_37996_postpend_0.txt tpstats_20151023-00:22:19_pid_37996_postpend_3.txt tpstats_20151023-00:24:25_pid_37996_postpend_133.txt tpstats_20151023-00:26:30_pid_37996_postpend_809.txt tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt tpstats_20151023-00:51:11_pid_48196_postpend_0.txt tpstats_20151023-00:53:13_pid_48196_postpend_0.txt tpstats_20151023-00:55:16_pid_48196_postpend_0.txt tpstats_20151023-00:57:21_pid_48196_postpend_0.txt {code} was (Author: jeffery.griffith): [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz This happened twice during this period and here is the first one. Note the pid changed because our monitoring detected and restarted the node. {code} -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:16 tpstats_20151023-00:16:02_pid_37996_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:18 tpstats_20151023-00:18:08_pid_37996_postpend_1.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:20 tpstats_20151023-00:20:14_pid_37996_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:22 tpstats_20151023-00:22:19_pid_37996_postpend_3.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:24 tpstats_20151023-00:24:25_pid_37996_postpend_133.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:26 tpstats_20151023-00:26:30_pid_37996_postpend_809.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:28 tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:30 tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:32 tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:34 tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:36 tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:38 tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:40 tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:43 tpstats_20151023-00:42:59_pid_37996_postpend_7137.tx
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:33 PM: - [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz This happened twice during this period and here is the first one. Note the pid changed because our monitoring detected and restarted the node. {code} -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:16 tpstats_20151023-00:16:02_pid_37996_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:18 tpstats_20151023-00:18:08_pid_37996_postpend_1.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:20 tpstats_20151023-00:20:14_pid_37996_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:22 tpstats_20151023-00:22:19_pid_37996_postpend_3.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:24 tpstats_20151023-00:24:25_pid_37996_postpend_133.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:26 tpstats_20151023-00:26:30_pid_37996_postpend_809.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:28 tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:30 tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:32 tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:34 tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:36 tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:38 tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:40 tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:43 tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:45 tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:47 tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:49 tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:51 tpstats_20151023-00:51:11_pid_48196_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2002 Oct 22 20:53 tpstats_20151023-00:53:13_pid_48196_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:55 tpstats_20151023-00:55:16_pid_48196_postpend_0.txt -rw-r--r-- 1 jgriffith Y\Domain Users 2180 Oct 22 20:57 tpstats_20151023-00:57:21_pid_48196_postpend_0.txt {code} was (Author: jeffery.griffith): [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels:
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:31 PM: - [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz was (Author: jeffery.griffith): [~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G setting but with different behavior. I have captured the whole thing with thread dumps and tpstats every two minutes. I've embedded pending numbers in the filenames for your convenience to make it easy to see where the backup starts. *-node1.tar.gz is the only one i uploaded since the files were so large, but note in the Dashboard.jpg file that all three nodes break the limit at about the same time. I can upload the others if it is useful. This case seems different from the previous case where there were lots of L0 files causing thread blocking, but even here it seems like the MemtablePostFlush is stopping on a countdownlatch. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, > cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960769#comment-14960769 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:20 PM: - ok, that's good news then although the weird synchronization of that last example concerns me. we'll apply tools/bin/sstableofflinerelevel to everything above 1K or so? was (Author: jeffery.griffith): ok, that's good news then. we'll apply tools/bin/sstableofflinerelevel to everything above 1K or so? > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, > stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960763#comment-14960763 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:17 PM: - No, this is only the beginning (300M users and a petabyte to go :) ) . We were kind of pausing here since the bigger clusters carry so many more users but we can definitely do that if we move forward with the rollout. We'll apply the releveling where can and see how it behaves. is the 5K sstable count enough to be concerned about? i'll do some more analysis on these and compare to clusters that have not yet been upgraded. was (Author: jeffery.griffith): No, this is only the beginning (300M users and a petabyte to go :) ) . We were kind of pausing here since the bigger clusters carry so many more users but we can definitely do that if we move forward with the rollout. We'll apply the releveling where can and see how it behaviors. is the 5K sstable count enough to be concerned about? i'll do some more analysis on these and compare to clusters that have not yet been upgraded. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, > stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960755#comment-14960755 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:07 PM: - [~krummas] I checked a different cluster for sstable counts. (see C5commitLogIncrease.jpg) Here they all decided to break the limit at the same time. The largest sstable count in each is about 5K. was (Author: jeffery.griffith): [~krummas] I checked a different cluster for sstable counts. (C5commitLogIncrease.jpg) Here they all decided to break the limit at the same time. The largest sstable count in each is about 5K. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, > CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, > stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960697#comment-14960697 ] Marcus Eriksson edited comment on CASSANDRA-10515 at 10/16/15 1:35 PM: --- [~tjake] yes, we should probably remove the {{synchronized}} on every method in the compaction strategies and optimize the LCS overlap check like we did with the global check in CASSANDRA-9882 (if possible, have not looked at that code in a while) was (Author: krummas): [~tjake] yes, we should probably the {{synchronized}} on every method in the compaction strategies and optimize the LCS overlap check like we did with the global check in CASSANDRA-9882 (if possible, have not looked at that code in a while) > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, > RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960663#comment-14960663 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:17 PM: - yes, we just upgraded from 2.0. would that explain the 50k+? i will dig deeper on that CF. i have checked and that is NOT the case on the other nodes. yes, they are balanced in terms of data (40 core machines with lots of memory). in this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only THREE nodes of the thirty are now exhibiting this behavior. for the first few days, several others did however they seem to have self corrected. these three still have not. I will go back and check for large sstable counts to see if that explains all of them. after this first stage, we'll be rolling out to the larger 24-node clusters but we are pausing here on the small clusters until we figure this out. was (Author: jeffery.griffith): yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and that is NOT the case on the other nodes. yes, they are balanced in terms of data (40 core machines with lots of memory). in this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only THREE nodes of the thirty are now exhibiting this behavior. for the first few days, several others did however they seem to have self corrected. these three still have not. I will go back and check for large sstable counts to see if that explains all of them. after this first stage, we'll be rolling out to the larger 24-node clusters but we are pausing here on the small clusters until we figure this out. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, > RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960663#comment-14960663 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:16 PM: - yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and that is NOT the case on the other nodes. yes, they are balanced in terms of data (40 core machines with lots of memory). in this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only THREE nodes of the thirty are now exhibiting this behavior. for the first few days, several others did however they seem to have self corrected. these three still have not. I will go back and check for large sstable counts to see if that explains all of them. after this first stage, we'll be rolling out to the larger 24-node clusters but we are pausing here on the small clusters until we figure this out. was (Author: jeffery.griffith): yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and that is NOT the case on the other nodes. yes, they are balanced in terms of data (40 core machines with lots of memory). in this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only THREE nodes of the thirty are now exhibiting this behavior. for the first few days, several others did however they seem to have self corrected. I will go back and check for large sstable counts to see if that explains all of them. after this first stage, we'll be rolling out to the larger 24-node clusters but we are pausing here on the small clusters until we figure this out. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, > RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960663#comment-14960663 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:14 PM: - yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and that is NOT the case on the other nodes. yes, they are balanced in terms of data (40 core machines with lots of memory). in this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only THREE nodes of the thirty are now exhibiting this behavior. for the first few days, several others did however they seem to have self corrected. I will go back and check for large sstable counts to see if that explains all of them. after this first stage, we'll be rolling out to the larger 24-node clusters but we are pausing here on the small clusters until we figure this out. was (Author: jeffery.griffith): yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and that is NOT the case on the other nodes. yes, they are balanced in terms of data (40 core machines with lots of memory). in this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only THREE nodes of the thirty are now exhibiting this behavior. for the first few days, several others did however they seem to have self corrected. I will go back and check for large sstable counts to see if that explains all of them. after this first stage, we'll be rolling out to the larger 24-node customers but we are pausing here on the small clusters until we figure this out. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, > RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960606#comment-14960606 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:59 PM: -- thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we didn't actually name them CF001 ;-) For your convenience I grabbed the sstable counts > 500: SSTable count: 3454 SSTable count: 55392 <--- indeed this is NOT the case on other nodes SSTable count: 687 Also, I've attached our cassandra.yaml was (Author: jeffery.griffith): thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we didn't actually name them CF001 ;-) For your convenience I grabbed the sstable counts > 500: SSTable count: 3454 SSTable count: 55392 <--- SSTable count: 687 Also, I've attached our cassandra.yaml > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, > RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960606#comment-14960606 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:51 PM: -- thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we didn't actually name them CF001 ;-) For your convenience I grabbed the sstable counts > 500: SSTable count: 3454 SSTable count: 55392 <--- SSTable count: 687 Also, I've attached our cassandra.yaml was (Author: jeffery.griffith): thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we didn't actually name them CF001 ;-) For your convenience I grabbed the sstable counts > 500: SSTable count: 3454 SSTable count: 55392 <--- SSTable count: 687 > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, > RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960606#comment-14960606 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:47 PM: -- thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we didn't actually name them CF001 ;-) For your convenience I grabbed the sstable counts > 500: SSTable count: 3454 SSTable count: 55392 <--- SSTable count: 687 was (Author: jeffery.griffith): thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we didn't actually name them CF001 ;-) > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, > RUN3tpstats.jpg, cfstats-clean.txt, stacktrace.txt, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:14 AM: -- [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so assume (?) the locker never completed or was very, very slow. Eventually a second MemtableFlushWriter thread blocks. I believe if I left it continue to run, all or many of them will. {code} "CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 nid=0x728b runnable [0x7fda4ae0b000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49) at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497) at org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572) at org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346) - locked <0x0004a8bc5038> (a org.apache.cassandra.db.compaction.LeveledManifest) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90) - locked <0x0004a8af17d0> (a org.apache.cassandra.db.compaction.LeveledCompactionStrategy) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84) - locked <0x0004a894df10> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} I see one thread for MemtablePostFlush and this is it: {code} "MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 nid=0x2d93 waiting on condition [0x7fda4b46c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005838ba468> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} I followed it for a while longer after this and it really looks like the post flush stays blocked on that latch forever: {code} 00:01 MemtableFlushWriter 2 2 2024 0 0 MemtablePostFlush 1 47159 4277 0 0 MemtableReclaimMemory 0 0 2024 0 0 00:03 MemtableFlushWriter 3 3 2075 0 0 MemtableP
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:13 AM: -- [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so assume (?) the locker never completed or was very, very slow. Eventually a second MemtableFlushWriter thread blocks. I believe if I left it continue to run, all or many of them will. {code} "CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 nid=0x728b runnable [0x7fda4ae0b000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49) at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497) at org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572) at org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346) - locked <0x0004a8bc5038> (a org.apache.cassandra.db.compaction.LeveledManifest) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90) - locked <0x0004a8af17d0> (a org.apache.cassandra.db.compaction.LeveledCompactionStrategy) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84) - locked <0x0004a894df10> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} I see one thread for MemtablePostFlush and this is it: {code} "MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 nid=0x2d93 waiting on condition [0x7fda4b46c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005838ba468> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} I followed it for a while longer after this and it really looks like the post flush stacks blocked on that latch forever: {code} 00:01 MemtableFlushWriter 2 2 2024 0 0 MemtablePostFlush 1 47159 4277 0 0 MemtableReclaimMemory 0 0 2024 0 0 00:03 MemtableFlushWriter 3 3 2075 0 0 Memtable
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:05 AM: -- [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so assume (?) the locker never completed or was very, very slow. Eventually a second MemtableFlushWriter thread blocks. I believe if I left it continue to run, all or many of them will. {code} "CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 nid=0x728b runnable [0x7fda4ae0b000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49) at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497) at org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572) at org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346) - locked <0x0004a8bc5038> (a org.apache.cassandra.db.compaction.LeveledManifest) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90) - locked <0x0004a8af17d0> (a org.apache.cassandra.db.compaction.LeveledCompactionStrategy) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84) - locked <0x0004a894df10> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} I see one thread for MemtablePostFlush and this is it: {code} "MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 nid=0x2d93 waiting on condition [0x7fda4b46c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005838ba468> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was (Author: jeffery.griffith): [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so a
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:00 AM: -- [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so assume (?) the locker never completed or was very, very slow. Eventually a second MemtableFlushWriter thread blocks. I believe if I left it continue to run, all or many of them will. {code} "CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 nid=0x728b runnable [0x7fda4ae0b000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49) at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497) at org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572) at org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346) - locked <0x0004a8bc5038> (a org.apache.cassandra.db.compaction.LeveledManifest) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90) - locked <0x0004a8af17d0> (a org.apache.cassandra.db.compaction.LeveledCompactionStrategy) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84) - locked <0x0004a894df10> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was (Author: jeffery.griffith): [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so assume (?) the locker never completed or was very, very slow: {code} "CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 nid=0x728b runnable [0x7fda4ae0b000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49) at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497) at org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572) at org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346) - locked <0x0004a8bc5038> (a org.apache.cassandra.db.compaction.LeveledManifest) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90) - locked <0x0004a8af17d0> (a org.apache.cassandra.db.compaction.LeveledCompactionStrategy) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84) - locked <0x0004a894df10> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 11:57 PM: -- [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so assume (?) the locker never completed or was very, very slow: {code} "CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 nid=0x728b runnable [0x7fda4ae0b000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49) at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497) at org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572) at org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346) - locked <0x0004a8bc5038> (a org.apache.cassandra.db.compaction.LeveledManifest) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90) - locked <0x0004a8af17d0> (a org.apache.cassandra.db.compaction.LeveledCompactionStrategy) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84) - locked <0x0004a894df10> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was (Author: jeffery.griffith): [~tjake] I monitored live for a few hours to capture the behavior. See RUN3tpstats.jpg in the attachments: Overview is: Monitoring threads began to block before the memtable flushing did. Memtable flushing seemed to be progressing slowly and then post flushing operations began to pile up. The primary things blocked were: 1. MemtableFlushWriter/handleNotif 2. CompactionExec/getNextBGTask 3. ServiceThread/getEstimatedRemTask Those three blocked and never came unblocked so assume (?) the locker never completed: {code} "CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 nid=0x728b runnable [0x7fda4ae0b000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49) at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511) at org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497) at org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572) at org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346) - locked <0x0004a8bc5038> (a org.apache.cassandra.db.compaction.LeveledManifest) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101) at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90) - locked <0x0004a8af17d0> (a org.apache.cassandra.db.compaction.LeveledCompactionStrategy) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84) - locked <0x0004a894df10> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurren
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959525#comment-14959525 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 8:06 PM: - Yeah doesn't look like the locking thread is deadlocked at all. I know this is a stretch, but considering we just migrated from 2.0.x, could there be something data specific that is confusing the compaction? Not sure where to check for slow flushes. Should i just watch tpstats? was (Author: jeffery.griffith): Yeah doesn't look blocked. How can i check for the slow flushes? > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959338#comment-14959338 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 7:19 PM: - Yeah, i saw all the blocked threads behind it. checking to see what monitoring tools are not checking for the previous instance to finish. but this is just an ugly side effect, isn't it? (a side effect of lock?) i will disable all monitoring and restart to be sure. (UPDATE: looks like a cron job piled those up after things got stuck. i disabled it to be sure.) was (Author: jeffery.griffith): Yeah, i saw all the blocked threads behind it. checking to see what monitoring tools are not checking for the previous instance to finish. but this is just an ugly side effect, isn't it? (a side effect of lock?) i will disable all monitoring and restart to be sure. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959338#comment-14959338 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 6:18 PM: - Yeah, i saw all the blocked threads behind it. checking to see what monitoring tools are not checking for the previous instance to finish. but this is just an ugly side effect, isn't it? (a side effect of lock?) i will disable all monitoring and restart to be sure. was (Author: jeffery.griffith): Yeah, i saw all the blocked threads behind it. checking to see what monitoring tools are not checking for the previous instance to finish. but this is just an ugly side effect, isn't it? (a side effect of lock?) > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959313#comment-14959313 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 6:08 PM: - Oh look, the memtable flusher is blocked on the same lock: {code} "MemtableFlushWriter:1166" #18316 daemon prio=5 os_prio=0 tid=0x7f33ac5f8800 nid=0xb649 waiting for monitor entry [0x7f31c5acc000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.handleNotification(WrappingCompactionStrategy.java:250) - waiting to lock <0x000498151af8> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.DataTracker.notifyAdded(DataTracker.java:518) {code} I don't know how hot that particular code is but every stacktrace showed the lock at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) or deeper. was (Author: jeffery.griffith): Oh look, the memtable flusher is blocked on the same lock: {code} "MemtableFlushWriter:1166" #18316 daemon prio=5 os_prio=0 tid=0x7f33ac5f8800 nid=0xb649 waiting for monitor entry [0x7f31c5acc000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.handleNotification(WrappingCompactionStrategy.java:250) - waiting to lock <0x000498151af8> (a org.apache.cassandra.db.compaction.WrappingCompactionStrategy) at org.apache.cassandra.db.DataTracker.notifyAdded(DataTracker.java:518) {code} > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, > system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956970#comment-14956970 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/14/15 1:49 PM: - CommitLogSize.jpg matches the period in the log file system.log.clean was (Author: jeffery.griffith): This matches the period in the log file system.log.clean > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10
[ https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956966#comment-14956966 ] Jeff Griffith edited comment on CASSANDRA-10515 at 10/14/15 1:49 PM: - I cleaned up a log and attached it (system.log.clean) for a period that begin already in trouble, but suddenly started (flushing?) and compacting again. At 11:25 there was a huge compaction. Things seemed normal for a while but the commit logs began to grow again. Note that I have cleaned out a bunch of warnings like this because we see them everywhere: WARN [SharedPool-Worker-4] 2015-10-14 11:21:22,940 BatchStatement.java:252 - Batch of prepared statements for [xx] is of size 5253, exceeding specified threshold of 5120 by 133. was (Author: jeffery.griffith): I cleaned up a log and attached it for a period that begin already in trouble, but suddenly started (flushing?) and compacting again. At 11:25 there was a huge compaction. Things seemed normal for a while but the commit logs began to grow again. Note that I have cleaned out a bunch of warnings like this because we see them everywhere: WARN [SharedPool-Worker-4] 2015-10-14 11:21:22,940 BatchStatement.java:252 - Batch of prepared statements for [xx] is of size 5253, exceeding specified threshold of 5120 by 133. > Commit logs back up with move to 2.1.10 > --- > > Key: CASSANDRA-10515 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10515 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: redhat 6.5, cassandra 2.1.10 >Reporter: Jeff Griffith >Assignee: Branimir Lambov >Priority: Critical > Labels: commitlog, triage > Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean > > > After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems > where some nodes break the 12G commit log max we configured and go as high as > 65G or more before it restarts. Once it reaches the state of more than 12G > commit log files, "nodetool compactionstats" hangs. Eventually C* restarts > without errors (not sure yet whether it is crashing but I'm checking into it) > and the cleanup occurs and the commit logs shrink back down again. Here is > the nodetool compactionstats immediately after restart. > {code} > jgriffith@prod1xc1.c2.bf1:~$ ndc > pending tasks: 2185 >compaction type keyspace table completed > totalunit progress > Compaction SyncCore *cf1* 61251208033 > 170643574558 bytes 35.89% > Compaction SyncCore *cf2* 19262483904 > 19266079916 bytes 99.98% > Compaction SyncCore *cf3*6592197093 > 6592316682 bytes100.00% > Compaction SyncCore *cf4*3411039555 > 3411039557 bytes100.00% > Compaction SyncCore *cf5*2879241009 > 2879487621 bytes 99.99% > Compaction SyncCore *cf6* 21252493623 > 21252635196 bytes100.00% > Compaction SyncCore *cf7* 81009853587 > 81009854438 bytes100.00% > Compaction SyncCore *cf8*3005734580 > 3005768582 bytes100.00% > Active compaction remaining time :n/a > {code} > I was also doing periodic "nodetool tpstats" which were working but not being > logged in system.log on the StatusLogger thread until after the compaction > started working again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)