[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:34 PM:
---------------------------------------------------------------------

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}



was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:16 
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:18 
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:20 
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:22 
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:24 
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:26 
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:28 
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:30 
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:32 
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:34 
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:36 
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:38 
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:40 
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:43 
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:45 
tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:47 
tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:49 
tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:51 
tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:53 
tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:55 
tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:57 
tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}


> Commit logs back up with move to 2.1.10
> ---------------------------------------
>
>                 Key: CASSANDRA-10515
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: redhat 6.5, cassandra 2.1.10
>            Reporter: Jeff Griffith
>            Assignee: Branimir Lambov
>            Priority: Critical
>              Labels: commitlog, triage
>         Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>    compaction type   keyspace                          table     completed    
>       total    unit   progress
>         Compaction   SyncCore                          *cf1*   61251208033   
> 170643574558   bytes     35.89%
>         Compaction   SyncCore                          *cf2*   19262483904    
> 19266079916   bytes     99.98%
>         Compaction   SyncCore                          *cf3*    6592197093    
>  6592316682   bytes    100.00%
>         Compaction   SyncCore                          *cf4*    3411039555    
>  3411039557   bytes    100.00%
>         Compaction   SyncCore                          *cf5*    2879241009    
>  2879487621   bytes     99.99%
>         Compaction   SyncCore                          *cf6*   21252493623    
> 21252635196   bytes    100.00%
>         Compaction   SyncCore                          *cf7*   81009853587    
> 81009854438   bytes    100.00%
>         Compaction   SyncCore                          *cf8*    3005734580    
>  3005768582   bytes    100.00%
> Active compaction remaining time :        n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to