[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-11-12 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002051#comment-15002051
 ] 

Marcus Eriksson edited comment on CASSANDRA-10515 at 11/12/15 1:23 PM:
---

it explains the huge amount of sstables in L0

it is not committed or released yet


was (Author: krummas):
it is not committed or released yet

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>  Labels: commitlog, triage
> Fix For: 3.1, 2.1.x, 2.2.x
>
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-29 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980375#comment-14980375
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/29/15 12:58 PM:
--

[~blambov] [~benedict]

See image: Killed two birds with one stone here it seems. Looking at he logs 
before the commit log growth, it looks like the IndexOutOfBounds exceptions 
affected all nodes in this small cluster of 3 at the same time, with with RF=3 
that probably makes sense, doesn't it?

https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg


was (Author: jeffery.griffith):
[~blambov] [~benedict]

Killed two birds with one stone here it seems. Looking at he logs before the 
commit log growth, it looks like the IndexOutOfBounds exceptions affected all 
nodes in this small cluster of 3 at the same time, with with RF=3 that probably 
makes sense, doesn't it?


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-29 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980375#comment-14980375
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/29/15 12:59 PM:
--

[~blambov] [~benedict]

See image: Killed two birds with one stone here it seems. Looking at the logs 
before the commit log growth, it looks like the IndexOutOfBounds exceptions 
affected all nodes in this small cluster of 3 at the same time, with with RF=3 
that probably makes sense, doesn't it?

https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg


was (Author: jeffery.griffith):
[~blambov] [~benedict]

See image: Killed two birds with one stone here it seems. Looking at he logs 
before the commit log growth, it looks like the IndexOutOfBounds exceptions 
affected all nodes in this small cluster of 3 at the same time, with with RF=3 
that probably makes sense, doesn't it?

https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-28 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978368#comment-14978368
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/28/15 1:00 PM:
-

Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the test running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what I saw, though, the symptoms seemed to match exactly 
what [~krummas] described.


was (Author: jeffery.griffith):
Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the test running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what saw, though, the symptoms seemed to match exactly what 
[~krummas] described.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-28 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978368#comment-14978368
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/28/15 12:58 PM:
--

Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the test running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what saw, though, the symptoms seemed to match exactly what 
[~krummas] described.


was (Author: jeffery.griffith):
Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the rest running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what saw, though, the symptoms seemed to match exactly what 
[~krummas] described.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977052#comment-14977052
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/27/15 7:59 PM:
-

[~krummas] [~tjake] something interesting on this second form of commit log 
growth where all nodes had uncontrolled commit log growth unlike the first 
example (many files in L0) where it was isolated nodes. for this latter case, I 
think i'm able to relate this to a separate problem with an index out of bounds 
exception. working with [~benedict] it seems like we have that one solved. i'm 
hopeful that patch will solve this growing commit log problem as well. it seems 
like all roads lead to rome where rome is commit log growth :-)

here is the other JIRA identifying an integer overflow in 
AbstractNativeCell.java
https://issues.apache.org/jira/browse/CASSANDRA-10579

Still uncertain how to proceed with the first form that seems to be starvation 
as you have described.



was (Author: jeffery.griffith):
[~krummas] [~tjake] something interesting on this second form of commit log 
growth where all nodes had uncontrolled commit log growth unless the first 
example (many files in L0) where it was isolated nodes. for this latter case, I 
think i'm able to relate this to a separate problem with an index out of bounds 
exception. working with [~benedict] it seems like we have that one solved. i'm 
hopeful that patch will solve this growing commit log problem as well. it seems 
like all roads lead to rome where rome is commit log growth :-)

here is the other JIRA identifying an integer overflow in 
AbstractNativeCell.java
https://issues.apache.org/jira/browse/CASSANDRA-10579

Still uncertain how to proceed with the first form that seems to be starvation 
as you have described.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-23 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:34 PM:
-

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}



was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:16 
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:18 
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:20 
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:22 
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:24 
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:26 
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:28 
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:30 
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:32 
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:34 
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:36 
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:38 
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:40 
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:43 
tpstats_20151023-00:42:59_pid_37996_postpend_7137.tx

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-23 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:33 PM:
-

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:16 
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:18 
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:20 
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:22 
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:24 
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:26 
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:28 
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:30 
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:32 
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:34 
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:36 
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:38 
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:40 
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:43 
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:45 
tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:47 
tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:49 
tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:51 
tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:53 
tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:55 
tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:57 
tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}



was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels:

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-23 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971417#comment-14971417
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:31 PM:
-

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz


was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960769#comment-14960769
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:20 PM:
-

ok, that's good news then although the weird synchronization of that last 
example concerns me. we'll apply tools/bin/sstableofflinerelevel to everything 
above 1K or so?


was (Author: jeffery.griffith):
ok, that's good news then. we'll apply tools/bin/sstableofflinerelevel to 
everything above 1K or so?

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960763#comment-14960763
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:17 PM:
-

No, this is only the beginning (300M users and a petabyte to go :) ) . We were 
kind of pausing here since the bigger clusters carry so many more users but we 
can definitely do that if we move forward with the rollout. We'll apply the 
releveling where can and see how it behaves. is the 5K sstable count enough to 
be concerned about? i'll do some more analysis on these and compare to clusters 
that have not yet been upgraded.



was (Author: jeffery.griffith):
No, this is only the beginning (300M users and a petabyte to go :) ) . We were 
kind of pausing here since the bigger clusters carry so many more users but we 
can definitely do that if we move forward with the rollout. We'll apply the 
releveling where can and see how it behaviors. is the 5K sstable count enough 
to be concerned about? i'll do some more analysis on these and compare to 
clusters that have not yet been upgraded.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960755#comment-14960755
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:07 PM:
-

[~krummas] I checked a different cluster for sstable counts. (see 
C5commitLogIncrease.jpg) Here they all decided to break the limit at the same 
time. The largest sstable count in each is about 5K.


was (Author: jeffery.griffith):
[~krummas] I checked a different cluster for sstable counts. 
(C5commitLogIncrease.jpg) Here they all decided to break the limit at the same 
time. The largest sstable count in each is about 5K.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960697#comment-14960697
 ] 

Marcus Eriksson edited comment on CASSANDRA-10515 at 10/16/15 1:35 PM:
---

[~tjake] yes, we should probably remove the {{synchronized}} on every method in 
the compaction strategies and optimize the LCS overlap check like we did with 
the global check in CASSANDRA-9882 (if possible, have not looked at that code 
in a while)


was (Author: krummas):
[~tjake] yes, we should probably the {{synchronized}} on every method in the 
compaction strategies and optimize the LCS overlap check like we did with the 
global check in CASSANDRA-9882 (if possible, have not looked at that code in a 
while)

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960663#comment-14960663
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:17 PM:
-

yes, we just upgraded from 2.0. would that explain the 50k+? i will dig deeper 
on that CF. i have checked and that is NOT the case on the other nodes. yes, 
they are balanced in terms of data (40 core machines with lots of memory). in 
this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes 
each (30 nodes total). only THREE nodes of the thirty are now exhibiting this 
behavior. for the first few days, several others did however they seem to have 
self corrected. these three still have not. I will go back and check for large 
sstable counts to see if that explains all of them. after this first stage, 
we'll be rolling out to the larger 24-node clusters but we are pausing here on 
the small clusters until we figure this out.



was (Author: jeffery.griffith):
yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. these three 
still have not. I will go back and check for large sstable counts to see if 
that explains all of them. after this first stage, we'll be rolling out to the 
larger 24-node clusters but we are pausing here on the small clusters until we 
figure this out.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960663#comment-14960663
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:16 PM:
-

yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. these three 
still have not. I will go back and check for large sstable counts to see if 
that explains all of them. after this first stage, we'll be rolling out to the 
larger 24-node clusters but we are pausing here on the small clusters until we 
figure this out.



was (Author: jeffery.griffith):
yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. I will go 
back and check for large sstable counts to see if that explains all of them. 
after this first stage, we'll be rolling out to the larger 24-node clusters but 
we are pausing here on the small clusters until we figure this out.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960663#comment-14960663
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:14 PM:
-

yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. I will go 
back and check for large sstable counts to see if that explains all of them. 
after this first stage, we'll be rolling out to the larger 24-node clusters but 
we are pausing here on the small clusters until we figure this out.



was (Author: jeffery.griffith):
yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. I will go 
back and check for large sstable counts to see if that explains all of them. 
after this first stage, we'll be rolling out to the larger 24-node customers 
but we are pausing here on the small clusters until we figure this out.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960606#comment-14960606
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:59 PM:
--

thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <--- indeed this is NOT the case on other 
nodes
SSTable count: 687

Also, I've attached our cassandra.yaml


was (Author: jeffery.griffith):
thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687

Also, I've attached our cassandra.yaml

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960606#comment-14960606
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:51 PM:
--

thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687

Also, I've attached our cassandra.yaml


was (Author: jeffery.griffith):
thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960606#comment-14960606
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:47 PM:
--

thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687



was (Author: jeffery.griffith):
thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:14 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


I see one thread for MemtablePostFlush and this is it:

{code}
"MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 
nid=0x2d93 waiting on condition [0x7fda4b46c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005838ba468> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

I followed it for a while longer after this and it really looks like the post 
flush stays blocked on that latch forever:

{code}
00:01
MemtableFlushWriter   2 2   2024 0  
   0
MemtablePostFlush 1 47159   4277 0  
   0
MemtableReclaimMemory 0 0   2024 0  
   0


00:03
MemtableFlushWriter   3 3   2075 0  
   0
MemtableP

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:13 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


I see one thread for MemtablePostFlush and this is it:

{code}
"MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 
nid=0x2d93 waiting on condition [0x7fda4b46c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005838ba468> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

I followed it for a while longer after this and it really looks like the post 
flush stacks blocked on that latch forever:

{code}
00:01
MemtableFlushWriter   2 2   2024 0  
   0
MemtablePostFlush 1 47159   4277 0  
   0
MemtableReclaimMemory 0 0   2024 0  
   0


00:03
MemtableFlushWriter   3 3   2075 0  
   0
Memtable

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:05 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


I see one thread for MemtablePostFlush and this is it:

{code}
"MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 
nid=0x2d93 waiting on condition [0x7fda4b46c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005838ba468> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: jeffery.griffith):
[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so a

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:00 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: jeffery.griffith):
[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow:

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 11:57 PM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow:

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: jeffery.griffith):
[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed:

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurren

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959525#comment-14959525
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 8:06 PM:
-

Yeah doesn't look like the locking thread is deadlocked at all. I know this is 
a stretch, but considering we just migrated from 2.0.x, could there be 
something data specific that is confusing the compaction? Not sure where to 
check for slow flushes. Should i just watch tpstats?


was (Author: jeffery.griffith):
Yeah doesn't look blocked. How can i check for the slow flushes?

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959338#comment-14959338
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 7:19 PM:
-

Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) i will disable all 
monitoring and restart to be sure. (UPDATE: looks like a cron job piled those 
up after things got stuck. i disabled it to be sure.)


was (Author: jeffery.griffith):
Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) i will disable all 
monitoring and restart to be sure.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959338#comment-14959338
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 6:18 PM:
-

Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) i will disable all 
monitoring and restart to be sure.


was (Author: jeffery.griffith):
Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) 

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959313#comment-14959313
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 6:08 PM:
-

Oh look, the memtable flusher is blocked on the same lock:

{code}
"MemtableFlushWriter:1166" #18316 daemon prio=5 os_prio=0 
tid=0x7f33ac5f8800 nid=0xb649 waiting for monitor entry [0x7f31c5acc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.handleNotification(WrappingCompactionStrategy.java:250)
- waiting to lock <0x000498151af8> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at org.apache.cassandra.db.DataTracker.notifyAdded(DataTracker.java:518)

{code}

I don't know how hot that particular code is but every stacktrace showed the 
lock at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) or deeper.


was (Author: jeffery.griffith):
Oh look, the memtable flusher is blocked on the same lock:

{code}
"MemtableFlushWriter:1166" #18316 daemon prio=5 os_prio=0 
tid=0x7f33ac5f8800 nid=0xb649 waiting for monitor entry [0x7f31c5acc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.handleNotification(WrappingCompactionStrategy.java:250)
- waiting to lock <0x000498151af8> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at org.apache.cassandra.db.DataTracker.notifyAdded(DataTracker.java:518)

{code}

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-14 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956970#comment-14956970
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/14/15 1:49 PM:
-

CommitLogSize.jpg matches the period in the log file system.log.clean


was (Author: jeffery.griffith):
This matches the period in the log file system.log.clean

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-14 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956966#comment-14956966
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/14/15 1:49 PM:
-

I cleaned up a log and attached it (system.log.clean) for a period that begin 
already in trouble, but suddenly started (flushing?) and compacting again. At 
11:25 there was a huge compaction. Things seemed normal for a while but the 
commit logs began to grow again. Note that I have cleaned out a bunch of 
warnings like this because we see them everywhere:

WARN  [SharedPool-Worker-4] 2015-10-14 11:21:22,940 BatchStatement.java:252 - 
Batch of prepared statements for [xx] is of size 5253, exceeding specified 
threshold of 5120 by 133.


was (Author: jeffery.griffith):
I cleaned up a log and attached it for a period that begin already in trouble, 
but suddenly started (flushing?) and compacting again. At 11:25 there was a 
huge compaction. Things seemed normal for a while but the commit logs began to 
grow again. Note that I have cleaned out a bunch of warnings like this because 
we see them everywhere:

WARN  [SharedPool-Worker-4] 2015-10-14 11:21:22,940 BatchStatement.java:252 - 
Batch of prepared statements for [xx] is of size 5253, exceeding specified 
threshold of 5120 by 133.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)