[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap after long GC pause

2015-10-22 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969125#comment-14969125
 ] 

Robbie Strickland edited comment on CASSANDRA-10449 at 10/22/15 1:08 PM:
-

I decided to try upgrading to 2.1.11 to see if the issue was resolved by 
CASSANDRA-9681.  The node has been joining for over 24 hours, even though it 
appears to have finished streaming after about 6 hours:

{noformat}
ubuntu@eventcass4x087:~$ nodetool netstats | grep -v 100%
Mode: JOINING
Bootstrap 7047c510-7732-11e5-a7e7-63f53bbd2778
Receiving 171 files, 95313491312 bytes total. Already received 171 
files, 95313491312 bytes total
Receiving 165 files, 78860134041 bytes total. Already received 165 
files, 78860134041 bytes total
Receiving 158 files, 77709354374 bytes total. Already received 158 
files, 77709354374 bytes total
Receiving 184 files, 106710570690 bytes total. Already received 184 
files, 106710570690 bytes total
Receiving 136 files, 35699286217 bytes total. Already received 136 
files, 35699286217 bytes total
Receiving 169 files, 53498180215 bytes total. Already received 169 
files, 53498180215 bytes total
Receiving 197 files, 129020987979 bytes total. Already received 197 
files, 129020987979 bytes total
Receiving 196 files, 113904035360 bytes total. Already received 196 
files, 113904035360 bytes total
Receiving 172 files, 47685647028 bytes total. Already received 172 
files, 47685647028 bytes total
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool NameActive   Pending  Completed
Commandsn/a 1  0
Responses   n/a 0   83743675
{noformat}

It doesn't appear to still be building indexes either:

{noformat}
ubuntu@eventcass4x087:~$ nodetool compactionstats
pending tasks: 2
   compaction typekeyspace  table   completed   total   
 unit   progress
Compaction   prod_analytics_events   wuevents   163704673   201033961   
bytes 81.43%
Active compaction remaining time :n/a
{noformat}

So I'm not sure why it's still joining.  Any thoughts?


was (Author: rstrickland):
I decided to try upgrading to 2.1.11 to see if the issue was resolved by 
CASSANDRA-9681.  The node has been joining for over 24 hours, even though it 
appears to have finished streaming after about 6 hours:

{{noformat}}
ubuntu@eventcass4x087:~$ nodetool netstats | grep -v 100%
Mode: JOINING
Bootstrap 7047c510-7732-11e5-a7e7-63f53bbd2778
Receiving 171 files, 95313491312 bytes total. Already received 171 
files, 95313491312 bytes total
Receiving 165 files, 78860134041 bytes total. Already received 165 
files, 78860134041 bytes total
Receiving 158 files, 77709354374 bytes total. Already received 158 
files, 77709354374 bytes total
Receiving 184 files, 106710570690 bytes total. Already received 184 
files, 106710570690 bytes total
Receiving 136 files, 35699286217 bytes total. Already received 136 
files, 35699286217 bytes total
Receiving 169 files, 53498180215 bytes total. Already received 169 
files, 53498180215 bytes total
Receiving 197 files, 129020987979 bytes total. Already received 197 
files, 129020987979 bytes total
Receiving 196 files, 113904035360 bytes total. Already received 196 
files, 113904035360 bytes total
Receiving 172 files, 47685647028 bytes total. Already received 172 
files, 47685647028 bytes total
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool NameActive   Pending  Completed
Commandsn/a 1  0
Responses   n/a 0   83743675
{{noformat}}

It doesn't appear to still be building indexes either:

{{noformat}}
ubuntu@eventcass4x087:~$ nodetool compactionstats
pending tasks: 2
   compaction typekeyspace  table   completed   total   
 unit   progress
Compaction   prod_analytics_events   wuevents   163704673   201033961   
bytes 81.43%
Active compaction remaining time :n/a
{{noformat}}

So I'm not sure why it's still joining.  Any thoughts?

> OOM on bootstrap after long GC pause
> 
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
> Attachments: GCpath.txt, heap_dump.png, system.log.10-05, 
> thread_dump.log, threads.txt
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default 

[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap after long GC pause

2015-10-22 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946624#comment-14946624
 ] 

Robbie Strickland edited comment on CASSANDRA-10449 at 10/22/15 1:10 PM:
-

I increased max heap to 96GB and tried again.  Now doing netstats shows 
progress ground to a halt:

9pm:

{noformat}
ubuntu@eventcass4x024:~$ nodetool netstats | grep -v 100%
Mode: JOINING
Bootstrap 45d8dec0-6c12-11e5-90ef-f7a8e02e59c0
Receiving 139 files, 36548040412 bytes total. Already received 139 
files, 36548040412 bytes total
Receiving 171 files, 6431853 bytes total. Already received 171 
files, 6431853 bytes total
Receiving 147 files, 78458709168 bytes total. Already received 79 
files, 55003961646 bytes total

/var/lib/cassandra/xvdd/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-295-Data.db
 955162267/4105438496 bytes(23%) received from idx:0/x.x.x.x
Receiving 141 files, 36700837768 bytes total. Already received 141 
files, 36700837768 bytes total
Receiving 176 files, 79676288976 bytes total. Already received 98 
files, 55932809644 bytes total

/var/lib/cassandra/xvdb/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-329-Data.db
 174070078/7326235809 bytes(2%) received from idx:0/x.x.x.x
Receiving 170 files, 85920995638 bytes total. Already received 94 
files, 54985226700 bytes total

/var/lib/cassandra/xvdd/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-265-Data.db
 4875660361/22821083384 bytes(21%) received from idx:0/x.x.x.x
Receiving 174 files, 87064163973 bytes total. Already received 91 
files, 53930233899 bytes total

/var/lib/cassandra/xvdb/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-157-Data.db
 17064156850/25823860172 bytes(66%) received from idx:0/x.x.x.x
Receiving 164 files, 46351636573 bytes total. Already received 164 
files, 46351636573 bytes total
Receiving 158 files, 62899520151 bytes total. Already received 158 
files, 62899520151 bytes total
Receiving 164 files, 48771232182 bytes total. Already received 164 
files, 48771232182 bytes total
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool NameActive   Pending  Completed
Commandsn/a19 56
Responses   n/a 0   35515795
{noformat}

6am:

{noformat}
ubuntu@eventcass4x024:~$ nodetool netstats | grep -v 100%
Mode: JOINING
Bootstrap 45d8dec0-6c12-11e5-90ef-f7a8e02e59c0
Receiving 139 files, 36548040412 bytes total. Already received 139 
files, 36548040412 bytes total
Receiving 171 files, 6431853 bytes total. Already received 171 
files, 6431853 bytes total
Receiving 147 files, 78458709168 bytes total. Already received 79 
files, 55003961646 bytes total

/var/lib/cassandra/xvdd/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-295-Data.db
 955162267/4105438496 bytes(23%) received from idx:0/x.x.x.x
Receiving 141 files, 36700837768 bytes total. Already received 141 
files, 36700837768 bytes total
Receiving 176 files, 79676288976 bytes total. Already received 98 
files, 55932809644 bytes total

/var/lib/cassandra/xvdb/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-329-Data.db
 174070078/7326235809 bytes(2%) received from idx:0/x.x.x.x
Receiving 170 files, 85920995638 bytes total. Already received 94 
files, 54985226700 bytes total

/var/lib/cassandra/xvdd/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-265-Data.db
 4875660361/22821083384 bytes(21%) received from idx:0/x.x.x.x
Receiving 174 files, 87064163973 bytes total. Already received 91 
files, 53930233899 bytes total

/var/lib/cassandra/xvdb/data/prod_analytics_events/wuevents-ffa99ad05af911e596f05987bbaaffad/prod_analytics_events-wuevents-tmp-ka-157-Data.db
 17064156850/25823860172 bytes(66%) received from idx:0/x.x.x.x
Receiving 164 files, 46351636573 bytes total. Already received 164 
files, 46351636573 bytes total
Receiving 158 files, 62899520151 bytes total. Already received 158 
files, 62899520151 bytes total
Receiving 164 files, 48771232182 bytes total. Already received 164 
files, 48771232182 bytes total
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool NameActive   Pending  Completed
Commands 

[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap after long GC pause

2015-10-16 Thread Mikhail Stepura (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14961726#comment-14961726
 ] 

Mikhail Stepura edited comment on CASSANDRA-10449 at 10/17/15 4:50 AM:
---

So, there are 2 ColumnFamilyStores with 6G  of memtables each (5797 'live' 
memtables per intsance) in the heap, but for some reason they are not being 
flushed. All 32 MemtableFlushWriters (and 1 post-flush) are waiting on 
{{writeBarrier}} s


was (Author: mishail):
So, there are 2  6G memtables in the heap, but for some reason they are not 
being flushed. All 32 MemtableFlushWriters (and 1 post-flush) are waiting on 
{{writeBarrier}} s

> OOM on bootstrap after long GC pause
> 
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
> Attachments: GCpath.txt, heap_dump.png, system.log.10-05, 
> thread_dump.log, threads.txt
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 
> 500-700GB per node.  SSTable counts are <10 per table.  I am attempting to 
> provision additional nodes, but bootstrapping OOMs every time after about 10 
> hours with a sudden long GC pause:
> {noformat}
> INFO  [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old 
> Generation GC in 1586126ms.  G1 Old Gen: 49213756976 -> 49072277176;
> ...
> ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[MemtableFlushWriter:454,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {noformat}
> I have tried increasing max heap to 48G just to get through the bootstrap, to 
> no avail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)