These are brand new boxes only running Cassandra.  Yeah the kernel is what
is killing the JVM, and this does appear to be a memory leak in Cassandra.
And Cassandra is the only thing running, aside from the basic services
needed for Amazon Linux to run.

On Fri, Mar 11, 2016 at 11:17 AM, Sebastian Estevez <
sebastian.este...@datastax.com> wrote:

> Sacrifice child in dmesg is your OS killing the process with the most ram.
> That means you're actually running out of memory at the Linux level outside
> of the JVM.
>
> Are you running anything other than Cassandra on this box?
>
> If so, does it have a memory leak?
>
> all the best,
>
> Sebastián
> On Mar 11, 2016 11:14 AM, "Adam Plumb" <apl...@fiksu.com> wrote:
>
>> I've got a new cluster of 18 nodes running Cassandra 3.4 that I just
>> launched and loaded data into yesterday (roughly 2TB of total storage) and
>> am seeing runaway memory usage.  These nodes are EC2 c3.4xlarges with 30GB
>> RAM and the heap size is set to 8G with a new heap size of 1.6G.
>>
>> Last night I finished loading up the data, then ran an incremental repair
>> on one of the nodes just to ensure that everything was working (nodetool
>> repair).  Over night all 18 nodes ran out of memory and were killed by the
>> OOM killer.  I restarted them this morning and they all came up fine, but
>> just started churning through memory and got killed again.  I restarted
>> them again and they're doing the same thing.  I'm not getting any errors in
>> the system log, since the process is getting killed abruptly (which makes
>> me think this is a native memory issue, not heap)
>>
>> Obviously this behavior isn't the best.  I'm willing to provide any data
>> people need to help debug this, these nodes are still up and running.  I'm
>> also in IRC if anyone wants to jump on there.
>>
>> Here is the output of ps aux:
>>
>> 497       64351  108 89.5 187156072 27642988 ?  SLl  15:13  62:15 java
>>> -ea -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities
>>> -XX:ThreadPriorityPolicy=42 -Xms7536M -Xmx7536M -Xmn1600M
>>> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003
>>> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
>>> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:+UseTLAB -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45
>>> -XX:-ParallelRefProcEnabled -XX:-AlwaysPreTouch -XX:+UseBiasedLocking
>>> -XX:+UseTLAB -XX:+ResizeTLAB -Djava.net.preferIPv4Stack=true
>>> -Dcom.sun.management.jmxremote.port=7199
>>> -Dcom.sun.management.jmxremote.rmi.port=7199
>>> -Dcom.sun.management.jmxremote.ssl=false
>>> -Dcom.sun.management.jmxremote.authenticate=false
>>> -XX:+CMSClassUnloadingEnabled -Dlogback.configurationFile=logback.xml -D
>>> *cas*sandra.logdir=/usr/local/*cas*sandra/logs -D*cas*
>>> sandra.storagedir=/usr/local/*cas*sandra/data -D*cas*
>>> sandra-pidfile=/var/run/*cas*sandra/*cas*sandra.pid -cp /usr/local/*cas*
>>> sandra/conf:/usr/local/*cas*sandra/build/classes/main:/usr/local/*cas*
>>> sandra/build/classes/thrift:/usr/local/*cas*
>>> sandra/lib/airline-0.6.jar:/usr/local/*cas*
>>> sandra/lib/antlr-runtime-3.5.2.jar:/usr/local/*cas*sandra/lib/apache-
>>> *cas*sandra-3.4.jar:/usr/local/*cas*sandra/lib/apache-*cas*
>>> sandra-clientutil-3.4.jar:/usr/local/*cas*sandra/lib/apache-*cas*
>>> sandra-thrift-3.4.jar:/usr/local/*cas*
>>> sandra/lib/asm-5.0.4.jar:/usr/local/*cas*sandra/lib/*cas*
>>> sandra-driver-core-3.0.0-shaded.jar:/usr/local/*ca*
>>> sandra/lib/commons-cli-1.1.jar:/usr/local/*cas*
>>> sandra/lib/commons-codec-1.2.jar:/usr/local/*cas*
>>> sandra/lib/commons-lang3-3.1.jar:/usr/local/*cas*
>>> sandra/lib/commons-math3-3.2.jar:/usr/local/*cas*
>>> sandra/lib/compress-lzf-0.8.4.jar:/usr/local/*cas*
>>> sandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/local/*cas*
>>> sandra/lib/concurrent-trees-2.4.0.jar:/usr/local/*cas*
>>> sandra/lib/disruptor-3.0.1.jar:/usr/local/*cas*
>>> sandra/lib/ecj-4.4.2.jar:/usr/local/*cas*
>>> sandra/lib/guava-18.0.jar:/usr/local/*cas*
>>> sandra/lib/high-scale-lib-1.0.6.jar:/usr/local/*cas*
>>> sandra/lib/hppc-0.5.4.jar:/usr/local/*cas*
>>> sandra/lib/jackson-core-asl-1.9.2.jar:/usr/local/*cas*
>>> sandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/local/*cas*
>>> sandra/lib/jamm-0.3.0.jar:/usr/local/*cas*
>>> sandra/lib/javax.inject.jar:/usr/local/*cas*
>>> sandra/lib/jbcrypt-0.3m.jar:/usr/local/*cas*
>>> sandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>> sandra/lib/jflex-1.6.0.jar:/usr/local/*cas*
>>> sandra/lib/jna-4.0.0.jar:/usr/local/*cas*
>>> sandra/lib/joda-time-2.4.jar:/usr/local/*cas*
>>> sandra/lib/json-simple-1.1.jar:/usr/local/*cas*
>>> sandra/lib/libthrift-0.9.2.jar:/usr/local/*cas*
>>> sandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/local/*cas*
>>> sandra/lib/logback-classic-1.1.3.jar:/usr/local/*cas*
>>> sandra/lib/logback-core-1.1.3.jar:/usr/local/*cas*
>>> sandra/lib/lz4-1.3.0.jar:/usr/local/*cas*
>>> sandra/lib/metrics-core-3.1.0.jar:/usr/local/*cas*
>>> sandra/lib/metrics-logback-3.1.0.jar:/usr/local/*cas*
>>> sandra/lib/netty-all-4.0.23.Final.jar:/usr/local/*cas*
>>> sandra/lib/ohc-core-0.4.2.jar:/usr/local/*cas*
>>> sandra/lib/ohc-core-j8-0.4.2.jar:/usr/local/*cas*
>>> sandra/lib/primitive-1.0.jar:/usr/local/*cas*
>>> sandra/lib/reporter-config3-3.0.0.jar:/usr/local/*cas*
>>> sandra/lib/reporter-config-base-3.0.0.jar:/usr/local/*cas*
>>> sandra/lib/sigar-1.6.4.jar:/usr/local/*cas*
>>> sandra/lib/slf4j-api-1.7.7.jar:/usr/local/*cas*
>>> sandra/lib/snakeyaml-1.11.jar:/usr/local/*cas*
>>> sandra/lib/snappy-java-1.1.1.7.jar:/usr/local/*cas*
>>> sandra/lib/snowball-stemmer-1.3.0.581.1.jar:/usr/local/*cas*
>>> sandra/lib/ST4-4.0.8.jar:/usr/local/*cas*
>>> sandra/lib/stream-2.5.2.jar:/usr/local/*cas*
>>> sandra/lib/thrift-server-0.3.7.jar:/usr/local/*cas*sandra/lib/jsr223/*/*.jar
>>> org.apache.*cas*sandra.service.CassandraDaemon
>>
>>
>>  Here is some dmesg output:
>>
>> [40003.010117] java invoked oom-killer: gfp_mask=0x280da, order=0,
>> oom_score_adj=0
>> [40003.013042] java cpuset=/ mems_allowed=0
>> [40003.014789] CPU: 3 PID: 37757 Comm: java Tainted: G            E
>> 4.1.7-15.23.amzn1.x86_64 #1
>> [40003.017852] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
>> [40003.020066]  0000000000000000 ffff8800ebaaba18 ffffffff814da12c
>> 0000000000000000
>> [40003.022870]  ffff880763594c80 ffff8800ebaabac8 ffffffff814d7939
>> ffff8800ebaaba78
>> [40003.025674]  ffffffff811bf8f7 ffff880770679c00 ffff88077001c190
>> 0000000000000080
>> [40003.028660] Call Trace:
>> [40003.029613]  [<ffffffff814da12c>] dump_stack+0x45/0x57
>> [40003.031486]  [<ffffffff814d7939>] dump_header+0x7f/0x1fe
>> [40003.033390]  [<ffffffff811bf8f7>] ? mem_cgroup_iter+0x137/0x3d0
>> [40003.035475]  [<ffffffff8107f496>] ? __queue_work+0x136/0x320
>> [40003.037594]  [<ffffffff8115d11c>] oom_kill_process+0x1cc/0x3b0
>> [40003.039825]  [<ffffffff8115d67e>] __out_of_memory+0x31e/0x530
>> [40003.041938]  [<ffffffff8115da2b>] out_of_memory+0x5b/0x80
>> [40003.043857]  [<ffffffff81162a79>] __alloc_pages_nodemask+0x8a9/0x8d0
>> [40003.046105]  [<ffffffff811a48fa>] alloc_page_interleave+0x3a/0x90
>> [40003.048419]  [<ffffffff811a79c3>] alloc_pages_vma+0x143/0x200
>> [40003.050582]  [<ffffffff81188035>] handle_mm_fault+0x1355/0x1770
>> [40003.052674]  [<ffffffff8118e4c5>] ? do_mmap_pgoff+0x2f5/0x3c0
>> [40003.054737]  [<ffffffff8105dafc>] __do_page_fault+0x17c/0x420
>> [40003.056858]  [<ffffffff8118c976>] ? SyS_mmap_pgoff+0x116/0x270
>> [40003.059082]  [<ffffffff8105ddc2>] do_page_fault+0x22/0x30
>> [40003.061084]  [<ffffffff814e2ad8>] page_fault+0x28/0x30
>> [40003.062938] Mem-Info:
>> [40003.063762] active_anon:5437903 inactive_anon:1025 isolated_anon:0
>>  active_file:51 inactive_file:8 isolated_file:0
>>  unevictable:2088582 dirty:0 writeback:0 unstable:0
>>  slab_reclaimable:82028 slab_unreclaimable:12209
>>  mapped:31065 shmem:20 pagetables:37089 bounce:0
>>  free:35830 free_pcp:3141 free_cma:0
>> [40003.075549] Node 0 DMA free:15872kB min:8kB low:8kB high:12kB
>> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB
>> managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
>> slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:0kB
>> pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
>> free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
>> [40003.090267] lowmem_reserve[]: 0 3746 30128 30128
>> [40003.092182] Node 0 DMA32 free:108236kB min:2756kB low:3444kB
>> high:4132kB active_anon:2400616kB inactive_anon:4060kB active_file:0kB
>> inactive_file:0kB unevictable:1049732kB isolated(anon):0kB
>> isolated(file):0kB present:3915776kB managed:3840296kB mlocked:1049732kB
>> dirty:4kB writeback:0kB mapped:16564kB shmem:12kB slab_reclaimable:243852kB
>> slab_unreclaimable:8832kB kernel_stack:1152kB pagetables:16532kB
>> unstable:0kB bounce:0kB free_pcp:5716kB local_pcp:220kB free_cma:0kB
>> writeback_tmp:0kB pages_scanned:5408 all_unreclaimable? yes
>> [40003.108802] lowmem_reserve[]: 0 0 26382 26382
>> [40003.110578] Node 0 Normal free:19212kB min:19412kB low:24264kB
>> high:29116kB active_anon:19350996kB inactive_anon:40kB active_file:212kB
>> inactive_file:80kB unevictable:7304596kB isolated(anon):0kB
>> isolated(file):0kB present:27525120kB managed:27015196kB mlocked:7304596kB
>> dirty:0kB writeback:0kB mapped:107696kB shmem:68kB slab_reclaimable:84228kB
>> slab_unreclaimable:40004kB kernel_stack:10000kB pagetables:131824kB
>> unstable:0kB bounce:0kB free_pcp:6848kB local_pcp:692kB free_cma:0kB
>> writeback_tmp:0kB pages_scanned:38332 all_unreclaimable? yes
>> [40003.128300] lowmem_reserve[]: 0 0 0 0
>> [40003.129844] Node 0 DMA: 0*4kB 0*8kB 0*16kB 2*32kB (UE) 3*64kB (UE)
>> 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ER) 2*4096kB
>> (M) = 15872kB
>> [40003.135917] Node 0 DMA32: 193*4kB (UEM) 254*8kB (UEM) 714*16kB (UE)
>> 1344*32kB (UEMR) 249*64kB (UEMR) 120*128kB (UER) 53*256kB (ER) 10*512kB
>> (ER) 1*1024kB (E) 0*2048kB 0*4096kB = 108244kB
>> [40003.142956] Node 0 Normal: 3956*4kB (UE) 0*8kB 1*16kB (R) 8*32kB (R)
>> 3*64kB (R) 2*128kB (R) 3*256kB (R) 0*512kB 0*1024kB 1*2048kB (R) 0*4096kB =
>> 19360kB
>> [40003.148749] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
>> hugepages_size=2048kB
>> [40003.151777] 31304 total pagecache pages
>> [40003.153288] 0 pages in swap cache
>> [40003.154528] Swap cache stats: add 0, delete 0, find 0/0
>> [40003.156377] Free swap  = 0kB
>> [40003.157423] Total swap = 0kB
>> [40003.158465] 7864221 pages RAM
>> [40003.159522] 0 pages HighMem/MovableOnly
>> [40003.160984] 146372 pages reserved
>> [40003.162244] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds
>> swapents oom_score_adj name
>> [40003.165398] [ 2560]     0  2560     2804      181      11       3
>>   0         -1000 udevd
>> [40003.168638] [ 3976]     0  3976     2334      123       9       3
>>   0             0 dhclient
>> [40003.171895] [ 4017]     0  4017    11626       89      23       4
>>   0         -1000 auditd
>> [40003.175080] [ 4035]     0  4035    61861       99      23       3
>>   0             0 rsyslogd
>> [40003.178198] [ 4046]     0  4046     3462       98      10       3
>>   0             0 irqbalance
>> [40003.181559] [ 4052]     0  4052     1096       22       7       3
>>   0             0 rngd
>> [40003.184683] [ 4067]    32  4067     8815       99      22       3
>>   0             0 rpcbind
>> [40003.187772] [ 4084]    29  4084     9957      201      24       3
>>   0             0 rpc.statd
>> [40003.191099] [ 4115]    81  4115     5442       60      15       3
>>   0             0 dbus-daemon
>> [40003.194438] [ 4333]     0  4333    19452      522      40       3
>>   0         -1000 sshd
>> [40003.197432] [ 4361]    38  4361     7321      562      19       3
>>   0             0 ntpd
>> [40003.200609] [ 4376]     0  4376    22238      720      46       3
>>   0             0 sendmail
>> [40003.203868] [ 4384]    51  4384    20103      674      41       3
>>   0             0 sendmail
>> [40003.206963] [ 4515]     0  4515     4267       38      13       3
>>   0             0 atd
>> [40003.210100] [ 6730]     0  6730    29888      547      13       3
>>   0             0 crond
>> [40003.213267] [13533]   497 13533 47235415  7455314   36074     167
>>   0             0 java
>> [40003.216364] [13674]   498 13674    49154     3168      51       3
>>   0             0 supervisord
>> [40003.219721] [13680]   498 13680    51046     5350      69       3
>>   0             0 python
>> [40003.222908] [13682]   498 13682    36172     5602      75       3
>>   0             0 python
>> [40003.225952] [13683]   498 13683    32633     5319      68       3
>>   0             0 python
>> [40003.229108] [13684]   498 13684    29577     5003      63       3
>>   0             0 python
>> [40003.232263] [13719]   498 13719  1035920    41287     234       8
>>   0             0 java
>> [40003.235287] [13753]   498 13753    34605     5645      70       3
>>   0             0 python
>> [40003.238322] [14143]     0 14143     1615      420       9       3
>>   0             0 agetty
>> [40003.241582] [14145]     0 14145     1078      377       8       3
>>   0             0 mingetty
>> [40003.244752] [14147]     0 14147     1078      354       8       3
>>   0             0 mingetty
>> [40003.247833] [14149]     0 14149     1078      373       8       3
>>   0             0 mingetty
>> [40003.251193] [14151]     0 14151     1078      367       7       3
>>   0             0 mingetty
>> [40003.254342] [14153]     0 14153     1078      348       8       3
>>   0             0 mingetty
>> [40003.257443] [14154]     0 14154     2803      182      10       3
>>   0         -1000 udevd
>> [40003.260688] [14155]     0 14155     2803      182      10       3
>>   0         -1000 udevd
>> [40003.263782] [14157]     0 14157     1078      369       8       3
>>   0             0 mingetty
>> [40003.266895] Out of memory: Kill process 13533 (java) score 970 or
>> sacrifice child
>> [40003.269702] Killed process 13533 (java) total-vm:188941660kB,
>> anon-rss:29710828kB, file-rss:110428kB
>>
>>

Reply via email to