You’re running DSE so the OSS list may not be much help. Datastax May have more insight
In open source, the only things offheap that vary significantly are bloom filters and compression offsets - both scale with disk space, and both increase during compaction. Large STCS compaction can cause pretty meaningful allocations for these. Also, if you have an unusually low compression chunk size or a very low bloom filter FP ratio, those will be larger. -- Jeff Jirsa > On Jan 26, 2019, at 12:11 PM, Ayub M <hia...@gmail.com> wrote: > > Cassandra node went down due to OOM, and checking the /var/log/message I see > below. > > ``` > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-killer: > gfp_mask=0x280da, order=0, oom_score_adj=0 > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/ mems_allowed=0 > .... > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB > 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB > (M) 3*4096kB (M) = 15908kB > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB (UM) > 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB (UEM) > 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE > M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4kB (UE) > 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB > 0*2048kB 0*4096kB = 62500kB > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 > hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0 > hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache pages > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add 0, delete 0, > find 0/0 > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap = 0kB > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages HighMem/MovableOnly > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ] uid tgid total_vm > rss nr_ptes swapents oom_score_adj name > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634] 0 2634 41614 > 326 82 0 0 systemd-journal > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690] 0 2690 29793 > 541 27 0 0 lvmetad > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710] 0 2710 11892 > 762 25 0 -1000 systemd-udevd > ..... > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774] 0 13774 459778 > 97729 429 0 0 Scan Factory > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506] 0 14506 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586] 0 14586 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588] 0 14588 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589] 0 14589 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598] 0 14598 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599] 0 14599 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600] 0 14600 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601] 0 14601 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679] 0 19679 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680] 0 19680 21628 > 5340 24 0 0 macompatsvc > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084] 1007 9084 2822449 > 260291 810 0 0 java > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509] 1007 8509 17223585 > 14908485 32510 0 0 java > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877] 0 21877 461828 > 97716 318 0 0 ScanAction Mgr > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884] 0 21884 496653 > 98605 340 0 0 OAS Manager > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718] 89 31718 25474 > 486 48 0 0 pickup > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891] 1007 4891 26999 > 191 9 0 0 iostat > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957] 1007 4957 26999 > 192 10 0 0 iostat > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill process 8509 > (java) score 928 or sacrifice child > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509 (java) > total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, shmem-rss:0kB > ``` > > Nothing else runs on this host except dse cassandra with search and > monitoring agents. Max heap size is set to 31g, the cassandra java process > seems to be using ~57gb (ram is 62gb) at the time of error. > So I am guess the jvm started using lots of memory and triggered oom error. > Is my understanding correct? > That this is linux triggered jvm kill as the jvm was consuming more than > available memory? > > So in this case jvm was using max of 31g and remaining 26gb its using is > non-heap memory. Normally this process takes around 42g and the fact that at > the time of oom moment it was consuming 57g I am suspecting the java process > to be the culprit rather than victim. > > At the time of issue there was no heap dump taken, I have configured it now. > But even if heap dump was taken would it have help figure out who is > consuming more memory. Heapdump would only dump heap memory area, what should > be used to dump non-heapdump? Native memory tracking is one thing I came > across. > Any way to have native memory dumped when oom occurs? > Whats the best way to monitor the jvm memory to diagnose oom errors? > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org