This is the cluster setup I have 5 x m3.2xlarge instaces (240G general purpose ssd backed ebs volume for each instances). I have allocated 22 G of 30 G for elasticsearch (with mlockall option set). Initially I had 5 x m3.xlarge instaces but they were crashing because of oom, so I ended up doubling up the RAM. It has 225 million documents occupying 170 G (replicas not taken into account).
Indexing wise the cluster receives 500-1000 documents per minute and I do not do any bulk processing. Search wise, there are some batches which mostly issue "timestamp in between" queries, and then some multigets. What I see is that kopf/elastichq indicate es node takes 13-14 GB out of 22 G I allocated to the process but when I log in to the box, the process is taking 98.3% of RAM. This is the output of top command PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1903 elastics 20 0 53.8g 28g 121m S 19.3 98.3 563:14.74 java Eventually somewhere down the line, the GC kicks in starts its job, but kernel is killing the node abruptly causing oom-killer to be invoked. This the message in /var/log/messages Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.903747] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.908324] java cpuset=/ mems_allowed=0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.910666] CPU: 7 PID: 3031 Comm: java Not tainted 3.10.42-52.145.amzn1.x86_64 #1 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.915092] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/23/2014 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.918807] 0000000000000000 ffff880037565978 ffffffff8144c2b9 ffff8800375659e8 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.923221] ffffffff814497df ffff88078ffbfb38 00000000000000a9 ffff880037565a50 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.926556] ffff8800375659b0 ffff8807523fdec0 0000000000000000 ffffffff81a4ffe0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.930586] Call Trace: Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.931799] [<ffffffff8144c2b9>] dump_stack+0x19/0x1b Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.933982] [<ffffffff814497df>] dump_header+0x7f/0x1c2 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.936321] [<ffffffff811180e9>] oom_kill_process+0x1a9/0x310 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.939246] [<ffffffff81207575>] ? security_capable_noaudit+0x15/0x20 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.942265] [<ffffffff81118839>] out_of_memory+0x429/0x460 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.944663] [<ffffffff8111dd37>] __alloc_pages_nodemask+0x947/0x9e0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.947900] [<ffffffff81157ce9>] alloc_pages_current+0xa9/0x170 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.950776] [<ffffffff811150a7>] __page_cache_alloc+0x87/0xb0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.953393] [<ffffffff81116f85>] filemap_fault+0x185/0x430 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.955758] [<ffffffff8113801f>] __do_fault+0x6f/0x4f0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.958038] [<ffffffff8144e43b>] ? __wait_on_bit_lock+0xab/0xc0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.961099] [<ffffffff8113b203>] handle_pte_fault+0x93/0xa10 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.963773] [<ffffffff81116518>] ? generic_file_aio_read+0x588/0x700 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.966555] [<ffffffff8113c939>] handle_mm_fault+0x299/0x690 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.969017] [<ffffffff81455430>] __do_page_fault+0x150/0x4f0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.972042] [<ffffffff814557de>] do_page_fault+0xe/0x10 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.974363] [<ffffffff81451e58>] page_fault+0x28/0x30 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.976827] Mem-Info: Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.977849] Node 0 DMA per-cpu: Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.980128] CPU 0: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.982184] CPU 1: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.984273] CPU 2: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.986381] CPU 3: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.988446] CPU 4: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.990510] CPU 5: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.992796] CPU 6: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.994810] CPU 7: hi: 0, btch: 1 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.997023] Node 0 DMA32 per-cpu: Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.999044] CPU 0: hi: 186, btch: 31 usd: 5 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.001285] CPU 1: hi: 186, btch: 31 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.003415] CPU 2: hi: 186, btch: 31 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.005766] CPU 3: hi: 186, btch: 31 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.008445] CPU 4: hi: 186, btch: 31 usd: 1 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.010760] CPU 5: hi: 186, btch: 31 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.032100] CPU 6: hi: 186, btch: 31 usd: 1 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.034094] CPU 7: hi: 186, btch: 31 usd: 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] active_anon:1546271 inactive_anon:12 isolated_anon:0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] active_file:290 inactive_file:0 isolated_file:0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] unevictable:5989349 dirty:3 writeback:3 unstable:0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] free:47166 slab_reclaimable:3187 slab_unreclaimable:4496 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] mapped:30828 shmem:14 pagetables:17021 bounce:0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] free_cma:0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.051002] Node 0 DMA free:15904kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.067258] lowmem_reserve[]: 0 3726 30123 30123 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.069508] Node 0 DMA32 free:113760kB min:8352kB low:10440kB high:12528kB active_anon:3688028kB inactive_anon:0kB active_file:860kB inactive_file:0kB unevictable:18312kB isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3815484kB mlocked:18312kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:2168kB slab_unreclaimable:3032kB kernel_stack:240kB pagetables:10768kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:3680 all_unreclaimable? yes Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.087556] lowmem_reserve[]: 0 0 26397 26397 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.089748] Node 0 Normal free:59000kB min:59192kB low:73988kB high:88788kB active_anon:2497056kB inactive_anon:48kB active_file:300kB inactive_file:0kB unevictable:23939084kB isolated(anon):0kB isolated(file):0kB present:27525120kB managed:27030916kB mlocked:23939084kB dirty:12kB writeback:12kB mapped:123516kB shmem:56kB slab_reclaimable:10580kB slab_unreclaimable:14952kB kernel_stack:1976kB pagetables:57316kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1232 all_unreclaimable? yes Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.110370] lowmem_reserve[]: 0 0 0 0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.112315] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15904kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.119563] Node 0 DMA32: 536*4kB (UEM) 178*8kB (UEM) 120*16kB (UEM) 83*32kB (UEM) 105*64kB (UEM) 92*128kB (UEM) 66*256kB (UE) 41*512kB (UEM) 25*1024kB (E) 6*2048kB (UER) 3*4096kB (UER) = 114704kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.129555] Node 0 Normal: 389*4kB (UEM) 415*8kB (UEM) 612*16kB (UEM) 221*32kB (UEM) 79*64kB (UEM) 33*128kB (UEM) 25*256kB (UEM) 20*512kB (UEM) 6*1024kB (E) 1*2048kB (E) 1*4096kB (R) = 59948kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.138500] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.142208] 30865 total pagecache pages Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.144094] 0 pages in swap cache Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.145569] Swap cache stats: add 0, delete 0, find 0/0 Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.147876] Free swap = 0kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.149141] Total swap = 0kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.147876] Free swap = 0kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.149141] Total swap = 0kB Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.195659] 7864319 pages RAM Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.197108] 142330 pages reserved Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.198497] 824561 pages shared Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.199895] 7642193 pages non-shared Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.201430] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.204804] [ 1205] 0 1205 2696 251 10 0 -1000 udevd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.208207] [ 1672] 0 1672 2307 165 8 0 0 dhclient Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.211790] [ 1712] 0 1712 27951 171 22 0 -1000 auditd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.215640] [ 1730] 0 1730 62347 208 23 0 0 rsyslogd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.219266] [ 1741] 0 1741 3391 155 10 0 0 irqbalance Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.223430] [ 1752] 81 1752 5407 85 15 0 0 dbus-daemon Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.227648] [ 1778] 0 1778 1049 145 7 0 0 acpid Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.231099] [ 1859] 0 1859 19495 282 39 0 -1000 sshd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.234819] [ 1885] 38 1885 7827 363 20 0 0 ntpd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.238520] [ 1903] 498 1903 13999213 7531530 16667 0 0 java Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.241863] [ 1919] 0 1919 22328 569 44 0 0 sendmail Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.245955] [ 1926] 51 1926 20191 454 39 0 0 sendmail Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.249406] [ 2008] 0 2008 29840 285 14 0 0 crond Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.252867] [ 2021] 0 2021 4227 98 11 0 0 atd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.257122] [ 2064] 0 2064 1576 188 9 0 0 agetty Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.260702] [ 2065] 0 2065 1045 140 8 0 0 mingetty Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.264387] [ 2069] 0 2069 1045 140 8 0 0 mingetty Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.267835] [ 2072] 0 2072 1045 141 8 0 0 mingetty Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.271166] [ 2074] 0 2074 1045 140 8 0 0 mingetty Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.274760] [ 2076] 0 2076 1045 141 8 0 0 mingetty Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.278119] [ 2078] 0 2078 1045 139 8 0 0 mingetty Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.281620] [ 2079] 0 2079 2695 157 9 0 -1000 udevd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.284961] [ 2080] 0 2080 2695 157 9 0 -1000 udevd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.288251] [ 9770] 0 9770 193442 324 42 0 0 collectd Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.291632] Out of memory: Kill process 1903 (java) score 985 or sacrifice child Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.295293] Killed process 1903 (java) total-vm:55996852kB, anon-rss:30006572kB, file-rss:119548kB I have also added the field cache circuit breaker, but even after that this had happened. This probably has got something to do with the merges, but not certain. The smaller RAM (16G) boxes and the cluster got killed (by oom-killers) just because a multiget was getting executed in an infinite loop. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/00cdf05c-e86f-4205-974e-d3eac076431b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.