This is the cluster setup I have 5 x m3.2xlarge instaces (240G general 
purpose ssd backed ebs volume for each instances). I have allocated 22 G of 
30 G for elasticsearch (with mlockall option set). Initially I had 5 x 
m3.xlarge instaces but they were crashing because of oom, so I ended up 
doubling up the RAM. It has 225 million documents occupying 170 G (replicas 
not taken into account).

Indexing wise the cluster receives 500-1000 documents per minute and I do 
not do any bulk processing. Search wise, there are some batches which 
mostly issue "timestamp in between" queries, and then some multigets. 

What I see is that kopf/elastichq indicate es node takes 13-14 GB out of 22 
G I allocated to the process but when I log in to the box, the process is 
taking 98.3% of RAM. 

This is the output of top command

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND       
                                                                            
                              
 1903 elastics   20   0    53.8g  28g   121m S 19.3     98.3     563:14.74 
java

Eventually somewhere down the line, the GC kicks in starts its job, but 
kernel is killing the node abruptly causing oom-killer to be invoked. This 
the message in /var/log/messages
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.903747] java invoked 
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.908324] java cpuset=/ 
mems_allowed=0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.910666] CPU: 7 PID: 3031 
Comm: java Not tainted 3.10.42-52.145.amzn1.x86_64 #1
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.915092] Hardware name: 
Xen HVM domU, BIOS 4.2.amazon 05/23/2014
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.918807]  0000000000000000 
ffff880037565978 ffffffff8144c2b9 ffff8800375659e8
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.923221]  ffffffff814497df 
ffff88078ffbfb38 00000000000000a9 ffff880037565a50
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.926556]  ffff8800375659b0 
ffff8807523fdec0 0000000000000000 ffffffff81a4ffe0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.930586] Call Trace:
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.931799] 
 [<ffffffff8144c2b9>] dump_stack+0x19/0x1b
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.933982] 
 [<ffffffff814497df>] dump_header+0x7f/0x1c2
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.936321] 
 [<ffffffff811180e9>] oom_kill_process+0x1a9/0x310
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.939246] 
 [<ffffffff81207575>] ? security_capable_noaudit+0x15/0x20
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.942265] 
 [<ffffffff81118839>] out_of_memory+0x429/0x460
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.944663] 
 [<ffffffff8111dd37>] __alloc_pages_nodemask+0x947/0x9e0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.947900] 
 [<ffffffff81157ce9>] alloc_pages_current+0xa9/0x170
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.950776] 
 [<ffffffff811150a7>] __page_cache_alloc+0x87/0xb0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.953393] 
 [<ffffffff81116f85>] filemap_fault+0x185/0x430
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.955758] 
 [<ffffffff8113801f>] __do_fault+0x6f/0x4f0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.958038] 
 [<ffffffff8144e43b>] ? __wait_on_bit_lock+0xab/0xc0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.961099] 
 [<ffffffff8113b203>] handle_pte_fault+0x93/0xa10
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.963773] 
 [<ffffffff81116518>] ? generic_file_aio_read+0x588/0x700
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.966555] 
 [<ffffffff8113c939>] handle_mm_fault+0x299/0x690
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.969017] 
 [<ffffffff81455430>] __do_page_fault+0x150/0x4f0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.972042] 
 [<ffffffff814557de>] do_page_fault+0xe/0x10
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.974363] 
 [<ffffffff81451e58>] page_fault+0x28/0x30
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.976827] Mem-Info:
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.977849] Node 0 DMA 
per-cpu:
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.980128] CPU    0: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.982184] CPU    1: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.984273] CPU    2: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.986381] CPU    3: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.988446] CPU    4: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.990510] CPU    5: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.992796] CPU    6: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.994810] CPU    7: hi:   
 0, btch:   1 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.997023] Node 0 DMA32 
per-cpu:
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125758.999044] CPU    0: hi: 
 186, btch:  31 usd:   5
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.001285] CPU    1: hi: 
 186, btch:  31 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.003415] CPU    2: hi: 
 186, btch:  31 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.005766] CPU    3: hi: 
 186, btch:  31 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.008445] CPU    4: hi: 
 186, btch:  31 usd:   1
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.010760] CPU    5: hi: 
 186, btch:  31 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.032100] CPU    6: hi: 
 186, btch:  31 usd:   1
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.034094] CPU    7: hi: 
 186, btch:  31 usd:   0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] 
active_anon:1546271 inactive_anon:12 isolated_anon:0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149]  active_file:290 
inactive_file:0 isolated_file:0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149] 
 unevictable:5989349 dirty:3 writeback:3 unstable:0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149]  free:47166 
slab_reclaimable:3187 slab_unreclaimable:4496
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149]  mapped:30828 
shmem:14 pagetables:17021 bounce:0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.036149]  free_cma:0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.051002] Node 0 DMA 
free:15904kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB 
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? yes
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.067258] lowmem_reserve[]: 
0 3726 30123 30123
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.069508] Node 0 DMA32 
free:113760kB min:8352kB low:10440kB high:12528kB active_anon:3688028kB 
inactive_anon:0kB active_file:860kB inactive_file:0kB unevictable:18312kB 
isolated(anon):0kB isolated(file):0kB present:3915776kB managed:3815484kB 
mlocked:18312kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB 
slab_reclaimable:2168kB slab_unreclaimable:3032kB kernel_stack:240kB 
pagetables:10768kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:3680 all_unreclaimable? yes
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.087556] lowmem_reserve[]: 
0 0 26397 26397
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.089748] Node 0 Normal 
free:59000kB min:59192kB low:73988kB high:88788kB active_anon:2497056kB 
inactive_anon:48kB active_file:300kB inactive_file:0kB 
unevictable:23939084kB isolated(anon):0kB isolated(file):0kB 
present:27525120kB managed:27030916kB mlocked:23939084kB dirty:12kB 
writeback:12kB mapped:123516kB shmem:56kB slab_reclaimable:10580kB 
slab_unreclaimable:14952kB kernel_stack:1976kB pagetables:57316kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1232 
all_unreclaimable? yes
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.110370] lowmem_reserve[]: 
0 0 0 0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.112315] Node 0 DMA: 0*4kB 
0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB 
(U) 1*2048kB (R) 3*4096kB (M) = 15904kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.119563] Node 0 DMA32: 
536*4kB (UEM) 178*8kB (UEM) 120*16kB (UEM) 83*32kB (UEM) 105*64kB (UEM) 
92*128kB (UEM) 66*256kB (UE) 41*512kB (UEM) 25*1024kB (E) 6*2048kB (UER) 
3*4096kB (UER) = 114704kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.129555] Node 0 Normal: 
389*4kB (UEM) 415*8kB (UEM) 612*16kB (UEM) 221*32kB (UEM) 79*64kB (UEM) 
33*128kB (UEM) 25*256kB (UEM) 20*512kB (UEM) 6*1024kB (E) 1*2048kB (E) 
1*4096kB (R) = 59948kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.138500] Node 0 
hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.142208] 30865 total 
pagecache pages
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.144094] 0 pages in swap 
cache
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.145569] Swap cache stats: 
add 0, delete 0, find 0/0
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.147876] Free swap  = 0kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.149141] Total swap = 0kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.147876] Free swap  = 0kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.149141] Total swap = 0kB
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.195659] 7864319 pages RAM
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.197108] 142330 pages 
reserved
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.198497] 824561 pages 
shared
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.199895] 7642193 pages 
non-shared
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.201430] [ pid ]   uid 
 tgid total_vm      rss nr_ptes swapents oom_score_adj name
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.204804] [ 1205]     0 
 1205     2696      251      10        0         -1000 udevd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.208207] [ 1672]     0 
 1672     2307      165       8        0             0 dhclient
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.211790] [ 1712]     0 
 1712    27951      171      22        0         -1000 auditd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.215640] [ 1730]     0 
 1730    62347      208      23        0             0 rsyslogd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.219266] [ 1741]     0 
 1741     3391      155      10        0             0 irqbalance
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.223430] [ 1752]    81 
 1752     5407       85      15        0             0 dbus-daemon
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.227648] [ 1778]     0 
 1778     1049      145       7        0             0 acpid
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.231099] [ 1859]     0 
 1859    19495      282      39        0         -1000 sshd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.234819] [ 1885]    38 
 1885     7827      363      20        0             0 ntpd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.238520] [ 1903]   498 
 1903 13999213  7531530   16667        0             0 java
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.241863] [ 1919]     0 
 1919    22328      569      44        0             0 sendmail
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.245955] [ 1926]    51 
 1926    20191      454      39        0             0 sendmail
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.249406] [ 2008]     0 
 2008    29840      285      14        0             0 crond
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.252867] [ 2021]     0 
 2021     4227       98      11        0             0 atd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.257122] [ 2064]     0 
 2064     1576      188       9        0             0 agetty
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.260702] [ 2065]     0 
 2065     1045      140       8        0             0 mingetty
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.264387] [ 2069]     0 
 2069     1045      140       8        0             0 mingetty
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.267835] [ 2072]     0 
 2072     1045      141       8        0             0 mingetty
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.271166] [ 2074]     0 
 2074     1045      140       8        0             0 mingetty
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.274760] [ 2076]     0 
 2076     1045      141       8        0             0 mingetty
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.278119] [ 2078]     0 
 2078     1045      139       8        0             0 mingetty
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.281620] [ 2079]     0 
 2079     2695      157       9        0         -1000 udevd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.284961] [ 2080]     0 
 2080     2695      157       9        0         -1000 udevd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.288251] [ 9770]     0 
 9770   193442      324      42        0             0 collectd
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.291632] Out of memory: 
Kill process 1903 (java) score 985 or sacrifice child
Oct 14 22:06:10 ip-10-213-155-189 kernel: [125759.295293] Killed process 
1903 (java) total-vm:55996852kB, anon-rss:30006572kB, file-rss:119548kB

I have also added the field cache circuit breaker, but even after that this 
had happened. This probably has got something to do with the merges, but 
not certain. 
The smaller RAM (16G) boxes and the cluster got killed (by oom-killers) 
just because a multiget was getting executed in an infinite loop.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/00cdf05c-e86f-4205-974e-d3eac076431b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to