We have a 4 node cluster in AWS that looks like this: 

1 x m4.2xlarge - runs all Graylog roles, processes incoming messages
3 x m4.xlarge - runs as "backend" roles - ES, graylog-server, etcd, mongo 

All nodes have a 2.4TB EBS-backed data volume. We store about 4TB (2.5 
billion messages, 1800 indices, 2gb per) of data. We use the provided AMIs 
and use the 1.2.1 omnibus package - rather, we started from the 1.1.4 
provided image and upgraded each instance to 1.1.6, 1.2.0, and now 1.2.1

When restarting even a single node in ES, after a few minutes, usually on 
the ES master node, we end up with JVM warnings in the ES log:

2015-10-02_21:43:28.48251 [2015-10-02 21:43:28,482][WARN ][monitor.jvm     
         ] [Ms. MODOK] [gc][old][2628][41] duration [19.3s], collections 
[1]/[20s], total [19.3s]/[2m], memory [9.1gb]->[9.1gb]/[9.3gb], all_pools 
{[young] [35.7mb]->[36.5mb]/[266.2mb]}{[survivor] 
[0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}
2015-10-02_21:43:55.40431 [2015-10-02 21:43:55,404][WARN ][monitor.jvm     
         ] [Ms. MODOK] [gc][old][2630][42] duration [25.7s], collections 
[1]/[25.8s], total [25.7s]/[2.5m], memory [9.3gb]->[9gb]/[9.3gb], all_pools 
{[young] [266.2mb]->[21.1mb]/[266.2mb]}{[survivor] 
[23.1mb]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}
2015-10-02_21:43:56.30787 [2015-10-02 21:43:56,307][WARN ][cluster.service 
         ] [Ms. MODOK] cluster state update task [shard-started 
([graylog_3873][0], node[5r9lBQcNQRCxPRwovXtzyg], [R], s[INITIALIZING], 
unassigned_info[[reason=NODE_LEFT], at[2015-10-02T21:19:44.518Z], 
details[node_left[CZIGj-wJQFqOYP0ZWdWHdg]]]), reason [after recovery 
(replica) from node [[Ms. 
MODOK][QvZjbJ9kR12F41NZwNygTg][example.com][inet[/x.x.x.x:9300]]]]] took 
1.2m above the warn threshold of 30s
2015-10-02_21:44:16.95477 [2015-10-02 21:44:16,954][WARN ][monitor.jvm     
         ] [Ms. MODOK] [gc][old][2632][43] duration [19.6s], collections 
[1]/[20.5s], total [19.6s]/[2.8m], memory [9.3gb]->[9.1gb]/[9.3gb], 
all_pools {[young] [254.3mb]->[49.7mb]/[266.2mb]}{[survivor] 
[0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}

At that point, ES becomes slow to respond to almost every request, such 
as /_cluster/health/

It's very tricky to get ES to a fully initialized state. Eventually 
unassigned_shards stops progressing. I usually have to restart ES on the 
machine where the JVM errors are thrown and hope it does better on the next 
round. 

In my elasticsearch.yml, I have "bootstrap.mlockall: true" and it is in 
effect, as far as the API can see on all nodes. 

Any suggestions out there?

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/2d6b906b-4bf5-4493-b7fc-a227edac6146%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to