We have a 4 node cluster in AWS that looks like this: 1 x m4.2xlarge - runs all Graylog roles, processes incoming messages 3 x m4.xlarge - runs as "backend" roles - ES, graylog-server, etcd, mongo
All nodes have a 2.4TB EBS-backed data volume. We store about 4TB (2.5 billion messages, 1800 indices, 2gb per) of data. We use the provided AMIs and use the 1.2.1 omnibus package - rather, we started from the 1.1.4 provided image and upgraded each instance to 1.1.6, 1.2.0, and now 1.2.1 When restarting even a single node in ES, after a few minutes, usually on the ES master node, we end up with JVM warnings in the ES log: 2015-10-02_21:43:28.48251 [2015-10-02 21:43:28,482][WARN ][monitor.jvm ] [Ms. MODOK] [gc][old][2628][41] duration [19.3s], collections [1]/[20s], total [19.3s]/[2m], memory [9.1gb]->[9.1gb]/[9.3gb], all_pools {[young] [35.7mb]->[36.5mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]} 2015-10-02_21:43:55.40431 [2015-10-02 21:43:55,404][WARN ][monitor.jvm ] [Ms. MODOK] [gc][old][2630][42] duration [25.7s], collections [1]/[25.8s], total [25.7s]/[2.5m], memory [9.3gb]->[9gb]/[9.3gb], all_pools {[young] [266.2mb]->[21.1mb]/[266.2mb]}{[survivor] [23.1mb]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]} 2015-10-02_21:43:56.30787 [2015-10-02 21:43:56,307][WARN ][cluster.service ] [Ms. MODOK] cluster state update task [shard-started ([graylog_3873][0], node[5r9lBQcNQRCxPRwovXtzyg], [R], s[INITIALIZING], unassigned_info[[reason=NODE_LEFT], at[2015-10-02T21:19:44.518Z], details[node_left[CZIGj-wJQFqOYP0ZWdWHdg]]]), reason [after recovery (replica) from node [[Ms. MODOK][QvZjbJ9kR12F41NZwNygTg][example.com][inet[/x.x.x.x:9300]]]]] took 1.2m above the warn threshold of 30s 2015-10-02_21:44:16.95477 [2015-10-02 21:44:16,954][WARN ][monitor.jvm ] [Ms. MODOK] [gc][old][2632][43] duration [19.6s], collections [1]/[20.5s], total [19.6s]/[2.8m], memory [9.3gb]->[9.1gb]/[9.3gb], all_pools {[young] [254.3mb]->[49.7mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]} At that point, ES becomes slow to respond to almost every request, such as /_cluster/health/ It's very tricky to get ES to a fully initialized state. Eventually unassigned_shards stops progressing. I usually have to restart ES on the machine where the JVM errors are thrown and hope it does better on the next round. In my elasticsearch.yml, I have "bootstrap.mlockall: true" and it is in effect, as far as the API can see on all nodes. Any suggestions out there? Thanks! -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to graylog2+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/2d6b906b-4bf5-4493-b7fc-a227edac6146%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.