Hi all We're running a three node elasticsearch cluster (two data nodes, one dataless) and using it to store data from logstash.
Every week or two, we see the following message in the elasticsearch logs: [24.8gb]->[24.5gb]/[24.8gb], all_pools {[young] [865.3mb]->[586mb]/[865.3mb]}{[survivor] [102.5mb]->[0b]/[108.1mb]}{[old] [23.9gb]->[23.9gb]/[23.9gb]} [2014-11-17 15:26:15,066][WARN ][monitor.jvm ] [es-prod-2] [gc][old][1189982][81480] duration [14.9s], collections [1]/[15.7s], total [14.9s]/[16.1h], memory [24.5gb]->[24.5gb]/[24.8gb], all_pools {[young] [586mb]->[592.5mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [23.9gb]->[23.9gb]/[23.9gb]} [2014-11-17 15:26:30,715][WARN ][monitor.jvm ] [es-prod-2] [gc][old][1189983][81481] duration [14.6s], collections [1]/[15.6s], total [14.6s]/[16.1h], memory [24.5gb]->[24.5gb]/[24.8gb], all_pools {[young] [592.5mb]->[589.1mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [23.9gb]->[23.9gb]/[23.9gb]} [2014-11-17 15:26:46,705][WARN ][monitor.jvm ] [es-prod-2] [gc][old][1189984][81482] duration [15.2s], collections [1]/[15.9s], total [15.2s]/[16.1h], memory [24.5gb]->[24.3gb]/[24.8gb], all_pools {[young] [589.1mb]->[445.2mb]/[865.3mb]}{[survivor] [0b]->[0b]/[108.1mb]}{[old] [23.9gb]->[23.9gb]/[23.9gb]} [2014-11-17 15:27:03,630][WARN ][monitor.jvm ] [es-prod-2] [gc][old][1189986][81483] duration [15.8s], collections [1]/[15.9s], total [15.8s]/[16.1h], memory [24.8gb]->[24.3gb]/[24.8gb], all_pools {[young] [865.3mb]->[461.7mb]/[865.3mb]}{[survivor] [91.8mb]->[0b]/[108.1mb]}{[old] [23.9gb]->[23.9gb]/[23.9gb]} When this occurs, search performance becomes very slow. Even a simple `$ curl http://es-prod-2:9200` can take around ten seconds. The daily indexes created by logstash vary between 5M and 80M documents, and 1.5GiB and 25GiB on disk. The data nodes have ES_HEAP_SIZE=25G (we saw OOM errors with 15G and going over 30GiB is not recommended I believe). I suspect this occurs when users try to query over a large numbers of indexes in Kibana. My questions are: 1: How should I tune our cluster to handle these queries? Is our dataset simply too big? 2: When this happens, I restart the bad node by: curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d ' { "transient": { "cluster.routing.allocation.enable": "none" } }' curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d '{ "transient" : { "cluster.routing.allocation.enable" : "none" } }' (start the node again) curl -XPUT "http://$HOST:$PORT/_cluster/settings?pretty" -d ' { "transient": { "cluster.routing.allocation.enable": "all" } }' It's then an hour or two before the cluster is green again, as the shards are assigned and then initialized. Is this the best way to restart a bad node? 3: Can I remove the ability for users to make such intensive requests from Kibana (either a Kibana setting or an ES setting)? Thanks Wilfred -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/895d2c10-e64e-432b-9c3b-f285945d951d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.