[graylog2] large searches kill ES - can graylog stop this?

Jason Haar Tue, 31 May 2016 17:49:05 -0700

Hi there

I just did a simple search on 30 days of data and managed to trigger the 
following ES error


[2016-06-01 00:12:53,525][WARN ][indices.breaker.fielddata] [fielddata] New 
used memory 11273780309 [10.4gb] for data of [message] would be larger than 
configured breaker: 10857952051 [10.1gb], breaking


According to what I can google, this means that ES would have had to 
allocate more resources than available to fulfil it, and that condition 
somehow triggers an epic fail: either ES becomes unresponsive or 
graylog-server does - I can't tell the difference. All I know is right now 
I have messages going into graylog and nothing coming out.

Within a minute, things went bad to worse, suddenly I'm getting shard 
errors (first shard errors in ages - definitely related)

[2016-06-01 00:21:32,860][WARN ][indices.cluster          ] [fantail] 
[[graylog_488][0]] marking and sending shard failed due to [engine failure, 
reason [already closed by tragic event on the index writer]]
[graylog_488][[graylog_488][0]] ShardNotFoundException[no such shard]
at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
[2016-06-01 00:21:32,962][WARN ][cluster.action.shard     ] [fantail] 
[graylog_488][0] received shard failed for target shard [[graylog_488][0], 
node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED], 
a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message 
[engine failure, reason [already closed by tragic event on the index 
writer]], failure [OutOfMemoryError[Java heap space]]
[2016-06-01 00:21:32,974][WARN ][cluster.action.shard     ] [fantail] 
[graylog_488][0] received shard failed for target shard [[graylog_488][0], 
node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED], 
a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message 
[master {fantail}{Tjzmk9cFRuCke6JEuomb4g}{127.0.0.1}{127.0.0.1:9300} marked 
shard as started, but shard has previous failed. resending shard failure.]
[2016-06-01 00:21:33,182][INFO ][cluster.routing.allocation] [fantail] 
Cluster health status changed from [GREEN] to [RED] (reason: [shards failed 
[[graylog_488][0], [graylog_488][0]] ...]).



Restarting graylog-server and ES (and cleaning up...) will solve this - but 
this is lame. graylog is an end-user tool that *by design* will have people 
doing actions that - on occasion - are beyond the reach of the backend: 
there has to be some way this could be handled better. The ES people seem 
to think this is a case of "you're doing it wrong", but graylog isn't some 
programmed frontend where every ES call is tightly managed - it's something 
that is meant to be used to "play" with data. Basically all I did was take 
a previous search that worked and asked it to re-run with an hourly graph 
instead of daily - enough to tip it over the edge. This will happen time 
and time again - so causing service outages is an acceptable outcome?

How are others dealing with this? Could graylog capture the ES error and 
mitigate (somehow)? I for one should have shut everything down before that 
"breaker" error turned into the "shard" error.

This is graylog-server-2.0.2/elasticsearch-2.3.3 under CentOS-7

Thanks

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/b7a7b095-3b6d-47fb-8bb0-bc62b8b67011%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[graylog2] large searches kill ES - can graylog stop this?

Reply via email to