Hi there I've got a graylog-1.1.3 instance (web/server+elasticsearch) running (CentOS-7) that I haven't changed INPUTs on for some months (ie I have one incoming syslog feed and 'n' GELF feeds). From what I know, graylog-server takes that data and pushes it into elasticsearch according to the sharding/etc settings, with auto-expiring of old data according to settings
As such, I would expect it to get into a "steady state" where the fundamental OS characteristics are fairly stable? ie it would use just about "this much" RAM, "this many open files", etc. Anyhow, two days it totally went down as it ran out of open file descriptors. Ended up corrupting over 9000 indexes before I noticed - a real mess. I increased nofiles, rebooted and then used that very nice script referred to below to re-absorb the borked indexes https://github.com/elastic/elasticsearch/issues/4206 So the thing I don't understand is why this happened (or why didn't this happen sooner)? In a steady state environment, why would the number of open files be increasing over time? eg only one index is open for write at any moment, and indexes are only open for read during searches, so why would this increase? More importantly, if this increase is meant to happen, doesn't that imply running out of file descriptors is inevitable? The other thing is why didn't graylog-server exit when this situation occurred? It seems to me that when elasticsearch started erroring, it should have exited (I mean, you don't recover from running out of file descriptors), but as it didn't, then why didn't graylog-server? Under what situation is it better to end up with 9000 corrupt indexes rather than a total outage? I'm still waiting for elasticsearch to finish re-assigning the unassigned_shards created by the above recovery process - it's working, but it's been 8 hours so far and it's still plodding along (so it's a two day outage for me so far). If graylog-server figured out elasticsearch was status "RED", why not shut down entirely so as to not make the situation any worse, and cause an easier to notice outage? Also, there's a bug with the elasticsearch rpm's. /etc/sysconfig/elasticsearch states to not set MAX_OPEN_FILES when using systemd (which you would be with CentOS7) and to instead set LimitNOFILE in /usr/lib/systemd/system/elasticsearch.service. However, /usr/lib/systemd/system/elasticsearch.service is replaced every time you upgrade elasticsearch. So either their documentation is wrong and /etc/sysconfig/elasticsearch is what "wins", or their rpm installer is broken. I'll open a bug report for them (not a graylog issue - but a FYI for others) -- Cheers Jason Haar Information Security Manager, Trimble Navigation Ltd. Phone: +1 408 481 8171 PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1 -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to graylog2+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/CAFChrgKZYE0bQOfKJfwYAONjYZh%2BrO4R5ir85gTj55m-RffdTA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.