Hey Jelle, Looks like you've got the same symptom with a different problem :-(. I can say for certain the symptom in my case was caused by the double livestatus loading - so we know what you're running into a different thing.
Best, Terence On Tue, May 16, 2017 at 8:12 AM, jesm <[email protected]> wrote: > Hi all, > > We experience exactly the same problem only that we don't have the > situation of loading Livestatus twice. > We're not using Docker and we're using the latest stable... We could try > the nightly build but unfortunately we cannot reproduce the problem ... > It comes and goes and we have no idea why ... > > The symptoms we are seeing: > > - All Naemon related threads are consuming 100% cpu of every core. > - Thruk is not able to connect to the Unix domain socket and therefor > each incoming request starts a fcgi process exhausting the pool in no time. > - Weirdly enough during this state, it's possible to manually query > livestatus using unixcat or socat. > - Restarting Naemon does not help > - Rebooting the server does not help > - Removing retention.dat solves the problem > - Restoring the previously removed retention.dat from during the > outage does NOT invoke the problem again. > - Stracing the threads shows a continuous barrage of entries like: (I > have no more detailed extraction of this output) > - > > <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) > > > > After the retention.dat file was deleted and Naemon restarted we were not > able to trigger the same problem. > > > Any ideas? > > Cheers, > > Jelle > > > May 12, 2017 4:27 AM, "Terence Kent" <[email protected] > <%22terence%20kent%22%20%[email protected]%3E>> wrote: > > Hey Sven, > Thanks for getting back to me so quickly, this was particularly > challenging to chase down. Using strace and livestatus debugging didn't > actually give me more information on this one. I also confirmed I had the > issue with the nightly build as well as 1.0.6. > Anyway, I found the cause of the issue. It's configuration related and > pretty subtle. If you uncomment the following directive in > /etc/naemon/naemon.cfg file... > > broker_module=/usr/lib/naemon/naemon-livestatus/livestatus.so > /var/cache/naemon/live > > ...then the livestatus socket gets initialized twice during naemons > startup, causing the issue I describe earlier. The reason for this > duplicate initialization is because the > /etc/naemon/module-conf.d/livestatus.cfg, > which also includes the same directive. There's a hint of the duplicate > initialization in the naemon log, due to multiple log messages for > livestatus initialization, but that's it. > It seems the only issue here is that the configuration is very confusing > (/etc/naemon/naemon.cfg > gives you an example of how to use livestatus, making you think you should > just be able to uncomment it) and that repeating a configuration directive > doesn't produce an obvious error. > Would you like me to file an issue for this? While it's easy to resolve, > it's really hard to chase down. > Thanks! > Terence > On Tue, May 9, 2017 at 12:12 AM, Sven Nierlein <[email protected]> > wrote: > > Hi Terence, > > Could you try the latest nightly build, just to be sure to not hunt > already fixed bugs. If that doesn't help, you could increase the > livestatus loglevel as well. Naemon has a debug log which could be > enabled, and of course strace often gives a good idea on whats > happening as well. > > Cheers, > Sven > > > On 09.05.2017 02:05, Terence Kent wrote: > > Hello! > > > > We're trying to update our naemon docker image to 1.0.6 and we're > running into a fairly difficult-to-debug issue. Here's the issue we're > seeing: > > > > 1. Naemon + Apache start as expected and will run indefinitely, if Thruk > is not accessed. > > 2. Upon signin to Thruk, the Naemon process's CPU consumption jumps to > 100% and will stay there indefinitely. > > > > We've been trying to get at some logging messages to see if we can > diagnose the behavior, but that's been a bit more trouble than we expected. > So far, we've just done the obvious thing of increasing the debuging levels > found in /etc/naemon/naemon.cfg. However, this seems produce no additional > information when the issue is hit. > > > > Anyway, here's some information about the container environment: > > > > *Base image:* phusion 0.9.21 (Which is Ubuntu 16.04) > > *Naemon primary log file entries: *These always look like this. Not much > to go off of. > > –––– > > > > [1494286706] Naemon 1.0.6-pkg starting... (PID=51) > > > > [1494286706] Local time is Mon May 08 23:38:26 UTC 2017 > > > > [1494286706] LOG VERSION: 2.0 > > > > [1494286706] qh: Socket '/var/lib/naemon/naemon.qh' successfully > initialized > > > > [1494286706] nerd: Channel hostchecks registered successfully > > > > [1494286706] nerd: Channel servicechecks registered successfully > > > > [1494286706] nerd: Fully initialized and ready to rock! > > > > [1494286706] wproc: Successfully registered manager as @wproc with query > handler > > > > [1494286706] wproc: Registry request: name=Core Worker 55;pid=55 > > > > [1494286706] wproc: Registry request: name=Core Worker 57;pid=57 > > > > [1494286706] wproc: Registry request: name=Core Worker 59;pid=59 > > > > [1494286706] wproc: Registry request: name=Core Worker 61;pid=61 > > > > [1494286706] wproc: Registry request: name=Core Worker 58;pid=58 > > > > [1494286706] wproc: Registry request: name=Core Worker 60;pid=60 > > > > –––– > > *Naemon livestatus log: *(Blank) > > *Thruk Logs: *Nothing comes out here, until I kill the naemon service, > then it's just: > > –––––––– > > > > [2017/05/08 19:34:00][nameon][ERROR][Thruk] No Backend available > > > > [2017/05/08 19:34:00][nameon][ERROR][Thruk] on page: > http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931 > > > > [2017/05/08 19:34:00][nameon][ERROR][Thruk] Naemon: ERROR: failed to > connect - Connection refused. (/var/cache/naemon/live) > > > > ––––––––– > > > > > > > > From tracing around, we're pretty confident the issue is when Thruk > attempts to connect to the naemon live socket. However, what the cause of > the issue is has been tough; we know the fs permissions are correct, we > believe the socket is working from the log messages, and Thruk works as > expected when we stop naemon (it shows it's interfaces and errors that it > cannot connect to naemon). We can keep at this, of course, but I was hoping > we could get pointed in the right direction. > > > > > > Thanks! > > > > Terence > > > > > > -- > Sven Nierlein [email protected] > ConSol* GmbH http://www.consol.de > Franziskanerstrasse 38 Tel.:089/45841-439 > 81669 Muenchen Fax.:089/45841-111 > > > >
