Hi, Could you try this patch? https://github.com/naemon/naemon-livestatus/pull/21 We noticed, that livestatus hangs upon reload when the livestatus module gets deinitialized. Due to a fuckup in the global counters this does leads to naemon hanging. Not sure if its related to this issue. Another patch worth looking into is this one: https://github.com/naemon/naemon-livestatus/pull/19
Cheers, Sven On 16.05.2017 17:12, jesm wrote: > Hi all, > > We experience exactly the same problem only that we don't have the situation > of loading Livestatus twice. > We're not using Docker and we're using the latest stable... We could try the > nightly build but unfortunately we cannot reproduce the problem ... > It comes and goes and we have no idea why ... > > The symptoms we are seeing: > > * All Naemon related threads are consuming 100% cpu of every core. > * Thruk is not able to connect to the Unix domain socket and therefor each > incoming request starts a fcgi process exhausting the pool in no time. > * Weirdly enough during this state, it's possible to manually query > livestatus using unixcat or socat. > * Restarting Naemon does not help > * Rebooting the server does not help > * Removing retention.dat solves the problem > * Restoring the previously removed retention.dat from during the outage > does NOT invoke the problem again. > * Stracing the threads shows a continuous barrage of entries like: (I have > no more detailed extraction of this output) > o > > |<... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)| > > > After the retention.dat file was deleted and Naemon restarted we were not > able to trigger the same problem. > > > Any ideas? > > Cheers, > > Jelle > > May 12, 2017 4:27 AM, "Terence Kent" <[email protected] > <mailto:%22Terence%20Kent%22%20<[email protected]>>> wrote: > > Hey Sven, > Thanks for getting back to me so quickly, this was particularly > challenging to chase down. Using strace and livestatus debugging didn't > actually give me more information on this one. I also confirmed I had the > issue with the nightly build as well as 1.0.6. > Anyway, I found the cause of the issue. It's configuration related and > pretty subtle. If you uncomment the following directive in > /etc/naemon/naemon.cfg file... > > broker_module=/usr/lib/naemon/naemon-livestatus/livestatus.so > /var/cache/naemon/live > > ...then the livestatus socket gets initialized twice during naemons > startup, causing the issue I describe earlier. The reason for this duplicate > initialization is because the /etc/naemon/module-conf.d/livestatus.cfg, which > also includes the same directive. There's a hint of the duplicate > initialization in the naemon log, due to multiple log messages for livestatus > initialization, but that's it. > It seems the only issue here is that the configuration is very confusing > (/etc/naemon/naemon.cfg gives you an example of how to use livestatus, making > you think you should just be able to uncomment it) and that repeating a > configuration directive doesn't produce an obvious error. > Would you like me to file an issue for this? While it's easy to resolve, > it's really hard to chase down. > Thanks! > Terence > On Tue, May 9, 2017 at 12:12 AM, Sven Nierlein <[email protected] > <mailto:[email protected]>> wrote: > > Hi Terence, > > Could you try the latest nightly build, just to be sure to not hunt > already fixed bugs. If that doesn't help, you could increase the > livestatus loglevel as well. Naemon has a debug log which could be > enabled, and of course strace often gives a good idea on whats > happening as well. > > Cheers, > Sven > > > On 09.05.2017 02:05, Terence Kent wrote: > > Hello! > > > > We're trying to update our naemon docker image to 1.0.6 and we're > running into a fairly difficult-to-debug issue. Here's the issue we're seeing: > > > > 1. Naemon + Apache start as expected and will run indefinitely, if > Thruk is not accessed. > > 2. Upon signin to Thruk, the Naemon process's CPU consumption jumps > to 100% and will stay there indefinitely. > > > > We've been trying to get at some logging messages to see if we can > diagnose the behavior, but that's been a bit more trouble than we expected. > So far, we've just done the obvious thing of increasing the debuging levels > found in /etc/naemon/naemon.cfg. However, this seems produce no additional > information when the issue is hit. > > > > Anyway, here's some information about the container environment: > > > > *Base image:* phusion 0.9.21 (Which is Ubuntu 16.04) > > *Naemon primary log file entries: *These always look like this. Not > much to go off of. > > –––– > > > > [1494286706] Naemon 1.0.6-pkg starting... (PID=51) > > > > [1494286706] Local time is Mon May 08 23:38:26 UTC 2017 > > > > [1494286706] LOG VERSION: 2.0 > > > > [1494286706] qh: Socket '/var/lib/naemon/naemon.qh' successfully > initialized > > > > [1494286706] nerd: Channel hostchecks registered successfully > > > > [1494286706] nerd: Channel servicechecks registered successfully > > > > [1494286706] nerd: Fully initialized and ready to rock! > > > > [1494286706] wproc: Successfully registered manager as @wproc with > query handler > > > > [1494286706] wproc: Registry request: name=Core Worker 55;pid=55 > > > > [1494286706] wproc: Registry request: name=Core Worker 57;pid=57 > > > > [1494286706] wproc: Registry request: name=Core Worker 59;pid=59 > > > > [1494286706] wproc: Registry request: name=Core Worker 61;pid=61 > > > > [1494286706] wproc: Registry request: name=Core Worker 58;pid=58 > > > > [1494286706] wproc: Registry request: name=Core Worker 60;pid=60 > > > > –––– > > *Naemon livestatus log: *(Blank) > > *Thruk Logs: *Nothing comes out here, until I kill the naemon > service, then it's just: > > –––––––– > > > > [2017/05/08 19:34:00][nameon][ERROR][Thruk] No Backend available > > > > [2017/05/08 19:34:00][nameon][ERROR][Thruk] on page: > http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931 > <http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931> > > > > [2017/05/08 19:34:00][nameon][ERROR][Thruk] Naemon: ERROR: failed > to connect - Connection refused. (/var/cache/naemon/live) > > > > ––––––––– > > > > > > > > From tracing around, we're pretty confident the issue is when Thruk > attempts to connect to the naemon live socket. However, what the cause of the > issue is has been tough; we know the fs permissions are correct, we believe > the socket is working from the log messages, and Thruk works as expected when > we stop naemon (it shows it's interfaces and errors that it cannot connect to > naemon). We can keep at this, of course, but I was hoping we could get > pointed in the right direction. > > > > > > Thanks! > > > > Terence > > > > > > -- > Sven Nierlein [email protected] <mailto:[email protected]> > ConSol* GmbH http://www.consol.de > Franziskanerstrasse 38 Tel.:089/45841-439 > 81669 Muenchen Fax.:089/45841-111 > > > -- Sven Nierlein [email protected] ConSol* GmbH http://www.consol.de Franziskanerstrasse 38 Tel.:089/45841-439 81669 Muenchen Fax.:089/45841-111
