Re: [naemon-dev] CPU spike + hang when thruk attempts to connect using 1.06 and 1.0.7 in a docker container

Sven Nierlein Tue, 23 May 2017 01:28:45 -0700

Hi,

Could you try this patch?
https://github.com/naemon/naemon-livestatus/pull/21
We noticed, that livestatus hangs upon reload when the
livestatus module gets deinitialized. Due to a fuckup in
the global counters this does leads to naemon hanging.
Not sure if its related to this issue.
Another patch worth looking into is this one:
https://github.com/naemon/naemon-livestatus/pull/19


Cheers,
 Sven


On 16.05.2017 17:12, jesm wrote:
> Hi all,
> 
> We experience exactly the same problem only that we don't have the situation 
> of loading Livestatus twice.
> We're not using Docker and we're using the latest stable... We could try the 
> nightly build but unfortunately we cannot reproduce the problem ...
> It comes and goes and we have no idea why ...
> 
> The symptoms we are seeing:
> 
>   * All Naemon related threads are consuming 100% cpu of every core.
>   * Thruk is not able to connect to the Unix domain socket and therefor each 
> incoming request starts a fcgi process exhausting the pool in no time.
>   * Weirdly enough during this state, it's possible to manually query 
> livestatus using unixcat or socat.
>   * Restarting Naemon does not help
>   * Rebooting the server does not help
>   * Removing retention.dat solves the problem
>   * Restoring the previously removed retention.dat from during the outage 
> does NOT invoke the problem again.
>   * Stracing the threads shows a continuous barrage of entries like: (I have 
> no more detailed extraction of this output)
>       o
> 
>         |<... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)|
> 
> 
> After the retention.dat file was deleted and Naemon restarted we were not 
> able to trigger the same problem.
> 
> 
> Any ideas?
> 
> Cheers,
> 
> Jelle
> 
> May 12, 2017 4:27 AM, "Terence Kent" <[email protected] 
> <mailto:%22Terence%20Kent%22%20<[email protected]>>> wrote:
> 
>     Hey Sven,
>     Thanks for getting back to me so quickly, this was particularly 
> challenging to chase down. Using strace and livestatus debugging didn't 
> actually give me more information on this one. I also confirmed I had the 
> issue with the nightly build as well as 1.0.6.
>     Anyway, I found the cause of the issue. It's configuration related and 
> pretty subtle. If you uncomment the following directive in 
> /etc/naemon/naemon.cfg file...
> 
>         broker_module=/usr/lib/naemon/naemon-livestatus/livestatus.so 
> /var/cache/naemon/live
> 
>     ...then the livestatus socket gets initialized twice during naemons 
> startup, causing the issue I describe earlier. The reason for this duplicate 
> initialization is because the /etc/naemon/module-conf.d/livestatus.cfg, which 
> also includes the same directive. There's a hint of the duplicate 
> initialization in the naemon log, due to multiple log messages for livestatus 
> initialization, but that's it.
>     It seems the only issue here is that the configuration is very confusing 
> (/etc/naemon/naemon.cfg gives you an example of how to use livestatus, making 
> you think you should just be able to uncomment it) and that repeating a 
> configuration directive doesn't produce an obvious error.
>     Would you like me to file an issue for this? While it's easy to resolve, 
> it's really hard to chase down.
>     Thanks!
>     Terence
>     On Tue, May 9, 2017 at 12:12 AM, Sven Nierlein <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>         Hi Terence,
> 
>         Could you try the latest nightly build, just to be sure to not hunt 
> already fixed bugs. If that doesn't help, you could increase the
>         livestatus loglevel as well. Naemon has a debug log which could be 
> enabled, and of course strace often gives a good idea on whats
>         happening as well.
> 
>         Cheers,
>         Sven
> 
> 
>         On 09.05.2017 02:05, Terence Kent wrote:
>         > Hello!
>         >
>         > We're trying to update our naemon docker image to 1.0.6 and we're 
> running into a fairly difficult-to-debug issue. Here's the issue we're seeing:
>         >
>         > 1. Naemon + Apache start as expected and will run indefinitely, if 
> Thruk is not accessed.
>         > 2. Upon signin to Thruk, the Naemon process's CPU consumption jumps 
> to 100% and will stay there indefinitely.
>         >
>         > We've been trying to get at some logging messages to see if we can 
> diagnose the behavior, but that's been a bit more trouble than we expected. 
> So far, we've just done the obvious thing of increasing the debuging levels 
> found in /etc/naemon/naemon.cfg. However, this seems produce no additional 
> information when the issue is hit.
>         >
>         > Anyway, here's some information about the container environment:
>         >
>         > *Base image:* phusion 0.9.21 (Which is Ubuntu 16.04)
>         > *Naemon primary log file entries: *These always look like this. Not 
> much to go off of.
>         > ––––
>         >
>         > [1494286706] Naemon 1.0.6-pkg starting... (PID=51)
>         >
>         > [1494286706] Local time is Mon May 08 23:38:26 UTC 2017
>         >
>         > [1494286706] LOG VERSION: 2.0
>         >
>         > [1494286706] qh: Socket '/var/lib/naemon/naemon.qh' successfully 
> initialized
>         >
>         > [1494286706] nerd: Channel hostchecks registered successfully
>         >
>         > [1494286706] nerd: Channel servicechecks registered successfully
>         >
>         > [1494286706] nerd: Fully initialized and ready to rock!
>         >
>         > [1494286706] wproc: Successfully registered manager as @wproc with 
> query handler
>         >
>         > [1494286706] wproc: Registry request: name=Core Worker 55;pid=55
>         >
>         > [1494286706] wproc: Registry request: name=Core Worker 57;pid=57
>         >
>         > [1494286706] wproc: Registry request: name=Core Worker 59;pid=59
>         >
>         > [1494286706] wproc: Registry request: name=Core Worker 61;pid=61
>         >
>         > [1494286706] wproc: Registry request: name=Core Worker 58;pid=58
>         >
>         > [1494286706] wproc: Registry request: name=Core Worker 60;pid=60
>         >
>         > ––––
>         > *Naemon livestatus log: *(Blank)
>         > *Thruk Logs: *Nothing comes out here, until I kill the naemon 
> service, then it's just:
>         > ––––––––
>         >
>         > [2017/05/08 19:34:00][nameon][ERROR][Thruk] No Backend available
>         >
>         > [2017/05/08 19:34:00][nameon][ERROR][Thruk] on page: 
> http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931 
> <http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931>
>         >
>         > [2017/05/08 19:34:00][nameon][ERROR][Thruk] Naemon: ERROR: failed 
> to connect - Connection refused. (/var/cache/naemon/live)
>         >
>         > –––––––––
>         >
>         >
>         >
>         > From tracing around, we're pretty confident the issue is when Thruk 
> attempts to connect to the naemon live socket. However, what the cause of the 
> issue is has been tough; we know the fs permissions are correct, we believe 
> the socket is working from the log messages, and Thruk works as expected when 
> we stop naemon (it shows it's interfaces and errors that it cannot connect to 
> naemon). We can keep at this, of course, but I was hoping we could get 
> pointed in the right direction.
>         >
>         >
>         > Thanks!
>         >
>         > Terence
>         >
>         >
> 
>         --
>         Sven Nierlein [email protected] <mailto:[email protected]>
>         ConSol* GmbH http://www.consol.de
>         Franziskanerstrasse 38 Tel.:089/45841-439
>         81669 Muenchen Fax.:089/45841-111 
> 
> 
> 


-- 
Sven Nierlein             [email protected]
ConSol* GmbH              http://www.consol.de
Franziskanerstrasse 38    Tel.:089/45841-439
81669 Muenchen            Fax.:089/45841-111

Re: [naemon-dev] CPU spike + hang when thruk attempts to connect using 1.06 and 1.0.7 in a docker container

Reply via email to