Hi --

Haven't heard anything on this, so thought I'd try to post it into a bug 
tracking system. However, I can't seem to figure out where that'd be -- 
https://www.varnish-cache.org/trac ?

Thanks,
Jeff

On Jun 5, 2013, at 9:16 PM, Jeff Potter <[email protected]> wrote:

> 
> Hi Varnish Folk,
> 
> We had an OOM condition today on a dev box that I thought was worth passing 
> along, as it could've happened in production or may happen to others.
> 
> This is on Varnish 3.0.3 / Ubuntu 10.04.
> 
> It looks like the parent varnish process is not handling unexpected child 
> exits correctly while inside a signal, leading to a situation where the 
> parent process locks up because of a race condition where the signal handlers 
> stomp on the global values of child_state and child_pid.
> 
> Here's a snippet of log -- the OOM killer kills the child processes each time 
> that varnish attempts to run them (28004, 15403, 15640/-1).
> 
> [7270495.691960] Out of memory: kill process 27086 (varnishd) score 9908734 
> or a child
> [7270495.691971] Killed process 28004 (varnishd)
> 
> 1370466543.3  Child (28004) died signal=9
> 1370466543.3  Child cleanup complete
> 1370466543.4  child (15403) Started
> 1370466545.4  Child (15403) said Not running as root, no priv-sep
> 1370466545.4  Child (15403) said Child starts
> 1370466545.4  Child (15403) said SMF.s0 mmap'ed 268435456 bytes of 268435456
> 
> [7270511.916123] Out of memory: kill process 27086 (varnishd) score 9724109 
> or a child
> [7270511.916134] Killed process 15403 (varnishd)
> 
> 1370466558.8  Child (15403) not responding to CLI, killing it.
> 1370466559.6  Child (15403) died signal=9
> 1370466559.6  Child cleanup complete
> 1370466559.6  child (15640) Started
> 
> [7270520.798539] Out of memory: kill process 27086 (varnishd) score 6705406 
> or a child
> [7270520.798551] Killed process 15640 (varnishd)
> 
> 1370466568.1  Pushing vcls failed:
> 1370466568.1  CLI communication error (hdr)
> 1370466568.1  Stopping Child
> 1370466568.1  Child (15640) died signal=9
> 
> # (The -1 value is "interesting" given the check for >0; child_pid is a 
> global and it's being set to -1 elsewhere -- so presumably start_child is 
> being called from somewhere else; i.e. start_child is running, mgt_sigchild 
> gets triggered, then flow returns to start_child right after and assigns -1 
> to child_pid)
> 
> 1370466568.1  Child (-1) said Not running as root, no priv-sep
> 1370466568.1  Child (-1) said Child starts
> 1370466568.1  Child (-1) said SMF.s0 mmap'ed 268435456 bytes of 268435456
> 1370466568.1  Child cleanup complete
> 
> 
> Reading through mgt_child.c, here's what I think is going on:
> 
> 1) The signal handler is getting invoked when the child process is killed by 
> the OOM killer
> 2) Then the parent process, inside the sig handler (mgt_sigchld), calls fork 
> (inside mgt_sigchld's call to start_child), which causes a new child process 
> to start
> 3) The OOM killer kills that second child process right away, before the sig 
> handler has a chance to return
> 
> Does it maybe make sense to set the signal to be blocked for sigchild so that 
> the events can be processed synchronously without losing them?
> 
> -Jeff
> 
> 
> _______________________________________________
> varnish-misc mailing list
> [email protected]
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc


_______________________________________________
varnish-misc mailing list
[email protected]
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

Reply via email to