Hi -- Haven't heard anything on this, so thought I'd try to post it into a bug tracking system. However, I can't seem to figure out where that'd be -- https://www.varnish-cache.org/trac ?
Thanks, Jeff On Jun 5, 2013, at 9:16 PM, Jeff Potter <[email protected]> wrote: > > Hi Varnish Folk, > > We had an OOM condition today on a dev box that I thought was worth passing > along, as it could've happened in production or may happen to others. > > This is on Varnish 3.0.3 / Ubuntu 10.04. > > It looks like the parent varnish process is not handling unexpected child > exits correctly while inside a signal, leading to a situation where the > parent process locks up because of a race condition where the signal handlers > stomp on the global values of child_state and child_pid. > > Here's a snippet of log -- the OOM killer kills the child processes each time > that varnish attempts to run them (28004, 15403, 15640/-1). > > [7270495.691960] Out of memory: kill process 27086 (varnishd) score 9908734 > or a child > [7270495.691971] Killed process 28004 (varnishd) > > 1370466543.3 Child (28004) died signal=9 > 1370466543.3 Child cleanup complete > 1370466543.4 child (15403) Started > 1370466545.4 Child (15403) said Not running as root, no priv-sep > 1370466545.4 Child (15403) said Child starts > 1370466545.4 Child (15403) said SMF.s0 mmap'ed 268435456 bytes of 268435456 > > [7270511.916123] Out of memory: kill process 27086 (varnishd) score 9724109 > or a child > [7270511.916134] Killed process 15403 (varnishd) > > 1370466558.8 Child (15403) not responding to CLI, killing it. > 1370466559.6 Child (15403) died signal=9 > 1370466559.6 Child cleanup complete > 1370466559.6 child (15640) Started > > [7270520.798539] Out of memory: kill process 27086 (varnishd) score 6705406 > or a child > [7270520.798551] Killed process 15640 (varnishd) > > 1370466568.1 Pushing vcls failed: > 1370466568.1 CLI communication error (hdr) > 1370466568.1 Stopping Child > 1370466568.1 Child (15640) died signal=9 > > # (The -1 value is "interesting" given the check for >0; child_pid is a > global and it's being set to -1 elsewhere -- so presumably start_child is > being called from somewhere else; i.e. start_child is running, mgt_sigchild > gets triggered, then flow returns to start_child right after and assigns -1 > to child_pid) > > 1370466568.1 Child (-1) said Not running as root, no priv-sep > 1370466568.1 Child (-1) said Child starts > 1370466568.1 Child (-1) said SMF.s0 mmap'ed 268435456 bytes of 268435456 > 1370466568.1 Child cleanup complete > > > Reading through mgt_child.c, here's what I think is going on: > > 1) The signal handler is getting invoked when the child process is killed by > the OOM killer > 2) Then the parent process, inside the sig handler (mgt_sigchld), calls fork > (inside mgt_sigchld's call to start_child), which causes a new child process > to start > 3) The OOM killer kills that second child process right away, before the sig > handler has a chance to return > > Does it maybe make sense to set the signal to be blocked for sigchild so that > the events can be processed synchronously without losing them? > > -Jeff > > > _______________________________________________ > varnish-misc mailing list > [email protected] > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc _______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
