You are not using sig_child() as intended.  When used as intended, sig_child() 
will prevent shutdown until the child process has exited and has been reaped.  
The timing issues you're worried about should not exist.

-- 
Rocco Caputo <rcap...@pobox.com>

On Mar 24, 2014, at 11:44, albertocurro <albertocu...@zoho.com> wrote:

> Hi Rocco,
> 
> many thanks for your quick answer! Unfortunately, the provided solution only 
> works partially. I still have some cases where the "fork bomb" message is 
> here with us :(
> 
>  One of the cases is this one: under some configuration, an instance of nginx 
> is started, so our product writes the configuration file and starts the Nginx 
> instance pointing to that configuration file. BUT, if the configuration file 
> could not be written (directory does not exist, etc), then the error raises, 
> and I've not found any way to handle it:
> 
> DEBUG - Created nginx temporary directory /opt/tmp/pull/instance1
> DEBUG - Created nginx configuration directory /opt/etc/pull/instance1
> DEBUG - Created nginx log directory /opt/log/pull/instance1
> DEBUG - creating nginx configfile for instance 1 in /opt/etc/pull/instance1
> === 13991 === !!! Kernel has 1 child process(es).
> === 13991 === !!! At least one child process is still running when 
> POE::Kernel->run() is ready to return.
> === 13991 === !!! Be sure to use sig_child() to reap child processes.
> === 13991 === !!! In extreme cases, failure to reap child processes has
> === 13991 === !!! resulted in a slow 'fork bomb' that has halted systems.
> Could not open file: No such file or directory
> 
> I've added a DIE handler in the main session to try to handle this:
> 
> $sig_session = POE::Session->create(
>    inline_states => {
>        _start => sub {
>            $_[HEAP]{RELOADED} = 0;
>            $_[KERNEL]->sig(TERM => '_sigterm');
>            $_[KERNEL]->sig(INT => '_sigterm');
>            $_[KERNEL]->sig(DIE => '_sigterm');
>            $_[KERNEL]->sig(nginx_reload => '_sig_nginx_reload');
>            $_[KERNEL]->alias_set('sighandler');
>        },
>        _sigdie => sub {
>            print "Handling exception, calling stop";
>            POE::Kernel->call($sig_session, '_stop');
>        },
>        _stop => sub {
>            # Reap any existing pid (# 1825119)
>            print "Handling stop";
>            POE::Kernel->sig_child();
>            use POSIX ":sys_wait_h";
>            1 while waitpid(WNOHANG, -1) > 0;
> 
>            # Clear signal handlers...
>            $_[KERNEL]->sig('TERM');
> 
> But, as said above, it's not working. Checking POE's code, I can see the 
> message lines are generated in Resources/Signals.pm, under 
> _data_sig_finalize() method (where POE is already doing the same you 
> recommended me, waiting for the pid).
> 
> But _data_sig_finalize() method is called in Kernel.pm just after 
> unregistered all the signals (Kernel.pm => _finalize_kernel):
> 
> my $self = shift;
> 
>  # Disable signal watching since there's now no place for them to go.
>  foreach ($self->_data_sig_get_safe_signals()) {
>    $self->loop_ignore_signal($_);
>  }
> 
>  # Remove the kernel session's signal watcher.
>  $self->_data_sig_remove($self->ID, "IDLE");
> 
>  # The main loop is done, no matter which event library ran it.
>  # sig before loop so that it clears the signal_pipe file handler
>  $self->_data_sig_finalize();
>  $self->loop_finalize();
> 
> Once here, none of my signal handlers in the main session instance would 
> work, as the signals have been unregistered. On an exception (die) while 
> POE::Kernel->run(), how could I handle it then??
> 
> Thanks a lot
> Alberto
> 
> 
> 
> 
> ---- Activado lun, 24 mar 2014 13:45:45 +0100 Rocco Caputo  escribió ---- 
> 
>> Hi, Alberto. 
>> 
>> At program end time, POE runs a quick waitpid() check for child processes 
>> that may have leaked. This check was added after a bug report where POE 
>> locked up a server after several days of running. It turned out to be the 
>> reporter's application, but it was hard to debug. 
>> 
>> Your program seems to have created two processes that it didn't reap: PIDs 
>> 5373 and 5374. The ideal solution is to reap those processes before exiting. 
>> Your program can do this using POE::Kernel's sig_child() method. 
>> 
>> In some cases, a third-party library will create processes and not properly 
>> clean them up. It can be impossible to solve this case without modifying 
>> other people's code. 
>> 
>> If you just want to ignore the problem, this might do the trick. Put these 
>> lines in your last _stop handler. They should reap the processes you've 
>> leaked before POE's check: 
>> 
>> use POSIX ":sys_wait_h"; 
>> 1 while waitpid(WNOHANG, -1) > 0; 
>> 
>> It's a bit of a pain, but I think it's better to explicitly ignore the 
>> problem than for it to go unnoticed by default. 
>> 
>> Please let me know whether that resolves your problem. It may not. For 
>> example, the processes may still be open until an object is destroyed at 
>> global destruction time. 
>> 
>> -- 
>> Rocco Caputo  
>> 
>> On Mar 24, 2014, at 05:46, albertocurro  wrote: 
>> 
>>> Guys, 
>>> 
>>> We have a product developed using POE as a base framework, with some other 
>>> tool libraries as log4perl; basically is a forward proxy, composed of 
>>> several modules, each one of them comprising a POE::Session; all of them 
>>> share an internal queue of tasks to be performed. Each module performs 
>>> several tasks on initialization, and if anything goes wrong, croak() is 
>>> called to stop the service -> this is considered ok, since croak() is only 
>>> called during initialization, when validation is being performed. 
>>> 
>>> The product is stable and works really fine, but recently I updated POE to 
>>> the latest version, and since then we can see this message in the logs: 
>>> 
>>> registering pdu failed: 263! 
>>> === 5267 === 5 -> on_handle (from Handler/StoreRemote.pm at 87) 
>>> === 5267 === 5 -> on_retry (from Handler/StoreRemote.pm at 141) 
>>> === 5267 === 9 -> on_handle (from Handler/StoreRemote.pm at 87) 
>>> === 5267 === 9 -> on_retry (from Handler/StoreRemote.pm at 141) 
>>> === 5267 === !!! Kernel has child processes. 
>>> === 5267 === !!! Stopped child process (PID 5373) reaped when 
>>> POE::Kernel->run() is ready to return. 
>>> === 5267 === !!! Stopped child process (PID 5374) reaped when 
>>> POE::Kernel->run() is ready to return. 
>>> === 5267 === !!! At least one child process is still running when 
>>> POE::Kernel->run() is ready to return. 
>>> === 5267 === !!! Be sure to use sig_child() to reap child processes. 
>>> === 5267 === !!! In extreme cases, failure to reap child processes has 
>>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. 
>>> mkdir /mnt/nfs99: Permission denied at Handler/Store.pm line 147 
>>> 
>>> first lines and last line above are the errors itself, but this part is new 
>>> since the upgrading: 
>>> 
>>> === 5267 === !!! Kernel has child processes. 
>>> === 5267 === !!! Stopped child process (PID 5373) reaped when 
>>> POE::Kernel->run() is ready to return. 
>>> === 5267 === !!! Stopped child process (PID 5374) reaped when 
>>> POE::Kernel->run() is ready to return. 
>>> === 5267 === !!! At least one child process is still running when 
>>> POE::Kernel->run() is ready to return. 
>>> === 5267 === !!! Be sure to use sig_child() to reap child processes. 
>>> === 5267 === !!! In extreme cases, failure to reap child processes has 
>>> === 5267 === !!! resulted in a slow 'fork bomb' that has halted systems. 
>>> 
>>> I can see it everytime the service is stopped because of an unhandled 
>>> condition, even when POE's event loop has been already running for ours. It 
>>> was not visible before, and I can't get rid of it in any way. I've tried 
>>> different ways to avoid it with no luck. 
>>> 
>>> Any advice or alternative approach on this? 
>>> 
>>> Many thanks 
>>> Alberto

Reply via email to