Hi again,

I fixed the problem from my last email, and I want to share what the  
root problem was and the solution I found, in case someone else runs  
into something similar. I also have a suggestion for a change to the  
BOINC client.

I didn't think this was relevant, so I didn't mention it in my last  
email, but my apps link against MPICH. I have it set up in a one-node  
configuration in my app because the code I'm modifying is depends on  
it and will still be used on a grid for the time being as well as via  
BOINC.

When you call MPI_Init, if the process is not attached to a terminal  
(this is why it only showed up when the "--background" flag was set),  
then MPICH calls setsid() to create a new session. This creates a new  
process group as well for that session, thus removing the app process  
from the BOINC process's group.

In app_control.cpp:169 and app_control.cpp:510, when BOINC calls
waitpid() to check the status of the processes, it gives pid=0. This  
makes it only reap processes in its same group; mine weren't anymore,  
and so they turned into zombies. I fixed this problem by setting the  
environment variable MPICH_PROCESS_GROUP=no, which prevents MPICH from  
creating a new process group.

However, I think BOINC's behaviour should be changed as well so it  
will wait for any child process, not just those in its same group  
(i.e, set pid=-1 when calling waitpid()). Is there a reason for the  
current behaviour, or has this just never bitten anyone before?


Thanks,

Michael Melanson


On 23-Oct-09, at 16:46 , Michael Melanson wrote:

> Hi everyone,
>
> I'm having a problem with my work units. When I run them on my test
> box (client version 6.9.0), they run correctly and exit, but then
> remain in the "Running" state as zombies:
>
> r...@tomtest1:~/boinc_project# ps -Af | grep boinc
> melanson 10519     1  0 Oct22 ?        00:05:25 /usr/bin/boincmgr
> boinc    11646     1  0 16:32 ?        00:00:00 /usr/bin/boinc --
> check_all_logins --redirectio --dir /var/lib/boinc-client
> boinc    11654 11646  1 16:33 ?        00:00:04 [evaluator_0.150]
> <defunct>
> boinc    11655 11646  1 16:33 ?        00:00:04 [evaluator_0.150]
> <defunct>
> root     11943  4667  0 16:38 pts/2    00:00:00 grep boinc
>
> Apparently the client is not calling waitpid() on them, but I have no
> idea why not.
>
> The processes terminate correctly when run stand-alone. They also
> terminate correctly when I start the client such that it remains
> attached to the terminal. This problem only occurs when it is started
> as a daemon process, such as by the init.d process. That is, if the
> client is started by running the following command as root, it leaves
> zombies behind:
>
> # start-stop-daemon --start --background --pidfile /var/run/boinc.pid
> --make-pidfile -quiet --user boinc --chuid boinc --chdir /var/lib/
> boinc-client --exec /usr/bin/boinc -- --check_all_logins --redirectio
> --dir /var/lib/boinc-client
>
> However, if you get rid of the '--background' flag from that line, the
> app processes terminate correctly. I'm at a loss as to why this should
> make any difference, and would appreciate any help and insight you can
> provide.
>
> Thank you in advance.
>
>
> Michael

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to