I checked in this change; thanks. -- David Michael Melanson wrote: > Hi again, > > I fixed the problem from my last email, and I want to share what the > root problem was and the solution I found, in case someone else runs > into something similar. I also have a suggestion for a change to the > BOINC client. > > I didn't think this was relevant, so I didn't mention it in my last > email, but my apps link against MPICH. I have it set up in a one-node > configuration in my app because the code I'm modifying is depends on > it and will still be used on a grid for the time being as well as via > BOINC. > > When you call MPI_Init, if the process is not attached to a terminal > (this is why it only showed up when the "--background" flag was set), > then MPICH calls setsid() to create a new session. This creates a new > process group as well for that session, thus removing the app process > from the BOINC process's group. > > In app_control.cpp:169 and app_control.cpp:510, when BOINC calls > waitpid() to check the status of the processes, it gives pid=0. This > makes it only reap processes in its same group; mine weren't anymore, > and so they turned into zombies. I fixed this problem by setting the > environment variable MPICH_PROCESS_GROUP=no, which prevents MPICH from > creating a new process group. > > However, I think BOINC's behaviour should be changed as well so it > will wait for any child process, not just those in its same group > (i.e, set pid=-1 when calling waitpid()). Is there a reason for the > current behaviour, or has this just never bitten anyone before? > > > Thanks, > > Michael Melanson > > > On 23-Oct-09, at 16:46 , Michael Melanson wrote: > >> Hi everyone, >> >> I'm having a problem with my work units. When I run them on my test >> box (client version 6.9.0), they run correctly and exit, but then >> remain in the "Running" state as zombies: >> >> r...@tomtest1:~/boinc_project# ps -Af | grep boinc >> melanson 10519 1 0 Oct22 ? 00:05:25 /usr/bin/boincmgr >> boinc 11646 1 0 16:32 ? 00:00:00 /usr/bin/boinc -- >> check_all_logins --redirectio --dir /var/lib/boinc-client >> boinc 11654 11646 1 16:33 ? 00:00:04 [evaluator_0.150] >> <defunct> >> boinc 11655 11646 1 16:33 ? 00:00:04 [evaluator_0.150] >> <defunct> >> root 11943 4667 0 16:38 pts/2 00:00:00 grep boinc >> >> Apparently the client is not calling waitpid() on them, but I have no >> idea why not. >> >> The processes terminate correctly when run stand-alone. They also >> terminate correctly when I start the client such that it remains >> attached to the terminal. This problem only occurs when it is started >> as a daemon process, such as by the init.d process. That is, if the >> client is started by running the following command as root, it leaves >> zombies behind: >> >> # start-stop-daemon --start --background --pidfile /var/run/boinc.pid >> --make-pidfile -quiet --user boinc --chuid boinc --chdir /var/lib/ >> boinc-client --exec /usr/bin/boinc -- --check_all_logins --redirectio >> --dir /var/lib/boinc-client >> >> However, if you get rid of the '--background' flag from that line, the >> app processes terminate correctly. I'm at a loss as to why this should >> make any difference, and would appreciate any help and insight you can >> provide. >> >> Thank you in advance. >> >> >> Michael > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address.
_______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
