Hi again, I fixed the problem from my last email, and I want to share what the root problem was and the solution I found, in case someone else runs into something similar. I also have a suggestion for a change to the BOINC client.
I didn't think this was relevant, so I didn't mention it in my last email, but my apps link against MPICH. I have it set up in a one-node configuration in my app because the code I'm modifying is depends on it and will still be used on a grid for the time being as well as via BOINC. When you call MPI_Init, if the process is not attached to a terminal (this is why it only showed up when the "--background" flag was set), then MPICH calls setsid() to create a new session. This creates a new process group as well for that session, thus removing the app process from the BOINC process's group. In app_control.cpp:169 and app_control.cpp:510, when BOINC calls waitpid() to check the status of the processes, it gives pid=0. This makes it only reap processes in its same group; mine weren't anymore, and so they turned into zombies. I fixed this problem by setting the environment variable MPICH_PROCESS_GROUP=no, which prevents MPICH from creating a new process group. However, I think BOINC's behaviour should be changed as well so it will wait for any child process, not just those in its same group (i.e, set pid=-1 when calling waitpid()). Is there a reason for the current behaviour, or has this just never bitten anyone before? Thanks, Michael Melanson On 23-Oct-09, at 16:46 , Michael Melanson wrote: > Hi everyone, > > I'm having a problem with my work units. When I run them on my test > box (client version 6.9.0), they run correctly and exit, but then > remain in the "Running" state as zombies: > > r...@tomtest1:~/boinc_project# ps -Af | grep boinc > melanson 10519 1 0 Oct22 ? 00:05:25 /usr/bin/boincmgr > boinc 11646 1 0 16:32 ? 00:00:00 /usr/bin/boinc -- > check_all_logins --redirectio --dir /var/lib/boinc-client > boinc 11654 11646 1 16:33 ? 00:00:04 [evaluator_0.150] > <defunct> > boinc 11655 11646 1 16:33 ? 00:00:04 [evaluator_0.150] > <defunct> > root 11943 4667 0 16:38 pts/2 00:00:00 grep boinc > > Apparently the client is not calling waitpid() on them, but I have no > idea why not. > > The processes terminate correctly when run stand-alone. They also > terminate correctly when I start the client such that it remains > attached to the terminal. This problem only occurs when it is started > as a daemon process, such as by the init.d process. That is, if the > client is started by running the following command as root, it leaves > zombies behind: > > # start-stop-daemon --start --background --pidfile /var/run/boinc.pid > --make-pidfile -quiet --user boinc --chuid boinc --chdir /var/lib/ > boinc-client --exec /usr/bin/boinc -- --check_all_logins --redirectio > --dir /var/lib/boinc-client > > However, if you get rid of the '--background' flag from that line, the > app processes terminate correctly. I'm at a loss as to why this should > make any difference, and would appreciate any help and insight you can > provide. > > Thank you in advance. > > > Michael _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
