There is a known problem with Leopard and Open MPI of all versions. We haven't had time to chase it down yet - probably still a few weeks away.
Ralph On 2/11/08 1:39 PM, "Greg Watson" <g.wat...@computer.org> wrote: > Hi, > > Since I upgraded to MacOS X 10.5.1, I've been having problems running > MPI programs (using both 1.2.4 and 1.2.5). The symptoms are > intermittent (i.e. sometimes the application runs fine), and appear as > follows: > > 1. One or more of the application processes die (I've see both one and > two processes die). > > 2. (It appears) that the orted's associated with these application > process then spin continually. > > Here is what I see when I run "mpirun -np 4 ./mpitest": > > 12467 ?? Rs 1:26.52 orted --bootproxy 1 --name 0.0.1 -- > num_procs 5 --vpid_start 0 --nodename node0 --universe > greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid > 12468 ?? Rs 1:26.63 orted --bootproxy 1 --name 0.0.2 -- > num_procs 5 --vpid_start 0 --nodename node1 --universe > greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid > 12469 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.3 -- > num_procs 5 --vpid_start 0 --nodename node2 --universe > greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid > 12470 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.4 -- > num_procs 5 --vpid_start 0 --nodename node3 --universe > greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// > 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// > 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid > 12471 ?? S 0:00.05 ./mpitest > 12472 ?? S 0:00.05 ./mpitest > > Killing the mpirun results in: > > $ mpirun -np 4 ./mpitest > ^Cmpirun: killing job... > > ^ > C > -------------------------------------------------------------------------- > WARNING: mpirun is in the process of killing a job, but has detected an > interruption (probably control-C). > > It is dangerous to interrupt mpirun while it is killing a job (proper > termination may not be guaranteed). Hit control-C again within 1 > second if you really want to kill mpirun immediately. > -------------------------------------------------------------------------- > ^Cmpirun: forcibly killing job... > -------------------------------------------------------------------------- > WARNING: mpirun has exited before it received notification that all > started processes had terminated. You should double check and ensure > that there are no runaway processes still executing. > -------------------------------------------------------------------------- > > At this point, the two spinning orted's are left running, and the only > way to kill them is with -9. > > Is anyone else seeing this problem? > > Greg > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel