OK. Sorry for the delay. I needed to read through this thread a few times and try some experiments. Let me reply to a few of these pieces, and then I'll talk about those experiments.
On 1/31/12 9:26 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>> I never used spawn_mutiple, but isn't it necessary to start it with mpiexec >>> too and call MPI_Init? >>> >>> $ mpiexec ./mpitest -np 1 >> >> I don't think so. > > In the book "Using MPI-2 by William Gropp at el." they use it in chapter > 7.2.2/page 235 this way, although it's indeed stated in the MPI-2.2 standard > on page 329 to create a singleton MPI environment if the application could > find the necessary information (i.e. wasn't started by mpiexec). > > Maybe it's a side effect of a tight integration that it would start on the > correct nodes (but I face an incorrect allocation of slots and an error > message at the end if started without mpiexec), as in this case it has no > command line option for the hostfile. How to get the requested nodes if > started from the command line? OK. I misunderstood you. I thought that you were saying that spawn_multiple had to call mpiexec for each spawned process. If you just meant that mpi.sh should launch the initial process with mpiexec, that seems reasonable. I tried it with and without, and I definitely get better results when using mpiexec. >> In any case, when I restrict the SGE grid to run all of >> my orte parallel environment jobs on one machine, the application runs fine. >> I only have problems if one or more of the spawned children gets scheduled >> to another node. >>> to override the detected slots by the tight integration into SGE. Otherwise >>> it might be running only as a serial one. The additional 4 spawned >>> processes can then be added inside your application. >>> >>> The line to initialize MPI: >>> >>> if( MPI::Init( MPI::THREAD_MULTIPLE ) != MPI::THREAD_MULTIPLE ) >>> ... >>> >>> I replaced the complete if... by a plain MPI::Init(); and get a suitable >>> output (see attached, qsub -pe openmpi 4 and changed _nProc to 3) in a tight >>> integration into SGE. > > Okay, typo - the _thread is missing. I have not tried that change, yet. If I need MPI_THREAD_MULTIPLE, and openmpi is compiled with thread support, it's not clear to me whether MPI::Init_Thread() and MPI::Inint_Thread(MPI::THREAD_MULTIPLE) would give me the same behavior from Open MPI. >>> NB: What is MPI::Init( MPI::THREAD_MULTIPLE ) supposed to do, output a >>> feature of MPI? >From the man page: MPI_Init_thread, as compared to MPI_Init, has a provision to request a certain level of thread support in required....The level of thread support available to the program is set in provided, except in C++, where it is the return value of the function. > For me it's not hanging. Did you try the alternative startup using mpiexec? > Aha - BTW: I use 1.4.4 Right, I'm on 1.5.4. Yes, I did try starting with mpiexec. That helps, but I still don't know whether I understand all of the results. For each experiment, I've attached the output of qfstat -f qfstat -g t pstree -Aalp <pid of sge_execd> output of mpitest parent and children (mpi.sh.o<job>) I ran each test with two different SGE queue configurations. In one case, the queue with the orte pe is set to include all 5 exec hosts in my gird. In the "single" case, the queue with the orte pe is set to use only a single host. (The queue configuration isn't shown here, but I changed the queue's hostlist to user either a single host or a host group that includes all of my machines. I run qsub on node 17. The grid machines available for this run are 3, 4, 10, 11, and 16. Some observations: 1. I'm still surprised that the SGE behavior is so different when I configure my SGE queue differently. See test "a" in the .tgz. When I just run mpitest in mpi.sh and ask for exactly 5 slots (-pe orte 5-5), it works if the queue is configured to use a single host. I see 1 MASTER and 4 SLAVES in qstat -g t, and I get the correct output. If the queue is set to use multiple hosts, the jobs hang in spawn/init, and I get errors [grid-03.cisco.com][[19159,2],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint _complete_connect] connect() to 192.168.122.1 failed: Connection refused (111) [grid-10.cisco.com:05327] [[19159,0],3] routed:binomial: Connection to lifeline [[19159,0],0] lost [grid-16.cisco.com:25196] [[19159,0],1] routed:binomial: Connection to lifeline [[19159,0],0] lost [grid-11.cisco.com:63890] [[19159,0],2] routed:binomial: Connection to lifeline [[19159,0],0] lost So, I'll just assume that mpiexec does some magic that is needed in the multi-machine scenario but not in the single machine scenario. 2. I guess I'm not sure how SGE is supposed to behave. Experiment "a" and "b" were identical except that I changed -pe orte 5-5 to -pe orte 5-. The single case works like before, and the multiple exec host case fails as before. The difference is that qstat -g t shows additional SLAVEs that don't seem to correspond to any jobs on the exec hosts. Are these SLAVEs just slots that are reserved for my job but that I'm not using? If my job will only use 5 slots, then I should set the SGE qsub job to ask for exactly 5 with "-pe orte 5-5", right? 3. Experiment "d" was similar to "b", but I use mpi.sh uses "mpiexec -np 1 mpitest" instead of running mpitest directly. Now both the single machine queue and multiple machine queue work. So, mpiexec seems to make my multi-machine configuration happier. In this case, I'm still using "-pe orte 5-", and I'm still seeing the extra SLAVE slots granted in qstat -g t. 4. Based on "d", I thought that I could follow the approach in "a". That is, for experiment "e", I used mpiexec -np 1, but I also used -pe orte 5-5. I thought that this would make the multi-machine queue reserve only the 5 slots that I needed. The single machine queue works correctly, but now the multi-machine case hangs with no errors. The output from qstat and pstree are what I'd expect, but it seems to hang in Span_multiple and Init_thread. I really expected this to work. I'm really confused by experiment "e" with multiple machines in the queue. Based on "a" and "d", I thought that a combination of mpiexec -np 1 would permit the multi-machine scheduling to work with MPI while the "-pe orte 5-5" would limit the slots to exactly the number that it needed to run. ---Tom
mpiExperiments.tgz
Description: Binary data