Ralph wrote: "I don't know if I would say we "interfere" with SLURM - I would say that we are only lightly integrated with SLURM at this time. We use SLURM as a resource manager to assign nodes, and then map processes onto those nodes according to the user's wishes. We chose to do this because srun applies its own load balancing algorithms if you launch processes directly with it, which leaves the user with little flexibility to specify their desired rank/slot mapping. We chose to support the greater flexibility."
Ralph, we wrote a launcher for mvapich that uses srun to launch but keeps tight control of where processes are started. The way we did it was to force srun to launch a single process on a particular node. The launcher calls many of these: srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS Hope this helps (and we are looking forward to a tighter orterun/slurm integration as you know). Regards, Federico -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Thursday, March 20, 2008 6:41 PM To: Open MPI Users <us...@open-mpi.org> Cc: Ralph Castain Subject: Re: [OMPI users] SLURM and OpenMPI Hi there I am no slurm expert. However, it is our understanding that SLURM_TASKS_PER_NODE means the number of slots allocated to the job, not the number of tasks to be executed on each node. So the 4(x2) tells us that we have 4 slots on each of two nodes to work with. You got 4 slots on each node because you used the -N option, which told slurm to assign all slots on that node to this job - I assume you have 4 processors on your nodes. OpenMPI parses that string to get the allocation, then maps the number of specified processes against it. It is possible that the interpretation of SLURM_TASKS_PER_NODE is different when used to allocate as opposed to directly launch processes. Our typical usage is for someone to do: srun -N 2 -A mpirun -np 2 helloworld In other words, we use srun to create an allocation, and then run mpirun separately within it. I am therefore unsure what the "-n 2" will do here. If I believe the documentation, it would seem to imply that srun will attempt to launch two copies of "mpirun -np 2 helloworld", yet your output doesn't seem to support that interpretation. It would appear that the "-n 2" is being ignored and only one copy of mpirun is being launched. I'm no slurm expert, so perhaps that interpretation is incorrect. Assuming that the -n 2 is ignored in this situation, your command line: > srun -N 2 -n 2 -b mpirun -np 2 helloworld will cause mpirun to launch two processes, mapped byslot against the slurm allocation of two nodes, each having 4 slots. Thus, both processes will be launched on the first node, which is what you observed. Similarly, the command line > srun -N 2 -n 2 -b mpirun helloworld doesn't specify the #procs to mpirun. In that case, mpirun will launch a process on every available slot in the allocation. Given this command, that means 4 procs will be launched on each of the 2 nodes, for a total of 8 procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the second. Again, this is what you observed. I don't know if I would say we "interfere" with SLURM - I would say that we are only lightly integrated with SLURM at this time. We use SLURM as a resource manager to assign nodes, and then map processes onto those nodes according to the user's wishes. We chose to do this because srun applies its own load balancing algorithms if you launch processes directly with it, which leaves the user with little flexibility to specify their desired rank/slot mapping. We chose to support the greater flexibility. Using the SLURM-defined mapping will require launching without our mpirun. This capability is still under development, and there are issues with doing that in slurm environments which need to be addressed. It is at a lower priority than providing such support for TM right now, so I wouldn't expect it to become available for several months at least. Alternatively, it may be possible for mpirun to get the SLURM-defined mapping and use it to launch the processes. If we can get it somehow, there is no problem launching it as specified - the problem is how to get the map! Unfortunately, slurm's licensing prevents us from using its internal APIs, so obtaining the map is not an easy thing to do. Anyone who wants to help accelerate that timetable is welcome to contact me. We know the technical issues - this is mostly a problem of (a) priorities versus my available time, and (b) similar considerations on the part of the slurm folks to do the work themselves. Ralph On 3/20/08 3:48 PM, "Tim Prins" <tpr...@open-mpi.org> wrote: > Hi Werner, > > Open MPI does things a little bit differently than other MPIs when it > comes to supporting SLURM. See > http://www.open-mpi.org/faq/?category=slurm > for general information about running with Open MPI on SLURM. > > After trying the commands you sent, I am actually a bit surprised by the > results. I would have expected this mode of operation to work. But > looking at the environment variables that SLURM is setting for us, I can > see why it doesn't. > > On a cluster with 4 cores/node, I ran: > [tprins@odin ~]$ cat mprun.sh > #!/bin/sh > printenv > [tprins@odin ~]$ srun -N 2 -n 2 -b mprun.sh > srun: jobid 55641 submitted > [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE > SLURM_TASKS_PER_NODE=4(x2) > [tprins@odin ~]$ > > Which seems to be wrong, since the srun man page says that > SLURM_TASKS_PER_NODE is the "Number of tasks to be initiated on each > node". This seems to imply that the value should be "1(x2)". So maybe > this is a SLURM problem? If this value were correctly reported, Open MPI > should work fine for what you wanted to do. > > Two other things: > 1. You should probably use the command line option '--npernode' for > mpirun instead of setting the rmaps_base_n_pernode directly. > 2. In regards to your second example below, Open MPI by default maps 'by > slot'. That is, it will fill all available slots on the first node > before moving to the second. You can change this, see: > http://www.open-mpi.org/faq/?category=running#mpirun-scheduling > > I have copied Ralph on this mail to see if he has a better response. > > Tim > > Werner Augustin wrote: >> Hi, >> >> At our site here at the University of Karlsruhe we are running two >> large clusters with SLURM and HP-MPI. For our new cluster we want to >> keep SLURM and switch to OpenMPI. While testing I got the following >> problem: >> >> with HP-MPI I do something like >> >> srun -N 2 -n 2 -b mpirun -srun helloworld >> >> and get >> >> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. >> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14. >> >> when I try the same with OpenMPI (version 1.2.4) >> >> srun -N 2 -n 2 -b mpirun helloworld >> >> I get >> >> Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13. >> Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13. >> Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14. >> Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13. >> Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14. >> Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13. >> Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14. >> Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14. >> >> and with >> >> srun -N 2 -n 2 -b mpirun -np 2 helloworld >> >> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. >> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13. >> >> which is still wrong, because it uses only one of the two allocated >> nodes. >> >> OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment >> variables, starts with slurm one orted per node and tasks upto the >> maximum number of slots on every node. So basically it also does >> some 'resource management' and interferes with slurm. OK, I can fix that >> with a mpirun wrapper script which calls mpirun with the right -np and >> the right rmaps_base_n_pernode setting, but it gets worse. We want to >> allocate computing power on a per cpu base instead of per node, i.e. >> different user might share a node. In addition slurm allows to schedule >> according to memory usage. Therefore it is important that on every node >> there is exactly the number of tasks running that slurm wants. The only >> solution I came up with is to generate for every job a detailed >> hostfile and call mpirun --hostfile. Any suggestions for improvement? >> >> I've found a discussion thread "slurm and all-srun orterun" in the >> mailinglist archive concerning the same problem, where Ralph Castain >> announced that he is working on two new launch methods which would fix >> my problems. Unfortunately his email address is deleted from the >> archive, so it would be really nice if the friendly elf mentioned there >> is still around and could forward my mail to him. >> >> Thanks in advance, >> Werner Augustin >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users