Re: [OMPI users] SLURM and OpenMPI
On Thu, 20 Mar 2008 16:40:41 -0600 Ralph Castainwrote: > I am no slurm expert. However, it is our understanding that > SLURM_TASKS_PER_NODE means the number of slots allocated to the job, > not the number of tasks to be executed on each node. So the 4(x2) > tells us that we have 4 slots on each of two nodes to work with. You > got 4 slots on each node because you used the -N option, which told > slurm to assign all slots on that node to this job - I assume you > have 4 processors on your nodes. OpenMPI parses that string to get > the allocation, then maps the number of specified processes against > it. That was also my interpretation and I was absolutely sure to have read it a couple of days ago in the srun man-page. In the meantime I changed my opinion because now it says "Number of tasks to be initiated on each node" as Tim has quoted. I've no idea, how Tim managed to change the man-page on my computer ;-) and there is another variable documented: SLURM_CPUS_ON_NODE Count of processors available to the job on this node. Note the select/linear plugin allocates entire nodes to jobs, so the value indicates the total count of CPUs on the node. The select/cons_res plugin allocates individual processors to jobs, so this number indicates the number of processors on this node allocated to the job. Anyway, back to reality: I made some further tests, and the only way to change the values of SLURM_TASKS_PER_NODE was to tell slurm that node x has only y cpus in slurm.conf. The variable documented as SLURM_CPUS_ON_NODE (in 1.0.15 and 1.2.22) doesn't seem to exist in either version. In 1.2.22 there seems to be SLURM_JOB_CPUS_PER_NODE which has the same value as SLURM_TASKS_PER_NODE. In a couple of days I'll try the other allocator plugin which allocates on a cpu base instead of a node base. And after that it probably would be a good idea, that somebody (me?) sums up our thread and asks the slurm guys for their opinion. > It is possible that the interpretation of SLURM_TASKS_PER_NODE is > different when used to allocate as opposed to directly launch > processes. Our typical usage is for someone to do: > > srun -N 2 -A > mpirun -np 2 helloworld > > In other words, we use srun to create an allocation, and then run > mpirun separately within it. > > > I am therefore unsure what the "-n 2" will do here. If I believe the > documentation, it would seem to imply that srun will attempt to > launch two copies of "mpirun -np 2 helloworld", yet your output > doesn't seem to support that interpretation. It would appear that the > "-n 2" is being ignored and only one copy of mpirun is being > launched. I'm no slurm expert, so perhaps that interpretation is > incorrect. That indeed happens when you call "srun -N 2 mpirun -np 2 helloworld", but "srun -N 2 -b mpirun -np 2 helloworld" submits it as a batch-job, i.e. "mpirun -np 2 helloworld" is executed only once on one of the allocated nodes and environment variables are set appropriately -- or at least should be set appropriately -- that a consecutive srun or an mpirun inside the command does the right thing. Werner
[OMPI users] SLURM and OpenMPI
Hi, At our site here at the University of Karlsruhe we are running two large clusters with SLURM and HP-MPI. For our new cluster we want to keep SLURM and switch to OpenMPI. While testing I got the following problem: with HP-MPI I do something like srun -N 2 -n 2 -b mpirun -srun helloworld and get Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14. when I try the same with OpenMPI (version 1.2.4) srun -N 2 -n 2 -b mpirun helloworld I get Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13. Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14. Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14. and with srun -N 2 -n 2 -b mpirun -np 2 helloworld Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13. which is still wrong, because it uses only one of the two allocated nodes. OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment variables, starts with slurm one orted per node and tasks upto the maximum number of slots on every node. So basically it also does some 'resource management' and interferes with slurm. OK, I can fix that with a mpirun wrapper script which calls mpirun with the right -np and the right rmaps_base_n_pernode setting, but it gets worse. We want to allocate computing power on a per cpu base instead of per node, i.e. different user might share a node. In addition slurm allows to schedule according to memory usage. Therefore it is important that on every node there is exactly the number of tasks running that slurm wants. The only solution I came up with is to generate for every job a detailed hostfile and call mpirun --hostfile. Any suggestions for improvement? I've found a discussion thread "slurm and all-srun orterun" in the mailinglist archive concerning the same problem, where Ralph Castain announced that he is working on two new launch methods which would fix my problems. Unfortunately his email address is deleted from the archive, so it would be really nice if the friendly elf mentioned there is still around and could forward my mail to him. Thanks in advance, Werner Augustin