Re: [OMPI users] SLURM and OpenMPI

2008-03-27 Thread Werner Augustin
On Thu, 20 Mar 2008 16:40:41 -0600
Ralph Castain  wrote:

> I am no slurm expert. However, it is our understanding that
> SLURM_TASKS_PER_NODE means the number of slots allocated to the job,
> not the number of tasks to be executed on each node. So the 4(x2)
> tells us that we have 4 slots on each of two nodes to work with. You
> got 4 slots on each node because you used the -N option, which told
> slurm to assign all slots on that node to this job - I assume you
> have 4 processors on your nodes. OpenMPI parses that string to get
> the allocation, then maps the number of specified processes against
> it.

That was also my interpretation and I was absolutely sure to have read
it a couple of days ago in the srun man-page. In the meantime I
changed my opinion because now it says "Number of tasks to be initiated
on each node" as Tim has quoted. I've no idea, how Tim managed to change
the man-page on my computer ;-)

and there is another variable documented: 

   SLURM_CPUS_ON_NODE
  Count of processors available to the job on this node.
Note the select/linear plugin allocates entire  nodes  to  jobs,  so
the value  indicates  the  total  count  of  CPUs  on the node.  The
  select/cons_res plugin allocates individual processors
to  jobs, so  this  number indicates the number of processors on this
node allocated to the job.


Anyway, back to reality: I made some further tests, and the only way to
change the values of SLURM_TASKS_PER_NODE was to tell slurm that node x
has only y cpus in slurm.conf. The variable documented as
SLURM_CPUS_ON_NODE (in 1.0.15 and 1.2.22) doesn't seem to exist in
either version. In 1.2.22 there seems to be SLURM_JOB_CPUS_PER_NODE
which has the same value as SLURM_TASKS_PER_NODE. In a couple of days
I'll try the other allocator plugin which allocates on a cpu base
instead of a node base. And after that it probably would be a good
idea, that somebody (me?) sums up our thread and asks the slurm guys
for their opinion.

> It is possible that the interpretation of SLURM_TASKS_PER_NODE is
> different when used to allocate as opposed to directly launch
> processes. Our typical usage is for someone to do:
> 
> srun -N 2 -A
> mpirun -np 2 helloworld
> 
> In other words, we use srun to create an allocation, and then run
> mpirun separately within it.
> 
> 
> I am therefore unsure what the "-n 2" will do here. If I believe the
> documentation, it would seem to imply that srun will attempt to
> launch two copies of "mpirun -np 2 helloworld", yet your output
> doesn't seem to support that interpretation. It would appear that the
> "-n 2" is being ignored and only one copy of mpirun is being
> launched. I'm no slurm expert, so perhaps that interpretation is
> incorrect.

That indeed happens when you call "srun -N 2 mpirun -np 2 helloworld",
but "srun -N 2 -b mpirun -np 2 helloworld" submits it as a batch-job,
i.e. "mpirun -np 2 helloworld" is executed only once on one of the
allocated nodes and environment variables are set appropriately -- or
at least should be set appropriately -- that a consecutive srun or
an mpirun inside the command does the right thing.

Werner 


[OMPI users] SLURM and OpenMPI

2008-03-20 Thread Werner Augustin
Hi,

At our site here at the University of Karlsruhe we are running two
large clusters with SLURM and HP-MPI. For our new cluster we want to
keep SLURM and switch to OpenMPI. While testing I got the following
problem:

with HP-MPI I do something like

srun -N 2 -n 2 -b mpirun -srun helloworld

and get 

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.

when I try the same with OpenMPI (version 1.2.4)

srun -N 2 -n 2 -b mpirun helloworld

I get

Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.

and with 

srun -N 2 -n 2 -b mpirun -np 2 helloworld

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.

which is still wrong, because it uses only one of the two allocated
nodes.

OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
variables, starts with slurm one orted per node and tasks upto the
maximum number of slots on every node. So basically it also does
some 'resource management' and interferes with slurm. OK, I can fix that
with a mpirun wrapper script which calls mpirun with the right -np and
the right rmaps_base_n_pernode setting, but it gets worse. We want to
allocate computing power on a per cpu base instead of per node, i.e.
different user might share a node. In addition slurm allows to schedule
according to memory usage. Therefore it is important that on every node
there is exactly the number of tasks running that slurm wants. The only
solution I came up with is to generate for every job a detailed
hostfile and call mpirun --hostfile. Any suggestions for improvement?

I've found a discussion thread "slurm and all-srun orterun" in the
mailinglist archive concerning the same problem, where Ralph Castain
announced that he is working on two new launch methods which would fix
my problems. Unfortunately his email address is deleted from the
archive, so it would be really nice if the friendly elf mentioned there
is still around and could forward my mail to him.

Thanks in advance,
Werner Augustin