Mark Dixon <m.c.di...@leeds.ac.uk> writes:

> Hi there,
>
> We've started looking at moving to the openmpi 1.8 branch from 1.6 on
> our CentOS6/Son of Grid Engine cluster and noticed an unexpected
> difference when binding multiple cores to each rank.
>
> Has openmpi's definition 'slot' changed between 1.6 and 1.8?

You wouldn't expect it to be documented if so, of course :-(, but it it
doesn't look so.

> It used to mean ranks, but now it appears to mean processing elements
> (see Details, below).

I'm fairly confused by this.  Bizarrely, it happens I was going to ask
whether anyone had a patch or workaround for the problem we see with
1.6.  [I notice there was a previous thread about mpi+openmp I didn't
catch at the time which looked pretty confused.  I suppose I should
follow it up for archives.]

> Thanks,
>
> Mark
>
> PS Also, the man page for 1.8.3 reports that '--bysocket' is
> deprecated, but it doesn't seem to exist when we try to use it:
>
>   mpirun: Error: unknown option "-bysocket"
>   Type 'mpirun --help' for usage.

[Yes, per mpirun --help.]

> ====== Details ======
>
> On 1.6.5, we launch with the following core binding options:
>
>   mpirun --bind-to-core --cpus-per-proc <n> <program>

That just doesn't work here on multiple nodes (and you forgot the
--np to override $NSLOTS).  It tries to over-allocate the first host.
The workaround is to use --loadbalance in this case, but it fails in the
normal case if you try to make it the default, sigh.  So the
recommendation for MPI+OpenMP jobs, until I fix it, is a script like

  #$ -l exclusive
  export OMP_NUM_THREADS=2
  exec mpirun --loadbalance --cpus-per-proc $OMP_NUM_THREADS --np 
$(($NSLOTS/$OMP_NUM_THREADS)) ...

assuming OMP_NUM_THREADS divides cores/socket on the relevant nodes
sensibly, and eliding issues with per-rank OMP affinity.

>   mpirun --bind-to-core --bysocket --cpus-per-proc <n> <program>

Similarly in that case.  (I assume that trying to keep consecutive ranks
adjacent is a good default.)

>   where <n> is calculated to maximise the number of cores available to
>   use - I guess affectively
>   max(1, int(number of cores per node / slots per node requested)).
>
>   openmpi reads the file $PE_HOSTFILE and launches a rank for each slot
>   defined in it, binding <n> cores per rank.

That's why you need the --np, or is this with a fiddled host file?

> On 1.8.3, we've tried launching with the following core binding
> options (which we hoped were equivalent):
>
>   mpirun -map-by node:PE=<n> <program>
>   mpirun -map-by socket:PE=<n> <program>

With 1.8.3 here, replacing "--loadbalance --cpus-per-proc" with
"--map-by slot:PE=2" works.

I assume you use --report-bindings to check what's going on (which gave
me the hint about --loadbalance).  I've never seen it lie about the
binding the processes actually get.

>   openmpi reads the file $PE_HOSTFILE and launches a factor of <n> fewer
>   ranks than under 1.6.5. We also notice that, where we wanted a single
>   rank on the box and <n> is the number of cores available, openmpi
>   refuses to launch and we get the message:
>
>   "There are not enough slots available in the system to satisfy the 1
>   slots that were requested by the application"
>
>   I think that error message needs a little work :)

Reply via email to