[slurm-dev] Re: srun + openmpi : Missing locality information

Ralph Castain Mon, 22 Jun 2015 08:45:15 -0700

Good to hear! Thanks
Ralph


On Mon, Jun 22, 2015 at 7:50 AM, Paul van der Mark <pvanderm...@fsu.edu>
wrote:

>
> Hello Ralph,
>
> A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have
> done the trick. Since we only recently starting to prepare for a switch
> to slurm, I can't confirm if this already existed in 1.8.4. Our slurm
> version is 14.11.7.
>
> Best,
> Paul
>
>
> On 06/18/2015 10:45 AM, Ralph Castain wrote:
> >
> > Please let us know - FWIW, we aren’t seeing any such reports on the OMPI
> mailing lists, and we run our test harness against Slurm (and other RMs)
> every night.
> >
> > Also, please tell us what version of Slurm you are using. We do
> sometimes see regressions against newer versions as they appear, and that
> may be the case here.
> >
> >
> >> On Jun 18, 2015, at 7:32 AM, Paul van der Mark <pvanderm...@fsu.edu>
> wrote:
> >>
> >>
> >> Hello John,
> >>
> >> We tried a number of combination of flags and some work and some don't.
> >> 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
> >> 2. salloc -n 9 srun ./mympiprog
> >> (test cluster with 8 cores per node)
> >>
> >> Case 1: works flawless (for every combination)
> >> Case 2: works sometimes, warnings in some cases, segmentation faults in
> >> some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
> >>
> >> mpirun instead of srun works all the time.
> >>
> >> We are going to look into openmpi 1.8.6 now. We would like to have -n X
> >> work, since that is what most of our users use anyway.
> >>
> >> Best,
> >> Paul
> >>
> >>
> >>
> >>
> >> On 06/05/2015 08:19 AM, John Desantis wrote:
> >>>
> >>> Paul,
> >>>
> >>> How are you invoking srun with the application in question?
> >>>
> >>> It seems strange that the messages would be manifest when the job runs
> >>> on more than one node.  Have you tried passing the flags "-N" and
> >>> "--ntasks-per-node" for testing?  What about using "-w hostfile"?
> >>> Those would be the options that I'd immediately try to begin
> >>> trouble-shooting the issue.
> >>>
> >>> John DeSantis
> >>>
> >>> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <pvanderm...@fsu.edu>:
> >>>>
> >>>> All,
> >>>>
> >>>> We are preparing for a switch from our current job scheduler to slurm
> >>>> and I am running into a strange issue. I compiled openmpi with slurm
> >>>> support and when I start a job with sbatch and use mpirun everything
> >>>> works fine. However, when I use srun instead of mpirun and the job
> does
> >>>> not fit on a single node, I either receive the following openmpi
> warning
> >>>> a number of times:
> >>>>
> --------------------------------------------------------------------------
> >>>> WARNING: Missing locality information required for sm initialization.
> >>>> Continuing without shared memory support.
> >>>>
> --------------------------------------------------------------------------
> >>>> or a segmentation fault in an openmpi library (address not mapped) or
> >>>> both.
> >>>>
> >>>> I only observe this with mpi-programs compiled with openmpi and ran by
> >>>> srun when the job does not fit on a single node. The same program
> >>>> started by openmpi's mpirun runs fine. The same source compiled with
> >>>> mvapich2 works fine with srun.
> >>>>
> >>>> Some version info:
> >>>> slurm 14.11.7
> >>>> openmpi 1.8.5
> >>>> hwloc 1.10.1 (used for both slurm and openmpi)
> >>>> os: RHEL 7.1
> >>>>
> >>>> Has anyone seen that warning before and what would be a good place to
> >>>> start troubleshooting?
> >>>>
> >>>>
> >>>> Thank you,
> >>>> Paul
>

[slurm-dev] Re: srun + openmpi : Missing locality information

Reply via email to