Please let us know - FWIW, we aren’t seeing any such reports on the OMPI 
mailing lists, and we run our test harness against Slurm (and other RMs) every 
night.

Also, please tell us what version of Slurm you are using. We do sometimes see 
regressions against newer versions as they appear, and that may be the case 
here.


> On Jun 18, 2015, at 7:32 AM, Paul van der Mark <pvanderm...@fsu.edu> wrote:
> 
> 
> Hello John,
> 
> We tried a number of combination of flags and some work and some don't.
> 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
> 2. salloc -n 9 srun ./mympiprog
> (test cluster with 8 cores per node)
> 
> Case 1: works flawless (for every combination)
> Case 2: works sometimes, warnings in some cases, segmentation faults in
> some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
> 
> mpirun instead of srun works all the time.
> 
> We are going to look into openmpi 1.8.6 now. We would like to have -n X
> work, since that is what most of our users use anyway.
> 
> Best,
> Paul
> 
> 
> 
> 
> On 06/05/2015 08:19 AM, John Desantis wrote:
>> 
>> Paul,
>> 
>> How are you invoking srun with the application in question?
>> 
>> It seems strange that the messages would be manifest when the job runs
>> on more than one node.  Have you tried passing the flags "-N" and
>> "--ntasks-per-node" for testing?  What about using "-w hostfile"?
>> Those would be the options that I'd immediately try to begin
>> trouble-shooting the issue.
>> 
>> John DeSantis
>> 
>> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <pvanderm...@fsu.edu>:
>>> 
>>> All,
>>> 
>>> We are preparing for a switch from our current job scheduler to slurm
>>> and I am running into a strange issue. I compiled openmpi with slurm
>>> support and when I start a job with sbatch and use mpirun everything
>>> works fine. However, when I use srun instead of mpirun and the job does
>>> not fit on a single node, I either receive the following openmpi warning
>>> a number of times:
>>> --------------------------------------------------------------------------
>>> WARNING: Missing locality information required for sm initialization.
>>> Continuing without shared memory support.
>>> --------------------------------------------------------------------------
>>> or a segmentation fault in an openmpi library (address not mapped) or
>>> both.
>>> 
>>> I only observe this with mpi-programs compiled with openmpi and ran by
>>> srun when the job does not fit on a single node. The same program
>>> started by openmpi's mpirun runs fine. The same source compiled with
>>> mvapich2 works fine with srun.
>>> 
>>> Some version info:
>>> slurm 14.11.7
>>> openmpi 1.8.5
>>> hwloc 1.10.1 (used for both slurm and openmpi)
>>> os: RHEL 7.1
>>> 
>>> Has anyone seen that warning before and what would be a good place to
>>> start troubleshooting?
>>> 
>>> 
>>> Thank you,
>>> Paul

Reply via email to