Hello Ralph,

A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have
done the trick. Since we only recently starting to prepare for a switch
to slurm, I can't confirm if this already existed in 1.8.4. Our slurm
version is 14.11.7.

Best,
Paul


On 06/18/2015 10:45 AM, Ralph Castain wrote:
> 
> Please let us know - FWIW, we aren’t seeing any such reports on the OMPI 
> mailing lists, and we run our test harness against Slurm (and other RMs) 
> every night.
> 
> Also, please tell us what version of Slurm you are using. We do sometimes see 
> regressions against newer versions as they appear, and that may be the case 
> here.
> 
> 
>> On Jun 18, 2015, at 7:32 AM, Paul van der Mark <pvanderm...@fsu.edu> wrote:
>>
>>
>> Hello John,
>>
>> We tried a number of combination of flags and some work and some don't.
>> 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog
>> 2. salloc -n 9 srun ./mympiprog
>> (test cluster with 8 cores per node)
>>
>> Case 1: works flawless (for every combination)
>> Case 2: works sometimes, warnings in some cases, segmentation faults in
>> some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc.
>>
>> mpirun instead of srun works all the time.
>>
>> We are going to look into openmpi 1.8.6 now. We would like to have -n X
>> work, since that is what most of our users use anyway.
>>
>> Best,
>> Paul
>>
>>
>>
>>
>> On 06/05/2015 08:19 AM, John Desantis wrote:
>>>
>>> Paul,
>>>
>>> How are you invoking srun with the application in question?
>>>
>>> It seems strange that the messages would be manifest when the job runs
>>> on more than one node.  Have you tried passing the flags "-N" and
>>> "--ntasks-per-node" for testing?  What about using "-w hostfile"?
>>> Those would be the options that I'd immediately try to begin
>>> trouble-shooting the issue.
>>>
>>> John DeSantis
>>>
>>> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <pvanderm...@fsu.edu>:
>>>>
>>>> All,
>>>>
>>>> We are preparing for a switch from our current job scheduler to slurm
>>>> and I am running into a strange issue. I compiled openmpi with slurm
>>>> support and when I start a job with sbatch and use mpirun everything
>>>> works fine. However, when I use srun instead of mpirun and the job does
>>>> not fit on a single node, I either receive the following openmpi warning
>>>> a number of times:
>>>> --------------------------------------------------------------------------
>>>> WARNING: Missing locality information required for sm initialization.
>>>> Continuing without shared memory support.
>>>> --------------------------------------------------------------------------
>>>> or a segmentation fault in an openmpi library (address not mapped) or
>>>> both.
>>>>
>>>> I only observe this with mpi-programs compiled with openmpi and ran by
>>>> srun when the job does not fit on a single node. The same program
>>>> started by openmpi's mpirun runs fine. The same source compiled with
>>>> mvapich2 works fine with srun.
>>>>
>>>> Some version info:
>>>> slurm 14.11.7
>>>> openmpi 1.8.5
>>>> hwloc 1.10.1 (used for both slurm and openmpi)
>>>> os: RHEL 7.1
>>>>
>>>> Has anyone seen that warning before and what would be a good place to
>>>> start troubleshooting?
>>>>
>>>>
>>>> Thank you,
>>>> Paul

Reply via email to