Hello Ralph, A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have done the trick. Since we only recently starting to prepare for a switch to slurm, I can't confirm if this already existed in 1.8.4. Our slurm version is 14.11.7.
Best, Paul On 06/18/2015 10:45 AM, Ralph Castain wrote: > > Please let us know - FWIW, we aren’t seeing any such reports on the OMPI > mailing lists, and we run our test harness against Slurm (and other RMs) > every night. > > Also, please tell us what version of Slurm you are using. We do sometimes see > regressions against newer versions as they appear, and that may be the case > here. > > >> On Jun 18, 2015, at 7:32 AM, Paul van der Mark <pvanderm...@fsu.edu> wrote: >> >> >> Hello John, >> >> We tried a number of combination of flags and some work and some don't. >> 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog >> 2. salloc -n 9 srun ./mympiprog >> (test cluster with 8 cores per node) >> >> Case 1: works flawless (for every combination) >> Case 2: works sometimes, warnings in some cases, segmentation faults in >> some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc. >> >> mpirun instead of srun works all the time. >> >> We are going to look into openmpi 1.8.6 now. We would like to have -n X >> work, since that is what most of our users use anyway. >> >> Best, >> Paul >> >> >> >> >> On 06/05/2015 08:19 AM, John Desantis wrote: >>> >>> Paul, >>> >>> How are you invoking srun with the application in question? >>> >>> It seems strange that the messages would be manifest when the job runs >>> on more than one node. Have you tried passing the flags "-N" and >>> "--ntasks-per-node" for testing? What about using "-w hostfile"? >>> Those would be the options that I'd immediately try to begin >>> trouble-shooting the issue. >>> >>> John DeSantis >>> >>> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <pvanderm...@fsu.edu>: >>>> >>>> All, >>>> >>>> We are preparing for a switch from our current job scheduler to slurm >>>> and I am running into a strange issue. I compiled openmpi with slurm >>>> support and when I start a job with sbatch and use mpirun everything >>>> works fine. However, when I use srun instead of mpirun and the job does >>>> not fit on a single node, I either receive the following openmpi warning >>>> a number of times: >>>> -------------------------------------------------------------------------- >>>> WARNING: Missing locality information required for sm initialization. >>>> Continuing without shared memory support. >>>> -------------------------------------------------------------------------- >>>> or a segmentation fault in an openmpi library (address not mapped) or >>>> both. >>>> >>>> I only observe this with mpi-programs compiled with openmpi and ran by >>>> srun when the job does not fit on a single node. The same program >>>> started by openmpi's mpirun runs fine. The same source compiled with >>>> mvapich2 works fine with srun. >>>> >>>> Some version info: >>>> slurm 14.11.7 >>>> openmpi 1.8.5 >>>> hwloc 1.10.1 (used for both slurm and openmpi) >>>> os: RHEL 7.1 >>>> >>>> Has anyone seen that warning before and what would be a good place to >>>> start troubleshooting? >>>> >>>> >>>> Thank you, >>>> Paul