Good to hear! Thanks Ralph
On Mon, Jun 22, 2015 at 7:50 AM, Paul van der Mark <pvanderm...@fsu.edu> wrote: > > Hello Ralph, > > A quick update: our upgrade to OpenMPI 1.8.6 (from 1.8.5) seems to have > done the trick. Since we only recently starting to prepare for a switch > to slurm, I can't confirm if this already existed in 1.8.4. Our slurm > version is 14.11.7. > > Best, > Paul > > > On 06/18/2015 10:45 AM, Ralph Castain wrote: > > > > Please let us know - FWIW, we aren’t seeing any such reports on the OMPI > mailing lists, and we run our test harness against Slurm (and other RMs) > every night. > > > > Also, please tell us what version of Slurm you are using. We do > sometimes see regressions against newer versions as they appear, and that > may be the case here. > > > > > >> On Jun 18, 2015, at 7:32 AM, Paul van der Mark <pvanderm...@fsu.edu> > wrote: > >> > >> > >> Hello John, > >> > >> We tried a number of combination of flags and some work and some don't. > >> 1. salloc -N 3 --ntasks-per-node 3 srun ./mympiprog > >> 2. salloc -n 9 srun ./mympiprog > >> (test cluster with 8 cores per node) > >> > >> Case 1: works flawless (for every combination) > >> Case 2: works sometimes, warnings in some cases, segmentation faults in > >> some cases (for example -n 10) in opal_memory_ptmalloc2_int_malloc. > >> > >> mpirun instead of srun works all the time. > >> > >> We are going to look into openmpi 1.8.6 now. We would like to have -n X > >> work, since that is what most of our users use anyway. > >> > >> Best, > >> Paul > >> > >> > >> > >> > >> On 06/05/2015 08:19 AM, John Desantis wrote: > >>> > >>> Paul, > >>> > >>> How are you invoking srun with the application in question? > >>> > >>> It seems strange that the messages would be manifest when the job runs > >>> on more than one node. Have you tried passing the flags "-N" and > >>> "--ntasks-per-node" for testing? What about using "-w hostfile"? > >>> Those would be the options that I'd immediately try to begin > >>> trouble-shooting the issue. > >>> > >>> John DeSantis > >>> > >>> 2015-06-02 14:19 GMT-04:00 Paul van der Mark <pvanderm...@fsu.edu>: > >>>> > >>>> All, > >>>> > >>>> We are preparing for a switch from our current job scheduler to slurm > >>>> and I am running into a strange issue. I compiled openmpi with slurm > >>>> support and when I start a job with sbatch and use mpirun everything > >>>> works fine. However, when I use srun instead of mpirun and the job > does > >>>> not fit on a single node, I either receive the following openmpi > warning > >>>> a number of times: > >>>> > -------------------------------------------------------------------------- > >>>> WARNING: Missing locality information required for sm initialization. > >>>> Continuing without shared memory support. > >>>> > -------------------------------------------------------------------------- > >>>> or a segmentation fault in an openmpi library (address not mapped) or > >>>> both. > >>>> > >>>> I only observe this with mpi-programs compiled with openmpi and ran by > >>>> srun when the job does not fit on a single node. The same program > >>>> started by openmpi's mpirun runs fine. The same source compiled with > >>>> mvapich2 works fine with srun. > >>>> > >>>> Some version info: > >>>> slurm 14.11.7 > >>>> openmpi 1.8.5 > >>>> hwloc 1.10.1 (used for both slurm and openmpi) > >>>> os: RHEL 7.1 > >>>> > >>>> Has anyone seen that warning before and what would be a good place to > >>>> start troubleshooting? > >>>> > >>>> > >>>> Thank you, > >>>> Paul >