Thanks Chris, these are useful. Our setup is like yours: we maintain separate gcc and Intel Composer build chains for compatibility. Right now, I'm solely focused on ic, gcc to follow at a later date ... then I get to start again with MVAPICH2 (yay!). The most common applications of our users rely on ic builds and OpenMPI, so I started there.
My build parameters are a lot like yours. The only one I wonder about is --without-scif. We've a mixture of Phi and non-Phi nodes, so I wasn't sure how to set this one and right now I take the default (which I gather includes SCIF). Do you have any insight as to your choice on this? For giggles I tried again just now, focusing on various nodes (both with Phi and without), and the results are all the same (segfault). We also --disable_vt and --disable-pty-support on our OpenMPI build, but I don't think these would cause the problem I'm seeing. Any disagreement with that? As to Uwe's suggestion about the PMI plugin, I've built a number of different ways, including with the PMI plugin. The libs are built and present, and I can run as root without setting the resv-ports. When I build without PMI, I set the resv-ports, and it still doesn't work. So I don't think PMI is the issue, but I appreciate the suggestion. The fact that I can run as root and that I can run without openib (that is using --mca btl ^openib on the mpirun call) suggests to me that there's some kind of permissions / resource access problem to the IB. But I can't understand why this would work fine outside of slurm but be a problem under slurm. Someone at SSERCA suggested setting PropagateResourceLimits=NONE in the slurm.conf file and opening up more than just memlock limits in the /etc/sysconfig/slurm file. I did all that, but none of that solved anything. I'm stumped. Paul. > On Jun 17, 2015, at 20:03, Christopher Samuel <[email protected]> wrote: > > > On 18/06/15 00:38, Wiegand, Paul wrote: > >> We have just started experimenting with Slurm, and I'm having trouble >> running OpenMPI jobs over Slurm. > > In case it helps Slurm here is configured with: > > ./configure --prefix=/usr/local/slurm/${slurm_ver} > --sysconfdir=/usr/local/slurm/etc > > Open-MPI (1.6.x) is configured with: > > ./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib > --enable-static --enable-shared > > Our test build of 1.8.4 (using a different build strategy > to separate out GCC and Intel builds to avoid the annoying > incompatibility of Fortran MOD files for our one user who > ran into it) is configured with: > > configure --prefix=/usr/local/openmpi-${COMPILER}/${VERSION} --with-slurm > --with-verbs --enable-static --enable-shared --without-scif > --with-pmi=/usr/local/slurm/latest > > Note that /usr/local/slurm/latest is a symlink to whatever > /usr/local/slurm/${slurm_ver} is the current version we're > running (currently 14.03.11). > > You will need to fix up your resource limit settings for > maximum lockable memory too, but that shouldn't cause the > issue you're seeing. > > Best of luck! > Chris > -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: [email protected] Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci
