Thanks Chris, these are useful.  Our setup is like yours:  we maintain separate 
gcc and Intel Composer build chains for compatibility.  Right now, I'm solely 
focused on ic, gcc to follow at a later date ... then I get to start again with 
MVAPICH2 (yay!).  The most common applications of our users rely on ic builds 
and OpenMPI, so I started there.

My build parameters are a lot like yours.  The only one I wonder about is 
--without-scif.  We've a mixture of Phi and non-Phi nodes, so I wasn't sure how 
to set this one and right now I take the default (which I gather includes 
SCIF).  Do you have any insight as to your choice on this?

For giggles I tried again just now, focusing on various nodes (both with Phi 
and without), and the results are all the same (segfault).

We also --disable_vt and --disable-pty-support on our OpenMPI build, but I 
don't think these would cause the problem I'm seeing.  Any disagreement with 
that?

As to Uwe's suggestion about the PMI plugin, I've built a number of different 
ways, including with the PMI plugin.  The libs are built and present, and I can 
run as root without setting the resv-ports.  When I build without PMI, I set 
the resv-ports, and it still doesn't work.  So I don't think PMI is the issue, 
but I appreciate the suggestion.

The fact that I can run as root and that I can run without openib (that is 
using --mca btl ^openib on the mpirun call) suggests to me that there's some 
kind of permissions / resource access problem to the IB.  But I can't 
understand why this would work fine outside of slurm but be a problem under 
slurm.

Someone at SSERCA suggested setting PropagateResourceLimits=NONE in the 
slurm.conf file and opening up more than just memlock limits in the 
/etc/sysconfig/slurm file.  I did all that, but none of that solved anything.

I'm stumped.

Paul.




> On Jun 17, 2015, at 20:03, Christopher Samuel <[email protected]> wrote:
> 
> 
> On 18/06/15 00:38, Wiegand, Paul wrote:
> 
>> We have just started experimenting with Slurm, and I'm having trouble
>> running OpenMPI jobs over Slurm.
> 
> In case it helps Slurm here is configured with:
> 
> ./configure --prefix=/usr/local/slurm/${slurm_ver} 
> --sysconfdir=/usr/local/slurm/etc
> 
> Open-MPI (1.6.x) is configured with:
> 
> ./configure --prefix=/usr/local/${BASE} --with-slurm --with-openib 
> --enable-static  --enable-shared
> 
> Our test build of 1.8.4 (using a different build strategy
> to separate out GCC and Intel builds to avoid the annoying
> incompatibility of Fortran MOD files for our one user who
> ran into it) is configured with:
> 
> configure --prefix=/usr/local/openmpi-${COMPILER}/${VERSION} --with-slurm 
> --with-verbs --enable-static  --enable-shared --without-scif 
> --with-pmi=/usr/local/slurm/latest
> 
> Note that /usr/local/slurm/latest is a symlink to whatever
> /usr/local/slurm/${slurm_ver} is the current version we're
> running (currently 14.03.11).
> 
> You will need to fix up your resource limit settings for
> maximum lockable memory too, but that shouldn't cause the
> issue you're seeing.
> 
> Best of luck!
> Chris
> -- 
> Christopher Samuel        Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: [email protected] Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to