Hi Paul,
Le 17/06/2015 16:38, Wiegand, Paul a écrit :
[...]
$ salloc -n 1
salloc: Granted job allocation 192
$ ulimit -l
unlimited
$ mpirun ./simple
[evc5:19184] *** Process received signal ***
[evc5:19184] Signal: Segmentation fault (11)
[evc5:19184] Signal code: Address not mapped (1)
[evc5:19184] Failing at address: 0x30
[evc5:19184] [ 0] /lib64/libpthread.so.0(+0xf130)[0x2b401b32c130]
[evc5:19184] [ 1]
/apps/openmpi/openmpi-1.8.3-ic-2015-slurm-14.11/lib/openmpi/mca_btl_openib.so(+0x1fdd8)[0x2b4020735dd8]
The stack trace stuck in BTL openib makes me think it's more related to
Open MPI <-> IB integration than to Slurm <-> Open MPI.
Did you check the permissions of your IB devices in /dev?
It could work w/o problem using `mpirun -host` because of MCA related
environment variables may be set in your module and not propagated by
mpirun through SSH where Slurm basically propagate everything.
You can also check it is related to IB by disabling it explicitely in
Open MPI BTL framework in parameters of mpirun.
rémi