Rémi, This got me a bit farther, thanks.
> The stack trace stuck in BTL openib makes me think it's more related to Open > MPI <-> IB integration than to Slurm <-> Open MPI. I agree that it seems like an MPI/IB thing; however, I can run using Torque/Moab via SSH so there's some kind of difference here that I'm not understanding, I think. > Did you check the permissions of your IB devices in /dev? Good point. I believe these are correct. We're not having a problem with any other IB-based applications, including other MPI/IB models. But I checked, and they look right to me. > It could work w/o problem using `mpirun -host` because of MCA related > environment variables may be set in your module and not propagated by mpirun > through SSH where Slurm basically propagate everything. I did the following command, both in my normal shell as well as after getting a shell from salloc, then diffed the results (no difference). What else can I check? ompi-info --all | grep -i btl > You can also check it is related to IB by disabling it explicitely in Open > MPI BTL framework in parameters of mpirun. This was a good idea. I ran correctly with the following in an salloc shell, which confirms that it's happening at the IB integration level: mpirun --mca btl ^openib ./simple So the question is: Why aren't the MCA parameters propagating? Or: What did I misconfigure so they would not. Torque uses ssh when it deploys, and we've no problems with any of our MPI setups via Torque. Is there some Slurm-ishness I my Torquey assumptions are getting in the way of me understanding? Thanks, Paul. P.S., To Andy Reibs: Thanks for your suggestion. My current build does use PMI and explicitly paths to the Slurm PMI. I tried your /etc/sysconfig/slurm suggestion, but no dice.
