when i try to run an openmpi job with >128 ranks (16 ranks per node)
using alltoall or alltoallv, i'm getting an error that the process was
unable to get a queue pair.

i've checked the max locked memory settings across my machines;

using ulimit -l in and outside of mpirun and they're all set to unlimited
pam modules to ensure pam_limits.so is loaded and working
the /etc/security/limits.conf is set for soft/hard mem to unlimited

i tried a couple of quick mpi config settings i could think of;

-mca mtl ^psm no affect
-mca btl_openib_flags 1 no affect

the openmpi faq says to tweak some mtt values in /sys, but since i'm
not on mellanox that doesn't apply to me

the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled ofed),
running on qlogic single-port infiniband cards, psm is enabled

other collectives seem to run okay, it seems to only be alltoall comms
that fail and only at scale

i believe (but can't prove) that this worked at one point, but i can't
recall when i last tested it.  so it's reasonable to assume that some
change to the system is preventing this.

the question is, where should i start poking to find it?

Reply via email to