On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico <mdidomeni...@gmail.com> wrote: > when i try to run an openmpi job with >128 ranks (16 ranks per node) > using alltoall or alltoallv, i'm getting an error that the process was > unable to get a queue pair. > > i've checked the max locked memory settings across my machines; > > using ulimit -l in and outside of mpirun and they're all set to unlimited > pam modules to ensure pam_limits.so is loaded and working > the /etc/security/limits.conf is set for soft/hard mem to unlimited > > i tried a couple of quick mpi config settings i could think of; > > -mca mtl ^psm no affect > -mca btl_openib_flags 1 no affect > > the openmpi faq says to tweak some mtt values in /sys, but since i'm > not on mellanox that doesn't apply to me > > the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled ofed), > running on qlogic single-port infiniband cards, psm is enabled > > other collectives seem to run okay, it seems to only be alltoall comms > that fail and only at scale > > i believe (but can't prove) that this worked at one point, but i can't > recall when i last tested it. so it's reasonable to assume that some > change to the system is preventing this. > > the question is, where should i start poking to find it?
bump?