Hi Michael,

I may be missing some context, if you are using the qlogic cards you will 
always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. 
As Tom suggest, confirm the limits are setup on every node: could it be the 
alltoall is reaching a node that "others" are not? Please share the command 
line and the error message.  

Thanks, 

_MAC

>> Begin forwarded message:
>> 
>> From: Michael Di Domenico <mdidomeni...@gmail.com>
>> Subject: Re: [OMPI users] locked memory and queue pairs
>> Date: March 16, 2016 at 11:32:01 AM EDT
>> To: Open MPI Users <us...@open-mpi.org>
>> Reply-To: Open MPI Users <us...@open-mpi.org>
>> 
>> On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico 
>> <mdidomeni...@gmail.com> wrote:
>>> when i try to run an openmpi job with >128 ranks (16 ranks per node) 
>>> using alltoall or alltoallv, i'm getting an error that the process 
>>> was unable to get a queue pair.
>>> 
>>> i've checked the max locked memory settings across my machines;
>>> 
>>> using ulimit -l in and outside of mpirun and they're all set to 
>>> unlimited pam modules to ensure pam_limits.so is loaded and working 
>>> the /etc/security/limits.conf is set for soft/hard mem to unlimited
>>> 
>>> i tried a couple of quick mpi config settings i could think of;
>>> 
>>> -mca mtl ^psm no affect
>>> -mca btl_openib_flags 1 no affect
>>> 
>>> the openmpi faq says to tweak some mtt values in /sys, but since i'm 
>>> not on mellanox that doesn't apply to me
>>> 
>>> the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled 
>>> ofed), running on qlogic single-port infiniband cards, psm is 
>>> enabled
>>> 
>>> other collectives seem to run okay, it seems to only be alltoall 
>>> comms that fail and only at scale
>>> 
>>> i believe (but can't prove) that this worked at one point, but i 
>>> can't recall when i last tested it.  so it's reasonable to assume 
>>> that some change to the system is preventing this.
>>> 
>>> the question is, where should i start poking to find it?
>> 
>> bump?
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/03/28713.php
>
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to: 
>http://www.cisco.com/web/about/doing_business/legal/cri/
>

Reply via email to