I didn't go into the code to see who is actually calling this error message, 
but I suspect this may be a generic error for "out of memory" kind of thing and 
not specific to the que pair. To confirm please add  -mca pml_base_verbose 100 
and add  -mca mtl_base_verbose 100  to see what is being selected. 

I'm trying to remember some details of IMB  and alltoallv to see if it is 
indeed requiring more resources that the other micro benchmarks. 

BTW, did you confirm the limits setup? Also do the nodes have all the same 
amount of mem? 

_MAC


-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di Domenico
Sent: Wednesday, March 16, 2016 1:25 PM
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] locked memory and queue pairs

On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A <matias.a.cab...@intel.com> 
wrote:
> Hi Michael,
>
> I may be missing some context, if you are using the qlogic cards you will 
> always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. 
> As Tom suggest, confirm the limits are setup on every node: could it be the 
> alltoall is reaching a node that "others" are not? Please share the command 
> line and the error message.



Yes, under normal circumstances, I use PSM.  i only disabled to see if it 
affected any kind of change.

the test i'm running is

mpirun -n 512 ./IMB-MPI1 alltoallv

when the system gets to 128 ranks, it freezes and errors out with

---

A process failed to create a queue pair. This usually means either the device 
has run out of queue pairs (too many connections) or there are insufficient 
resources available to allocate a queue pair (out of memory). The latter can 
happen if either 1) insufficient memory is available, or 2) no more physical 
memory can be registered with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:             node001
Local device:           qib0
Queue pair type:        Reliable connected (RC)

---

i've also tried various nodes across the cluster (200+).  i think i ruled out 
errant switch (qlogic single 12800-120) problems, bad cables, and bad nodes.  
that's not to say they're may not be present, i've just not been able to find 
it _______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28717.php

Reply via email to