I didn't go into the code to see who is actually calling this error message, but I suspect this may be a generic error for "out of memory" kind of thing and not specific to the que pair. To confirm please add -mca pml_base_verbose 100 and add -mca mtl_base_verbose 100 to see what is being selected.
I'm trying to remember some details of IMB and alltoallv to see if it is indeed requiring more resources that the other micro benchmarks. BTW, did you confirm the limits setup? Also do the nodes have all the same amount of mem? _MAC -----Original Message----- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di Domenico Sent: Wednesday, March 16, 2016 1:25 PM To: Open MPI Users <us...@open-mpi.org> Subject: Re: [OMPI users] locked memory and queue pairs On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A <matias.a.cab...@intel.com> wrote: > Hi Michael, > > I may be missing some context, if you are using the qlogic cards you will > always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. > As Tom suggest, confirm the limits are setup on every node: could it be the > alltoall is reaching a node that "others" are not? Please share the command > line and the error message. Yes, under normal circumstances, I use PSM. i only disabled to see if it affected any kind of change. the test i'm running is mpirun -n 512 ./IMB-MPI1 alltoallv when the system gets to 128 ranks, it freezes and errors out with --- A process failed to create a queue pair. This usually means either the device has run out of queue pairs (too many connections) or there are insufficient resources available to allocate a queue pair (out of memory). The latter can happen if either 1) insufficient memory is available, or 2) no more physical memory can be registered with the device. For more information on memory registration see the Open MPI FAQs at: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: node001 Local device: qib0 Queue pair type: Reliable connected (RC) --- i've also tried various nodes across the cluster (200+). i think i ruled out errant switch (qlogic single 12800-120) problems, bad cables, and bad nodes. that's not to say they're may not be present, i've just not been able to find it _______________________________________________ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2016/03/28717.php