Brian,

As Pasha said:
The maximum amount of supported qps you may see in ibv_devinfo.

However you'll probably need "-v":

{hargrove@cvrsvc05 ~}$ ibv_devinfo | grep max_qp:
{hargrove@cvrsvc05 ~}$ ibv_devinfo -v | grep max_qp:
        max_qp:                         261056

If you really are running out of QPs due to the "fattness" of the node, then you should definitely look at enabling XRC if your HCA and libibverbs version supports it. ibv_devinfo can query the HCA capability:

{hargrove@cvrsvc05 ~}$ ibv_devinfo -v | grep port_cap_flags:
                        port_cap_flags:         0x02510868

and look for bit 0x00100000  ( == 1<<20).

-Paul



On 1/27/2011 5:09 PM, Barrett, Brian W wrote:
Pasha -

Is there a way to tell which of the two happened or to check the number of QPs available 
per node?  The app likely does talk to a large number of peers from each process, and the 
nodes are fairly "fat" - it's quad socket, quad core and they are running 16 
MPI ranks for each node.

Brian

On Jan 27, 2011, at 6:17 PM, Shamis, Pavel wrote:

Unfortunately verbose error reports are not so friendly...anyway , I may think 
about 2 issues:

1. You trying to open open too much QPs. By default ib devices support fairly 
large amount of QPs and it is quite hard to push it to this corner. But If your 
job is really huge it may be the case. Or for example, if you share the compute 
nodes with some other processes that create a lot of qps. The maximum amount of 
supported qps you may see in ibv_devinfo.

2. The memory limit for registered memory is too low, as result driver fails 
allocate and register memory for QP. This scenario is most common. Just 
happened to me recently, system folks pushed some crap into limits.conf.

Regards,

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jan 27, 2011, at 5:56 PM, Barrett, Brian W wrote:

All -

On one of our clusters, we're seeing the following on one of our applications, 
I believe using Open MPI 1.4.3:

[xxx:27545] *** An error occurred in MPI_Scatterv
[xxx:27545] *** on communicator MPI COMMUNICATOR 5 DUP FROM 4
[xxx:27545] *** MPI_ERR_OTHER: known error not in list
[xxx:27545] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[xxx][[31806,1],0][connect/btl_openib_connect_oob.c:857:qp_create_one] error 
creating qp errno says Resource temporarily unavailable
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 27545 on
node rs1891 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


The problem goes away if we modify the eager protocol msg sizes so that there 
are only two QPs necessary instead of the default 4.  Is there a way to bump up 
the number of QPs that can be created on a node, assuming the issue is just 
running out of available QPs?  If not, any other thoughts on working around the 
problem?

Thanks,

Brian

--
Brian W. Barrett
Dept. 1423: Scalable System Software
Sandia National Laboratories





_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
   Brian W. Barrett
   Dept. 1423: Scalable System Software
   Sandia National Laboratories





_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Paul H. Hargrove                          phhargr...@lbl.gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to