Pasha -

Is there a way to tell which of the two happened or to check the number of QPs 
available per node?  The app likely does talk to a large number of peers from 
each process, and the nodes are fairly "fat" - it's quad socket, quad core and 
they are running 16 MPI ranks for each node.  

Brian

On Jan 27, 2011, at 6:17 PM, Shamis, Pavel wrote:

> Unfortunately verbose error reports are not so friendly...anyway , I may 
> think about 2 issues:
> 
> 1. You trying to open open too much QPs. By default ib devices support fairly 
> large amount of QPs and it is quite hard to push it to this corner. But If 
> your job is really huge it may be the case. Or for example, if you share the 
> compute nodes with some other processes that create a lot of qps. The maximum 
> amount of supported qps you may see in ibv_devinfo.
> 
> 2. The memory limit for registered memory is too low, as result driver fails 
> allocate and register memory for QP. This scenario is most common. Just 
> happened to me recently, system folks pushed some crap into limits.conf.
> 
> Regards,
> 
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> 
> 
> 
> 
> 
> 
> On Jan 27, 2011, at 5:56 PM, Barrett, Brian W wrote:
> 
>> All -
>> 
>> On one of our clusters, we're seeing the following on one of our 
>> applications, I believe using Open MPI 1.4.3:
>> 
>> [xxx:27545] *** An error occurred in MPI_Scatterv
>> [xxx:27545] *** on communicator MPI COMMUNICATOR 5 DUP FROM 4
>> [xxx:27545] *** MPI_ERR_OTHER: known error not in list
>> [xxx:27545] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [xxx][[31806,1],0][connect/btl_openib_connect_oob.c:857:qp_create_one] error 
>> creating qp errno says Resource temporarily unavailable
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 27545 on
>> node rs1891 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> 
>> 
>> The problem goes away if we modify the eager protocol msg sizes so that 
>> there are only two QPs necessary instead of the default 4.  Is there a way 
>> to bump up the number of QPs that can be created on a node, assuming the 
>> issue is just running out of available QPs?  If not, any other thoughts on 
>> working around the problem?
>> 
>> Thanks,
>> 
>> Brian
>> 
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

--
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories





Reply via email to