Pasha - Is there a way to tell which of the two happened or to check the number of QPs available per node? The app likely does talk to a large number of peers from each process, and the nodes are fairly "fat" - it's quad socket, quad core and they are running 16 MPI ranks for each node.
Brian On Jan 27, 2011, at 6:17 PM, Shamis, Pavel wrote: > Unfortunately verbose error reports are not so friendly...anyway , I may > think about 2 issues: > > 1. You trying to open open too much QPs. By default ib devices support fairly > large amount of QPs and it is quite hard to push it to this corner. But If > your job is really huge it may be the case. Or for example, if you share the > compute nodes with some other processes that create a lot of qps. The maximum > amount of supported qps you may see in ibv_devinfo. > > 2. The memory limit for registered memory is too low, as result driver fails > allocate and register memory for QP. This scenario is most common. Just > happened to me recently, system folks pushed some crap into limits.conf. > > Regards, > > Pavel (Pasha) Shamis > --- > Application Performance Tools Group > Computer Science and Math Division > Oak Ridge National Laboratory > > > > > > > On Jan 27, 2011, at 5:56 PM, Barrett, Brian W wrote: > >> All - >> >> On one of our clusters, we're seeing the following on one of our >> applications, I believe using Open MPI 1.4.3: >> >> [xxx:27545] *** An error occurred in MPI_Scatterv >> [xxx:27545] *** on communicator MPI COMMUNICATOR 5 DUP FROM 4 >> [xxx:27545] *** MPI_ERR_OTHER: known error not in list >> [xxx:27545] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> [xxx][[31806,1],0][connect/btl_openib_connect_oob.c:857:qp_create_one] error >> creating qp errno says Resource temporarily unavailable >> -------------------------------------------------------------------------- >> mpirun has exited due to process rank 0 with PID 27545 on >> node rs1891 exiting without calling "finalize". This may >> have caused other processes in the application to be >> terminated by signals sent by mpirun (as reported here). >> -------------------------------------------------------------------------- >> >> >> The problem goes away if we modify the eager protocol msg sizes so that >> there are only two QPs necessary instead of the default 4. Is there a way >> to bump up the number of QPs that can be created on a node, assuming the >> issue is just running out of available QPs? If not, any other thoughts on >> working around the problem? >> >> Thanks, >> >> Brian >> >> -- >> Brian W. Barrett >> Dept. 1423: Scalable System Software >> Sandia National Laboratories >> >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories