All -

On one of our clusters, we're seeing the following on one of our applications, 
I believe using Open MPI 1.4.3:

[xxx:27545] *** An error occurred in MPI_Scatterv
[xxx:27545] *** on communicator MPI COMMUNICATOR 5 DUP FROM 4
[xxx:27545] *** MPI_ERR_OTHER: known error not in list
[xxx:27545] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[xxx][[31806,1],0][connect/btl_openib_connect_oob.c:857:qp_create_one] error 
creating qp errno says Resource temporarily unavailable
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 27545 on
node rs1891 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


The problem goes away if we modify the eager protocol msg sizes so that there 
are only two QPs necessary instead of the default 4.  Is there a way to bump up 
the number of QPs that can be created on a node, assuming the issue is just 
running out of available QPs?  If not, any other thoughts on working around the 
problem?

Thanks,

Brian

--
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories





Reply via email to