All - On one of our clusters, we're seeing the following on one of our applications, I believe using Open MPI 1.4.3:
[xxx:27545] *** An error occurred in MPI_Scatterv [xxx:27545] *** on communicator MPI COMMUNICATOR 5 DUP FROM 4 [xxx:27545] *** MPI_ERR_OTHER: known error not in list [xxx:27545] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [xxx][[31806,1],0][connect/btl_openib_connect_oob.c:857:qp_create_one] error creating qp errno says Resource temporarily unavailable -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 27545 on node rs1891 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- The problem goes away if we modify the eager protocol msg sizes so that there are only two QPs necessary instead of the default 4. Is there a way to bump up the number of QPs that can be created on a node, assuming the issue is just running out of available QPs? If not, any other thoughts on working around the problem? Thanks, Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories