I'm trying to figure out what the limitation is for the number of pending nonblocking operations as it does not seem to be specified anywhere. I apologize if this is better suited to the user list, but this seemed like information more likely to be available on the dev list.
As part of a toy assignment involving multiplying triangular square matrices, one solution being compared sends each row and column individually. On matrices of 100 and 1000 rows the program functions fine. However with 5000 rows it functions correctly with 8 processes spread across 4 or 2 nodes, but not on a single node, similarly for 4 processes it works on 2 nodes, but not one, and for 2 processes on 1 node it fails. The failure appears to be because there are some number (at least 2500) of receives that never complete causing a waitany to never return. No errors are produced from the MPI_Isends, nor from the MPI_Irecv's nor the MPI_Waitany. As it works on multiple nodes, but not one node, it seems reasonable to believe that the problem lies with there being too many nonblocking operations in progress, as there are a total of around 18000 pending operations at once if all the processes are run on one node. The standard says the following, but I can't seem to find a definition of what Open MPI considers pathological, and information on where to find this would be appreciated. I've attached the results of ompi_info --all if it is of any use. "If the call causes some system resource to be exhausted, then it will fail and return an error code. Quality implementations of MPI should ensure that this happens only in ``pathological'' cases. That is, an MPI implementation should be able to support a large number of pending nonblocking operations." Sincerely, Christian Csar
ompi_info.gz
Description: GNU Zip compressed data