In osu_bw, process 0 pumps lots of Isend's to process 1, and process 1
in turn sets up lots of matching Irecvs. Many messages are in flight.
The question is what happens when resources are exhausted and OMPI
cannot handle so much in-flight traffic. Let's specifically consider
the case of long, rendezvous messages. There are at least two situations.
1) When the sender no longer has any fragments (nor can grow its free
list any more), it queues a send up with add_request_to_send_pending()
and somehow life is good. The PML seems to handle this case "correctly".
2) When the receiver -- specifically
mca_pml_ob1_recv_request_ack_send_btl() -- no longer has any fragments
to send ACKs back to confirm readiness for rendezvous, the
resource-exhaustion signal travels up the call stack to
mca_pml_ob1_recv_request_ack_send(), who does a
MCA_PML_OB1_ADD_ACK_TO_PENDING(). In short, the PML adds the ACK to
pckt_pending. Somehow, this code path doesn't work.
The reason we see the problem now is that I added "autosizing" of the
shared-memory area. We used to mmap *WAY* too much shared-memory for
small-np jobs. (Yes, that's a subjective statement.) Meanwhile, at
large-np, we didn't mmap enough and jobs wouldn't start. (Objective
statement there.) So, I added heuristics to size the shared area
"appropriately". The heuristics basically targetted the needs of
MPI_Init(). If you want fragment free lists to grow on demand after
MPI_Init(), you now basically have to bump mpool_sm_min_size up explicitly.
I'd like feedback on a fix. Here are two options:
A) Someone (could be I) increases the default resources. E.g., we could
start with a larger eager free list. Or, I could change those
"heuristics" to allow some amount of headroom for free lists to grow on
demand. Either way, I'd appreciate feedback on how big to set these things.
B) Someone (not I, since I don't know how) fixes the ob1 PML to handle
scenario 2 above correctly.