Re: [OMPI devel] trac 1857: SM btl hangs when msg >=4k

Eugene Loh Fri, 3 Apr 2009 18:47:10 -0400

What's next on this ticket? It's supposed to be a blocker. Again, theissue is that osu_bw deluges a receiver with rendezvous messages, butthe receiver does not have enough eager frags to acknowledge them all.We see this now that the sizing of the mmap file has changed and there'sless headroom to grow the free lists. Possible fixes are:

A) Just make the mmap file default size larger (though less overkillthan we used to have).B) Fix the PML code that is supposed to deal with cases like this. (Atleast I think the PML has code that's intended for this purpose.)


Eugene Loh wrote:

In osu_bw, process 0 pumps lots of Isend's to process 1, and process 1in turn sets up lots of matching Irecvs. Many messages are inflight. The question is what happens when resources are exhausted andOMPI cannot handle so much in-flight traffic. Let's specificallyconsider the case of long, rendezvous messages. There are at leasttwo situations.
1) When the sender no longer has any fragments (nor can grow its freelist any more), it queues a send up with add_request_to_send_pending()and somehow life is good. The PML seems to handle this case "correctly".
2) When the receiver -- specificallymca_pml_ob1_recv_request_ack_send_btl() -- no longer has any fragmentsto send ACKs back to confirm readiness for rendezvous, theresource-exhaustion signal travels up the call stack tomca_pml_ob1_recv_request_ack_send(), who does aMCA_PML_OB1_ADD_ACK_TO_PENDING(). In short, the PML adds the ACK topckt_pending. Somehow, this code path doesn't work.
The reason we see the problem now is that I added "autosizing" of theshared-memory area. We used to mmap *WAY* too much shared-memory forsmall-np jobs. (Yes, that's a subjective statement.) Meanwhile, atlarge-np, we didn't mmap enough and jobs wouldn't start. (Objectivestatement there.) So, I added heuristics to size the shared area"appropriately". The heuristics basically targetted the needs ofMPI_Init(). If you want fragment free lists to grow on demand afterMPI_Init(), you now basically have to bump mpool_sm_min_size upexplicitly.
I'd like feedback on a fix.  Here are two options:
A) Someone (could be I) increases the default resources. E.g., wecould start with a larger eager free list. Or, I could change those"heuristics" to allow some amount of headroom for free lists to growon demand. Either way, I'd appreciate feedback on how big to setthese things.
B) Someone (not I, since I don't know how) fixes the ob1 PML to handlescenario 2 above correctly.

Re: [OMPI devel] trac 1857: SM btl hangs when msg >=4k

Reply via email to