Re: [OMPI devel] collectives / #1944 progress

Eugene Loh Wed, 1 Jul 2009 22:23:56 -0400

Jeff Squyres wrote:

It looks like Eugene's and George's fixes on coll sm resolve all theknown hangs. We still have flow control issues, but that istemporarily being solved by the coll sync component. To be clear:running with coll_sync_barrier_before 1000 seems to resolve all knownhangs, and we think that this is good enough for v1.3.3. We shouldCMR whatever is necessary to the v1.3 branch.
==> We should also default coll_sync_barrier_before to 1000 forv1.3.3 (i.e., ensure sync activates itself).
For the future, we have a two pronged plan:

I suspect the standard procedure is that we all look quickly at thise-mail message, file appropriately, and then resume our normal lives.Yes? Or, is such a plan put somehow into place?

1. Clean up the sm btl:
   1a. Remove all dead code.

What do you mean here? (Possibly you mean getting rid of sm pendingsends if we implement 1b properly, but I'm not sure.)

1b. Resize free_list_max and fifo_size MCA params to effect goodenough flow control.1c. Possibly: convert from FIFO's to linked lists (for futuremaintenance purposes, not necessarily to fix problems).

Another idea is to have two kinds of FIFOs. One is just for returningfragments. The other is for in-coming message fragments. It would evenseem as though one would no longer need "free lists", but just use theack FIFO to manage fragments. (ALL of this is complicated by the factthat we have two kinds of fragments, eager and max, but fortunatelythose details can be pushed onto the sorry fool who'll be implementingall this. I wonder who that'll be.)

2. Test, enable, and continue to develop the coll sm module. Usingthis module will avoid the p2p unexpected message queue explosionthat we're seeing (at least for collectives with short messages).It nominally has broadcast, barrier, reduce, and allreduceimplemented. We really only need to a) test the heck outta them, andb) add gather, scatter, scan, and exscan to the list. All the othercollective operations have implicit synchronization and won't runinto the unbounded unexpected queue issues. The bcast loopreproducer seemed to work fine for me with the coll sm, but itsegv'ed immediately for Ralph. So clearly some work needs to be done.
We think that these two items should be the main features for 1.3.4.

Re: [OMPI devel] collectives / #1944 progress

Reply via email to