On Apr 16, 2009, at 9:12 AM, Ralph Castain wrote:

Sounds fine, though note that we don't want ob1 itself to do this as
it inevitably adds overhead that translates into latency. Instead, we
want that functionality to be in a separate component for those people
who want to use it.


To drive this point home: in an MPI implementation, latency and bandwidth performance benchmarks are [unfortunately] king. There should be zero (not "close to zero") performance impact of such changes for those who do not want to use them. That's why all work has been done in "cloned" ob1 components to date, to include failover, retransmission (note that retransmission implies a lot of tracking of pending requests that ob1 does not currently do -- the overhead for that is definitely going to be non-zero).

We did talk on a telecon earlier this week about the need to refactor
the PML so that all these various PML components don't have to keep
tracking what is done in ob1 - bit of a pain. Nothing has been done
yet, but hopefully at some point we'll address this issue.


Yes; talking to Sun is probably the next logical step to see a) the details of what Rolf has been doing, and b) if we can make a more general framework for these kinds of things without having to clone ob1 every time (this was the death of dr, for example -- dr is hasn't been updated with all the new changes to ob1 over the past year or two; I already see Nysal making heroic efforts to keep csum up to date with ob1. It just seems like there should be a better way... although I don't know offhand what that is, because all the options we've talked about so far have added overhear :-\ ).

--
Jeff Squyres
Cisco Systems

Reply via email to