On Apr 29, 2009, at 4:45 PM, Barrett, Brian W wrote:

If you think this sounds like a hassle, think about what it looks like from the point of view of the MPI implementer (or any other developer writing
libraries which sit between user data and OFED, like GASNet).


If you don't care about what pain MPI implementors have to go through (and you probably don't ;-) ) -- consider that this is a major roadblock to most *anyone* who wants to write to user verbs.

<banging the same old drum>

I heard lots of variations of "Why isn't OFED more popular?" in Sonoma this year. This is at least one big reason why: no (normal/non- superhuman programmers) can write verbs code (IMHO). MPI's *have* to support OpenFabrics -- HPC customers demand it. But non-HPC customers have a clear alternative: they'll just write sockets code. And the price/performance for using sockets over IB/iWARP may or may not be attractive depending on the customer's buying capacity. Hence -- they just buy gigE (10gigE, when the price drops low enough).

Doesn't OpenFabrics want to grow beyond MPI? Woody said that verbs is designed to support a billion different things -- outside of MPI and a few storage protocols (none of which are widely adopted), how much is OFED used?

</banging the same old drum>

Jeff and I talked for a while today, and we're pretty sure that as long as the byte set by the kernel notifier is written before the pages are returned into the unallocated list, there isn't actually a race condition. [snip]

However, there's still then the problem with the notifier concept of how the kernel passes which pages were given back to the kernel. It has to pass a (potentially very large) amount of data back to the user, so the memory ownership issues with kernel/user space are interesting. It also has to somewhat atomically prepare the list and undset the notifier byte, which is
also problematic.  But probably workable.



I feel compelled to amend this: this notifier concept *may be workable*, but it's still quite complex for the reasons Brian cited. The goal here is to *reduce* complexity, especially for applications/ ULPs using the verbs stack.

If we put the registration cache in the network stack, application/ULP complexity will be reduced significantly. My $0.02 is that using a notifier solution is still fairly complex and introduces a new set of problems.

FWIW: Putting the registration cache in the userspace verbs stack means that verbs will now have to do the horrid malloc/mmap/etc. intercept tricks that MPI implementations currently do. Take it from us -- this is not a business you want to be in. Such intercepts breaks tools like valgrind and other memory-checking debuggers. Even the best intercept hooks available today can still be subverted. Open MPI (and MX!) has to insert a pre-main hook to setup these intercepts, and then check later to ensure that no one else subverted our hooks. Yuck.

It's memory management.  And that belongs in the kernel.

--
Jeff Squyres
Cisco Systems

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to