On Apr 29, 2009, at 4:45 PM, Barrett, Brian W wrote:
If you think this sounds like a hassle, think about what it looks
like from
the point of view of the MPI implementer (or any other developer
writing
libraries which sit between user data and OFED, like GASNet).
If you don't care about what pain MPI implementors have to go through
(and you probably don't ;-) ) -- consider that this is a major
roadblock to most *anyone* who wants to write to user verbs.
<banging the same old drum>
I heard lots of variations of "Why isn't OFED more popular?" in Sonoma
this year. This is at least one big reason why: no (normal/non-
superhuman programmers) can write verbs code (IMHO). MPI's *have* to
support OpenFabrics -- HPC customers demand it. But non-HPC customers
have a clear alternative: they'll just write sockets code. And the
price/performance for using sockets over IB/iWARP may or may not be
attractive depending on the customer's buying capacity. Hence -- they
just buy gigE (10gigE, when the price drops low enough).
Doesn't OpenFabrics want to grow beyond MPI? Woody said that verbs is
designed to support a billion different things -- outside of MPI and a
few storage protocols (none of which are widely adopted), how much is
OFED used?
</banging the same old drum>
Jeff and I talked for a while today, and we're pretty sure that as
long as
the byte set by the kernel notifier is written before the pages are
returned
into the unallocated list, there isn't actually a race condition.
[snip]
However, there's still then the problem with the notifier concept of
how the
kernel passes which pages were given back to the kernel. It has to
pass a
(potentially very large) amount of data back to the user, so the
memory
ownership issues with kernel/user space are interesting. It also
has to
somewhat atomically prepare the list and undset the notifier byte,
which is
also problematic. But probably workable.
I feel compelled to amend this: this notifier concept *may be
workable*, but it's still quite complex for the reasons Brian cited.
The goal here is to *reduce* complexity, especially for applications/
ULPs using the verbs stack.
If we put the registration cache in the network stack, application/ULP
complexity will be reduced significantly. My $0.02 is that using a
notifier solution is still fairly complex and introduces a new set of
problems.
FWIW: Putting the registration cache in the userspace verbs stack
means that verbs will now have to do the horrid malloc/mmap/etc.
intercept tricks that MPI implementations currently do. Take it from
us -- this is not a business you want to be in. Such intercepts
breaks tools like valgrind and other memory-checking debuggers. Even
the best intercept hooks available today can still be subverted. Open
MPI (and MX!) has to insert a pre-main hook to setup these intercepts,
and then check later to ensure that no one else subverted our hooks.
Yuck.
It's memory management. And that belongs in the kernel.
--
Jeff Squyres
Cisco Systems
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general