On Apr 30, 2009, at 6:01 PM, Woodruff, Robert J wrote:

To me, all this sounds like a lot of whining....
Why can't the OS fix all my problems.

Absolutely not. As Brian stated, we have cited some real-world problems that we cannot fix (and we have tried many, many different workarounds over the past few years to fix them).

It sounds like your main objection to fixing them is "it's too much work." :-(

There's an application at Sandia and at Los Alamos which both of which cause problems for our linker tricks. This leads to such things as (proven) silent data corruption.


There are other apps that have also been reported over the years. C++ apps with their own allocators as especially problematic. Abaqus had to change their memory allocation model several years ago to be able to workaround these issues. These memory models also break valgrind, purify, and other memory-checking debuggers.

Have you tried these applications with any MPI other than OpenMPI ? i.e., does this corruption happen with Intel MPI and other MPIs as well?


We have been trying to say that this is a general problem that there currently is no guaranteed fix for. There's always a way to break the MPI workarounds for verbs' broken memory management model because there's no way to guarantee the memory allocation hooks.

There's two main reasons for fix these issues:

1. Business: to attract network programmers to verbs (and therefore to attract applications and therefore increase market share), it has to be simpler and within reach of today's commodity sockets-level programmers. Forcing them to have registration caches and to do memory allocation hooking significantly raises the bar. To date, this has been shunned by all network programmers except HPC and a handful of storage protocols.

2. Technical: if OFED says "to get good performance with verbs, you have to do malloc/mmap/etc. hooks and have a registration cache, "this unnecessarily *significantly* raises the education and code complexity barrier to entry for verbs programmers. It's also un-scaleable -- if this is something you *have* to do for good performance, why doesn't the network stack do it? It seems weird that you would effectively force all ULPs/MPIs/applications to implement the same functionality. The memory allocation hooking model also fails if more than one verbs- based middleware is used in the same application (because only one will be able to use the memory hooks per process).

Here's a story that encompasses both reasons:

We had Open MPI *not* use the registration cache by default for a long time because of the danger it posed to applications. Users could activate the registration cache with a simple command line parameter. But nobody would do that -- they wanted to run with top performance right out of the box (which is not unreasonable). It also led to OMPI's competitors -- ahem, *YOU* at Sonoma 2009 (!) -- citing "look, Open MPI's performance is bad! Our MPI's performance is GREAT!" Open MPI therefore was forced to change its defaults in the 1.3 series to activate the [dangerous] memory registration cache by default.

You mentioned that doing this stuff is a choice; the choice that MPI's/ ULPs/applications therefore have is:

- don't use registration caches/memory allocation hooking, have terrible performance - use registration caches/memory allocation hooking, have good performance

Which is no choice at all. If customers pay top dollar for these networks, they want to see benchmarks run out of the box that show that they're getting every flop/byte-per-second that they can. The fact that the programming model is needlessly complicated (and dangerous) to get that performance is something that the MPI's have tolerated because we had to for competition's sake.

This is not something that non-HPC customers will accept.

Of the solutions that have been presented so far,
I think the kernel notifier approach would be a better solution.


Note that Jason G. said in this thread: "Notifiers are going to be very troublesome, every time any sort of synchronous to user space notifier has been proposed or implemented in the kernel it has been a disaster."

--
Jeff Squyres
Cisco Systems

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to