Re: [ofa-general] New proposal for memory management

Jeff Squyres Fri, 01 May 2009 05:48:50 -0700

On Apr 30, 2009, at 6:01 PM, Woodruff, Robert J wrote:

To me, all this sounds like a lot of whining....
Why can't the OS fix all my problems.

Absolutely not. As Brian stated, we have cited some real-worldproblems that we cannot fix (and we have tried many, many differentworkarounds over the past few years to fix them).

It sounds like your main objection to fixing them is "it's too muchwork." :-(

There's an application at Sandia and at Los Alamos which both ofwhich cause problems for our linker tricks. This leads to suchthings as (proven) silent data corruption.

There are other apps that have also been reported over the years. C++apps with their own allocators as especially problematic. Abaqus hadto change their memory allocation model several years ago to be ableto workaround these issues. These memory models also break valgrind,purify, and other memory-checking debuggers.

Have you tried these applications with any MPI other than OpenMPI ?i.e., does this corruption happen with Intel MPI and other MPIs aswell?

We have been trying to say that this is a general problem that therecurrently is no guaranteed fix for. There's always a way to break theMPI workarounds for verbs' broken memory management model becausethere's no way to guarantee the memory allocation hooks.


There's two main reasons for fix these issues:

1. Business: to attract network programmers to verbs (and therefore toattract applications and therefore increase market share), it has tobe simpler and within reach of today's commodity sockets-levelprogrammers. Forcing them to have registration caches and to domemory allocation hooking significantly raises the bar. To date, thishas been shunned by all network programmers except HPC and a handfulof storage protocols.

2. Technical: if OFED says "to get good performance with verbs, youhave to do malloc/mmap/etc. hooks and have a registration cache, "thisunnecessarily *significantly* raises the education and code complexitybarrier to entry for verbs programmers. It's also un-scaleable -- ifthis is something you *have* to do for good performance, why doesn'tthe network stack do it? It seems weird that you would effectivelyforce all ULPs/MPIs/applications to implement the same functionality.The memory allocation hooking model also fails if more than one verbs-based middleware is used in the same application (because only onewill be able to use the memory hooks per process).


Here's a story that encompasses both reasons:

We had Open MPI *not* use the registration cache by default for a longtime because of the danger it posed to applications. Users couldactivate the registration cache with a simple command line parameter.But nobody would do that -- they wanted to run with top performanceright out of the box (which is not unreasonable). It also led toOMPI's competitors -- ahem, *YOU* at Sonoma 2009 (!) -- citing "look,Open MPI's performance is bad! Our MPI's performance is GREAT!" OpenMPI therefore was forced to change its defaults in the 1.3 series toactivate the [dangerous] memory registration cache by default.

You mentioned that doing this stuff is a choice; the choice that MPI's/ULPs/applications therefore have is:

- don't use registration caches/memory allocation hooking, haveterrible performance- use registration caches/memory allocation hooking, have goodperformance

Which is no choice at all. If customers pay top dollar for thesenetworks, they want to see benchmarks run out of the box that showthat they're getting every flop/byte-per-second that they can. Thefact that the programming model is needlessly complicated (anddangerous) to get that performance is something that the MPI's havetolerated because we had to for competition's sake.


This is not something that non-HPC customers will accept.

Of the solutions that have been presented so far,
I think the kernel notifier approach would be a better solution.

Note that Jason G. said in this thread: "Notifiers are going to bevery troublesome, every time any sort of synchronous to user spacenotifier has been proposed or implemented in the kernel it has been adisaster."


--
Jeff Squyres
Cisco Systems

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] New proposal for memory management

Reply via email to