Re: [O-MPI devel] [PATH] ompi_info doesn't show use_mem_hooks flag
On Tue, Dec 06, 2005 at 11:07:44AM -0500, Brian Barrett wrote: > On Dec 6, 2005, at 10:53 AM, Gleb Natapov wrote: > > > On Tue, Dec 06, 2005 at 08:33:32AM -0700, Tim S. Woodall wrote: > >>> Also memfree hooks decrease cache efficiency, the better solution > >>> would > >>> be to catch brk() system calls and remove memory from cache only > >>> then, > >>> but there is no way to do it for now. > >> > >> We are look at other options, including catching brk/munmap system > >> calls, and > >> will be experimenting w/ these on the trunk. > >> > > This will be really interesting. How are you going to catch brk/munmap > > without kernel help? Last time I checked preload tricks don't work if > > syscall is done from inside libc itself. > > All of the tricks we are looking at assume that nothing in libc calls > munmap. glibc does call mmap/munmap internally for big allocations as strace of this program shows: int main () { void *p = malloc (1024*1024); free (p); } > We can successfully catch free() calls from inside libc > without any problems. The LAM/MPI team and Myricom (with MPICH-gm) > have been doing this for many years without any problems. On the > small percentage of MPI applications that require some linker tricks > (some of the commercial apps are this way), we won't be able to > intercept any free/munmap calls, so we're going to fall back to our > RDMA pipeline algorithm. > Yes, but catching free is not good enough. This way we sometimes evict cache entries that may safely remains in the cache. Ideally we should be able to catch events that return memory to OS (munmap/brk) and remove the memory from cache only then. -- Gleb.
Re: [O-MPI devel] [PATH] ompi_info doesn't show use_mem_hooks flag
On Dec 7, 2005, at 9:44 AM, Gleb Natapov wrote: On Tue, Dec 06, 2005 at 11:07:44AM -0500, Brian Barrett wrote: On Dec 6, 2005, at 10:53 AM, Gleb Natapov wrote: On Tue, Dec 06, 2005 at 08:33:32AM -0700, Tim S. Woodall wrote: Also memfree hooks decrease cache efficiency, the better solution would be to catch brk() system calls and remove memory from cache only then, but there is no way to do it for now. We are look at other options, including catching brk/munmap system calls, and will be experimenting w/ these on the trunk. This will be really interesting. How are you going to catch brk/ munmap without kernel help? Last time I checked preload tricks don't work if syscall is done from inside libc itself. All of the tricks we are looking at assume that nothing in libc calls munmap. glibc does call mmap/munmap internally for big allocations as strace of this program shows: int main () { void *p = malloc (1024*1024); free (p); } Ah, yes, I wasn't clear. On Linux, we actually ship our own version of ptmalloc2 (the allocator used by glibc on Linux). We use the standard linker search order tricks to have the linker choose our versions of malloc, calloc, realloc, valloc, and free, which are from ptmalloc2. We've modified our version of ptmalloc2 such that any time it calls mmap or sbrk with a positive number, it then immediately allows the cache to know about the allocation. Any time it's about to call munmap or sbrk with a negative number, it informs the cache code before giving the memory back to the OS. We also catch mmap and munmap so that we can track when the user calls mmap / munmap. Note that we play with ptmalloc2's code such that it calls our mmap (which either uses the syscall interface directly or calls __mmap depending on what the system supports), so we don't intercept that call to mmap twice or anything like that. This works pretty well (like I said - it's worked fine for LAM and MPICH-gm for years), but has the problem of requiring the user to either use the wrapper compilers or add the -lmpi -lorte -lopal to the link line (ie, can't use shared library dependencies to load in libopal.so) or our ptmalloc2 / mmap / munmap isn't used. We can detect that this happened pretty easily and then we fall back to the pipelined RDMA code that doesn't offer the same performance but also doesn't have a pinning problem. We can successfully catch free() calls from inside libc without any problems. The LAM/MPI team and Myricom (with MPICH-gm) have been doing this for many years without any problems. On the small percentage of MPI applications that require some linker tricks (some of the commercial apps are this way), we won't be able to intercept any free/munmap calls, so we're going to fall back to our RDMA pipeline algorithm. Yes, but catching free is not good enough. This way we sometimes evict cache entries that may safely remains in the cache. Ideally we should be able to catch events that return memory to OS (munmap/brk) and remove the memory from cache only then. This is essentially what we do on Linux - we only tell the rcache code about allocations / deallocations when we are talking about getting memory from or giving memory back to the operating system. On Mac OS X / Darwin, due to their two level namespaces, we can't replace malloc / free with a customized version of the Darwin allocator like we could with ptmalloc2. There are some things you can do to simulate such behavior, but it requires linking in a flat namespace and doing some other things that nearly the Darwin engineers to pass out when I was talking to them about said tricks. So instead, we use the Darwin hooks for catching malloc / free / etc. It's not optimal, but it's the best we can do in the situation. And it doesn't force us to link all OMPI applications in a flat namespace, which is always nice. Of course, we still intercept mmap / munmap in the tradition linker tricks style. But again, there are very few function calls in libSystem.dylib that call mmap that we care about (malloc / free are already taken care of by the standard hooks), so this doesn't cause a problem. Hopefully this made some sense. If not, on to the next round of e- mails :). Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
[O-MPI devel] Fwd: (j3.2005) Re: Derived types according to MPI2
Begin forwarded message: From: Aleksandar Donev Date: November 21, 2005 9:30:18 AM MST To: J3 Subject: (j3.2005) Re: Derived types according to MPI2 Hello, Malcolm Cohen wrote: Which just goes to show that the authors of MPI2 didn't understand Fortran, since that is completely and utterly false in every sense that matters. Yes, but the interesting thing is neither me nor Van were aware of what the standard actually allows in terms of derived types and the storage for the components, and presumably we know Fortran better. Can storage for the components be separated from the scalar derived type itself? This probably makes no visible difference for scalars, but for arrays it does. Again, I am asking about what STORAGE_SIZE for derived types should mean. Dan Nagle wrote: Please be aware that the "external world" of the MPI standard is really the virtual machine of the C standard. Yes, of course, I am certainly not proposing binding to hardware. When defining a programming language, the "needless abstraction" I should have qualified with "some needless abstractions". Of course abstractions are good, especially when it does not matter to the user how something is done as long as it is done well. But if you want to pass an array of derived types to a parallel IO routine that is not compiled by your super-smart Fortran compiler that chooses to scatter the components across virtual-address space (yes, I mean virtual), then you do NOT want that abstraction. It is about choice. Leave preaching to the preachers. Programming is a profession for a reason---programmers are experienced and educated and understand the issues and don't need lectures on abstractions. Aleks
[O-MPI devel] Fwd: (j3.2005) Re: Derived types according to MPI2
Begin forwarded message: From: Bill Long Date: November 21, 2005 11:03:46 AM MST To: Malcolm Cohen Cc: J3 Subject: (j3.2005) Re: Derived types according to MPI2 Reply-To: lo...@cray.com Malcolm Cohen wrote: Aleksandar Donev said: (like MPI standard). Gak. Just because MPI is a load of dingo's kidneys doesn't mean everyone else should make a horrible mess. MPI is a bad idea that spun out of control. It is mainly useful as an example of what to avoid. It certainly is diametrically opposed to the programmer productivity goals being pushed by DARPA. One hopes that the combination of Fortran 2008 and UPC will finally let users abandon this archaic monstrosity. Cheers, Bill -- Bill Long lo...@cray.com Fortran Technical Support & voice: 651-605-9024 Bioinformatics Software Development fax: 651-605-9142 Cray Inc., 1340 Mendota Heights Rd., Mendota Heights, MN, 55120
[O-MPI devel] Fwd: (j3.2005) Re: Derived types according to MPI2
Begin forwarded message: From: Malcolm Cohen Date: November 21, 2005 11:23:59 AM MST To: Aleksandar Donev Cc: J3 Subject: (j3.2005) Re: Derived types according to MPI2 Aleksandar Donev said: Yes, but the interesting thing is neither me nor Van were aware of what the standard actually allows in terms of derived types and the storage for the components, and presumably we know Fortran better. Can storage One might have hoped so. for the components be separated from the scalar derived type itself? Hey, when *I* am the Fortran processor there's no contiguous storage, or for that matter addressable storage! Don't take too limited a view of current "hard" ware. how something is done as long as it is done well. But if you want to pass an array of derived types to a parallel IO routine that is not compiled by your super-smart Fortran compiler that chooses to scatter the components across virtual-address space (yes, I mean virtual), then you do NOT want that abstraction. You cannot be serious. You do realise that there is no requirement on any array even on intrinsic data types to contain the "actual data". Is that a problem in practice? No of course not. The Fortran standard doesn't mention virtual addressing, physical addressing or any of these things. Is that a problem? No. What the standard should do (and usually does) is to specify the behaviour of the Fortran "virtual machine", i.e. the meaning of the program. How that program gets mapped to hardware is way outside the scope of the standard. It is about choice. Leave preaching to the preachers. Programming is a profession for a reason---programmers are experienced and educated and understand the issues and don't need lectures on abstractions. Apparently not. Cheers, -- ...Malcolm Cohen, NAG Ltd., Oxford, U.K. (malc...@nag.co.uk) __ __ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk __ __