Re: [OMPI users] SM btl slows down bandwidth?

Terry Dontje Sat, 16 Aug 2008 09:09:01 -0400

Date: Sat, 16 Aug 2008 08:18:47 -0400 From: Jeff Squyres<jsquy...@cisco.com> Subject: Re: [OMPI users] SM btl slows downbandwidth? To: Open MPI Users <us...@open-mpi.org> Message-ID:<1197bce6-a7e3-499e-8b05-b85f7598d...@cisco.com> Content-Type:text/plain; charset=US-ASCII; format=flowed; delsp=yes On Aug 15,2008, at 3:32 PM, Gus Correa wrote:
> Just like Daniel and many others, I have seen some disappointing> performance of MPI code on multicore machines,> in code that scales fine in networked environments and single core> CPUs,
> particularly in memory-intensive programs.
> The bad performance has been variously ascribed to memory> bandwidth / contention,> to setting processor and memory affinity versus letting the kernel> scheduler do its thing,
> to poor performance of memcpy, and so on.
I'd suspect that all of these play a role -- not necessarily any onesingle one of them.
- It is my believe (contrary to several kernel developers' beliefs)that explicitly setting processor affinity is a Good Thing for MPIapplications. Not only does MPI have more knowledge than the OS for aparallel job spanning multiple processes, each MPI process isallocating resources that may be spatially / temporally relevant. Forexample, say that an MPI process allocates some memory during MPI_INITin a NUMA system. This memory will likely be "near" in a NUMA sense.If the OS later decides to move that process, then the memory would be"far" in a NUMA sense. Similarly, OMPI decides what I/O resources touse during MPI_INIT -- and may specifically choose some "near"resources (and exclude "far" resources). If the OS moves the processafter MPI_INIT, these "near" and "far" determinations could becomestale/incorrect, and performance would go down the tubes.

I've been in the discussion above for many years on the same side asJeff however I think it is more due to pragmatic reasoning than becauseMPI is the right level for binding processes. The Solaris kerneldevelopers I've talked with believe the right way to do the above is forMPI or the runtime to give hints to the OS as to locality binding ofprocesses and have the OS try and maintain the locality. The reasonbeing is that there might be other processes that the OS is dealing withthat MPI or its runtime do not know about. Having MPI or its runtimeforce binding really messes up an OSes ability to try and balance theworkload on a system. Now mind you on a machine with small number ofcores <8 this probably isn't as big of an issue. But once you startdealing with large SMPs with 100s of cores there is definitely a goodchance that there is more than one MPI job running on a machine.

However, until MPI and OS implementors come up with a way to pass suchhints it does become a necessity for MPI to do the binding for reasonsJeff supplies above. Note myself, Jeff and another member have talkedabout such hints but have not come up with anything definitive.

- Unoptimized memcpy implementations is definitely a factor, mainlyfor large message transfers through shared memory. Since most (all?)MPI implementations use some form of shared memory for on-hostcommunication, memcpy can play a big part of its performance for largemessages. Using hardware (such as IB HCAs) for on-host communicationcan effectively avoid unoptimized memcpy's, but then you're justshifting the problem to the hardware -- you're now dependent upon thehardware's DMA engine (which is *usually* pretty good). But thenother issues can arise, such as the asynchronicity of the transfer,potentially causing collisions and/or extra memory bus traversals thatmight be able to be avoided with memcpy (it depends on the topologyinside your server -- e.g., if 2 processes are "far" from the IB HCA,then the transfer will have to traverse QPI/HT/whatever twice, whereasa memcpy would assumedly stay local). As Ron pointed out in thisthread, non-temporal memcpy's can be quite helpful for benchmarks thatdon't touch the resulting message at the receiver (because the non-temporal memcpy doesn't bother to take the time to load the cache).

In addition to the above you may run into platform specific memoryarchitecture issues. Like should the SM BTL be laying out the fifos ina specific way to get the best performance. The problem is what isgreat for one platform may suck eggs for another.

- Using different compilers is a highly religious topic, and IMHO, maytend to be application specific. Compilers are large complex softwaresystems (just like MPI); different compiler authors have chosen toimplement different optimizations that work well in differentapplications. So yes, you may well see different run-time performancewith different compilers depending on your application and/or MPIimplementations. Some compilers may have better memcpy's.
My $0.02: I think there are a *lot* of factors involved here.

I agree and we probably just scratched the surface here.

--td

Terry Dontje
Sun Microsystems, Inc.

Re: [OMPI users] SM btl slows down bandwidth?

Reply via email to