Date: Sat, 16 Aug 2008 08:18:47 -0400 From: Jeff Squyres <jsquy...@cisco.com> Subject: Re: [OMPI users] SM btl slows down bandwidth? To: Open MPI Users <us...@open-mpi.org> Message-ID: <1197bce6-a7e3-499e-8b05-b85f7598d...@cisco.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes On Aug 15, 2008, at 3:32 PM, Gus Correa wrote:
> Just like Daniel and many others, I have seen some disappointing > performance of MPI code on multicore machines, > in code that scales fine in networked environments and single core > CPUs,
> particularly in memory-intensive programs.
> The bad performance has been variously ascribed to memory > bandwidth / contention, > to setting processor and memory affinity versus letting the kernel > scheduler do its thing,
> to poor performance of memcpy, and so on.

I'd suspect that all of these play a role -- not necessarily any one single one of them.

- It is my believe (contrary to several kernel developers' beliefs) that explicitly setting processor affinity is a Good Thing for MPI applications. Not only does MPI have more knowledge than the OS for a parallel job spanning multiple processes, each MPI process is allocating resources that may be spatially / temporally relevant. For example, say that an MPI process allocates some memory during MPI_INIT in a NUMA system. This memory will likely be "near" in a NUMA sense. If the OS later decides to move that process, then the memory would be "far" in a NUMA sense. Similarly, OMPI decides what I/O resources to use during MPI_INIT -- and may specifically choose some "near" resources (and exclude "far" resources). If the OS moves the process after MPI_INIT, these "near" and "far" determinations could become stale/incorrect, and performance would go down the tubes.

I've been in the discussion above for many years on the same side as Jeff however I think it is more due to pragmatic reasoning than because MPI is the right level for binding processes. The Solaris kernel developers I've talked with believe the right way to do the above is for MPI or the runtime to give hints to the OS as to locality binding of processes and have the OS try and maintain the locality. The reason being is that there might be other processes that the OS is dealing with that MPI or its runtime do not know about. Having MPI or its runtime force binding really messes up an OSes ability to try and balance the workload on a system. Now mind you on a machine with small number of cores <8 this probably isn't as big of an issue. But once you start dealing with large SMPs with 100s of cores there is definitely a good chance that there is more than one MPI job running on a machine.

However, until MPI and OS implementors come up with a way to pass such hints it does become a necessity for MPI to do the binding for reasons Jeff supplies above. Note myself, Jeff and another member have talked about such hints but have not come up with anything definitive.
- Unoptimized memcpy implementations is definitely a factor, mainly for large message transfers through shared memory. Since most (all?) MPI implementations use some form of shared memory for on-host communication, memcpy can play a big part of its performance for large messages. Using hardware (such as IB HCAs) for on-host communication can effectively avoid unoptimized memcpy's, but then you're just shifting the problem to the hardware -- you're now dependent upon the hardware's DMA engine (which is *usually* pretty good). But then other issues can arise, such as the asynchronicity of the transfer, potentially causing collisions and/or extra memory bus traversals that might be able to be avoided with memcpy (it depends on the topology inside your server -- e.g., if 2 processes are "far" from the IB HCA, then the transfer will have to traverse QPI/HT/whatever twice, whereas a memcpy would assumedly stay local). As Ron pointed out in this thread, non-temporal memcpy's can be quite helpful for benchmarks that don't touch the resulting message at the receiver (because the non- temporal memcpy doesn't bother to take the time to load the cache).

In addition to the above you may run into platform specific memory architecture issues. Like should the SM BTL be laying out the fifos in a specific way to get the best performance. The problem is what is great for one platform may suck eggs for another.

- Using different compilers is a highly religious topic, and IMHO, may tend to be application specific. Compilers are large complex software systems (just like MPI); different compiler authors have chosen to implement different optimizations that work well in different applications. So yes, you may well see different run-time performance with different compilers depending on your application and/or MPI implementations. Some compilers may have better memcpy's.

My $0.02: I think there are a *lot* of factors involved here.

I agree and we probably just scratched the surface here.

--td

Terry Dontje
Sun Microsystems, Inc.

Reply via email to