Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?
There is a project called "MVAPICH2-GPU", which is developed by D. K. Panda's research group at Ohio State University. You will find lots of references on Google... and I just briefly gone through the slides of "MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters"": http://nowlab.cse.ohio-state.edu/publications/conf-presentations/2011/hao-isc11-slides.pdf It takes advantage of CUDA 4.0's Unified Virtual Addressing (UVA) to pipeline & optimize cudaMemcpyAsync() & RMDA transfers. (MVAPICH 1.8a1p1 also supports Device-Device, Device-Host, Host-Device transfers.) Open MPI also supports similar functionality, but as OpenMPI is not an academic project, there are less academic papers documenting the internals of the latest developments (not saying that it's bad - many products are not academic in nature and thus have less published papers...) Rayson = Grid Engine / Open Grid Scheduler http://gridscheduler.sourceforge.net/ Scalable Grid Engine Support Program http://www.scalablelogic.com/ On Mon, Dec 12, 2011 at 11:40 AM, Durga Choudhurywrote: > I think this is a *great* topic for discussion, so let me throw some > fuel to the fire: the mechanism described in the blog (that makes > perfect sense) is fine for (N)UMA shared memory architectures. But > will it work for asymmetric architectures such as the Cell BE or > discrete GPUs where the data between the compute nodes have to be > explicitly DMA'd in? Is there a middleware layer that makes it > transparent to the upper layer software? > > Best regards > Durga > > On Mon, Dec 12, 2011 at 11:00 AM, Rayson Ho wrote: >> On Sat, Dec 10, 2011 at 3:21 PM, amjad ali wrote: >>> (2) The latest MPI implementations are intelligent enough that they use some >>> efficient mechanism while executing MPI based codes on shared memory >>> (multicore) machines. (please tell me any reference to quote this fact). >> >> Not an academic paper, but from a real MPI library developer/architect: >> >> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport/ >> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/ >> >> Open MPI is used by Japan's K computer (current #1 TOP 500 computer) >> and LANL's RoadRunner (#1 Jun 08 – Nov 09), and "10^16 Flops Can't Be >> Wrong" and "10^15 Flops Can't Be Wrong": >> >> http://www.open-mpi.org/papers/sc-2008/jsquyres-cisco-booth-talk-2up.pdf >> >> Rayson >> >> = >> Grid Engine / Open Grid Scheduler >> http://gridscheduler.sourceforge.net/ >> >> Scalable Grid Engine Support Program >> http://www.scalablelogic.com/ >> >> >>> >>> >>> Please help me in formally justifying this and comment/modify above two >>> justifications. Better if I you can suggent me to quote some reference of >>> any suitable publication in this regard. >>> >>> best regards, >>> Amjad Ali >>> >>> ___ >>> Beowulf mailing list, beow...@beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> >> >> -- >> Rayson >> >> == >> Open Grid Scheduler - The Official Open Source Grid Engine >> http://gridscheduler.sourceforge.net/ >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Rayson == Open Grid Scheduler - The Official Open Source Grid Engine http://gridscheduler.sourceforge.net/
Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?
On Dec 12, 2011, at 8:42 AM, amjad ali wrote: > Thanking you all very much for the reply. > > I would request to have some reference about what Tim Prince & Andreas has > said. > > Tim said that OpenMPI has had effective shared memory message passing. Is > that anything to do with --enable-MPI-threads switch while installing OpeMPI? > > regards, > AA > Hi Amjad I think this is just the 'sm' [shared memory] 'btl' [byte transport layer] of OpenMPI, which uses shared memory inside a node to pass messages [unless you turn it off]. If I remember right, the OpenMPI sm btl is built by default in an SMP computer [like yours], and used by default if two processes live in the same computer/node. As a practical matter, if you plan to run your program in larger problems, say, that do not fit the memory of a single node, it is wise to use MPI to begin with, because your programming effort is preserved, and you can pretty much use for the large problem in multiple nodes the very same code that you developed for the small problem in a single computer. You cannot do this with OpenMP, which requires shared memory to start with. Given the many answers from the OpenMPI pros so far, it is clear that you provoked an interesting discussion! I wonder if it is fair at all to make comparisons between MPI and OpenMP. They are quite different programming models, and assume different hardware and memory layouts. The techniques used to design algorithms in each case are quite different as well. Both have pros and cons, but I can hardly imagine a fair comparison between them in real world problems. For instance, if one has a PDE to solve, say, the wave equation, in 1D, 2D, or 3D. The typical approach in OpenMP is to parallelize the inner loop[s]. The typical approach in MPI is to use domain decomposition. The typical approach in hybrid mode [MPI + OpenMP] is to do both. Could somebody tell me how these things can be fairly compared to each other, if at all? Thank you, Gus Correa
Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?
Thanking you all very much for the reply. I would request to have some reference about what Tim Prince & Andreas has said. Tim said that OpenMPI has had effective shared memory message passing. Is that anything to do with --enable-MPI-threads switch while installing OpeMPI? regards, AA
Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?
On 12/11/2011 12:16 PM, Andreas Schäfer wrote: Hey, on an SMP box threaded codes CAN always be faster than their MPI equivalents. One reason why MPI sometimes turns out to be faster is that with MPI every process actually initializes its own data. Therefore it'll end up in the NUMA domain to which the core running that process belongs. A lot of threaded codes are not NUMA aware. So, for instance the initialization is done sequentially (because it may not take a lot of time), and Linux' first touch policy makes all memory pages belong to a single domain. In essence, those codes will use just a single memory controller (and its bandwidth). Many applications require significant additional RAM and message passing communication per MPI rank. Where those are not adverse issues, MPI is likely to out-perform pure OpenMP (Andreas just quoted some of the reasons), and OpenMP is likely to be favored only where it is an easier development model. The OpenMP library also should implement a first-touch policy, but it's very difficult to carry out fully in legacy applications. OpenMPI has had effective shared memory message passing from the beginning, as did its predecessor (LAM) and all current commercial MPI implementations I have seen, so you shouldn't have to beat on an issue which was dealt with 10 years ago. If you haven't been watching this mail list, you've missed some impressive reporting of new support features for effective pinning by CPU, cache, etc. When you get to hundreds of nodes, depending on your application and interconnect performance, you may need to consider "hybrid" (OpenMP as the threading model for MPI_THREAD_FUNNELED mode), if you are running a single application across the entire cluster. The biggest cluster in my neighborhood, which ranked #54 on the recent Top500, gave best performance in pure MPI mode for that ranking. It uses FDR infiniband, and ran 16 ranks per node, for 646 nodes, with DGEMM running in 4-wide vector parallel. Hybrid was tested as well, with each multiple-thread rank pinned to a single L3 cache. All 3 MPI implementations which were tested have full shared memory message passing and pinning to local cache within each node (OpenMPI and 2 commercial MPIs). -- Tim Prince
Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?
I guess, on a multicore machine, openmp/pthread code will always run faster than MPI code on the same box, even if the MPI implementation is efficient and uses a shared memory tool whereby the data is actually shared across the different process, though it's in a different way than it is shared across the threads in the same process. I'd be curious to see some timing comparisons. MM From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of amjad ali Sent: 10 December 2011 20:22 To: Open MPI Users Subject: [OMPI users] How to justify the use MPI codes on multicore systems/PCs? Hello All, I developed my MPI based parallel code for clusters, but now I use it on multicore/manycore computers (PCs) as well. How to justify (in some thesis/publication) the use of a distributed memory code (in MPI) on a shared memory (multicore) machine. I guess to explain two reasons: (1) Plan is to use several hunderds processes in future. So MPI like stuff is necessary. To maintain code uniformity and save cost/time for developing shared memory solution (using OpenMP, pthreads etc), I use the same MPI code on shared memory systems (like multicore PCs). MPI based codes give reasonable performance on multicore PCs, if not the best. (2) The latest MPI implementations are intelligent enough that they use some efficient mechanism while executing MPI based codes on shared memory (multicore) machines. (please tell me any reference to quote this fact). Please help me in formally justifying this and comment/modify above two justifications. Better if I you can suggent me to quote some reference of any suitable publication in this regard. best regards, Amjad Ali
[OMPI users] How to justify the use MPI codes on multicore systems/PCs?
Hello All, I developed my MPI based parallel code for clusters, but now I use it on multicore/manycore computers (PCs) as well. How to justify (in some thesis/publication) the use of a distributed memory code (in MPI) on a shared memory (multicore) machine. I guess to explain two reasons: (1) Plan is to use several hunderds processes in future. So MPI like stuff is necessary. To maintain code uniformity and save cost/time for developing shared memory solution (using OpenMP, pthreads etc), I use the same MPI code on shared memory systems (like multicore PCs). MPI based codes give reasonable performance on multicore PCs, if not the best. (2) The latest MPI implementations are intelligent enough that they use some efficient mechanism while executing MPI based codes on shared memory (multicore) machines. (please tell me any reference to quote this fact). Please help me in formally justifying this and comment/modify above two justifications. Better if I you can suggent me to quote some reference of any suitable publication in this regard. best regards, Amjad Ali