Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?

2011-12-14 Thread Rayson Ho
There is a project called "MVAPICH2-GPU", which is developed by D. K.
Panda's research group at Ohio State University. You will find lots of
references on Google... and I just briefly gone through the slides of
"MVAPICH2-­GPU: Optimized GPU to GPU Communication for InfiniBand
Clusters"":

http://nowlab.cse.ohio-state.edu/publications/conf-presentations/2011/hao-isc11-slides.pdf

It takes advantage of CUDA 4.0's Unified Virtual Addressing (UVA) to
pipeline & optimize cudaMemcpyAsync() & RMDA transfers. (MVAPICH
1.8a1p1 also supports Device-Device, Device-Host, Host-Device
transfers.)

Open MPI also supports similar functionality, but as OpenMPI is not an
academic project, there are less academic papers documenting the
internals of the latest developments (not saying that it's bad - many
products are not academic in nature and thus have less published
papers...)

Rayson

=
Grid Engine / Open Grid Scheduler
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/


On Mon, Dec 12, 2011 at 11:40 AM, Durga Choudhury  wrote:
> I think this is a *great* topic for discussion, so let me throw some
> fuel to the fire: the mechanism described in the blog (that makes
> perfect sense) is fine for (N)UMA shared memory architectures. But
> will it work for asymmetric architectures such as the Cell BE or
> discrete GPUs where the data between the compute nodes have to be
> explicitly DMA'd in? Is there a middleware layer that makes it
> transparent to the upper layer software?
>
> Best regards
> Durga
>
> On Mon, Dec 12, 2011 at 11:00 AM, Rayson Ho  wrote:
>> On Sat, Dec 10, 2011 at 3:21 PM, amjad ali  wrote:
>>> (2) The latest MPI implementations are intelligent enough that they use some
>>> efficient mechanism while executing MPI based codes on shared memory
>>> (multicore) machines.  (please tell me any reference to quote this fact).
>>
>> Not an academic paper, but from a real MPI library developer/architect:
>>
>> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport/
>> http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/
>>
>> Open MPI is used by Japan's K computer (current #1 TOP 500 computer)
>> and LANL's RoadRunner (#1 Jun 08 – Nov 09), and "10^16 Flops Can't Be
>> Wrong" and "10^15 Flops Can't Be Wrong":
>>
>> http://www.open-mpi.org/papers/sc-2008/jsquyres-cisco-booth-talk-2up.pdf
>>
>> Rayson
>>
>> =
>> Grid Engine / Open Grid Scheduler
>> http://gridscheduler.sourceforge.net/
>>
>> Scalable Grid Engine Support Program
>> http://www.scalablelogic.com/
>>
>>
>>>
>>>
>>> Please help me in formally justifying this and comment/modify above two
>>> justifications. Better if I you can suggent me to quote some reference of
>>> any suitable publication in this regard.
>>>
>>> best regards,
>>> Amjad Ali
>>>
>>> ___
>>> Beowulf mailing list, beow...@beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>>
>>
>> --
>> Rayson
>>
>> ==
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Rayson

==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/



Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?

2011-12-12 Thread Gustavo Correa

On Dec 12, 2011, at 8:42 AM, amjad ali wrote:

> Thanking you all very much for the reply.
>  
> I would request to have some reference about what Tim Prince  & Andreas has 
> said.
>  
> Tim said that OpenMPI has had effective shared memory message passing. Is 
> that anything to do with --enable-MPI-threads switch while installing OpeMPI?
>  
> regards,
> AA 
> 

Hi Amjad

I think this is just the 'sm' [shared memory] 'btl' [byte transport layer] of 
OpenMPI,
which uses shared memory inside a node to pass messages [unless you turn it 
off].
If I remember right, the OpenMPI  sm btl is built by default in an SMP computer 
[like yours], and 
used by default if two processes live in the same computer/node.

As a practical matter, if you plan to run your program in larger problems, say, 
that 
do not fit the memory of a single node, it is wise to use MPI to begin with, 
because your programming effort is preserved, 
and you can pretty much use for the large problem in multiple nodes
the very same code that you developed for the small problem in a single 
computer.
You cannot do this with OpenMP, which requires shared memory to start with.

Given the many answers from the OpenMPI pros so far, it is clear that you 
provoked 
an interesting discussion!

I wonder if it is fair at all to make comparisons between MPI and OpenMP.
They are quite different programming models, and assume different hardware 
and memory layouts.
The techniques used to design algorithms in each case are quite different as 
well.
Both have pros and cons, but I can hardly imagine a fair comparison between 
them 
in real world problems.
For instance, if one has a PDE to solve, say, the wave equation, in 1D, 2D, or 
3D. 
The typical approach in OpenMP is to parallelize the inner loop[s].
The typical approach in MPI is to use domain decomposition.
The typical approach in hybrid mode [MPI + OpenMP] is to do both.
Could somebody tell me how these things can be fairly compared to each other, 
if at all?

Thank you,
Gus Correa






Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?

2011-12-12 Thread amjad ali
Thanking you all very much for the reply.

I would request to have some reference about what Tim Prince  & Andreas has
said.

Tim said that OpenMPI has had effective shared memory message passing. Is
that anything to do with --enable-MPI-threads switch while installing
OpeMPI?

regards,
AA


Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?

2011-12-11 Thread Tim Prince

On 12/11/2011 12:16 PM, Andreas Schäfer wrote:

Hey,

on an SMP box threaded codes CAN always be faster than their MPI
equivalents. One reason why MPI sometimes turns out to be faster is
that with MPI every process actually initializes its own
data. Therefore it'll end up in the NUMA domain to which the core
running that process belongs. A lot of threaded codes are not NUMA
aware. So, for instance the initialization is done sequentially
(because it may not take a lot of time), and Linux' first touch policy
makes all memory pages belong to a single domain. In essence, those
codes will use just a single memory controller (and its bandwidth).



Many applications require significant additional RAM and message passing 
communication per MPI rank. Where those are not adverse issues, MPI is 
likely to out-perform pure OpenMP (Andreas just quoted some of the 
reasons), and OpenMP is likely to be favored only where it is an easier 
development model. The OpenMP library also should implement a 
first-touch policy, but it's very difficult to carry out fully in legacy 
applications.
OpenMPI has had effective shared memory message passing from the 
beginning, as did its predecessor (LAM) and all current commercial MPI 
implementations I have seen, so you shouldn't have to beat on an issue 
which was dealt with 10 years ago.  If you haven't been watching this 
mail list, you've missed some impressive reporting of new support 
features for effective pinning by CPU, cache, etc.
When you get to hundreds of nodes, depending on your application and 
interconnect performance, you may need to consider "hybrid" (OpenMP as 
the threading model for MPI_THREAD_FUNNELED mode), if you are running a 
single application across the entire cluster.
The biggest cluster in my neighborhood, which ranked #54 on the recent 
Top500, gave best performance in pure MPI mode for that ranking.  It 
uses FDR infiniband, and ran 16 ranks per node, for 646 nodes, with 
DGEMM running in 4-wide vector parallel.  Hybrid was tested as well, 
with each multiple-thread rank pinned to a single L3 cache.
All 3 MPI implementations which were tested have full shared memory 
message passing and pinning to local cache within each node (OpenMPI and 
2 commercial MPIs).



--
Tim Prince


Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?

2011-12-11 Thread MM
I guess, on a multicore machine, openmp/pthread code will always run faster
than MPI code on the same box, even if the MPI implementation is efficient
and uses a shared memory tool whereby the data is actually shared across the
different process, though it's in a different way than it is shared across
the threads in the same process.



I'd be curious to see some timing comparisons.

MM



From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of amjad ali
Sent: 10 December 2011 20:22
To: Open MPI Users
Subject: [OMPI users] How to justify the use MPI codes on multicore
systems/PCs?



Hello All,



I developed my MPI based parallel code for clusters, but now I use it on
multicore/manycore computers (PCs) as well. How to justify (in some
thesis/publication) the use of a distributed memory code (in MPI) on a
shared memory (multicore) machine. I guess to explain two reasons:



(1) Plan is to use several hunderds processes in future. So MPI like stuff
is necessary. To maintain code uniformity and save cost/time for developing
shared memory solution (using OpenMP, pthreads etc), I use the same MPI code
on shared memory systems (like multicore PCs). MPI based codes give
reasonable performance on multicore PCs, if not the best.



(2) The latest MPI implementations are intelligent enough that they use some
efficient mechanism while executing MPI based codes on shared memory
(multicore) machines.  (please tell me any reference to quote this fact).





Please help me in formally justifying this and comment/modify above two
justifications. Better if I you can suggent me to quote some reference of
any suitable publication in this regard. 



best regards,

Amjad Ali





[OMPI users] How to justify the use MPI codes on multicore systems/PCs?

2011-12-10 Thread amjad ali
 Hello All,

I developed my MPI based parallel code for clusters, but now I use it on
multicore/manycore computers (PCs) as well. How to justify (in some
thesis/publication) the use of a distributed memory code (in MPI) on a
shared memory (multicore) machine. I guess to explain two reasons:

(1) Plan is to use several hunderds processes in future. So MPI like stuff
is necessary. To maintain code uniformity and save cost/time for developing
shared memory solution (using OpenMP, pthreads etc), I use the same MPI
code on shared memory systems (like multicore PCs). MPI based codes give
reasonable performance on multicore PCs, if not the best.

(2) The latest MPI implementations are intelligent enough that they use
some efficient mechanism while executing MPI based codes on shared memory
(multicore) machines.  (please tell me any reference to quote this fact).


Please help me in formally justifying this and comment/modify above two
justifications. Better if I you can suggent me to quote some reference of
any suitable publication in this regard.

best regards,
Amjad Ali