Re: [OMPI users] OpenMPI binding all tasks to cpu0, leaving cpu1 idle. (2-cpu system)

2007-10-03 Thread Miguel Figueiredo Mascarenhas Sousa Filipe
Hi,

On 10/3/07, jody  wrote:
> Hi Miguel
> I don't know if it's a typo - but actually it should be
>  mpiexec -np 2 ./mpi-app config.ini
> and not
> > mpiexec -n 2 ./mpi-app config.ini

thanks for the remark, you're right, but in the man page says -n is a
synonym for -np

Kind regards,

-- 
Miguel Sousa Filipe


[OMPI users] OpenMPI binding all tasks to cpu0, leaving cpu1 idle. (2-cpu system)

2007-10-03 Thread Miguel Figueiredo Mascarenhas Sousa Filipe
Hi there,

I have a 2-cpu system (linux/x86-64), running openmpi-1.1. I do not
specify a hostfile.
Lately I'm having performance problems when running my mpi-app this way:

mpiexec -n 2 ./mpi-app config.ini

Both mpi-app processes are running on cpu0, leaving cpu1 idle.

After reading the mpirun manpage, it seems that openmpi bind tasks to
cpus in a round-robin way, meaning that this should not happen.
But given my problem, I assume that it's not detecting this is a 2-way
smp system, (assuming a UP system) and binding both tasks to cpu0..

Is this correct?

The openmpi-default-hostfile says I should not specify localhost in
there.. and let the job dispatcher/rca "detect" the single-node setup.

Where should I define/configure system wide, that this is a
single-node, 2-slot system?
I would like to avoid making the system users be obliged to pass a
hostfile to mpirun/mpiexec. I simply want mpiexec -n N ./mpi-task to
do the propper job of _really_ spreading the processes evenly between
all the system's CPUs.

Best regards, waiting for your answer.



ps.: should I upgrade to latest openMPI to have my problem
"automagically" solved?

-- 
Miguel Sousa Filipe


Re: [OMPI users] efficient memory to memory transfer

2006-11-08 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

On 11/8/06, Miguel Figueiredo Mascarenhas Sousa Filipe
 wrote:

Hi,

On 11/8/06, Greg Lindahl  wrote:
> On Tue, Nov 07, 2006 at 05:02:54PM +0000, Miguel Figueiredo Mascarenhas Sousa 
Filipe wrote:
>
> > if your aplication is on one given node, sharing data is better than
> > copying data.
>
> Unless sharing data repeatedly leads you to false sharing and a loss
> in performance.


what does that mean.. I did not understand that.

>
> > the MPI model assumes you don't have a "shared memory" system..
> > therefore it is "message passing" oriented, and not designed to
> > perform optimally on shared memory systems (like SMPs, or numa-CCs).
>
> For many programs with both MPI and shared memory implementations, the
> MPI version runs faster on SMPs and numa-CCs. Why? See the previous
> paragraph...


I miss understood what you've said.

The reasons I see for that are:
1) have aplication design MPI oriented...
and adapt that design to a shared memory implementation afterwards.
(using a mpi programming model on a shared memory aplication.)
2) cases where the problem space is better solved using a MPI
programming model oriented solution.

Shared memory, or multi-threading program development requires a
diferent programming model.
The MPI model can be better suited to some tasks, than the
multi-threading approach.
And vice-versa.




But, for instance.. try to benchmark real applications with a MPI and
posix threads implementations in the same numa-cc or big SMP machine..
my bet is that posix threads implementation is going to be faster..
There are always exceptions.. like having a very well designed MPI
application, but a terrible posix threads one.. or design that's just
not that adaptable to a posix threads programming model (or a MPI
model).


--
Miguel Sousa Filipe




--
Miguel Sousa Filipe


Re: [OMPI users] efficient memory to memory transfer

2006-11-08 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hi,

On 11/8/06, Greg Lindahl  wrote:

On Tue, Nov 07, 2006 at 05:02:54PM +, Miguel Figueiredo Mascarenhas Sousa 
Filipe wrote:

> if your aplication is on one given node, sharing data is better than
> copying data.

Unless sharing data repeatedly leads you to false sharing and a loss
in performance.



what does that mean.. I did not understand that.



> the MPI model assumes you don't have a "shared memory" system..
> therefore it is "message passing" oriented, and not designed to
> perform optimally on shared memory systems (like SMPs, or numa-CCs).

For many programs with both MPI and shared memory implementations, the
MPI version runs faster on SMPs and numa-CCs. Why? See the previous
paragraph...


Of course it does..its faster to copy data in main memory than it is
to do it thought any kind of network interface. You can optimize you
message passing implementation to a couple of memory to memory copies
when ranks are on the same node. In the worst case, even if using
local IP addresses to communicate between peers/ranks (in the same
node), the operating  system doesn't even touch the interface.. it
will just copy data from a tcp sender buffer to a tcp receiver
buffer.. in the end - that's always faster than going through a
phisical network link.



But you still have a message passing api that is doing memory to
memory copies.. its a worse framework to do memory copies than a api
designed just for that.
One could argue that MPI is more than a message passing api, since it
provides also APIs to apply operators to the data..


But, for instance.. try to benchmark real applications with a MPI and
posix threads implementations in the same numa-cc or big SMP machine..
my bet is that posix threads implementation is going to be faster..
There are always exceptions.. like having a very well designed MPI
application, but a terrible posix threads one.. or design that's just
not that adaptable to a posix threads programming model (or a MPI
model).


--
Miguel Sousa Filipe


Re: [OMPI users] efficient memory to memory transfer

2006-11-07 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hi,

On 11/7/06, Chevchenkovic Chevchenkovic  wrote:

Hi,
  I had the following setup:
 Rank 0 process on node 1 wants to send an array of particular size to Rank
1 process on same node.
1. What are the optimisations that can be done/invoked while running mpirun
to perform this memory to memory transfer efficiently?
2. Is there any performance gain  if 2 processes that are exchanging data
arrays are kept on the same node rather than on different nodes connected by
infiniband?


if your aplication is on one given node, sharing data is better than
copying data.
You can do this with unix shared memory api, or with posix threads api.
If aplications share the same address space, and if copy is necessary,
memcpy() is probably the faster way (and ensuring that data is aligned
in memory).
However, this by definition does not work on multi-computer
aplications/systems..

If you can have:

1 aplication per node, several threads per node.
consider using MPI only between aplications, and setup your MPI
framework to launch one aplication per node.

program your aplication to use #threads per rank (node), and use posix
threading model for parallel execution in each node (for instance,
where #threads == NCPUS) , and use MPI for comunicating between nodes.

the MPI model assumes you don't have a "shared memory" system..
therefore it is "message passing" oriented, and not designed to
perform optimally on shared memory systems (like SMPs, or numa-CCs).


best regards,

--
Miguel Sousa Filipe


Re: [OMPI users] dual Gigabit ethernet support

2006-10-23 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hello,



On 10/23/06, Jayanta Roy  wrote:


Hi,

Sometime before I have posted doubts about using dual gigabit support
fully. See I get ~140MB/s full duplex transfer rate in each of following
runs.




Thats impressive, since its _more_ than the threotetical limit of 1Gb
ethernet.
140MB = 140 x 8 Mbit = 1120 Megabit/sec > 1 Gigabit/sec...



mpirun --mca btl_tcp_if_include eth0 -n 4 -bynode -hostfile host a.out


mpirun --mca btl_tcp_if_include eth1 -n 4 -bynode -hostfile host a.out

How to combine these two port or use a proper routing table in place host
file? I am using openmpi-1.1 version.

-Jayanta




--
Miguel Sousa Filipe


Re: [OMPI users] using both 64 and 32 bit mpi

2006-09-29 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hi,



On 9/28/06, Glenn Johnson  wrote:

I have an 8-way AMD64 system. I built a 64 bit open-mpi-1.1
implementation and then compiled software to use it. That all works
fine.

In addition, I have a 32 bit binary program (Schrodinger Jaguar) that I
would like to run on this machine with mpi. Schrodinger provides source
code to build an mpi compatibility layer. This compatibility layer
allows jaguar to use a different mpi implementation than that which the
software was compiled with. I do not want to give up the 64 bit open-mpi
that I already have and am using.

So my questions are:
 1. Can I build/install a 32 bit version of open-mpi even though I
already have a 64 bit version installed?
 2. What "tricks" might I need to do to make sure a program calls
the correct version of mpi (32 or 64 bit)?
 3. Would I do better considering running jaguar in a 32 bit chroot
environment?



I'm a simple user who is using openmpi in 32bit and 64bits on a amd64
linux, on a 64bit sistem. But I also have openmpi and mpich2 in a
32bit chroot.

I've installed openmpi 32bit in /opt/openmpi-1.1.1-32bits (amd64 is
installed in /opt/openmpi-1.1.1-amd64) .. and its easy to change
between the two.. either with symlinks, PATHs or scripts.. to use one
.. or the other.

the chroot was a lot more trouble, and I did it because I wasn't being
able to compile openmpi (1.0 at the time) in 32bit mode on my 64bit
system.


Thanks.

--
Glenn Johnson 
USDA, ARS, SRRC




--
Miguel Sousa Filipe


Re: [OMPI users] Does current Intel dual processor support MPI?

2006-09-07 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hi all,

Well, I just wanted to say that from a software engineering (and also
computer science) point of view,

OMP, MPI and threads are completely diferent models for parallel
computation/concurrent programming.

I do not believe that any capable engineer (or good programmer for all
I know) can know in advance whats the "best"
one to use without knowing the problem space, design requisites, ..etc..

Its not just about "portability" or code "readability/mantainability".

Deciding which too use will depend (and therefore influence) the
aplication arquitecture.

Should I use OMP on a web server as apache or tomcat, providing that
way better portability and code readability?

Should I use OMP or threading for a massively parallel system, such
has blue gene/L, what about SGI Altix3000?

Shoud I use threading for a 2 cpu, shared memory system for a
sequencial aplication where I just need to speed up
some highly vectorizable loops?

For instance, my thesis dealt with paralelizing a seismic simulation
application, I did a thread and a MPI version.
The threaded version, since "tasks" could share massive amounts of
data with very little lock contention, could work bigger data sets
than the MPI version (given the same total amount of ram). But the MPI
version could run in clusters, while with threading I needed a single
system image.
OpenMP was inadequate since it would have a much bigger sequential
execution time, providing inadequate speedup, for a algorithm which
was very parallel.

Seedups measured in the threaded version and MPI version were about
1.99 in 2cpu mode, (<1% of sequential  computation). In MPI, with 16
cpus (1 gigabit link for 8 x 2cpu nodes), the measured speedup was
14.8.

My threaded version would never achieve a 14.8 speedup, even in a
"SSI" 8 node cluster.
The effort applied to make the MPI version so scalable was _much_
bigger.. (designing a new concurrent, distributed algorithm to replace
one that was sequential, that in the sequencial aplication amounted to
1% of the computation time.) than the threaded one. It uses more ram
per process, but can scale up to 64/128 nodes, depending on the
problem size, and it doesn't require a shared memory system.
My threaded version, in a shared memory system, with lots of cpus,
will scale quite a lot..but probably the agregate bandwith will be
inferior to a cluster with the same amount of cpus/ram (normally, big
SMP or NUMA systems have bigger RAM latency and not proportional
bandwith).
Basically, I can't predict which performs better.


So, I hope that its understandable that choosing the right parallel
computing model isn't just a matter of "taste".




On 9/6/06, George Bosilca  wrote:

 From my perspective some [let's say #1 and #2) of the most important
features of an application that has to last for a while is the
readability and portability. And OMP code is far more readable than
pthread one. The loops look like loops, the critical sections are
obvious and the sequential meaning of the program is preserved.

On Sep 5, 2006, at 7:52 PM, Durga Choudhury wrote:

> My opinion would be to use pthreads, for a couple of reasons:
>
> 1. You don't need an OMP aware compiler; any old compiler would do.

Compilers can be downloaded for free these days. And most of them
have now OMP support. And on all operating systems (i.e. even the
free Microsoft compiler now has OMP support, and Windows was
definitively not the platform I expect to use for my OMP tasks).

> 2. The pthread library is more well adapted and hence might be more
> optimized than the code emitted from an OMP compiler.

The pthread library add a huge overhead for all operations. At this
level granularity quite often you need atomic locks and operations,
not critical sections protected by mutexes. Unfortunately, there is
no portable library that give you a common interface to atomic
operations (there was a BSD one at one point). Moreover, using
threads instead of OMP directive move the burden on the programmer.
Most of the people just cannot afford a one year student who has to
first understand and then add the correct pthread directive inside.
And for which result ... you don't even know that you will get the
fastest version. On the other side OMP compilers are getting smarter
and smarter every day. Today the results are quite impressive, just
imagine what will happens in few years.

>
> If your operating system is Linux, you may use the clone() system
> call directly; this would add further optimization at the expense
> of portability.

It's always a trade-off between performance and portability. What do
you want to loose in order to get the 1% performance gain ... And in
this case the only performance gain you will get is when you start
the threads, otherwise you will not improve anything. Generally,
people prefer to use threads pools in order to avoid the overhead of
creating and destroying threads all the time.

   george.

>
> Durga
>
>
> On 9/5/06, George Bosilca  wrote:
> On Sep 5, 2006,

Re: [OMPI users] OpenMPI error in the simplest configuration...

2006-08-28 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hi there,

On 8/27/06, Durga Choudhury  wrote:


Hi all

I am getting an error (details follow) in the simplest of the possible
test scenarios:

Two identical regular Dell PCs connected back-to-back via an ethernet
switch on the 10/100 ethernet. Both run Fedora Core 4. Identical version (
1.1) of Open MPI is compiled and installed on both of them *without* a
--prefix option ( i.e. installed on the default location of /usr/local).

The hostfile on both the machine is the same:

cat ~/hostfile

192.168.22.29
192.168.22.103

I can run openMPI on either of these two machines by forking two
processes:

mpirun -np2 osu_acc_latency  <-- This runs fine on either of the two
machines.

However, when I try to luch the same program across the two machines, I
get an error:

mpirun --hostfile ~/hostfile -np2 /home/durga/openmpi-1.1
/osu_benchmarks/osu_acc_latency

durga@192.168.22.29's  password: foobar
/home/durga/openmpi-1.1/osu_benchmarks/osu_acc_latency: error while
loading shared libraries: libmpi.so.0: cannot open shared object file: No
such file or directory.

However, the file *does exist* in /usr/local/lib:

ls -l /usr/local/lib/libmpi.so.0
libmpi.so.0 -> libmpi.so.0.0.0

I have also tried adding /usr/local/lib to my LD_LIBRARY_PATH on *both*
machines, to no avail.




first of: I'm not from the  openMPI team.
Where did you add your LD_LIBRARY_PATH ?
make shure it is in one of the profile files for your shell.. such has:
.bash_profile and _not_ .bashrc

that's a tipical error in these kind of configurations..

best regards

Any help is greatly appreciated.


Thanks

Durga

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Miguel Sousa Filipe


Re: [OMPI users] OpenMPI-1.1 virtual memory overhead

2006-08-25 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hi,

On 8/25/06, Sven Stork  wrote:


Hello Miguel,

this is caused by the shared memory mempool. Per default this shared
memory
mapping has a size of 512 MB. You can use the "mpool_sm_size" parameter to
reduce size e.g.

mpirun -mca mpool_sm_size  ...




using
mpirun -mca mpool_sm_size 0
is acceptable ?
to what will it fallback ? sockets? pipes? tcp? smoke signals?

thankyou very much by the fast answer.

Thanks,

Sven

On Friday 25 August 2006 15:04, Miguel Figueiredo Mascarenhas Sousa Filipe
wrote:
> Hi there,
> I'm using openmpi-1.1 on a linux-amd64 machine and also a linux-32bit
x86
> chroot environment on that same machine.
> (distro is gentoo, compilers: gcc-4.1.1 and gcc-3.4.6)
>
> In both cases openmpi-1.1 shows a +/-400MB overhead in virtual memory
usage
> (virtual address space usage) for each MPI process.
>
> In my case this is quite troublesome because my application in 32bit
mode is
> counting on using the whole 4GB address space for the problem set size
and
> associated data.
> This means that I have a reduction in the size of the problems which it
can
> solve.
> (my aplication isn't 64bit safe yet, so I need to run in 32bit mode, and
use
> effectively the 4GB address space)
>
>
> Is there a way to tweak this overhead, by configuring openmpi to use
smaller
> buffers, or anything else ?
>
> I do not see this with mpich2.
>
> Best regards,
>
> --
> Miguel Sousa Filipe
>
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Miguel Sousa Filipe


[OMPI users] OpenMPI-1.1 virtual memory overhead

2006-08-25 Thread Miguel Figueiredo Mascarenhas Sousa Filipe

Hi there,
I'm using openmpi-1.1 on a linux-amd64 machine and also a linux-32bit x86
chroot environment on that same machine.
(distro is gentoo, compilers: gcc-4.1.1 and gcc-3.4.6)

In both cases openmpi-1.1 shows a +/-400MB overhead in virtual memory usage
(virtual address space usage) for each MPI process.

In my case this is quite troublesome because my application in 32bit mode is
counting on using the whole 4GB address space for the problem set size and
associated data.
This means that I have a reduction in the size of the problems which it can
solve.
(my aplication isn't 64bit safe yet, so I need to run in 32bit mode, and use
effectively the 4GB address space)


Is there a way to tweak this overhead, by configuring openmpi to use smaller
buffers, or anything else ?

I do not see this with mpich2.

Best regards,

--
Miguel Sousa Filipe