Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Samuel K. Gutierrez


On Apr 22, 2010, at 10:08 AM, Rainer Keller wrote:


Hello Oliver,
thanks for the update.

Just my $0.02: the upcoming Open MPI v1.5 will warn users, if their  
session

directory is on NFS (or Lustre).


... or panfs :-)

Samuel K. Gutierrez



Best regards,
Rainer


On Thursday 22 April 2010 11:37:48 am Oliver Geisler wrote:

To sum up and give an update:

The extended communication times while using shared memory  
communication
of openmpi processes are caused by openmpi session directory laying  
on

the network via NFS.

The problem is resolved by establishing on each diskless node a  
ramdisk

or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to
point to the according mountpoint shared memory communication and its
files are kept local, thus decreasing the communication times by
magnitudes.

The relation of the problem to the kernel version is not really
resolved, but maybe not "the problem" in this respect.
My benchmark is now running fine on a single node with 4 CPU, kernel
2.6.33.1 and openmpi 1.4.1.
Running on multiple nodes I experience still higher (TCP)  
communication

times than I would expect. But that requires me some more deep
researching the issue (e.g. collisions on the network) and should
probably posted to a new thread.

Thank you guys for your help.

oli



--

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Rainer Keller
Hello Oliver,
thanks for the update.

Just my $0.02: the upcoming Open MPI v1.5 will warn users, if their session 
directory is on NFS (or Lustre).

Best regards,
Rainer


On Thursday 22 April 2010 11:37:48 am Oliver Geisler wrote:
> To sum up and give an update:
> 
> The extended communication times while using shared memory communication
> of openmpi processes are caused by openmpi session directory laying on
> the network via NFS.
> 
> The problem is resolved by establishing on each diskless node a ramdisk
> or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to
> point to the according mountpoint shared memory communication and its
> files are kept local, thus decreasing the communication times by
>  magnitudes.
> 
> The relation of the problem to the kernel version is not really
> resolved, but maybe not "the problem" in this respect.
> My benchmark is now running fine on a single node with 4 CPU, kernel
> 2.6.33.1 and openmpi 1.4.1.
> Running on multiple nodes I experience still higher (TCP) communication
> times than I would expect. But that requires me some more deep
> researching the issue (e.g. collisions on the network) and should
> probably posted to a new thread.
> 
> Thank you guys for your help.
> 
> oli
> 

-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Kenneth A. Lloyd
Oliver,

Thank you for this summary insight.  This substantially affects the
structural design of software implementations, which points to a new
analysis "opportunity" in our software.

Ken Lloyd

-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Oliver Geisler
Sent: Thursday, April 22, 2010 9:38 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

To sum up and give an update:

The extended communication times while using shared memory communication of
openmpi processes are caused by openmpi session directory laying on the
network via NFS.

The problem is resolved by establishing on each diskless node a ramdisk or
mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to point to
the according mountpoint shared memory communication and its files are kept
local, thus decreasing the communication times by magnitudes.

The relation of the problem to the kernel version is not really resolved,
but maybe not "the problem" in this respect.
My benchmark is now running fine on a single node with 4 CPU, kernel
2.6.33.1 and openmpi 1.4.1.
Running on multiple nodes I experience still higher (TCP) communication
times than I would expect. But that requires me some more deep researching
the issue (e.g. collisions on the network) and should probably posted to a
new thread.

Thank you guys for your help.

oli

--
This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Oliver Geisler
To sum up and give an update:

The extended communication times while using shared memory communication
of openmpi processes are caused by openmpi session directory laying on
the network via NFS.

The problem is resolved by establishing on each diskless node a ramdisk
or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to
point to the according mountpoint shared memory communication and its
files are kept local, thus decreasing the communication times by magnitudes.

The relation of the problem to the kernel version is not really
resolved, but maybe not "the problem" in this respect.
My benchmark is now running fine on a single node with 4 CPU, kernel
2.6.33.1 and openmpi 1.4.1.
Running on multiple nodes I experience still higher (TCP) communication
times than I would expect. But that requires me some more deep
researching the issue (e.g. collisions on the network) and should
probably posted to a new thread.

Thank you guys for your help.

oli

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-19 Thread Peter Kjellstrom
On Monday 19 April 2010, Oliver Geisler wrote:
> > Ah, that could do it.  Open MPI's shared memory files are under /tmp.  So
> > if /tmp is NFS, you could get extremely high latencies because of dirty
> > page writes out through NFS.
> >
> > You don't necessarily have to make /tmp disk-full -- if you just make
> > OMPI's session directories go into a ramdisk instead of to NFS, that
> > should also be sufficient.
>
> I just browsed FAQ and "ompi_info --param all all", but didn't find the
> answer:
> How do I set the OMPI session directory to point it to a ramdisk?
>
> And another question:
> What would be a good size for the ram disk? One general value was
> supposed by the FAQ with 128MB, but what is your experience?
> (maybe a large topic by itself, so I have to try out, I guess)

I just wanted to add that space not used on a tmpfs (mount -t tmpfs ...) is 
not wasted. You can have an 8G tmpfs mounted but if you only use 100M that's 
how much RAM it uses.

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-19 Thread Eugene Loh




Ralph Castain wrote:

  On Apr 19, 2010, at 9:12 AM, Oliver Geisler wrote:
  
  

  Ah, that could do it.  Open MPI's shared memory files are under /tmp.  So if /tmp is NFS, you could get extremely high latencies because of dirty page writes out through NFS.

You don't necessarily have to make /tmp disk-full -- if you just make OMPI's session directories go into a ramdisk instead of to NFS, that should also be sufficient.
  

I just browsed FAQ and "ompi_info --param all all", but didn't find the
answer:
How do I set the OMPI session directory to point it to a ramdisk?

  
  Set the MCA param orte_tmpdir_base to point at the ramdisk using any of the MCA parameter methods (cmd line, envar, default mca param file).
  

I'll add that to http://www.open-mpi.org/faq/?category=sm#where-sm-file

  
  
And another question:
What would be a good size for the ram disk? One general value was
supposed by the FAQ with 128MB, but what is your experience?
(maybe a large topic by itself, so I have to try out, I guess)

  
  
I don't recall the default shared memory size per process, but you can get that value from ompi_info --param btl sm. Take that value, multiply by your expected ppn, and then give yourself a fudge factor.
  

Sizing proportionately to the number of processes was a poor heuristic
and starting in 1.3.2 we don't employ it any more.  In all likelihood,
the default size of the shared-memory backing file will be set by
mpool_sm_min_size... 64 Mbytes.  Try "ompi_info --param mpool sm". 
There's some other stuff in addition to this backing file...  so, need
a little fudge factor.  Probably 128 MB will be enough for the
shared-memory stuff.




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-19 Thread Ralph Castain
On Apr 19, 2010, at 9:12 AM, Oliver Geisler wrote:

> 
> 
>> Ah, that could do it.  Open MPI's shared memory files are under /tmp.  So if 
>> /tmp is NFS, you could get extremely high latencies because of dirty page 
>> writes out through NFS.
>> 
>> You don't necessarily have to make /tmp disk-full -- if you just make OMPI's 
>> session directories go into a ramdisk instead of to NFS, that should also be 
>> sufficient.
>> 
> 
> I just browsed FAQ and "ompi_info --param all all", but didn't find the
> answer:
> How do I set the OMPI session directory to point it to a ramdisk?
> 

Set the MCA param orte_tmpdir_base to point at the ramdisk using any of the MCA 
parameter methods (cmd line, envar, default mca param file).

> And another question:
> What would be a good size for the ram disk? One general value was
> supposed by the FAQ with 128MB, but what is your experience?
> (maybe a large topic by itself, so I have to try out, I guess)

I don't recall the default shared memory size per process, but you can get that 
value from ompi_info --param btl sm. Take that value, multiply by your expected 
ppn, and then give yourself a fudge factor.


> 
> Thanks a lot.
> 
> Oli
> 
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-19 Thread Oliver Geisler


> Ah, that could do it.  Open MPI's shared memory files are under /tmp.  So if 
> /tmp is NFS, you could get extremely high latencies because of dirty page 
> writes out through NFS.
> 
> You don't necessarily have to make /tmp disk-full -- if you just make OMPI's 
> session directories go into a ramdisk instead of to NFS, that should also be 
> sufficient.
> 

I just browsed FAQ and "ompi_info --param all all", but didn't find the
answer:
How do I set the OMPI session directory to point it to a ramdisk?

And another question:
What would be a good size for the ram disk? One general value was
supposed by the FAQ with 128MB, but what is your experience?
(maybe a large topic by itself, so I have to try out, I guess)

Thanks a lot.

Oli


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-12 Thread Jeff Squyres
On Apr 12, 2010, at 11:10 AM, Oliver Geisler wrote:

> > Is the /tmp filesystem on NFS by any chance?
> 
> Yes, /tmp is on NFS .. those are diskless nodes all without disks and 
> no swap space mounted.

Ah, that could do it.  Open MPI's shared memory files are under /tmp.  So if 
/tmp is NFS, you could get extremely high latencies because of dirty page 
writes out through NFS.

You don't necessarily have to make /tmp disk-full -- if you just make OMPI's 
session directories go into a ramdisk instead of to NFS, that should also be 
sufficient.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-12 Thread Ralph Castain
In  that scenario, you need to set the session directories to point somewhere 
other than /tmp. I believe you will find that in our FAQs as this has been a 
recurring problem. The shared memory backing file resides in the session 
directory tree, so if that is NFS mounted, your performance will stink.

People with that setup generally point the session dir at a ramdisk area, but 
anywhere in ram will do.

On Apr 12, 2010, at 9:10 AM, Oliver Geisler wrote:

> 
> Quoting Ashley Pittman :
> 
>> 
>> On 10 Apr 2010, at 04:51, Eugene Loh wrote:
>> 
>>> Why is shared-memory performance about four orders of magnitude slower than 
>>> it should be?  The processes are communicating via memory that's shared by 
>>> having the processes all mmap the same file into their address spaces.  Is 
>>> it possible that with the newer kernels, operations to that shared file are 
>>> going all the way out to disk?  Maybe you don't know the answer, but 
>>> hopefully someone on this mail list can provide some insight.
>> 
>> Is the /tmp filesystem on NFS by any chance?
>> 
> 
> Yes, /tmp is on NFS .. those are diskless nodes all without disks and no swap 
> space mounted.
> 
> Maybe I should setup one of the nodes with a disk, so I could try the 
> difference.
> 
> (Sorry, but I may return results next week since, I am out of office right 
> now)
> 
> Thanks
> oli
> 
> 
> 
>> Ashley,
>> 
>> --
>> 
>> Ashley Pittman, Bath, UK.
>> 
>> Padb - A parallel job inspection tool for cluster computing
>> http://padb.pittman.org.uk
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>> 
>> 
> 
> 
> 
> 
> 
> 
> This message was sent using IMP, the Internet Messaging Program.
> 
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-12 Thread Oliver Geisler


Quoting Ashley Pittman :



On 10 Apr 2010, at 04:51, Eugene Loh wrote:

Why is shared-memory performance about four orders of magnitude  
slower than it should be?  The processes are communicating via  
memory that's shared by having the processes all mmap the same file  
into their address spaces.  Is it possible that with the newer  
kernels, operations to that shared file are going all the way out  
to disk?  Maybe you don't know the answer, but hopefully someone on  
this mail list can provide some insight.


Is the /tmp filesystem on NFS by any chance?



Yes, /tmp is on NFS .. those are diskless nodes all without disks and  
no swap space mounted.


Maybe I should setup one of the nodes with a disk, so I could try the  
difference.


(Sorry, but I may return results next week since, I am out of office  
right now)


Thanks
oli




Ashley,

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.









This message was sent using IMP, the Internet Messaging Program.


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-11 Thread Chris Samuel

On 10/04/10 06:59, Oliver Geisler wrote:


This is the results of skampi pt2pt, first with shared memory allowed,
second shared memory excluded.


For what it's worth I can't replicate those results on an AMD Shanghai
cluster running a 2.6.32 kernel and Open-MPI 1.4.1.

Here is what I see (run under Torque, selecting 2 cores on the same
node, so no need to specify -np):

$ mpirun --mca btl self,sm,tcp  ./skampi -i ski/skampi_pt2pt.ski

# begin result "Pingpong_Send_Recv"
count= 14   2.0   0.0   16   2.0   1.8
count= 28   2.1   0.0   16   2.1   1.8
count= 3   12   2.1   0.18   2.0   2.0
count= 4   16   2.1   0.18   2.0   2.0
count= 6   24   2.0   0.0   16   2.0   1.8
count= 8   32   2.9   0.0   16   2.7   2.4
count= 11   44   2.3   0.1   16   2.2   2.0
count= 16   64   2.2   0.1   16   2.1   2.0
count= 23   92   2.7   0.2   16   2.6   2.1
count= 32  128   2.5   0.1   16   2.5   2.1
count= 45  180   3.0   0.0   16   2.8   2.6
count= 64  256   3.1   0.08   3.0   2.5
count= 91  364   3.1   0.08   3.0   3.0
count= 128  512   3.4   0.2   16   3.3   3.0
count= 181  724   4.1   0.0   16   4.0   4.1
count= 256 1024   5.0   0.08   4.5   4.5
count= 362 1448   6.0   0.0   16   5.8   5.7
count= 512 2048   7.7   0.1   16   7.3   7.6
count= 724 2896  10.0   0.08  10.0   9.8
count= 1024 4096  12.3   0.1   16  12.1  12.0
count= 1448 5792  13.8   0.28  13.5  13.4
count= 2048 8192  18.1   0.0   16  17.9  18.1
count= 289611584  25.0   0.0   16  24.9  25.0
count= 409616384  34.2   0.1   16  34.0  34.2
# end result "Pingpong_Send_Recv"
# duration = 0.00 sec

mpirun --mca btl tcp,self  ./skampi -i ski/skampi_pt2pt.ski

# begin result "Pingpong_Send_Recv"
count= 14  21.2   1.0   16  20.1  17.8
count= 28  20.8   1.0   16  20.6  16.7
count= 3   12  20.2   0.9   16  19.0  17.1
count= 4   16  19.9   1.0   16  19.0  17.0
count= 6   24  21.1   1.1   16  20.6  17.0
count= 8   32  20.0   1.0   16  18.8  17.1
count= 11   44  20.9   0.8   16  20.0  17.1
count= 16   64  21.7   1.1   16  20.5  17.6
count= 23   92  21.7   1.0   16  20.0  18.5
count= 32  128  21.6   1.0   16  20.5  18.5
count= 45  180  22.0   1.0   16  20.9  19.0
count= 64  256  21.8   0.7   16  20.5  20.2
count= 91  364  20.5   0.3   16  19.8  19.1
count= 128  512  18.5   0.38  17.5  18.1
count= 181  724  19.3   0.28  19.1  19.0
count= 256 1024  20.3   0.3   16  19.7  20.0
count= 362 1448  22.1   0.3   16  21.2  21.4
count= 512 2048  24.2   0.3   16  23.7  23.2
count= 724 2896  24.8   0.58  24.0  24.0
count= 1024 4096  26.8   0.2   16  26.1  26.3
count= 1448 5792  31.6   0.3   16  30.4  31.5
count= 2048 8192  38.0   0.6   16  37.3  37.1
count= 289611584  52.1   1.4   16  49.1  50.8
count= 409616384  93.8   1.1   16  81.1  91.5
# end result "Pingpong_Send_Recv"
# duration = 0.02 sec

cheers,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-11 Thread Chris Samuel

On 10/04/10 15:12, Bogdan Costescu wrote:


Have there been any process scheduler changes in the newer kernels ?


Are there ever kernels where that doesn't get tweaked ? ;-)


I'm not sure that they could explain four orders of magnitude
differences though...


One idea that comes to mind would be to run the child processes
under strace -c as that will monitor all the system calls and
report how long is spent in which.   By running a comparison
with 2.6.23 and 2.6.24 then you might get a pointer to which
syscall(s) are taking longer.

Alternatively if you want to get fancy then you could play
with doing a git bisection between 2.6.23 and 2.6.24 to track
down the commit that introduces the regression.

To be honest it'd be interesting to see whether the issue still
manifests on a recent kernel though, if so then perhaps we might
be able to get the kernel developers interested (though they will
likely ask for a bisection too).

cheers!
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-10 Thread Ashley Pittman

On 10 Apr 2010, at 04:51, Eugene Loh wrote:

> Why is shared-memory performance about four orders of magnitude slower than 
> it should be?  The processes are communicating via memory that's shared by 
> having the processes all mmap the same file into their address spaces.  Is it 
> possible that with the newer kernels, operations to that shared file are 
> going all the way out to disk?  Maybe you don't know the answer, but 
> hopefully someone on this mail list can provide some insight.

Is the /tmp filesystem on NFS by any chance?

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-10 Thread Eugene Loh




Bogdan Costescu wrote:

  On Sat, Apr 10, 2010 at 5:51 AM, Eugene Loh  wrote:
  
  
Why is shared-memory performance about four orders of magnitude slower than
it should be?

  
  Have there been any process scheduler changes in the newer kernels ?
I'm not sure that they could explain four orders of magnitude
differences though...

Plus, the TCP numbers seem okay.  So, it doesn't feel like that or
process-binding issues.




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-10 Thread Bogdan Costescu
On Sat, Apr 10, 2010 at 5:51 AM, Eugene Loh  wrote:
> Why is shared-memory performance about four orders of magnitude slower than
> it should be?

Have there been any process scheduler changes in the newer kernels ?
I'm not sure that they could explain four orders of magnitude
differences though...

Cheers,
Bogdan


Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-09 Thread Eugene Loh

Oliver Geisler wrote:


This is the results of skampi pt2pt, first with shared memory allowed,
second shared memory excluded.


Thanks for the data.  The TCP results are not very interesting... they 
look reasonable.


The shared-memory data is rather straightforward:  results are just 
plain ridiculously bad.  The results for "eager" messages (messages 
shorter than 4Kbytes) are around 12 millisec.  The results for 
"rendezvous" messages (longer than 4 Kbytes, signal the receiver, wait 
for an acknowledgement, then send the message) are about 30 millisec.


I was also curious about "long-message bandwidth", but since SKaMPI is 
only going up to 16 Kbyte messages, we can't really tell.


But, maybe all that is irrelevent.

Why is shared-memory performance about four orders of magnitude slower 
than it should be?  The processes are communicating via memory that's 
shared by having the processes all mmap the same file into their address 
spaces.  Is it possible that with the newer kernels, operations to that 
shared file are going all the way out to disk?  Maybe you don't know the 
answer, but hopefully someone on this mail list can provide some insight.


Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-09 Thread Oliver Geisler
Sorry for replying late. Unfortunately I am not "full time
administrator". And I am going to be a conference next week, so please
be patient with me replying.

On 4/7/2010 6:56 PM, Eugene Loh wrote:
> Oliver Geisler wrote:
> 
>> Using netpipe and comparing tcp and mpi communication I get the
>> following results:
>>
>> TCP is much faster than MPI, approx. by factor 12
>>  
>>
> Faster?  12x?  I don't understand the following:
> 
>> e.g a packet size of 4096 bytes deliveres in
>> 97.11 usec with NPtcp and
>> 15338.98 usec with NPmpi
>>  
>>
> This implies NPtcp is 160x faster than NPmpi.
> 

The ratio function NPtcp/NPmpi has a mean value of factor 60 for small
packet sizes <4kB, a maximum of 160 at 4kB (it was a bad value to pick
out in the first place), then dropping down to 40 for packet sizes of
about 16kB and further dropping below factor 20 for packets larger than
100kB.


>> or
>> packet size 262kb
>> 0.05268801 sec NPtcp
>> 0.00254560 sec NPmpi
>>  
>>
> This implies NPtcp is 20x slower than NPmpi.
> 

Sorry, my fault ... vice versa, should read:
packet size 262kb
0.00254560 sec NPtcp
0.05268801 sec NPmpi


>> Further our benchmark started with "--mca btl tcp,self" runs with short
>> communication times, even using kernel 2.6.33.1
>>
>> Is there a way to see what type of communication is actually selected?
>>
>> Can anybody imagine why shared memory leads to these problems?
>>  
>>
> Okay, so it's a shared-memory performance problem since:
> 
> 1) You get better performance when you exclude sm explicitly with "--mca
> btl tcp,self".
> 2) You get better performance when you exclude sm by distributing one
> process per node (an observation you made relatively early in this thread).
> 3) TCP is faster than MPI (which is presumably using sm).
> 
> Can you run a pingpong test as a function of message length for two
> processes in a way that demonstrates the problem?  For example, if
> you're comfortable with SKaMPI, just look at Pingpong_Send_Recv and
> let's see what performance looks like as a function of message length. 
> Maybe this is a short-message-latency problem.

This is the results of skampi pt2pt, first with shared memory allowed,
second shared memory excluded.
It doesn't look to me as the long message times are related to short
messages.
Including hosts over ethernet results in higher communication times
which are equal to those when I ping the host (a hundred+ milliseconds).

mpirun --mca btl self,sm,tcp -np 2 ./skampi -i ski/skampi_pt2pt.ski

# begin result "Pingpong_Send_Recv"
count= 14   12756.0 307.4   16   11555.3   11011.2
count= 289902.8 629.0   169615.48601.0
count= 3   12   12547.5 881.0   16   12233.1   11229.2
count= 4   16   12087.2 829.6   16   11610.6   10478.6
count= 6   24   13634.4 352.1   16   11247.8   12621.9
count= 8   32   13835.8 282.2   16   11091.7   12944.6
count= 11   44   13328.9 864.6   16   12095.6   11977.0
count= 16   64   13195.2 432.3   16   11460.4   10051.9
count= 23   92   13849.3 532.5   16   12476.9   12998.1
count= 32  128   14202.2 436.4   16   11923.8   12977.4
count= 45  180   14026.3 637.7   16   13042.5   12767.8
count= 64  256   13475.8 466.7   16   11720.4   12521.3
count= 91  364   14015.0 406.1   16   13300.4   12881.6
count= 128  512   13481.3 870.6   16   11187.7   12070.6
count= 181  724   10697.1  98.4   16   10697.19520.1
count= 256 1024   14120.8 602.1   16   13988.2   11349.9
count= 362 1448   15718.2 582.3   16   14468.2   12535.2
count= 512 2048   11214.9 749.1   16   11155.09928.5
count= 724 2896   15127.3 186.1   16   15127.3   10974.9
count= 1024 4096   34045.0 692.2   16   32963.6   31728.1
count= 1448 5792   29965.9 788.1   16   27997.8   27404.4
count= 2048 8192   30082.1 785.3   16   28023.9   29538.5
count= 289611584   32556.0 219.4   16   29312.2   32290.4
count= 409616384   24999.8 839.6   16   23422.0   23644.6
# end result "Pingpong_Send_Recv"
# duration = 10.15 sec

mpirun --mca btl tcp,self -np 2 ./skampi -i ski/skampi_pt2pt.ski

# begin result "Pingpong_Send_Recv"
count= 14  14.5   0.3   16  13.5  13.2
count= 28  13.5   0.28  12.9  12.4
count= 3   12  13.1   0.4   16  12.7  11.3
count= 4   16  13.9   0.4   16  12.7  13.0
count= 6   24  13.8   0.4   16  12.5  12.8
count= 8   32  13.8   0.4   16  12.7  13.0
count= 11   44  14.0   0.3   16  12.8  13.0
count= 16   64  13.5   0.5   16  12.3  12.4
count= 23   92  13.9   0.4   16  13.1  12.7
count= 32  128  14.8   0.1   16

Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-07 Thread Eugene Loh

Oliver Geisler wrote:


Using netpipe and comparing tcp and mpi communication I get the
following results:

TCP is much faster than MPI, approx. by factor 12
 


Faster?  12x?  I don't understand the following:


e.g a packet size of 4096 bytes deliveres in
97.11 usec with NPtcp and
15338.98 usec with NPmpi
 


This implies NPtcp is 160x faster than NPmpi.


or
packet size 262kb
0.05268801 sec NPtcp
0.00254560 sec NPmpi
 


This implies NPtcp is 20x slower than NPmpi.


Further our benchmark started with "--mca btl tcp,self" runs with short
communication times, even using kernel 2.6.33.1

Is there a way to see what type of communication is actually selected?

Can anybody imagine why shared memory leads to these problems?
 


Okay, so it's a shared-memory performance problem since:

1) You get better performance when you exclude sm explicitly with "--mca 
btl tcp,self".
2) You get better performance when you exclude sm by distributing one 
process per node (an observation you made relatively early in this thread).

3) TCP is faster than MPI (which is presumably using sm).

Can you run a pingpong test as a function of message length for two 
processes in a way that demonstrates the problem?  For example, if 
you're comfortable with SKaMPI, just look at Pingpong_Send_Recv and 
let's see what performance looks like as a function of message length.  
Maybe this is a short-message-latency problem.


Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-07 Thread Oliver Geisler
On 4/6/2010 5:09 PM, Jeff Squyres wrote:
> On Apr 6, 2010, at 6:04 PM, Oliver Geisler wrote:
> 
>> Further our benchmark started with "--mca btl tcp,self" runs with short
>> communication times, even using kernel 2.6.33.1
> 
> I'm not sure what this statement means (^^).  Can you explain?
> 
In the first place we witnessed the problem upgrading our hardware and
thus had to upgrade the running kernel version in order to get the
network cards running.
I used a typical application that we use on the cluster (in-house
development) to benchmark old vs. new hardware. There I witnessed an
performance drop instead of an increase to be expected.
Searching for the loss of performance we figured out that the pure
computation time on each data packet meets the expected increase due to
the accelerated hardware, but communication times between the master and
the slave processes increased largely.
Furthermore we broke down the problem to kernel versions larger than
2.6.23 (which we could not use, because the network cards aren't
supported yet)
Now that I run the program with mpirun option "--mca btl tcp,self", I
could achieve shortened communication times (and all over completion
times as expected), even running on an new node with kernel version
2.6.33.1.

>> Is there a way to see what type of communication is actually selected?
> 
> If you "--mca btl tcp,self" is used, then TCP sockets are used for non-self 
> communications (i.e., communications with peer MPI processes, regardless of 
> location).
> 
>> Can anybody imagine why shared memory leads to these problems?
> 
> I'm not sure I understand this -- if "--mca btl tcp,self", shared memory is 
> not used...?
> 
When I use "--mca btl sm,selfm", I get the issue, so my guess is it has
to do something with shared memory?

> re-reading your email, I'm wondering: did you run the NPmpi process with 
> "--mca btl tcp,sm,self" (or no --mca btl param)?  That might explain some of 
> my confusion, above.
> 
I ran NPmpi without explicit mca-btl option .. which should default to
/etc/openmpi/openmpi-mca-params.conf with
btl = self,sm,tcp



-- 
-

Oliver Geisler

TERRASYS Geophysics
3100 Wilcrest Drive  www.terrasysgeo.com
Suite 325
 Tel: +1-713-893-3630
Houston, TX 77042Fax: +1-713-893-3631
United States
 e-mail: geis...@terrasysgeo.com

-

TERRASYS Geophysics USA Inc. UBI#: 602 171 274
15131 Carter Loop SE FEIN: 52-726308
Yelm, WA 98597
-


This e-mail contains proprietary information some or all
of which may be legally privileged.
It is for the intended recipient only. The views expressed
in this e-mail may not be official policy, but the personal
views of the originator.
If an addressing or transmission error has misdirected this
e-mail, please notify the author by replying to this e-mail.
If you are not the intended recipient you must not use,
disclose, distribute, copy, print, or rely on this e-mail.
All messages sent and received are monitored for viruses
and high risk file extensions.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Jeff Squyres
On Apr 6, 2010, at 6:04 PM, Oliver Geisler wrote:

> Using netpipe and comparing tcp and mpi communication I get the
> following results:
> 
> TCP is much faster than MPI, approx. by factor 12
> e.g a packet size of 4096 bytes deliveres in
> 97.11 usec with NPtcp and
> 15338.98 usec with NPmpi
> or
> packet size 262kb
> 0.05268801 sec NPtcp
> 0.00254560 sec NPmpi

Well that's not good (for us).  :-\

> Further our benchmark started with "--mca btl tcp,self" runs with short
> communication times, even using kernel 2.6.33.1

I'm not sure what this statement means (^^).  Can you explain?

> Is there a way to see what type of communication is actually selected?

If you "--mca btl tcp,self" is used, then TCP sockets are used for non-self 
communications (i.e., communications with peer MPI processes, regardless of 
location).

> Can anybody imagine why shared memory leads to these problems?

I'm not sure I understand this -- if "--mca btl tcp,self", shared memory is not 
used...?

re-reading your email, I'm wondering: did you run the NPmpi process with 
"--mca btl tcp,sm,self" (or no --mca btl param)?  That might explain some of my 
confusion, above.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Oliver Geisler
On 4/6/2010 2:54 PM, Jeff Squyres wrote:
> Sorry for the delay -- I just replied on the user list -- I think the first 
> thing to do is to establish baseline networking performance and see if that 
> is out of whack.  If the underlying network is bad, then MPI performance will 
> also be bad.
> 
> 

Using netpipe and comparing tcp and mpi communication I get the
following results:

TCP is much faster than MPI, approx. by factor 12
e.g a packet size of 4096 bytes deliveres in
97.11 usec with NPtcp and
15338.98 usec with NPmpi
or
packet size 262kb
0.05268801 sec NPtcp
0.00254560 sec NPmpi

Further our benchmark started with "--mca btl tcp,self" runs with short
communication times, even using kernel 2.6.33.1

Is there a way to see what type of communication is actually selected?

Can anybody imagine why shared memory leads to these problems?
Kernel configuration?


Thanks, Jeff, for insisting upon testing network performance.
Thanks all others, too ;-)

oli


> On Apr 6, 2010, at 11:51 AM, Oliver Geisler wrote:
> 
>> On 4/6/2010 10:11 AM, Rainer Keller wrote:
>>> Hello Oliver,
>>> Hmm, this is really a teaser...
>>> I haven't seen such a drastic behavior, and haven't read of any on the list.
>>>
>>> One thing however, that might interfere is process binding.
>>> Could You make sure, that processes are not bound to cores (default in 
>>> 1.4.1):
>>> with mpirun --bind-to-none
>>>
>>
>> I have tried version 1.4.1. Using default settings and watched processes
>> switching from core to core in "top" (with "f" + "j"). Then I tried
>> --bind-to-core and explicitly --bind-to-none. All with the same result:
>> ~20% cpu wait and lot longer over-all computation times.
>>
>> Thanks for the idea ...
>> Every input is helpful.
>>
>> Oli
>>
>>
>>> Just an idea...
>>>
>>> Regards,
>>> Rainer
>>>
>>> On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote:
 Hello Devel-List,

 I am a little bit helpless about this matter. I already posted in the
 user list. In case you don't read the users list, I post in here.

 This is the original posting:

 http://www.open-mpi.org/community/lists/users/2010/03/12474.php

 Short:
 Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2
 (I know outdated, but in debian stable, and same results with 1.4.1)
 increases communication times between processes (essentially between one
 master and several slave processes). This is regardless of whether the
 processes are local only or communication is over ethernet.

 Did anybody witness such a behavior?

 Ideas what should I test for?

 What additional information should I provide for you?

 Thanks for your time

 oli

>>>
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> 
> 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Jeff Squyres
On Apr 6, 2010, at 4:29 PM, Oliver Geisler wrote:

> > Sorry for the delay -- I just replied on the user list -- I think the first 
> > thing to do is to establish baseline networking performance and see if that 
> > is out of whack.  If the underlying network is bad, then MPI performance 
> > will also be bad.
> 
> Could make sense. With kernel 2.6.24 it seems a major change in the
> modules for Intel PCI-Express network cards was introduced.
> Does openmpi use TCP communication, even if all processes are on the
> same local node?

It depends.  :-)

The "--mca btl sm,self,tcp" option to mpirun tells Open MPI to use shared 
memory, tcp, and process-loopback for MPI point-to-point communications.  Open 
MPI computes a reachability / priority map and uses the highest priority plugin 
that is reachable for each peer MPI process.

Meaning that on each node, if you allow "sm" to be used, "sm" should be used 
for all on-node communications.  If you had only said --mca btl tcp,self", then 
you're only allowing Open MPI to use TCP for all non-self MPI point-to-point 
communications.

The default -- if you don't specify "--mca btl " at all -- is for Open MPI 
to figure it out automatically and use whatever networks it can find.  In your 
case, I'm guessing that it's pretty much identical to specifying "--mca btl 
tcp,sm,self".

Another good raw TCP performance program that network wonks are familiar with 
is netperf.  NetPipe is nice because it allows an apples-to-apples comparison 
of TCP and MPI (i.e., it's the same benchmark app that can use either TCP or 
MPI [or several other] transports underneath).  But netperf might be a bit more 
familiar to those outside the HPC community.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Oliver Geisler
On 4/6/2010 2:54 PM, Jeff Squyres wrote:
> Sorry for the delay -- I just replied on the user list -- I think the first 
> thing to do is to establish baseline networking performance and see if that 
> is out of whack.  If the underlying network is bad, then MPI performance will 
> also be bad.
> 

Could make sense. With kernel 2.6.24 it seems a major change in the
modules for Intel PCI-Express network cards was introduced.
Does openmpi use TCP communication, even if all processes are on the
same local node?


> 
> On Apr 6, 2010, at 11:51 AM, Oliver Geisler wrote:
> 
>> On 4/6/2010 10:11 AM, Rainer Keller wrote:
>>> Hello Oliver,
>>> Hmm, this is really a teaser...
>>> I haven't seen such a drastic behavior, and haven't read of any on the list.
>>>
>>> One thing however, that might interfere is process binding.
>>> Could You make sure, that processes are not bound to cores (default in 
>>> 1.4.1):
>>> with mpirun --bind-to-none
>>>
>>
>> I have tried version 1.4.1. Using default settings and watched processes
>> switching from core to core in "top" (with "f" + "j"). Then I tried
>> --bind-to-core and explicitly --bind-to-none. All with the same result:
>> ~20% cpu wait and lot longer over-all computation times.
>>
>> Thanks for the idea ...
>> Every input is helpful.
>>
>> Oli
>>
>>
>>> Just an idea...
>>>
>>> Regards,
>>> Rainer
>>>
>>> On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote:
 Hello Devel-List,

 I am a little bit helpless about this matter. I already posted in the
 user list. In case you don't read the users list, I post in here.

 This is the original posting:

 http://www.open-mpi.org/community/lists/users/2010/03/12474.php

 Short:
 Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2
 (I know outdated, but in debian stable, and same results with 1.4.1)
 increases communication times between processes (essentially between one
 master and several slave processes). This is regardless of whether the
 processes are local only or communication is over ethernet.

 Did anybody witness such a behavior?

 Ideas what should I test for?

 What additional information should I provide for you?

 Thanks for your time

 oli

>>>
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> 
> 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Jeff Squyres
Sorry for the delay -- I just replied on the user list -- I think the first 
thing to do is to establish baseline networking performance and see if that is 
out of whack.  If the underlying network is bad, then MPI performance will also 
be bad.


On Apr 6, 2010, at 11:51 AM, Oliver Geisler wrote:

> On 4/6/2010 10:11 AM, Rainer Keller wrote:
> > Hello Oliver,
> > Hmm, this is really a teaser...
> > I haven't seen such a drastic behavior, and haven't read of any on the list.
> >
> > One thing however, that might interfere is process binding.
> > Could You make sure, that processes are not bound to cores (default in 
> > 1.4.1):
> > with mpirun --bind-to-none
> >
> 
> I have tried version 1.4.1. Using default settings and watched processes
> switching from core to core in "top" (with "f" + "j"). Then I tried
> --bind-to-core and explicitly --bind-to-none. All with the same result:
> ~20% cpu wait and lot longer over-all computation times.
> 
> Thanks for the idea ...
> Every input is helpful.
> 
> Oli
> 
> 
> > Just an idea...
> >
> > Regards,
> > Rainer
> >
> > On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote:
> >> Hello Devel-List,
> >>
> >> I am a little bit helpless about this matter. I already posted in the
> >> user list. In case you don't read the users list, I post in here.
> >>
> >> This is the original posting:
> >>
> >> http://www.open-mpi.org/community/lists/users/2010/03/12474.php
> >>
> >> Short:
> >> Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2
> >> (I know outdated, but in debian stable, and same results with 1.4.1)
> >> increases communication times between processes (essentially between one
> >> master and several slave processes). This is regardless of whether the
> >> processes are local only or communication is over ethernet.
> >>
> >> Did anybody witness such a behavior?
> >>
> >> Ideas what should I test for?
> >>
> >> What additional information should I provide for you?
> >>
> >> Thanks for your time
> >>
> >> oli
> >>
> >
> 
> 
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Oliver Geisler
On 4/6/2010 10:11 AM, Rainer Keller wrote:
> Hello Oliver,
> Hmm, this is really a teaser...
> I haven't seen such a drastic behavior, and haven't read of any on the list.
> 
> One thing however, that might interfere is process binding.
> Could You make sure, that processes are not bound to cores (default in 1.4.1):
> with mpirun --bind-to-none 
> 

I have tried version 1.4.1. Using default settings and watched processes
switching from core to core in "top" (with "f" + "j"). Then I tried
--bind-to-core and explicitly --bind-to-none. All with the same result:
~20% cpu wait and lot longer over-all computation times.

Thanks for the idea ...
Every input is helpful.

Oli


> Just an idea...
> 
> Regards,
> Rainer
> 
> On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote:
>> Hello Devel-List,
>>
>> I am a little bit helpless about this matter. I already posted in the
>> user list. In case you don't read the users list, I post in here.
>>
>> This is the original posting:
>>
>> http://www.open-mpi.org/community/lists/users/2010/03/12474.php
>>
>> Short:
>> Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2
>> (I know outdated, but in debian stable, and same results with 1.4.1)
>> increases communication times between processes (essentially between one
>> master and several slave processes). This is regardless of whether the
>> processes are local only or communication is over ethernet.
>>
>> Did anybody witness such a behavior?
>>
>> Ideas what should I test for?
>>
>> What additional information should I provide for you?
>>
>> Thanks for your time
>>
>> oli
>>
> 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-06 Thread Rainer Keller
Hello Oliver,
Hmm, this is really a teaser...
I haven't seen such a drastic behavior, and haven't read of any on the list.

One thing however, that might interfere is process binding.
Could You make sure, that processes are not bound to cores (default in 1.4.1):
with mpirun --bind-to-none 

Just an idea...

Regards,
Rainer

On Tuesday 06 April 2010 10:07:35 am Oliver Geisler wrote:
> Hello Devel-List,
> 
> I am a little bit helpless about this matter. I already posted in the
> user list. In case you don't read the users list, I post in here.
> 
> This is the original posting:
> 
> http://www.open-mpi.org/community/lists/users/2010/03/12474.php
> 
> Short:
> Switching from kernel 2.6.23 to 2.6.24 (and up), using openmpi 1.2.7-rc2
> (I know outdated, but in debian stable, and same results with 1.4.1)
> increases communication times between processes (essentially between one
> master and several slave processes). This is regardless of whether the
> processes are local only or communication is over ethernet.
> 
> Did anybody witness such a behavior?
> 
> Ideas what should I test for?
> 
> What additional information should I provide for you?
> 
> Thanks for your time
> 
> oli
> 

-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink