Re: [OMPI devel] New OMPI MPI extension

2010-04-22 Thread Rayson Ho
Hi Jeff,

There's a typo in trunk/README:

-> 1175 ...unrelated to wach other

I guess you mean "unrelated to each other".

Rayson



On Wed, Apr 21, 2010 at 12:35 PM, Jeff Squyres  wrote:
> Per the telecon Tuesday, I committed a new OMPI MPI extension to the trunk:
>
>    https://svn.open-mpi.org/trac/ompi/changeset/23018
>
> Please read the commit message and let me know what you think.  Suggestions 
> are welcome.
>
> If everyone is ok with it, I'd like to see this functionality hit the 1.5 
> series at some point.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



[OMPI devel] sendrecv_replace: long time to allocate/free memory

2010-04-22 Thread Pascal Deveze

Hi all,

The sendrecv_replace in Open MPI seems to allocate/free memory with 
MPI_Alloc_mem()/MPI_Free_mem()


I measured the time to allocate/free a buffer of 1MB.
MPI_Alloc_mem/MPI_Free_mem take 350us while malloc/free only take 8us.

malloc/free in ompi/mpi/c/sendrecv_replace.c was replaced by 
MPI_Alloc_mem/MPI_Free_mem with this commit :


user:twoodall
date:Thu Sep 22 16:43:17 2005 +
summary: use MPI_Alloc_mem/MPI_Free_mem for internally allocated 
buffers


Is there a real reason to use these functions or can we move back to 
malloc/free ?
Is there a problem on my configuration explaining such slow performance 
with MPI_Alloc_mem ?


Pascal


[OMPI devel] Segmentation fault on x86_64 on heterogeneous environment

2010-04-22 Thread Timur Magomedov
Hello, list.

I have a strange segmentation fault on x86_64 machine running together
with x86.
I am running attached program that sends some bytes from process 0 to
process 1. My configuration is:
Machine #1: (process 0)
  arch: x86
  hostname: magomedov-desktop
  linux distro: Ubuntu 9.10
  Open MPI: v1.4 configured with --enable-heterogeneous --enable-debug
Machine #2: (process 1)
  arch: x86_64
  hostname: linuxtche
  linux distro: Fedora 12
  Open MPI: v1.4 configured with --enable-heterogeneous
--prefix=/home/magomedov/openmpi/ --enable-debug

They are connected by ethernet.
My user environment on second (x86_64) machine is set up to use Open MPI
from /home/magomedov/openmpi/.

Then I compile attached program on both machines (at the same path) and
run it. Process 0 from x86 machine should send data to process 1 on
x86_64 machine.

First, let's send 65530 bytes:

mpirun -host timur,linuxtche -np
2 /home/magomedov/workspace/mpi-test/mpi-send-test 65530
magomedov@linuxtche's password: 
*** processor magomedov-desktop, comm size is 2, my rank is 0, pid 21875
***
*** processor linuxtche, comm size is 2, my rank is 1, pid 11357 ***
Received 65530 bytes

It's OK. Then let's send 65537 bytes:

magomedov@magomedov-desktop:~/workspace/mpi-test$ mpirun -host
timur,linuxtche -np 2 /home/magomedov/workspace/mpi-test/mpi-send-test
65537
magomedov@linuxtche's password: 
*** processor magomedov-desktop, comm size is 2, my rank is 0, pid 9205
***
*** processor linuxtche, comm size is 2, my rank is 1, pid 28858 ***
[linuxtche:28858] *** Process received signal ***
[linuxtche:28858] Signal: Segmentation fault (11)
[linuxtche:28858] Signal code: Address not mapped (1)
[linuxtche:28858] Failing at address: 0x201143bf8
[linuxtche:28858] [ 0] /lib64/libpthread.so.0() [0x3600c0f0f0]
[linuxtche:28858]
[ 1] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0xfc27)
[0x7f5e94076c27]
[linuxtche:28858]
[ 2] /home/magomedov/openmpi/lib/openmpi/mca_btl_tcp.so(+0xadac)
[0x7f5e935c3dac]
[linuxtche:28858]
[ 3] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27611)
[0x7f5e96575611]
[linuxtche:28858]
[ 4] /home/magomedov/openmpi/lib/libopen-pal.so.0(+0x27c57)
[0x7f5e96575c57]
[linuxtche:28858]
[ 5] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f)
[0x7f5e96575848]
[linuxtche:28858]
[ 6] /home/magomedov/openmpi/lib/libopen-pal.so.0(opal_progress+0x89)
[0x7f5e965648dd]
[linuxtche:28858]
[ 7] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x762f)
[0x7f5e9406e62f]
[linuxtche:28858]
[ 8] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x777d)
[0x7f5e9406e77d]
[linuxtche:28858]
[ 9] /home/magomedov/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8246)
[0x7f5e9406f246]
[linuxtche:28858] [10] /home/magomedov/openmpi/lib/libmpi.so.0(MPI_Recv
+0x2d2) [0x7f5e96af832c]
[linuxtche:28858]
[11] /home/magomedov/workspace/mpi-test/mpi-send-test(main+0x1e4)
[0x400ee8]
[linuxtche:28858] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
[0x360001eb1d]
[linuxtche:28858]
[13] /home/magomedov/workspace/mpi-test/mpi-send-test() [0x400c49]
[linuxtche:28858] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 28858 on node linuxtche
exited on signal 11 (Segmentation fault).
--

If I am trying to send >= 65537 bytes from x86 I always get segfault on
x86_64.

I made some investigations and found that "bad" pointer always has a
valid pointer actually in it's lower 32-bit word and "2" or "1" in it's
upper word. Program segfaults in pml_ob1_recvfrag.c, in function
mca_pml_ob1_recv_frag_callback_fin(), rdma pointer is broken. I inserted
rdma = (mca_btl_base_descriptor_t*)((unsigned long)rdma & 0x);
line which I believe truncates 64-bit pointer to 32 bits and segfaults
disappeared. However, this is not the solution.

After some investigations with gdb it seems to me like this pointer was
sent to x86 machine and was received from it broken but I don't realize
what is going on enough to fix it...

Can anyone reproduce it?
I got the same results on openmpi-1.4.2rc1 too.

It looks like the same problem was described here
http://www.open-mpi.org/community/lists/users/2010/02/12182.php in
ompi-users list.

-- 
Kind regards,
Timur Magomedov
Senior C++ Developer
DevelopOnBox LLC / Zodiac Interactive
http://www.zodiac.tv/
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
	int ret;
	int size;
	int rank;
	int name_len;
	char name[MPI_MAX_PROCESSOR_NAME];
	int len;
	int sender = 0;
	int receiver = 1;

	uint8_t *val;
	MPI_Status stat;

	MPI_Init(&argc, &argv);
	MPI_Get_processor_name(name, &name_len);
	MPI_Comm_size(MPI_COMM_WORLD, &size);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	printf("*** processor %s, comm size is %d, my rank is %d, pid %u ***\n", name, size, rank, getpid());
	if (argc != 2) {
		printf("Usage: %s message_length\n", argv[0]);
		exi

Re: [OMPI devel] New OMPI MPI extension

2010-04-22 Thread Jeff Squyres
Fixed -- thanks!

On Apr 22, 2010, at 12:35 AM, Rayson Ho wrote:

> Hi Jeff,
> 
> There's a typo in trunk/README:
> 
> -> 1175 ...unrelated to wach other
> 
> I guess you mean "unrelated to each other".
> 
> Rayson
> 
> 
> 
> On Wed, Apr 21, 2010 at 12:35 PM, Jeff Squyres  wrote:
> > Per the telecon Tuesday, I committed a new OMPI MPI extension to the trunk:
> >
> >https://svn.open-mpi.org/trac/ompi/changeset/23018
> >
> > Please read the commit message and let me know what you think.  Suggestions 
> > are welcome.
> >
> > If everyone is ok with it, I'd like to see this functionality hit the 1.5 
> > series at some point.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Oliver Geisler
To sum up and give an update:

The extended communication times while using shared memory communication
of openmpi processes are caused by openmpi session directory laying on
the network via NFS.

The problem is resolved by establishing on each diskless node a ramdisk
or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to
point to the according mountpoint shared memory communication and its
files are kept local, thus decreasing the communication times by magnitudes.

The relation of the problem to the kernel version is not really
resolved, but maybe not "the problem" in this respect.
My benchmark is now running fine on a single node with 4 CPU, kernel
2.6.33.1 and openmpi 1.4.1.
Running on multiple nodes I experience still higher (TCP) communication
times than I would expect. But that requires me some more deep
researching the issue (e.g. collisions on the network) and should
probably posted to a new thread.

Thank you guys for your help.

oli

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Kenneth A. Lloyd
Oliver,

Thank you for this summary insight.  This substantially affects the
structural design of software implementations, which points to a new
analysis "opportunity" in our software.

Ken Lloyd

-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Oliver Geisler
Sent: Thursday, April 22, 2010 9:38 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

To sum up and give an update:

The extended communication times while using shared memory communication of
openmpi processes are caused by openmpi session directory laying on the
network via NFS.

The problem is resolved by establishing on each diskless node a ramdisk or
mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to point to
the according mountpoint shared memory communication and its files are kept
local, thus decreasing the communication times by magnitudes.

The relation of the problem to the kernel version is not really resolved,
but maybe not "the problem" in this respect.
My benchmark is now running fine on a single node with 4 CPU, kernel
2.6.33.1 and openmpi 1.4.1.
Running on multiple nodes I experience still higher (TCP) communication
times than I would expect. But that requires me some more deep researching
the issue (e.g. collisions on the network) and should probably posted to a
new thread.

Thank you guys for your help.

oli

--
This message has been scanned for viruses and dangerous content by
MailScanner, and is believed to be clean.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Rainer Keller
Hello Oliver,
thanks for the update.

Just my $0.02: the upcoming Open MPI v1.5 will warn users, if their session 
directory is on NFS (or Lustre).

Best regards,
Rainer


On Thursday 22 April 2010 11:37:48 am Oliver Geisler wrote:
> To sum up and give an update:
> 
> The extended communication times while using shared memory communication
> of openmpi processes are caused by openmpi session directory laying on
> the network via NFS.
> 
> The problem is resolved by establishing on each diskless node a ramdisk
> or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to
> point to the according mountpoint shared memory communication and its
> files are kept local, thus decreasing the communication times by
>  magnitudes.
> 
> The relation of the problem to the kernel version is not really
> resolved, but maybe not "the problem" in this respect.
> My benchmark is now running fine on a single node with 4 CPU, kernel
> 2.6.33.1 and openmpi 1.4.1.
> Running on multiple nodes I experience still higher (TCP) communication
> times than I would expect. But that requires me some more deep
> researching the issue (e.g. collisions on the network) and should
> probably posted to a new thread.
> 
> Thank you guys for your help.
> 
> oli
> 

-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink



Re: [OMPI devel] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-04-22 Thread Samuel K. Gutierrez


On Apr 22, 2010, at 10:08 AM, Rainer Keller wrote:


Hello Oliver,
thanks for the update.

Just my $0.02: the upcoming Open MPI v1.5 will warn users, if their  
session

directory is on NFS (or Lustre).


... or panfs :-)

Samuel K. Gutierrez



Best regards,
Rainer


On Thursday 22 April 2010 11:37:48 am Oliver Geisler wrote:

To sum up and give an update:

The extended communication times while using shared memory  
communication
of openmpi processes are caused by openmpi session directory laying  
on

the network via NFS.

The problem is resolved by establishing on each diskless node a  
ramdisk

or mounting a tmpfs. By setting the MCA parameter orte_tmpdir_base to
point to the according mountpoint shared memory communication and its
files are kept local, thus decreasing the communication times by
magnitudes.

The relation of the problem to the kernel version is not really
resolved, but maybe not "the problem" in this respect.
My benchmark is now running fine on a single node with 4 CPU, kernel
2.6.33.1 and openmpi 1.4.1.
Running on multiple nodes I experience still higher (TCP)  
communication

times than I would expect. But that requires me some more deep
researching the issue (e.g. collisions on the network) and should
probably posted to a new thread.

Thank you guys for your help.

oli



--

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] New OMPI MPI extension

2010-04-22 Thread Rayson Ho
Jeff,

Seems like OMPI_Affinity_str() 's finest granularity is at the core
level. However, in SGE (Sun Grid Engine) we also offer thread level
(SMT) binding:

http://wikis.sun.com/display/gridengine62u5/Using+Job+to+Core+Binding

Will OpenMPI support thread level binding in the future??


BTW, another 2 typos in README:

1193subdirectory off <- directory "of"

1199 thse extensions <- "these" extensions

Rayson


On Thu, Apr 22, 2010 at 10:35 AM, Jeff Squyres  wrote:
> Fixed -- thanks!
>
> On Apr 22, 2010, at 12:35 AM, Rayson Ho wrote:
>
>> Hi Jeff,
>>
>> There's a typo in trunk/README:
>>
>> -> 1175 ...unrelated to wach other
>>
>> I guess you mean "unrelated to each other".
>>
>> Rayson
>>
>>
>>
>> On Wed, Apr 21, 2010 at 12:35 PM, Jeff Squyres  wrote:
>> > Per the telecon Tuesday, I committed a new OMPI MPI extension to the trunk:
>> >
>> >    https://svn.open-mpi.org/trac/ompi/changeset/23018
>> >
>> > Please read the commit message and let me know what you think.  
>> > Suggestions are welcome.
>> >
>> > If everyone is ok with it, I'd like to see this functionality hit the 1.5 
>> > series at some point.
>> >
>> > --
>> > Jeff Squyres
>> > jsquy...@cisco.com
>> > For corporate legal information go to:
>> > http://www.cisco.com/web/about/doing_business/legal/cri/
>> >
>> >
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



Re: [OMPI devel] New OMPI MPI extension

2010-04-22 Thread Jeff Squyres
On Apr 22, 2010, at 12:34 PM, Rayson Ho wrote:

> Seems like OMPI_Affinity_str() 's finest granularity is at the core
> level. However, in SGE (Sun Grid Engine) we also offer thread level
> (SMT) binding:
> 
> http://wikis.sun.com/display/gridengine62u5/Using+Job+to+Core+Binding
> 
> Will OpenMPI support thread level binding in the future??

Yes, but two things have to happen first:

1. Successfully import hwloc.  I tried importing hwloc 1.0rc1 earlier this week 
and ran into some problems; I unfortunately got side-tracked before I could fix 
them.  I need to fix those and get hwloc 1.0 out the door (it isn't clear to me 
yet if the problem was in OMPI or hwloc; but I want to resolve it before hwloc 
hits v1.0).

2. Update our internal handling inside OMPI to understand hardware threads (and 
possibly boards).  Our current internal APIs were written before hardware 
threads really mattered to HPC, so we need to do some updates.  It probably 
won't be too hard to do, but it does touch a bunch of places in OPAL and ORTE.

This likely puts OMPI hardware thread support in the 1.5.1 or 1.5.2 timeframe.

> BTW, another 2 typos in README:
> 
> 1193subdirectory off <- directory "of"
> 
> 1199 thse extensions <- "these" extensions

Awesome; thanks!  I had apparently enabled "typo-mode" in emacs when I wrote 
this stuff.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/