Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-07-18 Thread Philippe
Ralph,

thanks for investigating.

I've applied the two patches you mentioned earlier and ran with the
ompi server. Although i was able to runn our standalone test, when I
integrated the changes to our code, the processes entered a crazy loop
and allocated all the memory available when calling MPI_Port_Connect.
I was not able to identify why it works standalone but not integrated
with our code. If I found why, I'll let your know.

looking forward to your findings. We'll be happy to test any patches
if you have some!

p.

On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain  wrote:
> Okay, I can reproduce this problem. Frankly, I don't think this ever worked 
> with OMPI, and I'm not sure how the choice of BTL makes a difference.
>
> The program is crashing in the communicator definition, which involves a 
> communication over our internal out-of-band messaging system. That system has 
> zero connection to any BTL, so it should crash either way.
>
> Regardless, I will play with this a little as time allows. Thanks for the 
> reproducer!
>
>
> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
>
>> Hi,
>>
>> I'm trying to run a test program which consists of a server creating a
>> port using MPI_Open_port and N clients using MPI_Comm_connect to
>> connect to the server.
>>
>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
>> clients, I get the following error message:
>>
>>   [node003:32274] [[37084,0],0]:route_callback tried routing message
>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
>>
>> This is only happening with the openib BTL. With tcp BTL it works
>> perfectly fine (ofud also works as a matter of fact...). This has been
>> tested on two completely different clusters, with identical results.
>> In either cases, the IB frabic works normally.
>>
>> Any help would be greatly appreciated! Several people in my team
>> looked at the problem. Google and the mailing list archive did not
>> provide any clue. I believe that from an MPI standpoint, my test
>> program is valid (and it works with TCP, which make me feel better
>> about the sequence of MPI calls)
>>
>> Regards,
>> Philippe.
>>
>>
>>
>> Background:
>>
>> I intend to use openMPI to transport data inside a much larger
>> application. Because of that, I cannot used mpiexec. Each process is
>> started by our own "job management" and use a name server to find
>> about each others. Once all the clients are connected, I would like
>> the server to do MPI_Recv to get the data from all the client. I dont
>> care about the order or which client are sending data, as long as I
>> can receive it with on call. Do do that, the clients and the server
>> are going through a series of Comm_accept/Conn_connect/Intercomm_merge
>> so that at the end, all the clients and the server are inside the same
>> intracomm.
>>
>> Steps:
>>
>> I have a sample program that show the issue. I tried to make it as
>> short as possible. It needs to be executed on a shared file system
>> like NFS because the server write the port info to a file that the
>> client will read. To reproduce the issue, the following steps should
>> be performed:
>>
>> 0. compile the test with "mpicc -o ben12 ben12.c"
>> 1. ssh to the machine that will be the server
>> 2. run ./ben12 3 1
>> 3. ssh to the machine that will be the client #1
>> 4. run ./ben12 3 0
>> 5. repeat step 3-4 for client #2 and #3
>>
>> the server accept the connection from client #1 and merge it in a new
>> intracomm. It then accept connection from client #2 and merge it. when
>> the client #3 arrives, the server accept the connection, but that
>> cause client #1 and #2 to die with the error above (see the complete
>> trace in the tarball).
>>
>> The exact steps are:
>>
>>     - server open port
>>     - server does accept
>>     - client #1 does connect
>>     - server and client #1 do merge
>>     - server does accept
>>     - client #2 does connect
>>     - server, client #1 and client #2 do merge
>>     - server does accept
>>     - client #3 does connect
>>     - server, client #1, client #2 and client #3 do merge
>>
>>
>> My infiniband network works normally with other test programs or
>> applications (MPI or others like Verbs).
>>
>> Info about my setup:
>>
>>    openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
>> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
>>    config.log in the tarball
>>    "ompi_info --all" in the tarball
>>    OFED version = 1.3 installed from RHEL 5.3
>>    Distro = RedHat Entreprise Linux 5.3
>>    Kernel = 2.6.18-128.4.1.el5 x86_64
>>    subnet manager = built-in SM from the cisco/topspin switch
>>    output of ibv_devinfo included in the tarball (there are no "bad" nodes)
>>    "ulimit -l" says "unlimited"
>>
>> The tarball contains:
>>
>>   - ben12.c: my test program showing the behavior
>>   - config.log / config.out / make.out / make-install.out /
>> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt
>>   - trace-tcp.txt: output of the server and 

[OMPI users] MPICH2 is working OpenMPI Not

2010-07-18 Thread Bibrak Qamar
Hello,

I have developed a code which I tested on MPICH2, it working fine.

But when I compile and run it with OpenMPI, its not working.

The result of the program with the errors by OpenMPI is below ..

--


bibrak@barq:~/XXX> mpirun -np 4 ./exec 98


warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
Send count -- >> 25
Send count -- >> 25
Send count -- >> 24
Send count -- >> 24
Dis -- >> 0
Dis -- >> 25
Dis -- >> 50
Dis -- >> 74




 0 d[0] = -14.025975
 1 d[0] = -14.025975
-- 1 --
 2 d[0] = -14.025975
-- 2 --
-- 0 --
 3 d[0] = -14.025975
 --3 --
[barq:27118] *** Process received signal ***
[barq:27118] Signal: Segmentation fault (11)
[barq:27118] Signal code: Address not mapped (1)
[barq:27118] Failing at address: 0x51681f96
[barq:27121] *** Process received signal ***
[barq:27121] Signal: Segmentation fault (11)
[barq:27121] Signal code: Address not mapped (1)
[barq:27121] Failing at address: 0x77b5685
[barq:27118] [ 0] [0xe410]
[barq:27118] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7d20f3c]
[barq:27118] [ 2] ./exec(main+0x2214) [0x804ad8d]
[barq:27118] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7cc9705]
[barq:27121] [ 0] [0xe410]
[barq:27121] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7d0ef3c]
[barq:27121] [ 2] ./exec(main+0x2214) [0x804ad8d]
[barq:27121] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7cb7705]
[barq:27121] [ 4] ./exec [0x8048b01]
[barq:27121] *** End of error message ***
[barq:27118] [ 4] ./exec [0x8048b01]
[barq:27118] *** End of error message ***
--
mpirun noticed that process rank 3 with PID 27121 on node barq exited on
signal 11 (Segmentation fault).
--
[barq:27120] *** Process received signal ***
[barq:27120] Signal: Segmentation fault (11)
[barq:27120] Signal code: Address not mapped (1)
[barq:27120] Failing at address: 0x4bd1ca3e
[barq:27120] [ 0] [0xe410]
[barq:27120] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7c97f3c]
[barq:27120] [ 2] ./exec(main+0x2214) [0x804ad8d]
[barq:27120] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7c40705]
[barq:27120] [ 4] ./exec [0x8048b01]
[barq:27120] *** End of error message ***




Because of the warning:regcache incompatible with malloc warning I did
>  bibrak@barq:~/XXX> export MX_RCACHE=2

And now ignored the warning, but the error still remains

I shall appreciate any help.

Bibrak Qamar
NUST-SEECS


Re: [OMPI users] is loop unrolling safe for MPI logic?

2010-07-18 Thread Anton Shterenlikht
On Sat, Jul 17, 2010 at 09:14:11AM -0700, Eugene Loh wrote:
> Jeff Squyres wrote:
> 
> >On Jul 17, 2010, at 4:22 AM, Anton Shterenlikht wrote:
> >  
> >
> >>Is loop vectorisation/unrolling safe for MPI logic?
> >>I presume it is, but are there situations where
> >>loop vectorisation could e.g. violate the order
> >>of execution of MPI calls?
> >>
> >>
> >I *assume* that the intel compiler will not unroll loops that contain MPI 
> >function calls.  That's obviously an assumption, but I would think that 
> >unless you put some pragmas in there that tell the compiler that it's safe 
> >to unroll, the compiler will be somewhat conservative about what it 
> >automatically unrolls.
> >  
> >
> More generally, a Fortran compiler that optimizes aggressively could 
> "break" MPI code.
> 
> http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node241
> 
> That said, you may not need to worry about this in your particular case.

This is a very important point, many thanks Eugene.
Fortran MPI programmer definitely needs to pay attention to this.

MPI-2.2 provides a slightly updated version of this guide:

http://www.mpi-forum.org/docs/mpi22-report/node343.htm#Node348

many thanks
anton

-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423


Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!

2010-07-18 Thread Gustavo Correa
Check PETSc:
http://www.mcs.anl.gov/petsc/petsc-as/

On Jul 18, 2010, at 12:37 AM, Damien wrote:

> You should check out the MUMPS parallel linear solver.
> 
> Damien
> Sent from my iPhone
> 
> On 2010-07-17, at 5:16 PM, Daniel Janzon  wrote:
> 
>> Dear OpenMPI Users,
>> 
>> I successfully installed OpenMPI on some FreeBSD machines and I can
>> run MPI programs on the cluster. Yippie!
>> 
>> But I'm not patient enough to write my own MPI-based routines. So I
>> thought maybe I could ask here for suggestions. I am primarily
>> interested in general linear algebra routines. The best would be to
>> for instance start Octave and just use it as normal, only that all
>> matrix operations would run on the cluster. Has anyone done that? The
>> octave-parallel package seems to be something different.
>> 
>> I installed scalapack and the test files ran successfully with mpirun
>> (except a few of them). But the source code examples of scalapack
>> looks terrible. Is there no higher-level library that provides an API
>> with matrix operations, which have all MPI parallelism stuff handled
>> for you in the background? Certainly a smart piece of software can
>> decide better than me how to chunk up a matrix and pass it out to the
>> available processes.
>> 
>> All the best,
>> Daniel
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!

2010-07-18 Thread Damien

You should check out the MUMPS parallel linear solver.

Damien
Sent from my iPhone

On 2010-07-17, at 5:16 PM, Daniel Janzon  wrote:


Dear OpenMPI Users,

I successfully installed OpenMPI on some FreeBSD machines and I can
run MPI programs on the cluster. Yippie!

But I'm not patient enough to write my own MPI-based routines. So I
thought maybe I could ask here for suggestions. I am primarily
interested in general linear algebra routines. The best would be to
for instance start Octave and just use it as normal, only that all
matrix operations would run on the cluster. Has anyone done that? The
octave-parallel package seems to be something different.

I installed scalapack and the test files ran successfully with mpirun
(except a few of them). But the source code examples of scalapack
looks terrible. Is there no higher-level library that provides an API
with matrix operations, which have all MPI parallelism stuff handled
for you in the background? Certainly a smart piece of software can
decide better than me how to chunk up a matrix and pass it out to the
available processes.

All the best,
Daniel
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users