[OMPI users] MPI_Irecv segmentation fault

2009-09-21 Thread Everette Clemmer
Hey all,

I'm getting a segmentation fault when I attempt to receive a single
character via MPI_Irecv. Code follows:

void recv_func() {
if( !MASTER ) {
charbuffer[ 1 ];
int flag;
MPI_Request request;
MPI_Status  status;

MPI_Irecv( , 1, MPI_CHAR, 0, MPI_ANY_TAG, 
MPI_COMM_WORLD, );
MPI_Test( , ,  );

   if( flag ) {
  //do work
  }
}
}


void send_func( unsigned char c ) { 
 if( MASTER ) {
int totalNodes, int node, flag;
unsigned char buffer[] = { c };
MPI_Request request;
MPI_Status status;  
MPI_Comm_size( MPI_COMM_WORLD,  );

for ( node = 1; node < totalNodes; node++ ) {
MPI_Isend( , 1, MPI_CHAR, node, 0, MPI_COMM_WORLD, 
 );
}
 }
}

The segfault disappears if I comment out the MPI_Irecv call in
recv_func so I'm assuming that there's something wrong with the
parameters that I'm passing to it. Thoughts?

Thanks,
Everette


Re: [OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-21 Thread Pallab Datta
The following is the error dump

fuji:src pallabdatta$ /usr/local/bin/mpirun --mca btl_tcp_port_min_v4
36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca btl
tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
localhost,10.11.14.205 /tmp/hello
[fuji.local:01316] mca: base: components_open: Looking for btl components
[fuji.local:01316] mca: base: components_open: opening btl components
[fuji.local:01316] mca: base: components_open: found loaded component self
[fuji.local:01316] mca: base: components_open: component self has no
register function
[fuji.local:01316] mca: base: components_open: component self open
function successful
[fuji.local:01316] mca: base: components_open: found loaded component tcp
[fuji.local:01316] mca: base: components_open: component tcp has no
register function
[fuji.local:01316] mca: base: components_open: component tcp open function
successful
[fuji.local:01316] select: initializing btl component self
[fuji.local:01316] select: init of component self returned success
[fuji.local:01316] select: initializing btl component tcp
[fuji.local:01316] select: init of component tcp returned success
[apex-backpack:04753] mca: base: components_open: Looking for btl components
[apex-backpack:04753] mca: base: components_open: opening btl components
[apex-backpack:04753] mca: base: components_open: found loaded component self
[apex-backpack:04753] mca: base: components_open: component self has no
register function
[apex-backpack:04753] mca: base: components_open: component self open
function successful
[apex-backpack:04753] mca: base: components_open: found loaded component tcp
[apex-backpack:04753] mca: base: components_open: component tcp has no
register function
[apex-backpack:04753] mca: base: components_open: component tcp open
function successful
[apex-backpack:04753] select: initializing btl component self
[apex-backpack:04753] select: init of component self returned success
[apex-backpack:04753] select: initializing btl component tcp
[apex-backpack:04753] select: init of component tcp returned success
Process 0 on fuji.local out of 2
Process 1 on apex-backpack out of 2
[apex-backpack:04753] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360




> Hi
>
> I am trying to run open-mpi 1.3.3. between a linux box running ubuntu
> server v.9.04 and a Macintosh. I have configured openmpi with the
> following options.:
> ./configure --prefix=/usr/local/ --enable-heterogeneous --disable-shared
> --enable-static
>
> When both the machines are connected to the network via ethernet cables
> openmpi works fine.
>
> But when I switch the linux box to a wireless adapter i can reach (ping)
> the macintosh
> but openmpi hangs on a hello world program.
>
> I ran :
>
> /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
> btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
> OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205
> /tmp/back
>
> it hangs on a send receive function between the two ends. All my firewalls
> are turned off at the macintosh end. PLEASE HELP ASAP>
> regards,
> pallab
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



[OMPI users] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-21 Thread Pallab Datta
Hi

I am trying to run open-mpi 1.3.3. between a linux box running ubuntu
server v.9.04 and a Macintosh. I have configured openmpi with the
following options.:
./configure --prefix=/usr/local/ --enable-heterogeneous --disable-shared
--enable-static

When both the machines are connected to the network via ethernet cables
openmpi works fine.

But when I switch the linux box to a wireless adapter i can reach (ping)
the macintosh
but openmpi hangs on a hello world program.

I ran :

/usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205
/tmp/back

it hangs on a send receive function between the two ends. All my firewalls
are turned off at the macintosh end. PLEASE HELP ASAP>
regards,
pallab


Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-21 Thread Jonathan Dursi

Continuing the conversation with myself:

Google pointed me to Trac ticket #1944, which spoke of deadlocks in 
looped collective operations; there is no collective operation anywhere 
in this sample code, but trying one of the suggested workarounds/clues: 
 that is, setting btl_sm_num_fifos to at least (np-1) seems to make 
things work quite reliably, for both OpenMPI 1.3.2 and 1.3.3; that is, 
while this


mpirun -np 6 -mca btl sm,self ./diffusion-mpi

invariably hangs (at random-seeming numbers of iterations) with OpenMPI 
1.3.2 and sometimes hangs (maybe 10% of the time, again seemingly 
randomly) with 1.3.3,


mpirun -np 6 -mca btl tcp,self ./diffusion-mpi

or

mpirun -np 6 -mca btl_sm_num_fifos 5 -mca btl sm,self ./diffusion-mpi

always succeeds, with (as one might guess) the second being much faster...

Jonathan

--
Jonathan Dursi 


Re: [OMPI users] cartofile

2009-09-21 Thread Eugene Loh




Thank you, but I don't understand who is consuming this information for
what.  E.g., the mpirun man page describes the carto file, but doesn't
give users any indication whether they should be worrying about this.

Lenny Verkhovsky wrote:

  
  Hi Eugene,
  
  carto file is a file with a staic graph topology of your node.
  in the opal/mca/carto/file/carto_file.h you can see example.
  ( yes I know that , it should be help/man list :) )
  Basically it describes a map of your node and inside
interconnection.
  Hopefully it will be discovered automatically someday, 
  but for now you can describe your node manually.
  
  Best regards 
  Lenny.
  
  On Thu, Sep 17, 2009 at 12:38 AM, Eugene Loh
  
wrote:
  I
feel like I should know, but what's a cartofile?  I guess you supply
"topological" information about a host, but I can't tell how this
information is used by, say, mpirun.
  
  
  





Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-21 Thread Jonathan Dursi
I hate to repost, but I'm still stuck with the problem that, on a 
completely standard install with a standard gcc compiler, we're getting 
random hangs with a trivial test program when using the sm btl, and we 
still have no clues as to how to track down the problem.


Using a completely standard build:

./configure --prefix=/scinet/gpc/mpi/openmpi/1.3.2-gcc-v4.4.0-ofed 
--with-openib

make
make install

with a config.log attached, generating a distribution with an ompi_info 
--all that is also attached.


The very trivial attached program, which just does a series of SENDRECVs
rightwards through MPI_COMM_WORLD, hangs extremely reliably when run
like so on an 8 core box:

mpirun -np 6 -mca btl self,sm ./diffusion-mpi

The hanging seems to always occur within the
first 500 or so iterations - but sometimes between the 10th and 20th and
sometimes not until the late 400s.   The hanging occurs both on a new
dual socket quad core nehalem box, and an older harpertown machine.

Running without sm, however, seems to work fine:

mpirun -np 6 -mca btl self,tcp ./diffusion-mpi

never gives any problems.

Running with OpenMPI 1.3.3, built in the same way, gives the hangs 
significantly less frequently - it hangs one time out of every ten or 
so.  But obviously this is still far too often to deploy in a production 
environment.


Where should we be looking to track down this problem?

  - Jonathan
--
Jonathan Dursi 


config.log.gz
Description: GNU Zip compressed data
 Package: Open MPI root@gpc-f101n001 Distribution
Open MPI: 1.3.2
   Open MPI SVN revision: r21054
   Open MPI release date: Apr 21, 2009
Open RTE: 1.3.2
   Open RTE SVN revision: r21054
   Open RTE release date: Apr 21, 2009
OPAL: 1.3.2
   OPAL SVN revision: r21054
   OPAL release date: Apr 21, 2009
Ident string: 1.3.2
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.2)
  MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.2)
   MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.2)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.2)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.3.2)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.2)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.2)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.2)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.2)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.2)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.2)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.2)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: self (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.2)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.2)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.2)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.2)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.2)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.2)
 MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.2)
 MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.2)
 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.2)
 MCA pml: v (MCA v2.0, API v2.0, Component v1.3.2)
 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.2)
  MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.2)
 MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3.2)
 MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.2)
 MCA btl: self (MCA v2.0, API v2.0, Component v1.3.2)
 MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.2)
 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.2)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.2)
 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.2)
 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.2)
 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.2)
 MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.2)
 MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.2)
 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.2)
MCA odls: default (MCA v2.0, API v2.0, Component v1.3.2)
 MCA ras: slurm (MCA 

Re: [OMPI users] cartofile

2009-09-21 Thread Lenny Verkhovsky
Hi Eugene,
carto file is a file with a staic graph topology of your node.
in the opal/mca/carto/file/carto_file.h you can see example.
( yes I know that , it should be help/man list :) )
Basically it describes a map of your node and inside interconnection.
Hopefully it will be discovered automatically someday,
but for now you can describe your node manually.
Best regards
Lenny.

On Thu, Sep 17, 2009 at 12:38 AM, Eugene Loh  wrote:

> I feel like I should know, but what's a cartofile?  I guess you supply
> "topological" information about a host, but I can't tell how this
> information is used by, say, mpirun.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Program hangs when run in the remote host ...

2009-09-21 Thread souvik bhattacherjee
As Ralph suggested, I *reversed the order of my PATH settings*:

This is what I it shows:

$ echo $PATH
/usr/local/openmpi-1.3.3/bin/:/usr/bin:/bin:/usr/local/bin:/usr/X11R6/bin/:/usr/games:/usr/lib/qt4/bin:/usr/bin:/opt/kde3/bin

$ echo $LD_LIBRARY_PATH
/usr/local/openmpi-1.3.3/lib/

Moreover, I checked that there were *NO* system supplied versions of OMPI,
previously installed. ( I did install MPICH2 earlier, but I had removed the
binaries and the related files). This is because,

$ locate mpicc
/home/souvik/software/openmpi-1.3.3/build/ompi/contrib/vt/wrappers/mpicc-vt-wrapper-data.txt
/home/souvik/software/openmpi-1.3.3/build/ompi/tools/wrappers/mpicc-wrapper-data.txt
/home/souvik/software/openmpi-1.3.3/build/ompi/tools/wrappers/mpicc.1
/home/souvik/software/openmpi-1.3.3/contrib/platform/win32/ConfigFiles/mpicc-wrapper-data.txt.cmake
/home/souvik/software/openmpi-1.3.3/ompi/contrib/vt/wrappers/mpicc-vt-wrapper-data.txt
/home/souvik/software/openmpi-1.3.3/ompi/contrib/vt/wrappers/
mpicc-vt-wrapper-data.txt.in
/home/souvik/software/openmpi-1.3.3/ompi/tools/wrappers/mpicc-wrapper-data.txt
/home/souvik/software/openmpi-1.3.3/ompi/tools/wrappers/
mpicc-wrapper-data.txt.in
/usr/local/openmpi-1.3.3/bin/mpicc
/usr/local/openmpi-1.3.3/bin/mpicc-vt
/usr/local/openmpi-1.3.3/share/man/man1/mpicc.1
/usr/local/openmpi-1.3.3/share/openmpi/mpicc-vt-wrapper-data.txt
/usr/local/openmpi-1.3.3/share/openmpi/mpicc-wrapper-data.txt

does not show the occurrence of mpicc in any directory related to MPICH2.

The results are same with mpirun

$ locate mpirun
/home/souvik/software/openmpi-1.3.3/build/ompi/tools/ortetools/mpirun.1
/home/souvik/software/openmpi-1.3.3/ompi/runtime/mpiruntime.h
/usr/local/openmpi-1.3.3/bin/mpirun
/usr/local/openmpi-1.3.3/share/man/man1/mpirun.1

*These tests were done both on ict1 and ict2*.

I performed another test which probably proves that the executable finds the
required files on the remote host. The program was run from ict2.

$ cd /home/souvik/software/openmpi-1.3.3/examples/

$ mpirun -np 4 --host ict2,ict1 hello_c
bash: orted: command not found
--
A daemon (pid 28023) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished

$ mpirun --prefix /usr/local/openmpi-1.3.3/ -np 4 --host ict2,ict1 hello_c

*This command-line statement as usual does not produce any output. On
pressing Crtl+C, the following output occurs*

^Cmpirun: killing job...

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
ict1 - daemon did not report back when launched

$

Also, doing *top *does not show any *mpirun* & *hello_c* process running in
both the hosts. However, running hello_c in a single host say, ict2 does
show *mpirun* & *hello_c* in the process list.





On Sat, Sep 19, 2009 at 8:13 PM, Ralph Castain  wrote:

> One thing that flags my attention. In your PATH definition, you put $PATH
> ahead of your OMPI 1.3.3 installation. Thus, if there are any system
> supplied versions of OMPI hanging around (and there often are), they will be
> executed instead of your new installation.
> You might try reversing that order.
>
> On Sep 19, 2009, at 7:33 AM, souvik bhattacherjee wrote:
>
> Hi Gus (and all OpenMPI users),
>
> Thanks for your interest in my problem. However, the points you had raised
> earlier in your mails, seems to me that, I had already taken care of them. I
> had enlisted them below pointwise. Your comments are rewritten in *RED *and
> my replies in *BLACK.*
>
> 1) As you have mentioned: "*I would guess you only installed OpenMPI only
> on ict1, not on ict2*". However, I had mentioned initially: "*I had
> installed openmpi-1.3.3 separately on two of my machines ict1 and ict2*".
>

Re: [OMPI users] running open mpi on ubuntu 9.04

2009-09-21 Thread Pavel Shamis (Pasha)
You will not be need the trick if you will configure Open Mpi with 
follow flag:

--enable-mpirun-prefix-by-default

Pasha.



Hodgess, Erin wrote:


the LD_LIBRARY_PATH did the trick;
thanks so much!

Sincerely,
Erin


Erin M. Hodgess, PhD
Associate Professor
Department of Computer and Mathematical Sciences
University of Houston - Downtown
mailto: hodge...@uhd.edu



-Original Message-
From: users-boun...@open-mpi.org on behalf of Marce
Sent: Sat 9/19/2009 3:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] running open mpi on ubuntu 9.04

2009/9/18 Hodgess, Erin :
> There is no hosts file there originally
> I put in
>
>  cat hosts
> 127.0.0.1   localhost
>
>
> but still get the same thing
>
> thanks,
> Erin
>
> erin@erin-laptop:~$
> Erin M. Hodgess, PhD
> Associate Professor
> Department of Computer and Mathematical Sciences
> University of Houston - Downtown
> mailto: hodge...@uhd.edu
>
>
>
> -Original Message-
> From: users-boun...@open-mpi.org on behalf of Whit Armstrong
> Sent: Fri 9/18/2009 7:36 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] running open mpi on ubuntu 9.04
>
> yes, I had this issue before (we are on 9.04 as well).
> it has to do with the hosts file.
>
> Erin, can you send your hosts file?
>
> I think you want to make this the first line of your host file:
> 127.0.0.1   localhost
>
> Which Ubuntu, if memory serves defaults to the name of the machine 
instead

> of localhost.
>
> -Whit
>
>
> On Fri, Sep 18, 2009 at 8:31 AM, Ralph Castain  wrote:
>
>> It doesn't matter - 1.3 isn't going to launch another daemon on the 
local

>> node.
>> The problem here is that OMPI isn't recognizing your local host as 
being
>> "local" - i.e., it thinks that the host mpirun is executing on is 
somehow

>> not the the local host. This has come up before with ubuntu - you might
>> search the user mailing list for "ubuntu" to see earlier threads on 
this

>> issue.
>>
>> I forget the final solution, but those earlier threads will explain 
what
>> needs to be done. I'm afraid this is something quite specific to 
ubuntu.

>>
>>
>> On Sep 18, 2009, at 6:23 AM, Whit Armstrong wrote:
>>
>> can you "ssh localhost" without a password?
>> -Whit
>>
>>
>> On Thu, Sep 17, 2009 at 11:50 PM, Hodgess, Erin  
wrote:

>>
>>> It's 1.3, please.
>>>
>>> Thanks,
>>>
>>> Erin
>>>
>>>
>>> Erin M. Hodgess, PhD
>>> Associate Professor
>>> Department of Computer and Mathematical Sciences
>>> University of Houston - Downtown
>>> mailto: hodge...@uhd.edu
>>>
>>>
>>>
>>> -Original Message-
>>> From: users-boun...@open-mpi.org on behalf of Ralph Castain
>>> Sent: Thu 9/17/2009 10:39 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] running open mpi on ubuntu 9.04
>>>
>>> I gather you must be running a version of the old 1.2 series? Or are
>>> you running 1.3?
>>>
>>> It does make a difference as to the nature of the problem, and the
>>> recommended solution.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Sep 17, 2009, at 8:51 PM, Hodgess, Erin wrote:
>>>
>>> > Dear Open MPI people:
>>> >
>>> > I'm trying to run a simple "hello world" program on Ubuntu 9.04
>>> >
>>> > It's on a dual core laptop; no other machines.
>>> >
>>> > Here is the output:
>>> > erin@erin-laptop:~$ mpirun -np 2 a.out
>>> > ssh: connect to host erin-laptop port 22: Connection refused
>>> >
>>>
>>> 
--

>>> > A daemon (pid 11854) died unexpectedly with status 255 while
>>> > attempting
>>> > to launch so we are aborting.
>>> >
>>> > There may be more information reported by the environment (see 
above).

>>> >
>>> > This may be because the daemon was unable to find all the needed
>>> > shared
>>> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> > have the
>>> > location of the shared libraries on the remote nodes and this will
>>> > automatically be forwarded to the remote nodes.
>>> >
>>>
>>> 
--

>>> >
>>>
>>> 
--
>>> > mpirun noticed that the job aborted, but has no info as to the 
process

>>> > that caused that situation.
>>> >
>>>
>>> 
--

>>> > mpirun: clean termination accomplished
>>> >
>>> > erin@erin-laptop:~$
>>> >
>>> > Any help would be much appreciated.
>>> >
>>> > Sincerely,
>>> > Erin
>>> >
>>> >
>>> > Erin M. Hodgess, PhD
>>> > Associate Professor
>>> > Department of Computer and Mathematical Sciences
>>> > University of Houston - Downtown
>>> > mailto: hodge...@uhd.edu
>>> >
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org

Re: [OMPI users] Job fails after hours of running on a specific node

2009-09-21 Thread Pavel Shamis (Pasha)

Sangamesh,

The ib tunings that you added to your command line only delay the 
problem but doesn't resolve it.
The node-0-2.local gets asynchronous event "IBV_EVENT_PORT_ERROR" and as 
result
the processes fails to deliver packets to some remote hosts and as 
result you see bunch of IB errors.


The IBV_EVENT_PORT_ERROR error means that the IB port gone from ACTIVE 
state do DOWN state.
Or in other words you have problem with your IB networks that cause all 
these networks errors.
Source cause of such issue maybe some bad cable or some problematic port 
on switch.


For the IB network debug I propose you use Ibdiaget, it is open source 
IB network diagnostic tool :

http://linux.die.net/man/1/ibdiagnet
The tool is part of OFED distribution.

Pasha.


Sangamesh B wrote:

Dear all,
 
 The CPMD application which is compiled with OpenMPI-1.3 (Intel 
10.1 compilers) on CentOS-4.5, fails only, when a specific node i.e. 
node-0-2 is involved. But runs well on other nodes.
 
  Initially job failed after 5-10 mins (on node-0-2 + some other 
nodes). After googling error, I added options "-mca 
btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20" to 
mpirun command in the SGE script:
 
$ cat cpmdrun.sh

#!/bin/bash
#$ -N cpmd-acw
#$ -S /bin/bash
#$ -cwd
#$ -e err.$JOB_ID.$JOB_NAME
#$ -o out.$JOB_ID.$JOB_NAME
#$ -pe ib 32
unset SGE_ROOT  
PP_LIBRARY=/home/user1/cpmdrun/wac/prod/PP

CPMD=/opt/apps/cpmd/3.11/ompi/SOURCE/cpmd311-ompi-mkl.x
MPIRUN=/opt/mpi/openmpi/1.3/intel/bin/mpirun
$MPIRUN -np $NSLOTS -hostfile $TMPDIR/machines -mca 
btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20 $CPMD 
wac_md26.in   $PP_LIBRARY > wac_md26.out
After adding these options, job executed for 24+ hours then failed 
with the same error as earlier. The error is:
 
$ cat err.6186.cpmd-acw

--
The OpenFabrics stack has reported a network error event.  Open MPI
will try to continue, but your job may end up failing.
  Local host:node-0-2.local
  MPI process PID:   11840
  Error number:  10 (IBV_EVENT_PORT_ERR)
This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] Set MCA parameter "orte_base_help_aggregate" to 
0 to see all help / error messages
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 7 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 1 more process has sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 15 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 16 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[node-0-2.local:11836] 16 more processes have sent help message 
help-mpi-btl-openib.txt / of error event
[[718,1],20][btl_openib_component.c:2902:handle_wc] from 
node-0-22.local to: node-0-2 
--

The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:
 4.096 microseconds * (2^btl_openib_ib_timeout)
  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some 

Re: [OMPI users] Question about OpenMPI performance vs. MVAPICH2

2009-09-21 Thread Brian Powell

Jed Brown wrote:

Are you saying the output of mpicc/mpif90 -show has the same
optimization flags?  MPICH2 usually puts it's own optimization flags
into the wrappers.


Jed, thank you for your reply. Yes, mpif90 shows (other than differing  
libraries) identical flags.


Ralph Castain wrote:

Did you set -mca mpi_paffinity_alone 1? This will bind the processes
to cores and (usually) significantly improve performance.


Ralph, thank you for the suggestion. I had focussed on RDMA, and this  
made a significant difference. I have only had time to re-run an  
ensemble of one configuration (rather than the suite I had been  
running) and it improved the OpenMPI performance by 19.5%. So, it  
would appear this was the primary cause.


I will read through the documentation to find how to make this the  
default.


Thank you for your assistance! I look forward to the 1.3.4 improvements.

Cheers,
Brian