Re: [OMPI users] Newbie doubt.

2008-09-17 Thread Davi Vercillo C. Garcia (デビッド)
Hi,

> Yuo must close the File using
> MPI_File_close(MPI_File *fh)
> before calling MPI_Finalize.

Newbie question... newbie problem ! UHiauhiauh... Thanks !!!

> By the way i think you shouldn't do
>  strcat(argv[1], ".bz2");
> This would overwrite any following arguments.

I know... I was just trying ! =D

-- 
Davi Vercillo Carneiro Garcia
http://davivercillo.blogspot.com/

Universidade Federal do Rio de Janeiro
Departamento de Ciência da Computação
DCC-IM/UFRJ - http://www.dcc.ufrj.br

Grupo de Usuários GNU/Linux da UFRJ (GUL-UFRJ)
http://www.dcc.ufrj.br/~gul

Linux User: #388711
http://counter.li.org/

"Good things come to those who... wait." - Debian Project

"A computer is like air conditioning: it becomes useless when you open
windows." - Linus Torvalds



Re: [OMPI users] errors returned from openmpi-1.2.7 source code

2008-09-17 Thread Shafagh Jafer
ok i looked at the errors closely, it looks like that the problem is from the 
"namespace MPI{.." in line 136 of "mpicxx.h" and every where that this 
namespace (MPI) is used. here are the errors:


In file included from /opt/openmpi/1.2.7/include/mpi.h:1795,
 from stdaload.cpp:33:
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/mpicxx.h:136: parse error 
before `1'
In file included from 
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/mpicxx.h:168,
 from /opt/openmpi/1.2.7/include/mpi.h:1795,
 from stdaload.cpp:33:
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/functions.h:143: parse error 
before `1'
In file included from 
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/mpicxx.h:195,
 from /opt/openmpi/1.2.7/include/mpi.h:1795,
 from stdaload.cpp:33:
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/status.h:26: parse error before 
`::'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/status.h:27: parse error before 
`::'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/status.h:28: parse error before 
`::'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/status.h:102: parse error 
before `1'
In file included from 
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/mpicxx.h:196,
 from /opt/openmpi/1.2.7/include/mpi.h:1795,
 from stdaload.cpp:33:

---

--- On Wed, 9/17/08, Jeff Squyres  wrote:
From: Jeff Squyres 
Subject: Re: [OMPI users] errors returned from openmpi-1.2.7 source code
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Wednesday, September 17, 2008, 11:21 AM

You shouldn't need to add any -I's or -L's or -l's for Open MPI.
 Just  
use mpic++ and mpicc (per my first note, notice that "mpicc" (lower  
case) is the C compiler -- mpiCC is a synonym for the C++ compiler --  
this could be your problem).  Those wrappers add all the compiler /  
linker flags that you need.


On Sep 17, 2008, at 2:16 PM, Shafagh Jafer wrote:

> The openmpi is installed in the following path: /opt/openmpi/1.2.7  
> so should i replce what you told me about /usr/lib with /opt/openmpi/ 
> 1.2.7 ??
>
> --- On Wed, 9/17/08, Jeff Squyres  wrote:
> From: Jeff Squyres 
> Subject: Re: [OMPI users] errors returned from openmpi-1.2.7 source  
> code
> To: "Open MPI Users" 
> Date: Wednesday, September 17, 2008, 9:22 AM
>
> I don't quite understand the format of this file, but at first glance,
> you shouldn't need the following lines:
>
> export LIBMPI = -lmpi
>
> export MPIDIR=/nfs/sjafer/phd/openMPI/installed
> export LDFLAGS +=-L$(MPIDIR)/lib
> export INCLUDES_CPP += -I$(MPIDIR)/include
>
> It also doesn't seem like the last 2 arguments of this line are a good
> idea (the linker should automatically put /usr/lib and /lib in your
> search path, if appropriate):
>
> export LDFLAGS+=-L. -L$/usr/lib -L$/lib
>
> I also notice:
>
> export CPP=mpic++
> export CC=mpiCC
>
> I think you want "mpicc" for CC (note the lower case) -- mpiCC
is the
> C
> ++ compiler (mpic++ and mpiCC are synonyms).
>
> This might solve your problem.
>
>
>
> On Sep 15, 2008, at 4:56 PM, Shafagh Jafer wrote:
>
> > i am sending you my simulator's Makefile.common which points to
> > openmpi, please take a look at it. Thanks a lot.
> >
> > --- On Mon, 9/15/08, Jeff Squyres  wrote:
> > From: Jeff Squyres 
> > Subject: Re: [OMPI users] errors returned from openmpi-1.2.7 source
> > code
> > To: "Open MPI Users" 
> > Date: Monday, September 15, 2008, 11:21 AM
> >
> > On Sep 14, 2008, at 1:24 PM, Shafagh Jafer wrote:
> >
> > > I installed openmpi-1.2.7 and tested the hello_c and ring_c  
> examples
> > > on single and multiple node and worked fine. However, when I use
> > > openmpi with my simulator (by replacing the old mpich path with 

> the
> > > new openmpi ) I get many errors reported from "/openMPI/
> > > openmpi-1.2.7/include/openmpi/ompi/mpi/cxx/*.h" . Please
see the
> > > following snap shots:
> > >
> >
> > It's not clear exactly what you did here.  Did you just replace
> > MPICH's "mpiCC" with OMPI's "mpiCC"? 
FWIW,
> this is
> > almost always the
> > easiest way to compile MPI applications: use that
implementation's
> > "wrapper" compiler (I'm assuming you have a C++ code in
this
> > case).
> >
> > These errors should not normally happen; they look kinda like
you're
> > somehow inadvertently mixing Open MPI and MPICH.
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
___
> > users mailing 

Re: [OMPI users] errors returned from openmpi-1.2.7 source code

2008-09-17 Thread Shafagh Jafer
ok but the problme is that I have another type of mpi from Scali, and when I 
put in my make file "mpicc" and "mpic++" it goes and uses the Scali MPI's 
compilers which have exactly the same names "mpicc and mpic++"...So It did not 
give me any error, but i felt that it used the Scali stuff and not the 
openmpi's. So i modified my Makefile as follows:
==
export CPP=/opt/openmpi/1.2.7/bin/mpic++ #/usr/local/bin/g++
export CC=/opt/openmpi/1.2.7/bin/mpicc #/usr/local/bin/gcc
export AR=ar
export YACPP=yacc

#export DEFINES_CPP += -DNEWCOORDIN
#===
#PCD++ Directory Details

#jacky: change the following line to reflect your pcd code directory //:~

export MAINDIR=/nfs/sjafer/phd/openMPI/openmpi_cd++_timewarp
export INCLUDES_CPP +=-I$(MAINDIR)/include

#If running parallel simulation, uncomment the following lines
export DEFINES_CPP += -DMPI
#export LIBMPI = -lmpi

#===

#===
#MPI Directory Details
##export MPIDIR=/opt/openmpi/1.2.7/
##export LDFLAGS +=-L$(MPIDIR)/lib
##export INCLUDES_CPP += -I$(MPIDIR)/include

###export LDFLAGS+=-L. -L$/opt/openmpi/1.2.7/lib

==
and i am still getting the follwoing errors :
opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h: At top level:
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h:84: parse error before 
`protected'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h:96: base class 
`Comm_Null' has incomplete type
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h: In method 
`Comm::Comm(const Comm &)':
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h:153: `class Comm' has no 
member named `mpi_comm'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h:153: type `Comm_Null' is 
not an immediate basetype for `Comm'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h: In method 
`Comm::Comm(ompi_communicator_t *)':
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/comm.h:155: type `Comm_Null' is 
not an immediate basetype for `Comm'
In file included from 
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/mpicxx.h:199,
 from /opt/openmpi/1.2.7/include/mpi.h:1795,
 from stdaload.cpp:33:
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/win.h: At top level:
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/win.h:27: parse error before 
`::'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/win.h:28: parse error before 
`::'
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/win.h:93: `static' can only be 
specified for objects and functions
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/win.h:93: ANSI C++ forbids 
declaration `' with no type
/opt/openmpi/1.2.7/include/openmpi/ompi/mpi/cxx/win.h:93: confused by earlier 
errors, bailing outmake: *** [stdaload.o] Error 1


--- On Wed, 9/17/08, Jeff Squyres  wrote:
From: Jeff Squyres 
Subject: Re: [OMPI users] errors returned from openmpi-1.2.7 source code
To: "Open MPI Users" 
List-Post: users@lists.open-mpi.org
Date: Wednesday, September 17, 2008, 11:21 AM

You shouldn't need to add any -I's or -L's or -l's for Open MPI.
 Just  
use mpic++ and mpicc (per my first note, notice that "mpicc" (lower  
case) is the C compiler -- mpiCC is a synonym for the C++ compiler --  
this could be your problem).  Those wrappers add all the compiler /  
linker flags that you need.


On Sep 17, 2008, at 2:16 PM, Shafagh Jafer wrote:

> The openmpi is installed in the following path: /opt/openmpi/1.2.7  
> so should i replce what you told me about /usr/lib with /opt/openmpi/ 
> 1.2.7 ??
>
> --- On Wed, 9/17/08, Jeff Squyres  wrote:
> From: Jeff Squyres 
> Subject: Re: [OMPI users] errors returned from openmpi-1.2.7 source  
> code
> To: "Open MPI Users" 
> Date: Wednesday, September 17, 2008, 9:22 AM
>
> I don't quite understand the format of this file, but at first glance,
> you shouldn't need the following lines:
>
> export LIBMPI = -lmpi
>
> export MPIDIR=/nfs/sjafer/phd/openMPI/installed
> export LDFLAGS +=-L$(MPIDIR)/lib
> export INCLUDES_CPP += -I$(MPIDIR)/include
>
> It also doesn't seem like the last 2 arguments of this line are a good
> idea (the linker should automatically put /usr/lib and /lib in your
> search path, if appropriate):
>
> export LDFLAGS+=-L. -L$/usr/lib -L$/lib
>
> I also notice:
>
> export CPP=mpic++
> export CC=mpiCC
>
> I think you want "mpicc" for CC (note the lower case) -- mpiCC
is the
> C
> ++ compiler (mpic++ and mpiCC are synonyms).
>
> This might solve your problem.
>
>
>
> On Sep 15, 2008, at 4:56 PM, Shafagh Jafer wrote:
>
> > i am sending you my simulator's Makefile.common which points to
> > openmpi, please take 

Re: [OMPI users] errors returned from openmpi-1.2.7 source code

2008-09-17 Thread Jeff Squyres
You shouldn't need to add any -I's or -L's or -l's for Open MPI.  Just  
use mpic++ and mpicc (per my first note, notice that "mpicc" (lower  
case) is the C compiler -- mpiCC is a synonym for the C++ compiler --  
this could be your problem).  Those wrappers add all the compiler /  
linker flags that you need.



On Sep 17, 2008, at 2:16 PM, Shafagh Jafer wrote:

The openmpi is installed in the following path: /opt/openmpi/1.2.7  
so should i replce what you told me about /usr/lib with /opt/openmpi/ 
1.2.7 ??


--- On Wed, 9/17/08, Jeff Squyres  wrote:
From: Jeff Squyres 
Subject: Re: [OMPI users] errors returned from openmpi-1.2.7 source  
code

To: "Open MPI Users" 
Date: Wednesday, September 17, 2008, 9:22 AM

I don't quite understand the format of this file, but at first glance,
you shouldn't need the following lines:

export LIBMPI = -lmpi

export MPIDIR=/nfs/sjafer/phd/openMPI/installed
export LDFLAGS +=-L$(MPIDIR)/lib
export INCLUDES_CPP += -I$(MPIDIR)/include

It also doesn't seem like the last 2 arguments of this line are a good
idea (the linker should automatically put /usr/lib and /lib in your
search path, if appropriate):

export LDFLAGS+=-L. -L$/usr/lib -L$/lib

I also notice:

export CPP=mpic++
export CC=mpiCC

I think you want "mpicc" for CC (note the lower case) -- mpiCC is the
C
++ compiler (mpic++ and mpiCC are synonyms).

This might solve your problem.



On Sep 15, 2008, at 4:56 PM, Shafagh Jafer wrote:

> i am sending you my simulator's Makefile.common which points to
> openmpi, please take a look at it. Thanks a lot.
>
> --- On Mon, 9/15/08, Jeff Squyres  wrote:
> From: Jeff Squyres 
> Subject: Re: [OMPI users] errors returned from openmpi-1.2.7 source
> code
> To: "Open MPI Users" 
> Date: Monday, September 15, 2008, 11:21 AM
>
> On Sep 14, 2008, at 1:24 PM, Shafagh Jafer wrote:
>
> > I installed openmpi-1.2.7 and tested the hello_c and ring_c  
examples

> > on single and multiple node and worked fine. However, when I use
> > openmpi with my simulator (by replacing the old mpich path with  
the

> > new openmpi ) I get many errors reported from "/openMPI/
> > openmpi-1.2.7/include/openmpi/ompi/mpi/cxx/*.h" . Please see the
> > following snap shots:
> >
>
> It's not clear exactly what you did here.  Did you just replace
> MPICH's "mpiCC" with OMPI's "mpiCC"?  FWIW,
this is
> almost always the
> easiest way to compile MPI applications: use that implementation's
> "wrapper" compiler (I'm assuming you have a C++ code in this
> case).
>
> These errors should not normally happen; they look kinda like you're
> somehow inadvertently mixing Open MPI and MPICH.
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Odd MPI_Bcast behavior

2008-09-17 Thread Jeff Squyres

The patch is in 1.2.6 and beyond.

It's not really a serialization issue -- it's an "early completion"  
optimization, meaning that as soon as the underlying network stack  
indicates that the buffer has been copied, OMPI marks the request as  
complete and returns.  But the data may not actually have been pushed  
out on the network wire yet (so to speak) -- it may still require  
additional API-driven progress before the message actually departs for  
the peer.  While it may sound counterintuitive, this is actually an  
acceptable compromise/optimization for MPI applications that dip into  
the MPI layer frequently -- they'll naturally progress anything that  
has been queued up but not fully sent yet.  Disabling early completion  
means that OMPI won't mark the request as complete until the message  
requires no further progression from OMPI for it to be transited to  
the peer (e.g., the network hardware can completely take over the  
progression).


Hence, in your case, it looks serialized because you put in a big  
sleep().  If you called other MPI functions instead of sleep, it  
wouldn't appear as serialized.


Make sense?  (yes, I know it's a fine line distinction ;-) )

OMPI v1.3 internally differentiates between "early completion" and  
"out on the wire" so that it can automatically tell the difference  
(i.e., we changed our message progression engine to recognize the  
difference).  This change was seen as too big to port back to the  
v1.2. series, so the compromise was to put the "disable early  
completion" flag in the v1.2 series.



On Sep 17, 2008, at 12:31 PM, Gregory D Abram wrote:


Wow. I am indeed on IB.

So a program that calls an MPI_Bcast, then does a bunch of setup  
work that should be done in parallel before re-synchronizing, in  
fact serializes the setup work? I see its not quite that bad - If I  
run my little program on 5 nodes, I get 0 immediately, 1,2 and 4  
after 5 seconds and 3 after 10, revealing, I guess, the tree  
distribution.


Ticket 1224 isn't terribly clear - is this patch already in 1.2.6 or  
1.2.7, or do I have to download source, patch and build?


Greg


Jeff Squyres ---09/17/2008 12:03:21 PM---Are you using  
IB, perchance?


Jeff Squyres 
Sent by: users-boun...@open-mpi.org
09/17/08 11:55 AM
Please respond to
Open MPI Users 

To

Open MPI Users 

cc


Subject

Re: [OMPI users] Odd MPI_Bcast behavior



Are you using IB, perchance?

We have an "early completion" optimization in the 1.2 series that can
cause this kind of behavior.  For apps that dip into the MPI layer
frequently, it doesn't matter.  But for those that do not dip into the
MPI layer frequently, it can cause delays like this.  See 
http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
 for a few more details.

If you're not using IB, let us know.


On Sep 17, 2008, at 10:34 AM, Gregory D Abram wrote:

> I have a little program which initializes, calls MPI_Bcast, prints a
> message, waits five seconds, and finalized. I sure thought that each
> participating process would print the message immediately, then all
> would wait and exit - thats what happens with mvapich 1.0.0.  On
> OpenMPI 1.2.5, though, I get the message immediately from proc 0,
> then 5 seconds later, from proc 1, and then 5 seconds later, it
> exits- as if MPI_Finalize on proc 0 flushed the MPI_Bcast. If I add
> a MPI_Barrier after the MPI_Bcast, it works as I'd expect. Is this
> behavior correct? If so, I so I have a bunch of code to change in
> order to work correctly on OpenMPI.
>
> Greg
>
> Here's the code:
>
> #include 
> #include 
> #include 
>
> main(int argc, char *argv[])
> {
> char hostname[256]; int r, s;
> MPI_Init(, );
>
> gethostname(hostname, sizeof(hostname));
>
> MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, );
>
> fprintf(stderr, "%d of %d: %s\n", r, s, hostname);
>
> int i = 9;
> MPI_Bcast(, sizeof(i), MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
> // MPI_Barrier(MPI_COMM_WORLD);
>
> fprintf(stderr, "%d: got it\n", r);
>
> sleep(5);
>
> MPI_Finalize();
> }
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Odd MPI_Bcast behavior

2008-09-17 Thread Eugene Loh




I guess this must depend on what BTL you're using.  If I run all
processes on the same node, I get the behavior you expect.  So, are you
running processes on the same node, or different nodes and, if
different, via TCP or IB?

Gregory D Abram wrote:

  I have a little program which initializes, calls MPI_Bcast, prints
a message, waits five seconds, and finalized. I sure thought that each
participating process would print the message immediately, then all
would wait and exit - thats what happens with mvapich 1.0.0. On OpenMPI
1.2.5, though, I get the message immediately from proc 0, then 5
seconds later, from proc 1, and then 5 seconds later, it exits- as if
MPI_Finalize on proc 0 flushed the MPI_Bcast. If I add a MPI_Barrier
after the MPI_Bcast, it works as I'd expect. Is this behavior correct?
If so, I so I have a bunch of code to change in order to work correctly
on OpenMPI.
  
  #include 
  #include 
  #include 
  
  main(int argc, char *argv[])
  {
   char hostname[256]; int r, s;
   MPI_Init(, );
  
   gethostname(hostname, sizeof(hostname));
  
   MPI_Comm_rank(MPI_COMM_WORLD, );
   MPI_Comm_size(MPI_COMM_WORLD, );
  
   fprintf(stderr, "%d of %d: %s\n", r, s,
hostname);
  
   int i = 9;
   MPI_Bcast(, sizeof(i), MPI_UNSIGNED_CHAR,
  0,
MPI_COMM_WORLD);
   //
MPI_Barrier(MPI_COMM_WORLD);
  
   fprintf(stderr, "%d: got it\n", r);
  
   sleep(5);
  
   MPI_Finalize();
  }
  





Re: [OMPI users] Problem with MPI_Send and MPI_Recv

2008-09-17 Thread Jeff Squyres
Additionally, since you technically have a heterogeneous situation  
(different OS versions on each node), you might want to:


- compile and install OMPI separately on each node (preferably in the  
same filesystem location, though)
- compile and install your MPI app separately on each node (preferably  
in the same filesystem location)


You *could* be seeing differences between libc on each node, etc.



On Sep 17, 2008, at 11:52 AM, Terry Dontje wrote:




Date: Wed, 17 Sep 2008 16:23:59 +0200
From: "Sofia Aparicio Secanellas" 
Subject: Re: [OMPI users] Problem with MPI_Send and MPI_Recv
To: "Open MPI Users" 
Message-ID: <0625EEFB84E04647A1930A963A8DF7E3@aparicio1>
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
reply-type=response

Hello Terry,

Thank you very much for your help.



> Sofia,
>
> I took your program and actually ran it successfully on my  
systems using > Open MPI r19400.  A couple questions:

>
> 1.  Have you tried to run the program on a single node?
> mpirun -np 2 --host 10.4.5.123 --prefix /usr/local > ./ 
PruebaSumaParalela.out

>



Yes. In this case, the program works perfectly.


> 2.  Can you try and run the code the following way and is the  
output > different?
> mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --mca  
mpi_preconnect_all > 1 --prefix /usr/local ./PruebaSumaParalela.out

>



The program also hangs but the output is different. In both  
computers I get the following:


Inicio
Inicio
totalnodes:2
mynode:0
Inicio Recv



Ok, so it looks like rank 1 is not getting out of MPI_Init
> 3.  When the program hangs can you attach a debugger to one of  
the > processes and print out a stack?

>



I do not know how to do that.



With Solaris, I usually do the following:
% dbx - 
dbx>  where


> 4.  What version of Open MPI are you using, on what type of  
machine, using > which OS?

>



Openmpi-1.2.2 in both computers

In 10.4.5.123 I have:
Ubuntu Linux pichurra 2.6.22-15-generic #1 SMP Tue Jun 10 09:21:34  
UTC 2008 i686 GNU/Linux


In edu@10.4.5.126 I have:
K-Ubuntu Linux hp1-Linux 2.6.20-16-generic #2 SMP Sun Sep 23  
19:50:39 UTC 2007 i686 GNU/Linux



Sorry for the bonehead question but is edu@10.4.5.126 the actual  
machine name?  Is its IP address really 10.4.5.126?  Can you try  
that instead?  I would guess the issue is that the tcp btl is  
somehow not matching the two nodes as being connected to each other.


--td

Sofia


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Why compilig in global paths (only) for configuretion files?

2008-09-17 Thread Jeff Squyres

On Sep 17, 2008, at 9:49 AM, Paul Kapinos wrote:

If we add an " -x OPAL_PREFIX " flag, and through forces explicitly  
forwarding of this environment variable, the error was not occured.  
So we mean that this variable is needed to be exported across *all*  
systhems in cluster.


It seems, the variable OPAL_PREFIX  will *NOT* be automatically  
exported to new processes on the local and remote nodes.


Maybe the FAQ in http://www.open-mpi.org/faq/?category=building#installdirs 
 should be extended in this mean?



Hmm.  I don't know why it's not exporting for you -- it *is*  
automatically exporting OPAL_PREFIX for me (i.e., I don't need to  
specify it on the mpirun/-x command line).


Is there any chance that your wrapper is overriding this variable, or  
erasing the environment, or somesuch?


Ah -- here's another important point (after looking in the code):  
OPAL_PREFIX is only automatically propagated *by Open MPI* when using  
the rsh/ssh launcher.  Is that what you are using?  If not, OMPI  
assumes that the resource manager propagates mpirun's environment out  
to the back-end nodes.  If this does not happen, then you'll need to - 
x OPAL_PREFIX.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Problem with MPI_Send and MPI_Recv

2008-09-17 Thread Terry Dontje



Date: Wed, 17 Sep 2008 16:23:59 +0200
From: "Sofia Aparicio Secanellas" 
Subject: Re: [OMPI users] Problem with MPI_Send and MPI_Recv
To: "Open MPI Users" 
Message-ID: <0625EEFB84E04647A1930A963A8DF7E3@aparicio1>
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
reply-type=response

Hello Terry,

Thank you very much for your help.

  

> Sofia,
>
> I took your program and actually ran it successfully on my systems using 
> Open MPI r19400.  A couple questions:

>
> 1.  Have you tried to run the program on a single node?
> mpirun -np 2 --host 10.4.5.123 --prefix /usr/local 
> ./PruebaSumaParalela.out

>



Yes. In this case, the program works perfectly.

  
> 2.  Can you try and run the code the following way and is the output 
> different?
> mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --mca mpi_preconnect_all 
> 1 --prefix /usr/local ./PruebaSumaParalela.out

>



The program also hangs but the output is different. In both computers I get 
the following:


Inicio
Inicio
totalnodes:2
mynode:0
Inicio Recv

  

Ok, so it looks like rank 1 is not getting out of MPI_Init
> 3.  When the program hangs can you attach a debugger to one of the 
> processes and print out a stack?

>



I do not know how to do that.

  

With Solaris, I usually do the following:
% dbx - 
dbx>  where


> 4.  What version of Open MPI are you using, on what type of machine, using 
> which OS?

>



Openmpi-1.2.2 in both computers

In 10.4.5.123 I have:
Ubuntu Linux pichurra 2.6.22-15-generic #1 SMP Tue Jun 10 09:21:34 UTC 2008 
i686 GNU/Linux


In edu@10.4.5.126 I have:
K-Ubuntu Linux hp1-Linux 2.6.20-16-generic #2 SMP Sun Sep 23 19:50:39 UTC 
2007 i686 GNU/Linux


  
Sorry for the bonehead question but is edu@10.4.5.126 the actual machine 
name?  Is its IP address really 10.4.5.126?  Can you try that instead?  
I would guess the issue is that the tcp btl is somehow not matching the 
two nodes as being connected to each other.


--td

Sofia




Re: [OMPI users] Problem with MPI_Send and MPI_Recv

2008-09-17 Thread Sofia Aparicio Secanellas

Hello Terry,

I was trying to do the debug. I was setting all the debugging parameters for 
the MPI layer. Only with 1 parameter I obtain something different. I enclose 
the result of the following command:


mpirun --mca mpi_show_mca_params 1 -np 2 --host 
10.4.5.123,edu@10.4.5.126 --prefix /usr/local ./PruebaSumaParalela.out



Thanks again.

Sofia

- Original Message - 
From: "Terry Dontje" 

To: 
Sent: Wednesday, September 17, 2008 1:24 PM
Subject: Re: [OMPI users] Problem with MPI_Send and MPI_Recv



Sofia,

I took your program and actually ran it successfully on my systems using 
Open MPI r19400.  A couple questions:


1.  Have you tried to run the program on a single node?
mpirun -np 2 --host 10.4.5.123 --prefix /usr/local 
./PruebaSumaParalela.out


2.  Can you try and run the code the following way and is the output 
different?
mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --mca mpi_preconnect_all 
1 --prefix /usr/local ./PruebaSumaParalela.out


3.  When the program hangs can you attach a debugger to one of the 
processes and print out a stack?


4.  What version of Open MPI are you using, on what type of machine, using 
which OS?


--td

Date: Tue, 16 Sep 2008 18:15:59 +0200
From: "Sofia Aparicio Secanellas" 
Subject: [OMPI users] Problem with MPI_Send and MPI_Recv
To: 
Message-ID: 
Content-Type: text/plain; charset="iso-8859-1"

Hello,

I am new using MPI. I want to run a simple program (I enclose the 
program) in 2 different computers. I have installed MPI in both 
computers. I have compiled the program using:


mpiCC -o PruebaSumaParalela.out PruebaSumaParalela.cpp

I have copied the executable PruebaSumaParalela.out  to my /home directoy 
in both computers. Then I run:


mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --prefix /usr/local 
./PruebaSumaParalela.out

The 10.4.5.123 computer prints:

Inicio
Inicio
totalnodes:2
mynode:0
Inicio Recv
totalnodes:2
mynode:1
Inicio Send
sum:375250

The edu@10.4.5.126 computer prints:

Inicio
Inicio
totalnodes:2
mynode:1
Inicio Send
sum:375250
totalnodes:2
mynode:0
Inicio Recv

But the program does not finish on any computer. It seems that the Send 
and Recv does not work. Master computer is waiting to receive something 
that the slave does not send.

Do you know what the problem could be ?

Thank you very much.

Sofia

No virus found in this outgoing message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/
-- next part --
HTML attachment scrubbed and removed
-- next part --
An embedded and charset-unspecified text was scrubbed...
Name: PruebaSumaParalela.cpp
URL: 



--



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



No virus found in this incoming message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/ 


No virus found in this outgoing message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/



debugPrueba.abw
Description: Binary data


Re: [OMPI users] Why compilig in global paths (only) for configuretion files?

2008-09-17 Thread Paul Kapinos

Hi Rolf,

Rolf vandeVaart wrote:

I don't know -- this sounds like an issue with the Sun CT 8 build 
process.  It could also be a by-product of using the combined 32/64 
feature...?  I haven't used that in forever and I don't remember the 
restrictions.  Terry/Rolf -- can you comment?



I will write an separate eMail to ct-feedb...@sun.com


Hi Paul:
Yes, there are Sun people on this list!  We originally put those 
hardcoded paths in to make everything work correctly out of the box and 
our install process ensured that everything would be at 
/opt/SUNWhpc/HPC8.0.  However, let us take a look at everything that was 
just discussed here and see what we can do.  We will get back to you 
shortly.




I've just sent an eMail to ct-feedb...@sun.com with some explanation of 
our troubles...


The main trouble: we wanna to have *both* versions of CT8.0 (for studio 
and for gnu compiler) installed on same sythems. The RPMs are not 
relocatable, have same name and installs everything into the same 
directories... yes, it works out-of-box, but iff just *one* version 
installed. So, I started to move installations around, asking on these 
mailing list, setting envvars, and parsing configuretion files


I think installing everyting to hard-coded paths is somewhat inflexible. 
Maybe you may provide relocatable RPMs somewhere in the future?


But as mentioned above, our main goal is to have both versions of CT on 
same sythem working.


Best regards,

Paul Kapinos
<>

smime.p7s
Description: S/MIME Cryptographic Signature


[OMPI users] Odd MPI_Bcast behavior

2008-09-17 Thread Gregory D Abram


I have a little program which initializes, calls MPI_Bcast, prints a
message, waits five seconds, and finalized.  I sure thought that each
participating process would print the message immediately, then all would
wait and exit - thats what happens with mvapich 1.0.0.  On OpenMPI 1.2.5,
though, I get the message immediately from proc 0, then 5 seconds later,
from proc 1, and then 5 seconds later, it exits- as if MPI_Finalize on proc
0 flushed the MPI_Bcast.  If I add a MPI_Barrier after the MPI_Bcast, it
works as I'd expect.  Is this behavior correct?  If so, I so I have a bunch
of code to change in order to work correctly on OpenMPI.

Greg

Here's the code:

#include 
#include 
#include 

main(int argc, char *argv[])
{
char hostname[256]; int r, s;
MPI_Init(, );

gethostname(hostname, sizeof(hostname));

MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

fprintf(stderr, "%d of %d: %s\n", r, s, hostname);

int i = 9;
MPI_Bcast(, sizeof(i), MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
// MPI_Barrier(MPI_COMM_WORLD);

fprintf(stderr, "%d: got it\n", r);

sleep(5);

MPI_Finalize();
}


Re: [OMPI users] Problem with MPI_Send and MPI_Recv

2008-09-17 Thread Sofia Aparicio Secanellas

Hello Terry,

Thank you very much for your help.


Sofia,

I took your program and actually ran it successfully on my systems using 
Open MPI r19400.  A couple questions:


1.  Have you tried to run the program on a single node?
mpirun -np 2 --host 10.4.5.123 --prefix /usr/local 
./PruebaSumaParalela.out




Yes. In this case, the program works perfectly.

2.  Can you try and run the code the following way and is the output 
different?
mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --mca mpi_preconnect_all 
1 --prefix /usr/local ./PruebaSumaParalela.out




The program also hangs but the output is different. In both computers I get 
the following:


Inicio
Inicio
totalnodes:2
mynode:0
Inicio Recv

3.  When the program hangs can you attach a debugger to one of the 
processes and print out a stack?




I do not know how to do that.

4.  What version of Open MPI are you using, on what type of machine, using 
which OS?




Openmpi-1.2.2 in both computers

In 10.4.5.123 I have:
Ubuntu Linux pichurra 2.6.22-15-generic #1 SMP Tue Jun 10 09:21:34 UTC 2008 
i686 GNU/Linux


In edu@10.4.5.126 I have:
K-Ubuntu Linux hp1-Linux 2.6.20-16-generic #2 SMP Sun Sep 23 19:50:39 UTC 
2007 i686 GNU/Linux



Sofia


- Original Message - 
From: "Terry Dontje" 

To: 
Sent: Wednesday, September 17, 2008 1:24 PM
Subject: Re: [OMPI users] Problem with MPI_Send and MPI_Recv



Sofia,

I took your program and actually ran it successfully on my systems using 
Open MPI r19400.  A couple questions:


1.  Have you tried to run the program on a single node?
mpirun -np 2 --host 10.4.5.123 --prefix /usr/local 
./PruebaSumaParalela.out


2.  Can you try and run the code the following way and is the output 
different?
mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --mca mpi_preconnect_all 
1 --prefix /usr/local ./PruebaSumaParalela.out


3.  When the program hangs can you attach a debugger to one of the 
processes and print out a stack?


4.  What version of Open MPI are you using, on what type of machine, using 
which OS?


--td

Date: Tue, 16 Sep 2008 18:15:59 +0200
From: "Sofia Aparicio Secanellas" 
Subject: [OMPI users] Problem with MPI_Send and MPI_Recv
To: 
Message-ID: 
Content-Type: text/plain; charset="iso-8859-1"

Hello,

I am new using MPI. I want to run a simple program (I enclose the 
program) in 2 different computers. I have installed MPI in both 
computers. I have compiled the program using:


mpiCC -o PruebaSumaParalela.out PruebaSumaParalela.cpp

I have copied the executable PruebaSumaParalela.out  to my /home directoy 
in both computers. Then I run:


mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --prefix /usr/local 
./PruebaSumaParalela.out

The 10.4.5.123 computer prints:

Inicio
Inicio
totalnodes:2
mynode:0
Inicio Recv
totalnodes:2
mynode:1
Inicio Send
sum:375250

The edu@10.4.5.126 computer prints:

Inicio
Inicio
totalnodes:2
mynode:1
Inicio Send
sum:375250
totalnodes:2
mynode:0
Inicio Recv

But the program does not finish on any computer. It seems that the Send 
and Recv does not work. Master computer is waiting to receive something 
that the slave does not send.

Do you know what the problem could be ?

Thank you very much.

Sofia

No virus found in this outgoing message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/
-- next part --
HTML attachment scrubbed and removed
-- next part --
An embedded and charset-unspecified text was scrubbed...
Name: PruebaSumaParalela.cpp
URL: 



--



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



No virus found in this incoming message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/ 




No virus found in this outgoing message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/


Re: [OMPI users] Why compilig in global paths (only) for configuretion files?

2008-09-17 Thread Rolf vandeVaart

Paul Kapinos wrote:

Hi Jeff again!


(update) it works with "truly" OpenMPI, but it works *not* with SUN 
Cluster Tools 8.0 (which is also an OpenMPI). So, it seems be an SUN 
problem and not general problem of openMPI. Sorry for false relating 
the problem.


Ah, gotcha.  I guess my Sun colleagues on this list will need to 
address that.  ;-)


I hope!






The only trouble we have now are the error messages like

-- 


Sorry!  You were supposed to get help about:
   no hca params found
from the file:
   help-mpi-btl-openib.txt
But I couldn't find any file matching that name.  Sorry!
-- 



(the job still runs without problems! :o)

if running openmpi from new location, and the old location being 
removed. (if the old location being also persistense there is no 
error, so it seems to be an attempt to access to an file on old path).


Doh; that's weird.

Maybe we have to explicitly pass the OPAL_PREFIX environment 
variable to all processes?


Hmm.  I don't need to do this in my 1.2.7 installation.  I do 
something like this (I assume you're using rsh/ssh as a launcher?):


We use zsh as login shell, ssh as communication protocol and an 
wrapper to mpiexec which produces an command somewhat like



/opt/MPI/openmpi-1.2.7/linux64/intel/bin/mpiexec -x LD_LIBRARY_PATH -x 
PATH -x MPI_NAME --hostfile 
/tmp/pk224850/26654@linuxhtc01/hostfile3564 -n 2 MPI_FastTest.exe



(hostfiles are generated temporarely by our wrapper due of load 
balancing, and /opt/MPI/openmpi-1.2.7/linux64/intel/ is the path to 
our local installation of OpenMPI... )



You see that we also explicitly order OpenMPI to export environment 
variables PATH and LD_LIBRARY_PATH.


If we add an " -x OPAL_PREFIX " flag, and through forces explicitly 
forwarding of this environment variable, the error was not occured. So 
we mean that this variable is needed to be exported across *all* 
systhems in cluster.


It seems, the variable OPAL_PREFIX  will *NOT* be automatically 
exported to new processes on the local and remote nodes.


Maybe the FAQ in 
http://www.open-mpi.org/faq/?category=building#installdirs should be 
extended in this mean?





Did you (or anyone reading this message) have any contact to SUN 
developers to point to this circumstance? *Why* do them use 
hard-coded paths? :o)


I don't know -- this sounds like an issue with the Sun CT 8 build 
process.  It could also be a by-product of using the combined 32/64 
feature...?  I haven't used that in forever and I don't remember the 
restrictions.  Terry/Rolf -- can you comment?



I will write an separate eMail to ct-feedb...@sun.com


Hi Paul:
Yes, there are Sun people on this list!  We originally put those 
hardcoded paths in to make everything work correctly out of the box and 
our install process ensured that everything would be at 
/opt/SUNWhpc/HPC8.0.  However, let us take a look at everything that was 
just discussed here and see what we can do.  We will get back to you 
shortly.


Rolf


Re: [OMPI users] mpirun hang

2008-09-17 Thread Christophe Spaggiari
Thank you very much. That was it. I didn't know that by default it was any
firewall running on the default Yellow Dog Linux installations since nothing
was asked about this issue during the installation.
You really saved my day George.
Regards,
Chris


On Wed, Sep 17, 2008 at 2:24 PM, George Bosilca wrote:

> Christophe,
>
> Looks like a firewall problem. Please check the mailing list archives for
> the proper fix.
>
>  Thanks,
>george.
>
>
> On Sep 17, 2008, at 6:53 AM, Christophe Spaggiari wrote:
>
>  Hi,
>>
>> I am new to MPI and try to get my Open MPI environment up and running. I
>> have two machines Alpha and Beta, on which I have successfully installed
>> Open MPI in /usr/local/openmpi. I have made the ssh setting to not have to
>> enter password manually (using rsa keys), and I have modified the .rc files
>> to get the right path and right LD_LIBRARY_PATH when login using ssh on both
>> machines.
>>
>> In order to check if my installation was working I have started "mpirun
>> hostname" on Alpha and it is working just fine.
>> I have tested as well "mpirun hostname" on Beta and it is working fine
>> too.
>>
>> I have tested "ssh beta env" to check that my setting are correct and it
>> is working as expected.
>>
>> BUT when I am running "mpirun -host beta hostname" from Alpha nothing
>> happens. After several minutes I have to kill the "mpirun" process with
>> Ctrl-C (two times). Does any of you run into similar problem and could tell
>> me what I am doing wrong ? It seems that each local installation are working
>> fine but I can not start tasks on other nodes.
>>
>> The interesting point is that when I run a "ps" on Beta I can see that a
>> "orted" process is started (and stay in process list) for each of my try to
>> run "mpirun" command from Alpha. So somehow Beta gets the command to start
>> orted and does it but then, nothing happens ...
>>
>> I have been browsing the users list for similar issues and I found one guy
>> describing exactly the same problem but it was no answer to his post.
>>
>> Not sure if this is relevant but Alpha and Beta are Sony PS3 machines
>> running Yellow Dog Linux 6.1
>>
>> Thanks in advance for your help.
>>
>> Regards,
>> Chris
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] BLCR not found

2008-09-17 Thread Josh Hursey
It looks like the configure script is picking up the wrong lib- 
directory (/home/osa/blcr/lib64 instead of /home/osa/blcr/lib):
  gcc -o conftest -O3 -DNDEBUG -finline-functions -fno-strict- 
aliasing -pthread \

-I/home/osa/blcr/include   -L/home/osa/blcr/lib64 \
conftest.c -lcr  -lnsl -lutil  -lm
  /usr/bin/ld: cannot find -lcr

I'll try to reproduce and work up a patch. In the mean time you may  
check to make sure that the BLCR path is set correctly in your $PATH  
and $LD_LIBRARY_PATH.


Josh


On Sep 17, 2008, at 7:44 AM, Santolo Felaco wrote:


This is the zipped file config.log

2008/9/17 Josh Hursey 
Can you send me a zip'ed up version of the config.log from your  
build? That will help in determining what went wrong with configure.


Cheers,
Josh


On Sep 17, 2008, at 6:09 AM, Santolo Felaco wrote:

Hi, I want to install openmpi-1.3. I have invoked ./configure --with- 
ft=cr --enable-ft-thread --enable-mpi-threads --with-blcr=/home/osa/ 
blcr/ --enable-progress-threads

This is error message that show:
 BLCR support requested but not found.  Perhaps you need to specify  
the location of the BLCR libraries.


I have installed blcr in /home/osa/blcr, I report the list of  
directory blcr:

.:
bin  include  lib  libexec  man

./bin:
cr_checkpoint  cr_restart  cr_run

./include:
blcr_common.h  blcr_errcodes.h  blcr_ioctl.h  blcr_proc.h  libcr.h

./lib:
blcr  libcr_omit.la  libcr_omit.so.0  libcr_run.la   
libcr_run.so.0  libcr.solibcr.so.0.4.1
libcr.la  libcr_omit.so  libcr_omit.so.0.4.1  libcr_run.so   
libcr_run.so.0.4.1  libcr.so.0


./lib/blcr:
2.6.24-16-generic

./lib/blcr/2.6.24-16-generic:
blcr_imports.ko  blcr.ko  blcr_vmadump.ko

./libexec:

./man:
man1

./man/man1:
cr_checkpoint.1  cr_restart.1  cr_run.1


Help me, please

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Why compilig in global paths (only) for configuretion files?

2008-09-17 Thread Jeff Squyres

On Sep 17, 2008, at 5:49 AM, Paul Kapinos wrote:

But the setting of the environtemt variable OPAL_PREFIX to an  
appropriate value (assuming PATH and LD_LIBRARY_PATH are setted  
too) is not enough to let the OpenMPI rock from the new  
lokation.

Hmm.  It should be.


(update) it works with "truly" OpenMPI, but it works *not* with SUN  
Cluster Tools 8.0 (which is also an OpenMPI). So, it seems be an SUN  
problem and not general problem of openMPI. Sorry for false relating  
the problem.


Ah, gotcha.  I guess my Sun colleagues on this list will need to  
address that.  ;-)



The only trouble we have now are the error messages like

--
Sorry!  You were supposed to get help about:
   no hca params found
from the file:
   help-mpi-btl-openib.txt
But I couldn't find any file matching that name.  Sorry!
--

(the job still runs without problems! :o)

if running openmpi from new location, and the old location being  
removed. (if the old location being also persistense there is no  
error, so it seems to be an attempt to access to an file on old path).


Doh; that's weird.

Maybe we have to explicitly pass the OPAL_PREFIX environment  
variable to all processes?


Hmm.  I don't need to do this in my 1.2.7 installation.  I do  
something like this (I assume you're using rsh/ssh as a launcher?):


# OMPI installed to /home/jsquyres/bogus, then mv'ed to /home/jsquyres/ 
bogus/foo

tcsh% set path = (/home/jsquyres/bogus/foo/bin $path)
tcsh% setenv LD_LIBRARY_PATH /home/jsquyres/bogus/foo/lib: 
$LD_LIBRARY_PATH

tcsh% setenv OPAL_PREFIX /home/jsquyres/bogus/foo
tcsh% mpirun --hostfile whatever hostname
...works fine
tcsh% mpicc ring.c -o ring
tcsh% mpirun --hostfile whatever --mca btl openib,self ring
...works fine

Is this different for you?

Note one of configure files contained in Sun ClusterMPI 8.0 (see  
attached file). The paths are really hard-coded in instead of  
usage of variables; this makes the package really not relocable  
without parsing the configure files.


Did you (or anyone reading this message) have any contact to SUN  
developers to point to this circumstance? *Why* do them use hard- 
coded paths? :o)


I don't know -- this sounds like an issue with the Sun CT 8 build  
process.  It could also be a by-product of using the combined 32/64  
feature...?  I haven't used that in forever and I don't remember the  
restrictions.  Terry/Rolf -- can you comment?


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] BLCR not found

2008-09-17 Thread Santolo Felaco
This is the zipped file config.log

2008/9/17 Josh Hursey 

> Can you send me a zip'ed up version of the config.log from your build? That
> will help in determining what went wrong with configure.
>
> Cheers,
> Josh
>
>
> On Sep 17, 2008, at 6:09 AM, Santolo Felaco wrote:
>
>  Hi, I want to install openmpi-1.3. I have invoked ./configure --with-ft=cr
>> --enable-ft-thread --enable-mpi-threads --with-blcr=/home/osa/blcr/
>> --enable-progress-threads
>> This is error message that show:
>>  BLCR support requested but not found.  Perhaps you need to specify the
>> location of the BLCR libraries.
>>
>> I have installed blcr in /home/osa/blcr, I report the list of directory
>> blcr:
>> .:
>> bin  include  lib  libexec  man
>>
>> ./bin:
>> cr_checkpoint  cr_restart  cr_run
>>
>> ./include:
>> blcr_common.h  blcr_errcodes.h  blcr_ioctl.h  blcr_proc.h  libcr.h
>>
>> ./lib:
>> blcr  libcr_omit.la  libcr_omit.so.0  libcr_run.la libcr_run.so.0
>>   libcr.solibcr.so.0.4.1
>> libcr.la  libcr_omit.so  libcr_omit.so.0.4.1  libcr_run.so
>>  libcr_run.so.0.4.1  libcr.so.0
>>
>> ./lib/blcr:
>> 2.6.24-16-generic
>>
>> ./lib/blcr/2.6.24-16-generic:
>> blcr_imports.ko  blcr.ko  blcr_vmadump.ko
>>
>> ./libexec:
>>
>> ./man:
>> man1
>>
>> ./man/man1:
>> cr_checkpoint.1  cr_restart.1  cr_run.1
>>
>>
>> Help me, please
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


config.tar
Description: Unix tar archive


Re: [OMPI users] BLCR not found

2008-09-17 Thread Josh Hursey
Can you send me a zip'ed up version of the config.log from your  
build? That will help in determining what went wrong with configure.


Cheers,
Josh

On Sep 17, 2008, at 6:09 AM, Santolo Felaco wrote:

Hi, I want to install openmpi-1.3. I have invoked ./configure -- 
with-ft=cr --enable-ft-thread --enable-mpi-threads --with-blcr=/ 
home/osa/blcr/ --enable-progress-threads

This is error message that show:
 BLCR support requested but not found.  Perhaps you need to specify  
the location of the BLCR libraries.


I have installed blcr in /home/osa/blcr, I report the list of  
directory blcr:

.:
bin  include  lib  libexec  man

./bin:
cr_checkpoint  cr_restart  cr_run

./include:
blcr_common.h  blcr_errcodes.h  blcr_ioctl.h  blcr_proc.h  libcr.h

./lib:
blcr  libcr_omit.la  libcr_omit.so.0  libcr_run.la   
libcr_run.so.0  libcr.solibcr.so.0.4.1
libcr.la  libcr_omit.so  libcr_omit.so.0.4.1  libcr_run.so   
libcr_run.so.0.4.1  libcr.so.0


./lib/blcr:
2.6.24-16-generic

./lib/blcr/2.6.24-16-generic:
blcr_imports.ko  blcr.ko  blcr_vmadump.ko

./libexec:

./man:
man1

./man/man1:
cr_checkpoint.1  cr_restart.1  cr_run.1


Help me, please

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Checkpointing a restarted app fails

2008-09-17 Thread Josh Hursey


On Sep 16, 2008, at 11:18 PM, Matthias Hovestadt wrote:


Hi!

Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:

./autogen.sh
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
make -s

This gave me an OMPI version with checkpointing support, so I
started testing. The good news is: I am able to checkpoint and
restart applications. The bad news is: checkpointing a restarted
application fails.

In detail:

1) Starting the application

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml  
yafaray.xml


This starts my MPI-enabled application without any problems.


2) Checkpointing the application

First I queried the PID of the mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  13897  0.4  0.2  63992  2704 pts/0S+   04:59   0:00  
mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml


Then I checkpointed the job, terminating it directly:

ccs@grid-demo-1:~$ ompi-checkpoint --term 13897
Snapshot Ref.:   0 ompi_global_snapshot_13897.ckpt
ccs@grid-demo-1:~$

The application indeed terminated:
-- 

mpirun noticed that process rank 0 with PID 13898 on node grid- 
demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).
-- 


2 total processes killed (some possibly by mpirun during cleanup)

The checkpoint command generated a checkpoint dataset
of 367MB size:

ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
367Mompi_global_snapshot_13897.ckpt/
ccs@grid-demo-1:~$



3) Restarting the application

For restarting the application, I first executed ompi-clean,
then restarting the job with preloading all files:

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ ompi-restart --preload  
ompi_global_snapshot_13897.ckpt/


Restarting works pretty fine. The jobs restarts from the
checkpointed state and continues to execute. If not interrupted,
it continues until its end, returning a correct result.

However, I observed one weird thing: restarting the application
seemed to have the checkpoint dataset changed. Moreover, two new
directories have been created at restart time:

  4 drwx--  3 ccs  ccs4096 Sep 17 05:09  
ompi_global_snapshot_13897.ckpt

  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_0.ckpt
  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_1.ckpt




The ('opal_snapshot_*.ckpt') directories are an artifact of the -- 
preload option. This option will copy the individual checkpoint to  
the remote machine before executing.




4) Checkpointing again

Again I first looked for the PID of the running mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  14005  0.0  0.2  63992  2736 pts/1S+   05:09   0:00  
mpirun -am ft-enable-cr --app /home/ccs/ 
ompi_global_snapshot_13897.ckpt/restart-appfile



Then I checkpointed it:

ccs@grid-demo-1:~$ ompi-checkpoint 14005


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

-- 

mpirun noticed that process rank 1 with PID 14050 on node grid- 
demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
-- 


ccs@grid-demo-1:~$


Interesting. This looks like a bug with the restart mechanism in Open  
MPI. This was working fine, but something must have changed in the  
trunk to break it.


A useful piece of debugging information for me would be a stack trace  
from the failed process. You should be able to get this from a core  
file it left or If you would set the following MCA variable in  
$HOME/.openmpi/mca-params.conf:

  opal_cr_debug_sigpipe=1
This will cause the Open MPI app to wait in a sleep loop when it  
detects a Broken Pipe signal. Then you should be able to attach a  
debugger and retrieve a stack trace.





The "ompi-checkpoint 14005" command however does not return.



Is anybody here using checkpoint/restart capabilities of OMPI?
Did anybody encounter similar problems? Or is there something
wrong about my way of using ompi-checkpoint/ompi-restart?


I work with the checkpoint/restart functionality on a daily basis,  
but I must admit that I haven't worked on the trunk in a few weeks.   
I'll take a look and let you know what I find. I suspect that Open  
MPI is not resetting properly after a checkpoint.





Any hint is greatly appreciated! :-)



Best,
Matthias
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem with MPI_Send and MPI_Recv

2008-09-17 Thread Terry Dontje

Sofia,

I took your program and actually ran it successfully on my systems using 
Open MPI r19400.  A couple questions:


1.  Have you tried to run the program on a single node?
mpirun -np 2 --host 10.4.5.123 --prefix /usr/local 
./PruebaSumaParalela.out


2.  Can you try and run the code the following way and is the output 
different?
mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --mca 
mpi_preconnect_all 1 --prefix /usr/local ./PruebaSumaParalela.out


3.  When the program hangs can you attach a debugger to one of the 
processes and print out a stack?


4.  What version of Open MPI are you using, on what type of machine, 
using which OS?


--td

Date: Tue, 16 Sep 2008 18:15:59 +0200
From: "Sofia Aparicio Secanellas" 
Subject: [OMPI users] Problem with MPI_Send and MPI_Recv
To: 
Message-ID: 
Content-Type: text/plain; charset="iso-8859-1"

Hello,

I am new using MPI. I want to run a simple program (I enclose the program) in 2 
different computers. I have installed MPI in both computers. I have compiled 
the program using:

mpiCC -o PruebaSumaParalela.out PruebaSumaParalela.cpp

I have copied the executable PruebaSumaParalela.out  to my /home directoy in 
both computers. Then I run:

mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --prefix /usr/local ./PruebaSumaParalela.out 


The 10.4.5.123 computer prints:

Inicio
Inicio
totalnodes:2
mynode:0
Inicio Recv
totalnodes:2
mynode:1
Inicio Send
sum:375250

The edu@10.4.5.126 computer prints:

Inicio
Inicio
totalnodes:2
mynode:1
Inicio Send
sum:375250
totalnodes:2
mynode:0
Inicio Recv

But the program does not finish on any computer. It seems that the Send and 
Recv does not work. Master computer is waiting to receive something that the 
slave does not send.
Do you know what the problem could be ?

Thank you very much.

Sofia

No virus found in this outgoing message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/
-- next part --
HTML attachment scrubbed and removed
-- next part --
An embedded and charset-unspecified text was scrubbed...
Name: PruebaSumaParalela.cpp
URL: 


--
  




[OMPI users] mpirun hang

2008-09-17 Thread Christophe Spaggiari
Hi,
I am new to MPI and try to get my Open MPI environment up and running. I
have two machines Alpha and Beta, on which I have successfully installed
Open MPI in /usr/local/openmpi. I have made the ssh setting to not have to
enter password manually (using rsa keys), and I have modified the .rc files
to get the right path and right LD_LIBRARY_PATH when login using ssh on both
machines.

In order to check if my installation was working I have started "mpirun
hostname" on Alpha and it is working just fine.
I have tested as well "mpirun hostname" on Beta and it is working fine too.

I have tested "ssh beta env" to check that my setting are correct and it is
working as expected.

BUT when I am running "mpirun -host beta hostname" from Alpha nothing
happens. After several minutes I have to kill the "mpirun" process with
Ctrl-C (two times). Does any of you run into similar problem and could tell
me what I am doing wrong ? It seems that each local installation are working
fine but I can not start tasks on other nodes.

The interesting point is that when I run a "ps" on Beta I can see that a
"orted" process is started (and stay in process list) for each of my try to
run "mpirun" command from Alpha. So somehow Beta gets the command to start
orted and does it but then, nothing happens ...

I have been browsing the users list for similar issues and I found one guy
describing exactly the same problem but it was no answer to his post.

Not sure if this is relevant but Alpha and Beta are Sony PS3 machines
running Yellow Dog Linux 6.1

Thanks in advance for your help.

Regards,
Chris


[OMPI users] BLCR not found

2008-09-17 Thread Santolo Felaco
Hi, I want to install openmpi-1.3. I have invoked ./configure --with-ft=cr
--enable-ft-thread --enable-mpi-threads --with-blcr=/home/osa/blcr/
--enable-progress-threads
This is error message that show:
 BLCR support requested but not found.  Perhaps you need to specify the
location of the BLCR libraries.

I have installed blcr in /home/osa/blcr, I report the list of directory
blcr:
.:
bin  include  lib  libexec  man

./bin:
cr_checkpoint  cr_restart  cr_run

./include:
blcr_common.h  blcr_errcodes.h  blcr_ioctl.h  blcr_proc.h  libcr.h

./lib:
blcr  libcr_omit.la  libcr_omit.so.0  libcr_run.la
libcr_run.so.0  libcr.solibcr.so.0.4.1
libcr.la  libcr_omit.so  libcr_omit.so.0.4.1  libcr_run.so
libcr_run.so.0.4.1  libcr.so.0

./lib/blcr:
2.6.24-16-generic

./lib/blcr/2.6.24-16-generic:
blcr_imports.ko  blcr.ko  blcr_vmadump.ko

./libexec:

./man:
man1

./man/man1:
cr_checkpoint.1  cr_restart.1  cr_run.1


Help me, please


Re: [OMPI users] Why compilig in global paths (only) for configuretion files?

2008-09-17 Thread Paul Kapinos

Hi Jeff again!

But the setting of the environtemt variable OPAL_PREFIX to an 
appropriate value (assuming PATH and LD_LIBRARY_PATH are setted too) 
is not enough to let the OpenMPI rock from the new lokation.


Hmm.  It should be.


(update) it works with "truly" OpenMPI, but it works *not* with SUN 
Cluster Tools 8.0 (which is also an OpenMPI). So, it seems be an SUN 
problem and not general problem of openMPI. Sorry for false relating the 
problem.



The only trouble we have now are the error messages like

--
Sorry!  You were supposed to get help about:
no hca params found
from the file:
help-mpi-btl-openib.txt
But I couldn't find any file matching that name.  Sorry!
--

(the job still runs without problems! :o)

if running openmpi from new location, and the old location being 
removed. (if the old location being also persistense there is no error, 
so it seems to be an attempt to access to an file on old path).


Maybe we have to explicitly pass the OPAL_PREFIX environment variable to 
all processes?




Because of the fact, that all the files containing settings for 
opal_wrapper, which are located in share/openmpi/ and called e.g. 
mpif77-wrapper-data.txt, contain (defined by installation with 
--prefix) hard-coded paths, too.


Hmm; they should not.  In my 1.2.7 install, I see the following:

-
[11:14] svbu-mpi:/home/jsquyres/bogus/share/openmpi % cat 
mpif77-wrapper-data.txt

# There can be multiple blocks of configuration data, chosen by
# compiler flags (using the compiler_args key to chose which block
# should be activated.  This can be useful for multilib builds.  See the
# multilib page at:
#https://svn.open-mpi.org/trac/ompi/wiki/compilerwrapper3264
# for more information.

project=Open MPI
project_short=OMPI
version=1.2.7rc6r19546
language=Fortran 77
compiler_env=F77
compiler_flags_env=FFLAGS
compiler=gfortran
extra_includes=
preprocessor_flags=
compiler_flags=
linker_flags=
libs=-lmpi_f77 -lmpi -lopen-rte -lopen-pal   -ldl   -Wl,--export-dynamic 
-lnsl -lutil -lm -ldl

required_file=not supported
includedir=${includedir}
libdir=${libdir}
[11:14] svbu-mpi:/home/jsquyres/bogus/share/openmpi %
-

Note the "includedir" and "libdir" lines -- they're expressed in terms 
of ${foo}, which we can replace when OPAL_PREFIX (or related) is used.


What version of OMPI are you using?



Note one of configure files contained in Sun ClusterMPI 8.0 (see 
attached file). The paths are really hard-coded in instead of usage of 
variables; this makes the package really not relocable without parsing 
the configure files.


Did you (or anyone reading this message) have any contact to SUN 
developers to point to this circumstance? *Why* do them use hard-coded 
paths? :o)


best regards,

Paul Kapinos
#
# Default word-size (used when -m flag is supplied to wrapper compiler)
#
compiler_args=

project=Open MPI
project_short=OMPI
version=r19400-ct8.0-b31c-r29

language=Fortran 90
compiler_env=FC
compiler_flags_env=FCFLAGS
compiler=f90
module_option=-M
extra_includes=
preprocessor_flags=
compiler_flags=
libs=-lmpi -lopen-rte -lopen-pal -lnsl -lrt -lm -ldl -lutil -lpthread -lmpi_f77 
-lmpi_f90
linker_flags=-R/opt/mx/lib/lib64 -R/opt/SUNWhpc/HPC8.0/lib/lib64 
required_file=
includedir=/opt/SUNWhpc/HPC8.0/include/64
libdir=/opt/SUNWhpc/HPC8.0/lib/lib64

#
# Alternative word-size (used when -m flag is not supplied to wrapper compiler)
#
compiler_args=-m32

project=Open MPI
project_short=OMPI
version=r19400-ct8.0-b31c-r29

language=Fortran 90
compiler_env=FC
compiler_flags_env=FCFLAGS
compiler=f90
module_option=-M
extra_includes=
preprocessor_flags=
compiler_flags=-m32
libs=-lmpi -lopen-rte -lopen-pal -lnsl -lrt -lm -ldl -lutil -lpthread -lmpi_f77 
-lmpi_f90
linker_flags=-R/opt/mx/lib -R/opt/SUNWhpc/HPC8.0/lib 
required_file=
includedir=/opt/SUNWhpc/HPC8.0/include
libdir=/opt/SUNWhpc/HPC8.0/lib
<>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] Where is ompi-chekpoint?

2008-09-17 Thread Matthias Hovestadt

Hi!


Hi, I have installed openmpi-1.2.7 with following instructions:
./configure --with-ft=cr --enable-ft-enable-thread --enable-mpi-thread
--with-blcr=$HOME/blcr --prefix=$HOME/openmpi
make all install
In directory bin of directory $HOME/openmpi there is not ompi-checkpoint and
ompi-restart.


As far as I know, checkpointing support is not available
in OMPI 1.2.7. You have to use the devel version (1.3) of
OMPI, e.g. by checking out the source from SVN.


Best,
Matthias



Re: [OMPI users] Problem with MPI_Send and MPI_Recv

2008-09-17 Thread Sofia Aparicio Secanellas

Hello Gus,

Thank you very much for your answer but I do not think that this is the 
problem.I have changed everything in a C program and I obtain the same 
result.


Does anyone have any idea about the problem?

Sofia

- Original Message - 
From: "Gus Correa" 

To: "Open MPI Users" 
Sent: Tuesday, September 16, 2008 9:42 PM
Subject: Re: [OMPI users] Problem with MPI_Send and MPI_Recv



Hello Sofia and list

I am not a C++ person, I must say.
However, I noticed that you wrote the program in C++,
compiled it with the mpiCC (C++) compiler wrapper,
but your MPI calls are written with the MPI C binding syntax,
not the MPI C++ binding syntax.

E.g. :

MPI_Send(,1,MPI_INT,0,1,MPI_COMM_WORLD);

instead of something like this:

comm.Send(,1,MPI::INT,0,1);

I wonder if this mixed C++ / C environment may have caused some of the 
trouble,

although I am not sure about that.

Since the specific C++ commands that you use are basically for
printing messages, it may be easier to transform the program into
a C program, by replacing the appropriate include files
and the C++ specific I/O commands by C commands,
and then compile the program again with mpicc.
An alternative is to write the MPI function calls in the C++ binding 
syntax.


I hope this helps.

Gus Correa

--
-
Gustavo J. Ponce Correa, PhD - Email: g...@ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
-


Sofia Aparicio Secanellas wrote:


Hello,
 I am new using MPI. I want to run a simple program (I enclose the 
program) in 2 different computers. I have installed MPI in both 
computers. I have compiled the program using:

 mpiCC -o PruebaSumaParalela.out PruebaSumaParalela.cpp
 I have copied the executable PruebaSumaParalela.out  to my /home 
directoy in both computers. Then I run:
 mpirun -np 2 --host 10.4.5.123,edu@10.4.5.126 --prefix /usr/local 
./PruebaSumaParalela.out

 The 10.4.5.123 computer prints:
 Inicio
Inicio
totalnodes:2
mynode:0
Inicio Recv
totalnodes:2
mynode:1
Inicio Send
sum:375250
 The edu@10.4.5.126  computer prints:
 Inicio
Inicio
totalnodes:2
mynode:1
Inicio Send
sum:375250
totalnodes:2
mynode:0
Inicio Recv
 But the program does not finish on any computer. It seems that the Send 
and Recv does not work. Master computer is waiting to receive something 
that the slave does not send.

Do you know what the problem could be ?
 Thank you very much.
 Sofia


No virus found in this outgoing message
Checked by PC Tools AntiVirus (4.0.0.26 - 10.100.007).
http://www.pctools.com/free-antivirus/



#include
#include
int main(int argc, char ** argv){
int mynode, totalnodes;
int sum,startval,endval,accum;
printf("Inicio\n");
MPI_Status status;
MPI_Init(,);
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );
printf("totalnodes: %d\n",totalnodes);
printf("mynode: %d\n",mynode);
sum = 0;
startval = 1000*mynode/totalnodes+1;
endval = 1000*(mynode+1)/totalnodes;
for(int i=startval;i<=endval;i=i+1)
sum = sum + i;
if(mynode!=0){
printf("Inicio Send\n");
printf("sum: %d\n",sum);
MPI_Send(,1,MPI_INT,0,1,MPI_COMM_WORLD);
printf("Send sum\n");
}
else
for(int j=1;j

Re: [OMPI users] Newbie doubt.

2008-09-17 Thread jody
Hi

Yuo must close the File using
MPI_File_close(MPI_File *fh)
before calling MPI_Finalize.

By the way i think you shouldn't do
  strcat(argv[1], ".bz2");
This would overwrite any following arguments.

Jody


On Wed, Sep 17, 2008 at 5:13 AM, Davi Vercillo C. Garcia (デビッド)
 wrote:
> Hi,
>
> I'm starting to use OpenMPI and I'm having some troubles. I wrote a
> simple program that tries to open files using the function
> MPI_File_open(). Like below:
>
> #include 
> #include 
> #include 
>
> #include 
>
> int processoPrincipal(void);
> int processosEscravos(void);
>
> int main(int argc, char** argv) {
>int meuRank, numeroProcessos;
>MPI_File arquivoEntrada, arquivoSaida;
>
>MPI_Init(, );
>MPI_Comm_rank(MPI_COMM_WORLD, );
>MPI_Comm_size(MPI_COMM_WORLD,);
>
>MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY,
> MPI_INFO_NULL, );
>strcat(argv[1], ".bz2");
>MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDWR |
> MPI_MODE_CREATE, MPI_INFO_NULL, );
>
>if (meuRank != 0) {
>processoPrincipal();
>} else {
>processosEscravos();
>}
>
>MPI_Finalize();
>return 0;
> }
>
> But I'm getting a error message like:
>
> *** An error occurred in MPI_Barrier
> *** An error occurred in MPI_Barrier
> *** after MPI was finalized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> *** after MPI was finalized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> What is this ?
>
> --
> Davi Vercillo Carneiro Garcia
> http://davivercillo.blogspot.com/
>
> Universidade Federal do Rio de Janeiro
> Departamento de Ciência da Computação
> DCC-IM/UFRJ - http://www.dcc.ufrj.br
>
> Grupo de Usuários GNU/Linux da UFRJ (GUL-UFRJ)
> http://www.dcc.ufrj.br/~gul
>
> Linux User: #388711
> http://counter.li.org/
>
> "Good things come to those who... wait." - Debian Project
>
> "A computer is like air conditioning: it becomes useless when you open
> windows." - Linus Torvalds
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[MTT users] mtt IBM reduce_scatter_in_place test failure

2008-09-17 Thread Lenny Verkhovsky
I am running mtt test on our  cluster and I found error for
IBM reduce_scatter_in_place  test for np>8

/home/USERS/lenny/OMPI_1_3_TRUNK/bin/mpirun -np 10 -H witch2
./reduce_scatter_in_place

**WARNING**]: MPI_COMM_WORLD rank 4, file reduce_scatter_in_place.c:80:
bad answer (0) at index 0 of 1000 (should be 4)
[**WARNING**]: MPI_COMM_WORLD rank 3, file reduce_scatter_in_place.c:80:
[**WARNING**]: MPI_COMM_WORLD rank 2, file reduce_scatter_in_place.c:80:
bad answer (20916) at index 0 of 1000 (should be 2)
bad answer (0) at index 0 of 1000 (should be 3)
[**WARNING**]: MPI_COMM_WORLD rank 5, file reduce_scatter_in_place.c:80:
bad answer (0) at index 0 of 1000 (should be 5)
[**WARNING**]: MPI_COMM_WORLD rank 6, file reduce_scatter_in_place.c:80:
bad answer (0) at index 0 of 1000 (should be 6)
[**WARNING**]: MPI_COMM_WORLD rank 7, file reduce_scatter_in_place.c:80:
[**WARNING**]: MPI_COMM_WORLD rank 8, file reduce_scatter_in_place.c:80:
bad answer (0) at index 0 of 1000 (should be 8)
bad answer (0) at index 0 of 1000 (should be 7)
[**WARNING**]: MPI_COMM_WORLD rank 9, file reduce_scatter_in_place.c:80:
bad answer (0) at index 0 of 1000 (should be 9)
[**WARNING**]: MPI_COMM_WORLD rank 0, file reduce_scatter_in_place.c:80:
bad answer (-516024720) at index 0 of 1000 (should be 0)
[**WARNING**]: MPI_COMM_WORLD rank 1, file reduce_scatter_in_place.c:80:
bad answer (28112) at index 0 of 1000 (should be 1)

I think that the error is in the test itself.

--- sources/test_get__ibm/ibm/collective/reduce_scatter_in_place.c
2005-09-28 18:11:37.0 +0300
+++ installs/LKcC/tests/ibm/ibm/collective/reduce_scatter_in_place.c
2008-09-16 19:32:48.0 +0300
@@ -64,7 +64,7 @@ int main(int argc, char **argv)
  ompitest_error(__FILE__, __LINE__, "Doh! Rank %d was not able to allocate
enough memory. MPI test aborted!\n", myself);
  }

- for (j = 1; j <= MAXLEN; j *= 10) {
+ for (j = 1; j < MAXLEN; j *= 10) {
  for (i = 0; i < tasks; i++) {
  recvcounts[i] = j;
  }

I am not sure if this is right fix and who can review/commit it to the test
trunk.

Best regards

Lenny.


[OMPI users] Checkpointing a restarted app fails

2008-09-17 Thread Matthias Hovestadt

Hi!

Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:

./autogen.sh
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
make -s

This gave me an OMPI version with checkpointing support, so I
started testing. The good news is: I am able to checkpoint and
restart applications. The bad news is: checkpointing a restarted
application fails.

In detail:

1) Starting the application

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml

This starts my MPI-enabled application without any problems.


2) Checkpointing the application

First I queried the PID of the mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  13897  0.4  0.2  63992  2704 pts/0S+   04:59   0:00 mpirun 
-np 2 -am ft-enable-cr yafaray-xml yafaray.xml


Then I checkpointed the job, terminating it directly:

ccs@grid-demo-1:~$ ompi-checkpoint --term 13897
Snapshot Ref.:   0 ompi_global_snapshot_13897.ckpt
ccs@grid-demo-1:~$

The application indeed terminated:
--
mpirun noticed that process rank 0 with PID 13898 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).

--
2 total processes killed (some possibly by mpirun during cleanup)

The checkpoint command generated a checkpoint dataset
of 367MB size:

ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
367Mompi_global_snapshot_13897.ckpt/
ccs@grid-demo-1:~$



3) Restarting the application

For restarting the application, I first executed ompi-clean,
then restarting the job with preloading all files:

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ ompi-restart --preload ompi_global_snapshot_13897.ckpt/

Restarting works pretty fine. The jobs restarts from the
checkpointed state and continues to execute. If not interrupted,
it continues until its end, returning a correct result.

However, I observed one weird thing: restarting the application
seemed to have the checkpoint dataset changed. Moreover, two new
directories have been created at restart time:

  4 drwx--  3 ccs  ccs4096 Sep 17 05:09 
ompi_global_snapshot_13897.ckpt

  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_0.ckpt
  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_1.ckpt



4) Checkpointing again

Again I first looked for the PID of the running mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  14005  0.0  0.2  63992  2736 pts/1S+   05:09   0:00 mpirun 
-am ft-enable-cr --app 
/home/ccs/ompi_global_snapshot_13897.ckpt/restart-appfile



Then I checkpointed it:

ccs@grid-demo-1:~$ ompi-checkpoint 14005


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

--
mpirun noticed that process rank 1 with PID 14050 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).

--
ccs@grid-demo-1:~$


The "ompi-checkpoint 14005" command however does not return.



Is anybody here using checkpoint/restart capabilities of OMPI?
Did anybody encounter similar problems? Or is there something
wrong about my way of using ompi-checkpoint/ompi-restart?


Any hint is greatly appreciated! :-)



Best,
Matthias


[OMPI users] Newbie doubt.

2008-09-17 Thread Davi Vercillo C. Garcia (デビッド)
Hi,

I'm starting to use OpenMPI and I'm having some troubles. I wrote a
simple program that tries to open files using the function
MPI_File_open(). Like below:

#include 
#include 
#include 

#include 

int processoPrincipal(void);
int processosEscravos(void);

int main(int argc, char** argv) {
int meuRank, numeroProcessos;
MPI_File arquivoEntrada, arquivoSaida;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD,);

MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY,
MPI_INFO_NULL, );
strcat(argv[1], ".bz2");
MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDWR |
MPI_MODE_CREATE, MPI_INFO_NULL, );

if (meuRank != 0) {
processoPrincipal();
} else {
processosEscravos();
}

MPI_Finalize();
return 0;
}

But I'm getting a error message like:

*** An error occurred in MPI_Barrier
*** An error occurred in MPI_Barrier
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (goodbye)
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (goodbye)

What is this ?

-- 
Davi Vercillo Carneiro Garcia
http://davivercillo.blogspot.com/

Universidade Federal do Rio de Janeiro
Departamento de Ciência da Computação
DCC-IM/UFRJ - http://www.dcc.ufrj.br

Grupo de Usuários GNU/Linux da UFRJ (GUL-UFRJ)
http://www.dcc.ufrj.br/~gul

Linux User: #388711
http://counter.li.org/

"Good things come to those who... wait." - Debian Project

"A computer is like air conditioning: it becomes useless when you open
windows." - Linus Torvalds