Re: [OMPI devel] [patch] Verifying the message queue DLL build

2009-12-09 Thread Ashley Pittman
On Tue, 2009-12-08 at 09:36 -0500, Terry Dontje wrote:

> You can get it from the svn branch repo:
> https://svn.open-mpi.org/svn/ompi/branches/v1.5
> You might as well also try 1.4 which should also be clean:
> https://svn.open-mpi.org/svn/ompi/branches/v1.4

I can confirm that for both branches the patch applies cleanly, the test
is run and that the test passes.  For v1.4 I did a in-tree build, for
v1.5 I did a VPATH build.

There was an error in the v1.4 code though, after my check had passed
the check command went on to fail with this error.  This is a fresh
checkout, r22287M with the only configure option specified being
--prefix

The checks did pass if I ran "make install" before running "make check",
the v1.5 tree didn't need this however.  I guess that means this is a
build issue rather than a problem with the actual code.

Ashley,

/bin/sh ../../libtool --tag=CC   --mode=link gcc  -g -Wall -Wundef
-Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes
-Wcomment -pedantic -Wno-long-double
-Werror-implicit-function-declaration -finline-functions
-fno-strict-aliasing -pthread -fvisibility=hidden  -export-dynamic   -o
ddt_pack ddt_pack.o ../../ompi/libmpi.la -lnsl -lutil  -lm 
libtool: link: gcc -g -Wall -Wundef -Wno-long-long -Wsign-compare
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
-Wno-long-double -Werror-implicit-function-declaration
-finline-functions -fno-strict-aliasing -pthread -fvisibility=hidden
-o .libs/ddt_pack ddt_pack.o
-Wl,--export-dynamic  ../../ompi/.libs/libmpi.so -lnsl -lutil -lm
-pthread -Wl,-rpath -Wl,/tmp/v1.4/lib
make[3]: Leaving directory
`/mnt/home/debian/ashley/code/tmp/v1.4/test/datatype'
make  check-TESTS
make[3]: Entering directory
`/mnt/home/debian/ashley/code/tmp/v1.4/test/datatype'
/mnt/home/debian/ashley/code/tmp/v1.4/test/datatype/.libs/lt-checksum:
error while loading shared libraries: libopen-pal.so.0: cannot open
shared object file: No such file or directory
FAIL: checksum
/mnt/home/debian/ashley/code/tmp/v1.4/test/datatype/.libs/lt-position:
error while loading shared libraries: libopen-pal.so.0: cannot open
shared object file: No such file or directory
FAIL: position

2 of 2 tests failed
Please report to http://www.open-mpi.org/community/help/

make[3]: Leaving directory
`/mnt/home/debian/ashley/code/tmp/v1.4/test/datatype'
make[3]: *** [check-TESTS] Error 1
make[2]: *** [check-am] Error 2
make[2]: Leaving directory
`/mnt/home/debian/ashley/code/tmp/v1.4/test/datatype'
make[1]: Leaving directory `/mnt/home/debian/ashley/code/tmp/v1.4/test'
make[1]: *** [check-recursive] Error 1
make: *** [check-recursive] Error 1
ashley@alpha:~/code/tmp/v1.4$ 

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] possible bugs and unexpected values in returned errors classes

2009-12-09 Thread Lisandro Dalcin
It seems that this issue got lost.

On Thu, Feb 12, 2009 at 9:02 PM, Jeff Squyres  wrote:
> On Feb 11, 2009, at 8:24 AM, Lisandro Dalcin wrote:
>
>> Below a list of stuff that I've got by running mpi4py testsuite.
>>
>> 4)  When passing MPI_WIN_NULL, MPI_Win_get_errhandler() and
>> MPI_Win_set_errhandler()  DO NOT fail.
>
> I was a little more dubious here; the param checking code was specifically
> checking for MPI_WIN_NULL and not classifying it as an error.  Digging to
> find out why we did that, the best that I can come up with is that it is
> *not* an error to call MPI_File_set|get_errhandler on MPI_FILE_NULL (to set
> behavior for what happens when FILE_OPEN fails); I'm *guessing* that we
> simply copied the _File_ code to the _Win_ code and forgot to remove that
> extra check.
>
> I can't find anything in MPI-2.1 that says it is legal to call set|get
> errhandler on MPI_WIN_NULL.  I checked LAM as well; LAM errors in this case.
>  So I made this now be an error in OMPI as well.
>
> Do you need these in the 1.3 series?  Or are you ok waiting for 1.4
> (assuming 1.4 takes significantly less time to release than 1.3 :-) ).
>

In short:

When passing MPI_WIN_NULL, MPI_Win_get_errhandler() and
MPI_Win_set_errhandler()  DO NOT fail.

Jeff, you promised this for 1.4 ;-). Any chance for 1.4.1 ?

-- 
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594



Re: [OMPI devel] OPEN-MPI Fault-Tolerance for GASNet

2009-12-09 Thread Josh Hursey


On Dec 3, 2009, at 2:01 PM, Chang IL Yoon wrote:


Dear Josh and Paul.

First of all, thank you very much for your interesting on my problem.

1) I tested it again with MPIRUN_CMD as 'mpirun -am ft-enable-cr -np  
%N %P'

   But the checkpoint did not work.


Is it giving the same error?

Can you send me information on how you configured Open MPI on your  
system?




2) Here are the more information on my MPI configuration.
 - What version of Open MPI are you using?
   >> I am using Open-MPI ver 1.3.3 with BLCR ver 0.8.2

 - How did you configure Open MPI?
   >> ./configure --enable-ft-thread --with-ft=cr --enable-mpi- 
threads --with-blcr={BLCR_DIR} --with-blcr-libdir={BLCR_LIBDIR} -- 
prefix={OPENMPI_DIR}


 - What arguments are being passed to 'mpirun' when running with  
GASNet?
   >> mpirun -am ft-enable-cr --machinefile ./machinefile -np 1 ./ 
personal


The '-np 1' argument is a bit puzzling to me, don't you want this to  
be >1 normally. GASNet does not use any MPI dynamic process management  
interfaces (e.g., MPI_Comm_spawn), does it?



   >> personal is the same probram, my-app.c except for using  
gasnet_init and gasnet_exit() instead of MPI_Init() and  
MPI_Finalize().
   >> my-app.c is in http://osl.iu.edu/research/ft/ompi-cr/examples.php 
.
   >> gasnet_init() and gasnet_exit() use MPI_Init() and  
MPI_Finalize().


So you are using the program from the SELF checkpoint example? If Open  
MPI detects that the application has the appropriate function  
callbacks to use the SELF CRS (which this example does) then it will - 
not- use the BLCR component, but instead select the SELF component.


Try using a simple counting program instead of that particular  
example. You could also just remove the opal_crs_self_user_* and  
my_personal_* functions form the example program to reduce it to one.


I'm not sure why the checkpoint would not work even with the SELF CRS.  
I'll have to check on that.




 - Do you have any environment variables/MCA parameters set for Open  
MPI?

   >> yes
   $HOME/.openmpi/mca-params.conf
   # Local snapshot directory (not used in this scenario)
   crs_base_snapshot_dir=${HOME}/temp

   # Remote snapshot directory (globally mounted file system))
   snapc_base_global_snapshot_dir=${HOME}/checkpoints

 - My network interconnects is Infiniband/OpenIB (IP over IB).


These all look fine to me.



3) If there are something for me to solve this problem, please let  
me know without any hesitation.


Thank you again for your reading

Sincerely


On Tue, Dec 1, 2009 at 1:49 PM, Paul H. Hargrove  
 wrote:

Thomas,

I connection with Josh's question about mpirun arguments, I suggest  
you try setting

   MPIRUN_CMD='mpirun -am ft-enable-cr -np %N %P %A'
in your environment before launching the GASNet application.  This  
will instruct GASNet's wrapper around mpirun to include the flag  
Josh mentioned.


-Paul


Josh Hursey wrote:
Thomas,

I have not tried to use the checkpoint/restart feature with GASNet  
over MPI, so I cannot comment directly on how they interact.  
However, the combination should work as long as the proper arguments  
(-am ft-enable-cr) are passed along to the mpirun command, and Open  
MPI is configured properly.


The error message that you copied seems to indicate that the local  
daemon on one of the nodes failed to start a checkpoint of the  
target application. Often this is caused by one of two things:
 - Open MPI was not configured with the fault tolerance thread, and  
the application is waiting for a long time in a computation loop  
(not entering the MPI library).
 - The '-am ft-enable-cr' flag was not provided to the mpirun  
process, so the MPI application did not activate the C/R specific  
code paths and is therefore denying the request to checkpoint.


Can you send me a bit more information:
 - What version of Open MPI are you using?
 - How did you configure Open MPI?
 - What arguments are being passed to 'mpirun' when running with  
GASNet?
 - Do you have any environment variables/MCA parameters set for Open  
MPI?


-- Josh

On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote:

Dear all.

Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the  
checkpoint/restart function very well for my MPI applications.
But its checkpoint does not work for my GASNet applications which  
use the MPI conduit.

Is here anyone else to help me?
I wrote some code with GASNet API (Global-Address Space Networking: http://gasnet.cs.berkeley.edu/) 
and used MPI conduit for my gasnet application, so my program  
ran well with open-mpirun. Thus I thought that I could also use the  
transparent checkpoint/restart function supported by BLCR in Open- 
mpi. As opposed to my idea, it does not work and show the following  
error message.

--
Error: The process with PID 13896 is not checkpointable.
  This could be due to one of the following:
   - An application with this PID