On Dec 3, 2009, at 2:01 PM, Chang IL Yoon wrote:
Dear Josh and Paul.
First of all, thank you very much for your interesting on my problem.
1) I tested it again with MPIRUN_CMD as 'mpirun -am ft-enable-cr -np
%N %P'
But the checkpoint did not work.
Is it giving the same error?
Can you send me information on how you configured Open MPI on your
system?
2) Here are the more information on my MPI configuration.
- What version of Open MPI are you using?
>> I am using Open-MPI ver 1.3.3 with BLCR ver 0.8.2
- How did you configure Open MPI?
>> ./configure --enable-ft-thread --with-ft=cr --enable-mpi-
threads --with-blcr={BLCR_DIR} --with-blcr-libdir={BLCR_LIBDIR} --
prefix={OPENMPI_DIR}
- What arguments are being passed to 'mpirun' when running with
GASNet?
>> mpirun -am ft-enable-cr --machinefile ./machinefile -np 1 ./
personal
The '-np 1' argument is a bit puzzling to me, don't you want this to
be >1 normally. GASNet does not use any MPI dynamic process management
interfaces (e.g., MPI_Comm_spawn), does it?
>> personal is the same probram, my-app.c except for using
gasnet_init and gasnet_exit() instead of MPI_Init() and
MPI_Finalize().
>> my-app.c is in http://osl.iu.edu/research/ft/ompi-cr/examples.php
.
>> gasnet_init() and gasnet_exit() use MPI_Init() and
MPI_Finalize().
So you are using the program from the SELF checkpoint example? If Open
MPI detects that the application has the appropriate function
callbacks to use the SELF CRS (which this example does) then it will -
not- use the BLCR component, but instead select the SELF component.
Try using a simple counting program instead of that particular
example. You could also just remove the opal_crs_self_user_* and
my_personal_* functions form the example program to reduce it to one.
I'm not sure why the checkpoint would not work even with the SELF CRS.
I'll have to check on that.
- Do you have any environment variables/MCA parameters set for Open
MPI?
>> yes
$HOME/.openmpi/mca-params.conf
# Local snapshot directory (not used in this scenario)
crs_base_snapshot_dir=${HOME}/temp
# Remote snapshot directory (globally mounted file system))
snapc_base_global_snapshot_dir=${HOME}/checkpoints
- My network interconnects is Infiniband/OpenIB (IP over IB).
These all look fine to me.
3) If there are something for me to solve this problem, please let
me know without any hesitation.
Thank you again for your reading
Sincerely
On Tue, Dec 1, 2009 at 1:49 PM, Paul H. Hargrove
<phhargr...@lbl.gov> wrote:
Thomas,
I connection with Josh's question about mpirun arguments, I suggest
you try setting
MPIRUN_CMD='mpirun -am ft-enable-cr -np %N %P %A'
in your environment before launching the GASNet application. This
will instruct GASNet's wrapper around mpirun to include the flag
Josh mentioned.
-Paul
Josh Hursey wrote:
Thomas,
I have not tried to use the checkpoint/restart feature with GASNet
over MPI, so I cannot comment directly on how they interact.
However, the combination should work as long as the proper arguments
(-am ft-enable-cr) are passed along to the mpirun command, and Open
MPI is configured properly.
The error message that you copied seems to indicate that the local
daemon on one of the nodes failed to start a checkpoint of the
target application. Often this is caused by one of two things:
- Open MPI was not configured with the fault tolerance thread, and
the application is waiting for a long time in a computation loop
(not entering the MPI library).
- The '-am ft-enable-cr' flag was not provided to the mpirun
process, so the MPI application did not activate the C/R specific
code paths and is therefore denying the request to checkpoint.
Can you send me a bit more information:
- What version of Open MPI are you using?
- How did you configure Open MPI?
- What arguments are being passed to 'mpirun' when running with
GASNet?
- Do you have any environment variables/MCA parameters set for Open
MPI?
-- Josh
On Nov 22, 2009, at 7:13 PM, Thomas CI Yoon wrote:
Dear all.
Thanks to developers of OPEN-MPI for Fault-Tolerance, I can use the
checkpoint/restart function very well for my MPI applications.
But its checkpoint does not work for my GASNet applications which
use the MPI conduit.
Is here anyone else to help me?
I wrote some code with GASNet API (Global-Address Space Networking: http://gasnet.cs.berkeley.edu/)
and used MPI conduit for my gasnet application, so my program
ran well with open-mpirun. Thus I thought that I could also use the
transparent checkpoint/restart function supported by BLCR in Open-
mpi. As opposed to my idea, it does not work and show the following
error message.
--------------------------------------------------------------------------
Error: The process with PID 13896 is not checkpointable.
This could be due to one of the following:
- An application with this PID doesn't currently exist
- The application with this PID isn't checkpointable
- The application with this PID isn't an OPAL application.
We were looking for the named files:
/tmp/opal_cr_prog_write.13896
/tmp/opal_cr_prog_read.13896
--------------------------------------------------------------------------
1 more process has sent help message help-opal-checkpoint.txt
Set MCA parameter "orte_base_help_aggregate" to 0 to see all help
0] 13896) Step 53
0] 15100) Step 53
0] 13896) Step 54
0] 15100) Step 54
0] 13896) Step 55
In my application, the MPI_Initialized() says it is initialized.
Thank you for your reading and have a great day.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Paul H. Hargrove phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory