Hi again,

I found a C program to test ompi-checkpoint/restart an it works fine. The program was written by Alan Woodland and shared in the following distribution list: debian-bugs-d...@lists.debian.org This program starts a countdown from 10 to 0 and when the countdown is 6, do a checkpoint, kill the process and restart the process.

However, I still have the problem when I try to do (by hand) checkpointing directly into a node

Any ideas? :-(

Best regards
Sergio



Sergio Díaz escribió:
Hello,

I have achieved the checkpoint of an easy program without SGE. Now, I'm trying to do the integration openmpi+sge but I have some problems... When I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID doesn't exit. The example below.

Any ideas?
Somebody have a script to do it automatic with SGE?. For example I have one to do checkpoint each X seconds with BLCR and non-mpi jobs. It is launched by SGE if you have configured the queue and the ckpt environment.

Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the option to do it.


Regards,
Sergio


--------------------------------

[sdiaz@compute-3-17 ~]$ ps auxf
....
root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00 \_ sge_shepherd-2645150 -bg sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28 0:00 \_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150 sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3 sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V compute-3-18.......... sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28 0:00 \_ pi3


[sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
[compute-3-17.local:20124] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
[compute-3-17.local:20135] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
[compute-3-17.local:20136] HNP with PID 20112 Not found!

[sdiaz@compute-3-17 ~]$ ompi-checkpoint --hnp-pid 20112
--------------------------------------------------------------------------
ompi-checkpoint PID_OF_MPIRUN
  Open MPI Checkpoint Tool

   -am <arg0>            Aggregate MCA parameter set file list
   -gmca|--gmca <arg0> <arg1>
                         Pass global MCA parameters that are applicable to
                         all contexts (arg0 is the parameter name; arg1 is
                         the parameter value)
-h|--help                This help message
   --hnp-jobid <arg0>    This should be the jobid of the HNP whose
                         applications you wish to checkpoint.
   --hnp-pid <arg0>      This should be the pid of the mpirun whose
                         applications you wish to checkpoint.
   -mca|--mca <arg0> <arg1>
                         Pass context-specific MCA parameters; they are
                         considered global if --gmca is not used and only
                         one context is specified (arg0 is the parameter
                         name; arg1 is the parameter value)
-s|--status Display status messages describing the progression
                         of the checkpoint
   --term                Terminate the application after checkpoint
-v|--verbose             Be Verbose
-w|--nowait              Do not wait for the application to finish
                         checkpointing before returning

--------------------------------------------------------------------------
[sdiaz@compute-3-17 ~]$ exit
logout
Connection to c3-17 closed.
[sdiaz@svgd mpi_test]$ ssh c3-18
Last login: Wed Oct 28 13:24:12 2009 from svgd.local
-bash-3.00$ ps auxf |grep sdiaz

sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18 sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2295267328 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 2295267328.0;tcp://192.168.4.144:36596 -mca mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28 0:00 \_ pi3





--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/

------------------------------------------------
------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/

------------------------------------------------

Reply via email to