checkpointing and restarting openmpi applications don't work for me.
I have a redhat version 5U6 system with blcr checkpointing version 0.8.4
and openmpi version 1.6.3.
I have a simple parallel application that I want to checkpoint and restart.
I see that the blcr modules are loaded (with lsmod).
I run:
mpirun -np 1 -hostfile hostfile -am ft-enable-cr EXECUTABLE
ompi-checkpoint -v -s <PID of mpirun>
then I kill mpirun.
then:
ompi-restart -v ompi_global_snapshot_<PID>.ckpt
here is my results:
Error: Unable to obtain the proper restart command to restart from the
checkpoint file (opal_snapshot_0.ckpt). Returned -1.
Check the installation of the none checkpoint/restart service
on all of the machines in your system.
If I try using the blcr utilities (cr_run, cr_checkpoint, cr_run) then it runs
on the local machine, it won't on more then one machine.
Please help me with this.
Thank you.
With Blessings, always,
Jerry Mersel
System Administrator
IT Infrastructure Branch | Division of Information Systems
Weizmann Institute of Science
Rehovot 76100, Israel
Tel: +972-8-9342363