Re: [OMPI users] Changing location where checkpoints are saved

2009-12-09 Thread Josh Hursey
I took a look at the checkpoint staging and preload functionality. It  
seems that the combination of the two is broken on the v1.3 and v1.4  
branches. I filed a bug about it so that it would not get lost:

  https://svn.open-mpi.org/trac/ompi/ticket/2139

I also attached a patch to partially fix the problem, but the actual  
fix is must more involved. I don't know when I'll get around to  
finishing this bug fix for that branch. :(


However, the current development trunk and v1.5 are know to have a  
working version of this feature. Can you try the trunk or v1.5 and see  
if this fixes the problem?


-- Josh

P.S. If you are interested, we have a slightly better version of the  
documentation, hosted at the link below:

  http://osl.iu.edu/research/ft/ompi-cr/

On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote:


Josh Hursey wrote:

(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and  
1 values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers  
indicate higher priority ?


By searching in the archives of the mailing list I found two  
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/ 
2008/09/6534.php (for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/ 
2009/05/9385.php (for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node  
Process PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint  
of jobid [INVALID]
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot  
Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command  
message.

[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot  
Reference: ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3

--
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in  
file ../../../../../orte/mca/snapc/full/snapc_full_global.c at  
line 1054


This is a warning about creating the global snapshot directory  
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0).  
It seems to indicate that the directory existed when the file  
gather started.


A couple things to check:
- Did you clean out the /tmp on all of the nodes with any files  
starting with "opal" or "ompi"?
- Does the error go away when you set  
(snapc_base_global_snapshot_dir=$HOME)?
- Could you try running against a v1.3 release? (I wonder if this  
feature has been broken on the trunk)


Let me know what you find. In the next couple days, I'll try to  
test the trunk again with this feature to make sure that it is  
still working on my test machines.


-- Josh

Hello Josh,

I have switched to v1.3 and re-run with  
snapc_base_global_snapshot_dir=/tmp or $HOME

with a clean /tmp.

In both cases I get the same error as before :-(

I don't know if 

Re: [OMPI users] Changing location where checkpoints are saved

2009-11-18 Thread Constantinos Makassikis

Josh Hursey wrote:

(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 
values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers 
indicate higher priority ?


By searching in the archives of the mailing list I found two 
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php 
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php 
(for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process 
PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of 
jobid [INVALID]

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference: 
ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with 
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with 
status 3
-- 


WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

-- 


[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file 
../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054


This is a warning about creating the global snapshot directory 
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It 
seems to indicate that the directory existed when the file gather 
started.


A couple things to check:
 - Did you clean out the /tmp on all of the nodes with any files 
starting with "opal" or "ompi"?
 - Does the error go away when you set 
(snapc_base_global_snapshot_dir=$HOME)?
 - Could you try running against a v1.3 release? (I wonder if this 
feature has been broken on the trunk)


Let me know what you find. In the next couple days, I'll try to test 
the trunk again with this feature to make sure that it is still 
working on my test machines.


-- Josh

Hello Josh,

I have switched to v1.3 and re-run with 
snapc_base_global_snapshot_dir=/tmp or $HOME

with a clean /tmp.

In both cases I get the same error as before :-(

I don't know if the following can be of any help but after ompi-checkpoint
returns there is only a copy of the checkpoint of process of rank 0 in
the global snapshot directory:

$(snapc_base_global_snapshot_dir)/ompi_global_snapshot_.ckpt/0

So I guess the error occurs during the remote copy phase.

--
Constantinos







Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage 
below:

 https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in 
the past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS 
account. By default,
it seems that 

Re: [OMPI users] Changing location where checkpoints are saved

2009-11-06 Thread Josh Hursey

(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1  
values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers  
indicate higher priority ?


By searching in the archives of the mailing list I found two  
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php  
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php  
(for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process  
PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of  
jobid [INVALID]

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference:  
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference:  
ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with  
status 3

--
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in  
file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line  
1054


This is a warning about creating the global snapshot directory  
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It  
seems to indicate that the directory existed when the file gather  
started.


A couple things to check:
 - Did you clean out the /tmp on all of the nodes with any files  
starting with "opal" or "ompi"?
 - Does the error go away when you set  
(snapc_base_global_snapshot_dir=$HOME)?
 - Could you try running against a v1.3 release? (I wonder if this  
feature has been broken on the trunk)


Let me know what you find. In the next couple days, I'll try to test  
the trunk again with this feature to make sure that it is still  
working on my test machines.


-- Josh






Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage  
below:

 https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in  
the past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS  
account. By default,
it seems that checkpoints are saved in $HOME. However, I would  
prefer them

to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI  
saves checkpoints?



Best regards,

--
Constantinos
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Changing location where checkpoints are saved

2009-09-30 Thread Constantinos Makassikis

Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers 
indicate higher priority ?


By searching in the archives of the mailing list I found two 
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php 
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php 
(for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 
17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of 
jobid [INVALID]

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference: 
ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
--
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file 
../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054




Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:

This is described in the C/R User's Guide attached to the webpage below:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in the 
past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. 
By default,
it seems that checkpoints are saved in $HOME. However, I would prefer 
them

to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves 
checkpoints?



Best regards,

--
Constantinos
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Changing location where checkpoints are saved

2009-09-23 Thread Josh Hursey

This is described in the C/R User's Guide attached to the webpage below:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in the  
past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account.  
By default,
it seems that checkpoints are saved in $HOME. However, I would  
prefer them

to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves  
checkpoints?



Best regards,

--
Constantinos
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Changing location where checkpoints are saved

2009-09-18 Thread Constantinos Makassikis

Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By 
default,

it seems that checkpoints are saved in $HOME. However, I would prefer them
to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves 
checkpoints?



Best regards,

--
Constantinos