Re: [OMPI users] Changing location where checkpoints are saved
I took a look at the checkpoint staging and preload functionality. It seems that the combination of the two is broken on the v1.3 and v1.4 branches. I filed a bug about it so that it would not get lost: https://svn.open-mpi.org/trac/ompi/ticket/2139 I also attached a patch to partially fix the problem, but the actual fix is must more involved. I don't know when I'll get around to finishing this bug fix for that branch. :( However, the current development trunk and v1.5 are know to have a working version of this feature. Can you try the trunk or v1.5 and see if this fixes the problem? -- Josh P.S. If you are interested, we have a slightly better version of the documentation, hosted at the link below: http://osl.iu.edu/research/ft/ompi-cr/ On Nov 18, 2009, at 1:27 PM, Constantinos Makassikis wrote: Josh Hursey wrote: (Sorry for the excessive delay in replying) On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote: Thanks for the reply! Concerning the mca options for checkpointing: - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ? - in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ? By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/ 2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/ 2009/05/9385.php (for restarting) Following indications given in [1], I tried to make each process checkpoint itself in it local /tmp and centralize the resulting checkpoints in /tmp or $HOME: Excerpt from mca-params.conf: - snapc_base_store_in_place=0 snapc_base_global_snapshot_dir=/tmp or $HOME crs_base_snapshot_dir=/tmp COMMANDS used: -- mpirun -n 2 -machinefile machines -am ft-enable-cr a.out ompi-checkpoint mpirun_pid OUTPUT of ompi-checkpoint -v 16753 -- [ic85:17044] orte_checkpoint: Checkpointing... [ic85:17044] PID 17036 [ic85:17044] Connected to Mpirun [[42098,0],0] [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Requested - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Pending - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Running - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] File Transfer - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt OUTPUT of MPIRUN [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 -- WARNING: Could not preload specified file: File already exists. Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0 Host: ic85 Will continue attempting to launch the process. -- [ic85:17036] filem:rsh: wait_all(): Wait failed (-1) [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054 This is a warning about creating the global snapshot directory (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It seems to indicate that the directory existed when the file gather started. A couple things to check: - Did you clean out the /tmp on all of the nodes with any files starting with "opal" or "ompi"? - Does the error go away when you set (snapc_base_global_snapshot_dir=$HOME)? - Could you try running against a v1.3 release? (I wonder if this feature has been broken on the trunk) Let me know what you find. In the next couple days, I'll try to test the trunk again with this feature to make sure that it is still working on my test machines. -- Josh Hello Josh, I have switched to v1.3 and re-run with snapc_base_global_snapshot_dir=/tmp or $HOME with a clean /tmp. In both cases I get the same error as before :-( I don't know if t
Re: [OMPI users] Changing location where checkpoints are saved
Josh Hursey wrote: (Sorry for the excessive delay in replying) On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote: Thanks for the reply! Concerning the mca options for checkpointing: - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ? - in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ? By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php (for restarting) Following indications given in [1], I tried to make each process checkpoint itself in it local /tmp and centralize the resulting checkpoints in /tmp or $HOME: Excerpt from mca-params.conf: - snapc_base_store_in_place=0 snapc_base_global_snapshot_dir=/tmp or $HOME crs_base_snapshot_dir=/tmp COMMANDS used: -- mpirun -n 2 -machinefile machines -am ft-enable-cr a.out ompi-checkpoint mpirun_pid OUTPUT of ompi-checkpoint -v 16753 -- [ic85:17044] orte_checkpoint: Checkpointing... [ic85:17044] PID 17036 [ic85:17044] Connected to Mpirun [[42098,0],0] [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Requested - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Pending - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Running - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] File Transfer - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt OUTPUT of MPIRUN [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 -- WARNING: Could not preload specified file: File already exists. Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0 Host: ic85 Will continue attempting to launch the process. -- [ic85:17036] filem:rsh: wait_all(): Wait failed (-1) [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054 This is a warning about creating the global snapshot directory (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It seems to indicate that the directory existed when the file gather started. A couple things to check: - Did you clean out the /tmp on all of the nodes with any files starting with "opal" or "ompi"? - Does the error go away when you set (snapc_base_global_snapshot_dir=$HOME)? - Could you try running against a v1.3 release? (I wonder if this feature has been broken on the trunk) Let me know what you find. In the next couple days, I'll try to test the trunk again with this feature to make sure that it is still working on my test machines. -- Josh Hello Josh, I have switched to v1.3 and re-run with snapc_base_global_snapshot_dir=/tmp or $HOME with a clean /tmp. In both cases I get the same error as before :-( I don't know if the following can be of any help but after ompi-checkpoint returns there is only a copy of the checkpoint of process of rank 0 in the global snapshot directory: $(snapc_base_global_snapshot_dir)/ompi_global_snapshot_.ckpt/0 So I guess the error occurs during the remote copy phase. -- Constantinos Does anyone has an idea about what is wrong? Best regards, -- Constantinos Josh Hursey wrote: This is described in the C/R User's Guide attached to the webpage below: https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples. -- Josh On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote: Dear all, I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoint
Re: [OMPI users] Changing location where checkpoints are saved
(Sorry for the excessive delay in replying) On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote: Thanks for the reply! Concerning the mca options for checkpointing: - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ? - in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ? By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php (for restarting) Following indications given in [1], I tried to make each process checkpoint itself in it local /tmp and centralize the resulting checkpoints in /tmp or $HOME: Excerpt from mca-params.conf: - snapc_base_store_in_place=0 snapc_base_global_snapshot_dir=/tmp or $HOME crs_base_snapshot_dir=/tmp COMMANDS used: -- mpirun -n 2 -machinefile machines -am ft-enable-cr a.out ompi-checkpoint mpirun_pid OUTPUT of ompi-checkpoint -v 16753 -- [ic85:17044] orte_checkpoint: Checkpointing... [ic85:17044] PID 17036 [ic85:17044] Connected to Mpirun [[42098,0],0] [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Requested - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Pending - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Running - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] File Transfer - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt OUTPUT of MPIRUN [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 -- WARNING: Could not preload specified file: File already exists. Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0 Host: ic85 Will continue attempting to launch the process. -- [ic85:17036] filem:rsh: wait_all(): Wait failed (-1) [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054 This is a warning about creating the global snapshot directory (ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It seems to indicate that the directory existed when the file gather started. A couple things to check: - Did you clean out the /tmp on all of the nodes with any files starting with "opal" or "ompi"? - Does the error go away when you set (snapc_base_global_snapshot_dir=$HOME)? - Could you try running against a v1.3 release? (I wonder if this feature has been broken on the trunk) Let me know what you find. In the next couple days, I'll try to test the trunk again with this feature to make sure that it is still working on my test machines. -- Josh Does anyone has an idea about what is wrong? Best regards, -- Constantinos Josh Hursey wrote: This is described in the C/R User's Guide attached to the webpage below: https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples. -- Josh On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote: Dear all, I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoints are saved in $HOME. However, I would prefer them to be saved on a local disk (e.g.: /tmp). Does anyone know how I can change the location where Open MPI saves checkpoints? Best regards, -- Constantinos ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Changing location where checkpoints are saved
Thanks for the reply! Concerning the mca options for checkpointing: - are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ? - in priority options (e.g.: crs_blcr_priority) do lower numbers indicate higher priority ? By searching in the archives of the mailing list I found two interesting/useful posts: - [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php (for different checkpointing schemes) - [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php (for restarting) Following indications given in [1], I tried to make each process checkpoint itself in it local /tmp and centralize the resulting checkpoints in /tmp or $HOME: Excerpt from mca-params.conf: - snapc_base_store_in_place=0 snapc_base_global_snapshot_dir=/tmp or $HOME crs_base_snapshot_dir=/tmp COMMANDS used: -- mpirun -n 2 -machinefile machines -am ft-enable-cr a.out ompi-checkpoint mpirun_pid OUTPUT of ompi-checkpoint -v 16753 -- [ic85:17044] orte_checkpoint: Checkpointing... [ic85:17044] PID 17036 [ic85:17044] Connected to Mpirun [[42098,0],0] [ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 17036 [ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Requested - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Pending - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Running - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] File Transfer - Global Snapshot Reference: (null) [ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message. [ic85:17044] orte_checkpoint: hnp_receiver: Status Update. [ic85:17044] Error - Global Snapshot Reference: ompi_global_snapshot_17036.ckpt OUTPUT of MPIRUN [ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 [ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3 -- WARNING: Could not preload specified file: File already exists. Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0 Host: ic85 Will continue attempting to launch the process. -- [ic85:17036] filem:rsh: wait_all(): Wait failed (-1) [ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054 Does anyone has an idea about what is wrong? Best regards, -- Constantinos Josh Hursey wrote: This is described in the C/R User's Guide attached to the webpage below: https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples. -- Josh On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote: Dear all, I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoints are saved in $HOME. However, I would prefer them to be saved on a local disk (e.g.: /tmp). Does anyone know how I can change the location where Open MPI saves checkpoints? Best regards, -- Constantinos ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Changing location where checkpoints are saved
This is described in the C/R User's Guide attached to the webpage below: https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples. -- Josh On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote: Dear all, I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoints are saved in $HOME. However, I would prefer them to be saved on a local disk (e.g.: /tmp). Does anyone know how I can change the location where Open MPI saves checkpoints? Best regards, -- Constantinos ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Changing location where checkpoints are saved
Dear all, I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By default, it seems that checkpoints are saved in $HOME. However, I would prefer them to be saved on a local disk (e.g.: /tmp). Does anyone know how I can change the location where Open MPI saves checkpoints? Best regards, -- Constantinos