Re: [gridengine users] Trying to get checkpointing to work

Reuti Tue, 05 Aug 2014 05:07:24 -0700

Hi,

Am 05.08.2014 um 13:09 schrieb Wouter Verhelst:


> I'm currently replacing an ageing N1 gridengine 6.0u4 with Son of Gridengine 
> 8.1.7. Since the machine that's running the N1 qmaster is due to be replaced 
> in the coming months, I'm taking this opportunity to rethink the way our 
> setup is done.
> 
> One thing I'm trying to do is to get checkpointing (with BLCR) to work so 
> that jobs can be migrated to different machines if necessary. Our cluster 
> does not consist of machines with the exact same specs; some machines have 
> more memory than others.

But they have the same type of CPU - you tested your applications that the 
checkpointing facility is working reliable outside of SGE?


> When we're running a lot of small jobs that don't require a lot of resources 
> each, it doesn't matter which machine they run on and we want to spread the 
> load over as many machines as possible so that the jobs are finished as 
> quickly as possible; however, if such jobs are indeed running when a user 
> wants to submit a job that does require a lot of resources, I want gridengine 
> to checkpoint jobs on the high-memory machine to make room available for the 
> new high-resources job, so that the high-resources job doesn't need to wait 
> for large amounts of small-resources jobs to finish (which may take a long 
> time).
> 
> I've done the following so far:
> 
> qconf -sq all.q|grep starter
> starter_method     /usr/local/bin/blcr_start_job
> 
> this script checks if we have a $RESTARTED environment variable and if a 
> checkpoint file exists. If so, it execs cr_restart; else, it execs cr_run.
> 
> qconf -sckpt BLCR
> ckpt_name          BLCR
> interface          APPLICATION-LEVEL
> ckpt_command       /usr/local/bin/blcr_checkpoint
> migr_command       /usr/local/bin/blcr_migrate
> restart_command    NONE
> clean_command      /usr/local/bin/blcr_clean
> ckpt_dir           /opt/sge/default/common/ckpoint
> signal             NONE
> when               xsr
> 
> these scripts are based on the BLCR HOWTO by Peng and Ng from 2004, modified 
> to account for the fact that BLCR does support checkpointing an entire 
> process tree these days.
> 
> Finally, there's also this bit:
> 
> qconf -sq hiprio|grep -E 'subordinate|starter'
> starter_method        /usr/local/bin/blcr_start_job
> subordinate_list      slots=12(all.q:0:sr)
> 
> When I submit a job in the hiprio queue, it does suspend jobs in all.q, but I 
> don't see it checkpointing the jobs.

This is the intended behavior (please have a look at the state diagrams in the 
mentioned document). The migrate command itself will only migrate the jobs 
(i.e. kill their processes). Your migrate procedure will have to do a 
checkpoint first (by calling the "ckpt_command" on its own), or any other 
checkpoint (maybe taken in a periodic interval before) will be used.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Trying to get checkpointing to work

Reply via email to