Hi, Am 05.08.2014 um 13:09 schrieb Wouter Verhelst:
> I'm currently replacing an ageing N1 gridengine 6.0u4 with Son of Gridengine > 8.1.7. Since the machine that's running the N1 qmaster is due to be replaced > in the coming months, I'm taking this opportunity to rethink the way our > setup is done. > > One thing I'm trying to do is to get checkpointing (with BLCR) to work so > that jobs can be migrated to different machines if necessary. Our cluster > does not consist of machines with the exact same specs; some machines have > more memory than others. But they have the same type of CPU - you tested your applications that the checkpointing facility is working reliable outside of SGE? > When we're running a lot of small jobs that don't require a lot of resources > each, it doesn't matter which machine they run on and we want to spread the > load over as many machines as possible so that the jobs are finished as > quickly as possible; however, if such jobs are indeed running when a user > wants to submit a job that does require a lot of resources, I want gridengine > to checkpoint jobs on the high-memory machine to make room available for the > new high-resources job, so that the high-resources job doesn't need to wait > for large amounts of small-resources jobs to finish (which may take a long > time). > > I've done the following so far: > > qconf -sq all.q|grep starter > starter_method /usr/local/bin/blcr_start_job > > this script checks if we have a $RESTARTED environment variable and if a > checkpoint file exists. If so, it execs cr_restart; else, it execs cr_run. > > qconf -sckpt BLCR > ckpt_name BLCR > interface APPLICATION-LEVEL > ckpt_command /usr/local/bin/blcr_checkpoint > migr_command /usr/local/bin/blcr_migrate > restart_command NONE > clean_command /usr/local/bin/blcr_clean > ckpt_dir /opt/sge/default/common/ckpoint > signal NONE > when xsr > > these scripts are based on the BLCR HOWTO by Peng and Ng from 2004, modified > to account for the fact that BLCR does support checkpointing an entire > process tree these days. > > Finally, there's also this bit: > > qconf -sq hiprio|grep -E 'subordinate|starter' > starter_method /usr/local/bin/blcr_start_job > subordinate_list slots=12(all.q:0:sr) > > When I submit a job in the hiprio queue, it does suspend jobs in all.q, but I > don't see it checkpointing the jobs. This is the intended behavior (please have a look at the state diagrams in the mentioned document). The migrate command itself will only migrate the jobs (i.e. kill their processes). Your migrate procedure will have to do a checkpoint first (by calling the "ckpt_command" on its own), or any other checkpoint (maybe taken in a periodic interval before) will be used. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
