Hi folks, I'm currently replacing an ageing N1 gridengine 6.0u4 with Son of Gridengine 8.1.7. Since the machine that's running the N1 qmaster is due to be replaced in the coming months, I'm taking this opportunity to rethink the way our setup is done.
One thing I'm trying to do is to get checkpointing (with BLCR) to work so that jobs can be migrated to different machines if necessary. Our cluster does not consist of machines with the exact same specs; some machines have more memory than others. When we're running a lot of small jobs that don't require a lot of resources each, it doesn't matter which machine they run on and we want to spread the load over as many machines as possible so that the jobs are finished as quickly as possible; however, if such jobs are indeed running when a user wants to submit a job that does require a lot of resources, I want gridengine to checkpoint jobs on the high-memory machine to make room available for the new high-resources job, so that the high-resources job doesn't need to wait for large amounts of small-resources jobs to finish (which may take a long time). I've done the following so far: qconf -sq all.q|grep starter starter_method /usr/local/bin/blcr_start_job this script checks if we have a $RESTARTED environment variable and if a checkpoint file exists. If so, it execs cr_restart; else, it execs cr_run. qconf -sckpt BLCR ckpt_name BLCR interface APPLICATION-LEVEL ckpt_command /usr/local/bin/blcr_checkpoint migr_command /usr/local/bin/blcr_migrate restart_command NONE clean_command /usr/local/bin/blcr_clean ckpt_dir /opt/sge/default/common/ckpoint signal NONE when xsr these scripts are based on the BLCR HOWTO by Peng and Ng from 2004, modified to account for the fact that BLCR does support checkpointing an entire process tree these days. Finally, there's also this bit: qconf -sq hiprio|grep -E 'subordinate|starter' starter_method /usr/local/bin/blcr_start_job subordinate_list slots=12(all.q:0:sr) When I submit a job in the hiprio queue, it does suspend jobs in all.q, but I don't see it checkpointing the jobs. Any hints as to what I'm missing? _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
