> Hi, > > Am 05.08.2014 um 13:09 schrieb Wouter Verhelst: > > > I'm currently replacing an ageing N1 gridengine 6.0u4 with Son of > Gridengine 8.1.7. Since the machine that's running the N1 qmaster is > due to be replaced in the coming months, I'm taking this opportunity to > rethink the way our setup is done. > > > > One thing I'm trying to do is to get checkpointing (with BLCR) to > work so that jobs can be migrated to different machines if necessary. > Our cluster does not consist of machines with the exact same specs; > some machines have more memory than others. > > But they have the same type of CPU
About half of them do, yes. The others are various types of older machines; these are also the ones with the lowest amount of available resources. If that turns out to be a problem, I can probably make it so that jobs can only be rescheduled to one of the newer machines with the same hardware specs (these differ in amount of available memory _only_). This shouldn't have an effect on the usefulness of this solution. > you tested your applications that > the checkpointing facility is working reliable outside of SGE? Not yet, but yes I am planning on doing that -- once I figure out how the whole checkpointing thing works ;-) [...] > > qconf -sckpt BLCR > > ckpt_name BLCR > > interface APPLICATION-LEVEL > > ckpt_command /usr/local/bin/blcr_checkpoint > > migr_command /usr/local/bin/blcr_migrate > > restart_command NONE > > clean_command /usr/local/bin/blcr_clean > > ckpt_dir /opt/sge/default/common/ckpoint > > signal NONE > > when xsr [...] > > When I submit a job in the hiprio queue, it does suspend jobs in > all.q, but I don't see it checkpointing the jobs. > > This is the intended behavior (please have a look at the state diagrams > in the mentioned document). The migrate command itself will only > migrate the jobs (i.e. kill their processes). Your migrate procedure > will have to do a checkpoint first (by calling the "ckpt_command" on > its own), or any other checkpoint (maybe taken in a periodic interval > before) will be used. Yes, I did catch that, and I do believe that that is what I'm doing, but I'll readily admit that I don't understand the full details yet, and that I might be misunderstanding things (hence my question ;-) However, I've since noticed that there was actually an error in my script, which caused it to fail. This would probably explain why it wasn't doing any checkpoints... Sorry for the noise ;-) _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
