> Hi,
> 
> Am 05.08.2014 um 13:09 schrieb Wouter Verhelst:
> 
> > I'm currently replacing an ageing N1 gridengine 6.0u4 with Son of
> Gridengine 8.1.7. Since the machine that's running the N1 qmaster is
> due to be replaced in the coming months, I'm taking this opportunity to
> rethink the way our setup is done.
> >
> > One thing I'm trying to do is to get checkpointing (with BLCR) to
> work so that jobs can be migrated to different machines if necessary.
> Our cluster does not consist of machines with the exact same specs;
> some machines have more memory than others.
> 
> But they have the same type of CPU

About half of them do, yes. The others are various types of older machines; 
these are also the ones with the lowest amount of available resources.

If that turns out to be a problem, I can probably make it so that jobs can only 
be rescheduled to one of the newer machines with the same hardware specs (these 
differ in amount of available memory _only_). This shouldn't have an effect on 
the usefulness of this solution.

> you tested your applications that
> the checkpointing facility is working reliable outside of SGE?

Not yet, but yes I am planning on doing that -- once I figure out how the whole 
checkpointing thing works ;-)

[...]
> > qconf -sckpt BLCR
> > ckpt_name          BLCR
> > interface          APPLICATION-LEVEL
> > ckpt_command       /usr/local/bin/blcr_checkpoint
> > migr_command       /usr/local/bin/blcr_migrate
> > restart_command    NONE
> > clean_command      /usr/local/bin/blcr_clean
> > ckpt_dir           /opt/sge/default/common/ckpoint
> > signal             NONE
> > when               xsr
[...]
> > When I submit a job in the hiprio queue, it does suspend jobs in
> all.q, but I don't see it checkpointing the jobs.
> 
> This is the intended behavior (please have a look at the state diagrams
> in the mentioned document). The migrate command itself will only
> migrate the jobs (i.e. kill their processes). Your migrate procedure
> will have to do a checkpoint first (by calling the "ckpt_command" on
> its own), or any other checkpoint (maybe taken in a periodic interval
> before) will be used.

Yes, I did catch that, and I do believe that that is what I'm doing, but I'll 
readily admit that I don't understand the full details yet, and that I might be 
misunderstanding things (hence my question ;-)

However, I've since noticed that there was actually an error in my script, 
which caused it to fail. This would probably explain why it wasn't doing any 
checkpoints...

Sorry for the noise ;-)

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to