[slurm-dev] Messing with job checkpointing

Manuel Rodríguez Pascual Fri, 29 May 2015 09:00:39 -0700

Hi all,

I have been messing around a little bit with task checkpoint/restart.


I am employing BLCR to checkpoint a fairly small application with
slurm_checkpoint_vacate, what should take several seconds. However, when I
try to restart it with slurm_checkpoint_restart, the process is very slow.

Looking at the output of slurmtcld ,  what I get is

----
slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
slurmctld: attempt re-use active job_id 2110
slurmctld: _slurm_rpc_checkpoint restart 2110: Duplicate job id
----

if I continue performing the same call, the output is identical for some
time, until slurm cleans its internal structures (or something like that),
writing in the log

----
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug2: Performing purge of old job records
slurmctld: debug2: purge_old_job: purged 1 old job records
slurmctld: debug:  sched: Running job scheduler
slurmctld: debug:  backfill: beginning
slurmctld: debug:  backfill: no jobs to backfill
----

then, the next call to slurm_checkpoint_restart succeeds,  with

----
slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
slurmctld: debug2: found 9 usable nodes from config containing
slurm-compute[1-9]
slurmctld: debug2: sched: JobId=2110 allocated resources: NodeList=(null)
slurmctld: _slurm_rpc_checkpoint restart for 2110 usec=909
slurmctld: debug2: Testing job time limits and checkpoints
slurmctld: debug:  backfill: beginning
slurmctld: debug2: backfill: entering _try_sched for job 2110.
slurmctld: debug2: found 2 usable nodes from config containing
slurm-compute[1-9]
slurmctld: backfill: Started JobId=2110 on slurm-compute2
----


I am wondering why is all this necessary. Why can't the "vacate" call
delete everything related to the job, so it can be restarted immediately?
If there is any particular reason that makes that impossible, why cannot
the Slurm structures be cleaned (purged or whatever) every 10 seconds or
so, instead of once every 5-10 minutes? Does it cause a significant
overhead or scalability issue? Or as an alternative,  is there any API call
that can be employed to trigger that purge?


Thanks for your help,


Manuel






-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

[slurm-dev] Messing with job checkpointing

Reply via email to