Hi all, I have been messing around a little bit with task checkpoint/restart.
I am employing BLCR to checkpoint a fairly small application with slurm_checkpoint_vacate, what should take several seconds. However, when I try to restart it with slurm_checkpoint_restart, the process is very slow. Looking at the output of slurmtcld , what I get is ---- slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0 slurmctld: attempt re-use active job_id 2110 slurmctld: _slurm_rpc_checkpoint restart 2110: Duplicate job id ---- if I continue performing the same call, the output is identical for some time, until slurm cleans its internal structures (or something like that), writing in the log ---- slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug2: Performing purge of old job records slurmctld: debug2: purge_old_job: purged 1 old job records slurmctld: debug: sched: Running job scheduler slurmctld: debug: backfill: beginning slurmctld: debug: backfill: no jobs to backfill ---- then, the next call to slurm_checkpoint_restart succeeds, with ---- slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0 slurmctld: debug2: found 9 usable nodes from config containing slurm-compute[1-9] slurmctld: debug2: sched: JobId=2110 allocated resources: NodeList=(null) slurmctld: _slurm_rpc_checkpoint restart for 2110 usec=909 slurmctld: debug2: Testing job time limits and checkpoints slurmctld: debug: backfill: beginning slurmctld: debug2: backfill: entering _try_sched for job 2110. slurmctld: debug2: found 2 usable nodes from config containing slurm-compute[1-9] slurmctld: backfill: Started JobId=2110 on slurm-compute2 ---- I am wondering why is all this necessary. Why can't the "vacate" call delete everything related to the job, so it can be restarted immediately? If there is any particular reason that makes that impossible, why cannot the Slurm structures be cleaned (purged or whatever) every 10 seconds or so, instead of once every 5-10 minutes? Does it cause a significant overhead or scalability issue? Or as an alternative, is there any API call that can be employed to trigger that purge? Thanks for your help, Manuel -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
