Hi all,
I have been performing some more tests trying to understand the slurm internals and to reduce the checkpoint/restart time. Looking into the job status with slurm_print_job_info, I have observed that it remains on "RUNNING" status for about 5 minutes after a "slurm_checkpoint_vacate". JobId=2133 JobName=variableSizeTester.sh UserId=slurm(500) GroupId=slurm(1000) Priority=4294901754 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:05:39 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2015-06-02T06:43:16 EligibleTime=2015-06-02T06:43:16 StartTime=2015-06-02T06:43:17 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=debug AllocNode:Sid=slurm-master:2951 (...) So when calling "slurm_checkpoint_restart", slurmctld complains with attempt re-use active job_id 2133 slurm_rpc_checkpoint restart 2133: Duplicate job id and same error is obtained until the aforementioned 5 minute limit, then the job record is released, cleaned slurmctld: debug2: Purging old records slurmctld: debug2: purge_old_job: purged 1 old job records and the checkpoint can then be restarted. I have tried calling purge_old_job() to reduce this time but it does not work, so I assume that the problem is that the job is considered to be running and not a missinformation by slurmctld. Also, there is no query from slurmctld to the compute element, this seems to be some kind of internal timeout or something like that. Am I right? My question is then, cannot be this time reduced somehow? Is there any particular reason why the job is considered as active by Slurmctld for like 5 minutes after its checkpoint and cancellation? Thanks for your attention. Best regards, Manuel 2015-05-29 18:00 GMT+02:00 Manuel Rodríguez Pascual < [email protected]>: > Hi all, > > I have been messing around a little bit with task checkpoint/restart. > > I am employing BLCR to checkpoint a fairly small application with > slurm_checkpoint_vacate, what should take several seconds. However, when I > try to restart it with slurm_checkpoint_restart, the process is very slow. > > Looking at the output of slurmtcld , what I get is > > ---- > slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0 > slurmctld: attempt re-use active job_id 2110 > slurmctld: _slurm_rpc_checkpoint restart 2110: Duplicate job id > ---- > > if I continue performing the same call, the output is identical for some > time, until slurm cleans its internal structures (or something like that), > writing in the log > > ---- > slurmctld: debug2: Testing job time limits and checkpoints > slurmctld: debug2: Performing purge of old job records > slurmctld: debug2: purge_old_job: purged 1 old job records > slurmctld: debug: sched: Running job scheduler > slurmctld: debug: backfill: beginning > slurmctld: debug: backfill: no jobs to backfill > ---- > > then, the next call to slurm_checkpoint_restart succeeds, with > > ---- > slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0 > slurmctld: debug2: found 9 usable nodes from config containing > slurm-compute[1-9] > slurmctld: debug2: sched: JobId=2110 allocated resources: NodeList=(null) > slurmctld: _slurm_rpc_checkpoint restart for 2110 usec=909 > slurmctld: debug2: Testing job time limits and checkpoints > slurmctld: debug: backfill: beginning > slurmctld: debug2: backfill: entering _try_sched for job 2110. > slurmctld: debug2: found 2 usable nodes from config containing > slurm-compute[1-9] > slurmctld: backfill: Started JobId=2110 on slurm-compute2 > ---- > > > I am wondering why is all this necessary. Why can't the "vacate" call > delete everything related to the job, so it can be restarted immediately? > If there is any particular reason that makes that impossible, why cannot > the Slurm structures be cleaned (purged or whatever) every 10 seconds or > so, instead of once every 5-10 minutes? Does it cause a significant > overhead or scalability issue? Or as an alternative, is there any API call > that can be employed to trigger that purge? > > > Thanks for your help, > > > Manuel > > > > > > > -- > Dr. Manuel Rodríguez-Pascual > skype: manuel.rodriguez.pascual > phone: (+34) 913466173 // (+34) 679925108 > > CIEMAT-Moncloa > Edificio 22, desp. 1.25 > Avenida Complutense, 40 > 28040- MADRID > SPAIN > -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
