That's exactly what I was looking for, thanks very much.
2015-06-02 16:30 GMT+02:00 Moe Jette <[email protected]>: > > See the MinJobAge configuration option: > http://slurm.schedmd.com/slurm.conf.html > > > Quoting Manuel Rodríguez Pascual <[email protected]>: > >> Hi all, >> >> >> I have been performing some more tests trying to understand the slurm >> internals and to reduce the checkpoint/restart time. >> >> Looking into the job status with slurm_print_job_info, I have observed >> that >> it remains on "RUNNING" status for about 5 minutes after a >> "slurm_checkpoint_vacate". >> >> JobId=2133 JobName=variableSizeTester.sh >> UserId=slurm(500) GroupId=slurm(1000) >> Priority=4294901754 Nice=0 Account=(null) QOS=(null) >> JobState=RUNNING Reason=None Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >> RunTime=00:05:39 TimeLimit=UNLIMITED TimeMin=N/A >> SubmitTime=2015-06-02T06:43:16 EligibleTime=2015-06-02T06:43:16 >> StartTime=2015-06-02T06:43:17 EndTime=Unknown >> PreemptTime=None SuspendTime=None SecsPreSuspend=0 >> Partition=debug AllocNode:Sid=slurm-master:2951 >> (...) >> >> >> So when calling "slurm_checkpoint_restart", slurmctld complains with >> >> attempt re-use active job_id 2133 >> slurm_rpc_checkpoint restart 2133: Duplicate job id >> >> >> and same error is obtained until the aforementioned 5 minute limit, then >> the job record is released, cleaned >> >> slurmctld: debug2: Purging old records >> slurmctld: debug2: purge_old_job: purged 1 old job records >> >> and the checkpoint can then be restarted. >> >> I have tried calling purge_old_job() to reduce this time but it does not >> work, so I assume that the problem is that the job is considered to be >> running and not a missinformation by slurmctld. Also, there is no query >> from slurmctld to the compute element, this seems to be some kind of >> internal timeout or something like that. Am I right? >> >> My question is then, cannot be this time reduced somehow? Is there any >> particular reason why the job is considered as active by Slurmctld for >> like >> 5 minutes after its checkpoint and cancellation? >> >> Thanks for your attention. >> >> >> Best regards, >> >> >> Manuel >> >> >> >> >> >> >> >> >> 2015-05-29 18:00 GMT+02:00 Manuel Rodríguez Pascual < >> [email protected]>: >> >> Hi all, >>> >>> I have been messing around a little bit with task checkpoint/restart. >>> >>> I am employing BLCR to checkpoint a fairly small application with >>> slurm_checkpoint_vacate, what should take several seconds. However, when >>> I >>> try to restart it with slurm_checkpoint_restart, the process is very >>> slow. >>> >>> Looking at the output of slurmtcld , what I get is >>> >>> ---- >>> slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0 >>> slurmctld: attempt re-use active job_id 2110 >>> slurmctld: _slurm_rpc_checkpoint restart 2110: Duplicate job id >>> ---- >>> >>> if I continue performing the same call, the output is identical for some >>> time, until slurm cleans its internal structures (or something like >>> that), >>> writing in the log >>> >>> ---- >>> slurmctld: debug2: Testing job time limits and checkpoints >>> slurmctld: debug2: Performing purge of old job records >>> slurmctld: debug2: purge_old_job: purged 1 old job records >>> slurmctld: debug: sched: Running job scheduler >>> slurmctld: debug: backfill: beginning >>> slurmctld: debug: backfill: no jobs to backfill >>> ---- >>> >>> then, the next call to slurm_checkpoint_restart succeeds, with >>> >>> ---- >>> slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0 >>> slurmctld: debug2: found 9 usable nodes from config containing >>> slurm-compute[1-9] >>> slurmctld: debug2: sched: JobId=2110 allocated resources: NodeList=(null) >>> slurmctld: _slurm_rpc_checkpoint restart for 2110 usec=909 >>> slurmctld: debug2: Testing job time limits and checkpoints >>> slurmctld: debug: backfill: beginning >>> slurmctld: debug2: backfill: entering _try_sched for job 2110. >>> slurmctld: debug2: found 2 usable nodes from config containing >>> slurm-compute[1-9] >>> slurmctld: backfill: Started JobId=2110 on slurm-compute2 >>> ---- >>> >>> >>> I am wondering why is all this necessary. Why can't the "vacate" call >>> delete everything related to the job, so it can be restarted immediately? >>> If there is any particular reason that makes that impossible, why cannot >>> the Slurm structures be cleaned (purged or whatever) every 10 seconds or >>> so, instead of once every 5-10 minutes? Does it cause a significant >>> overhead or scalability issue? Or as an alternative, is there any API >>> call >>> that can be employed to trigger that purge? >>> >>> >>> Thanks for your help, >>> >>> >>> Manuel >>> >>> >>> >>> >>> >>> >>> -- >>> Dr. Manuel Rodríguez-Pascual >>> skype: manuel.rodriguez.pascual >>> phone: (+34) 913466173 // (+34) 679925108 >>> >>> CIEMAT-Moncloa >>> Edificio 22, desp. 1.25 >>> Avenida Complutense, 40 >>> 28040- MADRID >>> SPAIN >>> >>> >> >> >> -- >> Dr. Manuel Rodríguez-Pascual >> skype: manuel.rodriguez.pascual >> phone: (+34) 913466173 // (+34) 679925108 >> >> CIEMAT-Moncloa >> Edificio 22, desp. 1.25 >> Avenida Complutense, 40 >> 28040- MADRID >> SPAIN >> > > > -- > Morris "Moe" Jette > CTO, SchedMD LLC > Commercial Slurm Development and Support > -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
