[slurm-dev] Re: Messing with job checkpointing

Manuel Rodríguez Pascual Tue, 02 Jun 2015 03:57:13 -0700

Hi all,


I have been performing some more tests trying to understand the slurm
internals and to reduce the checkpoint/restart time.

Looking into the job status with slurm_print_job_info, I have observed that
it remains on "RUNNING" status for about 5 minutes after a
"slurm_checkpoint_vacate".

JobId=2133 JobName=variableSizeTester.sh
   UserId=slurm(500) GroupId=slurm(1000)
   Priority=4294901754 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:05:39 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-06-02T06:43:16 EligibleTime=2015-06-02T06:43:16
   StartTime=2015-06-02T06:43:17 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=debug AllocNode:Sid=slurm-master:2951
(...)


So when calling "slurm_checkpoint_restart", slurmctld complains with

attempt re-use active job_id 2133
slurm_rpc_checkpoint restart 2133: Duplicate job id


and same error is obtained until the aforementioned 5 minute limit, then
the job record is released, cleaned

slurmctld: debug2: Purging old records
slurmctld: debug2: purge_old_job: purged 1 old job records

and the checkpoint can then be restarted.

I have tried calling purge_old_job() to reduce this time but it does not
work, so I assume that the problem is that the job is considered to be
running and not a missinformation by slurmctld. Also, there is no query
from slurmctld to the compute element, this seems to be some kind of
internal timeout or something like that. Am I right?

My question is then, cannot be this time reduced somehow? Is there any
particular reason why the job is considered as active by Slurmctld for like
5 minutes after its checkpoint and cancellation?

Thanks for your attention.


Best regards,


Manuel








2015-05-29 18:00 GMT+02:00 Manuel Rodríguez Pascual <
[email protected]>:

>  Hi all,
>
> I have been messing around a little bit with task checkpoint/restart.
>
> I am employing BLCR to checkpoint a fairly small application with
> slurm_checkpoint_vacate, what should take several seconds. However, when I
> try to restart it with slurm_checkpoint_restart, the process is very slow.
>
> Looking at the output of slurmtcld ,  what I get is
>
> ----
> slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
> slurmctld: attempt re-use active job_id 2110
> slurmctld: _slurm_rpc_checkpoint restart 2110: Duplicate job id
> ----
>
> if I continue performing the same call, the output is identical for some
> time, until slurm cleans its internal structures (or something like that),
> writing in the log
>
> ----
> slurmctld: debug2: Testing job time limits and checkpoints
> slurmctld: debug2: Performing purge of old job records
> slurmctld: debug2: purge_old_job: purged 1 old job records
> slurmctld: debug:  sched: Running job scheduler
> slurmctld: debug:  backfill: beginning
> slurmctld: debug:  backfill: no jobs to backfill
> ----
>
> then, the next call to slurm_checkpoint_restart succeeds,  with
>
> ----
> slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
> slurmctld: debug2: found 9 usable nodes from config containing
> slurm-compute[1-9]
> slurmctld: debug2: sched: JobId=2110 allocated resources: NodeList=(null)
> slurmctld: _slurm_rpc_checkpoint restart for 2110 usec=909
> slurmctld: debug2: Testing job time limits and checkpoints
> slurmctld: debug:  backfill: beginning
> slurmctld: debug2: backfill: entering _try_sched for job 2110.
> slurmctld: debug2: found 2 usable nodes from config containing
> slurm-compute[1-9]
> slurmctld: backfill: Started JobId=2110 on slurm-compute2
> ----
>
>
> I am wondering why is all this necessary. Why can't the "vacate" call
> delete everything related to the job, so it can be restarted immediately?
> If there is any particular reason that makes that impossible, why cannot
> the Slurm structures be cleaned (purged or whatever) every 10 seconds or
> so, instead of once every 5-10 minutes? Does it cause a significant
> overhead or scalability issue? Or as an alternative,  is there any API call
> that can be employed to trigger that purge?
>
>
> Thanks for your help,
>
>
> Manuel
>
>
>
>
>
>
> --
> Dr. Manuel Rodríguez-Pascual
> skype: manuel.rodriguez.pascual
> phone: (+34) 913466173 // (+34) 679925108
>
> CIEMAT-Moncloa
> Edificio 22, desp. 1.25
> Avenida Complutense, 40
> 28040- MADRID
> SPAIN
>



-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

[slurm-dev] Re: Messing with job checkpointing

Reply via email to