[slurm-dev] Re: Messing with job checkpointing

Manuel Rodríguez Pascual Mon, 08 Jun 2015 08:49:03 -0700

That's exactly what I was looking for, thanks very much.


2015-06-02 16:30 GMT+02:00 Moe Jette <[email protected]>:

>
> See the MinJobAge configuration option:
> http://slurm.schedmd.com/slurm.conf.html
>
>
> Quoting Manuel Rodríguez Pascual <[email protected]>:
>
>> Hi all,
>>
>>
>> I have been performing some more tests trying to understand the slurm
>> internals and to reduce the checkpoint/restart time.
>>
>> Looking into the job status with slurm_print_job_info, I have observed
>> that
>> it remains on "RUNNING" status for about 5 minutes after a
>> "slurm_checkpoint_vacate".
>>
>> JobId=2133 JobName=variableSizeTester.sh
>>    UserId=slurm(500) GroupId=slurm(1000)
>>    Priority=4294901754 Nice=0 Account=(null) QOS=(null)
>>    JobState=RUNNING Reason=None Dependency=(null)
>>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>>    RunTime=00:05:39 TimeLimit=UNLIMITED TimeMin=N/A
>>    SubmitTime=2015-06-02T06:43:16 EligibleTime=2015-06-02T06:43:16
>>    StartTime=2015-06-02T06:43:17 EndTime=Unknown
>>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>>    Partition=debug AllocNode:Sid=slurm-master:2951
>> (...)
>>
>>
>> So when calling "slurm_checkpoint_restart", slurmctld complains with
>>
>> attempt re-use active job_id 2133
>> slurm_rpc_checkpoint restart 2133: Duplicate job id
>>
>>
>> and same error is obtained until the aforementioned 5 minute limit, then
>> the job record is released, cleaned
>>
>> slurmctld: debug2: Purging old records
>> slurmctld: debug2: purge_old_job: purged 1 old job records
>>
>> and the checkpoint can then be restarted.
>>
>> I have tried calling purge_old_job() to reduce this time but it does not
>> work, so I assume that the problem is that the job is considered to be
>> running and not a missinformation by slurmctld. Also, there is no query
>> from slurmctld to the compute element, this seems to be some kind of
>> internal timeout or something like that. Am I right?
>>
>> My question is then, cannot be this time reduced somehow? Is there any
>> particular reason why the job is considered as active by Slurmctld for
>> like
>> 5 minutes after its checkpoint and cancellation?
>>
>> Thanks for your attention.
>>
>>
>> Best regards,
>>
>>
>> Manuel
>>
>>
>>
>>
>>
>>
>>
>>
>> 2015-05-29 18:00 GMT+02:00 Manuel Rodríguez Pascual <
>> [email protected]>:
>>
>>   Hi all,
>>>
>>> I have been messing around a little bit with task checkpoint/restart.
>>>
>>> I am employing BLCR to checkpoint a fairly small application with
>>> slurm_checkpoint_vacate, what should take several seconds. However, when
>>> I
>>> try to restart it with slurm_checkpoint_restart, the process is very
>>> slow.
>>>
>>> Looking at the output of slurmtcld ,  what I get is
>>>
>>> ----
>>> slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
>>> slurmctld: attempt re-use active job_id 2110
>>> slurmctld: _slurm_rpc_checkpoint restart 2110: Duplicate job id
>>> ----
>>>
>>> if I continue performing the same call, the output is identical for some
>>> time, until slurm cleans its internal structures (or something like
>>> that),
>>> writing in the log
>>>
>>> ----
>>> slurmctld: debug2: Testing job time limits and checkpoints
>>> slurmctld: debug2: Performing purge of old job records
>>> slurmctld: debug2: purge_old_job: purged 1 old job records
>>> slurmctld: debug:  sched: Running job scheduler
>>> slurmctld: debug:  backfill: beginning
>>> slurmctld: debug:  backfill: no jobs to backfill
>>> ----
>>>
>>> then, the next call to slurm_checkpoint_restart succeeds,  with
>>>
>>> ----
>>> slurmctld: debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=0
>>> slurmctld: debug2: found 9 usable nodes from config containing
>>> slurm-compute[1-9]
>>> slurmctld: debug2: sched: JobId=2110 allocated resources: NodeList=(null)
>>> slurmctld: _slurm_rpc_checkpoint restart for 2110 usec=909
>>> slurmctld: debug2: Testing job time limits and checkpoints
>>> slurmctld: debug:  backfill: beginning
>>> slurmctld: debug2: backfill: entering _try_sched for job 2110.
>>> slurmctld: debug2: found 2 usable nodes from config containing
>>> slurm-compute[1-9]
>>> slurmctld: backfill: Started JobId=2110 on slurm-compute2
>>> ----
>>>
>>>
>>> I am wondering why is all this necessary. Why can't the "vacate" call
>>> delete everything related to the job, so it can be restarted immediately?
>>> If there is any particular reason that makes that impossible, why cannot
>>> the Slurm structures be cleaned (purged or whatever) every 10 seconds or
>>> so, instead of once every 5-10 minutes? Does it cause a significant
>>> overhead or scalability issue? Or as an alternative,  is there any API
>>> call
>>> that can be employed to trigger that purge?
>>>
>>>
>>> Thanks for your help,
>>>
>>>
>>> Manuel
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Dr. Manuel Rodríguez-Pascual
>>> skype: manuel.rodriguez.pascual
>>> phone: (+34) 913466173 // (+34) 679925108
>>>
>>> CIEMAT-Moncloa
>>> Edificio 22, desp. 1.25
>>> Avenida Complutense, 40
>>> 28040- MADRID
>>> SPAIN
>>>
>>>
>>
>>
>> --
>> Dr. Manuel Rodríguez-Pascual
>> skype: manuel.rodriguez.pascual
>> phone: (+34) 913466173 // (+34) 679925108
>>
>> CIEMAT-Moncloa
>> Edificio 22, desp. 1.25
>> Avenida Complutense, 40
>> 28040- MADRID
>> SPAIN
>>
>
>
> --
> Morris "Moe" Jette
> CTO, SchedMD LLC
> Commercial Slurm Development and Support
>



-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

[slurm-dev] Re: Messing with job checkpointing

Reply via email to