[slurm-dev] Jobs cancelled "DUE TO TIME LIMIT" long before actual timelimit

Brian W. Johanson Tue, 29 Aug 2017 11:34:02 -0700

I user noticed that their job was cancelled earlier than expected. Therequestedtimelimit was not honored. The partition does have a default timelimit of30:00, this may have been enforced?

We are running slurm 17.02.5. I dug up an old ticket that contained thesameissue while we were running 15.08, the job asked for 12:00:00 and was killedat 1:00:00. I don't recall a change to the default timelimit (should have been30m then also), so I might be wrong assuming this is enforcing the partitiondefault.

The job had a timelimit set to 5:00:00, it was cancelled after 30:00due to timelimit. The partition has a default timelimit of 30:00. It is recorded in thedbas COMPLETED, not TIMEOUT.


Any idea on what would cause this?

$ sacct -S 071417 -X -a --format JobID%20,State%20,timelimit,Elapsed,ExitCode-j 1695151

               JobID                State  Timelimit Elapsed ExitCode
-------------------- -------------------- ---------- ---------- --------
             1695151            COMPLETED   05:00:00 00:30:36      0:0


slurmctld

2017-08-29T09:45:30.868553-04:00 host1 slurmctld 15006 - -_slurm_rpc_submit_batch_job JobId=1695151 usec=19452017-08-29T09:46:52.059115-04:00 host1 slurmctld 15006 - - email msg tox...@gmail.com: SLURM Job_id=1695151 Name=job1 Began, Queued time 00:01:222017-08-29T09:46:52.059478-04:00 host1 slurmctld 15006 - - sched: AllocateJobID=1695151 NodeList=node1 #CPUs=80 Partition=default2017-08-29T09:46:52.110696-04:00 host1 slurmctld 15006 - -prolog_running_decr: Configuration for job 169515 is complete2017-08-29T09:47:03.624387-04:00 host1 slurmctld 15006 - -_slurm_rpc_update_job complete JobId=1695151 uid=1 usec=14692017-08-29T10:17:28.079554-04:00 host1 slurmctld 15006 - -check_job_step_time_limit: job 1695151 step 0 has timed out (30)2017-08-29T10:17:28.441852-04:00 host1 slurmctld 15006 - - job_complete:JobID=1695151 State=0x1 NodeCnt=1 WEXITSTATUS 02017-08-29T10:17:28.442031-04:00 host1 slurmctld 15006 - - email msg tox...@gmail.com: SLURM Job_id=1695151 Name=job1 Ended, Run time 00:30:36,COMPLETED, ExitCode 02017-08-29T10:17:28.442308-04:00 host1 slurmctld 15006 - - job_complete:JobID=1695151 State=0x8003 NodeCnt=1 done



compute node slurmd log

[2017-08-29T09:47:03.398] _run_prolog: prolog with lock for job 1695151 ran for11 seconds

[2017-08-29T09:47:03.398] Launching batch job 1695151 for UID 1

[2017-08-29T09:47:03.538] [1695151] task/cgroup: /slurm/uid_1/job_1695151:alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited[2017-08-29T09:47:03.538] [1695151] task/cgroup:/slurm/uid_1/job_1695151/step_batch: alloc=3072000MB mem.limit=3034095MBmemsw.limit=unlimited[2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup:/slurm/uid_1/job_1695151: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited[2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup:/slurm/uid_1/job_1695151/step_extern: alloc=3072000MB mem.limit=3034095MBmemsw.limit=unlimited[2017-08-29T09:47:03.704] launch task 1695151.0 request from 1.15885@localhost(port 45283)[2017-08-29T09:47:03.802] [1695151.0] task/cgroup: /slurm/uid_1/job_1695151:alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited[2017-08-29T09:47:03.802] [1695151.0] task/cgroup:/slurm/uid_1/job_1695151/step_0: alloc=3072000MB mem.limit=3034095MBmemsw.limit=unlimited[2017-08-29T10:17:28.188] [1695151.0] error: *** STEP 1695151.0 ON l020CANCELLED AT 2017-08-29T10:17:28 DUE TO TIME LIMIT ***

[2017-08-29T10:17:28.413] [1695151.0] done with job

[2017-08-29T10:17:28.439] [1695151] sending REQUEST_COMPLETE_BATCH_SCRIPT,error:0 status 0

[2017-08-29T10:17:28.442] [1695151] done with job
[2017-08-29T10:17:28.477] [1695151.4294967295] done with job

[slurm-dev] Jobs cancelled "DUE TO TIME LIMIT" long before actual timelimit

Reply via email to