I user noticed that their job was cancelled earlier than expected. Therequested
timelimit was not honored. The partition does have a default timelimit of
30:00, this may have been enforced?
We are running slurm 17.02.5. I dug up an old ticket that contained the
sameissue while we were running 15.08, the job asked for 12:00:00 and was killed
at 1:00:00. I don't recall a change to the default timelimit (should have been
30m then also), so I might be wrong assuming this is enforcing the partition
default.
The job had a timelimit set to 5:00:00, it was cancelled after 30:00due to time
limit. The partition has a default timelimit of 30:00. It is recorded in the
dbas COMPLETED, not TIMEOUT.
Any idea on what would cause this?
$ sacct -S 071417 -X -a --format JobID%20,State%20,timelimit,Elapsed,ExitCode
-j 1695151
JobID State Timelimit Elapsed ExitCode
-------------------- -------------------- ---------- ---------- --------
1695151 COMPLETED 05:00:00 00:30:36 0:0
slurmctld
2017-08-29T09:45:30.868553-04:00 host1 slurmctld 15006 - -
_slurm_rpc_submit_batch_job JobId=1695151 usec=1945
2017-08-29T09:46:52.059115-04:00 host1 slurmctld 15006 - - email msg to
x...@gmail.com: SLURM Job_id=1695151 Name=job1 Began, Queued time 00:01:22
2017-08-29T09:46:52.059478-04:00 host1 slurmctld 15006 - - sched: Allocate
JobID=1695151 NodeList=node1 #CPUs=80 Partition=default
2017-08-29T09:46:52.110696-04:00 host1 slurmctld 15006 - -
prolog_running_decr: Configuration for job 169515 is complete
2017-08-29T09:47:03.624387-04:00 host1 slurmctld 15006 - -
_slurm_rpc_update_job complete JobId=1695151 uid=1 usec=1469
2017-08-29T10:17:28.079554-04:00 host1 slurmctld 15006 - -
check_job_step_time_limit: job 1695151 step 0 has timed out (30)
2017-08-29T10:17:28.441852-04:00 host1 slurmctld 15006 - - job_complete:
JobID=1695151 State=0x1 NodeCnt=1 WEXITSTATUS 0
2017-08-29T10:17:28.442031-04:00 host1 slurmctld 15006 - - email msg to
x...@gmail.com: SLURM Job_id=1695151 Name=job1 Ended, Run time 00:30:36,
COMPLETED, ExitCode 0
2017-08-29T10:17:28.442308-04:00 host1 slurmctld 15006 - - job_complete:
JobID=1695151 State=0x8003 NodeCnt=1 done
compute node slurmd log
[2017-08-29T09:47:03.398] _run_prolog: prolog with lock for job 1695151 ran for
11 seconds
[2017-08-29T09:47:03.398] Launching batch job 1695151 for UID 1
[2017-08-29T09:47:03.538] [1695151] task/cgroup: /slurm/uid_1/job_1695151:
alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited
[2017-08-29T09:47:03.538] [1695151] task/cgroup:
/slurm/uid_1/job_1695151/step_batch: alloc=3072000MB mem.limit=3034095MB
memsw.limit=unlimited
[2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup:
/slurm/uid_1/job_1695151: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited
[2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup:
/slurm/uid_1/job_1695151/step_extern: alloc=3072000MB mem.limit=3034095MB
memsw.limit=unlimited
[2017-08-29T09:47:03.704] launch task 1695151.0 request from 1.15885@localhost
(port 45283)
[2017-08-29T09:47:03.802] [1695151.0] task/cgroup: /slurm/uid_1/job_1695151:
alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited
[2017-08-29T09:47:03.802] [1695151.0] task/cgroup:
/slurm/uid_1/job_1695151/step_0: alloc=3072000MB mem.limit=3034095MB
memsw.limit=unlimited
[2017-08-29T10:17:28.188] [1695151.0] error: *** STEP 1695151.0 ON l020
CANCELLED AT 2017-08-29T10:17:28 DUE TO TIME LIMIT ***
[2017-08-29T10:17:28.413] [1695151.0] done with job
[2017-08-29T10:17:28.439] [1695151] sending REQUEST_COMPLETE_BATCH_SCRIPT,
error:0 status 0
[2017-08-29T10:17:28.442] [1695151] done with job
[2017-08-29T10:17:28.477] [1695151.4294967295] done with job