I user noticed that their job was cancelled earlier than expected. Therequested timelimit was not honored. The partition does have a default timelimit of 30:00, this may have been enforced?

We are running slurm 17.02.5. I dug up an old ticket that contained the sameissue while we were running 15.08, the job asked for 12:00:00 and was killed at 1:00:00. I don't recall a change to the default timelimit (should have been 30m then also), so I might be wrong assuming this is enforcing the partition default.

The job had a timelimit set to 5:00:00, it was cancelled after 30:00due to time limit. The partition has a default timelimit of 30:00. It is recorded in the dbas COMPLETED, not TIMEOUT.

Any idea on what would cause this?


$ sacct -S 071417 -X -a --format JobID%20,State%20,timelimit,Elapsed,ExitCode -j 1695151
               JobID                State  Timelimit Elapsed ExitCode
-------------------- -------------------- ---------- ---------- --------
             1695151            COMPLETED   05:00:00 00:30:36      0:0


slurmctld

2017-08-29T09:45:30.868553-04:00 host1 slurmctld 15006 - - _slurm_rpc_submit_batch_job JobId=1695151 usec=1945 2017-08-29T09:46:52.059115-04:00 host1 slurmctld 15006 - - email msg to x...@gmail.com: SLURM Job_id=1695151 Name=job1 Began, Queued time 00:01:22 2017-08-29T09:46:52.059478-04:00 host1 slurmctld 15006 - - sched: Allocate JobID=1695151 NodeList=node1 #CPUs=80 Partition=default 2017-08-29T09:46:52.110696-04:00 host1 slurmctld 15006 - - prolog_running_decr: Configuration for job 169515 is complete 2017-08-29T09:47:03.624387-04:00 host1 slurmctld 15006 - - _slurm_rpc_update_job complete JobId=1695151 uid=1 usec=1469 2017-08-29T10:17:28.079554-04:00 host1 slurmctld 15006 - - check_job_step_time_limit: job 1695151 step 0 has timed out (30) 2017-08-29T10:17:28.441852-04:00 host1 slurmctld 15006 - - job_complete: JobID=1695151 State=0x1 NodeCnt=1 WEXITSTATUS 0 2017-08-29T10:17:28.442031-04:00 host1 slurmctld 15006 - - email msg to x...@gmail.com: SLURM Job_id=1695151 Name=job1 Ended, Run time 00:30:36, COMPLETED, ExitCode 0 2017-08-29T10:17:28.442308-04:00 host1 slurmctld 15006 - - job_complete: JobID=1695151 State=0x8003 NodeCnt=1 done


compute node slurmd log

[2017-08-29T09:47:03.398] _run_prolog: prolog with lock for job 1695151 ran for 11 seconds
[2017-08-29T09:47:03.398] Launching batch job 1695151 for UID 1
[2017-08-29T09:47:03.538] [1695151] task/cgroup: /slurm/uid_1/job_1695151: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited [2017-08-29T09:47:03.538] [1695151] task/cgroup: /slurm/uid_1/job_1695151/step_batch: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited [2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup: /slurm/uid_1/job_1695151: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited [2017-08-29T09:47:03.540] [1695151.4294967295] task/cgroup: /slurm/uid_1/job_1695151/step_extern: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited [2017-08-29T09:47:03.704] launch task 1695151.0 request from 1.15885@localhost (port 45283) [2017-08-29T09:47:03.802] [1695151.0] task/cgroup: /slurm/uid_1/job_1695151: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited [2017-08-29T09:47:03.802] [1695151.0] task/cgroup: /slurm/uid_1/job_1695151/step_0: alloc=3072000MB mem.limit=3034095MB memsw.limit=unlimited [2017-08-29T10:17:28.188] [1695151.0] error: *** STEP 1695151.0 ON l020 CANCELLED AT 2017-08-29T10:17:28 DUE TO TIME LIMIT ***
[2017-08-29T10:17:28.413] [1695151.0] done with job
[2017-08-29T10:17:28.439] [1695151] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2017-08-29T10:17:28.442] [1695151] done with job
[2017-08-29T10:17:28.477] [1695151.4294967295] done with job

Reply via email to