I have a job whose workload finished yesterday (successfully, no issues, output files good), but the SLURM job is still accumulating time. I just suspended it, but I’d like to know how it’s getting away with billing many extra hours.
The other 26 in this batch completed normally. The job script completed on
June 7th at 7:53:19.
JobName State Start Elapsed CPUTime
---------- ---------- ------------------- ---------- ----------
PBMC_5c_0+ COMPLETED 2017-06-06T16:39:51 02:53:56 23:11:28
batch COMPLETED 2017-06-06T16:39:51 02:53:56 23:11:28
PBMC_6a_0+ COMPLETED 2017-06-06T16:39:51 04:54:06 1-15:12:48
batch COMPLETED 2017-06-06T16:39:51 04:54:06 1-15:12:48
PBMC_6b_0+ COMPLETED 2017-06-06T16:39:51 03:04:41 1-00:37:28
batch COMPLETED 2017-06-06T16:39:51 03:04:41 1-00:37:28
PBMC_6c_0+ SUSPENDED 2017-06-06T16:39:51 1-21:12:55 15-01:43:20
That 15 days… not really possible since the job started two days ago.
sstat has nothing to say. scontrol shows me nothing out of the ordinary:
[root@udc-ba34-37:~] scontrol show jobid -dd 665155
JobId=665155 JobName=PBMC_6c_020917_ATACseq.py
JobState=SUSPENDED Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=1-21:12:39 TimeLimit=3-00:00:00 TimeMin=N/A
SubmitTime=2017-06-06T16:39:49 EligibleTime=2017-06-06T16:39:49
StartTime=2017-06-06T16:39:51 EndTime=2017-06-09T16:39:51
PreemptTime=None SuspendTime=2017-06-08T13:52:30 SecsPreSuspend=162759
Partition=serial AllocNode:Sid=udc-ba34-37:199211
ReqNodeList=(null) ExcNodeList=(null)
NodeList=udc-ba33-28c
BatchHost=udc-ba33-28c
NumNodes=1 NumCPUs=8 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
Nodes=udc-ba33-28c CPU_IDs=2-9 Mem=32000
MinCPUsNode=8 MinMemoryNode=32000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.sub
WorkDir=/sfs/lustre/allocations/shefflab/processed/cphg_atac/results_pipeline
StdErr=/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.log
StdIn=/dev/null
StdOut=/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.log
BatchScript=
#!/bin/bash
#SBATCH --job-name='PBMC_6c_020917_ATACseq.py'
#SBATCH
--output='/sfs/lustre/allocations/shefflab/processed/cphg_atac/submission/PBMC_6c_020917_ATACseq.log'
#SBATCH --mem='32000'
#SBATCH --cpus-per-task='8'
#SBATCH --time='3-00:00:00'
#SBATCH --partition='serial'
#SBATCH -m block
#SBATCH --ntasks=1
echo 'Compute node:' `hostname`
echo 'Start time:' `date +'%Y-%m-%d %T'`
/home/ns5bc/code/ATACseq/pipelines/ATACseq.py --input2
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L001_R2_001.fastq.gz
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L002_R2_001.fastq.gz
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L003_R2_001.fastq.gz
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L004_R2_001.fastq.gz
--genome hg38 --single-or-paired paired --sample-name PBMC_6c_020917 --input
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L001_R1_001.fastq.gz
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L002_R1_001.fastq.gz
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L003_R1_001.fastq.gz
/sfs/lustre/allocations/shefflab/data/gsl/PBMC-6c-020917_S1_L004_R1_001.fastq.gz
--prealignments rCRSd --genome-size hs -D --frip-ref-peaks
/home/ns5bc/code/cphg_atac/metadata/CD4_hotSpot_liftedhg19tohg38.bed -O
/sfs/lustre/allocations/shefflab/processed/cphg_atac/results_pipeline -P 8 -M
32000
Thanks!
———————
Alden Stradling
Research Computing Infrastructure
University of Virginia
[email protected] <mailto:[email protected]>
smime.p7s
Description: S/MIME cryptographic signature
