Scott Yockel <syoc...@g.harvard.edu> writes: > Slurm-Dev, > > Is there anything in the works to add the capacity of TotalCPU to also > track the child process user and system time? I see that currently > TotalCPU is defined: "provides a measure of the task’s parent process > and does not include CPU time of child processes.”
In my experience, that description might not be accurate. It seems also child processes are included, as long as the job doesn't time out. Here is an email I wrote about it last year: From: Bjørn-Helge Mevik <b.h.me...@usit.uio.no> Subject: [slurm-dev] UserCPU etc. for subprocesses not registered when a job times out. To: slurm-dev <slurm-dev@schedmd.com> Date: Fri, 12 Sep 2014 06:25:16 -0700 Reply-To: slurm-dev <slurm-dev@schedmd.com> We would like to use the UserCPU, SystemCPU and TotalCPU values from sacct to assess the efficiency of jobs. When a job exits normally, these values are reported for the batch script step and includes the time spent by sub-processes. However, if the job times out, these values only includ the CPU time spent by the batch script process itself, not its sub-processes. Se below for an illustration. Is this intended behaviour? If so, is there any other way to gather CPU times from jobs, even when they don't exit normally? Illustration: 407 (1) $ cat shell-loop.sh #!/bin/bash echo Starting loop ## Loop that only uses shell builtins: while true; do echo -n ; done echo This is the end... 408 (1) $ cat timeout-in-subprocess.sm #!/bin/bash #SBATCH --account=staff #SBATCH --time=0:2:0 --mem-per-cpu=500 #SBATCH --output=out/timeout-in-subprocess-%j.out ## Execute shell-loop.sh in subprocess: ./shell-loop.sh 409 (1) $ cat timeout-in-shell.sm #!/bin/bash #SBATCH --account=staff #SBATCH --time=0:2:0 --mem-per-cpu=500 #SBATCH --output=out/timeout-in-shell-%j.out ## Run shell-loop.sh in this shell: source shell-loop.sh 410 (1) $ sbatch timeout-in-subprocess.sm Submitted batch job 40 411 (1) $ sbatch timeout-in-shell.sm Submitted batch job 41 [... after a couple of minutes ...] 412 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 40,41 JobID State Elapsed UserCPU SystemCPU TotalCPU ------------ ---------- ---------- ---------- ---------- ---------- 40 TIMEOUT 00:02:27 00:00.001 00:00.001 00:00.003 40.batch CANCELLED 00:02:27 00:00.001 00:00.001 00:00.003 41 TIMEOUT 00:02:24 02:23.415 00:00:00 02:23.416 41.batch CANCELLED 00:02:24 02:23.415 00:00:00 02:23.416 i.e., time spent in subprocesses is not reported. Doing the same thing, but now with loops that terminate so the jobs don't time out, we get: 416 (1) $ sbatch work-in-subprocess.sm Submitted batch job 42 417 (1) $ sbatch work-in-shell.sm Submitted batch job 43 [... after a couple of minutes ...] 418 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 42,43 JobID State Elapsed UserCPU SystemCPU TotalCPU ------------ ---------- ---------- ---------- ---------- ---------- 42 COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187 42.batch COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187 43 COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 43.batch COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403 i.e., time spent in subprocesses is reported. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo