I have been writing my own 'jobinfo' tool for users to see info on
a job in any state that is useful and readable by them.  Still
new to slurm and trying to wrap my head around the database info
and the effects of arrays and such.

A completed job output looks like this:

# jobinfo 357300
--------------------------------------------------
          JobID : 356847_361          | 356847_361.batch
        JobName : batch_compile_loraks.sh
           User : gr879
        Account : syhdiff
      Partition : basic
        ReqTRES : billing=1,cpu=1,mem=40G,node=1
      AllocTRES : billing=1,cpu=1,mem=40G,node=1
       NodeList : r440-19
         Submit : 2021-06-13T22:07:07
          Start : 2021-06-14T01:47:55 | 2021-06-14T01:47:55
            End : 2021-06-14T05:22:00 | 2021-06-14T05:22:00
      Timelimit : 2-00:00:00
        Elapsed : 03:34:05            | 03:34:05
        CPUTime : 03:34:05            | 03:34:05
      SystemCPU : 05:57.056           | 05:57.056
        UserCPU : 03:27:07            | 03:27:07
       TotalCPU : 03:33:04            | 03:33:04
    MaxDiskRead :                     | 109.25M
   MaxDiskWrite :                     | 1.08M
         MaxRSS :                     | 32529204K
      MaxVMSize :                     | 61834112K
          State : COMPLETED           | COMPLETED
       ExitCode : 0:0                 | 0:0
        WorkDir : /autofs/homes/002/gr879/matlab/ex_vivo/batch_code

and a typical RUNNING job looks like

# jobinfo 357304
--------------------------------------------------
          JobID : 357199_21           | 357199_21.batch
        JobName : batch_compile_multi_shell.sh
           User : gr879
        Account : syhdiff
      Partition : basic
        ReqTRES : billing=1,cpu=1,mem=12G,node=1
      AllocTRES : billing=1,cpu=1,mem=12G,node=1
       NodeList : r440-17
         Submit : 2021-06-14T00:31:11
          Start : 2021-06-14T01:47:55 | 2021-06-14T01:47:55
            End : Unknown             | Unknown
      Timelimit : 1-00:00:00
        Elapsed : 12:04:35            | 12:04:35
        CPUTime : 12:04:35            | 12:04:35
      SystemCPU : 00:00:00            | 00:00:00
        UserCPU : 00:00:00            | 00:00:00
       TotalCPU : 00:00:00            | 12:01:46
    MaxDiskRead :                     | 101176763
   MaxDiskWrite :                     | 1259187
         MaxRSS :                     | 5455M
      MaxVMSize :                     | 10823600K
          State : RUNNING             | RUNNING
       ExitCode : 0:0                 | 0:0
        WorkDir : /autofs/homes/002/gr879/matlab/ex_vivo/batch_code

Where unfortunately I have to give zeros on certain info
I cannot get yet.  My current issue is with that TotalCPU row
on running jobs.  I actually get that from AveCPU from sstat and
in the case above it looks right.  But in others it is just way off

# jobinfo 357305
--------------------------------------------------
          JobID : 357305              | 357305.batch
        JobName : sjob_185
           User : mjk2
        Account : circgp
      Partition : basic
        ReqTRES : billing=27,cpu=20,mem=370G,node=1
      AllocTRES : billing=27,cpu=20,mem=370G,node=1
       NodeList : r440-05
         Submit : 2021-06-14T01:44:56
          Start : 2021-06-14T05:02:10 | 2021-06-14T05:02:10
            End : Unknown             | Unknown
      Timelimit : 7-00:00:00
        Elapsed : 08:50:17            | 08:50:17
        CPUTime : 7-08:45:40          | 7-08:45:40
      SystemCPU : 00:00:00            | 00:00:00
        UserCPU : 00:00:00            | 00:00:00
       TotalCPU : 00:00:00            | 11:33.000
    MaxDiskRead :                     | 79699046
   MaxDiskWrite :                     | 17983
         MaxRSS :                     | 81357340K
      MaxVMSize :                     | 104992372K
          State : RUNNING             | RUNNING
       ExitCode : 0:0                 | 0:0
        WorkDir : /autofs/homes/002/mjk2

In this job the user asked for 20 cores, but I can see his
job is only one one core on the actual node so this is a big waste.
But that core is constantly going 100% so I would expect AveCPU
to be close to the Elapsed time but is is way less (11 minutes
instead of nearly 9 hours)

# /usr/bin/sstat -p -a --job=357305 --format=JobID,AveCPU
JobID|AveCPU|
357305.extern|213503982334-14:25:51|
357305.batch|11:33.000|

Any idea why this is?  Also, what is that crazy number for
AveCPU on 357305.extern?

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Mon, 14 Jun 2021 2:45am, Ole Holm Nielsen wrote:

On 6/14/21 8:26 AM, Gestió Servidors wrote:
 How can I get all information about a finished job in the same way as
 “scontrol show jobid=” when job is pending or running?

Some minutes after job completion, you can only get the information which is stored in the Slurm database.

My script "showjob" in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs shows all available information for jobs in the queue as well as in the database.

/Ole



Reply via email to