" Looking back through the mailing list, it seems that from 2015 onwards the
recommendation from Danny was to use 'jobacct_gather/linux' instead of
'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
the cgroup version."
Ahh, hmm I need to dig up that recommendation as I didn't see that myself.
We'll look into this.
Thanks Paddy!
Best,
Chris
—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 1/8/19, 8:04 AM, "slurm-users on behalf of Paddy Doyle"
<[email protected] on behalf of [email protected]> wrote:
A small addition: I forgot to mention our JobAcct params:
JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/cgroup
I've done a small bit of playing around on a test cluster. Changing to
'JobAcctGatherFrequency=0' (i.e. only gather at job end) seems to then give
correct values for the job via sacct/seff.
Alternatively, setting the following also works:
JobAcctGatherFrequency=task=30
JobAcctGatherType=jobacct_gather/linux
Looking back through the mailing list, it seems that from 2015 onwards the
recommendation from Danny was to use 'jobacct_gather/linux' instead of
'jobacct_gather/cgroup'. I didn't pick up on that properly, so we kept with
the cgroup version.
Is anyone else still using jobacct_gather/cgroup and are you seeing this
same issue?
Just to note: there's a big warning in the man page not to adjust the
value of JobAcctGatherType while there are any running job steps. I'm not
sure if that means just on that node, or any jobs. Probably safest to
schedule a downtime to change it.
Paddy
On Fri, Jan 04, 2019 at 10:43:54PM +0000, Christopher Benjamin Coffey wrote:
> Actually we double checked and are seeing it in normal jobs too.
>
> ???
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>
>
> ???On 1/4/19, 9:24 AM, "slurm-users on behalf of Paddy Doyle"
<[email protected] on behalf of [email protected]> wrote:
>
> Hi Chris,
>
> We're seeing it on 18.08.3, so I was hoping that it was fixed in
18.08.4
> (recently upgraded from 17.02 to 18.08.3). Note that we're seeing it
in
> regular jobs (haven't tested job arrays).
>
> I think it's cgroups-related; there's a similar bug here:
>
>
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D6095&data=02%7C01%7Cchris.coffey%40nau.edu%7Ca8652902673c4688948308d6757a9957%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636825566755040599&sdata=sXhuwF0AcUByzXEiBrg%2BFXw4Niowhs%2B9g0uFDpq%2F19g%3D&reserved=0
>
> I was hoping that this note in the 18.08.4 NEWS might have been
related:
>
> -- Fix jobacct_gather/cgroup to work correctly when more than one
task is
> started on a node.
>
> Thanks,
> Paddy
>
> On Fri, Jan 04, 2019 at 03:19:18PM +0000, Christopher Benjamin Coffey
wrote:
>
> > I'm surprised no one else is seeing this issue? I wonder if you
have 18.08 you can take a moment and run jobeff on a job in one of your users
job arrays. I'm guessing jobeff will show the same issue as we are seeing. The
issue is that usercpu is incorrect, and off by many orders of magnitude.
> >
> > Best,
> > Chris
> >
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >
> >
> > ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey"
<[email protected]> wrote:
> >
> > So this issue is occurring only with job arrays.
> >
> > ???
> > Christopher Coffey
> > High-Performance Computing
> > Northern Arizona University
> > 928-523-1167
> >
> >
> > On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce
Carl Nelson" <[email protected] on behalf of
[email protected]> wrote:
> >
> > Hi folks,
> >
> >
> > calling sacct with the usercpu flag enabled seems to
provide cpu times far above expected values for job array indices. This is also
reported by seff. For example, executing the following job script:
> > ________________________________________________________
> >
> >
> > #!/bin/bash
> > #SBATCH --job-name=array_test
> > #SBATCH --workdir=/scratch/cbn35/bigdata
> > #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
> > #SBATCH --time=20:00
> > #SBATCH --array=1-5
> > #SBATCH -c2
> >
> >
> > srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s
> >
> >
> >
> > ________________________________________________________
> >
> >
> > ...results in the following stats:
> > ________________________________________________________
> >
> >
> >
> > JobID ReqCPUS UserCPU Timelimit Elapsed
> > ------------ -------- ---------- ---------- ----------
> > 15730924_5 2 02:30:14 00:20:00 00:01:08
> > 15730924_5.+ 2 00:00.004 00:01:08
> > 15730924_5.+ 2 00:00:00 00:01:09
> > 15730924_5.0 2 02:30:14 00:01:05
> > 15730924_1 2 02:30:48 00:20:00 00:01:08
> > 15730924_1.+ 2 00:00.013 00:01:08
> > 15730924_1.+ 2 00:00:00 00:01:09
> > 15730924_1.0 2 02:30:48 00:01:05
> > 15730924_2 2 02:15:52 00:20:00 00:01:07
> > 15730924_2.+ 2 00:00.007 00:01:07
> > 15730924_2.+ 2 00:00:00 00:01:07
> > 15730924_2.0 2 02:15:52 00:01:06
> > 15730924_3 2 02:30:20 00:20:00 00:01:08
> > 15730924_3.+ 2 00:00.010 00:01:08
> > 15730924_3.+ 2 00:00:00 00:01:09
> > 15730924_3.0 2 02:30:20 00:01:05
> > 15730924_4 2 02:30:26 00:20:00 00:01:08
> > 15730924_4.+ 2 00:00.006 00:01:08
> > 15730924_4.+ 2 00:00:00 00:01:09
> > 15730924_4.0 2 02:30:25 00:01:05
> >
> >
> >
> > ________________________________________________________
> >
> >
> > This is also reported by seff, with several errors to boot:
> > ________________________________________________________
> >
> >
> >
> > Use of uninitialized value $lmem in numeric lt (<) at
/usr/bin/seff line 130, <DATA> line 624.
> > Use of uninitialized value $lmem in numeric lt (<) at
/usr/bin/seff line 130, <DATA> line 624.
> > Use of uninitialized value $lmem in numeric lt (<) at
/usr/bin/seff line 130, <DATA> line 624.
> > Job ID: 15730924
> > Array Job ID: 15730924_5
> > Cluster: monsoon
> > User/Group: cbn35/clusterstu
> > State: COMPLETED (exit code 0)
> > Nodes: 1
> > Cores per node: 2
> > CPU Utilized: 03:19:15
> > CPU Efficiency: 8790.44% of 00:02:16 core-walltime
> > Job Wall-clock time: 00:01:08
> > Memory Utilized: 0.00 MB (estimated maximum)
> > Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core)
> >
> >
> >
> > ________________________________________________________
> >
> >
> >
> >
> >
> > As far as I can tell, I don't think a two core job with an
elapsed time of around one minute would have a cpu time of two hours. Could
this be a configuration issue, or is it a possible bug?
> >
> >
> > More info is available on request, and any help is
appreciated!
> >
> >
> >
> >
> >
> >
> >
> >
>
> --
> Paddy Doyle
> Trinity Centre for High Performance Computing,
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> Phone: +353-1-896-3725
>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2F&data=02%7C01%7Cchris.coffey%40nau.edu%7Ca8652902673c4688948308d6757a9957%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636825566755040599&sdata=pKI5USlS10B7ZrhT73c1og%2BM38KkZx4GSjHxPe%2Fb%2Fhk%3D&reserved=0
>
>
>
--
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.tchpc.tcd.ie%2F&data=02%7C01%7Cchris.coffey%40nau.edu%7Ca8652902673c4688948308d6757a9957%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636825566755040599&sdata=pKI5USlS10B7ZrhT73c1og%2BM38KkZx4GSjHxPe%2Fb%2Fhk%3D&reserved=0