Andreas made a good suggestion of looking at the user's TRESRunMin from sshare in order to answer Jeff's question about AssocGrpCPUMinutesLimit for a job. However, getting at this information is in practice really complicated, and I don't think any ordinary user will bother to look it up.

Due to this complexity, I have added some new functionality to my "showjob" script available from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs.

The "showjob" tool now tries to extract the information by combining the sshare, squeue, and sacctmgr commands. The job reasons AssocGrpCPUMinutesLimit as well as AssocGrpCpuLimit are treated.

An example output for a job is:

$ showjob  1347368
Job 1347368 of user xxx in account yyy has a jobstate=PENDING with reason=AssocGrpCpuLimit

Information about GrpCpuLimit:
User GrpTRES limit is:     cpu=1600
Current user TRES is:      cpu=1360
This job requires TRES:    cpu=960
...

I think some end users might find this information useful.

Could I ask any interested sites to test the "showjob" tool to see if the logic works also in their environment? Please send me feedback so that I may possibly improve the tool.

Best regards,
Ole


On 09-08-2019 08:00, Henkel, Andreas wrote:
Users may call sshare -l and have a look at the TRESRunMin. There the
number  of  TRES-minutes  allocated  by jobs currently running against
the account is listed. With a little math (cpu*timelimit) about the job
in question the users should be able to figure this out. At least they
wouldn't need the debug level increased ot a log file.

Best,

Andreas

On 8/7/19 8:47 PM, Sarlo, Jeffrey S wrote:
We had a job queued waiting for resources and when we changed the debug level, we were able to get the following in the slurmctld.log file.

[2019-08-02T10:03:47.347] debug2: JobId=804633 being held, the job is at or exceeds assoc 50(jeff/(null)/(null)) group max tres(cpu) minutes of 30000000 of which 1436396 are still available but request is for 1440000 (plus 0 already in use) tres minutes (request tres count 80)

We were then able to see that we just needed to lower the timelimit for the job a little.

Is there a way a user can get this same type of information for a job, without having to change the slurm debug level and then looking in a log file?

Reply via email to