Has anyone been working with the lustre jobstats feature and SLURM? We
have been, and it's OK. But now that I'm working on systems that run a
lot of array jobs and a fairly recent slurm version we found some ugly
stuff.
Array jobs report their do SLURM_JOBID as a variable, and it's unique
for every job. But they use other IDs too that appear only for array jobs.
http://slurm.schedmd.com/job_array.html
However, that unique SLURM_JOBID as far as I can tell is only truly
exposed in command line tools via 'scontrol' - which is only valid while
the job is running. If you want to look at older jobs with sacct for
example, things are troublesome.
Here's what my coworker and I have figured out:
- You submit a (non-array) job that gets jobid 100.
- The next job gets jobid 101.
- Then submit a 10 task array job. That gets jobid 102. The sub tasks
get 9 more job ids. If nothing else is happening with the system, that
means you use jobid 102 to 112.
If things were that orderly, you could cope with using SLURM_JOB_ID in
lustre jobstats pretty easily. Use sacct and you see job 102_2 - you
know that is jobid 103 in lustre jobstats.
But, if other jobs get submitted during set up (as of course they do),
they can take jobid 103. So, you've got problems.
I think we may try to set a magic variable in the slurm prolog and use
that for the jobstats_var, but who knows.
Scott
smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org