Folks,
A bit more information on this. I changed the way I was looking at the
log, and I have tons of these messages. I would not be at all surprised
if I am seeing one for every job that has been launched since Slurm
started (yesterday sometime) on this system. It seems like there might
be some kind of bookkeeping error preventing jobs from recognizing when
the last active step goes away. The concern I have is that this means
that old, long gone, jobs are still being accounted for by Slurm, and
this is taking a non-zero amount of effort on slurmctld's behalf.
Eventually, this will, presumably, cause slurmctld to bog down and
become slow or unresponsive.
I may or may not have witnessed that slowdown happening (the reason
Slurm was restarted yesterday is that it was not responding to srun --
oddly sinfo and squeue seemed to be working fine --).
Eric
On 11/12/15 8:08 AM, Eric Lund wrote:
Folks,
I am seeing an odd behavior and I am wondering if others have seen this
and might be able to explain it.
The behavior is the following message:
[2015-11-12T08:00:14.415] debug: Job 28447 still has 1 active steps
showing up in the slurmctld logs long (many minutes) after job 28447 has
completed and is no longer found in the squeue output. There is no
trace of the original job on either of the nodes it was originally run
on, and those nodes are idle, but the debug message keeps appearing.
I am running a locally modified 15.08.1. My local modifications should
not have anything to do with this, but I am open to the possibility.
Mostly I am curious whether this is a known behavior, and, if so,
whether there is a workaround or fix for it.
Thanks!
Eric