Folks,

A bit more information on this. I changed the way I was looking at the log, and I have tons of these messages. I would not be at all surprised if I am seeing one for every job that has been launched since Slurm started (yesterday sometime) on this system. It seems like there might be some kind of bookkeeping error preventing jobs from recognizing when the last active step goes away. The concern I have is that this means that old, long gone, jobs are still being accounted for by Slurm, and this is taking a non-zero amount of effort on slurmctld's behalf. Eventually, this will, presumably, cause slurmctld to bog down and become slow or unresponsive.

I may or may not have witnessed that slowdown happening (the reason Slurm was restarted yesterday is that it was not responding to srun -- oddly sinfo and squeue seemed to be working fine --).

Eric

On 11/12/15 8:08 AM, Eric Lund wrote:
Folks,

I am seeing an odd behavior and I am wondering if others have seen this
and might be able to explain it.

The behavior is the following message:

     [2015-11-12T08:00:14.415] debug:  Job 28447 still has 1 active steps

showing up in the slurmctld logs long (many minutes) after job 28447 has
completed and is no longer found in the squeue output.  There is no
trace of the original job on either of the nodes it was originally run
on, and those nodes are idle, but the debug message keeps appearing.

I am running a locally modified 15.08.1.  My local modifications should
not have anything to do with this, but I am open to the possibility.
Mostly I am curious whether this is a known behavior, and, if so,
whether there is a workaround or fix for it.

Thanks!

Eric

Reply via email to