Given the extreme amount of output that will be generated for potentially a 
couple hundred job runs, I was hoping that someone would say “Seen it, here’s 
how to fix it.” Guess I’ll have to go with the “high output” route.

Thanks Doug!

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Doug Meyer
Sent: Thursday, January 31, 2019 8:46 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires 
through job.

Doug

On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs 
<andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote:
Hi All,

Just checking to see if this sounds familiar to anyone.

Environment:
- CentOS 7.5 x86_64
- Slurm 17.11.10 (but this also happened with 17.11.5)

We typically run about 100 tests/night, selected from a handful of favorites. 
For roughly 1 in 300 test runs, we see one of two mysterious failures:

1. The 5 minute cancellation

A job will be rolling along, generating it's expected output, and then this 
message appears:
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 
***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
sacct reports
       JobID               Start                 End ExitCode      State
------------ ------------------- ------------------- -------- ----------
3418         2019-01-29T05:54:07 2019-01-29T05:59:16      0:9     FAILED
These failures consistently happen at just about 5 minutes into the run when 
they happen.

2. The random cancellation

As above, a job will be generating the expected output, and then we see
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 
***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
But this time, sacct reports
       JobID               Start                 End ExitCode      State
------------ ------------------- ------------------- -------- ----------
3531         2019-01-30T07:21:25 2019-01-30T07:35:50      0:0  COMPLETED
3531.0       2019-01-30T07:21:27 2019-01-30T07:35:56     0:15  CANCELLED
I think we've seen these cancellations pop up as soon as a minute or two into 
the test run, up to perhaps 20 minutes into the run.

The only thing slightly unusual in our job submissions is that we use srun's 
"--immediate=120" so that the scripts can respond appropriately if a node goes 
down.

With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in the 
slurmctld or slurmd logs.

Any thoughts on what might be happening, or what I might try next?

Andy



--

Andy Riebs

andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>

Hewlett-Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

My opinions are not necessarily those of HPE

    May the source be with you!

Reply via email to