Given the extreme amount of output that will be generated for potentially a couple hundred job runs, I was hoping that someone would say “Seen it, here’s how to fix it.” Guess I’ll have to go with the “high output” route.
Thanks Doug! Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Doug Meyer Sent: Thursday, January 31, 2019 8:46 PM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10 Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires through job. Doug On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs <andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote: Hi All, Just checking to see if this sounds familiar to anyone. Environment: - CentOS 7.5 x86_64 - Slurm 17.11.10 (but this also happened with 17.11.5) We typically run about 100 tests/night, selected from a handful of favorites. For roughly 1 in 300 test runs, we see one of two mysterious failures: 1. The 5 minute cancellation A job will be rolling along, generating it's expected output, and then this message appears: srun: forcing job termination srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 *** srun: error: nodename: task 250: Terminated srun: Terminating job step 3531.0 sacct reports JobID Start End ExitCode State ------------ ------------------- ------------------- -------- ---------- 3418 2019-01-29T05:54:07 2019-01-29T05:59:16 0:9 FAILED These failures consistently happen at just about 5 minutes into the run when they happen. 2. The random cancellation As above, a job will be generating the expected output, and then we see srun: forcing job termination srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT 2019-01-30T07:35:50 *** srun: error: nodename: task 250: Terminated srun: Terminating job step 3531.0 But this time, sacct reports JobID Start End ExitCode State ------------ ------------------- ------------------- -------- ---------- 3531 2019-01-30T07:21:25 2019-01-30T07:35:50 0:0 COMPLETED 3531.0 2019-01-30T07:21:27 2019-01-30T07:35:56 0:15 CANCELLED I think we've seen these cancellations pop up as soon as a minute or two into the test run, up to perhaps 20 minutes into the run. The only thing slightly unusual in our job submissions is that we use srun's "--immediate=120" so that the scripts can respond appropriately if a node goes down. With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in the slurmctld or slurmd logs. Any thoughts on what might be happening, or what I might try next? Andy -- Andy Riebs andy.ri...@hpe.com<mailto:andy.ri...@hpe.com> Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!