Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-21 Thread Ole Holm Nielsen
Hi Paul, Interesting observation on the execution time and the pipe! How do you ensure that you have enough disk space for the uncompressed database dump? Maybe using /dev/shmem? The lbzip2 mentioned in the link below is significantly faster than bzip2. Best regards, Ole On 9/21/22

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-21 Thread Paul Raines
Almost all the 5 min+ time was in the bzip2. The mysqldump by itself was about 16 seconds. So I moved the bzip2 to its own separate line so the tables are only locked for the ~16 seconds -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-21 Thread Ole Holm Nielsen
Hi Paul, IMHO, using logrotate is the most convenient method for making daily database backup dumps and keep a number of backup versions, see the notes in https://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate Using --single-transaction is recommended by SchedMD to

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-20 Thread Paul Raines
Further investigation found that I had setup logrotate to handle a mysql dump mysqldump -R --single-transaction -B slurm_db | bzip2 which is what is taking 5 minutes. I think this is locking tables during the time hanging calls to slurmdbd most likely and causing the issue. I will need to

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Chris Samuel
On 19/9/22 05:46, Paul Raines wrote: In slurm.conf I had InactiveLimit=60 which I guess is what is happening but my reading of the docs on this setting was it only affects the starting of a job with srun/salloc and not a job that has been running for days.  Is it InactiveLimit that leads to the

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Brian Andrus
Paul, You are likely spot on with the inactiveLimit change. It may also be an environment variable of TMOUT (under bash) set. Brian Andrus On 9/19/2022 5:46 AM, Paul Raines wrote: I have had two nights where right at 3:35am a bunch of jobs were killed early with TIMEOUT way before  their

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Reed Dier
I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in slurm. > postrotate > pkill -x --signal

[slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Paul Raines
I have had two nights where right at 3:35am a bunch of jobs were killed early with TIMEOUT way before their normal TimeLimit. The slurmctld log has lots of lines like at 3:35am with [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached for JobId=1636922 with jobs running