Re: [slurm-users] job_time_limit: inactivity time limit reached ...

Ole Holm Nielsen Wed, 21 Sep 2022 05:45:18 -0700

Hi Paul,

Interesting observation on the execution time and the pipe! How do youensure that you have enough disk space for the uncompressed database dump?Maybe using /dev/shmem?


The lbzip2 mentioned in the link below is significantly faster than bzip2.

Best regards,
Ole

On 9/21/22 14:38, Paul Raines wrote:

Almost all the 5 min+ time was in the bzip2. The mysqldump by itself wasabout 16 seconds. So I moved the bzip2 to its own separate line so

the tables are only locked for the ~16 seconds

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote:

Hi Paul,

IMHO, using logrotate is the most convenient method for making dailydatabase backup dumps and keep a number of backup versions, see thenotes inhttps://wiki.fysik.dtu.dk/niflheim/Slurm_database#backup-script-with-logrotate

Using --single-transaction is recommended by SchedMD to avoid raceconditions when slurmdbd is being run while taking the MySQL dump, see

https://bugs.schedmd.com/show_bug.cgi?id=10295#c18

/Ole

On 9/20/22 15:17, Paul Raines wrote:


 Further investigation found that I had setup logrotate to handle a mysql
 dump

    mysqldump -R --single-transaction -B slurm_db | bzip2

 which is what is taking 5 minutes.  I think this is locking tables during
 the time hanging calls to slurmdbd most likely and causing the issue.
 I will need to rework it.

 -- Paul Raines (http://help.nmr.mgh.harvard.edu)



 On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:

I’m not sure if this might be helpful, but my logrotate.d for slurmlooks

 a bit differently, namely instead of a systemctl reload, I am sending a
 specific SIGUSR2 signal, which is supposedly for the specific purpose of
 logrotation in slurm.

     postrotate
             pkill -x --signal SIGUSR2 slurmctld
             pkill -x --signal SIGUSR2 slurmd
             pkill -x --signal SIGUSR2 slurmdbd
             exit 0
     endscript


 I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
 <https://slurm.schedmd.com/slurm.conf.html#lbAQ>

 Reed

 On Sep 19, 2022, at 7:46 AM, Paul Raines <rai...@nmr.mgh.harvard.edu>
 wrote:


 I have had two nights where right at 3:35am a bunch of jobs were
 killed early with TIMEOUT way before  their normal TimeLimit.
 The slurmctld log has lots of lines like at 3:35am with

 [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached
 for JobId=1636922

 with jobs running on serveral different nodes.

The one curious thing is right about this time log rotation ishappening

 in cron on the slurmctld master node

Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily)starting

 logrotate

Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily)finished

 logrotate

 The 5 minute runtime here is a big anomoly.  On other machines, like

nodes just running slurmd or my web servers, this only takes acouple of

 seconds.

 In /etc/logrotate.d/slurmctl I have

   postrotate
     systemctl reload slurmdbd >/dev/null 2>/dev/null || true
     /bin/sleep 1
     systemctl reload slurmctld >/dev/null 2>/dev/null || true
   endscript

 Does it make sense that this could be causing the issue?

 In slurm.conf I had InactiveLimit=60 which I guess is what is happening
 but my reading of the docs on this setting was it only affects the
 starting of a job with srun/salloc and not a job that has been running
 for days.  Is it InactiveLimit that leads to the "inactivity time limit
 reached" message?

 Anyway, I have changed InactiveLimit=600 to see if that helps.

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

Reply via email to