Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Chris Samuel

On 19/9/22 05:46, Paul Raines wrote:


In slurm.conf I had InactiveLimit=60 which I guess is what is happening
but my reading of the docs on this setting was it only affects the
starting of a job with srun/salloc and not a job that has been running
for days.  Is it InactiveLimit that leads to the "inactivity time limit 
reached" message?


I believe so, but remember that this governs timeouts around 
communications between slurmctld and the srun/salloc commands, and not 
things like shell inactivity timeouts which are quite different.


See:

https://slurm.schedmd.com/faq.html#purge

# A job is considered inactive if it has no active job steps or
# if the srun command creating the job is not responding.

Hope this helps!

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] admin users without a database

2022-09-19 Thread Chris Samuel
On 19/9/22 06:14, Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) 
wrote:


Is it possible to make a user an admin without slurmdbd? The docs I've 
found indicates that I need to set the user's admin level with sacctmgr, 
but that command always says


I don't believe so, I believe that's all stored in slurmdbd (and 
sacctmgr is a command to communicate with slurmdbd).


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Brian Andrus

Paul,

You are likely spot on with the inactiveLimit change. It may also be an 
environment variable of TMOUT (under bash) set.


Brian Andrus

On 9/19/2022 5:46 AM, Paul Raines wrote:


I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before  their normal TimeLimit.
The slurmctld log has lots of lines like at 3:35am with

[2022-09-12T03:35:02.303] job_time_limit: inactivity time limit 
reached for JobId=1636922


with jobs running on serveral different nodes.

The one curious thing is right about this time log rotation is happening
in cron on the slurmctld master node

Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) 
starting logrotate
Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) 
finished logrotate


The 5 minute runtime here is a big anomoly.  On other machines, like
nodes just running slurmd or my web servers, this only takes a couple 
of seconds.


In /etc/logrotate.d/slurmctl I have

   postrotate
 systemctl reload slurmdbd >/dev/null 2>/dev/null || true
 /bin/sleep 1
 systemctl reload slurmctld >/dev/null 2>/dev/null || true
   endscript

Does it make sense that this could be causing the issue?

In slurm.conf I had InactiveLimit=60 which I guess is what is happening
but my reading of the docs on this setting was it only affects the
starting of a job with srun/salloc and not a job that has been running
for days.  Is it InactiveLimit that leads to the "inactivity time 
limit reached" message?


Anyway, I have changed InactiveLimit=600 to see if that helps.


---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129    USA



The information in this e-mail is intended only for the person to whom 
it is addressed.  If you believe this e-mail was sent to you in error 
and the e-mail contains patient information, please contact the Mass 
General Brigham Compliance HelpLine at 
https://www.massgeneralbrigham.org/complianceline 
 .
Please note that this e-mail is not secure (encrypted).  If you do not 
wish to continue communication over unencrypted e-mail, please notify 
the sender of this message immediately.  Continuing to send or respond 
to e-mail after receiving this message means you understand and accept 
this risk and wish to continue to communicate over unencrypted e-mail.






Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Reed Dier
I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit 
differently, namely instead of a systemctl reload, I am sending a specific 
SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in 
slurm.

> postrotate
> pkill -x --signal SIGUSR2 slurmctld
> pkill -x --signal SIGUSR2 slurmd
> pkill -x --signal SIGUSR2 slurmdbd
> exit 0
> endscript

I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ 


Reed

> On Sep 19, 2022, at 7:46 AM, Paul Raines  wrote:
> 
> 
> I have had two nights where right at 3:35am a bunch of jobs were
> killed early with TIMEOUT way before  their normal TimeLimit.
> The slurmctld log has lots of lines like at 3:35am with
> 
> [2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached for 
> JobId=1636922
> 
> with jobs running on serveral different nodes.
> 
> The one curious thing is right about this time log rotation is happening
> in cron on the slurmctld master node
> 
> Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting 
> logrotate
> Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished 
> logrotate
> 
> The 5 minute runtime here is a big anomoly.  On other machines, like
> nodes just running slurmd or my web servers, this only takes a couple of 
> seconds.
> 
> In /etc/logrotate.d/slurmctl I have
> 
>   postrotate
> systemctl reload slurmdbd >/dev/null 2>/dev/null || true
> /bin/sleep 1
> systemctl reload slurmctld >/dev/null 2>/dev/null || true
>   endscript
> 
> Does it make sense that this could be causing the issue?
> 
> In slurm.conf I had InactiveLimit=60 which I guess is what is happening
> but my reading of the docs on this setting was it only affects the
> starting of a job with srun/salloc and not a job that has been running
> for days.  Is it InactiveLimit that leads to the "inactivity time limit 
> reached" message?
> 
> Anyway, I have changed InactiveLimit=600 to see if that helps.
> 
> 
> ---
> Paul Raines http://help.nmr.mgh.harvard.edu
> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street Charlestown, MA 02129  USA
> 
> 
> 
> The information in this e-mail is intended only for the person to whom it is 
> addressed.  If you believe this e-mail was sent to you in error and the 
> e-mail contains patient information, please contact the Mass General Brigham 
> Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline 
>  .
> Please note that this e-mail is not secure (encrypted).  If you do not wish 
> to continue communication over unencrypted e-mail, please notify the sender 
> of this message immediately.  Continuing to send or respond to e-mail after 
> receiving this message means you understand and accept this risk and wish to 
> continue to communicate over unencrypted e-mail. 
> 



[slurm-users] admin users without a database

2022-09-19 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Is it possible to make a user an admin without slurmdbd? The docs I've found 
indicates that I need to set the user's admin level with sacctmgr, but that 
command always says
You are not running a supported accounting_storage plugin
Only 'accounting_storage/slurmdbd' is supported.

I don't especially want any accounting, just making one user an admin.

Noam


[slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Paul Raines



I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before  their normal TimeLimit.
The slurmctld log has lots of lines like at 3:35am with

[2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached 
for JobId=1636922


with jobs running on serveral different nodes.

The one curious thing is right about this time log rotation is happening
in cron on the slurmctld master node

Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting 
logrotate
Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) 
finished logrotate


The 5 minute runtime here is a big anomoly.  On other machines, like
nodes just running slurmd or my web servers, this only takes a couple of 
seconds.


In /etc/logrotate.d/slurmctl I have

   postrotate
 systemctl reload slurmdbd >/dev/null 2>/dev/null || true
 /bin/sleep 1
 systemctl reload slurmctld >/dev/null 2>/dev/null || true
   endscript

Does it make sense that this could be causing the issue?

In slurm.conf I had InactiveLimit=60 which I guess is what is happening
but my reading of the docs on this setting was it only affects the
starting of a job with srun/salloc and not a job that has been running
for days.  Is it InactiveLimit that leads to the "inactivity time limit 
reached" message?


Anyway, I have changed InactiveLimit=600 to see if that helps.


---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129USA



The information in this e-mail is intended only for the person to whom it is 
addressed.  If you believe this e-mail was sent to you in error and the e-mail 
contains patient information, please contact the Mass General Brigham Compliance 
HelpLine at https://www.massgeneralbrigham.org/complianceline 
 .
Please note that this e-mail is not secure (encrypted).  If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately.  Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.