Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-07 Thread Chris Samuel
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:

> i was wondering why a node is drained when killing of task fails and how can
> i disable it? (i use cgroups) moreover, how can the killing of task fails?
> (this is on slurm 19.05)

Slurm has tried to kill processes, but they refuse to go away. Usually this 
means they're stuck in a device or I/O wait for some reason, so look for 
processes that are in a "D" state on the node.

As others have said they can be stuck writing out large files and waiting for 
the kernel to complete that before they exit.  This can also happen if you're 
using GPUs and something has gone wrong in the driver and the process is stuck 
in the kernel somewhere.

You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the 
kernel reports tasks stuck and where they are stuck.

If there are tasks stuck in that state then often the only recourse is to 
reboot the node back into health.

You can tell Slurm to run a program on the node should it find itself in this 
state, see:

https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram

Best of luck,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] 19.05->20.11 update:: slurmdbd failure - SOLVED

2021-08-07 Thread Adrian Sevcenco

On 8/7/21 9:50 PM, Adrian Sevcenco wrote:

Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before)
and now, after checking the configuration slurmdbd do not start anymore:

[2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size 
innodb_lock_wait_timeout

[2021-08-07T21:42:01.896] error: mysql_query failed: 1054 Unknown column 
'pack_job_id' in 'iss-alice_job_table'
alter table "iss-alice_job_table" change pack_job_id het_job_id int unsigned not null, change pack_job_offset 
het_job_offset int unsigned not null;

[2021-08-07T21:42:01.896] error: _convert_job_table_pre: Can't convert 
iss-alice_job_table info: Unknown error 1054

actually i managed to solve this .. it seems that either i restarted the 
slurmdbd whem i see that is not comming up
or other thing and the pack_job_id was converted to het_job_id but not 
acknowledged ..
so i change the columns back to pack_name and let the conversion process run 
it's course at the starting of slurmdbd

Adrian


[2021-08-07T21:42:01.896] error: issue converting tables before create
[2021-08-07T21:42:01.896] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() 
callback failed

[2021-08-07T21:42:01.897] error: cannot create accounting_storage context for 
accounting_storage/mysql
[2021-08-07T21:42:01.897] fatal: Unable to initialize accounting_storage/mysql 
accounting storage plugin

Any idea what is going on? i checked the configuration content and it seems 
that nothing changed...

Thanks a lot!
Adrian





Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-07 Thread Adrian Sevcenco

On 8/6/21 6:06 PM, Willy Markuske wrote:

Adrian and Diego,

Hi!

Are you using AMD Epyc processors when viewing this issue? I've been having the same issue but only on dual AMD Epyc 

i do have some epyc nodes, but the cpu proportion is 50%/50% with broadwell 
cores ..
and i do not see a correlation/preference of the problem for the epyc ones

systems. I haven't tried changing the core file location from an NFS mount though so perhaps there's an issue writing it 
out in time.


How did you disable core files?

to tell the trouth i do not know at this moment :)) i have to search in conf 
files,
but i see that :
[root@alien ~]# ulimit -a | grep core
core file size  (blocks, -c) 0

you can either add to /etc/security/limits.d/
a file with:
* hard core 0

and/or:
ulimit -S -c 0

HTH,
Adrian




Regards,

Willy Markuske

HPC Systems Engineer



Research Data Services

P: (619) 519-4435

On 8/6/21 6:16 AM, Adrian Sevcenco wrote:

On 8/6/21 3:19 PM, Diego Zuccato wrote:

IIRC we increased SlurmdTimeout to 7200 .

Thanks a lot!

Adrian



Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:

On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm 
timeouts

oh, i see.. well, in principle i should not have core files, and i do not find 
any...

to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I 
disabled core files and restored the timeouts.

and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian




Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!


Might it be due to a timeout (maybe the killed job is creating a core file, or 
caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of failed 
killing? and how can i control/disable
this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed

i was wondering why a node is drained when killing of task fails and how can i 
disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian







--
--
Adrian Sevcenco, Ph.D.   |
Institute of Space Science - ISS, Romania|
adrian.sevcenco at {cern.ch,spacescience.ro} |
--




[slurm-users] 19.05->20.11 update:: slurmdbd failure

2021-08-07 Thread Adrian Sevcenco

Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before)
and now, after checking the configuration slurmdbd do not start anymore:

[2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size 
innodb_lock_wait_timeout

[2021-08-07T21:42:01.896] error: mysql_query failed: 1054 Unknown column 
'pack_job_id' in 'iss-alice_job_table'
alter table "iss-alice_job_table" change pack_job_id het_job_id int unsigned not null, change pack_job_offset 
het_job_offset int unsigned not null;

[2021-08-07T21:42:01.896] error: _convert_job_table_pre: Can't convert 
iss-alice_job_table info: Unknown error 1054
[2021-08-07T21:42:01.896] error: issue converting tables before create
[2021-08-07T21:42:01.896] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() 
callback failed

[2021-08-07T21:42:01.897] error: cannot create accounting_storage context for 
accounting_storage/mysql
[2021-08-07T21:42:01.897] fatal: Unable to initialize accounting_storage/mysql 
accounting storage plugin

Any idea what is going on? i checked the configuration content and it seems 
that nothing changed...

Thanks a lot!
Adrian