Re: [slurm-users] draining nodes due to failed killing of task?
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote: > i was wondering why a node is drained when killing of task fails and how can > i disable it? (i use cgroups) moreover, how can the killing of task fails? > (this is on slurm 19.05) Slurm has tried to kill processes, but they refuse to go away. Usually this means they're stuck in a device or I/O wait for some reason, so look for processes that are in a "D" state on the node. As others have said they can be stuck writing out large files and waiting for the kernel to complete that before they exit. This can also happen if you're using GPUs and something has gone wrong in the driver and the process is stuck in the kernel somewhere. You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the kernel reports tasks stuck and where they are stuck. If there are tasks stuck in that state then often the only recourse is to reboot the node back into health. You can tell Slurm to run a program on the node should it find itself in this state, see: https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] 19.05->20.11 update:: slurmdbd failure - SOLVED
On 8/7/21 9:50 PM, Adrian Sevcenco wrote: Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before) and now, after checking the configuration slurmdbd do not start anymore: [2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout [2021-08-07T21:42:01.896] error: mysql_query failed: 1054 Unknown column 'pack_job_id' in 'iss-alice_job_table' alter table "iss-alice_job_table" change pack_job_id het_job_id int unsigned not null, change pack_job_offset het_job_offset int unsigned not null; [2021-08-07T21:42:01.896] error: _convert_job_table_pre: Can't convert iss-alice_job_table info: Unknown error 1054 actually i managed to solve this .. it seems that either i restarted the slurmdbd whem i see that is not comming up or other thing and the pack_job_id was converted to het_job_id but not acknowledged .. so i change the columns back to pack_name and let the conversion process run it's course at the starting of slurmdbd Adrian [2021-08-07T21:42:01.896] error: issue converting tables before create [2021-08-07T21:42:01.896] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2021-08-07T21:42:01.897] error: cannot create accounting_storage context for accounting_storage/mysql [2021-08-07T21:42:01.897] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin Any idea what is going on? i checked the configuration content and it seems that nothing changed... Thanks a lot! Adrian
Re: [slurm-users] draining nodes due to failed killing of task?
On 8/6/21 6:06 PM, Willy Markuske wrote: Adrian and Diego, Hi! Are you using AMD Epyc processors when viewing this issue? I've been having the same issue but only on dual AMD Epyc i do have some epyc nodes, but the cpu proportion is 50%/50% with broadwell cores .. and i do not see a correlation/preference of the problem for the epyc ones systems. I haven't tried changing the core file location from an NFS mount though so perhaps there's an issue writing it out in time. How did you disable core files? to tell the trouth i do not know at this moment :)) i have to search in conf files, but i see that : [root@alien ~]# ulimit -a | grep core core file size (blocks, -c) 0 you can either add to /etc/security/limits.d/ a file with: * hard core 0 and/or: ulimit -S -c 0 HTH, Adrian Regards, Willy Markuske HPC Systems Engineer Research Data Services P: (619) 519-4435 On 8/6/21 6:16 AM, Adrian Sevcenco wrote: On 8/6/21 3:19 PM, Diego Zuccato wrote: IIRC we increased SlurmdTimeout to 7200 . Thanks a lot! Adrian Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should not have core files, and i do not find any... to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. and how much did you increased them? i have SlurmctldTimeout=300 SlurmdTimeout=300 Thank you! Adrian Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian -- -- Adrian Sevcenco, Ph.D. | Institute of Space Science - ISS, Romania| adrian.sevcenco at {cern.ch,spacescience.ro} | --
[slurm-users] 19.05->20.11 update:: slurmdbd failure
Hi! I just upgraded slurm from 19.05 to 20.11 (all services stopped before) and now, after checking the configuration slurmdbd do not start anymore: [2021-08-07T21:42:01.890] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout [2021-08-07T21:42:01.896] error: mysql_query failed: 1054 Unknown column 'pack_job_id' in 'iss-alice_job_table' alter table "iss-alice_job_table" change pack_job_id het_job_id int unsigned not null, change pack_job_offset het_job_offset int unsigned not null; [2021-08-07T21:42:01.896] error: _convert_job_table_pre: Can't convert iss-alice_job_table info: Unknown error 1054 [2021-08-07T21:42:01.896] error: issue converting tables before create [2021-08-07T21:42:01.896] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2021-08-07T21:42:01.897] error: cannot create accounting_storage context for accounting_storage/mysql [2021-08-07T21:42:01.897] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin Any idea what is going on? i checked the configuration content and it seems that nothing changed... Thanks a lot! Adrian