Re: [slurm-users] Curious performance results

2021-02-25 Thread Angelos Ching
I think it's related to the job step launch semantic change introduced at 20.11.0, which has been reverted since 20.11.3, see https://www.schedmd.com/news.php For details. Cheers, Angelos (Sent from mobile, please pardon me for typos and cursoriness.) > 26/2/2021 9:07、Volker Blum のメール: > > H

Re: [slurm-users] Insert separating characters into sacct formated output

2021-02-09 Thread Angelos Ching
Hi Jianwen, I guess the -p or -P flag does what you want? Best regards, Angelos (Sent from mobile, please pardon me for typos and cursoriness.) > 9/2/2021 21:46、SJTU のメール: > > Hi, > >I am using SLURM 19.05.7 . Is it possible to insert user-defined > separating characters like "|" or ","

Re: [slurm-users] [slurm 20.02.3] don't suspend nodes in down state

2020-08-24 Thread Angelos Ching
I have some logic of making sure that the node to be acted on is in idle state in SuspendProgram and its helper programs, before power action is performed. Best regards, Angelos (Sent from mobile, please pardon me for typos and cursoriness.) > 2020/08/24 17:42、Jacek Budzowski のメール: > >  > Dear

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-22 Thread Angelos Ching
Agreed. You may also want to write a script that gather the list of program in "D state" (kernel wait) and print their stack; and configure it as UnkillableStepProgram so that you can capture the program and relevant system callS that caused the job to become unkillable / timed out exiting for f

Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread Angelos Ching
If it's Ethernet problem there should be kernel message (dmesg) showing either link/carrier change or driver reset? OP's problem could have been caused by excessive paging, check the -M flag of slurmd? https://slurm.schedmd.com/slurmd.html Regards, Angelos (Sent from mobile, please pardon me fo

Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Angelos Ching
Hi Timo, We have faced similar problem and our solution was to run an hourly cron job to set a random node weight for each node. It works pretty well for us. Best regards, Angelos (Sent from mobile, please pardon me for typos and cursoriness.) > 2020/07/03 2:24、Timo Rothenpieler のメール: > > Hel

Re: [slurm-users] salloc not working in configless setup on login machine

2020-03-04 Thread Angelos Ching
Hi Gizo, I noticed SLURM_CONF was set to a broken socket when inside salloc, that's why sinfo was confused. I've found a workaround that if I "unset SLURM_CONF" before sinfo, then sinfo works. Maybe a bug needs to be reported for this. Best regards, Angelos On 3/4/20 2:07 AM, nan...@luis.un

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-27 Thread Angelos Ching
Hi all, Looks like using --config-server limits to 1 config server if I'm not mistaken? Specifying multiple --config-server will cause slurmd to consider only the last one. (A quick glance at the source seems to agree) Any plan on accepting a second server via command line options? Thanks & r