Hello Slurm user community,

I would like to share my experience concerning the updates I did following the 
security fixes published last month (May 4th) as it may help other users (and 
hopefully attract attention of responsible developers).

As I explained in my previous message, we were running Slurm 20.11.8 since July 
2021 without any problems.

After security patches were published we decided at the same time to update to 
the latest version (at that time), that is: 21.08.8-2. We then ran into 
problems with jobs composed of a large number of steps (ex: 283). We tried 
modifying the configuration without success. We then decided to go back to a 
previous version (a version not vulnerable of course), 20.11.9 using a 
conservative configuration, and it worked ! We then modified the configuration 
to use a parameter recommended for security and it continued working.

Here is a summary of what's working and what's not working:

20.11.7 (of course without CommunicationParameters=block_null_hash): working
21.08.8-2 with CommunicationParameters=block_null_hash: intermittent problems
21.08.8-2 without CommunicationParameters=block_null_hash: intermittent problems
20.11.8 without CommunicationParameters=block_null_hash: working
20.11.8 with CommunicationParameters=block_null_hash: working

So from the symptoms I described in my previous message, the intermittent 
problems  really look like a buggy behavior (race condition or communication 
problem). And from the result above this bug looks like a *regression* from 
20.11.[7-8] to 21.08.8-2.

To Slurm developers:

I know we are not paying for maintenance. I would like to see my organisation 
support a useful and good open source project such as Slurm instead of buying 
yet another Windows centered commercial software. But with our small 24 nodes 
cluster the support fees would be too high for our operational budget. If I 
were you I would still try to investigate this problem as it of general 
interest: it may allow to find a real regression bug affecting many users (both 
with and without support contract).

Thanks,

Martin Audet

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Audet, Martin
Sent: May 20, 2022 8:53
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] slurmctld loose connection with slurmd for no reason 
after upgrading from 20.11.8 to 21.08.8-2

Hello Slurm community,

It seems that since we upgraded Slurm from 20.11.8 to 21.08.8-2 and since a 
user started a special type of job with a many short steps (ex: 1000) lasting 
from one to five minutes each, after many steps (283 in this case) slurmctld 
seems to loose contact with the slurmd running on the compute nodes for this 
job wrongly thinking that the node has resumed and exceded its ResumeTimeout of 
300s (but this is not possible since the job was running since almost 24 hours 
on the same four nodes, no node "resumed"). Slurm then decide to mark the nodes 
as "down" and "power_save" and just after it notice that the four nodes used by 
the job are "now responding".

But too late ! The decision was taken, the job will soon be rescheduled on 
another set of nodes and the old nodes are marked "down~".

Note that on the compute nodes themselves everything appear normal. The only 
message I see is "credential for job 47647 revoked". The slurmd daemon is 
running fine according to systemctl and I have seen nothing in /var/log/message 
indicating an error (ex: communication error) on the head and compute nodes. 
Communication between slurmctld and slurmd is done over IPoIB interfaces 
(rather than GbE) and our Infiniband network seems fine.

In the last tree days this problem happened three times on our small 24 nodes 
cluster.

Am I the only one with this problem ?

In my opinion, this intermittent problem really look like a race or a 
communication bug.

Note that slurm.conf contains:

  CommunicationParameters=block_null_hash
  MaxStepCount=100000000
  SlurmctldDebug=debug
  SlurmdDebug=debug

Also the update was done cleanly by stopping all jobs using a maintenance 
reservation. The deamons were stopped and disabled. The rpms were build from 
sources. The old ones (20.11.8) were removed and the new ones installed 
(21.08.8-2). Finally the daeomons were started and enabled.

The job 47647 was allocated four nodes: cn[9-12]. Each step is short, use all 
CPUs allocated and run one after the other.

Thanks in advance,

Martin Audet

Here are the messages I had from slurmctld:


   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node 
cn9 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
requeue job JobId=47647 due to failure of node cn9
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
Requeuing JobId=47647
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node 
cn10 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node 
cn11 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: node 
cn12 not resumed by ResumeTimeout(300) - marking down and power_save
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
sched: Updated reservation=limite users=root nodes=cn[1-24] cores=576 
licenses=(null) tres=cpu=1152 watts=4294967294 start=2022-06-30T23:59:00 
end=2022-07-01T00:00:00 MaxStartDelay=
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node 
cn10 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node 
cn11 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node 
cn12 now responding
   May 19 11:13:08 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: Node 
cn9 now responding
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
_job_complete: JobId=47647 WTERMSIG 15
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
_job_complete: JobId=47647 cancelled by interactive user
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:09 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
job_step_signal: JobId=47647 is in state PENDING, cannot signal steps
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
step_partial_comp: JobId=47647 pending
   May 19 11:13:10 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
step_partial_comp: JobId=47647 pending
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn12
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn10
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn11
   May 19 11:13:11 hn.galerkin.res.nrc.gc.ca slurmctld[39512]: slurmctld: 
debug:  job_epilog_complete: JobId=47647 complete response from DOWN node cn9


Here are the messages I had from slurmd on cn9 (the other 3 show similar 
messages):

   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: launch task 
StepId=47647.283 request from UID:171971265 GID:171971265 HOST:10.2.1.109 
PORT:43544
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
Checking credential with 572 bytes of sig data
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: 
task/affinity: lllp_distribution: JobId=47647 binding: cores, dist 2
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: 
task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: 
task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [47647]: 
mask_cpu, 
0x000001000001,0x001000001000,0x000002000002,0x002000002000,0x000004000004,0x004000004000,0x000008000008,0x008000008000,0x000010000010,0x010000010000,0x000020000020,0x020000020000,0x000040000040,0x040000040000,0x000080000080,0x080000080000,0x000100000100,0x100000100000,0x000200000200,0x200000200000,0x000400000400,0x400000400000,0x000800000800,0x800000800000
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
Waiting for job 47647's prolog to complete
   May 19 11:08:48 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
Finished wait for job 47647's prolog to complete
   May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
_rpc_terminate_job: uid = 1001 JobId=47647
   May 19 11:13:08 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
credential for job 47647 revoked
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
Waiting for job 47647's prolog to complete
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
Finished wait for job 47647's prolog to complete
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
completed epilog for jobid 47647
   May 19 11:13:11 cn9.galerkin.res.nrc.gc.ca slurmd[2537]: slurmd: debug:  
JobId=47647: sent epilog complete msg: rc = 0



Reply via email to