On Tuesday, July 16, 2024 5:45:52 AM Central European Standard Time Jason Ellul 
via slurm-users wrote:
> Hi all,
> 
> I am hoping someone can help with our problem. Every hour after restarting
> slurmctld the controller becomes unresponsive to commands for 1 sec,
> reporting errors such as:
> 
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]]
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing
> socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed:
> Unexpected missing socket error [2024-07-15T11:45:48.509] error:
> slurm_send_node_msg: [socket:[934875]]
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing
> socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed:
> Unexpected missing socket error [2024-07-15T11:45:48.509] error:
> slurm_send_node_msg: [socket:[939016]]
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing
> socket error

with slurm 25.11.1 I noticed similar errors with array jobs.
I solved it for my small cluster with:

SlurmctldParameters=conmgr_max_connections=4096
SlurmdParameters=conmgr_max_connections=512

It seems the default values got lowered and are not the one stated in the 
documentation anymore.


regards
Markus Köberl

Attachment: smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to