On Tuesday, July 16, 2024 5:45:52 AM Central European Standard Time Jason Ellul via slurm-users wrote: > Hi all, > > I am hoping someone can help with our problem. Every hour after restarting > slurmctld the controller becomes unresponsive to commands for 1 sec, > reporting errors such as: > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] > slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing > socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: > [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: > Unexpected missing socket error [2024-07-15T11:45:48.509] error: > slurm_send_node_msg: [socket:[934875]] > slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing > socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: > [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: > Unexpected missing socket error [2024-07-15T11:45:48.509] error: > slurm_send_node_msg: [socket:[939016]] > slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing > socket error
with slurm 25.11.1 I noticed similar errors with array jobs. I solved it for my small cluster with: SlurmctldParameters=conmgr_max_connections=4096 SlurmdParameters=conmgr_max_connections=512 It seems the default values got lowered and are not the one stated in the documentation anymore. regards Markus Köberl
smime.p7s
Description: S/MIME cryptographic signature
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
