You can increase the number of possible sockets.
Also can enable socket reuse or recycle. One of these is a bit dangerous -
I cannot remember which.

I have a war story related to tcp connections. Might tell it later.

On Mon, Jan 12, 2026, 2:15 PM Emyr James via slurm-users <
[email protected]> wrote:

> Hi,
>
> We had the same error message. We see this happen when we get users
> submitting job arrays with lots of short jobs and then you get a storm of
> job starts and job completions within a short space of time.
>
> For every start of a job or end of a job, there are multiple messages that
> go back and forth between the compute node, the slurm controller, the
> database daemon and the database itself and multiple rows are inserted and
> updated within the database. When there is a high turnover of jobs in a
> short amount of time the slurm system is unable to keep up and it
> effectively becomes a denial-of-service attack against slurm.
>
> This command allowed me to see users that had a lot of jobs completing
> within the last day or so (you can add start and end filters for more
> precision)...
>
> sacct  --allusers -o jobid%20,user%20,jobname,state,exitcode,elapsedraw |
> grep -v batch |grep -v extern | grep -v RUNNING | awk '{print $2}' | sort |
> uniq -c | sort -nr| head
>
> (note because we are using pam_slurm_adopt we get 3 rows appearing for
> each job so I filter out the batch and extern rows)
>
> For the users showing up here I then did
>
> sacct  --allusers -o jobid%20,user%20,jobname,state,exitcode,elapsedraw |
> grep -v batch |grep -v extern | grep -v RUNNING | grep <username> |awk
> '{print $6}' | sort -n | uniq -c
>
> substituting in the usernames. This shows the number of jobs with elapsed
> time of 0, 1, 2, 3 etc. seconds. If you see multiple users with high
> numbers in the sub 30 second range then this could be the reason.
> Submitting lots of short jobs (<30s runtime) is an anti-pattern.
>
> Instead of submitting e.g. a 1000 element job array of 5 second jobs, the
> user could repackage this up into 10 jobs that do a for-loop over 100 of
> the individual jobs so that you get 10 jobs each with 100 subjobs with a
> runtime of 500 seconds. This will avoid the message storms. You could even
> ask for multiple cores here and use gnu parallel to run the 100 jobs across
> a few cores. This would make the individual jobs run quicker than the
> expected 500 seconds and you may get much higher cpu efficiency doing this
> if these jobs are bottlenecked on IO.
>
> Emyr James
> Head of Scientific IT
> CRG - Centre for Genomic Regulation
>
> ------------------------------
> *From:* Jason Ellul via slurm-users <[email protected]>
> *Sent:* 30 July 2024 01:46
> *To:* Patryk Bełzak <[email protected]>; Jason Ellul via
> slurm-users <[email protected]>
> *Subject:* [slurm-users] Re: slurmctld hourly: Unexpected missing socket
> error
>
>
> Thanks again Patryk,
>
>
>
> For your insights, we have implemented many of the same things, but the
> socket errors are still occurring regularly.
>
>
>
> If we find a solution that works I will be sure to add it to this thread.
>
>
>
> Many thanks
>
>
>
> Jason
>
>
>
>
>
> Jason Ellul
> Head - Research Computing Facility
> Office of Cancer Research
>
> My onsite days are Mon, alt Wed and Friday.
>
> [image:
> /var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/[email protected]]
>
> Phone +61 3 8559 6546
> Email [email protected]
>
> 305 Grattan Street
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0AMelbourne,+Victoria+%0D%0A3000+Australia?entry=gmail&source=g>
> Melbourne, Victoria
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0AMelbourne,+Victoria+%0D%0A3000+Australia?entry=gmail&source=g>
> 3000 Australia
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0AMelbourne,+Victoria+%0D%0A3000+Australia?entry=gmail&source=g>
>
> www.petermac.org
> <https://urldefense.com/v3/__http://www.petermac.org__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQe1MeNZco$>
>
>
> [image:
> /var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/[email protected]]
> <https://urldefense.com/v3/__https://twitter.com/petermaccc__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQeG309wzI$>
>
>
>
> *From: *Patryk Bełzak via slurm-users <[email protected]>
> *Date: *Wednesday, 24 July 2024 at 8:03 PM
> *To: *Jason Ellul via slurm-users <[email protected]>
> *Subject: *[slurm-users] Re: slurmctld hourly: Unexpected missing socket
> error
>
> ! EXTERNAL EMAIL: Think before you click. If suspicious send to
> [email protected]
>
> Hi,
>
> we're on 389 directory server (aka 389ds), which is pretty large instance.
> One of optimizations was to create proper ACI's on server side which
> significantly improved lookup times on slurm controller and worker nodes.
> Second thing was to move sssd cache to tmpfs - instruction by RedHat:
> https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/tuning_performance_in_identity_management/assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments_tuning-performance-in-idm#mounting-the-sssd-cache-in-tmpfs_assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments
> <https://urldefense.com/v3/__https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/tuning_performance_in_identity_management/assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments_tuning-performance-in-idm*mounting-the-sssd-cache-in-tmpfs_assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments__;Iw!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQecOBh2OM$>
> Entire chapter 9 may be helpful.
>
> I also remembered that recently I modified kernel to match the slurmd port
> range from slurm.conf (60000-63001) by creating file
> /etc/sysctl.d/91-slurm.conf with following content:
> # set ipv4 port range accordingly to slurmdPortRange in slurm.conf
> net.ipv4.ip_local_port_range = 32768    63001
> Unfortunately it hasn't stopped the error from occuring.
>
> Best regards,
> Patryk.
>
> On 24/07/23 12:08, Jason Ellul via slurm-users wrote:
> [-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 6,8K --]
> > Hi Patryk,
> >
> > Thanks so much for your email.
> >
> > There are a couple of things you list that we have not tried yet so we
> will definitely look at them. You mention optimizing SSSD which has me
> curious, are you using RedHat Identity management (free IPA?) because we
> are and after going through our logs it appears the errors became more
> consistent after upgrading our instance and replica to REHL9.
> >
> > May I please ask what optimizations did you put in place for SSSD?
> >
> > Many thanks
> >
> > Jason
> >
> >
> > Jason Ellul
> > Head - Research Computing Facility
> > Office of Cancer Research
> > My onsite days are Mon, alt Wed and Friday.
> >
> >
> [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/
> [email protected]]
> >
> > Phone +61 3 8559 6546
>
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g>>
> Email [email protected]<mailto:[email protected]>
>
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g>>
> 305 Grattan Street
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g>
> > Melbourne, Victoria
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g>
> > 3000 Australia
> <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g>
> >
> > www.petermac.org<http://www.petermac.org
> <https://urldefense.com/v3/__http://www.petermac.org__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQe1MeNZco$>
> >
> >
> >
> [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/
> [email protected]]<
> https://twitter.com/petermaccc>
> <https://urldefense.com/v3/__https://twitter.com/petermaccc*3E__;JQ!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQe_gMZ9S4$>
> >
> > From: Patryk Bełzak via slurm-users <[email protected]>
> > Date: Monday, 22 July 2024 at 6:03 PM
> > To: Jason Ellul via slurm-users <[email protected]>
> > Subject: [slurm-users] Re: slurmctld hourly: Unexpected missing socket
> error
> > ! EXTERNAL EMAIL: Think before you click. If suspicious send to
> [email protected]
> >
> > Hi,
> > we've been facing the same issue for some time. At the beginning the
> missing socket error happened every 20 minutes, later once per hour, now it
> happens few times a day.
> > The only downside of this was that controller was unresponsive for that
> couple of seconds - up to 60, if I remember well.
> > We tried to debug it in many ways, but we've found no straightforward
> solution or source of problems.
> >
> > Things we've changed since the problem came up:
> > * RPC user limit:
> `SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384`
> > * made sure that VM that slurm runs on has "network-latency" profile in
> `tuned`, also the same profile on worker nodes
> > * implemented some of these recommendations
> https://slurm.schedmd.com/high_throughput.html
> <https://urldefense.com/v3/__https://slurm.schedmd.com/high_throughput.html__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQeZ4rJxvU$>
> on controllers
> > * largely optimized slurmdb by some housekeeping and cleaning up
> inactive accounts, associations etc.
> > * optimized SSSD configuration (this one I believe had the biggest
> impact) both on controllers and on worker nodes
> > plus plenty of other (not related I guess) changes.
> >
> > I'm not really sure if any of above helped us significantly in that
> matter.
> >
> > Best regards,
> > Patryk Belzak.
> >
> > On 24/07/16 03:45, Jason Ellul via slurm-users wrote:
> > [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable,
> Size: 2,0K --]
> > > Hi all,
> > >
> > > I am hoping someone can help with our problem. Every hour after
> restarting slurmctld the controller becomes unresponsive to commands for 1
> sec, reporting errors such as:
> > >
> > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed:
> Unexpected missing socket error
> > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed:
> Unexpected missing socket error
> > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed:
> Unexpected missing socket error
> > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed:
> Unexpected missing socket error
> > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg:
> [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed:
> Unexpected missing socket error
> > >
> > > It occurs consistently at around the hour mark, but generally not at
> other times, unless we run a reconfigure or restart the controller. We
> don’t see any issues in the slurmdbd.log and the errors are also always msg
> type RESPONSE. We have tried building a new server on different
> infrastructure, but the problem has persisted. Yesterday we even tried
> updating slurm to v24.05.1 in the hope that may provide a fix. During our
> troubleshooting we have:
> > > Set:
> > >
> > >   *
> > > SchedulerParameters     =
> max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600
> > >   *
> > > SlurmctldPort           = 6808-6817
> > >
> > > But although the stats in sdiag have improved we still see the errors.
> > >
> > > On our monitoring software we also see a drop in network and disk
> activity during this 1 second, always at approx. 1 hour after restarting
> the controller.
> > >
> > > Many Thanks in advance
> > >
> > > Jason
> > >
> > > Jason Ellul
> > > Head - Research Computing Facility
> > > Office of Cancer Research
> > > Peter MacCallum Cancer Centre
> >
> > [-- Alternative Type #1: text/html; charset=Windows-1252, Encoding:
> quoted-printable, Size: 6,9K --]
> >
> > >
> > > --
> > > slurm-users mailing list -- [email protected]
> > > To unsubscribe send an email to [email protected]
>
> [-- Alternative Type #1: text/html; charset=utf-8, Encoding: base64, Size:
> 14K --]
>
>
>
>
> >
> > --
> > slurm-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to