You can increase the number of possible sockets. Also can enable socket reuse or recycle. One of these is a bit dangerous - I cannot remember which.
I have a war story related to tcp connections. Might tell it later. On Mon, Jan 12, 2026, 2:15 PM Emyr James via slurm-users < [email protected]> wrote: > Hi, > > We had the same error message. We see this happen when we get users > submitting job arrays with lots of short jobs and then you get a storm of > job starts and job completions within a short space of time. > > For every start of a job or end of a job, there are multiple messages that > go back and forth between the compute node, the slurm controller, the > database daemon and the database itself and multiple rows are inserted and > updated within the database. When there is a high turnover of jobs in a > short amount of time the slurm system is unable to keep up and it > effectively becomes a denial-of-service attack against slurm. > > This command allowed me to see users that had a lot of jobs completing > within the last day or so (you can add start and end filters for more > precision)... > > sacct --allusers -o jobid%20,user%20,jobname,state,exitcode,elapsedraw | > grep -v batch |grep -v extern | grep -v RUNNING | awk '{print $2}' | sort | > uniq -c | sort -nr| head > > (note because we are using pam_slurm_adopt we get 3 rows appearing for > each job so I filter out the batch and extern rows) > > For the users showing up here I then did > > sacct --allusers -o jobid%20,user%20,jobname,state,exitcode,elapsedraw | > grep -v batch |grep -v extern | grep -v RUNNING | grep <username> |awk > '{print $6}' | sort -n | uniq -c > > substituting in the usernames. This shows the number of jobs with elapsed > time of 0, 1, 2, 3 etc. seconds. If you see multiple users with high > numbers in the sub 30 second range then this could be the reason. > Submitting lots of short jobs (<30s runtime) is an anti-pattern. > > Instead of submitting e.g. a 1000 element job array of 5 second jobs, the > user could repackage this up into 10 jobs that do a for-loop over 100 of > the individual jobs so that you get 10 jobs each with 100 subjobs with a > runtime of 500 seconds. This will avoid the message storms. You could even > ask for multiple cores here and use gnu parallel to run the 100 jobs across > a few cores. This would make the individual jobs run quicker than the > expected 500 seconds and you may get much higher cpu efficiency doing this > if these jobs are bottlenecked on IO. > > Emyr James > Head of Scientific IT > CRG - Centre for Genomic Regulation > > ------------------------------ > *From:* Jason Ellul via slurm-users <[email protected]> > *Sent:* 30 July 2024 01:46 > *To:* Patryk Bełzak <[email protected]>; Jason Ellul via > slurm-users <[email protected]> > *Subject:* [slurm-users] Re: slurmctld hourly: Unexpected missing socket > error > > > Thanks again Patryk, > > > > For your insights, we have implemented many of the same things, but the > socket errors are still occurring regularly. > > > > If we find a solution that works I will be sure to add it to this thread. > > > > Many thanks > > > > Jason > > > > > > Jason Ellul > Head - Research Computing Facility > Office of Cancer Research > > My onsite days are Mon, alt Wed and Friday. > > [image: > /var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/[email protected]] > > Phone +61 3 8559 6546 > Email [email protected] > > 305 Grattan Street > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0AMelbourne,+Victoria+%0D%0A3000+Australia?entry=gmail&source=g> > Melbourne, Victoria > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0AMelbourne,+Victoria+%0D%0A3000+Australia?entry=gmail&source=g> > 3000 Australia > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0AMelbourne,+Victoria+%0D%0A3000+Australia?entry=gmail&source=g> > > www.petermac.org > <https://urldefense.com/v3/__http://www.petermac.org__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQe1MeNZco$> > > > [image: > /var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/[email protected]] > <https://urldefense.com/v3/__https://twitter.com/petermaccc__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQeG309wzI$> > > > > *From: *Patryk Bełzak via slurm-users <[email protected]> > *Date: *Wednesday, 24 July 2024 at 8:03 PM > *To: *Jason Ellul via slurm-users <[email protected]> > *Subject: *[slurm-users] Re: slurmctld hourly: Unexpected missing socket > error > > ! EXTERNAL EMAIL: Think before you click. If suspicious send to > [email protected] > > Hi, > > we're on 389 directory server (aka 389ds), which is pretty large instance. > One of optimizations was to create proper ACI's on server side which > significantly improved lookup times on slurm controller and worker nodes. > Second thing was to move sssd cache to tmpfs - instruction by RedHat: > https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/tuning_performance_in_identity_management/assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments_tuning-performance-in-idm#mounting-the-sssd-cache-in-tmpfs_assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments > <https://urldefense.com/v3/__https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/tuning_performance_in_identity_management/assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments_tuning-performance-in-idm*mounting-the-sssd-cache-in-tmpfs_assembly_tuning-sssd-performance-for-large-idm-ad-trust-deployments__;Iw!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQecOBh2OM$> > Entire chapter 9 may be helpful. > > I also remembered that recently I modified kernel to match the slurmd port > range from slurm.conf (60000-63001) by creating file > /etc/sysctl.d/91-slurm.conf with following content: > # set ipv4 port range accordingly to slurmdPortRange in slurm.conf > net.ipv4.ip_local_port_range = 32768 63001 > Unfortunately it hasn't stopped the error from occuring. > > Best regards, > Patryk. > > On 24/07/23 12:08, Jason Ellul via slurm-users wrote: > [-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 6,8K --] > > Hi Patryk, > > > > Thanks so much for your email. > > > > There are a couple of things you list that we have not tried yet so we > will definitely look at them. You mention optimizing SSSD which has me > curious, are you using RedHat Identity management (free IPA?) because we > are and after going through our logs it appears the errors became more > consistent after upgrading our instance and replica to REHL9. > > > > May I please ask what optimizations did you put in place for SSSD? > > > > Many thanks > > > > Jason > > > > > > Jason Ellul > > Head - Research Computing Facility > > Office of Cancer Research > > My onsite days are Mon, alt Wed and Friday. > > > > > [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/ > [email protected]] > > > > Phone +61 3 8559 6546 > > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g>> > Email [email protected]<mailto:[email protected]> > > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g>> > 305 Grattan Street > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g> > > Melbourne, Victoria > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g> > > 3000 Australia > <https://www.google.com/maps/search/305+Grattan+Street+%0D%0A+Melbourne,+Victoria+%0D%0A+3000+Australia?entry=gmail&source=g> > > > > www.petermac.org<http://www.petermac.org > <https://urldefense.com/v3/__http://www.petermac.org__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQe1MeNZco$> > > > > > > > [/var/folders/5b/sblmh0652x10d01v52f6htzrng5ffk/T/com.microsoft.Outlook/WebArchiveCopyPasteTempFiles/ > [email protected]]< > https://twitter.com/petermaccc> > <https://urldefense.com/v3/__https://twitter.com/petermaccc*3E__;JQ!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQe_gMZ9S4$> > > > > From: Patryk Bełzak via slurm-users <[email protected]> > > Date: Monday, 22 July 2024 at 6:03 PM > > To: Jason Ellul via slurm-users <[email protected]> > > Subject: [slurm-users] Re: slurmctld hourly: Unexpected missing socket > error > > ! EXTERNAL EMAIL: Think before you click. If suspicious send to > [email protected] > > > > Hi, > > we've been facing the same issue for some time. At the beginning the > missing socket error happened every 20 minutes, later once per hour, now it > happens few times a day. > > The only downside of this was that controller was unresponsive for that > couple of seconds - up to 60, if I remember well. > > We tried to debug it in many ways, but we've found no straightforward > solution or source of problems. > > > > Things we've changed since the problem came up: > > * RPC user limit: > `SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384` > > * made sure that VM that slurm runs on has "network-latency" profile in > `tuned`, also the same profile on worker nodes > > * implemented some of these recommendations > https://slurm.schedmd.com/high_throughput.html > <https://urldefense.com/v3/__https://slurm.schedmd.com/high_throughput.html__;!!D9dNQwwGXtA!U118tTtyALG0v10xlNbtw0dYudfSp26zqS5jzFg5IoXl3pqfhcxugoKp7aQjepXrTsjQRZCC19v-Vms3DXQeZ4rJxvU$> > on controllers > > * largely optimized slurmdb by some housekeeping and cleaning up > inactive accounts, associations etc. > > * optimized SSSD configuration (this one I believe had the biggest > impact) both on controllers and on worker nodes > > plus plenty of other (not related I guess) changes. > > > > I'm not really sure if any of above helped us significantly in that > matter. > > > > Best regards, > > Patryk Belzak. > > > > On 24/07/16 03:45, Jason Ellul via slurm-users wrote: > > [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, > Size: 2,0K --] > > > Hi all, > > > > > > I am hoping someone can help with our problem. Every hour after > restarting slurmctld the controller becomes unresponsive to commands for 1 > sec, reporting errors such as: > > > > > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: > [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: > Unexpected missing socket error > > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: > [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: > Unexpected missing socket error > > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: > [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: > Unexpected missing socket error > > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: > [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: > Unexpected missing socket error > > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: > [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: > Unexpected missing socket error > > > > > > It occurs consistently at around the hour mark, but generally not at > other times, unless we run a reconfigure or restart the controller. We > don’t see any issues in the slurmdbd.log and the errors are also always msg > type RESPONSE. We have tried building a new server on different > infrastructure, but the problem has persisted. Yesterday we even tried > updating slurm to v24.05.1 in the hope that may provide a fix. During our > troubleshooting we have: > > > Set: > > > > > > * > > > SchedulerParameters = > max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600 > > > * > > > SlurmctldPort = 6808-6817 > > > > > > But although the stats in sdiag have improved we still see the errors. > > > > > > On our monitoring software we also see a drop in network and disk > activity during this 1 second, always at approx. 1 hour after restarting > the controller. > > > > > > Many Thanks in advance > > > > > > Jason > > > > > > Jason Ellul > > > Head - Research Computing Facility > > > Office of Cancer Research > > > Peter MacCallum Cancer Centre > > > > [-- Alternative Type #1: text/html; charset=Windows-1252, Encoding: > quoted-printable, Size: 6,9K --] > > > > > > > > -- > > > slurm-users mailing list -- [email protected] > > > To unsubscribe send an email to [email protected] > > [-- Alternative Type #1: text/html; charset=utf-8, Encoding: base64, Size: > 14K --] > > > > > > > > -- > > slurm-users mailing list -- [email protected] > > To unsubscribe send an email to [email protected] > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
