Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 > sbatch: error: Batch job submission failed: Socket timed out on send/recv > operation I've seen such an error message from the underlying file system. Is there anything special (e.g. non-NFS) in your setup that may have changed in the past few months? Just a shot in the dark, of course... > Ecflow runs preprocessing on the script which generates a second script that > is submitted to slurm. In our case, the submission script is called > "42.job1". > > The problem we have is that sometimes, the "sbatch" command fails with the > message above. We couldn't find any hint on the logs. Hardware and software > logs are clean. I increased the debug level of slurm, to > # scontrol show config > (..._) > SlurmctldDebug = info > > But still not glue about what is happening. Maybe the next thing to try is to > use "sdiag" to inspect the server. Another complication is that the problem > is random, so we put "sdiag" in a cronjob? Is there a better way to run > "sdiag" periodically? > > Thnaks for your attention. > > Best Regards > > mg. > - S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
I had similar problems in the past. The 2 most common issues were: 1. Controller load - if the slurmctld was in heavy use, it sometimes didn't respond in timely manner, exceeding the timeout limit. 2. Topology and msg forwarding and aggregation. For 2 - it would seem the nodes designated for forwarding are statically assigned based on topology. I could be wrong, but that's my observation, as I would get the socket timeout error when they had issues, even though other nodes in the same topology 'zone' were ok and could be used instead. It took debug3 to observe this in the logs, I think. HTH --Dani_L. On 6/11/19 5:27 PM, Steffen Grunewald wrote: On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote: Hi Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails: + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 sbatch: error: Batch job submission failed: Socket timed out on send/recv operation I've seen such an error message from the underlying file system. Is there anything special (e.g. non-NFS) in your setup that may have changed in the past few months? Just a shot in the dark, of course... Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1". The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to # scontrol show config (..._) SlurmctldDebug = info But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically? Thnaks for your attention. Best Regards mg. - S
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Hi Steffen We are using Lustre as underlying file system: [root@teta2 ~]# cat /proc/fs/lustre/version lustre: 2.7.19.11 Nothing has changed. I think this is happening for a long time, but before was very sporadic, and only recently became more frequent. Best Regards mg. -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Steffen Grunewald Sent: Dienstag, 11. Juni 2019 16:28 To: Slurm User Community List Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation" On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 > sbatch: error: Batch job submission failed: Socket timed out on send/recv > operation I've seen such an error message from the underlying file system. Is there anything special (e.g. non-NFS) in your setup that may have changed in the past few months? Just a shot in the dark, of course... > Ecflow runs preprocessing on the script which generates a second script that > is submitted to slurm. In our case, the submission script is called > "42.job1". > > The problem we have is that sometimes, the "sbatch" command fails with the > message above. We couldn't find any hint on the logs. Hardware and software > logs are clean. I increased the debug level of slurm, to > # scontrol show config > (..._) > SlurmctldDebug = info > > But still not glue about what is happening. Maybe the next thing to try is to > use "sdiag" to inspect the server. Another complication is that the problem > is random, so we put "sdiag" in a cronjob? Is there a better way to run > "sdiag" periodically? > > Thnaks for your attention. > > Best Regards > > mg. > - S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ Click https://www.mailcontrol.com/sr/C3sVfTezEznGX2PQPOmvUj911dVlkoGM8wtqpF4T7nO4ifXHGgg4hDJ1wA0Q6k9yVX4zexuKDmbIiTKH8SslWQ== to report this email as spam.
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Another possible cause (we currently see it on one of our clusters): delays in ldap lookups. We have sssd on the machines, and occasionally, when sssd contacts the ldap server, it takes 5 or 10 seconds (or even 15) before it gets an answer. If that happens because slurmctld is trying to look up some user or group, etc, client commands depending on it will hang. The default message timeout is 10 seconds, so if the delay is more than that, you get the timeout error. We don't know why the delays are happening, but while we are debugging it, we've increased the MessageTimeout, which seems to have reduced the problem a bit. We're also experimenting with GroupUpdateForce and GroupUpdateTime to reduce the number of times slurmctld needs to ask about groups, but I'm unsure how much that helps. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Hi, you may want to look into increasing the sssd cache length on the nodes, and improving the network connectivity to your ldap directory. I recall when playing with sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls -l" through a directory. Once the uid/gid is resolved, it shouldn't be hitting the directory anymore till the cache expires. Do the nodes NAT through the head node? Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 6/12/19, 1:56 AM, "slurm-users on behalf of Bjørn-Helge Mevik" wrote: Another possible cause (we currently see it on one of our clusters): delays in ldap lookups. We have sssd on the machines, and occasionally, when sssd contacts the ldap server, it takes 5 or 10 seconds (or even 15) before it gets an answer. If that happens because slurmctld is trying to look up some user or group, etc, client commands depending on it will hang. The default message timeout is 10 seconds, so if the delay is more than that, you get the timeout error. We don't know why the delays are happening, but while we are debugging it, we've increased the MessageTimeout, which seems to have reduced the problem a bit. We're also experimenting with GroupUpdateForce and GroupUpdateTime to reduce the number of times slurmctld needs to ask about groups, but I'm unsure how much that helps. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Hi, we hit the same issue, up to 30.000 entries per day in the slurmctld log. As we used SL6 the first time (Scientific Linux), we had massive problems with sssd, often crashing. We therefore decided to get rid of sssd and manually fill /etc/passwd and /etc/groups via cronjob. So, yes we have a ldap, but it can't be the issue in our case, since user and group lookups are done locally. Best Marcus On 6/12/19 3:36 PM, Christopher Benjamin Coffey wrote: Hi, you may want to look into increasing the sssd cache length on the nodes, and improving the network connectivity to your ldap directory. I recall when playing with sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls -l" through a directory. Once the uid/gid is resolved, it shouldn't be hitting the directory anymore till the cache expires. Do the nodes NAT through the head node? Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 6/12/19, 1:56 AM, "slurm-users on behalf of Bjørn-Helge Mevik" wrote: Another possible cause (we currently see it on one of our clusters): delays in ldap lookups. We have sssd on the machines, and occasionally, when sssd contacts the ldap server, it takes 5 or 10 seconds (or even 15) before it gets an answer. If that happens because slurmctld is trying to look up some user or group, etc, client commands depending on it will hang. The default message timeout is 10 seconds, so if the delay is more than that, you get the timeout error. We don't know why the delays are happening, but while we are debugging it, we've increased the MessageTimeout, which seems to have reduced the problem a bit. We're also experimenting with GroupUpdateForce and GroupUpdateTime to reduce the number of times slurmctld needs to ask about groups, but I'm unsure how much that helps. -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de www.itc.rwth-aachen.de
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Hi, My group is struggling with this also. The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation. In fact, most of the time (for us), it succeeds. There appears to be some sort of race condition or something else going on. The job is often (maybe most of the time?) submitted just fine, but sbatch returns a non-zero status (meaning the submission failed) and reports the error message. From a workflow management perspective this is an absolute disaster that leads to workflow corruption and messes that are difficult to clean up. Workflow management systems rely on the status for sbatch to tell the truth about whether a job submission succeeded or not. If submission fails the workflow manager will resubmit the job, and if it succeeds it expects a jobid to be returned. Because sbatch usually lies about the failure of job submission when these events happen, workflow management systems think the submission failed and then resubmit the job. This causes two copies of the same job to be running at the same time, each trampling over the other and causing a cascade of other failures that become difficult to deal with. The problem is that the job submission request has already been received by the time sbatch dies with that error. So, the timeout happens after the job request has already been made. I don’t know how one would solve this problem. In my experience in interfacing various batch schedulers to workflow management systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t time it out (barring ridiculously long timeouts to catch truly pathological scenarios) because the request has already been sent and received; it’s the response that never makes it back to you. Because of the race condition there is probably no way to guarantee that failure really means failure and success really means success and use a timeout that guarantees failure. The best option that I know of is to never (this means a finite, but long, time) time out a job submission command; just wait for the response. That’s the only way to get the correct response. One way I’m using to work around this is to inject a long random string into the —comment option. Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID. It’s not ideal, but it can work. Chris
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT, which is only ever raised by slurm_send_timeout() and slurm_recv_timeout(). Those functions raise that error when a generic socket-based send/receive operation exceeds an arbitrary time limit imposed by the caller. The functions use gettimeofday() to grab an initial timestamp and on each iteration of the poll() loop call gettimeofday() again, calculating a delta from the initial and current values returned by that function and subtracting from the timeout period. Do you have any reason to suspect that your local times are fluctuating on the cluster? That use of gettimeofday() to calculate actual time deltas is not recommended for that very reason: NOTES The time returned by gettimeofday() is affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the system time). If you need a monotonically increasing clock, see clock_get‐ time(2). > On Jun 13, 2019, at 10:47 AM, Christopher Harrop - NOAA Affiliate > wrote: > > Hi, > > My group is struggling with this also. > > The worst part of this, which no one has brought up yet, is that the sbatch > command does not necessarily fail to submit the job in this situation. In > fact, most of the time (for us), it succeeds. There appears to be some sort > of race condition or something else going on. The job is often (maybe most > of the time?) submitted just fine, but sbatch returns a non-zero status > (meaning the submission failed) and reports the error message. > > From a workflow management perspective this is an absolute disaster that > leads to workflow corruption and messes that are difficult to clean up. > Workflow management systems rely on the status for sbatch to tell the truth > about whether a job submission succeeded or not. If submission fails the > workflow manager will resubmit the job, and if it succeeds it expects a jobid > to be returned. Because sbatch usually lies about the failure of job > submission when these events happen, workflow management systems think the > submission failed and then resubmit the job. This causes two copies of the > same job to be running at the same time, each trampling over the other and > causing a cascade of other failures that become difficult to deal with. > > The problem is that the job submission request has already been received by > the time sbatch dies with that error. So, the timeout happens after the job > request has already been made. I don’t know how one would solve this > problem. In my experience in interfacing various batch schedulers to > workflow management systems I’ve learned that attempting to time out > qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t > time it out (barring ridiculously long timeouts to catch truly pathological > scenarios) because the request has already been sent and received; it’s the > response that never makes it back to you. Because of the race condition > there is probably no way to guarantee that failure really means failure and > success really means success and use a timeout that guarantees failure. The > best option that I know of is to never (this means a finite, but long, time) > time out a job submission command; just wait for the response. That’s the > only way to get the correct response. > > One way I’m using to work around this is to inject a long random string into > the —comment option. Then, if I see the socket timeout, I use squeue to look > for that job and retrieve its ID. It’s not ideal, but it can work. > > Chris > :: Jeffrey T. Frey, Ph.D. Systems Programmer V / HPC Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 ::
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote: ... One way I?m using to work around this is to inject a long random string into the ?comment option. Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID. It?s not ideal, but it can work. I would have expected a different approach: use a unique string for the jobname, and always verify after submission. after all, squeue provides a --name parameter for this (efficient query by logical job "identity"). regards, mark hahn.
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
I agree with Christopher Coffey - look at the sssd caching. I have had experience with sssd and can help a bit. Also if you are seeing long waits could you have nested groups? sssd is notorious for not handling these well, and there are settings in the configuration file which you can experiment with. On Thu, 13 Jun 2019 at 16:52, Mark Hahn wrote: > On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote: > ... > > One way I?m using to work around this is to inject a long random string > >into the ?comment option. Then, if I see the socket timeout, I use squeue > >to look for that job and retrieve its ID. It?s not ideal, but it can > work. > > I would have expected a different approach: use a unique string for the > jobname, and always verify after submission. after all, squeue provides > a --name parameter for this (efficient query by logical job "identity"). > > regards, mark hahn. > >
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
> ... >> One way I?m using to work around this is to inject a long random string >> into the ?comment option. Then, if I see the socket timeout, I use squeue >> to look for that job and retrieve its ID. It?s not ideal, but it can work. > > I would have expected a different approach: use a unique string for the > jobname, and always verify after submission. after all, squeue provides > a --name parameter for this (efficient query by logical job "identity”). The job name is already in use, and it is not unique because there may be many copies of a workflow running at the same time by the same user. There is essentially no difference between verifying a match with jobname and a match with the comment; it’s just a different field of the output you’re looking at, which you can control with format options.
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Christopher Benjamin Coffey writes: > Hi, you may want to look into increasing the sssd cache length on the > nodes, We have thought about that, but it will not solve the problem, only make it less frequent, I think. > and improving the network connectivity to your ldap > directory. That is something we are investigating, yes. > I recall when playing with sssd in the past that it wasn't > actually caching. Verify with tcpdump, and "ls -l" through a > directory. Once the uid/gid is resolved, it shouldn't be hitting the > directory anymore till the cache expires. We turned up the logging of the AD backend, and the logs indicate that the caching works in our case: First time you look up a user/group in a while, the backend gets the request, but subsequent lookups never reach the backend (at least not according to the logs), which should mean that sssd has cached the info. > Do the nodes NAT through the head node? We do, but we see the sssd delays on the head node as well, and on other nodes outside the cluster that use the same ldap/da servers. But we _do_ have a quite complicated network setup due to security, so there might be something there. I'm currently trying to get my hands on the logs from the servers themselves to see they actually get the requests at the time when the sssd backend claims to make it. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
Hi Chris You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that: === start === Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus assumed job to be dead. === end === Which version of slurm are you using? I'm using " 17.02.4-1", and we are wondering about the possibility of upgrading to a newer version, that is, I hope that there was a bug and Schedmd fixed the problem. Best Regards mg. -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Christopher Harrop - NOAA Affiliate Sent: Donnerstag, 13. Juni 2019 16:47 To: Slurm User Community List Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation" Hi, My group is struggling with this also. The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation. In fact, most of the time (for us), it succeeds. There appears to be some sort of race condition or something else going on. The job is often (maybe most of the time?) submitted just fine, but sbatch returns a non-zero status (meaning the submission failed) and reports the error message. From a workflow management perspective this is an absolute disaster that leads to workflow corruption and messes that are difficult to clean up. Workflow management systems rely on the status for sbatch to tell the truth about whether a job submission succeeded or not. If submission fails the workflow manager will resubmit the job, and if it succeeds it expects a jobid to be returned. Because sbatch usually lies about the failure of job submission when these events happen, workflow management systems think the submission failed and then resubmit the job. This causes two copies of the same job to be running at the same time, each trampling over the other and causing a cascade of other failures that become difficult to deal with. The problem is that the job submission request has already been received by the time sbatch dies with that error. So, the timeout happens after the job request has already been made. I don’t know how one would solve this problem. In my experience in interfacing various batch schedulers to workflow management systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t time it out (barring ridiculously long timeouts to catch truly pathological scenarios) because the request has already been sent and received; it’s the response that never makes it back to you. Because of the race condition there is probably no way to guarantee that failure really means failure and success really means success and use a timeout that guarantees failure. The best option that I know of is to never (this means a finite, but long, time) time out a job submission command; just wait for the response. That’s the only way to get the correct response. One way I’m using to work around this is to inject a long random string into the —comment option. Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID. It’s not ideal, but it can work. Chris Click https://www.mailcontrol.com/sr/BSE5ulXU973GX2PQPOmvUujshICbHL2sPjokthLG0LGuvOKuSd7RBPQ08h87nB53U3B_o6vD7mIfmF8UmgH1OQ== to report this email as spam.
Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"
> Hi Chris > > You are right in pointing that the job actually runs, despite of the error in > the sbatch. The customer mention that: > === start === > Problem had usual scenario - job script was submitted and executed, but > sbatch command returned non-zero exit status to ecflow, which thus assumed > job to be dead. > === end === > > Which version of slurm are you using? I'm using " 17.02.4-1", and we are > wondering about the possibility of upgrading to a newer version, that is, I > hope that there was a bug and Schedmd fixed the problem. Sorry I missed that. I am not the admin of the system, but I believe we are using 18.08.7. I believe we have a ticket open with SchedMD and our admin team is working with them. And I believe the approach being taken is to capture statistics with sdiag and use that info to tune configuration parameters. It is my understanding that they view the problem as a configuration issue rather than a bug in the scheduler. What this means to me is that the timeouts can only be minimized, not eliminated. And because workflow corruption is such a disastrous event, I have built in attempts to try to work around it even though occurrences are “rare”. Chris