Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Steffen Grunewald
On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote:
> Hi 
> 
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes, 
> the command "sbatch" fails:
> 
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p 
> operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
> sbatch: error: Batch job submission failed: Socket timed out on send/recv 
> operation

I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...

> Ecflow runs preprocessing on the script which generates a second script that 
> is submitted to slurm. In our case, the submission script is called 
> "42.job1". 
> 
> The problem we have is that sometimes, the "sbatch" command fails with the 
> message above. We couldn't find any hint on the logs. Hardware and software 
> logs are clean. I increased the debug level of slurm, to 
> # scontrol show config
> (..._)
> SlurmctldDebug  = info
> 
> But still not glue about what is happening. Maybe the next thing to try is to 
> use "sdiag" to inspect the server. Another complication is that the problem 
> is random, so we put "sdiag" in a cronjob? Is there a better way to run 
> "sdiag" periodically?
> 
> Thnaks for your attention.
> 
> Best Regards
> 
> mg.
> 

- S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~



Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Daniel Letai

  
  
I had similar problems in the past.
The 2 most common issues were:
1. Controller load - if the slurmctld was in heavy use, it
  sometimes didn't respond in timely manner, exceeding the timeout
  limit.
2. Topology and msg forwarding and aggregation.


For 2 - it would seem the nodes designated for forwarding are
  statically assigned based on topology. I could be wrong, but
  that's my observation, as I would get the socket timeout error
  when they had issues, even though other nodes in the same topology
  'zone' were ok and could be used instead.


It took debug3 to observe this in the logs, I think.


HTH
--Dani_L.



On 6/11/19 5:27 PM, Steffen Grunewald
  wrote:


  On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote:

  
Hi 

Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails:

+ sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

  
  
I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...


  
Ecflow runs preprocessing on the script which generates a second script that is submitted to slurm. In our case, the submission script is called "42.job1". 

The problem we have is that sometimes, the "sbatch" command fails with the message above. We couldn't find any hint on the logs. Hardware and software logs are clean. I increased the debug level of slurm, to 
# scontrol show config
(..._)
SlurmctldDebug  = info

But still not glue about what is happening. Maybe the next thing to try is to use "sdiag" to inspect the server. Another complication is that the problem is random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" periodically?

Thnaks for your attention.

Best Regards

mg.


  
  
- S




  




Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Marcelo Garcia
Hi Steffen

We are using Lustre as underlying file system:
[root@teta2 ~]# cat /proc/fs/lustre/version
lustre: 2.7.19.11

Nothing has changed. I think this is happening for a long time, but before was 
very sporadic, and only recently became more frequent. 

Best Regards

mg.


-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Steffen Grunewald
Sent: Dienstag, 11. Juni 2019 16:28
To: Slurm User Community List 
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on 
send/recv operation"

On Tue, 2019-06-11 at 13:56:34 +, Marcelo Garcia wrote:
> Hi 
> 
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes, 
> the command "sbatch" fails:
> 
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p 
> operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
> sbatch: error: Batch job submission failed: Socket timed out on send/recv 
> operation

I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...

> Ecflow runs preprocessing on the script which generates a second script that 
> is submitted to slurm. In our case, the submission script is called 
> "42.job1". 
> 
> The problem we have is that sometimes, the "sbatch" command fails with the 
> message above. We couldn't find any hint on the logs. Hardware and software 
> logs are clean. I increased the debug level of slurm, to 
> # scontrol show config
> (..._)
> SlurmctldDebug  = info
> 
> But still not glue about what is happening. Maybe the next thing to try is to 
> use "sdiag" to inspect the server. Another complication is that the problem 
> is random, so we put "sdiag" in a cronjob? Is there a better way to run 
> "sdiag" periodically?
> 
> Thnaks for your attention.
> 
> Best Regards
> 
> mg.
> 

- S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~



 Click 
https://www.mailcontrol.com/sr/C3sVfTezEznGX2PQPOmvUj911dVlkoGM8wtqpF4T7nO4ifXHGgg4hDJ1wA0Q6k9yVX4zexuKDmbIiTKH8SslWQ==
  to report this email as spam.



Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Bjørn-Helge Mevik
Another possible cause (we currently see it on one of our clusters):
delays in ldap lookups.

We have sssd on the machines, and occasionally, when sssd contacts the
ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
answer.  If that happens because slurmctld is trying to look up some
user or group, etc, client commands depending on it will hang.  The
default message timeout is 10 seconds, so if the delay is more than
that, you get the timeout error.

We don't know why the delays are happening, but while we are debugging
it, we've increased the MessageTimeout, which seems to have reduced the
problem a bit.  We're also experimenting with GroupUpdateForce and
GroupUpdateTime to reduce the number of times slurmctld needs to ask
about groups, but I'm unsure how much that helps.

-- 
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Christopher Benjamin Coffey
Hi, you may want to look into increasing the sssd cache length on the nodes, 
and improving the network connectivity to your ldap directory. I recall when 
playing with sssd in the past that it wasn't actually caching. Verify with 
tcpdump, and "ls -l" through a directory. Once the uid/gid is resolved, it 
shouldn't be hitting the directory anymore till the cache expires.

Do the nodes NAT through the head node?

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/12/19, 1:56 AM, "slurm-users on behalf of Bjørn-Helge Mevik" 
 
wrote:

Another possible cause (we currently see it on one of our clusters):
delays in ldap lookups.

We have sssd on the machines, and occasionally, when sssd contacts the
ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
answer.  If that happens because slurmctld is trying to look up some
user or group, etc, client commands depending on it will hang.  The
default message timeout is 10 seconds, so if the delay is more than
that, you get the timeout error.

We don't know why the delays are happening, but while we are debugging
it, we've increased the MessageTimeout, which seems to have reduced the
problem a bit.  We're also experimenting with GroupUpdateForce and
GroupUpdateTime to reduce the number of times slurmctld needs to ask
about groups, but I'm unsure how much that helps.

-- 
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo




Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Marcus Wagner

Hi,

we hit the same issue, up to 30.000 entries per day in the slurmctld log.

As we used SL6 the first time (Scientific Linux), we had massive 
problems with sssd, often crashing.
We therefore decided to get rid of sssd and manually fill /etc/passwd 
and /etc/groups via cronjob.


So, yes we have a ldap, but it can't be the issue in our case, since 
user and group lookups are done locally.


Best
Marcus

On 6/12/19 3:36 PM, Christopher Benjamin Coffey wrote:

Hi, you may want to look into increasing the sssd cache length on the nodes, and 
improving the network connectivity to your ldap directory. I recall when playing with 
sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls 
-l" through a directory. Once the uid/gid is resolved, it shouldn't be hitting the 
directory anymore till the cache expires.

Do the nodes NAT through the head node?

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
  


On 6/12/19, 1:56 AM, "slurm-users on behalf of Bjørn-Helge Mevik" 
 wrote:

 Another possible cause (we currently see it on one of our clusters):
 delays in ldap lookups.
 
 We have sssd on the machines, and occasionally, when sssd contacts the

 ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
 answer.  If that happens because slurmctld is trying to look up some
 user or group, etc, client commands depending on it will hang.  The
 default message timeout is 10 seconds, so if the delay is more than
 that, you get the timeout error.
 
 We don't know why the delays are happening, but while we are debugging

 it, we've increased the MessageTimeout, which seems to have reduced the
 problem a bit.  We're also experimenting with GroupUpdateForce and
 GroupUpdateTime to reduce the number of times slurmctld needs to ask
 about groups, but I'm unsure how much that helps.
 
 --

 Bjørn-Helge Mevik, dr. scient,
 Department for Research Computing, University of Oslo
 



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de




Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Christopher Harrop - NOAA Affiliate
Hi,

My group is struggling with this also.  

The worst part of this, which no one has brought up yet, is that the sbatch 
command does not necessarily fail to submit the job in this situation.  In 
fact, most of the time (for us), it succeeds.  There appears to be some sort of 
race condition or something else going on.  The job is often (maybe most of the 
time?) submitted just fine, but sbatch returns a non-zero status (meaning the 
submission failed) and reports the error message.  

From a workflow management perspective this is an absolute disaster that leads 
to workflow corruption and messes that are difficult to clean up.  Workflow 
management systems rely on the status for sbatch to tell the truth about 
whether a job submission succeeded or not.  If submission fails the workflow 
manager will resubmit the job, and if it succeeds it expects a jobid to be 
returned.  Because sbatch usually lies about the failure of job submission when 
these events happen, workflow management systems think the submission failed 
and then resubmit the job.  This causes two copies of the same job to be 
running at the same time, each trampling over the other and causing a cascade 
of other failures that become difficult to deal with.

The problem is that the job submission request has already been received by the 
time sbatch dies with that error.  So, the timeout happens after the job 
request has already been made.  I don’t know how one would solve this problem.  
In my experience in interfacing various batch schedulers to workflow management 
systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands 
always leads to a race condition. You can’t time it out (barring ridiculously 
long timeouts to catch truly pathological scenarios) because the request has 
already been sent and received; it’s the response that never makes it back to 
you.  Because of the race condition there is probably no way to guarantee that 
failure really means failure and success really means success and use a timeout 
that guarantees failure.  The best option that I know of is to never (this 
means a finite, but long, time) time out a job submission command; just wait 
for the response.  That’s the only way to get the correct response.

One way I’m using to work around this is to inject a long random string into 
the —comment option.  Then, if I see the socket timeout, I use squeue to look 
for that job and retrieve its ID.  It’s not ideal, but it can work.

Chris



Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Jeffrey Frey
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT, 
which is only ever raised by slurm_send_timeout() and slurm_recv_timeout().  
Those functions raise that error when a generic socket-based send/receive 
operation exceeds an arbitrary time limit imposed by the caller.  The functions 
use gettimeofday() to grab an initial timestamp and on each iteration of the 
poll() loop call gettimeofday() again, calculating a delta from the initial and 
current values returned by that function and subtracting from the timeout 
period.


Do you have any reason to suspect that your local times are fluctuating on the 
cluster?  That use of gettimeofday() to calculate actual time deltas is not 
recommended for that very reason:


NOTES
   The time returned by gettimeofday() is affected by discontinuous jumps 
in the system time (e.g., if the system
   administrator manually changes the system time).  If you need a 
monotonically increasing clock, see clock_get‐
   time(2).







> On Jun 13, 2019, at 10:47 AM, Christopher Harrop - NOAA Affiliate 
>  wrote:
> 
> Hi,
> 
> My group is struggling with this also.  
> 
> The worst part of this, which no one has brought up yet, is that the sbatch 
> command does not necessarily fail to submit the job in this situation.  In 
> fact, most of the time (for us), it succeeds.  There appears to be some sort 
> of race condition or something else going on.  The job is often (maybe most 
> of the time?) submitted just fine, but sbatch returns a non-zero status 
> (meaning the submission failed) and reports the error message.  
> 
> From a workflow management perspective this is an absolute disaster that 
> leads to workflow corruption and messes that are difficult to clean up.  
> Workflow management systems rely on the status for sbatch to tell the truth 
> about whether a job submission succeeded or not.  If submission fails the 
> workflow manager will resubmit the job, and if it succeeds it expects a jobid 
> to be returned.  Because sbatch usually lies about the failure of job 
> submission when these events happen, workflow management systems think the 
> submission failed and then resubmit the job.  This causes two copies of the 
> same job to be running at the same time, each trampling over the other and 
> causing a cascade of other failures that become difficult to deal with.
> 
> The problem is that the job submission request has already been received by 
> the time sbatch dies with that error.  So, the timeout happens after the job 
> request has already been made.  I don’t know how one would solve this 
> problem.  In my experience in interfacing various batch schedulers to 
> workflow management systems I’ve learned that attempting to time out 
> qsub/sbatch/bsub/etc commands always leads to a race condition. You can’t 
> time it out (barring ridiculously long timeouts to catch truly pathological 
> scenarios) because the request has already been sent and received; it’s the 
> response that never makes it back to you.  Because of the race condition 
> there is probably no way to guarantee that failure really means failure and 
> success really means success and use a timeout that guarantees failure.  The 
> best option that I know of is to never (this means a finite, but long, time) 
> time out a job submission command; just wait for the response.  That’s the 
> only way to get the correct response.
> 
> One way I’m using to work around this is to inject a long random string into 
> the —comment option.  Then, if I see the socket timeout, I use squeue to look 
> for that job and retrieve its ID.  It’s not ideal, but it can work.
> 
> Chris
> 


::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::






Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Mark Hahn

On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
...

One way I?m using to work around this is to inject a long random string
into the ?comment option.  Then, if I see the socket timeout, I use squeue
to look for that job and retrieve its ID.  It?s not ideal, but it can work.


I would have expected a different approach: use a unique string for the
jobname, and always verify after submission.  after all, squeue provides
a --name parameter for this (efficient query by logical job "identity").

regards, mark hahn.



Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread John Hearns
I agree with Christopher Coffey - look at the sssd caching.
I have had experience with sssd and can help a bit.
Also if you are seeing long waits could you have nested groups?
sssd is notorious for not handling these well, and there are settings in
the configuration file which you can experiment with.

On Thu, 13 Jun 2019 at 16:52, Mark Hahn  wrote:

> On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote:
> ...
> > One way I?m using to work around this is to inject a long random string
> >into the ?comment option.  Then, if I see the socket timeout, I use squeue
> >to look for that job and retrieve its ID.  It?s not ideal, but it can
> work.
>
> I would have expected a different approach: use a unique string for the
> jobname, and always verify after submission.  after all, squeue provides
> a --name parameter for this (efficient query by logical job "identity").
>
> regards, mark hahn.
>
>


Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Christopher W. Harrop
> ...
>> One way I?m using to work around this is to inject a long random string
>> into the ?comment option.  Then, if I see the socket timeout, I use squeue
>> to look for that job and retrieve its ID.  It?s not ideal, but it can work.
> 
> I would have expected a different approach: use a unique string for the
> jobname, and always verify after submission.  after all, squeue provides
> a --name parameter for this (efficient query by logical job "identity”).

The job name is already in use, and it is not unique because there may be many 
copies of a workflow running at the same time by the same user.   There is 
essentially no difference between verifying a match with jobname and a match 
with the comment; it’s just a different field of the output you’re looking at, 
which you can control with format options.




Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Bjørn-Helge Mevik
Christopher Benjamin Coffey  writes:

> Hi, you may want to look into increasing the sssd cache length on the
> nodes,

We have thought about that, but it will not solve the problem, only make
it less frequent, I think.

> and improving the network connectivity to your ldap
> directory.

That is something we are investigating, yes.

> I recall when playing with sssd in the past that it wasn't
> actually caching. Verify with tcpdump, and "ls -l" through a
> directory. Once the uid/gid is resolved, it shouldn't be hitting the
> directory anymore till the cache expires.

We turned up the logging of the AD backend, and the logs indicate that
the caching works in our case: First time you look up a user/group in a
while, the backend gets the request, but subsequent lookups never reach
the backend (at least not according to the logs), which should mean that
sssd has cached the info.

> Do the nodes NAT through the head node?

We do, but we see the sssd delays on the head node as well, and on other
nodes outside the cluster that use the same ldap/da servers.  But we
_do_ have a quite complicated network setup due to security, so there
might be something there.  I'm currently trying to get my hands on the
logs from the servers themselves to see they actually get the requests
at the time when the sssd backend claims to make it.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Marcelo Garcia
Hi Chris

You are right in pointing that the job actually runs, despite of the error in 
the sbatch. The customer mention that:
=== start ===
Problem had usual scenario - job script was submitted and executed, but sbatch 
command returned non-zero exit status to ecflow, which thus  assumed job to be 
dead.
=== end ===

Which version of slurm are you using? I'm using " 17.02.4-1", and we are 
wondering about the possibility of upgrading to a newer version, that is, I 
hope that there was a bug and Schedmd fixed the problem.

Best Regards

mg.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Christopher Harrop - NOAA Affiliate
Sent: Donnerstag, 13. Juni 2019 16:47
To: Slurm User Community List 
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on 
send/recv operation"

Hi,

My group is struggling with this also.  

The worst part of this, which no one has brought up yet, is that the sbatch 
command does not necessarily fail to submit the job in this situation.  In 
fact, most of the time (for us), it succeeds.  There appears to be some sort of 
race condition or something else going on.  The job is often (maybe most of the 
time?) submitted just fine, but sbatch returns a non-zero status (meaning the 
submission failed) and reports the error message.  

From a workflow management perspective this is an absolute disaster that leads 
to workflow corruption and messes that are difficult to clean up.  Workflow 
management systems rely on the status for sbatch to tell the truth about 
whether a job submission succeeded or not.  If submission fails the workflow 
manager will resubmit the job, and if it succeeds it expects a jobid to be 
returned.  Because sbatch usually lies about the failure of job submission when 
these events happen, workflow management systems think the submission failed 
and then resubmit the job.  This causes two copies of the same job to be 
running at the same time, each trampling over the other and causing a cascade 
of other failures that become difficult to deal with.

The problem is that the job submission request has already been received by the 
time sbatch dies with that error.  So, the timeout happens after the job 
request has already been made.  I don’t know how one would solve this problem.  
In my experience in interfacing various batch schedulers to workflow management 
systems I’ve learned that attempting to time out qsub/sbatch/bsub/etc commands 
always leads to a race condition. You can’t time it out (barring ridiculously 
long timeouts to catch truly pathological scenarios) because the request has 
already been sent and received; it’s the response that never makes it back to 
you.  Because of the race condition there is probably no way to guarantee that 
failure really means failure and success really means success and use a timeout 
that guarantees failure.  The best option that I know of is to never (this 
means a finite, but long, time) time out a job submission command; just wait 
for the response.  That’s the only way to get the correct response.

One way I’m using to work around this is to inject a long random string into 
the —comment option.  Then, if I see the socket timeout, I use squeue to look 
for that job and retrieve its ID.  It’s not ideal, but it can work.

Chris



 Click 
https://www.mailcontrol.com/sr/BSE5ulXU973GX2PQPOmvUujshICbHL2sPjokthLG0LGuvOKuSd7RBPQ08h87nB53U3B_o6vD7mIfmF8UmgH1OQ==
  to report this email as spam.


Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Christopher Harrop - NOAA Affiliate
> Hi Chris
> 
> You are right in pointing that the job actually runs, despite of the error in 
> the sbatch. The customer mention that:
> === start ===
> Problem had usual scenario - job script was submitted and executed, but 
> sbatch command returned non-zero exit status to ecflow, which thus  assumed 
> job to be dead.
> === end ===
> 
> Which version of slurm are you using? I'm using " 17.02.4-1", and we are 
> wondering about the possibility of upgrading to a newer version, that is, I 
> hope that there was a bug and Schedmd fixed the problem.

Sorry I missed that.  I am not the admin of the system, but I believe we are 
using 18.08.7.  I believe we have a ticket open with SchedMD and our admin team 
is working with them.  And I believe the approach being taken is to capture 
statistics with sdiag and use that info to tune configuration parameters.  It 
is my understanding that they view the problem as a configuration issue rather 
than a bug in the scheduler.  What this means to me is that the timeouts can 
only be minimized, not eliminated.  And because workflow corruption is such a 
disastrous event, I have built in attempts to try to work around it even though 
occurrences are “rare”.  

Chris