Re: [Savannah-users] Savannah reboots today

Bob Proulx Tue, 24 May 2016 22:23:37 -0700

Eric Wong wrote:
> Bob Proulx wrote:
> > The hardware problems have been isolated and corrected.  All of the
> > systems are back online and operating normally.
>
> Thanks, curious if you could provide any post-mortem on what
> went wrong, how it was fixed, how to avoid it in the future,
> etc...


Since all of the work was done by the FSF admins I was vague because I
only have arm's length knowledge of it all.  I am one of the Savannah
admins but I have no access to the hardware.  In this I was just "The
Middleman".  (An obscure reference to a TV series.)  Ruben focused on
debugging and solving the problem.  I was simply communicating.

Here is the environment as I understand it.  Savannah and lists
(lists.gnu.org) have been moved on Friday the 6th to a four SSD RAID10
set of temporary storage being used to host us.  It is officially
temporary as a new storage array is planned.  The previous storage was
starting to show various problems and we were migrated.  There were
three Intel SSDs and one was a non-Intel SSD in the RAID10.  The
non-Intel SSD started to return corrupt data.  This was silent data
corruption.  There were no errors from the device.  Just incorrect
data.  (I personally have only good experiences with Intel SSDs and
consider them the gold standard.)

I analyzed some of the corrupted files and a typical corruption was a
16 byte strip that was munged.  Here is an example from one file.

  -02ae80 ca 39 6b 0c 47 ed ca 70 67 5c cb b4 25 21 46 ce  >.9k.G..pg\..%!F.<
  +02ae80 e9 1a a7 9f c4 cb b7 1e 06 01 33 6c 35 0c a7 f4  >...3l5...<

All other bytes were identical.  Your guess is as good as mine as to
the underlying cause but it is definitely bad SSD firmware that it
would do this as a silent data corruption.  This was the source of
wide spread problems.  Git checkouts failed their hash checks.  File
tar.gz downloads failed their associated signatures.  This really
underscores the need to verify signatures.

I imagine this is a good data point for paranoid file systems that do
more data integrity checks in the file system as opposed to trusting
that the hardware is sane.  Traditionally spinning media will detect
data corruption and report an error and the file system can trust it.
The new age of software heavy SSD firmware mean more possibilities of
software bugs in the firmware.  A new SSD vendor may not be
experienced with this type of problem.  Good quality SSDs will not
suffer this problem.  It is mostly a software quality issue.

After Ruben isolated the problem to the bad SSD he removed it from the
array.  This removed the source of the data corruption.  The remaining
devices in the array were okay.  Data was being written to both.  It
was only the bad device that was returning corrupted data.  Removing
it restored sanity to the system.  That was fixed on this last Friday.
On Monday the device was backfilled with a new SSD and redundancy was
once again available in the case of another failure.

> I noticed your message[1] in bug-bash stating you guys hit the
> xinetd limit for git-daemon due to networking issues:
>
>   http://mid.gmane.org/[email protected]

Just to be clear that was a completely independent problem.  That was
at the FSF router and affected everything.  Not related to the SSD
failure.  When it rains, it pours.

> Was this with or without SO_KEEPALIVE on the sockets?

I can see now that it was without.  But at the time I didn't know one
way or the other.

I reduced the time for kernel tcp keepalives.  That seemed to help and
so implies that the git-daemon has enabled them.  But obviously not.
A lot of stuff was happening at the same time.  With too much stuff
happening simultaneously it wasn't really possible to tell if it
helped or not.

> By default, SO_KEEPALIVE still takes around 2 hours to detect a
> failed connection, so it's probably worth tweaking the
> /proc/sys/net/ipv4/tcp_keepalive_* knobs, too.

Yes.  The defaults are very long.  Starting the tweaks I started with
these values.

  net.ipv4.tcp_keepalive_time = 600
  net.ipv4.tcp_keepalive_intvl = 60
  net.ipv4.tcp_keepalive_probes = 3

I may have made these too short for some networks.  Comments?

> I use xinetd myself for git-daemon, and have always had
> SO_KEEPALIVE enabled in my /etc/xinet.d/git via:
>
>       flags           = KEEPALIVE

Ah...  I was unaware of that!  We have not had that enabled there.
Thank you for this hint.  I think that is a very worthy addition.  I
will enable it in our flags too and credit you for the suggestion.

> But I guess git-daemon should be doing setsockopt to enable
> SO_KEEPALIVE itself...  I'll test + post patches to the git
> mailing list soonish.  I added SO_KEEPALIVE to the client-side
> of git years ago, but forgot the daemon :x.

The strange thing was that the network issue tickled the problem with
many git-daemons.  And also other vcs daemons didn't close either.
However git is so popular that it is the majority of all activity.
Then when the network problem was resolved all of the hanging daemons
problem completely went away.  So regardless of the keepalive issue
something else was going on with the connection state.  Unfortunately
I don't know what was the root cause.  And maybe it doesn't matter
since it was a cascade problem after the network problem.  It would
probably make sense if we knew it.

> I forgot git-daemon has --timeout/--init-timeout options which I guess
> weren't used?  Yikes :x

It looks like they are not used.  There is a global wrapper using
the command line 'timeout' however.  It reaps too long running
processes.  Note that timeouts must accomodate clients on slow
connections pulling large projects.

> Anyways, I just proposed a patch over on [email protected] to enable
> SO_KEEPALIVE as well:
> 
>   http://mid.gmane.org/[email protected]

I think that is a very good idea.

As to the patch, I myself wouldn't have used a function for what is
basically a one and a half statement addition.  I don't in my own
code.  I would rather see what it is doing then needing to look up the
contents of that very small function.  Obviously just a matter of
style however.  The result is the same.

Bob

Re: [Savannah-users] Savannah reboots today

Reply via email to