Re: Git repo outage

Bob Proulx Tue, 30 Jan 2024 10:07:28 -0800

Paul Smith wrote:
> Bob Proulx wrote:
> > https://hostux.social/@fsfstatus
>
> Bob, note that the last update (7h ago) says things are still down:
>
> > Savannah services git, svn, hg, bzr, download and audio-video.gnu.org
> > are unavailable.
>
> I'm not sure if that's a typo, or if someone forgot to add the latest
> updates, or...?


I don't have access to make hostux.social updates there.  That's the
FSF sysadmin staff only.  I'm just a volunteer out in the field. :-)

I asked Michael to update it.  He has done so during the time I have
taken to write this email.  It's updated now.  And I see that other
systems were also affected last night too.

The problem was that this was happening at midnight in my timezone and
2am US/Eastern for the FSF sysadmin staff timezone.  Michael the FSF
admin staff person on call was paged in to get the kvmhost3 system
that crashed back online.  He did that.  And that got the guest VMs
online.  Which was great!  Thanks Michael!

But then we had this network storage server problem.  That's a part of
the infrastructure which I happen to have full admin access to and
also I was the person who set up that side of things too making me the
logical person to work through the problem.  (Me looks around the room
shyly since I am the one who set that up and it is a problem there.
But I blame the specific version of the Trisquel system and Linux
kernel that is running on that server because none of the other
servers I admin ever exhibit this problem.)

It being after 2am there I told Michael to get some sleep because if
by morning I couldn't fix it yet then I would need him to be awake and
thinking straight to tag back into the problem.  I continued to look
at the problem.  Initially to me it looked like a network problem
because I could see mount requests going from client to server but the
server was not responding to those requests.  That was a red herring
and distracted me for a bit.  And at that exact moment I had not
realized that the kvmhost3 crash had also affected the nfs1 server
because nfs1 is running on a different host system kvmhost2.  But
seemingly kvmhost2 also rebooted.  Which triggered the nfs1 reboot.
And nfs1 has this bimodal working, not working, boot mode that we just
need to retire that entire system and retire that problem with it.

Just for the record though it's been about three years (pre-pandemic!)
since the last time we had an unscheduled crash affecting nfs1 that
caught us at an unexpected time with this problem.  (Other crashes on
other systems have happened but kvmhost2 has been super reliable rock
solid.)  Usually I am explicitly doing a reboot on nfs1 knowing I need
to ensure this has booted into an okay state.  Honestly I was slow
realizing that was the problem this time.  And then when I did realize
that was the problem I took a moment to try to debug it again.  Since
I can only do it when the system is down and it was down already.
Unfortunately I wasn't able to diagnose it and it being after midnight
for me and people starting to email in with problem reports I simply
rebooted it retrying until it booted up in the okay working state.  It
took two more reboots before that happened.  And then of course it was
okay.  It's always okay and reliable from them forward when it boots
into the okay working mode.

Just because I am talking here I will say that we have set up and
running an nfs2 system already.  It's in production and being used by
other systems and working great.  If nfs1's problem shifted to where
rebooting it still did not resolve the problem then we would move the
SAN block devices from nfs1 over to nfs2 and shift to serving the data
from there.  We will do that eventually regardless.  But it isn't a
transparent move and there would be a mad scramble to update symlinks
and mount points.  Would rather work through that during a scheduled
daylight downtime when not after midnight my time after an unexpected
crash.

Just to show how everything is connected the reason I posted here at
all was because after clearing this problem I ran an anti-spam review
of the mail queues and saw the message from Tongliang Liao to the
list, approved it through, and then because I had happened to see it
wrote a response.  There were also a few messages to the Savannah
lists which was expected and I responded there.  I am sure there were
messages to other lists that I did not see and so I didn't make random
postings anywhere else.  :-)

Bob

Re: Git repo outage

Reply via email to