Paul Smith wrote: > Bob Proulx wrote: > > https://hostux.social/@fsfstatus > > Bob, note that the last update (7h ago) says things are still down: > > > Savannah services git, svn, hg, bzr, download and audio-video.gnu.org > > are unavailable. > > I'm not sure if that's a typo, or if someone forgot to add the latest > updates, or...?
I don't have access to make hostux.social updates there. That's the FSF sysadmin staff only. I'm just a volunteer out in the field. :-) I asked Michael to update it. He has done so during the time I have taken to write this email. It's updated now. And I see that other systems were also affected last night too. The problem was that this was happening at midnight in my timezone and 2am US/Eastern for the FSF sysadmin staff timezone. Michael the FSF admin staff person on call was paged in to get the kvmhost3 system that crashed back online. He did that. And that got the guest VMs online. Which was great! Thanks Michael! But then we had this network storage server problem. That's a part of the infrastructure which I happen to have full admin access to and also I was the person who set up that side of things too making me the logical person to work through the problem. (Me looks around the room shyly since I am the one who set that up and it is a problem there. But I blame the specific version of the Trisquel system and Linux kernel that is running on that server because none of the other servers I admin ever exhibit this problem.) It being after 2am there I told Michael to get some sleep because if by morning I couldn't fix it yet then I would need him to be awake and thinking straight to tag back into the problem. I continued to look at the problem. Initially to me it looked like a network problem because I could see mount requests going from client to server but the server was not responding to those requests. That was a red herring and distracted me for a bit. And at that exact moment I had not realized that the kvmhost3 crash had also affected the nfs1 server because nfs1 is running on a different host system kvmhost2. But seemingly kvmhost2 also rebooted. Which triggered the nfs1 reboot. And nfs1 has this bimodal working, not working, boot mode that we just need to retire that entire system and retire that problem with it. Just for the record though it's been about three years (pre-pandemic!) since the last time we had an unscheduled crash affecting nfs1 that caught us at an unexpected time with this problem. (Other crashes on other systems have happened but kvmhost2 has been super reliable rock solid.) Usually I am explicitly doing a reboot on nfs1 knowing I need to ensure this has booted into an okay state. Honestly I was slow realizing that was the problem this time. And then when I did realize that was the problem I took a moment to try to debug it again. Since I can only do it when the system is down and it was down already. Unfortunately I wasn't able to diagnose it and it being after midnight for me and people starting to email in with problem reports I simply rebooted it retrying until it booted up in the okay working state. It took two more reboots before that happened. And then of course it was okay. It's always okay and reliable from them forward when it boots into the okay working mode. Just because I am talking here I will say that we have set up and running an nfs2 system already. It's in production and being used by other systems and working great. If nfs1's problem shifted to where rebooting it still did not resolve the problem then we would move the SAN block devices from nfs1 over to nfs2 and shift to serving the data from there. We will do that eventually regardless. But it isn't a transparent move and there would be a mad scramble to update symlinks and mount points. Would rather work through that during a scheduled daylight downtime when not after midnight my time after an unexpected crash. Just to show how everything is connected the reason I posted here at all was because after clearing this problem I ran an anti-spam review of the mail queues and saw the message from Tongliang Liao to the list, approved it through, and then because I had happened to see it wrote a response. There were also a few messages to the Savannah lists which was expected and I responded there. I am sure there were messages to other lists that I did not see and so I didn't make random postings anywhere else. :-) Bob