Eric Wong wrote: > Bob Proulx wrote: > > The hardware problems have been isolated and corrected. All of the > > systems are back online and operating normally. > > Thanks, curious if you could provide any post-mortem on what > went wrong, how it was fixed, how to avoid it in the future, > etc...
Since all of the work was done by the FSF admins I was vague because I only have arm's length knowledge of it all. I am one of the Savannah admins but I have no access to the hardware. In this I was just "The Middleman". (An obscure reference to a TV series.) Ruben focused on debugging and solving the problem. I was simply communicating. Here is the environment as I understand it. Savannah and lists (lists.gnu.org) have been moved on Friday the 6th to a four SSD RAID10 set of temporary storage being used to host us. It is officially temporary as a new storage array is planned. The previous storage was starting to show various problems and we were migrated. There were three Intel SSDs and one was a non-Intel SSD in the RAID10. The non-Intel SSD started to return corrupt data. This was silent data corruption. There were no errors from the device. Just incorrect data. (I personally have only good experiences with Intel SSDs and consider them the gold standard.) I analyzed some of the corrupted files and a typical corruption was a 16 byte strip that was munged. Here is an example from one file. -02ae80 ca 39 6b 0c 47 ed ca 70 67 5c cb b4 25 21 46 ce >.9k.G..pg\..%!F.< +02ae80 e9 1a a7 9f c4 cb b7 1e 06 01 33 6c 35 0c a7 f4 >...3l5...< All other bytes were identical. Your guess is as good as mine as to the underlying cause but it is definitely bad SSD firmware that it would do this as a silent data corruption. This was the source of wide spread problems. Git checkouts failed their hash checks. File tar.gz downloads failed their associated signatures. This really underscores the need to verify signatures. I imagine this is a good data point for paranoid file systems that do more data integrity checks in the file system as opposed to trusting that the hardware is sane. Traditionally spinning media will detect data corruption and report an error and the file system can trust it. The new age of software heavy SSD firmware mean more possibilities of software bugs in the firmware. A new SSD vendor may not be experienced with this type of problem. Good quality SSDs will not suffer this problem. It is mostly a software quality issue. After Ruben isolated the problem to the bad SSD he removed it from the array. This removed the source of the data corruption. The remaining devices in the array were okay. Data was being written to both. It was only the bad device that was returning corrupted data. Removing it restored sanity to the system. That was fixed on this last Friday. On Monday the device was backfilled with a new SSD and redundancy was once again available in the case of another failure. > I noticed your message[1] in bug-bash stating you guys hit the > xinetd limit for git-daemon due to networking issues: > > http://mid.gmane.org/[email protected] Just to be clear that was a completely independent problem. That was at the FSF router and affected everything. Not related to the SSD failure. When it rains, it pours. > Was this with or without SO_KEEPALIVE on the sockets? I can see now that it was without. But at the time I didn't know one way or the other. I reduced the time for kernel tcp keepalives. That seemed to help and so implies that the git-daemon has enabled them. But obviously not. A lot of stuff was happening at the same time. With too much stuff happening simultaneously it wasn't really possible to tell if it helped or not. > By default, SO_KEEPALIVE still takes around 2 hours to detect a > failed connection, so it's probably worth tweaking the > /proc/sys/net/ipv4/tcp_keepalive_* knobs, too. Yes. The defaults are very long. Starting the tweaks I started with these values. net.ipv4.tcp_keepalive_time = 600 net.ipv4.tcp_keepalive_intvl = 60 net.ipv4.tcp_keepalive_probes = 3 I may have made these too short for some networks. Comments? > I use xinetd myself for git-daemon, and have always had > SO_KEEPALIVE enabled in my /etc/xinet.d/git via: > > flags = KEEPALIVE Ah... I was unaware of that! We have not had that enabled there. Thank you for this hint. I think that is a very worthy addition. I will enable it in our flags too and credit you for the suggestion. > But I guess git-daemon should be doing setsockopt to enable > SO_KEEPALIVE itself... I'll test + post patches to the git > mailing list soonish. I added SO_KEEPALIVE to the client-side > of git years ago, but forgot the daemon :x. The strange thing was that the network issue tickled the problem with many git-daemons. And also other vcs daemons didn't close either. However git is so popular that it is the majority of all activity. Then when the network problem was resolved all of the hanging daemons problem completely went away. So regardless of the keepalive issue something else was going on with the connection state. Unfortunately I don't know what was the root cause. And maybe it doesn't matter since it was a cascade problem after the network problem. It would probably make sense if we knew it. > I forgot git-daemon has --timeout/--init-timeout options which I guess > weren't used? Yikes :x It looks like they are not used. There is a global wrapper using the command line 'timeout' however. It reaps too long running processes. Note that timeouts must accomodate clients on slow connections pulling large projects. > Anyways, I just proposed a patch over on [email protected] to enable > SO_KEEPALIVE as well: > > http://mid.gmane.org/[email protected] I think that is a very good idea. As to the patch, I myself wouldn't have used a function for what is basically a one and a half statement addition. I don't in my own code. I would rather see what it is doing then needing to look up the contents of that very small function. Obviously just a matter of style however. The result is the same. Bob
