On Thu, 2007-05-31 at 13:27 +0100, Darren J Moffat wrote: > > What errors and error rates have you seen? > > I have seen switches flip bits in NFS traffic such that the TCP checksum > still match yet the data was corrupted. One of the ways we saw this was > when files were being checked out of SCCS, the SCCS checksum failed. > Other ways we saw it was the compiler failing to compile untouched code.
To be specific, we found that an ethernet switch in one of our development labs had a tendency to toggle a particular bit in packets going through it. The problem was originally suspected to be a data corruption problem within solaris itself and got a lot of attention as a result. In the cases I examined (corrupted source file after SCCS checkout) there were complementary changes (0->1 and 1->0) in the same bit in bytes which were 256, 512, or 1024 bytes apart in the source file. Because of the mathematics of the 16-bit ones-complement checksum used by TCP, the packet checksummed to the same value after the switch made these two offsetting changes. (I believe that the switch was either inserting or removing a vlan tag so the ethernet CRC had to be recomputed by the switch). Once we realized that this was going on we went back, looked at the output of netstat -s, and noticed that the systems in this lab had been dropping an abnormally high number of packets due to bad TCP checksums; only a few of the broken packets were making it through, but there were enough of them to disrupt things in the lab. The problem went away when the suspect switch was taken out of service. - Bill _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss