On Thu, 2007-05-31 at 13:27 +0100, Darren J Moffat wrote:

> > What errors and error rates have you seen?
> 
> I have seen switches flip bits in NFS traffic such that the TCP checksum 
> still match yet the data was corrupted.  One of the ways we saw this was 
> when files were being checked out of SCCS, the SCCS checksum failed. 
> Other ways we saw it was the compiler failing to compile untouched code.

To be specific, we found that an ethernet switch in one of our
development labs had a tendency to toggle a particular bit in packets
going through it.   The problem was originally suspected to be a data
corruption problem within solaris itself and got a lot of attention as a
result.

In the cases I examined (corrupted source file after SCCS checkout)
there were complementary changes (0->1 and 1->0) in the same bit in
bytes which were 256, 512, or 1024 bytes apart in the source file.

Because of the mathematics of the 16-bit ones-complement checksum used
by TCP, the packet checksummed to the same value after the switch made
these two offsetting changes.  (I believe that the switch was either
inserting or removing a vlan tag so the ethernet CRC had to be
recomputed by the switch). 

Once we realized that this was going on we went back, looked at the
output of netstat -s, and noticed that the systems in this lab had been
dropping an abnormally high number of packets due to bad TCP checksums;
only a few of the broken packets were making it through, but there were
enough of them to disrupt things in the lab.

The problem went away when the suspect switch was taken out of service.

                                                - Bill






_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to