This makes sense. I do see RX dropped packets. I first want to figure this out.
Also, I had bonding mode enabled but it seems I need to goto independent modes. Does lustre do bonding on a filesystem level? or is it preferred to go with the OS? TIA On Wed, Nov 12, 2008 at 4:32 PM, Andreas Dilger <[EMAIL PROTECTED]> wrote: > On Nov 12, 2008 08:10 -0500, Brian J. Murrell wrote: >> On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote: >> > We noticed. >> > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before >> > arrival at OST: from [EMAIL PROTECTED] inum (somenumber)/(somenumber) >> > object (some number)/0 extend [0-4095] >> > >> > Its actually coming from 2 particular hosts (1 OSS) another from 1 >> > particular client. >> > >> > I also see @@@ redo for unrecoverable error [EMAIL PROTECTED] >> > >> > Any thoughts how can I get rid of these messages? >> >> Assuming it's not a bug in Lustre, fix whatever is mangling the data >> before it arrives at the OST. Do you have errors on your networking >> fabric, or on the interfaces of the hosts on either end of the >> transaction? > > Note that a similar error can also happen in the case of an application > doing mmap IO, which the Linux kernel does not prevent from modifying > the page even while it is being RDMA'd over the network, so it is hard > for Lustre to provide a checksum for. > > The client would have printed a message like the following in that case: > > "BAD WRITE CHECKSUM: changed in transit AND doesn't match the > original - likely false positive due to mmap IO (bug 11742)" > > If the client's copy of the data has not changed, and the checksum > is still correct, then it points to data corruption on the network > (probably in the NIC itself if it is specific to one node). > > Note that since the NIC is doing the TCP checksumming itself, this kind > of error won't be caught by TCP packet checksums because the data is > already corrupted in the NIC memory before the TCP checksum is computed. > > This specific problem was actually hit by a customer and is one of the > reasons why Lustre does its own data checksum, instead of depending on > the TCP layer to deliver the data without any errors. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
