Kevin: Thanks for the response.
What do I need to change using ethtool? BTW, I am using ethernet bonding to increase bandwidth. I suspect this could be causing the problem... I am not sure if my applications are using mmap(). I am not aware of an easy way to determine if they are. On Wed, Dec 31, 2008 at 12:34 PM, Kevin Van Maren <[email protected]> wrote: > I have previously observed cases where the RX checksum offload NIC would > pass packets up > to Linux as "good" if the Ethernet CRC was valid, even though the UDP > checksum failed (for > some reason it appeared that something (the sender?) was corrupting a byte > in the payload after > calculating the UDP csum, but before the Ethernet CRC was calculated). > > So disable any NIC offloading on both sides (ethtool) and see if the Lustre > csums errors go away. > > Also note that is you are using mmap files, it is _expected_ that the csum > might not match, > as the page can be modified between when the csum is calculated by Luster, > and the page > is actually transmitted. > > Kevin > > > Mag Gam wrote: >> >> I have done the tuning but still occasionally get a CSUM error. About >> 200 per day. Considering, we probally transfer close to 500G to 1TB >> of data a day is not that bad. >> >> I did the tuning on the e1000 card but I am not sure what else to do. >> The network guys have nothing wrong with their switch and the cables >> are fine (we even got them replaced). >> >> Since lustre has its own checksumming, I suppose I am in good shape... >> >> >> >> >> On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <[email protected]> wrote: >> >>> >>> Brian. Thanks for getting back to me. >>> >>> Yes. The contents matched but getting the RX drop which is king of >>> scary. I am using the same machine when doing the test. >>> >>> I have already looked at the Lnet tests >>> >>> >>> http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255 >>> >>> For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS" gets >>> me a RPC error but it seems my 5 servers get added. Wierd. Is there >>> better documentation or perhaps an example for the lnet tests I am >>> curious to try it. >>> >>> BTW, I am very happy to see this >>> >>> http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952 >>> (Last section regarding CRC). Where can I read more about this?? >>> >>> >>> >>> Keep in mind, I am using e1000 NICs, and I think there is some tuning >>> I should be doing (but I am not certain if I am doing the right >>> tuning) >>> >>> TIA >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <[email protected]> >>> wrote: >>> >>>> >>>> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote: >>>> >>>>> >>>>> OK. >>>>> >>>>> It seems Lustre FS is dropping the packets. >>>>> >>>> >>>> No. Nobody said anything about packets being dropped. They are failing >>>> checksum. >>>> >>>> >>>>> >>>>> I did multiple FTPs and >>>>> they were very large files (10GB each), and no packet drops >>>>> >>>> >>>> Did you verify the contents of what you ftp'd matched the original? Are >>>> you using the same machines in your ftp tests that are reporting >>>> checksum failures with Lustre? >>>> >>>> You might want to look in our test suite and see if there is a checksum >>>> unit test. I'd be surprised if there is not. Maybe run that and see >>>> what the results are. I'm afraid I don't have a lustre source tree very >>>> handy at the moment to check for you. >>>> >>>> b. >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> [email protected] >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
