Hi Andreas,

 

I have seen very similar errors in our 2.10.1 environment. Same errors from 
different clients to

different OSS servers and OSTs. Our network is OPA and we are using the latest 
driver and

firmware for all HFIs and switches (10.6).

 

Thanks,

 

Lixin Liu

Compute Canada

 

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
"Dilger, Andreas" <andreas.dil...@intel.com>
Date: Saturday, December 9, 2017 at 9:07 PM
To: Hans Henrik Happe <ha...@nbi.dk>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] BAD CHECKSUM

 

Based on the messages on the client, this isn’t related to mmap() or writes 
done by the client, since the data has the same checksum from before it was 
sent and after it got the checksum error returned from the server. That means 
the pages did not change on the client. 

 

Possible causes include the client network card, server network card, memory, 
or possibly the OFED driver?  It could of course be something in Lustre/LNet, 
though we haven’t had any reports of anything similar. 

 

When the checksum code was first written, it was motivated by a faulty Ethernet 
NIC that had TCP checksum offload, but bad onboard cache, and the data was 
corrupted when copied onto the NIC but the TCP checksum was computed on the bad 
data and the checksum was “correct” when received by the server, so it didn’t 
cause TCP resends. 

 

Are you seeing this on multiple servers?  The client log only shows one server, 
while the server log shows multiple clients.  If it is only happening on one 
server it might point to hardware. 

 

Did you also upgrade the kernel and OFED at the same time as Lustre? You could 
try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED to see if that 
works properly. 

Cheers, Andreas


On Dec 9, 2017, at 11:09, Hans Henrik Happe <ha...@nbi.dk> wrote:



On 09-12-2017 18:57, Hans Henrik Happe wrote:


On 07-12-2017 21:36, Dilger, Andreas wrote:

On Dec 7, 2017, at 10:37, Hans Henrik Happe <ha...@nbi.dk> wrote:

Hi,

 

Can an application cause BAD CHECKSUM errors in Lustre logs by somehow

overwriting memory while being DMA'ed to network?

 

After upgrading to 2.10.1 on the server side we started seeing this from

a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these

errors. We have not yet established weather the application is doing

things correctly.

If applications are using mmap IO it is possible for the page to become 
inconsistent after the checksum has been computed.  However, mmap IO is

normally detected by the client and no message should be printed.

 

There isn't anything that the application needs to do, since the client will 
resend the data if there is a checksum error, but the resends do slow down the 
IO.  If the inconsistency is on the client, there is no cause for concern 
(though it would be good to figure out the root cause).

 

It would be interesting to see what the exact error message is, since that will 
say whether the data became inconsistent on the client, or over the network.  
If the inconsistency is over the network or on the server, then that may point 
to hardware issues.

I've attached logs from a server and a client.


There was a cut n' paste error in the first set of files. This should be
better.

Looks like a something goes wrong over the network.

Cheers,
Hans Henrik

<client.log>

<server.log>

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________ lustre-discuss mailing list 
lustre-discuss@lists.lustre.org 
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to