Re: [lustre-discuss] BAD CHECKSUM

2017-12-11 Thread Hans Henrik Happe
On 10-12-2017 06:07, Dilger, Andreas wrote:
> Based on the messages on the client, this isn’t related to mmap() or
> writes done by the client, since the data has the same checksum from
> before it was sent and after it got the checksum error returned from
> the server. That means the pages did not change on the client.
>
> Possible causes include the client network card, server network card,
> memory, or possibly the OFED driver?  It could of course be something
> in Lustre/LNet, though we haven’t had any reports of anything similar. 
>
> When the checksum code was first written, it was motivated by a faulty
> Ethernet NIC that had TCP checksum offload, but bad onboard cache, and
> the data was corrupted when copied onto the NIC but the TCP checksum
> was computed on the bad data and the checksum was “correct” when
> received by the server, so it didn’t cause TCP resends. 
>
> Are you seeing this on multiple servers?  The client log only shows
> one server, while the server log shows multiple clients.  If it is
> only happening on one server it might point to hardware. 
Yes, we are seeing it on all servers.
> Did you also upgrade the kernel and OFED at the same time as Lustre?
> You could try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED
> to see if that works properly.
We upgraded to CentOS 7.4 and are using the included OFED on the
servers. Also, we upgraded the firmware on the server IB cards. We will
check further if this combination has compatibility issues.

Cheers,
Hans Henrik
>
> Cheers, Andreas
>
> On Dec 9, 2017, at 11:09, Hans Henrik Happe  > wrote:
>
>>
>>
>> On 09-12-2017 18:57, Hans Henrik Happe wrote:
>>> On 07-12-2017 21:36, Dilger, Andreas wrote:
 On Dec 7, 2017, at 10:37, Hans Henrik Happe >>> > wrote:
> Hi,
>
> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
> overwriting memory while being DMA'ed to network?
>
> After upgrading to 2.10.1 on the server side we started seeing
> this from
> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit
> these
> errors. We have not yet established weather the application is doing
> things correctly.
 If applications are using mmap IO it is possible for the page to
 become inconsistent after the checksum has been computed.  However,
 mmap IO is
 normally detected by the client and no message should be printed.

 There isn't anything that the application needs to do, since the
 client will resend the data if there is a checksum error, but the
 resends do slow down the IO.  If the inconsistency is on the
 client, there is no cause for concern (though it would be good to
 figure out the root cause).

 It would be interesting to see what the exact error message is,
 since that will say whether the data became inconsistent on the
 client, or over the network.  If the inconsistency is over the
 network or on the server, then that may point to hardware issues.
>>> I've attached logs from a server and a client.
>>
>> There was a cut n' paste error in the first set of files. This should be
>> better.
>>
>> Looks like a something goes wrong over the network.
>>
>> Cheers,
>> Hans Henrik
>>
>> 
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org 
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] BAD CHECKSUM

2017-12-09 Thread Lixin Liu
Hi Andreas,

 

I have seen very similar errors in our 2.10.1 environment. Same errors from 
different clients to

different OSS servers and OSTs. Our network is OPA and we are using the latest 
driver and

firmware for all HFIs and switches (10.6).

 

Thanks,

 

Lixin Liu

Compute Canada

 

From: lustre-discuss  on behalf of 
"Dilger, Andreas" 
Date: Saturday, December 9, 2017 at 9:07 PM
To: Hans Henrik Happe 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] BAD CHECKSUM

 

Based on the messages on the client, this isn’t related to mmap() or writes 
done by the client, since the data has the same checksum from before it was 
sent and after it got the checksum error returned from the server. That means 
the pages did not change on the client. 

 

Possible causes include the client network card, server network card, memory, 
or possibly the OFED driver?  It could of course be something in Lustre/LNet, 
though we haven’t had any reports of anything similar. 

 

When the checksum code was first written, it was motivated by a faulty Ethernet 
NIC that had TCP checksum offload, but bad onboard cache, and the data was 
corrupted when copied onto the NIC but the TCP checksum was computed on the bad 
data and the checksum was “correct” when received by the server, so it didn’t 
cause TCP resends. 

 

Are you seeing this on multiple servers?  The client log only shows one server, 
while the server log shows multiple clients.  If it is only happening on one 
server it might point to hardware. 

 

Did you also upgrade the kernel and OFED at the same time as Lustre? You could 
try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED to see if that 
works properly. 

Cheers, Andreas


On Dec 9, 2017, at 11:09, Hans Henrik Happe  wrote:



On 09-12-2017 18:57, Hans Henrik Happe wrote:


On 07-12-2017 21:36, Dilger, Andreas wrote:

On Dec 7, 2017, at 10:37, Hans Henrik Happe  wrote:

Hi,

 

Can an application cause BAD CHECKSUM errors in Lustre logs by somehow

overwriting memory while being DMA'ed to network?

 

After upgrading to 2.10.1 on the server side we started seeing this from

a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these

errors. We have not yet established weather the application is doing

things correctly.

If applications are using mmap IO it is possible for the page to become 
inconsistent after the checksum has been computed.  However, mmap IO is

normally detected by the client and no message should be printed.

 

There isn't anything that the application needs to do, since the client will 
resend the data if there is a checksum error, but the resends do slow down the 
IO.  If the inconsistency is on the client, there is no cause for concern 
(though it would be good to figure out the root cause).

 

It would be interesting to see what the exact error message is, since that will 
say whether the data became inconsistent on the client, or over the network.  
If the inconsistency is over the network or on the server, then that may point 
to hardware issues.

I've attached logs from a server and a client.


There was a cut n' paste error in the first set of files. This should be
better.

Looks like a something goes wrong over the network.

Cheers,
Hans Henrik





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___ lustre-discuss mailing list 
lustre-discuss@lists.lustre.org 
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] BAD CHECKSUM

2017-12-09 Thread Dilger, Andreas
Based on the messages on the client, this isn’t related to mmap() or writes 
done by the client, since the data has the same checksum from before it was 
sent and after it got the checksum error returned from the server. That means 
the pages did not change on the client.

Possible causes include the client network card, server network card, memory, 
or possibly the OFED driver?  It could of course be something in Lustre/LNet, 
though we haven’t had any reports of anything similar.

When the checksum code was first written, it was motivated by a faulty Ethernet 
NIC that had TCP checksum offload, but bad onboard cache, and the data was 
corrupted when copied onto the NIC but the TCP checksum was computed on the bad 
data and the checksum was “correct” when received by the server, so it didn’t 
cause TCP resends.

Are you seeing this on multiple servers?  The client log only shows one server, 
while the server log shows multiple clients.  If it is only happening on one 
server it might point to hardware.

Did you also upgrade the kernel and OFED at the same time as Lustre? You could 
try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED to see if that 
works properly.

Cheers, Andreas

On Dec 9, 2017, at 11:09, Hans Henrik Happe mailto:ha...@nbi.dk>> 
wrote:



On 09-12-2017 18:57, Hans Henrik Happe wrote:
On 07-12-2017 21:36, Dilger, Andreas wrote:
On Dec 7, 2017, at 10:37, Hans Henrik Happe mailto:ha...@nbi.dk>> 
wrote:
Hi,

Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
overwriting memory while being DMA'ed to network?

After upgrading to 2.10.1 on the server side we started seeing this from
a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
errors. We have not yet established weather the application is doing
things correctly.
If applications are using mmap IO it is possible for the page to become 
inconsistent after the checksum has been computed.  However, mmap IO is
normally detected by the client and no message should be printed.

There isn't anything that the application needs to do, since the client will 
resend the data if there is a checksum error, but the resends do slow down the 
IO.  If the inconsistency is on the client, there is no cause for concern 
(though it would be good to figure out the root cause).

It would be interesting to see what the exact error message is, since that will 
say whether the data became inconsistent on the client, or over the network.  
If the inconsistency is over the network or on the server, then that may point 
to hardware issues.
I've attached logs from a server and a client.

There was a cut n' paste error in the first set of files. This should be
better.

Looks like a something goes wrong over the network.

Cheers,
Hans Henrik



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] BAD CHECKSUM

2017-12-09 Thread Hans Henrik Happe


On 09-12-2017 18:57, Hans Henrik Happe wrote:
> On 07-12-2017 21:36, Dilger, Andreas wrote:
>> On Dec 7, 2017, at 10:37, Hans Henrik Happe  wrote:
>>> Hi,
>>>
>>> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
>>> overwriting memory while being DMA'ed to network?
>>>
>>> After upgrading to 2.10.1 on the server side we started seeing this from
>>> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
>>> errors. We have not yet established weather the application is doing
>>> things correctly.
>> If applications are using mmap IO it is possible for the page to become 
>> inconsistent after the checksum has been computed.  However, mmap IO is
>> normally detected by the client and no message should be printed.
>>
>> There isn't anything that the application needs to do, since the client will 
>> resend the data if there is a checksum error, but the resends do slow down 
>> the IO.  If the inconsistency is on the client, there is no cause for 
>> concern (though it would be good to figure out the root cause).
>>
>> It would be interesting to see what the exact error message is, since that 
>> will say whether the data became inconsistent on the client, or over the 
>> network.  If the inconsistency is over the network or on the server, then 
>> that may point to hardware issues.
> I've attached logs from a server and a client.

There was a cut n' paste error in the first set of files. This should be
better.

Looks like a something goes wrong over the network.

Cheers,
Hans Henrik

Dec  7 13:53:02 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927], original client csum 7505b09c (type 4), server csum 602d88a8 (type 4), client csum now 7505b09c
Dec  7 13:53:02 node830 kernel: LustreError: Skipped 1 previous similar message
Dec  7 13:53:02 node830 kernel: LustreError: 14576:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@880b57ce7080 x1585957881290896/t30079750448(30079750448) o4->astro-OST-osc-881072dbf400@10.21.10.114@o2ib:6/4 lens 608/416 e 0 to 0 dl 1512651188 ref 2 fl Interpret:RM/0/0 rc 0/0
Dec  7 13:53:19 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [316818-3169226751], original client csum d0df79c4 (type 4), server csum 1f1a7bf (type 4), client csum now d0df79c4
Dec  7 13:53:19 node830 kernel: LustreError: Skipped 25 previous similar messages
Dec  7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@88106fc51c80 x1585957881291856/t30079752084(30079752084) o4->astro-OST-osc-881072dbf400@10.21.10.114@o2ib:6/4 lens 608/416 e 0 to 0 dl 1512651243 ref 2 fl Interpret:RM/0/0 rc 0/0
Dec  7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 25 previous similar messages
Dec  7 13:53:59 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927], original client csum 7505b09c (type 4), server csum 120df09a (type 4), client csum now 7505b09c
Dec  7 13:53:59 node830 kernel: LustreError: Skipped 23 previous similar messages
Dec  7 13:53:59 node830 kernel: LustreError: 14569:0:(osc_request.c:1735:brw_interpret()) astro-OST-osc-881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@880b57ce7080 x1585957881292880/t30079752266(30079752266) o4->astro-OST-osc-881072dbf400@10.21.10.114@o2ib:6/4 lens 608/416 e 0 to 0 dl 1512651284 ref 2 fl Interpret:RM/0/0 rc 0/0
Dec  7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 23 previous similar messages
Dec  7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) astro-OST-osc-881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) Skipped 3 previous similar messages
Dec  7 13:54:58 node830 kernel: LustreError: 14561:0:(osc_request.c:1735:brw_interpret()) astro-OST-osc-881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:55:03 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 ex

Re: [lustre-discuss] BAD CHECKSUM

2017-12-09 Thread Hans Henrik Happe
On 07-12-2017 21:36, Dilger, Andreas wrote:
> On Dec 7, 2017, at 10:37, Hans Henrik Happe  wrote:
>> Hi,
>>
>> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
>> overwriting memory while being DMA'ed to network?
>>
>> After upgrading to 2.10.1 on the server side we started seeing this from
>> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
>> errors. We have not yet established weather the application is doing
>> things correctly.
> If applications are using mmap IO it is possible for the page to become 
> inconsistent after the checksum has been computed.  However, mmap IO is
> normally detected by the client and no message should be printed.
>
> There isn't anything that the application needs to do, since the client will 
> resend the data if there is a checksum error, but the resends do slow down 
> the IO.  If the inconsistency is on the client, there is no cause for concern 
> (though it would be good to figure out the root cause).
>
> It would be interesting to see what the exact error message is, since that 
> will say whether the data became inconsistent on the client, or over the 
> network.  If the inconsistency is over the network or on the server, then 
> that may point to hardware issues.
I've attached logs from a server and a client.

Cheers,
Hans Henrik
Dec  7 13:53:02 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927 c
Dec  7 13:53:02 node830 kernel: LustreError: Skipped 1 previous similar message
Dec  7 13:53:02 node830 kernel: LustreError: 14576:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@880b57ce7080 x1585957881290896/t30079750448(30079750448) o4->astro-OST-osc-881072dbf400@10.21.10.114@o2:0
Dec  7 13:53:19 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [316818-3169226751 4
Dec  7 13:53:19 node830 kernel: LustreError: Skipped 25 previous similar messages
Dec  7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@88106fc51c80 x1585957881291856/t30079752084(30079752084) o4->astro-OST-osc-881072dbf400@10.21.10.114@o2:0
Dec  7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 25 previous similar messages
Dec  7 13:53:59 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927 c
Dec  7 13:53:59 node830 kernel: LustreError: Skipped 23 previous similar messages
Dec  7 13:53:59 node830 kernel: LustreError: 14569:0:(osc_request.c:1735:brw_interpret()) astro-OST-osc-881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@880b57ce7080 x1585957881292880/t30079752266(30079752266) o4->astro-OST-osc-881072dbf400@10.21.10.114@o2:0
Dec  7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 23 previous similar messages
Dec  7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) astro-OST-osc-881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) Skipped 3 previous similar messages
Dec  7 13:54:58 node830 kernel: LustreError: 14561:0:(osc_request.c:1735:brw_interpret()) astro-OST-osc-881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:55:03 node830 kernel: LustreError: 132-0: astro-OST-osc-881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3146517280-3147563007 a
Dec  7 13:55:03 node830 kernel: LustreError: Skipped 46 previous similar messages
Dec  7 13:55:05 node830 kernel: LustreError: 14559:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@8808b43210c0 x1585957881295616/t30079754077(30079754077) o4->astro-OST-osc-881072dbf400@10.21.10.114@o2:0
Dec  7 13:55:05 node830 kernel: LustreError: 14559:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 41 previous similar messages
Dec  7 13:55:56 node830 kernel: LustreError: 14560:0:(osc_request.c:1735:brw_interpret()) astro-OST-osc-881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:55:56 node830 kernel: 

Re: [lustre-discuss] BAD CHECKSUM

2017-12-07 Thread Patrick Farrell
I would think it's possible if the application is doing direct I/O. This
should be impossible for buffered I/O, since the checksums are all
calculated after the copies in to kernel memory (the page cache) are
complete, so it doesn¹t matter what userspace does to its memory (at
least, it doesn¹t matter for the checksums).

And I¹m not 100% sure it¹s possible for direct.  I would think it is.
Someone else might be able to weigh in there - but it¹s definitely not
possible for buffered I/O.


It would be good, as Andreas said, to see the exact message.

One other thought: While the Lustre client might resend correctly, I would
think it extremely likely unintentionally messing with memory being used
for I/O represents a serious application bug, likely to lead to incorrect
operation.

Regards,
- Patrick

On 12/7/17, 2:36 PM, "lustre-discuss on behalf of Dilger, Andreas"
 wrote:

>On Dec 7, 2017, at 10:37, Hans Henrik Happe  wrote:
>> 
>> Hi,
>> 
>> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
>> overwriting memory while being DMA'ed to network?
>> 
>> After upgrading to 2.10.1 on the server side we started seeing this from
>> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
>> errors. We have not yet established weather the application is doing
>> things correctly.
>
>If applications are using mmap IO it is possible for the page to become
>inconsistent after the checksum has been computed.  However, mmap IO is
>normally detected by the client and no message should be printed.
>
>There isn't anything that the application needs to do, since the client
>will resend the data if there is a checksum error, but the resends do
>slow down the IO.  If the inconsistency is on the client, there is no
>cause for concern (though it would be good to figure out the root cause).
>
>It would be interesting to see what the exact error message is, since
>that will say whether the data became inconsistent on the client, or over
>the network.  If the inconsistency is over the network or on the server,
>then that may point to hardware issues.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Lustre Principal Architect
>Intel Corporation
>
>
>
>
>
>
>
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] BAD CHECKSUM

2017-12-07 Thread Dilger, Andreas
On Dec 7, 2017, at 10:37, Hans Henrik Happe  wrote:
> 
> Hi,
> 
> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
> overwriting memory while being DMA'ed to network?
> 
> After upgrading to 2.10.1 on the server side we started seeing this from
> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
> errors. We have not yet established weather the application is doing
> things correctly.

If applications are using mmap IO it is possible for the page to become 
inconsistent after the checksum has been computed.  However, mmap IO is
normally detected by the client and no message should be printed.

There isn't anything that the application needs to do, since the client will 
resend the data if there is a checksum error, but the resends do slow down the 
IO.  If the inconsistency is on the client, there is no cause for concern 
(though it would be good to figure out the root cause).

It would be interesting to see what the exact error message is, since that will 
say whether the data became inconsistent on the client, or over the network.  
If the inconsistency is over the network or on the server, then that may point 
to hardware issues.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] BAD CHECKSUM

2017-12-07 Thread Hans Henrik Happe
Hi,

Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
overwriting memory while being DMA'ed to network?

After upgrading to 2.10.1 on the server side we started seeing this from
a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
errors. We have not yet established weather the application is doing
things correctly.

Cheers,
Hans Henrik
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org