On 07-12-2017 21:36, Dilger, Andreas wrote:
> On Dec 7, 2017, at 10:37, Hans Henrik Happe <ha...@nbi.dk> wrote:
>> Hi,
>>
>> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
>> overwriting memory while being DMA'ed to network?
>>
>> After upgrading to 2.10.1 on the server side we started seeing this from
>> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
>> errors. We have not yet established weather the application is doing
>> things correctly.
> If applications are using mmap IO it is possible for the page to become 
> inconsistent after the checksum has been computed.  However, mmap IO is
> normally detected by the client and no message should be printed.
>
> There isn't anything that the application needs to do, since the client will 
> resend the data if there is a checksum error, but the resends do slow down 
> the IO.  If the inconsistency is on the client, there is no cause for concern 
> (though it would be good to figure out the root cause).
>
> It would be interesting to see what the exact error message is, since that 
> will say whether the data became inconsistent on the client, or over the 
> network.  If the inconsistency is over the network or on the server, then 
> that may point to hardware issues.
I've attached logs from a server and a client.

Cheers,
Hans Henrik
Dec  7 13:53:02 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927 c
Dec  7 13:53:02 node830 kernel: LustreError: Skipped 1 previous similar message
Dec  7 13:53:02 node830 kernel: LustreError: 14576:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880b57ce7080 x1585957881290896/t30079750448(30079750448) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0
Dec  7 13:53:19 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3168180000-3169226751 4
Dec  7 13:53:19 node830 kernel: LustreError: Skipped 25 previous similar messages
Dec  7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff88106fc51c80 x1585957881291856/t30079752084(30079752084) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0
Dec  7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 25 previous similar messages
Dec  7 13:53:59 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927 c
Dec  7 13:53:59 node830 kernel: LustreError: Skipped 23 previous similar messages
Dec  7 13:53:59 node830 kernel: LustreError: 14569:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff880b57ce7080 x1585957881292880/t30079752266(30079752266) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0
Dec  7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 23 previous similar messages
Dec  7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) Skipped 3 previous similar messages
Dec  7 13:54:58 node830 kernel: LustreError: 14561:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:55:03 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3146517280-3147563007 a
Dec  7 13:55:03 node830 kernel: LustreError: Skipped 46 previous similar messages
Dec  7 13:55:05 node830 kernel: LustreError: 14559:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11  req@ffff8808b43210c0 x1585957881295616/t30079754077(30079754077) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0
Dec  7 13:55:05 node830 kernel: LustreError: 14559:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 41 previous similar messages
Dec  7 13:55:56 node830 kernel: LustreError: 14560:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11.
Dec  7 13:55:56 node830 kernel: LustreError: 14560:0:(osc_request.c:1735:brw_interpret()) Skipped 2 previous similar messages

oss04: Dec  7 13:53:01 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.30@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3238583840-3239628799]: client csum 88708d4, server csum 3716de4c
oss04: Dec  7 13:53:01 oss04 kernel: LustreError: Skipped 282 previous similar messages
oss04: Dec  7 13:53:17 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.27@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [2875733280-2876780543]: client csum cd40c13, server csum 8be5fc56
oss04: Dec  7 13:53:17 oss04 kernel: LustreError: Skipped 131 previous similar messages
oss04: Dec  7 13:53:50 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.24@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [2583286560-2584334335]: client csum 59d854d2, server csum f93b0459
oss04: Dec  7 13:53:50 oss04 kernel: LustreError: Skipped 116 previous similar messages
oss04: Dec  7 13:54:56 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.22@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [2339580960-2340626431]: client csum 4ced3838, server csum c199f60f

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to