On 07-12-2017 21:36, Dilger, Andreas wrote: > On Dec 7, 2017, at 10:37, Hans Henrik Happe <ha...@nbi.dk> wrote: >> Hi, >> >> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow >> overwriting memory while being DMA'ed to network? >> >> After upgrading to 2.10.1 on the server side we started seeing this from >> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these >> errors. We have not yet established weather the application is doing >> things correctly. > If applications are using mmap IO it is possible for the page to become > inconsistent after the checksum has been computed. However, mmap IO is > normally detected by the client and no message should be printed. > > There isn't anything that the application needs to do, since the client will > resend the data if there is a checksum error, but the resends do slow down > the IO. If the inconsistency is on the client, there is no cause for concern > (though it would be good to figure out the root cause). > > It would be interesting to see what the exact error message is, since that > will say whether the data became inconsistent on the client, or over the > network. If the inconsistency is over the network or on the server, then > that may point to hardware issues. I've attached logs from a server and a client.
Cheers, Hans Henrik
Dec 7 13:53:02 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927 c Dec 7 13:53:02 node830 kernel: LustreError: Skipped 1 previous similar message Dec 7 13:53:02 node830 kernel: LustreError: 14576:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880b57ce7080 x1585957881290896/t30079750448(30079750448) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0 Dec 7 13:53:19 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3168180000-3169226751 4 Dec 7 13:53:19 node830 kernel: LustreError: Skipped 25 previous similar messages Dec 7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff88106fc51c80 x1585957881291856/t30079752084(30079752084) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0 Dec 7 13:53:19 node830 kernel: LustreError: 14565:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 25 previous similar messages Dec 7 13:53:59 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3157348640-3158396927 c Dec 7 13:53:59 node830 kernel: LustreError: Skipped 23 previous similar messages Dec 7 13:53:59 node830 kernel: LustreError: 14569:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11. Dec 7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff880b57ce7080 x1585957881292880/t30079752266(30079752266) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0 Dec 7 13:54:00 node830 kernel: LustreError: 14561:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 23 previous similar messages Dec 7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11. Dec 7 13:54:01 node830 kernel: LustreError: 14573:0:(osc_request.c:1735:brw_interpret()) Skipped 3 previous similar messages Dec 7 13:54:58 node830 kernel: LustreError: 14561:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11. Dec 7 13:55:03 node830 kernel: LustreError: 132-0: astro-OST0000-osc-ffff881072dbf400: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 10.21.10.114@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3146517280-3147563007 a Dec 7 13:55:03 node830 kernel: LustreError: Skipped 46 previous similar messages Dec 7 13:55:05 node830 kernel: LustreError: 14559:0:(osc_request.c:1611:osc_brw_redo_request()) @@@ redo for recoverable error -11 req@ffff8808b43210c0 x1585957881295616/t30079754077(30079754077) o4->astro-OST0000-osc-ffff881072dbf400@10.21.10.114@o2:0 Dec 7 13:55:05 node830 kernel: LustreError: 14559:0:(osc_request.c:1611:osc_brw_redo_request()) Skipped 41 previous similar messages Dec 7 13:55:56 node830 kernel: LustreError: 14560:0:(osc_request.c:1735:brw_interpret()) astro-OST0000-osc-ffff881072dbf400: too many resent retries for object: 0:55153652, rc = -11. Dec 7 13:55:56 node830 kernel: LustreError: 14560:0:(osc_request.c:1735:brw_interpret()) Skipped 2 previous similar messages
oss04: Dec 7 13:53:01 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.30@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [3238583840-3239628799]: client csum 88708d4, server csum 3716de4c oss04: Dec 7 13:53:01 oss04 kernel: LustreError: Skipped 282 previous similar messages oss04: Dec 7 13:53:17 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.27@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [2875733280-2876780543]: client csum cd40c13, server csum 8be5fc56 oss04: Dec 7 13:53:17 oss04 kernel: LustreError: Skipped 131 previous similar messages oss04: Dec 7 13:53:50 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.24@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [2583286560-2584334335]: client csum 59d854d2, server csum f93b0459 oss04: Dec 7 13:53:50 oss04 kernel: LustreError: Skipped 116 previous similar messages oss04: Dec 7 13:54:56 oss04 kernel: LustreError: 168-f: astro-OST0000: BAD WRITE CHECKSUM: from 12345-10.21.208.22@o2ib inode [0x2000135a4:0x169:0x0] object 0x0:55153652 extent [2339580960-2340626431]: client csum 4ced3838, server csum c199f60f
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org