no, on oss we found only the client who reported " dirty page discard " being evicted. we hit this again last night, and on oss we can see logs like: " [Tue Aug 25 23:40:12 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.10.3.223@o2ib ns: filter-public1-OST0000_UUID lock: ffff9f1f91cba880/0x3fcc67dad1c65842 lrc: 3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->270335) flags: 0x60000400020020 nid: 10.10.3.223@o2ib remote: 0xd713b7b417045252 expref: 7081 pid: 25923 timeout: 21386699 lvb_type: 0 [Tue Aug 25 23:40:12 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages [Tue Aug 25 23:40:14 2020] LustreError: 26000:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9f13259a6300 x1653628454261296/t0(0) o106->public1-OST0000@10.10.3.223@o2ib:15/16 lens 296/280 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [Tue Aug 25 23:40:14 2020] LustreError: 26000:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 14 previous similar messages [Tue Aug 25 23:40:26 2020] LustreError: 25917:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9f1339a5c800 x1653628454263632/t0(0) o106->public1-OST0002@10.10.3.223@o2ib:15/16 lens 296/280 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [Tue Aug 25 23:40:26 2020] LustreError: 25917:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous similar messages [Tue Aug 25 23:44:59 2020] LustreError: 32485:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0000: cli 3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 28672 GRANT, real grant 0 [Tue Aug 25 23:44:59 2020] LustreError: 32485:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 5755 previous similar messages [Tue Aug 25 23:49:18 2020] Lustre: public1-OST0002: Connection restored to 87ca2182-98a3-25dd-7d30-989d822381c6 (at 10.10.5.6@o2ib) [Tue Aug 25 23:49:18 2020] Lustre: Skipped 102 previous similar messages [Tue Aug 25 23:55:00 2020] LustreError: 32485:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0004: cli 3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 577536 GRANT, real grant 0 [Tue Aug 25 23:55:00 2020] LustreError: 32485:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 1121 previous similar messages [Tue Aug 25 23:59:25 2020] Lustre: public1-OST0000: Connection restored to d45ad9f4-8903-7c80-7b35-bd32037de660 (at 10.10.7.131@o2ib) [Tue Aug 25 23:59:25 2020] Lustre: Skipped 50 previous similar messages [Tue Aug 25 23:59:49 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 156s: evicting client at 10.10.3.223@o2ib ns: filter-public1-OST0000_UUID lock: ffff9f130863a880/0x3fcc67dad1cff1d5 lrc: 3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 3911680->4173823) flags: 0x60000000020020 nid: 10.10.3.223@o2ib remote: 0xd713b7b417354237 expref: 11891 pid: 26099 timeout: 21387847 lvb_type: 0 [Tue Aug 25 23:59:49 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar messages [Wed Aug 26 00:00:40 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.10.3.223@o2ib ns: filter-public1-OST0004_UUID lock: ffff9f2df4a10d80/0x3fcc67dad1d50925 lrc: 3/0,0 mode: PR/PR res: [0xdc95179:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->266239) flags: 0x60000400000020 nid: 10.10.3.223@o2ib remote: 0xd713b7b417549c43 expref: 14594 pid: 26181 timeout: 21387927 lvb_type: 0 [Wed Aug 26 00:00:40 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar message [Wed Aug 26 00:02:37 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.10.3.223@o2ib ns: filter-public1-OST0000_UUID lock: ffff9f1359e94a40/0x3fcc67dad1dacd8b lrc: 3/0,0 mode: PR/PR res: [0xde609f1:0x0:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 1941504->2097151) flags: 0x60000400020020 nid: 10.10.3.223@o2ib remote: 0xd713b7b417780209 expref: 5626 pid: 26134 timeout: 21388044 lvb_type: 0 [Wed Aug 26 00:02:37 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar message [Wed Aug 26 00:05:00 2020] LustreError: 26199:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0004: cli 3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 28672 GRANT, real grant 0 [Wed Aug 26 00:05:00 2020] LustreError: 26199:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 14028 previous similar messages [Wed Aug 26 00:09:30 2020] Lustre: public1-OST0000: Connection restored to 956559c4-4e7c-e6a5-3867-83ab85699688 (at 10.10.6.91@o2ib) [Wed Aug 26 00:09:30 2020] Lustre: Skipped 39 previous similar messages [Wed Aug 26 00:10:27 2020] LustreError: 14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 147s: evicting client at 10.10.3.223@o2ib ns: filter-public1-OST0002_UUID lock: ffff9f16e6f95c40/0x3fcc67dad1dea822 lrc: 3/0,0 mode: PR/PR res: [0xdd5d4bb:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->24575) flags: 0x60000400020020 nid: 10.10.3.223@o2ib remote: 0xd713b7b417900639 expref: 8633 pid: 25993 timeout: 21388514 lvb_type: 0 "
Anymore , we exec lfsck on all servers, result is " layout_mdts_init: 0 layout_mdts_scanning-phase1: 0 layout_mdts_scanning-phase2: 0 layout_mdts_completed: 1 layout_mdts_failed: 0 layout_mdts_stopped: 0 layout_mdts_paused: 0 layout_mdts_crashed: 0 layout_mdts_partial: 0 layout_mdts_co-failed: 0 layout_mdts_co-stopped: 0 layout_mdts_co-paused: 0 layout_mdts_unknown: 0 layout_osts_init: 0 layout_osts_scanning-phase1: 0 layout_osts_scanning-phase2: 0 layout_osts_completed: 8 layout_osts_failed: 0 layout_osts_stopped: 0 layout_osts_paused: 0 layout_osts_crashed: 0 layout_osts_partial: 0 layout_osts_co-failed: 0 layout_osts_co-stopped: 0 layout_osts_co-paused: 0 layout_osts_unknown: 0 layout_repaired: 2253861 namespace_mdts_init: 0 namespace_mdts_scanning-phase1: 0 namespace_mdts_scanning-phase2: 0 namespace_mdts_completed: 1 namespace_mdts_failed: 0 namespace_mdts_stopped: 0 namespace_mdts_paused: 0 namespace_mdts_crashed: 0 namespace_mdts_partial: 0 namespace_mdts_co-failed: 0 namespace_mdts_co-stopped: 0 namespace_mdts_co-paused: 0 namespace_mdts_unknown: 0 namespace_osts_init: 0 namespace_osts_scanning-phase1: 0 namespace_osts_scanning-phase2: 0 namespace_osts_completed: 0 namespace_osts_failed: 0 namespace_osts_stopped: 0 namespace_osts_paused: 0 namespace_osts_crashed: 0 namespace_osts_partial: 0 namespace_osts_co-failed: 0 namespace_osts_co-stopped: 0 namespace_osts_co-paused: 0 namespace_osts_unknown: 0 namespace_repaired: 0 " Colin Faber <cfa...@gmail.com> 于2020年8月26日周三 上午12:17写道: > The I/O was not fully committed after close() from the client. Are you > experiencing high numbers of evictions? > > On Tue, Aug 25, 2020 at 9:12 AM 肖正刚 <guru.nov...@gmail.com> wrote: > >> Hi, all >> >> We found that some clients' dmesg filled up with messages like >> " >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13565:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x1680f:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13547:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x14246:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13545:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12018:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13567:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12c86:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13566:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12c76:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13550:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12c8e:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13568:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12c66:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13569:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12c7e:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13548:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12c6e:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13570:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12ca6:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13549:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12cbe:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13571:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12cb6:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13551:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12cae:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13572:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12cce:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13573:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12cc6:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13574:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12d56:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13575:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x12d36:0x0]/ may get corrupted (rc -108) >> Aug 24 19:54:34 ln5 kernel: Lustre: >> 13576:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page >> discard: 10.10.2.11@o2ib:10.10.2.12@o2ib:/public1/fid: >> [0x200007a82:0x1429e:0x0]/ may get corrupted (rc -108) >> >> " >> Then, we checked disk array, sas link, multipath, but no error found. >> Has anyone ever met the same problem ? >> Any suggestions will help! >> >> Regards. >> _______________________________________________ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org