[66494.575431] LNetError: 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) Failed to map mr 1/8 elements [66494.575446] LNetError: 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Can't map 32768 bytes (8/8)s: -22 These errors originate from a call to ib_map_mr_sg() which is part of the kernel verbs API.
n = ib_map_mr_sg(mr, tx->tx_frags, rd->rd_nfrags, NULL, PAGE_SIZE); if (unlikely(n != rd->rd_nfrags)) { CERROR("Failed to map mr %d/%d elements\n", n, rd->rd_nfrags); return n < 0 ? n : -EINVAL; } Your errors mean that we wanted to map 8 fragments to the memory region, but we were only able to map one of them. As a first step, I would recommend ensuring that you have the latest firmware for your network cards, and if you’re using an external driver distribution (like mlnx-ofa_kernel) then upgrade to the latest version. There could be some bug in the o2iblnd driver code but it is best to first rule out any issue with firmware/drivers. Chris Horn From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Stepan Nassyr via lustre-discuss <lustre-discuss@lists.lustre.org> Date: Tuesday, August 16, 2022 at 8:26 AM To: Peter Jones <pjo...@whamcloud.com>, lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org> Subject: Re: [lustre-discuss] network error on bulk WRITE/bad log Hello Peter, Thank you for the reply. I have upgraded lustre to 2.15.1 . The errors persist, however - now I am also seeing a new error on io02: [ 1749.396942] LustreError: 9216:0:(mdt_handler.c:7499:mdt_iocontrol()) storage-MDT0001: Not supported cmd = 1074292357, rc = -95 I'm not entirely sure how to look up the cmd code and rc -95 seems to just be EOPNOTSUPP, so no additional information here. Is there a way to look up what the cmd value means? On 15.08.22 14:50, Peter Jones wrote: Stepan 2.14.56 is not a version of Lustre – it is an interim dev build. Even if it does not resolve this specific issue, I would strongly recommend switching to the recently released Lustre 2.15.1 release Peter From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org><mailto:lustre-discuss-boun...@lists.lustre.org> on behalf of Stepan Nassyr via lustre-discuss <lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org> Reply-To: Stepan Nassyr <s.nas...@fz-juelich.de><mailto:s.nas...@fz-juelich.de> Date: Monday, August 15, 2022 at 1:35 AM To: "lustre-discuss@lists.lustre.org"<mailto:lustre-discuss@lists.lustre.org> <lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org> Subject: [lustre-discuss] network error on bulk WRITE/bad log Hi all, In May I had a failure on a small cluster and asked here (http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-May/018073.html<http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-May/018073.html>). Due to time constraints I just recreated the filesystem back then. Now the failure happened again, this time I have more time and can investigate and haven't done anything destructive yet. I use the following versions: 1. lustre 2.14.56 2. zfs 2.0.7 (previously used 2.1.2, but got told that 2.1.x is not tested well with lustre) 3. Nodes are running Rocky Linux 8.6 4. uname -r: 4.18.0-372.19.1.el8_6.aarch64 There are 2 IO nodes (io01 and io02), both of them are MDS and OSS and one of them is MGS. Here are the devices: [snassyr@io02 ~]$ sudo lctl dl 0 UP osd-zfs storage-MDT0001-osd storage-MDT0001-osd_UUID 8 1 UP mgc MGC10.31.7.61@o2ib a087e05e-d57c-4561-ad75-6827d4428f54 4 2 UP mds MDS MDS_uuid 2 3 UP lod storage-MDT0001-mdtlov storage-MDT0001-mdtlov_UUID 3 4 UP mdt storage-MDT0001 storage-MDT0001_UUID 8 5 UP mdd storage-MDD0001 storage-MDD0001_UUID 3 6 UP osp storage-MDT0000-osp-MDT0001 storage-MDT0001-mdtlov_UUID 4 7 UP osp storage-OST0000-osc-MDT0001 storage-MDT0001-mdtlov_UUID 4 8 UP osp storage-OST0001-osc-MDT0001 storage-MDT0001-mdtlov_UUID 4 9 UP lwp storage-MDT0000-lwp-MDT0001 storage-MDT0000-lwp-MDT0001_UUID 4 10 UP osd-zfs storage-OST0001-osd storage-OST0001-osd_UUID 4 11 UP ost OSS OSS_uuid 2 12 UP obdfilter storage-OST0001 storage-OST0001_UUID 6 13 UP lwp storage-MDT0000-lwp-OST0001 storage-MDT0000-lwp-OST0001_UUID 4 14 UP lwp storage-MDT0001-lwp-OST0001 storage-MDT0001-lwp-OST0001_UUID 4 [snassyr@io01 ~]$ sudo lctl dl 0 UP osd-zfs MGS-osd MGS-osd_UUID 4 1 UP mgs MGS MGS 6 2 UP mgc MGC10.31.7.61@o2ib 9f351a51-0232-4306-a66d-cecee8629329 4 3 UP osd-zfs storage-MDT0000-osd storage-MDT0000-osd_UUID 9 4 UP mds MDS MDS_uuid 2 5 UP lod storage-MDT0000-mdtlov storage-MDT0000-mdtlov_UUID 3 6 UP mdt storage-MDT0000 storage-MDT0000_UUID 12 7 UP mdd storage-MDD0000 storage-MDD0000_UUID 3 8 UP qmt storage-QMT0000 storage-QMT0000_UUID 3 9 UP osp storage-MDT0001-osp-MDT0000 storage-MDT0000-mdtlov_UUID 4 10 UP osp storage-OST0000-osc-MDT0000 storage-MDT0000-mdtlov_UUID 4 11 UP osp storage-OST0001-osc-MDT0000 storage-MDT0000-mdtlov_UUID 4 12 UP lwp storage-MDT0000-lwp-MDT0000 storage-MDT0000-lwp-MDT0000_UUID 4 13 UP osd-zfs storage-OST0000-osd storage-OST0000-osd_UUID 4 14 UP ost OSS OSS_uuid 2 15 UP obdfilter storage-OST0000 storage-OST0000_UUID 6 16 UP lwp storage-MDT0000-lwp-OST0000 storage-MDT0000-lwp-OST0000_UUID 4 17 UP lwp storage-MDT0001-lwp-OST0000 storage-MDT0001-lwp-OST0000_UUID 4 On io01 I see repeating errors mentioning a network error: [65922.582578] LustreError: 20017:0:(ldlm_lib.c:3540:target_bulk_io()) Skipped 11 previous similar messages [66494.575431] LNetError: 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) Failed to map mr 1/8 elements [66494.575442] LNetError: 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) Skipped 11 previous similar messages [66494.575446] LNetError: 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Can't map 32768 bytes (8/8)s: -22 [66494.575448] LNetError: 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Skipped 11 previous similar messages [66494.575452] LNetError: 20017:0:(o2iblnd_cb.c:1725:kiblnd_send()) Can't setup PUT src for 10.31.7.62@o2ib: -22 [66494.575454] LNetError: 20017:0:(o2iblnd_cb.c:1725:kiblnd_send()) Skipped 11 previous similar messages [66494.575458] LustreError: 20017:0:(events.c:477:server_bulk_callback()) event type 5, status -5, desc 00000000cdd4e797 [66494.575460] LustreError: 20017:0:(events.c:477:server_bulk_callback()) Skipped 11 previous similar messages [66546.574314] LustreError: 20017:0:(ldlm_lib.c:3540:target_bulk_io()) @@@ network error on bulk WRITE req@0000000070b8f1ab x1740960836990720/t0(0) o1000->storage-MDT0001-mdtlov_UUID@10.31.7.62@o2ib:522/0<mailto:storage-MDT0001-mdtlov_UUID@10.31.7.62@o2ib:522/0> lens 336/33016 e 0 to 0 dl 1660376137 ref 1 fl Interpret:/0/0 rc 0/0 job:'' On io02 I see repeating errors mentioning a bad log: [66582.856444] LustreError: 14905:0:(llog_osd.c:264:llog_osd_read_header()) storage-MDT0000-osp-MDT0001: bad log [0x200000401:0x1:0x0] header magic: 0x0 (expected 0x10645539) [66582.856450] LustreError: 14905:0:(llog_osd.c:264:llog_osd_read_header()) Skipped 11 previous similar messages I can't make sense of these error messages. How can I recover? (I have the full dmesg/lctl dk log, but they are too big to attach, is it ok to upload them somewhere and put a link in a reply?) Thank you and best regards, Stepan ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Volker Rieke Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Neugierige sind herzlich willkommen am Sonntag, den 21. August 2022, von 10:00 bis 17:00 Uhr. Mehr unter: https://www.tagderneugier.de<https://www.tagderneugier.de>
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org