Re: [lustre-discuss] 2.12.6 freeze
Hi, Turns out there is a problem with the zpool, which we think got corrupted by a stonith event when a disk on another pool started to do a predicted failure. A zpool scrub has been done, and there are 5 files with permanent errors (zpool status -v): errors: Permanent errors have been detected in the following files: cos8-ost6/ost6:<0xe> cos8-ost6/ost6:<0x1a> cos8-ost6/ost6:<0x1c> cos8-ost6/ost6:/ cos8-ost6/ost6:<0x193> The fact that / is corrupted seems to worry me! If we set the canmount=on property and mount the zpool, then an ls of the mount point gives an Input/output error. Does anyone have experience with how to repair this? There is no hardware problem, all 12 disks within this z2 pool are fine - we think the stonith must have caused it - though I thought zfs was supposed to be immune to that! Thanks... On Tue, 30 Nov 2021, Tommi Tervo wrote: [EXTERNAL EMAIL] Upon attempting to mount a zfs OST, we are getting: Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG Hi, Looks like LU-12675, time to upgrade 2.12.7? https://jira.whamcloud.com/browse/LU-12675 HTH, -Tommi ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 2.12.6 freeze
> Upon attempting to mount a zfs OST, we are getting: > Message from syslogd@c8oss01 at Nov 29 18:11:47 ... > kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) > ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 > > Message from syslogd@c8oss01 at Nov 29 18:11:47 ... > kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG Hi, Looks like LU-12675, time to upgrade 2.12.7? https://jira.whamcloud.com/browse/LU-12675 HTH, -Tommi ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 2.12.6 freeze
Additional info - exporting the pool, importing on another (HA) server and attempting to mount there also has the same problem, i.e. a kernel panic, and the trace shown below. A writeconf does not help. On Mon, 29 Nov 2021, Alastair Basden wrote: [EXTERNAL EMAIL] Some more information. This is repeatable... (previously the file system has been fine - it's an established file system). To get this, we boot the node, and then do: zpool import -o cachefile=none pool1 zpool status shows all is well. mount -t lustre pool1/pool1 /mnt/lustre/pool1 And the kernel panic. Some additional logs in /var/log/messages: Nov 29 18:37:54 c8oss01 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 128, npartitions: 2 Nov 29 18:37:54 c8oss01 kernel: alg: No test for adler32 (adler32-zlib) Nov 29 18:37:55 c8oss01 kernel: Lustre: Lustre: Build Version: 2.12.6 Nov 29 18:37:55 c8oss01 kernel: LNet: 40260:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface em2: it's down Nov 29 18:37:55 c8oss01 kernel: LNet: Using FastReg for registration Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.18.185.5@o2ib [32/512/0/100] Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.17.185.5@tcp [8/256/0/180] Nov 29 18:37:55 c8oss01 kernel: LNet: Accept secure, port 988 Nov 29 18:37:55 c8oss01 zed: eid=85 class=data pool_guid=0x07C7BF473C816BCB Nov 29 18:37:55 c8oss01 kernel: LustreError: 40228:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 Nov 29 18:37:55 c8oss01 kernel: LustreError: 40228:0:(lu_object.c:1267:lu_device_fini()) LBUG Nov 29 18:37:55 c8oss01 kernel: Pid: 40228, comm: mount.lustre 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020 Nov 29 18:37:55 c8oss01 zed: eid=86 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk Nov 29 18:37:55 c8oss01 kernel: Call Trace: Nov 29 18:37:56 c8oss01 kernel: [] libcfs_call_trace+0x8c/0xc0 [libcfs] Nov 29 18:37:56 c8oss01 zed: eid=87 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg Nov 29 18:37:56 c8oss01 kernel: [] lbug_with_loc+0x4c/0xa0 [libcfs] Nov 29 18:37:56 c8oss01 kernel: [] lu_device_fini+0xbb/0xc0 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=88 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj Nov 29 18:37:56 c8oss01 kernel: [] dt_device_fini+0xe/0x10 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] osd_device_alloc+0x278/0x3b0 [osd_zfs] Nov 29 18:37:56 c8oss01 zed: eid=89 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag Nov 29 18:37:56 c8oss01 kernel: [] obd_setup+0x119/0x280 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=90 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf Nov 29 18:37:56 c8oss01 kernel: [] class_setup+0x2a8/0x840 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] class_process_config+0x1726/0x2830 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=91 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep Nov 29 18:37:56 c8oss01 kernel: [] do_lcfg+0x258/0x500 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] lustre_start_simple+0x88/0x210 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=92 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk Nov 29 18:37:56 c8oss01 kernel: [] server_fill_super+0xf55/0x1890 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=93 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg Nov 29 18:37:56 c8oss01 kernel: [] lustre_fill_super+0x468/0x960 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] mount_nodev+0x4f/0xb0 Nov 29 18:37:56 c8oss01 zed: eid=94 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj Nov 29 18:37:56 c8oss01 kernel: [] lustre_mount+0x38/0x60 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] mount_fs+0x3e/0x1b0 Nov 29 18:37:56 c8oss01 zed: eid=95 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag Nov 29 18:37:56 c8oss01 kernel: [] vfs_kern_mount+0x67/0x110 Nov 29 18:37:56 c8oss01 kernel: [] do_mount+0x1ef/0xd00 Nov 29 18:37:56 c8oss01 zed: eid=96 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf Nov 29 18:37:56 c8oss01 kernel: [] SyS_mount+0x83/0xd0 Nov 29 18:37:56 c8oss01 kernel: [] system_call_fastpath+0x25/0x2a Nov 29 18:37:56 c8oss01 zed: eid=97 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep Nov 29 18:37:56 c8oss01 kernel: [] 0x We suspect corruption on the OST caused by a stonith event, but could be wrong. Any tips in how to manually solve would be great... Thanks, Alastair. On Mon, 29 Nov 2021, Alastair Basden wrote: [EXTERNAL EMAIL] Hi all, Upon attempting to mount a zfs OST, we are getting: Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
Re: [lustre-discuss] 2.12.6 freeze
Some more information. This is repeatable... (previously the file system has been fine - it's an established file system). To get this, we boot the node, and then do: zpool import -o cachefile=none pool1 zpool status shows all is well. mount -t lustre pool1/pool1 /mnt/lustre/pool1 And the kernel panic. Some additional logs in /var/log/messages: Nov 29 18:37:54 c8oss01 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 128, npartitions: 2 Nov 29 18:37:54 c8oss01 kernel: alg: No test for adler32 (adler32-zlib) Nov 29 18:37:55 c8oss01 kernel: Lustre: Lustre: Build Version: 2.12.6 Nov 29 18:37:55 c8oss01 kernel: LNet: 40260:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface em2: it's down Nov 29 18:37:55 c8oss01 kernel: LNet: Using FastReg for registration Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.18.185.5@o2ib [32/512/0/100] Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.17.185.5@tcp [8/256/0/180] Nov 29 18:37:55 c8oss01 kernel: LNet: Accept secure, port 988 Nov 29 18:37:55 c8oss01 zed: eid=85 class=data pool_guid=0x07C7BF473C816BCB Nov 29 18:37:55 c8oss01 kernel: LustreError: 40228:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 Nov 29 18:37:55 c8oss01 kernel: LustreError: 40228:0:(lu_object.c:1267:lu_device_fini()) LBUG Nov 29 18:37:55 c8oss01 kernel: Pid: 40228, comm: mount.lustre 3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020 Nov 29 18:37:55 c8oss01 zed: eid=86 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk Nov 29 18:37:55 c8oss01 kernel: Call Trace: Nov 29 18:37:56 c8oss01 kernel: [] libcfs_call_trace+0x8c/0xc0 [libcfs] Nov 29 18:37:56 c8oss01 zed: eid=87 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg Nov 29 18:37:56 c8oss01 kernel: [] lbug_with_loc+0x4c/0xa0 [libcfs] Nov 29 18:37:56 c8oss01 kernel: [] lu_device_fini+0xbb/0xc0 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=88 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj Nov 29 18:37:56 c8oss01 kernel: [] dt_device_fini+0xe/0x10 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] osd_device_alloc+0x278/0x3b0 [osd_zfs] Nov 29 18:37:56 c8oss01 zed: eid=89 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag Nov 29 18:37:56 c8oss01 kernel: [] obd_setup+0x119/0x280 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=90 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf Nov 29 18:37:56 c8oss01 kernel: [] class_setup+0x2a8/0x840 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] class_process_config+0x1726/0x2830 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=91 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep Nov 29 18:37:56 c8oss01 kernel: [] do_lcfg+0x258/0x500 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] lustre_start_simple+0x88/0x210 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=92 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk Nov 29 18:37:56 c8oss01 kernel: [] server_fill_super+0xf55/0x1890 [obdclass] Nov 29 18:37:56 c8oss01 zed: eid=93 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg Nov 29 18:37:56 c8oss01 kernel: [] lustre_fill_super+0x468/0x960 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] mount_nodev+0x4f/0xb0 Nov 29 18:37:56 c8oss01 zed: eid=94 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj Nov 29 18:37:56 c8oss01 kernel: [] lustre_mount+0x38/0x60 [obdclass] Nov 29 18:37:56 c8oss01 kernel: [] mount_fs+0x3e/0x1b0 Nov 29 18:37:56 c8oss01 zed: eid=95 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag Nov 29 18:37:56 c8oss01 kernel: [] vfs_kern_mount+0x67/0x110 Nov 29 18:37:56 c8oss01 kernel: [] do_mount+0x1ef/0xd00 Nov 29 18:37:56 c8oss01 zed: eid=96 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf Nov 29 18:37:56 c8oss01 kernel: [] SyS_mount+0x83/0xd0 Nov 29 18:37:56 c8oss01 kernel: [] system_call_fastpath+0x25/0x2a Nov 29 18:37:56 c8oss01 zed: eid=97 class=checksum pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep Nov 29 18:37:56 c8oss01 kernel: [] 0x We suspect corruption on the OST caused by a stonith event, but could be wrong. Any tips in how to manually solve would be great... Thanks, Alastair. On Mon, 29 Nov 2021, Alastair Basden wrote: [EXTERNAL EMAIL] Hi all, Upon attempting to mount a zfs OST, we are getting: Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG Followed by a system freeze. Has anyone else seen this? Any ideas? Thanks, Alastair. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org
[lustre-discuss] 2.12.6 freeze
Hi all, Upon attempting to mount a zfs OST, we are getting: Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1 Message from syslogd@c8oss01 at Nov 29 18:11:47 ... kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG Followed by a system freeze. Has anyone else seen this? Any ideas? Thanks, Alastair. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org