Re: [lustre-discuss] 2.12.6 freeze

2021-12-01 Thread Alastair Basden

Hi,

Turns out there is a problem with the zpool, which we think got corrupted 
by a stonith event when a disk on another pool started to do a predicted 
failure.


A zpool scrub has been done, and there are 5 files with permanent errors 
(zpool status -v):

errors: Permanent errors have been detected in the following files:

cos8-ost6/ost6:<0xe>
cos8-ost6/ost6:<0x1a>
cos8-ost6/ost6:<0x1c>
cos8-ost6/ost6:/
cos8-ost6/ost6:<0x193>

The fact that / is corrupted seems to worry me!
If we set the canmount=on property and mount the zpool, then an ls of the 
mount point gives an Input/output error.


Does anyone have experience with how to repair this?

There is no hardware problem, all 12 disks within this z2 pool are fine - 
we think the stonith must have caused it - though I thought zfs was 
supposed to be immune to that!


Thanks...


On Tue, 30 Nov 2021, Tommi Tervo wrote:


[EXTERNAL EMAIL]


Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini())
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1

Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG


Hi,

Looks like LU-12675, time to upgrade 2.12.7?

https://jira.whamcloud.com/browse/LU-12675

HTH,
-Tommi


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Tommi Tervo
> Upon attempting to mount a zfs OST, we are getting:
> Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
>  kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini())
> ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1
> 
> Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
>  kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG

Hi,

Looks like LU-12675, time to upgrade 2.12.7?

https://jira.whamcloud.com/browse/LU-12675

HTH, 
-Tommi
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden
Additional info - exporting the pool, importing on another (HA) server and 
attempting to mount there also has the same problem, i.e. a kernel panic, 
and the trace shown below.


A writeconf does not help.

On Mon, 29 Nov 2021, Alastair Basden wrote:


[EXTERNAL EMAIL]

Some more information.  This is repeatable... (previously the file system
has been fine - it's an established file system).

To get this, we boot the node, and then do:
zpool import -o cachefile=none  pool1
zpool status shows all is well.

mount -t lustre pool1/pool1 /mnt/lustre/pool1

And the kernel panic.


Some additional logs in /var/log/messages:
Nov 29 18:37:54 c8oss01 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 128, 
npartitions: 2

Nov 29 18:37:54 c8oss01 kernel: alg: No test for adler32 (adler32-zlib)
Nov 29 18:37:55 c8oss01 kernel: Lustre: Lustre: Build Version: 2.12.6
Nov 29 18:37:55 c8oss01 kernel: LNet: 
40260:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface em2: 
it's down

Nov 29 18:37:55 c8oss01 kernel: LNet: Using FastReg for registration
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.18.185.5@o2ib 
[32/512/0/100]
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.17.185.5@tcp 
[8/256/0/180]

Nov 29 18:37:55 c8oss01 kernel: LNet: Accept secure, port 988
Nov 29 18:37:55 c8oss01 zed: eid=85 class=data pool_guid=0x07C7BF473C816BCB
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( 
atomic_read(>ld_ref) == 0 ) failed: Refcount is 1
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) LBUG
Nov 29 18:37:55 c8oss01 kernel: Pid: 40228, comm: mount.lustre 
3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
Nov 29 18:37:55 c8oss01 zed: eid=86 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk

Nov 29 18:37:55 c8oss01 kernel: Call Trace:
Nov 29 18:37:56 c8oss01 kernel: [] 
libcfs_call_trace+0x8c/0xc0 [libcfs]
Nov 29 18:37:56 c8oss01 zed: eid=87 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] lbug_with_loc+0x4c/0xa0 
[libcfs]
Nov 29 18:37:56 c8oss01 kernel: [] lu_device_fini+0xbb/0xc0 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=88 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] dt_device_fini+0xe/0x10 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
osd_device_alloc+0x278/0x3b0 [osd_zfs]
Nov 29 18:37:56 c8oss01 zed: eid=89 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] obd_setup+0x119/0x280 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=90 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf
Nov 29 18:37:56 c8oss01 kernel: [] class_setup+0x2a8/0x840 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
class_process_config+0x1726/0x2830 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=91 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep
Nov 29 18:37:56 c8oss01 kernel: [] do_lcfg+0x258/0x500 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_start_simple+0x88/0x210 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=92 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk
Nov 29 18:37:56 c8oss01 kernel: [] 
server_fill_super+0xf55/0x1890 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=93 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_fill_super+0x468/0x960 [obdclass]

Nov 29 18:37:56 c8oss01 kernel: [] mount_nodev+0x4f/0xb0
Nov 29 18:37:56 c8oss01 zed: eid=94 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] lustre_mount+0x38/0x60 
[obdclass]

Nov 29 18:37:56 c8oss01 kernel: [] mount_fs+0x3e/0x1b0
Nov 29 18:37:56 c8oss01 zed: eid=95 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] 
vfs_kern_mount+0x67/0x110

Nov 29 18:37:56 c8oss01 kernel: [] do_mount+0x1ef/0xd00
Nov 29 18:37:56 c8oss01 zed: eid=96 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf

Nov 29 18:37:56 c8oss01 kernel: [] SyS_mount+0x83/0xd0
Nov 29 18:37:56 c8oss01 kernel: [] 
system_call_fastpath+0x25/0x2a
Nov 29 18:37:56 c8oss01 zed: eid=97 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep

Nov 29 18:37:56 c8oss01 kernel: [] 0x

We suspect corruption on the OST caused by a stonith event, but could be
wrong.  Any tips in how to manually solve would be great...

Thanks,
Alastair.

On Mon, 29 Nov 2021, Alastair Basden wrote:


[EXTERNAL EMAIL]

Hi all,

Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini())
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1

Message from syslogd@c8oss01 at Nov 29 18:11:47 ...

Re: [lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden
Some more information.  This is repeatable... (previously the file system 
has been fine - it's an established file system).


To get this, we boot the node, and then do:
zpool import -o cachefile=none  pool1
zpool status shows all is well.

mount -t lustre pool1/pool1 /mnt/lustre/pool1

And the kernel panic.


Some additional logs in /var/log/messages:
Nov 29 18:37:54 c8oss01 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 128, 
npartitions: 2
Nov 29 18:37:54 c8oss01 kernel: alg: No test for adler32 (adler32-zlib)
Nov 29 18:37:55 c8oss01 kernel: Lustre: Lustre: Build Version: 2.12.6
Nov 29 18:37:55 c8oss01 kernel: LNet: 
40260:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface em2: 
it's down
Nov 29 18:37:55 c8oss01 kernel: LNet: Using FastReg for registration
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.18.185.5@o2ib [32/512/0/100]
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.17.185.5@tcp [8/256/0/180]
Nov 29 18:37:55 c8oss01 kernel: LNet: Accept secure, port 988
Nov 29 18:37:55 c8oss01 zed: eid=85 class=data pool_guid=0x07C7BF473C816BCB
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) 
== 0 ) failed: Refcount is 1
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) LBUG
Nov 29 18:37:55 c8oss01 kernel: Pid: 40228, comm: mount.lustre 
3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
Nov 29 18:37:55 c8oss01 zed: eid=86 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathk
Nov 29 18:37:55 c8oss01 kernel: Call Trace:
Nov 29 18:37:56 c8oss01 kernel: [] 
libcfs_call_trace+0x8c/0xc0 [libcfs]
Nov 29 18:37:56 c8oss01 zed: eid=87 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] lbug_with_loc+0x4c/0xa0 
[libcfs]
Nov 29 18:37:56 c8oss01 kernel: [] lu_device_fini+0xbb/0xc0 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=88 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] dt_device_fini+0xe/0x10 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
osd_device_alloc+0x278/0x3b0 [osd_zfs]
Nov 29 18:37:56 c8oss01 zed: eid=89 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] obd_setup+0x119/0x280 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=90 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathaf
Nov 29 18:37:56 c8oss01 kernel: [] class_setup+0x2a8/0x840 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
class_process_config+0x1726/0x2830 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=91 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathep
Nov 29 18:37:56 c8oss01 kernel: [] do_lcfg+0x258/0x500 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_start_simple+0x88/0x210 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=92 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathk
Nov 29 18:37:56 c8oss01 kernel: [] 
server_fill_super+0xf55/0x1890 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=93 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_fill_super+0x468/0x960 [obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] mount_nodev+0x4f/0xb0
Nov 29 18:37:56 c8oss01 zed: eid=94 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] lustre_mount+0x38/0x60 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] mount_fs+0x3e/0x1b0
Nov 29 18:37:56 c8oss01 zed: eid=95 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] vfs_kern_mount+0x67/0x110
Nov 29 18:37:56 c8oss01 kernel: [] do_mount+0x1ef/0xd00
Nov 29 18:37:56 c8oss01 zed: eid=96 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathaf
Nov 29 18:37:56 c8oss01 kernel: [] SyS_mount+0x83/0xd0
Nov 29 18:37:56 c8oss01 kernel: [] 
system_call_fastpath+0x25/0x2a
Nov 29 18:37:56 c8oss01 zed: eid=97 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathep
Nov 29 18:37:56 c8oss01 kernel: [] 0x

We suspect corruption on the OST caused by a stonith event, but could be 
wrong.  Any tips in how to manually solve would be great...


Thanks,
Alastair.

On Mon, 29 Nov 2021, Alastair Basden wrote:


[EXTERNAL EMAIL]

Hi all,

Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini())
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1

Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG


Followed by a system freeze.

Has anyone else seen this?  Any ideas?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org

[lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden

Hi all,

Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) 
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1


Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG


Followed by a system freeze.

Has anyone else seen this?  Any ideas?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org