The first crash was dramatic, because the machine was unable to boot. Checking the logs in single user mode, I saw this:
""" + dumpadm -y -d /dev/zvol/dsk/zones/dumpdumpadm: dump device /dev/zvol/dsk/zones/dump is too small to hold a system dump
dump size 2121340928 bytes, device size 1220542464 bytes + fatal 'failed to configure dump device' + echo 'Error: failed to configure dump device' Error: failed to configure dump device + exit 95 """After googling around, I increased "/dev/zvol/dsk/zones/dump" size to 30GB (overkill, but I don't want this to happen ever again).
Not being able to boot the machine in this situation should be considered a bug. Please, fix it.
After that, I hoped to get crash dumps somewhere. My configuration is:
[root@srvzfs3 /var/crash/volatile]# dumpadm
Dump content: kernel pages
Dump device: /dev/zvol/dsk/zones/dump (dedicated)
Savecore directory: /var/crash/volatile
Savecore enabled: yes
Save compressed: on
Dump encrypted: no
[root@srvzfs3 /var/crash/volatile]# zfs get volsize zones/dump
NAME PROPERTY VALUE SOURCE
zones/dump volsize 30G local
But there is nothing in "/var/crash/volatile", it is empty (there was a
dump there from a 2017 crash that I deleted). Nevertheless, the boot
takes forever. Doing a "savecore" manually I got this:
""" [root@srvzfs3 /var/crash/volatile]# savecore -v savecore: bad magic number e16aa54a savecore: bad summary magic bdec9c78 """During the first three crashes the machine was doing a resilvering after a harddisk replacement (the hardware replacement window was used to upgrade the platform to joyent_20201217T173522Z) but this morning the machine crashed again and the resilvering was already done.
The third crash showed some in the screen. I transcribe the (bad quality) photo that the operator sent me, I hope no typos:
"""srvfs3 wcons login: 2020-12-25T18:53:24.439856+00:00 srvzfs3 savecore: [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0 2020-12-25T18:53:35.688665+00:00 srvzf3 savecore: [ID 570001 auth.error] reboot after panic: mutex_enter: bad mutex, lp=ffffff07174ffc08 owner=ffffff0703fbbbc0 thread=ffffff0703fbbbc0
""" After that, the machine hangs. No automatic reboot, it need a hard reset.(talking with the operator, this picture was send yesterday after the server crash, but it showing errors from the 25th, maybe it is referring to a PREVIOUS crash).
I am quite surprised about the "auth.error" messages. This machine is a NFS server not connected to internet. I don't now if it is relevant.
Checking the "zool history", the replacement was done the right way:
"""
[root@srvzfs3 /var/crash/volatile]# zpool history
[...]
2020-12-19.23:46:48 zpool replace zones 8990018183995816436 c1t0d0
[...]
[root@srvzfs3 /var/crash/volatile]# zpool status
pool: zones
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not
support
the features. See zpool-features(5) for details.
scan: resilvered 2.14T in 5 days 09:59:58 with 0 errors on Sun Dec 27
10:12:52 2020
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
c1t4d0s0 ONLINE 0 0 0
c1t5d0s0 ONLINE 0 0 0
cache
c1t4d0s1 ONLINE 0 0 0
c1t5d0s1 ONLINE 0 0 0
errors: No known data errors
"""
"zdb" shows this (the "ashift" is "9" because this is a quite old ZPOOL):
"""
zones:
version: 5000
name: 'zones'
state: 0
txg: 37335958
pool_guid: 2807429990997653683
errata: 0
hostid: 542799372
hostname: ''
com.delphix:has_per_vdev_zaps
vdev_children: 3
vdev_tree:
type: 'root'
id: 0
guid: 2807429990997653683
children[0]:
type: 'mirror'
id: 0
guid: 8841657624222278566
metaslab_array: 39
metaslab_shift: 34
ashift: 9
asize: 2999985635328
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 55
children[0]:
type: 'disk'
id: 0
guid: 8956384447561865843
path: '/dev/dsk/c1t0d0s0'
devid: 'id1,sd@n60030480008a7d2027714acd11cdc60e/a'
phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@0,0:a'
whole_disk: 1
DTL: 1235
create_txg: 4
com.delphix:vdev_zap_leaf: 773
children[1]:
type: 'disk'
id: 1
guid: 301314384901939396
path: '/dev/dsk/c1t1d0s0'
devid: 'id1,sd@n60030480008a7d201ea8de8a42098cb9/a'
phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@1,0:a'
whole_disk: 1
DTL: 8621
create_txg: 4
com.delphix:vdev_zap_leaf: 107
children[1]:
type: 'mirror'
id: 1
guid: 4227076483237831215
metaslab_array: 36
metaslab_shift: 34
ashift: 9
asize: 2999985635328
is_log: 0
create_txg: 4
com.delphix:vdev_zap_top: 108
children[0]:
type: 'disk'
id: 0
guid: 6021145974097762388
path: '/dev/dsk/c1t2d0s0'
devid: 'id1,sd@n60030480008a7d201ea8de8c42214e32/a'
phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@2,0:a'
whole_disk: 1
DTL: 8620
create_txg: 4
com.delphix:vdev_zap_leaf: 109
children[1]:
type: 'disk'
id: 1
guid: 9695570681430649539
path: '/dev/dsk/c1t3d0s0'
devid: 'id1,sd@n60030480008a7d201ea8de8d4239f176/a'
phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@3,0:a'
whole_disk: 1
DTL: 8619
create_txg: 4
com.delphix:vdev_zap_leaf: 110
children[2]:
type: 'mirror'
id: 2
guid: 1877341096729848291
metaslab_array: 120
metaslab_shift: 24
ashift: 9
asize: 2150105088
is_log: 1
create_txg: 12527708
com.delphix:vdev_zap_top: 117
children[0]:
type: 'disk'
id: 0
guid: 6693173462782706499
path: '/dev/dsk/c1t4d0s0'
devid: 'id1,sd@n60030480008a7d2022260a60148b2236/a'
phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@4,0:a'
whole_disk: 0
DTL: 1238
create_txg: 12527708
com.delphix:vdev_zap_leaf: 118
children[1]:
type: 'disk'
id: 1
guid: 87265357747160889
path: '/dev/dsk/c1t5d0s0'
devid: 'id1,sd@n60030480008a7d2022260a60148b6e95/a'
phys_path: '/pci@0,0/pci8086,340c@5/pci15d9,700@0/sd@5,0:a'
whole_disk: 0
DTL: 1237
create_txg: 12527708
com.delphix:vdev_zap_leaf: 119
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
"""
I hope this is somewhat useful to anybody. Please, let me know how to go
deeper debugging this.
Thanks!. -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ [email protected] - https://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:[email protected] _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
OpenPGP_signature
Description: OpenPGP digital signature
This is a multi-part message in MIME format... ------------=_1609088076-358971-1--
