[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

Steve Gonczi Tue, 25 May 2010 12:04:37 -0700

Greetings,

I see repeatable crashes on some systems after upgrading.. the signature is 
always the same:


operating system: 5.11 snv_139 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff00175f88c0 addr=0 
occurred in module "genunix" due to a NULL pointer dereference

list_remove+0x1b(ffffff03e19339f0, ffffff03e0814640)
zfs_acl_release_nodes+0x34(ffffff03e19339c0)
zfs_acl_free+0x16(ffffff03e19339c0)
zfs_znode_free+0x5e(ffffff03e17fa600)
zfs_zinactive+0x9b(ffffff03e17fa600)
zfs_inactive+0x11c(ffffff03e17f8500, ffffff03ee867528, 0)
fop_inactive+0xaf(ffffff03e17f8500, ffffff03ee867528, 0)
vn_rele_dnlc+0x6c(ffffff03e17f8500)
dnlc_purge+0x175()
nfs_idmap_args+0x5e(ffffff00175f8c38)
nfssys+0x1e1(12, 8047dd8)

The stack always looks like the above, the vnode involved is sometimes a file,
sometimes a directory.

e.g.: I have seen the /boot/acpi directory  and the 
/kernel/drv/amd64/acpi_driver
fie in the vnode's path field.
 
looking at the data, I notice that  the z_acl.list_head  indicates a single 
member in the list ( presuming that is the case,
because list_prev and list_next point to the same address):

(ffffff03e19339c0)::print zfs_acl_t
{
    z_acl_count = 0x6
    z_acl_bytes = 0x30
    z_version = 0x1
    z_next_ace = 0xffffff03e171d210
    z_hints = 0
    z_curr_node = 0xffffff03e0814640
    z_acl = {
        list_size = 0x40
        list_offset = 0
        list_head = {
            list_next = 0xffffff03e0814640
            list_prev = 0xffffff03e0814640
        }
    }

This member's next pointer is bad ( sometimes zero, sometimes a low number, eg. 
0x10)
The null pointer  crash happens trying to follow the list_prev pointer:

 0xffffff03e0814640::print zfs_acl_node_t
{
    z_next = {
        list_next = 0
        list_prev = 0
    }
    z_acldata = 0xffffff03e10b6230
    z_allocdata = 0xffffff03e171d200
    z_allocsize = 0x30
    z_size = 0x30
    z_ace_count = 0x6
    z_ace_idx = 0x2
}


This is a repeating pattern,  seems to me always a single zfs_acl_node  in the 
list,
with null / garbaged out  list_next and list_prev pointers.
e.g.: in another instance of this crash, the zfs_acl_node looks like this:

::stack
list_remove+0x1b(ffffff03e10d24f0, ffffff03e0fc9a00)
zfs_acl_release_nodes+0x34(ffffff03e10d24c0)
zfs_acl_free+0x16(ffffff03e10d24c0)
zfs_znode_free+0x5e(ffffff03e10cc200)
zfs_zinactive+0x9b(ffffff03e10cc200)
zfs_inactive+0x11c(ffffff03e1281840, ffffff03ea5c7010, 0)
fop_inactive+0xaf(ffffff03e1281840, ffffff03ea5c7010, 0)
vn_rele_dnlc+0x6c(ffffff03e1281840)
dnlc_purge+0x175()
nfs_idmap_args+0x5e(ffffff001811ac38)
nfssys+0x1e1(12, 8047dd8)
_sys_sysenter_post_swapgs+0x149()
> ::status
...
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff001811a8c0 addr=10 
occurred in module "genunix" due to a NULL pointer dereference

>  ffffff03e0fc9a00::print zfs_acl_node_t
{
    z_next = {
        list_next = 0xffffff03e10e1cd9
        list_prev = 0x10
    }
    z_acldata = 0
    z_allocdata = 0xffffff03e10cb5d0
    z_allocsize = 0x30
    z_size = 0x30
    z_ace_count = 0x6
    z_ace_idx = 0x2
}

Looks to me the crash here is the same, and list_next / list_prev are garbage.

Anybody seen this?
Am I skipping  too many versions when I am image-updating?
I am hoping someone who knows this code would chime in.

Steve
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] multiple crashes upon boot after upgrading build 134 to 138, 139 or 140

Reply via email to