Hi All - I need to use the MSOSUG hive mind!

I've got a newly built system (OpenSolaris 2009.06 / nv111b) which is
unpredictably panicking, and I want to narrow down why this is
happening.
I've seen it mainly happening when the system is under higher IO load
(eg when doing a "zfs scrub rpool").
I have got quite a few (15!) crashdumps, and have looked at the
function stacks, and there doesn't seem to be any consistent pattern.
Sun (God bless em!) have said that they've seen a single-bit flip in
one of the crashdumps, and are wondering if it's hardware related
issue.
(The memory is non-ECC).

I've already run memtest86 (which completed 13 iterations without
finding fault),  and am wondering on the next steps.

I was wondering how to subdivide the problem - my initial thoughts are to:

1) Remove the harddisks, and boot from the LiveCD - and then run some
memory and CPU stress tests - Can anyone suggest a suitable stress
test that could be run from the LiveCD (ideally in text / singleuser
mode)?
2) Exchange the harddisks with some spares, and reinstall OS2009.06
(or another OS) from scratch.


Cheers-- Chris

PS -For those which might be interested kmdb msgbuf gives this output
on the latest crash:

panic[cpu2]/thread=ffffff02e172b020:
BAD TRAP: type=e (#pf Page fault) rp=ffffff000f89d870 addr=0 occurred in module
"zfs" due to a NULL pointer dereference


zfs:
#pf Page fault
Bad kernel fault at addr=0x0
pid=6260, pc=0xfffffffff78a2fdb, sp=0xffffff000f89d960, eflags=0x10286
cr0: 80050033<pg,wp,ne,et,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
cr2: 0
cr3: 1e8640000
cr8: c

        rdi:                0 rsi: ffffff8000000000 rdx: ffffff02e172b020
        rcx:                1  r8: fffffffffbd09010  r9: ffffff02d78853d8
        rax:              200 rbx: ffffff02da10eec8 rbp: ffffff000f89d990
        r10:                0 r11:                0 r12: 48cc25d36ba3d0f4
        r13: ffffff02da10f480 r14:                0 r15:                0
        fsb:                0 gsb: ffffff02d8981a80  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                0 rip: fffffffff78a2fdb
         cs:               30 rfl:            10286 rsp: ffffff000f89d960
         ss:               38

ffffff000f89d750 unix:die+dd ()
ffffff000f89d860 unix:trap+1752 ()
ffffff000f89d870 unix:cmntrap+e9 ()
ffffff000f89d990 zfs:arc_buf_clone+1b ()
ffffff000f89da30 zfs:arc_read_nolock+264 ()
ffffff000f89daf0 zfs:dmu_objset_open_impl+e2 ()
ffffff000f89db50 zfs:dmu_objset_open_ds_os+69 ()
ffffff000f89dbc0 zfs:dmu_objset_open+af ()
ffffff000f89dc00 zfs:zfs_ioc_objset_stats+33 ()
ffffff000f89dc40 zfs:zfs_ioc_snapshot_list_next+d6 ()
ffffff000f89dcc0 zfs:zfsdev_ioctl+10b ()
ffffff000f89dd00 genunix:cdev_ioctl+45 ()
ffffff000f89dd40 specfs:spec_ioctl+83 ()
ffffff000f89ddc0 genunix:fop_ioctl+7b ()
ffffff000f89dec0 genunix:ioctl+18e ()
ffffff000f89df10 unix:brand_sys_syscall32+197 ()

syncing file systems...
 done


-- 

Regards,

Chris

Reply via email to