Hi, I'd suggest identifying if it's a software/compatibility issue or hardware issue, though you mention the single bitflip and my intuition you've done no real tinkering I'd guess more indicates it's hardware, have you attempted to replicate the issue by testing the LiveCD on a different but similar motherboard? (with the same chipset)..
Already mentioned but just to be sure.. a barebone setup (board, CPU, 1 stick of RAM, graphics) is best - I don't know if your board has onboard video but you might like to try and revert to that for testing, OTOH, disable all south bridge functions (nic, sound, etc) - There's often a 'load failsafe defaults' option in the BIOS setup. As mentioned, check the cable. Actually don't bother checking it, just replace it. What are your temps like (CPU/case)? Is cooling adequate? Does the ambient environment seem to effect the frequency of crashes? Adequate power supply to the components is also important - are you using a nice weighty high wattage PSU? You mention high IO load raises the issue - maybe there isn't enough juice? Might like to check a non Solaris based test suite as well as the Seagate Seatools suite (will pass though I'm sure), ultimate boot cd (UBCD) has a few other disk testing utilities included. Good luck, Peter Davidoff. Jeremy Stagg wrote: > Chris > > Have you manipulated the ZFS ARC setting re memory consumption ? > > Regards > > Jeremy. > > >>> On 3/06/2009 at 9:37 am, in message > <206bef920906021637u65d451a0y7c1ad2c46bfd57e7 at mail.gmail.com>, Chris > Wells <chris.unix.dude at gmail.com> wrote: > Hi All - I need to use the MSOSUG hive mind! > > I've got a newly built system (OpenSolaris 2009.06 / nv111b) which is > unpredictably panicking, and I want to narrow down why this is > happening. > I've seen it mainly happening when the system is under higher IO load > (eg when doing a "zfs scrub rpool"). > I have got quite a few (15!) crashdumps, and have looked at the > function stacks, and there doesn't seem to be any consistent pattern. > Sun (God bless em!) have said that they've seen a single-bit flip in > one of the crashdumps, and are wondering if it's hardware related > issue. > (The memory is non-ECC). > > I've already run memtest86 (which completed 13 iterations without > finding fault), and am wondering on the next steps. > > I was wondering how to subdivide the problem - my initial thoughts are to: > > 1) Remove the harddisks, and boot from the LiveCD - and then run some > memory and CPU stress tests - Can anyone suggest a suitable stress > test that could be run from the LiveCD (ideally in text / singleuser > mode)? > 2) Exchange the harddisks with some spares, and reinstall OS2009.06 > (or another OS) from scratch. > > > Cheers-- Chris > > PS -For those which might be interested kmdb msgbuf gives this output > on the latest crash: > > panic[cpu2]/thread=ffffff02e172b020: > BAD TRAP: type=e (#pf Page fault) rp=ffffff000f89d870 addr=0 occurred in > module > "zfs" due to a NULL pointer dereference > > > zfs: > #pf Page fault > Bad kernel fault at addr=0x0 > pid=6260, pc=0xfffffffff78a2fdb, sp=0xffffff000f89d960, eflags=0x10286 > cr0: 80050033<pg,wp,ne,et,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de> > cr2: 0 > cr3: 1e8640000 > cr8: c > > rdi: 0 rsi: ffffff8000000000 rdx: ffffff02e172b020 > rcx: 1 r8: fffffffffbd09010 r9: ffffff02d78853d8 > rax: 200 rbx: ffffff02da10eec8 rbp: ffffff000f89d990 > r10: 0 r11: 0 r12: 48cc25d36ba3d0f4 > r13: ffffff02da10f480 r14: 0 r15: 0 > fsb: 0 gsb: ffffff02d8981a80 ds: 4b > es: 4b fs: 0 gs: 1c3 > trp: e err: 0 rip: fffffffff78a2fdb > cs: 30 rfl: 10286 rsp: ffffff000f89d960 > ss: 38 > > ffffff000f89d750 unix:die+dd () > ffffff000f89d860 unix:trap+1752 () > ffffff000f89d870 unix:cmntrap+e9 () > ffffff000f89d990 zfs:arc_buf_clone+1b () > ffffff000f89da30 zfs:arc_read_nolock+264 () > ffffff000f89daf0 zfs:dmu_objset_open_impl+e2 () > ffffff000f89db50 zfs:dmu_objset_open_ds_os+69 () > ffffff000f89dbc0 zfs:dmu_objset_open+af () > ffffff000f89dc00 zfs:zfs_ioc_objset_stats+33 () > ffffff000f89dc40 zfs:zfs_ioc_snapshot_list_next+d6 () > ffffff000f89dcc0 zfs:zfsdev_ioctl+10b () > ffffff000f89dd00 genunix:cdev_ioctl+45 () > ffffff000f89dd40 specfs:spec_ioctl+83 () > ffffff000f89ddc0 genunix:fop_ioctl+7b () > ffffff000f89dec0 genunix:ioctl+18e () > ffffff000f89df10 unix:brand_sys_syscall32+197 () > > syncing file systems... > done > > > -- > > Regards, > > Chris > _______________________________________________ > ug-msosug mailing list > ug-msosug at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/ug-msosug > > *Network Help Pty Ltd* > *Phone:* +61-3-9459-2122 > *Facsimile: *+61-3-9459-5322 > * Website:* http://www.networkhelp.com.au > > *_/Disclaimer/_/ :/*/ This message contains confidential information and > is intended only for the individual named. If you are not the named > addressee you should not disseminate, distribute or copy this e-mail. > Please notify the sender immediately by e-mail if you have received this > e-mail by mistake and delete this e-mail from your system. E-mail > transmission cannot be guaranteed to be secure or error-free as > information could be intercepted, corrupted, lost, destroyed, arrive > late or incomplete, or contain viruses. The sender therefore does not > accept liability for any errors or omissions in the contents of this > message, which arise as a result of e-mail transmission. If verification > is required please request a hard-copy version. Contact Network Help on > +61-3-9459-2122 for further details./ > > > ------------------------------------------------------------------------ > > _______________________________________________ > ug-msosug mailing list > ug-msosug at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/ug-msosug
