Re: [OmniOS-discuss] disk failure causing reboot?

Richard Elling Thu, 21 May 2015 17:04:20 -0700

> On May 18, 2015, at 11:25 AM, Jeff Stockett <jstock...@molalla.com> wrote:
> 
> A drive failed in one of our supermicro 5048R-E1CR36L servers running omnios 
> r151012 last night, and somewhat unexpectedly, the whole system seems to have 
> panicked.
>  
> May 18 04:43:08 zfs01 scsi: [ID 365881 kern.info] 
> /pci@0,0/pci8086,2f02@1/pci15d9,808@0 (mpt_sas0):
> May 18 04:43:08 zfs01         Log info 0x31140000 received for target 29 
> w50000c0f01f1bf06.
> May 18 04:43:08 zfs01         scsi_status=0x0, ioc_status=0x8048, 
> scsi_state=0xc


[forward reference]

> May 18 04:44:36 zfs01 genunix: [ID 843051 kern.info] NOTICE: SUNW-MSG-ID: 
> SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
> May 18 04:44:36 zfs01 unix: [ID 836849 kern.notice]
> May 18 04:44:36 zfs01 ^Mpanic[cpu0]/thread=ffffff00f3ecbc40:
> May 18 04:44:36 zfs01 genunix: [ID 918906 kern.notice] I/O to pool 'dpool' 
> appears to be hung.
> May 18 04:44:36 zfs01 unix: [ID 100000 kern.notice]
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecba20 
> zfs:vdev_deadman+10b ()

Bugs notwithstanding, the ZFS deadman timer occurs when a ZFS I/O does not
complete in 10,000 seconds (by default). The problem likely lies below ZFS. For 
this
reason, the deadman timer was invented -- don't blame ZFS for a problem below 
ZFS.

> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecba70 
> zfs:vdev_deadman+4a ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecbac0 
> zfs:vdev_deadman+4a ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecbaf0 
> zfs:spa_deadman+ad ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecbb90 
> genunix:cyclic_softint+fd ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecbba0 
> unix:cbe_low_level+14 ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecbbf0 
> unix:av_dispatch_softvect+78 ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3ecbc20 
> apix:apix_dispatch_softint+35 ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05990 
> unix:switch_sp_and_call+13 ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e059e0 
> apix:apix_do_softint+6c ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05a40 
> apix:apix_do_interrupt+34a ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05a50 
> unix:cmnint+ba ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05bc0 
> unix:acpi_cpu_cstate+11b ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05bf0 
> unix:cpu_acpi_idle+8d ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05c00 
> unix:cpu_idle_adaptive+13 ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05c20 
> unix:idle+a7 ()
> May 18 04:44:36 zfs01 genunix: [ID 655072 kern.notice] ffffff00f3e05c30 
> unix:thread_start+8 ()
> May 18 04:44:36 zfs01 unix: [ID 100000 kern.notice]
> May 18 04:44:36 zfs01 genunix: [ID 672855 kern.notice] syncing file systems...
> May 18 04:44:38 zfs01 genunix: [ID 904073 kern.notice]  done
> May 18 04:44:39 zfs01 genunix: [ID 111219 kern.notice] dumping to 
> /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
> May 18 04:44:39 zfs01 ahci: [ID 405573 kern.info] NOTICE: ahci0: 
> ahci_tran_reset_dport port 1 reset port
> May 18 05:17:56 zfs01 genunix: [ID 100000 kern.notice]
> May 18 05:17:56 zfs01 genunix: [ID 665016 kern.notice] ^M100% done: 8607621 
> pages dumped,
> May 18 05:17:56 zfs01 genunix: [ID 851671 kern.notice] dump succeeded
>  
> The disks are all 4TB WD40001FYYG enterprise SAS drives.
> 

I've had such bad luck with that model, IMNSHO I recommend replacing with 
anything else :-(

That said, I don't think it is a root cause for this panic. To get the trail of 
tears, you'll need to
look at the FMA ereports for the 10,000 seconds prior to the panic. fmdump has 
a -t option you'll
find useful. The [foreward reference] is the result of a SCSI reset of the 
target, LUN, or HBA.
These occur when the sd driver has not had a reply and issues one of those 
types of resets *or*
the device or something in the data path resets.

HTH,
 -- richard

>   Googling seems to indicate it is a known problem with the way the various 
> subsystems sometimes interact. Is there any way to fix/workaround this issue?
> _______________________________________________
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss

_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss

Re: [OmniOS-discuss] disk failure causing reboot?

Reply via email to