[osol-discuss] Possible ZFS corruption?

Joel Robison Thu, 11 Jun 2009 16:46:13 -0700

Hi everyone,

This is my first post to these forums, but I must first say that there is alot 
of very useful data here and I am very glad to see such a large community of 
contributors. I have been working with Solaris 10 now for about 3 years, coming 
from the linux world, there is no reason to look back.


My problem is this (no, no transition material, sorry):

for a while now I have been using Solaris 10 to store TB's of data on ZFS. This 
file system has been great at handling large datasets (no running crazy tools 
like diskpart to get ext* to recognize the disk > 1.8T and then worrying about 
losing it because the journal get corrupted), however I recently ran into an 
issue with some storage enclosures that I was starting to verify to be put into 
Production.

They are Supermicro storage enclosures that hold 24 1TB disks (Seagate ES2 
SATA), connected via an LSI logic 1068 based card.  I had been copying data 
from other sources (lots of large files) to this new enclosure to test, the 
enclosure was about 80% full and then the machine panic'd.  It then proceded to 
reboot and then got stuck in a reboot loop, each time giving me core dump 
information:

{code}
storage01# mdb unix.0 vmcore.0                                       
Loading modules: [ unix krtld genunix specfs dtrace cpu.generic uppc pcplusmp 
zfs ip hook neti sctp arp usba uhci fcp fctl md lofs mpt fcip random crypto 
logindmux ptm ufs nfs ]

> ::status
debugging crash dump vmcore.0 (64-bit) from storage01
operating system: 5.10 Generic_139556-08 (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe80010fe7a0 addr=0 
occurred in module "unix" due to a NULL pointer dereference
dump content: kernel pages only
> 

> ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R      0      0      0      0      0 0x00000001 fffffffffbc25800 sched
R      3      0      0      0      0 0x00020001 ffffffff82b14a78 fsflush
R      2      0      0      0      0 0x00020001 ffffffff82b156e0 pageout
R      1      0      0      0      0 0x4a004000 ffffffff82b16348 init
R    945      1    945    945      0 0x42000000 ffffffff8b2a1c88 rcm_daemon
R    846      1    846    846     25 0x52010000 ffffffff8b2a7360 sendmail
R    847      1    847    847      0 0x52010000 ffffffff8b2a66f8 sendmail
R    841      1    841    841      0 0x42000000 ffffffff8866ce18 fmd
R    836      1    836    836      0 0x42000000 ffffffff8b2a4e28 syslogd
R    816      1    816    816      0 0x42000000 ffffffff88669010 automountd
R    818    816    816    816      0 0x42000000 ffffffff89e94c80 automountd
R    666      1    666    666      0 0x42000000 ffffffff89e94018 inetd
R    651      1    651    651      0 0x42000000 ffffffff89e96550 utmpd
R    631      1    631    631      1 0x42000000 ffffffff89e97e20 lockd
R    619      1    616    616      1 0x42000000 ffffffff8866e6e8 nfs4cbd
R    617      1    617    617      1 0x42000000 ffffffff89e996f0 statd
R    618      1    618    618      1 0x52000000 ffffffff89e98a88 nfsmapid
R    610      1    610    610      1 0x42000000 ffffffff82b118d8 rpcbind
R    593      1    593    593      0 0x42010000 ffffffff82b13e10 cron
R    525      1    525    525      0 0x42000000 ffffffff8866da80 nscd
R    504      1    504    504      0 0x42000000 ffffffff89e9a358 picld
R    487      1    487    487      1 0x42000000 ffffffff8866b548 kcfd
R    464      1    464    464      0 0x42000000 ffffffff82b10008 syseventd
R     69      1     69     69      0 0x42000000 ffffffff82b10c70 devfsadm
R      9      1      9      9      0 0x42000000 ffffffff82b12540 svc.configd
R      7      1      7      7      0 0x42000000 ffffffff82b131a8 svc.startd
R    661      7    661    661      0 0x4a004000 ffffffff8866c1b0 sh
R    954    661    661    661      0 0x4a004000 ffffffff89e971b8 zpool
R    645      7    645    645      0 0x4a014000 ffffffff8866a8e0 sac
R    648    645    645    645      0 0x4a014000 ffffffff89e958e8 ttymon


::msgbuf
MESSAGE                                                               
sd15 at mpt0: target d lun 0
sd15 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@d,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@d,0 (sd15) online
sd16 at mpt0: target e lun 0
sd16 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@e,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@e,0 (sd16) online
sd17 at mpt0: target f lun 0
sd17 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@f,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@f,0 (sd17) online
sd18 at mpt0: target 10 lun 0
sd18 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@10,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@10,0 (sd18) online
sd19 at mpt0: target 11 lun 0
sd19 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@11,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@11,0 (sd19) online
sd20 at mpt0: target 12 lun 0
sd20 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@12,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@12,0 (sd20) online
sd21 at mpt0: target 13 lun 0
sd21 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@13,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@13,0 (sd21) online
sd22 at mpt0: target 14 lun 0
sd22 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@14,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@14,0 (sd22) online
sd23 at mpt0: target 15 lun 0
sd23 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@15,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@15,0 (sd23) online
sd24 at mpt0: target 16 lun 0
sd24 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@16,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@16,0 (sd24) online
sd25 at mpt0: target 17 lun 0
sd25 is /p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@17,0
/p...@0,0/pci8086,2...@1/pci1000,3...@0/s...@17,0 (sd25) online
pcplusmp: pciex8086,109a (e1000g) instance #1 vector 0x35 ioapic 0xff intin 
0xff is bound to cpu 1
        ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
        PIO mode 4 selected
        ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
        PIO mode 4 selected
        ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
        PIO mode 4 selected
        ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
        PIO mode 4 selected
NOTICE: e1000g1 registered
Intel(R) PRO/1000 Network Connection, Driver Ver. 5.2.13.1
        UltraDMA mode 5 selected
        UltraDMA mode 5 selected
pcplusmp: asy (asy) instance 0 vector 0x4 ioapic 0x2 intin 0x4 is bound to cpu 1
ISA-device: asy0
asy0 is /isa/a...@1,3f8
pcplusmp: asy (asy) instance #1 vector 0x3 ioapic 0x2 intin 0x3 is bound to cpu 
1
ISA-device: asy1
asy1 is /isa/a...@1,2f8
pcplusmp: lp (ecpp) instance 0 vector 0x7 ioapic 0x2 intin 0x7 is bound to cpu 0
ISA-device: ecpp0                     
ecpp0 is /isa/l...@1,378                
fd0 at fdc0                           
fd0 is /isa/f...@1,3f0/f...@0,0          
pseudo-device: ramdisk1024            

> ::stack
mutex_enter+0xb()
metaslab_free+0x68()
zio_dva_free+0x1f()
zio_execute+0x60()
zio_nowait+9()
arc_free+0x10a()
dsl_dataset_block_kill+0x26b()
dmu_objset_sync+0x1b2()
dsl_pool_sync+0x13a()
spa_sync+0x158()
txg_sync_thread+0x1cf()
thread_start+8()

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     113284               442   11%
Anon                         9650                37    1%
Exec and libs                2148                 8    0%
Page cache                    134                 0    0%
Free (cachelist)             2903                11    0%
Free (freelist)            918022              3586   88%

Total                     1046141              4086

{code}

I have also tried attaching this storage array to a completely different system 
and importing the pool using zpool import -f,  this resulted in a panic as well 
with the same type of data from the core dump.

My questions at this point:

What went wrong?
  obviously vague question that I dont expect anyone to be able to answer 
directly, but maybe aid me with some pointers

How do I get this data back? 
  if it was production and not just a test system since it will panic upon 
import/reboot

Is there any way to configure this machine to NOT reboot upon dumps?
  occasionally it does not do a crash dump and it flashes on screen to quickly 
to see what is going on.

What are my next steps?

I do realize that this is a Solaris10 machine and not opensolaris, but I was 
hoping someone would be able to point me in the right direction

Thanks in advance!
-- 
This message posted from opensolaris.org
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

[osol-discuss] Possible ZFS corruption?

Reply via email to