[zfs-discuss] Diagnosing Permanent Errors

Willard Korfhage Sun, 04 Apr 2010 07:01:05 -0700

I would like to get some help diagnosing permanent errors on my files. The 
machine in question has 12 1TB disks connected to an Areca raid card. I 
installed OpenSolaris build 134 and according to zpool history, created a pool 
with


zpool create bigraid raidz2 c4t0d0 c4t0d1 c4t0d2 c4t0d3 c4t0d4 c4t0d5 c4t0d6 
c4t0d7 c4t1d0 c4t1d1 c4t1d2 c4t1d3

I then backed up 806G of files to the machine, and had the backup program 
verify the files. It failed. The check is continuing to run, but so far it 
found 4 files where the checksums of the backup files don't match the checksum 
of the original file. Zpool status shows problems:

 $ sudo zpool status -v
  pool: bigraid
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        bigraid     DEGRADED     0     0   536
          raidz2-0  DEGRADED     0     0 3.14K
            c4t0d0  ONLINE       0     0     0
            c4t0d1  ONLINE       0     0     0
            c4t0d2  ONLINE       0     0     0
            c4t0d3  ONLINE       0     0     0
            c4t0d4  ONLINE       0     0     0
            c4t0d5  ONLINE       0     0     0
            c4t0d6  ONLINE       0     0     0
            c4t0d7  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c4t1d1  ONLINE       0     0     0
            c4t1d2  ONLINE       0     0     0
            c4t1d3  DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x18>
        <metadata>:<0x3a>

So, it appears that one of the disks is bad, but if one disk failed, how would 
a raidz2 pool develop permanent errors? The numbers in the CKSUM column are 
continuing to grow, but is that because the backup verification is tickling the 
errors as it runs?

Previous postings on permanent errors said to look at fmdump -eV, but that has 
437543 lines, and I don't really know how to interpret what I see. I did check 
the vdev_path with " fmdump -eV | grep  vdev_path | sort | uniq -c" to see if 
it was only certain disks, but every disk in the array is listed in the file, 
albeit with different frequencies:

2189    vdev_path = /dev/dsk/c4t0d0s0
1077    vdev_path = /dev/dsk/c4t0d1s0
1077    vdev_path = /dev/dsk/c4t0d2s0
1097    vdev_path = /dev/dsk/c4t0d3s0
  25    vdev_path = /dev/dsk/c4t0d4s0
  25    vdev_path = /dev/dsk/c4t0d5s0
  20    vdev_path = /dev/dsk/c4t0d6s0
1072    vdev_path = /dev/dsk/c4t0d7s0
1092    vdev_path = /dev/dsk/c4t1d0s0
2222    vdev_path = /dev/dsk/c4t1d1s0
2221    vdev_path = /dev/dsk/c4t1d2s0
1149    vdev_path = /dev/dsk/c4t1d3s0

What should I make of this? All the disks are bad? That seems unlikely. I found 
another thread

http://opensolaris.org/jive/thread.jspa?messageID=399988

where it finally came down to bad memory, so I'll test that. Any other 
suggestions?
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Diagnosing Permanent Errors

Reply via email to