[zfs-discuss] ZPOOL data corruption - Help

Bruno Sousa Sun, 04 Oct 2009 15:40:03 -0700

Hi all !

I have a serious problem, with a server, and i'm hoping that some onecould help me how to understand what's wrong.So basically i have a server with a pool of 6 disks, and after a zpoolscrub i go the message :


errors: Permanent errors have been detected in the following files:

      <metadata>:<0x0>
      <metadata>:<0x15>

The version of the opensolaris is 5.11 snv_101b (yes, i now, quite old).This server has been up and running for more than 4 months, with weeklyzpool scrubs, and now i got this message.


Here are some extra details about the system:

1 - i can still access the data in the pool , but i don't know if i canaccess all the data and/or if all the data is not corrupted

2 - nothing was changed in the hardware

3 - all the disks are ST31000340NS-SN06 , Seagate 1TB 7.200 rpm"enterprise class" , firmware SN064 - all the disks are connected to a LSI Logic SAS1068E connected to aJBOD chassis (Supermicro)

5 - the server is a SUN X2200 Dual-Core
6 - using the lsiutil, and querying the Display phy counters i see :
    Expander (Handle 0009) Phy 21:  Link Up
      Invalid DWord Count                                       1,171
      Running Disparity Error Count                               937
      Loss of DWord Synch Count                                     0
      Phy Reset Problem Count                                       0

    Expander (Handle 0009) Phy 22:  Link Up
      Invalid DWord Count                                   2,110,435
      Running Disparity Error Count                           855,781
      Loss of DWord Synch Count                                     3
      Phy Reset Problem Count                                       0

    Expander (Handle 0009) Phy 23:  Link Up
      Invalid DWord Count                                     740,029
      Running Disparity Error Count                           716,196
      Loss of DWord Synch Count                                     1
      Phy Reset Problem Count                                       0

    Expander (Handle 0009) Phy 24:  Link Up
      Invalid DWord Count                                     705,870
      Running Disparity Error Count                           692,280
      Loss of DWord Synch Count                                     1
      Phy Reset Problem Count                                       0

    Expander (Handle 0009) Phy 25:  Link Up
      Invalid DWord Count                                     698,935
      Running Disparity Error Count                           667,148
      Loss of DWord Synch Count                                     1
      Phy Reset Problem Count                                       0

7 - the /var/log/messages showo SCSI transport failed: reason 'reset': retrying command

        o  SCSI transport failed: reason 'reset': giving up

Maybe i'm wrong...but it seems like the disks started to report errors?

The reason behind the fact that i don't know if all the data isaccessible is because the pool size is quite big, as seen :


NAME         SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
POOL01  2.72T  1.71T  1.01T    62%  ONLINE  -

It might be the fact that i have been suffering from this problem fromsome time, but the lsi hba had never reported any error, and i assumedthat ZFS was build to deal with this kind of problems : the silent datacorruption .I'm would to understand if the problems started due to a high load inthe LSI hba that lead to timeouts and therefore disk errors, of if thethe LSI hba opensolaris driver was overloaded ,resulting in disk errorsand LSI hba errors...

Any clue to see what lead to what?

Even more importand did i lost data, or zfs is reporting errors to diskdrivers errors, but the data already existing is okay, and the new datamay be affected? Is the zpool metadata recoverable?My biggest concern, is to know if my pool is corrupted, and if so howcan i fix the zpool,metadata, problem.


Thanks for all your time,

Bruno

r...@server01:/# zpool status -vx
pool: POOL01
state: ONLINE
status: One or more devices has experienced an error resulting in data
      corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
      entire pool from backup.
 see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:

      NAME         STATE     READ WRITE CKSUM
    POOL01   ONLINE       0     0     0
        mirror     ONLINE       0     0     0
          c5t9d0   ONLINE       0     0     0
          c5t10d0  ONLINE       0     0     0
        mirror     ONLINE       0     0     0
          c5t11d0  ONLINE       0     0     0
          c5t12d0  ONLINE       0     0     0
        mirror     ONLINE       0     0     0
          c5t13d0  ONLINE       0     0     0
          c5t14d0  ONLINE       0     0     0
errors: Permanent errors have been detected in the following files:

      <metadata>:<0x0>
      <metadata>:<0x15>

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZPOOL data corruption - Help

Reply via email to