Re: [lustre-discuss] Error on a zpool underlying an OST

Kevin Abbey Fri, 15 Jul 2016 08:59:21 -0700

Hi Bob,

Thank you for the notes. I began to examining the zpool beforeobtaining the new LSI card. I was unable to start lustre without thenew card. Once I installed the replacement and re-examined the zpoolsthe resilvered pool was re-scrubbed, exported and reimported, and to mysurprise repaired. As a further test, I removed the spare disk thatreplaced the "apparent" bad disk and re-added the disk that wasremoved. The zpool resilvered ok and scrubbed clean. Lustre mountedand cleaned a few orphaned blocks but appeared fully functional from theclient side. However, without a "snapshot" (file list, md5sums - thoughzfs does internal check sums) of the prior status I cannot be sure if adata file was lost. This is something I'll need to address. MaybeRobinhood can help with this?

Thanks again for the notes. They will likely be useful in a similarscenario.


Kevin


On 07/12/2016 09:10 AM, Bob Ball wrote:

The answer came offline, and I guess I never replied back to theoriginal posting. This is what I learned. It deals with only asingle file, not 1000's. --bob
-------------------------------

On Mon, 14 Mar 2016, Bob Ball wrote:
OK, it would seem the affected user has already deleted this file, asthe "lfs fid2path" returns:
[root@umt3int01 ~]# lfs fid2path /lustre/umt3 [0x200002582:0xb5c0:0x0]
fid2path: error on FID [0x200002582:0xb5c0:0x0]: No such file ordirectory
I verified I could to it back and forth using a different file.
I am making one last check, with the OST re-activated (I had set itinactive on our MDT/MGS to keep new files off while figuring this out).
Nope, gone.  Time to do the clear and remove the snapshot.

Thanks for your help on this.

bob

On 3/14/2016 10:45 AM, Don Holmgren wrote:

 No, no downside.  The snapshot really is just used so that I can do this
 sort of repair live.
Once you've found the Lustre OID with "find", forll_decode_filter_fid to
 work you'll have to then umount the OST and remount as type lustre.

 Good luck!

 Don

Thank you!  This is very helpful.
I have no space to make a snapshot, so I will just umount this OST fora bit and remount it zfs. Our users can take some off-time if we arenot busy just then.
It will be an interesting process. I'm all set to drain and remakethough, should this method not work. I was putting that off to startuntil later today as I've other issues just now. Since it would takeme 2-3 days total to drain, remake and refill, your detailed method isfar more likeable for me.
Just to be certain, other than the temporary unavailability of theLustre file system, do you see any downside to not working from asnapshot?
bob


On 3/14/2016 10:21 AM, Don Holmgren wrote:

 Hi Bob -
I only get the lustre-discuss digest, so am not sure how to reply tothat
 whole list.  But I can reply directly to you regarding your posting
 (copied at the bottom).

 In the ZFS error message

    errors: Permanent errors have been detected in the following files:
          ost-007/ost0030:<0x2c90f>
0x2c90f is the ZFS inode number of the damaged item. To turn thisinto a
 Lustre filename, do the following:

 1. First, you have to use "find" using that inode number to get the
 corresponding
    Lustre object ID.  I do this via a ZFS snapshot, something like:

    zfs snapshot ost-007/ost0030@mar14
    mount -t zfs ost-007/ost0030@mar14 /mnt/snapshot
    find /mnt/snapshot/O -inum 182543

 (note 0x2c90f = 182543 decimal).  This may return something like

    /mnt/snapshot/O/0/d22/54

 if indeed the damaged item is a file object.


 2. OK, assuming the "find" did return a file object like above (in this
 case the
    Lustre OID of the object is 54) you need to find the parent "FID" of
 that
OID. Do this as follows on the OSS where you've mounted thesnapshot:
    [root@lustrenew3 ~]# ll_decode_filter_fid /mnt/snapshot/O/0/d22/54
    /mnt/snapshot/O/0/d22/54: parent=[0x20000040000010a:0x0:0x0] stripe=0


 3. That string "0x20000040000010a:0x0:0x0" is related to the Lustre FID.
 You
    can use "lfs fid2path" to convert this to a filename.  "lfs fid2path"
 must be
execute on a client of your Lustre filesystem. And, on ourLustre, the
    return string must be slightly altered (chopped up differently):

     [root@client ~]# lfs fid2path /djhzlus [0x200000400:0x10a:0x0]
/djhzlus/test/copy1/l6496f21b7075m00155m031/gauge/Coulomb/l6496f21b7075m00155m031-Coul_002
Here /djhzlus was where the Lustre filesystem was mounted on myclient
    (client).  fid2path takes three numbers, in my case the first was
    the first 9 hex digits of the return from ll_decode_filter_fid, and
 the
    second was the last 5 hex digits (I supressed the leading zeros) and
 the
    third was 0x0 (not sure whether this was the 2nd or 3rd field from
    ll_decode_filter_fid.
You can always use "lfs path2fid" on your Lustre client againstanother
    file in your filesystem to find the pattern for your FID.

    To check that you've indeed found the correct file, you can do
    "lfs getstripe" to confirm that the objid matches the Lustre OID you
    got with the find.
Once you figure out the bad file, you can delete it from Lustre, andthen
 use "zpool clear ost-007" to clear the reporting of
          ost-007/ost0030:<0x2c90f>
 Don't forget to umount and delete your ZFS snapshot of the OST with the
 bad file.


 I should mention that I found a Python script ("zfsobj2fid") somewhere
 that directly returns the FID using the ZFS debugger ("zdb") directly
 against the mounted OST.  You can probably google for zfsobj2fid; if you
 can't find it let me know and I'll dig around to see if I still have a
 copy.  Here's how I used it to get the FID for "lfs fid2path":

     [root@lustrenew3 ~]# ./zfsobj2fid zp2/ost2 0x113
     [0x20000040000010a:0x0:0x0]

 (my OID was 0x113, my pool was "zp2" and the ZFS OST was "ost2"). But,
 note that the FID returned still needs to be manipulated as above. I
 found this note in one of my write-ups about this manipulation:

 "Evidentally, the 'trusted.fid' xattr kept in ZFS for the OID file
contains both the first and second sections of the FID (according tosomeslide decks I found, the FID is [sequence:objectID:version], so thexattr
 has the sequence and the objectID."


 Cheers -

 Don Holmgren
 Fermilab

On 7/12/2016 12:02 AM, Kevin Abbey wrote:
Hi,
Can anyone advise how to clean up 1000s of zfs level permanent errorsand the lustre level too?
A similar question was presented on the list but I did not see ananswer.https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12454.html
As I was testing new hardware I discovered an LSI HBA was bad. On asingle combined MDS/OSS there were 8 OSTs split across 2 jbod and 2LSI HBA. The mdt was on a 3rd jbod downlinked on the jbod connectedwith the bad controller. The zpools connected to the good HBA werescrubed clean after unmounting and stopping lustre. The zpools on thebad controller continued to have errors while connected to the badcontroller. One of these OSTs reported a disk failure during thescrub and began resilvering yet autoreplace was off. This is avery bad event considering the card was causing all of the errors.Neither a scrub or resilver would ever complete. I stopped the scrubon the 3 other osts and detached the spare from the ost in resilverprocess. After narrowing down the bad HBA (initially it was notclear if cables or jbod backplanes were bad), I use the good HBA toscrub the jbod 1 again, then shutdown disconnected the jbod1. Thenproceeded to connect the jbod2 to the good controller to scrub thejbod 2 zpools which had previously been attached to the bad LSIcontroller. The 3 zpools which had scrub stopped previously didcomplete successfully. The one which had begun resilvering beganagain to resilver after I initiated a replace of the failed disk withthe spare. The resilver completed but many permanent errors werediscovered on the zpool. Since this is a test pool I was interestedto know if zfs would recover. In a real scenario with HW problemsI'll shutdown and disconnect the data drives prior to HW testing.
The status listed below shows a new scrub in process after theresilver completed. The cache drive is missing because the 3rd jbodis disconnected temporarily.
===================================

ZFS:   v0.6.5.7-1
lustre 2.8.55
kernel 2.6.32_642.1.1.el6.x86_64.x86_64
Centos 6.8


===================================
  ~]# zpool status -v test-ost4
  pool: test-ost4
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Jul 11 22:29:09 2016
    689G scanned out of 12.4T at 711M/s, 4h49m to go
    40K repaired, 5.41% done
config:

    NAME                                       STATE READ WRITE CKSUM
    test-ost4                                  ONLINE 0     0 180
      raidz2-0                                 ONLINE 0     0 360
ata-ST4000NM0033-9ZM170_Z1Z7GYXY ONLINE 0 0 2(repairing)ata-ST4000NM0033-9ZM170_Z1Z7KKPQ ONLINE 0 0 3(repairing)ata-ST4000NM0033-9ZM170_Z1Z7L5E7 ONLINE 0 0 3(repairing)ata-ST4000NM0033-9ZM170_Z1Z7KGQT ONLINE 0 0 0(repairing)ata-ST4000NM0033-9ZM170_Z1Z7LA8K ONLINE 0 0 4(repairing)ata-ST4000NM0033-9ZM170_Z1Z7KB0X ONLINE 0 0 3(repairing)ata-ST4000NM0033-9ZM170_Z1Z7JSMN ONLINE 0 0 2(repairing)ata-ST4000NM0033-9ZM170_Z1Z7KXRA ONLINE 0 0 2(repairing)ata-ST4000NM0033-9ZM170_Z1Z7MLSN ONLINE 0 0 2(repairing)ata-ST4000NM0033-9ZM170_Z1Z7L4DT ONLINE 0 0 7(repairing)
    cache
      ata-D2CSTK251M20-0240_A19CV011227000092  UNAVAIL 0     0 0

errors: Permanent errors have been detected in the following files:

        test-ost4/test-ost4:<0xe00>
        test-ost4/test-ost4:<0xe01>
        test-ost4/test-ost4:<0xe02>
        test-ost4/test-ost4:<0xe03>
        test-ost4/test-ost4:<0xe04>
        test-ost4/test-ost4:<0xe05>
        test-ost4/test-ost4:<0xe06>.......
    .......
    .......continues......
    .......
    .......
        test-ost4/test-ost4:<0xdfe>
        test-ost4/test-ost4:<0xdff>
===================================

Follow up questions,
Is is better to not have a spare attached to the pool to preventresilvering in this scenario? (bad HBA, disk failed during scrub,resilver began, yet auto relplace was off. The spare was assigned tothe zpool.)
In a dual path to the jbod would the bad HBA card be disabledautomatically to prevent IO errors reaching the disk? The currentsetup is single path only.
Thank you for any notes in advance,
Kevin


--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building

315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.ab...@rutgers.edu

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Error on a zpool underlying an OST

Reply via email to