Re: [gentoo-user] zfs repair needed (due to fingers being faster than brain)

Grant Taylor Mon, 01 Mar 2021 18:30:47 -0800

On 3/1/21 3:25 PM, John Blinka wrote:

HI, Gentooers!

Hi,

So, I typed dd if=/dev/zero of=/dev/sd<wrong letter>, and despitehitting ctrl-c quite quickly, zeroed out some portion of the initialpart of a disk. Which did this to my zfs raidz3 array:


OOPS!!!

     NAME                                         STATE     READ WRITE CKSUM
     zfs                                          DEGRADED     0     0     0
       raidz3-0                                   DEGRADED     0     0     0
         ata-HGST_HUS724030ALE640_PK1234P8JJJVKP  ONLINE       0     0     0
         ata-HGST_HUS724030ALE640_PK1234P8JJP3AP  ONLINE       0     0     0
         ata-ST4000NM0033-9ZM170_Z1Z80P4C         ONLINE       0     0     0
         ata-ST4000NM0033-9ZM170_Z1ZAZ8F1         ONLINE       0     0     0
         14296253848142792483                     UNAVAIL      0     0
    0  was /dev/disk/by-id/ata-ST4000NM0033-9ZM170_Z1ZAZDJ0-part1
         ata-ST4000NM0033-9ZM170_Z1Z80KG0         ONLINE       0     0     0

Okay. So the pool is online and the data is accessible. That'sactually better than I originally thought. -- I thought you hadaccidentally damaged part of the ZFS partition that existed on a singledisk. -- I've been able to repair this with minimal data loss (zeros)with Oracle's help on Solaris in the past.

Aside: My understanding is that ZFS stores multiple copies of it'smetadata on the disk (assuming single disk) and that it is possible torecover a pool if any one (or maybe two for consistency checks) areviable. Though doing so is further into the weeds than you normallywant to be.

Could have been worse. I do have backups, and it is raid3, so all I'veinjured is my pride, but I do want to fix things. I'd appreciatesome guidance before I attempt doing this - I have no experience atit myself.

First, your pool / it's raidz3 is only 'DEGRADED', which means that thedata is still accessible. 'OFFLINE' would be more problematic.

The steps I envision are

1) zpool offline zfs 14296253848142792483 (What's that number?)

I'm guessing it's an internal ZFS serial number. You will probably needto reference it.


I see no reason to take the pool offline.

2) do something to repair the damaged disk


I don't think you need to do anything at the individual disk level yet.

3) zpool online zfs <repaired disk>


I think you can fix this with the pool online.

Right now, the device name for the damaged disk is /dev/sda.Gdisk says this about it:
Caution: invalid main GPT header,


This is to be expected.

but valid backup; regenerating main header from backup!


This looks promising.

Warning: Invalid CRC on main header data; loaded backup partition table.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.

I'm assuming that the main partition table is at the start of the diskand that it's what got wiped out.

So I'd think that you can look at the 'c' and 'e' options on therecovery & transformation menu for options to repair the main partitiontable.

Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!


I know.  Thank you for using the backup partition table.

Warning! One or more CRCs don't match. You should repair the disk!

I'm guessing that this is a direct result of the dd oops. I would wantmore evidence to support it being a larger problem.

The CRC may be calculated over a partially zeroed chunk of disk. (Chunkbecause I don't know what term is best here and I want to avoid implyinganything specific or incorrectly.)

Main header: ERROR
Backup header: OK
Main partition table: ERROR
Backup partition table: OK

ACK

Partition table scan:
   MBR: not present
   BSD: not present
   APM: not present
   GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
  1 - Use current GPT
  2 - Create blank GPT

Your answer: ( I haven't given one yet)


I'd assume #1, Use current GPT.

I'm not exactly sure what this is telling me.  But I'm guessing it
means that the main partition table is gone, but there's a good
backup.


That's my interpretation too.

It jives with the description of what happened.

In addition, some, but not all disk id info is gone:
1) /dev/disk/by-id still shows ata-ST4000NM0033-9ZM170_Z1ZAZDJ0(the damaged disk) but none of its former partitions

The disk ID still being there may be a symptom / side effect of whenudev creates the links. I would expect it to not be there post-reboot.


Well, maybe.  The disk serial number is independent of any data on the disk.

Partitions by ID would probably be gone post reboot (or eject andre-insertion).

2) /dev/disk/by-partlabel shows entries for the undamaged disks inthe pool, but not the damaged one

Okay. That means that udev is recognizing the change faster than Iwould have expected.


That probably means that the ID in #1 has survived any such update.

3) /dev/disk/by-partuuid similar to /dev/disk/by-partlabel


Given #2, I'm not surprised at #3.

4) /dev/disk/by-uuid does not show the damaged disk


Hum.

This particular disk is from a batch of 4 I bought with the same make
and specification and very similar ids (/dev/disk/by-id).  Can I
repair this disk by copying something off one of those other disks
onto this one?


Maybe.  But I would not bother.  (See below.)

Is repair just repartitioning - as in the Gentoo handbook? Is itas simple as running gdisk and typing 1 to accept gdisk's attempt atrecovering the gpt? Is running gdisk's recovery and transformationfacilities the way to go (the b option looks like it's made forexactly this situation)?

gdisk will address the partition problem. But that doesn't do anythingfor ZFS.

Anybody experienced at this and willing to guide me?

I've not dealt with this particular problem. But I have dealt with afew different things.


My course of action would be:

0) Copy the entire disk to another disk if possible and if you aresufficiently paranoid.1) Let gdisk repair the main partition table using the data from thebackup partition table.

2)  Leverage ZFS's ZRAID functionality to recover the ZFS data.

I /think/ that #2 can be done with one command. Do your homework tounderstand, check, and validate this. You are responsible for your ownactions, despite what some random on the Internet says. ;-)


   # zpool replace 14296253848142792483 sda

Assuming that /dev/sda is the corrupted disk.

This will cause ZFS to remove the 14296253848142792483 disk from thepool and rebuild onto the (/dev/)sda disk. -- ZFS doesn't care thatthey are the same disk.


You can keep track of the resilver with something like the following:

   # while true; do zpool status zfs; sleep 60; done

Since your pool is only 'DEGRADED', you are probably in an okayposition. It's just a matter of not making things worse while trying tomake them better.

Given that you have a RAIDZ3 and all of the other disks are ONLINE, yourdata should currently be safe.




--
Grant. . . .
unix || die

Re: [gentoo-user] zfs repair needed (due to fingers being faster than brain)

Reply via email to