Re: [RESOLVED] [gentoo-user] zfs repair needed (due to fingers being faster than brain)

2021-03-02 Thread Dale
John Blinka wrote:
> To all who replied to my distress signal,
>
> The repair turned out to be pretty painless.  In two ways:
>
> First, getting quality advice from all of you sans the roasting I
> deserved ;), and
>
> Second, gdisk fixed the gpt header and partition table easily (details
> below).  After that, I rebooted, zfs recognized the disk, and then it
> started a resilver automatically.  It was done a few minutes later,
> and now everything’s back to normal.
>
> Gdisk noted that both the main gpt header and main partition table
> were damaged, but the backups were ok.  I bypassed gdisk’s offer to
> use either the current gpt or to create a blank gpt, because I didn’t
> understand exactly what “current” or “blank” meant.
>
> Instead, I invoked the recovery & transformation menu with “r”.  Then
> I used “b” to rebuild the damaged main gpt header with the good
> backup, followed by “c” to restore the partition table from the good
> backup.  I then printed the partition table.  It looked exactly like
> the partition tables on the other disks of the same make and model in
> the zfs pool (modulo what looked like a unique zfs partition name). 
> That made me comfortable, so I wrote the changes to disk, rebooted,
> and found everything back to normal after the resilver.
>
> Appreciate all the help.  Thanks!
>
> John


I think we all do things we need "roasting" for at some point.  I once
did a rm -rfv and missed a few keys and tab completion didn't beep,
likely it shouldn't have either.  Anyway, luckily I had enough left to
do a emerge -ek world and get it all back and it didn't reach /home.  I
also cleaned my keyboard with my portable air tank after that.  ;-)

I followed this thread in the hopes I might learn something.  I think I
did.  It seems that the normal routine stuff is done in the main menu
for Gdisk but recovery is done in another menu that is less obvious. 
This is a good thing to know.  While we hope none of us ever run into
this sort of thing, it is good to know just in case. 

It's amazing how well some of the newer file systems can recover from
such things.  Between the awesome file systems and Raid and maybe other
tools, most data losses can be avoided. 

Neat thread.  I don't use ZFS at this point but I learned something
about Gdisk. 

Dale

:-)  :-) 



Re: [RESOLVED] [gentoo-user] zfs repair needed (due to fingers being faster than brain)

2021-03-02 Thread John Blinka
To all who replied to my distress signal,

The repair turned out to be pretty painless.  In two ways:

First, getting quality advice from all of you sans the roasting I deserved
;), and

Second, gdisk fixed the gpt header and partition table easily (details
below).  After that, I rebooted, zfs recognized the disk, and then it
started a resilver automatically.  It was done a few minutes later, and now
everything’s back to normal.

Gdisk noted that both the main gpt header and main partition table were
damaged, but the backups were ok.  I bypassed gdisk’s offer to use either
the current gpt or to create a blank gpt, because I didn’t understand
exactly what “current” or “blank” meant.

Instead, I invoked the recovery & transformation menu with “r”.  Then I
used “b” to rebuild the damaged main gpt header with the good backup,
followed by “c” to restore the partition table from the good backup.  I
then printed the partition table.  It looked exactly like the partition
tables on the other disks of the same make and model in the zfs pool
(modulo what looked like a unique zfs partition name).  That made me
comfortable, so I wrote the changes to disk, rebooted, and found everything
back to normal after the resilver.

Appreciate all the help.  Thanks!

John


Re: [gentoo-user] zfs repair needed (due to fingers being faster than brain)

2021-03-01 Thread Grant Taylor

On 3/1/21 3:25 PM, John Blinka wrote:

HI, Gentooers!


Hi,

So, I typed dd if=/dev/zero of=/dev/sd, and despite 
hitting ctrl-c quite quickly, zeroed out some portion of the initial 
part of a disk.  Which did this to my zfs raidz3 array:


OOPS!!!


 NAME STATE READ WRITE CKSUM
 zfs  DEGRADED 0 0 0
   raidz3-0   DEGRADED 0 0 0
 ata-HGST_HUS724030ALE640_PK1234P8JJJVKP  ONLINE   0 0 0
 ata-HGST_HUS724030ALE640_PK1234P8JJP3AP  ONLINE   0 0 0
 ata-ST4000NM0033-9ZM170_Z1Z80P4C ONLINE   0 0 0
 ata-ST4000NM0033-9ZM170_Z1ZAZ8F1 ONLINE   0 0 0
 14296253848142792483 UNAVAIL  0 0
0  was /dev/disk/by-id/ata-ST4000NM0033-9ZM170_Z1ZAZDJ0-part1
 ata-ST4000NM0033-9ZM170_Z1Z80KG0 ONLINE   0 0 0


Okay.  So the pool is online and the data is accessible.  That's 
actually better than I originally thought.  --  I thought you had 
accidentally damaged part of the ZFS partition that existed on a single 
disk.  --  I've been able to repair this with minimal data loss (zeros) 
with Oracle's help on Solaris in the past.


Aside:  My understanding is that ZFS stores multiple copies of it's 
metadata on the disk (assuming single disk) and that it is possible to 
recover a pool if any one (or maybe two for consistency checks) are 
viable.  Though doing so is further into the weeds than you normally 
want to be.


Could have been worse.  I do have backups, and it is raid3, so all I've 
injured is my pride, but I do want to fix things.I'd appreciate 
some guidance before I attempt doing this - I have no experience at 
it myself.


First, your pool / it's raidz3 is only 'DEGRADED', which means that the 
data is still accessible.  'OFFLINE' would be more problematic.



The steps I envision are

1) zpool offline zfs 14296253848142792483 (What's that number?)


I'm guessing it's an internal ZFS serial number.  You will probably need 
to reference it.


I see no reason to take the pool offline.


2) do something to repair the damaged disk


I don't think you need to do anything at the individual disk level yet.


3) zpool online zfs 


I think you can fix this with the pool online.

Right now, the device name for the damaged disk is /dev/sda. 
Gdisk says this about it:


Caution: invalid main GPT header,


This is to be expected.


but valid backup; regenerating main header from backup!


This looks promising.


Warning: Invalid CRC on main header data; loaded backup partition table.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.


I'm assuming that the main partition table is at the start of the disk 
and that it's what got wiped out.


So I'd think that you can look at the 'c' and 'e' options on the 
recovery & transformation menu for options to repair the main partition 
table.



Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!


I know.  Thank you for using the backup partition table.


Warning! One or more CRCs don't match. You should repair the disk!


I'm guessing that this is a direct result of the dd oops.  I would want 
more evidence to support it being a larger problem.


The CRC may be calculated over a partially zeroed chunk of disk.  (Chunk 
because I don't know what term is best here and I want to avoid implying 
anything specific or incorrectly.)



Main header: ERROR
Backup header: OK
Main partition table: ERROR
Backup partition table: OK


ACK


Partition table scan:
   MBR: not present
   BSD: not present
   APM: not present
   GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
  1 - Use current GPT
  2 - Create blank GPT

Your answer: ( I haven't given one yet)


I'd assume #1, Use current GPT.


I'm not exactly sure what this is telling me.  But I'm guessing it
means that the main partition table is gone, but there's a good
backup.


That's my interpretation too.

It jives with the description of what happened.


In addition, some, but not all disk id info is gone:
1) /dev/disk/by-id still shows ata-ST4000NM0033-9ZM170_Z1ZAZDJ0 
(the damaged disk) but none of its former partitions


The disk ID still being there may be a symptom / side effect of when 
udev creates the links.  I would expect it to not be there post-reboot.


Well, maybe.  The disk serial number is independent of any data on the disk.

Partitions by ID would probably be gone post reboot (or eject and 
re-insertion).


2) /dev/disk/by-partlabel shows entries for the undamaged disks in 
the pool, but not the damaged one


Okay.  That means that udev is recognizing the change faster than I 
would have expected.


That 

Re: [gentoo-user] zfs repair needed (due to fingers being faster than brain)

2021-03-01 Thread antlists
Firstly, I'll say I'm not experienced, but knowing a fair bit about raid 
and recovering corrupted arrays ...


On 01/03/2021 22:25, John Blinka wrote:

HI, Gentooers!

So, I typed dd if=/dev/zero of=/dev/sd, and despite
hitting ctrl-c quite quickly, zeroed out some portion of the initial
part of a disk.  Which did this to my zfs raidz3 array:

 NAME STATE READ WRITE CKSUM
 zfs  DEGRADED 0 0 0
   raidz3-0   DEGRADED 0 0 0
 ata-HGST_HUS724030ALE640_PK1234P8JJJVKP  ONLINE   0 0 0
 ata-HGST_HUS724030ALE640_PK1234P8JJP3AP  ONLINE   0 0 0
 ata-ST4000NM0033-9ZM170_Z1Z80P4C ONLINE   0 0 0
 ata-ST4000NM0033-9ZM170_Z1ZAZ8F1 ONLINE   0 0 0
 14296253848142792483 UNAVAIL  0 0
0  was /dev/disk/by-id/ata-ST4000NM0033-9ZM170_Z1ZAZDJ0-part1
 ata-ST4000NM0033-9ZM170_Z1Z80KG0 ONLINE   0 0 0

Could have been worse.  I do have backups, and it is raid3, so all
I've injured is my pride, but I do want to fix things.I'd
appreciate some guidance before I attempt doing this - I have no
experience at it myself.

The steps I envision are

1) zpool offline zfs 14296253848142792483 (What's that number?)
2) do something to repair the damaged disk
3) zpool online zfs 

Right now, the device name for the damaged disk is /dev/sda.  Gdisk
says this about it:

Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!


The GPT table is stored at least twice, this is telling you the primary 
copy is trashed, but the backup seems okay ...


Warning: Invalid CRC on main header data; loaded backup partition table.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.

Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!

Warning! One or more CRCs don't match. You should repair the disk!
Main header: ERROR
Backup header: OK
Main partition table: ERROR
Backup partition table: OK

Partition table scan:
   MBR: not present
   BSD: not present
   APM: not present
   GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
  1 - Use current GPT
  2 - Create blank GPT

Your answer: ( I haven't given one yet)

I'm not exactly sure what this is telling me.  But I'm guessing it
means that the main partition table is gone, but there's a good
backup.


Yup. I don't understand that prompt, but I THINK it's saying that if you 
do choose choice 1, it will recover your partition table for you.



 In addition, some, but not all disk id info is gone:
1) /dev/disk/by-id still shows ata-ST4000NM0033-9ZM170_Z1ZAZDJ0 (the
damaged disk) but none of its former partitions


Because this is the disk, and you've damaged the contents, so this is 
completely unaffected.



2) /dev/disk/by-partlabel shows entries for the undamaged disks in the
pool, but not the damaged one
3) /dev/disk/by-partuuid similar to /dev/disk/by-partlabel


For both of these, "part" is short for partition, and you've just 
trashed them ...



4) /dev/disk/by-uuid does not show the damaged disk


Because the uuid is part of the partition table.


This particular disk is from a batch of 4 I bought with the same make
and specification and very similar ids (/dev/disk/by-id).  Can I
repair this disk by copying something off one of those other disks
onto this one? 


GOD NO! You'll start copying uuids, so they'll no longer be unique, and 
things really will be broken!



Is repair just repartitioning - as in the Gentoo
handbook?  Is it as simple as running gdisk and typing 1 to accept
gdisk's attempt at recovering the gpt?  Is running gdisk's recovery
and transformation facilities the way to go (the b option looks like
it's made for exactly this situation)?

Anybody experienced at this and willing to guide me?

Make sure that option 1 really does recover the GPT, then use it. Of 
course, the question then becomes what further damage will rear its head.


You need to make sure that your raid 3 array can recover from a corrupt 
disk. THIS IS IMPORTANT. If you tried to recover an md-raid-5 array from 
this situation you'd almost certainly trash it completely.



Actually, if your setup is raid, I'd just blow out the trashed disk 
completely. Take it out of your system, replace it, and let zfs repair 
itself onto the new disk.


You can then zero out the old disk and it's now a spare.

Just be careful here, because I don't know what zfs does, but btrfs by 
default mirrors metadata but not data, so with that you'd think a 
mirrored filesystem could repair itself but it can't ... if you want to 
repair the filesystem without rebuilding from scratch, you need 

[gentoo-user] zfs repair needed (due to fingers being faster than brain)

2021-03-01 Thread John Blinka
HI, Gentooers!

So, I typed dd if=/dev/zero of=/dev/sd, and despite
hitting ctrl-c quite quickly, zeroed out some portion of the initial
part of a disk.  Which did this to my zfs raidz3 array:

NAME STATE READ WRITE CKSUM
zfs  DEGRADED 0 0 0
  raidz3-0   DEGRADED 0 0 0
ata-HGST_HUS724030ALE640_PK1234P8JJJVKP  ONLINE   0 0 0
ata-HGST_HUS724030ALE640_PK1234P8JJP3AP  ONLINE   0 0 0
ata-ST4000NM0033-9ZM170_Z1Z80P4C ONLINE   0 0 0
ata-ST4000NM0033-9ZM170_Z1ZAZ8F1 ONLINE   0 0 0
14296253848142792483 UNAVAIL  0 0
   0  was /dev/disk/by-id/ata-ST4000NM0033-9ZM170_Z1ZAZDJ0-part1
ata-ST4000NM0033-9ZM170_Z1Z80KG0 ONLINE   0 0 0

Could have been worse.  I do have backups, and it is raid3, so all
I've injured is my pride, but I do want to fix things.I'd
appreciate some guidance before I attempt doing this - I have no
experience at it myself.

The steps I envision are

1) zpool offline zfs 14296253848142792483 (What's that number?)
2) do something to repair the damaged disk
3) zpool online zfs 

Right now, the device name for the damaged disk is /dev/sda.  Gdisk
says this about it:

Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Warning: Invalid CRC on main header data; loaded backup partition table.
Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
on the recovery & transformation menu to examine the two tables.

Warning! Main partition table CRC mismatch! Loaded backup partition table
instead of main partition table!

Warning! One or more CRCs don't match. You should repair the disk!
Main header: ERROR
Backup header: OK
Main partition table: ERROR
Backup partition table: OK

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: damaged

Found invalid MBR and corrupt GPT. What do you want to do? (Using the
GPT MAY permit recovery of GPT data.)
 1 - Use current GPT
 2 - Create blank GPT

Your answer: ( I haven't given one yet)

I'm not exactly sure what this is telling me.  But I'm guessing it
means that the main partition table is gone, but there's a good
backup.  In addition, some, but not all disk id info is gone:
1) /dev/disk/by-id still shows ata-ST4000NM0033-9ZM170_Z1ZAZDJ0 (the
damaged disk) but none of its former partitions
2) /dev/disk/by-partlabel shows entries for the undamaged disks in the
pool, but not the damaged one
3) /dev/disk/by-partuuid similar to /dev/disk/by-partlabel
4) /dev/disk/by-uuid does not show the damaged disk

This particular disk is from a batch of 4 I bought with the same make
and specification and very similar ids (/dev/disk/by-id).  Can I
repair this disk by copying something off one of those other disks
onto this one?  Is repair just repartitioning - as in the Gentoo
handbook?  Is it as simple as running gdisk and typing 1 to accept
gdisk's attempt at recovering the gpt?  Is running gdisk's recovery
and transformation facilities the way to go (the b option looks like
it's made for exactly this situation)?

Anybody experienced at this and willing to guide me?

Thanks,

John Blinka