Re: reinstallation and restore after catastrophic mistake or failure; was: 1 Currently unreadable (pending) sectors How worried should I be?

2024-01-06 Thread David Christensen

On 1/6/24 04:36, Michael Kjörling wrote:

On 6 Jan 2024 00:37 -0800, from dpchr...@holgerdanske.com (David Christensen):

I suggest taking an image (backup) with dd(1), Clonezilla, etc., when you're
done.  This will allow you to restore the image later -- to roll-back a
change you do not like, to recovery from a disaster, to clone the image to
another device, to facilitate experiments, (such as doing a secure erase to
see if it resolves the SSD pending sector issue), etc..

If you also keep your system configuration files in a version control
system, restoring an image is faster than wipe/ fresh install/ configure/
restore data.


I would go even farther. Backups should be designed such that
recovering from a catastrophic storage failure, such as getting hit by
ransomware, unintentionally doing a destructive badblocks write test
or the sudden failure of a storage device, is possible by at most
something very similar to:

* Boot some kind of live environment



I wanted more tools than what the Debian installer rescue shell provides 
(e.g. BusyBox) and I am too lazy to learn yet another live system (e.g. 
Knoppix), so I installed Debian with Xfce onto two USB drives -- one 
with BIOS/MBR and the other with secure UEFI/GPT.  They are both 
complete installs, so they are familiar and I can add whatever I want.




* Set up file systems on the storage device to be restored onto
   (partitioning, setting up LUKS containers, formatting, whatever else
   might be called for)
* Within the live environment, install and configure the software
   needed to access the backup (if any) (this may include things like
   cryptographic keys, access passphrases and the likes)
* Perform the restoration from the most recent backup (this is the
   part that likely will take a significant amount of time)



I keep my Debian instances small, simple, and self-contained (1 GB ext4 
boot, 1 GB dm-crypt swap, and 12 GB LUKS ext4 root on one 16+ GB 2.5" 
SATA SSD).  dd(1) meets all of my imaging needs.  It's fast and requires 
minimal storage -- less than 10 minutes using an old-school USB 2.0 HDD; 
each 100 GB holds 6+ images.  (`apt-get autoremove`, `apt-get 
autoclean`, fstrim(8), and/or gzip(1) can reduce time and storage 
requirements.)



If my OS instances were larger, more complex, shared disk space, etc. -- 
e.g. multi-boot Windows, Debian, etc., with a shared data partition -- 
e.g. what the OP likely had -- I would think about a tool such as 
Clonezilla.  Then I would get a big USB 3.0+ HDD/RAID, boot one of my 
Debian USB instances, look at the partition table, and take dd(1) images 
in chunks -- block 0 to the last block before ESP, the ESP, then each 
partition or contiguous span of related partitions, and finally the 
secondary GPT header.




* Update the restored copies of /etc/fstab, /etc/crypttab and any
   other files that directly reference the partitions or file systems
   by some kind of ID (UUID, /dev/disk/by-*/*, ...)
* Reinstall the boot loader



When I take a dd(1) image of an MBR disk, I copy from block 0 through 
the end of the root partition.  So:


1.  UUID's are preserved.

2.  All boot loader stages are preserved.


When I take an dd(1) image of a GPT disk with lots of zeros (fresh wipe 
and install), I copy the whole thing.  Again, UUID's and boot loader 
stages are preserved.



Using live media for UUID and/or boot loader surgery is non-trivial, as 
discussed in more than a few posts to this list.  But, such may be 
required after restoring an image onto a different disk and/or hardware 
arrangement.




* Reboot
* Reinstall the boot loader again from within the restored environment
   to ensure that everything relating to it is in sync



For the simple case of restoring an image onto the exact same hardware, 
a restored MBR image just works.  Same for GPT.  If a GPT disk was 
zeroed or secure erased, a secondary GPT header will need to be needed 
written.  I believe GRUB, Linux, or something on Debian did this 
automagically for me the last time I tried.




Such recovery should _not_ need to involve significant reconfiguration
of anything. Any such requirements will massively increase your time
to recovery, as I think we're seeing an example of here. And yes,
pretty much all of this could be scripted, but I strongly suspect that
few people need to do a bare-metal restore of their most recent backup
often enough for _that_ to be worth the effort to create and maintain.



AIUI the OP accidentally zeroed a Windows/ Debian multi-boot disk in a 
relatively new computer.  Rebuilding from scratch is going to involve 
more than twice the effort of rebuilding one OS from scratch, but 
hopefully there was no live data lost.



I have a half dozen computers in my SOHO network.  I trash my daily 
driver at least once a year and my workhorse more often than that.



I started with disaster preparedness/ recovery using 
lowest-common-denominator tools -- tar(1), gzip(1), rsync(1), dd(1), 
etc..  I am a code

Re: reinstallation and restore after catastrophic mistake or failure; was: 1 Currently unreadable (pending) sectors How worried should I be?

2024-01-06 Thread Michael Kjörling
On 6 Jan 2024 00:37 -0800, from dpchr...@holgerdanske.com (David Christensen):
> I suggest taking an image (backup) with dd(1), Clonezilla, etc., when you're
> done.  This will allow you to restore the image later -- to roll-back a
> change you do not like, to recovery from a disaster, to clone the image to
> another device, to facilitate experiments, (such as doing a secure erase to
> see if it resolves the SSD pending sector issue), etc..
> 
> If you also keep your system configuration files in a version control
> system, restoring an image is faster than wipe/ fresh install/ configure/
> restore data.

I would go even farther. Backups should be designed such that
recovering from a catastrophic storage failure, such as getting hit by
ransomware, unintentionally doing a destructive badblocks write test
or the sudden failure of a storage device, is possible by at most
something very similar to:

* Boot some kind of live environment
* Set up file systems on the storage device to be restored onto
  (partitioning, setting up LUKS containers, formatting, whatever else
  might be called for)
* Within the live environment, install and configure the software
  needed to access the backup (if any) (this may include things like
  cryptographic keys, access passphrases and the likes)
* Perform the restoration from the most recent backup (this is the
  part that likely will take a significant amount of time)
* Update the restored copies of /etc/fstab, /etc/crypttab and any
  other files that directly reference the partitions or file systems
  by some kind of ID (UUID, /dev/disk/by-*/*, ...)
* Reinstall the boot loader
* Reboot
* Reinstall the boot loader again from within the restored environment
  to ensure that everything relating to it is in sync

Such recovery should _not_ need to involve significant reconfiguration
of anything. Any such requirements will massively increase your time
to recovery, as I think we're seeing an example of here. And yes,
pretty much all of this could be scripted, but I strongly suspect that
few people need to do a bare-metal restore of their most recent backup
often enough for _that_ to be worth the effort to create and maintain.

Which is not to say that keeping configuration files
version-controlled cannot provide benefits anyway; but given a proper,
frequent backup regime, the benefits even of that are reduced.

-- 
Michael Kjörling 🔗 https://michael.kjorling.se
“Remember when, on the Internet, nobody cared that you were a dog?”