Re: File corruption on SSD disk

2024-01-10 Thread Jan Stary
On Jan 09 16:40:48, obsd.t...@randy.pensive.org wrote:
> I'm running OpenBSD on a Protectli box as a router/firewall. The disk is an
> SSD. Every now and then I reboot it ("sudo shutdown -r now") just to make
> sure it comes back up. Several times it hung on disk errors that the auto
> 'fsck' can't fix. I was able to manually run 'fsck' and answer its prompts
> to clean up the problems, which sometimes were unreferenced inodes or
> similar things. It deleted some files in /var. The system runs OK, so
> perhaps the files aren't used in my minimal setup.
> 
> I have two questions:
> 
> (1) In "/etc/rc" I changed [fsck -p "$@"] to [fsck -f "$@"] in an attempt to
> get it to force fix problems, so the system could recover without someone
> manually doing it. That didn't work (it still stopped startup with the disk
> errors), so I tried making it [do_fsck -f -y] but that didn't work either.
> How does one make the system recover (e.g., how would an unstaffed/dark
> computer  operations center do it)?
> 
> (2) Why would the system develop disk problems? Might the SSD be failing?

Of course.

> Should I proactively replace it?

There's hardly anything proactive about it,
it it's showing unrecoverable fsck errors already.

> If I do replace it, should I start fresh
> with a clean install versus cloning the current disk?

Definitely a clean install on another disk.

> By the way, the SSD is a Samsung SSD 870 EVO 500GB (only using a tiny bit of
> it). Micromat's Lifespan says it has 100% life left, and their Tech Tools
> Pro found no bad blocks.

Boot from the new clean install
and read the entire old disk with dd if=/dev/sdXc 

Jan



Re: File corruption on SSD disk

2024-01-10 Thread Stuart Henderson
On 2024-01-10, Randall Gellens  wrote:
> I'm running OpenBSD on a Protectli box as a router/firewall. The disk is 
> an SSD. Every now and then I reboot it ("sudo shutdown -r now") just to 
> make sure it comes back up. Several times it hung on disk errors that 
> the auto 'fsck' can't fix. I was able to manually run 'fsck' and answer 
> its prompts to clean up the problems, which sometimes were unreferenced 
> inodes or similar things. It deleted some files in /var. The system runs 
> OK, so perhaps the files aren't used in my minimal setup.
>
> I have two questions:
>
> (1) In "/etc/rc" I changed [fsck -p "$@"] to [fsck -f "$@"] in an 
> attempt to get it to force fix problems, so the system could recover 
> without someone manually doing it. That didn't work (it still stopped 
> startup with the disk errors), so I tried making it [do_fsck -f -y] but 
> that didn't work either. How does one make the system recover (e.g., how 
> would an unstaffed/dark computer  operations center do it)?

fsck -y is all you can do there.

> (2) Why would the system develop disk problems? Might the SSD be 
> failing? Should I proactively replace it? If I do replace it, should I 
> start fresh with a clean install versus cloning the current disk?

possibly. SSDs aren't exactly permanent storage either, even if not
failing (read about "bit rot") - magnetic HDDs too, though they're usually
considered to have a bit more longevity than SSDs in that respect. 

are temperatures in a safe range?

are your cables good and properly connected?

if replacing, you don't want to start from a clone of a suspicious drive.
you don't know if the data you're reading is good or not. I'd go for a
clean install, move config across, and review those config files.

> By the way, the SSD is a Samsung SSD 870 EVO 500GB (only using a tiny 
> bit of it). Micromat's Lifespan says it has 100% life left, and their 
> Tech Tools Pro found no bad blocks.

that will most likely be based on erase/write cycles and ignore other
possible effects.


-- 
Please keep replies on the mailing list.



File corruption on SSD disk

2024-01-09 Thread Randall Gellens
I'm running OpenBSD on a Protectli box as a router/firewall. The disk is 
an SSD. Every now and then I reboot it ("sudo shutdown -r now") just to 
make sure it comes back up. Several times it hung on disk errors that 
the auto 'fsck' can't fix. I was able to manually run 'fsck' and answer 
its prompts to clean up the problems, which sometimes were unreferenced 
inodes or similar things. It deleted some files in /var. The system runs 
OK, so perhaps the files aren't used in my minimal setup.


I have two questions:

(1) In "/etc/rc" I changed [fsck -p "$@"] to [fsck -f "$@"] in an 
attempt to get it to force fix problems, so the system could recover 
without someone manually doing it. That didn't work (it still stopped 
startup with the disk errors), so I tried making it [do_fsck -f -y] but 
that didn't work either. How does one make the system recover (e.g., how 
would an unstaffed/dark computer  operations center do it)?


(2) Why would the system develop disk problems? Might the SSD be 
failing? Should I proactively replace it? If I do replace it, should I 
start fresh with a clean install versus cloning the current disk?


By the way, the SSD is a Samsung SSD 870 EVO 500GB (only using a tiny 
bit of it). Micromat's Lifespan says it has 100% life left, and their 
Tech Tools Pro found no bad blocks.


--Randall