Re: File corruption on SSD disk
On Jan 09 16:40:48, obsd.t...@randy.pensive.org wrote: > I'm running OpenBSD on a Protectli box as a router/firewall. The disk is an > SSD. Every now and then I reboot it ("sudo shutdown -r now") just to make > sure it comes back up. Several times it hung on disk errors that the auto > 'fsck' can't fix. I was able to manually run 'fsck' and answer its prompts > to clean up the problems, which sometimes were unreferenced inodes or > similar things. It deleted some files in /var. The system runs OK, so > perhaps the files aren't used in my minimal setup. > > I have two questions: > > (1) In "/etc/rc" I changed [fsck -p "$@"] to [fsck -f "$@"] in an attempt to > get it to force fix problems, so the system could recover without someone > manually doing it. That didn't work (it still stopped startup with the disk > errors), so I tried making it [do_fsck -f -y] but that didn't work either. > How does one make the system recover (e.g., how would an unstaffed/dark > computer operations center do it)? > > (2) Why would the system develop disk problems? Might the SSD be failing? Of course. > Should I proactively replace it? There's hardly anything proactive about it, it it's showing unrecoverable fsck errors already. > If I do replace it, should I start fresh > with a clean install versus cloning the current disk? Definitely a clean install on another disk. > By the way, the SSD is a Samsung SSD 870 EVO 500GB (only using a tiny bit of > it). Micromat's Lifespan says it has 100% life left, and their Tech Tools > Pro found no bad blocks. Boot from the new clean install and read the entire old disk with dd if=/dev/sdXc Jan
Re: File corruption on SSD disk
On 2024-01-10, Randall Gellens wrote: > I'm running OpenBSD on a Protectli box as a router/firewall. The disk is > an SSD. Every now and then I reboot it ("sudo shutdown -r now") just to > make sure it comes back up. Several times it hung on disk errors that > the auto 'fsck' can't fix. I was able to manually run 'fsck' and answer > its prompts to clean up the problems, which sometimes were unreferenced > inodes or similar things. It deleted some files in /var. The system runs > OK, so perhaps the files aren't used in my minimal setup. > > I have two questions: > > (1) In "/etc/rc" I changed [fsck -p "$@"] to [fsck -f "$@"] in an > attempt to get it to force fix problems, so the system could recover > without someone manually doing it. That didn't work (it still stopped > startup with the disk errors), so I tried making it [do_fsck -f -y] but > that didn't work either. How does one make the system recover (e.g., how > would an unstaffed/dark computer operations center do it)? fsck -y is all you can do there. > (2) Why would the system develop disk problems? Might the SSD be > failing? Should I proactively replace it? If I do replace it, should I > start fresh with a clean install versus cloning the current disk? possibly. SSDs aren't exactly permanent storage either, even if not failing (read about "bit rot") - magnetic HDDs too, though they're usually considered to have a bit more longevity than SSDs in that respect. are temperatures in a safe range? are your cables good and properly connected? if replacing, you don't want to start from a clone of a suspicious drive. you don't know if the data you're reading is good or not. I'd go for a clean install, move config across, and review those config files. > By the way, the SSD is a Samsung SSD 870 EVO 500GB (only using a tiny > bit of it). Micromat's Lifespan says it has 100% life left, and their > Tech Tools Pro found no bad blocks. that will most likely be based on erase/write cycles and ignore other possible effects. -- Please keep replies on the mailing list.
File corruption on SSD disk
I'm running OpenBSD on a Protectli box as a router/firewall. The disk is an SSD. Every now and then I reboot it ("sudo shutdown -r now") just to make sure it comes back up. Several times it hung on disk errors that the auto 'fsck' can't fix. I was able to manually run 'fsck' and answer its prompts to clean up the problems, which sometimes were unreferenced inodes or similar things. It deleted some files in /var. The system runs OK, so perhaps the files aren't used in my minimal setup. I have two questions: (1) In "/etc/rc" I changed [fsck -p "$@"] to [fsck -f "$@"] in an attempt to get it to force fix problems, so the system could recover without someone manually doing it. That didn't work (it still stopped startup with the disk errors), so I tried making it [do_fsck -f -y] but that didn't work either. How does one make the system recover (e.g., how would an unstaffed/dark computer operations center do it)? (2) Why would the system develop disk problems? Might the SSD be failing? Should I proactively replace it? If I do replace it, should I start fresh with a clean install versus cloning the current disk? By the way, the SSD is a Samsung SSD 870 EVO 500GB (only using a tiny bit of it). Micromat's Lifespan says it has 100% life left, and their Tech Tools Pro found no bad blocks. --Randall