> Feedback-ID: 6798:1650:null:purelymail > Date: Tue, 6 Jan 2026 10:00:08 +1000 > From: Jonathan Matthew <[email protected]> > > On Mon, Jan 05, 2026 at 11:44:13PM +0100, Mark Kettenis wrote: > > > Date: Mon, 5 Jan 2026 14:24:54 +1000 > > > From: Jonathan Matthew <[email protected]> > > > > > > On Sun, Jan 04, 2026 at 10:34:55AM -0700, Theo de Raadt wrote: > > > > Mark Kettenis <[email protected]> wrote: > > > > > > > > > I'm 100% sure that I am booting the correct kernel. The checksum > > > > > calculated by that code above is the same. But for some reason the > > > > > checksum that we read back from the hibernation info on disk is > > > > > all-zeroes. So something is going wrong. Will dig deeper when I have > > > > > time. > > > > > > > > Is it just the checksum field --- or has the signature sector not > > > > actually made it onto disk? > > > > > > > > There is this messy thing in subr_hibernate.c around 1954 > > > > > > > > /* Allow the disk to settle */ > > > > delay(500000); > > > > > > > > Few days ago I asked Mike about this again. Apparently this was a > > > > workaround > > > > for an old system, and we should not do it anymore. That was probably > > > > ahci. > > > > But why did we need it back then? > > > > > > This was added well before ahci hibernate was working at all, so > > > it must have been for wdc. > > > > > > > > > > > These new systems are nvme. Do we have a situation where the last > > > > hibernate > > > > write operation gets skipped in subr_hibernate.c, or do we have > > > > low-level > > > > side-effect-free io functions which don't do their job. Is > > > > nvme_hibernate_io() > > > > failing the last write to disk? > > > > > > Looking at the nvme shutdown code again, I realise we're not deleting > > > the hibernate i/o queue, which we're supposed to do as part of the > > > normal shutdown procedure. Perhaps without that the controller isn't > > > flushing all the data out to non-volatile storage. We don't issue a > > > flush command after the last hibernate write, but we shouldn't have > > > to. > > > > > > Maybe this will help? (only compile tested) > > > > Sadly, it doesn't seem to help. > > > > It also prints a "unable to delete hib q, disabling" message during a > > normal suspend. I suppose that is because the hib q wasn't created in > > that case? > > Right, it would be better to do that in HIB_DONE. > > If you look at the unsafe shutdown counter on the device using > smartctl -A /dev/sd0c, does it seem like it matches the number > of unsuccessful hibernates?
# smartctl -A /dev/sd0c | grep ^Unsafe Unsafe Shutdowns: 35 The number doesn't increase with an unsuccessful (un)hibernate. Sounds plausible for the number of times I've had to long-press the power button to reset a hung machine while debugging stuff. To me, it really feels like the issue is a kernel layout thing. Some kernels consistently work, some consistently don't. Even if the only difference is a relink.
