> Feedback-ID: 6798:1650:null:purelymail
> Date: Tue, 6 Jan 2026 10:00:08 +1000
> From: Jonathan Matthew <[email protected]>
> 
> On Mon, Jan 05, 2026 at 11:44:13PM +0100, Mark Kettenis wrote:
> > > Date: Mon, 5 Jan 2026 14:24:54 +1000
> > > From: Jonathan Matthew <[email protected]>
> > > 
> > > On Sun, Jan 04, 2026 at 10:34:55AM -0700, Theo de Raadt wrote:
> > > > Mark Kettenis <[email protected]> wrote:
> > > > 
> > > > > I'm 100% sure that I am booting the correct kernel.  The checksum
> > > > > calculated by that code above is the same.  But for some reason the
> > > > > checksum that we read back from the hibernation info on disk is
> > > > > all-zeroes.  So something is going wrong.  Will dig deeper when I have
> > > > > time.
> > > > 
> > > > Is it just the checksum field --- or has the signature sector not
> > > > actually made it onto disk?
> > > > 
> > > > There is this messy thing in subr_hibernate.c around 1954
> > > > 
> > > >         /* Allow the disk to settle */
> > > >         delay(500000);
> > > > 
> > > > Few days ago I asked Mike about this again.  Apparently this was a 
> > > > workaround
> > > > for an old system, and we should not do it anymore.  That was probably 
> > > > ahci.
> > > > But why did we need it back then?
> > > 
> > > This was added well before ahci hibernate was working at all, so
> > > it must have been for wdc.
> > > 
> > > > 
> > > > These new systems are nvme.  Do we have a situation where the last 
> > > > hibernate
> > > > write operation gets skipped in subr_hibernate.c, or do we have 
> > > > low-level
> > > > side-effect-free io functions which don't do their job.  Is 
> > > > nvme_hibernate_io()
> > > > failing the last write to disk?
> > > 
> > > Looking at the nvme shutdown code again, I realise we're not deleting
> > > the hibernate i/o queue, which we're supposed to do as part of the
> > > normal shutdown procedure. Perhaps without that the controller isn't
> > > flushing all the data out to non-volatile storage. We don't issue a
> > > flush command after the last hibernate write, but we shouldn't have
> > > to.
> > > 
> > > Maybe this will help? (only compile tested)
> > 
> > Sadly, it doesn't seem to help.
> > 
> > It also prints a "unable to delete hib q, disabling" message during a
> > normal suspend.  I suppose that is because the hib q wasn't created in
> > that case?
> 
> Right, it would be better to do that in HIB_DONE.
> 
> If you look at the unsafe shutdown counter on the device using
> smartctl -A /dev/sd0c, does it seem like it matches the number
> of unsuccessful hibernates?

# smartctl -A /dev/sd0c | grep ^Unsafe
Unsafe Shutdowns:                   35

The number doesn't increase with an unsuccessful (un)hibernate.

Sounds plausible for the number of times I've had to long-press the
power button to reset a hung machine while debugging stuff.

To me, it really feels like the issue is a kernel layout thing.  Some
kernels consistently work, some consistently don't.  Even if the only
difference is a relink.

Reply via email to