bug#76516: [shepherd] Timer not executed

Tomas Volf Thu, 27 Feb 2025 17:30:30 -0800

Ludovic Courtès <[email protected]> writes:

> Hi,
>
> Tomas Volf <[email protected]> skribis:
>
>> I have no idea how Shepherd works internally (and much less how Fibers
>> work), so maybe this comment is completely off, but this seems
>> suspicious.  Should this lambda not get the wake up time as an argument,
>> instead of calling get-internal-real-time to get the "now"?
>
> Yes, it would probably be nicer, but it wouldn’t make much of a
> difference here (and it’s not related to the bug: the bug shows that we
> sleep longer than asked for).


I am not sure this is correct.  What the bug shows is that the callback
is called later then expected.  We do not know how long the sleep was.
Am I missing something?

>
>> Is there a way to enable logging of the events?  So we would know when
>> fibers decided the timer is up, and when the lambda was called?
>
> There’s no logging at the Fibers level; all we have is logging by
> shepherd itself.
>
>> PS: Looking into timer.scm, I see this comment
>>
>> ;; Reached when resuming from sleep state: we slept
>> ;; significantly more than the requested number of seconds.  To
>> ;; avoid triggering every timer when resuming from sleep state,
>> ;; sleep again to remain in sync.
>>
>> Not sure I would call 2 (or even the 10) a "significantly more". :) If I
>> expect the cron to sleep for 86400 seconds, 10 more seems... minor.
>>
>> Maybe (I did not put too much though into this and the numbers are
>> completely thumb-sucked), the "overslept" could be if the sleep was
>> longer by more than 10% of the timer period, clipped to be at least 2,
>> and at most 30 minutes?
>
> Yeah, though there’s no reason for sleeps to drift this much, it’s a
> pretty fundamental assumption.

Does not seem to hold in this particular case (at least for the lower
bound).  ¯\_(ツ)_/¯

> Maybe this:
>
>   (define max-delay
>     ;; Time after which we consider that we missed the deadline.  Tolerate a
>     ;; slight drift, which can happen occasionally.
>     (max (min (/ seconds 10.) 120) 2))

That should work, yeah.  At least as a temporary measure. :)

Few additional data-points: The timers I have scheduled for almost 24h
in the future fired exactly on time.  As for the kerberos-log-in-refresh
timer, twice it fired within the 2 seconds (12:00:01), once outside
(12:00:02).

I was thinking about this some more, and the right solution here
probably is to use netlink to listen for ACPI events, the same way acpid
does.  That should provide reliable information about the suspend and
resume events.

Tomas

-- 
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.

bug#76516: [shepherd] Timer not executed

Reply via email to