Nice cleanup!
I think I see a buglet in your level-5 cascading. Suppose a timer is requested far in the future for a time that is an exact multiple of 32768 jiffies. collect_expired_timers() scans level 5 after all the previous ones, and will cascade it to level 0, in a level-0 bucket which has already been scanned, and won't be scanned again for 64 jiffies. I agree that 64 jiffies is well within your allowed rounding accuracy, and order of timer firing is not guaranteed when they're for the same time, but it is a bit odd when a timer fires 32 jiffies *before* another timer scheduled for 32 jiffies later. That's the sort of peculiarity that could lead to a subtle bug. While I like the cleanup of just limiting long-term resolution, if it turns out to be necessary, it's not too hard to add exact timers back in if a need is found in future. All you need is a second __internal_add_timer function that rounds down rather than up, and to teach expire_timers() to cascade in the unlikely situation that a timer does have an expiry time in the future. (It also gets rid of the special case for level 5.) Other, mostly minor, code comments: > +/* Level offsets in the wheel */ > +#define LVL0_OFFS (0) > +#define LVL1_OFFS (LVL_SIZE) > +#define LVL2_OFFS (LVL1_OFFS + LVL_SIZE) > +#define LVL3_OFFS (LVL2_OFFS + LVL_SIZE) > +#define LVL4_OFFS (LVL3_OFFS + LVL_SIZE) > +#define LVL5_OFFS (LVL4_OFFS + LVL_SIZE) > + > +/* Clock divisor for the next level */ > +#define LVL_CLK_SHIFT 3 > +#define LVL_CLK_DIV (1 << LVL_CLK_SHIFT) > +#define LVL_CLK_MASK (LVL_CLK_DIV - 1) > + > +/* The shift constants for selecting the bucket at the levels */ > +#define LVL1_SHIFT (1 * LVL_CLK_SHIFT) > +#define LVL2_SHIFT (2 * LVL_CLK_SHIFT) > +#define LVL3_SHIFT (3 * LVL_CLK_SHIFT) > +#define LVL4_SHIFT (4 * LVL_CLK_SHIFT) > +#define LVL5_SHIFT (5 * LVL_CLK_SHIFT) > + > +/* The granularity of each level */ > +#define LVL0_GRAN 0x00000001 > +#define LVL1_GRAN (LVL0_GRAN << LVL_CLK_SHIFT) > +#define LVL2_GRAN (LVL1_GRAN << LVL_CLK_SHIFT) > +#define LVL3_GRAN (LVL2_GRAN << LVL_CLK_SHIFT) > +#define LVL4_GRAN (LVL3_GRAN << LVL_CLK_SHIFT) > +#define LVL5_GRAN (LVL4_GRAN << LVL_CLK_SHIFT) Wouldn't this all be so much simpler as #define LVL_BITS 6 /* Renamed previous LVL_SHIFT */ #define LVL_SIZE (1 << LVL_BITS) #define LVL_MASK (LVL_BITS - 1) #define LVL_OFFS(n) ((n) * LVL_SIZE) #define LVL_SHIFT(n) ((n) * LVL_CLK_SHIFT) #define LVL_GRAN(n) (1 << LVL_SHIFT(n)) Then you could do +static inline unsigned calc_index(unsigned expires, unsigned level), +{ + /* Round up to next bin bin */ + expires = ((expires - 1) >> LVL_SHIFT(level)) + 1; + return LVL_OFFS(level) + (expires & LVL_MASK); +} > +#define LVL1_TSTART (LVL_SIZE - 1) Er... isn't that LVL_SIZE, as documented in the table above? Then it could be #define LVL_TSTART(n) (LVL_SIZE << LVL_SHIFT(n)) Ideally, you'd like all of that + if (delta < LVL1_TSTART) { + idx = (expires + LVL0_GRAN) & LVL_MASK; + } else if (delta < LVL2_TSTART) { + idx = calc_index(expires, LVL1_GRAN, LVL1_SHIFT, LVL1_OFFS); + } else if (delta < LVL3_TSTART) { + idx = calc_index(expires, LVL2_GRAN, LVL2_SHIFT, LVL2_OFFS); + } else if (delta < LVL4_TSTART) { + idx = calc_index(expires, LVL3_GRAN, LVL3_SHIFT, LVL3_OFFS); + } else if (delta < LVL5_TSTART) { + idx = calc_index(expires, LVL4_GRAN, LVL4_SHIFT, LVL4_OFFS); to be replaced with __builtin_clz or similar: level = __fls(delta | LVL_MASK); if (level < LVL_BITS + LVL_SHIFT(LVL_DEPTH-1)) { /* or LVL_DEPTH-2, no difference */ level = (level + LVL_CLK_SHIFT - LVL_BITS) / LVL_CLK_SHIFT; } else if ((long)delta < 0) { expires = base->clk; level = 0; } else { level = LVL_DEPTH - 1; } index = calc_index(expires, level); > +static inline void detach_expired_timer(struct timer_list *timer) > { > detach_timer(timer, true); > - if (!(timer->flags & TIMER_DEFERRABLE)) > - base->active_timers--; > - base->all_timers--; > } Is there even a reason to have this wrapper any more? Why not just replace all calls to it in the source? > + timer = hlist_entry(head->first, struct timer_list, entry); > + fn = timer->function; > + data = timer->data; > + > + timer_stats_account_timer(timer); > + > + base->running_timer = timer; > + detach_expired_timer(timer); Is there some non-obvious reason that you have to fetch fn and data so early? It seems like a register pressure pessimization, if the compiler can't figure out that timer_stats code can't change them. The cache line containing this timer was already prefetched when you updated its entry.pprev as part of removing the previous entry from the list. I see why you want to fetch them with the lock held in case there's some freaky race, but I'd do it all after detach_timer().