On Tue, 20 Jan 2026 08:37:43 -0800
Dave Hansen <[email protected]> wrote:
> On 1/20/26 08:32, Ryan Roberts wrote:
> > I don't think this question was really addressed to me, but I'll give my
> > opinion
> > anyway; I agree it's pretty binary - it will either work or it will explode.
> > I've tested on arm64 and x86_64 so I have high confidence that it works. If
> > you
> > get it into -next ASAP it has 3 weeks to soak before the merge window opens
> > right? (Linus said he would do an -rc8 this cycle). That feels like enough
> > time
> > to me. But it's your tree 😉
>
> First of all, thank you for testing it on x86! Having that one data
> point where it helped performance is super valuable.
>
> I'm more worried that it's going to regress performance somewhere and
> then it's going to be a pain to back out. I'm not super worried about
> functional regressions.
Unlikely, on x86 the 'rdtsc' is ~20 clocks on Intel cpu and even slower
on amd (according to Agner).
(That is serialised against another rdtsc rather than other instructions.)
Whereas the four TAUSWORTHE() are independent so can execute in parallel.
IIRC each is a memory read and 5 ALU instructions - not much at all.
The slow bit will be the cache miss on the per-cpu data.
You lose a clock at the end because gcc will compile the a | b | c | d
as (((a | b) | c) | d) not ((a | b) | (c | d)).
I think someone reported the 'new' version being faster on x86,
that might be why.
David