On 27/11/2025 12:28, Ard Biesheuvel wrote: > On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <[email protected]> wrote: >> >> On 27/11/2025 09:22, Ard Biesheuvel wrote: >>> From: Ard Biesheuvel <[email protected]> >>> >>> Ryan reports that get_random_u16() is dominant in the performance >>> profiling of syscall entry when kstack randomization is enabled [0]. >>> >>> This is the reason many architectures rely on a counter instead, and >>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy >>> is gathered and recorded in a per-CPU variable. >>> >>> Let's try to make the get_random_uXX() fast path faster, and switch to >>> get_random_u8() so that we'll hit the slow path 2x less often. Then, >>> wire it up in the syscall entry path, replacing the per-CPU variable, >>> making the logic at syscall exit redundant. >> >> I ran the same set of syscall benchmarks for this series as I've done for my >> series. >> > > Thanks! > > >> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm >> showing >> performance cost of turning it on without any changes to the implementation, >> then the reduced performance cost of turning it on with my changes applied, >> and >> finally cost of turning it on with Ard's changes applied: >> >> arm64 (AWS Graviton3): >> +-----------------+--------------+-------------+---------------+-----------------+ >> | Benchmark | Result Class | v6.18-rc5 | per-task-prng | >> fast-get-random | >> | | | rndstack-on | | >> | >> +=================+==============+=============+===============+=================+ >> | syscall/getpid | mean (ns) | (R) 15.62% | (R) 3.43% | (R) >> 11.93% | >> | | p99 (ns) | (R) 155.01% | (R) 3.20% | (R) >> 11.00% | >> | | p99.9 (ns) | (R) 156.71% | (R) 2.93% | (R) >> 11.39% | >> +-----------------+--------------+-------------+---------------+-----------------+ >> | syscall/getppid | mean (ns) | (R) 14.09% | (R) 2.12% | (R) >> 10.44% | >> | | p99 (ns) | (R) 152.81% | 1.55% | (R) >> 9.94% | >> | | p99.9 (ns) | (R) 153.67% | 1.77% | (R) >> 9.83% | >> +-----------------+--------------+-------------+---------------+-----------------+ >> | syscall/invalid | mean (ns) | (R) 13.89% | (R) 3.32% | (R) >> 10.39% | >> | | p99 (ns) | (R) 165.82% | (R) 3.51% | (R) >> 10.72% | >> | | p99.9 (ns) | (R) 168.83% | (R) 3.77% | (R) >> 11.03% | >> +-----------------+--------------+-------------+---------------+-----------------+ >> > > What does the (R) mean?
(R) is statistically significant regression (I) is statistically significant improvement Where "statistically significant" is where the 95% confidence intervals of the baseline and comparson do not overlap. > >> So this fixes the tail problem. I guess get_random_u8() only takes the slow >> path >> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not >> sure >> that fully explains it though. >> >> But it's still a 10% cost on average. >> >> Personally I think 10% syscall cost is too much to pay for 6 bits of stack >> randomisation. 3% is better, but still higher than we would all prefer, I'm >> sure. >> > > Interesting! > > So the only thing that get_random_u8() does that could explain the > delta is calling into the scheduler on preempt_enable(), given that it > does very little beyond that. > > Would you mind repeating this experiment after changing the > put_cpu_var() to preempt_enable_no_resched(), to test this theory? Yep it's queued.
