On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <[email protected]> wrote: > > On 27/11/2025 09:22, Ard Biesheuvel wrote: > > From: Ard Biesheuvel <[email protected]> > > > > Ryan reports that get_random_u16() is dominant in the performance > > profiling of syscall entry when kstack randomization is enabled [0]. > > > > This is the reason many architectures rely on a counter instead, and > > that, in turn, is the reason for the convoluted way the (pseudo-)entropy > > is gathered and recorded in a per-CPU variable. > > > > Let's try to make the get_random_uXX() fast path faster, and switch to > > get_random_u8() so that we'll hit the slow path 2x less often. Then, > > wire it up in the syscall entry path, replacing the per-CPU variable, > > making the logic at syscall exit redundant. > > I ran the same set of syscall benchmarks for this series as I've done for my > series. >
Thanks! > The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm > showing > performance cost of turning it on without any changes to the implementation, > then the reduced performance cost of turning it on with my changes applied, > and > finally cost of turning it on with Ard's changes applied: > > arm64 (AWS Graviton3): > +-----------------+--------------+-------------+---------------+-----------------+ > | Benchmark | Result Class | v6.18-rc5 | per-task-prng | > fast-get-random | > | | | rndstack-on | | > | > +=================+==============+=============+===============+=================+ > | syscall/getpid | mean (ns) | (R) 15.62% | (R) 3.43% | (R) > 11.93% | > | | p99 (ns) | (R) 155.01% | (R) 3.20% | (R) > 11.00% | > | | p99.9 (ns) | (R) 156.71% | (R) 2.93% | (R) > 11.39% | > +-----------------+--------------+-------------+---------------+-----------------+ > | syscall/getppid | mean (ns) | (R) 14.09% | (R) 2.12% | (R) > 10.44% | > | | p99 (ns) | (R) 152.81% | 1.55% | (R) > 9.94% | > | | p99.9 (ns) | (R) 153.67% | 1.77% | (R) > 9.83% | > +-----------------+--------------+-------------+---------------+-----------------+ > | syscall/invalid | mean (ns) | (R) 13.89% | (R) 3.32% | (R) > 10.39% | > | | p99 (ns) | (R) 165.82% | (R) 3.51% | (R) > 10.72% | > | | p99.9 (ns) | (R) 168.83% | (R) 3.77% | (R) > 11.03% | > +-----------------+--------------+-------------+---------------+-----------------+ > What does the (R) mean? > So this fixes the tail problem. I guess get_random_u8() only takes the slow > path > every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not > sure > that fully explains it though. > > But it's still a 10% cost on average. > > Personally I think 10% syscall cost is too much to pay for 6 bits of stack > randomisation. 3% is better, but still higher than we would all prefer, I'm > sure. > Interesting! So the only thing that get_random_u8() does that could explain the delta is calling into the scheduler on preempt_enable(), given that it does very little beyond that. Would you mind repeating this experiment after changing the put_cpu_var() to preempt_enable_no_resched(), to test this theory?
