On 28/11/2025 10:07, Ard Biesheuvel wrote: > On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <[email protected]> wrote: >> >> On 27/11/2025 15:03, Ard Biesheuvel wrote: >>> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <[email protected]> wrote: >>>> >>>> On 27/11/2025 12:28, Ard Biesheuvel wrote: >>>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <[email protected]> wrote: >>>>>> >>>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote: >>>>>>> From: Ard Biesheuvel <[email protected]> >>>>>>> >>>>>>> Ryan reports that get_random_u16() is dominant in the performance >>>>>>> profiling of syscall entry when kstack randomization is enabled [0]. >>>>>>> >>>>>>> This is the reason many architectures rely on a counter instead, and >>>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy >>>>>>> is gathered and recorded in a per-CPU variable. >>>>>>> >>>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to >>>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then, >>>>>>> wire it up in the syscall entry path, replacing the per-CPU variable, >>>>>>> making the logic at syscall exit redundant. >>>>>> >>>>>> I ran the same set of syscall benchmarks for this series as I've done >>>>>> for my >>>>>> series. >>>>>> >>>>> >>>>> Thanks! >>>>> >>>>> >>>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm >>>>>> showing >>>>>> performance cost of turning it on without any changes to the >>>>>> implementation, >>>>>> then the reduced performance cost of turning it on with my changes >>>>>> applied, and >>>>>> finally cost of turning it on with Ard's changes applied: >>>>>> >>>>>> arm64 (AWS Graviton3): >>>>>> +-----------------+--------------+-------------+---------------+-----------------+ >>>>>> | Benchmark | Result Class | v6.18-rc5 | per-task-prng | >>>>>> fast-get-random | >>>>>> | | | rndstack-on | | >>>>>> | >>>>>> +=================+==============+=============+===============+=================+ >>>>>> | syscall/getpid | mean (ns) | (R) 15.62% | (R) 3.43% | >>>>>> (R) 11.93% | >>>>>> | | p99 (ns) | (R) 155.01% | (R) 3.20% | >>>>>> (R) 11.00% | >>>>>> | | p99.9 (ns) | (R) 156.71% | (R) 2.93% | >>>>>> (R) 11.39% | >>>>>> +-----------------+--------------+-------------+---------------+-----------------+ >>>>>> | syscall/getppid | mean (ns) | (R) 14.09% | (R) 2.12% | >>>>>> (R) 10.44% | >>>>>> | | p99 (ns) | (R) 152.81% | 1.55% | >>>>>> (R) 9.94% | >>>>>> | | p99.9 (ns) | (R) 153.67% | 1.77% | >>>>>> (R) 9.83% | >>>>>> +-----------------+--------------+-------------+---------------+-----------------+ >>>>>> | syscall/invalid | mean (ns) | (R) 13.89% | (R) 3.32% | >>>>>> (R) 10.39% | >>>>>> | | p99 (ns) | (R) 165.82% | (R) 3.51% | >>>>>> (R) 10.72% | >>>>>> | | p99.9 (ns) | (R) 168.83% | (R) 3.77% | >>>>>> (R) 11.03% | >>>>>> +-----------------+--------------+-------------+---------------+-----------------+ >>>>>> >>>>> >>>>> What does the (R) mean? >>>>> >>>>>> So this fixes the tail problem. I guess get_random_u8() only takes the >>>>>> slow path >>>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm >>>>>> not sure >>>>>> that fully explains it though. >>>>>> >>>>>> But it's still a 10% cost on average. >>>>>> >>>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of >>>>>> stack >>>>>> randomisation. 3% is better, but still higher than we would all prefer, >>>>>> I'm sure. >>>>>> >>>>> >>>>> Interesting! >>>>> >>>>> So the only thing that get_random_u8() does that could explain the >>>>> delta is calling into the scheduler on preempt_enable(), given that it >>>>> does very little beyond that. >>>>> >>>>> Would you mind repeating this experiment after changing the >>>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory? >>>> >>>> This has no impact on performance. >>>> >>> >>> Thanks. But this is really rather surprising: what else could be >>> taking up that time, given that on the fast path, there are only some >>> loads and stores to the buffer, and a cmpxchg64_local(). Could it be >>> the latter that is causing so much latency? I suppose the local >>> cmpxchg() semantics don't really exist on arm64, and this uses the >>> exact same LSE instruction that would be used for an ordinary >>> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix. >>> >>> In any case, there is no debate that your code is faster on arm64. >> >> The results I have for x86 show it's faster than the rdtsc too, although >> that's >> also somewhat surprising. I'll run your series on x86 to get the equivalent >> data. >> > > OK, brown paper bag time ... > > I swapped the order of the 'old' and 'new' cmpxchg64_local() > arguments, resulting in some very odd behavior. I think this explains > why the tail latency was eliminated entirely, which is bizarre. > > The speedup is also more modest now (~2x), which may still be > worthwhile, but likely insufficient for the kstack randomization case. > > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2
Ahh, so we were never taking the slow path. That would definitely explain it. I had a go at running this on x86 but couldn't even get the kernel to boot on my AWS Sapphire Rapids instance. Unfortunately I don't have access to the serial console so can't tell why it failed. But I used the exact same procedure and baseline for other runs so the only difference is your change. I wonder if this issue somehow breaks the boot on that platform? Anyway, I'll chuck this update at the benchmarks, but probably won't be until next week...
