On Wed, 14 Jan 2026 15:04:58 +0800
Feng Jiang <[email protected]> wrote:
> On 2026/1/14 14:14, Feng Jiang wrote:
> > On 2026/1/13 16:46, Andy Shevchenko wrote:
> >> On Tue, Jan 13, 2026 at 04:27:42PM +0800, Feng Jiang wrote:
> >>> Introduce a benchmark to compare the architecture-optimized strlen()
> >>> implementation against the generic C version (__generic_strlen).
> >>>
> >>> The benchmark uses a table-driven approach to evaluate performance
> >>> across different string lengths (short, medium, and long). It employs
> >>> ktime_get() for timing and get_random_bytes() followed by null-byte
> >>> filtering to generate test data that prevents early termination.
> >>>
> >>> This helps in quantifying the performance gains of architecture-specific
> >>> optimizations on various platforms.
...
> Preliminary results with this change look much more reasonable:
>
> ok 4 string_test_strlen
> # string_test_strlen_bench: strlen performance (short, len: 8, iters:
> 100000):
> # string_test_strlen_bench: arch-optimized: 4767500 ns
> # string_test_strlen_bench: generic C: 5815800 ns
> # string_test_strlen_bench: speedup: 1.21x
> # string_test_strlen_bench: strlen performance (medium, len: 64, iters:
> 100000):
> # string_test_strlen_bench: arch-optimized: 6573600 ns
> # string_test_strlen_bench: generic C: 16342500 ns
> # string_test_strlen_bench: speedup: 2.48x
> # string_test_strlen_bench: strlen performance (long, len: 2048, iters:
> 10000):
> # string_test_strlen_bench: arch-optimized: 7931000 ns
> # string_test_strlen_bench: generic C: 35347300 ns
That is far too long.
In 35ms you are including a lot of timer interrupts.
You are also just testing the 'hot cache' case.
The kernel runs 'cold cache' a lot of the time - especially for instructions.
To time short loops (or even single passes) you need a data dependency
between the 'start time' and the code being tested (easy enough, just add
(time & non_compile_time_zero) to a parameter), and between the result of
the code and the 'end time' - somewhat harder (doable in x86 if you use
the pmc cycle counter).
David
> # string_test_strlen_bench: speedup: 4.45x
> ok 5 string_test_strlen_bench
>
> I will adopt this pattern in v3, along with cache warm-up and
> preempt_disable(),
> to stay consistent with existing kernel benchmarks and ensure robust
> measurements.
>