On 2026/1/14 18:21, David Laight wrote:
> On Wed, 14 Jan 2026 15:04:58 +0800
> Feng Jiang <[email protected]> wrote:
> 
>> On 2026/1/14 14:14, Feng Jiang wrote:
>>> On 2026/1/13 16:46, Andy Shevchenko wrote:  
>>>> On Tue, Jan 13, 2026 at 04:27:42PM +0800, Feng Jiang wrote:  
>>>>> Introduce a benchmark to compare the architecture-optimized strlen()
>>>>> implementation against the generic C version (__generic_strlen).
>>>>>
>>>>> The benchmark uses a table-driven approach to evaluate performance
>>>>> across different string lengths (short, medium, and long). It employs
>>>>> ktime_get() for timing and get_random_bytes() followed by null-byte
>>>>> filtering to generate test data that prevents early termination.
>>>>>
>>>>> This helps in quantifying the performance gains of architecture-specific
>>>>> optimizations on various platforms.  
> ...
>> Preliminary results with this change look much more reasonable:
>>
>>     ok 4 string_test_strlen
>>     # string_test_strlen_bench: strlen performance (short, len: 8, iters: 
>> 100000):
>>     # string_test_strlen_bench:   arch-optimized: 4767500 ns
>>     # string_test_strlen_bench:   generic C:      5815800 ns
>>     # string_test_strlen_bench:   speedup:        1.21x
>>     # string_test_strlen_bench: strlen performance (medium, len: 64, iters: 
>> 100000):
>>     # string_test_strlen_bench:   arch-optimized: 6573600 ns
>>     # string_test_strlen_bench:   generic C:      16342500 ns
>>     # string_test_strlen_bench:   speedup:        2.48x
>>     # string_test_strlen_bench: strlen performance (long, len: 2048, iters: 
>> 10000):
>>     # string_test_strlen_bench:   arch-optimized: 7931000 ns
>>     # string_test_strlen_bench:   generic C:      35347300 ns
> 
> That is far too long.
> In 35ms you are including a lot of timer interrupts.
> You are also just testing the 'hot cache' case.
> The kernel runs 'cold cache' a lot of the time - especially for instructions.
> 
> To time short loops (or even single passes) you need a data dependency
> between the 'start time' and the code being tested (easy enough, just add
> (time & non_compile_time_zero) to a parameter), and between the result of
> the code and the 'end time' - somewhat harder (doable in x86 if you use
> the pmc cycle counter).

Hi David,

I appreciate the feedback! You're absolutely right that 35ms is quite long; it
was measured in a TCG environment, and on real hardware (ARM64 KVM), it's
actually an order of magnitude faster. I'll definitely tighten the iterations
in v3 to avoid potential noise.

For the more advanced suggestions like cold cache and data dependency, I can
see how they would make the benchmark much more rigorous. My plan is to follow
the pattern in crc_benchmark() to refine the logic, as I feel this approach is
simple, easy to maintain, and provides a good enough baseline for our needs.

While I understand that simulating a cold cache would be more precise, I'm
concerned it might introduce significant complexity at this stage. I hope the
current focus on hot-path throughput is a reasonable starting point for a
general KUnit test.

-- 
With Best Regards,
Feng Jiang


Reply via email to