On 2026/1/14 15:21, Andy Shevchenko wrote:
> On Wed, Jan 14, 2026 at 03:04:58PM +0800, Feng Jiang wrote:
>> On 2026/1/14 14:14, Feng Jiang wrote:
>>> On 2026/1/13 16:46, Andy Shevchenko wrote:
>
> ...
>
>>> Thank you for the catch. You are absolutely correct—the 2500x figure is
>>> heavily
>>> distorted and does not reflect real-world performance.
>>>
>>> I've found that by using a volatile function pointer to call the
>>> implementations
>>> (instead of direct calls), the results returned to a realistic range. It
>>> appears
>>> the previous benchmark logic allowed the compiler to over-optimize the test
>>> loop
>>> in ways that skewed the data.
>>>
>>> I will refactor the benchmark logic in v3, specifically referencing the
>>> crc32
>>> KUnit implementation (e.g., using warm-up loops and adding preempt_disable()
>>> to eliminate context-switch interference) to ensure the data is robust and
>>> accurate.
>>>
>>
>> Just a quick follow-up: I've also verified that using a volatile variable to
>> store
>> the return value (as seen in crc_benchmark()) is equally effective at
>> preventing
>> the optimization.
>>
>> The core change is as follows:
>>
>> volatile size_t len;
>> ...
>> for (unsigned int j = 0; j < iters; j++) {
>> OPTIMIZER_HIDE_VAR(buf);
>> len = strlen(buf);
>
> But please, check for sure this is Linux kernel generic implementation
> (before)
> and not __builtin_strlen() from GCC. (OTOH, it would be nice to benchmark that
> one as well, although I think that __builtin_strlen() in general maybe
> slightly
> better choice than Linux kernel generic implementation.) I.o.w. be sure *what*
> you test.
>
Thanks for the reminder. I actually verified this with objdump and gdb before
submitting the patch—the calls are indeed hitting the intended arch-specific
strlen symbols, not the compiler's __builtin_strlen(). I missed mentioning this
detail in my previous email.
I also just performed an additional test by explicitly calling the exported
arch-specific __pi_strlen() symbol, and the results remained consistent.
Results with riscv __pi_strlen():
ok 4 string_test_strlen
# string_test_strlen_bench: strlen performance (short, len: 8, iters:
100000):
# string_test_strlen_bench: arch-optimized: 4650500 ns
# string_test_strlen_bench: generic C: 5776000 ns
# string_test_strlen_bench: speedup: 1.24x
# string_test_strlen_bench: strlen performance (medium, len: 64, iters:
100000):
# string_test_strlen_bench: arch-optimized: 6895000 ns
# string_test_strlen_bench: generic C: 16343400 ns
# string_test_strlen_bench: speedup: 2.37x
# string_test_strlen_bench: strlen performance (long, len: 2048, iters:
10000):
# string_test_strlen_bench: arch-optimized: 8052800 ns
# string_test_strlen_bench: generic C: 35290700 ns
# string_test_strlen_bench: speedup: 4.38x
ok 5 string_test_strlen_bench
>> }
>
> Or using WRITE_ONCE() :-) But that one will probably be confusing as it
> usually
> should be paired with READ_ONCE() somewhere else in the code. So, I agree on
> crc_benchmark() approach taken.
>
Thanks for the guidance. I'll stick with the crc_benchmark() pattern to avoid
any
potential confusion regarding concurrency that WRITE_ONCE() might imply.
I'm still learning the most idiomatic practices in the kernel, so I appreciate
the tip.
>> Preliminary results with this change look much more reasonable:
>>
>> ok 4 string_test_strlen
>> # string_test_strlen_bench: strlen performance (short, len: 8, iters:
>> 100000):
>> # string_test_strlen_bench: arch-optimized: 4767500 ns
>> # string_test_strlen_bench: generic C: 5815800 ns
>> # string_test_strlen_bench: speedup: 1.21x
>> # string_test_strlen_bench: strlen performance (medium, len: 64, iters:
>> 100000):
>> # string_test_strlen_bench: arch-optimized: 6573600 ns
>> # string_test_strlen_bench: generic C: 16342500 ns
>> # string_test_strlen_bench: speedup: 2.48x
>> # string_test_strlen_bench: strlen performance (long, len: 2048, iters:
>> 10000):
>> # string_test_strlen_bench: arch-optimized: 7931000 ns
>> # string_test_strlen_bench: generic C: 35347300 ns
>> # string_test_strlen_bench: speedup: 4.45x
>> ok 5 string_test_strlen_bench
>>
>> I will adopt this pattern in v3, along with cache warm-up and
>> preempt_disable(),
>> to stay consistent with existing kernel benchmarks and ensure robust
>> measurements.
>
--
With Best Regards,
Feng Jiang