Re: [PATCH next] string: Optimise strlen()

David Laight Sat, 28 Mar 2026 04:36:02 -0700

On Fri, 27 Mar 2026 17:29:21 -0700
Linus Torvalds <[email protected]> wrote:

> On Fri, 27 Mar 2026 at 15:49, David Laight <[email protected]> 
> wrote:
> >
> > I've not measured strnlen(), but it wouldn't surprise me if argv[]
> > processing wouldn't be faster with something like the strlen() in this
> > patch.
> > After all arguments are usually relatively short.  
> 
> No, we used to do those a byte at a time. It was not great. execve()
> strings are often actually long because of filenames and environment
> variables.
> 
> The trivial cases don't even matter, because all the cost of execve()
> are elsewhere for those cases.
> 
> But the cases where the strings *do* matter, they are many and long.

Is that the strncpy_from_user() path?

> 
> Again - where did you actually see costs in real profiles? We really
> don't tend to have strings in the kernel. The strings we do have are
> from user space, and by the time they are in the kernel they typically
> aren't C strings any more (or, with pathnames, have other termination
> than just the NUL character).

I started looking at this because someone was trying to write the 'bit-masking'
version for (possibly) RISC-V and I deciding that they weren't making a good
job of it and that it probably wasn't worth while (since x86-64 just uses
the byte code).
So I did some measurements - mostly to prove it was a waste of time.
The slight unroll of the C versions was a 'I wonder what effect it would have'
change, I was surprised it doubled throughput.
For such a small change it seemed worthwhile.

I've just run the tests on an old Haswell box (I don't have any recent
Intel cpu).
The 'byte' loop (and the unrolled once one) execute at 1 byte/clock.
The 'masking longs' loop runs at 4 bytes/clock but matches the byte loop
for 16 bytes (mostly because the byte loop is slower).

The 'elephant in the room' is the glibc code.
That must be using AVX and manages to beat 32 bytes/clock on my zen-5
(slightly under 32 bytes/clock on the Haswell).

That makes me think about the exec path, the fp registers need to be
trashed which should make them usable in the kernel (even without
disabling preemption).
You might need them reloaded after being preempted - I'm guessing that
is done in the 'return to user' path these days, rather than trapping
on first the access?

From what I remember kernel_fpu_end() got changed so that it didn't
restore the registers - so pretty much just enables preemption.
Which means that subsequent kernel_fpu_begin() are also cheap.
But code can't request 'kernel_fpu_begin(if_cheap)' and use the
fallback slow path if the registers would have to be saved.
That could speed up some code for 'medium length' buffers.

        David

> 
>            Linus

Re: [PATCH next] string: Optimise strlen()

Reply via email to