> Secondly - when we get to the end of the shorter string; we can either
keep comparing to the last char or \0; or we go ‘modulo’ to the start of
the string. Now modulo is perhaps not ideal; and seems to affect the
pipeline on the XEON cpu (something I confess not to quite understand; and
I cannot see/replicate on ARM).

Comparing the same byte is easily optimized out.

What about looping over the same 16-byte micro-page we ended with, to avoid
fetching much earlier pages?  Truncate the address by nulling pointer bits
0x0f ?

Reply via email to