> Secondly - when we get to the end of the shorter string; we can either keep comparing to the last char or \0; or we go ‘modulo’ to the start of the string. Now modulo is perhaps not ideal; and seems to affect the pipeline on the XEON cpu (something I confess not to quite understand; and I cannot see/replicate on ARM).
Comparing the same byte is easily optimized out. What about looping over the same 16-byte micro-page we ended with, to avoid fetching much earlier pages? Truncate the address by nulling pointer bits 0x0f ?