On 01/15/2016 05:58 PM, Bernd Schmidt wrote:
One question Richard posed in the comments: why aren't we optimizing small constant size memcmps other than size 1 to *s == *q? The reason is the return value of memcmp, which implies byte-sized operation (incidentally, the use of SImode in the cmpmem/cmpstr patterns is really odd). It's possible to work around this, but expansion becomes a little more tricky (subtract after bswap, maybe).
When I did this (big-endian conversion, wide substract, sign) to the tail difference check in glibc's x86_64 memcmp, it was actually a bit faster than isolating the differing byte and returning its difference, even for non-random data such as encountered during qsort:
* Expand __memcmp_eq for small constant sizes with loads and comparison, fall back to a memcmp call.
Should we export such a function from glibc? I expect it's fairly common. Computing the tail difference costs a few cycles.
It may also make sense to call a streamlined implementation if you have interesting alignment information (for x86_64, that would be at least 16 on one or both inputs, so it's perhaps not easy to come by).
Florian