So, finally looping back to the httpd vs. apr implementations of these functions...
apr was testing null against equality and both *str1 null and *str2 null characters - that's obviously wrong since str2 of null. It also performed final unnecessary increments of str1 and str1, which I've corrected on apr-2 trunk. And I'm using a short [] rather than unsigned char [] which solves some sign conversion, presuming most architectures are able to more efficiently grab an aligned short and promote to int, as opposed to an unaligned byte (testing backed this up on intel, and I measure no net benefit when aligning to int instead). With these changes, apr slightly outperforms httpd's implementation in most instances. The one small penalty is looking ahead to a null that turns out to be there, but we should only be scanning one of the two strings for a null, and the difference between XOR ax, ax vs. loading both values into ax, cx and subtracting to obtain a zero value is very nominal, once per string comparison. The penalty is only in the case that str1 is equal to str2 for the full null terminated length and is a net win in all other cases. Just synch'ed to apr 2.0 trunk, and now will mass-rename so that the backported functions all correspond to their trunk counterparts using an ap_cstr_ decorated name corresponding to our namespace, to apr's function family group ("C"/POSIX string function). I think we can use that straight back to httpd 2.4 and then worry about migrating trunk/2.6 consumers to the apr 1.6/2.0 names with some helpful #define's. On Mon, Nov 23, 2015 at 11:10 PM, Mikhail T. <mi+t...@aldan.algebra.com> wrote: > On 23.11.2015 23:14, William A Rowe Jr wrote: > > L1 cache and other direct effects of cpu internal optimization. > > Just what I was thinking. Attached is the same program with one more pair > of functions added (and an easy way to add more "candidates" to the > main-driver). I changed the FOR-loop define to obtain repeatable results: > > # Test 1 -- equal strings: > foreach m ( 0 1 2 ) > foreach? ./strncasecmp $m 100000000 aaaaaaaaa AAAAAAAAA 7 > foreach? end > string.h (nb=100000000, len=7) > time = 6.975845 : res = 0 > optimized (nb=100000000, len=7) > time = 1.492197 : res = 0 > 'A' - 'a' (nb=100000000, len=7) > time = 1.787807 : res = 0 > > # Test 2 -- immediately-different strings > foreach m ( 0 1 2 ) > foreach? ./strncasecmp $m 100000000 aaaaaaaaa xAAAAAAAA 7 > foreach? end > string.h (nb=100000000, len=7) > time = 2.527727 : res = -23 > optimized (nb=100000000, len=7) > time = 0.406867 : res = -23 > 'A' - 'a' (nb=100000000, len=7) > time = 0.440320 : res = -23 > > # Test 3 -- strings different at the very end > foreach m ( 0 1 2 ) > foreach? ./strncasecmp $m 100000000 aaaaaaaaa AAAAAAAAx 0 > foreach? end > string.h (nb=100000000, len=0) > time = 9.629660 : res = -23 > optimized (nb=100000000, len=0) > time = 1.387208 : res = -23 > 'A' - 'a' (nb=100000000, len=0) > time = 1.754683 : res = -23 > > The new pair (method 2) does not use the static table, which is likely to > benefit from CPU-cache unfairly in repetitive benchmarks. It is slower > than the table-using method 1 functions. But the two pairs might be > comparable -- or even faster -- in real life. > > -mi > >