So, finally looping back to the httpd vs. apr implementations of these
functions...

apr was testing null against equality and both *str1 null and *str2 null
characters - that's obviously wrong since str2 of null. It also performed
final unnecessary increments of str1 and str1, which I've corrected on
apr-2 trunk.  And I'm using a short [] rather than unsigned char []
which solves some sign conversion, presuming most architectures
are able to more efficiently grab an aligned short and promote to int,
as opposed to an unaligned byte (testing backed this up on intel,
and I measure no net benefit when aligning to int instead).

With these changes, apr slightly outperforms httpd's implementation
in most instances. The one small penalty is looking ahead to a null
that turns out to be there, but we should only be scanning one of the
two strings for a null, and the difference between XOR ax, ax vs.
loading both values into ax, cx and subtracting to obtain a zero value
is very nominal, once per string comparison.  The penalty is only
in the case that str1 is equal to str2 for the full null terminated length
and is a net win in all other cases.

Just synch'ed to apr 2.0 trunk, and now will mass-rename so that
the backported functions all correspond to their trunk counterparts
using an ap_cstr_ decorated name corresponding to our namespace,
to apr's function family group ("C"/POSIX string function).

I think we can use that straight back to httpd 2.4 and then worry
about migrating trunk/2.6 consumers to the apr 1.6/2.0 names with
some helpful #define's.

On Mon, Nov 23, 2015 at 11:10 PM, Mikhail T. <mi+t...@aldan.algebra.com>
wrote:

> On 23.11.2015 23:14, William A Rowe Jr wrote:
>
> L1 cache and other direct effects of cpu internal optimization.
>
> Just what I was thinking. Attached is the same program with one more pair
> of functions added (and an easy way to add more "candidates" to the
> main-driver). I changed the FOR-loop define to obtain repeatable results:
>
> # Test 1 -- equal strings:
> foreach m ( 0 1 2 )
> foreach? ./strncasecmp $m 100000000 aaaaaaaaa AAAAAAAAA 7
> foreach? end
> string.h (nb=100000000, len=7)
> time = 6.975845 : res = 0
> optimized (nb=100000000, len=7)
> time = 1.492197 : res = 0
> 'A' - 'a' (nb=100000000, len=7)
> time = 1.787807 : res = 0
>
> # Test 2 -- immediately-different strings
> foreach m ( 0 1 2 )
> foreach? ./strncasecmp $m 100000000 aaaaaaaaa xAAAAAAAA 7
> foreach? end
> string.h (nb=100000000, len=7)
> time = 2.527727 : res = -23
> optimized (nb=100000000, len=7)
> time = 0.406867 : res = -23
> 'A' - 'a' (nb=100000000, len=7)
> time = 0.440320 : res = -23
>
> # Test 3 -- strings different at the very end
> foreach m ( 0 1 2 )
> foreach? ./strncasecmp $m 100000000 aaaaaaaaa AAAAAAAAx 0
> foreach? end
> string.h (nb=100000000, len=0)
> time = 9.629660 : res = -23
> optimized (nb=100000000, len=0)
> time = 1.387208 : res = -23
> 'A' - 'a' (nb=100000000, len=0)
> time = 1.754683 : res = -23
>
> The new pair (method 2) does not use the static table, which is likely to
> benefit from CPU-cache unfairly in repetitive benchmarks.  It is slower
> than the table-using method 1 functions. But the two pairs might be
> comparable -- or even faster -- in real life.
>
> -mi
>
>

Reply via email to