Re: Performance improvement to strnlen().

Václav Zeman Sat, 25 May 2013 23:30:50 -0700

On 05/25/2013 10:27 PM, Lee Thomas wrote:
> Hello FreeBSD devs,
> 
> I have found a performance improvement to libc's strnlen().
> lib/libc/string/strnlen.c is a trivial byte-by-byte implementation,
> where strlen.c has a smarter word-by-word implementation. I have
> implemented strnlen similarly to strlen. It runs about 4x as fast, at
> the cost of a binary codesize increase from 30 bytes to 221.
> 
> In writing this I needed a test, and there isn't one in
> tools/regression/lib/libc/string, so I wrote a unit test for strnlen.
> This really only makes sense for a word-by-word implementation, as it
> tests each combination of string and limit length, overread characters,
> and starting alignment.
> 
> Could someone please review these patches? I am not very experienced in
> C, and am even less experienced with FreeBSD's style guidelines, so they
> likely have a bunch of style and idiom issues. Even if one or both of
> them prove not worth committing, it would still be useful for my learning.
> 
> If/When these patches prove worthy of submitting, what is the next step
> after that? Should I submit a PR, or is there some other process? This
> code is quite similar to the existing strlen.c, and doesn't do anything
> fancy with e.g. endianness, but I haven't tested it for correctness or
> performance on anything other than x86...
> 
> And finally, there is some other low-hanging fruit in the other strn*
> functions. Would it be worth it for me to give those the same treatment?
> 
> Thanks,
> Lee Thomas
> 
> Test platform:
>     $uname -a
>         FreeBSD LeeDesktop 9.1-STABLE FreeBSD 9.1-STABLE #15 r250831:
> Mon May 20 17:31:28 EDT 2013
>             lthomas@LeeDesktop:/usr/obj/usr/src/sys/NOSOUND  amd64
>     $dmesg | grep CPU:
>         CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
> (2666.81-MHz K8-class CPU)
>     $clang --version
>         FreeBSD clang version 3.2 (tags/RELEASE_32/final 170710) 20121221
>         Target: x86_64-unknown-freebsd9.1
>         Thread model: posix
> 
> My timing test was a file of 10000 strings, of uniformly random length,
> 50% between 0 and 20 bytes long, and 50% between 21 and 1000 bytes long.
> Each was filled with random generated bytes in the range [1, 255]. The
> test executables run their strlen on each string in the file 1000 times,
> and a shell script ran the test programs alternately 100 times. Here are
> the clang results; gcc results are roughly the same. I will share the
> actual timing code if someone wants it, but it is pretty ugly C++ and
> shell and I don't guarantee it working :-).
> 
> Results:
> 
> x ./times_bsd_strnlen.txt
> + ./times_my_strnlen.txt
> +--------------------------------------------------------------------------+
> 
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |+                                                                       x|
> |A                                                                       A|
> +--------------------------------------------------------------------------+
> 
>     N           Min           Max        Median           Avg        Stddev
> x 101     1.8685486     1.8693889     1.8687506     1.8687894  0.0001484903
> + 101    0.42704683    0.42779486    0.42712813    0.42715597 0.00010681494
> Difference at 95.0% confidence
>     -1.44163 +/- 3.56739e-05
>     -77.1426% +/- 0.00190893%
>     (Student's t, pooled s = 0.000129342)
> 
> Note: I manually shortened the histogram, as it was 100 lines long of
> exactly the same.
> 
> 
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
>


> Index: strnlen.c
> ===================================================================
> diff --git a/head/lib/libc/string/strnlen.c b/head/lib/libc/string/strnlen.c
> --- a/head/lib/libc/string/strnlen.c  (revision 250951)
> +++ b/head/lib/libc/string/strnlen.c  (working copy)
> @@ -1,5 +1,6 @@
>  /*-
> - * Copyright (c) 2009 David Schultz <d...@freebsd.org>
> + * Copyright (c) 2009, 2010 Xin LI <delp...@freebsd.org>
> + * Copyright (c) 2013 Lee Thomas <lee_tho...@aslantools.com>
>   * All rights reserved.
>   *
>   * Redistribution and use in source and binary forms, with or without
> @@ -27,16 +28,91 @@
>  #include <sys/cdefs.h>
>  __FBSDID("$FreeBSD$");
>  
> +#include <sys/limits.h>
> +#include <sys/types.h>
>  #include <string.h>
>  
> +/*
> + * Portable strnlen() for 32-bit and 64-bit systems.
> + *
> + * Rationale: it is generally much more efficient to do word length
> + * operations and avoid branches on modern computer systems, as
> + * compared to byte-length operations with a lot of branches.
> + *
> + * The expression:
> + *
> + *   ((x - 0x01....01) & ~x & 0x80....80)
> + *
> + * would evaluate to a non-zero value iff any of the bytes in the
> + * original word is zero.
> + *
> + * On multi-issue processors, we can divide the above expression into:
> + *   a)  (x - 0x01....01)
> + *   b) (~x & 0x80....80)
> + *   c) a & b
> + *
> + * Where, a) and b) can be partially computed in parallel.
> + *
> + * The algorithm above is found on "Hacker's Delight" by
> + * Henry S. Warren, Jr.
> + */
> +
> +/* Magic numbers for the algorithm */
> +#if LONG_BIT == 32
> +static const unsigned long mask01 = 0x01010101;
> +static const unsigned long mask80 = 0x80808080;
> +#elif LONG_BIT == 64
> +static const unsigned long mask01 = 0x0101010101010101;
> +static const unsigned long mask80 = 0x8080808080808080;
> +#else
> +#error Unsupported word size
> +#endif
> +
> +#define      LONGPTR_MASK (sizeof(long) - 1)
> +
>  size_t
> -strnlen(const char *s, size_t maxlen)
> +strnlen(const char *str, size_t maxlen)
>  {
> -     size_t len;
> +     const char *stop, *short_stop;
> +     const char *p;
> +     const unsigned long *lp;
> +     long va, vb;
>  
> -     for (len = 0; len < maxlen; len++, s++) {
> -             if (!*s)
> -                     break;
> +     if (maxlen==0) return 0;
> +
> +     stop=str+maxlen;
> +     /*
> +      * Before trying the hard (unaligned byte-by-byte access) way
> +      * to figure out whether there is a nul character, try to see
> +      * if there is a nul character is within this accessible word
> +      * first.
> +      *
> +      * p and (p & ~LONGPTR_MASK) must be equally accessible since
> +      * they always fall in the same memory page, as long as page
> +      * boundaries is integral multiple of word size.
> +      */
> +     lp = (const unsigned long *)((uintptr_t)str & ~LONGPTR_MASK);
> +     va = (*lp - mask01);
> +     vb = ((~*lp) & mask80);
I do not think that this correct C. This is type punning violating the
rules of the language.

One way you can rewrite this is to use memcpy() into a temporary. Do not
be afraid of the memcpy(), modern GCC, and I hope even Clang, can
optimize it into a single move instruction:

{
  long tmp;
  memcpy(&tmp, lp, sizeof(tmp));
  va = (tmp - mask01);
  vb = ((~tmp) & mask80);
}

Another way could be using the union trick.

> +     lp++;
> +     if (va & vb) {
> +             /* Check if we have \0 in the first part */
> +             short_stop=(const char *)lp;
> +             if (stop<short_stop) short_stop=stop;
> +             for (p=str; p != short_stop; p++)
> +                     if (*p == '\0')
> +                             return (p-str);
>       }
> -     return (len);
> +     /* Scan the rest of the string using word sized operation */
> +     for (; (const char *)lp < stop; lp++) {
> +             va = (*lp - mask01);
> +             vb = ((~*lp) & mask80);
Ditto.

> +             if (va & vb) {
> +                     for (p=(const char *)lp; p != stop; p++)
> +                             if (*p == '\0')
> +                                     break;
> +                     return (p-str);
> +             }
> +     }
> +     return (maxlen);
>  }

-- 
VZ

signature.asc
Description: OpenPGP digital signature

Re: Performance improvement to strnlen().

Reply via email to