Hi Paul,

> (Ooo!  Ooo!  Performance measurements!  I love this stuff!)

Me too :-)

> It depends on the data.  In the typical case, "sort" is applied to
> text data, which does not contain NUL bytes.  The data in that
> benchmark contained many NUL bytes.  If you take the same benchmark
> and uniformly replace "\0" with "\t" in compare.c, then the situation
> is much different: coreutils memxfrm is about 3 times faster than
> gnulib memxfrm on the larger test cases

Indeed. By changing the compare.c program to try
  - first, a small string, 3 times,
  - then, a 40 KB string with \t separators, 3 times,
  - then, the same string with \0 separators, 3 times,
I confirm your timings:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036003
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 10,9647
Time for coreutils_memxfrm: 3,54422
Time for gnulib_memxfrm: 10,4247
Time for coreutils_memxfrm: 3,54422
Time for gnulib_memxfrm: 10,4407
Time for coreutils_memxfrm: 3,54422
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,42021
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,41621

The reason is that gnulib_memxfrm duplicates the allocated memory size, from
4 KB to 8 KB to 16 KB etc., ignoring the expected result size of 108 KB that
strxfrm is returned. After this is fixed, I get better timings.

> I expect that this performance glitch in gnulib memxfrm could be
> improved, as it shouldn't simply double buffer sizes when they're too
> small, as at that point it already knows what the final buffer size
> should be.  Doing this should bring up gnulib memxfrm to be as fast as
> coreutils xmemxfrm for this benchmark.

Yes:

Without strdup:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036003
Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 4,29627
Time for coreutils_memxfrm: 3,55222
Time for gnulib_memxfrm: 3,54422
Time for coreutils_memxfrm: 3,55222
Time for gnulib_memxfrm: 3,54022
Time for coreutils_memxfrm: 3,55222
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,42421
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,42421
Time for gnulib_memxfrm: 1,98812
Time for coreutils_memxfrm: 3,42021

With strdup:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036003
Time for coreutils_memxfrm: 0,032002
Time for gnulib_memxfrm: 4,39227
Time for coreutils_memxfrm: 3,54822
Time for gnulib_memxfrm: 3,62823
Time for coreutils_memxfrm: 3,54822
Time for gnulib_memxfrm: 3,64023
Time for coreutils_memxfrm: 3,54822
Time for gnulib_memxfrm: 1,98412
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,42021
Time for gnulib_memxfrm: 1,98012
Time for coreutils_memxfrm: 3,41621

So, this means that my estimation of the overhead of strdup was incorrect. It
is not unnoticeable. In the second of the three cases, it is about 2.5% of the
3.55 seconds. In other words, a malloc + memmove costs about 5% of one
strxfrm call.

But this means that reducing the average numbers of strxfrm calls from 2 to 1,
at the cost of some more malloc, will be a speed-up. This is what I did in
the patch below, and now gnulib_memxfrm is consistently the winner:

Time for gnulib_memxfrm: 0,036002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,032002
Time for coreutils_memxfrm: 0,036002
Time for gnulib_memxfrm: 0,036003
Time for coreutils_memxfrm: 0,032002
Time for gnulib_memxfrm: 2,90418
Time for coreutils_memxfrm: 3,54022
Time for gnulib_memxfrm: 1,97212
Time for coreutils_memxfrm: 3,54022
Time for gnulib_memxfrm: 1,99212
Time for coreutils_memxfrm: 3,54022
Time for gnulib_memxfrm: 1,84011
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,82811
Time for coreutils_memxfrm: 3,41621
Time for gnulib_memxfrm: 1,83611
Time for coreutils_memxfrm: 3,41621

> Still, gnulib memxfrm is
> problematic, because it insists on managing memory itself.

But by doing so it is now 44% faster than the coreutils_memxfrm on the
case of long strings, with or without NULs.

Starting from a certain complexity, doing malloc as part of the API is
a win. Only people who do some kind of embedded systems programming
require to be able to do memory allocation statically. For example,
GNU libiconv now has a function iconv_open_into, that is like iconv_open
without memory allocation. But very very few people need that.

Even the most basic functions in libunistring (unistr.in.h: u8_to_u16,
u8_to_u32, ..., uniconv.in.h: u8_conv_from_encoding, u8_conv_to_encoding,
etc.) rely on implicit dynamic memory allocation. No one has complained
about this.

> The point
> remains, though, that it's confusing that gnulib memxfrm's name begins
> with "mem", as the mem* functions don't allocate memory.  Would you
> consider a patch that renames gnulib memxfrm to amemxfrm, or to some
> other such name?

No, this is not good. The variant which never allocates memory by itself
would be more complex to use and slower on average that gnulib's function.
Also, functions like 'strdup', 'putenv', 'setenv', 'scandir', 'fts', all
do dynamic memory allocation without having an 'a' in their name to
indicate this.


2010-08-08  Bruno Haible  <br...@clisp.org>

        memxfrm: Speed up.
        * lib/memxfrm.c (memxfrm): Allocate enough memory ahead of time, so
        that usually only one call to strxfrm is necessary for each string
        part.
        Reported by Paul Eggert <egg...@cs.ucla.edu>.

--- lib/memxfrm.c.orig  Sun Aug  8 11:56:19 2010
+++ lib/memxfrm.c       Sun Aug  8 11:55:46 2010
@@ -64,12 +64,40 @@
     for (;;)
       {
         /* Search next NUL byte.  */
-        const char *q = p + strlen (p);
+        size_t l = strlen (p);
 
         for (;;)
           {
             size_t k;
 
+            /* A call to strxfrm costs about 20 times more than a call to
+               strdup of the result.  Therefore it is worth to try to avoid
+               calling strxfrm more than once on a given string, by making
+               enough room before calling strxfrm.
+               The size of the strxfrm result, k, is likely to be between
+               l and 3 * l.  */
+            if (3 * l >= allocated - length)
+              {
+                /* Grow the result buffer.  */
+                size_t new_allocated;
+                char *new_result;
+
+                new_allocated = length + 3 * l + 1;
+                if (new_allocated < 2 * allocated)
+                  new_allocated = 2 * allocated;
+                if (new_allocated < 64)
+                  new_allocated = 64;
+                if (result == resultbuf)
+                  new_result = (char *) malloc (new_allocated);
+                else
+                  new_result = (char *) realloc (result, new_allocated);
+                if (new_result != NULL)
+                  {
+                    allocated = new_allocated;
+                    result = new_result;
+                  }
+              }
+
             errno = 0;
             k = strxfrm (result + length, p, allocated - length);
             if (errno != 0)
@@ -77,17 +105,21 @@
             if (k >= allocated - length)
               {
                 /* Grow the result buffer.  */
+                size_t new_allocated;
                 char *new_result;
 
-                allocated = 2 * allocated;
-                if (allocated < 64)
-                  allocated = 64;
+                new_allocated = length + k + 1;
+                if (new_allocated < 2 * allocated)
+                  new_allocated = 2 * allocated;
+                if (new_allocated < 64)
+                  new_allocated = 64;
                 if (result == resultbuf)
-                  new_result = (char *) malloc (allocated);
+                  new_result = (char *) malloc (new_allocated);
                 else
-                  new_result = (char *) realloc (result, allocated);
+                  new_result = (char *) realloc (result, new_allocated);
                 if (new_result == NULL)
                   goto out_of_memory_1;
+                allocated = new_allocated;
                 result = new_result;
               }
             else
@@ -97,7 +129,7 @@
               }
           }
 
-        p = q + 1;
+        p = p + l + 1;
         if (p == p_end)
           break;
         result[length] = '\0';
@@ -105,12 +137,23 @@
       }
   }
 
-  /* Shrink the allocated memory if possible.  */
-  if (result != resultbuf && (length > 0 ? length : 1) < allocated)
+  /* Shrink the allocated memory if possible.
+     It is not worth calling realloc when length + 1 == allocated; it would
+     save just one byte.  */
+  if (result != resultbuf && length + 1 < allocated)
     {
-      char *memory = (char *) realloc (result, length > 0 ? length : 1);
-      if (memory != NULL)
-        result = memory;
+      if ((length > 0 ? length : 1) <= *lengthp)
+        {
+          memcpy (resultbuf, result, length);
+          free (result);
+          result = resultbuf;
+        }
+      else
+        {
+          char *memory = (char *) realloc (result, length > 0 ? length : 1);
+          if (memory != NULL)
+            result = memory;
+        }
     }
 
   s[n] = orig_sentinel;



Reply via email to