[PATCH 0/2] strbuf: improve API

2016-05-30 Thread William Duclot
This patch series implements an improvment of the strbuf API, allowing
strbuf to use preallocated memory. This makes strbuf fit to be used
in performance-critical operations.

The first patch is simply a preparatory work, adding tests for
existing strbuf implementation.
Most of the work is made in the second patch: handle pre-allocated
memory, extend the API, document it and test it.

As said in the second commit, the idea comes from Michael Haggerty.

William Duclot (2):
strbuf: add tests
strbuf: allow to use preallocated memory

Makefile   |   1 +
strbuf.c   |  68 
+++-
strbuf.h   |  31 +--
t/helper/test-strbuf.c | 111 
+++
t/t0082-strbuf.sh  |  47 +++
5 files changed, 251 insertions(+), 7 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-05-30 Thread Remi Galan Alfonso
William Duclot  writes:
> This patch series implements an improvment of the strbuf API, allowing
> strbuf to use preallocated memory. This makes strbuf fit to be used
> in performance-critical operations.
> 
> The first patch is simply a preparatory work, adding tests for
> existing strbuf implementation.
> Most of the work is made in the second patch: handle pre-allocated
> memory, extend the API, document it and test it.

Seems interesting, however do you have any test/example that would
show the difference of performance between using these optimizations
and not using them?

Such values would make a nice addition to help convince people that
your series is interesting to have and use.

Thanks,
RĂ©mi
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-06-01 Thread Jeff King
On Mon, May 30, 2016 at 01:32:08PM +0200, Remi Galan Alfonso wrote:

> William Duclot  writes:
> > This patch series implements an improvment of the strbuf API, allowing
> > strbuf to use preallocated memory. This makes strbuf fit to be used
> > in performance-critical operations.
> > 
> > The first patch is simply a preparatory work, adding tests for
> > existing strbuf implementation.
> > Most of the work is made in the second patch: handle pre-allocated
> > memory, extend the API, document it and test it.
> 
> Seems interesting, however do you have any test/example that would
> show the difference of performance between using these optimizations
> and not using them?
> 
> Such values would make a nice addition to help convince people that
> your series is interesting to have and use.

I'll second the request for actual numbers. I'm a little dubious that
malloc overhead is actually a significant place we are spending time, or
if there is simply a superstitious avoidance of using strbufs. A huge
number of strbufs are used for filenames, where we're about to make a
syscall anyway. If your allocator for a 4K page is not competitive with
a context switch, I suspect the best solution is to get a new allocator.

So I wonder if we have some less-invasive alternatives:

  1. Ship a faster allocator library with git, and use its malloc by
 default.

  2. Do caching tricks for strbufs used in tight loops. For example,
 have strbuf_release() throw its buffer into a last-used cache, and
 let the next strbuf_grow() use that cache entry. This cuts malloc()
 out of the loop.

 You'd probably want to protect the cache with a mutex, though. Most
 of git isn't thread-safe, but a few parts are, and strbufs are
 low-level enough that they might get called.

I have no idea if those ideas would work. But I wouldn't want to start
looking into either of them without some idea of how much time we're
actually spending on strbuf mallocs (or how much time we would spend if
strbufs were used in some proposed sites).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-06-01 Thread David Turner
On Wed, 2016-06-01 at 03:42 -0400, Jeff King wrote:
>   2. Do caching tricks for strbufs used in tight loops. For example,
>  have strbuf_release() throw its buffer into a last-used cache,
> and
>  let the next strbuf_grow() use that cache entry. This cuts
> malloc()
>  out of the loop.
> 
>  You'd probably want to protect the cache with a mutex, though.


... or make the last-used cache be thread-local.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-06-01 Thread Jeff King
On Wed, Jun 01, 2016 at 03:50:29PM -0400, David Turner wrote:

> On Wed, 2016-06-01 at 03:42 -0400, Jeff King wrote:
> >   2. Do caching tricks for strbufs used in tight loops. For example,
> >  have strbuf_release() throw its buffer into a last-used cache,
> > and
> >  let the next strbuf_grow() use that cache entry. This cuts
> > malloc()
> >  out of the loop.
> > 
> >  You'd probably want to protect the cache with a mutex, though.
> 
> 
> ... or make the last-used cache be thread-local.

Good idea.

I almost didn't mention threading at all, because I'd be surprised if
malloc lock contention is a serious bottleneck for us anywhere.

It's hard to really say much concrete because it's not clear to me where
people think strbuf allocation _is_ a bottleneck (or again, _would be_
if we were to use it more).

If we imagine a loop like this:

  for (i = 0; i < n; i++) {
struct strbuf buf = STRBUF_INIT;
do_something_with(&buf);
strbuf_release(&buf);
  }

then yes, we'll call malloc() and free() a lot of times. But unless your
libc malloc is terrible, those calls are all probably just taking a
mutex and reusing the same block from a free list. The strbuf case is
actually really friendly to allocators because our blocks tend to be
predictable sizes (they're sized by ALLOC_GROW, which has a
deterministic set of block sizes).

In practice, I'd imagine loops get more complicated than that. They call
sub-functions which allocate a strbuf, and sometimes don't release it
until other strbufs have been allocated, etc. The proposal in this
thread is basically using the call stack to mirror the allocation
patterns. So when the pattern of allocations matches an actual stack,
that's good, but I'd expect a reasonably smart allocator to perform
similarly. And when it's not a true stack, the stack-based strbufs are
going to end up with wasted stack space hanging around even after we've
free()'d the memory and could reuse it. And I'd still expect an
allocator based off a linked list or something to handle such cases
pretty well.

There are also other non-big-O factors at play in modern systems, like
CPU cache footprints. Is heap memory cheaper or more expensive to access
than stack? I can imagine stack is kept in the cache better, but if
you're reusing the same heap block over and over, it probably stays in
cache, too. But if you have unused stack buffers that _could_ be reused
(but won't be because you'd have to manually feed them to a new strbuf),
that hurts your footprint. And on top of that, the stack proposals I've
seen generally involve over-sizing the stack buffers out of a fear of
actually calling malloc.

And then on top of that, there is the question of whether anything
involving a strbuf allocation is actually a tight enough loop to even
_care_ about cache dynamics.

So again, I'm slightly negative on anything that makes the code even
slightly more complex, especially the call sites, if we can't show
actual measurable improvement numbers.

Sorry, this turned into a mini-rant, and it's not really directed at
you, David. Your email was just what spurred me to put some of these
thoughts into words. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-06-01 Thread David Turner
On Wed, 2016-06-01 at 16:09 -0400, Jeff King wrote:
> On Wed, Jun 01, 2016 at 03:50:29PM -0400, David Turner wrote:
> 
> > On Wed, 2016-06-01 at 03:42 -0400, Jeff King wrote:
> > >   2. Do caching tricks for strbufs used in tight loops. For
> > > example,
> > >  have strbuf_release() throw its buffer into a last-used
> > > cache,
> > > and
> > >  let the next strbuf_grow() use that cache entry. This cuts
> > > malloc()
> > >  out of the loop.
> > > 
> > >  You'd probably want to protect the cache with a mutex,
> > > though.
> > 
> > 
> > ... or make the last-used cache be thread-local.
> 
> Good idea.
> 
> I almost didn't mention threading at all, because I'd be surprised if
> malloc lock contention is a serious bottleneck for us anywhere.
> 
> It's hard to really say much concrete because it's not clear to me
> where
> people think strbuf allocation _is_ a bottleneck (or again, _would
> be_
> if we were to use it more).
> 
> If we imagine a loop like this:
> 
>   for (i = 0; i < n; i++) {
>   struct strbuf buf = STRBUF_INIT;
>   do_something_with(&buf);
>   strbuf_release(&buf);
>   }
> 
> then yes, we'll call malloc() and free() a lot of times. But unless
> your
> libc malloc is terrible, those calls are all probably just taking a
> mutex and reusing the same block from a free list. The strbuf case is
> actually really friendly to allocators because our blocks tend to be
> predictable sizes (they're sized by ALLOC_GROW, which has a
> deterministic set of block sizes).
> 
> In practice, I'd imagine loops get more complicated than that. They
> call
> sub-functions which allocate a strbuf, and sometimes don't release it
> until other strbufs have been allocated, etc. The proposal in this
> thread is basically using the call stack to mirror the allocation
> patterns. So when the pattern of allocations matches an actual stack,
> that's good, but I'd expect a reasonably smart allocator to perform
> similarly. And when it's not a true stack, the stack-based strbufs
> are
> going to end up with wasted stack space hanging around even after
> we've
> free()'d the memory and could reuse it. And I'd still expect an
> allocator based off a linked list or something to handle such cases
> pretty well.
> 
> There are also other non-big-O factors at play in modern systems,
> like
> CPU cache footprints. Is heap memory cheaper or more expensive to
> access
> than stack? I can imagine stack is kept in the cache better, but if
> you're reusing the same heap block over and over, it probably stays
> in
> cache, too. But if you have unused stack buffers that _could_ be
> reused
> (but won't be because you'd have to manually feed them to a new
> strbuf),
> that hurts your footprint. And on top of that, the stack proposals
> I've
> seen generally involve over-sizing the stack buffers out of a fear of
> actually calling malloc.
> 
> And then on top of that, there is the question of whether anything
> involving a strbuf allocation is actually a tight enough loop to even
> _care_ about cache dynamics.
> 
> So again, I'm slightly negative on anything that makes the code even
> slightly more complex, especially the call sites, if we can't show
> actual measurable improvement numbers.
> 
> Sorry, this turned into a mini-rant, and it's not really directed at
> you, David. Your email was just what spurred me to put some of these
> thoughts into words. :)

I agree with all this stuff anyway.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-06-01 Thread Jeff King
On Wed, Jun 01, 2016 at 03:42:18AM -0400, Jeff King wrote:

> I have no idea if those ideas would work. But I wouldn't want to start
> looking into either of them without some idea of how much time we're
> actually spending on strbuf mallocs (or how much time we would spend if
> strbufs were used in some proposed sites).

So I tried to come up with some numbers.

Here's an utterly silly use of strbufs, but one that I think should
over-emphasize the effect of any improvements we make:

diff --git a/Makefile b/Makefile
index 7a0551a..72b968a 100644
--- a/Makefile
+++ b/Makefile
@@ -579,6 +579,7 @@ PROGRAM_OBJS += shell.o
 PROGRAM_OBJS += show-index.o
 PROGRAM_OBJS += upload-pack.o
 PROGRAM_OBJS += remote-testsvn.o
+PROGRAM_OBJS += foo.o
 
 # Binary suffix, set to .exe for Windows builds
 X =
diff --git a/foo.c b/foo.c
index e69de29..b62dd97 100644
--- a/foo.c
+++ b/foo.c
@@ -0,0 +1,18 @@
+#include "git-compat-util.h"
+#include "strbuf.h"
+
+int main(void)
+{
+   const char *str = "this is a string that we'll repeatedly insert";
+   size_t len = strlen(str);
+
+   int i;
+   for (i = 0; i < 100; i++) {
+   struct strbuf buf = STRBUF_INIT;
+   int j;
+   for (j = 0; j < 500; j++)
+   strbuf_add(&buf, str, len);
+   strbuf_release(&buf);
+   }
+   return 0;
+}

That takes about 3.5 seconds to run git-foo on my machine.  Here's where
perf says the time goes:

# Children  Self  Command  Shared Object  Symbol
#     ...  .  ..
#
35.62%34.58%  git-foo  git-foo[.] strbuf_add
22.70%22.14%  git-foo  git-foo[.] strbuf_grow   
19.85%19.04%  git-foo  libc-2.22.so   [.] __memcpy_avx_unaligned
 9.47% 9.17%  git-foo  git-foo[.] main  
 4.88% 4.68%  git-foo  git-foo[.] memcpy@plt
 3.21% 3.12%  git-foo  libc-2.22.so   [.] realloc   
 1.75% 1.71%  git-foo  libc-2.22.so   [.] _int_realloc  
 1.42% 0.00%  git-foo  [unknown]  [.] 0x676e697274732061
 0.82% 0.79%  git-foo  git-foo[.] xrealloc  
 0.61% 0.59%  git-foo  git-foo[.] memory_limit_check
 0.45% 0.00%  git-foo  [unknown]  [.]   
 0.32% 0.00%  git-foo  [unknown]  [.] 0x0fff
 0.32% 0.32%  git-foo  [unknown]  [k] 0x7f591b44a2f0
 0.31% 0.31%  git-foo  [unknown]  [k] 0x00404719
 0.30% 0.00%  git-foo  [unknown]  [.] 0x1ffe
 0.30% 0.30%  git-foo  libc-2.22.so   [.] _int_free 
 0.30% 0.28%  git-foo  libc-2.22.so   [.] _int_malloc   

So malloc and free are pretty low. It looks like realloc is more,
probably because of the memcpys it has to do. I don't think that would
be much better in a stack-based system (because we'd realloc there,
too). What would help is simply using a larger initial size (for _this_
made-up benchmark; I'm not convinced it would be all that good in the
real world).

But either way, most of the time is actually spent in the strbuf
functions themselves (interesting, if you replace strbuf_add with
strbuf_addf, we spend much more time dealing with the formatted input).
We can probably micro-optimize them some.

Here's a short and hacky patch to inline strbuf_grow, and to use
compiler intrinsics to do the overflow checks (which obviously isn't
portable, but we can easily hide it behind a macro and fall back to the
existing scheme):

diff --git a/strbuf.c b/strbuf.c
index 1ba600b..4f163cb 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -55,19 +55,25 @@ void strbuf_attach(struct strbuf *sb, void *buf, size_t 
len, size_t alloc)
sb->buf[sb->len] = '\0';
 }
 
-void strbuf_grow(struct strbuf *sb, size_t extra)
+static inline void strbuf_grow_inline(struct strbuf *sb, size_t extra)
 {
int new_buf = !sb->alloc;
-   if (unsigned_add_overflows(extra, 1) ||
-   unsigned_add_overflows(sb->len, extra + 1))
+   if (__builtin_add_overflow(extra, sb->len, &extra) ||
+   __builtin_add_overflow(extra, 1, &extra))
die("you want to use way too much memory");
if (new_buf)
sb->buf = NULL;
-   ALLOC_GROW(sb->buf, sb->len + extra + 1, sb->alloc);
+   ALLOC_GROW(sb->buf, extra, sb->alloc);
if (new_buf)
sb->buf[0] = '\0';
 }
 
+void strbuf_grow(struct strbuf *sb, size_t extra)
+{
+   strbuf_grow_inline(sb, extra);
+}
+#define strbuf_grow strbuf_grow_inline
+
 void strbuf_trim(struct strbuf *sb)
 {
strbuf_rtrim(sb);

That drops my best-of-five for git-foo from:

  real0m3.377s
  user0m3.376s
  sys 0m0.000s

to:

  real0m3.107s
  user 

Re: [PATCH 0/2] strbuf: improve API

2016-06-02 Thread Michael Haggerty
On 06/01/2016 11:07 PM, Jeff King wrote:
> On Wed, Jun 01, 2016 at 03:42:18AM -0400, Jeff King wrote:
> 
>> I have no idea if those ideas would work. But I wouldn't want to start
>> looking into either of them without some idea of how much time we're
>> actually spending on strbuf mallocs (or how much time we would spend if
>> strbufs were used in some proposed sites).
> 
> So I tried to come up with some numbers.
> 
> Here's an utterly silly use of strbufs, but one that I think should
> over-emphasize the effect of any improvements we make:
> 
> diff --git a/Makefile b/Makefile
> index 7a0551a..72b968a 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -579,6 +579,7 @@ PROGRAM_OBJS += shell.o
>  PROGRAM_OBJS += show-index.o
>  PROGRAM_OBJS += upload-pack.o
>  PROGRAM_OBJS += remote-testsvn.o
> +PROGRAM_OBJS += foo.o
>  
>  # Binary suffix, set to .exe for Windows builds
>  X =
> diff --git a/foo.c b/foo.c
> index e69de29..b62dd97 100644
> --- a/foo.c
> +++ b/foo.c
> @@ -0,0 +1,18 @@
> +#include "git-compat-util.h"
> +#include "strbuf.h"
> +
> +int main(void)
> +{
> + const char *str = "this is a string that we'll repeatedly insert";
> + size_t len = strlen(str);
> +
> + int i;
> + for (i = 0; i < 100; i++) {
> + struct strbuf buf = STRBUF_INIT;
> + int j;
> + for (j = 0; j < 500; j++)
> + strbuf_add(&buf, str, len);
> + strbuf_release(&buf);
> + }
> + return 0;
> +}

Thanks for generating actual data.

Your benchmark could do two things to stress malloc()/free()
even more. I'm not claiming that this makes the benchmark more typical
of real-world use, but it maybe gets us closer to the theoretical upper
limit on improvement.

1. Since strbuf_grow() allocates new space in geometrically increasing
sizes, the number of mallocs needed to do N strbuf_add()s increases only
like log(N). So decreasing the "500" would increase the fraction of the
time spent on allocations.

2. Since the size of the string being appended increases the time spent
copying bytes, without appreciably changing the number of allocations
(also because of geometric growth), decreasing the length of the string
would also increase the fraction of the time spent on allocations.

I put those together as options, programmed several variants of the
string-concatenating loop, and added a perf script to run them; you can
see the full patch here:


https://github.com/mhagger/git/commit/b417935a4425e0f2bf62e59a924dc652bb2eae0c

The guts look like this:

>   int j;
>   if (variant == 0) {
>   /* Use buffer allocated a single time */
>   char *buf = big_constant_lifetime_buf;
> 
>   for (j = 0; j < reps; j++)
>   strcpy(buf + j * len, str);
>   } else if (variant == 1) {
>   /* One correct-sized buffer malloc per iteration */
>   char *buf = xmalloc(reps * len + 1);
> 
>   for (j = 0; j < reps; j++)
>   strcpy(buf + j * len, str);
> 
>   free(buf);
>   } else if (variant == 2) {
>   /* Conventional use of strbuf */
>   struct strbuf buf = STRBUF_INIT;
> 
>   for (j = 0; j < reps; j++)
>   strbuf_add(&buf, str, len);
> 
>   strbuf_release(&buf);
>   } else if (variant == 3) {
>   /* strbuf initialized to correct size */
>   struct strbuf buf;
>   strbuf_init(&buf, reps * len);
> 
>   for (j = 0; j < reps; j++)
>   strbuf_add(&buf, str, len);
> 
>   strbuf_release(&buf);
>   } else if (variant == 4) {
>   /*
>* Simulated fixed strbuf with correct size.
>* This code only works because we know how
>* strbuf works internally, namely that it
>* will never realloc() or free() the buffer
>* that we attach to it.
>*/
>   struct strbuf buf = STRBUF_INIT;
>   strbuf_attach(&buf, big_constant_lifetime_buf, 0, reps * len + 
> 1);
> 
>   for (j = 0; j < reps; j++)
>   strbuf_add(&buf, str, len);
> 
>   /* No strbuf_release() here! */
>   }

I ran this for a short string ("short") and a long string ("this is a
string that we will repeatedly insert"), and also concatenating the
string 5, 20, or 500 times. The number of loops around the whole program
is normalized to make the total number of concatenations approximately
constant. Here are the full results:


   time (s)
Test   0  1  2  3  4

5 short strings 1.64   3.37   8.72   6.08   3.65
20 short strings1.72   2.12   5.43   4.01   3.39
500 short strings   1.62   1.61   3.36   3.26   3.10
5 long strings  2.08   6.64  13.09   8.50   3.79
20 

Re: [PATCH 0/2] strbuf: improve API

2016-06-02 Thread Matthieu Moy
Michael Haggerty  writes:

> 1. The amount of added code complexity is small and quite
>encapsulated.

Actually, STRBUF_OWNS_MEMORY can even be seen as a simplification if
done properly: we already have the case where the strbuf does not own
the memory with strbuf_slopbuf. I already pointed places in
strbuf_grow() which could be simplified after the patch. Re-reading the
code it seems at lesat the call to strbuf_grow(sb, 0); in strbuf_detach
becomes useless. The same in strbuf_attach() probably is, too.

So, the final strbuf.[ch] code might not be "worse" that the previous.

I'm unsure about the complexity of the future code using the new API. I
don't forsee cases where using the new API would lead to a high
maintenance cost, but I don't claim I considered all possible uses.

> 2. The ability to use strbufs without having to allocate memory might
>make enough *psychological* difference that it encourages some
>devs to use strbufs where they would otherwise have done manual
>memory management. I think this would be a *big* win in terms of
>potential bugs and security vulnerabilities avoided.

Note that this can also be seen as a counter-argument, since it
may psychologically encourage people to micro-optimize code and use
contributors/reviewers neurons to spend time on "shall this be on-stack
or malloced?".

I think we already have a tendency to micro-optimize non-critical code
too much in Git's codebase, so it's not necessarily a step in the right
direction.

In conclusion, I don't have a conclusion, sorry ;-).

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-06-02 Thread William Duclot
On Thu, Jun 02, 2016 at 02:58:22PM +0200, Matthieu Moy wrote:
> Michael Haggerty  writes:
> 
>> 1. The amount of added code complexity is small and quite
>>encapsulated.
> 
> Actually, STRBUF_OWNS_MEMORY can even be seen as a simplification if
> done properly: we already have the case where the strbuf does not own
> the memory with strbuf_slopbuf. I already pointed places in
> strbuf_grow() which could be simplified after the patch. Re-reading the
> code it seems at lesat the call to strbuf_grow(sb, 0); in strbuf_detach
> becomes useless. The same in strbuf_attach() probably is, too.
> 
> So, the final strbuf.[ch] code might not be "worse" that the previous.
> 
> I'm unsure about the complexity of the future code using the new API. I
> don't forsee cases where using the new API would lead to a high
> maintenance cost, but I don't claim I considered all possible uses.
> 
>> 2. The ability to use strbufs without having to allocate memory might
>>make enough *psychological* difference that it encourages some
>>devs to use strbufs where they would otherwise have done manual
>>memory management. I think this would be a *big* win in terms of
>>potential bugs and security vulnerabilities avoided.
> 
> Note that this can also be seen as a counter-argument, since it
> may psychologically encourage people to micro-optimize code and use
> contributors/reviewers neurons to spend time on "shall this be on-stack
> or malloced?".
> 
> I think we already have a tendency to micro-optimize non-critical code
> too much in Git's codebase, so it's not necessarily a step in the right
> direction.
> 
> In conclusion, I don't have a conclusion, sorry ;-).

Thank you all for your input and your tests, those are very valuable!
Me and Simon have to take a decision, as this contribution is part of a
school project that comes to an end. We won't have the time to create
tests that are representative of the use of strbuf in the Git codebase
(partly because we lack knowledge of the codebase) to settle on whether
this API improvement is useful or not.

Jeff made very good points, and the tests we ran ourselves seem to
agree with yours. That being said, the tests made by Michael, more
detailed, suggest that there may be room for an improvement of the
strbuf API (even thought that's to confront to the reality of the
codebase).

Having little time, we will refactor the patch and send it as a V2: this
way, if it appears one day that improving the API with stack-allocated
memory is indeed useful, the patch will be ready to merge :)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] strbuf: improve API

2016-06-24 Thread Jeff King
On Thu, Jun 02, 2016 at 01:11:56PM +0200, Michael Haggerty wrote:

> On 06/01/2016 11:07 PM, Jeff King wrote:
> > On Wed, Jun 01, 2016 at 03:42:18AM -0400, Jeff King wrote:
> > 
> >> I have no idea if those ideas would work. But I wouldn't want to start
> >> looking into either of them without some idea of how much time we're
> >> actually spending on strbuf mallocs (or how much time we would spend if
> >> strbufs were used in some proposed sites).
> > 
> > So I tried to come up with some numbers.
> > 
> > Here's an utterly silly use of strbufs, but one that I think should
> > over-emphasize the effect of any improvements we make:
> [...]
> 
> Thanks for generating actual data.

Thanks for following up on this, and sorry for my slow response. It got
put in my "to think about and respond to later" pile (never a good place
to be :) ).

> Your benchmark could do two things to stress malloc()/free()
> even more. I'm not claiming that this makes the benchmark more typical
> of real-world use, but it maybe gets us closer to the theoretical upper
> limit on improvement.

Yeah, you're right. I added more writes to try to actually stimulate the
strbuf to grow, but if the point is to call malloc in a tight loop, it
works against that (I agree that malloc-in-a-tight-loop does not
necessarily represent a real workload, but I think the point of this
exercise is to make sure we can get a useful speedup even under
theoretical conditions).

> I ran this for a short string ("short") and a long string ("this is a
> string that we will repeatedly insert"), and also concatenating the
> string 5, 20, or 500 times. The number of loops around the whole program
> is normalized to make the total number of concatenations approximately
> constant. Here are the full results:

I suspect "5 short" and "5 long" are the only really interesting cases.
For the cases we're talking about, the author clearly expects most input
to fit into a single small-ish allocation (or else they would not bother
with the fixed strbuf in the first place). So "500 long" is really just
exercising memcpy.

> Test   0  1  2  3  4
> 
> 5 short strings 1.64   3.37   8.72   6.08   3.65
> 20 short strings1.72   2.12   5.43   4.01   3.39
> 500 short strings   1.62   1.61   3.36   3.26   3.10
> 5 long strings  2.08   6.64  13.09   8.50   3.79
> 20 long strings 2.16   3.33   7.03   4.72   3.55
> 500 long strings2.04   2.10   3.61   3.33   3.26
> 
> 
> Column 0 is approximately the "bare metal" approach, with a
> pre-allocated buffer and no strbuf overhead.
> 
> Column 1 is like column 0, except allocating a correctly-sized buffer
> each time through the loop. This increases the runtime by as much as 219%.

Makes sense. malloc() isn't free (no pun intended).

And I think this shows why the 20/500 cases aren't all that interesting,
as we really just end up measuring memcpy.

> Column 2 is a naive use of strbuf, where each time through the loop the
> strbuf is reinitialized to STRBUF_INIT, and managing the space is
> entirely left to strbuf.

I'd think this would be about the same as column 1 for our interesting
cases, modulo some strbuf overhead. But it's way higher. That implies
that either:

  1. Strbuf overhead is high (and I think in a really tight loop like
 this that things like our overflow checks might start to matter; we
 could be doing those with intrinsics, for example).

  2. We're doing multiple mallocs per strbuf, which implies our initial
 sizes could be a bit more aggressive (it's not like the 30-byte
 "short" case is that huge).

> Column 3 is like column 2, except that it initializes the strbuf to the
> correct size right away using strbuf_init(). This reduces the runtime
> relative to column 2 by as much as 35%.

That should show us the effects of multiple-allocations (my (2) above).
And indeed, it counts for some of it but not all.

> Column 4 uses a simulated "fixed strbuf", where the fixed-size buffer is
> big enough for the full string (thus there are no calls to
> malloc()/realloc()/free()).
> 
> The comparison between columns 0 and 4 shows that using a strbuf costs
> as much as 123% more than using a simple char array, even if the strbuf
> doesn't have to do any memory allocations.

Yeah, I can believe there is overhead in all of the "are we big enough"
checks (though I'd hope that for the no-allocation cases we simply rely
on snprintf to tell us "yes, you fit").

> The comparison between columns 3 and 4 shows savings a reduction in
> runtime of up to 55% from using a "fixed strbuf" rather than a pre-sized
> conventional strbuf. I think this is the comparison that is most
> relevant to the current discussion.

I'm including a patch at the end of this mail that implements a "variant
5", which essentially keeps a single-buffer cache of the last-released
strbuf, and reuses it. Other than that, it is identical to 2 (no tricks,
not