On Tue, Jun 07, 2016 at 05:36:59PM -0700, Junio C Hamano wrote:
> Jeff King <[email protected]> writes:
>
> > An alternative to this would be implement something like:
> >
> > struct tree *tp, tp_fallback[2];
> > if (nparent <= ARRAY_SIZE(tp_fallback))
> > tp = tp_fallback;
> > else
> > ALLOC_ARRAY(tp, nparent);
> > ...
> > if (tp != tp_fallback)
> > free(tp);
> >
> > That would let us drop our xalloca() portability code
> > entirely. But in my measurements, this seemed to perform
> > slightly worse than the xalloca solution.
>
> It indeed is curious why this "obvious" alternative performed
> worse.
Yeah. I'd be happy if somebody wanted to prove me wrong here. It's
entirely possible I just did something stupid. I had wrapped up the
above in the FAST_ARRAY_ALLOC macro. It would waste the stack space when
we _did_ have to call malloc, but that should almost never happen in my
benchmark. It also allocates an extra slot on the stack for non-merge
commits. So I guess the wasted 24 bytes or whatever could have in
impact. It's also possible that the extra variable simply tickles the
optimizer in some way; I didn't look at the generated asm.
FWIW, here are the timings I had come up with for running "git log --raw
--no-abbrev --no-renames" on linux.git (the same benchmark from the
commit that originally added the alloca). The "time" output at the end
is best-of-five, with the "attempt" lines showing the wall-clock time
for each:
[original, with xalloca]
Attempt 1: 30.669
Attempt 2: 30.782
Attempt 3: 31.807
Attempt 4: 31.152
Attempt 5: 30.243
real 0m30.243s
user 0m30.112s
sys 0m0.128s
[xmalloc/free instead of xalloca]
Attempt 1: 31.306
Attempt 2: 31.814
Attempt 3: 31.138
Attempt 4: 31.787
Attempt 5: 32.23
real 0m31.138s
user 0m30.976s
sys 0m0.160s
[local var with fallback to xmalloc]
Attempt 1: 30.926
Attempt 2: 31.466
Attempt 3: 31.865
Attempt 4: 31.345
Attempt 5: 33.159
real 0m30.926s
user 0m30.804s
sys 0m0.120s
So the improvement here over even a naive malloc/free is _really_ small.
You'll note that the best-of-five for xmalloc is actually smaller than
the worst-case with xalloca. But the mean is still 2% better. The mean
for the "fallback" case aren't really any better (which makes me wonder
if I was somehow accidentally calling malloc each time).
So I dunno. For 2%, I was tempted to simply say "screw it, let's just
forget this micro-optimization and call xmalloc". This patch is the
conservative choice in that it has the same performance profile as the
original. Or so I thought. I didn't record the old timings, but they
looked similar to the original. I just recreated them while writing this
email and got:
[xalloca with fallback to xmalloc]
Attempt 1: 31.356
Attempt 2: 31.725
Attempt 3: 31.454
Attempt 4: 30.898
Attempt 5: 31.937
real 0m30.898s
user 0m30.752s
sys 0m0.144s
which really isn't much better than the local-var case. I wonder if just
having the conditional stack/heap is what kills us (either itself, or
because it affects the optimizer).
I dunno.
> > +#define FAST_ARRAY_ALLOC(x, nr) do { \
> > + if ((nr) <= 2) \
> > + (x) = xalloca((nr) * sizeof(*(x))); \
> > + else \
> > + ALLOC_ARRAY((x), nr); \
> > +} while(0)
> > +#define FAST_ARRAY_FREE(x, nr) do { \
> > + if ((nr) > 2) \
> > + free((x)); \
> > +} while(0)
>
> An obvious and clean implementation of the idea.
>
> The only slight worry I have is that nr must stay constant until the
> time the caller calls FREE(), but for the only three callsite pairs
> it is clear nparent will stay constant and this is local to the
> file.
Yep, I had the same worry. I think it's OK because the damage is limited
to tree-diff.c. I'm not sure I would promote this macro for general use.
-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html