Hi David,

On Wed, Jan 06, 2016 at 01:23:52PM +0000, David CARLIER wrote:
> Sure, it is mainly gcc 5.2/5.3, sometimes clang 3.6 depending the
> machine I was working on.

I'm back (late) on this patch series.

So at this point I'm seeing that the memmem() and ebmb_lookup() functions
are significantly affected by the change, even more on low register count
machines like 32-bit x86 where this adds extra stack manipulation because
the compiler has no way to know that the pointers are the same.

In the case of memmem(), the extra register usage changed the code to the
point of adding 4 local variables to the stack and causing pointers to be
read from the stack, modified and stored back onto the stack, as you can
see in the "diff -y" output below showing the beginning of the function,
the patched one is on the left and the original one on the right. As you
can see, stack offsets 0x40, 0x44, 0x48 and 0x4c which were previously
not used are now used to store copies of general purpose registers, which
will definitely impact the function's efficiency :

000022e0 <my_memmem>:                                           000022e0 
<my_memmem>:
    22e0:       55                      push   %ebp                 22e0:       
55                      push   %ebp
    22e1:       57                      push   %edi                 22e1:       
57                      push   %edi
    22e2:       56                      push   %esi                 22e2:       
56                      push   %esi
    22e3:       53                      push   %ebx                 22e3:       
53                      push   %ebx
    22e4:       83 ec 2c                sub    $0x2c,%esp     |     22e4:       
83 ec 1c                sub    $0x1c,%esp
    22e7:       8b 4c 24 48             mov    0x48(%esp),%ec |     22e7:       
8b 7c 24 38             mov    0x38(%esp),%ed
    22eb:       8b 7c 24 4c             mov    0x4c(%esp),%ed |     22eb:       
8b 74 24 3c             mov    0x3c(%esp),%es
    22ef:       85 c9                   test   %ecx,%ecx      |     22ef:       
85 ff                   test   %edi,%edi
    22f1:       74 68                   je     235b <my_memme |     22f1:       
74 75                   je     2368 <my_memme
    22f3:       8b 54 24 40             mov    0x40(%esp),%ed |     22f3:       
8b 4c 24 30             mov    0x30(%esp),%ec
    22f7:       85 d2                   test   %edx,%edx      |     22f7:       
85 c9                   test   %ecx,%ecx
    22f9:       74 60                   je     235b <my_memme |     22f9:       
74 6d                   je     2368 <my_memme
    22fb:       39 7c 24 44             cmp    %edi,0x44(%esp |     22fb:       
39 74 24 34             cmp    %esi,0x34(%esp
    22ff:       72 5a                   jb     235b <my_memme |     22ff:       
72 67                   jb     2368 <my_memme
    2301:       8b 44 24 48             mov    0x48(%esp),%ea |     2301:       
8a 07                   mov    (%edi),%al
    2305:       31 d2                   xor    %edx,%edx      |     2303:       
8b 54 24 30             mov    0x30(%esp),%ed
    2307:       8b 5c 24 40             mov    0x40(%esp),%eb |     2307:       
25 ff 00 00 00          and    $0xff,%eax
    230b:       31 ed                   xor    %ebp,%ebp      |     230c:       
89 c5                   mov    %eax,%ebp
    230d:       8a 10                   mov    (%eax),%dl     |     230e:       
eb 27                   jmp    2337 <my_memme
    230f:       89 54 24 1c             mov    %edx,0x1c(%esp |     2310:       
8b 44 24 30             mov    0x30(%esp),%ea
    2313:       eb 20                   jmp    2335 <my_memme |     2314:       
8b 54 24 34             mov    0x34(%esp),%ed
    2315:       8d 76 00                lea    0x0(%esi),%esi |     2318:       
29 d8                   sub    %ebx,%eax

Regarding ebmb_lookup(), it's even worse because there's one load and
one store on the stack in the fast path for each bit in the tree descent
while it used to rely exclusively on registers even on such a machine.
I have not run any benchmarks but it's possible that the change is even
measurable on geolocation maps and stick-tables.

So I'll see what I can come up with by using only explicit type casts
where relevant. Normally it should not matter. At worst it would disable
strict aliasing, but we already disable it anyway.

I'll keep you informed.

Best regards,
Willy


Reply via email to