On 28.08.2012 20:30, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakan...@enterprisedb.com>  writes:
Drilling into the profile, I came up with three little optimizations:

1. Within spgdoinsert, a significant portion of the CPU time is spent on
line 2033 in spgdoinsert.c:

memset(&out, 0, sizeof(out));

That zeroes out a small struct allocated in the stack. Replacing that
with MemSet() makes it faster, reducing the time spent on zeroing that
struct from 10% to 1.5% of the time spent in spgdoinsert(). That's not
very much in the big scheme of things, but it's a trivial change so
seems worth it.

Fascinating.  I'd been of the opinion that modern compilers would inline
memset() for themselves and MemSet was probably not better than what the
compiler could do these days.  What platform are you testing on?

x64, gcc 4.7.1, running Debian.

The assembly generated for the MemSet is:

        .loc 1 2033 0 discriminator 3
        movq    $0, -432(%rbp)
.LVL166:
        movq    $0, -424(%rbp)
.LVL167:
        movq    $0, -416(%rbp)
.LVL168:
        movq    $0, -408(%rbp)
.LVL169:
        movq    $0, -400(%rbp)
.LVL170:
        movq    $0, -392(%rbp)

while the corresponding memset code is:

        .loc 1 2040 0 discriminator 6
        xorl    %eax, %eax
        .loc 1 2042 0 discriminator 6
        cmpb    $0, -669(%rbp)
        .loc 1 2040 0 discriminator 6
        movq    -584(%rbp), %rdi
        movl    $6, %ecx
        rep stosq

In fact, with -mstringop=unrolled_loop, I can coerce gcc to produce code similar to the MemSet version:

        movq    %rax, -440(%rbp)
        .loc 1 2040 0 discriminator 6
        xorl    %eax, %eax
.L254:
        movl    %eax, %edx
        addl    $32, %eax
        cmpl    $32, %eax
        movq    $0, -432(%rbp,%rdx)
        movq    $0, -424(%rbp,%rdx)
        movq    $0, -416(%rbp,%rdx)
        movq    $0, -408(%rbp,%rdx)
        jb      .L254
        leaq    -432(%rbp), %r9
        addq    %r9, %rax
        .loc 1 2042 0 discriminator 6
        cmpb    $0, -665(%rbp)
        .loc 1 2040 0 discriminator 6
        movq    $0, (%rax)
        movq    $0, 8(%rax)

I'm not sure why gcc doesn't choose that by default. Perhaps it's CPU specific which variant is faster - I was quite surprised that MemSet was such a clear win on my laptop. Or maybe it's a speed-space tradeoff, and gcc chooses the more compact version, although using -O3 instead of -O2 made no difference.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to