Re: More speedups for tuple deformation

Andres Freund Wed, 25 Feb 2026 10:05:38 -0800

Hi,

On 2026-02-25 13:59:53 +1300, David Rowley wrote:
> On Wed, 25 Feb 2026 at 03:39, Andres Freund <[email protected]> wrote:
> > ISTM we should just merge 0004. In my testing it's a very clear win, 
> > without,
> > afaict, any downsides.
>
> I'd like to get them in in sequence as I believe 0004 buys back some
> extra overheads such as the Min()s in slot_deform_heap_tuple(). If I
> were to do 0004 first, then wait a while, it might look more like I'm
> introducing a small regression.


I'd rather get more stuff merged out of the way. We can handle a small
regression by comparing to 18 instead.  I think we tend to be too worried
about intra-master regressions, and not worried enough about intra branch
regressions.

But FWIW, I don't actually see any regressions due to the Min() on my hardware
as the patches stand.


I do actually see a small regression in some cases due to 0004, but that's
easy to fix:

        /* Fetch any missing attrs and raise an error if reqnatts is invalid */
        if (unlikely(attnum < reqnatts))
                slot_getmissingattrs(slot, attnum, reqnatts);
done:

        /* Save current offset for next execution */
        *offp = off;


This forces pushing a bunch of stack state before the call to
slot_getmissingattrs, doing the call, returning, restoring state from the
stack, just to then jump to the end of the function to set *offp which then
again does a bunch of stack restoring and returns.


If you instead make it something like:
        /* Fetch any missing attrs and raise an error if reqnatts is invalid */
        if (unlikely(attnum < reqnatts))
        {
                *offp = off;
                slot_getmissingattrs(slot, attnum, reqnatts);
                return;
        }
(or just duplicate the *offp = off in the goto done; case)

the overhead seems to be removed.



At least gcc is doing some truly weird shit in the
firstNonGuaranteed/firstNonCachedOffsetAttr loop "header" (i.e. just before
the first entrance to the loop) , which leads to the register pressure being
high, which leads to spilling on the stack, making the few-tuples case slower:


r9 is tts_nvalid, r11 is firstNonGuaranteedAttr, r15 is
slot->tts_tupleDescriptor, rbx is tts_values, -0x10(%rsp) is tts_isnull, rax
is tup.


            if (attnum < firstNonGuaranteedAttr)
            cmp    %r9d,%r11d
            jle    <tts_buffer_heap_getsomeattrs+0x3c0>
            lea    0x0(,%r9,8),%r12

    multiply r9/tts_nvalid by 8, store in r12

            sub    %r9d,%r11d

    r11=firstNonGuaranteedAttr-tts_nvalid

            xor    %ecx,%ecx
    Set ecx to 0

            lea    0x20(%r15,%r12,1),%rdx

    compute (char *) tupleDesc->compact_attrs + r12, i.e. the address of the 
first
    needed compact attribute

            add    %rbx,%r12

    compute (char*) tts_values + tts_nvalid * sizeof(Datum)

            mov    -0x10(%rsp),%rbx

    restore tts_isnull from stack

            lea    (%rbx,%r9,1),%rsi

    compute tts_isnull[tts_nvalid]

            jmp    <tts_buffer_heap_getsomeattrs+0xee>


Then the loop body:
    ...
            movb          $0x0,(%rsi,%rcx,1)
    set (tts_isnull + tts_nvalid)[i] = 0

            mov           %rdx,%r13
            movswl        (%rdx),%edi

    load cattr->attcacheoff into rdi/edi

            movzwl        0x2(%rdx),%r10d

    load attlen into r10

            mov           %rdi,%rbx
            add           %rax,%rdi

    rdi = tup + attcacheoff

    [bunch of attlen related branches omitted]

            movslq        (%rdi),%rdi
    store *(tup + off) in rdi

            mov           %rdi,(%r12,%rcx,8)
    store rdi in tts_values[i]

            lea           0x1(%rcx),%rdi

    increment i by one, in a weird way, store in rdi

            add           $0x8,%rdx

    increment cattr by 8

            cmp           %rdi,%r11

    check if i < (firstNonGuaranteedAttr-tts_nvalid)


I.e. the compiler creates an offset version of tts_values[tts_nvalid],
tts_isnull[tts_nvalid], which then creates register allocation pressure,
because later the original tts_values/tts_isnulll etc are accessed again and
thus the underlying registers are preserved.  And this is all for zero gain,
from what I can tell, because the acceses are still done with indexed
addressing  (like  mov           %rdi,(%r12,%rcx,8)), which would work just as
well if rcx were indexed based on attnum, not zero indexed within the loop.

I see about a 10% improvement if I dissuade the compiler from doing that by
adding
  __asm__ volatile ("" : "+r"(attnum) : :);

In the loop body.


I'm getting to the point where I'd like to just hand write the assembler for
this stupid function. Gah.



> > > I'm not getting great results from benchmarking the 0005 patch. I
> > > verified that gcc does access the array without calculating the
> > > element address from scratch each time and calculates it once, then
> > > increments the pointer by sizeof(CompactAttribute). See the attached
> > > .csv for the results on the 3 machines I tested on.
> >
> > FWIW, where I had seen that be rather beneficial is the 
> > TupleDescCompactAttr()
> > at the start of the various loops, where the compiler has little choice to
> > compute the address of the tupdesc->compact_attrs[firstNeededCol].  That
> > matters only when only deforming a small number of columns, of course.
>
> oh ok. I wasn't aware that LEA's scaling factor can only be 1,2 4 or
> 8. With the 8-byte struct, the compiler should be able to do the shift
> and add as one operation, whereas with the 16-byte struct would
> require a separate shift and add.
>
> Looking at the generated code, with 0004, I see:
>
>     1c79: 48 c1 e2 04          shl    rdx,0x4
>     1c7d: 48 8d 4c 15 20        lea    rcx,[rbp+rdx*1+0x20]
>
> whereas with 0005 I see:
>
>     1c6b: 4a 8d 1c dd 00 00 00 lea    rbx,[r11*8+0x0]
>
> Is that what you meant?

Yep.



> > > I've also resequenced the patches to make the deform_bench test module
> > > part of the 0001 patch. This makes it easier to test the performance
> > > of master.
> >
> > What are your thoughts about merging the deform_bench tooling?  I wonder if 
> > we
> > should have src/test/modules/benchmark_tools or such, so we can add a few 
> > more
> > micro-benchmarky tools over time?
>
> I'd like to see us give these tools a proper home. It helps lower the
> bar for anyone else who'd like to experiment at some future date, and
> also allows people to more easily test for performance regressions if
> they're forced to change related code. I've also got a tool that
> benchmarks the MemoryContext code which I keep in some local repo that
> I dig out from time to time. Given that, it's probably unlikely
> deform_bench would be the only extension in there if we did make a
> directory for these.

I'd probably lean towards one extensions with different functions for
different purposes, but that's boring details.


> On the otherhand, it does add some maintenance overhead, but IMO,
> helping to ensure various key routines are optimal is a worthy enough
> cause to make the maintenance overhead worthwhile.

Agreed.

Greetings,

Andres Freund

Re: More speedups for tuple deformation

Reply via email to