Fwd: [Lazarus] Faster than popcnt

J. Gareth Moreton via fpc-devel Mon, 03 Jan 2022 16:07:08 -0800

Prepare for a lot of technical rambling!

This is just an analysis of the compilation of utf8lentest.lpr, not anyof the System units. Notably, POPCNT isn't called directly, but insteadgoes through the System unit via "call fpc_popcnt_qword" on both 3.2.xand 3.3.1. A future study of "fpc_popcnt_qword" might yield someinteresting information.

From what I've seen so far, "for i := 1 to cnt do" is implementeddifferently, although not in a way that makes it slower, and there are afew extra MOV instructions that shouldn't cause a speed decrease thanksto parallel execution. Similarly, alignment hints now take the form of".p2align 4,,10; .p2align 3" rather than ".balign 8,0x90", which alignscode to a 16-byte boundary rather than an 8-byte boundary and usesmulti-byte NOP instructions rather than a large number of single-byteNOP instructions (machine code 0x90), although there's every chance Iconfigured things incorrectly.

The difficult part is seeing through the different registers, since3.3.1 is assigning registers differently to 3.2.x. I've attached thetwo dumps if you want to compare them yourself (built under -O4).

From what I can observe, the performance loss seems to be due tochanges in code generation causing more pipeline stalls rather than anyparticular peephole optimisation. There is room for improvement though- for example, both compilers have snippets of code akin to the following:


    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r11
    andq    %r11,%r10
    movq    %r12,%r11
    movq    $71777214294589695,%r13
    andq    %r13,%r11
    leaq    (%r10,%r11),%r12

In this situation, one can replace %r11 with %r13 in the 3rd and 4thinstructions so the massive constant isn't assigned twice:


    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r13
    andq    %r13,%r10
    movq    %r12,%r11
    andq    %r13,%r11
    leaq    (%r10,%r11),%r12

And if %r13 isn't used again in its current form (in this particularsnippet, it isn't... it gets assigned a new constant a few instructionslater), one can do some clever rearranging:


    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r11
    andq    %r11,%r10
    andq    %r12,%r11
    leaq    (%r10,%r11),%r12

Granted this is finding new peephole optimisations rather than fixingthe pipeline stalls, but I can see quite a few places where arearrangement of registers can produce smaller and more efficient codeby removing redundant transfers.

There is a slight inefficiency in a nested function prologue - under3.2.x, it's this:


.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"x"
    .balign 16,0x90
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
    pushq    %rbp
.seh_pushreg %rbp
    movq    %rcx,%rbp
    leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
    leaq    -8(%rbp),%rcx
    call    fpc_ansistr_decr_ref
    leaq    -24(%rbp),%rcx
    call    fpc_ansistr_decr_ref

... whereas under 3.3.1...

.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"ax"
    .balign 16,0x90
.globl    P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
    pushq    %rbp
.seh_pushreg %rbp
    movq    %rcx,%rbp
    leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
    subq    $24,%rcx
    call    fpc_ansistr_decr_ref
    leaq    -8(%rbp),%rcx
    call    fpc_ansistr_decr_ref

In this case, the one in 3.2.x is slightly better and could be improvedby removing "movq %rcx,%rbp" completely, since the value is never usedand is not recorded in the SEH directives (also, the string objects aredecremented in the opposite order, with 3.2.x doing -8(%rbp) first, but3.3.1 doing -24(%rbp) first instead). I think this may be due to meputting in code in the peephole optimizer that avoids playing aroundwith the function prologue. (3.2.x doesn't have the .globl symbol though)

For actual reduction of pipeline stalls, I'm not quite sure how smartthe Intel CPU firmware is and if rearranging instructions would actuallyimprove throughput or not. For example, on 3.2.x, there are these fourinstructions that configure parameters and preserve %rax:


    movq    %rax,%rsi
    movl    -12(%rbp),%edx
    movl    $2,%r8d
    leaq    -24(%rbp),%rcx

All's okay so far, but on 3.3.1, the ordering is different:

    movq    %rax,%rsi
    movl    -12(%rbp),%edx
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d

If one assumes the instructions are executed in-order, there's a greaterrisk of a pipeline stall if there's only one AGU (Address GenerationUnit) available, in which case, rearranging the instructions to thefollowing might produce better results in regards to port allocation:


    movl    -12(%rbp),%edx
    movq    %rax,%rsi
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d

Other times, instruction rearrangement can aid the processor when it'swaiting for a register to be ready. For example:


    movslq    -12(%rbp),%rax
    subq    %rax,%rdx
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d

Making the "subq %rax,%rdx" as late as possible will minimise thepipeline stall:


    movslq    -12(%rbp),%rax
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d
    subq    %rax,%rdx

Then one can do port rebalancing again like before:

    movslq    -12(%rbp),%rax
    movl    $2,%r8d
    leaq    -24(%rbp),%rcx
    subq    %rax,%rdx

---

TL;DR: The code itself doesn't appear to be any worse than before, butdifferent register assignments seem to create more redundant MOVinstructions. This one will be interesting to experiment on. I mayhave to develop a register renaming system for the peephole optimizerand a means of deallocating registers when things get moved around(generally, deallocating registers is a little unsafe, so care has to betaken). Also, POPCNT is handled in the System unit and is not inlined,mainly because of how an input of zero is handled (the function has toreturn 255 in the result, whereas Intel's POPCNT instruction sets thezero flag and leaves the result undefined).


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

<<attachment: utf8lentest-comparison.zip>>

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Attn: J. Gareth // 3.3.1 opt = slower // Fwd: [Lazarus] Faster than popcnt

Reply via email to