Prepare for a lot of technical rambling!

This is just an analysis of the compilation of utf8lentest.lpr, not any of the System units.  Notably, POPCNT isn't called directly, but instead goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x and 3.3.1.  A future study of "fpc_popcnt_qword" might yield some interesting information.

From what I've seen so far, "for i := 1 to cnt do" is implemented differently, although not in a way that makes it slower, and there are a few extra MOV instructions that shouldn't cause a speed decrease thanks to parallel execution.  Similarly, alignment hints now take the form of ".p2align 4,,10; .p2align 3" rather than ".balign 8,0x90", which aligns code to a 16-byte boundary rather than an 8-byte boundary and uses multi-byte NOP instructions rather than a large number of single-byte NOP instructions (machine code 0x90), although there's every chance I configured things incorrectly.

The difficult part is seeing through the different registers, since 3.3.1 is assigning registers differently to 3.2.x.  I've attached the two dumps if you want to compare them yourself (built under -O4).

From what I can observe, the performance loss seems to be due to changes in code generation causing more pipeline stalls rather than any particular peephole optimisation.  There is room for improvement though - for example, both compilers have snippets of code akin to the following:

    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r11
    andq    %r11,%r10
    movq    %r12,%r11
    movq    $71777214294589695,%r13
    andq    %r13,%r11
    leaq    (%r10,%r11),%r12

In this situation, one can replace %r11 with %r13 in the 3rd and 4th instructions so the massive constant isn't assigned twice:

    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r13
    andq    %r13,%r10
    movq    %r12,%r11
    andq    %r13,%r11
    leaq    (%r10,%r11),%r12

And if %r13 isn't used again in its current form (in this particular snippet, it isn't... it gets assigned a new constant a few instructions later), one can do some clever rearranging:

    movq    %r12,%r10
    shrq    $8,%r10
    movq    $71777214294589695,%r11
    andq    %r11,%r10
    andq    %r12,%r11
    leaq    (%r10,%r11),%r12

Granted this is finding new peephole optimisations rather than fixing the pipeline stalls, but I can see quite a few places where a rearrangement of registers can produce smaller and more efficient code by removing redundant transfers.

There is a slight inefficiency in a nested function prologue - under 3.2.x, it's this:

.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"x"
    .balign 16,0x90
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
    pushq    %rbp
.seh_pushreg %rbp
    movq    %rcx,%rbp
    leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
    leaq    -8(%rbp),%rcx
    call    fpc_ansistr_decr_ref
    leaq    -24(%rbp),%rcx
    call    fpc_ansistr_decr_ref

... whereas under 3.3.1...

.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"ax"
    .balign 16,0x90
.globl    P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008:
.seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008
    pushq    %rbp
.seh_pushreg %rbp
    movq    %rcx,%rbp
    leaq    -32(%rsp),%rsp
.seh_stackalloc 32
.seh_endprologue
    subq    $24,%rcx
    call    fpc_ansistr_decr_ref
    leaq    -8(%rbp),%rcx
    call    fpc_ansistr_decr_ref

In this case, the one in 3.2.x is slightly better and could be improved by removing "movq %rcx,%rbp" completely, since the value is never used and is not recorded in the SEH directives (also, the string objects are decremented in the opposite order, with 3.2.x doing -8(%rbp) first, but 3.3.1 doing -24(%rbp) first instead).  I think this may be due to me putting in code in the peephole optimizer that avoids playing around with the function prologue. (3.2.x doesn't have the .globl symbol though)

For actual reduction of pipeline stalls, I'm not quite sure how smart the Intel CPU firmware is and if rearranging instructions would actually improve throughput or not.  For example, on 3.2.x, there are these four instructions that configure parameters and preserve %rax:

    movq    %rax,%rsi
    movl    -12(%rbp),%edx
    movl    $2,%r8d
    leaq    -24(%rbp),%rcx

All's okay so far, but on 3.3.1, the ordering is different:

    movq    %rax,%rsi
    movl    -12(%rbp),%edx
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d

If one assumes the instructions are executed in-order, there's a greater risk of a pipeline stall if there's only one AGU (Address Generation Unit) available, in which case, rearranging the instructions to the following might produce better results in regards to port allocation:

    movl    -12(%rbp),%edx
    movq    %rax,%rsi
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d

Other times, instruction rearrangement can aid the processor when it's waiting for a register to be ready.  For example:

    movslq    -12(%rbp),%rax
    subq    %rax,%rdx
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d

Making the "subq %rax,%rdx" as late as possible will minimise the pipeline stall:

    movslq    -12(%rbp),%rax
    leaq    -24(%rbp),%rcx
    movl    $2,%r8d
    subq    %rax,%rdx

Then one can do port rebalancing again like before:

    movslq    -12(%rbp),%rax
    movl    $2,%r8d
    leaq    -24(%rbp),%rcx
    subq    %rax,%rdx

---

TL;DR: The code itself doesn't appear to be any worse than before, but different register assignments seem to create more redundant MOV instructions.  This one will be interesting to experiment on.  I may have to develop a register renaming system for the peephole optimizer and a means of deallocating registers when things get moved around (generally, deallocating registers is a little unsafe, so care has to be taken).  Also, POPCNT is handled in the System unit and is not inlined, mainly because of how an input of zero is handled (the function has to return 255 in the result, whereas Intel's POPCNT instruction sets the zero flag and leaves the result undefined).

Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

<<attachment: utf8lentest-comparison.zip>>

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to