Prepare for a lot of technical rambling!This is just an analysis of the compilation of utf8lentest.lpr, not any of the System units. Notably, POPCNT isn't called directly, but instead goes through the System unit via "call fpc_popcnt_qword" on both 3.2.x and 3.3.1. A future study of "fpc_popcnt_qword" might yield some interesting information.
From what I've seen so far, "for i := 1 to cnt do" is implemented differently, although not in a way that makes it slower, and there are a few extra MOV instructions that shouldn't cause a speed decrease thanks to parallel execution. Similarly, alignment hints now take the form of ".p2align 4,,10; .p2align 3" rather than ".balign 8,0x90", which aligns code to a 16-byte boundary rather than an 8-byte boundary and uses multi-byte NOP instructions rather than a large number of single-byte NOP instructions (machine code 0x90), although there's every chance I configured things incorrectly.
The difficult part is seeing through the different registers, since 3.3.1 is assigning registers differently to 3.2.x. I've attached the two dumps if you want to compare them yourself (built under -O4).
From what I can observe, the performance loss seems to be due to changes in code generation causing more pipeline stalls rather than any particular peephole optimisation. There is room for improvement though - for example, both compilers have snippets of code akin to the following:
movq %r12,%r10 shrq $8,%r10 movq $71777214294589695,%r11 andq %r11,%r10 movq %r12,%r11 movq $71777214294589695,%r13 andq %r13,%r11 leaq (%r10,%r11),%r12In this situation, one can replace %r11 with %r13 in the 3rd and 4th instructions so the massive constant isn't assigned twice:
movq %r12,%r10 shrq $8,%r10 movq $71777214294589695,%r13 andq %r13,%r10 movq %r12,%r11 andq %r13,%r11 leaq (%r10,%r11),%r12And if %r13 isn't used again in its current form (in this particular snippet, it isn't... it gets assigned a new constant a few instructions later), one can do some clever rearranging:
movq %r12,%r10 shrq $8,%r10 movq $71777214294589695,%r11 andq %r11,%r10 andq %r12,%r11 leaq (%r10,%r11),%r12Granted this is finding new peephole optimisations rather than fixing the pipeline stalls, but I can see quite a few places where a rearrangement of registers can produce smaller and more efficient code by removing redundant transfers.
There is a slight inefficiency in a nested function prologue - under 3.2.x, it's this:
.section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"x" .balign 16,0x90 P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008: .seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008 pushq %rbp .seh_pushreg %rbp movq %rcx,%rbp leaq -32(%rsp),%rsp .seh_stackalloc 32 .seh_endprologue leaq -8(%rbp),%rcx call fpc_ansistr_decr_ref leaq -24(%rbp),%rcx call fpc_ansistr_decr_ref ... whereas under 3.3.1... .section .text.n_p$program$_$testasmutf8length_$$_fin$00000008,"ax" .balign 16,0x90 .globl P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008 P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008: .seh_proc P$PROGRAM$_$TESTASMUTF8LENGTH_$$_fin$00000008 pushq %rbp .seh_pushreg %rbp movq %rcx,%rbp leaq -32(%rsp),%rsp .seh_stackalloc 32 .seh_endprologue subq $24,%rcx call fpc_ansistr_decr_ref leaq -8(%rbp),%rcx call fpc_ansistr_decr_refIn this case, the one in 3.2.x is slightly better and could be improved by removing "movq %rcx,%rbp" completely, since the value is never used and is not recorded in the SEH directives (also, the string objects are decremented in the opposite order, with 3.2.x doing -8(%rbp) first, but 3.3.1 doing -24(%rbp) first instead). I think this may be due to me putting in code in the peephole optimizer that avoids playing around with the function prologue. (3.2.x doesn't have the .globl symbol though)
For actual reduction of pipeline stalls, I'm not quite sure how smart the Intel CPU firmware is and if rearranging instructions would actually improve throughput or not. For example, on 3.2.x, there are these four instructions that configure parameters and preserve %rax:
movq %rax,%rsi movl -12(%rbp),%edx movl $2,%r8d leaq -24(%rbp),%rcx All's okay so far, but on 3.3.1, the ordering is different: movq %rax,%rsi movl -12(%rbp),%edx leaq -24(%rbp),%rcx movl $2,%r8dIf one assumes the instructions are executed in-order, there's a greater risk of a pipeline stall if there's only one AGU (Address Generation Unit) available, in which case, rearranging the instructions to the following might produce better results in regards to port allocation:
movl -12(%rbp),%edx movq %rax,%rsi leaq -24(%rbp),%rcx movl $2,%r8dOther times, instruction rearrangement can aid the processor when it's waiting for a register to be ready. For example:
movslq -12(%rbp),%rax subq %rax,%rdx leaq -24(%rbp),%rcx movl $2,%r8dMaking the "subq %rax,%rdx" as late as possible will minimise the pipeline stall:
movslq -12(%rbp),%rax leaq -24(%rbp),%rcx movl $2,%r8d subq %rax,%rdx Then one can do port rebalancing again like before: movslq -12(%rbp),%rax movl $2,%r8d leaq -24(%rbp),%rcx subq %rax,%rdx ---TL;DR: The code itself doesn't appear to be any worse than before, but different register assignments seem to create more redundant MOV instructions. This one will be interesting to experiment on. I may have to develop a register renaming system for the peephole optimizer and a means of deallocating registers when things get moved around (generally, deallocating registers is a little unsafe, so care has to be taken). Also, POPCNT is handled in the System unit and is not inlined, mainly because of how an input of zero is handled (the function has to return 255 in the result, whereas Intel's POPCNT instruction sets the zero flag and leaves the result undefined).
Gareth aka. Kit -- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
<<attachment: utf8lentest-comparison.zip>>
_______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel