I did it!

After a good week of work and getting things wrong, I finally found a solution that works nicely and is extensible, at least for x86.  A bit of refactoring and it can be ported to other platforms.  I'm just running the test suites to see if I can break things now.  Honestly the hard part was making sure all the registers were tracked properly.

https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window

Please feel free to test it and break it!

It will likely undergo adjustments over time to refactor bits and pieces, add extra instructions (especially as it doesn't support SHRX and SHLX yet, for example) and see if I can make the sliding window more efficient - I increased it in size from 8 to 16 and then 32 entries, since even at 16, some optimisations were missed in the RTL, but this depends on a number of factors.

It's gotten some pretty good optimisations.  On x86_64-win64 under -O4, the three lazarus binaries... before:

lazarus.exe:

313,990,206
313,873,982

lazbuild.exe

59,704,391
59,725,895

startlazarus.exe

27,471,147
27,461,419

And the compiler itself, ppcx64.exe:

3,777,536
3,766,272

Remember that though I call the component a "sliding window", it's only because it's very similar to how LZ77 finds start points for run-length encoding, and the comments and variable names mention run-length encoding as it scans sequences of identical instructions.  However, at no point is it actually performing data compression, and the end-result is common subexpression elimination.  All the space savings are from the removal of redundant sequences of instructions, giving both a space and a speed boost.

Almost every source file in the compiler and the RTL shows some kind of improvement.  A lot of them are just redundant pointer deallocations, so this will help with cache misses and the like - that aside though, here are a couple of my favourites... one from dbgdwarf - TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:

Before:

    ...
.Lj682:
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    imulq    %r13,%rax
    movq    %rax,%r12
    addq    56(%rbp),%r12
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    movq    %rdx,%rsi
    cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

After:

    ...
.Lj682:
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    imulq    %r13,%rax
    movq    %rax,%r12
    addq    56(%rbp),%r12
    movq    %rdx,%rsi
    cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

This one has successfully removed an IDIV instruction because the algorithm was able to detect that the subroutine wanted the remainder in %edx, and it was still available from the first IDIV call because it hadn't been overwritten, and neither %r13 nor %rdi had changed values so the references are the same.

This one is from SysUtils' IsLeapYear function and is one I have personally wanted to optimise further ever since I first saw its disassembly after I implemented the fast "x mod const = 0" algorithm.

Before:

    ...
    imulw    $23593,%cx,%ax
    rorw    $2,%ax
    cmpw    $655,%ax
    ja    .Lj6979
    imulw    $23593,%cx,%ax
    rorw    $4,%ax
    cmpw    $163,%ax
    setbeb    %al
    ret
.Lj6979:
    ...

After:

    ...
    imulw    $23593,%cx,%ax
    rorw    $2,%ax
    cmpw    $655,%ax
    ja    .Lj6979
    rorw    $2,%ax
    cmpw    $163,%ax
    setbeb    %al
    ret
.Lj6979:
    ...

In this case, the RLE doesn't produce an exact match since the end result of %ax is different, but the important point to note is that as long as %cx doesn't change value (which it doesn't). the first sequence can be transformed into the second sequence via a "rorw $2,%ax" instruction, so the end result is that "imulw $23593,%cx,%ax; rorw $4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of %ax, thus removing a multiplication.

Because this has been a mammoth undertaking and is quite a major addition to the compiler, I'm going to hold off making a merge request until I write a document on it, like I've done in the past with some of the other larger optimisations I've made.  The branch is available at the link at the top of this e-mail though if anyone wants to take a look at it.

One final note is that this optimisation is rather slow and specialist, so much so that I've added a new optimizer switch named 'cs_opt_asmcse' ("Assembly CSE", to differentiate it from "Node CSE"); this is enabled by default with -O3, and requires the peephole optimizer to also be enabled in order to function.

Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to