Re: [fpc-devel] Prototype optimisation... Sliding Window

J. Gareth Moreton via fpc-devel Thu, 24 Feb 2022 21:09:05 -0800

I did it!

After a good week of work and getting things wrong, I finally found asolution that works nicely and is extensible, at least for x86. A bitof refactoring and it can be ported to other platforms. I'm justrunning the test suites to see if I can break things now. Honestly thehard part was making sure all the registers were tracked properly.


https://gitlab.com/CuriousKit/optimisations/-/commits/sliding-window

Please feel free to test it and break it!

It will likely undergo adjustments over time to refactor bits andpieces, add extra instructions (especially as it doesn't support SHRXand SHLX yet, for example) and see if I can make the sliding window moreefficient - I increased it in size from 8 to 16 and then 32 entries,since even at 16, some optimisations were missed in the RTL, but thisdepends on a number of factors.

It's gotten some pretty good optimisations. On x86_64-win64 under -O4,the three lazarus binaries... before:


lazarus.exe:

313,990,206
313,873,982

lazbuild.exe

59,704,391
59,725,895

startlazarus.exe

27,471,147
27,461,419

And the compiler itself, ppcx64.exe:

3,777,536
3,766,272

Remember that though I call the component a "sliding window", it's onlybecause it's very similar to how LZ77 finds start points for run-lengthencoding, and the comments and variable names mention run-lengthencoding as it scans sequences of identical instructions. However, atno point is it actually performing data compression, and the end-resultis common subexpression elimination. All the space savings are from theremoval of redundant sequences of instructions, giving both a space anda speed boost.

Almost every source file in the compiler and the RTL shows some kind ofimprovement. A lot of them are just redundant pointer deallocations, sothis will help with cache misses and the like - that aside though, hereare a couple of my favourites... one from dbgdwarf -TDebugInfoDwarf.appendsym_fieldvar_with_name_offset:


Before:

    ...
.Lj682:
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    imulq    %r13,%rax
    movq    %rax,%r12
    addq    56(%rbp),%r12
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    movq    %rdx,%rsi
    cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

After:

    ...
.Lj682:
    leaq    (,%r13,8),%rcx
    movq    120(%rdi),%rax
    cqto
    idivq    %rcx
    imulq    %r13,%rax
    movq    %rax,%r12
    addq    56(%rbp),%r12
    movq    %rdx,%rsi
    cmpb    $0,U_$SYSTEMS_$$_TARGET_INFO+276(%rip)

This one has successfully removed an IDIV instruction because thealgorithm was able to detect that the subroutine wanted the remainder in%edx, and it was still available from the first IDIV call because ithadn't been overwritten, and neither %r13 nor %rdi had changed values sothe references are the same.

This one is from SysUtils' IsLeapYear function and is one I havepersonally wanted to optimise further ever since I first saw itsdisassembly after I implemented the fast "x mod const = 0" algorithm.


Before:

    ...
    imulw    $23593,%cx,%ax
    rorw    $2,%ax
    cmpw    $655,%ax
    ja    .Lj6979
    imulw    $23593,%cx,%ax
    rorw    $4,%ax
    cmpw    $163,%ax
    setbeb    %al
    ret
.Lj6979:
    ...

After:

    ...
    imulw    $23593,%cx,%ax
    rorw    $2,%ax
    cmpw    $655,%ax
    ja    .Lj6979
    rorw    $2,%ax
    cmpw    $163,%ax
    setbeb    %al
    ret
.Lj6979:
    ...

In this case, the RLE doesn't produce an exact match since the endresult of %ax is different, but the important point to note is that aslong as %cx doesn't change value (which it doesn't). the first sequencecan be transformed into the second sequence via a "rorw $2,%ax"instruction, so the end result is that "imulw $23593,%cx,%ax; rorw$4,%ax" is transmuted into "rorw $2,%ax" based on the previous value of%ax, thus removing a multiplication.

Because this has been a mammoth undertaking and is quite a majoraddition to the compiler, I'm going to hold off making a merge requestuntil I write a document on it, like I've done in the past with some ofthe other larger optimisations I've made. The branch is available atthe link at the top of this e-mail though if anyone wants to take a lookat it.

One final note is that this optimisation is rather slow and specialist,so much so that I've added a new optimizer switch named 'cs_opt_asmcse'("Assembly CSE", to differentiate it from "Node CSE"); this is enabledby default with -O3, and requires the peephole optimizer to also beenabled in order to function.


Gareth aka. Kit


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Prototype optimisation... Sliding Window

Reply via email to