[fpc-devel] Optimisation question

J. Gareth Moreton via fpc-devel Mon, 30 Oct 2023 14:11:05 -0700

Hi everyone,

I'm still exploring optimisations in generated x86 code, something whichhas become my speciality, and I found one new potential optimisationsequence that aims to reduce unnecessary calls to CMP and TEST when theresult is already known. However there are some situations where I'mnot sure if the end result is better or not. For example (taken fromthe cgiprotocol unit):


.Lj5:
    subl    $1,%esi
.Lj6:
    testl    %esi,%esi
    jng    .Lj9
    ...
    testl    %eax,%eax
    jne    .Lj5
.Lj9:
    movl    $-1,%eax
    testl    %esi,%esi
    cmovlel    %eax,%esi
    movl    %esi,%eax

In this instance, if the first TEST instruction results in "jng .Lj9"branching, it soon calls another TEST instruction with the sameregister, which will not have changed value, so CMOVLE will definitelyset %esi to %eax (equal to -1) because "le" is equal to "ng" ("less thanor equal" versus "not greater"), and then %esi is written back to %eax. All in all, this is a dependency chain of length 3. When my newoptimisation takes place, the following is generated instead:


.Lj5:
    subl    $1,%esi
.Lj6:
    testl    %esi,%esi
    jng    .Lj9
    ...
    testl    %eax,%eax
    jne    .Lj5
    testl    %esi,%esi
    jnle    .Lj12
.Lj9:
    movl    $-1,%esi
.Lj12:
    movl    %esi,%eax

Here, a number of other optimisations don't take place (hence why theCMOV instruction no longer exists), but now if "jng .Lj9" branches, itsets %esi to -1 directly before writing the result to %eax... adependency chain of just 2. However, there is also an extra conditionaljump, which can no longer be optimised into CMOV because of the positionof the .Lj9 label.

The question is... is the new code better, about the same or worse? Onmodern processors most, TEST/Jcc can be macro-fused, but jumps alwayscause some penalty.

There are cases where the optimisation gives clear benefits - forexample, in the cclasses unit - before:


    movq    %rcx,%rdx
    testq    %rcx,%rcx
    je    .Lj382
    movq    -8(%rcx),%rdx
.Lj382:
    testq    %rcx,%rcx
    jne    .Lj383
    leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
    call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

After:

    movq    %rcx,%rdx
    testq    %rcx,%rcx
    je    .Lj382
    movq    -8(%rcx),%rdx
    jmp    .Lj383
.Lj382:
    leaq    FPC_EMPTYCHAR(%rip),%rcx
.Lj383:
    call    CCLASSES_$$_FPHASH$PCHAR$LONGINT$$LONGWORD

In this case, if "je .Lj382" branches, then %rcx is definitely zero andso control flow can safely skip straight to the LEA instruction, sincethe "jne .Lj383" instruction will definitely not branch. On the otherhand, if "je .Lj382" does not branch, then %rcx is definitely non-zeroand so the second "testq %rcx,%rcx" is once again deterministic, meaning"jne .Lj383" will definitely branch, therefore the second TESTinstruction can be removed completely and the conditional jump turnedinto an unconditional jump. The end result is that the number of jumpshasn't changed (exactly one jump will be taken regardless of the valueof %rcx), but the instruction count has been reduced (note that %rdxcan't be optimised out because its value is used as an actual parameterfor the CALL).


Kit

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] Optimisation question

Reply via email to