Re: [fpc-devel] Producing assembly with less branches?
It might still be possible to detect some patterns cheaply - I'l see what I can come up with. Also, any successful reduction in the number of passes in the peephole optimizer will result in a speed-up that may offset any expensive checks (and the most expensive ones can be reserved for -O3 and -O4). Gareth aka. Kit On 20/07/2020 19:03, Florian Klämpfl wrote: Am 19.07.20 um 23:37 schrieb Stefan Glienke: Still kinda disappointing compared to what it could be - while this is some simple code a modern compiler should try to eliminate conditional jumps even with the incredibly powerful branch predictors nowadays. clang and gcc emit this - I would guess they detect quite some common patterns like this. The price for this are huge compilation times. Having a design to able to detect such patterns and actually detecting them, simply takes time. FPC compiling itself with the LLVM backend is approx. 10 times slower than FPC with it's native backend. On average code however, the code generated by the LLVM backend is only 10 % to 20 % faster. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Producing assembly with less branches?
Am 19.07.20 um 23:37 schrieb Stefan Glienke: Still kinda disappointing compared to what it could be - while this is some simple code a modern compiler should try to eliminate conditional jumps even with the incredibly powerful branch predictors nowadays. clang and gcc emit this - I would guess they detect quite some common patterns like this. The price for this are huge compilation times. Having a design to able to detect such patterns and actually detecting them, simply takes time. FPC compiling itself with the LLVM backend is approx. 10 times slower than FPC with it's native backend. On average code however, the code generated by the LLVM backend is only 10 % to 20 % faster. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Producing assembly with less branches?
Am 20.07.2020 um 02:37 schrieb J. Gareth Moreton: On 19/07/2020 22:37, Stefan Glienke wrote: clang and gcc emit this - I would guess they detect quite some common patterns like this. ... cmp eax, edx mov edx, -1 setg al movzx eax, al cmovl eax, edx ret I think I can make improvements to that already! (Note the sequence above and below are in Intel notation) CMP EAX, EDX MOV EAX, 0 ; Note: don't use XOR EAX, EAX because this scrambles the FLAGS register MOV EDX, -1 SETG AL CMOVL EAX, EDX RET I believe that executes one cycle faster (20% faster for the entire sequence) on modern processors because it shortens the dependency chain that exists between "SETG AL; MOVZX EAX, AL; CMOVL EAX, EDX". It might require some testing though to be sure. That is what clang does (the first snippet I posted) by using ecx for the 0 and it does so with the shorter xor before the cmp which results in 16bytes of code - gcc is 17, yours 19. Anyhow they all don't differ in execution speed but are 2.5 times faster than the double cmp and cond jump galore ;) -- Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Producing assembly with less branches?
On 19/07/2020 22:37, Stefan Glienke wrote: clang and gcc emit this - I would guess they detect quite some common patterns like this. ... cmp eax, edx mov edx, -1 setg al movzx eax, al cmovl eax, edx ret I think I can make improvements to that already! (Note the sequence above and below are in Intel notation) CMP EAX, EDX MOV EAX, 0 ; Note: don't use XOR EAX, EAX because this scrambles the FLAGS register MOV EDX, -1 SETG AL CMOVL EAX, EDX RET I believe that executes one cycle faster (20% faster for the entire sequence) on modern processors because it shortens the dependency chain that exists between "SETG AL; MOVZX EAX, AL; CMOVL EAX, EDX". It might require some testing though to be sure. The difficulties with CMOV is that it can only write to registers (and not 8-bit ones) and can read from memory addresses, but not write to them. If there are registers free at that point in the code though, one could potentially write the constants to temporary registers beforehand, and then assign them to the registers that matter via CMOV (e.g. as shown above with the -1 value). I'm all for improving the generated assembly language where I can. There are some traps that one has to be careful of though, usually involving false dependencies. For example, when setting registers to -1, some compilers would use "OR EAX, -1" instead of "MOV EAX, -1" on account of it taking fewer bytes to encode. Both Visual C++ and GCC did this at one point, but this causes a false dependency with the previous value of EAX so would incur a performance penalty. The final thing to remember is that, by default, i386 will produce code that will run on the oldest 80386 processors. CMOV was only introduced with the Intel Pentium Pro in 1995. If compiling for x86_64, or if you specify compiler parameters to set the minimum processor support, then CMOV will be used. (It also just made me realise that Pass 2 of the peephole optimiser would not work with virtual registers because of CMOV's restriction in that it can't write to memory addresses, including the stack) Gareth aka. Kit ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Producing assembly with less branches?
- Reply to message - Subject: [fpc-devel] Producing assembly with less branches? From: Stefan Glienke To: Hi, not sure if anything significantly changed in trunk compared to 3.2 wrt to optimized code being generated but I am quite disappointed that fpc (checked win64 with -O3 and -O4) does not use cmovxx instructions and alike for the most basic things and produces terrible code like this: unit1.pas:49 if left < right then 00010002E3C0 39ca cmp edx,ecx 00010002E3C2 7e06 jle 0x10002e3ca unit1.pas:50 Result := -1 00010002E3C4 b8 mov eax,0x 00010002E3C9 c3 ret unit1.pas:51 else if left > right then 00010002E3CA 39ca cmp edx,ecx 00010002E3CC 7d06 jge 0x10002e3d4 unit1.pas:52 Result := 1 00010002E3CE b80100c3 mov eax,0x1 unit1.pas:54 Result := 0; 00010002E3D4 31c0 xor eax,eax unit1.pas:55 end; 00010002E3D6 c3 ret Similar for even simpler things: unit1.pas:43 if i < 0 then 00010002E3A1 85c0 test eax,eax 00010002E3A3 7d03 jge 0x10002e3a8 unit1.pas:44 i := 0; 00010002E3A5 31c0 xor eax,eax 00010002E3A7 90 nop Imo someone should work at that and make the compiler produce less branches. Not sure if that is on your list but it should be looked at. it's already done in trunk (sadly not in 3.2.0) to get cmov instruction emitted, has to meet two conditions 1) if statement without else part 2) assign value of variable (not constant). your code has to look like to benefit from cmov function cmov2(left, right : longint):longint; var l1,lf: longint; r : longint; begin l1:=1; lf:=-1; r:=0; if left > right then begin r:=lf; end;// else if left < right then begin r:=l1; end;// else r:=0; cmov2:=r; end; 00400370b9 01 00 00 00mov ecx,0001h 00400375ba ff ff ff ffmov edx,0h 0040037a31 c0 xor eax,eax 0040037c39 fe cmp esi,edi 0040037e0f 4c c2 cmovl eax,edx 0040038139 fe cmp esi,edi 004003830f 4f c1 cmovnle eax,ecx 00400386c3ret Still kinda disappointing compared to what it could be - while this is some simple code a modern compiler should try to eliminate conditional jumps even with the incredibly powerful branch predictors nowadays. clang and gcc emit this - I would guess they detect quite some common patterns like this. xor ecx, ecx cmp eax, edx mov eax, -1 setg cl cmovge eax, ecx ret cmp eax, edx mov edx, -1 setg al movzx eax, al cmovl eax, edx ret -- Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Producing assembly with less branches?
Am 19.07.20 um 19:24 schrieb Stefan Glienke: Hi, not sure if anything significantly changed in trunk compared to 3.2 wrt to optimized code being generated but I am quite disappointed that fpc (checked win64 with -O3 and -O4) does not use cmovxx instructions and alike for the most basic things and produces terrible code like this: cmov does not support constant operands, so FPC does not use it. unit1.pas:49 if left < right then 00010002E3C0 39ca cmp edx,ecx 00010002E3C2 7e06 jle 0x10002e3ca unit1.pas:50 Result := -1 00010002E3C4 b8 mov eax,0x 00010002E3C9 c3 ret unit1.pas:51 else if left > right then 00010002E3CA 39ca cmp edx,ecx 00010002E3CC 7d06 jge 0x10002e3d4 unit1.pas:52 Result := 1 00010002E3CE b80100c3 mov eax,0x1 unit1.pas:54 Result := 0; 00010002E3D4 31c0 xor eax,eax unit1.pas:55 end; 00010002E3D6 c3 ret Similar for even simpler things: unit1.pas:43 if i < 0 then 00010002E3A1 85c0 test eax,eax 00010002E3A3 7d03 jge 0x10002e3a8 unit1.pas:44 i := 0; 00010002E3A5 31c0 xor eax,eax 00010002E3A7 90 nop Imo someone should work at that and make the compiler produce less branches. Not sure if that is on your list but it should be looked at. It might be in the list which contains enough work for >10 years :) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Producing assembly with less branches?
- Reply to message - Subject: [fpc-devel] Producing assembly with less branches? From: Stefan Glienke To: > Hi, > not sure if anything significantly changed in trunk compared to 3.2 wrt > to optimized code being generated but I am quite disappointed that fpc > (checked win64 with -O3 and -O4) does not use cmovxx instructions and > alike for the most basic things and produces terrible code like this: > unit1.pas:49 if left < right then > 00010002E3C0 39ca cmp edx,ecx > 00010002E3C2 7e06 jle 0x10002e3ca > unit1.pas:50 Result := -1 > 00010002E3C4 b8 mov eax,0x > 00010002E3C9 c3 ret > unit1.pas:51 else if left > right then > 00010002E3CA 39ca cmp edx,ecx > 00010002E3CC 7d06 jge 0x10002e3d4 > unit1.pas:52 Result := 1 > 00010002E3CE b80100c3 mov eax,0x1 > unit1.pas:54 Result := 0; > 00010002E3D4 31c0 xor eax,eax > unit1.pas:55 end; > 00010002E3D6 c3 ret > Similar for even simpler things: > unit1.pas:43 if i < 0 then > 00010002E3A1 85c0 test eax,eax > 00010002E3A3 7d03 jge 0x10002e3a8 > > unit1.pas:44 i := 0; > 00010002E3A5 31c0 xor eax,eax > 00010002E3A7 90 nop > Imo someone should work at that and make the compiler produce less > branches. Not sure if that is on your list but it should be looked at. it's already done in trunk (sadly not in 3.2.0) to get cmov instruction emitted, has to meet two conditions 1) if statement without else part 2) assign value of variable (not constant). your code has to look like to benefit from cmov function cmov2(left, right : longint):longint; var l1,lf: longint; r : longint; begin l1:=1; lf:=-1; r:=0; if left > right then begin r:=lf; end;// else if left < right then begin r:=l1; end;// else r:=0; cmov2:=r; end; 00400370b9 01 00 00 00mov ecx,0001h 00400375ba ff ff ff ffmov edx,0h 0040037a31 c0 xor eax,eax 0040037c39 fe cmp esi,edi 0040037e0f 4c c2 cmovl eax,edx 0040038139 fe cmp esi,edi 004003830f 4f c1 cmovnle eax,ecx 00400386c3ret ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[fpc-devel] Producing assembly with less branches?
Hi, not sure if anything significantly changed in trunk compared to 3.2 wrt to optimized code being generated but I am quite disappointed that fpc (checked win64 with -O3 and -O4) does not use cmovxx instructions and alike for the most basic things and produces terrible code like this: unit1.pas:49 if left < right then 00010002E3C0 39ca cmp edx,ecx 00010002E3C2 7e06 jle 0x10002e3ca unit1.pas:50 Result := -1 00010002E3C4 b8 mov eax,0x 00010002E3C9 c3 ret unit1.pas:51 else if left > right then 00010002E3CA 39ca cmp edx,ecx 00010002E3CC 7d06 jge 0x10002e3d4 unit1.pas:52 Result := 1 00010002E3CE b80100c3 mov eax,0x1 unit1.pas:54 Result := 0; 00010002E3D4 31c0 xor eax,eax unit1.pas:55 end; 00010002E3D6 c3 ret Similar for even simpler things: unit1.pas:43 if i < 0 then 00010002E3A1 85c0 test eax,eax 00010002E3A3 7d03 jge 0x10002e3a8 unit1.pas:44 i := 0; 00010002E3A5 31c0 xor eax,eax 00010002E3A7 90 nop Imo someone should work at that and make the compiler produce less branches. Not sure if that is on your list but it should be looked at. -- Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft. https://www.avast.com/antivirus ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel