Re: [fpc-devel] Producing assembly with less branches?

2020-07-21 Thread J. Gareth Moreton
It might still be possible to detect some patterns cheaply - I'l see 
what I can come up with.


Also, any successful reduction in the number of passes in the peephole 
optimizer will result in a speed-up that may offset any expensive checks 
(and the most expensive ones can be reserved for -O3 and -O4).


Gareth aka. Kit

On 20/07/2020 19:03, Florian Klämpfl wrote:

Am 19.07.20 um 23:37 schrieb Stefan Glienke:
Still kinda disappointing compared to what it could be - while this 
is some simple code a modern compiler should try to eliminate 
conditional jumps even with the incredibly powerful branch predictors 
nowadays.


clang and gcc emit this - I would guess they detect quite some common 
patterns like this.


The price for this are huge compilation times. Having a design to able 
to detect such patterns and actually detecting them, simply takes time.


FPC compiling itself with the LLVM backend is approx. 10 times slower 
than FPC with it's native backend. On average code however, the code 
generated by the LLVM backend is only 10 % to 20 % faster.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Producing assembly with less branches?

2020-07-20 Thread Florian Klämpfl

Am 19.07.20 um 23:37 schrieb Stefan Glienke:
Still kinda disappointing compared to what it could be - while this is 
some simple code a modern compiler should try to eliminate conditional 
jumps even with the incredibly powerful branch predictors nowadays.


clang and gcc emit this - I would guess they detect quite some common 
patterns like this.


The price for this are huge compilation times. Having a design to able 
to detect such patterns and actually detecting them, simply takes time.


FPC compiling itself with the LLVM backend is approx. 10 times slower 
than FPC with it's native backend. On average code however, the code 
generated by the LLVM backend is only 10 % to 20 % faster.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Producing assembly with less branches?

2020-07-19 Thread Stefan Glienke

Am 20.07.2020 um 02:37 schrieb J. Gareth Moreton:


On 19/07/2020 22:37, Stefan Glienke wrote:
clang and gcc emit this - I would guess they detect quite some common 
patterns like this.


 ...
  cmp eax, edx
  mov edx, -1
  setg    al
  movzx   eax, al
  cmovl   eax, edx
  ret


I think I can make improvements to that already! (Note the sequence 
above and below are in Intel notation)


CMP   EAX, EDX
MOV   EAX, 0 ; Note: don't use XOR EAX, EAX because this scrambles the 
FLAGS register

MOV   EDX, -1
SETG   AL
CMOVL EAX, EDX
RET

I believe that executes one cycle faster (20% faster for the entire 
sequence) on modern processors because it shortens the dependency 
chain that exists between "SETG AL; MOVZX EAX, AL; CMOVL EAX, EDX". It 
might require some testing though to be sure.


That is what clang does (the first snippet I posted) by using ecx for 
the 0 and it does so with the shorter xor before the cmp which results 
in 16bytes of code - gcc is 17, yours 19.


Anyhow they all don't differ in execution speed but are 2.5 times faster 
than the double cmp and cond jump galore ;)



--
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Producing assembly with less branches?

2020-07-19 Thread J. Gareth Moreton


On 19/07/2020 22:37, Stefan Glienke wrote:
clang and gcc emit this - I would guess they detect quite some common 
patterns like this.


 ...
  cmp eax, edx
  mov edx, -1
  setg    al
  movzx   eax, al
  cmovl   eax, edx
  ret


I think I can make improvements to that already! (Note the sequence 
above and below are in Intel notation)


CMP   EAX, EDX
MOV   EAX, 0 ; Note: don't use XOR EAX, EAX because this scrambles the 
FLAGS register

MOV   EDX, -1
SETG   AL
CMOVL EAX, EDX
RET

I believe that executes one cycle faster (20% faster for the entire 
sequence) on modern processors because it shortens the dependency chain 
that exists between "SETG AL; MOVZX EAX, AL; CMOVL EAX, EDX". It might 
require some testing though to be sure.


The difficulties with CMOV is that it can only write to registers (and 
not 8-bit ones) and can read from memory addresses, but not write to 
them.  If there are registers free at that point in the code though, one 
could potentially write the constants to temporary registers beforehand, 
and then assign them to the registers that matter via CMOV (e.g. as 
shown above with the -1 value).


I'm all for improving the generated assembly language where I can.  
There are some traps that one has to be careful of though, usually 
involving false dependencies.  For example, when setting registers to 
-1, some compilers would use "OR EAX, -1" instead of "MOV EAX, -1" on 
account of it taking fewer bytes to encode.  Both Visual C++ and GCC did 
this at one point, but this causes a false dependency with the previous 
value of EAX so would incur a performance penalty.


The final thing to remember is that, by default, i386 will produce code 
that will run on the oldest 80386 processors.  CMOV was only introduced 
with the Intel Pentium Pro in 1995.  If compiling for x86_64, or if you 
specify compiler parameters to set the minimum processor support, then 
CMOV will be used.


(It also just made me realise that Pass 2 of the peephole optimiser 
would not work with virtual registers because of CMOV's restriction in 
that it can't write to memory addresses, including the stack)


Gareth aka. Kit

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Producing assembly with less branches?

2020-07-19 Thread Stefan Glienke

- Reply to message -

Subject: [fpc-devel] Producing assembly with less branches?
From:  Stefan Glienke 
To:  

Hi,
not sure if anything significantly changed in trunk compared to 3.2 wrt
to optimized code being generated but I am quite disappointed that fpc
(checked win64 with -O3 and -O4) does not use cmovxx instructions and
alike for the most basic things and produces terrible code like this:
unit1.pas:49  if left < right then
00010002E3C0 39ca cmp    edx,ecx
00010002E3C2 7e06 jle    0x10002e3ca 
unit1.pas:50  Result := -1
00010002E3C4 b8   mov    eax,0x
00010002E3C9 c3   ret
unit1.pas:51  else if left > right then
00010002E3CA 39ca cmp    edx,ecx
00010002E3CC 7d06 jge    0x10002e3d4 
unit1.pas:52  Result := 1
00010002E3CE b80100c3 mov    eax,0x1
unit1.pas:54  Result := 0;
00010002E3D4 31c0 xor    eax,eax
unit1.pas:55  end;
00010002E3D6 c3   ret
Similar for even simpler things:
unit1.pas:43  if i < 0 then
00010002E3A1 85c0 test   eax,eax
00010002E3A3 7d03 jge    0x10002e3a8

unit1.pas:44  i := 0;
00010002E3A5 31c0 xor    eax,eax
00010002E3A7 90   nop
Imo someone should work at that and make the compiler produce less
branches. Not sure if that is on your list but it should be looked at.

it's already done in trunk (sadly not in 3.2.0)
to get cmov instruction emitted, has to meet two conditions
1) if statement without else part
2) assign value of variable (not constant).

your code has to look like to benefit from cmov

function cmov2(left, right : longint):longint;
var l1,lf: longint;
  r : longint;
begin
  l1:=1;
  lf:=-1;
  r:=0;
  if left > right then
  begin
   r:=lf;
  end;// else
  if left < right then
  begin
   r:=l1;
  end;// else  r:=0;
  cmov2:=r;
end;
   


00400370b9 01 00 00 00mov ecx,0001h
00400375ba ff ff ff ffmov edx,0h
0040037a31 c0 xor eax,eax
0040037c39 fe cmp esi,edi
0040037e0f 4c c2  cmovl eax,edx
0040038139 fe cmp esi,edi
004003830f 4f c1  cmovnle eax,ecx
00400386c3ret


Still kinda disappointing compared to what it could be - while this is 
some simple code a modern compiler should try to eliminate conditional 
jumps even with the incredibly powerful branch predictors nowadays.


clang and gcc emit this - I would guess they detect quite some common 
patterns like this.


  xor ecx, ecx
  cmp eax, edx
  mov eax, -1
  setg cl
  cmovge eax, ecx
  ret

  cmp eax, edx
  mov edx, -1
  setg    al
  movzx   eax, al
  cmovl   eax, edx
  ret


--
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Producing assembly with less branches?

2020-07-19 Thread Florian Klämpfl

Am 19.07.20 um 19:24 schrieb Stefan Glienke:

Hi,

not sure if anything significantly changed in trunk compared to 3.2 wrt 
to optimized code being generated but I am quite disappointed that fpc 
(checked win64 with -O3 and -O4) does not use cmovxx instructions and 
alike for the most basic things and produces terrible code like this:


cmov does not support constant operands, so FPC does not use it.




unit1.pas:49  if left < right then
00010002E3C0 39ca cmp    edx,ecx
00010002E3C2 7e06 jle    0x10002e3ca 


unit1.pas:50  Result := -1
00010002E3C4 b8   mov    eax,0x
00010002E3C9 c3   ret
unit1.pas:51  else if left > right then
00010002E3CA 39ca cmp    edx,ecx
00010002E3CC 7d06 jge    0x10002e3d4 


unit1.pas:52  Result := 1
00010002E3CE b80100c3 mov    eax,0x1
unit1.pas:54  Result := 0;
00010002E3D4 31c0 xor    eax,eax
unit1.pas:55  end;
00010002E3D6 c3   ret

Similar for even simpler things:

unit1.pas:43  if i < 0 then
00010002E3A1 85c0 test   eax,eax
00010002E3A3 7d03 jge    0x10002e3a8 


unit1.pas:44  i := 0;
00010002E3A5 31c0 xor    eax,eax
00010002E3A7 90   nop

Imo someone should work at that and make the compiler produce less 
branches. Not sure if that is on your list but it should be looked at.


It might be in the list which contains enough work for >10 years :)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Producing assembly with less branches?

2020-07-19 Thread Marģers . via fpc-devel
- Reply to message -
Subject: [fpc-devel] Producing assembly with less branches?
From:  Stefan Glienke 
To:  
> Hi,

> not sure if anything significantly changed in trunk compared to 3.2 wrt
> to optimized code being generated but I am quite disappointed that fpc
> (checked win64 with -O3 and -O4) does not use cmovxx instructions and
> alike for the most basic things and produces terrible code like this:

> unit1.pas:49  if left < right then
> 00010002E3C0 39ca cmp    edx,ecx
> 00010002E3C2 7e06 jle    0x10002e3ca 
> unit1.pas:50  Result := -1
> 00010002E3C4 b8   mov    eax,0x
> 00010002E3C9 c3   ret
> unit1.pas:51  else if left > right then
> 00010002E3CA 39ca cmp    edx,ecx
> 00010002E3CC 7d06 jge    0x10002e3d4 
> unit1.pas:52  Result := 1
> 00010002E3CE b80100c3 mov    eax,0x1
> unit1.pas:54  Result := 0;
> 00010002E3D4 31c0 xor    eax,eax
> unit1.pas:55  end;
> 00010002E3D6 c3   ret

> Similar for even simpler things:

> unit1.pas:43  if i < 0 then
> 00010002E3A1 85c0 test   eax,eax
> 00010002E3A3 7d03 jge    0x10002e3a8
> 
> unit1.pas:44  i := 0;
> 00010002E3A5 31c0 xor    eax,eax
> 00010002E3A7 90   nop

> Imo someone should work at that and make the compiler produce less
> branches. Not sure if that is on your list but it should be looked at.

it's already done in trunk (sadly not in 3.2.0)
to get cmov instruction emitted, has to meet two conditions
1) if statement without else part
2) assign value of variable (not constant).

your code has to look like to benefit from cmov

function cmov2(left, right : longint):longint;
var l1,lf: longint;
 r : longint;
begin
 l1:=1;
 lf:=-1;
 r:=0;
 if left > right then
 begin
  r:=lf;
 end;// else
 if left < right then
 begin
  r:=l1;
 end;// else  r:=0;
 cmov2:=r;
end;
  

00400370b9 01 00 00 00mov ecx,0001h
00400375ba ff ff ff ffmov edx,0h
0040037a31 c0 xor eax,eax
0040037c39 fe cmp esi,edi
0040037e0f 4c c2  cmovl eax,edx
0040038139 fe cmp esi,edi
004003830f 4f c1  cmovnle eax,ecx
00400386c3ret

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


[fpc-devel] Producing assembly with less branches?

2020-07-19 Thread Stefan Glienke

Hi,

not sure if anything significantly changed in trunk compared to 3.2 wrt 
to optimized code being generated but I am quite disappointed that fpc 
(checked win64 with -O3 and -O4) does not use cmovxx instructions and 
alike for the most basic things and produces terrible code like this:


unit1.pas:49  if left < right then
00010002E3C0 39ca cmp    edx,ecx
00010002E3C2 7e06 jle    0x10002e3ca 
unit1.pas:50  Result := -1
00010002E3C4 b8   mov    eax,0x
00010002E3C9 c3   ret
unit1.pas:51  else if left > right then
00010002E3CA 39ca cmp    edx,ecx
00010002E3CC 7d06 jge    0x10002e3d4 
unit1.pas:52  Result := 1
00010002E3CE b80100c3 mov    eax,0x1
unit1.pas:54  Result := 0;
00010002E3D4 31c0 xor    eax,eax
unit1.pas:55  end;
00010002E3D6 c3   ret

Similar for even simpler things:

unit1.pas:43  if i < 0 then
00010002E3A1 85c0 test   eax,eax
00010002E3A3 7d03 jge    0x10002e3a8 


unit1.pas:44  i := 0;
00010002E3A5 31c0 xor    eax,eax
00010002E3A7 90   nop

Imo someone should work at that and make the compiler produce less 
branches. Not sure if that is on your list but it should be looked at.



--
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel