Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM

2018-11-16 Thread J. Gareth Moreton
 At the moment, I'm experimenting with overhauling the x86_64 optimizer to
see if I can reduce the number of passes through a block of code - my hope
is to greatly increase the speed of the compiler without sacrificing the
optimisations performed under -O1 and -O2.  At present, I've attempted to
not modify i386 because I wish to use it as a control case (i.e. do my
changes break other platforms?)

 It's probably not worthy of the bounty, but I'm enjoying the challenge to
seeing if I can improve the overall speed in places.
 Gareth aka. Kit

 On Fri 16/11/18 22:58 , "Florian Klämpfl" flor...@freepascal.org sent:
 Am 16.11.2018 um 23:41 schrieb Florian Klämpfl: 
 > Am 16.11.2018 um 23:36 schrieb Jonas Maebe: 
 >> On 16/11/18 22:44, Florian Klämpfl wrote: 
 >>> With some compiler tuning and a few tricks (two changes to the code
and hand-simulated peephole optimizations, but I 
 >>> think these tricks can also the compiler do): 
 >> 
 >> You can improve performance further by devirtualising all method calls
using wpo. First compile it with -FWvipri.wpo 
 >> -OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo
-OwDEVIRTCALLS,OPTVMTS (at least on my machine it gives a small boost, 
 >> and makes the results also more stable). 
 >> 
 >> Since I only have a preliminary llvm version (with Dwarf EH) running on
macOS, I can't provide a direct Kylix 
 >> comparison. The versions below are both x86-64. As mentioned before, a
32 bit FPC/LLVM is still quite a way off. 
 >> 
 >> * FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS: 
 >> 
 >> $ time ./vipribenchmemcache_nodeps 
 >> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0,
NumberOfChannels=6, BufferPackets=5000, 
 >> NumberOfSynchroThreads=4 
 >>
.

 >> Time: 5016ms = 9669059 pkts/s = 14680 MB/s 
 >> 
 >> real    0m5.137s 
 >> user    0m5.042s 
 >> sys    0m0.017s 
 >> 
 >> FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm
IR) and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no 
 >> LLVM link-time optimization): 
 >> 
 >> $ time ./vipribenchmemcache_nodeps_llvm 
 >> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0,
NumberOfChannels=6, BufferPackets=5000, 
 >> NumberOfSynchroThreads=4 
 >>
.

 >> Time: 5018ms = 11259466 pkts/s = 17094 MB/s 
 >> 
 >> real    0m5.161s 
 >> user    0m5.060s 
 >> sys    0m0.017s 
 >> 
 > 
 > Can you test with FPC 3.1.1 native, -O4 and the following patch: 
 > 
 > compiler/nmem.pas | 2 +- 
 > 1 file changed, 1 insertion(+), 1 deletion(-) 
 > 
 > diff --git a/compiler/nmem.pas b/compiler/nmem.pas 
 > index d5c1d85e8f..52add1fd81 100644 
 > --- a/compiler/nmem.pas 
 > +++ b/compiler/nmem.pas 
 > @@ -1176,7 +1176,7 @@ implementation 
 > begin 
 > include(flags,nf_write); 
 > { see comment in tsubscriptnode.mark_write } 
 > - if not(is_implicit_pointer_object_type(left.resultdef)) then 
 > + if not(is_implicit_array_pointer(left.resultdef)) then 
 > left.mark_write; 
 > end; 
 > 
 > ? 

 Hmmm, needs a few more of my changes to make work, though it should work
if used only with the benchmark. 

 ___ 
 fpc-devel maillist - fpc-devel@lists.freepascal.org 
 http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[1]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel 

 

Links:
--
[1] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-16 Thread Florian Klämpfl
Am 16.11.2018 um 23:41 schrieb Florian Klämpfl:
> Am 16.11.2018 um 23:36 schrieb Jonas Maebe:
>> On 16/11/18 22:44, Florian Klämpfl wrote:
>>> With some compiler tuning and a few tricks (two changes to the code and 
>>> hand-simulated peephole optimizations, but I
>>> think these tricks can also the compiler do):
>>
>> You can improve performance further by devirtualising all method calls using 
>> wpo. First compile it with -FWvipri.wpo
>> -OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at 
>> least on my machine it gives a small boost,
>> and makes the results also more stable).
>>
>> Since I only have a preliminary llvm version (with Dwarf EH) running on 
>> macOS, I can't provide a direct Kylix
>> comparison. The versions below are both x86-64. As mentioned before, a 32 
>> bit FPC/LLVM is still quite a way off.
>>
>> * FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS:
>>
>> $ time ./vipribenchmemcache_nodeps
>> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
>> NumberOfChannels=6, BufferPackets=5000,
>> NumberOfSynchroThreads=4
>> .
>> Time: 5016ms = 9669059 pkts/s = 14680 MB/s
>>
>> real    0m5.137s
>> user    0m5.042s
>> sys    0m0.017s
>>
>> FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) 
>> and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no
>> LLVM link-time optimization):
>>
>> $ time ./vipribenchmemcache_nodeps_llvm
>> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
>> NumberOfChannels=6, BufferPackets=5000,
>> NumberOfSynchroThreads=4
>> .
>> Time: 5018ms = 11259466 pkts/s = 17094 MB/s
>>
>> real    0m5.161s
>> user    0m5.060s
>> sys    0m0.017s
>>
> 
> Can you test with FPC 3.1.1 native, -O4 and the following patch:
> 
>  compiler/nmem.pas | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/compiler/nmem.pas b/compiler/nmem.pas
> index d5c1d85e8f..52add1fd81 100644
> --- a/compiler/nmem.pas
> +++ b/compiler/nmem.pas
> @@ -1176,7 +1176,7 @@ implementation
>begin
>  include(flags,nf_write);
>  { see comment in tsubscriptnode.mark_write }
> -if not(is_implicit_pointer_object_type(left.resultdef)) then
> +if not(is_implicit_array_pointer(left.resultdef)) then
>left.mark_write;
>end;
> 
> ?

Hmmm, needs a few more of my changes to make work, though it should work if 
used only with the benchmark.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-16 Thread Florian Klämpfl
Am 16.11.2018 um 23:36 schrieb Jonas Maebe:
> On 16/11/18 22:44, Florian Klämpfl wrote:
>> With some compiler tuning and a few tricks (two changes to the code and 
>> hand-simulated peephole optimizations, but I
>> think these tricks can also the compiler do):
> 
> You can improve performance further by devirtualising all method calls using 
> wpo. First compile it with -FWvipri.wpo
> -OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at 
> least on my machine it gives a small boost,
> and makes the results also more stable).
> 
> Since I only have a preliminary llvm version (with Dwarf EH) running on 
> macOS, I can't provide a direct Kylix
> comparison. The versions below are both x86-64. As mentioned before, a 32 bit 
> FPC/LLVM is still quite a way off.
> 
> * FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS:
> 
> $ time ./vipribenchmemcache_nodeps
> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
> NumberOfChannels=6, BufferPackets=5000,
> NumberOfSynchroThreads=4
> .
> Time: 5016ms = 9669059 pkts/s = 14680 MB/s
> 
> real    0m5.137s
> user    0m5.042s
> sys    0m0.017s
> 
> FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) 
> and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no
> LLVM link-time optimization):
> 
> $ time ./vipribenchmemcache_nodeps_llvm
> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
> NumberOfChannels=6, BufferPackets=5000,
> NumberOfSynchroThreads=4
> .
> Time: 5018ms = 11259466 pkts/s = 17094 MB/s
> 
> real    0m5.161s
> user    0m5.060s
> sys    0m0.017s
> 

Can you test with FPC 3.1.1 native, -O4 and the following patch:

 compiler/nmem.pas | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/compiler/nmem.pas b/compiler/nmem.pas
index d5c1d85e8f..52add1fd81 100644
--- a/compiler/nmem.pas
+++ b/compiler/nmem.pas
@@ -1176,7 +1176,7 @@ implementation
   begin
 include(flags,nf_write);
 { see comment in tsubscriptnode.mark_write }
-if not(is_implicit_pointer_object_type(left.resultdef)) then
+if not(is_implicit_array_pointer(left.resultdef)) then
   left.mark_write;
   end;

?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-16 Thread Jonas Maebe

On 16/11/18 22:44, Florian Klämpfl wrote:

With some compiler tuning and a few tricks (two changes to the code and 
hand-simulated peephole optimizations, but I
think these tricks can also the compiler do):


You can improve performance further by devirtualising all method calls 
using wpo. First compile it with -FWvipri.wpo -OWDEVIRTCALLS,OPTVMTS and 
next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at least on my machine it 
gives a small boost, and makes the results also more stable).


Since I only have a preliminary llvm version (with Dwarf EH) running on 
macOS, I can't provide a direct Kylix comparison. The versions below are 
both x86-64. As mentioned before, a 32 bit FPC/LLVM is still quite a way 
off.


* FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS:

$ time ./vipribenchmemcache_nodeps
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4

.
Time: 5016ms = 9669059 pkts/s = 14680 MB/s

real0m5.137s
user0m5.042s
sys 0m0.017s

FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm 
IR) and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no LLVM link-time 
optimization):


$ time ./vipribenchmemcache_nodeps_llvm
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4

.
Time: 5018ms = 11259466 pkts/s = 17094 MB/s

real0m5.161s
user0m5.060s
sys 0m0.017s


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-16 Thread Florian Klämpfl
Am 16.11.2018 um 20:22 schrieb Simon Kissel:
> Hi guys,
> 
> turns out that in our real-life scenario there sadly aren't big
> improvements yet. Might be due to the exception handling, but
> we haven't profiled it yet. As said we have seen better improvements
> in simpler benchmark code - but this benchmark here is what
> really matters for us.
> 
> Please find the benchmark here - the ZIP includes a Kylix-built
> binary.
> 
> https://share.nerdherrschaft.net/f/2ac772f0327e4840a533/?dl=1
> 
> Here are some results from a Dualcore i7 with 2 cores and 4 HT,
> 32 bit:
> 
> Kylix:
> Time: 5015ms = 9770688 pkts/s = 14610 MB/s
> ./vipribenchmemcache_nodeps_kylix  5.06s user 0.01s system 99% cpu 5.119 total
> 
> FPC 3.0.4:
> Time: 5052ms = 8016627 pkts/s = 11987 MB/s
> ./vipribenchmemcache  5.07s user 0.01s system 97% cpu 5.206 total
> 
> FPC 3.3.1 trunk (SVN Rev 40300):
> Time: 5040ms = 8035714 pkts/s = 12016 MB/s
> ./vipribenchmemcache_nodeps  5.07s user 0.02s system 97% cpu 5.207 total
> 
> Benchmark results for ARM will follow.

With some compiler tuning and a few tricks (two changes to the code and 
hand-simulated peephole optimizations, but I
think these tricks can also the compiler do):

florian@ubuntu32:~$ ./vipribench
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
..
Time: 5005ms = 9390609 pkts/s = 14042 MB/s
florian@ubuntu32:~$ ./vipribenchmemcache_nodeps_kylix
VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, 
NumberOfChannels=6, BufferPackets=5000,
NumberOfSynchroThreads=4
.
Time: 5018ms = 9266640 pkts/s = 13856 MB/s

;)

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-16 Thread Simon Kissel
Hi guys,

turns out that in our real-life scenario there sadly aren't big
improvements yet. Might be due to the exception handling, but
we haven't profiled it yet. As said we have seen better improvements
in simpler benchmark code - but this benchmark here is what
really matters for us.

Please find the benchmark here - the ZIP includes a Kylix-built
binary.

https://share.nerdherrschaft.net/f/2ac772f0327e4840a533/?dl=1

Here are some results from a Dualcore i7 with 2 cores and 4 HT,
32 bit:

Kylix:
Time: 5015ms = 9770688 pkts/s = 14610 MB/s
./vipribenchmemcache_nodeps_kylix  5.06s user 0.01s system 99% cpu 5.119 total

FPC 3.0.4:
Time: 5052ms = 8016627 pkts/s = 11987 MB/s
./vipribenchmemcache  5.07s user 0.01s system 97% cpu 5.206 total

FPC 3.3.1 trunk (SVN Rev 40300):
Time: 5040ms = 8035714 pkts/s = 12016 MB/s
./vipribenchmemcache_nodeps  5.07s user 0.02s system 97% cpu 5.207 total

Benchmark results for ARM will follow.

Cheers,

Simon


Thursday, November 15, 2018, 10:31:55 PM, you wrote:

> Am 14.11.2018 um 14:46 schrieb Simon Kissel:
>> 
>> We have not yet tested this on ARM (does it work on ARM?).
>>

> After r40321, arm-linux works as well.
> ___
> fpc-devel maillist  -  fpc-devel@lists.freepascal.org
> http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel



Best regards,

Simon Kissel

-- 
Nerdherrschaft GmbH
Mainzer Str. 40
55411 Bingen am Rhein
Germany

Phone:+49-6721-9492994
Fax:  +49-6721-9492996

simon.kis...@nerdherrschaft.com
http://www.nerdherrschaft.com

Registered office/Sitz der Gesellschaft: Bingen am Rhein, Germany
CEO/Geschäftsführer: Simon Kissel
Commercial register/Handelsregister: Amtsgericht Mainz HRB43337

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel