Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM
At the moment, I'm experimenting with overhauling the x86_64 optimizer to see if I can reduce the number of passes through a block of code - my hope is to greatly increase the speed of the compiler without sacrificing the optimisations performed under -O1 and -O2. At present, I've attempted to not modify i386 because I wish to use it as a control case (i.e. do my changes break other platforms?) It's probably not worthy of the bounty, but I'm enjoying the challenge to seeing if I can improve the overall speed in places. Gareth aka. Kit On Fri 16/11/18 22:58 , "Florian Klämpfl" flor...@freepascal.org sent: Am 16.11.2018 um 23:41 schrieb Florian Klämpfl: > Am 16.11.2018 um 23:36 schrieb Jonas Maebe: >> On 16/11/18 22:44, Florian Klämpfl wrote: >>> With some compiler tuning and a few tricks (two changes to the code and hand-simulated peephole optimizations, but I >>> think these tricks can also the compiler do): >> >> You can improve performance further by devirtualising all method calls using wpo. First compile it with -FWvipri.wpo >> -OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at least on my machine it gives a small boost, >> and makes the results also more stable). >> >> Since I only have a preliminary llvm version (with Dwarf EH) running on macOS, I can't provide a direct Kylix >> comparison. The versions below are both x86-64. As mentioned before, a 32 bit FPC/LLVM is still quite a way off. >> >> * FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS: >> >> $ time ./vipribenchmemcache_nodeps >> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, >> NumberOfSynchroThreads=4 >> . >> Time: 5016ms = 9669059 pkts/s = 14680 MB/s >> >> real 0m5.137s >> user 0m5.042s >> sys 0m0.017s >> >> FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no >> LLVM link-time optimization): >> >> $ time ./vipribenchmemcache_nodeps_llvm >> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, >> NumberOfSynchroThreads=4 >> . >> Time: 5018ms = 11259466 pkts/s = 17094 MB/s >> >> real 0m5.161s >> user 0m5.060s >> sys 0m0.017s >> > > Can you test with FPC 3.1.1 native, -O4 and the following patch: > > compiler/nmem.pas | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/compiler/nmem.pas b/compiler/nmem.pas > index d5c1d85e8f..52add1fd81 100644 > --- a/compiler/nmem.pas > +++ b/compiler/nmem.pas > @@ -1176,7 +1176,7 @@ implementation > begin > include(flags,nf_write); > { see comment in tsubscriptnode.mark_write } > - if not(is_implicit_pointer_object_type(left.resultdef)) then > + if not(is_implicit_array_pointer(left.resultdef)) then > left.mark_write; > end; > > ? Hmmm, needs a few more of my changes to make work, though it should work if used only with the benchmark. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [1]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel Links: -- [1] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Am 16.11.2018 um 23:41 schrieb Florian Klämpfl: > Am 16.11.2018 um 23:36 schrieb Jonas Maebe: >> On 16/11/18 22:44, Florian Klämpfl wrote: >>> With some compiler tuning and a few tricks (two changes to the code and >>> hand-simulated peephole optimizations, but I >>> think these tricks can also the compiler do): >> >> You can improve performance further by devirtualising all method calls using >> wpo. First compile it with -FWvipri.wpo >> -OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at >> least on my machine it gives a small boost, >> and makes the results also more stable). >> >> Since I only have a preliminary llvm version (with Dwarf EH) running on >> macOS, I can't provide a direct Kylix >> comparison. The versions below are both x86-64. As mentioned before, a 32 >> bit FPC/LLVM is still quite a way off. >> >> * FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS: >> >> $ time ./vipribenchmemcache_nodeps >> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, >> NumberOfChannels=6, BufferPackets=5000, >> NumberOfSynchroThreads=4 >> . >> Time: 5016ms = 9669059 pkts/s = 14680 MB/s >> >> real 0m5.137s >> user 0m5.042s >> sys 0m0.017s >> >> FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) >> and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no >> LLVM link-time optimization): >> >> $ time ./vipribenchmemcache_nodeps_llvm >> VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, >> NumberOfChannels=6, BufferPackets=5000, >> NumberOfSynchroThreads=4 >> . >> Time: 5018ms = 11259466 pkts/s = 17094 MB/s >> >> real 0m5.161s >> user 0m5.060s >> sys 0m0.017s >> > > Can you test with FPC 3.1.1 native, -O4 and the following patch: > > compiler/nmem.pas | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/compiler/nmem.pas b/compiler/nmem.pas > index d5c1d85e8f..52add1fd81 100644 > --- a/compiler/nmem.pas > +++ b/compiler/nmem.pas > @@ -1176,7 +1176,7 @@ implementation >begin > include(flags,nf_write); > { see comment in tsubscriptnode.mark_write } > -if not(is_implicit_pointer_object_type(left.resultdef)) then > +if not(is_implicit_array_pointer(left.resultdef)) then >left.mark_write; >end; > > ? Hmmm, needs a few more of my changes to make work, though it should work if used only with the benchmark. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Am 16.11.2018 um 23:36 schrieb Jonas Maebe: > On 16/11/18 22:44, Florian Klämpfl wrote: >> With some compiler tuning and a few tricks (two changes to the code and >> hand-simulated peephole optimizations, but I >> think these tricks can also the compiler do): > > You can improve performance further by devirtualising all method calls using > wpo. First compile it with -FWvipri.wpo > -OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at > least on my machine it gives a small boost, > and makes the results also more stable). > > Since I only have a preliminary llvm version (with Dwarf EH) running on > macOS, I can't provide a direct Kylix > comparison. The versions below are both x86-64. As mentioned before, a 32 bit > FPC/LLVM is still quite a way off. > > * FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS: > > $ time ./vipribenchmemcache_nodeps > VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, > NumberOfChannels=6, BufferPackets=5000, > NumberOfSynchroThreads=4 > . > Time: 5016ms = 9669059 pkts/s = 14680 MB/s > > real 0m5.137s > user 0m5.042s > sys 0m0.017s > > FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) > and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no > LLVM link-time optimization): > > $ time ./vipribenchmemcache_nodeps_llvm > VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, > NumberOfChannels=6, BufferPackets=5000, > NumberOfSynchroThreads=4 > . > Time: 5018ms = 11259466 pkts/s = 17094 MB/s > > real 0m5.161s > user 0m5.060s > sys 0m0.017s > Can you test with FPC 3.1.1 native, -O4 and the following patch: compiler/nmem.pas | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/compiler/nmem.pas b/compiler/nmem.pas index d5c1d85e8f..52add1fd81 100644 --- a/compiler/nmem.pas +++ b/compiler/nmem.pas @@ -1176,7 +1176,7 @@ implementation begin include(flags,nf_write); { see comment in tsubscriptnode.mark_write } -if not(is_implicit_pointer_object_type(left.resultdef)) then +if not(is_implicit_array_pointer(left.resultdef)) then left.mark_write; end; ? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
On 16/11/18 22:44, Florian Klämpfl wrote: With some compiler tuning and a few tricks (two changes to the code and hand-simulated peephole optimizations, but I think these tricks can also the compiler do): You can improve performance further by devirtualising all method calls using wpo. First compile it with -FWvipri.wpo -OWDEVIRTCALLS,OPTVMTS and next with -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (at least on my machine it gives a small boost, and makes the results also more stable). Since I only have a preliminary llvm version (with Dwarf EH) running on macOS, I can't provide a direct Kylix comparison. The versions below are both x86-64. As mentioned before, a 32 bit FPC/LLVM is still quite a way off. * FPC 3.0.4 -MDelphi -O2 -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS: $ time ./vipribenchmemcache_nodeps VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4 . Time: 5016ms = 9669059 pkts/s = 14680 MB/s real0m5.137s user0m5.042s sys 0m0.017s FPC 3.3.1 + llvm (clang from Xcode 10.1 with -O3 on FPC-generated llvm IR) and -Fwvipri.wpo -OwDEVIRTCALLS,OPTVMTS (no LLVM link-time optimization): $ time ./vipribenchmemcache_nodeps_llvm VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4 . Time: 5018ms = 11259466 pkts/s = 17094 MB/s real0m5.161s user0m5.060s sys 0m0.017s Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Am 16.11.2018 um 20:22 schrieb Simon Kissel: > Hi guys, > > turns out that in our real-life scenario there sadly aren't big > improvements yet. Might be due to the exception handling, but > we haven't profiled it yet. As said we have seen better improvements > in simpler benchmark code - but this benchmark here is what > really matters for us. > > Please find the benchmark here - the ZIP includes a Kylix-built > binary. > > https://share.nerdherrschaft.net/f/2ac772f0327e4840a533/?dl=1 > > Here are some results from a Dualcore i7 with 2 cores and 4 HT, > 32 bit: > > Kylix: > Time: 5015ms = 9770688 pkts/s = 14610 MB/s > ./vipribenchmemcache_nodeps_kylix 5.06s user 0.01s system 99% cpu 5.119 total > > FPC 3.0.4: > Time: 5052ms = 8016627 pkts/s = 11987 MB/s > ./vipribenchmemcache 5.07s user 0.01s system 97% cpu 5.206 total > > FPC 3.3.1 trunk (SVN Rev 40300): > Time: 5040ms = 8035714 pkts/s = 12016 MB/s > ./vipribenchmemcache_nodeps 5.07s user 0.02s system 97% cpu 5.207 total > > Benchmark results for ARM will follow. With some compiler tuning and a few tricks (two changes to the code and hand-simulated peephole optimizations, but I think these tricks can also the compiler do): florian@ubuntu32:~$ ./vipribench VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4 .. Time: 5005ms = 9390609 pkts/s = 14042 MB/s florian@ubuntu32:~$ ./vipribenchmemcache_nodeps_kylix VipriBenchThreaded - RunningTimeSeconds=5, TestCount=100, StartSeq=0, NumberOfChannels=6, BufferPackets=5000, NumberOfSynchroThreads=4 . Time: 5018ms = 9266640 pkts/s = 13856 MB/s ;) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Hi guys, turns out that in our real-life scenario there sadly aren't big improvements yet. Might be due to the exception handling, but we haven't profiled it yet. As said we have seen better improvements in simpler benchmark code - but this benchmark here is what really matters for us. Please find the benchmark here - the ZIP includes a Kylix-built binary. https://share.nerdherrschaft.net/f/2ac772f0327e4840a533/?dl=1 Here are some results from a Dualcore i7 with 2 cores and 4 HT, 32 bit: Kylix: Time: 5015ms = 9770688 pkts/s = 14610 MB/s ./vipribenchmemcache_nodeps_kylix 5.06s user 0.01s system 99% cpu 5.119 total FPC 3.0.4: Time: 5052ms = 8016627 pkts/s = 11987 MB/s ./vipribenchmemcache 5.07s user 0.01s system 97% cpu 5.206 total FPC 3.3.1 trunk (SVN Rev 40300): Time: 5040ms = 8035714 pkts/s = 12016 MB/s ./vipribenchmemcache_nodeps 5.07s user 0.02s system 97% cpu 5.207 total Benchmark results for ARM will follow. Cheers, Simon Thursday, November 15, 2018, 10:31:55 PM, you wrote: > Am 14.11.2018 um 14:46 schrieb Simon Kissel: >> >> We have not yet tested this on ARM (does it work on ARM?). >> > After r40321, arm-linux works as well. > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel Best regards, Simon Kissel -- Nerdherrschaft GmbH Mainzer Str. 40 55411 Bingen am Rhein Germany Phone:+49-6721-9492994 Fax: +49-6721-9492996 simon.kis...@nerdherrschaft.com http://www.nerdherrschaft.com Registered office/Sitz der Gesellschaft: Bingen am Rhein, Germany CEO/Geschäftsführer: Simon Kissel Commercial register/Handelsregister: Amtsgericht Mainz HRB43337 ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel