Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Am 23.11.2018 um 21:07 schrieb Simon Kissel: > own code, it won't get much better than what it is today, > and that Kylix producing faster code does not compensate it Well, to be fair, there is a lot of code out there where FPC is faster. Nevertheless, FPC's code can be still improved. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Hi Adriaan, In case you aren't just trolling and the subject really is of interest to you, I would recommend reading the discussion thread in full. That works much better than treating this like a write-only system. > You didn't answer any of my questions. The goal is to get the > code faster, isn't it. No, the goal is not to get any specific code faster. The goal is to have the compiler and/or RTL improved so that all code compiled benefits, and that execution speed in general gets on par with the 15 years old Kylix/Delphi 7 compilers. And yes, of course we are profiling our code for years, and we know what we are doing and talking about. Our code sadly does not have any bottlenecks in the sense of a small number of functions eating most of the CPU, the load is pretty evenly distributed across all of the functions. This means that the problem is distributed all across the code. However, there is something sticking out, being at the very top of pretty much all multi-threaded code we compile: fpc_pushexceptaddr & CRelocateThreadVar. Besides this, not everything can be uncovered by profiling, and that part is nothing that FPC can change: On one of the ARM platforms we use every context switch results in a CPU cache flush, so simply by having more threads *all* of them will become slower. The benchmark code as our real-life code is able to utilize ~99% of the CPU, so no, it's also not a matter of thread synchronization (we aren't spinlocking). The commercial reason behind putting out a 15k bounty is that no matter how much more money I invest into optimizing my own code, it won't get much better than what it is today, and that Kylix producing faster code does not compensate it not supporting any of the nice-to-have language features that FPC has today. Simon ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Hi Florian, > Actually, most of the improvements so far are no related to > threading. In particular r40339 helped a lot, it was a bug > fix: the compiler assumed that a certain sub expression was written > while it not was and this prevented CSE. Even better, that means there is still gold to be uncovered :) In our case the bottleneck very clearly appears to be that every call to fpc_pushexceptaddr/fpc_popaddrstack causes a call to CRelocateThreadVar, which causes a call to pthread_getspecific. We do create our ARM production builds with {$IMPLICITEXCEPTIONS OFF} to get acceptable speed, else it would be completely unbearable. BR, Simon ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Am 23.11.2018 um 14:36 schrieb Simon Kissel: > Hi Adriaan, > >> I find the phrase. "FPC's terrible multi-threading performance" >> unjust. > > Well, see the complete thread to better understand what this > is about, and what progress is being made. So far a 20% > improvement has been made, which kinda is like a proof that > there was something to improve ;) > >> When I do multi-threading >> with FPC, I get a near N speed improvement (on i386 and x86_64) where N is >> the number of cores, >> including hyper-threaded cores > > This isn't about FPC's code not scaling with N cores, it does. > It is about it being slow as soon as threads are used *at all*, > due to TLS stuff and exception handling. It's slow in a linear > fashion, so to say... > Actually, most of the improvements so far are no related to threading. In particular r40339 helped a lot, it was a bug fix: the compiler assumed that a certain sub expression was written while it not was and this prevented CSE. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Simon Kissel wrote: This isn't about FPC's code not scaling with N cores, it does. It is about it being slow as soon as threads are used *at all*, N cores being near N times faster than "not using threads at all". due to TLS stuff and exception handling. It's slow in a linear fashion, so to say... You didn't answer any of my questions. The goal is to get the code faster, isn't it. Or are you writing an academic thesis on compilers ? Regards, Adriaan van Os ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Hi Adriaan, > I find the phrase. "FPC's terrible multi-threading performance" > unjust. Well, see the complete thread to better understand what this is about, and what progress is being made. So far a 20% improvement has been made, which kinda is like a proof that there was something to improve ;) > When I do multi-threading > with FPC, I get a near N speed improvement (on i386 and x86_64) where N is > the number of cores, > including hyper-threaded cores This isn't about FPC's code not scaling with N cores, it does. It is about it being slow as soon as threads are used *at all*, due to TLS stuff and exception handling. It's slow in a linear fashion, so to say... Best regards, Simon ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Am Fr., 23. Nov. 2018, 12:15 hat Adriaan van Os geschrieben: > Simon Kissel wrote: > > > We know about a couple of bottlenecks (fpc_pushexceptaddr / > > RelocateThreadVar etc) which explain FPC's terrible multi-threading > > performance, but in general, FPC's code generator really is quite > > a mess, which we learned the hard way a couple of years when we > > did optimization work on the ARM target. > > I find the phrase. "FPC's terrible multi-threading performance" unjust. > When I do multi-threading > with FPC, I get a near N speed improvement (on i386 and x86_64) where N is > the number of cores, > including hyper-threaded cores > > What about taking another way, having a precise look at the source code ? > Did you profile it ? What > sort of work does the code do ? How are the threads synchronized ? What > data structures are used ? > > I don't take "the compiler is so bad" without an answer to these questions. > Simon wrote that the same code performs better when compiled with Kylix, so there definitely are things that can be done better by FPC and as Florian's work on TLS variables showed indeed *do* make FPC perform better. I suspect a similar improvement with DWARF exceptions as the setjmp/longjmp based approach *is* more expensive for the case when no exception occures compared to the case of marking protected code in the meta data as DWARF and SEH64 do. Regards, Sven > ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Simon Kissel wrote: We know about a couple of bottlenecks (fpc_pushexceptaddr / RelocateThreadVar etc) which explain FPC's terrible multi-threading performance, but in general, FPC's code generator really is quite a mess, which we learned the hard way a couple of years when we did optimization work on the ARM target. I find the phrase. "FPC's terrible multi-threading performance" unjust. When I do multi-threading with FPC, I get a near N speed improvement (on i386 and x86_64) where N is the number of cores, including hyper-threaded cores What about taking another way, having a precise look at the source code ? Did you profile it ? What sort of work does the code do ? How are the threads synchronized ? What data structures are used ? I don't take "the compiler is so bad" without an answer to these questions. Regards, Adriaan van Os ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel