Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM
Am Mi., 5. Dez. 2018, 00:06 hat J. Gareth Moreton geschrieben: > The more you learn! What is TLS, curiously? Given that I do a lot of my > work at the assembler level, I figure this is something I should know! > Thread Local Storage. For ELF they are pseudo instructions that the assembler needs to expand to the correct instructions for the target (and TLS model), which is why the internal assembler does not handle them yet. Regards, Sven > ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Optimization theory
Not sure if this was intended for the group mailing list or not, but was only e-mailed to me. One quesion though... in your example, what level of your optimisation are you compiling under? I thought the peephole optimizer already changes things like "lea (%ecx,%eax),%eax" to "add %ecx,%eax". Nevertheless, my Deep Optimizer is able to spot the unnecessary assignments to %eax. I did submit a patch, but because it was a prototype and tied into the post peephole optimizer, it hasn't received that much attention. It's something I'm still working on though and something I'd like to submit as an -O4 option. Gareth aka Kit - Original Message - From: David Pethes pub...@satd.sk To: gar...@moreton-family.com Sent: Tue 04/12/18 19:45 Subject: Fwd: Re: [fpc-devel] Optimization theory Hi Gareth, nice to see that you are still working on compiler improvements. If I may, can I ask you to look at the assembly output of this routine (it's a link to compiler explorer): https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:0.89579519,j:1,lang:pascal,source:'unit+output%3B%0A%0Ainterface%0A%0Aprocedure+core_4x4(block:+psmallint)%3B%0A%0Aimplementation%0A%0Atype%0A++matrix_t+%3D+array%5B0..3,+0..3%5D+of+smallint%3B+%0A%0Aprocedure+core_4x4(block:+psmallint)%3B%0Avar%0A++m:+matrix_t%3B%0A++e,+f,+g,+h:+array%5B0..3%5D+of+smallint%3B%0A%0Avar%0A++i:+longword%3B%0Abegin%0A++move(block%5E,+m,+16*2)%3B%0A%0A++for+i+:%3D+0+to+3+do+begin%0A++e%5Bi%5D+:%3D+m%5Bi%5D%5B0%5D+%2B+m%5Bi%5D%5B3%5D%3B%0A++f%5Bi%5D+:%3D+m%5Bi%5D%5B0%5D+-+m%5Bi%5D%5B3%5D%3B%0A++g%5Bi%5D+:%3D+m%5Bi%5D%5B1%5D+%2B+m%5Bi%5D%5B2%5D%3B%0A++h%5Bi%5D+:%3D+m%5Bi%5D%5B1%5D+-+m%5Bi%5D%5B2%5D%3B%0A++end%3B%0A%0Aend%3B%0A+%0Aend.%0A'),l:'5',n:'0',o:'Pascal+source+%231',t:'0')),k:51.379872662717624,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:fpc304,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),lang:pascal,libs:!(),options:'-O4',source:1),l:'5',n:'0',o:'x86-64+fpc+3.0.4+(Editor+%231,+Compiler+%231)+Pascal',t:'0')),header:(),k:48.6201273372824,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4 As you can see, there are a lot of duplicated index/load calculations, yet there are only a couple of registers used, so the results could be reused. I wonder if this is somehow covered in your optimization passes as well, or if you have some idea where would I start if I wanted to optimize the code generation for this? Thanks again for your efforts, David On 19. 6. 2018 1:51, J. Gareth Moreton wrote: > Hah, thanks. In a strange way, I find it quite fun. The culmination of > my efforts has been the first stage of a "Deep Optimizer", which > searches around MOV commands to see if it can minimise pipeline stalls > and sometimes can remove MOV commands entirely if it determines that the > two registers are identical in value at that point. > > As an example where inlining helped (actually, just copying and pasting > the function's contents virtually verbatim because everything was in > assembly language) is the new Int and Frac functions on x86-64, where I > inserted Int's contents directly into Frac. While Int had a branch and > a few safety checks to prevent an integer overflow, it was much faster > when inlined compared to fiddling around with register moves and doing a > function call, especially as the rest of Frac was very fast already, and > it also allowed the entirety of Frac to have no stack frame. > > It's probably a bit insulting to Free Pascal, but my initial drive was > to reduce the size of the compiled binaries, without sacrificing speed. > Of course, improving speed is a major focus as well! > > Gareth > > > On Mon 18/06/18 19:49 , David Pethes pub...@satd.sk [1] sent: > > > Hi, one less thing to worry about then :) Unless in a quite tight loop, > it shouldn't make much difference - or any at all, due to OoO > execution. > By the way, this Intel code analyzer could be interesting to you - I > only found it recently: > https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/ [2] > > > Good luck with you optimizations! > > David > > > On 17. 6. 2018 19:52, J. Gareth Moreton wrote: > > Aah, I was in error. I used agner as my > > source, but even that says 5 cycles for a > > near call and 17-33 for a far call, so I'm > > not sure where I got 50 from. My > > apologies. I have probably been avoiding > > function calls more than necessary! > > > > Gareth aka. Kit > > Links: -- [1] mailto:pub...@satd.sk [2] https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM
The more you learn! What is TLS, curiously? Given that I do a lot of my work at the assembler level, I figure this is something I should know! Gareth On Tue 04/12/18 22:48 , Simon Kissel simon.kis...@nerdherrschaft.com sent: Hi Florian, > Do you compile with -Aas? The internal assemblers do not support TLS yet, this is WIP. Ah wow! -Aas does indeed help. Both the assembler errors and the internal error are gone, both in Linux i386 and ARM. And the created binaries even work. Nice! Thank you! Cheers, Simon ___ fpc-devel maillist - fpc-devel@lists.freepascal.org [1] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel Links: -- [1] mailto:fpc-devel@lists.freepascal.org [2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Hi Florian, > Do you compile with -Aas? The internal assemblers do not support TLS yet, > this is WIP. Ah wow! -Aas does indeed help. Both the assembler errors and the internal error are gone, both in Linux i386 and ARM. And the created binaries even work. Nice! Thank you! Cheers, Simon ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM
Am 04.12.2018 um 02:16 schrieb Simon Kissel: > Hi Florian, > > > > we are currently to try to do some real-life benchmarks with our > products, however with rev. 40346 compilation fails with the two following > showstoppers: Do you compile with -Aas? The internal assemblers do not support TLS yet, this is WIP. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speedforLinux x86 / LLVM
At the moment I'm trying to fix some Linux bugs. When I compile Lazarus with it, I get about a 5% speed increase for -O1 and a 15% speed increase for -O3, but someone else reported a 2% slowdown for -O2 on their own test project. Either way, I hope my fundamental theory is sound... reducing the number of code passes... and I can convince Florian that the possible increase in maintenance difficulty is worth the performance gain. If that's successful, then I'll port the relevant code over to i386. When it's totally stable, I'll definitely run your own project through it to see what we get. There are other little speed savings that can be made here and there that add to the running total. Most of it is simple refactoring. For your bug report, I had a look at the Intel® 64 and IA-32 Architectures Software Developer’s Manual to check on some of the assembler routines, and the ones where LEA contains an immediate operand are actually invalid. My guess though is that the immediate was being converted into a reference that doesn't contain any registers. I'm not sure if and how an assembler should support that though, even if it's possible to represent the arrangement in raw machine code. Something for investigation, that's for sure. Gareth On Tue 04/12/18 19:29 , Simon Kissel simon.kis...@nerdherrschaft.com sent: Hi Gareth, > A regression like this is quite serious. I'd recommend opening a > bug report with a reproducible case so we can investigate and hopefully fix it within the day. created a test project, and opened two tickets: https://bugs.freepascal.org/view.php?id=34646 [1] https://bugs.freepascal.org/view.php?id=34647 [2] > At the moment I'm experimenting with increasing the speed of the > optimizer for x86_64, and then porting to i386 when it's proven > successful. Having teething problems though! Sounds great. I wish more of our products had 64bit CPUs... What's the speed-up you are seeing on my test project so far? Best regards, Simon ___ fpc-devel maillist - fpc-devel@lists.freepascal.org [3] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel [4]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel Links: -- [1] https://bugs.freepascal.org/view.php?id=34646 [2] https://bugs.freepascal.org/view.php?id=34647 [3] mailto:fpc-devel@lists.freepascal.org [4] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM
Hi Gareth, > A regression like this is quite serious. I'd recommend opening a > bug report with a reproducible case so we can investigate and hopefully fix > it within the day. created a test project, and opened two tickets: https://bugs.freepascal.org/view.php?id=34646 https://bugs.freepascal.org/view.php?id=34647 > At the moment I'm experimenting with increasing the speed of the > optimizer for x86_64, and then porting to i386 when it's proven > successful. Having teething problems though! Sounds great. I wish more of our products had 64bit CPUs... What's the speed-up you are seeing on my test project so far? Best regards, Simon ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] wrong step-over with fpc debug info / how to do objcopy on Mac, or strip .debug_frame
On 03/12/18 13:58, Martin wrote: Posts by "bigDan": http://forum.lazarus-ide.org/index.php/topic,42869.msg303599.html#msg303599 The log he provided shows that - lldb got a "thread step-over" - lldb believed to have stopped at the end of step-over (not any other reason): "stop reason = step over" - the active thread remained the same. So stepping was done in the correct thread - the called subroutine is located at a different address (not inlined) pc before stepping (in calling code) 0x000100059d8e pc after stepping (in subroutine) 0x0001002718e8 - the stackpointer was reduced by 8, and the stackframe register was NOT yet modified. So the stop was in the prologue ("begin" line of function) On windows one way to solve the problem was to get rid of .debug_frame info. At best it hides an apparent bug FPC's generation of Dwarf CFI. It would be interesting to test if that happens on Mac too. Does anyone know how to strip that info of the app bundle? FPC does not generate any information that gets stored in .debug_frame on Darwin. The section that is there probably comes from crt1.o or so, and does not cover FPC-generated code. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel