Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM

2018-12-04 Thread Sven Barth via fpc-devel
Am Mi., 5. Dez. 2018, 00:06 hat J. Gareth Moreton 
geschrieben:

> The more you learn!  What is TLS, curiously? Given that I do a lot of my
> work at the assembler level, I figure this is something I should know!
>

Thread Local Storage. For ELF they are pseudo instructions that the
assembler needs to expand to the correct instructions for the target (and
TLS model), which is why the internal assembler does not handle them yet.

Regards,
Sven

>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Optimization theory

2018-12-04 Thread J. Gareth Moreton
 Not sure if this was intended for the group mailing list or not, but was
only e-mailed to me.

 One quesion though... in your example, what level of your optimisation are
you compiling under? I thought the peephole optimizer already changes
things like "lea (%ecx,%eax),%eax" to "add %ecx,%eax".  Nevertheless, my
Deep Optimizer is able to spot the unnecessary assignments to %eax.  I did
submit a patch, but because it was a prototype and tied into the post
peephole optimizer, it hasn't received that much attention.  It's
something I'm still working on though and something I'd like to submit as
an -O4 option. 
 Gareth aka Kit

 - Original Message -
 From: David Pethes pub...@satd.sk
 To: gar...@moreton-family.com
 Sent: Tue 04/12/18 19:45
 Subject: Fwd: Re: [fpc-devel] Optimization theory
 Hi Gareth, 
 nice to see that you are still working on compiler improvements. If I 
 may, can I ask you to look at the assembly output of this routine (it's 
 a link to compiler explorer): 

 
https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:0.89579519,j:1,lang:pascal,source:'unit+output%3B%0A%0Ainterface%0A%0Aprocedure+core_4x4(block:+psmallint)%3B%0A%0Aimplementation%0A%0Atype%0A++matrix_t+%3D+array%5B0..3,+0..3%5D+of+smallint%3B+%0A%0Aprocedure+core_4x4(block:+psmallint)%3B%0Avar%0A++m:+matrix_t%3B%0A++e,+f,+g,+h:+array%5B0..3%5D+of+smallint%3B%0A%0Avar%0A++i:+longword%3B%0Abegin%0A++move(block%5E,+m,+16*2)%3B%0A%0A++for+i+:%3D+0+to+3+do+begin%0A++e%5Bi%5D+:%3D+m%5Bi%5D%5B0%5D+%2B+m%5Bi%5D%5B3%5D%3B%0A++f%5Bi%5D+:%3D+m%5Bi%5D%5B0%5D+-+m%5Bi%5D%5B3%5D%3B%0A++g%5Bi%5D+:%3D+m%5Bi%5D%5B1%5D+%2B+m%5Bi%5D%5B2%5D%3B%0A++h%5Bi%5D+:%3D+m%5Bi%5D%5B1%5D+-+m%5Bi%5D%5B2%5D%3B%0A++end%3B%0A%0Aend%3B%0A+%0Aend.%0A'),l:'5',n:'0',o:'Pascal+source+%231',t:'0')),k:51.379872662717624,l:'4',m:100,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:fpc304,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),lang:pascal,libs:!(),options:'-O4',source:1),l:'5',n:'0',o:'x86-64+fpc+3.0.4+(Editor+%231,+Compiler+%231)+Pascal',t:'0')),header:(),k:48.6201273372824,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',m:100,n:'0',o:'',t:'0')),version:4


 As you can see, there are a lot of duplicated index/load calculations, 
 yet there are only a couple of registers used, so the results could be 
 reused. I wonder if this is somehow covered in your optimization passes 
 as well, or if you have some idea where would I start if I wanted to 
 optimize the code generation for this? 
 Thanks again for your efforts, 

 David 

 On 19. 6. 2018 1:51, J. Gareth Moreton wrote: 
 > Hah, thanks.  In a strange way, I find it quite fun.  The culmination
of 
 > my efforts has been the first stage of a "Deep Optimizer", which 
 > searches around MOV commands to see if it can minimise pipeline stalls 
 > and sometimes can remove MOV commands entirely if it determines that the

 > two registers are identical in value at that point. 
 > 
 > As an example where inlining helped (actually, just copying and pasting 
 > the function's contents virtually verbatim because everything was in 
 > assembly language) is the new Int and Frac functions on x86-64, where I 
 > inserted Int's contents directly into Frac.  While Int had a branch and

 > a few safety checks to prevent an integer overflow, it was much faster 
 > when inlined compared to fiddling around with register moves and doing a

 > function call, especially as the rest of Frac was very fast already, and

 > it also allowed the entirety of Frac to have no stack frame. 
 > 
 > It's probably a bit insulting to Free Pascal, but my initial drive was 
 > to reduce the size of the compiled binaries, without sacrificing
speed.  
 > Of course, improving speed is a major focus as well! 
 > 
 > Gareth 
 > 
 > 
 > On Mon 18/06/18 19:49 , David Pethes pub...@satd.sk [1] sent: 
 > 
 > 
 > Hi, one less thing to worry about then :) Unless in a quite tight loop, 
 > it shouldn't make much difference - or any at all, due to OoO 
 > execution. 
 > By the way, this Intel code analyzer could be interesting to you - I 
 > only found it recently: 
 >
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
[2] 
 > 
 > 
 > Good luck with you optimizations! 
 > 
 > David 
 > 
 > 
 > On 17. 6. 2018 19:52, J. Gareth Moreton wrote: 
 > > Aah, I was in error. I used agner as my 
 > > source, but even that says 5 cycles for a 
 > > near call and 17-33 for a far call, so I'm 
 > > not sure where I got 50 from. My 
 > > apologies. I have probably been avoiding 
 > > function calls more than necessary! 
 > > 
 > > Gareth aka. Kit 
 > 
 > 

 

Links:
--
[1] mailto:pub...@satd.sk
[2]
https://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM

2018-12-04 Thread J. Gareth Moreton
 The more you learn!  What is TLS, curiously? Given that I do a lot of my
work at the assembler level, I figure this is something I should know!

 Gareth

 On Tue 04/12/18 22:48 , Simon Kissel simon.kis...@nerdherrschaft.com sent:
 Hi Florian, 

 > Do you compile with -Aas? The internal assemblers do not support TLS
yet, this is WIP. 

 Ah wow! -Aas does indeed help. Both the assembler errors and 
 the internal error are gone, both in Linux i386 and ARM. 

 And the created binaries even work. Nice! Thank you! 

 Cheers, 

 Simon 

 ___ 
 fpc-devel maillist - fpc-devel@lists.freepascal.org [1] 
 http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[2]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel 

 

Links:
--
[1] mailto:fpc-devel@lists.freepascal.org
[2] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-12-04 Thread Simon Kissel
Hi Florian,

> Do you compile with -Aas? The internal assemblers do not support TLS yet, 
> this is WIP.

Ah wow! -Aas does indeed help. Both the assembler errors and
the internal error are gone, both in Linux i386 and ARM.

And the created binaries even work. Nice! Thank you!

Cheers,

Simon

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-12-04 Thread Florian Klämpfl
Am 04.12.2018 um 02:16 schrieb Simon Kissel:
> Hi Florian,
>
>
> 
> we are currently to try to do some real-life benchmarks with our
> products, however with rev. 40346 compilation fails with the two following
> showstoppers:

Do you compile with -Aas? The internal assemblers do not support TLS yet, this 
is WIP.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speedforLinux x86 / LLVM

2018-12-04 Thread J. Gareth Moreton
 At the moment I'm trying to fix some Linux bugs.  When I compile Lazarus
with it, I get about a 5% speed increase for -O1 and a 15% speed increase
for -O3, but someone else reported a 2% slowdown for -O2 on their own test
project.  Either way, I hope my fundamental theory is sound... reducing
the number of code passes... and I can convince Florian that the possible
increase in maintenance difficulty is worth the performance gain.  If
that's successful, then I'll port the relevant code over to i386.

 When it's totally stable, I'll definitely run your own project through it
to see what we get.  There are other little speed savings that can be made
here and there that add to the running total.  Most of it is simple
refactoring.

 For your bug report, I had a look at the Intel® 64 and IA-32
Architectures Software Developer’s Manual to check on some of the
assembler routines, and the ones where LEA contains an immediate operand
are actually invalid.  My guess though is that the immediate was being
converted into a reference that doesn't contain any registers.  I'm not
sure if and how an assembler should support that though, even if it's
possible to represent the arrangement in raw machine code.  Something for
investigation, that's for sure.

 Gareth

 On Tue 04/12/18 19:29 , Simon Kissel simon.kis...@nerdherrschaft.com sent:
 Hi Gareth, 

 > A regression like this is quite serious. I'd recommend opening a 
 > bug report with a reproducible case so we can investigate and hopefully
fix it within the day. 

 created a test project, and opened two tickets: 

 https://bugs.freepascal.org/view.php?id=34646 [1] 
 https://bugs.freepascal.org/view.php?id=34647 [2] 

 > At the moment I'm experimenting with increasing the speed of the 
 > optimizer for x86_64, and then porting to i386 when it's proven 
 > successful. Having teething problems though! 

 Sounds great. I wish more of our products had 64bit CPUs... 

 What's the speed-up you are seeing on my test project so far? 

 Best regards, 

 Simon 

 ___ 
 fpc-devel maillist - fpc-devel@lists.freepascal.org [3] 
 http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
[4]">http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel 

 

Links:
--
[1] https://bugs.freepascal.org/view.php?id=34646
[2] https://bugs.freepascal.org/view.php?id=34647
[3] mailto:fpc-devel@lists.freepascal.org
[4] http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed forLinux x86 / LLVM

2018-12-04 Thread Simon Kissel
Hi Gareth,

> A regression like this is quite serious.  I'd recommend opening a
> bug report with a reproducible case so we can investigate and hopefully fix 
> it within the day.

created a test project, and opened two tickets:

https://bugs.freepascal.org/view.php?id=34646
https://bugs.freepascal.org/view.php?id=34647

> At the moment I'm experimenting with increasing the speed of the
> optimizer for x86_64, and then porting to i386 when it's proven
> successful.  Having teething problems though!

Sounds great. I wish more of our products had 64bit CPUs...

What's the speed-up you are seeing on my test project so far?

Best regards,

Simon

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] wrong step-over with fpc debug info / how to do objcopy on Mac, or strip .debug_frame

2018-12-04 Thread Jonas Maebe

On 03/12/18 13:58, Martin wrote:
Posts by "bigDan": 
http://forum.lazarus-ide.org/index.php/topic,42869.msg303599.html#msg303599

The log he provided shows that
- lldb got a "thread step-over"
- lldb believed to have stopped at the end of step-over (not any other 
reason): "stop reason = step over"
- the active thread remained the same. So stepping was done in the 
correct thread

- the called subroutine is located at a different address (not inlined)
   pc before stepping (in calling code) 0x000100059d8e
   pc after stepping (in subroutine) 0x0001002718e8
- the stackpointer was reduced by 8, and the stackframe register was NOT 
yet modified.

   So the stop was in the prologue ("begin" line of function)


On windows one way to solve the problem was to get rid of .debug_frame 
info.


At best it hides an apparent bug FPC's generation of Dwarf CFI.


It would be interesting to test if that happens on Mac too.
Does anyone know how to strip that info of the app bundle?


FPC does not generate any information that gets stored in .debug_frame 
on Darwin. The section that is there probably comes from crt1.o or so, 
and does not cover FPC-generated code.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel