Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Florian Klämpfl
Am 23.11.2018 um 21:07 schrieb Simon Kissel:
> own code, it won't get much better than what it is today,
> and that Kylix producing faster code does not compensate it

Well, to be fair, there is a lot of code out there where FPC is faster. 
Nevertheless, FPC's code can be still improved.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Simon Kissel
Hi Adriaan,

In case you aren't just trolling and the subject really is of
interest to you, I would recommend reading the discussion
thread in full. That works much better than treating this
like a write-only system.

> You didn't answer any of my questions. The goal is to get the
> code faster, isn't it.

No, the goal is not to get any specific code faster. The goal
is to have the compiler and/or RTL improved so that all code
compiled benefits, and that execution speed in general gets on
par with the 15 years old Kylix/Delphi 7 compilers.

And yes, of course we are profiling our code for years, and we
know what we are doing and talking about. Our code sadly does
not have any bottlenecks in the sense of a small number of
functions eating most of the CPU, the load is pretty evenly
distributed across all of the functions. This means that the
problem is distributed all across the code. However, there
is something sticking out, being at the very top of pretty
much all multi-threaded code we compile:

fpc_pushexceptaddr & CRelocateThreadVar.

Besides this, not everything can be uncovered by profiling,
and that part is nothing that FPC can change: On one of
the ARM platforms we use every context switch results in a
CPU cache flush, so simply by having more threads *all* of
them will become slower.

The benchmark code as our real-life code is able to utilize
~99% of the CPU, so no, it's also not a matter of thread
synchronization (we aren't spinlocking).

The commercial reason behind putting out a 15k bounty is that
no matter how much more money I invest into optimizing my
own code, it won't get much better than what it is today,
and that Kylix producing faster code does not compensate it
not supporting any of the nice-to-have language features that
FPC has today.

Simon




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Simon Kissel
Hi Florian,

> Actually, most of the improvements so far are no related to
> threading. In particular r40339 helped a lot, it was a bug
> fix: the compiler assumed that a certain sub expression was written
> while it not was and this prevented CSE.

Even better, that means there is still gold to be uncovered :)

In our case the bottleneck very clearly appears to be that
every call to fpc_pushexceptaddr/fpc_popaddrstack causes a
call to CRelocateThreadVar, which causes a call to
pthread_getspecific.

We do create our ARM production builds with {$IMPLICITEXCEPTIONS OFF}
to get acceptable speed, else it would be completely unbearable.

BR,

Simon

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Florian Klämpfl
Am 23.11.2018 um 14:36 schrieb Simon Kissel:
> Hi Adriaan,
> 
>> I find the phrase. "FPC's terrible multi-threading performance"
>> unjust.
> 
> Well, see the complete thread to better understand what this
> is about, and what progress is being made. So far a 20%
> improvement has been made, which kinda is like a proof that
> there was something to improve ;)
> 
>>  When I do multi-threading
>> with FPC, I get a near N speed improvement (on i386 and x86_64) where N is 
>> the number of cores,
>> including hyper-threaded cores 
> 
> This isn't about FPC's code not scaling with N cores, it does.
> It is about it being slow as soon as threads are used *at all*,
> due to TLS stuff and exception handling. It's slow in a linear
> fashion, so to say...
> 

Actually, most of the improvements so far are no related to threading. In 
particular r40339 helped a lot, it was a bug
fix: the compiler assumed that a certain sub expression was written while it 
not was and this prevented CSE.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Adriaan van Os

Simon Kissel wrote:


This isn't about FPC's code not scaling with N cores, it does.
It is about it being slow as soon as threads are used *at all*,


N cores being near N times faster than "not using threads at all".


due to TLS stuff and exception handling. It's slow in a linear
fashion, so to say...


You didn't answer any of my questions. The goal is to get the code faster, isn't it. Or are you 
writing an academic thesis on compilers ?


Regards,

Adriaan van Os
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Simon Kissel
Hi Adriaan,

> I find the phrase. "FPC's terrible multi-threading performance"
> unjust.

Well, see the complete thread to better understand what this
is about, and what progress is being made. So far a 20%
improvement has been made, which kinda is like a proof that
there was something to improve ;)

>  When I do multi-threading
> with FPC, I get a near N speed improvement (on i386 and x86_64) where N is 
> the number of cores,
> including hyper-threaded cores 

This isn't about FPC's code not scaling with N cores, it does.
It is about it being slow as soon as threads are used *at all*,
due to TLS stuff and exception handling. It's slow in a linear
fashion, so to say...

Best regards,

Simon

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Sven Barth via fpc-devel
Am Fr., 23. Nov. 2018, 12:15 hat Adriaan van Os 
geschrieben:

> Simon Kissel wrote:
>
> > We know about a couple of bottlenecks (fpc_pushexceptaddr /
> > RelocateThreadVar etc) which explain FPC's terrible multi-threading
> > performance, but in general, FPC's code generator really is quite
> > a mess, which we learned the hard way a couple of years when we
> > did optimization work on the ARM target.
>
> I find the phrase. "FPC's terrible multi-threading performance" unjust.
> When I do multi-threading
> with FPC, I get a near N speed improvement (on i386 and x86_64) where N is
> the number of cores,
> including hyper-threaded cores 
>
> What about taking another way, having a precise look at the source code ?
> Did you profile it ? What
> sort of work does the code do ? How are the threads synchronized ? What
> data structures are used ?
>
> I don't take "the compiler is so bad" without an answer to these questions.
>

Simon wrote that the same code performs better when compiled with Kylix, so
there definitely are things that can be done better by FPC and as Florian's
work on TLS variables showed indeed *do* make FPC perform better. I suspect
a similar improvement with DWARF exceptions as the setjmp/longjmp based
approach *is* more expensive for the case when no exception occures
compared to the case of marking protected code in the meta data as DWARF
and SEH64 do.

Regards,
Sven

>
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] The 15k bounty: Optimizing executable speed for Linux x86 / LLVM

2018-11-23 Thread Adriaan van Os

Simon Kissel wrote:


We know about a couple of bottlenecks (fpc_pushexceptaddr /
RelocateThreadVar etc) which explain FPC's terrible multi-threading
performance, but in general, FPC's code generator really is quite
a mess, which we learned the hard way a couple of years when we
did optimization work on the ARM target.


I find the phrase. "FPC's terrible multi-threading performance" unjust. When I do multi-threading 
with FPC, I get a near N speed improvement (on i386 and x86_64) where N is the number of cores, 
including hyper-threaded cores 


What about taking another way, having a precise look at the source code ? Did you profile it ? What 
sort of work does the code do ? How are the threads synchronized ? What data structures are used ?


I don't take "the compiler is so bad" without an answer to these questions.

Regards,

Adriaan van Os

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel