Re: Quantitative analysis of -Os vs -O3

2017-08-27 Thread Andi Kleen
Allan Sandfeld Jensen  writes:
>
> Yeah. That is just more problematic in practice. Though I do believe we have 
> support for it. It is good to know it will automatically upgrade 
> optimizations 
> like that. I just wish there was a way to distribute pre-generated arch-
> independent training data.

autofdo supports that in principle (but probably would need some
improvements in the tools  to make it really easily usable, especially
with shared libraries)

-Andi


Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Michael Clark
FYI - I’ve updated the stats to include -O2 in addition to -O3 and -Os:

- https://rv8.io/bench#optimisation

There are 57 plots and 31 tables. It’s quite a bit of data. It will be quite 
interesting to run these on new gcc releases to monitor changes.

The Geomean for -O2 is 0.98 of -O3 on x86-64. I probably need to add some 
tables that show file sizes per architecture side by side, versus the current 
grouping which is by optimisation level to allow comparisons between 
architectures. If I pivot the data, we can add file size ratios by optimisation 
level per architecture. Note: these are relatively small benchmark programs, 
however the stats are still interesting. I’m most interested in RISC-V register 
allocation at present.

-O2 does pretty well on file size compared to -O3, on all architectures. At a 
glance, the -O2 file sizes are slightly larger than the -Os file sizes but the 
performance increase is considerably more. I could perhaps show ratios of 
performance vs size between -O2 and -Os.

> On 26 Aug 2017, at 10:05 PM, Michael Clark  wrote:
> 
>> 
>> On 26 Aug 2017, at 8:39 PM, Andrew Pinski  wrote:
>> 
>> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark  wrote:
>>> Dear GCC folk,
>>> I have to say that’s GCC’s -Os caught me by surprise after several years 
>>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year 
>>> and a half I have been working on RISC-V development and have been 
>>> exclusively using GCC for RISC-V builds, and initially I was using -Os. 
>>> After performing a qualitative/quantitative assessment I don’t believe 
>>> GCC’s current -Os is particularly useful, at least for my needs as it 
>>> doesn’t provide a commensurate saving in size given the sometimes quite 
>>> huge drop in performance.
>>> 
>>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
>>> frustration thread, as I think Apple’s documentation which presumably 
>>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
>>> using -O2 as a starting point) with the idea that the current -Os is 
>>> renamed to -Oz.
>>> 
>>>   -Oz
>>>  (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>>>  enables the same optimization flags that -Os uses, but -Oz also
>>>  enables other optimizations intended solely to reduce code 
>>> size.
>>>  In particular, instructions that encode into fewer bytes are
>>>  preferred over longer instructions that execute in fewer 
>>> cycles.
>>>  -Oz on Darwin is very similar to -Os in FSF distributions of 
>>> GCC.
>>>  -Oz employs the same inlining limits and avoids string 
>>> instructions
>>>  just like -Os.
>>> 
>>>   -Os
>>>  Optimize for size, but not at the expense of speed. -Os 
>>> enables all
>>>  -O2 optimizations that do not typically increase code size.
>>>  However, instructions are chosen for best performance, 
>>> regardless
>>>  of size. To optimize solely for size on Darwin, use -Oz (APPLE
>>>  ONLY).
>>> 
>>> I have recently  been working on a benchmark suite to test a RISC-V JIT 
>>> engine. I have performed all testing using GCC 7.1 as the baseline 
>>> compiler, and during the process I have collected several performance 
>>> metrics, some that are neutral to the JIT runtime environment. In 
>>> particular I have made performance comparisons between -Os and -O3 on x86, 
>>> along with capturing executable file sizes, dynamic retired instruction and 
>>> micro-op counts for x86, dynamic retired instruction counts for RISC-V as 
>>> well as dynamic register and instruction usage histograms for RISC-V, for 
>>> both -Os and -O3.
>>> 
>>> See the Optimisation section for a charted performance comparison between 
>>> -O3 and -Os. There are dozens of other plots that show the differences 
>>> between -Os and -O3.
>>> 
>>>   - https://rv8.io/bench
>>> 
>>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
>>> Geomean of course smooths over some pathological cases where -Os 
>>> performance is severely degraded versus -O3 but not with significant, or 
>>> commensurate savings in size.
>> 
>> 
>> First let me put into some perspective on -Os usage and some history:
>> 1) -Os is not useful for non-embedded users
>> 2) the embedded folks really need the smallest code possible and
>> usually will be willing to afford the performance hit
>> 3) -Os was a mistake for Apple to use in the first place; they used it
>> and then GCC got better for PowerPC to use the string instructions
>> which is why -Oz was added :)
>> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
>> 
>> Comparing -O3 to -Os is not totally fair on x86 due to the many
>> different instructions and encodings.
>> Compare it on ARM/Thumb2 or MIPS/MIPS16 (or 

Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Allan Sandfeld Jensen
On Samstag, 26. August 2017 12:59:06 CEST Markus Trippelsdorf wrote:
> On 2017.08.26 at 12:40 +0200, Allan Sandfeld Jensen wrote:
> > On Samstag, 26. August 2017 10:56:16 CEST Markus Trippelsdorf wrote:
> > > On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> > > > First let me put into some perspective on -Os usage and some history:
> > > > 1) -Os is not useful for non-embedded users
> > > > 2) the embedded folks really need the smallest code possible and
> > > > usually will be willing to afford the performance hit
> > > > 3) -Os was a mistake for Apple to use in the first place; they used it
> > > > and then GCC got better for PowerPC to use the string instructions
> > > > which is why -Oz was added :)
> > > > 4) -Os is used heavily by the arm/thumb2 folks in bare metal
> > > > applications.
> > > > 
> > > > Comparing -O3 to -Os is not totally fair on x86 due to the many
> > > > different instructions and encodings.
> > > > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> > > > big issue.
> > > > I soon have a need to keep overall (bare-metal) application size down
> > > > to just 256k.
> > > > Micro-controllers are places where -Os matters the most.
> > > > 
> > > > This comment does not help my application usage.  It rather hurts it
> > > > and goes against what -Os is really about.  It is not about reducing
> > > > icache pressure but overall application code size.  I really need the
> > > > code to fit into a specific size.
> > > 
> > > For many applications using -flto does reduce code size more than just
> > > going from -O2 to -Os.
> > 
> > I added the option to optimize with -Os in Qt, and it gives an average 15%
> > reduction in binary size, somtimes as high as 25%. Using lto gives almost
> > the same (slightly less), but the two options combine perfectly and using
> > both can reduce binary size from 20 to 40%. And that is on a shared
> > library, not even a statically linked binary.
> > 
> > Only real minus is that some of the libraries especially QtGui would
> > benefit from a auto-vectorization, so it would be nice if there existed
> > an -O3s version which vectorized the most obvious vectorizable functions,
> > a few hundred bytes for an additional version here and there would do
> > good. Fortunately it doesn't too much damage as we have manually
> > vectorized routines for to have good performance also on MSVC, if we
> > relied more on auto- vectorization it would be worse.
> 
> In that case using profile guided optimizations will help. It will
> optimize cold functions with -Os and hot functions with -O3 (when using
> e.g.: "-flto -O3 -fprofile-use"). Of course you will have to compile
> twice and also collect training data from your library in between.

Yeah. That is just more problematic in practice. Though I do believe we have 
support for it. It is good to know it will automatically upgrade optimizations 
like that. I just wish there was a way to distribute pre-generated arch-
independent training data.

`Allan 



Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Markus Trippelsdorf
On 2017.08.26 at 12:40 +0200, Allan Sandfeld Jensen wrote:
> On Samstag, 26. August 2017 10:56:16 CEST Markus Trippelsdorf wrote:
> > On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> > > First let me put into some perspective on -Os usage and some history:
> > > 1) -Os is not useful for non-embedded users
> > > 2) the embedded folks really need the smallest code possible and
> > > usually will be willing to afford the performance hit
> > > 3) -Os was a mistake for Apple to use in the first place; they used it
> > > and then GCC got better for PowerPC to use the string instructions
> > > which is why -Oz was added :)
> > > 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> > > 
> > > Comparing -O3 to -Os is not totally fair on x86 due to the many
> > > different instructions and encodings.
> > > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> > > big issue.
> > > I soon have a need to keep overall (bare-metal) application size down
> > > to just 256k.
> > > Micro-controllers are places where -Os matters the most.
> > > 
> > > This comment does not help my application usage.  It rather hurts it
> > > and goes against what -Os is really about.  It is not about reducing
> > > icache pressure but overall application code size.  I really need the
> > > code to fit into a specific size.
> > 
> > For many applications using -flto does reduce code size more than just
> > going from -O2 to -Os.
> 
> I added the option to optimize with -Os in Qt, and it gives an average 15% 
> reduction in binary size, somtimes as high as 25%. Using lto gives almost the 
> same (slightly less), but the two options combine perfectly and using both 
> can 
> reduce binary size from 20 to 40%. And that is on a shared library, not even 
> a 
> statically linked binary.
> 
> Only real minus is that some of the libraries especially QtGui would benefit 
> from a auto-vectorization, so it would be nice if there existed an -O3s 
> version which vectorized the most obvious vectorizable functions, a few 
> hundred bytes for an additional version here and there would do good. 
> Fortunately it doesn't too much damage as we have manually vectorized 
> routines 
> for to have good performance also on MSVC, if we relied more on auto-
> vectorization it would be worse.

In that case using profile guided optimizations will help. It will
optimize cold functions with -Os and hot functions with -O3 (when using
e.g.: "-flto -O3 -fprofile-use"). Of course you will have to compile
twice and also collect training data from your library in between.

-- 
Markus


Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Allan Sandfeld Jensen
On Samstag, 26. August 2017 10:56:16 CEST Markus Trippelsdorf wrote:
> On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> > First let me put into some perspective on -Os usage and some history:
> > 1) -Os is not useful for non-embedded users
> > 2) the embedded folks really need the smallest code possible and
> > usually will be willing to afford the performance hit
> > 3) -Os was a mistake for Apple to use in the first place; they used it
> > and then GCC got better for PowerPC to use the string instructions
> > which is why -Oz was added :)
> > 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> > 
> > Comparing -O3 to -Os is not totally fair on x86 due to the many
> > different instructions and encodings.
> > Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> > big issue.
> > I soon have a need to keep overall (bare-metal) application size down
> > to just 256k.
> > Micro-controllers are places where -Os matters the most.
> > 
> > This comment does not help my application usage.  It rather hurts it
> > and goes against what -Os is really about.  It is not about reducing
> > icache pressure but overall application code size.  I really need the
> > code to fit into a specific size.
> 
> For many applications using -flto does reduce code size more than just
> going from -O2 to -Os.

I added the option to optimize with -Os in Qt, and it gives an average 15% 
reduction in binary size, somtimes as high as 25%. Using lto gives almost the 
same (slightly less), but the two options combine perfectly and using both can 
reduce binary size from 20 to 40%. And that is on a shared library, not even a 
statically linked binary.

Only real minus is that some of the libraries especially QtGui would benefit 
from a auto-vectorization, so it would be nice if there existed an -O3s 
version which vectorized the most obvious vectorizable functions, a few 
hundred bytes for an additional version here and there would do good. 
Fortunately it doesn't too much damage as we have manually vectorized routines 
for to have good performance also on MSVC, if we relied more on auto-
vectorization it would be worse.

`Allan



Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Michael Clark

> On 26 Aug 2017, at 8:39 PM, Andrew Pinski  wrote:
> 
> On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark  wrote:
>> Dear GCC folk,
>> I have to say that’s GCC’s -Os caught me by surprise after several years 
>> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year 
>> and a half I have been working on RISC-V development and have been 
>> exclusively using GCC for RISC-V builds, and initially I was using -Os. 
>> After performing a qualitative/quantitative assessment I don’t believe GCC’s 
>> current -Os is particularly useful, at least for my needs as it doesn’t 
>> provide a commensurate saving in size given the sometimes quite huge drop in 
>> performance.
>> 
>> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
>> frustration thread, as I think Apple’s documentation which presumably 
>> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
>> using -O2 as a starting point) with the idea that the current -Os is renamed 
>> to -Oz.
>> 
>>-Oz
>>   (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>>   enables the same optimization flags that -Os uses, but -Oz also
>>   enables other optimizations intended solely to reduce code 
>> size.
>>   In particular, instructions that encode into fewer bytes are
>>   preferred over longer instructions that execute in fewer 
>> cycles.
>>   -Oz on Darwin is very similar to -Os in FSF distributions of 
>> GCC.
>>   -Oz employs the same inlining limits and avoids string 
>> instructions
>>   just like -Os.
>> 
>>-Os
>>   Optimize for size, but not at the expense of speed. -Os 
>> enables all
>>   -O2 optimizations that do not typically increase code size.
>>   However, instructions are chosen for best performance, 
>> regardless
>>   of size. To optimize solely for size on Darwin, use -Oz (APPLE
>>   ONLY).
>> 
>> I have recently  been working on a benchmark suite to test a RISC-V JIT 
>> engine. I have performed all testing using GCC 7.1 as the baseline compiler, 
>> and during the process I have collected several performance metrics, some 
>> that are neutral to the JIT runtime environment. In particular I have made 
>> performance comparisons between -Os and -O3 on x86, along with capturing 
>> executable file sizes, dynamic retired instruction and micro-op counts for 
>> x86, dynamic retired instruction counts for RISC-V as well as dynamic 
>> register and instruction usage histograms for RISC-V, for both -Os and -O3.
>> 
>> See the Optimisation section for a charted performance comparison between 
>> -O3 and -Os. There are dozens of other plots that show the differences 
>> between -Os and -O3.
>> 
>>- https://rv8.io/bench
>> 
>> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
>> Geomean of course smooths over some pathological cases where -Os performance 
>> is severely degraded versus -O3 but not with significant, or commensurate 
>> savings in size.
> 
> 
> First let me put into some perspective on -Os usage and some history:
> 1) -Os is not useful for non-embedded users
> 2) the embedded folks really need the smallest code possible and
> usually will be willing to afford the performance hit
> 3) -Os was a mistake for Apple to use in the first place; they used it
> and then GCC got better for PowerPC to use the string instructions
> which is why -Oz was added :)
> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> 
> Comparing -O3 to -Os is not totally fair on x86 due to the many
> different instructions and encodings.
> Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> big issue.
> I soon have a need to keep overall (bare-metal) application size down
> to just 256k.
> Micro-controllers are places where -Os matters the most.

Fair points.

- Size at all cost is useful for the embedded case where there is a restricted 
footprint.
- It’s fair to compare on RISC-V which has the RVC compressed ISA extension, 
which is conceptually similar to Thumb-2
- Understand renaming -Os to -Oz would cause a few downstream issues for those 
who expect size at all costs.
- There is an achievable use-case for good RVC compression and good performance 
on RISC-V

However the question remains, what options does one choose for size, but not 
size at the expense of speed. -O2 and an -mtune?

I’m probably interested in an -O2 with an -mtune that can favour register 
allocations that result in better RVC compression for RISC-V. Ideally the 
dominant register set can be assigned to x8 through x15 using loop frequency 
information and this  would result in better compression and also reduce 
dynamic icache pressure. I think I should look more closely at LRA and see how 
it uses register_priority.

There is a use 

RE: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Shi, Steven
> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
Also by the x86 in bare-mental firmware, e.g. http://www.uefi.org/ 

> For many applications using -flto does reduce code size more than just
> going from -O2 to -Os.
Yes. -flto is must to have, but the -Os is still necessary. E.g. Uefi firmware 
use both (-flto -Os) when GCC build. Only -flto + -Os can make Uefi frimware 
GCC build be competivie with MSVS in terms of code size.


Steven
Thanks



Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Markus Trippelsdorf
On 2017.08.26 at 01:39 -0700, Andrew Pinski wrote:
> 
> First let me put into some perspective on -Os usage and some history:
> 1) -Os is not useful for non-embedded users
> 2) the embedded folks really need the smallest code possible and
> usually will be willing to afford the performance hit
> 3) -Os was a mistake for Apple to use in the first place; they used it
> and then GCC got better for PowerPC to use the string instructions
> which is why -Oz was added :)
> 4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.
> 
> Comparing -O3 to -Os is not totally fair on x86 due to the many
> different instructions and encodings.
> Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
> big issue.
> I soon have a need to keep overall (bare-metal) application size down
> to just 256k.
> Micro-controllers are places where -Os matters the most.
> 
> This comment does not help my application usage.  It rather hurts it
> and goes against what -Os is really about.  It is not about reducing
> icache pressure but overall application code size.  I really need the
> code to fit into a specific size.

For many applications using -flto does reduce code size more than just
going from -O2 to -Os.

-- 
Markus


Re: Quantitative analysis of -Os vs -O3

2017-08-26 Thread Andrew Pinski
On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark  wrote:
> Dear GCC folk,
> I have to say that’s GCC’s -Os caught me by surprise after several years 
> using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and 
> a half I have been working on RISC-V development and have been exclusively 
> using GCC for RISC-V builds, and initially I was using -Os. After performing 
> a qualitative/quantitative assessment I don’t believe GCC’s current -Os is 
> particularly useful, at least for my needs as it doesn’t provide a 
> commensurate saving in size given the sometimes quite huge drop in 
> performance.
>
> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
> frustration thread, as I think Apple’s documentation which presumably 
> documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps 
> using -O2 as a starting point) with the idea that the current -Os is renamed 
> to -Oz.
>
> -Oz
>(APPLE ONLY) Optimize for size, regardless of performance. -Oz
>enables the same optimization flags that -Os uses, but -Oz also
>enables other optimizations intended solely to reduce code 
> size.
>In particular, instructions that encode into fewer bytes are
>preferred over longer instructions that execute in fewer 
> cycles.
>-Oz on Darwin is very similar to -Os in FSF distributions of 
> GCC.
>-Oz employs the same inlining limits and avoids string 
> instructions
>just like -Os.
>
> -Os
>Optimize for size, but not at the expense of speed. -Os 
> enables all
>-O2 optimizations that do not typically increase code size.
>However, instructions are chosen for best performance, 
> regardless
>of size. To optimize solely for size on Darwin, use -Oz (APPLE
>ONLY).
>
> I have recently  been working on a benchmark suite to test a RISC-V JIT 
> engine. I have performed all testing using GCC 7.1 as the baseline compiler, 
> and during the process I have collected several performance metrics, some 
> that are neutral to the JIT runtime environment. In particular I have made 
> performance comparisons between -Os and -O3 on x86, along with capturing 
> executable file sizes, dynamic retired instruction and micro-op counts for 
> x86, dynamic retired instruction counts for RISC-V as well as dynamic 
> register and instruction usage histograms for RISC-V, for both -Os and -O3.
>
> See the Optimisation section for a charted performance comparison between -O3 
> and -Os. There are dozens of other plots that show the differences between 
> -Os and -O3.
>
> - https://rv8.io/bench
>
> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
> Geomean of course smooths over some pathological cases where -Os performance 
> is severely degraded versus -O3 but not with significant, or commensurate 
> savings in size.


First let me put into some perspective on -Os usage and some history:
1) -Os is not useful for non-embedded users
2) the embedded folks really need the smallest code possible and
usually will be willing to afford the performance hit
3) -Os was a mistake for Apple to use in the first place; they used it
and then GCC got better for PowerPC to use the string instructions
which is why -Oz was added :)
4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.

Comparing -O3 to -Os is not totally fair on x86 due to the many
different instructions and encodings.
Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
big issue.
I soon have a need to keep overall (bare-metal) application size down
to just 256k.
Micro-controllers are places where -Os matters the most.

>
> I don’t currently have -O2 in my results however it seems like I should add 
> -O2 to the benchmark suite. If you take a look at the web page you’ll see 
> that there is already a huge amount of data given we have captured dynamic 
> register frequencies and dynamic instruction frequencies for -Os and -O3. The 
> tables and charts are all generated by scripts so if there is interest I 
> could add -O2. I can also pretty easily perform runs with new compiler 
> versions as everything is completely automated. The biggest factor is that it 
> currently takes 4 hours for a full run as we run all of the benchmarks in a 
> simulator to capture dynamic register usage and dynamic instruction usage.
>
> After looking at the results, one has to question the utility of -Os in its 
> present form, and indeed question how it is actually used in practice, given 
> the proportion of savings in executable size. After my assessment I would not 
> recommend anyone to use -Os because its savings in size are not proportionate 
> to the loss in performance. I feel discouraged from using it after looking at 
> the results. I really don’t 

Quantitative analysis of -Os vs -O3

2017-08-26 Thread Michael Clark
Dear GCC folk,

I have to say that’s GCC’s -Os caught me by surprise after several years using 
Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and a half 
I have been working on RISC-V development and have been exclusively using GCC 
for RISC-V builds, and initially I was using -Os. After performing a 
qualitative/quantitative assessment I don’t believe GCC’s current -Os is 
particularly useful, at least for my needs as it doesn’t provide a commensurate 
saving in size given the sometimes quite huge drop in performance.

I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC 
frustration thread, as I think Apple’s documentation which presumably documents 
Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps using -O2 as a 
starting point) with the idea that the current -Os is renamed to -Oz.

-Oz
   (APPLE ONLY) Optimize for size, regardless of performance. -Oz
   enables the same optimization flags that -Os uses, but -Oz also
   enables other optimizations intended solely to reduce code size.
   In particular, instructions that encode into fewer bytes are
   preferred over longer instructions that execute in fewer cycles.
   -Oz on Darwin is very similar to -Os in FSF distributions of GCC.
   -Oz employs the same inlining limits and avoids string 
instructions
   just like -Os.

-Os
   Optimize for size, but not at the expense of speed. -Os enables 
all
   -O2 optimizations that do not typically increase code size.
   However, instructions are chosen for best performance, regardless
   of size. To optimize solely for size on Darwin, use -Oz (APPLE
   ONLY).

I have recently  been working on a benchmark suite to test a RISC-V JIT engine. 
I have performed all testing using GCC 7.1 as the baseline compiler, and during 
the process I have collected several performance metrics, some that are neutral 
to the JIT runtime environment. In particular I have made performance 
comparisons between -Os and -O3 on x86, along with capturing executable file 
sizes, dynamic retired instruction and micro-op counts for x86, dynamic retired 
instruction counts for RISC-V as well as dynamic register and instruction usage 
histograms for RISC-V, for both -Os and -O3.

See the Optimisation section for a charted performance comparison between -O3 
and -Os. There are dozens of other plots that show the differences between -Os 
and -O3.

- https://rv8.io/bench

The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The 
Geomean of course smooths over some pathological cases where -Os performance is 
severely degraded versus -O3 but not with significant, or commensurate savings 
in size.

I don’t currently have -O2 in my results however it seems like I should add -O2 
to the benchmark suite. If you take a look at the web page you’ll see that 
there is already a huge amount of data given we have captured dynamic register 
frequencies and dynamic instruction frequencies for -Os and -O3. The tables and 
charts are all generated by scripts so if there is interest I could add -O2. I 
can also pretty easily perform runs with new compiler versions as everything is 
completely automated. The biggest factor is that it currently takes 4 hours for 
a full run as we run all of the benchmarks in a simulator to capture dynamic 
register usage and dynamic instruction usage.

After looking at the results, one has to question the utility of -Os in its 
present form, and indeed question how it is actually used in practice, given 
the proportion of savings in executable size. After my assessment I would not 
recommend anyone to use -Os because its savings in size are not proportionate 
to the loss in performance. I feel discouraged from using it after looking at 
the results. I really don’t believe -Os makes the right trades e.g. reducing 
icache pressure can indeed lead to better performance due to reduced code size.

I also wonder whether -O2 level optimisations may be a good starting point for 
a more useful -Os and how one would proceed towards selecting optimisations to 
add back to -Os to increase its usability, or rename the current -Os to -Oz and 
make -Os an alias for -O2. A similar profile to -O2 would probably produce less 
shock for anyone who does quantitative performance analysis of -Os.

In fact there are some interesting issues for the RISC-V backend given the 
assembler performs RVC compression and GCC doesn’t really see the size of 
emitted instructions. It would be an interesting backend to investigate 
improving -Os presuming that a backend can opt in to various optimisations for 
a given optimisation level. RISC-V would gain most of its size and runtime 
icache pressure reduction improvements by getting the highest frequency 
registers allocated within the 8 register set that is