[patch 0/5] nios2 address mode improvements

2017-10-19 Thread Sandra Loosemore

This is the set of nios2 optimization patches that I've previously
mentioned in these threads:

https://gcc.gnu.org/ml/gcc/2017-10/msg00016.html
https://gcc.gnu.org/ml/gcc-patches/2017-10/msg00957.html

To give an overview of what this is for

The nios2 backend currently generates quite bad code for memory
accesses with addresses involving symbolic constants.  Like a typical
RISC machine, nios2 requires splitting such 32-bit constants into
HIGH/LO_SUM pairs.  Currently this happens in expand, and address
expressions involving such constants are always converted to use a
register indirect form.

One part of the problem is that the backend currently doesn't
recognize that LO_SUM is a legitimate address form (it's register
indirect with a constant offset using the %lo relocation).  That's
fixed in these patches.

A harder problem is that doing the high/lo_sum splitting in expand
inhibits subsequent optimizations.  One such problem arises when you
have accesses to multiple fields in a static structure object.  Expand
sees this as many (symbol + offset) expressions involving the same
symbol with different constant offsets.  What we should be doing in
that case is CSE'ing the symbol address computation rather than
splitting every such expression individually.

This patch series attacks that problem by deferring splitting to the
split1 pass, which happens after cse and fwprop optimizations.
Deferring the splitting also requires that TARGET_LEGITIMATE_ADDRESS_P
accept these symbolic constant expressions until the splitting takes
place, and that code that might generate 32-bit constants in other
places (e.g., the movsi expander) must not do so after they are
supposed to have been split.

This patch series also includes general improvements to the cost model
to get better CSE results -- in particular, the nios2 backend has been 
completely missing an implementation for TARGET_ADDRESS_COST.  I also 
found that making TARGET_LEGITIMIZE_ADDRESS smarter resulted in better

address cost modeling by the ivopts pass.

All together, this resulted in about a 7% code size improvement on the
customer-provided test case I was using for tuning purposes.

Patches in this set are broken down as follows:

1: Switch to LRA.
2: Detect when splitting has been completed.
3: Add splitters and recognize the new address modes.
4: Cost model improvements.
5: Test cases.

Part 2 is the piece that relates to the discussion linked above.  As
implemented, it works fine, but it's maybe not the best design.  I'll
hold off on committing the entire set for at least a few days in case
somebody wants to suggest a better solution.

-Sandra



Re: [patch 0/5] nios2 address mode improvements

2017-10-20 Thread Richard Biener
On Fri, Oct 20, 2017 at 4:03 AM, Sandra Loosemore
 wrote:
> This is the set of nios2 optimization patches that I've previously
> mentioned in these threads:
>
> https://gcc.gnu.org/ml/gcc/2017-10/msg00016.html
> https://gcc.gnu.org/ml/gcc-patches/2017-10/msg00957.html
>
> To give an overview of what this is for
>
> The nios2 backend currently generates quite bad code for memory
> accesses with addresses involving symbolic constants.  Like a typical
> RISC machine, nios2 requires splitting such 32-bit constants into
> HIGH/LO_SUM pairs.  Currently this happens in expand, and address
> expressions involving such constants are always converted to use a
> register indirect form.
>
> One part of the problem is that the backend currently doesn't
> recognize that LO_SUM is a legitimate address form (it's register
> indirect with a constant offset using the %lo relocation).  That's
> fixed in these patches.
>
> A harder problem is that doing the high/lo_sum splitting in expand
> inhibits subsequent optimizations.  One such problem arises when you
> have accesses to multiple fields in a static structure object.  Expand
> sees this as many (symbol + offset) expressions involving the same
> symbol with different constant offsets.  What we should be doing in
> that case is CSE'ing the symbol address computation rather than
> splitting every such expression individually.
>
> This patch series attacks that problem by deferring splitting to the
> split1 pass, which happens after cse and fwprop optimizations.
> Deferring the splitting also requires that TARGET_LEGITIMATE_ADDRESS_P
> accept these symbolic constant expressions until the splitting takes
> place, and that code that might generate 32-bit constants in other
> places (e.g., the movsi expander) must not do so after they are
> supposed to have been split.

How do other targets handle this situation?  Naiively I'd have handled
the splitting at reload/LRA time ... (which would make the flag
to test reload_completed)

There are quite a number of targets using lo_sum but I'm not sure they
share the issue with symbolic constants.

Otherwise defering the splitting of course looks like the correct thing to do.

Richard.

> This patch series also includes general improvements to the cost model
> to get better CSE results -- in particular, the nios2 backend has been
> completely missing an implementation for TARGET_ADDRESS_COST.  I also found
> that making TARGET_LEGITIMIZE_ADDRESS smarter resulted in better
> address cost modeling by the ivopts pass.
>
> All together, this resulted in about a 7% code size improvement on the
> customer-provided test case I was using for tuning purposes.
>
> Patches in this set are broken down as follows:
>
> 1: Switch to LRA.
> 2: Detect when splitting has been completed.
> 3: Add splitters and recognize the new address modes.
> 4: Cost model improvements.
> 5: Test cases.
>
> Part 2 is the piece that relates to the discussion linked above.  As
> implemented, it works fine, but it's maybe not the best design.  I'll
> hold off on committing the entire set for at least a few days in case
> somebody wants to suggest a better solution.
>
> -Sandra
>


Re: [patch 0/5] nios2 address mode improvements

2017-10-20 Thread Richard Biener
On Fri, Oct 20, 2017 at 10:12 AM, Richard Biener
 wrote:
> On Fri, Oct 20, 2017 at 4:03 AM, Sandra Loosemore
>  wrote:
>> This is the set of nios2 optimization patches that I've previously
>> mentioned in these threads:
>>
>> https://gcc.gnu.org/ml/gcc/2017-10/msg00016.html
>> https://gcc.gnu.org/ml/gcc-patches/2017-10/msg00957.html
>>
>> To give an overview of what this is for
>>
>> The nios2 backend currently generates quite bad code for memory
>> accesses with addresses involving symbolic constants.  Like a typical
>> RISC machine, nios2 requires splitting such 32-bit constants into
>> HIGH/LO_SUM pairs.  Currently this happens in expand, and address
>> expressions involving such constants are always converted to use a
>> register indirect form.
>>
>> One part of the problem is that the backend currently doesn't
>> recognize that LO_SUM is a legitimate address form (it's register
>> indirect with a constant offset using the %lo relocation).  That's
>> fixed in these patches.
>>
>> A harder problem is that doing the high/lo_sum splitting in expand
>> inhibits subsequent optimizations.  One such problem arises when you
>> have accesses to multiple fields in a static structure object.  Expand
>> sees this as many (symbol + offset) expressions involving the same
>> symbol with different constant offsets.  What we should be doing in
>> that case is CSE'ing the symbol address computation rather than
>> splitting every such expression individually.
>>
>> This patch series attacks that problem by deferring splitting to the
>> split1 pass, which happens after cse and fwprop optimizations.
>> Deferring the splitting also requires that TARGET_LEGITIMATE_ADDRESS_P
>> accept these symbolic constant expressions until the splitting takes
>> place, and that code that might generate 32-bit constants in other
>> places (e.g., the movsi expander) must not do so after they are
>> supposed to have been split.
>
> How do other targets handle this situation?  Naiively I'd have handled
> the splitting at reload/LRA time ... (which would make the flag
> to test reload_completed)
>
> There are quite a number of targets using lo_sum but I'm not sure they
> share the issue with symbolic constants.

sparc for example has in sparc_legitimate_address_p:

  /* During reload, accept the HIGH+LO_SUM construct generated by
 sparc_legitimize_reload_address.  */
  if (reload_in_progress
  && GET_CODE (rs1) == HIGH
  && XEXP (rs1, 0) == imm1)
return 1;

and it seems that sparc_legitimize_reload_address performs the splitting
(sparc uses LRA now so this part looks dead code to me -- maybe LRA
can do this magically somehow but I see nios2 still uses reload so the
code maybe a recipie to follow).

Richard.

> Otherwise defering the splitting of course looks like the correct thing to do.
>
> Richard.
>
>> This patch series also includes general improvements to the cost model
>> to get better CSE results -- in particular, the nios2 backend has been
>> completely missing an implementation for TARGET_ADDRESS_COST.  I also found
>> that making TARGET_LEGITIMIZE_ADDRESS smarter resulted in better
>> address cost modeling by the ivopts pass.
>>
>> All together, this resulted in about a 7% code size improvement on the
>> customer-provided test case I was using for tuning purposes.
>>
>> Patches in this set are broken down as follows:
>>
>> 1: Switch to LRA.
>> 2: Detect when splitting has been completed.
>> 3: Add splitters and recognize the new address modes.
>> 4: Cost model improvements.
>> 5: Test cases.
>>
>> Part 2 is the piece that relates to the discussion linked above.  As
>> implemented, it works fine, but it's maybe not the best design.  I'll
>> hold off on committing the entire set for at least a few days in case
>> somebody wants to suggest a better solution.
>>
>> -Sandra
>>


Re: [patch 0/5] nios2 address mode improvements

2017-10-20 Thread Ramana Radhakrishnan
On Fri, Oct 20, 2017 at 9:18 AM, Richard Biener
 wrote:
> On Fri, Oct 20, 2017 at 10:12 AM, Richard Biener
>> How do other targets handle this situation?  Naiively I'd have handled
>> the splitting at reload/LRA time ... (which would make the flag
>> to test reload_completed)
>>
>> There are quite a number of targets using lo_sum but I'm not sure they
>> share the issue with symbolic constants.
>
> sparc for example has in sparc_legitimate_address_p:
>
>   /* During reload, accept the HIGH+LO_SUM construct generated by
>  sparc_legitimize_reload_address.  */
>   if (reload_in_progress
>   && GET_CODE (rs1) == HIGH
>   && XEXP (rs1, 0) == imm1)
> return 1;
>
> and it seems that sparc_legitimize_reload_address performs the splitting
> (sparc uses LRA now so this part looks dead code to me -- maybe LRA
> can do this magically somehow but I see nios2 still uses reload so the
> code maybe a recipie to follow).

I remember a patch to LRA to get this done with high / lo_sum
targeting MIPS go in during the gcc 7 time frame which seemed to help.

We've had similar issues with high and lo_sums in the Arm and AArch64
ports. Delaying splitting has helped but we've not found a case to go
as fine grained as what Sandra is attempting here but that maybe
something to further investigate on these ports.

Ramana

>
> Richard.
>
>> Otherwise defering the splitting of course looks like the correct thing to 
>> do.
>>
>> Richard.
>>
>>> This patch series also includes general improvements to the cost model
>>> to get better CSE results -- in particular, the nios2 backend has been
>>> completely missing an implementation for TARGET_ADDRESS_COST.  I also found
>>> that making TARGET_LEGITIMIZE_ADDRESS smarter resulted in better
>>> address cost modeling by the ivopts pass.
>>>
>>> All together, this resulted in about a 7% code size improvement on the
>>> customer-provided test case I was using for tuning purposes.
>>>
>>> Patches in this set are broken down as follows:
>>>
>>> 1: Switch to LRA.
>>> 2: Detect when splitting has been completed.
>>> 3: Add splitters and recognize the new address modes.
>>> 4: Cost model improvements.
>>> 5: Test cases.
>>>
>>> Part 2 is the piece that relates to the discussion linked above.  As
>>> implemented, it works fine, but it's maybe not the best design.  I'll
>>> hold off on committing the entire set for at least a few days in case
>>> somebody wants to suggest a better solution.
>>>
>>> -Sandra
>>>


Re: [patch 0/5] nios2 address mode improvements

2017-10-20 Thread Jakub Jelinek
On Thu, Oct 19, 2017 at 08:03:45PM -0600, Sandra Loosemore wrote:
> A harder problem is that doing the high/lo_sum splitting in expand
> inhibits subsequent optimizations.  One such problem arises when you
> have accesses to multiple fields in a static structure object.  Expand
> sees this as many (symbol + offset) expressions involving the same
> symbol with different constant offsets.  What we should be doing in
> that case is CSE'ing the symbol address computation rather than
> splitting every such expression individually.

Do you have the needed relocations for that though?
If not, then you need to do:
  tmp = high (symbol);
  tmp |= lo_sum (symbol); // or +
  a = [tmp + 0];
  b = [tmp + 4];
  c = [tmp + 8];
if you do (like e.g. sparc64 has the %olo relocation), then you can do
  tmp = high (symbol);
  a = [tmp + lo_sum (symbol) + 0];
  b = [tmp + lo_sum (symbol) + 4];
  c = [tmp + lo_sum (symbol) + 8];
If you tried to do:
  tmp = high (symbol);
  a = [tmp + lo_sum (symbol)];
  b = [tmp + lo_sum (symbol + 4)];
  c = [tmp + lo_sum (symbol + 8)];
then this would break if lo_sum (symbol + 4) or lo_sum (symbol + 8)
is < 4.

Jakub


Re: [patch 0/5] nios2 address mode improvements

2017-10-20 Thread Sandra Loosemore

On 10/20/2017 02:12 AM, Richard Biener wrote:

On Fri, Oct 20, 2017 at 4:03 AM, Sandra Loosemore
 wrote:

This is the set of nios2 optimization patches that I've previously
mentioned in these threads:

https://gcc.gnu.org/ml/gcc/2017-10/msg00016.html
https://gcc.gnu.org/ml/gcc-patches/2017-10/msg00957.html

To give an overview of what this is for

The nios2 backend currently generates quite bad code for memory
accesses with addresses involving symbolic constants.  Like a typical
RISC machine, nios2 requires splitting such 32-bit constants into
HIGH/LO_SUM pairs.  Currently this happens in expand, and address
expressions involving such constants are always converted to use a
register indirect form.

One part of the problem is that the backend currently doesn't
recognize that LO_SUM is a legitimate address form (it's register
indirect with a constant offset using the %lo relocation).  That's
fixed in these patches.

A harder problem is that doing the high/lo_sum splitting in expand
inhibits subsequent optimizations.  One such problem arises when you
have accesses to multiple fields in a static structure object.  Expand
sees this as many (symbol + offset) expressions involving the same
symbol with different constant offsets.  What we should be doing in
that case is CSE'ing the symbol address computation rather than
splitting every such expression individually.

This patch series attacks that problem by deferring splitting to the
split1 pass, which happens after cse and fwprop optimizations.
Deferring the splitting also requires that TARGET_LEGITIMATE_ADDRESS_P
accept these symbolic constant expressions until the splitting takes
place, and that code that might generate 32-bit constants in other
places (e.g., the movsi expander) must not do so after they are
supposed to have been split.


How do other targets handle this situation?  Naiively I'd have handled
the splitting at reload/LRA time ... (which would make the flag
to test reload_completed)


The problem with this is that the HIGH/LO_SUM split requires a scratch 
register, so it has to be done before register allocation; 
reload_completed is too late.  Early on when I started working on this 
problem, I did various experiments with this, including fiddling with 
the insn patterns and constraints, trying to use TARGET_SECONDARY_RELOAD 
to manage the scratch register, etc, but I could never get it to work 
without completely stomping on all the unsplit symbols before reload.


I also considered a separate target-specific pass to do the splitting, 
but once I got it to work with the regular split mechanism I was quite 
happy with the design except for the "have we split yet?" hook.  :-P



There are quite a number of targets using lo_sum but I'm not sure they
share the issue with symbolic constants.


Nios II resembles a simplified MIPS-ish architecture so that was where I 
looked first.  It does some complicated stuff with trying to match up 
HIGH and LO_SUM pairs after the fact, and I didn't really want to go 
there.  I looked at some other backends as well but didn't see anything 
I could directly copy.


-Sandra



Re: [patch 0/5] nios2 address mode improvements

2017-10-20 Thread Sandra Loosemore

On 10/20/2017 02:56 AM, Jakub Jelinek wrote:

On Thu, Oct 19, 2017 at 08:03:45PM -0600, Sandra Loosemore wrote:

A harder problem is that doing the high/lo_sum splitting in expand
inhibits subsequent optimizations.  One such problem arises when you
have accesses to multiple fields in a static structure object.  Expand
sees this as many (symbol + offset) expressions involving the same
symbol with different constant offsets.  What we should be doing in
that case is CSE'ing the symbol address computation rather than
splitting every such expression individually.


Do you have the needed relocations for that though?
If not, then you need to do:
   tmp = high (symbol);
   tmp |= lo_sum (symbol); // or +
   a = [tmp + 0];
   b = [tmp + 4];
   c = [tmp + 8];
if you do (like e.g. sparc64 has the %olo relocation), then you can do
   tmp = high (symbol);
   a = [tmp + lo_sum (symbol) + 0];
   b = [tmp + lo_sum (symbol) + 4];
   c = [tmp + lo_sum (symbol) + 8];
If you tried to do:
   tmp = high (symbol);
   a = [tmp + lo_sum (symbol)];
   b = [tmp + lo_sum (symbol + 4)];
   c = [tmp + lo_sum (symbol + 8)];
then this would break if lo_sum (symbol + 4) or lo_sum (symbol + 8)
is < 4.


No, nothing that sophisticated -- I'm aiming to produce the first of 
your three choices.  I just want to avoid producing 3 different 
high/lo_sum pairs here, which is what the unpatched nios2 backend does.  :-(


-Sandra



Re: [patch 0/5] nios2 address mode improvements

2017-10-20 Thread Andrew Pinski
On Thu, Oct 19, 2017 at 7:03 PM, Sandra Loosemore
 wrote:
> This is the set of nios2 optimization patches that I've previously
> mentioned in these threads:
>
> https://gcc.gnu.org/ml/gcc/2017-10/msg00016.html
> https://gcc.gnu.org/ml/gcc-patches/2017-10/msg00957.html
>
> To give an overview of what this is for
>
> The nios2 backend currently generates quite bad code for memory
> accesses with addresses involving symbolic constants.  Like a typical
> RISC machine, nios2 requires splitting such 32-bit constants into
> HIGH/LO_SUM pairs.  Currently this happens in expand, and address
> expressions involving such constants are always converted to use a
> register indirect form.
>
> One part of the problem is that the backend currently doesn't
> recognize that LO_SUM is a legitimate address form (it's register
> indirect with a constant offset using the %lo relocation).  That's
> fixed in these patches.
>
> A harder problem is that doing the high/lo_sum splitting in expand
> inhibits subsequent optimizations.  One such problem arises when you
> have accesses to multiple fields in a static structure object.  Expand
> sees this as many (symbol + offset) expressions involving the same
> symbol with different constant offsets.  What we should be doing in
> that case is CSE'ing the symbol address computation rather than
> splitting every such expression individually.
>
> This patch series attacks that problem by deferring splitting to the
> split1 pass, which happens after cse and fwprop optimizations.
> Deferring the splitting also requires that TARGET_LEGITIMATE_ADDRESS_P
> accept these symbolic constant expressions until the splitting takes
> place, and that code that might generate 32-bit constants in other
> places (e.g., the movsi expander) must not do so after they are
> supposed to have been split.
>
> This patch series also includes general improvements to the cost model
> to get better CSE results -- in particular, the nios2 backend has been
> completely missing an implementation for TARGET_ADDRESS_COST.  I also found
> that making TARGET_LEGITIMIZE_ADDRESS smarter resulted in better
> address cost modeling by the ivopts pass.
>
> All together, this resulted in about a 7% code size improvement on the
> customer-provided test case I was using for tuning purposes.

I remember the Sony version of the SPU Back-end doing something
similar and getting similar improvements.
But I don't remember the exact details either.  It might have been
because the SPU only had 128bit loads so expanding the loads too soon
was missing optimizations.  I do remember not upstreaming that code
and I was always disappointed it was not.

Thanks,
Andrew

>
> Patches in this set are broken down as follows:
>
> 1: Switch to LRA.
> 2: Detect when splitting has been completed.
> 3: Add splitters and recognize the new address modes.
> 4: Cost model improvements.
> 5: Test cases.
>
> Part 2 is the piece that relates to the discussion linked above.  As
> implemented, it works fine, but it's maybe not the best design.  I'll
> hold off on committing the entire set for at least a few days in case
> somebody wants to suggest a better solution.
>
> -Sandra
>


Re: [patch 0/5] nios2 address mode improvements

2017-10-24 Thread Jeff Law
On 10/19/2017 08:03 PM, Sandra Loosemore wrote:
> This is the set of nios2 optimization patches that I've previously
> mentioned in these threads:
> 
> https://gcc.gnu.org/ml/gcc/2017-10/msg00016.html
> https://gcc.gnu.org/ml/gcc-patches/2017-10/msg00957.html
> 
> To give an overview of what this is for
> 
> The nios2 backend currently generates quite bad code for memory
> accesses with addresses involving symbolic constants.  Like a typical
> RISC machine, nios2 requires splitting such 32-bit constants into
> HIGH/LO_SUM pairs.  Currently this happens in expand, and address
> expressions involving such constants are always converted to use a
> register indirect form.
> 
> One part of the problem is that the backend currently doesn't
> recognize that LO_SUM is a legitimate address form (it's register
> indirect with a constant offset using the %lo relocation).  That's
> fixed in these patches.
Right.  Handling this is pretty standard stuff.

> 
> A harder problem is that doing the high/lo_sum splitting in expand
> inhibits subsequent optimizations.  One such problem arises when you
> have accesses to multiple fields in a static structure object.  Expand
> sees this as many (symbol + offset) expressions involving the same
> symbol with different constant offsets.  What we should be doing in
> that case is CSE'ing the symbol address computation rather than
> splitting every such expression individually.
My memory is fuzzy, but you might want to look at the PA.

My recollection is that we had support for CSE-ing the high part of a
series of nearby accesses.  The trick was to mask the constant
displacement in the HIGH expression.  So to access x, x+1 and x+2 would
look like

high (x)
lo_sum (x)
high (x)
lo_sum (x+1)
high (x)
lo_sum (x+2)

We'd then naturally cse the HIGH expressions.

The trick here is you also need some assembler/linker support.  That's
why there's rrsel and lrsel field selectors in the PA port.

We split things as early as possible to expose the CSE opportunities for
the HIGH expressions and to encourage LICM.

Jeff



Re: [patch 0/5] nios2 address mode improvements

2017-10-24 Thread Jeff Law
On 10/20/2017 02:56 AM, Jakub Jelinek wrote:
> On Thu, Oct 19, 2017 at 08:03:45PM -0600, Sandra Loosemore wrote:
>> A harder problem is that doing the high/lo_sum splitting in expand
>> inhibits subsequent optimizations.  One such problem arises when you
>> have accesses to multiple fields in a static structure object.  Expand
>> sees this as many (symbol + offset) expressions involving the same
>> symbol with different constant offsets.  What we should be doing in
>> that case is CSE'ing the symbol address computation rather than
>> splitting every such expression individually.
> 
> Do you have the needed relocations for that though?
We certainly did on the PA.

jeff