[PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-01-27 Thread Richard Biener
The following avoids repeatedly turning VALUE RTXen into
sth useful and re-applying a constant offset through get_addr
via DSE check_mem_read_rtx.  Instead perform this once for
all stores to be visited in check_mem_read_rtx.  This avoids
allocating 1.6GB of garbage PLUS RTXen on the PR80960
testcase, fixing the memory usage regression from old GCC.

Bootstrap and regtest running on x86_64-unknown-linux-gnu, OK?

Thanks,
Richard.

2021-01-27  Richard Biener  

PR rtl-optimization/80960
* dse.c (check_mem_read_rtx): Call get_addr on the
offsetted address.
---
 gcc/dse.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/gcc/dse.c b/gcc/dse.c
index c88587e7d94..da0df54a2dd 100644
--- a/gcc/dse.c
+++ b/gcc/dse.c
@@ -2219,6 +2219,11 @@ check_mem_read_rtx (rtx *loc, bb_info_t bb_info)
 }
   if (maybe_ne (offset, 0))
 mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
+  /* Avoid passing VALUE RTXen as mem_addr to canon_true_dependence
+ which will over and over re-create proper RTL and re-apply the
+ offset above.  See PR80960 where we almost allocate 1.6GB of PLUS
+ RTXen that way.  */
+  mem_addr = get_addr (mem_addr);
 
   if (group_id >= 0)
 {
-- 
2.26.2


Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-01-27 Thread Jakub Jelinek via Gcc-patches
On Wed, Jan 27, 2021 at 03:40:38PM +0100, Richard Biener wrote:
> The following avoids repeatedly turning VALUE RTXen into
> sth useful and re-applying a constant offset through get_addr
> via DSE check_mem_read_rtx.  Instead perform this once for
> all stores to be visited in check_mem_read_rtx.  This avoids
> allocating 1.6GB of garbage PLUS RTXen on the PR80960
> testcase, fixing the memory usage regression from old GCC.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu, OK?
> 
> Thanks,
> Richard.
> 
> 2021-01-27  Richard Biener  
> 
>   PR rtl-optimization/80960
>   * dse.c (check_mem_read_rtx): Call get_addr on the
>   offsetted address.
> ---
>  gcc/dse.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/gcc/dse.c b/gcc/dse.c
> index c88587e7d94..da0df54a2dd 100644
> --- a/gcc/dse.c
> +++ b/gcc/dse.c
> @@ -2219,6 +2219,11 @@ check_mem_read_rtx (rtx *loc, bb_info_t bb_info)
>  }
>if (maybe_ne (offset, 0))
>  mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
> +  /* Avoid passing VALUE RTXen as mem_addr to canon_true_dependence
> + which will over and over re-create proper RTL and re-apply the
> + offset above.  See PR80960 where we almost allocate 1.6GB of PLUS
> + RTXen that way.  */
> +  mem_addr = get_addr (mem_addr);
>  
>if (group_id >= 0)
>  {

Does that result in any changes on how much does DSE optimize?
I mean, if you do 2 bootstraps/regtests, one with this patch and another one
without it, and at the end of rest_of_handle_dse dump
locally_deleted, globally_deleted
for each CU/function, do you get the same counts except perhaps for dse.c?

Jakub



Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-01-27 Thread Richard Biener
On Wed, 27 Jan 2021, Jakub Jelinek wrote:

> On Wed, Jan 27, 2021 at 03:40:38PM +0100, Richard Biener wrote:
> > The following avoids repeatedly turning VALUE RTXen into
> > sth useful and re-applying a constant offset through get_addr
> > via DSE check_mem_read_rtx.  Instead perform this once for
> > all stores to be visited in check_mem_read_rtx.  This avoids
> > allocating 1.6GB of garbage PLUS RTXen on the PR80960
> > testcase, fixing the memory usage regression from old GCC.
> > 
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu, OK?
> > 
> > Thanks,
> > Richard.
> > 
> > 2021-01-27  Richard Biener  
> > 
> > PR rtl-optimization/80960
> > * dse.c (check_mem_read_rtx): Call get_addr on the
> > offsetted address.
> > ---
> >  gcc/dse.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/gcc/dse.c b/gcc/dse.c
> > index c88587e7d94..da0df54a2dd 100644
> > --- a/gcc/dse.c
> > +++ b/gcc/dse.c
> > @@ -2219,6 +2219,11 @@ check_mem_read_rtx (rtx *loc, bb_info_t bb_info)
> >  }
> >if (maybe_ne (offset, 0))
> >  mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
> > +  /* Avoid passing VALUE RTXen as mem_addr to canon_true_dependence
> > + which will over and over re-create proper RTL and re-apply the
> > + offset above.  See PR80960 where we almost allocate 1.6GB of PLUS
> > + RTXen that way.  */
> > +  mem_addr = get_addr (mem_addr);
> >  
> >if (group_id >= 0)
> >  {
> 
> Does that result in any changes on how much does DSE optimize?
> I mean, if you do 2 bootstraps/regtests, one with this patch and another one
> without it, and at the end of rest_of_handle_dse dump
> locally_deleted, globally_deleted
> for each CU/function, do you get the same counts except perhaps for dse.c?

I can check but all immediate first uses of mem_addr are in
true_dependece_1 which does x_addr = get_addr (x_addr); as the
first thing on it.  So the concern would be that
get_addr (get_addr (x_addr)) != get_addr (x_addr) which I think
shouldn't be the case (no semantic difference at least).

I'll see to do the test tomorrow.

Thanks,
Richard.


Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-01-27 Thread Jakub Jelinek via Gcc-patches
On Wed, Jan 27, 2021 at 04:16:22PM +0100, Richard Biener wrote:
> I can check but all immediate first uses of mem_addr are in
> true_dependece_1 which does x_addr = get_addr (x_addr); as the
> first thing on it.  So the concern would be that
> get_addr (get_addr (x_addr)) != get_addr (x_addr) which I think
> shouldn't be the case (no semantic difference at least).

I guess you're right, indeed I can't find other uses of mem_addr
and it isn't saved, but just used in that one function, though
perhaps against more than one addr.
And in get_addr we most likely get into the:

  if (GET_CODE (x) != VALUE)
{
  if ((GET_CODE (x) == PLUS || GET_CODE (x) == MINUS)
  && GET_CODE (XEXP (x, 0)) == VALUE
  && CONST_SCALAR_INT_P (XEXP (x, 1)))
{
  rtx op0 = get_addr (XEXP (x, 0));
  if (op0 != XEXP (x, 0))
{
  poly_int64 c;
  if (GET_CODE (x) == PLUS
  && poly_int_rtx_p (XEXP (x, 1), &c))
return plus_constant (GET_MODE (x), op0, c);
  return simplify_gen_binary (GET_CODE (x), GET_MODE (x),
  op0, XEXP (x, 1));
}
}
  return x;
}

and if get_addr already is the VALUE it should be then it will not create
any further PLUS/MINUS.

One question is how often we do that
  if (maybe_ne (offset, 0))
mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
and with your patch also
  mem_addr = get_addr (mem_addr);
if there are no active local stores (or just no canon_true_dependence
calls for that read).
If it would happen often, we could further optimize compile time and memory
by changing mem_addr above this plus_constant into mem_addr_base,
add mem_addr var initially set to NULL and before each of those 3
canon_true_dependence calls do:
  if (mem_addr == NULL_RTX)
{
  mem_addr = mem_addr_base;
  if (maybe_ne (offset, 0))
mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
  mem_addr = get_addr (mem_addr);
}

Jakub



Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-01-27 Thread Richard Biener
On Wed, 27 Jan 2021, Jakub Jelinek wrote:

> On Wed, Jan 27, 2021 at 04:16:22PM +0100, Richard Biener wrote:
> > I can check but all immediate first uses of mem_addr are in
> > true_dependece_1 which does x_addr = get_addr (x_addr); as the
> > first thing on it.  So the concern would be that
> > get_addr (get_addr (x_addr)) != get_addr (x_addr) which I think
> > shouldn't be the case (no semantic difference at least).
> 
> I guess you're right, indeed I can't find other uses of mem_addr
> and it isn't saved, but just used in that one function, though
> perhaps against more than one addr.
> And in get_addr we most likely get into the:
> 
>   if (GET_CODE (x) != VALUE)
> {
>   if ((GET_CODE (x) == PLUS || GET_CODE (x) == MINUS)
>   && GET_CODE (XEXP (x, 0)) == VALUE
>   && CONST_SCALAR_INT_P (XEXP (x, 1)))
> {
>   rtx op0 = get_addr (XEXP (x, 0));
>   if (op0 != XEXP (x, 0))
> {
>   poly_int64 c;
>   if (GET_CODE (x) == PLUS
>   && poly_int_rtx_p (XEXP (x, 1), &c))
> return plus_constant (GET_MODE (x), op0, c);
>   return simplify_gen_binary (GET_CODE (x), GET_MODE (x),
>   op0, XEXP (x, 1));
> }
> }
>   return x;
> }
> 
> and if get_addr already is the VALUE it should be then it will not create
> any further PLUS/MINUS.
> 
> One question is how often we do that
>   if (maybe_ne (offset, 0))
> mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
> and with your patch also
>   mem_addr = get_addr (mem_addr);
> if there are no active local stores (or just no canon_true_dependence
> calls for that read).
> If it would happen often, we could further optimize compile time and memory
> by changing mem_addr above this plus_constant into mem_addr_base,
> add mem_addr var initially set to NULL and before each of those 3
> canon_true_dependence calls do:
>   if (mem_addr == NULL_RTX)
> {
>   mem_addr = mem_addr_base;
>   if (maybe_ne (offset, 0))
>   mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
>   mem_addr = get_addr (mem_addr);
> }

Sure, more micro-optimizing is possible, including passing a flag
to canon_true_dependence whether the addr RTX already had get_addr
called on it.  And pass in the offset as poly-rtx-int and make
get_addr apply it if not zero.  But I've mostly tried to address
the non-linearity here, after the patch the number of get_addr
and plus_constant calls should be linear in the number of loads
rather than O (#loads * #stores).

I've also tried to find the most minimalistic change at this point
(so it could be eventually backported).

Richard.


Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-01-27 Thread Jakub Jelinek via Gcc-patches
On Wed, Jan 27, 2021 at 05:37:54PM +0100, Richard Biener wrote:
> Sure, more micro-optimizing is possible, including passing a flag
> to canon_true_dependence whether the addr RTX already had get_addr
> called on it.  And pass in the offset as poly-rtx-int and make
> get_addr apply it if not zero.  But I've mostly tried to address
> the non-linearity here, after the patch the number of get_addr
> and plus_constant calls should be linear in the number of loads
> rather than O (#loads * #stores).
> 
> I've also tried to find the most minimalistic change at this point
> (so it could be eventually backported).

Ok.  I'll gather stats incrementally and see if it is worth to do something
further later.

Jakub



Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-01-28 Thread Richard Biener
On Wed, 27 Jan 2021, Jakub Jelinek wrote:

> On Wed, Jan 27, 2021 at 03:40:38PM +0100, Richard Biener wrote:
> > The following avoids repeatedly turning VALUE RTXen into
> > sth useful and re-applying a constant offset through get_addr
> > via DSE check_mem_read_rtx.  Instead perform this once for
> > all stores to be visited in check_mem_read_rtx.  This avoids
> > allocating 1.6GB of garbage PLUS RTXen on the PR80960
> > testcase, fixing the memory usage regression from old GCC.
> > 
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu, OK?
> > 
> > Thanks,
> > Richard.
> > 
> > 2021-01-27  Richard Biener  
> > 
> > PR rtl-optimization/80960
> > * dse.c (check_mem_read_rtx): Call get_addr on the
> > offsetted address.
> > ---
> >  gcc/dse.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/gcc/dse.c b/gcc/dse.c
> > index c88587e7d94..da0df54a2dd 100644
> > --- a/gcc/dse.c
> > +++ b/gcc/dse.c
> > @@ -2219,6 +2219,11 @@ check_mem_read_rtx (rtx *loc, bb_info_t bb_info)
> >  }
> >if (maybe_ne (offset, 0))
> >  mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
> > +  /* Avoid passing VALUE RTXen as mem_addr to canon_true_dependence
> > + which will over and over re-create proper RTL and re-apply the
> > + offset above.  See PR80960 where we almost allocate 1.6GB of PLUS
> > + RTXen that way.  */
> > +  mem_addr = get_addr (mem_addr);
> >  
> >if (group_id >= 0)
> >  {
> 
> Does that result in any changes on how much does DSE optimize?
> I mean, if you do 2 bootstraps/regtests, one with this patch and another one
> without it, and at the end of rest_of_handle_dse dump
> locally_deleted, globally_deleted
> for each CU/function, do you get the same counts except perhaps for dse.c?

So I see no difference for stage2-gcc/*.o dse1/dse2 with/without the
patch but counts are _extremely_ small.  Statistics:

  70148 dse: local deletions = 0, global deletions = 0
 32 dse: local deletions = 0, global deletions = 1
  9 dse: local deletions = 0, global deletions = 2
  7 dse: local deletions = 0, global deletions = 3
  2 dse: local deletions = 0, global deletions = 4
  2 dse: local deletions = 0, global deletions = 5
  3 dse: local deletions = 0, global deletions = 7
 67 dse: local deletions = 1, global deletions = 0
  1 dse: local deletions = 1, global deletions = 2
 12 dse: local deletions = 2, global deletions = 0
  1 dse: local deletions = 24, global deletions = 1
  2 dse: local deletions = 3, global deletions = 0
  4 dse: local deletions = 4, global deletions = 0
  4 dse: local deletions = 6, global deletions = 0
  1 dse: local deletions = 7, global deletions = 0
  1 dse: local deletions = 8, global deletions = 0

so not sure how much confidence this brings over the analytical
reasoning that it shouldn't make a difference ...

stats on just dse2 are even more depressing (given it's cost)

  35123 dse: local deletions = 0, global deletions = 0
  2 dse: local deletions = 0, global deletions = 1
 20 dse: local deletions = 1, global deletions = 0
  1 dse: local deletions = 2, global deletions = 0
  1 dse: local deletions = 3, global deletions = 0
  1 dse: local deletions = 4, global deletions = 0

Richard.


Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-02-01 Thread Jeff Law via Gcc-patches



On 1/28/21 1:09 AM, Richard Biener wrote:
> On Wed, 27 Jan 2021, Jakub Jelinek wrote:
>
>> On Wed, Jan 27, 2021 at 03:40:38PM +0100, Richard Biener wrote:
>>> The following avoids repeatedly turning VALUE RTXen into
>>> sth useful and re-applying a constant offset through get_addr
>>> via DSE check_mem_read_rtx.  Instead perform this once for
>>> all stores to be visited in check_mem_read_rtx.  This avoids
>>> allocating 1.6GB of garbage PLUS RTXen on the PR80960
>>> testcase, fixing the memory usage regression from old GCC.
>>>
>>> Bootstrap and regtest running on x86_64-unknown-linux-gnu, OK?
>>>
>>> Thanks,
>>> Richard.
>>>
>>> 2021-01-27  Richard Biener  
>>>
>>> PR rtl-optimization/80960
>>> * dse.c (check_mem_read_rtx): Call get_addr on the
>>> offsetted address.
>>> ---
>>>  gcc/dse.c | 5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/gcc/dse.c b/gcc/dse.c
>>> index c88587e7d94..da0df54a2dd 100644
>>> --- a/gcc/dse.c
>>> +++ b/gcc/dse.c
>>> @@ -2219,6 +2219,11 @@ check_mem_read_rtx (rtx *loc, bb_info_t bb_info)
>>>  }
>>>if (maybe_ne (offset, 0))
>>>  mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
>>> +  /* Avoid passing VALUE RTXen as mem_addr to canon_true_dependence
>>> + which will over and over re-create proper RTL and re-apply the
>>> + offset above.  See PR80960 where we almost allocate 1.6GB of PLUS
>>> + RTXen that way.  */
>>> +  mem_addr = get_addr (mem_addr);
>>>  
>>>if (group_id >= 0)
>>>  {
>> Does that result in any changes on how much does DSE optimize?
>> I mean, if you do 2 bootstraps/regtests, one with this patch and another one
>> without it, and at the end of rest_of_handle_dse dump
>> locally_deleted, globally_deleted
>> for each CU/function, do you get the same counts except perhaps for dse.c?
> So I see no difference for stage2-gcc/*.o dse1/dse2 with/without the
> patch but counts are _extremely_ small.  Statistics:
>
>   70148 dse: local deletions = 0, global deletions = 0
>  32 dse: local deletions = 0, global deletions = 1
>   9 dse: local deletions = 0, global deletions = 2
>   7 dse: local deletions = 0, global deletions = 3
>   2 dse: local deletions = 0, global deletions = 4
>   2 dse: local deletions = 0, global deletions = 5
>   3 dse: local deletions = 0, global deletions = 7
>  67 dse: local deletions = 1, global deletions = 0
>   1 dse: local deletions = 1, global deletions = 2
>  12 dse: local deletions = 2, global deletions = 0
>   1 dse: local deletions = 24, global deletions = 1
>   2 dse: local deletions = 3, global deletions = 0
>   4 dse: local deletions = 4, global deletions = 0
>   4 dse: local deletions = 6, global deletions = 0
>   1 dse: local deletions = 7, global deletions = 0
>   1 dse: local deletions = 8, global deletions = 0
>
> so not sure how much confidence this brings over the analytical
> reasoning that it shouldn't make a difference ...
>
> stats on just dse2 are even more depressing (given it's cost)
>
>   35123 dse: local deletions = 0, global deletions = 0
>   2 dse: local deletions = 0, global deletions = 1
>  20 dse: local deletions = 1, global deletions = 0
>   1 dse: local deletions = 2, global deletions = 0
>   1 dse: local deletions = 3, global deletions = 0
>   1 dse: local deletions = 4, global deletions = 0
Based on that, I'd argue that DSE2 should go away and DSE1 should be
evaluated for the chopping block.  While RTL DSE was marginally
important in 1999 when it was first submitted, the tree-ssa pipeline as
a whole has probably made RTL DSE largely pointless.

Jeff



Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-02-01 Thread Richard Biener
On February 1, 2021 8:34:35 PM GMT+01:00, Jeff Law  wrote:
>
>
>On 1/28/21 1:09 AM, Richard Biener wrote:
>> On Wed, 27 Jan 2021, Jakub Jelinek wrote:
>>
>>> On Wed, Jan 27, 2021 at 03:40:38PM +0100, Richard Biener wrote:
 The following avoids repeatedly turning VALUE RTXen into
 sth useful and re-applying a constant offset through get_addr
 via DSE check_mem_read_rtx.  Instead perform this once for
 all stores to be visited in check_mem_read_rtx.  This avoids
 allocating 1.6GB of garbage PLUS RTXen on the PR80960
 testcase, fixing the memory usage regression from old GCC.

 Bootstrap and regtest running on x86_64-unknown-linux-gnu, OK?

 Thanks,
 Richard.

 2021-01-27  Richard Biener  

PR rtl-optimization/80960
* dse.c (check_mem_read_rtx): Call get_addr on the
offsetted address.
 ---
  gcc/dse.c | 5 +
  1 file changed, 5 insertions(+)

 diff --git a/gcc/dse.c b/gcc/dse.c
 index c88587e7d94..da0df54a2dd 100644
 --- a/gcc/dse.c
 +++ b/gcc/dse.c
 @@ -2219,6 +2219,11 @@ check_mem_read_rtx (rtx *loc, bb_info_t
>bb_info)
  }
if (maybe_ne (offset, 0))
  mem_addr = plus_constant (get_address_mode (mem), mem_addr,
>offset);
 +  /* Avoid passing VALUE RTXen as mem_addr to
>canon_true_dependence
 + which will over and over re-create proper RTL and re-apply
>the
 + offset above.  See PR80960 where we almost allocate 1.6GB of
>PLUS
 + RTXen that way.  */
 +  mem_addr = get_addr (mem_addr);
  
if (group_id >= 0)
  {
>>> Does that result in any changes on how much does DSE optimize?
>>> I mean, if you do 2 bootstraps/regtests, one with this patch and
>another one
>>> without it, and at the end of rest_of_handle_dse dump
>>> locally_deleted, globally_deleted
>>> for each CU/function, do you get the same counts except perhaps for
>dse.c?
>> So I see no difference for stage2-gcc/*.o dse1/dse2 with/without the
>> patch but counts are _extremely_ small.  Statistics:
>>
>>   70148 dse: local deletions = 0, global deletions = 0
>>  32 dse: local deletions = 0, global deletions = 1
>>   9 dse: local deletions = 0, global deletions = 2
>>   7 dse: local deletions = 0, global deletions = 3
>>   2 dse: local deletions = 0, global deletions = 4
>>   2 dse: local deletions = 0, global deletions = 5
>>   3 dse: local deletions = 0, global deletions = 7
>>  67 dse: local deletions = 1, global deletions = 0
>>   1 dse: local deletions = 1, global deletions = 2
>>  12 dse: local deletions = 2, global deletions = 0
>>   1 dse: local deletions = 24, global deletions = 1
>>   2 dse: local deletions = 3, global deletions = 0
>>   4 dse: local deletions = 4, global deletions = 0
>>   4 dse: local deletions = 6, global deletions = 0
>>   1 dse: local deletions = 7, global deletions = 0
>>   1 dse: local deletions = 8, global deletions = 0
>>
>> so not sure how much confidence this brings over the analytical
>> reasoning that it shouldn't make a difference ...
>>
>> stats on just dse2 are even more depressing (given it's cost)
>>
>>   35123 dse: local deletions = 0, global deletions = 0
>>   2 dse: local deletions = 0, global deletions = 1
>>  20 dse: local deletions = 1, global deletions = 0
>>   1 dse: local deletions = 2, global deletions = 0
>>   1 dse: local deletions = 3, global deletions = 0
>>   1 dse: local deletions = 4, global deletions = 0
>Based on that, I'd argue that DSE2 should go away and DSE1 should be
>evaluated for the chopping block.  While RTL DSE was marginally
>important in 1999 when it was first submitted, the tree-ssa pipeline as
>a whole has probably made RTL DSE largely pointless.

True. Though I'd argue that DSE2 might be the conceptually more useful pass 
since it sees spill slots. 

Richard. 

>Jeff



Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-02-01 Thread Jeff Law via Gcc-patches



On 2/1/21 12:47 PM, Richard Biener wrote:
> On February 1, 2021 8:34:35 PM GMT+01:00, Jeff Law  wrote:
>>
>> On 1/28/21 1:09 AM, Richard Biener wrote:
>>> On Wed, 27 Jan 2021, Jakub Jelinek wrote:
>>>
 On Wed, Jan 27, 2021 at 03:40:38PM +0100, Richard Biener wrote:
> The following avoids repeatedly turning VALUE RTXen into
> sth useful and re-applying a constant offset through get_addr
> via DSE check_mem_read_rtx.  Instead perform this once for
> all stores to be visited in check_mem_read_rtx.  This avoids
> allocating 1.6GB of garbage PLUS RTXen on the PR80960
> testcase, fixing the memory usage regression from old GCC.
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu, OK?
>
> Thanks,
> Richard.
>
> 2021-01-27  Richard Biener  
>
>   PR rtl-optimization/80960
>   * dse.c (check_mem_read_rtx): Call get_addr on the
>   offsetted address.
> ---
>  gcc/dse.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/gcc/dse.c b/gcc/dse.c
> index c88587e7d94..da0df54a2dd 100644
> --- a/gcc/dse.c
> +++ b/gcc/dse.c
> @@ -2219,6 +2219,11 @@ check_mem_read_rtx (rtx *loc, bb_info_t
>> bb_info)
>  }
>if (maybe_ne (offset, 0))
>  mem_addr = plus_constant (get_address_mode (mem), mem_addr,
>> offset);
> +  /* Avoid passing VALUE RTXen as mem_addr to
>> canon_true_dependence
> + which will over and over re-create proper RTL and re-apply
>> the
> + offset above.  See PR80960 where we almost allocate 1.6GB of
>> PLUS
> + RTXen that way.  */
> +  mem_addr = get_addr (mem_addr);
>  
>if (group_id >= 0)
>  {
 Does that result in any changes on how much does DSE optimize?
 I mean, if you do 2 bootstraps/regtests, one with this patch and
>> another one
 without it, and at the end of rest_of_handle_dse dump
 locally_deleted, globally_deleted
 for each CU/function, do you get the same counts except perhaps for
>> dse.c?
>>> So I see no difference for stage2-gcc/*.o dse1/dse2 with/without the
>>> patch but counts are _extremely_ small.  Statistics:
>>>
>>>   70148 dse: local deletions = 0, global deletions = 0
>>>  32 dse: local deletions = 0, global deletions = 1
>>>   9 dse: local deletions = 0, global deletions = 2
>>>   7 dse: local deletions = 0, global deletions = 3
>>>   2 dse: local deletions = 0, global deletions = 4
>>>   2 dse: local deletions = 0, global deletions = 5
>>>   3 dse: local deletions = 0, global deletions = 7
>>>  67 dse: local deletions = 1, global deletions = 0
>>>   1 dse: local deletions = 1, global deletions = 2
>>>  12 dse: local deletions = 2, global deletions = 0
>>>   1 dse: local deletions = 24, global deletions = 1
>>>   2 dse: local deletions = 3, global deletions = 0
>>>   4 dse: local deletions = 4, global deletions = 0
>>>   4 dse: local deletions = 6, global deletions = 0
>>>   1 dse: local deletions = 7, global deletions = 0
>>>   1 dse: local deletions = 8, global deletions = 0
>>>
>>> so not sure how much confidence this brings over the analytical
>>> reasoning that it shouldn't make a difference ...
>>>
>>> stats on just dse2 are even more depressing (given it's cost)
>>>
>>>   35123 dse: local deletions = 0, global deletions = 0
>>>   2 dse: local deletions = 0, global deletions = 1
>>>  20 dse: local deletions = 1, global deletions = 0
>>>   1 dse: local deletions = 2, global deletions = 0
>>>   1 dse: local deletions = 3, global deletions = 0
>>>   1 dse: local deletions = 4, global deletions = 0
>> Based on that, I'd argue that DSE2 should go away and DSE1 should be
>> evaluated for the chopping block.  While RTL DSE was marginally
>> important in 1999 when it was first submitted, the tree-ssa pipeline as
>> a whole has probably made RTL DSE largely pointless.
> True. Though I'd argue that DSE2 might be the conceptually more useful pass 
> since it sees spill slots. 
True in concept, but I bet that the SSA pipeline has made this much less
common in RTL DSE than it was 20+ years ago.  Our allocator and reloader
are much improved as well which would further decrease the number of
opportunities.

I'd hazard a guess that what's left are locals that need to be
addressable and some optimization in the RTL pipeline exposed a dead
store that wasn't otherwise visible in the SSA pipeline.  BUt the only
way to be sure would be to dig into them.

jeff



Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-02-01 Thread Jakub Jelinek via Gcc-patches
On Mon, Feb 01, 2021 at 12:54:50PM -0700, Jeff Law wrote:
> >>> So I see no difference for stage2-gcc/*.o dse1/dse2 with/without the
> >>> patch but counts are _extremely_ small.  Statistics:
> >>>
> >>>   70148 dse: local deletions = 0, global deletions = 0
> >>>  32 dse: local deletions = 0, global deletions = 1
> >>>   9 dse: local deletions = 0, global deletions = 2
> >>>   7 dse: local deletions = 0, global deletions = 3
> >>>   2 dse: local deletions = 0, global deletions = 4
> >>>   2 dse: local deletions = 0, global deletions = 5
> >>>   3 dse: local deletions = 0, global deletions = 7
> >>>  67 dse: local deletions = 1, global deletions = 0
> >>>   1 dse: local deletions = 1, global deletions = 2
> >>>  12 dse: local deletions = 2, global deletions = 0
> >>>   1 dse: local deletions = 24, global deletions = 1
> >>>   2 dse: local deletions = 3, global deletions = 0
> >>>   4 dse: local deletions = 4, global deletions = 0
> >>>   4 dse: local deletions = 6, global deletions = 0
> >>>   1 dse: local deletions = 7, global deletions = 0
> >>>   1 dse: local deletions = 8, global deletions = 0
> >>>
> >>> so not sure how much confidence this brings over the analytical
> >>> reasoning that it shouldn't make a difference ...
> >>>
> >>> stats on just dse2 are even more depressing (given it's cost)
> >>>
> >>>   35123 dse: local deletions = 0, global deletions = 0
> >>>   2 dse: local deletions = 0, global deletions = 1
> >>>  20 dse: local deletions = 1, global deletions = 0
> >>>   1 dse: local deletions = 2, global deletions = 0
> >>>   1 dse: local deletions = 3, global deletions = 0
> >>>   1 dse: local deletions = 4, global deletions = 0
> >> Based on that, I'd argue that DSE2 should go away and DSE1 should be
> >> evaluated for the chopping block.  While RTL DSE was marginally
> >> important in 1999 when it was first submitted, the tree-ssa pipeline as
> >> a whole has probably made RTL DSE largely pointless.
> > True. Though I'd argue that DSE2 might be the conceptually more useful pass 
> > since it sees spill slots. 
> True in concept, but I bet that the SSA pipeline has made this much less
> common in RTL DSE than it was 20+ years ago.  Our allocator and reloader
> are much improved as well which would further decrease the number of
> opportunities.
> 
> I'd hazard a guess that what's left are locals that need to be
> addressable and some optimization in the RTL pipeline exposed a dead
> store that wasn't otherwise visible in the SSA pipeline.  BUt the only
> way to be sure would be to dig into them.

Shouldn't we gather statistics from larger codebase first and perhaps
compare against tree-ssa-dse statistics?  I mean, in many functions there
are no DSE opportunities at all.

Jakub



Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-02-01 Thread Richard Biener
On Mon, 1 Feb 2021, Jakub Jelinek wrote:

> On Mon, Feb 01, 2021 at 12:54:50PM -0700, Jeff Law wrote:
> > >>> So I see no difference for stage2-gcc/*.o dse1/dse2 with/without the
> > >>> patch but counts are _extremely_ small.  Statistics:
> > >>>
> > >>>   70148 dse: local deletions = 0, global deletions = 0
> > >>>  32 dse: local deletions = 0, global deletions = 1
> > >>>   9 dse: local deletions = 0, global deletions = 2
> > >>>   7 dse: local deletions = 0, global deletions = 3
> > >>>   2 dse: local deletions = 0, global deletions = 4
> > >>>   2 dse: local deletions = 0, global deletions = 5
> > >>>   3 dse: local deletions = 0, global deletions = 7
> > >>>  67 dse: local deletions = 1, global deletions = 0
> > >>>   1 dse: local deletions = 1, global deletions = 2
> > >>>  12 dse: local deletions = 2, global deletions = 0
> > >>>   1 dse: local deletions = 24, global deletions = 1
> > >>>   2 dse: local deletions = 3, global deletions = 0
> > >>>   4 dse: local deletions = 4, global deletions = 0
> > >>>   4 dse: local deletions = 6, global deletions = 0
> > >>>   1 dse: local deletions = 7, global deletions = 0
> > >>>   1 dse: local deletions = 8, global deletions = 0
> > >>>
> > >>> so not sure how much confidence this brings over the analytical
> > >>> reasoning that it shouldn't make a difference ...
> > >>>
> > >>> stats on just dse2 are even more depressing (given it's cost)
> > >>>
> > >>>   35123 dse: local deletions = 0, global deletions = 0
> > >>>   2 dse: local deletions = 0, global deletions = 1
> > >>>  20 dse: local deletions = 1, global deletions = 0
> > >>>   1 dse: local deletions = 2, global deletions = 0
> > >>>   1 dse: local deletions = 3, global deletions = 0
> > >>>   1 dse: local deletions = 4, global deletions = 0
> > >> Based on that, I'd argue that DSE2 should go away and DSE1 should be
> > >> evaluated for the chopping block.  While RTL DSE was marginally
> > >> important in 1999 when it was first submitted, the tree-ssa pipeline as
> > >> a whole has probably made RTL DSE largely pointless.
> > > True. Though I'd argue that DSE2 might be the conceptually more useful 
> > > pass since it sees spill slots. 
> > True in concept, but I bet that the SSA pipeline has made this much less
> > common in RTL DSE than it was 20+ years ago.  Our allocator and reloader
> > are much improved as well which would further decrease the number of
> > opportunities.
> > 
> > I'd hazard a guess that what's left are locals that need to be
> > addressable and some optimization in the RTL pipeline exposed a dead
> > store that wasn't otherwise visible in the SSA pipeline.  BUt the only
> > way to be sure would be to dig into them.
> 
> Shouldn't we gather statistics from larger codebase first and perhaps
> compare against tree-ssa-dse statistics?  I mean, in many functions there
> are no DSE opportunities at all.

Of course.  Some DSE will definitely be required because we expose
ABI details only on RTL and expand sometimes is quite stupid.  ISTR
either DCE or CSE performs some limited amount of DSE as well?

The most needed and interesting work will be to disentangle RTL expansion
into the "complex" bits to be done on (lowered) GIMPLE and the
mechanical detail of GIMPLE to RTL one instruction at a time.  I guess
only during this work we'll learn what we need in lowered GIMPLE.

Richard.


Re: [PATCH] rtl-optimization/80960 - avoid creating garbage RTL in DSE

2021-02-02 Thread Jakub Jelinek via Gcc-patches
On Tue, Feb 02, 2021 at 08:51:07AM +0100, Richard Biener wrote:
> > Shouldn't we gather statistics from larger codebase first and perhaps
> > compare against tree-ssa-dse statistics?  I mean, in many functions there
> > are no DSE opportunities at all.
> 
> Of course.  Some DSE will definitely be required because we expose
> ABI details only on RTL and expand sometimes is quite stupid.  ISTR
> either DCE or CSE performs some limited amount of DSE as well?
> 
> The most needed and interesting work will be to disentangle RTL expansion
> into the "complex" bits to be done on (lowered) GIMPLE and the
> mechanical detail of GIMPLE to RTL one instruction at a time.  I guess
> only during this work we'll learn what we need in lowered GIMPLE.

BTW, more complete RTL DSE statistics (from x86_64-linux and i686-linux
bootstraps/regtests together) are below (columns are number of occurences of
specific local and global deletion pairs, local deletions, global
deletions), so it is true RTL DSE only removed something in 1.6% of
functions during those.  But RTL DSE is also doing the SCCVN-like
optimization of optimizing memory reads into constants if those constants
were stored before and that isn't tracked in this.

5908970 0 0
  20144 1 0
  14162 0 3
  11413 0 1
  10862 2 0
   5268 0 2
   5256 3 0
   4924 0 6
   3506 4 0
   2838 0 4
   1734 0 9
   1621 5 0
   1449 0 5
   1014 0 8
856 0 12
754 0 7
729 6 0
487 7 0
464 8 0
422 14 0
412 9 0
381 0 15
373 0 10
297 1 1
259 0 18
248 2 2
245 0 11
237 3 3
230 12 0
220 10 0
168 0 16
144 11 0
138 0 14
133 4 6
133 0 13
123 0 24
115 15 0
114 0 21
 94 0 20
 91 0 17
 88 1 4
 84 1 2
 75 16 0
 74 17 0
 71 0 27
 69 0 30
 55 2 1
 50 4 1
 50 1 9
 50 1 3
 46 0 36
 44 0 33
 44 0 25
 42 2 3
 41 18 0
 40 5 3
 39 34 0
 36 13 0
 35 2 18
 35 0 19
 34 3 1
 33 1 7
 31 1 5
 31 0 48
 31 0 23
 30 4 3
 30 1 6
 30 0 22
 28 0 29
 28 0 26
 26 0 28
 25 24 0
 24 1 18
 22 2 6
 21 0 32
 20 4 4
 20 3 2
 19 4 2
 18 0 40
 17 39 0
 17 22 0
 16 5 9
 16 5 5
 15 129 0
 15 0 66
 15 0 31
 14 20 0
 14 19 0
 13 6 3
 12 1 8
 12 0 42
 11 32 0
 11 2 4
 10 4 9
 10 42 0
 10 0 44
  9 3 4
  9 30 0
  9 259 0
  9 0 57
  9 0 39
  8 4 8
  8 46 0
  8 31 0
  8 25 35
  8 1 15
  8 0 56
  8 0 54
  8 0 45
  8 0 43
  8 0 38
  7 3 8
  7 35 0
  7 33 0
  7 130 0
  7 1 24
  7 12 1
  7 1 14
  7 0 60
  6 9 3
  6 5 6
  6 5 4
  6 5 1
  6 4 49
  6 3 13
  6 15 9
  6 12 6
  6 0 84
  6 0 41
  5 6 12
  5 44 0
  5 27 0
  5 2 7
  5 25 0
  5 23 0
  5 14 12
  5 12 2
  5 1 21
  5 10 2
  5 0 35
  5 0 150
  4 86 0
  4 7 6
  4 69 0
  4 6 8
  4 56 0
  4 45 0
  4 4 16
  4 29 0
  4 266 0
  4 263 0
  4 258 0
  4 2 22
  4 12 4
  4 1 12
  4 0 83
  4 0 61
  4 0 51
  4 0 34
  3 8 9
  3 8 6
  3 6 1
  3 55 0
  3 5 10
  3 49 0
  3 4 77
  3 4 5
  3 4 18
  3 4 13
  3 2 9
  3 28 0
  3 2 38
  3 2 12
  3 14 6
  3 1 13
  3 0 88
  3 0 86
  3 0 75
  3 0 69
  3 0 198
  3 0 100
  2 90 0
  2 6 39
  2 6 21
  2 53 0
  2 5 2
  2 512 0
  2 5 11
  2 50 0
  2 4 7
  2 4 33
  2 4 10
  2 40 0
  2 3 9
  2 3 6
  2 33 1
  2 2 8
  2 262 0
  2 26 0
  2 252 0
  2 24 12
  2 24 1
  2 23 29
  2 22 22
  2 21 5
  2 21 0
  2 16 5
  2 14 18
  2 128 0
  2 12 5
  2 1 20
  2 11 40
  2 1 11
  2 0 91
  2 0 37
  2 0 158
  2 0 136
  2 0 128
  2 0 104
  2 0 103
  1 9 6
  1 9 10
  1 8 7
  1 8 2
  1 8 16
  1 8 15
  1 81 0
  1 8 1
  1 7 5
  1 74 0
  1 73 6
  1 7 12
  1 7 1
  1 6 7
  1 54 0
  1 52 0
  1 48 0
  1 36 0
  1 3 5
  1 31 3
  1 2 5
  1 2 21
  1 2 13
  1 21 18
  1 2 10
  1 19 20
  1 19 2
  1 18 12
  1 17 10
  1 16 1
  1 14 9
  1 145 0
  1 14 15
  1 13 1
  1 12 8
  1 11 11
  1 1 10
  1 10 7
  1 10 11
  1 10 1
  1 0 89
  1 0 80
  1 0 70
  1 0 67
  1 0 59
  1 0 55
  1 0 49
  1 0 244
  1 0 148
  1 0 140

Jakub