Re: [PATCH v2] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Richard Biener via Gcc-patches Fri, 11 Mar 2022 03:34:49 -0800

On Wed, Mar 9, 2022 at 6:43 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> Hi Richard,
> Many thanks.  Yes, your proposed ao_ref_alignment is exactly what I was 
> looking for.
> Here's the second revision of my patch for PR tree-optimization/98335 that 
> both uses/
> introduces ao_ref_alignment and more intelligently aligns/trims both head and 
> tail,
> for example handling the case discussed by Richard and Jeff Law, of a 16 
> byte-aligned
> object where we wish to avoid trimming (just) the last three bytes.  It uses 
> the useful
> property that writing N consecutive bytes, typically requires popcount(N) 
> store
> instructions, so we wish to align (if we can) that we begin/end with a store 
> of N' bytes
> where popcount(N') is one, if that isn't already the case.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap and
> make -k check with no new failures.  Is this revised version ok for mainline?


OK.

I wonder if we can also handle bitpos != 0, like if we have
a an access 3 bytes into an 8 byte aligned object and
want to trim 2 bytes at start then it would be good to trim only
1 byte so the access is then 4 byte aligned (4 bytes into the
8 byte aligned object).  Similar, trimming 6 bytes should be reduced
to trimming 5 bytes.

Richard.

> 2022-03-09  Roger Sayle  <ro...@nextmovesoftware.com>
>             Richard Biener  <rguent...@suse.de>
>
> gcc/ChangeLog
>         PR tree-optimization/98335
>         * builtins.cc (get_object_alignment_2): Export.
>         * builtins.h (get_object_alignment_2): Likewise.
>         * tree-ssa-alias.cc (ao_ref_alignment): New.
>         * tree-ssa-alias.h (ao_ref_alignment): Declare.
>
>         * tree-ssa-dse.cc (compute_trims): Improve logic deciding whether
>         to align head/tail, writing more bytes but using fewer store insns.
>         (maybe_trim_memstar_call): Silence compiler warnings by using
>         memset to initialize lendata.
>
> gcc/testsuite/ChangeLog
>         PR tree-optimization/98335
>         * g++.dg/pr98335.C: New test case.
>         * gcc.dg/pr86010.c: New test case.
>         * gcc.dg/pr86010-2.c: New test case.
>
> Thanks again for your help.
> Roger
> --
>
> > -----Original Message-----
> > From: Richard Biener <richard.guent...@gmail.com>
> > Sent: 08 March 2022 10:44
> > To: Roger Sayle <ro...@nextmovesoftware.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>
> > Subject: Re: [PATCH] PR tree-optimization/98335: Improvements to DSE's
> > compute_trims.
> >
> > On Tue, Mar 8, 2022 at 11:10 AM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Mon, Mar 7, 2022 at 11:04 AM Roger Sayle
> > <ro...@nextmovesoftware.com> wrote:
> > > >
> > > >
> > > > This patch is the main middle-end piece of a fix for PR
> > > > tree-opt/98335, which is a code-quality regression affecting
> > > > mainline.  The issue occurs in DSE's (dead store elimination's)
> > > > compute_trims function that determines where a store to memory can
> > > > be trimmed.  In the testcase given in the PR, this function notices
> > > > that the first byte of a DImode store is dead, and replaces the
> > > > 8-byte store at (aligned) offset zero, with a 7-byte store at
> > > > (unaligned) offset one.  Most architectures can store a power-of-two
> > > > bytes (up to a maximum) in single instruction, so writing 7 bytes
> > > > requires more instructions than writing 8 bytes.  This patch follows 
> > > > Jakub
> > Jelinek's suggestion in comment 5, that compute_trims needs improved
> > heuristics.
> > > >
> > > > In this patch, decision of whether and how to align trim_head is
> > > > based on the number of bytes being written, the alignment of the
> > > > start of the object and where within the object the first byte is
> > > > written.  The first tests check whether we're already writing to the
> > > > start of the object, and that we're writing three or more bytes.  If
> > > > we're only writing one or two bytes, there's no benefit from providing
> > additional alignment.
> > > > Then we determine the alignment of the object, which is either 1, 2,
> > > > 4, 8 or 16 byte aligned (capping at 16 guarantees that we never
> > > > write more than 7 bytes beyond the minimum required).  If the buffer
> > > > is only
> > > > 1 or 2 byte aligned there's no benefit from additional alignment.
> > > > For the remaining cases, alignment of trim_head is based upon where
> > > > within each aligned block (word) the first byte is written.  For
> > > > example, storing the last byte (or last half-word) of a word can be
> > > > performed with a single insn.
> > > >
> > > > On x86_64-pc-linux-gnu with -O2 the new test case in the PR goes from:
> > > >
> > > >         movl    $0, -24(%rsp)
> > > >         movabsq $72057594037927935, %rdx
> > > >         movl    $0, -21(%rsp)
> > > >         andq    -24(%rsp), %rdx
> > > >         movq    %rdx, %rax
> > > >         salq    $8, %rax
> > > >         movb    c(%rip), %al
> > > >         ret
> > > >
> > > > to
> > > >
> > > >         xorl    %eax, %eax
> > > >         movb    c(%rip), %al
> > > >         ret
> > > >
> > > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > > bootstrap and make -k check with no new failures.  I've also added
> > > > new testcases for the original motivating PR
> > > > tree-optimization/86010, to ensure that those remain optimized (in 
> > > > future).
> > Ok for mainline?
> > >
> > > diff --git a/gcc/tree-ssa-dse.cc b/gcc/tree-ssa-dse.cc index
> > > 2b22a61..080e406 100644
> > > --- a/gcc/tree-ssa-dse.cc
> > > +++ b/gcc/tree-ssa-dse.cc
> > > @@ -405,10 +405,36 @@ compute_trims (ao_ref *ref, sbitmap live, int
> > > *trim_head, int *trim_tail,
> > >    int first_live = bitmap_first_set_bit (live);
> > >    *trim_head = first_live - first_orig;
> > >
> > > -  /* If more than a word remains, then make sure to keep the
> > > -     starting point at least word aligned.  */
> > > -  if (last_live - first_live > UNITS_PER_WORD)
> > > -    *trim_head &= ~(UNITS_PER_WORD - 1);
> > > +  /* If REF is aligned, try to maintain this alignment if it reduces
> > > +     the number of (power-of-two sized aligned) writes to memory.
> > > +     First check that we're writing >= 3 bytes at a non-zero offset.
> > > + */  if (first_live
> > > +      && last_live - first_live >= 2)
> > > +    {
> > > +      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (ref->base));
> > >
> > > you can't simply use TYPE_ALIGN_* on ref->base.  You can use
> > > get_object_alignment on ref->ref, but ref->ref can be NULL in case the
> > > ref was initialized from a builtin call like memcpy.
> > >
> > > Also ref->base is offsetted by ref->offset which you don't seem to
> > > account for.  In theory one could export get_object_alignment_2 and if
> > > ref->ref is NULL, use that on ref->base, passing addr_p = true, and
> > > then adjust the resulting bitpos by ref->offset and fix align
> > > accordingly (trimming might also align an access if the original
> > > access was offsetted from known alignment).
> > >
> > > That said, a helper like ao_ref_alignment () might be useful here.
> >
> > Like the attached - free feel to use that.
> >
> > Richard.
> >
> > >
> > > I wonder if we can apply good heuristics to compute_trims without
> > > taking into account context, like maybe_trimp_complex_store is already
> > > limiting itself to useful subsets and the constructor and memstar
> > > cases will only benefit if they end up being expanded inline via
> > > *_by_pieces, not if expanded as a call.
> > >
> > > You don't seem to adjust *trim_tail at all, if an aligned 16 byte
> > > region is trimmed there by 3 that will result in two extra stores as 
> > > well, no?
> > >
>
>

Re: [PATCH v2] PR tree-optimization/98335: Improvements to DSE's compute_trims.

Reply via email to