Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory

Richard Biener via Gcc-patches Thu, 20 May 2021 00:51:41 -0700

On Wed, May 19, 2021 at 3:22 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Wed, May 19, 2021 at 2:33 AM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Tue, May 18, 2021 at 9:16 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > >
> > > When expanding a constant constructor, don't call expand_constructor if
> > > it is more efficient to load the data from the memory via move by pieces.
> > >
> > > gcc/
> > >
> > >         PR middle-end/90773
> > >         * expr.c (expand_expr_real_1): Don't call expand_constructor if
> > >         it is more efficient to load the data from the memory.
> > >
> > > gcc/testsuite/
> > >
> > >         PR middle-end/90773
> > >         * gcc.target/i386/pr90773-24.c: New test.
> > >         * gcc.target/i386/pr90773-25.c: Likewise.
> > > ---
> > >  gcc/expr.c                                 | 10 ++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr90773-24.c | 22 ++++++++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr90773-25.c | 20 ++++++++++++++++++++
> > >  3 files changed, 52 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-24.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr90773-25.c
> > >
> > > diff --git a/gcc/expr.c b/gcc/expr.c
> > > index d09ee42e262..80e01ea1cbe 100644
> > > --- a/gcc/expr.c
> > > +++ b/gcc/expr.c
> > > @@ -10886,6 +10886,16 @@ expand_expr_real_1 (tree exp, rtx target, 
> > > machine_mode tmode,
> > >                 unsigned HOST_WIDE_INT ix;
> > >                 tree field, value;
> > >
> > > +               /* Check if it is more efficient to load the data from
> > > +                  the memory directly.  FIXME: How many stores do we
> > > +                  need here if not moved by pieces?  */
> > > +               unsigned HOST_WIDE_INT bytes
> > > +                 = tree_to_uhwi (TYPE_SIZE_UNIT (type));
> >
> > that's prone to fail - it could be a VLA.
>
> What do you mean by fail?  Is it ICE or missed optimization?
> Do you have a testcase?
>
> >
> > > +               if ((bytes / UNITS_PER_WORD) > 2
> > > +                   && MOVE_MAX_PIECES > UNITS_PER_WORD
> > > +                   && can_move_by_pieces (bytes, TYPE_ALIGN (type)))
> > > +                 goto normal_inner_ref;
> > > +
> >
> > It looks like you're concerned about aggregate copies but this also handles
> > non-aggregates (which on GIMPLE might already be optimized of course).
>
> Here I check if we copy more than 2 words and we can move more than
> a word in a single instruction.
>
> > Also you say "if it's cheaper" but I see no cost considerations.  How do
> > we generally handle immed const vs. load from constant pool costs?
>
> This trades 2 (update to 8) stores with one load plus one store.  Is there
> a way to check which one is faster?


I'm not sure - it depends on whether the target can do stores from immediates
at all or what restrictions apply, what the immediate value actually is
(zero or all-ones should be way cheaper than sth arbitrary) and how the
pressure on the load unit is.  can_move_by_pieces (bytes, TYPE_ALIGN (type))
also does not guarantee it will actually move pieces larger than UNITS_PER_WORD,
that might depend on alignment.  There's by_pieces_ninsns that might provide
some hint here.

I'm sure it works well for x86.

I wonder if the existing code is in the appropriate place and we
shouldn't instead
handle this somewhere upthread where we ask to copy 'exp' into some other
memory location.  For your testcase that's expand_assignment but I can
imagine passing array[0] by value to a function resulting in similar copying.
Testing that shows we get

        pushq   array+56(%rip)
        .cfi_def_cfa_offset 24
        pushq   array+48(%rip)
        .cfi_def_cfa_offset 32
        pushq   array+40(%rip)
        .cfi_def_cfa_offset 40
        pushq   array+32(%rip)
        .cfi_def_cfa_offset 48
        pushq   array+24(%rip)
        .cfi_def_cfa_offset 56
        pushq   array+16(%rip)
        .cfi_def_cfa_offset 64
        pushq   array+8(%rip)
        .cfi_def_cfa_offset 72
        pushq   array(%rip)
        .cfi_def_cfa_offset 80
        call    bar

for that.  We do have the by-pieces infrastructure to generally do this kind of
copying but in both of these cases we do not seem to use it.  I also wonder
if the by-pieces infrastructure can pick up constant initializers automagically
(we could native_encode the initializer part and feed the by-pieces
infrastructure with an array of bytes).  There for example might be easy to
immediate-store byte parts and difficult ones where we could decide on a
case-by-case basis whether to load+store or immediate-store them.

For example if I change your testcase to have the array[] initializer
all-zero we currently emit

        pxor    %xmm0, %xmm0
        movups  %xmm0, (%rdi)
        movups  %xmm0, 16(%rdi)
        movups  %xmm0, 32(%rdi)
        movups  %xmm0, 48(%rdi)
        ret

will your patch cause us to emit 4 loads?  OTHO if I do

const struct S array[] = {
  { 0, 0, 0, 7241, 124764, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
};

we get

        movq    $0, (%rdi)
        movl    $0, 8(%rdi)
        movl    $0, 12(%rdi)
        movl    $7241, 16(%rdi)
...

ideally we'd have sth like

    pxor %xmm0, %xmm0
    movups  %xmm0, (%rdi)
    movaps array+16(%rip), %xmm0
    movups %xmm0, 16(%rdi)
...

thus have the zeros written as immediates and the remaining pieces
with load+stores.

The by-pieces infrastructure eventually get's to see

(mem/u/c:BLK (symbol_ref:DI ("array") [flags 0x2] <var_decl
0x7ffff7ff5b40 array>) [1 array+0 S64 A256])

where the MEM_EXPR should provide a way to access the constant initializer.

That said I do agree the current code is a bit premature optimization
- but maybe
it should be fend off in expand_constructor which has the cheap clear_storage
first and which already does check can_move_by_pieces with some heuristics,
but that seems to be guarded by

           || (tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
               && (! can_move_by_pieces
                   (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
                    TYPE_ALIGN (type)))
               && ! mostly_zeros_p (exp))))

which is odd (we _can_ move by pieces, but how does this apply to
TREE_CONSTANT CTORs and avoid_temp_mem?).

That said, I wonder if we want to elide expand_constructor when the
CTOR is TREE_STATIC && TREE_CONSTANT and !mostly_zeros_p
and we can_move_by_pieces.

So sth like

diff --git a/gcc/expr.c b/gcc/expr.c
index 7139545d543..76b3bdf0c01 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -8504,6 +8504,12 @@ expand_constructor (tree exp, rtx target, enum
expand_modifier modifier,
               && (! can_move_by_pieces
                   (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
                    TYPE_ALIGN (type)))
+              && ! mostly_zeros_p (exp))
+          || (TREE_CONSTANT (exp)
+              && tree_fits_uhwi_p (TYPE_SIZE_UNIT (type))
+              && (can_move_by_pieces
+                  (tree_to_uhwi (TYPE_SIZE_UNIT (type)),
+                   TYPE_ALIGN (type)))
               && ! mostly_zeros_p (exp))))
       || ((modifier == EXPAND_INITIALIZER || modifier == EXPAND_CONST_ADDRESS)
          && TREE_CONSTANT (exp)))

which handles your initializer and the all-zero one optimal?

Richard.

> > >                 FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (init), ix,
> > >                                           field, value)
> > >                   if (tree_int_cst_equal (field, index))
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-24.c 
> > > b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > > new file mode 100644
> > > index 00000000000..4a4b62533dc
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr90773-24.c
> > > @@ -0,0 +1,22 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=x86-64" } */
> > > +
> > > +struct S
> > > +{
> > > +  long long s1 __attribute__ ((aligned (8)));
> > > +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > > +};
> > > +
> > > +const struct S array[] = {
> > > +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > > +};
> > > +
> > > +void
> > > +foo (struct S *x)
> > > +{
> > > +  x[0] = array[0];
> > > +}
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 
> > > \\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 
> > > 16\\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 
> > > 32\\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "movups\[\\t \]%xmm\[0-9\]+, 
> > > 48\\(%\[\^,\]+\\)" 1 } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr90773-25.c 
> > > b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > > new file mode 100644
> > > index 00000000000..2520b670989
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr90773-25.c
> > > @@ -0,0 +1,20 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -march=skylake" } */
> > > +
> > > +struct S
> > > +{
> > > +  long long s1 __attribute__ ((aligned (8)));
> > > +  unsigned s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14;
> > > +};
> > > +
> > > +const struct S array[] = {
> > > +  { 0, 60, 640, 2112543726, 39682, 48, 16, 33, 10, 96, 2, 0, 0, 4 }
> > > +};
> > > +
> > > +void
> > > +foo (struct S *x)
> > > +{
> > > +  x[0] = array[0];
> > > +}
> > > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 
> > > \\(%\[\^,\]+\\)" 1 } } */
> > > +/* { dg-final { scan-assembler-times "vmovdqu\[\\t \]%ymm\[0-9\]+, 
> > > 32\\(%\[\^,\]+\\)" 1 } } */
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> H.J.

Re: [PATCH v4 12/12] constructor: Check if it is faster to load constant from memory

Reply via email to