https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99912

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
   Last reconfirmed|                            |2021-04-07
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |ASSIGNED

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Erik Schnetter from comment #5)
> As you suggested, the problem is probably not caused by register spills, but
> by stores into a struct that are not optimized away. In this case, the
> respective struct elements are unused in the code.
> 
> I traced the results of the first __builtin_ia32_maskloadpd256:
> 
>   _63940 = __builtin_ia32_maskloadpd256 (_63955, prephitmp_86203);
>   MEM <const vector(4) double> [(struct mat3 *)&vars + 992B] = _63940;
>   _178613 = .FMA (_63940, _64752, _178609);
>   MEM <const vector(4) double> [(struct mat3 *)&vars + 1312B] = _63940;
> 
> The respective struct locations (+ 992B, + 1312B) are indeed not used
> anywhere else.
> 
> The struct is of type z4c_vars. It (and its parent) are defined in lines
> 279837 to 280818. It is large.
> 
> Is there e.g. a parameter I could set to make GCC try harder avoid
> unnecessary stores?

Yes, there's --param dse-max-alias-queries-per-store=N where setting N
to 10000 seems to help quite a bit (it's default is 256 and it's used to
limit the quadratic complexity of DSE to constant-time).

There are still some other temporaries left, notably we have aggregate
copies like

  MEM[(struct vect *)&D.662088 + 96B clique 23 base 1].elts._M_elems[0] =
MEM[(const struct simd &)&D.662080 clique 23 base 0];

and later used as

  SR.3537_174107 = MEM <simd_vector> [(const struct vec3 &)&D.662088];

the aggregate copying keeps the stores into D.662080 live (instead of
directly storing into D.662088).  At SRA time we still take the address
of both so they are not considered for decomposition.  That's from
dead stores to some std::initializer_list.  Thus there's a pass ordering
issue and SRA would benefit from another DSE pass before it.

For the fun I did

diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc57..0c8a50e7a07 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -221,6 +221,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_tail_recursion);
       NEXT_PASS (pass_ch);
       NEXT_PASS (pass_lower_complex);
+      NEXT_PASS (pass_dse);
       NEXT_PASS (pass_sra);
       /* The dom pass will also resolve all __builtin_constant_p calls
          that are still there to 0.  This has to be done after some
@@ -236,7 +237,6 @@ along with GCC; see the file COPYING3.  If not see
       /* Identify paths that should never be executed in a conforming
         program and isolate those paths.  */
       NEXT_PASS (pass_isolate_erroneous_paths);
-      NEXT_PASS (pass_dse);
       NEXT_PASS (pass_reassoc, true /* insert_powi_p */);
       NEXT_PASS (pass_dce);
       NEXT_PASS (pass_forwprop);

which helps SRA in this case.  It does need quite some magic to remove all
the C++ abstraction in this code.  There's also lots of stray CLOBBERS
of otherwise unused variables in the code which skews the DSE alias
query parameter limit, we should look at doing TODO_remove_unused_locals
more often (for example after DSE) - that alone is enough to get almost
all the improvements without increasing any of the DSE walking limits.

Reply via email to