https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70912

            Bug ID: 70912
           Summary: reassociation width needs to be aware of FMA, width of
                    expression, and other architectural details
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: acsawdey at gcc dot gnu.org
  Target Milestone: ---

Created attachment 38395
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38395&action=edit
test case

This is based on looking at trunk 235605 performance for ppc64le on a power8
system.

If you have an expression of the form

x = a*b+c*d+e*f+g*h ...

With reassociation width 1, you get a single multiply in front followed by a
series of dependent multiply-adds. A natural consequence of doing reassociation
with width > 1 is that you need to peel additional multiplies off of the front
and do additional adds at the end to bring together the final result. But if
you have too few overall terms, the cost of additional serialization vs the
fused multipy add eats up the gain you might get from the parallelism in the
middle.

Some real numbers for this compiler and power8, using --param
tree-reassoc-width=N to set the max width. Rows are width 1,2,4 and columns are
increasing numbers of terms i.e. "a*b+c*d" would be 4 terms. Table values are
reduction in runtime.

        8       12      16      32
1       0.00%   0.00%   0.00%   0.00%
2       -3.37%  4.34%   8.62%   22.83%
4                       14.53%  31.47%

So for this arch we do not want to do this at all for the FMA case unless we
have at least 10 or 12 total terms. Looking at the reassoc pass output showed
it did not try more than width=2 for 8 or 12 terms.

I ran into this when putting together a reassociation_width function for the
rs6000 config. I couldn't see a way to avoid this behavior.

Another issue is that this is another place where we might want to modify
behavior based on local register pressure. We don't want to introduce a bunch
of new temps to do the parallel reassociation only to end up being unable to
allocate them.

Reply via email to