https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70912
Bug ID: 70912 Summary: reassociation width needs to be aware of FMA, width of expression, and other architectural details Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: acsawdey at gcc dot gnu.org Target Milestone: --- Created attachment 38395 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38395&action=edit test case This is based on looking at trunk 235605 performance for ppc64le on a power8 system. If you have an expression of the form x = a*b+c*d+e*f+g*h ... With reassociation width 1, you get a single multiply in front followed by a series of dependent multiply-adds. A natural consequence of doing reassociation with width > 1 is that you need to peel additional multiplies off of the front and do additional adds at the end to bring together the final result. But if you have too few overall terms, the cost of additional serialization vs the fused multipy add eats up the gain you might get from the parallelism in the middle. Some real numbers for this compiler and power8, using --param tree-reassoc-width=N to set the max width. Rows are width 1,2,4 and columns are increasing numbers of terms i.e. "a*b+c*d" would be 4 terms. Table values are reduction in runtime. 8 12 16 32 1 0.00% 0.00% 0.00% 0.00% 2 -3.37% 4.34% 8.62% 22.83% 4 14.53% 31.47% So for this arch we do not want to do this at all for the FMA case unless we have at least 10 or 12 total terms. Looking at the reassoc pass output showed it did not try more than width=2 for 8 or 12 terms. I ran into this when putting together a reassociation_width function for the rs6000 config. I couldn't see a way to avoid this behavior. Another issue is that this is another place where we might want to modify behavior based on local register pressure. We don't want to introduce a bunch of new temps to do the parallel reassociation only to end up being unable to allocate them.