On Wed, Aug 4, 2021 at 2:07 AM Aaron Sawdey <acsaw...@linux.ibm.com> wrote: > > Richard, > > So, I’m noticing that in get_reassociation_width() we know how many ops > (ops_num) are in the expression being considered for parallel reassociation, > but this is not passed to the target hook. In my testing this seems like it > might be useful to have. If you determine the maximum width that gives > additional speedup for a large number of terms, and then use that as the > width from the target hook, get_reassociation_width() is more aggressive than > you would like for small expressions with maybe 4-16 terms and produces code > that is slower than optimal. For example in many cases you want to continue > using a width of 1 until you get to 16 terms or so. My testing shows this to > be the case for power8, power9, and power10 processors. > > So, I’m wondering how it might be received if I posted a patch that adds this > to the reassociation_width target hook (and of course fixes all uses of that > target hook)?
You probably saw that get_reassociation_width already tries to optimize things. So what exactly would you change and why is it slower for 4-16 terms but not for 17+ ones? I suppose "is slower" is --param mining on some benchmarks on your side and eventually you manage to pick the best threshold to not run into register pressure issues (by luck) for those benchmarks? That said, I question you can explain why it is slower, right? Richard. > Thanks! > Aaron > > > Aaron Sawdey, Ph.D. saw...@linux.ibm.com > IBM Linux on POWER Toolchain > >