https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686

--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> ---
I think we have always maintained that this can definitely be a per-uarch
default but shouldn't be a generic default.

> I don't see any reason why this wouldn't be the case for the vast majority of
> implementations, especially high performance ones would benefit from having
> more work to saturate the execution units with, since a larger LMUL works
> quite
> similar to loop unrolling.

One argument is reduced freedom for renaming and the out of order machinery. 
It's much easier to shuffle individual registers around than large blocks. 
Also lower-latency insns are easier to schedule than longer-latency ones and
faults, rejects, aborts etc. get proportionally more expensive.
I was under the impression that unrolling doesn't help a whole lot (sometimes
even slows things down a bit) on modern cores and certainly is not
unconditionally helpful.  Granted, I haven't seen a lot of data on it recently.
An exception is of course breaking dependency chains.

In general nothing stands in the way of having a particular tune target use
dynamic LMUL by default even now but nobody went ahead and posted a patch for
theirs.  One could maybe argue that it should be the default for in-order
uarchs?

Should it become obvious in the future that LMUL > 1 is indeed,
unconditionally, a "better unrolling" because of its favorable icache footprint
and other properties (which I doubt - happy to be proved wrong) then we will
surely re-evaluation the decision or rather have a different consensus.

The data we publicly have so far is all in-order cores and my expectation is
that the picture will change once out-of-order cores hit the scene.

Reply via email to