https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686
--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> --- I think we have always maintained that this can definitely be a per-uarch default but shouldn't be a generic default. > I don't see any reason why this wouldn't be the case for the vast majority of > implementations, especially high performance ones would benefit from having > more work to saturate the execution units with, since a larger LMUL works > quite > similar to loop unrolling. One argument is reduced freedom for renaming and the out of order machinery. It's much easier to shuffle individual registers around than large blocks. Also lower-latency insns are easier to schedule than longer-latency ones and faults, rejects, aborts etc. get proportionally more expensive. I was under the impression that unrolling doesn't help a whole lot (sometimes even slows things down a bit) on modern cores and certainly is not unconditionally helpful. Granted, I haven't seen a lot of data on it recently. An exception is of course breaking dependency chains. In general nothing stands in the way of having a particular tune target use dynamic LMUL by default even now but nobody went ahead and posted a patch for theirs. One could maybe argue that it should be the default for in-order uarchs? Should it become obvious in the future that LMUL > 1 is indeed, unconditionally, a "better unrolling" because of its favorable icache footprint and other properties (which I doubt - happy to be proved wrong) then we will surely re-evaluation the decision or rather have a different consensus. The data we publicly have so far is all in-order cores and my expectation is that the picture will change once out-of-order cores hit the scene.