https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114686
Bug ID: 114686 Summary: Feature request: Dynamic LMUL should be the default for the RISC-V Vector extension Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: camel-cdr at protonmail dot com Target Milestone: --- Currently, the default value for -mrvv-max-lmul is "m1", it should be "dynamic" instead for the following reasons: All currently available RVV implementations benefit from using the largest LMUL, when possible (C906,C908,C920,ara,bobcat, see also this comment about the SiFive cores: https://gcc.gnu.org/pipermail/gcc-patches/2024-February/644676.html) Some even to the degree that you are basically always wasting 50% of the performance by using LMUL=1 instead of LMUL=2 or above, as you can see here for the C908: https://camel-cdr.github.io/rvv-bench-results/canmv_k230/index.html I don't see any reason why this wouldn't be the case for the vast majority of implementations, especially high performance ones would benefit from having more work to saturate the execution units with, since a larger LMUL works quite similar to loop unrolling. Also consider that using a lower LMUL than possible would make mask instructions more expensive because they happen more frequently. With any LMUL/SEW the mask fits into a single LMUL=1 vector register and can thus (usually) execute in the same number of cycles regardless of LMUL. So in a loop with LMUL=4 the mask operations are four times as fast per element as with LMUL=1, because they occur less frequently. Notes: The vrgather.vv instruction should be except from that, because an LMUL=8 vrgather.vv is way more powerful than eight LMUL=1 vrgather.vv instructions, and thus disproportionately complex to implement. When you don't need to cross lanes, it's possible to unrolling LMUL=1 vrgathers manually, instead of choosing a higher LMUL. Here are throughput measurements on some existing implementations: VLEN e8m1 e8m2 e8m4 e8m8 c906 128 4 16 64 256 c908 128 4 16 64.9 261.1 c920 128 0.5 2.4 8.0 32.0 bobcat* 256 68 132 260 516 x280* 512 65 129 257 513 *bobcat: Note that it was explicitly stated, that they didn't optimize the permutation instructions *x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a vrgather fast path for vl<=256. I think there was much incentive to make this fast, as the x280 mostly targets AI. vcompress.vm doesn't scale linearly with LMUL on the XuanTie chips either, but a better implementation is conceivable, because the work can be better distributed/subdivided. GCC currently doesn't seem to generate vcompress.vm via auto-vectorization anyway: https://godbolt.org/z/Mb5Kba865