We used to apply -mrvv-max-lmul= to limit VLS code gen, auto vectorizer,
and builtin string function expansion. But I think the VLS code gen part doesn't
need this limit, since it only happens when the user explicitly writes vector
types.

For example, int32x8_t under -mrvv-max-lmul=m1 with VLEN=128 would be split into
two int32x4_t, which generate more instructions and runs slower.

In this patch, I changed -mrvv-max-lmul= to only affect auto vectorization and
builtin string function expansion. Actually, the option's help text already
says it only controls the LMUL used by auto-vectorization, so I believe this
change is makes sense :)

This might have been discussed while I was away so I haven't complained yet :)
To me the -mrvv-max-lmul option always included "everything" and IMHO the maximum LMUL should be generally tied to a microarchitecture.

Many of the higher-end cores won't favor LMUL > 1 and I'd find it surprising if we started emitting LMUL8 even for fixed vector sizes.

To play devil's advocate: If LMUL8 (or 4, 2) is faster why don't we enable it unconditionally? Not that I think it's generally faster but what's special about such a VLS example that doesn't hold for auto-vectorization?

Is the code for this example particularly bad for LMUL1 or is it optimal and LMUL8 is just faster on your uarchs?

--
Regards
Robin

Reply via email to