The main reason is that I’m working on the fixed-length-vector calling
convention [1]. For that, I need all these VLS types to be available so
that arguments can be passed correctly.

I know LMUL choice is very u-arch specific, so I agree the option makes
sense for the vectorizer. But when people use fixed-length vectors in
their code, I think it’s a bit different. My assumption is that if
someone writes code with fixed-length vectors, they usually expect it to
map directly to hardware operations, not to be split into smaller ones.

I understand there are several aspects to balance here. Other architectures like x86 allow vectors larger than the physical vector size and there the expectation is that they are split into appropriately sized chunks. That's IMHO the advantage of those GCC vectors over real VLS vectors - being able to use large vectors and still get pretty good/optimal code.

IMHO (again), it is mainly a performance argument which chunk size to use for splitting. Consider a 1024-bit vector, should it be split into 2x LMUL8, 4x LMUL4, ...? We'd lose a degree of freedom if we always went with LMUL8, and, provided a particular uarch performs bad at LMUL8 would accept worse performance. Shouldn't a user who explicitly wants a certain LMUL rather use RVV intrinsics?

Also, the option doesn’t really match the meaning of explicit vector
types in code. For example, with -mrvv-max-lmul=dynamic, the current
implementation basically acts the same as -mrvv-max-lmul=m8 for VLS
types.

Yeah, the "dynamic" name is certainly a bit misleading for GCC vectors.
It's LMUL8 + spilling analysis which doesn't happen without autovec.

--
Regards
Robin

Reply via email to