On 15/02/2024 10:21, Richard Biener wrote:
[snip]
I suppse if RDNA really only has 32 lane vectors (it sounds like it,
even if it can "simulate" 64 lane ones?) then it might make sense to
vectorize for 32 lanes?  That said, with variable-length it likely
doesn't matter but I'd not expose fixed-size modes with 64 lanes then?

For most operations, wavefrontsize=64 works just fine; the GPU runs each
instruction twice and presents a pair of hardware registers as a logical
64-lane register. This breaks down for permutations and reductions, and is
obviously inefficient when they vectors are not fully utilized, but is
otherwise compatible with the GCN/CDNA compiler.

I didn't want to invest all the effort it would take to support
wavefrontsize=32, which would be the natural mode for these devices; the
number of places that have "64" hard-coded is just too big. Not only that, but
the EXEC and VCC registers change from DImode to SImode and that's going to
break a lot of stuff. (And we have no paying customer for this.)

I'm open to patch submissions. :)

OK, I see ;)  As said for fully masked that's a good answer.  I'd
probably still not expose V64mode modes in the RTL expanders for the
vect_* patterns?  Or, what happens if you change
gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?

Changing the preferred mode probably would fix permute.

Does that possibly leave performance on the plate? (not sure if there's
any documents about choosing wavefrontsize=64 vs 32 with regard to
performance)

Note it would entirely forbit the vectorizer from using larger modes,
it just makes it prefer the smaller ones.  OTOH if you then run
wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
by always masking it?

Right, the GPU will continue to process the "top half" of the vector as an additional step, regardless whether you put anything useful there, or not.

So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
allowing 32 there ...

I think the DImode to SImode change is the most difficult fix. Unless you know of a cunning trick, that's going to mean a lot of changes to a lot of the machine description; substitutions, duplications, iterators, indirections, etc., etc., etc.

The "64" substitution would be tedious but less hairy. I did a lot of those when I created the fake vector sizes.

Anyway, the fix works, so that's the most important thing ;)

:)

Andrew

Reply via email to