There are two levels of dysfunction here:
1. Why spill & fill through the stack? Why not extract scalars directly
from vregs
directly into scalar regs?
2. Why involve scalar registers at all? Why not vslide or even vrgather,
using
temporary vregs as necessary?
That's how expmed does it. If vec_extract and friends or subregs don't work we
need to go via memory as last resort.
The fatal deficiency seems to be that the backend lacks vec_extractNM
patterns
for mode M bigger than ELEN. Here are some ideas:
1. Define scalar modes M larger than DI mode. Aarch64 defines TI, OI, and
XI modes
for 128, 256, and 512-bit integers (all of which are wider than the
hardware supports). 2. Define vector modes M that are half, quarter,
eighth, ... width of vector mode N. That
can be done with mode iterators. We already have VLS_HALF and
VLS_QUARTER, but
there are no such iterators for the VLA modes. Note: there are no
fractional LMUL
modes defined for SEW=64, i.e., no RVVMF[248]DI.
Yeah, generally vec_extract with vector modes is the way to go I'd say, that's
generally a "VLS" line of thinking, though.
We cannot have RVVMF2DI and smaller when the minimum vector length is 64 bits.
Increasing the minimum vector length helps but then we're not fully "VLA" any
more.
How does aarch64 do it? Do the larger scalar modes help for your problem?
They have those trn instructions I guess but doesn't their approach involve
BIT_FIELD_REFs?
How is your approach, i.e. what code do you write? Do you start with C code or
is this an autovec expansion? Couldn't you use vrgathers etc. right away?
--
Regards
Robin