https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> --- FWIW, Peter Cordes provides an overview of available approaches for extraction depending on vector length and ISA extensions (up to AVX2, not including AVX-512) in this StackOverflow answer: https://stackoverflow.com/a/51414330/4755075 TL;DR: generally through store+load; possible alternatives: 128b: SSSE3: pshufb (1-byte elements) SSSE3: imul+add+pshufb (any element size) AVX: vpermilp[sd] (4 or 8-byte elements) 256b: AVX2: vpermps (4-byte elements) In all cases a (v)movd is needed to move the index to a vector register, and potentially another (v)movd if the result is needed in a general register. The basic store+load tactic may look worse latency-wise, but can be better throughput-wise (especially with multiple extractions from the same vector, as then the store needs to be done just once, as Peter mentioned). Why in RTL it is important to do this without referencing the stack?