On Fri, May 31, 2024 at 4:59 AM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote: > > Hi, > I've recently been trying to hand-write code to trigger automatic > vectorization optimizations in GCC on Intel x86 machines (without > using the interfaces in immintrin.h), but I'm running into a problem > where I can't seem to get the concise `vpmovzxbd` or similar > instructions. > > My requirement is to convert 8 `uint8_t` elements to `int32_t` type > and print the output. If I use the interface (_mm256_cvtepu8_epi32) in > immintrin.h, the code is as follows: > > int immintrin () { > int size = 10000, offset = 3; > uint8_t* a = malloc(sizeof(char) * size); > > __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset)); > > for (int i = 0; i < 8; i++) { > printf("%d\n", b[i]); > } > } > > After compiling with -mavx2 -O3, you can get concise and efficient > instructions. (You can see it here: https://godbolt.org/z/8ojzdav47) > > But if I do not use this interface and instead use a for-loop or the > `__builtin_convertvector` interface provided by GCC, I cannot achieve > the above effect. The code is as follows: > > typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8))); > int forloop () { > int size = 10000, offset = 3; > uint8_t* a = malloc(sizeof(char) * size); > > v8qiu av = *(v8qiu *)(a + offset); > __v8si b = {}; > for (int i = 0; i < 8; i++) { > b[i] = (a + offset)[i]; > } > > for (int i = 0; i < 8; i++) { > printf("%d\n", b[i]); > } > } > > int builtin_cvt () { > int size = 10000, offset = 3; > uint8_t* a = malloc(sizeof(char) * size); > > v8qiu av = *(v8qiu *)(a + offset); > __v8si b = __builtin_convertvector(av, __v8si); > > for (int i = 0; i < 8; i++) { > printf("%d\n", b[i]); > } > }
Ideally both should work. The loop case works when disabling the loop vectorizer, thus -O3 -fno-tree-loop-vectorize it then produces vpmovzxbd 3(%rax), %ymm0 vmovdqa %ymm0, (%rsp) the loop vectorizer is constraint with using same vector sizes and thus makes a mess out of it by unpacking the 8 char vector two times to four 2 element int vectors. I do have plans to address this, but not sure if those can materialize for GCC 15. > The instructions generated by both functions are redundant and > complex, and are quite difficult to read compared to calling > `_mm256_cvtepu8_epi32` directly. (You can see it here as well: > https://godbolt.org/z/8ojzdav47) > > What I want to ask is: How should I write the source code to get > assembly instructions similar to directly calling > _mm256_cvtepu8_epi32? > > Or would it be easier if I modified the GIMPLE directly? But it seems > that there is no relevant expression or interface directly > corresponding to `vpmovzxbd` in GIMPLE. > > Thanks > Hanke Zhang