On Fri, May 31, 2024 at 4:59 AM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi,
> I've recently been trying to hand-write code to trigger automatic
> vectorization optimizations in GCC on Intel x86 machines (without
> using the interfaces in immintrin.h), but I'm running into a problem
> where I can't seem to get the concise `vpmovzxbd` or similar
> instructions.
>
> My requirement is to convert 8 `uint8_t` elements to `int32_t` type
> and print the output. If I use the interface (_mm256_cvtepu8_epi32) in
> immintrin.h, the code is as follows:
>
> int immintrin () {
>     int size = 10000, offset = 3;
>     uint8_t* a = malloc(sizeof(char) * size);
>
>     __v8si b = (__v8si)_mm256_cvtepu8_epi32(*(__m128i *)(a + offset));
>
>     for (int i = 0; i < 8; i++) {
>         printf("%d\n", b[i]);
>     }
> }
>
> After compiling with -mavx2 -O3, you can get concise and efficient
> instructions. (You can see it here: https://godbolt.org/z/8ojzdav47)
>
> But if I do not use this interface and instead use a for-loop or the
> `__builtin_convertvector` interface provided by GCC, I cannot achieve
> the above effect. The code is as follows:
>
> typedef uint8_t v8qiu __attribute__ ((__vector_size__ (8)));
> int forloop () {
>     int size = 10000, offset = 3;
>     uint8_t* a = malloc(sizeof(char) * size);
>
>     v8qiu av = *(v8qiu *)(a + offset);
>     __v8si b = {};
>     for (int i = 0; i < 8; i++) {
>         b[i] = (a + offset)[i];
>     }
>
>     for (int i = 0; i < 8; i++) {
>         printf("%d\n", b[i]);
>     }
> }
>
> int builtin_cvt () {
>     int size = 10000, offset = 3;
>     uint8_t* a = malloc(sizeof(char) * size);
>
>     v8qiu av = *(v8qiu *)(a + offset);
>     __v8si b = __builtin_convertvector(av, __v8si);
>
>     for (int i = 0; i < 8; i++) {
>         printf("%d\n", b[i]);
>     }
> }

Ideally both should work.  The loop case works when disabling
the loop vectorizer, thus -O3 -fno-tree-loop-vectorize it then
produces

        vpmovzxbd       3(%rax), %ymm0
        vmovdqa %ymm0, (%rsp)

the loop vectorizer is constraint with using same vector sizes
and thus makes a mess out of it by unpacking the 8 char
vector two times to four 2 element int vectors.

I do have plans to address this, but not sure if those can materialize
for GCC 15.

> The instructions generated by both functions are redundant and
> complex, and are quite difficult to read compared to calling
> `_mm256_cvtepu8_epi32` directly. (You can see it here as well:
> https://godbolt.org/z/8ojzdav47)
>
> What I want to ask is: How should I write the source code to get
> assembly instructions similar to directly calling
> _mm256_cvtepu8_epi32?
>
> Or would it be easier if I modified the GIMPLE directly? But it seems
> that there is no relevant expression or interface directly
> corresponding to `vpmovzxbd` in GIMPLE.
>
> Thanks
> Hanke Zhang

Reply via email to