(resent because first one is held?) Hello cfe-users,
I’m trying to get clang (or GCC for that matter) to vectorize a very simple loop, and I’m wondering what I’m doing wrong. I’d rather write the loop as a loop instead of using intrinsics or the clang vector extensions, because I want the code to be portable. Pragmas and magic attributes are also undesirable, but they’re better than intrinsics. This file is representative of what I’m trying to do. I’m compiling with -O3 -std=c99 -mavx2, but the same issues should apply for other vector settings. “”” #include <stdint.h> typedef struct this_should_totally_be_a_vector { uint64_t limb[8]; } __attribute__((aligned(32))) a_vector; void add(a_vector *a, const a_vector *b) { for (int i=0; i<8; i++) a->limb[i] += b->limb[i]; } void mac(a_vector *a, const a_vector *b) { const a_vector c = {{0,1,2,3,4,5,6,7}}; for (int i=0; i<8; i++) a->limb[i] += b->limb[i] + 3*c.limb[i]; } “”” Can someone suggest flags, pragmas, attributes etc which would cause these functions to produce good code? I’m seeing lots of problems. I’m testing for now on clang-3.6 release. For starters, the compiler is unable to determine that there is no loop dependency, and therefore unrolls the loop instead of vectorizing. When passed #pragma clang loop unroll(disable) vectorize(enable), it is still not able to determine that there is no dependency, and so branches to a scalar version if a is close to b. Furthermore, it ignores the alignment hint and uses vmovdqu for everything, though maybe that doesn’t actually cost any performance. In fact, there cannot be a loop dependency both because of the alignment and because the arrays are in structs. Clang produces the correct code if a is declared __restrict__, but in the real code it is possible that a=b so I’d rather not say __restrict__ if I don’t have to (especially since the code may be inlined, possibly causing alias analysis to break). GCC has #pragma GCC ivdep, which causes it to vectorize properly, but does Clang have any equivalent to #pragma ivdep? Also, __restrict__ still doesn’t give me vmovdqa. For mac, with __restrict__ (again undesirable) I get decent 2-way vectorized sse3 code, which isn’t bad I guess, but I’d rather the compiler automatically produced 4-way avx2 code. If I add #pragma clang loop unroll(disable) vectorize(enable), I get “”" vmovdqa mac.c(%rip), %ymm0 vpbroadcastq .LCPI2_0(%rip), %ymm1 vpmuludq %ymm1, %ymm0, %ymm2 vpxor %ymm3, %ymm3, %ymm3 vpmuludq %ymm3, %ymm0, %ymm4 vpsllq $32, %ymm4, %ymm4 vpaddq %ymm4, %ymm2, %ymm2 vpsrlq $32, %ymm0, %ymm0 vpmuludq %ymm1, %ymm0, %ymm0 vpsllq $32, %ymm0, %ymm0 vpaddq %ymm0, %ymm2, %ymm0 vpaddq (%rsi), %ymm0, %ymm0 vpaddq (%rdi), %ymm0, %ymm0 vmovdqu %ymm0, (%rdi) vmovdqa mac.c+32(%rip), %ymm0 vpmuludq %ymm1, %ymm0, %ymm2 vpmuludq %ymm3, %ymm0, %ymm3 vpsllq $32, %ymm3, %ymm3 vpaddq %ymm3, %ymm2, %ymm2 vpsrlq $32, %ymm0, %ymm0 vpmuludq %ymm1, %ymm0, %ymm0 vpsllq $32, %ymm0, %ymm0 vpaddq %ymm0, %ymm2, %ymm0 vpaddq 32(%rsi), %ymm0, %ymm0 vpaddq 32(%rdi), %ymm0, %ymm0 vmovdqu %ymm0, 32(%rdi) vzeroupper retq “”" In other words, clang has failed to propagate constants, and is trying to do 64-bit multiplies (lowered to vpsllq and vpmuludq) at runtime. Can anyone help me get decent, portable code out of this? GCC performs well on add with #pragma GCC ivdep, but it also does silly things with mul. Is there a way to do this which doesn’t depend on intrinsics or extensions? If I absolutely have to write this with intrinsics or extensions, is there a nice way to do it which doesn’t change the struct definition and doesn’t break strict aliasing? Thanks a lot, — Mike _______________________________________________ cfe-users mailing list cfe-users@cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/cfe-users