https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83202
--- Comment #1 from Daniel Fruzynski <bugzi...@poradnik-webmastera.com> --- This was compiled with -O3 -mavx -ftree-vectorize After sending this I noticed that I wrote inner loop incorrectly, I meant one below. Anyway, it it also not optimized: for (int j = 0; j < i; j+=4) I also checked code which could be optimized using operations on YMM registers: void test(double data[8][8]) { for (int i = 0; i < 8; i++) { for (int j = 0; j < i; j+=4) { data[i][j] *= data[i][j]; data[i][j+1] *= data[i][j+1]; data[i][j+2] *= data[i][j+2]; data[i][j+3] *= data[i][j+3]; } } } gcc output is, hmm, interesting: test(double (*) [8]): vmovupd xmm0, XMMWORD PTR [rdi+64] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+80], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+64], xmm0 vextractf128 XMMWORD PTR [rdi+80], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+128] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+144], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+128], xmm0 vextractf128 XMMWORD PTR [rdi+144], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+192] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+208], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+192], xmm0 vextractf128 XMMWORD PTR [rdi+208], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+256] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+272], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+256], xmm0 vextractf128 XMMWORD PTR [rdi+272], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+320] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+336], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+320], xmm0 vextractf128 XMMWORD PTR [rdi+336], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+352] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+368], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+352], xmm0 vextractf128 XMMWORD PTR [rdi+368], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+384] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+400], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+384], xmm0 vextractf128 XMMWORD PTR [rdi+400], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+416] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+432], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+416], xmm0 vextractf128 XMMWORD PTR [rdi+432], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+448] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+464], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+448], xmm0 vextractf128 XMMWORD PTR [rdi+464], ymm0, 0x1 vmovupd xmm0, XMMWORD PTR [rdi+480] vinsertf128 ymm0, ymm0, XMMWORD PTR [rdi+496], 0x1 vmulpd ymm0, ymm0, ymm0 vmovups XMMWORD PTR [rdi+480], xmm0 vextractf128 XMMWORD PTR [rdi+496], ymm0, 0x1 vzeroupper ret