https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91460
H.J. Lu <hjl.tools at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crazylht at gmail dot com --- Comment #1 from H.J. Lu <hjl.tools at gmail dot com> --- This testcase --- int block[9][9][9]; void foo(int row, int k, int h) { /* Variable nrow range from 4 to 9. */ int nrow = ((row - 1)/3 + 1)*3 + 1; for (int i = nrow; i < 9; i++) block[k][h][i] = block[k][h][i] - 10; } --- Since nrow range from 4 to 9, 256bit vector operation will never be executed(vector elements always less than 8), so 256bit vector actually equals no vectorization plus additional branch cost. Even with epilogue vectorization, 256bit vector still has more overhead. When this is a hot function, 256bit vector can reduce performance by 6%.