https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91460

H.J. Lu <hjl.tools at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |crazylht at gmail dot com

--- Comment #1 from H.J. Lu <hjl.tools at gmail dot com> ---
This testcase

---
int block[9][9][9];
void foo(int row, int k, int h)
{
  /* Variable nrow range from 4 to 9.  */
  int nrow = ((row - 1)/3 + 1)*3 + 1;

   for (int i = nrow; i < 9; i++)
     block[k][h][i] = block[k][h][i] - 10;
}
---

Since nrow range from 4 to 9, 256bit vector operation will never be
executed(vector elements always less than 8), so 256bit vector actually
equals no vectorization plus additional branch cost.  Even with epilogue
vectorization, 256bit vector still has more overhead.  When this is a hot
function, 256bit vector can reduce performance by 6%.

Reply via email to