...only without the typo in the ChangeLog below... > On May 3, 2017, at 2:43 PM, Bill Schmidt <wschm...@linux.vnet.ibm.com> wrote: > > Hi, > > We recently became aware of some poor code generation as a result of > unprofitable (for POWER) loop vectorization. When a loop is simply copying > data with 64-bit loads and stores, vectorizing with 128-bit loads and stores > generally does not provide any benefit on modern POWER processors. > Furthermore, if there is a requirement to version the loop for aliasing, > alignment, etc., the cost of the versioning test is almost certainly a > performance loss for such loops. The user code example included such a copy > loop, executed only a few times on average, within an outer loop that was > executed many times on average, causing a tremendous slowdown. > > This patch very specifically targets these kinds of loops and no others, > and artificially inflates the vectorization cost to ensure vectorization > does not appear profitable. This is done within the target model cost > hooks to avoid affecting other targets. A new test case is included that > demonstrates the refusal to vectorize. > > We've done SPEC performance testing to verify that the patch does not > degrade such workloads. Results were all in the noise range. The > customer code performance loss was verified to have been reversed. > > Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions. > Is this ok for trunk? > > Thanks, > Bill > > > [gcc] > > 2017-05-03 Bill Schmidt <wschm...@linux.vnet.ibm.com> > > * config/rs6000/rs6000.c (rs6000_vect_nonmem): New static var. > (rs6000_init_cost): Initialize rs6000_vect_nonmem. > (rs6000_add_stmt_cost): Update rs6000_vect_nonmem. > (rs6000_finish_cost): Avoid vectorizing simple copy loops with > VF=2 that require versioning. > > [gcc/testsuite] > > 2017-05-03 Bill Schmidt <wschm...@linux.vnet.ibm.com> > > * gcc.target/powerpc/veresioned-copy-loop.c: New file.
^^ fixed to "versioned". Bill > > > Index: gcc/config/rs6000/rs6000.c > =================================================================== > --- gcc/config/rs6000/rs6000.c (revision 247560) > +++ gcc/config/rs6000/rs6000.c (working copy) > @@ -5873,6 +5873,8 @@ rs6000_density_test (rs6000_cost_data *data) > > /* Implement targetm.vectorize.init_cost. */ > > +static bool rs6000_vect_nonmem; > + > static void * > rs6000_init_cost (struct loop *loop_info) > { > @@ -5881,6 +5883,7 @@ rs6000_init_cost (struct loop *loop_info) > data->cost[vect_prologue] = 0; > data->cost[vect_body] = 0; > data->cost[vect_epilogue] = 0; > + rs6000_vect_nonmem = false; > return data; > } > > @@ -5907,6 +5910,19 @@ rs6000_add_stmt_cost (void *data, int count, enum > > retval = (unsigned) (count * stmt_cost); > cost_data->cost[where] += retval; > + > + /* Check whether we're doing something other than just a copy loop. > + Not all such loops may be profitably vectorized; see > + rs6000_finish_cost. */ > + if ((where == vect_body > + && (kind == vector_stmt || kind == vec_to_scalar || kind == vec_perm > + || kind == vec_promote_demote || kind == vec_construct > + || kind == scalar_to_vec)) > + || (where != vect_body > + && (kind == vec_to_scalar || kind == vec_perm > + || kind == vec_promote_demote || kind == vec_construct > + || kind == scalar_to_vec))) > + rs6000_vect_nonmem = true; > } > > return retval; > @@ -5923,6 +5939,19 @@ rs6000_finish_cost (void *data, unsigned *prologue > if (cost_data->loop_info) > rs6000_density_test (cost_data); > > + /* Don't vectorize minimum-vectorization-factor, simple copy loops > + that require versioning for any reason. The vectorization is at > + best a wash inside the loop, and the versioning checks make > + profitability highly unlikely and potentially quite harmful. */ > + if (cost_data->loop_info) > + { > + loop_vec_info vec_info = loop_vec_info_for_loop (cost_data->loop_info); > + if (!rs6000_vect_nonmem > + && LOOP_VINFO_VECT_FACTOR (vec_info) == 2 > + && LOOP_REQUIRES_VERSIONING (vec_info)) > + cost_data->cost[vect_body] += 10000; > + } > + > *prologue_cost = cost_data->cost[vect_prologue]; > *body_cost = cost_data->cost[vect_body]; > *epilogue_cost = cost_data->cost[vect_epilogue]; > Index: gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c > =================================================================== > --- gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c (nonexistent) > +++ gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c (working copy) > @@ -0,0 +1,30 @@ > +/* { dg-do compile } */ > +/* { dg-require-effective-target powerpc_p8vector_ok } */ > +/* { dg-options "-O3 -fdump-tree-vect-details" } */ > + > +/* Verify that a pure copy loop with a vectorization factor of two > + that requires alignment will not be vectorized. See the cost > + model hooks in rs6000.c. */ > + > +typedef long unsigned int size_t; > +typedef unsigned char uint8_t; > + > +extern void *memcpy (void *__restrict __dest, const void *__restrict __src, > + size_t __n) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ > ((__nonnull__ (1, 2))); > + > +void foo (void *dstPtr, const void *srcPtr, void *dstEnd) > +{ > + uint8_t *d = (uint8_t*)dstPtr; > + const uint8_t *s = (const uint8_t*)srcPtr; > + uint8_t* const e = (uint8_t*)dstEnd; > + > + do > + { > + memcpy (d, s, 8); > + d += 8; > + s += 8; > + } > + while (d < e); > +} > + > +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */ >