https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94406

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Hmm, we're invoking memset from libc which might use a different path on
CXL compared to Zen2?

Note that a vectorized epilogue should in no way cause additional
store-to-load forwarding penalties _but_ it might cause additional
(positive) store-to-load forwardings.

Code-generation wise the loop leaves a lot to be desired and given
we know the number of iterations is 5 the vectorized epilogue will
never be entered thus its overhead will only hurt.  Maybe CXL
branch prediction behaves better here.

Note there's room for improvement in the way we dispatch to the vectorized
epilogue.  Exiting the main vectorized loop we do

  if (do_we_need_an_epilouge)
    {

then for the vectorized epilogue we do

       if (remaining-niters == 1)
         do scalar epilogue
       else
         do vector epilogue

where the complication is due to the fact that we share the scalar
epilogue loops with the loop used when the runtime cost model check
fails.

Thus the CFG with vectorized epilogue could be more optimally structured
reducing the overhead to a single jump-around.

For bwaves the other improvement opportunity is to move the memset out
of the full loop nest rather than just covering the innermost two loops.
That probably improves register allocation.

Reply via email to