[Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076 --- Comment #6 from Hongtao.liu --- (In reply to Richard Biener from comment #5) > Note even when avoiding the STLF hit the vectorized version is slower. > You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading > the lower/upper half of vectors separately. > This leads to extra instructions(extra 2 loads), and if the vectorizer knew that, it would find that the cost of vectorization is larger than scalar. > The reason is that without -ffast-math we are using an in-order reduction > which doesn't save us much but instead just combines dependence chains > here. We do have a related bug for this somewhere. > > With -ffast-math the version with/without > -mtune-ctl=^sse_unaligned_load_optimal > is about the same speed, so STLF is a red herring here (on Zen2). > > Still not vectorizing is a lot faster. > Yes, As far as vectorization is concerned, vectorization does not improve performance here(compare -O2 -funroll-loops vs -O2 -ftree-vectorize -funroll-loops) so I'm wondering if we can adjust the heuristic or cost model so that the loop is not vectorized. > Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX? doesn't help.
[Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2021-04-15 Status|UNCONFIRMED |NEW --- Comment #5 from Richard Biener --- Note even when avoiding the STLF hit the vectorized version is slower. You can use -mtune-ctl=^sse_unaligned_load_optimal to force loading the lower/upper half of vectors separately. The reason is that without -ffast-math we are using an in-order reduction which doesn't save us much but instead just combines dependence chains here. We do have a related bug for this somewhere. With -ffast-math the version with/without -mtune-ctl=^sse_unaligned_load_optimal is about the same speed, so STLF is a red herring here (on Zen2). Still not vectorizing is a lot faster. Can you check if -mtune-ctl=^sse_unaligned_load_optimal helps on CLX?
[Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076 --- Comment #4 from Hongtao.liu --- Created attachment 50590 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50590&action=edit eembc_automotive_basefp01.cpp
[Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076 Richard Biener changed: What|Removed |Added Target||x86_64-*-* CC||rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener --- See also PR90579. I wonder if there's a way to tell the CPU to not forward a load - does emitting a lfence inbetween the scalar store and the vector load fix the issue? ISTR that the "bad" effect is not so much the delay between flushing the store buffers to L1 and then loading from L1 but when the CPU speculates there's no conflicting [not forwardable] store in the store buffer and thus fetches a wrong value from L1 and thus we have to flush and restart the pipeline after we discover the conflict late? Otherwise it's really hard to address these kind of issues - for doubles and SSE vectorization we might simply vectorize all loads using scalars but that doesn't scale for larger VFs. It might eventually be enough to force peel a single iteration of all loops at the cost of code size (and performance if there's no STLF issue). That said, CPU design folks should try to address this by making the penalty smaller ;) Can you share a runtime testcase?
[Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076 --- Comment #2 from Hongtao.liu --- (In reply to H.J. Lu from comment #1) > Is -O3 slower than -O3 -fno-tree-vectorize? If not, why? For this case O3 is Ok, because O3 will enable pass_cunroll to complete unroll the loop1/loop2/loop3, and later pass_fre will elimiate redudant load of polyX1 in loop2 and loop3 for both -O3 and -O3 -fno-tree-vectorize.
[Bug tree-optimization/100076] eembc/automotive/basefp01 has 30.3% regression compare -O2 -ftree-vectorize with -O2 on CLX/Znver3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100076 --- Comment #1 from H.J. Lu --- Is -O3 slower than -O3 -fno-tree-vectorize? If not, why?