[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #10 from Thomas Koenig --- Created attachment 44121 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121=edit Test case for vectorizing, inc. assembly This test case, written by Nicolas König, shows a proof of concept for vectorizing a maxloc loop using AVX2. It lacks loop peeling, so the overhead for real-life cases will be a little higher. Compiled with $ gcc -Ofast -funroll-loops -Wall main-gp.c maxloc_unroll.s maxloc_nounroll.s with a Ryzen 1700, the results are # Ints per cycle # nnormalexpect AVX2 AVX2_unroll 128 0.179272 0.107563 0.537815 0.537815 256 0.396285 0.442907 1.075630 1.254902 512 0.456328 0.654731 1.368984 1.254902 1024 0.501961 0.912656 1.771626 1.434174 2048 0.552617 1.368984 1.434174 1.771626 4096 0.568257 1.338562 1.673203 2.041874 8192 0.529541 1.392724 1.661663 1.958871 16384 0.533056 1.342291 1.204706 1.617055 32768 0.535128 1.478167 1.451453 1.832252 65536 0.435206 1.479301 1.612995 2.077079 131072 0.586499 1.366558 1.603602 2.081565 262144 0.533831 1.366315 1.708424 2.071499 524288 0.582885 1.458868 1.601936 2.053294 1048576 0.580800 1.367224 1.563047 2.060289 2097152 0.567541 1.120453 1.552151 1.420402 4194304 0.27 0.934199 1.001712 1.210831 8388608 0.553388 0.968962 1.211283 1.010533 16777216 0.529886 0.972008 1.143200 1.291568 33554432 0.538509 0.971257 1.303704 1.245031 67108864 0.567889 1.016425 1.320057 1.413633 134217728 0.572604 1.043561 1.338406 1.437528 268435456 0.578321 1.046748 1.420888 1.450485 536870912 0.572145 1.042577 1.445350 1.424409 536870912 0.458679 1.072158 1.451617 1.442926 268435456 0.460019 1.011306 1.380031 1.404121 134217728 0.460833 1.007872 1.318771 1.409553 67108864 0.457754 1.008473 1.281166 1.387048 33554432 0.425367 0.933973 1.378792 1.078957 16777216 0.449178 0.903576 1.371796 1.416321 8388608 0.436478 1.043172 1.298123 1.344617 4194304 0.421925 1.023954 1.214096 1.288334 2097152 0.458274 1.068309 1.060794 1.147894 1048576 0.470394 1.488727 1.247491 1.777959 524288 0.473536 1.493919 2.073448 2.096280 262144 0.476521 1.504707 2.261032 2.067056 131072 0.473536 1.494209 2.189131 2.060427 65536 0.432861 1.477034 2.238710 2.018355 32768 0.468757 1.482715 2.072612 2.024716 16384 0.470129 1.455838 2.068165 1.974928 8192 0.469671 1.417301 2.190374 2.041874 4096 0.474294 1.368984 2.007843 1.974928 2048 0.459811 1.075630 2.007843 2.077079 1024 0.406995 1.254902 1.882353 1.505882 512 0.406995 0.792570 1.673203 0.941176 256 0.327366 0.627451 1.254902 0.752941 128 0.342246 0.941176 0.941176 0.537815 so a) __builtin_expect is a big win b) AVX2 is an even bigger win c) Unrolling the AVX2 loop is a win for intermediate sizes, at large sizes (outside of any cache) Dunno how realistic it is to emit this kind of code for the general case, or what we could do in the Fortran front end to encourage vectorization like that. Use vector data types and do the loop peeling by hand, I presume.
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #9 from Thomas Koenig --- (In reply to Martin Liška from comment #8) > (In reply to Richard Biener from comment #7) > > Confirmed with a Haswell CPU as well. Without the __builtin_expect we > > rightfully predict the branch to be 50%/50% which means BB re-ordering will > > do either nothing to pre-existing order or apply some other magic. CFG > > construction makes the > > flow exactly as visible in the source. > > > > So not sure what you are asking here, but annotating the libgfortran > > routines > > or inline expansion from the FE with __builtin_expect is probably a good > > idea. First, I think that -funroll-loop is such a big win that it would be good to have it enabled for such a loop with -Ofast. Second, vectorization would be nice, but quite possibly unrealistic. I'll add a test case momentarliy for AVX2, which also clearly shows a benefit. > If the code is emitted in Fortran FE, that it's similar to specific > predictors: > grep for 'PRED_FORTRAN_'. These are predictors emitted by the FE and can > have specific probability based on SPEC benchmarks. > > Can you Thomas point me to code that emits the maxloc/minloc? I already added __builtin_expect to the minloc/maxloc routines, in gfc_conv_intrinsic_minmaxloc. Maybe the likelyhood of missing the maximum case was too high. > > > > At least I can't really see how to easily derive a new predictor that would > > match > > this case... > > Agree. Running through an array and finding a min/max? Hmm, a pity...
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #8 from Martin Liška --- (In reply to Richard Biener from comment #7) > Confirmed with a Haswell CPU as well. Without the __builtin_expect we > rightfully predict the branch to be 50%/50% which means BB re-ordering will > do either nothing to pre-existing order or apply some other magic. CFG > construction makes the > flow exactly as visible in the source. > > So not sure what you are asking here, but annotating the libgfortran routines > or inline expansion from the FE with __builtin_expect is probably a good > idea. If the code is emitted in Fortran FE, that it's similar to specific predictors: grep for 'PRED_FORTRAN_'. These are predictors emitted by the FE and can have specific probability based on SPEC benchmarks. Can you Thomas point me to code that emits the maxloc/minloc? > > At least I can't really see how to easily derive a new predictor that would > match > this case... Agree.
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 Richard Biener changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-*-*, i?86-*-* Status|UNCONFIRMED |NEW Last reconfirmed||2018-05-11 CC||hubicka at gcc dot gnu.org, ||marxin at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #7 from Richard Biener --- Confirmed with a Haswell CPU as well. Without the __builtin_expect we rightfully predict the branch to be 50%/50% which means BB re-ordering will do either nothing to pre-existing order or apply some other magic. CFG construction makes the flow exactly as visible in the source. So not sure what you are asking here, but annotating the libgfortran routines or inline expansion from the FE with __builtin_expect is probably a good idea. At least I can't really see how to easily derive a new predictor that would match this case...
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #6 from Andrew Pinski --- (In reply to Thomas Koenig from comment #5) > (In reply to Andrew Pinski from comment #1) > > These functions are not functional equivalent. > > > > In the b.c, it records the max location but it is the last element which > > contains that value. While in c.c, the first element which contains the > > value is recorded. > > I do not understand. > > The only difference between the files is > > $ diff -u -b b.c c.c > --- b.c 2018-05-10 22:22:06.276904322 +0200 > +++ c.c 2018-05-10 22:21:50.028801008 +0200 > @@ -14,7 +14,7 @@ >nm = -1; >for (i=0; i{ > - if (__builtin_expect (a[i] > m, 0)) > + if (a[i] > m) > { > m = a[i]; > nm = i; > > Does this change the semantics of the comparison? If it does, it > would be a bug in __builtin_expect. The documentation is hard to understand but you are correct that it just says we are expecting the value of the comparison to be false. Anyways the problem here is a bit more complex than just marking the edge as expect. For aarch64 we produce csel (which is basically cmov) for both loops: .L11: ldr w4, [x5, x2, lsl 2] cmp w3, w4 cselw0, w0, w2, ge add x2, x2, 1 cselw3, w3, w4, ge cmp w1, w2 bgt .L11
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #5 from Thomas Koenig --- (In reply to Andrew Pinski from comment #1) > These functions are not functional equivalent. > > In the b.c, it records the max location but it is the last element which > contains that value. While in c.c, the first element which contains the > value is recorded. I do not understand. The only difference between the files is $ diff -u -b b.c c.c --- b.c 2018-05-10 22:22:06.276904322 +0200 +++ c.c 2018-05-10 22:21:50.028801008 +0200 @@ -14,7 +14,7 @@ nm = -1; for (i=0; im, 0)) + if (a[i] > m) { m = a[i]; nm = i; Does this change the semantics of the comparison? If it does, it would be a bug in __builtin_expect.
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #4 from Andrew Pinski --- (a a) != 0
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #3 from Dominique d'Humieres --- > These functions are not functional equivalent. > > In the b.c, it records the max location but it is the last element which > contains that value. While in c.c, the first element which contains > the value is recorded. > (In reply to Andrew Pinski from comment #1) > > These functions are not functional equivalent. > > To get them equivalent, you either need to use >= or <=. Could you please elaborate? The results seem to depend on the processor. On my Core i7 I get around 43000 cycles for both tests and -O2 or above, except -Ofast -funroll-loops which gives around 26000 cycles for the first test and 58000 cycles for the second one.
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #2 from Andrew Pinski --- (In reply to Andrew Pinski from comment #1) > These functions are not functional equivalent. To get them equivalent, you either need to use >= or <=.
[Bug middle-end/85740] Non-optimal determining maximum in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 --- Comment #1 from Andrew Pinski --- These functions are not functional equivalent. In the b.c, it records the max location but it is the last element which contains that value. While in c.c, the first element which contains the value is recorded.