[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-12 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #10 from Thomas Koenig  ---
Created attachment 44121
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121=edit
Test case for vectorizing, inc. assembly

This test case, written by Nicolas König, shows a proof of concept for
vectorizing a maxloc loop using AVX2. It lacks loop peeling, so the
overhead for real-life cases will be a little higher.

Compiled with

$ gcc -Ofast -funroll-loops -Wall main-gp.c maxloc_unroll.s maxloc_nounroll.s

with a Ryzen 1700, the results are

# Ints per cycle
#   nnormalexpect  AVX2   AVX2_unroll
  128  0.179272  0.107563  0.537815  0.537815
  256  0.396285  0.442907  1.075630  1.254902
  512  0.456328  0.654731  1.368984  1.254902
 1024  0.501961  0.912656  1.771626  1.434174
 2048  0.552617  1.368984  1.434174  1.771626
 4096  0.568257  1.338562  1.673203  2.041874
 8192  0.529541  1.392724  1.661663  1.958871
16384  0.533056  1.342291  1.204706  1.617055
32768  0.535128  1.478167  1.451453  1.832252
65536  0.435206  1.479301  1.612995  2.077079
   131072  0.586499  1.366558  1.603602  2.081565
   262144  0.533831  1.366315  1.708424  2.071499
   524288  0.582885  1.458868  1.601936  2.053294
  1048576  0.580800  1.367224  1.563047  2.060289
  2097152  0.567541  1.120453  1.552151  1.420402
  4194304  0.27  0.934199  1.001712  1.210831
  8388608  0.553388  0.968962  1.211283  1.010533
 16777216  0.529886  0.972008  1.143200  1.291568
 33554432  0.538509  0.971257  1.303704  1.245031
 67108864  0.567889  1.016425  1.320057  1.413633
134217728  0.572604  1.043561  1.338406  1.437528
268435456  0.578321  1.046748  1.420888  1.450485
536870912  0.572145  1.042577  1.445350  1.424409

536870912  0.458679  1.072158  1.451617  1.442926
268435456  0.460019  1.011306  1.380031  1.404121
134217728  0.460833  1.007872  1.318771  1.409553
 67108864  0.457754  1.008473  1.281166  1.387048
 33554432  0.425367  0.933973  1.378792  1.078957
 16777216  0.449178  0.903576  1.371796  1.416321
  8388608  0.436478  1.043172  1.298123  1.344617
  4194304  0.421925  1.023954  1.214096  1.288334
  2097152  0.458274  1.068309  1.060794  1.147894
  1048576  0.470394  1.488727  1.247491  1.777959
   524288  0.473536  1.493919  2.073448  2.096280
   262144  0.476521  1.504707  2.261032  2.067056
   131072  0.473536  1.494209  2.189131  2.060427
65536  0.432861  1.477034  2.238710  2.018355
32768  0.468757  1.482715  2.072612  2.024716
16384  0.470129  1.455838  2.068165  1.974928
 8192  0.469671  1.417301  2.190374  2.041874
 4096  0.474294  1.368984  2.007843  1.974928
 2048  0.459811  1.075630  2.007843  2.077079
 1024  0.406995  1.254902  1.882353  1.505882
  512  0.406995  0.792570  1.673203  0.941176
  256  0.327366  0.627451  1.254902  0.752941
  128  0.342246  0.941176  0.941176  0.537815

so

a) __builtin_expect is a big win

b) AVX2 is an even bigger win

c) Unrolling the AVX2 loop is a win for intermediate sizes, at large sizes
   (outside of any cache)

Dunno how realistic it is to emit this kind of code for the general case,
or what we could do in the Fortran front end to encourage vectorization
like that. Use vector data types and do the loop peeling by hand, I presume.

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-12 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #9 from Thomas Koenig  ---
(In reply to Martin Liška from comment #8)
> (In reply to Richard Biener from comment #7)
> > Confirmed with a Haswell CPU as well.  Without the __builtin_expect we
> > rightfully predict the branch to be 50%/50% which means BB re-ordering will
> > do either nothing to pre-existing order or apply some other magic.  CFG
> > construction makes the
> > flow exactly as visible in the source.
> > 
> > So not sure what you are asking here, but annotating the libgfortran 
> > routines
> > or inline expansion from the FE with __builtin_expect is probably a good
> > idea.

First, I think that -funroll-loop is such a big win that it
would be good to have it enabled for such a loop with -Ofast.

Second, vectorization would be nice, but quite possibly unrealistic.

I'll add a test case momentarliy for AVX2, which also clearly shows
a benefit.

> If the code is emitted in Fortran FE, that it's similar to specific
> predictors:
> grep for 'PRED_FORTRAN_'. These are predictors emitted by the FE and can
> have specific probability based on SPEC benchmarks. 
> 
> Can you Thomas point me to code that emits the maxloc/minloc?

I already added __builtin_expect to the minloc/maxloc routines,
in  gfc_conv_intrinsic_minmaxloc.  Maybe the likelyhood of missing
the maximum case was too high.

> > 
> > At least I can't really see how to easily derive a new predictor that would
> > match
> > this case...
> 
> Agree.

Running through an array and finding a min/max?  Hmm, a pity...

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-11 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #8 from Martin Liška  ---
(In reply to Richard Biener from comment #7)
> Confirmed with a Haswell CPU as well.  Without the __builtin_expect we
> rightfully predict the branch to be 50%/50% which means BB re-ordering will
> do either nothing to pre-existing order or apply some other magic.  CFG
> construction makes the
> flow exactly as visible in the source.
> 
> So not sure what you are asking here, but annotating the libgfortran routines
> or inline expansion from the FE with __builtin_expect is probably a good
> idea.

If the code is emitted in Fortran FE, that it's similar to specific predictors:
grep for 'PRED_FORTRAN_'. These are predictors emitted by the FE and can have
specific probability based on SPEC benchmarks. 

Can you Thomas point me to code that emits the maxloc/minloc?

> 
> At least I can't really see how to easily derive a new predictor that would
> match
> this case...

Agree.

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-*, i?86-*-*
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-05-11
 CC||hubicka at gcc dot gnu.org,
   ||marxin at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #7 from Richard Biener  ---
Confirmed with a Haswell CPU as well.  Without the __builtin_expect we
rightfully predict the branch to be 50%/50% which means BB re-ordering will do
either nothing to pre-existing order or apply some other magic.  CFG
construction makes the
flow exactly as visible in the source.

So not sure what you are asking here, but annotating the libgfortran routines
or inline expansion from the FE with __builtin_expect is probably a good idea.

At least I can't really see how to easily derive a new predictor that would
match
this case...

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-11 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #6 from Andrew Pinski  ---
(In reply to Thomas Koenig from comment #5)
> (In reply to Andrew Pinski from comment #1)
> > These functions are not functional equivalent.
> > 
> > In the b.c, it records the max location but it is the last element which
> > contains that value.  While in c.c, the first element which contains the
> > value is recorded.
> 
> I do not understand.
> 
> The only difference between the files is
> 
> $ diff -u -b b.c c.c
> --- b.c 2018-05-10 22:22:06.276904322 +0200
> +++ c.c 2018-05-10 22:21:50.028801008 +0200
> @@ -14,7 +14,7 @@
>nm = -1;
>for (i=0; i  {
> -  if (__builtin_expect (a[i] > m, 0))
> +  if (a[i] > m)
> {
>   m = a[i];
>   nm = i;
> 
> Does this change the semantics of the comparison? If it does, it
> would be a bug in __builtin_expect.

The documentation is hard to understand but you are correct that it just says
we are expecting the value of the comparison to be false.

Anyways the problem here is a bit more complex than just marking the edge as
expect.

For aarch64 we produce csel (which is basically cmov) for both loops:
.L11:
ldr w4, [x5, x2, lsl 2]
cmp w3, w4
cselw0, w0, w2, ge
add x2, x2, 1
cselw3, w3, w4, ge
cmp w1, w2
bgt .L11

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-11 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #5 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #1)
> These functions are not functional equivalent.
> 
> In the b.c, it records the max location but it is the last element which
> contains that value.  While in c.c, the first element which contains the
> value is recorded.

I do not understand.

The only difference between the files is

$ diff -u -b b.c c.c
--- b.c 2018-05-10 22:22:06.276904322 +0200
+++ c.c 2018-05-10 22:21:50.028801008 +0200
@@ -14,7 +14,7 @@
   nm = -1;
   for (i=0; i m, 0))
+  if (a[i] > m)
{
  m = a[i];
  nm = i;

Does this change the semantics of the comparison? If it does, it
would be a bug in __builtin_expect.

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-10 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #4 from Andrew Pinski  ---
(a a) != 0

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-10 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #3 from Dominique d'Humieres  ---
> These functions are not functional equivalent.
>
> In the b.c, it records the max location but it is the last element which
> contains that value.  While in c.c, the first element which contains
> the value is recorded.
> (In reply to Andrew Pinski from comment #1)
> > These functions are not functional equivalent.
>
> To get them equivalent, you either need to use >= or <=.

Could you please elaborate?

The results seem to depend on the processor. On my Core i7 I get around 
43000 cycles for both tests and -O2 or above, except -Ofast -funroll-loops
which gives around 26000 cycles for the first test and 58000 cycles for
the second one.

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-10 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #2 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #1)
> These functions are not functional equivalent.

To get them equivalent, you either need to use >= or <=.

[Bug middle-end/85740] Non-optimal determining maximum in a loop

2018-05-10 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740

--- Comment #1 from Andrew Pinski  ---
These functions are not functional equivalent.

In the b.c, it records the max location but it is the last element which
contains that value.  While in c.c, the first element which contains the value
is recorded.