[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

spop at gcc dot gnu.org Fri, 11 Sep 2015 12:28:37 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741


--- Comment #34 from Sebastian Pop <spop at gcc dot gnu.org> ---
r227567 extends the limits of a scop, and now we can detect a scop in the
MAIN__ function corresponding to the following code:

  A=0.1D0
  B=0.1D0

-fdump-tree-graphite-all shows that the loops have been tiled:

tiled by 51
tiled by 51

ISL AST generated by ISL: 
{
  for (int c1 = 0; c1 <= 1023; c1 += 51)
    for (int c2 = 0; c2 <= 1023; c2 += 51)
      for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1)
        for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1)
          S_4(c3, c4);
  for (int c1 = 0; c1 <= 1023; c1 += 51)
    for (int c2 = 0; c2 <= 1023; c2 += 51)
      for (int c3 = c1; c3 <= min(1023, c1 + 50); c3 += 1)
        for (int c4 = c2; c4 <= min(1023, c2 + 50); c4 += 1)
          S_10(c3, c4);
}

What makes me wondering is why for memset kind of loops when tiling gets us a
better performance as reported:

before:
   17.848000000000003
after:
   15.847999999999999

Btw, what architecture have you used for this experiment?

The same happens on an AArch64 machine where I was able to reproduce your
results: the loop blocked initialization of arrays is consistently faster by
about 10%.

I noted that on a recent Intel x86_64 machine the first runs show some 10%
speedup with loop blocking and then the speedup disappears in subsequent runs
(I was alternating runs with and without loop block 10 times).

[Bug tree-optimization/14741] graphite with loop blocking and interchanging doesn't optimize a matrix multiplication loop

Reply via email to