csinc

tnfchris at gcc dot gnu.org via Gcc-bugs Tue, 02 Apr 2024 10:46:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682


Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
          Component|middle-end                  |rtl-optimization
   Last reconfirmed|                            |2024-04-02

--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Ok, so for now I think we should separate out the issue of whether/when we
should perform the ifconversion and instead for a bit look at why we miss these
conversions:

The first problem is that GCC does not support if-converting blocks which
terminate into another branch.  In other words, the destination of 1 of the
arms must have a single successor per noce_find_if_block which requires the
final block to be a JOIN block.

In

size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) {
    auto pos = tree;
    size_t low = 0;
    size_t high = treeSize;
    while (low < high) {
        size_t i = (low + high) / 2;
        int res = cmp(user, tree[i]);

        // These should be cmp + csinc + csel on arm
        // and lea + test + cmov + cmov on x86.
        low = res > 0 ? i + 1 : low; // csinc
        high = res < 0 ? i: high; // csel

        if (res == 0) return i;
    }
    return -1;
}

this fails because of the:

        if (res == 0) return i;

which means that the final block doesn't adhere to this.  so the cbranch blocks
the ifconversion.

if we simplify the testcase:

typedef unsigned long size_t;
extern int cmp(size_t *);

size_t bsearch2(size_t high, size_t low, size_t i) {
    while (low < high) {
        int res = cmp(&i);

        low = res > 0 ? i + 1 : low;
        high = res < 0 ? i: high;
    }
    return -1;
}

we now detect the csel but not the csinc. This has two reasons, we visit the BB
in entry->exit order, but we probably should do exit->entry, such that when we
if-convert blocks and delete the branches, the preceeding blocks become valid
to if-convert.

but the main reason the csinc is missed is because of code sinking. The update
of the reference &i is sank into the res > 0 branch.

  <bb 3> [local count: 955630224]:
  # high_18 = PHI <high_14(7), high_7(D)(2)>
  # low_19 = PHI <low_5(7), low_8(D)(2)>
  res_11 = cmp (&i);
  if (res_11 > 0)
    goto <bb 4>; [59.00%]
  else
    goto <bb 5>; [41.00%]

  <bb 4> [local count: 563821836]:
  i.1_1 = i;
  iftmp.0_12 = i.1_1 + 1;
  goto <bb 7>; [100.00%]

This is because the compiler (righfully) decides that on the dereference should
only happen in the branch that needs it.  Because of this block 4 is no longer
if-convertable.  However since the load is to the stack we are allowed to move
it around technically speaking.  But this is only beneficial is the branch is
really unpredictable.

To show that it is the sinking, we can avoid it by using -fdisable-tree-sink2
and this example:

size_t bsearch2(size_t high, size_t low, size_t *iv, int *resv) {
    while (low < high) {
        int res = cmp (iv);
        size_t i = *iv;
        low = res > 0 ? i + 1 : low;
    }
    return -1;
}

which generates the csinc.

However the following example again doesn't generate the csinc for GCC but does
for LLVM:

size_t bsearch3(size_t high, size_t low, size_t i, int res) {
    while (low < high) {
        asm volatile ("" : "=w"(i));
        asm volatile ("" : "=w"(res));
        low = res > 0 ? i + 1 : low;
    }
    return -1;
}

which is because of the change header pass.  GCC optimizes this scalar code
from:

size_t bsearch3 (size_t high, size_t low, size_t i, int res)
{
  size_t iftmp.0_7;

  <bb 2> [local count: 59055799]:
  goto <bb 5>; [100.00%]

  <bb 3> [local count: 1014686024]:
  __asm__ __volatile__("" : "=w" i_5);
  __asm__ __volatile__("" : "=w" res_6);
  if (res_6 > 0)
    goto <bb 4>; [5.50%]
  else
    goto <bb 6>; [94.50%]

  <bb 4> [local count: 55807731]:
  iftmp.0_7 = i_5 + 1;

  <bb 5> [local count: 114863530]:
  # low_1 = PHI <low_2(D)(2), iftmp.0_7(4)>

  <bb 6> [local count: 1073741824]:
  if (low_1 < high_3(D))
    goto <bb 3>; [94.50%]
  else
    goto <bb 7>; [5.50%]

  <bb 7> [local count: 59055800]:
  return 18446744073709551615;

}

into 

size_t bsearch3 (size_t high, size_t low, size_t i, int res)
{
  size_t iftmp.0_7;

  <bb 2> [local count: 59055799]:
  if (low_2(D) < high_3(D))
    goto <bb 5>; [48.59%]
  else
    goto <bb 6>; [51.41%]

  <bb 3> [local count: 958878294]:
  __asm__ __volatile__("" : "=w" i_5);
  __asm__ __volatile__("" : "=w" res_6);
  if (res_6 > 0)
    goto <bb 4>; [5.50%]
  else
    goto <bb 3>; [94.50%]

  <bb 4> [local count: 55807731]:
  # i_10 = PHI <i_5(3), i_8(5)>
  iftmp.0_7 = i_10 + 1;
  if (high_3(D) > iftmp.0_7)
    goto <bb 5>; [48.59%]
  else
    goto <bb 6>; [51.41%]

  <bb 5> [local count: 55807730]:
  __asm__ __volatile__("" : "=w" i_8);
  __asm__ __volatile__("" : "=w" res_9);
  if (res_9 > 0)
    goto <bb 4>; [5.50%]
  else
    goto <bb 3>; [94.50%]

  <bb 6> [local count: 59055800]:
  return 18446744073709551615;

}

which then hits the limitation of the single pred I mentioned before.

using -fdisable-tree-ch2 we get:

        add     x2, x2, 1
        csel    x1, x2, x1, gt

shows that oddly we missed the csinc which means noce_try_addcc doesn't trigger
for some reason.

ultimately RTL seems like the wrong place to fix up these issues as it's much
easier to do in GIMPLE where we can more freely modify the CFG.  Though ce1 and
ce2 are still in CFG layout mode so we could..

But I also think gimple is better as it allows us to check the critical path
and the loop carried dependencies by walking the use-def chains cheaply.

Perhaps we can use something like gimple-isel to either do the conversion, or
use another pass to reshape the gimple so CE can find them?

Additionally it would be good to have a way for users to tell us that the
branch is unpredictable and so that the heuristics should be replaxed.  LLVM
has __builtin_unpredictable to help it along.

Perhaps we should do the same? Though there is already
__builtin_expect_with_probability and I suppose a ~50% predictable chance
should be interpreted as unpredictable?

thoughts?

note that these are only issues for loops, outside of loops we detect them
reliably because the transformations above don't happen.:

unsigned long long
test_csinc64_ifcvt(unsigned long long *x0,
                   unsigned long long x1,
                   unsigned long long x2) {
  if (*x0 == x1)
    ++ x2;

  return x2;
}

does the right thing.

[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc

Reply via email to