[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc

2024-04-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

--- Comment #9 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #8)
> This might be the path splitting running on the gimple level causing issues
> too; see PR 112402 .

Ah that's a good shout.  It looks like Richi already agrees that we should
recognize/do some ifcvt at GIMPLE.

Guess that just leaves the where.

[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc

2024-04-02 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Andrew Pinski  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=112402

--- Comment #8 from Andrew Pinski  ---
This might be the path splitting running on the gimple level causing issues
too; see PR 112402 .

[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc

2024-04-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Tamar Christina  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
  Component|middle-end  |rtl-optimization
   Last reconfirmed||2024-04-02

--- Comment #7 from Tamar Christina  ---
Ok, so for now I think we should separate out the issue of whether/when we
should perform the ifconversion and instead for a bit look at why we miss these
conversions:

The first problem is that GCC does not support if-converting blocks which
terminate into another branch.  In other words, the destination of 1 of the
arms must have a single successor per noce_find_if_block which requires the
final block to be a JOIN block.

In

size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) {
auto pos = tree;
size_t low = 0;
size_t high = treeSize;
while (low < high) {
size_t i = (low + high) / 2;
int res = cmp(user, tree[i]);

// These should be cmp + csinc + csel on arm
// and lea + test + cmov + cmov on x86.
low = res > 0 ? i + 1 : low; // csinc
high = res < 0 ? i: high; // csel

if (res == 0) return i;
}
return -1;
}

this fails because of the:

if (res == 0) return i;

which means that the final block doesn't adhere to this.  so the cbranch blocks
the ifconversion.

if we simplify the testcase:

typedef unsigned long size_t;
extern int cmp(size_t *);

size_t bsearch2(size_t high, size_t low, size_t i) {
while (low < high) {
int res = cmp();

low = res > 0 ? i + 1 : low;
high = res < 0 ? i: high;
}
return -1;
}

we now detect the csel but not the csinc. This has two reasons, we visit the BB
in entry->exit order, but we probably should do exit->entry, such that when we
if-convert blocks and delete the branches, the preceeding blocks become valid
to if-convert.

but the main reason the csinc is missed is because of code sinking. The update
of the reference  is sank into the res > 0 branch.

   [local count: 955630224]:
  # high_18 = PHI 
  # low_19 = PHI 
  res_11 = cmp ();
  if (res_11 > 0)
goto ; [59.00%]
  else
goto ; [41.00%]

   [local count: 563821836]:
  i.1_1 = i;
  iftmp.0_12 = i.1_1 + 1;
  goto ; [100.00%]

This is because the compiler (righfully) decides that on the dereference should
only happen in the branch that needs it.  Because of this block 4 is no longer
if-convertable.  However since the load is to the stack we are allowed to move
it around technically speaking.  But this is only beneficial is the branch is
really unpredictable.

To show that it is the sinking, we can avoid it by using -fdisable-tree-sink2
and this example:

size_t bsearch2(size_t high, size_t low, size_t *iv, int *resv) {
while (low < high) {
int res = cmp (iv);
size_t i = *iv;
low = res > 0 ? i + 1 : low;
}
return -1;
}

which generates the csinc.

However the following example again doesn't generate the csinc for GCC but does
for LLVM:

size_t bsearch3(size_t high, size_t low, size_t i, int res) {
while (low < high) {
asm volatile ("" : "=w"(i));
asm volatile ("" : "=w"(res));
low = res > 0 ? i + 1 : low;
}
return -1;
}

which is because of the change header pass.  GCC optimizes this scalar code
from:

size_t bsearch3 (size_t high, size_t low, size_t i, int res)
{
  size_t iftmp.0_7;

   [local count: 59055799]:
  goto ; [100.00%]

   [local count: 1014686024]:
  __asm__ __volatile__("" : "=w" i_5);
  __asm__ __volatile__("" : "=w" res_6);
  if (res_6 > 0)
goto ; [5.50%]
  else
goto ; [94.50%]

   [local count: 55807731]:
  iftmp.0_7 = i_5 + 1;

   [local count: 114863530]:
  # low_1 = PHI 

   [local count: 1073741824]:
  if (low_1 < high_3(D))
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 59055800]:
  return 18446744073709551615;

}

into 

size_t bsearch3 (size_t high, size_t low, size_t i, int res)
{
  size_t iftmp.0_7;

   [local count: 59055799]:
  if (low_2(D) < high_3(D))
goto ; [48.59%]
  else
goto ; [51.41%]

   [local count: 958878294]:
  __asm__ __volatile__("" : "=w" i_5);
  __asm__ __volatile__("" : "=w" res_6);
  if (res_6 > 0)
goto ; [5.50%]
  else
goto ; [94.50%]

   [local count: 55807731]:
  # i_10 = PHI 
  iftmp.0_7 = i_10 + 1;
  if (high_3(D) > iftmp.0_7)
goto ; [48.59%]
  else
goto ; [51.41%]

   [local count: 55807730]:
  __asm__ __volatile__("" : "=w" i_8);
  __asm__ __volatile__("" : "=w" res_9);
  if (res_9 > 0)
goto ; [5.50%]
  else
goto ; [94.50%]

   [local count: 59055800]:
  return 18446744073709551615;

}

which then hits the limitation of the single pred I mentioned before.

using -fdisable-tree-ch2 we get:

add x2, x2, 1
cselx1, 

[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc

2024-01-31 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

--- Comment #2 from Andrew Pinski  ---
I should note that on x86, 2 cmov in a row might be an issue and worse than
branches. There is a cost model and the x86 backend rejects that.

There are some cores where it is worse. I don't know if it applies to recent
ones though.

[Bug rtl-optimization/113682] Branches in branchless binary search rather than cmov/csel/csinc

2024-01-31 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
  Component|other   |rtl-optimization
Version|unknown |14.0
 Target||aarch64, x86_64-*-*

--- Comment #1 from Richard Biener  ---
Since there's a loop exit involved (and the loop has multiple exits)
if-conversion is made difficult here.

You could try rotating manually producing a do { } while loop with
a "nicer" exit condition and see whether that helps.