https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Component|middle-end |rtl-optimization Last reconfirmed| |2024-04-02 --- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> --- Ok, so for now I think we should separate out the issue of whether/when we should perform the ifconversion and instead for a bit look at why we miss these conversions: The first problem is that GCC does not support if-converting blocks which terminate into another branch. In other words, the destination of 1 of the arms must have a single successor per noce_find_if_block which requires the final block to be a JOIN block. In size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) { auto pos = tree; size_t low = 0; size_t high = treeSize; while (low < high) { size_t i = (low + high) / 2; int res = cmp(user, tree[i]); // These should be cmp + csinc + csel on arm // and lea + test + cmov + cmov on x86. low = res > 0 ? i + 1 : low; // csinc high = res < 0 ? i: high; // csel if (res == 0) return i; } return -1; } this fails because of the: if (res == 0) return i; which means that the final block doesn't adhere to this. so the cbranch blocks the ifconversion. if we simplify the testcase: typedef unsigned long size_t; extern int cmp(size_t *); size_t bsearch2(size_t high, size_t low, size_t i) { while (low < high) { int res = cmp(&i); low = res > 0 ? i + 1 : low; high = res < 0 ? i: high; } return -1; } we now detect the csel but not the csinc. This has two reasons, we visit the BB in entry->exit order, but we probably should do exit->entry, such that when we if-convert blocks and delete the branches, the preceeding blocks become valid to if-convert. but the main reason the csinc is missed is because of code sinking. The update of the reference &i is sank into the res > 0 branch. <bb 3> [local count: 955630224]: # high_18 = PHI <high_14(7), high_7(D)(2)> # low_19 = PHI <low_5(7), low_8(D)(2)> res_11 = cmp (&i); if (res_11 > 0) goto <bb 4>; [59.00%] else goto <bb 5>; [41.00%] <bb 4> [local count: 563821836]: i.1_1 = i; iftmp.0_12 = i.1_1 + 1; goto <bb 7>; [100.00%] This is because the compiler (righfully) decides that on the dereference should only happen in the branch that needs it. Because of this block 4 is no longer if-convertable. However since the load is to the stack we are allowed to move it around technically speaking. But this is only beneficial is the branch is really unpredictable. To show that it is the sinking, we can avoid it by using -fdisable-tree-sink2 and this example: size_t bsearch2(size_t high, size_t low, size_t *iv, int *resv) { while (low < high) { int res = cmp (iv); size_t i = *iv; low = res > 0 ? i + 1 : low; } return -1; } which generates the csinc. However the following example again doesn't generate the csinc for GCC but does for LLVM: size_t bsearch3(size_t high, size_t low, size_t i, int res) { while (low < high) { asm volatile ("" : "=w"(i)); asm volatile ("" : "=w"(res)); low = res > 0 ? i + 1 : low; } return -1; } which is because of the change header pass. GCC optimizes this scalar code from: size_t bsearch3 (size_t high, size_t low, size_t i, int res) { size_t iftmp.0_7; <bb 2> [local count: 59055799]: goto <bb 5>; [100.00%] <bb 3> [local count: 1014686024]: __asm__ __volatile__("" : "=w" i_5); __asm__ __volatile__("" : "=w" res_6); if (res_6 > 0) goto <bb 4>; [5.50%] else goto <bb 6>; [94.50%] <bb 4> [local count: 55807731]: iftmp.0_7 = i_5 + 1; <bb 5> [local count: 114863530]: # low_1 = PHI <low_2(D)(2), iftmp.0_7(4)> <bb 6> [local count: 1073741824]: if (low_1 < high_3(D)) goto <bb 3>; [94.50%] else goto <bb 7>; [5.50%] <bb 7> [local count: 59055800]: return 18446744073709551615; } into size_t bsearch3 (size_t high, size_t low, size_t i, int res) { size_t iftmp.0_7; <bb 2> [local count: 59055799]: if (low_2(D) < high_3(D)) goto <bb 5>; [48.59%] else goto <bb 6>; [51.41%] <bb 3> [local count: 958878294]: __asm__ __volatile__("" : "=w" i_5); __asm__ __volatile__("" : "=w" res_6); if (res_6 > 0) goto <bb 4>; [5.50%] else goto <bb 3>; [94.50%] <bb 4> [local count: 55807731]: # i_10 = PHI <i_5(3), i_8(5)> iftmp.0_7 = i_10 + 1; if (high_3(D) > iftmp.0_7) goto <bb 5>; [48.59%] else goto <bb 6>; [51.41%] <bb 5> [local count: 55807730]: __asm__ __volatile__("" : "=w" i_8); __asm__ __volatile__("" : "=w" res_9); if (res_9 > 0) goto <bb 4>; [5.50%] else goto <bb 3>; [94.50%] <bb 6> [local count: 59055800]: return 18446744073709551615; } which then hits the limitation of the single pred I mentioned before. using -fdisable-tree-ch2 we get: add x2, x2, 1 csel x1, x2, x1, gt shows that oddly we missed the csinc which means noce_try_addcc doesn't trigger for some reason. ultimately RTL seems like the wrong place to fix up these issues as it's much easier to do in GIMPLE where we can more freely modify the CFG. Though ce1 and ce2 are still in CFG layout mode so we could.. But I also think gimple is better as it allows us to check the critical path and the loop carried dependencies by walking the use-def chains cheaply. Perhaps we can use something like gimple-isel to either do the conversion, or use another pass to reshape the gimple so CE can find them? Additionally it would be good to have a way for users to tell us that the branch is unpredictable and so that the heuristics should be replaxed. LLVM has __builtin_unpredictable to help it along. Perhaps we should do the same? Though there is already __builtin_expect_with_probability and I suppose a ~50% predictable chance should be interpreted as unpredictable? thoughts? note that these are only issues for loops, outside of loops we detect them reliably because the transformations above don't happen.: unsigned long long test_csinc64_ifcvt(unsigned long long *x0, unsigned long long x1, unsigned long long x2) { if (*x0 == x1) ++ x2; return x2; } does the right thing.