[Bug c/110007] Implement support for Clang’s __builtin_unpredictable()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007 --- Comment #8 from Richard Yao --- (In reply to Alexander Monakov from comment #6) > Are you sure the branch is unpredictable in your micro-benchmark? If you > have repeated runs you'll train the predictors. (In reply to Jan Hubicka from comment #7) > Also note that branch predicted with 50% outcome is not necessarily > unpredictable for example in this: > > for (int i = 0; i < 1; i++) > if (i&1) > > > I would expect branch predictor to work this out on modern systems. > So having explicit flag in branch_probability that given probability is hard > for CPU to predict would make sense and I was thinking we may try to get > this info from auto-fdo eventually too. Good point. I had reused an existing micro-benchmark, but it is using libc's srand(), which is known for not having great quality RNG. It is quite possible that the last branch really is predictable because of that. Having only 1 unpredictable branch is not that terrible, so I probably will defer looking into this further to a future date. (In reply to Alexander Monakov from comment #6) > Implementing a __builtin_branchless_select would address such needs more > directly. There were similar requests in the past, two of them related to > qsort and bsearch, unsurprisingly: PR 93165, PR 97734, PR 106804. As a developer that works on a project that supports GCC and Clang equally, but must support older versions of GCC longer, I would like to see both GCC and Clang adopt each others' builtins. That way, I need to implement fewer compatibility shims to support both compilers and what compatibility shims I do need can be dropped sooner (maybe after 10 years). I am not against the new builtin, but I would like to also have __builtin_unpredictable(). It would be consistent with the existing likely/unlikely macros that my project uses, which should mean other developers will not have to learn the specifics of how predication works to read code using it, since they can just treat it as a compiler hint and read things as if it were not there unless they have some reason to reason more deeply about things. I could just do a macro around __builtin_expect_with_probability(), but I greatly prefer __builtin_unpredictable() since people reading the code do not need to decipher the argument list to understand what it does since the name is self documenting.
[Bug c/110007] Implement support for Clang’s __builtin_unpredictable()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007 --- Comment #5 from Richard Yao --- (In reply to Andrew Pinski from comment #4) > (In reply to Richard Yao from comment #3) > > (In reply to Andrew Pinski from comment #2) > > > (In reply to Richard Yao from comment #0) > > > > Having the ability to specify __builtin_unpredictable() as a hint to > > > > encourage the compiler to use cmov would be useful for implementing > > > > algorithms like binary search that have unpredictable branches. > > > > __builtin_expect_with_probability() looks like a possible substitute, > > > > but > > > > Clang does not support it and it does not always work as described in > > > > #110001. > > > > > > PR 110001 has nothing to do with __builtin_expect_with_probability and > > > > I mentioned it briefly in PR 110001, but I guess I should make it explicit. > > See line 58 here: > > > > https://gcc.godbolt.org/z/ef3Yfchzv > > > > That is made into a jump, when all that is necessary is cmov, which is what > > Clang generates. In the example I had posted in PR 110001, I had not used > > __builtin_expect_with_probability() because it made no difference, but here > > I am using it to show that cmov is not used. > > > > If we look at the -fdump-tree-optimized-lineno dump we see why it was not > turned into a cmov: > [/app/example.c:58:12 discrim 2] if (a_111 <= key_128(D)) > goto ; [50.00%] > else > goto ; [50.00%] > >[local count: 447392427]: > [/app/example.c:75:40] _268 = (long unsigned int) i_82; > [/app/example.c:75:40] _271 = _268 * 4; > [/app/example.c:75:34] _274 = array_89(D) + _271; > [/app/example.c:5:6] pretmp_277 = *_274; > goto ; [100.00%] > > > There is a load moved inside the conditional. Is executing an unpredictable branch cheaper than executing a redundant load? This a patch against the assembly output from GCC 12.3 modifies this to use cmov: diff --git a/out.s b/out.s index d796087..f0f009c 100644 --- a/out.s +++ b/out.s @@ -317,15 +317,11 @@ custom_binary_search_fast: cmovle %ecx, %edx .L43: leal1(%rdx), %ecx - movq%rcx, %rax - movl(%rdi,%rcx,4), %ecx - cmpl%esi, %ecx - jle .L15 + cmpl%esi, (%rdi,%rcx,4) + cmovle %ecx, %edx .L25: movl%edx, %eax movl(%rdi,%rax,4), %ecx - movl%edx, %eax -.L15: cmpl%ecx, %esi setl%cl setg%dl Micro-benchmarking the two suggests that the answer is yes on Zen 3, although I do not understand why.
[Bug c/110007] Implement support for Clang’s __builtin_unpredictable()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007 --- Comment #3 from Richard Yao --- (In reply to Andrew Pinski from comment #2) > (In reply to Richard Yao from comment #0) > > Having the ability to specify __builtin_unpredictable() as a hint to > > encourage the compiler to use cmov would be useful for implementing > > algorithms like binary search that have unpredictable branches. > > __builtin_expect_with_probability() looks like a possible substitute, but > > Clang does not support it and it does not always work as described in > > #110001. > > PR 110001 has nothing to do with __builtin_expect_with_probability and I mentioned it briefly in PR 110001, but I guess I should make it explicit. See line 58 here: https://gcc.godbolt.org/z/ef3Yfchzv That is made into a jump, when all that is necessary is cmov, which is what Clang generates. In the example I had posted in PR 110001, I had not used __builtin_expect_with_probability() because it made no difference, but here I am using it to show that cmov is not used. Earlier in the output, GCC 12.3 uses cmov for every 3rd instruction, so I suspect that this is not a case of two cmovs being too close to each other. Should I file another issue explicitly for that so this could be left for the request that we have __builtin_unpredictable()? > __builtin_expect_with_probability will do exactly the same as > __builtin_unpredictable will do really. __builtin_unpredictable() is easier to read than __builtin_expect_with_probability() and it would be nice for both Clang and GCC to support the same builtins.
[Bug middle-end/98801] Request for a conditional move built-in function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801 Richard Yao changed: What|Removed |Added CC||richard.yao at alumni dot stonybro ||ok.edu --- Comment #8 from Richard Yao --- I just filed #110007 to ask for __builtin_unpredictable(). Something like __builtin_unpredictable(condition) ? a : b should be equivalent to a __builtin_cmov(). __builtin_unpredictable() is already implemented in Clang (although it does not yet pass it to the backend), so adopting it would be preferable to implementing an entirely new builtin.
[Bug c/110007] New: Implement support for Clang’s __builtin_unpredictable()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007 Bug ID: 110007 Summary: Implement support for Clang’s __builtin_unpredictable() Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: richard.yao at alumni dot stonybrook.edu Target Milestone: --- Having the ability to specify __builtin_unpredictable() as a hint to encourage the compiler to use cmov would be useful for implementing algorithms like binary search that have unpredictable branches. __builtin_expect_with_probability() looks like a possible substitute, but Clang does not support it and it does not always work as described in #110001. Combining this with C’s ternary operator should make the proposed cmov builtin from #98801 unnecessary. Note that while clang implements __builtin_unpredictable(), it does not currently pass that information to the backend: https://github.com/llvm/llvm-project/issues/62790#issuecomment-1552671099
[Bug tree-optimization/109901] Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901 --- Comment #8 from Richard Yao --- Created attachment 55177 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55177&action=edit Source code for micro-benchmark. Here is an example of how not having this optimization slows us down: https://gcc.godbolt.org/z/nxdrb17G7 custom_binary_search_slow() and custom_binary_search_fast() are the same function. The only difference is that I manually applied the a) > (b)) - ((a) < (b))) <= 0) -> ((a) <= (b)) transformation to see what GCC would generate if it were able to do this transformation. This optimization alone makes binary search ~78% faster on my Ryzen 7 5800X: Benchmark: array size: 1024, runs: 1000, repetitions: 1, seed: 1685158101, density: 10 Even distribution with 1024 32 bit integers, random access | Name | Items | Hits | Misses | Time | | -- | -- | -- | -- | -- | | custom_binary_search_slow | 1024 |983 | 9017 | 0.000313 | | custom_binary_search_fast | 1024 |983 | 9017 | 0.000176 | I modified the microbenchmark from scandum/binary_search to better suit a workload that I am micro-optimizing: https://github.com/scandum/binary_search In specific, I wanted to test on small arrays, but avoid cache effects contaminating the results. One could easily add the search functions from my modified version into the original to get get numbers for bigger array sizes. I have attached the source code for the modified micro-benchmark. The above run was done after compiling with -O2 -fno-inline. Compiling with just -O2 does not make much difference, since I deleted the code where -fno-inline makes a difference from that file since it was not relevant to this issue.
[Bug tree-optimization/110001] New: [13 regression] Suboptimal code generation for branchless binary search
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110001 Bug ID: 110001 Summary: [13 regression] Suboptimal code generation for branchless binary search Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: richard.yao at alumni dot stonybrook.edu Target Milestone: --- GCC 12.3 generated beautiful code for this, with all but the last of the unrolled loop iterations using only 3 instructions: https://gcc.godbolt.org/z/eGbEj9YKd Currently, GCC generates 4 instructions: https://gcc.godbolt.org/z/Ebczq8jjx This probably does not make a huge difference given the data hazard, but there is something awe-inspiring from seeing GCC generate only 3 instructions per unrolled loop iteration for binary search. It would be nice if future versions went back to generating three instructions. This function was inspired by this D code: https://godbolt.org/z/5En7xajzc The bsearch1000() function is entirely branchless and has no more than 2 instructions for every cmov, excluding ret. I wrote a more general version in C that can handle variable array sizes, and to my pleasant surprise, GCC 12.3 generated a similar 3 instruction sequence for all but the last of the unrolled loop iterations. I was saddened when I saw the output from GCC 13.1 and trunk. Anyway, all recent versions of GCC that I cared to check generate a branch for the last unrolled iteration on line 58. That branch is unpredictable, so GCC would generate better code here if it used predication to avoid the branch. I had been able to give GCC a hint to avoid a similar branch at the end using __builtin_expect_with_probability(), but that trick did not work for line 58. Also, if anyone is interested in where that D code originated, here is the source: https://muscar.eu/shar-binary-search-meta.html
[Bug tree-optimization/109901] Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901 --- Comment #7 from Richard Yao --- Two more rules: bool0 - bool1 >= 0 -> bool0 | !bool1 -> bool1 >= bool0 bool0 - bool1 <= 0 -> !bool0 | bool1 -> bool0 <= bool1
[Bug tree-optimization/109901] Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901 --- Comment #6 from Richard Yao --- (In reply to Andrew Pinski from comment #1) > bool0 - bool1 == 1 -> bool0 & !bool1 -> bool0 < bool1 > bool0 - bool1 > 0 -> bool0 & !bool1 -> bool0 < bool1 That should be: bool0 - bool1 == 1 -> bool0 & !bool1 -> bool0 > bool1 bool0 - bool1 > 0 -> bool0 & !bool1 -> bool0 > bool1
[Bug middle-end/109901] New: Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901 Bug ID: 109901 Summary: Optimization opportunity: a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: richard.yao at alumni dot stonybrook.edu Target Milestone: --- LLVM/Clang also misses this optimization opportunity, so I also filed an issue with them: https://github.com/llvm/llvm-project/issues/62790 The following transformations can be done as an optimization: a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) a) > (b)) - ((a) < (b))) <= 0) -> ((a) <= (b)) a) > (b)) - ((a) < (b))) == -1) -> ((a) < (b)) a) > (b)) - ((a) < (b))) == 1) -> ((a) > (b)) a) > (b)) - ((a) < (b))) == 0) -> ((a) == (b)) a) > (b)) - ((a) < (b))) > 0) -> ((a) > (b)) a) > (b)) - ((a) < (b))) >= 0) -> ((a) >= (b)) a) >= (b)) - ((a) <= (b))) < 0) -> ((a) < (b)) a) >= (b)) - ((a) <= (b))) <= 0) -> ((a) <= (b)) a) >= (b)) - ((a) <= (b))) == -1) -> ((a) < (b)) a) >= (b)) - ((a) <= (b))) == 1) -> ((a) > (b)) a) >= (b)) - ((a) <= (b))) == 0) -> ((a) == (b)) a) >= (b)) - ((a) <= (b))) > 0) -> ((a) > (b)) a) >= (b)) - ((a) <= (b))) >= 0) -> ((a) >= (b)) Both (((a) > (b)) - ((a) < (b))) and (((a) >= (b)) - ((a) <= (b))) will generate -1, 0 or 1 when comparing two integers (signed or unsigned). When comparators using this trick are inlined into the caller, the above transformations become applicable. I noticed that neither compiler exploits this high level optimization opportunity when I was working on using a faster binary search implementation for the OpenZFS b-tree code. It relied on a macro to achieve C++-style inlining of the comparator into the implementation by making different versions of the same function. The following example at godbolt does not use a macro to make it easier to see which lines of assembly correspond to which lines of high level C: https://gcc.godbolt.org/z/cdedccxae On amd64, GCC generates 15 instructions for the loop. If you comment out line 35 and uncomment line 37, GCC will generate 11 instructions for the loop. This is the output GCC would produce if it supported that high level optimization. Had the comparator returned 1 for less than and 0 for greater than or equal to, we would have had the 11-instruction version of the loop without any need for this optimization. Changing the semantics because our compilers lack this optimization would be painful in part because the entire code base expects the -1, 0 or 1 return value semantics and other code depends on these comparators. It would be nice if GCC implemented this optimization.