[Bug c/110007] Implement support for Clang’s __builtin_unpredictable()

2023-05-28 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007

--- Comment #8 from Richard Yao  ---
(In reply to Alexander Monakov from comment #6)
> Are you sure the branch is unpredictable in your micro-benchmark? If you
> have repeated runs you'll train the predictors.

(In reply to Jan Hubicka from comment #7)
> Also note that branch predicted with 50% outcome is not necessarily
> unpredictable for example in this:
> 
> for (int i = 0; i < 1; i++)
>   if (i&1)
>  
> 
> I would expect branch predictor to work this out on modern systems.
> So having explicit flag in branch_probability that given probability is hard
> for CPU to predict would make sense and I was thinking we may try to get
> this info from auto-fdo eventually too.

Good point. I had reused an existing micro-benchmark, but it is using libc's
srand(), which is known for not having great quality RNG. It is quite possible
that the last branch really is predictable because of that. Having only 1
unpredictable branch is not that terrible, so I probably will defer looking
into this further to a future date.

(In reply to Alexander Monakov from comment #6)
> Implementing a __builtin_branchless_select would address such needs more
> directly. There were similar requests in the past, two of them related to
> qsort and bsearch, unsurprisingly: PR 93165, PR 97734, PR 106804.

As a developer that works on a project that supports GCC and Clang equally, but
must support older versions of GCC longer, I would like to see both GCC and
Clang adopt each others' builtins. That way, I need to implement fewer
compatibility shims to support both compilers and what compatibility shims I do
need can be dropped sooner (maybe after 10 years).

I am not against the new builtin, but I would like to also have
__builtin_unpredictable(). It would be consistent with the existing
likely/unlikely macros that my project uses, which should mean other developers
will not have to learn the specifics of how predication works to read code
using it, since they can just treat it as a compiler hint and read things as if
it were not there unless they have some reason to reason more deeply about
things.

I could just do a macro around __builtin_expect_with_probability(), but I
greatly prefer __builtin_unpredictable() since people reading the code do not
need to decipher the argument list to understand what it does since the name is
self documenting.

[Bug c/110007] Implement support for Clang’s __builtin_unpredictable()

2023-05-27 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007

--- Comment #5 from Richard Yao  ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Richard Yao from comment #3)
> > (In reply to Andrew Pinski from comment #2)
> > > (In reply to Richard Yao from comment #0)
> > > > Having the ability to specify __builtin_unpredictable() as a hint to
> > > > encourage the compiler to use cmov would be useful for implementing
> > > > algorithms like binary search that have unpredictable branches.
> > > > __builtin_expect_with_probability() looks like a possible substitute, 
> > > > but
> > > > Clang does not support it and it does not always work as described in
> > > > #110001.
> > > 
> > > PR 110001 has nothing to do with __builtin_expect_with_probability and
> > 
> > I mentioned it briefly in PR 110001, but I guess I should make it explicit.
> > See line 58 here:
> > 
> > https://gcc.godbolt.org/z/ef3Yfchzv
> > 
> > That is made into a jump, when all that is necessary is cmov, which is what
> > Clang generates. In the example I had posted in PR 110001, I had not used
> > __builtin_expect_with_probability() because it made no difference, but here
> > I am using it to show that cmov is not used.
> 
> 
> 
> If we look at the -fdump-tree-optimized-lineno dump we see why it was not
> turned into a cmov:
>   [/app/example.c:58:12 discrim 2] if (a_111 <= key_128(D))
> goto ; [50.00%]
>   else
> goto ; [50.00%]
> 
>[local count: 447392427]:
>   [/app/example.c:75:40] _268 = (long unsigned int) i_82;
>   [/app/example.c:75:40] _271 = _268 * 4;
>   [/app/example.c:75:34] _274 = array_89(D) + _271;
>   [/app/example.c:5:6] pretmp_277 = *_274;
>   goto ; [100.00%]
> 
> 
> There is a load moved inside the conditional.

Is executing an unpredictable branch cheaper than executing a redundant load?

This a patch against the assembly output from GCC 12.3 modifies this to use
cmov:

diff --git a/out.s b/out.s
index d796087..f0f009c 100644
--- a/out.s
+++ b/out.s
@@ -317,15 +317,11 @@ custom_binary_search_fast:
cmovle  %ecx, %edx
 .L43:
leal1(%rdx), %ecx
-   movq%rcx, %rax
-   movl(%rdi,%rcx,4), %ecx
-   cmpl%esi, %ecx
-   jle .L15
+   cmpl%esi, (%rdi,%rcx,4)
+   cmovle  %ecx, %edx
 .L25:
movl%edx, %eax
movl(%rdi,%rax,4), %ecx
-   movl%edx, %eax
-.L15:
cmpl%ecx, %esi
setl%cl
setg%dl

Micro-benchmarking the two suggests that the answer is yes on Zen 3, although I
do not understand why.

[Bug c/110007] Implement support for Clang’s __builtin_unpredictable()

2023-05-27 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007

--- Comment #3 from Richard Yao  ---
(In reply to Andrew Pinski from comment #2)
> (In reply to Richard Yao from comment #0)
> > Having the ability to specify __builtin_unpredictable() as a hint to
> > encourage the compiler to use cmov would be useful for implementing
> > algorithms like binary search that have unpredictable branches.
> > __builtin_expect_with_probability() looks like a possible substitute, but
> > Clang does not support it and it does not always work as described in
> > #110001.
> 
> PR 110001 has nothing to do with __builtin_expect_with_probability and

I mentioned it briefly in PR 110001, but I guess I should make it explicit. See
line 58 here:

https://gcc.godbolt.org/z/ef3Yfchzv

That is made into a jump, when all that is necessary is cmov, which is what
Clang generates. In the example I had posted in PR 110001, I had not used
__builtin_expect_with_probability() because it made no difference, but here I
am using it to show that cmov is not used.

Earlier in the output, GCC 12.3 uses cmov for every 3rd instruction, so I
suspect that this is not a case of two cmovs being too close to each other.

Should I file another issue explicitly for that so this could be left for the
request that we have __builtin_unpredictable()?

> __builtin_expect_with_probability will do exactly the same as
> __builtin_unpredictable will do really.

__builtin_unpredictable() is easier to read than
__builtin_expect_with_probability() and it would be nice for both Clang and GCC
to support the same builtins.

[Bug middle-end/98801] Request for a conditional move built-in function

2023-05-27 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801

Richard Yao  changed:

   What|Removed |Added

 CC||richard.yao at alumni dot 
stonybro
   ||ok.edu

--- Comment #8 from Richard Yao  ---
I just filed #110007 to ask for __builtin_unpredictable(). Something like
__builtin_unpredictable(condition) ? a : b should be equivalent to a
__builtin_cmov(). __builtin_unpredictable() is already implemented in Clang
(although it does not yet pass it to the backend), so adopting it would be
preferable to implementing an entirely new builtin.

[Bug c/110007] New: Implement support for Clang’s __builtin_unpredictable()

2023-05-27 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110007

Bug ID: 110007
   Summary: Implement support for Clang’s
__builtin_unpredictable()
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: richard.yao at alumni dot stonybrook.edu
  Target Milestone: ---

Having the ability to specify __builtin_unpredictable() as a hint to encourage
the compiler to use cmov would be useful for implementing algorithms like
binary search that have unpredictable branches.
__builtin_expect_with_probability() looks like a possible substitute, but Clang
does not support it and it does not always work as described in #110001.

Combining this with C’s ternary operator should make the proposed cmov builtin
from #98801 unnecessary.

Note that while clang implements __builtin_unpredictable(), it does not
currently pass that information to the backend:

https://github.com/llvm/llvm-project/issues/62790#issuecomment-1552671099

[Bug tree-optimization/109901] Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))

2023-05-26 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

--- Comment #8 from Richard Yao  ---
Created attachment 55177
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55177&action=edit
Source code for micro-benchmark.

Here is an example of how not having this optimization slows us down:

https://gcc.godbolt.org/z/nxdrb17G7

custom_binary_search_slow() and custom_binary_search_fast() are the same
function. The only difference is that I manually applied the a) > (b)) -
((a) < (b))) <= 0) -> ((a) <= (b)) transformation to see what GCC would
generate if it were able to do this transformation.

This optimization alone makes binary search ~78% faster on my Ryzen 7 5800X:

Benchmark: array size: 1024, runs: 1000, repetitions: 1, seed: 1685158101,
density: 10

Even distribution with 1024 32 bit integers, random access

|   Name |  Items |   Hits
| Misses |   Time |
| -- | -- | --
| -- | -- |
|  custom_binary_search_slow |   1024 |983
|   9017 |   0.000313 |
|  custom_binary_search_fast |   1024 |983
|   9017 |   0.000176 |

I modified the microbenchmark from scandum/binary_search to better suit a
workload that I am micro-optimizing:

https://github.com/scandum/binary_search

In specific, I wanted to test on small arrays, but avoid cache effects
contaminating the results. One could easily add the search functions from my
modified version into the original to get get numbers for bigger array sizes.

I have attached the source code for the modified micro-benchmark. The above run
was done after compiling with -O2 -fno-inline. Compiling with just -O2 does not
make much difference, since I deleted the code where -fno-inline makes a
difference from that file since it was not relevant to this issue.

[Bug tree-optimization/110001] New: [13 regression] Suboptimal code generation for branchless binary search

2023-05-26 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110001

Bug ID: 110001
   Summary: [13 regression] Suboptimal code generation for
branchless binary search
   Product: gcc
   Version: 13.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: richard.yao at alumni dot stonybrook.edu
  Target Milestone: ---

GCC 12.3 generated beautiful code for this, with all but the last of the
unrolled loop iterations using only 3 instructions:

https://gcc.godbolt.org/z/eGbEj9YKd

Currently, GCC generates 4 instructions:

https://gcc.godbolt.org/z/Ebczq8jjx

This probably does not make a huge difference given the data hazard, but there
is something awe-inspiring from seeing GCC generate only 3 instructions per
unrolled loop iteration for binary search. It would be nice if future versions
went back to generating three instructions.

This function was inspired by this D code:

https://godbolt.org/z/5En7xajzc

The bsearch1000() function is entirely branchless and has no more than 2
instructions for every cmov, excluding ret. I wrote a more general version in C
that can handle variable array sizes, and to my pleasant surprise, GCC 12.3
generated a similar 3 instruction sequence for all but the last of the unrolled
loop iterations. I was saddened when I saw the output from GCC 13.1 and trunk.

Anyway, all recent versions of GCC that I cared to check generate a branch for
the last unrolled iteration on line 58. That branch is unpredictable, so GCC
would generate better code here if it used predication to avoid the branch. I
had been able to give GCC a hint to avoid a similar branch at the end using
__builtin_expect_with_probability(), but that trick did not work for line 58.

Also, if anyone is interested in where that D code originated, here is the
source:

https://muscar.eu/shar-binary-search-meta.html

[Bug tree-optimization/109901] Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))

2023-05-17 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

--- Comment #7 from Richard Yao  ---
Two more rules:

bool0 - bool1 >= 0 -> bool0 | !bool1 -> bool1 >= bool0
bool0 - bool1 <= 0 -> !bool0 | bool1 -> bool0 <= bool1

[Bug tree-optimization/109901] Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))

2023-05-17 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

--- Comment #6 from Richard Yao  ---
(In reply to Andrew Pinski from comment #1)
> bool0 - bool1 == 1 -> bool0 & !bool1 -> bool0 < bool1
> bool0 - bool1 > 0 -> bool0 & !bool1 -> bool0 < bool1

That should be:

bool0 - bool1 == 1 -> bool0 & !bool1 -> bool0 > bool1
bool0 - bool1 > 0 -> bool0 & !bool1 -> bool0 > bool1

[Bug middle-end/109901] New: Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))

2023-05-17 Thread richard.yao at alumni dot stonybrook.edu via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109901

Bug ID: 109901
   Summary: Optimization opportunity: a) > (b)) - ((a) < (b)))
< 0) -> ((a) < (b))
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: richard.yao at alumni dot stonybrook.edu
  Target Milestone: ---

LLVM/Clang also misses this optimization opportunity, so I also filed an issue
with them:

https://github.com/llvm/llvm-project/issues/62790

The following transformations can be done as an optimization:

a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b))

a) > (b)) - ((a) < (b))) <= 0) -> ((a) <= (b))

a) > (b)) - ((a) < (b))) == -1) -> ((a) < (b))

a) > (b)) - ((a) < (b))) == 1) -> ((a) > (b))

a) > (b)) - ((a) < (b))) == 0) -> ((a) == (b))

a) > (b)) - ((a) < (b))) > 0) -> ((a) > (b))

a) > (b)) - ((a) < (b))) >= 0) -> ((a) >= (b))

a) >= (b)) - ((a) <= (b))) < 0) -> ((a) < (b))

a) >= (b)) - ((a) <= (b))) <= 0) -> ((a) <= (b))

a) >= (b)) - ((a) <= (b))) == -1) -> ((a) < (b))

a) >= (b)) - ((a) <= (b))) == 1) -> ((a) > (b))

a) >= (b)) - ((a) <= (b))) == 0) -> ((a) == (b))

a) >= (b)) - ((a) <= (b))) > 0) -> ((a) > (b))

a) >= (b)) - ((a) <= (b))) >= 0) -> ((a) >= (b))


Both (((a) > (b)) - ((a) < (b))) and (((a) >= (b)) - ((a) <= (b))) will
generate -1, 0 or 1 when comparing two integers (signed or unsigned). When
comparators using this trick are inlined into the caller, the above
transformations become applicable.

I noticed that neither compiler exploits this high level optimization
opportunity when I was working on using a faster binary search implementation
for the OpenZFS b-tree code. It relied on a macro to achieve C++-style inlining
of the comparator into the implementation by making different versions of the
same function.

The following example at godbolt does not use a macro to make it easier to see
which lines of assembly correspond to which lines of high level C:

https://gcc.godbolt.org/z/cdedccxae

On amd64, GCC generates 15 instructions for the loop. If you comment out line
35 and uncomment line 37, GCC will generate 11 instructions for the loop. This
is the output GCC would produce if it supported that high level optimization.

Had the comparator returned 1 for less than and 0 for greater than or equal to,
we would have had the 11-instruction version of the loop without any need for
this optimization. Changing the semantics because our compilers lack this
optimization would be painful in part because the entire code base expects the
-1, 0 or 1 return value semantics and other code depends on these comparators.

It would be nice if GCC implemented this optimization.