Yeah +1 on waiting/asking/expecting CMOV to be properly utilized by
Hotspot, instead of trying to target the instruction ourselves.  This is
all more of a curiosity / exploration.  I am curious whether "branchless
binary search" is something Arrays.binarySearch already does / compiles
to.  Even in C (much closer to shiny bare metal than javaland!) you must
write the for loop just so to tickle the compiler into producing CMOV.

When Adrien fixed our FOR postings decode to write the java "just so" so
that Hotspot would auto-vectorize properly, it was an impactful performance
jump.  But I agree this is all very brittle: one small change on upgrade to
the JDK, CPU instructions, whatnot, might break the optimization.  How do
we know/track/test that it is even still working?  Maybe we need some crazy
unit test involving hsdis to confirm :)  Cutting over to explicit
vectorized code (Panama) should ensure that, but for postings it looks like
we are still a ways off from that being realistic:
https://github.com/apache/lucene/issues/12477#issuecomment-1658224212 ...
though for KNN we were able to cutover.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jul 28, 2023 at 4:36 AM Uwe Schindler <u...@thetaphi.de> wrote:

> Actually this is exactly the same for Java: You can try whatever you want,
> the outcome of the dynamic optimization applied by various dynamic building
> blocks (Java bytecode, Java/Hotspot version, command line parameters,
> hardware CPU, virtualization) is not predictable and any change anywhere
> may produce different results. So we should stop on arguing about changing
> *our* code to improve assembly code. If we have some code on our side and
> it is not correctly converted to CMOV, we should open bug report on OpenJDK
> (Chris H. and I can do this easily - and ask for improvement).
>
> As you have seen in my other answer to this thread: Hotspot applies CMOV
> depending on analysis of branches. So in general our code *should* make us
> of CMOV. You can only get certainity by using hsdis and print of assembly
> for some of our methods which you think should use CMOV. But there's no
> guarantee that it is applied. And as always: It may take a very long time
> until Hotspot replaces the standard branched code by conditional moves (as
> they have significant overhead if used in cases where the result is
>
> With Hotspot you can try to add -XX:ConditionalMoveLimit ("Limit of ops
> to make speculative when using CMOVE") and try with different values (0
> disables, default is 3 on x86 and aarch64, 4 on arm). But as always: Wait
> long enough.
>
> To enforce usage of CMOV (maybe that's the first thing for trying around
> and to look on the type of assembly created; but this may slow down other
> code as CMOV is always used, without analysis):
> -XX:+UseCMoveUnconditionally ("Generates CMove (scalar and vector)
> instructions regardless of profitability analysis.")
>
> Uwe
>
> P.S.: Hotspot also has cmov for vectorized code
> Am 28.07.2023 um 09:08 schrieb Dawid Weiss:
>
>
>
>> Specifically, one of the fascinating Tantivy optimizations is the
>> branchless binary search: https://quickwit.io/blog/search-a-sorted-block
>> .
>>
>
> This is an interesting post, thanks for sharing, Mike. I remember when
> people did such low-level tricks frequently (but on much simpler processors
> and fairly consistent hardware) and it
> always makes me wonder whether all the moving blocks involved here (rust,
> llvm, actual hardware) make it sane - any change in any of these layers may
> affect the outcome (and debugging what actually happened will be a
> nightmare...). I like it though - nice intellectual exercise and some
> assembly dumps for a change. ;)
>
> D.
>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

Reply via email to