On 8/12/18 10:57 AM, Martin Buchholz wrote:
If delegating to nlz is the winner so far, we should be able to do at least as well by inlining nlz into ntz and then looking for more optimizations. Following this strategy leads naturally to

    static int ntz_inlineNlz2(int i) {
        i &= -i;
        if (i <= 0) return 32 - (i >>> 31);
        int n = 0;
        if (i >= 1 << 16) { n += 16; i >>>= 16; }
        if (i >= 1 <<  8) { n +=  8; i >>>=  8; }
        if (i >= 1 <<  4) { n +=  4; i >>>=  4; }
        if (i >= 1 <<  2) { n +=  2; i >>>=  2; }
        return n + (i >>> 1);
    }

Right. The variant numberOfLeadingZeros_01() from the benchmark is very close to this, though you've got better handling of (i <= 0) case, so I added it as numberOfLeadingZeros_01a() with a minor modification.

which should save a branch and so should be a benchmark winner.

A reason why delegating to nlz may have beat my previous attempt is because direct comparison with a constant is an operation the hardware tries hard to optimize, e.g. branch predict.

Most of the comparisons should be false in practice because "most ints are small".

I also see that our nlz implementation favors small integers, which helps with ntz.

It's possible that benchmarking may cause branches to be very highly predictable. It should be more real-world for each benchmark method to see a variety of inputs, perhaps in an array.

Okay. Now I tried to combine calculating of several results in one iteration of benchmark to make it harder to predict branches :) Surprisingly, this made the variant 05 (reducing to nlz) the leader, for which I don't have a good explanation, as it does strictly more calculations than 01 or 01a even after inlining.

Anyways, please find the updated benchmark here:
http://cr.openjdk.java.net/~igerasim/8209171/03/bench/int/MyBenchmark.java

The graphs for -client and -server are here:
http://cr.openjdk.java.net/~igerasim/8209171/03/bench/int/bench-int-03-client.png
http://cr.openjdk.java.net/~igerasim/8209171/03/bench/int/bench-int-03-server.png

It took almost an hour to generate the results, so they seem to be quite accurate.

So, I'm still inclined to prefer the variant 05 (which is to reduce ntz to nlz) :)

With kind regards,
Ivan


On Sun, Aug 12, 2018 at 7:22 AM, Ivan Gerasimov <ivan.gerasi...@oracle.com <mailto:ivan.gerasi...@oracle.com>> wrote:

    Hi Martin!


    On 8/11/18 5:54 PM, Martin Buchholz wrote:
    Hi Ivan,

    Oh the allure of bit-twiddling!

    Yes :)

    I'm skeptical that ntz should ever delegate to nlz, and not only
    because of the overhead of a wrapper, but because small numbers
    are more common, and we can take that into account when
    implementing ntz.
    I was under impression that the more frequently a function is
    called the faster it gets compiled, so all the callers of this
    function will benefit.
    For example, if numberOfTrailingZeros is reduced to
    numberOfLeadingZeros then when the later is compiled while the
    former is still not, it will still be running faster than the
    variant with independent implementations.

      At least add "1" to the set of numbers to benchmark.
    In the last proposed patch, all odd numbers will be processed via
    a fast path (because for any odd i, ~i & (i - 1) == 0).
    So, I added 1, 16 and 42 -- small numbers with different number of
    trailing zeros.

    Here's the updated benchmark:
    http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/MyBenchmark.java
    
<http://cr.openjdk.java.net/%7Eigerasim/8209171/02/bench/int/MyBenchmark.java>
    (I only executed four implementations to keep the picture clear.)

      Here's my own entry in this race:

        static int ntz(int i) {
            if (i == 0) return 32;
            int n = 0;
            if ((i << 16)  == 0) { n += 16; i >>>= 16; }
            if ((i & 0xFF) == 0) { n +=  8; i >>>=  8; }
            if ((i & 0xF)  == 0) { n +=  4; i >>>=  4; }
            if ((i & 0x3)  == 0) { n +=  2; i >>>=  2; }
            return n + (~i & 1);
        }

    Interesting!
    You might also avoid inversion at the end, if n is initialized
    with 1, and then the last line may be written as return n - (i & 1).

    Still this variant appears to be slower in most tried cases.
    Here's the graph of the latest benchmark:
    
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-02-client.png
    
<http://cr.openjdk.java.net/%7Eigerasim/8209171/02/bench/int/bench-int-02-client.png>
    
http://cr.openjdk.java.net/~igerasim/8209171/02/bench/int/bench-int-02-server.png
    
<http://cr.openjdk.java.net/%7Eigerasim/8209171/02/bench/int/bench-int-02-server.png>

    The variant from the test01 is the fastest in most cases, but I
    would prefer to proceed with the variant from test05:
    It's only slightly slower than 01, but contains less bytecode and
    helps to warm up numberOfLeadingZeros().

    Whatever happens, we ought to check in the microbenchmarks
    somewhere.  It looks like the new jmh-jdk-microbenchmarks is the
    place.
    I also suspect that tests for these methods could be improved
    (but there are existing hotspot tests).

    To make sure the new code is correct I ran an exhaustive loop from
    Integer.MIN_VALUE to MAX_VALUE inclusive and checked all the
    tested variants of implementation.

    nlz seems harder than ntz in the sense that for nlz "small ints"
    and random bits may have different optimal implementations.


    On Fri, Aug 10, 2018 at 7:03 PM, Ivan Gerasimov
    <ivan.gerasi...@oracle.com <mailto:ivan.gerasi...@oracle.com>> wrote:

        Thanks Martin!


        On 8/9/18 5:42 PM, Martin Buchholz wrote:


        On Thu, Aug 9, 2018 at 5:27 PM, Ivan Gerasimov
        <ivan.gerasi...@oracle.com
        <mailto:ivan.gerasi...@oracle.com>> wrote:

            I did not use the intrinsified variants of
            numberOfLeadingZeros in the benchmark.


        Oops! Should have looked more closely!
        Did you know about
        http://www.hackersdelight.org/hdcodetxt/ntz.c.txt
        <http://www.hackersdelight.org/hdcodetxt/ntz.c.txt>

        Ah, right, ntz1() is even better because it has less
        branches.  How could I miss that?

        Here's the updated webrev and benchmarks:

        http://cr.openjdk.java.net/~igerasim/8209171/01/webrev/
        <http://cr.openjdk.java.net/%7Eigerasim/8209171/01/webrev/>
        
http://cr.openjdk.java.net/~igerasim/8209171/01/bench/int/MyBenchmark.java
        
<http://cr.openjdk.java.net/%7Eigerasim/8209171/01/bench/int/MyBenchmark.java>
        
http://cr.openjdk.java.net/~igerasim/8209171/01/bench/long/MyBenchmark.java
        
<http://cr.openjdk.java.net/%7Eigerasim/8209171/01/bench/long/MyBenchmark.java>

-- With kind regards,
        Ivan Gerasimov



-- With kind regards,
    Ivan Gerasimov



--
With kind regards,
Ivan Gerasimov

Reply via email to