Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-26 Thread Paul Sandoz


> On Mar 26, 2018, at 2:26 PM, Ivan Gerasimov  wrote:
> 
> Thank you Andrew for looking into this!
> 
> 
> On 3/24/18 4:13 AM, Andrew Haley wrote:
>> On 03/20/2018 05:20 PM, Ivan Gerasimov wrote:
>>> I tried to run it, but the numbers are non-distinguishable for non-zero
>>> arguments.
>>> And my variant performs slightly better with zero argument.
>>> So, I think it's reasonable to keep the variant with the ternary operator.
>> I am suspicious of this argument.  Did you look at the generated code?
>> 
>> I get
>> 
>>cbnz  w10, 0x03ffa8202384
>>mov   w0, wzr
>> 
>> for the zero test and
>> 
>>cbz   w10, 0x03ffa81d228c
>>clz   w11, w10
>>orr   w10, wzr, #0x8000
>>lsr   w0, w10, w11;*iushr
>> 
>> for 42.
>> 
>> The branch at the start of both versions goes to a deoptimize trap.
>> We don't want deoptimize traps if we can avoid them, so the branchless
>> version is better IMO.
> This looks persuasive, so let's go ahead with the branchless variant!
> 

+1

In my experience with nano-benchamarks like this it's often informative to look 
at the generated code.

This is even easier now that JMH supports a dtrace ASM profiler on the Mac:

  http://mail.openjdk.java.net/pipermail/jmh-dev/2018-January/002686.html 


(YMMV, I ran into some issues on the mac whereby the spawned dtrace stopped 
logging output and needed to poked with a signal into action and dump the rest 
of its output. Not had time to investigate in detail and report back on this.)  

Paul.

Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-26 Thread Ivan Gerasimov

Thank you Andrew for looking into this!


On 3/24/18 4:13 AM, Andrew Haley wrote:

On 03/20/2018 05:20 PM, Ivan Gerasimov wrote:

I tried to run it, but the numbers are non-distinguishable for non-zero
arguments.
And my variant performs slightly better with zero argument.
So, I think it's reasonable to keep the variant with the ternary operator.

I am suspicious of this argument.  Did you look at the generated code?

I get

cbnzw10, 0x03ffa8202384
mov w0, wzr

for the zero test and

cbz w10, 0x03ffa81d228c
clz w11, w10
orr w10, wzr, #0x8000
lsr w0, w10, w11;*iushr

for 42.

The branch at the start of both versions goes to a deoptimize trap.
We don't want deoptimize traps if we can avoid them, so the branchless
version is better IMO.

This looks persuasive, so let's go ahead with the branchless variant!

With kind regards,
Ivan


I think your benchmark is of questionable validity because it always
uses the same value.  This is unlikely in real code.  I think the
versions should be benchmarked with a *varying* argument.



--
With kind regards,
Ivan Gerasimov



Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-24 Thread Andrew Haley
On 03/20/2018 05:20 PM, Ivan Gerasimov wrote:
> I tried to run it, but the numbers are non-distinguishable for non-zero 
> arguments.
> And my variant performs slightly better with zero argument.
> So, I think it's reasonable to keep the variant with the ternary operator.

I am suspicious of this argument.  Did you look at the generated code?

I get

   cbnz w10, 0x03ffa8202384
   mov  w0, wzr

for the zero test and

   cbz  w10, 0x03ffa81d228c
   clz  w11, w10
   orr  w10, wzr, #0x8000
   lsr  w0, w10, w11;*iushr

for 42.

The branch at the start of both versions goes to a deoptimize trap.
We don't want deoptimize traps if we can avoid them, so the branchless
version is better IMO.

I think your benchmark is of questionable validity because it always
uses the same value.  This is unlikely in real code.  I think the
versions should be benchmarked with a *varying* argument.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. 
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-20 Thread Brian Burkhalter
Hi Ivan,

On Mar 20, 2018, at 10:24 AM, Ivan Gerasimov  wrote:

> With non-zero values the new function performs 11-13% worse.
> I guess it's acceptable?

If we are mostly concerned about the intrinsified cases then I would think that 
this would be acceptable.

Brian

Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-20 Thread Ivan Gerasimov

Hi Claes!


On 3/20/18 2:46 AM, Claes Redestad wrote:

Hi,

On 2018-03-20 09:58, Ivan Gerasimov wrote:

Hello!

The hightestOneBit function doesn't have an intrinsic and is 
currently implemented with a dozen of instructions.
Alternatively, it could be implemented as MIN_VALUE >>> 
numberOfLeadingZeros(i), which works for all integers but zero.
The former function gets intrisified by hotspot, which results in 
+27% of throughput (see the jmh results below).


Would you please help review this simple fix?

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8199843
WEBREV: http://cr.openjdk.java.net/~igerasim/8199843/00/webrev/


nice optimization!

Benchmark: 
http://cr.openjdk.java.net/~igerasim/8199843/00/MyBenchmark.java


Benchmark results:

Benchmark(arg)   Mode  Cnt Score Error Units
MyBenchmark.int_testMethod_new   0  thrpt   35 323430664.593 ±  
7492044.171  ops/s
MyBenchmark.int_testMethod_new  42  thrpt   35 298526237.078 ±  
5978291.689  ops/s
MyBenchmark.int_testMethod_new -42  thrpt   35 302903562.073 ±  
7984723.721  ops/s
MyBenchmark.int_testMethod_org   0  thrpt   35 236245042.891 ±  
3635990.596  ops/s
MyBenchmark.int_testMethod_org  42  thrpt   35 237903410.753 ±  
3437684.390  ops/s
MyBenchmark.int_testMethod_org -42  thrpt   35 238472580.618 ±  
2654886.010  ops/s
MyBenchmark.long_testMethod_new  0  thrpt   35 282646114.501 ± 
48028366.305  ops/s
MyBenchmark.long_testMethod_new 42  thrpt   35 282382228.405 ±  
5781529.307  ops/s
MyBenchmark.long_testMethod_new-42  thrpt   35 276724858.286 ±  
6529561.227  ops/s
MyBenchmark.long_testMethod_org  0  thrpt   35 198500211.972 ± 
15096862.367  ops/s
MyBenchmark.long_testMethod_org 42  thrpt   35 215854630.194 ±  
3112930.563  ops/s
MyBenchmark.long_testMethod_org-42  thrpt   35 217992805.521 ±  
2622877.082  ops/s


To nitpick a bit:

Please run with some appropriate time unit, e.g., "-tu us" to make 
results more human readable.

And where are the baseline results? :-)

It'd also be nice to verify we don't regress too much in case there's 
no intrinsic, i.e., test with the

intrinsic disabled.


Good point!

Here are results for Integer.highestOneBit with the intrinsic of 
numberOfLeadingZeros being disabled:

Benchmark   (arg)   Mode  CntScore Error   Units
MyBenchmark.int_testMethod_00_base  0  thrpt   35 324.369 ± 15.437  
ops/us
MyBenchmark.int_testMethod_00_base 42  thrpt   35 307.741 ± 29.623  
ops/us
MyBenchmark.int_testMethod_00_base-42  thrpt   35 324.563 ± 25.039  
ops/us
MyBenchmark.int_testMethod_01_org   0  thrpt   35 231.276 ±  8.392  
ops/us
MyBenchmark.int_testMethod_01_org  42  thrpt   35 230.466 ± 10.557  
ops/us
MyBenchmark.int_testMethod_01_org -42  thrpt   35 238.579 ±  8.257  
ops/us
MyBenchmark.int_testMethod_02_new   0  thrpt   35 326.752 ± 18.400  
ops/us
MyBenchmark.int_testMethod_02_new  42  thrpt   35 200.604 ±  8.139  
ops/us
MyBenchmark.int_testMethod_02_new -42  thrpt   35 212.313 ± 21.284  
ops/us


Base case just returns the argument, thus shows the maximum possible 
upper bound of the throughput.


With non-zero values the new function performs 11-13% worse.
I guess it's acceptable?

With kind regards,
Ivan

Thanks!

/Claes



--
With kind regards,
Ivan Gerasimov



Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-20 Thread Ivan Gerasimov

Hi Peter!


On 3/20/18 6:05 AM, Peter Levart wrote:

Hi Ivan,

What about branch-less variant?

public static int highestOneBit(int i) {
return i & (MIN_VALUE >>> numberOfLeadingZeros(i));
}


Nice variant!

I tried to run it, but the numbers are non-distinguishable for non-zero 
arguments.

And my variant performs slightly better with zero argument.
So, I think it's reasonable to keep the variant with the ternary operator.

Here are the results:
Benchmark   (arg)   Mode  CntScore Error   Units
MyBenchmark.int_testMethod_00_base  0  thrpt   35  352.509 ± 20.796  
ops/us
MyBenchmark.int_testMethod_00_base 42  thrpt   35 362.955 ±  7.278  
ops/us
MyBenchmark.int_testMethod_00_base-42  thrpt   35 366.084 ±  6.770  
ops/us
MyBenchmark.int_testMethod_01_org   0  thrpt   35 249.340 ±  1.981  
ops/us
MyBenchmark.int_testMethod_01_org  42  thrpt   35 249.007 ±  2.005  
ops/us
MyBenchmark.int_testMethod_01_org -42  thrpt   35 241.866 ±  4.759  
ops/us
MyBenchmark.int_testMethod_02_new   0  thrpt   35 328.389 ±  5.883  
ops/us
MyBenchmark.int_testMethod_02_new  42  thrpt   35 306.381 ±  5.505  
ops/us
MyBenchmark.int_testMethod_02_new -42  thrpt   35 300.328 ±  5.644  
ops/us
MyBenchmark.int_testMethod_03_and   0  thrpt   35 301.559 ±  5.453  
ops/us
MyBenchmark.int_testMethod_03_and  42  thrpt   35  299.694 ± 5.000  
ops/us
MyBenchmark.int_testMethod_03_and -42  thrpt   35  301.505 ± 5.664  
ops/us


The benchmark updated in place:
http://cr.openjdk.java.net/~igerasim/8199843/00/MyBenchmark.java

With kind regards
Ivan


Would it be any better for call sites that vary 0 and non-0 argument?

Regards, Peter

On 03/20/2018 09:58 AM, Ivan Gerasimov wrote:

Hello!

The hightestOneBit function doesn't have an intrinsic and is 
currently implemented with a dozen of instructions.
Alternatively, it could be implemented as MIN_VALUE >>> 
numberOfLeadingZeros(i), which works for all integers but zero.
The former function gets intrisified by hotspot, which results in 
+27% of throughput (see the jmh results below).


Would you please help review this simple fix?

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8199843
WEBREV: http://cr.openjdk.java.net/~igerasim/8199843/00/webrev/
Benchmark: 
http://cr.openjdk.java.net/~igerasim/8199843/00/MyBenchmark.java


Benchmark results:

Benchmark(arg)   Mode  Cnt Score Error Units
MyBenchmark.int_testMethod_new   0  thrpt   35 323430664.593 ±  
7492044.171  ops/s
MyBenchmark.int_testMethod_new  42  thrpt   35 298526237.078 ±  
5978291.689  ops/s
MyBenchmark.int_testMethod_new -42  thrpt   35 302903562.073 ±  
7984723.721  ops/s
MyBenchmark.int_testMethod_org   0  thrpt   35 236245042.891 ±  
3635990.596  ops/s
MyBenchmark.int_testMethod_org  42  thrpt   35 237903410.753 ±  
3437684.390  ops/s
MyBenchmark.int_testMethod_org -42  thrpt   35 238472580.618 ±  
2654886.010  ops/s
MyBenchmark.long_testMethod_new  0  thrpt   35 282646114.501 ± 
48028366.305  ops/s
MyBenchmark.long_testMethod_new 42  thrpt   35 282382228.405 ±  
5781529.307  ops/s
MyBenchmark.long_testMethod_new-42  thrpt   35 276724858.286 ±  
6529561.227  ops/s
MyBenchmark.long_testMethod_org  0  thrpt   35 198500211.972 ± 
15096862.367  ops/s
MyBenchmark.long_testMethod_org 42  thrpt   35 215854630.194 ±  
3112930.563  ops/s
MyBenchmark.long_testMethod_org-42  thrpt   35 217992805.521 ±  
2622877.082  ops/s


Thanks in advance!





--
With kind regards,
Ivan Gerasimov



Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-20 Thread Peter Levart

Hi Ivan,

What about branch-less variant?

public static int highestOneBit(int i) {
    return i & (MIN_VALUE >>> numberOfLeadingZeros(i));
}

Would it be any better for call sites that vary 0 and non-0 argument?

Regards, Peter

On 03/20/2018 09:58 AM, Ivan Gerasimov wrote:

Hello!

The hightestOneBit function doesn't have an intrinsic and is currently 
implemented with a dozen of instructions.
Alternatively, it could be implemented as MIN_VALUE >>> 
numberOfLeadingZeros(i), which works for all integers but zero.
The former function gets intrisified by hotspot, which results in +27% 
of throughput (see the jmh results below).


Would you please help review this simple fix?

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8199843
WEBREV: http://cr.openjdk.java.net/~igerasim/8199843/00/webrev/
Benchmark: 
http://cr.openjdk.java.net/~igerasim/8199843/00/MyBenchmark.java


Benchmark results:

Benchmark    (arg)   Mode  Cnt Score Error  Units
MyBenchmark.int_testMethod_new   0  thrpt   35 323430664.593 ±  
7492044.171  ops/s
MyBenchmark.int_testMethod_new  42  thrpt   35 298526237.078 ±  
5978291.689  ops/s
MyBenchmark.int_testMethod_new -42  thrpt   35 302903562.073 ±  
7984723.721  ops/s
MyBenchmark.int_testMethod_org   0  thrpt   35 236245042.891 ±  
3635990.596  ops/s
MyBenchmark.int_testMethod_org  42  thrpt   35 237903410.753 ±  
3437684.390  ops/s
MyBenchmark.int_testMethod_org -42  thrpt   35 238472580.618 ±  
2654886.010  ops/s
MyBenchmark.long_testMethod_new  0  thrpt   35 282646114.501 ± 
48028366.305  ops/s
MyBenchmark.long_testMethod_new 42  thrpt   35 282382228.405 ±  
5781529.307  ops/s
MyBenchmark.long_testMethod_new    -42  thrpt   35 276724858.286 ±  
6529561.227  ops/s
MyBenchmark.long_testMethod_org  0  thrpt   35 198500211.972 ± 
15096862.367  ops/s
MyBenchmark.long_testMethod_org 42  thrpt   35 215854630.194 ±  
3112930.563  ops/s
MyBenchmark.long_testMethod_org    -42  thrpt   35 217992805.521 ±  
2622877.082  ops/s


Thanks in advance!





Re: RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-20 Thread Claes Redestad

Hi,

On 2018-03-20 09:58, Ivan Gerasimov wrote:

Hello!

The hightestOneBit function doesn't have an intrinsic and is currently 
implemented with a dozen of instructions.
Alternatively, it could be implemented as MIN_VALUE >>> 
numberOfLeadingZeros(i), which works for all integers but zero.
The former function gets intrisified by hotspot, which results in +27% 
of throughput (see the jmh results below).


Would you please help review this simple fix?

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8199843
WEBREV: http://cr.openjdk.java.net/~igerasim/8199843/00/webrev/


nice optimization!

Benchmark: 
http://cr.openjdk.java.net/~igerasim/8199843/00/MyBenchmark.java


Benchmark results:

Benchmark    (arg)   Mode  Cnt Score Error  Units
MyBenchmark.int_testMethod_new   0  thrpt   35 323430664.593 ±  
7492044.171  ops/s
MyBenchmark.int_testMethod_new  42  thrpt   35 298526237.078 ±  
5978291.689  ops/s
MyBenchmark.int_testMethod_new -42  thrpt   35 302903562.073 ±  
7984723.721  ops/s
MyBenchmark.int_testMethod_org   0  thrpt   35 236245042.891 ±  
3635990.596  ops/s
MyBenchmark.int_testMethod_org  42  thrpt   35 237903410.753 ±  
3437684.390  ops/s
MyBenchmark.int_testMethod_org -42  thrpt   35 238472580.618 ±  
2654886.010  ops/s
MyBenchmark.long_testMethod_new  0  thrpt   35 282646114.501 ± 
48028366.305  ops/s
MyBenchmark.long_testMethod_new 42  thrpt   35 282382228.405 ±  
5781529.307  ops/s
MyBenchmark.long_testMethod_new    -42  thrpt   35 276724858.286 ±  
6529561.227  ops/s
MyBenchmark.long_testMethod_org  0  thrpt   35 198500211.972 ± 
15096862.367  ops/s
MyBenchmark.long_testMethod_org 42  thrpt   35 215854630.194 ±  
3112930.563  ops/s
MyBenchmark.long_testMethod_org    -42  thrpt   35 217992805.521 ±  
2622877.082  ops/s


To nitpick a bit:

Please run with some appropriate time unit, e.g., "-tu us" to make 
results more human readable.

And where are the baseline results? :-)

It'd also be nice to verify we don't regress too much in case there's no 
intrinsic, i.e., test with the

intrinsic disabled.

Thanks!

/Claes


RFR 8199843 : Optimize Integer/Long.highestOneBit()

2018-03-20 Thread Ivan Gerasimov

Hello!

The hightestOneBit function doesn't have an intrinsic and is currently 
implemented with a dozen of instructions.
Alternatively, it could be implemented as MIN_VALUE >>> 
numberOfLeadingZeros(i), which works for all integers but zero.
The former function gets intrisified by hotspot, which results in +27% 
of throughput (see the jmh results below).


Would you please help review this simple fix?

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8199843
WEBREV: http://cr.openjdk.java.net/~igerasim/8199843/00/webrev/
Benchmark: http://cr.openjdk.java.net/~igerasim/8199843/00/MyBenchmark.java

Benchmark results:

Benchmark(arg)   Mode  Cnt Score  Error  
Units
MyBenchmark.int_testMethod_new   0  thrpt   35 323430664.593 ±  
7492044.171  ops/s
MyBenchmark.int_testMethod_new  42  thrpt   35 298526237.078 ±  
5978291.689  ops/s
MyBenchmark.int_testMethod_new -42  thrpt   35 302903562.073 ±  
7984723.721  ops/s
MyBenchmark.int_testMethod_org   0  thrpt   35 236245042.891 ±  
3635990.596  ops/s
MyBenchmark.int_testMethod_org  42  thrpt   35 237903410.753 ±  
3437684.390  ops/s
MyBenchmark.int_testMethod_org -42  thrpt   35 238472580.618 ±  
2654886.010  ops/s
MyBenchmark.long_testMethod_new  0  thrpt   35 282646114.501 ± 
48028366.305  ops/s
MyBenchmark.long_testMethod_new 42  thrpt   35 282382228.405 ±  
5781529.307  ops/s
MyBenchmark.long_testMethod_new-42  thrpt   35 276724858.286 ±  
6529561.227  ops/s
MyBenchmark.long_testMethod_org  0  thrpt   35 198500211.972 ± 
15096862.367  ops/s
MyBenchmark.long_testMethod_org 42  thrpt   35 215854630.194 ±  
3112930.563  ops/s
MyBenchmark.long_testMethod_org-42  thrpt   35 217992805.521 ±  
2622877.082  ops/s


Thanks in advance!

--
With kind regards,
Ivan Gerasimov