Initial version of the patch made worse C1 code because of additionally
introduced locals, this may be important for client (arm32). I fixed
this by just coupling xors with brackets. Also I made measurements with
Graal and AOT. Note, in case of tiered with AOT compiled java.base the
intrinsic is used if present.
Updated webrev: http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/
Updated benchmark:
http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/CRC32CAltBench.java
Results on my x86 laptop and JDK 10:
Tiered
before 375 ± 6 ns/op
after 334 ± 3 ns/op 11%
Tiered with Graal (JVMCI)
before 356 ± 7 ns/op
after 327 ± 6 ns/op 8%
Tiered with AOT compiled benchmark (non-tiered)
before 1308 ± 58 ns/op
after 1010 ± 8 ns/op 1.3x
Tiered with -XX:MaxInlineLevel=0
before 660 ± 4 ns/op
after 338 ± 3 ns/op 1.9x
C1
before 498 ± 4 ns/op
after 495 ± 4 ns/op same
Interpreter
before 40844 ± 333 ns/op
after 24777 ± 624 ns/op 1.7x
-Dmitry
On 11/16/2017 07:42 PM, Dmitry Chuyko wrote:
On 11/15/2017 09:44 PM, Andrew Haley wrote:
On 15/11/17 18:38, Vitaly Davidovich wrote:
On Wed, Nov 15, 2017 at 12:40 PM, Andrew Haley <a...@redhat.com> wrote:
On 15/11/17 15:38, Alan Bateman wrote:
Moving the nativeOrder out of the loop make sense but I'm curious
about
the context for improving this implementation.
I wonder about lifting ByteOrder.nativeOrder(). Maybe it fails to
inline because the method is too large: if that happens, we really
lose. I'm not seeing that, though: it seems to be inlined just fine,
and has no effect.
Sure, it is the effect of missing inlining. But you can relatively
easily break it by your tiered JIT settings. Not sure about AOT. Like
(in Hotspot):
-XX:-Inline, -XX:MaxInlineLevel=0 (no wonder to meet this one in
wild), -XX:FreqInlineSize=3, -XX:InlineSmallCode=15..
In any case, this patch doesn't help anything on my test hardware.
Is this with -Xcomp though? That can generate crap code because
there's no profiling information. Not that -Xcomp should be the way
to test peak performance IMO, but that is the setting that was used I
believe.
Another noticeable case is -Xint where absolute times of CRC
calculation are quite long.
Here is a benchmark that is easier to experiment with (no need to
build jdk or to turn off intrinsics):
http://cr.openjdk.java.net/~dchuyko/8191328/CRC32CAltBench.java
Some x86 results:
default tiered
before 380.957 ± 11.621 ns/op
after 350.838 ± 5.149 ns/op
-XX:MaxInlineLevel=0
before 656.791 ± 8.216 ns/op
after 340.999 ± 2.686 ns/op
-Xint
before 36113.441 ± 197.716 ns/op
after 26928.593 ± 133.309 ns/op
-Dmitry
Shrug; maybe. We shouldn't mess the code up for -Xcomp.