Re: RFR (S): JDK-8191328: Avoid unnecessary overhead in CRC32C

Dmitry Chuyko Tue, 28 Nov 2017 07:38:13 -0800

Initial version of the patch made worse C1 code because of additionallyintroduced locals, this may be important for client (arm32). I fixedthis by just coupling xors with brackets. Also I made measurements withGraal and AOT. Note, in case of tiered with AOT compiled java.base theintrinsic is used if present.


Updated webrev: http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/

Updated benchmark:http://cr.openjdk.java.net/~dchuyko/8191328/webrev.01/CRC32CAltBench.java


Results on my x86 laptop and JDK 10:

Tiered
before  375 ± 6  ns/op
after   334 ± 3  ns/op 11%

Tiered with Graal (JVMCI)
before  356 ± 7  ns/op
after   327 ± 6  ns/op 8%

Tiered with AOT compiled benchmark (non-tiered)
before  1308 ± 58  ns/op
after   1010 ±  8  ns/op 1.3x

Tiered with -XX:MaxInlineLevel=0
before  660 ± 4  ns/op
after   338 ± 3  ns/op 1.9x

C1
before  498 ± 4  ns/op
after   495 ± 4  ns/op same

Interpreter
before  40844 ± 333  ns/op
after   24777 ± 624  ns/op 1.7x

-Dmitry


On 11/16/2017 07:42 PM, Dmitry Chuyko wrote:

On 11/15/2017 09:44 PM, Andrew Haley wrote:
On 15/11/17 18:38, Vitaly Davidovich wrote:
On Wed, Nov 15, 2017 at 12:40 PM, Andrew Haley <a...@redhat.com> wrote:
On 15/11/17 15:38, Alan Bateman wrote:
Moving the nativeOrder out of the loop make sense but I'm curiousabout
the context for improving this implementation.
I wonder about lifting ByteOrder.nativeOrder().  Maybe it fails to
inline because the method is too large: if that happens, we really
lose.  I'm not seeing that, though: it seems to be inlined just fine,
and has no effect.
Sure, it is the effect of missing inlining. But you can relativelyeasily break it by your tiered JIT settings. Not sure about AOT. Like(in Hotspot):-XX:-Inline, -XX:MaxInlineLevel=0 (no wonder to meet this one inwild), -XX:FreqInlineSize=3, -XX:InlineSmallCode=15..
In any case, this patch doesn't help anything on my test hardware.
Is this with -Xcomp though? That can generate crap code because
there's no profiling information.  Not that -Xcomp should be the way
to test peak performance IMO, but that is the setting that was used I
believe.
Another noticeable case is -Xint where absolute times of CRCcalculation are quite long.
Here is a benchmark that is easier to experiment with (no need tobuild jdk or to turn off intrinsics):
http://cr.openjdk.java.net/~dchuyko/8191328/CRC32CAltBench.java

Some x86 results:

default tiered
before  380.957 ± 11.621  ns/op
after   350.838 ±  5.149  ns/op

-XX:MaxInlineLevel=0
before  656.791 ± 8.216  ns/op
after  340.999 ± 2.686  ns/op

-Xint
before  36113.441 ± 197.716  ns/op
after   26928.593 ± 133.309  ns/op

-Dmitry
Shrug; maybe.  We shouldn't mess the code up for -Xcomp.

Re: RFR (S): JDK-8191328: Avoid unnecessary overhead in CRC32C

Reply via email to