On Tue, 8 Sep 2020 11:59:36 GMT, Jason Tatton 
<github.com+70893615+jasontatton-...@openjdk.org> wrote:

> This is an implementation of the indexOf(char) intrinsic for StringLatin1 (1 
> byte encoded Strings). It is provided for
> x86 and ARM64. The implementation is greatly inspired by the indexOf(char) 
> intrinsic for StringUTF16. To incorporate it
> I had to make a small change to StringLatin1.java (refactor of functionality 
> to intrisified private method) as well as
> code for C2.  Submitted to: hotspot-compiler-dev and core-libs-dev as this 
> patch contains a change to hotspot and
> java/lang/StringLatin1.java  https://bugs.openjdk.java.net/browse/JDK-8173585
> 
> Details of testing:
> ============
> I have created a jtreg test 
> “compiler/intrinsics/string/TestStringLatin1IndexOfChar” to cover this new 
> intrinsic. Note
> that, particularly for the x86 implementation of the intrinsic, the code path 
> taken is dependent upon the length of the
> input String. Hence the test has been designed to cover all these cases. In 
> summary they are:
> - A “short” string of < 16 characters.
> - A SIMD String of 16 – 31 characters.
> - A AVX2 SIMD String of 32 characters+.
> 
> Hardware used for testing:
> -----------------------------
> 
> - Intel Xeon CPU E5-2680 (JVM did not recognize this as having AVX2 support) 
> • Intel i7 processor (with AVX2 support).
> - AWS Graviton 2 (ARM 64 processor).
> 
> I also ran; ‘run-test-tier1’ and ‘run-test-tier2’ for: x86_64 and aarch64.
> 
> Possible future enhancements:
> ====================
> For the x86 implementation there may be two further improvements we can make 
> in order to improve performance of both
> the StringUTF16 and StringLatin1 indexOf(char) intrinsics:
> 1. Make use of AVX-512 instructions.
> 2. For “short” Strings (see below), I think it may be possible to modify the 
> existing algorithm to still use SSE SIMD
> instructions instead of a loop.
> Benchmark results:
> ============
> **Without** the new StringLatin1 indexOf(char) intrinsic:
> 
> | Benchmark  | Mode | Cnt | Score | Error | Units |
> | ------------- | ------------- |------------- |------------- |------------- 
> |------------- |
> | IndexOfBenchmark.latin1_mixed_char | avgt | 5 | **26,389.129** | ± 182.581 
> | ns/op |
> | IndexOfBenchmark.utf16_mixed_char | avgt | 5 | 17,885.383 | ± 435.933  | 
> ns/op |
> 
> 
> **With** the new StringLatin1 indexOf(char) intrinsic:
> 
> | Benchmark  | Mode | Cnt | Score | Error | Units |
> | ------------- | ------------- |------------- |------------- |------------- 
> |------------- |
> | IndexOfBenchmark.latin1_mixed_char | avgt | 5 | **17,875.185** | ± 407.716  
> | ns/op |
> | IndexOfBenchmark.utf16_mixed_char | avgt | 5 | 18,292.802 | ± 167.306  | 
> ns/op |
> 
> 
> The objective of the patch is to bring the performance of StringLatin1 
> indexOf(char) in line with StringUTF16
> indexOf(char) for x86 and ARM64. We can see above that this has been 
> achieved. Similar results were obtained when
> running on ARM.

Hi Andrew,

The current indexOf char intrinsics for StringUTF16 and the new one here for 
StringLatin1 both use the AVX2 – i.e. 256
bit instructions, these are also affected by the frequency scaling which 
affects the AVX-512 instructions you pointed
out. Of course in a world where all the work taking place is AVX instructions 
this wouldn’t be an issue but in mixed
mode execution this is a problem.

However, the compiler does have knowledge of the capability of the CPU upon 
which it’s optimizing code for and is able
to decide whether to use AVX instructions if they are supported by the CPU AND 
if it wouldn’t be detrimental for
performance. In fact, there is a flag which one can use to interact with this: 
-XX:UseAVX=version.

This of course made testing this patch an interesting experience as the AVX2 
instructions were not enabled on the Xeon
processors which I had access to at AWS, but in the end I was able to use an i7 
on my corporate macbook to validate the
code.

From: mlbridge[bot] <notificati...@github.com>
Sent: 11 September 2020 17:01
To: openjdk/jdk <j...@noreply.github.com>
Cc: Tatton, Jason <jptat...@amazon.com>; Mention <ment...@noreply.github.com>
Subject: Re: [openjdk/jdk] 8173585: Intrinsify StringLatin1.indexOf(char) (#71)


Mailing list message from Andrew Haley<mailto:a...@redhat.com> on 
hotspot-dev<mailto:hotspot-...@openjdk.java.net>:

On 11/09/2020 11:23, Jason Tatton wrote:

For the x86 implementation there may be two further improvements we
can make in order to improve performance of both the StringUTF16 and
StringLatin1 indexOf(char) intrinsics:

1. Make use of AVX-512 instructions.

Is this really a good idea?

When the processor detects Intel AVX instructions, additional
voltage is applied to the core. With the additional voltage applied,
the processor could run hotter, requiring the operating frequency to
be reduced to maintain operations within the TDP limits. The higher
voltage is maintained for 1 millisecond after the last Intel AVX
instruction completes, and then the voltage returns to the nominal
TDP voltage level.

https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf

So, if StringLatin1.indexOf(char) executes enough to make a difference
to any real-world program, the effect may well be to slow down the clock
for all of the code that does not use AVX.

2. For ?short? Strings (see below), I think it may be possible to modify the 
existing algorithm to still use SSE SIMD
instructions instead of a loop.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on 
GitHub<https://github.com/openjdk/jdk/pull/71#issuecomment-691179797>, or
unsubscribe<https://github.com/notifications/unsubscribe-auth/AQ44AL33DFEQ36TCHSTZKCLSFJCUJANCNFSM4Q74AKCQ>.

-------------

PR: https://git.openjdk.java.net/jdk/pull/71

Reply via email to