Integrated: 8335921: Fix HotSpot VM build without JVMTI

2024-07-17 Thread Vladimir Kozlov
On Wed, 17 Jul 2024 03:37:36 GMT, Vladimir Kozlov  wrote:

> Citing David Holmes from bug report:
> "We provided the ability to leave out certain VM services (JVMTI, GC's other 
> than serial, ...) as part of the design of the MinimalVM to support Java SE 
> Embedded, along with the Compact Profiles of JDK 8. This manifested in the 
> source code as a set of INCLUDE_XXX ifdef guards. The build system later 
> exposed these as individual --with-jvm-features=xxx,yyy support. However, it 
> was never intended (and certainly not tested) that you could mix-and-match 
> arbitrary subsets of these VM features at will. Consequently if you start 
> trying to do this you will find things that need fixing."
> 
> I added `INCLUDE_JVMTI` guards in two places where it was missed: JVMCI and 
> JFR.  Affected code was added recently, in the past year. After that I was 
> able to build VM on all supported platforms.
> 
> Note: building VM without JVMTI is not officially supported feature. We are 
> not testing it and such failures (missing guards) are not unexpected.
> 
> A lot of tests failed with VM without JVMTI. All are expected failures. I 
> listed failed tests in bug report.
> I fixed (added requires `vm.jvmti`) only one which was part of 
> [JDK-8257967](https://bugs.openjdk.org/browse/JDK-8257967) changes which 
> introduced JFR code without `INCLUDE_JVMTI` guards.
> 
> I ran 2 rounds of testing:
> 
> First, only **tier1** with VM built without JVMTI to see if builds passed and 
> which tests affected. I wrote comment in bug report which tests failed (all 
> expected to fail without JVMTI).
> 
> Second round of testing with JVMTI in VM: tier1-4

This pull request has now been integrated.

Changeset: bcb5e695
Author:Vladimir Kozlov 
URL:   
https://git.openjdk.org/jdk/commit/bcb5e69505f6cc8a4f323924cd2c58e630595fc0
Stats: 20 lines in 8 files changed: 7 ins; 0 del; 13 mod

8335921: Fix HotSpot VM build without JVMTI

Reviewed-by: dholmes, shade

-

PR: https://git.openjdk.org/jdk/pull/20209


Re: RFR: 8335921: Fix HotSpot VM build without JVMTI

2024-07-17 Thread Vladimir Kozlov
On Wed, 17 Jul 2024 03:37:36 GMT, Vladimir Kozlov  wrote:

> Citing David Holmes from bug report:
> "We provided the ability to leave out certain VM services (JVMTI, GC's other 
> than serial, ...) as part of the design of the MinimalVM to support Java SE 
> Embedded, along with the Compact Profiles of JDK 8. This manifested in the 
> source code as a set of INCLUDE_XXX ifdef guards. The build system later 
> exposed these as individual --with-jvm-features=xxx,yyy support. However, it 
> was never intended (and certainly not tested) that you could mix-and-match 
> arbitrary subsets of these VM features at will. Consequently if you start 
> trying to do this you will find things that need fixing."
> 
> I added `INCLUDE_JVMTI` guards in two places where it was missed: JVMCI and 
> JFR.  Affected code was added recently, in the past year. After that I was 
> able to build VM on all supported platforms.
> 
> Note: building VM without JVMTI is not officially supported feature. We are 
> not testing it and such failures (missing guards) are not unexpected.
> 
> A lot of tests failed with VM without JVMTI. All are expected failures. I 
> listed failed tests in bug report.
> I fixed (added requires `vm.jvmti`) only one which was part of 
> [JDK-8257967](https://bugs.openjdk.org/browse/JDK-8257967) changes which 
> introduced JFR code without `INCLUDE_JVMTI` guards.
> 
> I ran 2 rounds of testing:
> 
> First, only **tier1** with VM built without JVMTI to see if builds passed and 
> which tests affected. I wrote comment in bug report which tests failed (all 
> expected to fail without JVMTI).
> 
> Second round of testing with JVMTI in VM: tier1-4

Thank you, Aleksey, for review.

-

PR Comment: https://git.openjdk.org/jdk/pull/20209#issuecomment-2233729056


Re: RFR: 8335921: Fix HotSpot VM build without JVMTI

2024-07-16 Thread Vladimir Kozlov
On Wed, 17 Jul 2024 04:52:35 GMT, David Holmes  wrote:

>> Citing David Holmes from bug report:
>> "We provided the ability to leave out certain VM services (JVMTI, GC's other 
>> than serial, ...) as part of the design of the MinimalVM to support Java SE 
>> Embedded, along with the Compact Profiles of JDK 8. This manifested in the 
>> source code as a set of INCLUDE_XXX ifdef guards. The build system later 
>> exposed these as individual --with-jvm-features=xxx,yyy support. However, it 
>> was never intended (and certainly not tested) that you could mix-and-match 
>> arbitrary subsets of these VM features at will. Consequently if you start 
>> trying to do this you will find things that need fixing."
>> 
>> I added `INCLUDE_JVMTI` guards in two places where it was missed: JVMCI and 
>> JFR.  Affected code was added recently, in the past year. After that I was 
>> able to build VM on all supported platforms.
>> 
>> Note: building VM without JVMTI is not officially supported feature. We are 
>> not testing it and such failures (missing guards) are not unexpected.
>> 
>> A lot of tests failed with VM without JVMTI. All are expected failures. I 
>> listed failed tests in bug report.
>> I fixed (added requires `vm.jvmti`) only one which was part of 
>> [JDK-8257967](https://bugs.openjdk.org/browse/JDK-8257967) changes which 
>> introduced JFR code without `INCLUDE_JVMTI` guards.
>> 
>> I ran 2 rounds of testing:
>> 
>> First, only **tier1** with VM built without JVMTI to see if builds passed 
>> and which tests affected. I wrote comment in bug report which tests failed 
>> (all expected to fail without JVMTI).
>> 
>> Second round of testing with JVMTI in VM: tier1-4
>
> src/hotspot/share/jfr/instrumentation/jfrJvmtiAgent.hpp line 35:
> 
>> 33:   JfrJvmtiAgent();
>> 34:   ~JfrJvmtiAgent();
>> 35:   static bool create() NOT_JVMTI_RETURN_(true);
> 
> It initially seemed odd to return `true` here, but looking through the JFR 
> code that interacts with the Agent it seems the right way to view this is 
> that without JVMTI we have a no-op agent.

Right.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/20209#discussion_r1680433885


Re: RFR: 8335921: Fix HotSpot VM build without JVMTI

2024-07-16 Thread Vladimir Kozlov
On Wed, 17 Jul 2024 03:37:36 GMT, Vladimir Kozlov  wrote:

> Citing David Holmes from bug report:
> "We provided the ability to leave out certain VM services (JVMTI, GC's other 
> than serial, ...) as part of the design of the MinimalVM to support Java SE 
> Embedded, along with the Compact Profiles of JDK 8. This manifested in the 
> source code as a set of INCLUDE_XXX ifdef guards. The build system later 
> exposed these as individual --with-jvm-features=xxx,yyy support. However, it 
> was never intended (and certainly not tested) that you could mix-and-match 
> arbitrary subsets of these VM features at will. Consequently if you start 
> trying to do this you will find things that need fixing."
> 
> I added `INCLUDE_JVMTI` guards in two places where it was missed: JVMCI and 
> JFR.  Affected code was added recently, in the past year. After that I was 
> able to build VM on all supported platforms.
> 
> Note: building VM without JVMTI is not officially supported feature. We are 
> not testing it and such failures (missing guards) are not unexpected.
> 
> A lot of tests failed with VM without JVMTI. All are expected failures. I 
> listed failed tests in bug report.
> I fixed (added requires `vm.jvmti`) only one which was part of 
> [JDK-8257967](https://bugs.openjdk.org/browse/JDK-8257967) changes which 
> introduced JFR code without `INCLUDE_JVMTI` guards.
> 
> I ran 2 rounds of testing:
> 
> First, only **tier1** with VM built without JVMTI to see if builds passed and 
> which tests affected. I wrote comment in bug report which tests failed (all 
> expected to fail without JVMTI).
> 
> Second round of testing with JVMTI in VM: tier1-4

Thank you, David, for review.

-

PR Comment: https://git.openjdk.org/jdk/pull/20209#issuecomment-2232459481


RFR: 8335921: Fix HotSpot VM build without JVMTI

2024-07-16 Thread Vladimir Kozlov
Citing David Holmes from bug report:
"We provided the ability to leave out certain VM services (JVMTI, GC's other 
than serial, ...) as part of the design of the MinimalVM to support Java SE 
Embedded, along with the Compact Profiles of JDK 8. This manifested in the 
source code as a set of INCLUDE_XXX ifdef guards. The build system later 
exposed these as individual --with-jvm-features=xxx,yyy support. However, it 
was never intended (and certainly not tested) that you could mix-and-match 
arbitrary subsets of these VM features at will. Consequently if you start 
trying to do this you will find things that need fixing."

I added `INCLUDE_JVMTI` guards in two places where it was missed: JVMCI and 
JFR.  Affected code was added recently, in the past year. After that I was able 
to build VM on all supported platforms.

Note: building VM without JVMTI is not officially supported feature. We are not 
testing it and such failures (missing guards) are not unexpected.

A lot of tests failed with VM without JVMTI. All are expected failures. I 
listed failed tests in bug report.
I fixed (added requires `vm.jvmti`) only one which was part of 
[JDK-8257967](https://bugs.openjdk.org/browse/JDK-8257967) changes which 
introduced JFR code without `INCLUDE_JVMTI` guards.

I ran 2 rounds of testing:

First, only **tier1** with VM built without JVMTI to see if builds passed and 
which tests affected. I wrote comment in bug report which tests failed (all 
expected to fail without JVMTI).

Second round of testing with JVMTI in VM: tier1-4

-

Commit messages:
 - 8335921: Fix HotSpot VM build without JVMTI

Changes: https://git.openjdk.org/jdk/pull/20209/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20209&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8335921
  Stats: 20 lines in 8 files changed: 7 ins; 0 del; 13 mod
  Patch: https://git.openjdk.org/jdk/pull/20209.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20209/head:pull/20209

PR: https://git.openjdk.org/jdk/pull/20209


Re: RFR: 8327173: HotSpot Style Guide needs update regarding nullptr vs NULL [v2]

2024-03-05 Thread Vladimir Kozlov
On Tue, 5 Mar 2024 07:12:09 GMT, Kim Barrett  wrote:

>> Please review this change to update the HotSpot Style Guide's discussion of
>> nullptr and its use.
>> 
>> I suggest this is an editorial rather than substantive change to the style
>> guide.  As such, the normal HotSpot PR process can be used for this change.
>
> Kim Barrett has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   respond to shipilev comments

Approved.

-

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18101#pullrequestreview-1917796573


Re: RFR: 8327173: HotSpot Style Guide needs update regarding nullptr vs NULL

2024-03-04 Thread Vladimir Kozlov
On Mon, 4 Mar 2024 09:51:16 GMT, Aleksey Shipilev  wrote:

>> doc/hotspot-style.md line 730:
>> 
>>> 728: Use `nullptr`
>>> 729: 
>>> ([n2431](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2431.pdf))
>>> 730: rather than `NULL`.  Don't use (constant expression or literal) 0 for 
>>> pointers.
>> 
>> Should there be any discussion here of why we want to avoid `NULL`?  
>> Specifically the varying
>> definitions and the pitfalls of `NULL` in varargs context.  Also template 
>> and overload issues.
>
> I think it would be enough to write 1..2 sentences about this, and then defer 
> to N2431 already linked here for more details.

I agree with Aleksey.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/18101#discussion_r1511561200


Re: RFR: 8318446: C2: implement StoreNode::Ideal_merge_stores

2024-01-16 Thread Vladimir Kozlov
On Wed, 18 Oct 2023 13:04:35 GMT, Emanuel Peter  wrote:

> This is a feature requiested by @RogerRiggs and @cl4es .
> 
> **Idea**
> 
> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger 
> stores (e.g. one long store) can lead to speedup. Recently, @cl4es and 
> @RogerRiggs had to review a few PR's where people would try to get speedups 
> by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian 
> (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an 
> optimization in C2, rather than in the Java library code, or even user code.
> 
> This patch here supports a few simple use-cases, like these:
> 
> Merge consecutive array stores, with constants. We can combine the separate 
> constants into a larger constant:
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395
> 
> Merge consecutive array stores, with a variable that was split (using 
> shifts). We can essentially undo the splitting (i.e. shifting and 
> truncation), and directly store the variable:
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456
> 
> The idea is that this would allow the introduction of a very simple API, 
> without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):
> 
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472
> 
> **Details**
> 
> This draft currently implements the optimization in an additional special 
> IGVN phase:
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485
> 
> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see 
> `Compile::gather_nodes_for_merge_stores`). During IGVN, we call 
> `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of 
> `StoreNode::Ideal`. We essentially try to establish a chain of mergable 
> stores:
> 
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806
> 
> Mergable stores must have the same Opcode (implies they have the same element 
> type and hence size). Further, mergable stores must have the same control (or 
> be separated by only a RangeCheck). Further, they must either both store 
> constants, or adjacent segments of a larger value ...

About changes. May be you can use something similar to ClearArrayNode. Collect 
all stores into one node and corresponding Mach (machine) nodes will implement 
it using available instructions instead of C2 decide the size of combined store.

One drawback for these changes I see that you may use a lot more registers to 
keep all values.

For constants you need to keep in mind the order of memory (little or big 
endian).

-

PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1894665744


Re: RFR: 8318446: C2: implement StoreNode::Ideal_merge_stores

2024-01-16 Thread Vladimir Kozlov
On Tue, 16 Jan 2024 15:09:27 GMT, Emanuel Peter  wrote:

>> @eme64 I have tried your patch, it seems that there are some limitations:
>> 
>> - The stores are not merged if the order is not right (e.g `a[2] = 2; a[1] = 
>> 1;`)
>> - The stores are not merged if they are floating point constants.
>> - The stores are not merged if they are consecutive fields in an object. E.g:
>> 
>> 
>> class Point {
>> int x; int y;
>> }
>> 
>> p.x = 1;
>> p.y = 2; // Cannot merge into mov [p.x], 0x20001
>> 
>> 
>> Regarding the final point, fields may be of different types with different 
>> sizes and there may be padding between them. This means that for load-store 
>> sequence merges, I think SLP cannot handle these cases.
>> 
>> Thanks.
>
> @merykitty @cl4es @RogerRiggs @vnkozlov I wonder if you think that the 
> approach of this PR is good, and if you have any suggestions about it?
> 
> - Is a separate phase ok?
> - Is this PR in a sweet-spot that reaches the goals of the library-folks, but 
> is not too complex?
> - Would you prefer a more general solution, like a straight-line SLP 
> algorithm, that can merge (even vectorize) any load / store sequences, even 
> merge accesses with different element sizes and with gaps/padding?

@eme64  I would suggest to change the subject of RFE and this PR to something 
like:
 "C2: optimize stores into primitive arrays by combining values into larger 
store"
 
It will correctly describes the scope of changes.  In a future we may have 
separate RFE for object fields - I don't think we should do it in this RFE.

For performance result it would be nice to have only one table with additional 
column with % difference. It is hard to see now the difference.

-

PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1894659376


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12]

2023-12-08 Thread Vladimir Kozlov
On Wed, 6 Dec 2023 23:12:13 GMT, Srinivas Vamsi Parasa  wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX2 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 
>> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the 
>> performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit 
>> datatypes (int, float) as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline  (us/op) | AVX2 (us/op) | 
>> Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6
>> 
>> 
>> 
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Revert "Change supported intrinsic check"
>   
>   This reverts commit 9621eb045c2958582f81ec06b237789a07481ddd.

I pushed closed changes.

-

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16534#pullrequestreview-1773255608


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12]

2023-12-07 Thread Vladimir Kozlov
On Fri, 8 Dec 2023 00:33:49 GMT, Srinivas Vamsi Parasa  wrote:

> > Testing have only one failure in closed tests and I need to fix it before 
> > this can be pushed.
> 
> Thanks Vladimir for the update. Is the test failure because of this PR?

Yes. One of our test, which checks integrity of built JDK, is confused by 
changes in libsimdsort.so.

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1846326542


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12]

2023-12-07 Thread Vladimir Kozlov
On Wed, 6 Dec 2023 23:12:13 GMT, Srinivas Vamsi Parasa  wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX2 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 
>> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the 
>> performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit 
>> datatypes (int, float) as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline  (us/op) | AVX2 (us/op) | 
>> Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6
>> 
>> 
>> 
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Revert "Change supported intrinsic check"
>   
>   This reverts commit 9621eb045c2958582f81ec06b237789a07481ddd.

Testing have only one failure in closed tests and I need to fix it before this 
can be pushed.

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1846315767


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12]

2023-12-06 Thread Vladimir Kozlov
On Wed, 6 Dec 2023 23:09:01 GMT, Srinivas Vamsi Parasa  wrote:

>>> LGTM, thanks!
>> 
>> Thanks Jatin!
>
>> @vamsi-parasa, sorry, I was wrong. I missed that you need to check type 
>> `bt`. Latest change is more complicated than it was before. Please revert it 
>> back (undo last change). I will test previous version 09.
> @vnkozlov 
> Vladimir, please see the commit reverted in the updated changes pushed now.

@vamsi-parasa, please, remind me which tests check that code in `libsmdsort.so` 
is used?

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843938518


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v12]

2023-12-06 Thread Vladimir Kozlov
On Wed, 6 Dec 2023 23:12:13 GMT, Srinivas Vamsi Parasa  wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX2 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 
>> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the 
>> performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit 
>> datatypes (int, float) as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline  (us/op) | AVX2 (us/op) | 
>> Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6
>> 
>> 
>> 
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Revert "Change supported intrinsic check"
>   
>   This reverts commit 9621eb045c2958582f81ec06b237789a07481ddd.

Good. I submitted testing.

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843839669


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]

2023-12-06 Thread Vladimir Kozlov
On Wed, 6 Dec 2023 17:44:24 GMT, Srinivas Vamsi Parasa  wrote:

>> LGTM, thanks!
>
>> LGTM, thanks!
> 
> Thanks Jatin!

@vamsi-parasa, sorry, I was wrong. I missed that you need to check type `bt`. 
Latest change is more complicated than it was before. Please revert it back 
(undo last change). I will test previous version 09.

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843822075


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]

2023-12-06 Thread Vladimir Kozlov
On Wed, 6 Dec 2023 17:48:04 GMT, Srinivas Vamsi Parasa  wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX2 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> For serial sort on random data, this PR shows upto ~7.5x improvement for 
>> 32-bit datatypes (int, float) on Intel TigerLake machine as shown in the 
>> performance data below.
>> 
>> For parallel sort on random data, this PR shows upto ~3.4x for 32-bit 
>> datatypes (int, float) as shown below.
>> 
>> **Note:** This PR also improves the performance of AVX512 sort by upto 35%.
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
>> 
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
>> > href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Benchmark (Serial Sort) | Size | Baseline  (us/op) | AVX2 (us/op) | 
>> Speedup
>> -- | -- | -- | -- | --
>> ArraysSort.intSort | 10 | 0.034 | 0.029 | 1.2
>> ArraysSort.intSort | 25 | 0.088 | 0.044 | 2.0
>> ArraysSort.intSort | 50 | 0.239 | 0.159 | 1.5
>> ArraysSort.intSort | 75 | 0.417 | 0.27 | 1.5
>> ArraysSort.intSort | 100 | 0.572 | 0.265 | 2.2
>> ArraysSort.intSort | 1000 | 10.098 | 4.282 | 2.4
>> ArraysSort.intSort | 1 | 330.065 | 43.383 | 7.6
>> ArraysSort.intSort | 10 | 4099.527 | 778.943 | 5.3
>> ArraysSort.intSort | 100 | 49150.16 | 9634.335 | 5.1
>> ArraysSort.floatSort | 10 | 0.045 | 0.043 | 1.0
>> ArraysSort.floatSort | 25 | 0.105 | 0.073 | 1.4
>> ArraysSort.floatSort | 50 | 0.278 | 0.216 | 1.3
>> ArraysSort.floatSort | 75 | 0.476 | 0.241 | 2.0
>> ArraysSort.floatSort | 100 | 0.583 | 0.313 | 1.9
>> ArraysSort.floatSort | 1000 | 10.182 | 4.329 | 2.4
>> ArraysSort.floatSort | 1 | 323.136 | 57.175 | 5.7
>> ArraysSort.floatSort | 10 | 4299.519 | 862.63 | 5.0
>> ArraysSort.floatSort | 100 | 50889.4 | 10972.19 | 4.6
>> 
>> 
>> 
>> 
>> 
>> > xmlns:o="urn:schemas-microsoft-com:office:office"
>> xmlns:x="urn:schemas-microsoft-com:office:excel"
>> xmlns="http://www.w3.org/TR/REC-html40";>
>> 
>> 
>> 
>> 
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   add missing header files

src/hotspot/share/opto/library_call.cpp line 5393:

> 5391: if (!Matcher::supports_simd_sort(bt)) {
> 5392:   return false;
> 5393: }

This check should be in `C2Compiler::is_intrinsic_supported()`

src/hotspot/share/opto/library_call.cpp line 5450:

> 5448:   if (!Matcher::supports_simd_sort(bt)) {
> 5449: return false;
> 5450:   }

Same.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417831171
PR Review Comment: https://git.openjdk.org/jdk/pull/16534#discussion_r1417832246


Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]

2023-12-06 Thread Vladimir Kozlov
On Wed, 6 Dec 2023 18:05:22 GMT, Sandhya Viswanathan  
wrote:

> @TobiHartmann @vnkozlov Please advice if we can go head and integrate this PR 
> today before the fork.

Too late. Changes looks fine to me (I am still on fence that we moving to C++ 
implementation of intrinsics and require latest C++ compiler version). I need 
to run testing for latest version of changes before approval. Lets not rush.

-

PR Comment: https://git.openjdk.org/jdk/pull/16534#issuecomment-1843446956


Re: RFR: 8318027: Support alternative name to jdk.internal.vm.compiler [v3]

2023-10-20 Thread Vladimir Kozlov
On Fri, 20 Oct 2023 15:45:50 GMT, Doug Simon  wrote:

>> The Graal code base has 
>> [renamed](https://github.com/oracle/graal/commit/1e41203d10db321f86723eac90f6cd0573b08b33)
>>  its module to `jdk.compiler.graal` as part of preparations for Project 
>> Galahad. Due to the way Java modules work, this requires a JDK change. The 
>> core of the issue is that the 
>> [service](https://github.com/openjdk/jdk/blob/56aa1e8dc8047cbc29d554889c64beb6eca0b8eb/src/jdk.internal.vm.ci/share/classes/module-info.java#L37)
>>  by which HotSpot requests a Graal compilation is defined in JVMCI. Since 
>> JVMCI is in the boot layer, the service can only be implemented by a 
>> provider in the boot layer and the package defining the service must be 
>> exported to the provider's defining module. This export currently targets 
>> [`jdk.internal.vm.compiler`](https://github.com/openjdk/jdk/blob/56aa1e8dc8047cbc29d554889c64beb6eca0b8eb/src/jdk.internal.vm.ci/share/classes/module-info.java#L28)
>>  and so the binding fails for the new Graal module. To address this, this PR 
>> reflects the Graal module ren
 aming, including adjusting the qualified export.
>
> Doug Simon has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   re-fix since tags to reflect current JDK release

Good.

-

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16189#pullrequestreview-1690675532


Re: RFR: 8318027: Support alternative name to jdk.internal.vm.compiler

2023-10-20 Thread Vladimir Kozlov
On Fri, 13 Oct 2023 16:28:19 GMT, Doug Simon  wrote:

> The Graal code base has 
> [renamed](https://github.com/oracle/graal/commit/1e41203d10db321f86723eac90f6cd0573b08b33)
>  its module to `jdk.compiler.graal` as part of preparations for Project 
> Galahad. Due to the way Java modules work, this requires a JDK change. The 
> core of the issue is that the 
> [service](https://github.com/openjdk/jdk/blob/56aa1e8dc8047cbc29d554889c64beb6eca0b8eb/src/jdk.internal.vm.ci/share/classes/module-info.java#L37)
>  by which HotSpot requests a Graal compilation is defined in JVMCI. Since 
> JVMCI is in the boot layer, the service can only be implemented by a provider 
> in the boot layer and the package defining the service must be exported to 
> the provider's defining module. This export currently targets 
> [`jdk.internal.vm.compiler`](https://github.com/openjdk/jdk/blob/56aa1e8dc8047cbc29d554889c64beb6eca0b8eb/src/jdk.internal.vm.ci/share/classes/module-info.java#L28)
>  and so the binding fails for the new Graal module. To address this, this PR 
> reflects the Graal module rena
 ming, including adjusting the qualified export.

Copyright is associated with file and should be kept, only update last year.
I don't know rules about 'since'. But there will be no confusion for 
developers. Everyone knows that these files were created for Graal from the 
start.

-

PR Comment: https://git.openjdk.org/jdk/pull/16189#issuecomment-1772894022


Re: RFR: 8318027: Support alternative name to jdk.internal.vm.compiler

2023-10-20 Thread Vladimir Kozlov
On Fri, 13 Oct 2023 16:28:19 GMT, Doug Simon  wrote:

> The Graal code base has 
> [renamed](https://github.com/oracle/graal/commit/1e41203d10db321f86723eac90f6cd0573b08b33)
>  its module to `jdk.compiler.graal` as part of preparations for Project 
> Galahad. Due to the way Java modules work, this requires a JDK change. The 
> core of the issue is that the 
> [service](https://github.com/openjdk/jdk/blob/56aa1e8dc8047cbc29d554889c64beb6eca0b8eb/src/jdk.internal.vm.ci/share/classes/module-info.java#L37)
>  by which HotSpot requests a Graal compilation is defined in JVMCI. Since 
> JVMCI is in the boot layer, the service can only be implemented by a provider 
> in the boot layer and the package defining the service must be exported to 
> the provider's defining module. This export currently targets 
> [`jdk.internal.vm.compiler`](https://github.com/openjdk/jdk/blob/56aa1e8dc8047cbc29d554889c64beb6eca0b8eb/src/jdk.internal.vm.ci/share/classes/module-info.java#L28)
>  and so the binding fails for the new Graal module. To address this, this PR 
> reflects the Graal module rena
 ming, including adjusting the qualified export.

Why you replaced pair of copyright years with one year in module-info.Java 
files? Instead of updating last year only.
Why also update 'since' there?
Even if you changed location these files existed already.

-

Changes requested by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16189#pullrequestreview-1690172543


Re: RFR: 8318078: ADLC: pass ASSERT and PRODUCT flags [v2]

2023-10-16 Thread Vladimir Kozlov
On Mon, 16 Oct 2023 10:34:37 GMT, Emanuel Peter  wrote:

>> @vnkozlov asked me to guard some debug AD file rules in `#ifdef ASSERT`. 
>> https://github.com/openjdk/jdk/pull/14785#discussion_r1349391130
>> 
>> We discovered that the `ASSERT` and `PRODUCT` are not yet passed to ADLC, 
>> and hence they are always considered `undefined`. Hence, all of these 
>> `ifdef` blocks would always be ignored.
>> 
>> **Solution**
>> I added the flags to `make/hotspot/gensrc/GensrcAdlc.gmk`, just like in 
>> `make/hotspot/lib/JvmFlags.gmk`.
>> 
>> As @erikj79 commented: we should probably unify this. But I leave that to 
>> the build team.
>> 
>> **Testing**
>> With this code you can see what flags are passed to ADLC:
>> 
>> --- a/src/hotspot/share/adlc/main.cpp
>> +++ b/src/hotspot/share/adlc/main.cpp
>> @@ -56,6 +56,11 @@ int main(int argc, char *argv[])
>>// Check for proper arguments
>>if( argc == 1 ) usage(AD);// No arguments?  Then print usage
>>  
>> +  for( int i = 1; i < argc; i++ ) { // For all arguments
>> +char *s = argv[i];  // Get option/filename
>> +fprintf(stderr, "ARGV[%d] %s\n", i, s);
>> +  }
>> +
>>// Read command line arguments and file names
>>for( int i = 1; i < argc; i++ ) { // For all arguments
>>  char *s = argv[i];  // Get option/filename
>> 
>> 
>> On `linux-x64` I get:
>> 
>> ARGV[1] -q
>> ARGV[2] -T
>> ARGV[3] -DLINUX=1
>> ARGV[4] -D_GNU_SOURCE=1
>> ARGV[5] -g
>> ARGV[6] -DAMD64=1
>> ARGV[7] -D_LP64=1
>> ARGV[8] -DNDEBUG
>> ARGV[9] -DPRODUCT
>> 
>> 
>> And on `linux-x64-debug` I get:
>> 
>> ARGV[1] -q
>> ARGV[2] -T
>> ARGV[3] -DLINUX=1
>> ARGV[4] -D_GNU_SOURCE=1
>> ARGV[5] -g
>> ARGV[6] -DAMD64=1
>> ARGV[7] -D_LP64=1
>> ARGV[8] -DASSERT
>> 
>> 
>> I verified that the `#ifdef` work as expected, by adding this code to 
>> `src/hotspot/cpu/x86/x86.ad`:
>> 
>> #ifdef ASSERT
>> #ifdef PRODUCT
>> control
>> #endif
>> #endif
>> 
>> #ifdef ASSERT
>> xxx
>> #endif
>> 
>> #ifdef PRODUCT
>> yyy
>> #endif
>> 
>> When compiling, I get complaints for `yyy` on `linux-x64` and for `xxx` on 
>> `linux-x64-debug`. But since `ASSERT` and `PRODUCT` never occur together, we 
>> never get complaints about `control`.
>> 
>> **Running tier1-3 and stress testing ...**
>
> Emanuel Peter has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   add comments like Vladimir requested

Good

-

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16178#pullrequestreview-1680416164


Re: RFR: 8318078: ADLC: pass ASSERT and PRODUCT flags

2023-10-13 Thread Vladimir Kozlov
On Fri, 13 Oct 2023 09:49:48 GMT, Emanuel Peter  wrote:

> @vnkozlov asked me to guard some debug AD file rules in `#ifdef ASSERT`. 
> https://github.com/openjdk/jdk/pull/14785#discussion_r1349391130
> 
> We discovered that the `ASSERT` and `PRODUCT` are not yet passed to ADLC, and 
> hence they are always considered `undefined`. Hence, all of these `ifdef` 
> blocks would always be ignored.
> 
> **Solution**
> I added the flags to `make/hotspot/gensrc/GensrcAdlc.gmk`, just like in 
> `make/hotspot/lib/JvmFlags.gmk`.
> 
> As @erikj79 commented: we should probably unify this. But I leave that to the 
> build team.
> 
> **Testing**
> With this code you can see what flags are passed to ADLC:
> 
> --- a/src/hotspot/share/adlc/main.cpp
> +++ b/src/hotspot/share/adlc/main.cpp
> @@ -56,6 +56,11 @@ int main(int argc, char *argv[])
>// Check for proper arguments
>if( argc == 1 ) usage(AD);// No arguments?  Then print usage
>  
> +  for( int i = 1; i < argc; i++ ) { // For all arguments
> +char *s = argv[i];  // Get option/filename
> +fprintf(stderr, "ARGV[%d] %s\n", i, s);
> +  }
> +
>// Read command line arguments and file names
>for( int i = 1; i < argc; i++ ) { // For all arguments
>  char *s = argv[i];  // Get option/filename
> 
> 
> On `linux-x64` I get:
> 
> ARGV[1] -q
> ARGV[2] -T
> ARGV[3] -DLINUX=1
> ARGV[4] -D_GNU_SOURCE=1
> ARGV[5] -g
> ARGV[6] -DAMD64=1
> ARGV[7] -D_LP64=1
> ARGV[8] -DNDEBUG
> ARGV[9] -DPRODUCT
> 
> 
> And on `linux-x64-debug` I get:
> 
> ARGV[1] -q
> ARGV[2] -T
> ARGV[3] -DLINUX=1
> ARGV[4] -D_GNU_SOURCE=1
> ARGV[5] -g
> ARGV[6] -DAMD64=1
> ARGV[7] -D_LP64=1
> ARGV[8] -DASSERT
> 
> 
> I verified that the `#ifdef` work as expected, by adding this code to 
> `src/hotspot/cpu/x86/x86.ad`:
> 
> #ifdef ASSERT
> #ifdef PRODUCT
> control
> #endif
> #endif
> 
> #ifdef ASSERT
> xxx
> #endif
> 
> #ifdef PRODUCT
> yyy
> #endif
> 
> When compiling, I get complaints for `yyy` on `linux-x64` and for `xxx` on 
> `linux-x64-debug`. But since `ASSERT` and `PRODUCT` never occur together, we 
> never get complaints about `control`.
> 
> **Running tier1-3 and stress testing ...**

make/hotspot/gensrc/GensrcAdlc.gmk line 138:

> 136:   # Set ASSERT, NDEBUG and PRODUCT flags just like in JvmFlags.gmk
> 137:   ifeq ($(DEBUG_LEVEL), release)
> 138: ADLCFLAGS += -DNDEBUG

May be you should also copy all comments from `JvmFlags.gmk` to avoid confusion 
because, for example, `NDEBUG` is not used directly in HotSpot code but it is 
needed "to disable uses of assert macro from ."

-

PR Review Comment: https://git.openjdk.org/jdk/pull/16178#discussion_r1358574163


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v5]

2023-10-11 Thread Vladimir Kozlov
On Wed, 11 Oct 2023 20:58:23 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal of this PR is to address the follow-up comments to the SIMD 
>> accelerated sort PR (#14227) which implemented AVX512 intrinsics for 
>> Arrays.sort() methods.
>> The proposed changes are:
>> 
>> 1) Restriction of the AVX512 sort acceleration to only Intel CPUs. A 
>> performance regression (due to micro-architectural differences) was reported 
>> for AMD Zen4 CPUs in the comments section of PR.
>> 2) Addressing the build failure due to a bug in GCC 12 (which was fixed in 
>> version 12.3.1). The details of the bug are at: 
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593
>> 3) Minor changes in Javadoc strings
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Revert @ForceInline annotations for small array sort methods

My tier1-3,xcomp testing for v04 passed. I am integrating these changes.
Lets continue discussion about changes for AMD in  
https://bugs.openjdk.org/browse/JDK-8317976.

-

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16124#pullrequestreview-1673014053


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v5]

2023-10-11 Thread Vladimir Kozlov
On Wed, 11 Oct 2023 23:40:55 GMT, Sandhya Viswanathan 
 wrote:

> > It makes sense to let `-XX:ControlIntrinsic=` overrule 
> > `VM_Version::is_intel()` check and enable the intrinsics when `AVX512DQ` is 
> > supported.
> 
> May be it could be done as part of 
> https://bugs.openjdk.org/browse/JDK-8317976.

It is not easy to do. You need to make sure that you not enable intrinsic by 
`-XX:ControlIntrinsic=` when it is not supported by hardware. I would say it is 
separate (second) RFE because we need to test it separately from current Zen 4 
issue.

Note, we are not introducing regression to AMD with these changes - code is 
reverted back to before previous changes 
[14227](https://github.com/openjdk/jdk/pull/14227). We have some  time to 
propose patch.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758737716


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v5]

2023-10-11 Thread Vladimir Kozlov
On Wed, 11 Oct 2023 20:58:23 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal of this PR is to address the follow-up comments to the SIMD 
>> accelerated sort PR (#14227) which implemented AVX512 intrinsics for 
>> Arrays.sort() methods.
>> The proposed changes are:
>> 
>> 1) Restriction of the AVX512 sort acceleration to only Intel CPUs. A 
>> performance regression (due to micro-architectural differences) was reported 
>> for AMD Zen4 CPUs in the comments section of PR.
>> 2) Addressing the build failure due to a bug in GCC 12 (which was fixed in 
>> version 12.3.1). The details of the bug are at: 
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593
>> 3) Minor changes in Javadoc strings
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Revert @ForceInline annotations for small array sort methods

I started testing for v04 assuming if you make make file changes they will be 
cosmetic.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758619820


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v3]

2023-10-11 Thread Vladimir Kozlov
On Wed, 11 Oct 2023 19:06:24 GMT, Vladimir Kozlov  wrote:

>> @vnkozlov Please advice if we can integrate this PR or if you would like to 
>> run some tests first.
>
> Okay. I will start testing for current changes. @sviswa7 please file RFE for 
> Zen 4. If we get patch for it we do followup changes in that RFE.

> @vnkozlov
> Are there any options to be sure that C2 JIT compiler is used during tests?

May be you already figure out that. Use `-XX:+PrintCompilation 
-XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:-TieredCompilation` flags 
(Tiered is off to use only C2) and look for something like next

 82   20 java.util.Random::next (47 bytes)
@ 3   java.util.Random::next (47 bytes)   inline 
(hot)
  @ 8   java.util.concurrent.atomic.AtomicLong::get 
(5 bytes)   accessor
  @ 32   
java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)   inline (hot)
@ 9   
jdk.internal.misc.Unsafe::compareAndSetLong (0 bytes)   (intrinsic)

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758406966


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR [v3]

2023-10-11 Thread Vladimir Kozlov
On Wed, 11 Oct 2023 18:31:44 GMT, Sandhya Viswanathan 
 wrote:

>> Also @forceinline in these changes only works for case when new intrinsics 
>> are not used.
>> I would suggest to adapt/update JMH benchmark to cover all cases and see 
>> effect @forceinline without intrinsics.
>> That will tell us which @forceinline annotations are needed.
>
> @vnkozlov Please advice if we can integrate this PR or if you would like to 
> run some tests first.

Okay. I will start testing for current changes. @sviswa7 please file RFE for 
Zen 4. If we get patch for it we do followup changes in that RFE.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758342800


Re: RFR: 8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR

2023-10-11 Thread Vladimir Kozlov
On Tue, 10 Oct 2023 18:40:30 GMT, R1chterScale  wrote:

>> The goal of this PR is to address the follow-up comments to the SIMD 
>> accelerated sort PR (#14227) which implemented AVX512 intrinsics for 
>> Arrays.sort() methods.
>> The proposed changes are:
>> 
>> 1) Restriction of the AVX512 sort acceleration to only Intel CPUs. A 
>> performance regression (due to micro-architectural differences) was reported 
>> for AMD Zen4 CPUs in the comments section of PR.
>> 2) Addressing the build failure due to a bug in GCC 12 (which was fixed in 
>> version 12.3.1). The details of the bug are at: 
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593
>> 3) Minor changes in Javadoc strings
>
> Forgive me, I might be missing something very obvious, but is there any 
> particular reason to entirely disable the SIMD accelerated sort on Zen 4 
> rather than having an alternate code path for Zen 4 where it has the 
> `compressstoreu` instructions split up into separate `compress` and `storeu` 
> instructions so that Zen 4 platforms can still benefit from a decent degree 
> of performance uplift from AVX512 acceleration of sort?

@R1chterScale or someone can suggest patch for Zen 4 and test it so we can 
include it into these changes?
Otherwise we will file separate followup RFE as Sandhya suggested?

And yes, I need to run testing before approval but after we decide what to do 
with Zen 4 in these changes.

-

PR Comment: https://git.openjdk.org/jdk/pull/16124#issuecomment-1758306441


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v42]

2023-10-06 Thread Vladimir Kozlov
On Thu, 5 Oct 2023 23:36:48 GMT, Srinivas Vamsi Parasa  wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request with a new target base due 
> to a merge or a rebase. The pull request now contains 45 commits:
> 
>  - fix code style and formatting
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into avx512sort
>  - Update CompileThresholdScaling only for the sort and partition intrinsics; 
> update build script to remove nested if
>  - change variable names of indexPivot* to pivotIndex*
>  - Update DualPivotQuicksort.java
>  - Rename arraySort and arrayPartition Java methods to sort and partition. 
> Cleanup some comments
>  - Remove the unnecessary exception in single pivot partitioning fallback 
> method
>  - Move functional interfaces close to the associated methods
>  - Refactor the sort and partition 

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v41]

2023-10-05 Thread Vladimir Kozlov
On Thu, 5 Oct 2023 23:39:20 GMT, Srinivas Vamsi Parasa  wrote:

>> In general it looks good. I have some code style comments and file name 
>> change request.
>> After you fix that I will need to rerun testing for it before approval.
>
>> In general it looks good. I have some code style comments and file name 
>> change request. After you fix that I will need to rerun testing for it 
>> before approval.
> 
> Hello Vladimir (@vnkozlov),
> 
> Thank you for the suggestions related to the code style and formatting!
> Please see all the suggested changes incorporated into the latest commit that 
> was just pushed.
> 
> Thanks,
> Vamsi

Thank you @vamsi-parasa. I will start new testing.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1749806333


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v41]

2023-10-05 Thread Vladimir Kozlov
On Fri, 22 Sep 2023 16:53:21 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Update CompileThresholdScaling only for the sort and partition intrinsics; 
> update build script to remove nested if

In general it looks good. I have some code style comments and file name change 
request.
After you fix that I will need to rerun testing for it before approval.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4183:

> 4181: 
> 4182:   // Load x86_64_sort library on supported hardware to enable avx512 
> sort and partition intrinsics
> 4183: if (UseAVX > 2 && VM_Version::supports_avx512dq()) {

Indention (spacing) is wrong for lines 4183-4190 after you moved check.

src/hotspot/share/opto/library_call.cpp line

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

2023-08-29 Thread Vladimir Kozlov
On Mon, 28 Aug 2023 21:27:25 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Clean up parameters passed to arrayPartition; update the check to load 
> library

Good. Thank you.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1698380569


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

2023-08-29 Thread Vladimir Kozlov
On Mon, 28 Aug 2023 21:27:25 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Clean up parameters passed to arrayPartition; update the check to load 
> library

I looked on my testing log and I see that this test was run on machines which 
do not have avx512. I am re-running jdk/util tests with -Xcomp flag on avx512 
machines.

My testing  with -Xcomp flag on avx512 machines passed.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1698106526
PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1698272873


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

2023-08-29 Thread Vladimir Kozlov
On Tue, 29 Aug 2023 19:32:44 GMT, iaroslavski  wrote:

> Hi, We already have correctness tests. See 
> test/jdk/java/util/Arrays/Sorting.java
> 
> The latest version you can find in PR 
> https://github.com/openjdk/jdk/pull/13568/files

Does test/jdk/java/util/Arrays/Sorting.java trigger usage of this intrinsic 
without additional flags?
@vamsi-parasa can you check?

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1698092741


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v29]

2023-08-29 Thread Vladimir Kozlov
On Tue, 29 Aug 2023 17:32:26 GMT, Srinivas Vamsi Parasa  
wrote:

>>> The reason this PR is focused on Linux is because the AVX512 sort and 
>>> partitioning routines are based on Intel’s x86-simd-library 
>>> (https://github.com/intel/x86-simd-sort) which was originally developed 
>>> with GCC as the target compiler. Thus, this PR has restricted itself to 
>>> Linux as the code was tested using GCC/Linux platforms. Additionally, the 
>>> x86_64 library is compiled for AVX512 using file specific compilation 
>>> pragmas (`#pragma GCC target("avx512dq", "avx512f")`). This feature is 
>>> absent for Windows/MSVC++ compiler.”
>> 
>> That is why I am questioning this approach to have additional separate C++ 
>> code library - too much dependencies on other tools.
>> 
>> As I suggested before try to disassemble this library and use assembler code 
>> in VM new stubs. You can create specialized 
>> stubGenerator_x86_64_array_sort.cpp file for it. Then you don't need to 
>> depend on C++ compiler or OS.
>
> The shared library approach is being followed currently as an initial 
> implementation to demonstrate the value of AVX512 sorting. This will be 
> followed up in future with support for Windows as well. 
> If it is ok with you, the shared library approach could be pursued for now to 
> be later replaced with specialized assembly stubs (which are agnostic to OS 
> and compiler) when AVX512 sort is enabled for Windows. Please let us know.

I am okay with such incremental approach. Please, file RFE to replace library 
with stubs in a future (it could be still separate library but with assembler 
code).

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1309316767


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v29]

2023-08-29 Thread Vladimir Kozlov
On Mon, 28 Aug 2023 23:28:44 GMT, Erik Joelsson  wrote:

>> Hi Erik,
>> 
>> The reason this PR is focused on Linux is because the AVX512 sort and 
>> partitioning routines are based on Intel’s x86-simd-library 
>> (https://github.com/intel/x86-simd-sort) which was originally developed with 
>> GCC as the target compiler. Thus, this PR has restricted itself to Linux as 
>> the code was tested using GCC/Linux platforms. 
>> Additionally, the x86_64 library is compiled for AVX512 using file specific 
>> compilation pragmas (`#pragma GCC target("avx512dq", "avx512f")`). This 
>> feature is absent for Windows/MSVC++ compiler.”
>> 
>> Thanks,
>> Vamsi
>
> If it's tied to GCC as well, then we should probably include that in the 
> condition here unless it's also expected to work with Clang. 
> (`TOOLCHAIN_TYPE` = `gcc`)

> The reason this PR is focused on Linux is because the AVX512 sort and 
> partitioning routines are based on Intel’s x86-simd-library 
> (https://github.com/intel/x86-simd-sort) which was originally developed with 
> GCC as the target compiler. Thus, this PR has restricted itself to Linux as 
> the code was tested using GCC/Linux platforms. Additionally, the x86_64 
> library is compiled for AVX512 using file specific compilation pragmas 
> (`#pragma GCC target("avx512dq", "avx512f")`). This feature is absent for 
> Windows/MSVC++ compiler.”

That is why I am questioning this approach to have additional separate C++ code 
library - too much dependencies on other tools.

As I suggested before try to disassemble this library and use assembler code in 
VM new stubs. You can create specialized stubGenerator_x86_64_array_sort.cpp 
file for it. Then you don't need to depend on C++ compiler or OS.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/14227#discussion_r1309054538


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v30]

2023-08-29 Thread Vladimir Kozlov
On Mon, 28 Aug 2023 21:27:25 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Clean up parameters passed to arrayPartition; update the check to load 
> library

My testing passed. But I am not sure correctness of code is fully tested. For 
now we have only JMH benchmark for this new code.  Do we have JDK test which 
can check correctness of this code?

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1697743981


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v29]

2023-08-25 Thread Vladimir Kozlov
On Fri, 25 Aug 2023 01:57:41 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Remove unnecessary import in Arrays.java

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4143:

> 4141: log_info(library)("Loaded library %s, handle " INTPTR_FORMAT, 
> JNI_LIB_PREFIX "x86_64" JNI_LIB_SUFFIX, p2i(libx86_64));
> 4142: 
> 4143: if (UseAVX > 2 && VM_Version::supports_avx512dq()) {

This check should be done before you locate and load library

src/hotspot/share/opto/library_call.cpp line 5218:

> 5216:   BasicType bt = elem_type->basic_type();
> 5217:   stubAddr = StubRoutines::select_array_partition_function(bt);
> 5218:   if (stubAddr == nullptr) return false;

I see now how you check for AVX512 support.

Y

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v29]

2023-08-25 Thread Vladimir Kozlov
On Fri, 25 Aug 2023 01:57:41 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Remove unnecessary import in Arrays.java

After I fixed it Tier1 passed and I submitted other tiers.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1693791251


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v22]

2023-08-25 Thread Vladimir Kozlov
On Thu, 24 Aug 2023 06:23:29 GMT, Srinivas Vamsi Parasa  
wrote:

>>> Improvements are nice but it would not pay off if you have big regressions. 
>>> I can accept 0.9x but 0.4x - 0.8x regressions should be investigated and 
>>> implementation adjusted to avoid them.
>> 
>> Hi Vladimir,
>> 
>> Thank you for the suggestion! 
>> Currently, AVX512sort is doing well for Random, Repeated and Shuffle 
>> patterns of input data. The regressions are observed for Staggered (Wave) 
>> pattern of input data. 
>> Will investigate the regressions and adjust the implementations to address 
>> them.
>> 
>> Thanks,
>> Vamsi
>
>> Improvements are nice but it would not pay off if you have big regressions. 
>> I can accept 0.9x but 0.4x - 0.8x regressions should be investigated and 
>> implementation adjusted to avoid them.
> 
> Hello Vladimir (@vnkozlov) ,
> 
> As per your suggestion, the implementation was adjusted to address the 
> regressions caused for STAGGER and REPEATED type of input data patterns. 
> Please see below the new JMH performance data using the adjusted 
> implementation.
> 
> In the new implementation, we don't call the AVX512 sort intrinsic at the top 
> level (`Arrays.sort()`) . Instead, we take a decomposed or a 'building 
> blocks' approach where we only intrinsify  (using AVX512 instructions) the 
> partitioning and small array sort functions used in the current baseline ( 
> `DualPivotQuickSort.sort()` ). Since the current baseline has logic to detect 
> and sort special input patterns like STAGGER, we fallback to the current 
> baseline instead of using AVX512 partitioning and sorting (which works best 
> for RANDOM, REPEATED and SHUFFLE patterns).
> 
> Data below shows `Arrays.sort()` performance comparison between the current 
> **Java baseline (DPQS)** vs. **AVX512 sort** (this PR)  using the 
> `ArraysSort.java` JMH 
> [benchmark](https://github.com/openjdk/jdk/pull/13568/files#diff-dee51b13bd1872ff455cec2f29255cfd25014a5dd33dda55a2fc68638c3dd4b2)
>  provided in the PR for [JDK-8266431: Dual-Pivot Quicksort improvements 
> (Radix sort)](https://github.com/openjdk/jdk/pull/13568/files#top) ( #13568)
> 
> - The following command line was used to run the benchmarks: ` java -jar 
> $JDK_HOME/build/linux-x86_64-server-release/images/test/micro/benchmarks.jar 
> -jvmArgs "-XX:CompileThreshold=1 -XX:-TieredCompilation" ArraysSort`
> - The scores shown are the average time (us/op),  thus lower is better. The 
> last column towards the right shows the speedup. 
> 
> 
> | Benchmark   |   Mode|   Size|   Baseline DPQS 
> (us/op)   |   AVX512 partitioning & sort (us/op)  |   Speedup |
> | --- |   --- |   --- |   --- |   --- 
> |   --- |
> | ArraysSort.Double.testSort  |   RANDOM  |   800 |   
> 6.7 |   4.8 |   1.39x   |
> | ArraysSort.Double.testSort  |   RANDOM  |   7000|   
> 234.1   |   51.5|   **4.55x**   |
> | ArraysSort.Double.testSort  |   RANDOM  |   5   |   
> 2155.9  |   470.0   |   **4.59x**   |
> | ArraysSort.Double.testSort  |   RANDOM  |   30  |   
> 15076.3 |   3391.3  |   **4.45x**   |
> | ArraysSort.Double.testSort  |   RANDOM  |   200 |   
> 116445.5|   27491.7 |   **4.24x**   |
> | ArraysSort.Double.testSort  |   REPEATED|   800 
> |   2.3 |   1.7 |   1.35x   |
> | ArraysSort.Double.testSort  |   REPEATED|   7000
> |   23.3|   12.1|   **1.92x**   |
> | ArraysSort.Double.testSort  |...

@vamsi-parasa I submitted our testing of latest v28 version. It found issue in 
`ArraysSort.java` new benchmark file. You missed `,`after year in copyright 
line:

 * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1693790589


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v29]

2023-08-24 Thread Vladimir Kozlov
On Fri, 25 Aug 2023 01:57:41 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Remove unnecessary import in Arrays.java

Second.  We do have already the precedent to generate separate dynamic library 
(and load it into JVM) for intrinsics : 
[8265783](https://bugs.openjdk.org/browse/JDK-8265783). But I consider that as 
an exception.

To have second such library gives me concerns. Especially C++ code - we can't 
control what vectors code particular version of C++ produces. Are 
`_mm512_set1_*` part of C++ standard or it is dependancies on another tool?
In 8265783 case we had assembler code  which is why we accepted it after some 
discussions.

And I don't see (may be missing it somewhere) any checks in JVM tha

Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v28]

2023-08-24 Thread Vladimir Kozlov
On Fri, 25 Aug 2023 01:51:12 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal is to develop faster sort routines for x86_64 CPUs by taking 
>> advantage of AVX512 instructions. This enhancement provides an order of 
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>> 
>> This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and 
>> upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the 
>> performance data below.
>> 
>> 
>> **Arrays.sort performance data using JMH benchmarks for arrays with random 
>> data** 
>> 
>> |Arrays.sort benchmark   |   Array Size  |   Baseline 
>> (us/op)|   AVX512 Sort (us/op) |   Speedup |
>> |--- |   --- |   --- |   --- |   --- 
>> |
>> |ArraysSort.doubleSort   |   10  |   0.034   |   0.035   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   25  |   0.116   |   0.089   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   50  |   0.282   |   0.291   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   75  |   0.474   |   0.358   
>> |   1.3 |
>> |ArraysSort.doubleSort   |   100 |   0.654   |   0.623   
>> |   1.0 |
>> |ArraysSort.doubleSort   |   1000|   9.274   |   6.331   
>> |   1.5 |
>> |ArraysSort.doubleSort   |   1   |   323.339 |   71.228  
>> |   **4.5** |
>> |ArraysSort.doubleSort   |   10  |   4471.871|   
>> 1002.748|   **4.5** |
>> |ArraysSort.doubleSort   |   100 |   51660.742   |   
>> 12921.295   |   **4.0** |
>> |ArraysSort.floatSort|   10  |   0.045   |   0.046   
>> |   1.0 |
>> |ArraysSort.floatSort|   25  |   0.103   |   0.084   
>> |   1.2 |
>> |ArraysSort.floatSort|   50  |   0.285   |   0.33
>> |   0.9 |
>> |ArraysSort.floatSort|   75  |   0.492   |   0.346   
>> |   1.4 |
>> |ArraysSort.floatSort|   100 |   0.597   |   0.326   
>> |   1.8 |
>> |ArraysSort.floatSort|   1000|   9.811   |   5.294   
>> |   1.9 |
>> |ArraysSort.floatSort|   1   |   323.955 |   50.547  
>> |   **6.4** |
>> |ArraysSort.floatSort|   10  |   4326.38 |   731.152 
>> |   **5.9** |
>> |ArraysSort.floatSort|   100 |   52413.88|   
>> 8409.193|   **6.2** |
>> |ArraysSort.intSort  |   10  |   0.033   |   0.033   
>> |   1.0 |
>> |ArraysSort.intSort  |   25  |   0.086   |   0.051   
>> |   1.7 |
>> |ArraysSort.intSort  |   50  |   0.236   |   0.151   
>> |   1.6 |
>> |ArraysSort.intSort  |   75  |   0.416   |   0.332   
>> |   1.3 |
>> |ArraysSort.intSort  |   100 |   0.63|   0.521   
>> |   1.2 |
>> |ArraysSort.intSort  |   1000|   10.518  |   4.698   
>> |   2.2 |
>> |ArraysSort.intSort  |   1   |   309.659 |   42.518  
>> |   **7.3** |
>> |ArraysSort.intSort  |   10  |   4130.917|   
>> 573.956 |   **7.2** |
>> |ArraysSort.intSort  |   100 |   49876.307   |   
>> 6712.812|   **7.4** |
>> |ArraysSort.longSort |   10  |   0.036   |   0.037   
>> |   1.0 |
>> |ArraysSort.longSort |   25  |   0.094   |   0.08
>> |   1.2 |
>> |ArraysSort.longSort |   50  |   0.218   |   0.227   
>> |   1.0 |
>> |ArraysSort.longSort |   75  |   0.466   |   0.402   
>> |   1.2 |
>> |ArraysSort.longSort |   100 |   0.76|   0.58
>> |   1.3 |
>> |ArraysSort.longSort |   1000|   10.449  |   6
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Move sort and partition intrinsics from Arrays.java to DPQS.java

Also by build group.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1692645418


Re: RFR: 8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) [v22]

2023-08-24 Thread Vladimir Kozlov
On Thu, 24 Aug 2023 06:23:29 GMT, Srinivas Vamsi Parasa  
wrote:

>>> Improvements are nice but it would not pay off if you have big regressions. 
>>> I can accept 0.9x but 0.4x - 0.8x regressions should be investigated and 
>>> implementation adjusted to avoid them.
>> 
>> Hi Vladimir,
>> 
>> Thank you for the suggestion! 
>> Currently, AVX512sort is doing well for Random, Repeated and Shuffle 
>> patterns of input data. The regressions are observed for Staggered (Wave) 
>> pattern of input data. 
>> Will investigate the regressions and adjust the implementations to address 
>> them.
>> 
>> Thanks,
>> Vamsi
>
>> Improvements are nice but it would not pay off if you have big regressions. 
>> I can accept 0.9x but 0.4x - 0.8x regressions should be investigated and 
>> implementation adjusted to avoid them.
> 
> Hello Vladimir (@vnkozlov) ,
> 
> As per your suggestion, the implementation was adjusted to address the 
> regressions caused for STAGGER and REPEATED type of input data patterns. 
> Please see below the new JMH performance data using the adjusted 
> implementation.
> 
> In the new implementation, we don't call the AVX512 sort intrinsic at the top 
> level (`Arrays.sort()`) . Instead, we take a decomposed or a 'building 
> blocks' approach where we only intrinsify  (using AVX512 instructions) the 
> partitioning and small array sort functions used in the current baseline ( 
> `DualPivotQuickSort.sort()` ). Since the current baseline has logic to detect 
> and sort special input patterns like STAGGER, we fallback to the current 
> baseline instead of using AVX512 partitioning and sorting (which works best 
> for RANDOM, REPEATED and SHUFFLE patterns).
> 
> Data below shows `Arrays.sort()` performance comparison between the current 
> **Java baseline (DPQS)** vs. **AVX512 sort** (this PR)  using the 
> `ArraysSort.java` JMH 
> [benchmark](https://github.com/openjdk/jdk/pull/13568/files#diff-dee51b13bd1872ff455cec2f29255cfd25014a5dd33dda55a2fc68638c3dd4b2)
>  provided in the PR for [JDK-8266431: Dual-Pivot Quicksort improvements 
> (Radix sort)](https://github.com/openjdk/jdk/pull/13568/files#top) ( #13568)
> 
> - The following command line was used to run the benchmarks: ` java -jar 
> $JDK_HOME/build/linux-x86_64-server-release/images/test/micro/benchmarks.jar 
> -jvmArgs "-XX:CompileThreshold=1 -XX:-TieredCompilation" ArraysSort`
> - The scores shown are the average time (us/op),  thus lower is better. The 
> last column towards the right shows the speedup. 
> 
> 
> | Benchmark   |   Mode|   Size|   Baseline DPQS 
> (us/op)   |   AVX512 partitioning & sort (us/op)  |   Speedup |
> | --- |   --- |   --- |   --- |   --- 
> |   --- |
> | ArraysSort.Double.testSort  |   RANDOM  |   800 |   
> 6.7 |   4.8 |   1.39x   |
> | ArraysSort.Double.testSort  |   RANDOM  |   7000|   
> 234.1   |   51.5|   **4.55x**   |
> | ArraysSort.Double.testSort  |   RANDOM  |   5   |   
> 2155.9  |   470.0   |   **4.59x**   |
> | ArraysSort.Double.testSort  |   RANDOM  |   30  |   
> 15076.3 |   3391.3  |   **4.45x**   |
> | ArraysSort.Double.testSort  |   RANDOM  |   200 |   
> 116445.5|   27491.7 |   **4.24x**   |
> | ArraysSort.Double.testSort  |   REPEATED|   800 
> |   2.3 |   1.7 |   1.35x   |
> | ArraysSort.Double.testSort  |   REPEATED|   7000
> |   23.3|   12.1|   **1.92x**   |
> | ArraysSort.Double.testSort  |...

@vamsi-parasa Thank you for addressing performance issues I asked about.

First, since you add new public Java API to Arrays class this have to be 
reviewed by Core Libs.

-

PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1692644591


Re: RFR: JDK-8303798: REDO - Remove fdlibm C sources

2023-04-03 Thread Vladimir Kozlov
On Sat, 1 Apr 2023 18:08:44 GMT, Joe Darcy  wrote:

> This PR is a redo of JDK-8302801: Remove fdlibm C sources. The problem with 
> JDK-8302801 was that it neglected (mea culpa) to include a Java 
> implementation of IEEEremainder before the FDLIBM C implementation was 
> deleted. Such an implementation has been successfully provided under 
> JDK-8304028: Port fdlibm IEEEremainder to Java. After JDK-8304028, there are 
> no native methods left in StrictMath.
> 
> This PR is the same as JDK-8302801 other than StrictMath.c already being 
> removed under JDK-8304028.

Seems fine to me.

-

Marked as reviewed by kvn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/13279#pullrequestreview-1369672551


Re: RFR: JDK-8302801: Remove fdlibm C sources [v4]

2023-03-05 Thread Vladimir Kozlov
On Sun, 5 Mar 2023 06:19:06 GMT, Joe Darcy  wrote:

>> While the review of https://github.com/openjdk/jdk/pull/12800 finishes up, I 
>> thought I'd get out for the review the next phase of the FDLIBM port: 
>> removing the FDLIBM C sources from the repo.
>> 
>> A repo with the changes for JDK-8302027 and this PR successful build on the 
>> default set of platform and successfully run tier 1 tests, which includes 
>> tests of the math library.
>> 
>> There are a few remaining references to the case-independent string "fdlibm" 
>> in the make directory and HotSpot sources. HotSpot contains a partial fork 
>> for FDLIBM (a tine of FDLIBM?) to use for intrinsics. The remaining make 
>> machinery contains logic to determine what set of gcc options can be used 
>> for the compile.
>> 
>> The intention of this change is to remove use of FDLIBM for the core 
>> libraries.
>
> Joe Darcy has updated the pull request with a new target base due to a merge 
> or a rebase. The pull request now contains 20 commits:
> 
>  - Merge branch 'master' of https://git.openjdk.org/jdk into JDK-8302801-expr
>  - Respond to review feedback.
>  - Respond to review feedback and add description of transliteration process.
>  - JDK-8302801: Remove fdlibm C sources
>  - Update src/java.base/share/classes/java/lang/FdLibm.java
>
>Co-authored-by: Andrey Turbanov 
>  - Update src/java.base/share/classes/java/lang/FdLibm.java
>
>Co-authored-by: Andrey Turbanov 
>  - Update src/java.base/share/classes/java/lang/FdLibm.java
>
>Co-authored-by: Andrey Turbanov 
>  - Update src/java.base/share/classes/java/lang/FdLibm.java
>
>Co-authored-by: Andrey Turbanov 
>  - Update src/java.base/share/classes/java/lang/FdLibm.java
>
>Co-authored-by: Andrey Turbanov 
>  - Update src/java.base/share/classes/java/lang/FdLibm.java
>
>Co-authored-by: Andrey Turbanov 
>  - ... and 10 more: https://git.openjdk.org/jdk/compare/1bb39a95...437a8fce

HotSpot changes look good.

-

Marked as reviewed by kvn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/12821


Re: RFR: 8303588: [JVMCI] make JVMCI source directories conform with standard layout

2023-03-03 Thread Vladimir Kozlov
On Fri, 3 Mar 2023 17:47:20 GMT, Doug Simon  wrote:

> The layout of the sources in `jdk.internal.vm.ci` stems from their initial 
> development outside the JDK where they adopted a layout influenced by Eclipse.
> 
> There's no good reason for maintaining this layout any more. Moving to a 
> standard layout also means IDEs will be able to make sense of the JVMCI 
> sources in `src.zip`.

Good.
Please, test it in mach5.

-

Marked as reviewed by kvn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/12860


Re: RFR: JDK-8302801: Remove fdlibm C sources [v3]

2023-03-02 Thread Vladimir Kozlov
On Fri, 3 Mar 2023 00:31:12 GMT, David Holmes  wrote:

> Hotspot changes are okay but I'm a bit confused about what the hotspot code 
> will now be used for?

`SharedRuntime::*` runtime math functions are used on platforms where there are 
no HW instructions or intrinsics (Zero VM). JIT compiled code may also call 
them in such case (or when intrinsics disabled with flag):
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/library_call.cpp#L1815

Consider them as default intrinsics for all platforms (since they are written 
in C). They are faster than interpreting bytecode.

They are also needed for results consistency - the same code is used by 
Interpreter and JITed code.

-

PR: https://git.openjdk.org/jdk/pull/12821


Re: RFR: 8295970: Add vector api sanity tests in tier1 [v2]

2022-11-01 Thread Vladimir Kozlov
On Fri, 28 Oct 2022 07:19:31 GMT, Jie Fu  wrote:

>> Hi all,
>> 
>> As discussed here 
>> https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it 
>> would be better to add the vector api tests in GHA.
>> 
>> Thanks.
>> Best regards,
>> Jie
>
> Jie Fu has updated the pull request with a new target base due to a merge or 
> a rebase. The incremental webrev excludes the unrelated changes brought in by 
> the merge/rebase. The pull request contains four additional commits since the 
> last revision:
> 
>  - Add jdk_vector_sanity test group
>  - Merge branch 'master' into JDK-8295970
>  - Revert changes in test.yml
>  - 8295970: Add jdk_vector tests in GHA

Good.

I think we need review from @shipilev too since he had strong opinion about 
this.

-

Marked as reviewed by kvn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/10879


Re: RFR: 8295970: Add jdk_vector tests in GHA

2022-10-27 Thread Vladimir Kozlov
On Thu, 27 Oct 2022 03:53:13 GMT, Jie Fu  wrote:

> Hi all,
> 
> As discussed here 
> https://github.com/openjdk/jdk/pull/10807#pullrequestreview-1150314487 , it 
> would be better to add the vector api tests in GHA.
> 
> Thanks.
> Best regards,
> Jie

I think we can meet all requests by adding `jdk_vector_sanity` subset of vector 
tests similar to existing `jdk_svc_sanity`.
The motivation for that is to catch obvious bugs in vector code generation on 
32-bit platforms and aarch64 very early.
We, in Oracle, help to test 64 bits but it is not enough I think.
Vector API is stable enough already and I think we can have some sanity checks 
for it in tier1.
The only issue is to find subset of tests.

-

PR: https://git.openjdk.org/jdk/pull/10879