Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-19 Thread Jatin Bhateja
On Fri, 19 Jan 2024 07:43:18 GMT, Emanuel Peter  wrote:

>> For long/double each permute row is 32 byte in size, so a shift by 5 to 
>> compute row address.
>
> Ah right. Maybe we could say `32byte = 4 long = 4 * 64bit`.
> Because "64bit row" sounds like the whole row is only 64 bit long. It is 
> actually the cells that are 64bits, not the rows!

DONE

-

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1459568064


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-18 Thread Emanuel Peter
On Thu, 18 Jan 2024 17:06:55 GMT, Jatin Bhateja  wrote:

>> @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit?
>
> For long/double each permute row is 32 byte in size, so a shift by 5 to 
> compute row address.

Ah right. Maybe we could say `32byte = 4 long = 4 * 64bit`.
Because "64bit row" sounds like the whole row is only 64 bit long. It is 
actually the cells that are 64bits, not the rows!

-

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1458509886


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-18 Thread Jatin Bhateja
On Tue, 16 Jan 2024 07:08:57 GMT, Emanuel Peter  wrote:

>> Each long/double permute lane holds 64 bit value.
>
> @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit?

For long/double each permute row is 32 byte in size, so a shift by 5 to compute 
row address.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1457747672


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:13:43 GMT, Jatin Bhateja  wrote:

>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309:
>> 
>>> 5307: assert(bt == T_LONG || bt == T_DOUBLE, "");
>>> 5308: vmovmskpd(rtmp, mask, vec_enc);
>>> 5309: shlq(rtmp, 5); // for 64 bit rows (4 longs)
>> 
>> Suggestion:
>> 
>> shlq(rtmp, 5); // for 32 bit rows (4 longs)
>
> Each long/double permute lane holds 64 bit value.

@jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit?

-

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1453003935


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 09:10:38 GMT, Emanuel Peter  wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Using emulated variable blend E-Core optimized instruction.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309:
> 
>> 5307: assert(bt == T_LONG || bt == T_DOUBLE, "");
>> 5308: vmovmskpd(rtmp, mask, vec_enc);
>> 5309: shlq(rtmp, 5); // for 64 bit rows (4 longs)
> 
> Suggestion:
> 
> shlq(rtmp, 5); // for 32 bit rows (4 longs)

Each long/double permute lane holds 64 bit value.

-

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1452967063


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-15 Thread Emanuel Peter
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja  wrote:

>> Hi,
>> 
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 
>> only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 
>> instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>> 
>> Implementation uses a lookup table to record permute indices. Table index is 
>> computed using
>> mask argument of compress/expand operation.
>> 
>> Following are the performance number of JMH micro included with the patch.
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>> 
>> Baseline:
>> Benchmark (size)   Mode  CntScore   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  142.767
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   71.436
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   35.992
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  182.151
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2   91.096
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2   44.757
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  184.099
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   2047  thrpt2   91.981
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   4096  thrpt2   45.170
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  1024  thrpt2  148.017
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  2047  thrpt2   73.516
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  4096  thrpt2   36.844
>>   ops/ms
>> 
>> Withopt:
>> Benchmark (size)   Mode  Cnt Score   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  2051.707   
>>ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   914.072   
>>ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   489.898   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  5324.195   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2  2587.229   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2  1278.665   
>>ops/ms
>> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  4149.384   
>>ops/ms
>> ColumnFilterBenchmark.filterIntColumn   2047  thrpt  ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Using emulated variable blend E-Core optimized instruction.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309:

> 5307: assert(bt == T_LONG || bt == T_DOUBLE, "");
> 5308: vmovmskpd(rtmp, mask, vec_enc);
> 5309: shlq(rtmp, 5); // for 64 bit rows (4 longs)

Suggestion:

shlq(rtmp, 5); // for 32 bit rows (4 longs)

-

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1452098849


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-14 Thread Andrey Turbanov
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja  wrote:

>> Hi,
>> 
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 
>> only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 
>> instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>> 
>> Implementation uses a lookup table to record permute indices. Table index is 
>> computed using
>> mask argument of compress/expand operation.
>> 
>> Following are the performance number of JMH micro included with the patch.
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>> 
>> Baseline:
>> Benchmark (size)   Mode  CntScore   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  142.767
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   71.436
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   35.992
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  182.151
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2   91.096
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2   44.757
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  184.099
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   2047  thrpt2   91.981
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   4096  thrpt2   45.170
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  1024  thrpt2  148.017
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  2047  thrpt2   73.516
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  4096  thrpt2   36.844
>>   ops/ms
>> 
>> Withopt:
>> Benchmark (size)   Mode  Cnt Score   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  2051.707   
>>ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   914.072   
>>ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   489.898   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  5324.195   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2  2587.229   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2  1278.665   
>>ops/ms
>> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  4149.384   
>>ops/ms
>> ColumnFilterBenchmark.filterIntColumn   2047  thrpt  ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Using emulated variable blend E-Core optimized instruction.

test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java 
line 37:

> 35: @Fork(jvmArgsPrepend = {"--add-modules=jdk.incubator.vector", 
> "-XX:UseAVX=2"})
> 36: public class ColumnFilterBenchmark {
> 37: @Param({"1024","2047", "4096"})

Suggestion:

@Param({"1024", "2047", "4096"})

-

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1452021322


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-10 Thread Jatin Bhateja
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja  wrote:

>> Hi,
>> 
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 
>> only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 
>> instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>> 
>> Implementation uses a lookup table to record permute indices. Table index is 
>> computed using
>> mask argument of compress/expand operation.
>> 
>> Following are the performance number of JMH micro included with the patch.
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>> 
>> Baseline:
>> Benchmark (size)   Mode  CntScore   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  142.767
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   71.436
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   35.992
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  182.151
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2   91.096
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2   44.757
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  184.099
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   2047  thrpt2   91.981
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn   4096  thrpt2   45.170
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  1024  thrpt2  148.017
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  2047  thrpt2   73.516
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn  4096  thrpt2   36.844
>>   ops/ms
>> 
>> Withopt:
>> Benchmark (size)   Mode  Cnt Score   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  2051.707   
>>ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   914.072   
>>ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   489.898   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  5324.195   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2  2587.229   
>>ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2  1278.665   
>>ops/ms
>> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  4149.384   
>>ops/ms
>> ColumnFilterBenchmark.filterIntColumn   2047  thrpt  ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Using emulated variable blend E-Core optimized instruction.

Following are the performance numbers for existing Vector API JMH micro 
benchmark over Meteor Lake - Crestmont E-cores.
![image](https://github.com/openjdk/jdk/assets/59989778/dab762f8-2379-4fcf-90da-f765e907c6c1)

-

PR Comment: https://git.openjdk.org/jdk/pull/17261#issuecomment-1885525420


Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-09 Thread Jatin Bhateja
> Hi,
> 
> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only 
> targets.
> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 
> instruction set.
> These are very frequently used APIs in columnar database filter operation.
> 
> Implementation uses a lookup table to record permute indices. Table index is 
> computed using
> mask argument of compress/expand operation.
> 
> Following are the performance number of JMH micro included with the patch.
> 
> 
> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
> 
> Baseline:
> Benchmark (size)   Mode  CntScore   Error 
>   Units
> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  142.767 
>  ops/ms
> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   71.436 
>  ops/ms
> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   35.992 
>  ops/ms
> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  182.151 
>  ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2   91.096 
>  ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2   44.757 
>  ops/ms
> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  184.099 
>  ops/ms
> ColumnFilterBenchmark.filterIntColumn   2047  thrpt2   91.981 
>  ops/ms
> ColumnFilterBenchmark.filterIntColumn   4096  thrpt2   45.170 
>  ops/ms
> ColumnFilterBenchmark.filterLongColumn  1024  thrpt2  148.017 
>  ops/ms
> ColumnFilterBenchmark.filterLongColumn  2047  thrpt2   73.516 
>  ops/ms
> ColumnFilterBenchmark.filterLongColumn  4096  thrpt2   36.844 
>  ops/ms
> 
> Withopt:
> Benchmark (size)   Mode  Cnt Score   
> Error   Units
> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  2051.707
>   ops/ms
> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   914.072
>   ops/ms
> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   489.898
>   ops/ms
> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  5324.195
>   ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2  2587.229
>   ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2  1278.665
>   ops/ms
> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  4149.384
>   ops/ms
> ColumnFilterBenchmark.filterIntColumn   2047  thrpt2  1791.170
>   ops/ms
> ColumnFilterBenchmark.filterIntColumn   4096...

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  Using emulated variable blend E-Core optimized instruction.

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/17261/files
  - new: https://git.openjdk.org/jdk/pull/17261/files/257a6351..c3f1c50e

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=17261=04
 - incr: https://webrevs.openjdk.org/?repo=jdk=17261=03-04

  Stats: 28 lines in 4 files changed: 18 ins; 0 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/17261.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/17261/head:pull/17261

PR: https://git.openjdk.org/jdk/pull/17261