On Mon, 8 Jan 2024 06:23:46 GMT, Jatin Bhateja <[email protected]> wrote:
>> Hi,
>>
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2
>> only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2
>> instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>>
>> Implementation uses a lookup table to record permute indices. Table index is
>> computed using
>> mask argument of compress/expand operation.
>>
>> Following are the performance number of JMH micro included with the patch.
>>
>>
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>>
>> Baseline:
>> Benchmark (size) Mode Cnt Score
>> Error Units
>> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 142.767
>> ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 71.436
>> ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 35.992
>> ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 182.151
>> ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 91.096
>> ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 44.757
>> ops/ms
>> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 184.099
>> ops/ms
>> ColumnFilterBenchmark.filterIntColumn 2047 thrpt 2 91.981
>> ops/ms
>> ColumnFilterBenchmark.filterIntColumn 4096 thrpt 2 45.170
>> ops/ms
>> ColumnFilterBenchmark.filterLongColumn 1024 thrpt 2 148.017
>> ops/ms
>> ColumnFilterBenchmark.filterLongColumn 2047 thrpt 2 73.516
>> ops/ms
>> ColumnFilterBenchmark.filterLongColumn 4096 thrpt 2 36.844
>> ops/ms
>>
>> Withopt:
>> Benchmark (size) Mode Cnt Score
>> Error Units
>> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 2051.707
>> ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 914.072
>> ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 489.898
>> ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 5324.195
>> ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 2587.229
>> ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 1278.665
>> ops/ms
>> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 4149.384
>> ops/ms
>> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Review suggestions incorporated.
I think we are almost there! 😊
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5291:
> 5289: if (bt == T_INT || bt == T_FLOAT) {
> 5290: vmovmskps(rtmp, mask, vec_enc);
> 5291: shlq(rtmp, 5);
Suggestion:
shlq(rtmp, 5); // for 32 bit rows (8 int)
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309:
> 5307: assert(bt == T_LONG || bt == T_DOUBLE, "");
> 5308: vmovmskpd(rtmp, mask, vec_enc);
> 5309: shlq(rtmp, 5);
Suggestion:
shlq(rtmp, 5); // for 32 bit rows (4 long)
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 1018:
> 1016: } else {
> 1017: assert(esize == 64, "");
> 1018: // Loop to generate 16 x 4 int expand permute index table. A row is
> accessed
Suggestion:
// Loop to generate 16 x 4 long expand permute index table. A row is
accessed
-------------
Changes requested by epeter (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/17261#pullrequestreview-1811224600
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1446133371
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1446133800
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1446132575