Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v4]

Emanuel Peter Tue, 09 Jan 2024 06:28:32 -0800

On Mon, 8 Jan 2024 06:23:46 GMT, Jatin Bhateja <[email protected]> wrote:


>> Hi,
>> 
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 
>> only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 
>> instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>> 
>> Implementation uses a lookup table to record permute indices. Table index is 
>> computed using
>> mask argument of compress/expand operation.
>> 
>> Following are the performance number of JMH micro included with the patch.
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>> 
>> Baseline:
>> Benchmark                                 (size)   Mode  Cnt    Score   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  142.767        
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   71.436        
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   35.992        
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  182.151        
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2   91.096        
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2   44.757        
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  184.099        
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt    2   91.981        
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn       4096  thrpt    2   45.170        
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn      1024  thrpt    2  148.017        
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn      2047  thrpt    2   73.516        
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn      4096  thrpt    2   36.844        
>>   ops/ms
>> 
>> Withopt:
>> Benchmark                                 (size)   Mode  Cnt     Score   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  2051.707       
>>    ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   914.072       
>>    ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   489.898       
>>    ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  5324.195       
>>    ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2  2587.229       
>>    ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2  1278.665       
>>    ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  4149.384       
>>    ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt  ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Review suggestions incorporated.

I think we are almost there! 😊

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5291:

> 5289:   if (bt == T_INT || bt == T_FLOAT) {
> 5290:     vmovmskps(rtmp, mask, vec_enc);
> 5291:     shlq(rtmp, 5);

Suggestion:

    shlq(rtmp, 5); // for 32 bit rows (8 int)

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309:

> 5307:     assert(bt == T_LONG || bt == T_DOUBLE, "");
> 5308:     vmovmskpd(rtmp, mask, vec_enc);
> 5309:     shlq(rtmp, 5);

Suggestion:

    shlq(rtmp, 5); // for 32 bit rows (4 long)

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 1018:

> 1016:   } else {
> 1017:     assert(esize == 64, "");
> 1018:     // Loop to generate 16 x 4 int expand permute index table. A row is 
> accessed

Suggestion:

    // Loop to generate 16 x 4 long expand permute index table. A row is 
accessed

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/17261#pullrequestreview-1811224600
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1446133371
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1446133800
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1446132575

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v4]

Reply via email to