Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Fri, 19 Jan 2024 07:43:18 GMT, Emanuel Peter wrote: >> For long/double each permute row is 32 byte in size, so a shift by 5 to >> compute row address. > > Ah right. Maybe we could say `32byte = 4 long = 4 * 64bit`. > Because "64bit row" sounds like the whole row is only 64 bit long. It is > actually the cells that are 64bits, not the rows! DONE - PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1459568064
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Thu, 18 Jan 2024 17:06:55 GMT, Jatin Bhateja wrote: >> @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit? > > For long/double each permute row is 32 byte in size, so a shift by 5 to > compute row address. Ah right. Maybe we could say `32byte = 4 long = 4 * 64bit`. Because "64bit row" sounds like the whole row is only 64 bit long. It is actually the cells that are 64bits, not the rows! - PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1458509886
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Tue, 16 Jan 2024 07:08:57 GMT, Emanuel Peter wrote: >> Each long/double permute lane holds 64 bit value. > > @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit? For long/double each permute row is 32 byte in size, so a shift by 5 to compute row address. - PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1457747672
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Tue, 16 Jan 2024 06:13:43 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309: >> >>> 5307: assert(bt == T_LONG || bt == T_DOUBLE, ""); >>> 5308: vmovmskpd(rtmp, mask, vec_enc); >>> 5309: shlq(rtmp, 5); // for 64 bit rows (4 longs) >> >> Suggestion: >> >> shlq(rtmp, 5); // for 32 bit rows (4 longs) > > Each long/double permute lane holds 64 bit value. @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit? - PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1453003935
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Mon, 15 Jan 2024 09:10:38 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Using emulated variable blend E-Core optimized instruction. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309: > >> 5307: assert(bt == T_LONG || bt == T_DOUBLE, ""); >> 5308: vmovmskpd(rtmp, mask, vec_enc); >> 5309: shlq(rtmp, 5); // for 64 bit rows (4 longs) > > Suggestion: > > shlq(rtmp, 5); // for 32 bit rows (4 longs) Each long/double permute lane holds 64 bit value. - PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1452967063
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in columnar database filter operation. >> >> Implementation uses a lookup table to record permute indices. Table index is >> computed using >> mask argument of compress/expand operation. >> >> Following are the performance number of JMH micro included with the patch. >> >> >> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) >> >> Baseline: >> Benchmark (size) Mode CntScore >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 142.767 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 71.436 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 35.992 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 182.151 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 91.096 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 44.757 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 184.099 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt2 91.981 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 4096 thrpt2 45.170 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 1024 thrpt2 148.017 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 2047 thrpt2 73.516 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 4096 thrpt2 36.844 >> ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 2051.707 >>ops/ms >> ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 914.072 >>ops/ms >> ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 489.898 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 5324.195 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 >>ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 >>ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Using emulated variable blend E-Core optimized instruction. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309: > 5307: assert(bt == T_LONG || bt == T_DOUBLE, ""); > 5308: vmovmskpd(rtmp, mask, vec_enc); > 5309: shlq(rtmp, 5); // for 64 bit rows (4 longs) Suggestion: shlq(rtmp, 5); // for 32 bit rows (4 longs) - PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1452098849
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in columnar database filter operation. >> >> Implementation uses a lookup table to record permute indices. Table index is >> computed using >> mask argument of compress/expand operation. >> >> Following are the performance number of JMH micro included with the patch. >> >> >> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) >> >> Baseline: >> Benchmark (size) Mode CntScore >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 142.767 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 71.436 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 35.992 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 182.151 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 91.096 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 44.757 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 184.099 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt2 91.981 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 4096 thrpt2 45.170 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 1024 thrpt2 148.017 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 2047 thrpt2 73.516 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 4096 thrpt2 36.844 >> ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 2051.707 >>ops/ms >> ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 914.072 >>ops/ms >> ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 489.898 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 5324.195 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 >>ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 >>ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Using emulated variable blend E-Core optimized instruction. test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java line 37: > 35: @Fork(jvmArgsPrepend = {"--add-modules=jdk.incubator.vector", > "-XX:UseAVX=2"}) > 36: public class ColumnFilterBenchmark { > 37: @Param({"1024","2047", "4096"}) Suggestion: @Param({"1024", "2047", "4096"}) - PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1452021322
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in columnar database filter operation. >> >> Implementation uses a lookup table to record permute indices. Table index is >> computed using >> mask argument of compress/expand operation. >> >> Following are the performance number of JMH micro included with the patch. >> >> >> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) >> >> Baseline: >> Benchmark (size) Mode CntScore >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 142.767 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 71.436 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 35.992 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 182.151 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 91.096 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 44.757 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 184.099 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt2 91.981 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 4096 thrpt2 45.170 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 1024 thrpt2 148.017 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 2047 thrpt2 73.516 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 4096 thrpt2 36.844 >> ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 2051.707 >>ops/ms >> ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 914.072 >>ops/ms >> ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 489.898 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 5324.195 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 >>ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 >>ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 >>ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Using emulated variable blend E-Core optimized instruction. Following are the performance numbers for existing Vector API JMH micro benchmark over Meteor Lake - Crestmont E-cores. ![image](https://github.com/openjdk/jdk/assets/59989778/dab762f8-2379-4fcf-90da-f765e907c6c1) - PR Comment: https://git.openjdk.org/jdk/pull/17261#issuecomment-1885525420
Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]
> Hi, > > Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only > targets. > Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 > instruction set. > These are very frequently used APIs in columnar database filter operation. > > Implementation uses a lookup table to record permute indices. Table index is > computed using > mask argument of compress/expand operation. > > Following are the performance number of JMH micro included with the patch. > > > System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) > > Baseline: > Benchmark (size) Mode CntScore Error > Units > ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 142.767 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 71.436 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 35.992 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 182.151 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 91.096 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 44.757 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 184.099 > ops/ms > ColumnFilterBenchmark.filterIntColumn 2047 thrpt2 91.981 > ops/ms > ColumnFilterBenchmark.filterIntColumn 4096 thrpt2 45.170 > ops/ms > ColumnFilterBenchmark.filterLongColumn 1024 thrpt2 148.017 > ops/ms > ColumnFilterBenchmark.filterLongColumn 2047 thrpt2 73.516 > ops/ms > ColumnFilterBenchmark.filterLongColumn 4096 thrpt2 36.844 > ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score > Error Units > ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 2051.707 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 914.072 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 489.898 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 5324.195 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > ops/ms > ColumnFilterBenchmark.filterIntColumn 2047 thrpt2 1791.170 > ops/ms > ColumnFilterBenchmark.filterIntColumn 4096... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Using emulated variable blend E-Core optimized instruction. - Changes: - all: https://git.openjdk.org/jdk/pull/17261/files - new: https://git.openjdk.org/jdk/pull/17261/files/257a6351..c3f1c50e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk=17261=04 - incr: https://webrevs.openjdk.org/?repo=jdk=17261=03-04 Stats: 28 lines in 4 files changed: 18 ins; 0 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/17261.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17261/head:pull/17261 PR: https://git.openjdk.org/jdk/pull/17261