On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai <[email protected]> wrote:
> `Vector::slice` is a method at the top-level class of the Vector API that
> concatenates the 2 inputs into an intermediate composite and extracts a
> window equal to the size of the inputs into the result. It is used in vector
> conversion methods where the part number is not 0 to slice the parts to the
> correct positions. Slicing is also used in text processing such as utf8 and
> utf16 validation. x86 starting from SSSE3 has `palignr` which does vector
> slicing very efficiently. As a result, I think it is beneficial to add a C2
> node for this operation as well as intrinsify `Vector::slice` method.
>
> A slice is currently implemented as
> `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires
> preparation of the index vector and the blending mask. Even with the
> preparations being hoisted out of the loops, microbenchmarks show improvement
> using the slice instrinsics. Some have tremendous increases in throughput due
> to the limitation that a mask of length 2 cannot currently be intrinsified,
> leading to falling back to the Java implementations.
>
> Please take a look and have some reviews. Thank you very much.
Benchmark results:
Before
After
Benchmark (size) Mode Cnt Score
Error Score Error Units Change
Byte128Vector.sliceBinaryConstant 1024 thrpt 5 5058.760 ±
2214.115 8315.263 ± 102.169 ops/ms +64.37%
Byte256Vector.sliceBinaryConstant 1024 thrpt 5 6986.299 ±
1028.257 8440.387 ± 30.163 ops/ms +20.81%
Byte64Vector.sliceBinaryConstant 1024 thrpt 5 2944.869 ±
849.548 5926.054 ± 493.146 ops/ms +101.23%
ByteMaxVector.sliceBinaryConstant 1024 thrpt 5 7269.226 ±
366.246 8201.184 ± 309.539 ops/ms +12.82%
Double128Vector.sliceBinaryConstant 1024 thrpt 5 10.204 ±
0.508 979.287 ± 19.991 ops/ms x95.97
Double256Vector.sliceBinaryConstant 1024 thrpt 5 868.085 ±
26.378 967.799 ± 10.224 ops/ms +11.49%
DoubleMaxVector.sliceBinaryConstant 1024 thrpt 5 813.646 ±
74.468 978.150 ± 14.316 ops/ms +20.22%
Float128Vector.sliceBinaryConstant 1024 thrpt 5 1297.281 ±
23.650 1850.995 ± 29.741 ops/ms +42.68%
Float256Vector.sliceBinaryConstant 1024 thrpt 5 1796.121 ±
26.662 2011.362 ± 38.418 ops/ms +11.98%
Float64Vector.sliceBinaryConstant 1024 thrpt 5 10.381 ±
0.194 1628.510 ± 8.752 ops/ms x156.87
FloatMaxVector.sliceBinaryConstant 1024 thrpt 5 1820.161 ±
26.802 1988.085 ± 41.835 ops/ms +9.23%
Int128Vector.sliceBinaryConstant 1024 thrpt 5 1394.911 ±
40.815 1864.818 ± 33.792 ops/ms +33.69%
Int256Vector.sliceBinaryConstant 1024 thrpt 5 1874.496 ±
60.541 1864.818 ± 33.792 ops/ms -0.52%
Int64Vector.sliceBinaryConstant 1024 thrpt 5 10.942 ±
0.377 1621.849 ± 56.538 ops/ms x148.22
IntMaxVector.sliceBinaryConstant 1024 thrpt 5 1870.746 ±
40.665 2027.041 ± 25.880 ops/ms +8.35%
Long128Vector.sliceBinaryConstant 1024 thrpt 5 10.595 ±
0.306 991.969 ± 15.033 ops/ms x93.63
Long256Vector.sliceBinaryConstant 1024 thrpt 5 815.689 ±
12.243 989.365 ± 25.969 ops/ms +21.29%
LongMaxVector.sliceBinaryConstant 1024 thrpt 5 822.060 ±
12.337 977.061 ± 31.968 ops/ms +18.86%
Short128Vector.sliceBinaryConstant 1024 thrpt 5 3062.676 ±
124.796 3890.796 ± 326.767 ops/ms +27.04%
Short256Vector.sliceBinaryConstant 1024 thrpt 5 3747.778 ±
119.356 4125.463 ± 33.602 ops/ms +10.08%
Short64Vector.sliceBinaryConstant 1024 thrpt 5 1879.203 ±
69.160 2899.515 ± 57.870 ops/ms +54.29%
ShortMaxVector.sliceBinaryConstant 1024 thrpt 5 3717.217 ±
48.876 4035.455 ± 102.725 ops/ms +8.56%
-------------
PR: https://git.openjdk.org/jdk/pull/12909