Re: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice

Quan Anh Mai Tue, 07 Mar 2023 10:35:00 -0800

On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:


> `Vector::slice` is a method at the top-level class of the Vector API that 
> concatenates the 2 inputs into an intermediate composite and extracts a 
> window equal to the size of the inputs into the result. It is used in vector 
> conversion methods where the part number is not 0 to slice the parts to the 
> correct positions. Slicing is also used in text processing such as utf8 and 
> utf16 validation. x86 starting from SSSE3 has `palignr` which does vector 
> slicing very efficiently. As a result, I think it is beneficial to add a C2 
> node for this operation as well as intrinsify `Vector::slice` method.
> 
> A slice is currently implemented as 
> `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires 
> preparation of the index vector and the blending mask. Even with the 
> preparations being hoisted out of the loops, microbenchmarks show improvement 
> using the slice instrinsics. Some have tremendous increases in throughput due 
> to the limitation that a mask of length 2 cannot currently be intrinsified, 
> leading to falling back to the Java implementations.
> 
> Please take a look and have some reviews. Thank you very much.

Benchmark results:

                                                                   Before       
         After
    Benchmark                            (size)   Mode  Cnt     Score      
Error     Score     Error   Units    Change
    Byte128Vector.sliceBinaryConstant      1024  thrpt    5  5058.760 ± 
2214.115  8315.263 ± 102.169  ops/ms   +64.37%
    Byte256Vector.sliceBinaryConstant      1024  thrpt    5  6986.299 ± 
1028.257  8440.387 ±  30.163  ops/ms   +20.81%
    Byte64Vector.sliceBinaryConstant       1024  thrpt    5  2944.869 ±  
849.548  5926.054 ± 493.146  ops/ms  +101.23%
    ByteMaxVector.sliceBinaryConstant      1024  thrpt    5  7269.226 ±  
366.246  8201.184 ± 309.539  ops/ms   +12.82%
    Double128Vector.sliceBinaryConstant    1024  thrpt    5    10.204 ±    
0.508   979.287 ±  19.991  ops/ms    x95.97
    Double256Vector.sliceBinaryConstant    1024  thrpt    5   868.085 ±   
26.378   967.799 ±  10.224  ops/ms   +11.49%
    DoubleMaxVector.sliceBinaryConstant    1024  thrpt    5   813.646 ±   
74.468   978.150 ±  14.316  ops/ms   +20.22%
    Float128Vector.sliceBinaryConstant     1024  thrpt    5  1297.281 ±   
23.650  1850.995 ±  29.741  ops/ms   +42.68%
    Float256Vector.sliceBinaryConstant     1024  thrpt    5  1796.121 ±   
26.662  2011.362 ±  38.418  ops/ms   +11.98%
    Float64Vector.sliceBinaryConstant      1024  thrpt    5    10.381 ±    
0.194  1628.510 ±   8.752  ops/ms   x156.87
    FloatMaxVector.sliceBinaryConstant     1024  thrpt    5  1820.161 ±   
26.802  1988.085 ±  41.835  ops/ms    +9.23%
    Int128Vector.sliceBinaryConstant       1024  thrpt    5  1394.911 ±   
40.815  1864.818 ±  33.792  ops/ms   +33.69%
    Int256Vector.sliceBinaryConstant       1024  thrpt    5  1874.496 ±   
60.541  1864.818 ±  33.792  ops/ms    -0.52%
    Int64Vector.sliceBinaryConstant        1024  thrpt    5    10.942 ±    
0.377  1621.849 ±  56.538  ops/ms   x148.22
    IntMaxVector.sliceBinaryConstant       1024  thrpt    5  1870.746 ±   
40.665  2027.041 ±  25.880  ops/ms    +8.35%
    Long128Vector.sliceBinaryConstant      1024  thrpt    5    10.595 ±    
0.306   991.969 ±  15.033  ops/ms    x93.63
    Long256Vector.sliceBinaryConstant      1024  thrpt    5   815.689 ±   
12.243   989.365 ±  25.969  ops/ms   +21.29%
    LongMaxVector.sliceBinaryConstant      1024  thrpt    5   822.060 ±   
12.337   977.061 ±  31.968  ops/ms   +18.86%
    Short128Vector.sliceBinaryConstant     1024  thrpt    5  3062.676 ±  
124.796  3890.796 ± 326.767  ops/ms   +27.04%
    Short256Vector.sliceBinaryConstant     1024  thrpt    5  3747.778 ±  
119.356  4125.463 ±  33.602  ops/ms   +10.08%
    Short64Vector.sliceBinaryConstant      1024  thrpt    5  1879.203 ±   
69.160  2899.515 ±  57.870  ops/ms   +54.29%
    ShortMaxVector.sliceBinaryConstant     1024  thrpt    5  3717.217 ±   
48.876  4035.455 ± 102.725  ops/ms    +8.56%

-------------

PR: https://git.openjdk.org/jdk/pull/12909

Re: RFR: 8303762: [vectorapi] Intrinsification of Vector.slice

Reply via email to