jhorstmann commented on issue #1182:
URL: https://github.com/apache/arrow-rs/issues/1182#issuecomment-1013669825


   Benchmarks with array size of 64k, run on an AMD Ryzen 3700U laptop.
   Compiled with `$ RUSTFLAGS="-C target-cpu=skylake"`
   (The Skylake code generator in llvm seems to have received more tuning than 
Zen but the architecture is otherwise quite close) 
   
   With simd feature:
   
   ```
   add                     time:   [27.039 us 27.895 us 28.542 us]
   
   subtract                time:   [26.408 us 27.194 us 28.040 us]
   
   multiply                time:   [27.303 us 27.872 us 28.429 us]
   
   divide                  time:   [45.050 us 46.385 us 47.590 us]
   
   divide_unchecked        time:   [27.806 us 29.673 us 31.300 us]
   
   divide_scalar           time:   [21.931 us 23.228 us 24.801 us]
   
   modulo                  time:   [434.12 us 435.14 us 436.78 us]
   
   modulo_scalar           time:   [1.5086 ms 1.6025 ms 1.7023 ms]
   
   add_nulls               time:   [26.502 us 27.072 us 27.736 us]
   
   divide_nulls            time:   [40.180 us 40.329 us 40.515 us]
   
   divide_nulls_unchecked  time:   [32.199 us 32.461 us 32.743 us]
   
   divide_scalar_nulls     time:   [33.899 us 33.911 us 33.922 us]
   
   modulo_nulls            time:   [663.39 us 708.55 us 756.58 us]
   
   modulo_scalar_nulls     time:   [1.1747 ms 1.2106 ms 1.2556 ms]
   ```
   
   Without simd feature
   
   ```
   add                     time:   [12.703 us 13.485 us 14.165 us]
                           change: [-45.506% -41.284% -36.365%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   subtract                time:   [17.586 us 17.829 us 18.018 us]
                           change: [-42.087% -40.678% -39.176%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   multiply                time:   [16.261 us 16.638 us 17.035 us]
                           change: [-41.123% -39.753% -38.321%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   divide                  time:   [97.142 us 104.55 us 111.13 us]
                           change: [+113.55% +125.75% +136.87%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   
   divide_unchecked        time:   [24.008 us 24.153 us 24.328 us]
                           change: [-23.200% -19.850% -16.066%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   divide_scalar           time:   [14.484 us 15.522 us 16.626 us]
                           change: [-40.708% -36.430% -32.487%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   modulo                  time:   [368.17 us 390.38 us 411.11 us]
                           change: [-9.5426% -6.4033% -3.4307%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   modulo_scalar           time:   [1.2196 ms 1.2890 ms 1.3754 ms]
                           change: [-14.441% -9.7393% -5.2267%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   add_nulls               time:   [17.075 us 17.339 us 17.598 us]
                           change: [-40.726% -39.565% -38.339%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   divide_nulls            time:   [409.69 us 437.35 us 460.53 us]
                           change: [+762.95% +804.17% +855.01%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   
   divide_nulls_unchecked  time:   [24.312 us 24.435 us 24.566 us]
                           change: [-25.160% -24.640% -24.154%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   divide_scalar_nulls     time:   [18.353 us 19.136 us 19.961 us]
                           change: [-46.439% -44.289% -41.942%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   modulo_nulls            time:   [479.73 us 509.93 us 546.35 us]
                           change: [-26.438% -22.742% -18.697%] (p = 0.00 < 
0.05)
                           Performance has improved.
   
   modulo_scalar_nulls     time:   [1.0541 ms 1.1235 ms 1.1982 ms]
                           change: [-21.713% -17.154% -12.422%] (p = 0.00 < 
0.05)
                           Performance has improved.
   ```
   
   **Summary**: Autovectorized code is about 40% faster for simple arithmetic.
   Division with nulls is 10x slower without simd, so we should keep that 
optimized implementation.
   Modulo is about the same speed with and without simd, this cpu does not have 
a simd fmod instruction
   and the implementation actually calls the libc `fmodf` function for each 
element in both versions.
   
   **Assembly**: Autovectorized `multiply` kernel (inner loop), this computes 
64 floats before checking the loop condition:
   
   ```
     0,57 │320:┌─→vmovups    ymm0,YMMWORD PTR [rdi+rsi*4-0xe0]
     2,75 │    │  vmovups    ymm1,YMMWORD PTR [rdi+rsi*4-0xc0]
     3,16 │    │  vmulps     ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0xe0]
     8,51 │    │  vmulps     ymm1,ymm1,YMMWORD PTR [rax+rsi*4-0xc0]
     3,91 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0x0],ymm0
     1,44 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0x20],ymm1
     3,97 │    │  vmovups    ymm0,YMMWORD PTR [rdi+rsi*4-0xa0]
     2,57 │    │  vmovups    ymm1,YMMWORD PTR [rdi+rsi*4-0x80]
     1,59 │    │  vmulps     ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0xa0]
    11,66 │    │  vmulps     ymm1,ymm1,YMMWORD PTR [rax+rsi*4-0x80]
     3,86 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0x40],ymm0
     1,39 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0x60],ymm1
     4,00 │    │  vmovups    ymm0,YMMWORD PTR [rdi+rsi*4-0x60]
     2,58 │    │  vmovups    ymm1,YMMWORD PTR [rdi+rsi*4-0x40]
     1,59 │    │  vmulps     ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0x60]
     9,60 │    │  vmulps     ymm1,ymm1,YMMWORD PTR [rax+rsi*4-0x40]
     4,19 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0x80],ymm0
     1,36 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0xa0],ymm1
     4,21 │    │  vmovups    ymm0,YMMWORD PTR [rdi+rsi*4-0x20]
     3,98 │    │  vmovups    ymm1,YMMWORD PTR [rdi+rsi*4]
     1,77 │    │  vmulps     ymm0,ymm0,YMMWORD PTR [rax+rsi*4-0x20]
    10,02 │    │  vmulps     ymm1,ymm1,YMMWORD PTR [rax+rsi*4]
     3,82 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0xc0],ymm0
     2,31 │    │  vmovups    YMMWORD PTR [rbp+rsi*4+0xe0],ymm1
     4,07 │    │  add        rsi,0x40
     0,26 │    │  add        rdx,0x4
          │    └──jne        320
   ```
   
   Custom simd code calculates 8 lanes and also contains additional bounds 
checks:
   
   ```
     0,79 │1e0:┌─→cmp        r15,rsi
     3,88 │    │↓ je         214
     0,00 │    │  cmp        r13,rsi
     0,54 │    │↓ je         214
          │    │  vmovups    ymm0,YMMWORD PTR [rdi+rsi*4]
    12,98 │    │  vmovups    ymm1,YMMWORD PTR [rdi+rsi*4+0x20]
    11,00 │    │  vmulps     ymm0,ymm0,YMMWORD PTR [rbx+rsi*4]
    36,47 │    │  vmulps     ymm1,ymm1,YMMWORD PTR [rbx+rsi*4+0x20]
    15,89 │    │  vmovups    YMMWORD PTR [rax+rsi*4+0x20],ymm1
    10,02 │    │  vmovups    YMMWORD PTR [rax+rsi*4],ymm0
     5,77 │    │  add        rsi,0x10
     1,86 │    │  cmp        r8,rsi
     0,00 │    └──jne        1e0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to