At 2024-11-26 21:24:17, "Micro Daryl Robles" <[email protected]> wrote: >Also optimize transpose_4x4_s16 implementation. > >Relative performance compared to scalar C: > > Neoverse N1: 1.63x > Neoverse V1: 1.85x > Neoverse V2: 2.00x >--- > source/common/aarch64/dct-prim.cpp | 88 +++++++++++++++++++++++++----- > 1 file changed, 74 insertions(+), 14 deletions(-) > >+template<int shift> >+static inline void fastForwardDst4_neon(const int16_t *src, int16_t *dst) >+{ >+ int16x4_t s0 = vld1_s16(src + 0); >+ int16x4_t s1 = vld1_s16(src + 4); >+ int16x4_t s2 = vld1_s16(src + 8); >+ int16x4_t s3 = vld1_s16(src + 12); May we merge these memory load with pair or 4-element load instruction? >+ vst1_s16(dst + 0, d0); >+ vst1_s16(dst + 4, d1); >+ vst1_s16(dst + 8, d2); >+ vst1_s16(dst + 12, d3); storage instruction is same >+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride) In optimize version, we need not this wrapper functions, especially memcpy, it made slower performance >+{ >+ const int shift_pass1 = 1 + X265_DEPTH - 8; >+ const int shift_pass2 = 8; >+ >+ ALIGN_VAR_32(int16_t, coef[4 * 4]); >+ ALIGN_VAR_32(int16_t, block[4 * 4]); >+ >+ for (int i = 0; i < 4; i++) >+ { >+ memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t)); >+ } >+ >+ fastForwardDst4_neon<shift_pass1>(block, coef); >+ fastForwardDst4_neon<shift_pass2>(coef, dst); >+} >
_______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
