At 2024-12-04 23:38:12, "Micro Daryl Robles" <[email protected]> wrote: >+template<int shift> >+static inline void inverseDst4_neon(const int16_t *src, int16_t *dst, >intptr_t dstStride) >+{ >+ int16x4_t s0 = vld1_s16(src + 0); >+ int16x4_t s1 = vld1_s16(src + 4); s0 and s1 may load by 128-bits instruction >+ int16x4_t s2 = vld1_s16(src + 8); >+ int16x4_t s3 = vld1_s16(src + 12); >+ >+ int32x4_t c0 = vaddl_s16(s0, s2); >+ int32x4_t c1 = vaddl_s16(s2, s3); >+ int32x4_t c2 = vsubl_s16(s0, s3); >+ int32x4_t c3 = vmull_n_s16(s1, 74); with above optimize, s1 may use by instcution smull2
_______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
