At 2024-12-04 23:38:12, "Micro Daryl Robles" <[email protected]> wrote:
>+template<int shift>
>+static inline void inverseDst4_neon(const int16_t *src, int16_t *dst, 
>intptr_t dstStride)
>+{
>+    int16x4_t s0 = vld1_s16(src + 0);

>+    int16x4_t s1 = vld1_s16(src + 4);
s0 and s1 may load by 128-bits instruction


>+    int16x4_t s2 = vld1_s16(src + 8);
>+    int16x4_t s3 = vld1_s16(src + 12);
>+
>+    int32x4_t c0 = vaddl_s16(s0, s2);
>+    int32x4_t c1 = vaddl_s16(s2, s3);
>+    int32x4_t c2 = vsubl_s16(s0, s3);

>+    int32x4_t c3 = vmull_n_s16(s1, 74);
with above optimize, s1 may use by instcution smull2

_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel

Reply via email to