Hi Micro,
Thank for the explain. 1) vcombine_s16(s0, vdup_n_s16(0)) it looks the compiler does not provide vreinterpret cast in between 64b and 128b, so we can keep your version, we may optimize future with hand assembly. 2) >>>+ int16x4_t s0 = vld1_s16(src + 0); >>>+ int16x4_t s1 = vld1_s16(src + 4); >>>>>s0 and s1 may load by 128-bits instruction > >These loads are properly >>>converted by the compiler using LDP d0, d1 and only takes 1 instruction. In this case, s1 used by vmull_n_s16 only, so we can combine s0 & s1 into one register, the s2/s3 can't. Since it is a small improve, we can keep your version and optimize future by hand assembly *) I have no more comments now, [PATCH v3] looks good to me. Regards, Chen At 2024-12-14 07:16:24, "Micro Daryl Robles" <[email protected]> wrote: >Hi Chen, > >Thank you for your comments. See my replies below. > > >>>-static void transpose_4x4x16(int16x4_t &x0, int16x4_t &x1, int16x4_t &x2, >>>int16x4_t &x3) >>>+static inline void transpose_4x4_s16(int16x4_t &s0, int16x4_t &s1, >>>int16x4_t &s2, int16x4_t &s3) >>> { >>>- int32x2_t s0, s1, s2, s3; >>>+ int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0)); >>>+ int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0)); >>>+ int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0)); >> >>>+ int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0)); >> >> >>Why clear high 64-bits? it will overwrite by ZIP1 below >> >>> >>>+ int16x8x2_t s0123 = vzipq_s16(s02, s13); >> > >The vcombine_s16(s0, vdup_n_s16(0)) is a NOP operation, since the previous >64-bit vld1 >instruction already has the implicit effect of clearing the top half of the >128-bit vector. >So the vcombine intrinsic is just to allow the compiler to use the whole >128-bit vector >of s0 so that we can use intrinsic vzip1q to zip the lower 64-bit data. > >This approach is faster than using vcombine(s0, s1) which uses an extra MOV >instruction to >move s1 to the upper 64-bit. > >> >>>+void dst4_neon(const int16_t *src, int16_t *dst, intptr_t srcStride) >>>+{ >>>+ const int shift_pass1 = 1 + X265_DEPTH - 8; >>>+ const int shift_pass2 = 8; >>>+ >>>+ ALIGN_VAR_32(int16_t, coef[4 * 4]); >>>+ ALIGN_VAR_32(int16_t, block[4 * 4]); >>>+ >>>+ for (int i = 0; i < 4; i++) >>>+ { >>>+ memcpy(&block[i * 4], &src[i * srcStride], 4 * sizeof(int16_t)); >>>+ } >> >>We need not this loop to copy data from input buffer > >For the Forward DCTs, this memcpy loop is optimized away by the compiler so I >decided >to keep it to make it look consistent with the other forward DCT functions >which >I have not modified in this patch set. > > >>>+template<int shift> >>>+static inline void inverseDst4_neon(const int16_t *src, int16_t *dst, >>>intptr_t dstStride) >>>+{ >>>+ int16x4_t s0 = vld1_s16(src + 0); >>>+ int16x4_t s1 = vld1_s16(src + 4); >>s0 and s1 may load by 128-bits instruction > >These loads are properly converted by the compiler using LDP d0, d1 and only >takes 1 instruction. > > >>>+static inline void transpose_4x8_s16(int16x4_t s0, int16x4_t s1, int16x4_t >>>s2, int16x4_t s3, >>>+ int16x4_t s4, int16x4_t s5, int16x4_t >>>s6, int16x4_t s7, >>>+ int16x8_t &d0, int16x8_t &d1, >>>int16x8_t &d2, int16x8_t &d3) >>>+{ >>>+ int16x8_t s0q = vcombine_s16(s0, vdup_n_s16(0)); >>>+ int16x8_t s1q = vcombine_s16(s1, vdup_n_s16(0)); >>>+ int16x8_t s2q = vcombine_s16(s2, vdup_n_s16(0)); >>>+ int16x8_t s3q = vcombine_s16(s3, vdup_n_s16(0)); >>>+ int16x8_t s4q = vcombine_s16(s4, vdup_n_s16(0)); >>>+ int16x8_t s5q = vcombine_s16(s5, vdup_n_s16(0)); >>>+ int16x8_t s6q = vcombine_s16(s6, vdup_n_s16(0)); >> >>>+ int16x8_t s7q = vcombine_s16(s7, vdup_n_s16(0)); >>Same as previous, high 64 bits unnecessary to clear > >Same explanation as my first comment. >These vcombine_s16(sx, vdup_n_s16(0)) instructions are effectively NOP, since >the previous >64-bit vld1 instructions have an implicit effect of clearing the top half of >the 128-bit vector. >The vcombine is just to allow the compiler to use intrinsic vzip1q to zip the >lower 64-bit data. > >> >>>+template<int shift> >>>+static inline void partialButterflyInverse8_neon(const int16_t *src, >>>int16_t *dst, intptr_t dstStride) >>>+ if (vget_lane_u64(vreinterpret_u64_s16(vget_low_s16(s3)), 0) != 0) >> >>detect zeros is good idea, however, 4 instructions not enough to hidden >>pipeline flush cost, suggest combine below each two of if_sections (O_lo & >>O_hi) into one >> >> >>>+ { >>>+ O_lo[0] = vmlal_lane_s16(O_lo[0], vget_low_s16(s3), c_odd, 1); // >>>75 >>>+ O_lo[1] = vmlsl_lane_s16(O_lo[1], vget_low_s16(s3), c_odd, 3); // >>>-18 >>>+ O_lo[2] = vmlsl_lane_s16(O_lo[2], vget_low_s16(s3), c_odd, 0); // >>>-89 >>>+ O_lo[3] = vmlsl_lane_s16(O_lo[3], vget_low_s16(s3), c_odd, 2); // >>>-50 >>>+ } >>>+ if (vget_lane_u64(vreinterpret_u64_s16(vget_high_s16(s3)), 0) != 0) >>>+ { >>>+ O_hi[0] = vmlal_lane_s16(O_hi[0], vget_high_s16(s3), c_odd, 1); // >>>75 >>>+ O_hi[1] = vmlsl_lane_s16(O_hi[1], vget_high_s16(s3), c_odd, 3); // >>>-18 >>>+ O_hi[2] = vmlsl_lane_s16(O_hi[2], vget_high_s16(s3), c_odd, 0); // >>>-89 >>>+ O_hi[3] = vmlsl_lane_s16(O_hi[3], vget_high_s16(s3), c_odd, 2); // >>>-50 >>>+ } > >Thank you for the suggestion, I have considered it and sending my patch v3. > >I have modified this by using (vaddlvq_u32(vreinterpretq_u32_s16(s3)) != 0). > >This unsigned add long adds each 32-bit data into a single unsigned 64-bit to >avoid any risk >of non-zero sum overflowing to zero. This should be better than using >vorr(vget_low, vget_high). > > >Many thanks, >Micro > >Micro Daryl Robles (7): > AArch64: Add Neon implementation of 4x4 DST > AArch64: Add Neon implementation of 4x4 IDST > AArch64: Add Neon implementation of 4x4 DCT > AArch64: Add Neon implementation of 4x4 IDCT > AArch64: Add Neon implementation of 8x8 IDCT > AArch64: Improve the Neon implementation of 16x16 IDCT > AArch64: Improve the Neon implementation of 32x32 IDCT > > source/common/aarch64/dct-prim.cpp | 1440 +++++++++++++++++++++------- > 1 file changed, 1102 insertions(+), 338 deletions(-) > >-- >2.34.1 > >_______________________________________________ >x265-devel mailing list >[email protected] >https://mailman.videolan.org/listinfo/x265-devel
_______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
