Hi Chen, Thank you for your comments. Please see the replies below.
>May we merge these memory load with pair or 4-element load instruction? >Storage instruction is same The compiler should reliably generate LDP/STP instructions for the existing intrinsics code. Using the 4-register load instructions (vld1_s16_x4 etc) has a couple of additional disadvantages compared to the existing code: 1) Older compilers (especially GCC) emit a lot of unnecessary MOV instructions around the multi-register load/store instructions. 2) Using plain load/stores allows the compiler to more easily elide the stores and loads to the temporary buffer between calls to e.g. fastForwardDst4_neon, whereas with the 4-register load instructions the stores and loads to the temporary buffer remain in the generated code. >In optimize version, we need not this wrapper functions, especially memcpy, it >made slower performance For the forward transforms, the memcpy part is effectively removed by the compiler, so removing memcpy in the intrinsics code generates the same assembly code as the current one. However, for the inverse transforms, there seems to be some benefit in removing the memcpy part, so I removed them only for the inverse transforms in this v2 patch set. Many thanks, Micro Micro Daryl Robles (7): AArch64: Add Neon implementation of 4x4 DST AArch64: Add Neon implementation of 4x4 IDST AArch64: Add Neon implementation of 4x4 DCT AArch64: Add Neon implementation of 4x4 IDCT AArch64: Add Neon implementation of 8x8 IDCT AArch64: Improve the Neon implementation of 16x16 IDCT AArch64: Improve the Neon implementation of 32x32 IDCT source/common/aarch64/dct-prim.cpp | 1442 +++++++++++++++++++++------- 1 file changed, 1104 insertions(+), 338 deletions(-) -- 2.34.1 _______________________________________________ x265-devel mailing list [email protected] https://mailman.videolan.org/listinfo/x265-devel
