Hi Chen,

Thank you for your comments. Please see the replies below.

>May we merge these memory load with pair or 4-element load instruction?
>Storage instruction is same

The compiler should reliably generate LDP/STP instructions for the existing 
intrinsics code.
Using the 4-register load instructions (vld1_s16_x4 etc) has a couple of 
additional disadvantages 
compared to the existing code:

1) Older compilers (especially GCC) emit a lot of unnecessary MOV instructions 
around the 
multi-register load/store instructions.

2) Using plain load/stores allows the compiler to more easily elide the stores 
and loads to the 
temporary buffer between calls to e.g. fastForwardDst4_neon, whereas with the 
4-register load 
instructions the stores and loads to the temporary buffer remain in the 
generated code.

>In optimize version, we need not this wrapper functions, especially memcpy, it 
>made slower performance

For the forward transforms, the memcpy part is effectively removed by the 
compiler, so removing 
memcpy in the intrinsics code generates the same assembly code as the current 
one.

However, for the inverse transforms, there seems to be some benefit in removing 
the memcpy part, 
so I removed them only for the inverse transforms in this v2 patch set.

Many thanks,
Micro

Micro Daryl Robles (7):
  AArch64: Add Neon implementation of 4x4 DST
  AArch64: Add Neon implementation of 4x4 IDST
  AArch64: Add Neon implementation of 4x4 DCT
  AArch64: Add Neon implementation of 4x4 IDCT
  AArch64: Add Neon implementation of 8x8 IDCT
  AArch64: Improve the Neon implementation of 16x16 IDCT
  AArch64: Improve the Neon implementation of 32x32 IDCT

 source/common/aarch64/dct-prim.cpp | 1442 +++++++++++++++++++++-------
 1 file changed, 1104 insertions(+), 338 deletions(-)

-- 
2.34.1

_______________________________________________
x265-devel mailing list
[email protected]
https://mailman.videolan.org/listinfo/x265-devel

Reply via email to