I have no idea to significant improve performance, the macro helpful code readable. some little comment: move SUB follow by LD1 will hidden memory operator latency, also mixed ST1 with next LD1, etc. But in these case the code readable became bad, so I do not suggest these adjust.
Regards, Min Chen At 2021-07-31 12:14:29, "Pop, Sebastian" <s...@amazon.com> wrote: Hi, Please let me know if you have ideas on how to make this code faster. I tried to remove the stall by fetching more memory earlier, still no change in performance: // void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride) function x265_scale2D_64to32_neon mov w12, #15 ld1 {v0.16b-v3.16b}, [x1], x2 ld1 {v4.16b-v7.16b}, [x1], x2 .loop_scale2D: sub w12, w12, #1 ld1 {v20.16b-v23.16b}, [x1], x2 ld1 {v24.16b-v27.16b}, [x1], x2 scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7 ld1 {v0.16b-v3.16b}, [x1], x2 ld1 {v4.16b-v7.16b}, [x1], x2 scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27 cbnz w12, .loop_scale2D ld1 {v20.16b-v23.16b}, [x1], x2 ld1 {v24.16b-v27.16b}, [x1], x2 scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7 scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27 ret endfunc .macro scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7 uaddlp \v0\().8h, \v0\().16b uaddlp \v1\().8h, \v1\().16b uaddlp \v2\().8h, \v2\().16b uaddlp \v3\().8h, \v3\().16b uaddlp \v4\().8h, \v4\().16b uaddlp \v5\().8h, \v5\().16b uaddlp \v6\().8h, \v6\().16b uaddlp \v7\().8h, \v7\().16b add \v0\().8h, \v0\().8h, \v4\().8h add \v1\().8h, \v1\().8h, \v5\().8h add \v2\().8h, \v2\().8h, \v6\().8h add \v3\().8h, \v3\().8h, \v7\().8h uqrshrn \v0\().8b, \v0\().8h, #2 uqrshrn2 \v0\().16b, \v1\().8h, #2 uqrshrn \v1\().8b, \v2\().8h, #2 uqrshrn2 \v1\().16b, \v3\().8h, #2 st1 {\v0\().16b-\v1\().16b}, [x0], #32 .endm The only change that I did is to further optimize for code size by re-rolling the loop that was unrolled 2x. No change in performance, and 2x smaller code. Sebastian
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel