qpel assembly

Martin Storsjö Fri, 30 Apr 2021 04:46:41 -0700

On Wed, 28 Apr 2021, Josh Dekker wrote:

From: Rafal Dabrowa <fatwild...@gmail.com>


First a couple technical details:

The use of '.ifeqs "\op", "sshr"' needs to be changed into '.ifc \op,sshr', because gas-preprocessor doesn't implement '.ifeqs'.

The checkasm tests for hevc_pel that were added in9c513edb7999a35ddcc6e3a8d984a96c8fb492a3 aren't actually ever run when yourun "make fate-checkasm", because they're not added intests/fate/checkasm.mak. They're, IMO, separated way too finegrainedlyinto 10 test groups of hevc_epel, hevc_epel_uni, hevc_epel_uni_w, etc. I'dstrongly suggest you'd merge them into one single group, hevc_pel, justlike the file name. That makes it easier to hook up to fate.

(As a side note, the fact that new checkasm tests need to be hooked up infate like this in tests/fate/checkasm.mak is easy to overlook, and thebenefit from being able to mark one specific group of checkasm asknown-failure to ignore, is not used very much. So maybe we shouldreconsider to just have one single "fate-checkasm" which runs the fullsuite?)

All the tests crash under checkasm on linux. The reason is that they usee.g. x3 for loop counter, while the variable is declared as int. Checkasmtests against this by making sure the upper half of such registers arefilled with garbage, but the wrapper that does that can't be used on appleplatforms, so that aspect goes untested when developing on a mac. Soplease test checkasm on something else than a mac too.

Then, there's the sheer size of the patch. You said you consideredtemplating it - if possible that would be very welcome. I haven't lookedclosely enough to be able to say easily how to do it, if at all though.

Then finally, the code is very very stall prone. It might not bemeasurable on an M1, but is very real on other cores. While you can say "Ihaven't bothered tuning it yet", the thing is, you don't need to actuallyrun it on such a core to roughly do much better scheduling. I did oneinitial test on one function and got it sped up by 4% by just adjusting itin an initial way. Example:


function ff_hevc_put_hevc_epel_h48_8_neon, export=1
[...]
1:      ld3            {v26.16b, v27.16b, v28.16b}, [x1], x5
        ushr            v29.2d, v26.2d, #8    // Uses v26 which was loaded 
right before
        ushr            v30.2d, v27.2d, #8
        ushr            v31.2d, v28.2d, #8
        ld1            {v24.s}[0], [x1], x5
        ld1            {v25.s}[0], [x1], x2 // Uses x1 which was updated in the 
previous instruction
        mov             v29.b[7], v24.b[0] // Uses v24 which was loaded almost 
right before
        mov             v30.b[7], v24.b[1]
        mov             v31.b[7], v24.b[2]
        mov             v29.b[15], v25.b[0]
        mov             v30.b[15], v25.b[1]
        mov             v31.b[15], v25.b[2]
        movi            v16.8h, #0 // This has no dependencies and could be 
done anytime, while e.g. waiting for results for the loads above!
        movi            v17.8h, #0
        movi            v18.8h, #0
        movi            v20.8h, #0
        movi            v21.8h, #0
        movi            v22.8h, #0
        calc_epelb      v16, v26, v27, v28, v29
        calc_epelb2     v20, v26, v27, v28, v29
        calc_epelb      v17, v27, v28, v29, v30
        calc_epelb2     v21, v27, v28, v29, v30
        calc_epelb      v18, v28, v29, v30, v31
        calc_epelb2     v22, v28, v29, v30, v31
        st3            {v16.8h, v17.8h, v18.8h}, [x0], #48
        st3            {v20.8h, v21.8h, v22.8h}, [x0], x10

subs x3, x3, #1 // This updates the condition flagsright before the branch, while we could stick it anywhere else in theloop, where we have an instruction waiting for the result of the previousone

        b.ne            1b
        ret


A trivial rescheduling of it looks like this:

1:      ld3            {v26.16b, v27.16b, v28.16b}, [x1], x5
        movi            v16.8h, #0
        movi            v17.8h, #0
        movi            v18.8h, #0
        ld1            {v24.s}[0], [x1], x5
        movi            v20.8h, #0
        movi            v21.8h, #0
        movi            v22.8h, #0
        ld1            {v25.s}[0], [x1], x2
        ushr            v29.2d, v26.2d, #8
        ushr            v30.2d, v27.2d, #8
        ushr            v31.2d, v28.2d, #8
        mov             v29.b[7], v24.b[0]
        mov             v30.b[7], v24.b[1]
        mov             v31.b[7], v24.b[2]
        mov             v29.b[15], v25.b[0]
        mov             v30.b[15], v25.b[1]
        mov             v31.b[15], v25.b[2]
        calc_epelb      v16, v26, v27, v28, v29
        calc_epelb2     v20, v26, v27, v28, v29
        calc_epelb      v17, v27, v28, v29, v30
        calc_epelb2     v21, v27, v28, v29, v30
        calc_epelb      v18, v28, v29, v30, v31
        calc_epelb2     v22, v28, v29, v30, v31
        st3            {v16.8h, v17.8h, v18.8h}, [x0], #48
        subs            x3, x3, #1   // height
        st3            {v20.8h, v21.8h, v22.8h}, [x0], x10
        b.ne            1b

The first version gave these checkasm numbers for me:

                        Cortex A53      A72      A73
put_hevc_epel_h48_8_neon:   3312.7   2545.7   2961.5
The new version was:
put_hevc_epel_h48_8_neon:   3168.7   2497.0   2842.5

That is a 4% speedup on A53, 1.8% speedup on A72 and 4% speedup on A73.(The latter two were quite noisy though so the exact benefit is unknown.)

One doesn't necessarily need to run and test and tune all of it for theexact perfect scheduling, but the functions look trivial enough that youcan improve it a lot like this by just shuffling things around.


// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 2/2] lavc/aarch64: add hevc epel/qpel assembly

Reply via email to