On 2016-11-02 13:11:19 +0200, Martin Storsjö wrote:
> On Wed, 2 Nov 2016, Janne Grunau wrote:
> 
> >On 2016-10-19 22:18:43 +0300, Martin Storsjö wrote:
> 
> >>+    /* We only need h + 7 lines, but the horizontal filter assumes an      
> >>       \
> >>+     * even number of rows, so filter h + 8 lines here. */                 
> >>       \
> >>+    ff_vp9_put_##filter##sz##_h_neon(temp, 64,                             
> >>       \
> >>+                                     src - 3 * src_stride, src_stride,     
> >>       \
> >>+                                     h + 8, mx, 0);                        
> >>       \
> >>+    ff_vp9_##op##_##filter##sz##_v_neon(dst, dst_stride,                   
> >>       \
> >>+                                        temp + 3 * 64, 64,                 
> >>       \
> >>+                                        h, 0, my);                         
> >>       \
> >
> >I guess you haven't tried implementing the diagonal mc fully in asm?
> >Probably not worth the effort since the only significant gain comes from
> >keeping the temporary values in registers which works on size 4 and 8 and
> >maybe 16 for aarch64 wich are probably not that important.
> 
> I didn't try that fully, no. I did try an asm frontend just like this one,
> though, to avoid a separate vpush/vpop in each of them, and while it does
> save a little, I had to restructure the end of the filter function, which
> cost a little for the plain-horizontal/vertical functions. All in all I
> guess it would be a gain, but not really worth it I think. Being able to
> keep all the intermediates in registers would probably make it more
> worthwhile, but that would also be even more effort. Since you need 7 more
> rows than the actual input size, and the height can be width*2, I think it
> would be hard to keep the intermediates for anything more than size 4 in
> registers even if you'd do a handmade version for it.

Yes, the variable height complicates it further, I guess 8 and 16 (on 
aarch64) works only if you merge the horizontal and vertical filter 
completely. I don't think it is worth the effort and patch good as it is 
anyway.

> >>+        vrhadd.u8       q3,  q3,  q11
> >>+        vst1.8          {q0},  [r0, :128]!
> >>+        vst1.8          {q1},  [r0, :128]!
> >>+        vst1.8          {q2},  [r0, :128]!
> >>+        vst1.8          {q3},  [r0, :128], r1
> >>+        subs            r12, r12, #1
> >>+        bne             1b
> >>+        bx              lr
> >>+endfunc
> >
> >Have you tried 4 d-register loads/stores? If it's not slower it's
> >preferable since it uses less instructions. Same for size 32 with the
> >additional advantange of not having to adjust the stride
> 
> Oh, why didn't I remember those instructions? Yes, this gives a huge gain,
> like a 20% speedup in the 32/64 functions. Thanks!

More than expected and I almost didn't ask for it since I assumed you 
had tried.

> >>+@ Evaluate the filter twice in parallel, from the inputs src1-src9
> >>into dst1-dst2
> >>+@ (src1-src8 into dst1, src2-src9 into dst2), adding idx2 separately
> >>+@ at the end with saturation
> >>+.macro convolve dst1, dst2, src1, src2, src3, src4, src5, src6, src7, 
> >>src8, src9, idx1, idx2, tmp1, tmp2
> >>+        vmul.s16        \dst1, \src1, d0[0]
> >>+        vmul.s16        \dst2, \src2, d0[0]
> >>+        vmla.s16        \dst1, \src2, d0[1]
> >>+        vmla.s16        \dst2, \src3, d0[1]
> >>+        vmla.s16        \dst1, \src3, d0[2]
> >>+        vmla.s16        \dst2, \src4, d0[2]
> >>+.if \idx1 == 3
> >>+        vmla.s16        \dst1, \src4, d0[3]
> >>+        vmla.s16        \dst2, \src5, d0[3]
> >>+.else
> >>+        vmla.s16        \dst1, \src5, d1[0]
> >>+        vmla.s16        \dst2, \src6, d1[0]
> >>+.endif
> >>+        vmla.s16        \dst1, \src6, d1[1]
> >>+        vmla.s16        \dst2, \src7, d1[1]
> >>+        vmla.s16        \dst1, \src7, d1[2]
> >>+        vmla.s16        \dst2, \src8, d1[2]
> >>+        vmla.s16        \dst1, \src8, d1[3]
> >>+        vmla.s16        \dst2, \src9, d1[3]
> >>+.if \idx2 == 3
> >>+        vmul.s16        \tmp1, \src4, d0[3]
> >>+        vmul.s16        \tmp2, \src5, d0[3]
> >>+.else
> >>+        vmul.s16        \tmp1, \src5, d1[0]
> >>+        vmul.s16        \tmp2, \src6, d1[0]
> >>+.endif
> >>+        vqadd.s16       \dst1, \dst1, \tmp1
> >>+        vqadd.s16       \dst2, \dst2, \tmp2
> >>+.endm
> >
> >the negative terms could be accumulated in tmp1/2 I doubt it makes a
> >difference in practice but reduces data dependencies.
> 
> This did actually turn out to help quite a lot on A53 (from 11707 cycles to
> 9649, for the 64v function). Very marginal slowdowns on A8 and A9 though,
> but clearly worthwhile doing.

A little surprising. Either vmla is really slow on the cortex-a53 or its 
neon unit is contrary to reports in tech press at least partially 
128-bit wide.

Janne
_______________________________________________
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel

Reply via email to