On 2016-11-02 13:11:19 +0200, Martin Storsjö wrote: > On Wed, 2 Nov 2016, Janne Grunau wrote: > > >On 2016-10-19 22:18:43 +0300, Martin Storsjö wrote: > > >>+ /* We only need h + 7 lines, but the horizontal filter assumes an > >> \ > >>+ * even number of rows, so filter h + 8 lines here. */ > >> \ > >>+ ff_vp9_put_##filter##sz##_h_neon(temp, 64, > >> \ > >>+ src - 3 * src_stride, src_stride, > >> \ > >>+ h + 8, mx, 0); > >> \ > >>+ ff_vp9_##op##_##filter##sz##_v_neon(dst, dst_stride, > >> \ > >>+ temp + 3 * 64, 64, > >> \ > >>+ h, 0, my); > >> \ > > > >I guess you haven't tried implementing the diagonal mc fully in asm? > >Probably not worth the effort since the only significant gain comes from > >keeping the temporary values in registers which works on size 4 and 8 and > >maybe 16 for aarch64 wich are probably not that important. > > I didn't try that fully, no. I did try an asm frontend just like this one, > though, to avoid a separate vpush/vpop in each of them, and while it does > save a little, I had to restructure the end of the filter function, which > cost a little for the plain-horizontal/vertical functions. All in all I > guess it would be a gain, but not really worth it I think. Being able to > keep all the intermediates in registers would probably make it more > worthwhile, but that would also be even more effort. Since you need 7 more > rows than the actual input size, and the height can be width*2, I think it > would be hard to keep the intermediates for anything more than size 4 in > registers even if you'd do a handmade version for it.
Yes, the variable height complicates it further, I guess 8 and 16 (on aarch64) works only if you merge the horizontal and vertical filter completely. I don't think it is worth the effort and patch good as it is anyway. > >>+ vrhadd.u8 q3, q3, q11 > >>+ vst1.8 {q0}, [r0, :128]! > >>+ vst1.8 {q1}, [r0, :128]! > >>+ vst1.8 {q2}, [r0, :128]! > >>+ vst1.8 {q3}, [r0, :128], r1 > >>+ subs r12, r12, #1 > >>+ bne 1b > >>+ bx lr > >>+endfunc > > > >Have you tried 4 d-register loads/stores? If it's not slower it's > >preferable since it uses less instructions. Same for size 32 with the > >additional advantange of not having to adjust the stride > > Oh, why didn't I remember those instructions? Yes, this gives a huge gain, > like a 20% speedup in the 32/64 functions. Thanks! More than expected and I almost didn't ask for it since I assumed you had tried. > >>+@ Evaluate the filter twice in parallel, from the inputs src1-src9 > >>into dst1-dst2 > >>+@ (src1-src8 into dst1, src2-src9 into dst2), adding idx2 separately > >>+@ at the end with saturation > >>+.macro convolve dst1, dst2, src1, src2, src3, src4, src5, src6, src7, > >>src8, src9, idx1, idx2, tmp1, tmp2 > >>+ vmul.s16 \dst1, \src1, d0[0] > >>+ vmul.s16 \dst2, \src2, d0[0] > >>+ vmla.s16 \dst1, \src2, d0[1] > >>+ vmla.s16 \dst2, \src3, d0[1] > >>+ vmla.s16 \dst1, \src3, d0[2] > >>+ vmla.s16 \dst2, \src4, d0[2] > >>+.if \idx1 == 3 > >>+ vmla.s16 \dst1, \src4, d0[3] > >>+ vmla.s16 \dst2, \src5, d0[3] > >>+.else > >>+ vmla.s16 \dst1, \src5, d1[0] > >>+ vmla.s16 \dst2, \src6, d1[0] > >>+.endif > >>+ vmla.s16 \dst1, \src6, d1[1] > >>+ vmla.s16 \dst2, \src7, d1[1] > >>+ vmla.s16 \dst1, \src7, d1[2] > >>+ vmla.s16 \dst2, \src8, d1[2] > >>+ vmla.s16 \dst1, \src8, d1[3] > >>+ vmla.s16 \dst2, \src9, d1[3] > >>+.if \idx2 == 3 > >>+ vmul.s16 \tmp1, \src4, d0[3] > >>+ vmul.s16 \tmp2, \src5, d0[3] > >>+.else > >>+ vmul.s16 \tmp1, \src5, d1[0] > >>+ vmul.s16 \tmp2, \src6, d1[0] > >>+.endif > >>+ vqadd.s16 \dst1, \dst1, \tmp1 > >>+ vqadd.s16 \dst2, \dst2, \tmp2 > >>+.endm > > > >the negative terms could be accumulated in tmp1/2 I doubt it makes a > >difference in practice but reduces data dependencies. > > This did actually turn out to help quite a lot on A53 (from 11707 cycles to > 9649, for the 64v function). Very marginal slowdowns on A8 and A9 though, > but clearly worthwhile doing. A little surprising. Either vmla is really slow on the cortex-a53 or its neon unit is contrary to reports in tech press at least partially 128-bit wide. Janne _______________________________________________ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel