On Mon, Apr 11, 2016 at 4:18 PM, Matthieu Bouron <matthieu.bou...@gmail.com> wrote:
> > > On Mon, Apr 11, 2016 at 9:58 AM, Benoit Fouet <benoit.fo...@free.fr> > wrote: > >> Hi, >> >> (again, thanks to both of you for documenting all this assembly /NEON >> code) >> >> On 09/04/2016 10:22, Matthieu Bouron wrote: >> >>> From: Matthieu Bouron <matthieu.bou...@stupeflix.com> >>> >>> --- >>> >>> Hello, >>> >>> The following patch add yuv2planeX_8_neon function for the arm >>> platform. It is >>> currently restricted to 8-bit per component sources until I fix fate >>> issues >>> with 10-bit sources (the dnxhd-*-10bit tests fail but I haven't figured >>> out yet >>> where it comes from). >>> >>> Matthieu >>> >>> --- >>> libswscale/arm/Makefile | 1 + >>> libswscale/arm/output.S | 78 >>> ++++++++++++++++++++++++++++++++++++++++++++++++ >>> libswscale/arm/swscale.c | 7 +++++ >>> libswscale/utils.c | 3 +- >>> 4 files changed, 88 insertions(+), 1 deletion(-) >>> create mode 100644 libswscale/arm/output.S >>> >>> [...] >>> >>> diff --git a/libswscale/arm/output.S b/libswscale/arm/output.S >>> new file mode 100644 >>> index 0000000..4437447 >>> --- /dev/null >>> +++ b/libswscale/arm/output.S >>> @@ -0,0 +1,78 @@ >>> >> >> [...] >> >> >> +function ff_yuv2planeX_8_neon, export=1 >>> + push {r4-r12, lr} >>> + vpush {q4-q7} >>> + ldr r4, [sp, #104] >>> @ dstW >>> + ldr r5, [sp, #108] >>> @ dither >>> + ldr r6, [sp, #112] >>> @ offset >>> + vld1.8 {d0}, [r5] >>> @ load 8x8-bit dither values >>> + tst r6, #0 >>> @ check offsetting which can be 0 or 3 only >>> + beq 1f >>> + vext.u8 d0, d0, d0, #3 >>> @ honor offseting which can be 3 only >>> +1: vmovl.u8 q0, d0 >>> @ extend dither to 16-bit >>> + vshll.u16 q1, d0, #12 >>> @ extend dither to 32-bit with left shift by 12 (part 1) >>> + vshll.u16 q2, d1, #12 >>> @ extend dither to 32-bit with left shift by 12 (part 2) >>> + mov r7, #0 >>> @ i = 0 >>> +2: vmov.u8 q3, q1 >>> @ initialize accumulator with dithering values (part 1) >>> + vmov.u8 q4, q2 >>> @ initialize accumulator with dithering values (part 2) >>> + mov r8, r1 >>> @ tmpFilterSize = filterSize >>> + mov r9, r2 >>> @ srcp >>> + mov r10, r0 >>> @ filterp >>> +3: ldr r11, [r9], #4 >>> @ get pointer @ src[j] >>> + ldr r12, [r9], #4 >>> @ get pointer @ src[j+1] >>> + add r11, r11, r7, lsl #1 >>> @ &src[j][i] >>> + add r12, r12, r7, lsl #1 >>> @ &src[j+1][i] >>> + vld1.16 {q5}, [r11] >>> @ read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H >>> + vld1.16 {q6}, [r12] >>> @ read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P >>> + ldr r11, [r10], #4 >>> @ read 2x16-bit coeffs (X, Y) at (filter[j], filter[j+1]) >>> + vmov.16 q7, q5 >>> @ copy 8x16-bit @ src[j ][i + {0..7}] for following inplace zip >>> instruction >>> + vmov.16 q8, q6 >>> @ copy 8x16-bit @ src[j+1][i + {0..7}] for following inplace zip >>> instruction >>> + vzip.16 q7, q8 >>> @ A,I,B,J,C,K,D,L,E,M,F,N,G,O,H,L >>> >> >> nit: O,H,P > > > Fixed. > > Patch updated fixing fate issues with 10-bit sources (the code was not > honoring offsetting: tst r6, #0 has been replaced with cmp r6, #0). > If there is no objection, I will push the patch in the next hours. > Patch applied. Matthieu _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel