On Thu, May 12, 2016 at 3:50 PM, Benoit Fouet <benoit.fo...@free.fr> wrote:
> Hi, > > > On 12/05/2016 15:22, Matthieu Bouron wrote: > >> On Thu, May 12, 2016 at 10:01 AM, Benoit Fouet <benoit.fo...@free.fr> >> wrote: >> >> Hi, >>> >>> I mostly have nits remarks. >>> >>> On 11/05/2016 18:39, Matthieu Bouron wrote: >>> >>> From: Matthieu Bouron <matthieu.bou...@stupeflix.com> >>>> >>>> >>>> [...] >>> >>> diff --git a/libswresample/arm/resample.S b/libswresample/arm/resample.S >>> >>>> new file mode 100644 >>>> index 0000000..13462e3 >>>> --- /dev/null >>>> +++ b/libswresample/arm/resample.S >>>> @@ -0,0 +1,77 @@ >>>> >>>> [...] >>>> >>>> +function ff_resample_common_apply_filter_x4_float_neon, export=1 >>>> + vmov.f32 q0, #0.0 >>>> @ >>>> accumulator >>>> +1: vld1.32 {q1}, [r1]! >>>> @ >>>> src >>>> + vld1.32 {q2}, [r2]! >>>> @ >>>> filter >>>> + vmla.f32 q0, q1, q2 >>>> @ >>>> src + {0..3} * filter + {0..3} >>>> >>>> nit: the comment could be "accu += src[0..3] . filter[0..3]" >>> same for the other ones below >>> >>> [...] >>> >>> + subs r3, #4 @ >>> >>>> filter_length -= 4 >>>> + bgt 1b >>>> @ >>>> loop until filter_length >>>> + vpadd.f32 d0, d0, d1 >>>> @ >>>> pair adding of the 4x32-bit accumulated values >>>> + vpadd.f32 d0, d0, d0 >>>> @ >>>> pair adding of the 4x32-bit accumulator values >>>> + vst1.32 {d0[0]}, [r0] >>>> @ >>>> write accumulator >>>> + mov pc, lr >>>> +endfunc >>>> + >>>> +function ff_resample_common_apply_filter_x8_float_neon, export=1 >>>> + vmov.f32 q0, #0.0 >>>> @ >>>> accumulator >>>> +1: vld1.32 {q1}, [r1]! >>>> @ >>>> src1 >>>> + vld1.32 {q2}, [r2]! >>>> @ >>>> filter1 >>>> + vld1.32 {q8}, [r1]! >>>> @ >>>> src2 >>>> + vld1.32 {q9}, [r2]! >>>> @ >>>> filter2 >>>> + vmla.f32 q0, q1, q2 >>>> @ >>>> src1 + {0..3} * filter1 + {0..3} >>>> + vmla.f32 q0, q8, q9 >>>> @ >>>> src2 + {0..3} * filter2 + {0..3} >>>> >>>> instead of using src1 and src2, you may want to use src[0..3] and >>> src[4..7] >>> so, if I reuse the formulation I proposed above: >>> accu += src[0..3] . filter[0..3] >>> accu += src[4..7] . filter[4..7] >>> >>> Fixed locally (as well as the other case you mentionned) with: >> - vmla.f32 q0, q1, q2 @ >> src1 + {0..3} * filter1 + {0..3} >> - vmla.f32 q0, q8, q9 @ >> src2 + {0..3} * filter2 + {0..3} >> + vmla.f32 q0, q1, q2 @ >> accumulator += src1 + {0..3} * filter1 + {0..3} >> + vmla.f32 q0, q8, q9 @ >> accumulator += src2 + {4..7} * filter2 + {4..7} >> >> I prefer to use + {0..3} instead of [0..3] to make the comments consistent >> with what has been done in swscale/arm. >> >> > Fine for me (I chose the "[]" notation to be consistent with the "." > notation also, in order to do as if it were a dot product between two > vectors). > > > + subs r3, #8 @ >>> >>>> filter_length -= 4 >>>> >>>> -= 8 >>> >>> Fixed locally. >> >> >> [...] >>> >>> diff --git a/libswresample/arm/resample_init.c >>> >>>> b/libswresample/arm/resample_init.c >>>> new file mode 100644 >>>> index 0000000..c817d03 >>>> --- /dev/null >>>> +++ b/libswresample/arm/resample_init.c >>>> >>>> [...] >>>> >>>> +static int ff_resample_common_##TYPE##_neon(ResampleContext *c, void >>>> *dest, const void *source, \ >>>> + int n, int update_ctx) >>>> \ >>>> +{ >>>> \ >>>> + DELEM *dst = dest; >>>> \ >>>> + const DELEM *src = source; >>>> \ >>>> + int dst_index; >>>> \ >>>> + int index= c->index; >>>> \ >>>> + int frac= c->frac; >>>> \ >>>> + int sample_index = index >> c->phase_shift; >>>> \ >>>> + int x4_aligned_filter_length = c->filter_length & ~3; >>>> \ >>>> + int x8_aligned_filter_length = c->filter_length & ~7; >>>> \ >>>> + >>>> \ >>>> + index &= c->phase_mask; >>>> \ >>>> + for (dst_index = 0; dst_index < n; dst_index++) { >>>> \ >>>> + FELEM *filter = ((FELEM *) c->filter_bank) + c->filter_alloc * >>>> index; \ >>>> + >>>> \ >>>> + FELEM2 val=0; >>>> \ >>>> + int i = 0; >>>> \ >>>> + if (x8_aligned_filter_length >= 8) { >>>> \ >>>> + ff_resample_common_apply_filter_x8_##TYPE##_neon(&val, >>>> &src[sample_index], \ >>>> + filter, >>>> x8_aligned_filter_length); \ >>>> + i += x8_aligned_filter_length; >>>> \ >>>> + >>>> \ >>>> + } else if (x4_aligned_filter_length >= 4) { >>>> \ >>>> >>>> do you think there could be a gain processing the remainder of the >>> 8-aligned part through the 4-aligned part of the code? e.g. for a filter >>> length of 15, that would make: >>> - one run of the 8-aligned >>> - one run of the 4-aligned >>> - 3 C loops >>> As you stated filter length seems to commonly be 32, I guess that >>> wouldn't >>> be easy to benchmark, though. >>> >>> I'll see if I could trigger a case where the filter_length is 15 and come >> up with a benchmark. If there is a performance gain, would you be ok if it >> is implemented as a separate patch ? >> > > Sure, no problem for me. > Pushed. Thanks. Matthieu [...] _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel