opusdsp: implement NEON accerelated postfilter and deemphasis

Lynne Sat, 23 Mar 2019 09:17:18 -0700

23 Mar 2019, 15:04 by [email protected]:

> 2019-03-23 15:23 GMT+01:00, Lynne <> [email protected] <mailto:[email protected]>> >:
>
>> 16 Mar 2019, 16:34 by >> [email protected] <mailto:[email protected]>>> :
>>
>>> 153372 UNITS in postfilter_c,   65536 runs,      0 skips
>>> 73164 UNITS in postfilter_neon,   65536 runs,      0 skips -> 2.1x speedup
>>>
>>> 80591 UNITS in deemphasis_c,  131072 runs,      0 skips
>>> 43969 UNITS in deemphasis_neon,  131072 runs,      0 skips -> 1.83x
>>> speedup
>>>
>>> Total decoder speedup: ~15% on a Raspberry Pi 3 (from 28.1x to 33.5x
>>> realtime)
>>>
>>> Deemphasis SIMD based on the following unrolling:
>>> const float c1 = CELT_EMPH_COEFF, c2 = c1*c1, c3 = c2*c1, c4 = c3*c1;
>>> float state = coeff;
>>>
>>> for (int i = 0; i < len; i += 4) {
>>>  y[0] = x[0] + c1*state;
>>>  y[1] = x[1] + c2*state + c1*x[0];
>>>  y[2] = x[2] + c3*state + c1*x[1] + c2*x[0];
>>>  y[3] = x[3] + c4*state + c1*x[2] + c2*x[1] + c3*x[0];
>>>
>>>  state = y[3];
>>>  y += 4;
>>>  x += 4;
>>> }
>>>
>>> Unlike the x86 version, duplication is used instead of pslldq so
>>> the structure and tables are different.
>>> Same approach tested on x86 (3x pslldq -> vbroadcastss + shufps + pslldq)
>>> had the same performance, so 3x pslldq was kept as vbroadcastss has a
>>> higher latency.
>>>
>>
>> Could someone review the patches?
>>
>
> Which toolchains did you test?
> (For compilation, not performance.)
>


gcc 8.2.1 on both aarch64 and x86-64
_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".

Re: [FFmpeg-devel] [PATCH 5/5] aarch64/opusdsp: implement NEON accerelated postfilter and deemphasis

Reply via email to