Hi,
On Tue, Jul 17, 2012 at 6:16 AM, Justin Ruggles
wrote:
> ---
> libavresample/x86/audio_convert.asm| 63
>
> libavresample/x86/audio_convert_init.c |9 +
> 2 files changed, 72 insertions(+), 0 deletions(-)
(I'm going to assume Loren had no furt
On 07/18/2012 03:06 PM, Loren Merritt wrote:
> On Wed, 18 Jul 2012, Justin Ruggles wrote:
>> On 07/18/2012 05:15 AM, Loren Merritt wrote:
>>>
>>> Aha, a large part of the discrepancy is due to cache aliasing, when the
>>> offsets between the 6 output streams are divisible by some large power of
>>>
On Wed, 18 Jul 2012, Justin Ruggles wrote:
> On 07/18/2012 05:15 AM, Loren Merritt wrote:
>>
>> Aha, a large part of the discrepancy is due to cache aliasing, when the
>> offsets between the 6 output streams are divisible by some large power of
>> 2. This would have to be fixed in whatever piece of
On 07/18/2012 05:15 AM, Loren Merritt wrote:
> On Tue, 17 Jul 2012, Loren Merritt wrote:
>
>> 25% faster on penryn (even though I didn't predict that by counting uops).
>> 25% faster on sandybridge.
>> No change on bulldozer.
>>
>> But even though I successfully predicted that this is an improveme
On Tue, 17 Jul 2012, Loren Merritt wrote:
> 25% faster on penryn (even though I didn't predict that by counting uops).
> 25% faster on sandybridge.
> No change on bulldozer.
>
> But even though I successfully predicted that this is an improvement, I don't
> understand its performance.
> 6x load,
---
libavresample/x86/audio_convert.asm| 63
libavresample/x86/audio_convert_init.c |9 +
2 files changed, 72 insertions(+), 0 deletions(-)
diff --git a/libavresample/x86/audio_convert.asm
b/libavresample/x86/audio_convert.asm
index 4899d91..cdd9824
On 07/17/2012 07:02 AM, Loren Merritt wrote:
> 25% faster on penryn (even though I didn't predict that by counting uops).
> 25% faster on sandybridge.
> No change on bulldozer.
>
> But even though I successfully predicted that this is an improvement, I don't
> understand its performance.
> 6x loa
25% faster on penryn (even though I didn't predict that by counting uops).
25% faster on sandybridge.
No change on bulldozer.
But even though I successfully predicted that this is an improvement, I don't
understand its performance.
6x load, 12x punpckldq, 6x store, 4x scalar:
Should take 12 cycle
On 07/16/2012 08:49 AM, Loren Merritt wrote:
> On Sun, 15 Jul 2012, Justin Ruggles wrote:
>
>> +.loop:
>> +mova m0, [srcq ] ; m0 = 0/0, 1/0, 2/0, 3/0
>> +mova m1, [srcq+ mmsize] ; m1 = 4/0, 5/0, 0/1, 1/1
>> +mova m2, [srcq+2*mmsize] ; m2 = 2/1, 3/1, 4/1, 5
On 07/16/2012 08:49 AM, Loren Merritt wrote:
> On Sun, 15 Jul 2012, Justin Ruggles wrote:
>
>> +.loop:
>> +mova m0, [srcq ] ; m0 = 0/0, 1/0, 2/0, 3/0
>> +mova m1, [srcq+ mmsize] ; m1 = 4/0, 5/0, 0/1, 1/1
>> +mova m2, [srcq+2*mmsize] ; m2 = 2/1, 3/1, 4/1, 5
On Sun, 15 Jul 2012, Justin Ruggles wrote:
> +.loop:
> +mova m0, [srcq ] ; m0 = 0/0, 1/0, 2/0, 3/0
> +mova m1, [srcq+ mmsize] ; m1 = 4/0, 5/0, 0/1, 1/1
> +mova m2, [srcq+2*mmsize] ; m2 = 2/1, 3/1, 4/1, 5/1
> +mova m3, [srcq+3*mmsize] ; m3 = 0/2,
---
libavresample/x86/audio_convert.asm| 68
libavresample/x86/audio_convert_init.c |9
2 files changed, 77 insertions(+), 0 deletions(-)
diff --git a/libavresample/x86/audio_convert.asm
b/libavresample/x86/audio_convert.asm
index 4899d91..de95151
12 matches
Mail list logo