On Wed, Oct 21, 2015 at 10:32 AM Timothy Gu wrote:
> On Tue, Oct 20, 2015 at 7:36 PM James Almer wrote:
>
>> On 10/20/2015 10:32 PM, Timothy Gu wrote:
>>
> > +; mov type used for src1q, dstq, first reg, second reg
>> > +%macro DIFF_BYTES_LOOP_CORE 4
>> > +%if regsize != 16
>>
>> %if mmsize != 16
On Tue, Oct 20, 2015 at 7:36 PM James Almer wrote:
> On 10/20/2015 10:32 PM, Timothy Gu wrote:
> > +; mov type used for src1q, dstq, first reg, second reg
> > +%macro DIFF_BYTES_LOOP_CORE 4
> > +%if regsize != 16
>
> %if mmsize != 16
>
> By checking regsize you're using the SSE2 version in the AV
On 10/20/2015 10:32 PM, Timothy Gu wrote:
> SSE2 version 4%-35% faster than MMX depending on the width.
> AVX2 version 1%-13% faster than SSE2 depending on the width.
> ---
>
> Addressed James's and Henrik's advices. Removed heuristics based on width.
> Made available both aligned and unaligned ve
SSE2 version 4%-35% faster than MMX depending on the width.
AVX2 version 1%-13% faster than SSE2 depending on the width.
---
Addressed James's and Henrik's advices. Removed heuristics based on width.
Made available both aligned and unaligned versions. For AVX2 version,
gracefully fall back on SSE2