Re: [libav-devel] [PATCH] hevcdsp: add x86 SIMD for MC

2015-08-23 Thread James Almer
On 23/08/15 3:27 PM, Anton Khirnov wrote:
> Quoting James Almer (2015-08-22 23:58:41)
>> On 22/08/15 1:16 PM, Anton Khirnov wrote:
> +%macro QPEL_8 2
> +%if %2
> +%define postfixv
> +%define mvfrac myq

 Same here and below the else, rename this to mvfracq and add a mvfracd.

> +%define pixstride  srcstrideq
> +%define pixstride3 sstride3q
> +%define src_m3 srcm3q
> +%else
> +%define postfixh
> +%define mvfrac mxq
> +%define pixstride  1
> +%define pixstride3 3
> +%define src_m3 (srcq - 3)
> +%endif
> +
> +cglobal hevc_qpel_ %+ postfix %+ _ %+ %1 %+ _8, 8, 10, 7, dst, 
> dststride, src, srcstride, height, mx, my, sstride3, srcm3, coeffsreg
>>
>> This should be 7, 10, 7, Otherwise you're loading sstride3 from stack as if 
>> it were
>> a function argument.
>> Ideally though, for vertical you'd use 5, 9, 7 then manually load either mx 
>> or my
>> instead of both, saving one register, or even 5, 8, 7, since coeffsreg and 
>> mvfrac
>> are only used during init, and you can easily reuse one of those two 
>> registers for
>> sstride3 or srcm3.
>> You can also push it down to 4, 7, 7 if you manually load height before or 
>> after
>> the SPLATWs and reuse the regs for coeffsreg and mvfrac. As a plus, this 
>> would make
>> the functions work with x86_32.
>>
>> For horizontal you don't even need sstride3 or srcm3, so you definitely 
>> should
>> declare and use less registers.
>>
>> Didn't check other functions but I'm sure similar optimizations can be done.
>>
> +%if %2
> +and   mvfrac, 0x3
> +%endif
> +dec   mvfrac
> +shl   mvfrac, 4

 Use mvfracd on these three, it will clear the high bits for the mova below.
>>>
>>> anding the whole register with 3/7 should also work fine, with less
>>> clutter.
>>
>> "and mvfrac, 0x3" is only in ff_hevc_qpel_v_* functions, but not 
>> ff_hevc_qpel_h_*.
>> It's the same with the "and mvfrac, 0x7" cases below.
> 
> Sure, I meant to change the code so it's done in both paths.

It's not necessary. Just use the 32bit gprs.

>> You need to use the d suffix
>> instead of q on the register names to make sure the high bits are cleared.
> 
> Eh? Perhaps I'm misunderstading something, but I'd expect that using d
> here would do exactly the opposite and keep the random data in the high bits.

No, using d to write a gprs on x86_64 will clear the high bits (32 to 63) in a 
similar
way that using VEX coding instructions to write xmm registers will clear bits 
128 to
255 on ymm registers.

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH] hevcdsp: add x86 SIMD for MC

2015-08-23 Thread Henrik Gramner
On Sun, Aug 23, 2015 at 8:27 PM, Anton Khirnov  wrote:
> Quoting James Almer (2015-08-22 23:58:41)
>> You need to use the d suffix
>> instead of q on the register names to make sure the high bits are cleared.
>
> Eh? Perhaps I'm misunderstading something, but I'd expect that using d
> here would do exactly the opposite and keep the random data in the high bits.

Operations on 32-bit registers zeroes the high bits of the register.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH] hevcdsp: add x86 SIMD for MC

2015-08-23 Thread Anton Khirnov
Quoting James Almer (2015-08-22 23:58:41)
> On 22/08/15 1:16 PM, Anton Khirnov wrote:
> >>> +%macro QPEL_8 2
> >>> +%if %2
> >>> +%define postfixv
> >>> +%define mvfrac myq
> >>
> >> Same here and below the else, rename this to mvfracq and add a mvfracd.
> >>
> >>> +%define pixstride  srcstrideq
> >>> +%define pixstride3 sstride3q
> >>> +%define src_m3 srcm3q
> >>> +%else
> >>> +%define postfixh
> >>> +%define mvfrac mxq
> >>> +%define pixstride  1
> >>> +%define pixstride3 3
> >>> +%define src_m3 (srcq - 3)
> >>> +%endif
> >>> +
> >>> +cglobal hevc_qpel_ %+ postfix %+ _ %+ %1 %+ _8, 8, 10, 7, dst, 
> >>> dststride, src, srcstride, height, mx, my, sstride3, srcm3, coeffsreg
> 
> This should be 7, 10, 7, Otherwise you're loading sstride3 from stack as if 
> it were
> a function argument.
> Ideally though, for vertical you'd use 5, 9, 7 then manually load either mx 
> or my
> instead of both, saving one register, or even 5, 8, 7, since coeffsreg and 
> mvfrac
> are only used during init, and you can easily reuse one of those two 
> registers for
> sstride3 or srcm3.
> You can also push it down to 4, 7, 7 if you manually load height before or 
> after
> the SPLATWs and reuse the regs for coeffsreg and mvfrac. As a plus, this 
> would make
> the functions work with x86_32.
> 
> For horizontal you don't even need sstride3 or srcm3, so you definitely should
> declare and use less registers.
> 
> Didn't check other functions but I'm sure similar optimizations can be done.
> 
> >>> +%if %2
> >>> +and   mvfrac, 0x3
> >>> +%endif
> >>> +dec   mvfrac
> >>> +shl   mvfrac, 4
> >>
> >> Use mvfracd on these three, it will clear the high bits for the mova below.
> > 
> > anding the whole register with 3/7 should also work fine, with less
> > clutter.
> 
> "and mvfrac, 0x3" is only in ff_hevc_qpel_v_* functions, but not 
> ff_hevc_qpel_h_*.
> It's the same with the "and mvfrac, 0x7" cases below.

Sure, I meant to change the code so it's done in both paths.

> You need to use the d suffix
> instead of q on the register names to make sure the high bits are cleared.

Eh? Perhaps I'm misunderstading something, but I'd expect that using d
here would do exactly the opposite and keep the random data in the high bits.

-- 
Anton Khirnov
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH] hevcdsp: add x86 SIMD for MC

2015-08-22 Thread James Almer
On 21/08/15 4:19 AM, Anton Khirnov wrote:
> +%macro PUT_WEIGHTED_PRED 3
> +%if %1
> +cglobal hevc_put_weighted_pred_avg_ %+ %2 %+ _ %+ %3, 11, 11, 8, denom, 
> weight0, weight1, offset0, offset1, dst, dststride, src0, src1, srcstride, 
> height
> +%else
> +cglobal hevc_put_weighted_pred_ %+ %2 %+ _ %+ %3, 8, 8, 8, denom, weight0, 
> offset0, dst, dststride, src0, srcstride, height
> +%endif
> +and heightq,0x7fff
> +
> +add denomq, 14 + %1 - %3
> +movqm0, denomq

demon is an uint8_t. This should be

add denomd, 14 + %1 - %3
movdm0, denomd

I don't think doing a movzx denomd, denomb to clear bits 9 to 31 is necessary, 
so
the above should suffice.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH] hevcdsp: add x86 SIMD for MC

2015-08-22 Thread James Almer
On 22/08/15 1:16 PM, Anton Khirnov wrote:
>>> +%macro QPEL_8 2
>>> +%if %2
>>> +%define postfixv
>>> +%define mvfrac myq
>>
>> Same here and below the else, rename this to mvfracq and add a mvfracd.
>>
>>> +%define pixstride  srcstrideq
>>> +%define pixstride3 sstride3q
>>> +%define src_m3 srcm3q
>>> +%else
>>> +%define postfixh
>>> +%define mvfrac mxq
>>> +%define pixstride  1
>>> +%define pixstride3 3
>>> +%define src_m3 (srcq - 3)
>>> +%endif
>>> +
>>> +cglobal hevc_qpel_ %+ postfix %+ _ %+ %1 %+ _8, 8, 10, 7, dst, dststride, 
>>> src, srcstride, height, mx, my, sstride3, srcm3, coeffsreg

This should be 7, 10, 7, Otherwise you're loading sstride3 from stack as if it 
were
a function argument.
Ideally though, for vertical you'd use 5, 9, 7 then manually load either mx or 
my
instead of both, saving one register, or even 5, 8, 7, since coeffsreg and 
mvfrac
are only used during init, and you can easily reuse one of those two registers 
for
sstride3 or srcm3.
You can also push it down to 4, 7, 7 if you manually load height before or after
the SPLATWs and reuse the regs for coeffsreg and mvfrac. As a plus, this would 
make
the functions work with x86_32.

For horizontal you don't even need sstride3 or srcm3, so you definitely should
declare and use less registers.

Didn't check other functions but I'm sure similar optimizations can be done.

>>> +%if %2
>>> +and   mvfrac, 0x3
>>> +%endif
>>> +dec   mvfrac
>>> +shl   mvfrac, 4
>>
>> Use mvfracd on these three, it will clear the high bits for the mova below.
> 
> anding the whole register with 3/7 should also work fine, with less
> clutter.

"and mvfrac, 0x3" is only in ff_hevc_qpel_v_* functions, but not 
ff_hevc_qpel_h_*.
It's the same with the "and mvfrac, 0x7" cases below. You need to use the d 
suffix
instead of q on the register names to make sure the high bits are cleared.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH] hevcdsp: add x86 SIMD for MC

2015-08-22 Thread Anton Khirnov
Quoting James Almer (2015-08-21 19:10:19)
> On 21/08/15 4:19 AM, Anton Khirnov wrote:
> > +
> > +add   dstq, dststrideq
> > +add   srcq, srcstrideq
> > +
> > +%assign i (i + 1)
> > +%endrep
> > +
> > +dec heightq
> 
> This and every other case should be heightd. There's no guarantee the high 
> bits will be zero
> on every x86_64 target.
> This is the source of the crashes i was getting.

Right, fixed.

> 
> > +jg .loop
> > +RET
> > +%endmacro
> > +
> > +INIT_XMM sse2
> > +GET_PIXELS 4,  8, 1
> > +GET_PIXELS 8,  8, 1
> > +GET_PIXELS 12, 8, 3
> > +GET_PIXELS 16, 8, 2
> > +GET_PIXELS 24, 8, 3
> > +GET_PIXELS 32, 8, 3
> > +GET_PIXELS 48, 8, 3
> > +GET_PIXELS 64, 8, 3
> > +
> > +GET_PIXELS 4,  10, 1
> > +GET_PIXELS 8,  10, 1
> > +GET_PIXELS 12, 10, 3
> > +GET_PIXELS 16, 10, 2
> > +GET_PIXELS 24, 10, 3
> > +GET_PIXELS 32, 10, 3
> > +GET_PIXELS 48, 10, 3
> > +GET_PIXELS 64, 10, 3
> > +
> > +; hevc_qpel_h/v__8(int16_t *dst, ptrdiff_t dststride,
> > +; uint8_t *src, ptrdiff_t srcstride,
> > +; int height, int mx, int my, int *mcbuffer)
> > +
> > +; 8-bit qpel interpolation
> > +; %1: block width
> > +; %2: 0 - horizontal; 1 - vertical
> > +%macro QPEL_8 2
> > +%if %2
> > +%define postfixv
> > +%define mvfrac myq
> 
> Same here and below the else, rename this to mvfracq and add a mvfracd.
> 
> > +%define pixstride  srcstrideq
> > +%define pixstride3 sstride3q
> > +%define src_m3 srcm3q
> > +%else
> > +%define postfixh
> > +%define mvfrac mxq
> > +%define pixstride  1
> > +%define pixstride3 3
> > +%define src_m3 (srcq - 3)
> > +%endif
> > +
> > +cglobal hevc_qpel_ %+ postfix %+ _ %+ %1 %+ _8, 8, 10, 7, dst, dststride, 
> > src, srcstride, height, mx, my, sstride3, srcm3, coeffsreg
> > +%if %2
> > +and   mvfrac, 0x3
> > +%endif
> > +dec   mvfrac
> > +shl   mvfrac, 4
> 
> Use mvfracd on these three, it will clear the high bits for the mova below.

anding the whole register with 3/7 should also work fine, with less
clutter.

-- 
Anton Khirnov
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH] hevcdsp: add x86 SIMD for MC

2015-08-21 Thread James Almer
On 21/08/15 4:19 AM, Anton Khirnov wrote:
> +
> +add   dstq, dststrideq
> +add   srcq, srcstrideq
> +
> +%assign i (i + 1)
> +%endrep
> +
> +dec heightq

This and every other case should be heightd. There's no guarantee the high bits 
will be zero
on every x86_64 target.
This is the source of the crashes i was getting.

> +jg .loop
> +RET
> +%endmacro
> +
> +INIT_XMM sse2
> +GET_PIXELS 4,  8, 1
> +GET_PIXELS 8,  8, 1
> +GET_PIXELS 12, 8, 3
> +GET_PIXELS 16, 8, 2
> +GET_PIXELS 24, 8, 3
> +GET_PIXELS 32, 8, 3
> +GET_PIXELS 48, 8, 3
> +GET_PIXELS 64, 8, 3
> +
> +GET_PIXELS 4,  10, 1
> +GET_PIXELS 8,  10, 1
> +GET_PIXELS 12, 10, 3
> +GET_PIXELS 16, 10, 2
> +GET_PIXELS 24, 10, 3
> +GET_PIXELS 32, 10, 3
> +GET_PIXELS 48, 10, 3
> +GET_PIXELS 64, 10, 3
> +
> +; hevc_qpel_h/v__8(int16_t *dst, ptrdiff_t dststride,
> +; uint8_t *src, ptrdiff_t srcstride,
> +; int height, int mx, int my, int *mcbuffer)
> +
> +; 8-bit qpel interpolation
> +; %1: block width
> +; %2: 0 - horizontal; 1 - vertical
> +%macro QPEL_8 2
> +%if %2
> +%define postfixv
> +%define mvfrac myq

Same here and below the else, rename this to mvfracq and add a mvfracd.

> +%define pixstride  srcstrideq
> +%define pixstride3 sstride3q
> +%define src_m3 srcm3q
> +%else
> +%define postfixh
> +%define mvfrac mxq
> +%define pixstride  1
> +%define pixstride3 3
> +%define src_m3 (srcq - 3)
> +%endif
> +
> +cglobal hevc_qpel_ %+ postfix %+ _ %+ %1 %+ _8, 8, 10, 7, dst, dststride, 
> src, srcstride, height, mx, my, sstride3, srcm3, coeffsreg
> +%if %2
> +and   mvfrac, 0x3
> +%endif
> +dec   mvfrac
> +shl   mvfrac, 4

Use mvfracd on these three, it will clear the high bits for the mova below.

> +lea   coeffsregq, [hevc_qpel_coeffs8]
> +mova  m0, [coeffsregq + mvfrac]

Then use mvfraq here. Replicate this on every function, of course.

> +
> +%macro PUT_WEIGHTED_PRED 3
> +%if %1
> +cglobal hevc_put_weighted_pred_avg_ %+ %2 %+ _ %+ %3, 11, 11, 8, denom, 
> weight0, weight1, offset0, offset1, dst, dststride, src0, src1, srcstride, 
> height
> +%else
> +cglobal hevc_put_weighted_pred_ %+ %2 %+ _ %+ %3, 8, 8, 8, denom, weight0, 
> offset0, dst, dststride, src0, srcstride, height
> +%endif
> +and heightq,0x7fff

You should be able to remove this after the above changes.

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel