Re: [FFmpeg-devel] [PATCH v2] swscale/output: Altivec-optimize float yuv2plane1

2018-12-26 Thread Michael Niedermayer
On Mon, Dec 24, 2018 at 07:39:18PM +0200, Lauri Kasanen wrote:
> On Sun, 16 Dec 2018 11:06:53 +0200
> Lauri Kasanen  wrote:
> 
> > This function wouldn't benefit from VSX instructions, so I put it
> > under altivec.
> > 
> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt 
> > grayf32le \
> > -f null -vframes 100 -v error -nostats -
> > 
> > 3743 UNITS in planar1,   65495 runs, 41 skips
> > 
> > -cpuflags 0
> > 
> > 23511 UNITS in planar1,   65530 runs,  6 skips
> > 
> > grayf32be
> > 
> > 4647 UNITS in planar1,   65449 runs, 87 skips
> > 
> > -cpuflags 0
> > 
> > 28608 UNITS in planar1,   65530 runs,  6 skips
> > 
> > The native speedup is 6.28133, and the bswapping one 6.15623.
> > Fate passes, each format tested with an image to video conversion.
> > 
> > Signed-off-by: Lauri Kasanen 
> > ---
> > 
> > Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated.
> > 
> > v2: Added #undef vzero, that define broke the build on older gcc. Thanks 
> > Michael
> 
> Ping. And of course it's not gcc version dependant, but rather it was
> the BE ifdef; it was too early in the morning.

seems working, will apply

thx

[...]
-- 
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Rewriting code that is poorly written but fully understood is good.
Rewriting code that one doesnt understand is a sign that one is less smart
then the original author, trying to rewrite it will not make it better.


signature.asc
Description: PGP signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v2] swscale/output: Altivec-optimize float yuv2plane1

2018-12-24 Thread Lauri Kasanen
On Sun, 16 Dec 2018 11:06:53 +0200
Lauri Kasanen  wrote:

> This function wouldn't benefit from VSX instructions, so I put it
> under altivec.
> 
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt 
> grayf32le \
> -f null -vframes 100 -v error -nostats -
> 
> 3743 UNITS in planar1,   65495 runs, 41 skips
> 
> -cpuflags 0
> 
> 23511 UNITS in planar1,   65530 runs,  6 skips
> 
> grayf32be
> 
> 4647 UNITS in planar1,   65449 runs, 87 skips
> 
> -cpuflags 0
> 
> 28608 UNITS in planar1,   65530 runs,  6 skips
> 
> The native speedup is 6.28133, and the bswapping one 6.15623.
> Fate passes, each format tested with an image to video conversion.
> 
> Signed-off-by: Lauri Kasanen 
> ---
> 
> Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated.
> 
> v2: Added #undef vzero, that define broke the build on older gcc. Thanks 
> Michael

Ping. And of course it's not gcc version dependant, but rather it was
the BE ifdef; it was too early in the morning.

- Lauri
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v2] swscale/output: Altivec-optimize float yuv2plane1

2018-12-17 Thread Lauri Kasanen
On Mon, 17 Dec 2018 14:52:49 +0100
Carl Eugen Hoyos  wrote:

> >> Note that this function / this pix_fmt currently has no real use-case
> >> afaict.
> >
> > Is there a list of which pix fmts are useful? Of course I don't want to
> > waste both my and reviewers' time, if the format is considered for
> > removal or otherwise broken.
> 
> The pix_fmt is not deprecated (it's new), what I meant was that it is
> currently only used for obscure monochrome Photoshop images
> and one filter, so I am not sure optimizing this colour conversion
> will help often.

Oh, thanks for the clarification. I'm going roughly in difficulty
order, doing the easy functions first.

- Lauri
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v2] swscale/output: Altivec-optimize float yuv2plane1

2018-12-17 Thread Carl Eugen Hoyos
2018-12-17 8:37 GMT+01:00, Lauri Kasanen :
> On Mon, 17 Dec 2018 01:03:36 +0100
> Carl Eugen Hoyos  wrote:
>
>> 2018-12-16 10:06 GMT+01:00, Lauri Kasanen :
>> > This function wouldn't benefit from VSX instructions, so I put it
>> > under altivec.
>> >
>> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
>> > grayf32le \
>> > -f null -vframes 100 -v error -nostats -
>> >
>> > 3743 UNITS in planar1,   65495 runs, 41 skips
>> >
>> > -cpuflags 0
>> >
>> > 23511 UNITS in planar1,   65530 runs,  6 skips
>> >
>> > grayf32be
>> >
>> > 4647 UNITS in planar1,   65449 runs, 87 skips
>> >
>> > -cpuflags 0
>> >
>> > 28608 UNITS in planar1,   65530 runs,  6 skips
>> >
>> > The native speedup is 6.28133, and the bswapping one 6.15623.
>>
>> > Fate passes
>>
>> I wonder a little how, given that grayf32 already breaks fate as-is...
>
> Are the tests for it disabled? fate.ffmpeg.org reports 100% success for
> many platforms.

Iirc, it is broken with --disable-sse

>> Note that this function / this pix_fmt currently has no real use-case
>> afaict.
>
> Is there a list of which pix fmts are useful? Of course I don't want to
> waste both my and reviewers' time, if the format is considered for
> removal or otherwise broken.

The pix_fmt is not deprecated (it's new), what I meant was that it is
currently only used for obscure monochrome Photoshop images
and one filter, so I am not sure optimizing this colour conversion
will help often.

But this is of course not very much related to this patch, sorry
for the noise!

Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v2] swscale/output: Altivec-optimize float yuv2plane1

2018-12-16 Thread Lauri Kasanen
On Mon, 17 Dec 2018 01:03:36 +0100
Carl Eugen Hoyos  wrote:

> 2018-12-16 10:06 GMT+01:00, Lauri Kasanen :
> > This function wouldn't benefit from VSX instructions, so I put it
> > under altivec.
> >
> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
> > grayf32le \
> > -f null -vframes 100 -v error -nostats -
> >
> > 3743 UNITS in planar1,   65495 runs, 41 skips
> >
> > -cpuflags 0
> >
> > 23511 UNITS in planar1,   65530 runs,  6 skips
> >
> > grayf32be
> >
> > 4647 UNITS in planar1,   65449 runs, 87 skips
> >
> > -cpuflags 0
> >
> > 28608 UNITS in planar1,   65530 runs,  6 skips
> >
> > The native speedup is 6.28133, and the bswapping one 6.15623.
> 
> > Fate passes
> 
> I wonder a little how, given that grayf32 already breaks fate as-is...

Are the tests for it disabled? fate.ffmpeg.org reports 100% success for
many platforms.

> Note that this function / this pix_fmt currently has no real use-case
> afaict.

Is there a list of which pix fmts are useful? Of course I don't want to
waste both my and reviewers' time, if the format is considered for
removal or otherwise broken.

- Lauri
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH v2] swscale/output: Altivec-optimize float yuv2plane1

2018-12-16 Thread Carl Eugen Hoyos
2018-12-16 10:06 GMT+01:00, Lauri Kasanen :
> This function wouldn't benefit from VSX instructions, so I put it
> under altivec.
>
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
> grayf32le \
> -f null -vframes 100 -v error -nostats -
>
> 3743 UNITS in planar1,   65495 runs, 41 skips
>
> -cpuflags 0
>
> 23511 UNITS in planar1,   65530 runs,  6 skips
>
> grayf32be
>
> 4647 UNITS in planar1,   65449 runs, 87 skips
>
> -cpuflags 0
>
> 28608 UNITS in planar1,   65530 runs,  6 skips
>
> The native speedup is 6.28133, and the bswapping one 6.15623.

> Fate passes

I wonder a little how, given that grayf32 already breaks fate as-is...

Note that this function / this pix_fmt currently has no real use-case
afaict.

Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH v2] swscale/output: Altivec-optimize float yuv2plane1

2018-12-16 Thread Lauri Kasanen
This function wouldn't benefit from VSX instructions, so I put it
under altivec.

./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le 
\
-f null -vframes 100 -v error -nostats -

3743 UNITS in planar1,   65495 runs, 41 skips

-cpuflags 0

23511 UNITS in planar1,   65530 runs,  6 skips

grayf32be

4647 UNITS in planar1,   65449 runs, 87 skips

-cpuflags 0

28608 UNITS in planar1,   65530 runs,  6 skips

The native speedup is 6.28133, and the bswapping one 6.15623.
Fate passes, each format tested with an image to video conversion.

Signed-off-by: Lauri Kasanen 
---

Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated.

v2: Added #undef vzero, that define broke the build on older gcc. Thanks Michael

 libswscale/ppc/swscale_altivec.c | 141 ++-
 1 file changed, 139 insertions(+), 2 deletions(-)

diff --git a/libswscale/ppc/swscale_altivec.c b/libswscale/ppc/swscale_altivec.c
index 1d2b2fa..d72ed1e 100644
--- a/libswscale/ppc/swscale_altivec.c
+++ b/libswscale/ppc/swscale_altivec.c
@@ -31,7 +31,8 @@
 #include "yuv2rgb_altivec.h"
 #include "libavutil/ppc/util_altivec.h"
 
-#if HAVE_ALTIVEC && HAVE_BIGENDIAN
+#if HAVE_ALTIVEC
+#if HAVE_BIGENDIAN
 #define vzero vec_splat_s32(0)
 
 #define  GET_LS(a,b,c,s) {\
@@ -102,7 +103,137 @@
 #include "swscale_ppc_template.c"
 #undef FUNC
 
-#endif /* HAVE_ALTIVEC && HAVE_BIGENDIAN */
+#undef vzero
+
+#endif /* HAVE_BIGENDIAN */
+
+#define output_pixel(pos, val, bias, signedness) \
+if (big_endian) { \
+AV_WB16(pos, bias + av_clip_ ## signedness ## 16(val >> shift)); \
+} else { \
+AV_WL16(pos, bias + av_clip_ ## signedness ## 16(val >> shift)); \
+}
+
+static void
+yuv2plane1_float_u(const int32_t *src, float *dest, int dstW, int start)
+{
+static const int big_endian = HAVE_BIGENDIAN;
+static const int shift = 3;
+static const float float_mult = 1.0f / 65535.0f;
+int i, val;
+uint16_t val_uint;
+
+for (i = start; i < dstW; ++i){
+val = src[i] + (1 << (shift - 1));
+output_pixel(&val_uint, val, 0, uint);
+dest[i] = float_mult * (float)val_uint;
+}
+}
+
+static void
+yuv2plane1_float_bswap_u(const int32_t *src, uint32_t *dest, int dstW, int 
start)
+{
+static const int big_endian = HAVE_BIGENDIAN;
+static const int shift = 3;
+static const float float_mult = 1.0f / 65535.0f;
+int i, val;
+uint16_t val_uint;
+
+for (i = start; i < dstW; ++i){
+val = src[i] + (1 << (shift - 1));
+output_pixel(&val_uint, val, 0, uint);
+dest[i] = av_bswap32(av_float2int(float_mult * (float)val_uint));
+}
+}
+
+static void yuv2plane1_float_altivec(const int32_t *src, float *dest, int dstW)
+{
+const int dst_u = -(uintptr_t)dest & 3;
+const int shift = 3;
+const int add = (1 << (shift - 1));
+const int clip = (1 << 16) - 1;
+const float fmult = 1.0f / 65535.0f;
+const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
+const vector uint32_t vshift = (vector uint32_t) vec_splat_u32(shift);
+const vector uint32_t vlargest = (vector uint32_t) {clip, clip, clip, 
clip};
+const vector float vmul = (vector float) {fmult, fmult, fmult, fmult};
+const vector float vzero = (vector float) {0, 0, 0, 0};
+vector uint32_t v;
+vector float vd;
+int i;
+
+yuv2plane1_float_u(src, dest, dst_u, 0);
+
+for (i = dst_u; i < dstW - 3; i += 4) {
+v = vec_ld(0, (const uint32_t *) &src[i]);
+v = vec_add(v, vadd);
+v = vec_sr(v, vshift);
+v = vec_min(v, vlargest);
+
+vd = vec_ctf(v, 0);
+vd = vec_madd(vd, vmul, vzero);
+
+vec_st(vd, 0, &dest[i]);
+}
+
+yuv2plane1_float_u(src, dest, dstW, i);
+}
+
+static void yuv2plane1_float_bswap_altivec(const int32_t *src, uint32_t *dest, 
int dstW)
+{
+const int dst_u = -(uintptr_t)dest & 3;
+const int shift = 3;
+const int add = (1 << (shift - 1));
+const int clip = (1 << 16) - 1;
+const float fmult = 1.0f / 65535.0f;
+const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
+const vector uint32_t vshift = (vector uint32_t) vec_splat_u32(shift);
+const vector uint32_t vlargest = (vector uint32_t) {clip, clip, clip, 
clip};
+const vector float vmul = (vector float) {fmult, fmult, fmult, fmult};
+const vector float vzero = (vector float) {0, 0, 0, 0};
+const vector uint32_t vswapbig = (vector uint32_t) {16, 16, 16, 16};
+const vector uint16_t vswapsmall = vec_splat_u16(8);
+vector uint32_t v;
+vector float vd;
+int i;
+
+yuv2plane1_float_bswap_u(src, dest, dst_u, 0);
+
+for (i = dst_u; i < dstW - 3; i += 4) {
+v = vec_ld(0, (const uint32_t *) &src[i]);
+v = vec_add(v, vadd);
+v = vec_sr(v, vshift);
+v = vec_min(v, vlargest);
+
+vd = vec_ctf(v, 0);
+vd = vec_madd(vd, vmul, vzero);
+
+vd = (vector fl