2018-11-16 13:59 GMT+01:00, Lauri Kasanen <c...@gmx.com>: > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt > yuv420p \ > -f null -vframes 100 -v error -nostats - > > 1158 UNITS in planar1, 65528 runs, 8 skips > > -cpuflags 0 > > 19082 UNITS in planar1, 65533 runs, 3 skips > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version > takes as many cycles as the x86 SSE2 version, yikes it's fast.
> Note that this function uses VSX instructions, but is not marked so. > This is because several existing functions also make that mistake. > I'll submit a patch moving them all once this is reviewed. (This is less important atm, but I believe all functions currently in libswscale/ppc compile and run fine on - old - 32bit be hardware as your new function does. My completely inexperienced suspicion is that the instruction that you call "VSX" also exists on Altivec.) > No BE support since I can only test LE. LE is however the common case > for POWER8 and POWER9. > > Signed-off-by: Lauri Kasanen <c...@gmx.com> > --- > libswscale/ppc/swscale_altivec.c | 55 > ++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 55 insertions(+) > > diff --git a/libswscale/ppc/swscale_altivec.c > b/libswscale/ppc/swscale_altivec.c > index 2fb2337..a064016 100644 > --- a/libswscale/ppc/swscale_altivec.c > +++ b/libswscale/ppc/swscale_altivec.c > @@ -324,6 +324,53 @@ static void hScale_altivec_real(SwsContext *c, int16_t > *dst, int dstW, > } > } > } > + > +static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW, > + const uint8_t *dither, int offset, int start) > +{ > + int i; > + for (i = start; i < dstW; i++) { > + int val = (src[i] + dither[(i + offset) & 7]) >> 7; > + dest[i] = av_clip_uint8(val); > + } > +} > + > +static void yuv2plane1_8_altivec(const int16_t *src, uint8_t *dest, int > dstW, > + const uint8_t *dither, int offset) > +{ > + const int dst_u = -(uintptr_t)dest & 15; > + int i, j; > + LOCAL_ALIGNED(16, int16_t, val, [16]); > + const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, > 7}; > + vector int16_t vi, vileft, ditherleft, ditherright; > + vector uint8_t vd; > + > + for (j = 0; j < 16; j++) { > + val[j] = dither[(dst_u + offset + j) & 7]; > + } > + > + ditherleft = vec_ld(0, val); > + ditherright = vec_ld(0, &val[8]); > + > + yuv2plane1_8_u(src, dest, dst_u, dither, offset, 0); > + > + for (i = dst_u; i < dstW - 15; i += 16) { > + > + vi = vec_vsx_ld(0, &src[i]); > + vi = vec_adds(ditherleft, vi); > + vileft = vec_sra(vi, shifts); > + > + vi = vec_vsx_ld(0, &src[i + 8]); > + vi = vec_adds(ditherright, vi); > + vi = vec_sra(vi, shifts); > + > + vd = vec_packsu(vileft, vi); > + vec_st(vd, 0, &dest[i]); > + } > + > + yuv2plane1_8_u(src, dest, dstW, dither, offset, i); > +} > + > #endif /* HAVE_ALTIVEC */ > > av_cold void ff_sws_init_swscale_ppc(SwsContext *c) > @@ -367,6 +414,14 @@ av_cold void ff_sws_init_swscale_ppc(SwsContext *c) > c->yuv2packedX = ff_yuv2rgb24_X_altivec; > break; > } > + > + switch (c->dstBpc) { > + case 8: > +#if !HAVE_BIGENDIAN > + c->yuv2plane1 = yuv2plane1_8_altivec; > + break; > +#endif /* !HAVE_BIGENDIAN */ > + } I wanted to write that this hunk breaks compilation on big-endian (you should be able to test with "#if 0" instead of "#if !HAVE_BIGENDIAN") but the good news is that your patch works fine on big-endian, just remove the if-endif block. (Tested visually with lena on 32 and 64bit be.) Are you aware of the bounty that is offered for this task? https://trac.ffmpeg.org/ticket/5568 (and #5569, #5570) There is a bug report about one altivec routine that works on big-endian but breaks the output visually on little-endian while many other functions work on both, could you have a look? https://trac.ffmpeg.org/ticket/7124 Carl Eugen _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel