On Mon, Aug 4, 2025 at 7:19 PM Jacob Lifshay <programmerj...@gmail.com> wrote:
>
>
>
> On August 4, 2025 6:49:20 AM PDT, Alan Kelly via ffmpeg-devel 
> <ffmpeg-devel@ffmpeg.org> wrote:
> > The gather is unmasked but the instruction does a merge into ymm4, which
> > depends on the value of ymm4 from the previous loop iteration. The
> > out-of-order scheduler does not know statically that the instruction is
> > fully unmasked, preventing parallel out-of-order execution of the
> > gathers.
> > ---
> >  libswscale/x86/scale_avx2.asm | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm
> > index b4b852d60b..90ee8b0a0e 100644
> > --- a/libswscale/x86/scale_avx2.asm
> > +++ b/libswscale/x86/scale_avx2.asm
> > @@ -68,8 +68,10 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, 
> > filter, fltpos, fltsize,
> >  .innerloop:
> >  %endif
> >      vpcmpeqd  m13, m13
> > +    pxor m3, m3  ; break loop-carried dependency
>
> this is in AVX2 code, so you should use vpxor since pxor will just clear the 
> lower 128 bits and leave the upper 128 bits unmodified. actually, on some 
> older intel cpus it will cause a huge stall due to not being v-prefixed:
> https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852
>

The v is actually automatically added by the pre-processor through
x86inc.asm if the function is marked as avx - its a bit confusing
because all other instructions are explicitly using it however, so it
might still be a good idea to be explicit about it.

As for the patch itself, any numbers?

- Hendrik
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".

Reply via email to