On Mon, Aug 4, 2025 at 7:19 PM Jacob Lifshay <programmerj...@gmail.com> wrote: > > > > On August 4, 2025 6:49:20 AM PDT, Alan Kelly via ffmpeg-devel > <ffmpeg-devel@ffmpeg.org> wrote: > > The gather is unmasked but the instruction does a merge into ymm4, which > > depends on the value of ymm4 from the previous loop iteration. The > > out-of-order scheduler does not know statically that the instruction is > > fully unmasked, preventing parallel out-of-order execution of the > > gathers. > > --- > > libswscale/x86/scale_avx2.asm | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm > > index b4b852d60b..90ee8b0a0e 100644 > > --- a/libswscale/x86/scale_avx2.asm > > +++ b/libswscale/x86/scale_avx2.asm > > @@ -68,8 +68,10 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, > > filter, fltpos, fltsize, > > .innerloop: > > %endif > > vpcmpeqd m13, m13 > > + pxor m3, m3 ; break loop-carried dependency > > this is in AVX2 code, so you should use vpxor since pxor will just clear the > lower 128 bits and leave the upper 128 bits unmodified. actually, on some > older intel cpus it will cause a huge stall due to not being v-prefixed: > https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake/41349852#41349852 >
The v is actually automatically added by the pre-processor through x86inc.asm if the function is marked as avx - its a bit confusing because all other instructions are explicitly using it however, so it might still be a good idea to be explicit about it. As for the patch itself, any numbers? - Hendrik _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".