On Tue, Jun 16, 2015 at 2:30 PM, Stefano Sabatini <stefa...@gmail.com> wrote: > On date Tuesday 2015-06-16 14:16:11 +0200, Gwenole Beauchesne encoded: >> Hi, >> >> 2015-06-16 14:03 GMT+02:00 Michael Niedermayer <michae...@gmx.at>: > [...] >> >> +#if HAVE_SSE2 >> >> +/* Copy 16/64 bytes from srcp to dstp loading data with the SSE>=2 >> >> instruction >> >> + * load and storing data with the SSE>=2 instruction store. >> >> + */ >> >> +#define COPY16(dstp, srcp, load, store) \ >> >> + __asm__ volatile ( \ >> >> + load " 0(%[src]), %%xmm1\n" \ >> >> + store " %%xmm1, 0(%[dst])\n" \ >> >> + : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1") >> >> + >> >> +#define COPY64(dstp, srcp, load, store) \ >> >> + __asm__ volatile ( \ >> >> + load " 0(%[src]), %%xmm1\n" \ >> >> + load " 16(%[src]), %%xmm2\n" \ >> >> + load " 32(%[src]), %%xmm3\n" \ >> >> + load " 48(%[src]), %%xmm4\n" \ >> >> + store " %%xmm1, 0(%[dst])\n" \ >> >> + store " %%xmm2, 16(%[dst])\n" \ >> >> + store " %%xmm3, 32(%[dst])\n" \ >> >> + store " %%xmm4, 48(%[dst])\n" \ >> >> + : : [dst]"r"(dstp), [src]"r"(srcp) : "memory", "xmm1", "xmm2", >> >> "xmm3", "xmm4") >> >> +#endif >> >> + >> >> +#define COPY_LINE(dstp, srcp, size, load) \ >> >> + const unsigned unaligned = (-(uintptr_t)srcp) & 0x0f; \ >> >> + unsigned x = unaligned; \ >> >> + \ >> >> + av_assert0(((intptr_t)dstp & 0x0f) == 0); \ >> >> + \ >> >> + __asm__ volatile ("mfence"); \ >> >> + if (!unaligned) { \ >> >> + for (; x+63 < size; x += 64) \ >> >> + COPY64(&dstp[x], &srcp[x], load, "movdqa"); \ >> >> + } else { \ >> >> + COPY16(dst, src, "movdqu", "movdqa"); \ >> >> + for (; x+63 < size; x += 64) \ >> >> + COPY64(&dstp[x], &srcp[x], load, "movdqu"); \ >> > >> > to use SSE registers in inline asm operands or clobber list you need >> > to build with -msse (which probably is default on on x86-64) >> > >> > files build with -msse will result in undefined behavior if anything >> > in them is executed on a pre SSE cpu, as these allow gcc to put >> > SSE instructions directly in the code where it likes >> > >> > The way out of this "design" is not to tell gcc that it passes >> > a string with SSE code to the assembler >> > that is not to use SSE registers in operands and not to put them >> > on the clobber list unless gcc actually is in SSE mode and can use and >> > need them there. >> > see XMM_CLOBBERS* >> >> Well, from past experience, lying to gcc is generally not a good thing >> either. There are multiple interesting ways it could fail from time to >> time. :) >> >> Other approaches: >> - With GCC >= 4.4, you can use __attribute__((target(T))) where T = >> "ssse3", "sse4.1", etc. This is the easiest way ; >> - Split into several separate files per target. Though, one would then >> argue that while we are at it why not just start moving to yasm. >> > >> The former approach looks more appealing to me, considering there may >> be an effort to migrate to yasm afterwards. > > I plan to port this patch to yasm. I'll ask for help on IRC since > probably it will take too much time otherwise without any guidance. > --
If you accept a few restrictions (like requiring aligned and padded input/output) and maybe give it a more specific name so that people won't try to replace generic memcpy with it, yasm'ing it would be pretty simple. If you want it to be generic like the C version, supporting unaligned and whatnot, the asm is going to get a bit more verbose.. I could probably whip up a basic implementation of the restricted version, and the yasm experts can make suggestions on improvements then. - Hendrik _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel