On Thu, 14 May 2015 14:52:29 +0200 Stefano Sabatini <stefa...@gmail.com> wrote:
> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: > > On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: > [...] > > > One limitation is as the manual said, it needs to be copied from the > > > GPU to system memory. ffmpeg_dxva2.c does not implement a optimized > > > copy function for this, it uses plain old memcpy. > > > Intel introduced a new instruction for this in SSE4, MOVNTDQA, which > > > is optimized for copying from USWC memory (Uncacheable Speculative > > > Write Combining) to system memory. Using this may help speed up the > > > process significantly, and VLC probably uses it. > > > > Now the question is, how would be possible to optimize GPU to CPU copy > > to get an overall performance gain? At least VLC seems able to get > > better performances when using HW decoding, but I'm not sure it is > > copying decoded data back to the CPU (indeed it may perform direct > > rendering). > > Self-reply: > commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 > Author: Laurent Aimar <fen...@videolan.org> > Date: Tue Nov 17 01:09:43 2009 +0100 > > Improved performance when copying video surface in dxva2. > > That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 > instructions are available. Here's what lavfilters appears to use: http://git.1f0.de/gitweb?p=lavfsplitter.git;a=blob;f=common/DSUtilLite/gpu_memcpy_sse4.h _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel