On 18.05.2015, at 12:37, Stefano Sabatini <stefa...@gmail.com> wrote:
> On Thu, May 14, 2015 at 2:52 PM, Stefano Sabatini <stefa...@gmail.com> > wrote: > >> On date Thursday 2015-05-14 13:01:51 +0200, Stefano Sabatini encoded: >>> On date Tuesday 2015-05-12 15:54:17 +0200, Hendrik Leppkes encoded: >> [...] >>>> One limitation is as the manual said, it needs to be copied from the >>>> GPU to system memory. ffmpeg_dxva2.c does not implement a optimized >>>> copy function for this, it uses plain old memcpy. >>>> Intel introduced a new instruction for this in SSE4, MOVNTDQA, which >>>> is optimized for copying from USWC memory (Uncacheable Speculative >>>> Write Combining) to system memory. Using this may help speed up the >>>> process significantly, and VLC probably uses it. >>> >>> Now the question is, how would be possible to optimize GPU to CPU copy >>> to get an overall performance gain? At least VLC seems able to get >>> better performances when using HW decoding, but I'm not sure it is >>> copying decoded data back to the CPU (indeed it may perform direct >>> rendering). >> >> Self-reply: >> commit 62107e563f979c638f9a5f58cdfd5639d9c63ac7 >> Author: Laurent Aimar <fen...@videolan.org> >> Date: Tue Nov 17 01:09:43 2009 +0100 >> >> Improved performance when copying video surface in dxva2. >> >> That is, VLC is using optimized GPU->CPU copy when the relevant SSE2 >> instructions are available. >> > > I have a first hackish patch, performed some tests and I got some > significant performance gains, on my iCore5 with Intel Graphics HD4000 I > have now the same performance as the software decoder using DXVA2 for > decoding a H.264 1920x1080 video, but using only a single thread. The patch > as is is a hack, since I had to modify the compilation flags to enable > assembly compilation in the ffmpeg_dxva2.c file. I should probably create > an optimized copy function in libavutil, comments are welcome. What exactly is SSE4 needed for? Both non-temporal movs and prefetches existed before it, so if that is critical for performance the fallback implementation is bad. However possibly more important: why is a memcpy needed at all? _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel