I want to suggest a way we could eliminate a substantial amount of data copying when playing video on X servers that do not provide hardware video windows, including servers that offer the X shared memory extension. In common situations, I suspect that this could reduce memory bus utilization for playing video by more than a factor of two.
I do not know if I have time to implement this optimization right now, but I think it is potentially a big enough benefit that I really ought to describe it here in case someone else wants to implement it or can relieve me of thinking about it by showing me why it will not work. The copy operation that I want to eliminate occurs when the X server reads data from XPutImage (usually via a shared memory area) and copies it to the frame buffer. The amount of data copied is particularly large because the image is often "stretched" from its native dimensions (720x480 for DVD, for example) to the dimensions of the display area (for example, 1920x1200 for full screen video on a 24" panel). To eliminate this copy, I want the X server to receive the unstretched YUV image by XvPutImage provided by the Xvideo-2.2 ("Xv") extension, as is done for video display hardware that provided video windows, which typically do YUV->RGB and stretch in the display hardware. In this proposed Xv driver, which I will refer to as "soft XvPutImage", the YUV->RGB and stretch operations would have to be done in software by the X server, just as they are currently done in software by video playing programs. The difference is that by combining this operation with the X server receiving the image, a big copy operation is eliminated that might plausibly account for more than half of the memory bus utilization in some common video playing scenarios. I realize that most modern video hardware has YUV/stretch video window capabilities or other hardware acceleration for this operation (for example, in hardware 3D operations), but there are at common cases in practice where this optimization should be useful: 1) Improving the capabilities of the weakest systems would allow video to be used more ubiquitously (for example, adding video-based tutorials to larger application suites might become more common). 2) Many open source drivers lack this YUV/stretch capability even if the hardware has it, due to lack of public documentation or slow development in comparison to the life cycle of the hardware, even though efforts to address these problems are definitely helping. 3) The following scenarios may fall under #1 or #2, but are worth separate mention: a) On systems with more processor cores (typically ones which have YUV/stretch hardware but lack drivers), memory bus utilization will be especially important. b) "Fake" X servers, such as for VNC or when running on a virtualized computer, are less likely to have access to acceleration hardware (although it is possible). c) There are those who believe that 3D acceleration hardware will be traded off for more CPU cores in typical systems of the future. So, at least for the case of playing video through a 3D effect, this optimization may help. See, for example, the "Twilight of the GPU" interview on slashdot yesterday at http://tech.slashdot.org/tech/08/09/15/2116240.shtml . 4) There are also a couple of cases of small benefit I will note for completeness: a) For video with a slow frame rate playing on a monitor with a high refresh rate where the frame buffer and video window are part of system memory (i.e., no video RAM), where pixels in the frame buffer under the video window are still fetched for chroma key comparison, Soft XvPutImage might actually use less bandwidth than a YUV/stretch video window. b) Not part of this proposal, but a similar idea for systems that have Xv but lack XvMC would be SoftXvMC to eliminate a verbatim copying in of YUV data in Xv, but the bandwidth savings would be more modest. To understand the possible bandwidth savings, here is a calculation based on the scenario mentioned earlier: playing standard DVD (720x480 yuv422) stretched to 1920x1200 (a popular full screen resolution). To start, here is a list of data transfers that occur in the early stages of video decoding, regardless of whether this soft XvPutImage optimization is used. (I believe yuv422 is 2 bytes per pixel). In the descriptions I refer to the media player as "mplayer" and the video format as "MPEG" but the argument applies to user level video players in general and most video formats, since the only thing that is important about the video format is it achieves such good that copying the fully decoded video is what dominates memory bus utilization. Common transfers with and without this optimization: kernel: MPEG data DVD player -> buffer .66 MiB/sec ("1X DVD") kernel: MPEG data buffer -> CPU .66 MiB/sec kernel: MPEG data CPU -> mplayer buf .66 MiB/sec mplayer: MPEG data mplayer buf -> CPU .66 MiB/sec mplayer: VLD + iDCT ? mplayer: interpolated frames + motion comp. 80 MiB/sec??? --------- 82.64 MiB/sec + ? Having separated the operations common to both cases, we now can more easily compare the memory bandwidth of XPutImage vesus soft XvPutImage (both using shared memory). With XPutImage: Common transfers shown above 82.64 MiB/sec + ? mplayer: CPU --> 720x480 yuv422 @ 60Hz 40 MiB/sec (720x480x2x60) * mplayer: 720x480 yuv422 -> CPU 40 MiB/sec mplayer: CPU -> 1920x1200 RGB(XPutImage) 527 MiB/sec (1920x1200x4x60) Xserver: 1920x1200 RGB -> CPU 527 MiB/sec Xserver: 1920x1200 RGB CPU->framebuffer 527 MiB/sec --------- 1.70 GiB/sec + ? With soft XvPutImage: Common transfers shown above 82.64 MiB/sec + ? mplayer: CPU --> 720x480 yuv422 (XvPutImage) 40 MiB/sec (720x480x2x60) * Xserver: 720x480 yuv422 -> CPU 40 MiB/sec Xserver: 1920x1200 RGB CPU->Frame buffer 527 MiB/sec (1920x1200x4x60) --------- 0.67 GiB/sec + ? * yuv422 is 2 bytes per pixel. In the impossibly ideal case where no memory transfers are generated by the variable length decoding, inverse discrete cosine transformation, motion compensation and other activity, this optimization would reduce memory bus utilization by a factor of 2.52 for a screen resolution of 1920x1200. As screens get bigger (or if you make the unstretched video resolution smaller) the improvement in memory bandwidth utilization is asymptotic to 3. In reality, there would be other sources of memory bandwidth utilization, reducing the fraction that this optimization accounts for, but I expect this optimization would still be very substantial. Consider that these bandwidth predictions are a substantial fraction of the total CPU to DRAM bandwidth available on a typical computer, which is the relevant resource when you consider that the size of the stretched video is typically too large to fit in the CPU caches. For example, a DDR2-1066 memory module has a maximum transfer rate of 7.94GiB/sec. minus delays due to accessing different 16KiB DRAM columns (double this for dual channel systems, half this for DDR2-533 memory). By the way, the effect of the CPU cache is likely to increase the benefit of this optimization toward that ideal factor of 3 improvement, because only the earlier stages of video decoding potentially have data footprints small enough for the CPU data caches to eliminate any memory bus transfers. The memory bus utilization would also be reduced (but never more than that factor 3) as the ratio of the size of the unstretched video to stretched video increases, such as when playing a 720x480 video on a newer 2560x1600 display. The potential benefits seem pretty substantial to me, and I have more happy vaporware speculation: the implementation effort may be quite small, because it should not be necessary to write the video format conversion and stretch code, as that code can be taken from existing free video players that do this on the client side, particularly libswscale, which is already conveniently in its own subdirectory of the ffmpeg library used by many video players. Since there are no formal releases of ffmpeg, and since ffmpeg already maintains libswscale as a separate source control tree, it might make sense to ask the ffmpeg people about evolving ffmpeg into a freedesktop.org project with tar file releases, or merging it into libpixman. This would also at least require some header file movement to allow libswcale to compile without ffmpeg. Finally, I would like to mention a couple of alternatives to this approach, for completeness, because I do not want to give to suggest that this is necessarily the end of the line for optimizing software based video playing. Alternative #1: If you have 3D hardware and appropriate drivers properly configured that supports YUV textures and support in your X server, and lack Xv support for some reason, 3D surely will provide better performance than any software XvPutImage, and there is apparently already support for this in at least one video player. Alternative #2: I do not know enough about DRI to be sure about this, but I think that a slightly more optimal approach for systems without any hardware video acceleration would be to allow trusted (i.e., privileged) video players to map the frame buffer directly and coordinate changes to clip regions and window coordinates, which I think is done already done for 3D using the Direct Rendering Infrastructure (DRI). As far as I know, DRI theoretically does not actually require 3D, video players with appropriate permissions should be able to use DRI to map the frame buffer and draw to it. However, DRI currently requires a DRM driver for each type of video card, and this would require writing new (but relatively simple) video output drivers any video player that would use this approach, whereas Soft XvPutImage should work with existing video players that already support Xv. There are also the relatively minor drawbacks of this approach being limited to the very common case of the player running on the same machine as the frame buffer, the frame buffer being memory mapped in a sensible format, and the video player running with enough privilege to use DRI. I have written this description partly so that I will no longer feel like a roadblock to more ubiquitous video playing by not mentioning it and not getting around to implementing it, and also to find out if there are compelling reasons not to do it, such as an existing implementation that I have missed or an already existing optimization in software video playing that makes this optimization idea useless. So, further analysis, corrections, and, of course, implementation, would be most welcome. Adam Richter _______________________________________________ xorg mailing list xorg@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/xorg