Re: R200 ReadPixels optimization
Am Dienstag, 12. Oktober 2004 20:24 schrieb Ian Romanick: Dieter Nützel wrote: NONE of your three versions gave me direct rendering?! I've tested with and without your TLS-patch (progress?). The symbols are in. DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep r200ReadRGBASpan_ARGB 00175714 t r200ReadRGBASpan_ARGB 00175be4 t r200ReadRGBASpan_ARGB_MMX 00175ad4 t r200ReadRGBASpan_ARGB_SSE 001759c4 t r200ReadRGBASpan_ARGB_SSE2 But DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep _generic_read_RGBA_span_BGRA U _generic_read_RGBA_span_BGRA_REV_MMX U _generic_read_RGBA_span_BGRA_REV_SSE U _generic_read_RGBA_span_BGRA_REV_SSE2 I'm on XFree86 DRI CVS build as long as my distro based on it;-) You'll have to update the Imakefiles to build in DRI CVS. The problem is that it's not linking (or compiling) ../common/read_rgba_span_x86.o. Got it with DRI and Mesa CVS on XFree86. --- xc/lib/GL/mesa/x86/Imakefile.inc2004-10-18 20:29:09.824517266 +0200 +++ xc/lib/GL/mesa/x86/Imakefile.inc.Dieter 2004-10-18 18:13:42.0 +0200 @@ -8,6 +8,7 @@ MESA_X86_SRCS = $(MESAX86BUILDDIR)common_x86.c \ $(MESAX86BUILDDIR)common_x86_asm.S \ + $(MESAX86BUILDDIR)read_rgba_span_x86.S \ $(MESAX86BUILDDIR)glapi_x86.S \ $(MESAX86BUILDDIR)x86.c \ $(MESAX86BUILDDIR)x86_cliptest.S \ @@ -18,6 +19,7 @@ #ifdef NeedToLinkMesaSrc LinkSourceFile(common_x86.c, $(MESASRCDIR)/src/mesa/x86) LinkSourceFile(common_x86_asm.S, $(MESASRCDIR)/src/mesa/x86) +LinkSourceFile(read_rgba_span_x86.S, $(MESASRCDIR)/src/mesa/x86) LinkSourceFile(glapi_x86.S, $(MESASRCDIR)/src/mesa/x86) LinkSourceFile(x86.c, $(MESASRCDIR)/src/mesa/x86) LinkSourceFile(x86_cliptest.S, $(MESASRCDIR)/src/mesa/x86) @@ -28,6 +30,7 @@ MESA_X86_OBJS = $(MESAX86BUILDDIR)common_x86.o \ $(MESAX86BUILDDIR)common_x86_asm.o \ + $(MESAX86BUILDDIR)read_rgba_span_x86.o \ $(MESAX86BUILDDIR)x86.o \ $(MESAX86BUILDDIR)x86_cliptest.o \ $(MESAX86BUILDDIR)x86_xform2.o \ @@ -37,6 +40,7 @@ #if defined(DoSharedLib) DoSharedLib MESA_X86_UOBJS = $(MESAX86BUILDDIR)unshared/common_x86.o \ $(MESAX86BUILDDIR)common_x86_asm.o \ + $(MESAX86BUILDDIR)unshared/read_rgba_span_x86.o \ $(MESAX86BUILDDIR)unshared/x86.o \ $(MESAX86BUILDDIR)x86_cliptest.o \ $(MESAX86BUILDDIR)x86_xform2.o \ @@ -48,6 +52,7 @@ MESA_X86_DOBJS = $(MESAX86BUILDDIR)debugger/common_x86.o \ $(MESAX86BUILDDIR)common_x86_asm.o \ + $(MESAX86BUILDDIR)debugger/read_rgba_span_x86.o \ $(MESAX86BUILDDIR)debugger/x86.o \ $(MESAX86BUILDDIR)x86_cliptest.o \ $(MESAX86BUILDDIR)x86_xform2.o \ @@ -56,6 +61,7 @@ MESA_X86_POBJS = $(MESAX86BUILDDIR)profiled/common_x86.o \ $(MESAX86BUILDDIR)common_x86_asm.o \ + $(MESAX86BUILDDIR)profiled/read_rgba_span_x86.o \ $(MESAX86BUILDDIR)profiled/x86.o \ $(MESAX86BUILDDIR)x86_cliptest.o \ $(MESAX86BUILDDIR)x86_xform2.o \ -Dieter --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Dieter Nützel wrote: NONE of your three versions gave me direct rendering?! I've tested with and without your TLS-patch (progress?). The symbols are in. DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep r200ReadRGBASpan_ARGB 00175714 t r200ReadRGBASpan_ARGB 00175be4 t r200ReadRGBASpan_ARGB_MMX 00175ad4 t r200ReadRGBASpan_ARGB_SSE 001759c4 t r200ReadRGBASpan_ARGB_SSE2 But DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep _generic_read_RGBA_span_BGRA U _generic_read_RGBA_span_BGRA_REV_MMX U _generic_read_RGBA_span_BGRA_REV_SSE U _generic_read_RGBA_span_BGRA_REV_SSE2 I'm on XFree86 DRI CVS build as long as my distro based on it;-) You'll have to update the Imakefiles to build in DRI CVS. The problem is that it's not linking (or compiling) ../common/read_rgba_span_x86.o. BTW The old indirect mode is way faster then direct for me: It should be several orders of magnitude faster. Afterall, it's actually just doing a memory-to-memory copy instead of reading from the framebuffer. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Am Samstag, 9. Oktober 2004 03:33 schrieb Ian Romanick: Ian Romanick wrote: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Before persuing this too far, I'd like to look at ways to make the *compiled* code from spantmp.h be more device-independent. That would make it easier to generate a bunch of these generic routines and just plug them in. Here's version 3 of the patch. This is *probably* the last version that will circulate as a patch. Here are the changes from the last version of the patch: - Fixes the problem where the R200 driver would only use the MMX version. - Numerous little optimizations to all 3 versions. The SSE version is still crap. :( - Trivially optimized the C version. ;) I'm thinking that a lot of this will actually get pulled into spantmp.h when I commit it. My thinking is to have the driver define which pixel format it uses (e.g., #define SPANTMP_USE_BGRA_REV) and have spantmp.h automatically generate the optimized versions (based on the existance of USE_MMX_ASM, etc.). Since there are just handful of pixel formats that appear in practice, this should be pretty easy to do. My only concern is big-endian machines. I should be able to try this out on a Rage128 in a Power Mac. Maybe there will be another version as a patch...ugh... Ian, NONE of your three versions gave me direct rendering?! I've tested with and without your TLS-patch (progress?). The symbols are in. DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep r200ReadRGBASpan_ARGB 00175714 t r200ReadRGBASpan_ARGB 00175be4 t r200ReadRGBASpan_ARGB_MMX 00175ad4 t r200ReadRGBASpan_ARGB_SSE 001759c4 t r200ReadRGBASpan_ARGB_SSE2 But DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep _generic_read_RGBA_span_BGRA U _generic_read_RGBA_span_BGRA_REV_MMX U _generic_read_RGBA_span_BGRA_REV_SSE U _generic_read_RGBA_span_BGRA_REV_SSE2 I'm on XFree86 DRI CVS build as long as my distro based on it;-) Any ideas? -Dieter BTW The old indirect mode is way faster then direct for me: progs/demos ./readpix Mesa: software DXTn compression/decompression available GL_VERSION = 1.3 Mesa 6.3 GL_RENDERER = Mesa DRI R200 20040929 AGP 4x x86/MMX+/3DNow!+/SSE TCL Loaded 194 by 188 image Benchmarking... Result: 348 reads in 4.009000 seconds = 3165940.633574 pixels/sec Benchmarking... Result: 344 reads in 4.007000 seconds = 3131112.553032 pixels/sec Benchmarking... Result: 346 reads in 4.001000 seconds = 3154039.490127 pixels/sec Benchmarking... Result: 278 reads in 4.007000 seconds = 2530375.842276 pixels/sec Benchmarking... Result: 275 reads in 4.003000 seconds = 2505570.821884 pixels/sec Benchmarking... Result: 272 reads in 4.001000 seconds = 2479476.130967 pixels/sec glDrawBuffer(GL_FRONT) Benchmarking... Result: 342 reads in 4.004000 seconds = 3115240.759241 pixels/sec Benchmarking... Result: 352 reads in 4.01 seconds = 3201532.169576 pixels/sec Benchmarking... Result: 342 reads in 4.004000 seconds = 3115240.759241 pixels/sec Benchmarking... Result: 269 reads in 4.011000 seconds = 2446015.457492 pixels/sec Benchmarking... Result: 268 reads in 4.00 seconds = 2443624.00 pixels/sec Benchmarking... Result: 270 reads in 4.01 seconds = 2455720.698254 pixels/sec Mesa indirect: progs/demos ./readpix GL_VERSION = 1.2 (1.5 Mesa 6.3) GL_RENDERER = Mesa GLX Indirect Loaded 194 by 188 image Benchmarking... Result: 1793 reads in 4.002000 seconds = 16340403.798101 pixels/sec Benchmarking... Result: 1797 reads in 4.00 seconds = 16385046.00 pixels/sec Benchmarking... Result: 1792 reads in 4.00 seconds = 16339456.00 pixels/sec Benchmarking... Result: 800 reads in 4.003000 seconds = 7288933.300025 pixels/sec Benchmarking... Result: 799 reads in 4.004000 seconds = 7278003.996004 pixels/sec Benchmarking... Result: 797 reads in 4.004000 seconds = 7259786.213786 pixels/sec glDrawBuffer(GL_FRONT) Benchmarking... Result: 294 reads in 4.007000 seconds = 2676008.984278 pixels/sec Benchmarking... Result: 290 reads in 4.002000 seconds = 2642898.550725 pixels/sec Benchmarking... Result: 291 reads in 4.008000 seconds = 2648041.916168 pixels/sec Benchmarking... Result: 241 reads in 4.009000 seconds = 2192504.864056 pixels/sec Benchmarking... Result: 240 reads in 4.015000 seconds = 2180144.458281 pixels/sec Benchmarking... Result: 240 reads in 4.014000 seconds = 2180687.593423 pixels/sec --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us
Re: R200 ReadPixels optimization
Ian Romanick wrote: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Before persuing this too far, I'd like to look at ways to make the *compiled* code from spantmp.h be more device-independent. That would make it easier to generate a bunch of these generic routines and just plug them in. Here's an updated version of the patch. I think this one includes all the code. ;) In addition, this patch adds support for the Unichrome driver. It is still called r200_readpixels-*.patch. r200_readpixels-02.tar.bz2 Description: BZip2 compressed data
Re: R200 ReadPixels optimization
Marcello Maggioni wrote: I experience a great slowdown in using this patch . [EMAIL PROTECTED]:~/driconf-0.2.2$ glxgears Mesa: software DXTn compression/decompression available Using MMX version of ReadRGBASpan 27 frames in 5.1 seconds = 5.320 FPS 25 frames in 5.0 seconds = 4.982 FPS [EMAIL PROTECTED]:~/driconf-0.2.2$ That's just weird. glxgears should never end up in any of the code modified by this patch. I get ~1600 fps on my box with the patch applied. Do you have any other patches? There is one known bug in the patch. I accidentally left it set so that the SSE and SSE2 versions were never used. That's why you get the Using MMX version message. I should have an updated version in a bit. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Fri, 08 Oct 2004 15:12:51 -0700, Ian Romanick [EMAIL PROTECTED] wrote: Marcello Maggioni wrote: I experience a great slowdown in using this patch . [EMAIL PROTECTED]:~/driconf-0.2.2$ glxgears Mesa: software DXTn compression/decompression available Using MMX version of ReadRGBASpan 27 frames in 5.1 seconds = 5.320 FPS 25 frames in 5.0 seconds = 4.982 FPS [EMAIL PROTECTED]:~/driconf-0.2.2$ That's just weird. glxgears should never end up in any of the code modified by this patch. I get ~1600 fps on my box with the patch applied. Do you have any other patches? There is one known bug in the patch. I accidentally left it set so that the SSE and SSE2 versions were never used. That's why you get the Using MMX version message. I should have an updated version in a bit. Mmm, I rebooted my system and now GLXGEARS is back to acceptable values. Probably only rebooting X wasn't enough :) I'm waiting for the SSE version :) Bye --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Ian Romanick wrote: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Before persuing this too far, I'd like to look at ways to make the *compiled* code from spantmp.h be more device-independent. That would make it easier to generate a bunch of these generic routines and just plug them in. Here's version 3 of the patch. This is *probably* the last version that will circulate as a patch. Here are the changes from the last version of the patch: - Fixes the problem where the R200 driver would only use the MMX version. - Numerous little optimizations to all 3 versions. The SSE version is still crap. :( - Trivially optimized the C version. ;) I'm thinking that a lot of this will actually get pulled into spantmp.h when I commit it. My thinking is to have the driver define which pixel format it uses (e.g., #define SPANTMP_USE_BGRA_REV) and have spantmp.h automatically generate the optimized versions (based on the existance of USE_MMX_ASM, etc.). Since there are just handful of pixel formats that appear in practice, this should be pretty easy to do. My only concern is big-endian machines. I should be able to try this out on a Rage128 in a Power Mac. Maybe there will be another version as a patch...ugh... r200_readpixels-03.tar.bz2 Description: BZip2 compressed data
Re: R200 ReadPixels optimization
Alan Cox wrote: On Mer, 2004-10-06 at 22:02, Ian Romanick wrote: Here's my question. Is there any way to trick it into doing back-to-back reads as a single PCI transfer? So, if I did something like: Not that anyone has found. I'm not sure PCI even really allows it except for prefetchable memory. Except of course DMA - so the DMA for radeon announcement seems ideal for Mesa here. Note that there's some code in there already which uses the blitter to copy from framebuffer to agp memory, though it tries to implement the entire readpixels() operation rather than being a useful low-level operation. Keith --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Note that there's some code in there already which uses the blitter to copy from framebuffer to agp memory, though it tries to implement the entire readpixels() operation rather than being a useful low-level operation. AGP memory is hostside uncached (CPU limitations on x86 for one) which means it is better (swap PCI for DDR ram bus latencies is good) but still benefits from the treatment. Alan. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Thu, Oct 07, 2004 at 02:02:38PM +0100, Alan Cox wrote: Note that there's some code in there already which uses the blitter to copy from framebuffer to agp memory, though it tries to implement the entire readpixels() operation rather than being a useful low-level operation. AGP memory is hostside uncached (CPU limitations on x86 for one) which means it is better (swap PCI for DDR ram bus latencies is good) but still benefits from the treatment. Why can't we make AGP memory cached? Wouldn't it be enought to flush the caches at some critical points? I was playing around with DirectFB and AGP some years ago and enabling write-back caching didn't seem to have any side effects. Without caching AGP is almost as bad as video memory for sw fallbacks. -- Ville Syrjälä [EMAIL PROTECTED] http://www.sci.fi/~syrjala/ --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization / AGP
On Iau, 2004-10-07 at 15:40, Ville Syrjl wrote: Why can't we make AGP memory cached? Wouldn't it be enought to flush the caches at some critical points? Possibly although it is not trivial to see how we get that right, especially with the 4Mb kernel maps. The x86 processor cannot handle a page being mapped both cached and uncached at once. Even more excitingly this includes implicit suprise caching by the CPU (speculative and hardware prefetch). This is more a TLB than cache issue. I was playing around with DirectFB and AGP some years ago and enabling write-back caching didn't seem to have any side effects. Without caching AGP is almost as bad as video memory for sw fallbacks. It would be nice to get cacheable AGP but that means some fairly hairy things especially on SMP systems. Pulling a page out of AGP doing a TLB shootdown for it, and then putting it back after wbinvd might work. One to talk to the processor people about. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization / AGP
Alan Cox wrote: On Iau, 2004-10-07 at 15:40, Ville Syrjl wrote: Why can't we make AGP memory cached? Wouldn't it be enought to flush the caches at some critical points? Possibly although it is not trivial to see how we get that right, especially with the 4Mb kernel maps. The x86 processor cannot handle a page being mapped both cached and uncached at once. Even more excitingly this includes implicit suprise caching by the CPU (speculative and hardware prefetch). This is more a TLB than cache issue. I was playing around with DirectFB and AGP some years ago and enabling write-back caching didn't seem to have any side effects. Without caching AGP is almost as bad as video memory for sw fallbacks. It would be nice to get cacheable AGP but that means some fairly hairy things especially on SMP systems. Pulling a page out of AGP doing a TLB shootdown for it, and then putting it back after wbinvd might work. One to talk to the processor people about. I think the traditional path is to pull pages into out of the agp table. So, blit to agp, then remove those pages from the gart table and (somehow) make them cacheable and allow the client to access them directly. Keith --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Wed, 6 Oct 2004, Eric Anholt wrote: On Wed, 2004-10-06 at 09:33, Vladimir Dergachev wrote: On Wed, 6 Oct 2004, Dieter [iso-8859-15] Nützel wrote: Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Stupid question - aren't newer versions of gcc capable of producing SSE/MMX code ? Would it be enough just to turn on appropriate flags ? Distributions are probably shipping only one set of binaries. So, automatically chosen hand-coded assembler in the places that matter is much better than people having to compile using a custom set of gcc flags and maybe getting code that's about as good for their particular machine (if they get lucky). Oh, I simply meant to compile the same source multiple times with different gcc options and different function (or object) names. But Ian says that current gcc is unusable, so it is irrelevant. best Vladimir Dergachev -- Eric Anholt[EMAIL PROTECTED] http://people.freebsd.org/~anholt/ [EMAIL PROTECTED] --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
--- Ville Syrjälä [EMAIL PROTECTED] wrote: On Thu, Oct 07, 2004 at 02:02:38PM +0100, Alan Cox wrote: Note that there's some code in there already which uses the blitter to copy from framebuffer to agp memory, though it tries to implement the entire readpixels() operation rather than being a useful low-level operation. AGP memory is hostside uncached (CPU limitations on x86 for one) which means it is better (swap PCI for DDR ram bus latencies is good) but still benefits from the treatment. Why can't we make AGP memory cached? Wouldn't it be enought to flush the caches at some critical points? I was playing around with DirectFB and AGP some years ago and enabling write-back caching didn't seem to have any side effects. Without caching AGP is almost as bad as video memory for sw fallbacks. I don't get it... We would have to flush the cache after the AGP transfer any way, right? If this truely can't be cached then woulden't the AGP transfer slow us down even more? So unless we get vastly improved performance, the end result is more painfull to use the AGP transfer. -- Ville Syrjälä [EMAIL PROTECTED] http://www.sci.fi/~syrjala/ --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel ___ Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Before persuing this too far, I'd like to look at ways to make the *compiled* code from spantmp.h be more device-independent. That would make it easier to generate a bunch of these generic routines and just plug them in. You have forgotten 'read_rgba_span_SSE2.S. What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)? It would be nice if we have this like MPlayer: CPU: Advanced Micro Devices Athlon 4 /Athlon MP/XP Palomino 1763 MHz (Family: 6, Stepping: 2) Detected cache-line size is 64 bytes CPUflags: MMX: 1 MMX2: 1 3DNow: 1 3DNow2: 1 SSE: 1 SSE2: 0 Compiled with runtime CPU detection - WARNING - this is not optimal! What do you think? -Dieter --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Wed, 6 Oct 2004, Dieter [iso-8859-15] Nützel wrote: Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Stupid question - aren't newer versions of gcc capable of producing SSE/MMX code ? Would it be enough just to turn on appropriate flags ? best Vladimir Dergachev Before persuing this too far, I'd like to look at ways to make the *compiled* code from spantmp.h be more device-independent. That would make it easier to generate a bunch of these generic routines and just plug them in. You have forgotten 'read_rgba_span_SSE2.S. What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)? It would be nice if we have this like MPlayer: CPU: Advanced Micro Devices Athlon 4 /Athlon MP/XP Palomino 1763 MHz (Family: 6, Stepping: 2) Detected cache-line size is 64 bytes CPUflags: MMX: 1 MMX2: 1 3DNow: 1 3DNow2: 1 SSE: 1 SSE2: 0 Compiled with runtime CPU detection - WARNING - this is not optimal! What do you think? -Dieter --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Vladimir Dergachev wrote: Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Stupid question - aren't newer versions of gcc capable of producing SSE/MMX code ? Would it be enough just to turn on appropriate flags ? My original version used the MMX and SSE intrinsics in GCC 3.4.1. The MMX code it generated was 5 times *slower* than the vanilla C version, and the SSE version caused an internal compiler error. At this point, I don't have a lot of faith in the SIMD intrinsics in current versions of GCC. I fully expect that it will improve, but for now it is unusable. :( --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Wed, 2004-10-06 at 09:33, Vladimir Dergachev wrote: On Wed, 6 Oct 2004, Dieter [iso-8859-15] Nützel wrote: Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Stupid question - aren't newer versions of gcc capable of producing SSE/MMX code ? Would it be enough just to turn on appropriate flags ? Distributions are probably shipping only one set of binaries. So, automatically chosen hand-coded assembler in the places that matter is much better than people having to compile using a custom set of gcc flags and maybe getting code that's about as good for their particular machine (if they get lucky). -- Eric Anholt[EMAIL PROTECTED] http://people.freebsd.org/~anholt/ [EMAIL PROTECTED] --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Mer, 2004-10-06 at 16:56, Dieter Ntzel wrote: What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)? It would be nice if we have this like MPlayer: Soreen wrote a set of routines for this that are in Xorg 6.8.* and optimise the readback of video memory for render operations - naturally enough they include the speed ups for readback of videoram. DMA is still a better option - being able to DMA a tile to main memory for fixup and DMA back.. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Alan Cox wrote: On Mer, 2004-10-06 at 16:56, Dieter Ntzel wrote: What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)? It would be nice if we have this like MPlayer: Soreen wrote a set of routines for this that are in Xorg 6.8.* and optimise the readback of video memory for render operations - naturally enough they include the speed ups for readback of videoram. I'll take a look at that. It should make a useful reference. The gotcha is that the readback routine is (slightly) more than just a copy from video RAM to system RAM. It has to convert the pixel data from its native, on-card format to RGBA. In the case of my patch, it converts from BGRA to RGBA while doing the copy. That's why it needs the SSE2 shift instructions. DMA is still a better option - being able to DMA a tile to main memory for fixup and DMA back.. Right. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Mer, 2004-10-06 at 19:36, Ian Romanick wrote: from video RAM to system RAM. It has to convert the pixel data from its native, on-card format to RGBA. In the case of my patch, it converts from BGRA to RGBA while doing the copy. That's why it needs the SSE2 shift instructions. From the data Soreen posted it seems to come down to how many bytes can you pull at once, the rest is noise to the PCI latency. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Alan Cox wrote: On Mer, 2004-10-06 at 19:36, Ian Romanick wrote: from video RAM to system RAM. It has to convert the pixel data from its native, on-card format to RGBA. In the case of my patch, it converts from BGRA to RGBA while doing the copy. That's why it needs the SSE2 shift instructions. From the data Soreen posted it seems to come down to how many bytes can you pull at once, the rest is noise to the PCI latency. That matches what I saw. I tested both routines (MMX SSE2) outside Mesa with the source and destination buffers in main memory. Both routines were obviously much faster. However, I noticed that the SSE2 version took less of a hit going to vram-system than the MMX version. Here's my question. Is there any way to trick it into doing back-to-back reads as a single PCI transfer? So, if I did something like: movaps (%ebx), %xmm0 movaps 16(%ebx), %xmm1 It would do a single 32-byte PCI transfer? I /assume/ there isn't any way to do so. When I unrolled the inner loop of the SSE2 version one time (and had code like the above), the performance increase was on the order of 1%. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
On Mer, 2004-10-06 at 22:02, Ian Romanick wrote: Here's my question. Is there any way to trick it into doing back-to-back reads as a single PCI transfer? So, if I did something like: Not that anyone has found. I'm not sure PCI even really allows it except for prefetchable memory. Except of course DMA - so the DMA for radeon announcement seems ideal for Mesa here. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: R200 ReadPixels optimization
Dieter Nützel wrote: Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick: Here's a simple patch that gives about a 50% (on my box) speed boost to glReadPixels performance in 24-bit. I measured using the benchmark built into progs/demos/readpix. The interesting thing is that the core MMX SSE2 routines can be used for other cards as well. For example, it looks like MGA, Unichrome, and others can use the same code for 24-bit. Before persuing this too far, I'd like to look at ways to make the *compiled* code from spantmp.h be more device-independent. That would make it easier to generate a bunch of these generic routines and just plug them in. You have forgotten 'read_rgba_span_SSE2.S. D'oh! You are correct. I guess that's what I get for putting the patch together in a hurry. I suspect I just mis-typed the name when I ran tar. :( The bad part is, I'm not where that code is right now. I'll have to re-send it tomorrow. What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)? For this particular case, MMX and SSE2 are the only ways to get a speedup. The SSE2 code is basically identical to the MMX code except that it processes twice as many pixels per pass through the loop. There are two reasons the SSE version requires the SSE2 instruction set. In the original SSE instruction set there is no way to shift an XMM register. There is also no way to read or write a packed XMM register (i.e., all 128 bits) to an unaligned address. The code is careful to make sure the source data is 16-byte aligned, but there's no way to guarantee that both source and destination will have the same alignment. For SSE, it may be possible to use a blend of SSE and MMX. Basically, read the data using movaps to an XMM register, then move the data to MMX registers. The performance delta between the MMX and SSE version is pretty small in practice, so I'm not sure how much that would help. A number of variables (card, system chipset, processor, phase of the moon, etc.) may affect that. The main idea of this patch was to get some improvement to as may drivers as possible with as little effort as possible. With a little more effort, I think it will achieve that. The *real* speed up will be to DMA a large block of data from the card to system memory. I didn't take that route initially because the implementation will be different for each card. --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel