Re: R200 ReadPixels optimization

2004-10-18 Thread Dieter Nützel
Am Dienstag, 12. Oktober 2004 20:24 schrieb Ian Romanick:
 Dieter Nützel wrote:
  NONE of your three versions gave me direct rendering?!
  I've tested with and without your TLS-patch (progress?).
 
  The symbols are in.
  DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep
  r200ReadRGBASpan_ARGB
  00175714 t r200ReadRGBASpan_ARGB
  00175be4 t r200ReadRGBASpan_ARGB_MMX
  00175ad4 t r200ReadRGBASpan_ARGB_SSE
  001759c4 t r200ReadRGBASpan_ARGB_SSE2
 
  But
  DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep
  _generic_read_RGBA_span_BGRA
   U _generic_read_RGBA_span_BGRA_REV_MMX
   U _generic_read_RGBA_span_BGRA_REV_SSE
   U _generic_read_RGBA_span_BGRA_REV_SSE2
 
  I'm on XFree86 DRI CVS build as long as my distro based on it;-)

 You'll have to update the Imakefiles to build in DRI CVS.  The problem
 is that it's not linking (or compiling) ../common/read_rgba_span_x86.o.

Got it with DRI and Mesa CVS on XFree86.

--- xc/lib/GL/mesa/x86/Imakefile.inc2004-10-18 20:29:09.824517266 +0200
+++ xc/lib/GL/mesa/x86/Imakefile.inc.Dieter 2004-10-18 18:13:42.0 
+0200
@@ -8,6 +8,7 @@

 MESA_X86_SRCS = $(MESAX86BUILDDIR)common_x86.c \
$(MESAX86BUILDDIR)common_x86_asm.S \
+   $(MESAX86BUILDDIR)read_rgba_span_x86.S \
$(MESAX86BUILDDIR)glapi_x86.S \
$(MESAX86BUILDDIR)x86.c \
$(MESAX86BUILDDIR)x86_cliptest.S \
@@ -18,6 +19,7 @@
 #ifdef NeedToLinkMesaSrc
 LinkSourceFile(common_x86.c, $(MESASRCDIR)/src/mesa/x86)
 LinkSourceFile(common_x86_asm.S, $(MESASRCDIR)/src/mesa/x86)
+LinkSourceFile(read_rgba_span_x86.S, $(MESASRCDIR)/src/mesa/x86)
 LinkSourceFile(glapi_x86.S, $(MESASRCDIR)/src/mesa/x86)
 LinkSourceFile(x86.c, $(MESASRCDIR)/src/mesa/x86)
 LinkSourceFile(x86_cliptest.S, $(MESASRCDIR)/src/mesa/x86)
@@ -28,6 +30,7 @@

 MESA_X86_OBJS = $(MESAX86BUILDDIR)common_x86.o \
$(MESAX86BUILDDIR)common_x86_asm.o \
+   $(MESAX86BUILDDIR)read_rgba_span_x86.o \
$(MESAX86BUILDDIR)x86.o \
$(MESAX86BUILDDIR)x86_cliptest.o \
$(MESAX86BUILDDIR)x86_xform2.o \
@@ -37,6 +40,7 @@
 #if defined(DoSharedLib)  DoSharedLib
 MESA_X86_UOBJS = $(MESAX86BUILDDIR)unshared/common_x86.o \
$(MESAX86BUILDDIR)common_x86_asm.o \
+   $(MESAX86BUILDDIR)unshared/read_rgba_span_x86.o \
$(MESAX86BUILDDIR)unshared/x86.o \
$(MESAX86BUILDDIR)x86_cliptest.o \
$(MESAX86BUILDDIR)x86_xform2.o \
@@ -48,6 +52,7 @@

 MESA_X86_DOBJS = $(MESAX86BUILDDIR)debugger/common_x86.o \
$(MESAX86BUILDDIR)common_x86_asm.o \
+   $(MESAX86BUILDDIR)debugger/read_rgba_span_x86.o \
$(MESAX86BUILDDIR)debugger/x86.o \
$(MESAX86BUILDDIR)x86_cliptest.o \
$(MESAX86BUILDDIR)x86_xform2.o \
@@ -56,6 +61,7 @@

 MESA_X86_POBJS = $(MESAX86BUILDDIR)profiled/common_x86.o \
$(MESAX86BUILDDIR)common_x86_asm.o \
+   $(MESAX86BUILDDIR)profiled/read_rgba_span_x86.o \
$(MESAX86BUILDDIR)profiled/x86.o \
$(MESAX86BUILDDIR)x86_cliptest.o \
$(MESAX86BUILDDIR)x86_xform2.o \

-Dieter


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-12 Thread Ian Romanick
Dieter Nützel wrote:
NONE of your three versions gave me direct rendering?!
I've tested with and without your TLS-patch (progress?).
The symbols are in.
DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep 
r200ReadRGBASpan_ARGB
00175714 t r200ReadRGBASpan_ARGB
00175be4 t r200ReadRGBASpan_ARGB_MMX
00175ad4 t r200ReadRGBASpan_ARGB_SSE
001759c4 t r200ReadRGBASpan_ARGB_SSE2

But
DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep 
_generic_read_RGBA_span_BGRA
 U _generic_read_RGBA_span_BGRA_REV_MMX
 U _generic_read_RGBA_span_BGRA_REV_SSE
 U _generic_read_RGBA_span_BGRA_REV_SSE2

I'm on XFree86 DRI CVS build as long as my distro based on it;-)
You'll have to update the Imakefiles to build in DRI CVS.  The problem 
is that it's not linking (or compiling) ../common/read_rgba_span_x86.o.

BTW The old indirect mode is way faster then direct for me:
It should be several orders of magnitude faster.  Afterall, it's 
actually just doing a memory-to-memory copy instead of reading from the 
framebuffer.

---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-09 Thread Dieter Ntzel
Am Samstag, 9. Oktober 2004 03:33 schrieb Ian Romanick:
 Ian Romanick wrote:
  Here's a simple patch that gives about a 50% (on my box) speed boost to
  glReadPixels performance in 24-bit.  I measured using the benchmark
  built into progs/demos/readpix.  The interesting thing is that the core
  MMX  SSE2 routines can be used for other cards as well.  For example,
  it looks like MGA, Unichrome, and others can use the same code for
  24-bit.
 
  Before persuing this too far, I'd like to look at ways to make the
  *compiled* code from spantmp.h be more device-independent.  That would
  make it easier to generate a bunch of these generic routines and just
  plug them in.

 Here's version 3 of the patch.  This is *probably* the last version that
 will circulate as a patch.  Here are the changes from the last version
 of the patch:

 - Fixes the problem where the R200 driver would only use the MMX version.
 - Numerous little optimizations to all 3 versions.  The SSE version is
 still crap. :(
 - Trivially optimized the C version. ;)

 I'm thinking that a lot of this will actually get pulled into spantmp.h
 when I commit it.  My thinking is to have the driver define which pixel
 format it uses (e.g., #define SPANTMP_USE_BGRA_REV) and have
 spantmp.h automatically generate the optimized versions (based on the
 existance of USE_MMX_ASM, etc.).  Since there are just handful of pixel
 formats that appear in practice, this should be pretty easy to do.

 My only concern is big-endian machines.  I should be able to try this
 out on a Rage128 in a Power Mac.  Maybe there will be another version as
 a patch...ugh...

Ian,

NONE of your three versions gave me direct rendering?!
I've tested with and without your TLS-patch (progress?).

The symbols are in.
DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep 
r200ReadRGBASpan_ARGB
00175714 t r200ReadRGBASpan_ARGB
00175be4 t r200ReadRGBASpan_ARGB_MMX
00175ad4 t r200ReadRGBASpan_ARGB_SSE
001759c4 t r200ReadRGBASpan_ARGB_SSE2

But
DRI-Mesa/Patches nm /usr/X11R6-NO-TLS/lib/modules/dri/r200_dri.so | grep 
_generic_read_RGBA_span_BGRA
 U _generic_read_RGBA_span_BGRA_REV_MMX
 U _generic_read_RGBA_span_BGRA_REV_SSE
 U _generic_read_RGBA_span_BGRA_REV_SSE2

I'm on XFree86 DRI CVS build as long as my distro based on it;-)

Any ideas?

-Dieter

BTW The old indirect mode is way faster then direct for me:

progs/demos ./readpix
Mesa: software DXTn compression/decompression available
GL_VERSION = 1.3 Mesa 6.3
GL_RENDERER = Mesa DRI R200 20040929 AGP 4x x86/MMX+/3DNow!+/SSE TCL
Loaded 194 by 188 image

Benchmarking...
Result:  348 reads in 4.009000 seconds = 3165940.633574 pixels/sec
Benchmarking...
Result:  344 reads in 4.007000 seconds = 3131112.553032 pixels/sec
Benchmarking...
Result:  346 reads in 4.001000 seconds = 3154039.490127 pixels/sec
Benchmarking...
Result:  278 reads in 4.007000 seconds = 2530375.842276 pixels/sec
Benchmarking...
Result:  275 reads in 4.003000 seconds = 2505570.821884 pixels/sec
Benchmarking...
Result:  272 reads in 4.001000 seconds = 2479476.130967 pixels/sec
glDrawBuffer(GL_FRONT)
Benchmarking...
Result:  342 reads in 4.004000 seconds = 3115240.759241 pixels/sec
Benchmarking...
Result:  352 reads in 4.01 seconds = 3201532.169576 pixels/sec
Benchmarking...
Result:  342 reads in 4.004000 seconds = 3115240.759241 pixels/sec
Benchmarking...
Result:  269 reads in 4.011000 seconds = 2446015.457492 pixels/sec
Benchmarking...
Result:  268 reads in 4.00 seconds = 2443624.00 pixels/sec
Benchmarking...
Result:  270 reads in 4.01 seconds = 2455720.698254 pixels/sec


Mesa indirect:
progs/demos ./readpix
GL_VERSION = 1.2 (1.5 Mesa 6.3)
GL_RENDERER = Mesa GLX Indirect
Loaded 194 by 188 image
Benchmarking...
Result:  1793 reads in 4.002000 seconds = 16340403.798101 pixels/sec
Benchmarking...
Result:  1797 reads in 4.00 seconds = 16385046.00 pixels/sec
Benchmarking...
Result:  1792 reads in 4.00 seconds = 16339456.00 pixels/sec
Benchmarking...
Result:  800 reads in 4.003000 seconds = 7288933.300025 pixels/sec
Benchmarking...
Result:  799 reads in 4.004000 seconds = 7278003.996004 pixels/sec
Benchmarking...
Result:  797 reads in 4.004000 seconds = 7259786.213786 pixels/sec
glDrawBuffer(GL_FRONT)
Benchmarking...
Result:  294 reads in 4.007000 seconds = 2676008.984278 pixels/sec
Benchmarking...
Result:  290 reads in 4.002000 seconds = 2642898.550725 pixels/sec
Benchmarking...
Result:  291 reads in 4.008000 seconds = 2648041.916168 pixels/sec
Benchmarking...
Result:  241 reads in 4.009000 seconds = 2192504.864056 pixels/sec
Benchmarking...
Result:  240 reads in 4.015000 seconds = 2180144.458281 pixels/sec
Benchmarking...
Result:  240 reads in 4.014000 seconds = 2180687.593423 pixels/sec


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us 

Re: R200 ReadPixels optimization

2004-10-08 Thread Ian Romanick
Ian Romanick wrote:
Here's a simple patch that gives about a 50% (on my box) speed boost to 
glReadPixels performance in 24-bit.  I measured using the benchmark 
built into progs/demos/readpix.  The interesting thing is that the core 
MMX  SSE2 routines can be used for other cards as well.  For example, 
it looks like MGA, Unichrome, and others can use the same code for 24-bit.

Before persuing this too far, I'd like to look at ways to make the 
*compiled* code from spantmp.h be more device-independent.  That would 
make it easier to generate a bunch of these generic routines and just 
plug them in.
Here's an updated version of the patch.  I think this one includes all 
the code. ;)  In addition, this patch adds support for the Unichrome 
driver.  It is still called r200_readpixels-*.patch.


r200_readpixels-02.tar.bz2
Description: BZip2 compressed data


Re: R200 ReadPixels optimization

2004-10-08 Thread Ian Romanick
Marcello Maggioni wrote:
I experience a great slowdown in using this patch .
[EMAIL PROTECTED]:~/driconf-0.2.2$ glxgears
Mesa: software DXTn compression/decompression available
Using MMX version of ReadRGBASpan
27 frames in 5.1 seconds =  5.320 FPS
25 frames in 5.0 seconds =  4.982 FPS
[EMAIL PROTECTED]:~/driconf-0.2.2$ 
That's just weird.  glxgears should never end up in any of the code 
modified by this patch.  I get ~1600 fps on my box with the patch 
applied.  Do you have any other patches?

There is one known bug in the patch.  I accidentally left it set so that 
the SSE and SSE2 versions were never used.  That's why you get the 
Using MMX version message.  I should have an updated version in a bit.

---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-08 Thread Marcello Maggioni
On Fri, 08 Oct 2004 15:12:51 -0700, Ian Romanick [EMAIL PROTECTED] wrote:
 Marcello Maggioni wrote:
 
  I experience a great slowdown in using this patch .
 
  [EMAIL PROTECTED]:~/driconf-0.2.2$ glxgears
  Mesa: software DXTn compression/decompression available
  Using MMX version of ReadRGBASpan
  27 frames in 5.1 seconds =  5.320 FPS
  25 frames in 5.0 seconds =  4.982 FPS
  [EMAIL PROTECTED]:~/driconf-0.2.2$
 
 That's just weird.  glxgears should never end up in any of the code
 modified by this patch.  I get ~1600 fps on my box with the patch
 applied.  Do you have any other patches?
 
 There is one known bug in the patch.  I accidentally left it set so that
 the SSE and SSE2 versions were never used.  That's why you get the
 Using MMX version message.  I should have an updated version in a bit.
 


Mmm, I rebooted my system and now GLXGEARS is back to acceptable values.

Probably only rebooting X wasn't enough :)

I'm waiting for the SSE version :)

Bye


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-08 Thread Ian Romanick
Ian Romanick wrote:
Here's a simple patch that gives about a 50% (on my box) speed boost to 
glReadPixels performance in 24-bit.  I measured using the benchmark 
built into progs/demos/readpix.  The interesting thing is that the core 
MMX  SSE2 routines can be used for other cards as well.  For example, 
it looks like MGA, Unichrome, and others can use the same code for 24-bit.

Before persuing this too far, I'd like to look at ways to make the 
*compiled* code from spantmp.h be more device-independent.  That would 
make it easier to generate a bunch of these generic routines and just 
plug them in.
Here's version 3 of the patch.  This is *probably* the last version that 
will circulate as a patch.  Here are the changes from the last version 
of the patch:

- Fixes the problem where the R200 driver would only use the MMX version.
- Numerous little optimizations to all 3 versions.  The SSE version is 
still crap. :(
- Trivially optimized the C version. ;)

I'm thinking that a lot of this will actually get pulled into spantmp.h 
when I commit it.  My thinking is to have the driver define which pixel 
format it uses (e.g., #define SPANTMP_USE_BGRA_REV) and have 
spantmp.h automatically generate the optimized versions (based on the 
existance of USE_MMX_ASM, etc.).  Since there are just handful of pixel 
formats that appear in practice, this should be pretty easy to do.

My only concern is big-endian machines.  I should be able to try this 
out on a Rage128 in a Power Mac.  Maybe there will be another version as 
a patch...ugh...


r200_readpixels-03.tar.bz2
Description: BZip2 compressed data


Re: R200 ReadPixels optimization

2004-10-07 Thread Keith Whitwell
Alan Cox wrote:
On Mer, 2004-10-06 at 22:02, Ian Romanick wrote:
Here's my question.  Is there any way to trick it into doing 
back-to-back reads as a single PCI transfer?  So, if I did something like:

Not that anyone has found. I'm not sure PCI even really allows it except
for prefetchable memory.
Except of course DMA - so the DMA for radeon announcement seems ideal
for Mesa here.
Note that there's some code in there already which uses the blitter to copy 
from framebuffer to agp memory, though it tries to implement the entire 
readpixels() operation rather than being a useful low-level operation.

Keith
---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-07 Thread Alan Cox
 Note that there's some code in there already which uses the blitter to copy 
 from framebuffer to agp memory, though it tries to implement the entire 
 readpixels() operation rather than being a useful low-level operation.

AGP memory is hostside uncached (CPU limitations on x86 for one) which
means it is better (swap PCI for DDR ram bus latencies is good) but
still benefits from the treatment.

Alan.


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-07 Thread Ville Syrjälä
On Thu, Oct 07, 2004 at 02:02:38PM +0100, Alan Cox wrote:
  Note that there's some code in there already which uses the blitter to copy 
  from framebuffer to agp memory, though it tries to implement the entire 
  readpixels() operation rather than being a useful low-level operation.
 
 AGP memory is hostside uncached (CPU limitations on x86 for one) which
 means it is better (swap PCI for DDR ram bus latencies is good) but
 still benefits from the treatment.

Why can't we make AGP memory cached? Wouldn't it be enought to flush the 
caches at some critical points?

I was playing around with DirectFB and AGP some years ago and enabling 
write-back caching didn't seem to have any side effects. Without caching 
AGP is almost as bad as video memory for sw fallbacks.

-- 
Ville Syrjälä
[EMAIL PROTECTED]
http://www.sci.fi/~syrjala/


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization / AGP

2004-10-07 Thread Alan Cox
On Iau, 2004-10-07 at 15:40, Ville Syrjl wrote:
 Why can't we make AGP memory cached? Wouldn't it be enought to flush the 
 caches at some critical points?

Possibly although it is not trivial to see how we get that right,
especially with the 4Mb kernel maps. The x86 processor cannot handle a
page being mapped both cached and uncached at once. Even more excitingly
this includes implicit suprise caching by the CPU (speculative and
hardware prefetch). This is more a TLB than cache issue.

 I was playing around with DirectFB and AGP some years ago and enabling 
 write-back caching didn't seem to have any side effects. Without caching 
 AGP is almost as bad as video memory for sw fallbacks.

It would be nice to get cacheable AGP but that means some fairly hairy
things especially on SMP systems. Pulling a page out of AGP doing a TLB
shootdown for it, and then putting it back after wbinvd might work. 

One to talk to the processor people about.



---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization / AGP

2004-10-07 Thread Keith Whitwell
Alan Cox wrote:
On Iau, 2004-10-07 at 15:40, Ville Syrjl wrote:
Why can't we make AGP memory cached? Wouldn't it be enought to flush the 
caches at some critical points?

Possibly although it is not trivial to see how we get that right,
especially with the 4Mb kernel maps. The x86 processor cannot handle a
page being mapped both cached and uncached at once. Even more excitingly
this includes implicit suprise caching by the CPU (speculative and
hardware prefetch). This is more a TLB than cache issue.

I was playing around with DirectFB and AGP some years ago and enabling 
write-back caching didn't seem to have any side effects. Without caching 
AGP is almost as bad as video memory for sw fallbacks.

It would be nice to get cacheable AGP but that means some fairly hairy
things especially on SMP systems. Pulling a page out of AGP doing a TLB
shootdown for it, and then putting it back after wbinvd might work. 

One to talk to the processor people about.
I think the traditional path is to pull pages into  out of the agp table. 
So, blit to agp, then remove those pages from the gart table and (somehow) 
make them cacheable and allow the client to access them directly.

Keith
---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-07 Thread Vladimir Dergachev

On Wed, 6 Oct 2004, Eric Anholt wrote:
On Wed, 2004-10-06 at 09:33, Vladimir Dergachev wrote:
On Wed, 6 Oct 2004, Dieter [iso-8859-15] Nützel wrote:
Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick:
Here's a simple patch that gives about a 50% (on my box) speed boost to
glReadPixels performance in 24-bit.  I measured using the benchmark
built into progs/demos/readpix.  The interesting thing is that the core
MMX  SSE2 routines can be used for other cards as well.  For example,
it looks like MGA, Unichrome, and others can use the same code for 24-bit.
Stupid question - aren't newer versions of gcc capable of producing
SSE/MMX code ? Would it be enough just to turn on appropriate flags ?
Distributions are probably shipping only one set of binaries.  So,
automatically chosen hand-coded assembler in the places that matter is
much better than people having to compile using a custom set of gcc
flags and maybe getting code that's about as good for their particular
machine (if they get lucky).
Oh, I simply meant to compile the same source multiple times with 
different gcc options and different function (or object) names.

But Ian says that current gcc is unusable, so it is irrelevant.
  best
Vladimir Dergachev
--
Eric Anholt[EMAIL PROTECTED]
http://people.freebsd.org/~anholt/ [EMAIL PROTECTED]

---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-07 Thread Mike Mestnik

--- Ville Syrjälä [EMAIL PROTECTED] wrote:

 On Thu, Oct 07, 2004 at 02:02:38PM +0100, Alan Cox wrote:
   Note that there's some code in there already which uses the blitter
 to copy 
   from framebuffer to agp memory, though it tries to implement the
 entire 
   readpixels() operation rather than being a useful low-level
 operation.
  
  AGP memory is hostside uncached (CPU limitations on x86 for one) which
  means it is better (swap PCI for DDR ram bus latencies is good) but
  still benefits from the treatment.
 
 Why can't we make AGP memory cached? Wouldn't it be enought to flush the
 
 caches at some critical points?
 
 I was playing around with DirectFB and AGP some years ago and enabling 
 write-back caching didn't seem to have any side effects. Without caching
 
 AGP is almost as bad as video memory for sw fallbacks.
 
I don't get it...  We would have to flush the cache after the AGP transfer
any way, right?

If this truely can't be cached then woulden't the AGP transfer slow us
down even more?  So unless we get vastly improved performance, the end
result is more painfull to use the AGP transfer.

 -- 
 Ville Syrjälä
 [EMAIL PROTECTED]
 http://www.sci.fi/~syrjala/
 
 
 ---
 This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
 Use IT products in your business? Tell us what you think of them. Give
 us
 Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out
 more
 http://productguide.itmanagersjournal.com/guidepromo.tmpl
 --
 ___
 Dri-devel mailing list
 [EMAIL PROTECTED]
 https://lists.sourceforge.net/lists/listinfo/dri-devel
 




___
Do you Yahoo!?
Declare Yourself - Register online to vote today!
http://vote.yahoo.com


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Dieter Ntzel
Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick:
 Here's a simple patch that gives about a 50% (on my box) speed boost to
 glReadPixels performance in 24-bit.  I measured using the benchmark
 built into progs/demos/readpix.  The interesting thing is that the core
 MMX  SSE2 routines can be used for other cards as well.  For example,
 it looks like MGA, Unichrome, and others can use the same code for 24-bit.

 Before persuing this too far, I'd like to look at ways to make the
 *compiled* code from spantmp.h be more device-independent.  That would
 make it easier to generate a bunch of these generic routines and just
 plug them in.

You have forgotten 'read_rgba_span_SSE2.S.

What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)?

It would be nice if we have this like MPlayer:

CPU: Advanced Micro Devices Athlon 4 /Athlon MP/XP Palomino 1763 MHz (Family: 
6, Stepping: 2)
Detected cache-line size is 64 bytes
CPUflags:  MMX: 1 MMX2: 1 3DNow: 1 3DNow2: 1 SSE: 1 SSE2: 0
Compiled with runtime CPU detection - WARNING - this is not optimal!

What do you think?

-Dieter


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Vladimir Dergachev

On Wed, 6 Oct 2004, Dieter [iso-8859-15] Nützel wrote:
Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick:
Here's a simple patch that gives about a 50% (on my box) speed boost to
glReadPixels performance in 24-bit.  I measured using the benchmark
built into progs/demos/readpix.  The interesting thing is that the core
MMX  SSE2 routines can be used for other cards as well.  For example,
it looks like MGA, Unichrome, and others can use the same code for 24-bit.
Stupid question - aren't newer versions of gcc capable of producing 
SSE/MMX code ? Would it be enough just to turn on appropriate flags ?

best
  Vladimir Dergachev
Before persuing this too far, I'd like to look at ways to make the
*compiled* code from spantmp.h be more device-independent.  That would
make it easier to generate a bunch of these generic routines and just
plug them in.
You have forgotten 'read_rgba_span_SSE2.S.
What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)?
It would be nice if we have this like MPlayer:
CPU: Advanced Micro Devices Athlon 4 /Athlon MP/XP Palomino 1763 MHz (Family:
6, Stepping: 2)
Detected cache-line size is 64 bytes
CPUflags:  MMX: 1 MMX2: 1 3DNow: 1 3DNow2: 1 SSE: 1 SSE2: 0
Compiled with runtime CPU detection - WARNING - this is not optimal!
What do you think?
-Dieter
---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Ian Romanick
Vladimir Dergachev wrote:
Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick:
Here's a simple patch that gives about a 50% (on my box) speed boost to
glReadPixels performance in 24-bit.  I measured using the benchmark
built into progs/demos/readpix.  The interesting thing is that the core
MMX  SSE2 routines can be used for other cards as well.  For example,
it looks like MGA, Unichrome, and others can use the same code for 
24-bit.
Stupid question - aren't newer versions of gcc capable of producing 
SSE/MMX code ? Would it be enough just to turn on appropriate flags ?
My original version used the MMX and SSE intrinsics in GCC 3.4.1.  The 
MMX code it generated was 5 times *slower* than the vanilla C version, 
and the SSE version caused an internal compiler error.  At this point, I 
don't have a lot of faith in the SIMD intrinsics in current versions of 
GCC.  I fully expect that it will improve, but for now it is unusable. :(

---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Eric Anholt
On Wed, 2004-10-06 at 09:33, Vladimir Dergachev wrote:
 On Wed, 6 Oct 2004, Dieter [iso-8859-15] Nützel wrote:
 
  Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick:
  Here's a simple patch that gives about a 50% (on my box) speed boost to
  glReadPixels performance in 24-bit.  I measured using the benchmark
  built into progs/demos/readpix.  The interesting thing is that the core
  MMX  SSE2 routines can be used for other cards as well.  For example,
  it looks like MGA, Unichrome, and others can use the same code for 24-bit.
 
 Stupid question - aren't newer versions of gcc capable of producing 
 SSE/MMX code ? Would it be enough just to turn on appropriate flags ?

Distributions are probably shipping only one set of binaries.  So,
automatically chosen hand-coded assembler in the places that matter is
much better than people having to compile using a custom set of gcc
flags and maybe getting code that's about as good for their particular
machine (if they get lucky).

-- 
Eric Anholt[EMAIL PROTECTED]  
http://people.freebsd.org/~anholt/ [EMAIL PROTECTED]




---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Alan Cox
On Mer, 2004-10-06 at 16:56, Dieter Ntzel wrote:
 What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)?
 
 It would be nice if we have this like MPlayer:

Soreen wrote a set of routines for this that are in Xorg 6.8.* and
optimise the readback of video memory for render operations - naturally
enough they include the speed ups for readback of videoram.

DMA is still a better option - being able to DMA a tile to main memory
for fixup and DMA back..



---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Ian Romanick
Alan Cox wrote:
On Mer, 2004-10-06 at 16:56, Dieter Ntzel wrote:
What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)?
It would be nice if we have this like MPlayer:
Soreen wrote a set of routines for this that are in Xorg 6.8.* and
optimise the readback of video memory for render operations - naturally
enough they include the speed ups for readback of videoram.
I'll take a look at that.  It should make a useful reference.  The 
gotcha is that the readback routine is (slightly) more than just a copy 
from video RAM to system RAM.  It has to convert the pixel data from its 
native, on-card format to RGBA.  In the case of my patch, it 
converts from BGRA to RGBA while doing the copy.  That's why it needs 
the SSE2 shift instructions.

DMA is still a better option - being able to DMA a tile to main memory
for fixup and DMA back..
Right.
---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Alan Cox
On Mer, 2004-10-06 at 19:36, Ian Romanick wrote:
 from video RAM to system RAM.  It has to convert the pixel data from its 
 native, on-card format to RGBA.  In the case of my patch, it 
 converts from BGRA to RGBA while doing the copy.  That's why it needs 
 the SSE2 shift instructions.

From the data Soreen posted it seems to come down to how many bytes can
you pull at once, the rest is noise to the PCI latency.


---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Ian Romanick
Alan Cox wrote:
On Mer, 2004-10-06 at 19:36, Ian Romanick wrote:
from video RAM to system RAM.  It has to convert the pixel data from its 
native, on-card format to RGBA.  In the case of my patch, it 
converts from BGRA to RGBA while doing the copy.  That's why it needs 
the SSE2 shift instructions.

From the data Soreen posted it seems to come down to how many bytes can
you pull at once, the rest is noise to the PCI latency.
That matches what I saw.  I tested both routines (MMX  SSE2) outside 
Mesa with the source and destination buffers in main memory.  Both 
routines were obviously much faster.  However, I noticed that the SSE2 
version took less of a hit going to vram-system than the MMX version.

Here's my question.  Is there any way to trick it into doing 
back-to-back reads as a single PCI transfer?  So, if I did something like:

movaps  (%ebx), %xmm0
movaps  16(%ebx), %xmm1
It would do a single 32-byte PCI transfer?  I /assume/ there isn't any 
way to do so.  When I unrolled the inner loop of the SSE2 version one 
time (and had code like the above), the performance increase was on the 
order of 1%.

---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Alan Cox
On Mer, 2004-10-06 at 22:02, Ian Romanick wrote:
 Here's my question.  Is there any way to trick it into doing 
 back-to-back reads as a single PCI transfer?  So, if I did something like:

Not that anyone has found. I'm not sure PCI even really allows it except
for prefetchable memory.

Except of course DMA - so the DMA for radeon announcement seems ideal
for Mesa here.



---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: R200 ReadPixels optimization

2004-10-06 Thread Ian Romanick
Dieter Nützel wrote:
Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick:
Here's a simple patch that gives about a 50% (on my box) speed boost to
glReadPixels performance in 24-bit.  I measured using the benchmark
built into progs/demos/readpix.  The interesting thing is that the core
MMX  SSE2 routines can be used for other cards as well.  For example,
it looks like MGA, Unichrome, and others can use the same code for 24-bit.
Before persuing this too far, I'd like to look at ways to make the
*compiled* code from spantmp.h be more device-independent.  That would
make it easier to generate a bunch of these generic routines and just
plug them in.
You have forgotten 'read_rgba_span_SSE2.S.
D'oh!  You are correct.  I guess that's what I get for putting the patch 
together in a hurry.  I suspect I just mis-typed the name when I ran 
tar. :(  The bad part is, I'm not where that code is right now.  I'll 
have to re-send it tomorrow.

What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)?
For this particular case, MMX and SSE2 are the only ways to get a 
speedup.  The SSE2 code is basically identical to the MMX code except 
that it processes twice as many pixels per pass through the loop.

There are two reasons the SSE version requires the SSE2 instruction set. 
 In the original SSE instruction set there is no way to shift an XMM 
register.  There is also no way to read or write a packed XMM register 
(i.e., all 128 bits) to an unaligned address.  The code is careful to 
make sure the source data is 16-byte aligned, but there's no way to 
guarantee that both source and destination will have the same alignment.

For SSE, it may be possible to use a blend of SSE and MMX.  Basically, 
read the data using movaps to an XMM register, then move the data to MMX 
registers.  The performance delta between the MMX and SSE version is 
pretty small in practice, so I'm not sure how much that would help.  A 
number of variables (card, system chipset, processor, phase of the moon, 
etc.) may affect that.

The main idea of this patch was to get some improvement to as may 
drivers as possible with as little effort as possible.  With a little 
more effort, I think it will achieve that.  The *real* speed up will be 
to DMA a large block of data from the card to system memory.  I didn't 
take that route initially because the implementation will be different 
for each card.

---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
--
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel