Dieter Nützel wrote:
Am Mittwoch, 6. Oktober 2004 03:52 schrieb Ian Romanick:

Here's a simple patch that gives about a 50% (on my box) speed boost to
glReadPixels performance in 24-bit.  I measured using the benchmark
built into progs/demos/readpix.  The interesting thing is that the core
MMX & SSE2 routines can be used for other cards as well.  For example,
it looks like MGA, Unichrome, and others can use the same code for 24-bit.

Before persuing this too far, I'd like to look at ways to make the
*compiled* code from spantmp.h be more device-independent.  That would
make it easier to generate a bunch of these generic routines and just
plug them in.

You have forgotten 'read_rgba_span_SSE2.S".

D'oh! You are correct. I guess that's what I get for putting the patch together in a hurry. I suspect I just mis-typed the name when I ran tar. :( The bad part is, I'm not where that code is right now. I'll have to re-send it tomorrow.


What about MMX2, 3DNow, 3DNow2 (pro), SSE (1)?

For this particular case, MMX and SSE2 are the only ways to get a speedup. The SSE2 code is basically identical to the MMX code except that it processes twice as many pixels per pass through the loop.


There are two reasons the SSE version requires the SSE2 instruction set. In the original SSE instruction set there is no way to shift an XMM register. There is also no way to read or write a packed XMM register (i.e., all 128 bits) to an unaligned address. The code is careful to make sure the source data is 16-byte aligned, but there's no way to guarantee that both source and destination will have the same alignment.

For SSE, it may be possible to use a blend of SSE and MMX. Basically, read the data using movaps to an XMM register, then move the data to MMX registers. The performance delta between the MMX and SSE version is pretty small in practice, so I'm not sure how much that would help. A number of variables (card, system chipset, processor, phase of the moon, etc.) may affect that.

The main idea of this patch was to get some improvement to as may drivers as possible with as little effort as possible. With a little more effort, I think it will achieve that. The *real* speed up will be to DMA a large block of data from the card to system memory. I didn't take that route initially because the implementation will be different for each card.


------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl -- _______________________________________________ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to