Around 18 o'clock on May 25, Linus Torvalds wrote:

> You can often make things go faster by simplifying and streamlining the
> code rather than trying to be clever and having a big footprint. Ask Keith
> Packard about the X frame buffer code and this very issue some day.

The frame buffer code has very different tradeoffs; all of the memory 
references are across the PCI/AGP bus, so even issues like instruction 
caches are pretty much lost in the noise.  That means you can completely 
ignore instruction count issues when estimating algorithm performance and 
look only at bus cycles.  

The result is similar; code gets rolled up into the smallest space, but not
entirely for efficiency, but rather to make it easier to understand and
count memory cycles.  Of course, it's also nice to avoid trashing the
i-cache so that when the frame buffer access is done there isn't a huge
penalty in getting back to the rest of the X server.

Reading data from the frame buffer takes nearly forever -- uncached PCI/AGP
reads are completely synchronous. The frame buffer code stands on it's head
to avoid that, even at the cost of some significant code expansion in
places.  For example, when filling rectangles, the edges are often not
aligned on 32-bit boundaries.  It's much more efficient to do a sequence of
byte/short writes than the read-mask-write cycle that the older frame
buffer code used.  Writes are a bit better, but the lame Intel CPUs can't 
saturate an AGP bus in write combining mode -- that mode doesn't go through
the regular cache logic and instead uses a separate buffer which isn't 
deep enough to cover the bus latency.  Hence the performance difference 
between DMA and PIO for simple 2D graphics operations.

The code also takes advantage of dynamic branch prediction; tests which 
resolve the same direction each pass through a loop are left inside the 
loop instead of duplicating the code to avoid the test; there isn't be a 
pipeline branch penalty while running through the loop because the 
predictor will guess right every time.

The result is code which handles all of the X data formats (1,4,8,16,24,32)
in about half the space the older code used to handle only a single format. 
The old code was optimized for 60ns CPUs with 300ns memory systems; new 
machines have much faster CPUs but only marginally faster memory.

Getting a chance to implement the same spec in two radically different 
performance environments has been a lot of fun.

Keith Packard        XFree86 Core Team        HP Cambridge Research Lab



_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to