Around 18 o'clock on May 25, Linus Torvalds wrote:
> You can often make things go faster by simplifying and streamlining the > code rather than trying to be clever and having a big footprint. Ask Keith > Packard about the X frame buffer code and this very issue some day. The frame buffer code has very different tradeoffs; all of the memory references are across the PCI/AGP bus, so even issues like instruction caches are pretty much lost in the noise. That means you can completely ignore instruction count issues when estimating algorithm performance and look only at bus cycles. The result is similar; code gets rolled up into the smallest space, but not entirely for efficiency, but rather to make it easier to understand and count memory cycles. Of course, it's also nice to avoid trashing the i-cache so that when the frame buffer access is done there isn't a huge penalty in getting back to the rest of the X server. Reading data from the frame buffer takes nearly forever -- uncached PCI/AGP reads are completely synchronous. The frame buffer code stands on it's head to avoid that, even at the cost of some significant code expansion in places. For example, when filling rectangles, the edges are often not aligned on 32-bit boundaries. It's much more efficient to do a sequence of byte/short writes than the read-mask-write cycle that the older frame buffer code used. Writes are a bit better, but the lame Intel CPUs can't saturate an AGP bus in write combining mode -- that mode doesn't go through the regular cache logic and instead uses a separate buffer which isn't deep enough to cover the bus latency. Hence the performance difference between DMA and PIO for simple 2D graphics operations. The code also takes advantage of dynamic branch prediction; tests which resolve the same direction each pass through a loop are left inside the loop instead of duplicating the code to avoid the test; there isn't be a pipeline branch penalty while running through the loop because the predictor will guess right every time. The result is code which handles all of the X data formats (1,4,8,16,24,32) in about half the space the older code used to handle only a single format. The old code was optimized for 60ns CPUs with 300ns memory systems; new machines have much faster CPUs but only marginally faster memory. Getting a chance to implement the same spec in two radically different performance environments has been a lot of fun. Keith Packard XFree86 Core Team HP Cambridge Research Lab _______________________________________________________________ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm _______________________________________________ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel