Michel Lanners writes: > I wouldn't know for top, but I can say that mtrr definitely makes a > difference: 45% cpu for X without mtrr, down to roughly 5% with the > proper mtrr configured. So that was it.
I mostly fixed top. Debian-unstable has the fix. There is a bit of randomness due to data collection that isn't instant, and the very first screen is bogus due to kernel limitations. > We might be able to squeeze a few percent out of better caching for the > framebuffer (making X's framebuffer mapping cacheable enables bursting > from the CPU; combined with float or vector stores instead of regular > memcpy that should give a boost that _could_ come close to what mtrr > achieves on i386. Oh, you really want to play with the WIMG bits! It would be nice if there were arch-specific mmap() flags for this. If you can spare a BAT register, use it. (maybe use page tables for most of the kernel address space, with just one BAT to cover the kernel itself) The WIMG setting should be 0000. This is: write-back (not write-through) cached coherency not enforced (must use cache control instructions!!!) not guarded against ordering/merging/speculative troubles Then you do: "dcba" for a frame buffer cache line fill the cache line with your data "dcbf" to write out and then free the cache line That "fill the cache line with your data" part should also have some sort of cache control stuff. Most likely it should use the AltiVec prefetch stuff for streaming data too. As always, unroll the loop a bit so that you can move instructions around. You need to do this to avoid stalls due to instructions needing to wait for preceeding instructions to complete. Put as much distance between such instructions as you can. Here is your %CPU goal: 100*bandwidth_needed/bandwidth_available You should be able to get pretty close to that, since memory operations may be interleaved with other operations that will then become "free", just as the TCP/IP checksum comes "free" with a copy.