On Sun, 12 Sep 1999, Eero Pajarre wrote:
> Here are the top entries for my application:
> gl_x86_transform_points3_3d_raw
> gl_x86_cliptest_points4
> gl_x86_transform_points3_3d_no_rot_raw
> gl_x86_transform_points3_perspective_raw

Thank you Eero and Keith for you information. It seems that the 
most critical functions are quite limited in number after all. My
problem right now is that I haven't really got the PIII operations
running at all. The compilation works alright, but I get exceptions 
as soon as I try the XMM ops. Maybe the kernel I'm using hasn't set 
up the proper control bits to allow XMM. The only PIII in our lab 
is actually an autonomous robot and people probably do not like me 
to install some unofficial patches on that machine.

> The cliptest_points contains a fdiv instruction which seems
> (according to Vtune) to be responsible for
> at least half of the CPU consumption of the whole function.

This function should probably be quite easy to parallelize and 
I suppose the projections should be kept within the function.
It's hard to see them elsewhere.

> Quite interestingly Vtune also places almost half of the
> CPU hit of the transform_points3_3d_raw operation to the
> first multiply instruction. I think this means that at least
> in my setup it is not the arithmetic which is expensive
> but fetching the data from the main memory to the FPU.

It's probably because the data is aligned and cache misses (when 
they occur) always occur in the beginning of the loop. I did some
tests before and concluded that cache misses (on our 450MHz PIII) 
cost as follows:

L1 read miss:   7 cycles
L1 write miss: 37 cycles
L2 read miss:  26 cycles
L2 write miss: 80 cycles

Yes, those number are hard to believe indeed, but you can be quite
sure that they are in that neighbourhood. If all the coordinate data
won't fit into the caches, one could expect misses to occur once for
every two coordinates (2x16 bytes). So, the data reads (and writes)
will surely dominate. Lots of speed is probably to be gained by
careful use of the new prefetching operations. 

> In any case issues like "pipelining for cache"
> should be considered in addition to the minimizing
> of the CPU cycles for the actual arithmetic.

I definitely agree! One should also try to store as little temporary
data as possible in dedicated memory. It's better reusing the same
memory locations for different kinds of data. This can, however, be 
a little meesy and almost impossible to read.

Eero Pajarre? Sounds much like an old friend from my years at Mentor
Graphics. :-) Sorry for not responding earlier. Since my back and ribs 
have prevented me from sitting too much infront of the screen, I've
been a little off grid for a while. Crayfish parties are very popular
in Sweden and this years party was no exception. This year I managed
to break a number of ribs, falling down from a pier.



_______________________________________________
Mesa-dev maillist  -  [EMAIL PROTECTED]
http://lists.mesa3d.org/mailman/listinfo/mesa-dev

Reply via email to