Re: [Mesa-dev] Transformation functions

1999-09-17 Thread Mårten Björkman


On Sun, 12 Sep 1999, Eero Pajarre wrote:
 Here are the top entries for my application:
 gl_x86_transform_points3_3d_raw
 gl_x86_cliptest_points4
 gl_x86_transform_points3_3d_no_rot_raw
 gl_x86_transform_points3_perspective_raw

Thank you Eero and Keith for you information. It seems that the 
most critical functions are quite limited in number after all. My
problem right now is that I haven't really got the PIII operations
running at all. The compilation works alright, but I get exceptions 
as soon as I try the XMM ops. Maybe the kernel I'm using hasn't set 
up the proper control bits to allow XMM. The only PIII in our lab 
is actually an autonomous robot and people probably do not like me 
to install some unofficial patches on that machine.

 The cliptest_points contains a fdiv instruction which seems
 (according to Vtune) to be responsible for
 at least half of the CPU consumption of the whole function.

This function should probably be quite easy to parallelize and 
I suppose the projections should be kept within the function.
It's hard to see them elsewhere.

 Quite interestingly Vtune also places almost half of the
 CPU hit of the transform_points3_3d_raw operation to the
 first multiply instruction. I think this means that at least
 in my setup it is not the arithmetic which is expensive
 but fetching the data from the main memory to the FPU.

It's probably because the data is aligned and cache misses (when 
they occur) always occur in the beginning of the loop. I did some
tests before and concluded that cache misses (on our 450MHz PIII) 
cost as follows:

L1 read miss:   7 cycles
L1 write miss: 37 cycles
L2 read miss:  26 cycles
L2 write miss: 80 cycles

Yes, those number are hard to believe indeed, but you can be quite
sure that they are in that neighbourhood. If all the coordinate data
won't fit into the caches, one could expect misses to occur once for
every two coordinates (2x16 bytes). So, the data reads (and writes)
will surely dominate. Lots of speed is probably to be gained by
careful use of the new prefetching operations. 

 In any case issues like "pipelining for cache"
 should be considered in addition to the minimizing
 of the CPU cycles for the actual arithmetic.

I definitely agree! One should also try to store as little temporary
data as possible in dedicated memory. It's better reusing the same
memory locations for different kinds of data. This can, however, be 
a little meesy and almost impossible to read.

Eero Pajarre? Sounds much like an old friend from my years at Mentor
Graphics. :-) Sorry for not responding earlier. Since my back and ribs 
have prevented me from sitting too much infront of the screen, I've
been a little off grid for a while. Crayfish parties are very popular
in Sweden and this years party was no exception. This year I managed
to break a number of ribs, falling down from a pier.



___
Mesa-dev maillist  -  [EMAIL PROTECTED]
http://lists.mesa3d.org/mailman/listinfo/mesa-dev



Re: [Mesa-dev] Transformation functions

1999-09-12 Thread Eero Pajarre



Mårten Björkman wrote:
 
 I'm trying to estimate the amount of work required in order to
 optimize the transformation functions for Pentium III. Since the
 functions are many and some of them won't benefit much from using the
 new SIMD instructions, it's probably preferable to spend most energy
 on the most frequently used functions and forget about the rest.
 
 Does anyone know which functions one ought to concentrate on?
 

Here are the top entries for my application:
gl_x86_transform_points3_3d_raw
gl_x86_cliptest_points4
gl_x86_transform_points3_3d_no_rot_raw
gl_x86_transform_points3_perspective_raw

I have tested these in a Windows95/Pentium Pro/Vtune environment

The cliptest_points contains a fdiv instruction which seems
(according to Vtune) to be responsible for
at least half of the CPU consumption of the whole function.
The heavy CPU penalty may be caused partly by the fact that
the instruction also gets the datacache miss penalties,
as it is the first instruction which accesses vertex data
in the loop.

Quite interestingly Vtune also places almost half of the
CPU hit of the transform_points3_3d_raw operation to the
first multiply instruction. I think this means that at least
in my setup it is not the arithmetic which is expensive
but fetching the data from the main memory to the FPU.

I am using quite a lot of CVA stuff so I suspect
that the transformation engines are fed directly
from my application data. If I would use vertex3f
commands it might be that the cache miss penalty would
be in my application instead of the opengl library.


In any case issues like "pipelining for cache"
should be considered in addition to the minimizing
of the CPU cycles for the actual arithmetic.


Eero


___
Mesa-dev maillist  -  [EMAIL PROTECTED]
http://lists.mesa3d.org/mailman/listinfo/mesa-dev



[Mesa-dev] Transformation functions

1999-09-09 Thread Mårten Björkman


I'm trying to estimate the amount of work required in order to
optimize the transformation functions for Pentium III. Since the
functions are many and some of them won't benefit much from using the
new SIMD instructions, it's probably preferable to spend most energy 
on the most frequently used functions and forget about the rest.

Does anyone know which functions one ought to concentrate on?

/ Mårten Björkman 




___
Mesa-dev maillist  -  [EMAIL PROTECTED]
http://lists.mesa3d.org/mailman/listinfo/mesa-dev



Re: [Mesa-dev] Transformation functions

1999-09-09 Thread Keith Whitwell

Mårten Björkman wrote:
 
 I'm trying to estimate the amount of work required in order to
 optimize the transformation functions for Pentium III. Since the
 functions are many and some of them won't benefit much from using the
 new SIMD instructions, it's probably preferable to spend most energy
 on the most frequently used functions and forget about the rest.
 
 Does anyone know which functions one ought to concentrate on?
 


The functions with cullmask are less used than those without.  Functions for 1
and 2 vertices are less used than the 3 and 4 cases.  Probably the top few are:

cliptest-points-4
transform_points3_general_raw
transform_points3_3d_no_rot_raw

I'd also add 

transform_points3_3d_raw

and have a look at the ones used by the fx/mga fast paths in vertices.c.  These
are the ones used by q3.

Keith


___
Mesa-dev maillist  -  [EMAIL PROTECTED]
http://lists.mesa3d.org/mailman/listinfo/mesa-dev