Keith Whitwell wrote:
Ian Romanick wrote:
One thing about Jakub's patch is that, on x86, it eliminates the need for the specialized _ts_* versions of the dispatch functions. It basically converts the DISPATCH macro (as used in src/mesa/main/dispatch.c) from:
#define DISPATCH(FUNC, ARGS, MESSAGE) \ (_glapi_Dispatch->FUNC) ARGS
to:
#define DISPATCH(FUNC, ARGS, MESSAGE) \ const struct _glapi_table * d = _glapi_Dispatch; \ if ( __builtin_expect( d == NULL, 0 ) ) \ d = get_dispatch(); \ (d->FUNC) ARGS
There is some extra cost in the non-threaded case, but it seems very minimal. In the x86 assembly case, it's only a test and a conditional branch that is usually not taken. Does this seem like a reasonable change to make across the board?
Hmm. The _ts_* macros were introduced to eliminate exactly that sort of test - though we probably coded it up in a less optimal way than that. Are you saying that the dispatch tables would really become compiled 'C'? At the moment they are typically generated as assembly and use a jmp rather than calling a new function as in either of the examples above.
My feeling is that the non-threaded case should run as fast as possible, being the normal usage. Maybe some timings would make things clearer.
Attached is the test program I used. It takes turns calling a few API functions 1,000,000 (or more if specified on the command line) times. I tried it on a 2.4GHz Pentium 4 and a 400MHz K6-3. Both systems are Redhat 7.3 + patches (and in need of upgrades, I know). All code was compiled with gcc 2.96-113.
On the K6-3, the results were within the measurable margin of error for the two x86 assembly dispatch methods.
On the P4, the old-style dispatch was between 5 and 20 clock cycles faster. This amounts to an increase of between 5% and 38% on each call. The worst was glTexCoord3fv, which increased from ~52 cycles to ~72 cycles. The two exceptions were glMultiTexCoord2fv and glMultiTexCoord2f. The timings for these were virtually identical.
I'm a bit confused as to why the overhead isn't constant from function to function. The difference per-call should be identical. I suspect there is some other difference in my build. :( I'll keep looking into it...
#include <stdio.h> #include <stdlib.h> #define GL_GLEXT_PROTOTYPES #include <GL/gl.h> #include <GL/glext.h> #include <GL/glut.h>
#include <asm/timex.h> static float Width = 400.0; static float Height = 400.0; static unsigned count = 1000000; static void Idle( void ) { glutPostRedisplay(); } #define DO_FUNC(f,p) \ do { \ t0 = get_cycles(); \ for ( i = 0 ; i < count ; i++ ) { \ f p ; \ } \ t1 = get_cycles(); \ printf("%u calls to % 20s required %llu cycles.\n", count, # f, t1 - t0); \ } while( 0 ) static void Display( void ) { int i; const float v[3] = { 1.0, 0.0, 0.0 }; cycles_t t0; cycles_t t1; glBegin(GL_TRIANGLE_STRIP); DO_FUNC( glColor3fv, (v) ); DO_FUNC( glNormal3fv, (v) ); DO_FUNC( glTexCoord2fv, (v) ); DO_FUNC( glTexCoord3fv, (v) ); DO_FUNC( glMultiTexCoord2fv, (GL_TEXTURE0, v) ); DO_FUNC( glMultiTexCoord2f, (GL_TEXTURE0, 0.0, 0.0) ); DO_FUNC( glFogCoordfv, (v) ); DO_FUNC( glFogCoordf, (0.5) ); glEnd(); exit(0); } static void Reshape( int width, int height ) { Width = width; Height = height; glViewport( 0, 0, width, height ); glMatrixMode( GL_PROJECTION ); glLoadIdentity(); glOrtho(0.0, width, 0.0, height, -1.0, 1.0); glMatrixMode( GL_MODELVIEW ); glLoadIdentity(); } static void Key( unsigned char key, int x, int y ) { (void) x; (void) y; switch (key) { case 27: exit(0); break; } glutPostRedisplay(); } int main( int argc, char *argv[] ) { glutInit( &argc, argv ); glutInitWindowSize( (int) Width, (int) Height ); glutInitWindowPosition( 0, 0 ); glutInitDisplayMode( GLUT_RGB ); glutCreateWindow( argv[0] ); if ( argc > 1 ) { count = strtoul( argv[1], NULL, 0 ); } glutReshapeFunc( Reshape ); glutKeyboardFunc( Key ); glutDisplayFunc( Display ); glutIdleFunc( Idle ); glutMainLoop(); return 0; }