Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, Nov 29, 2002 at 11:15:19AM +, José Fonseca wrote: The later is the approach taken e.g., in chromium http://chromium.sourceforge.net , but actually I don't know if for any application besides scientific visualization a pipeline handles so many primitives at a time. For applications such as games, state changes (like texture changes) seem to happen too often for that. The problem with that approach is that you have to do one of a) sort primitives in screen space and assign non-overlapping primitives to the different pipelines; b) keep multiple buffers and blend them together at some point. If you do a), you can have parallelism after the transformation stage or have a synchronization point in the middle of the pipelines. b) is horribly slow for a pipeline running completly in software. Stage parallelization is in this case a much better approach. Problem is the final stages of the pipeline are much more CPU intensive than the initial stages (think vertex processing vs. fragment processing), so you can't split pipeline stages equally across threads. -- Marcelo --- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, Nov 29, 2002 at 01:13:22PM +0100, Felix Kühling wrote: On Fri, 29 Nov 2002 11:15:19 + José Fonseca [EMAIL PROTECTED] wrote: On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote: First some high level considerations. Having each pipeline stage in a separate thread would lead to a large number of threads (e.g. 9 in the case of the radeon driver). Most of the pipeline stages take rather short time so that the extra overhead of multi threading and synchronization could have a significant impact. Alternatively one could use a fixed number N of threads and schedule pipeline stages on them, the main thread and N-1 free threads. If a free thread is available the next pipeline stage would be executed on that thread and the OpenGL client could continue on the main thread without waiting for all pipeline stages to complete. Note that on a non-SMP system there would be only the main thread which is equivalent to how the pipeline is currently executed. I think that one thing that must be thought is whether the parallelism should be in the pipeline stages or in the pipeline data, i.e., if we I am not sure I understand the difference. The idea of a pipeline is that you split the tasks performed on data into several stages. Mesa does this part already. Then while one package is in stage 1 another one can be processed in stage 2 at the same time. So I think I have parallelism both in pipeline data and the stages. The problem is two-fold in this case. First, most of the time not all of the stages are executed (i.e., the software rasterizer case is rarely executed). Second, most of the stages are very short. You'll spend most of your execution time synchronizing between the stages. I seem to recall the Carmack had a .plan update about that when he was adding SMP support to Quake3. I'll see if I can find it. Most research in parallelizing code points to doing whatever is possible to minimize synchronization costs. You might search through previous years SIGGRAPH papers to see what other people have done in this area. It's not a new field. I know that there are patents in this area (sigh) that go back at least 5 or 10 years. All assumptions have to be very well verified against all existing Mesa drivers, otherwise a discrete hack can cause havoc... All the hardware specific stages are drawing stages. So only one of them will be executed at a time. I don't see any problem here. One tricky part could be to find out, how much of the context actually has to be copied. Obviously, all data that is modified by the pipeline stages needs to be copied. Everything that is read only can be shared by all context copies. What about TCL stages? I think one problem that you'll run into is that, as more and more of the OpenGL pipeline gets moved into hardware, you'll see less and less benefit in doing this. :( What might be worth looking into is using left over CPU time to optimize data that is being sent to the card. That is, if the card is the rendering bottleneck, use some CPU cycles to optimize triangle strips that are being submitted, optimize out redundant state changes from the command stream, etc. The trick is in deciding when to enable the optimizer pass. -- Smile! http://antwrp.gsfc.nasa.gov/apod/ap990315.html --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, Nov 29, 2002 at 01:20:45PM +, José Fonseca wrote: On Fri, Nov 29, 2002 at 01:13:22PM +0100, Felix Kühling wrote: On Fri, 29 Nov 2002 11:15:19 + José Fonseca [EMAIL PROTECTED] wrote: I think that one thing that must be thought is whether the parallelism should be in the pipeline stages or in the pipeline data, i.e., if we I am not sure I understand the difference. The idea of a pipeline is that you split the tasks performed on data into several stages. Mesa does this part already. Then while one package is in stage 1 another one can be processed in stage 2 at the same time. So I think I have parallelism both in pipeline data and the stages. Let'me ilustrate with an example. Image you have 1000 polygons to process (i.e., transform, clip, build vertex buffers, and render). If you have a SMP computer with 4 processors you can make use of parallelism in at least two ways: a) Have a processor do the transform of the 1000 polygons, another do the clipping, ... etc. b) Have a processor do the transform+clipping+... of 250 polygons, have another do the same for another 250 polygons, ... etc. One thing I forgot to mention in my other message is that the b option will make MUCH better use of the CPU caches. -- Smile! http://antwrp.gsfc.nasa.gov/apod/ap990315.html --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
Felix Kühling wrote: On Fri, 29 Nov 2002 07:55:52 -0700 Brian Paul [EMAIL PROTECTED] wrote: [snip] Implementing a true threaded pipeline could be very compilicated. State changes are the big issue. If you stall/flush the pipeline for every state change you wouldn't gain anything. The alternative is to associate the GL state with each chunk of vertex data as it passes through the pipeline AND reconfigure the pipeline in the midst of state changes. Again, I think this would be very complicated. I see the problem. On many state changes, a corresponding driver function is called. In a parallel pipeline implementation, if there is still vertex data (with associated state) pending in the pipeline, it will be rendered by the driver with the wrong state. A proper solution would be to call the state-changing driver functions (or only UpdateState?) from within the pipeline, just before a driver stage is run. The required amount of modifications to mesas driver state management seems not too big. A quick recursive grep in xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63 lines in 23 files. I found many state changing callbacks in dd.h which don't seem to be used. Are they left-overs from earlier Mesa versions or did my grep miss something? Which ones? Keith --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 07:55:52 -0700 Brian Paul [EMAIL PROTECTED] wrote: [snip] Implementing a true threaded pipeline could be very compilicated. State changes are the big issue. If you stall/flush the pipeline for every state change you wouldn't gain anything. The alternative is to associate the GL state with each chunk of vertex data as it passes through the pipeline AND reconfigure the pipeline in the midst of state changes. Again, I think this would be very complicated. I see the problem. On many state changes, a corresponding driver function is called. In a parallel pipeline implementation, if there is still vertex data (with associated state) pending in the pipeline, it will be rendered by the driver with the wrong state. A proper solution would be to call the state-changing driver functions (or only UpdateState?) from within the pipeline, just before a driver stage is run. The required amount of modifications to mesas driver state management seems not too big. A quick recursive grep in xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63 lines in 23 files. I found many state changing callbacks in dd.h which don't seem to be used. Are they left-overs from earlier Mesa versions or did my grep miss something? Regards, Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Sat, 30 Nov 2002 16:24:59 + Keith Whitwell [EMAIL PROTECTED] wrote: Felix Kühling wrote: [snip] I found many state changing callbacks in dd.h which don't seem to be used. Are they left-overs from earlier Mesa versions or did my grep miss something? Which ones? Ok, I got the answer to my question ;) There are many calls like this one: (*ctx-Driver.Enable)( ctx, GL_TEXTURE_GEN_Q, GL_FALSE ); My grep missed them. Another grep ctx-Driver\..*( finds 162 lines in 34 files. :-/ Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Sat, 30 Nov 2002 17:20:04 +0100 Felix Kühling [EMAIL PROTECTED] wrote: On Fri, 29 Nov 2002 07:55:52 -0700 Brian Paul [EMAIL PROTECTED] wrote: [snip] Implementing a true threaded pipeline could be very compilicated. State changes are the big issue. If you stall/flush the pipeline for every state change you wouldn't gain anything. The alternative is to associate the GL state with each chunk of vertex data as it passes through the pipeline AND reconfigure the pipeline in the midst of state changes. Again, I think this would be very complicated. I see the problem. On many state changes, a corresponding driver function is called. In a parallel pipeline implementation, if there is still vertex data (with associated state) pending in the pipeline, it will be rendered by the driver with the wrong state. A proper solution would be to call the state-changing driver functions (or only UpdateState?) from within the pipeline, just before a driver stage is run. The required amount of modifications to mesas driver state management seems not too big. A quick recursive grep in xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63 lines in 23 files. Vertex formats are the real problem. Driver functions are called directly from the GL application and modify the driver state bypassing the pipeline. Would it be worth sacrificing for a multi-threaded pipeline on an SMP system? In the radeon driver vtxfmt seems to be only enabled with TCL. But I remember some mails about vertex formats and the mach64 driver. Are vertex formats widely used with non-TCL drivers? Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Sat, Nov 30, 2002 at 08:04:55PM +0100, Felix Kühling wrote: On Sat, 30 Nov 2002 17:20:04 +0100 Felix Kühling [EMAIL PROTECTED] wrote: I see the problem. On many state changes, a corresponding driver function is called. In a parallel pipeline implementation, if there is still vertex data (with associated state) pending in the pipeline, it will be rendered by the driver with the wrong state. A proper solution would be to call the state-changing driver functions (or only UpdateState?) from within the pipeline, just before a driver stage is run. The required amount of modifications to mesas driver state management seems not too big. A quick recursive grep in xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63 lines in 23 files. Vertex formats are the real problem. Driver functions are called directly from the GL application and modify the driver state bypassing the pipeline. Could you please give an example? Would it be worth sacrificing for a multi-threaded pipeline on an SMP system? In the radeon driver vtxfmt seems to be only enabled with TCL. But I remember some mails about vertex formats and the mach64 driver. Are vertex formats widely used with non-TCL drivers? All drivers have a vertex format. One of the last stages of the Mesa pipeline on a DRI driver is to build vertex buffers, which are nothing more than a big array of vertices using that vertex format. What happens is that most drivers use the D3D vertex format. While in newer cards that format usually corresponds to the card native vertex format, on older cards (such as Mach64, and 3DFX) it doesn't. In summary, the vertex format it's like a C structure of how the card expects to see a vertex when reading the system memory by DMA. Now, returning to the main issue here, I'm not seeing what state changes happen during the vertex buffer construction... besides the vertex buffer itself, which would had to be thread specific. José Fonseca __ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Sat, 30 Nov 2002 19:52:50 + José Fonseca [EMAIL PROTECTED] wrote: On Sat, Nov 30, 2002 at 08:04:55PM +0100, Felix Kühling wrote: [snip] Vertex formats are the real problem. Driver functions are called directly from the GL application and modify the driver state bypassing the pipeline. Could you please give an example? I am referring to the GLvertexformat structure which is defined in dd.h. It contains driver callbacks which are installed directly into a glapi_table. This happens in xc/xc/extras/Mesa/src/vtxfmt.c. Reading the comments in vtxfmt.c and dd.h this really seems to be TCL related. An example is radeon_Materialfv in radeon_vtxfmt.c. A backtrace from radeon_Materialfv looks like this: #0 radeon_Materialfv (face=1032, pname=4609, params=0xb22c) at radeon_vtxfmt.c:764 #1 0x404b627c in neutral_Materialfv (face=134605728, pname=134605728, v=0x805eba0) at ../../../../extras/Mesa/src/vtxfmt_tmp.h:162 #2 0x0804ce29 in draw_stairs_internal (mi=0xb388) at stairs.c:318 #3 0x0804d580 in draw_stairs (mi=0xb388) at stairs.c:555 #4 0x0804fb19 in xlockmore_screenhack (dpy=0x80664d8, window=48234497, want_writable_colors=0, want_uniform_colors=0, want_smooth_colors=0, want_bright_colors=0, event_mask=0, hack_init=0x804d380 init_stairs, hack_draw=0x804d454 draw_stairs, hack_reshape=0x804cf9c reshape_stairs, hack_handle_events=0, hack_free=0) at xlockmore.c:385 #5 0x0804c78e in screenhack (dpy=0x80664d8, window=48234497) at ../xlockmore.h:154 #6 0x0804e6d0 in main (argc=1, argv=0xb7d4) at ./../screenhack.c:638 There is that neutral_* wrapper layer in between. So there may still be hope ;-) Would it be worth sacrificing for a multi-threaded pipeline on an SMP system? In the radeon driver vtxfmt seems to be only enabled with TCL. But I remember some mails about vertex formats and the mach64 driver. Are vertex formats widely used with non-TCL drivers? All drivers have a vertex format. One of the last stages of the Mesa pipeline on a DRI driver is to build vertex buffers, which are nothing more than a big array of vertices using that vertex format. What happens is that most drivers use the D3D vertex format. While in newer cards that format usually corresponds to the card native vertex format, on older cards (such as Mach64, and 3DFX) it doesn't. In summary, the vertex format it's like a C structure of how the card expects to see a vertex when reading the system memory by DMA. Ok. This seems to be something different from what I described above. Now, returning to the main issue here, I'm not seeing what state changes happen during the vertex buffer construction... besides the vertex buffer itself, which would had to be thread specific. José Fonseca Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
Hi, I got pretty excited about this idea and spent some more thought on it. I'm going to share my insights in this mail. Any comments are appreciated. First some high level considerations. Having each pipeline stage in a separate thread would lead to a large number of threads (e.g. 9 in the case of the radeon driver). Most of the pipeline stages take rather short time so that the extra overhead of multi threading and synchronization could have a significant impact. Alternatively one could use a fixed number N of threads and schedule pipeline stages on them, the main thread and N-1 free threads. If a free thread is available the next pipeline stage would be executed on that thread and the OpenGL client could continue on the main thread without waiting for all pipeline stages to complete. Note that on a non-SMP system there would be only the main thread which is equivalent to how the pipeline is currently executed. There are some functions that would have to flush the pipeline. I'm thinking of glXSwapBuffers and glFinish. There might be more (functions that access the color or z-buffers directly). Extra care has to be taken if there are several pipeline stages which access the color or z-buffers like a hardware rendering stage + a software rendering stage for fallbacks. I'll call them drawing stages. They have to be executed one at a time and in the correct order. This can be ensured by flushing all higher pipeline stages before running a drawing stage. I'm not sure if textures would require any special attention. I'd have to read a bit more about that. If several pipeline stages are running at the same time for different vertex buffers, they have to work with separate copies of the GLcontext reflecting the state at the time the buffer was fed into the pipeline. It is enough the copy a context only once, when it moves from the main thread to a free thread. I think this task is not trivial but feasible. It would allow SMP systems to benefit from the inherent parallelism in the rendering process. Regards, Felix On Thu, 28 Nov 2002 20:59:08 +0100 Felix Kühling [EMAIL PROTECTED] wrote: Hi, I came across the mesa pipeline a couple of times now, reading the mesa sources. It struck me that all the pipeline stages are actually executed sequentially. Does anyone who is more involved think it would be worthwile to process several vertex buffers at a time in parallel pipeline stages by having each stage in its own thread? I realize that only drivers/cards without hardware TCL would benefit from this (or fallback cases where TCL doesn't work). Thoughts? In case this is considered useful I'm volunteering to make this a small project of mine. __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 10:19:52 +0100 Felix Kühling [EMAIL PROTECTED] wrote: I think this task is not trivial but feasible. It would allow SMP systems to benefit from the inherent parallelism in the rendering process. Sounds pretty sweet. now make a patch ;-) --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote: Hi, I got pretty excited about this idea and spent some more thought on it. I'm going to share my insights in this mail. Any comments are appreciated. First some high level considerations. Having each pipeline stage in a separate thread would lead to a large number of threads (e.g. 9 in the case of the radeon driver). Most of the pipeline stages take rather short time so that the extra overhead of multi threading and synchronization could have a significant impact. Alternatively one could use a fixed number N of threads and schedule pipeline stages on them, the main thread and N-1 free threads. If a free thread is available the next pipeline stage would be executed on that thread and the OpenGL client could continue on the main thread without waiting for all pipeline stages to complete. Note that on a non-SMP system there would be only the main thread which is equivalent to how the pipeline is currently executed. I think that one thing that must be thought is whether the parallelism should be in the pipeline stages or in the pipeline data, i.e., if we should partition a stage for each thread (as you're suggesting), or if there is enough data running through the pipeline making it worth to have to a whole pipeline for each thread, each processing a part of the primitives. The later is the approach taken e.g., in chromium http://chromium.sourceforge.net , but actually I don't know if for any application besides scientific visualization a pipeline handles so many primitives at a time. For applications such as games, state changes (like texture changes) seem to happen too often for that. There are some functions that would have to flush the pipeline. I'm thinking of glXSwapBuffers and glFinish. There might be more (functions that access the color or z-buffers directly). Extra care has to be taken if there are several pipeline stages which access the color or z-buffers like a hardware rendering stage + a software rendering stage for fallbacks. I'll call them drawing stages. They have to be executed one at a time and in the correct order. This can be ensured by flushing all higher pipeline stages before running a drawing stage. I'm not sure if textures would require any special attention. I'd have to read a bit more about that. If several pipeline stages are running at the same time for different vertex buffers, they have to work with separate copies of the GLcontext reflecting the state at the time the buffer was fed into the pipeline. It is enough the copy a context only once, when it moves from the main thread to a free thread. All assumptions have to be very well verified against all existing Mesa drivers, otherwise a discrete hack can cause havoc... I think this task is not trivial but feasible. It would allow SMP systems to benefit from the inherent parallelism in the rendering process. It surely seems interesting! ;-) José Fonseca __ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 11:15:19 + José Fonseca [EMAIL PROTECTED] wrote: On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote: Hi, I got pretty excited about this idea and spent some more thought on it. I'm going to share my insights in this mail. Any comments are appreciated. First some high level considerations. Having each pipeline stage in a separate thread would lead to a large number of threads (e.g. 9 in the case of the radeon driver). Most of the pipeline stages take rather short time so that the extra overhead of multi threading and synchronization could have a significant impact. Alternatively one could use a fixed number N of threads and schedule pipeline stages on them, the main thread and N-1 free threads. If a free thread is available the next pipeline stage would be executed on that thread and the OpenGL client could continue on the main thread without waiting for all pipeline stages to complete. Note that on a non-SMP system there would be only the main thread which is equivalent to how the pipeline is currently executed. I think that one thing that must be thought is whether the parallelism should be in the pipeline stages or in the pipeline data, i.e., if we I am not sure I understand the difference. The idea of a pipeline is that you split the tasks performed on data into several stages. Mesa does this part already. Then while one package is in stage 1 another one can be processed in stage 2 at the same time. So I think I have parallelism both in pipeline data and the stages. should partition a stage for each thread (as you're suggesting), or if there is enough data running through the pipeline making it worth to have to a whole pipeline for each thread, each processing a part of the primitives. Having one thread for each pipeline stage was my first idea. The alternative approach I tried to explain could be implemented in a way that, once data is processed by a free thread, it runs through all remaining pipeline stages in that same thread. Only, if no free thread is available, processing starts on the main thread. Still semaphores have to be used to synchronize the threads (including the main thread) so that data packets cannot overtake each other. In the end, the drawing has to occur in the same order as data was fed into the pipeline. The later is the approach taken e.g., in chromium http://chromium.sourceforge.net , but actually I don't know if for any application besides scientific visualization a pipeline handles so many primitives at a time. For applications such as games, state changes (like texture changes) seem to happen too often for that. I just had a look at their web page. They take a very different approach to parallelizing the rendering task. They are targeting clusters, not SMP systems. There are some functions that would have to flush the pipeline. I'm thinking of glXSwapBuffers and glFinish. There might be more (functions that access the color or z-buffers directly). Extra care has to be taken if there are several pipeline stages which access the color or z-buffers like a hardware rendering stage + a software rendering stage for fallbacks. I'll call them drawing stages. They have to be executed one at a time and in the correct order. This can be ensured by flushing all higher pipeline stages before running a drawing stage. I'm not sure if textures would require any special attention. I'd have to read a bit more about that. If several pipeline stages are running at the same time for different vertex buffers, they have to work with separate copies of the GLcontext reflecting the state at the time the buffer was fed into the pipeline. It is enough the copy a context only once, when it moves from the main thread to a free thread. All assumptions have to be very well verified against all existing Mesa drivers, otherwise a discrete hack can cause havoc... All the hardware specific stages are drawing stages. So only one of them will be executed at a time. I don't see any problem here. One tricky part could be to find out, how much of the context actually has to be copied. Obviously, all data that is modified by the pipeline stages needs to be copied. Everything that is read only can be shared by all context copies. I think this task is not trivial but feasible. It would allow SMP systems to benefit from the inherent parallelism in the rendering process. It surely seems interesting! ;-) José Fonseca Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size!
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 11:15:19 + José Fonseca [EMAIL PROTECTED] wrote: On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote: [snip] I think that one thing that must be thought is whether the parallelism should be in the pipeline stages or in the pipeline data, i.e., if we should partition a stage for each thread (as you're suggesting), or if there is enough data running through the pipeline making it worth to have to a whole pipeline for each thread, each processing a part of the primitives. I just came back from lunch with my brain back online ;-). Now I understand what you're saying. I think my approach is much simpler to implement. I don't have to worry about how to compose the final image from images generated by subsets of the vertex data. And even before that there would be more problems if we wanted to render to several parallel back buffers at the same time using the same hardware. The later is the approach taken e.g., in chromium http://chromium.sourceforge.net , but actually I don't know if for any application besides scientific visualization a pipeline handles so many primitives at a time. For applications such as games, state changes (like texture changes) seem to happen too often for that. Agreed. To me it sounds like chromium splits the vertex data on a frame level. We're talking about much smaller units here. [snip] Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, Nov 29, 2002 at 01:13:22PM +0100, Felix Kühling wrote: On Fri, 29 Nov 2002 11:15:19 + José Fonseca [EMAIL PROTECTED] wrote: I think that one thing that must be thought is whether the parallelism should be in the pipeline stages or in the pipeline data, i.e., if we I am not sure I understand the difference. The idea of a pipeline is that you split the tasks performed on data into several stages. Mesa does this part already. Then while one package is in stage 1 another one can be processed in stage 2 at the same time. So I think I have parallelism both in pipeline data and the stages. Let'me ilustrate with an example. Image you have 1000 polygons to process (i.e., transform, clip, build vertex buffers, and render). If you have a SMP computer with 4 processors you can make use of parallelism in at least two ways: a) Have a processor do the transform of the 1000 polygons, another do the clipping, ... etc. b) Have a processor do the transform+clipping+... of 250 polygons, have another do the same for another 250 polygons, ... etc. should partition a stage for each thread (as you're suggesting), or if there is enough data running through the pipeline making it worth to have to a whole pipeline for each thread, each processing a part of the primitives. Having one thread for each pipeline stage was my first idea. The alternative approach I tried to explain could be implemented in a way that, once data is processed by a free thread, it runs through all remaining pipeline stages in that same thread. Only, if no free thread is available, processing starts on the main thread. I understood your second proposal, but you still have one thread doing a a pipeline stage processing all the data (case a) above). Still semaphores have to be used to synchronize the threads (including the main thread) so that data packets cannot overtake each other. In the end, the drawing has to occur in the same order as data was fed into the pipeline. The later is the approach taken e.g., in chromium http://chromium.sourceforge.net , but actually I don't know if for any application besides scientific visualization a pipeline handles so many primitives at a time. For applications such as games, state changes (like texture changes) seem to happen too often for that. I just had a look at their web page. They take a very different approach to parallelizing the rendering task. They are targeting clusters, not SMP systems. Why do you dismiss it so quickly? Have you seen http://www.cs.virginia.edu/~humper/chromium_documentation/threadedapplication.html ? There is nothing in their approach specific to clusters. Using approach b) yeilds much better parallelism than a) because each thread can work independently of each other, and therefore there is less waits/lock contentions/etc. Nevertheless, it isn't worth if you're processing 50 polygons at at time. Still if you have 50 polygons in blue, 50 in red, and 50 in texture A, and you run all them seperately in different processors will probably still have better parallelism. I'm not sure which approach will give better speedups, but this has to be considered. Perhaps it would be a good idea to talk with the Chromium guys to know what the order of speedups they achieve in SMP systems on usual applications. José Fonseca __ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 13:20:45 + José Fonseca [EMAIL PROTECTED] wrote: [snip] Let'me ilustrate with an example. Image you have 1000 polygons to process (i.e., transform, clip, build vertex buffers, and render). If you have a SMP computer with 4 processors you can make use of parallelism in at least two ways: a) Have a processor do the transform of the 1000 polygons, another do the clipping, ... etc. b) Have a processor do the transform+clipping+... of 250 polygons, have another do the same for another 250 polygons, ... etc. Ok, now I get the point. should partition a stage for each thread (as you're suggesting), or if there is enough data running through the pipeline making it worth to have to a whole pipeline for each thread, each processing a part of the primitives. Having one thread for each pipeline stage was my first idea. The alternative approach I tried to explain could be implemented in a way that, once data is processed by a free thread, it runs through all remaining pipeline stages in that same thread. Only, if no free thread is available, processing starts on the main thread. I understood your second proposal, but you still have one thread doing a a pipeline stage processing all the data (case a) above). Sorry, my fault. I should have read more carefully. The chromium approach was new to me. I guess it took me a while to digest. Still semaphores have to be used to synchronize the threads (including the main thread) so that data packets cannot overtake each other. In the end, the drawing has to occur in the same order as data was fed into the pipeline. The later is the approach taken e.g., in chromium http://chromium.sourceforge.net , but actually I don't know if for any application besides scientific visualization a pipeline handles so many primitives at a time. For applications such as games, state changes (like texture changes) seem to happen too often for that. I just had a look at their web page. They take a very different approach to parallelizing the rendering task. They are targeting clusters, not SMP systems. Why do you dismiss it so quickly? Have you seen I wasn't dismissing it. It's just what they say on their web page: quote Chromium is a system for interactive rendering on clusters of workstations. /quote http://www.cs.virginia.edu/~humper/chromium_documentation/threadedapplication.html ? Not yet. I'll read it. There is nothing in their approach specific to clusters. Using approach b) yeilds much better parallelism than a) because each thread can work independently of each other, and therefore there is less waits/lock contentions/etc. If they want it to run on a cluster they cannot use shared memory. Also they have to optimize their system to not suffer too much from network latencies. Sure, their system will also work and yield a speedup on a single SMP system. But my approach is limited to single SMP systems. So in a way, my approach is more limited, if that's what was bothering you ;-) Nevertheless, it isn't worth if you're processing 50 polygons at at time. Still if you have 50 polygons in blue, 50 in red, and 50 in texture A, and you run all them seperately in different processors will probably still have better parallelism. On an SMP system my system could exploit this. It would render, say the blue polygons in a free thread while in the main thread the OpenGL client continues queuing up the red polygons. When the red polygons are to be run through the pipeline and the blue ones aren't finished yet, they would be rendered in the main thread in parallel to the blue polygons. If you have more threads, you can have more vertex bufferes processed in parallel in this way. If the synchronization is implemented as unrestrictive as possible the only reason to wait should be to force the drawing stages to occur in the correct order. I'm not sure which approach will give better speedups, but this has to be considered. Perhaps it would be a good idea to talk with the Chromium guys to know what the order of speedups they achieve in SMP systems on usual applications. My approach was basically inspired by the fact that there is something in mesa that is called pipeline. So I thought, why not implement it like a real pipeline. If we really want to parallelize MESA, then we should consider all options. I'm probably biased towards my proposal ;-) Best regards, Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
Re: [Dri-devel] Parallelizing MESA's pipeline?
Felix Kühling wrote: My approach was basically inspired by the fact that there is something in mesa that is called pipeline. So I thought, why not implement it like a real pipeline. If we really want to parallelize MESA, then we should consider all options. I'm probably biased towards my proposal ;-) A few years ago another group developed pmesa: http://pmesa.sourceforge.net/ You might look at that. I think someone else brought this up on the Mesa-dev or DRI list earlier this year. I have to say I'm skeptical. A hardware TCL driver like the radeon or r200 won't benefit from this. In most other cases, I think the overhead of parallelization will result in very modest speed-ups, if any. The only situation in which I can see benefit is when applications draw very long primitive strips (many thousands of vertices). In that case, splitting the N vertices into N/P pieces for P processors and transforming them in parallel would be the best approach. I think that's what pmesa did. Implementing a true threaded pipeline could be very compilicated. State changes are the big issue. If you stall/flush the pipeline for every state change you wouldn't gain anything. The alternative is to associate the GL state with each chunk of vertex data as it passes through the pipeline AND reconfigure the pipeline in the midst of state changes. Again, I think this would be very complicated. As for Chromium, there are situations in which multiprocessors can be helpful, but it depends largely on the nature of the application and the best speed-ups come from parallelizing the application itself so that multiple threads of instances of the application all issue rendering commands in parallel to an array of graphics cards. With modern graphics cards, the bottleneck is often the application itself: the card starves because the app can't issue rendering commands and vertex data fast enough. So, feel free to experiment with this but realize that you may be dissapointed with the results with typical applications. -Brian --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 07:55:52 -0700 Brian Paul [EMAIL PROTECTED] wrote: Felix Kühling wrote: My approach was basically inspired by the fact that there is something in mesa that is called pipeline. So I thought, why not implement it like a real pipeline. If we really want to parallelize MESA, then we should consider all options. I'm probably biased towards my proposal ;-) A few years ago another group developed pmesa: http://pmesa.sourceforge.net/ You might look at that. I had a look at their page. The last news is from april 2000. Looks dead. I think someone else brought this up on the Mesa-dev or DRI list earlier this year. I have to say I'm skeptical. I saw his post. He had a slightly different plan. He didn't want to have several chunks of vertex data processed in parallel but exploit independencies of pipeline stages for parallelization. A hardware TCL driver like the radeon or r200 won't benefit from this. That's not quite true. With what I have in mind, the OpenGL client could continue while another thread is submitting the vertex data to the TCL unit. But you're right, drivers without hardware TCL would benefit more from this. In most other cases, I think the overhead of parallelization will result in very modest speed-ups, if any. The only situation in which I can see benefit is when applications draw very long primitive strips (many thousands of vertices). In that case, splitting the N vertices into N/P pieces for P processors and transforming them in parallel would be the best approach. I think that's what pmesa did. That's right. Implementing a true threaded pipeline could be very compilicated. State changes are the big issue. If you stall/flush the pipeline for every state change you wouldn't gain anything. The alternative is to associate the GL state with each chunk of vertex data as it passes through the pipeline AND reconfigure the pipeline in the midst of state changes. Again, I think this would be very complicated. That's what I had in mind. As for Chromium, there are situations in which multiprocessors can be helpful, but it depends largely on the nature of the application and the best speed-ups come from parallelizing the application itself so that multiple threads of instances of the application all issue rendering commands in parallel to an array of graphics cards. With modern graphics cards, the bottleneck is often the application itself: the card starves because the app can't issue rendering commands and vertex data fast enough. Yeah, I hardly see 100% CPU usage with 3D applications on my Radeon 7500. But the latest Quake3 point release has some multithreading support. They must have seen some benefit from this. So, feel free to experiment with this but realize that you may be dissapointed with the results with typical applications. I think I'll give it a shot. -Brian Thanks for the pointers, Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 14:54:30 +0100 Felix Kühling [EMAIL PROTECTED] wrote: When the red polygons are to be run through the pipeline and the blue ones aren't finished yet What if the red ones are transparent and overlap the blue ones? they will have to wait, but perhaps not all of them? how fine-grained can this go without becoming an excercise in futility? --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 13:20:45 + José Fonseca [EMAIL PROTECTED] wrote: Nevertheless, it isn't worth if you're processing 50 polygons at at time. Still if you have 50 polygons in blue, 50 in red, and 50 in texture A, and you run all them seperately in different processors will probably still have better parallelism. OOI, how did the (long dead) pmesa?? work ? --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 16:59:16 +0100 Felix Kühling [EMAIL PROTECTED] wrote: A hardware TCL driver like the radeon or r200 won't benefit from this. That's not quite true. With what I have in mind, the OpenGL client could continue while another thread is submitting the vertex data to the TCL unit. But you're right, drivers without hardware TCL would benefit more from this. What about a card with multiple TCL units? the client could then be submitting streams in parallel? --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 16:59:16 +0100 Felix Kühling [EMAIL PROTECTED] wrote: Yeah, I hardly see 100% CPU usage with 3D applications on my Radeon 7500. Heh me either. Q3 uses bugger all CPU here. that said, the xmms default GL plugin uses 100% and its NOT software rendering. --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Parallelizing MESA's pipeline?
On Fri, 29 Nov 2002 22:46:08 + Ian Molton [EMAIL PROTECTED] wrote: On Fri, 29 Nov 2002 14:54:30 +0100 Felix Kühling [EMAIL PROTECTED] wrote: When the red polygons are to be run through the pipeline and the blue ones aren't finished yet What if the red ones are transparent and overlap the blue ones? they will have to wait, but perhaps not all of them? The actual drawing has to happen sequentially in any case as we have only one graphics card to render with. But while the blue polygons are drawn, the red polygons can be transformed and clipped, the vertices lighted and whatever other stages are there before rendering. how fine-grained can this go without becoming an excercise in futility? Regarding the pipeline stages I was planning to use the granularity already defined by the mesa pipeline stages. If you are looking for an example see lines 206-236 in xc/xc/lib/GL/mesa/src/drv/radeon/radeon_context.c. The size of vertex chunks processed at a time depends on the application (how often the state changes). Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! --- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel