Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-12-02 Thread Marcelo E. Magallon
On Fri, Nov 29, 2002 at 11:15:19AM +, José Fonseca wrote:

  The later is the approach taken e.g., in chromium
  http://chromium.sourceforge.net , but actually I don't know if for
  any application besides scientific visualization a pipeline handles
  so many primitives at a time. For applications such as games, state
  changes (like texture changes) seem to happen too often for that.

 The problem with that approach is that you have to do one of a) sort
 primitives in screen space and assign non-overlapping primitives to
 the different pipelines; b) keep multiple buffers and blend them
 together at some point.  If you do a), you can have parallelism after
 the transformation stage or have a synchronization point in the middle
 of the pipelines.  b) is horribly slow for a pipeline running completly
 in software.

 Stage parallelization is in this case a much better approach.  Problem
 is the final stages of the pipeline are much more CPU intensive than
 the initial stages (think vertex processing vs. fragment processing),
 so you can't split pipeline stages equally across threads.

-- 
Marcelo


---
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-12-02 Thread Ian Romanick
On Fri, Nov 29, 2002 at 01:13:22PM +0100, Felix Kühling wrote:
 On Fri, 29 Nov 2002 11:15:19 +
 José Fonseca [EMAIL PROTECTED] wrote:
 
  On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote:

   First some high level considerations. Having each pipeline stage in a
   separate thread would lead to a large number of threads (e.g. 9 in the
   case of the radeon driver). Most of the pipeline stages take rather
   short time so that the extra overhead of multi threading and
   synchronization could have a significant impact. Alternatively one could
   use a fixed number N of threads and schedule pipeline stages on them,
   the main thread and N-1 free threads. If a free thread is available
   the next pipeline stage would be executed on that thread and the OpenGL
   client could continue on the main thread without waiting for all
   pipeline stages to complete. Note that on a non-SMP system there would
   be only the main thread which is equivalent to how the pipeline is
   currently executed.
  
  I think that one thing that must be thought is whether the parallelism
  should be in the pipeline stages or in the pipeline data, i.e., if we
 
 I am not sure I understand the difference. The idea of a pipeline is
 that you split the tasks performed on data into several stages. Mesa
 does this part already. Then while one package is in stage 1 another one
 can be processed in stage 2 at the same time. So I think I have
 parallelism both in pipeline data and the stages.

The problem is two-fold in this case.  First, most of the time not all of
the stages are executed (i.e., the software rasterizer case is rarely
executed).  Second, most of the stages are very short.  You'll spend most of
your execution time synchronizing between the stages.  I seem to recall the
Carmack had a .plan update about that when he was adding SMP support to
Quake3.  I'll see if I can find it.

Most research in parallelizing code points to doing whatever is possible to
minimize synchronization costs.  You might search through previous years
SIGGRAPH papers to see what other people have done in this area.  It's not a
new field.  I know that there are patents in this area (sigh) that go back
at least 5 or 10 years.

  All assumptions have to be very well verified against all existing Mesa
  drivers, otherwise a discrete hack can cause havoc...
 
 All the hardware specific stages are drawing stages. So only one of them
 will be executed at a time. I don't see any problem here. One tricky
 part could be to find out, how much of the context actually has to be
 copied. Obviously, all data that is modified by the pipeline stages
 needs to be copied. Everything that is read only can be shared by all
 context copies.

What about TCL stages?

I think one problem that you'll run into is that, as more and more of the
OpenGL pipeline gets moved into hardware, you'll see less and less benefit
in doing this. :(

What might be worth looking into is using left over CPU time to optimize
data that is being sent to the card.  That is, if the card is the rendering
bottleneck, use some CPU cycles to optimize triangle strips that are being
submitted, optimize out redundant state changes from the command stream, etc.
The trick is in deciding when to enable the optimizer pass.

-- 
Smile!  http://antwrp.gsfc.nasa.gov/apod/ap990315.html


---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-12-02 Thread Ian Romanick
On Fri, Nov 29, 2002 at 01:20:45PM +, José Fonseca wrote:
 On Fri, Nov 29, 2002 at 01:13:22PM +0100, Felix Kühling wrote:
  On Fri, 29 Nov 2002 11:15:19 +
  José Fonseca [EMAIL PROTECTED] wrote:
   I think that one thing that must be thought is whether the parallelism
   should be in the pipeline stages or in the pipeline data, i.e., if we
  
  I am not sure I understand the difference. The idea of a pipeline is
  that you split the tasks performed on data into several stages. Mesa
  does this part already. Then while one package is in stage 1 another one
  can be processed in stage 2 at the same time. So I think I have
  parallelism both in pipeline data and the stages.
 
 
 Let'me ilustrate with an example. Image you have 1000 polygons to
 process (i.e., transform, clip, build vertex buffers, and render). If
 you have a SMP computer with 4 processors you can make use of
 parallelism in at least two ways:
 
 a) Have a processor do the transform of the 1000 polygons, another do
 the clipping, ... etc.
 
 b) Have a processor do the transform+clipping+... of 250 polygons, have
 another do the same for another 250 polygons, ... etc.

One thing I forgot to mention in my other message is that the b option
will make MUCH better use of the CPU caches.

-- 
Smile!  http://antwrp.gsfc.nasa.gov/apod/ap990315.html


---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-30 Thread Keith Whitwell
Felix Kühling wrote:

On Fri, 29 Nov 2002 07:55:52 -0700
Brian Paul [EMAIL PROTECTED] wrote:

[snip]


Implementing a true threaded pipeline could be very compilicated.  State
changes are the big issue.  If you stall/flush the pipeline for every
state change you wouldn't gain anything.  The alternative is to associate
the GL state with each chunk of vertex data as it passes through the
pipeline AND reconfigure the pipeline in the midst of state changes.
Again, I think this would be very complicated.



I see the problem. On many state changes, a corresponding driver
function is called. In a parallel pipeline implementation, if there is
still vertex data (with associated state) pending in the pipeline, it
will be rendered by the driver with the wrong state. A proper solution
would be to call the state-changing driver functions (or only
UpdateState?) from within the pipeline, just before a driver stage is
run. The required amount of modifications to mesas driver state
management seems not too big. A quick recursive grep in
xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63
lines in 23 files.

I found many state changing callbacks in dd.h which don't seem to be
used. Are they left-overs from earlier Mesa versions or did my grep miss
something?


Which ones?

Keith




---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-30 Thread Felix Kühling
On Fri, 29 Nov 2002 07:55:52 -0700
Brian Paul [EMAIL PROTECTED] wrote:

[snip]
 Implementing a true threaded pipeline could be very compilicated.  State
 changes are the big issue.  If you stall/flush the pipeline for every
 state change you wouldn't gain anything.  The alternative is to associate
 the GL state with each chunk of vertex data as it passes through the
 pipeline AND reconfigure the pipeline in the midst of state changes.
 Again, I think this would be very complicated.

I see the problem. On many state changes, a corresponding driver
function is called. In a parallel pipeline implementation, if there is
still vertex data (with associated state) pending in the pipeline, it
will be rendered by the driver with the wrong state. A proper solution
would be to call the state-changing driver functions (or only
UpdateState?) from within the pipeline, just before a driver stage is
run. The required amount of modifications to mesas driver state
management seems not too big. A quick recursive grep in
xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63
lines in 23 files.

I found many state changing callbacks in dd.h which don't seem to be
used. Are they left-overs from earlier Mesa versions or did my grep miss
something?

Regards,
Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-30 Thread Felix Kühling
On Sat, 30 Nov 2002 16:24:59 +
Keith Whitwell [EMAIL PROTECTED] wrote:

 Felix Kühling wrote:
[snip]
  I found many state changing callbacks in dd.h which don't seem to be
  used. Are they left-overs from earlier Mesa versions or did my grep miss
  something?
 
 Which ones?

Ok, I got the answer to my question ;) There are many calls like this
one: (*ctx-Driver.Enable)( ctx, GL_TEXTURE_GEN_Q, GL_FALSE ); My grep
missed them. Another grep ctx-Driver\..*( finds 162 lines in 34
files. :-/

Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-30 Thread Felix Kühling
On Sat, 30 Nov 2002 17:20:04 +0100
Felix Kühling [EMAIL PROTECTED] wrote:

 On Fri, 29 Nov 2002 07:55:52 -0700
 Brian Paul [EMAIL PROTECTED] wrote:
 
 [snip]
  Implementing a true threaded pipeline could be very compilicated.  State
  changes are the big issue.  If you stall/flush the pipeline for every
  state change you wouldn't gain anything.  The alternative is to associate
  the GL state with each chunk of vertex data as it passes through the
  pipeline AND reconfigure the pipeline in the midst of state changes.
  Again, I think this would be very complicated.
 
 I see the problem. On many state changes, a corresponding driver
 function is called. In a parallel pipeline implementation, if there is
 still vertex data (with associated state) pending in the pipeline, it
 will be rendered by the driver with the wrong state. A proper solution
 would be to call the state-changing driver functions (or only
 UpdateState?) from within the pipeline, just before a driver stage is
 run. The required amount of modifications to mesas driver state
 management seems not too big. A quick recursive grep in
 xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63
 lines in 23 files.

Vertex formats are the real problem. Driver functions are called
directly from the GL application and modify the driver state bypassing
the pipeline. Would it be worth sacrificing for a multi-threaded
pipeline on an SMP system? In the radeon driver vtxfmt seems to be only
enabled with TCL. But I remember some mails about vertex formats and the
mach64 driver. Are vertex formats widely used with non-TCL drivers?

Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-30 Thread Jos Fonseca
On Sat, Nov 30, 2002 at 08:04:55PM +0100, Felix Kühling wrote:
 On Sat, 30 Nov 2002 17:20:04 +0100
 Felix Kühling [EMAIL PROTECTED] wrote:
  I see the problem. On many state changes, a corresponding driver
  function is called. In a parallel pipeline implementation, if there is
  still vertex data (with associated state) pending in the pipeline, it
  will be rendered by the driver with the wrong state. A proper solution
  would be to call the state-changing driver functions (or only
  UpdateState?) from within the pipeline, just before a driver stage is
  run. The required amount of modifications to mesas driver state
  management seems not too big. A quick recursive grep in
  xc/xc/extras/Mesa for ctx-Driver\.[[:alnum:]]*[[:space:]]*( finds 63
  lines in 23 files.
 
 Vertex formats are the real problem. Driver functions are called
 directly from the GL application and modify the driver state bypassing
 the pipeline. 

Could you please give an example?

 Would it be worth sacrificing for a multi-threaded pipeline on an SMP
 system? In the radeon driver vtxfmt seems to be only enabled with TCL.
 But I remember some mails about vertex formats and the mach64 driver.
 Are vertex formats widely used with non-TCL drivers?

All drivers have a vertex format. One of the last stages of the Mesa
pipeline on a DRI driver is to build vertex buffers, which are nothing
more than a big array of vertices using that vertex format. What happens
is that most drivers use the D3D vertex format. While in newer cards
that format usually corresponds to the card native vertex format, on
older cards (such as Mach64, and 3DFX) it doesn't.

In summary, the vertex format it's like a C structure of how the card
expects to see a vertex when reading the system memory by DMA.

Now, returning to the main issue here, I'm not seeing what state changes
happen during the vertex buffer construction... besides the vertex
buffer itself, which would had to be thread specific.

José Fonseca
__
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-30 Thread Felix Kühling
On Sat, 30 Nov 2002 19:52:50 +
José Fonseca [EMAIL PROTECTED] wrote:

 On Sat, Nov 30, 2002 at 08:04:55PM +0100, Felix Kühling wrote:
[snip]
  
  Vertex formats are the real problem. Driver functions are called
  directly from the GL application and modify the driver state bypassing
  the pipeline. 
 
 Could you please give an example?

I am referring to the GLvertexformat structure which is defined in dd.h.
It contains driver callbacks which are installed directly into a
glapi_table. This happens in xc/xc/extras/Mesa/src/vtxfmt.c. Reading the
comments in vtxfmt.c and dd.h this really seems to be TCL related.

An example is radeon_Materialfv in radeon_vtxfmt.c. A backtrace from
radeon_Materialfv looks like this:

#0  radeon_Materialfv (face=1032, pname=4609, params=0xb22c)
at radeon_vtxfmt.c:764
#1  0x404b627c in neutral_Materialfv (face=134605728, pname=134605728, 
v=0x805eba0) at ../../../../extras/Mesa/src/vtxfmt_tmp.h:162
#2  0x0804ce29 in draw_stairs_internal (mi=0xb388) at stairs.c:318
#3  0x0804d580 in draw_stairs (mi=0xb388) at stairs.c:555
#4  0x0804fb19 in xlockmore_screenhack (dpy=0x80664d8, window=48234497, 
want_writable_colors=0, want_uniform_colors=0, want_smooth_colors=0, 
want_bright_colors=0, event_mask=0, hack_init=0x804d380 init_stairs, 
hack_draw=0x804d454 draw_stairs, 
hack_reshape=0x804cf9c reshape_stairs, hack_handle_events=0, hack_free=0)
at xlockmore.c:385
#5  0x0804c78e in screenhack (dpy=0x80664d8, window=48234497)
at ../xlockmore.h:154
#6  0x0804e6d0 in main (argc=1, argv=0xb7d4) at ./../screenhack.c:638

There is that neutral_* wrapper layer in between. So there may still be
hope ;-)

  Would it be worth sacrificing for a multi-threaded pipeline on an SMP
  system? In the radeon driver vtxfmt seems to be only enabled with TCL.
  But I remember some mails about vertex formats and the mach64 driver.
  Are vertex formats widely used with non-TCL drivers?
 
 All drivers have a vertex format. One of the last stages of the Mesa
 pipeline on a DRI driver is to build vertex buffers, which are nothing
 more than a big array of vertices using that vertex format. What happens
 is that most drivers use the D3D vertex format. While in newer cards
 that format usually corresponds to the card native vertex format, on
 older cards (such as Mach64, and 3DFX) it doesn't.
 
 In summary, the vertex format it's like a C structure of how the card
 expects to see a vertex when reading the system memory by DMA.

Ok. This seems to be something different from what I described above.

 
 Now, returning to the main issue here, I'm not seeing what state changes
 happen during the vertex buffer construction... besides the vertex
 buffer itself, which would had to be thread specific.
 
 José Fonseca

Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Felix Kühling
Hi,

I got pretty excited about this idea and spent some more thought on it.
I'm going to share my insights in this mail. Any comments are
appreciated.

First some high level considerations. Having each pipeline stage in a
separate thread would lead to a large number of threads (e.g. 9 in the
case of the radeon driver). Most of the pipeline stages take rather
short time so that the extra overhead of multi threading and
synchronization could have a significant impact. Alternatively one could
use a fixed number N of threads and schedule pipeline stages on them,
the main thread and N-1 free threads. If a free thread is available
the next pipeline stage would be executed on that thread and the OpenGL
client could continue on the main thread without waiting for all
pipeline stages to complete. Note that on a non-SMP system there would
be only the main thread which is equivalent to how the pipeline is
currently executed.

There are some functions that would have to flush the pipeline. I'm
thinking of glXSwapBuffers and glFinish. There might be more (functions
that access the color or z-buffers directly). Extra care has to be taken
if there are several pipeline stages which access the color or z-buffers
like a hardware rendering stage + a software rendering stage for
fallbacks. I'll call them drawing stages. They have to be executed one
at a time and in the correct order. This can be ensured by flushing all
higher pipeline stages before running a drawing stage. I'm not sure if
textures would require any special attention. I'd have to read a bit
more about that.

If several pipeline stages are running at the same time for different
vertex buffers, they have to work with separate copies of the GLcontext
reflecting the state at the time the buffer was fed into the pipeline.
It is enough the copy a context only once, when it moves from the main
thread to a free thread.

I think this task is not trivial but feasible. It would allow SMP
systems to benefit from the inherent parallelism in the rendering
process.

Regards,
   Felix

On Thu, 28 Nov 2002 20:59:08 +0100
Felix Kühling [EMAIL PROTECTED] wrote:

 Hi,
 
 I came across the mesa pipeline a couple of times now, reading the mesa
 sources. It struck me that all the pipeline stages are actually executed
 sequentially. Does anyone who is more involved think it would be
 worthwile to process several vertex buffers at a time in parallel
 pipeline stages by having each stage in its own thread? I realize that
 only drivers/cards without hardware TCL would benefit from this (or
 fallback cases where TCL doesn't work).
 
 Thoughts? In case this is considered useful I'm volunteering to make
 this a small project of mine.


   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Ian Molton
On Fri, 29 Nov 2002 10:19:52 +0100
Felix Kühling [EMAIL PROTECTED] wrote:

 
 I think this task is not trivial but feasible. It would allow SMP
 systems to benefit from the inherent parallelism in the rendering
 process.

Sounds pretty sweet.

now make a patch ;-)


---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Jos Fonseca
On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote:
 Hi,
 
 I got pretty excited about this idea and spent some more thought on it.
 I'm going to share my insights in this mail. Any comments are
 appreciated.
 
 First some high level considerations. Having each pipeline stage in a
 separate thread would lead to a large number of threads (e.g. 9 in the
 case of the radeon driver). Most of the pipeline stages take rather
 short time so that the extra overhead of multi threading and
 synchronization could have a significant impact. Alternatively one could
 use a fixed number N of threads and schedule pipeline stages on them,
 the main thread and N-1 free threads. If a free thread is available
 the next pipeline stage would be executed on that thread and the OpenGL
 client could continue on the main thread without waiting for all
 pipeline stages to complete. Note that on a non-SMP system there would
 be only the main thread which is equivalent to how the pipeline is
 currently executed.

I think that one thing that must be thought is whether the parallelism
should be in the pipeline stages or in the pipeline data, i.e., if we
should partition a stage for each thread (as you're suggesting), or if
there is enough data running through the pipeline making it worth to
have to a whole pipeline for each thread, each processing a part of the
primitives.

The later is the approach taken e.g., in chromium
http://chromium.sourceforge.net , but actually I don't know if for any
application besides scientific visualization a pipeline handles so many
primitives at a time. For applications such as games, state changes (like
texture changes) seem to happen too often for that.

 There are some functions that would have to flush the pipeline. I'm
 thinking of glXSwapBuffers and glFinish. There might be more (functions
 that access the color or z-buffers directly). Extra care has to be taken
 if there are several pipeline stages which access the color or z-buffers
 like a hardware rendering stage + a software rendering stage for
 fallbacks. I'll call them drawing stages. They have to be executed one
 at a time and in the correct order. This can be ensured by flushing all
 higher pipeline stages before running a drawing stage. I'm not sure if
 textures would require any special attention. I'd have to read a bit
 more about that.
 
 If several pipeline stages are running at the same time for different
 vertex buffers, they have to work with separate copies of the GLcontext
 reflecting the state at the time the buffer was fed into the pipeline.
 It is enough the copy a context only once, when it moves from the main
 thread to a free thread.

All assumptions have to be very well verified against all existing Mesa
drivers, otherwise a discrete hack can cause havoc...

 
 I think this task is not trivial but feasible. It would allow SMP
 systems to benefit from the inherent parallelism in the rendering
 process.

It surely seems interesting! ;-)

José Fonseca
__
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Felix Kühling
On Fri, 29 Nov 2002 11:15:19 +
José Fonseca [EMAIL PROTECTED] wrote:

 On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote:
  Hi,
  
  I got pretty excited about this idea and spent some more thought on it.
  I'm going to share my insights in this mail. Any comments are
  appreciated.
  
  First some high level considerations. Having each pipeline stage in a
  separate thread would lead to a large number of threads (e.g. 9 in the
  case of the radeon driver). Most of the pipeline stages take rather
  short time so that the extra overhead of multi threading and
  synchronization could have a significant impact. Alternatively one could
  use a fixed number N of threads and schedule pipeline stages on them,
  the main thread and N-1 free threads. If a free thread is available
  the next pipeline stage would be executed on that thread and the OpenGL
  client could continue on the main thread without waiting for all
  pipeline stages to complete. Note that on a non-SMP system there would
  be only the main thread which is equivalent to how the pipeline is
  currently executed.
 
 I think that one thing that must be thought is whether the parallelism
 should be in the pipeline stages or in the pipeline data, i.e., if we

I am not sure I understand the difference. The idea of a pipeline is
that you split the tasks performed on data into several stages. Mesa
does this part already. Then while one package is in stage 1 another one
can be processed in stage 2 at the same time. So I think I have
parallelism both in pipeline data and the stages.

 should partition a stage for each thread (as you're suggesting), or if
 there is enough data running through the pipeline making it worth to
 have to a whole pipeline for each thread, each processing a part of the
 primitives.

Having one thread for each pipeline stage was my first idea. The
alternative approach I tried to explain could be implemented in a way
that, once data is processed by a free thread, it runs through all
remaining pipeline stages in that same thread. Only, if no free thread
is available, processing starts on the main thread.

Still semaphores have to be used to synchronize the threads (including
the main thread) so that data packets cannot overtake each other. In the
end, the drawing has to occur in the same order as data was fed into the
pipeline.

 The later is the approach taken e.g., in chromium
 http://chromium.sourceforge.net , but actually I don't know if for any
 application besides scientific visualization a pipeline handles so many
 primitives at a time. For applications such as games, state changes (like
 texture changes) seem to happen too often for that.

I just had a look at their web page. They take a very different approach
to parallelizing the rendering task. They are targeting clusters,
not SMP systems.

  There are some functions that would have to flush the pipeline. I'm
  thinking of glXSwapBuffers and glFinish. There might be more (functions
  that access the color or z-buffers directly). Extra care has to be taken
  if there are several pipeline stages which access the color or z-buffers
  like a hardware rendering stage + a software rendering stage for
  fallbacks. I'll call them drawing stages. They have to be executed one
  at a time and in the correct order. This can be ensured by flushing all
  higher pipeline stages before running a drawing stage. I'm not sure if
  textures would require any special attention. I'd have to read a bit
  more about that.
  
  If several pipeline stages are running at the same time for different
  vertex buffers, they have to work with separate copies of the GLcontext
  reflecting the state at the time the buffer was fed into the pipeline.
  It is enough the copy a context only once, when it moves from the main
  thread to a free thread.
 
 All assumptions have to be very well verified against all existing Mesa
 drivers, otherwise a discrete hack can cause havoc...

All the hardware specific stages are drawing stages. So only one of them
will be executed at a time. I don't see any problem here. One tricky
part could be to find out, how much of the context actually has to be
copied. Obviously, all data that is modified by the pipeline stages
needs to be copied. Everything that is read only can be shared by all
context copies.

 
  
  I think this task is not trivial but feasible. It would allow SMP
  systems to benefit from the inherent parallelism in the rendering
  process.
 
 It surely seems interesting! ;-)
 
 José Fonseca

Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 

Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Felix Kühling
On Fri, 29 Nov 2002 11:15:19 +
José Fonseca [EMAIL PROTECTED] wrote:

 On Fri, Nov 29, 2002 at 10:19:52AM +0100, Felix Kühling wrote:
[snip]

 I think that one thing that must be thought is whether the parallelism
 should be in the pipeline stages or in the pipeline data, i.e., if we
 should partition a stage for each thread (as you're suggesting), or if
 there is enough data running through the pipeline making it worth to
 have to a whole pipeline for each thread, each processing a part of the
 primitives.

I just came back from lunch with my brain back online ;-). Now I
understand what you're saying. I think my approach is much simpler to
implement. I don't have to worry about how to compose the final image
from images generated by subsets of the vertex data. And even before
that there would be more problems if we wanted to render to several
parallel back buffers at the same time using the same hardware.

 The later is the approach taken e.g., in chromium
 http://chromium.sourceforge.net , but actually I don't know if for any
 application besides scientific visualization a pipeline handles so many
 primitives at a time. For applications such as games, state changes (like
 texture changes) seem to happen too often for that.

Agreed. To me it sounds like chromium splits the vertex data on a frame
level. We're talking about much smaller units here.

[snip]

Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Jos Fonseca
On Fri, Nov 29, 2002 at 01:13:22PM +0100, Felix Kühling wrote:
 On Fri, 29 Nov 2002 11:15:19 +
 José Fonseca [EMAIL PROTECTED] wrote:
  I think that one thing that must be thought is whether the parallelism
  should be in the pipeline stages or in the pipeline data, i.e., if we
 
 I am not sure I understand the difference. The idea of a pipeline is
 that you split the tasks performed on data into several stages. Mesa
 does this part already. Then while one package is in stage 1 another one
 can be processed in stage 2 at the same time. So I think I have
 parallelism both in pipeline data and the stages.


Let'me ilustrate with an example. Image you have 1000 polygons to
process (i.e., transform, clip, build vertex buffers, and render). If
you have a SMP computer with 4 processors you can make use of
parallelism in at least two ways:

a) Have a processor do the transform of the 1000 polygons, another do
the clipping, ... etc.

b) Have a processor do the transform+clipping+... of 250 polygons, have
another do the same for another 250 polygons, ... etc.

  should partition a stage for each thread (as you're suggesting), or if
  there is enough data running through the pipeline making it worth to
  have to a whole pipeline for each thread, each processing a part of the
  primitives.
 
 Having one thread for each pipeline stage was my first idea. The
 alternative approach I tried to explain could be implemented in a way
 that, once data is processed by a free thread, it runs through all
 remaining pipeline stages in that same thread. Only, if no free thread
 is available, processing starts on the main thread.

I understood your second proposal, but you still have one thread doing a
a pipeline stage processing all the data (case a) above).

 
 Still semaphores have to be used to synchronize the threads (including
 the main thread) so that data packets cannot overtake each other. In the
 end, the drawing has to occur in the same order as data was fed into the
 pipeline.
 
  The later is the approach taken e.g., in chromium
  http://chromium.sourceforge.net , but actually I don't know if for any
  application besides scientific visualization a pipeline handles so many
  primitives at a time. For applications such as games, state changes (like
  texture changes) seem to happen too often for that.
 
 I just had a look at their web page. They take a very different approach
 to parallelizing the rendering task. They are targeting clusters,
 not SMP systems.

Why do you dismiss it so quickly? Have you seen
http://www.cs.virginia.edu/~humper/chromium_documentation/threadedapplication.html
?

There is nothing in their approach specific to clusters. Using approach
b) yeilds much better parallelism than a) because each thread can work
independently of each other, and therefore there is less waits/lock
contentions/etc.

Nevertheless, it isn't worth if you're processing 50 polygons at at
time. Still if you have 50 polygons in blue, 50 in red, and 50 in
texture A, and you run all them seperately in different processors will
probably still have better parallelism.

I'm not sure which approach will give better speedups, but this has to
be considered. Perhaps it would be a good idea to talk with the Chromium
guys to know what the order of speedups they achieve in SMP systems on
usual applications.

José Fonseca
__
Do You Yahoo!?
Everything you'll ever need on one web page
from News and Sport to Email and Music Charts
http://uk.my.yahoo.com


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Felix Kühling
On Fri, 29 Nov 2002 13:20:45 +
José Fonseca [EMAIL PROTECTED] wrote:

[snip]
 
 Let'me ilustrate with an example. Image you have 1000 polygons to
 process (i.e., transform, clip, build vertex buffers, and render). If
 you have a SMP computer with 4 processors you can make use of
 parallelism in at least two ways:
 
 a) Have a processor do the transform of the 1000 polygons, another do
 the clipping, ... etc.
 
 b) Have a processor do the transform+clipping+... of 250 polygons, have
 another do the same for another 250 polygons, ... etc.

Ok, now I get the point.

   should partition a stage for each thread (as you're suggesting), or if
   there is enough data running through the pipeline making it worth to
   have to a whole pipeline for each thread, each processing a part of the
   primitives.
  
  Having one thread for each pipeline stage was my first idea. The
  alternative approach I tried to explain could be implemented in a way
  that, once data is processed by a free thread, it runs through all
  remaining pipeline stages in that same thread. Only, if no free thread
  is available, processing starts on the main thread.
 
 I understood your second proposal, but you still have one thread doing a
 a pipeline stage processing all the data (case a) above).

Sorry, my fault. I should have read more carefully. The chromium
approach was new to me. I guess it took me a while to digest.

  Still semaphores have to be used to synchronize the threads (including
  the main thread) so that data packets cannot overtake each other. In the
  end, the drawing has to occur in the same order as data was fed into the
  pipeline.
  
   The later is the approach taken e.g., in chromium
   http://chromium.sourceforge.net , but actually I don't know if for any
   application besides scientific visualization a pipeline handles so many
   primitives at a time. For applications such as games, state changes (like
   texture changes) seem to happen too often for that.
  
  I just had a look at their web page. They take a very different approach
  to parallelizing the rendering task. They are targeting clusters,
  not SMP systems.
 
 Why do you dismiss it so quickly? Have you seen

I wasn't dismissing it. It's just what they say on their web page:
quote
Chromium is a system for interactive rendering on clusters of workstations.
/quote

 http://www.cs.virginia.edu/~humper/chromium_documentation/threadedapplication.html
 ?

Not yet. I'll read it.

 There is nothing in their approach specific to clusters. Using approach
 b) yeilds much better parallelism than a) because each thread can work
 independently of each other, and therefore there is less waits/lock
 contentions/etc.

If they want it to run on a cluster they cannot use shared memory. Also
they have to optimize their system to not suffer too much from network
latencies. Sure, their system will also work and yield a speedup on a
single SMP system. But my approach is limited to single SMP systems. So
in a way, my approach is more limited, if that's what was bothering you
;-)

 Nevertheless, it isn't worth if you're processing 50 polygons at at
 time. Still if you have 50 polygons in blue, 50 in red, and 50 in
 texture A, and you run all them seperately in different processors will
 probably still have better parallelism.

On an SMP system my system could exploit this. It would render, say the
blue polygons in a free thread while in the main thread the OpenGL
client continues queuing up the red polygons. When the red polygons are
to be run through the pipeline and the blue ones aren't finished yet,
they would be rendered in the main thread in parallel to the blue
polygons. If you have more threads, you can have more vertex bufferes
processed in parallel in this way. If the synchronization is implemented
as unrestrictive as possible the only reason to wait should be to force
the drawing stages to occur in the correct order.

 I'm not sure which approach will give better speedups, but this has to
 be considered. Perhaps it would be a good idea to talk with the Chromium
 guys to know what the order of speedups they achieve in SMP systems on
 usual applications.

My approach was basically inspired by the fact that there is something
in mesa that is called pipeline. So I thought, why not implement it
like a real pipeline. If we really want to parallelize MESA, then we
should consider all options. I'm probably biased towards my proposal ;-)

Best regards,
  Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en

Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Brian Paul
Felix Kühling wrote:


My approach was basically inspired by the fact that there is something
in mesa that is called pipeline. So I thought, why not implement it
like a real pipeline. If we really want to parallelize MESA, then we
should consider all options. I'm probably biased towards my proposal ;-)



A few years ago another group developed pmesa: http://pmesa.sourceforge.net/
You might look at that.

I think someone else brought this up on the Mesa-dev or DRI list earlier
this year.  I have to say I'm skeptical.

A hardware TCL driver like the radeon or r200 won't benefit from this.
In most other cases, I think the overhead of parallelization will result
in very modest speed-ups, if any.

The only situation in which I can see benefit is when applications draw
very long primitive strips (many thousands of vertices).  In that case,
splitting the N vertices into N/P pieces for P processors and transforming
them in parallel would be the best approach.  I think that's what pmesa did.

Implementing a true threaded pipeline could be very compilicated.  State
changes are the big issue.  If you stall/flush the pipeline for every
state change you wouldn't gain anything.  The alternative is to associate
the GL state with each chunk of vertex data as it passes through the
pipeline AND reconfigure the pipeline in the midst of state changes.
Again, I think this would be very complicated.


As for Chromium, there are situations in which multiprocessors can be
helpful, but it depends largely on the nature of the application and
the best speed-ups come from parallelizing the application itself so
that multiple threads of instances of the application all issue rendering
commands in parallel to an array of graphics cards.

With modern graphics cards, the bottleneck is often the application itself:
the card starves because the app can't issue rendering commands and vertex
data fast enough.

So, feel free to experiment with this but realize that you may be dissapointed
with the results with typical applications.

-Brian



---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Felix Kühling
On Fri, 29 Nov 2002 07:55:52 -0700
Brian Paul [EMAIL PROTECTED] wrote:

 Felix Kühling wrote:
  
  My approach was basically inspired by the fact that there is something
  in mesa that is called pipeline. So I thought, why not implement it
  like a real pipeline. If we really want to parallelize MESA, then we
  should consider all options. I'm probably biased towards my proposal ;-)
 
 
 A few years ago another group developed pmesa: http://pmesa.sourceforge.net/
 You might look at that.

I had a look at their page. The last news is from april 2000. Looks dead.

 I think someone else brought this up on the Mesa-dev or DRI list earlier
 this year.  I have to say I'm skeptical.

I saw his post. He had a slightly different plan. He didn't want to have
several chunks of vertex data processed in parallel but exploit
independencies of pipeline stages for parallelization.

 A hardware TCL driver like the radeon or r200 won't benefit from this.

That's not quite true. With what I have in mind, the OpenGL client could
continue while another thread is submitting the vertex data to the TCL
unit. But you're right, drivers without hardware TCL would benefit more
from this.

 In most other cases, I think the overhead of parallelization will result
 in very modest speed-ups, if any.
 
 The only situation in which I can see benefit is when applications draw
 very long primitive strips (many thousands of vertices).  In that case,
 splitting the N vertices into N/P pieces for P processors and transforming
 them in parallel would be the best approach.  I think that's what pmesa did.

That's right.

 Implementing a true threaded pipeline could be very compilicated.  State
 changes are the big issue.  If you stall/flush the pipeline for every
 state change you wouldn't gain anything.  The alternative is to associate
 the GL state with each chunk of vertex data as it passes through the
 pipeline AND reconfigure the pipeline in the midst of state changes.
 Again, I think this would be very complicated.

That's what I had in mind.

 As for Chromium, there are situations in which multiprocessors can be
 helpful, but it depends largely on the nature of the application and
 the best speed-ups come from parallelizing the application itself so
 that multiple threads of instances of the application all issue rendering
 commands in parallel to an array of graphics cards.
 
 With modern graphics cards, the bottleneck is often the application itself:
 the card starves because the app can't issue rendering commands and vertex
 data fast enough.

Yeah, I hardly see 100% CPU usage with 3D applications on my Radeon
7500. But the latest Quake3 point release has some multithreading
support. They must have seen some benefit from this.

 So, feel free to experiment with this but realize that you may be dissapointed
 with the results with typical applications.

I think I'll give it a shot.

 
 -Brian
 

Thanks for the pointers,
   Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Ian Molton
On Fri, 29 Nov 2002 14:54:30 +0100
Felix Kühling [EMAIL PROTECTED] wrote:

 When the red polygons are
 to be run through the pipeline and the blue ones aren't finished yet

What if the red ones are transparent and overlap the blue ones? they
will have to wait, but perhaps not all of them?

how fine-grained can this go without becoming an excercise in futility?


---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Ian Molton
On Fri, 29 Nov 2002 13:20:45 +
José Fonseca [EMAIL PROTECTED] wrote:

 
 Nevertheless, it isn't worth if you're processing 50 polygons at at
 time. Still if you have 50 polygons in blue, 50 in red, and 50 in
 texture A, and you run all them seperately in different processors
 will probably still have better parallelism.

OOI, how did the (long dead) pmesa?? work ?


---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Ian Molton
On Fri, 29 Nov 2002 16:59:16 +0100
Felix Kühling [EMAIL PROTECTED] wrote:

  A hardware TCL driver like the radeon or r200 won't benefit from
  this.
 
 That's not quite true. With what I have in mind, the OpenGL client
 could continue while another thread is submitting the vertex data to
 the TCL unit. But you're right, drivers without hardware TCL would
 benefit more from this.

What about a card with multiple TCL units? the client could then be
submitting streams in parallel?


---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Ian Molton
On Fri, 29 Nov 2002 16:59:16 +0100
Felix Kühling [EMAIL PROTECTED] wrote:

 
 Yeah, I hardly see 100% CPU usage with 3D applications on my Radeon
 7500.

Heh me either. Q3 uses bugger all CPU here.

that said, the xmms default GL plugin uses 100% and its NOT software
rendering.


---
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power  Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Parallelizing MESA's pipeline?

2002-11-29 Thread Felix Kühling
On Fri, 29 Nov 2002 22:46:08 +
Ian Molton [EMAIL PROTECTED] wrote:

 On Fri, 29 Nov 2002 14:54:30 +0100
 Felix Kühling [EMAIL PROTECTED] wrote:
 
  When the red polygons are
  to be run through the pipeline and the blue ones aren't finished yet
 
 What if the red ones are transparent and overlap the blue ones? they
 will have to wait, but perhaps not all of them?

The actual drawing has to happen sequentially in any case as we have
only one graphics card to render with. But while the blue polygons are
drawn, the red polygons can be transformed and clipped, the vertices
lighted and whatever other stages are there before rendering.

 how fine-grained can this go without becoming an excercise in futility?

Regarding the pipeline stages I was planning to use the granularity
already defined by the mesa pipeline stages. If you are looking for an
example see lines 206-236 in
xc/xc/lib/GL/mesa/src/drv/radeon/radeon_context.c.

The size of vertex chunks processed at a time depends on the application
(how often the state changes).

Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!


---
This SF.net email is sponsored by: Get the new Palm Tungsten T 
handheld. Power  Color in a compact size! 
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel