On Thursday 10 February 2005 04:06, Hans Kristian Rosbach wrote:
> > This will deliver 2 pixels/clock most of the time, but will stall
> > up to 3 clocks on short spans.  Worst case is to fill the screen
> > with single pixel vertical strips, which will cut the fill rate by
> > 75%.  Typical throughput reduction should be much less, perhaps in
> > the 5-10% range. If there really are four multipliers left over,
> > they could be applied to cutting the span setup stall to a single
> > clock.
>
> What about separate FPGA flashes for speed and correctness?
> This thought is not new, but I think it's been a while since
> it was considered and discussed.
>
> This FPGA will be reprogrammable, that will pretty much guarantee
> that flashable mods will appear (if design is Open Source). But what
> about having two "official" designs that the user can choose and
> trust depending on his needs.
>
> The default one should be profiled for image correctness, and thus
> may sacrifice speed in favor of extra processing for quality.
> (ex: extra precision)
>
> The alternative should be profiled for speed with acceptable quality
> losses. This (i hope) would allow the design to free up transistors
> that can be used for increasing overall speed instead. (ex: extra
> pipelining to allow higher clock)
>
> Since the default is slow but pretty I think nobody would complain
> about quality decrease if they themself choose to change the profile
> to favor speed.
>
> This would allow once-a-week gamers to get that extra punch when
> they try to frag their fellow CAD designers.
>
> Yes it'll be extra work, but optimizing for both profiles in one core
> may be more frustrating and lead to unneeded sacrifices.
>
> Thoughts?

Hi Hans,

Yes, it would be cool to be able to hotrod the card with alternate 
logic, but is it worth the development cost of opening up a design 
fork, and with Timothy as the overloaded resource?

Optimizing is always frustrating, yet it seems to work out in the end.  
As it is, it seems we're heading towards a design that delivers both 
respectable throughput and decent accuracy.  I think we just need to 
keep plugging away in that direction.

There's another way to approach the span setup bottleneck: set up a few 
spans in advance and save them in a queue.  This will cost about 100 
bytes of distributed ram per queue element.  The queue can deliver 
spans to the rasterizer at full clock speed until it drains, then the 
pipeline stalls as before.  However, this maps well onto the small 
triangle case where each of two trapezoids is wide at the top or bottom 
and narrow on the other side.  The queue fills during wide spans and 
drains during narrow ones.  So span setup can be designed to handle 
average load instead of worst case and thus get away with fewer 
multipliers.

This pleasant behavior extends to trapezoid setup as well, and we could 
even contemplate bringing triangle setup onto the card.  One thing 
that's been bothering me about delivering trapezoids to the card 
instead of triangles is, two trapezoids are quite a lot bulkier than 
one triangle and it's harder to come up with simple compression tricks 
like I suggested earlier for small triangles.  So, do only the 
expensive reciprocal on the host and multiply out the perspective 
gradients on the card.  It would be nice to be able to handle something 
in the neighborhood of 3 million triangles/sec on PCI and this strategy 
just might get there.

Regards,

Daniel
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to