On Thursday 10 February 2005 04:06, Hans Kristian Rosbach wrote: > > This will deliver 2 pixels/clock most of the time, but will stall > > up to 3 clocks on short spans. Worst case is to fill the screen > > with single pixel vertical strips, which will cut the fill rate by > > 75%. Typical throughput reduction should be much less, perhaps in > > the 5-10% range. If there really are four multipliers left over, > > they could be applied to cutting the span setup stall to a single > > clock. > > What about separate FPGA flashes for speed and correctness? > This thought is not new, but I think it's been a while since > it was considered and discussed. > > This FPGA will be reprogrammable, that will pretty much guarantee > that flashable mods will appear (if design is Open Source). But what > about having two "official" designs that the user can choose and > trust depending on his needs. > > The default one should be profiled for image correctness, and thus > may sacrifice speed in favor of extra processing for quality. > (ex: extra precision) > > The alternative should be profiled for speed with acceptable quality > losses. This (i hope) would allow the design to free up transistors > that can be used for increasing overall speed instead. (ex: extra > pipelining to allow higher clock) > > Since the default is slow but pretty I think nobody would complain > about quality decrease if they themself choose to change the profile > to favor speed. > > This would allow once-a-week gamers to get that extra punch when > they try to frag their fellow CAD designers. > > Yes it'll be extra work, but optimizing for both profiles in one core > may be more frustrating and lead to unneeded sacrifices. > > Thoughts?
Hi Hans, Yes, it would be cool to be able to hotrod the card with alternate logic, but is it worth the development cost of opening up a design fork, and with Timothy as the overloaded resource? Optimizing is always frustrating, yet it seems to work out in the end. As it is, it seems we're heading towards a design that delivers both respectable throughput and decent accuracy. I think we just need to keep plugging away in that direction. There's another way to approach the span setup bottleneck: set up a few spans in advance and save them in a queue. This will cost about 100 bytes of distributed ram per queue element. The queue can deliver spans to the rasterizer at full clock speed until it drains, then the pipeline stalls as before. However, this maps well onto the small triangle case where each of two trapezoids is wide at the top or bottom and narrow on the other side. The queue fills during wide spans and drains during narrow ones. So span setup can be designed to handle average load instead of worst case and thus get away with fewer multipliers. This pleasant behavior extends to trapezoid setup as well, and we could even contemplate bringing triangle setup onto the card. One thing that's been bothering me about delivering trapezoids to the card instead of triangles is, two trapezoids are quite a lot bulkier than one triangle and it's harder to come up with simple compression tricks like I suggested earlier for small triangles. So, do only the expensive reciprocal on the host and multiply out the perspective gradients on the card. It would be nice to be able to handle something in the neighborhood of 3 million triangles/sec on PCI and this strategy just might get there. Regards, Daniel _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
