[Dri-devel] OT: CPU vs. GPU & bandwidths (was: Mesa software blending)

Alexander Stohr Wed, 03 Apr 2002 17:07:43 -0800

Hey Raystonn,

Oh my godness, who fed that trolls. ;-)


Lets still assume, you havent had the facts handy 
(due to your age, education, place of birth or current location)
for seeing clear in all the subjects you are trying to adress.

I would be much happier if i did feel that you were really working 
at least with a few of the concepts and tools you are refering to.
The web is so big and there is nothing much which will hinder you
from getting hands on one after the other, if you really like it.

Maybe i am on error with you, and you might want to enlighten 
me what successful computer projects on earth are a result of 
your bold mind. (dont exclude computer projects for space crafts.)

> > That is far from the truth - they have internal pipelining
> > and parallelism.  Their use of silicon can be optimised to balance
> > the performance of just one single algorithm.  You can never do that
> > for a machine that also has to run an OS, word process and run
> > spreadsheets.
> 
> Modern processors have internal pipelining and parallelism as 
> well.  

But at a much lower degree. Please understand that a Pentium 4 with
only a single 64 bit FPU unit for a 128 bit SSE operand is far away
from the performence that a dedicated grafics pipline with some
4 dozends of FPUs can carry out. And some of these FPUs even work
in parallel to their neighbourghs...

> Most of the processing power of today's CPUs go completely unused.  

One reason more that a dedicated grafics chip with about the same
amount of transistors as some CPU (thats almost true today) does 
perfom better by huge factors than the competing CPU.

> It is possible to create optimized implementations using 
> Single-Instruction-Multiple-Data (SIMD) instructions of 
> efficient algorithms.

As long as the optimized version just improves by only some 30% to 50%
(or at maximum some 100%) it will never come close to what grafics
chips will do by default.

> > >  Since memory bandwidth is increasing rapidly,...
> >
> > It is?!?  Let's look at the facts:
> >
> > Since 1989, CPU speed has grown by a factor of 70.  Over the same
> > period the memory bus has increased by a factor of maybe 6 or so.
> 
> We have gone from approximately 200MB/s of memory bandwidth 
> (PC66 EDO RAM)
> to over 3.2GB/s (dual 16-bit RDRAM channels) in the last 5 
> years.  We have
> over 16 times the memory bandwidth available today than we 
> did just 5 years
> ago.  

> Available memory bandwidth has been growing more quickly than
> processor clockspeed lately, 

No - you are still on error...

CPU clock rate growth = 70, (from 33 to 2400)
CPU register width growth = 4, (from 32 to 128)
CPU pipelining speed increase = 2, (worst case assumption)
CPU performance growth total = 560 = 70 * 4 * 2

according to your sample:
memory bandwidth total growth = 16 (taken from your numbers)

hmm, my 1990 system had some DIL rams with ending "-70" and "-80"
which means the latency in nanoseconds. Some cache RAM already
had only some 20 ns. if you assume DDR does have 2,5 nsec today
i get a factor of 32. Futhrer assumed a factor of 4 in bus width
increase we are still at an overall increase of only 64.

nicely considered with some factor of 2 bus clocking optimisations
we are still facing only a speed increase of 128 for the RAM
whilst the CPU speed increase was determined to have a value of 560.

> and I do not foresee an end to this any time soon.

That doesnt contribute any adavance in the argue. Sorry.

> > On the other hand, the graphics card can use heavily pipelined
> > operations to guarantee that the memory bandwidth is 100% utilised
> 
> Overutilised in my opinion.  

Not sure what you want to say with it. Byond 100% there is nothing than
saturation.

A well tuned system utilizes in its main application case 100% for all
units.
For the not so common cases one of the units will run at 100% whilst other
components are idle at a smaller percentage.

> The amount of overdraw performed by today's
> video cards on modern games and applications is incredible.  

No, its a sign of a bad application design. ;-)
And anyways, a good video card will eliminate already 50% or more of the
supplied data in early calculation states, especially when the dumb 
applications does send anything to the adapter.

> Immediate mode rendering is an inefficient algorithm.

Agreed, but you can only efficiently use "instant" display list rendering
if your scenery has a noticeable constant geometry data. Its a question
of what you are doing. And its a question of the application.

> Video cards tend to have extremely well optimized implementations 
> of this inefficient algorithm.

I dont feel that you know much about OpenGL and its core design principles.
I would recommend you a reading of the "red book" to get into the subject.

You might find out that most of above is in no way a required thing when
doing OpenGL based rendering. And you will further see that even most
cases the you suppose to be damaging to performance a) wont occure often
and b) get nicely eliminated by hardware in a prettye early stage.

The only thing that is not covered is the case when you insist yourself
on writing dumb code. (Only happens when the basics aren't understood.)

> Modern processors have a considerable amount of parallelism 
> built in.  

Its relative. Two integer ALUs are much compared to one, but is only
a laugh if you compare it to the 128 channel to 128 channel parallel
pipelined digital switch matrix ICs running at some 5 GHz, which were 
introduced last year for a later use in communications sattelites.
I cant exactly specify the number of integer and logic ALU in a 
grafics chipset but they are big and they are in heavy parallel ussage.

> With advanced prefetch and streaming SIMD instructions it is very 
> possible to do these types of operations in a modern processor.  

You talk about an extension of the CPU command command set? Notice, any new 
command in current processor architecture makes opcode decoding complexer.

Hey, sure you can do anything in software. Even simulate a grafics chips
operations. But thats not the point. Its only the question which is faster.
At least its simpler to say "x=3, y=4, length=5, DRAW_SEMI_TRANSPARENT_LINE"
to a grafics processor and do something else in the meantime, instead of 
doing it with your main processor. And if you say its fast enough for you,
then i can only tell you that it just heavily depends on what you want to
do.

> It will, however, take another couple of years to be able to render at
great 
> framerates and high resolutions.

Interesting point of view. In a few years CPUs have evolved in several areas
and grafics processors have evolved as well. So the distance will still
exist
which will give you a good reason for using a dedicated grafics hardware for
the grafics. You have to consider that your requirements to a computer
system
might have evolved in that time frame as well. Maybe a fully 3D desktop, or
Stereo-3D on the desktop via some shutter glasses, or high res cinemascope 
movies from highly compressed MPEG4 and successor file formats.

> > You only have to look at the gap you are trying to bridge - a
> > modern graphics card is *easily* 100 times faster at rendering
> > sophisticated pixels (with pixel shaders, multiple textures and
> > antialiasing) than the CPU.
> 
> They are limited in what they can do.  

No doubt. But if it fullfills the purpose why should you care about?
Maybe you do have a Mini-Van, but still get your gas for the stove
via a pipeline - and you never think about to change it.

> In order to allow more flexibility they have recently introduced 
> pixel shaders, which basically turns the video card into a mini-CPU.

Nice point - its a RISC engine then. It has a pretty compressed
command set for doing all from inside the chaches. Main targets
of such engines is a highler level of customisation, so far correct.
But its still dedicated hardware. Forming a curve from a line geometry
plus a forumla is nothing new at all. Read the "red book" about nurbs
and polylines. The advantage is that there is only few data to transfer
to the adapter but the results are wide and directly feed to the render
circuits at the maximum rate that silicon does allow. Thats much more
than you would be able to do on any external bus system.

> Modern processors can perform these features more quickly and would 
> allow an order of magnitude more flexibility in what can be done.

I doubt the "more quickly" rather strongly. And when they start up to 
store their results into main memory all the gain might get lost.
A typical OpenGL setup might have some 5 kB to 20 kB of important data
that all could impact the results of a geometry calculation and rendering. 
You cannot store all these parameters in CPU registers, and the much bigger
program for the respective software renderer wont fit your fastet code
caches.

> Kyro-based video cards perform quite well.

Hmm, is SGI's Fahrenheit API still alive?

I think only scene graph based applications do have a chance to bring
Kyro to a significant performance boost. But that means you have to trash
most of your existing software and buy a big bunch of new software to
satisfy that silicon's need. Or have i overseen a path that allows soft
migration with getting the hidden performance already now?

> They are not quite up to the level of nVidia's latest cards but this is 
> new technology and is being worked on by a relatively new company.  

Whats new with the Kyro? I think there is not much that new with it.

Z-Buffer sorting algorithms for optimizing the finaly drawn pixel amount 
arent new. They are known since the first "Castle Wolfenstein" releases 
once upon the time i had my C64 powerd on for regula use. 

Tiled rendering? Nothing new with it. Having a limited viewport for 
rendering a bigger scenery is a thing that George Lucas used in his
labs since there was hardware rendering used. Merging several sub
images into one big image is nothing that trills an insider anymore.

> These cards do not require nearly as much memory bandwidth as 
> immediate-mode renderers, performing 0 overdraw.
> They are more processing intensive rather than being bandwidth intensive.
> I see this as a more efficient algorithm.

So you suddenly favour complex grafics processors instead of CPUs?

Let me say, fogging of the misc sorts, alpha blending, stenceling
and whatever more is used for making rendered images looking more
realistic implys that framebuffer reading and writing is performed
(unless you are using that high priced and nearly death 3D memory concept, 
as E&S/Mithsubishi offered for some time, i.e. for the FireGL 4000).

Conclusion, if you render simple images, then you might have a chance
to gain a slight performance boost with specific hardware technologies,
but as soon as you turn to more realistic images, you will only gain
performance with well designed memory interfaces, high performance 
caching and a chipset that runs at high clock rates and wide parallelism.

> Everything starts out in hardware and eventually moves to software.  

Things only move to software if the CPU gets enough speed and the
respective hardware is to costly. Think of the DVD accellerators
that made DVD playback possible on a P-166. They were replaced
by software players untill the P-500 with MMX popped up. But there
is still DCT/iDCT in current grafics processors, because it makes
much more sense there and doesn't cost $100 total but only some 
$0.002 per grafics unit. While some external source pumps the
data in one format to the grafics chip, the operation applies 
immedeately and the data is written in the result format to the 
framebuffer. Compare this to what data streams you would all cause 
if a CPU would do the same...

> There will come a time when the basic functionality provided by 
> video cards can be easily done by a main processor.  

Ah, you agree, that a CPU needs complex coding for doing such
easy stuff. Maybe its because the main purpose of a CPU is not
what you want it to do. Even if it could do as easily as the grafics
chipset, it still would be done faster because of the strategic
optimal location of the grafics chipset between the main memory
and the framebuffer memory. And further its some pretty nice and
attractive sort of multiprocessing whicht benefits overall system
availability.

> The extra features offered by the video cards, such as pixel shaders, 
> are simply attempts to stand-in as a main processor.

Typically consumer devices have the _smallest_ processor as their main
circuit.
These central processors only have to query status of the other devices,
check for the keyboard and initate activity.

I dont think a grafics chipset will take over main CPUs duties. Dont worry.
;-)
For general tasks: the CPU
For grafics: the grafics core
At least there were no grafics core that had its own OS or did "animate" 
a whole productive environement just because it had attached some sort of
ROM.

> Once the basic functionality of the video card can be performed
> by the main system procsesor, there will really be no need for extra
> hardware to perform these tasks.

You are again thinking of something like UMA? (see other mailing)

Urge a Pentium 4+X to supply a totally permanent data stream
to a digital-to-analog converter (DAC) at nanosecond precisiness
in order to have an image on some screen (or multiple screens).

If you now say, the screen is smart and only needs part time updates,
then i will tell you that in this case the screen includes the 
grafics chipset in some way. Seen for decades when PostScript supporting
monitors (like printers) were availabel for some old Unix workstations. 
If you know about GLX remote rendering protocoll, its something like this.

> What I see now is a move by the video card companies to software-based 
> solutions (pixel shaders, etc.). They have recognized that there are 
> limitations to what specialized hardware can do and they are now
attempting 
> to allow programmers more flexibility.

They provide another feature that plugs into the high speed grafics chipset
and performs itself much better than it could be done if programmed
remotely.

> However, this is the kind of functionality where the main system 
> processor has a huge advantage.

It's not the goal of programming anything via the grafics chip,
but only to allow several nice things that are of repeated use
while best possible performance provided. ;-)

see this:
CPU --- programmable pipelined grafics chipset --- pipelined grafics chipset

The left side provides more programming freedom with infinite complexity.
The right side provides more performance at a fixed complexity or
programming.
The middle provides the performance of the right combined with 
a programming freedom that meets almost 90% of existing desires.

So you will now get 9 out of 10 of your previous CPU tasks solved at maximum
speed.
Only for 1/10th of the cases you still have to stick with slower CPU
operations.

> If more features are added in this manner (as software) then the
> specialized video card hardware will lose its edge.  

As long as grafics chips do evolve as quickly as the cpu there is
no reason for predicting that the power ration between both of them
will ever change. Therefor i see no reason why one will obsolte the other.

> Intel is capable of pushing microprocessor technology more quickly 
> than nVidia or ATI, regardless of how much nVidia wants their technology 
> to be at the center of the chipset.

Intel has bought a grafics vendor for ages. So far as i remember it was
a Lockhead Martin subsidiary Real3D that was bought and stood for the first
intel x86-PC grafics since ever, the i740. There were successors like the
i815 and i830 mainboard chipsets with integrated grafics, partialy with
audio and modem support. But that were neither highly impressing nor were 
they seen in a wide range of computers, only for all in one office desktops.
(ATI and nVidia are doing embedded chipsets as well, so there is not much to
say.)

In a look back, the i740 was just made to promote the intel owned AGP bus.

Yes, intel is in 3D Grafics business, but there is no view that they do
want to change anything dramatically in their current level of presencs.
If you do know more, just tell me!

> What would you call MMX, SSE, SSE2, and even 3dnow? These are additional
> instructions designed to optimize the use of these new transistors.

They are command set extensions, that you must take special care
and efforts for making use of them. Their use is limited to specific
sequences in the grafics code. I think some only 20% of that opcodes 
might come into effect if you run e.g. Mesa software rendering.

Jose resently did a fix for such code and if you only have seen the
patches he sent to the list, you will know it was neither a simple
task nor it was a fast task to fix it in the assembler statments.

> > instructions is already limited by the brain power of compiler
> > writers.
> 
> Since when can you write a pixel shading routine in a standard C/C++
compiler?

Since i saw that handwritten assembler does most often represent slower code
than C code that the compiler was allowed to optimize automatically.
Get hands on VTune and get aware of all those 99 assembler tricks 
that the compiler could solve better than you with your brain works.

BTW, i think the amount of C++ code in all OpenGL implementations out there 
is below 3%. Its simply overhead in the end that slows the code down
significantly.

> Assembly language can be used for the main processor just as easily 
> as it can be used for pixel shaders using nVidia's own assembly language.

But it will result in much faster execution for the dedicated grafics risc
engine.
So why code for the CPU in assembler whilst it executes equally fast when
done in C
or much faster when its done as microcode for the grafics chipset?

> In fact, there is a great deal more support for assembly language
> on the main processor.

Outch. I would hardcode binary opcodes for double the coding time if i would
get
significant performance increase from the hardware for an infinitive
runitme.

> Modern processors have multiple parallel units for both 
> integer and FPU operations.  

Just nice, but if i have data of only one sort, then its of no use.
Thats the typical case if you decided to have one out of MMX, SSE,
16-bit-math, ...
And interleaving such data processing in a single piece of code wont work.
Hyperthreading might utilize both units if there are two different codings
of which none consumes the full bandwidth to memory, and only in theory.

> Increasing processor performance is much more complex than a simple die
shrink.

And therrfore grafics performance is likely to explode much faster than CPU
performance.
A grafics core is more straight forward, but that were already said in this
thread.

> Fill rate is just memory bandwidth.  It is not hard to offer more memory
> channels.  In fact, a dual-channel DDR chipset is coming soon for the
> Pentium 4.  In May the Pentium 4 will have access to 4.3GB/s of memory
> bandwidth.  Future generations will offer considerably more.

Dont assume grafics chipsets development has reached a stand still today.
Its evolving everywhere for memory and busses, so it its not even sure
which component does gain that sort of advance first and which next.

> The Intel C/C++ compiler generates MMX, SSE, and SSE2 instructions if you
> tell it to do so.  It requires no inline assembly, though inline assembly
is
> always a good idea.  SSE and SSE2 are used in nVidia's drivers...

You never heared of compiler "intrinsics" as introduced by intel some 3
years ago?
You never had a look into super computing compiler optimisation issues?
The tools are there for do it at an abstraction level, but as soon as
dedicated
hardware is in place for your tasks, it can outdo your CPU easily.

> > CONCLUSION.
> > ~~~~~~~~~~~
> > There is no sign whatever that CPU's are "catching up" with
> > graphics cards - and no logical reason why they ever will.
> 
> I will have to disagree here.  Indications are that the video card
> manufacturers are looking more and more into 'programmable' 
> features such as the pixel shaders.

Yes? And it boost performance far, and especiall far byond the level
a CPU of same complexity can ever provide. And there are no signs
that complexity of a CPU and a grafics chipset will split apart widely.

> If this is the case it would be relatively easy for the
> main processor to 'catch up'.  Programmability is its specialty.

Programmability is the thing that prevents it from the ultimate performance
levels.
Ultimate programmability is nice, but never intended by the grafics
vendors. Catching up to general RISC designs (like the Motorola M68k)
did Intel take some 20 years. And the DEC alpha series still outdos
the current intel designs. Trying to catching up with dedicated on-die 
risc engines as a front gate to grafics processors is a no win goal.
Intel would be crazy if they put their engineering powers into such
a subject. And i dont think AMD will do. (At least they are still 
doing nice commandset speedups for their CPUs today.)

> At any rate, we will probably just have to agree to disagree here. ;)
> 
> -Raystonn

I am strongly disagreeing with your overall opinion. 
I only find myself doing several reasoning where your
insights were possibly to far away from the facts.

So that it is.
Regards, Alex.


_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel

[Dri-devel] OT: CPU vs. GPU & bandwidths (was: Mesa software blending)

Reply via email to