​On Mon, Mar 20, 2017 at 3:52 PM, Greg Ewing <greg.ew...@canterbury.ac.nz>
 wrote:

> Ian Mallett wrote:
>
>> Per-pixel drawing operations, if they must be individually managed by the
>> CPU, will always be much faster to do on the CPU. This means things like
>> Surface.set_at(...) and likely draw.circle(...) as well as potentially
>> things like draw.rect(...) and draw.line(...).
>>
>
> This is undoubtedly true in the case of drawing one pixel at a time,
> but are you sure it's true for lines and rectangles?


On Mon, Mar 20, 2017 at 12:25 PM, Leif Theden <leif.the...@gmail.com> wrote:

> Good points Ian, but I don't see why we need to support software drawing
> when OpengGL supports drawing primitives?  Is there a compelling reason
> that drawing lines with the CPU is better then doing it on the GPU?
>
​Oh yes!

Basically, it's because a GPU runs well when it has a big, parallelizeable
workload, and terribly when not. Flexible, small workloads, such as you see
in a typical indie game or small project, are basically exactly this worst
case. They are small (rendering dozens to hundreds of objects), and they
are dynamic in that the objects change positions and shading according to
CPU-hosted logic. Heuristic: if you've got a branch deciding where/whether
to render your object or what color it should be, then the GPU hates it and
you.*

If that made sense you to, you can skip this elaboration:
----

The GPU is basically a bunch of workers (thousands, nowadays) sitting in a
room. When you tell the GPU to do something, you tell everyone in the room
to do that same thing. Configuring the GPU to do something else (saliently:
changing the shader) is slow (for technical reasons).

I have a GTX 980 sitting on my desk right now, and it has 2048 thread
processors clocked at 1126 MHz. That's ****ing *insane*. I can throw
millions and millions of triangles at it, and it laughs right back at me
because it's rendering them (essentially) 2048 at a time. The fragments (≈
pixels) generated from those triangles are also rendered 2048 at a time.
This is awesome, but only if you're drawing lots of triangles or shading
lots of pixels in the same way (the same shader).

But I *cannot* change the way I'm drawing those triangles individually. Say
I alternate between a red shader and a blue shader for each of a million
triangles. NVIDIA guidelines tell me I'm at about 3 *seconds per frame*,
not even counting the rendering. This is what I mean by overhead. (To work
around this problem, you *double* the amount of work and send a color along
with each vertex as data. That's just more data and the GPU can handle it
easily. But reconfigure? No good.) And this is in C-like languages. In
Python, you have a huge amount of software overhead for those state
changes, even before you get to the PCIe bus.

And in a typical pygame project or indie game, this is basically exactly
what we're trying to do. We've got sprites with individual location data
and different ways of being rendered--different textures, different blend
modes, etc. Only a few objects, but decent complexity in how to draw them.
With a bunch of cleverness, you could conceivably write some complex code
to work around this (generate work batches, abstract to an übershader,
etc.), but I doubt you could (or would want to) fully abstract this away
from the user--particularly in such a flexible API as pygame.

The second issue is that the PCIe bus, which is how the CPU talks to the
GPU, is *really slow* compared to the CPU's memory subsystem--both in terms
of latency and bandwidth. My lab computer has ~64 GB/s DDR4 bandwidth (my
computer at home has quadruple that) at 50ns-500ns latency. By contrast,
the PCIe bus tops out at 2 GB/s at ~20000ns latency. My CPU also has 15MB
of L3 cache, while my 980 has no L3 cache and only 2MiB of L2 (because
streaming workloads need less caching and caching is expensive).

So when you draw something on the CPU, you're drawing using a fast
processor (my machine: 3.5 GHz, wide vectors, long pipe) into very close
DRAM at a really low latency, but it's probably cached in L3 or lower
anyway. When you draw something on the GPU, you're drawing (slowly (~1 GHz,
narrow vectors, short pipe), but in-parallel) into
different-DRAM-which-is-optimized-more-for-streaming and which may or may
not be cached at all. Similar maybe, but you *also* have to wait for the
command to go over the PCIe bus, take any driver sync hit, spool up the GPU
pipeline in the right configuration, and so on. The overhead is worth it if
you're drawing a million triangles, but not if you're calling
Surface.set_at(...).

The point is, GPUs have great parallelism, but you pay for it in latency
and usability. It's a tradeoff, and when you consider all the rest you need
to do on the CPU, it's not always a clear one. But, as a heuristic, lots of
geometry or fillrate or math-intensive shading calls for a GPU. Anything
less calls for a CPU. My argument is that the typical use-case of pygame
falls, *easily*, into the latter.


----

*(Of course, you *can* make this fast at a tremendous programmer cost by
emulating all that logic on the GPU using e.g. compute shaders, which is
what all the cool kids are doing, or amortizing state changes with e.g.
Vulkan's new command lists. But it requires (1) being competent at GPU
architecture and (2) being willing to invest the time. I still use pygame
mainly because of 2.)


> Also, I'm a bit tired of the "python is slow so you may as well make
> everything slow and not expect it to work quickly" attitude.
>
​​I was worried someone might take it that way; this isn't my point at all.
What I want is for people to remember what's important.

Clearly, one should not aspire to make things slow. I'm just saying that if
a game developer tries to use Python+pygame to write some crazy
graphics-intensive mega-AAA game, when it fails it's really on them for
picking the wrong tool. At least for now--this is what I mean when I say we
need to figure out if we like our niche.
​

> A pygame app burns through the CPU not because of the interpretor, but
> because it is flipping bits in ram when a GPU could do it.
>
​It's both of these and more. SDL's core blitting routines are in C,
occasionally vectorized, IIRC, whereas as I mentioned above you have to
figure in the cost of command transfers and overhead when you do operations
on the GPU.

Ian​

Reply via email to