The use case is to speed up naieve drawing. With things like pygame zero,
where they update the whole screen no matter what you draw. Where if people
just draw several images to the screen, we should be able to easily track
their changes, and only update what has changed.

Yeah! Batching APIs would be a useful additions indeed. Ones which do
_multi things at once. The tricky bit is to finding a set of those
functions which are useful for many cases. Also extending the API without
adding a million different functions here and there. fill_multi, blit_multi
etc. However, what does blit_multi do? Blit one surface to multiple places?
What about blit blend arguments? What about sub rects? Shouldn't there be a
batch API for blitting a lot of different surfaces to different rects at
once? After all this you can see how the APIs could get complicated and the
number of them large. We have drawline, and drawlines, do we have blit and
blits?

Relatedly, there is also a possibility of compiling the sprite classes in
Cython. This may also give a decent speedup in some cases. As shown by
things like psyco, and pypy speeding them up. They're loops in python after
all. But I wonder if they can be compiled easily, or if they're too dynamic.

Apart from the jit work going on using libjit, it would be useful to do
data aware blits. What I mean by that is by looking at an image, you can
find better ways to draw it. Eg. does the image only contain 6bits worth of
colour? Then an RLE blit is the fastest, and using 32bit alpha transparency
is wrong. Are there large areas of the same colour (which can be more
quickly drawn with fill())? Can we split it into 5 surfaces (four on the
edges, one in the middle) where only the edge surfaces are drawn with the
much slower alpha transparency blitters? For many games, the images don't
change, so this works quite well. However this would perhaps require quite
an API change (not entirely sure of that), or could likely be implemented
inside the sprite classes with some trickery.

Another area is better memory alignment on a platform specific basis can
give considerable gains. On the slower raspberry pi for example, 16 byte,
and 32 byte aligned memory goes much faster. On many platforms this is 4
bytes, or 8 bytes. Gains of around 600MB/s ->1300MB/s have been seen. This
is one area where a libjit based blitter (or one like in ANGLE) can make
significant improvements. This type of change only needs to happen where we
allocate memory for surfaces (or maybe people have done that already).

Finally, hardware support for things like jpeg drawing, and DMA are
possibilities. For example, the raspberrypi has dedicated hardware for jpeg
decoding (as does most hardware these days). This can be used to draw a
jpeg from memory into a buffer, so if you are drawing photos onto the
screen, then this is a very fast way to do it. However, this would require
a more abstract Sprite class which loads files itself.
Sprite('mybigphoto.jpg'). But interestingly (but not unexpectedly) the CPUs
are outperforming the dedicated hardware, if you dedicate the CPUs to the
task. Also for when the images are small. However, the older, and slower
CPUs don't.
https://info-beamer.com/blog/omx-jpeg-decoding-performance-vs-libjpeg-turbo
But either way, keeping the jpg photos in memory allows you to draw a lot
more of them, than if you decode them and use up all of the limited memory
on small platforms like the Orange|Raspberry PI. So we need a
Sprite('photo.jpg') style api for this reason.

I'm interested in if anyone has done any work on dirty rect optimizations
for reducing over draw for now. Because that could be applied to something
like pygame zero with almost no work on their part, and no work on the
users part. For non-worst cases, like where there's < 50 rects, I guess
simply compiling an algorithm like DR0ID has done in C/Cython will work
nicely enough - if the python code isn't fast enough already. Would need
some representative pygamezero examples to check this.




On Mon, Mar 27, 2017 at 3:55 AM, Leif Theden <leif.the...@gmail.com> wrote:

> Just to clarify, this is an optimization to update the display surface?
> TBH, if DR0ID says that it isn't a good optimization, then I'll have to
> follow him.  Thinking deeply about the issue, the would be better payoffs
> by finding way to not flip bits in the display surface in the first place.
> To say it another way, if we can inform developers that they are drawing to
> specific areas in the display surface several times, then it would help
> profile the game better and reduce extraneous draws.  Android has options
> to detect extraneous overdraw, and I'm sure other platforms do as well.
>
> Also, if we are going to promote pygame group rendering, a cheap 15-20%
> performance increase (based on my system) could be had by creating a C
> function that accepts a list of surfaces and their positions, then
> executing the blits in C.  I've brought it up before, but nothing came if
> it.  In fact, I was asked more than once if I had converted my surfaces, so
> maybe people just don't read threads?  ¯\_(ツ)_/¯
>
> On Sun, Mar 26, 2017 at 2:50 PM, René Dudfield <ren...@gmail.com> wrote:
>
>> Because rectangles can overlap, it's possible to reduce the amount
>> drawing done when using them for dirty rect updating. If there's two
>> rectangles overlapping, then we don't need to over draw the overlapping
>> area twice. Normally the dirty rect update algorithm used just makes a
>> bigger rectangle which contains two of the smaller rectangles. But that's
>> not optimal.
>>
>> But, as with all over draw algorithms, can it be done fast enough to be
>> worth it?
>>
>> Here's an article on the topic:
>>     http://gandraxa.com/detect_overlapping_subrectangles.xml
>>
>> jmm0 made some code here:
>>     https://bitbucket.org/jmm0/optimize_dirty_rects/src/c2affd54
>> 50b57fc142c919de10e530e367306222/optimize_dirty_rects.py?at=
>> default&fileviewer=file-view-default
>>
>> DR0ID, also did some with tests and faster code...
>> https://bitbucket.org/dr0id/pyweek-games/src/1a9943ebadc6e21
>> 02db0457d17ca3e6025f6ca60/pyweek19-2014-10/practice/
>> dirtyrects/?at=default
>>
>>
>> So far DR0ID says it's not really fast enough to be worth it. However,
>> there are opportunities to improve it. Perhaps a more optimal algorithm, or
>> one which uses C or Cython.
>>
>> "worst case szenario with 2000 rects it takes ~0.31299996376 seconds"
>> If it's 20x faster in C, then that gets down to 0.016666. Still too slow
>> perhaps.
>>
>>
>> It's not an embarrassingly parallel problem... I think. But I haven't
>> thought on it much at all. Maybe there is a use for multi core here.
>>
>>
>> Anyone done any other work like this? Or know of some good algos for this?
>>
>>
>>
>> cheers,
>>
>
>

Reply via email to