Cool about the blit function. Since the rect and surfaces are by reference,
then you only need to make sprite_image_list once (if you are blitting the
same thing every time). Additionally, I think it would likely be fast
enough for another commonly requested case (blit the same image to many
different spots). It would also work nicely for sprite groups. I think it's
worth some more API discussion, before merging it in. We'd also need some
tests and docs.


A couple of quick points.

   - hardware is slower sometimes (see raspberrypi link I posted). Not
   surprising for platforms like this where video memory is the same speed for
   CPUs and GPUs, and where there are 4 CPU cores that have a modern
   architecture.
   - memory reads are reduced a lot when you change the bitdepth from 4
   bytes to 1 or 2. On modern computers memory IO is always the bottleneck.
   Even with hardware acceleration, blitting less memory is faster. If you can
   use the same as the screen format (some bayer format) then it's often only
   8-12 bits of data per LCD pixel. YUV style formats are very fast, and lower
   memory because they are what movies use.
   - Run length encoded, even if not hardware accelerated, can be very
   quick because you read less memory.
   - You can't use memcpy for blit blends, format conversions, scaling etc.
   Memory alignment can be easy by just allocating an extra 32 or 16 byte of
   data, and having the start position be there. If you do this for each scan
   line, even better.
   - The current blit functions to screen can be problematic on some
   platforms with threads, since some platforms don't have thread safe display
   functions. Calling them outside the main thread is not possible (on some
   platforms). There are a few other parts which aren't exactly thread safe on
   SDL. A work around is to use multiple processes (1 per core is optimal) and
   use shared memory. That way there are no threading issues, and this is the
   most optimal way to do it anyway. You can create mmap'd Surfaces with
   frombuffer.
   - There are some multi threaded blit functions however floating around
   that use openmp. pgreloaded branches.
   - The jit blit functions can mostly gain speed advantages by taking
   advantage of platform optimal prefetch instructions, SIMD, but also CPU
   optimal stride (how much data to process in what sized chunk so it fits in
   the cache best). The best example of this is ANGLE, which has a very
   advanced run time assembler. (it has to to emulate OpenGL and dx in
   software!) But also numexpr does some of these optimizations.

Another software improvement would be using the new METH_FASTCALL on python
3.6+ Which can see 15%-30% improvements for cases like calling lots of
functions.
Anyway... I think there's lots of potential improvements for the
raspberrypi. Some of them will benefit both software and hardware renderers.


But I'm still interested in the dirty rect handling, for the naieve case.


cheers,


On Mon, Mar 27, 2017 at 4:59 PM, Leif Theden <leif.the...@gmail.com> wrote:

> I think a function that accepts a sequence of tuples in the form of
> (dest_surface, dest_position, source_area, blend_mode) would be enough.  It
> is only needed as a power-user optimization and I don't seem much value in
> watering it down or splitting it into multiple functions for beginners.  No
> need to overthink and complicate it.  Cython may help sprite rendering in
> Groups, but it would have to be implemented and tested.
>
> Has anyone considered releasing the GIL during blit operations?  Is it
> possible the release the GIL during a blit, then block when needed, in the
> cases of subsequent draw operations?  The optimized case would be to allow
> math operations to execute while the display buffer is being modified,
> while you are iterating over sprites.
>
> Here is my preliminary C code for a "blit_multi" function.  It is called
> "blit_list".  It only supports (dest_surface, dest_position) tuples, but it
> works as so far.
> https://gist.github.com/bitcraft/1785face7c5684916cde
>
> Use it like this:
>
> sprite_image_list = [(s.image, s.rect) for s in self.sprites]
> display.blit_list(sprite_image_list)
>
>
> Jitblit sounds interesting, but how is that different from just using
> Surface.convert() with the correct flags?  I may be mistaken here, but on
> all modern computers, using 32-bit surfaces can be optimized into a single
> memcpy call...bytealigned and whatnot.  No need to iterate over pixel data
> (RLE).  If that's the case, then it doesn't make sense to support legacy
> 8/16/24-bit image formats, and validates pygame_sdl2's position not to
> support them.
>
> I don't see a way to combat overdraw, except by informing users that it is
> happening.
>
> Anyway, all of this chatter is moot, if pygame moves to a hardware sprites
> soon.
>
> On Mon, Mar 27, 2017 at 6:18 AM, René Dudfield <ren...@gmail.com> wrote:
>
>> The use case is to speed up naieve drawing. With things like pygame zero,
>> where they update the whole screen no matter what you draw. Where if people
>> just draw several images to the screen, we should be able to easily track
>> their changes, and only update what has changed.
>>
>> Yeah! Batching APIs would be a useful additions indeed. Ones which do
>> _multi things at once. The tricky bit is to finding a set of those
>> functions which are useful for many cases. Also extending the API without
>> adding a million different functions here and there. fill_multi, blit_multi
>> etc. However, what does blit_multi do? Blit one surface to multiple places?
>> What about blit blend arguments? What about sub rects? Shouldn't there be a
>> batch API for blitting a lot of different surfaces to different rects at
>> once? After all this you can see how the APIs could get complicated and the
>> number of them large. We have drawline, and drawlines, do we have blit and
>> blits?
>>
>> Relatedly, there is also a possibility of compiling the sprite classes in
>> Cython. This may also give a decent speedup in some cases. As shown by
>> things like psyco, and pypy speeding them up. They're loops in python after
>> all. But I wonder if they can be compiled easily, or if they're too dynamic.
>>
>> Apart from the jit work going on using libjit, it would be useful to do
>> data aware blits. What I mean by that is by looking at an image, you can
>> find better ways to draw it. Eg. does the image only contain 6bits worth of
>> colour? Then an RLE blit is the fastest, and using 32bit alpha transparency
>> is wrong. Are there large areas of the same colour (which can be more
>> quickly drawn with fill())? Can we split it into 5 surfaces (four on the
>> edges, one in the middle) where only the edge surfaces are drawn with the
>> much slower alpha transparency blitters? For many games, the images don't
>> change, so this works quite well. However this would perhaps require quite
>> an API change (not entirely sure of that), or could likely be implemented
>> inside the sprite classes with some trickery.
>>
>> Another area is better memory alignment on a platform specific basis can
>> give considerable gains. On the slower raspberry pi for example, 16 byte,
>> and 32 byte aligned memory goes much faster. On many platforms this is 4
>> bytes, or 8 bytes. Gains of around 600MB/s ->1300MB/s have been seen. This
>> is one area where a libjit based blitter (or one like in ANGLE) can make
>> significant improvements. This type of change only needs to happen where we
>> allocate memory for surfaces (or maybe people have done that already).
>>
>> Finally, hardware support for things like jpeg drawing, and DMA are
>> possibilities. For example, the raspberrypi has dedicated hardware for jpeg
>> decoding (as does most hardware these days). This can be used to draw a
>> jpeg from memory into a buffer, so if you are drawing photos onto the
>> screen, then this is a very fast way to do it. However, this would require
>> a more abstract Sprite class which loads files itself.
>> Sprite('mybigphoto.jpg'). But interestingly (but not unexpectedly) the CPUs
>> are outperforming the dedicated hardware, if you dedicate the CPUs to the
>> task. Also for when the images are small. However, the older, and slower
>> CPUs don't. https://info-beamer.com/blog/omx-jpeg-decoding-performance-v
>> s-libjpeg-turbo But either way, keeping the jpg photos in memory allows
>> you to draw a lot more of them, than if you decode them and use up all of
>> the limited memory on small platforms like the Orange|Raspberry PI. So we
>> need a Sprite('photo.jpg') style api for this reason.
>>
>> I'm interested in if anyone has done any work on dirty rect optimizations
>> for reducing over draw for now. Because that could be applied to something
>> like pygame zero with almost no work on their part, and no work on the
>> users part. For non-worst cases, like where there's < 50 rects, I guess
>> simply compiling an algorithm like DR0ID has done in C/Cython will work
>> nicely enough - if the python code isn't fast enough already. Would need
>> some representative pygamezero examples to check this.
>>
>>
>>
>>
>> On Mon, Mar 27, 2017 at 3:55 AM, Leif Theden <leif.the...@gmail.com>
>> wrote:
>>
>>> Just to clarify, this is an optimization to update the display surface?
>>> TBH, if DR0ID says that it isn't a good optimization, then I'll have to
>>> follow him.  Thinking deeply about the issue, the would be better payoffs
>>> by finding way to not flip bits in the display surface in the first place.
>>> To say it another way, if we can inform developers that they are drawing to
>>> specific areas in the display surface several times, then it would help
>>> profile the game better and reduce extraneous draws.  Android has options
>>> to detect extraneous overdraw, and I'm sure other platforms do as well.
>>>
>>> Also, if we are going to promote pygame group rendering, a cheap 15-20%
>>> performance increase (based on my system) could be had by creating a C
>>> function that accepts a list of surfaces and their positions, then
>>> executing the blits in C.  I've brought it up before, but nothing came if
>>> it.  In fact, I was asked more than once if I had converted my surfaces, so
>>> maybe people just don't read threads?  ¯\_(ツ)_/¯
>>>
>>> On Sun, Mar 26, 2017 at 2:50 PM, René Dudfield <ren...@gmail.com> wrote:
>>>
>>>> Because rectangles can overlap, it's possible to reduce the amount
>>>> drawing done when using them for dirty rect updating. If there's two
>>>> rectangles overlapping, then we don't need to over draw the overlapping
>>>> area twice. Normally the dirty rect update algorithm used just makes a
>>>> bigger rectangle which contains two of the smaller rectangles. But that's
>>>> not optimal.
>>>>
>>>> But, as with all over draw algorithms, can it be done fast enough to be
>>>> worth it?
>>>>
>>>> Here's an article on the topic:
>>>>     http://gandraxa.com/detect_overlapping_subrectangles.xml
>>>>
>>>> jmm0 made some code here:
>>>>     https://bitbucket.org/jmm0/optimize_dirty_rects/src/c2affd54
>>>> 50b57fc142c919de10e530e367306222/optimize_dirty_rects.py?at=
>>>> default&fileviewer=file-view-default
>>>>
>>>> DR0ID, also did some with tests and faster code...
>>>> https://bitbucket.org/dr0id/pyweek-games/src/1a9943ebadc6e21
>>>> 02db0457d17ca3e6025f6ca60/pyweek19-2014-10/practice/dirtyrec
>>>> ts/?at=default
>>>>
>>>>
>>>> So far DR0ID says it's not really fast enough to be worth it. However,
>>>> there are opportunities to improve it. Perhaps a more optimal algorithm, or
>>>> one which uses C or Cython.
>>>>
>>>> "worst case szenario with 2000 rects it takes ~0.31299996376 seconds"
>>>> If it's 20x faster in C, then that gets down to 0.016666. Still too
>>>> slow perhaps.
>>>>
>>>>
>>>> It's not an embarrassingly parallel problem... I think. But I haven't
>>>> thought on it much at all. Maybe there is a use for multi core here.
>>>>
>>>>
>>>> Anyone done any other work like this? Or know of some good algos for
>>>> this?
>>>>
>>>>
>>>>
>>>> cheers,
>>>>
>>>
>>>
>>
>

Reply via email to