On Sat, Jul 2, 2016 at 12:52 PM, Rob Clark <robdcl...@gmail.com> wrote: > So, games/apps that are aware of how a tiler gpu works will make an > effort to avoid mid-batch (tile pass) updates to textures, UBOs, etc, > since this will force a flush, and extra resolve (tile->mem) and > restore (mem->tile) in the next batch. They also avoid unnecessary > framebuffer switches, for the same reason. > > But turns out that many games, benchmarks, etc, aren't very good at > this. But what if we could re-order the batches (and potentially > shadow texture/UBO/etc resources) to minimize the tile passes and > unnecessary resolve/restore? > > This is based on a rough idea that Eric suggested a while back, and > a few other experiments that I have been trying recently. It boils > down to three parts: > > 1) Add an fd_batch object, which tracks cmdstream being built for that > particular tile pass. State that is global to the tile pass is > move from fd_context to fd_batch. (Mostly the framebuffer state, > but also so internal tracking that is done to decide whether to > use GMEM or sysmem/bypass mode, etc.) > > Tracking of resources written/read in the batch is also moved from > ctx to batch.
So, it turned out that tracking only the most recent batch that reads a resource leads to unnecessary dependencies, and results in batches getting force-flushed (to avoid a dependency loop) when otherwise not needed. I initially did things this way so I could have a single list_head in the pipe_resource, and to avoid needing to track a 'struct set' of batches per pipe_resource. But we really need a way to allow tracking of multiple batches that read a resource without introducing an artificial dependency between the reading batches. So I came up with a different approach after discussing a few different options with glisse. It involves putting an upper bound on the # of batches at 32 (although 64 would be a possibility). In the batch, we end up needing a hash-set to track resources accessed by the batch. But in the resource we only need a bitmask of which batches access this resource (and a single 'struct fd_batch *write_batch' for most recent writer). (And I check the bitmask to short-circuit the hashset lookup/insert in the common case.) So now I'm getting ~+20% on manhattan, and a bit more improvement in xonotic than before. There are still a few glitches in xonotic (the increased re-ordering exposes that occlusion query is completely broken and queries need some work to dtrt in the face of re-ordering). And the map in the upper left corner somehow doesn't show the outline/map of the level (just the dots where the players are at). Not sure yet what is going on there. Mostly I only hit forced flushes due to hitting upper limit on # of batches during game startup, when it is doing a lot of texture uploads and mipmap generation, but not yet submitted any rendering that uses those textures. And an upper-bound on un-flushed batches in that sort of scenario actually seems like a good thing. Although I could probably be more clever about picking which batch(es) to flush in that scenario. The upper limit could be problematic if someone uploaded layer 0 to a bunch of textures, and then generated mipmap for all of the textures (as opposed to interleaving upload/genmipmap). I guess you probably have to go out of your way to be that stupid, so meh? One of the annoying things, since pipe_resource is per-screen, not per-context, I end up having to push batch_cache down into screen. Which means that, for example, one context switching fb state could force flushing a batch from another context. Eventually if I push of gmem+submit to a per-context helper thread, that should help keep things properly serialized. Although I still need some (currently missing) mutexes to serialize batch_cache access, etc. Also it means that the upper limit on # of batches is per-screen, not per-context. Not really sure what to do about that. I really wish resources were not shared across contexts (but rather just use flink or dmabuf when you need to share), but I guess it is too late for that now :-( https://github.com/freedreno/mesa/commit/e23dac02234de1c688efbad58758fdf9d837c94b BR, -R > 2) Add a batch-cache. Previously, whenever new framebuffer state is > set, it forced a flush. Now (if reordering is enabled), we use > the framebuffer state as key into a hashtable to map it to an > existing batch (if there is one, otherwise construct a new batch > and add it to the table). > > When a resource is marked as read/written by a batch, which is > already pending access by another batch, a dependency between the > two batches is added. > > TODO there is probably a bit more room for improvement here. See > below analysis of supertuxkart. > > 3) Shadow resources. Mid-batch UBO updates or uploading new contents > to an in-use texture is sadly too common. Traditional (non-tiler) > gpu's could solve this with a staging buffer, and blitting from the > staging to real buffer at the appropriate spot in the cmdstream. > But this doesn't work for a tiling gpu, since we'll need the old > contents again when we move on to the next tile. To solve this, > allocate a new buffer and back-blit the previous contents to the > new buffer. The existing buffer becomes a shadow and is unref'd > (the backing GEM object is kept alive since it is referenced by > the cmdstream). > > For example, a texture upload + mipmap gen turns into transfer_map > for level zero (glTexSubImage*, etc), followed by blits to the > remaining mipmap levels (glGenerateMipmap()). So in transfer_map() > if writing new contents into the buffer would trigger a flush or > stall, we shadow the existing buffer, and blit the remaining levels > from old to new. Each blit turns into a batch (different frame- > buffer state), and is not immediately flushed, but just hangs out > in the batch cache. When the next blit (from glGenerateMipmap() > overwrites the contents from the back-blit, we realize this and > drop the previous rendering to the batch, so in many cases the > back-blit ends up discarded. > > > > Results: > > supertuxkart was a big winner, with an overall ~30% boost, making the > new render engine finally playable on most levels. Fps varies a lot > by level, but on average going from 14-19fps to 20-25fps. > > (Sadly, the old render engine, which was much faster on lower end hw, > seems to be in disrepair.) > > I did also add some instrumentation to collect some stats on # of > different sorts of batches. Since supertuxkart --profile-laps is > not repeatable, I could not directly compare results there, but I > could compare an apitrace replay of stk level: > > normal: batch_sysmem=10398, batch_gmem=6958, batch_restore=3864 > reorder: batch_sysmem=16825, batch_gmem=6956, batch_restore=3863 > (for 792 frames) > > I was expecting a drop in gmem batches, and restores, because stk > does two problematic things: (1) render target switches, ie. clear, > switch fb, clear, switch fb, draw, etc., and (2) mid-batch UBO > update. > > I've looked a bit into the render target switches, but it seems like > it is mixing/matching zsbuf and cbuf's in a way that makes them map > to different batches. Ie: > > set fb: zsbuf=A, cbuf[0]=B > clear color0 > clear stencil > set fb: zsbuf=A, cbuf[0]=C > draw > > Not entirely sure what to do about that. I suppose I could track the > cmdstream for the clears individually, and juggle them between batches > somehow to avoid the flush? > > The mid-batch UBO update seems to actually happen between two fb states > with the same color0 and zs, but first treats color0 as R8G8B8A8_SRGB > and the next R8G8B8A8_UNORM. Probably we need a flush here anyways, > but use of glDiscardFramebuffer() in the app (and wiring up the driver > bits) could avoid a lot of restores. > > Most of the gain seems to come from simply not stalling on the UBO > update. > > > xonitic also seems to be a winner, although I haven't analyzed it as > closely: > > med: 48fps -> 52fps > high: 25fps -> 31fps > ultra: 15fps -> 19fps > > and the batch stats show more of an improvement: > > med: > normal: batch_sysmem=0, batch_gmem=18055, batch_restore=3748 > reorder: batch_sysmem=2220, batch_gmem=14483, batch_restore=174 > (10510 frames) > > high: > normal: batch_sysmem=63072, batch_gmem=62692, batch_restore=48384 > reorder: batch_sysmem=65429, batch_gmem=58284, batch_restore=43971 > (10510 frames) > > ultra: > normal: batch_sysmem=63072, batch_gmem=81318, batch_restore=66863 > reorder: batch_sysmem=65869, batch_gmem=71360, batch_restore=56939 > (10510 frames) > > So in all cases a nice drop in tile passes (batch_gmem) and reduction > in number of times we need to move back from system memory to tile > buffer (batch_restore). High/ultra still has a lot of restore's per > frame, so maybe there is still some room for improvement. Not sure > yet if it is the same sort of thing going on as supertuxkart. > > I would expect to see some gains in manhattan and possibly trex, but > unfortunately it is mostly using compressed textures that util_blitter > cannot blit, so the resource shadowing back-blit ends up on the CPU > (which ends up flushing previous mipmap generation and stalling, which > kind of defeats the purpose). I'm not entirely sure what to do here. > Since we don't need scaling/filtering/etc we could map things to a > different format which can be rendered to, but I think we end up > needing to also lie about the width/height. Which works ok for fb > state (we take w/h from the pipe_surface, not the pipe_resource). But > not on the src (tex state) side. Possibly we could add w/h to > pipe_sampler_view to solve this? Solving this should at least bring > about +15% in manhattan, and maybe a bit in trex. > > > At any rate, the freedreno bits end up depending on some libdrm > patches[1] which in turn depend on some kernel stuff I have queued up > for 4.8. So it will be some time before it lands. But I'd like to > get the first three patches reviewed and pushed. And suggestions > about the remaining issues welcome, since there is still some room > for further gains. > > [1] https://github.com/freedreno/libdrm/commits/fd-next > > Rob Clark (12): > gallium/util: make util_copy_framebuffer_state(src=NULL) work > gallium: un-inline pipe_surface_desc > list: fix list_replace() for empty lists > freedreno: introduce fd_batch > freedreno: push resource tracking down into batch > freedreno: dynamically sized/growable cmd buffers > freedreno: move more batch related tracking to fd_batch > freedreno: add batch-cache > freedreno: batch re-ordering support > freedreno: spiff up some debug traces > freedreno: shadow textures if possible to avoid stall/flush > freedreno: support discarding previous rendering in special cases > > src/gallium/auxiliary/util/u_framebuffer.c | 37 ++- > src/gallium/drivers/freedreno/Makefile.sources | 4 + > src/gallium/drivers/freedreno/a2xx/fd2_draw.c | 12 +- > src/gallium/drivers/freedreno/a2xx/fd2_emit.c | 15 +- > src/gallium/drivers/freedreno/a2xx/fd2_gmem.c | 63 ++--- > src/gallium/drivers/freedreno/a3xx/fd3_context.c | 4 - > src/gallium/drivers/freedreno/a3xx/fd3_context.h | 5 - > src/gallium/drivers/freedreno/a3xx/fd3_draw.c | 23 +- > src/gallium/drivers/freedreno/a3xx/fd3_emit.c | 23 +- > src/gallium/drivers/freedreno/a3xx/fd3_emit.h | 2 +- > src/gallium/drivers/freedreno/a3xx/fd3_gmem.c | 146 +++++------ > src/gallium/drivers/freedreno/a4xx/fd4_draw.c | 41 +-- > src/gallium/drivers/freedreno/a4xx/fd4_draw.h | 13 +- > src/gallium/drivers/freedreno/a4xx/fd4_emit.c | 24 +- > src/gallium/drivers/freedreno/a4xx/fd4_emit.h | 2 +- > src/gallium/drivers/freedreno/a4xx/fd4_gmem.c | 122 ++++----- > src/gallium/drivers/freedreno/freedreno_batch.c | 280 > +++++++++++++++++++++ > src/gallium/drivers/freedreno/freedreno_batch.h | 152 +++++++++++ > .../drivers/freedreno/freedreno_batch_cache.c | 246 ++++++++++++++++++ > .../drivers/freedreno/freedreno_batch_cache.h | 51 ++++ > src/gallium/drivers/freedreno/freedreno_context.c | 131 ++-------- > src/gallium/drivers/freedreno/freedreno_context.h | 123 ++------- > src/gallium/drivers/freedreno/freedreno_draw.c | 132 +++++----- > src/gallium/drivers/freedreno/freedreno_draw.h | 15 +- > src/gallium/drivers/freedreno/freedreno_gmem.c | 110 ++++---- > src/gallium/drivers/freedreno/freedreno_gmem.h | 6 +- > src/gallium/drivers/freedreno/freedreno_query_hw.c | 8 +- > src/gallium/drivers/freedreno/freedreno_resource.c | 242 ++++++++++++++++-- > src/gallium/drivers/freedreno/freedreno_resource.h | 10 +- > src/gallium/drivers/freedreno/freedreno_screen.c | 9 + > src/gallium/drivers/freedreno/freedreno_screen.h | 2 + > src/gallium/drivers/freedreno/freedreno_state.c | 19 +- > src/gallium/drivers/freedreno/freedreno_util.h | 43 ++-- > src/gallium/include/pipe/p_state.h | 23 +- > src/util/list.h | 14 +- > 35 files changed, 1486 insertions(+), 666 deletions(-) > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.c > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.h > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.c > create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.h > > -- > 2.7.4 > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev