So some rendering patterns that I've seen in apps turn out to be somewhat evil for tiling gpu's.. couple cases I've seen:
1) stk has some silliness where it binds an fbo, clears, binds other fbo clears, binds previous fbo and draws, and so on. This one is probably not too hard to just fix in stk. 2) I've seen a render pattern in manhattan where app does a bunch of texture uploads mid-frame via a pbo (and then generates mipmap levels for the updated texture, which hits the blit path which changes fb state and forces a flush). This one probably not something that can be fixed in the app ;-) There are probably other cases where this comes up which I haven't noticed yet. I'm not entirely sure how common the pattern that I see in manhattan is. At one point, Eric Anholt mentioned the idea of tracking rendering cmdstream per render-target, as well as dependency information between these different sets of cmdstream (if you render to one fbo, then turn around and sample from it, the rendering needs to happen before the sampling). I've been thinking a bit about how this would actually work, and trying to do some experiments to get an idea about how useful this would be. In the manhattan case, via a bit of a hack (to basically no-op the pipe->blit() to avoid interrupting the tiling pass), I guestimate that if we were able to re-order the rendering it would gain us something around 15%. (This is on ifc6540.. the win might be bigger on something more memory bandwidth constrained.) To realize the benefit we would require a bit more cleverness in pipe->transfer_map to realize that the whole texture contents is being updated and turn the DISCARD_RANGE into DISCARD_WHOLE_RESOURCE. The problem being, I think, that it is only discarding the first mipmap level so we'd need realize that in the new buffer the additional mipmap levels aren't valid.. no idea how that would work.. but in this case it seems like mostly a smallish (128x128) texture so maybe it is a win to just memcpy the rest of the old texture data over to the new texture bo to avoid the stall/flush. Anyways, the basic idea involves turning pipe_framebuffer_state into a refcnt'd CSO inside the driver, and use that as the point to track rendering cmds and dependency info. (It would be kinda nice if fb state was already a CSO.. but I guess we can work around that in the driver using the pipe_framebuffer_state as the hashtable key.. hopefully we can rely on not having garbage data in unused cbuf slots? Otherwise we might need a custom hash/equals fxn.) So something like: /* framebuffer CSO: */ /* TODO maybe it is more clear to call it fd_batch? */ struct fd_framebuffer_state { struct pipe_reference refcnt; struct pipe_framebuffer_state base; struct fd_context *ctx; struct fd_ringbuffer *ring; struct set *dependencies; /* hashset of dependent fd_framebuffer_state(s) */ bool dirty; } When new fb state is set, hashtable lookup and increment the refcnt of existing CSO if it exists, else create new state object. And unref the outgoing CSO. Whenever there is unflushed rendering to a prsc (pipe_resource), the prsc would need to also hold a refcnt to the most recent fb CSO which renders to the prsc to keep the fb CSO live as long as something depends on it. Also we need to hold ref's to all the entries in the dependencies table. Whenever we emit a reference to another prsc (texture, vbo, index buffer, etc), we'd have to check if it has pending rendering in a different fb CSO. I think for the most part we could replace OUT_RELOC(fd_bo *) helper with OUT_PRSC(pipe_resource *).. so something roughly like: struct fd_resource { struct u_resource base; ... - struct fd_context *pending_ctx; + /* hold ref to most recent fb CSO that rendered to us: */ + struct fd_framebuffer_state *pending_fb; } static inline void OUT_RSC(struct fb_ringbuffer *ring, struct fd_resource *rsc) { if (rsc->pending_fb && rsc->pending_fb->dirty) { /* a bit ugly to chase the current ctx ptr this way, but * OUT_RING() is already used in a lot of places that * don't have ctx ptr handy.. */ struct fd_context *ctx = rsc->pending_fb->ctx; /* check for reverse dependency.. if other fb CSO already * depends on current fb then we cannot create a loop: */ if (depends_on(rsc->pending_fb, ctx->fb)) { fd_context_render(ctx, ctx->fb); } else { .. add rsc->pending_fb to ctx->fb->dependencies .. } } OUT_RING(ring, rsc->bo); } static inline void OUT_PRSC(struct fd_ringbuffer *ring, struct pipe_resource *prsc) { OUT_RSC(ring, fd_resource(prsc)); } TODO: 1) how would queries work when we start re-ordering rendering? I guess we need a query results bo per fb CSO and the query needs to hold ref's to all the fb CSO's that were active for the duration of the query? Timestamp queries would have truely non-sense results (but that is already more or less the case for tilers) 2) what happens w/ prsc's shared across multiple pipe_context's? I guess we get a pipe->flush() otherwise sharing would never work, so maybe that is good enough? 3) anything useful to extract out into helpers? I guess vc4 and freedreno more or less want the same thing.. 4) what gremlins have I not imagined yet? There seems like a lot of ways to get this wrong.. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev