Re: [Mesa-dev] tilers and out-of-order rendering..

2016-06-06 Thread Rob Clark
On Mon, Jun 6, 2016 at 5:19 AM, Jose Fonseca  wrote:
> On 04/06/16 20:36, Rob Clark wrote:
>>
>> On Fri, Jun 3, 2016 at 8:53 AM, Rob Clark  wrote:
>>>
>>> Ok, so I had a really evil thought that I wanted to bounce off
>>> people..  it's a quite different approach from the more obvious one
>>> discussed below (and which I've already started implementing)
>>>
>>> Basically, idea is to have a wrapper pipe driver, similar to
>>> ddebug/rbug/trace/etc, which re-orders draw calls.  All the CSO
>>> objects would have to be wrapped in a refcounted thing, so
>>> pending-draw's could hang on to their associated state.  For things
>>> that are not refcounted (draw_info, and all the non-CSO state) there
>>> would unfortunately be some memcpy involved.. not sure how bad that
>>> would be, but it seems like the thing that could doom the idea?
>>
>>
>> so the slightly awkward thing is how to deal with things like
>> u_blitter (pipe->blit/pipe->copy_region).. if we were re-ordering
>> things to avoid unnecessary render target switches, the wrapper layer
>> would have to handle these paths itself.  But looks like vc4 has some
>> special handling (vc4_tile_blit()).. not really sure how that would
>> work out.
>>
>> (and in general, the wrapper layer would want to handle some cases, as
>> well as transfer_map, itself.. so it could generate ghost
>> pipe_resources for things like writing into a busy texture.. but that
>> probably isn't too hard since a wrapper pipe_resource could replace
>> the ref to hw driver's pipe_resource and schedule blits to copy from
>> previous pipe_resource where needed..  hopefully combination of
>> PIPE_TRANSFER_DISCARD* and pipe_draw_info::discard type hint (as I
>> mentioned below) could "DCE" those copy blits.  Except I somehow need
>> to deal w/ CSO's which have reference to the ghosted resource.. bleh)
>
>
> Yeah, wrapper pipe drivers sound nice in theory, but aren't that great in
> practice, for anything other than debugging pipe drivers.
>
> A driver auxiliary library that drivers can opt-in/out and extend, is much
> more flexible.
>
> But still, rather than aiming straight at a driver indepedendent code, it
> might be better to first prototoype as an internal driver component, and
> then generalize/refactor in a 2nd step.

well, it is two fairly different approaches.  Doing it in the driver,
you would (presumably) be buffering up baked cmdstream.  Vs the
wrapper layer that is buffering up gallium state.  I do like the
aspect that it keeps all the (rather generic) complexity of dependency
tracking and resource ghosting out of the driver.

And I do like the fact that switching it on/off is just a matter of
deciding to wrap the screen or not, and when you don't wrap the screen
the overhead goes away completely.

idk, I may end up trying it both ways in the end..

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] tilers and out-of-order rendering..

2016-06-06 Thread Jose Fonseca

On 04/06/16 20:36, Rob Clark wrote:

On Fri, Jun 3, 2016 at 8:53 AM, Rob Clark  wrote:

Ok, so I had a really evil thought that I wanted to bounce off
people..  it's a quite different approach from the more obvious one
discussed below (and which I've already started implementing)

Basically, idea is to have a wrapper pipe driver, similar to
ddebug/rbug/trace/etc, which re-orders draw calls.  All the CSO
objects would have to be wrapped in a refcounted thing, so
pending-draw's could hang on to their associated state.  For things
that are not refcounted (draw_info, and all the non-CSO state) there
would unfortunately be some memcpy involved.. not sure how bad that
would be, but it seems like the thing that could doom the idea?


so the slightly awkward thing is how to deal with things like
u_blitter (pipe->blit/pipe->copy_region).. if we were re-ordering
things to avoid unnecessary render target switches, the wrapper layer
would have to handle these paths itself.  But looks like vc4 has some
special handling (vc4_tile_blit()).. not really sure how that would
work out.

(and in general, the wrapper layer would want to handle some cases, as
well as transfer_map, itself.. so it could generate ghost
pipe_resources for things like writing into a busy texture.. but that
probably isn't too hard since a wrapper pipe_resource could replace
the ref to hw driver's pipe_resource and schedule blits to copy from
previous pipe_resource where needed..  hopefully combination of
PIPE_TRANSFER_DISCARD* and pipe_draw_info::discard type hint (as I
mentioned below) could "DCE" those copy blits.  Except I somehow need
to deal w/ CSO's which have reference to the ghosted resource.. bleh)


Yeah, wrapper pipe drivers sound nice in theory, but aren't that great 
in practice, for anything other than debugging pipe drivers.


A driver auxiliary library that drivers can opt-in/out and extend, is 
much more flexible.


But still, rather than aiming straight at a driver indepedendent code, 
it might be better to first prototoype as an internal driver component, 
and then generalize/refactor in a 2nd step.


Jose
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] tilers and out-of-order rendering..

2016-06-04 Thread Rob Clark
On Fri, Jun 3, 2016 at 8:53 AM, Rob Clark  wrote:
> Ok, so I had a really evil thought that I wanted to bounce off
> people..  it's a quite different approach from the more obvious one
> discussed below (and which I've already started implementing)
>
> Basically, idea is to have a wrapper pipe driver, similar to
> ddebug/rbug/trace/etc, which re-orders draw calls.  All the CSO
> objects would have to be wrapped in a refcounted thing, so
> pending-draw's could hang on to their associated state.  For things
> that are not refcounted (draw_info, and all the non-CSO state) there
> would unfortunately be some memcpy involved.. not sure how bad that
> would be, but it seems like the thing that could doom the idea?

so the slightly awkward thing is how to deal with things like
u_blitter (pipe->blit/pipe->copy_region).. if we were re-ordering
things to avoid unnecessary render target switches, the wrapper layer
would have to handle these paths itself.  But looks like vc4 has some
special handling (vc4_tile_blit()).. not really sure how that would
work out.

(and in general, the wrapper layer would want to handle some cases, as
well as transfer_map, itself.. so it could generate ghost
pipe_resources for things like writing into a busy texture.. but that
probably isn't too hard since a wrapper pipe_resource could replace
the ref to hw driver's pipe_resource and schedule blits to copy from
previous pipe_resource where needed..  hopefully combination of
PIPE_TRANSFER_DISCARD* and pipe_draw_info::discard type hint (as I
mentioned below) could "DCE" those copy blits.  Except I somehow need
to deal w/ CSO's which have reference to the ghosted resource.. bleh)

BR,
-R

> The nice thing is it becomes basically free to turn on/off for
> different drivers, at least at screen create time.. basically it gets
> 100% re-use, rather than having to re-implement the concepts in each
> (tiler) driver.
>
> Not sure if we need a way to turn it on/off at context create time,
> but either way it would be nice if it were somehow a driconf option so
> that it could be enabled/disabled per app, as to not penalize properly
> written apps.
>
> Thoughts?
>
> 
>
> Semi-related issue, which applies to either of the draw-reordering
> approaches.  A frequent pattern is:
>
>... bunch of draws ...
>glTexSubImage2D()
>glGenerateMipmap()
>... bunch more draws ...
>... repeat sequence a bunch of times with same texture ...
>
> That glTexSubImage() comes to driver as transfer_map(DISCARD_RANGE).
> At this point the backing bo is likely to be busy (since above
> sequence repeats a bunch of times with the same texture).  So the best
> we can do is discard whole bo and schedule blit(s) for the remaining
> levels into the new bo.
>
> But then at the glGenerateMipmap() step, we overwrite the contents of
> all the other layers.  Which means if driver (or re-ordering wrapper
> layer) had some extra hints, the blits triggered by the transfer_map()
> could be skipped.
>
> What I'm thinking would be a simple solution is to have an extra field
> in pipe_draw_info so that internal blits (like mipmap generation)
> could hint to the driver that the entire previous contents of the
> render target are discarded.  (Or possibly we want it more
> fine-grained, to indicate which render-targets and z/s are discarded,
> if not all?  But thit doesn't seem useful.)  This could help tell
> tilers that they could discard previous blits (and even skip
> system-memory -> tile transfer).
>
> (Hell, there might even be some use to apps to expose the "this draw
> discards previous contents" type extension..  given some of the wonky
> vendor extensions I've seen, I wouldn't be surprised if it already
> existed.)
>
> Thoughts?
>
> BR,
> -R
>
>
> On Fri, May 20, 2016 at 10:51 AM, Rob Clark  wrote:
>> On Fri, May 20, 2016 at 3:35 AM, Jose Fonseca  wrote:
>>> On 20/05/16 00:34, Rob Clark wrote:

 On Thu, May 19, 2016 at 6:21 PM, Eric Anholt  wrote:
>
> Rob Clark  writes:
>
>> So some rendering patterns that I've seen in apps turn out to be
>> somewhat evil for tiling gpu's.. couple cases I've seen:
>>
>> 1) stk has some silliness where it binds an fbo, clears, binds other
>> fbo clears, binds previous fbo and draws, and so on.  This one is
>> probably not too hard to just fix in stk.
>>
>> 2) I've seen a render pattern in manhattan where app does a bunch of
>> texture uploads mid-frame via a pbo (and then generates mipmap levels
>> for the updated texture, which hits the blit path which changes fb
>> state and forces a flush).  This one probably not something that can
>> be fixed in the app ;-)
>>
>> There are probably other cases where this comes up which I haven't
>> noticed yet.  I'm not entirely sure how common the pattern that I see
>> in manhattan is.
>>
>> At one 

Re: [Mesa-dev] tilers and out-of-order rendering..

2016-06-03 Thread Rob Clark
Ok, so I had a really evil thought that I wanted to bounce off
people..  it's a quite different approach from the more obvious one
discussed below (and which I've already started implementing)

Basically, idea is to have a wrapper pipe driver, similar to
ddebug/rbug/trace/etc, which re-orders draw calls.  All the CSO
objects would have to be wrapped in a refcounted thing, so
pending-draw's could hang on to their associated state.  For things
that are not refcounted (draw_info, and all the non-CSO state) there
would unfortunately be some memcpy involved.. not sure how bad that
would be, but it seems like the thing that could doom the idea?

The nice thing is it becomes basically free to turn on/off for
different drivers, at least at screen create time.. basically it gets
100% re-use, rather than having to re-implement the concepts in each
(tiler) driver.

Not sure if we need a way to turn it on/off at context create time,
but either way it would be nice if it were somehow a driconf option so
that it could be enabled/disabled per app, as to not penalize properly
written apps.

Thoughts?



Semi-related issue, which applies to either of the draw-reordering
approaches.  A frequent pattern is:

   ... bunch of draws ...
   glTexSubImage2D()
   glGenerateMipmap()
   ... bunch more draws ...
   ... repeat sequence a bunch of times with same texture ...

That glTexSubImage() comes to driver as transfer_map(DISCARD_RANGE).
At this point the backing bo is likely to be busy (since above
sequence repeats a bunch of times with the same texture).  So the best
we can do is discard whole bo and schedule blit(s) for the remaining
levels into the new bo.

But then at the glGenerateMipmap() step, we overwrite the contents of
all the other layers.  Which means if driver (or re-ordering wrapper
layer) had some extra hints, the blits triggered by the transfer_map()
could be skipped.

What I'm thinking would be a simple solution is to have an extra field
in pipe_draw_info so that internal blits (like mipmap generation)
could hint to the driver that the entire previous contents of the
render target are discarded.  (Or possibly we want it more
fine-grained, to indicate which render-targets and z/s are discarded,
if not all?  But thit doesn't seem useful.)  This could help tell
tilers that they could discard previous blits (and even skip
system-memory -> tile transfer).

(Hell, there might even be some use to apps to expose the "this draw
discards previous contents" type extension..  given some of the wonky
vendor extensions I've seen, I wouldn't be surprised if it already
existed.)

Thoughts?

BR,
-R


On Fri, May 20, 2016 at 10:51 AM, Rob Clark  wrote:
> On Fri, May 20, 2016 at 3:35 AM, Jose Fonseca  wrote:
>> On 20/05/16 00:34, Rob Clark wrote:
>>>
>>> On Thu, May 19, 2016 at 6:21 PM, Eric Anholt  wrote:

 Rob Clark  writes:

> So some rendering patterns that I've seen in apps turn out to be
> somewhat evil for tiling gpu's.. couple cases I've seen:
>
> 1) stk has some silliness where it binds an fbo, clears, binds other
> fbo clears, binds previous fbo and draws, and so on.  This one is
> probably not too hard to just fix in stk.
>
> 2) I've seen a render pattern in manhattan where app does a bunch of
> texture uploads mid-frame via a pbo (and then generates mipmap levels
> for the updated texture, which hits the blit path which changes fb
> state and forces a flush).  This one probably not something that can
> be fixed in the app ;-)
>
> There are probably other cases where this comes up which I haven't
> noticed yet.  I'm not entirely sure how common the pattern that I see
> in manhattan is.
>
> At one point, Eric Anholt mentioned the idea of tracking rendering
> cmdstream per render-target, as well as dependency information between
> these different sets of cmdstream (if you render to one fbo, then turn
> around and sample from it, the rendering needs to happen before the
> sampling).  I've been thinking a bit about how this would actually
> work, and trying to do some experiments to get an idea about how
> useful this would be.


 My plan was pretty much what you laid out here, except I was going to
 just map to my CL struct with a little hash table from the FB state
 members since FB state isn't a CSO.
>>>
>>>
>>> ok, yeah, I guess that solves the naming conflict (fd_batch(_state)
>>> sounds nicer for what it's purpose really is than
>>> fd_framebuffer_state)
>>>
>>> BR,
>>> -R
>>
>>
>> llvmpipe is also a tiler and we've seen similar patterns.  Flushing reduces
>> caching effectiveness, however in llvmpipe quite often texture sampling is
>> the bottleneck, and an additional flush doesn't make a huge difference.
>>
>
> interesting, it hadn't occurred to me about llvmpipe
>
>>
>> I think the internal hash table as 

Re: [Mesa-dev] tilers and out-of-order rendering..

2016-05-20 Thread Rob Clark
On Fri, May 20, 2016 at 3:35 AM, Jose Fonseca  wrote:
> On 20/05/16 00:34, Rob Clark wrote:
>>
>> On Thu, May 19, 2016 at 6:21 PM, Eric Anholt  wrote:
>>>
>>> Rob Clark  writes:
>>>
 So some rendering patterns that I've seen in apps turn out to be
 somewhat evil for tiling gpu's.. couple cases I've seen:

 1) stk has some silliness where it binds an fbo, clears, binds other
 fbo clears, binds previous fbo and draws, and so on.  This one is
 probably not too hard to just fix in stk.

 2) I've seen a render pattern in manhattan where app does a bunch of
 texture uploads mid-frame via a pbo (and then generates mipmap levels
 for the updated texture, which hits the blit path which changes fb
 state and forces a flush).  This one probably not something that can
 be fixed in the app ;-)

 There are probably other cases where this comes up which I haven't
 noticed yet.  I'm not entirely sure how common the pattern that I see
 in manhattan is.

 At one point, Eric Anholt mentioned the idea of tracking rendering
 cmdstream per render-target, as well as dependency information between
 these different sets of cmdstream (if you render to one fbo, then turn
 around and sample from it, the rendering needs to happen before the
 sampling).  I've been thinking a bit about how this would actually
 work, and trying to do some experiments to get an idea about how
 useful this would be.
>>>
>>>
>>> My plan was pretty much what you laid out here, except I was going to
>>> just map to my CL struct with a little hash table from the FB state
>>> members since FB state isn't a CSO.
>>
>>
>> ok, yeah, I guess that solves the naming conflict (fd_batch(_state)
>> sounds nicer for what it's purpose really is than
>> fd_framebuffer_state)
>>
>> BR,
>> -R
>
>
> llvmpipe is also a tiler and we've seen similar patterns.  Flushing reduces
> caching effectiveness, however in llvmpipe quite often texture sampling is
> the bottleneck, and an additional flush doesn't make a huge difference.
>

interesting, it hadn't occurred to me about llvmpipe

>
> I think the internal hash table as Eric proposes seems a better first step.
>
> Later on we could try make framebuffer state a first class cso, but I
> suspect you'll probably want to walk internally all pending FBOs CLs anyway
> (to see which need to be flushed on transfers.)
>
> So first changing the driver internals, then abstract if there are
> commonalities, seems more effective way forward.


yeah, makes sense.. and I'm planning to go w/ Eric's idea to keep
fd_batch separate from framebuffer state.

It did occur to me that I forgot to think about the write-after-read
hazard case.  Those need to be handled with an extra dependency
between batches too.

And at least for this particular case, I need somehow some cleverness
to discard or clone the old bo to avoid that write-after-read forcing
a flush.  (Maybe in transfer_map?  But I guess there are other paths..
hmm..)

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] tilers and out-of-order rendering..

2016-05-20 Thread Jose Fonseca

On 20/05/16 00:34, Rob Clark wrote:

On Thu, May 19, 2016 at 6:21 PM, Eric Anholt  wrote:

Rob Clark  writes:


So some rendering patterns that I've seen in apps turn out to be
somewhat evil for tiling gpu's.. couple cases I've seen:

1) stk has some silliness where it binds an fbo, clears, binds other
fbo clears, binds previous fbo and draws, and so on.  This one is
probably not too hard to just fix in stk.

2) I've seen a render pattern in manhattan where app does a bunch of
texture uploads mid-frame via a pbo (and then generates mipmap levels
for the updated texture, which hits the blit path which changes fb
state and forces a flush).  This one probably not something that can
be fixed in the app ;-)

There are probably other cases where this comes up which I haven't
noticed yet.  I'm not entirely sure how common the pattern that I see
in manhattan is.

At one point, Eric Anholt mentioned the idea of tracking rendering
cmdstream per render-target, as well as dependency information between
these different sets of cmdstream (if you render to one fbo, then turn
around and sample from it, the rendering needs to happen before the
sampling).  I've been thinking a bit about how this would actually
work, and trying to do some experiments to get an idea about how
useful this would be.


My plan was pretty much what you laid out here, except I was going to
just map to my CL struct with a little hash table from the FB state
members since FB state isn't a CSO.


ok, yeah, I guess that solves the naming conflict (fd_batch(_state)
sounds nicer for what it's purpose really is than
fd_framebuffer_state)

BR,
-R


llvmpipe is also a tiler and we've seen similar patterns.  Flushing 
reduces caching effectiveness, however in llvmpipe quite often texture 
sampling is the bottleneck, and an additional flush doesn't make a huge 
difference.



I think the internal hash table as Eric proposes seems a better first step.

Later on we could try make framebuffer state a first class cso, but I 
suspect you'll probably want to walk internally all pending FBOs CLs 
anyway (to see which need to be flushed on transfers.)


So first changing the driver internals, then abstract if there are 
commonalities, seems more effective way forward.


Jose
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] tilers and out-of-order rendering..

2016-05-19 Thread Rob Clark
On Thu, May 19, 2016 at 6:21 PM, Eric Anholt  wrote:
> Rob Clark  writes:
>
>> So some rendering patterns that I've seen in apps turn out to be
>> somewhat evil for tiling gpu's.. couple cases I've seen:
>>
>> 1) stk has some silliness where it binds an fbo, clears, binds other
>> fbo clears, binds previous fbo and draws, and so on.  This one is
>> probably not too hard to just fix in stk.
>>
>> 2) I've seen a render pattern in manhattan where app does a bunch of
>> texture uploads mid-frame via a pbo (and then generates mipmap levels
>> for the updated texture, which hits the blit path which changes fb
>> state and forces a flush).  This one probably not something that can
>> be fixed in the app ;-)
>>
>> There are probably other cases where this comes up which I haven't
>> noticed yet.  I'm not entirely sure how common the pattern that I see
>> in manhattan is.
>>
>> At one point, Eric Anholt mentioned the idea of tracking rendering
>> cmdstream per render-target, as well as dependency information between
>> these different sets of cmdstream (if you render to one fbo, then turn
>> around and sample from it, the rendering needs to happen before the
>> sampling).  I've been thinking a bit about how this would actually
>> work, and trying to do some experiments to get an idea about how
>> useful this would be.
>
> My plan was pretty much what you laid out here, except I was going to
> just map to my CL struct with a little hash table from the FB state
> members since FB state isn't a CSO.

ok, yeah, I guess that solves the naming conflict (fd_batch(_state)
sounds nicer for what it's purpose really is than
fd_framebuffer_state)

BR,
-R
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] tilers and out-of-order rendering..

2016-05-19 Thread Eric Anholt
Rob Clark  writes:

> So some rendering patterns that I've seen in apps turn out to be
> somewhat evil for tiling gpu's.. couple cases I've seen:
>
> 1) stk has some silliness where it binds an fbo, clears, binds other
> fbo clears, binds previous fbo and draws, and so on.  This one is
> probably not too hard to just fix in stk.
>
> 2) I've seen a render pattern in manhattan where app does a bunch of
> texture uploads mid-frame via a pbo (and then generates mipmap levels
> for the updated texture, which hits the blit path which changes fb
> state and forces a flush).  This one probably not something that can
> be fixed in the app ;-)
>
> There are probably other cases where this comes up which I haven't
> noticed yet.  I'm not entirely sure how common the pattern that I see
> in manhattan is.
>
> At one point, Eric Anholt mentioned the idea of tracking rendering
> cmdstream per render-target, as well as dependency information between
> these different sets of cmdstream (if you render to one fbo, then turn
> around and sample from it, the rendering needs to happen before the
> sampling).  I've been thinking a bit about how this would actually
> work, and trying to do some experiments to get an idea about how
> useful this would be.

My plan was pretty much what you laid out here, except I was going to
just map to my CL struct with a little hash table from the FB state
members since FB state isn't a CSO.


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] tilers and out-of-order rendering..

2016-05-19 Thread Rob Clark
So some rendering patterns that I've seen in apps turn out to be
somewhat evil for tiling gpu's.. couple cases I've seen:

1) stk has some silliness where it binds an fbo, clears, binds other
fbo clears, binds previous fbo and draws, and so on.  This one is
probably not too hard to just fix in stk.

2) I've seen a render pattern in manhattan where app does a bunch of
texture uploads mid-frame via a pbo (and then generates mipmap levels
for the updated texture, which hits the blit path which changes fb
state and forces a flush).  This one probably not something that can
be fixed in the app ;-)

There are probably other cases where this comes up which I haven't
noticed yet.  I'm not entirely sure how common the pattern that I see
in manhattan is.

At one point, Eric Anholt mentioned the idea of tracking rendering
cmdstream per render-target, as well as dependency information between
these different sets of cmdstream (if you render to one fbo, then turn
around and sample from it, the rendering needs to happen before the
sampling).  I've been thinking a bit about how this would actually
work, and trying to do some experiments to get an idea about how
useful this would be.

In the manhattan case, via a bit of a hack (to basically no-op the
pipe->blit() to avoid interrupting the tiling pass), I guestimate that
if we were able to re-order the rendering it would gain us something
around 15%.  (This is on ifc6540.. the win might be bigger on
something more memory bandwidth constrained.)

To realize the benefit we would require a bit more cleverness in
pipe->transfer_map to realize that the whole texture contents is being
updated and turn the DISCARD_RANGE into DISCARD_WHOLE_RESOURCE.  The
problem being, I think, that it is only discarding the first mipmap
level so we'd need realize that in the new buffer the additional
mipmap levels aren't valid.. no idea how that would work.. but in this
case it seems like mostly a smallish (128x128) texture so maybe it is
a win to just memcpy the rest of the old texture data over to the new
texture bo to avoid the stall/flush.

Anyways, the basic idea involves turning pipe_framebuffer_state into a
refcnt'd CSO inside the driver, and use that as the point to track
rendering cmds and dependency info.  (It would be kinda nice if fb
state was already a CSO.. but I guess we can work around that in the
driver using the pipe_framebuffer_state as the hashtable key..
hopefully we can rely on not having garbage data in unused cbuf slots?
 Otherwise we might need a custom hash/equals fxn.)  So something
like:

   /* framebuffer CSO: */
   /* TODO maybe it is more clear to call it fd_batch? */
   struct fd_framebuffer_state {
  struct pipe_reference refcnt;
  struct pipe_framebuffer_state base;
  struct fd_context *ctx;
  struct fd_ringbuffer *ring;
  struct set *dependencies;   /* hashset of dependent
fd_framebuffer_state(s) */
  bool dirty;
   }

When new fb state is set, hashtable lookup and increment the refcnt of
existing CSO if it exists, else create new state object.  And unref
the outgoing CSO.  Whenever there is unflushed rendering to a prsc
(pipe_resource), the prsc would need to also hold a refcnt to the most
recent fb CSO which renders to the prsc to keep the fb CSO live as
long as something depends on it.  Also we need to hold ref's to all
the entries in the dependencies table.

Whenever we emit a reference to another prsc (texture, vbo, index
buffer, etc), we'd have to check if it has pending rendering in a
different fb CSO.  I think for the most part we could replace
OUT_RELOC(fd_bo *) helper with OUT_PRSC(pipe_resource *).. so
something roughly like:

   struct fd_resource {
  struct u_resource base;
  ...
- struct fd_context *pending_ctx;
+ /* hold ref to most recent fb CSO that rendered to us: */
+ struct fd_framebuffer_state *pending_fb;
   }

   static inline void
   OUT_RSC(struct fb_ringbuffer *ring, struct fd_resource *rsc)
   {
   if (rsc->pending_fb && rsc->pending_fb->dirty) {
  /* a bit ugly to chase the current ctx ptr this way, but
   * OUT_RING() is already used in a lot of places that
   * don't have ctx ptr handy..
   */
  struct fd_context *ctx = rsc->pending_fb->ctx;

  /* check for reverse dependency.. if other fb CSO already
   * depends on current fb then we cannot create a loop:
   */
  if (depends_on(rsc->pending_fb, ctx->fb)) {
 fd_context_render(ctx, ctx->fb);
  } else {
 .. add rsc->pending_fb to ctx->fb->dependencies ..
  }
   }
   OUT_RING(ring, rsc->bo);
   }

   static inline void
   OUT_PRSC(struct fd_ringbuffer *ring, struct pipe_resource *prsc)
   {
   OUT_RSC(ring, fd_resource(prsc));
   }



TODO:
1) how would queries work when we start re-ordering rendering?
   I guess we need a query results bo per fb CSO and the query
   needs to hold ref's to all the fb CSO's that