Re: Buffer object access mode is per-operation, not per-buffer

Keith Packard Thu, 27 Dec 2007 11:19:31 -0800

On Thu, 2007-12-27 at 10:34 +0100, Thomas Hellström wrote:

> This would not be sufficient to optimize the (re-)use of buffers.


Buffers are ready for re-use when they are no longer referenced by the
command stream and when some kind of flush operation has occurred to
remove them from caches or other memory buffers. By having the 'flush'
operation affect all buffers with pending memory accesses, we can reuse
buffers as soon as the flush is passed in the exe stream. With the
current model, every buffer must be associated with a specific flushing
fence or they will not ever be freed. Your example explicitly requires
that.

If the driver desires to close the gap between use and re-use, it could
simply emit more flush sequences, which would serve to clear buffers
from the GPU list more quickly.

> Exec buffers should be released when the EXE type has signalled,
> For Poulsbo, vertex-, index- and command stream buffers can be released 
> when the TA (aka BINNER) is done whereas texture buffers only when the 
> RASTERIZER done type singnals.
> All for the same sequence of commands. So if we expose these stages, we 
> also must expose a way for the waiting code to make sure that they 
> eventually signal.

I will not be able to comment on unreleased hardware designs and how
they affect the DRM design. Suffice it to say that this granularity of
signalling seems unnecessary to me; a single flush operation that was
signalled at the end of the sequence to free all of these buffers seems
more than adequate.

> Let's say you want to implement flushing separately from the fencing 
> code and expose only a single EXE type, the current generic code indeed 
> allows you to do that.

I believe this should be the generic model, not cobbled on top of the
current complexity. You've exposed a driver-internal state machine to
the generic code in ways which do not elucidate the function of either
the hardware or the generic operations. I cannot see how this benefits
anyone; drivers still need to implement the flushing state machine
internally, but instead of polling back up to the generic wait loop,
they would poll internally instead.

> Right. I agree a driver-specific wait is a good idea.
> However we must be careful about how the word "polling" is used.
> A fence flush may well be used to turn on IRQ flags in the presence of 
> waiters.

A driver-specific wait function would be able to enable interrupts as
necessary, block waiting for them and return. I think this would
actually solve the whole problem as there wouldn't need to be any
generic 'flush' flags, as all of that state would  become
driver-specific.

> And how would a user-space sub-allocator distinguish between EXE buffer 
> idle and
> Texture buffer idle for the same command sequence?

It wouldn't need to -- it would 'flush' either buffer which would know,
from previous accesses, how to eliminate reference to those buffers from
the GPU.

> Yes, or If the IRQs are turned off because there are no waiters.
> But this is not an expensive polling. A waiter would do a single poll 
> and then wait.

>     * Fence types are needed to expose completion steps to be able to
>       reuse buffers without excessive waits.

I believe a GPU-pending buffer queue would work better for this -- each
'flush' type would walk that list and remove buffers that have no active
GPU access. I don't see how fences have anything to do with this, and,
again, requiring that every buffer be associated with a specific
flushing fence is tantamount to requiring that every fence include a
flush sequence.

>     * The current implementation with the extensions outlined above and
>       in previous mails would be able to do what you want in a close-to
>       optimal way.

I really don't care about how much code we have to rewrite; let's get
the solution correct on at least hardware which is reasonably well known
to the community.

>     * I want to avoid core flush-specific buffer object lists as I
>       believe they duplicate information easily obtainable by other means.

I disagree -- as long as buffers are referenced only by fences, then
those fences must always include a flush.

>     * I want to avoid core pending flush lists at this point as most
>       current hardware doesn't need to use them. I believe currently
>       Intel and Radeon would be the only ones. 

I'd like to encourage you to not marginalize these two architectures.

> Poulsbo has very small
>       (16x16pixels) rendering caches since it's tile based, so it
>       doesn't implement any separate flushing, and probably wouldn't
>       benefit from it anyway. Unichromes and SiS apparently flush when
>       switching from 3D to 2D, and that's done when a breadcrumb write
>       is executed.

Not requiring flushing is not the same as not wanting some queue of
buffers that can be flushed.  Implicit flushing is just as useful as
explicit flushing here; you would walk the pending flush list and free
up all buffers covered by whatever flush operation was executed.

>     * GPU flushing need not be tied to the fencing mechanism, but there
>       must be a way (exposed or not) for a fence waiter to initiate a
>       flush which hasn't yet been queued.

Yes, an explicit 'flush' API in the generic layer would be used when
changing access modes. When switching from GPU to CPU, that API would
block as needed using fences or polling as appropriate.

> Are there any other issues I fail to see here such as problems flushing 
> at wait time due to ring locking issues etc?

I think the biggest issue right now is that buffers must be tied to
flushing fences every time they are used -- otherwise, they will never
be known to have been flushed. 

> The by far easiest way to remedy both those situations is if you could 
> provide a detailed example (in pseudo-code or whatever) where you feel 
> the suggested solution fails to do what you want or would issue an 
> unnecessary flush (such as a flush that is not issued on-demand or 
> duplicated) or  an unnecessary wait?

> If, on the other hand we agree that both solutions are comparable in 
> terms of efficiency and capabilities we should probably start listing 
> important pros and cons, otherwise we could continue this discussion 
> forever....

The current implementation needs fixing to eliminate polling in the
generic code; that seems to require a driver-specific wait function.

It also needs some mechanism to mark buffers as no longer referenced by
the GPU which does not depend on having each buffer tied to a flushing
fence.

The first will eliminate the state machine exposed by the current
generic fencing code, the second will (I believe) require some kind of
GPU-busy queue of buffers. With those changes in place, we can change
drivers to signal buffers as GPU-idle whenever suitable hardware
notifications are made.

I will try to implement this so that we can see how it would work in
practice. At this point, we've strayed far enough from the current
design that it's hard to see what the overall result would look like
without doing some code.

-- 
[EMAIL PROTECTED]

signature.asc
Description: This is a digitally signed message part

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: Buffer object access mode is per-operation, not per-buffer

Reply via email to