Re: Buffer object access mode is per-operation, not per-buffer

Keith Packard Tue, 25 Dec 2007 20:09:41 -0800

On Tue, 2007-12-25 at 14:24 +0100, Thomas Hellström wrote:
  
> The fence type bits are solely intended to indicate different completion 
> stages, so if the union of access modes defined as (RWX) is the
> same for all fence objects of a particular driver, you're right. Fence 
> flushing of a particular fence type is intented to make sure that that
> fence type will eventually signal. This functionality (which may seem 
> awkward an unnecessary at first) can be used to turn on / off irqs in 
> the presence / absence of waiters and can be used to perform things like 
> RW flushing on Intel chipsets.


RW flushing is independent of fencing for most use cases -- when
switching between reading and writing operations in the ring, there's no
need to fence the operation, only to place a flush in the command
stream. The only time when flushing is related to fencing is when you
need to wait for a fence to expire with a flush operation in front of
it.

Mixing flushing and fencing in the same mechanism has made that very
difficult to understand, and I believe it has removed the common
optimization where flushing needn't wait on a fence.

I'd much rather see flushing as a distinct operation, which may (if
necessary) wait for a fence on its own before completing the flush. As
all operations involving flushing are executed in non-interrupt context,
flushes can block on fences quite easily. Having the generic code
provide a simple 'block on the ring' mechanism which drivers can use to
implement their own synchronization seems clearer than attempting to
solve a general case we don't even completely understand yet.

> Other uses I foresee is, (again with Unichrome as an example), marking a 
> buffer for access with the mpeg- or the HQV engine. These use the same 
> command submission mechanism, but have separate IRQs or status bit to 
> indicate engine idle.

Right, having different fence types for different notification
mechanisms seems necessary and useful for such hardware. Intel has only
a single mechanism for blocking, however we may want to use a different
fence type to indicate fences coming from the higher-priority ring. Good
stuff.

> It's important here that the delayed deletion of buffers are considered.
> The code is currently using the fence flush mechanism to avoid
> busy-waiting also in the case where only polling is available.

I'd say a simpler mechanism would be to use a helper thread that polls
when such delayed work needs to be done without having an obvious
process context available. It seems like you're counting on having some
process polling the hardware to clean things up, but we need to deal
with the idle process cases where we want to release resources
reasonably quickly instead of waiting for that process to wakeup and
process more requests.
 
> Yes, we can provide a driver-specific optimized fence_wait method, 
> perhaps with
> a default generic one. However, I do believe we still need to be able to 
> do fence_flush and fence_signaled
> considering the delayed delete mechanism. We probably also need to 
> provide example code for the
> pure IRQ and pure polling case.

Please consider removing the delayed delete mechanism from fencing --
they are completely unrelated, and a polling driver should be using a
worker thread to clean up instead of hoping that some other process will
come along and poll so that objects may be freed.

Fencing should be restricted to knowing when the GPU has completed a
sequence of commands.

> >> It's will not require any synchronization with the extension mentioned. 
> >> Rather it's a restriction that a single buffer cannot be accessed in 
> >> both read and write mode by the same superioctl command submission. 
> >> There must be a call to drm_bo_do_validate() in between to switch access 
> >> mode on the buffer object and emit the MI_FLUSH operation, and that call 
> >> needs to be made using a new superioctl call. I'm not sure how else you 
> >> would tell the kernel that we are switching buffer access modes?
> >>     
> >
> > The question is where this information lies. The fence contains no
> > information about the access modes for each buffer, so that information
> > must be saved inside the buffer object. However, when an MI_FLUSH
> > operation is placed in the ring, *all* buffers get flushed at that
> > point, so we must update the flush pending status on all buffers at that
> > point.
> >   
> I can't really see why?

The fence only knows the global access modes, not specific modes for
each buffer. As I queue additional requests for that buffer, I must know
whether to insert a flush command for that buffer. I'll need a flush
command precisely when the current and previous access modes require it,
and when no other flush operation that covers this buffer has been
executed since the previous access was made.

Yes, we can simplify this by adding more flushes than necessary, but I
would really like to make sure we don't *have* to do that. It's not
terrible now, but I can assure you that flushing won't get any cheaper
as caches get larger and GPUs get faster relative to memory.

In thinking about this today, I believe (echoing comments above) that
this flushing has nothing to do with fencing. All I need to do is queue
the flush operation to the ring and mark all to-be-flushed buffers as
'flushed'. Future access to that buffer will notice the 'flushed' status
and not emit an additional flush operation.

So, 'execbuffer' would first check whether the new access modes are
compatible with the old ones, emiting a flush operation if necessary.
Then, it would add the buffers being manipulated to the 'to be flushed'
queue, set some status in them indicating what kind of access needs to
be flushed (read, write or exe) and emit the buffer commands to the
ring.

We'd keep two separate 'flush status' entries -- one for the GPU, and
one for the CPU. The CPU one would keep the fence covering the flush
operation. When the buffer is pulled from the GTT, the buffer would be
flushed if necessary (if the buffer was on the pending fence list), and
then the appropriate fence waited for.

> However, keeping track of all buffers with a certain pending flush 
> status seems very awkward to me, so if the buffer carries a reference to 
> the completion information together with it's access mode we only need 
> to update the completion information which is not per buffer:
> 
> a) The operation in question. (which would be the fence object).
> b) The completion status of any flushes needed to complete the operation 
> (which would be the fence object signaled flags).
> c) Whether there are any flushes pending in the command queue that would 
> perform a particular flush operation on a fence sequence number of 
> interest. (This wasn't needed with the sync flush).
> d) The knowledge that any flush performed will update the signaled flags 
> of all preceding fence objects. (This is build into the current 
> implementation).
> 
> Now, c) isn't implemented and you would need it for flush type 2) above. 
> However it would be easy just to keep the breacrumb sequence number of 
> the fence object corresponding to the last submitted sequence flush in 
> the driver-private data structure, remembering that a non-signaled fence 
> object always has a valid sequence number, and the global flush sequence 
> numbers can be marked invalid when they are equal to the current 
> breadcrumb sequence number. Care must be taken to avoid any wraparound 
> issues.

I don't see how this discovers which buffers have been flushed by a
particular flush operation as seen by the CPU though -- when the fence
passes, somehow the buffers which were flushed at that point, and which
have not been accessed since then, must be found and signalled.

> 
> > I don't believe this is equivalent to fencing -- most of the time,
> > there's no need for the process to wait for the flush to occur, only to
> > know that it has been queued. Of course, if you want to pull the object
> > from the GTT, you'll have to wait for that flush to execute, which means
> > knowing which buffers are flushed at each breadcrumb. Which could use
> > the fencing mechanism, for in-ring flush operations.
> >   
> But the code wouldn't wait for the flush to occur.

The 'flush' is an in-ring operation in this case, followed by a normal
fence which could then be waited for. For polling flushing, the 'flush'
operation would involve executing the flush operation synchronously,
without using the fencing mechanism.

> If we take a look, again, at example 2 above and let's say we need to 
> change status from write-to read of three buffers a, b and c during the 
> validation sequence of a superioctl submission:
> 
> 1) The validation code inspects the per buffer access mode information 
> bo->fence_type of buffer a) and determines that the WRITE flag is going 
> to be cleared on that buffer. Hence an in-ring write flush might be needed.
> 2) The validation code inspects the fence object pointed to by the 
> buffer. The WRITE flush type has not signaled.

But, I don't need to wait for the flush to signal; I only need to know
that a flush is in the ring after all previous access to the buffer.
Forcing me to wait for the flush to become signalled means a huge delay
in execution.

> 3) The validation code emits an MI_FLUSH, updates the breadcrumb 
> counter, updates the last_write_flush sequence to the current breadcrumb 
> number and then emits a new fence object, f_new.

I don't need a fence here -- I only need to place the fence in the ring.
Of course, a fence might be useful in the future if this buffer was ever
needed by the CPU. Emiting a fence is costly, and not usually necessary
given sufficient GTT space.

> 4) The validation code dereferences the old fence object pointed to by 
> buffer a), f_old.
> 5) The validation code points the buffer to the new fence object, f_new, 
> and places the buffer object on the unfenced list. The new per buffer 
> access mode information bo->new_fence_type is set  to contain the READ flag.
> 6) The validation code moves on to buffer b). and determines that the 
> WRITE flag is going to be cleared on that buffer. Hence an in-ring write 
> flush might be needed.
> 7) Buffer b) still has a pointer to f_old, The write flush type has 
> still not signaled.
> 8) The validation code determines that the sequence number of f_old is 
> less or equal to the last_write_flush sequence, and that the 
> last_write_flush sequence is valid and therefore doesn't emit a flush 
> and puts buffer b) on the unfenced list, The new per buffer access mode 
> information bo->new_fence_type is set  to contain the READ flag.
> 9) The validation code moves on to buffer c). and determines that the 
> write flag is going to be cleared on that buffer. Hence an in-ring write 
> flush is needed.
> 10) Buffer c) still has a pointer to f_old, But now the in-ring flush 
> emitted for buffer a) has executed and also f_old has been updated by 
> the fence IRQ handler so that the WRITE type (and perhaps even the READ 
> type) has signaled. The buffer's fence pointer is set to NULL and f_old 
> is dereferenced. Buffer c) is put on the unfenced list and the new per 
> buffer access mode information bo->new_fence_flags is set  to contain 
> the READ flag.
> 11a) Commands are put on the ring, a new fence object f_final is 
> created, and buffer objects are updated to point to the new fence 
> object, without waiting for any old fence objects to signal. Per buffer 
> access information bo->fence_flags := bo->new_fence_flags, Old fence 
> objects are dereferenced as appropriate, and buffers are moved off the 
> unfenced list.
> 11b) (Error scenario) Command submission is interrupted by a mouse-move 
> signal, returning an -EAGAIN. Buffers are moved off the unfenced list, 
> retaining their old per buffer access mode information bo->fence_flags, 
> and their old fence objects pointers. a) points to f_new, b) points to 
> f_old and c) has a NULL fence object pointer. The caller is required to 
> re-run the superioctl call.
> 
> So while I can see the need for a new driver entry point when the old 
> and new access modes are incompatible
> (currently the code idles the buffer here), I still fail to see the need 
> for any additional mechanism as it feels natural to otherwise only 
> submit flushes on-demand (from the fence waiting code), and make sure 
> that all flush completion information is properly propagated to all 
> possible users of that information.

Certainly we want to emit flushes on demand, but I don't believe fences
correctly capture the appropriate demand. In particular, we need never
wait for a fence when deciding whether to emit an in-ring flush. We only
need to remember whether a flush has been submitted since the last
access to the buffer.

> I also can't see the need for additional per-buffer flush pending 
> information or per-flush-type buffer lists, as the fence code already 
> updates all active fence objects with any relevant flush information.
> Could you give a detailed example when the current code is less 
> efficient or fails to do what you want here?

Fencing updates only when the CPU sees state change in the GPU; flushing
needs to carry separate state to know what is pending ahead in the ring.

> Perhaps it's also worth considering doing flush example 2) from user-space?
> If user-space detects that a buffer is changing mode, just emit an 
> MI_FLUSH to the command stream.

User space cannot know what operations are happening between
applications, so it would not be able to reliably perform this
operation.

Also, kernel mode must know the pending access status of buffers so that
it can perform flushing when pulling buffers from the GTT, hence must
track buffer access in any case.

> This would, however, also require that the superioctl code emits an 
> MI_FLUSH when it switches to a new client.

Right, something which we can avoid by doing all flushing in kernel
mode.

-- 
[EMAIL PROTECTED]

signature.asc
Description: This is a digitally signed message part

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Re: Buffer object access mode is per-operation, not per-buffer

Reply via email to