On 21/06/16 11:44, Tvrtko Ursulin wrote:

On 17/06/16 12:05, john.c.harri...@intel.com wrote:
From: John Harrison <john.c.harri...@intel.com>

The intended usage model for struct fence is that the signalled status
should be set on demand rather than polled. That is, there should not
be a need for a 'signaled' function to be called everytime the status
is queried. Instead, 'something' should be done to enable a signal
callback from the hardware which will update the state directly. In
the case of requests, this is the seqno update interrupt. The idea is
that this callback will only be enabled on demand when something
actually tries to wait on the fence.

This change removes the polling test and replaces it with the callback
scheme. Each fence is added to a 'please poke me' list at the start of
i915_add_request(). The interrupt handler (via a deferred work queue)
then scans through the 'poke me' list when a new seqno pops out and
signals any matching fence/request. The fence is then removed from the
list so the entire request stack does not need to be scanned every
time. The fence is added to the list before the commands to generate
the seqno interrupt are added to the ring. Thus the sequence is
guaranteed to be race free if the interrupt is already enabled.

Note that the interrupt is only enabled on demand (i.e. when
__wait_request() is called). Thus there is still a potential race when
enabling the interrupt as the request may already have completed.
However, this is simply solved by calling the interrupt processing
code immediately after enabling the interrupt and thereby checking for
already completed requests.

Lastly, the ring clean up code has the possibility to cancel
outstanding requests (e.g. because TDR has reset the ring). These
requests will never get signalled and so must be removed from the
signal list manually. This is done by setting a 'cancelled' flag and
then calling the regular notify/retire code path rather than
attempting to duplicate the list manipulatation and clean up code in
multiple places. This also avoids any race condition where the
cancellation request might occur after/during the completion interrupt
actually arriving.

v2: Updated to take advantage of the request unreference no longer
requiring the mutex lock.

v3: Move the signal list processing around to prevent unsubmitted
requests being added to the list. This was occurring on Android
because the native sync implementation calls the
fence->enable_signalling API immediately on fence creation.

Updated after review comments by Tvrtko Ursulin. Renamed list nodes to
'link' instead of 'list'. Added support for returning an error code on
a cancelled fence. Update list processing to be more efficient/safer
with respect to spinlocks.

v5: Made i915_gem_request_submit a static as it is only ever called
from one place.

Fixed up the low latency wait optimisation. The time delay between the
seqno value being to memory and the drive's ISR running can be
significant, at least for the wait request micro-benchmark. This can
be greatly improved by explicitly checking for seqno updates in the
pre-wait busy poll loop. Also added some documentation comments to the
busy poll code.

Fixed up support for the faking of lost interrupts
(test_irq_rings/missed_irq_rings). That is, there is an IGT test that
tells the driver to loose interrupts deliberately and then check that
everything still works as expected (albeit much slower).

Updates from review comments: use non IRQ-save spinlocking, early exit
on WARN and improved comments (Tvrtko Ursulin).

v6: Updated to newer nigthly and resolved conflicts around the
wait_request busy spin optimisation. Also fixed a race condition
between this early exit path and the regular completion path.

v7: Updated to newer nightly - lots of ring -> engine renaming plus an
interface change on get_seqno(). Also added a list_empty() check
before acquring spinlocks and doing list processing.

v8: Updated to newer nightly - changes to request clean up code mean
non of the deferred free mess is needed any more.

v9: Moved the request completion processing out of the interrupt
handler and into a worker thread (Chris Wilson).

v10: Changed to an un-ordered work queue to allow parallel processing
of different engines. Also set the high priority flag for reduced
latency. Removed some unnecessary checks for invalid seqno values.
Improved/added some comments and WARNs. Moved a spinlock release a few
lines later to make the 'locked' parameter of
i915_gem_request_enable_interrupt redundant and removed it. Also
shuffled the function around in the file so as to make it static and
remove it from the header file. Corrected the use of
fence_signal_locked() to fence_signal() in the retire code. Dropped
the irq save part of the spin lock calls in the notify code as this is
no longer called from the ISR. Changed the call of
i915_gem_retire_requests_ring() in the reset cleanup code to
i915_gem_request_notify() instead as the former is just duplicating a
lot of operations.
[Review comments from Maarten Lankhorst & Tvrtko Ursulin]

Made the call to _notify() from _retire_requests_ring() conditional on
interrupts not being enabled as it is only a race condition work
around for that case. Also re-instated the lazy_coherency flag (but
now on the _notify() function) to reduce the overhead of
_retire_requests_ring() calling _notify() lots and lots (even with the
anti-interrupt check).

Updated for yet more nightly changes (u64 for fence context).

v10b: Re-ordered the fence signal and IRQ release in _notify() to fix
a race
condition when disabling interrupts. Also, moved the wake_up_all()
call from the IRQ handler to the worker thread to prevent the wake up
of waiting threads from overtaking the signalling of the request.

I was concerned by the second part of this change which will increase
the wake-up latency for the waiters and has asked John do do some quick
low-level (gem_latency) testing.

Preliminary results were a bit strange with small batches experiencing
the expected slowdown but large one being significantly faster.

Since he is out this week I will try and run some more benchmarks on this.

To re-iterate, concern is moving the wake_up_all(&engine->irq_queue)
from notify_ring (hard irq) to the fence notify worker. This adds one
additional scheduling wakeup latency to the waiters which use the i915 API.

Okay, benchmarking is done and good news first.

SynMark2 and GfxBench show probably only noise on almost all tests. What regresses is OglBatch[567] between 2-5%, OglMultithread by 10%, OglDrvState by 2% and OglDrvCtx by 30%.

As far as I remember those are not important in general so that is good.

But as expected low-level testing with gem_latency, with three sets of batches, zero length, 100 NOPs and 1000 NOPs show on average 5-6% less throughput and around 18% worse wakeup latency.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to