ttm: handle blitter failure on DG2

Matthew Auld Thu, 23 Jun 2022 08:31:32 -0700

On 23/06/2022 15:52, Christian König wrote:

Am 23.06.22 um 16:13 schrieb Matthew Auld:

[SNIP]

TTM_BO_VM_NUM_PREFAULT);
+               /*
+                * Ensure we check for any fatal errors if we had to
move/clear
+                * the object. The device should already be wedged if
we hit
+                * such an error.
+                */
+               if (i915_gem_object_wait_moving_fence(obj, true))
+                       ret = VM_FAULT_SIGBUS;


We should check with Christian here whether it's ok to export
ttm_bo_vm_fault_idle() as a helper, so that we release the proper locks
while waiting. The above is not a bug, but causes us to wait for the
moving fence under the mmap_lock, which is considered bad.

Christian, any chance we can export ttm_bo_vm_fault_idle() for usehere? Or is that NACK?


Well question is why you want to do this? E.g. what's the background?

Right, so basically we need to prevent userspace from being able toaccess the pages for the object, if the ttm blit/move hits an error(some kind of GPU error). Normally we can just fall back tomemcpy/memset to ensure we never leak anything (i915 is never allowed tohand userspace non-zeroed memory even for VRAM), but with small-BARsystems this might not be possible. Anyway, if we do hit an error duringthe ttm move we might now mark the object as being in an "unknown state"before signalling the fence. Later when binding the GPU page-tables wecheck for the "unknown state" and skip the bind (it will end up justpointing to some scratch pages instead). And then here on the CPU side,we need to sync against all the kernel fences, before then checking forthe potential "unknown state", which is then handled by returning SIBUS.The i915_gem_object_wait_moving_fence() is basically doing exactly that,but it looks dumb compared to what ttm_bo_vm_fault_idle() is doing. Andthen while all this going on we then also "wedge" the device tobasically signal that it's busted, which should prevent further workbeing submitted to the gpu.


Regards,
Christian.

Re: [Intel-gfx] [PATCH v2 10/12] drm/i915/ttm: handle blitter failure on DG2

Reply via email to