Re: [Intel-gfx] [PATCH 2/8] drm/i915: Complete the fences as they are cancelled due to wedging

2018-12-03 Thread Tvrtko Ursulin


On 03/12/2018 11:36, Chris Wilson wrote:

We inspect the requests under the assumption that they will be marked as
completed when they are removed from the queue. Currently however, in the
process of wedging the requests will be removed from the queue before they
are completed, so rearrange the code to complete the fences before the
locks are dropped.

<1>[  354.473346] BUG: unable to handle kernel NULL pointer dereference at 
0250
<6>[  354.473363] PGD 0 P4D 0
<4>[  354.473370] Oops:  [#1] PREEMPT SMP PTI
<4>[  354.473380] CPU: 0 PID: 4470 Comm: gem_eio Tainted: G U
4.20.0-rc4-CI-CI_DRM_5216+ #1
<4>[  354.473393] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS 
JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4>[  354.473480] RIP: 0010:__i915_schedule+0x311/0x5e0 [i915]
<4>[  354.473490] Code: 49 89 44 24 20 4d 89 4c 24 28 4d 89 29 44 39 b3 a0 04 00 00 
7d 3a 41 8b 44 24 78 85 c0 74 13 48 8b 93 78 04 00 00 48 83 e2 fc <39> 82 50 02 00 00 
79 1e 44 89 b3 a0 04 00 00 48 8d bb d0 03 00 00


This confuses me, isn't the code segment usually at the end? And then 
you have another after the call trace which doesn't match 
__i915_scheduel.. anyways, _this_ code seems to be this part:


if (node_to_request(node)->global_seqno &&
 90d:   8b 43 78moveax,DWORD PTR [rbx+0x78]
 910:   85 c0   test   eax,eax
 912:   74 13   je 927 <__i915_schedule+0x317>

i915_seqno_passed(port_request(engine->execlists.port)->global_seqno,
 914:   49 8b 97 c0 04 00 00movrdx,QWORD PTR [r15+0x4c0]
 91b:   48 83 e2 fc andrdx,0xfffc
if (node_to_request(node)->global_seqno &&
 91f:   39 82 50 02 00 00   cmpDWORD PTR [rdx+0x250],eax
 925:   79 1e   jns945 <__i915_schedule+0x335>


<4>[  354.473515] RSP: 0018:c91bba90 EFLAGS: 00010046
<4>[  354.473524] RAX: 0003 RBX: 8882624c8008 RCX: 
f34a7378
<4>[  354.473535] RDX:  RSI:  RDI: 
8882624c8048
<4>[  354.473545] RBP: c91bbab0 R08: 5963f1f1 R09: 

<4>[  354.473556] R10: c91bba10 R11: 8882624c8060 R12: 
88824fdd7b98
<4>[  354.473567] R13: 88824fdd7bb8 R14: 0001 R15: 
88824fdd7750
<4>[  354.473578] FS:  7f44b4b5b980() GS:888277e0() 
knlGS:
<4>[  354.473590] CS:  0010 DS:  ES:  CR0: 80050033
<4>[  354.473599] CR2: 0250 CR3: 00026976e000 CR4: 
00340ef0


Given the registers above, I think it means this - eax is global_seqno 
of the node rq. rdx is is port_request so NULL and bang. No request in 
port, but why would there always be one at the point we are scheduling 
in a new request to the runnable queue?


Regards,

Tvrtko


<4>[  354.473611] Call Trace:
<4>[  354.473622]  ? lock_acquire+0xa6/0x1c0
<4>[  354.473677]  ? i915_schedule_bump_priority+0x57/0xd0 [i915]
<4>[  354.473736]  i915_schedule_bump_priority+0x72/0xd0 [i915]
<4>[  354.473792]  i915_request_wait+0x4db/0x840 [i915]
<4>[  354.473804]  ? get_pwq.isra.4+0x2c/0x50
<4>[  354.473813]  ? ___preempt_schedule+0x16/0x18
<4>[  354.473824]  ? wake_up_q+0x70/0x70
<4>[  354.473831]  ? wake_up_q+0x70/0x70
<4>[  354.473882]  ? gen6_rps_boost+0x118/0x120 [i915]
<4>[  354.473936]  i915_gem_object_wait_fence+0x8a/0x110 [i915]
<4>[  354.473991]  i915_gem_object_wait+0x113/0x500 [i915]
<4>[  354.474047]  i915_gem_wait_ioctl+0x11c/0x2f0 [i915]
<4>[  354.474101]  ? i915_gem_unset_wedged+0x210/0x210 [i915]
<4>[  354.474113]  drm_ioctl_kernel+0x81/0xf0
<4>[  354.474123]  drm_ioctl+0x2de/0x390
<4>[  354.474175]  ? i915_gem_unset_wedged+0x210/0x210 [i915]
<4>[  354.474187]  ? finish_task_switch+0x95/0x260
<4>[  354.474197]  ? lock_acquire+0xa6/0x1c0
<4>[  354.474207]  do_vfs_ioctl+0xa0/0x6e0
<4>[  354.474217]  ? __fget+0xfc/0x1e0
<4>[  354.474225]  ksys_ioctl+0x35/0x60
<4>[  354.474233]  __x64_sys_ioctl+0x11/0x20
<4>[  354.474241]  do_syscall_64+0x55/0x190
<4>[  354.474251]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  354.474260] RIP: 0033:0x7f44b3de65d7
<4>[  354.474267] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 
ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 
73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4>[  354.474293] RSP: 002b:7fff974948e8 EFLAGS: 0246 ORIG_RAX: 
0010
<4>[  354.474305] RAX: ffda RBX:  RCX: 
7f44b3de65d7
<4>[  354.474316] RDX: 7fff97494940 RSI: c010646c RDI: 
0007
<4>[  354.474327] RBP: 7fff97494940 R08:  R09: 
7f44b40bbc40
<4>[  354.474337] R10:  R11: 0246 R12: 
c010646c
<4>[  354.474348] R13: 0007 R14:  R15: 


v2: Avoid floating requests.
v3: Can't call dma_fence_signal() under the timeline lock!

Re: [Intel-gfx] [PATCH 2/8] drm/i915: Complete the fences as they are cancelled due to wedging

2018-12-03 Thread Chris Wilson
Quoting Tvrtko Ursulin (2018-12-03 17:11:59)
> 
> On 03/12/2018 11:36, Chris Wilson wrote:
> > We inspect the requests under the assumption that they will be marked as
> > completed when they are removed from the queue. Currently however, in the
> > process of wedging the requests will be removed from the queue before they
> > are completed, so rearrange the code to complete the fences before the
> > locks are dropped.
> > 
> > <1>[  354.473346] BUG: unable to handle kernel NULL pointer dereference at 
> > 0250
> > <6>[  354.473363] PGD 0 P4D 0
> > <4>[  354.473370] Oops:  [#1] PREEMPT SMP PTI
> > <4>[  354.473380] CPU: 0 PID: 4470 Comm: gem_eio Tainted: G U   
> >  4.20.0-rc4-CI-CI_DRM_5216+ #1
> > <4>[  354.473393] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS 
> > JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
> > <4>[  354.473480] RIP: 0010:__i915_schedule+0x311/0x5e0 [i915]
> > <4>[  354.473490] Code: 49 89 44 24 20 4d 89 4c 24 28 4d 89 29 44 39 b3 a0 
> > 04 00 00 7d 3a 41 8b 44 24 78 85 c0 74 13 48 8b 93 78 04 00 00 48 83 e2 fc 
> > <39> 82 50 02 00 00 79 1e 44 89 b3 a0 04 00 00 48 8d bb d0 03 00 00
> 
> This confuses me, isn't the code segment usually at the end?

*shrug* It was cut and paste.

> And then 
> you have another after the call trace which doesn't match 
> __i915_scheduel.. anyways, _this_ code seems to be this part:
> 
>  if (node_to_request(node)->global_seqno &&
>   90d:   8b 43 78moveax,DWORD PTR [rbx+0x78]
>   910:   85 c0   test   eax,eax
>   912:   74 13   je 927 <__i915_schedule+0x317>
>  
> i915_seqno_passed(port_request(engine->execlists.port)->global_seqno,
>   914:   49 8b 97 c0 04 00 00movrdx,QWORD PTR [r15+0x4c0]
>   91b:   48 83 e2 fc andrdx,0xfffc
>  if (node_to_request(node)->global_seqno &&
>   91f:   39 82 50 02 00 00   cmpDWORD PTR [rdx+0x250],eax
>   925:   79 1e   jns945 <__i915_schedule+0x335>
> 
> > <4>[  354.473515] RSP: 0018:c91bba90 EFLAGS: 00010046
> > <4>[  354.473524] RAX: 0003 RBX: 8882624c8008 RCX: 
> > f34a7378
> > <4>[  354.473535] RDX:  RSI:  RDI: 
> > 8882624c8048
> > <4>[  354.473545] RBP: c91bbab0 R08: 5963f1f1 R09: 
> > 
> > <4>[  354.473556] R10: c91bba10 R11: 8882624c8060 R12: 
> > 88824fdd7b98
> > <4>[  354.473567] R13: 88824fdd7bb8 R14: 0001 R15: 
> > 88824fdd7750
> > <4>[  354.473578] FS:  7f44b4b5b980() GS:888277e0() 
> > knlGS:
> > <4>[  354.473590] CS:  0010 DS:  ES:  CR0: 80050033
> > <4>[  354.473599] CR2: 0250 CR3: 00026976e000 CR4: 
> > 00340ef0
> 
> Given the registers above, I think it means this - eax is global_seqno 
> of the node rq. rdx is is port_request so NULL and bang. No request in 
> port, but why would there always be one at the point we are scheduling 
> in a new request to the runnable queue?

Correct. The answer, as I chose to interpret it, is because of the
incomplete submitted+dequeued requests during cancellation which this
patch attempts to address.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/8] drm/i915: Complete the fences as they are cancelled due to wedging

2018-12-04 Thread Tvrtko Ursulin


On 03/12/2018 17:36, Chris Wilson wrote:

Quoting Tvrtko Ursulin (2018-12-03 17:11:59)


On 03/12/2018 11:36, Chris Wilson wrote:

We inspect the requests under the assumption that they will be marked as
completed when they are removed from the queue. Currently however, in the
process of wedging the requests will be removed from the queue before they
are completed, so rearrange the code to complete the fences before the
locks are dropped.

<1>[  354.473346] BUG: unable to handle kernel NULL pointer dereference at 
0250
<6>[  354.473363] PGD 0 P4D 0
<4>[  354.473370] Oops:  [#1] PREEMPT SMP PTI
<4>[  354.473380] CPU: 0 PID: 4470 Comm: gem_eio Tainted: G U
4.20.0-rc4-CI-CI_DRM_5216+ #1
<4>[  354.473393] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS 
JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4>[  354.473480] RIP: 0010:__i915_schedule+0x311/0x5e0 [i915]
<4>[  354.473490] Code: 49 89 44 24 20 4d 89 4c 24 28 4d 89 29 44 39 b3 a0 04 00 00 
7d 3a 41 8b 44 24 78 85 c0 74 13 48 8b 93 78 04 00 00 48 83 e2 fc <39> 82 50 02 00 00 
79 1e 44 89 b3 a0 04 00 00 48 8d bb d0 03 00 00


This confuses me, isn't the code segment usually at the end?


*shrug* It was cut and paste.


And then
you have another after the call trace which doesn't match
__i915_scheduel.. anyways, _this_ code seems to be this part:

  if (node_to_request(node)->global_seqno &&
   90d:   8b 43 78moveax,DWORD PTR [rbx+0x78]
   910:   85 c0   test   eax,eax
   912:   74 13   je 927 <__i915_schedule+0x317>
  
i915_seqno_passed(port_request(engine->execlists.port)->global_seqno,

   914:   49 8b 97 c0 04 00 00movrdx,QWORD PTR [r15+0x4c0]
   91b:   48 83 e2 fc andrdx,0xfffc
  if (node_to_request(node)->global_seqno &&
   91f:   39 82 50 02 00 00   cmpDWORD PTR [rdx+0x250],eax
   925:   79 1e   jns945 <__i915_schedule+0x335>


<4>[  354.473515] RSP: 0018:c91bba90 EFLAGS: 00010046
<4>[  354.473524] RAX: 0003 RBX: 8882624c8008 RCX: 
f34a7378
<4>[  354.473535] RDX:  RSI:  RDI: 
8882624c8048
<4>[  354.473545] RBP: c91bbab0 R08: 5963f1f1 R09: 

<4>[  354.473556] R10: c91bba10 R11: 8882624c8060 R12: 
88824fdd7b98
<4>[  354.473567] R13: 88824fdd7bb8 R14: 0001 R15: 
88824fdd7750
<4>[  354.473578] FS:  7f44b4b5b980() GS:888277e0() 
knlGS:
<4>[  354.473590] CS:  0010 DS:  ES:  CR0: 80050033
<4>[  354.473599] CR2: 0250 CR3: 00026976e000 CR4: 
00340ef0


Given the registers above, I think it means this - eax is global_seqno
of the node rq. rdx is is port_request so NULL and bang. No request in
port, but why would there always be one at the point we are scheduling
in a new request to the runnable queue?


Correct. The answer, as I chose to interpret it, is because of the
incomplete submitted+dequeued requests during cancellation which this
patch attempts to address.


I couldn't find any other route to this state myself, so on the basis of 
that, but with a little bit of fear from "Could it have really been so 
much simpler all along?!":


Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx