Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to intel_context twice

2021-12-14 Thread Tvrtko Ursulin



On 14/12/2021 05:58, Yang, Dong wrote:

Thanks Tvrtko, I will try the patch you mentioned.

BTW, how do you think we use this patch in our project, any side-effect it may 
have?  If no side-effect we can take it as WA for temporally fix till we got 
the final root fixed.


For side effects I can't be sure. Best to try backporting and see if it 
fixes your issue, but note backporting may be tricky and you may end up 
pulling other patches as well.


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to intel_context twice

2021-12-13 Thread Yang, Dong
Thanks Tvrtko, I will try the patch you mentioned.

BTW, how do you think we use this patch in our project, any side-effect it may 
have?  If no side-effect we can take it as WA for temporally fix till we got 
the final root fixed.

Thanks,
Dong


-Original Message-
From: Tvrtko Ursulin  
Sent: Monday, December 13, 2021 5:37 PM
To: Yang, Dong ; intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to 
intel_context twice


On 13/12/2021 01:53, Yang, Dong wrote:
> I am working on a customized kernel based on 5.4.39,  issue can only 
> reproduced when system facing low memory pressure, and system try to reclaim 
> memory, then wrong double insert i915_reqeust coming  from the 
> i915_gem_shrink() path.

5.4 is quite old and there have been fixes to this code since. Any chance that 
you can repro on drm-tip? What project are you working on?

Is your bug perhaps similar to what c744d50363b7 ("drm/i915/gt: Split the 
breadcrumb spinlock between global and contexts") fixed? As the commit says:

"""
  Furthermore, this closes the race between enabling the signaling context
  while it is in the process of being signaled and removed:
"""

> 
> i915_request_enable_breadcrumb+0x136/0x14a
> dma_fence_enable_sw_signaling+0x47/0xb0
> enable_signaling+0x66/0x80
> i915_active_wait+0xc1/0x150
> __i915_vma_unbind+0x17/0x1a0
> i915_vma_unbind+0x47/0xc0
> i915_gem_object_unbind+0x189/0x290
> i915_gem_shrink+0x139/0x460
> ? __pm_runtime_resume+0x53/0x70
> i915_gem_shrinker_scan+0x9c/0xb0
> do_shrink_slab+0x14f/0x2b0
> shrink_slab+0xa7/0x2a0
> shrink_node+0xd1/0x410
> balance_pgdat+0x2b7/0x500
> kswapd+0x1e2/0x3b0
> 
> I believe it's not related to the ce->signal_lock,  the lock should works 
> normally.
> 
> The i915_request_enable_breadcrumb() can be invoked by several context, like 
> called from ioctl(), from interrupt context, and from memory swap thread, I 
> suggest add a double check before insert i915_request to the list, it's hard 
> to assure valid call from all the paths, but add check can avoid the 
> critical effect,  because add same i915_request twice will trigger a dead 
> loop in signal_irq_work() , and the loop will never break continue the 
> i915_request. hwsp_seqno be changed, and invalid address access error 
> reported followed by system panic.

Maybe, but I was pointing out double insert_breadcrumb is already protected 
when called inside i915_request_enable_breadcrumb - by the virtue of the 
spinlock and I915_FENCE_FLAG_SIGNAL. So maybe a race with remove or something, 
but it looks unlikely it is simple double add due parallel enablement.

Regards,

Tvrtko

> 
> Thanks,
> Dong
> 
> -----Original Message-----
> From: Tvrtko Ursulin 
> Sent: Friday, December 10, 2021 4:51 PM
> To: Yang, Dong ; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same 
> i915_request to intel_context twice
> 
> 
> On 10/12/2021 01:31, dong.y...@intel.com wrote:
>> From: "Yang, Dong" 
>>
>> With unknow race condition, the i915_request will be added
> 
> What do you mean with unknown here?
> 
>> to intel_context list twice, and result in system panic.
>>
>> If node alreay exist then do not add it again.
> 
> Note the call chains are under ce->signal_lock and protecting from double add 
> AFAICT:
> 
> static void insert_breadcrumb(struct i915_request *rq) { ...
>   if (test_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags))
>   return;
> ...
>   set_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags);
> 
> 
> bool i915_request_enable_breadcrumb(struct i915_request *rq) { ...
>   spin_lock(>signal_lock);
>   if (test_bit(I915_FENCE_FLAG_ACTIVE, >fence.flags))
>   insert_breadcrumb(rq);
>   spin_unlock(>signal_lock);
> 
> 
> void i915_request_cancel_breadcrumb(struct i915_request *rq) { ...
>   spin_lock(>signal_lock);
>   if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags)) {
>   spin_unlock(>signal_lock);
>   return;
>   }
> 
> void intel_context_remove_breadcrumbs(struct intel_context *ce,
> struct intel_breadcrumbs *b) { ...
>   spin_lock_irqsave(>signal_lock, flags);
> 
>   if (list_empty(>signals))
>   goto unlock;
> 
>   list_for_each_entry_safe(rq, rn, >signals, signal_link) {
>   GEM_BUG_ON(!__i915_request_is_complete(rq));
>   if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL,
>   >fence.flags))
>   continue;
> 
> The last one in 

Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to intel_context twice

2021-12-13 Thread Tvrtko Ursulin



On 13/12/2021 01:53, Yang, Dong wrote:

I am working on a customized kernel based on 5.4.39,  issue can only reproduced 
when system facing low memory pressure, and system try to reclaim memory, then 
wrong double insert i915_reqeust coming  from the i915_gem_shrink() path.


5.4 is quite old and there have been fixes to this code since. Any chance that 
you can repro on drm-tip? What project are you working on?

Is your bug perhaps similar to what c744d50363b7 ("drm/i915/gt: Split the breadcrumb 
spinlock between global and contexts") fixed? As the commit says:

"""
 Furthermore, this closes the race between enabling the signaling context
 while it is in the process of being signaled and removed:
"""



i915_request_enable_breadcrumb+0x136/0x14a
dma_fence_enable_sw_signaling+0x47/0xb0
enable_signaling+0x66/0x80
i915_active_wait+0xc1/0x150
__i915_vma_unbind+0x17/0x1a0
i915_vma_unbind+0x47/0xc0
i915_gem_object_unbind+0x189/0x290
i915_gem_shrink+0x139/0x460
? __pm_runtime_resume+0x53/0x70
i915_gem_shrinker_scan+0x9c/0xb0
do_shrink_slab+0x14f/0x2b0
shrink_slab+0xa7/0x2a0
shrink_node+0xd1/0x410
balance_pgdat+0x2b7/0x500
kswapd+0x1e2/0x3b0

I believe it's not related to the ce->signal_lock,  the lock should works 
normally.

The i915_request_enable_breadcrumb() can be invoked by several context, like called 
from ioctl(), from interrupt context, and from memory swap thread, I suggest add a 
double check before insert i915_request to the list, it's hard to assure valid call 
from all the paths, but add check can avoid the critical effect,  
because add same i915_request twice will trigger a dead loop in signal_irq_work() , 
and the loop will never break continue the i915_request. hwsp_seqno be changed, and 
invalid address access error reported followed by system panic.


Maybe, but I was pointing out double insert_breadcrumb is already protected 
when called inside i915_request_enable_breadcrumb - by the virtue of the 
spinlock and I915_FENCE_FLAG_SIGNAL. So maybe a race with remove or something, 
but it looks unlikely it is simple double add due parallel enablement.

Regards,

Tvrtko



Thanks,
Dong

-Original Message-
From: Tvrtko Ursulin 
Sent: Friday, December 10, 2021 4:51 PM
To: Yang, Dong ; intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to 
intel_context twice


On 10/12/2021 01:31, dong.y...@intel.com wrote:

From: "Yang, Dong" 

With unknow race condition, the i915_request will be added


What do you mean with unknown here?


to intel_context list twice, and result in system panic.

If node alreay exist then do not add it again.


Note the call chains are under ce->signal_lock and protecting from double add 
AFAICT:

static void insert_breadcrumb(struct i915_request *rq) { ...
if (test_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags))
return;
...
set_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags);


bool i915_request_enable_breadcrumb(struct i915_request *rq) { ...
spin_lock(>signal_lock);
if (test_bit(I915_FENCE_FLAG_ACTIVE, >fence.flags))
insert_breadcrumb(rq);
spin_unlock(>signal_lock);


void i915_request_cancel_breadcrumb(struct i915_request *rq) { ...
spin_lock(>signal_lock);
if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags)) {
spin_unlock(>signal_lock);
return;
}

void intel_context_remove_breadcrumbs(struct intel_context *ce,
  struct intel_breadcrumbs *b)
{
...
spin_lock_irqsave(>signal_lock, flags);

if (list_empty(>signals))
goto unlock;

list_for_each_entry_safe(rq, rn, >signals, signal_link) {
GEM_BUG_ON(!__i915_request_is_complete(rq));
if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL,
>fence.flags))
continue;

The last one in signal_irq_work is guarded by the __i915_request_is_complete 
check.

So I think more context is needed on how you found this may be an issue.

Regards,

Tvrtko



Signed-off-by: Yang, Dong 
---
   drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index 209cf265bf74..9c7bc060d2ae 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -387,6 +387,9 @@ static void insert_breadcrumb(struct i915_request *rq)
}
}
   
+	if (>signal_link == pos)

+   return;
+
i915_request_get(rq);
list_add_rcu(>signal_link, pos);
GEM_BUG_ON(!check_signal_order(ce, rq));



Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to intel_context twice

2021-12-12 Thread Yang, Dong
I am working on a customized kernel based on 5.4.39,  issue can only reproduced 
when system facing low memory pressure, and system try to reclaim memory, then 
wrong double insert i915_reqeust coming  from the i915_gem_shrink() path.

i915_request_enable_breadcrumb+0x136/0x14a
dma_fence_enable_sw_signaling+0x47/0xb0
enable_signaling+0x66/0x80
i915_active_wait+0xc1/0x150
__i915_vma_unbind+0x17/0x1a0
i915_vma_unbind+0x47/0xc0
i915_gem_object_unbind+0x189/0x290
i915_gem_shrink+0x139/0x460
? __pm_runtime_resume+0x53/0x70
i915_gem_shrinker_scan+0x9c/0xb0
do_shrink_slab+0x14f/0x2b0
shrink_slab+0xa7/0x2a0
shrink_node+0xd1/0x410
balance_pgdat+0x2b7/0x500
kswapd+0x1e2/0x3b0

I believe it's not related to the ce->signal_lock,  the lock should works 
normally.

The i915_request_enable_breadcrumb() can be invoked by several context, like 
called from ioctl(), from interrupt context, and from memory swap thread, I 
suggest add a double check before insert i915_request to the list, it's hard to 
assure valid call from all the paths, but add check can avoid the 
critical effect,  because add same i915_request twice will trigger a dead loop 
in signal_irq_work() , and the loop will never break continue the i915_request. 
hwsp_seqno be changed, and invalid address access error reported followed by 
system panic.

Thanks,
Dong

-Original Message-
From: Tvrtko Ursulin  
Sent: Friday, December 10, 2021 4:51 PM
To: Yang, Dong ; intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to 
intel_context twice


On 10/12/2021 01:31, dong.y...@intel.com wrote:
> From: "Yang, Dong" 
> 
> With unknow race condition, the i915_request will be added

What do you mean with unknown here?

> to intel_context list twice, and result in system panic.
> 
> If node alreay exist then do not add it again.

Note the call chains are under ce->signal_lock and protecting from double add 
AFAICT:

static void insert_breadcrumb(struct i915_request *rq) { ...
if (test_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags))
return;
...
set_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags);


bool i915_request_enable_breadcrumb(struct i915_request *rq) { ...
spin_lock(>signal_lock);
if (test_bit(I915_FENCE_FLAG_ACTIVE, >fence.flags))
insert_breadcrumb(rq);
spin_unlock(>signal_lock);


void i915_request_cancel_breadcrumb(struct i915_request *rq) { ...
spin_lock(>signal_lock);
if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags)) {
spin_unlock(>signal_lock);
return;
}

void intel_context_remove_breadcrumbs(struct intel_context *ce,
  struct intel_breadcrumbs *b)
{
...
spin_lock_irqsave(>signal_lock, flags);

if (list_empty(>signals))
goto unlock;

list_for_each_entry_safe(rq, rn, >signals, signal_link) {
GEM_BUG_ON(!__i915_request_is_complete(rq));
if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL,
>fence.flags))
continue;

The last one in signal_irq_work is guarded by the __i915_request_is_complete 
check.

So I think more context is needed on how you found this may be an issue.

Regards,

Tvrtko

> 
> Signed-off-by: Yang, Dong 
> ---
>   drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c 
> b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
> index 209cf265bf74..9c7bc060d2ae 100644
> --- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
> @@ -387,6 +387,9 @@ static void insert_breadcrumb(struct i915_request *rq)
>   }
>   }
>   
> + if (>signal_link == pos)
> + return;
> +
>   i915_request_get(rq);
>   list_add_rcu(>signal_link, pos);
>   GEM_BUG_ON(!check_signal_order(ce, rq));
> 


Re: [Intel-gfx] [PATCH] drm/i915/gt: Do not add same i915_request to intel_context twice

2021-12-10 Thread Tvrtko Ursulin



On 10/12/2021 01:31, dong.y...@intel.com wrote:

From: "Yang, Dong" 

With unknow race condition, the i915_request will be added


What do you mean with unknown here?


to intel_context list twice, and result in system panic.

If node alreay exist then do not add it again.


Note the call chains are under ce->signal_lock and protecting from double add 
AFAICT:

static void insert_breadcrumb(struct i915_request *rq)
{
...
if (test_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags))
return;
...
set_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags);


bool i915_request_enable_breadcrumb(struct i915_request *rq)
{
...
spin_lock(>signal_lock);
if (test_bit(I915_FENCE_FLAG_ACTIVE, >fence.flags))
insert_breadcrumb(rq);
spin_unlock(>signal_lock);


void i915_request_cancel_breadcrumb(struct i915_request *rq)
{
...
spin_lock(>signal_lock);
if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL, >fence.flags)) {
spin_unlock(>signal_lock);
return;
}

void intel_context_remove_breadcrumbs(struct intel_context *ce,
  struct intel_breadcrumbs *b)
{
...
spin_lock_irqsave(>signal_lock, flags);

if (list_empty(>signals))
goto unlock;

list_for_each_entry_safe(rq, rn, >signals, signal_link) {
GEM_BUG_ON(!__i915_request_is_complete(rq));
if (!test_and_clear_bit(I915_FENCE_FLAG_SIGNAL,
>fence.flags))
continue;

The last one in signal_irq_work is guarded by the __i915_request_is_complete 
check.

So I think more context is needed on how you found this may be an issue.

Regards,

Tvrtko



Signed-off-by: Yang, Dong 
---
  drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c 
b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index 209cf265bf74..9c7bc060d2ae 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -387,6 +387,9 @@ static void insert_breadcrumb(struct i915_request *rq)
}
}
  
+	if (>signal_link == pos)

+   return;
+
i915_request_get(rq);
list_add_rcu(>signal_link, pos);
GEM_BUG_ON(!check_signal_order(ce, rq));