Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-26 Thread Christian König
What I mean is - should we get rid of dma_fence_add/remove_callback 
logic in drm_sched_job_timedout and do it for each driver in between


scheduler deactivation  and activation back ?



Yes, exactly. That's the reason why I already have a revert for the 
patch and remove the dance from drm_sched_job_timedout again.


Christian.


Am 26.11.18 um 20:28 schrieb Grodzovsky, Andrey:



Actually, after looking again at drm_sched_job_timedout  from which 
the amdgpu_device_gpu_recover will be called I see that we already 
disconnect all the pending scheduler fences from the HW fence, including
the guilty job. I also see that in drm_sched_job_timedout 
job_list_lock is released before calling sched->ops->timedout_job and 
then required after, so new jobs can slip into ring_mirror_list in 
between.


And also i will end up going over the ring_mirror_list twice, once 
from amdgpu_device_post_asic_reset and later from 
drm_sched_job_timedout - this might cause double fence processing.


Isn't it more correct only do the disconnect from HW fence after the 
schedules have been stopped and connect back before we restart the 
schedulers (as you pointed out here before)


What I mean is - should we get rid of dma_fence_add/remove_callback 
logic in drm_sched_job_timedout and do it for each driver in between


scheduler deactivation  and activation back ?

Andrey


On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote:

Additional to that I would try improve the pre, middle, post handling
towards checking if we made some progress in between.

In other words we stop all schedulers in the pre handling and
disconnect the scheduler fences from the hardware fence like I did in
patch "drm/sched: fix timeout handling v2".

Then before we do the actual reset in the middle handling we check if
the offending job has completed or at least made some progress in the
meantime.

I understand how to check if the job completed - if it's fence already
signaled, but how do I test if the job made 'at least some progress' ?

Good question. Maybe we can somehow query from the hardware the number
of primitives or pixels processed so far and then compare after a moment?

I will check on this later. In the mean while I will update the code
with the proposed per hive locking and I will add the check if the
guilty job completed before ASIC reset skipping the reset if it's did.

Andrey




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-26 Thread Grodzovsky, Andrey

Actually, after looking again at drm_sched_job_timedout  from which the 
amdgpu_device_gpu_recover will be called I see that we already disconnect all 
the pending scheduler fences from the HW fence, including
the guilty job. I also see that in drm_sched_job_timedout job_list_lock is 
released before calling sched->ops->timedout_job and then required after, so 
new jobs can slip into ring_mirror_list in between.

And also i will end up going over the ring_mirror_list twice, once from 
amdgpu_device_post_asic_reset and later from drm_sched_job_timedout - this 
might cause double fence processing.

Isn't it more correct only do the disconnect from HW fence after the schedules 
have been stopped and connect back before we restart the schedulers (as you 
pointed out here before)

What I mean is - should we get rid of dma_fence_add/remove_callback logic in 
drm_sched_job_timedout and do it for each driver in between

scheduler deactivation  and activation back ?

Andrey

On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote:

Additional to that I would try improve the pre, middle, post handling
towards checking if we made some progress in between.

In other words we stop all schedulers in the pre handling and
disconnect the scheduler fences from the hardware fence like I did in
patch "drm/sched: fix timeout handling v2".

Then before we do the actual reset in the middle handling we check if
the offending job has completed or at least made some progress in the
meantime.


I understand how to check if the job completed - if it's fence already
signaled, but how do I test if the job made 'at least some progress' ?


Good question. Maybe we can somehow query from the hardware the number
of primitives or pixels processed so far and then compare after a moment?


I will check on this later. In the mean while I will update the code
with the proposed per hive locking and I will add the check if the
guilty job completed before ASIC reset skipping the reset if it's did.

Andrey



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-22 Thread Grodzovsky, Andrey


On 11/22/2018 02:03 PM, Christian König wrote:
> Am 22.11.18 um 16:44 schrieb Grodzovsky, Andrey:
>>
>> On 11/22/2018 06:16 AM, Christian König wrote:
>>> How about using a lock per hive and then acquiring that with trylock()
>>> instead?
>>>
>>> This way you should at least catch cases where multiple causes try to
>>> reset the same hive at the same time. True that there is still some
>>> racing involved, but it's at least a good start.
>> What about using per hive work_struct which when scheduled will execute
>> amdgpu_device_gpu_recover ? Since work queue rejects duplicates we will
>> get per hive serialization automatically from that without need of using
>> any new mutex. Also this might be necessary anyway for RAS as I believe
>> RAS will trigger interrupt when errors are detected and might then
>> decide to reset the GPU so we will have to switch to button half context
>> anyway. And I don't think it's a problem to schedule this work from job
>> timeout handler if needed.
>
> Mhm, what is the advantage to the trylock variant? The key problem 
> with that is how to get the offending job into the background worker.
>
> E.g. we have trouble guaranteeing that the job isn't destroyed when 
> this is in a background worker.

Yea, actually there isn't.

>
>>>
>>>
>>> Additional to that I would try improve the pre, middle, post handling
>>> towards checking if we made some progress in between.
>>>
>>> In other words we stop all schedulers in the pre handling and
>>> disconnect the scheduler fences from the hardware fence like I did in
>>> patch "drm/sched: fix timeout handling v2".
>>>
>>> Then before we do the actual reset in the middle handling we check if
>>> the offending job has completed or at least made some progress in the
>>> meantime.
>> I understand how to check if the job completed - if it's fence already
>> signaled, but how do I test if the job made 'at least some progress' ?
>
> Good question. Maybe we can somehow query from the hardware the number 
> of primitives or pixels processed so far and then compare after a moment?

I will check on this later. In the mean while I will update the code 
with the proposed per hive locking and I will add the check if the 
guilty job completed before ASIC reset skipping the reset if it's did.

Andrey

>
>> Another question - what's the purpose of this progress check - if I've
>> already completed the pre handling sequence  I can't bail out even if
>> the guilty job is is signaled by the time I do the progress check, I
>> have to complete at least the post handling to. Do you mean I can at
>> least skip the ASIC reset phase in that case ?
>
> Yes exactly. We have a rather big race problem in the current reset 
> logic between completing the job and resetting the hardware.
>
> In other words when the job completes exactly in the moment we reset 
> the GPU we try to signal it twice etc etc... The whole handling here 
> is not really thought through.
>
> Christian.
>
>>
>> Andrey
>>
>>> In the case of a manual reset we skip that because we don't have an
>>> offending job to check.
>>>
>>> In the post handling we stitch everything together again and start the
>>> scheduler to go on with job submission.
>>>
>>> Christian.
>>>
>>> Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey:
 Depends what was the reason for triggering the reset for that node how
 do we know ?
 If the reason was RAS error that probably not hard to check all errors
 are cleared, but
 if the reason was job timeout on that specific node I will need to
 recheck that no jobs are left in incomplete state
 state. And if the reason is manual gpu reset trigger from sysfs, then
 what's the policy ?
 Sounds to me it's just easier to go ahead and allow all the pending
 resets to proceed unless there is
 a clear and quick criteria you can check after you grab the mutex then
 sure - but I don't know what it would be.

 Andrey

 On 11/21/2018 03:49 PM, Liu, Shaoyun wrote:
> I saw you use the  global xgmi_mutex to prevent concurrent reset 
> to be
> triggered by different nodes ,  but after the  mutex been released ,
> current node may grap the mutex and continue to do another reset .
> Maybe  we should check the GPU status and  skip the  reset in this 
> case
> since the  GPU may already be in good state .
>
> Regards
>
> shaoyun.liu
>
> On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote:
>> For XGMI hive case do reset in steps where each step iterates over
>> all devs in hive. This especially important for asic reset
>> since all PSP FW in hive must come up within a limited time
>> (around 1 sec) to properply negotiate the link.
>> Do this by  refactoring  amdgpu_device_gpu_recover and
>> amdgpu_device_reset
>> into pre_asic_reset, asic_reset and post_asic_reset functions where
>> is part
>> is exectued for all the GPUs in the hive before going to the next

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-22 Thread Christian König

Am 22.11.18 um 16:44 schrieb Grodzovsky, Andrey:


On 11/22/2018 06:16 AM, Christian König wrote:

How about using a lock per hive and then acquiring that with trylock()
instead?

This way you should at least catch cases where multiple causes try to
reset the same hive at the same time. True that there is still some
racing involved, but it's at least a good start.

What about using per hive work_struct which when scheduled will execute
amdgpu_device_gpu_recover ? Since work queue rejects duplicates we will
get per hive serialization automatically from that without need of using
any new mutex. Also this might be necessary anyway for RAS as I believe
RAS will trigger interrupt when errors are detected and might then
decide to reset the GPU so we will have to switch to button half context
anyway. And I don't think it's a problem to schedule this work from job
timeout handler if needed.


Mhm, what is the advantage to the trylock variant? The key problem with 
that is how to get the offending job into the background worker.


E.g. we have trouble guaranteeing that the job isn't destroyed when this 
is in a background worker.





Additional to that I would try improve the pre, middle, post handling
towards checking if we made some progress in between.

In other words we stop all schedulers in the pre handling and
disconnect the scheduler fences from the hardware fence like I did in
patch "drm/sched: fix timeout handling v2".

Then before we do the actual reset in the middle handling we check if
the offending job has completed or at least made some progress in the
meantime.

I understand how to check if the job completed - if it's fence already
signaled, but how do I test if the job made 'at least some progress' ?


Good question. Maybe we can somehow query from the hardware the number 
of primitives or pixels processed so far and then compare after a moment?



Another question - what's the purpose of this progress check - if I've
already completed the pre handling sequence  I can't bail out even if
the guilty job is is signaled by the time I do the progress check, I
have to complete at least the post handling to. Do you mean I can at
least skip the ASIC reset phase in that case ?


Yes exactly. We have a rather big race problem in the current reset 
logic between completing the job and resetting the hardware.


In other words when the job completes exactly in the moment we reset the 
GPU we try to signal it twice etc etc... The whole handling here is not 
really thought through.


Christian.



Andrey


In the case of a manual reset we skip that because we don't have an
offending job to check.

In the post handling we stitch everything together again and start the
scheduler to go on with job submission.

Christian.

Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey:

Depends what was the reason for triggering the reset for that node how
do we know ?
If the reason was RAS error that probably not hard to check all errors
are cleared, but
if the reason was job timeout on that specific node I will need to
recheck that no jobs are left in incomplete state
state. And if the reason is manual gpu reset trigger from sysfs, then
what's the policy ?
Sounds to me it's just easier to go ahead and allow all the pending
resets to proceed unless there is
a clear and quick criteria you can check after you grab the mutex then
sure - but I don't know what it would be.

Andrey

On 11/21/2018 03:49 PM, Liu, Shaoyun wrote:

I saw you use the  global xgmi_mutex to prevent concurrent reset to be
triggered by different nodes ,  but after the  mutex been released ,
current node may grap the mutex and continue to do another reset .
Maybe  we should check the GPU status and  skip the  reset in this case
since the  GPU may already be in good state .

Regards

shaoyun.liu

On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote:

For XGMI hive case do reset in steps where each step iterates over
all devs in hive. This especially important for asic reset
since all PSP FW in hive must come up within a limited time
(around 1 sec) to properply negotiate the link.
Do this by  refactoring  amdgpu_device_gpu_recover and
amdgpu_device_reset
into pre_asic_reset, asic_reset and post_asic_reset functions where
is part
is exectued for all the GPUs in the hive before going to the next
step.

Signed-off-by: Andrey Grodzovsky 
---
     drivers/gpu/drm/amd/amdgpu/amdgpu.h    |   5 +-
     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375
-
     2 files changed, 264 insertions(+), 116 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4ef5f7a..bd06d45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1026,6 +1026,9 @@ struct amdgpu_device {
     unsigned long last_mm_index;
     bool    in_gpu_reset;
     struct mutex  lock_reset;
+
+    int asic_reset_res;
+    int resched;
     };
         static

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-22 Thread Grodzovsky, Andrey


On 11/22/2018 06:16 AM, Christian König wrote:
> How about using a lock per hive and then acquiring that with trylock() 
> instead?
>
> This way you should at least catch cases where multiple causes try to 
> reset the same hive at the same time. True that there is still some 
> racing involved, but it's at least a good start.

What about using per hive work_struct which when scheduled will execute 
amdgpu_device_gpu_recover ? Since work queue rejects duplicates we will 
get per hive serialization automatically from that without need of using 
any new mutex. Also this might be necessary anyway for RAS as I believe 
RAS will trigger interrupt when errors are detected and might then 
decide to reset the GPU so we will have to switch to button half context 
anyway. And I don't think it's a problem to schedule this work from job 
timeout handler if needed.

>
>
>
> Additional to that I would try improve the pre, middle, post handling 
> towards checking if we made some progress in between.
>
> In other words we stop all schedulers in the pre handling and 
> disconnect the scheduler fences from the hardware fence like I did in 
> patch "drm/sched: fix timeout handling v2".
>
> Then before we do the actual reset in the middle handling we check if 
> the offending job has completed or at least made some progress in the 
> meantime. 

I understand how to check if the job completed - if it's fence already 
signaled, but how do I test if the job made 'at least some progress' ?
Another question - what's the purpose of this progress check - if I've 
already completed the pre handling sequence  I can't bail out even if 
the guilty job is is signaled by the time I do the progress check, I 
have to complete at least the post handling to. Do you mean I can at 
least skip the ASIC reset phase in that case ?

Andrey

> In the case of a manual reset we skip that because we don't have an 
> offending job to check.
>
> In the post handling we stitch everything together again and start the 
> scheduler to go on with job submission.
>
> Christian.
>
> Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey:
>> Depends what was the reason for triggering the reset for that node how
>> do we know ?
>> If the reason was RAS error that probably not hard to check all errors
>> are cleared, but
>> if the reason was job timeout on that specific node I will need to
>> recheck that no jobs are left in incomplete state
>> state. And if the reason is manual gpu reset trigger from sysfs, then
>> what's the policy ?
>> Sounds to me it's just easier to go ahead and allow all the pending
>> resets to proceed unless there is
>> a clear and quick criteria you can check after you grab the mutex then
>> sure - but I don't know what it would be.
>>
>> Andrey
>>
>> On 11/21/2018 03:49 PM, Liu, Shaoyun wrote:
>>> I saw you use the  global xgmi_mutex to prevent concurrent reset to be
>>> triggered by different nodes ,  but after the  mutex been released ,
>>> current node may grap the mutex and continue to do another reset .
>>> Maybe  we should check the GPU status and  skip the  reset in this case
>>> since the  GPU may already be in good state .
>>>
>>> Regards
>>>
>>> shaoyun.liu
>>>
>>> On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote:
 For XGMI hive case do reset in steps where each step iterates over
 all devs in hive. This especially important for asic reset
 since all PSP FW in hive must come up within a limited time
 (around 1 sec) to properply negotiate the link.
 Do this by  refactoring  amdgpu_device_gpu_recover and 
 amdgpu_device_reset
 into pre_asic_reset, asic_reset and post_asic_reset functions where 
 is part
 is exectued for all the GPUs in the hive before going to the next 
 step.

 Signed-off-by: Andrey Grodzovsky 
 ---
     drivers/gpu/drm/amd/amdgpu/amdgpu.h    |   5 +-
     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 
 -
     2 files changed, 264 insertions(+), 116 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
 b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
 index 4ef5f7a..bd06d45 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
 @@ -1026,6 +1026,9 @@ struct amdgpu_device {
     unsigned long last_mm_index;
     bool    in_gpu_reset;
     struct mutex  lock_reset;
 +
 +    int asic_reset_res;
 +    int resched;
     };
         static inline struct amdgpu_device *amdgpu_ttm_adev(struct 
 ttm_bo_device *bdev)
 @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info;
         struct list_head *amdgpu_xgmi_get_adev_list_handle(struct 
 amdgpu_hive_info *hive);
     struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct 
 amdgpu_device *adev);
 -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive);
 +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *h

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-22 Thread Christian König
How about using a lock per hive and then acquiring that with trylock() 
instead?


This way you should at least catch cases where multiple causes try to 
reset the same hive at the same time. True that there is still some 
racing involved, but it's at least a good start.




Additional to that I would try improve the pre, middle, post handling 
towards checking if we made some progress in between.


In other words we stop all schedulers in the pre handling and disconnect 
the scheduler fences from the hardware fence like I did in patch 
"drm/sched: fix timeout handling v2".


Then before we do the actual reset in the middle handling we check if 
the offending job has completed or at least made some progress in the 
meantime. In the case of a manual reset we skip that because we don't 
have an offending job to check.


In the post handling we stitch everything together again and start the 
scheduler to go on with job submission.


Christian.

Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey:

Depends what was the reason for triggering the reset for that node how
do we know ?
If the reason was RAS error that probably not hard to check all errors
are cleared, but
if the reason was job timeout on that specific node I will need to
recheck that no jobs are left in incomplete state
state. And if the reason is manual gpu reset trigger from sysfs, then
what's the policy ?
Sounds to me it's just easier to go ahead and allow all the pending
resets to proceed unless there is
a clear and quick criteria you can check after you grab the mutex then
sure - but I don't know what it would be.

Andrey

On 11/21/2018 03:49 PM, Liu, Shaoyun wrote:

I saw you use the  global xgmi_mutex to prevent concurrent reset to be
triggered by different nodes ,  but after the  mutex been released ,
current node may grap the mutex and continue to do another reset .
Maybe  we should check the GPU status and  skip the  reset in this case
since the  GPU may already be in good state .

Regards

shaoyun.liu

On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote:

For XGMI hive case do reset in steps where each step iterates over
all devs in hive. This especially important for asic reset
since all PSP FW in hive must come up within a limited time
(around 1 sec) to properply negotiate the link.
Do this by  refactoring  amdgpu_device_gpu_recover and amdgpu_device_reset
into pre_asic_reset, asic_reset and post_asic_reset functions where is part
is exectued for all the GPUs in the hive before going to the next step.

Signed-off-by: Andrey Grodzovsky 
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h|   5 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 
-
2 files changed, 264 insertions(+), 116 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4ef5f7a..bd06d45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1026,6 +1026,9 @@ struct amdgpu_device {
unsigned long last_mm_index;
boolin_gpu_reset;
struct mutex  lock_reset;
+
+   int asic_reset_res;
+   int resched;
};

static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)

@@ -1232,7 +1235,7 @@ struct amdgpu_hive_info;

struct list_head *amdgpu_xgmi_get_adev_list_handle(struct amdgpu_hive_info *hive);

struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
-int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive);
+int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct 
amdgpu_device *adev);
int amdgpu_xgmi_add_device(struct amdgpu_device *adev);

/*

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cb06e68..8e94d7f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct 
amdgpu_device *adev)
return 0;
}

-/**

- * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
- *
- * @adev: amdgpu device pointer
- *
- * attempt to do soft-reset or full-reset and reinitialize Asic
- * return 0 means succeeded otherwise failed
- */
-static int amdgpu_device_reset(struct amdgpu_device *adev)
-{
-   bool need_full_reset, vram_lost = 0;
-   int r;
-
-   need_full_reset = amdgpu_device_ip_need_full_reset(adev);
-
-   if (!need_full_reset) {
-   amdgpu_device_ip_pre_soft_reset(adev);
-   r = amdgpu_device_ip_soft_reset(adev);
-   amdgpu_device_ip_post_soft_reset(adev);
-   if (r || amdgpu_device_ip_check_soft_reset(adev)) {
-   DRM_INFO("soft reset failed, will fallback to full 
reset!\n");
-   need_full_reset = true;
-   }
-   }
-
-   if (need_full_reset) {
-   r = amdgpu_device_ip_suspe

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-21 Thread Grodzovsky, Andrey
Depends what was the reason for triggering the reset for that node how 
do we know ?
If the reason was RAS error that probably not hard to check all errors 
are cleared, but
if the reason was job timeout on that specific node I will need to 
recheck that no jobs are left in incomplete state
state. And if the reason is manual gpu reset trigger from sysfs, then 
what's the policy ?
Sounds to me it's just easier to go ahead and allow all the pending 
resets to proceed unless there is
a clear and quick criteria you can check after you grab the mutex then 
sure - but I don't know what it would be.

Andrey

On 11/21/2018 03:49 PM, Liu, Shaoyun wrote:
> I saw you use the  global xgmi_mutex to prevent concurrent reset to be
> triggered by different nodes ,  but after the  mutex been released ,
> current node may grap the mutex and continue to do another reset .
> Maybe  we should check the GPU status and  skip the  reset in this case
> since the  GPU may already be in good state .
>
> Regards
>
> shaoyun.liu
>
> On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote:
>> For XGMI hive case do reset in steps where each step iterates over
>> all devs in hive. This especially important for asic reset
>> since all PSP FW in hive must come up within a limited time
>> (around 1 sec) to properply negotiate the link.
>> Do this by  refactoring  amdgpu_device_gpu_recover and amdgpu_device_reset
>> into pre_asic_reset, asic_reset and post_asic_reset functions where is part
>> is exectued for all the GPUs in the hive before going to the next step.
>>
>> Signed-off-by: Andrey Grodzovsky 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu.h|   5 +-
>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 
>> -
>>2 files changed, 264 insertions(+), 116 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 4ef5f7a..bd06d45 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1026,6 +1026,9 @@ struct amdgpu_device {
>>  unsigned long last_mm_index;
>>  boolin_gpu_reset;
>>  struct mutex  lock_reset;
>> +
>> +int asic_reset_res;
>> +int resched;
>>};
>>
>>static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
>> *bdev)
>> @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info;
>>
>>struct list_head *amdgpu_xgmi_get_adev_list_handle(struct 
>> amdgpu_hive_info *hive);
>>struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
>> -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive);
>> +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct 
>> amdgpu_device *adev);
>>int amdgpu_xgmi_add_device(struct amdgpu_device *adev);
>>
>>/*
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index cb06e68..8e94d7f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct 
>> amdgpu_device *adev)
>>  return 0;
>>}
>>
>> -/**
>> - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
>> - *
>> - * @adev: amdgpu device pointer
>> - *
>> - * attempt to do soft-reset or full-reset and reinitialize Asic
>> - * return 0 means succeeded otherwise failed
>> - */
>> -static int amdgpu_device_reset(struct amdgpu_device *adev)
>> -{
>> -bool need_full_reset, vram_lost = 0;
>> -int r;
>> -
>> -need_full_reset = amdgpu_device_ip_need_full_reset(adev);
>> -
>> -if (!need_full_reset) {
>> -amdgpu_device_ip_pre_soft_reset(adev);
>> -r = amdgpu_device_ip_soft_reset(adev);
>> -amdgpu_device_ip_post_soft_reset(adev);
>> -if (r || amdgpu_device_ip_check_soft_reset(adev)) {
>> -DRM_INFO("soft reset failed, will fallback to full 
>> reset!\n");
>> -need_full_reset = true;
>> -}
>> -}
>> -
>> -if (need_full_reset) {
>> -r = amdgpu_device_ip_suspend(adev);
>> -
>> -retry:
>> -r = amdgpu_asic_reset(adev);
>> -/* post card */
>> -amdgpu_atom_asic_init(adev->mode_info.atom_context);
>> -
>> -if (!r) {
>> -dev_info(adev->dev, "GPU reset succeeded, trying to 
>> resume\n");
>> -r = amdgpu_device_ip_resume_phase1(adev);
>> -if (r)
>> -goto out;
>> -
>> -vram_lost = amdgpu_device_check_vram_lost(adev);
>> -if (vram_lost) {
>> -DRM_ERROR("VRAM is lost!\n");
>> -atomic_inc(&adev->vram_lost_counter);
>> -}
>> -
>> -r = amdgpu_gtt_mgr_recover(
>> -&adev->mman.bdev.man[

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-21 Thread Liu, Shaoyun
I saw you use the  global xgmi_mutex to prevent concurrent reset to be 
triggered by different nodes ,  but after the  mutex been released ,  
current node may grap the mutex and continue to do another reset .  
Maybe  we should check the GPU status and  skip the  reset in this case 
since the  GPU may already be in good state .

Regards

shaoyun.liu

On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote:
> For XGMI hive case do reset in steps where each step iterates over
> all devs in hive. This especially important for asic reset
> since all PSP FW in hive must come up within a limited time
> (around 1 sec) to properply negotiate the link.
> Do this by  refactoring  amdgpu_device_gpu_recover and amdgpu_device_reset
> into pre_asic_reset, asic_reset and post_asic_reset functions where is part
> is exectued for all the GPUs in the hive before going to the next step.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h|   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 
> -
>   2 files changed, 264 insertions(+), 116 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 4ef5f7a..bd06d45 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1026,6 +1026,9 @@ struct amdgpu_device {
>   unsigned long last_mm_index;
>   boolin_gpu_reset;
>   struct mutex  lock_reset;
> +
> + int asic_reset_res;
> + int resched;
>   };
>   
>   static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
> *bdev)
> @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info;
>   
>   struct list_head *amdgpu_xgmi_get_adev_list_handle(struct amdgpu_hive_info 
> *hive);
>   struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
> -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive);
> +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct 
> amdgpu_device *adev);
>   int amdgpu_xgmi_add_device(struct amdgpu_device *adev);
>   
>   /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index cb06e68..8e94d7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct 
> amdgpu_device *adev)
>   return 0;
>   }
>   
> -/**
> - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
> - *
> - * @adev: amdgpu device pointer
> - *
> - * attempt to do soft-reset or full-reset and reinitialize Asic
> - * return 0 means succeeded otherwise failed
> - */
> -static int amdgpu_device_reset(struct amdgpu_device *adev)
> -{
> - bool need_full_reset, vram_lost = 0;
> - int r;
> -
> - need_full_reset = amdgpu_device_ip_need_full_reset(adev);
> -
> - if (!need_full_reset) {
> - amdgpu_device_ip_pre_soft_reset(adev);
> - r = amdgpu_device_ip_soft_reset(adev);
> - amdgpu_device_ip_post_soft_reset(adev);
> - if (r || amdgpu_device_ip_check_soft_reset(adev)) {
> - DRM_INFO("soft reset failed, will fallback to full 
> reset!\n");
> - need_full_reset = true;
> - }
> - }
> -
> - if (need_full_reset) {
> - r = amdgpu_device_ip_suspend(adev);
> -
> -retry:
> - r = amdgpu_asic_reset(adev);
> - /* post card */
> - amdgpu_atom_asic_init(adev->mode_info.atom_context);
> -
> - if (!r) {
> - dev_info(adev->dev, "GPU reset succeeded, trying to 
> resume\n");
> - r = amdgpu_device_ip_resume_phase1(adev);
> - if (r)
> - goto out;
> -
> - vram_lost = amdgpu_device_check_vram_lost(adev);
> - if (vram_lost) {
> - DRM_ERROR("VRAM is lost!\n");
> - atomic_inc(&adev->vram_lost_counter);
> - }
> -
> - r = amdgpu_gtt_mgr_recover(
> - &adev->mman.bdev.man[TTM_PL_TT]);
> - if (r)
> - goto out;
> -
> - r = amdgpu_device_fw_loading(adev);
> - if (r)
> - return r;
> -
> - r = amdgpu_device_ip_resume_phase2(adev);
> - if (r)
> - goto out;
> -
> - if (vram_lost)
> - amdgpu_device_fill_reset_magic(adev);
> - }
> - }
> -
> -out:
> - if (!r) {
> - amdgpu_irq_gpu_reset_resume_helper(adev);
> - r = amdgpu_ib_ring_tests(adev);
> - if (r) {
> - dev_err(adev->dev, "ib ring test failed (%d).\n", r);

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

2018-11-21 Thread Alex Deucher
On Wed, Nov 21, 2018 at 1:11 PM Andrey Grodzovsky
 wrote:
>
> For XGMI hive case do reset in steps where each step iterates over
> all devs in hive. This especially important for asic reset
> since all PSP FW in hive must come up within a limited time
> (around 1 sec) to properply negotiate the link.
> Do this by  refactoring  amdgpu_device_gpu_recover and amdgpu_device_reset
> into pre_asic_reset, asic_reset and post_asic_reset functions where is part
> is exectued for all the GPUs in the hive before going to the next step.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|   5 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 
> -
>  2 files changed, 264 insertions(+), 116 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 4ef5f7a..bd06d45 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1026,6 +1026,9 @@ struct amdgpu_device {
> unsigned long last_mm_index;
> boolin_gpu_reset;
> struct mutex  lock_reset;
> +
> +   int asic_reset_res;
> +   int resched;
>  };
>
>  static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device 
> *bdev)
> @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info;
>
>  struct list_head *amdgpu_xgmi_get_adev_list_handle(struct amdgpu_hive_info 
> *hive);
>  struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev);
> -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive);
> +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct 
> amdgpu_device *adev);
>  int amdgpu_xgmi_add_device(struct amdgpu_device *adev);
>
>  /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index cb06e68..8e94d7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct 
> amdgpu_device *adev)
> return 0;
>  }
>
> -/**
> - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
> - *
> - * @adev: amdgpu device pointer
> - *
> - * attempt to do soft-reset or full-reset and reinitialize Asic
> - * return 0 means succeeded otherwise failed
> - */
> -static int amdgpu_device_reset(struct amdgpu_device *adev)
> -{
> -   bool need_full_reset, vram_lost = 0;
> -   int r;
> -
> -   need_full_reset = amdgpu_device_ip_need_full_reset(adev);
> -
> -   if (!need_full_reset) {
> -   amdgpu_device_ip_pre_soft_reset(adev);
> -   r = amdgpu_device_ip_soft_reset(adev);
> -   amdgpu_device_ip_post_soft_reset(adev);
> -   if (r || amdgpu_device_ip_check_soft_reset(adev)) {
> -   DRM_INFO("soft reset failed, will fallback to full 
> reset!\n");
> -   need_full_reset = true;
> -   }
> -   }
> -
> -   if (need_full_reset) {
> -   r = amdgpu_device_ip_suspend(adev);
> -
> -retry:
> -   r = amdgpu_asic_reset(adev);
> -   /* post card */
> -   amdgpu_atom_asic_init(adev->mode_info.atom_context);
> -
> -   if (!r) {
> -   dev_info(adev->dev, "GPU reset succeeded, trying to 
> resume\n");
> -   r = amdgpu_device_ip_resume_phase1(adev);
> -   if (r)
> -   goto out;
> -
> -   vram_lost = amdgpu_device_check_vram_lost(adev);
> -   if (vram_lost) {
> -   DRM_ERROR("VRAM is lost!\n");
> -   atomic_inc(&adev->vram_lost_counter);
> -   }
> -
> -   r = amdgpu_gtt_mgr_recover(
> -   &adev->mman.bdev.man[TTM_PL_TT]);
> -   if (r)
> -   goto out;
> -
> -   r = amdgpu_device_fw_loading(adev);
> -   if (r)
> -   return r;
> -
> -   r = amdgpu_device_ip_resume_phase2(adev);
> -   if (r)
> -   goto out;
> -
> -   if (vram_lost)
> -   amdgpu_device_fill_reset_magic(adev);
> -   }
> -   }
> -
> -out:
> -   if (!r) {
> -   amdgpu_irq_gpu_reset_resume_helper(adev);
> -   r = amdgpu_ib_ring_tests(adev);
> -   if (r) {
> -   dev_err(adev->dev, "ib ring test failed (%d).\n", r);
> -   r = amdgpu_device_ip_suspend(adev);
> -   need_full_reset = true;
> -   goto retry;
> -   }
> -   }
> -
> -   if (!r)
> -   r = amdgpu_device_recover_vram(adev);
>