Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
What I mean is - should we get rid of dma_fence_add/remove_callback logic in drm_sched_job_timedout and do it for each driver in between scheduler deactivation and activation back ? Yes, exactly. That's the reason why I already have a revert for the patch and remove the dance from drm_sched_job_timedout again. Christian. Am 26.11.18 um 20:28 schrieb Grodzovsky, Andrey: Actually, after looking again at drm_sched_job_timedout from which the amdgpu_device_gpu_recover will be called I see that we already disconnect all the pending scheduler fences from the HW fence, including the guilty job. I also see that in drm_sched_job_timedout job_list_lock is released before calling sched->ops->timedout_job and then required after, so new jobs can slip into ring_mirror_list in between. And also i will end up going over the ring_mirror_list twice, once from amdgpu_device_post_asic_reset and later from drm_sched_job_timedout - this might cause double fence processing. Isn't it more correct only do the disconnect from HW fence after the schedules have been stopped and connect back before we restart the schedulers (as you pointed out here before) What I mean is - should we get rid of dma_fence_add/remove_callback logic in drm_sched_job_timedout and do it for each driver in between scheduler deactivation and activation back ? Andrey On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote: Additional to that I would try improve the pre, middle, post handling towards checking if we made some progress in between. In other words we stop all schedulers in the pre handling and disconnect the scheduler fences from the hardware fence like I did in patch "drm/sched: fix timeout handling v2". Then before we do the actual reset in the middle handling we check if the offending job has completed or at least made some progress in the meantime. I understand how to check if the job completed - if it's fence already signaled, but how do I test if the job made 'at least some progress' ? Good question. Maybe we can somehow query from the hardware the number of primitives or pixels processed so far and then compare after a moment? I will check on this later. In the mean while I will update the code with the proposed per hive locking and I will add the check if the guilty job completed before ASIC reset skipping the reset if it's did. Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
Actually, after looking again at drm_sched_job_timedout from which the amdgpu_device_gpu_recover will be called I see that we already disconnect all the pending scheduler fences from the HW fence, including the guilty job. I also see that in drm_sched_job_timedout job_list_lock is released before calling sched->ops->timedout_job and then required after, so new jobs can slip into ring_mirror_list in between. And also i will end up going over the ring_mirror_list twice, once from amdgpu_device_post_asic_reset and later from drm_sched_job_timedout - this might cause double fence processing. Isn't it more correct only do the disconnect from HW fence after the schedules have been stopped and connect back before we restart the schedulers (as you pointed out here before) What I mean is - should we get rid of dma_fence_add/remove_callback logic in drm_sched_job_timedout and do it for each driver in between scheduler deactivation and activation back ? Andrey On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote: Additional to that I would try improve the pre, middle, post handling towards checking if we made some progress in between. In other words we stop all schedulers in the pre handling and disconnect the scheduler fences from the hardware fence like I did in patch "drm/sched: fix timeout handling v2". Then before we do the actual reset in the middle handling we check if the offending job has completed or at least made some progress in the meantime. I understand how to check if the job completed - if it's fence already signaled, but how do I test if the job made 'at least some progress' ? Good question. Maybe we can somehow query from the hardware the number of primitives or pixels processed so far and then compare after a moment? I will check on this later. In the mean while I will update the code with the proposed per hive locking and I will add the check if the guilty job completed before ASIC reset skipping the reset if it's did. Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
On 11/22/2018 02:03 PM, Christian König wrote: > Am 22.11.18 um 16:44 schrieb Grodzovsky, Andrey: >> >> On 11/22/2018 06:16 AM, Christian König wrote: >>> How about using a lock per hive and then acquiring that with trylock() >>> instead? >>> >>> This way you should at least catch cases where multiple causes try to >>> reset the same hive at the same time. True that there is still some >>> racing involved, but it's at least a good start. >> What about using per hive work_struct which when scheduled will execute >> amdgpu_device_gpu_recover ? Since work queue rejects duplicates we will >> get per hive serialization automatically from that without need of using >> any new mutex. Also this might be necessary anyway for RAS as I believe >> RAS will trigger interrupt when errors are detected and might then >> decide to reset the GPU so we will have to switch to button half context >> anyway. And I don't think it's a problem to schedule this work from job >> timeout handler if needed. > > Mhm, what is the advantage to the trylock variant? The key problem > with that is how to get the offending job into the background worker. > > E.g. we have trouble guaranteeing that the job isn't destroyed when > this is in a background worker. Yea, actually there isn't. > >>> >>> >>> Additional to that I would try improve the pre, middle, post handling >>> towards checking if we made some progress in between. >>> >>> In other words we stop all schedulers in the pre handling and >>> disconnect the scheduler fences from the hardware fence like I did in >>> patch "drm/sched: fix timeout handling v2". >>> >>> Then before we do the actual reset in the middle handling we check if >>> the offending job has completed or at least made some progress in the >>> meantime. >> I understand how to check if the job completed - if it's fence already >> signaled, but how do I test if the job made 'at least some progress' ? > > Good question. Maybe we can somehow query from the hardware the number > of primitives or pixels processed so far and then compare after a moment? I will check on this later. In the mean while I will update the code with the proposed per hive locking and I will add the check if the guilty job completed before ASIC reset skipping the reset if it's did. Andrey > >> Another question - what's the purpose of this progress check - if I've >> already completed the pre handling sequence I can't bail out even if >> the guilty job is is signaled by the time I do the progress check, I >> have to complete at least the post handling to. Do you mean I can at >> least skip the ASIC reset phase in that case ? > > Yes exactly. We have a rather big race problem in the current reset > logic between completing the job and resetting the hardware. > > In other words when the job completes exactly in the moment we reset > the GPU we try to signal it twice etc etc... The whole handling here > is not really thought through. > > Christian. > >> >> Andrey >> >>> In the case of a manual reset we skip that because we don't have an >>> offending job to check. >>> >>> In the post handling we stitch everything together again and start the >>> scheduler to go on with job submission. >>> >>> Christian. >>> >>> Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey: Depends what was the reason for triggering the reset for that node how do we know ? If the reason was RAS error that probably not hard to check all errors are cleared, but if the reason was job timeout on that specific node I will need to recheck that no jobs are left in incomplete state state. And if the reason is manual gpu reset trigger from sysfs, then what's the policy ? Sounds to me it's just easier to go ahead and allow all the pending resets to proceed unless there is a clear and quick criteria you can check after you grab the mutex then sure - but I don't know what it would be. Andrey On 11/21/2018 03:49 PM, Liu, Shaoyun wrote: > I saw you use the global xgmi_mutex to prevent concurrent reset > to be > triggered by different nodes , but after the mutex been released , > current node may grap the mutex and continue to do another reset . > Maybe we should check the GPU status and skip the reset in this > case > since the GPU may already be in good state . > > Regards > > shaoyun.liu > > On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote: >> For XGMI hive case do reset in steps where each step iterates over >> all devs in hive. This especially important for asic reset >> since all PSP FW in hive must come up within a limited time >> (around 1 sec) to properply negotiate the link. >> Do this by refactoring amdgpu_device_gpu_recover and >> amdgpu_device_reset >> into pre_asic_reset, asic_reset and post_asic_reset functions where >> is part >> is exectued for all the GPUs in the hive before going to the next
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
Am 22.11.18 um 16:44 schrieb Grodzovsky, Andrey: On 11/22/2018 06:16 AM, Christian König wrote: How about using a lock per hive and then acquiring that with trylock() instead? This way you should at least catch cases where multiple causes try to reset the same hive at the same time. True that there is still some racing involved, but it's at least a good start. What about using per hive work_struct which when scheduled will execute amdgpu_device_gpu_recover ? Since work queue rejects duplicates we will get per hive serialization automatically from that without need of using any new mutex. Also this might be necessary anyway for RAS as I believe RAS will trigger interrupt when errors are detected and might then decide to reset the GPU so we will have to switch to button half context anyway. And I don't think it's a problem to schedule this work from job timeout handler if needed. Mhm, what is the advantage to the trylock variant? The key problem with that is how to get the offending job into the background worker. E.g. we have trouble guaranteeing that the job isn't destroyed when this is in a background worker. Additional to that I would try improve the pre, middle, post handling towards checking if we made some progress in between. In other words we stop all schedulers in the pre handling and disconnect the scheduler fences from the hardware fence like I did in patch "drm/sched: fix timeout handling v2". Then before we do the actual reset in the middle handling we check if the offending job has completed or at least made some progress in the meantime. I understand how to check if the job completed - if it's fence already signaled, but how do I test if the job made 'at least some progress' ? Good question. Maybe we can somehow query from the hardware the number of primitives or pixels processed so far and then compare after a moment? Another question - what's the purpose of this progress check - if I've already completed the pre handling sequence I can't bail out even if the guilty job is is signaled by the time I do the progress check, I have to complete at least the post handling to. Do you mean I can at least skip the ASIC reset phase in that case ? Yes exactly. We have a rather big race problem in the current reset logic between completing the job and resetting the hardware. In other words when the job completes exactly in the moment we reset the GPU we try to signal it twice etc etc... The whole handling here is not really thought through. Christian. Andrey In the case of a manual reset we skip that because we don't have an offending job to check. In the post handling we stitch everything together again and start the scheduler to go on with job submission. Christian. Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey: Depends what was the reason for triggering the reset for that node how do we know ? If the reason was RAS error that probably not hard to check all errors are cleared, but if the reason was job timeout on that specific node I will need to recheck that no jobs are left in incomplete state state. And if the reason is manual gpu reset trigger from sysfs, then what's the policy ? Sounds to me it's just easier to go ahead and allow all the pending resets to proceed unless there is a clear and quick criteria you can check after you grab the mutex then sure - but I don't know what it would be. Andrey On 11/21/2018 03:49 PM, Liu, Shaoyun wrote: I saw you use the global xgmi_mutex to prevent concurrent reset to be triggered by different nodes , but after the mutex been released , current node may grap the mutex and continue to do another reset . Maybe we should check the GPU status and skip the reset in this case since the GPU may already be in good state . Regards shaoyun.liu On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote: For XGMI hive case do reset in steps where each step iterates over all devs in hive. This especially important for asic reset since all PSP FW in hive must come up within a limited time (around 1 sec) to properply negotiate the link. Do this by refactoring amdgpu_device_gpu_recover and amdgpu_device_reset into pre_asic_reset, asic_reset and post_asic_reset functions where is part is exectued for all the GPUs in the hive before going to the next step. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 - 2 files changed, 264 insertions(+), 116 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 4ef5f7a..bd06d45 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1026,6 +1026,9 @@ struct amdgpu_device { unsigned long last_mm_index; bool in_gpu_reset; struct mutex lock_reset; + + int asic_reset_res; + int resched; }; static
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
On 11/22/2018 06:16 AM, Christian König wrote: > How about using a lock per hive and then acquiring that with trylock() > instead? > > This way you should at least catch cases where multiple causes try to > reset the same hive at the same time. True that there is still some > racing involved, but it's at least a good start. What about using per hive work_struct which when scheduled will execute amdgpu_device_gpu_recover ? Since work queue rejects duplicates we will get per hive serialization automatically from that without need of using any new mutex. Also this might be necessary anyway for RAS as I believe RAS will trigger interrupt when errors are detected and might then decide to reset the GPU so we will have to switch to button half context anyway. And I don't think it's a problem to schedule this work from job timeout handler if needed. > > > > Additional to that I would try improve the pre, middle, post handling > towards checking if we made some progress in between. > > In other words we stop all schedulers in the pre handling and > disconnect the scheduler fences from the hardware fence like I did in > patch "drm/sched: fix timeout handling v2". > > Then before we do the actual reset in the middle handling we check if > the offending job has completed or at least made some progress in the > meantime. I understand how to check if the job completed - if it's fence already signaled, but how do I test if the job made 'at least some progress' ? Another question - what's the purpose of this progress check - if I've already completed the pre handling sequence I can't bail out even if the guilty job is is signaled by the time I do the progress check, I have to complete at least the post handling to. Do you mean I can at least skip the ASIC reset phase in that case ? Andrey > In the case of a manual reset we skip that because we don't have an > offending job to check. > > In the post handling we stitch everything together again and start the > scheduler to go on with job submission. > > Christian. > > Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey: >> Depends what was the reason for triggering the reset for that node how >> do we know ? >> If the reason was RAS error that probably not hard to check all errors >> are cleared, but >> if the reason was job timeout on that specific node I will need to >> recheck that no jobs are left in incomplete state >> state. And if the reason is manual gpu reset trigger from sysfs, then >> what's the policy ? >> Sounds to me it's just easier to go ahead and allow all the pending >> resets to proceed unless there is >> a clear and quick criteria you can check after you grab the mutex then >> sure - but I don't know what it would be. >> >> Andrey >> >> On 11/21/2018 03:49 PM, Liu, Shaoyun wrote: >>> I saw you use the global xgmi_mutex to prevent concurrent reset to be >>> triggered by different nodes , but after the mutex been released , >>> current node may grap the mutex and continue to do another reset . >>> Maybe we should check the GPU status and skip the reset in this case >>> since the GPU may already be in good state . >>> >>> Regards >>> >>> shaoyun.liu >>> >>> On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote: For XGMI hive case do reset in steps where each step iterates over all devs in hive. This especially important for asic reset since all PSP FW in hive must come up within a limited time (around 1 sec) to properply negotiate the link. Do this by refactoring amdgpu_device_gpu_recover and amdgpu_device_reset into pre_asic_reset, asic_reset and post_asic_reset functions where is part is exectued for all the GPUs in the hive before going to the next step. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 - 2 files changed, 264 insertions(+), 116 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 4ef5f7a..bd06d45 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1026,6 +1026,9 @@ struct amdgpu_device { unsigned long last_mm_index; bool in_gpu_reset; struct mutex lock_reset; + + int asic_reset_res; + int resched; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info; struct list_head *amdgpu_xgmi_get_adev_list_handle(struct amdgpu_hive_info *hive); struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev); -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive); +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *h
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
How about using a lock per hive and then acquiring that with trylock() instead? This way you should at least catch cases where multiple causes try to reset the same hive at the same time. True that there is still some racing involved, but it's at least a good start. Additional to that I would try improve the pre, middle, post handling towards checking if we made some progress in between. In other words we stop all schedulers in the pre handling and disconnect the scheduler fences from the hardware fence like I did in patch "drm/sched: fix timeout handling v2". Then before we do the actual reset in the middle handling we check if the offending job has completed or at least made some progress in the meantime. In the case of a manual reset we skip that because we don't have an offending job to check. In the post handling we stitch everything together again and start the scheduler to go on with job submission. Christian. Am 21.11.18 um 23:02 schrieb Grodzovsky, Andrey: Depends what was the reason for triggering the reset for that node how do we know ? If the reason was RAS error that probably not hard to check all errors are cleared, but if the reason was job timeout on that specific node I will need to recheck that no jobs are left in incomplete state state. And if the reason is manual gpu reset trigger from sysfs, then what's the policy ? Sounds to me it's just easier to go ahead and allow all the pending resets to proceed unless there is a clear and quick criteria you can check after you grab the mutex then sure - but I don't know what it would be. Andrey On 11/21/2018 03:49 PM, Liu, Shaoyun wrote: I saw you use the global xgmi_mutex to prevent concurrent reset to be triggered by different nodes , but after the mutex been released , current node may grap the mutex and continue to do another reset . Maybe we should check the GPU status and skip the reset in this case since the GPU may already be in good state . Regards shaoyun.liu On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote: For XGMI hive case do reset in steps where each step iterates over all devs in hive. This especially important for asic reset since all PSP FW in hive must come up within a limited time (around 1 sec) to properply negotiate the link. Do this by refactoring amdgpu_device_gpu_recover and amdgpu_device_reset into pre_asic_reset, asic_reset and post_asic_reset functions where is part is exectued for all the GPUs in the hive before going to the next step. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 5 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 - 2 files changed, 264 insertions(+), 116 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 4ef5f7a..bd06d45 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1026,6 +1026,9 @@ struct amdgpu_device { unsigned long last_mm_index; boolin_gpu_reset; struct mutex lock_reset; + + int asic_reset_res; + int resched; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info; struct list_head *amdgpu_xgmi_get_adev_list_handle(struct amdgpu_hive_info *hive); struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev); -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive); +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct amdgpu_device *adev); int amdgpu_xgmi_add_device(struct amdgpu_device *adev); /* diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index cb06e68..8e94d7f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev) return 0; } -/** - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough - * - * @adev: amdgpu device pointer - * - * attempt to do soft-reset or full-reset and reinitialize Asic - * return 0 means succeeded otherwise failed - */ -static int amdgpu_device_reset(struct amdgpu_device *adev) -{ - bool need_full_reset, vram_lost = 0; - int r; - - need_full_reset = amdgpu_device_ip_need_full_reset(adev); - - if (!need_full_reset) { - amdgpu_device_ip_pre_soft_reset(adev); - r = amdgpu_device_ip_soft_reset(adev); - amdgpu_device_ip_post_soft_reset(adev); - if (r || amdgpu_device_ip_check_soft_reset(adev)) { - DRM_INFO("soft reset failed, will fallback to full reset!\n"); - need_full_reset = true; - } - } - - if (need_full_reset) { - r = amdgpu_device_ip_suspe
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
Depends what was the reason for triggering the reset for that node how do we know ? If the reason was RAS error that probably not hard to check all errors are cleared, but if the reason was job timeout on that specific node I will need to recheck that no jobs are left in incomplete state state. And if the reason is manual gpu reset trigger from sysfs, then what's the policy ? Sounds to me it's just easier to go ahead and allow all the pending resets to proceed unless there is a clear and quick criteria you can check after you grab the mutex then sure - but I don't know what it would be. Andrey On 11/21/2018 03:49 PM, Liu, Shaoyun wrote: > I saw you use the global xgmi_mutex to prevent concurrent reset to be > triggered by different nodes , but after the mutex been released , > current node may grap the mutex and continue to do another reset . > Maybe we should check the GPU status and skip the reset in this case > since the GPU may already be in good state . > > Regards > > shaoyun.liu > > On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote: >> For XGMI hive case do reset in steps where each step iterates over >> all devs in hive. This especially important for asic reset >> since all PSP FW in hive must come up within a limited time >> (around 1 sec) to properply negotiate the link. >> Do this by refactoring amdgpu_device_gpu_recover and amdgpu_device_reset >> into pre_asic_reset, asic_reset and post_asic_reset functions where is part >> is exectued for all the GPUs in the hive before going to the next step. >> >> Signed-off-by: Andrey Grodzovsky >> --- >>drivers/gpu/drm/amd/amdgpu/amdgpu.h| 5 +- >>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 >> - >>2 files changed, 264 insertions(+), 116 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> index 4ef5f7a..bd06d45 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> @@ -1026,6 +1026,9 @@ struct amdgpu_device { >> unsigned long last_mm_index; >> boolin_gpu_reset; >> struct mutex lock_reset; >> + >> +int asic_reset_res; >> +int resched; >>}; >> >>static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device >> *bdev) >> @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info; >> >>struct list_head *amdgpu_xgmi_get_adev_list_handle(struct >> amdgpu_hive_info *hive); >>struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev); >> -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive); >> +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct >> amdgpu_device *adev); >>int amdgpu_xgmi_add_device(struct amdgpu_device *adev); >> >>/* >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index cb06e68..8e94d7f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct >> amdgpu_device *adev) >> return 0; >>} >> >> -/** >> - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough >> - * >> - * @adev: amdgpu device pointer >> - * >> - * attempt to do soft-reset or full-reset and reinitialize Asic >> - * return 0 means succeeded otherwise failed >> - */ >> -static int amdgpu_device_reset(struct amdgpu_device *adev) >> -{ >> -bool need_full_reset, vram_lost = 0; >> -int r; >> - >> -need_full_reset = amdgpu_device_ip_need_full_reset(adev); >> - >> -if (!need_full_reset) { >> -amdgpu_device_ip_pre_soft_reset(adev); >> -r = amdgpu_device_ip_soft_reset(adev); >> -amdgpu_device_ip_post_soft_reset(adev); >> -if (r || amdgpu_device_ip_check_soft_reset(adev)) { >> -DRM_INFO("soft reset failed, will fallback to full >> reset!\n"); >> -need_full_reset = true; >> -} >> -} >> - >> -if (need_full_reset) { >> -r = amdgpu_device_ip_suspend(adev); >> - >> -retry: >> -r = amdgpu_asic_reset(adev); >> -/* post card */ >> -amdgpu_atom_asic_init(adev->mode_info.atom_context); >> - >> -if (!r) { >> -dev_info(adev->dev, "GPU reset succeeded, trying to >> resume\n"); >> -r = amdgpu_device_ip_resume_phase1(adev); >> -if (r) >> -goto out; >> - >> -vram_lost = amdgpu_device_check_vram_lost(adev); >> -if (vram_lost) { >> -DRM_ERROR("VRAM is lost!\n"); >> -atomic_inc(&adev->vram_lost_counter); >> -} >> - >> -r = amdgpu_gtt_mgr_recover( >> -&adev->mman.bdev.man[
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
I saw you use the global xgmi_mutex to prevent concurrent reset to be triggered by different nodes , but after the mutex been released , current node may grap the mutex and continue to do another reset . Maybe we should check the GPU status and skip the reset in this case since the GPU may already be in good state . Regards shaoyun.liu On 2018-11-21 1:10 p.m., Andrey Grodzovsky wrote: > For XGMI hive case do reset in steps where each step iterates over > all devs in hive. This especially important for asic reset > since all PSP FW in hive must come up within a limited time > (around 1 sec) to properply negotiate the link. > Do this by refactoring amdgpu_device_gpu_recover and amdgpu_device_reset > into pre_asic_reset, asic_reset and post_asic_reset functions where is part > is exectued for all the GPUs in the hive before going to the next step. > > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h| 5 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 > - > 2 files changed, 264 insertions(+), 116 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 4ef5f7a..bd06d45 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -1026,6 +1026,9 @@ struct amdgpu_device { > unsigned long last_mm_index; > boolin_gpu_reset; > struct mutex lock_reset; > + > + int asic_reset_res; > + int resched; > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device > *bdev) > @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info; > > struct list_head *amdgpu_xgmi_get_adev_list_handle(struct amdgpu_hive_info > *hive); > struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev); > -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive); > +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct > amdgpu_device *adev); > int amdgpu_xgmi_add_device(struct amdgpu_device *adev); > > /* > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index cb06e68..8e94d7f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct > amdgpu_device *adev) > return 0; > } > > -/** > - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough > - * > - * @adev: amdgpu device pointer > - * > - * attempt to do soft-reset or full-reset and reinitialize Asic > - * return 0 means succeeded otherwise failed > - */ > -static int amdgpu_device_reset(struct amdgpu_device *adev) > -{ > - bool need_full_reset, vram_lost = 0; > - int r; > - > - need_full_reset = amdgpu_device_ip_need_full_reset(adev); > - > - if (!need_full_reset) { > - amdgpu_device_ip_pre_soft_reset(adev); > - r = amdgpu_device_ip_soft_reset(adev); > - amdgpu_device_ip_post_soft_reset(adev); > - if (r || amdgpu_device_ip_check_soft_reset(adev)) { > - DRM_INFO("soft reset failed, will fallback to full > reset!\n"); > - need_full_reset = true; > - } > - } > - > - if (need_full_reset) { > - r = amdgpu_device_ip_suspend(adev); > - > -retry: > - r = amdgpu_asic_reset(adev); > - /* post card */ > - amdgpu_atom_asic_init(adev->mode_info.atom_context); > - > - if (!r) { > - dev_info(adev->dev, "GPU reset succeeded, trying to > resume\n"); > - r = amdgpu_device_ip_resume_phase1(adev); > - if (r) > - goto out; > - > - vram_lost = amdgpu_device_check_vram_lost(adev); > - if (vram_lost) { > - DRM_ERROR("VRAM is lost!\n"); > - atomic_inc(&adev->vram_lost_counter); > - } > - > - r = amdgpu_gtt_mgr_recover( > - &adev->mman.bdev.man[TTM_PL_TT]); > - if (r) > - goto out; > - > - r = amdgpu_device_fw_loading(adev); > - if (r) > - return r; > - > - r = amdgpu_device_ip_resume_phase2(adev); > - if (r) > - goto out; > - > - if (vram_lost) > - amdgpu_device_fill_reset_magic(adev); > - } > - } > - > -out: > - if (!r) { > - amdgpu_irq_gpu_reset_resume_helper(adev); > - r = amdgpu_ib_ring_tests(adev); > - if (r) { > - dev_err(adev->dev, "ib ring test failed (%d).\n", r);
Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.
On Wed, Nov 21, 2018 at 1:11 PM Andrey Grodzovsky wrote: > > For XGMI hive case do reset in steps where each step iterates over > all devs in hive. This especially important for asic reset > since all PSP FW in hive must come up within a limited time > (around 1 sec) to properply negotiate the link. > Do this by refactoring amdgpu_device_gpu_recover and amdgpu_device_reset > into pre_asic_reset, asic_reset and post_asic_reset functions where is part > is exectued for all the GPUs in the hive before going to the next step. > > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h| 5 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 375 > - > 2 files changed, 264 insertions(+), 116 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 4ef5f7a..bd06d45 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -1026,6 +1026,9 @@ struct amdgpu_device { > unsigned long last_mm_index; > boolin_gpu_reset; > struct mutex lock_reset; > + > + int asic_reset_res; > + int resched; > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device > *bdev) > @@ -1232,7 +1235,7 @@ struct amdgpu_hive_info; > > struct list_head *amdgpu_xgmi_get_adev_list_handle(struct amdgpu_hive_info > *hive); > struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev); > -int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive); > +int amdgpu_xgmi_update_topology(struct amdgpu_hive_info *hive, struct > amdgpu_device *adev); > int amdgpu_xgmi_add_device(struct amdgpu_device *adev); > > /* > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index cb06e68..8e94d7f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3157,86 +3157,6 @@ static int amdgpu_device_recover_vram(struct > amdgpu_device *adev) > return 0; > } > > -/** > - * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough > - * > - * @adev: amdgpu device pointer > - * > - * attempt to do soft-reset or full-reset and reinitialize Asic > - * return 0 means succeeded otherwise failed > - */ > -static int amdgpu_device_reset(struct amdgpu_device *adev) > -{ > - bool need_full_reset, vram_lost = 0; > - int r; > - > - need_full_reset = amdgpu_device_ip_need_full_reset(adev); > - > - if (!need_full_reset) { > - amdgpu_device_ip_pre_soft_reset(adev); > - r = amdgpu_device_ip_soft_reset(adev); > - amdgpu_device_ip_post_soft_reset(adev); > - if (r || amdgpu_device_ip_check_soft_reset(adev)) { > - DRM_INFO("soft reset failed, will fallback to full > reset!\n"); > - need_full_reset = true; > - } > - } > - > - if (need_full_reset) { > - r = amdgpu_device_ip_suspend(adev); > - > -retry: > - r = amdgpu_asic_reset(adev); > - /* post card */ > - amdgpu_atom_asic_init(adev->mode_info.atom_context); > - > - if (!r) { > - dev_info(adev->dev, "GPU reset succeeded, trying to > resume\n"); > - r = amdgpu_device_ip_resume_phase1(adev); > - if (r) > - goto out; > - > - vram_lost = amdgpu_device_check_vram_lost(adev); > - if (vram_lost) { > - DRM_ERROR("VRAM is lost!\n"); > - atomic_inc(&adev->vram_lost_counter); > - } > - > - r = amdgpu_gtt_mgr_recover( > - &adev->mman.bdev.man[TTM_PL_TT]); > - if (r) > - goto out; > - > - r = amdgpu_device_fw_loading(adev); > - if (r) > - return r; > - > - r = amdgpu_device_ip_resume_phase2(adev); > - if (r) > - goto out; > - > - if (vram_lost) > - amdgpu_device_fill_reset_magic(adev); > - } > - } > - > -out: > - if (!r) { > - amdgpu_irq_gpu_reset_resume_helper(adev); > - r = amdgpu_ib_ring_tests(adev); > - if (r) { > - dev_err(adev->dev, "ib ring test failed (%d).\n", r); > - r = amdgpu_device_ip_suspend(adev); > - need_full_reset = true; > - goto retry; > - } > - } > - > - if (!r) > - r = amdgpu_device_recover_vram(adev); >