Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
Am 2021-11-22 um 11:16 a.m. schrieb Liu, Shaoyun: > [AMD Official Use Only] > > Thanks for the review . > The hash for the previous change from gerrirgit/amd-staging-drm-next branch > is 7079e7d5c6bf248bff, so there is another drm-next branch that not in the > gerritgit for upstream ? Yes. amd-staging-drm-next is our AMD internal branch. Alex sends pull requests to Dave Airlie's for his drm-next branch where they get integrated with all the other DRM driver changes. That usually results in different commit hashes. Regards, Felix > > Thanks > Shaoyun.liu > > > -Original Message- > From: Kuehling, Felix > Sent: Monday, November 22, 2021 10:40 AM > To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov > function > > Am 2021-11-18 um 11:57 a.m. schrieb shaoyunl: >> For sriov XGMI configuration, the host driver will handle the hive >> reset, so in guest side, the reset_sriov only be called once on one >> device. This will make kfd post_reset unblanced with kfd pre_reset >> since kfd pre_reset already been moved out of reset_sriov function. >> Move kfd post_reset out of reset_sriov function to make them balance . >> >> Signed-off-by: shaoyunl > Please change the headline prefix to "drm/amdgpu: ". The extra "/amd" is > redundant. And I'd also add a tag > > Fixes: 9f4f2c1a3524 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in > sriov") > > Note that the commit hash is the one from the drm-next branch, which is what > will get merged into master eventually. With those changes, the patch is > > Reviewed-by: Felix Kuehling > > >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++ >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 10c8008d1da0..9a9d5493c676 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct >> amdgpu_device *adev, >> >> amdgpu_irq_gpu_reset_resume_helper(adev); >> r = amdgpu_ib_ring_tests(adev); >> -amdgpu_amdkfd_post_reset(adev); >> >> error: >> if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) >> { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct >> amdgpu_device *adev, >> >> tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); >> /* Actual ASIC resets if needed.*/ >> -/* TODO Implement XGMI hive reset logic for SRIOV */ >> +/* Host driver will handle XGMI hive reset for SRIOV */ >> if (amdgpu_sriov_vf(adev)) { >> r = amdgpu_device_reset_sriov(adev, job ? false : true); >> if (r) >> @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct >> amdgpu_device *adev, >> >> skip_sched_resume: >> list_for_each_entry(tmp_adev, device_list_handle, reset_list) { >> -/* unlock kfd: SRIOV would do it separately */ >> -if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev)) >> +/* unlock kfd */ >> +if (!need_emergency_restart) >> amdgpu_amdkfd_post_reset(tmp_adev); >> >> /* kfd_post_reset will do nothing if kfd device is not >> initialized,
RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
[AMD Official Use Only] Thanks for the review . The hash for the previous change from gerrirgit/amd-staging-drm-next branch is 7079e7d5c6bf248bff, so there is another drm-next branch that not in the gerritgit for upstream ? Thanks Shaoyun.liu -Original Message- From: Kuehling, Felix Sent: Monday, November 22, 2021 10:40 AM To: Liu, Shaoyun ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function Am 2021-11-18 um 11:57 a.m. schrieb shaoyunl: > For sriov XGMI configuration, the host driver will handle the hive > reset, so in guest side, the reset_sriov only be called once on one > device. This will make kfd post_reset unblanced with kfd pre_reset > since kfd pre_reset already been moved out of reset_sriov function. > Move kfd post_reset out of reset_sriov function to make them balance . > > Signed-off-by: shaoyunl Please change the headline prefix to "drm/amdgpu: ". The extra "/amd" is redundant. And I'd also add a tag Fixes: 9f4f2c1a3524 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") Note that the commit hash is the one from the drm-next branch, which is what will get merged into master eventually. With those changes, the patch is Reviewed-by: Felix Kuehling > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 10c8008d1da0..9a9d5493c676 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct > amdgpu_device *adev, > > amdgpu_irq_gpu_reset_resume_helper(adev); > r = amdgpu_ib_ring_tests(adev); > - amdgpu_amdkfd_post_reset(adev); > > error: > if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) > { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct > amdgpu_device *adev, > > tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); > /* Actual ASIC resets if needed.*/ > - /* TODO Implement XGMI hive reset logic for SRIOV */ > + /* Host driver will handle XGMI hive reset for SRIOV */ > if (amdgpu_sriov_vf(adev)) { > r = amdgpu_device_reset_sriov(adev, job ? false : true); > if (r) > @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct > amdgpu_device *adev, > > skip_sched_resume: > list_for_each_entry(tmp_adev, device_list_handle, reset_list) { > - /* unlock kfd: SRIOV would do it separately */ > - if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev)) > + /* unlock kfd */ > + if (!need_emergency_restart) > amdgpu_amdkfd_post_reset(tmp_adev); > > /* kfd_post_reset will do nothing if kfd device is not > initialized,
Re: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
Am 2021-11-18 um 11:57 a.m. schrieb shaoyunl: > For sriov XGMI configuration, the host driver will handle the hive reset, > so in guest side, the reset_sriov only be called once on one device. This will > make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already > been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov > function to make them balance . > > Signed-off-by: shaoyunl Please change the headline prefix to "drm/amdgpu: ". The extra "/amd" is redundant. And I'd also add a tag Fixes: 9f4f2c1a3524 ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") Note that the commit hash is the one from the drm-next branch, which is what will get merged into master eventually. With those changes, the patch is Reviewed-by: Felix Kuehling > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 10c8008d1da0..9a9d5493c676 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct > amdgpu_device *adev, > > amdgpu_irq_gpu_reset_resume_helper(adev); > r = amdgpu_ib_ring_tests(adev); > - amdgpu_amdkfd_post_reset(adev); > > error: > if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { > @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > > tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); > /* Actual ASIC resets if needed.*/ > - /* TODO Implement XGMI hive reset logic for SRIOV */ > + /* Host driver will handle XGMI hive reset for SRIOV */ > if (amdgpu_sriov_vf(adev)) { > r = amdgpu_device_reset_sriov(adev, job ? false : true); > if (r) > @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device > *adev, > > skip_sched_resume: > list_for_each_entry(tmp_adev, device_list_handle, reset_list) { > - /* unlock kfd: SRIOV would do it separately */ > - if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev)) > + /* unlock kfd */ > + if (!need_emergency_restart) > amdgpu_amdkfd_post_reset(tmp_adev); > > /* kfd_post_reset will do nothing if kfd device is not > initialized,
RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
[AMD Official Use Only] ping -Original Message- From: Liu, Shaoyun Sent: Thursday, November 18, 2021 10:08 PM To: amd-gfx@lists.freedesktop.org Subject: RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function [AMD Official Use Only] Ping -Original Message- From: Liu, Shaoyun Sent: Thursday, November 18, 2021 11:58 AM To: amd-gfx@lists.freedesktop.org Cc: Liu, Shaoyun Subject: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function For sriov XGMI configuration, the host driver will handle the hive reset, so in guest side, the reset_sriov only be called once on one device. This will make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov function to make them balance . Signed-off-by: shaoyunl --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 10c8008d1da0..9a9d5493c676 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, amdgpu_irq_gpu_reset_resume_helper(adev); r = amdgpu_ib_ring_tests(adev); - amdgpu_amdkfd_post_reset(adev); error: if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); /* Actual ASIC resets if needed.*/ - /* TODO Implement XGMI hive reset logic for SRIOV */ + /* Host driver will handle XGMI hive reset for SRIOV */ if (amdgpu_sriov_vf(adev)) { r = amdgpu_device_reset_sriov(adev, job ? false : true); if (r) @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, skip_sched_resume: list_for_each_entry(tmp_adev, device_list_handle, reset_list) { - /* unlock kfd: SRIOV would do it separately */ - if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev)) + /* unlock kfd */ + if (!need_emergency_restart) amdgpu_amdkfd_post_reset(tmp_adev); /* kfd_post_reset will do nothing if kfd device is not initialized, -- 2.17.1
RE: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
[AMD Official Use Only] Ping -Original Message- From: Liu, Shaoyun Sent: Thursday, November 18, 2021 11:58 AM To: amd-gfx@lists.freedesktop.org Cc: Liu, Shaoyun Subject: [PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function For sriov XGMI configuration, the host driver will handle the hive reset, so in guest side, the reset_sriov only be called once on one device. This will make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov function to make them balance . Signed-off-by: shaoyunl --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 10c8008d1da0..9a9d5493c676 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, amdgpu_irq_gpu_reset_resume_helper(adev); r = amdgpu_ib_ring_tests(adev); - amdgpu_amdkfd_post_reset(adev); error: if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); /* Actual ASIC resets if needed.*/ - /* TODO Implement XGMI hive reset logic for SRIOV */ + /* Host driver will handle XGMI hive reset for SRIOV */ if (amdgpu_sriov_vf(adev)) { r = amdgpu_device_reset_sriov(adev, job ? false : true); if (r) @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, skip_sched_resume: list_for_each_entry(tmp_adev, device_list_handle, reset_list) { - /* unlock kfd: SRIOV would do it separately */ - if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev)) + /* unlock kfd */ + if (!need_emergency_restart) amdgpu_amdkfd_post_reset(tmp_adev); /* kfd_post_reset will do nothing if kfd device is not initialized, -- 2.17.1
[PATCH] drm/amd/amdgpu: move kfd post_reset out of reset_sriov function
For sriov XGMI configuration, the host driver will handle the hive reset, so in guest side, the reset_sriov only be called once on one device. This will make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov function to make them balance . Signed-off-by: shaoyunl --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 10c8008d1da0..9a9d5493c676 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4308,7 +4308,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, amdgpu_irq_gpu_reset_resume_helper(adev); r = amdgpu_ib_ring_tests(adev); - amdgpu_amdkfd_post_reset(adev); error: if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) { @@ -5081,7 +5080,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter)); /* Actual ASIC resets if needed.*/ - /* TODO Implement XGMI hive reset logic for SRIOV */ + /* Host driver will handle XGMI hive reset for SRIOV */ if (amdgpu_sriov_vf(adev)) { r = amdgpu_device_reset_sriov(adev, job ? false : true); if (r) @@ -5141,8 +5140,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, skip_sched_resume: list_for_each_entry(tmp_adev, device_list_handle, reset_list) { - /* unlock kfd: SRIOV would do it separately */ - if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev)) + /* unlock kfd */ + if (!need_emergency_restart) amdgpu_amdkfd_post_reset(tmp_adev); /* kfd_post_reset will do nothing if kfd device is not initialized, -- 2.17.1