Re: [PATCH] drm/radeon: remove load callback

2024-06-21 Thread Hoi Pok Wu
Dear Thomas,

Thank you for testing my patch. The dev->dev_private is indeed the problem.

However, most of the functions that uses dev->dev_private is passing
drm_device as parameter, and then uses dev->dev_private to retrieve
radeon_device,
contradicting what the patch intended. It should use radeon_device directly.
Should I send a follow up patch with the updated patch?

Thank you.

Best Regards
Wu

On Wed, Jun 19, 2024 at 10:28 AM Thomas Zimmermann  wrote:
>
> Hi
>
> Am 07.06.24 um 03:14 schrieb wu hoi pok:
> > this patch is to remove the load callback from the kms_driver,
> > following closly to amdgpu, radeon_driver_load_kms and devm_drm_dev_alloc
> > are used, most of the changes here are rdev->ddev to rdev_to_drm,
> > which maps to adev_to_drm in amdgpu. however this patch is not tested on
> > hardware, so if you are free and have a gcn1 gcn2 card please do so.
> >
> > Signed-off-by: wu hoi pok 
>
> I volunteer for testing. The test device is
>
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Turks PRO [Radeon HD 6570/7570/8550 / R5 230] (prog-if 00 [VGA
> controller])
>  Subsystem: PC Partner Limited / Sapphire Technology Device e193
>  Flags: bus master, fast devsel, latency 0, IRQ 147
>  Memory at c000 (64-bit, prefetchable) [size=256M]
>  Memory at dfe2 (64-bit, non-prefetchable) [size=128K]
>  I/O ports at e000 [size=256]
>  Expansion ROM at 000c [disabled] [size=128K]
>  Capabilities: 
>  Kernel driver in use: radeon
>  Kernel modules: radeon, amdgpu
>
>
> With the current patch, the driver crashes upon booting. Here is your
> backtrace:
>
> [   24.013524] Console: switching to colour dummy device 80x25
> [   24.021093] radeon :01:00.0: vgaarb: deactivate vga console
> [   24.031806] [drm] initializing kernel modesetting (TURKS
> 0x1002:0x6759 0x174B:0xE193 0x00).
> [   24.041066] ATOM BIOS: YODA
> [   24.043930]
> ==
> [   24.051195] BUG: KASAN: user-memory-access in
> radeon_atom_initialize_bios_scratch_regs+0x33/0x110 [radeon]
> [   24.061287] Read of size 4 at addr 1058 by task
> (udev-worker)/349
> [   24.061292]
> [   24.061295] CPU: 3 PID: 349 Comm: (udev-worker) Tainted: G U
> E  6.10.0-rc4-1-default+ #2977
> [   24.061301] Hardware name: System manufacturer System Product
> Name/Z170-A, BIOS 3802 03/15/2018
> [   24.061305] Call Trace:
> [  OK 24.061308]  
> [   24.061313]  dump_stack_lvl+0x68/0x90
> [   24.061322]  ? radeon_atom_initialize_bios_scratch_regs+0x33/0x110
> [radeon]
> 0m] Finished24.105026]  kasan_report+0xcf/0x1a0
> [   24.105039]  ? radeon_atom_initialize_bios_scratch_regs+0x33/0x110
> [radeon]
> ;1;39mCreate Vol[   24.117055]  ? __pfx_cail_ioreg_read+0x10/0x10 [radeon]
> atile Files and [   24.123698]
> radeon_atom_initialize_bios_scratch_regs+0x33/0x110 [radeon]
> Directories.[   24.131933]  radeon_atombios_init+0x192/0x220 [radeon]
>
> [   24.138506]  evergreen_init+0x57/0x400 [radeon]
> [   24.143473]  radeon_device_init+0x8f2/0x1040 [radeon]
> [   24.148897]  ? down_read_failed+0x7/0x410
> [   24.152936]  ? ksm_might_need_to_copy+0x10/0x280
> [   24.157594]  radeon_driver_load_kms+0xe3/0x330 [radeon]
> [   24.163198]  radeon_pci_probe+0x117/0x180 [radeon]
> [   24.168431]  ? __pfx_radeon_pci_probe+0x10/0x10 [radeon]
> [   24.174161]  local_pci_probe+0x74/0xc0
> [   24.177945]  pci_call_probe+0xc6/0x260
> [   24.181727]  ? __pfx_pci_call_probe+0x10/0x10
> [   24.186118]  ? do_raw_spin_trylock+0xb0/0xf0
> [   24.190439]  ? pci_match_device+0x1c5/0x240
> [   24.194651]  ? pci_match_id+0x102/0x150
> [   24.198522]  ? pci_match_device+0x1dd/0x240
> [   24.202752]  pci_device_probe+0x9d/0x150
> [   24.206705]  ? driver_sysfs_add+0xb0/0x130
> [   24.210838]  really_probe+0x13b/0x490
> [   24.214547]  __driver_probe_device+0xca/0x1b0
> [   24.218943]  driver_probe_device+0x4a/0xf0
> [   24.223073]  __driver_attach+0x136/0x290
> [   24.227032]  ? __pfx___driver_attach+0x10/0x10
> [   24.231508]  bus_for_each_dev+0xc0/0x110
> [   24.235465]  ? __pfx_bus_for_each_dev+0x10/0x10
> [   24.240032]  ? bus_add_driver+0x17a/0x2b0
> [   24.244079]  bus_add_driver+0x19a/0x2b0
> [   24.247950]  driver_register+0xc5/0x140
> [   24.251817]  ? __pfx_radeon_module_init+0x10/0x10 [radeon]
> [   24.257674]  do_one_initcall+0xbc/0x390
> [   24.261542]  ? __pfx_do_one_initcall+0x10/0x10
> [   24.266022]  ? kasan_unpoison+0x40/0x70
> [   24.269891]  ? rcu_is_watching+0x34/0x60
> [   24.273849]  ? kmalloc_trace_noprof+0x286/0x320
> [   24.278415]  ? do_init_module+0x38/0x3a0
> [   24.282387]  ? kasan_unpoison+0x40/0x70
> [   24.286264]  do_init_module+0x13a/0x3a0
> [   24.290133]  init_module_from_file+0xc0/0x100
> [   24.294523]  ? __pfx_init_module_from_file+0x10/0x10
> [   24.299522]  ? __lock_release.isra.0+0x132/0x4f0
> [   24.304185]  ? do_raw_spin_unlock+0x83/0xe0
> [   24.304209]  ide

[PATCH] drm/amd/display: Remove redundant code and semicolons

2024-06-21 Thread Jiapeng Chong
No functional modification involved.

./drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_core/dml2_core_shared.c:3171:2-3:
 Unneeded semicolon.
./drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_core/dml2_core_shared.c:3185:2-3:
 Unneeded semicolon.
./drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_core/dml2_core_shared.c:3200:2-3:
 Unneeded semicolon.

Reported-by: Abaci Robot 
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9365
Signed-off-by: Jiapeng Chong 
---
 .../dml21/src/dml2_core/dml2_core_shared.c| 46 +--
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git 
a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_core/dml2_core_shared.c 
b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_core/dml2_core_shared.c
index cfa4c4475821..1a9895b1833f 100644
--- a/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_core/dml2_core_shared.c
+++ b/drivers/gpu/drm/amd/display/dc/dml2/dml21/src/dml2_core/dml2_core_shared.c
@@ -3142,62 +3142,62 @@ static unsigned int dml_get_tile_block_size_bytes(enum 
dml2_swizzle_mode sw_mode
 {
switch (sw_mode) {
case (dml2_sw_linear):
-   return 256; break;
+   return 256;
case (dml2_sw_256b_2d):
-   return 256; break;
+   return 256;
case (dml2_sw_4kb_2d):
-   return 4096; break;
+   return 4096;
case (dml2_sw_64kb_2d):
-   return 65536; break;
+   return 65536;
case (dml2_sw_256kb_2d):
-   return 262144; break;
+   return 262144;
case (dml2_gfx11_sw_linear):
-   return 256; break;
+   return 256;
case (dml2_gfx11_sw_64kb_d):
-   return 65536; break;
+   return 65536;
case (dml2_gfx11_sw_64kb_d_t):
-   return 65536; break;
+   return 65536;
case (dml2_gfx11_sw_64kb_d_x):
-   return 65536; break;
+   return 65536;
case (dml2_gfx11_sw_64kb_r_x):
-   return 65536; break;
+   return 65536;
case (dml2_gfx11_sw_256kb_d_x):
-   return 262144; break;
+   return 262144;
case (dml2_gfx11_sw_256kb_r_x):
-   return 262144; break;
+   return 262144;
default:
DML2_ASSERT(0);
return 256;
-   };
+   }
 }
 
 const char *dml2_core_internal_bw_type_str(enum dml2_core_internal_bw_type 
bw_type)
 {
switch (bw_type) {
case (dml2_core_internal_bw_sdp):
-   return("dml2_core_internal_bw_sdp"); break;
+   return("dml2_core_internal_bw_sdp");
case (dml2_core_internal_bw_dram):
-   return("dml2_core_internal_bw_dram"); break;
+   return("dml2_core_internal_bw_dram");
case (dml2_core_internal_bw_max):
-   return("dml2_core_internal_bw_max"); break;
+   return("dml2_core_internal_bw_max");
default:
-   return("dml2_core_internal_bw_unknown"); break;
-   };
+   return("dml2_core_internal_bw_unknown");
+   }
 }
 
 const char *dml2_core_internal_soc_state_type_str(enum 
dml2_core_internal_soc_state_type dml2_core_internal_soc_state_type)
 {
switch (dml2_core_internal_soc_state_type) {
case (dml2_core_internal_soc_state_sys_idle):
-   return("dml2_core_internal_soc_state_sys_idle"); break;
+   return("dml2_core_internal_soc_state_sys_idle");
case (dml2_core_internal_soc_state_sys_active):
-   return("dml2_core_internal_soc_state_sys_active"); break;
+   return("dml2_core_internal_soc_state_sys_active");
case (dml2_core_internal_soc_state_svp_prefetch):
-   return("dml2_core_internal_soc_state_svp_prefetch"); break;
+   return("dml2_core_internal_soc_state_svp_prefetch");
case dml2_core_internal_soc_state_max:
default:
-   return("dml2_core_internal_soc_state_unknown"); break;
-   };
+   return("dml2_core_internal_soc_state_unknown");
+   }
 }
 
 static bool dml_is_vertical_rotation(enum dml2_rotation_angle Scan)
-- 
2.20.1.7.g153144c



[PATCH] drm/amdgpu/kfd: Add unlock() on error path to add_queue_mes()

2024-06-21 Thread Dan Carpenter
We recently added locking to add_queue_mes() but this error path was
overlooked.  Add an unlock to the error path.

Fixes: 1802b042a343 ("drm/amdgpu/kfd: remove is_hws_hang and is_resetting")
Signed-off-by: Dan Carpenter 
---
 drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index d2fceb6f9802..4f48507418d2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -230,6 +230,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, 
struct queue *q,
if (queue_type < 0) {
dev_err(adev->dev, "Queue type not supported with MES, 
queue:%d\n",
q->properties.type);
+   up_read(&adev->reset_domain->sem);
return -EINVAL;
}
queue_input.queue_type = (uint32_t)queue_type;
-- 
2.43.0



[PATCH] drm/amd/display: Clean up indenting in dm_dp_mst_is_port_support_mode()

2024-06-21 Thread Dan Carpenter
This code works, but it's not aligned correctly.  Add a couple missing
tabs.

Signed-off-by: Dan Carpenter 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
index 48118447c8d9..5d4f831b1e55 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_mst_types.c
@@ -1691,7 +1691,7 @@ enum dc_status dm_dp_mst_is_port_support_mode(
if (aconnector->mst_output_port->passthrough_aux) {
if (bw_range.min_kbps > end_to_end_bw_in_kbps) {
DRM_DEBUG_DRIVER("DSC passthrough. Max dsc 
compression can't fit into end-to-end bw\n");
-   return DC_FAIL_BANDWIDTH_VALIDATE;
+   return DC_FAIL_BANDWIDTH_VALIDATE;
}
} else {
/*dsc bitstream decoded at the dp last link*/
@@ -1756,7 +1756,7 @@ enum dc_status dm_dp_mst_is_port_support_mode(
if (branch_max_throughput_mps != 0 &&
((stream->timing.pix_clk_100hz / 10) >  
branch_max_throughput_mps * 1000)) {
DRM_DEBUG_DRIVER("DSC is required but max throughput 
mps fails");
-   return DC_FAIL_BANDWIDTH_VALIDATE;
+   return DC_FAIL_BANDWIDTH_VALIDATE;
}
} else {
DRM_DEBUG_DRIVER("DSC is required but can't find common dsc 
config.");
-- 
2.43.0



Re: [PATCH v3 2/2] drm/amd: Add power_saving_policy drm property to eDP connectors

2024-06-21 Thread Xaver Hugl
Am Mi., 19. Juni 2024 um 06:08 Uhr schrieb Mario Limonciello

> Thanks!  I don't have permissions, so can you (or someone else) please
> apply to drm-misc-next for me?
>
> After it's merged I'll rebase and work on the feedback for the new IGT
> tests.

Merging can only happen once a real world userspace application has
implemented support for it. I'll try to do that sometime next week in
KWin


Re: [PATCH 21/39] drm/amd/display: Make DML2.1 P-State method force per stream

2024-06-21 Thread Greg KH
On Thu, Jun 20, 2024 at 10:11:27AM -0600, Alex Hung wrote:
> From: Dillon Varone 
> 
> [WHY & HOW]
> Currently the force only works for a single display, make it so it can
> be forced per stream.
> 
> Reviewed-by: Alvin Lee 
> Cc: Mario Limonciello 
> Cc: Alex Deucher 
> Cc: sta...@vger.kernel.org
> Acked-by: Alex Hung 
> Signed-off-by: Dillon Varone 

When submitting patches from others, you too have to sign-off on the
patch.  Read the DCO in the documentation for details.

thanks,

greg k-h


Re: [PATCH] drm/radeon: remove load callback

2024-06-21 Thread Thomas Zimmermann

Hi

Am 20.06.24 um 16:30 schrieb Hoi Pok Wu:

Dear Thomas,

Thank you for testing my patch. The dev->dev_private is indeed the problem.

However, most of the functions that uses dev->dev_private is passing
drm_device as parameter, and then uses dev->dev_private to retrieve
radeon_device,
contradicting what the patch intended. It should use radeon_device directly.
Should I send a follow up patch with the updated patch?


Simply assign the radeon_device to dev_private as before and you'll be 
fine. Reworking all function calls would be a patchset of its own.


Best regards
Thomas



Thank you.

Best Regards
Wu

On Wed, Jun 19, 2024 at 10:28 AM Thomas Zimmermann  wrote:

Hi

Am 07.06.24 um 03:14 schrieb wu hoi pok:

this patch is to remove the load callback from the kms_driver,
following closly to amdgpu, radeon_driver_load_kms and devm_drm_dev_alloc
are used, most of the changes here are rdev->ddev to rdev_to_drm,
which maps to adev_to_drm in amdgpu. however this patch is not tested on
hardware, so if you are free and have a gcn1 gcn2 card please do so.

Signed-off-by: wu hoi pok 

I volunteer for testing. The test device is

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Turks PRO [Radeon HD 6570/7570/8550 / R5 230] (prog-if 00 [VGA
controller])
  Subsystem: PC Partner Limited / Sapphire Technology Device e193
  Flags: bus master, fast devsel, latency 0, IRQ 147
  Memory at c000 (64-bit, prefetchable) [size=256M]
  Memory at dfe2 (64-bit, non-prefetchable) [size=128K]
  I/O ports at e000 [size=256]
  Expansion ROM at 000c [disabled] [size=128K]
  Capabilities: 
  Kernel driver in use: radeon
  Kernel modules: radeon, amdgpu


With the current patch, the driver crashes upon booting. Here is your
backtrace:

[   24.013524] Console: switching to colour dummy device 80x25
[   24.021093] radeon :01:00.0: vgaarb: deactivate vga console
[   24.031806] [drm] initializing kernel modesetting (TURKS
0x1002:0x6759 0x174B:0xE193 0x00).
[   24.041066] ATOM BIOS: YODA
[   24.043930]
==
[   24.051195] BUG: KASAN: user-memory-access in
radeon_atom_initialize_bios_scratch_regs+0x33/0x110 [radeon]
[   24.061287] Read of size 4 at addr 1058 by task
(udev-worker)/349
[   24.061292]
[   24.061295] CPU: 3 PID: 349 Comm: (udev-worker) Tainted: G U
E  6.10.0-rc4-1-default+ #2977
[   24.061301] Hardware name: System manufacturer System Product
Name/Z170-A, BIOS 3802 03/15/2018
[   24.061305] Call Trace:
[  OK 24.061308]  
[   24.061313]  dump_stack_lvl+0x68/0x90
[   24.061322]  ? radeon_atom_initialize_bios_scratch_regs+0x33/0x110
[radeon]
0m] Finished24.105026]  kasan_report+0xcf/0x1a0
[   24.105039]  ? radeon_atom_initialize_bios_scratch_regs+0x33/0x110
[radeon]
;1;39mCreate Vol[   24.117055]  ? __pfx_cail_ioreg_read+0x10/0x10 [radeon]
atile Files and [   24.123698]
radeon_atom_initialize_bios_scratch_regs+0x33/0x110 [radeon]
Directories.[   24.131933]  radeon_atombios_init+0x192/0x220 [radeon]

[   24.138506]  evergreen_init+0x57/0x400 [radeon]
[   24.143473]  radeon_device_init+0x8f2/0x1040 [radeon]
[   24.148897]  ? down_read_failed+0x7/0x410
[   24.152936]  ? ksm_might_need_to_copy+0x10/0x280
[   24.157594]  radeon_driver_load_kms+0xe3/0x330 [radeon]
[   24.163198]  radeon_pci_probe+0x117/0x180 [radeon]
[   24.168431]  ? __pfx_radeon_pci_probe+0x10/0x10 [radeon]
[   24.174161]  local_pci_probe+0x74/0xc0
[   24.177945]  pci_call_probe+0xc6/0x260
[   24.181727]  ? __pfx_pci_call_probe+0x10/0x10
[   24.186118]  ? do_raw_spin_trylock+0xb0/0xf0
[   24.190439]  ? pci_match_device+0x1c5/0x240
[   24.194651]  ? pci_match_id+0x102/0x150
[   24.198522]  ? pci_match_device+0x1dd/0x240
[   24.202752]  pci_device_probe+0x9d/0x150
[   24.206705]  ? driver_sysfs_add+0xb0/0x130
[   24.210838]  really_probe+0x13b/0x490
[   24.214547]  __driver_probe_device+0xca/0x1b0
[   24.218943]  driver_probe_device+0x4a/0xf0
[   24.223073]  __driver_attach+0x136/0x290
[   24.227032]  ? __pfx___driver_attach+0x10/0x10
[   24.231508]  bus_for_each_dev+0xc0/0x110
[   24.235465]  ? __pfx_bus_for_each_dev+0x10/0x10
[   24.240032]  ? bus_add_driver+0x17a/0x2b0
[   24.244079]  bus_add_driver+0x19a/0x2b0
[   24.247950]  driver_register+0xc5/0x140
[   24.251817]  ? __pfx_radeon_module_init+0x10/0x10 [radeon]
[   24.257674]  do_one_initcall+0xbc/0x390
[   24.261542]  ? __pfx_do_one_initcall+0x10/0x10
[   24.266022]  ? kasan_unpoison+0x40/0x70
[   24.269891]  ? rcu_is_watching+0x34/0x60
[   24.273849]  ? kmalloc_trace_noprof+0x286/0x320
[   24.278415]  ? do_init_module+0x38/0x3a0
[   24.282387]  ? kasan_unpoison+0x40/0x70
[   24.286264]  do_init_module+0x13a/0x3a0
[   24.290133]  init_module_from_file+0xc0/0x100
[   24.294523]  ? __pfx_init_module_from_file+0x10/0x10
[   24.299522]  ? __lock_release.isra.0+0x132/0x4f0
[   24.304185]  ? do_raw_spin_unlock+0x83/0xe0
[ 

Re: [PATCH] drm/radeon: remove load callback

2024-06-21 Thread Christian König

Am 21.06.24 um 09:16 schrieb Thomas Zimmermann:

Hi

Am 20.06.24 um 16:30 schrieb Hoi Pok Wu:

Dear Thomas,

Thank you for testing my patch. The dev->dev_private is indeed the 
problem.


However, most of the functions that uses dev->dev_private is passing
drm_device as parameter, and then uses dev->dev_private to retrieve
radeon_device,
contradicting what the patch intended. It should use radeon_device 
directly.

Should I send a follow up patch with the updated patch?


Simply assign the radeon_device to dev_private as before and you'll be 
fine. Reworking all function calls would be a patchset of its own.


Yeah, completely agree. Try to keep it as simply and stupid as possible 
for now, more extensive cleanups can come later.


And thanks a lot for looking into this and testing the stuff.

Regards,
Christian.



Best regards
Thomas



Thank you.

Best Regards
Wu

On Wed, Jun 19, 2024 at 10:28 AM Thomas Zimmermann 
 wrote:

Hi

Am 07.06.24 um 03:14 schrieb wu hoi pok:

this patch is to remove the load callback from the kms_driver,
following closly to amdgpu, radeon_driver_load_kms and 
devm_drm_dev_alloc

are used, most of the changes here are rdev->ddev to rdev_to_drm,
which maps to adev_to_drm in amdgpu. however this patch is not 
tested on

hardware, so if you are free and have a gcn1 gcn2 card please do so.

Signed-off-by: wu hoi pok 

I volunteer for testing. The test device is

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Turks PRO [Radeon HD 6570/7570/8550 / R5 230] (prog-if 00 
[VGA

controller])
  Subsystem: PC Partner Limited / Sapphire Technology Device 
e193

  Flags: bus master, fast devsel, latency 0, IRQ 147
  Memory at c000 (64-bit, prefetchable) [size=256M]
  Memory at dfe2 (64-bit, non-prefetchable) [size=128K]
  I/O ports at e000 [size=256]
  Expansion ROM at 000c [disabled] [size=128K]
  Capabilities: 
  Kernel driver in use: radeon
  Kernel modules: radeon, amdgpu


With the current patch, the driver crashes upon booting. Here is your
backtrace:

[   24.013524] Console: switching to colour dummy device 80x25
[   24.021093] radeon :01:00.0: vgaarb: deactivate vga console
[   24.031806] [drm] initializing kernel modesetting (TURKS
0x1002:0x6759 0x174B:0xE193 0x00).
[   24.041066] ATOM BIOS: YODA
[   24.043930]
==
[   24.051195] BUG: KASAN: user-memory-access in
radeon_atom_initialize_bios_scratch_regs+0x33/0x110 [radeon]
[   24.061287] Read of size 4 at addr 1058 by task
(udev-worker)/349
[   24.061292]
[   24.061295] CPU: 3 PID: 349 Comm: (udev-worker) Tainted: G U
E  6.10.0-rc4-1-default+ #2977
[   24.061301] Hardware name: System manufacturer System Product
Name/Z170-A, BIOS 3802 03/15/2018
[   24.061305] Call Trace:
[  OK 24.061308]  
[   24.061313]  dump_stack_lvl+0x68/0x90
[   24.061322]  ? radeon_atom_initialize_bios_scratch_regs+0x33/0x110
[radeon]
0m] Finished    24.105026]  kasan_report+0xcf/0x1a0
[   24.105039]  ? radeon_atom_initialize_bios_scratch_regs+0x33/0x110
[radeon]
;1;39mCreate Vol[   24.117055]  ? __pfx_cail_ioreg_read+0x10/0x10 
[radeon]

atile Files and [   24.123698]
radeon_atom_initialize_bios_scratch_regs+0x33/0x110 [radeon]
Directories.[   24.131933]  radeon_atombios_init+0x192/0x220 [radeon]

[   24.138506]  evergreen_init+0x57/0x400 [radeon]
[   24.143473]  radeon_device_init+0x8f2/0x1040 [radeon]
[   24.148897]  ? down_read_failed+0x7/0x410
[   24.152936]  ? ksm_might_need_to_copy+0x10/0x280
[   24.157594]  radeon_driver_load_kms+0xe3/0x330 [radeon]
[   24.163198]  radeon_pci_probe+0x117/0x180 [radeon]
[   24.168431]  ? __pfx_radeon_pci_probe+0x10/0x10 [radeon]
[   24.174161]  local_pci_probe+0x74/0xc0
[   24.177945]  pci_call_probe+0xc6/0x260
[   24.181727]  ? __pfx_pci_call_probe+0x10/0x10
[   24.186118]  ? do_raw_spin_trylock+0xb0/0xf0
[   24.190439]  ? pci_match_device+0x1c5/0x240
[   24.194651]  ? pci_match_id+0x102/0x150
[   24.198522]  ? pci_match_device+0x1dd/0x240
[   24.202752]  pci_device_probe+0x9d/0x150
[   24.206705]  ? driver_sysfs_add+0xb0/0x130
[   24.210838]  really_probe+0x13b/0x490
[   24.214547]  __driver_probe_device+0xca/0x1b0
[   24.218943]  driver_probe_device+0x4a/0xf0
[   24.223073]  __driver_attach+0x136/0x290
[   24.227032]  ? __pfx___driver_attach+0x10/0x10
[   24.231508]  bus_for_each_dev+0xc0/0x110
[   24.235465]  ? __pfx_bus_for_each_dev+0x10/0x10
[   24.240032]  ? bus_add_driver+0x17a/0x2b0
[   24.244079]  bus_add_driver+0x19a/0x2b0
[   24.247950]  driver_register+0xc5/0x140
[   24.251817]  ? __pfx_radeon_module_init+0x10/0x10 [radeon]
[   24.257674]  do_one_initcall+0xbc/0x390
[   24.261542]  ? __pfx_do_one_initcall+0x10/0x10
[   24.266022]  ? kasan_unpoison+0x40/0x70
[   24.269891]  ? rcu_is_watching+0x34/0x60
[   24.273849]  ? kmalloc_trace_noprof+0x286/0x320
[   24.278415]  ? do_init_module+0x38/0x3a0
[   24.282387]  ? 

Re: [PATCH 1/2] drm/amdgpu: Unmap BO memory before calling amdgpu_bo_unref()

2024-06-21 Thread Thomas Zimmermann

Hi

Am 20.06.24 um 17:50 schrieb Christian König:

Am 20.06.24 um 16:44 schrieb Thomas Zimmermann:

Prepares for using ttm_bo_vmap() and ttm_bo_vunmap() in amdgpu. Both
require the caller to hold the GEM reservation lock, which is not the
case while releasing a buffer object. Hence, push a possible call to
unmap out from the buffer-object release code. Warn if a buffer object
with mapped pages is supposed to be released.


Yeah, I've looked into this a while ago as well and that unfortunately 
won't work like this.


Amdgpu also uses ttm_bo_kmap() on user allocations, so the 
amdgpu_bo_kunmap() in amdgpu_bo_destroy() is a must have.


Is there a testcase (igt-gpu-tools ?) that runs this code?  I've tested 
these patches by booting and running a 3d game under X11. But I didn't 
expect that to fully cover all cases.


Best regards
Thomas



On the other hand I'm pretty sure that calling ttm_bo_vunmap() without 
holding the reservation lock is ok in this situation.


After all it's guaranteed that nobody else is having a reference to 
the BO any more.


Regards,
Christian.



Signed-off-by: Thomas Zimmermann 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

index a1b7438c43dc8..d58b11ea0ead5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -58,7 +58,12 @@ static void amdgpu_bo_destroy(struct 
ttm_buffer_object *tbo)

  {
  struct amdgpu_bo *bo = ttm_to_amdgpu_bo(tbo);
  -    amdgpu_bo_kunmap(bo);
+    /*
+ * BO memory pages should be unmapped at this point. Call
+ * amdgpu_bo_kunmap() before releasing the BO.
+ */
+    if (drm_WARN_ON_ONCE(bo->tbo.base.dev, bo->kmap.bo))
+    amdgpu_bo_kunmap(bo);
    if (bo->tbo.base.import_attach)
  drm_prime_gem_destroy(&bo->tbo.base, bo->tbo.sg);
@@ -450,9 +455,7 @@ void amdgpu_bo_free_kernel(struct amdgpu_bo **bo, 
u64 *gpu_addr,

WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend);
    if (likely(amdgpu_bo_reserve(*bo, true) == 0)) {
-    if (cpu_addr)
-    amdgpu_bo_kunmap(*bo);
-
+    amdgpu_bo_kunmap(*bo);
  amdgpu_bo_unpin(*bo);
  amdgpu_bo_unreserve(*bo);
  }




--
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstrasse 146, 90461 Nuernberg, Germany
GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
HRB 36809 (AG Nuernberg)



[PATCH] drm/amdgpu: Fix smatch static checker warning

2024-06-21 Thread Hawking Zhang
adev->gfx.imu.funcs could be NULL.

Signed-off-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 8 
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index b4575765d7a8..5c17409439f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4498,11 +4498,11 @@ static int gfx_v11_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 460bf33a22b1..16fc5c5b15f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -3258,11 +3258,11 @@ static int gfx_v12_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v12_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v12_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) {
-- 
2.17.1



RE: [bug report] drm/amdgpu: add init support for GFX11 (v2)

2024-06-21 Thread Zhang, Hawking
[AMD Official Use Only - AMD Internal Distribution Only]

Hi,

The fix is sent out for code review.

Regards,
Hawking

-Original Message-
From: Dan Carpenter 
Sent: Saturday, June 15, 2024 01:33
To: Zhang, Hawking 
Cc: amd-gfx@lists.freedesktop.org; SHANMUGAM, SRINIVASAN 

Subject: [bug report] drm/amdgpu: add init support for GFX11 (v2)

Hello Hawking Zhang,

Commit 3d879e81f0f9 ("drm/amdgpu: add init support for GFX11 (v2)") from Apr 
13, 2022 (linux-next), leads to the following Smatch static checker warning:

drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c:4503 gfx_v11_0_hw_init()
error: we previously assumed 'adev->gfx.imu.funcs' could be null (see 
line 4497)

drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
4491 static int gfx_v11_0_hw_init(void *handle)
4492 {
4493 int r;
4494 struct amdgpu_device *adev = (struct amdgpu_device *)handle;
4495
4496 if (adev->firmware.load_type == 
AMDGPU_FW_LOAD_RLC_BACKDOOR_AUTO) {
4497 if (adev->gfx.imu.funcs) {
 ^^^ Check for NULL

4498 /* RLC autoload sequence 1: Program rlc ram */
4499 if (adev->gfx.imu.funcs->program_rlc_ram)
4500 
adev->gfx.imu.funcs->program_rlc_ram(adev);
4501 }
4502 /* rlc autoload firmware */
--> 4503 r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
 
Unchecked dereference inside the function.  (Probably just delete the NULL 
check?)

4504 if (r)
4505 return r;
4506 } else {

regards,
dan carpenter


[PATCH] drm/amdgpu: normalize registers as local xcc to read/write under sriov in TLB flush

2024-06-21 Thread Jane Jian
[WHY]
sriov has the higher bit violation when flushing tlb

[HOW]
normalize the registers to keep lower 16-bit(dword aligned) to aviod higher bit 
violation
RLCG will mask xcd out and always assume it's accessing its own xcd

[TODO]
later will add the normalization in sriovw/rreg after fixing bugs

v2
rename the normalized macro, add ip block type for further use
move asics func declaration after ip block type since new func refers ip block 
type
add normalization in emit flush tlb as well

Signed-off-by: Jane Jian 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h| 112 +++--
 drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c |  16 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  32 --
 drivers/gpu/drm/amd/amdgpu/soc15.c |   1 +
 drivers/gpu/drm/amd/amdgpu/soc15.h |   1 +
 drivers/gpu/drm/amd/amdgpu/soc15_common.h  |   5 +-
 6 files changed, 101 insertions(+), 66 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 083f353cff6e..070fd9e601fe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -583,61 +583,6 @@ struct amdgpu_video_codecs {
const struct amdgpu_video_codec_info *codec_array;
 };
 
-/*
- * ASIC specific functions.
- */
-struct amdgpu_asic_funcs {
-   bool (*read_disabled_bios)(struct amdgpu_device *adev);
-   bool (*read_bios_from_rom)(struct amdgpu_device *adev,
-  u8 *bios, u32 length_bytes);
-   int (*read_register)(struct amdgpu_device *adev, u32 se_num,
-u32 sh_num, u32 reg_offset, u32 *value);
-   void (*set_vga_state)(struct amdgpu_device *adev, bool state);
-   int (*reset)(struct amdgpu_device *adev);
-   enum amd_reset_method (*reset_method)(struct amdgpu_device *adev);
-   /* get the reference clock */
-   u32 (*get_xclk)(struct amdgpu_device *adev);
-   /* MM block clocks */
-   int (*set_uvd_clocks)(struct amdgpu_device *adev, u32 vclk, u32 dclk);
-   int (*set_vce_clocks)(struct amdgpu_device *adev, u32 evclk, u32 ecclk);
-   /* static power management */
-   int (*get_pcie_lanes)(struct amdgpu_device *adev);
-   void (*set_pcie_lanes)(struct amdgpu_device *adev, int lanes);
-   /* get config memsize register */
-   u32 (*get_config_memsize)(struct amdgpu_device *adev);
-   /* flush hdp write queue */
-   void (*flush_hdp)(struct amdgpu_device *adev, struct amdgpu_ring *ring);
-   /* invalidate hdp read cache */
-   void (*invalidate_hdp)(struct amdgpu_device *adev,
-  struct amdgpu_ring *ring);
-   /* check if the asic needs a full reset of if soft reset will work */
-   bool (*need_full_reset)(struct amdgpu_device *adev);
-   /* initialize doorbell layout for specific asic*/
-   void (*init_doorbell_index)(struct amdgpu_device *adev);
-   /* PCIe bandwidth usage */
-   void (*get_pcie_usage)(struct amdgpu_device *adev, uint64_t *count0,
-  uint64_t *count1);
-   /* do we need to reset the asic at init time (e.g., kexec) */
-   bool (*need_reset_on_init)(struct amdgpu_device *adev);
-   /* PCIe replay counter */
-   uint64_t (*get_pcie_replay_count)(struct amdgpu_device *adev);
-   /* device supports BACO */
-   int (*supports_baco)(struct amdgpu_device *adev);
-   /* pre asic_init quirks */
-   void (*pre_asic_init)(struct amdgpu_device *adev);
-   /* enter/exit umd stable pstate */
-   int (*update_umd_stable_pstate)(struct amdgpu_device *adev, bool enter);
-   /* query video codecs */
-   int (*query_video_codecs)(struct amdgpu_device *adev, bool encode,
- const struct amdgpu_video_codecs **codecs);
-   /* encode "> 32bits" smn addressing */
-   u64 (*encode_ext_smn_addressing)(int ext_id);
-
-   ssize_t (*get_reg_state)(struct amdgpu_device *adev,
-enum amdgpu_reg_state reg_state, void *buf,
-size_t max_size);
-};
-
 /*
  * IOCTL.
  */
@@ -728,6 +673,63 @@ enum amd_hw_ip_block_type {
MAX_HWIP
 };
 
+/*
+ * ASIC specific functions.
+ */
+struct amdgpu_asic_funcs {
+   bool (*read_disabled_bios)(struct amdgpu_device *adev);
+   bool (*read_bios_from_rom)(struct amdgpu_device *adev,
+  u8 *bios, u32 length_bytes);
+   int (*read_register)(struct amdgpu_device *adev, u32 se_num,
+u32 sh_num, u32 reg_offset, u32 *value);
+   void (*set_vga_state)(struct amdgpu_device *adev, bool state);
+   int (*reset)(struct amdgpu_device *adev);
+   enum amd_reset_method (*reset_method)(struct amdgpu_device *adev);
+   /* get the reference clock */
+   u32 (*get_xclk)(struct amdgpu_device *adev);
+   /* MM block clocks */
+   int (*set_uvd_clocks)(struct amdgpu_device *adev, u32 vclk, u

Re: [PATCH 1/2] drm/amdgpu: Unmap BO memory before calling amdgpu_bo_unref()

2024-06-21 Thread Christian König

Am 21.06.24 um 09:32 schrieb Thomas Zimmermann:

Hi

Am 20.06.24 um 17:50 schrieb Christian König:

Am 20.06.24 um 16:44 schrieb Thomas Zimmermann:

Prepares for using ttm_bo_vmap() and ttm_bo_vunmap() in amdgpu. Both
require the caller to hold the GEM reservation lock, which is not the
case while releasing a buffer object. Hence, push a possible call to
unmap out from the buffer-object release code. Warn if a buffer object
with mapped pages is supposed to be released.


Yeah, I've looked into this a while ago as well and that 
unfortunately won't work like this.


Amdgpu also uses ttm_bo_kmap() on user allocations, so the 
amdgpu_bo_kunmap() in amdgpu_bo_destroy() is a must have.


Is there a testcase (igt-gpu-tools ?) that runs this code?  I've 
tested these patches by booting and running a 3d game under X11. But I 
didn't expect that to fully cover all cases.


You need a hardware generation and use case which needs patching or 
inspection of IBs.


Video decoding on old SI or CIK hardware generation should probably do 
the trick.


Regards,
Christian.



Best regards
Thomas



On the other hand I'm pretty sure that calling ttm_bo_vunmap() 
without holding the reservation lock is ok in this situation.


After all it's guaranteed that nobody else is having a reference to 
the BO any more.


Regards,
Christian.



Signed-off-by: Thomas Zimmermann 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

index a1b7438c43dc8..d58b11ea0ead5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -58,7 +58,12 @@ static void amdgpu_bo_destroy(struct 
ttm_buffer_object *tbo)

  {
  struct amdgpu_bo *bo = ttm_to_amdgpu_bo(tbo);
  -    amdgpu_bo_kunmap(bo);
+    /*
+ * BO memory pages should be unmapped at this point. Call
+ * amdgpu_bo_kunmap() before releasing the BO.
+ */
+    if (drm_WARN_ON_ONCE(bo->tbo.base.dev, bo->kmap.bo))
+    amdgpu_bo_kunmap(bo);
    if (bo->tbo.base.import_attach)
  drm_prime_gem_destroy(&bo->tbo.base, bo->tbo.sg);
@@ -450,9 +455,7 @@ void amdgpu_bo_free_kernel(struct amdgpu_bo 
**bo, u64 *gpu_addr,

WARN_ON(amdgpu_ttm_adev((*bo)->tbo.bdev)->in_suspend);
    if (likely(amdgpu_bo_reserve(*bo, true) == 0)) {
-    if (cpu_addr)
-    amdgpu_bo_kunmap(*bo);
-
+    amdgpu_bo_kunmap(*bo);
  amdgpu_bo_unpin(*bo);
  amdgpu_bo_unreserve(*bo);
  }








RE: [PATCH] drm/amdgpu: Fix smatch static checker warning

2024-06-21 Thread Gao, Likun
[AMD Official Use Only - AMD Internal Distribution Only]

Seems only need to deal with this on gfx v11, for gfx v12, it will judgement 
whether (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) before use imu funcs on 
gfx_v12_0_rlc_backdoor_autoload_enable.

Regards,
Likun

-Original Message-
From: Hawking Zhang 
Sent: Friday, June 21, 2024 3:56 PM
To: amd-gfx@lists.freedesktop.org; Gao, Likun ; Min, Frank 

Cc: Zhang, Hawking 
Subject: [PATCH] drm/amdgpu: Fix smatch static checker warning

adev->gfx.imu.funcs could be NULL.

Signed-off-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 8   
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index b4575765d7a8..5c17409439f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4498,11 +4498,11 @@ static int gfx_v11_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) { diff 
--git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 460bf33a22b1..16fc5c5b15f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -3258,11 +3258,11 @@ static int gfx_v12_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v12_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v12_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) {
--
2.17.1



Re: [PATCH v2 8/8] drm/amdgpu: Call drm_atomic_helper_shutdown() at shutdown time

2024-06-21 Thread Maxime Ripard
On Thu, Jun 20, 2024 at 09:00:23AM GMT, Alex Deucher wrote:
> On Thu, Jun 20, 2024 at 3:10 AM Maxime Ripard  wrote:
> >
> > Hi,
> >
> > On Wed, Jun 19, 2024 at 09:53:12AM GMT, Alex Deucher wrote:
> > > On Wed, Jun 19, 2024 at 9:50 AM Alex Deucher  
> > > wrote:
> > > >
> > > > On Tue, Jun 18, 2024 at 7:53 PM Doug Anderson  
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > On Tue, Jun 18, 2024 at 3:00 PM Alex Deucher  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Jun 18, 2024 at 5:40 PM Doug Anderson 
> > > > > >  wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jun 17, 2024 at 8:01 AM Alex Deucher 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 12, 2024 at 6:37 PM Douglas Anderson 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Based on grepping through the source code this driver appears 
> > > > > > > > > to be
> > > > > > > > > missing a call to drm_atomic_helper_shutdown() at system 
> > > > > > > > > shutdown
> > > > > > > > > time. Among other things, this means that if a panel is in 
> > > > > > > > > use that it
> > > > > > > > > won't be cleanly powered off at system shutdown time.
> > > > > > > > >
> > > > > > > > > The fact that we should call drm_atomic_helper_shutdown() in 
> > > > > > > > > the case
> > > > > > > > > of OS shutdown/restart comes straight out of the kernel doc 
> > > > > > > > > "driver
> > > > > > > > > instance overview" in drm_drv.c.
> > > > > > > > >
> > > > > > > > > Suggested-by: Maxime Ripard 
> > > > > > > > > Cc: Alex Deucher 
> > > > > > > > > Cc: Christian König 
> > > > > > > > > Cc: Xinhui Pan 
> > > > > > > > > Signed-off-by: Douglas Anderson 
> > > > > > > > > ---
> > > > > > > > > This commit is only compile-time tested.
> > > > > > > > >
> > > > > > > > > ...and further, I'd say that this patch is more of a plea for 
> > > > > > > > > help
> > > > > > > > > than a patch I think is actually right. I'm _fairly_ certain 
> > > > > > > > > that
> > > > > > > > > drm/amdgpu needs this call at shutdown time but the logic is 
> > > > > > > > > a bit
> > > > > > > > > hard for me to follow. I'd appreciate if anyone who actually 
> > > > > > > > > knows
> > > > > > > > > what this should look like could illuminate me, or perhaps 
> > > > > > > > > even just
> > > > > > > > > post a patch themselves!
> > > > > > > >
> > > > > > > > I'm not sure this patch makes sense or not.  The driver doesn't 
> > > > > > > > really
> > > > > > > > do a formal tear down in its shutdown routine, it just quiesces 
> > > > > > > > the
> > > > > > > > hardware.  What are the actual requirements of the shutdown 
> > > > > > > > function?
> > > > > > > > In the past when we did a full driver tear down in shutdown, it
> > > > > > > > delayed the shutdown sequence and users complained.
> > > > > > >
> > > > > > > The "inspiration" for this patch is to handle panels properly.
> > > > > > > Specifically, panels often have several power/enable signals 
> > > > > > > going to
> > > > > > > them and often have requirements that these signals are powered 
> > > > > > > off in
> > > > > > > the proper order with the proper delays between them. While we 
> > > > > > > can't
> > > > > > > always do so when the system crashes / reboots in an uncontrolled 
> > > > > > > way,
> > > > > > > panel manufacturers / HW Engineers get upset if we don't power 
> > > > > > > things
> > > > > > > off properly during an orderly shutdown/reboot. When panels are
> > > > > > > powered off badly it can cause garbage on the screen and, so I've 
> > > > > > > been
> > > > > > > told, can even cause long term damage to the panels over time.
> > > > > > >
> > > > > > > In Linux, some panel drivers have tried to ensure a proper 
> > > > > > > poweroff of
> > > > > > > the panel by handling the shutdown() call themselves. However, 
> > > > > > > this is
> > > > > > > ugly and panel maintainers want panel drivers to stop doing it. We
> > > > > > > have removed the code doing this from most panels now [1]. 
> > > > > > > Instead the
> > > > > > > assumption is that the DRM modeset drivers should be calling
> > > > > > > drm_atomic_helper_shutdown() which will make sure panels get an
> > > > > > > orderly shutdown.
> > > > > > >
> > > > > > > For a lot more details, see the cover letter [2] which then 
> > > > > > > contains
> > > > > > > links to even more discussions about the topic.
> > > > > > >
> > > > > > > [1] 
> > > > > > > https://lore.kernel.org/r/20240605002401.2848541-1-diand...@chromium.org
> > > > > > > [2] 
> > > > > > > https://lore.kernel.org/r/2024061435.3188234-1-diand...@chromium.org
> > > > > >
> > > > > > I don't think it's an issue.  We quiesce the hardware as if we were
> > > > > > about to suspend the system (e.g., S3).  For the display hardware we
> > > > > > call drm_atomic_helper_suspend() as part of that sequence.
> > > > >
> > > > > OK. It's no skin off my teeth and we can drop this patch if you're
> > > > > c

RE: [PATCH V2 1/4] drm/amdgpu: add variable to record the deferred error number read by driver

2024-06-21 Thread Chai, Thomas
[AMD Official Use Only - AMD Internal Distribution Only]

prevd_queried_count and de_queried_count are used to accurately count the 
number of DE lost after driver receives a large number of poison creation 
interrupts.

Since amdgpu_ras_query_error_status can be called by page_retirment_thread, 
xxx_err_count sysfs and gpu recovery,
using local variable to save the old de_queried_count before calling 
amdgpu_ras_query_error_status in page_retirment_thread will be inaccurate.


-
Best Regards,
Thomas

-Original Message-
From: Zhang, Hawking 
Sent: Friday, June 21, 2024 2:37 PM
To: Chai, Thomas ; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao ; Li, Candice ; Wang, 
Yang(Kevin) ; Yang, Stanley 
Subject: RE: [PATCH V2 1/4] drm/amdgpu: add variable to record the deferred 
error number read by driver

[AMD Official Use Only - AMD Internal Distribution Only]

Shall we make pre_de_queried_count to be local variable? Others look good to me

Regards,
Hawking

-Original Message-
From: Chai, Thomas 
Sent: Thursday, June 20, 2024 13:40
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; Li, 
Candice ; Wang, Yang(Kevin) ; Yang, 
Stanley ; Chai, Thomas 
Subject: [PATCH V2 1/4] drm/amdgpu: add variable to record the deferred error 
number read by driver

Add variable to record the deferred error number read by driver.

Signed-off-by: YiPeng Chai 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 62 ++---  
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  3 +-  
drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  |  4 +-
 3 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 86cb97d2155b..f674e34037b6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -120,7 +120,7 @@ const char *get_ras_block_str(struct ras_common_if 
*ras_block)
 /* typical ECC bad page rate is 1 bad page per 100MB VRAM */
 #define RAS_BAD_PAGE_COVER  (100 * 1024 * 1024ULL)

-#define MAX_UMC_POISON_POLLING_TIME_ASYNC  100  //ms
+#define MAX_UMC_POISON_POLLING_TIME_ASYNC  300  //ms

 #define AMDGPU_RAS_RETIRE_PAGE_INTERVAL 100  //ms

@@ -2804,7 +2804,8 @@ static void amdgpu_ras_ecc_log_init(struct 
ras_ecc_log_info *ecc_log)
memset(&ecc_log->ecc_key, 0xad, sizeof(ecc_log->ecc_key));

INIT_RADIX_TREE(&ecc_log->de_page_tree, GFP_KERNEL);
-   ecc_log->de_updated = false;
+   ecc_log->de_queried_count = 0;
+   ecc_log->prev_de_queried_count = 0;
 }

 static void amdgpu_ras_ecc_log_fini(struct ras_ecc_log_info *ecc_log) @@ 
-2823,7 +2824,8 @@ static void amdgpu_ras_ecc_log_fini(struct ras_ecc_log_info 
*ecc_log)
mutex_unlock(&ecc_log->lock);

mutex_destroy(&ecc_log->lock);
-   ecc_log->de_updated = false;
+   ecc_log->de_queried_count = 0;
+   ecc_log->prev_de_queried_count = 0;
 }
 #endif

@@ -2856,40 +2858,64 @@ static void amdgpu_ras_do_page_retirement(struct 
work_struct *work)
mutex_unlock(&con->umc_ecc_log.lock);
 }

-static void amdgpu_ras_poison_creation_handler(struct amdgpu_device *adev,
-   uint32_t timeout_ms)
+static int amdgpu_ras_poison_creation_handler(struct amdgpu_device *adev,
+   uint32_t poison_creation_count)
 {
int ret = 0;
struct ras_ecc_log_info *ecc_log;
struct ras_query_if info;
-   uint32_t timeout = timeout_ms;
+   uint32_t timeout = 0;
struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
+   uint64_t de_queried_count;
+   uint32_t new_detect_count, total_detect_count;
+   uint32_t need_query_count = poison_creation_count;
+   bool query_data_timeout = false;

memset(&info, 0, sizeof(info));
info.head.block = AMDGPU_RAS_BLOCK__UMC;

ecc_log = &ras->umc_ecc_log;
-   ecc_log->de_updated = false;
+   total_detect_count = 0;
do {
ret = amdgpu_ras_query_error_status(adev, &info);
-   if (ret) {
-   dev_err(adev->dev, "Failed to query ras error! 
ret:%d\n", ret);
-   return;
+   if (ret)
+   return ret;
+
+   de_queried_count = ecc_log->de_queried_count;
+   if (de_queried_count > ecc_log->prev_de_queried_count) {
+   new_detect_count = de_queried_count - 
ecc_log->prev_de_queried_count;
+   ecc_log->prev_de_queried_count = de_queried_count;
+   timeout = 0;
+   } else {
+   new_detect_count = 0;
}

-   if (timeout && !ecc_log->de_updated) {
-   msleep(1);
-   timeout--;
+   if (new_detect_count) {
+   total_detect_count += new_detect_count;
+   } else {
+   if (!timeout && nee

[PATCH] drm/amdgpu: add missing error handling for amdgpu_ring_alloc()

2024-06-21 Thread Bob Zhou
Fix the unchecked return value warning reported by Coverity,
so add error handling.

Signed-off-by: Bob Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c  | 7 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 4 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c | 6 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c  | 3 ++-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c| 8 ++--
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   | 6 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 6 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   | 6 +-
 8 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 82452606ae6c..25cab6a8d478 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -1005,7 +1005,8 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, 
uint32_t reg, uint32_t xcc_
pr_err("critical bug! too many kiq readers\n");
goto failed_unlock;
}
-   amdgpu_ring_alloc(ring, 32);
+   if (amdgpu_ring_alloc(ring, 32))
+   goto failed_unlock;
amdgpu_ring_emit_rreg(ring, reg, reg_val_offs);
r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
if (r)
@@ -1071,7 +1072,8 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t 
reg, uint32_t v, uint3
}
 
spin_lock_irqsave(&kiq->ring_lock, flags);
-   amdgpu_ring_alloc(ring, 32);
+   if (amdgpu_ring_alloc(ring, 32))
+   goto failed_unlock;
amdgpu_ring_emit_wreg(ring, reg, v);
r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
if (r)
@@ -1107,6 +1109,7 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t 
reg, uint32_t v, uint3
 
 failed_undo:
amdgpu_ring_undo(ring);
+failed_unlock:
spin_unlock_irqrestore(&kiq->ring_lock, flags);
 failed_kiq_write:
dev_err(adev->dev, "failed to write reg:%x\n", reg);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 3a7622611916..778941f47c96 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -768,7 +768,8 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device 
*adev,
}
 
spin_lock_irqsave(&kiq->ring_lock, flags);
-   amdgpu_ring_alloc(ring, 32);
+   if (amdgpu_ring_alloc(ring, 32))
+   goto failed_unlock;
amdgpu_ring_emit_reg_write_reg_wait(ring, reg0, reg1,
ref, mask);
r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
@@ -798,6 +799,7 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device 
*adev,
 
 failed_undo:
amdgpu_ring_undo(ring);
+failed_unlock:
spin_unlock_irqrestore(&kiq->ring_lock, flags);
 failed_kiq:
dev_err(adev->dev, "failed to write reg %x wait reg %x\n", reg0, reg1);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
index d234b7ccfaaf..01864990a192 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
@@ -63,12 +63,14 @@ static void amdgpu_ring_mux_copy_pkt_from_sw_ring(struct 
amdgpu_ring_mux *mux,
return;
}
if (start > end) {
-   amdgpu_ring_alloc(real_ring, (ring->ring_size >> 2) + end - 
start);
+   if (amdgpu_ring_alloc(real_ring, (ring->ring_size >> 2) + end - 
start))
+   return;
amdgpu_ring_write_multiple(real_ring, (void 
*)&ring->ring[start],
   (ring->ring_size >> 2) - start);
amdgpu_ring_write_multiple(real_ring, (void *)&ring->ring[0], 
end);
} else {
-   amdgpu_ring_alloc(real_ring, end - start);
+   if (amdgpu_ring_alloc(real_ring, end - start))
+   return;
amdgpu_ring_write_multiple(real_ring, (void 
*)&ring->ring[start], end - start);
}
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
index bad232859972..d7d68e61506d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
@@ -610,7 +610,8 @@ static int vpe_ring_preempt_ib(struct amdgpu_ring *ring)
 
/* emit the trailing fence */
ring->trail_seq += 1;
-   amdgpu_ring_alloc(ring, 10);
+   if (amdgpu_ring_alloc(ring, 10))
+   return -ENOMEM;
vpe_ring_emit_fence(ring, ring->trail_fence_gpu_addr, ring->trail_seq, 
0);
amdgpu_ring_commit(ring);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 2929c8972ea7..810f7f7e279d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -4093,7 +4093,8 @@ st

Re: [PATCH] drm/amdgpu: add missing error handling for amdgpu_ring_alloc()

2024-06-21 Thread Christian König

Am 21.06.24 um 11:24 schrieb Bob Zhou:

Fix the unchecked return value warning reported by Coverity,
so add error handling.


That seems to be completely superfluous. The only case when 
amdgpu_ring_alloc() returns an error is when we try to allocate more 
than the maximum submission size.


And in that case we will get quite a warning already.

I strongly suggest to instead drop the return value from 
amdgpu_ring_alloc().


Regards,
Christian.



Signed-off-by: Bob Zhou 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c  | 7 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c  | 4 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c | 6 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c  | 3 ++-
  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c| 8 ++--
  drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c   | 6 +-
  drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c   | 6 +-
  drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   | 6 +-
  8 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 82452606ae6c..25cab6a8d478 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -1005,7 +1005,8 @@ uint32_t amdgpu_kiq_rreg(struct amdgpu_device *adev, 
uint32_t reg, uint32_t xcc_
pr_err("critical bug! too many kiq readers\n");
goto failed_unlock;
}
-   amdgpu_ring_alloc(ring, 32);
+   if (amdgpu_ring_alloc(ring, 32))
+   goto failed_unlock;
amdgpu_ring_emit_rreg(ring, reg, reg_val_offs);
r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
if (r)
@@ -1071,7 +1072,8 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t 
reg, uint32_t v, uint3
}
  
  	spin_lock_irqsave(&kiq->ring_lock, flags);

-   amdgpu_ring_alloc(ring, 32);
+   if (amdgpu_ring_alloc(ring, 32))
+   goto failed_unlock;
amdgpu_ring_emit_wreg(ring, reg, v);
r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
if (r)
@@ -1107,6 +1109,7 @@ void amdgpu_kiq_wreg(struct amdgpu_device *adev, uint32_t 
reg, uint32_t v, uint3
  
  failed_undo:

amdgpu_ring_undo(ring);
+failed_unlock:
spin_unlock_irqrestore(&kiq->ring_lock, flags);
  failed_kiq_write:
dev_err(adev->dev, "failed to write reg:%x\n", reg);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index 3a7622611916..778941f47c96 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -768,7 +768,8 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device 
*adev,
}
  
  	spin_lock_irqsave(&kiq->ring_lock, flags);

-   amdgpu_ring_alloc(ring, 32);
+   if (amdgpu_ring_alloc(ring, 32))
+   goto failed_unlock;
amdgpu_ring_emit_reg_write_reg_wait(ring, reg0, reg1,
ref, mask);
r = amdgpu_fence_emit_polling(ring, &seq, MAX_KIQ_REG_WAIT);
@@ -798,6 +799,7 @@ void amdgpu_gmc_fw_reg_write_reg_wait(struct amdgpu_device 
*adev,
  
  failed_undo:

amdgpu_ring_undo(ring);
+failed_unlock:
spin_unlock_irqrestore(&kiq->ring_lock, flags);
  failed_kiq:
dev_err(adev->dev, "failed to write reg %x wait reg %x\n", reg0, reg1);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
index d234b7ccfaaf..01864990a192 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring_mux.c
@@ -63,12 +63,14 @@ static void amdgpu_ring_mux_copy_pkt_from_sw_ring(struct 
amdgpu_ring_mux *mux,
return;
}
if (start > end) {
-   amdgpu_ring_alloc(real_ring, (ring->ring_size >> 2) + end - 
start);
+   if (amdgpu_ring_alloc(real_ring, (ring->ring_size >> 2) + end - 
start))
+   return;
amdgpu_ring_write_multiple(real_ring, (void 
*)&ring->ring[start],
   (ring->ring_size >> 2) - start);
amdgpu_ring_write_multiple(real_ring, (void *)&ring->ring[0], 
end);
} else {
-   amdgpu_ring_alloc(real_ring, end - start);
+   if (amdgpu_ring_alloc(real_ring, end - start))
+   return;
amdgpu_ring_write_multiple(real_ring, (void 
*)&ring->ring[start], end - start);
}
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
index bad232859972..d7d68e61506d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
@@ -610,7 +610,8 @@ static int vpe_ring_preempt_ib(struct amdgpu_ring *ring)
  
  	/* emit the trailing fence */

ring->trail_seq += 1;
-   amdgpu_ring_alloc(ring, 10);
+   if (amdgpu_ring_alloc(ring, 10))
+   ret

[PATCH] drm/amdgpu: Fix smatch static checker warning

2024-06-21 Thread Hawking Zhang
adev->gfx.imu.funcs could be NULL

Signed-off-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index b4575765d7a8..5c17409439f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4498,11 +4498,11 @@ static int gfx_v11_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) {
-- 
2.17.1



RE: [PATCH] drm/amdgpu: Fix smatch static checker warning

2024-06-21 Thread Zhang, Hawking
[AMD Official Use Only - AMD Internal Distribution Only]

Sure, that works. Send out v2

Regards,
Hawking

-Original Message-
From: Gao, Likun 
Sent: Friday, June 21, 2024 16:35
To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org; Min, 
Frank 
Cc: Zhang, Hawking 
Subject: RE: [PATCH] drm/amdgpu: Fix smatch static checker warning

[AMD Official Use Only - AMD Internal Distribution Only]

Seems only need to deal with this on gfx v11, for gfx v12, it will judgement 
whether (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) before use imu funcs on 
gfx_v12_0_rlc_backdoor_autoload_enable.

Regards,
Likun

-Original Message-
From: Hawking Zhang 
Sent: Friday, June 21, 2024 3:56 PM
To: amd-gfx@lists.freedesktop.org; Gao, Likun ; Min, Frank 

Cc: Zhang, Hawking 
Subject: [PATCH] drm/amdgpu: Fix smatch static checker warning

adev->gfx.imu.funcs could be NULL.

Signed-off-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 8   
drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index b4575765d7a8..5c17409439f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4498,11 +4498,11 @@ static int gfx_v11_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) { diff 
--git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 460bf33a22b1..16fc5c5b15f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -3258,11 +3258,11 @@ static int gfx_v12_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v12_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v12_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) {
--
2.17.1




RE: [PATCH] drm/amdgpu: Fix smatch static checker warning

2024-06-21 Thread Gao, Likun
[AMD Official Use Only - AMD Internal Distribution Only]

This patch was
Reviewed-by: Likun Gao .

Regards,
Likun

-Original Message-
From: Hawking Zhang 
Sent: Friday, June 21, 2024 5:54 PM
To: amd-gfx@lists.freedesktop.org; Gao, Likun ; Min, Frank 

Cc: Zhang, Hawking 
Subject: [PATCH] drm/amdgpu: Fix smatch static checker warning

adev->gfx.imu.funcs could be NULL

Signed-off-by: Hawking Zhang 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index b4575765d7a8..5c17409439f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -4498,11 +4498,11 @@ static int gfx_v11_0_hw_init(void *handle)
/* RLC autoload sequence 1: Program rlc ram */
if (adev->gfx.imu.funcs->program_rlc_ram)
adev->gfx.imu.funcs->program_rlc_ram(adev);
+   /* rlc autoload firmware */
+   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
+   if (r)
+   return r;
}
-   /* rlc autoload firmware */
-   r = gfx_v11_0_rlc_backdoor_autoload_enable(adev);
-   if (r)
-   return r;
} else {
if (adev->firmware.load_type == AMDGPU_FW_LOAD_DIRECT) {
if (adev->gfx.imu.funcs && (amdgpu_dpm > 0)) {
--
2.17.1



Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-06-21 Thread Mikhail Gavrilov
On Fri, Jun 21, 2024 at 12:56 PM Linux regression tracking (Thorsten
Leemhuis)  wrote:
> Hmmm, I might have missed something, but it looks like nothing happened
> here since then. What's the status? Is the issue still happening?

Yes. Tested on e5b3efbe1ab1.

I spotted that the problem disappears after forcing the TV to sleep
(activate screensaver  + ) and then wake it up by pressing
any button and entering a password.
Hope this information can't help figure out how to fix it.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-06-21 Thread Linux regression tracking (Thorsten Leemhuis)
On 09.06.24 23:19, Mikhail Gavrilov wrote:
> On Fri, Jun 7, 2024 at 6:39 PM Alex Deucher  wrote:
>>
>> --- a/drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c
>> +++ b/drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c
>> @@ -944,7 +944,7 @@ void optc1_set_drr(
>> OTG_V_TOTAL_MAX_SEL, 1,
>> OTG_FORCE_LOCK_ON_EVENT, 0,
>> OTG_SET_V_TOTAL_MIN_MASK_EN, 0,
>> -   OTG_SET_V_TOTAL_MIN_MASK, 0);
>> +   OTG_SET_V_TOTAL_MIN_MASK, (1 << 1)); /* 
>> TRIGA */
>>
>> // Setup manual flow control for EOF via TRIG_A
>> optc->funcs->setup_manual_trigger(optc);
> 
> Thanks, Alex.
> I applied this patch on top of 771ed66105de and unfortunately the
> issue is not fixed.
> I saw a green flashing bar on top of the screen again.

Hmmm, I might have missed something, but it looks like nothing happened
here since then. What's the status? Is the issue still happening? Any
solution in sight?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke



Re: [PATCH] drm/buddy: Add start address support to trim function

2024-06-21 Thread Matthew Auld

Hi,

On 21/06/2024 06:29, Arunpravin Paneer Selvam wrote:

- Add a new start parameter in trim function to specify exact
   address from where to start the trimming. This would help us
   in situations like if drivers would like to do address alignment
   for specific requirements.

- Add a new flag DRM_BUDDY_TRIM_DISABLE. Drivers can use this
   flag to disable the allocator trimming part. This patch enables
   the drivers control trimming and they can do it themselves
   based on the application requirements.

Signed-off-by: Arunpravin Paneer Selvam 
---
  drivers/gpu/drm/drm_buddy.c  | 22 --
  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  2 +-
  include/drm/drm_buddy.h  |  2 ++
  3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 6a8e45e9d0ec..287b6acb1637 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -851,6 +851,7 @@ static int __alloc_contig_try_harder(struct drm_buddy *mm,
   * drm_buddy_block_trim - free unused pages
   *
   * @mm: DRM buddy manager
+ * @start: start address to begin the trimming.
   * @new_size: original size requested
   * @blocks: Input and output list of allocated blocks.
   * MUST contain single block as input to be trimmed.
@@ -866,11 +867,13 @@ static int __alloc_contig_try_harder(struct drm_buddy *mm,
   * 0 on success, error code on failure.
   */
  int drm_buddy_block_trim(struct drm_buddy *mm,
+u64 *start,


I guess just wondering if this should be offset within or address. If it 
offset then zero be the valid default giving the existing behaviour. But 
hard to say without seeing the user for this. Are there some more 
patches to give some context for this usecase?



 u64 new_size,
 struct list_head *blocks)
  {
struct drm_buddy_block *parent;
struct drm_buddy_block *block;
+   u64 block_start, block_end;
LIST_HEAD(dfs);
u64 new_start;
int err;
@@ -882,6 +885,9 @@ int drm_buddy_block_trim(struct drm_buddy *mm,
 struct drm_buddy_block,
 link);
  
+	block_start = drm_buddy_block_offset(block);

+   block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
if (WARN_ON(!drm_buddy_block_is_allocated(block)))
return -EINVAL;
  
@@ -894,6 +900,17 @@ int drm_buddy_block_trim(struct drm_buddy *mm,

if (new_size == drm_buddy_block_size(mm, block))
return 0;
  
+	new_start = block_start;

+   if (start) {
+   new_start = *start;
+
+   if (new_start < block_start)
+   return -EINVAL;


In addition should check that the alignment of new_start is at least 
compatible with the min chunk_size. Otherwise I think bad stuff can happen.



+
+   if ((new_start + new_size) > block_end)


range_overflows() ?


+   return -EINVAL;
+   }
+
list_del(&block->link);
mark_free(mm, block);
mm->avail += drm_buddy_block_size(mm, block);
@@ -904,7 +921,6 @@ int drm_buddy_block_trim(struct drm_buddy *mm,
parent = block->parent;
block->parent = NULL;
  
-	new_start = drm_buddy_block_offset(block);

list_add(&block->tmp_link, &dfs);
err =  __alloc_range(mm, &dfs, new_start, new_size, blocks, NULL);
if (err) {
@@ -1066,7 +1082,8 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
} while (1);
  
  	/* Trim the allocated block to the required size */

-   if (original_size != size) {
+   if (!(flags & DRM_BUDDY_TRIM_DISABLE) &&
+   original_size != size) {
struct list_head *trim_list;
LIST_HEAD(temp);
u64 trim_size;
@@ -1083,6 +1100,7 @@ int drm_buddy_alloc_blocks(struct drm_buddy *mm,
}
  
  		drm_buddy_block_trim(mm,

+NULL,
 trim_size,
 trim_list);
  
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c

index fe3779fdba2c..423b261ea743 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -150,7 +150,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager 
*man,
} while (remaining_size);
  
  	if (place->flags & TTM_PL_FLAG_CONTIGUOUS) {

-   if (!drm_buddy_block_trim(mm, vres->base.size, &vres->blocks))
+   if (!drm_buddy_block_trim(mm, NULL, vres->base.size, 
&vres->blocks))
size = vres->base.size;
}
  
diff --git a/include/drm/drm_buddy.h b/include/drm/drm_buddy.h

index 2a74fa9d0ce5..9689a7c5dd36 100644
--- a/include/drm/drm_buddy.h
+++ b/include/drm/drm_buddy.h
@@ -27,6 +27,7 @@
  #define DRM_BUDDY_CONTIGUOUS_ALLOCATION   BIT(

[PATCH] drm/amdkfd: Correct svm prange overlapping handling at svm_range_set_attr ioctl

2024-06-21 Thread Xiaogang . Chen
From: Xiaogang Chen 

When user adds new vm range that has overlapping with existing svm pranges
current kfd clones new prange and remove existing pranges including all data
associate with it. It is not necessary. We can handle the overlapping on
existing pranges directly that would simplify kfd code. And, when remove a
existing prange the locks from it will get destroyed. This may cause issue if
code still use these locks. And locks from cloned prange do not inherit
context of locks that got removed.

This patch does not remove existing pranges or clone new pranges, keeps locks
of pranges alive.

Signed-off-by: Xiaogang Chen
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 89 
 1 file changed, 12 insertions(+), 77 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 407636a68814..a8fcace6f9a2 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -904,23 +904,6 @@ svm_range_copy_array(void *psrc, size_t size, uint64_t 
num_elements,
return (void *)dst;
 }
 
-static int
-svm_range_copy_dma_addrs(struct svm_range *dst, struct svm_range *src)
-{
-   int i;
-
-   for (i = 0; i < MAX_GPU_INSTANCE; i++) {
-   if (!src->dma_addr[i])
-   continue;
-   dst->dma_addr[i] = svm_range_copy_array(src->dma_addr[i],
-   sizeof(*src->dma_addr[i]), src->npages, 
0, NULL);
-   if (!dst->dma_addr[i])
-   return -ENOMEM;
-   }
-
-   return 0;
-}
-
 static int
 svm_range_split_array(void *ppnew, void *ppold, size_t size,
  uint64_t old_start, uint64_t old_n,
@@ -1967,38 +1950,6 @@ svm_range_evict(struct svm_range *prange, struct 
mm_struct *mm,
return r;
 }
 
-static struct svm_range *svm_range_clone(struct svm_range *old)
-{
-   struct svm_range *new;
-
-   new = svm_range_new(old->svms, old->start, old->last, false);
-   if (!new)
-   return NULL;
-   if (svm_range_copy_dma_addrs(new, old)) {
-   svm_range_free(new, false);
-   return NULL;
-   }
-   if (old->svm_bo) {
-   new->ttm_res = old->ttm_res;
-   new->offset = old->offset;
-   new->svm_bo = svm_range_bo_ref(old->svm_bo);
-   spin_lock(&new->svm_bo->list_lock);
-   list_add(&new->svm_bo_list, &new->svm_bo->range_list);
-   spin_unlock(&new->svm_bo->list_lock);
-   }
-   new->flags = old->flags;
-   new->preferred_loc = old->preferred_loc;
-   new->prefetch_loc = old->prefetch_loc;
-   new->actual_loc = old->actual_loc;
-   new->granularity = old->granularity;
-   new->mapped_to_gpu = old->mapped_to_gpu;
-   new->vram_pages = old->vram_pages;
-   bitmap_copy(new->bitmap_access, old->bitmap_access, MAX_GPU_INSTANCE);
-   bitmap_copy(new->bitmap_aip, old->bitmap_aip, MAX_GPU_INSTANCE);
-
-   return new;
-}
-
 void svm_range_set_max_pages(struct amdgpu_device *adev)
 {
uint64_t max_pages;
@@ -2057,7 +2008,6 @@ svm_range_split_new(struct svm_range_list *svms, uint64_t 
start, uint64_t last,
  * @attrs: array of attributes
  * @update_list: output, the ranges need validate and update GPU mapping
  * @insert_list: output, the ranges need insert to svms
- * @remove_list: output, the ranges are replaced and need remove from svms
  * @remap_list: output, remap unaligned svm ranges
  *
  * Check if the virtual address range has overlap with any existing ranges,
@@ -2082,7 +2032,7 @@ static int
 svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size,
  uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs,
  struct list_head *update_list, struct list_head *insert_list,
- struct list_head *remove_list, struct list_head *remap_list)
+ struct list_head *remap_list)
 {
unsigned long last = start + size - 1UL;
struct svm_range_list *svms = &p->svms;
@@ -2096,7 +2046,6 @@ svm_range_add(struct kfd_process *p, uint64_t start, 
uint64_t size,
 
INIT_LIST_HEAD(update_list);
INIT_LIST_HEAD(insert_list);
-   INIT_LIST_HEAD(remove_list);
INIT_LIST_HEAD(&new_list);
INIT_LIST_HEAD(remap_list);
 
@@ -2117,20 +2066,11 @@ svm_range_add(struct kfd_process *p, uint64_t start, 
uint64_t size,
/* nothing to do */
} else if (node->start < start || node->last > last) {
/* node intersects the update range and its attributes
-* will change. Clone and split it, apply updates only
+* will change. Split it, apply updates only
 * to the overlapping part
 */
-   struct svm_range *old = prange;
-
-   prange = svm_range_clone(old);
-

[PATCH 1/2] drm/amdgpu: Disable compute partition switch under SRIOV

2024-06-21 Thread Rajneesh Bhardwaj
Do not allow the compute partition mode switch from the guest driver.

Signed-off-by: Rajneesh Bhardwaj 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 82452606ae6c..722c3fef09a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -1292,6 +1292,9 @@ static ssize_t amdgpu_gfx_set_compute_partition(struct 
device *dev,
enum amdgpu_gfx_partition mode;
int ret = 0, num_xcc;
 
+   if (amdgpu_sriov_vf(adev))
+   return -EINVAL;
+
num_xcc = NUM_XCC(adev->gfx.xcc_mask);
if (num_xcc % 2 != 0)
return -EINVAL;
-- 
2.34.1



[PATCH 2/2] drm/amdgpu: Don't warn for compute mode switch under SRIOV

2024-06-21 Thread Rajneesh Bhardwaj
Under SRIOV environment, the compute partition mode is setup by the
host driver so state machine cached copy might be different when doing
the transition for the first time.

Signed-off-by: Rajneesh Bhardwaj 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index 2b99eed5ba19..c4a9669bceb0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -228,7 +228,8 @@ int amdgpu_xcp_query_partition_mode(struct amdgpu_xcp_mgr 
*xcp_mgr, u32 flags)
if (!(flags & AMDGPU_XCP_FL_LOCKED))
mutex_lock(&xcp_mgr->xcp_lock);
mode = xcp_mgr->funcs->query_partition_mode(xcp_mgr);
-   if (xcp_mgr->mode != AMDGPU_XCP_MODE_TRANS && mode != xcp_mgr->mode)
+   if (xcp_mgr->mode != AMDGPU_XCP_MODE_TRANS && mode != xcp_mgr->mode
+   && !amdgpu_sriov_vf(xcp_mgr->adev))
dev_WARN(
xcp_mgr->adev->dev,
"Cached partition mode %d not matching with device mode 
%d",
-- 
2.34.1



RE: [PATCH] drm/amdgpu: process RAS fatal error MB notification

2024-06-21 Thread Chan, Hing Pong
[AMD Official Use Only - AMD Internal Distribution Only]

Should we also set fed flag for the case where kfd detects timeout first?

I.e. adding

amdgpu_ras_set_fed(adev, true);

to amdgpu_device_gpu_recover or amdgpu_virt_rcvd_ras_interrupt if the RAS 
signature is found?

Thanks,
Hing Pong

-Original Message-
From: Chander, Vignesh 
Sent: Thursday, June 20, 2024 2:26 AM
To: amd-gfx@lists.freedesktop.org
Cc: Chan, Hing Pong ; Luo, Zhigang ; 
Lazar, Lijo ; Chander, Vignesh 
Subject: [PATCH] drm/amdgpu: process RAS fatal error MB notification

For RAS error scenario, VF guest driver will check mailbox and set fed flag to 
avoid unnecessary HW accesses.
additionally, poll for reset completion message first to avoid accidentally 
spamming multiple reset requests to host.

v2: add another mailbox check for handling case where kfd detects timeout first

v3: set host_flr bit and use wait_for_reset

Signed-off-by: Vignesh Chander 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 25 +++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  4 +++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c  | 14 +++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h  |  4 +++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c  | 14 +++-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.h  |  5 +++--
 7 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7df5544ac9839e..1b204af9831d24 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5834,6 +5834,11 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Actual ASIC resets if needed.*/
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
+   if (amdgpu_ras_get_fed_status(adev) || 
amdgpu_virt_rcvd_ras_interrupt(adev)) {
+   dev_dbg(adev->dev, "Detected RAS error, wait for FLR 
completion\n");
+   set_bit(AMDGPU_HOST_FLR, &reset_context->flags);
+   }
+
r = amdgpu_device_reset_sriov(adev, reset_context);
if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
amdgpu_virt_release_full_gpu(adev, true); diff --git 
a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 63f2286858c484..ccb3d041c2b249 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -229,6 +229,22 @@ void amdgpu_virt_free_mm_table(struct amdgpu_device *adev)
adev->virt.mm_table.gpu_addr = 0;
 }

+/**
+ * amdgpu_virt_rcvd_ras_interrupt() - receive ras interrupt
+ * @adev:  amdgpu device.
+ * Check whether host sent RAS error message
+ * Return: true if found, otherwise false  */ bool
+amdgpu_virt_rcvd_ras_interrupt(struct amdgpu_device *adev) {
+   struct amdgpu_virt *virt = &adev->virt;
+
+   if (!virt->ops || !virt->ops->rcvd_ras_intr)
+   return false;
+
+   return virt->ops->rcvd_ras_intr(adev); }
+

 unsigned int amd_sriov_msg_checksum(void *obj,
unsigned long obj_size,
@@ -612,11 +628,14 @@ static void amdgpu_virt_update_vf2pf_work_item(struct 
work_struct *work)
ret = amdgpu_virt_read_pf2vf_data(adev);
if (ret) {
adev->virt.vf2pf_update_retry_cnt++;
-   if ((adev->virt.vf2pf_update_retry_cnt >= 
AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
-   amdgpu_sriov_runtime(adev)) {
+
+   if ((amdgpu_virt_rcvd_ras_interrupt(adev) ||
+   adev->virt.vf2pf_update_retry_cnt >= 
AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
+   amdgpu_sriov_runtime(adev)) {
+
amdgpu_ras_set_fed(adev, true);
if (amdgpu_reset_domain_schedule(adev->reset_domain,
- 
&adev->kfd.reset_work))
+   &adev->kfd.reset_work))
return;
else
dev_err(adev->dev, "Failed to queue work! at 
%s", __func__); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index f04cd1586c7220..b42a8854dca0cb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -52,7 +52,7 @@
 /* tonga/fiji use this offset */
 #define mmBIF_IOV_FUNC_IDENTIFIER 0x1503

-#define AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT 5
+#define AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT 2

 enum amdgpu_sriov_vf_mode {
SRIOV_VF_MODE_BARE_METAL = 0,
@@ -94,6 +94,7 @@ struct amdgpu_virt_ops {
  u32 data1, u32 data2, u32 data3);
void (*ras_poison_handler)(struct amdgpu_device *adev,