Re: [PATCH] drm/amdkfd: disable SVM for GC 10.1.3/4

2023-09-07 Thread Felix Kuehling
We need heavy-weight flushes not just for SVM. If this is broken it will 
affect ROCm either way.


Regards,
  Felix


On 2023-09-07 08:08, Lang Yu wrote:

GC 10.1.3/4 have problems with TLB_FLUSH_HEAVYWEIGHT
which is used by SVM in svm_range_unmap_from_gpus().
This causes problems on GC 10.1.3/4.

Signed-off-by: Lang Yu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 22 +-
  1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 7d82c7da223a..dd3db3d88d59 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -992,6 +992,22 @@ static const struct dev_pagemap_ops svm_migrate_pgmap_ops 
= {
  /* Each VRAM page uses sizeof(struct page) on system memory */
  #define SVM_HMM_PAGE_STRUCT_SIZE(size) ((size)/PAGE_SIZE * sizeof(struct 
page))
  
+static inline bool is_zone_device_needed(struct amdgpu_device *adev)

+{
+   /* Page migration works on gfx9 or newer */
+   if (adev->ip_versions[GC_HWIP][0] < IP_VERSION(9, 0, 1))
+   return false;
+
+   if (adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 1, 3) ||
+   adev->ip_versions[GC_HWIP][0] == IP_VERSION(10, 1, 4))
+   return false;
+
+   if (adev->gmc.is_app_apu)
+   return false;
+
+   return true;
+}
+
  int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
  {
struct amdgpu_kfd_dev *kfddev = >kfd;
@@ -1000,11 +1016,7 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
unsigned long size;
void *r;
  
-	/* Page migration works on gfx9 or newer */

-   if (adev->ip_versions[GC_HWIP][0] < IP_VERSION(9, 0, 1))
-   return -EINVAL;
-
-   if (adev->gmc.is_app_apu)
+   if (!is_zone_device_needed(adev))
return 0;
  
  	pgmap = >pgmap;


Re: [PATCHv3] drm/amdkfd: Fix unaligned 64-bit doorbell warning

2023-09-06 Thread Felix Kuehling

On 2023-09-06 11:39, Mukul Joshi wrote:

This patch fixes the following unaligned 64-bit doorbell
warning seen when submitting packets on HIQ on GFX v9.4.3
by making the HIQ doorbell 64-bit aligned.
The warning is seen when GPU is loaded in any mode other
than SPX mode.

[  +0.000301] [ cut here ]
[  +0.03] Unaligned 64-bit doorbell
[  +0.30] WARNING: /amdkfd/kfd_doorbell.c:339 
write_kernel_doorbell64+0x72/0x80
[  +0.03] RIP: 0010:write_kernel_doorbell64+0x72/0x80
[  +0.04] RSP: 0018:c90004287730 EFLAGS: 00010246
[  +0.05] RAX:  RBX:  RCX: 
[  +0.03] RDX: 0001 RSI: 82837c71 RDI: 
[  +0.03] RBP: c90004287748 R08: 0003 R09: 0001
[  +0.02] R10: 001a R11: 88a034008198 R12: c900013bd004
[  +0.03] R13: 0008 R14: c900042877b0 R15: 007f
[  +0.03] FS:  7fa8c7b62000() GS:889f8840() 
knlGS:
[  +0.04] CS:  0010 DS:  ES:  CR0: 80050033
[  +0.03] CR2: 56111c45aaf0 CR3: 0001414f2002 CR4: 00770ee0
[  +0.03] PKRU: 5554
[  +0.02] Call Trace:
[  +0.04]  
[  +0.06]  kq_submit_packet+0x45/0x50 [amdgpu]
[  +0.000524]  pm_send_set_resources+0x7f/0xc0 [amdgpu]
[  +0.000500]  set_sched_resources+0xe4/0x160 [amdgpu]
[  +0.000503]  start_cpsch+0x1c5/0x2a0 [amdgpu]
[  +0.000497]  kgd2kfd_device_init.cold+0x816/0xb42 [amdgpu]
[  +0.000743]  amdgpu_amdkfd_device_init+0x15f/0x1f0 [amdgpu]
[  +0.000602]  amdgpu_device_init.cold+0x1813/0x2176 [amdgpu]
[  +0.000684]  ? pci_bus_read_config_word+0x4a/0x80
[  +0.12]  ? do_pci_enable_device+0xdc/0x110
[  +0.08]  amdgpu_driver_load_kms+0x1a/0x110 [amdgpu]
[  +0.000545]  amdgpu_pci_probe+0x197/0x400 [amdgpu]

Fixes: cfeaeb3c0ce7 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells")
Signed-off-by: Mukul Joshi 


Reviewed-by: Felix Kuehling 



---
v1->v2:
- Update the logic to make it work with both 32 bit
   64 bit doorbells.
- Add the Fixed tag
v2->v3:
- Revert to the original change to align it with whats done in
   amdgpu_doorbell_index_on_bar.

  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index c2e0b79dcc6d..7b38537c7c99 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -162,6 +162,7 @@ void __iomem *kfd_get_kernel_doorbell(struct kfd_dev *kfd,
return NULL;
  
  	*doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, kfd->doorbells, inx);

+   inx *= 2;
  
  	pr_debug("Get kernel queue doorbell\n"

" doorbell offset   == 0x%08X\n"
@@ -176,6 +177,7 @@ void kfd_release_kernel_doorbell(struct kfd_dev *kfd, u32 
__iomem *db_addr)
unsigned int inx;
  
  	inx = (unsigned int)(db_addr - kfd->doorbell_kernel_ptr);

+   inx /= 2;
  
  	mutex_lock(>doorbell_mutex);

__clear_bit(inx, kfd->doorbell_bitmap);


Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-31 Thread Felix Kuehling

On 2023-08-31 16:33, Chen, Xiaogang wrote:
That said, I'm not actually sure why we're freeing the DMA 
address array after migration to RAM at all. I think we still 
need it even when we're using VRAM. We call svm_range_dma_map in 
svm_range_validate_and_map regardless of whether the range is in 
VRAM or system memory. So it will just allocate a new array the 
next time the range is validated anyway. VRAM pages use a special 
address encoding to indicate VRAM pages to the GPUVM code.


I think we do not need free DMA address array as you said, it is 
another thing though.


We need unmap dma address(dma_unmap_page) after migrate from ram 
to vram because we always do dma_map_page at 
svm_range_validate_and_map. If not we would have multiple dma maps 
for same sys ram page.


svm_range_dma_map_dev calls dma_unmap_page before overwriting an 
existing valid entry in the dma_addr array. Anyway, dma unmapping 
the old pages in bulk may still be cleaner. And it avoids delays in 
cleaning up DMA mappings after migrations.


Regards,
  Felix


then we may not need do dma_unmap after migrate from ram to vram 
since svm_range_dma_map_dev always do dma_unmap_page if the address 
is valid dma address for sys ram, and after migrate from ram to vram 
we always do gpu mapping?


I think with XNACK enabled, the DMA mapping may be delayed until a 
page fault. For example on a multi-GPU system, GPU1 page faults and 
migrates data from system memory to its VRAM. Immediately afterwards, 
the page fault handler should use svm_validate_and_map to update GPU1 
page tables. But GPU2 page tables are not updated immediately. So the 
now stale DMA mappings for GPU2 would continue to exist until the 
next page fault on GPU2.


Regards,
  Felix

If I understand correctly: when user call svm_range_set_attr, if 
p->xnack_enabled is true, we can skip call svm_range_validate_and_map. 
We postpone the buffer validating and gpu mapping until page fault or 
the time the buffer really got used by a GPU, and only dma map and gpu 
map for this GPU.


The current implementation of svm_range_set_attr skips the validation 
after migration if XNACK is off, because it is handled by 
svm_range_restore_work that gets scheduled by the MMU notifier triggered 
by the migration.


With XNACK on, svm_range_set_attr currently validates and maps after 
migration assuming that the data will be used by the GPU(s) soon. That 
is something we could change and let page faults take care of the 
mappings as needed.


Regards,
  Felix




Regards

Xiaogang 


Re: [PATCH v2 1/2] drm/amdgpu: Merge debug module parameters

2023-08-31 Thread Felix Kuehling



On 2023-08-30 18:08, André Almeida wrote:

Merge all developer debug options available as separated module
parameters in one, making it obvious that are for developers.

Drop the obsolete module options in favor of the new ones.

Signed-off-by: André Almeida 
---
v2:
- drop old module params
- use BIT() macros
- replace global var with adev-> vars
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h  |  4 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  | 48 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c  |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c   |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c|  2 +-
  drivers/gpu/drm/amd/include/amd_shared.h |  8 
  8 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4de074243c4d..82eaccfce347 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1101,6 +1101,10 @@ struct amdgpu_device {
booldc_enabled;
/* Mask of active clusters */
uint32_taid_mask;
+
+   /* Debug */
+   booldebug_vm;
+   booldebug_largebar;
  };
  
  static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fb78a8f47587..8a26bed76505 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1191,7 +1191,7 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser 
*p)
job->vm_pd_addr = amdgpu_gmc_pd_addr(vm->root.bo);
}
  
-	if (amdgpu_vm_debug) {

+   if (adev->debug_vm) {
/* Invalidate all BOs to test for userspace bugs */
amdgpu_bo_list_for_each_entry(e, p->bo_list) {
struct amdgpu_bo *bo = ttm_to_amdgpu_bo(e->tv.bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index f5856b82605e..0cd48c025433 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -140,7 +140,6 @@ int amdgpu_vm_size = -1;
  int amdgpu_vm_fragment_size = -1;
  int amdgpu_vm_block_size = -1;
  int amdgpu_vm_fault_stop;
-int amdgpu_vm_debug;
  int amdgpu_vm_update_mode = -1;
  int amdgpu_exp_hw_support;
  int amdgpu_dc = -1;
@@ -194,6 +193,7 @@ int amdgpu_use_xgmi_p2p = 1;
  int amdgpu_vcnfw_log;
  int amdgpu_sg_display = -1; /* auto */
  int amdgpu_user_partt_mode = AMDGPU_AUTO_COMPUTE_PARTITION_MODE;
+uint amdgpu_debug_mask;
  
  static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work);
  
@@ -405,13 +405,6 @@ module_param_named(vm_block_size, amdgpu_vm_block_size, int, 0444);

  MODULE_PARM_DESC(vm_fault_stop, "Stop on VM fault (0 = never (default), 1 = print 
first, 2 = always)");
  module_param_named(vm_fault_stop, amdgpu_vm_fault_stop, int, 0444);
  
-/**

- * DOC: vm_debug (int)
- * Debug VM handling (0 = disabled, 1 = enabled). The default is 0 (Disabled).
- */
-MODULE_PARM_DESC(vm_debug, "Debug VM handling (0 = disabled (default), 1 = 
enabled)");
-module_param_named(vm_debug, amdgpu_vm_debug, int, 0644);


This parameter used to be writable, which means it could be changed 
through sysfs after loading the module. Code looking at the global 
variable would see the last value written by user mode. With your 
changes, this is no longer writable, and driver code is now looking at 
adev->debug_vm, which cannot be updated through sysfs. As long as 
everyone is OK with that change, I have no objections. Just pointing it out.


Regardless, this patch is

Acked-by: Felix Kuehling 



-
  /**
   * DOC: vm_update_mode (int)
   * Override VM update mode. VM updated by using CPU (0 = never, 1 = Graphics 
only, 2 = Compute only, 3 = Both). The default
@@ -743,18 +736,6 @@ module_param(send_sigterm, int, 0444);
  MODULE_PARM_DESC(send_sigterm,
"Send sigterm to HSA process on unhandled exception (0 = disable, 1 = 
enable)");
  
-/**

- * DOC: debug_largebar (int)
- * Set debug_largebar as 1 to enable simulating large-bar capability on 
non-large bar
- * system. This limits the VRAM size reported to ROCm applications to the 
visible
- * size, usually 256MB.
- * Default value is 0, diabled.
- */
-int debug_largebar;
-module_param(debug_largebar, int, 0444);
-MODULE_PARM_DESC(debug_largebar,
-   "Debug large-bar flag used to simulate large-bar capability on non-large bar 
machine (0 = disable, 1 = enable)");
-
  /**
   * DOC: halt_if_hws_hang (int)
   * Halt if HWS hang is detected. Default value, 0, disables the halt on hang.
@@ -938,6 +919,18 @@ module_param_named(user_partt_mode, 
amdgpu_user_partt_mo

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-31 Thread Felix Kuehling



On 2023-08-30 19:02, Chen, Xiaogang wrote:


On 8/30/2023 3:56 PM, Felix Kuehling wrote:


On 2023-08-30 15:39, Chen, Xiaogang wrote:


On 8/28/2023 5:37 PM, Felix Kuehling wrote:


On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault 
according to migration
granularity(default 2MB) and not split svm range in cpu page 
fault handling.
Now a svm range may have pages from both system ram and vram of 
one gpu.
These chagnes are expected to improve migration performance and 
reduce

mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to 
device

   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node 
*node, struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,

+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 
0x%x\n",

- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to 
sys ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,
  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", 
prange->svms,

   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true, start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should 
not free the DMA mapping array at all. The array is needed as 
long as there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array 
at this stage.


The concern(also related to comments below) is I do not know how 
many pages in vram after partial migration. Originally I used 
bitmap to record that.  I used bitmap to record which pages were 
migrated at each migration functions. Here I do not need use hmm 
function to get that info,  inside each migration function we can 
know which pages got migrated, then update the bitmap accordingly 
inside each migration function.




I think the condition above with cpages should be updated. 
Instead of cpages, we need to keep track of a count of pages in 
VRAM in struct svm_range. See more below.


I think you want add a new integer in svm_range to remember how 
many pages are in vram side for each svm_range, instead of bitmap. 
There is a problem I saw: when we need split a prange(such as user 
uses set_attr api) how do we know how many pages in vram for each 
splitted prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" 
assumption when splitting ranges that have some pages in VRAM, that 
everything is in VRAM. And update that to 0 after migrate_to_ram 
for the entire range, to allow the BO reference to be released.


migrate_to_ram is partial migration too that only 2MB vra

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-30 Thread Felix Kuehling



On 2023-08-30 15:39, Chen, Xiaogang wrote:


On 8/28/2023 5:37 PM, Felix Kuehling wrote:


On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault 
according to migration
granularity(default 2MB) and not split svm range in cpu page fault 
handling.
Now a svm range may have pages from both system ram and vram of 
one gpu.
These chagnes are expected to improve migration performance and 
reduce

mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to 
device

   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,

+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 
0x%x\n",

- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys 
ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true, start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not 
free the DMA mapping array at all. The array is needed as long as 
there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array at 
this stage.


The concern(also related to comments below) is I do not know how 
many pages in vram after partial migration. Originally I used bitmap 
to record that.  I used bitmap to record which pages were migrated 
at each migration functions. Here I do not need use hmm function to 
get that info,  inside each migration function we can know which 
pages got migrated, then update the bitmap accordingly inside each 
migration function.




I think the condition above with cpages should be updated. Instead 
of cpages, we need to keep track of a count of pages in VRAM in 
struct svm_range. See more below.


I think you want add a new integer in svm_range to remember how many 
pages are in vram side for each svm_range, instead of bitmap. There 
is a problem I saw: when we need split a prange(such as user uses 
set_attr api) how do we know how many pages in vram for each 
splitted prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" assumption 
when splitting ranges that have some pages in VRAM, that everything 
is in VRAM. And update that to 0 after migrate_to_ram for the entire 
range, to allow the BO reference to be released.


migrate_to_ram is partial migration too that only 2MB vram got 
migrated. After split if we assume all pages are vram(pessimistic) we 
will give the ne

Re: [PATCHv2] drm/amdkfd: Fix unaligned 64-bit doorbell warning

2023-08-30 Thread Felix Kuehling

On 2023-08-30 16:01, Mukul Joshi wrote:

This patch fixes the following unaligned 64-bit doorbell
warning seen when submitting packets on HIQ on GFX v9.4.3
by making the HIQ doorbell 64-bit aligned.
The warning is seen when GPU is loaded in any mode other
than SPX mode.

[  +0.000301] [ cut here ]
[  +0.03] Unaligned 64-bit doorbell
[  +0.30] WARNING: /amdkfd/kfd_doorbell.c:339 
write_kernel_doorbell64+0x72/0x80 [amdgpu]
[  +0.03] RIP: 0010:write_kernel_doorbell64+0x72/0x80 [amdgpu]
[  +0.04] RSP: 0018:c90004287730 EFLAGS: 00010246
[  +0.05] RAX:  RBX:  RCX: 
[  +0.03] RDX: 0001 RSI: 82837c71 RDI: 
[  +0.03] RBP: c90004287748 R08: 0003 R09: 0001
[  +0.02] R10: 001a R11: 88a034008198 R12: c900013bd004
[  +0.03] R13: 0008 R14: c900042877b0 R15: 007f
[  +0.03] FS:  7fa8c7b62000() GS:889f8840() 
knlGS:
[  +0.04] CS:  0010 DS:  ES:  CR0: 80050033
[  +0.03] CR2: 56111c45aaf0 CR3: 0001414f2002 CR4: 00770ee0
[  +0.03] PKRU: 5554
[  +0.02] Call Trace:
[  +0.04]  
[  +0.06]  kq_submit_packet+0x45/0x50 [amdgpu]
[  +0.000524]  pm_send_set_resources+0x7f/0xc0 [amdgpu]
[  +0.000500]  set_sched_resources+0xe4/0x160 [amdgpu]
[  +0.000503]  start_cpsch+0x1c5/0x2a0 [amdgpu]
[  +0.000497]  kgd2kfd_device_init.cold+0x816/0xb42 [amdgpu]
[  +0.000743]  amdgpu_amdkfd_device_init+0x15f/0x1f0 [amdgpu]
[  +0.000602]  amdgpu_device_init.cold+0x1813/0x2176 [amdgpu]
[  +0.000684]  ? pci_bus_read_config_word+0x4a/0x80
[  +0.12]  ? do_pci_enable_device+0xdc/0x110
[  +0.08]  amdgpu_driver_load_kms+0x1a/0x110 [amdgpu]
[  +0.000545]  amdgpu_pci_probe+0x197/0x400 [amdgpu]

Fixes: cfeaeb3c0ce7 ("drm/amdgpu: use doorbell mgr for kfd kernel doorbells")
Signed-off-by: Mukul Joshi 
---
v1->v2:
- Update the logic to make it work with both 32 bit
   64 bit doorbells.
- Add the Fixed tag.

  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index c2e0b79dcc6d..e0d44f4af18e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -162,6 +162,7 @@ void __iomem *kfd_get_kernel_doorbell(struct kfd_dev *kfd,
return NULL;
  
  	*doorbell_off = amdgpu_doorbell_index_on_bar(kfd->adev, kfd->doorbells, inx);

+   inx *= kfd->device_info.doorbell_size / sizeof(u32);


Sorry for going back and forth on this. But you pointed out offline, 
that amdgpu_doorbell_index_on_bar calculates the doorbell address on the 
bar by always multiplying with 2. I think we need to do the same thing 
here for calculating the CPU address of the doorbell. Otherwise the CPU 
may not write to the same doorbell that the GPU is listening on. In 
practice this only matters on GPUs that create multiple HIQs. But at 
least I'd like the driver to be internally consistent and calculate the 
doorbell addresses the same way in the two addresses spaces.



  
  	pr_debug("Get kernel queue doorbell\n"

" doorbell offset   == 0x%08X\n"
@@ -175,7 +176,8 @@ void kfd_release_kernel_doorbell(struct kfd_dev *kfd, u32 
__iomem *db_addr)
  {
unsigned int inx;
  
-	inx = (unsigned int)(db_addr - kfd->doorbell_kernel_ptr);

+   inx = (unsigned int)(db_addr - kfd->doorbell_kernel_ptr)
+   * sizeof(u32) / kfd->device_info.doorbell_size;


Same as above.

Regards,
  Felix


  
  	mutex_lock(>doorbell_mutex);

__clear_bit(inx, kfd->doorbell_bitmap);


Re: [PATCH] drm/amdkfd: Fix unaligned 64-bit doorbell warning

2023-08-30 Thread Felix Kuehling
+Shashank, FYI. I believe this is a regression from your patch 
"drm/amdgpu: use doorbell mgr for kfd kernel doorbells".


On 2023-08-29 12:16, Mukul Joshi wrote:


This patch fixes the following unaligned 64-bit doorbell
warning seen when submitting packets on HIQ on GFX v9.4.3
by making the HIQ doorbell 64-bit aligned.
The warning is seen when GPU is loaded in any mode other
than SPX mode.

[  +0.000301] [ cut here ]
[  +0.03] Unaligned 64-bit doorbell
[  +0.30] WARNING: /amdkfd/kfd_doorbell.c:339 
write_kernel_doorbell64+0x72/0x80 [amdgpu]
[  +0.03] RIP: 0010:write_kernel_doorbell64+0x72/0x80 [amdgpu]
[  +0.04] RSP: 0018:c90004287730 EFLAGS: 00010246
[  +0.05] RAX:  RBX:  RCX: 
[  +0.03] RDX: 0001 RSI: 82837c71 RDI: 
[  +0.03] RBP: c90004287748 R08: 0003 R09: 0001
[  +0.02] R10: 001a R11: 88a034008198 R12: c900013bd004
[  +0.03] R13: 0008 R14: c900042877b0 R15: 007f
[  +0.03] FS:  7fa8c7b62000() GS:889f8840() 
knlGS:
[  +0.04] CS:  0010 DS:  ES:  CR0: 80050033
[  +0.03] CR2: 56111c45aaf0 CR3: 0001414f2002 CR4: 00770ee0
[  +0.03] PKRU: 5554
[  +0.02] Call Trace:
[  +0.04]  
[  +0.06]  kq_submit_packet+0x45/0x50 [amdgpu]
[  +0.000524]  pm_send_set_resources+0x7f/0xc0 [amdgpu]
[  +0.000500]  set_sched_resources+0xe4/0x160 [amdgpu]
[  +0.000503]  start_cpsch+0x1c5/0x2a0 [amdgpu]
[  +0.000497]  kgd2kfd_device_init.cold+0x816/0xb42 [amdgpu]
[  +0.000743]  amdgpu_amdkfd_device_init+0x15f/0x1f0 [amdgpu]
[  +0.000602]  amdgpu_device_init.cold+0x1813/0x2176 [amdgpu]
[  +0.000684]  ? pci_bus_read_config_word+0x4a/0x80
[  +0.12]  ? do_pci_enable_device+0xdc/0x110
[  +0.08]  amdgpu_driver_load_kms+0x1a/0x110 [amdgpu]
[  +0.000545]  amdgpu_pci_probe+0x197/0x400 [amdgpu]

Signed-off-by: Mukul Joshi 


This should have a Fixes tag:

Fixes: cfeaeb3c0ce7 ("drm/amdgpu: use doorbell mgr for kfd kernel 
doorbells")


The original code before that patch used "* sizeof(u32) / 
kfd->device_info.doorbell_size" instead of "* 2". May be safer to 
restore the original calculation to have the correct doorbell size on 
old and new GPUs.


Regards,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index c2e0b79dcc6d..b1c2772c3a8d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -168,7 +168,7 @@ void __iomem *kfd_get_kernel_doorbell(struct kfd_dev *kfd,
" doorbell index== 0x%x\n",
*doorbell_off, inx);
  
-	return kfd->doorbell_kernel_ptr + inx;

+   return kfd->doorbell_kernel_ptr + inx * 2;
  }
  
  void kfd_release_kernel_doorbell(struct kfd_dev *kfd, u32 __iomem *db_addr)

@@ -176,6 +176,7 @@ void kfd_release_kernel_doorbell(struct kfd_dev *kfd, u32 
__iomem *db_addr)
unsigned int inx;
  
  	inx = (unsigned int)(db_addr - kfd->doorbell_kernel_ptr);

+   inx /= 2;
  
  	mutex_lock(>doorbell_mutex);

__clear_bit(inx, kfd->doorbell_bitmap);


Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-28 Thread Felix Kuehling



On 2023-08-28 16:57, Chen, Xiaogang wrote:


On 8/28/2023 2:06 PM, Felix Kuehling wrote:


On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault according 
to migration
granularity(default 2MB) and not split svm range in cpu page fault 
handling.
Now a svm range may have pages from both system ram and vram of one 
gpu.

These chagnes are expected to improve migration performance and reduce
mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 
+++

  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   * svm_migrate_ram_to_vram - migrate svm range from system to device
   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, 
struct svm_range *prange,

   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+    unsigned long start_mgr, unsigned long last_mgr,
  struct mm_struct *mm, uint32_t trigger)
  {
  unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  unsigned long cpages = 0;
  long r = 0;
  -    if (prange->actual_loc == best_loc) {
-    pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
- prange->svms, prange->start, prange->last, best_loc);
+    if (!best_loc) {
+    pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys 
ram\n",

+ prange->svms, start_mgr, last_mgr);
  return 0;
  }
  @@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

  pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
   prange->start, prange->last, best_loc);
  -    start = prange->start << PAGE_SHIFT;
-    end = (prange->last + 1) << PAGE_SHIFT;
+    start = start_mgr << PAGE_SHIFT;
+    end = (last_mgr + 1) << PAGE_SHIFT;
    r = svm_range_vram_node_new(node, prange, true);
  if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range 
*prange, uint32_t best_loc,

    if (cpages) {
  prange->actual_loc = best_loc;
-    svm_range_free_dma_mappings(prange, true);
-    } else {
+    /* only free dma mapping in the migrated range */
+    svm_range_free_dma_mappings(prange, true,  start_mgr - 
prange->start,

+ last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not 
free the DMA mapping array at all. The array is needed as long as 
there are any valid DMA mappings in it.


yes, I realized it after submit. I can not free DMA mapping array at 
this stage.


The concern(also related to comments below) is I do not know how many 
pages in vram after partial migration. Originally I used bitmap to 
record that.  I used bitmap to record which pages were migrated at 
each migration functions. Here I do not need use hmm function to get 
that info,  inside each migration function we can know which pages got 
migrated, then update the bitmap accordingly inside each migration 
function.




I think the condition above with cpages should be updated. Instead of 
cpages, we need to keep track of a count of pages in VRAM in struct 
svm_range. See more below.


I think you want add a new integer in svm_range to remember how many 
pages are in vram side for each svm_range, instead of bitmap. There is 
a problem I saw: when we need split a prange(such as user uses 
set_attr api) how do we know how many pages in vram for each splitted 
prange?


Right, that's a bit problematic. But it should be a relatively rare 
corner case. It may be good enough to make a "pessimistic" assumption 
when splitting ranges that have some pages in VRAM, that everything is 
in VRAM. And update that to 0 after migrate_to_ram for the entire range, 
to allow the BO reference to be released.


So in the worst case, you keep your DMA addresses and BOs allocated 
slightly longer than necessary. If that doesn't work, I agree that we 
need a bitmap with one bit per 4KB page. But I hope that can be avoided.


That said, I'm not actually sure w

Re: [PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-28 Thread Felix Kuehling



On 2023-08-24 18:08, Xiaogang.Chen wrote:

From: Xiaogang Chen 

This patch implements partial migration in gpu page fault according to migration
granularity(default 2MB) and not split svm range in cpu page fault handling.
Now a svm range may have pages from both system ram and vram of one gpu.
These chagnes are expected to improve migration performance and reduce
mmu callback and TLB flush workloads.

Signed-off-by: xiaogang chen 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 153 +++
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  87 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
  4 files changed, 162 insertions(+), 91 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 7d82c7da223a..5a3aa80a1834 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -479,6 +479,8 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to device
   * @prange: range structure
   * @best_loc: the device to migrate to
+ * @start_mgr: start page to migrate
+ * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -489,6 +491,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
+   unsigned long start_mgr, unsigned long last_mgr,
struct mm_struct *mm, uint32_t trigger)
  {
unsigned long addr, start, end;
@@ -498,9 +501,9 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,
unsigned long cpages = 0;
long r = 0;
  
-	if (prange->actual_loc == best_loc) {

-   pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
-prange->svms, prange->start, prange->last, best_loc);
+   if (!best_loc) {
+   pr_debug("request svms 0x%p [0x%lx 0x%lx] migrate to sys ram\n",
+prange->svms, start_mgr, last_mgr);
return 0;
}
  
@@ -513,8 +516,8 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,

pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
 prange->start, prange->last, best_loc);
  
-	start = prange->start << PAGE_SHIFT;

-   end = (prange->last + 1) << PAGE_SHIFT;
+   start = start_mgr << PAGE_SHIFT;
+   end = (last_mgr + 1) << PAGE_SHIFT;
  
  	r = svm_range_vram_node_new(node, prange, true);

if (r) {
@@ -544,10 +547,12 @@ svm_migrate_ram_to_vram(struct svm_range *prange, 
uint32_t best_loc,
  
  	if (cpages) {

prange->actual_loc = best_loc;
-   svm_range_free_dma_mappings(prange, true);
-   } else {
+   /* only free dma mapping in the migrated range */
+   svm_range_free_dma_mappings(prange, true,  start_mgr - 
prange->start,
+last_mgr - start_mgr + 1);


This is wrong. If we only migrated some of the pages, we should not free 
the DMA mapping array at all. The array is needed as long as there are 
any valid DMA mappings in it.


I think the condition above with cpages should be updated. Instead of 
cpages, we need to keep track of a count of pages in VRAM in struct 
svm_range. See more below.




+   } else if (!prange->actual_loc)
+   /* if all pages from prange are at sys ram */
svm_range_vram_node_free(prange);
-   }
  
  	return r < 0 ? r : 0;

  }
@@ -762,6 +767,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node, struct 
svm_range *prange,
   * svm_migrate_vram_to_ram - migrate svm range from device to system
   * @prange: range structure
   * @mm: process mm, use current->mm if NULL
+ * @start_mgr: start page need be migrated to sys ram
+ * @last_mgr: last page need be migrated to sys ram
   * @trigger: reason of migration
   * @fault_page: is from vmf->page, svm_migrate_to_ram(), this is CPU page 
fault callback
   *
@@ -771,7 +778,8 @@ svm_migrate_vma_to_ram(struct kfd_node *node, struct 
svm_range *prange,
   * 0 - OK, otherwise error code
   */
  int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm,
-   uint32_t trigger, struct page *fault_page)
+   unsigned long start_mgr, unsigned long 
last_mgr,
+   uint32_t trigger, struct page 
*fault_page)
  {
struct kfd_node *node;
struct vm_area_struct *vma;
@@ -781,23 +789,30 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, 
struct mm_struct *mm,
unsigned long upages = 0;
long r = 0;
  
+	/* this pragne has no any vram page to migrate to sys ram */

if 

Re: [PATCH] drm/amdkfd: Add missing gfx11 MQD manager callbacks

2023-08-28 Thread Felix Kuehling

On 2023-08-25 17:30, Harish Kasiviswanathan wrote:

From: Jay Cornwall 

mqd_stride function was introduced in commit 129c7b6a0217
("drm/amdkfd: Update MQD management on multi XCC setup")
but not assigned for gfx11. Fixes a NULL dereference in debugfs.

Signed-off-by: Jay Cornwall 
Signed-off-by: Harish Kasiviswanathan 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
index 2319467d2d95..0bbf0edbabd4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
@@ -457,6 +457,7 @@ struct mqd_manager *mqd_manager_init_v11(enum KFD_MQD_TYPE 
type,
mqd->is_occupied = kfd_is_occupied_cp;
mqd->mqd_size = sizeof(struct v11_compute_mqd);
mqd->get_wave_state = get_wave_state;
+   mqd->mqd_stride = kfd_mqd_stride;
  #if defined(CONFIG_DEBUG_FS)
mqd->debugfs_show_mqd = debugfs_show_mqd;
  #endif
@@ -472,6 +473,7 @@ struct mqd_manager *mqd_manager_init_v11(enum KFD_MQD_TYPE 
type,
mqd->destroy_mqd = destroy_hiq_mqd;
mqd->is_occupied = kfd_is_occupied_cp;
mqd->mqd_size = sizeof(struct v11_compute_mqd);
+   mqd->mqd_stride = kfd_mqd_stride;
  #if defined(CONFIG_DEBUG_FS)
mqd->debugfs_show_mqd = debugfs_show_mqd;
  #endif
@@ -501,6 +503,7 @@ struct mqd_manager *mqd_manager_init_v11(enum KFD_MQD_TYPE 
type,
mqd->destroy_mqd = kfd_destroy_mqd_sdma;
mqd->is_occupied = kfd_is_occupied_sdma;
mqd->mqd_size = sizeof(struct v11_sdma_mqd);
+   mqd->mqd_stride = kfd_mqd_stride;
  #if defined(CONFIG_DEBUG_FS)
mqd->debugfs_show_mqd = debugfs_show_mqd_sdma;
  #endif


Re: [PATCH] drm/amdkfd: use mask to get v9 interrupt sq data bits correctly

2023-08-28 Thread Felix Kuehling



On 2023-08-28 11:35, Alex Sierra wrote:

Interrupt sq data bits were not taken properly from contextid0 and contextid1.
Use macro KFD_CONTEXT_ID_GET_SQ_INT_DATA instead.

Signed-off-by: Alex Sierra 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
index f0731a6a5306..830396b1c3b1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
@@ -384,7 +384,7 @@ static void event_interrupt_wq_v9(struct kfd_node *dev,
default:
break;
}
-   kfd_signal_event_interrupt(pasid, context_id0 & 
0xff, 24);
+   kfd_signal_event_interrupt(pasid, sq_int_data, 24);
} else if (source_id == SOC15_INTSRC_CP_BAD_OPCODE) {
kfd_set_dbg_ev_from_interrupt(dev, pasid,
KFD_DEBUG_DOORBELL_ID(context_id0),


Re: [PATCH v2] drm/amdkfd: Replace pr_err with dev_err

2023-08-28 Thread Felix Kuehling



On 2023-08-26 09:41, Asad Kamal wrote:

Replace pr_err with dev_err to show the bus-id of
failing device with kfd queue errors

Signed-off-by: Asad Kamal 
Reviewed-by: Lijo Lazar 


Reviewed-by: Felix Kuehling 



---
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 116 +++---
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   2 +-
  2 files changed, 71 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index b166f30f083e..cd6cfffd6436 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -232,8 +232,8 @@ static int add_queue_mes(struct device_queue_manager *dqm, 
struct queue *q,
  
  	queue_type = convert_to_mes_queue_type(q->properties.type);

if (queue_type < 0) {
-   pr_err("Queue type not supported with MES, queue:%d\n",
-   q->properties.type);
+   dev_err(adev->dev, "Queue type not supported with MES, 
queue:%d\n",
+   q->properties.type);
return -EINVAL;
}
queue_input.queue_type = (uint32_t)queue_type;
@@ -244,9 +244,9 @@ static int add_queue_mes(struct device_queue_manager *dqm, 
struct queue *q,
r = adev->mes.funcs->add_hw_queue(>mes, _input);
amdgpu_mes_unlock(>mes);
if (r) {
-   pr_err("failed to add hardware queue to MES, doorbell=0x%x\n",
+   dev_err(adev->dev, "failed to add hardware queue to MES, 
doorbell=0x%x\n",
q->properties.doorbell_off);
-   pr_err("MES might be in unrecoverable state, issue a GPU 
reset\n");
+   dev_err(adev->dev, "MES might be in unrecoverable state, issue a GPU 
reset\n");
kfd_hws_hang(dqm);
}
  
@@ -272,9 +272,9 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,

amdgpu_mes_unlock(>mes);
  
  	if (r) {

-   pr_err("failed to remove hardware queue from MES, 
doorbell=0x%x\n",
+   dev_err(adev->dev, "failed to remove hardware queue from MES, 
doorbell=0x%x\n",
q->properties.doorbell_off);
-   pr_err("MES might be in unrecoverable state, issue a GPU 
reset\n");
+   dev_err(adev->dev, "MES might be in unrecoverable state, issue a GPU 
reset\n");
kfd_hws_hang(dqm);
}
  
@@ -284,6 +284,7 @@ static int remove_queue_mes(struct device_queue_manager *dqm, struct queue *q,

  static int remove_all_queues_mes(struct device_queue_manager *dqm)
  {
struct device_process_node *cur;
+   struct device *dev = dqm->dev->adev->dev;
struct qcm_process_device *qpd;
struct queue *q;
int retval = 0;
@@ -294,7 +295,7 @@ static int remove_all_queues_mes(struct 
device_queue_manager *dqm)
if (q->properties.is_active) {
retval = remove_queue_mes(dqm, q, qpd);
if (retval) {
-   pr_err("%s: Failed to remove queue %d for 
dev %d",
+   dev_err(dev, "%s: Failed to remove queue %d 
for dev %d",
__func__,
q->properties.queue_id,
dqm->dev->id);
@@ -443,6 +444,7 @@ static int allocate_vmid(struct device_queue_manager *dqm,
struct qcm_process_device *qpd,
struct queue *q)
  {
+   struct device *dev = dqm->dev->adev->dev;
int allocated_vmid = -1, i;
  
  	for (i = dqm->dev->vm_info.first_vmid_kfd;

@@ -454,7 +456,7 @@ static int allocate_vmid(struct device_queue_manager *dqm,
}
  
  	if (allocated_vmid < 0) {

-   pr_err("no more vmid to allocate\n");
+   dev_err(dev, "no more vmid to allocate\n");
return -ENOSPC;
}
  
@@ -510,10 +512,12 @@ static void deallocate_vmid(struct device_queue_manager *dqm,

struct qcm_process_device *qpd,
struct queue *q)
  {
+   struct device *dev = dqm->dev->adev->dev;
+
/* On GFX v7, CP doesn't flush TC at dequeue */
if (q->device->adev->asic_type == CHIP_HAWAII)
if (flush_texture_cache_nocpsch(q->device, qpd))
-   pr_err("Failed to flush TC\n");
+   dev_err(dev, "Failed to flush TC\n");
  
  	kfd_flush_tlb(qpd_to_pdd(qpd), TLB_FLUSH_LEGACY);
  
@@ -708

Re: [PATCH AUTOSEL 5.15 6/6] drm/amdkfd: ignore crat by default

2023-08-23 Thread Felix Kuehling

On 2023-08-22 11:41, Deucher, Alexander wrote:

[Public]


-Original Message-
From: Sasha Levin 
Sent: Tuesday, August 22, 2023 7:37 AM
To: linux-ker...@vger.kernel.org; sta...@vger.kernel.org
Cc: Deucher, Alexander ; Kuehling, Felix
; Koenig, Christian ;
Mike Lothian ; Sasha Levin ; Pan,
Xinhui ; airl...@gmail.com; dan...@ffwll.ch; amd-
g...@lists.freedesktop.org; dri-de...@lists.freedesktop.org
Subject: [PATCH AUTOSEL 5.15 6/6] drm/amdkfd: ignore crat by default

From: Alex Deucher 

[ Upstream commit a6dea2d64ff92851e68cd4e20a35f6534286e016 ]

We are dropping the IOMMUv2 path, so no need to enable this.
It's often buggy on consumer platforms anyway.

This is not needed for stable.


I agree. I was about to comment in the 5.10 patch as well.

Regards,
  Felix




Alex


Reviewed-by: Felix Kuehling 
Acked-by: Christian König 
Tested-by: Mike Lothian 
Signed-off-by: Alex Deucher 
Signed-off-by: Sasha Levin 
---
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index e574aa32a111d..46dfd9baeb013 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -1523,11 +1523,7 @@ static bool kfd_ignore_crat(void)
   if (ignore_crat)
   return true;

-#ifndef KFD_SUPPORT_IOMMU_V2
   ret = true;
-#else
- ret = false;
-#endif

   return ret;
  }
--
2.40.1


Re: [PATCH] drm/amdgpu: Rework memory limits to allow big allocations

2023-08-22 Thread Felix Kuehling



On 2023-08-22 9:49, Bhardwaj, Rajneesh wrote:


On 8/21/2023 4:32 PM, Felix Kuehling wrote:


On 2023-08-21 15:20, Rajneesh Bhardwaj wrote:

Rework the KFD max system memory and ttm limit to allow bigger
system memory allocations upto 63/64 of the available memory which is
controlled by ttm module params pages_limit and page_pool_size. Also 
for

NPS1 mode, report the max ttm limit as the available VRAM size. For max
system memory limit, leave 1GB exclusively outside ROCm allocations 
i.e.
on 16GB system, >14 GB can be used by ROCm still leaving some memory 
for

other system applications and on 128GB systems (e.g. GFXIP 9.4.3 APU in
NPS1 mode) nearly >120GB can be used by ROCm.

Signed-off-by: Rajneesh Bhardwaj 
---
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  5 ++--
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 25 
+--

  2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c

index 9e18fe5eb190..3387dcdf1bc9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -44,6 +44,7 @@
   * changes to accumulate
   */
  #define AMDGPU_USERPTR_RESTORE_DELAY_MS 1
+#define ONE_GB    (1UL << 30)
    /*
   * Align VRAM availability to 2MB to avoid fragmentation caused by 
4K allocations in the tail 2MB

@@ -117,11 +118,11 @@ void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
  return;
    si_meminfo();
-    mem = si.freeram - si.freehigh;
+    mem = si.totalram - si.totalhigh;
  mem *= si.mem_unit;
    spin_lock_init(_mem_limit.mem_limit_lock);
-    kfd_mem_limit.max_system_mem_limit = mem - (mem >> 4);
+    kfd_mem_limit.max_system_mem_limit = mem - (mem >> 6) - (ONE_GB);


I believe this is an OK heuristic for large systems and medium-sized 
systems. But it produces a negative number or an underflow for 
systems with very small system memory (about 1.1GB).  It's not 
practical to run ROCm on such a small system, but the code at least 
needs to be robust here and produce something meaningful. E.g.



Sure, I agree.



kfd_mem_limit.max_system_mem_limit = mem - (mem >> 6);
if (kfd_mem_limit.max_system_mem_limit < 2 * ONE_GB)
    kfd_mem_limit.max_system_mem_limit <<= 1;
else
    kfd_mem_limit.max_system_mem_limit -= ONE_GB;

Since this change affects all GPUs and the change below is specific 
to GFXv9.4.3 APUs, I'd separate this into two patches.



Ok, will split into two changes.




  kfd_mem_limit.max_ttm_mem_limit = ttm_tt_pages_limit() << 
PAGE_SHIFT;

  pr_debug("Kernel memory limit %lluM, TTM limit %lluM\n",
  (kfd_mem_limit.max_system_mem_limit >> 20),
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c

index 8447fcada8bb..4962e35df617 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -25,6 +25,7 @@
  #include 
    #include 
+#include 
    #include "amdgpu.h"
  #include "gmc_v9_0.h"
@@ -1877,6 +1878,7 @@ static void
  gmc_v9_0_init_acpi_mem_ranges(struct amdgpu_device *adev,
    struct amdgpu_mem_partition_info *mem_ranges)
  {
+    uint64_t max_ttm_size = ttm_tt_pages_limit() << PAGE_SHIFT;
  int num_ranges = 0, ret, mem_groups;
  struct amdgpu_numa_info numa_info;
  int node_ids[MAX_MEM_RANGES];
@@ -1913,8 +1915,17 @@ gmc_v9_0_init_acpi_mem_ranges(struct 
amdgpu_device *adev,

    /* If there is only partition, don't use entire size */
  if (adev->gmc.num_mem_partitions == 1) {
-    mem_ranges[0].size = mem_ranges[0].size * (mem_groups - 1);
-    do_div(mem_ranges[0].size, mem_groups);
+    if (max_ttm_size > mem_ranges[0].size || max_ttm_size <= 0) {


This gives some weird dis-continuous behaviour. For max_ttm_size > 
mem_ranges[0].size it gives you 3/4. For max_ttm_size == 
mem_ranges[0].size it gives you all the memory.


Also, why is this only applied for num_mem_partitions == 1? The TTM 
limit also applies when there are more memory partitions. Would it 
make more sense to always evenly divide the ttm_tt_pages_limit 
between all the memory partitions? And cap the size at the NUMA node 
size. I think that would eliminate special cases for different 
memory-partition configs and give you sensible behaviour in all cases.



I think TTM doesn't check what values are being passed to pages_limt 
or page_pool_size so when the user passes an arbitrary number here, I 
wanted to retain the default behavior for NPS1 mode i.e. 3/4th of the 
available NUMA memory should be reported as VRAM. Also for >NPS1 mode, 
the partition size is already proportionately divided i.e in TPX/NPS4 
mode, we have 1/4th NUMA memory visible as VRAM but KFD limits will be 
already bigger than that and we will be capped by VRAM size so this

Re: [PATCH] drm/amdgpu: Use READ_ONCE() when reading the values in 'sdma_v4_4_2_ring_get_rptr'

2023-08-21 Thread Felix Kuehling
Would it make sense to include a link to a better explanation of the 
underlying issue? E.g. https://lwn.net/Articles/624126/?


Regards,
  Felix


On 2023-08-21 07:23, Christian König wrote:

Am 04.08.23 um 07:46 schrieb Srinivasan Shanmugam:

Instead of declaring pointers use READ_ONCE(), when accessing those
values to make sure that the compiler doesn't voilate any cache
coherences


That commit message is a bit confusing and not 100% technically correct.

The compiler is not causing any cache coherency issues, but 
potentially re-ordering things or reading the value multiple times.


Just write something like "Use READ_ONCE() instead of declaring the 
pointer volatile.". The background explanation would exceed the 
information suitable for a commit message anyway.


Apart from that looks good to me,
Christian.



Cc: Guchun Chen 
Cc: Christian König 
Cc: Alex Deucher 
Cc: "Pan, Xinhui" 
Cc: Le Ma 
Cc: Hawking Zhang 
Signed-off-by: Srinivasan Shanmugam 
---
  drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c

index f413898dda37..267c1b7b8dcd 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -154,13 +154,13 @@ static int sdma_v4_4_2_init_microcode(struct 
amdgpu_device *adev)

   */
  static uint64_t sdma_v4_4_2_ring_get_rptr(struct amdgpu_ring *ring)
  {
-    u64 *rptr;
+    u64 rptr;
    /* XXX check if swapping is necessary on BE */
-    rptr = ((u64 *)>adev->wb.wb[ring->rptr_offs]);
+    rptr = READ_ONCE(*((u64 *)>adev->wb.wb[ring->rptr_offs]));
  -    DRM_DEBUG("rptr before shift == 0x%016llx\n", *rptr);
-    return ((*rptr) >> 2);
+    DRM_DEBUG("rptr before shift == 0x%016llx\n", rptr);
+    return rptr >> 2;
  }
    /**




Re: [PATCH] drm/amdkfd: Share the original BO for GTT mapping

2023-08-21 Thread Felix Kuehling



On 2023-08-21 15:29, Philip Yang wrote:

If mGPUs is on same IOMMU group, or is ram direct mapped, then mGPUs
can share the original BO for GTT mapping dma address, without creating
new BO from export/import dmabuf.

Signed-off-by: Philip Yang 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 282879c3441a..b5b940485059 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -864,9 +864,10 @@ static int kfd_mem_attach(struct amdgpu_device *adev, 
struct kgd_mem *mem,
  
  		if ((adev == bo_adev && !(mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) ||

(amdgpu_ttm_tt_get_usermm(mem->bo->tbo.ttm) && 
reuse_dmamap(adev, bo_adev)) ||
-   same_hive) {
+   (mem->domain == AMDGPU_GEM_DOMAIN_GTT && reuse_dmamap(adev, 
bo_adev)) ||
+   same_hive) {
/* Mappings on the local GPU, or VRAM mappings in the
-* local hive, or userptr mapping can reuse dma map
+* local hive, or userptr, or GTT mapping can reuse dma 
map
 * address space share the original BO
 */
attachment[i]->type = KFD_MEM_ATT_SHARED;


Re: [PATCH] drm/amdgpu: Rework memory limits to allow big allocations

2023-08-21 Thread Felix Kuehling



On 2023-08-21 15:20, Rajneesh Bhardwaj wrote:

Rework the KFD max system memory and ttm limit to allow bigger
system memory allocations upto 63/64 of the available memory which is
controlled by ttm module params pages_limit and page_pool_size. Also for
NPS1 mode, report the max ttm limit as the available VRAM size. For max
system memory limit, leave 1GB exclusively outside ROCm allocations i.e.
on 16GB system, >14 GB can be used by ROCm still leaving some memory for
other system applications and on 128GB systems (e.g. GFXIP 9.4.3 APU in
NPS1 mode) nearly >120GB can be used by ROCm.

Signed-off-by: Rajneesh Bhardwaj 
---
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  5 ++--
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 25 +--
  2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 9e18fe5eb190..3387dcdf1bc9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -44,6 +44,7 @@
   * changes to accumulate
   */
  #define AMDGPU_USERPTR_RESTORE_DELAY_MS 1
+#define ONE_GB (1UL << 30)
  
  /*

   * Align VRAM availability to 2MB to avoid fragmentation caused by 4K 
allocations in the tail 2MB
@@ -117,11 +118,11 @@ void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
return;
  
  	si_meminfo();

-   mem = si.freeram - si.freehigh;
+   mem = si.totalram - si.totalhigh;
mem *= si.mem_unit;
  
  	spin_lock_init(_mem_limit.mem_limit_lock);

-   kfd_mem_limit.max_system_mem_limit = mem - (mem >> 4);
+   kfd_mem_limit.max_system_mem_limit = mem - (mem >> 6) - (ONE_GB);


I believe this is an OK heuristic for large systems and medium-sized 
systems. But it produces a negative number or an underflow for systems 
with very small system memory (about 1.1GB).  It's not practical to run 
ROCm on such a small system, but the code at least needs to be robust 
here and produce something meaningful. E.g.


kfd_mem_limit.max_system_mem_limit = mem - (mem >> 6);
if (kfd_mem_limit.max_system_mem_limit < 2 * ONE_GB)
kfd_mem_limit.max_system_mem_limit <<= 1;
else
kfd_mem_limit.max_system_mem_limit -= ONE_GB;

Since this change affects all GPUs and the change below is specific to 
GFXv9.4.3 APUs, I'd separate this into two patches.




kfd_mem_limit.max_ttm_mem_limit = ttm_tt_pages_limit() << PAGE_SHIFT;
pr_debug("Kernel memory limit %lluM, TTM limit %lluM\n",
(kfd_mem_limit.max_system_mem_limit >> 20),
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 8447fcada8bb..4962e35df617 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -25,6 +25,7 @@
  #include 
  
  #include 

+#include 
  
  #include "amdgpu.h"

  #include "gmc_v9_0.h"
@@ -1877,6 +1878,7 @@ static void
  gmc_v9_0_init_acpi_mem_ranges(struct amdgpu_device *adev,
  struct amdgpu_mem_partition_info *mem_ranges)
  {
+   uint64_t max_ttm_size = ttm_tt_pages_limit() << PAGE_SHIFT;
int num_ranges = 0, ret, mem_groups;
struct amdgpu_numa_info numa_info;
int node_ids[MAX_MEM_RANGES];
@@ -1913,8 +1915,17 @@ gmc_v9_0_init_acpi_mem_ranges(struct amdgpu_device *adev,
  
  	/* If there is only partition, don't use entire size */

if (adev->gmc.num_mem_partitions == 1) {
-   mem_ranges[0].size = mem_ranges[0].size * (mem_groups - 1);
-   do_div(mem_ranges[0].size, mem_groups);
+   if (max_ttm_size > mem_ranges[0].size || max_ttm_size <= 0) {


This gives some weird dis-continuous behaviour. For max_ttm_size > 
mem_ranges[0].size it gives you 3/4. For max_ttm_size == 
mem_ranges[0].size it gives you all the memory.


Also, why is this only applied for num_mem_partitions == 1? The TTM 
limit also applies when there are more memory partitions. Would it make 
more sense to always evenly divide the ttm_tt_pages_limit between all 
the memory partitions? And cap the size at the NUMA node size. I think 
that would eliminate special cases for different memory-partition 
configs and give you sensible behaviour in all cases.


Regards,
  Felix



+   /* Report VRAM as 3/4th of available numa memory */
+   mem_ranges[0].size = mem_ranges[0].size * (mem_groups - 
1);
+   do_div(mem_ranges[0].size, mem_groups);
+   } else {
+   /* Report VRAM as set by ttm.pages_limit or default ttm
+* limit which is 1/2 of system memory
+*/
+   mem_ranges[0].size = max_ttm_size;
+   }
+   pr_debug("NPS1 mode, setting VRAM size = %llu\n", 
mem_ranges[0].size);
}
  }
  
@@ -2159,6 +2170,11 @@ static 

Re: [PATCH] drm/amdkfd: use correct method to get clock under SRIOV

2023-08-17 Thread Felix Kuehling

On 2023-08-17 07:08, Horace Chen wrote:

[What]
Current SRIOV still using adev->clock.default_XX which gets from
atomfirmware. But these fields are abandoned in atomfirmware long ago.
Which may cause function to return a 0 value.

[How]
We don't need to check whether SR-IOV. For SR-IOV one-vf-mode,
pm is enabled and VF is able to read dpm clock
from pmfw, so we can use dpm clock interface directly. For
multi-VF mode, VF pm is disabled, so driver can just react as pm
disabled. One-vf-mode is introduced from GFX9 so it shall not have
any backward compatibility issue.

Signed-off-by: Horace Chen 


Acked-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 8 ++--
  1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index df633e9ce920..cdf6087706aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -442,9 +442,7 @@ void amdgpu_amdkfd_get_local_mem_info(struct amdgpu_device 
*adev,
mem_info->local_mem_size_public,
mem_info->local_mem_size_private);
  
-	if (amdgpu_sriov_vf(adev))

-   mem_info->mem_clk_max = adev->clock.default_mclk / 100;
-   else if (adev->pm.dpm_enabled) {
+   if (adev->pm.dpm_enabled) {
if (amdgpu_emu_mode == 1)
mem_info->mem_clk_max = 0;
else
@@ -463,9 +461,7 @@ uint64_t amdgpu_amdkfd_get_gpu_clock_counter(struct 
amdgpu_device *adev)
  uint32_t amdgpu_amdkfd_get_max_engine_clock_in_mhz(struct amdgpu_device *adev)
  {
/* the sclk is in quantas of 10kHz */
-   if (amdgpu_sriov_vf(adev))
-   return adev->clock.default_sclk / 100;
-   else if (adev->pm.dpm_enabled)
+   if (adev->pm.dpm_enabled)
return amdgpu_dpm_get_sclk(adev, false) / 100;
else
return 100;


Re: [PATCH] drm/amdkfd: retry after EBUSY is returned from hmm_ranges_get_pages

2023-08-17 Thread Felix Kuehling

On 2023-08-16 14:44, Alex Sierra wrote:

if hmm_range_get_pages returns EBUSY error during
svm_range_validate_and_map, within the context of a page fault
interrupt. This should retry through svm_range_restore_pages
callback. Therefore we treat this as EAGAIN error instead, and defer
it to restore pages fallback.

Signed-off-by: Alex Sierra 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 93609ea42163..3ebd5d99f39e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1685,6 +1685,8 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
WRITE_ONCE(p->svms.faulting_task, NULL);
if (r) {
pr_debug("failed %d to get svm range pages\n", r);
+   if (r == -EBUSY)
+   r = -EAGAIN;
goto unreserve_out;
}
  


Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
If you have a complete kernel log, it may be worth looking at backtraces 
from other threads, to better understand the interactions. I'd expect 
that there is a thread there that's in an RCU read critical section. It 
may not be in our driver, though. If it's a customer system, it may also 
help to see the kernel config. Maybe the kernel was configured without 
preemption:


-   For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the kernel
without invoking schedule().  If the looping in the kernel is
really expected and desirable behavior, you might need to add
some calls to cond_resched().

But then I would expect cond_resched() to fix the problem, according to 
this document.


Regards,
  Felix


On 2023-08-11 17:27, Chen, Xiaogang wrote:


On 8/11/2023 4:22 PM, Felix Kuehling wrote:

On 2023-08-11 17:12, Chen, Xiaogang wrote:


I know the original jira ticket. The system got RCU cpu stall, then 
kernel enter panic, then no response or ssh. This patch let prange 
list update task yield cpu after each range update. It can prevent 
task holding mm lock too long.


Calling schedule does not drop the lock. If anything, it causes the 
lock to be held longer, because the function takes longer to complete.


Regards,
  Felix

Right. I do not see either how this patch target the root cause. It is 
on customer system that can have many RCU operations(not necessary 
from our code). Any read critical section can cause write stall.


I think we can use some RCU parameters first to see if thing can 
change: like config_rcu_cpu_stall_timeout to increase grace period, or 
rcuupdate.rcu_cpu_stall_suppress to surppress RCU stall.


Regards

Xiaogang

mm lock is rw_semophore, not RCU mechanism. Can you explain how that 
can prevent RCU cpu stall in this case?


Regards

Xiaogang

On 8/11/2023 2:11 PM, James Zhu wrote:
Caution: This message originated from an External Source. Use 
proper caution when opening attachments, clicking links, or 
responding.



update_list could be big in list_for_each_entry(prange, 
_list, update_list),
mmap_read_lock(mm) is kept hold all the time, adding schedule() can 
remove

RCU stall on CPU for this case.

RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]
Code: 00 00 00 bf 00 02 00 00 48 81 c2 90 00 00 00 e8 1f 6a b9 e0 
65 48 8b 14 25 00 bd 01 00 8b 42 2c 48 8b 3c 24 80 e4 f7 0b 43 d8 
<89> 42 2c e8 51 dd 2d e1 48 8b 7b 38 e8 98 29 b7 e0 48 83 c4 30 b8

RSP: 0018:c9000ffd7b10 EFLAGS: 0206
RAX: 0100 RBX: 88c493968d80 RCX: 88d1a6469b18
RDX: 88e18ef1ec80 RSI: c9000ffd7be0 RDI: 88c493968d38
RBP: 0003062e R08: 3042f000 R09: 3062efff
R10: 1000 R11: 88c1ad255000 R12: 0003042f
R13: 88c493968c00 R14: c9000ffd7be0 R15: 88c493968c00
__mmu_notifier_invalidate_range_start+0x132/0x1d0
? amdgpu_vm_bo_update+0x3fd/0x520 [amdgpu]
migrate_vma_setup+0x6c7/0x8f0
? kfd_smi_event_migration_start+0x5f/0x80 [amdgpu]
svm_migrate_ram_to_vram+0x14e/0x580 [amdgpu]
svm_range_set_attr+0xe34/0x11a0 [amdgpu]
kfd_ioctl+0x271/0x4e0 [amdgpu]
? kfd_ioctl_set_xnack_mode+0xd0/0xd0 [amdgpu]
__x64_sys_ioctl+0x92/0xd0

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 113fd11aa96e..9f2d48ade7fa 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3573,6 +3573,7 @@ svm_range_set_attr(struct kfd_process *p, 
struct mm_struct *mm,
 r = svm_range_trigger_migration(mm, prange, 
);

 if (r)
 goto out_unlock_range;
+   schedule();

 if (migrated && (!p->xnack_enabled ||
 (prange->flags & 
KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED)) &&

--
2.34.1



Re: [PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Felix Kuehling

On 2023-08-11 17:06, James Zhu wrote:

Return 0 when drm device alloc failed with -ENOSPC in
order to  allow amdgpu drive loading. But the xcp without
drm device node assigned won't be visiable in user space.
This helps amdgpu driver loading on system which has more
than 64 nodes, the current limitation.

The proposal to add more drm nodes is discussed in public,
which will support up to 2^20 nodes totally.
kernel drm:
https://lore.kernel.org/lkml/20230724211428.3831636-1-michal.winiar...@intel.com/T/
libdrm:
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/305

Signed-off-by: James Zhu 
Acked-by: Christian König 


Reviewed-by: Felix Kuehling 




-v2: added warning message
-v3: use dev_warn
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c   | 13 -
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 10 +-
  2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index 9c9cca129498..565a1fa436d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -239,8 +239,13 @@ static int amdgpu_xcp_dev_alloc(struct amdgpu_device *adev)
  
  	for (i = 1; i < MAX_XCP; i++) {

ret = amdgpu_xcp_drm_dev_alloc(_ddev);
-   if (ret)
+   if (ret == -ENOSPC) {
+   dev_warn(adev->dev,
+   "Skip xcp node #%d when out of drm node resource.", i);
+   return 0;
+   } else if (ret) {
return ret;
+   }
  
  		/* Redirect all IOCTLs to the primary device */

adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
@@ -328,6 +333,9 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
return 0;
  
  	for (i = 1; i < MAX_XCP; i++) {

+   if (!adev->xcp_mgr->xcp[i].ddev)
+   break;
+
ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, 
ent->driver_data);
if (ret)
return ret;
@@ -345,6 +353,9 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
return;
  
  	for (i = 1; i < MAX_XCP; i++) {

+   if (!adev->xcp_mgr->xcp[i].ddev)
+   break;
+
p_ddev = adev->xcp_mgr->xcp[i].ddev;
drm_dev_unplug(p_ddev);
p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 3b0749390388..310df98ba46a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1969,8 +1969,16 @@ int kfd_topology_add_device(struct kfd_node *gpu)
int i;
const char *asic_name = amdgpu_asic_name[gpu->adev->asic_type];
  
+

gpu_id = kfd_generate_gpu_id(gpu);
-   pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
+   if (!gpu->xcp->ddev) {
+   dev_warn(gpu->adev->dev,
+   "Won't add GPU (ID: 0x%x) to topology since it has no drm node 
assigned.",
+   gpu_id);
+   return 0;
+   } else {
+   pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
+   }
  
  	/* Check to see if this gpu device exists in the topology_device_list.

 * If so, assign the gpu to that device,


Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling

On 2023-08-11 17:12, Chen, Xiaogang wrote:


I know the original jira ticket. The system got RCU cpu stall, then 
kernel enter panic, then no response or ssh. This patch let prange 
list update task yield cpu after each range update. It can prevent 
task holding mm lock too long.


Calling schedule does not drop the lock. If anything, it causes the lock 
to be held longer, because the function takes longer to complete.


Regards,
  Felix


mm lock is rw_semophore, not RCU mechanism. Can you explain how that 
can prevent RCU cpu stall in this case?


Regards

Xiaogang

On 8/11/2023 2:11 PM, James Zhu wrote:
Caution: This message originated from an External Source. Use proper 
caution when opening attachments, clicking links, or responding.



update_list could be big in list_for_each_entry(prange, _list, 
update_list),
mmap_read_lock(mm) is kept hold all the time, adding schedule() can 
remove

RCU stall on CPU for this case.

RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]
Code: 00 00 00 bf 00 02 00 00 48 81 c2 90 00 00 00 e8 1f 6a b9 e0 65 
48 8b 14 25 00 bd 01 00 8b 42 2c 48 8b 3c 24 80 e4 f7 0b 43 d8 <89> 
42 2c e8 51 dd 2d e1 48 8b 7b 38 e8 98 29 b7 e0 48 83 c4 30 b8

RSP: 0018:c9000ffd7b10 EFLAGS: 0206
RAX: 0100 RBX: 88c493968d80 RCX: 88d1a6469b18
RDX: 88e18ef1ec80 RSI: c9000ffd7be0 RDI: 88c493968d38
RBP: 0003062e R08: 3042f000 R09: 3062efff
R10: 1000 R11: 88c1ad255000 R12: 0003042f
R13: 88c493968c00 R14: c9000ffd7be0 R15: 88c493968c00
__mmu_notifier_invalidate_range_start+0x132/0x1d0
? amdgpu_vm_bo_update+0x3fd/0x520 [amdgpu]
migrate_vma_setup+0x6c7/0x8f0
? kfd_smi_event_migration_start+0x5f/0x80 [amdgpu]
svm_migrate_ram_to_vram+0x14e/0x580 [amdgpu]
svm_range_set_attr+0xe34/0x11a0 [amdgpu]
kfd_ioctl+0x271/0x4e0 [amdgpu]
? kfd_ioctl_set_xnack_mode+0xd0/0xd0 [amdgpu]
__x64_sys_ioctl+0x92/0xd0

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 113fd11aa96e..9f2d48ade7fa 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3573,6 +3573,7 @@ svm_range_set_attr(struct kfd_process *p, 
struct mm_struct *mm,

 r = svm_range_trigger_migration(mm, prange, );
 if (r)
 goto out_unlock_range;
+   schedule();

 if (migrated && (!p->xnack_enabled ||
 (prange->flags & 
KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED)) &&

--
2.34.1



Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
I don't understand why this loop is causing a stall. These stall 
warnings indicate that there is an RCU grace period that's not making 
progress. That means there must be an RCU read critical section that's 
being blocked. But there is no RCU-read critical section in 
svm_range_set_attr function. You mentioned the mmap-read-lock. But why 
is that causing an issue? Does it trigger any of the conditions listed 
in kernel/Documentation/RCU/stallwarn.rst?


-   A CPU looping in an RCU read-side critical section.
-   A CPU looping with interrupts disabled.
-   A CPU looping with preemption disabled.
-   A CPU looping with bottom halves disabled.

Or is there another thread that has an mmap_write_lock inside an RCU 
read critical section that's getting stalled by the mmap_read_lock?


Regards,
  Felix


On 2023-08-11 16:50, James Zhu wrote:


On 2023-08-11 16:06, Felix Kuehling wrote:


On 2023-08-11 15:11, James Zhu wrote:
update_list could be big in list_for_each_entry(prange, 
_list, update_list),
mmap_read_lock(mm) is kept hold all the time, adding schedule() can 
remove

RCU stall on CPU for this case.

RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]


You're just showing the backtrace here, but not what the problem is. 
Can you include more context, e.g. the message that says something 
about a stall?


[JZ] I attached more log here, and update in patch later.

2023-07-20T14:15:39-04:00 frontier06693 kernel: rcu: INFO: rcu_sched 
self-detected stall on CPU
2023-07-20T14:15:39-04:00 frontier06693 kernel: rcu: #01134-: 
(59947 ticks this GP) idle=7f6/1/0x4000 softirq=1735/1735 
fqs=29977
2023-07-20T14:15:39-04:00 frontier06693 kernel: #011(t=60006 jiffies 
g=3265905 q=15150)
2023-07-20T14:15:39-04:00 frontier06693 kernel: rcu: CPU 34: RCU dump 
cpu stacks:

2023-07-20T14:15:39-04:00 frontier06693 kernel: NMI backtrace for cpu 34
2023-07-20T14:15:39-04:00 frontier06693 kernel: CPU: 34 PID: 72044 
Comm: ncsd-it-hip.exe Kdump: loaded Tainted: G   OE 
5.14.21-150400.24.46_12.0.83-cray_shasta_c #1 SLE15-SP4 (unreleased)
2023-07-20T14:15:39-04:00 frontier06693 kernel: Hardware name: HPE 
HPE_CRAY_EX235A/HPE CRAY EX235A, BIOS 1.6.2 03-22-2023

2023-07-20T14:15:39-04:00 frontier06693 kernel: Call Trace:
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
2023-07-20T14:15:39-04:00 frontier06693 kernel: dump_stack_lvl+0x44/0x5b
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
nmi_cpu_backtrace+0xdd/0xe0
2023-07-20T14:15:39-04:00 frontier06693 kernel: ? 
lapic_can_unplug_cpu+0xa0/0xa0
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
nmi_trigger_cpumask_backtrace+0xfd/0x130
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
rcu_dump_cpu_stacks+0x13b/0x180
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
rcu_sched_clock_irq+0x6cb/0x930
2023-07-20T14:15:39-04:00 frontier06693 kernel: ? 
trigger_load_balance+0x158/0x390
2023-07-20T14:15:39-04:00 frontier06693 kernel: ? 
scheduler_tick+0xe1/0x290
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
update_process_times+0x8c/0xb0
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
tick_sched_handle.isra.21+0x1d/0x60
2023-07-20T14:15:39-04:00 frontier06693 kernel: ? 
tick_sched_handle.isra.21+0x60/0x60
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
tick_sched_timer+0x67/0x80
2023-07-20T14:15:39-04:00 frontier06693 kernel: ? 
tick_sched_handle.isra.21+0x60/0x60
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
__hrtimer_run_queues+0xa0/0x2b0
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
hrtimer_interrupt+0xe5/0x250
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
__sysvec_apic_timer_interrupt+0x62/0x100
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
sysvec_apic_timer_interrupt+0x4b/0x90

2023-07-20T14:15:39-04:00 frontier06693 kernel: 
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
2023-07-20T14:15:39-04:00 frontier06693 kernel: 
asm_sysvec_apic_timer_interrupt+0x12/0x20
2023-07-20T14:15:39-04:00 frontier06693 kernel: RIP: 
0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]
2023-07-20T14:15:39-04:00 frontier06693 kernel: Code: 00 00 00 bf 00 
02 00 00 48 81 c2 90 00 00 00 e8 1f 6a b9 e0 65 48 8b 14 25 00 bd 01 
00 8b 42 2c 48 8b 3c 24 80 e4 f7 0b 43 d8 <89> 42 2c e8 51 dd 2d e1 48 
8b 7b 38 e8 98 29 b7 e0 48 83 c4 30 b8
2023-07-20T14:15:39-04:00 frontier06693 kernel: RSP: 
0018:c9000ffd7b10 EFLAGS: 0206
2023-07-20T14:15:39-04:00 frontier06693 kernel: RAX: 0100 
RBX: 88c493968d80 RCX: 88d1a6469b18
2023-07-20T14:15:39-04:00 frontier06693 kernel: RDX: 88e18ef1ec80 
RSI: c9000ffd7be0 RDI: 88c493968d38
2023-07-20T14:15:39-04:00 frontier06693 kernel: RBP: 0003062e 
R08: 3042f000 R09: 3062efff
2023-07-20T14:15:39-04:00 frontier06693 kernel: R10: 1000 
R11: 88c1ad255000 R12: 0003042f
2023-07-20T14:15:39-04:00 frontier06693 kernel: R13: 88c493968c00 
R14: c9000ffd7be0 R15: 88c493968c00
2023-07-20T14:15:39

Re: [PATCH v2] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Felix Kuehling

On 2023-08-11 16:23, James Zhu wrote:

Return 0 when drm device alloc failed with -ENOSPC in
order to  allow amdgpu drive loading. But the xcp without
drm device node assigned won't be visiable in user space.
This helps amdgpu driver loading on system which has more
than 64 nodes, the current limitation.

The proposal to add more drm nodes is discussed in public,
which will support up to 2^20 nodes totally.
kernel drm:
https://lore.kernel.org/lkml/20230724211428.3831636-1-michal.winiar...@intel.com/T/
libdrm:
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/305

Signed-off-by: James Zhu 
Acked-by: Christian König 

-v2: added warning message
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c   | 13 -
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 10 +-
  2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index 9c9cca129498..f0754d70da5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -239,8 +239,13 @@ static int amdgpu_xcp_dev_alloc(struct amdgpu_device *adev)
  
  	for (i = 1; i < MAX_XCP; i++) {

ret = amdgpu_xcp_drm_dev_alloc(_ddev);
-   if (ret)
+   if (ret == -ENOSPC) {
+   dev_WARN(adev->dev,
+   "Skip xcp node #%d when out of drm node resource.", i);


This prints a noisy backtrace. Maybe that's a bit too much. I'd just use 
dev_warn, so it only prints your message without a backtrace.




+   return 0;
+   } else if (ret) {
return ret;
+   }
  
  		/* Redirect all IOCTLs to the primary device */

adev->xcp_mgr->xcp[i].rdev = p_ddev->render->dev;
@@ -328,6 +333,9 @@ int amdgpu_xcp_dev_register(struct amdgpu_device *adev,
return 0;
  
  	for (i = 1; i < MAX_XCP; i++) {

+   if (!adev->xcp_mgr->xcp[i].ddev)
+   break;
+
ret = drm_dev_register(adev->xcp_mgr->xcp[i].ddev, 
ent->driver_data);
if (ret)
return ret;
@@ -345,6 +353,9 @@ void amdgpu_xcp_dev_unplug(struct amdgpu_device *adev)
return;
  
  	for (i = 1; i < MAX_XCP; i++) {

+   if (!adev->xcp_mgr->xcp[i].ddev)
+   break;
+
p_ddev = adev->xcp_mgr->xcp[i].ddev;
drm_dev_unplug(p_ddev);
p_ddev->render->dev = adev->xcp_mgr->xcp[i].rdev;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 3b0749390388..0f844151caaf 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1969,8 +1969,16 @@ int kfd_topology_add_device(struct kfd_node *gpu)
int i;
const char *asic_name = amdgpu_asic_name[gpu->adev->asic_type];
  
+

gpu_id = kfd_generate_gpu_id(gpu);
-   pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
+   if (!gpu->xcp->ddev) {
+   dev_WARN(gpu->adev->dev,
+   "Won't add GPU (ID: 0x%x) to topology since it has no drm node 
assigned.",
+   gpu_id);


Same as above.

Regards,
  Felix



+   return 0;
+   } else {
+   pr_debug("Adding new GPU (ID: 0x%x) to topology\n", gpu_id);
+   }
  
  	/* Check to see if this gpu device exists in the topology_device_list.

 * If so, assign the gpu to that device,


Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling



On 2023-08-11 15:11, James Zhu wrote:

update_list could be big in list_for_each_entry(prange, _list, 
update_list),
mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove
RCU stall on CPU for this case.

RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu]


You're just showing the backtrace here, but not what the problem is. Can 
you include more context, e.g. the message that says something about a 
stall?




Code: 00 00 00 bf 00 02 00 00 48 81 c2 90 00 00 00 e8 1f 6a b9 e0 65 48 8b 14 25 00 
bd 01 00 8b 42 2c 48 8b 3c 24 80 e4 f7 0b 43 d8 <89> 42 2c e8 51 dd 2d e1 48 8b 
7b 38 e8 98 29 b7 e0 48 83 c4 30 b8
RSP: 0018:c9000ffd7b10 EFLAGS: 0206
RAX: 0100 RBX: 88c493968d80 RCX: 88d1a6469b18
RDX: 88e18ef1ec80 RSI: c9000ffd7be0 RDI: 88c493968d38
RBP: 0003062e R08: 3042f000 R09: 3062efff
R10: 1000 R11: 88c1ad255000 R12: 0003042f
R13: 88c493968c00 R14: c9000ffd7be0 R15: 88c493968c00
__mmu_notifier_invalidate_range_start+0x132/0x1d0
? amdgpu_vm_bo_update+0x3fd/0x520 [amdgpu]
migrate_vma_setup+0x6c7/0x8f0
? kfd_smi_event_migration_start+0x5f/0x80 [amdgpu]
svm_migrate_ram_to_vram+0x14e/0x580 [amdgpu]
svm_range_set_attr+0xe34/0x11a0 [amdgpu]
kfd_ioctl+0x271/0x4e0 [amdgpu]
? kfd_ioctl_set_xnack_mode+0xd0/0xd0 [amdgpu]
__x64_sys_ioctl+0x92/0xd0

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 113fd11aa96e..9f2d48ade7fa 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3573,6 +3573,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
r = svm_range_trigger_migration(mm, prange, );
if (r)
goto out_unlock_range;
+   schedule();


I'm not sure that unconditionally scheduling here in every loop 
iteration is a good solution. This could lead to performance degradation 
when there are many small ranges. I think a better option is to call 
cond_resched. That would only reschedule only "if necessary", though I 
haven't quite figured out the criteria for rescheduling being necessary.


Regards,
  Felix


  
  		if (migrated && (!p->xnack_enabled ||

(prange->flags & KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED)) &&


Re: [PATCH] drm/amdgpu: don't allow userspace to create a doorbell BO

2023-08-11 Thread Felix Kuehling

Am 2023-08-09 um 15:09 schrieb Alex Deucher:

We need the domains in amdgpu_drm.h for the kernel driver to manage
the pool, but we don't want userspace using it until the code
is ready.  So reject for now.

Signed-off-by: Alex Deucher 


Acked-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 693b1fd1191a..ca4d2d430e28 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -289,6 +289,10 @@ int amdgpu_gem_create_ioctl(struct drm_device *dev, void 
*data,
uint32_t handle, initial_domain;
int r;
  
+	/* reject DOORBELLs until userspace code to use it is available */

+   if (args->in.domains & AMDGPU_GEM_DOMAIN_DOORBELL)
+   return -EINVAL;
+
/* reject invalid gem flags */
if (flags & ~(AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED |
  AMDGPU_GEM_CREATE_NO_CPU_ACCESS |


Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

2023-08-11 Thread Felix Kuehling

Am 2023-08-10 um 18:27 schrieb Eric Huang:
There is not UNMAP_QUEUES command sending for queue preemption because 
the queue is suspended and test is closed to the end. Function 
unmap_queue_cpsch will do nothing after that.


How do you suspend queues without sending an UNMAP_QUEUES command?

Regards,
  Felix




The workaround is new and only for gfx v9.4.2, because debugger tests 
has changed to check if all address watch points are correctly set, 
i.e. test A sets more than one watchpoint and leave, the following 
test B only sets one watchpoint, and test A's setting will cause more 
than one watchpoint event, so test B check out and report error on 
second or third watchpoint not set by itself.


Regards,
Eric

On 2023-08-10 17:56, Felix Kuehling wrote:
I think Jon is suggesting that the UNMAP_QUEUES command should clear 
the address watch registers. Requesting such a change from the the 
HWS team may take a long time.


That said, when was this workaround implemented and reviewed? Did I 
review it as part of Jon's debugger upstreaming patch series? Or did 
this come later? This patch only enables the workaround for v9.4.2.


Regards,
  Felix


On 2023-08-10 17:52, Eric Huang wrote:
The problem is the queue is suspended before clearing address watch 
call in KFD, there is not queue preemption and queue resume after 
clearing call, and the test ends. So there is not chance to send 
MAP_PROCESS to HWS. At this point FW has nothing to do. We have 
several test FWs from Tej, none of them works, so I recalled the 
kernel debug log and found out the problem.


GFX11 has different scheduler, when calling clear address watch, KFD 
directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it 
doesn't consider if the queue is suspended. So GFX11 doesn't have 
this issue.


Regards,
Eric

On 2023-08-10 17:27, Kim, Jonathan wrote:

[AMD Official Use Only - General]

This is a strange solution because the MEC should set watch 
controls as non-valid automatically on queue preemption to avoid 
this kind of issue in the first place by design. MAP_PROCESS on 
resume will take whatever the driver requests.

GFX11 has no issue with letting the HWS do this.

Are we sure we're not working around some HWS bug?

Thanks,

Jon


-Original Message-
From: Kuehling, Felix 
Sent: Thursday, August 10, 2023 5:03 PM
To: Huang, JinHuiEric ; amd-
g...@lists.freedesktop.org
Cc: Kim, Jonathan 
Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug 
for gfx v9.4.2


I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
different because it needs to support multiple XCCs.

That said, this patch is

Reviewed-by: Felix Kuehling 


On 2023-08-10 16:47, Eric Huang wrote:

KFD currently relies on MEC FW to clear tcp watch control
register by sending MAP_PROCESS packet with 0 of field
tcp_watch_cntl to HWS, but if the queue is suspended, the
packet will not be sent and the previous value will be
left on the register, that will affect the following apps.
So the solution is to clear the register as gfx v9 in KFD.

Signed-off-by: Eric Huang 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +---
   1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c

index e2fed6edbdd0..aff08321e976 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
@@ -163,12 +163,6 @@ static uint32_t

kgd_gfx_aldebaran_set_address_watch(

 return watch_address_cntl;
   }

-static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct

amdgpu_device *adev,

- uint32_t watch_id)
-{
-   return 0;
-}
-
   const struct kfd2kgd_calls aldebaran_kfd2kgd = {
 .program_sh_mem_settings =

kgd_gfx_v9_program_sh_mem_settings,

 .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
@@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
 .set_wave_launch_trap_override =

kgd_aldebaran_set_wave_launch_trap_override,

 .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
 .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
-   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
+   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
 .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
 .build_grace_period_packet_info =

kgd_gfx_v9_build_grace_period_packet_info,

.program_trap_handler_settings =

kgd_gfx_v9_program_trap_handler_settings,






Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-08-11 Thread Felix Kuehling

Am 2023-08-11 um 06:11 schrieb Mike Lothian:

On Thu, 3 Aug 2023 at 20:43, Felix Kuehling  wrote:

Is your kernel configured without dynamic debugging? Maybe we need to
wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE).


Apologies, I thought I'd replied to this, yes I didn't have dynamic
debugging enabled


I submitted a fix for this by Arnd Bergman: 
https://patchwork.freedesktop.org/patch/551367/. It should show up in 
Alex's public branch soon.


Regards,
  Felix




Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

2023-08-10 Thread Felix Kuehling
I think Jon is suggesting that the UNMAP_QUEUES command should clear the 
address watch registers. Requesting such a change from the the HWS team 
may take a long time.


That said, when was this workaround implemented and reviewed? Did I 
review it as part of Jon's debugger upstreaming patch series? Or did 
this come later? This patch only enables the workaround for v9.4.2.


Regards,
  Felix


On 2023-08-10 17:52, Eric Huang wrote:
The problem is the queue is suspended before clearing address watch 
call in KFD, there is not queue preemption and queue resume after 
clearing call, and the test ends. So there is not chance to send 
MAP_PROCESS to HWS. At this point FW has nothing to do. We have 
several test FWs from Tej, none of them works, so I recalled the 
kernel debug log and found out the problem.


GFX11 has different scheduler, when calling clear address watch, KFD 
directly sends the MES_MISC_OP_SET_SHADER_DEBUGGER to MES, it doesn't 
consider if the queue is suspended. So GFX11 doesn't have this issue.


Regards,
Eric

On 2023-08-10 17:27, Kim, Jonathan wrote:

[AMD Official Use Only - General]

This is a strange solution because the MEC should set watch controls 
as non-valid automatically on queue preemption to avoid this kind of 
issue in the first place by design.  MAP_PROCESS on resume will take 
whatever the driver requests.

GFX11 has no issue with letting the HWS do this.

Are we sure we're not working around some HWS bug?

Thanks,

Jon


-Original Message-
From: Kuehling, Felix 
Sent: Thursday, August 10, 2023 5:03 PM
To: Huang, JinHuiEric ; amd-
g...@lists.freedesktop.org
Cc: Kim, Jonathan 
Subject: Re: [PATCH] drm/amdkfd: fix address watch clearing bug for 
gfx v9.4.2


I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit
different because it needs to support multiple XCCs.

That said, this patch is

Reviewed-by: Felix Kuehling 


On 2023-08-10 16:47, Eric Huang wrote:

KFD currently relies on MEC FW to clear tcp watch control
register by sending MAP_PROCESS packet with 0 of field
tcp_watch_cntl to HWS, but if the queue is suspended, the
packet will not be sent and the previous value will be
left on the register, that will affect the following apps.
So the solution is to clear the register as gfx v9 in KFD.

Signed-off-by: Eric Huang 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +---
   1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c

b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c

index e2fed6edbdd0..aff08321e976 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
@@ -163,12 +163,6 @@ static uint32_t

kgd_gfx_aldebaran_set_address_watch(

 return watch_address_cntl;
   }

-static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct

amdgpu_device *adev,

- uint32_t watch_id)
-{
-   return 0;
-}
-
   const struct kfd2kgd_calls aldebaran_kfd2kgd = {
 .program_sh_mem_settings =

kgd_gfx_v9_program_sh_mem_settings,

 .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
@@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
 .set_wave_launch_trap_override =

kgd_aldebaran_set_wave_launch_trap_override,

 .set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
 .set_address_watch = kgd_gfx_aldebaran_set_address_watch,
-   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
+   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
 .get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
 .build_grace_period_packet_info =

kgd_gfx_v9_build_grace_period_packet_info,

 .program_trap_handler_settings =

kgd_gfx_v9_program_trap_handler_settings,




Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

2023-08-10 Thread Felix Kuehling
I think amdgpu_amdkfd_gc_9_4_3.c needs a similar fix. But maybe a bit 
different because it needs to support multiple XCCs.


That said, this patch is

Reviewed-by: Felix Kuehling 


On 2023-08-10 16:47, Eric Huang wrote:

KFD currently relies on MEC FW to clear tcp watch control
register by sending MAP_PROCESS packet with 0 of field
tcp_watch_cntl to HWS, but if the queue is suspended, the
packet will not be sent and the previous value will be
left on the register, that will affect the following apps.
So the solution is to clear the register as gfx v9 in KFD.

Signed-off-by: Eric Huang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c | 8 +---
  1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
index e2fed6edbdd0..aff08321e976 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
@@ -163,12 +163,6 @@ static uint32_t kgd_gfx_aldebaran_set_address_watch(
return watch_address_cntl;
  }
  
-static uint32_t kgd_gfx_aldebaran_clear_address_watch(struct amdgpu_device *adev,

- uint32_t watch_id)
-{
-   return 0;
-}
-
  const struct kfd2kgd_calls aldebaran_kfd2kgd = {
.program_sh_mem_settings = kgd_gfx_v9_program_sh_mem_settings,
.set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
@@ -193,7 +187,7 @@ const struct kfd2kgd_calls aldebaran_kfd2kgd = {
.set_wave_launch_trap_override = 
kgd_aldebaran_set_wave_launch_trap_override,
.set_wave_launch_mode = kgd_aldebaran_set_wave_launch_mode,
.set_address_watch = kgd_gfx_aldebaran_set_address_watch,
-   .clear_address_watch = kgd_gfx_aldebaran_clear_address_watch,
+   .clear_address_watch = kgd_gfx_v9_clear_address_watch,
.get_iq_wait_times = kgd_gfx_v9_get_iq_wait_times,
.build_grace_period_packet_info = 
kgd_gfx_v9_build_grace_period_packet_info,
.program_trap_handler_settings = 
kgd_gfx_v9_program_trap_handler_settings,


Re: [PATCH] drm/amdkfd: fix double assign skip process context clear

2023-08-10 Thread Felix Kuehling

On 2023-08-10 15:03, Jonathan Kim wrote:

Remove redundant assignment when skipping process ctx clear.

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index aa5091f18681..89c2bfcb36ce 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -227,7 +227,6 @@ static int add_queue_mes(struct device_queue_manager *dqm, 
struct queue *q,
queue_input.tba_addr = qpd->tba_addr;
queue_input.tma_addr = qpd->tma_addr;
queue_input.trap_en = !kfd_dbg_has_cwsr_workaround(q->device);
-   queue_input.skip_process_ctx_clear = 
qpd->pqm->process->debug_trap_enabled;
queue_input.skip_process_ctx_clear = 
qpd->pqm->process->debug_trap_enabled ||
 
kfd_dbg_has_ttmps_always_setup(q->device);
  


Re: [PATCH] drm/amdkfd: Add missing tba_hi programming on aldebaran

2023-08-09 Thread Felix Kuehling

On 2023-08-09 17:26, Jay Cornwall wrote:

Previously asymptomatic because high 32 bits were zero.

Fixes: 615222cfed20 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole")
Signed-off-by: Jay Cornwall 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c
index 8fda16e6fee6..8ce6f5200905 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c
@@ -121,6 +121,7 @@ static int pm_map_process_aldebaran(struct packet_manager 
*pm,
packet->sh_mem_bases = qpd->sh_mem_bases;
if (qpd->tba_addr) {
packet->sq_shader_tba_lo = lower_32_bits(qpd->tba_addr >> 8);
+   packet->sq_shader_tba_hi = upper_32_bits(qpd->tba_addr >> 8);
packet->sq_shader_tma_lo = lower_32_bits(qpd->tma_addr >> 8);
packet->sq_shader_tma_hi = upper_32_bits(qpd->tma_addr >> 8);
}


Re: [PATCH v2] drm/amdkfd: Use memdup_user() rather than duplicating its implementation

2023-08-09 Thread Felix Kuehling



On 2023-08-09 01:30, Atul Raut wrote:

To prevent its redundant implementation and streamline
code, use memdup_user.

This fixes warnings reported by Coccinelle:
./drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:2811:13-20: WARNING 
opportunity for memdup_user

Signed-off-by: Atul Raut 


The patch is

Reviewed-by: Felix Kuehling 

I'm applying it to amd-staging-drm-next.

Regards,
  Felix



---
v1 -> v2
   caller checks for errors, hence removed
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 +-
  1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 2df153828ff4..df9b618756e6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2803,19 +2803,11 @@ static void copy_context_work_handler (struct 
work_struct *work)
  static uint32_t *get_queue_ids(uint32_t num_queues, uint32_t 
*usr_queue_id_array)
  {
size_t array_size = num_queues * sizeof(uint32_t);
-   uint32_t *queue_ids = NULL;
  
  	if (!usr_queue_id_array)

return NULL;
  
-	queue_ids = kzalloc(array_size, GFP_KERNEL);

-   if (!queue_ids)
-   return ERR_PTR(-ENOMEM);
-
-   if (copy_from_user(queue_ids, usr_queue_id_array, array_size))
-   return ERR_PTR(-EFAULT);
-
-   return queue_ids;
+   return memdup_user(usr_queue_id_array, array_size);
  }
  
  int resume_queues(struct kfd_process *p,


Re: drm/amdkfd: Use memdup_user() rather than duplicating its

2023-08-08 Thread Felix Kuehling

On 2023-08-08 16:57, Atul Raut wrote:

To prevent its redundant implementation and streamline
code, use memdup_user.

This fixes warnings reported by Coccinelle:
./drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:2811:13-20: WARNING 
opportunity for memdup_user

Signed-off-by: Atul Raut 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 +++--
  1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 2df153828ff4..51740e007e89 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2808,12 +2808,9 @@ static uint32_t *get_queue_ids(uint32_t num_queues, 
uint32_t *usr_queue_id_array
if (!usr_queue_id_array)
return NULL;
  
-	queue_ids = kzalloc(array_size, GFP_KERNEL);

-   if (!queue_ids)
-   return ERR_PTR(-ENOMEM);
-
-   if (copy_from_user(queue_ids, usr_queue_id_array, array_size))
-   return ERR_PTR(-EFAULT);
+   queue_ids = memdup_user(usr_queue_id_array, array_size);
+   if (IS_ERR(Iqueue_ids))


You have a typo in the variable name here. Did you at least compile-test 
the patch?




+   return ERR_PTR(queue_ids);


I think it should just return queue_ids here. That's already an ERR_PTR 
in case of errors. So you don't even need the "if". Just this should do 
the job:


    return memdup_user(usr_queue_id_array, array_size);

The error checking is done by the caller.

Regards,
  Felix


  
  	return queue_ids;

  }


Re: [PATCH V2 1/5] drm/amdkfd: ignore crat by default

2023-08-08 Thread Felix Kuehling

On 2023-08-07 18:05, Alex Deucher wrote:

We are dropping the IOMMUv2 path, so no need to enable this.
It's often buggy on consumer platforms anyway.

Signed-off-by: Alex Deucher 


The series is

Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
index 49f40d9f16e86..f5a6f562e2a80 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_crat.c
@@ -1543,11 +1543,7 @@ static bool kfd_ignore_crat(void)
if (ignore_crat)
return true;
  
-#ifndef KFD_SUPPORT_IOMMU_V2

ret = true;
-#else
-   ret = false;
-#endif
  
  	return ret;

  }


Re: [PATCH] drm/amdkfd: wrap dynamic debug call with CONFIG_DYNAMIC_DEBUG_CORE

2023-08-04 Thread Felix Kuehling
I just applied Arnd Bergmann's patch "drm/amdkfd: fix build failure 
without CONFIG_DYNAMIC_DEBUG". This patch is no longer needed.


Regards,
  Felix

On 2023-08-04 12:05, Alex Sierra wrote:

This causes error compilation if CONFIG_DYNAMIC_DEBUG_CORE is not
defined.

Signed-off-by: Alex Sierra 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index a69994ff1c2f..cde4cc6afa83 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -824,6 +824,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct 
svm_range *prange,
   *
   * Context: The caller must hold svms->lock
   */
+#if defined(CONFIG_DYNAMIC_DEBUG_CORE)
  static void svm_range_debug_dump(struct svm_range_list *svms)
  {
struct interval_tree_node *node;
@@ -851,6 +852,7 @@ static void svm_range_debug_dump(struct svm_range_list 
*svms)
node = interval_tree_iter_next(node, 0, ~0ULL);
}
  }
+#endif
  
  static void *

  svm_range_copy_array(void *psrc, size_t size, uint64_t num_elements,
@@ -3594,7 +3596,9 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
break;
}
  
+#if defined(CONFIG_DYNAMIC_DEBUG_CORE)

dynamic_svm_range_dump(svms);
+#endif
  
  	mutex_unlock(>lock);

mmap_read_unlock(mm);


Re: [PATCH] drm/amdkfd: fix build failure without CONFIG_DYNAMIC_DEBUG

2023-08-04 Thread Felix Kuehling

On 2023-08-04 9:29, Arnd Bergmann wrote:

From: Arnd Bergmann 

When CONFIG_DYNAMIC_DEBUG is disabled altogether, calling
_dynamic_func_call_no_desc() does not work:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c: In function 
'svm_range_set_attr':
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:52:9: error: implicit 
declaration of function '_dynamic_func_call_no_desc' 
[-Werror=implicit-function-declaration]
52 | _dynamic_func_call_no_desc("svm_range_dump", 
svm_range_debug_dump, svms)
   | ^~
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:3564:9: note: in expansion of 
macro 'dynamic_svm_range_dump'
  3564 | dynamic_svm_range_dump(svms);
   | ^~

Add a compile-time conditional in addition to the runtime check.

Fixes: 8923137dbe4b2 ("drm/amdkfd: avoid svm dump when dynamic debug disabled")
Signed-off-by: Arnd Bergmann 


The patch is

Reviewed-by: Felix Kuehling 

I'm applying it to amd-staging-drm-next.

Thanks,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 308384dbc502d..44e710821b6d9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -23,6 +23,7 @@
  
  #include 

  #include 
+#include 
  #include 
  #include 
  
@@ -48,8 +49,13 @@

   * page table is updated.
   */
  #define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING  (2UL * NSEC_PER_MSEC)
+#if IS_ENABLED(CONFIG_DYNAMIC_DEBUG)
  #define dynamic_svm_range_dump(svms) \
_dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)
+#else
+#define dynamic_svm_range_dump(svms) \
+   do { if (0) svm_range_debug_dump(svms); } while (0)
+#endif
  
  /* Giant svm range split into smaller ranges based on this, it is decided using

   * minimum of all dGPU/APU 1/32 VRAM size, between 2MB to 1GB and alignment to


Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-08-03 Thread Felix Kuehling
Is your kernel configured without dynamic debugging? Maybe we need to 
wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE).


Regards,
  Felix


Am 2023-08-03 um 15:38 schrieb Mike Lothian:

Hi

I'm seeing a compiler failure with Clang 16

   CC  drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.o
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:3568:2: error: call to
undeclared function '_dynamic_func_call_no_desc'; ISO C99 and later do
not support implicit function declarations
[-Wimplicit-function-declaration]
dynamic_svm_range_dump(svms);
^
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:50:2: note: expanded
from macro 'dynamic_svm_range_dump'
_dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)
^
1 error generated.

Cheers

Mike

On Wed, 19 Jul 2023 at 22:27, Felix Kuehling  wrote:

Am 2023-07-19 um 17:22 schrieb Alex Sierra:

Set dynamic_svm_range_dump macro to avoid iterating over SVM lists
from svm_range_debug_dump when dynamic debug is disabled. Otherwise,
it could drop performance, specially with big number of SVM ranges.
Make sure both svm_range_set_attr and svm_range_debug_dump functions
are dynamically enabled to print svm_range_debug_dump debug traces.

Signed-off-by: Alex Sierra 
Tested-by: Alex Sierra 
Signed-off-by: Philip Yang 
Signed-off-by: Felix Kuehling 

I don't think my name on a Signed-off-by is appropriate here. I didn't
write the patch. And I'm not submitting it. However, the patch is

Reviewed-by: Felix Kuehling 



---
   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
   1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 479c4f66afa7..1b50eae051a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -46,6 +46,8 @@
* page table is updated.
*/
   #define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING(2UL * NSEC_PER_MSEC)
+#define dynamic_svm_range_dump(svms) \
+ _dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)

   /* Giant svm range split into smaller ranges based on this, it is decided 
using
* minimum of all dGPU/APU 1/32 VRAM size, between 2MB to 1GB and alignment 
to
@@ -3563,7 +3565,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
   break;
   }

- svm_range_debug_dump(svms);
+ dynamic_svm_range_dump(svms);

   mutex_unlock(>lock);
   mmap_read_unlock(mm);


Re: [PATCH 1/3] drm/amdkfd: Sync trap handler binaries with source

2023-08-02 Thread Felix Kuehling

On 2023-07-31 16:40, Jay Cornwall wrote:

Some changes have been lost during rebases. Rebuild sources.

Signed-off-by: Jay Cornwall 


The series is

Reviewed-by: Felix Kuehling 



---
  .../gpu/drm/amd/amdkfd/cwsr_trap_handler.h| 741 +-
  1 file changed, 371 insertions(+), 370 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h 
b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
index 73ca9aebf086..717ad0633dbe 100644
--- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
+++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
@@ -283,7 +283,7 @@ static const uint32_t cwsr_trap_gfx9_hex[] = {
0x866eff7b, 0x0400,
0xbf850051, 0xbf8e0010,
0xb8fbf803, 0xbf82fffa,
-   0x866eff7b, 0x0900,
+   0x866eff7b, 0x03c00900,
0xbf850015, 0x866eff7b,
0x71ff, 0xbf840008,
0x866fff7b, 0x7080,
@@ -1103,7 +1103,7 @@ static const uint32_t cwsr_trap_arcturus_hex[] = {
0x866eff7b, 0x0400,
0xbf850051, 0xbf8e0010,
0xb8fbf803, 0xbf82fffa,
-   0x866eff7b, 0x0900,
+   0x866eff7b, 0x03c00900,
0xbf850015, 0x866eff7b,
0x71ff, 0xbf840008,
0x866fff7b, 0x7080,
@@ -1581,7 +1581,7 @@ static const uint32_t cwsr_trap_aldebaran_hex[] = {
0x866eff7b, 0x0400,
0xbf850051, 0xbf8e0010,
0xb8fbf803, 0xbf82fffa,
-   0x866eff7b, 0x0900,
+   0x866eff7b, 0x03c00900,
0xbf850015, 0x866eff7b,
0x71ff, 0xbf840008,
0x866fff7b, 0x7080,
@@ -2494,6 +2494,7 @@ static const uint32_t cwsr_trap_gfx10_hex[] = {
0xbf9f, 0xbf9f,
0xbf9f, 0x,
  };
+
  static const uint32_t cwsr_trap_gfx11_hex[] = {
0xbfa1, 0xbfa00221,
0xb0804006, 0xb8f8f802,
@@ -2938,211 +2939,149 @@ static const uint32_t cwsr_trap_gfx11_hex[] = {
  };
  
  static const uint32_t cwsr_trap_gfx9_4_3_hex[] = {

-   0xbf820001, 0xbf8202d6,
-   0xb8f8f802, 0x89788678,
-   0xb8fbf803, 0x866eff78,
-   0x2000, 0xbf840009,
-   0x866eff6d, 0x00ff,
-   0xbf85001a, 0x866eff7b,
-   0x0400, 0xbf85004d,
-   0xbf8e0010, 0xb8fbf803,
-   0xbf82fffa, 0x866eff7b,
-   0x03c00900, 0xbf850011,
-   0x866eff7b, 0x71ff,
-   0xbf840008, 0x866fff7b,
-   0x7080, 0xbf840001,
-   0xbeee1a87, 0xb8eff801,
-   0x8e6e8c6e, 0x866e6f6e,
-   0xbf850006, 0x866eff6d,
-   0x00ff, 0xbf850003,
+   0xbf820001, 0xbf8202d7,
+   0xb8f8f802, 0x8978ff78,
+   0x00020006, 0xb8fbf803,
+   0x866eff78, 0x2000,
+   0xbf840009, 0x866eff6d,
+   0x00ff, 0xbf85001a,
0x866eff7b, 0x0400,
-   0xbf850036, 0xb8faf807,
+   0xbf85004d, 0xbf8e0010,
+   0xb8fbf803, 0xbf82fffa,
+   0x866eff7b, 0x03c00900,
+   0xbf850011, 0x866eff7b,
+   0x71ff, 0xbf840008,
+   0x866fff7b, 0x7080,
+   0xbf840001, 0xbeee1a87,
+   0xb8eff801, 0x8e6e8c6e,
+   0x866e6f6e, 0xbf850006,
+   0x866eff6d, 0x00ff,
+   0xbf850003, 0x866eff7b,
+   0x0400, 0xbf850036,
+   0xb8faf807, 0x867aff7a,
+   0x001f8000, 0x8e7a8b7a,
+   0x8979ff79, 0xfc00,
+   0x87797a79, 0xba7ff807,
+   0x, 0xb8faf812,
+   0xb8fbf813, 0x8efa887a,
+   0xc0031bbd, 0x0010,
+   0xbf8cc07f, 0x8e6e976e,
+   0x8979ff79, 0x0080,
+   0x87796e79, 0xc0071bbd,
+   0x, 0xbf8cc07f,
+   0xc0071ebd, 0x0008,
+   0xbf8cc07f, 0x86ee6e6e,
+   0xbf840001, 0xbe801d6e,
+   0x866eff6d, 0x01ff,
+   0xbf850005, 0x8778ff78,
+   0x2000, 0x80ec886c,
+   0x82ed806d, 0xbf820005,
+   0x866eff6d, 0x0100,
+   0xbf850002, 0x806c846c,
+   0x826d806d, 0x866dff6d,
+   0x, 0x8f7a8b79,
0x867aff7a, 0x001f8000,
-   0x8e7a8b7a, 0x8979ff79,
-   0xfc00, 0x87797a79,
-   0xba7ff807, 0x,
-   0xb8faf812, 0xb8fbf813,
-   0x8efa887a, 0xc0031bbd,
-   0x0010, 0xbf8cc07f,
-   0x8e6e976e, 0x8979ff79,
-   0x0080, 0x87796e79,
-   0xc0071bbd, 0x,
-   0xbf8cc07f, 0xc0071ebd,
-   0x0008, 0xbf8cc07f,
-   0x86ee6e6e, 0xbf840001,
-   0xbe801d6e, 0x866eff6d,
-   0x01ff, 0xbf850005,
-   0x8778ff78, 0x2000,
-   0x80ec886c, 0x82ed806d,
-   0xbf820005, 0x866eff6d,
-   0x0100, 0xbf850002,
-   0x806c846c, 0x826d806d,
+   0xb97af807, 0x86fe7e7e,
+   0x86ea6a6a, 0x8f6e8378,
+   0xb96ee0c2, 0xbf82,
+   0xb9780002, 0xbe801f6c,
0x866dff6d, 0x,
-   0x8f7a8b79, 0x867aff7a,
-   0x001f8000, 0xb97af807,
-   0x86fe7e7e, 0x86ea6a6a,
-   0x8f6e8378, 0xb96ee0c2,
-   0xbf82, 0xb9780002,
-   0xbe801f6c, 0x866dff6d,
-   0x, 0xbefa0080,
-   0xb97a0283, 0xb8faf807,
-   0x867aff7a, 0x001f8000,
-   0x8e7a8b7a, 0x8979ff79,
-   0xfc00, 0x87797a79

Re: [PATCH] drm/amdkfd: avoid unmap dma address when svm_ranges are split

2023-07-28 Thread Felix Kuehling

On 2023-07-28 17:41, Alex Sierra wrote:

DMA address reference within svm_ranges should be unmapped only after
the memory has been released from the system. In case of range
splitting, the DMA address information should be copied to the
corresponding range after this has split. But leaving dma mapping
intact.

Signed-off-by: Alex Sierra 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  7 +--
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 61 +---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  2 +-
  3 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 709ac885ca6d..7d82c7da223a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -461,7 +461,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
0, node->id, trigger);
  
  	svm_range_dma_unmap(adev->dev, scratch, 0, npages);

-   svm_range_free_dma_mappings(prange);
  
  out_free:

kvfree(buf);
@@ -543,10 +542,12 @@ svm_migrate_ram_to_vram(struct svm_range *prange, 
uint32_t best_loc,
addr = next;
}
  
-	if (cpages)

+   if (cpages) {
prange->actual_loc = best_loc;
-   else
+   svm_range_free_dma_mappings(prange, true);
+   } else {
svm_range_vram_node_free(prange);
+   }
  
  	return r < 0 ? r : 0;

  }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 1b50eae051a4..a69994ff1c2f 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -241,7 +241,7 @@ void svm_range_dma_unmap(struct device *dev, dma_addr_t 
*dma_addr,
}
  }
  
-void svm_range_free_dma_mappings(struct svm_range *prange)

+void svm_range_free_dma_mappings(struct svm_range *prange, bool unmap_dma)
  {
struct kfd_process_device *pdd;
dma_addr_t *dma_addr;
@@ -262,13 +262,14 @@ void svm_range_free_dma_mappings(struct svm_range *prange)
continue;
}
dev = >dev->adev->pdev->dev;
-   svm_range_dma_unmap(dev, dma_addr, 0, prange->npages);
+   if (unmap_dma)
+   svm_range_dma_unmap(dev, dma_addr, 0, prange->npages);
kvfree(dma_addr);
prange->dma_addr[gpuidx] = NULL;
}
  }
  
-static void svm_range_free(struct svm_range *prange, bool update_mem_usage)

+static void svm_range_free(struct svm_range *prange, bool do_unmap)
  {
uint64_t size = (prange->last - prange->start + 1) << PAGE_SHIFT;
struct kfd_process *p = container_of(prange->svms, struct kfd_process, 
svms);
@@ -277,9 +278,9 @@ static void svm_range_free(struct svm_range *prange, bool 
update_mem_usage)
 prange->start, prange->last);
  
  	svm_range_vram_node_free(prange);

-   svm_range_free_dma_mappings(prange);
+   svm_range_free_dma_mappings(prange, do_unmap);
  
-	if (update_mem_usage && !p->xnack_enabled) {

+   if (do_unmap && !p->xnack_enabled) {
pr_debug("unreserve prange 0x%p size: 0x%llx\n", prange, size);
amdgpu_amdkfd_unreserve_mem_limit(NULL, size,
KFD_IOC_ALLOC_MEM_FLAGS_USERPTR, 0);
@@ -851,6 +852,37 @@ static void svm_range_debug_dump(struct svm_range_list 
*svms)
}
  }
  
+static void *

+svm_range_copy_array(void *psrc, size_t size, uint64_t num_elements,
+uint64_t offset)
+{
+   unsigned char *dst;
+
+   dst = kvmalloc_array(num_elements, size, GFP_KERNEL);
+   if (!dst)
+   return NULL;
+   memcpy(dst, (unsigned char *)psrc + offset, num_elements * size);
+
+   return (void *)dst;
+}
+
+static int
+svm_range_copy_dma_addrs(struct svm_range *dst, struct svm_range *src)
+{
+   int i;
+
+   for (i = 0; i < MAX_GPU_INSTANCE; i++) {
+   if (!src->dma_addr[i])
+   continue;
+   dst->dma_addr[i] = svm_range_copy_array(src->dma_addr[i],
+   sizeof(*src->dma_addr[i]), src->npages, 
0);
+   if (!dst->dma_addr[i])
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
  static int
  svm_range_split_array(void *ppnew, void *ppold, size_t size,
  uint64_t old_start, uint64_t old_n,
@@ -865,22 +897,16 @@ svm_range_split_array(void *ppnew, void *ppold, size_t 
size,
if (!pold)
return 0;
  
-	new = kvmalloc_array(new_n, size, GFP_KERNEL);

+   d = (new_start - old_start) * size;
+   new = svm_range_copy_array(pold, size, new_n, d);
if (!new)
return -ENOME

Re: [PATCH v3] drm/amdgpu: Add EXT_COHERENT memory allocation flags

2023-07-28 Thread Felix Kuehling

On 2023-07-28 15:39, David Francis wrote:

These flags (for GEM and SVM allocations) allocate
memory that allows for system-scope atomic semantics.

On GFX943 these flags cause caches to be avoided on
non-local memory.

On all other ASICs they are identical in functionality to the
equivalent COHERENT flags.

Corresponding Thunk patch is at
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/pull/88

v3: changed name of flag

Signed-off-by: David Francis 


I made one comment on the user mode patch regarding the explicit 
handling of invalid combinations of Uncached, Coherent, ExtCoherent 
flags. I'm not sure what we agreed on any more. But I don't think we 
want to just leave it up to chance. Other than that, this patch looks 
good to me.


Regards,
  Felix



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c  |  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c   |  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c   |  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c|  5 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 10 +-
  include/uapi/drm/amdgpu_drm.h| 10 +-
  include/uapi/linux/kfd_ioctl.h   |  3 +++
  8 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index d34c3ef8f3ed..a1ce261f2d06 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1738,6 +1738,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
  
  	if (flags & KFD_IOC_ALLOC_MEM_FLAGS_COHERENT)

alloc_flags |= AMDGPU_GEM_CREATE_COHERENT;
+   if (flags & KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT)
+   alloc_flags |= AMDGPU_GEM_CREATE_EXT_COHERENT;
if (flags & KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED)
alloc_flags |= AMDGPU_GEM_CREATE_UNCACHED;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c

index 12210598e5b8..76b618735dc0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -331,6 +331,7 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
dma_buf *dma_buf)
  
  		flags |= other->flags & (AMDGPU_GEM_CREATE_CPU_GTT_USWC |

 AMDGPU_GEM_CREATE_COHERENT |
+AMDGPU_GEM_CREATE_EXT_COHERENT |
 AMDGPU_GEM_CREATE_UNCACHED);
}
  
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

index 6b430e10d38e..301ffe30824f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -632,6 +632,7 @@ static void gmc_v10_0_get_vm_pte(struct amdgpu_device *adev,
}
  
  	if (bo && bo->flags & (AMDGPU_GEM_CREATE_COHERENT |

+  AMDGPU_GEM_CREATE_EXT_COHERENT |
   AMDGPU_GEM_CREATE_UNCACHED))
*flags = (*flags & ~AMDGPU_PTE_MTYPE_NV10_MASK) |
 AMDGPU_PTE_MTYPE_NV10(MTYPE_UC);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index a6ee0220db56..846894e212e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -540,6 +540,7 @@ static void gmc_v11_0_get_vm_pte(struct amdgpu_device *adev,
}
  
  	if (bo && bo->flags & (AMDGPU_GEM_CREATE_COHERENT |

+  AMDGPU_GEM_CREATE_EXT_COHERENT |
   AMDGPU_GEM_CREATE_UNCACHED))
*flags = (*flags & ~AMDGPU_PTE_MTYPE_NV10_MASK) |
 AMDGPU_PTE_MTYPE_NV10(MTYPE_UC);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 880460cd3239..92a623e130d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1183,7 +1183,8 @@ static void gmc_v9_0_get_coherence_flags(struct 
amdgpu_device *adev,
  {
struct amdgpu_device *bo_adev = amdgpu_ttm_adev(bo->tbo.bdev);
bool is_vram = bo->tbo.resource->mem_type == TTM_PL_VRAM;
-   bool coherent = bo->flags & AMDGPU_GEM_CREATE_COHERENT;
+   bool coherent = bo->flags & (AMDGPU_GEM_CREATE_COHERENT | 
AMDGPU_GEM_CREATE_EXT_COHERENT);
+   bool ext_coherent = bo->flags & AMDGPU_GEM_CREATE_EXT_COHERENT;
bool uncached = bo->flags & AMDGPU_GEM_CREATE_UNCACHED;
struct amdgpu_vm *vm = mapping->bo_va->base.vm;
unsigned int mtype_local, mtype;
@@ -1251,6 +1252,8 @@ static void gmc_v9_0_get_coherence_flags(struct 
amdgpu_device *adev,
snoop = true;
if (uncached) {
mtype = MTYPE_UC;
+   } else if (ext_coherent) {
+   

Re: [PATCH 2/4] drm/amdkfd: disable IOMMUv2 support for KV/CZ

2023-07-28 Thread Felix Kuehling
There are some APU-specific code paths for Kaveri and Carrizo in the 
device queue manager and MQD manager. I think a minimal fix would be to 
change device_queue_manager_init to call 
device_queue_manager_init_cik_hawaii for Kaveri and 
device_queue_manager_init_vi_tonga for Carrizo to use the dGPU code paths.


Then we could probably remove the APU-specific functions and remove the 
_hawaii and _tonga suffixes from the dGPU functions.


Regards,
  Felix


On 2023-07-28 12:41, Alex Deucher wrote:

Use the dGPU path instead.  There were a lot of platform
issues with IOMMU in general on these chips due to windows
not enabling IOMMU at the time.  The dGPU path has been
used for a long time with newer APUs and works fine.  This
also paves the way to simplify the driver significantly.

Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 6 --
  1 file changed, 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 64772921ea43b..814a6116ca9bb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -234,10 +234,6 @@ static void kfd_device_info_init(struct kfd_dev *kfd,
asic_type != CHIP_TONGA)
kfd->device_info.supports_cwsr = true;
  
-		if (asic_type == CHIP_KAVERI ||

-   asic_type == CHIP_CARRIZO)
-   kfd->device_info.needs_iommu_device = true;
-
if (asic_type != CHIP_HAWAII && !vf)
kfd->device_info.needs_pci_atomics = true;
}
@@ -250,7 +246,6 @@ struct kfd_dev *kgd2kfd_probe(struct amdgpu_device *adev, 
bool vf)
uint32_t gfx_target_version = 0;
  
  	switch (adev->asic_type) {

-#ifdef KFD_SUPPORT_IOMMU_V2
  #ifdef CONFIG_DRM_AMDGPU_CIK
case CHIP_KAVERI:
gfx_target_version = 7;
@@ -263,7 +258,6 @@ struct kfd_dev *kgd2kfd_probe(struct amdgpu_device *adev, 
bool vf)
if (!vf)
f2g = _v8_kfd2kgd;
break;
-#endif
  #ifdef CONFIG_DRM_AMDGPU_CIK
case CHIP_HAWAII:
gfx_target_version = 70001;


Re: [PATCH] drm/amdkfd: avoid unmap dma address when svm_ranges are split

2023-07-28 Thread Felix Kuehling



On 2023-07-27 19:43, Alex Sierra wrote:

DMA address reference within svm_ranges should be unmapped only after
the memory has been released from the system. In case of range
splitting, the DMA address information should be copied to the
corresponding range after this has split. But leaving dma mapping
intact.

Signed-off-by: Alex Sierra 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 67 ++--
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  2 +-
  3 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 709ac885ca6d..2586ac070190 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -461,7 +461,7 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
0, node->id, trigger);
  
  	svm_range_dma_unmap(adev->dev, scratch, 0, npages);

-   svm_range_free_dma_mappings(prange);
+   svm_range_free_dma_mappings(prange, true);


Do we even need to call svm_range_dma_unmap just before? Looks like 
that's done inside svm_range_free_dma_mappings anyway.


Maybe this should also be moved to svm_migrate_ram_to_vram because it 
affects the entire prange and not just one VMA. So you only need to do 
it once per prange. Let's clean that up in a follow up change.



  
  out_free:

kvfree(buf);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 1b50eae051a4..d1ff1c7e96d0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -241,7 +241,7 @@ void svm_range_dma_unmap(struct device *dev, dma_addr_t 
*dma_addr,
}
  }
  
-void svm_range_free_dma_mappings(struct svm_range *prange)

+void svm_range_free_dma_mappings(struct svm_range *prange, bool unmap_dma)
  {
struct kfd_process_device *pdd;
dma_addr_t *dma_addr;
@@ -262,7 +262,8 @@ void svm_range_free_dma_mappings(struct svm_range *prange)
continue;
}
dev = >dev->adev->pdev->dev;
-   svm_range_dma_unmap(dev, dma_addr, 0, prange->npages);
+   if (unmap_dma)
+   svm_range_dma_unmap(dev, dma_addr, 0, prange->npages);
kvfree(dma_addr);
prange->dma_addr[gpuidx] = NULL;
}
@@ -277,7 +278,7 @@ static void svm_range_free(struct svm_range *prange, bool 
update_mem_usage)


I'd rename the update_mem_usage parameter to better represent what it 
means. Maybe something like "do_unmap".




 prange->start, prange->last);
  
  	svm_range_vram_node_free(prange);

-   svm_range_free_dma_mappings(prange);
+   svm_range_free_dma_mappings(prange, update_mem_usage);
  
  	if (update_mem_usage && !p->xnack_enabled) {

pr_debug("unreserve prange 0x%p size: 0x%llx\n", prange, size);
@@ -851,12 +852,46 @@ static void svm_range_debug_dump(struct svm_range_list 
*svms)
}
  }
  
+static int

+svm_range_copy_array(void *ppdst, void *ppsrc, size_t size,


ppdst and pprsc should be defined as void ** to avoid some ugly pointer 
type casts below. I'm not sure why ppsrc is a pointer to a pointer in 
the first place. I think it should just be a pointer because you don't 
need to update the caller's pointer.


It may also be cleaner if you return the destination pointer as return 
value instead and return NULL if allocation failed.




+uint64_t num_elements, uint64_t offset)
+{
+   unsigned char *dst, *psrc;
+
+   psrc = *(unsigned char **)ppsrc;
+   dst = kvmalloc_array(num_elements, size, GFP_KERNEL);
+   if (!dst)
+   return -ENOMEM;
+   memcpy(dst, psrc + offset, num_elements * size);
+   *(void **)ppdst = dst;
+
+   return 0;
+}
+
+static int
+svm_range_copy_dma_addrs(struct svm_range *dst, struct svm_range *src)
+{
+   int i, r;
+
+   for (i = 0; i < MAX_GPU_INSTANCE; i++) {
+   if (!src->dma_addr[i])
+   continue;
+   r = svm_range_copy_array(>dma_addr[i], >dma_addr[i],
+sizeof(*src->dma_addr[i]), 
src->npages, 0);
+   if (r)
+   return r;
+   }
+
+   return 0;
+}
+
  static int
  svm_range_split_array(void *ppnew, void *ppold, size_t size,
  uint64_t old_start, uint64_t old_n,
  uint64_t new_start, uint64_t new_n)
  {
unsigned char *new, *old, *pold;
+   int r;
uint64_t d;
  
  	if (!ppold)

@@ -865,22 +900,16 @@ svm_range_split_array(void *ppnew, void *ppold, size_t 
size,
if (!pold)
return 0;
  
-	new = kvmalloc_array(new_n, size, GFP_KERNEL);

-   if (!new)
-   return -ENOMEM;
-
d = (new_start - old_start) * size;
-   

Re: [PATCH v3] drm/amdgpu: Add EXT_COHERENCE memory allocation flags

2023-07-27 Thread Felix Kuehling
In amdgpu_dma_buf_create_obj we copy the coherence-related flags to the 
SG BO that's used to attach the BO to the importer device. You need to 
add the new flag to the list.


Some more nit-picks inline.

Am 2023-07-26 um 09:34 schrieb David Francis:

These flags (for GEM and SVM allocations) allocate
memory that allows for system-scope atomic semantics.

On GFX943 these flags cause caches to be avoided on
non-local memory.

On all other ASICs they are identical in functionality to the
equivalent COHERENT flags.

Corresponding Thunk patch is at
https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/pull/88

Signed-off-by: David Francis 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |  2 ++
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c   |  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c   |  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c|  5 -
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 10 +-
  include/uapi/drm/amdgpu_drm.h|  7 +++
  include/uapi/linux/kfd_ioctl.h   |  3 +++
  7 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index d34c3ef8f3ed..7f23bc0ee592 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1738,6 +1738,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
  
  	if (flags & KFD_IOC_ALLOC_MEM_FLAGS_COHERENT)

alloc_flags |= AMDGPU_GEM_CREATE_COHERENT;
+   if (flags & KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENCE)
+   alloc_flags |= AMDGPU_GEM_CREATE_EXT_COHERENCE;
if (flags & KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED)
alloc_flags |= AMDGPU_GEM_CREATE_UNCACHED;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c

index 6b430e10d38e..8e951688668b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -632,6 +632,7 @@ static void gmc_v10_0_get_vm_pte(struct amdgpu_device *adev,
}
  
  	if (bo && bo->flags & (AMDGPU_GEM_CREATE_COHERENT |

+  AMDGPU_GEM_CREATE_EXT_COHERENCE |
   AMDGPU_GEM_CREATE_UNCACHED))
*flags = (*flags & ~AMDGPU_PTE_MTYPE_NV10_MASK) |
 AMDGPU_PTE_MTYPE_NV10(MTYPE_UC);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index a6ee0220db56..ff330c7c0232 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -540,6 +540,7 @@ static void gmc_v11_0_get_vm_pte(struct amdgpu_device *adev,
}
  
  	if (bo && bo->flags & (AMDGPU_GEM_CREATE_COHERENT |

+  AMDGPU_GEM_CREATE_EXT_COHERENCE |
   AMDGPU_GEM_CREATE_UNCACHED))
*flags = (*flags & ~AMDGPU_PTE_MTYPE_NV10_MASK) |
 AMDGPU_PTE_MTYPE_NV10(MTYPE_UC);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 880460cd3239..e40fcfc1a3f3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1183,7 +1183,8 @@ static void gmc_v9_0_get_coherence_flags(struct 
amdgpu_device *adev,
  {
struct amdgpu_device *bo_adev = amdgpu_ttm_adev(bo->tbo.bdev);
bool is_vram = bo->tbo.resource->mem_type == TTM_PL_VRAM;
-   bool coherent = bo->flags & AMDGPU_GEM_CREATE_COHERENT;
+   bool coherent = bo->flags & (AMDGPU_GEM_CREATE_COHERENT | 
AMDGPU_GEM_CREATE_EXT_COHERENCE);
+   bool ext_coherence = bo->flags & AMDGPU_GEM_CREATE_EXT_COHERENCE;
bool uncached = bo->flags & AMDGPU_GEM_CREATE_UNCACHED;
struct amdgpu_vm *vm = mapping->bo_va->base.vm;
unsigned int mtype_local, mtype;
@@ -1251,6 +1252,8 @@ static void gmc_v9_0_get_coherence_flags(struct 
amdgpu_device *adev,
snoop = true;
if (uncached) {
mtype = MTYPE_UC;
+   } else if (ext_coherence) {
+   mtype = is_local ? MTYPE_CC : MTYPE_UC;
} else if (adev->flags & AMD_IS_APU) {
mtype = is_local ? mtype_local : MTYPE_NC;
} else {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 1b50eae051a4..28304b93a990 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1155,7 +1155,8 @@ svm_range_get_pte_flags(struct kfd_node *node,
uint32_t mapping_flags = 0;
uint64_t pte_flags;
bool snoop = (domain != SVM_RANGE_VRAM_DOMAIN);
-   bool coherent = flags & KFD_IOCTL_SVM_FLAG_COHERENT;
+   bool coherent = flags & (KFD_IOCTL_SVM_FLAG_COHERENT | 
KFD_IOCTL_SVM_FLAG_EXT_COHERENCE);
+   bool ext_coherence = flags & 

Re: [Patch V2 v2] drm/amdgpu: Checkpoint and Restore VRAM BOs without VA

2023-07-25 Thread Felix Kuehling

Am 2023-07-25 um 17:11 schrieb Ramesh Errabolu:

Extend checkpoint logic to allow inclusion of VRAM BOs that
do not have a VA attached

Signed-off-by: Ramesh Errabolu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 40ac093b5035..44c647c82070 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1845,7 +1845,8 @@ static uint32_t get_process_num_bos(struct kfd_process *p)
idr_for_each_entry(>alloc_idr, mem, id) {
struct kgd_mem *kgd_mem = (struct kgd_mem *)mem;
  
-			if ((uint64_t)kgd_mem->va > pdd->gpuvm_base)

+   if (((uint64_t)kgd_mem->va > pdd->gpuvm_base) ||


Unnecessary parentheses around (a > b).



+   !kgd_mem->va)
num_of_bos++;
}
}
@@ -1917,7 +1918,11 @@ static int criu_checkpoint_bos(struct kfd_process *p,
kgd_mem = (struct kgd_mem *)mem;
dumper_bo = kgd_mem->bo;
  
-			if ((uint64_t)kgd_mem->va <= pdd->gpuvm_base)

+   /* Skip checkpointing BOs that are used for Trap handler
+* code and state. Currently, these BOs have a VA that
+* is less GPUVM Base
+*/
+   if (((uint64_t)kgd_mem->va <= pdd->gpuvm_base) && 
kgd_mem->va)


Unnecessary parentheses around (a <= b). In this condition I'd also 
prefer to put kgd_mem->va first, because it short-circuits execution for 
the case that va is NULL.


With that fixed, the patch is

Reviewed-by: Felix Kuehling 




continue;
  
  			bo_bucket = _buckets[bo_index];


Re: [PATCH] drm/amdgpu: Checkpoint and Restore VRAM BOs without VA

2023-07-25 Thread Felix Kuehling

Am 2023-07-25 um 16:04 schrieb Errabolu, Ramesh:

[AMD Official Use Only - General]

Responses inline.

-Original Message-
From: Kuehling, Felix 
Sent: Monday, July 24, 2023 2:51 PM
To: amd-gfx@lists.freedesktop.org; Errabolu, Ramesh 
Subject: Re: [PATCH] drm/amdgpu: Checkpoint and Restore VRAM BOs without VA


On 2023-07-24 11:57, Ramesh Errabolu wrote:

Extend checkpoint logic to allow inclusion of VRAM BOs that do not
have a VA attached

Signed-off-by: Ramesh Errabolu 
---
   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 --
   1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 40ac093b5035..5cc00ff4b635 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1845,7 +1845,8 @@ static uint32_t get_process_num_bos(struct kfd_process *p)
   idr_for_each_entry(>alloc_idr, mem, id) {
   struct kgd_mem *kgd_mem = (struct kgd_mem *)mem;

- if ((uint64_t)kgd_mem->va > pdd->gpuvm_base)
+ if (((uint64_t)kgd_mem->va > pdd->gpuvm_base) ||
+ (kgd_mem->va == 0))

I'm trying to remember what this condition is there to protect against, because 
it almost looks like we could drop the entire condition. I think it's there to 
avoid checkpointing the TMA/TBA BOs allocated by KFD itself.

Ramesh: I am unsure as to how we can detect TMA/TBA BOs if we don't want them 
checkpointed. Alternatively we can checkpoint and restore TMA/TBA BOs without 
loss of function. If this o.k. we can drop the check that determines BO 
qualification.


It's OK. Currently they have a VA > 0 and < gpuvm_base. So this check 
will still work if you only allow BOs with VA == 0.


There is a patch in the works to move the TMA and TBA to the upper half 
of the virtual address space. Then we'll need to update this check to 
exclude anything that has bit 63 of the VA set.


Regards,
  Felix




That said, you have some unnecessary parentheses in this expression. And just 
using !x to check for 0 is usually preferred in the kernel. This should work 
and is more readable IMO:

+   if ((uint64_t)kgd_mem->va > pdd->gpuvm_base || 
!kgd_mem->va)



   num_of_bos++;
   }
   }
@@ -1917,7 +1918,8 @@ static int criu_checkpoint_bos(struct kfd_process *p,
   kgd_mem = (struct kgd_mem *)mem;
   dumper_bo = kgd_mem->bo;

- if ((uint64_t)kgd_mem->va <= pdd->gpuvm_base)
+ if (((uint64_t)kgd_mem->va <= pdd->gpuvm_base) &&
+ !(kgd_mem->va == 0))

Similar to above:

+   if (kgd_mem->va && (uint64_t)kgd_mem->va <= 
pdd->gpuvm_base)

Regards,
Felix



   continue;

   bo_bucket = _buckets[bo_index];


Re: [PATCH] drm/amdkfd: start_cpsch don't map queues

2023-07-24 Thread Felix Kuehling

On 2023-07-24 13:52, Philip Yang wrote:

start_cpsch map queues when kfd_init_node have race condition with
IOMMUv2 init, and cause the gfx ring test failed later. Remove it
from start_cpsch because map queues will be done when creating queues
and resume queues.

Reported-by: Michel Dänzer 
Signed-off-by: Philip Yang 


Reviewed-by: Felix Kuehling 

Michel, can you test whether this fixes your regression on Raven? Would 
be good to get a Tested-by for this patch, since we haven't been able to 
reproduce the problem yet.


Thanks,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 71b7f16c0173..a2d0d0bcf853 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1658,9 +1658,6 @@ static int start_cpsch(struct device_queue_manager *dqm)
dqm->is_resetting = false;
dqm->sched_running = true;
  
-	if (!dqm->dev->kfd->shared_resources.enable_mes)

-   execute_queues_cpsch(dqm, 
KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0, USE_DEFAULT_GRACE_PERIOD);
-
/* Set CWSR grace period to 1x1000 cycle for GFX9.4.3 APU */
if (amdgpu_emu_mode == 0 && dqm->dev->adev->gmc.is_app_apu &&
(KFD_GC_VERSION(dqm->dev) == IP_VERSION(9, 4, 3))) {


Re: [PATCH] drm/amdgpu: Checkpoint and Restore VRAM BOs without VA

2023-07-24 Thread Felix Kuehling



On 2023-07-24 11:57, Ramesh Errabolu wrote:

Extend checkpoint logic to allow inclusion of VRAM BOs that
do not have a VA attached

Signed-off-by: Ramesh Errabolu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 40ac093b5035..5cc00ff4b635 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1845,7 +1845,8 @@ static uint32_t get_process_num_bos(struct kfd_process *p)
idr_for_each_entry(>alloc_idr, mem, id) {
struct kgd_mem *kgd_mem = (struct kgd_mem *)mem;
  
-			if ((uint64_t)kgd_mem->va > pdd->gpuvm_base)

+   if (((uint64_t)kgd_mem->va > pdd->gpuvm_base) ||
+   (kgd_mem->va == 0))


I'm trying to remember what this condition is there to protect against, 
because it almost looks like we could drop the entire condition. I think 
it's there to avoid checkpointing the TMA/TBA BOs allocated by KFD itself.


That said, you have some unnecessary parentheses in this expression. And 
just using !x to check for 0 is usually preferred in the kernel. This 
should work and is more readable IMO:


+   if ((uint64_t)kgd_mem->va > pdd->gpuvm_base || 
!kgd_mem->va)



num_of_bos++;
}
}
@@ -1917,7 +1918,8 @@ static int criu_checkpoint_bos(struct kfd_process *p,
kgd_mem = (struct kgd_mem *)mem;
dumper_bo = kgd_mem->bo;
  
-			if ((uint64_t)kgd_mem->va <= pdd->gpuvm_base)

+   if (((uint64_t)kgd_mem->va <= pdd->gpuvm_base) &&
+   !(kgd_mem->va == 0))


Similar to above:

+   if (kgd_mem->va && (uint64_t)kgd_mem->va <= 
pdd->gpuvm_base)

Regards,
  Felix



continue;
  
  			bo_bucket = _buckets[bo_index];


Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-07-19 Thread Felix Kuehling

Am 2023-07-19 um 17:22 schrieb Alex Sierra:

Set dynamic_svm_range_dump macro to avoid iterating over SVM lists
from svm_range_debug_dump when dynamic debug is disabled. Otherwise,
it could drop performance, specially with big number of SVM ranges.
Make sure both svm_range_set_attr and svm_range_debug_dump functions
are dynamically enabled to print svm_range_debug_dump debug traces.

Signed-off-by: Alex Sierra 
Tested-by: Alex Sierra 
Signed-off-by: Philip Yang 
Signed-off-by: Felix Kuehling 


I don't think my name on a Signed-off-by is appropriate here. I didn't 
write the patch. And I'm not submitting it. However, the patch is


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 479c4f66afa7..1b50eae051a4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -46,6 +46,8 @@
   * page table is updated.
   */
  #define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING  (2UL * NSEC_PER_MSEC)
+#define dynamic_svm_range_dump(svms) \
+   _dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)
  
  /* Giant svm range split into smaller ranges based on this, it is decided using

   * minimum of all dGPU/APU 1/32 VRAM size, between 2MB to 1GB and alignment to
@@ -3563,7 +3565,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
break;
}
  
-	svm_range_debug_dump(svms);

+   dynamic_svm_range_dump(svms);
  
  	mutex_unlock(>lock);

mmap_read_unlock(mm);


Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-07-19 Thread Felix Kuehling

Am 2023-07-19 um 14:03 schrieb Alex Sierra:

Set dynamic_svm_range_dump macro to avoid iterating over SVM lists
from svm_range_debug_dump when dynamic debug is disabled. Otherwise,
it could drop performance, specially with big number of SVM ranges.
Make sure both svm_range_set_attr and svm_range_debug_dump functions
are dynamically enabled to print svm_range_debug_dump debug traces.

Signed-off-by: Alex Sierra 
Tested-by: Alex Sierra 
Signed-off-by: Philip Yang 
Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +++
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 479c4f66afa7..0687f27f506c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3563,7 +3563,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
break;
}
  
-	svm_range_debug_dump(svms);

+   dynamic_svm_range_dump(svms);
  
  	mutex_unlock(>lock);

mmap_read_unlock(mm);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 21b14510882b..ed4cd501fafe 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -39,6 +39,9 @@
  #define SVM_ADEV_PGMAP_OWNER(adev)\
((adev)->hive ? (void *)(adev)->hive : (void *)(adev))
  
+#define dynamic_svm_range_dump(svms) \

+   _dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)
+


This should be in kfd_svm.c. The function svm_range_debug_dump is a 
static function in that file. This macro is not useful outside of it.


Regards,
  Felix



  struct svm_range_bo {
struct amdgpu_bo*bo;
struct kref kref;


Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-07-19 Thread Felix Kuehling

Am 2023-07-08 um 12:57 schrieb Alex Sierra:

svm_range_debug_dump should not be called at all when dynamic debug
is disabled to avoid iterating over SVM lists. This could drop
performance, specially with big number of SVM ranges.

Signed-off-by: Alex Sierra 
Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 479c4f66afa7..4fb427fc5942 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -821,7 +821,7 @@ svm_range_is_same_attrs(struct kfd_process *p, struct 
svm_range *prange,
   *
   * Context: The caller must hold svms->lock
   */
-static void svm_range_debug_dump(struct svm_range_list *svms)
+static int svm_range_debug_dump(struct svm_range_list *svms)
  {
struct interval_tree_node *node;
struct svm_range *prange;
@@ -847,6 +847,8 @@ static void svm_range_debug_dump(struct svm_range_list 
*svms)
 prange->actual_loc);
node = interval_tree_iter_next(node, 0, ~0ULL);
}
+
+   return 0;
  }
  
  static int

@@ -3563,7 +3565,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
break;
}
  
-	svm_range_debug_dump(svms);

+   pr_debug("%d", svm_range_debug_dump(svms));


This is a bit hacky. I would use the way dynamic_hex_dump is defined as 
an example for how to do this without the dummy pr_debug and without 
returning a dummy result from svm_range_debug_dump:


#define dynamic_svm_range_dump(svms) \
_dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)

Then instead of calling svm_range_debug_dump directly, call 
dynamic_svm_range_dump(svms).


Regards,
  Felix


  
  	mutex_unlock(>lock);

mmap_read_unlock(mm);


Re: [PATCH] drm/amdkfd: enable cooperative groups for gfx11

2023-07-19 Thread Felix Kuehling

Am 2023-07-19 um 10:36 schrieb Jonathan Kim:

MES can concurrently schedule queues on the device that require
exclusive device access if marked exclusively_scheduled without the
requirement of GWS.  Similar to the F32 HWS, MES will manage
quality of service for these queues.
Use this for cooperative groups since cooperative groups are device
occupancy limited.

Since some GFX11 devices can only be debugged with partial CUs, do not
allow the debugging of cooperative groups on these devices as the CU
occupancy limit will change on attach.

In addition, zero initialize the MES add queue submission vector for MES
initialization tests as we do not want these to be cooperative
dispatches.

v2: fix up indentation and comments.
remove unnecessary perf warning on oversubscription.
change 0 init to 0 memset to deal with padding.

Signed-off-by: Jonathan Kim 


Sorry. More indentation nit-picks inline. With those fixed, the patch is

Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c  |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h  |  1 +
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c   |  2 ++
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  3 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c   |  3 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_device.c  |  6 +-
  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c|  7 ++-
  .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c   | 12 
  drivers/gpu/drm/amd/include/mes_v11_api_def.h|  4 +++-
  9 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index f808841310fd..72ab6a838bb6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -642,6 +642,8 @@ int amdgpu_mes_add_hw_queue(struct amdgpu_device *adev, int 
gang_id,
unsigned long flags;
int r;
  
+	memset(_input, 0, sizeof(struct mes_add_queue_input));

+
/* allocate the mes queue buffer */
queue = kzalloc(sizeof(struct amdgpu_mes_queue), GFP_KERNEL);
if (!queue) {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index 2d6ac30b7135..2053954a235c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -224,6 +224,7 @@ struct mes_add_queue_input {
uint32_tis_kfd_process;
uint32_tis_aql_queue;
uint32_tqueue_size;
+   uint32_texclusively_scheduled;
  };
  
  struct mes_remove_queue_input {

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 1bdaa00c0b46..8e67e965f7ea 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -214,6 +214,8 @@ static int mes_v11_0_add_hw_queue(struct amdgpu_mes *mes,
mes_add_queue_pkt.is_aql_queue = input->is_aql_queue;
mes_add_queue_pkt.gds_size = input->queue_size;
  
+	mes_add_queue_pkt.exclusively_scheduled = input->exclusively_scheduled;

+
return mes_v11_0_submit_pkt_and_poll_completion(mes,
_add_queue_pkt, sizeof(mes_add_queue_pkt),
offsetof(union MESAPI__ADD_QUEUE, api_status));
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 40ac093b5035..e0f9cf6dd8fd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1487,7 +1487,8 @@ static int kfd_ioctl_alloc_queue_gws(struct file *filep,
goto out_unlock;
}
  
-	if (!kfd_dbg_has_gws_support(dev) && p->debug_trap_enabled) {

+   if (p->debug_trap_enabled && (!kfd_dbg_has_gws_support(dev) ||
+ kfd_dbg_has_cwsr_workaround(dev))) {
retval = -EBUSY;
goto out_unlock;
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index ccfc81f085ce..1f82caea59ba 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -753,7 +753,8 @@ int kfd_dbg_trap_enable(struct kfd_process *target, 
uint32_t fd,
if (!KFD_IS_SOC15(pdd->dev))
return -ENODEV;
  
-		if (!kfd_dbg_has_gws_support(pdd->dev) && pdd->qpd.num_gws)

+   if (pdd->qpd.num_gws && (!kfd_dbg_has_gws_support(pdd->dev) ||
+kfd_dbg_has_cwsr_workaround(pdd->dev)))
return -EBUSY;
}
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c

index 0b3dc754e06b..ebc9674d3ce1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -508,6 +508,7 @@ stati

Re: [PATCH v2 2/4] drm/amdkfd: use vma_is_initial_stack() and vma_is_initial_heap()

2023-07-19 Thread Felix Kuehling



Am 2023-07-19 um 03:51 schrieb Kefeng Wang:

Use the helpers to simplify code.

Cc: Felix Kuehling 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: "Pan, Xinhui" 
Cc: David Airlie 
Cc: Daniel Vetter 
Signed-off-by: Kefeng Wang 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 5ff1a5a89d96..0b7bfbd0cb66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2621,10 +2621,7 @@ svm_range_get_range_boundaries(struct kfd_process *p, 
int64_t addr,
return -EFAULT;
}
  
-	*is_heap_stack = (vma->vm_start <= vma->vm_mm->brk &&

- vma->vm_end >= vma->vm_mm->start_brk) ||
-(vma->vm_start <= vma->vm_mm->start_stack &&
- vma->vm_end >= vma->vm_mm->start_stack);
+   *is_heap_stack = vma_is_initial_heap(vma) || vma_is_initial_stack(vma);
  
  	start_limit = max(vma->vm_start >> PAGE_SHIFT,

  (unsigned long)ALIGN_DOWN(addr, 2UL << 8));


Re: [PATCH 1/2] drm/amdkfd: fix trap handling work around for debugging

2023-07-19 Thread Felix Kuehling

Am 2023-07-14 um 05:37 schrieb Jonathan Kim:

Update the list of devices that require the cwsr trap handling
workaround for debugging use cases.

Signed-off-by: Jonathan Kim 


This patch is

Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c| 5 ++---
  drivers/gpu/drm/amd/amdkfd/kfd_debug.h| 6 ++
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 6 ++
  3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index 190b03efe5ff..ccfc81f085ce 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -302,8 +302,7 @@ static int kfd_dbg_set_queue_workaround(struct queue *q, 
bool enable)
if (!q)
return 0;
  
-	if (KFD_GC_VERSION(q->device) < IP_VERSION(11, 0, 0) ||

-   KFD_GC_VERSION(q->device) >= IP_VERSION(12, 0, 0))
+   if (!kfd_dbg_has_cwsr_workaround(q->device))
return 0;
  
  	if (enable && q->properties.is_user_cu_masked)

@@ -349,7 +348,7 @@ int kfd_dbg_set_mes_debug_mode(struct kfd_process_device 
*pdd)
  {
uint32_t spi_dbg_cntl = pdd->spi_dbg_override | 
pdd->spi_dbg_launch_mode;
uint32_t flags = pdd->process->dbg_flags;
-   bool sq_trap_en = !!spi_dbg_cntl;
+   bool sq_trap_en = !!spi_dbg_cntl || 
!kfd_dbg_has_cwsr_workaround(pdd->dev);
  
  	if (!kfd_dbg_is_per_vmid_supported(pdd->dev))

return 0;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.h
index ba616ed17dee..586d7f886712 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.h
@@ -101,6 +101,12 @@ static inline bool kfd_dbg_is_rlc_restore_supported(struct 
kfd_node *dev)
 KFD_GC_VERSION(dev) == IP_VERSION(10, 1, 1));
  }
  
+static inline bool kfd_dbg_has_cwsr_workaround(struct kfd_node *dev)

+{
+   return KFD_GC_VERSION(dev) >= IP_VERSION(11, 0, 0) &&
+  KFD_GC_VERSION(dev) <= IP_VERSION(11, 0, 3);
+}
+
  static inline bool kfd_dbg_has_gws_support(struct kfd_node *dev)
  {
if ((KFD_GC_VERSION(dev) == IP_VERSION(9, 0, 1)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 31cac1fd0d58..761963ad6154 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -226,8 +226,7 @@ static int add_queue_mes(struct device_queue_manager *dqm, 
struct queue *q,
queue_input.paging = false;
queue_input.tba_addr = qpd->tba_addr;
queue_input.tma_addr = qpd->tma_addr;
-   queue_input.trap_en = KFD_GC_VERSION(q->device) < IP_VERSION(11, 0, 0) 
||
- KFD_GC_VERSION(q->device) > IP_VERSION(11, 0, 3);
+   queue_input.trap_en = !kfd_dbg_has_cwsr_workaround(q->device);
queue_input.skip_process_ctx_clear = 
qpd->pqm->process->debug_trap_enabled;
  
  	queue_type = convert_to_mes_queue_type(q->properties.type);

@@ -1827,8 +1826,7 @@ static int create_queue_cpsch(struct device_queue_manager 
*dqm, struct queue *q,
 */
q->properties.is_evicted = !!qpd->evicted;
q->properties.is_dbg_wa = qpd->pqm->process->debug_trap_enabled &&
-   KFD_GC_VERSION(q->device) >= IP_VERSION(11, 0, 0) &&
-   KFD_GC_VERSION(q->device) <= IP_VERSION(11, 0, 3);
+ kfd_dbg_has_cwsr_workaround(q->device);
  
  	if (qd)

mqd_mgr->restore_mqd(mqd_mgr, >mqd, q->mqd_mem_obj, 
>gart_mqd_addr,


Re: [PATCH 2/2] drm/amdkfd: enable cooperative groups for gfx11

2023-07-18 Thread Felix Kuehling



Am 2023-07-14 um 05:37 schrieb Jonathan Kim:

MES can concurrently schedule queues on the device that require
exclusive device access if marked exclusively_scheduled without the
requirement of GWS.  Similar to the F32 HWS, MES will manage
quality of service for these queues.
Use this for cooperative groups since cooperative groups are device
occupancy limited.

Since some GFX11 devices can only be debugged with partial CUs, do not
allow the debugging of cooperative groups on these devices as the CU
occupancy limit will change on attach.

In addition, zero initialize the MES add queue submission vector for MES
initialization tests as we do not want these to be cooperative
dispatches.

NOTE: FIXME MES FW enablement checks are a placeholder at the moment and
will be updated when the binary revision number is finalized.

Signed-off-by: Jonathan Kim 

Some nit-picks inline. Looks good to me otherwise.



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   |  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   |  1 +
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c|  2 ++
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  3 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c|  3 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c |  9 -
  .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c| 11 +++
  drivers/gpu/drm/amd/include/mes_v11_api_def.h |  4 +++-
  9 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index e9091ebfe230..8d13623389d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -638,7 +638,7 @@ int amdgpu_mes_add_hw_queue(struct amdgpu_device *adev, int 
gang_id,
  {
struct amdgpu_mes_queue *queue;
struct amdgpu_mes_gang *gang;
-   struct mes_add_queue_input queue_input;
+   struct mes_add_queue_input queue_input = {0};


 Generally, it is preferred to use memset to initialize structures on 
the stack because that also initializes padding.




unsigned long flags;
int r;
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h

index 2d6ac30b7135..2053954a235c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -224,6 +224,7 @@ struct mes_add_queue_input {
uint32_tis_kfd_process;
uint32_tis_aql_queue;
uint32_tqueue_size;
+   uint32_texclusively_scheduled;
  };
  
  struct mes_remove_queue_input {

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index 1bdaa00c0b46..8e67e965f7ea 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -214,6 +214,8 @@ static int mes_v11_0_add_hw_queue(struct amdgpu_mes *mes,
mes_add_queue_pkt.is_aql_queue = input->is_aql_queue;
mes_add_queue_pkt.gds_size = input->queue_size;
  
+	mes_add_queue_pkt.exclusively_scheduled = input->exclusively_scheduled;

+
return mes_v11_0_submit_pkt_and_poll_completion(mes,
_add_queue_pkt, sizeof(mes_add_queue_pkt),
offsetof(union MESAPI__ADD_QUEUE, api_status));
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 40ac093b5035..e18401811956 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1487,7 +1487,8 @@ static int kfd_ioctl_alloc_queue_gws(struct file *filep,
goto out_unlock;
}
  
-	if (!kfd_dbg_has_gws_support(dev) && p->debug_trap_enabled) {

+   if (p->debug_trap_enabled && (!kfd_dbg_has_gws_support(dev) ||
+  kfd_dbg_has_cwsr_workaround(dev))) {


Indentation looks off. kfd_dbg_has_cwsr_workaround should be indented 
one less space. Otherwise you may be incorrectly implying that the ! 
applies to it.




retval = -EBUSY;
goto out_unlock;
}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index ccfc81f085ce..895e7f690fd0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -753,7 +753,8 @@ int kfd_dbg_trap_enable(struct kfd_process *target, 
uint32_t fd,
if (!KFD_IS_SOC15(pdd->dev))
return -ENODEV;
  
-		if (!kfd_dbg_has_gws_support(pdd->dev) && pdd->qpd.num_gws)

+   if (pdd->qpd.num_gws && (!kfd_dbg_has_gws_support(pdd->dev) ||
+ 
kfd_dbg_has_cwsr_workaround(pdd->dev)))


Same as above.



return -EBUSY;
}
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 

Re: [PATCH 4/4] drm/amdgpu: use a macro to define no xcp partition case

2023-07-17 Thread Felix Kuehling

On 2023-07-16 22:26, Guchun Chen wrote:

~0 as no xcp partition is used in several places, so improve its
definition by a macro for code consistency.

Suggested-by: Christian König 
Signed-off-by: Guchun Chen 


The series is

Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 3 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c  | 4 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h  | 2 ++
  drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c   | 4 ++--
  4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index a7f314ddd173..d34c3ef8f3ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1709,7 +1709,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
alloc_flags |= (flags & KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC) 
?
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED : 0;
}
-   xcp_id = fpriv->xcp_id == ~0 ? 0 : fpriv->xcp_id;
+   xcp_id = fpriv->xcp_id == AMDGPU_XCP_NO_PARTITION ?
+   0 : fpriv->xcp_id;
} else if (flags & KFD_IOC_ALLOC_MEM_FLAGS_GTT) {
domain = alloc_domain = AMDGPU_GEM_DOMAIN_GTT;
alloc_flags = 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
index d175e862f222..9c9cca129498 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c
@@ -363,7 +363,7 @@ int amdgpu_xcp_open_device(struct amdgpu_device *adev,
if (!adev->xcp_mgr)
return 0;
  
-	fpriv->xcp_id = ~0;

+   fpriv->xcp_id = AMDGPU_XCP_NO_PARTITION;
for (i = 0; i < MAX_XCP; ++i) {
if (!adev->xcp_mgr->xcp[i].ddev)
break;
@@ -381,7 +381,7 @@ int amdgpu_xcp_open_device(struct amdgpu_device *adev,
}
}
  
-	fpriv->vm.mem_id = fpriv->xcp_id == ~0 ? -1 :

+   fpriv->vm.mem_id = fpriv->xcp_id == AMDGPU_XCP_NO_PARTITION ? -1 :
adev->xcp_mgr->xcp[fpriv->xcp_id].mem_id;
return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
index 0f8026d64ea5..9a1036aeec2a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.h
@@ -37,6 +37,8 @@
  #define AMDGPU_XCP_FL_NONE 0
  #define AMDGPU_XCP_FL_LOCKED (1 << 0)
  
+#define AMDGPU_XCP_NO_PARTITION (~0)

+
  struct amdgpu_fpriv;
  
  enum AMDGPU_XCP_IP_BLOCK {

diff --git a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c 
b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
index 16471b81a1f5..72b629a78c62 100644
--- a/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
+++ b/drivers/gpu/drm/amd/amdgpu/aqua_vanjaram.c
@@ -68,7 +68,7 @@ static void aqua_vanjaram_set_xcp_id(struct amdgpu_device 
*adev,
enum AMDGPU_XCP_IP_BLOCK ip_blk;
uint32_t inst_mask;
  
-	ring->xcp_id = ~0;

+   ring->xcp_id = AMDGPU_XCP_NO_PARTITION;
if (adev->xcp_mgr->mode == AMDGPU_XCP_MODE_NONE)
return;
  
@@ -177,7 +177,7 @@ static int aqua_vanjaram_select_scheds(

u32 sel_xcp_id;
int i;
  
-	if (fpriv->xcp_id == ~0) {

+   if (fpriv->xcp_id == AMDGPU_XCP_NO_PARTITION) {
u32 least_ref_cnt = ~0;
  
  		fpriv->xcp_id = 0;


Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()

2023-07-14 Thread Felix Kuehling

Am 2023-07-14 um 10:26 schrieb Vlastimil Babka:

On 7/12/23 18:24, Felix Kuehling wrote:

Allocations in the heap and stack tend to be small, with several
allocations sharing the same page. Sharing the same page for different
allocations with different access patterns leads to thrashing when we
migrate data back and forth on GPU and CPU access. To avoid this we
disable HMM migrations for head and stack VMAs.

Wonder how well does it really work in practice? AFAIK "heaps" (malloc())
today uses various arenas obtained by mmap() and not a single brk() managed
space anymore? And programs might be multithreaded, thus have multiple
stacks, while vma_is_stack() will recognize only the initial one...


Thanks for these pointers. I have not heard of such problems with mmap 
arenas and multiple thread stacks in practice. But I'll keep it in mind 
in case we observe unexpected thrashing in the future. FWIW, we once had 
the opposite problem of a custom malloc implementation that used sbrk 
for very large allocations. This disabled migrations of large buffers 
unexpectedly.


I agree that eventually we'll want a more dynamic way of detecting and 
suppressing thrashing that's based on observed memory access patterns. 
Getting this right is probably trickier than it sounds, so I'd prefer to 
have some more experience with real workloads to use as benchmarks. 
Compared to other things we're working on, this is fairly low on our 
priority list at the moment. Using the VMA flags is a simple and 
effective method for now, at least until we see it failing in real 
workloads.


Regards,
  Felix




Vlastimil


Regards,
    Felix


Am 2023-07-12 um 10:42 schrieb Christoph Hellwig:

On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote:

Use the helpers to simplify code.

Nothing against your addition of a helper, but a GPU driver really
should have no business even looking at this information..




Re: [PATCH v3 09/12] drm/amdgpu: use doorbell manager for kfd process doorbells

2023-07-13 Thread Felix Kuehling

On 2023-06-20 13:16, Shashank Sharma wrote:

This patch:
- adds a doorbell object in kfd pdd structure.
- allocates doorbells for a process while creating its queue.
- frees the doorbells with pdd destroy.
- moves doorbell bitmap init function to kfd_doorbell.c

PS: This patch ensures that we don't break the existing KFD
 functionality, but now KFD userspace library should also
 create doorbell pages as AMDGPU GEM objects using libdrm
 functions in userspace. The reference code for the same
 is available with AMDGPU Usermode queue libdrm MR. Once
 this is done, we will not need to create process doorbells
 in kernel.

V2: - Do not use doorbell wrapper API, use amdgpu_bo_create_kernel
   instead (Alex).
 - Do not use custom doorbell structure, instead use separate
   variables for bo and doorbell_bitmap (Alex)
V3:
- Do not allocate doorbell page with PDD, delay doorbell process
  page allocation until really needed (Felix)

Cc: Alex Deucher 
Cc: Christian Koenig 
Cc: Felix Kuehling 
Acked-by: Christian König 
Signed-off-by: Shashank Sharma 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  |  20 ++--
  .../drm/amd/amdkfd/kfd_device_queue_manager.c |   8 +-
  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 103 +-
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   9 +-
  drivers/gpu/drm/amd/amdkfd/kfd_process.c  |  40 +--
  .../amd/amdkfd/kfd_process_queue_manager.c|  23 ++--
  6 files changed, 108 insertions(+), 95 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 1b54a9aaae70..5d4f4fca793a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -327,10 +327,12 @@ static int kfd_ioctl_create_queue(struct file *filep, 
struct kfd_process *p,
goto err_bind_process;
}
  
-	if (!pdd->doorbell_index &&

-   kfd_alloc_process_doorbells(dev, >doorbell_index) < 0) {
-   err = -ENOMEM;
-   goto err_alloc_doorbells;
+   if (!pdd->qpd.proc_doorbells) {
+   err = kfd_alloc_process_doorbells(dev, pdd);
+   if (err) {
+   pr_debug("failed to allocate process doorbells\n");
+   goto err_bind_process;
+   }
}
  
  	/* Starting with GFX11, wptr BOs must be mapped to GART for MES to determine work

@@ -410,7 +412,6 @@ static int kfd_ioctl_create_queue(struct file *filep, 
struct kfd_process *p,
if (wptr_bo)
amdgpu_amdkfd_free_gtt_mem(dev->adev, wptr_bo);
  err_wptr_map_gart:
-err_alloc_doorbells:
  err_bind_process:
  err_pdd:
mutex_unlock(>mutex);
@@ -2239,11 +2240,12 @@ static int criu_restore_devices(struct kfd_process *p,
goto exit;
}
  
-		if (!pdd->doorbell_index &&

-   kfd_alloc_process_doorbells(pdd->dev, >doorbell_index) 
< 0) {
-   ret = -ENOMEM;
-   goto exit;
+   if (!pdd->qpd.proc_doorbells) {
+   ret = kfd_alloc_process_doorbells(dev, pdd);
+   if (ret)
+   goto exit;
}
+
}
  
  	/*

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 7a95698d83f7..834f640cf807 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -371,7 +371,7 @@ static int allocate_doorbell(struct qcm_process_device *qpd,
unsigned int found;
  
  			found = find_first_zero_bit(qpd->doorbell_bitmap,

-   
KFD_MAX_NUM_OF_QUEUES_PER_PROCESS);
+   
KFD_MAX_NUM_OF_QUEUES_PER_PROCESS);
if (found >= KFD_MAX_NUM_OF_QUEUES_PER_PROCESS) {
pr_debug("No doorbells available");
return -EBUSY;
@@ -381,9 +381,9 @@ static int allocate_doorbell(struct qcm_process_device *qpd,
}
}
  
-	q->properties.doorbell_off =

-   kfd_get_doorbell_dw_offset_in_bar(dev, qpd_to_pdd(qpd),
- q->doorbell_id);
+   q->properties.doorbell_off = amdgpu_doorbell_index_on_bar(dev->adev,
+ 
qpd->proc_doorbells,
+ 
q->doorbell_id);
return 0;
  }
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c

index f7d45057ed32..c9ca21e1a99a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/

Re: [PATCH v3 08/12] drm/amdgpu: use doorbell manager for kfd kernel doorbells

2023-07-13 Thread Felix Kuehling

On 2023-06-20 13:16, Shashank Sharma wrote:

This patch:
- adds a doorbell bo in kfd device structure.
- creates doorbell page for kfd kernel usages.
- updates the get_kernel_doorbell and free_kernel_doorbell functions
   accordingly

V2: Do not use wrapper API, use direct amdgpu_create_kernel(Alex)
V3:
  - Move single variable declaration below (Christian)
  - Add a to-do item to reuse the KGD kernel level doorbells for
KFD for non-MES cases, instead of reserving one page (Felix)

Cc: Alex Deucher 
Cc: Christian Koenig 
Cc: Felix Kuehling 
Signed-off-by: Shashank Sharma 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c   |   2 -
  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 109 +++---
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   6 ++
  3 files changed, 39 insertions(+), 78 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 00f528eb9812..36fbe9c840ee 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -437,8 +437,6 @@ struct kfd_dev *kgd2kfd_probe(struct amdgpu_device *adev, 
bool vf)
atomic_set(>compute_profile, 0);
  
  	mutex_init(>doorbell_mutex);

-   memset(>doorbell_available_index, 0,
-   sizeof(kfd->doorbell_available_index));
  
  	atomic_set(>sram_ecc_flag, 0);
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c

index 38c9e1ca6691..f7d45057ed32 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -61,81 +61,46 @@ size_t kfd_doorbell_process_slice(struct kfd_dev *kfd)
  /* Doorbell calculations for device init. */
  int kfd_doorbell_init(struct kfd_dev *kfd)
  {
-   size_t doorbell_start_offset;
-   size_t doorbell_aperture_size;
-   size_t doorbell_process_limit;
+   int size = PAGE_SIZE;
+   int r;
  
  	/*

-* With MES enabled, just set the doorbell base as it is needed
-* to calculate doorbell physical address.
-*/
-   if (kfd->shared_resources.enable_mes) {
-   kfd->doorbell_base =
-   kfd->shared_resources.doorbell_physical_address;
-   return 0;
-   }
-
-   /*
-* We start with calculations in bytes because the input data might
-* only be byte-aligned.
-* Only after we have done the rounding can we assume any alignment.
+* Todo: KFD kernel level operations need only one doorbell for
+* ring test/HWS. So instead of reserving a whole page here for
+* kernel, reserve and consume a doorbell from existing KGD kernel
+* doorbell page.
 */
  
-	doorbell_start_offset =

-   roundup(kfd->shared_resources.doorbell_start_offset,
-   kfd_doorbell_process_slice(kfd));
-
-   doorbell_aperture_size =
-   rounddown(kfd->shared_resources.doorbell_aperture_size,
-   kfd_doorbell_process_slice(kfd));
-
-   if (doorbell_aperture_size > doorbell_start_offset)
-   doorbell_process_limit =
-   (doorbell_aperture_size - doorbell_start_offset) /
-   kfd_doorbell_process_slice(kfd);
-   else
-   return -ENOSPC;
-
-   if (!kfd->max_doorbell_slices ||
-   doorbell_process_limit < kfd->max_doorbell_slices)
-   kfd->max_doorbell_slices = doorbell_process_limit;
-
-   kfd->doorbell_base = kfd->shared_resources.doorbell_physical_address +
-   doorbell_start_offset;
-
-   kfd->doorbell_base_dw_offset = doorbell_start_offset / sizeof(u32);
-
-   kfd->doorbell_kernel_ptr = ioremap(kfd->doorbell_base,
-  kfd_doorbell_process_slice(kfd));
-
-   if (!kfd->doorbell_kernel_ptr)
+   /* Bitmap to dynamically allocate doorbells from kernel page */
+   kfd->doorbell_bitmap = bitmap_zalloc(size / sizeof(u32), GFP_KERNEL);
+   if (!kfd->doorbell_bitmap) {
+   DRM_ERROR("Failed to allocate kernel doorbell bitmap\n");
return -ENOMEM;
+   }
  
-	pr_debug("Doorbell initialization:\n");

-   pr_debug("doorbell base   == 0x%08lX\n",
-   (uintptr_t)kfd->doorbell_base);
-
-   pr_debug("doorbell_base_dw_offset  == 0x%08lX\n",
-   kfd->doorbell_base_dw_offset);
-
-   pr_debug("doorbell_process_limit  == 0x%08lX\n",
-   doorbell_process_limit);
-
-   pr_debug("doorbell_kernel_offset  == 0x%08lX\n",
-   (uintptr_t)kfd->doorbell_base);
-
-   pr_debug("doorbell aperture size  

Re: [PATCH Review V2 2/2] drm/amdgpu: Disable RAS by default on APU flatform

2023-07-13 Thread Felix Kuehling



On 2023-07-13 10:50, Stanley.Yang wrote:

Disable RAS feature by default for aqua vanjaram on APU platform.

Changed from V1:
Splite Disable RAS by default on APU platform into a
separated patch.

Signed-off-by: Stanley.Yang 
Reviewed-by: Hawking Zhang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 8673d9790bb0..ec5f60b64346 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2517,6 +2517,15 @@ static void amdgpu_ras_check_supported(struct 
amdgpu_device *adev)
adev->ras_hw_enabled |= (1 << AMDGPU_RAS_BLOCK__GFX |
   1 << AMDGPU_RAS_BLOCK__SDMA |
   1 << AMDGPU_RAS_BLOCK__MMHUB);
+
+   if (adev->ip_versions[MP0_HWIP][0] == IP_VERSION(13, 0, 6)) {
+   /*
+* Disable ras feature for aqua vanjaram
+* by default on apu platform.
+*/
+   if (-1 == amdgpu_ras_enable)
+   amdgpu_ras_enable = 0;
Changing a global variable here is probably not appropriate. The 
condition above looks like this should affect a device-specific variable 
only.


Regards,
  Felix



+   }
}
  
  	amdgpu_ras_get_quirks(adev);


Re: [PATCH 3/5] drm/amdkfd: use vma_is_stack() and vma_is_heap()

2023-07-12 Thread Felix Kuehling
Allocations in the heap and stack tend to be small, with several 
allocations sharing the same page. Sharing the same page for different 
allocations with different access patterns leads to thrashing when we 
migrate data back and forth on GPU and CPU access. To avoid this we 
disable HMM migrations for head and stack VMAs.


Regards,
  Felix


Am 2023-07-12 um 10:42 schrieb Christoph Hellwig:

On Wed, Jul 12, 2023 at 10:38:29PM +0800, Kefeng Wang wrote:

Use the helpers to simplify code.

Nothing against your addition of a helper, but a GPU driver really
should have no business even looking at this information..




Re: [PATCH v5 04/10] drm/amdgpu: create GFX-gen11 usermode queue

2023-07-12 Thread Felix Kuehling

Am 2023-07-12 um 11:55 schrieb Shashank Sharma:


On 11/07/2023 21:51, Felix Kuehling wrote:


On 2023-07-06 09:39, Christian König wrote:

Am 06.07.23 um 15:37 schrieb Shashank Sharma:


On 06/07/2023 15:22, Christian König wrote:

Am 06.07.23 um 14:35 schrieb Shashank Sharma:

A Memory queue descriptor (MQD) of a userqueue defines it in
the hw's context. As MQD format can vary between different
graphics IPs, we need gfx GEN specific handlers to create MQDs.

This patch:
- Introduces MQD handler functions for the usermode queues.
- Adds new functions to create and destroy userqueue MQD for
   GFX-GEN-11 IP

V1: Worked on review comments from Alex:
 - Make MQD functions GEN and IP specific

V2: Worked on review comments from Alex:
 - Reuse the existing adev->mqd[ip] for MQD creation
 - Formatting and arrangement of code

V3:
 - Integration with doorbell manager

V4: Review comments addressed:
 - Do not create a new file for userq, reuse gfx_v11_0.c (Alex)
 - Align name of structure members (Luben)
 - Don't break up the Cc tag list and the Sob tag list in commit
   message (Luben)
V5:
    - No need to reserve the bo for MQD (Christian).
    - Some more changes to support IP specific MQD creation.

Cc: Alex Deucher 
Cc: Christian Koenig 
Signed-off-by: Shashank Sharma 
Signed-off-by: Arvind Yadav 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 16 
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c    | 73 
+++

  .../gpu/drm/amd/include/amdgpu_userqueue.h    |  7 ++
  3 files changed, 96 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c

index e37b5da5a0d0..bb774144c372 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
@@ -134,12 +134,28 @@ int amdgpu_userq_ioctl(struct drm_device 
*dev, void *data,

  return r;
  }
  +extern const struct amdgpu_userq_funcs userq_gfx_v11_funcs;
+
+static void
+amdgpu_userqueue_setup_gfx(struct amdgpu_userq_mgr *uq_mgr)
+{
+    int maj;
+    struct amdgpu_device *adev = uq_mgr->adev;
+    uint32_t version = adev->ip_versions[GC_HWIP][0];
+
+    /* We support usermode queue only for GFX V11 as of now */
+    maj = IP_VERSION_MAJ(version);
+    if (maj == 11)
+    uq_mgr->userq_funcs[AMDGPU_HW_IP_GFX] = 
_gfx_v11_funcs;

+}
+
  int amdgpu_userq_mgr_init(struct amdgpu_userq_mgr *userq_mgr, 
struct amdgpu_device *adev)

  {
  mutex_init(_mgr->userq_mutex);
  idr_init_base(_mgr->userq_idr, 1);
  userq_mgr->adev = adev;
  +    amdgpu_userqueue_setup_gfx(userq_mgr);
  return 0;
  }
  diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c

index c4940b6ea1c4..e76e1b86b434 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -30,6 +30,7 @@
  #include "amdgpu_psp.h"
  #include "amdgpu_smu.h"
  #include "amdgpu_atomfirmware.h"
+#include "amdgpu_userqueue.h"
  #include "imu_v11_0.h"
  #include "soc21.h"
  #include "nvd.h"
@@ -6486,3 +6487,75 @@ const struct amdgpu_ip_block_version 
gfx_v11_0_ip_block =

  .rev = 0,
  .funcs = _v11_0_ip_funcs,
  };
+
+static int gfx_v11_0_userq_mqd_create(struct amdgpu_userq_mgr 
*uq_mgr,

+  struct drm_amdgpu_userq_in *args_in,
+  struct amdgpu_usermode_queue *queue)
+{
+    struct amdgpu_device *adev = uq_mgr->adev;
+    struct amdgpu_mqd *mqd_gfx_generic = 
>mqds[AMDGPU_HW_IP_GFX];

+    struct drm_amdgpu_userq_mqd_gfx_v11_0 mqd_user;
+    struct amdgpu_mqd_prop userq_props;
+    int r;
+
+    /* Incoming MQD parameters from userspace to be saved here */
+    memset(_user, 0, sizeof(mqd_user));
+
+    /* Structure to initialize MQD for userqueue using generic 
MQD init function */

+    memset(_props, 0, sizeof(userq_props));
+
+    if (args_in->mqd_size != sizeof(struct 
drm_amdgpu_userq_mqd_gfx_v11_0)) {

+    DRM_ERROR("MQD size mismatch\n");
+    return -EINVAL;
+    }
+
+    if (copy_from_user(_user, u64_to_user_ptr(args_in->mqd), 
args_in->mqd_size)) {

+    DRM_ERROR("Failed to get user MQD\n");
+    return -EFAULT;
+    }
+
+    /* Create BO for actual Userqueue MQD now */
+    r = amdgpu_bo_create_kernel(adev, mqd_gfx_generic->mqd_size, 
PAGE_SIZE,

+    AMDGPU_GEM_DOMAIN_GTT,
+    >mqd.obj,
+    >mqd.gpu_addr,
+    >mqd.cpu_ptr);
+    if (r) {
+    DRM_ERROR("Failed to allocate BO for userqueue (%d)", r);
+    return -ENOMEM;
+    }


Using amdgpu_bo_create_kernel() for the MQD is most likely not a 
good idea in the long term, but should work for now.


I was a bit curious about this, the scope of this MQD object is 
kernel internal and used for queue mapp

Re: [PATCH v5 04/10] drm/amdgpu: create GFX-gen11 usermode queue

2023-07-11 Thread Felix Kuehling



On 2023-07-06 09:39, Christian König wrote:

Am 06.07.23 um 15:37 schrieb Shashank Sharma:


On 06/07/2023 15:22, Christian König wrote:

Am 06.07.23 um 14:35 schrieb Shashank Sharma:

A Memory queue descriptor (MQD) of a userqueue defines it in
the hw's context. As MQD format can vary between different
graphics IPs, we need gfx GEN specific handlers to create MQDs.

This patch:
- Introduces MQD handler functions for the usermode queues.
- Adds new functions to create and destroy userqueue MQD for
   GFX-GEN-11 IP

V1: Worked on review comments from Alex:
 - Make MQD functions GEN and IP specific

V2: Worked on review comments from Alex:
 - Reuse the existing adev->mqd[ip] for MQD creation
 - Formatting and arrangement of code

V3:
 - Integration with doorbell manager

V4: Review comments addressed:
 - Do not create a new file for userq, reuse gfx_v11_0.c (Alex)
 - Align name of structure members (Luben)
 - Don't break up the Cc tag list and the Sob tag list in commit
   message (Luben)
V5:
    - No need to reserve the bo for MQD (Christian).
    - Some more changes to support IP specific MQD creation.

Cc: Alex Deucher 
Cc: Christian Koenig 
Signed-off-by: Shashank Sharma 
Signed-off-by: Arvind Yadav 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c | 16 
  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c    | 73 
+++

  .../gpu/drm/amd/include/amdgpu_userqueue.h    |  7 ++
  3 files changed, 96 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c

index e37b5da5a0d0..bb774144c372 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c
@@ -134,12 +134,28 @@ int amdgpu_userq_ioctl(struct drm_device 
*dev, void *data,

  return r;
  }
  +extern const struct amdgpu_userq_funcs userq_gfx_v11_funcs;
+
+static void
+amdgpu_userqueue_setup_gfx(struct amdgpu_userq_mgr *uq_mgr)
+{
+    int maj;
+    struct amdgpu_device *adev = uq_mgr->adev;
+    uint32_t version = adev->ip_versions[GC_HWIP][0];
+
+    /* We support usermode queue only for GFX V11 as of now */
+    maj = IP_VERSION_MAJ(version);
+    if (maj == 11)
+    uq_mgr->userq_funcs[AMDGPU_HW_IP_GFX] = _gfx_v11_funcs;
+}
+
  int amdgpu_userq_mgr_init(struct amdgpu_userq_mgr *userq_mgr, 
struct amdgpu_device *adev)

  {
  mutex_init(_mgr->userq_mutex);
  idr_init_base(_mgr->userq_idr, 1);
  userq_mgr->adev = adev;
  +    amdgpu_userqueue_setup_gfx(userq_mgr);
  return 0;
  }
  diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c

index c4940b6ea1c4..e76e1b86b434 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -30,6 +30,7 @@
  #include "amdgpu_psp.h"
  #include "amdgpu_smu.h"
  #include "amdgpu_atomfirmware.h"
+#include "amdgpu_userqueue.h"
  #include "imu_v11_0.h"
  #include "soc21.h"
  #include "nvd.h"
@@ -6486,3 +6487,75 @@ const struct amdgpu_ip_block_version 
gfx_v11_0_ip_block =

  .rev = 0,
  .funcs = _v11_0_ip_funcs,
  };
+
+static int gfx_v11_0_userq_mqd_create(struct amdgpu_userq_mgr 
*uq_mgr,

+  struct drm_amdgpu_userq_in *args_in,
+  struct amdgpu_usermode_queue *queue)
+{
+    struct amdgpu_device *adev = uq_mgr->adev;
+    struct amdgpu_mqd *mqd_gfx_generic = 
>mqds[AMDGPU_HW_IP_GFX];

+    struct drm_amdgpu_userq_mqd_gfx_v11_0 mqd_user;
+    struct amdgpu_mqd_prop userq_props;
+    int r;
+
+    /* Incoming MQD parameters from userspace to be saved here */
+    memset(_user, 0, sizeof(mqd_user));
+
+    /* Structure to initialize MQD for userqueue using generic MQD 
init function */

+    memset(_props, 0, sizeof(userq_props));
+
+    if (args_in->mqd_size != sizeof(struct 
drm_amdgpu_userq_mqd_gfx_v11_0)) {

+    DRM_ERROR("MQD size mismatch\n");
+    return -EINVAL;
+    }
+
+    if (copy_from_user(_user, u64_to_user_ptr(args_in->mqd), 
args_in->mqd_size)) {

+    DRM_ERROR("Failed to get user MQD\n");
+    return -EFAULT;
+    }
+
+    /* Create BO for actual Userqueue MQD now */
+    r = amdgpu_bo_create_kernel(adev, mqd_gfx_generic->mqd_size, 
PAGE_SIZE,

+    AMDGPU_GEM_DOMAIN_GTT,
+    >mqd.obj,
+    >mqd.gpu_addr,
+    >mqd.cpu_ptr);
+    if (r) {
+    DRM_ERROR("Failed to allocate BO for userqueue (%d)", r);
+    return -ENOMEM;
+    }


Using amdgpu_bo_create_kernel() for the MQD is most likely not a 
good idea in the long term, but should work for now.


I was a bit curious about this, the scope of this MQD object is 
kernel internal and used for queue mapping only, userspace doesn't 
know much about it. Do you still think we should not create a kernel 
object for it ?



Well we should use a kernel BO. But amdgpu_bo_create_kernel() not only 
creates a kernel BO but also pins it! And that is problematic 

Re: [PATCH 3/6] drm/amdkfd: switch over to using drm_exec v2

2023-07-11 Thread Felix Kuehling



On 2023-07-11 09:31, Christian König wrote:

Avoids quite a bit of logic and kmalloc overhead.

v2: fix multiple problems pointed out by Felix

Signed-off-by: Christian König 


Two nit-picks inline about DRM_EXEC_INTERRUPTIBLE_WAIT. With those 
fixed, the patch is


Reviewed-by: Felix Kuehling 




---
  drivers/gpu/drm/amd/amdgpu/Kconfig|   1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h|   5 +-
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 299 +++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  18 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|   4 +
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c  |  45 ++-
  6 files changed, 162 insertions(+), 210 deletions(-)

[snip]

@@ -2538,50 +2489,41 @@ static int update_invalid_user_pages(struct 
amdkfd_process_info *process_info,
   */
  static int validate_invalid_user_pages(struct amdkfd_process_info 
*process_info)
  {
-   struct amdgpu_bo_list_entry *pd_bo_list_entries;
-   struct list_head resv_list, duplicates;
-   struct ww_acquire_ctx ticket;
+   struct ttm_operation_ctx ctx = { false, false };
struct amdgpu_sync sync;
+   struct drm_exec exec;
  
  	struct amdgpu_vm *peer_vm;

struct kgd_mem *mem, *tmp_mem;
struct amdgpu_bo *bo;
-   struct ttm_operation_ctx ctx = { false, false };
-   int i, ret;
-
-   pd_bo_list_entries = kcalloc(process_info->n_vms,
-sizeof(struct amdgpu_bo_list_entry),
-GFP_KERNEL);
-   if (!pd_bo_list_entries) {
-   pr_err("%s: Failed to allocate PD BO list entries\n", __func__);
-   ret = -ENOMEM;
-   goto out_no_mem;
-   }
-
-   INIT_LIST_HEAD(_list);
-   INIT_LIST_HEAD();
+   int ret;
  
-	/* Get all the page directory BOs that need to be reserved */

-   i = 0;
-   list_for_each_entry(peer_vm, _info->vm_list_head,
-   vm_list_node)
-   amdgpu_vm_get_pd_bo(peer_vm, _list,
-   _bo_list_entries[i++]);
-   /* Add the userptr_inval_list entries to resv_list */
-   list_for_each_entry(mem, _info->userptr_inval_list,
-   validate_list.head) {
-   list_add_tail(>resv_list.head, _list);
-   mem->resv_list.bo = mem->validate_list.bo;
-   mem->resv_list.num_shared = mem->validate_list.num_shared;
-   }
+   amdgpu_sync_create();
  
+	drm_exec_init(, DRM_EXEC_INTERRUPTIBLE_WAIT);


This runs in a worker thread. So I think it doesn't need to be 
interruptible.




/* Reserve all BOs and page tables for validation */
-   ret = ttm_eu_reserve_buffers(, _list, false, );
-   WARN(!list_empty(), "Duplicates should be empty");
-   if (ret)
-   goto out_free;
+   drm_exec_until_all_locked() {
+   /* Reserve all the page directories */
+   list_for_each_entry(peer_vm, _info->vm_list_head,
+   vm_list_node) {
+   ret = amdgpu_vm_lock_pd(peer_vm, , 2);
+   drm_exec_retry_on_contention();
+   if (unlikely(ret))
+   goto unreserve_out;
+   }
  
-	amdgpu_sync_create();

+   /* Reserve the userptr_inval_list entries to resv_list */
+   list_for_each_entry(mem, _info->userptr_inval_list,
+   validate_list) {
+   struct drm_gem_object *gobj;
+
+   gobj = >bo->tbo.base;
+   ret = drm_exec_prepare_obj(, gobj, 1);
+   drm_exec_retry_on_contention();
+   if (unlikely(ret))
+   goto unreserve_out;
+   }
+   }
  
  	ret = process_validate_vms(process_info);

if (ret)


[snip]

@@ -1467,25 +1467,24 @@ static int svm_range_reserve_bos(struct 
svm_validate_context *ctx)

uint32_t gpuidx;
int r;
  
-	INIT_LIST_HEAD(>validate_list);

-   for_each_set_bit(gpuidx, ctx->bitmap, MAX_GPU_INSTANCE) {
-   pdd = kfd_process_device_from_gpuidx(ctx->process, gpuidx);
-   if (!pdd) {
-   pr_debug("failed to find device idx %d\n", gpuidx);
-   return -EINVAL;
-   }
-   vm = drm_priv_to_vm(pdd->drm_priv);
-
-   ctx->tv[gpuidx].bo = >root.bo->tbo;
-   ctx->tv[gpuidx].num_shared = 4;
-   list_add(>tv[gpuidx].head, >validate_list);
-   }
+   drm_exec_init(>exec, DRM_EXEC_INTERRUPTIBLE_WAIT);


This function is only called from svm_range_validate_and_map, which has 
an "intr" parameter. If you pass that through, you could make 
D

Re: [PATCH] drm/amdkfd: enable grace period for xcp instance

2023-07-11 Thread Felix Kuehling



On 2023-07-11 10:28, Eric Huang wrote:

Read/write grace period from/to first xcc instance of
xcp in kfd node.

Signed-off-by: Eric Huang 
---
  .../drm/amd/amdkfd/kfd_device_queue_manager.c | 21 ---
  .../drm/amd/amdkfd/kfd_device_queue_manager.h |  2 +-
  .../drm/amd/amdkfd/kfd_packet_manager_v9.c|  8 ---
  3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 31cac1fd0d58..9000c4b778fd 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1619,10 +1619,14 @@ static int initialize_cpsch(struct device_queue_manager 
*dqm)
  
  	init_sdma_bitmaps(dqm);
  
-	if (dqm->dev->kfd2kgd->get_iq_wait_times)

+   if (dqm->dev->kfd2kgd->get_iq_wait_times) {
+   u32 first_inst = dqm->dev->xcp->id *
+dqm->dev->adev->gfx.num_xcc_per_xcp;
dqm->dev->kfd2kgd->get_iq_wait_times(dqm->dev->adev,
-   >wait_times,
-   ffs(dqm->dev->xcc_mask) - 1);
+   >wait_times[first_inst],
+   first_inst);
+   }
+
return 0;
  }
  
@@ -1675,13 +1679,16 @@ static int start_cpsch(struct device_queue_manager *dqm)

grace_period);
if (retval)
pr_err("Setting grace timeout failed\n");
-   else if (dqm->dev->kfd2kgd->build_grace_period_packet_info)
+   else if (dqm->dev->kfd2kgd->build_grace_period_packet_info) {
+   u32 first_inst = dqm->dev->xcp->id *
+dqm->dev->adev->gfx.num_xcc_per_xcp;
/* Update dqm->wait_times maintained in software */
dqm->dev->kfd2kgd->build_grace_period_packet_info(
-   dqm->dev->adev,   dqm->wait_times,
+   dqm->dev->adev,   
dqm->wait_times[first_inst],
grace_period, _offset,
-   >wait_times,
-   ffs(dqm->dev->xcc_mask) - 1);
+   >wait_times[first_inst],
+   first_inst);
+   }
}
  
  	dqm_unlock(dqm);

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 7dd4b177219d..45959c33b944 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -262,7 +262,7 @@ struct device_queue_manager {
/* used for GFX 9.4.3 only */
uint32_tcurrent_logical_xcc_start;
  
-	uint32_t		wait_times;

+   uint32_twait_times[MAX_XCP];


Why do you need an array here, if it only saves the wait times in one of 
the array entries [first_inst]?


Regards,
  Felix


  
  	wait_queue_head_t	destroy_wait;

  };
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c
index 8fda16e6fee6..960404a6379b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_packet_manager_v9.c
@@ -292,17 +292,19 @@ static int pm_set_grace_period_v9(struct packet_manager 
*pm,
struct pm4_mec_write_data_mmio *packet;
uint32_t reg_offset = 0;
uint32_t reg_data = 0;
+   uint32_t first_inst = pm->dqm->dev->xcp->id *
+ pm->dqm->dev->adev->gfx.num_xcc_per_xcp;
  
  	pm->dqm->dev->kfd2kgd->build_grace_period_packet_info(

pm->dqm->dev->adev,
-   pm->dqm->wait_times,
+   pm->dqm->wait_times[first_inst],
grace_period,
_offset,
_data,
-   0);
+   first_inst);
  
  	if (grace_period == USE_DEFAULT_GRACE_PERIOD)

-   reg_data = pm->dqm->wait_times;
+   reg_data = pm->dqm->wait_times[first_inst];
  
  	packet = (struct pm4_mec_write_data_mmio *)buffer;

memset(buffer, 0, sizeof(struct pm4_mec_write_data_mmio));


Re: [PATCH] drm/amdkfd: report dispatch id always saved in ttmps after gc9.4.2

2023-07-11 Thread Felix Kuehling

On 2023-07-11 13:19, Jonathan Kim wrote:

The feature to save the dispatch ID in trap temporaries 6 & 7 on context
save is unconditionally enabled during MQD initialization.

Now that TTMPs are always setup regardless of debug mode for GC 9.4.3, we
should report that the dispatch ID is always available for debug/trap
handling.

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 1a4cdee86759..eeedc3ddffeb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -1941,10 +1941,11 @@ static void kfd_topology_set_capabilities(struct 
kfd_topology_device *dev)
HSA_DBG_WATCH_ADDR_MASK_LO_BIT_GFX9 |
HSA_DBG_WATCH_ADDR_MASK_HI_BIT;
  
-		if (KFD_GC_VERSION(dev->gpu) < IP_VERSION(9, 4, 2))

+   if (KFD_GC_VERSION(dev->gpu) != IP_VERSION(9, 4, 2))
dev->node_props.debug_prop |=
HSA_DBG_DISPATCH_INFO_ALWAYS_VALID;
-   else
+
+   if (KFD_GC_VERSION(dev->gpu) >= IP_VERSION(9, 4, 2))
dev->node_props.capability |=

HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED;
} else {


Re: [PATCH v2] drm/amdgpu: Increase soft IH ring size

2023-07-07 Thread Felix Kuehling

On 2023-07-07 11:49, Philip Yang wrote:

Retry faults are delegated to soft IH ring and then processed by
deferred worker. Current soft IH ring size PAGE_SIZE can store 128
entries, which may overflow and drop retry faults, causes HW stucks
because the retry fault is not recovered.

Increase soft IH ring size to 8KB, enough to store 256 CAM entries
because we clear the CAM entry after handling the retry fault from soft
ring.

Define macro IH_RING_SIZE and IH_SW_RING_SIZE to remove duplicate
constant.

Show warning message if soft IH ring overflows because this should not
happen.


It would indicate a problem with the CAM or it could happen on older 
GPUs that don't have a CAM. See below.





Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c  | 8 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  | 7 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c| 4 ++--
  drivers/gpu/drm/amd/amdgpu/navi10_ih.c  | 4 ++--
  drivers/gpu/drm/amd/amdgpu/vega10_ih.c  | 4 ++--
  drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 4 ++--
  7 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index fceb3b384955..51a0dbd2358a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -138,6 +138,7 @@ void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct 
amdgpu_ih_ring *ih)
  /**
   * amdgpu_ih_ring_write - write IV to the ring buffer
   *
+ * @adev: amdgpu_device pointer
   * @ih: ih ring to write to
   * @iv: the iv to write
   * @num_dw: size of the iv in dw
@@ -145,8 +146,8 @@ void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct 
amdgpu_ih_ring *ih)
   * Writes an IV to the ring buffer using the CPU and increment the wptr.
   * Used for testing and delegating IVs to a software ring.
   */
-void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const uint32_t *iv,
- unsigned int num_dw)
+void amdgpu_ih_ring_write(struct amdgpu_device *adev, struct amdgpu_ih_ring 
*ih,
+ const uint32_t *iv, unsigned int num_dw)
  {
uint32_t wptr = le32_to_cpu(*ih->wptr_cpu) >> 2;
unsigned int i;
@@ -161,6 +162,9 @@ void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const 
uint32_t *iv,
if (wptr != READ_ONCE(ih->rptr)) {
wmb();
WRITE_ONCE(*ih->wptr_cpu, cpu_to_le32(wptr));
+   } else {
+   dev_warn(adev->dev, "IH soft ring buffer overflow 0x%X, 0x%X\n",
+wptr, ih->rptr);


If this happens, it's probably going to flood the log. It would be a 
good idea to apply a rate-limit, or use dev_warn_once. With that fixed, 
the patch is


Reviewed-by: Felix Kuehling 



}
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h

index dd1c2eded6b9..6c6184f0dbc1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
@@ -27,6 +27,9 @@
  /* Maximum number of IVs processed at once */
  #define AMDGPU_IH_MAX_NUM_IVS 32
  
+#define IH_RING_SIZE	(256 * 1024)

+#define IH_SW_RING_SIZE(8 * 1024)  /* enough for 256 CAM entries */
+
  struct amdgpu_device;
  struct amdgpu_iv_entry;
  
@@ -97,8 +100,8 @@ struct amdgpu_ih_funcs {

  int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
unsigned ring_size, bool use_bus_addr);
  void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct amdgpu_ih_ring 
*ih);
-void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const uint32_t *iv,
- unsigned int num_dw);
+void amdgpu_ih_ring_write(struct amdgpu_device *adev, struct amdgpu_ih_ring 
*ih,
+ const uint32_t *iv, unsigned int num_dw);
  int amdgpu_ih_wait_on_checkpoint_process_ts(struct amdgpu_device *adev,
struct amdgpu_ih_ring *ih);
  int amdgpu_ih_process(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 5273decc5753..fa6d0adcec20 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -493,7 +493,7 @@ void amdgpu_irq_delegate(struct amdgpu_device *adev,
 struct amdgpu_iv_entry *entry,
 unsigned int num_dw)
  {
-   amdgpu_ih_ring_write(>irq.ih_soft, entry->iv_entry, num_dw);
+   amdgpu_ih_ring_write(adev, >irq.ih_soft, entry->iv_entry, num_dw);
schedule_work(>irq.ih_soft_work);
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c

index b02e1cef78a7..980b24120080 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
@@ -535,7 +53

Re: [PATCH] drm/amdgpu: Increase IH soft ring size

2023-07-07 Thread Felix Kuehling



Am 2023-07-07 um 10:14 schrieb Philip Yang:

Retry faults are delegated to IH soft ring and then processed by
deferred worker. Current IH soft ring size PAGE_SIZE can store 128
entries, which may overflow and drop retry faults, causes HW stucks
because the retry fault is not recovered.

Increase IH soft ring size to the same size as IH ring, define macro
IH_RING_SIZE to remove duplicate constant.


As discussed offline, dropping retry fault interrupts is only a problem 
when the CAM is enabled. You only need as many entries in the soft IH 
ring as there are entries in the CAM.


Regards,
  Felix




Show warning message if IH soft ring overflows because this should not
happen any more.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c  | 8 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  | 4 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 2 +-
  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c| 5 +++--
  drivers/gpu/drm/amd/amdgpu/navi10_ih.c  | 5 +++--
  drivers/gpu/drm/amd/amdgpu/vega10_ih.c  | 5 +++--
  drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 5 +++--
  7 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index fceb3b384955..51a0dbd2358a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -138,6 +138,7 @@ void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct 
amdgpu_ih_ring *ih)
  /**
   * amdgpu_ih_ring_write - write IV to the ring buffer
   *
+ * @adev: amdgpu_device pointer
   * @ih: ih ring to write to
   * @iv: the iv to write
   * @num_dw: size of the iv in dw
@@ -145,8 +146,8 @@ void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct 
amdgpu_ih_ring *ih)
   * Writes an IV to the ring buffer using the CPU and increment the wptr.
   * Used for testing and delegating IVs to a software ring.
   */
-void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const uint32_t *iv,
- unsigned int num_dw)
+void amdgpu_ih_ring_write(struct amdgpu_device *adev, struct amdgpu_ih_ring 
*ih,
+ const uint32_t *iv, unsigned int num_dw)
  {
uint32_t wptr = le32_to_cpu(*ih->wptr_cpu) >> 2;
unsigned int i;
@@ -161,6 +162,9 @@ void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const 
uint32_t *iv,
if (wptr != READ_ONCE(ih->rptr)) {
wmb();
WRITE_ONCE(*ih->wptr_cpu, cpu_to_le32(wptr));
+   } else {
+   dev_warn(adev->dev, "IH soft ring buffer overflow 0x%X, 0x%X\n",
+wptr, ih->rptr);
}
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h

index dd1c2eded6b9..a8cf67f1f011 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
@@ -97,8 +97,8 @@ struct amdgpu_ih_funcs {
  int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
unsigned ring_size, bool use_bus_addr);
  void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct amdgpu_ih_ring 
*ih);
-void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const uint32_t *iv,
- unsigned int num_dw);
+void amdgpu_ih_ring_write(struct amdgpu_device *adev, struct amdgpu_ih_ring 
*ih,
+ const uint32_t *iv, unsigned int num_dw);
  int amdgpu_ih_wait_on_checkpoint_process_ts(struct amdgpu_device *adev,
struct amdgpu_ih_ring *ih);
  int amdgpu_ih_process(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 5273decc5753..fa6d0adcec20 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -493,7 +493,7 @@ void amdgpu_irq_delegate(struct amdgpu_device *adev,
 struct amdgpu_iv_entry *entry,
 unsigned int num_dw)
  {
-   amdgpu_ih_ring_write(>irq.ih_soft, entry->iv_entry, num_dw);
+   amdgpu_ih_ring_write(adev, >irq.ih_soft, entry->iv_entry, num_dw);
schedule_work(>irq.ih_soft_work);
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c

index b02e1cef78a7..21d2e57cffe7 100644
--- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
@@ -32,6 +32,7 @@
  #include "soc15_common.h"
  #include "ih_v6_0.h"
  
+#define IH_RING_SIZE	(256 * 1024)

  #define MAX_REARM_RETRY 10
  
  static void ih_v6_0_set_interrupt_funcs(struct amdgpu_device *adev);

@@ -535,7 +536,7 @@ static int ih_v6_0_sw_init(void *handle)
 * use bus address for ih ring by psp bl */
use_bus_addr =
(adev->firmware.load_type == AMDGPU_FW_LOAD_PSP) ? false : true;
-   r = amdgpu_ih_ring_init(adev, >irq.ih, 256 * 1024, use_bus_addr);
+   r = amdgpu_ih_ring_init(adev, 

Re: [PATCH] drm/amdkfd: Access gpuvm_export_dmabuf() api

2023-06-21 Thread Felix Kuehling

Am 2023-06-20 um 22:11 schrieb Ramesh Errabolu:

Call KFD api to get Dmabuf instead of calling GEM Prime API

Signed-off-by: Ramesh Errabolu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index cf1db0ab3471..c37d82b35372 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1852,13 +1852,13 @@ static uint32_t get_process_num_bos(struct kfd_process 
*p)
return num_of_bos;
  }
  
-static int criu_get_prime_handle(struct drm_gem_object *gobj, int flags,

+static int criu_get_prime_handle(struct kgd_mem *mem, int flags,
  u32 *shared_fd)
  {
struct dma_buf *dmabuf;
int ret;
  
-	dmabuf = amdgpu_gem_prime_export(gobj, flags);

+   ret = amdgpu_amdkfd_gpuvm_export_dmabuf(mem, );
if (IS_ERR(dmabuf)) {


I think you need to check ret here instead of IS_ERR(dmabuf). Please 
also check with Rajneesh. I think he ran into this before and I 
discussed this fix with him.


Otherwise the patch looks reasonable to me.

Thanks,
  Felix



ret = PTR_ERR(dmabuf);
pr_err("dmabuf export failed for the BO\n");
@@ -1940,7 +1940,7 @@ static int criu_checkpoint_bos(struct kfd_process *p,
}
if (bo_bucket->alloc_flags
& (KFD_IOC_ALLOC_MEM_FLAGS_VRAM | 
KFD_IOC_ALLOC_MEM_FLAGS_GTT)) {
-   ret = 
criu_get_prime_handle(_bo->tbo.base,
+   ret = criu_get_prime_handle(kgd_mem,
bo_bucket->alloc_flags &

KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ? DRM_RDWR : 0,
_bucket->dmabuf_fd);
@@ -2402,7 +2402,7 @@ static int criu_restore_bo(struct kfd_process *p,
/* create the dmabuf object and export the bo */
if (bo_bucket->alloc_flags
& (KFD_IOC_ALLOC_MEM_FLAGS_VRAM | KFD_IOC_ALLOC_MEM_FLAGS_GTT)) {
-   ret = criu_get_prime_handle(_mem->bo->tbo.base, DRM_RDWR,
+   ret = criu_get_prime_handle(kgd_mem, DRM_RDWR,
_bucket->dmabuf_fd);
if (ret)
return ret;


Re: [PATCH] drm/amdgpu: Forbid kfd using cpu to update pt if vm is shared with gfx

2023-06-21 Thread Felix Kuehling
Can we change the flags if needed. E.g. see what 
amdgpu_bo_pin_restricted does:


if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS))
bo->flags |= AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED;
amdgpu_bo_placement_from_domain(bo, domain);

This shouldn't really change anything about the BO placement because we 
only enable CPU page table updates on large-BAR GPUs by default. 
Alternatively, we could create VM BOs with 
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED on large-BAR GPUs to make it 
possible to switch to CPU page table updates for compute VMs.


Regards,
  Felix


Am 2023-06-21 um 05:46 schrieb YuBiao Wang:

If a same GPU VM is shared by kfd and graphic operations, we must align
the vm update mode to sdma, or cpu kmap will fail and cause null pointer
issue.

Signed-off-by: YuBiao Wang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 291977b93b1d..e105ff9e8041 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2239,6 +2239,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
amdgpu_vm *vm)
  int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm)
  {
bool pte_support_ats = (adev->asic_type == CHIP_RAVEN);
+   struct amdgpu_bo *bo = vm->root.bo;
int r;
  
  	r = amdgpu_bo_reserve(vm->root.bo, true);

@@ -2265,6 +2266,10 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)
/* Update VM state */
vm->use_cpu_for_update = !!(adev->vm_manager.vm_update_mode &
AMDGPU_VM_USE_CPU_FOR_COMPUTE);
+
+   if (bo && !(bo->flags & AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED))
+   vm->use_cpu_for_update = false;
+
DRM_DEBUG_DRIVER("VM update mode is %s\n",
 vm->use_cpu_for_update ? "CPU" : "SDMA");
WARN_ONCE((vm->use_cpu_for_update &&


Re: [PATCHv4] drm/amdgpu: Update invalid PTE flag setting

2023-06-20 Thread Felix Kuehling

On 2023-06-19 13:38, Mukul Joshi wrote:

Update the invalid PTE flag setting with TF enabled.
This is to ensure, in addition to transitioning the
retry fault to a no-retry fault, it also causes the
wavefront to enter the trap handler. With the current
setting, the fault only transitions to a no-retry fault.
Additionally, have 2 sets of invalid PTE settings, one for
TF enabled, the other for TF disabled. The setting with
TF disabled, doesn't work with TF enabled.

Signed-off-by: Mukul Joshi 


Reviewed-by: Felix Kuehling 



---
v1->v2:
- Update handling according to Christian's feedback.

v2->v3:
- Remove ASIC specific callback (Felix).

v3->v4:
- Add noretry flag to amdgpu->gmc. This allows to set
   ASIC specific flags.

  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h   |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  6 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c | 31 +++
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c|  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c |  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c |  1 +
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  1 +
  9 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 56d73fade568..fdc25cd559b6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -331,6 +331,8 @@ struct amdgpu_gmc {
u64 VM_CONTEXT_PAGE_TABLE_END_ADDR_LO32[16];
u64 VM_CONTEXT_PAGE_TABLE_END_ADDR_HI32[16];
u64 MC_VM_MX_L1_TLB_CNTL;
+
+   u64 noretry_flags;
  };
  
  #define amdgpu_gmc_flush_gpu_tlb(adev, vmid, vmhub, type) ((adev)->gmc.gmc_funcs->flush_gpu_tlb((adev), (vmid), (vmhub), (type)))

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index eff73c428b12..8c7861a4d75d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2604,7 +2604,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, 
u32 pasid,
/* Intentionally setting invalid PTE flag
 * combination to force a no-retry-fault
 */
-   flags = AMDGPU_PTE_SNOOPED | AMDGPU_PTE_PRT;
+   flags = AMDGPU_VM_NORETRY_FLAGS;
value = 0;
} else if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) {
/* Redirect the access to the dummy page */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 9c85d494f2a2..b81fcb962d8f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -84,7 +84,13 @@ struct amdgpu_mem_stats;
  /* PDE Block Fragment Size for VEGA10 */
  #define AMDGPU_PDE_BFS(a) ((uint64_t)a << 59)
  
+/* Flag combination to set no-retry with TF disabled */

+#define AMDGPU_VM_NORETRY_FLAGS(AMDGPU_PTE_EXECUTABLE | AMDGPU_PDE_PTE 
| \
+   AMDGPU_PTE_TF)
  
+/* Flag combination to set no-retry with TF enabled */

+#define AMDGPU_VM_NORETRY_FLAGS_TF (AMDGPU_PTE_VALID | AMDGPU_PTE_SYSTEM | \
+  AMDGPU_PTE_PRT)
  /* For GFX9 */
  #define AMDGPU_PTE_MTYPE_VG10(a)  ((uint64_t)(a) << 57)
  #define AMDGPU_PTE_MTYPE_VG10_MASKAMDGPU_PTE_MTYPE_VG10(3ULL)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
index dea1a64be44d..24ddf6a0512a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
@@ -778,6 +778,27 @@ int amdgpu_vm_pde_update(struct amdgpu_vm_update_params 
*params,
1, 0, flags);
  }
  
+/**

+ * amdgpu_vm_pte_update_noretry_flags - Update PTE no-retry flags
+ *
+ * @adev - amdgpu_device pointer
+ * @flags: pointer to PTE flags
+ *
+ * Update PTE no-retry flags when TF is enabled.
+ */
+static void amdgpu_vm_pte_update_noretry_flags(struct amdgpu_device *adev,
+   uint64_t *flags)
+{
+   /*
+* Update no-retry flags with the corresponding TF
+* no-retry combination.
+*/
+   if ((*flags & AMDGPU_VM_NORETRY_FLAGS) == AMDGPU_VM_NORETRY_FLAGS) {
+   *flags &= ~AMDGPU_VM_NORETRY_FLAGS;
+   *flags |= adev->gmc.noretry_flags;
+   }
+}
+
  /*
   * amdgpu_vm_pte_update_flags - figure out flags for PTE updates
   *
@@ -804,6 +825,16 @@ static void amdgpu_vm_pte_update_flags(struct 
amdgpu_vm_update_params *params,
flags |= AMDGPU_PTE_EXECUTABLE;
}
  
+	/*

+* Update no-retry flags to use the no-retry flag combination
+* with TF enabled. The AMDGPU_VM_NORETRY_FLAGS flag combination
+* does not work when TF is enabled. So, replace them w

Re: [PATCH] drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute

2023-06-19 Thread Felix Kuehling

On 2023-06-19 17:28, Xiaogang.Chen wrote:

From: Xiaogang Chen 

Since we allow kfd and graphic operate on same GPU VM to have interoperation
between them GPU VM may have been used by graphic vm operations before kfd turns
a GPU VM into a compute VM. Remove vm clean checking at amdgpu_vm_make_compute.

Signed-off-by: Xiaogang Chen


Reviewed-by: Felix Kuehling 

As discussed, we can follow this up with a change that enables ATS for 
graphics VMs as well, so we don't need to enable ATS in 
amdgpu_vm_make_compute. This would improve interop for Raven. We only 
enable ATS for the lower half of the address space, so it should not 
affect graphics client that use the upper half.


Thanks,
  Felix



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index eff73c428b12..291977b93b1d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2245,16 +2245,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)
if (r)
return r;
  
-	/* Sanity checks */

-   if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
-   r = -EINVAL;
-   goto unreserve_bo;
-   }
-
/* Check if PD needs to be reinitialized and do it before
 * changing any other state, in case it fails.
 */
if (pte_support_ats != vm->pte_support_ats) {
+   /* Sanity checks */
+   if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
+   r = -EINVAL;
+   goto unreserve_bo;
+   }
+
vm->pte_support_ats = pte_support_ats;
r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo),
   false);


Re: [PATCH] drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute

2023-06-19 Thread Felix Kuehling



On 2023-06-19 15:06, Xiaogang.Chen wrote:

From: Xiaogang Chen 

Since we allow kfd and graphic operate on same GPU VM to have interoperation
between them GPU VM may have been used by graphic vm operations before kfd turn
a GFX VM into a compute VM. Remove vm clean checking at amdgpu_vm_make_compute.

Signed-off-by: Xiaogang Chen
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index eff73c428b12..33f05297ab7e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2246,7 +2246,7 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)
return r;
  
  	/* Sanity checks */

-   if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
+   if (pte_support_ats && !amdgpu_vm_pt_is_root_clean(adev, vm)) {


I think the correct condition here would be "pte_support_ats != 
vm->pte_support_ats", because that's what's used to reinitialize the 
page table just below. I think it would be even cleaner if you moved 
that check inside the "if (pte_support_ats != vm->pte_support_ats)" 
block below.


Regards,
  Felix



r = -EINVAL;
goto unreserve_bo;
}


Re: [PATCHv2] drm/amdkfd: Enable GWS on GFX9.4.3

2023-06-16 Thread Felix Kuehling



On 2023-06-16 14:44, Mukul Joshi wrote:

Enable GWS capable queue creation for forward
progress gaurantee on GFX 9.4.3.

Signed-off-by: Mukul Joshi 


Reviewed-by: Felix Kuehling 



---
v1->v2:
- Update the condition for setting pqn->q->gws
   for GFX 9.4.3.
  drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  1 +
  .../amd/amdkfd/kfd_process_queue_manager.c| 35 ---
  2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 9d4abfd8b55e..226d2dd7fa49 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -518,6 +518,7 @@ static int kfd_gws_init(struct kfd_node *node)
&& kfd->mec2_fw_version >= 0x30)   ||
(KFD_GC_VERSION(node) == IP_VERSION(9, 4, 2)
&& kfd->mec2_fw_version >= 0x28) ||
+   (KFD_GC_VERSION(node) == IP_VERSION(9, 4, 3)) ||
(KFD_GC_VERSION(node) >= IP_VERSION(10, 3, 0)
&& KFD_GC_VERSION(node) < IP_VERSION(11, 0, 0)
&& kfd->mec2_fw_version >= 0x6b
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 9ad1a2186a24..ba9d69054119 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -123,16 +123,24 @@ int pqm_set_gws(struct process_queue_manager *pqm, 
unsigned int qid,
if (!gws && pdd->qpd.num_gws == 0)
return -EINVAL;
  
-	if (gws)

-   ret = 
amdgpu_amdkfd_add_gws_to_process(pdd->process->kgd_process_info,
-   gws, );
-   else
-   ret = 
amdgpu_amdkfd_remove_gws_from_process(pdd->process->kgd_process_info,
-   pqn->q->gws);
-   if (unlikely(ret))
-   return ret;
+   if (KFD_GC_VERSION(dev) != IP_VERSION(9, 4, 3)) {
+   if (gws)
+   ret = 
amdgpu_amdkfd_add_gws_to_process(pdd->process->kgd_process_info,
+   gws, );
+   else
+   ret = 
amdgpu_amdkfd_remove_gws_from_process(pdd->process->kgd_process_info,
+   pqn->q->gws);
+   if (unlikely(ret))
+   return ret;
+   pqn->q->gws = mem;
+   } else {
+   /*
+* Intentionally set GWS to a non-NULL value
+* for GFX 9.4.3.
+*/
+   pqn->q->gws = gws ? ERR_PTR(-ENOMEM) : NULL;
+   }
  
-	pqn->q->gws = mem;

pdd->qpd.num_gws = gws ? dev->adev->gds.gws_size : 0;
  
  	return pqn->q->device->dqm->ops.update_queue(pqn->q->device->dqm,

@@ -164,7 +172,8 @@ void pqm_uninit(struct process_queue_manager *pqm)
struct process_queue_node *pqn, *next;
  
  	list_for_each_entry_safe(pqn, next, >queues, process_queue_list) {

-   if (pqn->q && pqn->q->gws)
+   if (pqn->q && pqn->q->gws &&
+   KFD_GC_VERSION(pqn->q->device) != IP_VERSION(9, 4, 3))

amdgpu_amdkfd_remove_gws_from_process(pqm->process->kgd_process_info,
pqn->q->gws);
kfd_procfs_del_queue(pqn->q);
@@ -446,8 +455,10 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, 
unsigned int qid)
}
  
  		if (pqn->q->gws) {

-   
amdgpu_amdkfd_remove_gws_from_process(pqm->process->kgd_process_info,
-   pqn->q->gws);
+   if (KFD_GC_VERSION(pqn->q->device) != IP_VERSION(9, 4, 
3))
+   amdgpu_amdkfd_remove_gws_from_process(
+   pqm->process->kgd_process_info,
+   pqn->q->gws);
pdd->qpd.num_gws = 0;
}
  


Re: [PATCH] drm/amdkfd: Use KIQ to unmap HIQ

2023-06-16 Thread Felix Kuehling



On 2023-06-16 14:00, Mukul Joshi wrote:

Currently, we unmap HIQ by directly writing to HQD
registers. This doesn't work for GFX9.4.3. Instead,
use KIQ to unmap HIQ, similar to how we use KIQ to
map HIQ. Using KIQ to unmap HIQ works for all GFX
series post GFXv9.

Signed-off-by: Mukul Joshi 
---
  .../drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c   |  1 +
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 47 ++
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h|  3 ++
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10_3.c  |  1 +
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c| 47 ++
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 48 +++
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.h |  3 ++
  drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.c  |  8 
  drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager.h  |  4 ++
  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v10.c  |  2 +-
  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c  |  2 +-
  .../gpu/drm/amd/amdkfd/kfd_mqd_manager_v9.c   |  7 ++-
  .../gpu/drm/amd/include/kgd_kfd_interface.h   |  3 ++
  13 files changed, 170 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c
index 5b4b7f8b92a5..b82435e17ed0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gc_9_4_3.c
@@ -372,6 +372,7 @@ const struct kfd2kgd_calls gc_9_4_3_kfd2kgd = {
.hqd_sdma_dump = kgd_gfx_v9_4_3_hqd_sdma_dump,
.hqd_is_occupied = kgd_gfx_v9_hqd_is_occupied,
.hqd_sdma_is_occupied = kgd_gfx_v9_4_3_hqd_sdma_is_occupied,
+   .hiq_hqd_destroy = kgd_gfx_v9_hiq_hqd_destroy,
.hqd_destroy = kgd_gfx_v9_hqd_destroy,
.hqd_sdma_destroy = kgd_gfx_v9_4_3_hqd_sdma_destroy,
.wave_control_execute = kgd_gfx_v9_wave_control_execute,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
index 8ad7a7779e14..a919fb8e09a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
@@ -510,6 +510,52 @@ static bool kgd_hqd_sdma_is_occupied(struct amdgpu_device 
*adev, void *mqd)
return false;
  }
  
+int kgd_gfx_v10_hiq_hqd_destroy(struct amdgpu_device *adev, void *mqd,

+   uint32_t pipe_id, uint32_t queue_id,
+   uint32_t inst)
+{
+   struct amdgpu_ring *kiq_ring = >gfx.kiq[0].ring;
+   struct v10_compute_mqd *m = get_mqd(mqd);
+   uint32_t mec, pipe;
+   uint32_t doorbell_off;
+   int r;
+
+   doorbell_off = m->cp_hqd_pq_doorbell_control >>
+   CP_HQD_PQ_DOORBELL_CONTROL__DOORBELL_OFFSET__SHIFT;
+
+   acquire_queue(adev, pipe_id, queue_id);
+
+   mec = (pipe_id / adev->gfx.mec.num_pipe_per_mec) + 1;
+   pipe = (pipe_id % adev->gfx.mec.num_pipe_per_mec);
+
+   spin_lock(>gfx.kiq[0].ring_lock);
+   r = amdgpu_ring_alloc(kiq_ring, 6);
+   if (r) {
+   pr_err("Failed to alloc KIQ (%d).\n", r);
+   goto out_unlock;
+   }
+
+   amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_UNMAP_QUEUES, 4));
+   amdgpu_ring_write(kiq_ring, /* Q_sel: 0, vmid: 0, engine: 0, num_Q: 1 */
+ PACKET3_UNMAP_QUEUES_ACTION(0) |
+ PACKET3_UNMAP_QUEUES_QUEUE_SEL(0) |
+ PACKET3_UNMAP_QUEUES_ENGINE_SEL(0) |
+ PACKET3_UNMAP_QUEUES_NUM_QUEUES(1));
+   amdgpu_ring_write(kiq_ring,
+ PACKET3_UNMAP_QUEUES_DOORBELL_OFFSET0(doorbell_off));
+   amdgpu_ring_write(kiq_ring, 0);
+   amdgpu_ring_write(kiq_ring, 0);
+   amdgpu_ring_write(kiq_ring, 0);


This looks like you're duplicating the functionality in 
kiq->pmf->kiq_unmap_queues. Can we just call that instead? See 
amdgpu_gfx_disable_kcq for example.


Regards,
  Felix



+
+   amdgpu_ring_commit(kiq_ring);
+
+out_unlock:
+   spin_unlock(>gfx.kiq[0].ring_lock);
+   release_queue(adev);
+
+   return r;
+}
+
  static int kgd_hqd_destroy(struct amdgpu_device *adev, void *mqd,
enum kfd_preempt_type reset_type,
unsigned int utimeout, uint32_t pipe_id,
@@ -1034,6 +1080,7 @@ const struct kfd2kgd_calls gfx_v10_kfd2kgd = {
.hqd_sdma_dump = kgd_hqd_sdma_dump,
.hqd_is_occupied = kgd_hqd_is_occupied,
.hqd_sdma_is_occupied = kgd_hqd_sdma_is_occupied,
+   .hiq_hqd_destroy = kgd_gfx_v10_hiq_hqd_destroy,
.hqd_destroy = kgd_hqd_destroy,
.hqd_sdma_destroy = kgd_hqd_sdma_destroy,
.wave_control_execute = kgd_wave_control_execute,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h
index e6b70196071a..00b4514ebdd5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.h
+++ 

Re: [PATCH] drm/amdkfd: Enable GWS on GFX9.4.3

2023-06-16 Thread Felix Kuehling



On 2023-06-16 13:59, Mukul Joshi wrote:

Enable GWS capable queue creation for forward
progress gaurantee on GFX 9.4.3.

Signed-off-by: Mukul Joshi 
---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c   |  1 +
  .../amd/amdkfd/kfd_process_queue_manager.c| 31 ---
  2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 9d4abfd8b55e..226d2dd7fa49 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -518,6 +518,7 @@ static int kfd_gws_init(struct kfd_node *node)
&& kfd->mec2_fw_version >= 0x30)   ||
(KFD_GC_VERSION(node) == IP_VERSION(9, 4, 2)
&& kfd->mec2_fw_version >= 0x28) ||
+   (KFD_GC_VERSION(node) == IP_VERSION(9, 4, 3)) ||
(KFD_GC_VERSION(node) >= IP_VERSION(10, 3, 0)
&& KFD_GC_VERSION(node) < IP_VERSION(11, 0, 0)
&& kfd->mec2_fw_version >= 0x6b
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index 9ad1a2186a24..9a091d8f9aaf 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -123,16 +123,20 @@ int pqm_set_gws(struct process_queue_manager *pqm, 
unsigned int qid,
if (!gws && pdd->qpd.num_gws == 0)
return -EINVAL;
  
-	if (gws)

-   ret = 
amdgpu_amdkfd_add_gws_to_process(pdd->process->kgd_process_info,
-   gws, );
-   else
-   ret = 
amdgpu_amdkfd_remove_gws_from_process(pdd->process->kgd_process_info,
-   pqn->q->gws);
-   if (unlikely(ret))
-   return ret;
+   if (KFD_GC_VERSION(dev) != IP_VERSION(9, 4, 3)) {
+   if (gws)
+   ret = 
amdgpu_amdkfd_add_gws_to_process(pdd->process->kgd_process_info,
+   gws, );
+   else
+   ret = 
amdgpu_amdkfd_remove_gws_from_process(pdd->process->kgd_process_info,
+   pqn->q->gws);
+   if (unlikely(ret))
+   return ret;
+   pqn->q->gws = mem;
+   } else {
+   pqn->q->gws = ERR_PTR(-ENOMEM);


I think this needs to be

pqn->q->gws = gws ? ERR_PTR(-ENOMEM) : NULL;

Regards,
  Felix



+   }
  
-	pqn->q->gws = mem;

pdd->qpd.num_gws = gws ? dev->adev->gds.gws_size : 0;
  
  	return pqn->q->device->dqm->ops.update_queue(pqn->q->device->dqm,

@@ -164,7 +168,8 @@ void pqm_uninit(struct process_queue_manager *pqm)
struct process_queue_node *pqn, *next;
  
  	list_for_each_entry_safe(pqn, next, >queues, process_queue_list) {

-   if (pqn->q && pqn->q->gws)
+   if (pqn->q && pqn->q->gws &&
+   KFD_GC_VERSION(pqn->q->device) != IP_VERSION(9, 4, 3))

amdgpu_amdkfd_remove_gws_from_process(pqm->process->kgd_process_info,
pqn->q->gws);
kfd_procfs_del_queue(pqn->q);
@@ -446,8 +451,10 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, 
unsigned int qid)
}
  
  		if (pqn->q->gws) {

-   
amdgpu_amdkfd_remove_gws_from_process(pqm->process->kgd_process_info,
-   pqn->q->gws);
+   if (KFD_GC_VERSION(pqn->q->device) != IP_VERSION(9, 4, 
3))
+   amdgpu_amdkfd_remove_gws_from_process(
+   pqm->process->kgd_process_info,
+   pqn->q->gws);
pdd->qpd.num_gws = 0;
}
  


Re: [PATCH] drm/amdgpu: Modify for_each_inst macro

2023-06-16 Thread Felix Kuehling



Am 2023-06-16 um 06:23 schrieb Lijo Lazar:

Modify it such that it doesn't change the instance mask parameter.

Signed-off-by: Lijo Lazar 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index f4029c13a9be..c5451a9b0ee4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1295,9 +1295,9 @@ int emu_soc_asic_init(struct amdgpu_device *adev);
  
  #define amdgpu_inc_vram_lost(adev) atomic_inc(&((adev)->vram_lost_counter));
  
-#define for_each_inst(i, inst_mask)\

-   for (i = ffs(inst_mask) - 1; inst_mask;\
-inst_mask &= ~(1U << i), i = ffs(inst_mask) - 1)
+#define for_each_inst(i, inst_mask)\
+   for (i = ffs(inst_mask); i-- != 0; \
+i = ffs((inst_mask & (~0U << (i + 1)
  
  #define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
  


Re: [PATCH] drm/amdkfd: set coherent host access capability flag

2023-06-15 Thread Felix Kuehling



Am 2023-06-16 um 00:29 schrieb Felix Kuehling:


Am 2023-06-15 um 18:54 schrieb Alex Sierra:

This flag determines whether the host possesses coherent access to
the memory of the device.

Signed-off-by: Alex Sierra 
---
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c

index 90b86a6ac7bd..7ede3de4f7fb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -2107,6 +2107,10 @@ int kfd_topology_add_device(struct kfd_node *gpu)
  if (KFD_IS_SVM_API_SUPPORTED(dev->gpu->adev))
  dev->node_props.capability |= HSA_CAP_SVMAPI_SUPPORTED;
  +    if (dev->gpu->adev->gmc.is_app_apu |
+    dev->gpu->adev->gmc.xgmi.connected_to_cpu)
+    dev->node_props.capability |= HSA_CAP_FLAGS_COHERENTHOSTACCESS;


I believe this is not true for "small APUs" because they map the 
framebuffer as WC on the CPU. I think you need to check specifically 
for APP APU.


Never mind, I read it wrong. You are checking the correct APP APU flag. 
Just one more nit-pick, in the condition you should use logical OR (a || 
b), not bit-wise OR (a | b). With that fixed, the patch is


Reviewed-by: Felix Kuehling 




Regards,
  Felix



+
  kfd_debug_print_topology();
    kfd_notify_gpu_change(gpu_id, 1);


Re: [PATCH] drm/amdkfd: set coherent host access capability flag

2023-06-15 Thread Felix Kuehling



Am 2023-06-15 um 18:54 schrieb Alex Sierra:

This flag determines whether the host possesses coherent access to
the memory of the device.

Signed-off-by: Alex Sierra 
---
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 90b86a6ac7bd..7ede3de4f7fb 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -2107,6 +2107,10 @@ int kfd_topology_add_device(struct kfd_node *gpu)
if (KFD_IS_SVM_API_SUPPORTED(dev->gpu->adev))
dev->node_props.capability |= HSA_CAP_SVMAPI_SUPPORTED;
  
+	if (dev->gpu->adev->gmc.is_app_apu |

+   dev->gpu->adev->gmc.xgmi.connected_to_cpu)
+   dev->node_props.capability |= HSA_CAP_FLAGS_COHERENTHOSTACCESS;


I believe this is not true for "small APUs" because they map the 
framebuffer as WC on the CPU. I think you need to check specifically for 
APP APU.


Regards,
  Felix



+
kfd_debug_print_topology();
  
  	kfd_notify_gpu_change(gpu_id, 1);


Re: [PATCH 2/3] drm/amdgpu: Implement new dummy vram manager

2023-06-15 Thread Felix Kuehling



Am 2023-06-15 um 03:37 schrieb Christian König:

Am 14.06.23 um 17:42 schrieb Felix Kuehling:

Am 2023-06-14 um 06:38 schrieb Christian König:

Am 10.05.23 um 00:01 schrieb Alex Deucher:

From: Rajneesh Bhardwaj 

This adds dummy vram manager to support ASICs that do not have a
dedicated or carvedout vram domain.


Well that doesn't seem to make much sense. Why we should have that?


TTM always expects a resource manager for VRAM. There are no NULL 
pointer checks in TTM for not having a resource manager for VRAM. The 
existing amdgpu_vram_mgr gets confused if there is no VRAM. It seemed 
cleaner to add a dummy manager than to scatter conditions for a 
memory-less GPU corner case through the regular VRAM manager.


Well no that's absolutely *not* cleaner. TTM has a predefined manager 
if you need to use a dummy.


I think you are referring to ttm_range_manager. ttm_range_man_alloc does 
a bunch of useless stuff when there is no hope of succeeding:


 * kzalloc a node struct
 * ttm_resource_init
 o add the node to an LRU
 * drm_mm_insert_node_in_range (which fails because the drm_mm was
   created with 0 size)
 * ttm_resource_fini
 o remove the node from an LRU
 * kfree the node struct

In that process it also takes 3 spin_locks. All of that for TTM to 
figure out that VRAM is not a feasible placement. All we need to do here 
in the dummy manager is to return -ENOSPC.


I really don't get why this bothers you so much, or why this is even 
controversial.


Regards,
  Felix




Why the heck didn't you ask me before doing stuff like that?

Regards,
Christian.



Regards,
  Felix




Christian.



Reviewed-by: Felix Kuehling 
Signed-off-by: Rajneesh Bhardwaj 
Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 67 
++--

  1 file changed, 60 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c

index 43d6a9d6a538..89d35d194f2c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -370,6 +370,45 @@ int amdgpu_vram_mgr_query_page_status(struct 
amdgpu_vram_mgr *mgr,

  return ret;
  }
  +static void amdgpu_dummy_vram_mgr_debug(struct 
ttm_resource_manager *man,

+  struct drm_printer *printer)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr debug\n");
+}
+
+static bool amdgpu_dummy_vram_mgr_compatible(struct 
ttm_resource_manager *man,

+   struct ttm_resource *res,
+   const struct ttm_place *place,
+   size_t size)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr compatible\n");
+    return false;
+}
+
+static bool amdgpu_dummy_vram_mgr_intersects(struct 
ttm_resource_manager *man,

+   struct ttm_resource *res,
+   const struct ttm_place *place,
+   size_t size)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr intersects\n");
+    return true;
+}
+
+static void amdgpu_dummy_vram_mgr_del(struct ttm_resource_manager 
*man,

+    struct ttm_resource *res)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr deleted\n");
+}
+
+static int amdgpu_dummy_vram_mgr_new(struct ttm_resource_manager 
*man,

+   struct ttm_buffer_object *tbo,
+   const struct ttm_place *place,
+   struct ttm_resource **res)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr new\n");
+    return -ENOSPC;
+}
+
  /**
   * amdgpu_vram_mgr_new - allocate new ranges
   *
@@ -817,6 +856,14 @@ static void amdgpu_vram_mgr_debug(struct 
ttm_resource_manager *man,

  mutex_unlock(>lock);
  }
  +static const struct ttm_resource_manager_func 
amdgpu_dummy_vram_mgr_func = {

+    .alloc    = amdgpu_dummy_vram_mgr_new,
+    .free    = amdgpu_dummy_vram_mgr_del,
+    .intersects = amdgpu_dummy_vram_mgr_intersects,
+    .compatible = amdgpu_dummy_vram_mgr_compatible,
+    .debug    = amdgpu_dummy_vram_mgr_debug
+};
+
  static const struct ttm_resource_manager_func 
amdgpu_vram_mgr_func = {

  .alloc    = amdgpu_vram_mgr_new,
  .free    = amdgpu_vram_mgr_del,
@@ -841,17 +888,22 @@ int amdgpu_vram_mgr_init(struct amdgpu_device 
*adev)

  ttm_resource_manager_init(man, >mman.bdev,
    adev->gmc.real_vram_size);
  -    man->func = _vram_mgr_func;
-
-    err = drm_buddy_init(>mm, man->size, PAGE_SIZE);
-    if (err)
-    return err;
-
  mutex_init(>lock);
  INIT_LIST_HEAD(>reservations_pending);
  INIT_LIST_HEAD(>reserved_pages);
  mgr->default_page_size = PAGE_SIZE;
  +    if (!adev->gmc.is_app_apu) {
+    man->func = _vram_mgr_func;
+
+    err = drm_buddy_init(>mm, man->size, PAGE_SIZE);
+    if (err)
+    return err;
+    } else {
+    man->func = _dummy_vram_mgr_func;
+    DRM_INFO("Setup dummy vram mgr\n");
+    }
+
  t

Re: [PATCH 2/3] drm/amdgpu: Implement new dummy vram manager

2023-06-14 Thread Felix Kuehling

Am 2023-06-14 um 06:38 schrieb Christian König:

Am 10.05.23 um 00:01 schrieb Alex Deucher:

From: Rajneesh Bhardwaj 

This adds dummy vram manager to support ASICs that do not have a
dedicated or carvedout vram domain.


Well that doesn't seem to make much sense. Why we should have that?


TTM always expects a resource manager for VRAM. There are no NULL 
pointer checks in TTM for not having a resource manager for VRAM. The 
existing amdgpu_vram_mgr gets confused if there is no VRAM. It seemed 
cleaner to add a dummy manager than to scatter conditions for a 
memory-less GPU corner case through the regular VRAM manager.


Regards,
  Felix




Christian.



Reviewed-by: Felix Kuehling 
Signed-off-by: Rajneesh Bhardwaj 
Signed-off-by: Alex Deucher 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 67 ++--
  1 file changed, 60 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c

index 43d6a9d6a538..89d35d194f2c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -370,6 +370,45 @@ int amdgpu_vram_mgr_query_page_status(struct 
amdgpu_vram_mgr *mgr,

  return ret;
  }
  +static void amdgpu_dummy_vram_mgr_debug(struct 
ttm_resource_manager *man,

+  struct drm_printer *printer)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr debug\n");
+}
+
+static bool amdgpu_dummy_vram_mgr_compatible(struct 
ttm_resource_manager *man,

+   struct ttm_resource *res,
+   const struct ttm_place *place,
+   size_t size)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr compatible\n");
+    return false;
+}
+
+static bool amdgpu_dummy_vram_mgr_intersects(struct 
ttm_resource_manager *man,

+   struct ttm_resource *res,
+   const struct ttm_place *place,
+   size_t size)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr intersects\n");
+    return true;
+}
+
+static void amdgpu_dummy_vram_mgr_del(struct ttm_resource_manager *man,
+    struct ttm_resource *res)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr deleted\n");
+}
+
+static int amdgpu_dummy_vram_mgr_new(struct ttm_resource_manager *man,
+   struct ttm_buffer_object *tbo,
+   const struct ttm_place *place,
+   struct ttm_resource **res)
+{
+    DRM_DEBUG_DRIVER("Dummy vram mgr new\n");
+    return -ENOSPC;
+}
+
  /**
   * amdgpu_vram_mgr_new - allocate new ranges
   *
@@ -817,6 +856,14 @@ static void amdgpu_vram_mgr_debug(struct 
ttm_resource_manager *man,

  mutex_unlock(>lock);
  }
  +static const struct ttm_resource_manager_func 
amdgpu_dummy_vram_mgr_func = {

+    .alloc    = amdgpu_dummy_vram_mgr_new,
+    .free    = amdgpu_dummy_vram_mgr_del,
+    .intersects = amdgpu_dummy_vram_mgr_intersects,
+    .compatible = amdgpu_dummy_vram_mgr_compatible,
+    .debug    = amdgpu_dummy_vram_mgr_debug
+};
+
  static const struct ttm_resource_manager_func amdgpu_vram_mgr_func = {
  .alloc    = amdgpu_vram_mgr_new,
  .free    = amdgpu_vram_mgr_del,
@@ -841,17 +888,22 @@ int amdgpu_vram_mgr_init(struct amdgpu_device 
*adev)

  ttm_resource_manager_init(man, >mman.bdev,
    adev->gmc.real_vram_size);
  -    man->func = _vram_mgr_func;
-
-    err = drm_buddy_init(>mm, man->size, PAGE_SIZE);
-    if (err)
-    return err;
-
  mutex_init(>lock);
  INIT_LIST_HEAD(>reservations_pending);
  INIT_LIST_HEAD(>reserved_pages);
  mgr->default_page_size = PAGE_SIZE;
  +    if (!adev->gmc.is_app_apu) {
+    man->func = _vram_mgr_func;
+
+    err = drm_buddy_init(>mm, man->size, PAGE_SIZE);
+    if (err)
+    return err;
+    } else {
+    man->func = _dummy_vram_mgr_func;
+    DRM_INFO("Setup dummy vram mgr\n");
+    }
+
  ttm_set_driver_manager(>mman.bdev, TTM_PL_VRAM, 
>manager);

  ttm_resource_manager_set_used(man, true);
  return 0;
@@ -886,7 +938,8 @@ void amdgpu_vram_mgr_fini(struct amdgpu_device 
*adev)

  drm_buddy_free_list(>mm, >allocated);
  kfree(rsv);
  }
-    drm_buddy_fini(>mm);
+    if (!adev->gmc.is_app_apu)
+    drm_buddy_fini(>mm);
  mutex_unlock(>lock);
    ttm_resource_manager_cleanup(man);




Re: [PATCH] drm/amdkfd: Switch over to memdup_user()

2023-06-14 Thread Felix Kuehling



Am 2023-06-13 um 22:04 schrieb Jiapeng Chong:

Use memdup_user() rather than duplicating its implementation. This is a
little bit restricted to reduce false positives.

./drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c:2813:13-20: WARNING 
opportunity for memdup_user.

Reported-by: Abaci Robot 
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5523
Signed-off-by: Jiapeng Chong 


Kernel test robot is reporting a failure with this patch, looks like you 
used PTR_ERR incorrectly. Please make sure your patch compiles without 
warnings.


I see more opportunities to use memdup_user in kfd_chardev.c, 
kfd_events.c, kfd_process_queue_manager.c and kfd_svm.c. Do you want to 
fix those, too, while you're at it?


Thanks,
  Felix



---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 9 +++--
  1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index d6b15493fffd..637962d4083c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2810,12 +2810,9 @@ static uint32_t *get_queue_ids(uint32_t num_queues, 
uint32_t *usr_queue_id_array
if (!usr_queue_id_array)
return NULL;
  
-	queue_ids = kzalloc(array_size, GFP_KERNEL);

-   if (!queue_ids)
-   return ERR_PTR(-ENOMEM);
-
-   if (copy_from_user(queue_ids, usr_queue_id_array, array_size))
-   return ERR_PTR(-EFAULT);
+   queue_ids = memdup_user(usr_queue_id_array, array_size);
+   if (IS_ERR(queue_ids))
+   return PTR_ERR(queue_ids);
  
  	return queue_ids;

  }


Re: [PATCH] drm/amdkfd: decrement queue count on mes queue destroy

2023-06-13 Thread Felix Kuehling

On 2023-06-13 17:48, Jonathan Kim wrote:

Queue count should decrement on queue destruction regardless of HWS
support type.

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 8a39a9e0ed5a..f515cb8f30ca 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2089,8 +2089,8 @@ static int destroy_queue_cpsch(struct 
device_queue_manager *dqm,
list_del(>list);
qpd->queue_count--;
if (q->properties.is_active) {
+   decrement_queue_count(dqm, qpd, q);
if (!dqm->dev->kfd->shared_resources.enable_mes) {
-   decrement_queue_count(dqm, qpd, q);
retval = execute_queues_cpsch(dqm,
  
KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
  USE_DEFAULT_GRACE_PERIOD);


Re: [PATCH] drm/amdgpu/sdma4: set align mask to 255

2023-06-12 Thread Felix Kuehling

Am 2023-06-07 um 12:31 schrieb Alex Deucher:

The wptr needs to be incremented at at least 64 dword intervals,
use 256 to align with windows.  This should fix potential hangs
with unaligned updates.

Signed-off-by: Alex Deucher 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 4 ++--
  drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 4 ++--
  2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 1f83eebfc8a7..cd37f45e01a1 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -2312,7 +2312,7 @@ const struct amd_ip_funcs sdma_v4_0_ip_funcs = {
  
  static const struct amdgpu_ring_funcs sdma_v4_0_ring_funcs = {

.type = AMDGPU_RING_TYPE_SDMA,
-   .align_mask = 0xf,
+   .align_mask = 0xff,
.nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
.support_64bit_ptrs = true,
.secure_submission_supported = true,
@@ -2344,7 +2344,7 @@ static const struct amdgpu_ring_funcs 
sdma_v4_0_ring_funcs = {
  
  static const struct amdgpu_ring_funcs sdma_v4_0_page_ring_funcs = {

.type = AMDGPU_RING_TYPE_SDMA,
-   .align_mask = 0xf,
+   .align_mask = 0xff,
.nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
.support_64bit_ptrs = true,
.secure_submission_supported = true,
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 8eebf9c2bbcd..05bb0691ee0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -1823,7 +1823,7 @@ const struct amd_ip_funcs sdma_v4_4_2_ip_funcs = {
  
  static const struct amdgpu_ring_funcs sdma_v4_4_2_ring_funcs = {

.type = AMDGPU_RING_TYPE_SDMA,
-   .align_mask = 0xf,
+   .align_mask = 0xff,
.nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
.support_64bit_ptrs = true,
.get_rptr = sdma_v4_4_2_ring_get_rptr,
@@ -1854,7 +1854,7 @@ static const struct amdgpu_ring_funcs 
sdma_v4_4_2_ring_funcs = {
  
  static const struct amdgpu_ring_funcs sdma_v4_4_2_page_ring_funcs = {

.type = AMDGPU_RING_TYPE_SDMA,
-   .align_mask = 0xf,
+   .align_mask = 0xff,
.nop = SDMA_PKT_NOP_HEADER_OP(SDMA_OP_NOP),
.support_64bit_ptrs = true,
.get_rptr = sdma_v4_4_2_ring_get_rptr,


Re: [PATCHv2] drm/amdgpu: Update invalid PTE flag setting

2023-06-12 Thread Felix Kuehling



Am 2023-06-12 um 12:23 schrieb Mukul Joshi:

Update the invalid PTE flag setting with TF enabled.
This is to ensure, in addition to transitioning the
retry fault to a no-retry fault, it also causes the
wavefront to enter the trap handler. With the current
setting, the fault only transitions to a no-retry fault.
Additionally, have 2 sets of invalid PTE settings, one for
TF enabled, the other for TF disabled. The setting with
TF disabled, doesn't work with TF enabled.

Signed-off-by: Mukul Joshi 
---
v1->v2:
- Update handling according to Christian's feedback.

  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h   |  7 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c|  2 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h|  6 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c |  3 +++
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 11 +++
  5 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
index 6794edd1d2d2..e5c6b075fbbb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -152,6 +152,10 @@ struct amdgpu_gmc_funcs {
void (*override_vm_pte_flags)(struct amdgpu_device *dev,
  struct amdgpu_vm *vm,
  uint64_t addr, uint64_t *flags);
+   /* update no-retry flags */
+   void (*update_vm_pte_noretry_flags)(struct amdgpu_device *dev,
+   uint64_t *flags);
+
/* get the amount of memory used by the vbios for pre-OS console */
unsigned int (*get_vbios_fb_size)(struct amdgpu_device *adev);
  
@@ -343,6 +347,9 @@ struct amdgpu_gmc {

  #define amdgpu_gmc_override_vm_pte_flags(adev, vm, addr, pte_flags)   \
(adev)->gmc.gmc_funcs->override_vm_pte_flags  \
((adev), (vm), (addr), (pte_flags))
+#define amdgpu_gmc_update_vm_pte_noretry_flags(adev, pte_flags)
\
+   ((adev)->gmc.gmc_funcs->update_vm_pte_noretry_flags   \
+   ((adev), (pte_flags)))
  #define amdgpu_gmc_get_vbios_fb_size(adev) 
(adev)->gmc.gmc_funcs->get_vbios_fb_size((adev))
  
  /**

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 1cb14ea18cd9..ff9db7e5c086 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2583,7 +2583,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, 
u32 pasid,
/* Intentionally setting invalid PTE flag
 * combination to force a no-retry-fault
 */
-   flags = AMDGPU_PTE_SNOOPED | AMDGPU_PTE_PRT;
+   flags = AMDGPU_VM_NORETRY_FLAGS;
value = 0;
} else if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) {
/* Redirect the access to the dummy page */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 9c85d494f2a2..b81fcb962d8f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -84,7 +84,13 @@ struct amdgpu_mem_stats;
  /* PDE Block Fragment Size for VEGA10 */
  #define AMDGPU_PDE_BFS(a) ((uint64_t)a << 59)
  
+/* Flag combination to set no-retry with TF disabled */

+#define AMDGPU_VM_NORETRY_FLAGS(AMDGPU_PTE_EXECUTABLE | AMDGPU_PDE_PTE 
| \
+   AMDGPU_PTE_TF)
  
+/* Flag combination to set no-retry with TF enabled */

+#define AMDGPU_VM_NORETRY_FLAGS_TF (AMDGPU_PTE_VALID | AMDGPU_PTE_SYSTEM | \
+  AMDGPU_PTE_PRT)
  /* For GFX9 */
  #define AMDGPU_PTE_MTYPE_VG10(a)  ((uint64_t)(a) << 57)
  #define AMDGPU_PTE_MTYPE_VG10_MASKAMDGPU_PTE_MTYPE_VG10(3ULL)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
index dea1a64be44d..39f1650f6d00 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
@@ -804,6 +804,9 @@ static void amdgpu_vm_pte_update_flags(struct 
amdgpu_vm_update_params *params,
flags |= AMDGPU_PTE_EXECUTABLE;
}
  
+	if (adev->gmc.translate_further && level == AMDGPU_VM_PTB)

+   amdgpu_gmc_update_vm_pte_noretry_flags(adev, );


Don't you need a check that 
((adev)->gmc.gmc_funcs->update_vm_pte_noretry_flags is not NULL? But 
adding a new callback for this may be overkill. Since the 
AMDGPU_VM_NORETRY_FLAGS(_TF) are defined in a non-HW-specific header 
file, you can probably implement the application of those flags in 
amdgpu_vm_pte_update_flags directly.


Regards,
  Felix



+
/* APUs mapping system memory may need different MTYPEs on different
 * NUMA nodes. Only do this for contiguous ranges that can be assumed
 * to be on the same NUMA node.
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 

Re: [PATCH] drm/amdkfd: fix null queue check on debug setting exceptions

2023-06-12 Thread Felix Kuehling



Am 2023-06-12 um 11:46 schrieb Jonathan Kim:

Null check should be done on queue struct itself and not on the
process queue list node.

Signed-off-by: Jonathan Kim 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index cd34e7aaead4..fff3ccc04fa9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -1097,7 +1097,7 @@ void kfd_dbg_set_enabled_debug_exception_mask(struct 
kfd_process *target,
  
  	pqm = >pqm;

list_for_each_entry(pqn, >queues, process_queue_list) {
-   if (!pqn)
+   if (!pqn->q)
continue;
  
  		found_mask |= pqn->q->properties.exception_status;


Re: [PATCH v5 3/5] drm/amdkfd: set activated flag true when event age unmatchs

2023-06-12 Thread Felix Kuehling
Testing for intermittent failures or race conditions is not easy. If we 
create such a test, we need to make sure it can catch the problem when 
not using the event ages, just to know that the test is good enough.


I guess it could be a parametrized test that can run with or without 
event age. Without event age, we'd expect to catch a timeout. Not 
catching a timeout would be a test failure (indicating that the test is 
not good enough). With event age it should not time out, i.e. a timeout 
would be considered a failure in this case (indicating a problem with 
the event age mechanism).


That said, I'd feel better about a ROCr test that doesn't just cover the 
KFD event age mechanism, but also its use in the ROCr implementation of 
HSA signal waiting.


Regards,
  Felix


Am 2023-06-12 um 12:19 schrieb Yat Sin, David:

[AMD Official Use Only - General]

The current ROCr patches already address my previous feedback. I am ok with the 
current ROCr patches.

Currently, there is no ROCrtst that would stress this multiple-waiters issue. I 
was thinking something like the KFDTest, but with by calling the waiters from 
different threads. @Zhu, James Would you have time to look into this?

~David


-Original Message-
From: Kuehling, Felix 
Sent: Friday, June 9, 2023 6:44 PM
To: Zhu, James ; amd-gfx@lists.freedesktop.org
Cc: Yat Sin, David ; Zhu, James

Subject: Re: [PATCH v5 3/5] drm/amdkfd: set activated flag true when event
age unmatchs

  From the KFD perspective, the series is

Reviewed-by: Felix Kuehling 

David, I looked at the ROCr and Thunk changes as well, and they look
reasonable to me. Do you have any feedback on these patches from a ROCr
point of view? Is there a reasonable stress test that could be used check that
this handles the race conditions as expected and allows all waiters to sleep?

Regards,
Felix


On 2023-06-09 16:43, James Zhu wrote:

Set waiter's activated flag true when event age unmatchs with

last_event_age.

-v4: add event type check
-v5: improve on event age enable and activated flags

Signed-off-by: James Zhu 
---
   drivers/gpu/drm/amd/amdkfd/kfd_events.c | 17 +
   1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index c7689181cc22..b2586a1dd35d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -41,6 +41,7 @@ struct kfd_event_waiter {
 wait_queue_entry_t wait;
 struct kfd_event *event; /* Event to wait for */
 bool activated;  /* Becomes true when event is signaled */
+   bool event_age_enabled;  /* set to true when last_event_age is
+non-zero */
   };

   /*
@@ -797,9 +798,9 @@ static struct kfd_event_waiter
*alloc_event_waiters(uint32_t num_events)

   static int init_event_waiter(struct kfd_process *p,
 struct kfd_event_waiter *waiter,
-   uint32_t event_id)
+   struct kfd_event_data *event_data)
   {
-   struct kfd_event *ev = lookup_event_by_id(p, event_id);
+   struct kfd_event *ev = lookup_event_by_id(p, event_data->event_id);

 if (!ev)
 return -EINVAL;
@@ -808,6 +809,15 @@ static int init_event_waiter(struct kfd_process *p,
 waiter->event = ev;
 waiter->activated = ev->signaled;
 ev->signaled = ev->signaled && !ev->auto_reset;
+
+   /* last_event_age = 0 reserved for backward compatible */
+   if (waiter->event->type == KFD_EVENT_TYPE_SIGNAL &&
+   event_data->signal_event_data.last_event_age) {
+   waiter->event_age_enabled = true;
+   if (ev->event_age != event_data-
signal_event_data.last_event_age)
+   waiter->activated = true;
+   }
+
 if (!waiter->activated)
 add_wait_queue(>wq, >wait);
 spin_unlock(>lock);
@@ -948,8 +958,7 @@ int kfd_wait_on_events(struct kfd_process *p,
 goto out_unlock;
 }

-   ret = init_event_waiter(p, _waiters[i],
-   event_data.event_id);
+   ret = init_event_waiter(p, _waiters[i], _data);
 if (ret)
 goto out_unlock;
 }


Re: [PATCH v2] gpu: drm/amd: Remove the redundant null pointer check in list_for_each_entry() loops

2023-06-12 Thread Felix Kuehling

[+Jon]

Am 2023-06-12 um 07:58 schrieb Lu Hongfei:

pqn bound in list_for_each_entry loop will not be null, so there is
no need to check whether pqn is NULL or not.
Thus remove a redundant null pointer check.

Signed-off-by: Lu Hongfei 
---
The filename of the previous version was:
0001-gpu-drm-amd-Fix-the-bug-in-list_for_each_entry-loops.patch

The modifications made compared to the previous version are as follows:
1. Modified the patch title
2. "Thus remove a redundant null pointer check." is used instead of
"We could remove this check."

  drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index cd34e7aaead4..10d0cef844f0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -1097,9 +1097,6 @@ void kfd_dbg_set_enabled_debug_exception_mask(struct 
kfd_process *target,
  
  	pqm = >pqm;

list_for_each_entry(pqn, >queues, process_queue_list) {
-   if (!pqn)


Right, this check doesn't make a lot of sense. Jon, was this meant to 
check pqn->q?


Regards,
  Felix



-   continue;
-
found_mask |= pqn->q->properties.exception_status;
}
  


Re: [PATCH v5 3/5] drm/amdkfd: set activated flag true when event age unmatchs

2023-06-09 Thread Felix Kuehling

From the KFD perspective, the series is

Reviewed-by: Felix Kuehling 

David, I looked at the ROCr and Thunk changes as well, and they look 
reasonable to me. Do you have any feedback on these patches from a ROCr 
point of view? Is there a reasonable stress test that could be used 
check that this handles the race conditions as expected and allows all 
waiters to sleep?


Regards,
  Felix


On 2023-06-09 16:43, James Zhu wrote:

Set waiter's activated flag true when event age unmatchs with last_event_age.

-v4: add event type check
-v5: improve on event age enable and activated flags

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_events.c | 17 +
  1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index c7689181cc22..b2586a1dd35d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -41,6 +41,7 @@ struct kfd_event_waiter {
wait_queue_entry_t wait;
struct kfd_event *event; /* Event to wait for */
bool activated;  /* Becomes true when event is signaled */
+   bool event_age_enabled;  /* set to true when last_event_age is non-zero 
*/
  };
  
  /*

@@ -797,9 +798,9 @@ static struct kfd_event_waiter 
*alloc_event_waiters(uint32_t num_events)
  
  static int init_event_waiter(struct kfd_process *p,

struct kfd_event_waiter *waiter,
-   uint32_t event_id)
+   struct kfd_event_data *event_data)
  {
-   struct kfd_event *ev = lookup_event_by_id(p, event_id);
+   struct kfd_event *ev = lookup_event_by_id(p, event_data->event_id);
  
  	if (!ev)

return -EINVAL;
@@ -808,6 +809,15 @@ static int init_event_waiter(struct kfd_process *p,
waiter->event = ev;
waiter->activated = ev->signaled;
ev->signaled = ev->signaled && !ev->auto_reset;
+
+   /* last_event_age = 0 reserved for backward compatible */
+   if (waiter->event->type == KFD_EVENT_TYPE_SIGNAL &&
+   event_data->signal_event_data.last_event_age) {
+   waiter->event_age_enabled = true;
+   if (ev->event_age != 
event_data->signal_event_data.last_event_age)
+   waiter->activated = true;
+   }
+
if (!waiter->activated)
add_wait_queue(>wq, >wait);
spin_unlock(>lock);
@@ -948,8 +958,7 @@ int kfd_wait_on_events(struct kfd_process *p,
goto out_unlock;
}
  
-		ret = init_event_waiter(p, _waiters[i],

-   event_data.event_id);
+   ret = init_event_waiter(p, _waiters[i], _data);
if (ret)
goto out_unlock;
}


Re: [PATCH v4 3/5] drm/amdkfd: set activated flag true when event age unmatchs

2023-06-09 Thread Felix Kuehling



On 2023-06-09 16:13, James Zhu wrote:

Set waiter's activated flag true when event age unmatchs with last_event_age.

-v4: add event type check

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_events.c | 15 +++
  1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index c7689181cc22..2cc1a7e976f4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -41,6 +41,7 @@ struct kfd_event_waiter {
wait_queue_entry_t wait;
struct kfd_event *event; /* Event to wait for */
bool activated;  /* Becomes true when event is signaled */
+   bool event_age_enabled;  /* set to true when last_event_age is non-zero 
*/
  };
  
  /*

@@ -797,9 +798,9 @@ static struct kfd_event_waiter 
*alloc_event_waiters(uint32_t num_events)
  
  static int init_event_waiter(struct kfd_process *p,

struct kfd_event_waiter *waiter,
-   uint32_t event_id)
+   struct kfd_event_data *event_data)
  {
-   struct kfd_event *ev = lookup_event_by_id(p, event_id);
+   struct kfd_event *ev = lookup_event_by_id(p, event_data->event_id);
  
  	if (!ev)

return -EINVAL;
@@ -808,6 +809,13 @@ static int init_event_waiter(struct kfd_process *p,
waiter->event = ev;
waiter->activated = ev->signaled;
ev->signaled = ev->signaled && !ev->auto_reset;
+
+   /* last_event_age = 0 reserved for backward compatible */
+   waiter->event_age_enabled = 
!!event_data->signal_event_data.last_event_age;


This should also be inside the "if (waiter->event->type == 
KFD_EVENT_TYPE_SIGNAL)". I'd do something like this:


if (waiter->event->type == KFD_EVENT_TYPE_SIGNAL &&
event_data->signal_event_data.last_event_age) {
waiter->event_age_enabled = true;
if (ev->event_age != 
event_data->signal_event_data.last_event_age)
waiter->activated = true;
}

You don't need WRITE_ONCE here because there can be no concurrent access 
before you add the waiter to the wait queue.


Regards,
  Felix



+   if (waiter->event->type == KFD_EVENT_TYPE_SIGNAL && waiter->event_age_enabled 
&&
+   ev->event_age != event_data->signal_event_data.last_event_age)
+   WRITE_ONCE(waiter->activated, true);
+
if (!waiter->activated)
add_wait_queue(>wq, >wait);
spin_unlock(>lock);
@@ -948,8 +956,7 @@ int kfd_wait_on_events(struct kfd_process *p,
goto out_unlock;
}
  
-		ret = init_event_waiter(p, _waiters[i],

-   event_data.event_id);
+   ret = init_event_waiter(p, _waiters[i], _data);
if (ret)
goto out_unlock;
}


Re: [PATCH v3 3/5] drm/amdkfd: set activated flag true when event age unmatchs

2023-06-09 Thread Felix Kuehling



On 2023-06-08 13:07, James Zhu wrote:

Set waiter's activated flag true when event age unmatchs with last_event_age.

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_events.c | 15 +++
  1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index c7689181cc22..4c6907507190 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -41,6 +41,7 @@ struct kfd_event_waiter {
wait_queue_entry_t wait;
struct kfd_event *event; /* Event to wait for */
bool activated;  /* Becomes true when event is signaled */
+   bool event_age_enabled;  /* set to true when last_event_age is non-zero 
*/
  };
  
  /*

@@ -797,9 +798,9 @@ static struct kfd_event_waiter 
*alloc_event_waiters(uint32_t num_events)
  
  static int init_event_waiter(struct kfd_process *p,

struct kfd_event_waiter *waiter,
-   uint32_t event_id)
+   struct kfd_event_data *event_data)
  {
-   struct kfd_event *ev = lookup_event_by_id(p, event_id);
+   struct kfd_event *ev = lookup_event_by_id(p, event_data->event_id);
  
  	if (!ev)

return -EINVAL;
@@ -808,6 +809,13 @@ static int init_event_waiter(struct kfd_process *p,
waiter->event = ev;
waiter->activated = ev->signaled;
ev->signaled = ev->signaled && !ev->auto_reset;
+
+   /* last_event_age = 0 reserved for backward compatible */
+   waiter->event_age_enabled = 
!!event_data->signal_event_data.last_event_age;
+   if (waiter->event_age_enabled &&
+   ev->event_age != event_data->signal_event_data.last_event_age)
+   WRITE_ONCE(waiter->activated, true);


This needs to check the event type. Looking at 
event_data->signal_event_data when this is not a signal event is 
illegal, because it is aliased in a union with other event type data.


Other than that, the series looks good to me now.

Regards,
  Felix



+
if (!waiter->activated)
add_wait_queue(>wq, >wait);
spin_unlock(>lock);
@@ -948,8 +956,7 @@ int kfd_wait_on_events(struct kfd_process *p,
goto out_unlock;
}
  
-		ret = init_event_waiter(p, _waiters[i],

-   event_data.event_id);
+   ret = init_event_waiter(p, _waiters[i], _data);
if (ret)
goto out_unlock;
}


Re: [PATCH v2 10/12] drm/amdgpu: remove unused functions and variables

2023-06-09 Thread Felix Kuehling

On 2023-04-12 12:25, Shashank Sharma wrote:

This patch removes some variables and functions from KFD
doorbell handling code, which are no more required since
doorbell manager is handling doorbell calculations.

Cc: Alex Deucher 
Cc: Christian Koenig 
Signed-off-by: Shashank Sharma 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c | 32 ---
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 12 -
  2 files changed, 44 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
index 718cfe9cb4f5..f4088cfd52db 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c
@@ -193,38 +193,6 @@ void write_kernel_doorbell64(void __iomem *db, u64 value)
}
  }
  
-unsigned int kfd_get_doorbell_dw_offset_in_bar(struct kfd_dev *kfd,

-   struct kfd_process_device *pdd,
-   unsigned int doorbell_id)
-{
-   /*
-* doorbell_base_dw_offset accounts for doorbells taken by KGD.
-* index * kfd_doorbell_process_slice/sizeof(u32) adjusts to
-* the process's doorbells. The offset returned is in dword
-* units regardless of the ASIC-dependent doorbell size.
-*/
-   if (!kfd->shared_resources.enable_mes)
-   return kfd->doorbell_base_dw_offset +
-   pdd->doorbell_index
-   * kfd_doorbell_process_slice(kfd) / sizeof(u32) +
-   doorbell_id *
-   kfd->device_info.doorbell_size / sizeof(u32);
-   else
-   return amdgpu_mes_get_doorbell_dw_offset_in_bar(
-   (struct amdgpu_device *)kfd->adev,
-   pdd->doorbell_index, doorbell_id);
-}
-
-uint64_t kfd_get_number_elems(struct kfd_dev *kfd)
-{
-   uint64_t num_of_elems = (kfd->shared_resources.doorbell_aperture_size -
-   kfd->shared_resources.doorbell_start_offset) /
-   kfd_doorbell_process_slice(kfd) + 1;
-
-   return num_of_elems;
-
-}
-
  phys_addr_t kfd_get_process_doorbells(struct kfd_process_device *pdd)
  {
struct amdgpu_device *adev = pdd->dev->adev;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index dfff77379acb..1bc6a8ed8cda 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -257,15 +257,6 @@ struct kfd_dev {
  
  	unsigned int id;		/* topology stub index */
  
-	phys_addr_t doorbell_base;	/* Start of actual doorbells used by

-* KFD. It is aligned for mapping
-* into user mode
-*/
-   size_t doorbell_base_dw_offset; /* Offset from the start of the PCI
-* doorbell BAR to the first KFD
-* doorbell in dwords. GFX reserves
-* the segment before this offset.
-*/
u32 __iomem *doorbell_kernel_ptr; /* This is a pointer for a doorbells
   * page used by kernel queue
   */
@@ -276,8 +267,6 @@ struct kfd_dev {
  
  	const struct kfd2kgd_calls *kfd2kgd;

struct mutex doorbell_mutex;
-   DECLARE_BITMAP(doorbell_available_index,
-   KFD_MAX_NUM_OF_QUEUES_PER_PROCESS);
  
  	void *gtt_mem;

uint64_t gtt_start_gpu_addr;
@@ -754,7 +743,6 @@ struct kfd_process_device {
struct attribute attr_evict;
  
  	struct kobject *kobj_stats;

-   unsigned int doorbell_index;
  
  	/*

 * @cu_occupancy: Reports occupancy of Compute Units (CU) of a process


Re: [PATCH v2 08/12] drm/amdgpu: use doorbell manager for kfd kernel doorbells

2023-06-09 Thread Felix Kuehling



On 2023-04-25 15:59, Shashank Sharma wrote:


On 24/04/2023 21:56, Felix Kuehling wrote:

On 2023-04-22 2:39, Shashank Sharma wrote:
- KFD process level doorbells: doorbell pages which are allocated by 
kernel but mapped and written by userspace processes, saved in 
struct pdd->qpd->doorbells


size = kfd_doorbell_process_slice.

We realized that we only need 1-2 doorbells for KFD kernel level 
stuff (so kept it one page), but need 2-page of doorbells for KFD 
process, so they are sized accordingly.


We have also run kfd_test_suit and verified the changes for any 
regression. Hope this helps in explaining the design.


Right, I missed that this was only for kernel doorbells. I wonder 
whether KFD really needs its own page here. I think we only need a 
doorbell for HWS. And when we use MES, I think even that isn't needed 
because MES packet submissions go through amdgpu. So maybe KFD 
doesn't need its own kernel-mode doorbell page any more on systems 
with user graphics mode queues.


Yeah, for any IP with MES enabled, KFD doesn't need kernel level 
doorbells. But I still allocated a page just to make sure we do not 
break any non-MES platforms or use cases where MES is deliberately 
disabled from kernel command line. Hope that works for you.


Even without MES, we still only need one doorbell for HWS. Allocating a 
whole page for that is wasteful. Anyway, I'm OK with cleaning that up later.


Regards,
  Felix




- Shashank



Regards,
  Felix




- Shashank 


Re: [PATCH] drm/amdkfd: fix and enable debugging for gfx11

2023-06-07 Thread Felix Kuehling



On 2023-06-07 16:20, Jonathan Kim wrote:

There are a couple of fixes required to enable gfx11 debugging.

First, ADD_QUEUE.trap_en is an inappropriate place to toggle
a per-process register so move it to SET_SHADER_DEBUGGER.trap_en.
When ADD_QUEUE.skip_process_ctx_clear is set, MES will prioritize
the SET_SHADER_DEBUGGER.trap_en setting.

Second, to preserve correct save/restore priviledged wave states
in coordination with the trap enablement setting, resume suspended
waves early in the disable call.

NOTE: The AMDGPU_MES_VERSION_MASK check is a place holder as
MES FW updates have been reviewed but is awaiting binary
creation.  Once the binaries have been created, this check may
be subject to change.

v2: do a trap_en safety check in case old mes doesn't accept
unused trap_en d-word.
remove unnecessary process termination work around.

Signed-off-by: Jonathan Kim 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c|  7 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h|  4 +++-
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c |  1 +
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c | 14 ++
  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  |  3 +--
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c  | 12 +++-
  drivers/gpu/drm/amd/include/mes_v11_api_def.h  |  1 +
  7 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 20cc3fffe921..e9091ebfe230 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -928,7 +928,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
uint64_t process_context_addr,
uint32_t spi_gdbg_per_vmid_cntl,
const uint32_t *tcp_watch_cntl,
-   uint32_t flags)
+   uint32_t flags,
+   bool trap_en)
  {
struct mes_misc_op_input op_input = {0};
int r;
@@ -945,6 +946,10 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl,
sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
  
+	if (((adev->mes.sched_version & AMDGPU_MES_API_VERSION_MASK) >>

+   AMDGPU_MES_API_VERSION_SHIFT) >= 14)
+   op_input.set_shader_debugger.trap_en = trap_en;
+


It's probably too late to change the GFX11 MES API at this point. But 
why didn't they just add a trap_en bit in the existing flags field? That 
could have avoided the need for the compatibility checks.


Anyway, the patch is

Reviewed-by: Felix Kuehling 



amdgpu_mes_lock(>mes);
  
  	r = adev->mes.funcs->misc_op(>mes, _input);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index b5f5eed2b5ef..2d6ac30b7135 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -294,6 +294,7 @@ struct mes_misc_op_input {
} flags;
uint32_t spi_gdbg_per_vmid_cntl;
uint32_t tcp_watch_cntl[4];
+   uint32_t trap_en;
} set_shader_debugger;
};
  };
@@ -361,7 +362,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
uint64_t process_context_addr,
uint32_t spi_gdbg_per_vmid_cntl,
const uint32_t *tcp_watch_cntl,
-   uint32_t flags);
+   uint32_t flags,
+   bool trap_en);
  
  int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id,

int queue_type, int idx,
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index c4e3cb8d44de..1bdaa00c0b46 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -347,6 +347,7 @@ static int mes_v11_0_misc_op(struct amdgpu_mes *mes,
memcpy(misc_pkt.set_shader_debugger.tcp_watch_cntl,
input->set_shader_debugger.tcp_watch_cntl,

sizeof(misc_pkt.set_shader_debugger.tcp_watch_cntl));
+   misc_pkt.set_shader_debugger.trap_en = 
input->set_shader_debugger.trap_en;
break;
default:
DRM_ERROR("unsupported misc op (%d) \n", input->op);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index 125274445f43..cd34e7aaead4 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -349,12 +349,13 @@ int kfd_dbg_set

Re: [PATCH] drm/amdkfd: optimize gfx off enable toggle for debugging

2023-06-07 Thread Felix Kuehling



On 2023-06-07 13:32, Jonathan Kim wrote:

Legacy debug devices limited to pinning a single debug VMID for debugging
are the only devices that require disabling GFX OFF while accessing
debug registers.  Debug devices that support multi-process debugging
rely on the hardware scheduler to update debug registers and do not run
into GFX OFF access issues.

Remove KFD GFX OFF enable toggle clutter by moving these calls into the
KGD debug calls themselves.

v2: toggle gfx off around address watch hi/lo settings as well.

Signed-off-by: Jonathan Kim 
---
  .../drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c  |  4 +++
  .../drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c   |  7 
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c| 33 ++-
  .../drm/amd/amdgpu/amdgpu_amdkfd_gfx_v11.c|  4 +++
  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 24 ++
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 22 +++--
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c| 21 +---


Looks like you missed one amdgpu_amdkfd_gfx_off_ctrl call in kfd_process.c.



  7 files changed, 77 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
index 60f9e027fb66..1f0e6ec56618 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_aldebaran.c
@@ -150,6 +150,8 @@ static uint32_t kgd_gfx_aldebaran_set_address_watch(
VALID,
1);
  
+	amdgpu_gfx_off_ctrl(adev, false);

+


Aldebaran doesn't use automatic gfxoff, so this should not be needed.



WREG32_RLC((SOC15_REG_OFFSET(GC, 0, regTCP_WATCH0_ADDR_H) +
(watch_id * TCP_WATCH_STRIDE)),
watch_address_high);
@@ -158,6 +160,8 @@ static uint32_t kgd_gfx_aldebaran_set_address_watch(
(watch_id * TCP_WATCH_STRIDE)),
watch_address_low);
  
+	amdgpu_gfx_off_ctrl(adev, true);

+
return watch_address_cntl;
  }
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c

index 625db444df1c..a4e28d547173 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
@@ -350,6 +350,8 @@ static uint32_t kgd_arcturus_enable_debug_trap(struct 
amdgpu_device *adev,
bool restore_dbg_registers,
uint32_t vmid)
  {
+   amdgpu_gfx_off_ctrl(adev, false);
+


I would need to double check, but I believe Arcturus also doesn't 
support gfxoff.




mutex_lock(>grbm_idx_mutex);
  
  	kgd_gfx_v9_set_wave_launch_stall(adev, vmid, true);

@@ -362,6 +364,8 @@ static uint32_t kgd_arcturus_enable_debug_trap(struct 
amdgpu_device *adev,
  
  	mutex_unlock(>grbm_idx_mutex);
  
+	amdgpu_gfx_off_ctrl(adev, true);

+
return 0;
  }
  
@@ -375,6 +379,7 @@ static uint32_t kgd_arcturus_disable_debug_trap(struct amdgpu_device *adev,

bool keep_trap_enabled,
uint32_t vmid)
  {
+   amdgpu_gfx_off_ctrl(adev, false);
  
  	mutex_lock(>grbm_idx_mutex);
  
@@ -388,6 +393,8 @@ static uint32_t kgd_arcturus_disable_debug_trap(struct amdgpu_device *adev,
  
  	mutex_unlock(>grbm_idx_mutex);
  
+	amdgpu_gfx_off_ctrl(adev, true);

+
return 0;
  }
  const struct kfd2kgd_calls arcturus_kfd2kgd = {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
index 8ad7a7779e14..415928139861 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v10.c
@@ -754,12 +754,13 @@ uint32_t kgd_gfx_v10_enable_debug_trap(struct 
amdgpu_device *adev,
bool restore_dbg_registers,
uint32_t vmid)
  {
+   amdgpu_gfx_off_ctrl(adev, false);
  
  	mutex_lock(>grbm_idx_mutex);
  
  	kgd_gfx_v10_set_wave_launch_stall(adev, vmid, true);
  
-	/* assume gfx off is disabled for the debug session if rlc restore not supported. */

+   /* keep gfx off disabled for the debug session if rlc restore not 
supported. */
if (restore_dbg_registers) {
uint32_t data = 0;
  
@@ -784,6 +785,8 @@ uint32_t kgd_gfx_v10_enable_debug_trap(struct amdgpu_device *adev,
  
  	mutex_unlock(>grbm_idx_mutex);
  
+	amdgpu_gfx_off_ctrl(adev, true);

+
return 0;
  }
  
@@ -791,6 +794,8 @@ uint32_t kgd_gfx_v10_disable_debug_trap(struct amdgpu_device *adev,

bool keep_trap_enabled,
uint32_t vmid)
  {
+   amdgpu_gfx_off_ctrl(adev, false);
+
mutex_lock(>grbm_idx_mutex);
  
  	kgd_gfx_v10_set_wave_launch_stall(adev, vmid, true);

@@ -801,6 +806,16 @@ uint32_t 

Re: [PATCH] drm/amdkfd: fix and enable debugging for gfx11

2023-06-07 Thread Felix Kuehling



On 2023-06-07 13:26, Jonathan Kim wrote:

There are a few fixes required to enable gfx11 debugging.

First, ADD_QUEUE.trap_en is an inappropriate place to toggle
a per-process register so move it to SET_SHADER_DEBUGGER.trap_en.
When ADD_QUEUE.skip_process_ctx_clear is set, MES will prioritize
the SET_SHADER_DEBUGGER.trap_en setting.


I see you have a firmware version check for enabling debugging. But is 
the struct SET_SHADER_DEBUGGER change safe with older firmware when 
debugging is disabled?





Second, to preserve correct save/restore priviledged wave states
in coordination with the trap enablement setting, resume suspended
waves early in the disable call.

Finally, displaced single stepping can cause non-fatal illegal
instructions during process termination on debug disable.  To work
around this, stall the waves prior to disable and allow clean
up to happen naturally on process termination.

NOTE: The AMDGPU_MES_VERSION_MASK check is a place holder as
MES FW updates have been reviewed but is awaiting binary
creation.  Once the binaries have been created, this check may
be subject to change.

Signed-off-by: Jonathan Kim 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c   |  5 ++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h   |  4 ++-
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c|  1 +
  drivers/gpu/drm/amd/amdkfd/kfd_debug.c| 31 ++-
  .../drm/amd/amdkfd/kfd_device_queue_manager.c |  3 +-
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 12 ---
  drivers/gpu/drm/amd/include/mes_v11_api_def.h |  1 +
  7 files changed, 40 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 20cc3fffe921..95d69f9c7361 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -928,7 +928,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
uint64_t process_context_addr,
uint32_t spi_gdbg_per_vmid_cntl,
const uint32_t *tcp_watch_cntl,
-   uint32_t flags)
+   uint32_t flags,
+   bool trap_en)
  {
struct mes_misc_op_input op_input = {0};
int r;
@@ -945,6 +946,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
memcpy(op_input.set_shader_debugger.tcp_watch_cntl, tcp_watch_cntl,
sizeof(op_input.set_shader_debugger.tcp_watch_cntl));
  
+	op_input.set_shader_debugger.trap_en = trap_en;

+
amdgpu_mes_lock(>mes);
  
  	r = adev->mes.funcs->misc_op(>mes, _input);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index b5f5eed2b5ef..2d6ac30b7135 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -294,6 +294,7 @@ struct mes_misc_op_input {
} flags;
uint32_t spi_gdbg_per_vmid_cntl;
uint32_t tcp_watch_cntl[4];
+   uint32_t trap_en;
} set_shader_debugger;
};
  };
@@ -361,7 +362,8 @@ int amdgpu_mes_set_shader_debugger(struct amdgpu_device 
*adev,
uint64_t process_context_addr,
uint32_t spi_gdbg_per_vmid_cntl,
const uint32_t *tcp_watch_cntl,
-   uint32_t flags);
+   uint32_t flags,
+   bool trap_en);
  
  int amdgpu_mes_add_ring(struct amdgpu_device *adev, int gang_id,

int queue_type, int idx,
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index c4e3cb8d44de..1bdaa00c0b46 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -347,6 +347,7 @@ static int mes_v11_0_misc_op(struct amdgpu_mes *mes,
memcpy(misc_pkt.set_shader_debugger.tcp_watch_cntl,
input->set_shader_debugger.tcp_watch_cntl,

sizeof(misc_pkt.set_shader_debugger.tcp_watch_cntl));
+   misc_pkt.set_shader_debugger.trap_en = 
input->set_shader_debugger.trap_en;
break;
default:
DRM_ERROR("unsupported misc op (%d) \n", input->op);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
index 125274445f43..e7bc07068eed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_debug.c
@@ -349,12 +349,30 @@ int kfd_dbg_set_mes_debug_mode(struct kfd_process_device 
*pdd)
  {
uint32_t spi_dbg_cntl = pdd->spi_dbg_override | 
pdd->spi_dbg_launch_mode;
uint32_t flags = pdd->process->dbg_flags;
+   bool sq_trap_en = !!spi_dbg_cntl;
  
  	if 

Re: [PATCH v2 3/3] drm/amdkfd: don't sleep when event age unmatch

2023-06-07 Thread Felix Kuehling

On 2023-06-06 12:24, James Zhu wrote:

Don't sleep when event age unmatch, and update last_event_age.
It is only for KFD_EVENT_TYPE_SIGNAL which is checked by user space.

Signed-off-by: James Zhu 
---
  drivers/gpu/drm/amd/amdkfd/kfd_events.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
index c7689181cc22..f4ceb5be78ed 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c
@@ -952,6 +952,21 @@ int kfd_wait_on_events(struct kfd_process *p,
event_data.event_id);
if (ret)
goto out_unlock;
+
+   /* last_event_age = 0 reserved for backward compatible */
+   if (event_data.signal_event_data.last_event_age &&
+   event_waiters[i].event->event_age !=
+   event_data.signal_event_data.last_event_age) {
+   event_data.signal_event_data.last_event_age =
+   event_waiters[i].event->event_age;


The event_age is updated in set_event under the event->spin_lock. You 
need to take that lock for this check here as well.


I think the easiest way to do this would be to move the check into 
init_event_waiter. That way you can initialize the waiter as activated 
if the event age is not up to date.




+   WRITE_ONCE(event_waiters[i].activated, true);
+
+   if (copy_to_user([i], _data,
+   sizeof(struct kfd_event_data))) {
+   ret = -EFAULT;
+   goto out_unlock;
+   }
+   }


I think we also need to update the event age in event data after an 
event has signaled. You should probably move updating and copying of the 
event age to user mode into copy_signaled_event_data. That way it would 
handle all the cases.


Regards,
  Felix



}
  
  	/* Check condition once. */


Re: [PATCH v2 1/3] drm/amdkfd: add event age tracking

2023-06-07 Thread Felix Kuehling

On 2023-06-06 12:24, James Zhu wrote:

Add event age tracking

Signed-off-by: James Zhu 
---
  include/uapi/linux/kfd_ioctl.h | 13 +++--
  1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 1781e7669982..eeb2fdcbdcb7 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -39,9 +39,10 @@
   * - 1.11 - Add unified memory for ctx save/restore area
   * - 1.12 - Add DMA buf export ioctl
   * - 1.13 - Add debugger API
+ * - 1.14 - Update kfd_event_data
   */
  #define KFD_IOCTL_MAJOR_VERSION 1
-#define KFD_IOCTL_MINOR_VERSION 13
+#define KFD_IOCTL_MINOR_VERSION 14


Bumping the version number should be done in the last patch in the 
series, once the feature is fully enabled.


Regards,
  Felix


  
  struct kfd_ioctl_get_version_args {

__u32 major_version;/* from KFD */
@@ -320,12 +321,20 @@ struct kfd_hsa_hw_exception_data {
__u32 gpu_id;
  };
  
+/* hsa signal event data */

+struct kfd_hsa_signal_event_data {
+   __u64 last_event_age;   /* to and from KFD */
+};
+
  /* Event data */
  struct kfd_event_data {
union {
+   /* From KFD */
struct kfd_hsa_memory_exception_data memory_exception_data;
struct kfd_hsa_hw_exception_data hw_exception_data;
-   };  /* From KFD */
+   /* To and From KFD */
+   struct kfd_hsa_signal_event_data signal_event_data;
+   };
__u64 kfd_event_data_ext;   /* pointer to an extension structure
   for future exception types */
__u32 event_id; /* to KFD */


Re: [PATCH] drm/amdkfd: Fix reserved SDMA queues handling

2023-06-07 Thread Felix Kuehling



On 2023-06-07 11:27, Mukul Joshi wrote:

This patch fixes a regression caused by a bad merge where
the handling of reserved SDMA queues was accidentally removed.
With the fix, the reserved SDMA queues are again correctly
marked as unavailable for allocation.

Fixes: c27842c84a848 ("drm/amdkfd: Update SDMA queue management for GFX9.4.3")
Signed-off-by: Mukul Joshi 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_device.c | 13 ++---
  .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c   | 10 +-
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h   |  2 +-
  3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
index 9fc9d32cb579..9d4abfd8b55e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
@@ -106,20 +106,19 @@ static void kfd_device_info_set_sdma_info(struct kfd_dev 
*kfd)
kfd->device_info.num_sdma_queues_per_engine = 8;
}
  
+	bitmap_zero(kfd->device_info.reserved_sdma_queues_bitmap, KFD_MAX_SDMA_QUEUES);

+
switch (sdma_version) {
case IP_VERSION(6, 0, 0):
+   case IP_VERSION(6, 0, 1):
case IP_VERSION(6, 0, 2):
case IP_VERSION(6, 0, 3):
/* Reserve 1 for paging and 1 for gfx */
kfd->device_info.num_reserved_sdma_queues_per_engine = 2;
/* BIT(0)=engine-0 queue-0; BIT(1)=engine-1 queue-0; 
BIT(2)=engine-0 queue-1; ... */
-   kfd->device_info.reserved_sdma_queues_bitmap = 0xFULL;
-   break;
-   case IP_VERSION(6, 0, 1):
-   /* Reserve 1 for paging and 1 for gfx */
-   kfd->device_info.num_reserved_sdma_queues_per_engine = 2;
-   /* BIT(0)=engine-0 queue-0; BIT(1)=engine-0 queue-1; ... */
-   kfd->device_info.reserved_sdma_queues_bitmap = 0x3ULL;
+   bitmap_set(kfd->device_info.reserved_sdma_queues_bitmap, 0,
+  kfd->adev->sdma.num_instances *
+  
kfd->device_info.num_reserved_sdma_queues_per_engine);
break;
default:
break;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 0c1be91a87c6..498ad7d4e7d9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -123,11 +123,6 @@ unsigned int get_num_xgmi_sdma_queues(struct 
device_queue_manager *dqm)
dqm->dev->kfd->device_info.num_sdma_queues_per_engine;
  }
  
-static inline uint64_t get_reserved_sdma_queues_bitmap(struct device_queue_manager *dqm)

-{
-   return dqm->dev->kfd->device_info.reserved_sdma_queues_bitmap;
-}
-
  static void init_sdma_bitmaps(struct device_queue_manager *dqm)
  {
bitmap_zero(dqm->sdma_bitmap, KFD_MAX_SDMA_QUEUES);
@@ -135,6 +130,11 @@ static void init_sdma_bitmaps(struct device_queue_manager 
*dqm)
  
  	bitmap_zero(dqm->xgmi_sdma_bitmap, KFD_MAX_SDMA_QUEUES);

bitmap_set(dqm->xgmi_sdma_bitmap, 0, get_num_xgmi_sdma_queues(dqm));
+
+   /* Mask out the reserved queues */
+   bitmap_andnot(dqm->sdma_bitmap, dqm->sdma_bitmap,
+ dqm->dev->kfd->device_info.reserved_sdma_queues_bitmap,
+ KFD_MAX_SDMA_QUEUES);
  }
  
  void program_sh_mem_settings(struct device_queue_manager *dqm,

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 023b17e0116b..7364a5d77c6e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -239,7 +239,7 @@ struct kfd_device_info {
uint32_t no_atomic_fw_version;
unsigned int num_sdma_queues_per_engine;
unsigned int num_reserved_sdma_queues_per_engine;
-   uint64_t reserved_sdma_queues_bitmap;
+   DECLARE_BITMAP(reserved_sdma_queues_bitmap, KFD_MAX_SDMA_QUEUES);
  };
  
  unsigned int kfd_get_num_sdma_engines(struct kfd_node *kdev);


<    1   2   3   4   5   6   7   8   9   10   >