[PATCH] drm/kfd: Correct pined buffer handling at kfd restore and validate process

2024-05-13 Thread Xiaogang . Chen
From: Xiaogang Chen This reverts 8a774fe912ff09e39c2d3a3589c729330113f388 "drm/amdgpu: avoid restore process run into dead loop" since buffer got pined is not related whether it needs mapping. And skip buffer validation at kfd driver if the buffer has been pinned. Signed-off-by: Xia

[PATCH] amd/kfd: cancle work iterms at ih_wq in kfd_interrupt_exit

2024-04-10 Thread Xiaogang . Chen
From: Xiaogang Chen When kfd/amdgpu driver is tearing down cannot handle callback from ih_wq. If there is still work items left cancle them instead of flush that would wait until they got served. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c | 7 --- 1 file

[PATCH v4] drm/amdkfd: Use partial hmm page walk during buffer validation in SVM

2023-12-13 Thread Xiaogang . Chen
From: Xiaogang Chen v2: -not need calculate vram page number for new registered svm range, only do it for split vram pages. v3: -use dma address to calculate vram page number of split svm range; use migrate_vma from hmm to calculate page number that migrate to vram. v4: -combine calculating

[PATCH v3] drm/amdkfd: Use partial hmm page walk during buffer validation in SVM

2023-12-06 Thread Xiaogang . Chen
From: Xiaogang Chen v2: -not need calculate vram page number for new registered svm range, only do it for split vram pages. v3: -use dma address to calculate vram page number of split svm range; use migrate_vma from hmm to calculate page number that migrate to vram. SVM uses hmm page walk

[PATCH v2] drm/amdkfd: Use partial hmm page walk during buffer validation in SVM

2023-12-04 Thread Xiaogang . Chen
From: Xiaogang Chen v2: -not need calculate vram page number for new registered svm range, only do it for split vram pages. SVM uses hmm page walk to valid buffer before map to gpu vm. After have partial migration/mapping do validation on same vm range as migration/map do instead of whole svm

[PATCH] drm/amdkfd: Use partial hmm page walk during buffer validation in SVM

2023-12-04 Thread Xiaogang . Chen
From: Xiaogang Chen SVM uses hmm page walk to valid buffer before map to gpu vm. After have partial migration/mapping do validation on same vm range as migration/map do instead of whole svm range that can be very large. This change is expected to improve svm code performance. Signed-off

[PATCH] drm/amdkfd: Use partial migrations/mapping for GPU/CPU page faults in SVM

2023-11-14 Thread Xiaogang . Chen
From: Xiaogang Chen This patch implements partial migration/mapping for gpu/cpu page faults in SVM according to migration granularity(default 2MB). A svm range may include pages from both system ram and vram of one gpu now. These chagnes are expected to improve migration performance and reduce

[PATCH v3] drm/amdkfd: Use partial mapping in GPU page faults

2023-10-20 Thread Xiaogang . Chen
From: Xiaogang Chen After partial migration to recover GPU page fault this patch does GPU vm space mapping for same page range that got migrated intead of mapping all pages of svm range in which the page fault happened. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 29

[PATCH v2] drm/amdkfd: Use partial mapping in GPU page fault recovery

2023-10-20 Thread Xiaogang . Chen
From: Xiaogang Chen After partial migration to recover GPU page fault this patch does GPU vm space mapping for same page range that got migrated instead of mapping all pages of svm range in which the page fault happened. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c

[PATCH] drm/amdkfd: Use partial mapping in GPU page fault recovery

2023-10-19 Thread Xiaogang . Chen
From: Xiaogang Chen After partial migration to recover GPU page fault this patch does GPU vm space mapping for same page range that got migrated instead of mapping all pages of svm range in which the page fault happened. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c

[PATCH v2] drm/amdgpu: Correctly use bo_va->ref_count in compute VMs

2023-10-12 Thread Xiaogang . Chen
From: Xiaogang Chen This is needed to correctly handle BOs imported into compute VM from gfx. Both kfd and gfx should use same bo_va and set bo_va->ref_count correctly when map the Bos into same VM, otherwise we may trigger kernel general protection when iterate mappings over bo_va's val

[PATCH] Find bo_va before create it when map bo into compute VM

2023-10-11 Thread Xiaogang . Chen
From: Xiaogang Chen This is needed to correctly handle BOs imported into compute VM from gfx. Both kfd and gfx should use same bo_va when map the Bos into same VM, otherwise we may trigger kernel general protection when iterate mappings from bo_va. Signed-off-by: Felix Kuehling Acked

[PATCH v4] drm/amdkfd: Use partial migrations in GPU page faults

2023-10-03 Thread Xiaogang . Chen
From: Xiaogang Chen This patch implements partial migration in gpu page fault according to migration granularity(default 2MB) and not split svm range in cpu page fault handling. A svm range may include pages from both system ram and vram of one gpu now. These chagnes are expected to improve

[PATCH] drm/amdkfd: Fix a race condition of vram buffer unref in svm code

2023-09-26 Thread Xiaogang . Chen
From: Xiaogang Chen prange->svm_bo unref can happen in both mmu callback and a callback after migrate to system ram. Both are async call in different tasks. Sync svm_bo unref operation to avoid random "use-after-free". Signed-off-by: Xiaogang.Chen --- drivers/gpu/drm/amd/amdkfd/

[PATCH v3] drm/amdkfd: Use partial migrations in GPU page faults

2023-09-20 Thread Xiaogang . Chen
From: Xiaogang Chen This patch implements partial migration in gpu page fault according to migration granularity(default 2MB) and not split svm range in cpu page fault handling. A svm range may include pages from both system ram and vram of one gpu now. These chagnes are expected to improve

[PATCH v2] drm/amdkfd: fix some race conditions in vram buffer alloc/free of svm code

2023-09-20 Thread Xiaogang . Chen
From: Xiaogang Chen This patch fixes: 1: ref number of prange's svm_bo got decreased by an async call from hmm. When wait svm_bo of prange got released we shoul also wait prang->svm_bo become NULL, otherwise prange->svm_bo may be set to null after allocate new vram buffer. 2: During w

[PATCH] drm/amdkfd: fix some race conditions in vram buffer alloc/free of svm code

2023-09-20 Thread Xiaogang . Chen
From: Xiaogang Chen This patch fixes: 1: ref number of prange's svm_bo got decreased by an async call from hmm. When wait svm_bo of prange got released we shoul also wait prang->svm_bo become NULL, otherwise prange->svm_bo may be set to null after allocate new vram buffer. 2: During w

[PATCH] drm/amdkfd: Seperate dma unmap and free of dma address array operations

2023-09-15 Thread Xiaogang . Chen
From: Xiaogang Chen We do not need free dma address array of svm_range each time we do dma unmap for pages in svm_range as we can reuse the same array. Only free it when free svm_range. Seperate these two operations and use them accordinly. Signed-off-by: Xiaogang.Chen --- drivers/gpu/drm/amd

[PATCH v2] drm/amdkfd: Use partial migrations in GPU page faults

2023-09-11 Thread Xiaogang . Chen
From: Xiaogang Chen This patch implements partial migration in gpu page fault according to migration granularity(default 2MB) and not split svm range in cpu page fault handling. A svm range may include pages from both system ram and vram of one gpu now. These chagnes are expected to improve

[PATCH] drm/amdkfd: Use partial migrations in GPU page faults

2023-08-24 Thread Xiaogang . Chen
From: Xiaogang Chen This patch implements partial migration in gpu page fault according to migration granularity(default 2MB) and not split svm range in cpu page fault handling. Now a svm range may have pages from both system ram and vram of one gpu. These chagnes are expected to improve

[PATCH] drm/amdgpu: have bos for PDs/PTS cpu accessible when kfd uses cpu to update vm

2023-06-29 Thread Xiaogang . Chen
From: Xiaogang Chen When kfd uses cpu to update vm iterates all current PDs/PTs bos, adds AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED flag and kmap them to kernel virtual address space before kfd updates the vm that was created by gfx. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdgpu

[PATCH] drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute

2023-06-19 Thread Xiaogang . Chen
From: Xiaogang Chen Since we allow kfd and graphic operate on same GPU VM to have interoperation between them GPU VM may have been used by graphic vm operations before kfd turns a GPU VM into a compute VM. Remove vm clean checking at amdgpu_vm_make_compute. Signed-off-by: Xiaogang Chen

[PATCH] drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute

2023-06-19 Thread Xiaogang . Chen
From: Xiaogang Chen Since we allow kfd and graphic operate on same GPU VM to have interoperation between them GPU VM may have been used by graphic vm operations before kfd turn a GFX VM into a compute VM. Remove vm clean checking at amdgpu_vm_make_compute. Signed-off-by: Xiaogang Chen

[PATCH] drm/amdkfd: Fix an issue at userptr buffer validation process.

2023-04-20 Thread Xiaogang . Chen
From: Xiaogang Chen amdgpu_ttm_tt_get_user_pages can fail(-EFAULT). If it failed mem has no associated hmm range or user_pages associated. Keep it at process_info->userptr_inval_list and mark mem->invalid until following scheduled attempts can valid it. Signed-off-by: Xiaogan

[PATCH] drm/amdkfd: Fix some issues at userptr buffer validation process.

2023-04-12 Thread Xiaogang . Chen
From: Xiaogang Chen Notice userptr buffer restore process has following issues: 1: amdgpu_ttm_tt_get_user_pages can fail(-EFAULT). If it failed we should not set it valid(mem->invalid = 0). In this case mem has no associated hmm range or user_pages associated. 2: mmu notifier can hap

[PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.

2023-04-10 Thread Xiaogang . Chen
From: Xiaogang Chen During KFD restore evicted userptr BOs mmu invalidate callback may invalidate same userptr BOs that have been just restored. When KFD restore process detects it KFD will reschedule another validation process. It is not an error. Change WARN to pr_debug, not put the BOs

[PATCH] drm/amdkfd: fix warnings in kfd_migrate.c

2023-03-10 Thread Xiaogang . Chen
From: Xiaogang Chen drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_migrate.c: In function ‘svm_migrate_copy_to_vram’: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_migrate.c:393:1: warning: label ‘out’ defined but not used [-Wunused-label] 393 | out: | ^~~ drivers/gpu/drm/amd/amdgpu/../amdkfd

[PATCH] drm/amdkfd: Get prange->offset after svm_range_vram_node_new

2023-03-07 Thread Xiaogang . Chen
From: Xiaogang Chen During miration to vram prange->offset is valid after vram buffer is located, either use old one or allocate a new one. Move svm_range_vram_node_new before migrate for each vma to get valid prange->offset. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/

[PATCH v2] drm/amdkfd: Cal vram offset in TTM resource for each svm_migrate_copy_to_vram

2023-03-01 Thread Xiaogang . Chen
From: Xiaogang Chen svm_migrate_ram_to_vram migrates a prange from sys ram to vram. The prange may cross multiple vma. Need remember current dst vram offset in the TTM resource for each migration. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17

[PATCH] drm/amdkfd: Cal vram offset in page for each svm_migrate_copy_to_vram

2023-02-27 Thread Xiaogang . Chen
From: Xiaogang Chen svm_migrate_ram_to_vram migrate a prange from sys ram to vram. The prange may cross multiple vma. Need remember current dst vram offset in page for each migration. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++--- 1 file

[PATCH v3] drm/amdkfd: Prevent user space using both svm and kfd api to register same user buffer

2023-02-08 Thread Xiaogang . Chen
From: Xiaogang Chen When xnack is on user space can use svm page restore to set a vm range without setup it first, then use regular api to register. Currently kfd api and svm are not interoperable. We already have check on that, but for user buffer the mapping address is not same as buffer cpu

[PATCH v2] drm/amdkfd: Prevent user space using both svm and kfd api to register same user buffer

2023-02-08 Thread Xiaogang . Chen
From: Xiaogang Chen When xnack is on user space can use svm page restore to set a vm range without setup it first, then use regular api to register. Currently kfd api and svm are not interoperable. We already have check on that, but for user buffer the mapping address is not same as buffer cpu

[PATCH] drm/amdkfd: Prevent user space using both svm and kfd api to register same user buffer

2023-02-07 Thread Xiaogang . Chen
From: Xiaogang Chen When xnack is on user space can use svm page restore to set a vm range without setup it first, then use regular api to register. Currently kfd api and svm are not interoperable. We already have check on that, but for user buffer the mapping address is not same as buffer cpu

[PATCH] drm/amdkfd: Remove skiping userptr buffer mapping when mmu notifier marks it as invalid

2022-10-28 Thread Xiaogang . Chen
From: Xiaogang Chen mmu notifier does not always hold mm->sem during call back. That causes a race condition between kfd userprt buffer mapping and mmu notifier which leds to gpu shadder or SDMA access userptr buffer before it has been mapped to gpu VM. Always map userptr buffer to av

[PATCH v2] drm/amdgpu: config HDP_MISC_CNTL.READ_BUFFER_WATERMARK to fix applications running across multiple GPU config hang.

2022-02-22 Thread Xiaogang . Chen
From: Xiaogang Chen Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdgpu/hdp_v4_0.c | 3 +++ drivers/gpu/drm/amd/include/asic_reg/hdp/hdp_4_0_sh_mask.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/hdp_v4_0.c b/drivers/gpu/drm

[PATCH] drm/amdgpu: config HDP_MISC_CNTL.READ_BUFFER_WATERMARK to fix applications running across multiple GPU config hang.

2022-02-21 Thread Xiaogang . Chen
From: Xiaogang Chen Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdgpu/hdp_v4_0.c | 1 + drivers/gpu/drm/amd/include/asic_reg/hdp/hdp_4_0_sh_mask.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/hdp_v4_0.c b/drivers/gpu/drm/amd

[PATCH] drm/amdkfd: explicitly create/destroy queue attributes under /sys

2021-12-09 Thread Xiaogang . Chen
From: Xiaogang Chen When application is about finish it destroys queues it has created by an ioctl. Driver deletes queue entry(/sys/class/kfd/kfd/proc/pid/queues/queueid/) which is directory including this queue all attributes. Low level kernel code deletes all attributes under this directory

[PATCH 2/2] drm/amdgpu/display: buffer INTERRUPT_LOW_IRQ_CONTEXT interrupt work

2021-01-03 Thread Xiaogang . Chen
From: Xiaogang Chen amdgpu DM handles INTERRUPT_LOW_IRQ_CONTEXT interrupt(hpd, hpd_rx) by using work queue and uses single work_struct. If previous interrupt has not been handled new interrupts(same type) will be discarded and driver just sends "amdgpu_dm_irq_schedule_work FAILED" m

[PATCH 1/2] drm: distinguish return value of drm_dp_check_and_send_link_address.

2021-01-03 Thread Xiaogang . Chen
From: Xiaogang Chen drm_dp_check_and_send_link_address discovers MST device topology. It can return both positive and negative values. When it returns positive values there is no error found. If it returns negative values there is error found, such as get NAK , timeout, etc. Following