On Mon, Jan 12, 2026 at 3:28 PM Felix Kuehling <[email protected]> wrote: > > > On 2026-01-12 09:06, Donet Tom wrote: > > RFC -> v2 > > ========= > > > > In RFC patch v1 [1], there were 8 patches. From that series, patches 1–3 are > > required to enable minimal support for 64K pages in AMDGPU. I have added > > those > > 3 pacthes in this series. > > > > With these three patches applied, all RCCL tests and the rocr-debug-agent > > tests > > pass on a ppc64le system with 64K page size on 2GPUs. However, on systems > > with > > more than 2 GPUs and with XNACK enabled, we require additional Patches > > [4-8] > > which were posted earlier as part of RFC [1] Since that require a bit of > > additional > > work and discussion. We will post v2 of them later as Part-2. > > > > 1. Patch 1 was updated to only relax the EOP buffer size check, based on > > Philip Yang’s comment. > > > > 2. Philip’s review comments on Patch 2 were addressed, and Reviewed-by tags > > were added to > > Patch 2 and Patch 3. > > > > [1] https://lore.kernel.org/all/[email protected]/ > > > > If this looks good, could we pull these changes into v6.20? > > The series looks good to me. > > Reviewed-by: Felix Kuehling <[email protected]> > > Alex, what does it take to get this into 6.20? I guess you'll want to > include this in a pull-request for drm-fixes ASAP?
Yes, if you can land it in amd-staging-drm-next ASAP, I'll include it in this week's PR. Alex > > Regards, > Felix > > > > > > This patch series addresses few issues which we encountered while running > > rocr > > debug agent and rccl unit tests with AMD GPU on Power10 (ppc64le), using 64k > > system pagesize. > > > > Note that we don't observe any of these issues while booting with 4k system > > pagesize on Power. So with the 64K system pagesize what we observed so far > > is, > > at few of the places, the conversion between gpu pfn to cpu pfn (or vice > > versa) > > may not be done correctly (due to different page size of AMD GPU (4K) > > v/s cpu pagesize (64K)) which causes issues like gpu page faults or gpu hang > > while running these tests. > > > > Changes so far in this series: > > ============================= > > 1. For now, during kfd queue creation, this patch lifts the restriction on > > EOP > > buffer size to be same buffer object mapping size. > > > > 2. Fix SVM range map/unmap operations to convert CPU page numbers to GPU > > page > > numbers before calling amdgpu_vm_update_range(), which expects 4K GPU > > pages. > > Without this the rocr-debug-agent tests and rccl unit tests were > > failing. > > > > 3. Fix GART PTE allocation in migration code to account for multiple GPU > > pages > > per CPU page. The current code only allocates PTEs based on number of > > CPU > > pages, but GART may need one PTE per 4K GPU page. > > > > Setup details: > > ============ > > System details: Power10 LPAR using 64K pagesize. > > AMD GPU: > > Name: gfx90a > > Marketing Name: AMD Instinct MI210 > > > > Donet Tom (3): > > drm/amdkfd: Relax size checking during queue buffer get > > drm/amdkfd: Fix SVM map/unmap address conversion for non-4k page sizes > > drm/amdkfd: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map() > > > > drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +- > > drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 6 ++--- > > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 29 +++++++++++++++++------- > > 3 files changed, 25 insertions(+), 12 deletions(-) > >
