On Mon, Jan 12, 2026 at 3:28 PM Felix Kuehling <[email protected]> wrote:
>
>
> On 2026-01-12 09:06, Donet Tom wrote:
> > RFC -> v2
> > =========
> >
> > In RFC patch v1 [1], there were 8 patches. From that series, patches 1–3 are
> > required to enable minimal support for 64K pages in AMDGPU. I have added 
> > those
> > 3 pacthes in this series.
> >
> > With these three patches applied, all RCCL tests and the rocr-debug-agent 
> > tests
> > pass on a ppc64le system with 64K page size on 2GPUs.  However, on systems 
> > with
> > more than 2 GPUs and with XNACK enabled, we require  additional Patches 
> > [4-8]
> > which were posted earlier as part of RFC [1]  Since that require a bit of 
> > additional
> > work and discussion. We will post v2 of them later as Part-2.
> >
> > 1. Patch 1 was updated to only relax the EOP buffer size check, based on 
> > Philip Yang’s comment.
> >
> > 2. Philip’s review comments on Patch 2 were addressed, and Reviewed-by tags 
> > were added to
> >     Patch 2 and Patch 3.
> >
> > [1] https://lore.kernel.org/all/[email protected]/
> >
> > If this looks good, could we pull these changes into v6.20?
>
> The series looks good to me.
>
> Reviewed-by: Felix Kuehling <[email protected]>
>
> Alex, what does it take to get this into 6.20? I guess you'll want to
> include this in a pull-request for drm-fixes ASAP?

Yes, if you can land it in amd-staging-drm-next ASAP, I'll include it
in this week's PR.

Alex

>
> Regards,
>    Felix
>
>
> >
> > This patch series addresses few issues which we encountered while running 
> > rocr
> > debug agent and rccl unit tests with AMD GPU on Power10 (ppc64le), using 64k
> > system pagesize.
> >
> > Note that we don't observe any of these issues while booting with 4k system
> > pagesize on Power. So with the 64K system pagesize what we observed so far 
> > is,
> > at few of the places, the conversion between gpu pfn to cpu pfn (or vice 
> > versa)
> > may not be done correctly (due to different page size of AMD GPU (4K)
> > v/s cpu pagesize (64K)) which causes issues like gpu page faults or gpu hang
> > while running these tests.
> >
> > Changes so far in this series:
> > =============================
> > 1. For now, during kfd queue creation, this patch lifts the restriction on 
> > EOP
> >     buffer size to be same buffer object mapping size.
> >
> > 2. Fix SVM range map/unmap operations to convert CPU page numbers to GPU 
> > page
> >     numbers before calling amdgpu_vm_update_range(), which expects 4K GPU 
> > pages.
> >     Without this the rocr-debug-agent tests and rccl unit  tests were 
> > failing.
> >
> > 3. Fix GART PTE allocation in migration code to account for multiple GPU 
> > pages
> >     per CPU page. The current code only allocates PTEs based on number of 
> > CPU
> >     pages, but GART may need one PTE per 4K GPU page.
> >
> > Setup details:
> > ============
> > System details: Power10 LPAR using 64K pagesize.
> > AMD GPU:
> >    Name:                    gfx90a
> >    Marketing Name:          AMD Instinct MI210
> >
> > Donet Tom (3):
> >    drm/amdkfd: Relax size checking during queue buffer get
> >    drm/amdkfd: Fix SVM map/unmap address conversion for non-4k page sizes
> >    drm/amdkfd: Fix GART PTE for non-4K pagesize in svm_migrate_gart_map()
> >
> >   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  2 +-
> >   drivers/gpu/drm/amd/amdkfd/kfd_queue.c   |  6 ++---
> >   drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 29 +++++++++++++++++-------
> >   3 files changed, 25 insertions(+), 12 deletions(-)
> >

Reply via email to