Re: Need to support mixed memory mappings with HMM

2021-03-26 Thread Christian König

Am 26.03.21 um 15:49 schrieb Felix Kuehling:

Am 2021-03-26 um 4:50 a.m. schrieb Christian König:


Am 25.03.21 um 17:28 schrieb Felix Kuehling:

Am 2021-03-25 um 12:22 p.m. schrieb Christian König:

My idea is to change the amdgpu_vm_update_mapping interface to use
some
high-bit in the pages_addr array to indicate the page type. For
the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing
callers
it's as easy as changing flags to &flags (array of size 1). For
HMM we
would pass a pointer to a real array.

Yeah, I've thought about stuff like that as well for a while.

Problem is that this won't work that easily. We assume at many
places
that the flags doesn't change for the range in question.

I think some lower level functions assume that the flags stay the
same
for physically contiguous ranges. But if you use the high-bits to
encode
the page type, the ranges won't be contiguous any more. So you can
change page flags for different contiguous ranges.

Yeah, but then you also get absolutely zero THP and fragment flags
support.

As long as you have a contiguous 2MB page with the same page type, I
think you can still get a THP mapping in the GPU page table. But if
one
page in the middle of a 2MB page has a different page type, that will
break the THP mapping, as it should.

Yeah, but currently we detect that before we call down into the
functions to update the tables.

When you give a mixed list the chance for that is just completely gone.

Currently the detection of contiguous ranges is in
amdgpu_vm_update_mapping:


if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn =
cursor.start >> PAGE_SHIFT; uint64_t count; contiguous =
pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries
/ AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count)
{ uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] ==
pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count *
AMDGPU_GPU_PAGES_IN_CPU_PAGE; }

If a page type changes from one page to the next, it will end a
contiguous range and call into the lower level (amdgpu_vm_update_ptes).
I don't think anything needs to change in amdgpu_vm_update_ptes, because
all pages contiguous passed to it use the same page type, so the same
flags. And the existing code in amdgpu_vm_update_mapping already does
the right thing. I honestly don't see the problem.

But then you could also just call amdgpu_vm_update_mapping() multiple
times feeding it from whatever data structure you use in the HMM code.

Using the page array sounds like an intermediate data structure to me
you only created to feed into amdgpu_vm_update_mapping().

You're right. It could be done. One new call to amdgpu_vm_update_mapping
every time the flags change as we iterate over virtual addresses. It
should work OK as long as the mappings of different memory types aren't
too finely interleaved.


I suggest that we stick to this approach for now.

The only alternative I see is that we have an unified address space for 
DMA addresses similar to what you have outlined.


This way we can easily figure out for each address to which device it 
belongs and then calculate the best path to that device.


E.g. direct address for system memory, clearing the S bit for local 
memory, using XGMI for remote memory etc...


That is similar to what you have in mind, just a bit more generalized.

Regards,
Christian.



Regards,
   Felix



Regards,
Christian.


Regards,
    Felix



Regards,
Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Need to support mixed memory mappings with HMM

2021-03-26 Thread Felix Kuehling
Am 2021-03-26 um 4:50 a.m. schrieb Christian König:
>
>
> Am 25.03.21 um 17:28 schrieb Felix Kuehling:
>> Am 2021-03-25 um 12:22 p.m. schrieb Christian König:
 My idea is to change the amdgpu_vm_update_mapping interface to use
 some
 high-bit in the pages_addr array to indicate the page type. For
 the
 default page type (0) nothing really changes for the callers. The
 "flags" parameter needs to become a pointer to an array that gets
 indexed by the high bits from the pages_addr array. For existing
 callers
 it's as easy as changing flags to &flags (array of size 1). For
 HMM we
 would pass a pointer to a real array.
>>> Yeah, I've thought about stuff like that as well for a while.
>>>
>>> Problem is that this won't work that easily. We assume at many
>>> places
>>> that the flags doesn't change for the range in question.
>> I think some lower level functions assume that the flags stay the
>> same
>> for physically contiguous ranges. But if you use the high-bits to
>> encode
>> the page type, the ranges won't be contiguous any more. So you can
>> change page flags for different contiguous ranges.
> Yeah, but then you also get absolutely zero THP and fragment flags
> support.
 As long as you have a contiguous 2MB page with the same page type, I
 think you can still get a THP mapping in the GPU page table. But if
 one
 page in the middle of a 2MB page has a different page type, that will
 break the THP mapping, as it should.
>>> Yeah, but currently we detect that before we call down into the
>>> functions to update the tables.
>>>
>>> When you give a mixed list the chance for that is just completely gone.
>> Currently the detection of contiguous ranges is in
>> amdgpu_vm_update_mapping:
>>
>>> if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn =
>>> cursor.start >> PAGE_SHIFT; uint64_t count; contiguous =
>>> pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries
>>> / AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count)
>>> { uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] ==
>>> pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count *
>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE; }
>> If a page type changes from one page to the next, it will end a
>> contiguous range and call into the lower level (amdgpu_vm_update_ptes).
>> I don't think anything needs to change in amdgpu_vm_update_ptes, because
>> all pages contiguous passed to it use the same page type, so the same
>> flags. And the existing code in amdgpu_vm_update_mapping already does
>> the right thing. I honestly don't see the problem.
>
> But then you could also just call amdgpu_vm_update_mapping() multiple
> times feeding it from whatever data structure you use in the HMM code.
>
> Using the page array sounds like an intermediate data structure to me
> you only created to feed into amdgpu_vm_update_mapping().

You're right. It could be done. One new call to amdgpu_vm_update_mapping
every time the flags change as we iterate over virtual addresses. It
should work OK as long as the mappings of different memory types aren't
too finely interleaved.

Regards,
  Felix


> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Need to support mixed memory mappings with HMM

2021-03-26 Thread Christian König



Am 25.03.21 um 17:28 schrieb Felix Kuehling:

Am 2021-03-25 um 12:22 p.m. schrieb Christian König:

My idea is to change the amdgpu_vm_update_mapping interface to use
some
high-bit in the pages_addr array to indicate the page type. For the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing
callers
it's as easy as changing flags to &flags (array of size 1). For
HMM we
would pass a pointer to a real array.

Yeah, I've thought about stuff like that as well for a while.

Problem is that this won't work that easily. We assume at many places
that the flags doesn't change for the range in question.

I think some lower level functions assume that the flags stay the same
for physically contiguous ranges. But if you use the high-bits to
encode
the page type, the ranges won't be contiguous any more. So you can
change page flags for different contiguous ranges.

Yeah, but then you also get absolutely zero THP and fragment flags
support.

As long as you have a contiguous 2MB page with the same page type, I
think you can still get a THP mapping in the GPU page table. But if one
page in the middle of a 2MB page has a different page type, that will
break the THP mapping, as it should.

Yeah, but currently we detect that before we call down into the
functions to update the tables.

When you give a mixed list the chance for that is just completely gone.

Currently the detection of contiguous ranges is in amdgpu_vm_update_mapping:


if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn =
cursor.start >> PAGE_SHIFT; uint64_t count; contiguous =
pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries
/ AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count)
{ uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] ==
pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count *
AMDGPU_GPU_PAGES_IN_CPU_PAGE; }

If a page type changes from one page to the next, it will end a
contiguous range and call into the lower level (amdgpu_vm_update_ptes).
I don't think anything needs to change in amdgpu_vm_update_ptes, because
all pages contiguous passed to it use the same page type, so the same
flags. And the existing code in amdgpu_vm_update_mapping already does
the right thing. I honestly don't see the problem.


But then you could also just call amdgpu_vm_update_mapping() multiple 
times feeding it from whatever data structure you use in the HMM code.


Using the page array sounds like an intermediate data structure to me 
you only created to feed into amdgpu_vm_update_mapping().


Regards,
Christian.



Regards,
   Felix



Regards,
Christian.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Need to support mixed memory mappings with HMM

2021-03-25 Thread Felix Kuehling

Am 2021-03-25 um 12:22 p.m. schrieb Christian König:
>> My idea is to change the amdgpu_vm_update_mapping interface to use
>> some
>> high-bit in the pages_addr array to indicate the page type. For the
>> default page type (0) nothing really changes for the callers. The
>> "flags" parameter needs to become a pointer to an array that gets
>> indexed by the high bits from the pages_addr array. For existing
>> callers
>> it's as easy as changing flags to &flags (array of size 1). For
>> HMM we
>> would pass a pointer to a real array.
> Yeah, I've thought about stuff like that as well for a while.
>
> Problem is that this won't work that easily. We assume at many places
> that the flags doesn't change for the range in question.
 I think some lower level functions assume that the flags stay the same
 for physically contiguous ranges. But if you use the high-bits to
 encode
 the page type, the ranges won't be contiguous any more. So you can
 change page flags for different contiguous ranges.
>>> Yeah, but then you also get absolutely zero THP and fragment flags
>>> support.
>> As long as you have a contiguous 2MB page with the same page type, I
>> think you can still get a THP mapping in the GPU page table. But if one
>> page in the middle of a 2MB page has a different page type, that will
>> break the THP mapping, as it should.
>
> Yeah, but currently we detect that before we call down into the
> functions to update the tables.
>
> When you give a mixed list the chance for that is just completely gone.

Currently the detection of contiguous ranges is in amdgpu_vm_update_mapping:

> if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn =
> cursor.start >> PAGE_SHIFT; uint64_t count; contiguous =
> pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries
> / AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count)
> { uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] ==
> pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count *
> AMDGPU_GPU_PAGES_IN_CPU_PAGE; } 

If a page type changes from one page to the next, it will end a
contiguous range and call into the lower level (amdgpu_vm_update_ptes).
I don't think anything needs to change in amdgpu_vm_update_ptes, because
all pages contiguous passed to it use the same page type, so the same
flags. And the existing code in amdgpu_vm_update_mapping already does
the right thing. I honestly don't see the problem.

Regards,
  Felix


>
> Regards,
> Christian. 
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Need to support mixed memory mappings with HMM

2021-03-25 Thread Christian König

Am 25.03.21 um 17:20 schrieb Felix Kuehling:

Am 2021-03-25 um 12:16 p.m. schrieb Christian König:

Am 25.03.21 um 17:14 schrieb Felix Kuehling:

Am 2021-03-25 um 12:10 p.m. schrieb Christian König:

Am 25.03.21 um 17:03 schrieb Felix Kuehling:

Hi,

This is a long one with a proposal for a pretty significant
redesign of
how we handle migrations and VRAM management with HMM. This is
based on
my own debugging and reading of the migrate_vma helpers, as well as
Alex's problems with migrations on A+A. I hope we can discuss this
next
Monday after you've had some time do digest it.

I did some debugging yesterday and found that migrations to VRAM can
fail for some pages. The current migration helpers have many corner
cases where a page cannot be migrated. Some of them may be fixable
(adding support for THP), others are not (locked pages are skipped to
avoid deadlocks). Therefore I think our current code is too inflexible
when it assumes that a range is entirely in one place.

Alex also ran into some funny issues with COW on A+A where some pages
get faulted back to system memory. I think a lot of the problems here
will get easier once we support mixed mappings.

Mixed GPU mappings
===

The idea is, to remove any assumptions that an entire svm_range is in
one place. Instead hmm_range_fault gives us a list of pages, some of
which are system memory and others are device_private or
device_generic.

We will need an amdgpu_vm interface that lets us map mixed page arrays
where different pages use different PTE flags. We can have at least 3
different types of pages in one mapping:

    1. System memory (S-bit set)
    2. Local memory (S-bit cleared, MTYPE for local memory)
    3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)

My idea is to change the amdgpu_vm_update_mapping interface to use
some
high-bit in the pages_addr array to indicate the page type. For the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing
callers
it's as easy as changing flags to &flags (array of size 1). For HMM we
would pass a pointer to a real array.

Yeah, I've thought about stuff like that as well for a while.

Problem is that this won't work that easily. We assume at many places
that the flags doesn't change for the range in question.

I think some lower level functions assume that the flags stay the same
for physically contiguous ranges. But if you use the high-bits to encode
the page type, the ranges won't be contiguous any more. So you can
change page flags for different contiguous ranges.

Yeah, but then you also get absolutely zero THP and fragment flags
support.

As long as you have a contiguous 2MB page with the same page type, I
think you can still get a THP mapping in the GPU page table. But if one
page in the middle of a 2MB page has a different page type, that will
break the THP mapping, as it should.


Yeah, but currently we detect that before we call down into the 
functions to update the tables.


When you give a mixed list the chance for that is just completely gone.

Regards,
Christian.



Regards,
   Felix



But I think we could also add those later on.

Regards,
Christian.


Regards,
    Felix



We would somehow need to change that to get the flags directly from
the low bits of the DMA address or something instead.

Christian.


Once this is done, it leads to a number of opportunities for
simplification and better efficiency in kfd_svm:

     * Support migration when cpages != npages
     * Migrate a part of an svm_range without splitting it. No more
   splitting of ranges in CPU page faults
     * Migrate a part of an svm_range in GPU page fault handler. No
more
   migrating the whole range for a single page fault
     * Simplified VRAM management (see below)

With that, svm_range will no longer have an "actual_loc" field. If
we're
not sure where the data is, we need to call migrate. If it's
already in
the right place, then cpages will be 0 and we can skip all the steps
after migrate_vma_setup.

Simplified VRAM management
==

VRAM BOs are no longer associated with pranges. Instead they are
"free-floating", allocated during migration to VRAM, with reference
count for each page that uses the BO. Ref is released in page-release
callback. When the ref count drops to 0, free the BO.

VRAM BO size should match the migration granularity, 2MB by default.
That way the BO can be freed when memory gets migrated out. If
migration
of some pages fails the BO may not be fully occupied. Also some pages
may be released individually on A+A due to COW or other events.

Eviction needs to migrate all the pages still using the BO. If the BO
struct keeps an array of page pointers, that's basically the
migrate.src
for the eviction. Migration calls "try_to_unmap", which has the best
chance of freeing the BO, even when shared by multiple processes.

If we ca

Re: Need to support mixed memory mappings with HMM

2021-03-25 Thread Felix Kuehling
Am 2021-03-25 um 12:16 p.m. schrieb Christian König:
> Am 25.03.21 um 17:14 schrieb Felix Kuehling:
>> Am 2021-03-25 um 12:10 p.m. schrieb Christian König:
>>>
>>> Am 25.03.21 um 17:03 schrieb Felix Kuehling:
 Hi,

 This is a long one with a proposal for a pretty significant
 redesign of
 how we handle migrations and VRAM management with HMM. This is
 based on
 my own debugging and reading of the migrate_vma helpers, as well as
 Alex's problems with migrations on A+A. I hope we can discuss this
 next
 Monday after you've had some time do digest it.

 I did some debugging yesterday and found that migrations to VRAM can
 fail for some pages. The current migration helpers have many corner
 cases where a page cannot be migrated. Some of them may be fixable
 (adding support for THP), others are not (locked pages are skipped to
 avoid deadlocks). Therefore I think our current code is too inflexible
 when it assumes that a range is entirely in one place.

 Alex also ran into some funny issues with COW on A+A where some pages
 get faulted back to system memory. I think a lot of the problems here
 will get easier once we support mixed mappings.

 Mixed GPU mappings
 ===

 The idea is, to remove any assumptions that an entire svm_range is in
 one place. Instead hmm_range_fault gives us a list of pages, some of
 which are system memory and others are device_private or
 device_generic.

 We will need an amdgpu_vm interface that lets us map mixed page arrays
 where different pages use different PTE flags. We can have at least 3
 different types of pages in one mapping:

    1. System memory (S-bit set)
    2. Local memory (S-bit cleared, MTYPE for local memory)
    3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)

 My idea is to change the amdgpu_vm_update_mapping interface to use
 some
 high-bit in the pages_addr array to indicate the page type. For the
 default page type (0) nothing really changes for the callers. The
 "flags" parameter needs to become a pointer to an array that gets
 indexed by the high bits from the pages_addr array. For existing
 callers
 it's as easy as changing flags to &flags (array of size 1). For HMM we
 would pass a pointer to a real array.
>>> Yeah, I've thought about stuff like that as well for a while.
>>>
>>> Problem is that this won't work that easily. We assume at many places
>>> that the flags doesn't change for the range in question.
>> I think some lower level functions assume that the flags stay the same
>> for physically contiguous ranges. But if you use the high-bits to encode
>> the page type, the ranges won't be contiguous any more. So you can
>> change page flags for different contiguous ranges.
>
> Yeah, but then you also get absolutely zero THP and fragment flags
> support.

As long as you have a contiguous 2MB page with the same page type, I
think you can still get a THP mapping in the GPU page table. But if one
page in the middle of a 2MB page has a different page type, that will
break the THP mapping, as it should.

Regards,
  Felix


>
> But I think we could also add those later on.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> We would somehow need to change that to get the flags directly from
>>> the low bits of the DMA address or something instead.
>>>
>>> Christian.
>>>
 Once this is done, it leads to a number of opportunities for
 simplification and better efficiency in kfd_svm:

     * Support migration when cpages != npages
     * Migrate a part of an svm_range without splitting it. No more
   splitting of ranges in CPU page faults
     * Migrate a part of an svm_range in GPU page fault handler. No
 more
   migrating the whole range for a single page fault
     * Simplified VRAM management (see below)

 With that, svm_range will no longer have an "actual_loc" field. If
 we're
 not sure where the data is, we need to call migrate. If it's
 already in
 the right place, then cpages will be 0 and we can skip all the steps
 after migrate_vma_setup.

 Simplified VRAM management
 ==

 VRAM BOs are no longer associated with pranges. Instead they are
 "free-floating", allocated during migration to VRAM, with reference
 count for each page that uses the BO. Ref is released in page-release
 callback. When the ref count drops to 0, free the BO.

 VRAM BO size should match the migration granularity, 2MB by default.
 That way the BO can be freed when memory gets migrated out. If
 migration
 of some pages fails the BO may not be fully occupied. Also some pages
 may be released individually on A+A due to COW or other events.

 Eviction needs to migrate all the pages still using the BO. If the BO
 stru

Re: Need to support mixed memory mappings with HMM

2021-03-25 Thread Christian König

Am 25.03.21 um 17:14 schrieb Felix Kuehling:

Am 2021-03-25 um 12:10 p.m. schrieb Christian König:


Am 25.03.21 um 17:03 schrieb Felix Kuehling:

Hi,

This is a long one with a proposal for a pretty significant redesign of
how we handle migrations and VRAM management with HMM. This is based on
my own debugging and reading of the migrate_vma helpers, as well as
Alex's problems with migrations on A+A. I hope we can discuss this next
Monday after you've had some time do digest it.

I did some debugging yesterday and found that migrations to VRAM can
fail for some pages. The current migration helpers have many corner
cases where a page cannot be migrated. Some of them may be fixable
(adding support for THP), others are not (locked pages are skipped to
avoid deadlocks). Therefore I think our current code is too inflexible
when it assumes that a range is entirely in one place.

Alex also ran into some funny issues with COW on A+A where some pages
get faulted back to system memory. I think a lot of the problems here
will get easier once we support mixed mappings.

Mixed GPU mappings
===

The idea is, to remove any assumptions that an entire svm_range is in
one place. Instead hmm_range_fault gives us a list of pages, some of
which are system memory and others are device_private or device_generic.

We will need an amdgpu_vm interface that lets us map mixed page arrays
where different pages use different PTE flags. We can have at least 3
different types of pages in one mapping:

   1. System memory (S-bit set)
   2. Local memory (S-bit cleared, MTYPE for local memory)
   3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)

My idea is to change the amdgpu_vm_update_mapping interface to use some
high-bit in the pages_addr array to indicate the page type. For the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing callers
it's as easy as changing flags to &flags (array of size 1). For HMM we
would pass a pointer to a real array.

Yeah, I've thought about stuff like that as well for a while.

Problem is that this won't work that easily. We assume at many places
that the flags doesn't change for the range in question.

I think some lower level functions assume that the flags stay the same
for physically contiguous ranges. But if you use the high-bits to encode
the page type, the ranges won't be contiguous any more. So you can
change page flags for different contiguous ranges.


Yeah, but then you also get absolutely zero THP and fragment flags support.

But I think we could also add those later on.

Regards,
Christian.



Regards,
   Felix



We would somehow need to change that to get the flags directly from
the low bits of the DMA address or something instead.

Christian.


Once this is done, it leads to a number of opportunities for
simplification and better efficiency in kfd_svm:

    * Support migration when cpages != npages
    * Migrate a part of an svm_range without splitting it. No more
  splitting of ranges in CPU page faults
    * Migrate a part of an svm_range in GPU page fault handler. No more
  migrating the whole range for a single page fault
    * Simplified VRAM management (see below)

With that, svm_range will no longer have an "actual_loc" field. If we're
not sure where the data is, we need to call migrate. If it's already in
the right place, then cpages will be 0 and we can skip all the steps
after migrate_vma_setup.

Simplified VRAM management
==

VRAM BOs are no longer associated with pranges. Instead they are
"free-floating", allocated during migration to VRAM, with reference
count for each page that uses the BO. Ref is released in page-release
callback. When the ref count drops to 0, free the BO.

VRAM BO size should match the migration granularity, 2MB by default.
That way the BO can be freed when memory gets migrated out. If migration
of some pages fails the BO may not be fully occupied. Also some pages
may be released individually on A+A due to COW or other events.

Eviction needs to migrate all the pages still using the BO. If the BO
struct keeps an array of page pointers, that's basically the migrate.src
for the eviction. Migration calls "try_to_unmap", which has the best
chance of freeing the BO, even when shared by multiple processes.

If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
pressure so we can start evicting memory.

Regards,
    Felix



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Need to support mixed memory mappings with HMM

2021-03-25 Thread Felix Kuehling
Am 2021-03-25 um 12:10 p.m. schrieb Christian König:
>
>
> Am 25.03.21 um 17:03 schrieb Felix Kuehling:
>> Hi,
>>
>> This is a long one with a proposal for a pretty significant redesign of
>> how we handle migrations and VRAM management with HMM. This is based on
>> my own debugging and reading of the migrate_vma helpers, as well as
>> Alex's problems with migrations on A+A. I hope we can discuss this next
>> Monday after you've had some time do digest it.
>>
>> I did some debugging yesterday and found that migrations to VRAM can
>> fail for some pages. The current migration helpers have many corner
>> cases where a page cannot be migrated. Some of them may be fixable
>> (adding support for THP), others are not (locked pages are skipped to
>> avoid deadlocks). Therefore I think our current code is too inflexible
>> when it assumes that a range is entirely in one place.
>>
>> Alex also ran into some funny issues with COW on A+A where some pages
>> get faulted back to system memory. I think a lot of the problems here
>> will get easier once we support mixed mappings.
>>
>> Mixed GPU mappings
>> ===
>>
>> The idea is, to remove any assumptions that an entire svm_range is in
>> one place. Instead hmm_range_fault gives us a list of pages, some of
>> which are system memory and others are device_private or device_generic.
>>
>> We will need an amdgpu_vm interface that lets us map mixed page arrays
>> where different pages use different PTE flags. We can have at least 3
>> different types of pages in one mapping:
>>
>>   1. System memory (S-bit set)
>>   2. Local memory (S-bit cleared, MTYPE for local memory)
>>   3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)
>>
>> My idea is to change the amdgpu_vm_update_mapping interface to use some
>> high-bit in the pages_addr array to indicate the page type. For the
>> default page type (0) nothing really changes for the callers. The
>> "flags" parameter needs to become a pointer to an array that gets
>> indexed by the high bits from the pages_addr array. For existing callers
>> it's as easy as changing flags to &flags (array of size 1). For HMM we
>> would pass a pointer to a real array.
>
> Yeah, I've thought about stuff like that as well for a while.
>
> Problem is that this won't work that easily. We assume at many places
> that the flags doesn't change for the range in question.

I think some lower level functions assume that the flags stay the same
for physically contiguous ranges. But if you use the high-bits to encode
the page type, the ranges won't be contiguous any more. So you can
change page flags for different contiguous ranges.

Regards,
  Felix


>
> We would somehow need to change that to get the flags directly from
> the low bits of the DMA address or something instead.
>
> Christian.
>
>>
>> Once this is done, it leads to a number of opportunities for
>> simplification and better efficiency in kfd_svm:
>>
>>    * Support migration when cpages != npages
>>    * Migrate a part of an svm_range without splitting it. No more
>>  splitting of ranges in CPU page faults
>>    * Migrate a part of an svm_range in GPU page fault handler. No more
>>  migrating the whole range for a single page fault
>>    * Simplified VRAM management (see below)
>>
>> With that, svm_range will no longer have an "actual_loc" field. If we're
>> not sure where the data is, we need to call migrate. If it's already in
>> the right place, then cpages will be 0 and we can skip all the steps
>> after migrate_vma_setup.
>>
>> Simplified VRAM management
>> ==
>>
>> VRAM BOs are no longer associated with pranges. Instead they are
>> "free-floating", allocated during migration to VRAM, with reference
>> count for each page that uses the BO. Ref is released in page-release
>> callback. When the ref count drops to 0, free the BO.
>>
>> VRAM BO size should match the migration granularity, 2MB by default.
>> That way the BO can be freed when memory gets migrated out. If migration
>> of some pages fails the BO may not be fully occupied. Also some pages
>> may be released individually on A+A due to COW or other events.
>>
>> Eviction needs to migrate all the pages still using the BO. If the BO
>> struct keeps an array of page pointers, that's basically the migrate.src
>> for the eviction. Migration calls "try_to_unmap", which has the best
>> chance of freeing the BO, even when shared by multiple processes.
>>
>> If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
>> allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
>> pressure so we can start evicting memory.
>>
>> Regards,
>>    Felix
>>
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Need to support mixed memory mappings with HMM

2021-03-25 Thread Christian König



Am 25.03.21 um 17:03 schrieb Felix Kuehling:

Hi,

This is a long one with a proposal for a pretty significant redesign of
how we handle migrations and VRAM management with HMM. This is based on
my own debugging and reading of the migrate_vma helpers, as well as
Alex's problems with migrations on A+A. I hope we can discuss this next
Monday after you've had some time do digest it.

I did some debugging yesterday and found that migrations to VRAM can
fail for some pages. The current migration helpers have many corner
cases where a page cannot be migrated. Some of them may be fixable
(adding support for THP), others are not (locked pages are skipped to
avoid deadlocks). Therefore I think our current code is too inflexible
when it assumes that a range is entirely in one place.

Alex also ran into some funny issues with COW on A+A where some pages
get faulted back to system memory. I think a lot of the problems here
will get easier once we support mixed mappings.

Mixed GPU mappings
===

The idea is, to remove any assumptions that an entire svm_range is in
one place. Instead hmm_range_fault gives us a list of pages, some of
which are system memory and others are device_private or device_generic.

We will need an amdgpu_vm interface that lets us map mixed page arrays
where different pages use different PTE flags. We can have at least 3
different types of pages in one mapping:

  1. System memory (S-bit set)
  2. Local memory (S-bit cleared, MTYPE for local memory)
  3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)

My idea is to change the amdgpu_vm_update_mapping interface to use some
high-bit in the pages_addr array to indicate the page type. For the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing callers
it's as easy as changing flags to &flags (array of size 1). For HMM we
would pass a pointer to a real array.


Yeah, I've thought about stuff like that as well for a while.

Problem is that this won't work that easily. We assume at many places 
that the flags doesn't change for the range in question.


We would somehow need to change that to get the flags directly from the 
low bits of the DMA address or something instead.


Christian.



Once this is done, it leads to a number of opportunities for
simplification and better efficiency in kfd_svm:

   * Support migration when cpages != npages
   * Migrate a part of an svm_range without splitting it. No more
 splitting of ranges in CPU page faults
   * Migrate a part of an svm_range in GPU page fault handler. No more
 migrating the whole range for a single page fault
   * Simplified VRAM management (see below)

With that, svm_range will no longer have an "actual_loc" field. If we're
not sure where the data is, we need to call migrate. If it's already in
the right place, then cpages will be 0 and we can skip all the steps
after migrate_vma_setup.

Simplified VRAM management
==

VRAM BOs are no longer associated with pranges. Instead they are
"free-floating", allocated during migration to VRAM, with reference
count for each page that uses the BO. Ref is released in page-release
callback. When the ref count drops to 0, free the BO.

VRAM BO size should match the migration granularity, 2MB by default.
That way the BO can be freed when memory gets migrated out. If migration
of some pages fails the BO may not be fully occupied. Also some pages
may be released individually on A+A due to COW or other events.

Eviction needs to migrate all the pages still using the BO. If the BO
struct keeps an array of page pointers, that's basically the migrate.src
for the eviction. Migration calls "try_to_unmap", which has the best
chance of freeing the BO, even when shared by multiple processes.

If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
pressure so we can start evicting memory.

Regards,
   Felix



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Need to support mixed memory mappings with HMM

2021-03-25 Thread Felix Kuehling
Hi,

This is a long one with a proposal for a pretty significant redesign of
how we handle migrations and VRAM management with HMM. This is based on
my own debugging and reading of the migrate_vma helpers, as well as
Alex's problems with migrations on A+A. I hope we can discuss this next
Monday after you've had some time do digest it.

I did some debugging yesterday and found that migrations to VRAM can
fail for some pages. The current migration helpers have many corner
cases where a page cannot be migrated. Some of them may be fixable
(adding support for THP), others are not (locked pages are skipped to
avoid deadlocks). Therefore I think our current code is too inflexible
when it assumes that a range is entirely in one place.

Alex also ran into some funny issues with COW on A+A where some pages
get faulted back to system memory. I think a lot of the problems here
will get easier once we support mixed mappings.

Mixed GPU mappings
===

The idea is, to remove any assumptions that an entire svm_range is in
one place. Instead hmm_range_fault gives us a list of pages, some of
which are system memory and others are device_private or device_generic.

We will need an amdgpu_vm interface that lets us map mixed page arrays
where different pages use different PTE flags. We can have at least 3
different types of pages in one mapping:

 1. System memory (S-bit set)
 2. Local memory (S-bit cleared, MTYPE for local memory)
 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory)

My idea is to change the amdgpu_vm_update_mapping interface to use some
high-bit in the pages_addr array to indicate the page type. For the
default page type (0) nothing really changes for the callers. The
"flags" parameter needs to become a pointer to an array that gets
indexed by the high bits from the pages_addr array. For existing callers
it's as easy as changing flags to &flags (array of size 1). For HMM we
would pass a pointer to a real array.

Once this is done, it leads to a number of opportunities for
simplification and better efficiency in kfd_svm:

  * Support migration when cpages != npages
  * Migrate a part of an svm_range without splitting it. No more
splitting of ranges in CPU page faults
  * Migrate a part of an svm_range in GPU page fault handler. No more
migrating the whole range for a single page fault
  * Simplified VRAM management (see below)

With that, svm_range will no longer have an "actual_loc" field. If we're
not sure where the data is, we need to call migrate. If it's already in
the right place, then cpages will be 0 and we can skip all the steps
after migrate_vma_setup.

Simplified VRAM management
==

VRAM BOs are no longer associated with pranges. Instead they are
"free-floating", allocated during migration to VRAM, with reference
count for each page that uses the BO. Ref is released in page-release
callback. When the ref count drops to 0, free the BO.

VRAM BO size should match the migration granularity, 2MB by default.
That way the BO can be freed when memory gets migrated out. If migration
of some pages fails the BO may not be fully occupied. Also some pages
may be released individually on A+A due to COW or other events.

Eviction needs to migrate all the pages still using the BO. If the BO
struct keeps an array of page pointers, that's basically the migrate.src
for the eviction. Migration calls "try_to_unmap", which has the best
chance of freeing the BO, even when shared by multiple processes.

If we cannot guarantee eviction of pages, we cannot use TTM for VRAM
allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory
pressure so we can start evicting memory.

Regards,
  Felix

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx