Re: Need to support mixed memory mappings with HMM
Am 26.03.21 um 15:49 schrieb Felix Kuehling: Am 2021-03-26 um 4:50 a.m. schrieb Christian König: Am 25.03.21 um 17:28 schrieb Felix Kuehling: Am 2021-03-25 um 12:22 p.m. schrieb Christian König: My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. Yeah, I've thought about stuff like that as well for a while. Problem is that this won't work that easily. We assume at many places that the flags doesn't change for the range in question. I think some lower level functions assume that the flags stay the same for physically contiguous ranges. But if you use the high-bits to encode the page type, the ranges won't be contiguous any more. So you can change page flags for different contiguous ranges. Yeah, but then you also get absolutely zero THP and fragment flags support. As long as you have a contiguous 2MB page with the same page type, I think you can still get a THP mapping in the GPU page table. But if one page in the middle of a 2MB page has a different page type, that will break the THP mapping, as it should. Yeah, but currently we detect that before we call down into the functions to update the tables. When you give a mixed list the chance for that is just completely gone. Currently the detection of contiguous ranges is in amdgpu_vm_update_mapping: if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn = cursor.start >> PAGE_SHIFT; uint64_t count; contiguous = pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries / AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count) { uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] == pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count * AMDGPU_GPU_PAGES_IN_CPU_PAGE; } If a page type changes from one page to the next, it will end a contiguous range and call into the lower level (amdgpu_vm_update_ptes). I don't think anything needs to change in amdgpu_vm_update_ptes, because all pages contiguous passed to it use the same page type, so the same flags. And the existing code in amdgpu_vm_update_mapping already does the right thing. I honestly don't see the problem. But then you could also just call amdgpu_vm_update_mapping() multiple times feeding it from whatever data structure you use in the HMM code. Using the page array sounds like an intermediate data structure to me you only created to feed into amdgpu_vm_update_mapping(). You're right. It could be done. One new call to amdgpu_vm_update_mapping every time the flags change as we iterate over virtual addresses. It should work OK as long as the mappings of different memory types aren't too finely interleaved. I suggest that we stick to this approach for now. The only alternative I see is that we have an unified address space for DMA addresses similar to what you have outlined. This way we can easily figure out for each address to which device it belongs and then calculate the best path to that device. E.g. direct address for system memory, clearing the S bit for local memory, using XGMI for remote memory etc... That is similar to what you have in mind, just a bit more generalized. Regards, Christian. Regards, Felix Regards, Christian. Regards, Felix Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Need to support mixed memory mappings with HMM
Am 2021-03-26 um 4:50 a.m. schrieb Christian König: > > > Am 25.03.21 um 17:28 schrieb Felix Kuehling: >> Am 2021-03-25 um 12:22 p.m. schrieb Christian König: My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. >>> Yeah, I've thought about stuff like that as well for a while. >>> >>> Problem is that this won't work that easily. We assume at many >>> places >>> that the flags doesn't change for the range in question. >> I think some lower level functions assume that the flags stay the >> same >> for physically contiguous ranges. But if you use the high-bits to >> encode >> the page type, the ranges won't be contiguous any more. So you can >> change page flags for different contiguous ranges. > Yeah, but then you also get absolutely zero THP and fragment flags > support. As long as you have a contiguous 2MB page with the same page type, I think you can still get a THP mapping in the GPU page table. But if one page in the middle of a 2MB page has a different page type, that will break the THP mapping, as it should. >>> Yeah, but currently we detect that before we call down into the >>> functions to update the tables. >>> >>> When you give a mixed list the chance for that is just completely gone. >> Currently the detection of contiguous ranges is in >> amdgpu_vm_update_mapping: >> >>> if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn = >>> cursor.start >> PAGE_SHIFT; uint64_t count; contiguous = >>> pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries >>> / AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count) >>> { uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] == >>> pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count * >>> AMDGPU_GPU_PAGES_IN_CPU_PAGE; } >> If a page type changes from one page to the next, it will end a >> contiguous range and call into the lower level (amdgpu_vm_update_ptes). >> I don't think anything needs to change in amdgpu_vm_update_ptes, because >> all pages contiguous passed to it use the same page type, so the same >> flags. And the existing code in amdgpu_vm_update_mapping already does >> the right thing. I honestly don't see the problem. > > But then you could also just call amdgpu_vm_update_mapping() multiple > times feeding it from whatever data structure you use in the HMM code. > > Using the page array sounds like an intermediate data structure to me > you only created to feed into amdgpu_vm_update_mapping(). You're right. It could be done. One new call to amdgpu_vm_update_mapping every time the flags change as we iterate over virtual addresses. It should work OK as long as the mappings of different memory types aren't too finely interleaved. Regards, Felix > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> Regards, >>> Christian. > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Need to support mixed memory mappings with HMM
Am 25.03.21 um 17:28 schrieb Felix Kuehling: Am 2021-03-25 um 12:22 p.m. schrieb Christian König: My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. Yeah, I've thought about stuff like that as well for a while. Problem is that this won't work that easily. We assume at many places that the flags doesn't change for the range in question. I think some lower level functions assume that the flags stay the same for physically contiguous ranges. But if you use the high-bits to encode the page type, the ranges won't be contiguous any more. So you can change page flags for different contiguous ranges. Yeah, but then you also get absolutely zero THP and fragment flags support. As long as you have a contiguous 2MB page with the same page type, I think you can still get a THP mapping in the GPU page table. But if one page in the middle of a 2MB page has a different page type, that will break the THP mapping, as it should. Yeah, but currently we detect that before we call down into the functions to update the tables. When you give a mixed list the chance for that is just completely gone. Currently the detection of contiguous ranges is in amdgpu_vm_update_mapping: if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn = cursor.start >> PAGE_SHIFT; uint64_t count; contiguous = pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries / AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count) { uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] == pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count * AMDGPU_GPU_PAGES_IN_CPU_PAGE; } If a page type changes from one page to the next, it will end a contiguous range and call into the lower level (amdgpu_vm_update_ptes). I don't think anything needs to change in amdgpu_vm_update_ptes, because all pages contiguous passed to it use the same page type, so the same flags. And the existing code in amdgpu_vm_update_mapping already does the right thing. I honestly don't see the problem. But then you could also just call amdgpu_vm_update_mapping() multiple times feeding it from whatever data structure you use in the HMM code. Using the page array sounds like an intermediate data structure to me you only created to feed into amdgpu_vm_update_mapping(). Regards, Christian. Regards, Felix Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Need to support mixed memory mappings with HMM
Am 2021-03-25 um 12:22 p.m. schrieb Christian König: >> My idea is to change the amdgpu_vm_update_mapping interface to use >> some >> high-bit in the pages_addr array to indicate the page type. For the >> default page type (0) nothing really changes for the callers. The >> "flags" parameter needs to become a pointer to an array that gets >> indexed by the high bits from the pages_addr array. For existing >> callers >> it's as easy as changing flags to &flags (array of size 1). For >> HMM we >> would pass a pointer to a real array. > Yeah, I've thought about stuff like that as well for a while. > > Problem is that this won't work that easily. We assume at many places > that the flags doesn't change for the range in question. I think some lower level functions assume that the flags stay the same for physically contiguous ranges. But if you use the high-bits to encode the page type, the ranges won't be contiguous any more. So you can change page flags for different contiguous ranges. >>> Yeah, but then you also get absolutely zero THP and fragment flags >>> support. >> As long as you have a contiguous 2MB page with the same page type, I >> think you can still get a THP mapping in the GPU page table. But if one >> page in the middle of a 2MB page has a different page type, that will >> break the THP mapping, as it should. > > Yeah, but currently we detect that before we call down into the > functions to update the tables. > > When you give a mixed list the chance for that is just completely gone. Currently the detection of contiguous ranges is in amdgpu_vm_update_mapping: > if (num_entries > AMDGPU_GPU_PAGES_IN_CPU_PAGE) { uint64_t pfn = > cursor.start >> PAGE_SHIFT; uint64_t count; contiguous = > pages_addr[pfn + 1] == pages_addr[pfn] + PAGE_SIZE; tmp = num_entries > / AMDGPU_GPU_PAGES_IN_CPU_PAGE; for (count = 2; count < tmp; ++count) > { uint64_t idx = pfn + count; if (contiguous != (pages_addr[idx] == > pages_addr[idx - 1] + PAGE_SIZE)) break; } num_entries = count * > AMDGPU_GPU_PAGES_IN_CPU_PAGE; } If a page type changes from one page to the next, it will end a contiguous range and call into the lower level (amdgpu_vm_update_ptes). I don't think anything needs to change in amdgpu_vm_update_ptes, because all pages contiguous passed to it use the same page type, so the same flags. And the existing code in amdgpu_vm_update_mapping already does the right thing. I honestly don't see the problem. Regards, Felix > > Regards, > Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Need to support mixed memory mappings with HMM
Am 25.03.21 um 17:20 schrieb Felix Kuehling: Am 2021-03-25 um 12:16 p.m. schrieb Christian König: Am 25.03.21 um 17:14 schrieb Felix Kuehling: Am 2021-03-25 um 12:10 p.m. schrieb Christian König: Am 25.03.21 um 17:03 schrieb Felix Kuehling: Hi, This is a long one with a proposal for a pretty significant redesign of how we handle migrations and VRAM management with HMM. This is based on my own debugging and reading of the migrate_vma helpers, as well as Alex's problems with migrations on A+A. I hope we can discuss this next Monday after you've had some time do digest it. I did some debugging yesterday and found that migrations to VRAM can fail for some pages. The current migration helpers have many corner cases where a page cannot be migrated. Some of them may be fixable (adding support for THP), others are not (locked pages are skipped to avoid deadlocks). Therefore I think our current code is too inflexible when it assumes that a range is entirely in one place. Alex also ran into some funny issues with COW on A+A where some pages get faulted back to system memory. I think a lot of the problems here will get easier once we support mixed mappings. Mixed GPU mappings === The idea is, to remove any assumptions that an entire svm_range is in one place. Instead hmm_range_fault gives us a list of pages, some of which are system memory and others are device_private or device_generic. We will need an amdgpu_vm interface that lets us map mixed page arrays where different pages use different PTE flags. We can have at least 3 different types of pages in one mapping: 1. System memory (S-bit set) 2. Local memory (S-bit cleared, MTYPE for local memory) 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory) My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. Yeah, I've thought about stuff like that as well for a while. Problem is that this won't work that easily. We assume at many places that the flags doesn't change for the range in question. I think some lower level functions assume that the flags stay the same for physically contiguous ranges. But if you use the high-bits to encode the page type, the ranges won't be contiguous any more. So you can change page flags for different contiguous ranges. Yeah, but then you also get absolutely zero THP and fragment flags support. As long as you have a contiguous 2MB page with the same page type, I think you can still get a THP mapping in the GPU page table. But if one page in the middle of a 2MB page has a different page type, that will break the THP mapping, as it should. Yeah, but currently we detect that before we call down into the functions to update the tables. When you give a mixed list the chance for that is just completely gone. Regards, Christian. Regards, Felix But I think we could also add those later on. Regards, Christian. Regards, Felix We would somehow need to change that to get the flags directly from the low bits of the DMA address or something instead. Christian. Once this is done, it leads to a number of opportunities for simplification and better efficiency in kfd_svm: * Support migration when cpages != npages * Migrate a part of an svm_range without splitting it. No more splitting of ranges in CPU page faults * Migrate a part of an svm_range in GPU page fault handler. No more migrating the whole range for a single page fault * Simplified VRAM management (see below) With that, svm_range will no longer have an "actual_loc" field. If we're not sure where the data is, we need to call migrate. If it's already in the right place, then cpages will be 0 and we can skip all the steps after migrate_vma_setup. Simplified VRAM management == VRAM BOs are no longer associated with pranges. Instead they are "free-floating", allocated during migration to VRAM, with reference count for each page that uses the BO. Ref is released in page-release callback. When the ref count drops to 0, free the BO. VRAM BO size should match the migration granularity, 2MB by default. That way the BO can be freed when memory gets migrated out. If migration of some pages fails the BO may not be fully occupied. Also some pages may be released individually on A+A due to COW or other events. Eviction needs to migrate all the pages still using the BO. If the BO struct keeps an array of page pointers, that's basically the migrate.src for the eviction. Migration calls "try_to_unmap", which has the best chance of freeing the BO, even when shared by multiple processes. If we ca
Re: Need to support mixed memory mappings with HMM
Am 2021-03-25 um 12:16 p.m. schrieb Christian König: > Am 25.03.21 um 17:14 schrieb Felix Kuehling: >> Am 2021-03-25 um 12:10 p.m. schrieb Christian König: >>> >>> Am 25.03.21 um 17:03 schrieb Felix Kuehling: Hi, This is a long one with a proposal for a pretty significant redesign of how we handle migrations and VRAM management with HMM. This is based on my own debugging and reading of the migrate_vma helpers, as well as Alex's problems with migrations on A+A. I hope we can discuss this next Monday after you've had some time do digest it. I did some debugging yesterday and found that migrations to VRAM can fail for some pages. The current migration helpers have many corner cases where a page cannot be migrated. Some of them may be fixable (adding support for THP), others are not (locked pages are skipped to avoid deadlocks). Therefore I think our current code is too inflexible when it assumes that a range is entirely in one place. Alex also ran into some funny issues with COW on A+A where some pages get faulted back to system memory. I think a lot of the problems here will get easier once we support mixed mappings. Mixed GPU mappings === The idea is, to remove any assumptions that an entire svm_range is in one place. Instead hmm_range_fault gives us a list of pages, some of which are system memory and others are device_private or device_generic. We will need an amdgpu_vm interface that lets us map mixed page arrays where different pages use different PTE flags. We can have at least 3 different types of pages in one mapping: 1. System memory (S-bit set) 2. Local memory (S-bit cleared, MTYPE for local memory) 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory) My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. >>> Yeah, I've thought about stuff like that as well for a while. >>> >>> Problem is that this won't work that easily. We assume at many places >>> that the flags doesn't change for the range in question. >> I think some lower level functions assume that the flags stay the same >> for physically contiguous ranges. But if you use the high-bits to encode >> the page type, the ranges won't be contiguous any more. So you can >> change page flags for different contiguous ranges. > > Yeah, but then you also get absolutely zero THP and fragment flags > support. As long as you have a contiguous 2MB page with the same page type, I think you can still get a THP mapping in the GPU page table. But if one page in the middle of a 2MB page has a different page type, that will break the THP mapping, as it should. Regards, Felix > > But I think we could also add those later on. > > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> We would somehow need to change that to get the flags directly from >>> the low bits of the DMA address or something instead. >>> >>> Christian. >>> Once this is done, it leads to a number of opportunities for simplification and better efficiency in kfd_svm: * Support migration when cpages != npages * Migrate a part of an svm_range without splitting it. No more splitting of ranges in CPU page faults * Migrate a part of an svm_range in GPU page fault handler. No more migrating the whole range for a single page fault * Simplified VRAM management (see below) With that, svm_range will no longer have an "actual_loc" field. If we're not sure where the data is, we need to call migrate. If it's already in the right place, then cpages will be 0 and we can skip all the steps after migrate_vma_setup. Simplified VRAM management == VRAM BOs are no longer associated with pranges. Instead they are "free-floating", allocated during migration to VRAM, with reference count for each page that uses the BO. Ref is released in page-release callback. When the ref count drops to 0, free the BO. VRAM BO size should match the migration granularity, 2MB by default. That way the BO can be freed when memory gets migrated out. If migration of some pages fails the BO may not be fully occupied. Also some pages may be released individually on A+A due to COW or other events. Eviction needs to migrate all the pages still using the BO. If the BO stru
Re: Need to support mixed memory mappings with HMM
Am 25.03.21 um 17:14 schrieb Felix Kuehling: Am 2021-03-25 um 12:10 p.m. schrieb Christian König: Am 25.03.21 um 17:03 schrieb Felix Kuehling: Hi, This is a long one with a proposal for a pretty significant redesign of how we handle migrations and VRAM management with HMM. This is based on my own debugging and reading of the migrate_vma helpers, as well as Alex's problems with migrations on A+A. I hope we can discuss this next Monday after you've had some time do digest it. I did some debugging yesterday and found that migrations to VRAM can fail for some pages. The current migration helpers have many corner cases where a page cannot be migrated. Some of them may be fixable (adding support for THP), others are not (locked pages are skipped to avoid deadlocks). Therefore I think our current code is too inflexible when it assumes that a range is entirely in one place. Alex also ran into some funny issues with COW on A+A where some pages get faulted back to system memory. I think a lot of the problems here will get easier once we support mixed mappings. Mixed GPU mappings === The idea is, to remove any assumptions that an entire svm_range is in one place. Instead hmm_range_fault gives us a list of pages, some of which are system memory and others are device_private or device_generic. We will need an amdgpu_vm interface that lets us map mixed page arrays where different pages use different PTE flags. We can have at least 3 different types of pages in one mapping: 1. System memory (S-bit set) 2. Local memory (S-bit cleared, MTYPE for local memory) 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory) My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. Yeah, I've thought about stuff like that as well for a while. Problem is that this won't work that easily. We assume at many places that the flags doesn't change for the range in question. I think some lower level functions assume that the flags stay the same for physically contiguous ranges. But if you use the high-bits to encode the page type, the ranges won't be contiguous any more. So you can change page flags for different contiguous ranges. Yeah, but then you also get absolutely zero THP and fragment flags support. But I think we could also add those later on. Regards, Christian. Regards, Felix We would somehow need to change that to get the flags directly from the low bits of the DMA address or something instead. Christian. Once this is done, it leads to a number of opportunities for simplification and better efficiency in kfd_svm: * Support migration when cpages != npages * Migrate a part of an svm_range without splitting it. No more splitting of ranges in CPU page faults * Migrate a part of an svm_range in GPU page fault handler. No more migrating the whole range for a single page fault * Simplified VRAM management (see below) With that, svm_range will no longer have an "actual_loc" field. If we're not sure where the data is, we need to call migrate. If it's already in the right place, then cpages will be 0 and we can skip all the steps after migrate_vma_setup. Simplified VRAM management == VRAM BOs are no longer associated with pranges. Instead they are "free-floating", allocated during migration to VRAM, with reference count for each page that uses the BO. Ref is released in page-release callback. When the ref count drops to 0, free the BO. VRAM BO size should match the migration granularity, 2MB by default. That way the BO can be freed when memory gets migrated out. If migration of some pages fails the BO may not be fully occupied. Also some pages may be released individually on A+A due to COW or other events. Eviction needs to migrate all the pages still using the BO. If the BO struct keeps an array of page pointers, that's basically the migrate.src for the eviction. Migration calls "try_to_unmap", which has the best chance of freeing the BO, even when shared by multiple processes. If we cannot guarantee eviction of pages, we cannot use TTM for VRAM allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory pressure so we can start evicting memory. Regards, Felix ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Need to support mixed memory mappings with HMM
Am 2021-03-25 um 12:10 p.m. schrieb Christian König: > > > Am 25.03.21 um 17:03 schrieb Felix Kuehling: >> Hi, >> >> This is a long one with a proposal for a pretty significant redesign of >> how we handle migrations and VRAM management with HMM. This is based on >> my own debugging and reading of the migrate_vma helpers, as well as >> Alex's problems with migrations on A+A. I hope we can discuss this next >> Monday after you've had some time do digest it. >> >> I did some debugging yesterday and found that migrations to VRAM can >> fail for some pages. The current migration helpers have many corner >> cases where a page cannot be migrated. Some of them may be fixable >> (adding support for THP), others are not (locked pages are skipped to >> avoid deadlocks). Therefore I think our current code is too inflexible >> when it assumes that a range is entirely in one place. >> >> Alex also ran into some funny issues with COW on A+A where some pages >> get faulted back to system memory. I think a lot of the problems here >> will get easier once we support mixed mappings. >> >> Mixed GPU mappings >> === >> >> The idea is, to remove any assumptions that an entire svm_range is in >> one place. Instead hmm_range_fault gives us a list of pages, some of >> which are system memory and others are device_private or device_generic. >> >> We will need an amdgpu_vm interface that lets us map mixed page arrays >> where different pages use different PTE flags. We can have at least 3 >> different types of pages in one mapping: >> >> 1. System memory (S-bit set) >> 2. Local memory (S-bit cleared, MTYPE for local memory) >> 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory) >> >> My idea is to change the amdgpu_vm_update_mapping interface to use some >> high-bit in the pages_addr array to indicate the page type. For the >> default page type (0) nothing really changes for the callers. The >> "flags" parameter needs to become a pointer to an array that gets >> indexed by the high bits from the pages_addr array. For existing callers >> it's as easy as changing flags to &flags (array of size 1). For HMM we >> would pass a pointer to a real array. > > Yeah, I've thought about stuff like that as well for a while. > > Problem is that this won't work that easily. We assume at many places > that the flags doesn't change for the range in question. I think some lower level functions assume that the flags stay the same for physically contiguous ranges. But if you use the high-bits to encode the page type, the ranges won't be contiguous any more. So you can change page flags for different contiguous ranges. Regards, Felix > > We would somehow need to change that to get the flags directly from > the low bits of the DMA address or something instead. > > Christian. > >> >> Once this is done, it leads to a number of opportunities for >> simplification and better efficiency in kfd_svm: >> >> * Support migration when cpages != npages >> * Migrate a part of an svm_range without splitting it. No more >> splitting of ranges in CPU page faults >> * Migrate a part of an svm_range in GPU page fault handler. No more >> migrating the whole range for a single page fault >> * Simplified VRAM management (see below) >> >> With that, svm_range will no longer have an "actual_loc" field. If we're >> not sure where the data is, we need to call migrate. If it's already in >> the right place, then cpages will be 0 and we can skip all the steps >> after migrate_vma_setup. >> >> Simplified VRAM management >> == >> >> VRAM BOs are no longer associated with pranges. Instead they are >> "free-floating", allocated during migration to VRAM, with reference >> count for each page that uses the BO. Ref is released in page-release >> callback. When the ref count drops to 0, free the BO. >> >> VRAM BO size should match the migration granularity, 2MB by default. >> That way the BO can be freed when memory gets migrated out. If migration >> of some pages fails the BO may not be fully occupied. Also some pages >> may be released individually on A+A due to COW or other events. >> >> Eviction needs to migrate all the pages still using the BO. If the BO >> struct keeps an array of page pointers, that's basically the migrate.src >> for the eviction. Migration calls "try_to_unmap", which has the best >> chance of freeing the BO, even when shared by multiple processes. >> >> If we cannot guarantee eviction of pages, we cannot use TTM for VRAM >> allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory >> pressure so we can start evicting memory. >> >> Regards, >> Felix >> > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Need to support mixed memory mappings with HMM
Am 25.03.21 um 17:03 schrieb Felix Kuehling: Hi, This is a long one with a proposal for a pretty significant redesign of how we handle migrations and VRAM management with HMM. This is based on my own debugging and reading of the migrate_vma helpers, as well as Alex's problems with migrations on A+A. I hope we can discuss this next Monday after you've had some time do digest it. I did some debugging yesterday and found that migrations to VRAM can fail for some pages. The current migration helpers have many corner cases where a page cannot be migrated. Some of them may be fixable (adding support for THP), others are not (locked pages are skipped to avoid deadlocks). Therefore I think our current code is too inflexible when it assumes that a range is entirely in one place. Alex also ran into some funny issues with COW on A+A where some pages get faulted back to system memory. I think a lot of the problems here will get easier once we support mixed mappings. Mixed GPU mappings === The idea is, to remove any assumptions that an entire svm_range is in one place. Instead hmm_range_fault gives us a list of pages, some of which are system memory and others are device_private or device_generic. We will need an amdgpu_vm interface that lets us map mixed page arrays where different pages use different PTE flags. We can have at least 3 different types of pages in one mapping: 1. System memory (S-bit set) 2. Local memory (S-bit cleared, MTYPE for local memory) 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory) My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. Yeah, I've thought about stuff like that as well for a while. Problem is that this won't work that easily. We assume at many places that the flags doesn't change for the range in question. We would somehow need to change that to get the flags directly from the low bits of the DMA address or something instead. Christian. Once this is done, it leads to a number of opportunities for simplification and better efficiency in kfd_svm: * Support migration when cpages != npages * Migrate a part of an svm_range without splitting it. No more splitting of ranges in CPU page faults * Migrate a part of an svm_range in GPU page fault handler. No more migrating the whole range for a single page fault * Simplified VRAM management (see below) With that, svm_range will no longer have an "actual_loc" field. If we're not sure where the data is, we need to call migrate. If it's already in the right place, then cpages will be 0 and we can skip all the steps after migrate_vma_setup. Simplified VRAM management == VRAM BOs are no longer associated with pranges. Instead they are "free-floating", allocated during migration to VRAM, with reference count for each page that uses the BO. Ref is released in page-release callback. When the ref count drops to 0, free the BO. VRAM BO size should match the migration granularity, 2MB by default. That way the BO can be freed when memory gets migrated out. If migration of some pages fails the BO may not be fully occupied. Also some pages may be released individually on A+A due to COW or other events. Eviction needs to migrate all the pages still using the BO. If the BO struct keeps an array of page pointers, that's basically the migrate.src for the eviction. Migration calls "try_to_unmap", which has the best chance of freeing the BO, even when shared by multiple processes. If we cannot guarantee eviction of pages, we cannot use TTM for VRAM allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory pressure so we can start evicting memory. Regards, Felix ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Need to support mixed memory mappings with HMM
Hi, This is a long one with a proposal for a pretty significant redesign of how we handle migrations and VRAM management with HMM. This is based on my own debugging and reading of the migrate_vma helpers, as well as Alex's problems with migrations on A+A. I hope we can discuss this next Monday after you've had some time do digest it. I did some debugging yesterday and found that migrations to VRAM can fail for some pages. The current migration helpers have many corner cases where a page cannot be migrated. Some of them may be fixable (adding support for THP), others are not (locked pages are skipped to avoid deadlocks). Therefore I think our current code is too inflexible when it assumes that a range is entirely in one place. Alex also ran into some funny issues with COW on A+A where some pages get faulted back to system memory. I think a lot of the problems here will get easier once we support mixed mappings. Mixed GPU mappings === The idea is, to remove any assumptions that an entire svm_range is in one place. Instead hmm_range_fault gives us a list of pages, some of which are system memory and others are device_private or device_generic. We will need an amdgpu_vm interface that lets us map mixed page arrays where different pages use different PTE flags. We can have at least 3 different types of pages in one mapping: 1. System memory (S-bit set) 2. Local memory (S-bit cleared, MTYPE for local memory) 3. Remote XGMI memory (S-bit cleared, MTYPE+C for remote memory) My idea is to change the amdgpu_vm_update_mapping interface to use some high-bit in the pages_addr array to indicate the page type. For the default page type (0) nothing really changes for the callers. The "flags" parameter needs to become a pointer to an array that gets indexed by the high bits from the pages_addr array. For existing callers it's as easy as changing flags to &flags (array of size 1). For HMM we would pass a pointer to a real array. Once this is done, it leads to a number of opportunities for simplification and better efficiency in kfd_svm: * Support migration when cpages != npages * Migrate a part of an svm_range without splitting it. No more splitting of ranges in CPU page faults * Migrate a part of an svm_range in GPU page fault handler. No more migrating the whole range for a single page fault * Simplified VRAM management (see below) With that, svm_range will no longer have an "actual_loc" field. If we're not sure where the data is, we need to call migrate. If it's already in the right place, then cpages will be 0 and we can skip all the steps after migrate_vma_setup. Simplified VRAM management == VRAM BOs are no longer associated with pranges. Instead they are "free-floating", allocated during migration to VRAM, with reference count for each page that uses the BO. Ref is released in page-release callback. When the ref count drops to 0, free the BO. VRAM BO size should match the migration granularity, 2MB by default. That way the BO can be freed when memory gets migrated out. If migration of some pages fails the BO may not be fully occupied. Also some pages may be released individually on A+A due to COW or other events. Eviction needs to migrate all the pages still using the BO. If the BO struct keeps an array of page pointers, that's basically the migrate.src for the eviction. Migration calls "try_to_unmap", which has the best chance of freeing the BO, even when shared by multiple processes. If we cannot guarantee eviction of pages, we cannot use TTM for VRAM allocations. Need to use amdgpu_vram_mgr. Need a way to detect memory pressure so we can start evicting memory. Regards, Felix ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx