Re: [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
"Aneesh Kumar K.V" writes: > On 7/8/20 10:14 AM, Michael Ellerman wrote: >> "Aneesh Kumar K.V" writes: >>> To enable memory unplug without splitting kernel page table >>> mapping, we force the max mapping size to the LMB size. LMB >>> size is the unit in which hypervisor will do memory add/remove >>> operation. >>> >>> This implies on pseries system, we now end up mapping >> >> Please expand on why it "implies" that for pseries. >> >>> memory with 2M page size instead of 1G. To improve >>> that we want hypervisor to hint the kernel about the hotplug >>> memory range. This was added that as part of >> That >>> >>> commit b6eca183e23e ("powerpc/kernel: Enables memory >>> hot-remove after reboot on pseries guests") >>> >>> But we still don't do that on PowerVM. Once we get PowerVM >> >> I think you mean PowerVM doesn't provide that hint yet? >> >> Realistically it won't until P10. So this means we'll always use 2MB on >> Power9 PowerVM doesn't it? >> >> What about KVM? >> >> Have you done any benchmarking on the impact of switching the linear >> mapping to 2MB pages? >> > > The TLB impact should be minimal because with a 256M LMB size partition > scoped entries are still 2M and hence we end up with TLBs of 2M size. > > >>> updated, we can then force the 2M mapping only to hot-pluggable >>> memory region using memblock_is_hotpluggable(). Till then >>> let's depend on LMB size for finding the mapping page size >>> for linear range. >>> > > updated > > > powerpc/mm/radix: Create separate mappings for hot-plugged memory > > To enable memory unplug without splitting kernel page table > mapping, we force the max mapping size to the LMB size. LMB > size is the unit in which hypervisor will do memory add/remove > operation. > > Pseries systems supports max LMB size of 256MB. Hence on pseries, > we now end up mapping memory with 2M page size instead of 1G. To improve > that we want hypervisor to hint the kernel about the hotplug > memory range. That was added that as part of > > commit b6eca18 ("powerpc/kernel: Enables memory > hot-remove after reboot on pseries guests") > > But PowerVM doesn't provide that hint yet. Once we get PowerVM > updated, we can then force the 2M mapping only to hot-pluggable > memory region using memblock_is_hotpluggable(). Till then > let's depend on LMB size for finding the mapping page size > for linear range. > > With this change KVM guest will also be doing linear mapping with > 2M page size. ... >>> @@ -494,17 +544,27 @@ void __init radix__early_init_devtree(void) >>> * Try to find the available page sizes in the device-tree >>> */ >>> rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL); >>> - if (rc != 0) /* Found */ >>> - goto found; >>> + if (rc == 0) { >>> + /* >>> +* no page size details found in device tree >>> +* let's assume we have page 4k and 64k support >> >> Capitals and punctuation please? >> >>> +*/ >>> + mmu_psize_defs[MMU_PAGE_4K].shift = 12; >>> + mmu_psize_defs[MMU_PAGE_4K].ap = 0x0; >>> + >>> + mmu_psize_defs[MMU_PAGE_64K].shift = 16; >>> + mmu_psize_defs[MMU_PAGE_64K].ap = 0x5; >>> + } >> >> Moving that seems like an unrelated change. It's a reasonable change but >> I'd rather you did it in a standalone patch. >> > > we needed that change so that we can call radix_memory_block_size() for > both found and !found case. But the found and !found cases converge at found:, which is where you call it. So I don't understand. But as I said below, it would be even simpler if you worked out the memory block size first. cheers >>> /* >>> -* let's assume we have page 4k and 64k support >>> +* Max mapping size used when mapping pages. We don't use >>> +* ppc_md.memory_block_size() here because this get called >>> +* early and we don't have machine probe called yet. Also >>> +* the pseries implementation only check for ibm,lmb-size. >>> +* All hypervisor supporting radix do expose that device >>> +* tree node. >>> */ >>> - mmu_psize_defs[MMU_PAGE_4K].shift = 12; >>> - mmu_psize_defs[MMU_PAGE_4K].ap = 0x0; >>> - >>> - mmu_psize_defs[MMU_PAGE_64K].shift = 16; >>> - mmu_psize_defs[MMU_PAGE_64K].ap = 0x5; >>> -found: >>> + radix_mem_block_size = radix_memory_block_size(); >> >> If you did that earlier in the function, before >> radix_dt_scan_page_sizes(), the logic would be simpler. >> >>> return; >>> }
Re: [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
On 7/8/20 10:14 AM, Michael Ellerman wrote: "Aneesh Kumar K.V" writes: To enable memory unplug without splitting kernel page table mapping, we force the max mapping size to the LMB size. LMB size is the unit in which hypervisor will do memory add/remove operation. This implies on pseries system, we now end up mapping Please expand on why it "implies" that for pseries. memory with 2M page size instead of 1G. To improve that we want hypervisor to hint the kernel about the hotplug memory range. This was added that as part of That commit b6eca183e23e ("powerpc/kernel: Enables memory hot-remove after reboot on pseries guests") But we still don't do that on PowerVM. Once we get PowerVM I think you mean PowerVM doesn't provide that hint yet? Realistically it won't until P10. So this means we'll always use 2MB on Power9 PowerVM doesn't it? What about KVM? Have you done any benchmarking on the impact of switching the linear mapping to 2MB pages? The TLB impact should be minimal because with a 256M LMB size partition scoped entries are still 2M and hence we end up with TLBs of 2M size. updated, we can then force the 2M mapping only to hot-pluggable memory region using memblock_is_hotpluggable(). Till then let's depend on LMB size for finding the mapping page size for linear range. updated powerpc/mm/radix: Create separate mappings for hot-plugged memory To enable memory unplug without splitting kernel page table mapping, we force the max mapping size to the LMB size. LMB size is the unit in which hypervisor will do memory add/remove operation. Pseries systems supports max LMB size of 256MB. Hence on pseries, we now end up mapping memory with 2M page size instead of 1G. To improve that we want hypervisor to hint the kernel about the hotplug memory range. That was added that as part of commit b6eca18 ("powerpc/kernel: Enables memory hot-remove after reboot on pseries guests") But PowerVM doesn't provide that hint yet. Once we get PowerVM updated, we can then force the 2M mapping only to hot-pluggable memory region using memblock_is_hotpluggable(). Till then let's depend on LMB size for finding the mapping page size for linear range. With this change KVM guest will also be doing linear mapping with 2M page size. Signed-off-by: Bharata B Rao Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 83 arch/powerpc/platforms/powernv/setup.c | 10 ++- 2 files changed, 81 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 78ad11812e0d..a4179e4da49d 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -34,6 +35,7 @@ unsigned int mmu_pid_bits; unsigned int mmu_base_pid; +unsigned int radix_mem_block_size; static? __ro_after_init? Added __ro_after_iit @@ -266,6 +268,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end) static int __meminit create_physical_mapping(unsigned long start, unsigned long end, +unsigned long max_mapping_size, int nid, pgprot_t _prot) { unsigned long vaddr, addr, mapping_size = 0; @@ -279,6 +282,8 @@ static int __meminit create_physical_mapping(unsigned long start, int rc; gap = next_boundary(addr, end) - addr; + if (gap > max_mapping_size) + gap = max_mapping_size; previous_size = mapping_size; prev_exec = exec; @@ -329,8 +334,9 @@ static void __init radix_init_pgtable(void) /* We don't support slb for radix */ mmu_slb_size = 0; + /* -* Create the linear mapping, using standard page size for now +* Create the linear mapping */ for_each_memblock(memory, reg) { /* @@ -346,6 +352,7 @@ static void __init radix_init_pgtable(void) WARN_ON(create_physical_mapping(reg->base, reg->base + reg->size, + radix_mem_block_size, -1, PAGE_KERNEL)); } @@ -486,6 +493,49 @@ static int __init radix_dt_scan_page_sizes(unsigned long node, return 1; } +static int __init probe_memory_block_size(unsigned long node, const char *uname, int + depth, void *data) +{ + const __be32 *block_size; + int len; + + if (depth != 1) + return 0; + + if (!strcmp(uname, "ibm,dynamic-reconfiguration-memory")) { if (strcmp(uname, "ibm,dynamic-reconfiguration-memory") != 0) return
Re: [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
"Aneesh Kumar K.V" writes: > To enable memory unplug without splitting kernel page table > mapping, we force the max mapping size to the LMB size. LMB > size is the unit in which hypervisor will do memory add/remove > operation. > > This implies on pseries system, we now end up mapping Please expand on why it "implies" that for pseries. > memory with 2M page size instead of 1G. To improve > that we want hypervisor to hint the kernel about the hotplug > memory range. This was added that as part of That > > commit b6eca183e23e ("powerpc/kernel: Enables memory > hot-remove after reboot on pseries guests") > > But we still don't do that on PowerVM. Once we get PowerVM I think you mean PowerVM doesn't provide that hint yet? Realistically it won't until P10. So this means we'll always use 2MB on Power9 PowerVM doesn't it? What about KVM? Have you done any benchmarking on the impact of switching the linear mapping to 2MB pages? > updated, we can then force the 2M mapping only to hot-pluggable > memory region using memblock_is_hotpluggable(). Till then > let's depend on LMB size for finding the mapping page size > for linear range. > > Signed-off-by: Bharata B Rao > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/mm/book3s64/radix_pgtable.c | 83 > arch/powerpc/platforms/powernv/setup.c | 10 ++- > 2 files changed, 81 insertions(+), 12 deletions(-) > > diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c > b/arch/powerpc/mm/book3s64/radix_pgtable.c > index 78ad11812e0d..a4179e4da49d 100644 > --- a/arch/powerpc/mm/book3s64/radix_pgtable.c > +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -34,6 +35,7 @@ > > unsigned int mmu_pid_bits; > unsigned int mmu_base_pid; > +unsigned int radix_mem_block_size; static? __ro_after_init? > @@ -266,6 +268,7 @@ static unsigned long next_boundary(unsigned long addr, > unsigned long end) > > static int __meminit create_physical_mapping(unsigned long start, >unsigned long end, > + unsigned long max_mapping_size, >int nid, pgprot_t _prot) > { > unsigned long vaddr, addr, mapping_size = 0; > @@ -279,6 +282,8 @@ static int __meminit create_physical_mapping(unsigned > long start, > int rc; > > gap = next_boundary(addr, end) - addr; > + if (gap > max_mapping_size) > + gap = max_mapping_size; > previous_size = mapping_size; > prev_exec = exec; > > @@ -329,8 +334,9 @@ static void __init radix_init_pgtable(void) > > /* We don't support slb for radix */ > mmu_slb_size = 0; > + > /* > - * Create the linear mapping, using standard page size for now > + * Create the linear mapping >*/ > for_each_memblock(memory, reg) { > /* > @@ -346,6 +352,7 @@ static void __init radix_init_pgtable(void) > > WARN_ON(create_physical_mapping(reg->base, > reg->base + reg->size, > + radix_mem_block_size, > -1, PAGE_KERNEL)); > } > > @@ -486,6 +493,49 @@ static int __init radix_dt_scan_page_sizes(unsigned long > node, > return 1; > } > > +static int __init probe_memory_block_size(unsigned long node, const char > *uname, int > + depth, void *data) > +{ > + const __be32 *block_size; > + int len; > + > + if (depth != 1) > + return 0; > + > + if (!strcmp(uname, "ibm,dynamic-reconfiguration-memory")) { if (strcmp(uname, "ibm,dynamic-reconfiguration-memory") != 0) return 0; Then you can de-indent the rest of the body. > + block_size = of_get_flat_dt_prop(node, "ibm,lmb-size", &len); > + if (!block_size || len < dt_root_size_cells * sizeof(__be32)) This will give us a sparse warning, please do the endian conversion before looking at the value. > + /* > + * Nothing in the device tree > + */ > + return MIN_MEMORY_BLOCK_SIZE; > + > + return dt_mem_next_cell(dt_root_size_cells, &block_size); Just of_read_number() ? This is misusing the return value, as I explained on one of your other recent patches. You should return !0 to indicate that scanning should stop, and the actual value can go via the data pointer, or better just set the global. > + } > + > + return 0; > +} > + > +static unsigned long radix_memory_block_size(void) > +{ > + unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE; > + > + if (firmware_has_feature(FW_FEATURE_OPAL)) { > + > + mem_block_size
[PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
To enable memory unplug without splitting kernel page table mapping, we force the max mapping size to the LMB size. LMB size is the unit in which hypervisor will do memory add/remove operation. This implies on pseries system, we now end up mapping memory with 2M page size instead of 1G. To improve that we want hypervisor to hint the kernel about the hotplug memory range. This was added that as part of commit b6eca183e23e ("powerpc/kernel: Enables memory hot-remove after reboot on pseries guests") But we still don't do that on PowerVM. Once we get PowerVM updated, we can then force the 2M mapping only to hot-pluggable memory region using memblock_is_hotpluggable(). Till then let's depend on LMB size for finding the mapping page size for linear range. Signed-off-by: Bharata B Rao Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/book3s64/radix_pgtable.c | 83 arch/powerpc/platforms/powernv/setup.c | 10 ++- 2 files changed, 81 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index 78ad11812e0d..a4179e4da49d 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -34,6 +35,7 @@ unsigned int mmu_pid_bits; unsigned int mmu_base_pid; +unsigned int radix_mem_block_size; static __ref void *early_alloc_pgtable(unsigned long size, int nid, unsigned long region_start, unsigned long region_end) @@ -266,6 +268,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end) static int __meminit create_physical_mapping(unsigned long start, unsigned long end, +unsigned long max_mapping_size, int nid, pgprot_t _prot) { unsigned long vaddr, addr, mapping_size = 0; @@ -279,6 +282,8 @@ static int __meminit create_physical_mapping(unsigned long start, int rc; gap = next_boundary(addr, end) - addr; + if (gap > max_mapping_size) + gap = max_mapping_size; previous_size = mapping_size; prev_exec = exec; @@ -329,8 +334,9 @@ static void __init radix_init_pgtable(void) /* We don't support slb for radix */ mmu_slb_size = 0; + /* -* Create the linear mapping, using standard page size for now +* Create the linear mapping */ for_each_memblock(memory, reg) { /* @@ -346,6 +352,7 @@ static void __init radix_init_pgtable(void) WARN_ON(create_physical_mapping(reg->base, reg->base + reg->size, + radix_mem_block_size, -1, PAGE_KERNEL)); } @@ -486,6 +493,49 @@ static int __init radix_dt_scan_page_sizes(unsigned long node, return 1; } +static int __init probe_memory_block_size(unsigned long node, const char *uname, int + depth, void *data) +{ + const __be32 *block_size; + int len; + + if (depth != 1) + return 0; + + if (!strcmp(uname, "ibm,dynamic-reconfiguration-memory")) { + + block_size = of_get_flat_dt_prop(node, "ibm,lmb-size", &len); + if (!block_size || len < dt_root_size_cells * sizeof(__be32)) + /* +* Nothing in the device tree +*/ + return MIN_MEMORY_BLOCK_SIZE; + + return dt_mem_next_cell(dt_root_size_cells, &block_size); + + } + + return 0; +} + +static unsigned long radix_memory_block_size(void) +{ + unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE; + + if (firmware_has_feature(FW_FEATURE_OPAL)) { + + mem_block_size = 1UL * 1024 * 1024 * 1024; + + } else if (firmware_has_feature(FW_FEATURE_LPAR)) { + mem_block_size = of_scan_flat_dt(probe_memory_block_size, NULL); + if (!mem_block_size) + mem_block_size = MIN_MEMORY_BLOCK_SIZE; + } + + return mem_block_size; +} + + void __init radix__early_init_devtree(void) { int rc; @@ -494,17 +544,27 @@ void __init radix__early_init_devtree(void) * Try to find the available page sizes in the device-tree */ rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL); - if (rc != 0) /* Found */ - goto found; + if (rc == 0) { + /* +* no page size details found in device tree +* let's assume we have page 4k and 64k support +*/ + mmu_psize_defs[MMU_PAGE_4K].shift =