Re: [PATCH 1/1] arm64: reduce section size for sparsemem
On 2021-01-11 03:09, Anshuman Khandual wrote: + Catalin Hello Sudershan, Could you please change the subject line above as follows for better classifications and clarity. arm64/sparsemem: Reduce SECTION_SIZE_BITS On 1/9/21 4:46 AM, Sudarshan Rajagopalan wrote: Reducing the section size helps reduce wastage of reserved memory for huge memory holes in sparsemem model. But having a much smaller There are two distinct benefits of reducing SECTION_SIZE_BITS. - Improve memory hotplug granularity - Reduce reserved memory wastage for vmmemmap mappings for sections with large memory holes section size bits could break PMD mappings for vmemmap and wouldn't accomodate the highest order page for certain page size granule configs. There are constrains in reducing SECTION_SIZE_BIT like - Should accommodate highest order page for a given config - Should not break PMD mapping in vmemmap for 4K pages - Should not consume too many page->flags bits reducing space for other info Both benefits and constraints should be described in the commit message for folks to understand the rationale clearly at a later point in time. It is determined that SECTION_SIZE_BITS of 27 (128MB) could be ideal Probably needs some description how we arrived here. default value for 4K_PAGES that gives least section size without breaking PMD based vmemmap mappings. For simplicity, 16K_PAGES could follow the same as 4K_PAGES. And the least SECTION_SIZE_BITS for 64K_PAGES is 29 that could accomodate MAX_ORDER. Did not see this patch earlier and hence ended up writing yet another one. Here is the draft commit message from that patch, please feel free to use in part or full. But please do include the benefits, the constraints and the rationale for arriving at these figures. - memory_block_size_bytes() determines the memory hotplug granularity i.e the amount of memory which can be hot added or hot removed from the kernel. The generic value here being MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS) for memory_block_size_bytes() on platforms like arm64 that does not override. Current SECTION_SIZE_BITS is 30 i.e 1GB which is large and a reduction here increases memory hotplug granularity, thus improving its agility. A reduced section size also reduces memory wastage in vmemmmap mapping for sections with large memory holes. A section size bits selection must follow. (MAX_ORDER - 1 + PAGE_SHIFT) <= SECTION_SIZE_BITS CONFIG_FORCE_MAX_ZONEORDER is always defined on arm64 and just following it would help achieve the smallest section size. SECTION_SIZE_BITS = (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) SECTION_SIZE_BITS = 22 (11 - 1 + 12) i.e 4MB for 4K pages SECTION_SIZE_BITS = 24 (11 - 1 + 14) i.e 16MB for 16K pages without THP SECTION_SIZE_BITS = 25 (12 - 1 + 14) i.e 32MB for 16K pages with THP SECTION_SIZE_BITS = 26 (11 - 1 + 16) i.e 64MB for 64K pages without THP SECTION_SIZE_BITS = 29 (14 - 1 + 16) i.e 512MB for 64K pages with THP But there are other problems. Reducing the section size too much would over populate /sys/devices/system/memory/ and also consume too many page->flags bits in the !vmemmap case. Also section size needs to be multiple of 128MB to have PMD based vmemmap mapping with CONFIG_ARM64_4K_PAGES. Given these constraints, lets just reduce the section size to 128MB for 4K and 16K base page size configs and to 512MB for 64K base page size config. - Signed-off-by: Sudarshan Rajagopalan Suggested-by: David Hildenbrand Cc: Will Deacon Cc: Anshuman Khandual Cc: Mike Rapoport Cc: Mark Rutland Cc: Suren Baghdasaryan A nit. Please add all relevant mailing lists like LAKML, MM along with other developers here in the CC list, so that it would never be missed. --- arch/arm64/include/asm/sparsemem.h | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 1f43fcc79738..ff08ff6b677c 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -7,7 +7,13 @@ #ifdef CONFIG_SPARSEMEM #define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS -#define SECTION_SIZE_BITS 30 -#endif + +#if defined(CONFIG_ARM64_4K_PAGES) || defined(CONFIG_ARM64_16K_PAGES) Please add a comment, something like /* * Section size must be at least 128MB for 4K base * page size config. Otherwise PMD based huge page * entries could not be created for vmemmap mappings. * 16K follows 4K for simplicity. */ +#define SECTION_SIZE_BITS 27 +#else Please add a comment, something like /* * Section size must be at least 512MB for 64K base * page size config. Otherwise it will be less than * (MAX_ORDER - 1) and the build process will fail. */ +#define SECTION_SIZE_BITS 29 +#endif /* CONFIG_ARM64_4K_PAGES || CONFIG_
[PATCH 1/1] arm64/sparsemem: reduce SECTION_SIZE_BITS
memory_block_size_bytes() determines the memory hotplug granularity i.e the amount of memory which can be hot added or hot removed from the kernel. The generic value here being MIN_MEMORY_BLOCK_SIZE (1UL << SECTION_SIZE_BITS) for memory_block_size_bytes() on platforms like arm64 that does not override. Current SECTION_SIZE_BITS is 30 i.e 1GB which is large and a reduction here increases memory hotplug granularity, thus improving its agility. A reduced section size also reduces memory wastage in vmemmmap mapping for sections with large memory holes. So we try to set the least section size as possible. A section size bits selection must follow: (MAX_ORDER - 1 + PAGE_SHIFT) <= SECTION_SIZE_BITS CONFIG_FORCE_MAX_ZONEORDER is always defined on arm64 and so just following it would help achieve the smallest section size. SECTION_SIZE_BITS = (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) SECTION_SIZE_BITS = 22 (11 - 1 + 12) i.e 4MB for 4K pages SECTION_SIZE_BITS = 24 (11 - 1 + 14) i.e 16MB for 16K pages without THP SECTION_SIZE_BITS = 25 (12 - 1 + 14) i.e 32MB for 16K pages with THP SECTION_SIZE_BITS = 26 (11 - 1 + 16) i.e 64MB for 64K pages without THP SECTION_SIZE_BITS = 29 (14 - 1 + 16) i.e 512MB for 64K pages with THP But there are other problems in reducing SECTION_SIZE_BIT. Reducing it by too much would over populate /sys/devices/system/memory/ and also consume too many page->flags bits in the !vmemmap case. Also section size needs to be multiple of 128MB to have PMD based vmemmap mapping with CONFIG_ARM64_4K_PAGES. Given these constraints, lets just reduce the section size to 128MB for 4K and 16K base page size configs, and to 512MB for 64K base page size config. Signed-off-by: Sudarshan Rajagopalan Suggested-by: Anshuman Khandual Suggested-by: David Hildenbrand Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: David Hildenbrand Cc: Mike Rapoport Cc: Mark Rutland Cc: Logan Gunthorpe Cc: Andrew Morton Cc: Steven Price Cc: Suren Baghdasaryan --- arch/arm64/include/asm/sparsemem.h | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 1f43fcc79738..eb4a75d720ed 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -7,7 +7,26 @@ #ifdef CONFIG_SPARSEMEM #define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS -#define SECTION_SIZE_BITS 30 -#endif + +/* + * Section size must be at least 512MB for 64K base + * page size config. Otherwise it will be less than + * (MAX_ORDER - 1) and the build process will fail. + */ +#ifdef CONFIG_ARM64_64K_PAGES +#define SECTION_SIZE_BITS 29 + +#else + +/* + * Section size must be at least 128MB for 4K base + * page size config. Otherwise PMD based huge page + * entries could not be created for vmemmap mappings. + * 16K follows 4K for simplicity. + */ +#define SECTION_SIZE_BITS 27 +#endif /* CONFIG_ARM64_64K_PAGES */ + +#endif /* CONFIG_SPARSEMEM*/ #endif -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 0/1] arm64/sparsemem: reduce SECTION_SIZE_BITS
This patch is the follow-up from the discussions in the thread [1]. Reducing the section size has the merit of reducing wastage of reserved memory for vmmemmap mappings for sections with large memory holes. Also with smaller section size gives more grunularity and agility for memory hot(un)plugging. But there are also constraints in reducing SECTION_SIZE_BIT: - Should accommodate highest order page for a given config - Should not break PMD mapping in vmemmap for 4K pages - Should not consume too many page->flags bits reducing space for other info This patch uses the suggestions from Anshuman Khandual and David Hildenbrand in thread [1] to set the least possible section size to 128MB for 4K and 16K base page size configs for simplicity, and to 512MB for 64K base page size config. [1] https://lore.kernel.org/lkml/cover.1609895500.git.sudar...@codeaurora.org/T/#m8ee60ae69db5e9eb06ca7999c43828d49ccb9626 Sudarshan Rajagopalan (1): arm64/sparsemem: reduce SECTION_SIZE_BITS arch/arm64/include/asm/sparsemem.h | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH 1/1] arm64: reduce section size for sparsemem
On 2021-01-20 09:49, Will Deacon wrote: On Fri, Jan 08, 2021 at 03:16:00PM -0800, Sudarshan Rajagopalan wrote: Reducing the section size helps reduce wastage of reserved memory for huge memory holes in sparsemem model. But having a much smaller section size bits could break PMD mappings for vmemmap and wouldn't accomodate the highest order page for certain page size granule configs. It is determined that SECTION_SIZE_BITS of 27 (128MB) could be ideal default value for 4K_PAGES that gives least section size without breaking PMD based vmemmap mappings. For simplicity, 16K_PAGES could follow the same as 4K_PAGES. And the least SECTION_SIZE_BITS for 64K_PAGES is 29 that could accomodate MAX_ORDER. Signed-off-by: Sudarshan Rajagopalan Suggested-by: David Hildenbrand Cc: Will Deacon Cc: Anshuman Khandual Cc: Mike Rapoport Cc: Mark Rutland Cc: Suren Baghdasaryan --- arch/arm64/include/asm/sparsemem.h | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 1f43fcc79738..ff08ff6b677c 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -7,7 +7,13 @@ #ifdef CONFIG_SPARSEMEM #define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS -#define SECTION_SIZE_BITS 30 -#endif + +#if defined(CONFIG_ARM64_4K_PAGES) || defined(CONFIG_ARM64_16K_PAGES) +#define SECTION_SIZE_BITS 27 +#else +#define SECTION_SIZE_BITS 29 +#endif /* CONFIG_ARM64_4K_PAGES || CONFIG_ARM64_16K_PAGES */ + +#endif /* CONFIG_SPARSEMEM*/ Please can you repost this in light of the comments from Anshuman? Thanks, Will Sure Will. We were held up with some other critical tasks.. will repost the patch by EOD after addressing Anshuman's comments. -- Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] mm: vmscan: support equal reclaim for anon and file pages
When performing memory reclaim support treating anonymous and file backed pages equally. Swapping anonymous pages out to memory can be efficient enough to justify treating anonymous and file backed pages equally. Signed-off-by: Sudarshan Rajagopalan Cc: Andrew Morton --- mm/vmscan.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 257cba79a96d..ec7585e0d5f5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -169,6 +169,8 @@ struct scan_control { */ int vm_swappiness = 60; +bool balance_anon_file_reclaim = false; + static void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs) { @@ -201,6 +203,13 @@ static DECLARE_RWSEM(shrinker_rwsem); static DEFINE_IDR(shrinker_idr); static int shrinker_nr_max; +static int __init cmdline_parse_balance_reclaim(char *p) +{ + balance_anon_file_reclaim = true; + return 0; +} +early_param("balance_reclaim", cmdline_parse_balance_reclaim); + static int prealloc_memcg_shrinker(struct shrinker *shrinker) { int id, ret = -ENOMEM; @@ -2291,9 +2300,11 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, /* * If there is enough inactive page cache, we do not reclaim -* anything from the anonymous working right now. +* anything from the anonymous working right now. But when balancing +* anon and page cache files for reclaim, allow swapping of anon pages +* even if there are a number of inactive file cache pages. */ - if (sc->cache_trim_mode) { + if (!balance_anon_file_reclaim && sc->cache_trim_mode) { scan_balance = SCAN_FILE; goto out; } -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 1/1] arm64: reduce section size for sparsemem
Reducing the section size helps reduce wastage of reserved memory for huge memory holes in sparsemem model. But having a much smaller section size bits could break PMD mappings for vmemmap and wouldn't accomodate the highest order page for certain page size granule configs. It is determined that SECTION_SIZE_BITS of 27 (128MB) could be ideal default value for 4K_PAGES that gives least section size without breaking PMD based vmemmap mappings. For simplicity, 16K_PAGES could follow the same as 4K_PAGES. And the least SECTION_SIZE_BITS for 64K_PAGES is 29 that could accomodate MAX_ORDER. Signed-off-by: Sudarshan Rajagopalan Suggested-by: David Hildenbrand Cc: Will Deacon Cc: Anshuman Khandual Cc: Mike Rapoport Cc: Mark Rutland Cc: Suren Baghdasaryan --- arch/arm64/include/asm/sparsemem.h | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 1f43fcc79738..ff08ff6b677c 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -7,7 +7,13 @@ #ifdef CONFIG_SPARSEMEM #define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS -#define SECTION_SIZE_BITS 30 -#endif + +#if defined(CONFIG_ARM64_4K_PAGES) || defined(CONFIG_ARM64_16K_PAGES) +#define SECTION_SIZE_BITS 27 +#else +#define SECTION_SIZE_BITS 29 +#endif /* CONFIG_ARM64_4K_PAGES || CONFIG_ARM64_16K_PAGES */ + +#endif /* CONFIG_SPARSEMEM*/ #endif -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 0/1] arm64: reduce section size for sparsemem
This patch is the follow-up from the discussions in the thread [1]. Reducing the section size has the merit of reducing wastage of reserved memory for huge memory holes in sparsemem model. Also with smaller section size gives more grunularity and agility for memory hot(un)plugging. This patch tends to use the suggestion from David Hildenbrand in thread [1] to set the least possible SECTION_SIZE_BITS for 4K, 16K and 64K page granule. That is 27 (128MB) for 4K/16K and 29 (512MB) for 64K page granule. [1] https://lore.kernel.org/lkml/cover.1609895500.git.sudar...@codeaurora.org/T/#m8ee60ae69db5e9eb06ca7999c43828d49ccb9626 Sudarshan Rajagopalan (1): arm64: reduce section size for sparsemem arch/arm64/include/asm/sparsemem.h | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH 1/1] arm64: make section size configurable for memory hotplug
On 2021-01-05 22:11, Anshuman Khandual wrote: Hello Anshuman, thanks for your response. (+ Will) Hi Sudershan, This patch (and the cover letter) does not copy LAKML even though the entire change here is arm64 specific. Please do copy all applicable mailing lists for a given patch. I used ./scripts/get_maintainer.pl patch.patch to get the maintainers list. It somehow didn't mention LAKML. I've added the mailing list to this thread. On 1/6/21 6:58 AM, Sudarshan Rajagopalan wrote: Currently on arm64, memory section size is hard-coded to 1GB. Make this configurable if memory-hotplug is enabled, to support more finer granularity for hotplug-able memory. Section size has always been decided by the platform. It cannot be a configurable option because the user would not know the constraints for memory representation on the platform and besides it also cannot be trusted. Signed-off-by: Sudarshan Rajagopalan --- arch/arm64/Kconfig | 11 +++ arch/arm64/include/asm/sparsemem.h | 4 2 files changed, 15 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6d232837cbee..34124eee65da 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -294,6 +294,17 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE config SMP def_bool y +config HOTPLUG_SIZE_BITS + int "Memory hotplug block size(29 => 512MB 30 => 1GB)" + depends on SPARSEMEM + depends on MEMORY_HOTPLUG + range 28 30 28 would not work for 64K pages. + default 30 + help +Selects granularity of hotplug memory. Block size for +memory hotplug is represent as a power of 2. +If unsure, stick with default value. + config KERNEL_MODE_NEON def_bool y diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 1f43fcc79738..3d5310f3aad5 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -7,7 +7,11 @@ #ifdef CONFIG_SPARSEMEM #define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS +#ifndef CONFIG_MEMORY_HOTPLUG #define SECTION_SIZE_BITS 30 +#else +#define SECTION_SIZE_BITS CONFIG_HOTPLUG_SIZE_BITS +#endif #endif #endif There was an inconclusive discussion regarding this last month. https://lore.kernel.org/linux-arm-kernel/20201204014443.43329-1-liwei...@huawei.com/ Thanks for pointing out this thread. Looking into all the comments, major concern with reducing the section size seems to be risk of running out of bits in the page flags. And while SECTION_SIZE must be greater or equal to highest order page in the buddy, it must also satisfy cases for 4K page size where it doesn't break PMD mapping for vmemmap - and hence SECTION_SIZE_BITS of 27 could be set for 4K page size that could allow 2MB PMD mappings for each 128M(2^27) block size. While this is the least value that can be set (27 for 4K_PAGE, MAX_ZONEORDER - 1 + PAGE_SHIFT for 16K or 64K_PAGE), are there any concerns with setting higher values (but <= 30bits). It seems like any arbitrary number between this range could be applied that wouldn't break vmemmaps. That's why we were thinking of letting the user configure it since this directly impacts memory hotplug about the granularity or least size that can be hot (up)plugged. The current setting of 1GB for arm64 does poses a lot of challenges in utilizing memory hotplug via a driver, esp. for low RAM targets. I agree its sub-optimal in some sense but wanted to know the maintainers opinion on this. Also, the patch introduced in that thread does seem to help reduce vmemmap memory if there are large holes. So there is some merit in reducing the section size along with memory hotplug leveraging it. I have been wondering if this would solve the problem for 4K page size config which requires PMD mapping for the vmemmap mapping while making section size bits dependent on max order. But this has not been tested properly. diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 1f43fcc79738..fe4353cb1dce 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -7,7 +7,18 @@ #ifdef CONFIG_SPARSEMEM #define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS -#define SECTION_SIZE_BITS 30 -#endif + +#ifdef CONFIG_ARM64_4K_PAGES +#define SECTION_SIZE_BITS 27 +#else +#ifdef CONFIG_FORCE_MAX_ZONEORDER +#define SECTION_SIZE_BITS (CONFIG_FORCE_MAX_ZONEORDER - 1 + PAGE_SHIFT) +#else +#define SECTION_SIZE_BITS 30 +#endif /* CONFIG_FORCE_MAX_ZONEORDER */ + +#endif /* CONFIG_ARM64_4K_PAGES */ + +#endif /* CONFIG_SPARSEMEM*/ #endif SECTION_SIZE_BITS of 27 for 4K_PAGES should be fine for us. Would you know if there's possibility of this patch above being applied in upstream anytime soon? This is in regards with Generic Kernel Image (GKI) that we are working with Google. If this patch would positively end up in upstream, we could appl
[PATCH 1/1] arm64: make section size configurable for memory hotplug
Currently on arm64, memory section size is hard-coded to 1GB. Make this configurable if memory-hotplug is enabled, to support more finer granularity for hotplug-able memory. Signed-off-by: Sudarshan Rajagopalan --- arch/arm64/Kconfig | 11 +++ arch/arm64/include/asm/sparsemem.h | 4 2 files changed, 15 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6d232837cbee..34124eee65da 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -294,6 +294,17 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE config SMP def_bool y +config HOTPLUG_SIZE_BITS + int "Memory hotplug block size(29 => 512MB 30 => 1GB)" + depends on SPARSEMEM + depends on MEMORY_HOTPLUG + range 28 30 + default 30 + help +Selects granularity of hotplug memory. Block size for +memory hotplug is represent as a power of 2. +If unsure, stick with default value. + config KERNEL_MODE_NEON def_bool y diff --git a/arch/arm64/include/asm/sparsemem.h b/arch/arm64/include/asm/sparsemem.h index 1f43fcc79738..3d5310f3aad5 100644 --- a/arch/arm64/include/asm/sparsemem.h +++ b/arch/arm64/include/asm/sparsemem.h @@ -7,7 +7,11 @@ #ifdef CONFIG_SPARSEMEM #define MAX_PHYSMEM_BITS CONFIG_ARM64_PA_BITS +#ifndef CONFIG_MEMORY_HOTPLUG #define SECTION_SIZE_BITS 30 +#else +#define SECTION_SIZE_BITS CONFIG_HOTPLUG_SIZE_BITS +#endif #endif #endif -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 0/1] arm64: make section size configurable for memory hotplug
The section size defines the granularity of memory hotplug. This is currently hard coded to 1GB on arm64 linux, which defines that the least size of memblock that can be hotplugged out is 1GB. Some DDR configurations (especially low RAM and dual-rank DDRs) may have section sizes that are less than 1GB (ex. 512MB, 256MB etc.). Having an option to reduce the memblock size to section size or lower gives more granularity of memory hotplug. For example, a system with DDR section size of 512MB and kernel memblock size of 1GB, we would have to remove two segments of DDR sections in order to hotplug out atleast 1 memblock from kernel POV. Section sizes of DDRs vary based on specs (number of ranks, channels, regions etc.) Making this section size configurable helps users to assign based on the DDR being used. The default is set to 1GB which is the current memblock size. Sudarshan Rajagopalan (1): arm64: Make section size configurable for memory hotplug arch/arm64/Kconfig | 11 +++ arch/arm64/include/asm/sparsemem.h | 4 2 files changed, 15 insertions(+) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] mm: vmscan: support complete shrinker reclaim
Ensure that shrinkers are given the option to completely drop their caches even when their caches are smaller than the batch size. This change helps improve memory headroom by ensuring that under significant memory pressure shrinkers can drop all of their caches. This change only attempts to more aggressively call the shrinkers during background memory reclaim, inorder to avoid hurting the performance of direct memory reclaim. Signed-off-by: Sudarshan Rajagopalan Cc: Andrew Morton --- mm/vmscan.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9727dd8e2581..35973665ae64 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -424,6 +424,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, long batch_size = shrinker->batch ? shrinker->batch : SHRINK_BATCH; long scanned = 0, next_deferred; + long min_cache_size = batch_size; + + if (current_is_kswapd()) + min_cache_size = 0; if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) nid = 0; @@ -503,7 +507,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, * scanning at high prio and therefore should try to reclaim as much as * possible. */ - while (total_scan >= batch_size || + while (total_scan > min_cache_size || total_scan >= freeable) { unsigned long ret; unsigned long nr_to_scan = min(batch_size, total_scan); -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[RFC] depopulate_range_driver_managed() for removing page-table mappings for hot-added memory blocks
Hello, When memory blocks are removed, along with removing the memmap entries, memory resource and memory block devices, the arch specific arch_remove_memory() is called which takes care of tearing down the page-tables. Suppose there’s a usecase where the removed memory blocks will be added back into the system at later point, we can remove/offline the block in a way that all entries such as memmaps, memory resources and block devices can be kept intact so that they won’t be needed to be created again when blocks are added back. Now this can be done by doing offline alone. But if there’s special usecase where the page-table entries are needed to be teared down when blocks are offlined in order to avoid speculative accesses on offlined memory region, but also keep the memmap entries and block devices intact, I was thinking if we can implement something like {populate|depopulate}_range_driver_managed() that can be called after online/offline which can create/tear down page table mappings for that range. This would avoid us from the need to do remove_memory() entirely just for the sake of page-table entries being removed. We can now just offline the block and call depopulate_range_driver_managed. This basically isolates arch_{add/remove}_memory outside of add/remove_memory routines so that drivers can choose if it needs to just offline and remove page-table mappings or hotremove memory entirely. This gives drivers the flexibility to retain memmap entries and memory resource and block device creation so that they can be skipped when blocks are added back – this helps us reduce the latencies for removing and adding memory blocks. I’m still in the process the creating the patch that implements this, which would give clear view about this RFC but just putting out the thought here if it makes sense or not. Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v4] arm64/mm: add fallback option to allocate virtually contiguous memory
On 2020-10-16 11:56, Sudarshan Rajagopalan wrote: Hello Will, Catalin, Did you have a chance to review this patch? It is reviewed by others and haven't seen any Nacks. This patch will be useful to have so that memory hotremove doesn't fail when such PMD_SIZE pages aren't available.. which is usually the case in low RAM devices. When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Reviewed-by: Gavin Shan Reviewed-by: Anshuman Khandual Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62fea1b6..44486fd0e883 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1121,8 +1121,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { + if (vmemmap_populate_basepages(addr, next, node, altmap)) + return -ENOMEM; + continue; + } pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL)); } else -- Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: mm/memblock: export memblock_{start/end}_of_DRAM
On 2020-10-30 01:38, Mike Rapoport wrote: On Thu, Oct 29, 2020 at 02:29:27PM -0700, Sudarshan Rajagopalan wrote: Hello all, We have a usecase where a module driver adds certain memory blocks using add_memory_driver_managed(), so that it can perform memory hotplug operations on these blocks. In general, these memory blocks aren’t something that gets physically added later, but is part of actual RAM that system booted up with. Meaning – we set the ‘mem=’ cmdline parameter to limit the memory and later add the remaining ones using add_memory*() variants. The basic idea is to have driver have ownership and manage certain memory blocks for hotplug operations. For the driver be able to know how much memory was limited and how much actually present, we take the delta of ‘bootmem physical end address’ and ‘memblock_end_of_DRAM’. The 'bootmem physical end address' is obtained by scanning the reg values in ‘memory’ DT node and determining the max {addr,size}. Since our driver is getting modularized, we won’t have access to memblock_end_of_DRAM (i.e. end address of all memory blocks after ‘mem=’ is applied). So checking if memblock_{start/end}_of_DRAM() symbols can be exported? Also, this information can be obtained by userspace by doing ‘cat /proc/iomem’ and greping for ‘System RAM’. So wondering if userspace can have access to such info, can we allow kernel module drivers have access by exporting memblock_{start/end}_of_DRAM(). These functions cannot be exported not because we want to hide this information from the modules but because it is unsafe to use them. On most architecturs these functions are __init so they are discarded after boot anyway. Beisdes, the memory configuration known to memblock might be not accurate in many cases as David explained in his reply. I don't see how information contained in memblock_{start/end}_of_DRAM() is considered hidden if the information can be obtained using 'cat /proc/iomem'. The memory resource manager adds these blocks either in "System RAM", "reserved", "Kernel data/code" etc. Inspecting this, one could determine whats the start and end of memblocks. I agree on the part that its __init annotated and could be removed after boot. This is something that the driver can be vary of too. Or are there any other ways where a module driver can get the end address of system memory block? What do you mean by "system memory block"? There could be a lot of interpretations if you take into account memory hotplug, "mem=" option, reserved and firmware memory. I meant the physical end address of memblock. The equivalent of memblock_end_of_DRAM. I'd suggest you to describe the entire use case in more detail. Having the complete picture would help finding a proper solution. The usecase in general is have a way to add/remove and online/offline certain memory blocks which are part of boot. We do this by limiting the memory using "mem=" and latter add the remaining blocks using add_memory_driver_mamanaged(). Sudarshan -- Sincerely yours, Mike. Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: mm/memblock: export memblock_{start/end}_of_DRAM
On 2020-10-29 23:41, David Hildenbrand wrote: On 29.10.20 22:29, Sudarshan Rajagopalan wrote: Hello all, Hi! Hi David.. thanks for the response as always. We have a usecase where a module driver adds certain memory blocks using add_memory_driver_managed(), so that it can perform memory hotplug operations on these blocks. In general, these memory blocks aren’t something that gets physically added later, but is part of actual RAM that system booted up with. Meaning – we set the ‘mem=’ cmdline parameter to limit the memory and later add the remaining ones using add_memory*() variants. The basic idea is to have driver have ownership and manage certain memory blocks for hotplug operations. So, in summary, you're still abusing the memory hot(un)plug infrastructure from your driver - just not in a severe way as before. And I'll tell you why, so you might understand why exposing this API is not really a good idea and why your driver wouldn't - for example - be upstream material. Don't get me wrong, what you are doing might be ok in your context, but it's simply not universally applicable in our current model. Ordinary system RAM works different than many other devices (like PCI devices) whereby *something* senses the device and exposes it to the system, and some available driver binds to it and owns the memory. Memory is detected by a driver and added to the system via e.g., add_memory_driver_managed(). Memory devices are created and the memory is directly handed off to the system, to be used as system RAM as soon as memory devices are onlined. There is no driver that "binds" memory like other devices - it's rather the core (buddy) that uses/owns that memory immediately after device creation. I see.. and I agree that drivers are meant to *sense* that something changed or newly added, so that driver can check if it's the one responsible or compatible for handling this entity and binds to it. So I guess what it boils down to is - a driver that uses memory hotplug _cannot_ add/remove or have ownership of memblock boot memory, but for the newly added RAM blocks later on. I was trying to mimic the detecting and adding of extra RAM by limiting the System RAM with "mem=XGB" as though system booted with XGB of boot memory and later add the remaining blocks (force detection and adding) using add_memorY-driver_manager(). This remaining blocks are calculated by 'physical end addr of boot memory' - 'memblock_end_of_DRAM'. The "physical end addr of boot memory" i.e. the actual RAM that bootloader informs to kernel can be obtained by scanning the 'memory' DT node. For the driver be able to know how much memory was limited and how much actually present, we take the delta of ‘bootmem physical end address’ and ‘memblock_end_of_DRAM’. The 'bootmem physical end address' is obtained by scanning the reg values in ‘memory’ DT node and determining the max {addr,size}. Since our driver is getting modularized, we won’t have access to memblock_end_of_DRAM (i.e. end address of all memory blocks after ‘mem=’ is applied). What you do with "mem=" is force memory detection to ignore some of it's detected memory. So checking if memblock_{start/end}_of_DRAM() symbols can be exported? Also, this information can be obtained by userspace by doing ‘cat /proc/iomem’ and greping for ‘System RAM’. So wondering if userspace can Not correct: with "mem=", cat /proc/iomem only shows *detected* + added system RAM, not the unmodified detection. That's correct - I meant 'memblock_end_of_DRAM' along with "mem=" can be calculated using 'cat /proc/iomem' which shows "detected plus added" System RAM, and not the remaining undetected one which got stripped off due to "mem=XGB". Basically, 'memblock_end_of_DRAM' address with 'mem=XGB' is {end addr of boot RAM - XGB}.. which would be same as end address of "System RAM" showed in /proc/iomem. The reasoning for this is - if userspace can have access to such info and calculate the memblock end address, why not let drivers have this info using memblock_end_of_DRAM()? have access to such info, can we allow kernel module drivers have access by exporting memblock_{start/end}_of_DRAM(). Or are there any other ways where a module driver can get the end address of system memory block? And here is our problem: You disabled *detection* of that memory by the responsible driver (here: core). Now your driver wants to know what would have been detected. Assume you have memory hole in that region - it would not work by simply looking at start/end. You're driver is not the one doing the detection. Regarding the memory hole - the driver can inspect the 'memory' DT node that kernel gets from ABL from RAM partition table if any such holes exist or not. I agree that if such holes exists, hot adding will fail since it needs block size to be added. The same issue will arise
mm/memblock: export memblock_{start/end}_of_DRAM
Hello all, We have a usecase where a module driver adds certain memory blocks using add_memory_driver_managed(), so that it can perform memory hotplug operations on these blocks. In general, these memory blocks aren’t something that gets physically added later, but is part of actual RAM that system booted up with. Meaning – we set the ‘mem=’ cmdline parameter to limit the memory and later add the remaining ones using add_memory*() variants. The basic idea is to have driver have ownership and manage certain memory blocks for hotplug operations. For the driver be able to know how much memory was limited and how much actually present, we take the delta of ‘bootmem physical end address’ and ‘memblock_end_of_DRAM’. The 'bootmem physical end address' is obtained by scanning the reg values in ‘memory’ DT node and determining the max {addr,size}. Since our driver is getting modularized, we won’t have access to memblock_end_of_DRAM (i.e. end address of all memory blocks after ‘mem=’ is applied). So checking if memblock_{start/end}_of_DRAM() symbols can be exported? Also, this information can be obtained by userspace by doing ‘cat /proc/iomem’ and greping for ‘System RAM’. So wondering if userspace can have access to such info, can we allow kernel module drivers have access by exporting memblock_{start/end}_of_DRAM(). Or are there any other ways where a module driver can get the end address of system memory block? Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: arm64: dropping prevent_bootmem_remove_notifier
Hi Anshuman, David, Thanks for all the detailed explanations for the reasoning to have bootmem protected from being removed. Also, I do agree drivers being able to mark memory sections isn't the right thing to do. We went ahead with the approach of using "mem=" as you suggested to limit the bootmem and add remaining blocks using add_memory_driver_managed() so that driver has ownership of these blocks. We do have some follow-up questions regarding this - will initiate a discussion soon. On 2020-10-18 22:37, Anshuman Khandual wrote: Hello Sudarshan, On 10/17/2020 04:41 AM, Sudarshan Rajagopalan wrote: Hello Anshuman, In the patch that enables memory hot-remove (commit bbd6ec605c0f ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier put in place that prevents boot memory from being offlined and removed. Also commit text mentions that boot memory on arm64 cannot be removed. We wanted to understand more about the reasoning for this. X86 and other archs doesn’t seem to do this prevention. There’s also comment in the code that this notifier could be dropped in future if and when boot memory can be removed. Right and till then the notifier cannot be dropped. There was a lot of discussions around this topic during multiple iterations of memory hot remove series. Hence, I would just request you to please go through them first. This list here is from one such series (https://lwn.net/Articles/809179/) but might not be exhaustive. - On arm64 platform, it is essential to ensure that the boot time discovered memory couldn't be hot-removed so that, 1. FW data structures used across kexec are idempotent e.g. the EFI memory map. 2. linear map or vmemmap would not have to be dynamically split, and can map boot memory at a large granularity 3. Avoid penalizing paths that have to walk page tables, where we can be certain that the memory is not hot-removable - The primary reason being kexec which would need substantial rework otherwise. The current logic is that only “new” memory blocks which are hot-added can later be offlined and removed. The memory that system booted up with cannot be offlined and removed. But there could be many usercases such as inter-VM memory sharing where a primary VM could offline and hot-remove a block/section of memory and lend it to secondary VM where it could hot-add it. And after usecase is done, the reverse happens where secondary VM hot-removes and gives it back to primary which can hot-add it back. In such cases, the present logic for arm64 doesn’t allow this hot-remove in primary to happen. That is not true. Each VM could just boot with a minimum boot memory which can not be offlined or removed but then a possible larger portion of memory can be hot added during the boot process itself, making them available for any future inter VM sharing purpose. Hence this problem could easily be solved in the user space itself. Also, on systems with movable zone that sort of guarantees pages to be migrated and isolated so that blocks can be offlined, this logic also defeats the purpose of having a movable zone which system can rely on memory hot-plugging, which say virt-io mem also relies on for fully plugged memory blocks. ZONE_MOVABLE does not really guarantee migration, isolation and removal. There are reasons an offline request might just fail. I agree that those reasons are normally not platform related but core memory gives platform an opportunity to decline an offlining request via a notifier. Hence ZONE_MOVABLE offline can be denied. Semantics wise we are still okay. This might look bit inconsistent that movablecore/kernelcore/movable_node with firmware sending in 'hot pluggable' memory (IIRC arm64 does not really support this yet), the system might end up with ZONE_MOVABLE marked boot memory which cannot be offlined or removed. But an offline notifier action is orthogonal. Hence did not block those kernel command line paths that creates ZONE_MOVABLE during boot to preserve existing behavior. I understand that some region of boot RAM shouldn’t be allowed to be removed, but such regions won’t be allowed to be offlined in first place since pages cannot be migrated and isolated, example reserved pages. So we’re trying to understand the reasoning for such a prevention put in place for arm64 arch alone. Primary reason being kexec. During kexec on arm64, next kernel's memory map is derived from firmware and not from current running kernel. So the next kernel will crash if it would access memory that might have been removed in running kernel. Until kexec on arm64 changes substantially and takes into account the real available memory on the current kernel, boot memory cannot be removed. One possible way to solve this is by marking the required sections as “non-early” by removing the SECTION_IS_EARLY bit in its section_me
[PATCH 2/2] arm64: allow hotpluggable sections to be offlined
On receiving the MEM_GOING_OFFLINE notification, we disallow offlining of any boot memory by checking if section_early or not. With the introduction of SECTION_MARK_HOTPLUGGABLE, allow boot mem sections that are marked as hotpluggable with this bit set to be offlined and removed. This now allows certain boot mem sections to be offlined. Signed-off-by: Sudarshan Rajagopalan Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Gavin Shan Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price Cc: Suren Baghdasaryan --- arch/arm64/mm/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62fea1b6..fb8878698672 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1487,7 +1487,7 @@ static int prevent_bootmem_remove_notifier(struct notifier_block *nb, for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { ms = __pfn_to_section(pfn); - if (early_section(ms)) + if (early_section(ms) && !removable_section(ms)) return NOTIFY_BAD; } return NOTIFY_OK; -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 1/2] mm/memory_hotplug: allow marking of memory sections as hotpluggable
Certain architectures such as arm64 doesn't allow boot memory to be offlined and removed. Distinguish certain memory sections as "hotpluggable" which can be marked by module drivers stating to memory hotplug layer that these sections can be offlined and then removed. This is done by using a separate section memory mab bit and setting it, rather than clearing the existing SECTION_IS_EARLY bit. This patch introduces SECTION_MARK_HOTPLUGGABLE bit into section mem map. Only the allowed sections which are in movable zone and have unmovable pages are allowed to be set with this new bit. Signed-off-by: Sudarshan Rajagopalan Cc: Catalin Marinas Cc: Will Deacon Cc: Mike Rapoport Cc: Anshuman Khandual Cc: David Hildenbrand Cc: Mark Rutland Cc: Steven Price Cc: Logan Gunthorpe Cc: Suren Baghdasaryan --- include/linux/memory_hotplug.h | 1 + include/linux/mmzone.h | 9 - mm/memory_hotplug.c| 20 mm/sparse.c| 31 +++ 4 files changed, 60 insertions(+), 1 deletion(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 375515803cd8..81df45b582c8 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -319,6 +319,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); extern int remove_memory(int nid, u64 start, u64 size); extern void __remove_memory(int nid, u64 start, u64 size); extern int offline_and_remove_memory(int nid, u64 start, u64 size); +extern int mark_memory_hotpluggable(unsigned long start, unsigned long end); #else static inline void try_offline_node(int nid) {} diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8379432f4f2f..3df3a4975236 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1247,7 +1247,8 @@ extern size_t mem_section_usage_size(void); #define SECTION_HAS_MEM_MAP(1UL<<1) #define SECTION_IS_ONLINE (1UL<<2) #define SECTION_IS_EARLY (1UL<<3) -#define SECTION_MAP_LAST_BIT (1UL<<4) +#define SECTION_MARK_HOTPLUGGABLE (1UL<<4) +#define SECTION_MAP_LAST_BIT (1UL<<5) #define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) #define SECTION_NID_SHIFT 3 @@ -1278,6 +1279,11 @@ static inline int early_section(struct mem_section *section) return (section && (section->section_mem_map & SECTION_IS_EARLY)); } +static inline int removable_section(struct mem_section *section) +{ + return (section && (section->section_mem_map & SECTION_MARK_HOTPLUGGABLE)); +} + static inline int valid_section_nr(unsigned long nr) { return valid_section(__nr_to_section(nr)); @@ -1297,6 +1303,7 @@ static inline int online_section_nr(unsigned long nr) void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn); #ifdef CONFIG_MEMORY_HOTREMOVE void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn); +int section_mark_hotpluggable(struct mem_section *ms); #endif #endif diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index e9d5ab5d3ca0..503b0de489a0 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1860,4 +1860,24 @@ int offline_and_remove_memory(int nid, u64 start, u64 size) return rc; } EXPORT_SYMBOL_GPL(offline_and_remove_memory); + +int mark_memory_hotpluggable(unsigned long start_pfn, unsigned long end_pfn) +{ + struct mem_section *ms; + unsigned long nr; + int rc = -EINVAL; + + if (end_pfn < start_pfn) + return rc; + + for (nr = start_pfn; nr <= end_pfn; nr++) { + ms = __pfn_to_section(nr); + rc = section_mark_hotpluggable(ms); + if (!rc) + break; + } + + return rc; +} +EXPORT_SYMBOL_GPL(mark_memory_hotpluggable); #endif /* CONFIG_MEMORY_HOTREMOVE */ diff --git a/mm/sparse.c b/mm/sparse.c index fcc3d176f1ea..cc21c23e2f1d 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "internal.h" #include @@ -644,6 +645,36 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) ms->section_mem_map &= ~SECTION_IS_ONLINE; } } + +int section_mark_hotpluggable(struct mem_section *ms) +{ + unsigned long section_nr, pfn; + bool unmovable; + struct page *page; + + /* section needs to be both valid and present to be marked */ + if (WARN_ON(!valid_section(ms)) || !present_section(ms)) + return -EINVAL; + + /* +* now check if this section is removable. This can be done by checking +* if section has unmovable pages or not. +*/ + section_nr = __section_nr(ms); + pfn = section_nr_to_pfn(section_nr); + page = pfn_to_page(pfn); + unmovable = has_unmovable_p
[PATCH 0/2] mm/memory_hotplug, arm64: allow certain bootmem sections to be offlinable
In the patch that enables memory hot-remove (commit bbd6ec605c0f ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier put in place that prevents boot memory from being offlined and removed. The commit text mentions that boot memory on arm64 cannot be removed. But x86 and other archs doesn’t seem to do this prevention. The current logic is that only “new” memory blocks which are hot-added can later be offlined and removed. The memory that system booted up with cannot be offlined and removed. But there could be many usercases such as inter-VM memory sharing where a primary VM could offline and hot-remove a block/section of memory and lend it to secondary VM where it could hot-add it. And after usecase is done, the reverse happens where secondary VM hot-removes and gives it back to primary which can hot-add it back. In such cases, the present logic for arm64 doesn’t allow this hot-remove in primary to happen. Also, on systems with movable zone that sort of guarantees pages to be migrated and isolated so that blocks can be offlined, this logic also defeats the purpose of having a movable zone which system can rely on memory hot-plugging, which say virt-io mem also relies on for fully plugged memory blocks. This patch tries to solve by introducing a new section mem map sit 'SECTION_MARK_HOTPLUGGABLE' which allows the concerned module drivers be able to mark requried sections as "hotpluggable" by setting this bit. Also this marking is only allowed for sections which are in movable zone and have unmovable pages. The arm64 mmu code on receiving the MEM_GOING_OFFLINE notification, we disallow offlining of any boot memory by checking if section_early or not. With the introduction of SECTION_MARK_HOTPLUGGABLE, we allow boot mem sections that are marked as hotpluggable with this bit set to be offlined and removed. Thereby allowing required bootmem sections to be offlinable. Sudarshan Rajagopalan (2): mm/memory_hotplug: allow marking of memory sections as hotpluggable arm64: allow hotpluggable sections to be offlined arch/arm64/mm/mmu.c| 2 +- include/linux/memory_hotplug.h | 1 + include/linux/mmzone.h | 9 - mm/memory_hotplug.c| 20 mm/sparse.c| 31 +++ 5 files changed, 61 insertions(+), 2 deletions(-) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
arm64: dropping prevent_bootmem_remove_notifier
Hello Anshuman, In the patch that enables memory hot-remove (commit bbd6ec605c0f ("arm64/mm: Enable memory hot remove")) for arm64, there’s a notifier put in place that prevents boot memory from being offlined and removed. Also commit text mentions that boot memory on arm64 cannot be removed. We wanted to understand more about the reasoning for this. X86 and other archs doesn’t seem to do this prevention. There’s also comment in the code that this notifier could be dropped in future if and when boot memory can be removed. The current logic is that only “new” memory blocks which are hot-added can later be offlined and removed. The memory that system booted up with cannot be offlined and removed. But there could be many usercases such as inter-VM memory sharing where a primary VM could offline and hot-remove a block/section of memory and lend it to secondary VM where it could hot-add it. And after usecase is done, the reverse happens where secondary VM hot-removes and gives it back to primary which can hot-add it back. In such cases, the present logic for arm64 doesn’t allow this hot-remove in primary to happen. Also, on systems with movable zone that sort of guarantees pages to be migrated and isolated so that blocks can be offlined, this logic also defeats the purpose of having a movable zone which system can rely on memory hot-plugging, which say virt-io mem also relies on for fully plugged memory blocks. I understand that some region of boot RAM shouldn’t be allowed to be removed, but such regions won’t be allowed to be offlined in first place since pages cannot be migrated and isolated, example reserved pages. So we’re trying to understand the reasoning for such a prevention put in place for arm64 arch alone. One possible way to solve this is by marking the required sections as “non-early” by removing the SECTION_IS_EARLY bit in its section_mem_map. This puts these sections in the context of “memory hotpluggable” which can be offlined-removed and added-onlined which are part of boot RAM itself and doesn’t need any extra blocks to be hot added. This way of marking certain sections as “non-early” could be exported so that module drivers can set the required number of sections as “memory hotpluggable”. This could have certain checks put in place to see which sections are allowed, example only movable zone sections can be marked as “non-early”. Your thoughts on this? We are also looking for different ways to solve the problem without having to completely dropping this notifier, but just putting out the concern here about the notifier logic that is breaking our usecase which is a generic memory sharing usecase using memory hotplug feature. Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory
On 2020-10-15 01:36, Will Deacon wrote: On Wed, Oct 14, 2020 at 05:51:23PM -0700, Sudarshan Rajagopalan wrote: When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Reviewed-by: Gavin Shan Reviewed-by: Anshuman Khandual Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Please can you fix the subject? I have three copies of "PATCH v3" from different days in my inbox. I know it sounds trivial, but getting these little things right really helps with review, especially when it's sitting amongst a sea of other patches. Yes sure, sorry about that - will change it to "PATCH v4" to make it stand out from other patches. Thanks, Will Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory
On 2020-10-13 04:38, Anshuman Khandual wrote: On 10/13/2020 04:35 AM, Sudarshan Rajagopalan wrote: When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). There is a checkpatch warning here, which could be fixed while merging ? WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: When section mappings are enabled, we allocate vmemmap pages from physically total: 0 errors, 1 warnings, 13 lines checked Thanks Anshuman for the review. I sent out an updated patch fixing the checkpatch warning. Signed-off-by: Sudarshan Rajagopalan Reviewed-by: Gavin Shan Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price Nonetheless, this looks fine. Did not see any particular problem while creating an experimental vmemmap with interleaving section and base page mapping. Reviewed-by: Anshuman Khandual --- arch/arm64/mm/mmu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62fea1b6..44486fd0e883 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1121,8 +1121,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { + if (vmemmap_populate_basepages(addr, next, node, altmap)) + return -ENOMEM; + continue; + } pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL)); } else Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory
When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Reviewed-by: Gavin Shan Reviewed-by: Anshuman Khandual Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62fea1b6..44486fd0e883 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1121,8 +1121,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { + if (vmemmap_populate_basepages(addr, next, node, altmap)) + return -ENOMEM; + continue; + } pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL)); } else -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory
V1: The initial patch used the approach to abort at the first instance of PMD_SIZE allocation failure, unmaps all previously mapped sections using vmemmap_free and maps the entire request with vmemmap_populate_basepages to allocate virtually contiguous memory. https://lkml.org/lkml/2020/9/10/66 V2: Allocates virtually contiguous memory only for sections that failed PMD_SIZE allocation, and continous to allocate physically contiguous memory for other sections. https://lkml.org/lkml/2020/9/30/1489 V3: Addressed trivial review comments. Pass in altmap to vmemmap_populate_basepages. Sudarshan Rajagopalan (1): arm64/mm: add fallback option to allocate virtually contiguous memory arch/arm64/mm/mmu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory
When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Reviewed-by: Gavin Shan Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62fea1b6..44486fd0e883 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1121,8 +1121,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { + if (vmemmap_populate_basepages(addr, next, node, altmap)) + return -ENOMEM; + continue; + } pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL)); } else -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory
V1: The initial patch used the approach to abort at the first instance of PMD_SIZE allocation failure, unmaps all previously mapped sections using vmemmap_free and maps the entire request with vmemmap_populate_basepages to allocate virtually contiguous memory. https://lkml.org/lkml/2020/9/10/66 V2: Allocates virtually contiguous memory only for sections that failed PMD_SIZE allocation, and continous to allocate physically contiguous memory for other sections. https://lkml.org/lkml/2020/9/30/1489 V3: Addressed trivial review comments. Pass in altmap to vmemmap_populate_basepages. Sudarshan Rajagopalan (1): arm64/mm: add fallback option to allocate virtually contiguous memory arch/arm64/mm/mmu.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] arm64/mm: add fallback option to allocate virtually contiguous memory
When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62f..11f8639 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1121,8 +1121,15 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { + /* +* fallback allocating with virtually +* contiguous memory for this section +*/ + if (vmemmap_populate_basepages(addr, next, node, NULL)) + return -ENOMEM; + continue; + } pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL)); } else -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] arm64/mm: add fallback option to allocate virtually
V1: The initial patch used the approach to abort at the first instance of PMD_SIZE allocation failure, unmaps all previously mapped sections using vmemmap_free and maps the entire request with vmemmap_populate_basepages to allocate virtually contiguous memory. https://lkml.org/lkml/2020/9/10/66 V2: Allocates virtually contiguous memory only for sections that failed PMD_SIZE allocation, and continues to allocate physically contiguous memory for other sections. https://lkml.org/lkml/2020/9/30/1489 V3: Addresses Anshuman's comment to allow fallback to altmap base pages as well if and when required. Sudarshan Rajagopalan (1): arm64/mm: add fallback option to allocate virtually contiguous memory arch/arm64/mm/mmu.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] arm64/mm: add fallback option to allocate virtually contiguous memory
On 2020-09-30 17:30, Anshuman Khandual wrote: On 10/01/2020 04:43 AM, Sudarshan Rajagopalan wrote: When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62f..9edbbb8 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1121,8 +1121,18 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { + if (altmap) + return -ENOMEM; /* no fallback */ Why ? If huge pages inside a vmemmap section might have been allocated from altmap, the base page could also fallback on altmap. If this patch has just followed the existing x86 semantics, it was written [1] long back before vmemmap_populate_basepages() supported altmap allocation. While adding that support [2] recently, it was deliberate not to change x86 semantics as it was a platform decision. Nonetheless, it makes sense to fallback on altmap bases pages if and when required. [1] 4b94ffdc4163 (x86, mm: introduce vmem_altmap to augment vmemmap_populate()) [2] 1d9cfee7535c (mm/sparsemem: enable vmem_altmap support in vmemmap_populate_basepages()) Yes agreed. We can allow fallback on altmap as well. I did indeed follow x86 semantics. Will send the updated patch. Sudarshan -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] arm64/mm: add fallback option to allocate virtually contiguous memory
When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SIZE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62f..9edbbb8 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1121,8 +1121,18 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { + if (altmap) + return -ENOMEM; /* no fallback */ + + /* +* fallback allocating with virtually +* contiguous memory for this section +*/ + if (vmemmap_populate_basepages(addr, next, node, NULL)) + return -ENOMEM; + continue; + } pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL)); } else -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] arm64/mm: add fallback option to allocate virtually
V1: The initial patch used the approach to abort at the first instance of PMD_SIZE allocation failure, unmaps all previously mapped sections using vmemmap_free and maps the entire request with vmemmap_populate_basepages to allocate virtually contiguous memory. https://lkml.org/lkml/2020/9/10/66 V2: Allocates virtually contiguous memory only for sections that failed PMD_SIZE allocation, and continues to allocate physically contiguous memory for other sections. Sudarshan Rajagopalan (1): arm64/mm: add fallback option to allocate virtually contiguous memory arch/arm64/mm/mmu.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] arm64/mm: add fallback option to allocate virtually contiguous memory
When section mappings are enabled, we allocate vmemmap pages from physically continuous memory of size PMD_SZIE using vmemmap_alloc_block_buf(). Section mappings are good to reduce TLB pressure. But when system is highly fragmented and memory blocks are being hot-added at runtime, its possible that such physically continuous memory allocations can fail. Rather than failing the memory hot-add procedure, add a fallback option to allocate vmemmap pages from discontinuous pages using vmemmap_populate_basepages(). Signed-off-by: Sudarshan Rajagopalan Cc: Catalin Marinas Cc: Will Deacon Cc: Anshuman Khandual Cc: Mark Rutland Cc: Logan Gunthorpe Cc: David Hildenbrand Cc: Andrew Morton Cc: Steven Price --- arch/arm64/mm/mmu.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62f..a46c7d4 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1100,6 +1100,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, p4d_t *p4dp; pud_t *pudp; pmd_t *pmdp; + int ret = 0; do { next = pmd_addr_end(addr, end); @@ -1121,15 +1122,23 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void *p = NULL; p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap); - if (!p) - return -ENOMEM; + if (!p) { +#ifdef CONFIG_MEMORY_HOTPLUG + vmemmap_free(start, end, altmap); +#endif + ret = -ENOMEM; + break; + } pmd_set_huge(pmdp, __pa(p), __pgprot(PROT_SECT_NORMAL)); } else vmemmap_verify((pte_t *)pmdp, node, addr, next); } while (addr = next, addr != end); - return 0; + if (ret) + return vmemmap_populate_basepages(start, end, node, altmap); + else + return ret; } #endif /* !ARM64_SWAPPER_USES_SECTION_MAPS */ void vmemmap_free(unsigned long start, unsigned long end, -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project