Re: [PATCH v2] mm: z3fold: deprecate CONFIG_Z3FOLD
On Wed, Sep 04, 2024 at 11:33:43PM +, Yosry Ahmed wrote: > The z3fold compressed pages allocator is rarely used, most users use > zsmalloc. The only disadvantage of zsmalloc in comparison is the > dependency on MMU, and zbud is a more common option for !MMU as it was > the default zswap allocator for a long time. > > Historically, zsmalloc had worse latency than zbud and z3fold but > offered better memory savings. This is no longer the case as shown by > a simple recent analysis [1]. That analysis showed that z3fold does not > have any advantage over zsmalloc or zbud considering both performance > and memory usage. In a kernel build test on tmpfs in a limited cgroup, > z3fold took 3% more time and used 1.8% more memory. The latency of > zswap_load() was 7% higher, and that of zswap_store() was 10% higher. > Zsmalloc is better in all metrics. > > Moreover, z3fold apparently has latent bugs, which was made noticeable > by a recent soft lockup bug report with z3fold [2]. Switching to > zsmalloc not only fixed the problem, but also reduced the swap usage > from 6~8G to 1~2G. Other users have also reported being bitten by > mistakenly enabling z3fold. > > Other than hurting users, z3fold is repeatedly causing wasted > engineering effort. Apart from investigating the above bug, it came up > in multiple development discussions (e.g. [3]) as something we need to > handle, when there aren't any legit users (at least not intentionally). > > The natural course of action is to deprecate z3fold, and remove in a few > cycles if no objections are raised from active users. Next on the list > should be zbud, as it offers marginal latency gains at the cost of huge > memory waste when compared to zsmalloc. That one will need to wait until > zsmalloc does not depend on MMU. > > Rename the user-visible config option from CONFIG_Z3FOLD to > CONFIG_Z3FOLD_DEPRECATED so that users with CONFIG_Z3FOLD=y get a new > prompt with explanation during make oldconfig. Also, remove > CONFIG_Z3FOLD=y from defconfigs. > > [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2ww13scj4s3i4bnndqf+3+...@mail.gmail.com/ > [2]https://lore.kernel.org/lkml/ef0abd3e-a239-4111-a8ab-5c442e759...@gmail.com/ > [3]https://lore.kernel.org/lkml/CAJD7tkbnmeVugfunffSovJf9FAgy9rhBVt_tx=nxuvelufq...@mail.gmail.com/ > > Acked-by: Chris Down > Acked-by: Nhat Pham > Signed-off-by: Yosry Ahmed zsmalloc's CONFIG_MMU requirement was a concern in the past, but z3fold appears so bitrotted at this point that it's simply not a usable option anymore. > I think it should actually be fine to remove z3fold without deprecating > it first, but I am doing the due diligence. Yeah, you never know for sure if users exist. Deprecating it for a few cycles is the safer option. Acked-by: Johannes Weiner
Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types
Michal Simek , Thomas Bogendoerfer , linux-par...@vger.kernel.org, Max Filippov , linux-ker...@vger.kernel.org, Dinh Nguyen , Palmer Dabbelt , Sven Schnelle , Guo Ren , Borislav Petkov , Johannes Berg , linuxppc-dev@lists.ozlabs.org, "David S . Miller" Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org Sender: "Linuxppc-dev" On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote: > I observed that for each of the shared file-backed page faults, we're very > likely to retry one more time for the 1st write fault upon no page. It's > because we'll need to release the mmap lock for dirty rate limit purpose > with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()). > > Then after that throttling we return VM_FAULT_RETRY. > > We did that probably because VM_FAULT_RETRY is the only way we can return > to the fault handler at that time telling it we've released the mmap lock. > > However that's not ideal because it's very likely the fault does not need > to be retried at all since the pgtable was well installed before the > throttling, so the next continuous fault (including taking mmap read lock, > walk the pgtable, etc.) could be in most cases unnecessary. > > It's not only slowing down page faults for shared file-backed, but also add > more mmap lock contention which is in most cases not needed at all. > > To observe this, one could try to write to some shmem page and look at > "pgfault" value in /proc/vmstat, then we should expect 2 counts for each > shmem write simply because we retried, and vm event "pgfault" will capture > that. > > To make it more efficient, add a new VM_FAULT_COMPLETED return code just to > show that we've completed the whole fault and released the lock. It's also > a hint that we should very possibly not need another fault immediately on > this page because we've just completed it. > > This patch provides a ~12% perf boost on my aarch64 test VM with a simple > program sequentially dirtying 400MB shmem file being mmap()ed and these are > the time it needs: > > Before: 650.980 ms (+-1.94%) > After: 569.396 ms (+-1.38%) > > I believe it could help more than that. > > We need some special care on GUP and the s390 pgfault handler (for gmap > code before returning from pgfault), the rest changes in the page fault > handlers should be relatively straightforward. > > Another thing to mention is that mm_account_fault() does take this new > fault as a generic fault to be accounted, unlike VM_FAULT_RETRY. > > I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do > not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping > them as-is. > > Signed-off-by: Peter Xu Acked-by: Johannes Weiner
Re: mm: Question about the use of 'accessed' flags and pte_young() helper
On Tue, Oct 20, 2020 at 05:52:07PM +0200, Vlastimil Babka wrote: > On 10/8/20 11:49 AM, Christophe Leroy wrote: > > In a 10 years old commit > > (https://github.com/linuxppc/linux/commit/d069cb4373fe0d451357c4d3769623a7564dfa9f), > > powerpc 8xx has > > made the handling of PTE accessed bit conditional to CONFIG_SWAP. > > Since then, this has been extended to some other powerpc variants. > > > > That commit means that when CONFIG_SWAP is not selected, the accessed bit > > is not set by SW TLB miss > > handlers, leading to pte_young() returning garbage, or should I say > > possibly returning false > > allthough a page has been accessed since its access flag was reset. > > > > Looking at various mm/ places, pte_young() is used independent of > > CONFIG_SWAP > > > > Is it still valid the not manage accessed flags when CONFIG_SWAP is not > > selected ? > > AFAIK it's wrong, reclaim needs it to detect accessed pages on inactive > list, via page_referenced(), including file pages (page cache) where > CONFIG_SWAP plays no role. Maybe it was different 10 years ago. Yes, we require this bit for properly aging mmapped file pages. The underlying assumption in the referenced commit is incorrect. > > If yes, should pte_young() always return true in that case ? > > It should best work as intended. If not possible, true is maybe better, as > false will lead to inactive file list thrashing. An unconditional true will cause mmapped file pages to be permanently mlocked / unevictable. Either way will break some workloads. The only good answer is the truth :-)
Re: linux-next: runtime warning in Linus' tree
On Thu, Aug 13, 2020 at 04:46:54PM +1000, Stephen Rothwell wrote: > [0.055220][T0] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5220 > mem_cgroup_css_alloc+0x350/0x904 > [The line numbers in the final linux next are 5226 and 5141 due to > later patches.] > > Introduced (or exposed) by commit > > 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup") > > This commit actually adds the WARN_ON, so it either adds the bug that > sets it off, or the bug already existed. > > Unfotunately, the version of this patch in linux-next up tuntil today > is different. :-( Sorry, I made a last-minute request to include these checks in that patch to make the code a bit more robust, but they trigger a false positive here. Let's remove them. --- >From de8ea7c96c056c3cbe7b93995029986a158fb9cd Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 13 Aug 2020 10:40:54 -0400 Subject: [PATCH] mm: memcontrol: fix warning when allocating the root cgroup Commit 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup") adds memory tracking to the memcg kernel structures themselves to make cgroups liable for the memory they are consuming through the allocation of child groups (which can be significant). This code is a bit awkward as it's spread out through several functions: The outermost function does memalloc_use_memcg(parent) to set up current->active_memcg, which designates which cgroup to charge, and the inner functions pass GFP_ACCOUNT to request charging for specific allocations. To make sure this dependency is satisfied at all times - to make sure we don't randomly charge whoever is calling the functions - the inner functions warn on !current->active_memcg. However, this triggers a false warning when the root memcg itself is allocated. No parent exists in this case, and so current->active_memcg is rightfully NULL. It's a false positive, not indicative of a bug. Delete the warnings for now, we can revisit this later. Fixes: 3e38e0aaca9e ("mm: memcg: charge memcg percpu memory to the parent cgroup") Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d59fd9af6e63..9d87082e64aa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5137,9 +5137,6 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return 1; - /* We charge the parent cgroup, never the current task */ - WARN_ON_ONCE(!current->active_memcg); - pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, GFP_KERNEL_ACCOUNT); if (!pn->lruvec_stat_local) { @@ -5222,9 +5219,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } - /* We charge the parent cgroup, never the current task */ - WARN_ON_ONCE(!current->active_memcg); - memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_local) -- 2.28.0
Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()
On Mon, Apr 13, 2020 at 05:15:49PM -0400, Waiman Long wrote: > As said by Linus: > > A symmetric naming is only helpful if it implies symmetries in use. > Otherwise it's actively misleading. As the btrfs example proves - people can be tempted by this false symmetry to pair kzalloc with kzfree, which isn't what we wanted. > In "kzalloc()", the z is meaningful and an important part of what the > caller wants. > > In "kzfree()", the z is actively detrimental, because maybe in the > future we really _might_ want to use that "memfill(0xdeadbeef)" or > something. The "zero" part of the interface isn't even _relevant_. > > The main reason that kzfree() exists is to clear sensitive information > that should not be leaked to other future users of the same memory > objects. > > Rename kzfree() to kfree_sensitive() to follow the example of the > recently added kvfree_sensitive() and make the intention of the API > more explicit. In addition, memzero_explicit() is used to clear the > memory to make sure that it won't get optimized away by the compiler. > > The renaming is done by using the command sequence: > > git grep -w --name-only kzfree |\ > xargs sed -i 's/\bkzfree\b/kfree_sensitive/' > > followed by some editing of the kfree_sensitive() kerneldoc and the > use of memzero_explicit() instead of memset(). > > Suggested-by: Joe Perches > Signed-off-by: Waiman Long Looks good to me. Thanks for fixing this very old mistake. Acked-by: Johannes Weiner
Re: [PATCH 25/28] mm: remove vmalloc_user_node_flags
On Thu, Apr 09, 2020 at 03:25:03PM -0700, Andrii Nakryiko wrote: > cc Johannes who suggested this API call originally I forgot why we did it this way - probably just cruft begetting more cruft. Either way, Christoph's cleanup makes this look a lot better. > On Wed, Apr 8, 2020 at 5:03 AM Christoph Hellwig wrote: > > > > Open code it in __bpf_map_area_alloc, which is the only caller. Also > > clean up __bpf_map_area_alloc to have a single vmalloc call with > > slightly different flags instead of the current two different calls. > > > > For this to compile for the nommu case add a __vmalloc_node_range stub > > to nommu.c. > > > > Signed-off-by: Christoph Hellwig Acked-by: Johannes Weiner
Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node
On Thu, Jul 30, 2015 at 06:34:29PM +0200, Vlastimil Babka wrote: > The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page > allocator: do not check NUMA node ID when the caller knows the node is valid") > as an optimized variant of alloc_pages_node(), that doesn't fallback to > current > node for nid == NUMA_NO_NODE. Unfortunately the name of the function can > easily > suggest that the allocation is restricted to the given node and fails > otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is > passed among the gfp flags. > > The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm, > thp: really limit transparent hugepage allocation to local node") and > b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). > > Another issue with the name is that there's a family of alloc_pages_exact*() > functions where 'exact' means exact size (instead of page order), which leads > to more confusion. > > To prevent further mistakes, this patch effectively renames > alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's > an optimized variant of alloc_pages_node() not intended for general usage. > Both functions get described in comments. > > It has been also considered to really provide a convenience function for > allocations restricted to a node, but the major opinion seems to be that > __GFP_THISNODE already provides that functionality and we shouldn't duplicate > the API needlessly. The number of users would be small anyway. > > Existing callers of alloc_pages_exact_node() are simply converted to call > __alloc_pages_node(), with two exceptions. sba_alloc_coherent() and > slob_new_page() both open-code the check for NUMA_NO_NODE, so they are > converted to use alloc_pages_node() instead. This means they no longer perform > some VM_BUG_ON checks, and since the current check for nid in > alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), > it may hide wrong values which would be previously exposed. Both differences > will be rectified by the next patch. > > To sum up, this patch makes no functional changes, except temporarily hiding > potentially buggy callers. Restricting the checks in alloc_pages_node() is > left for the next patch which can in turn expose more existing buggy callers. > > Signed-off-by: Vlastimil Babka > Cc: Mel Gorman > Cc: David Rientjes > Cc: Greg Thelen > Cc: Aneesh Kumar K.V > Cc: Christoph Lameter > Cc: Pekka Enberg > Cc: Joonsoo Kim > Cc: Naoya Horiguchi > Cc: Tony Luck > Cc: Fenghua Yu > Cc: Arnd Bergmann > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Acked-by: Michael Ellerman > Cc: Gleb Natapov > Cc: Paolo Bonzini > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: Cliff Whickman > Acked-by: Robin Holt Acked-by: Johannes Weiner ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/3] mm: unify checks in alloc_pages_node() and __alloc_pages_node()
On Thu, Jul 30, 2015 at 06:34:30PM +0200, Vlastimil Babka wrote: > Perform the same debug checks in alloc_pages_node() as are done in > __alloc_pages_node(), by making the former function a wrapper of the latter > one. > > In addition to better diagnostics in DEBUG_VM builds for situations which > have been already fatal (e.g. out-of-bounds node id), there are two visible > changes for potential existing buggy callers of alloc_pages_node(): > > - calling alloc_pages_node() with any negative nid (e.g. due to arithmetic > overflow) was treated as passing NUMA_NO_NODE and fallback to local node was > applied. This will now be fatal. > - calling alloc_pages_node() with an offline node will now be checked for > DEBUG_VM builds. Since it's not fatal if the node has been previously > online, > and this patch may expose some existing buggy callers, change the VM_BUG_ON > in __alloc_pages_node() to VM_WARN_ON. > > Signed-off-by: Vlastimil Babka > Acked-by: David Rientjes Acked-by: Johannes Weiner ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()
On Thu, Jul 30, 2015 at 06:34:31PM +0200, Vlastimil Babka wrote: > numa_mem_id() is able to handle allocation from CPUs on memory-less nodes, > so it's a more robust fallback than the currently used numa_node_id(). Won't it fall through to the next closest memory node in the zonelist anyway? Is this for callers doing NUMA_NO_NODE with __GFP_THISZONE? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 6/7][TRIVIAL][resend] mm: cleanup page reclaim comment error
On Fri, Jun 15, 2012 at 09:19:45PM +0800, Wanpeng Li wrote: > From: Wanpeng Li > > Since there are five lists in LRU cache, the array nr in get_scan_count > should be: > > nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan > nr[2] = file inactive pages to scan; nr[3] = file active pages to scan > > Signed-off-by: Wanpeng Li > Acked-by: KOSAKI Motohiro > Acked-by: Minchan Kim > Reviewed-by: Rik van Riel > > --- > mm/vmscan.c |3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index eeb3bc9..ed823df 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1567,7 +1567,8 @@ static int vmscan_swappiness(struct scan_control *sc) > * by looking at the fraction of the pages scanned we did rotate back > * onto the active list instead of evict. > * > - * nr[0] = anon pages to scan; nr[1] = file pages to scan > + * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan > + * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan > */ Does including this in the comment have any merit in the first place? We never access nr[0] or nr[1] etc. anywhere with magic numbers. It's a local function with one callsite, the passed array is declared and accessed exclusively by what is defined in enum lru_list, where is the point in repeating the enum items?. I'd rather the next change to this comment would be its removal. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] bootmem/sparsemem: remove limit constraint in alloc_bootmem_section
ve Memory Sharing) / CMO (Cooperative Memory > Overcommit) on powerpc, we tripped the following: > > kernel BUG at mm/bootmem.c:483! > cpu 0x0: Vector: 700 (Program Check) at [c0c03940] > pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c > lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c > sp: c0c03bc0 >msr: 80021032 > current = 0xc0b0cce0 > paca= 0xc1d8 > pid = 0, comm = swapper > kernel BUG at mm/bootmem.c:483! > enter ? for help > [c0c03c80] c0a64bcc > .sparse_early_usemaps_alloc_node+0x84/0x29c > [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c > [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294 > [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460 > [c0c03f90] c0009670 .start_here_common+0x1c/0x2c > > This is > > BUG_ON(limit && goal + size > limit); > > and after some debugging, it seems that > > goal = 0x700 > limit = 0x800 > > and sparse_early_usemaps_alloc_node -> > sparse_early_usemaps_alloc_pgdat_section calls > > return alloc_bootmem_section(usemap_size() * count, section_nr); > > This is on a system with 8TB available via the AMS pool, and as a quirk > of AMS in firmware, all of that memory shows up in node 0. So, we end up > with an allocation that will fail the goal/limit constraints. In theory, > we could "fall-back" to alloc_bootmem_node() in > sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE > defined, we'll BUG_ON() instead. A simple solution appears to be to > unconditionally remove the limit condition in alloc_bootmem_section, > meaning allocations are allowed to cross section boundaries (necessary > for systems of this size). > > Johannes Weiner pointed out that if alloc_bootmem_section() no longer > guarantees section-locality, we need check_usemap_section_nr() to print > possible cross-dependencies between node descriptors and the usemaps > allocated through it. That makes the two loops in > sparse_early_usemaps_alloc_node() identical, so re-factor the code a > bit. > > Signed-off-by: Nishanth Aravamudan Acked-by: Johannes Weiner ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations
On Tue, Feb 28, 2012 at 12:11:51PM -0800, Nishanth Aravamudan wrote: > On 28.02.2012 [14:53:26 +0100], Johannes Weiner wrote: > > On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote: > > > While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory > > > Overcommit) on powerpc, we tripped the following: > > > > > > kernel BUG at mm/bootmem.c:483! > > > cpu 0x0: Vector: 700 (Program Check) at [c0c03940] > > > pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c > > > lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c > > > sp: c0c03bc0 > > >msr: 80021032 > > > current = 0xc0b0cce0 > > > paca= 0xc1d8 > > > pid = 0, comm = swapper > > > kernel BUG at mm/bootmem.c:483! > > > enter ? for help > > > [c0c03c80] c0a64bcc > > > .sparse_early_usemaps_alloc_node+0x84/0x29c > > > [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c > > > [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294 > > > [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460 > > > [c0c03f90] c0009670 .start_here_common+0x1c/0x2c > > > > > > This is > > > > > > BUG_ON(limit && goal + size > limit); > > > > > > and after some debugging, it seems that > > > > > > goal = 0x700 > > > limit = 0x800 > > > > > > and sparse_early_usemaps_alloc_node -> > > > sparse_early_usemaps_alloc_pgdat_section -> alloc_bootmem_section calls > > > > > > return alloc_bootmem_section(usemap_size() * count, section_nr); > > > > > > This is on a system with 8TB available via the AMS pool, and as a quirk > > > of AMS in firmware, all of that memory shows up in node 0. So, we end up > > > with an allocation that will fail the goal/limit constraints. In theory, > > > we could "fall-back" to alloc_bootmem_node() in > > > sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE > > > defined, we'll BUG_ON() instead. A simple solution appears to be to > > > disable the limit check if the size of the allocation in > > > alloc_bootmem_secition exceeds the section size. > > > > It makes sense to allow the usemaps to spill over to subsequent > > sections instead of panicking, so FWIW: > > > > Acked-by: Johannes Weiner > > > > That being said, it would be good if check_usemap_section_nr() printed > > the cross-dependencies between pgdats and sections when the usemaps of > > a node spilled over to other sections than the ones holding the pgdat. > > > > How about this? > > > > --- > > From: Johannes Weiner > > Subject: sparsemem/bootmem: catch greater than section size allocations fix > > > > If alloc_bootmem_section() no longer guarantees section-locality, we > > need check_usemap_section_nr() to print possible cross-dependencies > > between node descriptors and the usemaps allocated through it. > > > > Signed-off-by: Johannes Weiner > > --- > > > > diff --git a/mm/sparse.c b/mm/sparse.c > > index 61d7cde..9e032dc 100644 > > --- a/mm/sparse.c > > +++ b/mm/sparse.c > > @@ -359,6 +359,7 @@ static void __init > > sparse_early_usemaps_alloc_node(unsigned long**usemap_map, > > continue; > > usemap_map[pnum] = usemap; > > usemap += size; > > + check_usemap_section_nr(nodeid, usemap_map[pnum]); > > } > > return; > > } > > This makes sense to me -- ok if I fold it into the re-worked patch > (based upon Mel's comments)? Sure thing! > > Furthermore, I wonder if we can remove the sparse-specific stuff from > > bootmem.c as well, as now even more so than before, calculating the > > desired area is really none of bootmem's business. > > > > Would something like this be okay? > > > > --- > > From: Johannes Weiner > > Subject: [patch] mm: remove sparsemem allocation details from the bootmem > > allocator > > > > alloc_bootmem_section() derives allocation area constraints from the > > specified sparsemem section. This is a bit specific for a generic > > memory allocator like bootmem, though, so move it over to sparsemem. > > > > Since __alloc_bootmem_node() already retries failed allocations with > > relaxed area constraints, the fallback code in sparsemem.c can be > > removed and the code becomes a bit more compact overall. > > > > Signed-off-by: Johannes Weiner > > I've not tested it, but the intention seems sensible. I think it should > remain a separate change. Yes, I agree. I'll resend it in a bit as stand-alone patch. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] sparsemem/bootmem: catch greater than section size allocations
On Fri, Feb 24, 2012 at 11:33:58AM -0800, Nishanth Aravamudan wrote: > While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory > Overcommit) on powerpc, we tripped the following: > > kernel BUG at mm/bootmem.c:483! > cpu 0x0: Vector: 700 (Program Check) at [c0c03940] > pc: c0a62bd8: .alloc_bootmem_core+0x90/0x39c > lr: c0a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c > sp: c0c03bc0 >msr: 80021032 > current = 0xc0b0cce0 > paca= 0xc1d8 > pid = 0, comm = swapper > kernel BUG at mm/bootmem.c:483! > enter ? for help > [c0c03c80] c0a64bcc > .sparse_early_usemaps_alloc_node+0x84/0x29c > [c0c03d50] c0a64f10 .sparse_init+0x12c/0x28c > [c0c03e20] c0a474f4 .setup_arch+0x20c/0x294 > [c0c03ee0] c0a4079c .start_kernel+0xb4/0x460 > [c0c03f90] c0009670 .start_here_common+0x1c/0x2c > > This is > > BUG_ON(limit && goal + size > limit); > > and after some debugging, it seems that > > goal = 0x700 > limit = 0x800 > > and sparse_early_usemaps_alloc_node -> > sparse_early_usemaps_alloc_pgdat_section -> alloc_bootmem_section calls > > return alloc_bootmem_section(usemap_size() * count, section_nr); > > This is on a system with 8TB available via the AMS pool, and as a quirk > of AMS in firmware, all of that memory shows up in node 0. So, we end up > with an allocation that will fail the goal/limit constraints. In theory, > we could "fall-back" to alloc_bootmem_node() in > sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE > defined, we'll BUG_ON() instead. A simple solution appears to be to > disable the limit check if the size of the allocation in > alloc_bootmem_secition exceeds the section size. It makes sense to allow the usemaps to spill over to subsequent sections instead of panicking, so FWIW: Acked-by: Johannes Weiner That being said, it would be good if check_usemap_section_nr() printed the cross-dependencies between pgdats and sections when the usemaps of a node spilled over to other sections than the ones holding the pgdat. How about this? --- From: Johannes Weiner Subject: sparsemem/bootmem: catch greater than section size allocations fix If alloc_bootmem_section() no longer guarantees section-locality, we need check_usemap_section_nr() to print possible cross-dependencies between node descriptors and the usemaps allocated through it. Signed-off-by: Johannes Weiner --- diff --git a/mm/sparse.c b/mm/sparse.c index 61d7cde..9e032dc 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -359,6 +359,7 @@ static void __init sparse_early_usemaps_alloc_node(unsigned long**usemap_map, continue; usemap_map[pnum] = usemap; usemap += size; + check_usemap_section_nr(nodeid, usemap_map[pnum]); } return; } --- Furthermore, I wonder if we can remove the sparse-specific stuff from bootmem.c as well, as now even more so than before, calculating the desired area is really none of bootmem's business. Would something like this be okay? --- From: Johannes Weiner Subject: [patch] mm: remove sparsemem allocation details from the bootmem allocator alloc_bootmem_section() derives allocation area constraints from the specified sparsemem section. This is a bit specific for a generic memory allocator like bootmem, though, so move it over to sparsemem. Since __alloc_bootmem_node() already retries failed allocations with relaxed area constraints, the fallback code in sparsemem.c can be removed and the code becomes a bit more compact overall. Signed-off-by: Johannes Weiner --- include/linux/bootmem.h |3 --- mm/bootmem.c| 26 -- mm/sparse.c | 29 + 3 files changed, 9 insertions(+), 49 deletions(-) diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h index ab344a5..001c248 100644 --- a/include/linux/bootmem.h +++ b/include/linux/bootmem.h @@ -135,9 +135,6 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat, extern int reserve_bootmem_generic(unsigned long addr, unsigned long size, int flags); -extern void *alloc_bootmem_section(unsigned long size, - unsigned long section_nr); - #ifdef CONFIG_HAVE_ARCH_ALLOC_REMAP extern void *alloc_remap(int nid, unsigned long size); #else diff --git a/mm/bootmem.c b/mm/bootmem.c index 7bc0557..d34026c 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -756,32 +756,6 @@ void * __init __alloc_bootmem_node_high(pg_data_t *pgdat, unsigned long size, } -#ifdef CONFIG_