Re: [PATCH v2 0/5] Add movablecore_map boot option
On Wed, Nov 28, 2012 at 04:29:01PM +0800, Wen Congyang wrote: >At 11/28/2012 12:08 PM, Jiang Liu Wrote: >> On 2012-11-28 11:24, Bob Liu wrote: >>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen wrote: On 11/27/2012 08:09 PM, Bob Liu wrote: > > On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen > wrote: >> >> Hi Liu, >> >> >> This feature is used in memory hotplug. >> >> In order to implement a whole node hotplug, we need to make sure the >> node contains no kernel memory, because memory used by kernel could >> not be migrated. (Since the kernel memory is directly mapped, >> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.) >> >> User could specify all the memory on a node to be movable, so that the >> node could be hot-removed. >> > > Thank you for your explanation. It's reasonable. > > But i think it's a bit duplicated with CMA, i'm not sure but maybe we > can combine it with CMA which already in mainline? > Hi Liu, Thanks for your advice. :) CMA is Contiguous Memory Allocator, right? What I'm trying to do is controlling where is the start of ZONE_MOVABLE of each node. Could CMA do this job ? >>> >>> cma will not control the start of ZONE_MOVABLE of each node, but it >>> can declare a memory that always movable >>> and all non movable allocate request will not happen on that area. >>> >>> Currently cma use a boot parameter "cma=" to declare a memory size >>> that always movable. >>> I think it might fulfill your requirement if extending the boot >>> parameter with a start address. >>> >>> more info at http://lwn.net/Articles/468044/ And also, after a short investigation, CMA seems need to base on memblock. But we need to limit memblock not to allocate memory on ZONE_MOVABLE. As a result, we need to know the ranges before memblock could be used. I'm afraid we still need an approach to get the ranges, such as a boot option, or from static ACPI tables such as SRAT/MPST. >>> >>> Yes, it's based on memblock and with boot option. >>> In setup_arch32() >>> dma_contiguous_reserve(0); => will declare a cma area using >>> memblock_reserve() >>> I'm don't know much about CMA for now. So if you have any better idea, please share with us, thanks. :) >>> >>> My idea is reuse cma like below patch(even not compiled) and boot with >>> "cma=size@start_address". >>> I don't know whether it can work and whether suitable for your >>> requirement, if not forgive me for this noises. >>> >>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c >>> index 612afcc..564962a 100644 >>> --- a/drivers/base/dma-contiguous.c >>> +++ b/drivers/base/dma-contiguous.c >>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area; >>> */ >>> static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M; >>> static long size_cmdline = -1; >>> +static long cma_start_cmdline = -1; >>> >>> static int __init early_cma(char *p) >>> { >>> + char *oldp; >>> pr_debug("%s(%s)\n", __func__, p); >>> + oldp = p; >>> size_cmdline = memparse(p, &p); >>> + >>> + if (*p == '@') >>> + cma_start_cmdline = memparse(p+1, &p); >>> + printk("cma start:0x%x, size: 0x%x\n", size_cmdline, >>> cma_start_cmdline); >>> return 0; >>> } >>> early_param("cma", early_cma); >>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit) >>> if (selected_size) { >>> pr_debug("%s: reserving %ld MiB for global area\n", >>> __func__, >>> selected_size / SZ_1M); >>> - >>> - dma_declare_contiguous(NULL, selected_size, 0, limit); >>> + if (cma_size_cmdline != -1) >>> + dma_declare_contiguous(NULL, selected_size, >>> cma_start_cmdline, limit); >>> + else >>> + dma_declare_contiguous(NULL, selected_size, 0, >>> limit); >>> } >>> }; >> Seems a good idea to reserve memory by reusing CMA logic, though need more >> investigation here. One of CMA goal is to ensure pages in CMA are really >> movable, and this patchset tries to achieve the same goal at a first glance. > >Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it >for movable memory, I think movable zone is enough. And the start address is >not acceptable, because we want to specify the start address for each node. > >I think we can implement movablecore_map like that: >1. parse the parameter >2. reserve the memory after efi_reserve_boot_services() >3. release the memory in mem_init > Hi Tang, I haven't read the patchset yet, but could you give a short describe how you design your implementation in this patchset? Regards, Jaegeuk >What about this? > >Thanks >Wen Congyang >> >> >> >> >> > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in
Re: [PATCH] tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
On Wed, Nov 28, 2012 at 05:22:03PM -0800, Hugh Dickins wrote: >Revert 3.5's f21f8062201f ("tmpfs: revert SEEK_DATA and SEEK_HOLE") >to reinstate 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"), >with the intervening additional arg to generic_file_llseek_size(). > >In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper >SEEK_DATA and SEEK_HOLE support; and a good case has now been made >for it on tmpfs, so let's join the party. > Hi Hugh, IIUC, several months ago you revert the patch. You said, "I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it would be of any use to them on tmpfs. This code adds 92 lines and 752 bytes on x86_64 - is that bloat or worthwhile?" But this time in which scenario will use it? Regards, Jaegeuk >It's quite easy for tmpfs to scan the radix_tree to support llseek's new >SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still >on my mind (in particular, the !PageUptodate-ness of pages fallocated but >still unwritten). > >[a...@linux-foundation.org: fix warning with CONFIG_TMPFS=n] >Signed-off-by: Hugh Dickins >--- > > mm/shmem.c | 92 ++- > 1 file changed, 91 insertions(+), 1 deletion(-) > >--- 3.7-rc7/mm/shmem.c 2012-11-16 19:26:56.388459961 -0800 >+++ linux/mm/shmem.c 2012-11-28 15:53:38.788477201 -0800 >@@ -1709,6 +1709,96 @@ static ssize_t shmem_file_splice_read(st > return error; > } > >+/* >+ * llseek SEEK_DATA or SEEK_HOLE through the radix_tree. >+ */ >+static pgoff_t shmem_seek_hole_data(struct address_space *mapping, >+ pgoff_t index, pgoff_t end, int origin) >+{ >+ struct page *page; >+ struct pagevec pvec; >+ pgoff_t indices[PAGEVEC_SIZE]; >+ bool done = false; >+ int i; >+ >+ pagevec_init(&pvec, 0); >+ pvec.nr = 1;/* start small: we may be there already */ >+ while (!done) { >+ pvec.nr = shmem_find_get_pages_and_swap(mapping, index, >+ pvec.nr, pvec.pages, indices); >+ if (!pvec.nr) { >+ if (origin == SEEK_DATA) >+ index = end; >+ break; >+ } >+ for (i = 0; i < pvec.nr; i++, index++) { >+ if (index < indices[i]) { >+ if (origin == SEEK_HOLE) { >+ done = true; >+ break; >+ } >+ index = indices[i]; >+ } >+ page = pvec.pages[i]; >+ if (page && !radix_tree_exceptional_entry(page)) { >+ if (!PageUptodate(page)) >+ page = NULL; >+ } >+ if (index >= end || >+ (page && origin == SEEK_DATA) || >+ (!page && origin == SEEK_HOLE)) { >+ done = true; >+ break; >+ } >+ } >+ shmem_deswap_pagevec(&pvec); >+ pagevec_release(&pvec); >+ pvec.nr = PAGEVEC_SIZE; >+ cond_resched(); >+ } >+ return index; >+} >+ >+static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin) >+{ >+ struct address_space *mapping = file->f_mapping; >+ struct inode *inode = mapping->host; >+ pgoff_t start, end; >+ loff_t new_offset; >+ >+ if (origin != SEEK_DATA && origin != SEEK_HOLE) >+ return generic_file_llseek_size(file, offset, origin, >+ MAX_LFS_FILESIZE, i_size_read(inode)); >+ mutex_lock(&inode->i_mutex); >+ /* We're holding i_mutex so we can access i_size directly */ >+ >+ if (offset < 0) >+ offset = -EINVAL; >+ else if (offset >= inode->i_size) >+ offset = -ENXIO; >+ else { >+ start = offset >> PAGE_CACHE_SHIFT; >+ end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; >+ new_offset = shmem_seek_hole_data(mapping, start, end, origin); >+ new_offset <<= PAGE_CACHE_SHIFT; >+ if (new_offset > offset) { >+ if (new_offset < inode->i_size) >+ offset = new_offset; >+ else if (origin == SEEK_DATA) >+ offset = -ENXIO; >+ else >+ offset = inode->i_size; >+ } >+ } >+ >+ if (offset >= 0 && offset != file->f_pos) { >+ file->f_pos = offset; >+ file->f_version = 0; >+ } >+ mutex_unlock(&inode->i_mutex); >+ return offset; >+} >+ > static long shmem_fallocate(struct file *file, int mode, lo
Re: [PATCH v2 0/5] Add movablecore_map boot option
On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote: >Hi all, > Seems it's a great chance to discuss about the memory hotplug feature >within this thread. So I will try to give some high level thoughts about memory >hotplug feature on x86/IA64. Any comments are welcomed! > First of all, I think usability really matters. Ideally, memory hotplug >feature should just work out of box, and we shouldn't expect administrators to >add several extra platform dependent parameters to enable memory hotplug. >But how to enable memory (or CPU/node) hotplug out of box? I think the key >point >is to cooperate with BIOS/ACPI/firmware/device management teams. > I still position memory hotplug as an advanced feature for high end >servers and those systems may/should provide some management interfaces to >configure CPU/memory/node hotplug features. The configuration UI may be >provided >by BIOS, BMC or centralized system management suite. Once administrator enables >hotplug feature through those management UI, OS should support system device >hotplug out of box. For example, HP SuperDome2 management suite provides >interface >to configure a node as floating node(hot-removable). And OpenSolaris supports >CPU/memory hotplug out of box without any extra configurations. So we should >shape interfaces between firmware and OS to better support system device >hotplug. > On the other hand, I think there are no commercial available x86/IA64 >platforms with system device hotplug capabilities in the field yet, at least >only >limited quantity if any. So backward compatibility is not a big issue for us >now. >So I think it's doable to rely on firmware to provide better support for system >device hotplug. > Then what should be enhanced to better support system device hotplug? > >1) ACPI specification should be enhanced to provide a static table to describe >components with hotplug features, so OS could reserve special resources for >hotplug at early boot stages. For example, to reserve enough CPU ids for CPU >hot-add. Currently we guess maximum number of CPUs supported by the platform >by counting CPU entries in APIC table, that's not reliable. > >2) BIOS should implement SRAT, MPST and PMTT tables to better support memory >hotplug. SRAT associates memory ranges with proximity domains with an extra >"hotpluggable" flag. PMTT provides memory device topology information, such >as "socket->memory controller->DIMM". MPST is used for memory power management >and provides a way to associate memory ranges with memory devices in PMTT. >With all information from SRAT, MPST and PMTT, OS could figure out hotplug >memory ranges automatically, so no extra kernel parameters needed. > >3) Enhance ACPICA to provide a method to scan static ACPI tables before >memory subsystem has been initialized because OS need to access SRAT, >MPST and PMTT when initializing memory subsystem. > >4) The last and the most important issue is how to minimize performance >drop caused by memory hotplug. As proposed by this patchset, once we >configure all memory of a NUMA node as movable, it essentially disable >NUMA optimization of kernel memory allocation from that node. According >to experience, that will cause huge performance drop. We have observed >10-30% performance drop with memory hotplug enabled. And on another >OS the average performance drop caused by memory hotplug is about 10%. >If we can't resolve the performance drop, memory hotplug is just a feature >for demo:( With help from hardware, we do have some chances to reduce >performance penalty caused by memory hotplug. > As we know, Linux could migrate movable page, but can't migrate >non-movable pages used by kernel/DMA etc. And the most hard part is how >to deal with those unmovable pages when hot-removing a memory device. >Now hardware has given us a hand with a technology named memory migration, >which could transparently migrate memory between memory devices. There's >no OS visible changes except NUMA topology before and after hardware memory >migration. > And if there are multiple memory devices within a NUMA node, >we could configure some memory devices to host unmovable memory and the >other to host movable memory. With this configuration, there won't be >bigger performance drop because we have preserved all NUMA optimizations. >We also could achieve memory hotplug remove by: >1) Use existing page migration mechanism to reclaim movable pages. >2) For memory devices hosting unmovable pages, we need: >2.1) find a movable memory device on other nodes with enough capacity >and reclaim it. >2.2) use hardware migration technology to migrate unmovable memory to Hi Jiang, Could you give an explanation how hardware migration technology works? Regards, Jaegeuk >the just reclaimed memory device on other nodes. > > I hope we could expect users to adopt memory hotplug technology >with all these implemented. > > Back to this patch, we could rely
Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON
On 11/07/2012 07:48 AM, Hugh Dickins wrote: On Tue, 6 Nov 2012, Dave Jones wrote: On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote: > -/* We already confirmed swap, and make no allocation */ > -VM_BUG_ON(error); > +/* > + * We already confirmed swap under page lock, and make > + * no memory allocation here, so usually no possibility > + * of error; but free_swap_and_cache() only trylocks a > + * page, so it is just possible that the entry has been > + * truncated or holepunched since swap was confirmed. > + * shmem_undo_range() will have done some of the > + * unaccounting, now delete_from_swap_cache() will do > + * the rest (including mem_cgroup_uncharge_swapcache). > + * Reset swap.val? No, leave it so "failed" goes back to > + * "repeat": reading a hole and writing should succeed. > + */ > +if (error) { > +VM_BUG_ON(error != -ENOENT); > +delete_from_swap_cache(page); > +} > } I ran with this overnight, Thanks a lot... and still hit the (new!) VM_BUG_ON ... but that's even more surprising than your original report. Perhaps we should print out what 'error' was too ? I'll rebuild with that.. Thanks; though I thought the error was going to turn out too boring, and was preparing a debug patch for you to show the expected and found values too. But then got very puzzled... [ cut here ] WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Hardware name: 2012 Client Platform Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54 That's the very same line number as in your original report, despite the long comment which the patch adds. Are you sure that kernel was built with the patch in? I wouldn't usually question you, but I'm going mad trying to understand how the VM_BUG_ON(error != -ENOENT) fires. At the time I wrote that line, and when I was preparing the debug patch, I was thinking that an error from shmem_radix_tree_replace could also be -EEXIST, for when a different something rather than nothing is found [*]. But that's not the case, shmem_radix_tree_replace returns either 0 or -ENOENT. So if error != -ENOENT, that means shmem_add_to_page_cache went the radix_tree_insert route instead of the shmem_radix_tree_replace route; which means that its 'expected' is NULL, so swp_to_radix_entry(swap) is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt the radix_tree might be, I do not understand the new VM_BUG_ON firing. Please tell me it was the wrong kernel! Hugh [*] But in thinking it over, I realize that if shmem_radix_tree_replace had returned -EEXIST for the "wrong something" case, I would have been wrong to BUG on that; because just as truncation could remove an entry, something else could immediately after instantiate a new page there. Hi Hugh, As you said, swp_to_radix_entry() does an "| 2", so even if truncation could remove an entry and something else could immediately after instantiate a new page there, but the expected parameter will not be NULL, the result is radix_tree_insert will not be called and shmem_add_to_page_cache will not return -EEXIST, then why trigger BUG_ON ? Regards, Jaegeuk So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's not saying what I had intended to say with it, and would have been wrong to say that anyway. It just looks stupid to me now, rather like inserting a VM_BUG_ON(false) - but that does become interesting when you report that you've hit it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON
On 11/14/2012 11:50 AM, Hugh Dickins wrote: On Wed, 14 Nov 2012, Jaegeuk Hanse wrote: On 11/07/2012 07:48 AM, Hugh Dickins wrote: On Tue, 6 Nov 2012, Dave Jones wrote: On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote: > - /* We already confirmed swap, and make no allocation */ > - VM_BUG_ON(error); > + /* > +* We already confirmed swap under page lock, and make > +* no memory allocation here, so usually no possibility > +* of error; but free_swap_and_cache() only trylocks a > +* page, so it is just possible that the entry has been > +* truncated or holepunched since swap was confirmed. > +* shmem_undo_range() will have done some of the > +* unaccounting, now delete_from_swap_cache() will do > +* the rest (including mem_cgroup_uncharge_swapcache). > +* Reset swap.val? No, leave it so "failed" goes back to > +* "repeat": reading a hole and writing should succeed. > +*/ > + if (error) { > + VM_BUG_ON(error != -ENOENT); > + delete_from_swap_cache(page); > + } > } I ran with this overnight, Thanks a lot... and still hit the (new!) VM_BUG_ON ... but that's even more surprising than your original report. Perhaps we should print out what 'error' was too ? I'll rebuild with that.. Thanks; though I thought the error was going to turn out too boring, and was preparing a debug patch for you to show the expected and found values too. But then got very puzzled... [ cut here ] WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Hardware name: 2012 Client Platform Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54 That's the very same line number as in your original report, despite the long comment which the patch adds. Are you sure that kernel was built with the patch in? I wouldn't usually question you, but I'm going mad trying to understand how the VM_BUG_ON(error != -ENOENT) fires. At the time I wrote that line, and when I was preparing the debug patch, I was thinking that an error from shmem_radix_tree_replace could also be -EEXIST, for when a different something rather than nothing is found [*]. But that's not the case, shmem_radix_tree_replace returns either 0 or -ENOENT. So if error != -ENOENT, that means shmem_add_to_page_cache went the radix_tree_insert route instead of the shmem_radix_tree_replace route; which means that its 'expected' is NULL, so swp_to_radix_entry(swap) is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt the radix_tree might be, I do not understand the new VM_BUG_ON firing. Please tell me it was the wrong kernel! Hugh [*] But in thinking it over, I realize that if shmem_radix_tree_replace had returned -EEXIST for the "wrong something" case, I would have been wrong to BUG on that; because just as truncation could remove an entry, something else could immediately after instantiate a new page there. Hi Hugh, As you said, swp_to_radix_entry() does an "| 2", so even if truncation could remove an entry and something else could immediately after instantiate a new page there, but the expected parameter will not be NULL, the result is radix_tree_insert will not be called and shmem_add_to_page_cache will not return -EEXIST, then why trigger BUG_ON ? Why insert the VM_BUG_ON? Because at the time I thought that it asserted something useful; but I was mistaken, as explained above. How can the VM_BUG_ON trigger (without stack corruption, or something of that kind)? I have no idea. We are in agreement: I now think that VM_BUG_ON is misleading and silly, and sent Andrew a further patch to remove it a just couple of hours ago. Originally I was waiting to hear further from Dave; but his test machine was giving trouble, and it occurred to me that, never mind whether he says he has hit it again, or he has not hit it again, the answer is the same: don't send that VM_BUG_ON upstream. Hugh Thanks Hugh. Another question. Why the function shmem_fallocate which you add to kernel need call shmem_getpage? Regards, Jaegeuk Regards, Jaegeuk So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's not saying what I had intended to say with it, and would have been wrong to say that anyway. It just looks stupid to me now, rather like inserting a VM_BUG_ON(false) - but that does become interesting when you report that you
Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON
On 11/16/2012 03:56 AM, Hugh Dickins wrote: Offtopic... On Thu, 15 Nov 2012, Jaegeuk Hanse wrote: Another question. Why the function shmem_fallocate which you add to kernel need call shmem_getpage? Because shmem_getpage(_gfp) is where shmem's page lookup and allocation complexities are handled. I assume the question behind your question is: why does shmem actually allocate pages for its fallocate, instead of just reserving the space? Yeah, this is what I want to know. I did play with just reserving the space, with more special entries in the radix_tree to note the reservations made. It should be doable for the vm_enough_memory and sbinfo->used_blocks reservations. What absolutely deterred me from taking that path was the mem_cgroup case: shmem and swap and memcg are not easy to get working right together, and nobody would thank me for complicating memcg just for shmem_fallocate. By allocating pages, the pre-existing memcg code just works; if we used reservations instead, we would have to track their memcg charges in some additional new way. I see no justification for that complication. Oh, I see, thanks Hugh. :-) Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 20/21] mm: drop vmtruncate
On 11/03/2012 05:32 PM, Marco Stornelli wrote: Removed vmtruncate Hi Marco, Could you explain me why vmtruncate need remove? What's the problem and how to substitute it? Regards, Jaegeuk Signed-off-by: Marco Stornelli --- include/linux/mm.h |1 - mm/truncate.c | 23 --- 2 files changed, 0 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index fa06804..95f70bb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -977,7 +977,6 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new); extern void truncate_setsize(struct inode *inode, loff_t newsize); -extern int vmtruncate(struct inode *inode, loff_t offset); void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); int generic_error_remove_page(struct address_space *mapping, struct page *page); diff --git a/mm/truncate.c b/mm/truncate.c index d51ce92..c75b736 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -577,29 +577,6 @@ void truncate_setsize(struct inode *inode, loff_t newsize) EXPORT_SYMBOL(truncate_setsize); /** - * vmtruncate - unmap mappings "freed" by truncate() syscall - * @inode: inode of the file used - * @newsize: file offset to start truncating - * - * This function is deprecated and truncate_setsize or truncate_pagecache - * should be used instead, together with filesystem specific block truncation. - */ -int vmtruncate(struct inode *inode, loff_t newsize) -{ - int error; - - error = inode_newsize_ok(inode, newsize); - if (error) - return error; - - truncate_setsize(inode, newsize); - if (inode->i_op->truncate) - inode->i_op->truncate(inode); - return 0; -} -EXPORT_SYMBOL(vmtruncate); - -/** * truncate_pagecache_range - unmap and remove pagecache that is hole-punched * @inode: inode * @lstart: offset of beginning of hole -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON
On 11/16/2012 03:56 AM, Hugh Dickins wrote: Offtopic... On Thu, 15 Nov 2012, Jaegeuk Hanse wrote: Another question. Why the function shmem_fallocate which you add to kernel need call shmem_getpage? Because shmem_getpage(_gfp) is where shmem's page lookup and allocation complexities are handled. I assume the question behind your question is: why does shmem actually allocate pages for its fallocate, instead of just reserving the space? I did play with just reserving the space, with more special entries in the radix_tree to note the reservations made. It should be doable for the vm_enough_memory and sbinfo->used_blocks reservations. What absolutely deterred me from taking that path was the mem_cgroup case: shmem and swap and memcg are not easy to get working right together, and nobody would thank me for complicating memcg just for shmem_fallocate. By allocating pages, the pre-existing memcg code just works; if we used reservations instead, we would have to track their memcg charges in some additional new way. I see no justification for that complication. Hi Hugh Some questions about your shmem/tmpfs: misc and fallocate patchset. - Since shmem_setattr can truncate tmpfs files, why need add another similar codes in function shmem_fallocate? What's the trick? - in tmpfs: support fallocate preallocation patch changelog: "Christoph Hellwig: What for exactly? Please explain why preallocating on tmpfs would make any sense. Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or -EOPNOTSUPP] on fallocate is just ugly." Could shmem/tmpfs fallocate prevent one process truncate the file which the second process mmap() and get SIGBUS when the second process access mmap but out of current size of file? Regards, Jaegeuk Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON
On 11/17/2012 12:48 PM, Hugh Dickins wrote: Further offtopic.. Thanks for your explanation, Hugh. :-) On Fri, 16 Nov 2012, Jaegeuk Hanse wrote: Some questions about your shmem/tmpfs: misc and fallocate patchset. - Since shmem_setattr can truncate tmpfs files, why need add another similar codes in function shmem_fallocate? What's the trick? I don't know if I understand you. In general, hole-punching is different from truncation. Supporting the hole-punch mode of the fallocate system call is different from supporting truncation. They're closely related, and share code, but meet different specifications. What's the different between shmem/tmpfs hole-punching and truncate_setsize/truncate_pagecache? Do you mean one is punch hole in the file and the other one is shrink or extent the size of a file? - in tmpfs: support fallocate preallocation patch changelog: "Christoph Hellwig: What for exactly? Please explain why preallocating on tmpfs would make any sense. Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or -EOPNOTSUPP] on fallocate is just ugly." Could shmem/tmpfs fallocate prevent one process truncate the file which the second process mmap() and get SIGBUS when the second process access mmap but out of current size of file? Again, I don't know if I understand you. fallocate does not prevent truncation or races or SIGBUS. I believe that Kay meant that without using fallocate to allocate the memory in advance, systemd found it hard to protect itself from the possibility of getting a SIGBUS, if access to a shmem mapping happened to run out of memory/space in the middle. IIUC, it will return VM_xxx_OOM instead of SIGBUS if run out of memory. Then how can get SIGBUS in this scene? Regards, Jaegeuk I never grasped why writing the file in advance was not good enough: fallocate happened to be what they hoped to use, and it was hard to deny it, given that tmpfs already supported hole-punching, and was about to convert to the fallocate interface for that. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON
On 11/17/2012 12:48 PM, Hugh Dickins wrote: Further offtopic.. Hi Hugh, - I see you add this in vfs.txt: + fallocate: called by the VFS to preallocate blocks or punch a hole. I want to know if it's necessary to add it to man page since users still don't know fallocate can punch a hole from man fallocate. - in function shmem_fallocate: + else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced) + error = -ENOMEM; If this changelog "shmem_fallocate() compare counts and give up once the reactivated pages have started to coming back to writepage (approximately: some zones would in fact recycle faster than others)." describe why need this change? If the answer is yes, I have two questions. 1) how can guarantee it really don't need preallocation if just one or a few pages always reactivated, in this scene, nr_unswapped maybe grow bigger enough than shmem_falloc.nr_falloced 2) why return -ENOMEM, it's not really OOM, is it a trick or ...? Regards, Jaegeuk On Fri, 16 Nov 2012, Jaegeuk Hanse wrote: Some questions about your shmem/tmpfs: misc and fallocate patchset. - Since shmem_setattr can truncate tmpfs files, why need add another similar codes in function shmem_fallocate? What's the trick? I don't know if I understand you. In general, hole-punching is different from truncation. Supporting the hole-punch mode of the fallocate system call is different from supporting truncation. They're closely related, and share code, but meet different specifications. - in tmpfs: support fallocate preallocation patch changelog: "Christoph Hellwig: What for exactly? Please explain why preallocating on tmpfs would make any sense. Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or -EOPNOTSUPP] on fallocate is just ugly." Could shmem/tmpfs fallocate prevent one process truncate the file which the second process mmap() and get SIGBUS when the second process access mmap but out of current size of file? Again, I don't know if I understand you. fallocate does not prevent truncation or races or SIGBUS. I believe that Kay meant that without using fallocate to allocate the memory in advance, systemd found it hard to protect itself from the possibility of getting a SIGBUS, if access to a shmem mapping happened to run out of memory/space in the middle. I never grasped why writing the file in advance was not good enough: fallocate happened to be what they hoped to use, and it was hard to deny it, given that tmpfs already supported hole-punching, and was about to convert to the fallocate interface for that. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page
On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" H. Peter Anvin doesn't like huge zero page which sticks in memory forever after the first allocation. Here's implementation of lockless refcounting for huge zero page. We have two basic primitives: {get,put}_huge_zero_page(). They manipulate reference counter. If counter is 0, get_huge_zero_page() allocates a new huge page and takes two references: one for caller and one for shrinker. We free the page only in shrinker callback if counter is 1 (only shrinker has the reference). put_huge_zero_page() only decrements counter. Counter is never zero in put_huge_zero_page() since shrinker holds on reference. Freeing huge zero page in shrinker callback helps to avoid frequent allocate-free. Refcounting has cost. On 4 socket machine I observe ~1% slowdown on parallel (40 processes) read page faulting comparing to lazy huge page allocation. I think it's pretty reasonable for synthetic benchmark. Hi Kirill, I see your and Andew's hot discussion in v4 resend thread. "I also tried another scenario: usemem -n16 100M -r 1000. It creates real memory pressure - no easy reclaimable memory. This time callback called with nr_to_scan > 0 and we freed hzp. " What's "usemem"? Is it a tool and how to get it? It's hard for me to find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your scenario? Regards, Jaegeuk Signed-off-by: Kirill A. Shutemov --- mm/huge_memory.c | 112 ++- 1 file changed, 87 insertions(+), 25 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bad9c8f..923ea75 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -47,7 +48,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 1; /* during fragmentation poll the hugepage allocator once every minute */ static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 6; static struct task_struct *khugepaged_thread __read_mostly; -static unsigned long huge_zero_pfn __read_mostly; static DEFINE_MUTEX(khugepaged_mutex); static DEFINE_SPINLOCK(khugepaged_mm_lock); static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait); @@ -160,31 +160,74 @@ static int start_khugepaged(void) return err; } -static int init_huge_zero_pfn(void) +static atomic_t huge_zero_refcount; +static unsigned long huge_zero_pfn __read_mostly; + +static inline bool is_huge_zero_pfn(unsigned long pfn) { - struct page *hpage; - unsigned long pfn; + unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn); + return zero_pfn && pfn == zero_pfn; +} + +static inline bool is_huge_zero_pmd(pmd_t pmd) +{ + return is_huge_zero_pfn(pmd_pfn(pmd)); +} + +static unsigned long get_huge_zero_page(void) +{ + struct page *zero_page; +retry: + if (likely(atomic_inc_not_zero(&huge_zero_refcount))) + return ACCESS_ONCE(huge_zero_pfn); - hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE, + zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE, HPAGE_PMD_ORDER); - if (!hpage) - return -ENOMEM; - pfn = page_to_pfn(hpage); - if (cmpxchg(&huge_zero_pfn, 0, pfn)) - __free_page(hpage); - return 0; + if (!zero_page) + return 0; + preempt_disable(); + if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) { + preempt_enable(); + __free_page(zero_page); + goto retry; + } + + /* We take additional reference here. It will be put back by shrinker */ + atomic_set(&huge_zero_refcount, 2); + preempt_enable(); + return ACCESS_ONCE(huge_zero_pfn); } -static inline bool is_huge_zero_pfn(unsigned long pfn) +static void put_huge_zero_page(void) { - return huge_zero_pfn && pfn == huge_zero_pfn; + /* +* Counter should never go to zero here. Only shrinker can put +* last reference. +*/ + BUG_ON(atomic_dec_and_test(&huge_zero_refcount)); } -static inline bool is_huge_zero_pmd(pmd_t pmd) +static int shrink_huge_zero_page(struct shrinker *shrink, + struct shrink_control *sc) { - return is_huge_zero_pfn(pmd_pfn(pmd)); + if (!sc->nr_to_scan) + /* we can free zero page only if last reference remains */ + return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0; + + if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) { + unsigned long zero_pfn = xchg(&huge_zero_pfn, 0); + BUG_ON(zero_pfn == 0); + __free_page(__pfn_to_page(zero_pfn)); + } + + return 0; } +static struct shrinker huge_zero_page_shrinker = { + .shrink = shrink_huge_zero_page, + .
Re: [PATCH] tmpfs: change final i_blocks BUG to WARNING
On 11/06/2012 09:34 AM, Hugh Dickins wrote: Under a particular load on one machine, I have hit shmem_evict_inode()'s BUG_ON(inode->i_blocks), enough times to narrow it down to a particular race between swapout and eviction. It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(), and the lack of coherent locking between mapping's nrpages and shmem's swapped count. There's a window in shmem_writepage(), between lowering nrpages in shmem_delete_from_page_cache() and then raising swapped count, when the freed count appears to be +1 when it should be 0, and then the asymmetry stops it from being corrected with -1 before hitting the BUG. Hi Hugh, So if race happen, still have pages swapout after inode and radix tree destroied. What will happen when the pages need be swapin in the scenacio like swapoff. Regards, Jaegeuk One answer is coherent locking: using tree_lock throughout, without info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on used_blocks makes that messier than expected. Another answer may be a further effort to eliminate the weird shmem_recalc_inode() altogether, but previous attempts at that failed. So far undecided, but for now change the BUG_ON to WARN_ON: in usual circumstances it remains a useful consistency check. Signed-off-by: Hugh Dickins Cc: sta...@vger.kernel.org --- mm/shmem.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- 3.7-rc4/mm/shmem.c 2012-10-14 16:16:58.361309122 -0700 +++ linux/mm/shmem.c2012-11-01 14:31:04.288185742 -0700 @@ -643,7 +643,7 @@ static void shmem_evict_inode(struct ino kfree(info->symlink); simple_xattrs_free(&info->xattrs); - BUG_ON(inode->i_blocks); + WARN_ON(inode->i_blocks); shmem_free_inode(inode->i_sb); clear_inode(inode); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page
On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote: On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote: On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" H. Peter Anvin doesn't like huge zero page which sticks in memory forever after the first allocation. Here's implementation of lockless refcounting for huge zero page. We have two basic primitives: {get,put}_huge_zero_page(). They manipulate reference counter. If counter is 0, get_huge_zero_page() allocates a new huge page and takes two references: one for caller and one for shrinker. We free the page only in shrinker callback if counter is 1 (only shrinker has the reference). put_huge_zero_page() only decrements counter. Counter is never zero in put_huge_zero_page() since shrinker holds on reference. Freeing huge zero page in shrinker callback helps to avoid frequent allocate-free. Refcounting has cost. On 4 socket machine I observe ~1% slowdown on parallel (40 processes) read page faulting comparing to lazy huge page allocation. I think it's pretty reasonable for synthetic benchmark. Hi Kirill, I see your and Andew's hot discussion in v4 resend thread. "I also tried another scenario: usemem -n16 100M -r 1000. It creates real memory pressure - no easy reclaimable memory. This time callback called with nr_to_scan > 0 and we freed hzp. " What's "usemem"? Is it a tool and how to get it? http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar Thanks for your response. But how to use it, I even can't compile the files. # ./case-lru-file-mmap-read ./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory ./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0 (error token is "nr_cpu") # gcc usemem.c -o usemem /tmp/ccFkIDWk.o: In function `do_task': usemem.c:(.text+0x9f2): undefined reference to `pthread_create' usemem.c:(.text+0xa44): undefined reference to `pthread_join' collect2: ld returned 1 exit status It's hard for me to find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your scenario? shrink_slab() calls the callback with nr_to_scan > 0 if system is under pressure -- look for do_shrinker_shrink(). Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this path? I think it also can add memory pressure, where I miss? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page
On 11/19/2012 06:23 PM, Kirill A. Shutemov wrote: On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote: On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote: On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote: On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" H. Peter Anvin doesn't like huge zero page which sticks in memory forever after the first allocation. Here's implementation of lockless refcounting for huge zero page. We have two basic primitives: {get,put}_huge_zero_page(). They manipulate reference counter. If counter is 0, get_huge_zero_page() allocates a new huge page and takes two references: one for caller and one for shrinker. We free the page only in shrinker callback if counter is 1 (only shrinker has the reference). put_huge_zero_page() only decrements counter. Counter is never zero in put_huge_zero_page() since shrinker holds on reference. Freeing huge zero page in shrinker callback helps to avoid frequent allocate-free. Refcounting has cost. On 4 socket machine I observe ~1% slowdown on parallel (40 processes) read page faulting comparing to lazy huge page allocation. I think it's pretty reasonable for synthetic benchmark. Hi Kirill, I see your and Andew's hot discussion in v4 resend thread. "I also tried another scenario: usemem -n16 100M -r 1000. It creates real memory pressure - no easy reclaimable memory. This time callback called with nr_to_scan > 0 and we freed hzp. " What's "usemem"? Is it a tool and how to get it? http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar Thanks for your response. But how to use it, I even can't compile the files. # ./case-lru-file-mmap-read ./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory ./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0 (error token is "nr_cpu") # gcc usemem.c -o usemem -lpthread /tmp/ccFkIDWk.o: In function `do_task': usemem.c:(.text+0x9f2): undefined reference to `pthread_create' usemem.c:(.text+0xa44): undefined reference to `pthread_join' collect2: ld returned 1 exit status It's hard for me to find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your scenario? shrink_slab() calls the callback with nr_to_scan > 0 if system is under pressure -- look for do_shrinker_shrink(). Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this path? I think it also can add memory pressure, where I miss? dd if=large-file only fills pagecache -- easy reclaimable memory. Pagecache will be dropped first, before shrinking slabs. How could I confirm page reclaim working hard and slabs are reclaimed at this time? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page
On 11/19/2012 07:09 PM, Kirill A. Shutemov wrote: On Mon, Nov 19, 2012 at 07:02:22PM +0800, Jaegeuk Hanse wrote: On 11/19/2012 06:23 PM, Kirill A. Shutemov wrote: On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote: On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote: On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote: On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote: From: "Kirill A. Shutemov" H. Peter Anvin doesn't like huge zero page which sticks in memory forever after the first allocation. Here's implementation of lockless refcounting for huge zero page. We have two basic primitives: {get,put}_huge_zero_page(). They manipulate reference counter. If counter is 0, get_huge_zero_page() allocates a new huge page and takes two references: one for caller and one for shrinker. We free the page only in shrinker callback if counter is 1 (only shrinker has the reference). put_huge_zero_page() only decrements counter. Counter is never zero in put_huge_zero_page() since shrinker holds on reference. Freeing huge zero page in shrinker callback helps to avoid frequent allocate-free. Refcounting has cost. On 4 socket machine I observe ~1% slowdown on parallel (40 processes) read page faulting comparing to lazy huge page allocation. I think it's pretty reasonable for synthetic benchmark. Hi Kirill, I see your and Andew's hot discussion in v4 resend thread. "I also tried another scenario: usemem -n16 100M -r 1000. It creates real memory pressure - no easy reclaimable memory. This time callback called with nr_to_scan > 0 and we freed hzp. " What's "usemem"? Is it a tool and how to get it? http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar Thanks for your response. But how to use it, I even can't compile the files. # ./case-lru-file-mmap-read ./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory ./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0 (error token is "nr_cpu") # gcc usemem.c -o usemem -lpthread /tmp/ccFkIDWk.o: In function `do_task': usemem.c:(.text+0x9f2): undefined reference to `pthread_create' usemem.c:(.text+0xa44): undefined reference to `pthread_join' collect2: ld returned 1 exit status It's hard for me to find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your scenario? shrink_slab() calls the callback with nr_to_scan > 0 if system is under pressure -- look for do_shrinker_shrink(). Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this path? I think it also can add memory pressure, where I miss? dd if=large-file only fills pagecache -- easy reclaimable memory. Pagecache will be dropped first, before shrinking slabs. How could I confirm page reclaim working hard and slabs are reclaimed at this time? The only what I see is slabs_scanned in vmstat. Oh, I see. Thanks! :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
On 11/19/2012 12:07 AM, Jiang Liu wrote: The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to resolve an issue caused by inaccurate zone->present_pages, but that fix is incomplete and causes regresions with HIGHMEM. And it has been reverted by commit 5576646 revert "mm: fix-up zone present pages" This is a following-up patchset for the issue above. It introduces a new field named "managed_pages" to struct zone, which counts pages managed by the buddy system from the zone. And zone->present_pages is used to count pages existing in the zone, which is spanned_pages - absent_pages. But that way, zone->present_pages will be kept in consistence with pgdat->node_present_pages, which is sum of zone->present_pages. This patchset has only been tested on x86_64 with nobootmem.c. So need help to test this patchset on machines: 1) use bootmem.c If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it? 2) have highmem This patchset applies to "f4a75d2e Linux 3.7-rc6" from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Any comments and helps are welcomed! Jiang Liu (5): mm: introduce new field "managed_pages" to struct zone mm: replace zone->present_pages with zone->managed_pages if appreciated mm: set zone->present_pages to number of existing pages in the zone mm: provide more accurate estimation of pages occupied by memmap mm: increase totalram_pages when free pages allocated by bootmem allocator include/linux/mmzone.h |1 + mm/bootmem.c | 14 mm/memory_hotplug.c|6 mm/mempolicy.c |2 +- mm/nobootmem.c | 15 mm/page_alloc.c| 89 +++- mm/vmscan.c| 16 - mm/vmstat.c|8 +++-- 8 files changed, 108 insertions(+), 43 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap
On 11/19/2012 12:07 AM, Jiang Liu wrote: If SPARSEMEM is enabled, it won't build page structures for non-existing pages (holes) within a zone, so provide a more accurate estimation of pages occupied by memmap if there are big holes within the zone. And pages for highmem zones' memmap will be allocated from lowmem, so charge nr_kernel_pages for that. Signed-off-by: Jiang Liu --- mm/page_alloc.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5b327d7..eb25679 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void) #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ +static unsigned long calc_memmap_size(unsigned long spanned_pages, + unsigned long present_pages) +{ + unsigned long pages = spanned_pages; + + /* +* Provide a more accurate estimation if there are big holes within +* the zone and SPARSEMEM is in use. +*/ + if (spanned_pages > present_pages + (present_pages >> 4) && + IS_ENABLED(CONFIG_SPARSEMEM)) + pages = present_pages; + + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT; +} + /* * Set up the zone data structures: * - mark all pages reserved @@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, * is used by this zone for memmap. This affects the watermark * and per-cpu initialisations */ - memmap_pages = - PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT; + memmap_pages = calc_memmap_size(size, realsize); if (freesize >= memmap_pages) { freesize -= memmap_pages; if (memmap_pages) @@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, if (!is_highmem_idx(j)) nr_kernel_pages += freesize; + /* Charge for highmem memmap if there are enough kernel pages */ + else if (nr_kernel_pages > memmap_pages * 2) + nr_kernel_pages -= memmap_pages; Since this is in else branch, if nr_kernel_pages is equal to 0 at initially time? nr_all_pages += freesize; zone->spanned_pages = size; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages
On 11/20/2012 10:43 AM, Jiang Liu wrote: On 2012-11-20 10:13, Jaegeuk Hanse wrote: On 11/19/2012 12:07 AM, Jiang Liu wrote: The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to resolve an issue caused by inaccurate zone->present_pages, but that fix is incomplete and causes regresions with HIGHMEM. And it has been reverted by commit 5576646 revert "mm: fix-up zone present pages" This is a following-up patchset for the issue above. It introduces a new field named "managed_pages" to struct zone, which counts pages managed by the buddy system from the zone. And zone->present_pages is used to count pages existing in the zone, which is spanned_pages - absent_pages. But that way, zone->present_pages will be kept in consistence with pgdat->node_present_pages, which is sum of zone->present_pages. This patchset has only been tested on x86_64 with nobootmem.c. So need help to test this patchset on machines: 1) use bootmem.c If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it? Hi Jaegeuk, Thanks for review this patch set. Currently x86/x86_64/Sparc have been converted to use nobootmem.c, and other Arches still use bootmem.c. So need to test it on other Arches, such as ARM etc. Yesterday we have tested it patchset on an Itanium platform, so bootmem.c should work as expected too. Hi Jiang, If there are any codes changed in x86/x86_64 to meet nobootmem.c logic? I mean if remove config NO_BOOTMEM def_bool y in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c or not. Regards, Jaegeuk Thanks! Gerry -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
On 11/01/2012 05:44 PM, Wen Congyang wrote: From: Yasuaki Ishimatsu Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. So the patch add unregister_memory_section() into __remove_section(). Hi Yasuaki, In order to review this patch, I should dig sparse memory codes in advance. But I have some confuse of codes. Why need encode/decode mem map instead of set mem_map to ms->section_mem_map directly? Regards, Jaegeuk CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ca07433..66a79a7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone *zone, #ifdef CONFIG_SPARSEMEM_VMEMMAP static int __remove_section(struct zone *zone, struct mem_section *ms) { - /* -* XXX: Freeing memmap with vmemmap is not implement yet. -* This should be removed later. -*/ - return -EBUSY; + int ret = -EINVAL; + + if (!valid_section(ms)) + return ret; + + ret = unregister_memory_section(ms); + + return ret; } #else static int __remove_section(struct zone *zone, struct mem_section *ms) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
On 11/20/2012 02:55 PM, Wen Congyang wrote: At 11/20/2012 02:22 PM, Jaegeuk Hanse Wrote: On 11/01/2012 05:44 PM, Wen Congyang wrote: From: Yasuaki Ishimatsu Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. So the patch add unregister_memory_section() into __remove_section(). Hi Yasuaki, In order to review this patch, I should dig sparse memory codes in advance. But I have some confuse of codes. Why need encode/decode mem map instead of set mem_map to ms->section_mem_map directly? The memmap is aligned, and the low bits are zero. We store some information in these bits. So we need to encode/decode memmap here. Hi Congyang, Thanks for you reponse. But I mean why return (unsigned long)(mem_map - (section_nr_to_pfn(pnum))); in function sparse_encode_mem_map, and then return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum); in funtion sparse_decode_mem_map instead of just store mem_map in ms->section_mep_map directly. Regards, Jaegeuk Thanks Wen Congyang Regards, Jaegeuk CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ca07433..66a79a7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone *zone, #ifdef CONFIG_SPARSEMEM_VMEMMAP static int __remove_section(struct zone *zone, struct mem_section *ms) { -/* - * XXX: Freeing memmap with vmemmap is not implement yet. - * This should be removed later. - */ -return -EBUSY; +int ret = -EINVAL; + +if (!valid_section(ms)) +return ret; + +ret = unregister_memory_section(ms); + +return ret; } #else static int __remove_section(struct zone *zone, struct mem_section *ms) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
On 11/20/2012 05:37 PM, Wen Congyang wrote: At 11/20/2012 02:58 PM, Jaegeuk Hanse Wrote: On 11/20/2012 02:55 PM, Wen Congyang wrote: At 11/20/2012 02:22 PM, Jaegeuk Hanse Wrote: On 11/01/2012 05:44 PM, Wen Congyang wrote: From: Yasuaki Ishimatsu Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. So the patch add unregister_memory_section() into __remove_section(). Hi Yasuaki, In order to review this patch, I should dig sparse memory codes in advance. But I have some confuse of codes. Why need encode/decode mem map instead of set mem_map to ms->section_mem_map directly? The memmap is aligned, and the low bits are zero. We store some information in these bits. So we need to encode/decode memmap here. Hi Congyang, Thanks for you reponse. But I mean why return (unsigned long)(mem_map - (section_nr_to_pfn(pnum))); in function sparse_encode_mem_map, and then return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum); in funtion sparse_decode_mem_map instead of just store mem_map in ms->section_mep_map directly. I don't know why. I try to find the reason, but I don't find any place to use the pfn stored in the mem_map except in the decode function. Maybe the designer doesn't want us to access the mem_map directly. It seems that mem_map is per node, but pfn is real pfn. you can check __page_to_pfn. Thanks Wen Congyang Regards, Jaegeuk Thanks Wen Congyang Regards, Jaegeuk CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ca07433..66a79a7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone *zone, #ifdef CONFIG_SPARSEMEM_VMEMMAP static int __remove_section(struct zone *zone, struct mem_section *ms) { -/* - * XXX: Freeing memmap with vmemmap is not implement yet. - * This should be removed later. - */ -return -EBUSY; +int ret = -EINVAL; + +if (!valid_section(ms)) +return ret; + +ret = unregister_memory_section(ms); + +return ret; } #else static int __remove_section(struct zone *zone, struct mem_section *ms) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
On 11/01/2012 05:44 PM, Wen Congyang wrote: From: Yasuaki Ishimatsu Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. So the patch add unregister_memory_section() into __remove_section(). Hi Yasuaki, I have a question about these sparse vmemmap memory related patches. Hot add memory need allocated vmemmap pages, but this time is allocated by buddy system. How can gurantee virtual address is continuous to the address allocated before? If not continuous, page_to_pfn and pfn_to_page can't work correctly. Regards, Jaegeuk CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ca07433..66a79a7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone *zone, #ifdef CONFIG_SPARSEMEM_VMEMMAP static int __remove_section(struct zone *zone, struct mem_section *ms) { - /* -* XXX: Freeing memmap with vmemmap is not implement yet. -* This should be removed later. -*/ - return -EBUSY; + int ret = -EINVAL; + + if (!valid_section(ms)) + return ret; + + ret = unregister_memory_section(ms); + + return ret; } #else static int __remove_section(struct zone *zone, struct mem_section *ms) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Add movablecore_map boot option.
On 11/20/2012 07:07 PM, Yasuaki Ishimatsu wrote: 2012/11/20 5:53, Andrew Morton wrote: On Mon, 19 Nov 2012 22:27:21 +0800 Tang Chen wrote: This patchset provide a boot option for user to specify ZONE_MOVABLE memory map for each node in the system. movablecore_map=nn[KMG]@ss[KMG] This option make sure memory range from ss to ss+nn is movable memory. 1) If the range is involved in a single node, then from ss to the end of the node will be ZONE_MOVABLE. 2) If the range covers two or more nodes, then from ss to the end of the node will be ZONE_MOVABLE, and all the other nodes will only have ZONE_MOVABLE. 3) If no range is in the node, then the node will have no ZONE_MOVABLE unless kernelcore or movablecore is specified. 4) This option could be specified at most MAX_NUMNODES times. 5) If kernelcore or movablecore is also specified, movablecore_map will have higher priority to be satisfied. 6) This option has no conflict with memmap option. This doesn't describe the problem which the patchset solves. I can kinda see where it's coming from, but it would be nice to have it all spelled out, please. - What is wrong with the kernel as it stands? If we hot remove a memroy, the memory cannot have kernel memory, because Linux cannot migrate kernel memory currently. Therefore, we have to guarantee that the hot removed memory has only movable memoroy. Linux has two boot options, kernelcore= and movablecore=, for creating movable memory. These boot options can specify the amount of memory use as kernel or movable memory. Using them, we can create ZONE_MOVABLE which has only movable memory. But it does not fulfill a requirement of memory hot remove, because even if we specify the boot options, movable memory is distributed in each node evenly. So when we want to hot remove memory which memory range is 0x8000-0c000, we have no way to specify the memory as movable memory. Could you explain why can't specify the memory as movable memory in this case? So we proposed a new feature which specifies memory range to use as movable memory. - What are the possible ways of solving this? I thought 2 ways to specify movable memory. 1. use firmware information 2. use boot option 1. use firmware information According to ACPI spec 5.0, SRAT table has memory affinity structure and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory Affinity Structure". If we use the information, we might be able to specify movable memory by firmware. For example, if Hot Pluggable Filed is enabled, Linux sets the memory as movable memory. 2. use boot option This is our proposal. New boot option can specify memory range to use as movable memory. - Describe the chosen way, explain why it is superior to alternatives We chose second way, because if we use first way, users cannot change memory range to use as movable memory easily. We think if we create movable memory, performance regression may occur by NUMA. In this case, Could you explain why regression occur in details? user can turn off the feature easily if we prepare the boot option. And if we prepare the boot optino, the user can select which memory to use as movable memory easily. Thanks, Yasuaki Ishimatsu The amount of manual system configuration in this proposal looks quite high. Adding kernel boot parameters really is a last resort. Why was it unavoidable here? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fadvise interferes with readahead
On 11/20/2012 04:04 PM, Fengguang Wu wrote: Hi Claudio, Thanks for the detailed problem description! On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote: Hi. First of all, I'm not subscribed to this list, so I'd suggest all replies copy me personally. I have been trying to implement some I/O pipelining in Postgres (ie: read the next data page asynchronously while working on the current page), and stumbled upon some puzzling behavior involving the interaction between fadvise and readahead. I'm running kernel 3.0.0 (debian testing), on a single-disk system which, though unsuitable for database workloads, is slow enough to let me experiment with these read-ahead issues. Typical random I/O performance is on the order of between 150 r/s to 200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s. Sequential I/O can go up to 60MB/s, though it tends to be around 50. Now onto the problem. In order to parallelize I/O with computation, I've made postgres fadvise(willneed) the pages it will read next. How far ahead is configurable, and I've tested with a number of configurations. The prefetching logic is aware of the OS and pg-specific cache, so it will only fadvise a block once. fadvise calls will stay 1 (or a configurable N) real I/O ahead of read calls, and there's no fadvising of pages that won't be read eventually, in the same order. I checked with strace. However, performance when fadvising drops considerably for a specific yet common access pattern: When a nested loop with two index scans happens, access is random locally, but eventually whole ranges of a file get read (in this random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2 4 5 101 298 301". Though random, there are ranges there that can be merged in one read-request. The kernel seems to do the merge by applying some form of readahead, not sure if it's context, ondemand or adaptive readahead on the 3.0.0 kernel. Anyway, it seems to do readahead, as iostat says: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 4.40 224.202.00 4.16 0.03 37.86 1.918.438.00 56.80 4.40 99.44 (notice the avgrq-sz of 37.8) With fadvise calls, the thing looks a lot different: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0018.00 226.801.00 1.80 0.07 16.81 4.00 17.52 17.23 82.40 4.39 99.92 FYI, there is a readahead tracing/stats patchset that can provide far more accurate numbers about what's going on with readahead, which will help eliminate lots of the guess works here. https://lwn.net/Articles/472798/ Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's spot-on with a postgres page (8k). So, fadvise seems to carry out the requests verbatim, while read manages to merge at least two of them. The random nature of reads makes me think the scheduler is failing to merge the requests in both cases (rrqm/s = 0), because it only looks at successive requests (I'm only guessing here though). I guess it's not a merging problem, but that the kernel readahead code manages to submit larger IO requests in the first place. Looking into the kernel code, it seems the problem could be related to how fadvise works in conjunction with readahead. fadvise seems to call the function in readahead.c that schedules the asynchornous I/O[0]. It doesn't seem subject to readahead logic itself[1], which in on itself doesn't seem bad. But it does, I assume (not knowing the code that well), prevent readahead logic[2] to eventually see the pattern. It effectively disables readahead altogether. You are right. If user space does fadvise() and the fadvised pages cover all read() pages, the kernel readahead code will not run at all. So the title is actually a bit misleading. The kernel readahead won't interfere with user space prefetching at all. ;) This, I theorize, may be because after the fadvise call starts an async I/O on the page, further reads won't hit readahead code because of the page cache[3] (!PageUptodate I imagine). Whether this is desirable or not is not really obvious. In this particular case, doing fadvise calls in what would seem an optimum way, results in terribly worse performance. So I'd suggest it's not really that advisable. Yes. The kernel readahead code by design will outperform simple fadvise in the case of clustered random reads. Imagine the access pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While You mean it will trigger 6 IOs in the POSIX_FADV_RANDOM case or POSIX_FADV_WILLNEED case? kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on the page miss for 2, it will detect the existence of history page 1 and do readahead properly. For hard disks, it's mainly the number of If the first IO read 1, it will call page_
Re: fadvise interferes with readahead
On 11/20/2012 04:04 PM, Fengguang Wu wrote: Hi Claudio, Thanks for the detailed problem description! Hi Fengguang, Another question, thanks in advance. What's the meaning of interleaved reads? If the first process readahead from start ~ start + size - async_size, another process read start + size - aysnc_size + 1, then what will happen? It seems that variable hit_readahead_marker is false, and related codes can't run, where I miss? Regards, Jaegeuk On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote: Hi. First of all, I'm not subscribed to this list, so I'd suggest all replies copy me personally. I have been trying to implement some I/O pipelining in Postgres (ie: read the next data page asynchronously while working on the current page), and stumbled upon some puzzling behavior involving the interaction between fadvise and readahead. I'm running kernel 3.0.0 (debian testing), on a single-disk system which, though unsuitable for database workloads, is slow enough to let me experiment with these read-ahead issues. Typical random I/O performance is on the order of between 150 r/s to 200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s. Sequential I/O can go up to 60MB/s, though it tends to be around 50. Now onto the problem. In order to parallelize I/O with computation, I've made postgres fadvise(willneed) the pages it will read next. How far ahead is configurable, and I've tested with a number of configurations. The prefetching logic is aware of the OS and pg-specific cache, so it will only fadvise a block once. fadvise calls will stay 1 (or a configurable N) real I/O ahead of read calls, and there's no fadvising of pages that won't be read eventually, in the same order. I checked with strace. However, performance when fadvising drops considerably for a specific yet common access pattern: When a nested loop with two index scans happens, access is random locally, but eventually whole ranges of a file get read (in this random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2 4 5 101 298 301". Though random, there are ranges there that can be merged in one read-request. The kernel seems to do the merge by applying some form of readahead, not sure if it's context, ondemand or adaptive readahead on the 3.0.0 kernel. Anyway, it seems to do readahead, as iostat says: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 4.40 224.202.00 4.16 0.03 37.86 1.918.438.00 56.80 4.40 99.44 (notice the avgrq-sz of 37.8) With fadvise calls, the thing looks a lot different: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0018.00 226.801.00 1.80 0.07 16.81 4.00 17.52 17.23 82.40 4.39 99.92 FYI, there is a readahead tracing/stats patchset that can provide far more accurate numbers about what's going on with readahead, which will help eliminate lots of the guess works here. https://lwn.net/Articles/472798/ Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's spot-on with a postgres page (8k). So, fadvise seems to carry out the requests verbatim, while read manages to merge at least two of them. The random nature of reads makes me think the scheduler is failing to merge the requests in both cases (rrqm/s = 0), because it only looks at successive requests (I'm only guessing here though). I guess it's not a merging problem, but that the kernel readahead code manages to submit larger IO requests in the first place. Looking into the kernel code, it seems the problem could be related to how fadvise works in conjunction with readahead. fadvise seems to call the function in readahead.c that schedules the asynchornous I/O[0]. It doesn't seem subject to readahead logic itself[1], which in on itself doesn't seem bad. But it does, I assume (not knowing the code that well), prevent readahead logic[2] to eventually see the pattern. It effectively disables readahead altogether. You are right. If user space does fadvise() and the fadvised pages cover all read() pages, the kernel readahead code will not run at all. So the title is actually a bit misleading. The kernel readahead won't interfere with user space prefetching at all. ;) This, I theorize, may be because after the fadvise call starts an async I/O on the page, further reads won't hit readahead code because of the page cache[3] (!PageUptodate I imagine). Whether this is desirable or not is not really obvious. In this particular case, doing fadvise calls in what would seem an optimum way, results in terribly worse performance. So I'd suggest it's not really that advisable. Yes. The kernel readahead code by design will outperform simple fadvise in the case of clustered random reads. Imagine the access pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs
Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
On 11/21/2012 11:05 AM, Wen Congyang wrote: At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote: On 11/01/2012 05:44 PM, Wen Congyang wrote: From: Yasuaki Ishimatsu Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. So the patch add unregister_memory_section() into __remove_section(). Hi Yasuaki, I have a question about these sparse vmemmap memory related patches. Hot add memory need allocated vmemmap pages, but this time is allocated by buddy system. How can gurantee virtual address is continuous to the address allocated before? If not continuous, page_to_pfn and pfn_to_page can't work correctly. vmemmap has its virtual address range: ea00 - eaff (=40 bits) virtual memory map (1TB) We allocate memory from buddy system to store struct page, and its virtual address isn't in this range. So we should update the page table: kmalloc_section_memmap() sparse_mem_map_populate() pfn_to_page() // get the virtual address in the vmemmap range vmemmap_populate() // we update page table here When we use vmemmap, page_to_pfn() always returns address in the vmemmap range, not the address that kmalloc() returns. So the virtual address is continuous. Hi Congyang, Another question about memory hotplug. During hot remove memory, it will also call memblock_remove to remove related memblock. memblock_remove() __memblock_remove() memblock_isolate_range() memblock_remove_region() But memblock_isolate_range() only record fully contained regions, regions which are partial overlapped just be splitted instead of record. So these partial overlapped regions can't be removed. Where I miss? Regards, Jaegeuk Thanks Wen Congyang Regards, Jaegeuk CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ca07433..66a79a7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone *zone, #ifdef CONFIG_SPARSEMEM_VMEMMAP static int __remove_section(struct zone *zone, struct mem_section *ms) { -/* - * XXX: Freeing memmap with vmemmap is not implement yet. - * This should be removed later. - */ -return -EBUSY; +int ret = -EINVAL; + +if (!valid_section(ms)) +return ret; + +ret = unregister_memory_section(ms); + +return ret; } #else static int __remove_section(struct zone *zone, struct mem_section *ms) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP
On 11/21/2012 12:42 PM, Wen Congyang wrote: At 11/21/2012 12:22 PM, Jaegeuk Hanse Wrote: On 11/21/2012 11:05 AM, Wen Congyang wrote: At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote: On 11/01/2012 05:44 PM, Wen Congyang wrote: From: Yasuaki Ishimatsu Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section. So the patch add unregister_memory_section() into __remove_section(). Hi Yasuaki, I have a question about these sparse vmemmap memory related patches. Hot add memory need allocated vmemmap pages, but this time is allocated by buddy system. How can gurantee virtual address is continuous to the address allocated before? If not continuous, page_to_pfn and pfn_to_page can't work correctly. vmemmap has its virtual address range: ea00 - eaff (=40 bits) virtual memory map (1TB) We allocate memory from buddy system to store struct page, and its virtual address isn't in this range. So we should update the page table: kmalloc_section_memmap() sparse_mem_map_populate() pfn_to_page() // get the virtual address in the vmemmap range vmemmap_populate() // we update page table here When we use vmemmap, page_to_pfn() always returns address in the vmemmap range, not the address that kmalloc() returns. So the virtual address is continuous. Hi Congyang, Another question about memory hotplug. During hot remove memory, it will also call memblock_remove to remove related memblock. IIRC, we don't touch memblock when hot-add/hot-remove memory. memblock is only used for bootmem allocator. I think it isn't used after booting. In IBM pseries servers. pseries_remove_memory() pseries_remove_memblock() memblock_remove() Furthermore, memblock is set to record available memory ranges get from e820 map(you can check it in memblock_x86_fill()) in x86 case, after hot-remove memory, this range of memory can't be available, why not remove them as pseries servers' codes do. memblock_remove() __memblock_remove()memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP memblock_isolate_range() memblock_remove_region() But memblock_isolate_range() only record fully contained regions, regions which are partial overlapped just be splitted instead of record. So these partial overlapped regions can't be removed. Where I miss? No, memblock_isolate_range() can deal with partial overlapped region. = if (rbase < base) { /* * @rgn intersects from below. Split and continue * to process the next region - the new top half. */ rgn->base = base; rgn->size -= base - rbase; type->total_size -= base - rbase; memblock_insert_region(type, i, rbase, base - rbase, memblock_get_region_node(rgn)); } else if (rend > end) { /* * @rgn intersects from above. Split and redo the * current region - the new bottom half. */ rgn->base = end; rgn->size -= end - rbase; type->total_size -= end - rbase; memblock_insert_region(type, i--, rbase, end - rbase, memblock_get_region_node(rgn)); = If the region is partial overlapped region, we will split the old region into two regions. After doing this, it is full contained region now. You are right, I misunderstand the codes. Thanks Wen Congyang Regards, Jaegeuk Thanks Wen Congyang Regards, Jaegeuk CC: David Rientjes CC: Jiang Liu CC: Len Brown CC: Christoph Lameter Cc: Minchan Kim CC: Andrew Morton CC: KOSAKI Motohiro CC: Wen Congyang Signed-off-by: Yasuaki Ishimatsu --- mm/memory_hotplug.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ca07433..66a79a7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone *zone, #ifdef CONFIG_SPARSEMEM_VMEMMAP static int __remove_section(struct zone *zone, struct mem_section *ms) { -/* - * XXX: Freeing memmap with vmemmap is not implement yet. - * This should be removed later. - */ -return -EBUSY; +int ret = -EINVAL; + +if (!valid_section(ms)) +return ret; + +ret = unregister_memory_section(ms); + +return ret; } #else static int __remove_section(struct zone *zone, struct mem_section *ms) -- To unsubscribe
Re: fadvise interferes with readahead
On 11/20/2012 11:15 PM, Fengguang Wu wrote: On Tue, Nov 20, 2012 at 10:11:54PM +0800, Jaegeuk Hanse wrote: On 11/20/2012 04:04 PM, Fengguang Wu wrote: Hi Claudio, Thanks for the detailed problem description! Hi Fengguang, Another question, thanks in advance. What's the meaning of interleaved reads? If the first process It's access patterns like 1, 1001, 2, 1002, 3, 1003, ... in which there are two (or more) mixed sequential read streams. readahead from start ~ start + size - async_size, another process read start + size - aysnc_size + 1, then what will happen? It seems that variable hit_readahead_marker is false, and related codes can't run, where I miss? Yes hit_readahead_marker will be false. However on reading 1002, hit_readahead_marker()/count_history_pages() will find the previous page 1001 already in page cache and trigger context readahead. Hi Fengguang, Thanks for your explaination, the comment in function ondemand_readahead, "Hit a marked page without valid readahead state". What's the meaning of "without valid readahead state"? Regards, Jaegeuk Thanks, Fengguang On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote: Hi. First of all, I'm not subscribed to this list, so I'd suggest all replies copy me personally. I have been trying to implement some I/O pipelining in Postgres (ie: read the next data page asynchronously while working on the current page), and stumbled upon some puzzling behavior involving the interaction between fadvise and readahead. I'm running kernel 3.0.0 (debian testing), on a single-disk system which, though unsuitable for database workloads, is slow enough to let me experiment with these read-ahead issues. Typical random I/O performance is on the order of between 150 r/s to 200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s. Sequential I/O can go up to 60MB/s, though it tends to be around 50. Now onto the problem. In order to parallelize I/O with computation, I've made postgres fadvise(willneed) the pages it will read next. How far ahead is configurable, and I've tested with a number of configurations. The prefetching logic is aware of the OS and pg-specific cache, so it will only fadvise a block once. fadvise calls will stay 1 (or a configurable N) real I/O ahead of read calls, and there's no fadvising of pages that won't be read eventually, in the same order. I checked with strace. However, performance when fadvising drops considerably for a specific yet common access pattern: When a nested loop with two index scans happens, access is random locally, but eventually whole ranges of a file get read (in this random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2 4 5 101 298 301". Though random, there are ranges there that can be merged in one read-request. The kernel seems to do the merge by applying some form of readahead, not sure if it's context, ondemand or adaptive readahead on the 3.0.0 kernel. Anyway, it seems to do readahead, as iostat says: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 4.40 224.202.00 4.16 0.03 37.86 1.918.438.00 56.80 4.40 99.44 (notice the avgrq-sz of 37.8) With fadvise calls, the thing looks a lot different: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.0018.00 226.801.00 1.80 0.07 16.81 4.00 17.52 17.23 82.40 4.39 99.92 FYI, there is a readahead tracing/stats patchset that can provide far more accurate numbers about what's going on with readahead, which will help eliminate lots of the guess works here. https://lwn.net/Articles/472798/ Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's spot-on with a postgres page (8k). So, fadvise seems to carry out the requests verbatim, while read manages to merge at least two of them. The random nature of reads makes me think the scheduler is failing to merge the requests in both cases (rrqm/s = 0), because it only looks at successive requests (I'm only guessing here though). I guess it's not a merging problem, but that the kernel readahead code manages to submit larger IO requests in the first place. Looking into the kernel code, it seems the problem could be related to how fadvise works in conjunction with readahead. fadvise seems to call the function in readahead.c that schedules the asynchornous I/O[0]. It doesn't seem subject to readahead logic itself[1], which in on itself doesn't seem bad. But it does, I assume (not knowing the code that well), prevent readahead logic[2] to eventually see the pattern. It effectively disables readahead altogether. You are right. If user space does fadvise() and the
Re: fadvise interferes with readahead
On 11/20/2012 10:58 PM, Fengguang Wu wrote: On Tue, Nov 20, 2012 at 10:34:11AM -0300, Claudio Freire wrote: On Tue, Nov 20, 2012 at 5:04 AM, Fengguang Wu wrote: Yes. The kernel readahead code by design will outperform simple fadvise in the case of clustered random reads. Imagine the access pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on the page miss for 2, it will detect the existence of history page 1 and do readahead properly. For hard disks, it's mainly the number of IOs that matters. So even if kernel readahead loses some opportunities to do async IO and possibly loads some extra pages that will never be used, it still manges to perform much better. The fix would lay in fadvise, I think. It should update readahead tracking structures. Alternatively, one could try to do it in do_generic_file_read, updating readahead on !PageUptodate or even on page cache hits. I really don't have the expertise or time to go modifying, building and testing the supposedly quite simple patch that would fix this. It's mostly about the testing, in fact. So if someone can comment or try by themselves, I guess it would really benefit those relying on fadvise to fix this behavior. One possible solution is to try the context readahead at fadvise time to check the existence of history pages and do readahead accordingly. However it will introduce *real interferences* between kernel readahead and user prefetching. The original scheme is, once user space starts its own informed prefetching, kernel readahead will automatically stand out of the way. I understand that would seem like a reasonable design, but in this particular case it doesn't seem to be. I propose that in most cases it doesn't really work well as a design decision, to make fadvise work as direct I/O. Precisely because fadvise is supposed to be a hint to let the kernel make better decisions, and not a request to make the kernel stop making decisions. Any interference so introduced wouldn't be any worse than the interference introduced by readahead over reads. I agree, if fadvise were to trigger readahead, it could be bad for applications that don't read what they say the will. Right. But if cache hits were to simply update readahead state, it would only mean that read calls behave the same regardless of fadvise calls. I think that's worth pursuing. Here you are describing an alternative solution that will somehow trap into the readahead code even when, for example, the application is accessing once and again an already cached file? I'm afraid this will add non-trivial overheads and is less attractive than the "readahead on fadvise" solution. Hi Fengguang, Page cache sync readahead only triggered when cache miss, but if file has already cached, how can readahead be trigged again if the application is accessing once and again an already cached file. Regards, Jaegeuk I ought to try to prepare a patch for this to illustrate my point. Not sure I'll be able to though. I'd be glad to materialize the readahead on fadvise proposal, if there are no obvious negative examples/cases. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
Cc Fengguang Wu. On 11/21/2012 04:13 PM, metin d wrote: Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening? Thank you, Metin On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is: $ uname -r 3.2.28-45.62.amzn1.x86_64 Edit: and it seems that I use one NUMA instance, if you think that it can a problem. $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 70007 MB node 0 free: 360 MB node distances: node 0 0: 10 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/21/2012 05:02 PM, Fengguang Wu wrote: On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: Cc Fengguang Wu. On 11/21/2012 04:13 PM, metin d wrote: Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening? We may debug it this way. 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages (please double check via /proc/vmstat whether it does the expected work) 2) run 'page-types -r' with root, to view the page status for the remaining pages of data-1 The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" page-types can be found in the kernel source tree tools/vm/page-types.c Sorry that sounds a bit twisted.. I do have a patch to directly dump page cache status of a user specified file, however it's not upstreamed yet. Hi Fengguang, Thanks for you detail steps, I think metin can have a try. flagspage-count MB symbolic-flags long-symbolic-flags 0x607699 2373 ___ 0x0001343227 1340 ___r___reserved But I have some questions of the print of page-type: Is 2373MB here mean total memory in used include page cache? I don't think so. Which kind of pages will be marked reserved? Which line of long-symbolic-flags is for page cache? Regards, Jaegeuk Thanks, Fengguang On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is: $ uname -r 3.2.28-45.62.amzn1.x86_64 Edit: and it seems that I use one NUMA instance, if you think that it can a problem. $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 70007 MB node 0 free: 360 MB node distances: node 0 0: 10 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/22/2012 05:34 AM, Johannes Weiner wrote: Hi, On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. This might be because we do not deactive pages as long as there is cache on the inactive list. I'm guessing that the inter-reference distance of data-2 is bigger than half of memory, so it's never getting activated and data-1 is never challenged. Hi Johannes, What's the meaning of "inter-reference distance" and why compare it with half of memoy, what's the trick? Regards, Jaegeuk I have a series of patches that detects a thrashing inactive list and handles working set changes up to the size of memory. Would you be willing to test them? They are currently based on 3.4, let me know what version works best for you. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/22/2012 09:09 AM, Johannes Weiner wrote: On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: On 11/22/2012 05:34 AM, Johannes Weiner wrote: Hi, On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. This might be because we do not deactive pages as long as there is cache on the inactive list. I'm guessing that the inter-reference distance of data-2 is bigger than half of memory, so it's never getting activated and data-1 is never challenged. Hi Johannes, What's the meaning of "inter-reference distance" It's the number of memory accesses between two accesses to the same page: A B C D A B C E ... |___| | | and why compare it with half of memoy, what's the trick? If B gets accessed twice, it gets activated. If it gets evicted in between, the second access will be a fresh page fault and B will not be recognized as frequently used. Our cutoff for scanning the active list is cache size / 2 right now (inactive_file_is_low), leaving 50% of memory to the inactive list. If the inter-reference distance for pages on the inactive list is bigger than that, they get evicted before their second access. Hi Johannes, Thanks for your explanation. But could you give a short description of how you resolve this inactive list thrashing issues? Regards, Jaegeuk -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd endless loop for compaction
On 11/21/2012 03:04 AM, Johannes Weiner wrote: Hi guys, while testing a 3.7-rc5ish kernel, I noticed that kswapd can drop into a busy spin state without doing reclaim. printk-style debugging told me that this happens when the distance between a zone's high watermark and its low watermark is less than two huge pages (DMA zone). 1. The first loop in balance_pgdat() over the zones finds all zones to be above their high watermark and only does goto out (all_zones_ok). 2. pgdat_balanced() at the out: label also just checks the high watermark, so the node is considered balanced and the order is not reduced. 3. In the `if (order)' block after it, compaction_suitable() checks if the zone's low watermark + twice the huge page size is okay, which it's not necessarily in a small zone, and so COMPACT_SKIPPED makes it it go back to loop_again:. This will go on until somebody else allocates and breaches the high watermark and then hopefully goes on to reclaim the zone above low watermark + 2 * THP. I'm not really sure what the correct solution is. Should we modify the zone_watermark_ok() checks in balance_pgdat() to take into account the higher watermark requirements for reclaim on behalf of compaction? Change the check in compaction_suitable() / not use it directly? Hi Johannes, - If all zones meet high watermark, goto out, then why go to `if (order)' block? - If depend on compaction get enough contigous pages, why if (CONPACT_BUILD && order && compaction_suitable(zone, order) != COMPACTION_SKIPPED) testorder = 0; can't guarantee low watermark + twice the huge page size is okay? Regards, Jaegeuk Thanks, Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages
On 11/21/2012 08:01 PM, Maxim Patlasov wrote: Added linux-mm@ to cc:. The patch can stand on it's own. Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP counter is high enough. This prevents us from having too many dirty pages on fuse, thus giving the userspace part of it a chance to write stuff properly. Note, that the existing balance logic is per-bdi, i.e. if the fuse user task gets stuck in the function this means, that it either writes to the mountpoint it serves (but it can deadlock even without the writeback) or it is writing to some _other_ dirty bdi and in the latter case someone else will free the memory for it. Signed-off-by: Maxim V. Patlasov Signed-off-by: Pavel Emelyanov --- mm/page-writeback.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 830893b..499a606 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1220,7 +1220,8 @@ static void balance_dirty_pages(struct address_space *mapping, */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); - nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); + nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) + + global_page_state(NR_WRITEBACK_TEMP); Could you explain NR_WRITEBACK_TEMP is used for accounting what? And when it will increase? global_dirty_limits(&background_thresh, &dirty_thresh); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/22/2012 11:26 PM, Fengguang Wu wrote: Hi Jaegeuk, Sorry for the delay. I'm traveling these days.. On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote: On 11/21/2012 05:02 PM, Fengguang Wu wrote: On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote: Cc Fengguang Wu. On 11/21/2012 04:13 PM, metin d wrote: Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? I'm guessing it'd evict the entries, but am wondering if we could run any more diagnostics before trying this. We regularly use a setup where we have two databases; one gets used frequently and the other one about once a month. It seems like the memory manager keeps unused pages in memory at the expense of frequently used database's performance. My understanding was that under memory pressure from heavily accessed pages, unused pages would eventually get evicted. Is there anything else we can try on this host to understand why this is happening? We may debug it this way. 1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages (please double check via /proc/vmstat whether it does the expected work) 2) run 'page-types -r' with root, to view the page status for the remaining pages of data-1 The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached) Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE" page-types can be found in the kernel source tree tools/vm/page-types.c Sorry that sounds a bit twisted.. I do have a patch to directly dump page cache status of a user specified file, however it's not upstreamed yet. Hi Fengguang, Thanks for you detail steps, I think metin can have a try. flagspage-count MB symbolic-flags long-symbolic-flags 0x607699 2373 ___ 0x0001343227 1340 ___r___reserved We don't need to care about the above two pages states actually. Page cache pages will never be in the special reserved or all-flags-cleared state. Hi Fengguang, Thanks for your response. But which kind of pages are in the special reserved and which are all-flags-cleared? Regards, Jaegeuk But I have some questions of the print of page-type: Is 2373MB here mean total memory in used include page cache? I don't think so. Which kind of pages will be marked reserved? Which line of long-symbolic-flags is for page cache? The (lru && !anonymous) pages are page cache pages. Thanks, Fengguang On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is: $ uname -r 3.2.28-45.62.amzn1.x86_64 Edit: and it seems that I use one NUMA instance, if you think that it can a problem. $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 70007 MB node 0 free: 360 MB node distances: node 0 0: 10 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/21/2012 02:25 AM, Jan Kara wrote: On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Hi metin d, fincore is a tool or ...? How could I get it? Regards, Jaegeuk Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is: $ uname -r 3.2.28-45.62.amzn1.x86_64 Edit: and it seems that I use one NUMA instance, if you think that it can a problem. $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 70007 MB node 0 free: 360 MB node distances: node 0 0: 10 Honza -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/22/2012 11:53 PM, Fengguang Wu wrote: On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote: On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote: On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse wrote: On 11/21/2012 05:58 PM, metin d wrote: Hi Fengguang, I run tests and attached the results. The line below I guess shows the data-1 page caches. 0x0008006c 658405125718 __RU_lA___P referenced,uptodate,lru,active,private I thinks this is just one state of page cache pages. But why these page caches are in this state as opposed to other page caches. From the results I conclude that: data-1 pages are in state : referenced,uptodate,lru,active,private I wonder if it's this code that stops data-1 pages from being reclaimed: shrink_page_list(): if (page_has_private(page)) { if (!try_to_release_page(page, sc->gfp_mask)) goto activate_locked; What's the filesystem used? Ah it's more likely caused by this logic: if (is_active_lru(lru)) { if (inactive_list_is_low(mz, file)) shrink_active_list(nr_to_scan, mz, sc, priority, file); The active file list won't be scanned at all if it's smaller than the active list. In this case, it's inactive=33586MB > active=25719MB. So the data-1 pages in the active list will never be scanned and reclaimed. Hi Fengguang, It seems that most of data-1 file pages are in active lru cache and most of data-2 file pages are in inactive lru cache. As Johannes mentioned, if inter-reference distance is bigger than half of memory, the pages will not be actived. How you intend to resolve this issue? Is Johannes's inactive list threshing idea available? Regards, Jaegeuk data-2 pages are in state : referenced,uptodate,lru,mappedtodisk Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/23/2012 12:17 AM, Johannes Weiner wrote: On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote: On 11/22/2012 09:09 AM, Johannes Weiner wrote: On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote: On 11/22/2012 05:34 AM, Johannes Weiner wrote: Hi, On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote: On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. This might be because we do not deactive pages as long as there is cache on the inactive list. I'm guessing that the inter-reference distance of data-2 is bigger than half of memory, so it's never getting activated and data-1 is never challenged. Hi Johannes, What's the meaning of "inter-reference distance" It's the number of memory accesses between two accesses to the same page: A B C D A B C E ... |___| | | and why compare it with half of memoy, what's the trick? If B gets accessed twice, it gets activated. If it gets evicted in between, the second access will be a fresh page fault and B will not be recognized as frequently used. Our cutoff for scanning the active list is cache size / 2 right now (inactive_file_is_low), leaving 50% of memory to the inactive list. If the inter-reference distance for pages on the inactive list is bigger than that, they get evicted before their second access. Hi Johannes, Thanks for your explanation. But could you give a short description of how you resolve this inactive list thrashing issues? I remember a time stamp of evicted file pages in the page cache radix tree that let me reconstruct the inter-reference distance even after a page has been evicted from cache when it's faulted back in. This way I can tell a one-time sequence from thrashing, no matter how small the inactive list. When thrashing is detected, I start deactivating protected pages and put them next to the refaulted cache on the head of the inactive list and let them fight it out as usual. In this reported case, the old data will be challenged and since it's no longer used, it will just drop off the inactive list eventually. If the guess is wrong and the deactivated memory is used more heavily than the refaulting pages, they will just get activated again without incurring any disruption like a major fault. Hi Johannes, If you also add the time stamp to the protected pages which you deactive when incur thrashing? Regards, Jaegeuk -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem in Page Cache Replacement
On 11/23/2012 04:08 PM, metin d wrote: - Original Message - From: Jaegeuk Hanse To: metin d Cc: Jan Kara ; "linux-kernel@vger.kernel.org" ; linux...@kvack.org Sent: Friday, November 23, 2012 3:58 AM Subject: Re: Problem in Page Cache Replacement On 11/21/2012 02:25 AM, Jan Kara wrote: On Tue 20-11-12 09:42:42, metin d wrote: I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB. I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk. I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days. Hi metin d, fincore is a tool or ...? How could I get it? Regards, Jaegeuk Hi Jaegeuk, Yes, it is a tool, you get it from here : http://code.google.com/p/linux-ftools/ Hi Metin, Could you give me a link to download it? I can't get it from the link you give me. Thanks in advance. :-) Regards, Jaegeuk Regards, Metin Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem. Curious. Added linux-mm list to CC to catch more attention. If you run echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory? This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is: $ uname -r 3.2.28-45.62.amzn1.x86_64 Edit: and it seems that I use one NUMA instance, if you think that it can a problem. $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 70007 MB node 0 free: 360 MB node distances: node 0 0: 10 Honza -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kswapd endless loop for compaction
On 11/21/2012 03:04 AM, Johannes Weiner wrote: Hi guys, while testing a 3.7-rc5ish kernel, I noticed that kswapd can drop into a busy spin state without doing reclaim. printk-style debugging told me that this happens when the distance between a zone's high watermark and its low watermark is less than two huge pages (DMA zone). 1. The first loop in balance_pgdat() over the zones finds all zones to be above their high watermark and only does goto out (all_zones_ok). 2. pgdat_balanced() at the out: label also just checks the high watermark, so the node is considered balanced and the order is not reduced. 3. In the `if (order)' block after it, compaction_suitable() checks if the zone's low watermark + twice the huge page size is okay, which it's not necessarily in a small zone, and so COMPACT_SKIPPED makes it it go back to loop_again:. This will go on until somebody else allocates and breaches the high watermark and then hopefully goes on to reclaim the zone above low watermark + 2 * THP. I'm not really sure what the correct solution is. Should we modify the zone_watermark_ok() checks in balance_pgdat() to take into account the higher watermark requirements for reclaim on behalf of compaction? Change the check in compaction_suitable() / not use it directly? Hi Johannes, If depend on compaction get enough contigous pages, why if (CONPACT_BUILD && order && compaction_suitable(zone, order) != COMPACTION_SKIPPED) testorder = 0; can't guarantee low watermark + twice the huge page size is okay? Regards, Jaegeuk Thanks, Johannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm,vmscan: free pages if compaction_suitable tells us to
On 11/26/2012 06:44 AM, Johannes Weiner wrote: On Sun, Nov 25, 2012 at 01:29:50PM -0500, Rik van Riel wrote: On Sun, 25 Nov 2012 17:57:28 +0100 Johannes Hirte wrote: With kernel 3.7-rc6 I've still problems with kswapd0 on my laptop And this is most of the time. I've only observed this behavior on the laptop. Other systems don't show this. This suggests it may have something to do with small memory zones, where we end up with the "funny" situation that the high watermark (+ balance gap) for a particular zone is less than the low watermark + 2< It's not quite enough because it's not reaching the conditions you changed, see analysis in https://lkml.org/lkml/2012/11/20/567 But even fixing it up (by adding the compaction_suitable() test in this preliminary scan over the zones and setting end_zone accordingly) is not enough because no actual reclaim happens at priority 12 in a The preliminary scan is in the highmem->dma direction, it will miss high zone which not meet compaction_suitable() test instead of lowest zone. small zone. So the number of free pages is not actually changing and the compaction_suitable() checks keep the loop going. The problem is fairly easy to reproduce, by the way. Just boot with mem=800M to have a relatively small lowmem reserve in the DMA zone. Fill it up with page cache, then allocate transparent huge pages. With your patch and my fix to the preliminary zone loop, there won't be any hung task warnings anymore because kswapd actually calls shrink_slab() and there is a rescheduling point in there, but it still loops forever. It also seems a bit aggressive to try to balance a small zone like DMA for a huge page when it's not a GFP_DMA allocation, but none of these checks actually take the classzone into account. Do we have any agreement over what this whole thing is supposed to be doing? diff --git a/mm/vmscan.c b/mm/vmscan.c index b99ecba..f7e54df 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2412,6 +2412,9 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) * would need to be at least 256M for it to be balance a whole node. * Similarly, on x86-64 the Normal zone would need to be at least 1G * to balance a node on its own. These seemed like reasonable ratios. + * + * The kswapd source code is brought to you by Advil®. "For today's + * tough pain, one might not be enough." */ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages, int classzone_idx) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/