Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jaegeuk Hanse
On Wed, Nov 28, 2012 at 04:29:01PM +0800, Wen Congyang wrote:
>At 11/28/2012 12:08 PM, Jiang Liu Wrote:
>> On 2012-11-28 11:24, Bob Liu wrote:
>>> On Tue, Nov 27, 2012 at 8:49 PM, Tang Chen  wrote:
 On 11/27/2012 08:09 PM, Bob Liu wrote:
>
> On Tue, Nov 27, 2012 at 4:29 PM, Tang Chen
> wrote:
>>
>> Hi Liu,
>>
>>
>> This feature is used in memory hotplug.
>>
>> In order to implement a whole node hotplug, we need to make sure the
>> node contains no kernel memory, because memory used by kernel could
>> not be migrated. (Since the kernel memory is directly mapped,
>> VA = PA + __PAGE_OFFSET. So the physical address could not be changed.)
>>
>> User could specify all the memory on a node to be movable, so that the
>> node could be hot-removed.
>>
>
> Thank you for your explanation. It's reasonable.
>
> But i think it's a bit duplicated with CMA, i'm not sure but maybe we
> can combine it with CMA which already in mainline?
>
 Hi Liu,

 Thanks for your advice. :)

 CMA is Contiguous Memory Allocator, right?  What I'm trying to do is
 controlling where is the start of ZONE_MOVABLE of each node. Could
 CMA do this job ?
>>>
>>> cma will not control the start of ZONE_MOVABLE of each node, but it
>>> can declare a memory that always movable
>>> and all non movable allocate request will not happen on that area.
>>>
>>> Currently cma use a boot parameter "cma=" to declare a memory size
>>> that always movable.
>>> I think it might fulfill your requirement if extending the boot
>>> parameter with a start address.
>>>
>>> more info at http://lwn.net/Articles/468044/

 And also, after a short investigation, CMA seems need to base on
 memblock. But we need to limit memblock not to allocate memory on
 ZONE_MOVABLE. As a result, we need to know the ranges before memblock
 could be used. I'm afraid we still need an approach to get the ranges,
 such as a boot option, or from static ACPI tables such as SRAT/MPST.

>>>
>>> Yes, it's based on memblock and with boot option.
>>> In setup_arch32()
>>> dma_contiguous_reserve(0);   => will declare a cma area using
>>> memblock_reserve()
>>>
 I'm don't know much about CMA for now. So if you have any better idea,
 please share with us, thanks. :)
>>>
>>> My idea is reuse cma like below patch(even not compiled) and boot with
>>> "cma=size@start_address".
>>> I don't know whether it can work and whether suitable for your
>>> requirement, if not forgive me for this noises.
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index 612afcc..564962a 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -59,11 +59,18 @@ struct cma *dma_contiguous_default_area;
>>>   */
>>>  static const unsigned long size_bytes = CMA_SIZE_MBYTES * SZ_1M;
>>>  static long size_cmdline = -1;
>>> +static long cma_start_cmdline = -1;
>>>
>>>  static int __init early_cma(char *p)
>>>  {
>>> +   char *oldp;
>>> pr_debug("%s(%s)\n", __func__, p);
>>> +   oldp = p;
>>> size_cmdline = memparse(p, &p);
>>> +
>>> +   if (*p == '@')
>>> +   cma_start_cmdline = memparse(p+1, &p);
>>> +   printk("cma start:0x%x, size: 0x%x\n", size_cmdline, 
>>> cma_start_cmdline);
>>> return 0;
>>>  }
>>>  early_param("cma", early_cma);
>>> @@ -127,8 +134,10 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>>> if (selected_size) {
>>> pr_debug("%s: reserving %ld MiB for global area\n", 
>>> __func__,
>>>  selected_size / SZ_1M);
>>> -
>>> -   dma_declare_contiguous(NULL, selected_size, 0, limit);
>>> +   if (cma_size_cmdline != -1)
>>> +   dma_declare_contiguous(NULL, selected_size,
>>> cma_start_cmdline, limit);
>>> +   else
>>> +   dma_declare_contiguous(NULL, selected_size, 0, 
>>> limit);
>>> }
>>>  };
>> Seems a good idea to reserve memory by reusing CMA logic, though need more
>> investigation here. One of CMA goal is to ensure pages in CMA are really
>> movable, and this patchset tries to achieve the same goal at a first glance.
>
>Hmm, I don't like to reuse CMA. Because CMA is used for DMA. If we reuse it
>for movable memory, I think movable zone is enough. And the start address is
>not acceptable, because we want to specify the start address for each node.
>
>I think we can implement movablecore_map like that:
>1. parse the parameter
>2. reserve the memory after efi_reserve_boot_services()
>3. release the memory in mem_init
>

Hi Tang,

I haven't read the patchset yet, but could you give a short describe how 
you design your implementation in this patchset?

Regards,
Jaegeuk

>What about this?
>
>Thanks
>Wen Congyang
>> 
>>  
>> 
>> 
>> 
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in

Re: [PATCH] tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)

2012-11-28 Thread Jaegeuk Hanse
On Wed, Nov 28, 2012 at 05:22:03PM -0800, Hugh Dickins wrote:
>Revert 3.5's f21f8062201f ("tmpfs: revert SEEK_DATA and SEEK_HOLE")
>to reinstate 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE"),
>with the intervening additional arg to generic_file_llseek_size().
>
>In 3.8, ext4 is expected to join btrfs, ocfs2 and xfs with proper
>SEEK_DATA and SEEK_HOLE support; and a good case has now been made
>for it on tmpfs, so let's join the party.
>

Hi Hugh,

IIUC, several months ago you revert the patch. You said, 

"I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
would be of any use to them on tmpfs.  This code adds 92 lines and 752
bytes on x86_64 - is that bloat or worthwhile?"

But this time in which scenario will use it?

Regards,
Jaegeuk

>It's quite easy for tmpfs to scan the radix_tree to support llseek's new
>SEEK_DATA and SEEK_HOLE options: so add them while the minutiae are still
>on my mind (in particular, the !PageUptodate-ness of pages fallocated but
>still unwritten).
>
>[a...@linux-foundation.org: fix warning with CONFIG_TMPFS=n]
>Signed-off-by: Hugh Dickins 
>---
>
> mm/shmem.c |   92 ++-
> 1 file changed, 91 insertions(+), 1 deletion(-)
>
>--- 3.7-rc7/mm/shmem.c 2012-11-16 19:26:56.388459961 -0800
>+++ linux/mm/shmem.c   2012-11-28 15:53:38.788477201 -0800
>@@ -1709,6 +1709,96 @@ static ssize_t shmem_file_splice_read(st
>   return error;
> }
> 
>+/*
>+ * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
>+ */
>+static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
>+  pgoff_t index, pgoff_t end, int origin)
>+{
>+  struct page *page;
>+  struct pagevec pvec;
>+  pgoff_t indices[PAGEVEC_SIZE];
>+  bool done = false;
>+  int i;
>+
>+  pagevec_init(&pvec, 0);
>+  pvec.nr = 1;/* start small: we may be there already */
>+  while (!done) {
>+  pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
>+  pvec.nr, pvec.pages, indices);
>+  if (!pvec.nr) {
>+  if (origin == SEEK_DATA)
>+  index = end;
>+  break;
>+  }
>+  for (i = 0; i < pvec.nr; i++, index++) {
>+  if (index < indices[i]) {
>+  if (origin == SEEK_HOLE) {
>+  done = true;
>+  break;
>+  }
>+  index = indices[i];
>+  }
>+  page = pvec.pages[i];
>+  if (page && !radix_tree_exceptional_entry(page)) {
>+  if (!PageUptodate(page))
>+  page = NULL;
>+  }
>+  if (index >= end ||
>+  (page && origin == SEEK_DATA) ||
>+  (!page && origin == SEEK_HOLE)) {
>+  done = true;
>+  break;
>+  }
>+  }
>+  shmem_deswap_pagevec(&pvec);
>+  pagevec_release(&pvec);
>+  pvec.nr = PAGEVEC_SIZE;
>+  cond_resched();
>+  }
>+  return index;
>+}
>+
>+static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin)
>+{
>+  struct address_space *mapping = file->f_mapping;
>+  struct inode *inode = mapping->host;
>+  pgoff_t start, end;
>+  loff_t new_offset;
>+
>+  if (origin != SEEK_DATA && origin != SEEK_HOLE)
>+  return generic_file_llseek_size(file, offset, origin,
>+  MAX_LFS_FILESIZE, i_size_read(inode));
>+  mutex_lock(&inode->i_mutex);
>+  /* We're holding i_mutex so we can access i_size directly */
>+
>+  if (offset < 0)
>+  offset = -EINVAL;
>+  else if (offset >= inode->i_size)
>+  offset = -ENXIO;
>+  else {
>+  start = offset >> PAGE_CACHE_SHIFT;
>+  end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>+  new_offset = shmem_seek_hole_data(mapping, start, end, origin);
>+  new_offset <<= PAGE_CACHE_SHIFT;
>+  if (new_offset > offset) {
>+  if (new_offset < inode->i_size)
>+  offset = new_offset;
>+  else if (origin == SEEK_DATA)
>+  offset = -ENXIO;
>+  else
>+  offset = inode->i_size;
>+  }
>+  }
>+
>+  if (offset >= 0 && offset != file->f_pos) {
>+  file->f_pos = offset;
>+  file->f_version = 0;
>+  }
>+  mutex_unlock(&inode->i_mutex);
>+  return offset;
>+}
>+
> static long shmem_fallocate(struct file *file, int mode, lo

Re: [PATCH v2 0/5] Add movablecore_map boot option

2012-11-28 Thread Jaegeuk Hanse
On Wed, Nov 28, 2012 at 04:47:42PM +0800, Jiang Liu wrote:
>Hi all,
>   Seems it's a great chance to discuss about the memory hotplug feature
>within this thread. So I will try to give some high level thoughts about memory
>hotplug feature on x86/IA64. Any comments are welcomed!
>   First of all, I think usability really matters. Ideally, memory hotplug
>feature should just work out of box, and we shouldn't expect administrators to 
>add several extra platform dependent parameters to enable memory hotplug. 
>But how to enable memory (or CPU/node) hotplug out of box? I think the key 
>point
>is to cooperate with BIOS/ACPI/firmware/device management teams. 
>   I still position memory hotplug as an advanced feature for high end 
>servers and those systems may/should provide some management interfaces to 
>configure CPU/memory/node hotplug features. The configuration UI may be 
>provided
>by BIOS, BMC or centralized system management suite. Once administrator enables
>hotplug feature through those management UI, OS should support system device
>hotplug out of box. For example, HP SuperDome2 management suite provides 
>interface
>to configure a node as floating node(hot-removable). And OpenSolaris supports
>CPU/memory hotplug out of box without any extra configurations. So we should
>shape interfaces between firmware and OS to better support system device 
>hotplug.
>   On the other hand, I think there are no commercial available x86/IA64
>platforms with system device hotplug capabilities in the field yet, at least 
>only
>limited quantity if any. So backward compatibility is not a big issue for us 
>now.
>So I think it's doable to rely on firmware to provide better support for system
>device hotplug.
>   Then what should be enhanced to better support system device hotplug?
>
>1) ACPI specification should be enhanced to provide a static table to describe
>components with hotplug features, so OS could reserve special resources for
>hotplug at early boot stages. For example, to reserve enough CPU ids for CPU
>hot-add. Currently we guess maximum number of CPUs supported by the platform
>by counting CPU entries in APIC table, that's not reliable.
>
>2) BIOS should implement SRAT, MPST and PMTT tables to better support memory
>hotplug. SRAT associates memory ranges with proximity domains with an extra
>"hotpluggable" flag. PMTT provides memory device topology information, such
>as "socket->memory controller->DIMM". MPST is used for memory power management
>and provides a way to associate memory ranges with memory devices in PMTT.
>With all information from SRAT, MPST and PMTT, OS could figure out hotplug
>memory ranges automatically, so no extra kernel parameters needed.
>
>3) Enhance ACPICA to provide a method to scan static ACPI tables before
>memory subsystem has been initialized because OS need to access SRAT,
>MPST and PMTT when initializing memory subsystem.
>
>4) The last and the most important issue is how to minimize performance
>drop caused by memory hotplug. As proposed by this patchset, once we
>configure all memory of a NUMA node as movable, it essentially disable
>NUMA optimization of kernel memory allocation from that node. According
>to experience, that will cause huge performance drop. We have observed
>10-30% performance drop with memory hotplug enabled. And on another
>OS the average performance drop caused by memory hotplug is about 10%.
>If we can't resolve the performance drop, memory hotplug is just a feature
>for demo:( With help from hardware, we do have some chances to reduce
>performance penalty caused by memory hotplug.
>   As we know, Linux could migrate movable page, but can't migrate
>non-movable pages used by kernel/DMA etc. And the most hard part is how
>to deal with those unmovable pages when hot-removing a memory device.
>Now hardware has given us a hand with a technology named memory migration,
>which could transparently migrate memory between memory devices. There's
>no OS visible changes except NUMA topology before and after hardware memory
>migration.
>   And if there are multiple memory devices within a NUMA node,
>we could configure some memory devices to host unmovable memory and the
>other to host movable memory. With this configuration, there won't be
>bigger performance drop because we have preserved all NUMA optimizations.
>We also could achieve memory hotplug remove by:
>1) Use existing page migration mechanism to reclaim movable pages.
>2) For memory devices hosting unmovable pages, we need:
>2.1) find a movable memory device on other nodes with enough capacity
>and reclaim it.
>2.2) use hardware migration technology to migrate unmovable memory to

Hi Jiang,

Could you give an explanation how hardware migration technology works?

Regards,
Jaegeuk

>the just reclaimed memory device on other nodes.
>
>   I hope we could expect users to adopt memory hotplug technology
>with all these implemented.
>
>   Back to this patch, we could rely

Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

2012-11-13 Thread Jaegeuk Hanse

On 11/07/2012 07:48 AM, Hugh Dickins wrote:

On Tue, 6 Nov 2012, Dave Jones wrote:

On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote:

  > -/* We already confirmed swap, and make no 
allocation */
  > -VM_BUG_ON(error);
  > +/*
  > + * We already confirmed swap under page lock, and 
make
  > + * no memory allocation here, so usually no 
possibility
  > + * of error; but free_swap_and_cache() only 
trylocks a
  > + * page, so it is just possible that the entry has 
been
  > + * truncated or holepunched since swap was 
confirmed.
  > + * shmem_undo_range() will have done some of the
  > + * unaccounting, now delete_from_swap_cache() will 
do
  > + * the rest (including 
mem_cgroup_uncharge_swapcache).
  > + * Reset swap.val? No, leave it so "failed" goes 
back to
  > + * "repeat": reading a hole and writing should 
succeed.
  > + */
  > +if (error) {
  > +VM_BUG_ON(error != -ENOENT);
  > +delete_from_swap_cache(page);
  > +}
  >  }

I ran with this overnight,

Thanks a lot...


and still hit the (new!) VM_BUG_ON

... but that's even more surprising than your original report.


Perhaps we should print out what 'error' was too ?  I'll rebuild with that..

Thanks; though I thought the error was going to turn out too boring,
and was preparing a debug patch for you to show the expected and found
values too.  But then got very puzzled...
  

[ cut here ]
WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Hardware name: 2012 Client Platform
Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54

That's the very same line number as in your original report, despite
the long comment which the patch adds.  Are you sure that kernel was
built with the patch in?

I wouldn't usually question you, but I'm going mad trying to understand
how the VM_BUG_ON(error != -ENOENT) fires.  At the time I wrote that
line, and when I was preparing the debug patch, I was thinking that an
error from shmem_radix_tree_replace could also be -EEXIST, for when a
different something rather than nothing is found [*].  But that's not
the case, shmem_radix_tree_replace returns either 0 or -ENOENT.

So if error != -ENOENT, that means shmem_add_to_page_cache went the
radix_tree_insert route instead of the shmem_radix_tree_replace route;
which means that its 'expected' is NULL, so swp_to_radix_entry(swap)
is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt
the radix_tree might be, I do not understand the new VM_BUG_ON firing.

Please tell me it was the wrong kernel!
Hugh

[*] But in thinking it over, I realize that if shmem_radix_tree_replace
had returned -EEXIST for the "wrong something" case, I would have been
wrong to BUG on that; because just as truncation could remove an entry,
something else could immediately after instantiate a new page there.


Hi Hugh,

As you said, swp_to_radix_entry() does an "| 2", so even if truncation 
could remove an entry and something else could immediately after 
instantiate a new page there, but the expected parameter will not be 
NULL, the result is radix_tree_insert will not be called and 
shmem_add_to_page_cache will not return -EEXIST, then why trigger BUG_ON ?


Regards,
Jaegeuk


So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's
not saying what I had intended to say with it, and would have been
wrong to say that anyway.  It just looks stupid to me now, rather
like inserting a VM_BUG_ON(false) - but that does become interesting
when you report that you've hit it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

2012-11-14 Thread Jaegeuk Hanse

On 11/14/2012 11:50 AM, Hugh Dickins wrote:

On Wed, 14 Nov 2012, Jaegeuk Hanse wrote:

On 11/07/2012 07:48 AM, Hugh Dickins wrote:

On Tue, 6 Nov 2012, Dave Jones wrote:

On Mon, Nov 05, 2012 at 05:32:41PM -0800, Hugh Dickins wrote:

   > -   /* We already confirmed swap, and make no
allocation */
   > -   VM_BUG_ON(error);
   > +   /*
   > +* We already confirmed swap under page lock,
and make
   > +* no memory allocation here, so usually no
possibility
   > +* of error; but free_swap_and_cache() only
trylocks a
   > +* page, so it is just possible that the
entry has been
   > +* truncated or holepunched since swap was
confirmed.
   > +* shmem_undo_range() will have done some of
the
   > +* unaccounting, now delete_from_swap_cache()
will do
   > +* the rest (including
mem_cgroup_uncharge_swapcache).
   > +* Reset swap.val? No, leave it so "failed"
goes back to
   > +* "repeat": reading a hole and writing
should succeed.
   > +*/
   > +   if (error) {
   > +   VM_BUG_ON(error != -ENOENT);
   > +   delete_from_swap_cache(page);
   > +   }
   > }

I ran with this overnight,

Thanks a lot...


and still hit the (new!) VM_BUG_ON

... but that's even more surprising than your original report.


Perhaps we should print out what 'error' was too ?  I'll rebuild with
that..

Thanks; though I thought the error was going to turn out too boring,
and was preparing a debug patch for you to show the expected and found
values too.  But then got very puzzled...
   

[ cut here ]
WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
Hardware name: 2012 Client Platform
Pid: 21798, comm: trinity-child4 Not tainted 3.7.0-rc4+ #54

That's the very same line number as in your original report, despite
the long comment which the patch adds.  Are you sure that kernel was
built with the patch in?

I wouldn't usually question you, but I'm going mad trying to understand
how the VM_BUG_ON(error != -ENOENT) fires.  At the time I wrote that
line, and when I was preparing the debug patch, I was thinking that an
error from shmem_radix_tree_replace could also be -EEXIST, for when a
different something rather than nothing is found [*].  But that's not
the case, shmem_radix_tree_replace returns either 0 or -ENOENT.

So if error != -ENOENT, that means shmem_add_to_page_cache went the
radix_tree_insert route instead of the shmem_radix_tree_replace route;
which means that its 'expected' is NULL, so swp_to_radix_entry(swap)
is NULL; but swp_to_radix_entry() does an "| 2", so however corrupt
the radix_tree might be, I do not understand the new VM_BUG_ON firing.

Please tell me it was the wrong kernel!
Hugh

[*] But in thinking it over, I realize that if shmem_radix_tree_replace
had returned -EEXIST for the "wrong something" case, I would have been
wrong to BUG on that; because just as truncation could remove an entry,
something else could immediately after instantiate a new page there.

Hi Hugh,

As you said, swp_to_radix_entry() does an "| 2", so even if truncation could
remove an entry and something else could immediately after instantiate a new
page there, but the expected parameter will not be NULL, the result is
radix_tree_insert will not be called and shmem_add_to_page_cache will not
return -EEXIST, then why trigger BUG_ON ?

Why insert the VM_BUG_ON?  Because at the time I thought that it
asserted something useful; but I was mistaken, as explained above.

How can the VM_BUG_ON trigger (without stack corruption, or something
of that kind)?  I have no idea.

We are in agreement: I now think that VM_BUG_ON is misleading and silly,
and sent Andrew a further patch to remove it a just couple of hours ago.

Originally I was waiting to hear further from Dave; but his test
machine was giving trouble, and it occurred to me that, never mind
whether he says he has hit it again, or he has not hit it again,
the answer is the same: don't send that VM_BUG_ON upstream.

Hugh


Thanks Hugh.

Another question. Why the function shmem_fallocate which you add to 
kernel need call shmem_getpage?


Regards,
Jaegeuk




Regards,
Jaegeuk


So although I believe my VM_BUG_ON(error != -ENOENT) is safe, it's
not saying what I had intended to say with it, and would have been
wrong to say that anyway.  It just looks stupid to me now, rather
like inserting a VM_BUG_ON(false) - but that does become interesting
when you report that you

Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

2012-11-15 Thread Jaegeuk Hanse

On 11/16/2012 03:56 AM, Hugh Dickins wrote:

Offtopic...

On Thu, 15 Nov 2012, Jaegeuk Hanse wrote:

Another question. Why the function shmem_fallocate which you add to kernel
need call shmem_getpage?

Because shmem_getpage(_gfp) is where shmem's
page lookup and allocation complexities are handled.

I assume the question behind your question is: why does shmem actually
allocate pages for its fallocate, instead of just reserving the space?


Yeah, this is what I want to know.



I did play with just reserving the space, with more special entries in
the radix_tree to note the reservations made.  It should be doable for
the vm_enough_memory and sbinfo->used_blocks reservations.

What absolutely deterred me from taking that path was the mem_cgroup
case: shmem and swap and memcg are not easy to get working right together,
and nobody would thank me for complicating memcg just for shmem_fallocate.

By allocating pages, the pre-existing memcg code just works; if we used
reservations instead, we would have to track their memcg charges in some
additional new way.  I see no justification for that complication.


Oh, I see, thanks Hugh. :-)


Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 20/21] mm: drop vmtruncate

2012-11-15 Thread Jaegeuk Hanse

On 11/03/2012 05:32 PM, Marco Stornelli wrote:

Removed vmtruncate


Hi Marco,

Could you explain me why vmtruncate need remove? What's the problem and 
how to substitute it?


Regards,
Jaegeuk



Signed-off-by: Marco Stornelli 
---
  include/linux/mm.h |1 -
  mm/truncate.c  |   23 ---
  2 files changed, 0 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..95f70bb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -977,7 +977,6 @@ static inline void unmap_shared_mapping_range(struct 
address_space *mapping,
  
  extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new);

  extern void truncate_setsize(struct inode *inode, loff_t newsize);
-extern int vmtruncate(struct inode *inode, loff_t offset);
  void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
  int truncate_inode_page(struct address_space *mapping, struct page *page);
  int generic_error_remove_page(struct address_space *mapping, struct page 
*page);
diff --git a/mm/truncate.c b/mm/truncate.c
index d51ce92..c75b736 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -577,29 +577,6 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
  EXPORT_SYMBOL(truncate_setsize);
  
  /**

- * vmtruncate - unmap mappings "freed" by truncate() syscall
- * @inode: inode of the file used
- * @newsize: file offset to start truncating
- *
- * This function is deprecated and truncate_setsize or truncate_pagecache
- * should be used instead, together with filesystem specific block truncation.
- */
-int vmtruncate(struct inode *inode, loff_t newsize)
-{
-   int error;
-
-   error = inode_newsize_ok(inode, newsize);
-   if (error)
-   return error;
-
-   truncate_setsize(inode, newsize);
-   if (inode->i_op->truncate)
-   inode->i_op->truncate(inode);
-   return 0;
-}
-EXPORT_SYMBOL(vmtruncate);
-
-/**
   * truncate_pagecache_range - unmap and remove pagecache that is hole-punched
   * @inode: inode
   * @lstart: offset of beginning of hole


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

2012-11-16 Thread Jaegeuk Hanse

On 11/16/2012 03:56 AM, Hugh Dickins wrote:

Offtopic...

On Thu, 15 Nov 2012, Jaegeuk Hanse wrote:

Another question. Why the function shmem_fallocate which you add to kernel
need call shmem_getpage?

Because shmem_getpage(_gfp) is where shmem's
page lookup and allocation complexities are handled.

I assume the question behind your question is: why does shmem actually
allocate pages for its fallocate, instead of just reserving the space?

I did play with just reserving the space, with more special entries in
the radix_tree to note the reservations made.  It should be doable for
the vm_enough_memory and sbinfo->used_blocks reservations.

What absolutely deterred me from taking that path was the mem_cgroup
case: shmem and swap and memcg are not easy to get working right together,
and nobody would thank me for complicating memcg just for shmem_fallocate.

By allocating pages, the pre-existing memcg code just works; if we used
reservations instead, we would have to track their memcg charges in some
additional new way.  I see no justification for that complication.


Hi Hugh

Some questions about your shmem/tmpfs: misc and fallocate patchset.

- Since shmem_setattr can truncate tmpfs files, why need add another 
similar codes in function shmem_fallocate? What's the trick?

- in tmpfs: support fallocate preallocation patch changelog:
  "Christoph Hellwig: What for exactly?  Please explain why 
preallocating on tmpfs would make any sense.
  Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on 
files on the /dev/shm filesystem.  The glibc fallback loop for -ENOSYS 
[or -EOPNOTSUPP] on fallocate is just ugly."
  Could shmem/tmpfs fallocate prevent one process truncate the file 
which the second process mmap() and get SIGBUS when the second process 
access mmap but out of current size of file?


Regards,
Jaegeuk


Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

2012-11-17 Thread Jaegeuk Hanse

On 11/17/2012 12:48 PM, Hugh Dickins wrote:

Further offtopic..


Thanks for your explanation, Hugh. :-)



On Fri, 16 Nov 2012, Jaegeuk Hanse wrote:

Some questions about your shmem/tmpfs: misc and fallocate patchset.

- Since shmem_setattr can truncate tmpfs files, why need add another similar
codes in function shmem_fallocate? What's the trick?

I don't know if I understand you.  In general, hole-punching is different
from truncation.  Supporting the hole-punch mode of the fallocate system
call is different from supporting truncation.  They're closely related,
and share code, but meet different specifications.


What's the different between shmem/tmpfs hole-punching and 
truncate_setsize/truncate_pagecache?
Do you mean one is punch hole in the file and the other one is shrink or 
extent the size of a file?



- in tmpfs: support fallocate preallocation patch changelog:
   "Christoph Hellwig: What for exactly?  Please explain why preallocating on
tmpfs would make any sense.
   Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on
the /dev/shm filesystem.  The glibc fallback loop for -ENOSYS [or
-EOPNOTSUPP] on fallocate is just ugly."
   Could shmem/tmpfs fallocate prevent one process truncate the file which the
second process mmap() and get SIGBUS when the second process access mmap but
out of current size of file?

Again, I don't know if I understand you.  fallocate does not prevent
truncation or races or SIGBUS.  I believe that Kay meant that without
using fallocate to allocate the memory in advance, systemd found it hard
to protect itself from the possibility of getting a SIGBUS, if access to
a shmem mapping happened to run out of memory/space in the middle.


IIUC, it will return VM_xxx_OOM instead of SIGBUS if run out of memory. 
Then how can get SIGBUS in this scene?


Regards,
Jaegeuk


I never grasped why writing the file in advance was not good enough:
fallocate happened to be what they hoped to use, and it was hard to
deny it, given that tmpfs already supported hole-punching, and was
about to convert to the fallocate interface for that.
Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: fix shmem_getpage_gfp VM_BUG_ON

2012-11-17 Thread Jaegeuk Hanse

On 11/17/2012 12:48 PM, Hugh Dickins wrote:

Further offtopic..


Hi Hugh,

- I see you add this in vfs.txt:
  +  fallocate: called by the VFS to preallocate blocks or punch a hole.
  I want to know if it's necessary to add it to man page since users 
still don't know fallocate can punch a hole from man fallocate.

- in function shmem_fallocate:
+   else if (shmem_falloc.nr_unswapped > 
shmem_falloc.nr_falloced)

+   error = -ENOMEM;
If this changelog "shmem_fallocate() compare counts and give up once the 
reactivated pages have started to coming back to writepage 
(approximately: some zones would in fact recycle faster than others)." 
describe why need this change? If the answer is yes, I have two questions.
1) how can guarantee it really don't need preallocation if just one or a 
few pages always reactivated, in this scene, nr_unswapped maybe grow 
bigger enough than shmem_falloc.nr_falloced

2) why return -ENOMEM, it's not really OOM, is it a trick or ...?

Regards,
Jaegeuk



On Fri, 16 Nov 2012, Jaegeuk Hanse wrote:

Some questions about your shmem/tmpfs: misc and fallocate patchset.

- Since shmem_setattr can truncate tmpfs files, why need add another similar
codes in function shmem_fallocate? What's the trick?

I don't know if I understand you.  In general, hole-punching is different
from truncation.  Supporting the hole-punch mode of the fallocate system
call is different from supporting truncation.  They're closely related,
and share code, but meet different specifications.


- in tmpfs: support fallocate preallocation patch changelog:
   "Christoph Hellwig: What for exactly?  Please explain why preallocating on
tmpfs would make any sense.
   Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files on
the /dev/shm filesystem.  The glibc fallback loop for -ENOSYS [or
-EOPNOTSUPP] on fallocate is just ugly."
   Could shmem/tmpfs fallocate prevent one process truncate the file which the
second process mmap() and get SIGBUS when the second process access mmap but
out of current size of file?

Again, I don't know if I understand you.  fallocate does not prevent
truncation or races or SIGBUS.  I believe that Kay meant that without
using fallocate to allocate the memory in advance, systemd found it hard
to protect itself from the possibility of getting a SIGBUS, if access to
a shmem mapping happened to run out of memory/space in the middle.

I never grasped why writing the file in advance was not good enough:
fallocate happened to be what they hoped to use, and it was hard to
deny it, given that tmpfs already supported hole-punching, and was
about to convert to the fallocate interface for that.

Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

2012-11-17 Thread Jaegeuk Hanse

On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.


Hi Kirill,

I see your and Andew's hot discussion in v4 resend thread.

"I also tried another scenario: usemem -n16 100M -r 1000. It creates 
real memory pressure - no easy reclaimable memory. This time callback 
called with nr_to_scan > 0 and we freed hzp. "


What's "usemem"? Is it a tool and how to get it? It's hard for me to 
find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your 
scenario?


Regards,
Jaegeuk



Signed-off-by: Kirill A. Shutemov 
---
  mm/huge_memory.c | 112 ++-
  1 file changed, 87 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bad9c8f..923ea75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include "internal.h"
@@ -47,7 +48,6 @@ static unsigned int khugepaged_scan_sleep_millisecs 
__read_mostly = 1;
  /* during fragmentation poll the hugepage allocator once every minute */
  static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 6;
  static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
  static DEFINE_MUTEX(khugepaged_mutex);
  static DEFINE_SPINLOCK(khugepaged_mm_lock);
  static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -160,31 +160,74 @@ static int start_khugepaged(void)
return err;
  }
  
-static int init_huge_zero_pfn(void)

+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
  {
-   struct page *hpage;
-   unsigned long pfn;
+   unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+   return zero_pfn && pfn == zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+   return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
+static unsigned long get_huge_zero_page(void)
+{
+   struct page *zero_page;
+retry:
+   if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+   return ACCESS_ONCE(huge_zero_pfn);
  
-	hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,

+   zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
-   if (!hpage)
-   return -ENOMEM;
-   pfn = page_to_pfn(hpage);
-   if (cmpxchg(&huge_zero_pfn, 0, pfn))
-   __free_page(hpage);
-   return 0;
+   if (!zero_page)
+   return 0;
+   preempt_disable();
+   if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+   preempt_enable();
+   __free_page(zero_page);
+   goto retry;
+   }
+
+   /* We take additional reference here. It will be put back by shrinker */
+   atomic_set(&huge_zero_refcount, 2);
+   preempt_enable();
+   return ACCESS_ONCE(huge_zero_pfn);
  }
  
-static inline bool is_huge_zero_pfn(unsigned long pfn)

+static void put_huge_zero_page(void)
  {
-   return huge_zero_pfn && pfn == huge_zero_pfn;
+   /*
+* Counter should never go to zero here. Only shrinker can put
+* last reference.
+*/
+   BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
  }
  
-static inline bool is_huge_zero_pmd(pmd_t pmd)

+static int shrink_huge_zero_page(struct shrinker *shrink,
+   struct shrink_control *sc)
  {
-   return is_huge_zero_pfn(pmd_pfn(pmd));
+   if (!sc->nr_to_scan)
+   /* we can free zero page only if last reference remains */
+   return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+   if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+   unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+   BUG_ON(zero_pfn == 0);
+   __free_page(__pfn_to_page(zero_pfn));
+   }
+
+   return 0;
  }
  
+static struct shrinker huge_zero_page_shrinker = {

+   .shrink = shrink_huge_zero_page,
+   .

Re: [PATCH] tmpfs: change final i_blocks BUG to WARNING

2012-11-18 Thread Jaegeuk Hanse

On 11/06/2012 09:34 AM, Hugh Dickins wrote:

Under a particular load on one machine, I have hit shmem_evict_inode()'s
BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
race between swapout and eviction.

It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
and the lack of coherent locking between mapping's nrpages and shmem's
swapped count.  There's a window in shmem_writepage(), between lowering
nrpages in shmem_delete_from_page_cache() and then raising swapped count,
when the freed count appears to be +1 when it should be 0, and then the
asymmetry stops it from being corrected with -1 before hitting the BUG.


Hi Hugh,

So if race happen, still have pages swapout after inode and radix tree 
destroied.
What will happen when the pages need be swapin in the scenacio like 
swapoff.


Regards,
Jaegeuk



One answer is coherent locking: using tree_lock throughout, without
info->lock; reasonable, but the raw_spin_lock in percpu_counter_add()
on used_blocks makes that messier than expected.  Another answer may be
a further effort to eliminate the weird shmem_recalc_inode() altogether,
but previous attempts at that failed.

So far undecided, but for now change the BUG_ON to WARN_ON:
in usual circumstances it remains a useful consistency check.

Signed-off-by: Hugh Dickins 
Cc: sta...@vger.kernel.org
---

  mm/shmem.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

--- 3.7-rc4/mm/shmem.c  2012-10-14 16:16:58.361309122 -0700
+++ linux/mm/shmem.c2012-11-01 14:31:04.288185742 -0700
@@ -643,7 +643,7 @@ static void shmem_evict_inode(struct ino
kfree(info->symlink);
  
  	simple_xattrs_free(&info->xattrs);

-   BUG_ON(inode->i_blocks);
+   WARN_ON(inode->i_blocks);
shmem_free_inode(inode->i_sb);
clear_inode(inode);
  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

2012-11-19 Thread Jaegeuk Hanse

On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:

On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:

On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Hi Kirill,

I see your and Andew's hot discussion in v4 resend thread.

"I also tried another scenario: usemem -n16 100M -r 1000. It creates
real memory pressure - no easy reclaimable memory. This time
callback called with nr_to_scan > 0 and we freed hzp. "

What's "usemem"? Is it a tool and how to get it?

http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar


Thanks for your response.  But how to use it, I even can't compile the 
files.


# ./case-lru-file-mmap-read
./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0 
(error token is "nr_cpu")


# gcc usemem.c -o usemem
/tmp/ccFkIDWk.o: In function `do_task':
usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
usemem.c:(.text+0xa44): undefined reference to `pthread_join'
collect2: ld returned 1 exit status




It's hard for me to
find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
scenario?

shrink_slab() calls the callback with nr_to_scan > 0 if system is under
pressure -- look for do_shrinker_shrink().


Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this 
path? I think it also can add memory pressure, where I miss?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

2012-11-19 Thread Jaegeuk Hanse

On 11/19/2012 06:23 PM, Kirill A. Shutemov wrote:

On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote:

On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:

On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:

On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Hi Kirill,

I see your and Andew's hot discussion in v4 resend thread.

"I also tried another scenario: usemem -n16 100M -r 1000. It creates
real memory pressure - no easy reclaimable memory. This time
callback called with nr_to_scan > 0 and we freed hzp. "

What's "usemem"? Is it a tool and how to get it?

http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar

Thanks for your response.  But how to use it, I even can't compile
the files.

# ./case-lru-file-mmap-read
./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0
(error token is "nr_cpu")

# gcc usemem.c -o usemem

-lpthread


/tmp/ccFkIDWk.o: In function `do_task':
usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
usemem.c:(.text+0xa44): undefined reference to `pthread_join'
collect2: ld returned 1 exit status


It's hard for me to
find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
scenario?

shrink_slab() calls the callback with nr_to_scan > 0 if system is under
pressure -- look for do_shrinker_shrink().

Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this
path? I think it also can add memory pressure, where I miss?

dd if=large-file only fills pagecache -- easy reclaimable memory.
Pagecache will be dropped first, before shrinking slabs.


How could I confirm page reclaim working hard and slabs are reclaimed at 
this time?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

2012-11-19 Thread Jaegeuk Hanse

On 11/19/2012 07:09 PM, Kirill A. Shutemov wrote:

On Mon, Nov 19, 2012 at 07:02:22PM +0800, Jaegeuk Hanse wrote:

On 11/19/2012 06:23 PM, Kirill A. Shutemov wrote:

On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote:

On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:

On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:

On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:

From: "Kirill A. Shutemov" 

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Hi Kirill,

I see your and Andew's hot discussion in v4 resend thread.

"I also tried another scenario: usemem -n16 100M -r 1000. It creates
real memory pressure - no easy reclaimable memory. This time
callback called with nr_to_scan > 0 and we freed hzp. "

What's "usemem"? Is it a tool and how to get it?

http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar

Thanks for your response.  But how to use it, I even can't compile
the files.

# ./case-lru-file-mmap-read
./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0
(error token is "nr_cpu")

# gcc usemem.c -o usemem

-lpthread


/tmp/ccFkIDWk.o: In function `do_task':
usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
usemem.c:(.text+0xa44): undefined reference to `pthread_join'
collect2: ld returned 1 exit status


It's hard for me to
find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
scenario?

shrink_slab() calls the callback with nr_to_scan > 0 if system is under
pressure -- look for do_shrinker_shrink().

Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this
path? I think it also can add memory pressure, where I miss?

dd if=large-file only fills pagecache -- easy reclaimable memory.
Pagecache will be dropped first, before shrinking slabs.

How could I confirm page reclaim working hard and slabs are
reclaimed at this time?

The only what I see is slabs_scanned in vmstat.


Oh, I see. Thanks! :-)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

2012-11-19 Thread Jaegeuk Hanse

On 11/19/2012 12:07 AM, Jiang Liu wrote:

The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
resolve an issue caused by inaccurate zone->present_pages, but that
fix is incomplete and causes regresions with HIGHMEM. And it has been
reverted by commit
5576646 revert "mm: fix-up zone present pages"

This is a following-up patchset for the issue above. It introduces a
new field named "managed_pages" to struct zone, which counts pages
managed by the buddy system from the zone. And zone->present_pages
is used to count pages existing in the zone, which is
spanned_pages - absent_pages.

But that way, zone->present_pages will be kept in consistence with
pgdat->node_present_pages, which is sum of zone->present_pages.

This patchset has only been tested on x86_64 with nobootmem.c. So need
help to test this patchset on machines:
1) use bootmem.c


If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?

2) have highmem

This patchset applies to "f4a75d2e Linux 3.7-rc6" from
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Any comments and helps are welcomed!

Jiang Liu (5):
   mm: introduce new field "managed_pages" to struct zone
   mm: replace zone->present_pages with zone->managed_pages if
 appreciated
   mm: set zone->present_pages to number of existing pages in the zone
   mm: provide more accurate estimation of pages occupied by memmap
   mm: increase totalram_pages when free pages allocated by bootmem
 allocator

  include/linux/mmzone.h |1 +
  mm/bootmem.c   |   14 
  mm/memory_hotplug.c|6 
  mm/mempolicy.c |2 +-
  mm/nobootmem.c |   15 
  mm/page_alloc.c|   89 +++-
  mm/vmscan.c|   16 -
  mm/vmstat.c|8 +++--
  8 files changed, 108 insertions(+), 43 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

2012-11-19 Thread Jaegeuk Hanse

On 11/19/2012 12:07 AM, Jiang Liu wrote:

If SPARSEMEM is enabled, it won't build page structures for
non-existing pages (holes) within a zone, so provide a more accurate
estimation of pages occupied by memmap if there are big holes within
the zone.

And pages for highmem zones' memmap will be allocated from lowmem,
so charge nr_kernel_pages for that.

Signed-off-by: Jiang Liu 
---
  mm/page_alloc.c |   22 --
  1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b327d7..eb25679 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
  
  #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
  
+static unsigned long calc_memmap_size(unsigned long spanned_pages,

+ unsigned long present_pages)
+{
+   unsigned long pages = spanned_pages;
+
+   /*
+* Provide a more accurate estimation if there are big holes within
+* the zone and SPARSEMEM is in use.
+*/
+   if (spanned_pages > present_pages + (present_pages >> 4) &&
+   IS_ENABLED(CONFIG_SPARSEMEM))
+   pages = present_pages;
+
+   return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
+}
+
  /*
   * Set up the zone data structures:
   *   - mark all pages reserved
@@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 * is used by this zone for memmap. This affects the watermark
 * and per-cpu initialisations
 */
-   memmap_pages =
-   PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+   memmap_pages = calc_memmap_size(size, realsize);
if (freesize >= memmap_pages) {
freesize -= memmap_pages;
if (memmap_pages)
@@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
  
  		if (!is_highmem_idx(j))

nr_kernel_pages += freesize;
+   /* Charge for highmem memmap if there are enough kernel pages */
+   else if (nr_kernel_pages > memmap_pages * 2)
+   nr_kernel_pages -= memmap_pages;


Since this is in else branch, if nr_kernel_pages is equal to 0 at 
initially time?



nr_all_pages += freesize;
  
  		zone->spanned_pages = size;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

2012-11-19 Thread Jaegeuk Hanse

On 11/20/2012 10:43 AM, Jiang Liu wrote:

On 2012-11-20 10:13, Jaegeuk Hanse wrote:

On 11/19/2012 12:07 AM, Jiang Liu wrote:

The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
resolve an issue caused by inaccurate zone->present_pages, but that
fix is incomplete and causes regresions with HIGHMEM. And it has been
reverted by commit
5576646 revert "mm: fix-up zone present pages"

This is a following-up patchset for the issue above. It introduces a
new field named "managed_pages" to struct zone, which counts pages
managed by the buddy system from the zone. And zone->present_pages
is used to count pages existing in the zone, which is
 spanned_pages - absent_pages.

But that way, zone->present_pages will be kept in consistence with
pgdat->node_present_pages, which is sum of zone->present_pages.

This patchset has only been tested on x86_64 with nobootmem.c. So need
help to test this patchset on machines:
1) use bootmem.c

If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?

Hi Jaegeuk,
Thanks for review this patch set.
Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
and other Arches still use bootmem.c. So need to test it on other Arches,
such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
so bootmem.c should work as expected too.


Hi Jiang,

If there are any codes changed in x86/x86_64  to meet nobootmem.c logic? 
I mean if remove

config NO_BOOTMEM
   def_bool y
in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c 
or not.


Regards,
Jaegeuk


Thanks!
Gerry




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-19 Thread Jaegeuk Hanse

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().


Hi Yasuaki,

In order to review this patch, I should dig sparse memory codes in 
advance. But I have some confuse of codes. Why need encode/decode mem 
map instead of set mem_map to ms->section_mem_map directly?


Regards,
Jaegeuk



CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
  mm/memory_hotplug.c | 13 -
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone 
*zone,
  #ifdef CONFIG_SPARSEMEM_VMEMMAP
  static int __remove_section(struct zone *zone, struct mem_section *ms)
  {
-   /*
-* XXX: Freeing memmap with vmemmap is not implement yet.
-*  This should be removed later.
-*/
-   return -EBUSY;
+   int ret = -EINVAL;
+
+   if (!valid_section(ms))
+   return ret;
+
+   ret = unregister_memory_section(ms);
+
+   return ret;
  }
  #else
  static int __remove_section(struct zone *zone, struct mem_section *ms)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-19 Thread Jaegeuk Hanse

On 11/20/2012 02:55 PM, Wen Congyang wrote:

At 11/20/2012 02:22 PM, Jaegeuk Hanse Wrote:

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

Hi Yasuaki,

In order to review this patch, I should dig sparse memory codes in
advance. But I have some confuse of codes. Why need encode/decode mem
map instead of set mem_map to ms->section_mem_map directly?

The memmap is aligned, and the low bits are zero. We store some information
in these bits. So we need to encode/decode memmap here.


Hi Congyang,

Thanks for you reponse. But I mean why return (unsigned long)(mem_map - 
(section_nr_to_pfn(pnum))); in function sparse_encode_mem_map, and then 
return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum); in 
funtion sparse_decode_mem_map instead of just store mem_map in 
ms->section_mep_map directly.


Regards,
Jaegeuk



Thanks
Wen Congyang


Regards,
Jaegeuk


CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
   mm/memory_hotplug.c | 13 -
   1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
struct zone *zone,
   #ifdef CONFIG_SPARSEMEM_VMEMMAP
   static int __remove_section(struct zone *zone, struct mem_section *ms)
   {
-/*
- * XXX: Freeing memmap with vmemmap is not implement yet.
- *  This should be removed later.
- */
-return -EBUSY;
+int ret = -EINVAL;
+
+if (!valid_section(ms))
+return ret;
+
+ret = unregister_memory_section(ms);
+
+return ret;
   }
   #else
   static int __remove_section(struct zone *zone, struct mem_section *ms)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Jaegeuk Hanse

On 11/20/2012 05:37 PM, Wen Congyang wrote:

At 11/20/2012 02:58 PM, Jaegeuk Hanse Wrote:

On 11/20/2012 02:55 PM, Wen Congyang wrote:

At 11/20/2012 02:22 PM, Jaegeuk Hanse Wrote:

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

Hi Yasuaki,

In order to review this patch, I should dig sparse memory codes in
advance. But I have some confuse of codes. Why need encode/decode mem
map instead of set mem_map to ms->section_mem_map directly?

The memmap is aligned, and the low bits are zero. We store some
information
in these bits. So we need to encode/decode memmap here.

Hi Congyang,

Thanks for you reponse. But I mean why return (unsigned long)(mem_map -
(section_nr_to_pfn(pnum))); in function sparse_encode_mem_map, and then
return ((struct page *)coded_mem_map) + section_nr_to_pfn(pnum); in
funtion sparse_decode_mem_map instead of just store mem_map in
ms->section_mep_map directly.

I don't know why. I try to find the reason, but I don't find any
place to use the pfn stored in the mem_map except in the decode
function. Maybe the designer doesn't want us to access the mem_map
directly.


It seems that mem_map is per node, but pfn is real pfn.
you can check __page_to_pfn.



Thanks
Wen Congyang


Regards,
Jaegeuk


Thanks
Wen Congyang


Regards,
Jaegeuk


CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
mm/memory_hotplug.c | 13 -
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
struct zone *zone,
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static int __remove_section(struct zone *zone, struct mem_section
*ms)
{
-/*
- * XXX: Freeing memmap with vmemmap is not implement yet.
- *  This should be removed later.
- */
-return -EBUSY;
+int ret = -EINVAL;
+
+if (!valid_section(ms))
+return ret;
+
+ret = unregister_memory_section(ms);
+
+return ret;
}
#else
static int __remove_section(struct zone *zone, struct mem_section
*ms)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Jaegeuk Hanse

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().


Hi Yasuaki,

I have a question about these sparse vmemmap memory related patches. Hot 
add memory need allocated vmemmap pages, but this time is allocated by 
buddy system. How can gurantee virtual address is continuous to the 
address allocated before? If not continuous, page_to_pfn and pfn_to_page 
can't work correctly.


Regards,
Jaegeuk



CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
  mm/memory_hotplug.c | 13 -
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid, struct zone 
*zone,
  #ifdef CONFIG_SPARSEMEM_VMEMMAP
  static int __remove_section(struct zone *zone, struct mem_section *ms)
  {
-   /*
-* XXX: Freeing memmap with vmemmap is not implement yet.
-*  This should be removed later.
-*/
-   return -EBUSY;
+   int ret = -EINVAL;
+
+   if (!valid_section(ms))
+   return ret;
+
+   ret = unregister_memory_section(ms);
+
+   return ret;
  }
  #else
  static int __remove_section(struct zone *zone, struct mem_section *ms)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] Add movablecore_map boot option.

2012-11-20 Thread Jaegeuk Hanse

On 11/20/2012 07:07 PM, Yasuaki Ishimatsu wrote:

2012/11/20 5:53, Andrew Morton wrote:

On Mon, 19 Nov 2012 22:27:21 +0800
Tang Chen  wrote:

This patchset provide a boot option for user to specify ZONE_MOVABLE 
memory

map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.
1) If the range is involved in a single node, then from ss to the 
end of

the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
the node will be ZONE_MOVABLE, and all the other nodes will only
have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map 
will have

higher priority to be satisfied.
6) This option has no conflict with memmap option.


This doesn't describe the problem which the patchset solves.  I can
kinda see where it's coming from, but it would be nice to have it all
spelled out, please.




- What is wrong with the kernel as it stands?


If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x8000-0c000, we have no way to specify
the memory as movable memory.


Could you explain why can't specify the memory as movable memory in this 
case?




So we proposed a new feature which specifies memory range to use as
movable memory.


- What are the possible ways of solving this?


I thought 2 ways to specify movable memory.
 1. use firmware information
 2. use boot option

1. use firmware information
  According to ACPI spec 5.0, SRAT table has memory affinity structure
  and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
  Affinity Structure". If we use the information, we might be able to
  specify movable memory by firmware. For example, if Hot Pluggable
  Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
  This is our proposal. New boot option can specify memory range to use
  as movable memory.


- Describe the chosen way, explain why it is superior to alternatives


We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,


Could you explain why regression occur in details?


user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily.

Thanks,
Yasuaki Ishimatsu



The amount of manual system configuration in this proposal looks quite
high.  Adding kernel boot parameters really is a last resort. Why was
it unavoidable here?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: fadvise interferes with readahead

2012-11-20 Thread Jaegeuk Hanse

On 11/20/2012 04:04 PM, Fengguang Wu wrote:

Hi Claudio,

Thanks for the detailed problem description!

On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote:

Hi. First of all, I'm not subscribed to this list, so I'd suggest all
replies copy me personally.

I have been trying to implement some I/O pipelining in Postgres (ie:
read the next data page asynchronously while working on the current
page), and stumbled upon some puzzling behavior involving the
interaction between fadvise and readahead.

I'm running kernel 3.0.0 (debian testing), on a single-disk system
which, though unsuitable for database workloads, is slow enough to let
me experiment with these read-ahead issues.

Typical random I/O performance is on the order of between 150 r/s to
200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s.
Sequential I/O can go up to 60MB/s, though it tends to be around 50.

Now onto the problem. In order to parallelize I/O with computation,
I've made postgres fadvise(willneed) the pages it will read next. How
far ahead is configurable, and I've tested with a number of
configurations.

The prefetching logic is aware of the OS and pg-specific cache, so it
will only fadvise a block once. fadvise calls will stay 1 (or a
configurable N) real I/O ahead of read calls, and there's no fadvising
of pages that won't be read eventually, in the same order. I checked
with strace.

However, performance when fadvising drops considerably for a specific
yet common access pattern:

When a nested loop with two index scans happens, access is random
locally, but eventually whole ranges of a file get read (in this
random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2
4 5 101 298 301". Though random, there are ranges there that can be
merged in one read-request.

The kernel seems to do the merge by applying some form of readahead,
not sure if it's context, ondemand or adaptive readahead on the 3.0.0
kernel. Anyway, it seems to do readahead, as iostat says:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 4.40  224.202.00 4.16 0.03
37.86 1.918.438.00   56.80   4.40  99.44

(notice the avgrq-sz of 37.8)

With fadvise calls, the thing looks a lot different:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.0018.00  226.801.00 1.80 0.07
16.81 4.00   17.52   17.23   82.40   4.39  99.92

FYI, there is a readahead tracing/stats patchset that can provide far
more accurate numbers about what's going on with readahead, which will
help eliminate lots of the guess works here.

https://lwn.net/Articles/472798/


Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's
spot-on with a postgres page (8k). So, fadvise seems to carry out the
requests verbatim, while read manages to merge at least two of them.

The random nature of reads makes me think the scheduler is failing to
merge the requests in both cases (rrqm/s = 0), because it only looks
at successive requests (I'm only guessing here though).

I guess it's not a merging problem, but that the kernel readahead code
manages to submit larger IO requests in the first place.


Looking into the kernel code, it seems the problem could be related to
how fadvise works in conjunction with readahead. fadvise seems to call
the function in readahead.c that schedules the asynchornous I/O[0]. It
doesn't seem subject to readahead logic itself[1], which in on itself
doesn't seem bad. But it does, I assume (not knowing the code that
well), prevent readahead logic[2] to eventually see the pattern. It
effectively disables readahead altogether.

You are right. If user space does fadvise() and the fadvised pages
cover all read() pages, the kernel readahead code will not run at all.

So the title is actually a bit misleading. The kernel readahead won't
interfere with user space prefetching at all. ;)


This, I theorize, may be because after the fadvise call starts an
async I/O on the page, further reads won't hit readahead code because
of the page cache[3] (!PageUptodate I imagine). Whether this is
desirable or not is not really obvious. In this particular case, doing
fadvise calls in what would seem an optimum way, results in terribly
worse performance. So I'd suggest it's not really that advisable.

Yes. The kernel readahead code by design will outperform simple
fadvise in the case of clustered random reads. Imagine the access
pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While


You mean it will trigger 6 IOs in the POSIX_FADV_RANDOM case or 
POSIX_FADV_WILLNEED case?



kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on
the page miss for 2, it will detect the existence of history page 1
and do readahead properly. For hard disks, it's mainly the number of


If the first IO read 1, it will call page_

Re: fadvise interferes with readahead

2012-11-20 Thread Jaegeuk Hanse

On 11/20/2012 04:04 PM, Fengguang Wu wrote:

Hi Claudio,

Thanks for the detailed problem description!


Hi Fengguang,

Another question, thanks in advance.

What's the meaning of interleaved reads? If the first process readahead 
from start ~ start + size - async_size, another process read start + 
size - aysnc_size + 1, then what will happen? It seems that variable 
hit_readahead_marker is false, and related codes can't run, where I miss?


Regards,
Jaegeuk



On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote:

Hi. First of all, I'm not subscribed to this list, so I'd suggest all
replies copy me personally.

I have been trying to implement some I/O pipelining in Postgres (ie:
read the next data page asynchronously while working on the current
page), and stumbled upon some puzzling behavior involving the
interaction between fadvise and readahead.

I'm running kernel 3.0.0 (debian testing), on a single-disk system
which, though unsuitable for database workloads, is slow enough to let
me experiment with these read-ahead issues.

Typical random I/O performance is on the order of between 150 r/s to
200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s.
Sequential I/O can go up to 60MB/s, though it tends to be around 50.

Now onto the problem. In order to parallelize I/O with computation,
I've made postgres fadvise(willneed) the pages it will read next. How
far ahead is configurable, and I've tested with a number of
configurations.

The prefetching logic is aware of the OS and pg-specific cache, so it
will only fadvise a block once. fadvise calls will stay 1 (or a
configurable N) real I/O ahead of read calls, and there's no fadvising
of pages that won't be read eventually, in the same order. I checked
with strace.

However, performance when fadvising drops considerably for a specific
yet common access pattern:

When a nested loop with two index scans happens, access is random
locally, but eventually whole ranges of a file get read (in this
random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2
4 5 101 298 301". Though random, there are ranges there that can be
merged in one read-request.

The kernel seems to do the merge by applying some form of readahead,
not sure if it's context, ondemand or adaptive readahead on the 3.0.0
kernel. Anyway, it seems to do readahead, as iostat says:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 4.40  224.202.00 4.16 0.03
37.86 1.918.438.00   56.80   4.40  99.44

(notice the avgrq-sz of 37.8)

With fadvise calls, the thing looks a lot different:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.0018.00  226.801.00 1.80 0.07
16.81 4.00   17.52   17.23   82.40   4.39  99.92

FYI, there is a readahead tracing/stats patchset that can provide far
more accurate numbers about what's going on with readahead, which will
help eliminate lots of the guess works here.

https://lwn.net/Articles/472798/


Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's
spot-on with a postgres page (8k). So, fadvise seems to carry out the
requests verbatim, while read manages to merge at least two of them.

The random nature of reads makes me think the scheduler is failing to
merge the requests in both cases (rrqm/s = 0), because it only looks
at successive requests (I'm only guessing here though).

I guess it's not a merging problem, but that the kernel readahead code
manages to submit larger IO requests in the first place.


Looking into the kernel code, it seems the problem could be related to
how fadvise works in conjunction with readahead. fadvise seems to call
the function in readahead.c that schedules the asynchornous I/O[0]. It
doesn't seem subject to readahead logic itself[1], which in on itself
doesn't seem bad. But it does, I assume (not knowing the code that
well), prevent readahead logic[2] to eventually see the pattern. It
effectively disables readahead altogether.

You are right. If user space does fadvise() and the fadvised pages
cover all read() pages, the kernel readahead code will not run at all.

So the title is actually a bit misleading. The kernel readahead won't
interfere with user space prefetching at all. ;)


This, I theorize, may be because after the fadvise call starts an
async I/O on the page, further reads won't hit readahead code because
of the page cache[3] (!PageUptodate I imagine). Whether this is
desirable or not is not really obvious. In this particular case, doing
fadvise calls in what would seem an optimum way, results in terribly
worse performance. So I'd suggest it's not really that advisable.

Yes. The kernel readahead code by design will outperform simple
fadvise in the case of clustered random reads. Imagine the access
pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs 

Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Jaegeuk Hanse

On 11/21/2012 11:05 AM, Wen Congyang wrote:

At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote:

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

Hi Yasuaki,

I have a question about these sparse vmemmap memory related patches. Hot
add memory need allocated vmemmap pages, but this time is allocated by
buddy system. How can gurantee virtual address is continuous to the
address allocated before? If not continuous, page_to_pfn and pfn_to_page
can't work correctly.

vmemmap has its virtual address range:
ea00 - eaff (=40 bits) virtual memory map (1TB)

We allocate memory from buddy system to store struct page, and its virtual
address isn't in this range. So we should update the page table:

kmalloc_section_memmap()
 sparse_mem_map_populate()
 pfn_to_page() // get the virtual address in the vmemmap range
 vmemmap_populate() // we update page table here

When we use vmemmap, page_to_pfn() always returns address in the vmemmap
range, not the address that kmalloc() returns. So the virtual address
is continuous.


Hi Congyang,

Another question about memory hotplug. During hot remove memory, it will 
also call memblock_remove to remove related memblock.

memblock_remove()
   __memblock_remove()
   memblock_isolate_range()
   memblock_remove_region()

But memblock_isolate_range() only record fully contained regions, 
regions which are partial overlapped just be splitted instead of record. 
So these partial overlapped regions can't be removed. Where I miss?


Regards,
Jaegeuk


Thanks
Wen Congyang

Regards,
Jaegeuk


CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
   mm/memory_hotplug.c | 13 -
   1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
struct zone *zone,
   #ifdef CONFIG_SPARSEMEM_VMEMMAP
   static int __remove_section(struct zone *zone, struct mem_section *ms)
   {
-/*
- * XXX: Freeing memmap with vmemmap is not implement yet.
- *  This should be removed later.
- */
-return -EBUSY;
+int ret = -EINVAL;
+
+if (!valid_section(ms))
+return ret;
+
+ret = unregister_memory_section(ms);
+
+return ret;
   }
   #else
   static int __remove_section(struct zone *zone, struct mem_section *ms)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Jaegeuk Hanse

On 11/21/2012 12:42 PM, Wen Congyang wrote:

At 11/21/2012 12:22 PM, Jaegeuk Hanse Wrote:

On 11/21/2012 11:05 AM, Wen Congyang wrote:

At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote:

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

Hi Yasuaki,

I have a question about these sparse vmemmap memory related patches. Hot
add memory need allocated vmemmap pages, but this time is allocated by
buddy system. How can gurantee virtual address is continuous to the
address allocated before? If not continuous, page_to_pfn and pfn_to_page
can't work correctly.

vmemmap has its virtual address range:
ea00 - eaff (=40 bits) virtual memory map (1TB)

We allocate memory from buddy system to store struct page, and its
virtual
address isn't in this range. So we should update the page table:

kmalloc_section_memmap()
  sparse_mem_map_populate()
  pfn_to_page() // get the virtual address in the vmemmap range
  vmemmap_populate() // we update page table here

When we use vmemmap, page_to_pfn() always returns address in the vmemmap
range, not the address that kmalloc() returns. So the virtual address
is continuous.

Hi Congyang,

Another question about memory hotplug. During hot remove memory, it will
also call memblock_remove to remove related memblock.

IIRC, we don't touch memblock when hot-add/hot-remove memory. memblock is
only used for bootmem allocator. I think it isn't used after booting.


In IBM pseries servers.

pseries_remove_memory()
pseries_remove_memblock()
memblock_remove()

Furthermore, memblock is set to record available memory ranges get from 
e820 map(you can check it in memblock_x86_fill()) in x86 case, after 
hot-remove memory, this range of memory can't be available, why not 
remove them as pseries servers' codes do.



memblock_remove()
__memblock_remove()memory-hotplug: unregister memory section on 
SPARSEMEM_VMEMMAP

memblock_isolate_range()
memblock_remove_region()

But memblock_isolate_range() only record fully contained regions,
regions which are partial overlapped just be splitted instead of record.
So these partial overlapped regions can't be removed. Where I miss?

No, memblock_isolate_range() can deal with partial overlapped region.
=
if (rbase < base) {
/*
 * @rgn intersects from below.  Split and continue
 * to process the next region - the new top half.
 */
rgn->base = base;
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
   memblock_get_region_node(rgn));
} else if (rend > end) {
/*
 * @rgn intersects from above.  Split and redo the
 * current region - the new bottom half.
 */
rgn->base = end;
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
   memblock_get_region_node(rgn));
=

If the region is partial overlapped region, we will split the old region into
two regions. After doing this, it is full contained region now.


You are right, I misunderstand the codes.



Thanks
Wen Congyang


Regards,
Jaegeuk


Thanks
Wen Congyang

Regards,
Jaegeuk


CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
mm/memory_hotplug.c | 13 -
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
struct zone *zone,
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static int __remove_section(struct zone *zone, struct mem_section
*ms)
{
-/*
- * XXX: Freeing memmap with vmemmap is not implement yet.
- *  This should be removed later.
- */
-return -EBUSY;
+int ret = -EINVAL;
+
+if (!valid_section(ms))
+return ret;
+
+ret = unregister_memory_section(ms);
+
+return ret;
}
#else
static int __remove_section(struct zone *zone, struct mem_section
*ms)




--
To unsubscribe 

Re: fadvise interferes with readahead

2012-11-20 Thread Jaegeuk Hanse

On 11/20/2012 11:15 PM, Fengguang Wu wrote:

On Tue, Nov 20, 2012 at 10:11:54PM +0800, Jaegeuk Hanse wrote:

On 11/20/2012 04:04 PM, Fengguang Wu wrote:

Hi Claudio,

Thanks for the detailed problem description!

Hi Fengguang,

Another question, thanks in advance.

What's the meaning of interleaved reads? If the first process

It's access patterns like

 1, 1001, 2, 1002, 3, 1003, ...

in which there are two (or more) mixed sequential read streams.


readahead from start ~ start + size - async_size, another process
read start + size - aysnc_size + 1, then what will happen? It seems
that variable hit_readahead_marker is false, and related codes can't
run, where I miss?

Yes hit_readahead_marker will be false. However on reading 1002,
hit_readahead_marker()/count_history_pages() will find the previous
page 1001 already in page cache and trigger context readahead.


Hi Fengguang,

Thanks for your explaination, the comment in function 
ondemand_readahead, "Hit a marked page without valid readahead state". 
What's the meaning of "without valid readahead state"?


Regards,
Jaegeuk



Thanks,
Fengguang


On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote:

Hi. First of all, I'm not subscribed to this list, so I'd suggest all
replies copy me personally.

I have been trying to implement some I/O pipelining in Postgres (ie:
read the next data page asynchronously while working on the current
page), and stumbled upon some puzzling behavior involving the
interaction between fadvise and readahead.

I'm running kernel 3.0.0 (debian testing), on a single-disk system
which, though unsuitable for database workloads, is slow enough to let
me experiment with these read-ahead issues.

Typical random I/O performance is on the order of between 150 r/s to
200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s.
Sequential I/O can go up to 60MB/s, though it tends to be around 50.

Now onto the problem. In order to parallelize I/O with computation,
I've made postgres fadvise(willneed) the pages it will read next. How
far ahead is configurable, and I've tested with a number of
configurations.

The prefetching logic is aware of the OS and pg-specific cache, so it
will only fadvise a block once. fadvise calls will stay 1 (or a
configurable N) real I/O ahead of read calls, and there's no fadvising
of pages that won't be read eventually, in the same order. I checked
with strace.

However, performance when fadvising drops considerably for a specific
yet common access pattern:

When a nested loop with two index scans happens, access is random
locally, but eventually whole ranges of a file get read (in this
random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2
4 5 101 298 301". Though random, there are ranges there that can be
merged in one read-request.

The kernel seems to do the merge by applying some form of readahead,
not sure if it's context, ondemand or adaptive readahead on the 3.0.0
kernel. Anyway, it seems to do readahead, as iostat says:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 4.40  224.202.00 4.16 0.03
37.86 1.918.438.00   56.80   4.40  99.44

(notice the avgrq-sz of 37.8)

With fadvise calls, the thing looks a lot different:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.0018.00  226.801.00 1.80 0.07
16.81 4.00   17.52   17.23   82.40   4.39  99.92

FYI, there is a readahead tracing/stats patchset that can provide far
more accurate numbers about what's going on with readahead, which will
help eliminate lots of the guess works here.

https://lwn.net/Articles/472798/


Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's
spot-on with a postgres page (8k). So, fadvise seems to carry out the
requests verbatim, while read manages to merge at least two of them.

The random nature of reads makes me think the scheduler is failing to
merge the requests in both cases (rrqm/s = 0), because it only looks
at successive requests (I'm only guessing here though).

I guess it's not a merging problem, but that the kernel readahead code
manages to submit larger IO requests in the first place.


Looking into the kernel code, it seems the problem could be related to
how fadvise works in conjunction with readahead. fadvise seems to call
the function in readahead.c that schedules the asynchornous I/O[0]. It
doesn't seem subject to readahead logic itself[1], which in on itself
doesn't seem bad. But it does, I assume (not knowing the code that
well), prevent readahead logic[2] to eventually see the pattern. It
effectively disables readahead altogether.

You are right. If user space does fadvise() and the 

Re: fadvise interferes with readahead

2012-11-20 Thread Jaegeuk Hanse

On 11/20/2012 10:58 PM, Fengguang Wu wrote:

On Tue, Nov 20, 2012 at 10:34:11AM -0300, Claudio Freire wrote:

On Tue, Nov 20, 2012 at 5:04 AM, Fengguang Wu  wrote:

Yes. The kernel readahead code by design will outperform simple
fadvise in the case of clustered random reads. Imagine the access
pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While
kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on
the page miss for 2, it will detect the existence of history page 1
and do readahead properly. For hard disks, it's mainly the number of
IOs that matters. So even if kernel readahead loses some opportunities
to do async IO and possibly loads some extra pages that will never be
used, it still manges to perform much better.


The fix would lay in fadvise, I think. It should update readahead
tracking structures. Alternatively, one could try to do it in
do_generic_file_read, updating readahead on !PageUptodate or even on
page cache hits. I really don't have the expertise or time to go
modifying, building and testing the supposedly quite simple patch that
would fix this. It's mostly about the testing, in fact. So if someone
can comment or try by themselves, I guess it would really benefit
those relying on fadvise to fix this behavior.

One possible solution is to try the context readahead at fadvise time
to check the existence of history pages and do readahead accordingly.

However it will introduce *real interferences* between kernel
readahead and user prefetching. The original scheme is, once user
space starts its own informed prefetching, kernel readahead will
automatically stand out of the way.

I understand that would seem like a reasonable design, but in this
particular case it doesn't seem to be. I propose that in most cases it
doesn't really work well as a design decision, to make fadvise work as
direct I/O. Precisely because fadvise is supposed to be a hint to let
the kernel make better decisions, and not a request to make the kernel
stop making decisions.

Any interference so introduced wouldn't be any worse than the
interference introduced by readahead over reads. I agree, if fadvise
were to trigger readahead, it could be bad for applications that don't
read what they say the will.

Right.


But if cache hits were to simply update
readahead state, it would only mean that read calls behave the same
regardless of fadvise calls. I think that's worth pursuing.

Here you are describing an alternative solution that will somehow trap
into the readahead code even when, for example, the application is
accessing once and again an already cached file?  I'm afraid this will
add non-trivial overheads and is less attractive than the "readahead
on fadvise" solution.


Hi Fengguang,

Page cache sync readahead only triggered when cache miss, but if file 
has already cached, how can readahead be trigged again if the 
application is accessing once and again an already cached file.


Regards,
Jaegeuk




I ought to try to prepare a patch for this to illustrate my point. Not
sure I'll be able to though.

I'd be glad to materialize the readahead on fadvise proposal, if there
are no obvious negative examples/cases.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-21 Thread Jaegeuk Hanse

Cc Fengguang Wu.

On 11/21/2012 04:13 PM, metin d wrote:

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?

I'm guessing it'd evict the entries, but am wondering if we could run any more 
diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently 
and the other one about once a month. It seems like the memory manager keeps 
unused pages in memory at the expense of frequently used database's performance.

My understanding was that under memory pressure from heavily accessed pages, 
unused pages would eventually get evicted. Is there anything else we can try on 
this host to understand why this is happening?

Thank you,

Metin

On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
   does it evict data-1 pages from memory?


This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
0:  10


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-21 Thread Jaegeuk Hanse

On 11/21/2012 05:02 PM, Fengguang Wu wrote:

On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:

Cc Fengguang Wu.

On 11/21/2012 04:13 PM, metin d wrote:

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?

I'm guessing it'd evict the entries, but am wondering if we could run any more 
diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently 
and the other one about once a month. It seems like the memory manager keeps 
unused pages in memory at the expense of frequently used database's performance.
My understanding was that under memory pressure from heavily
accessed pages, unused pages would eventually get evicted. Is there
anything else we can try on this host to understand why this is
happening?

We may debug it this way.

1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
(please double check via /proc/vmstat whether it does the expected work)

2) run 'page-types -r' with root, to view the page status for the
remaining pages of data-1

The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
-D_LARGEFILE64_SOURCE"

page-types can be found in the kernel source tree tools/vm/page-types.c

Sorry that sounds a bit twisted.. I do have a patch to directly dump
page cache status of a user specified file, however it's not
upstreamed yet.


Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

flagspage-count   MB  symbolic-flags long-symbolic-flags
0x607699 2373 
___
0x0001343227 1340 
___r___reserved


But I have some questions of the print of page-type:

Is 2373MB here mean total memory in used include page cache? I don't 
think so.

Which kind of pages will be marked reserved?
Which line of long-symbolic-flags is for page cache?

Regards,
Jaegeuk



Thanks,
Fengguang


On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
   does it evict data-1 pages from memory?


This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
0:  10


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-22 Thread Jaegeuk Hanse

On 11/22/2012 05:34 AM, Johannes Weiner wrote:

Hi,

On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:

On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

This might be because we do not deactive pages as long as there is
cache on the inactive list.  I'm guessing that the inter-reference
distance of data-2 is bigger than half of memory, so it's never
getting activated and data-1 is never challenged.


Hi Johannes,

What's the meaning of "inter-reference distance" and why compare it with 
half of memoy, what's the trick?


Regards,
Jaegeuk



I have a series of patches that detects a thrashing inactive list and
handles working set changes up to the size of memory.  Would you be
willing to test them?  They are currently based on 3.4, let me know
what version works best for you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-22 Thread Jaegeuk Hanse

On 11/22/2012 09:09 AM, Johannes Weiner wrote:

On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote:

On 11/22/2012 05:34 AM, Johannes Weiner wrote:

Hi,

On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:

On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

This might be because we do not deactive pages as long as there is
cache on the inactive list.  I'm guessing that the inter-reference
distance of data-2 is bigger than half of memory, so it's never
getting activated and data-1 is never challenged.

Hi Johannes,

What's the meaning of "inter-reference distance"

It's the number of memory accesses between two accesses to the same
page:

   A B C D A B C E ...
 |___|
 |   |


and why compare it with half of memoy, what's the trick?

If B gets accessed twice, it gets activated.  If it gets evicted in
between, the second access will be a fresh page fault and B will not
be recognized as frequently used.

Our cutoff for scanning the active list is cache size / 2 right now
(inactive_file_is_low), leaving 50% of memory to the inactive list.
If the inter-reference distance for pages on the inactive list is
bigger than that, they get evicted before their second access.


Hi Johannes,

Thanks for your explanation. But could you give a short description of 
how you resolve this inactive list thrashing issues?


Regards,
Jaegeuk



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd endless loop for compaction

2012-11-22 Thread Jaegeuk Hanse

On 11/21/2012 03:04 AM, Johannes Weiner wrote:

Hi guys,

while testing a 3.7-rc5ish kernel, I noticed that kswapd can drop into
a busy spin state without doing reclaim.  printk-style debugging told
me that this happens when the distance between a zone's high watermark
and its low watermark is less than two huge pages (DMA zone).

1. The first loop in balance_pgdat() over the zones finds all zones to
be above their high watermark and only does goto out (all_zones_ok).

2. pgdat_balanced() at the out: label also just checks the high
watermark, so the node is considered balanced and the order is not
reduced.

3. In the `if (order)' block after it, compaction_suitable() checks if
the zone's low watermark + twice the huge page size is okay, which
it's not necessarily in a small zone, and so COMPACT_SKIPPED makes it
it go back to loop_again:.

This will go on until somebody else allocates and breaches the high
watermark and then hopefully goes on to reclaim the zone above low
watermark + 2 * THP.

I'm not really sure what the correct solution is.  Should we modify
the zone_watermark_ok() checks in balance_pgdat() to take into account
the higher watermark requirements for reclaim on behalf of compaction?
Change the check in compaction_suitable() / not use it directly?



Hi Johannes,


- If all zones meet high watermark, goto out, then why go to `if 
(order)' block?


- If depend on compaction get enough contigous pages, why


if (CONPACT_BUILD && order &&

compaction_suitable(zone, order) !=

COMPACTION_SKIPPED)

testorder = 0;


can't guarantee low watermark + twice the huge page size is okay?


Regards,

Jaegeuk



Thanks,
Johannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 14/14] mm: Account for WRITEBACK_TEMP in balance_dirty_pages

2012-11-22 Thread Jaegeuk Hanse

On 11/21/2012 08:01 PM, Maxim Patlasov wrote:

Added linux-mm@ to cc:. The patch can stand on it's own.


Make balance_dirty_pages start the throttling when the WRITEBACK_TEMP
counter is high enough. This prevents us from having too many dirty
pages on fuse, thus giving the userspace part of it a chance to write
stuff properly.

Note, that the existing balance logic is per-bdi, i.e. if the fuse
user task gets stuck in the function this means, that it either
writes to the mountpoint it serves (but it can deadlock even without
the writeback) or it is writing to some _other_ dirty bdi and in the
latter case someone else will free the memory for it.

Signed-off-by: Maxim V. Patlasov 
Signed-off-by: Pavel Emelyanov 
---
  mm/page-writeback.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..499a606 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1220,7 +1220,8 @@ static void balance_dirty_pages(struct address_space 
*mapping,
 */
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+   nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK) +
+   global_page_state(NR_WRITEBACK_TEMP);
  


Could you explain NR_WRITEBACK_TEMP is used for accounting what? And 
when it will increase?



global_dirty_limits(&background_thresh, &dirty_thresh);
  


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-22 Thread Jaegeuk Hanse

On 11/22/2012 11:26 PM, Fengguang Wu wrote:

Hi Jaegeuk,

Sorry for the delay. I'm traveling these days..

On Wed, Nov 21, 2012 at 05:42:33PM +0800, Jaegeuk Hanse wrote:

On 11/21/2012 05:02 PM, Fengguang Wu wrote:

On Wed, Nov 21, 2012 at 04:34:40PM +0800, Jaegeuk Hanse wrote:

Cc Fengguang Wu.

On 11/21/2012 04:13 PM, metin d wrote:

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches does it evict data-1 pages from memory?

I'm guessing it'd evict the entries, but am wondering if we could run any more 
diagnostics before trying this.

We regularly use a setup where we have two databases; one gets used frequently 
and the other one about once a month. It seems like the memory manager keeps 
unused pages in memory at the expense of frequently used database's performance.
My understanding was that under memory pressure from heavily
accessed pages, unused pages would eventually get evicted. Is there
anything else we can try on this host to understand why this is
happening?

We may debug it this way.

1) run 'fadvise data-2 0 0 dontneed' to drop data-2 cached pages
(please double check via /proc/vmstat whether it does the expected work)

2) run 'page-types -r' with root, to view the page status for the
remaining pages of data-1

The fadvise tool comes from Andrew Morton's ext3-tools. (source code attached)
Please compile them with options "-Dlinux -I. -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
-D_LARGEFILE64_SOURCE"

page-types can be found in the kernel source tree tools/vm/page-types.c

Sorry that sounds a bit twisted.. I do have a patch to directly dump
page cache status of a user specified file, however it's not
upstreamed yet.

Hi Fengguang,

Thanks for you detail steps, I think metin can have a try.

 flagspage-count   MB  symbolic-flags long-symbolic-flags
0x607699 2373
___
0x0001343227 1340
___r___reserved
  
We don't need to care about the above two pages states actually.

Page cache pages will never be in the special reserved or
all-flags-cleared state.


Hi Fengguang,

Thanks for your response. But which kind of pages are in the special 
reserved and which are all-flags-cleared?


Regards,
Jaegeuk




But I have some questions of the print of page-type:

Is 2373MB here mean total memory in used include page cache? I don't
think so.
Which kind of pages will be marked reserved?
Which line of long-symbolic-flags is for page cache?

The (lru && !anonymous) pages are page cache pages.

Thanks,
Fengguang


On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
   does it evict data-1 pages from memory?


This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
0:  10


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-22 Thread Jaegeuk Hanse

On 11/21/2012 02:25 AM, Jan Kara wrote:

On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.


Hi metin d,

fincore is a tool or ...? How could I get it?

Regards,
Jaegeuk



Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

   Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
   does it evict data-1 pages from memory?


This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
   0:  10

Honza


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-22 Thread Jaegeuk Hanse

On 11/22/2012 11:53 PM, Fengguang Wu wrote:

On Thu, Nov 22, 2012 at 11:41:07PM +0800, Fengguang Wu wrote:

On Wed, Nov 21, 2012 at 12:07:22PM +0200, Metin Döşlü wrote:

On Wed, Nov 21, 2012 at 12:00 PM, Jaegeuk Hanse  wrote:

On 11/21/2012 05:58 PM, metin d wrote:

Hi Fengguang,

I run tests and attached the results. The line below I guess shows the data-1 
page caches.

0x0008006c   658405125718  __RU_lA___P  
  referenced,uptodate,lru,active,private


I thinks this is just one state of page cache pages.

But why these page caches are in this state as opposed to other page
caches. From the results I conclude that:

data-1 pages are in state : referenced,uptodate,lru,active,private

I wonder if it's this code that stops data-1 pages from being
reclaimed:

shrink_page_list():

 if (page_has_private(page)) {
 if (!try_to_release_page(page, sc->gfp_mask))
 goto activate_locked;

What's the filesystem used?

Ah it's more likely caused by this logic:

 if (is_active_lru(lru)) {
 if (inactive_list_is_low(mz, file))
 shrink_active_list(nr_to_scan, mz, sc, priority, file);

The active file list won't be scanned at all if it's smaller than the
active list. In this case, it's inactive=33586MB > active=25719MB. So
the data-1 pages in the active list will never be scanned and reclaimed.


Hi Fengguang,

It seems that most of data-1 file pages are in active lru cache and most 
of data-2 file pages are in inactive lru cache. As Johannes mentioned, 
if inter-reference distance is bigger than half of memory, the pages 
will not be actived. How you intend to resolve this issue? Is Johannes's 
inactive list threshing idea  available?


Regards,
Jaegeuk




data-2 pages are in state : referenced,uptodate,lru,mappedtodisk

Thanks,
Fengguang


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-22 Thread Jaegeuk Hanse

On 11/23/2012 12:17 AM, Johannes Weiner wrote:

On Thu, Nov 22, 2012 at 09:16:27PM +0800, Jaegeuk Hanse wrote:

On 11/22/2012 09:09 AM, Johannes Weiner wrote:

On Thu, Nov 22, 2012 at 08:48:07AM +0800, Jaegeuk Hanse wrote:

On 11/22/2012 05:34 AM, Johannes Weiner wrote:

Hi,

On Tue, Nov 20, 2012 at 07:25:00PM +0100, Jan Kara wrote:

On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

This might be because we do not deactive pages as long as there is
cache on the inactive list.  I'm guessing that the inter-reference
distance of data-2 is bigger than half of memory, so it's never
getting activated and data-1 is never challenged.

Hi Johannes,

What's the meaning of "inter-reference distance"

It's the number of memory accesses between two accesses to the same
page:

   A B C D A B C E ...
 |___|
 |   |


and why compare it with half of memoy, what's the trick?

If B gets accessed twice, it gets activated.  If it gets evicted in
between, the second access will be a fresh page fault and B will not
be recognized as frequently used.

Our cutoff for scanning the active list is cache size / 2 right now
(inactive_file_is_low), leaving 50% of memory to the inactive list.
If the inter-reference distance for pages on the inactive list is
bigger than that, they get evicted before their second access.

Hi Johannes,

Thanks for your explanation. But could you give a short description
of how you resolve this inactive list thrashing issues?

I remember a time stamp of evicted file pages in the page cache radix
tree that let me reconstruct the inter-reference distance even after a
page has been evicted from cache when it's faulted back in.  This way
I can tell a one-time sequence from thrashing, no matter how small the
inactive list.

When thrashing is detected, I start deactivating protected pages and
put them next to the refaulted cache on the head of the inactive list
and let them fight it out as usual.  In this reported case, the old
data will be challenged and since it's no longer used, it will just
drop off the inactive list eventually.  If the guess is wrong and the
deactivated memory is used more heavily than the refaulting pages,
they will just get activated again without incurring any disruption
like a major fault.


Hi Johannes,

If you also add the time stamp to the protected pages which you deactive 
when incur thrashing?


Regards,
Jaegeuk



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem in Page Cache Replacement

2012-11-23 Thread Jaegeuk Hanse

On 11/23/2012 04:08 PM, metin d wrote:

- Original Message -

From: Jaegeuk Hanse 
To: metin d 
Cc: Jan Kara ; "linux-kernel@vger.kernel.org" 
; linux...@kvack.org
Sent: Friday, November 23, 2012 3:58 AM
Subject: Re: Problem in Page Cache Replacement

On 11/21/2012 02:25 AM, Jan Kara wrote:

On Tue 20-11-12 09:42:42, metin d wrote:

I have two PostgreSQL databases named data-1 and data-2 that sit on the
same machine. Both databases keep 40 GB of data, and the total memory
available on the machine is 68GB.

I started data-1 and data-2, and ran several queries to go over all their
data. Then, I shut down data-1 and kept issuing queries against data-2.
For some reason, the OS still holds on to large parts of data-1's pages
in its page cache, and reserves about 35 GB of RAM to data-2's files. As
a result, my queries on data-2 keep hitting disk.

I'm checking page cache usage with fincore. When I run a table scan query
against data-2, I see that data-2's pages get evicted and put back into
the cache in a round-robin manner. Nothing happens to data-1's pages,
although they haven't been touched for days.

Hi metin d,
fincore is a tool or ...? How could I get it?
Regards,
Jaegeuk


Hi Jaegeuk,

Yes, it is a tool, you get it from here :
http://code.google.com/p/linux-ftools/


Hi Metin,

Could you give me a link to download it? I can't get it from the link 
you give me. Thanks in advance. :-)


Regards,
Jaegeuk




Regards,
Metin

Does anybody know why data-1's pages aren't evicted from the page cache?
I'm open to all kind of suggestions you think it might relate to problem.

 Curious. Added linux-mm list to CC to catch more attention. If you run
echo 1 >/proc/sys/vm/drop_caches
 does it evict data-1 pages from memory?


This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no
swap space. The kernel version is:

$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit:

and it seems that I use one NUMA instance, if  you think that it can a problem.

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 70007 MB
node 0 free: 360 MB
node distances:
node   0
 0:  10

 Honza


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd endless loop for compaction

2012-11-23 Thread Jaegeuk Hanse

On 11/21/2012 03:04 AM, Johannes Weiner wrote:

Hi guys,

while testing a 3.7-rc5ish kernel, I noticed that kswapd can drop into
a busy spin state without doing reclaim.  printk-style debugging told
me that this happens when the distance between a zone's high watermark
and its low watermark is less than two huge pages (DMA zone).

1. The first loop in balance_pgdat() over the zones finds all zones to
be above their high watermark and only does goto out (all_zones_ok).

2. pgdat_balanced() at the out: label also just checks the high
watermark, so the node is considered balanced and the order is not
reduced.

3. In the `if (order)' block after it, compaction_suitable() checks if
the zone's low watermark + twice the huge page size is okay, which
it's not necessarily in a small zone, and so COMPACT_SKIPPED makes it
it go back to loop_again:.

This will go on until somebody else allocates and breaches the high
watermark and then hopefully goes on to reclaim the zone above low
watermark + 2 * THP.

I'm not really sure what the correct solution is.  Should we modify
the zone_watermark_ok() checks in balance_pgdat() to take into account
the higher watermark requirements for reclaim on behalf of compaction?
Change the check in compaction_suitable() / not use it directly?


Hi Johannes,

If depend on compaction get enough contigous pages, why

if (CONPACT_BUILD && order &&
compaction_suitable(zone, order) !=
COMPACTION_SKIPPED)
testorder = 0;

can't guarantee low watermark + twice the huge page size is okay?

Regards,
Jaegeuk


Thanks,
Johannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm,vmscan: free pages if compaction_suitable tells us to

2012-11-25 Thread Jaegeuk Hanse

On 11/26/2012 06:44 AM, Johannes Weiner wrote:

On Sun, Nov 25, 2012 at 01:29:50PM -0500, Rik van Riel wrote:

On Sun, 25 Nov 2012 17:57:28 +0100
Johannes Hirte  wrote:


With kernel 3.7-rc6 I've still problems with kswapd0 on my laptop
And this is most of the time. I've only observed this behavior on the
laptop. Other systems don't show this.

This suggests it may have something to do with small memory zones,
where we end up with the "funny" situation that the high watermark
(+ balance gap) for a particular zone is less than the low watermark
+ 2<
It's not quite enough because it's not reaching the conditions you
changed, see analysis in https://lkml.org/lkml/2012/11/20/567

But even fixing it up (by adding the compaction_suitable() test in
this preliminary scan over the zones and setting end_zone accordingly)
is not enough because no actual reclaim happens at priority 12 in a


The preliminary scan is in the highmem->dma direction, it will miss high 
zone which not meet compaction_suitable() test instead of lowest zone.



small zone.  So the number of free pages is not actually changing and
the compaction_suitable() checks keep the loop going.

The problem is fairly easy to reproduce, by the way.  Just boot with
mem=800M to have a relatively small lowmem reserve in the DMA zone.
Fill it up with page cache, then allocate transparent huge pages.

With your patch and my fix to the preliminary zone loop, there won't
be any hung task warnings anymore because kswapd actually calls
shrink_slab() and there is a rescheduling point in there, but it still
loops forever.

It also seems a bit aggressive to try to balance a small zone like DMA
for a huge page when it's not a GFP_DMA allocation, but none of these
checks actually take the classzone into account.  Do we have any
agreement over what this whole thing is supposed to be doing?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b99ecba..f7e54df 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2412,6 +2412,9 @@ static void age_active_anon(struct zone *zone, struct 
scan_control *sc)
   * would need to be at least 256M for it to be balance a whole node.
   * Similarly, on x86-64 the Normal zone would need to be at least 1G
   * to balance a node on its own. These seemed like reasonable ratios.
+ *
+ * The kswapd source code is brought to you by Advil®.  "For today's
+ * tough pain, one might not be enough."
   */
  static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
int classzone_idx)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/