Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-01 Thread Minchan Kim
s the pseudo-RAM
> to treat the pool as shared using a 128-bit UUID as a key.  On systems
> that may run multiple kernels (such as hard partitioned or virtualized
> systems) that may share a clustered filesystem, and where the pseudo-RAM
> may be shared among those kernels, calls to init_shared_fs that specify the
> same UUID will receive the same pool id, thus allowing the pages to
> be shared.  Note that any security requirements must be imposed outside
> of the kernel (e.g. by "tools" that control the pseudo-RAM).  Or a
> pseudo-RAM implementation can simply disable shared_init by always
> returning a negative value.
>
> If a get_page is successful on a non-shared pool, the page is flushed (thus
> making cleancache an "exclusive" cache).  On a shared pool, the page

Do you have any reason about force "exclusive" on a non-shared pool?
To free memory on pesudo-RAM?
I want to make it "inclusive" by some reason but unfortunately I can't
say why I want it now.

While you mentioned it's "exclusive", cleancache_get_page doesn't
flush the page at below code.
Is it a role of user who implement cleancache_ops->get_page?

+int __cleancache_get_page(struct page *page)
+{
+   int ret = 0;
+   int pool_id = page->mapping->host->i_sb->cleancache_poolid;
+
+   if (pool_id >= 0) {
+   ret = (*cleancache_ops->get_page)(pool_id,
+ page->mapping->host->i_ino,
+ page->index,
+ page);
+   if (ret == CLEANCACHE_GET_PAGE_SUCCESS)
+   succ_gets++;
+   else
+   failed_gets++;
+   }
+   return ret;
+}
+EXPORT_SYMBOL(__cleancache_get_page);

If backed device is ram(ie), Could we _move_ the pages from page cache
to cleancache?
I mean I don't want to copy page when get/put operation. we can just
move page in case of backed device "ram". Is it possible?

You send the patches which is core of cleancache but I don't see any use case.
Could you send use case patches with this series?
It could help understand cleancache's benefit.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-02 Thread Minchan Kim
Hi, Dan. 

On Wed, Jun 02, 2010 at 08:27:48AM -0700, Dan Magenheimer wrote:
> Hi Minchan --
> 
> > I think cleancache approach is cool. :)
> > I have some suggestions and questions.
> 
> Thanks for your interest!
> 
> > > If a get_page is successful on a non-shared pool, the page is flushed
> > (thus
> > > making cleancache an "exclusive" cache).  On a shared pool, the page
> > 
> > Do you have any reason about force "exclusive" on a non-shared pool?
> > To free memory on pesudo-RAM?
> > I want to make it "inclusive" by some reason but unfortunately I can't
> > say why I want it now.
> 
> The main reason is to free up memory in pseudo-RAM and to
> avoid unnecessary cleancache_flush calls.  If you want
> inclusive, the page can be put immediately following
> the get.  If put-after-get for inclusive becomes common,
> the interface could easily be extended to add a "get_no_flush"
> call.

Sounds good to me. 

>  
> > While you mentioned it's "exclusive", cleancache_get_page doesn't
> > flush the page at below code.
> > Is it a role of user who implement cleancache_ops->get_page?
> 
> Yes, the flush is done by the cleancache implementation.
> 
> > If backed device is ram(ie), Could we _move_ the pages from page cache
> > to cleancache?
> > I mean I don't want to copy page when get/put operation. we can just
> > move page in case of backed device "ram". Is it possible?
> 
> By "move", do you mean changing the virtual mappings?  Yes,
> this could be done as long as the source and destination are
> both directly addressable (that is, true physical RAM), but
> requires TLB manipulation and has some complicated corner
> cases.  The copy semantics simplifies the implementation on
> both the "frontend" and the "backend" and also allows the
> backend to do fancy things on-the-fly like page compression
> and page deduplication.

Agree. But I don't mean it. 
If I use brd as backend, i want to do it follwing as. 

put_page :

remove_from_page_cache(page);
brd_insert_page(page);

get_page :

brd_lookup_page(page);
add_to_page_cache(page);

Of course, I know it's impossible without new metadata and modification of 
page cache handling and it makes front and backend's good layered design. 

What I want is to remove copy overhead when backend is ram and it's also
part of main memory(ie, we have page descriptor). 

Do you have an idea?

> 
> > You send the patches which is core of cleancache but I don't see any
> > use case.
> > Could you send use case patches with this series?
> > It could help understand cleancache's benefit.
> 
> Do you mean the Xen Transcendent Memory ("tmem") implementation?
> If so, this is four files in the Xen source tree (common/tmem.c,
> common/tmem_xen.c, include/xen/tmem.h, include/xen/tmem_xen.h).
> There is also an html document in the Xen source tree, which can
> be viewed here:
> http://oss.oracle.com/projects/tmem/dist/documentation/internals/xen4-internals-v01.html
>  
> 
> Or did you mean a cleancache_ops "backend"?  For tmem, there
> is one file linux/drivers/xen/tmem.c and it interfaces between
> the cleancache_ops calls and Xen hypercalls.  It should be in
> a Xenlinux pv_ops tree soon, or I can email it sooner.

I mean "backend". :) 

> 
> I am also eagerly awaiting Nitin Gupta's cleancache backend
> and implementation to do in-kernel page cache compression.

Do Nitin say he will make backend of cleancache for
page cache compression? 

It would be good feature. 
I have a interest, too. :)

Thanks, Dan. 

> 
> Thanks,
> Dan

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/7] Cleancache (was Transcendent Memory): VFS hooks

2010-06-04 Thread Minchan Kim
 once our page is gone
> +  */
> + if (PageUptodate(page))
> + cleancache_put_page(page);
> + else
> + cleancache_flush_page(mapping, page);
> +
>   radix_tree_delete(&mapping->page_tree, page->index);

I doubt it's right place related to PFRA.

1)
You mentiond PFRA in you description and I understood cleancache has 
a cold clean page which is evicted by reclaimer. 
But __remove_from_page_cache can be called by other call sites.

For example, shmem_write page calls it for moving the page from page cache
to swap cache. Although there isn't the page in page cache, it is in swap cache.
So next read/write of shmem until swapout happens can be read/write in swap 
cache. 

I didn't looked into whole of callsites. But please review again them. 

2) 
While I review this, I found GFP_ATOMIC of add_to_swap_cache. 
I don't know why it is in there. But at least when you review them, 
please consider gfp_flag of the context. 

If PFRA is going on, it means there is system memory pressure. 
So we can't allocate page easily. So each functions cold be use GFP_NOIO, 
GFP_NOFS, GFP_ATOMIC and so one. It means your backend can't call alloc_pages 
freely. 

I think it's the best if you call your hooks in no limited place. 
Maybe you already have done it. :)

3)
Please consider system memory pressure. 
Without this, PFRA might reclaim the page but cleancache's 
backend(non-virtualized)
may consume another page for putting clean page. It could change system behivor 
althought it can reduce I/O cost. 
I don't know which is important. Let us discuss it in another thread in next 
chance. 

In this thread, my concern is that we need new hook which notify urgent memory
reclaim. (ex, direct reclaim) 
I think at least if system momory pressure is high, backend need to free own 
cache memory. Of course, backend can monitor it but it's awkward. 
I think frontend should it. 

And I hope Nitin consider this, too. 


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-04 Thread Minchan Kim
Hi, Nitin. 

I am happy to hear you started this work. 

On Fri, Jun 04, 2010 at 03:06:49PM +0530, Nitin Gupta wrote:
> On 06/03/2010 09:13 PM, Dan Magenheimer wrote:
> >> On 06/03/2010 10:23 AM, Andreas Dilger wrote:
> >>> On 2010-06-02, at 20:46, Nitin Gupta wrote:
> >>
> >>> I was thinking it would be quite clever to do compression in, say,
> >>> 64kB or 128kB chunks in a mapping (to get decent compression) and
> >>> then write these compressed chunks directly from the page cache
> >>> to disk in btrfs and/or a revived compressed ext4.
> >>
> >> Batching of pages to get good compression ratio seems doable.
> > 
> > Is there evidence that batching a set of random individual 4K
> > pages will have a significantly better compression ratio than
> > compressing the pages separately?  I certainly understand that
> > if the pages are from the same file, compression is likely to
> > be better, but pages evicted from the page cache (which is
> > the source for all cleancache_puts) are likely to be quite a
> > bit more random than that, aren't they?
> > 
> 
> 
> Batching of pages from random files may not be so effective but
> it would be interesting to collect some data for this. Still,
> per-inode batching of pages seems doable and this should help
> us get over this problem.

1)
Please, consider system memory pressure case. 
In such case, we have to release compressed cache pages. 
Or it would be better to discard not-good-compression pages 
when you compress it. 

2)
This work is related to page reclaiming.
Page reclaiming is to make free memory. 
But this work might free memory little than old. 
I admit your concept is good in terms of I/O cost. 
But we might discard more clean pages than old if you want to 
do batching of pages for good compression.

3)
testcase. 

As I mentioned, it could be good in terms of I/O cost. 
But it could change system's behavior due to page consumption of backend. 
so many page scanning/reclaiming could be happen. 
It means hot pages can be discarded with this patch.
But it's a just guessing. 
So we need number with testcase we can measure I/O and system 
responsivness. 

> 
> Thanks,
> Nitin

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/7] Cleancache (was Transcendent Memory): VFS hooks

2010-06-04 Thread Minchan Kim
On Fri, Jun 04, 2010 at 08:13:14AM -0700, Dan Magenheimer wrote:
> > 1)
> > You mentiond PFRA in you description and I understood cleancache has
> > a cold clean page which is evicted by reclaimer.
> > But __remove_from_page_cache can be called by other call sites.
> > 
> > For example, shmem_write page calls it for moving the page from page
> > cache
> > to swap cache. Although there isn't the page in page cache, it is in
> > swap cache.
> > So next read/write of shmem until swapout happens can be read/write in
> > swap cache.
> > 
> > I didn't looked into whole of callsites. But please review again them.
> 
> I think the "if (PageUptodate(page))" eliminates all the cases
> where bad things can happen.

I missed it. my fisrt concern has gone. :)

> 
> Note that there may be cases where some unnecessary puts/flushes
> occur.  The focus of the patch is on correctness first; it may
> be possible to increase performance (marginally) in the future by
> reducing unnecessary cases.

I think it wouldn't be marginally. It depends on implementation
of backend. 
I think frontend would be better to notify to backend in 
only exact place. As your descrption, we can call it in shrink_page_list
with some check or change __remove_mapping which adding a argument to tell
"this is calling of reclaim path". 

> 
> > 3) Please consider system memory pressure.
> > And I hope Nitin consider this, too.
> 
> This is definitely very important but remember that cleancache
> provides a great deal of flexibility:  Any page in cleancache
> can be thrown away at any time as every page is clean!  It
> can even accept a page and throw it away immediately.  Clearly
> the backend needs to do this intelligently so this will
> take some policy work.

I admit design goal of cleancache is to give a greate deal of flexibility. 
But I think system memory pressure(ie, direct reclaim and even OOM) is 
exceptional. Whenever we implement various backend, every backend(non-virtual
environemnt)have to implement policy which deal with system memory 
pressure emergency to prevent system hang, I think. 

And backend might need some hack to know the situation. It's horrible.
So I hope frontend gives little information to backend, at least. 

If some backend don't need it, it can just ignore. 
But if some backend need it, it can be a big deal. :)

> 
> Since I saw you sent a separate response to Nitin, I'll
> let him answer for his in-kernel page cache compression
> work.  The solution to the similar problem for Xen is
> described in the tmem internals document that I think
> I pointed to earlier here:
> http://oss.oracle.com/projects/tmem/documentation/internals/ 

I will read it when I have a time. 
Thanks for quick reply but I can't. 
It's time to sleep and weekend. 
See you soon and have a nice weekend. 

> 
> Thanks,
> Dan
> 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] Cleancache: overview

2010-07-23 Thread Minchan Kim
On Fri, Jul 23, 2010 at 4:36 PM, Nitin Gupta  wrote:
>
> 2. I think change in btrfs can be avoided by moving cleancache_get_page()
> from do_mpage_reapage() to filemap_fault() and this should work for all
> filesystems. See:
>
> handle_pte_fault() -> do_(non)linear_fault() -> __do_fault()
>                                                -> vma->vm_ops->fault()
>
> which is defined as filemap_fault() for all filesystems. If some future
> filesystem uses its own custom function (why?) then it will have to arrange 
> for
> call to cleancache_get_page(), if it wants this feature.


filemap fault works only in case of file-backed page which is mapped
but don't work not-mapped cache page.  So we could miss cache page by
read system call if we move it into filemap_fault.


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/8] Cleancache: overview

2010-07-23 Thread Minchan Kim
On Fri, Jul 23, 2010 at 4:36 PM, Nitin Gupta  wrote:
>
> 2. I think change in btrfs can be avoided by moving cleancache_get_page()
> from do_mpage_reapage() to filemap_fault() and this should work for all
> filesystems. See:
>
> handle_pte_fault() -> do_(non)linear_fault() -> __do_fault()
>                                                -> vma->vm_ops->fault()
>
> which is defined as filemap_fault() for all filesystems. If some future
> filesystem uses its own custom function (why?) then it will have to arrange 
> for
> call to cleancache_get_page(), if it wants this feature.


filemap fault works only in case of file-backed page which is mapped
but don't work not-mapped cache page.  So we could miss cache page by
read system call if we move it into filemap_fault.


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
er zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled. Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
> 
> We are loosing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
> of the inherent zone vs. node discrepancy.
> 
> Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible 
> zones inactive ratio")
> Cc: stable # 4.8+
> Reported-by: Nils Holland 
> Signed-off-by: Michal Hocko 
Acked-by: Minchan Kim 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> > kworker/u4:5 invoked oom-killer: 
> > gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> > oom_score_adj=0
> > kworker/u4:5 cpuset=/ mems_allowed=0
> > CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> > [...]
> > Mem-Info:
> > active_anon:58685 inactive_anon:90 isolated_anon:0
> >  active_file:274324 inactive_file:281962 isolated_file:0
> >  unevictable:0 dirty:649 writeback:0 unstable:0
> >  slab_reclaimable:40662 slab_unreclaimable:17754
> >  mapped:7382 shmem:202 pagetables:351 bounce:0
> >  free:206736 free_pcp:332 free_cma:0
> > Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> > inactive_file:1127848kB unevictable:0kB isolated(anon):0kB 
> > isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB 
> > shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB 
> > unstable:0kB pages_scanned:0 all_unreclaimable? no
> > DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> > inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> > writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> > slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> > pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 813 3474 3474
> > Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> > active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> > unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> > mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> > kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> > local_pcp:340kB free_cma:0kB
> > lowmem_reserve[]: 0 0 21292 21292
> > HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> > active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> > inactive_file:1127804kB unevictable:0kB writepending:2592kB 
> > present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB 
> > slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB 
> > free_pcp:800kB local_pcp:608kB free_cma:0kB
> > 
> > the oom killer is clearly pre-mature because there there is still a
> > lot of page cache in the zone Normal which should satisfy this lowmem
> > request. Further debugging has shown that the reclaim cannot make any
> > forward progress because the page cache is hidden in the active list
> > which doesn't get rotated because inactive_list_is_low is not memcg
> > aware.
> > It simply subtracts per-zone highmem counters from the re

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.
> 
> I would also appreciate if Mel and Johannes had a look at it. I am not
> yet sure whether we need the same thing for anon/file balancing in
> get_scan_count. I suspect we need but need to think more about that.
> 
> Thanks a lot again!
> ---
> From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 27 Dec 2016 16:28:44 +0100
> Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> 
> get_scan_count considers the whole node LRU size when
> - doing SCAN_FILE due to many page cache inactive pages
> - calculating the number of pages to scan
> 
> in both cases this might lead to unexpected behavior especially on 32b
> systems where we can expect lowmem memory pressure very often.
> 
> A large highmem zone can easily distort SCAN_FILE heuristic because
> there might be only few file pages from the eligible zones on the node
> lru and we would still enforce file lru scanning which can lead to
> trashing while we could still scan anonymous pages.

Nit:
It doesn't make thrashing because isolate_lru_pages filter out them
but I agree it makes pointless CPU burning to find eligible pages.

> 
> The later use of lruvec_lru_size can be problematic as well. Especially
> when there are not many pages from the eligible zones. We would have to
> skip over many pages to find anything to reclaim but shrink_node_memcg
> would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
> at maximum. Therefore we can end up going over a large LRU many times
> without actually having chance to reclaim much if anything at all. The
> closer we are out of memory on lowmem zone the worse the problem will
> be.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/vmscan.c | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c98b1a585992..785b4d7fb8a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec 
> *lruvec, enum lru_list lru, int
>  }
>  
>  /*
> + * Return the number of pages on the given lru which are eligibne for the
eligible
> + * given zone_idx
> + */
> +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> + enum lru_list lru, int zone_idx)

Nit:

Although there is a comment, function name is rather confusing when I compared
it with lruvec_zone_lru_size.

lruvec_eligible_zones_lru_size is better?


> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + unsigned long lru_size;
> + int zid;
> +
> + lru_size = lruvec_lru_size(lruvec, lru);
> + for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = &pgdat->node_zones[zid];
> + unsigned long size;
> +
> + if (!managed_zone(zone))
> + continue;
> +
> + size = lruvec_zone_lru_size(lruvec, lru, zid);
> + lru_size -= min(size, lru_size);
> + }
> +
> + return lru_size;
> +}
> +
> +/*
>   * Add a shrinker callback to be called from the vm.
>   */
>  int register_shrinker(struct shrinker *shrinker)
> @@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>* system is under heavy pressure.
>*/
>   if (!inactive_list_is_low(lruvec, true, sc) &&
> - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
> + lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
> sc->reclaim_idx) >> sc->priority) {
>   scan_balance = SCAN_FILE;
>   goto out;
>   }
> @@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>   unsigned long size;
>   unsigned long scan;
>  
> - size = lruvec_lru_size(lruvec, lru);
> + size = lruvec_lru_size_zone_idx(lruvec, lru, 
> sc->reclaim_idx);
>       scan = size >> sc->prio

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-29 Thread Minchan Kim
On Thu, Dec 29, 2016 at 10:04:32AM +0100, Michal Hocko wrote:
> On Thu 29-12-16 10:20:26, Minchan Kim wrote:
> > On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > > Hi,
> > > could you try to run with the following patch on top of the previous
> > > one? I do not think it will make a large change in your workload but
> > > I think we need something like that so some testing under which is known
> > > to make a high lowmem pressure would be really appreciated. If you have
> > > more time to play with it then running with and without the patch with
> > > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > > whether it make any difference at all.
> > > 
> > > I would also appreciate if Mel and Johannes had a look at it. I am not
> > > yet sure whether we need the same thing for anon/file balancing in
> > > get_scan_count. I suspect we need but need to think more about that.
> > > 
> > > Thanks a lot again!
> > > ---
> > > From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko 
> > > Date: Tue, 27 Dec 2016 16:28:44 +0100
> > > Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> > > 
> > > get_scan_count considers the whole node LRU size when
> > > - doing SCAN_FILE due to many page cache inactive pages
> > > - calculating the number of pages to scan
> > > 
> > > in both cases this might lead to unexpected behavior especially on 32b
> > > systems where we can expect lowmem memory pressure very often.
> > > 
> > > A large highmem zone can easily distort SCAN_FILE heuristic because
> > > there might be only few file pages from the eligible zones on the node
> > > lru and we would still enforce file lru scanning which can lead to
> > > trashing while we could still scan anonymous pages.
> > 
> > Nit:
> > It doesn't make thrashing because isolate_lru_pages filter out them
> > but I agree it makes pointless CPU burning to find eligible pages.
> 
> This is not about isolate_lru_pages. The trashing could happen if we had
> lowmem pagecache user which would constantly reclaim recently faulted
> in pages while there is anonymous memory in the lowmem which could be
> reclaimed instead.
>  
> [...]
> > >  /*
> > > + * Return the number of pages on the given lru which are eligibne for the
> > eligible
> 
> fixed
> 
> > > + * given zone_idx
> > > + */
> > > +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> > > + enum lru_list lru, int zone_idx)
> > 
> > Nit:
> > 
> > Although there is a comment, function name is rather confusing when I 
> > compared
> > it with lruvec_zone_lru_size.
> 
> I am all for a better name.
> 
> > lruvec_eligible_zones_lru_size is better?
> 
> this would be too easy to confuse with lruvec_eligible_zone_lru_size.
> What about lruvec_lru_size_eligible_zones?

Don't mind.

>  
> > Nit:
> > 
> > With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx 
> > rather than
> > own custom calculation to filter out non-eligible pages. 
> 
> Yes, that would be possible and I was considering that. But then I found
> useful to see total and reduced numbers in the tracepoint
> http://lkml.kernel.org/r/20161228153032.10821-8-mho...@kernel.org
> and didn't want to call lruvec_lru_size 2 times. But if you insist then
> I can just do that.

I don't mind either but I think we need to describe the reason if you want to
go with your open-coded version. Otherwise, someone will try to fix it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/3] drivers/staging: zcache: dynamic page cache/swap compression

2011-02-15 Thread Minchan Kim
On Wed, Feb 16, 2011 at 10:27 AM, Dan Magenheimer
 wrote:
>> -Original Message-
>> From: Matt [mailto:jackdac...@gmail.com]
>> Sent: Tuesday, February 15, 2011 5:12 PM
>> To: Minchan Kim
>> Cc: Dan Magenheimer; gre...@suse.de; Chris Mason; linux-
>> ker...@vger.kernel.org; linux...@kvack.org; ngu...@vflare.org; linux-
>> bt...@vger.kernel.org; Josef Bacik; Dan Rosenberg; Yan Zheng;
>> mi...@cn.fujitsu.com; Li Zefan
>> Subject: Re: [PATCH V2 0/3] drivers/staging: zcache: dynamic page
>> cache/swap compression
>>
>> On Mon, Feb 14, 2011 at 4:35 AM, Minchan Kim 
>> > Just my guessing. I might be wrong.
>> >
>> > __cleancache_flush_inode calls cleancache_get_key with
>> cleancache_filekey.
>> > cleancache_file_key's size is just 6 * u32.
>> > cleancache_get_key calls btrfs_encode_fh with the key.
>> > but btrfs_encode_fh does typecasting the key to btrfs_fid which is
>> > bigger size than cleancache_filekey's one so it should not access
>> > fields beyond cleancache_get_key.
>> >
>> > I think some file systems use extend fid so in there, this problem
>> can
>> > happen. I don't know why we can't find it earlier. Maybe Dan and
>> > others test it for a long time.
>> >
>> > Am I missing something?
>> >
>> >
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>> >
>>
>> reposting Minchan's message for reference to the btrfs mailing list
>> while also adding
>>
>> Li Zefan, Miao Xie, Yan Zheng, Dan Rosenberg and Josef Bacik to CC
>>
>> Regards
>>
>> Matt
>
> Hi Matt and Minchan --
>
> (BTRFS EXPERTS SEE *** BELOW)
>
> I definitely see a bug in cleancache_get_key in the monolithic
> zcache+cleancache+frontswap patch I posted on oss.oracle.com
> that is corrected in linux-next but I don't see how it could
> get provoked by btrfs.
>
> The bug is that, in cleancache_get_key, the return value of fhfn should
> be checked against 255.  If the return value is 255, cleancache_get_key
> should return -1.  This should disable cleancache for any filesystem
> where KEY_MAX is too large.
>
> But cleancache_get_key always calls fhfn with connectable == 0 and
> CLEANCACHE_KEY_MAX==6 should be greater than BTRFS_FID_SIZE_CONNECTABLE
> (which I think should be 5?).  And the elements written into the
> typecast btrfs_fid should be only writing the first 5 32-bit words.

BTRFS_FID_SIZE_NON_CONNECTALBE is 5,  not BTRFS_FID_SIZE_CONNECTABLE.
Anyway, you passed connectable with 0 so it should be only writing the
first 5 32-bit words as you said.
That's one I missed. ;-)

Thanks.
-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V8 4/8] mm/fs: add hooks to support cleancache

2011-04-14 Thread Minchan Kim
/*
> +        * if we're uptodate, flush out into the cleancache, otherwise
> +        * invalidate any existing cleancache entries.  We can't leave
> +        * stale data around in the cleancache once our page is gone
> +        */
> +       if (PageUptodate(page) && PageMappedToDisk(page))
> +               cleancache_put_page(page);
> +       else
> +               cleancache_flush_page(mapping, page);
> +

First of all, thanks for resolving conflict with my patch.

Before I suggested a thing about cleancache_flush_page, cleancache_flush_inode.

what's the meaning of flush's semantic?
I thought it means invalidation.
AFAIC, how about change flush with invalidate?


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V8 4/8] mm/fs: add hooks to support cleancache

2011-04-17 Thread Minchan Kim
sed for the data can also be freed.
>
> (Note that "take the data" and "recreate the page of data" are
> similar in semantics to "copy to" and "copy from", but since
> the cleancache operation may perform an "inflight" transformation
> on the data, and "copy" usually means a byte-for-byte replication,
> the word "copy" is also misleading.)
>
> The cleancache operation currently known as "flush" has the
> following semantics:  Derive the filesystem-determined handle
> from this struct page and struct mapping.  If cleancache
> contains a page matching that handle, delete in cleancache any
> record of the handle and any data associated with it, so that a
> subsequent "get" will no longer find a match for the handle;
> any space used for the data can also be freed
>
> The cleancache operation currently known as "flush inode" has
> the following semantics: Derive the filesystem-determined filekey
> from this struct mapping.  If cleancache contains ANY handles
> matching that filekey, delete in cleancache any record of
> any matching handle and any data associated with those handles;
> any space used for the data can also be freed.
>
> The cleancache operation currently known as "init fs" has
> the following semantics: Create a unique poolid to refer
> to this filesystem and save it in the superblock's
> cleancache_poolid field.
>
> The cleancache operation currently known as "flush fs" has
> the following semantics: Get the cleancache_poolid field
> from this superblock.  If cleancache contains ANY handles
> associated with that poolid, delete in cleancache any
> record of any matching handles and any data associated with
> those handles; any space used for the data can also be freed.
> Also, set the superblock's cleancache_poolid to be invalid
> and, in cleancache, recycle the poolid so a subsequent init_fs
> operation can reuse it.
>
> That's all!
>
> Thanks,
> Dan
>

At least, I didn't confused your semantics except just flush. That's
why I suggested only flush but after seeing your explaining, there is
another thing I want to change. The get/put is common semantic of
reference counting in kernel but in POV your semantics, it makes sense
to me but get has a exclusive semantic so I want to represent it with
API name. Maybe cleancache_get_page_exclusive.

The summary is that I don't want to change all API name. Just two thing.
(I am not sure you and others agree on me. It's just suggestion).

1. cleancache_flush_page -> cleancache_[invalidate|remove]_page
2. cleancache_get_page -> cleancache_get_page_exclusive

BTW, Nice description.
Please include it in documentation if we can't reach the conclusion.
It will help others to understand semantic of cleancache.

Thanks, Dan.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

2011-09-27 Thread Minchan Kim
rty_balance_reserve __read_mostly;
> +
>  int percpu_pagelist_fraction;
>  gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>  
> @@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
>   if (max > zone->present_pages)
>   max = zone->present_pages;
>   reserve_pages += max;
> + /*
> +  * Lowmem reserves are not available to
> +  * GFP_HIGHUSER page cache allocations and
> +  * kswapd tries to balance zones to their high
> +  * watermark.  As a result, neither should be
> +  * regarded as dirtyable memory, to prevent a
> +  * situation where reclaim has to clean pages
> +  * in order to balance the zones.
> +  */

Could you put Mel's description instead of it if you don't mind?
If I didn't see Mel's thing, maybe I wouldn't suggest but it looks
more easier to understand.

> + zone->dirty_balance_reserve = max;
>   }
>   }
> + dirty_balance_reserve = reserve_pages;
>   totalreserve_pages = reserve_pages;
>  }
>  
> -- 
> 1.7.6.2
> 

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-27 Thread Minchan Kim
On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> The maximum number of dirty pages that exist in the system at any time
> is determined by a number of pages considered dirtyable and a
> user-configured percentage of those, or an absolute number in bytes.

It's explanation of old approach.

> 
> This number of dirtyable pages is the sum of memory provided by all
> the zones in the system minus their lowmem reserves and high
> watermarks, so that the system can retain a healthy number of free
> pages without having to reclaim dirty pages.

It's a explanation of new approach.

> 
> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones.  And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone's LRU list.  This can interfere so badly with IO from the flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM's only possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.

It's a explanation of old approach, again!
Shoudn't we move above phrase of new approach into below?

> 
> Enter per-zone dirty limits.  They are to a zone's dirtyable memory
> what the global limit is to the global amount of dirtyable memory, and
> try to make sure that no single zone receives more than its fair share
> of the globally allowed dirty pages in the first place.  As the number
> of pages considered dirtyable exclude the zones' lowmem reserves and
> high watermarks, the maximum number of dirty pages in a zone is such
> that the zone can always be balanced without requiring page cleaning.
> 
> As this is a placement decision in the page allocator and pages are
> dirtied only after the allocation, this patch allows allocators to
> pass __GFP_WRITE when they know in advance that the page will be
> written to and become dirty soon.  The page allocator will then
> attempt to allocate from the first zone of the zonelist - which on
> NUMA is determined by the task's NUMA memory policy - that has not
> exceeded its dirty limit.
> 
> At first glance, it would appear that the diversion to lower zones can
> increase pressure on them, but this is not the case.  With a full high
> zone, allocations will be diverted to lower zones eventually, so it is
> more of a shift in timing of the lower zone allocations.  Workloads
> that previously could fit their dirty pages completely in the higher
> zone may be forced to allocate from lower zones, but the amount of
> pages that 'spill over' are limited themselves by the lower zones'
> dirty constraints, and thus unlikely to become a problem.

That's a good justification.

> 
> For now, the problem of unfair dirty page distribution remains for
> NUMA configurations where the zones allowed for allocation are in sum
> not big enough to trigger the global dirty limits, wake up the flusher
> threads and remedy the situation.  Because of this, an allocation that
> could not succeed on any of the considered zones is allowed to ignore
> the dirty limits before going into direct reclaim or even failing the
> allocation, until a future patch changes the global dirty throttling
> and flusher thread activation so that they take individual zone states
> into account.
> 
> Signed-off-by: Johannes Weiner 

Otherwise, looks good to me.
Reviewed-by: Minchan Kim 

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2/4] mm: writeback: cleanups in preparation for per-zone dirty limits

2011-09-27 Thread Minchan Kim
On Fri, Sep 23, 2011 at 04:41:07PM +0200, Johannes Weiner wrote:
> On Thu, Sep 22, 2011 at 10:52:42AM +0200, Johannes Weiner wrote:
> > On Wed, Sep 21, 2011 at 04:02:26PM -0700, Andrew Morton wrote:
> > > Should we rename determine_dirtyable_memory() to
> > > global_dirtyable_memory(), to get some sense of its relationship with
> > > zone_dirtyable_memory()?
> > 
> > Sounds good.
> 
> ---
> 
> The next patch will introduce per-zone dirty limiting functions in
> addition to the traditional global dirty limiting.
> 
> Rename determine_dirtyable_memory() to global_dirtyable_memory()
> before adding the zone-specific version, and fix up its documentation.
> 
> Also, move the functions to determine the dirtyable memory and the
> function to calculate the dirty limit based on that together so that
> their relationship is more apparent and that they can be commented on
> as a group.
> 
> Signed-off-by: Johannes Weiner 
Reviewed-by: Minchan Kim 
-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

2011-09-27 Thread Minchan Kim
On Tue, Sep 20, 2011 at 03:45:14PM +0200, Johannes Weiner wrote:
> Tell the page allocator that pages allocated through
> grab_cache_page_write_begin() are expected to become dirty soon.
> 
> Signed-off-by: Johannes Weiner 
Reviewed-by: Minchan Kim 

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2/4] mm: try to distribute dirty pages fairly across zones

2011-09-28 Thread Minchan Kim
On Wed, Sep 28, 2011 at 09:11:54AM +0200, Johannes Weiner wrote:
> On Wed, Sep 28, 2011 at 02:56:40PM +0900, Minchan Kim wrote:
> > On Fri, Sep 23, 2011 at 04:42:48PM +0200, Johannes Weiner wrote:
> > > The maximum number of dirty pages that exist in the system at any time
> > > is determined by a number of pages considered dirtyable and a
> > > user-configured percentage of those, or an absolute number in bytes.
> > 
> > It's explanation of old approach.
> 
> What do you mean?  This does not change with this patch.  We still
> have a number of dirtyable pages and a limit that is applied
> relatively to this number.
> 
> > > This number of dirtyable pages is the sum of memory provided by all
> > > the zones in the system minus their lowmem reserves and high
> > > watermarks, so that the system can retain a healthy number of free
> > > pages without having to reclaim dirty pages.
> > 
> > It's a explanation of new approach.
> 
> Same here, this aspect is also not changed with this patch!
> 
> > > But there is a flaw in that we have a zoned page allocator which does
> > > not care about the global state but rather the state of individual
> > > memory zones.  And right now there is nothing that prevents one zone
> > > from filling up with dirty pages while other zones are spared, which
> > > frequently leads to situations where kswapd, in order to restore the
> > > watermark of free pages, does indeed have to write pages from that
> > > zone's LRU list.  This can interfere so badly with IO from the flusher
> > > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > > requests from reclaim already, taking away the VM's only possibility
> > > to keep such a zone balanced, aside from hoping the flushers will soon
> > > clean pages from that zone.
> > 
> > It's a explanation of old approach, again!
> > Shoudn't we move above phrase of new approach into below?
> 
> Everything above describes the current behaviour (at the point of this
> patch, so respecting lowmem_reserve e.g. is part of the current
> behaviour by now) and its problems.  And below follows a description
> of how the patch tries to fix it.

It seems that it's not a good choice to use "old" and "new" terms.
Hannes, please ignore, it's not a biggie.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4 v2] mm: exclude reserved pages from dirtyable memory

2011-09-28 Thread Minchan Kim
On Wed, Sep 28, 2011 at 09:50:54AM +0200, Johannes Weiner wrote:
> On Wed, Sep 28, 2011 at 01:55:51PM +0900, Minchan Kim wrote:
> > Hi Hannes,
> > 
> > On Fri, Sep 23, 2011 at 04:38:17PM +0200, Johannes Weiner wrote:
> > > The amount of dirtyable pages should not include the full number of
> > > free pages: there is a number of reserved pages that the page
> > > allocator and kswapd always try to keep free.
> > > 
> > > The closer (reclaimable pages - dirty pages) is to the number of
> > > reserved pages, the more likely it becomes for reclaim to run into
> > > dirty pages:
> > > 
> > >+--+ ---
> > >|   anon   |  |
> > >+--+  |
> > >|  |  |
> > >|  |  -- dirty limit new-- flusher new
> > >|   file   |  | |
> > >|  |  | |
> > >|  |  -- dirty limit old-- flusher old
> > >|  ||
> > >+--+   --- reclaim
> > >| reserved |
> > >+--+
> > >|  kernel  |
> > >+--+
> > > 
> > > This patch introduces a per-zone dirty reserve that takes both the
> > > lowmem reserve as well as the high watermark of the zone into account,
> > > and a global sum of those per-zone values that is subtracted from the
> > > global amount of dirtyable pages.  The lowmem reserve is unavailable
> > > to page cache allocations and kswapd tries to keep the high watermark
> > > free.  We don't want to end up in a situation where reclaim has to
> > > clean pages in order to balance zones.
> > > 
> > > Not treating reserved pages as dirtyable on a global level is only a
> > > conceptual fix.  In reality, dirty pages are not distributed equally
> > > across zones and reclaim runs into dirty pages on a regular basis.
> > > 
> > > But it is important to get this right before tackling the problem on a
> > > per-zone level, where the distance between reclaim and the dirty pages
> > > is mostly much smaller in absolute numbers.
> > > 
> > > Signed-off-by: Johannes Weiner 
> > > ---
> > >  include/linux/mmzone.h |6 ++
> > >  include/linux/swap.h   |1 +
> > >  mm/page-writeback.c|6 --
> > >  mm/page_alloc.c|   19 +++
> > >  4 files changed, 30 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 1ed4116..37a61e7 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -317,6 +317,12 @@ struct zone {
> > >*/
> > >   unsigned long   lowmem_reserve[MAX_NR_ZONES];
> > >  
> > > + /*
> > > +  * This is a per-zone reserve of pages that should not be
> > > +  * considered dirtyable memory.
> > > +  */
> > > + unsigned long   dirty_balance_reserve;
> > > +
> > >  #ifdef CONFIG_NUMA
> > >   int node;
> > >   /*
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index b156e80..9021453 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -209,6 +209,7 @@ struct swap_list_t {
> > >  /* linux/mm/page_alloc.c */
> > >  extern unsigned long totalram_pages;
> > >  extern unsigned long totalreserve_pages;
> > > +extern unsigned long dirty_balance_reserve;
> > >  extern unsigned int nr_free_buffer_pages(void);
> > >  extern unsigned int nr_free_pagecache_pages(void);
> > >  
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index da6d263..c8acf8a 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -170,7 +170,8 @@ static unsigned long 
> > > highmem_dirtyable_memory(unsigned long total)
> > >   &NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> > >  
> > >   x += zone_page_state(z, NR_FREE_PAGES) +
> > > -  zone_reclaimable_pages(z);
> > > +  zone_reclaimable_pages(z) -
> > > +  zone->dirty_balance_reserve;
> > >   }
> > >   /*
> > >* Make sure that the number of highmem pages is never larger
> > > @@ -194,7 +195,8 @@ static unsigned long

Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

2011-10-01 Thread Minchan Kim
On Fri, Sep 30, 2011 at 09:17:20AM +0200, Johannes Weiner wrote:
> The amount of dirtyable pages should not include the full number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
> 
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
> 
>+--+ ---
>|   anon   |  |
>+--+  |
>|  |  |
>|  |  -- dirty limit new-- flusher new
>|   file   |  | |
>|  |  | |
>|  |  -- dirty limit old-- flusher old
>|  ||
>+--+   --- reclaim
>| reserved |
>+--+
>|  kernel  |
>+--+
> 
> This patch introduces a per-zone dirty reserve that takes both the
> lowmem reserve as well as the high watermark of the zone into account,
> and a global sum of those per-zone values that is subtracted from the
> global amount of dirtyable pages.  The lowmem reserve is unavailable
> to page cache allocations and kswapd tries to keep the high watermark
> free.  We don't want to end up in a situation where reclaim has to
> clean pages in order to balance zones.
> 
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix.  In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
> 
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
> 
> Signed-off-by: Johannes Weiner 
> Reviewed-by: Rik van Riel 
Reviewed-by: Minchan Kim 

-- 
Kinds regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html