Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Sun, Apr 12, 2015 at 11:48:23PM +0900, Minchan Kim wrote: > Hello Hugh, > > On Sat, Apr 11, 2015 at 02:40:46PM -0700, Hugh Dickins wrote: > > On Wed, 11 Mar 2015, Minchan Kim wrote: > > > > > Bascially, MADV_FREE relys on the pte dirty to decide whether > > > it allows VM to discard the page. However, if there is swap-in, > > > pte pointed out the page has no pte_dirty. So, MADV_FREE checks > > > PageDirty and PageSwapCache for those pages to not discard it > > > because swapped-in page could live on swap cache or PageDirty > > > when it is removed from swapcache. > > > > > > The problem in here is that anonymous pages can have PageDirty if > > > it is removed from swapcache so that VM cannot parse those pages > > > as freeable even if we did madvise_free. Look at below example. > > > > > > ptr = malloc(); > > > memset(ptr); > > > .. > > > heavy memory pressure -> swap-out all of pages > > > .. > > > out of memory pressure so there are lots of free pages > > > .. > > > var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean > > >but SetPageDirty > > > > > > madvise_free(ptr); > > > .. > > > .. > > > heavy memory pressure -> VM cannot discard the page by PageDirty. > > > > > > PageDirty for anonymous page aims for avoiding duplicating > > > swapping out. In other words, if a page have swapped-in but > > > live swapcache(ie, !PageDirty), we could save swapout if the page > > > is selected as victim by VM in future because swap device have > > > kept previous swapped-out contents of the page. > > > > > > So, rather than relying on the PG_dirty for working madvise_free, > > > pte_dirty is more straightforward. Inherently, swapped-out page was > > > pte_dirty so this patch restores the dirtiness when swap-in fault > > > happens so madvise_free doesn't rely on the PageDirty any more. > > > > > > Cc: Hugh Dickins > > > Cc: Cyrill Gorcunov > > > Cc: Pavel Emelyanov > > > Reported-by: Yalin Wang > > > Signed-off-by: Minchan Kim > > > > Sorry, but NAK to this patch, > > mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree > > (I hope it hasn't reached linux-next yet). > > > > You may well be right that pte_dirty<->PageDirty can be handled > > differently, in a way more favourable to MADV_FREE. And this patch > > may be a step in the right direction, but I've barely given it thought. > > > > As it stands, it segfaults more than any patch I've seen in years: > > I just tried applying it to 4.0-rc7-mm1, and running kernel builds > > in low memory with swap. Even if I leave KSM out, and memcg out, and > > swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon. > > > > I have a choice: spend a few hours tracking down the errors, and > > post a fix patch on top of yours? But even then I'd want to spend > > a lot longer thinking through every dirty/Dirty in the source before > > I'd feel comfortable to give an ack. > > > > This is users' data, and we need to be very careful with it: errors > > in MADV_FREE are one thing, for now that's easy to avoid; but in this > > patch you're changing the rules for Anon PageDirty for everyone. > > > > I think for now I'll have to leave it to you to do much more source > > diligence and testing, before coming back with a corrected patch for > > us then to review, slowly and carefully. > > Sorry for my bad. I will keep your advise in mind. > I will investigate the problem as soon as I get back to work > after vacation. > > Thanks for the the review. When I look at the code, migration doesn't restore dirty bit of pte in remove_migration_pte and relys on PG_dirty which was set by try_to_unmap_one. I think it was a reason you saw segfault. I will spend more time to investigate another code piece which might ignore dirty bit restore. Thanks. > > -- > Kind regards, > Minchan Kim -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Sun, Apr 12, 2015 at 11:48:23PM +0900, Minchan Kim wrote: Hello Hugh, On Sat, Apr 11, 2015 at 02:40:46PM -0700, Hugh Dickins wrote: On Wed, 11 Mar 2015, Minchan Kim wrote: Bascially, MADV_FREE relys on the pte dirty to decide whether it allows VM to discard the page. However, if there is swap-in, pte pointed out the page has no pte_dirty. So, MADV_FREE checks PageDirty and PageSwapCache for those pages to not discard it because swapped-in page could live on swap cache or PageDirty when it is removed from swapcache. The problem in here is that anonymous pages can have PageDirty if it is removed from swapcache so that VM cannot parse those pages as freeable even if we did madvise_free. Look at below example. ptr = malloc(); memset(ptr); .. heavy memory pressure - swap-out all of pages .. out of memory pressure so there are lots of free pages .. var = *ptr; - swap-in page/remove the page from swapcache. so pte_clean but SetPageDirty madvise_free(ptr); .. .. heavy memory pressure - VM cannot discard the page by PageDirty. PageDirty for anonymous page aims for avoiding duplicating swapping out. In other words, if a page have swapped-in but live swapcache(ie, !PageDirty), we could save swapout if the page is selected as victim by VM in future because swap device have kept previous swapped-out contents of the page. So, rather than relying on the PG_dirty for working madvise_free, pte_dirty is more straightforward. Inherently, swapped-out page was pte_dirty so this patch restores the dirtiness when swap-in fault happens so madvise_free doesn't rely on the PageDirty any more. Cc: Hugh Dickins hu...@google.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Pavel Emelyanov xe...@parallels.com Reported-by: Yalin Wang yalin.w...@sonymobile.com Signed-off-by: Minchan Kim minc...@kernel.org Sorry, but NAK to this patch, mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree (I hope it hasn't reached linux-next yet). You may well be right that pte_dirty-PageDirty can be handled differently, in a way more favourable to MADV_FREE. And this patch may be a step in the right direction, but I've barely given it thought. As it stands, it segfaults more than any patch I've seen in years: I just tried applying it to 4.0-rc7-mm1, and running kernel builds in low memory with swap. Even if I leave KSM out, and memcg out, and swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon. I have a choice: spend a few hours tracking down the errors, and post a fix patch on top of yours? But even then I'd want to spend a lot longer thinking through every dirty/Dirty in the source before I'd feel comfortable to give an ack. This is users' data, and we need to be very careful with it: errors in MADV_FREE are one thing, for now that's easy to avoid; but in this patch you're changing the rules for Anon PageDirty for everyone. I think for now I'll have to leave it to you to do much more source diligence and testing, before coming back with a corrected patch for us then to review, slowly and carefully. Sorry for my bad. I will keep your advise in mind. I will investigate the problem as soon as I get back to work after vacation. Thanks for the the review. When I look at the code, migration doesn't restore dirty bit of pte in remove_migration_pte and relys on PG_dirty which was set by try_to_unmap_one. I think it was a reason you saw segfault. I will spend more time to investigate another code piece which might ignore dirty bit restore. Thanks. -- Kind regards, Minchan Kim -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
Hello Hugh, On Sat, Apr 11, 2015 at 02:40:46PM -0700, Hugh Dickins wrote: > On Wed, 11 Mar 2015, Minchan Kim wrote: > > > Bascially, MADV_FREE relys on the pte dirty to decide whether > > it allows VM to discard the page. However, if there is swap-in, > > pte pointed out the page has no pte_dirty. So, MADV_FREE checks > > PageDirty and PageSwapCache for those pages to not discard it > > because swapped-in page could live on swap cache or PageDirty > > when it is removed from swapcache. > > > > The problem in here is that anonymous pages can have PageDirty if > > it is removed from swapcache so that VM cannot parse those pages > > as freeable even if we did madvise_free. Look at below example. > > > > ptr = malloc(); > > memset(ptr); > > .. > > heavy memory pressure -> swap-out all of pages > > .. > > out of memory pressure so there are lots of free pages > > .. > > var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean > >but SetPageDirty > > > > madvise_free(ptr); > > .. > > .. > > heavy memory pressure -> VM cannot discard the page by PageDirty. > > > > PageDirty for anonymous page aims for avoiding duplicating > > swapping out. In other words, if a page have swapped-in but > > live swapcache(ie, !PageDirty), we could save swapout if the page > > is selected as victim by VM in future because swap device have > > kept previous swapped-out contents of the page. > > > > So, rather than relying on the PG_dirty for working madvise_free, > > pte_dirty is more straightforward. Inherently, swapped-out page was > > pte_dirty so this patch restores the dirtiness when swap-in fault > > happens so madvise_free doesn't rely on the PageDirty any more. > > > > Cc: Hugh Dickins > > Cc: Cyrill Gorcunov > > Cc: Pavel Emelyanov > > Reported-by: Yalin Wang > > Signed-off-by: Minchan Kim > > Sorry, but NAK to this patch, > mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree > (I hope it hasn't reached linux-next yet). > > You may well be right that pte_dirty<->PageDirty can be handled > differently, in a way more favourable to MADV_FREE. And this patch > may be a step in the right direction, but I've barely given it thought. > > As it stands, it segfaults more than any patch I've seen in years: > I just tried applying it to 4.0-rc7-mm1, and running kernel builds > in low memory with swap. Even if I leave KSM out, and memcg out, and > swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon. > > I have a choice: spend a few hours tracking down the errors, and > post a fix patch on top of yours? But even then I'd want to spend > a lot longer thinking through every dirty/Dirty in the source before > I'd feel comfortable to give an ack. > > This is users' data, and we need to be very careful with it: errors > in MADV_FREE are one thing, for now that's easy to avoid; but in this > patch you're changing the rules for Anon PageDirty for everyone. > > I think for now I'll have to leave it to you to do much more source > diligence and testing, before coming back with a corrected patch for > us then to review, slowly and carefully. Sorry for my bad. I will keep your advise in mind. I will investigate the problem as soon as I get back to work after vacation. Thanks for the the review. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
Hello Hugh, On Sat, Apr 11, 2015 at 02:40:46PM -0700, Hugh Dickins wrote: On Wed, 11 Mar 2015, Minchan Kim wrote: Bascially, MADV_FREE relys on the pte dirty to decide whether it allows VM to discard the page. However, if there is swap-in, pte pointed out the page has no pte_dirty. So, MADV_FREE checks PageDirty and PageSwapCache for those pages to not discard it because swapped-in page could live on swap cache or PageDirty when it is removed from swapcache. The problem in here is that anonymous pages can have PageDirty if it is removed from swapcache so that VM cannot parse those pages as freeable even if we did madvise_free. Look at below example. ptr = malloc(); memset(ptr); .. heavy memory pressure - swap-out all of pages .. out of memory pressure so there are lots of free pages .. var = *ptr; - swap-in page/remove the page from swapcache. so pte_clean but SetPageDirty madvise_free(ptr); .. .. heavy memory pressure - VM cannot discard the page by PageDirty. PageDirty for anonymous page aims for avoiding duplicating swapping out. In other words, if a page have swapped-in but live swapcache(ie, !PageDirty), we could save swapout if the page is selected as victim by VM in future because swap device have kept previous swapped-out contents of the page. So, rather than relying on the PG_dirty for working madvise_free, pte_dirty is more straightforward. Inherently, swapped-out page was pte_dirty so this patch restores the dirtiness when swap-in fault happens so madvise_free doesn't rely on the PageDirty any more. Cc: Hugh Dickins hu...@google.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Pavel Emelyanov xe...@parallels.com Reported-by: Yalin Wang yalin.w...@sonymobile.com Signed-off-by: Minchan Kim minc...@kernel.org Sorry, but NAK to this patch, mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree (I hope it hasn't reached linux-next yet). You may well be right that pte_dirty-PageDirty can be handled differently, in a way more favourable to MADV_FREE. And this patch may be a step in the right direction, but I've barely given it thought. As it stands, it segfaults more than any patch I've seen in years: I just tried applying it to 4.0-rc7-mm1, and running kernel builds in low memory with swap. Even if I leave KSM out, and memcg out, and swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon. I have a choice: spend a few hours tracking down the errors, and post a fix patch on top of yours? But even then I'd want to spend a lot longer thinking through every dirty/Dirty in the source before I'd feel comfortable to give an ack. This is users' data, and we need to be very careful with it: errors in MADV_FREE are one thing, for now that's easy to avoid; but in this patch you're changing the rules for Anon PageDirty for everyone. I think for now I'll have to leave it to you to do much more source diligence and testing, before coming back with a corrected patch for us then to review, slowly and carefully. Sorry for my bad. I will keep your advise in mind. I will investigate the problem as soon as I get back to work after vacation. Thanks for the the review. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Wed, 11 Mar 2015, Minchan Kim wrote: > Bascially, MADV_FREE relys on the pte dirty to decide whether > it allows VM to discard the page. However, if there is swap-in, > pte pointed out the page has no pte_dirty. So, MADV_FREE checks > PageDirty and PageSwapCache for those pages to not discard it > because swapped-in page could live on swap cache or PageDirty > when it is removed from swapcache. > > The problem in here is that anonymous pages can have PageDirty if > it is removed from swapcache so that VM cannot parse those pages > as freeable even if we did madvise_free. Look at below example. > > ptr = malloc(); > memset(ptr); > .. > heavy memory pressure -> swap-out all of pages > .. > out of memory pressure so there are lots of free pages > .. > var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean >but SetPageDirty > > madvise_free(ptr); > .. > .. > heavy memory pressure -> VM cannot discard the page by PageDirty. > > PageDirty for anonymous page aims for avoiding duplicating > swapping out. In other words, if a page have swapped-in but > live swapcache(ie, !PageDirty), we could save swapout if the page > is selected as victim by VM in future because swap device have > kept previous swapped-out contents of the page. > > So, rather than relying on the PG_dirty for working madvise_free, > pte_dirty is more straightforward. Inherently, swapped-out page was > pte_dirty so this patch restores the dirtiness when swap-in fault > happens so madvise_free doesn't rely on the PageDirty any more. > > Cc: Hugh Dickins > Cc: Cyrill Gorcunov > Cc: Pavel Emelyanov > Reported-by: Yalin Wang > Signed-off-by: Minchan Kim Sorry, but NAK to this patch, mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree (I hope it hasn't reached linux-next yet). You may well be right that pte_dirty<->PageDirty can be handled differently, in a way more favourable to MADV_FREE. And this patch may be a step in the right direction, but I've barely given it thought. As it stands, it segfaults more than any patch I've seen in years: I just tried applying it to 4.0-rc7-mm1, and running kernel builds in low memory with swap. Even if I leave KSM out, and memcg out, and swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon. I have a choice: spend a few hours tracking down the errors, and post a fix patch on top of yours? But even then I'd want to spend a lot longer thinking through every dirty/Dirty in the source before I'd feel comfortable to give an ack. This is users' data, and we need to be very careful with it: errors in MADV_FREE are one thing, for now that's easy to avoid; but in this patch you're changing the rules for Anon PageDirty for everyone. I think for now I'll have to leave it to you to do much more source diligence and testing, before coming back with a corrected patch for us then to review, slowly and carefully. Hugh > --- > mm/madvise.c | 1 - > mm/memory.c | 9 +++-- > mm/rmap.c| 2 +- > mm/vmscan.c | 3 +-- > 4 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 22e8f0c..a045798 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned > long addr, > continue; > } > > - ClearPageDirty(page); > unlock_page(page); > } > > diff --git a/mm/memory.c b/mm/memory.c > index 0f96a4a..40428a5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct > vm_area_struct *vma, > > inc_mm_counter_fast(mm, MM_ANONPAGES); > dec_mm_counter_fast(mm, MM_SWAPENTS); > - pte = mk_pte(page, vma->vm_page_prot); > + > + /* > + * Every page swapped-out was pte_dirty so we make pte dirty again. > + * MADV_FREE relies on it. > + */ > + pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot)); > if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > + pte = maybe_mkwrite(pte, vma); > flags &= ~FAULT_FLAG_WRITE; > ret |= VM_FAULT_WRITE; > exclusive = 1; > diff --git a/mm/rmap.c b/mm/rmap.c > index 47b3ba8..34c1d66 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct > vm_area_struct *vma, > > if (flags & TTU_FREE) { > VM_BUG_ON_PAGE(PageSwapCache(page), page); > - if (!dirty && !PageDirty(page)) { > + if (!dirty) { > /* It's a freeable page by MADV_FREE */ > dec_mm_counter(mm, MM_ANONPAGES); > goto discard; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 260c413..3357ffa 100644
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Wed, 11 Mar 2015, Minchan Kim wrote: Bascially, MADV_FREE relys on the pte dirty to decide whether it allows VM to discard the page. However, if there is swap-in, pte pointed out the page has no pte_dirty. So, MADV_FREE checks PageDirty and PageSwapCache for those pages to not discard it because swapped-in page could live on swap cache or PageDirty when it is removed from swapcache. The problem in here is that anonymous pages can have PageDirty if it is removed from swapcache so that VM cannot parse those pages as freeable even if we did madvise_free. Look at below example. ptr = malloc(); memset(ptr); .. heavy memory pressure - swap-out all of pages .. out of memory pressure so there are lots of free pages .. var = *ptr; - swap-in page/remove the page from swapcache. so pte_clean but SetPageDirty madvise_free(ptr); .. .. heavy memory pressure - VM cannot discard the page by PageDirty. PageDirty for anonymous page aims for avoiding duplicating swapping out. In other words, if a page have swapped-in but live swapcache(ie, !PageDirty), we could save swapout if the page is selected as victim by VM in future because swap device have kept previous swapped-out contents of the page. So, rather than relying on the PG_dirty for working madvise_free, pte_dirty is more straightforward. Inherently, swapped-out page was pte_dirty so this patch restores the dirtiness when swap-in fault happens so madvise_free doesn't rely on the PageDirty any more. Cc: Hugh Dickins hu...@google.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Pavel Emelyanov xe...@parallels.com Reported-by: Yalin Wang yalin.w...@sonymobile.com Signed-off-by: Minchan Kim minc...@kernel.org Sorry, but NAK to this patch, mm-make-every-pte-dirty-on-do_swap_page.patch in akpm's mm tree (I hope it hasn't reached linux-next yet). You may well be right that pte_dirty-PageDirty can be handled differently, in a way more favourable to MADV_FREE. And this patch may be a step in the right direction, but I've barely given it thought. As it stands, it segfaults more than any patch I've seen in years: I just tried applying it to 4.0-rc7-mm1, and running kernel builds in low memory with swap. Even if I leave KSM out, and memcg out, and swapoff out, and THP out, and tmpfs out, it still SIGSEGVs very soon. I have a choice: spend a few hours tracking down the errors, and post a fix patch on top of yours? But even then I'd want to spend a lot longer thinking through every dirty/Dirty in the source before I'd feel comfortable to give an ack. This is users' data, and we need to be very careful with it: errors in MADV_FREE are one thing, for now that's easy to avoid; but in this patch you're changing the rules for Anon PageDirty for everyone. I think for now I'll have to leave it to you to do much more source diligence and testing, before coming back with a corrected patch for us then to review, slowly and carefully. Hugh --- mm/madvise.c | 1 - mm/memory.c | 9 +++-- mm/rmap.c| 2 +- mm/vmscan.c | 3 +-- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 22e8f0c..a045798 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; } - ClearPageDirty(page); unlock_page(page); } diff --git a/mm/memory.c b/mm/memory.c index 0f96a4a..40428a5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); - pte = mk_pte(page, vma-vm_page_prot); + + /* + * Every page swapped-out was pte_dirty so we make pte dirty again. + * MADV_FREE relies on it. + */ + pte = pte_mkdirty(mk_pte(page, vma-vm_page_prot)); if ((flags FAULT_FLAG_WRITE) reuse_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = maybe_mkwrite(pte, vma); flags = ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = 1; diff --git a/mm/rmap.c b/mm/rmap.c index 47b3ba8..34c1d66 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty !PageDirty(page)) { + if (!dirty) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; diff --git a/mm/vmscan.c b/mm/vmscan.c index 260c413..3357ffa 100644 --- a/mm/vmscan.c
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Thu, 9 Apr 2015 08:50:25 +0900 Minchan Kim wrote: > Bump. I'm getting the feeling that MADV_FREE is out of control. Below is the overall rollup of mm-support-madvisemadv_free.patch mm-support-madvisemadv_free-fix.patch mm-support-madvisemadv_free-fix-2.patch mm-dont-split-thp-page-when-syscall-is-called.patch mm-dont-split-thp-page-when-syscall-is-called-fix.patch mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch mm-free-swp_entry-in-madvise_free.patch mm-move-lazy-free-pages-to-inactive-list.patch mm-move-lazy-free-pages-to-inactive-list-fix.patch mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch mm-make-every-pte-dirty-on-do_swap_page.patch It's pretty large and has its sticky little paws in all sorts of places. The feature would need to be pretty darn useful to justify a mainline merge. Has any such usefulness been demonstrated? arch/alpha/include/uapi/asm/mman.h |1 arch/mips/include/uapi/asm/mman.h |1 arch/parisc/include/uapi/asm/mman.h|1 arch/xtensa/include/uapi/asm/mman.h|1 include/linux/huge_mm.h|4 include/linux/rmap.h |9 - include/linux/swap.h |1 include/linux/vm_event_item.h |1 include/uapi/asm-generic/mman-common.h |1 mm/huge_memory.c | 35 mm/madvise.c | 175 +++ mm/memory.c| 10 + mm/rmap.c | 46 +- mm/swap.c | 44 + mm/vmscan.c| 63 ++-- mm/vmstat.c|1 16 files changed, 372 insertions(+), 22 deletions(-) diff -puN include/linux/rmap.h~mm-support-madvisemadv_free include/linux/rmap.h --- a/include/linux/rmap.h~mm-support-madvisemadv_free +++ a/include/linux/rmap.h @@ -85,6 +85,7 @@ enum ttu_flags { TTU_UNMAP = 1, /* unmap mode */ TTU_MIGRATION = 2, /* migration mode */ TTU_MUNLOCK = 4,/* munlock mode */ + TTU_FREE = 8, /* free mode */ TTU_IGNORE_MLOCK = (1 << 8),/* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ @@ -183,7 +184,8 @@ static inline void page_dup_rmap(struct * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked, - struct mem_cgroup *memcg, unsigned long *vm_flags); + struct mem_cgroup *memcg, unsigned long *vm_flags, + int *is_pte_dirty); #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) @@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct static inline int page_referenced(struct page *page, int is_locked, struct mem_cgroup *memcg, - unsigned long *vm_flags) + unsigned long *vm_flags, + int *is_pte_dirty) { *vm_flags = 0; + if (is_pte_dirty) + *is_pte_dirty = 0; return 0; } diff -puN include/linux/vm_event_item.h~mm-support-madvisemadv_free include/linux/vm_event_item.h --- a/include/linux/vm_event_item.h~mm-support-madvisemadv_free +++ a/include/linux/vm_event_item.h @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS FOR_ALL_ZONES(PGALLOC), PGFREE, PGACTIVATE, PGDEACTIVATE, PGFAULT, PGMAJFAULT, + PGLAZYFREED, FOR_ALL_ZONES(PGREFILL), FOR_ALL_ZONES(PGSTEAL_KSWAPD), FOR_ALL_ZONES(PGSTEAL_DIRECT), diff -puN include/uapi/asm-generic/mman-common.h~mm-support-madvisemadv_free include/uapi/asm-generic/mman-common.h --- a/include/uapi/asm-generic/mman-common.h~mm-support-madvisemadv_free +++ a/include/uapi/asm-generic/mman-common.h @@ -34,6 +34,7 @@ #define MADV_SEQUENTIAL2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* free pages only if memory pressure */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE9 /* remove these pages & resources */ diff -puN mm/madvise.c~mm-support-madvisemadv_free mm/madvise.c --- a/mm/madvise.c~mm-support-madvisemadv_free +++ a/mm/madvise.c @@ -19,6 +19,14 @@ #include #include #include +#include + +#include + +struct madvise_free_private { + struct vm_area_struct *vma; + struct mmu_gather *tlb; +}; /* * Any behaviour which results in changes to the vma->vm_flags needs to @@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int b case MADV_REMOVE: case
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Thu, 9 Apr 2015 08:50:25 +0900 Minchan Kim minc...@kernel.org wrote: Bump. I'm getting the feeling that MADV_FREE is out of control. Below is the overall rollup of mm-support-madvisemadv_free.patch mm-support-madvisemadv_free-fix.patch mm-support-madvisemadv_free-fix-2.patch mm-dont-split-thp-page-when-syscall-is-called.patch mm-dont-split-thp-page-when-syscall-is-called-fix.patch mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch mm-free-swp_entry-in-madvise_free.patch mm-move-lazy-free-pages-to-inactive-list.patch mm-move-lazy-free-pages-to-inactive-list-fix.patch mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch mm-make-every-pte-dirty-on-do_swap_page.patch It's pretty large and has its sticky little paws in all sorts of places. The feature would need to be pretty darn useful to justify a mainline merge. Has any such usefulness been demonstrated? arch/alpha/include/uapi/asm/mman.h |1 arch/mips/include/uapi/asm/mman.h |1 arch/parisc/include/uapi/asm/mman.h|1 arch/xtensa/include/uapi/asm/mman.h|1 include/linux/huge_mm.h|4 include/linux/rmap.h |9 - include/linux/swap.h |1 include/linux/vm_event_item.h |1 include/uapi/asm-generic/mman-common.h |1 mm/huge_memory.c | 35 mm/madvise.c | 175 +++ mm/memory.c| 10 + mm/rmap.c | 46 +- mm/swap.c | 44 + mm/vmscan.c| 63 ++-- mm/vmstat.c|1 16 files changed, 372 insertions(+), 22 deletions(-) diff -puN include/linux/rmap.h~mm-support-madvisemadv_free include/linux/rmap.h --- a/include/linux/rmap.h~mm-support-madvisemadv_free +++ a/include/linux/rmap.h @@ -85,6 +85,7 @@ enum ttu_flags { TTU_UNMAP = 1, /* unmap mode */ TTU_MIGRATION = 2, /* migration mode */ TTU_MUNLOCK = 4,/* munlock mode */ + TTU_FREE = 8, /* free mode */ TTU_IGNORE_MLOCK = (1 8),/* ignore mlock */ TTU_IGNORE_ACCESS = (1 9), /* don't age */ @@ -183,7 +184,8 @@ static inline void page_dup_rmap(struct * Called from mm/vmscan.c to handle paging out */ int page_referenced(struct page *, int is_locked, - struct mem_cgroup *memcg, unsigned long *vm_flags); + struct mem_cgroup *memcg, unsigned long *vm_flags, + int *is_pte_dirty); #define TTU_ACTION(x) ((x) TTU_ACTION_MASK) @@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct static inline int page_referenced(struct page *page, int is_locked, struct mem_cgroup *memcg, - unsigned long *vm_flags) + unsigned long *vm_flags, + int *is_pte_dirty) { *vm_flags = 0; + if (is_pte_dirty) + *is_pte_dirty = 0; return 0; } diff -puN include/linux/vm_event_item.h~mm-support-madvisemadv_free include/linux/vm_event_item.h --- a/include/linux/vm_event_item.h~mm-support-madvisemadv_free +++ a/include/linux/vm_event_item.h @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS FOR_ALL_ZONES(PGALLOC), PGFREE, PGACTIVATE, PGDEACTIVATE, PGFAULT, PGMAJFAULT, + PGLAZYFREED, FOR_ALL_ZONES(PGREFILL), FOR_ALL_ZONES(PGSTEAL_KSWAPD), FOR_ALL_ZONES(PGSTEAL_DIRECT), diff -puN include/uapi/asm-generic/mman-common.h~mm-support-madvisemadv_free include/uapi/asm-generic/mman-common.h --- a/include/uapi/asm-generic/mman-common.h~mm-support-madvisemadv_free +++ a/include/uapi/asm-generic/mman-common.h @@ -34,6 +34,7 @@ #define MADV_SEQUENTIAL2 /* expect sequential page references */ #define MADV_WILLNEED 3 /* will need these pages */ #define MADV_DONTNEED 4 /* don't need these pages */ +#define MADV_FREE 5 /* free pages only if memory pressure */ /* common parameters: try to keep these consistent across architectures */ #define MADV_REMOVE9 /* remove these pages resources */ diff -puN mm/madvise.c~mm-support-madvisemadv_free mm/madvise.c --- a/mm/madvise.c~mm-support-madvisemadv_free +++ a/mm/madvise.c @@ -19,6 +19,14 @@ #include linux/blkdev.h #include linux/swap.h #include linux/swapops.h +#include linux/mmu_notifier.h + +#include asm/tlb.h + +struct madvise_free_private { + struct vm_area_struct *vma; + struct mmu_gather *tlb; +}; /* * Any behaviour which results in changes to the vma-vm_flags needs to @@ -31,6 +39,7 @@ static int
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
Bump. On Wed, Mar 11, 2015 at 10:20:38AM +0900, Minchan Kim wrote: > Bascially, MADV_FREE relys on the pte dirty to decide whether > it allows VM to discard the page. However, if there is swap-in, > pte pointed out the page has no pte_dirty. So, MADV_FREE checks > PageDirty and PageSwapCache for those pages to not discard it > because swapped-in page could live on swap cache or PageDirty > when it is removed from swapcache. > > The problem in here is that anonymous pages can have PageDirty if > it is removed from swapcache so that VM cannot parse those pages > as freeable even if we did madvise_free. Look at below example. > > ptr = malloc(); > memset(ptr); > .. > heavy memory pressure -> swap-out all of pages > .. > out of memory pressure so there are lots of free pages > .. > var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean >but SetPageDirty > > madvise_free(ptr); > .. > .. > heavy memory pressure -> VM cannot discard the page by PageDirty. > > PageDirty for anonymous page aims for avoiding duplicating > swapping out. In other words, if a page have swapped-in but > live swapcache(ie, !PageDirty), we could save swapout if the page > is selected as victim by VM in future because swap device have > kept previous swapped-out contents of the page. > > So, rather than relying on the PG_dirty for working madvise_free, > pte_dirty is more straightforward. Inherently, swapped-out page was > pte_dirty so this patch restores the dirtiness when swap-in fault > happens so madvise_free doesn't rely on the PageDirty any more. > > Cc: Hugh Dickins > Cc: Cyrill Gorcunov > Cc: Pavel Emelyanov > Reported-by: Yalin Wang > Signed-off-by: Minchan Kim > --- > mm/madvise.c | 1 - > mm/memory.c | 9 +++-- > mm/rmap.c| 2 +- > mm/vmscan.c | 3 +-- > 4 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/mm/madvise.c b/mm/madvise.c > index 22e8f0c..a045798 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned > long addr, > continue; > } > > - ClearPageDirty(page); > unlock_page(page); > } > > diff --git a/mm/memory.c b/mm/memory.c > index 0f96a4a..40428a5 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct > vm_area_struct *vma, > > inc_mm_counter_fast(mm, MM_ANONPAGES); > dec_mm_counter_fast(mm, MM_SWAPENTS); > - pte = mk_pte(page, vma->vm_page_prot); > + > + /* > + * Every page swapped-out was pte_dirty so we make pte dirty again. > + * MADV_FREE relies on it. > + */ > + pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot)); > if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { > - pte = maybe_mkwrite(pte_mkdirty(pte), vma); > + pte = maybe_mkwrite(pte, vma); > flags &= ~FAULT_FLAG_WRITE; > ret |= VM_FAULT_WRITE; > exclusive = 1; > diff --git a/mm/rmap.c b/mm/rmap.c > index 47b3ba8..34c1d66 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct > vm_area_struct *vma, > > if (flags & TTU_FREE) { > VM_BUG_ON_PAGE(PageSwapCache(page), page); > - if (!dirty && !PageDirty(page)) { > + if (!dirty) { > /* It's a freeable page by MADV_FREE */ > dec_mm_counter(mm, MM_ANONPAGES); > goto discard; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 260c413..3357ffa 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct > page *page, > return PAGEREF_KEEP; > } > > - if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) && > - !PageDirty(page)) > + if (PageAnon(page) && !pte_dirty && !PageSwapCache(page)) > *freeable = true; > > /* Reclaim if clean, defer dirty pages to writeback */ > -- > 1.9.3 > -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
Bump. On Wed, Mar 11, 2015 at 10:20:38AM +0900, Minchan Kim wrote: Bascially, MADV_FREE relys on the pte dirty to decide whether it allows VM to discard the page. However, if there is swap-in, pte pointed out the page has no pte_dirty. So, MADV_FREE checks PageDirty and PageSwapCache for those pages to not discard it because swapped-in page could live on swap cache or PageDirty when it is removed from swapcache. The problem in here is that anonymous pages can have PageDirty if it is removed from swapcache so that VM cannot parse those pages as freeable even if we did madvise_free. Look at below example. ptr = malloc(); memset(ptr); .. heavy memory pressure - swap-out all of pages .. out of memory pressure so there are lots of free pages .. var = *ptr; - swap-in page/remove the page from swapcache. so pte_clean but SetPageDirty madvise_free(ptr); .. .. heavy memory pressure - VM cannot discard the page by PageDirty. PageDirty for anonymous page aims for avoiding duplicating swapping out. In other words, if a page have swapped-in but live swapcache(ie, !PageDirty), we could save swapout if the page is selected as victim by VM in future because swap device have kept previous swapped-out contents of the page. So, rather than relying on the PG_dirty for working madvise_free, pte_dirty is more straightforward. Inherently, swapped-out page was pte_dirty so this patch restores the dirtiness when swap-in fault happens so madvise_free doesn't rely on the PageDirty any more. Cc: Hugh Dickins hu...@google.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Pavel Emelyanov xe...@parallels.com Reported-by: Yalin Wang yalin.w...@sonymobile.com Signed-off-by: Minchan Kim minc...@kernel.org --- mm/madvise.c | 1 - mm/memory.c | 9 +++-- mm/rmap.c| 2 +- mm/vmscan.c | 3 +-- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 22e8f0c..a045798 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; } - ClearPageDirty(page); unlock_page(page); } diff --git a/mm/memory.c b/mm/memory.c index 0f96a4a..40428a5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); - pte = mk_pte(page, vma-vm_page_prot); + + /* + * Every page swapped-out was pte_dirty so we make pte dirty again. + * MADV_FREE relies on it. + */ + pte = pte_mkdirty(mk_pte(page, vma-vm_page_prot)); if ((flags FAULT_FLAG_WRITE) reuse_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = maybe_mkwrite(pte, vma); flags = ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = 1; diff --git a/mm/rmap.c b/mm/rmap.c index 47b3ba8..34c1d66 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty !PageDirty(page)) { + if (!dirty) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; diff --git a/mm/vmscan.c b/mm/vmscan.c index 260c413..3357ffa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_KEEP; } - if (PageAnon(page) !pte_dirty !PageSwapCache(page) - !PageDirty(page)) + if (PageAnon(page) !pte_dirty !PageSwapCache(page)) *freeable = true; /* Reclaim if clean, defer dirty pages to writeback */ -- 1.9.3 -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Tue, Mar 31, 2015 at 12:14:46AM +0300, Cyrill Gorcunov wrote: > On Mon, Mar 30, 2015 at 05:59:15PM +0900, Minchan Kim wrote: > > Hi Cyrill, > > > > On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote: > > > On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: > > > > 2nd description trial. > > > ... > > > Hi Minchan, could you please point for which repo this patch, > > > linux-next? > > > > It was based on v4.0-rc5-mmotm-2015-03-24-17-02. > > As well, I confirmed it was applied on local-next-20150327. > > > > Thanks. > > Hi Minchan! I managed to fetch mmotm and the change looks > reasonable to me. Still better to wait for review from Mel > or Hugh, maybe I miss something obvious. Thanks for the review! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Mon, Mar 30, 2015 at 05:59:15PM +0900, Minchan Kim wrote: > Hi Cyrill, > > On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote: > > On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: > > > 2nd description trial. > > ... > > Hi Minchan, could you please point for which repo this patch, > > linux-next? > > It was based on v4.0-rc5-mmotm-2015-03-24-17-02. > As well, I confirmed it was applied on local-next-20150327. > > Thanks. Hi Minchan! I managed to fetch mmotm and the change looks reasonable to me. Still better to wait for review from Mel or Hugh, maybe I miss something obvious. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
Hi Cyrill, On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote: > On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: > > 2nd description trial. > ... > Hi Minchan, could you please point for which repo this patch, > linux-next? It was based on v4.0-rc5-mmotm-2015-03-24-17-02. As well, I confirmed it was applied on local-next-20150327. Thanks. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: > 2nd description trial. ... Hi Minchan, could you please point for which repo this patch, linux-next? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
Hi Cyrill, On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote: On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: 2nd description trial. ... Hi Minchan, could you please point for which repo this patch, linux-next? It was based on v4.0-rc5-mmotm-2015-03-24-17-02. As well, I confirmed it was applied on local-next-20150327. Thanks. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: 2nd description trial. ... Hi Minchan, could you please point for which repo this patch, linux-next? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Mon, Mar 30, 2015 at 05:59:15PM +0900, Minchan Kim wrote: Hi Cyrill, On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote: On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: 2nd description trial. ... Hi Minchan, could you please point for which repo this patch, linux-next? It was based on v4.0-rc5-mmotm-2015-03-24-17-02. As well, I confirmed it was applied on local-next-20150327. Thanks. Hi Minchan! I managed to fetch mmotm and the change looks reasonable to me. Still better to wait for review from Mel or Hugh, maybe I miss something obvious. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
On Tue, Mar 31, 2015 at 12:14:46AM +0300, Cyrill Gorcunov wrote: On Mon, Mar 30, 2015 at 05:59:15PM +0900, Minchan Kim wrote: Hi Cyrill, On Mon, Mar 30, 2015 at 11:51:12AM +0300, Cyrill Gorcunov wrote: On Mon, Mar 30, 2015 at 02:22:50PM +0900, Minchan Kim wrote: 2nd description trial. ... Hi Minchan, could you please point for which repo this patch, linux-next? It was based on v4.0-rc5-mmotm-2015-03-24-17-02. As well, I confirmed it was applied on local-next-20150327. Thanks. Hi Minchan! I managed to fetch mmotm and the change looks reasonable to me. Still better to wait for review from Mel or Hugh, maybe I miss something obvious. Thanks for the review! -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
2nd description trial. >From ccfc6c79634f6cec69d8fb23b0e863ebfa5b893c Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Mon, 30 Mar 2015 13:43:08 +0900 Subject: [PATCH v2] mm: make every pte dirty on do_swap_page Bascially, MADV_FREE relys on the dirty bit in page table entry to decide whether VM allows to discard the page or not. IOW, if page table entry includes marked dirty bit, VM shouldn't discard the page. However, if swap-in by read fault happens, page table entry point out the page doesn't have marked dirty bit so MADV_FREE might discard the page wrongly. For avoiding the problem, MADV_FREE did more checks with PageDirty and PageSwapCache. It worked out because swapped-in page lives on swap cache and since it was evicted from the swap cache, the page has PG_dirty flag. So both page flags checks effectvely prevent wrong discarding by MADV_FREE. A problem in above logic is that swapped-in page has PG_dirty since they are removed from swap cache so VM cannot consider those pages as freeable any more alghouth madvise_free is called in future. Look at below example for detail. ptr = malloc(); memset(ptr); .. .. .. heavy memory pressure so all of pages are swapped out .. .. var = *ptr; -> a page swapped-in and removed from swapcache. page table doesn't mark dirty bit and page descriptor includes PG_dirty .. .. madvise_free(ptr); .. .. .. .. heavy memory pressure again. .. In this time, VM cannot discard the page because the page .. has *PG_dirty* Rather than relying on the PG_dirty of page descriptor for preventing discarding a page, dirty bit in page table is more straightforward and simple. So, this patch makes page table dirty bit marked whenever swap-in happens. Inherenty, page table entry point out swapped-out page had dirty bit so I think it's no prblem. With this, it removes complicated logic and makes freeable page checking by madvise_free simple. Of course, we could solve above mentioned example. Cc: Hugh Dickins Cc: Cyrill Gorcunov Cc: Pavel Emelyanov Reported-by: Yalin Wang Signed-off-by: Minchan Kim --- * From v1: * Rewrite description - Andrew mm/madvise.c | 1 - mm/memory.c | 10 -- mm/rmap.c| 2 +- mm/vmscan.c | 3 +-- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 22e8f0c..a045798 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; } - ClearPageDirty(page); unlock_page(page); } diff --git a/mm/memory.c b/mm/memory.c index 6743966..48ff537 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2521,9 +2521,15 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); - pte = mk_pte(page, vma->vm_page_prot); + + /* +* The page is swapping in now was dirty before it was swapped out +* so restore the state again(ie, pte_mkdirty) because MADV_FREE +* relies on the dirty bit on page table. +*/ + pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot)); if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = maybe_mkwrite(pte, vma); flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = 1; diff --git a/mm/rmap.c b/mm/rmap.c index dad23a4..281e806 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1275,7 +1275,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags & TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty && !PageDirty(page)) { + if (!dirty) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; diff --git a/mm/vmscan.c b/mm/vmscan.c index dc6cd51..fffebf0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_KEEP; } - if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) && - !PageDirty(page)) + if (PageAnon(page) && !pte_dirty && !PageSwapCache(page)) *freeable = true; /* Reclaim if clean, defer dirty pages to writeback */ -- 1.9.3 -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] mm: make every pte dirty on do_swap_page
2nd description trial. From ccfc6c79634f6cec69d8fb23b0e863ebfa5b893c Mon Sep 17 00:00:00 2001 From: Minchan Kim minc...@kernel.org Date: Mon, 30 Mar 2015 13:43:08 +0900 Subject: [PATCH v2] mm: make every pte dirty on do_swap_page Bascially, MADV_FREE relys on the dirty bit in page table entry to decide whether VM allows to discard the page or not. IOW, if page table entry includes marked dirty bit, VM shouldn't discard the page. However, if swap-in by read fault happens, page table entry point out the page doesn't have marked dirty bit so MADV_FREE might discard the page wrongly. For avoiding the problem, MADV_FREE did more checks with PageDirty and PageSwapCache. It worked out because swapped-in page lives on swap cache and since it was evicted from the swap cache, the page has PG_dirty flag. So both page flags checks effectvely prevent wrong discarding by MADV_FREE. A problem in above logic is that swapped-in page has PG_dirty since they are removed from swap cache so VM cannot consider those pages as freeable any more alghouth madvise_free is called in future. Look at below example for detail. ptr = malloc(); memset(ptr); .. .. .. heavy memory pressure so all of pages are swapped out .. .. var = *ptr; - a page swapped-in and removed from swapcache. page table doesn't mark dirty bit and page descriptor includes PG_dirty .. .. madvise_free(ptr); .. .. .. .. heavy memory pressure again. .. In this time, VM cannot discard the page because the page .. has *PG_dirty* Rather than relying on the PG_dirty of page descriptor for preventing discarding a page, dirty bit in page table is more straightforward and simple. So, this patch makes page table dirty bit marked whenever swap-in happens. Inherenty, page table entry point out swapped-out page had dirty bit so I think it's no prblem. With this, it removes complicated logic and makes freeable page checking by madvise_free simple. Of course, we could solve above mentioned example. Cc: Hugh Dickins hu...@google.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Pavel Emelyanov xe...@parallels.com Reported-by: Yalin Wang yalin.w...@sonymobile.com Signed-off-by: Minchan Kim minc...@kernel.org --- * From v1: * Rewrite description - Andrew mm/madvise.c | 1 - mm/memory.c | 10 -- mm/rmap.c| 2 +- mm/vmscan.c | 3 +-- 4 files changed, 10 insertions(+), 6 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 22e8f0c..a045798 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; } - ClearPageDirty(page); unlock_page(page); } diff --git a/mm/memory.c b/mm/memory.c index 6743966..48ff537 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2521,9 +2521,15 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); - pte = mk_pte(page, vma-vm_page_prot); + + /* +* The page is swapping in now was dirty before it was swapped out +* so restore the state again(ie, pte_mkdirty) because MADV_FREE +* relies on the dirty bit on page table. +*/ + pte = pte_mkdirty(mk_pte(page, vma-vm_page_prot)); if ((flags FAULT_FLAG_WRITE) reuse_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = maybe_mkwrite(pte, vma); flags = ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = 1; diff --git a/mm/rmap.c b/mm/rmap.c index dad23a4..281e806 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1275,7 +1275,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty !PageDirty(page)) { + if (!dirty) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; diff --git a/mm/vmscan.c b/mm/vmscan.c index dc6cd51..fffebf0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_KEEP; } - if (PageAnon(page) !pte_dirty !PageSwapCache(page) - !PageDirty(page)) + if (PageAnon(page) !pte_dirty !PageSwapCache(page)) *freeable = true; /* Reclaim if clean, defer dirty pages to writeback */ -- 1.9.3 -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at
[PATCH 4/4] mm: make every pte dirty on do_swap_page
Bascially, MADV_FREE relys on the pte dirty to decide whether it allows VM to discard the page. However, if there is swap-in, pte pointed out the page has no pte_dirty. So, MADV_FREE checks PageDirty and PageSwapCache for those pages to not discard it because swapped-in page could live on swap cache or PageDirty when it is removed from swapcache. The problem in here is that anonymous pages can have PageDirty if it is removed from swapcache so that VM cannot parse those pages as freeable even if we did madvise_free. Look at below example. ptr = malloc(); memset(ptr); .. heavy memory pressure -> swap-out all of pages .. out of memory pressure so there are lots of free pages .. var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean but SetPageDirty madvise_free(ptr); .. .. heavy memory pressure -> VM cannot discard the page by PageDirty. PageDirty for anonymous page aims for avoiding duplicating swapping out. In other words, if a page have swapped-in but live swapcache(ie, !PageDirty), we could save swapout if the page is selected as victim by VM in future because swap device have kept previous swapped-out contents of the page. So, rather than relying on the PG_dirty for working madvise_free, pte_dirty is more straightforward. Inherently, swapped-out page was pte_dirty so this patch restores the dirtiness when swap-in fault happens so madvise_free doesn't rely on the PageDirty any more. Cc: Hugh Dickins Cc: Cyrill Gorcunov Cc: Pavel Emelyanov Reported-by: Yalin Wang Signed-off-by: Minchan Kim --- mm/madvise.c | 1 - mm/memory.c | 9 +++-- mm/rmap.c| 2 +- mm/vmscan.c | 3 +-- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 22e8f0c..a045798 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; } - ClearPageDirty(page); unlock_page(page); } diff --git a/mm/memory.c b/mm/memory.c index 0f96a4a..40428a5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); - pte = mk_pte(page, vma->vm_page_prot); + + /* +* Every page swapped-out was pte_dirty so we make pte dirty again. +* MADV_FREE relies on it. +*/ + pte = pte_mkdirty(mk_pte(page, vma->vm_page_prot)); if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = maybe_mkwrite(pte, vma); flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = 1; diff --git a/mm/rmap.c b/mm/rmap.c index 47b3ba8..34c1d66 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags & TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty && !PageDirty(page)) { + if (!dirty) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; diff --git a/mm/vmscan.c b/mm/vmscan.c index 260c413..3357ffa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_KEEP; } - if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) && - !PageDirty(page)) + if (PageAnon(page) && !pte_dirty && !PageSwapCache(page)) *freeable = true; /* Reclaim if clean, defer dirty pages to writeback */ -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] mm: make every pte dirty on do_swap_page
Bascially, MADV_FREE relys on the pte dirty to decide whether it allows VM to discard the page. However, if there is swap-in, pte pointed out the page has no pte_dirty. So, MADV_FREE checks PageDirty and PageSwapCache for those pages to not discard it because swapped-in page could live on swap cache or PageDirty when it is removed from swapcache. The problem in here is that anonymous pages can have PageDirty if it is removed from swapcache so that VM cannot parse those pages as freeable even if we did madvise_free. Look at below example. ptr = malloc(); memset(ptr); .. heavy memory pressure - swap-out all of pages .. out of memory pressure so there are lots of free pages .. var = *ptr; - swap-in page/remove the page from swapcache. so pte_clean but SetPageDirty madvise_free(ptr); .. .. heavy memory pressure - VM cannot discard the page by PageDirty. PageDirty for anonymous page aims for avoiding duplicating swapping out. In other words, if a page have swapped-in but live swapcache(ie, !PageDirty), we could save swapout if the page is selected as victim by VM in future because swap device have kept previous swapped-out contents of the page. So, rather than relying on the PG_dirty for working madvise_free, pte_dirty is more straightforward. Inherently, swapped-out page was pte_dirty so this patch restores the dirtiness when swap-in fault happens so madvise_free doesn't rely on the PageDirty any more. Cc: Hugh Dickins hu...@google.com Cc: Cyrill Gorcunov gorcu...@gmail.com Cc: Pavel Emelyanov xe...@parallels.com Reported-by: Yalin Wang yalin.w...@sonymobile.com Signed-off-by: Minchan Kim minc...@kernel.org --- mm/madvise.c | 1 - mm/memory.c | 9 +++-- mm/rmap.c| 2 +- mm/vmscan.c | 3 +-- 4 files changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 22e8f0c..a045798 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -325,7 +325,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; } - ClearPageDirty(page); unlock_page(page); } diff --git a/mm/memory.c b/mm/memory.c index 0f96a4a..40428a5 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2521,9 +2521,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter_fast(mm, MM_ANONPAGES); dec_mm_counter_fast(mm, MM_SWAPENTS); - pte = mk_pte(page, vma-vm_page_prot); + + /* +* Every page swapped-out was pte_dirty so we make pte dirty again. +* MADV_FREE relies on it. +*/ + pte = pte_mkdirty(mk_pte(page, vma-vm_page_prot)); if ((flags FAULT_FLAG_WRITE) reuse_swap_page(page)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = maybe_mkwrite(pte, vma); flags = ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = 1; diff --git a/mm/rmap.c b/mm/rmap.c index 47b3ba8..34c1d66 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags TTU_FREE) { VM_BUG_ON_PAGE(PageSwapCache(page), page); - if (!dirty !PageDirty(page)) { + if (!dirty) { /* It's a freeable page by MADV_FREE */ dec_mm_counter(mm, MM_ANONPAGES); goto discard; diff --git a/mm/vmscan.c b/mm/vmscan.c index 260c413..3357ffa 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_KEEP; } - if (PageAnon(page) !pte_dirty !PageSwapCache(page) - !PageDirty(page)) + if (PageAnon(page) !pte_dirty !PageSwapCache(page)) *freeable = true; /* Reclaim if clean, defer dirty pages to writeback */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/