from:"minchan"

[PATCH v2 2/4] mm: make tlb_flush_pending global

2017-07-31 Thread Minchan Kim

Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
COMPACTION] but upcoming patches to solve subtle TLB flush bacting
problem will use it regardless of compaction/numa so this patch
doesn't remove the dependency.

Cc: Nadav Amit 
Cc: Mel Gorman 
Signed-off-by: Minchan Kim 
---
 include/linux/mm_types.h | 21 -
 mm/debug.c   |  2 --
 2 files changed, 23 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c605f2a3a68e..892a7b0196fd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -487,14 +487,12 @@ struct mm_struct {
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
 #endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
 * An operation with batched TLB flushing is going on. Anything that
 * can move process memory needs to flush the TLB when moving a
 * PROT_NONE or PROT_NUMA mapped page.
 */
atomic_t tlb_flush_pending;
-#endif
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
/* See flush_tlb_batched_pending() */
bool tlb_flush_batched;
@@ -528,7 +526,6 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct 
mm_struct *mm,
 extern void tlb_finish_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end);
 
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
 /*
  * Memory barriers to keep this state in sync are graciously provided by
  * the page table locks, outside of which no page table modifications happen.
@@ -569,24 +566,6 @@ static inline void dec_tlb_flush_pending(struct mm_struct 
*mm)
smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);
 }
-#else
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
-{
-   return false;
-}
-
-static inline void init_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-
-static inline void inc_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-
-static inline void dec_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-#endif
 
 struct vm_fault;
 
diff --git a/mm/debug.c b/mm/debug.c
index d70103bb4731..18a9b15b1e37 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -158,9 +158,7 @@ void dump_mm(const struct mm_struct *mm)
 #ifdef CONFIG_NUMA_BALANCING
mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
 #endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
atomic_read(&mm->tlb_flush_pending),
-#endif
mm->def_flags, &mm->def_flags
);
 }
-- 
2.7.4

[PATCH v2 0/4] fix several TLB batch races

2017-07-31 Thread Minchan Kim

Nadav and Mel founded several subtle races caused by TLB batching.
This patchset aims for solving thoses problems using embedding
[inc|dec]_tlb_flush_pending to TLB batching API.
With that, places to know TLB flush pending catch it up by
using mm_tlb_flush_pending.

Each patch includes detailed description.

This patchset is based on v4.13-rc2-mmots-2017-07-28-16-10 +
"[PATCH v5 0/3] mm: fixes of tlb_flush_pending races" from Nadav

* from v1
  * TLB batching API separation core part from arch specific one - Mel
  * introduce mm_tlb_flush_nested - Mel

Minchan Kim (4):
  mm: refactoring TLB gathering API
  mm: make tlb_flush_pending global
  mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
  mm: fix KSM data corruption

 arch/arm/include/asm/tlb.h  | 11 ++--
 arch/ia64/include/asm/tlb.h |  8 --
 arch/s390/include/asm/tlb.h | 17 -
 arch/sh/include/asm/tlb.h   |  8 +++---
 arch/um/include/asm/tlb.h   | 13 +++---
 fs/proc/task_mmu.c  |  4 ++-
 include/asm-generic/tlb.h   |  7 ++---
 include/linux/mm_types.h| 35 ++---
 mm/debug.c  |  2 --
 mm/ksm.c|  3 ++-
 mm/memory.c | 62 +++--
 11 files changed, 107 insertions(+), 63 deletions(-)

-- 
2.7.4

[PATCH v2 4/4] mm: fix KSM data corruption

2017-07-31 Thread Minchan Kim

Nadav reported KSM can corrupt the user data by the TLB batching race[1].
That means data user written can be lost.

Quote from Nadav Amit
"
For this race we need 4 CPUs:

CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
write later.

CPU1: Runs madvise_free on the range that includes the PTE. It would clear
the dirty-bit. It batches TLB flushes.

CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
care about the fact that it clears the PTE write-bit, and of course, batches
TLB flushes.

CPU3: Runs KSM. Our purpose is to pass the following test in
write_protect_page():

if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
(pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

Since it will avoid TLB flush. And we want to do it while the PTE is stale.
Later, and before replacing the page, we would be able to change the page.

Note that all the operations the CPU1-3 perform canhappen in parallel since
they only acquire mmap_sem for read.

We start with two identical pages. Everything below regards the same
page/PTE.

CPU0CPU1CPU2CPU3

Write the same
value on page

[cache PTE as
 dirty in TLB]

MADV_FREE
pte_mkclean()

4 > clear_refs
pte_wrprotect()

write_protect_page()
[ success, no flush ]

pages_indentical()
[ ok ]

Write to page
different value

[Ok, using stale
 PTE]

replace_page()

Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
already wrote on the page, but KSM ignored this write, and it got lost.
"

In above scenario, MADV_FREE is fixed by changing TLB batching API
including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty part.

This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm
and KSM checks pending TLB flush by using mm_tlb_flush_pending so that
it will flush TLB to avoid data lost if there are other parallel threads
pending TLB flush.

[1] http://lkml.kernel.org/r/bd3a0ebe-ecf4-41d4-87fa-c755ea9ab...@gmail.com

Note:
I failed to reproduce this problem through Nadav's test program which
need to tune timing in my system speed so didn't confirm it work.
Nadav, Could you test this patch on your test machine?

Thanks!

Cc: Nadav Amit 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Andrea Arcangeli 
Signed-off-by: Minchan Kim 
---
 fs/proc/task_mmu.c | 4 +++-
 mm/ksm.c   | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9782dedeead7..58ef3a6abbc0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1018,6 +1018,7 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
enum clear_refs_types type;
int itype;
int rv;
+   struct mmu_gather tlb;
 
memset(buffer, 0, sizeof(buffer));
if (count > sizeof(buffer) - 1)
@@ -1062,6 +1063,7 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
}
 
down_read(&mm->mmap_sem);
+   tlb_gather_mmu(&tlb, mm, 0, -1);
if (type == CLEAR_REFS_SOFT_DIRTY) {
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (!(vma->vm_flags & VM_SOFTDIRTY))
@@ -1083,7 +1085,7 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
walk_page_range(0, mm->highest_vm_end, &clear_refs_walk);
if (type == CLEAR_REFS_SOFT_DIRTY)
mmu_notifier_invalidate_range_end(mm, 0, -1);
-   flush_tlb_mm(mm);
+   tlb_finish_mmu(&tlb, 0, -1);
up_read(&mm->mmap_sem);
 out_mm:
mmput(mm);
diff --git a/mm/ksm.c b/mm/ksm.c
index 0c927e36a639..15dd7415f7b3 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1038,7 +1038,8 @@ static int write_protect_page(struct vm_area_struct *vma, 
struct page *page,
goto out_unlock;
 
if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
-   (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) {
+   (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
+   mm_tlb_flush_pending(mm)) {
pte_t entry;
 
swapped = PageSwapCache(page);
-- 
2.7.4

Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-31 Thread Minchan Kim

On Mon, Jul 31, 2017 at 09:17:07AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> > rw_page's gain is reducing of dynamic allocation in swap path
> > as well as performance gain thorugh avoiding bio allocation.
> > And it would be important in memory pressure situation.
> 
> There is no need for any dynamic allocation when using the bio
> path.  Take a look at __blkdev_direct_IO_simple for an example
> that doesn't do any allocations.

Do you suggest define something special flag(e.g., SWP_INMEMORY)
for in-memory swap to swap_info_struct when swapon time manually
or from bdi_queue_someting automatically?
And depending the flag of swap_info_struct, use the onstack bio
instead of dynamic allocation if the swap device is in-memory?

Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-30 Thread Minchan Kim

On Mon, Jul 31, 2017 at 07:16:59AM +0900, Minchan Kim wrote:
> Hi Andrew,
> 
> On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> > On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox  
> > wrote:
> > 
> > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > whether the rw_page() interface made sense for synchronous memory 
> > > > drivers
> > > > [1][2].  It's unclear whether this interface has any performance benefit
> > > > for these drivers, but as we continue to fix bugs it is clear that it 
> > > > does
> > > > have a maintenance burden.  This series removes the rw_page()
> > > > implementations in brd, pmem and btt to relieve this burden.
> > > 
> > > Why don't you measure whether it has performance benefits?  I don't
> > > understand why zram would see performance benefits and not other drivers.
> > > If it's going to be removed, then the whole interface should be removed,
> > > not just have the implementations removed from some drivers.
> > 
> > Yes please.  Minchan, could you please take a look sometime?
> 
> rw_page's gain is reducing of dynamic allocation in swap path
> as well as performance gain thorugh avoiding bio allocation.
> And it would be important in memory pressure situation.
> 
> I guess it comes from bio_alloc mempool. Usually, zram-swap works
> in high memory pressure so mempool would be exahusted easily.
> It means that mempool wait and repeated alloc would consume the
> overhead.
> 
> Actually, at that time although Karam reported the gain is 2.4%,
> I got a report from production team that the gain in corner case
> (e.g., animation playing is smooth) would be much higher than
> expected.

One of the idea is to create bioset only for swap without sharing
with FS so bio allocation for swap doesn't need to wait returning
bio from FS side which does slow nand IO to mempool.

Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

2017-07-30 Thread Minchan Kim

Hi Andrew,

On Fri, Jul 28, 2017 at 02:21:23PM -0700, Andrew Morton wrote:
> On Fri, 28 Jul 2017 10:31:43 -0700 Matthew Wilcox  wrote:
> 
> > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > whether the rw_page() interface made sense for synchronous memory drivers
> > > [1][2].  It's unclear whether this interface has any performance benefit
> > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > have a maintenance burden.  This series removes the rw_page()
> > > implementations in brd, pmem and btt to relieve this burden.
> > 
> > Why don't you measure whether it has performance benefits?  I don't
> > understand why zram would see performance benefits and not other drivers.
> > If it's going to be removed, then the whole interface should be removed,
> > not just have the implementations removed from some drivers.
> 
> Yes please.  Minchan, could you please take a look sometime?

rw_page's gain is reducing of dynamic allocation in swap path
as well as performance gain thorugh avoiding bio allocation.
And it would be important in memory pressure situation.

I guess it comes from bio_alloc mempool. Usually, zram-swap works
in high memory pressure so mempool would be exahusted easily.
It means that mempool wait and repeated alloc would consume the
overhead.

Actually, at that time although Karam reported the gain is 2.4%,
I got a report from production team that the gain in corner case
(e.g., animation playing is smooth) would be much higher than
expected.

Re: [PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

2017-07-28 Thread Minchan Kim

On Fri, Jul 28, 2017 at 09:46:34AM +0100, Mel Gorman wrote:
> On Fri, Jul 28, 2017 at 03:41:51PM +0900, Minchan Kim wrote:
> > Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
> > problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
> > 
> > Quote from Mel Gorman
> > 
> > "The race in question is CPU 0 running madv_free and updating some PTEs
> > while CPU 1 is also running madv_free and looking at the same PTEs.
> > CPU 1 may have writable TLB entries for a page but fail the pte_dirty
> > check (because CPU 0 has updated it already) and potentially fail to flush.
> > Hence, when madv_free on CPU 1 returns, there are still potentially writable
> > TLB entries and the underlying PTE is still present so that a subsequent 
> > write
> > does not necessarily propagate the dirty bit to the underlying PTE any more.
> > Reclaim at some unknown time at the future may then see that the PTE is 
> > still
> > clean and discard the page even though a write has happened in the meantime.
> > I think this is possible but I could have missed some protection in 
> > madv_free
> > that prevents it happening."
> > 
> > This patch aims for solving both problems all at once and is ready for
> > other problem with KSM, MADV_FREE and soft-dirty story[3].
> > 
> > TLB batch API(tlb_[gather|finish]_mmu] uses [set|clear]_tlb_flush_pending
> > and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can 
> > catch
> > there are parallel threads going on. In that case, flush TLB to prevent
> > for user to access memory via stale TLB entry although it fail to gather
> > pte entry.
> > 
> > I confiremd this patch works with [4] test program Nadav gave so this patch
> > supersedes "mm: Always flush VMA ranges affected by zap_page_range v2"
> > in current mmotm.
> > 
> > NOTE:
> > This patch modifies arch-specific TLB gathering interface(x86, ia64,
> > s390, sh, um). It seems most of architecture are straightforward but s390
> > need to be careful because tlb_flush_mmu works only if mm->context.flush_mm
> > is set to non-zero which happens only a pte entry really is cleared by
> > ptep_get_and_clear and friends. However, this problem never changes the
> > pte entries but need to flush to prevent memory access from stale tlb.
> > 
> > Any thoughts?
> > 
> 
> The cc list is somewhat . extensive, given the topic. Trim it if
> there is another version.

Most of them are maintainers and mailling list for each architecures
I am changing. I'm not sure what I can trim. As you said it's rather
extensive, I will trim mailing list for each arch but keep maintainers
and linux-arch.

> 
> > index 3f2eb76243e3..8c26961f0503 100644
> > --- a/arch/arm/include/asm/tlb.h
> > +++ b/arch/arm/include/asm/tlb.h
> > @@ -163,13 +163,26 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct 
> > mm_struct *mm, unsigned long start
> >  #ifdef CONFIG_HAVE_RCU_TABLE_FREE
> > tlb->batch = NULL;
> >  #endif
> > +   set_tlb_flush_pending(tlb->mm);
> >  }
> >  
> >  static inline void
> >  tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long 
> > end)
> >  {
> > -   tlb_flush_mmu(tlb);
> > +   /*
> > +* If there are parallel threads are doing PTE changes on same range
> > +* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> > +* flush by batching, a thread has stable TLB entry can fail to flush
> > +* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> > +* if we detect parallel PTE batching threads.
> > +*/
> > +   if (mm_tlb_flush_pending(tlb->mm, false) > 1) {
> > +   tlb->range_start = start;
> > +   tlb->range_end = end;
> > +   }
> >  
> > +   tlb_flush_mmu(tlb);
> > +   clear_tlb_flush_pending(tlb->mm);
> > /* keep the page table cache within bounds */
> > check_pgt_cache();
> >  
> 
> mm_tlb_flush_pending shouldn't be taking a barrier specific arg. I expect
> this to change in the future and cause a conflict. At least I think in
> this context, it's the conditional barrier stuff.
> 

Yub. I saw your comment to Nadav so I expect you want mm_tlb_flush_pending
be called under pte lock. However, I will use it out of pte lock where
tlb_finish_mmu, however, in that case, atomic op and barrier
to prevent compiler reordering between tlb flush and atomic_read
in mm_tlb_flush_pending are enough to work.

> That aside, it's very unfortunate that the return value of
>

[PATCH 0/3] fix several TLB batch races

2017-07-27 Thread Minchan Kim

Nadav and Mel founded several subtle races caused by TLB batching.
This patchset aims for solving thoses problems using embedding
[set|clear]_tlb_flush_pending to TLB batching API.
With that, places to know TLB flush pending catch it up by
using mm_tlb_flush_pending.

Each patch includes detailed description.

This patchset is based on v4.13-rc2-mmots-2017-07-26-16-16 +
revert: "mm: prevent racy access to tlb_flush_pending" +
adding: "[PATCH v3 0/2] mm: fixes of tlb_flush_pending races".

Minchan Kim (3):
  mm: make tlb_flush_pending global
  mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
  mm: fix KSM data corruption

 arch/arm/include/asm/tlb.h  | 15 ++-
 arch/ia64/include/asm/tlb.h | 12 
 arch/s390/include/asm/tlb.h | 15 +++
 arch/sh/include/asm/tlb.h   |  4 +++-
 arch/um/include/asm/tlb.h   |  8 
 fs/proc/task_mmu.c  |  4 +++-
 include/linux/mm_types.h| 22 +-
 kernel/fork.c   |  2 --
 mm/debug.c  |  2 --
 mm/ksm.c|  3 ++-
 mm/memory.c | 24 
 11 files changed, 74 insertions(+), 37 deletions(-)

-- 
2.7.4

[PATCH 3/3] mm: fix KSM data corruption

2017-07-27 Thread Minchan Kim

Nadav reported KSM can corrupt the user data by the TLB batching race[1].
That means data user written can be lost.

Quote from Nadav Amit
"
For this race we need 4 CPUs:

CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
write later.

CPU1: Runs madvise_free on the range that includes the PTE. It would clear
the dirty-bit. It batches TLB flushes.

CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
care about the fact that it clears the PTE write-bit, and of course, batches
TLB flushes.

CPU3: Runs KSM. Our purpose is to pass the following test in
write_protect_page():

if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
(pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

Since it will avoid TLB flush. And we want to do it while the PTE is stale.
Later, and before replacing the page, we would be able to change the page.

Note that all the operations the CPU1-3 perform canhappen in parallel since
they only acquire mmap_sem for read.

We start with two identical pages. Everything below regards the same
page/PTE.

CPU0CPU1CPU2CPU3

Write the same
value on page

[cache PTE as
 dirty in TLB]

MADV_FREE
pte_mkclean()

4 > clear_refs
pte_wrprotect()

write_protect_page()
[ success, no flush ]

pages_indentical()
[ ok ]

Write to page
different value

[Ok, using stale
 PTE]

replace_page()

Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
already wrote on the page, but KSM ignored this write, and it got lost.
"

In above scenario, MADV_FREE is fixed by changing TLB batching API
including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty part.

This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm
and KSM checks pending TLB flush by using mm_tlb_flush_pending so that
it will flush TLB to avoid data lost if there are other parallel threads
pending TLB flush.

[1] http://lkml.kernel.org/r/bd3a0ebe-ecf4-41d4-87fa-c755ea9ab...@gmail.com

Note:
I failed to reproduce this problem through Nadav's test program which
need to tune timing in my system speed so didn't confirm it work.
Nadav, Could you test this patch on your test machine?

Thanks!

Cc: Nadav Amit 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Andrea Arcangeli 
Signed-off-by: Minchan Kim 
---
 fs/proc/task_mmu.c | 4 +++-
 mm/ksm.c   | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 35be35e05153..583fc50eb36d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1019,6 +1019,7 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
enum clear_refs_types type;
int itype;
int rv;
+   struct mmu_gather tlb;
 
memset(buffer, 0, sizeof(buffer));
if (count > sizeof(buffer) - 1)
@@ -1063,6 +1064,7 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
}
 
down_read(&mm->mmap_sem);
+   tlb_gather_mmu(&tlb, mm, 0, -1);
if (type == CLEAR_REFS_SOFT_DIRTY) {
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (!(vma->vm_flags & VM_SOFTDIRTY))
@@ -1084,7 +1086,7 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
walk_page_range(0, mm->highest_vm_end, &clear_refs_walk);
if (type == CLEAR_REFS_SOFT_DIRTY)
mmu_notifier_invalidate_range_end(mm, 0, -1);
-   flush_tlb_mm(mm);
+   tlb_finish_mmu(&tlb, 0, -1);
up_read(&mm->mmap_sem);
 out_mm:
mmput(mm);
diff --git a/mm/ksm.c b/mm/ksm.c
index 4dc92f138786..d3b1c70aac18 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1038,7 +1038,8 @@ static int write_protect_page(struct vm_area_struct *vma, 
struct page *page,
goto out_unlock;
 
if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
-   (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) {
+   (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
+   mm_tlb_flush_pending(mm, true)) {
pte_t entry;
 
swapped = PageSwapCache(page);
-- 
2.7.4

[PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

2017-07-27 Thread Minchan Kim

Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
problem and Mel fixed it[1] and found same problem on MADV_FREE[2].

Quote from Mel Gorman

"The race in question is CPU 0 running madv_free and updating some PTEs
while CPU 1 is also running madv_free and looking at the same PTEs.
CPU 1 may have writable TLB entries for a page but fail the pte_dirty
check (because CPU 0 has updated it already) and potentially fail to flush.
Hence, when madv_free on CPU 1 returns, there are still potentially writable
TLB entries and the underlying PTE is still present so that a subsequent write
does not necessarily propagate the dirty bit to the underlying PTE any more.
Reclaim at some unknown time at the future may then see that the PTE is still
clean and discard the page even though a write has happened in the meantime.
I think this is possible but I could have missed some protection in madv_free
that prevents it happening."

This patch aims for solving both problems all at once and is ready for
other problem with KSM, MADV_FREE and soft-dirty story[3].

TLB batch API(tlb_[gather|finish]_mmu] uses [set|clear]_tlb_flush_pending
and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch
there are parallel threads going on. In that case, flush TLB to prevent
for user to access memory via stale TLB entry although it fail to gather
pte entry.

I confiremd this patch works with [4] test program Nadav gave so this patch
supersedes "mm: Always flush VMA ranges affected by zap_page_range v2"
in current mmotm.

NOTE:
This patch modifies arch-specific TLB gathering interface(x86, ia64,
s390, sh, um). It seems most of architecture are straightforward but s390
need to be careful because tlb_flush_mmu works only if mm->context.flush_mm
is set to non-zero which happens only a pte entry really is cleared by
ptep_get_and_clear and friends. However, this problem never changes the
pte entries but need to flush to prevent memory access from stale tlb.

Any thoughts?

[1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzz...@techsingularity.net
[2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrf...@suse.de
[3] http://lkml.kernel.org/r/bd3a0ebe-ecf4-41d4-87fa-c755ea9ab...@gmail.com
[4] https://patchwork.kernel.org/patch/9861621/

Cc: Ingo Molnar 
Cc: x...@kernel.org
Cc: Russell King 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Tony Luck 
Cc: linux-i...@vger.kernel.org
Cc: Martin Schwidefsky 
Cc: "David S. Miller" 
Cc: Heiko Carstens 
Cc: linux-s...@vger.kernel.org
Cc: Yoshinori Sato 
Cc: linux...@vger.kernel.org
Cc: Jeff Dike 
Cc: user-mode-linux-de...@lists.sourceforge.net
Cc: linux-a...@vger.kernel.org
Cc: Nadav Amit 
Reported-by: Mel Gorman 
Signed-off-by: Minchan Kim 
---
 arch/arm/include/asm/tlb.h  | 15 ++-
 arch/ia64/include/asm/tlb.h | 12 
 arch/s390/include/asm/tlb.h | 15 +++
 arch/sh/include/asm/tlb.h   |  4 +++-
 arch/um/include/asm/tlb.h   |  8 
 include/linux/mm_types.h|  7 +--
 mm/memory.c | 24 
 7 files changed, 69 insertions(+), 16 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 3f2eb76243e3..8c26961f0503 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -163,13 +163,26 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct 
*mm, unsigned long start
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
tlb->batch = NULL;
 #endif
+   set_tlb_flush_pending(tlb->mm);
 }
 
 static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-   tlb_flush_mmu(tlb);
+   /*
+* If there are parallel threads are doing PTE changes on same range
+* under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
+* flush by batching, a thread has stable TLB entry can fail to flush
+* the TLB by observing pte_none|!pte_dirty, for example so flush TLB
+* if we detect parallel PTE batching threads.
+*/
+   if (mm_tlb_flush_pending(tlb->mm, false) > 1) {
+   tlb->range_start = start;
+   tlb->range_end = end;
+   }
 
+   tlb_flush_mmu(tlb);
+   clear_tlb_flush_pending(tlb->mm);
/* keep the page table cache within bounds */
check_pgt_cache();
 
diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index fced197b9626..22fe976a4693 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -178,6 +178,7 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct 
*mm, unsigned long start
tlb->start = start;
tlb->end = end;
tlb->start_addr = ~0UL;
+   set_tlb_flush_pending(tlb->mm);
 }
 
 /*
@@ -188,10 +189,21 @@ static inline void
 tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
/*
+* If there are parallel threads are doing

[PATCH 1/3] mm: make tlb_flush_pending global

2017-07-27 Thread Minchan Kim

Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
COMPACTION] but upcoming patches to solve subtle TLB flush bacting
problem will use it regardless of compaction/numa so this patch
doesn't remove the dependency.

Cc: Nadav Amit 
Cc: Mel Gorman 
Signed-off-by: Minchan Kim 
---
 include/linux/mm_types.h | 15 ---
 kernel/fork.c|  2 --
 mm/debug.c   |  2 --
 3 files changed, 19 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4b9a625c370c..6953d2c706fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -487,14 +487,12 @@ struct mm_struct {
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
 #endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
/*
 * An operation with batched TLB flushing is going on. Anything that
 * can move process memory needs to flush the TLB when moving a
 * PROT_NONE or PROT_NUMA mapped page.
 */
atomic_t tlb_flush_pending;
-#endif
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
/* See flush_tlb_batched_pending() */
bool tlb_flush_batched;
@@ -522,7 +520,6 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return mm->cpu_vm_mask_var;
 }
 
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
 /*
  * Memory barriers to keep this state in sync are graciously provided by
  * the page table locks, outside of which no page table modifications happen.
@@ -565,18 +562,6 @@ static inline void clear_tlb_flush_pending(struct 
mm_struct *mm)
smp_mb__before_atomic();
atomic_dec(&mm->tlb_flush_pending);
 }
-#else
-static inline bool mm_tlb_flush_pending(struct mm_struct *mm, bool pt_locked)
-{
-   return false;
-}
-static inline void set_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-static inline void clear_tlb_flush_pending(struct mm_struct *mm)
-{
-}
-#endif
 
 struct vm_fault;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index aaf4d70afd8b..7e9f42060976 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -807,9 +807,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm_init_aio(mm);
mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
atomic_set(&mm->tlb_flush_pending, 0);
-#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
 #endif
diff --git a/mm/debug.c b/mm/debug.c
index d70103bb4731..18a9b15b1e37 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -158,9 +158,7 @@ void dump_mm(const struct mm_struct *mm)
 #ifdef CONFIG_NUMA_BALANCING
mm->numa_next_scan, mm->numa_scan_offset, mm->numa_scan_seq,
 #endif
-#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
atomic_read(&mm->tlb_flush_pending),
-#endif
mm->def_flags, &mm->def_flags
);
 }
-- 
2.7.4

Re: [PATCH] zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse

2017-07-25 Thread Minchan Kim

On Mon, Jul 24, 2017 at 05:45:35PM +0800, Hui Zhu wrote:
> The first version is in [1].
> 
> Got -EBUSY from zs_page_migrate will make migration
> slow (retry) or fail (zs_page_putback will schedule_work free_work,
> but it cannot ensure the success).
> 
> I noticed this issue because my Kernel patched [2]
> that will remove retry in __alloc_contig_migrate_range.
> This retry willhandle the -EBUSY because it will re-isolate the page
> and re-call migrate_pages.
> Without it will make cma_alloc fail at once with -EBUSY.
> 
> According to the review from Minchan Kim in [3], I update the patch
> to skip unnecessary loops but not return -EBUSY if zspage is not inuse.
> 
> Following is what I got with highalloc-performance in a vbox with 2
> cpu 1G memory 512 zram as swap.  And the swappiness is set to 100.
>ori  ne
>   orig new
> Minor Faults  5080511350830235
> Major Faults 43918   56530
> Swap Ins 42087   55680
> Swap Outs89718  104700
> Allocation stalls0   0
> DMA allocs   57787   52364
> DMA32 allocs  4796459948043563
> Normal allocs0   0
> Movable allocs   0   0
> Direct pages scanned 45493   23167
> Kswapd pages scanned   1565222 1725078
> Kswapd pages reclaimed 134 1503037
> Direct pages reclaimed   45615   25186
> Kswapd efficiency  85% 87%
> Kswapd velocity   1897.1011949.042
> Direct efficiency 100%108%
> Direct velocity 55.139  26.175
> Percentage direct scans 2%  1%
> Zone normal velocity  1952.2401975.217
> Zone dma32 velocity  0.000   0.000
> Zone dma velocity0.000   0.000
> Page writes by reclaim   89764.000  105233.000
> Page writes file46 533
> Page writes anon 89718  104700
> Page reclaim immediate   214573699
> Sector Reads   3259688 3441368
> Sector Writes  3667252 3754836
> Page rescued immediate   0   0
> Slabs scanned  1042872 1160855
> Direct inode steals   8042   10089
> Kswapd inode steals  54295   29170
> Kswapd skipped wait  0   0
> THP fault alloc175 154
> THP collapse alloc 226 289
> THP splits   0   0
> THP fault fallback  11  14
> THP collapse fail3   2
> Compaction stalls  536 646
> Compaction success 322 358
> Compaction failures214 288
> Page migrate success119608  111063
> Page migrate failure  27232593
> Compaction pages isolated   250179  232652
> Compaction migrate scanned 9131832 9942306
> Compaction free scanned2093272 2613998
> Compaction cost192 189
> NUMA alloc hit4712455547193990
> NUMA alloc miss  0   0
> NUMA interleave hit  0   0
> NUMA alloc local  4712455547193990
> NUMA base PTE updates0   0
> NUMA huge PMD updates0   0
> NUMA page range updates  0   0
> NUMA hint faults 0   0
> NUMA hint local faults   0   0
> NUMA hint local percent100 100
> NUMA pages migrated  0   0
> AutoNUMA cost   0%  0%
> 
> [1]: https://lkml.org/lkml/2017/7/14/93
> [2]: https://lkml.org/lkml/2014/5/28/113
> [3]: https://lkml.org/lkml/2017/7/21/10
> 
> Signed-off-by: Hui Zhu 
> ---
>  mm/zsmalloc.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index d41edd2..c2c7ba9 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1997,8 +1997,11 @@ int zs_page_migrate(struct address_space *mapping, 
> struct page *newpage,
>  
>   spin_lock(&class->lock);
>   if (!get_zspage_inuse(zspage)) {
> - ret = -EBUSY;
> - goto unlock_class;
> + /*
> +  * Set "offset" to end of the pag

Re: [zram] ltp inspired explosion - master v4.13-rc1-3-g87b2c3fc6317

2017-07-24 Thread Minchan Kim

Hi,

On Mon, Jul 24, 2017 at 08:17:01PM +0200, Mike Galbraith wrote:
> Now bisected and verified via revert, the culprit is:
> 
> cf8e0fedf078 mm/zsmalloc: simplify zs_max_alloc_size handling
> 
> Reproducer: ltp::testcases/bin/zram03.
> 

Thanks for the report and bisecting.
I believe this patch should fix it.

Thanks!

>From 0ffbd3c8769fdf56e2f14908f890f9d1703ed32e Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Tue, 25 Jul 2017 15:15:18 +0900
Subject: [PATCH] zram: do not free pool->size_class

Mike reported kernel goes oops with ltp:zram03 testcase.

[ 1449.835161] zram: Added device: zram0
[ 1449.929981] zram0: detected capacity change from 0 to 107374182400
[ 1449.968583] BUG: unable to handle kernel paging request at 306d61727a77
[ 1449.975550] IP: zs_map_object+0xb9/0x260
[ 1449.979472] PGD 0
[ 1449.979473] P4D 0
[ 1449.981488]
[ 1449.984997] Oops:  [#1] SMP
[ 1449.988139] Dumping ftrace buffer:
[ 1449.991545](ftrace buffer empty)
[ 1449.995120] Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) 
raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) 
ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) 
br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) 
cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) 
crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) 
ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) 
ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) 
shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) 
cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) 
ipmi_msghandler(E)
[ 1450.065731]  nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) 
ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) 
sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) 
libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) 
scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) 
autofs4(E) [last unloaded: zram]
[ 1450.107900] CPU: 6 PID: 12356 Comm: swapon Tainted: GE   
4.13.0.g87b2c3f-default #194
[ 1450.116760] Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , 
BIOS -[D6E150AUS-1.10]- 12/15/2010
[ 1450.126486] task: 880158d2c4c0 task.stack: c9000168
[ 1450.132401] RIP: 0010:zs_map_object+0xb9/0x260
[ 1450.136843] RSP: 0018:c90001683988 EFLAGS: 00010202
[ 1450.142063] RAX:  RBX: 8801547a98d0 RCX: 8801211b78b0
[ 1450.149190] RDX: 306d61727a2f RSI: 0016 RDI: 8801547a98f8
[ 1450.156317] RBP: c900016839c8 R08: 04db4200 R09: 0008
[ 1450.163446] R10: 880151329260 R11:  R12: 880158f76000
[ 1450.170573] R13: 0001 R14:  R15: ea0004db4200
[ 1450.177700] FS:  7fe1b4e8b880() GS:88017f18() 
knlGS:
[ 1450.185782] CS:  0010 DS:  ES:  CR0: 80050033
[ 1450.191522] CR2: 306d61727a77 CR3: 000154415000 CR4: 06e0
[ 1450.198649] Call Trace:
[ 1450.201103]  zram_bvec_rw.isra.26+0xe8/0x780 [zram]
[ 1450.205978]  zram_rw_page+0x6e/0xa0 [zram]
[ 1450.210077]  bdev_read_page+0x81/0xb0
[ 1450.213738]  do_mpage_readpage+0x51a/0x710
[ 1450.217837]  ? lru_cache_add+0xe/0x10
[ 1450.221498]  mpage_readpages+0x122/0x1a0
[ 1450.225420]  ? I_BDEV+0x20/0x20
[ 1450.228560]  ? I_BDEV+0x20/0x20
[ 1450.231702]  ? alloc_pages_current+0x6a/0xb0
[ 1450.235971]  blkdev_readpages+0x1d/0x20
[ 1450.239805]  __do_page_cache_readahead+0x1b2/0x270
[ 1450.244596]  ondemand_readahead+0x180/0x2c0
[ 1450.248777]  page_cache_sync_readahead+0x31/0x50
[ 1450.253394]  generic_file_read_iter+0x7e7/0xaf0
[ 1450.257922]  blkdev_read_iter+0x37/0x40
[ 1450.261756]  __vfs_read+0xce/0x140
[ 1450.265160]  vfs_read+0x9e/0x150
[ 1450.268389]  SyS_read+0x46/0xa0
[ 1450.271533]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[ 1450.276149] RIP: 0033:0x7fe1b4344270
[ 1450.279724] RSP: 002b:7ffdb4299f48 EFLAGS: 0246 ORIG_RAX: 

[ 1450.287287] RAX: ffda RBX: 7fe1b4604678 RCX: 7fe1b4344270
[ 1450.294414] RDX: 0001 RSI: 00db2c00 RDI: 0006
[ 1450.301541] RBP: 7fe1b4604620 R08: 0003 R09: 7fe1b4604678
[ 1450.308667] R10:  R11: 0246 R12: 00010030
[ 1450.315794] R13: 0001 R14: 2710 R15: 00010011
[ 1450.322920] Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f 
b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 
48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70
[ 1450.341785] RIP

Re: [PATCH RFC] mm: allow isolation for pages not inserted into lru lists yet

2017-07-23 Thread Minchan Kim

Hi,

On Tue, Jul 18, 2017 at 07:00:23PM +0300, Konstantin Khlebnikov wrote:
> Pages are added into lru lists via per-cpu page vectors in order
> to combine these insertions and reduce lru lock contention.
> 
> These pending pages cannot be isolated and moved into another lru.
> This breaks in some cases page activation and makes mlock-munlock
> much more complicated.
> 
> Also this breaks newly added swapless MADV_FREE: if it cannot move
> anon page into file lru then page could never be freed lazily.

Yes, it's really unforunate.

> 
> This patch rearranges lru list handling to allow lru isolation for
> such pages. It set PageLRU earlier and initialize page->lru to mark
> pages still pending for lru insert.

At a first glance, it seems to work but it's rather hacky to me.

Could you make mark_page_lazyfree be aware of it?
IOW, mark_page_lazyfree can clear PG_active|referenced|swapbacked under
lru_lock if it was not in the LRU. With it, pagevec handler for LRU
can move pages into proper list when drain happens.

> 
> Signed-off-by: Konstantin Khlebnikov 
> ---
>  include/linux/mm_inline.h |   10 --
>  mm/swap.c |   26 --
>  2 files changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index e030a68ead7e..6618c588ee40 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -60,8 +60,14 @@ static __always_inline void 
> add_page_to_lru_list_tail(struct page *page,
>  static __always_inline void del_page_from_lru_list(struct page *page,
>   struct lruvec *lruvec, enum lru_list lru)
>  {
> - list_del(&page->lru);
> - update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
> + /*
> +  * Empty list head means page is not drained to lru list yet.
> +  */
> + if (likely(!list_empty(&page->lru))) {
> + list_del(&page->lru);
> + update_lru_size(lruvec, lru, page_zonenum(page),
> + -hpage_nr_pages(page));
> + }
>  }
>  
>  /**
> diff --git a/mm/swap.c b/mm/swap.c
> index 23fc6e049cda..ba4c98074a09 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -400,13 +400,35 @@ void mark_page_accessed(struct page *page)
>  }
>  EXPORT_SYMBOL(mark_page_accessed);
>  
> +static void __pagevec_lru_add_drain_fn(struct page *page, struct lruvec 
> *lruvec,
> +void *arg)
> +{
> + /* Check for isolated or already added pages */
> + if (likely(PageLRU(page) && list_empty(&page->lru))) {
> + int file = page_is_file_cache(page);
> + int active = PageActive(page);
> + enum lru_list lru = page_lru(page);
> +
> + add_page_to_lru_list(page, lruvec, lru);
> + update_page_reclaim_stat(lruvec, file, active);
> + trace_mm_lru_insertion(page, lru);
> + }
> +}
> +
>  static void __lru_cache_add(struct page *page)
>  {
>   struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
>  
> + /*
> +  * Set PageLRU right here and initialize list head to
> +  * allow page isolation while it on the way to the LRU list.
> +  */
> + VM_BUG_ON_PAGE(PageLRU(page), page);
> + INIT_LIST_HEAD(&page->lru);
>   get_page(page);
> + SetPageLRU(page);
>   if (!pagevec_add(pvec, page) || PageCompound(page))
> - __pagevec_lru_add(pvec);
> + pagevec_lru_move_fn(pvec, __pagevec_lru_add_drain_fn, NULL);
>   put_cpu_var(lru_add_pvec);
>  }
>  
> @@ -611,7 +633,7 @@ void lru_add_drain_cpu(int cpu)
>   struct pagevec *pvec = &per_cpu(lru_add_pvec, cpu);
>  
>   if (pagevec_count(pvec))
> - __pagevec_lru_add(pvec);
> + pagevec_lru_move_fn(pvec, __pagevec_lru_add_drain_fn, NULL);
>  
>   pvec = &per_cpu(lru_rotate_pvecs, cpu);
>   if (pagevec_count(pvec)) {
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org

Re: [PATCH] zsmalloc: zs_page_migrate: not check inuse if migrate_mode is not MIGRATE_ASYNC

2017-07-20 Thread Minchan Kim

Hi Hui,

On Thu, Jul 20, 2017 at 05:33:45PM +0800, Hui Zhu wrote:

< snip >

> >> >> +++ b/mm/zsmalloc.c
> >> >> @@ -1982,6 +1982,7 @@ int zs_page_migrate(struct address_space 
> >> >> *mapping, struct page *newpage,
> >> >>   unsigned long old_obj, new_obj;
> >> >>   unsigned int obj_idx;
> >> >>   int ret = -EAGAIN;
> >> >> + int inuse;
> >> >>
> >> >>   VM_BUG_ON_PAGE(!PageMovable(page), page);
> >> >>   VM_BUG_ON_PAGE(!PageIsolated(page), page);
> >> >> @@ -1996,21 +1997,24 @@ int zs_page_migrate(struct address_space 
> >> >> *mapping, struct page *newpage,
> >> >>   offset = get_first_obj_offset(page);
> >> >>
> >> >>   spin_lock(&class->lock);
> >> >> - if (!get_zspage_inuse(zspage)) {
> >> >> + inuse = get_zspage_inuse(zspage);
> >> >> + if (mode == MIGRATE_ASYNC && !inuse) {
> >> >>   ret = -EBUSY;
> >> >>   goto unlock_class;
> >> >>   }
> >> >>
> >> >>   pos = offset;
> >> >>   s_addr = kmap_atomic(page);
> >> >> - while (pos < PAGE_SIZE) {
> >> >> - head = obj_to_head(page, s_addr + pos);
> >> >> - if (head & OBJ_ALLOCATED_TAG) {
> >> >> - handle = head & ~OBJ_ALLOCATED_TAG;
> >> >> - if (!trypin_tag(handle))
> >> >> - goto unpin_objects;
> >> >> + if (inuse) {
> >
> > I don't want to add inuse check for every loop. It might avoid unncessary
> > looping in every loop of zs_page_migrate so it is for optimization, not
> > correction. As I consider it would happen rarely, I think we don't need
> > to add the check. Could you just remove get_zspage_inuse check, instead?
> >
> > like this.
> >
> >
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 013eea76685e..2d3d75fb0f16 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -1980,14 +1980,9 @@ int zs_page_migrate(struct address_space *mapping, 
> > struct page *newpage,
> > pool = mapping->private_data;
> > class = pool->size_class[class_idx];
> > offset = get_first_obj_offset(page);
> > +   pos = offset;
> >
> > spin_lock(&class->lock);
> > -   if (!get_zspage_inuse(zspage)) {
> > -   ret = -EBUSY;
> > -   goto unlock_class;
> > -   }
> > -
> > -   pos = offset;
> > s_addr = kmap_atomic(page);
> > while (pos < PAGE_SIZE) {
> > head = obj_to_head(page, s_addr + pos);
> >
> >
> 
> What about set pos to avoid the loops?
> 
> @@ -1997,8 +1997,10 @@ int zs_page_migrate(struct address_space
> *mapping, struct page *newpage,
> 
> spin_lock(&class->lock);
> if (!get_zspage_inuse(zspage)) {
> -   ret = -EBUSY;
> -   goto unlock_class;
> +   /* The page is empty.
> +  Set "offset" to the end of page.
> +  Then the loops of page will be avoided.  */
> +   offset = PAGE_SIZE;

Good idea. Just a nitpick:

/*
 * set "offset" to end of the page so that every loops
 * skips unnecessary object scanning.
 */

Thanks!

Re: [PATCH] zsmalloc: zs_page_migrate: not check inuse if migrate_mode is not MIGRATE_ASYNC

2017-07-20 Thread Minchan Kim

Hi Hui,

On Thu, Jul 20, 2017 at 02:39:17PM +0800, Hui Zhu wrote:
> Hi Minchan,
> 
> I am sorry for answer late.
> I spent some time on ubuntu 16.04 with mmtests in an old laptop.
> 
> 2017-07-17 13:39 GMT+08:00 Minchan Kim :
> > Hello Hui,
> >
> > On Fri, Jul 14, 2017 at 03:51:07PM +0800, Hui Zhu wrote:
> >> Got some -EBUSY from zs_page_migrate that will make migration
> >> slow (retry) or fail (zs_page_putback will schedule_work free_work,
> >> but it cannot ensure the success).
> >
> > I think EAGAIN(migration retrial) is better than EBUSY(bailout) because
> > expectation is that zsmalloc will release the empty zs_page soon so
> > at next retrial, it will be succeeded.
> 
> 
> I am not sure.
> 
> This is the call trace of zs_page_migrate:
> zs_page_migrate
> mapping->a_ops->migratepage
> move_to_new_page
> __unmap_and_move
> unmap_and_move
> migrate_pages
> 
> In unmap_and_move will remove page from migration page list
> and call putback_movable_page(will call mapping->a_ops->putback_page) if
> return value of zs_page_migrate is not -EAGAIN.
> The comments of this part:
> After called mapping->a_ops->putback_page, zsmalloc can free the page
> from ZS_EMPTY list.
> 
> If retrun -EAGAIN, the page will be not be put back.  EAGAIN page will
> be try again in migrate_pages without re-isolate.

You're right. With -EGAIN, it burns out CPU pointlessly.

> 
> > About schedule_work, as you said, we don't make sure when it happens but
> > I believe it will happen in a migration iteration most of case.
> > How often do you see that case?
> 
> I noticed this issue because my Kernel patch 
> https://lkml.org/lkml/2014/5/28/113
> that will remove retry in __alloc_contig_migrate_range.
> This retry willhandle the -EBUSY because it will re-isolate the page
> and re-call migrate_pages.
> Without it will make cma_alloc fail at once with -EBUSY.

LKML.org server is not responding so hard to see patch you mentioned
but I just got your point now so I don't care any more. Your patch is
enough simple as considering the benefit.
Just look at below comment.

> 
> >
> >>
> >> And I didn't find anything that make zs_page_migrate cannot work with
> >> a ZS_EMPTY zspage.
> >> So make the patch to not check inuse if migrate_mode is not
> >> MIGRATE_ASYNC.
> >
> > At a first glance, I think it work but the question is that it a same 
> > problem
> > ith schedule_work of zs_page_putback. IOW, Until the work is done, 
> > compaction
> > cannot succeed. Do you have any number before and after?
> >
> 
> 
> Following is what I got with highalloc-performance in a vbox with 2
> cpu 1G memory 512 zram as swap:
>oriafte
>   orig   after
> Minor Faults  5080511350801261
> Major Faults 43918   46692
> Swap Ins 42087   46299
> Swap Outs89718  105495
> Allocation stalls0   0
> DMA allocs   57787   69787
> DMA32 allocs  4796459947983772
> Normal allocs0   0
> Movable allocs   0   0
> Direct pages scanned 45493   28837
> Kswapd pages scanned   1565222 1512947
> Kswapd pages reclaimed 134 1334030
> Direct pages reclaimed   45615   30174
> Kswapd efficiency  85% 88%
> Kswapd velocity   1897.1011708.309
> Direct efficiency 100%104%
> Direct velocity 55.139  32.561
> Percentage direct scans 2%  1%
> Zone normal velocity  1952.2401740.870
> Zone dma32 velocity  0.000   0.000
> Zone dma velocity0.000   0.000
> Page writes by reclaim   89764.000  106043.000
> Page writes file46 548
> Page writes anon 89718  105495
> Page reclaim immediate   214577269
> Sector Reads   3259688 3144160
> Sector Writes  3667252 3675528
> Page rescued immediate   0   0
> Slabs scanned  1042872 1035438
> Direct inode steals   80427772
> Kswapd inode steals  54295   55075
> Kswapd skipped wait  0   0
> THP fault alloc175 200
> THP collapse alloc 226 363
> THP splits

Re: [PATCH] zsmalloc: zs_page_migrate: not check inuse if migrate_mode is not MIGRATE_ASYNC

2017-07-16 Thread Minchan Kim

Hello Hui,

On Fri, Jul 14, 2017 at 03:51:07PM +0800, Hui Zhu wrote:
> Got some -EBUSY from zs_page_migrate that will make migration
> slow (retry) or fail (zs_page_putback will schedule_work free_work,
> but it cannot ensure the success).

I think EAGAIN(migration retrial) is better than EBUSY(bailout) because
expectation is that zsmalloc will release the empty zs_page soon so
at next retrial, it will be succeeded.
About schedule_work, as you said, we don't make sure when it happens but
I believe it will happen in a migration iteration most of case.
How often do you see that case?

> 
> And I didn't find anything that make zs_page_migrate cannot work with
> a ZS_EMPTY zspage.
> So make the patch to not check inuse if migrate_mode is not
> MIGRATE_ASYNC.

At a first glance, I think it work but the question is that it a same problem
ith schedule_work of zs_page_putback. IOW, Until the work is done, compaction
cannot succeed. Do you have any number before and after?

Thanks.

> 
> Signed-off-by: Hui Zhu 
> ---
>  mm/zsmalloc.c | 66 
> +--
>  1 file changed, 37 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index d41edd2..c298e5c 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1982,6 +1982,7 @@ int zs_page_migrate(struct address_space *mapping, 
> struct page *newpage,
>   unsigned long old_obj, new_obj;
>   unsigned int obj_idx;
>   int ret = -EAGAIN;
> + int inuse;
>  
>   VM_BUG_ON_PAGE(!PageMovable(page), page);
>   VM_BUG_ON_PAGE(!PageIsolated(page), page);
> @@ -1996,21 +1997,24 @@ int zs_page_migrate(struct address_space *mapping, 
> struct page *newpage,
>   offset = get_first_obj_offset(page);
>  
>   spin_lock(&class->lock);
> - if (!get_zspage_inuse(zspage)) {
> + inuse = get_zspage_inuse(zspage);
> + if (mode == MIGRATE_ASYNC && !inuse) {
>   ret = -EBUSY;
>   goto unlock_class;
>   }
>  
>   pos = offset;
>   s_addr = kmap_atomic(page);
> - while (pos < PAGE_SIZE) {
> - head = obj_to_head(page, s_addr + pos);
> - if (head & OBJ_ALLOCATED_TAG) {
> - handle = head & ~OBJ_ALLOCATED_TAG;
> - if (!trypin_tag(handle))
> - goto unpin_objects;
> + if (inuse) {
> + while (pos < PAGE_SIZE) {
> + head = obj_to_head(page, s_addr + pos);
> + if (head & OBJ_ALLOCATED_TAG) {
> + handle = head & ~OBJ_ALLOCATED_TAG;
> + if (!trypin_tag(handle))
> + goto unpin_objects;
> + }
> + pos += class->size;
>   }
> - pos += class->size;
>   }
>  
>   /*
> @@ -2020,20 +2024,22 @@ int zs_page_migrate(struct address_space *mapping, 
> struct page *newpage,
>   memcpy(d_addr, s_addr, PAGE_SIZE);
>   kunmap_atomic(d_addr);
>  
> - for (addr = s_addr + offset; addr < s_addr + pos;
> - addr += class->size) {
> - head = obj_to_head(page, addr);
> - if (head & OBJ_ALLOCATED_TAG) {
> - handle = head & ~OBJ_ALLOCATED_TAG;
> - if (!testpin_tag(handle))
> - BUG();
> -
> - old_obj = handle_to_obj(handle);
> - obj_to_location(old_obj, &dummy, &obj_idx);
> - new_obj = (unsigned long)location_to_obj(newpage,
> - obj_idx);
> - new_obj |= BIT(HANDLE_PIN_BIT);
> - record_obj(handle, new_obj);
> + if (inuse) {
> + for (addr = s_addr + offset; addr < s_addr + pos;
> + addr += class->size) {
> + head = obj_to_head(page, addr);
> + if (head & OBJ_ALLOCATED_TAG) {
> + handle = head & ~OBJ_ALLOCATED_TAG;
> + if (!testpin_tag(handle))
> + BUG();
> +
> + old_obj = handle_to_obj(handle);
> + obj_to_location(old_obj, &dummy, &obj_idx);
> + new_obj = (unsigned long)
> + location_to_obj(newpage, obj_idx);
> + new_obj |= BIT(HANDLE_PIN_BIT);
> + record_obj(handle, new_obj);
> + }
>   }
>   }
>  
> @@ -2055,14 +2061,16 @@ int zs_page_migrate(struct address_space *mapping, 
> struct page *newpage,
>  
>   ret = MIGRATEPAGE_SUCCESS;
>  unpin_objects:
> - for (addr = s_addr + offset; addr < s_addr + pos;
> + if (inuse) {
> + for (addr = s_addr + offset; addr <

Re: [PATCH] zram: constify attribute_group structures.

2017-07-02 Thread Minchan Kim

Hello,

On Mon, Jul 03, 2017 at 11:43:13AM +0530, Arvind Yadav wrote:
> attribute_groups are not supposed to change at runtime. All functions
> working with attribute_groups provided by  work with const
> attribute_group. So mark the non-const structs as const.

If so, how about changing all of places where not have used const
as well as zram?

Anyway, I'm okay with this.

> 
> File size before:
>text  data bss dec hex filename
>8293   841   4913823b2 drivers/block/zram/zram_drv.o
> 
> File size After adding 'const':
>text  data bss dec hex filename
>8357   777   4913823b2 drivers/block/zram/zram_drv.o
> 
> Signed-off-by: Arvind Yadav 
Acked-by: Minchan Kim 

Thanks.

Re: [PATCH v2] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-07-02 Thread Minchan Kim

Forgot to add Andrew.

On Mon, Jul 03, 2017 at 11:13:12AM +0900, Minchan Kim wrote:
> On Fri, Jun 30, 2017 at 01:48:59PM +0200, Jerome Marchand wrote:
> > Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
> > ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
> > zsmalloc to allocate an object of the maximal size
> > (ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly
> > complicated.
> > 
> > This patch replaces the dynamic calculation of zs_size_classes at init
> > time by a compile time calculation that uses the DIV_ROUND_UP() macro
> > already used in get_size_class_index().
> > 
> > Signed-off-by: Jerome Marchand 
> Acked-by: Minchan Kim 
> 
> Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org

Re: [PATCH v2] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-07-02 Thread Minchan Kim

On Fri, Jun 30, 2017 at 01:48:59PM +0200, Jerome Marchand wrote:
> Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
> ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
> zsmalloc to allocate an object of the maximal size
> (ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly
> complicated.
> 
> This patch replaces the dynamic calculation of zs_size_classes at init
> time by a compile time calculation that uses the DIV_ROUND_UP() macro
> already used in get_size_class_index().
> 
> Signed-off-by: Jerome Marchand 
Acked-by: Minchan Kim 

Thanks.

Re: [PATCH -mm -v2 0/6] mm, swap: VMA based swap readahead

2017-06-29 Thread Minchan Kim

Hi Huang,

Ccing Johannes:

I don't read this patch yet but I remember Johannes tried VMA-based
readahead approach long time ago so he might have good comment.

On Fri, Jun 30, 2017 at 09:44:37AM +0800, Huang, Ying wrote:
> The swap readahead is an important mechanism to reduce the swap in
> latency.  Although pure sequential memory access pattern isn't very
> popular for anonymous memory, the space locality is still considered
> valid.
> 
> In the original swap readahead implementation, the consecutive blocks
> in swap device are readahead based on the global space locality
> estimation.  But the consecutive blocks in swap device just reflect
> the order of page reclaiming, don't necessarily reflect the access
> pattern in virtual memory space.  And the different tasks in the
> system may have different access patterns, which makes the global
> space locality estimation incorrect.
> 
> In this patchset, when page fault occurs, the virtual pages near the
> fault address will be readahead instead of the swap slots near the
> fault swap slot in swap device.  This avoid to readahead the unrelated
> swap slots.  At the same time, the swap readahead is changed to work
> on per-VMA from globally.  So that the different access patterns of
> the different VMAs could be distinguished, and the different readahead
> policy could be applied accordingly.  The original core readahead
> detection and scaling algorithm is reused, because it is an effect
> algorithm to detect the space locality.
> 
> In addition to the swap readahead changes, some new sysfs interface is
> added to show the efficiency of the readahead algorithm and some other
> swap statistics.
> 
> This new implementation will incur more small random read, on SSD, the
> improved correctness of estimation and readahead target should beat
> the potential increased overhead, this is also illustrated in the test
> results below.  But on HDD, the overhead may beat the benefit, so the
> original implementation will be used by default.
> 
> The test and result is as follow,
> 
> Common test condition
> =
> 
> Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
> Swap device: NVMe disk
> 
> Micro-benchmark with combined access pattern
> 
> 
> vm-scalability, sequential swap test case, 4 processes to eat 50G
> virtual memory space, repeat the sequential memory writing until 300
> seconds.  The first round writing will trigger swap out, the following
> rounds will trigger sequential swap in and out.
> 
> At the same time, run vm-scalability random swap test case in
> background, 8 processes to eat 30G virtual memory space, repeat the
> random memory write until 300 seconds.  This will trigger random
> swap-in in the background.
> 
> This is a combined workload with sequential and random memory
> accessing at the same time.  The result (for sequential workload) is
> as follow,
> 
>   BaseOptimized
>   -
> throughput345413 KB/s 414029 KB/s (+19.9%)
> latency.average   97.14 us61.06 us (-37.1%)
> latency.50th  2 us1 us
> latency.60th  2 us1 us
> latency.70th  98 us   2 us
> latency.80th  160 us  2 us
> latency.90th  260 us  217 us
> latency.95th  346 us  369 us
> latency.99th  1.34 ms 1.09 ms
> ra_hit%   52.69%  99.98%
> 
> The original swap readahead algorithm is confused by the background
> random access workload, so readahead hit rate is lower.  The VMA-base
> readahead algorithm works much better.
> 
> Linpack
> ===
> 
> The test memory size is bigger than RAM to trigger swapping.
> 
>   BaseOptimized
>   -
> elapsed_time  393.49 s329.88 s (-16.2%)
> ra_hit%   86.21%  98.82%
> 
> The score of base and optimized kernel hasn't visible changes.  But
> the elapsed time reduced and readahead hit rate improved, so the
> optimized kernel runs better for startup and tear down stages.  And
> the absolute value of readahead hit rate is high, shows that the space
> locality is still valid in some practical workloads.

Re: [PATCH] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-06-29 Thread Minchan Kim

Hi Jerome,

On Wed, Jun 28, 2017 at 10:14:20AM +0200, Jerome Marchand wrote:
> Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
> ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
> zsmalloc to allocate an object of the maximal size
> (ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly
> complicated.
> 
> This patch replaces the dynamic calculation of zs_size_classes at init
> time by a compile time calculation that uses the DIV_ROUND_UP() macro
> already used in get_size_class_index().
> 
> Signed-off-by: Jerome Marchand 
> ---
>  mm/zsmalloc.c | 52 +++-
>  1 file changed, 15 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index d41edd2..134024b 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -116,6 +116,11 @@
>  #define OBJ_INDEX_BITS   (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
>  #define OBJ_INDEX_MASK   ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
>  
> +#define FULLNESS_BITS2
> +#define CLASS_BITS   8
> +#define ISOLATED_BITS3
> +#define MAGIC_VAL_BITS   8
> +
>  #define MAX(a, b) ((a) >= (b) ? (a) : (b))
>  /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
>  #define ZS_MIN_ALLOC_SIZE \
> @@ -137,6 +142,8 @@
>   *  (reason above)
>   */
>  #define ZS_SIZE_CLASS_DELTA  (PAGE_SIZE >> CLASS_BITS)
> +#define ZS_SIZE_CLASSES  DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
> ZS_MIN_ALLOC_SIZE, \
> +  ZS_SIZE_CLASS_DELTA)

#define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
 ZS_SIZE_CLASS_DELTA) + 1)


I think it should add +1 to cover ZS_MIN_ALLOC_SIZE.
Otherwise, looks good to me.

Thanks.

Re: [PATCH v1 0/7] writeback incompressible pages to storage

2017-06-29 Thread Minchan Kim

On Thu, Jun 29, 2017 at 06:17:13PM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (06/29/17 17:47), Minchan Kim wrote:
> [..]
> > > > This patch supports writeback feature of zram so admin can set up
> > > > a block device and with it, zram can save the memory via writing
> > > > out the incompressile pages once it found it's incompressible pages
> > > > (1/4 comp ratio) instead of keeping the page in memory.
> > > 
> > > hm, alternative idea. just an idea. can we try compressing the page
> > > with another algorithm? example: downcast from lz4 to zlib? we can
> > > set up a fallback "worst case" algorithm, so each entry can contain
> > > additional flag that would tell if the src page was compressed with
> > > the fast or slow algorithm. that sounds to me easier than "create a
> > > new block device and bond it to zram, etc". but I may be wrong.
> > 
> > We tried it although it was static not dynamic adatation you suggested.
> 
> could you please explain more? I'm not sure I understand what
> was the configuration (what is static adaptation?).

echo inflate > /sys/block/zramX/comp_algorighm

> 
> > However problem was media-stream data so zlib, lzam added just pointless
> > overhead.
> 
> would that overhead be bigger than a full-blown I/O request to
> another block device (potentially slow, or under load, etc. etc.)?

The problem is not a overhead but memeory saving.
Although we use higher compression algorithm like zlib, lzma,
the comp ratio was not different with lzo and lz4 so it was
added pointless overhead without any saving.

Re: [PATCH v1 0/7] writeback incompressible pages to storage

2017-06-29 Thread Minchan Kim

Hi Sergey,

On Thu, Jun 29, 2017 at 12:41:57AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (06/26/17 15:52), Minchan Kim wrote:
> [..]
> > zRam is useful for memory saving with compressible pages but sometime,
> > workload can be changed and system has lots of incompressible pages
> > which is very harmful for zram.
> 
> could do. that makes zram quite complicated, to be honest. no offense,
> but the whole zram's "good compression" margin looks to me completely
> random and quite unreasonable. building a complex logic atop of random
> logic is a bit tricky. but I see what problem you are trying to address.
> 
> > This patch supports writeback feature of zram so admin can set up
> > a block device and with it, zram can save the memory via writing
> > out the incompressile pages once it found it's incompressible pages
> > (1/4 comp ratio) instead of keeping the page in memory.
> 
> hm, alternative idea. just an idea. can we try compressing the page
> with another algorithm? example: downcast from lz4 to zlib? we can
> set up a fallback "worst case" algorithm, so each entry can contain
> additional flag that would tell if the src page was compressed with
> the fast or slow algorithm. that sounds to me easier than "create a
> new block device and bond it to zram, etc". but I may be wrong.

We tried it although it was static not dynamic adatation you suggested.
However problem was media-stream data so zlib, lzam added just pointless
overhead.

Thanks.

Re: [PATCH] thp, mm: Fix crash due race in MADV_FREE handling

2017-06-29 Thread Minchan Kim

On Wed, Jun 28, 2017 at 01:15:50PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jun 28, 2017 at 01:12:49PM +0300, Kirill A. Shutemov wrote:
> > Reinette reported following crash:
> > 
> >   BUG: Bad page state in process log2exe  pfn:57600
> >   page:ea00015d8000 count:0 mapcount:0 mapping:  (null) 
> > index:0x20200
> >   flags: 0x40040019(locked|uptodate|dirty|swapbacked)
> >   raw: 40040019  00020200 
> >   raw: ea00015d8020 ea00015d8020  
> >   page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> >   bad because of flags: 0x1(locked)
> >   Modules linked in: rfcomm 8021q bnep intel_rapl x86_pkg_temp_thermal 
> > coretemp efivars btusb btrtl btbcm pwm_lpss_pci snd_hda_codec_hdmi btintel 
> > pwm_lpss snd_hda_codec_realtek snd_soc_skl snd_hda_codec_generic 
> > snd_soc_skl_ipc spi_pxa2xx_platform snd_soc_sst_ipc snd_soc_sst_dsp 
> > i2c_designware_platform i2c_designware_core snd_hda_ext_core 
> > snd_soc_sst_match snd_hda_intel snd_hda_codec mei_me snd_hda_core mei 
> > snd_soc_rt286 snd_soc_rl6347a snd_soc_core efivarfs
> >   CPU: 1 PID: 354 Comm: log2exe Not tainted 4.12.0-rc7-test-test #19
> >   Hardware name: Intel corporation NUC6CAYS/NUC6CAYB, BIOS 
> > AYAPLCEL.86A.0027.2016.1108.1529 11/08/2016
> >   Call Trace:
> >dump_stack+0x95/0xeb
> >bad_page+0x16a/0x1f0
> >free_pages_check_bad+0x117/0x190
> >? rcu_read_lock_sched_held+0xa8/0x130
> >free_hot_cold_page+0x7b1/0xad0
> >__put_page+0x70/0xa0
> >madvise_free_huge_pmd+0x627/0x7b0
> >madvise_free_pte_range+0x6f8/0x1150
> >? debug_check_no_locks_freed+0x280/0x280
> >? swapin_walk_pmd_entry+0x380/0x380
> >__walk_page_range+0x6b5/0xe30
> >walk_page_range+0x13b/0x310
> >madvise_free_page_range.isra.16+0xad/0xd0
> >? force_swapin_readahead+0x110/0x110
> >? swapin_walk_pmd_entry+0x380/0x380
> >? lru_add_drain_cpu+0x160/0x320
> >madvise_free_single_vma+0x2e4/0x470
> >? madvise_free_page_range.isra.16+0xd0/0xd0
> >? vmacache_update+0x100/0x130
> >? find_vma+0x35/0x160
> >SyS_madvise+0x8ce/0x1450
> > 
> > If somebody frees the page under us and we hold the last reference to
> > it, put_page() would attempt to free the page before unlocking it.
> > 
> > The fix is trivial reorder of operations.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > Reported-by: Reinette Chatre 
> > Fixes: 9818b8cde622 ("madvise_free, thp: fix madvise_free_huge_pmd return 
> > value after splitting")
> 
> Sorry, the wrong Fixes. The right one:
> 
> Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE 
> syscall is called")
> 

Acked-by: Minchan Kim 

Thanks.

[PATCH v1 7/9] zram: write incompressible pages to backing device

2017-06-25 Thread Minchan Kim

This patch enables write IO to transfer data to backing device.
For that, it implements write_to_bdev function which creates
new bio and chaining with parent bio to make the parent bio
asynchrnous.
For rw_page which don't have parent bio, it submit owned bio
and handle IO completion by zram_page_end_io.

Also, this patch defines new flag ZRAM_WB to mark written page
for later read IO.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 113 +-
 drivers/block/zram/zram_drv.h |   1 +
 2 files changed, 102 insertions(+), 12 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 896867e2..99e46ae 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -445,9 +445,76 @@ static void put_entry_bdev(struct zram *zram, unsigned 
long entry)
WARN_ON_ONCE(!was_set);
 }
 
+void zram_page_end_io(struct bio *bio)
+{
+   struct page *page = bio->bi_io_vec[0].bv_page;
+
+   page_endio(page, op_is_write(bio_op(bio)),
+   blk_status_to_errno(bio->bi_status));
+   bio_put(bio);
+}
+
+static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
+   u32 index, struct bio *parent,
+   unsigned long *pentry)
+{
+   struct bio *bio;
+   unsigned long entry;
+
+   bio = bio_alloc(GFP_ATOMIC, 1);
+   if (!bio)
+   return -ENOMEM;
+
+   entry = get_entry_bdev(zram);
+   if (!entry) {
+   bio_put(bio);
+   return -ENOSPC;
+   }
+
+   bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
+   bio->bi_bdev = zram->bdev;
+   if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len,
+   bvec->bv_offset)) {
+   bio_put(bio);
+   put_entry_bdev(zram, entry);
+   return -EIO;
+   }
+
+   if (!parent) {
+   bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+   bio->bi_end_io = zram_page_end_io;
+   } else {
+   bio->bi_opf = parent->bi_opf;
+   bio_chain(bio, parent);
+   }
+
+   submit_bio(bio);
+   *pentry = entry;
+
+   return 0;
+}
+
+static void zram_wb_clear(struct zram *zram, u32 index)
+{
+   unsigned long entry;
+
+   zram_clear_flag(zram, index, ZRAM_WB);
+   entry = zram_get_element(zram, index);
+   zram_set_element(zram, index, 0);
+   put_entry_bdev(zram, entry);
+}
+
 #else
 static bool zram_wb_enabled(struct zram *zram) { return false; }
 static inline void reset_bdev(struct zram *zram) {};
+static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
+   u32 index, struct bio *parent,
+   unsigned long *pentry)
+
+{
+   return -EIO;
+}
+static void zram_wb_clear(struct zram *zram, u32 index) {}
 #endif
 
 
@@ -672,7 +739,13 @@ static bool zram_meta_alloc(struct zram *zram, u64 
disksize)
  */
 static void zram_free_page(struct zram *zram, size_t index)
 {
-   unsigned long handle = zram_get_handle(zram, index);
+   unsigned long handle;
+
+   if (zram_wb_enabled(zram) && zram_test_flag(zram, index, ZRAM_WB)) {
+   zram_wb_clear(zram, index);
+   atomic64_dec(&zram->stats.pages_stored);
+   return;
+   }
 
/*
 * No memory is allocated for same element filled pages.
@@ -686,6 +759,7 @@ static void zram_free_page(struct zram *zram, size_t index)
return;
}
 
+   handle = zram_get_handle(zram, index);
if (!handle)
return;
 
@@ -770,7 +844,8 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec 
*bvec,
return ret;
 }
 
-static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
+static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
+   u32 index, struct bio *bio)
 {
int ret = 0;
unsigned long alloced_pages;
@@ -781,6 +856,7 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
struct page *page = bvec->bv_page;
unsigned long element = 0;
enum zram_pageflags flags = 0;
+   bool allow_wb = true;
 
mem = kmap_atomic(page);
if (page_same_filled(mem, &element)) {
@@ -805,8 +881,20 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
return ret;
}
 
-   if (unlikely(comp_len > max_zpage_size))
+   if (unlikely(comp_len > max_zpage_size)) {
+   if (zram_wb_enabled(zram) && allow_wb) {
+   zcomp_stream_put(zram->comp);
+   ret = write_to_bdev(zram, bvec, index, bio, &element);
+   if (!ret)

[PATCH v1 3/9] zram: rename zram_decompress_page with __zram_bvec_read

2017-06-25 Thread Minchan Kim

zram_decompress_page naming is not proper because it doesn't
decompress if page was dedup hit or stored with compression.
Use more abstract term and consistent with write path function
 __zram_bvec_write.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8138822..5c92209 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -518,7 +518,7 @@ static void zram_free_page(struct zram *zram, size_t index)
zram_set_obj_size(zram, index, 0);
 }
 
-static int zram_decompress_page(struct zram *zram, struct page *page, u32 
index)
+static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index)
 {
int ret;
unsigned long handle;
@@ -570,7 +570,7 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec 
*bvec,
return -ENOMEM;
}
 
-   ret = zram_decompress_page(zram, page, index);
+   ret = __zram_bvec_read(zram, page, index);
if (unlikely(ret))
goto out;
 
@@ -717,7 +717,7 @@ static int zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec,
if (!page)
return -ENOMEM;
 
-   ret = zram_decompress_page(zram, page, index);
+   ret = __zram_bvec_read(zram, page, index);
if (ret)
goto out;
 
-- 
2.7.4

[PATCH v1 2/9] zram: inlining zram_compress

2017-06-25 Thread Minchan Kim

zram_compress does several things, compress, entry alloc and check
limitation. I did for just readbility but it hurts modulization.:(
So this patch removes zram_compress functions and inline it in
__zram_bvec_write for upcoming patches.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 64 +++
 1 file changed, 22 insertions(+), 42 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 5d3ea405..8138822 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -589,25 +589,38 @@ static int zram_bvec_read(struct zram *zram, struct 
bio_vec *bvec,
return ret;
 }
 
-static int zram_compress(struct zram *zram, struct zcomp_strm **zstrm,
-   struct page *page,
-   unsigned long *out_handle, unsigned int *out_comp_len)
+static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
 {
int ret;
-   unsigned int comp_len;
-   void *src;
unsigned long alloced_pages;
unsigned long handle = 0;
+   unsigned int comp_len = 0;
+   void *src, *dst, *mem;
+   struct zcomp_strm *zstrm;
+   struct page *page = bvec->bv_page;
+   unsigned long element = 0;
+   enum zram_pageflags flags = 0;
+
+   mem = kmap_atomic(page);
+   if (page_same_filled(mem, &element)) {
+   kunmap_atomic(mem);
+   /* Free memory associated with this sector now. */
+   flags = ZRAM_SAME;
+   atomic64_inc(&zram->stats.same_pages);
+   goto out;
+   }
+   kunmap_atomic(mem);
 
 compress_again:
+   zstrm = zcomp_stream_get(zram->comp);
src = kmap_atomic(page);
-   ret = zcomp_compress(*zstrm, src, &comp_len);
+   ret = zcomp_compress(zstrm, src, &comp_len);
kunmap_atomic(src);
 
if (unlikely(ret)) {
+   zcomp_stream_put(zram->comp);
pr_err("Compression failed! err=%d\n", ret);
-   if (handle)
-   zs_free(zram->mem_pool, handle);
+   zs_free(zram->mem_pool, handle);
return ret;
}
 
@@ -639,7 +652,6 @@ static int zram_compress(struct zram *zram, struct 
zcomp_strm **zstrm,
handle = zs_malloc(zram->mem_pool, comp_len,
GFP_NOIO | __GFP_HIGHMEM |
__GFP_MOVABLE);
-   *zstrm = zcomp_stream_get(zram->comp);
if (handle)
goto compress_again;
return -ENOMEM;
@@ -649,43 +661,11 @@ static int zram_compress(struct zram *zram, struct 
zcomp_strm **zstrm,
update_used_max(zram, alloced_pages);
 
if (zram->limit_pages && alloced_pages > zram->limit_pages) {
+   zcomp_stream_put(zram->comp);
zs_free(zram->mem_pool, handle);
return -ENOMEM;
}
 
-   *out_handle = handle;
-   *out_comp_len = comp_len;
-   return 0;
-}
-
-static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
-{
-   int ret;
-   unsigned long handle = 0;
-   unsigned int comp_len = 0;
-   void *src, *dst, *mem;
-   struct zcomp_strm *zstrm;
-   struct page *page = bvec->bv_page;
-   unsigned long element = 0;
-   enum zram_pageflags flags = 0;
-
-   mem = kmap_atomic(page);
-   if (page_same_filled(mem, &element)) {
-   kunmap_atomic(mem);
-   /* Free memory associated with this sector now */
-   atomic64_inc(&zram->stats.same_pages);
-   flags = ZRAM_SAME;
-   goto out;
-   }
-   kunmap_atomic(mem);
-
-   zstrm = zcomp_stream_get(zram->comp);
-   ret = zram_compress(zram, &zstrm, page, &handle, &comp_len);
-   if (ret) {
-   zcomp_stream_put(zram->comp);
-   return ret;
-   }
-
dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
 
src = zstrm->buffer;
-- 
2.7.4

[PATCH v1 4/9] zram: add interface to specify backing device

2017-06-25 Thread Minchan Kim

For writeback feature, user should set up backing device before
the zram working. This patch enables the interface via
/sys/block/zramX/backing_dev.

Currently, it supports block device only but it could be enhanced
for file as well.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 142 ++
 drivers/block/zram/zram_drv.h |   5 ++
 2 files changed, 147 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 5c92209..eb20655 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -270,6 +270,141 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static bool zram_wb_enabled(struct zram *zram)
+{
+   return zram->backing_dev;
+}
+
+static void reset_bdev(struct zram *zram)
+{
+   struct block_device *bdev;
+
+   if (!zram_wb_enabled(zram))
+   return;
+
+   bdev = zram->bdev;
+   if (zram->old_block_size)
+   set_blocksize(bdev, zram->old_block_size);
+   blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+   /* hope filp_close flush all of IO */
+   filp_close(zram->backing_dev, NULL);
+   zram->backing_dev = NULL;
+   zram->old_block_size = 0;
+   zram->bdev = NULL;
+}
+
+static ssize_t backing_dev_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   struct file *file = zram->backing_dev;
+   char *p;
+   ssize_t ret;
+
+   down_read(&zram->init_lock);
+   if (!zram_wb_enabled(zram)) {
+   memcpy(buf, "none\n", 5);
+   up_read(&zram->init_lock);
+   return 5;
+   }
+
+   p = file_path(file, buf, PAGE_SIZE - 1);
+   if (IS_ERR(p)) {
+   ret = PTR_ERR(p);
+   goto out;
+   }
+
+   ret = strlen(p);
+   memmove(buf, p, ret);
+   buf[ret++] = '\n';
+out:
+   up_read(&zram->init_lock);
+   return ret;
+}
+
+static ssize_t backing_dev_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   char *file_name;
+   struct file *backing_dev = NULL;
+   struct inode *inode;
+   struct address_space *mapping;
+   unsigned int old_block_size = 0;
+   struct block_device *bdev = NULL;
+   int err;
+   struct zram *zram = dev_to_zram(dev);
+
+   file_name = kmalloc(PATH_MAX, GFP_KERNEL);
+   if (!file_name)
+   return -ENOMEM;
+
+   down_write(&zram->init_lock);
+   if (init_done(zram)) {
+   pr_info("Can't setup backing device for initialized device\n");
+   err = -EBUSY;
+   goto out;
+   }
+
+   strlcpy(file_name, buf, len);
+
+   backing_dev = filp_open(file_name, O_RDWR|O_LARGEFILE, 0);
+   if (IS_ERR(backing_dev)) {
+   err = PTR_ERR(backing_dev);
+   backing_dev = NULL;
+   goto out;
+   }
+
+   mapping = backing_dev->f_mapping;
+   inode = mapping->host;
+
+   /* Support only block device in this moment */
+   if (!S_ISBLK(inode->i_mode)) {
+   err = -ENOTBLK;
+   goto out;
+   }
+
+   bdev = bdgrab(I_BDEV(inode));
+   err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
+   if (err < 0)
+   goto out;
+
+   old_block_size = block_size(bdev);
+   err = set_blocksize(bdev, PAGE_SIZE);
+   if (err)
+   goto out;
+
+   reset_bdev(zram);
+
+   zram->old_block_size = old_block_size;
+   zram->bdev = bdev;
+   zram->backing_dev = backing_dev;
+   up_write(&zram->init_lock);
+
+   pr_info("setup backing device %s\n", file_name);
+   kfree(file_name);
+
+   return len;
+out:
+   if (bdev)
+   blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+
+   if (backing_dev)
+   filp_close(backing_dev, NULL);
+
+   up_write(&zram->init_lock);
+
+   kfree(file_name);
+
+   return err;
+}
+
+#else
+static bool zram_wb_enabled(struct zram *zram) { return false; }
+static inline void reset_bdev(struct zram *zram) {};
+#endif
+
+
 /*
  * We switched to per-cpu streams and this attr is not needed anymore.
  * However, we will keep it around for some time, because:
@@ -952,6 +1087,7 @@ static void zram_reset_device(struct zram *zram)
zram_meta_free(zram, disksize);
memset(&zram->stats, 0, sizeof(zram->stats));
zcomp_destroy(comp);
+   reset_bdev(zram);
 }
 
 static ssize_t disksize_store(struct device *dev,
@@ -1077,6 +1213,9 @@ static DEVICE_ATTR_WO(mem_limit);
 static DEVICE_ATTR_WO(mem_used_max);
 static DEVICE_ATTR_RW(max_comp_stream

[PATCH v1 6/9] zram: identify asynchronous IO's return value

2017-06-25 Thread Minchan Kim

For upcoming asynchronous IO like writeback, zram_rw_page should
be aware of that whether requested IO was completed or submitted
successfully, otherwise error.

For the goal, zram_bvec_rw has three return values.

-errno: returns error number
 0: IO request is done synchronously
 1: IO request is issued successfully.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 32 
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e31fef7..896867e2 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -772,7 +772,7 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec 
*bvec,
 
 static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
 {
-   int ret;
+   int ret = 0;
unsigned long alloced_pages;
unsigned long handle = 0;
unsigned int comp_len = 0;
@@ -876,7 +876,7 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
 
/* Update stats */
atomic64_inc(&zram->stats.pages_stored);
-   return 0;
+   return ret;
 }
 
 static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
@@ -958,6 +958,11 @@ static void zram_bio_discard(struct zram *zram, u32 index,
}
 }
 
+/*
+ * Returns errno if it has some problem. Otherwise return 0 or 1.
+ * Returns 0 if IO request was done synchronously
+ * Returns 1 if IO request was successfully submitted.
+ */
 static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
int offset, bool is_write)
 {
@@ -979,7 +984,7 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec 
*bvec, u32 index,
 
generic_end_io_acct(rw_acct, &zram->disk->part0, start_time);
 
-   if (unlikely(ret)) {
+   if (unlikely(ret < 0)) {
if (!is_write)
atomic64_inc(&zram->stats.failed_reads);
else
@@ -1072,7 +1077,7 @@ static void zram_slot_free_notify(struct block_device 
*bdev,
 static int zram_rw_page(struct block_device *bdev, sector_t sector,
   struct page *page, bool is_write)
 {
-   int offset, err = -EIO;
+   int offset, ret;
u32 index;
struct zram *zram;
struct bio_vec bv;
@@ -1081,7 +1086,7 @@ static int zram_rw_page(struct block_device *bdev, 
sector_t sector,
 
if (!valid_io_request(zram, sector, PAGE_SIZE)) {
atomic64_inc(&zram->stats.invalid_io);
-   err = -EINVAL;
+   ret = -EINVAL;
goto out;
}
 
@@ -1092,7 +1097,7 @@ static int zram_rw_page(struct block_device *bdev, 
sector_t sector,
bv.bv_len = PAGE_SIZE;
bv.bv_offset = 0;
 
-   err = zram_bvec_rw(zram, &bv, index, offset, is_write);
+   ret = zram_bvec_rw(zram, &bv, index, offset, is_write);
 out:
/*
 * If I/O fails, just return error(ie, non-zero) without
@@ -1102,9 +1107,20 @@ static int zram_rw_page(struct block_device *bdev, 
sector_t sector,
 * bio->bi_end_io does things to handle the error
 * (e.g., SetPageError, set_page_dirty and extra works).
 */
-   if (err == 0)
+   if (unlikely(ret < 0))
+   return ret;
+
+   switch (ret) {
+   case 0:
page_endio(page, is_write, 0);
-   return err;
+   break;
+   case 1:
+   ret = 0;
+   break;
+   default:
+   WARN_ON(1);
+   }
+   return ret;
 }
 
 static void zram_reset_device(struct zram *zram)
-- 
2.7.4

[PATCH v1 9/9] zram: add config and doc file for writeback feature

2017-06-25 Thread Minchan Kim

This patch adds document and kconfig for using of writeback feature.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 
 Documentation/blockdev/zram.txt| 11 +++
 drivers/block/zram/Kconfig | 12 
 3 files changed, 31 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 451b6d8..c1513c7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -90,3 +90,11 @@ Contact: Sergey Senozhatsky 

device's debugging info useful for kernel developers. Its
format is not documented intentionally and may change
anytime without any notice.
+
+What:  /sys/block/zram/backing_dev
+Date:  June 2017
+Contact:   Minchan Kim 
+Description:
+   The backing_dev file is read-write and set up backing
+   device for zram to write incompressible pages.
+   For using, user should enable CONFIG_ZRAM_WRITEBACK.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 4fced8a..257e657 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -168,6 +168,7 @@ max_comp_streams  RWthe number of possible concurrent 
compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
+backing_dev  RWset up backend storage for zram to write out
 
 
 User space is advised to use the following files to read the device statistics.
@@ -231,5 +232,15 @@ The stat file represents device's mm statistics. It 
consists of a single
resets the disksize to zero. You must set the disksize again
before reusing the device.
 
+* Optional Feature
+
+= writeback
+
+With incompressible pages, there is no memory saving with zram.
+Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+to backing storage rather than keeping it in memory.
+User should set up backing device via /sys/block/zramX/backing_dev
+before disksize setting.
+
 Nitin Gupta
 ngu...@vflare.org
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index b8ecba6..7cd4a8e 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,3 +13,15 @@ config ZRAM
  disks and maybe many more.
 
  See zram.txt for more information.
+
+config ZRAM_WRITEBACK
+   bool "Write back incompressible page to backing device"
+   depends on ZRAM
+   default n
+   help
+With incompressible page, there is no memory saving to keep it
+in memory. Instead, write it out to backing device.
+For this feature, admin should set up backing device via
+/sys/block/zramX/backing_dev.
+
+See zram.txt for more infomration.
-- 
2.7.4

[PATCH v1 0/7] writeback incompressible pages to storage

2017-06-25 Thread Minchan Kim

zRam is useful for memory saving with compressible pages but sometime,
workload can be changed and system has lots of incompressible pages
which is very harmful for zram.

This patch supports writeback feature of zram so admin can set up
a block device and with it, zram can save the memory via writing
out the incompressile pages once it found it's incompressible pages
(1/4 comp ratio) instead of keeping the page in memory.

[1-3] is just clean up and [4-8] is step by step feature enablement.
[4-8] is logically not bisectable(ie, logical unit separation)
although I tried to compiled out without breaking but I think it would
be better to review.

Minchan Kim (9):
  [1] zram: clean up duplicated codes in __zram_bvec_write
  [2] zram: inlining zram_compress
  [3] zram: rename zram_decompress_page with __zram_bvec_read
  [4] zram: add interface to specify backing device
  [5] zram: add free space management in backing device
  [6] zram: identify asynchronous IO's return value
  [7] zram: write incompressible pages to backing device
  [8] zram: read page from backing device
  [9] zram: add config and doc file for writeback feature

 Documentation/ABI/testing/sysfs-block-zram |   8 +
 Documentation/blockdev/zram.txt|  11 +
 drivers/block/zram/Kconfig |  12 +
 drivers/block/zram/zram_drv.c  | 537 -
 drivers/block/zram/zram_drv.h  |   9 +
 5 files changed, 496 insertions(+), 81 deletions(-)

-- 
2.7.4

[PATCH v1 5/9] zram: add free space management in backing device

2017-06-25 Thread Minchan Kim

With backing device, zram needs management of free space of
backing device.
This patch adds bitmap logic to manage free space which is
very naive. However, it would be simple enough as considering
uncompressible pages's frequenty in zram.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 48 ++-
 drivers/block/zram/zram_drv.h |  3 +++
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index eb20655..e31fef7 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -292,6 +292,9 @@ static void reset_bdev(struct zram *zram)
zram->backing_dev = NULL;
zram->old_block_size = 0;
zram->bdev = NULL;
+
+   kvfree(zram->bitmap);
+   zram->bitmap = NULL;
 }
 
 static ssize_t backing_dev_show(struct device *dev,
@@ -330,7 +333,8 @@ static ssize_t backing_dev_store(struct device *dev,
struct file *backing_dev = NULL;
struct inode *inode;
struct address_space *mapping;
-   unsigned int old_block_size = 0;
+   unsigned int bitmap_sz, old_block_size = 0;
+   unsigned long nr_pages, *bitmap = NULL;
struct block_device *bdev = NULL;
int err;
struct zram *zram = dev_to_zram(dev);
@@ -369,16 +373,27 @@ static ssize_t backing_dev_store(struct device *dev,
if (err < 0)
goto out;
 
+   nr_pages = i_size_read(inode) >> PAGE_SHIFT;
+   bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
+   bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
+   if (!bitmap) {
+   err = -ENOMEM;
+   goto out;
+   }
+
old_block_size = block_size(bdev);
err = set_blocksize(bdev, PAGE_SIZE);
if (err)
goto out;
 
reset_bdev(zram);
+   spin_lock_init(&zram->bitmap_lock);
 
zram->old_block_size = old_block_size;
zram->bdev = bdev;
zram->backing_dev = backing_dev;
+   zram->bitmap = bitmap;
+   zram->nr_pages = nr_pages;
up_write(&zram->init_lock);
 
pr_info("setup backing device %s\n", file_name);
@@ -386,6 +401,9 @@ static ssize_t backing_dev_store(struct device *dev,
 
return len;
 out:
+   if (bitmap)
+   kvfree(bitmap);
+
if (bdev)
blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
 
@@ -399,6 +417,34 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
+static unsigned long get_entry_bdev(struct zram *zram)
+{
+   unsigned long entry;
+
+   spin_lock(&zram->bitmap_lock);
+   /* skip 0 bit to confuse zram.handle = 0 */
+   entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
+   if (entry == zram->nr_pages) {
+   spin_unlock(&zram->bitmap_lock);
+   return 0;
+   }
+
+   set_bit(entry, zram->bitmap);
+   spin_unlock(&zram->bitmap_lock);
+
+   return entry;
+}
+
+static void put_entry_bdev(struct zram *zram, unsigned long entry)
+{
+   int was_set;
+
+   spin_lock(&zram->bitmap_lock);
+   was_set = test_and_clear_bit(entry, zram->bitmap);
+   spin_unlock(&zram->bitmap_lock);
+   WARN_ON_ONCE(!was_set);
+}
+
 #else
 static bool zram_wb_enabled(struct zram *zram) { return false; }
 static inline void reset_bdev(struct zram *zram) {};
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 113a411..707aec0 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -119,6 +119,9 @@ struct zram {
struct file *backing_dev;
struct block_device *bdev;
unsigned int old_block_size;
+   unsigned long *bitmap;
+   unsigned long nr_pages;
+   spinlock_t bitmap_lock;
 #endif
 };
 #endif
-- 
2.7.4

[PATCH v1 1/9] zram: clean up duplicated codes in __zram_bvec_write

2017-06-25 Thread Minchan Kim

__zram_bvec_write has some of duplicated logic for zram meta data
handling of same_page|compressed_page.  This patch aims to clean it up
without behavior change.

Link: 
http://lkml.kernel.org/r/1496019048-27016-1-git-send-email-minc...@kernel.org
Signed-off-by: Minchan Kim 
Reviewed-by: Sergey Senozhatsky 
---
 drivers/block/zram/zram_drv.c | 55 +--
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d3e3af2..5d3ea405 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -453,30 +453,6 @@ static bool zram_same_page_read(struct zram *zram, u32 
index,
return false;
 }
 
-static bool zram_same_page_write(struct zram *zram, u32 index,
-   struct page *page)
-{
-   unsigned long element;
-   void *mem = kmap_atomic(page);
-
-   if (page_same_filled(mem, &element)) {
-   kunmap_atomic(mem);
-   /* Free memory associated with this sector now. */
-   zram_slot_lock(zram, index);
-   zram_free_page(zram, index);
-   zram_set_flag(zram, index, ZRAM_SAME);
-   zram_set_element(zram, index, element);
-   zram_slot_unlock(zram, index);
-
-   atomic64_inc(&zram->stats.same_pages);
-   atomic64_inc(&zram->stats.pages_stored);
-   return true;
-   }
-   kunmap_atomic(mem);
-
-   return false;
-}
-
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
size_t num_pages = disksize >> PAGE_SHIFT;
@@ -685,14 +661,23 @@ static int zram_compress(struct zram *zram, struct 
zcomp_strm **zstrm,
 static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
 {
int ret;
-   unsigned long handle;
-   unsigned int comp_len;
-   void *src, *dst;
+   unsigned long handle = 0;
+   unsigned int comp_len = 0;
+   void *src, *dst, *mem;
struct zcomp_strm *zstrm;
struct page *page = bvec->bv_page;
+   unsigned long element = 0;
+   enum zram_pageflags flags = 0;
 
-   if (zram_same_page_write(zram, index, page))
-   return 0;
+   mem = kmap_atomic(page);
+   if (page_same_filled(mem, &element)) {
+   kunmap_atomic(mem);
+   /* Free memory associated with this sector now */
+   atomic64_inc(&zram->stats.same_pages);
+   flags = ZRAM_SAME;
+   goto out;
+   }
+   kunmap_atomic(mem);
 
zstrm = zcomp_stream_get(zram->comp);
ret = zram_compress(zram, &zstrm, page, &handle, &comp_len);
@@ -712,19 +697,23 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
 
zcomp_stream_put(zram->comp);
zs_unmap_object(zram->mem_pool, handle);
-
+out:
/*
 * Free memory associated with this sector
 * before overwriting unused sectors.
 */
zram_slot_lock(zram, index);
zram_free_page(zram, index);
-   zram_set_handle(zram, index, handle);
-   zram_set_obj_size(zram, index, comp_len);
+   if (flags == ZRAM_SAME) {
+   zram_set_flag(zram, index, ZRAM_SAME);
+   zram_set_element(zram, index, element);
+   } else {
+   zram_set_handle(zram, index, handle);
+   zram_set_obj_size(zram, index, comp_len);
+   }
zram_slot_unlock(zram, index);
 
/* Update stats */
-   atomic64_add(comp_len, &zram->stats.compr_data_size);
atomic64_inc(&zram->stats.pages_stored);
return 0;
 }
-- 
2.7.4

[PATCH v1 8/9] zram: read page from backing device

2017-06-25 Thread Minchan Kim

This patch enables read IO from backing device. For the feature,
it implements two IO read functions to transfer data from backing
storage.

One is asynchronous IO function and other is synchronous one.

A reason I need synchrnous IO is due to partial write which need to
complete read IO before the overwriting partial data.

We can make the partial IO's case asynchronous, too but at the moment,
I don't feel adding more complexity to support such rare use cases
so want to go with simple.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 123 --
 1 file changed, 118 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 99e46ae..a1e8c73 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -454,6 +454,95 @@ void zram_page_end_io(struct bio *bio)
bio_put(bio);
 }
 
+/*
+ * Returns 0 if the submission is successful.
+ */
+static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *parent)
+{
+   struct bio *bio;
+
+   bio = bio_alloc(GFP_ATOMIC, 1);
+   if (!bio)
+   return -ENOMEM;
+
+   bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
+   bio->bi_bdev = zram->bdev;
+   if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len, bvec->bv_offset)) {
+   bio_put(bio);
+   return -EIO;
+   }
+
+   if (!parent) {
+   bio->bi_opf = REQ_OP_READ;
+   bio->bi_end_io = zram_page_end_io;
+   } else {
+   bio->bi_opf = parent->bi_opf;
+   bio_chain(bio, parent);
+   }
+
+   submit_bio(bio);
+   return 1;
+}
+
+struct zram_work {
+   struct work_struct work;
+   struct zram *zram;
+   unsigned long entry;
+   struct bio *bio;
+};
+
+#if PAGE_SIZE != 4096
+static void zram_sync_read(struct work_struct *work)
+{
+   struct bio_vec bvec;
+   struct zram_work *zw = container_of(work, struct zram_work, work);
+   struct zram *zram = zw->zram;
+   unsigned long entry = zw->entry;
+   struct bio *bio = zw->bio;
+
+   read_from_bdev_async(zram, &bvec, entry, bio);
+}
+
+/*
+ * Block layer want one ->make_request_fn to be active at a time
+ * so if we use chained IO with parent IO in same context,
+ * it's a deadlock. To avoid, it, it uses worker thread context.
+ */
+static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *bio)
+{
+   struct zram_work work;
+
+   work.zram = zram;
+   work.entry = entry;
+   work.bio = bio;
+
+   INIT_WORK_ONSTACK(&work.work, zram_sync_read);
+   queue_work(system_unbound_wq, &work.work);
+   flush_work(&work.work);
+   destroy_work_on_stack(&work.work);
+
+   return 1;
+}
+#else
+static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *bio)
+{
+   WARN_ON(1);
+   return -EIO;
+}
+#endif
+
+static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *parent, bool sync)
+{
+   if (sync)
+   return read_from_bdev_sync(zram, bvec, entry, parent);
+   else
+   return read_from_bdev_async(zram, bvec, entry, parent);
+}
+
 static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
u32 index, struct bio *parent,
unsigned long *pentry)
@@ -514,6 +603,12 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
 {
return -EIO;
 }
+
+static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *parent, bool sync)
+{
+   return -EIO;
+}
 static void zram_wb_clear(struct zram *zram, u32 index) {}
 #endif
 
@@ -773,13 +868,31 @@ static void zram_free_page(struct zram *zram, size_t 
index)
zram_set_obj_size(zram, index, 0);
 }
 
-static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index)
+static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
+   struct bio *bio, bool partial_io)
 {
int ret;
unsigned long handle;
unsigned int size;
void *src, *dst;
 
+   if (zram_wb_enabled(zram)) {
+   zram_slot_lock(zram, index);
+   if (zram_test_flag(zram, index, ZRAM_WB)) {
+   struct bio_vec bvec;
+
+   zram_slot_unlock(zram, index);
+
+   bvec.bv_page = page;
+   bvec.bv_len = PAGE_SIZE;
+   bvec.bv_offset = 0;
+   return read_from_bde

Re: zram hot_add device busy

2017-06-25 Thread Minchan Kim

Hello,

On Sat, Jun 24, 2017 at 11:08:01AM +0100, Sami Kerola wrote:
> Hello,
> 
> While going through if there are new util-linux bugs reported I came a
> cross this https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1645846
> 
> Simple way to reproduce the issue is:
> d=$(cat /sys/class/zram-control/hot_add) && zramctl --size 256M /dev/zram$d

To know the problem comes from any side, could you test it without zramctl
command?

IOW,
d=$(cat /sys/class/zram-control/hot_add) && echo $((256<<20)) 
/dev/zram$d

If it still has a problem, please show your test code which helps
understanding of fundamental problem a lot. ;-)

> 
> I am not entirely sure, but drivers/block/zram/zram_drv.c function
> zram_add() should block until the device is usable. Looking the code
> that it might be the device_add_disk() from block/genhd.c that should
> do the blocking. But perhaps it's best if I leave such detail to
> people who know the code a bit better.

I might miss something but I believe device is usable state after zram_add done.
Just in case, please test return value after some operation.

if [ $? -ne 0  ];
then
echo "fail to some op"
blah blah
fi

Thanks.

> 
> One thing annoys me. I expected 'zramctl --find --size 256M' to suffer
> from same issue but it does not. I can only reproduce the issue when
> triggering hot_add separately, and as quick as possibly using the
> path. Notice that sometimes it takes second try before the hot_add and
> use triggers the issue. That is almost certainly down to speed the
> system in hand, e.g., quicker the computer less likely to trigger.
> 
> -- 
> Sami Kerola
> http://www.iki.fi/kerolasa/

Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

2017-06-20 Thread Minchan Kim

On Tue, Jun 20, 2017 at 11:52:08AM +0900, Minchan Kim wrote:
> Hello Kirill,
> 
> On Mon, Jun 19, 2017 at 05:03:23PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> > > Hi Andrea,
> > > 
> > > On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > > > Hello Minchan,
> > > > 
> > > > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > > > @@ -1995,8 +1984,6 @@ static void 
> > > > > > > > __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > > > > > > > if (soft_dirty)
> > > > > > > > entry = pte_mksoft_dirty(entry);
> > > > > > > > }
> > > > > > > > -   if (dirty)
> > > > > > > > -   SetPageDirty(page + i);
> > > > > > > > pte = pte_offset_map(&_pmd, addr);
> > > > [..]
> > > > > 
> > > > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > > > that piece of code. What I found one is just __split_huge_page_tail
> > > > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > > > page is not dirty, tail page will be clean, too.
> > > > > Could you point out what routine set PG_dirty to all subpages 
> > > > > unconditionally?
> > 
> > When I wrote this code, I considered that we may want to track dirty
> > status on per-4k basis for file-backed THPs.
> > 
> > > > On a side note the snippet deleted above was useless, as long as
> > > > there's one left hugepmd to split, the physical page has to be still
> > > > compound and huge and as long as that's the case the tail pages
> > > > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > > > during the physical split).
> > > 
> > > I got it during reviewing this patch. That's why I didn't argue
> > > this patch would break MADV_FREE by deleting routine which propagate
> > > dirty to pte of subpages. However, although it's useless, I prefer
> > > not removing the transfer of dirty bit. Because it would help MADV_FREE
> > > users who want to use smaps to know how many of pages are not freeable
> > > (i.e, dirtied) since MADV_FREE although it is not 100% correct.
> > > 
> > > > 
> > > > In short PG_dirty is only meaningful in the head as long as it's
> > > > compound. The physical split in __split_huge_page_tail transfer the
> > > > head value to the tails like you mentioned, that's all as far as I can
> > > > tell.
> > > 
> > > Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> > > which has lost dirty bit by transferring dirty bit too early.
> > 
> > Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
> > actual bug.
> > 
> > But I'm not sure it's subject for -stable. I haven't seen any bug reports
> > that can be attributed to the bug.
> 
> Okay, I'm not against but please rewrite changelog to indicate it fixes
> the problem. One more thing, as I mentioned, I don't want to remove
> pmd dirty bit -> PG_dirty propagate to subpage part because it would be
> helpful for MADV_FREE users.

Oops, I misread smap accouting code so no problem to remove useless
propagation part I added for MADV_FREE.

Thanks.

Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

2017-06-19 Thread Minchan Kim

Hello Kirill,

On Mon, Jun 19, 2017 at 05:03:23PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 11:53:33PM +0900, Minchan Kim wrote:
> > Hi Andrea,
> > 
> > On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> > > Hello Minchan,
> > > 
> > > On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct 
> > > > > > > vm_area_struct *vma, pmd_t *pmd,
> > > > > > >   if (soft_dirty)
> > > > > > >   entry = pte_mksoft_dirty(entry);
> > > > > > >   }
> > > > > > > - if (dirty)
> > > > > > > - SetPageDirty(page + i);
> > > > > > >   pte = pte_offset_map(&_pmd, addr);
> > > [..]
> > > > 
> > > > split_huge_page set PG_dirty to all subpages unconditionally?
> > > > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > > > that piece of code. What I found one is just __split_huge_page_tail
> > > > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > > > page is not dirty, tail page will be clean, too.
> > > > Could you point out what routine set PG_dirty to all subpages 
> > > > unconditionally?
> 
> When I wrote this code, I considered that we may want to track dirty
> status on per-4k basis for file-backed THPs.
> 
> > > On a side note the snippet deleted above was useless, as long as
> > > there's one left hugepmd to split, the physical page has to be still
> > > compound and huge and as long as that's the case the tail pages
> > > PG_dirty bit is meaningless (even if set, it's going to be clobbered
> > > during the physical split).
> > 
> > I got it during reviewing this patch. That's why I didn't argue
> > this patch would break MADV_FREE by deleting routine which propagate
> > dirty to pte of subpages. However, although it's useless, I prefer
> > not removing the transfer of dirty bit. Because it would help MADV_FREE
> > users who want to use smaps to know how many of pages are not freeable
> > (i.e, dirtied) since MADV_FREE although it is not 100% correct.
> > 
> > > 
> > > In short PG_dirty is only meaningful in the head as long as it's
> > > compound. The physical split in __split_huge_page_tail transfer the
> > > head value to the tails like you mentioned, that's all as far as I can
> > > tell.
> > 
> > Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
> > which has lost dirty bit by transferring dirty bit too early.
> 
> Erghh. I've misread splitting code. Yes, it's not unconditional. So we fix
> actual bug.
> 
> But I'm not sure it's subject for -stable. I haven't seen any bug reports
> that can be attributed to the bug.

Okay, I'm not against but please rewrite changelog to indicate it fixes
the problem. One more thing, as I mentioned, I don't want to remove
pmd dirty bit -> PG_dirty propagate to subpage part because it would be
helpful for MADV_FREE users.

Thanks.

> 
> -- 
>  Kirill A. Shutemov

Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

2017-06-16 Thread Minchan Kim

Hi Andrea,

On Fri, Jun 16, 2017 at 04:27:20PM +0200, Andrea Arcangeli wrote:
> Hello Minchan,
> 
> On Fri, Jun 16, 2017 at 10:52:09PM +0900, Minchan Kim wrote:
> > > > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct 
> > > > > vm_area_struct *vma, pmd_t *pmd,
> > > > >   if (soft_dirty)
> > > > >   entry = pte_mksoft_dirty(entry);
> > > > >   }
> > > > > - if (dirty)
> > > > > - SetPageDirty(page + i);
> > > > >   pte = pte_offset_map(&_pmd, addr);
> [..]
> > 
> > split_huge_page set PG_dirty to all subpages unconditionally?
> > If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
> > that piece of code. What I found one is just __split_huge_page_tail
> > which set PG_dirty to subpage if head page is dirty. IOW, if the head
> > page is not dirty, tail page will be clean, too.
> > Could you point out what routine set PG_dirty to all subpages 
> > unconditionally?
> 
> On a side note the snippet deleted above was useless, as long as
> there's one left hugepmd to split, the physical page has to be still
> compound and huge and as long as that's the case the tail pages
> PG_dirty bit is meaningless (even if set, it's going to be clobbered
> during the physical split).

I got it during reviewing this patch. That's why I didn't argue
this patch would break MADV_FREE by deleting routine which propagate
dirty to pte of subpages. However, although it's useless, I prefer
not removing the transfer of dirty bit. Because it would help MADV_FREE
users who want to use smaps to know how many of pages are not freeable
(i.e, dirtied) since MADV_FREE although it is not 100% correct.

> 
> In short PG_dirty is only meaningful in the head as long as it's
> compound. The physical split in __split_huge_page_tail transfer the
> head value to the tails like you mentioned, that's all as far as I can
> tell.

Thanks for the comment. Then, this patch is to fix MADV_FREE's bug
which has lost dirty bit by transferring dirty bit too early.

Thanks.

Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

2017-06-16 Thread Minchan Kim

On Fri, Jun 16, 2017 at 04:19:08PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 16, 2017 at 12:02:50PM +0900, Minchan Kim wrote:
> > Hello,
> > 
> > On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> > > This patch uses modifed pmdp_invalidate(), that return previous value of 
> > > pmd,
> > > to transfer dirty and accessed bits.
> > > 
> > > Signed-off-by: Kirill A. Shutemov 
> > > ---
> > >  fs/proc/task_mmu.c |  8 
> > >  mm/huge_memory.c   | 29 -
> > >  2 files changed, 16 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index f0c8b33d99b1..f2fc1ef5bba2 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct 
> > > vm_area_struct *vma,
> > >  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> > >   unsigned long addr, pmd_t *pmdp)
> > >  {
> > > - pmd_t pmd = *pmdp;
> > > + pmd_t old, pmd = *pmdp;
> > >  
> > >   /* See comment in change_huge_pmd() */
> > > - pmdp_invalidate(vma, addr, pmdp);
> > > - if (pmd_dirty(*pmdp))
> > > + old = pmdp_invalidate(vma, addr, pmdp);
> > > + if (pmd_dirty(old))
> > >   pmd = pmd_mkdirty(pmd);
> > > - if (pmd_young(*pmdp))
> > > + if (pmd_young(old))
> > >   pmd = pmd_mkyoung(pmd);
> > >  
> > >   pmd = pmd_wrprotect(pmd);
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index a84909cf20d3..0433e73531bf 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, 
> > > pmd_t *pmd,
> > >* pmdp_invalidate() is required to make sure we don't miss
> > >* dirty/young flags set by hardware.
> > >*/
> > > - entry = *pmd;
> > > - pmdp_invalidate(vma, addr, pmd);
> > > -
> > > - /*
> > > -  * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> > > -  * corrupt them.
> > > -  */
> > > - if (pmd_dirty(*pmd))
> > > - entry = pmd_mkdirty(entry);
> > > - if (pmd_young(*pmd))
> > > - entry = pmd_mkyoung(entry);
> > > + entry = pmdp_invalidate(vma, addr, pmd);
> > >  
> > >   entry = pmd_modify(entry, newprot);
> > >   if (preserve_write)
> > > @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct 
> > > vm_area_struct *vma, pmd_t *pmd,
> > >   struct mm_struct *mm = vma->vm_mm;
> > >   struct page *page;
> > >   pgtable_t pgtable;
> > > - pmd_t _pmd;
> > > - bool young, write, dirty, soft_dirty;
> > > + pmd_t old, _pmd;
> > > + bool young, write, soft_dirty;
> > >   unsigned long addr;
> > >   int i;
> > >  
> > > @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct 
> > > vm_area_struct *vma, pmd_t *pmd,
> > >   page_ref_add(page, HPAGE_PMD_NR - 1);
> > >   write = pmd_write(*pmd);
> > >   young = pmd_young(*pmd);
> > > - dirty = pmd_dirty(*pmd);
> > >   soft_dirty = pmd_soft_dirty(*pmd);
> > >  
> > >   pmdp_huge_split_prepare(vma, haddr, pmd);
> > > @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct 
> > > vm_area_struct *vma, pmd_t *pmd,
> > >   if (soft_dirty)
> > >   entry = pte_mksoft_dirty(entry);
> > >   }
> > > - if (dirty)
> > > - SetPageDirty(page + i);
> > >   pte = pte_offset_map(&_pmd, addr);
> > >   BUG_ON(!pte_none(*pte));
> > >   set_pte_at(mm, addr, pte, entry);
> > > @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct 
> > > vm_area_struct *vma, pmd_t *pmd,
> > >* and finally we write the non-huge version of the pmd entry with
> > >* pmd_populate.
> > >*/
> > > - pmdp_invalidate(vma, haddr, pmd);
> > > + old = pmdp_invalidate(vma, haddr, pmd);
> > > +
> > > + /*
> > > +  * Transfer dirty bit using value returned by pmd_invalidate() to be
> > > +  * sure we don't race with CPU that can set the bit under us.
> > > +  */
> > > + if (pmd_dirty(old))
> > > + SetPageDirty(page);
> > > +
> > 
> > When I see this, without this patch, MADV_FREE has been broken because
> > it can lose dirty bit by early checking. Right?
> > If so, isn't it a candidate for -stable?
> 
> Actually, I don't see how MADV_FREE supposed to work: vmscan splits THP on
> reclaim and split_huge_page() would set unconditionally, so MADV_FREE
> seems no effect on THP.

split_huge_page set PG_dirty to all subpages unconditionally?
If it's true, yes, it doesn't break MADV_FREE. However, I didn't spot
that piece of code. What I found one is just __split_huge_page_tail
which set PG_dirty to subpage if head page is dirty. IOW, if the head
page is not dirty, tail page will be clean, too.
Could you point out what routine set PG_dirty to all subpages unconditionally?

Thanks.

Re: [PATCHv2 3/3] mm: Use updated pmdp_invalidate() inteface to track dirty/accessed bits

2017-06-15 Thread Minchan Kim

Hello,

On Thu, Jun 15, 2017 at 05:52:24PM +0300, Kirill A. Shutemov wrote:
> This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
> to transfer dirty and accessed bits.
> 
> Signed-off-by: Kirill A. Shutemov 
> ---
>  fs/proc/task_mmu.c |  8 
>  mm/huge_memory.c   | 29 -
>  2 files changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index f0c8b33d99b1..f2fc1ef5bba2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -906,13 +906,13 @@ static inline void clear_soft_dirty(struct 
> vm_area_struct *vma,
>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>   unsigned long addr, pmd_t *pmdp)
>  {
> - pmd_t pmd = *pmdp;
> + pmd_t old, pmd = *pmdp;
>  
>   /* See comment in change_huge_pmd() */
> - pmdp_invalidate(vma, addr, pmdp);
> - if (pmd_dirty(*pmdp))
> + old = pmdp_invalidate(vma, addr, pmdp);
> + if (pmd_dirty(old))
>   pmd = pmd_mkdirty(pmd);
> - if (pmd_young(*pmdp))
> + if (pmd_young(old))
>   pmd = pmd_mkyoung(pmd);
>  
>   pmd = pmd_wrprotect(pmd);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a84909cf20d3..0433e73531bf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1777,17 +1777,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>* pmdp_invalidate() is required to make sure we don't miss
>* dirty/young flags set by hardware.
>*/
> - entry = *pmd;
> - pmdp_invalidate(vma, addr, pmd);
> -
> - /*
> -  * Recover dirty/young flags.  It relies on pmdp_invalidate to not
> -  * corrupt them.
> -  */
> - if (pmd_dirty(*pmd))
> - entry = pmd_mkdirty(entry);
> - if (pmd_young(*pmd))
> - entry = pmd_mkyoung(entry);
> + entry = pmdp_invalidate(vma, addr, pmd);
>  
>   entry = pmd_modify(entry, newprot);
>   if (preserve_write)
> @@ -1927,8 +1917,8 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   struct mm_struct *mm = vma->vm_mm;
>   struct page *page;
>   pgtable_t pgtable;
> - pmd_t _pmd;
> - bool young, write, dirty, soft_dirty;
> + pmd_t old, _pmd;
> + bool young, write, soft_dirty;
>   unsigned long addr;
>   int i;
>  
> @@ -1965,7 +1955,6 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   page_ref_add(page, HPAGE_PMD_NR - 1);
>   write = pmd_write(*pmd);
>   young = pmd_young(*pmd);
> - dirty = pmd_dirty(*pmd);
>   soft_dirty = pmd_soft_dirty(*pmd);
>  
>   pmdp_huge_split_prepare(vma, haddr, pmd);
> @@ -1995,8 +1984,6 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   if (soft_dirty)
>   entry = pte_mksoft_dirty(entry);
>   }
> - if (dirty)
> - SetPageDirty(page + i);
>   pte = pte_offset_map(&_pmd, addr);
>   BUG_ON(!pte_none(*pte));
>   set_pte_at(mm, addr, pte, entry);
> @@ -2045,7 +2032,15 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>* and finally we write the non-huge version of the pmd entry with
>* pmd_populate.
>*/
> - pmdp_invalidate(vma, haddr, pmd);
> + old = pmdp_invalidate(vma, haddr, pmd);
> +
> + /*
> +  * Transfer dirty bit using value returned by pmd_invalidate() to be
> +  * sure we don't race with CPU that can set the bit under us.
> +  */
> + if (pmd_dirty(old))
> + SetPageDirty(page);
> +

When I see this, without this patch, MADV_FREE has been broken because
it can lose dirty bit by early checking. Right?
If so, isn't it a candidate for -stable?

Re: [PATCH v1] zram: Use __sysfs_match_string() helper

2017-06-12 Thread Minchan Kim

On Mon, Jun 12, 2017 at 10:52:45AM +0900, Sergey Senozhatsky wrote:
> (Cc Andrew)
> 
> Link: 
> lkml.kernel.org/r/20170609120835.22156-1-andriy.shevche...@linux.intel.com
> 
> 
> On (06/09/17 15:08), Andy Shevchenko wrote:
> > Use __sysfs_match_string() helper instead of open coded variant.
> > 
> > Cc: Minchan Kim 
> > Cc: Nitin Gupta 
> > Cc: Sergey Senozhatsky 
> > Signed-off-by: Andy Shevchenko 
> 
> Reviewed-by: Sergey Senozhatsky 
Acked-by: Minchan Kim 

Thanks.

[RFC 5/7] zram: identify asynchronous IO's return value

2017-06-11 Thread Minchan Kim

For upcoming asynchronous IO like writeback, zram_rw_page should
be aware of that whether requested IO was completed or submitted
successfully, otherwise error.

For the goal, zram_bvec_rw has three return values.

-errno: returns error number
 0: IO request is done synchronously
 1: IO request is issued successfully.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 32 
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d82914e..f5924ef 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -897,7 +897,7 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec 
*bvec,
 
 static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
 {
-   int ret;
+   int ret = 0;
struct zram_entry *uninitialized_var(entry);
unsigned int uninitialized_var(comp_len);
void *src, *dst, *mem;
@@ -1014,7 +1014,7 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
zram_slot_unlock(zram, index);
atomic64_inc(&zram->stats.pages_stored);
 
-   return 0;
+   return ret;
 }
 
 static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
@@ -1096,6 +1096,11 @@ static void zram_bio_discard(struct zram *zram, u32 
index,
}
 }
 
+/*
+ * Returns errno if it has some problem. Otherwise return 0 or 1.
+ * Returns 0 if IO request was done synchronously
+ * Returns 1 if IO request was successfully submitted.
+ */
 static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
int offset, bool is_write)
 {
@@ -1117,7 +1122,7 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec 
*bvec, u32 index,
 
generic_end_io_acct(rw_acct, &zram->disk->part0, start_time);
 
-   if (unlikely(ret)) {
+   if (unlikely(ret < 0)) {
if (!is_write)
atomic64_inc(&zram->stats.failed_reads);
else
@@ -1210,7 +1215,7 @@ static void zram_slot_free_notify(struct block_device 
*bdev,
 static int zram_rw_page(struct block_device *bdev, sector_t sector,
   struct page *page, bool is_write)
 {
-   int offset, err = -EIO;
+   int offset, ret;
u32 index;
struct zram *zram;
struct bio_vec bv;
@@ -1219,7 +1224,7 @@ static int zram_rw_page(struct block_device *bdev, 
sector_t sector,
 
if (!valid_io_request(zram, sector, PAGE_SIZE)) {
atomic64_inc(&zram->stats.invalid_io);
-   err = -EINVAL;
+   ret = -EINVAL;
goto out;
}
 
@@ -1230,7 +1235,7 @@ static int zram_rw_page(struct block_device *bdev, 
sector_t sector,
bv.bv_len = PAGE_SIZE;
bv.bv_offset = 0;
 
-   err = zram_bvec_rw(zram, &bv, index, offset, is_write);
+   ret = zram_bvec_rw(zram, &bv, index, offset, is_write);
 out:
/*
 * If I/O fails, just return error(ie, non-zero) without
@@ -1240,9 +1245,20 @@ static int zram_rw_page(struct block_device *bdev, 
sector_t sector,
 * bio->bi_end_io does things to handle the error
 * (e.g., SetPageError, set_page_dirty and extra works).
 */
-   if (err == 0)
+   if (unlikely(ret < 0))
+   return ret;
+
+   switch (ret) {
+   case 0:
page_endio(page, is_write, 0);
-   return err;
+   break;
+   case 1:
+   ret = 0;
+   break;
+   default:
+   WARN_ON(1);
+   }
+   return ret;
 }
 
 static void zram_reset_device(struct zram *zram)
-- 
2.7.4

[RFC 4/7] zram: add free space management in backing device

2017-06-11 Thread Minchan Kim

With backing device, zram needs management of free space of
backing device.
This patch adds bitmap logic to manage free space which is
very naive. However, it would be simple enough as considering
uncompressible pages's frequenty in zram.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/Kconfig| 13 
 drivers/block/zram/zram_drv.c | 48 ++-
 drivers/block/zram/zram_drv.h |  3 +++
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 2f3dd1f..f2ca2b5 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -27,3 +27,16 @@ config ZRAM_DEDUP
  computation time trade-off. Please check the benefit before
  enabling this option. Experiment shows the positive effect when
  the zram is used as blockdev and is used to store build output.
+
+config ZRAM_WRITEBACK
+   bool "Write back incompressible page to backing device"
+   depends on ZRAM
+   default n
+   help
+ With incompressible page, there is no memory saving to keep it
+ in memory. Instead, write it out to backing device.
+ For this feature, admin should set up backing device via
+ /sys/block/zramX/backing_dev.
+
+ See zram.txt for more infomration.
+
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index dcb6f83..d82914e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -298,6 +298,9 @@ static void reset_bdev(struct zram *zram)
zram->backing_dev = NULL;
zram->old_block_size = 0;
zram->bdev = NULL;
+
+   kvfree(zram->bitmap);
+   zram->bitmap = NULL;
 }
 
 static ssize_t backing_dev_show(struct device *dev,
@@ -337,7 +340,8 @@ static ssize_t backing_dev_store(struct device *dev,
struct file *backing_dev = NULL;
struct inode *inode;
struct address_space *mapping;
-   unsigned int old_block_size = 0;
+   unsigned int bitmap_sz, old_block_size = 0;
+   unsigned long nr_pages, *bitmap = NULL;
struct block_device *bdev = NULL;
int err;
size_t sz;
@@ -388,16 +392,27 @@ static ssize_t backing_dev_store(struct device *dev,
if (err < 0)
goto out;
 
+   nr_pages = i_size_read(inode) >> PAGE_SHIFT;
+   bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
+   bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
+   if (!bitmap) {
+   err = -ENOMEM;
+   goto out;
+   }
+
old_block_size = block_size(bdev);
err = set_blocksize(bdev, PAGE_SIZE);
if (err)
goto out;
 
reset_bdev(zram);
+   spin_lock_init(&zram->bitmap_lock);
 
zram->old_block_size = old_block_size;
zram->bdev = bdev;
zram->backing_dev = backing_dev;
+   zram->bitmap = bitmap;
+   zram->nr_pages = nr_pages;
up_write(&zram->init_lock);
 
pr_info("setup backing device %s\n", file_name);
@@ -407,6 +422,9 @@ static ssize_t backing_dev_store(struct device *dev,
 
return len;
 out:
+   if (bitmap)
+   kvfree(bitmap);
+
if (bdev)
blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
 
@@ -422,6 +440,34 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
+static unsigned long get_entry_bdev(struct zram *zram)
+{
+   unsigned long entry;
+
+   spin_lock(&zram->bitmap_lock);
+   /* skip 0 bit to confuse zram.handle = 0 */
+   entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
+   if (entry == zram->nr_pages) {
+   spin_unlock(&zram->bitmap_lock);
+   return 0;
+   }
+
+   set_bit(entry, zram->bitmap);
+   spin_unlock(&zram->bitmap_lock);
+
+   return entry;
+}
+
+static void put_entry_bdev(struct zram *zram, unsigned long entry)
+{
+   int was_set;
+
+   spin_lock(&zram->bitmap_lock);
+   was_set = test_and_clear_bit(entry, zram->bitmap);
+   spin_unlock(&zram->bitmap_lock);
+   WARN_ON_ONCE(!was_set);
+}
+
 #else
 static bool zram_wb_enabled(struct zram *zram) { return false; }
 static void reset_bdev(struct zram *zram) {};
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 5193bcb..8ae3b3f 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -140,6 +140,9 @@ struct zram {
struct file *backing_dev;
struct block_device *bdev;
unsigned int old_block_size;
+   unsigned long *bitmap;
+   unsigned long nr_pages;
+   spinlock_t bitmap_lock;
 #endif
 };
 
-- 
2.7.4

[RFC 0/7] writeback incompressible pages to storage

2017-06-11 Thread Minchan Kim

zRam is useful for memory saving with compressible pages but sometime,
workload can be changed and system has lots of incompressible pages
which is very harmful for zram.

This patch supports writeback feature of zram so admin can set up
a block device and with it, zram can save the memory via writing
out the incompressile pages once it found it's incompressible pages
(1/4 comp ratio) instead of keeping the page in memory.

[1,2] is just clean up and [3-7] is step by step feature enablement.
[3-7] is logically not bisectable(ie, logical unit separation)
although I tried to compiled out without breaking but I think it would
be better to review.

Thanks.

Minchan Kim (7):
  [1] zram: inlining zram_compress
  [2] zram: rename zram_decompress_page with __zram_bvec_read
  [3] zram: add interface to specify backing device
  [4] zram: add free space management in backing device
  [5] zram: identify asynchronous IO's return value
  [6] zram: write incompressible pages to backing device
  [7] zram: read page from backing device

 drivers/block/zram/Kconfig|  13 +
 drivers/block/zram/zram_drv.c | 549 --
 drivers/block/zram/zram_drv.h |   9 +
 3 files changed, 500 insertions(+), 71 deletions(-)

-- 
2.7.4

[RFC 1/7] zram: inlining zram_compress

2017-06-11 Thread Minchan Kim

zram_compress does several things, compress, entry alloc and check
limitation. I did for just readbility but it hurts modulization.:(
So this patch removes zram_compress functions and inline it in
__zram_bvec_write for upcoming patches.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 81 +--
 1 file changed, 31 insertions(+), 50 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 5440d1a..bed534e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -692,22 +692,45 @@ static int zram_bvec_read(struct zram *zram, struct 
bio_vec *bvec,
return ret;
 }
 
-static int zram_compress(struct zram *zram, struct zcomp_strm **zstrm,
-   struct page *page, struct zram_entry **out_entry,
-   unsigned int *out_comp_len)
+static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
 {
int ret;
-   unsigned int comp_len;
-   void *src;
+   struct zram_entry *uninitialized_var(entry);
+   unsigned int uninitialized_var(comp_len);
+   void *src, *dst, *mem;
+   struct zcomp_strm *zstrm;
+   struct page *page = bvec->bv_page;
+   u32 checksum;
+   enum zram_pageflags flags = 0;
+   unsigned long uninitialized_var(element);
unsigned long alloced_pages;
-   struct zram_entry *entry = NULL;
+
+   mem = kmap_atomic(page);
+   if (page_same_filled(mem, &element)) {
+   kunmap_atomic(mem);
+   /* Free memory associated with this sector now. */
+   flags = ZRAM_SAME;
+   atomic64_inc(&zram->stats.same_pages);
+   goto out;
+   }
+   kunmap_atomic(mem);
+
+   entry = zram_dedup_find(zram, page, &checksum);
+   if (entry) {
+   comp_len = entry->len;
+   flags = ZRAM_DUP;
+   atomic64_add(comp_len, &zram->stats.dup_data_size);
+   goto out;
+   }
 
 compress_again:
+   zstrm = zcomp_stream_get(zram->comp);
src = kmap_atomic(page);
-   ret = zcomp_compress(*zstrm, src, &comp_len);
+   ret = zcomp_compress(zstrm, src, &comp_len);
kunmap_atomic(src);
 
if (unlikely(ret)) {
+   zcomp_stream_put(zram->comp);
pr_err("Compression failed! err=%d\n", ret);
if (entry)
zram_entry_free(zram, entry);
@@ -742,7 +765,6 @@ static int zram_compress(struct zram *zram, struct 
zcomp_strm **zstrm,
entry = zram_entry_alloc(zram, comp_len,
GFP_NOIO | __GFP_HIGHMEM |
__GFP_MOVABLE);
-   *zstrm = zcomp_stream_get(zram->comp);
if (entry)
goto compress_again;
return -ENOMEM;
@@ -752,52 +774,11 @@ static int zram_compress(struct zram *zram, struct 
zcomp_strm **zstrm,
update_used_max(zram, alloced_pages);
 
if (zram->limit_pages && alloced_pages > zram->limit_pages) {
+   zcomp_stream_put(zram->comp);
zram_entry_free(zram, entry);
return -ENOMEM;
}
 
-   *out_entry = entry;
-   *out_comp_len = comp_len;
-   return 0;
-}
-
-static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
-{
-   int ret;
-   struct zram_entry *uninitialized_var(entry);
-   unsigned int uninitialized_var(comp_len);
-   void *src, *dst, *mem;
-   struct zcomp_strm *zstrm;
-   struct page *page = bvec->bv_page;
-   u32 checksum;
-   enum zram_pageflags flags = 0;
-   unsigned long uninitialized_var(element);
-
-   mem = kmap_atomic(page);
-   if (page_same_filled(mem, &element)) {
-   kunmap_atomic(mem);
-   /* Free memory associated with this sector now. */
-   flags = ZRAM_SAME;
-   atomic64_inc(&zram->stats.same_pages);
-   goto out;
-   }
-   kunmap_atomic(mem);
-
-   entry = zram_dedup_find(zram, page, &checksum);
-   if (entry) {
-   comp_len = entry->len;
-   flags = ZRAM_DUP;
-   atomic64_add(comp_len, &zram->stats.dup_data_size);
-   goto out;
-   }
-
-   zstrm = zcomp_stream_get(zram->comp);
-   ret = zram_compress(zram, &zstrm, page, &entry, &comp_len);
-   if (ret) {
-   zcomp_stream_put(zram->comp);
-   return ret;
-   }
-
dst = zs_map_object(zram->mem_pool,
zram_entry_handle(zram, entry), ZS_MM_WO);
 
-- 
2.7.4

[RFC 6/7] zram: write incompressible pages to backing device

2017-06-11 Thread Minchan Kim

This patch enables write IO to transfer data to backing device.
For that, it implements write_to_bdev function which creates
new bio and chaining with parent bio to make the parent bio
asynchrnous.
For rw_page which don't have parent bio, it submit owned bio
and handle IO completion by zram_page_end_io.

Also, this patch defines new flag ZRAM_WB to mark written page
for later read IO.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 108 ++
 drivers/block/zram/zram_drv.h |   1 +
 2 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f5924ef..9b0db9b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -468,9 +468,75 @@ static void put_entry_bdev(struct zram *zram, unsigned 
long entry)
WARN_ON_ONCE(!was_set);
 }
 
+void zram_page_end_io(struct bio *bio)
+{
+   struct page *page = bio->bi_io_vec[0].bv_page;
+
+   page_endio(page, op_is_write(bio_op(bio)), bio->bi_error);
+   bio_put(bio);
+}
+
+static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
+   u32 index, struct bio *parent,
+   unsigned long *pentry)
+{
+   struct bio *bio;
+   unsigned long entry;
+
+   bio = bio_alloc(GFP_ATOMIC, 1);
+   if (!bio)
+   return -ENOMEM;
+
+   entry = get_entry_bdev(zram);
+   if (!entry) {
+   bio_put(bio);
+   return -ENOSPC;
+   }
+
+   bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
+   bio->bi_bdev = zram->bdev;
+   if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len,
+   bvec->bv_offset)) {
+   bio_put(bio);
+   put_entry_bdev(zram, entry);
+   return -EIO;
+   }
+
+   if (!parent) {
+   bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
+   bio->bi_end_io = zram_page_end_io;
+   } else {
+   bio->bi_opf = parent->bi_opf;
+   bio_chain(bio, parent);
+   }
+
+   submit_bio(bio);
+   *pentry = entry;
+
+   return 0;
+}
+
+static void zram_wb_clear(struct zram *zram, u32 index)
+{
+   unsigned long entry;
+
+   zram_clear_flag(zram, index, ZRAM_WB);
+   entry = zram_get_element(zram, index);
+   zram_set_element(zram, index, 0);
+   put_entry_bdev(zram, entry);
+}
+
 #else
 static bool zram_wb_enabled(struct zram *zram) { return false; }
 static void reset_bdev(struct zram *zram) {};
+static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
+   u32 index, struct bio *parent,
+   unsigned long *pentry)
+
+{
+   return -EIO;
+}
+static void zram_wb_clear(struct zram *zram, u32 index) {}
 #endif
 
 
@@ -789,7 +855,15 @@ static bool zram_meta_alloc(struct zram *zram, u64 
disksize)
  */
 static void zram_free_page(struct zram *zram, size_t index)
 {
-   struct zram_entry *entry = zram_get_entry(zram, index);
+   struct zram_entry *uninitialized_var(entry);
+
+   if (zram_wb_enabled(zram) && zram_test_flag(zram, index, ZRAM_WB)) {
+   zram_wb_clear(zram, index);
+   atomic64_dec(&zram->stats.pages_stored);
+   return;
+   }
+
+   entry = zram_get_entry(zram, index);
 
/*
 * No memory is allocated for same element filled pages.
@@ -895,7 +969,8 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec 
*bvec,
return ret;
 }
 
-static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
+static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
+   u32 index, struct bio *bio)
 {
int ret = 0;
struct zram_entry *uninitialized_var(entry);
@@ -907,6 +982,7 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
enum zram_pageflags flags = 0;
unsigned long uninitialized_var(element);
unsigned long alloced_pages;
+   bool allow_wb = true;
 
mem = kmap_atomic(page);
if (page_same_filled(mem, &element)) {
@@ -940,8 +1016,20 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
return ret;
}
 
-   if (unlikely(comp_len > max_zpage_size))
+   if (unlikely(comp_len > max_zpage_size)) {
+   if (zram_wb_enabled(zram) && allow_wb) {
+   zcomp_stream_put(zram->comp);
+   ret = write_to_bdev(zram, bvec, index, bio, &element);
+   if (!ret) {
+   flags = ZRAM_WB;
+   ret = 1;
+   goto out;
+

[RFC 3/7] zram: add interface to specify backing device

2017-06-11 Thread Minchan Kim

For writeback feature, user should set up backing device before
the zram working. This patch enables the interface via
/sys/block/zramX/backing_dev.

Currently, it supports block device only but it could be enhanced
for file as well.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 163 ++
 drivers/block/zram/zram_drv.h |   5 ++
 2 files changed, 168 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index a0c304b..dcb6f83 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -271,6 +271,163 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static bool zram_wb_enabled(struct zram *zram)
+{
+   return zram->backing_dev;
+}
+
+static void reset_bdev(struct zram *zram)
+{
+   struct inode *inode;
+   struct address_space *mapping;
+   struct block_device *bdev;
+
+   if (!zram_wb_enabled(zram))
+   return;
+
+   mapping = zram->backing_dev->f_mapping;
+   inode = mapping->host;
+   bdev = I_BDEV(inode);
+
+   if (zram->old_block_size)
+   set_blocksize(bdev, zram->old_block_size);
+   blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
+   /* hope filp_close flush all of IO */
+   filp_close(zram->backing_dev, NULL);
+   zram->backing_dev = NULL;
+   zram->old_block_size = 0;
+   zram->bdev = NULL;
+}
+
+static ssize_t backing_dev_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   struct file *file = zram->backing_dev;
+   char *p;
+   ssize_t ret;
+
+   down_read(&zram->init_lock);
+   if (!zram_wb_enabled(zram)) {
+   memcpy(buf, "none\n", 5);
+   up_read(&zram->init_lock);
+   return 5;
+   }
+
+   p = file_path(file, buf, PAGE_SIZE - 1);
+   if (IS_ERR(p)) {
+   ret = PTR_ERR(p);
+   goto out;
+   }
+
+   ret = strlen(p);
+   memmove(buf, p, ret);
+   buf[ret++] = '\n';
+out:
+   up_read(&zram->init_lock);
+   return ret;
+}
+
+static ssize_t backing_dev_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   char *file_name;
+   struct filename *name = NULL;
+   struct file *backing_dev = NULL;
+   struct inode *inode;
+   struct address_space *mapping;
+   unsigned int old_block_size = 0;
+   struct block_device *bdev = NULL;
+   int err;
+   size_t sz;
+   struct zram *zram = dev_to_zram(dev);
+
+   file_name = kmalloc(PATH_MAX, GFP_KERNEL);
+   if (!file_name)
+   return -ENOMEM;
+
+   down_write(&zram->init_lock);
+   if (init_done(zram)) {
+   pr_info("Can't setup backing device for initialized device\n");
+   err = -EBUSY;
+   goto out;
+   }
+
+   strlcpy(file_name, buf, len);
+   /* ignore trailing newline */
+   sz = strlen(file_name);
+   if (sz > 0 && file_name[sz - 1] == '\n')
+   file_name[sz - 1] = 0x00;
+
+   name = getname_kernel(file_name);
+   if (IS_ERR(name)) {
+   err = PTR_ERR(name);
+   name = NULL;
+   goto out;
+   }
+
+   backing_dev = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
+   if (IS_ERR(backing_dev)) {
+   err = PTR_ERR(backing_dev);
+   backing_dev = NULL;
+   goto out;
+   }
+
+   mapping = backing_dev->f_mapping;
+   inode = mapping->host;
+
+   /* Support only block device in this moment */
+   if (!S_ISBLK(inode->i_mode)) {
+   err = -ENOTBLK;
+   goto out;
+   }
+
+   bdev = bdgrab(I_BDEV(inode));
+   err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
+   if (err < 0)
+   goto out;
+
+   old_block_size = block_size(bdev);
+   err = set_blocksize(bdev, PAGE_SIZE);
+   if (err)
+   goto out;
+
+   reset_bdev(zram);
+
+   zram->old_block_size = old_block_size;
+   zram->bdev = bdev;
+   zram->backing_dev = backing_dev;
+   up_write(&zram->init_lock);
+
+   pr_info("setup backing device %s\n", file_name);
+
+   putname(name);
+   kfree(file_name);
+
+   return len;
+out:
+   if (bdev)
+   blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
+
+   if (backing_dev)
+   filp_close(backing_dev, NULL);
+
+   if (name)
+   putname(name);
+   up_write(&zram->init_lock);
+
+   kfree(file_name);
+
+   return err;
+}
+
+#else
+static bool zram_wb_enabled(struct z

[RFC 7/7] zram: read page from backing device

2017-06-11 Thread Minchan Kim

This patch enables read IO from backing device. For the feature,
it implements two IO read functions to transfer data from backing
storage.

One is asynchronous IO function and other is synchronous one.

A reason I need synchrnous IO is due to partial write which need to
complete read IO before the overwriting partial data.

We can make the partial IO's case asynchronous, too but at the moment,
I don't feel adding more complexity to support such rare use cases
so want to go with simple.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 123 --
 1 file changed, 118 insertions(+), 5 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 9b0db9b..d9eb6df 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -476,6 +476,95 @@ void zram_page_end_io(struct bio *bio)
bio_put(bio);
 }
 
+/*
+ * Returns 0 if the submission is successful.
+ */
+static int read_from_bdev_async(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *parent)
+{
+   struct bio *bio;
+
+   bio = bio_alloc(GFP_ATOMIC, 1);
+   if (!bio)
+   return -ENOMEM;
+
+   bio->bi_iter.bi_sector = entry * (PAGE_SIZE >> 9);
+   bio->bi_bdev = zram->bdev;
+   if (!bio_add_page(bio, bvec->bv_page, bvec->bv_len, bvec->bv_offset)) {
+   bio_put(bio);
+   return -EIO;
+   }
+
+   if (!parent) {
+   bio->bi_opf = REQ_OP_READ;
+   bio->bi_end_io = zram_page_end_io;
+   } else {
+   bio->bi_opf = parent->bi_opf;
+   bio_chain(bio, parent);
+   }
+
+   submit_bio(bio);
+   return 1;
+}
+
+struct zram_work {
+   struct work_struct work;
+   struct zram *zram;
+   unsigned long entry;
+   struct bio *bio;
+};
+
+#if PAGE_SIZE != 4096
+static void zram_sync_read(struct work_struct *work)
+{
+   struct bio_vec bvec;
+   struct zram_work *zw = container_of(work, struct zram_work, work);
+   struct zram *zram = zw->zram;
+   unsigned long entry = zw->entry;
+   struct bio *bio = zw->bio;
+
+   read_from_bdev_async(zram, &bvec, entry, bio);
+}
+
+/*
+ * Block layer want one ->make_request_fn to be active at a time
+ * so if we use chained IO with parent IO in same context,
+ * it's a deadlock. To avoid, it, it uses worker thread context.
+ */
+static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *bio)
+{
+   struct zram_work work;
+
+   work.zram = zram;
+   work.entry = entry;
+   work.bio = bio;
+
+   INIT_WORK_ONSTACK(&work.work, zram_sync_read);
+   queue_work(system_unbound_wq, &work.work);
+   flush_work(&work.work);
+   destroy_work_on_stack(&work.work);
+
+   return 1;
+}
+#else
+static int read_from_bdev_sync(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *bio)
+{
+   WARN_ON(1);
+   return -EIO;
+}
+#endif
+
+static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *parent, bool sync)
+{
+   if (sync)
+   return read_from_bdev_sync(zram, bvec, entry, parent);
+   else
+   return read_from_bdev_async(zram, bvec, entry, parent);
+}
+
 static int write_to_bdev(struct zram *zram, struct bio_vec *bvec,
u32 index, struct bio *parent,
unsigned long *pentry)
@@ -536,6 +625,12 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
 {
return -EIO;
 }
+
+static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
+   unsigned long entry, struct bio *parent, bool sync)
+{
+   return -EIO;
+}
 static void zram_wb_clear(struct zram *zram, u32 index) {}
 #endif
 
@@ -897,13 +992,31 @@ static void zram_free_page(struct zram *zram, size_t 
index)
zram_set_obj_size(zram, index, 0);
 }
 
-static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index)
+static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index,
+   struct bio *bio, bool partial_io)
 {
int ret;
struct zram_entry *entry;
unsigned int size;
void *src, *dst;
 
+   if (zram_wb_enabled(zram)) {
+   zram_slot_lock(zram, index);
+   if (zram_test_flag(zram, index, ZRAM_WB)) {
+   struct bio_vec bvec;
+
+   zram_slot_unlock(zram, index);
+
+   bvec.bv_page = page;
+   bvec.bv_len = PAGE_SIZE;
+   bvec.bv_offset = 0;
+   return read_from_bde

[RFC 2/7] zram: rename zram_decompress_page with __zram_bvec_read

2017-06-11 Thread Minchan Kim

zram_decompress_page naming is not proper because it doesn't
decompress if page was dedup hit or stored with compression.
Use more abstract term and consistent with write path function
 __zram_bvec_write.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bed534e..a0c304b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -620,7 +620,7 @@ static void zram_free_page(struct zram *zram, size_t index)
zram_set_obj_size(zram, index, 0);
 }
 
-static int zram_decompress_page(struct zram *zram, struct page *page, u32 
index)
+static int __zram_bvec_read(struct zram *zram, struct page *page, u32 index)
 {
int ret;
struct zram_entry *entry;
@@ -673,7 +673,7 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec 
*bvec,
return -ENOMEM;
}
 
-   ret = zram_decompress_page(zram, page, index);
+   ret = __zram_bvec_read(zram, page, index);
if (unlikely(ret))
goto out;
 
@@ -833,7 +833,7 @@ static int zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec,
if (!page)
return -ENOMEM;
 
-   ret = zram_decompress_page(zram, page, index);
+   ret = __zram_bvec_read(zram, page, index);
if (ret)
goto out;
 
-- 
2.7.4

Re: [PATCH] mm: correct the comment when reclaimed pages exceed the scanned pages

2017-06-07 Thread Minchan Kim

On Wed, Jun 07, 2017 at 04:31:06PM +0800, zhongjiang wrote:
> The commit e1587a494540 ("mm: vmpressure: fix sending wrong events on
> underflow") declare that reclaimed pages exceed the scanned pages due
> to the thp reclaim. it is incorrect because THP will be spilt to normal
> page and loop again. which will result in the scanned pages increment.
> 
> Signed-off-by: zhongjiang 
> ---
>  mm/vmpressure.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 6063581..0e91ba3 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -116,8 +116,9 @@ static enum vmpressure_levels 
> vmpressure_calc_level(unsigned long scanned,
>  
>   /*
>* reclaimed can be greater than scanned in cases
> -  * like THP, where the scanned is 1 and reclaimed
> -  * could be 512
> +  * like reclaimed slab pages, shrink_node just add
> +  * reclaimed page without a related increment to
> +  * scanned pages.
>*/
>   if (reclaimed >= scanned)
>   goto out;

Thanks for the fixing my fault!

Acked-by: Minchan Kim 

Frankly speaking, I'm not sure we need such comment in there at the cost
of maintainance because it would be fragile but easy to fix by above simple
condition so I think it would be better to remove that comment but others
might be different. So, don't have any objection.

Re: [PATCH] Revert "mm: vmpressure: fix sending wrong events on underflow"

2017-06-06 Thread Minchan Kim

On Wed, Jun 07, 2017 at 12:56:57PM +0800, zhong jiang wrote:
> On 2017/6/7 11:55, Minchan Kim wrote:
> > On Wed, Jun 07, 2017 at 11:08:37AM +0800, zhongjiang wrote:
> >> This reverts commit e1587a4945408faa58d0485002c110eb2454740c.
> >>
> >> THP lru page is reclaimed , THP is split to normal page and loop again.
> >> reclaimed pages should not be bigger than nr_scan.  because of each
> >> loop will increase nr_scan counter.
> > Unfortunately, there is still underflow issue caused by slab pages as
> > Vinayak reported in description of e1587a4945408 so we cannot revert.
> > Please correct comment instead of removing the logic.
> >
> > Thanks.
>   we calculate the vmpressue based on the Lru page, exclude the slab pages by 
> previous
>   discussion.is it not this?
> 

IIRC, It is not merged into mainline although mmotm has it.

Re: [PATCH] Revert "mm: vmpressure: fix sending wrong events on underflow"

2017-06-06 Thread Minchan Kim

On Wed, Jun 07, 2017 at 11:08:37AM +0800, zhongjiang wrote:
> This reverts commit e1587a4945408faa58d0485002c110eb2454740c.
> 
> THP lru page is reclaimed , THP is split to normal page and loop again.
> reclaimed pages should not be bigger than nr_scan.  because of each
> loop will increase nr_scan counter.

Unfortunately, there is still underflow issue caused by slab pages as
Vinayak reported in description of e1587a4945408 so we cannot revert.
Please correct comment instead of removing the logic.

Thanks.

> 
> Signed-off-by: zhongjiang 
> ---
>  mm/vmpressure.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 6063581..149fdf6 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -112,16 +112,9 @@ static enum vmpressure_levels 
> vmpressure_calc_level(unsigned long scanned,
>   unsigned long reclaimed)
>  {
>   unsigned long scale = scanned + reclaimed;
> - unsigned long pressure = 0;
> + unsigned long pressure;
>  
>   /*
> -  * reclaimed can be greater than scanned in cases
> -  * like THP, where the scanned is 1 and reclaimed
> -  * could be 512
> -  */
> - if (reclaimed >= scanned)
> - goto out;
> - /*
>* We calculate the ratio (in percents) of how many pages were
>* scanned vs. reclaimed in a given time frame (window). Note that
>* time is in VM reclaimer's "ticks", i.e. number of pages
> @@ -131,7 +124,6 @@ static enum vmpressure_levels 
> vmpressure_calc_level(unsigned long scanned,
>   pressure = scale - (reclaimed * scale / scanned);
>   pressure = pressure * 100 / scale;
>  
> -out:
>   pr_debug("%s: %3lu  (s: %lu  r: %lu)\n", __func__, pressure,
>scanned, reclaimed);
>  
> -- 
> 1.7.12.4
>

Re: [PATCH] mm: vmscan: do not pass reclaimed slab to vmpressure

2017-06-06 Thread Minchan Kim

Hi,

On Tue, Jun 06, 2017 at 09:00:55PM +0800, zhong jiang wrote:
> On 2017/1/31 7:40, Minchan Kim wrote:
> > Hi Vinayak,
> > Sorry for late response. It was Lunar New Year holidays.
> >
> > On Fri, Jan 27, 2017 at 01:43:23PM +0530, vinayak menon wrote:
> >>> Thanks for the explain. However, such case can happen with THP page
> >>> as well as slab. In case of THP page, nr_scanned is 1 but nr_reclaimed
> >>> could be 512 so I think vmpressure should have a logic to prevent undeflow
> >>> regardless of slab shrinking.
> >>>
> >> I see. Going to send a vmpressure fix. But, wouldn't the THP case
> >> result in incorrect
> >> vmpressure reporting even if we fix the vmpressure underflow problem ?
> > If a THP page is reclaimed, it reports lower pressure due to bigger
> > reclaim ratio(ie, reclaimed/scanned) compared to normal pages but
> > it's not a problem, is it? Because VM reclaimed more memory than
> > expected so memory pressure isn't severe now.
>   Hi, Minchan
> 
>   THP lru page is reclaimed, reclaim ratio bigger make sense. but I read the 
> code, I found
>   THP is split to normal pages and loop again.  reclaimed pages should not be 
> bigger
>than nr_scan.  because of each loop will increase nr_scan counter.
>  
>It is likely  I miss something.  you can point out the point please.

You are absolutely right.

I got confused by nr_scanned from isolate_lru_pages and sc->nr_scanned
from shrink_page_list.

Thanks.

[PATCH] zram: clean up duplicated codes in __zram_bvec_write

2017-05-28 Thread Minchan Kim

__zram_bvec_write has some of duplicated logic for zram meta
data handling of same_page|dedup_page|compressed_page.
This patch aims to clean it up without behavior change.

Cc: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 70 +--
 1 file changed, 27 insertions(+), 43 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 5f2a862..0557c15 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -500,30 +500,6 @@ static bool zram_same_page_read(struct zram *zram, u32 
index,
return false;
 }
 
-static bool zram_same_page_write(struct zram *zram, u32 index,
-   struct page *page)
-{
-   unsigned long element;
-   void *mem = kmap_atomic(page);
-
-   if (page_same_filled(mem, &element)) {
-   kunmap_atomic(mem);
-   /* Free memory associated with this sector now. */
-   zram_slot_lock(zram, index);
-   zram_free_page(zram, index);
-   zram_set_flag(zram, index, ZRAM_SAME);
-   zram_set_element(zram, index, element);
-   zram_slot_unlock(zram, index);
-
-   atomic64_inc(&zram->stats.same_pages);
-   atomic64_inc(&zram->stats.pages_stored);
-   return true;
-   }
-   kunmap_atomic(mem);
-
-   return false;
-}
-
 static struct zram_entry *zram_entry_alloc(struct zram *zram,
unsigned int len, gfp_t flags)
 {
@@ -790,28 +766,31 @@ static int zram_compress(struct zram *zram, struct 
zcomp_strm **zstrm,
 static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
index)
 {
int ret;
-   struct zram_entry *entry;
-   unsigned int comp_len;
-   void *src, *dst;
+   struct zram_entry *uninitialized_var(entry);
+   unsigned int uninitialized_var(comp_len);
+   void *src, *dst, *mem;
struct zcomp_strm *zstrm;
struct page *page = bvec->bv_page;
u32 checksum;
+   enum zram_pageflags flags = 0;
+   unsigned long uninitialized_var(element);
 
-   if (zram_same_page_write(zram, index, page))
-   return 0;
+   mem = kmap_atomic(page);
+   if (page_same_filled(mem, &element)) {
+   kunmap_atomic(mem);
+   /* Free memory associated with this sector now. */
+   flags = ZRAM_SAME;
+   atomic64_inc(&zram->stats.same_pages);
+   goto out;
+   }
+   kunmap_atomic(mem);
 
entry = zram_dedup_find(zram, page, &checksum);
if (entry) {
comp_len = entry->len;
-   zram_slot_lock(zram, index);
-   zram_free_page(zram, index);
-   zram_set_flag(zram, index, ZRAM_DUP);
-   zram_set_entry(zram, index, entry);
-   zram_set_obj_size(zram, index, comp_len);
-   zram_slot_unlock(zram, index);
+   flags = ZRAM_DUP;
atomic64_add(comp_len, &zram->stats.dup_data_size);
-   atomic64_inc(&zram->stats.pages_stored);
-   return 0;
+   goto out;
}
 
zstrm = zcomp_stream_get(zram->comp);
@@ -835,19 +814,24 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
zs_unmap_object(zram->mem_pool, zram_entry_handle(zram, entry));
zram_dedup_insert(zram, entry, checksum);
 
+out:
+   zram_slot_lock(zram, index);
/*
 * Free memory associated with this sector
 * before overwriting unused sectors.
 */
-   zram_slot_lock(zram, index);
zram_free_page(zram, index);
-   zram_set_entry(zram, index, entry);
-   zram_set_obj_size(zram, index, comp_len);
+   if (flags)
+   zram_set_flag(zram, index, flags);
+   if (flags != ZRAM_SAME) {
+   zram_set_obj_size(zram, index, comp_len);
+   zram_set_entry(zram, index, entry);
+   } else {
+   zram_set_element(zram, index, element);
+   }
zram_slot_unlock(zram, index);
-
-   /* Update stats */
-   atomic64_add(comp_len, &zram->stats.compr_data_size);
atomic64_inc(&zram->stats.pages_stored);
+
return 0;
 }
 
-- 
2.7.4

Re: [PATCH] mm/zsmalloc: fix -Wunneeded-internal-declaration warning

2017-05-24 Thread Minchan Kim

On Tue, May 23, 2017 at 10:38:57PM -0700, Nick Desaulniers wrote:
> is_first_page() is only called from the macro VM_BUG_ON_PAGE() which is
> only compiled in as a runtime check when CONFIG_DEBUG_VM is set,
> otherwise is checked at compile time and not actually compiled in.
> 
> Fixes the following warning, found with Clang:
> 
> mm/zsmalloc.c:472:12: warning: function 'is_first_page' is not needed and
> will not be emitted [-Wunneeded-internal-declaration]
> static int is_first_page(struct page *page)
>^
> 
> Signed-off-by: Nick Desaulniers 
Acked-by: Minchan Kim 

Thanks.

Re: [PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-21 Thread Minchan Kim

On Sun, May 21, 2017 at 12:04:27AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2017 at 05:32:12PM +0900, Minchan Kim wrote:
> > Is block device(esp, zram which is compressed ram block device) okay to
> > return garbage when ongoing overwrite IO fails?
> > 
> > O_DIRECT write 4 block "aaa.." -> success
> > read  4 block "aaa.." -> success
> > O_DIRECT write 4 block "bbb.." -> fail
> > read  4 block "000..' -> it is okay?
> > 
> > Hope to get an answer form experts. :)
> 
> It's "okay" as it's what existing real block devices do (at least on a
> sector boundary).  It's not "nice" though, so if you can avoid it,
> please do.

That was my understanding so I wanted to avoid it for just simple
code refactoring. Your comment helps to confirm the thhought.

Thanks, Christoph!

Re: [PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-17 Thread Minchan Kim

Hi Sergey,

On Wed, May 17, 2017 at 06:14:23PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (05/17/17 17:32), Minchan Kim wrote:
> [..]
> > > what we can return now is a `partially updated' data, with some new
> > > and some stale pages. this is quite unlikely to end up anywhere good.
> > > am I wrong?
> > > 
> > > why does `rd block 4' in your case causes Oops? as a worst case scenario?
> > > application does not expect page to be 'all A' at this point. pages are
> > > likely to belong to some mappings/files/etc., and there is likely a data
> > > dependency between them, dunno C++ objects that span across pages or
> > > JPEG images, etc. so returning "new data   new data   stale data" is a bit
> > > fishy.
> > 
> > I thought more about it and start to confuse. :/
> 
> sorry, I'm not sure I see what's the source of your confusion :)
> 
> my point is - we should not let READ succeed if we know that WRITE
> failed. assume JPEG image example,

I don't think we shoul do it. I will write down my thought below. :)

> 
> 
> over-write block 1 aaa->xxx OK
> over-write block 2 bbb->yyy OK
> over-write block 3 ccc->zzz error
> 
> reading that JPEG file
> 
> read block 1 xxx OK
> read block 2 yyy OK
> read block 3 ccc OK   << we should not return OK here. because
>  "xxxyyyccc" is not the correct JPEG file
>  anyway.
> 
> do you agree that telling application that read() succeeded and at
> the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is
> not correct?

I don't agree. I *think* block device is a just dumb device so
zram doesn't need to know about any objects from the upper layer.
What zram should consider is basically read/write success or fail of
IO unit(maybe, BIO).

So if we assume each step from above example is bio unit, I think
it's no problem returns "xxxyyyccc".

What I meant "started confused" was about atomicity, not above
thing.

I think it's okay to return ccc instead of zzz but is it okay
zram to return "000", not "ccc" and "zzz"?
My conclusion is that it's okay now after discussion from one
of my FS friends.

Let's think about it.

FS requests write "aaa" to block 4 and fails by somethings
(H/W failure, S/W failure like ENOMEM). The interface to catch
the failure is the function registered by bio_endio which is
normally handles by AS_EIO by mappint_set_error as well as
PG_error flags of the page. In this case, FS assumes the block
4 can have stale data, not 'zzz' and 'ccc' because the device
was broken in the middle of write some data to a block if
the block device doesn't support atomic write(I guess it's
more popular) so it would be safe to consider the block
has garbage now rather than old value, new value.
(I hope I explain my thought well :/)

Having said that, I think everyone likes block device supports
atomicity(ie, old or new). so I am reluctant to change the
behavior for simple refactoring.

Thanks.

Re: [PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-17 Thread Minchan Kim

Hi Sergey,

On Tue, May 16, 2017 at 04:36:17PM +0900, Sergey Senozhatsky wrote:
> On (05/16/17 16:16), Minchan Kim wrote:
> > > but would this be correct? the data is not valid - we failed to store
> > > the valid one. but instead we assure application that read()/swapin/etc.,
> > > depending on the usage scenario, is successful (even though the data is
> > > not what application really expects to see), application tries to use the
> > > data from that page and probably crashes (dunno, for example page 
> > > contained
> > > hash tables with pointers that are not valid anymore, etc. etc.).
> > > 
> > > I'm not optimistic about stale data reads; it basically will look like
> > > data corruption to the application.
> > 
> > Hmm, I don't understand what you say.
> > My point is zram_free_page should be done only if whoe write operation
> > is successful.
> > With you change, following situation can happens.
> > 
> > write block 4, 'all A' -> success
> > read  block 4, 'all A' verified -> Good
> > write block 4, 'all B' -> but failed with ENOMEM
> > read  block 4  expected 'all A' but 'all 0' -> Oops
> 
> yes. 'all A' in #4 can be incorrect. zram can be used as a block device
> with a file system, and pid that does write op not necessarily does read
> op later. it can be a completely different application. e.g. compilation,
> or anything else.
> 
> suppose PID A does
> 
> wr block 1   all a
> wr block 2   all a + 1
> wr block 3   all a + 2
> wr block 4   all a + 3
> 
> now PID A does
> 
> wr block 1   all m
> wr block 2   all m + 1
> wr block 3   all m + 2
> wr block 4   failed. block still has 'all a + 3'.
> exit
> 
> another application, PID C, reads in the file and tries to do
> something sane with it
> 
> rd block 1   all m
> rd block 2   all m + 1
> rd block 3   all m + 3
> rd block 4   all a + 3  << this is dangerous. we should return
>error from read() here; not stale data.
> 
> 
> what we can return now is a `partially updated' data, with some new
> and some stale pages. this is quite unlikely to end up anywhere good.
> am I wrong?
> 
> why does `rd block 4' in your case causes Oops? as a worst case scenario?
> application does not expect page to be 'all A' at this point. pages are
> likely to belong to some mappings/files/etc., and there is likely a data
> dependency between them, dunno C++ objects that span across pages or
> JPEG images, etc. so returning "new data   new data   stale data" is a bit
> fishy.

I thought more about it and start to confuse. :/
So, let's Cc linux-block, fs peoples.

The question is that 

Is block device(esp, zram which is compressed ram block device) okay to
return garbage when ongoing overwrite IO fails?

O_DIRECT write 4 block "aaa.." -> success
read  4 block "aaa.." -> success
O_DIRECT write 4 block "bbb.." -> fail
read  4 block "000..' -> it is okay?

Hope to get an answer form experts. :)

Re: [PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-16 Thread Minchan Kim

On Tue, May 16, 2017 at 02:45:33PM +0900, Sergey Senozhatsky wrote:
> On (05/16/17 14:26), Minchan Kim wrote:
> [..]
> > > +   /*
> > > +* Free memory associated with this sector
> > > +* before overwriting unused sectors.
> > > +*/
> > > +   zram_slot_lock(zram, index);
> > > +   zram_free_page(zram, index);
> > 
> > Hmm, zram_free should happen only if the write is done successfully.
> > Otherwise, we lose the valid data although write IO was fail.
> 
> but would this be correct? the data is not valid - we failed to store
> the valid one. but instead we assure application that read()/swapin/etc.,
> depending on the usage scenario, is successful (even though the data is
> not what application really expects to see), application tries to use the
> data from that page and probably crashes (dunno, for example page contained
> hash tables with pointers that are not valid anymore, etc. etc.).
> 
> I'm not optimistic about stale data reads; it basically will look like
> data corruption to the application.

Hmm, I don't understand what you say.
My point is zram_free_page should be done only if whoe write operation
is successful.
With you change, following situation can happens.

write block 4, 'all A' -> success
read  block 4, 'all A' verified -> Good
write block 4, 'all B' -> but failed with ENOMEM
read  block 4  expected 'all A' but 'all 0' -> Oops

It is the problem what I pointed out.
If I miss something, could you elaborate it a bit?

Thanks!

> 
> > > +
> > > if (zram_same_page_write(zram, index, page))
> > > -   return 0;
> > > +   goto out_unlock;
> > >  
> > > entry = zram_dedup_find(zram, page, &checksum);
> > > if (entry) {
> > > comp_len = entry->len;
> > > +   zram_set_flag(zram, index, ZRAM_DUP);
> >
> > In case of hitting dedup, we shouldn't increase compr_data_size.
> 
> no, we should not. you are right. my "... patch" is incomplete and
> wrong. please don't pay too much attention to it.
> 
> 
> > If we fix above two problems, do you think it's still cleaner?
> > (I don't mean to be reluctant with your suggestion. Just a
> >  real question to know your thought.:)
> 
> do you mean code duplication and stale data read?
> 
> I'd probably prefer to address stale data reads separately.
> but it seems that stale reads fix will re-do parts of your
> 0002 patch and, at least potentially, reduce code duplication.
> 
> so we can go with your 0002 and then stale reads fix will try
> to reduce code duplication (unless we want to have 4 places doing
> the same thing :) )
> 
>   -ss

Re: [PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-15 Thread Minchan Kim

On Tue, May 16, 2017 at 11:36:15AM +0900, Sergey Senozhatsky wrote:
> > > > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, 
> > > > struct bio_vec *bvec, u32 index)
> > > > entry = zram_dedup_find(zram, page, &checksum);
> > > > if (entry) {
> > > > comp_len = entry->len;
> > > > -   goto found_dup;
> > > > +   zram_slot_lock(zram, index);
> > > > +   zram_free_page(zram, index);
> > > > +   zram_set_flag(zram, index, ZRAM_DUP);
> > > > +   zram_set_entry(zram, index, entry);
> > > > +   zram_set_obj_size(zram, index, comp_len);
> > > > +   zram_slot_unlock(zram, index);
> > > > +   atomic64_add(comp_len, &zram->stats.dup_data_size);
> > > > +   atomic64_inc(&zram->stats.pages_stored);
> > > > +   return 0;
> > > 
> > > hm. that's a somewhat big code duplication. isn't it?
> > 
> > Yub. 3 parts. above part,  zram_same_page_write and tail of 
> > __zram_bvec_write.
> 
> hmm... good question... hardly can think of anything significantly
> better, zram object handling is now a mix of flags, entries,
> ref_counters, etc. etc. may be we can merge some of those ops, if we
> would keep slot locked through the entire __zram_bvec_write(), but
> that does not look attractive.
> 
> something ABSOLUTELY untested and incomplete. not even compile tested (!).
> 99% broken and stupid (!). but there is one thing that it has revealed, so
> thus incomplete. see below:
> 
> ---
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 372602c7da49..b31543c40d54 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -509,11 +509,8 @@ static bool zram_same_page_write(struct zram *zram, u32 
> index,
> if (page_same_filled(mem, &element)) {
> kunmap_atomic(mem);
> /* Free memory associated with this sector now. */
> -   zram_slot_lock(zram, index);
> -   zram_free_page(zram, index);
> zram_set_flag(zram, index, ZRAM_SAME);
> zram_set_element(zram, index, element);
> -   zram_slot_unlock(zram, index);
>  
> atomic64_inc(&zram->stats.same_pages);
> return true;
> @@ -778,7 +775,7 @@ static int zram_compress(struct zram *zram, struct 
> zcomp_strm **zstrm,
>  
>  static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 
> index)
>  {
> -   int ret;
> +   int ret = 0;
> struct zram_entry *entry;
> unsigned int comp_len;
> void *src, *dst;
> @@ -786,12 +783,20 @@ static int __zram_bvec_write(struct zram *zram, struct 
> bio_vec *bvec, u32 index)
> struct page *page = bvec->bv_page;
> u32 checksum;
>  
> +   /*
> +* Free memory associated with this sector
> +* before overwriting unused sectors.
> +*/
> +   zram_slot_lock(zram, index);
> +   zram_free_page(zram, index);

Hmm, zram_free should happen only if the write is done successfully.
Otherwise, we lose the valid data although write IO was fail.

> +
> if (zram_same_page_write(zram, index, page))
> -   return 0;
> +   goto out_unlock;
>  
> entry = zram_dedup_find(zram, page, &checksum);
> if (entry) {
> comp_len = entry->len;
> +   zram_set_flag(zram, index, ZRAM_DUP);

In case of hitting dedup, we shouldn't increase compr_data_size.
If we fix above two problems, do you think it's still cleaner?
(I don't mean to be reluctant with your suggestion. Just a
 real question to know your thought.:)


> goto found_dup;
> }
>  
> @@ -799,7 +804,7 @@ static int __zram_bvec_write(struct zram *zram, struct 
> bio_vec *bvec, u32 index)
> ret = zram_compress(zram, &zstrm, page, &entry, &comp_len);
> if (ret) {
> zcomp_stream_put(zram->comp);
> -   return ret;
> +   goto out_unlock;
> }
>  
> dst = zs_map_object(zram->mem_pool,
> @@ -817,20 +822,16 @@ static int __zram_bvec_write(struct zram *zram, struct 
> bio_vec *bvec, u32 index)
> zram_dedup_insert(zram, entry, checksum);
>  
>  found_dup:
> -   /*
> -* Free memory associated with this sector
> -* before overwriting unused sectors.
> -*/
> -   zram_slot_lock(zram, index);
> -   zram_free_page(zram, index);
> zram_set_entry(zram, index, entry);
> zram_set_obj_size(zram, index, comp_len);
> -   zram_slot_unlock(zram, index);
>  
> /* Update stats */
> atomic64_add(comp_len, &zram->stats.compr_data_size);
> atomic64_inc(&zram->stats.pages_stored);
> -   return 0;
> +
> +out_unlock:
> +   zram_slot_unlock(zram, index);
> +   return ret;
>  }
> 
> ---
> 
> 
> namely,
> that zram_compress() error return path fr

Re: [PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-15 Thread Minchan Kim

Hi Sergey,

On Tue, May 16, 2017 at 10:30:22AM +0900, Sergey Senozhatsky wrote:
> On (05/15/17 16:41), Minchan Kim wrote:
> [..]
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index b885356551e9..8152e405117b 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t 
> > index)
> > return;
> > }
> >  
> > +   if (zram_dedup_enabled(zram) &&
> > +   zram_test_flag(zram, index, ZRAM_DUP)) {
> > +   zram_clear_flag(zram, index, ZRAM_DUP);
> > +   atomic64_sub(entry->len, &zram->stats.dup_data_size);
> > +   goto out;
> > +   }
> 
> so that `goto' there is to just jump over ->stats.compr_data_size?

Yub.

> can you sub ->stats.compr_data_size before the `if' and avoid labels?


> 
> > if (!entry)
> > return;
> 
> shouldn't this `if' be moved before `if (zram_dedup_enabled(zram)`?

You mean this?

static void zram_free_page(..) {
if (zram_test_flag(zram, index, ZRAM_SAME))
...

if (!entry)
return;

if (zram_dedup_enabled(zram) && )) {
zram_clear_flag(ZRAM_DUP);
atomic64_sub(entry->len, &zram->stats.dup_data_size);
} else {
atomic64_sub(zram_get_obj_size(zram, index),
&zram->stats.compr_dat_size);
}

zram_entry_free
zram_set_entry
zram_set_obj_size
}

> 
> 
> [..]
> > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct 
> > bio_vec *bvec, u32 index)
> > entry = zram_dedup_find(zram, page, &checksum);
> > if (entry) {
> > comp_len = entry->len;
> > -   goto found_dup;
> > +   zram_slot_lock(zram, index);
> > +   zram_free_page(zram, index);
> > +   zram_set_flag(zram, index, ZRAM_DUP);
> > +   zram_set_entry(zram, index, entry);
> > +   zram_set_obj_size(zram, index, comp_len);
> > +   zram_slot_unlock(zram, index);
> > +   atomic64_add(comp_len, &zram->stats.dup_data_size);
> > +   atomic64_inc(&zram->stats.pages_stored);
> > +   return 0;
> 
> hm. that's a somewhat big code duplication. isn't it?

Yub. 3 parts. above part,  zram_same_page_write and tail of __zram_bvec_write.

Do you have any idea? Feel free to suggest. :)
Thanks.

[PATCH 1/2] zram: count same page write as page_stored

2017-05-15 Thread Minchan Kim

Regardless of whether it is same page or not, it's surely write
and stored to zram so we should increase pages_stored stat.
Otherwise, user can see zero value via mm_stats although he
writes a lot of pages to zram.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 372602c7da49..b885356551e9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -516,6 +516,7 @@ static bool zram_same_page_write(struct zram *zram, u32 
index,
zram_slot_unlock(zram, index);
 
atomic64_inc(&zram->stats.same_pages);
+   atomic64_inc(&zram->stats.pages_stored);
return true;
}
kunmap_atomic(mem);
@@ -619,6 +620,7 @@ static void zram_free_page(struct zram *zram, size_t index)
zram_clear_flag(zram, index, ZRAM_SAME);
zram_set_element(zram, index, 0);
atomic64_dec(&zram->stats.same_pages);
+   atomic64_dec(&zram->stats.pages_stored);
return;
}
 
-- 
2.7.4

[PATCH 2/2] zram: do not count duplicated pages as compressed

2017-05-15 Thread Minchan Kim

it's not same compressed pages and deduplicated pages
so we shouldn't count duplicated pages as compressed pages.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_dedup.c |  4 
 drivers/block/zram/zram_drv.c   | 24 +++-
 drivers/block/zram/zram_drv.h   |  1 +
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
index 14c4988f8ff7..c15848cc1b31 100644
--- a/drivers/block/zram/zram_dedup.c
+++ b/drivers/block/zram/zram_dedup.c
@@ -101,9 +101,6 @@ static unsigned long zram_dedup_put(struct zram *zram,
entry->refcount--;
if (!entry->refcount)
rb_erase(&entry->rb_node, &hash->rb_root);
-   else
-   atomic64_sub(entry->len, &zram->stats.dup_data_size);
-
spin_unlock(&hash->lock);
 
return entry->refcount;
@@ -127,7 +124,6 @@ static struct zram_entry *__zram_dedup_get(struct zram 
*zram,
 
 again:
entry->refcount++;
-   atomic64_add(entry->len, &zram->stats.dup_data_size);
spin_unlock(&hash->lock);
 
if (prev)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b885356551e9..8152e405117b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t 
index)
return;
}
 
+   if (zram_dedup_enabled(zram) &&
+   zram_test_flag(zram, index, ZRAM_DUP)) {
+   zram_clear_flag(zram, index, ZRAM_DUP);
+   atomic64_sub(entry->len, &zram->stats.dup_data_size);
+   goto out;
+   }
+
if (!entry)
return;
 
-   zram_entry_free(zram, entry);
-
atomic64_sub(zram_get_obj_size(zram, index),
&zram->stats.compr_data_size);
-   atomic64_dec(&zram->stats.pages_stored);
+out:
+   zram_entry_free(zram, entry);
 
+   atomic64_dec(&zram->stats.pages_stored);
zram_set_entry(zram, index, NULL);
zram_set_obj_size(zram, index, 0);
 }
@@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
entry = zram_dedup_find(zram, page, &checksum);
if (entry) {
comp_len = entry->len;
-   goto found_dup;
+   zram_slot_lock(zram, index);
+   zram_free_page(zram, index);
+   zram_set_flag(zram, index, ZRAM_DUP);
+   zram_set_entry(zram, index, entry);
+   zram_set_obj_size(zram, index, comp_len);
+   zram_slot_unlock(zram, index);
+   atomic64_add(comp_len, &zram->stats.dup_data_size);
+   atomic64_inc(&zram->stats.pages_stored);
+   return 0;
}
 
zstrm = zcomp_stream_get(zram->comp);
@@ -818,7 +833,6 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec, u32 index)
zs_unmap_object(zram->mem_pool, zram_entry_handle(zram, entry));
zram_dedup_insert(zram, entry, checksum);
 
-found_dup:
/*
 * Free memory associated with this sector
 * before overwriting unused sectors.
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 0091e23873c1..8ccfdcd8f674 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -64,6 +64,7 @@ static const size_t max_zpage_size = PAGE_SIZE / 4 * 3;
 enum zram_pageflags {
/* Page consists entirely of zeros */
ZRAM_SAME = ZRAM_FLAG_SHIFT,
+   ZRAM_DUP,
ZRAM_ACCESS,/* page is now accessed */
 
__NR_ZRAM_PAGEFLAGS,
-- 
2.7.4

[PATCH 2/2] mm: swap: move anonymous THP split logic to vmscan

2017-05-11 Thread Minchan Kim

The add_to_swap aims to allocate swap_space(ie, swap slot and
swapcache) so if it fails due to lack of space in case of THP
or something(hdd swap but tries THP swapout) *caller* rather
than add_to_swap itself should split the THP page and retry it
with base page which is more natural.

Cc: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 include/linux/swap.h |  4 ++--
 mm/swap_state.c  | 23 ++-
 mm/vmscan.c  | 17 -
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f12f67e869f..87cca2169d44 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -359,7 +359,7 @@ extern struct address_space *swapper_spaces[];
>> SWAP_ADDRESS_SPACE_SHIFT])
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
-extern int add_to_swap(struct page *, struct list_head *list);
+extern int add_to_swap(struct page *);
 extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
 extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
 extern void __delete_from_swap_cache(struct page *);
@@ -479,7 +479,7 @@ static inline struct page *lookup_swap_cache(swp_entry_t 
swp)
return NULL;
 }
 
-static inline int add_to_swap(struct page *page, struct list_head *list)
+static inline int add_to_swap(struct page *page)
 {
return 0;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0ad214d7a7ad..9c71b6b2562f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -184,7 +184,7 @@ void __delete_from_swap_cache(struct page *page)
  * Allocate swap space for the page and add the page to the
  * swap cache.  Caller needs to hold the page lock. 
  */
-int add_to_swap(struct page *page, struct list_head *list)
+int add_to_swap(struct page *page)
 {
swp_entry_t entry;
int err;
@@ -192,12 +192,12 @@ int add_to_swap(struct page *page, struct list_head *list)
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageUptodate(page), page);
 
-retry:
entry = get_swap_page(page);
if (!entry.val)
-   goto fail;
+   return 0;
+
if (mem_cgroup_try_charge_swap(page, entry))
-   goto fail_free;
+   goto fail;
 
/*
 * Radix-tree node allocations from PF_MEMALLOC contexts could
@@ -218,23 +218,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 * clear SWAP_HAS_CACHE flag.
 */
-   goto fail_free;
-
-   if (PageTransHuge(page)) {
-   err = split_huge_page_to_list(page, list);
-   if (err) {
-   delete_from_swap_cache(page);
-   return 0;
-   }
-   }
+   goto fail;
 
return 1;
 
-fail_free:
-   put_swap_page(page, entry);
 fail:
-   if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
-   goto retry;
+   put_swap_page(page, entry);
return 0;
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 57268c0c8fcf..767e20856080 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1125,8 +1125,23 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
!PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
-   if (!add_to_swap(page, page_list))
+   if (!add_to_swap(page)) {
+   if (!PageTransHuge(page))
+   goto activate_locked;
+   /* Split THP and swap individual base pages */
+   if (split_huge_page_to_list(page, page_list))
+   goto activate_locked;
+   if (!add_to_swap(page))
+   goto activate_locked;
+   }
+
+   /* XXX: We don't support THP writes */
+   if (PageTransHuge(page) &&
+ split_huge_page_to_list(page, page_list)) {
+   delete_from_swap_cache(page);
goto activate_locked;
+   }
+
may_enter_fs = 1;
 
/* Adding to swap updated mapping */
-- 
2.7.4

[PATCH 1/2] mm: swap: unify swap slot free functions to put_swap_page

2017-05-11 Thread Minchan Kim

Now, get_swap_page takes struct page and allocates swap space
according to page size(ie, normal or THP) so it would be more
cleaner to introduce put_swap_page which is a counter function
of get_swap_page. Then, it calls right swap slot free function
depending on page's size.

Cc: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 include/linux/swap.h |  4 ++--
 mm/shmem.c   |  2 +-
 mm/swap_state.c  | 13 +++--
 mm/swapfile.c|  8 
 mm/vmscan.c  |  2 +-
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b60fea3748f8..8f12f67e869f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -393,6 +393,7 @@ static inline long get_nr_swap_pages(void)
 
 extern void si_swapinfo(struct sysinfo *);
 extern swp_entry_t get_swap_page(struct page *page);
+extern void put_swap_page(struct page *page, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, bool cluster, swp_entry_t swp_entries[]);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
@@ -400,7 +401,6 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void put_swap_page(struct page *page, swp_entry_t swp)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 29948d7da172..82158edaefdb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
 
mutex_unlock(&shmem_swaplist_mutex);
 free_swap:
-   swapcache_free(swap);
+   put_swap_page(page, swap);
 redirty:
set_page_dirty(page);
if (wbc->for_reclaim)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..0ad214d7a7ad 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list)
return 1;
 
 fail_free:
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
+   put_swap_page(page, entry);
 fail:
if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
goto retry;
@@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page)
__delete_from_swap_cache(page);
spin_unlock_irq(&address_space->tree_lock);
 
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
-
+   put_swap_page(page, entry);
page_ref_sub(page, hpage_nr_pages(page));
 }
 
@@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 * clear SWAP_HAS_CACHE flag.
 */
-   swapcache_free(entry);
+   put_swap_page(new_page, entry);
} while (err != -ENOMEM);
 
if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596306272059..b65e49428090 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
 }
 #endif /* CONFIG_THP_SWAP */
 
+void put_swap_page(struct page *page, swp_entry_t entry)
+{
+   if (!PageTransHuge(page))
+   swapcache_free(entry);
+   else
+   swapcache_free_cluster(entry);
+}
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
const swp_entry_t *e1 = ent1, *e2 = ent2;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..57268c0c8fcf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
-   swapcache_free(swap);
+   put_swap_page(page, swap);
} else {
void (*freepage)(struct page *);
void *shadow = NULL;
-- 
2.7.4

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-05-11 Thread Minchan Kim

Hi Hannes,

On Thu, May 11, 2017 at 06:40:58AM -0400, Johannes Weiner wrote:
> On Thu, May 11, 2017 at 10:22:13AM +0900, Minchan Kim wrote:
> > On Thu, May 11, 2017 at 08:25:56AM +0900, Minchan Kim wrote:
> > > On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
> > > > Hi Michan,
> > > > 
> > > > On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> > > > > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
> > > > >  /*
> > > > >   * Called after dropping swapcache to decrease refcnt to swap 
> > > > > entries.
> > > > >   */
> > > > > -void swapcache_free(swp_entry_t entry)
> > > > > +void __swapcache_free(swp_entry_t entry)
> > > > >  {
> > > > >   struct swap_info_struct *p;
> > > > >  
> > > > > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
> > > > >  }
> > > > >  
> > > > >  #ifdef CONFIG_THP_SWAP
> > > > > -void swapcache_free_cluster(swp_entry_t entry)
> > > > > +void __swapcache_free_cluster(swp_entry_t entry)
> > > > >  {
> > > > >   unsigned long offset = swp_offset(entry);
> > > > >   unsigned long idx = offset / SWAPFILE_CLUSTER;
> > > > > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
> > > > >  }
> > > > >  #endif /* CONFIG_THP_SWAP */
> > > > >  
> > > > > +void swapcache_free(struct page *page, swp_entry_t entry)
> > > > > +{
> > > > > + if (!PageTransHuge(page))
> > > > > + __swapcache_free(entry);
> > > > > + else
> > > > > + __swapcache_free_cluster(entry);
> > > > > +}
> > > > 
> > > > I don't think this is cleaner :/
> > 
> > Let's see a example add_to_swap. Without it, it looks like that.
> > 
> > int add_to_swap(struct page *page)
> > {
> > entry = get_swap_page(page);
> > ..
> > ..
> > fail:
> > if (PageTransHuge(page))
> > swapcache_free_cluster(entry);
> > else
> > swapcache_free(entry);
> > }
> > 
> > It doesn't looks good to me because get_swap_page hides
> > where entry allocation is from cluster or slot but when
> > we free the entry allocated, we should be aware of the
> > internal and call right function. :(
> 
> This could be nicer indeed. I just don't like the underscore versions
> much, but symmetry with get_swap_page() would be nice.
> 
> How about put_swap_page()? :)

Good idea. It's the best one I can do now.
Actually, get_swap_page is awkward to me. Maybe it would be nicer to
rename it with get_swap_[slot|entry] but, I will postpone it if someone
would be on same page with me in future.

> 
> That can call the appropriate swapcache_free function then.

Yub.
Thanks for the review!

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-05-10 Thread Minchan Kim

On Thu, May 11, 2017 at 08:50:01AM +0800, Huang, Ying wrote:
< snip >

> >> > @@ -1125,8 +1125,28 @@ static unsigned long shrink_page_list(struct 
> >> > list_head *page_list,
> >> >  !PageSwapCache(page)) {
> >> >  if (!(sc->gfp_mask & __GFP_IO))
> >> >  goto keep_locked;
> >> > -if (!add_to_swap(page, page_list))
> >> > +swap_retry:
> >> > +/*
> >> > + * Retry after split if we fail to allocate
> >> > + * swap space of a THP.
> >> > + */
> >> > +if (!add_to_swap(page)) {
> >> > +if (!PageTransHuge(page) ||
> >> > +split_huge_page_to_list(page, 
> >> > page_list))
> >> > +goto activate_locked;
> >> > +goto swap_retry;
> >> > +}
> >> 
> >> This is definitely better.
> >
> > Thanks.
> >
> >> 
> >> However, I think it'd be cleaner without the label here:
> >> 
> >>if (!add_to_swap(page)) {
> >>if (!PageTransHuge(page))
> >>goto activate_locked;
> >>/* Split THP and swap individual base pages */
> >>if (split_huge_page_to_list(page, page_list))
> >>goto activate_locked;
> >>if (!add_to_swap(page))
> >>goto activate_locked;
> >
> > Yes.
> >
> >>}
> >> 
> >> > +/*
> >> > + * Got swap space successfully. But 
> >> > unfortunately,
> >> > + * we don't support a THP page writeout so 
> >> > split it.
> >> > + */
> >> > +if (PageTransHuge(page) &&
> >> > +  split_huge_page_to_list(page, 
> >> > page_list)) {
> >> > +delete_from_swap_cache(page);
> >> >  goto activate_locked;
> >> > +}
> >> 
> >> Pulling this out of add_to_swap() is an improvement for sure. Add an
> >> XXX: before that "we don't support THP writes" comment for good
> >> measure :)
> >
> > Sure.
> >
> > It could be a separate patch which makes add_to_swap clean via
> > removing page_list argument but I hope Huang take/fold it when he
> > resend it because it would be more important with THP swap.
> 
> Sure.  I will take this patch as one patch of the THP swap series.
> Because the first patch of the THP swap series is a little big, I don't
> think it is a good idea to fold this patch into it.  Could you update
> the patch according to Johannes' comments and resend it?

Okay, I will resend this clean-up patch against on yours patch
after finishing this discussion.

Thanks.

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-05-10 Thread Minchan Kim

On Thu, May 11, 2017 at 08:25:56AM +0900, Minchan Kim wrote:
> On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
> > Hi Michan,
> > 
> > On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> > > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
> > >  /*
> > >   * Called after dropping swapcache to decrease refcnt to swap entries.
> > >   */
> > > -void swapcache_free(swp_entry_t entry)
> > > +void __swapcache_free(swp_entry_t entry)
> > >  {
> > >   struct swap_info_struct *p;
> > >  
> > > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
> > >  }
> > >  
> > >  #ifdef CONFIG_THP_SWAP
> > > -void swapcache_free_cluster(swp_entry_t entry)
> > > +void __swapcache_free_cluster(swp_entry_t entry)
> > >  {
> > >   unsigned long offset = swp_offset(entry);
> > >   unsigned long idx = offset / SWAPFILE_CLUSTER;
> > > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
> > >  }
> > >  #endif /* CONFIG_THP_SWAP */
> > >  
> > > +void swapcache_free(struct page *page, swp_entry_t entry)
> > > +{
> > > + if (!PageTransHuge(page))
> > > + __swapcache_free(entry);
> > > + else
> > > + __swapcache_free_cluster(entry);
> > > +}
> > 
> > I don't think this is cleaner :/

Let's see a example add_to_swap. Without it, it looks like that.

int add_to_swap(struct page *page)
{
entry = get_swap_page(page);
..
..
fail:
if (PageTransHuge(page))
swapcache_free_cluster(entry);
else
swapcache_free(entry);
}

It doesn't looks good to me because get_swap_page hides
where entry allocation is from cluster or slot but when
we free the entry allocated, we should be aware of the
internal and call right function. :(

Do you think it's better still?

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-05-10 Thread Minchan Kim

On Wed, May 10, 2017 at 09:56:54AM -0400, Johannes Weiner wrote:
> Hi Michan,
> 
> On Tue, May 02, 2017 at 08:53:32AM +0900, Minchan Kim wrote:
> > @@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
> >  /*
> >   * Called after dropping swapcache to decrease refcnt to swap entries.
> >   */
> > -void swapcache_free(swp_entry_t entry)
> > +void __swapcache_free(swp_entry_t entry)
> >  {
> > struct swap_info_struct *p;
> >  
> > @@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
> >  }
> >  
> >  #ifdef CONFIG_THP_SWAP
> > -void swapcache_free_cluster(swp_entry_t entry)
> > +void __swapcache_free_cluster(swp_entry_t entry)
> >  {
> > unsigned long offset = swp_offset(entry);
> > unsigned long idx = offset / SWAPFILE_CLUSTER;
> > @@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
> >  }
> >  #endif /* CONFIG_THP_SWAP */
> >  
> > +void swapcache_free(struct page *page, swp_entry_t entry)
> > +{
> > +   if (!PageTransHuge(page))
> > +   __swapcache_free(entry);
> > +   else
> > +   __swapcache_free_cluster(entry);
> > +}
> 
> I don't think this is cleaner :/
> 
> On your second patch:
> 
> > @@ -1125,8 +1125,28 @@ static unsigned long shrink_page_list(struct 
> > list_head *page_list,
> > !PageSwapCache(page)) {
> > if (!(sc->gfp_mask & __GFP_IO))
> > goto keep_locked;
> > -   if (!add_to_swap(page, page_list))
> > +swap_retry:
> > +   /*
> > +* Retry after split if we fail to allocate
> > +* swap space of a THP.
> > +*/
> > +   if (!add_to_swap(page)) {
> > +   if (!PageTransHuge(page) ||
> > +   split_huge_page_to_list(page, page_list))
> > +   goto activate_locked;
> > +   goto swap_retry;
> > +   }
> 
> This is definitely better.

Thanks.

> 
> However, I think it'd be cleaner without the label here:
> 
>   if (!add_to_swap(page)) {
>   if (!PageTransHuge(page))
>   goto activate_locked;
>   /* Split THP and swap individual base pages */
>   if (split_huge_page_to_list(page, page_list))
>   goto activate_locked;
>   if (!add_to_swap(page))
>   goto activate_locked;

Yes.

>   }
> 
> > +   /*
> > +* Got swap space successfully. But unfortunately,
> > +* we don't support a THP page writeout so split it.
> > +*/
> > +   if (PageTransHuge(page) &&
> > + split_huge_page_to_list(page, page_list)) {
> > +   delete_from_swap_cache(page);
> > goto activate_locked;
> > +   }
> 
> Pulling this out of add_to_swap() is an improvement for sure. Add an
> XXX: before that "we don't support THP writes" comment for good
> measure :)

Sure.

It could be a separate patch which makes add_to_swap clean via
removing page_list argument but I hope Huang take/fold it when he
resend it because it would be more important with THP swap.

Thanks.

[PATCH v2] mm: vmscan: scan until it founds eligible pages

2017-05-10 Thread Minchan Kim

pening.

This patch makes isolate_lru_pages try to scan pages until it
encounters eligible zones's pages.

Signed-off-by: Minchan Kim 
---
* from v1
  * put more words in description and code
  * drop unncessary pages_skipped list flushing

 mm/vmscan.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..e051bf4a1144 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1449,7 +1449,7 @@ static __always_inline void update_lru_sizes(struct 
lruvec *lruvec,
  *
  * Appropriate locks must be held before calling this function.
  *
- * @nr_to_scan:The number of pages to look through on the list.
+ * @nr_to_scan:The number of eligible pages to look through on the 
list.
  * @lruvec:The LRU vector to pull pages from.
  * @dst:   The temp list to put pages on to.
  * @nr_scanned:The number of pages that were scanned.
@@ -1469,11 +1469,13 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
unsigned long skipped = 0;
-   unsigned long scan, nr_pages;
+   unsigned long scan, total_scan, nr_pages;
LIST_HEAD(pages_skipped);
 
-   for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
-   !list_empty(src); scan++) {
+   for (total_scan = scan = 0; scan < nr_to_scan &&
+   nr_taken < nr_to_scan &&
+   !list_empty(src);
+   total_scan++) {
struct page *page;
 
page = lru_to_page(src);
@@ -1487,6 +1489,13 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
continue;
}
 
+   /*
+* Do not count skipped pages because it makes the function to
+* return with none isolated pages if the LRU mostly contains
+* ineligible pages so that VM cannot reclaim any pages and
+* trigger premature OOM.
+*/
+   scan++;
switch (__isolate_lru_page(page, mode)) {
case 0:
nr_pages = hpage_nr_pages(page);
@@ -1524,9 +1533,9 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
skipped += nr_skipped[zid];
}
}
-   *nr_scanned = scan;
+   *nr_scanned = total_scan;
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
-   scan, skipped, nr_taken, mode, lru);
+   total_scan, skipped, nr_taken, mode, lru);
update_lru_sizes(lruvec, lru, nr_zone_taken);
return nr_taken;
 }
-- 
2.7.4

Re: [PATCH] vmscan: scan pages until it founds eligible pages

2017-05-10 Thread Minchan Kim

On Wed, May 10, 2017 at 08:13:12AM +0200, Michal Hocko wrote:
> On Wed 10-05-17 10:46:54, Minchan Kim wrote:
> > On Wed, May 03, 2017 at 08:00:44AM +0200, Michal Hocko wrote:
> [...]
> > > @@ -1486,6 +1486,12 @@ static unsigned long isolate_lru_pages(unsigned 
> > > long nr_to_scan,
> > >   continue;
> > >   }
> > >  
> > > + /*
> > > +  * Do not count skipped pages because we do want to isolate
> > > +  * some pages even when the LRU mostly contains ineligible
> > > +  * pages
> > > +  */
> > 
> > How about adding comment about "why"?
> > 
> > /*
> >  * Do not count skipped pages because it makes the function to return with
> >  * none isolated pages if the LRU mostly contains inelgible pages so that
> >  * VM cannot reclaim any pages and trigger premature OOM.
> >  */
> 
> I am not sure this is necessarily any better. Mentioning a pre-mature
> OOM would require a much better explanation because a first immediate
> question would be "why don't we scan those pages at priority 0". Also
> decision about the OOM is at a different layer and it might change in
> future when this doesn't apply any more. But it is not like I would
> insist...
> 
> > > + scan++;
> > >   switch (__isolate_lru_page(page, mode)) {
> > >   case 0:
> > >   nr_pages = hpage_nr_pages(page);
> > 
> > Confirmed.
> 
> Hmm. I can clearly see how we could skip over too many pages and hit
> small reclaim priorities too quickly but I am still scratching my head
> about how we could hit the OOM killer as a result. The amount of pages
> on the active anonymous list suggests that we are not able to rotate
> pages quickly enough. I have to keep thinking about that.

I explained it but seems to be not enouggh. Let me try again.

The problem is that get_scan_count determines nr_to_scan with
eligible zones.

size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
size = size >> sc->priority;

Assumes sc->priority is 0 and LRU list is as follows.

N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

(Ie, small eligible pages are in the head of LRU but others are
almost ineligible pages)

In that case, size becomes 4 so VM want to scan 4 pages but 4 pages
from tail of the LRU are not eligible pages.
If get_scan_count counts skipped pages, it doesn't reclaim remained
pages after scanning 4 pages.

If it's more helpful to understand the problem, I will add it to
the description.

Re: [PATCH] vmscan: scan pages until it founds eligible pages

2017-05-09 Thread Minchan Kim

On Wed, May 03, 2017 at 08:00:44AM +0200, Michal Hocko wrote:
> On Wed 03-05-17 13:48:09, Minchan Kim wrote:
> > On Tue, May 02, 2017 at 05:14:36PM +0200, Michal Hocko wrote:
> > > On Tue 02-05-17 23:51:50, Minchan Kim wrote:
> > > > Hi Michal,
> > > > 
> > > > On Tue, May 02, 2017 at 09:54:32AM +0200, Michal Hocko wrote:
> > > > > On Tue 02-05-17 14:14:52, Minchan Kim wrote:
> > > > > > Oops, forgot to add lkml and linux-mm.
> > > > > > Sorry for that.
> > > > > > Send it again.
> > > > > > 
> > > > > > >From 8ddf1c8aa15baf085bc6e8c62ce705459d57ea4c Mon Sep 17 00:00:00 
> > > > > > >2001
> > > > > > From: Minchan Kim 
> > > > > > Date: Tue, 2 May 2017 12:34:05 +0900
> > > > > > Subject: [PATCH] vmscan: scan pages until it founds eligible pages
> > > > > > 
> > > > > > On Tue, May 02, 2017 at 01:40:38PM +0900, Minchan Kim wrote:
> > > > > > There are premature OOM happening. Although there are a ton of free
> > > > > > swap and anonymous LRU list of elgible zones, OOM happened.
> > > > > > 
> > > > > > With investigation, skipping page of isolate_lru_pages makes reclaim
> > > > > > void because it returns zero nr_taken easily so LRU shrinking is
> > > > > > effectively nothing and just increases priority aggressively.
> > > > > > Finally, OOM happens.
> > > > > 
> > > > > I am not really sure I understand the problem you are facing. Could 
> > > > > you
> > > > > be more specific please? What is your configuration etc...
> > > > 
> > > > Sure, KVM guest on x86_64, It has 2G memory and 1G swap and configured
> > > > movablecore=1G to simulate highmem zone.
> > > > Workload is a process consumes 2.2G memory and then random touch the
> > > > address space so it makes lots of swap in/out.
> > > > 
> > > > > 
> > > > > > balloon invoked oom-killer: 
> > > > > > gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), 
> > > > > > nodemask=(null),  order=0, oom_score_adj=0
> > > > > [...]
> > > > > > Node 0 active_anon:1698864kB inactive_anon:261256kB 
> > > > > > active_file:208kB inactive_file:184kB unevictable:0kB 
> > > > > > isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB 
> > > > > > writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB 
> > > > > > all_unreclaimable? no
> > > > > > DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB 
> > > > > > inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
> > > > > > writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
> > > > > > slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB 
> > > > > > pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > > > lowmem_reserve[]: 0 992 992 1952
> > > > > > DMA32 free:9088kB min:2048kB low:3064kB high:4080kB 
> > > > > > active_anon:952176kB inactive_anon:0kB active_file:36kB 
> > > > > > inactive_file:0kB unevictable:0kB writepending:88kB 
> > > > > > present:1032192kB managed:1019388kB mlocked:0kB 
> > > > > > slab_reclaimable:13532kB slab_unreclaimable:16460kB 
> > > > > > kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB 
> > > > > > local_pcp:24kB free_cma:0kB
> > > > > > lowmem_reserve[]: 0 0 0 959
> > > > > 
> > > > > Hmm DMA32 has sufficient free memory to allow this order-0 request.
> > > > > Inactive anon lru is basically empty. Why do not we rotate a really
> > > > > large active anon list? Isn't this the primary problem?
> > > > 
> > > > It's a side effect by skipping page logic in isolate_lru_pages
> > > > I mentioned above in changelog.
> > > > 
> > > > The problem is a lot of anonymous memory in movable zone(ie, highmem)
> > > > and non-small memory in DMA32 zone.
> > > 
> > > Such a configuration is questionable on its own. But let't keep this
> > > part alone.
> > 
> > It seems you are misunderstood. It's really common on 32bit.
> 
> Yes, I am not arguing about 32b syst

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-05-09 Thread Minchan Kim

Hi Huang,

On Fri, Apr 28, 2017 at 08:21:37PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Apr 27, 2017 at 03:12:34PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying 
> >> >> 
> >> >> In this patch, splitting huge page is delayed from almost the first
> >> >> step of swapping out to after allocating the swap space for the
> >> >> THP (Transparent Huge Page) and adding the THP into the swap cache.
> >> >> This will batch the corresponding operation, thus improve THP swap out
> >> >> throughput.
> >> >> 
> >> >> This is the first step for the THP swap optimization.  The plan is to
> >> >> delay splitting the THP step by step and avoid splitting the THP
> >> >> finally.
> >> >> 
> >> >> The advantages of the THP swap support include:
> >> >> 
> >> >> - Batch the swap operations for the THP and reduce lock
> >> >>   acquiring/releasing, including allocating/freeing the swap space,
> >> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >> >>   space, etc.  This will help to improve the THP swap performance.
> >> >> 
> >> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >> >>   particularly helpful for the swap read, which usually are 4k random
> >> >>   IO.  This will help to improve the THP swap performance.
> >> >> 
> >> >> - It will help the memory fragmentation, especially when the THP is
> >> >>   heavily used by the applications.  The 2M continuous pages will be
> >> >>   free up after the THP swapping out.
> >> >> 
> >> >> - It will improve the THP utilization on the system with the swap
> >> >>   turned on.  Because the speed for khugepaged to collapse the normal
> >> >>   pages into the THP is quite slow.  After the THP is split during the
> >> >>   swapping out, it will take quite long time for the normal pages to
> >> >>   collapse back into the THP after being swapped in.  The high THP
> >> >>   utilization helps the efficiency of the page based memory management
> >> >>   too.
> >> >> 
> >> >> There are some concerns regarding THP swap in, mainly because possible
> >> >> enlarged read/write IO size (for swap in/out) may put more overhead on
> >> >> the storage device.  To deal with that, the THP swap in should be
> >> >> turned on only when necessary.  For example, it can be selected via
> >> >> "always/never/madvise" logic, to be turned on globally, turned off
> >> >> globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
> >> >> 
> >> >> In this patch, one swap cluster is used to hold the contents of each
> >> >> THP swapped out.  So, the size of the swap cluster is changed to that
> >> >> of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
> >> >> other architectures which want such THP swap optimization,
> >> >> ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
> >> >> for the architecture.  In effect, this will enlarge swap cluster size
> >> >> by 2 times on x86_64.  Which may make it harder to find a free cluster
> >> >> when the swap space becomes fragmented.  So that, this may reduce the
> >> >> continuous swap space allocation and sequential write in theory.  The
> >> >> performance test in 0day shows no regressions caused by this.
> >> >
> >> > What about other architecures?
> >> >
> >> > I mean THP page size on every architectures would be various.
> >> > If THP page size is much bigger than 2M, the architecture should
> >> > have big swap cluster size for supporting THP swap-out feature.
> >> > It means fast empty-swap cluster consumption so that it can suffer
> >> > from fragmentation easily which causes THP swap void and swap slot
> >> > allocations slow due to not being able to use per-cpu.
> >> >
> >> > What I suggested was contiguous multiple swap cluster allocations
> >> > to meet THP page size. If some of architecure's THP size is 64M
> >> > and SWAP_CLUSTER_SIZE is 2M, it should allocate 32 con

Re: [PATCH] vmscan: scan pages until it founds eligible pages

2017-05-02 Thread Minchan Kim

On Tue, May 02, 2017 at 05:14:36PM +0200, Michal Hocko wrote:
> On Tue 02-05-17 23:51:50, Minchan Kim wrote:
> > Hi Michal,
> > 
> > On Tue, May 02, 2017 at 09:54:32AM +0200, Michal Hocko wrote:
> > > On Tue 02-05-17 14:14:52, Minchan Kim wrote:
> > > > Oops, forgot to add lkml and linux-mm.
> > > > Sorry for that.
> > > > Send it again.
> > > > 
> > > > >From 8ddf1c8aa15baf085bc6e8c62ce705459d57ea4c Mon Sep 17 00:00:00 2001
> > > > From: Minchan Kim 
> > > > Date: Tue, 2 May 2017 12:34:05 +0900
> > > > Subject: [PATCH] vmscan: scan pages until it founds eligible pages
> > > > 
> > > > On Tue, May 02, 2017 at 01:40:38PM +0900, Minchan Kim wrote:
> > > > There are premature OOM happening. Although there are a ton of free
> > > > swap and anonymous LRU list of elgible zones, OOM happened.
> > > > 
> > > > With investigation, skipping page of isolate_lru_pages makes reclaim
> > > > void because it returns zero nr_taken easily so LRU shrinking is
> > > > effectively nothing and just increases priority aggressively.
> > > > Finally, OOM happens.
> > > 
> > > I am not really sure I understand the problem you are facing. Could you
> > > be more specific please? What is your configuration etc...
> > 
> > Sure, KVM guest on x86_64, It has 2G memory and 1G swap and configured
> > movablecore=1G to simulate highmem zone.
> > Workload is a process consumes 2.2G memory and then random touch the
> > address space so it makes lots of swap in/out.
> > 
> > > 
> > > > balloon invoked oom-killer: 
> > > > gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), 
> > > > nodemask=(null),  order=0, oom_score_adj=0
> > > [...]
> > > > Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB 
> > > > inactive_file:184kB unevictable:0kB isolated(anon):0kB 
> > > > isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB 
> > > > writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > > > DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB 
> > > > inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
> > > > writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
> > > > slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB 
> > > > pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > > > lowmem_reserve[]: 0 992 992 1952
> > > > DMA32 free:9088kB min:2048kB low:3064kB high:4080kB 
> > > > active_anon:952176kB inactive_anon:0kB active_file:36kB 
> > > > inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB 
> > > > managed:1019388kB mlocked:0kB slab_reclaimable:13532kB 
> > > > slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB 
> > > > bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
> > > > lowmem_reserve[]: 0 0 0 959
> > > 
> > > Hmm DMA32 has sufficient free memory to allow this order-0 request.
> > > Inactive anon lru is basically empty. Why do not we rotate a really
> > > large active anon list? Isn't this the primary problem?
> > 
> > It's a side effect by skipping page logic in isolate_lru_pages
> > I mentioned above in changelog.
> > 
> > The problem is a lot of anonymous memory in movable zone(ie, highmem)
> > and non-small memory in DMA32 zone.
> 
> Such a configuration is questionable on its own. But let't keep this
> part alone.

It seems you are misunderstood. It's really common on 32bit.
Think of 2G DRAM system on 32bit. Normally, it's 1G normal:1G highmem.
It's almost same with one I configured.

> 
> > In heavy memory pressure,
> > requesting a page in GFP_KERNEL triggers reclaim. VM knows inactive list
> > is low so it tries to deactivate pages. For it, first of all, it tries
> > to isolate pages from active list but there are lots of anonymous pages
> > from movable zone so skipping logic in isolate_lru_pages works. With
> > the result, isolate_lru_pages cannot isolate any eligible pages so
> > reclaim trial is effectively void. It continues to meet OOM.
> 
> But skipped pages should be rotated and we should eventually hit pages
> from the right zone(s). Moreover we should scan the full LRU at priority
> 0 so why exactly we hit the OOM killer?

Yes, full scan in priority 0 but keep it in mind that the number of full
LRU pages to scan is one of eligible pages, no

Re: [PATCH] vmscan: scan pages until it founds eligible pages

2017-05-02 Thread Minchan Kim

Hi Michal,

On Tue, May 02, 2017 at 09:54:32AM +0200, Michal Hocko wrote:
> On Tue 02-05-17 14:14:52, Minchan Kim wrote:
> > Oops, forgot to add lkml and linux-mm.
> > Sorry for that.
> > Send it again.
> > 
> > >From 8ddf1c8aa15baf085bc6e8c62ce705459d57ea4c Mon Sep 17 00:00:00 2001
> > From: Minchan Kim 
> > Date: Tue, 2 May 2017 12:34:05 +0900
> > Subject: [PATCH] vmscan: scan pages until it founds eligible pages
> > 
> > On Tue, May 02, 2017 at 01:40:38PM +0900, Minchan Kim wrote:
> > There are premature OOM happening. Although there are a ton of free
> > swap and anonymous LRU list of elgible zones, OOM happened.
> > 
> > With investigation, skipping page of isolate_lru_pages makes reclaim
> > void because it returns zero nr_taken easily so LRU shrinking is
> > effectively nothing and just increases priority aggressively.
> > Finally, OOM happens.
> 
> I am not really sure I understand the problem you are facing. Could you
> be more specific please? What is your configuration etc...

Sure, KVM guest on x86_64, It has 2G memory and 1G swap and configured
movablecore=1G to simulate highmem zone.
Workload is a process consumes 2.2G memory and then random touch the
address space so it makes lots of swap in/out.

> 
> > balloon invoked oom-killer: 
> > gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), 
> > nodemask=(null),  order=0, oom_score_adj=0
> [...]
> > Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB 
> > inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> > mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB 
> > unstable:0kB all_unreclaimable? no
> > DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB 
> > inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
> > writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
> > slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB 
> > pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 992 992 1952
> > DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB 
> > inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB 
> > writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB 
> > slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB 
> > pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
> > lowmem_reserve[]: 0 0 0 959
> 
> Hmm DMA32 has sufficient free memory to allow this order-0 request.
> Inactive anon lru is basically empty. Why do not we rotate a really
> large active anon list? Isn't this the primary problem?

It's a side effect by skipping page logic in isolate_lru_pages
I mentioned above in changelog.

The problem is a lot of anonymous memory in movable zone(ie, highmem)
and non-small memory in DMA32 zone. In heavy memory pressure,
requesting a page in GFP_KERNEL triggers reclaim. VM knows inactive list
is low so it tries to deactivate pages. For it, first of all, it tries
to isolate pages from active list but there are lots of anonymous pages
from movable zone so skipping logic in isolate_lru_pages works. With
the result, isolate_lru_pages cannot isolate any eligible pages so
reclaim trial is effectively void. It continues to meet OOM.

I'm on long vacation from today so understand if my response is slow.

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-05-01 Thread Minchan Kim

Hi Huang,

On Tue, May 02, 2017 at 01:35:24PM +0800, Huang, Ying wrote:
> Hi, Minchan,
> 
> Minchan Kim  writes:
> 
> > On Fri, Apr 28, 2017 at 09:35:37PM +0800, Huang, Ying wrote:
> >> In fact, during the test, I found the overhead of sort() is comparable
> >> with the performance difference of adding likely()/unlikely() to the
> >> "if" in the function.
> >
> > Huang,
> >
> > This discussion is started from your optimization code:
> >
> > if (nr_swapfiles > 1)
> > sort();
> >
> > I don't have such fast machine so cannot test it. However, you added
> > such optimization code in there so I guess it's *worth* to review so
> > with spending my time, I pointed out what you are missing and
> > suggested a idea to find a compromise.
> 
> Sorry for wasting your time and Thanks a lot for your review and
> suggestion!
> 
> When I started talking this with you, I found there is some measurable
> overhead of sort().  But later when I done more tests, I found the
> measurable overhead is at the same level of likely()/unlikely() compiler
> notation.  So you help me to find that, Thanks again!
> 
> > Now you are saying sort is so fast so no worth to add more logics
> > to avoid the overhead?
> > Then, please just drop that if condition part and instead, sort
> > it unconditionally.
> 
> Now, because we found the overhead of sort() is low, I suggest to put
> minimal effort to avoid it.  Like the original implementation,
> 
>  if (nr_swapfiles > 1)
>  sort();

It might confuse someone in future and would make him/her send a patch
to fix like we discussed. If the logic is not clear and doesn't have
measureable overhead, just leave it which is more simple/clear.

> 
> Or, we can make nr_swapfiles more correct as Tim suggested (tracking
> the number of the swap devices during swap on/off).

It might be better option but it's still hard to justify the patch
because you said it's hard to measure. Such optimiztion patch should
be from numbers.

Re: [PATCH] vmscan: scan pages until it founds eligible pages

2017-05-01 Thread Minchan Kim

Oops, forgot to add lkml and linux-mm.
Sorry for that.
Send it again.

>From 8ddf1c8aa15baf085bc6e8c62ce705459d57ea4c Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Tue, 2 May 2017 12:34:05 +0900
Subject: [PATCH] vmscan: scan pages until it founds eligible pages

On Tue, May 02, 2017 at 01:40:38PM +0900, Minchan Kim wrote:
There are premature OOM happening. Although there are a ton of free
swap and anonymous LRU list of elgible zones, OOM happened.

With investigation, skipping page of isolate_lru_pages makes reclaim
void because it returns zero nr_taken easily so LRU shrinking is
effectively nothing and just increases priority aggressively.
Finally, OOM happens.

This patch makes isolate_lru_pages try to scan pages until it
encounters eligible zones's pages or too much scan happen(ie,
node's LRU size).

balloon invoked oom-killer: 
gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), 
nodemask=(null),  order=0, oom_score_adj=0
CPU: 7 PID: 1138 Comm: balloon Not tainted 
4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x65/0x87
 dump_header.isra.19+0x8f/0x20f
 ? preempt_count_add+0x9e/0xb0
 ? _raw_spin_unlock_irqrestore+0x24/0x40
 oom_kill_process+0x21d/0x3f0
 ? has_capability_noaudit+0x17/0x20
 out_of_memory+0xd8/0x390
 __alloc_pages_slowpath+0xbc1/0xc50
 ? anon_vma_interval_tree_insert+0x84/0x90
 __alloc_pages_nodemask+0x1a5/0x1c0
 pte_alloc_one+0x20/0x50
 __pte_alloc+0x1e/0x110
 __handle_mm_fault+0x919/0x960
 handle_mm_fault+0x77/0x120
 __do_page_fault+0x27a/0x550
 trace_do_page_fault+0x43/0x150
 do_async_page_fault+0x2c/0x90
 async_page_fault+0x28/0x30
RIP: 0033:0x7fc4636bacb8
RSP: 002b:7fff97c9c4c0 EFLAGS: 00010202
RAX: 7fc3e818d000 RBX: 7fc4639f8760 RCX: 7fc46372e9ca
RDX: 00101002 RSI: 00101000 RDI: 
RBP: 00100010 R08:  R09: 
R10: 0022 R11: 000a3901 R12: 7fc3e818d010
R13: 00101000 R14: 7fc4639f87b8 R15: 7fc4639f87b8
Mem-Info:
active_anon:424716 inactive_anon:65314 isolated_anon:0
 active_file:52 inactive_file:46 isolated_file:0
 unevictable:0 dirty:27 writeback:0 unstable:0
 slab_reclaimable:3967 slab_unreclaimable:4125
 mapped:133 shmem:43 pagetables:1674 bounce:0
 free:4637 free_pcp:225 free_cma:0
Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB 
inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB 
unstable:0kB all_unreclaimable? no
DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB 
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB 
writepending:0kB present:15992kB managed:15908kB mlocked:0kB 
slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB 
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 992 992 1952
DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB 
inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB 
writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB 
slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB 
pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
lowmem_reserve[]: 0 0 0 959
Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB 
inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB 
writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB 
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB 
bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB 
(E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 
3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB 
(M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
378 total pagecache pages
17 pages in swap cache
Swap cache stats: add 17325, delete 17302, find 0/27
Free swap  = 978940kB
Total swap = 1048572kB
524157 pages RAM
0 pages HighMem/MovableOnly
12629 pages reserved
0 pages cma reserved
0 pages hwpoisoned
[ pid ]   uid  tgid total_vm  rss nr_ptes nr_pmds swapents oom_score_adj 
name
[  433] 0   433 49045  14   3   82 0 
upstart-udev-br
[  438] 0   438123715  27   3  191 -1000 
systemd-udevd
...

Signed-off-by: Minchan Kim 
---
 mm/vmscan.c | 33 +
 1 file changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2314aca47d12..1fec21d155b3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-05-01 Thread Minchan Kim

On Fri, Apr 28, 2017 at 09:35:37PM +0800, Huang, Ying wrote:
> In fact, during the test, I found the overhead of sort() is comparable
> with the performance difference of adding likely()/unlikely() to the
> "if" in the function.

Huang,

This discussion is started from your optimization code:

if (nr_swapfiles > 1)
sort();

I don't have such fast machine so cannot test it. However, you added
such optimization code in there so I guess it's *worth* to review so
with spending my time, I pointed out what you are missing and
suggested a idea to find a compromise.

Now you are saying sort is so fast so no worth to add more logics
to avoid the overhead?
Then, please just drop that if condition part and instead, sort
it unconditionally.

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-05-01 Thread Minchan Kim

Hi Johannes,

The patch I sent has two clean-up.

First part was as follows:

>From 5400dceb3a7739d4e7ff340fc0831e0e1830ec0b Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Fri, 28 Apr 2017 15:04:14 +0900
Subject: [PATCH 1/2] swap: make swapcache_free aware of page size

Now, get_swap_page takes struct page and allocates swap space according
to page size(ie, normal or THP) so it would be more clear to take
struct page in swapcache_free which is a counter function of
get_swap_page without needing if-else statement of caller side.

Cc: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 include/linux/swap.h |  4 ++--
 mm/shmem.c   |  2 +-
 mm/swap_state.c  | 13 +++--
 mm/swapfile.c| 12 ++--
 mm/vmscan.c  |  2 +-
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b60fea3748f8..16c8d2392ddd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -400,7 +400,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void swapcache_free(struct page *page, swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
@@ -459,7 +459,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void swapcache_free(struct page *page, swp_entry_t swp)
 {
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 29948d7da172..ab1802664e97 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1326,7 +1326,7 @@ static int shmem_writepage(struct page *page, struct 
writeback_control *wbc)
 
mutex_unlock(&shmem_swaplist_mutex);
 free_swap:
-   swapcache_free(swap);
+   swapcache_free(page, swap);
 redirty:
set_page_dirty(page);
if (wbc->for_reclaim)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16ff89d058f4..4af44fd4142e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -231,10 +231,7 @@ int add_to_swap(struct page *page, struct list_head *list)
return 1;
 
 fail_free:
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
+   swapcache_free(page, entry);
 fail:
if (PageTransHuge(page) && !split_huge_page_to_list(page, list))
goto retry;
@@ -259,11 +256,7 @@ void delete_from_swap_cache(struct page *page)
__delete_from_swap_cache(page);
spin_unlock_irq(&address_space->tree_lock);
 
-   if (PageTransHuge(page))
-   swapcache_free_cluster(entry);
-   else
-   swapcache_free(entry);
-
+   swapcache_free(page, entry);
page_ref_sub(page, hpage_nr_pages(page));
 }
 
@@ -415,7 +408,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 * clear SWAP_HAS_CACHE flag.
 */
-   swapcache_free(entry);
+   swapcache_free(new_page, entry);
} while (err != -ENOMEM);
 
if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596306272059..9496cc3e955a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1144,7 +1144,7 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry)
 {
struct swap_info_struct *p;
 
@@ -1156,7 +1156,7 @@ void swapcache_free(swp_entry_t entry)
 }
 
 #ifdef CONFIG_THP_SWAP
-void swapcache_free_cluster(swp_entry_t entry)
+void __swapcache_free_cluster(swp_entry_t entry)
 {
unsigned long offset = swp_offset(entry);
unsigned long idx = offset / SWAPFILE_CLUSTER;
@@ -1182,6 +1182,14 @@ void swapcache_free_cluster(swp_entry_t entry)
 }
 #endif /* CONFIG_THP_SWAP */
 
+void swapcache_free(struct page *page, swp_entry_t entry)
+{
+   if (!PageTransHuge(page))
+   __swapcache_free(entry);
+   else
+   __swapcache_free_cluster(entry);
+}
+
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
const swp_entry_t *e1 = ent1, *e2 = ent2;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ebf468c5429..0f8ca3d1761d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -708,7 +708,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
-   swapcache_free(swap);
+   swapcache_free(page, swap);
} else {
void (*freepage)(struct

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-04-28 Thread Minchan Kim

On Fri, Apr 28, 2017 at 04:05:26PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Fri, Apr 28, 2017 at 09:09:53AM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Apr 26, 2017 at 08:42:10PM +0800, Huang, Ying wrote:
> >> >> Minchan Kim  writes:
> >> >> 
> >> >> > On Fri, Apr 21, 2017 at 08:29:30PM +0800, Huang, Ying wrote:
> >> >> >> "Huang, Ying"  writes:
> >> >> >> 
> >> >> >> > Minchan Kim  writes:
> >> >> >> >
> >> >> >> >> On Wed, Apr 19, 2017 at 04:14:43PM +0800, Huang, Ying wrote:
> >> >> >> >>> Minchan Kim  writes:
> >> >> >> >>> 
> >> >> >> >>> > Hi Huang,
> >> >> >> >>> >
> >> >> >> >>> > On Fri, Apr 07, 2017 at 02:49:01PM +0800, Huang, Ying wrote:
> >> >> >> >>> >> From: Huang Ying 
> >> >> >> >>> >> 
> >> >> >> >>> >>  void swapcache_free_entries(swp_entry_t *entries, int n)
> >> >> >> >>> >>  {
> >> >> >> >>> >>   struct swap_info_struct *p, *prev;
> >> >> >> >>> >> @@ -1075,6 +1083,10 @@ void 
> >> >> >> >>> >> swapcache_free_entries(swp_entry_t *entries, int n)
> >> >> >> >>> >>  
> >> >> >> >>> >>   prev = NULL;
> >> >> >> >>> >>   p = NULL;
> >> >> >> >>> >> +
> >> >> >> >>> >> + /* Sort swap entries by swap device, so each lock is 
> >> >> >> >>> >> only taken once. */
> >> >> >> >>> >> + if (nr_swapfiles > 1)
> >> >> >> >>> >> + sort(entries, n, sizeof(entries[0]), 
> >> >> >> >>> >> swp_entry_cmp, NULL);
> >> >> >> >>> >
> >> >> >> >>> > Let's think on other cases.
> >> >> >> >>> >
> >> >> >> >>> > There are two swaps and they are configured by priority so a 
> >> >> >> >>> > swap's usage
> >> >> >> >>> > would be zero unless other swap used up. In case of that, this 
> >> >> >> >>> > sorting
> >> >> >> >>> > is pointless.
> >> >> >> >>> >
> >> >> >> >>> > As well, nr_swapfiles is never decreased so if we enable 
> >> >> >> >>> > multiple
> >> >> >> >>> > swaps and then disable until a swap is remained, this sorting 
> >> >> >> >>> > is
> >> >> >> >>> > pointelss, too.
> >> >> >> >>> >
> >> >> >> >>> > How about lazy sorting approach? IOW, if we found prev != p 
> >> >> >> >>> > and,
> >> >> >> >>> > then we can sort it.
> >> >> >> >>> 
> >> >> >> >>> Yes.  That should be better.  I just don't know whether the added
> >> >> >> >>> complexity is necessary, given the array is short and sort is 
> >> >> >> >>> fast.
> >> >> >> >>
> >> >> >> >> Huh?
> >> >> >> >>
> >> >> >> >> 1. swapon /dev/XXX1
> >> >> >> >> 2. swapon /dev/XXX2
> >> >> >> >> 3. swapoff /dev/XXX2
> >> >> >> >> 4. use only one swap
> >> >> >> >> 5. then, always pointless sort.
> >> >> >> >
> >> >> >> > Yes.  In this situation we will do unnecessary sorting.  What I 
> >> >> >> > don't
> >> >> >> > know is whether the unnecessary sorting will hurt performance in 
> >> >> >> > real
> >> >> >> > life.  I can do some measurement.
> >> >> >> 
> >> >> >> I tested the patch with 1 swap device and 1 process

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-04-28 Thread Minchan Kim

On Thu, Apr 27, 2017 at 03:12:34PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Tue, Apr 25, 2017 at 08:56:56PM +0800, Huang, Ying wrote:
> >> From: Huang Ying 
> >> 
> >> In this patch, splitting huge page is delayed from almost the first
> >> step of swapping out to after allocating the swap space for the
> >> THP (Transparent Huge Page) and adding the THP into the swap cache.
> >> This will batch the corresponding operation, thus improve THP swap out
> >> throughput.
> >> 
> >> This is the first step for the THP swap optimization.  The plan is to
> >> delay splitting the THP step by step and avoid splitting the THP
> >> finally.
> >> 
> >> The advantages of the THP swap support include:
> >> 
> >> - Batch the swap operations for the THP and reduce lock
> >>   acquiring/releasing, including allocating/freeing the swap space,
> >>   adding/deleting to/from the swap cache, and writing/reading the swap
> >>   space, etc.  This will help to improve the THP swap performance.
> >> 
> >> - The THP swap space read/write will be 2M sequential IO.  It is
> >>   particularly helpful for the swap read, which usually are 4k random
> >>   IO.  This will help to improve the THP swap performance.
> >> 
> >> - It will help the memory fragmentation, especially when the THP is
> >>   heavily used by the applications.  The 2M continuous pages will be
> >>   free up after the THP swapping out.
> >> 
> >> - It will improve the THP utilization on the system with the swap
> >>   turned on.  Because the speed for khugepaged to collapse the normal
> >>   pages into the THP is quite slow.  After the THP is split during the
> >>   swapping out, it will take quite long time for the normal pages to
> >>   collapse back into the THP after being swapped in.  The high THP
> >>   utilization helps the efficiency of the page based memory management
> >>   too.
> >> 
> >> There are some concerns regarding THP swap in, mainly because possible
> >> enlarged read/write IO size (for swap in/out) may put more overhead on
> >> the storage device.  To deal with that, the THP swap in should be
> >> turned on only when necessary.  For example, it can be selected via
> >> "always/never/madvise" logic, to be turned on globally, turned off
> >> globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
> >> 
> >> In this patch, one swap cluster is used to hold the contents of each
> >> THP swapped out.  So, the size of the swap cluster is changed to that
> >> of the THP (Transparent Huge Page) on x86_64 architecture (512).  For
> >> other architectures which want such THP swap optimization,
> >> ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file
> >> for the architecture.  In effect, this will enlarge swap cluster size
> >> by 2 times on x86_64.  Which may make it harder to find a free cluster
> >> when the swap space becomes fragmented.  So that, this may reduce the
> >> continuous swap space allocation and sequential write in theory.  The
> >> performance test in 0day shows no regressions caused by this.
> >
> > What about other architecures?
> >
> > I mean THP page size on every architectures would be various.
> > If THP page size is much bigger than 2M, the architecture should
> > have big swap cluster size for supporting THP swap-out feature.
> > It means fast empty-swap cluster consumption so that it can suffer
> > from fragmentation easily which causes THP swap void and swap slot
> > allocations slow due to not being able to use per-cpu.
> >
> > What I suggested was contiguous multiple swap cluster allocations
> > to meet THP page size. If some of architecure's THP size is 64M
> > and SWAP_CLUSTER_SIZE is 2M, it should allocate 32 contiguos
> > swap clusters. For that, swap layer need to manage clusters sort
> > in order which would be more overhead in CONFIG_THP_SWAP case
> > but I think it's tradeoff. With that, every architectures can
> > support THP swap easily without arch-specific something.
> 
> That may be a good solution for other architectures.  But I am afraid I
> am not the right person to work on that.  Because I don't know the
> requirement of other architectures, and I have no other architectures
> machines to work on and measure the performance.

IMO, THP swapout is good thing for every architecture so I dobut
you need to know other architecture's requirement.

>

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-04-28 Thread Minchan Kim

On Fri, Apr 28, 2017 at 09:09:53AM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Wed, Apr 26, 2017 at 08:42:10PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Fri, Apr 21, 2017 at 08:29:30PM +0800, Huang, Ying wrote:
> >> >> "Huang, Ying"  writes:
> >> >> 
> >> >> > Minchan Kim  writes:
> >> >> >
> >> >> >> On Wed, Apr 19, 2017 at 04:14:43PM +0800, Huang, Ying wrote:
> >> >> >>> Minchan Kim  writes:
> >> >> >>> 
> >> >> >>> > Hi Huang,
> >> >> >>> >
> >> >> >>> > On Fri, Apr 07, 2017 at 02:49:01PM +0800, Huang, Ying wrote:
> >> >> >>> >> From: Huang Ying 
> >> >> >>> >> 
> >> >> >>> >>  void swapcache_free_entries(swp_entry_t *entries, int n)
> >> >> >>> >>  {
> >> >> >>> >>  struct swap_info_struct *p, *prev;
> >> >> >>> >> @@ -1075,6 +1083,10 @@ void swapcache_free_entries(swp_entry_t 
> >> >> >>> >> *entries, int n)
> >> >> >>> >>  
> >> >> >>> >>  prev = NULL;
> >> >> >>> >>  p = NULL;
> >> >> >>> >> +
> >> >> >>> >> +/* Sort swap entries by swap device, so each lock is 
> >> >> >>> >> only taken once. */
> >> >> >>> >> +if (nr_swapfiles > 1)
> >> >> >>> >> +sort(entries, n, sizeof(entries[0]), 
> >> >> >>> >> swp_entry_cmp, NULL);
> >> >> >>> >
> >> >> >>> > Let's think on other cases.
> >> >> >>> >
> >> >> >>> > There are two swaps and they are configured by priority so a 
> >> >> >>> > swap's usage
> >> >> >>> > would be zero unless other swap used up. In case of that, this 
> >> >> >>> > sorting
> >> >> >>> > is pointless.
> >> >> >>> >
> >> >> >>> > As well, nr_swapfiles is never decreased so if we enable multiple
> >> >> >>> > swaps and then disable until a swap is remained, this sorting is
> >> >> >>> > pointelss, too.
> >> >> >>> >
> >> >> >>> > How about lazy sorting approach? IOW, if we found prev != p and,
> >> >> >>> > then we can sort it.
> >> >> >>> 
> >> >> >>> Yes.  That should be better.  I just don't know whether the added
> >> >> >>> complexity is necessary, given the array is short and sort is fast.
> >> >> >>
> >> >> >> Huh?
> >> >> >>
> >> >> >> 1. swapon /dev/XXX1
> >> >> >> 2. swapon /dev/XXX2
> >> >> >> 3. swapoff /dev/XXX2
> >> >> >> 4. use only one swap
> >> >> >> 5. then, always pointless sort.
> >> >> >
> >> >> > Yes.  In this situation we will do unnecessary sorting.  What I don't
> >> >> > know is whether the unnecessary sorting will hurt performance in real
> >> >> > life.  I can do some measurement.
> >> >> 
> >> >> I tested the patch with 1 swap device and 1 process to eat memory
> >> >> (remove the "if (nr_swapfiles > 1)" for test).  I think this is the
> >> >> worse case because there is no lock contention.  The memory freeing time
> >> >> increased from 1.94s to 2.12s (increase ~9.2%).  So there is some
> >> >> overhead for some cases.  I change the algorithm to something like
> >> >> below,
> >> >> 
> >> >>  void swapcache_free_entries(swp_entry_t *entries, int n)
> >> >>  {
> >> >> struct swap_info_struct *p, *prev;
> >> >> int i;
> >> >> +   swp_entry_t entry;
> >> >> +   unsigned int prev_swp_type;
> >> >>  
> >> >> if (n <= 0)
> >> >> return;
> >> >>  
> >> >> +

Re: [PATCH -mm -v10 1/3] mm, THP, swap: Delay splitting THP during swap out

2017-04-26 Thread Minchan Kim

t; normal cases.  If the difference of the number of the free swap
> clusters among multiple swap devices is significant, it is possible
> that some THPs are split earlier than necessary.  For example, this
> could be caused by big size difference among multiple swap devices.
> 
> The swap cache functions is enhanced to support add/delete THP to/from
> the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
> enhanced in the future with multi-order radix tree.  But because we
> will split the THP soon during swapping out, that optimization doesn't
> make much sense for this first step.
> 
> The THP splitting functions are enhanced to support to split THP in
> swap cache during swapping out.  The page lock will be held during
> allocating the swap cluster, adding the THP into the swap cache and
> splitting the THP.  So in the code path other than swapping out, if
> the THP need to be split, the PageSwapCache(THP) will be always false.
> 
> The swap cluster is only available for SSD, so the THP swap
> optimization in this patchset has no effect for HDD.
> 
> With the patch, the swap out throughput improves 11.5% (from about
> 3.73GB/s to about 4.16GB/s) in the vm-scalability swap-w-seq test case
> with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
> device used is a RAM simulated PMEM (persistent memory) device.  To
> test the sequential swapping out, the test case creates 8 processes,
> which sequentially allocate and write to the anonymous pages until the
> RAM and part of the swap device is used up.
> 
> [han...@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
> Signed-off-by: "Huang, Ying" 
> Cc: Andrea Arcangeli 
> Cc: Ebru Akagunduz 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Tejun Heo 
> Cc: Hugh Dickins 
> Cc: Shaohua Li 
> Cc: Minchan Kim 
> Cc: Rik van Riel 
> Cc: cgro...@vger.kernel.org
> Suggested-by: Andrew Morton  [for config option]
> Acked-by: Kirill A. Shutemov  [for changes 
> in huge_memory.c and huge_mm.h]
> Signed-off-by: Johannes Weiner 
> ---
>  arch/x86/Kconfig|   1 +
>  include/linux/page-flags.h  |   7 +-
>  include/linux/swap.h|  25 -
>  include/linux/swap_cgroup.h |   6 +-
>  mm/Kconfig  |  12 +++
>  mm/huge_memory.c|  11 +-
>  mm/memcontrol.c |  50 -
>  mm/shmem.c  |   2 +-
>  mm/swap_cgroup.c|  40 +--
>  mm/swap_slots.c |  16 ++-
>  mm/swap_state.c | 114 
>  mm/swapfile.c   | 256 
> 
>  12 files changed, 375 insertions(+), 165 deletions(-)

< snip >

> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1290,7 +1290,7 @@ static int shmem_writepage(struct page *page, struct 
> writeback_control *wbc)
>   SetPageUptodate(page);
>   }
>  
> - swap = get_swap_page();
> + swap = get_swap_page(page);
>   if (!swap.val)
>   goto redirty;
>  

If swap is non-ssd, swap.val could be zero. Right?
If so, could we retry like anonymous page swapout?

>  
> -swp_entry_t get_swap_page(void)
> +swp_entry_t get_swap_page(struct page *page)
>  {
>   swp_entry_t entry, *pentry;
>   struct swap_slots_cache *cache;
>  
> + entry.val = 0;
> +
> + if (PageTransHuge(page)) {
> + if (hpage_nr_pages(page) == SWAPFILE_CLUSTER)
> + get_swap_pages(1, true, &entry);
> + return entry;
> + }
> +


< snip >

>  /**
> @@ -178,20 +192,12 @@ int add_to_swap(struct page *page, struct list_head 
> *list)
>   VM_BUG_ON_PAGE(!PageLocked(page), page);
>   VM_BUG_ON_PAGE(!PageUptodate(page), page);
>  
> - entry = get_swap_page();
> +retry:
> + entry = get_swap_page(page);
>   if (!entry.val)
> - return 0;
> -
> - if (mem_cgroup_try_charge_swap(page, entry)) {
> - swapcache_free(entry);
> - return 0;
> - }
> -
> - if (unlikely(PageTransHuge(page)))
> - if (unlikely(split_huge_page_to_list(page, list))) {
> - swapcache_free(entry);
> - return 0;
> - }
> + goto fail;

So, with non-SSD swap, THP page *always* get the fail to get swp_entry_t
and retry after split the page. However, it makes unncessary get_swap_pages
call which is not trivial. If there is no SSD swap, thp-swap out should
be void without adding any performance overhead.
Hmm, but I have no good idea to do it simple. :(

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-04-26 Thread Minchan Kim

On Wed, Apr 26, 2017 at 08:42:10PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Fri, Apr 21, 2017 at 08:29:30PM +0800, Huang, Ying wrote:
> >> "Huang, Ying"  writes:
> >> 
> >> > Minchan Kim  writes:
> >> >
> >> >> On Wed, Apr 19, 2017 at 04:14:43PM +0800, Huang, Ying wrote:
> >> >>> Minchan Kim  writes:
> >> >>> 
> >> >>> > Hi Huang,
> >> >>> >
> >> >>> > On Fri, Apr 07, 2017 at 02:49:01PM +0800, Huang, Ying wrote:
> >> >>> >> From: Huang Ying 
> >> >>> >> 
> >> >>> >>  void swapcache_free_entries(swp_entry_t *entries, int n)
> >> >>> >>  {
> >> >>> >> struct swap_info_struct *p, *prev;
> >> >>> >> @@ -1075,6 +1083,10 @@ void swapcache_free_entries(swp_entry_t 
> >> >>> >> *entries, int n)
> >> >>> >>  
> >> >>> >> prev = NULL;
> >> >>> >> p = NULL;
> >> >>> >> +
> >> >>> >> +   /* Sort swap entries by swap device, so each lock is only taken 
> >> >>> >> once. */
> >> >>> >> +   if (nr_swapfiles > 1)
> >> >>> >> +   sort(entries, n, sizeof(entries[0]), swp_entry_cmp, 
> >> >>> >> NULL);
> >> >>> >
> >> >>> > Let's think on other cases.
> >> >>> >
> >> >>> > There are two swaps and they are configured by priority so a swap's 
> >> >>> > usage
> >> >>> > would be zero unless other swap used up. In case of that, this 
> >> >>> > sorting
> >> >>> > is pointless.
> >> >>> >
> >> >>> > As well, nr_swapfiles is never decreased so if we enable multiple
> >> >>> > swaps and then disable until a swap is remained, this sorting is
> >> >>> > pointelss, too.
> >> >>> >
> >> >>> > How about lazy sorting approach? IOW, if we found prev != p and,
> >> >>> > then we can sort it.
> >> >>> 
> >> >>> Yes.  That should be better.  I just don't know whether the added
> >> >>> complexity is necessary, given the array is short and sort is fast.
> >> >>
> >> >> Huh?
> >> >>
> >> >> 1. swapon /dev/XXX1
> >> >> 2. swapon /dev/XXX2
> >> >> 3. swapoff /dev/XXX2
> >> >> 4. use only one swap
> >> >> 5. then, always pointless sort.
> >> >
> >> > Yes.  In this situation we will do unnecessary sorting.  What I don't
> >> > know is whether the unnecessary sorting will hurt performance in real
> >> > life.  I can do some measurement.
> >> 
> >> I tested the patch with 1 swap device and 1 process to eat memory
> >> (remove the "if (nr_swapfiles > 1)" for test).  I think this is the
> >> worse case because there is no lock contention.  The memory freeing time
> >> increased from 1.94s to 2.12s (increase ~9.2%).  So there is some
> >> overhead for some cases.  I change the algorithm to something like
> >> below,
> >> 
> >>  void swapcache_free_entries(swp_entry_t *entries, int n)
> >>  {
> >>struct swap_info_struct *p, *prev;
> >>int i;
> >> +  swp_entry_t entry;
> >> +  unsigned int prev_swp_type;
> >>  
> >>if (n <= 0)
> >>return;
> >>  
> >> +  prev_swp_type = swp_type(entries[0]);
> >> +  for (i = n - 1; i > 0; i--) {
> >> +  if (swp_type(entries[i]) != prev_swp_type)
> >> +  break;
> >> +  }
> >
> > That's really what I want to avoid. For many swap usecases,
> > it adds unnecessary overhead.
> >
> >> +
> >> +  /* Sort swap entries by swap device, so each lock is only taken once. */
> >> +  if (i)
> >> +  sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
> >>prev = NULL;
> >>p = NULL;
> >>for (i = 0; i < n; ++i) {
> >> -  p = swap_info_get_cont(entries[i], prev);
> >> +  entry = entries[i];
> >> +  p = swap_info_get_cont(entry, prev);
> >>

Re: [PATCH v4 2/4] zram: implement deduplication in zram

2017-04-26 Thread Minchan Kim

On Wed, Apr 26, 2017 at 04:21:47PM +0900, Sergey Senozhatsky wrote:
> On (04/26/17 15:59), Joonsoo Kim wrote:
> [..]
> > > Actually, I found it for the last review cycle but didn't say that
> > > intentionally. Because it is also odd to me that pages_stored isn't
> > > increased for same_pages so I thought we can fix it all.
> > >
> > > I mean:
> > >
> > > * normal page
> > > inc pages_stored
> > > inc compr_data_size
> > > * same_page
> > > inc pages_stored
> > > inc same_pages
> > > * dedup_page
> > > inc pages_stored
> > > inc dup_data_size
> > >
> > > IOW, pages_stored should be increased for every write IO.
> > > But the concern is we have said in zram.txt
> > >
> > >  orig_data_size   uncompressed size of data stored in this disk.
> > >   This excludes same-element-filled pages (same_pages) 
> > > since
> > >   no memory is allocated for them.
> > >
> > > So, we might be too late. :-(
> > > What do you think about it?
> > > If anyone doesn't have any objection, I want to correct it all.
> > 
> > I have no objection.
> > If so, do I need to postpone this patchset until others are fixed?
> 
> this probably will mess with your series a lot. so I don't mind if you or
> Minchan will send stats-fixup patch after the dedup series. may be/preferably
> as the last patch in the series. but if you or Minchan want to fix stats
> first, then I wouldn't mind either. I just don't make a big deal out of those
> stats, a bunch of fun to know numbers. my 5cents.

After Andrew takes dedup patchset, I will fix it later.
Thanks.

Re: [PATCH v4 2/4] zram: implement deduplication in zram

2017-04-25 Thread Minchan Kim

Hi Sergey and Joonsoo,

On Wed, Apr 26, 2017 at 02:57:03PM +0900, Joonsoo Kim wrote:
> On Wed, Apr 26, 2017 at 11:14:52AM +0900, Sergey Senozhatsky wrote:
> > Hello,
> > 
> > On (04/26/17 09:52), js1...@gmail.com wrote:
> > [..]
> > >   ret = scnprintf(buf, PAGE_SIZE,
> > > - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
> > > + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
> > >   orig_size << PAGE_SHIFT,
> > >   (u64)atomic64_read(&zram->stats.compr_data_size),
> > >   mem_used << PAGE_SHIFT,
> > >   zram->limit_pages << PAGE_SHIFT,
> > >   max_used << PAGE_SHIFT,
> > >   (u64)atomic64_read(&zram->stats.same_pages),
> > > - pool_stats.pages_compacted);
> > > + pool_stats.pages_compacted,
> > > + zram_dedup_dup_size(zram),
> > > + zram_dedup_meta_size(zram));
> > 
> > hm... should't we subtract zram_dedup_dup_size(zram) from
> > ->stats.compr_data_size? we don't use extra memory for dedupped
> > pages. or don't inc ->stats.compr_data_size for dedupped pages?
> 
> Hmm... My intention is to keep previous stat as much as possible. User
> can just notice the saving by only checking mem_used.
> 
> However, it's also odd that compr_data_size doesn't show actual
> compressed data size.

Actually, I found it for the last review cycle but didn't say that
intentionally. Because it is also odd to me that pages_stored isn't
increased for same_pages so I thought we can fix it all.

I mean:

* normal page
inc pages_stored
inc compr_data_size
* same_page
inc pages_stored
inc same_pages
* dedup_page
inc pages_stored
inc dup_data_size

IOW, pages_stored should be increased for every write IO.
But the concern is we have said in zram.txt

 orig_data_size   uncompressed size of data stored in this disk.
  This excludes same-element-filled pages (same_pages) since
  no memory is allocated for them.

So, we might be too late. :-(
What do you think about it?
If anyone doesn't have any objection, I want to correct it all.

Thanks.

Re: [PATCH v3 0/4] zram: implement deduplication in zram

2017-04-24 Thread Minchan Kim

Hi Joonsoo,

On Fri, Apr 21, 2017 at 10:14:47AM +0900, js1...@gmail.com wrote:
> From: Joonsoo Kim 
> 
> Changes from v2
> o rebase to latest zram code
> o manage alloc/free of the zram_entry in zram_drv.c
> o remove useless RB_CLEAR_NODE
> o set RO permission tor use_deup sysfs entry if CONFIG_ZRAM_DEDUP=n
> 
> Changes from v1
> o reogranize dedup specific functions
> o support Kconfig on/off
> o update zram documents
> o compare all the entries with same checksum (patch #4)
> 
> This patchset implements deduplication feature in zram. Motivation
> is to save memory usage by zram. There are detailed description
> about motivation and experimental results on patch #2 so please
> refer it.
> 
> Thanks.

To all patches:

Acked-by: Minchan Kim 

If you send new version due to trivial stuff I mentioned,
feel free to add my Acked-by to the patchset.

Thanks!

Re: [PATCH v3 3/4] zram: make deduplication feature optional

2017-04-24 Thread Minchan Kim

On Fri, Apr 21, 2017 at 10:14:50AM +0900, js1...@gmail.com wrote:
> From: Joonsoo Kim 
> 
> Benefit of deduplication is dependent on the workload so it's not
> preferable to always enable. Therefore, make it optional in Kconfig
> and device param. Default is 'off'. This option will be beneficial
> for users who use the zram as blockdev and stores build output to it.
> 
> Signed-off-by: Joonsoo Kim 

< snip >

>  
>  static struct attribute *zram_disk_attrs[] = {
>   &dev_attr_disksize.attr,
> @@ -1169,6 +1227,7 @@ static struct attribute *zram_disk_attrs[] = {
>   &dev_attr_mem_used_max.attr,
>   &dev_attr_max_comp_streams.attr,
>   &dev_attr_comp_algorithm.attr,
> + &dev_attr_use_dedup.attr,
>   &dev_attr_io_stat.attr,
>   &dev_attr_mm_stat.attr,
>   &dev_attr_debug_stat.attr,
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index 4b86921..3f7649a 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -134,7 +134,12 @@ struct zram {
>* zram is claimed so open request will be failed
>*/
>   bool claim; /* Protected by bdev->bd_mutex */
> + bool use_dedup;
>  };
>  
> +static inline bool zram_dedup_enabled(struct zram *zram)
> +{
> + return zram->use_dedup;

#ifdef CONFIG_ZRAM_DEDUP
return zram->use_dedup;
#else
return false;
#endif

Otherwise, looks good to me.

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-04-23 Thread Minchan Kim

On Fri, Apr 21, 2017 at 08:29:30PM +0800, Huang, Ying wrote:
> "Huang, Ying"  writes:
> 
> > Minchan Kim  writes:
> >
> >> On Wed, Apr 19, 2017 at 04:14:43PM +0800, Huang, Ying wrote:
> >>> Minchan Kim  writes:
> >>> 
> >>> > Hi Huang,
> >>> >
> >>> > On Fri, Apr 07, 2017 at 02:49:01PM +0800, Huang, Ying wrote:
> >>> >> From: Huang Ying 
> >>> >> 
> >>> >>  void swapcache_free_entries(swp_entry_t *entries, int n)
> >>> >>  {
> >>> >>struct swap_info_struct *p, *prev;
> >>> >> @@ -1075,6 +1083,10 @@ void swapcache_free_entries(swp_entry_t 
> >>> >> *entries, int n)
> >>> >>  
> >>> >>prev = NULL;
> >>> >>p = NULL;
> >>> >> +
> >>> >> +  /* Sort swap entries by swap device, so each lock is only taken 
> >>> >> once. */
> >>> >> +  if (nr_swapfiles > 1)
> >>> >> +  sort(entries, n, sizeof(entries[0]), swp_entry_cmp, 
> >>> >> NULL);
> >>> >
> >>> > Let's think on other cases.
> >>> >
> >>> > There are two swaps and they are configured by priority so a swap's 
> >>> > usage
> >>> > would be zero unless other swap used up. In case of that, this sorting
> >>> > is pointless.
> >>> >
> >>> > As well, nr_swapfiles is never decreased so if we enable multiple
> >>> > swaps and then disable until a swap is remained, this sorting is
> >>> > pointelss, too.
> >>> >
> >>> > How about lazy sorting approach? IOW, if we found prev != p and,
> >>> > then we can sort it.
> >>> 
> >>> Yes.  That should be better.  I just don't know whether the added
> >>> complexity is necessary, given the array is short and sort is fast.
> >>
> >> Huh?
> >>
> >> 1. swapon /dev/XXX1
> >> 2. swapon /dev/XXX2
> >> 3. swapoff /dev/XXX2
> >> 4. use only one swap
> >> 5. then, always pointless sort.
> >
> > Yes.  In this situation we will do unnecessary sorting.  What I don't
> > know is whether the unnecessary sorting will hurt performance in real
> > life.  I can do some measurement.
> 
> I tested the patch with 1 swap device and 1 process to eat memory
> (remove the "if (nr_swapfiles > 1)" for test).  I think this is the
> worse case because there is no lock contention.  The memory freeing time
> increased from 1.94s to 2.12s (increase ~9.2%).  So there is some
> overhead for some cases.  I change the algorithm to something like
> below,
> 
>  void swapcache_free_entries(swp_entry_t *entries, int n)
>  {
>   struct swap_info_struct *p, *prev;
>   int i;
> + swp_entry_t entry;
> + unsigned int prev_swp_type;
>  
>   if (n <= 0)
>   return;
>  
> + prev_swp_type = swp_type(entries[0]);
> + for (i = n - 1; i > 0; i--) {
> + if (swp_type(entries[i]) != prev_swp_type)
> + break;
> + }

That's really what I want to avoid. For many swap usecases,
it adds unnecessary overhead.

> +
> + /* Sort swap entries by swap device, so each lock is only taken once. */
> + if (i)
> + sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
>   prev = NULL;
>   p = NULL;
>   for (i = 0; i < n; ++i) {
> - p = swap_info_get_cont(entries[i], prev);
> + entry = entries[i];
> + p = swap_info_get_cont(entry, prev);
>   if (p)
> - swap_entry_free(p, entries[i]);
> + swap_entry_free(p, entry);
>   prev = p;
>   }
>   if (p)
> 
> With this patch, the memory freeing time increased from 1.94s to 1.97s.
> I think this is good enough.  Do you think so?

What I mean is as follows(I didn't test it at all):

With this, sort entries if we found multiple entries in current
entries. It adds some condition checks for non-multiple swap
usecase but it would be more cheaper than the sorting.
And it adds a [un]lock overhead for multiple swap usecase but
it should be a compromise for single-swap usecase which is more
popular.

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f23c56e9be39..0d76a492786f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1073,30 +1073,40 @@ static int swp_entry_cmp(const void *ent1,

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-04-20 Thread Minchan Kim

On Wed, Apr 19, 2017 at 04:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > Hi Huang,
> >
> > On Fri, Apr 07, 2017 at 02:49:01PM +0800, Huang, Ying wrote:
> >> From: Huang Ying 
> >> 
> >> To reduce the lock contention of swap_info_struct->lock when freeing
> >> swap entry.  The freed swap entries will be collected in a per-CPU
> >> buffer firstly, and be really freed later in batch.  During the batch
> >> freeing, if the consecutive swap entries in the per-CPU buffer belongs
> >> to same swap device, the swap_info_struct->lock needs to be
> >> acquired/released only once, so that the lock contention could be
> >> reduced greatly.  But if there are multiple swap devices, it is
> >> possible that the lock may be unnecessarily released/acquired because
> >> the swap entries belong to the same swap device are non-consecutive in
> >> the per-CPU buffer.
> >> 
> >> To solve the issue, the per-CPU buffer is sorted according to the swap
> >> device before freeing the swap entries.  Test shows that the time
> >> spent by swapcache_free_entries() could be reduced after the patch.
> >> 
> >> Test the patch via measuring the run time of swap_cache_free_entries()
> >> during the exit phase of the applications use much swap space.  The
> >> results shows that the average run time of swap_cache_free_entries()
> >> reduced about 20% after applying the patch.
> >> 
> >> Signed-off-by: Huang Ying 
> >> Acked-by: Tim Chen 
> >> Cc: Hugh Dickins 
> >> Cc: Shaohua Li 
> >> Cc: Minchan Kim 
> >> Cc: Rik van Riel 
> >> 
> >> v3:
> >> 
> >> - Add some comments in code per Rik's suggestion.
> >> 
> >> v2:
> >> 
> >> - Avoid sort swap entries if there is only one swap device.
> >> ---
> >>  mm/swapfile.c | 12 
> >>  1 file changed, 12 insertions(+)
> >> 
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index 90054f3c2cdc..f23c56e9be39 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -37,6 +37,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  
> >>  #include 
> >>  #include 
> >> @@ -1065,6 +1066,13 @@ void swapcache_free(swp_entry_t entry)
> >>}
> >>  }
> >>  
> >> +static int swp_entry_cmp(const void *ent1, const void *ent2)
> >> +{
> >> +  const swp_entry_t *e1 = ent1, *e2 = ent2;
> >> +
> >> +  return (long)(swp_type(*e1) - swp_type(*e2));
> >> +}
> >> +
> >>  void swapcache_free_entries(swp_entry_t *entries, int n)
> >>  {
> >>struct swap_info_struct *p, *prev;
> >> @@ -1075,6 +1083,10 @@ void swapcache_free_entries(swp_entry_t *entries, 
> >> int n)
> >>  
> >>prev = NULL;
> >>p = NULL;
> >> +
> >> +  /* Sort swap entries by swap device, so each lock is only taken once. */
> >> +  if (nr_swapfiles > 1)
> >> +  sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
> >
> > Let's think on other cases.
> >
> > There are two swaps and they are configured by priority so a swap's usage
> > would be zero unless other swap used up. In case of that, this sorting
> > is pointless.
> >
> > As well, nr_swapfiles is never decreased so if we enable multiple
> > swaps and then disable until a swap is remained, this sorting is
> > pointelss, too.
> >
> > How about lazy sorting approach? IOW, if we found prev != p and,
> > then we can sort it.
> 
> Yes.  That should be better.  I just don't know whether the added
> complexity is necessary, given the array is short and sort is fast.

Huh?

1. swapon /dev/XXX1
2. swapon /dev/XXX2
3. swapoff /dev/XXX2
4. use only one swap
5. then, always pointless sort.

Do not add such bogus code.

Nacked.

Re: copy_page() on a kmalloc-ed page with DEBUG_SLAB enabled (was "zram: do not use copy_page with non-page alinged address")

2017-04-19 Thread Minchan Kim

On Thu, Apr 20, 2017 at 10:45:42AM +0900, Sergey Senozhatsky wrote:
> On (04/19/17 04:51), Matthew Wilcox wrote:
> [..]
> > > > > Another approach is the API does normal thing for non-aligned prefix 
> > > > > and
> > > > > tail space and fast thing for aligned space.
> > > > > Otherwise, it would be happy if the API has WARN_ON non-page SIZE 
> > > > > aligned
> > > > > address.
> > 
> > Why not just use memcpy()?  Is copy_page() significantly faster than
> > memcpy() for a PAGE_SIZE amount of data?
> 
> that's a good point.
> 
> I was going to ask yesterday - do we even need copy_page()? arch that
> provides well optimized copy_page() quite likely provides somewhat
> equally optimized memcpy(). so may be copy_page() is not even needed?

I don't know.

Just I found https://download.samba.org/pub/paulus/ols-2003-presentation.pdf
and heard https://lkml.org/lkml/2017/4/10/1270.

Re: [patch] mm, vmscan: avoid thrashing anon lru when free + file is low

2017-04-19 Thread Minchan Kim

Hi David,

On Wed, Apr 19, 2017 at 04:24:48PM -0700, David Rientjes wrote:
> On Wed, 19 Apr 2017, Minchan Kim wrote:
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 24efcc20af91..5d2f3fa41e92 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2174,8 +2174,17 @@ static void get_scan_count(struct lruvec *lruvec, 
> > struct mem_cgroup *memcg,
> > }
> >  
> > if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
> > -   scan_balance = SCAN_ANON;
> > -   goto out;
> > +   /*
> > +* force SCAN_ANON if inactive anonymous LRU lists of
> > +* eligible zones are enough pages. Otherwise, thrashing
> > +* can be happen on the small anonymous LRU list.
> > +*/
> > +   if (!inactive_list_is_low(lruvec, false, NULL, sc, 
> > false) &&
> > +lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, 
> > sc->reclaim_idx)
> > +   >> sc->priority) {
> > +       scan_balance = SCAN_ANON;
> > +   goto out;
> > +   }
> > }
> > }
> >  
> 
> Hi Minchan,
> 
> This looks good and it correctly biases against SCAN_ANON for my workload 
> that was thrashing the anon lrus.  Feel free to use parts of my changelog 
> if you'd like.

Thanks for the testing!
As considering how it's hard to find such a problem, it should be totally your
credit. So you can send the patch with detailed description. Feel free to
add my suggested-by. :)

Thanks!

Re: copy_page() on a kmalloc-ed page with DEBUG_SLAB enabled (was "zram: do not use copy_page with non-page alinged address")

2017-04-18 Thread Minchan Kim

Hello Michal,

On Tue, Apr 18, 2017 at 09:33:07AM +0200, Michal Hocko wrote:
> On Tue 18-04-17 09:03:19, Minchan Kim wrote:
> > On Mon, Apr 17, 2017 at 10:20:42AM -0500, Christoph Lameter wrote:
> > > On Mon, 17 Apr 2017, Sergey Senozhatsky wrote:
> > > 
> > > > Minchan reported that doing copy_page() on a kmalloc(PAGE_SIZE) page
> > > > with DEBUG_SLAB enabled can cause a memory corruption (See below or
> > > > lkml.kernel.org/r/1492042622-12074-2-git-send-email-minc...@kernel.org )
> > > 
> > > Yes the alignment guarantees do not require alignment on a page boundary.
> > > 
> > > The alignment for kmalloc allocations is controlled by KMALLOC_MIN_ALIGN.
> > > Usually this is either double word aligned or cache line aligned.
> > > 
> > > > that's an interesting problem. arm64 copy_page(), for instance, wants 
> > > > src
> > > > and dst to be page aligned, which is reasonable, while generic 
> > > > copy_page(),
> > > > on the contrary, simply does memcpy(). there are, probably, other 
> > > > callpaths
> > > > that do copy_page() on kmalloc-ed pages and I'm wondering if there is 
> > > > some
> > > > sort of a generic fix to the problem.
> > > 
> > > Simple solution is to not allocate pages via the slab allocator but use
> > > the page allocator for this. The page allocator provides proper alignment.
> > > 
> > > There is a reason it is called the page allocator because if you want a
> > > page you use the proper allocator for it.
> 
> Agreed. Using the slab allocator for page sized object is just wasting
> cycles and additional metadata.
> 
> > It would be better if the APIs works with struct page, not address but
> > I can imagine there are many cases where don't have struct page itself
> > and redundant for kmap/kunmap.
> 
> I do not follow. Why would you need kmap for something that is already
> in the kernel space?

Because it can work with highmem pages.

> 
> > Another approach is the API does normal thing for non-aligned prefix and
> > tail space and fast thing for aligned space.
> > Otherwise, it would be happy if the API has WARN_ON non-page SIZE aligned
> > address.
> 
> copy_page is a performance sensitive function and I believe that we do
> those tricks exactly for this purpose. Why would we want to add an
> overhead for the alignment check or WARN_ON when using unaligned
> pointers? I do see that debugging a subtle memory corruption is PITA
> but that doesn't imply we should clobber the hot path IMHO.

What I wanted is VM_WARN_ON so it shouldn't be no overhead for whom
want really fast kernel. 

> 
> A big fat warning for copy_page would be definitely helpful though.

It's better than as-is but everyone doesn't read comment like such
simple API(e.g., clear_page(void *mem)), esp. And once it happens,
it's really subtle because for exmaple, you have not seen any bug
without slub debug. Based on it, you add new feature and crashed
for testing. To find a bug, you enable slub_debug. Bang.
you encounter a new bug lurked for a long time.
VM_WARN_ON would be valuable but I'm okay any option which might
have better to catch the bug if someone donates his time to fix
it up.

Thanks.

Re: [patch] mm, vmscan: avoid thrashing anon lru when free + file is low

2017-04-18 Thread Minchan Kim

Hi David,

On Tue, Apr 18, 2017 at 02:32:56PM -0700, David Rientjes wrote:
> On Tue, 18 Apr 2017, Minchan Kim wrote:
> 
> > > The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan: do
> > > not swap anon pages just because free+file is low'") reintroduces is to
> > > prefer swapping anonymous memory rather than trashing the file lru.
> > > 
> > > If all anonymous memory is unevictable, however, this insistance on
> > 
> > "unevictable" means hot workingset, not (mlocked and increased refcount
> > by some driver)?
> > I got confused.
> > 
> 
> For my purposes, it's mlocked, but I think this thrashing is possible 
> anytime we fail the file lru heuristic and the evictable anon lrus are 
> very small themselves.  I'll update the changelog to make this explicit.

I understood now. Thanks for clarifying.

> 
> > > Check that enough evictable anon memory is actually on this lruvec before
> > > insisting on SCAN_ANON.  SWAP_CLUSTER_MAX is used as the threshold to
> > > determine if only scanning anon is beneficial.
> > 
> > Why do you use SWAP_CLUSTER_MAX instead of (high wmark + free) like
> > file-backed pages?
> > As considering anonymous pages have more probability to become workingset
> > because they are are mapped, IMO, more {strong or equal} condition than
> > file-LRU would be better to prevent anon LRU thrashing.
> > 
> 
> If the suggestion is checking
> NR_ACTIVE_ANON + NR_INACTIVE_ANON > total_high_wmark pages, it would be a 
> separate heurstic to address a problem that I'm not having :)  My issue is 
> specifically when NR_ACTIVE_FILE + NR_INACTIVE_FILE < total_high_wmark, 
> NR_ACTIVE_ANON + NR_INACTIVE_ANON is very large, but all not on this 
> lruvec's evictable lrus.

I understand it as "all not eligible LRU lists". Right?
I will write the comment below with that my assumption is right.

> 
> This is the reason why I chose lruvec_lru_size() rather than per-node 
> statistics.  The argument could also be made for the file lrus in the 
> get_scan_count() heuristic that forces SCAN_ANON, but I have not met such 
> an issue (yet).  I could follow-up with that change or incorporate it into 
> a v2 of this patch if you'd prefer.

I don't think we need to fix that part because the logic is to keep
some amount of file-backed page workingset regardless of eligible
zones. 

> 
> In other words, I want get_scan_count() to not force SCAN_ANON and 
> fallback to SCAN_FRACT, absent other heuristics, if the amount of 
> evictable anon is below a certain threshold for this lruvec.  I 
> arbitrarily chose SWAP_CLUSTER_MAX to be conservative, but I could easily 
> compare to total_high_wmark as well, although I would consider that more 
> aggressive.

I realize your problem now. It's rather different heuristic so no need
to align file-lru. But SWAP_CLUSTER_MAX is too conservatie, too. IMHO.

How about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 24efcc20af91..5d2f3fa41e92 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2174,8 +2174,17 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
}
 
if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
-   scan_balance = SCAN_ANON;
-   goto out;
+   /*
+* force SCAN_ANON if inactive anonymous LRU lists of
+* eligible zones are enough pages. Otherwise, thrashing
+* can be happen on the small anonymous LRU list.
+*/
+   if (!inactive_list_is_low(lruvec, false, NULL, sc, 
false) &&
+lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, 
sc->reclaim_idx)
+   >> sc->priority) {
+   scan_balance = SCAN_ANON;
+   goto out;
+   }
}
}
 

Thanks.

Re: [PATCH -mm -v3] mm, swap: Sort swap entries before free

2017-04-17 Thread Minchan Kim

Hi Huang,

On Fri, Apr 07, 2017 at 02:49:01PM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> To reduce the lock contention of swap_info_struct->lock when freeing
> swap entry.  The freed swap entries will be collected in a per-CPU
> buffer firstly, and be really freed later in batch.  During the batch
> freeing, if the consecutive swap entries in the per-CPU buffer belongs
> to same swap device, the swap_info_struct->lock needs to be
> acquired/released only once, so that the lock contention could be
> reduced greatly.  But if there are multiple swap devices, it is
> possible that the lock may be unnecessarily released/acquired because
> the swap entries belong to the same swap device are non-consecutive in
> the per-CPU buffer.
> 
> To solve the issue, the per-CPU buffer is sorted according to the swap
> device before freeing the swap entries.  Test shows that the time
> spent by swapcache_free_entries() could be reduced after the patch.
> 
> Test the patch via measuring the run time of swap_cache_free_entries()
> during the exit phase of the applications use much swap space.  The
> results shows that the average run time of swap_cache_free_entries()
> reduced about 20% after applying the patch.
> 
> Signed-off-by: Huang Ying 
> Acked-by: Tim Chen 
> Cc: Hugh Dickins 
> Cc: Shaohua Li 
> Cc: Minchan Kim 
> Cc: Rik van Riel 
> 
> v3:
> 
> - Add some comments in code per Rik's suggestion.
> 
> v2:
> 
> - Avoid sort swap entries if there is only one swap device.
> ---
>  mm/swapfile.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 90054f3c2cdc..f23c56e9be39 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1065,6 +1066,13 @@ void swapcache_free(swp_entry_t entry)
>   }
>  }
>  
> +static int swp_entry_cmp(const void *ent1, const void *ent2)
> +{
> + const swp_entry_t *e1 = ent1, *e2 = ent2;
> +
> + return (long)(swp_type(*e1) - swp_type(*e2));
> +}
> +
>  void swapcache_free_entries(swp_entry_t *entries, int n)
>  {
>   struct swap_info_struct *p, *prev;
> @@ -1075,6 +1083,10 @@ void swapcache_free_entries(swp_entry_t *entries, int 
> n)
>  
>   prev = NULL;
>   p = NULL;
> +
> + /* Sort swap entries by swap device, so each lock is only taken once. */
> + if (nr_swapfiles > 1)
> + sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);

Let's think on other cases.

There are two swaps and they are configured by priority so a swap's usage
would be zero unless other swap used up. In case of that, this sorting
is pointless.

As well, nr_swapfiles is never decreased so if we enable multiple
swaps and then disable until a swap is remained, this sorting is
pointelss, too.

How about lazy sorting approach? IOW, if we found prev != p and,
then we can sort it.

Thanks.

Re: [PATCH 1/3] zram: fix operator precedence to get offset

2017-04-17 Thread Minchan Kim

On Tue, Apr 18, 2017 at 10:53:10AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (04/18/17 08:53), Minchan Kim wrote:
> > On Mon, Apr 17, 2017 at 07:50:16PM +0900, Sergey Senozhatsky wrote:
> > > Hello Minchan,
> > > 
> > > On (04/17/17 11:14), Minchan Kim wrote:
> > > > On Mon, Apr 17, 2017 at 10:54:29AM +0900, Sergey Senozhatsky wrote:
> > > > > On (04/17/17 10:21), Sergey Senozhatsky wrote:
> > > > > > > However, it should be *fixed* to prevent confusion in future
> > > > > 
> > > > > or may be something like below? can save us some cycles.
> > > > > 
> > > > > remove this calculation
> > > > > 
> > > > > -   offset = sector & (SECTORS_PER_PAGE - 1) << SECTOR_SHIFT;
> > > > > 
> > > > > 
> > > > > and pass 0 to zram_bvec_rw()
> > > > > 
> > > > > -   err = zram_bvec_rw(zram, &bv, index, offset, is_write);
> > > > > +   err = zram_bvec_rw(zram, &bv, index, 0, is_write);
> > > > 
> > > > That was one I wrote but have thought it more.
> > > > 
> > > > Because I suspect fs can submit page-size IO in non-aligned PAGE_SIZE
> > > > sector? For example, it can submit PAGE_SIZE read request from 9 sector.
> > > > Is it possible? I don't know.
> > > > 
> > > > As well, FS can format zram from sector 1, not sector 0? IOW, can't it
> > > > use starting sector as non-page algined sector?
> > > > We can do it via fdisk?
> > > > 
> > > > Anyway, If one of scenario I mentioned is possible, zram_rw_page will
> > > > be broken.
> > > > 
> > > > If it's hard to check all of scenario in this moment, it would be
> > > > better to not remove it and then add WARN_ON(offset) in there.
> > > > 
> > > > While I am writing this, I found this.
> > > > 
> > > > /**
> > > >  * bdev_read_page() - Start reading a page from a block device
> > > >  * @bdev: The device to read the page from
> > > >  * @sector: The offset on the device to read the page to (need not be 
> > > > aligned)
> > > >  * @page: The page to read
> > > >  *
> > > > 
> > > > Hmm,, need investigation but no time.
> > > 
> > > good questions.
> > > 
> > > as far as I can see, we never use 'offset' which we pass to zram_bvec_rw()
> > > from zram_rw_page(). `offset' makes a lot of sense for partial IO, but in
> > > zram_bvec_rw() we always do "bv.bv_len = PAGE_SIZE".
> > > 
> > > so what we have is
> > > 
> > > for READ
> > > 
> > > zram_rw_page()
> > >   bv.bv_len = PAGE_SIZE
> > >   zram_bvec_rw(zram, &bv, index, offset, is_write);
> > >   zram_bvec_read()
> > >   if (is_partial_io(bvec))// always false
> > >   memcpy(user_mem + bvec->bv_offset,
> > >   uncmem + offset,
> > >   bvec->bv_len);
> > > 
> > > 
> > > for WRITE
> > > 
> > > zram_rw_page()
> > >   bv.bv_len = PAGE_SIZE
> > >   zram_bvec_rw(zram, &bv, index, offset, is_write);
> > >   zram_bvec_write()
> > >   if (is_partial_io(bvec))// always false
> > >   memcpy(uncmem + offset,
> > >   user_mem + bvec->bv_offset,
> > >   bvec->bv_len);
> > > 
> > > 
> > > and our is_partial_io() looks at ->bv_len:
> > > 
> > >   bvec->bv_len != PAGE_SIZE;
> > > 
> > > which we set to PAGE_SIZE.
> > > 
> > > so in the existing scheme of things, we never care about 'sector'
> > > passed from zram_rw_page(). and this has worked for us for quite
> > > some time. my call would be -- let's drop zram_rw_page() `sector'
> > > calculation.
> > 
> > I can do but before that, I want to confirm. Ccing Matthew,
> > Summary for Matthew,
> > 
> > I see following comment about the sector from bdev_read_page.
> > 
> > /**
> >  * bdev_read_page() - Start reading a page from a block device
> >  * @bdev: The device to read the p

Re: [patch] mm, vmscan: avoid thrashing anon lru when free + file is low

2017-04-17 Thread Minchan Kim

Hello David,

On Mon, Apr 17, 2017 at 05:06:20PM -0700, David Rientjes wrote:
> The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan: do
> not swap anon pages just because free+file is low'") reintroduces is to
> prefer swapping anonymous memory rather than trashing the file lru.
> 
> If all anonymous memory is unevictable, however, this insistance on

"unevictable" means hot workingset, not (mlocked and increased refcount
by some driver)?
I got confused.

> SCAN_ANON ends up thrashing that lru instead.

Sound reasonable.

> 
> Check that enough evictable anon memory is actually on this lruvec before
> insisting on SCAN_ANON.  SWAP_CLUSTER_MAX is used as the threshold to
> determine if only scanning anon is beneficial.

Why do you use SWAP_CLUSTER_MAX instead of (high wmark + free) like
file-backed pages?
As considering anonymous pages have more probability to become workingset
because they are are mapped, IMO, more {strong or equal} condition than
file-LRU would be better to prevent anon LRU thrashing.

> 
> Otherwise, fallback to balanced reclaim so the file lru doesn't remain
> untouched.
> 
> Signed-off-by: David Rientjes 
> ---
>  mm/vmscan.c | 41 +++--
>  1 file changed, 23 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2186,26 +2186,31 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>* anon pages.  Try to detect this based on file LRU size.

Please update this comment, too.

>*/
>   if (global_reclaim(sc)) {
> - unsigned long pgdatfile;
> - unsigned long pgdatfree;
> - int z;
> - unsigned long total_high_wmark = 0;
> -
> - pgdatfree = sum_zone_node_page_state(pgdat->node_id, 
> NR_FREE_PAGES);
> - pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> -node_page_state(pgdat, NR_INACTIVE_FILE);
> -
> - for (z = 0; z < MAX_NR_ZONES; z++) {
> - struct zone *zone = &pgdat->node_zones[z];
> - if (!managed_zone(zone))
> - continue;
> + anon = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, 
> sc->reclaim_idx) +
> +lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, 
> sc->reclaim_idx);
> + if (likely(anon >= SWAP_CLUSTER_MAX)) {

With high_wmark, we can do this.

if (global_reclaim(sc)) {
pgdatfree = xxx;
pgdatfile = xxx;
total_high_wmark = xxx;

if (pgdatfile + pgdatfree <= total_high_wmark) {
pgdatanon = xxx;
if (pgdatanon + pgdatfree > total_high_wmark) {
scan_balance = SCAN_ANON;
goto out;
}
}
}


> + unsigned long total_high_wmark = 0;
> + unsigned long pgdatfile;
> + unsigned long pgdatfree;
> + int z;
> +
> + pgdatfree = sum_zone_node_page_state(pgdat->node_id,
> +  NR_FREE_PAGES);
> + pgdatfile = node_page_state(pgdat, NR_ACTIVE_FILE) +
> + node_page_state(pgdat, NR_INACTIVE_FILE);
> +
> + for (z = 0; z < MAX_NR_ZONES; z++) {
> + struct zone *zone = &pgdat->node_zones[z];
> + if (!managed_zone(zone))
> + continue;
>  
> - total_high_wmark += high_wmark_pages(zone);
> - }
> + total_high_wmark += high_wmark_pages(zone);
> + }
>  
> - if (unlikely(pgdatfile + pgdatfree <= total_high_wmark)) {
> - scan_balance = SCAN_ANON;
> - goto out;
> + if (unlikely(pgdatfile + pgdatfree <= 
> total_high_wmark)) {
> + scan_balance = SCAN_ANON;
> + goto out;
> + }
>   }
>   }
>  
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org

Re: copy_page() on a kmalloc-ed page with DEBUG_SLAB enabled (was "zram: do not use copy_page with non-page alinged address")

2017-04-17 Thread Minchan Kim

On Mon, Apr 17, 2017 at 10:20:42AM -0500, Christoph Lameter wrote:
> On Mon, 17 Apr 2017, Sergey Senozhatsky wrote:
> 
> > Minchan reported that doing copy_page() on a kmalloc(PAGE_SIZE) page
> > with DEBUG_SLAB enabled can cause a memory corruption (See below or
> > lkml.kernel.org/r/1492042622-12074-2-git-send-email-minc...@kernel.org )
> 
> Yes the alignment guarantees do not require alignment on a page boundary.
> 
> The alignment for kmalloc allocations is controlled by KMALLOC_MIN_ALIGN.
> Usually this is either double word aligned or cache line aligned.
> 
> > that's an interesting problem. arm64 copy_page(), for instance, wants src
> > and dst to be page aligned, which is reasonable, while generic copy_page(),
> > on the contrary, simply does memcpy(). there are, probably, other callpaths
> > that do copy_page() on kmalloc-ed pages and I'm wondering if there is some
> > sort of a generic fix to the problem.
> 
> Simple solution is to not allocate pages via the slab allocator but use
> the page allocator for this. The page allocator provides proper alignment.
> 
> There is a reason it is called the page allocator because if you want a
> page you use the proper allocator for it.

It would be better if the APIs works with struct page, not address but
I can imagine there are many cases where don't have struct page itself
and redundant for kmap/kunmap.

Another approach is the API does normal thing for non-aligned prefix and
tail space and fast thing for aligned space.
Otherwise, it would be happy if the API has WARN_ON non-page SIZE aligned
address.

Re: [PATCH 1/3] zram: fix operator precedence to get offset

2017-04-17 Thread Minchan Kim

Hi Sergey,

On Mon, Apr 17, 2017 at 07:50:16PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (04/17/17 11:14), Minchan Kim wrote:
> > On Mon, Apr 17, 2017 at 10:54:29AM +0900, Sergey Senozhatsky wrote:
> > > On (04/17/17 10:21), Sergey Senozhatsky wrote:
> > > > > However, it should be *fixed* to prevent confusion in future
> > > 
> > > or may be something like below? can save us some cycles.
> > > 
> > > remove this calculation
> > > 
> > > -   offset = sector & (SECTORS_PER_PAGE - 1) << SECTOR_SHIFT;
> > > 
> > > 
> > > and pass 0 to zram_bvec_rw()
> > > 
> > > -   err = zram_bvec_rw(zram, &bv, index, offset, is_write);
> > > +   err = zram_bvec_rw(zram, &bv, index, 0, is_write);
> > 
> > That was one I wrote but have thought it more.
> > 
> > Because I suspect fs can submit page-size IO in non-aligned PAGE_SIZE
> > sector? For example, it can submit PAGE_SIZE read request from 9 sector.
> > Is it possible? I don't know.
> > 
> > As well, FS can format zram from sector 1, not sector 0? IOW, can't it
> > use starting sector as non-page algined sector?
> > We can do it via fdisk?
> > 
> > Anyway, If one of scenario I mentioned is possible, zram_rw_page will
> > be broken.
> > 
> > If it's hard to check all of scenario in this moment, it would be
> > better to not remove it and then add WARN_ON(offset) in there.
> > 
> > While I am writing this, I found this.
> > 
> > /**
> >  * bdev_read_page() - Start reading a page from a block device
> >  * @bdev: The device to read the page from
> >  * @sector: The offset on the device to read the page to (need not be 
> > aligned)
> >  * @page: The page to read
> >  *
> > 
> > Hmm,, need investigation but no time.
> 
> good questions.
> 
> as far as I can see, we never use 'offset' which we pass to zram_bvec_rw()
> from zram_rw_page(). `offset' makes a lot of sense for partial IO, but in
> zram_bvec_rw() we always do "bv.bv_len = PAGE_SIZE".
> 
> so what we have is
> 
> for READ
> 
> zram_rw_page()
>   bv.bv_len = PAGE_SIZE
>   zram_bvec_rw(zram, &bv, index, offset, is_write);
>   zram_bvec_read()
>   if (is_partial_io(bvec))// always false
>   memcpy(user_mem + bvec->bv_offset,
>   uncmem + offset,
>   bvec->bv_len);
> 
> 
> for WRITE
> 
> zram_rw_page()
>   bv.bv_len = PAGE_SIZE
>   zram_bvec_rw(zram, &bv, index, offset, is_write);
>   zram_bvec_write()
>   if (is_partial_io(bvec))// always false
>   memcpy(uncmem + offset,
>   user_mem + bvec->bv_offset,
>   bvec->bv_len);
> 
> 
> and our is_partial_io() looks at ->bv_len:
> 
>   bvec->bv_len != PAGE_SIZE;
> 
> which we set to PAGE_SIZE.
> 
> so in the existing scheme of things, we never care about 'sector'
> passed from zram_rw_page(). and this has worked for us for quite
> some time. my call would be -- let's drop zram_rw_page() `sector'
> calculation.

I can do but before that, I want to confirm. Ccing Matthew,
Summary for Matthew,

I see following comment about the sector from bdev_read_page.

/**
 * bdev_read_page() - Start reading a page from a block device
 * @bdev: The device to read the page from
 * @sector: The offset on the device to read the page to (need not be aligned)
 * @page: The page to read
 *

Does it mean that sector can be not aligned PAGE_SIZE?

For example, 512byte sector, 4K page system, 4K = 8 sector

bdev_read_page(bdev, 9, page);

is possible for driver declared below?

blk_queue_physical_block_size(zram->disk->queue, PAGE_SIZE);
blk_queue_logical_block_size(zram->disk->queue,
ZRAM_LOGICAL_BLOCK_SIZE);

ZRAM_LOGICAL_BLOCK_SIZE is 4K regradless of 4K/64K page architecure.

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 3699 matches

Mail list logo