Re: [PATCH v3 5/5] psi: introduce psi monitor

2019-01-28 Thread Minchan Kim
Hi Suren,

When I review first time, it was rather hard to understand due to naming
so below comments are mostly cleanup or minor.
I'm not strong against if you don't think it's helpful.
Feel free to select parts.

Thanks.

On Thu, Jan 24, 2019 at 01:15:18PM -0800, Suren Baghdasaryan wrote:
> Psi monitor aims to provide a low-latency short-term pressure
> detection mechanism configurable by users. It allows users to
> monitor psi metrics growth and trigger events whenever a metric
> raises above user-defined threshold within user-defined time window.
> 
> Time window and threshold are both expressed in usecs. Multiple psi
> resources with different thresholds and window sizes can be monitored
> concurrently.
> 
> Psi monitors activate when system enters stall state for the monitored
> psi metric and deactivate upon exit from the stall state. While system
> is in the stall state psi signal growth is monitored at a rate of 10 times
> per tracking window. Min window size is 500ms, therefore the min monitoring
> interval is 50ms. Max window size is 10s with monitoring interval of 1s.
> 
> When activated psi monitor stays active for at least the duration of one
> tracking window to avoid repeated activations/deactivations when psi
> signal is bouncing.
> 
> Notifications to the users are rate-limited to one per tracking window.
> 
> Signed-off-by: Suren Baghdasaryan 
> Signed-off-by: Johannes Weiner 
> ---
>  Documentation/accounting/psi.txt | 104 ++
>  include/linux/psi.h  |  10 +
>  include/linux/psi_types.h|  59 
>  kernel/cgroup/cgroup.c   | 107 +-
>  kernel/sched/psi.c   | 562 +--
>  5 files changed, 808 insertions(+), 34 deletions(-)
> 
> diff --git a/Documentation/accounting/psi.txt 
> b/Documentation/accounting/psi.txt
> index b8ca28b60215..6b21c72aa87c 100644
> --- a/Documentation/accounting/psi.txt
> +++ b/Documentation/accounting/psi.txt
> @@ -63,6 +63,107 @@ tracked and exported as well, to allow detection of 
> latency spikes
>  which wouldn't necessarily make a dent in the time averages, or to
>  average trends over custom time frames.
>  
> +Monitoring for pressure thresholds
> +==
> +
> +Users can register triggers and use poll() to be woken up when resource
> +pressure exceeds certain thresholds.
> +
> +A trigger describes the maximum cumulative stall time over a specific
> +time window, e.g. 100ms of total stall time within any 500ms window to
> +generate a wakeup event.
> +
> +To register a trigger user has to open psi interface file under
> +/proc/pressure/ representing the resource to be monitored and write the
> +desired threshold and time window. The open file descriptor should be
> +used to wait for trigger events using select(), poll() or epoll().
> +The following format is used:
> +
> +  
> +
> +For example writing "some 15 100" into /proc/pressure/memory
> +would add 150ms threshold for partial memory stall measured within
> +1sec time window. Writing "full 5 100" into /proc/pressure/io
> +would add 50ms threshold for full io stall measured within 1sec time window.
> +
> +Triggers can be set on more than one psi metric and more than one trigger
> +for the same psi metric can be specified. However for each trigger a separate
> +file descriptor is required to be able to poll it separately from others,
> +therefore for each trigger a separate open() syscall should be made even
> +when opening the same psi interface file.
> +
> +Monitors activate only when system enters stall state for the monitored
> +psi metric and deactivates upon exit from the stall state. While system is
> +in the stall state psi signal growth is monitored at a rate of 10 times per
> +tracking window.
> +
> +The kernel accepts window sizes ranging from 500ms to 10s, therefore min
> +monitoring update interval is 50ms and max is 1s.

Hope to add why we decide these number into min/max.

> +
> +When activated, psi monitor stays active for at least the duration of one
> +tracking window to avoid repeated activations/deactivations when system is
> +bouncing in and out of the stall state.
> +
> +Notifications to the userspace are rate-limited to one per tracking window.
> +
> +The trigger will de-register when the file descriptor used to define the
> +trigger  is closed.
> +
> +Userspace monitor usage example
> +===
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Monitor memory partial stall with 1s tracking window size
> + * and 150ms threshold.
> + */
> +int main() {
> + const char trig[] = "some 15 100";
> + struct pollfd fds;
> + int n;
> +
> + fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
> + if (fds.fd < 0) {
> + printf("/proc/pressure/memory open error: %s\n",
> + strerror(errno));
> + return 1;
> + }
> + fds.events = 

Re: [PATCH] zram: idle writeback fixes and cleanup

2018-12-27 Thread Minchan Kim
Hi Sergey,

On Thu, Dec 27, 2018 at 11:26:24AM +0900, Sergey Senozhatsky wrote:
> On (12/24/18 12:35), Minchan Kim wrote:
> [..]
> > @@ -645,10 +680,13 @@ static ssize_t writeback_store(struct device *dev,
> > bvec.bv_len = PAGE_SIZE;
> > bvec.bv_offset = 0;
> >  
> > -   if (zram->stop_writeback) {
> > +   spin_lock(>wb_limit_lock);
> > +   if (zram->wb_limit_enable && !zram->bd_wb_limit) {
> > +   spin_unlock(>wb_limit_lock);
> > ret = -EIO;
> > break;
> > }
> > +   spin_unlock(>wb_limit_lock);
> [..]
> > @@ -732,11 +771,10 @@ static ssize_t writeback_store(struct device *dev,
> > zram_set_element(zram, index, blk_idx);
> > blk_idx = 0;
> > atomic64_inc(>stats.pages_stored);
> > -   if (atomic64_add_unless(>stats.bd_wb_limit,
> > -   -1 << (PAGE_SHIFT - 12), 0)) {
> > -   if (atomic64_read(>stats.bd_wb_limit) == 0)
> > -   zram->stop_writeback = true;
> > -   }
> > +   spin_lock(>wb_limit_lock);
> > +   if (zram->wb_limit_enable && zram->bd_wb_limit > 0)
> > +   zram->bd_wb_limit -=  1UL << (PAGE_SHIFT - 12);
> > +   spin_unlock(>wb_limit_lock);
> 
> Do we really need ->wb_limit_lock spinlock? We kinda punch it twice
> in this loop. If someone clears ->wb_limit_enable somewhere in between
> then the worst thing to happen is that we will just write extra page
> to the backing device; not a very big deal to me. Am I missing
> something?

Without the lock, bd_wb_limit store/read would be racy.

CPU A   CPU B
if (zram->wb_limit_enable && zram->bd_wb_limit > 0)
zram->bd_wb_limit = 0
zram->bd_wb_limit -= 1UL << (PAGE_SHIFT - 12) 

It makes limit feature void.

> 
>   -ss


[PATCH] zram: idle writeback fixes and cleanup

2018-12-23 Thread Minchan Kim
This patch includes some fixes and cleanup for idle-page writeback.

1. writeback_limit interface

Now writeback_limit interface is rather conusing.
For example, once writeback limit budget is exausted, admin can see 0
from /sys/block/zramX/writeback_limit which is same semantic with disable
writeback_limit at this moment. IOW, admin cannot tell that zero came from
disable writeback limit or exausted writeback limit.

To make the interface clear, let's sepatate enable of writeback limit
to another knob - /sys/block/zram0/writeback_limit_enable

* before:
  while true :
# to re-enable writeback limit once previous one is used up
echo 0 > /sys/block/zram0/writeback_limit
echo $((200<<20)) > /sys/block/zram0/writeback_limit
..
.. # used up the writeback limit budget

* new
  # To enable writeback limit, from the beginning, admin should
  # enable it.
  echo $((200<<20)) > /sys/block/zram0/writeback_limit
  echo 1 > /sys/block/zram/0/writeback_limit_enable
  while true :
echo $((200<<20)) > /sys/block/zram0/writeback_limit
..
.. # used up the writeback limit budget

It's much strightforward.

2. fix condition check idle/huge writeback mode check

The mode in writeback_store is not bit opeartion any more so no need
to use bit operations. Furthermore, current condition check is broken
in that it does writeback every pages regardless of huge/idle.

3. clean up idle_store

No need to use goto.

Suggested-by: John Dias 
Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram | 11 ++-
 Documentation/blockdev/zram.txt| 74 ---
 drivers/block/zram/zram_drv.c  | 86 --
 drivers/block/zram/zram_drv.h  |  5 +-
 4 files changed, 122 insertions(+), 54 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 9d2339a485c8a..14b2bf2e5105c 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -122,11 +122,18 @@ Contact:  Minchan Kim 
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
 
+What:  /sys/block/zram/writeback_limit_enable
+Date:      November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit_enable file is read-write and specifies
+   eanbe of writeback_limit feature. "1" means eable the feature.
+   No limit "0" is the initial state.
+
 What:  /sys/block/zram/writeback_limit
 Date:  November 2018
 Contact:   Minchan Kim 
 Description:
The writeback_limit file is read-write and specifies the maximum
amount of writeback ZRAM can do. The limit could be changed
-   in run time and "0" means disable the limit.
-   No limit is the initial state.
+   in run time.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 436c5e98e1b60..4df0ce2710857 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -156,22 +156,23 @@ Per-device statistics are exported as various nodes under 
/sys/block/zram/
 A brief description of exported device attributes. For more details please
 read Documentation/ABI/testing/sysfs-block-zram.
 
-Nameaccessdescription
------
-disksize  RWshow and set the device's disk size
-initstate ROshows the initialization state of the device
-reset WOtrigger device reset
-mem_used_max  WOreset the `mem_used_max' counter (see later)
-mem_limit WOspecifies the maximum amount of memory ZRAM can use
-to store the compressed data
-writeback_limit   WOspecifies the maximum amount of write IO zram can
-   write out to backing device as 4KB unit
-max_comp_streams  RWthe number of possible concurrent compress operations
-comp_algorithmRWshow and change the compression algorithm
-compact   WOtrigger memory compaction
-debug_statROthis file is used for zram debugging purposes
-backing_dev  RWset up backend storage for zram to write out
-idle WOmark allocated slot as idle
+Name   accessdescription
+   -----
+disksize   RW  show and set the device's disk size
+initstate  RO  shows the initialization state of the device
+reset  WO  trigger device reset
+mem_used_max   WO  reset the `mem_used_max' counter (see later)
+mem_limit  WO  specifies the maximum amount of memory ZRAM can 
use
+   to store the compressed data
+writebac

Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Minchan Kim
On Mon, Dec 03, 2018 at 11:30:40AM +0900, Sergey Senozhatsky wrote:
> On (12/03/18 08:18), Minchan Kim wrote:
> > 
> > Per andrew's comment:
> > https://lkml.org/lkml/2018/11/27/156
> > 
> > I need to fix it to represent 4K always.
> 
> Aha.
> 
> Then we need to increase bd_writes PAGE_SIZE/4K times in writeback_store()?
> 
>wb_count = atomic64_inc_return(>stats.bd_writes);
>...
>if (wb_limit != 0 && wb_count >= wb_limit)
>zram->stop_writeback = true;
> 
> bd_wb_limit is in 4K units; but in writeback_store() we alloc a full page
> and write it to the backing device. So the actual number of written bytes
> can be larger on systems with page_size > 4K. Right?

Hey Sergey,

I changed interface in recent version v4. I belive it would be more
straigtforward for user. Could you review it?

Thanks!


Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Minchan Kim
On Mon, Dec 03, 2018 at 11:30:40AM +0900, Sergey Senozhatsky wrote:
> On (12/03/18 08:18), Minchan Kim wrote:
> > 
> > Per andrew's comment:
> > https://lkml.org/lkml/2018/11/27/156
> > 
> > I need to fix it to represent 4K always.
> 
> Aha.
> 
> Then we need to increase bd_writes PAGE_SIZE/4K times in writeback_store()?
> 
>wb_count = atomic64_inc_return(>stats.bd_writes);
>...
>if (wb_limit != 0 && wb_count >= wb_limit)
>zram->stop_writeback = true;
> 
> bd_wb_limit is in 4K units; but in writeback_store() we alloc a full page
> and write it to the backing device. So the actual number of written bytes
> can be larger on systems with page_size > 4K. Right?

Hey Sergey,

I changed interface in recent version v4. I belive it would be more
straigtforward for user. Could you review it?

Thanks!


[PATCH v4 7/7] zram: writeback throttle

2018-12-02 Thread Minchan Kim
If there are lots of write IO with flash device, it could have a
wearout problem of storage. To overcome the problem, admin needs
to design write limitation to guarantee flash health
for entire product life.

This patch creates a new knob "writeback_limit" on zram.

writeback_limit's default value is 0 so that it doesn't limit
any writeback. If admin want to measure writeback count in a
certain period, he could know it via /sys/block/zram0/bd_stat's
3rd column.

If admin want to limit writeback as per-day 400M, he could do it
like below.

MB_SHIFT=20
4K_SHIFT=12
echo $((400<>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit.

If admin want to allow further write again, he could do it like below

echo 0 > /sys/block/zram0/writeback_limit

If admin want to see remaining writeback budget,

cat /sys/block/zram0/writeback_limit

The writeback_limit count will reset whenever you reset zram(e.g.,
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.

Signed-off-by: Minchan Kim 
---

I removed Reviewed-by from Sergey and Joey because I modified interface
since they had reviewed.

 Documentation/ABI/testing/sysfs-block-zram |  9 
 Documentation/blockdev/zram.txt| 31 +
 drivers/block/zram/zram_drv.c  | 52 --
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 65fc33b2f53b..9d2339a485c8 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 906df97527a7..436c5e98e1b6 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit   WOspecifies the maximum amount of write IO zram can
+   write out to backing device as 4KB unit
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
@@ -275,6 +277,35 @@ Admin can request writeback of those idle pages at right 
timing via
 
 With the command, zram writeback idle pages from memory to the storage.
 
+If there are lots of write IO with flash device, potentially, it has
+flash wearout problem so that admin needs to design write limitation
+to guarantee storage health for entire product life.
+To overcome the concern, zram supports "writeback_limit".
+The "writeback_limit"'s default value is 0 so that it doesn't limit
+any writeback. If admin want to measure writeback count in a certain
+period, he could know it via /sys/block/zram0/bd_stat's 3rd column.
+
+If admin want to limit writeback as per-day 400M, he could do it
+like below.
+
+MB_SHIFT=20
+4K_SHIFT=12
+echo $((400<>4K_SHIFT)) > \
+   /sys/block/zram0/writeback_limit.
+
+If admin want to allow further write again, he could do it like below
+
+echo 0 > /sys/block/zram0/writeback_limit
+
+If admin want to see remaining writeback budget since he set,
+
+cat /sys/block/zram0/writeback_limit
+
+The writeback_limit count will reset whenever you reset zram(e.g.,
+system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
+writeback happened until you reset the zram to allocate extra writeback
+budget in next setting is user's job.
+
 = memory tracking
 
 With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f1832fa3ba41..33c5cc879f24 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -330,6 +330,39 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_Z

[PATCH v4 7/7] zram: writeback throttle

2018-12-02 Thread Minchan Kim
If there are lots of write IO with flash device, it could have a
wearout problem of storage. To overcome the problem, admin needs
to design write limitation to guarantee flash health
for entire product life.

This patch creates a new knob "writeback_limit" on zram.

writeback_limit's default value is 0 so that it doesn't limit
any writeback. If admin want to measure writeback count in a
certain period, he could know it via /sys/block/zram0/bd_stat's
3rd column.

If admin want to limit writeback as per-day 400M, he could do it
like below.

MB_SHIFT=20
4K_SHIFT=12
echo $((400<>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit.

If admin want to allow further write again, he could do it like below

echo 0 > /sys/block/zram0/writeback_limit

If admin want to see remaining writeback budget,

cat /sys/block/zram0/writeback_limit

The writeback_limit count will reset whenever you reset zram(e.g.,
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.

Signed-off-by: Minchan Kim 
---

I removed Reviewed-by from Sergey and Joey because I modified interface
since they had reviewed.

 Documentation/ABI/testing/sysfs-block-zram |  9 
 Documentation/blockdev/zram.txt| 31 +
 drivers/block/zram/zram_drv.c  | 52 --
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 65fc33b2f53b..9d2339a485c8 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 906df97527a7..436c5e98e1b6 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit   WOspecifies the maximum amount of write IO zram can
+   write out to backing device as 4KB unit
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
@@ -275,6 +277,35 @@ Admin can request writeback of those idle pages at right 
timing via
 
 With the command, zram writeback idle pages from memory to the storage.
 
+If there are lots of write IO with flash device, potentially, it has
+flash wearout problem so that admin needs to design write limitation
+to guarantee storage health for entire product life.
+To overcome the concern, zram supports "writeback_limit".
+The "writeback_limit"'s default value is 0 so that it doesn't limit
+any writeback. If admin want to measure writeback count in a certain
+period, he could know it via /sys/block/zram0/bd_stat's 3rd column.
+
+If admin want to limit writeback as per-day 400M, he could do it
+like below.
+
+MB_SHIFT=20
+4K_SHIFT=12
+echo $((400<>4K_SHIFT)) > \
+   /sys/block/zram0/writeback_limit.
+
+If admin want to allow further write again, he could do it like below
+
+echo 0 > /sys/block/zram0/writeback_limit
+
+If admin want to see remaining writeback budget since he set,
+
+cat /sys/block/zram0/writeback_limit
+
+The writeback_limit count will reset whenever you reset zram(e.g.,
+system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
+writeback happened until you reset the zram to allocate extra writeback
+budget in next setting is user's job.
+
 = memory tracking
 
 With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f1832fa3ba41..33c5cc879f24 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -330,6 +330,39 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_Z

[PATCH v4 2/7] zram: fix double free backing device

2018-12-02 Thread Minchan Kim
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
kernel emits below log. This patch fixes it.

[   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
blkdev_put+0x105/0x120
[   31.075104] Modules linked in:
[   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
[   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   31.079589] RIP: 0010:blkdev_put+0x105/0x120
[   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 
5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
[   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
[   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: b9fc37a4
[   31.088105] RDX: 0001 RSI:  RDI: 9640e740
[   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 0004
[   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0083
[   31.092276] R13: 9779fe607b98 R14:  R15: 9779fe607a38
[   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
knlGS:
[   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
[   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 00160ef0
[   31.096781] Call Trace:
[   31.097212]  __x64_sys_swapoff+0x46d/0x490
[   31.097914]  do_syscall_64+0x5a/0x190
[   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   31.099402] RIP: 0033:0x7fc11843ec27
[   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
[   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
00a8
[   31.104425] RAX: ffda RBX: 011d98c0 RCX: 7fc11843ec27
[   31.105627] RDX: 0001 RSI: 0001 RDI: 011d98c0
[   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 0001
[   31.108038] R10: 02b1 R11: 0206 R12: 0001
[   31.109231] R13:  R14:  R15: 
[   31.110433] irq event stamp: 4466
[   31.111001] hardirqs last  enabled at (4465): [] 
__free_pages_ok+0x1e3/0x490
[   31.112437] hardirqs last disabled at (4466): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
[   31.113973] softirqs last  enabled at (3420): [] 
__do_softirq+0x333/0x446
[   31.115364] softirqs last disabled at (3407): [] 
irq_exit+0xd1/0xe0

Cc: sta...@vger.kernel.org # 4.14+
Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 21a7046958a3..d1459cc1159f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct device *dev,
 
bdev = bdgrab(I_BDEV(inode));
err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
-   if (err < 0)
+   if (err < 0) {
+   bdev = NULL;
goto out;
+   }
 
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
-- 
2.20.0.rc1.387.gf8505762e3-goog



[PATCH v4 4/7] zram: introduce ZRAM_IDLE flag

2018-12-02 Thread Minchan Kim
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

  30075.033841 ...i
  30163.806904 s..i
  30263.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

  30075.033841 ...
  30163.806904 s..i
  30263.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 +++
 Documentation/blockdev/zram.txt| 10 ++--
 drivers/block/zram/zram_drv.c  | 57 --
 drivers/block/zram/zram_drv.h  |  1 +
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim 
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:  /sys/block/zram/idle
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   idle file is write-only and mark zram slot as idle.
+   If system has mounted debugfs, user can see which slots
+   are idle via /sys/kernel/debug/zram/zram/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithmRWshow and change the compression 
algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
 backing_dev  RWset up backend storage for zram to write out
+idle WOmark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
- 30075.033841 .wh
- 30163.806904 s..
- 30263.806919 ..h
+ 30075.033841 .wh.
+ 30163.806904 s...
+ 30263.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4457d0395bfb..180613b478a6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,47 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+   int index;
+   char mode_buf[8];
+   ssize_t sz;
+
+   sz = strscpy(mode_buf, buf, sizeof(mode_buf));
+   if (sz <= 0)
+   return -EINVAL;
+
+   /* ignore trailing new line */
+   if (mode_buf[sz - 1] == '\n')
+   mode_buf[sz - 1] = 0x00;
+
+   if (strcmp(mode_buf, "all"))
+   return -EINVAL;
+
+   down_read(>init_lock);
+   if (!init_done(zram)) {
+   up_read(>init_lock);
+   return -EINVAL;
+   }
+
+   for (index = 0; index < nr_pages; index++) {
+   zram_slot_lock(zram, index);
+   if (!zram_allocated(zram, index))
+   goto next;
+
+   zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+   zram_slot_unlock(zram, index);
+   }
+
+   up_read(>init_lock);
+
+   return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -638,6 +679,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+   zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -67

[PATCH v4 5/7] zram: support idle/huge page writeback

2018-12-02 Thread Minchan Kim
This patch supports new feature "zram idle/huge page writeback".
On zram-swap usecase, zram has usually many idle/huge swap pages.
It's pointless to keep in memory(ie, zram).

To solve the problem, this feature introduces idle/huge page
writeback to backing device so the goal is to save more memory
space on embedded system.

Normal sequence to use idle/huge page writeback feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.

echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}

By per discussion[1], this patch removes direct incommpressibe page
writeback feature.
(d2afd25114f4, zram: write incompressible pages to backing device).

Below concerns from Sergey:
== &< ==
"IDLE writeback" is superior to "incompressible writeback".

"incompressible writeback" is completely unpredictable and
uncontrollable; it depens on data patterns and compression algorithms.
While "IDLE writeback" is predictable.

I even suspect, that, *ideally*, we can remove "incompressible
writeback". "IDLE pages" is a super set which also includes
"incompressible" pages. So, technically, we still can do
"incompressible writeback" from "IDLE writeback" path; but a much
more reasonable one, based on a page idling period.

I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from
flash wearout, so I can see "incompressible writeback" path becoming
a dead code, long term.
== &< ==

Below concerns from Minchan:
== &< ==
My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
both hugepage/idlepage writeck will turn on. However someuser want
to enable only idlepage writeback so we need to introduce turn on/off
knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.
I don't want to make it complicated *if possible*.

Long term, I imagine we need to make VM aware of new swap hierarchy
a little bit different with as-is.
For example, first high priority swap can return -EIO or -ENOCOMP,
swap try to fallback to next lower priority swap device. With that,
hugepage writeback will work tranparently.

So we could regard it as regression because incompressible pages
doesn't go to backing storage automatically. Instead, user should
do it via "echo huge" > /sys/block/zram/writeback" manually.
== &< ==

If we may hear some regression, we could restore the function with
different implemenataion.

[1], https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u

Reviewed-by: Sergey Senozhatsky 
Reviewed-by: Joey Pabalinas 
Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  28 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 247 +++--
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 209 insertions(+), 79 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..806cdaabac83 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ The stat file represents device's mm statistics. It 
consists of a single
 
 = writeback
 
-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
 to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+   "echo /dev/sda5 > /sys/block/zra

[PATCH v4 6/7] zram: add bd_stat statistics

2018-12-02 Thread Minchan Kim
bd_stat represents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 29 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 53 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..65fc33b2f53b 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 806cdaabac83..906df97527a7 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: 4K bytes
+ bd_reads  the number of reads from backing device
+   Unit: 4K bytes
+ bd_writes the number of writes to backing device
+   Unit: 4K bytes
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6b5a886c8f32..f1832fa3ba41 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -502,6 +502,7 @@ static unsigned long alloc_block_bdev(struct zram *zram)
if (test_and_set_bit(blk_idx, zram->bitmap))
goto retry;
 
+   atomic64_inc(>stats.bd_count);
return blk_idx;
 }
 
@@ -511,6 +512,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
 
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -668,6 +670,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -757,6 +760,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1013,6 +1017,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count) * (PAGE_SHIFT - 12),
+   (u64)atomic64_read(>stats.bd_reads) * (PAGE_SHIFT - 12),
+   (u64)atomic64_read(>stats.bd_writes) * (PAGE_SHIFT - 12));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1033,6 +1056,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1759,6 +1785,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
_attr_io_stat.attr,
_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+   _attr_bd_stat.attr,
+#endif
_attr_debug_stat.attr,
  

[PATCH v4 1/7] zram: fix lockdep warning of free block handling

2018-12-02 Thread Minchan Kim
[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(>bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(>bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(>bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

get_entry_bdev
spin_lock(bitmap->lock);
irq
softirq
end_swap_bio_read
zram_slot_free_notify
zram_slot_lock <-- deadlock prone
zram_free_page
put_entry_bdev
spin_lock(bitmap->lock); <-- deadlock prone

With akpm's suggestion(i.e. bitmap operation is already atomic),
we could remove bitmap lock. It might fail to find a empty slot
if serious contention happens. However, it's not severe problem
because huge page writeback has already possiblity to fail if there
is severe memory pressure. Worst case is just keeping
the incompressible in memory, not storage.

The other problem is zram_slot_lock in zram_slot_slot_free_notify.
To make it safe is this patch introduces zram_slot_trylock where
zram_slot_free_notify uses it. Although it's rare to be contented,
this patch adds new debug stat "miss_free" to keep monitoring
how often it happens.

Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 38 +++
 drivers/block/zram/zram_drv.h |  2 +-
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4879595200e1..21a7046958a3 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -53,6 +53,11 @@ static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, siz

[PATCH v4 2/7] zram: fix double free backing device

2018-12-02 Thread Minchan Kim
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
kernel emits below log. This patch fixes it.

[   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
blkdev_put+0x105/0x120
[   31.075104] Modules linked in:
[   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
[   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   31.079589] RIP: 0010:blkdev_put+0x105/0x120
[   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 
5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
[   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
[   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: b9fc37a4
[   31.088105] RDX: 0001 RSI:  RDI: 9640e740
[   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 0004
[   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0083
[   31.092276] R13: 9779fe607b98 R14:  R15: 9779fe607a38
[   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
knlGS:
[   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
[   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 00160ef0
[   31.096781] Call Trace:
[   31.097212]  __x64_sys_swapoff+0x46d/0x490
[   31.097914]  do_syscall_64+0x5a/0x190
[   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   31.099402] RIP: 0033:0x7fc11843ec27
[   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
[   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
00a8
[   31.104425] RAX: ffda RBX: 011d98c0 RCX: 7fc11843ec27
[   31.105627] RDX: 0001 RSI: 0001 RDI: 011d98c0
[   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 0001
[   31.108038] R10: 02b1 R11: 0206 R12: 0001
[   31.109231] R13:  R14:  R15: 
[   31.110433] irq event stamp: 4466
[   31.111001] hardirqs last  enabled at (4465): [] 
__free_pages_ok+0x1e3/0x490
[   31.112437] hardirqs last disabled at (4466): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
[   31.113973] softirqs last  enabled at (3420): [] 
__do_softirq+0x333/0x446
[   31.115364] softirqs last disabled at (3407): [] 
irq_exit+0xd1/0xe0

Cc: sta...@vger.kernel.org # 4.14+
Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 21a7046958a3..d1459cc1159f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct device *dev,
 
bdev = bdgrab(I_BDEV(inode));
err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
-   if (err < 0)
+   if (err < 0) {
+   bdev = NULL;
goto out;
+   }
 
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
-- 
2.20.0.rc1.387.gf8505762e3-goog



[PATCH v4 4/7] zram: introduce ZRAM_IDLE flag

2018-12-02 Thread Minchan Kim
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

  30075.033841 ...i
  30163.806904 s..i
  30263.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

  30075.033841 ...
  30163.806904 s..i
  30263.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 +++
 Documentation/blockdev/zram.txt| 10 ++--
 drivers/block/zram/zram_drv.c  | 57 --
 drivers/block/zram/zram_drv.h  |  1 +
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim 
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:  /sys/block/zram/idle
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   idle file is write-only and mark zram slot as idle.
+   If system has mounted debugfs, user can see which slots
+   are idle via /sys/kernel/debug/zram/zram/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithmRWshow and change the compression 
algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
 backing_dev  RWset up backend storage for zram to write out
+idle WOmark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
- 30075.033841 .wh
- 30163.806904 s..
- 30263.806919 ..h
+ 30075.033841 .wh.
+ 30163.806904 s...
+ 30263.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4457d0395bfb..180613b478a6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,47 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+   int index;
+   char mode_buf[8];
+   ssize_t sz;
+
+   sz = strscpy(mode_buf, buf, sizeof(mode_buf));
+   if (sz <= 0)
+   return -EINVAL;
+
+   /* ignore trailing new line */
+   if (mode_buf[sz - 1] == '\n')
+   mode_buf[sz - 1] = 0x00;
+
+   if (strcmp(mode_buf, "all"))
+   return -EINVAL;
+
+   down_read(>init_lock);
+   if (!init_done(zram)) {
+   up_read(>init_lock);
+   return -EINVAL;
+   }
+
+   for (index = 0; index < nr_pages; index++) {
+   zram_slot_lock(zram, index);
+   if (!zram_allocated(zram, index))
+   goto next;
+
+   zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+   zram_slot_unlock(zram, index);
+   }
+
+   up_read(>init_lock);
+
+   return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -638,6 +679,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+   zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -67

[PATCH v4 5/7] zram: support idle/huge page writeback

2018-12-02 Thread Minchan Kim
This patch supports new feature "zram idle/huge page writeback".
On zram-swap usecase, zram has usually many idle/huge swap pages.
It's pointless to keep in memory(ie, zram).

To solve the problem, this feature introduces idle/huge page
writeback to backing device so the goal is to save more memory
space on embedded system.

Normal sequence to use idle/huge page writeback feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.

echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}

By per discussion[1], this patch removes direct incommpressibe page
writeback feature.
(d2afd25114f4, zram: write incompressible pages to backing device).

Below concerns from Sergey:
== &< ==
"IDLE writeback" is superior to "incompressible writeback".

"incompressible writeback" is completely unpredictable and
uncontrollable; it depens on data patterns and compression algorithms.
While "IDLE writeback" is predictable.

I even suspect, that, *ideally*, we can remove "incompressible
writeback". "IDLE pages" is a super set which also includes
"incompressible" pages. So, technically, we still can do
"incompressible writeback" from "IDLE writeback" path; but a much
more reasonable one, based on a page idling period.

I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from
flash wearout, so I can see "incompressible writeback" path becoming
a dead code, long term.
== &< ==

Below concerns from Minchan:
== &< ==
My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
both hugepage/idlepage writeck will turn on. However someuser want
to enable only idlepage writeback so we need to introduce turn on/off
knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.
I don't want to make it complicated *if possible*.

Long term, I imagine we need to make VM aware of new swap hierarchy
a little bit different with as-is.
For example, first high priority swap can return -EIO or -ENOCOMP,
swap try to fallback to next lower priority swap device. With that,
hugepage writeback will work tranparently.

So we could regard it as regression because incompressible pages
doesn't go to backing storage automatically. Instead, user should
do it via "echo huge" > /sys/block/zram/writeback" manually.
== &< ==

If we may hear some regression, we could restore the function with
different implemenataion.

[1], https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u

Reviewed-by: Sergey Senozhatsky 
Reviewed-by: Joey Pabalinas 
Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  28 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 247 +++--
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 209 insertions(+), 79 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..806cdaabac83 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ The stat file represents device's mm statistics. It 
consists of a single
 
 = writeback
 
-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
 to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+   "echo /dev/sda5 > /sys/block/zra

[PATCH v4 6/7] zram: add bd_stat statistics

2018-12-02 Thread Minchan Kim
bd_stat represents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 29 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 53 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..65fc33b2f53b 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 806cdaabac83..906df97527a7 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: 4K bytes
+ bd_reads  the number of reads from backing device
+   Unit: 4K bytes
+ bd_writes the number of writes to backing device
+   Unit: 4K bytes
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6b5a886c8f32..f1832fa3ba41 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -502,6 +502,7 @@ static unsigned long alloc_block_bdev(struct zram *zram)
if (test_and_set_bit(blk_idx, zram->bitmap))
goto retry;
 
+   atomic64_inc(>stats.bd_count);
return blk_idx;
 }
 
@@ -511,6 +512,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
 
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -668,6 +670,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -757,6 +760,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1013,6 +1017,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count) * (PAGE_SHIFT - 12),
+   (u64)atomic64_read(>stats.bd_reads) * (PAGE_SHIFT - 12),
+   (u64)atomic64_read(>stats.bd_writes) * (PAGE_SHIFT - 12));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1033,6 +1056,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1759,6 +1785,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
_attr_io_stat.attr,
_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+   _attr_bd_stat.attr,
+#endif
_attr_debug_stat.attr,
  

[PATCH v4 1/7] zram: fix lockdep warning of free block handling

2018-12-02 Thread Minchan Kim
[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(>bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(>bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(>bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

get_entry_bdev
spin_lock(bitmap->lock);
irq
softirq
end_swap_bio_read
zram_slot_free_notify
zram_slot_lock <-- deadlock prone
zram_free_page
put_entry_bdev
spin_lock(bitmap->lock); <-- deadlock prone

With akpm's suggestion(i.e. bitmap operation is already atomic),
we could remove bitmap lock. It might fail to find a empty slot
if serious contention happens. However, it's not severe problem
because huge page writeback has already possiblity to fail if there
is severe memory pressure. Worst case is just keeping
the incompressible in memory, not storage.

The other problem is zram_slot_lock in zram_slot_slot_free_notify.
To make it safe is this patch introduces zram_slot_trylock where
zram_slot_free_notify uses it. Although it's rare to be contented,
this patch adds new debug stat "miss_free" to keep monitoring
how often it happens.

Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 38 +++
 drivers/block/zram/zram_drv.h |  2 +-
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4879595200e1..21a7046958a3 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -53,6 +53,11 @@ static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, siz

[PATCH v4 3/7] zram: refactoring flags and writeback stuff

2018-12-02 Thread Minchan Kim
This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 105 +-
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d1459cc1159f..4457d0395bfb 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
+   return bit_spin_trylock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-   bit_spin_lock(ZRAM_LOCK, >table[index].value);
+   bit_spin_lock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-   bit_spin_unlock(ZRAM_LOCK, >table[index].value);
+   bit_spin_unlock(ZRAM_LOCK, >table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   return zram->table[index].value & BIT(flag);
+   return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value |= BIT(flag);
+   zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value &= ~BIT(flag);
+   zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, 
u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-   return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+   return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
 {
-   unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+   unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-   zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+   zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-   return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
 
-   if (!zram_wb_enabled(zram))
+   if (!zram->backing_dev)
return;
 
bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;
 
down_read(>init_lock);
-   if (!zram_wb_enabled(zram)) {
+   if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(>init_lock);
return 5;
@@ -447,7 +442,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
unsigned long blk_idx = 1;
 retry:
@@ -462,11 +457,11 @@ static unsigned long get_entry_bdev(struct zram *zram)
return blk_idx;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
int was_set;
 
-   was_set = test_and_clear_bit(entry, zram->bitmap);
+   was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
 }
 
@@ -579,7 +574,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
if (!bio)
return -ENOMEM;
 
-   entry = get_entry_bdev(zram);
+   entry = alloc_block_bdev(zram);
if (!entry) {
bio_put(bio);

[PATCH v4 0/7] zram idle page writeback

2018-12-02 Thread Minchan Kim
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Below first two patches are -stable material so it could go first
separately with others in this series.

  zram: fix lockdep warning of free block handling
  zram: fix double free backing device

* from v3
  - add more words in changelog - akpm
  - clarification writeback limit - akpm
  - fix 4k unit of bd_stat - akpm
  - change writeback_limit interface - minchan
  - add reviewed-by - sergey, joey

* from v2
  - use strscpy instead of strlcpy - Joey Pabalinas
  - remove irqlock for bitmap op - akpm
  - don't use page as stat unit - akpm

* from v1
  - add fix dobule free backing device - minchan
  - change writeback/idle interface - minchan 
  - remove direct incompressible page writeback - sergey

Minchan Kim (7):
  zram: fix lockdep warning of free block handling
  zram: fix double free backing device
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle/huge page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt|  80 +++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 502 +++--
 drivers/block/zram/zram_drv.h  |  19 +-
 5 files changed, 476 insertions(+), 162 deletions(-)

-- 
2.20.0.rc1.387.gf8505762e3-goog



[PATCH v4 0/7] zram idle page writeback

2018-12-02 Thread Minchan Kim
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Below first two patches are -stable material so it could go first
separately with others in this series.

  zram: fix lockdep warning of free block handling
  zram: fix double free backing device

* from v3
  - add more words in changelog - akpm
  - clarification writeback limit - akpm
  - fix 4k unit of bd_stat - akpm
  - change writeback_limit interface - minchan
  - add reviewed-by - sergey, joey

* from v2
  - use strscpy instead of strlcpy - Joey Pabalinas
  - remove irqlock for bitmap op - akpm
  - don't use page as stat unit - akpm

* from v1
  - add fix dobule free backing device - minchan
  - change writeback/idle interface - minchan 
  - remove direct incompressible page writeback - sergey

Minchan Kim (7):
  zram: fix lockdep warning of free block handling
  zram: fix double free backing device
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle/huge page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt|  80 +++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 502 +++--
 drivers/block/zram/zram_drv.h  |  19 +-
 5 files changed, 476 insertions(+), 162 deletions(-)

-- 
2.20.0.rc1.387.gf8505762e3-goog



[PATCH v4 3/7] zram: refactoring flags and writeback stuff

2018-12-02 Thread Minchan Kim
This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Reviewed-by: Joey Pabalinas 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 105 +-
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d1459cc1159f..4457d0395bfb 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
+   return bit_spin_trylock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-   bit_spin_lock(ZRAM_LOCK, >table[index].value);
+   bit_spin_lock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-   bit_spin_unlock(ZRAM_LOCK, >table[index].value);
+   bit_spin_unlock(ZRAM_LOCK, >table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   return zram->table[index].value & BIT(flag);
+   return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value |= BIT(flag);
+   zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value &= ~BIT(flag);
+   zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, 
u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-   return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+   return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
 {
-   unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+   unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-   zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+   zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-   return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
 
-   if (!zram_wb_enabled(zram))
+   if (!zram->backing_dev)
return;
 
bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;
 
down_read(>init_lock);
-   if (!zram_wb_enabled(zram)) {
+   if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(>init_lock);
return 5;
@@ -447,7 +442,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
unsigned long blk_idx = 1;
 retry:
@@ -462,11 +457,11 @@ static unsigned long get_entry_bdev(struct zram *zram)
return blk_idx;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
int was_set;
 
-   was_set = test_and_clear_bit(entry, zram->bitmap);
+   was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
 }
 
@@ -579,7 +574,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
if (!bio)
return -ENOMEM;
 
-   entry = get_entry_bdev(zram);
+   entry = alloc_block_bdev(zram);
if (!entry) {
bio_put(bio);

Re: [PATCH v3 0/7] zram idle page writeback

2018-12-02 Thread Minchan Kim
On Fri, Nov 30, 2018 at 01:36:56PM +0900, Sergey Senozhatsky wrote:
> On (11/27/18 14:54), Minchan Kim wrote:
> > Inherently, swap device has many idle pages which are rare touched since
> > it was allocated. It is never problem if we use storage device as swap.
> > However, it's just waste for zram-swap.
> > 
> > This patchset supports zram idle page writeback feature.
> > 
> > * Admin can define what is idle page "no access since X time ago"
> > * Admin can define when zram should writeback them
> > * Admin can define when zram should stop writeback to prevent wearout
> > 
> > Detail is on each patch's description.
> > 
> > Below first two patches are -stable material so it could go first
> > separately with others in this series.
> 
> I had some time to look at the patches
> Reviewed-by: Sergey Senozhatsky 
> 
> Will give it some testing later; next week maybe.

Thanks Sergey!


Re: [PATCH v3 0/7] zram idle page writeback

2018-12-02 Thread Minchan Kim
On Fri, Nov 30, 2018 at 01:36:56PM +0900, Sergey Senozhatsky wrote:
> On (11/27/18 14:54), Minchan Kim wrote:
> > Inherently, swap device has many idle pages which are rare touched since
> > it was allocated. It is never problem if we use storage device as swap.
> > However, it's just waste for zram-swap.
> > 
> > This patchset supports zram idle page writeback feature.
> > 
> > * Admin can define what is idle page "no access since X time ago"
> > * Admin can define when zram should writeback them
> > * Admin can define when zram should stop writeback to prevent wearout
> > 
> > Detail is on each patch's description.
> > 
> > Below first two patches are -stable material so it could go first
> > separately with others in this series.
> 
> I had some time to look at the patches
> Reviewed-by: Sergey Senozhatsky 
> 
> Will give it some testing later; next week maybe.

Thanks Sergey!


Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Minchan Kim
On Thu, Nov 29, 2018 at 11:23:58AM +0900, Sergey Senozhatsky wrote:
> On (11/27/18 14:54), Minchan Kim wrote:
> > diff --git a/Documentation/ABI/testing/sysfs-block-zram 
> > b/Documentation/ABI/testing/sysfs-block-zram
> > index 65fc33b2f53b..9d2339a485c8 100644
> > --- a/Documentation/ABI/testing/sysfs-block-zram
> > +++ b/Documentation/ABI/testing/sysfs-block-zram
> > @@ -121,3 +121,12 @@ Contact:   Minchan Kim 
> > The bd_stat file is read-only and represents backing device's
> > statistics (bd_count, bd_reads, bd_writes) in a format
> > similar to block layer statistics file format.
> > +
> > +What:  /sys/block/zram/writeback_limit
> > +Date:  November 2018
> > +Contact:   Minchan Kim 
> > +Description:
> > +   The writeback_limit file is read-write and specifies the maximum
> > +   amount of writeback ZRAM can do. The limit could be changed
> > +   in run time and "0" means disable the limit.
> > +   No limit is the initial state.
> > diff --git a/Documentation/blockdev/zram.txt 
> > b/Documentation/blockdev/zram.txt
> > index 906df97527a7..64b61925e475 100644
> > --- a/Documentation/blockdev/zram.txt
> > +++ b/Documentation/blockdev/zram.txt
> > @@ -164,6 +164,8 @@ reset WOtrigger device reset
> >  mem_used_max  WOreset the `mem_used_max' counter (see later)
> >  mem_limit WOspecifies the maximum amount of memory ZRAM can use
> >  to store the compressed data
> > +writeback_limit   WOspecifies the maximum amount of write IO zram can
> > +   write out to backing device as 4KB unit
>  
>   page size units?

Per andrew's comment:
https://lkml.org/lkml/2018/11/27/156

I need to fix it to represent 4K always.


Re: [PATCH v3 7/7] zram: writeback throttle

2018-12-02 Thread Minchan Kim
On Thu, Nov 29, 2018 at 11:23:58AM +0900, Sergey Senozhatsky wrote:
> On (11/27/18 14:54), Minchan Kim wrote:
> > diff --git a/Documentation/ABI/testing/sysfs-block-zram 
> > b/Documentation/ABI/testing/sysfs-block-zram
> > index 65fc33b2f53b..9d2339a485c8 100644
> > --- a/Documentation/ABI/testing/sysfs-block-zram
> > +++ b/Documentation/ABI/testing/sysfs-block-zram
> > @@ -121,3 +121,12 @@ Contact:   Minchan Kim 
> > The bd_stat file is read-only and represents backing device's
> > statistics (bd_count, bd_reads, bd_writes) in a format
> > similar to block layer statistics file format.
> > +
> > +What:  /sys/block/zram/writeback_limit
> > +Date:  November 2018
> > +Contact:   Minchan Kim 
> > +Description:
> > +   The writeback_limit file is read-write and specifies the maximum
> > +   amount of writeback ZRAM can do. The limit could be changed
> > +   in run time and "0" means disable the limit.
> > +   No limit is the initial state.
> > diff --git a/Documentation/blockdev/zram.txt 
> > b/Documentation/blockdev/zram.txt
> > index 906df97527a7..64b61925e475 100644
> > --- a/Documentation/blockdev/zram.txt
> > +++ b/Documentation/blockdev/zram.txt
> > @@ -164,6 +164,8 @@ reset WOtrigger device reset
> >  mem_used_max  WOreset the `mem_used_max' counter (see later)
> >  mem_limit WOspecifies the maximum amount of memory ZRAM can use
> >  to store the compressed data
> > +writeback_limit   WOspecifies the maximum amount of write IO zram can
> > +   write out to backing device as 4KB unit
>  
>   page size units?

Per andrew's comment:
https://lkml.org/lkml/2018/11/27/156

I need to fix it to represent 4K always.


Re: [PATCH v3 7/7] zram: writeback throttle

2018-11-28 Thread Minchan Kim
On Wed, Nov 28, 2018 at 03:41:41PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2018 14:54:29 +0900 Minchan Kim  wrote:
> 
> > On small memory system, there are lots of write IO so if we use
> > flash device as swap, there would be serious flash wearout.
> > To overcome the problem, system developers need to design write
> > limitation strategy to guarantee flash health for entire product life.
> > 
> > This patch creates a new konb "writeback_limit" on zram. With that,
> > if current writeback IO count(/sys/block/zramX/io_stat) excceds

   bd_stat

> > the limitation, zram stops further writeback until admin can reset
> > the limit.
> 
> I'm not really understanding this.  Does this only refer to suspending
> the idle page writeback feature?  Not all zram writeback, surely?

It aims for all zram writeback.

> 
> I don't think the documentation gives an administrator sufficient
> information to effectively use the feature.  Some additional discussion
> would help.  What sort of values should it be set to and why?
> 
> And what is the default setting?  And why?

Default setting is 0 so there is no limitation because we couldn't
expect user's workload of zram.

> 
> And the limit isn't persistent across reboots which makes me wonder
> whether the overall feature is particularly valuable?

Good point.
Keeping the value in persisten across reboot is userspace's role.

I will add this for admin
"
You could know how many of write happens since the system boot
via /sys/block/zramX/bd_stat's bd_writes.
If your backing device has wearout concern, you could limit the
writing via /sys/block/zramX/writeback_limit.

For instance, if the vaule you read bd_writes is 200, you could
set 300 to writeback_limit so upcomding 100 write be only allowed.
If you set the writeback_limit to lower value than current
bd_writes's value, zram allow further writeback without limit.

The value will reset when your system reboot so keeping how many
write happn until now across reboot is user's job.
"



Re: [PATCH v3 7/7] zram: writeback throttle

2018-11-28 Thread Minchan Kim
On Wed, Nov 28, 2018 at 03:41:41PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2018 14:54:29 +0900 Minchan Kim  wrote:
> 
> > On small memory system, there are lots of write IO so if we use
> > flash device as swap, there would be serious flash wearout.
> > To overcome the problem, system developers need to design write
> > limitation strategy to guarantee flash health for entire product life.
> > 
> > This patch creates a new konb "writeback_limit" on zram. With that,
> > if current writeback IO count(/sys/block/zramX/io_stat) excceds

   bd_stat

> > the limitation, zram stops further writeback until admin can reset
> > the limit.
> 
> I'm not really understanding this.  Does this only refer to suspending
> the idle page writeback feature?  Not all zram writeback, surely?

It aims for all zram writeback.

> 
> I don't think the documentation gives an administrator sufficient
> information to effectively use the feature.  Some additional discussion
> would help.  What sort of values should it be set to and why?
> 
> And what is the default setting?  And why?

Default setting is 0 so there is no limitation because we couldn't
expect user's workload of zram.

> 
> And the limit isn't persistent across reboots which makes me wonder
> whether the overall feature is particularly valuable?

Good point.
Keeping the value in persisten across reboot is userspace's role.

I will add this for admin
"
You could know how many of write happens since the system boot
via /sys/block/zramX/bd_stat's bd_writes.
If your backing device has wearout concern, you could limit the
writing via /sys/block/zramX/writeback_limit.

For instance, if the vaule you read bd_writes is 200, you could
set 300 to writeback_limit so upcomding 100 write be only allowed.
If you set the writeback_limit to lower value than current
bd_writes's value, zram allow further writeback without limit.

The value will reset when your system reboot so keeping how many
write happn until now across reboot is user's job.
"



Re: [PATCH v2 6/7] zram: add bd_stat statistics

2018-11-28 Thread Minchan Kim
On Wed, Nov 28, 2018 at 03:30:21PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2018 11:07:54 +0900 Minchan Kim  wrote:
> 
> > On Mon, Nov 26, 2018 at 12:58:33PM -0800, Andrew Morton wrote:
> > > On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim  wrote:
> > > 
> > > > +File /sys/block/zram/bd_stat
> > > > +
> > > > +The stat file represents device's backing device statistics. It 
> > > > consists of
> > > > +a single line of text and contains the following stats separated by 
> > > > whitespace:
> > > > + bd_count  size of data written in backing device.
> > > > +   Unit: pages
> > > > + bd_reads  the number of reads from backing device
> > > > +   Unit: pages
> > > > + bd_writes the number of writes to backing device
> > > > +   Unit: pages
> > > 
> > > Using `pages' is a bad choice.  And I assume this means that
> > > writeback_limit is in pages as well, which is worse.
> > > 
> > > Page sizes are not constant!  We want userspace which was developed on
> > > 4k pagesize to work the same on 64k pagesize.
> > > 
> > > Arguably, we could require that well-written userspace remember to use
> > > getpagesize().  However we have traditionally tried to avoid that by
> > > performing the pagesize normalization within the kernel.
> > 
> > zram works based on page so I used that term but I agree it's rather
> > vague. If there is no objection, I will use (Unit: 4K) instead of
> > (Unit: pages).
> 
> Is that still true if PAGE_SIZE=64k?

Oops, it will fix.


Re: [PATCH v2 6/7] zram: add bd_stat statistics

2018-11-28 Thread Minchan Kim
On Wed, Nov 28, 2018 at 03:30:21PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2018 11:07:54 +0900 Minchan Kim  wrote:
> 
> > On Mon, Nov 26, 2018 at 12:58:33PM -0800, Andrew Morton wrote:
> > > On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim  wrote:
> > > 
> > > > +File /sys/block/zram/bd_stat
> > > > +
> > > > +The stat file represents device's backing device statistics. It 
> > > > consists of
> > > > +a single line of text and contains the following stats separated by 
> > > > whitespace:
> > > > + bd_count  size of data written in backing device.
> > > > +   Unit: pages
> > > > + bd_reads  the number of reads from backing device
> > > > +   Unit: pages
> > > > + bd_writes the number of writes to backing device
> > > > +   Unit: pages
> > > 
> > > Using `pages' is a bad choice.  And I assume this means that
> > > writeback_limit is in pages as well, which is worse.
> > > 
> > > Page sizes are not constant!  We want userspace which was developed on
> > > 4k pagesize to work the same on 64k pagesize.
> > > 
> > > Arguably, we could require that well-written userspace remember to use
> > > getpagesize().  However we have traditionally tried to avoid that by
> > > performing the pagesize normalization within the kernel.
> > 
> > zram works based on page so I used that term but I agree it's rather
> > vague. If there is no objection, I will use (Unit: 4K) instead of
> > (Unit: pages).
> 
> Is that still true if PAGE_SIZE=64k?

Oops, it will fix.


Re: [PATCH v3 5/7] zram: support idle/huge page writeback

2018-11-28 Thread Minchan Kim
Hi Andrew,

On Wed, Nov 28, 2018 at 03:35:59PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2018 14:54:27 +0900 Minchan Kim  wrote:
> 
> > This patch supports new feature "zram idle/huge page writeback".
> > On zram-swap usecase, zram has usually many idle/huge swap pages.
> > It's pointless to keep in memory(ie, zram).
> > 
> > To solve the problem, this feature introduces idle/huge page
> > writeback to backing device so the goal is to save more memory
> > space on embedded system.
> > 
> > Normal sequence to use idle/huge page writeback feature is as follows,
> > 
> > while (1) {
> > # mark allocated zram slot to idle
> > echo all > /sys/block/zram0/idle
> > # leave system working for several hours
> > # Unless there is no access for some blocks on zram,
> > # they are still IDLE marked pages.
> > 
> > echo "idle" > /sys/block/zram0/writeback
> > or/and
> > echo "huge" > /sys/block/zram0/writeback
> > # write the IDLE or/and huge marked slot into backing device
> > # and free the memory.
> > }
> > 
> > By per discussion:
> > https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
> > 
> > This patch removes direct incommpressibe page writeback feature
> > (d2afd25114f4, zram: write incompressible pages to backing device)
> > so we could regard it as regression because incompressible pages
> > doesn't go to backing storage automatically. Instead, usre should
> > do it via "echo huge" > /sys/block/zram/writeback" manually.
> 
> I'm not in any position to determine the regression risk here.
> 
> Why is that feature being removed, anyway?

Below concerns from Sergey:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u

== &< ==
"IDLE writeback" is superior to "incompressible writeback".

"incompressible writeback" is completely unpredictable and
uncontrollable; it depens on data patterns and compression algorithms.
While "IDLE writeback" is predictable.

I even suspect, that, *ideally*, we can remove "incompressible
writeback". "IDLE pages" is a super set which also includes
"incompressible" pages. So, technically, we still can do
"incompressible writeback" from "IDLE writeback" path; but a much
more reasonable one, based on a page idling period.

I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from
flash wearout, so I can see "incompressible writeback" path becoming
a dead code, long term.
== &< ==

My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
both hugepage/idlepage writeck will turn on. However someuser want
to enable only idlepage writeback so we need to introduce turn on/off
knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.
I don't want to make it complicated *if possible*.

Long term, I imagine we need to make VM aware of new swap hierarchy
a little bit different with as-is.
For example, first high priority swap can return -EIO or -ENOCOMP,
swap try to fallback to next lower priority swap device. With that,
hugepage writeback will work tranparently.

> 
> > If we hear some regression, we could restore the function.
> 
> Why not do that now?
> 

We want to remove it at this moment. 


Re: [PATCH v3 5/7] zram: support idle/huge page writeback

2018-11-28 Thread Minchan Kim
Hi Andrew,

On Wed, Nov 28, 2018 at 03:35:59PM -0800, Andrew Morton wrote:
> On Tue, 27 Nov 2018 14:54:27 +0900 Minchan Kim  wrote:
> 
> > This patch supports new feature "zram idle/huge page writeback".
> > On zram-swap usecase, zram has usually many idle/huge swap pages.
> > It's pointless to keep in memory(ie, zram).
> > 
> > To solve the problem, this feature introduces idle/huge page
> > writeback to backing device so the goal is to save more memory
> > space on embedded system.
> > 
> > Normal sequence to use idle/huge page writeback feature is as follows,
> > 
> > while (1) {
> > # mark allocated zram slot to idle
> > echo all > /sys/block/zram0/idle
> > # leave system working for several hours
> > # Unless there is no access for some blocks on zram,
> > # they are still IDLE marked pages.
> > 
> > echo "idle" > /sys/block/zram0/writeback
> > or/and
> > echo "huge" > /sys/block/zram0/writeback
> > # write the IDLE or/and huge marked slot into backing device
> > # and free the memory.
> > }
> > 
> > By per discussion:
> > https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
> > 
> > This patch removes direct incommpressibe page writeback feature
> > (d2afd25114f4, zram: write incompressible pages to backing device)
> > so we could regard it as regression because incompressible pages
> > doesn't go to backing storage automatically. Instead, usre should
> > do it via "echo huge" > /sys/block/zram/writeback" manually.
> 
> I'm not in any position to determine the regression risk here.
> 
> Why is that feature being removed, anyway?

Below concerns from Sergey:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u

== &< ==
"IDLE writeback" is superior to "incompressible writeback".

"incompressible writeback" is completely unpredictable and
uncontrollable; it depens on data patterns and compression algorithms.
While "IDLE writeback" is predictable.

I even suspect, that, *ideally*, we can remove "incompressible
writeback". "IDLE pages" is a super set which also includes
"incompressible" pages. So, technically, we still can do
"incompressible writeback" from "IDLE writeback" path; but a much
more reasonable one, based on a page idling period.

I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from
flash wearout, so I can see "incompressible writeback" path becoming
a dead code, long term.
== &< ==

My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
both hugepage/idlepage writeck will turn on. However someuser want
to enable only idlepage writeback so we need to introduce turn on/off
knob for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.
I don't want to make it complicated *if possible*.

Long term, I imagine we need to make VM aware of new swap hierarchy
a little bit different with as-is.
For example, first high priority swap can return -EIO or -ENOCOMP,
swap try to fallback to next lower priority swap device. With that,
hugepage writeback will work tranparently.

> 
> > If we hear some regression, we could restore the function.
> 
> Why not do that now?
> 

We want to remove it at this moment. 


[PATCH v3 7/7] zram: writeback throttle

2018-11-26 Thread Minchan Kim
On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  9 +
 Documentation/blockdev/zram.txt|  2 +
 drivers/block/zram/zram_drv.c  | 47 +-
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 65fc33b2f53b..9d2339a485c8 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 906df97527a7..64b61925e475 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit   WOspecifies the maximum amount of write IO zram can
+   write out to backing device as 4KB unit
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 67168a6ecca6..58b025c5c83e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -330,6 +330,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   u64 val;
+   ssize_t ret = -EINVAL;
+
+   if (kstrtoull(buf, 10, ))
+   return ret;
+
+   down_read(>init_lock);
+   atomic64_set(>stats.bd_wb_limit, val);
+   if (val == 0 || val > atomic64_read(>stats.bd_writes))
+   zram->stop_writeback = false;
+   up_read(>init_lock);
+   ret = len;
+
+   return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   u64 val;
+   struct zram *zram = dev_to_zram(dev);
+
+   down_read(>init_lock);
+   val = atomic64_read(>stats.bd_wb_limit);
+   up_read(>init_lock);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
@@ -571,6 +605,7 @@ static ssize_t writeback_store(struct device *dev,
char mode_buf[8];
unsigned long mode = -1UL;
unsigned long blk_idx = 0;
+   u64 wb_count, wb_limit;
 
sz = strscpy(mode_buf, buf, sizeof(mode_buf));
if (sz <= 0)
@@ -612,6 +647,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
 
+   if (zram->stop_writeback) {
+   ret = -EIO;
+   break;
+   }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -670,7 +710,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
-   atomic64_inc(>stats.bd_writes);
+   wb_count = atomic64_inc_return(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -694,6 +734,9 @@ static ssize_t writeback_store(stru

[PATCH v3 7/7] zram: writeback throttle

2018-11-26 Thread Minchan Kim
On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  9 +
 Documentation/blockdev/zram.txt|  2 +
 drivers/block/zram/zram_drv.c  | 47 +-
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 65fc33b2f53b..9d2339a485c8 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 906df97527a7..64b61925e475 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit   WOspecifies the maximum amount of write IO zram can
+   write out to backing device as 4KB unit
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 67168a6ecca6..58b025c5c83e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -330,6 +330,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   u64 val;
+   ssize_t ret = -EINVAL;
+
+   if (kstrtoull(buf, 10, ))
+   return ret;
+
+   down_read(>init_lock);
+   atomic64_set(>stats.bd_wb_limit, val);
+   if (val == 0 || val > atomic64_read(>stats.bd_writes))
+   zram->stop_writeback = false;
+   up_read(>init_lock);
+   ret = len;
+
+   return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   u64 val;
+   struct zram *zram = dev_to_zram(dev);
+
+   down_read(>init_lock);
+   val = atomic64_read(>stats.bd_wb_limit);
+   up_read(>init_lock);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
@@ -571,6 +605,7 @@ static ssize_t writeback_store(struct device *dev,
char mode_buf[8];
unsigned long mode = -1UL;
unsigned long blk_idx = 0;
+   u64 wb_count, wb_limit;
 
sz = strscpy(mode_buf, buf, sizeof(mode_buf));
if (sz <= 0)
@@ -612,6 +647,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
 
+   if (zram->stop_writeback) {
+   ret = -EIO;
+   break;
+   }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -670,7 +710,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
-   atomic64_inc(>stats.bd_writes);
+   wb_count = atomic64_inc_return(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -694,6 +734,9 @@ static ssize_t writeback_store(stru

[PATCH v3 3/7] zram: refactoring flags and writeback stuff

2018-11-26 Thread Minchan Kim
This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 105 +-
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d1459cc1159f..4457d0395bfb 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
+   return bit_spin_trylock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-   bit_spin_lock(ZRAM_LOCK, >table[index].value);
+   bit_spin_lock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-   bit_spin_unlock(ZRAM_LOCK, >table[index].value);
+   bit_spin_unlock(ZRAM_LOCK, >table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   return zram->table[index].value & BIT(flag);
+   return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value |= BIT(flag);
+   zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value &= ~BIT(flag);
+   zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, 
u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-   return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+   return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
 {
-   unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+   unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-   zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+   zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-   return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
 
-   if (!zram_wb_enabled(zram))
+   if (!zram->backing_dev)
return;
 
bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;
 
down_read(>init_lock);
-   if (!zram_wb_enabled(zram)) {
+   if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(>init_lock);
return 5;
@@ -447,7 +442,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
unsigned long blk_idx = 1;
 retry:
@@ -462,11 +457,11 @@ static unsigned long get_entry_bdev(struct zram *zram)
return blk_idx;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
int was_set;
 
-   was_set = test_and_clear_bit(entry, zram->bitmap);
+   was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
 }
 
@@ -579,7 +574,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
if (!bio)
return -ENOMEM;
 
-   entry = get_entry_bdev(zram);
+   entry = alloc_block_bdev(zram);
if (!entry) {
bio_put(bio);
return -ENOSPC;
@@ -590,7 +585,7 @@ static int write_to_bdev(

[PATCH v3 6/7] zram: add bd_stat statistics

2018-11-26 Thread Minchan Kim
bd_stat represents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 29 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 53 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..65fc33b2f53b 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 806cdaabac83..906df97527a7 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: 4K bytes
+ bd_reads  the number of reads from backing device
+   Unit: 4K bytes
+ bd_writes the number of writes to backing device
+   Unit: 4K bytes
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6b5a886c8f32..67168a6ecca6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -502,6 +502,7 @@ static unsigned long alloc_block_bdev(struct zram *zram)
if (test_and_set_bit(blk_idx, zram->bitmap))
goto retry;
 
+   atomic64_inc(>stats.bd_count);
return blk_idx;
 }
 
@@ -511,6 +512,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
 
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -668,6 +670,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -757,6 +760,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1013,6 +1017,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count),
+   (u64)atomic64_read(>stats.bd_reads),
+   (u64)atomic64_read(>stats.bd_writes));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1033,6 +1056,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1759,6 +1785,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
_attr_io_stat.attr,
_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+   _attr_bd_stat.attr,
+#endif
_attr_debug_stat.attr,
NULL,
 };
diff --git a/dr

[PATCH v3 5/7] zram: support idle/huge page writeback

2018-11-26 Thread Minchan Kim
This patch supports new feature "zram idle/huge page writeback".
On zram-swap usecase, zram has usually many idle/huge swap pages.
It's pointless to keep in memory(ie, zram).

To solve the problem, this feature introduces idle/huge page
writeback to backing device so the goal is to save more memory
space on embedded system.

Normal sequence to use idle/huge page writeback feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.

echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}

By per discussion:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,

This patch removes direct incommpressibe page writeback feature
(d2afd25114f4, zram: write incompressible pages to backing device)
so we could regard it as regression because incompressible pages
doesn't go to backing storage automatically. Instead, usre should
do it via "echo huge" > /sys/block/zram/writeback" manually.

If we hear some regression, we could restore the function.

Reviewed-by: Joey Pabalinas 
Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  28 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 247 +++--
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 209 insertions(+), 79 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..806cdaabac83 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ The stat file represents device's mm statistics. It 
consists of a single
 
 = writeback
 
-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
 to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+   "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+
+before disksize setting. It supports only partition at this moment.
+If admin want to use incompressible page writeback, they could do via
+
+   "echo huge > /sys/block/zramX/write"
+
+To use idle page writeback, first, user need to declare zram pages
+as idle.
+
+   "echo all > /sys/block/zramX/idle"
+
+From now on, any pages on zram are idle pages. The idle mark
+will be removed until someone request access of the block.
+IOW, unless there is access request, those pages are still idle pages.
+
+Admin can request writeback of those idle pages at right timing via
+
+   "echo idle > /sys/block/zramX/writeback"
+
+With the command, zram writeback idle pages from memory to the storage.
 
 = memory tracking
 
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-   bool "Write back incompressible page to backing device"
+   bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
+With /sys/block/zramX/{idle,writeback}, application could ask
+idle page's writeback to the backing device to save in memory.
+
 See Documentation/blockdev/zram.tx

[PATCH v3 2/7] zram: fix double free backing device

2018-11-26 Thread Minchan Kim
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
kernel emits below log. This patch fixes it.

[   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
blkdev_put+0x105/0x120
[   31.075104] Modules linked in:
[   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
[   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   31.079589] RIP: 0010:blkdev_put+0x105/0x120
[   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 
5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
[   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
[   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: b9fc37a4
[   31.088105] RDX: 0001 RSI:  RDI: 9640e740
[   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 0004
[   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0083
[   31.092276] R13: 9779fe607b98 R14:  R15: 9779fe607a38
[   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
knlGS:
[   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
[   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 00160ef0
[   31.096781] Call Trace:
[   31.097212]  __x64_sys_swapoff+0x46d/0x490
[   31.097914]  do_syscall_64+0x5a/0x190
[   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   31.099402] RIP: 0033:0x7fc11843ec27
[   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
[   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
00a8
[   31.104425] RAX: ffda RBX: 011d98c0 RCX: 7fc11843ec27
[   31.105627] RDX: 0001 RSI: 0001 RDI: 011d98c0
[   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 0001
[   31.108038] R10: 02b1 R11: 0206 R12: 0001
[   31.109231] R13:  R14:  R15: 
[   31.110433] irq event stamp: 4466
[   31.111001] hardirqs last  enabled at (4465): [] 
__free_pages_ok+0x1e3/0x490
[   31.112437] hardirqs last disabled at (4466): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
[   31.113973] softirqs last  enabled at (3420): [] 
__do_softirq+0x333/0x446
[   31.115364] softirqs last disabled at (3407): [] 
irq_exit+0xd1/0xe0

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 21a7046958a3..d1459cc1159f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct device *dev,
 
bdev = bdgrab(I_BDEV(inode));
err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
-   if (err < 0)
+   if (err < 0) {
+   bdev = NULL;
goto out;
+   }
 
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v3 4/7] zram: introduce ZRAM_IDLE flag

2018-11-26 Thread Minchan Kim
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

  30075.033841 ...i
  30163.806904 s..i
  30263.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

  30075.033841 ...
  30163.806904 s..i
  30263.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 +++
 Documentation/blockdev/zram.txt| 10 ++--
 drivers/block/zram/zram_drv.c  | 57 --
 drivers/block/zram/zram_drv.h  |  1 +
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim 
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:  /sys/block/zram/idle
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   idle file is write-only and mark zram slot as idle.
+   If system has mounted debugfs, user can see which slots
+   are idle via /sys/kernel/debug/zram/zram/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithmRWshow and change the compression 
algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
 backing_dev  RWset up backend storage for zram to write out
+idle WOmark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
- 30075.033841 .wh
- 30163.806904 s..
- 30263.806919 ..h
+ 30075.033841 .wh.
+ 30163.806904 s...
+ 30263.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4457d0395bfb..180613b478a6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,47 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+   int index;
+   char mode_buf[8];
+   ssize_t sz;
+
+   sz = strscpy(mode_buf, buf, sizeof(mode_buf));
+   if (sz <= 0)
+   return -EINVAL;
+
+   /* ignore trailing new line */
+   if (mode_buf[sz - 1] == '\n')
+   mode_buf[sz - 1] = 0x00;
+
+   if (strcmp(mode_buf, "all"))
+   return -EINVAL;
+
+   down_read(>init_lock);
+   if (!init_done(zram)) {
+   up_read(>init_lock);
+   return -EINVAL;
+   }
+
+   for (index = 0; index < nr_pages; index++) {
+   zram_slot_lock(zram, index);
+   if (!zram_allocated(zram, index))
+   goto next;
+
+   zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+   zram_slot_unlock(zram, index);
+   }
+
+   up_read(>init_lock);
+
+   return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -638,6 +679,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+   zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -670,12 +712,13 @@ static ssize_t read_block

[PATCH v3 1/7] zram: fix lockdep warning of free block handling

2018-11-26 Thread Minchan Kim
[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(>bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(>bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(>bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

get_entry_bdev
spin_lock(bitmap->lock);
irq
softirq
end_swap_bio_read
zram_slot_free_notify
zram_slot_lock <-- deadlock prone
zram_free_page
put_entry_bdev
spin_lock(bitmap->lock); <-- deadlock prone

With akpm's suggestion(i.e. bitmap operation is already atomic),
we could remove bitmap lock. It might fail to find a empty slot
if serious contention happens. However, it's not severe problem
because huge page writeback has already possiblity to fail if there
is severe memory pressure. Worst case is just keeping
the incompressible in memory, not storage.

The other problem is zram_slot_lock in zram_slot_slot_free_notify.
To make it safe is this patch introduces zram_slot_trylock where
zram_slot_free_notify uses it. Although it's rare to be contented,
this patch adds new debug stat "miss_free" to keep monitoring
how often it happens.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 38 +++
 drivers/block/zram/zram_drv.h |  2 +-
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4879595200e1..21a7046958a3 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -53,6 +53,11 @@ static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
 
+static int zram_slot_trylock(struct zr

[PATCH v3 2/7] zram: fix double free backing device

2018-11-26 Thread Minchan Kim
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
kernel emits below log. This patch fixes it.

[   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
blkdev_put+0x105/0x120
[   31.075104] Modules linked in:
[   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
[   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   31.079589] RIP: 0010:blkdev_put+0x105/0x120
[   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 
5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
[   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
[   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: b9fc37a4
[   31.088105] RDX: 0001 RSI:  RDI: 9640e740
[   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 0004
[   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0083
[   31.092276] R13: 9779fe607b98 R14:  R15: 9779fe607a38
[   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
knlGS:
[   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
[   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 00160ef0
[   31.096781] Call Trace:
[   31.097212]  __x64_sys_swapoff+0x46d/0x490
[   31.097914]  do_syscall_64+0x5a/0x190
[   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   31.099402] RIP: 0033:0x7fc11843ec27
[   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
[   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
00a8
[   31.104425] RAX: ffda RBX: 011d98c0 RCX: 7fc11843ec27
[   31.105627] RDX: 0001 RSI: 0001 RDI: 011d98c0
[   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 0001
[   31.108038] R10: 02b1 R11: 0206 R12: 0001
[   31.109231] R13:  R14:  R15: 
[   31.110433] irq event stamp: 4466
[   31.111001] hardirqs last  enabled at (4465): [] 
__free_pages_ok+0x1e3/0x490
[   31.112437] hardirqs last disabled at (4466): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
[   31.113973] softirqs last  enabled at (3420): [] 
__do_softirq+0x333/0x446
[   31.115364] softirqs last disabled at (3407): [] 
irq_exit+0xd1/0xe0

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 21a7046958a3..d1459cc1159f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct device *dev,
 
bdev = bdgrab(I_BDEV(inode));
err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
-   if (err < 0)
+   if (err < 0) {
+   bdev = NULL;
goto out;
+   }
 
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v3 4/7] zram: introduce ZRAM_IDLE flag

2018-11-26 Thread Minchan Kim
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

  30075.033841 ...i
  30163.806904 s..i
  30263.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

  30075.033841 ...
  30163.806904 s..i
  30263.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 +++
 Documentation/blockdev/zram.txt| 10 ++--
 drivers/block/zram/zram_drv.c  | 57 --
 drivers/block/zram/zram_drv.h  |  1 +
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim 
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:  /sys/block/zram/idle
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   idle file is write-only and mark zram slot as idle.
+   If system has mounted debugfs, user can see which slots
+   are idle via /sys/kernel/debug/zram/zram/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithmRWshow and change the compression 
algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
 backing_dev  RWset up backend storage for zram to write out
+idle WOmark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
- 30075.033841 .wh
- 30163.806904 s..
- 30263.806919 ..h
+ 30075.033841 .wh.
+ 30163.806904 s...
+ 30263.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4457d0395bfb..180613b478a6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,47 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+   int index;
+   char mode_buf[8];
+   ssize_t sz;
+
+   sz = strscpy(mode_buf, buf, sizeof(mode_buf));
+   if (sz <= 0)
+   return -EINVAL;
+
+   /* ignore trailing new line */
+   if (mode_buf[sz - 1] == '\n')
+   mode_buf[sz - 1] = 0x00;
+
+   if (strcmp(mode_buf, "all"))
+   return -EINVAL;
+
+   down_read(>init_lock);
+   if (!init_done(zram)) {
+   up_read(>init_lock);
+   return -EINVAL;
+   }
+
+   for (index = 0; index < nr_pages; index++) {
+   zram_slot_lock(zram, index);
+   if (!zram_allocated(zram, index))
+   goto next;
+
+   zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+   zram_slot_unlock(zram, index);
+   }
+
+   up_read(>init_lock);
+
+   return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -638,6 +679,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+   zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -670,12 +712,13 @@ static ssize_t read_block

[PATCH v3 1/7] zram: fix lockdep warning of free block handling

2018-11-26 Thread Minchan Kim
[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(>bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(>bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(>bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

get_entry_bdev
spin_lock(bitmap->lock);
irq
softirq
end_swap_bio_read
zram_slot_free_notify
zram_slot_lock <-- deadlock prone
zram_free_page
put_entry_bdev
spin_lock(bitmap->lock); <-- deadlock prone

With akpm's suggestion(i.e. bitmap operation is already atomic),
we could remove bitmap lock. It might fail to find a empty slot
if serious contention happens. However, it's not severe problem
because huge page writeback has already possiblity to fail if there
is severe memory pressure. Worst case is just keeping
the incompressible in memory, not storage.

The other problem is zram_slot_lock in zram_slot_slot_free_notify.
To make it safe is this patch introduces zram_slot_trylock where
zram_slot_free_notify uses it. Although it's rare to be contented,
this patch adds new debug stat "miss_free" to keep monitoring
how often it happens.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 38 +++
 drivers/block/zram/zram_drv.h |  2 +-
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 4879595200e1..21a7046958a3 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -53,6 +53,11 @@ static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
 
+static int zram_slot_trylock(struct zr

[PATCH v3 3/7] zram: refactoring flags and writeback stuff

2018-11-26 Thread Minchan Kim
This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 105 +-
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d1459cc1159f..4457d0395bfb 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
+   return bit_spin_trylock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-   bit_spin_lock(ZRAM_LOCK, >table[index].value);
+   bit_spin_lock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-   bit_spin_unlock(ZRAM_LOCK, >table[index].value);
+   bit_spin_unlock(ZRAM_LOCK, >table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   return zram->table[index].value & BIT(flag);
+   return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value |= BIT(flag);
+   zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value &= ~BIT(flag);
+   zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, 
u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-   return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+   return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
 {
-   unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+   unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-   zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+   zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-   return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
 
-   if (!zram_wb_enabled(zram))
+   if (!zram->backing_dev)
return;
 
bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;
 
down_read(>init_lock);
-   if (!zram_wb_enabled(zram)) {
+   if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(>init_lock);
return 5;
@@ -447,7 +442,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
unsigned long blk_idx = 1;
 retry:
@@ -462,11 +457,11 @@ static unsigned long get_entry_bdev(struct zram *zram)
return blk_idx;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
int was_set;
 
-   was_set = test_and_clear_bit(entry, zram->bitmap);
+   was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
 }
 
@@ -579,7 +574,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
if (!bio)
return -ENOMEM;
 
-   entry = get_entry_bdev(zram);
+   entry = alloc_block_bdev(zram);
if (!entry) {
bio_put(bio);
return -ENOSPC;
@@ -590,7 +585,7 @@ static int write_to_bdev(

[PATCH v3 6/7] zram: add bd_stat statistics

2018-11-26 Thread Minchan Kim
bd_stat represents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 29 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 53 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..65fc33b2f53b 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 806cdaabac83..906df97527a7 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: 4K bytes
+ bd_reads  the number of reads from backing device
+   Unit: 4K bytes
+ bd_writes the number of writes to backing device
+   Unit: 4K bytes
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6b5a886c8f32..67168a6ecca6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -502,6 +502,7 @@ static unsigned long alloc_block_bdev(struct zram *zram)
if (test_and_set_bit(blk_idx, zram->bitmap))
goto retry;
 
+   atomic64_inc(>stats.bd_count);
return blk_idx;
 }
 
@@ -511,6 +512,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
 
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -668,6 +670,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -757,6 +760,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1013,6 +1017,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count),
+   (u64)atomic64_read(>stats.bd_reads),
+   (u64)atomic64_read(>stats.bd_writes));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1033,6 +1056,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1759,6 +1785,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
_attr_io_stat.attr,
_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+   _attr_bd_stat.attr,
+#endif
_attr_debug_stat.attr,
NULL,
 };
diff --git a/dr

[PATCH v3 5/7] zram: support idle/huge page writeback

2018-11-26 Thread Minchan Kim
This patch supports new feature "zram idle/huge page writeback".
On zram-swap usecase, zram has usually many idle/huge swap pages.
It's pointless to keep in memory(ie, zram).

To solve the problem, this feature introduces idle/huge page
writeback to backing device so the goal is to save more memory
space on embedded system.

Normal sequence to use idle/huge page writeback feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.

echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}

By per discussion:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,

This patch removes direct incommpressibe page writeback feature
(d2afd25114f4, zram: write incompressible pages to backing device)
so we could regard it as regression because incompressible pages
doesn't go to backing storage automatically. Instead, usre should
do it via "echo huge" > /sys/block/zram/writeback" manually.

If we hear some regression, we could restore the function.

Reviewed-by: Joey Pabalinas 
Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  28 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 247 +++--
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 209 insertions(+), 79 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..806cdaabac83 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ The stat file represents device's mm statistics. It 
consists of a single
 
 = writeback
 
-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
 to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+   "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+
+before disksize setting. It supports only partition at this moment.
+If admin want to use incompressible page writeback, they could do via
+
+   "echo huge > /sys/block/zramX/write"
+
+To use idle page writeback, first, user need to declare zram pages
+as idle.
+
+   "echo all > /sys/block/zramX/idle"
+
+From now on, any pages on zram are idle pages. The idle mark
+will be removed until someone request access of the block.
+IOW, unless there is access request, those pages are still idle pages.
+
+Admin can request writeback of those idle pages at right timing via
+
+   "echo idle > /sys/block/zramX/writeback"
+
+With the command, zram writeback idle pages from memory to the storage.
 
 = memory tracking
 
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-   bool "Write back incompressible page to backing device"
+   bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
+With /sys/block/zramX/{idle,writeback}, application could ask
+idle page's writeback to the backing device to save in memory.
+
 See Documentation/blockdev/zram.tx

[PATCH v3 0/7] zram idle page writeback

2018-11-26 Thread Minchan Kim
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Below first two patches are -stable material so it could go first
separately with others in this series.

  zram: fix lockdep warning of free block handling
  zram: fix double free backing device

* from v2
  - use strscpy instead of strlcpy - Joey Pabalinas
  - remove irqlock for bitmap op - akpm
  - don't use page as stat unit - akpm

* from v1
  - add fix dobule free backing device - minchan
  - change writeback/idle interface - minchan 
  - remove direct incompressible page writeback - sergey

Minchan Kim (7):
  zram: fix lockdep warning of free block handling
  zram: fix double free backing device
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle/huge page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt|  51 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 501 +++--
 drivers/block/zram/zram_drv.h  |  19 +-
 5 files changed, 446 insertions(+), 162 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v3 0/7] zram idle page writeback

2018-11-26 Thread Minchan Kim
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Below first two patches are -stable material so it could go first
separately with others in this series.

  zram: fix lockdep warning of free block handling
  zram: fix double free backing device

* from v2
  - use strscpy instead of strlcpy - Joey Pabalinas
  - remove irqlock for bitmap op - akpm
  - don't use page as stat unit - akpm

* from v1
  - add fix dobule free backing device - minchan
  - change writeback/idle interface - minchan 
  - remove direct incompressible page writeback - sergey

Minchan Kim (7):
  zram: fix lockdep warning of free block handling
  zram: fix double free backing device
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle/huge page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt|  51 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 501 +++--
 drivers/block/zram/zram_drv.h  |  19 +-
 5 files changed, 446 insertions(+), 162 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



Re: [PATCH v2 5/7] zram: support idle/huge page writeback

2018-11-26 Thread Minchan Kim
On Sun, Nov 25, 2018 at 11:47:37PM -1000, Joey Pabalinas wrote:
> On Mon, Nov 26, 2018 at 05:28:11PM +0900, Minchan Kim wrote:
> > +   strlcpy(mode_buf, buf, sizeof(mode_buf));
> > +   /* ignore trailing newline */
> > +   sz = strlen(mode_buf);
> 
> One possible idea would be to use strscpy() instead and directly assign
> the return value to sz, avoiding an extra strlen() call (though you would
> have to check if `sz == -E2BIG` and do `sz = sizeof(mode_buf) - 1` in that
> case).

Thanks for the suggstion.
If I limit destination buffer smaller, I couldn't meet -E2BIG?

> 
> > +   if (!strcmp(mode_buf, "idle"))
> > +   mode = IDLE_WRITEBACK;
> > +   if (!strcmp(mode_buf, "huge"))
> > +   mode = HUGE_WRITEBACK;
> 
> Maybe using `else if (!strcmp(mode_buf, "huge"))` would be slightly
> better here, avoiding a second strcmp() if mode_buf has already
> matched "idle".

I considered "huge|idle" as an option. Anyway, in that case, mode should
"mode |= ". At this moment, yes, lets use "else if" since I don't have
strong opinion to support "idle|huge".

> 
> > +   if ((mode & IDLE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_IDLE)) &&
> > +   (mode & HUGE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_HUGE)))
> > +   goto next;
> 
> Wouldn't writing this as `mode & (IDLE_WRITEBACK | HUGE_WRITEBACK)`
> be a bit easier to read as well as slightly more compact?
> 
> > +   ret = len;
> > +__free_page(page);
> > +release_init_lock:
> > +   up_read(>init_lock);
> > +   return ret;
> 
> Hm, I noticed that this function either returns an error or just the passed
> in len on success, and I'm left wondering if there might be other useful
> information which could be passed back to the caller instead. I can't
> immediately think of any such information, though, so it's possible I'm
> just daydreaming :)

It is write syscall semantic of sysfs so not sure it's doable to pass
other value to user.

> 
> -- 
> Cheers,
> Joey Pabalinas




Re: [PATCH v2 5/7] zram: support idle/huge page writeback

2018-11-26 Thread Minchan Kim
On Sun, Nov 25, 2018 at 11:47:37PM -1000, Joey Pabalinas wrote:
> On Mon, Nov 26, 2018 at 05:28:11PM +0900, Minchan Kim wrote:
> > +   strlcpy(mode_buf, buf, sizeof(mode_buf));
> > +   /* ignore trailing newline */
> > +   sz = strlen(mode_buf);
> 
> One possible idea would be to use strscpy() instead and directly assign
> the return value to sz, avoiding an extra strlen() call (though you would
> have to check if `sz == -E2BIG` and do `sz = sizeof(mode_buf) - 1` in that
> case).

Thanks for the suggstion.
If I limit destination buffer smaller, I couldn't meet -E2BIG?

> 
> > +   if (!strcmp(mode_buf, "idle"))
> > +   mode = IDLE_WRITEBACK;
> > +   if (!strcmp(mode_buf, "huge"))
> > +   mode = HUGE_WRITEBACK;
> 
> Maybe using `else if (!strcmp(mode_buf, "huge"))` would be slightly
> better here, avoiding a second strcmp() if mode_buf has already
> matched "idle".

I considered "huge|idle" as an option. Anyway, in that case, mode should
"mode |= ". At this moment, yes, lets use "else if" since I don't have
strong opinion to support "idle|huge".

> 
> > +   if ((mode & IDLE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_IDLE)) &&
> > +   (mode & HUGE_WRITEBACK &&
> > + !zram_test_flag(zram, index, ZRAM_HUGE)))
> > +   goto next;
> 
> Wouldn't writing this as `mode & (IDLE_WRITEBACK | HUGE_WRITEBACK)`
> be a bit easier to read as well as slightly more compact?
> 
> > +   ret = len;
> > +__free_page(page);
> > +release_init_lock:
> > +   up_read(>init_lock);
> > +   return ret;
> 
> Hm, I noticed that this function either returns an error or just the passed
> in len on success, and I'm left wondering if there might be other useful
> information which could be passed back to the caller instead. I can't
> immediately think of any such information, though, so it's possible I'm
> just daydreaming :)

It is write syscall semantic of sysfs so not sure it's doable to pass
other value to user.

> 
> -- 
> Cheers,
> Joey Pabalinas




Re: [PATCH v2 7/7] zram: writeback throttle

2018-11-26 Thread Minchan Kim
On Mon, Nov 26, 2018 at 12:54:46PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:13 +0900 Minchan Kim  wrote:
> 
> > On small memory system, there are lots of write IO so if we use
> > flash device as swap, there would be serious flash wearout.
> > To overcome the problem, system developers need to design write
> > limitation strategy to guarantee flash health for entire product life.
> > 
> > This patch creates a new konb "writeback_limit" on zram. With that,
> > if current writeback IO count(/sys/block/zramX/io_stat) excceds
> > the limitation, zram stops further writeback until admin can reset
> > the limit.
> > 
> > +++ b/Documentation/ABI/testing/sysfs-block-zram
> > @@ -121,3 +121,12 @@ Contact:   Minchan Kim 
> > The bd_stat file is read-only and represents backing device's
> > statistics (bd_count, bd_reads, bd_writes) in a format
> > similar to block layer statistics file format.
> > +
> > +What:  /sys/block/zram/writeback_limit
> > +Date:  November 2018
> > +Contact:   Minchan Kim 
> > +Description:
> > +   The writeback_limit file is read-write and specifies the maximum
> > +   amount of writeback ZRAM can do. The limit could be changed
> > +   in run time and "0" means disable the limit.
> > +   No limit is the initial state.
> > diff --git a/Documentation/blockdev/zram.txt 
> > b/Documentation/blockdev/zram.txt
> > index 550bca77d322..41748d52712d 100644
> > --- a/Documentation/blockdev/zram.txt
> > +++ b/Documentation/blockdev/zram.txt
> > @@ -164,6 +164,8 @@ reset WOtrigger device reset
> >  mem_used_max  WOreset the `mem_used_max' counter (see later)
> >  mem_limit WOspecifies the maximum amount of memory ZRAM can use
> >  to store the compressed data
> > +writeback_limit  WOspecifies the maximum amount of write IO zram 
> > can
> > +   write out to backing device
> 
> Neither the changelog nor the Documentation specify the units of
> writeback_limit.  Bytes?  Pages?  Blocks?
> 
> This gets so confusing that in many /proc/sys/vm files we actually put
> the units into the filenames.
> 

I will use unit as 4K.

Thanks.


Re: [PATCH v2 7/7] zram: writeback throttle

2018-11-26 Thread Minchan Kim
On Mon, Nov 26, 2018 at 12:54:46PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:13 +0900 Minchan Kim  wrote:
> 
> > On small memory system, there are lots of write IO so if we use
> > flash device as swap, there would be serious flash wearout.
> > To overcome the problem, system developers need to design write
> > limitation strategy to guarantee flash health for entire product life.
> > 
> > This patch creates a new konb "writeback_limit" on zram. With that,
> > if current writeback IO count(/sys/block/zramX/io_stat) excceds
> > the limitation, zram stops further writeback until admin can reset
> > the limit.
> > 
> > +++ b/Documentation/ABI/testing/sysfs-block-zram
> > @@ -121,3 +121,12 @@ Contact:   Minchan Kim 
> > The bd_stat file is read-only and represents backing device's
> > statistics (bd_count, bd_reads, bd_writes) in a format
> > similar to block layer statistics file format.
> > +
> > +What:  /sys/block/zram/writeback_limit
> > +Date:  November 2018
> > +Contact:   Minchan Kim 
> > +Description:
> > +   The writeback_limit file is read-write and specifies the maximum
> > +   amount of writeback ZRAM can do. The limit could be changed
> > +   in run time and "0" means disable the limit.
> > +   No limit is the initial state.
> > diff --git a/Documentation/blockdev/zram.txt 
> > b/Documentation/blockdev/zram.txt
> > index 550bca77d322..41748d52712d 100644
> > --- a/Documentation/blockdev/zram.txt
> > +++ b/Documentation/blockdev/zram.txt
> > @@ -164,6 +164,8 @@ reset WOtrigger device reset
> >  mem_used_max  WOreset the `mem_used_max' counter (see later)
> >  mem_limit WOspecifies the maximum amount of memory ZRAM can use
> >  to store the compressed data
> > +writeback_limit  WOspecifies the maximum amount of write IO zram 
> > can
> > +   write out to backing device
> 
> Neither the changelog nor the Documentation specify the units of
> writeback_limit.  Bytes?  Pages?  Blocks?
> 
> This gets so confusing that in many /proc/sys/vm files we actually put
> the units into the filenames.
> 

I will use unit as 4K.

Thanks.


Re: [PATCH v2 6/7] zram: add bd_stat statistics

2018-11-26 Thread Minchan Kim
On Mon, Nov 26, 2018 at 12:58:33PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim  wrote:
> 
> > +File /sys/block/zram/bd_stat
> > +
> > +The stat file represents device's backing device statistics. It consists of
> > +a single line of text and contains the following stats separated by 
> > whitespace:
> > + bd_count  size of data written in backing device.
> > +   Unit: pages
> > + bd_reads  the number of reads from backing device
> > +   Unit: pages
> > + bd_writes the number of writes to backing device
> > +   Unit: pages
> 
> Using `pages' is a bad choice.  And I assume this means that
> writeback_limit is in pages as well, which is worse.
> 
> Page sizes are not constant!  We want userspace which was developed on
> 4k pagesize to work the same on 64k pagesize.
> 
> Arguably, we could require that well-written userspace remember to use
> getpagesize().  However we have traditionally tried to avoid that by
> performing the pagesize normalization within the kernel.

zram works based on page so I used that term but I agree it's rather
vague. If there is no objection, I will use (Unit: 4K) instead of
(Unit: pages).



Re: [PATCH v2 6/7] zram: add bd_stat statistics

2018-11-26 Thread Minchan Kim
On Mon, Nov 26, 2018 at 12:58:33PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:12 +0900 Minchan Kim  wrote:
> 
> > +File /sys/block/zram/bd_stat
> > +
> > +The stat file represents device's backing device statistics. It consists of
> > +a single line of text and contains the following stats separated by 
> > whitespace:
> > + bd_count  size of data written in backing device.
> > +   Unit: pages
> > + bd_reads  the number of reads from backing device
> > +   Unit: pages
> > + bd_writes the number of writes to backing device
> > +   Unit: pages
> 
> Using `pages' is a bad choice.  And I assume this means that
> writeback_limit is in pages as well, which is worse.
> 
> Page sizes are not constant!  We want userspace which was developed on
> 4k pagesize to work the same on 64k pagesize.
> 
> Arguably, we could require that well-written userspace remember to use
> getpagesize().  However we have traditionally tried to avoid that by
> performing the pagesize normalization within the kernel.

zram works based on page so I used that term but I agree it's rather
vague. If there is no objection, I will use (Unit: 4K) instead of
(Unit: pages).



Re: [PATCH v2 1/7] zram: fix lockdep warning of free block handling

2018-11-26 Thread Minchan Kim
On Mon, Nov 26, 2018 at 12:49:28PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:07 +0900 Minchan Kim  wrote:
> 
> > 
> > ...
> >
> > With writeback feature, zram_slot_free_notify could be called
> > in softirq context by end_swap_bio_read. However, bitmap_lock
> > is not aware of that so lockdep yell out. Thanks.
> > 
> > The problem is not only bitmap_lock but it is also zram_slot_lock
> > so straightforward solution would disable irq on zram_slot_lock
> > which covers every bitmap_lock, too.
> > Although duration disabling the irq is short in many places
> > zram_slot_lock is used, a place(ie, decompress) is not fast
> > enough to hold irqlock on relying on compression algorithm
> > so it's not a option.
> > 
> > The approach in this patch is just "best effort", not guarantee
> > "freeing orphan zpage". If the zram_slot_lock contention may happen,
> > kernel couldn't free the zpage until it recycles the block. However,
> > such contention between zram_slot_free_notify and other places to
> > hold zram_slot_lock should be very rare in real practice.
> > To see how often it happens, this patch adds new debug stat
> > "miss_free".
> > 
> > It also adds irq lock in get/put_block_bdev to prevent deadlock
> > lockdep reported. The reason I used irq disable rather than bottom
> > half is swap_slot_free_notify could be called with irq disabled
> > so it breaks local_bh_enable's rule. The irqlock works on only
> > writebacked zram slot entry so it should be not frequent lock.
> > 
> > Cc: sta...@vger.kernel.org # 4.14+
> > Signed-off-by: Minchan Kim 
> > ---
> >  drivers/block/zram/zram_drv.c | 56 +--
> >  drivers/block/zram/zram_drv.h |  1 +
> >  2 files changed, 42 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 4879595200e1..472027eaed60 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -53,6 +53,11 @@ static size_t huge_class_size;
> >  
> >  static void zram_free_page(struct zram *zram, size_t index);
> >  
> > +static int zram_slot_trylock(struct zram *zram, u32 index)
> > +{
> > +   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
> > +}
> > +
> >  static void zram_slot_lock(struct zram *zram, u32 index)
> >  {
> > bit_spin_lock(ZRAM_LOCK, >table[index].value);
> > @@ -443,29 +448,45 @@ static ssize_t backing_dev_store(struct device *dev,
> >  
> >  static unsigned long get_entry_bdev(struct zram *zram)
> >  {
> > -   unsigned long entry;
> > +   unsigned long blk_idx;
> > +   unsigned long ret = 0;
> >  
> > -   spin_lock(>bitmap_lock);
> > /* skip 0 bit to confuse zram.handle = 0 */
> > -   entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > -   if (entry == zram->nr_pages) {
> > -   spin_unlock(>bitmap_lock);
> > -   return 0;
> > +   blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > +   if (blk_idx == zram->nr_pages)
> > +   goto retry;
> > +
> > +   spin_lock_irq(>bitmap_lock);
> > +   if (test_bit(blk_idx, zram->bitmap)) {
> > +   spin_unlock_irq(>bitmap_lock);
> > +   goto retry;
> > }
> >  
> > -   set_bit(entry, zram->bitmap);
> > -   spin_unlock(>bitmap_lock);
> > +   set_bit(blk_idx, zram->bitmap);
> 
> Here we could do
> 
>   if (test_and_set_bit(...)) {
>   spin_unlock(...);
>   goto retry;
> 
> But it's weird to take the spinlock on behalf of bitops which are
> already atomic!
> 
> It seems rather suspicious to me.  Why are we doing this?

What I need is look_up_and_set operation. I don't see there is an
atomic operation for that. But I want to minimize irq disabled
area so first, it scans the bit lockless and if race happens,
i can try under the lock.

It seems __set_bit is enough under the lock.

> 
> > +   ret = blk_idx;
> > +   goto out;
> > +retry:
> > +   spin_lock_irq(>bitmap_lock);
> > +   blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > +   if (blk_idx == zram->nr_pages)
> > +   goto out;
> > +
> > +   set_bit(blk_idx, zram->bitmap);
> > +   ret = blk_idx;
> > +out:
> > +   spin_unlock_irq(>bitmap_lock);
> >  
> > -   return entry;
> > +   return ret;
> >  }
> >  
> >  static void put_entry_bdev(struct zram *zram, unsigned long entry)
> >  {
> > int was_set;
> > +   unsigned long flags;
> >  
> > -   spin_lock(>bitmap_lock);
> > +   spin_lock_irqsave(>bitmap_lock, flags);
> > was_set = test_and_clear_bit(entry, zram->bitmap);
> > -   spin_unlock(>bitmap_lock);
> > +   spin_unlock_irqrestore(>bitmap_lock, flags);
> 
> Here's another one.  Surely that locking is unnecessary.

Indeed! although get_entry_bdev side can miss some bits, it's not a critical 
problem.
Benefit is we might remove irq disable for the lockdep problem.
Yes, I will cook and test.

Thanks, Andrew.


Re: [PATCH v2 1/7] zram: fix lockdep warning of free block handling

2018-11-26 Thread Minchan Kim
On Mon, Nov 26, 2018 at 12:49:28PM -0800, Andrew Morton wrote:
> On Mon, 26 Nov 2018 17:28:07 +0900 Minchan Kim  wrote:
> 
> > 
> > ...
> >
> > With writeback feature, zram_slot_free_notify could be called
> > in softirq context by end_swap_bio_read. However, bitmap_lock
> > is not aware of that so lockdep yell out. Thanks.
> > 
> > The problem is not only bitmap_lock but it is also zram_slot_lock
> > so straightforward solution would disable irq on zram_slot_lock
> > which covers every bitmap_lock, too.
> > Although duration disabling the irq is short in many places
> > zram_slot_lock is used, a place(ie, decompress) is not fast
> > enough to hold irqlock on relying on compression algorithm
> > so it's not a option.
> > 
> > The approach in this patch is just "best effort", not guarantee
> > "freeing orphan zpage". If the zram_slot_lock contention may happen,
> > kernel couldn't free the zpage until it recycles the block. However,
> > such contention between zram_slot_free_notify and other places to
> > hold zram_slot_lock should be very rare in real practice.
> > To see how often it happens, this patch adds new debug stat
> > "miss_free".
> > 
> > It also adds irq lock in get/put_block_bdev to prevent deadlock
> > lockdep reported. The reason I used irq disable rather than bottom
> > half is swap_slot_free_notify could be called with irq disabled
> > so it breaks local_bh_enable's rule. The irqlock works on only
> > writebacked zram slot entry so it should be not frequent lock.
> > 
> > Cc: sta...@vger.kernel.org # 4.14+
> > Signed-off-by: Minchan Kim 
> > ---
> >  drivers/block/zram/zram_drv.c | 56 +--
> >  drivers/block/zram/zram_drv.h |  1 +
> >  2 files changed, 42 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 4879595200e1..472027eaed60 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -53,6 +53,11 @@ static size_t huge_class_size;
> >  
> >  static void zram_free_page(struct zram *zram, size_t index);
> >  
> > +static int zram_slot_trylock(struct zram *zram, u32 index)
> > +{
> > +   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
> > +}
> > +
> >  static void zram_slot_lock(struct zram *zram, u32 index)
> >  {
> > bit_spin_lock(ZRAM_LOCK, >table[index].value);
> > @@ -443,29 +448,45 @@ static ssize_t backing_dev_store(struct device *dev,
> >  
> >  static unsigned long get_entry_bdev(struct zram *zram)
> >  {
> > -   unsigned long entry;
> > +   unsigned long blk_idx;
> > +   unsigned long ret = 0;
> >  
> > -   spin_lock(>bitmap_lock);
> > /* skip 0 bit to confuse zram.handle = 0 */
> > -   entry = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > -   if (entry == zram->nr_pages) {
> > -   spin_unlock(>bitmap_lock);
> > -   return 0;
> > +   blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > +   if (blk_idx == zram->nr_pages)
> > +   goto retry;
> > +
> > +   spin_lock_irq(>bitmap_lock);
> > +   if (test_bit(blk_idx, zram->bitmap)) {
> > +   spin_unlock_irq(>bitmap_lock);
> > +   goto retry;
> > }
> >  
> > -   set_bit(entry, zram->bitmap);
> > -   spin_unlock(>bitmap_lock);
> > +   set_bit(blk_idx, zram->bitmap);
> 
> Here we could do
> 
>   if (test_and_set_bit(...)) {
>   spin_unlock(...);
>   goto retry;
> 
> But it's weird to take the spinlock on behalf of bitops which are
> already atomic!
> 
> It seems rather suspicious to me.  Why are we doing this?

What I need is look_up_and_set operation. I don't see there is an
atomic operation for that. But I want to minimize irq disabled
area so first, it scans the bit lockless and if race happens,
i can try under the lock.

It seems __set_bit is enough under the lock.

> 
> > +   ret = blk_idx;
> > +   goto out;
> > +retry:
> > +   spin_lock_irq(>bitmap_lock);
> > +   blk_idx = find_next_zero_bit(zram->bitmap, zram->nr_pages, 1);
> > +   if (blk_idx == zram->nr_pages)
> > +   goto out;
> > +
> > +   set_bit(blk_idx, zram->bitmap);
> > +   ret = blk_idx;
> > +out:
> > +   spin_unlock_irq(>bitmap_lock);
> >  
> > -   return entry;
> > +   return ret;
> >  }
> >  
> >  static void put_entry_bdev(struct zram *zram, unsigned long entry)
> >  {
> > int was_set;
> > +   unsigned long flags;
> >  
> > -   spin_lock(>bitmap_lock);
> > +   spin_lock_irqsave(>bitmap_lock, flags);
> > was_set = test_and_clear_bit(entry, zram->bitmap);
> > -   spin_unlock(>bitmap_lock);
> > +   spin_unlock_irqrestore(>bitmap_lock, flags);
> 
> Here's another one.  Surely that locking is unnecessary.

Indeed! although get_entry_bdev side can miss some bits, it's not a critical 
problem.
Benefit is we might remove irq disable for the lockdep problem.
Yes, I will cook and test.

Thanks, Andrew.


[PATCH v2 3/7] zram: refactoring flags and writeback stuff

2018-11-26 Thread Minchan Kim
This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 105 +-
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 514f5aaf6eff..fee7e67c750d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
+   return bit_spin_trylock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-   bit_spin_lock(ZRAM_LOCK, >table[index].value);
+   bit_spin_lock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-   bit_spin_unlock(ZRAM_LOCK, >table[index].value);
+   bit_spin_unlock(ZRAM_LOCK, >table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   return zram->table[index].value & BIT(flag);
+   return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value |= BIT(flag);
+   zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value &= ~BIT(flag);
+   zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, 
u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-   return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+   return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
 {
-   unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+   unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-   zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+   zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-   return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
 
-   if (!zram_wb_enabled(zram))
+   if (!zram->backing_dev)
return;
 
bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;
 
down_read(>init_lock);
-   if (!zram_wb_enabled(zram)) {
+   if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(>init_lock);
return 5;
@@ -448,7 +443,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
unsigned long blk_idx;
unsigned long ret = 0;
@@ -481,13 +476,13 @@ static unsigned long get_entry_bdev(struct zram *zram)
return ret;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
int was_set;
unsigned long flags;
 
spin_lock_irqsave(>bitmap_lock, flags);
-   was_set = test_and_clear_bit(entry, zram->bitmap);
+   was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(>bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
 }
@@ -601,7 +596,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
if (!bio)
return -ENOMEM;
 
-   entry = get_entry_bdev(zram);
+   entry = alloc_b

[PATCH v2 5/7] zram: support idle/huge page writeback

2018-11-26 Thread Minchan Kim
This patch supports new feature "zram idle/huge page writeback".
On zram-swap usecase, zram has usually many idle/huge swap pages.
It's pointless to keep in memory(ie, zram).

To solve the problem, this feature introduces idle/huge page
writeback to backing device so the goal is to save more memory
space on embedded system.

Normal sequence to use idle/huge page writeback feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.

echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}

By per discussion:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,

This patch removes direct incommpressibe page writeback feature
(d2afd25114f4, zram: write incompressible pages to backing device)
so we could regard it as regression because incompressible pages
doesn't go to backing storage automatically. Instead, usre should
do it via "echo huge" > /sys/block/zram/writeback" manually.

If we hear some regression, we could restore the function.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  28 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 245 ++---
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 207 insertions(+), 79 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..806cdaabac83 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ The stat file represents device's mm statistics. It 
consists of a single
 
 = writeback
 
-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
 to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+   "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+
+before disksize setting. It supports only partition at this moment.
+If admin want to use incompressible page writeback, they could do via
+
+   "echo huge > /sys/block/zramX/write"
+
+To use idle page writeback, first, user need to declare zram pages
+as idle.
+
+   "echo all > /sys/block/zramX/idle"
+
+From now on, any pages on zram are idle pages. The idle mark
+will be removed until someone request access of the block.
+IOW, unless there is access request, those pages are still idle pages.
+
+Admin can request writeback of those idle pages at right timing via
+
+   "echo idle > /sys/block/zramX/writeback"
+
+With the command, zram writeback idle pages from memory to the storage.
 
 = memory tracking
 
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-   bool "Write back incompressible page to backing device"
+   bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
+With /sys/block/zramX/{idle,writeback}, application could ask
+idle page's writeback to the backing device to save in memory.
+
 See Documentation/blockdev/zram.txt for more informat

[PATCH v2 1/7] zram: fix lockdep warning of free block handling

2018-11-26 Thread Minchan Kim
[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(>bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(>bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(>bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

The problem is not only bitmap_lock but it is also zram_slot_lock
so straightforward solution would disable irq on zram_slot_lock
which covers every bitmap_lock, too.
Although duration disabling the irq is short in many places
zram_slot_lock is used, a place(ie, decompress) is not fast
enough to hold irqlock on relying on compression algorithm
so it's not a option.

The approach in this patch is just "best effort", not guarantee
"freeing orphan zpage". If the zram_slot_lock contention may happen,
kernel couldn't free the zpage until it recycles the block. However,
such contention between zram_slot_free_notify and other places to
hold zram_slot_lock should be very rare in real practice.
To see how often it happens, this patch adds new debug stat
"miss_free".

It also adds irq lock in get/put_block_bdev to prevent deadlock
lockdep reported. The reason I used irq disable rather than bottom
half is swap_slot_free_notify could be called with irq disabled
so it breaks local_bh_enable's rule. The irqlock works on only
writebacked zram slot entry so it should be not frequent lock.

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 56 +--
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/driver

[PATCH v2 2/7] zram: fix double free backing device

2018-11-26 Thread Minchan Kim
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
kernel emits below log. This patch fixes it.

[   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
blkdev_put+0x105/0x120
[   31.075104] Modules linked in:
[   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
[   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   31.079589] RIP: 0010:blkdev_put+0x105/0x120
[   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 
5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
[   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
[   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: b9fc37a4
[   31.088105] RDX: 0001 RSI:  RDI: 9640e740
[   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 0004
[   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0083
[   31.092276] R13: 9779fe607b98 R14:  R15: 9779fe607a38
[   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
knlGS:
[   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
[   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 00160ef0
[   31.096781] Call Trace:
[   31.097212]  __x64_sys_swapoff+0x46d/0x490
[   31.097914]  do_syscall_64+0x5a/0x190
[   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   31.099402] RIP: 0033:0x7fc11843ec27
[   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
[   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
00a8
[   31.104425] RAX: ffda RBX: 011d98c0 RCX: 7fc11843ec27
[   31.105627] RDX: 0001 RSI: 0001 RDI: 011d98c0
[   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 0001
[   31.108038] R10: 02b1 R11: 0206 R12: 0001
[   31.109231] R13:  R14:  R15: 
[   31.110433] irq event stamp: 4466
[   31.111001] hardirqs last  enabled at (4465): [] 
__free_pages_ok+0x1e3/0x490
[   31.112437] hardirqs last disabled at (4466): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
[   31.113973] softirqs last  enabled at (3420): [] 
__do_softirq+0x333/0x446
[   31.115364] softirqs last disabled at (3407): [] 
irq_exit+0xd1/0xe0

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 472027eaed60..514f5aaf6eff 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct device *dev,
 
bdev = bdgrab(I_BDEV(inode));
err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
-   if (err < 0)
+   if (err < 0) {
+   bdev = NULL;
goto out;
+   }
 
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v2 3/7] zram: refactoring flags and writeback stuff

2018-11-26 Thread Minchan Kim
This patch does renaming some variables and restructuring
some codes for better redability in writeback and zs_free_page.

Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 105 +-
 drivers/block/zram/zram_drv.h |   8 +--
 2 files changed, 44 insertions(+), 69 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 514f5aaf6eff..fee7e67c750d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -55,17 +55,17 @@ static void zram_free_page(struct zram *zram, size_t index);
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
-   return bit_spin_trylock(ZRAM_LOCK, >table[index].value);
+   return bit_spin_trylock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_lock(struct zram *zram, u32 index)
 {
-   bit_spin_lock(ZRAM_LOCK, >table[index].value);
+   bit_spin_lock(ZRAM_LOCK, >table[index].flags);
 }
 
 static void zram_slot_unlock(struct zram *zram, u32 index)
 {
-   bit_spin_unlock(ZRAM_LOCK, >table[index].value);
+   bit_spin_unlock(ZRAM_LOCK, >table[index].flags);
 }
 
 static inline bool init_done(struct zram *zram)
@@ -76,7 +76,7 @@ static inline bool init_done(struct zram *zram)
 static inline bool zram_allocated(struct zram *zram, u32 index)
 {
 
-   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
zram->table[index].handle;
 }
 
@@ -99,19 +99,19 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   return zram->table[index].value & BIT(flag);
+   return zram->table[index].flags & BIT(flag);
 }
 
 static void zram_set_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value |= BIT(flag);
+   zram->table[index].flags |= BIT(flag);
 }
 
 static void zram_clear_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
-   zram->table[index].value &= ~BIT(flag);
+   zram->table[index].flags &= ~BIT(flag);
 }
 
 static inline void zram_set_element(struct zram *zram, u32 index,
@@ -127,15 +127,15 @@ static unsigned long zram_get_element(struct zram *zram, 
u32 index)
 
 static size_t zram_get_obj_size(struct zram *zram, u32 index)
 {
-   return zram->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
+   return zram->table[index].flags & (BIT(ZRAM_FLAG_SHIFT) - 1);
 }
 
 static void zram_set_obj_size(struct zram *zram,
u32 index, size_t size)
 {
-   unsigned long flags = zram->table[index].value >> ZRAM_FLAG_SHIFT;
+   unsigned long flags = zram->table[index].flags >> ZRAM_FLAG_SHIFT;
 
-   zram->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
+   zram->table[index].flags = (flags << ZRAM_FLAG_SHIFT) | size;
 }
 
 #if PAGE_SIZE != 4096
@@ -282,16 +282,11 @@ static ssize_t mem_used_max_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
-static bool zram_wb_enabled(struct zram *zram)
-{
-   return zram->backing_dev;
-}
-
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
 
-   if (!zram_wb_enabled(zram))
+   if (!zram->backing_dev)
return;
 
bdev = zram->bdev;
@@ -318,7 +313,7 @@ static ssize_t backing_dev_show(struct device *dev,
ssize_t ret;
 
down_read(>init_lock);
-   if (!zram_wb_enabled(zram)) {
+   if (!zram->backing_dev) {
memcpy(buf, "none\n", 5);
up_read(>init_lock);
return 5;
@@ -448,7 +443,7 @@ static ssize_t backing_dev_store(struct device *dev,
return err;
 }
 
-static unsigned long get_entry_bdev(struct zram *zram)
+static unsigned long alloc_block_bdev(struct zram *zram)
 {
unsigned long blk_idx;
unsigned long ret = 0;
@@ -481,13 +476,13 @@ static unsigned long get_entry_bdev(struct zram *zram)
return ret;
 }
 
-static void put_entry_bdev(struct zram *zram, unsigned long entry)
+static void free_block_bdev(struct zram *zram, unsigned long blk_idx)
 {
int was_set;
unsigned long flags;
 
spin_lock_irqsave(>bitmap_lock, flags);
-   was_set = test_and_clear_bit(entry, zram->bitmap);
+   was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(>bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
 }
@@ -601,7 +596,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
if (!bio)
return -ENOMEM;
 
-   entry = get_entry_bdev(zram);
+   entry = alloc_b

[PATCH v2 5/7] zram: support idle/huge page writeback

2018-11-26 Thread Minchan Kim
This patch supports new feature "zram idle/huge page writeback".
On zram-swap usecase, zram has usually many idle/huge swap pages.
It's pointless to keep in memory(ie, zram).

To solve the problem, this feature introduces idle/huge page
writeback to backing device so the goal is to save more memory
space on embedded system.

Normal sequence to use idle/huge page writeback feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.

echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}

By per discussion:
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,

This patch removes direct incommpressibe page writeback feature
(d2afd25114f4, zram: write incompressible pages to backing device)
so we could regard it as regression because incompressible pages
doesn't go to backing storage automatically. Instead, usre should
do it via "echo huge" > /sys/block/zram/writeback" manually.

If we hear some regression, we could restore the function.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  28 ++-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 245 ++---
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 207 insertions(+), 79 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..806cdaabac83 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -238,11 +238,31 @@ The stat file represents device's mm statistics. It 
consists of a single
 
 = writeback
 
-With incompressible pages, there is no memory saving with zram.
-Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
 to backing storage rather than keeping it in memory.
-User should set up backing device via /sys/block/zramX/backing_dev
-before disksize setting.
+To use the feature, admin should set up backing device via
+
+   "echo /dev/sda5 > /sys/block/zramX/backing_dev"
+
+before disksize setting. It supports only partition at this moment.
+If admin want to use incompressible page writeback, they could do via
+
+   "echo huge > /sys/block/zramX/write"
+
+To use idle page writeback, first, user need to declare zram pages
+as idle.
+
+   "echo all > /sys/block/zramX/idle"
+
+From now on, any pages on zram are idle pages. The idle mark
+will be removed until someone request access of the block.
+IOW, unless there is access request, those pages are still idle pages.
+
+Admin can request writeback of those idle pages at right timing via
+
+   "echo idle > /sys/block/zramX/writeback"
+
+With the command, zram writeback idle pages from memory to the storage.
 
 = memory tracking
 
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-   bool "Write back incompressible page to backing device"
+   bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
+With /sys/block/zramX/{idle,writeback}, application could ask
+idle page's writeback to the backing device to save in memory.
+
 See Documentation/blockdev/zram.txt for more informat

[PATCH v2 1/7] zram: fix lockdep warning of free block handling

2018-11-26 Thread Minchan Kim
[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(>bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(>bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(>bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

The problem is not only bitmap_lock but it is also zram_slot_lock
so straightforward solution would disable irq on zram_slot_lock
which covers every bitmap_lock, too.
Although duration disabling the irq is short in many places
zram_slot_lock is used, a place(ie, decompress) is not fast
enough to hold irqlock on relying on compression algorithm
so it's not a option.

The approach in this patch is just "best effort", not guarantee
"freeing orphan zpage". If the zram_slot_lock contention may happen,
kernel couldn't free the zpage until it recycles the block. However,
such contention between zram_slot_free_notify and other places to
hold zram_slot_lock should be very rare in real practice.
To see how often it happens, this patch adds new debug stat
"miss_free".

It also adds irq lock in get/put_block_bdev to prevent deadlock
lockdep reported. The reason I used irq disable rather than bottom
half is swap_slot_free_notify could be called with irq disabled
so it breaks local_bh_enable's rule. The irqlock works on only
writebacked zram slot entry so it should be not frequent lock.

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 56 +--
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/driver

[PATCH v2 2/7] zram: fix double free backing device

2018-11-26 Thread Minchan Kim
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise,
kernel emits below log. This patch fixes it.

[   31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 
blkdev_put+0x105/0x120
[   31.075104] Modules linked in:
[   31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
[   31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   31.079589] RIP: 0010:blkdev_put+0x105/0x120
[   31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 
6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 
5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f
[   31.085080] RSP: 0018:b409005c7ed0 EFLAGS: 00010297
[   31.086383] RAX: 9779fe5a8040 RBX: 9779fbc17300 RCX: b9fc37a4
[   31.088105] RDX: 0001 RSI:  RDI: 9640e740
[   31.089850] RBP: 9779fbc17318 R08: 95499a89 R09: 0004
[   31.091201] R10: b409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0083
[   31.092276] R13: 9779fe607b98 R14:  R15: 9779fe607a38
[   31.093355] FS:  7fc118d9b840() GS:9779fc60() 
knlGS:
[   31.094582] CS:  0010 DS:  ES:  CR0: 80050033
[   31.095541] CR2: 7fc11894b8dc CR3: 339f6001 CR4: 00160ef0
[   31.096781] Call Trace:
[   31.097212]  __x64_sys_swapoff+0x46d/0x490
[   31.097914]  do_syscall_64+0x5a/0x190
[   31.098550]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   31.099402] RIP: 0033:0x7fc11843ec27
[   31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48
[   31.103149] RSP: 002b:7ffdf69be648 EFLAGS: 0206 ORIG_RAX: 
00a8
[   31.104425] RAX: ffda RBX: 011d98c0 RCX: 7fc11843ec27
[   31.105627] RDX: 0001 RSI: 0001 RDI: 011d98c0
[   31.106847] RBP: 0001 R08: 7ffdf69be690 R09: 0001
[   31.108038] R10: 02b1 R11: 0206 R12: 0001
[   31.109231] R13:  R14:  R15: 
[   31.110433] irq event stamp: 4466
[   31.111001] hardirqs last  enabled at (4465): [] 
__free_pages_ok+0x1e3/0x490
[   31.112437] hardirqs last disabled at (4466): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
[   31.113973] softirqs last  enabled at (3420): [] 
__do_softirq+0x333/0x446
[   31.115364] softirqs last disabled at (3407): [] 
irq_exit+0xd1/0xe0

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 472027eaed60..514f5aaf6eff 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct device *dev,
 
bdev = bdgrab(I_BDEV(inode));
err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram);
-   if (err < 0)
+   if (err < 0) {
+   bdev = NULL;
goto out;
+   }
 
nr_pages = i_size_read(inode) >> PAGE_SHIFT;
bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v2 0/7] zram idle page writeback

2018-11-26 Thread Minchan Kim
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Below first two patches are -stable material so it could go first
separately with others in this series.

  zram: fix lockdep warning of free block handling
  zram: fix double free backing device

* from v1
  - add fix dobule free backing device - minchan
  - change writeback/idle interface - minchan 
  - remove direct incompressible page writeback - sergey

Minchan Kim (7):
  zram: fix lockdep warning of free block handling
  zram: fix double free backing device
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle/huge page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt|  51 +-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 516 +++--
 drivers/block/zram/zram_drv.h  |  18 +-
 5 files changed, 463 insertions(+), 159 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v2 7/7] zram: writeback throttle

2018-11-26 Thread Minchan Kim
On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  9 +
 Documentation/blockdev/zram.txt|  2 +
 drivers/block/zram/zram_drv.c  | 47 +-
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 65fc33b2f53b..9d2339a485c8 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 550bca77d322..41748d52712d 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit  WOspecifies the maximum amount of write IO zram 
can
+   write out to backing device
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cceaa10301e8..07c0847b7c0f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -328,6 +328,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   u64 val;
+   ssize_t ret = -EINVAL;
+
+   if (kstrtoull(buf, 10, ))
+   return ret;
+
+   down_read(>init_lock);
+   atomic64_set(>stats.bd_wb_limit, val);
+   if (val == 0 || val > atomic64_read(>stats.bd_writes))
+   zram->stop_writeback = false;
+   up_read(>init_lock);
+   ret = len;
+
+   return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   u64 val;
+   struct zram *zram = dev_to_zram(dev);
+
+   down_read(>init_lock);
+   val = atomic64_read(>stats.bd_wb_limit);
+   up_read(>init_lock);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
@@ -592,6 +626,7 @@ static ssize_t writeback_store(struct device *dev,
char mode_buf[64];
unsigned long mode = -1UL;
unsigned long blk_idx = 0;
+   u64 wb_count, wb_limit;
 
strlcpy(mode_buf, buf, sizeof(mode_buf));
/* ignore trailing newline */
@@ -631,6 +666,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
 
+   if (zram->stop_writeback) {
+   ret = -EIO;
+   break;
+   }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -689,7 +729,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
-   atomic64_inc(>stats.bd_writes);
+   wb_count = atomic64_inc_return(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -713,6 +753,9 @@ static ssize_t writeback_store(struct 

[PATCH v2 4/7] zram: introduce ZRAM_IDLE flag

2018-11-26 Thread Minchan Kim
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

  30075.033841 ...i
  30163.806904 s..i
  30263.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

  30075.033841 ...
  30163.806904 s..i
  30263.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 
 Documentation/blockdev/zram.txt| 10 ++--
 drivers/block/zram/zram_drv.c  | 55 --
 drivers/block/zram/zram_drv.h  |  1 +
 4 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim 
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:  /sys/block/zram/idle
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   idle file is write-only and mark zram slot as idle.
+   If system has mounted debugfs, user can see which slots
+   are idle via /sys/kernel/debug/zram/zram/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithmRWshow and change the compression 
algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
 backing_dev  RWset up backend storage for zram to write out
+idle WOmark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
- 30075.033841 .wh
- 30163.806904 s..
- 30263.806919 ..h
+ 30075.033841 .wh.
+ 30163.806904 s...
+ 30263.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fee7e67c750d..59f78011d2d9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,45 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+   int index;
+   char mode_buf[64];
+   ssize_t sz;
+
+   strlcpy(mode_buf, buf, sizeof(mode_buf));
+   /* ignore trailing new line */
+   sz = strlen(mode_buf);
+   if (sz > 0 && mode_buf[sz - 1] == '\n')
+   mode_buf[sz - 1] = 0x00;
+
+   if (strcmp(mode_buf, "all"))
+   return -EINVAL;
+
+   down_read(>init_lock);
+   if (!init_done(zram)) {
+   up_read(>init_lock);
+   return -EINVAL;
+   }
+
+   for (index = 0; index < nr_pages; index++) {
+   zram_slot_lock(zram, index);
+   if (!zram_allocated(zram, index))
+   goto next;
+
+   zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+   zram_slot_unlock(zram, index);
+   }
+
+   up_read(>init_lock);
+
+   return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -660,6 +699,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+   zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -692,12 +732,13 @@ static ssize_t read_block_state(struct file *file, 

[PATCH v2 6/7] zram: add bd_stat statistics

2018-11-26 Thread Minchan Kim
bd_stat represents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 30 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 54 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..65fc33b2f53b 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 806cdaabac83..550bca77d322 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: pages
+ bd_reads  the number of reads from backing device
+   Unit: pages
+ bd_writes the number of writes to backing device
+   Unit: pages
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 3d069b2328f8..cceaa10301e8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -518,6 +518,8 @@ static unsigned long alloc_block_bdev(struct zram *zram)
ret = blk_idx;
 out:
spin_unlock_irq(>bitmap_lock);
+   if (ret != 0)
+   atomic64_inc(>stats.bd_count);
 
return ret;
 }
@@ -531,6 +533,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(>bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -686,6 +689,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -775,6 +779,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1031,6 +1036,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count),
+   (u64)atomic64_read(>stats.bd_reads),
+   (u64)atomic64_read(>stats.bd_writes));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1051,6 +1075,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1777,6 +1804,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
_attr_io_stat.attr,
_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+   _attr_bd_stat.attr,
+#endif
   

[PATCH v2 4/7] zram: introduce ZRAM_IDLE flag

2018-11-26 Thread Minchan Kim
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.

Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.

  30075.033841 ...i
  30163.806904 s..i
  30263.806919 ..hi

Once there is IO for the slot, the mark will be disappeared.

  30075.033841 ...
  30163.806904 s..i
  30263.806919 ..hi

Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 
 Documentation/blockdev/zram.txt| 10 ++--
 drivers/block/zram/zram_drv.c  | 55 --
 drivers/block/zram/zram_drv.h  |  1 +
 4 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index c1513c756af1..04c9a5980bc7 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -98,3 +98,11 @@ Contact: Minchan Kim 
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
+
+What:  /sys/block/zram/idle
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   idle file is write-only and mark zram slot as idle.
+   If system has mounted debugfs, user can see which slots
+   are idle via /sys/kernel/debug/zram/zram/block_state
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 3c1b5ab54bc0..f3bcd716d8a9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -169,6 +169,7 @@ comp_algorithmRWshow and change the compression 
algorithm
 compact   WOtrigger memory compaction
 debug_statROthis file is used for zram debugging purposes
 backing_dev  RWset up backend storage for zram to write out
+idle WOmark allocated slot as idle
 
 
 User space is advised to use the following files to read the device statistics.
@@ -251,16 +252,17 @@ pages of the process with*pagemap.
 If you enable the feature, you could see block state via
 /sys/kernel/debug/zram/zram0/block_state". The output is as follows,
 
- 30075.033841 .wh
- 30163.806904 s..
- 30263.806919 ..h
+ 30075.033841 .wh.
+ 30163.806904 s...
+ 30263.806919 ..hi
 
 First column is zram's block index.
 Second column is access time since the system was booted
 Third column is state of the block.
 (s: same page
 w: written page to backing store
-h: huge page)
+h: huge page
+i: idle page)
 
 First line of above example says 300th block is accessed at 75.033841sec
 and the block's state is huge so it is written back to the backing
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index fee7e67c750d..59f78011d2d9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -281,6 +281,45 @@ static ssize_t mem_used_max_store(struct device *dev,
return len;
 }
 
+static ssize_t idle_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+   int index;
+   char mode_buf[64];
+   ssize_t sz;
+
+   strlcpy(mode_buf, buf, sizeof(mode_buf));
+   /* ignore trailing new line */
+   sz = strlen(mode_buf);
+   if (sz > 0 && mode_buf[sz - 1] == '\n')
+   mode_buf[sz - 1] = 0x00;
+
+   if (strcmp(mode_buf, "all"))
+   return -EINVAL;
+
+   down_read(>init_lock);
+   if (!init_done(zram)) {
+   up_read(>init_lock);
+   return -EINVAL;
+   }
+
+   for (index = 0; index < nr_pages; index++) {
+   zram_slot_lock(zram, index);
+   if (!zram_allocated(zram, index))
+   goto next;
+
+   zram_set_flag(zram, index, ZRAM_IDLE);
+next:
+   zram_slot_unlock(zram, index);
+   }
+
+   up_read(>init_lock);
+
+   return len;
+}
+
 #ifdef CONFIG_ZRAM_WRITEBACK
 static void reset_bdev(struct zram *zram)
 {
@@ -660,6 +699,7 @@ static void zram_debugfs_destroy(void)
 
 static void zram_accessed(struct zram *zram, u32 index)
 {
+   zram_clear_flag(zram, index, ZRAM_IDLE);
zram->table[index].ac_time = ktime_get_boottime();
 }
 
@@ -692,12 +732,13 @@ static ssize_t read_block_state(struct file *file, 

[PATCH v2 6/7] zram: add bd_stat statistics

2018-11-26 Thread Minchan Kim
bd_stat represents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 30 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 54 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..65fc33b2f53b 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 806cdaabac83..550bca77d322 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: pages
+ bd_reads  the number of reads from backing device
+   Unit: pages
+ bd_writes the number of writes to backing device
+   Unit: pages
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 3d069b2328f8..cceaa10301e8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -518,6 +518,8 @@ static unsigned long alloc_block_bdev(struct zram *zram)
ret = blk_idx;
 out:
spin_unlock_irq(>bitmap_lock);
+   if (ret != 0)
+   atomic64_inc(>stats.bd_count);
 
return ret;
 }
@@ -531,6 +533,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(>bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -686,6 +689,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -775,6 +779,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -1031,6 +1036,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count),
+   (u64)atomic64_read(>stats.bd_reads),
+   (u64)atomic64_read(>stats.bd_writes));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1051,6 +1075,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksize)
@@ -1777,6 +1804,9 @@ static struct attribute *zram_disk_attrs[] = {
 #endif
_attr_io_stat.attr,
_attr_mm_stat.attr,
+#ifdef CONFIG_ZRAM_WRITEBACK
+   _attr_bd_stat.attr,
+#endif
   

[PATCH v2 0/7] zram idle page writeback

2018-11-26 Thread Minchan Kim
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.

This patchset supports zram idle page writeback feature.

* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout

Detail is on each patch's description.

Below first two patches are -stable material so it could go first
separately with others in this series.

  zram: fix lockdep warning of free block handling
  zram: fix double free backing device

* from v1
  - add fix dobule free backing device - minchan
  - change writeback/idle interface - minchan 
  - remove direct incompressible page writeback - sergey

Minchan Kim (7):
  zram: fix lockdep warning of free block handling
  zram: fix double free backing device
  zram: refactoring flags and writeback stuff
  zram: introduce ZRAM_IDLE flag
  zram: support idle/huge page writeback
  zram: add bd_stat statistics
  zram: writeback throttle

 Documentation/ABI/testing/sysfs-block-zram |  32 ++
 Documentation/blockdev/zram.txt|  51 +-
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 516 +++--
 drivers/block/zram/zram_drv.h  |  18 +-
 5 files changed, 463 insertions(+), 159 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v2 7/7] zram: writeback throttle

2018-11-26 Thread Minchan Kim
On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  9 +
 Documentation/blockdev/zram.txt|  2 +
 drivers/block/zram/zram_drv.c  | 47 +-
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 65fc33b2f53b..9d2339a485c8 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 550bca77d322..41748d52712d 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit  WOspecifies the maximum amount of write IO zram 
can
+   write out to backing device
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index cceaa10301e8..07c0847b7c0f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -328,6 +328,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   u64 val;
+   ssize_t ret = -EINVAL;
+
+   if (kstrtoull(buf, 10, ))
+   return ret;
+
+   down_read(>init_lock);
+   atomic64_set(>stats.bd_wb_limit, val);
+   if (val == 0 || val > atomic64_read(>stats.bd_writes))
+   zram->stop_writeback = false;
+   up_read(>init_lock);
+   ret = len;
+
+   return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   u64 val;
+   struct zram *zram = dev_to_zram(dev);
+
+   down_read(>init_lock);
+   val = atomic64_read(>stats.bd_wb_limit);
+   up_read(>init_lock);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
@@ -592,6 +626,7 @@ static ssize_t writeback_store(struct device *dev,
char mode_buf[64];
unsigned long mode = -1UL;
unsigned long blk_idx = 0;
+   u64 wb_count, wb_limit;
 
strlcpy(mode_buf, buf, sizeof(mode_buf));
/* ignore trailing newline */
@@ -631,6 +666,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
 
+   if (zram->stop_writeback) {
+   ret = -EIO;
+   break;
+   }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -689,7 +729,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
-   atomic64_inc(>stats.bd_writes);
+   wb_count = atomic64_inc_return(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -713,6 +753,9 @@ static ssize_t writeback_store(struct 

[PATCH] [PATCH for v3.18] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7e94459a489a..5f4e6a3c2dde 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -999,6 +999,11 @@ static struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 static int create_device(struct zram *zram, int device_id)
 {
int ret = -ENOMEM;
@@ -1060,22 +1065,14 @@ static int create_device(struct zram *zram, int 
device_id)
zram->disk->queue->limits.discard_zeroes_data = 0;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_warn("Error creating sysfs group");
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
zram->meta = NULL;
zram->max_comp_streams = 1;
return 0;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(zram->queue);
 out:
@@ -1084,9 +1081,6 @@ static int create_device(struct zram *zram, int device_id)
 
 static void destroy_device(struct zram *zram)
 {
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
del_gendisk(zram->disk);
put_disk(zram->disk);
 
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH] [PATCH for v3.18] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7e94459a489a..5f4e6a3c2dde 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -999,6 +999,11 @@ static struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 static int create_device(struct zram *zram, int device_id)
 {
int ret = -ENOMEM;
@@ -1060,22 +1065,14 @@ static int create_device(struct zram *zram, int 
device_id)
zram->disk->queue->limits.discard_zeroes_data = 0;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_warn("Error creating sysfs group");
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
zram->meta = NULL;
zram->max_comp_streams = 1;
return 0;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(zram->queue);
 out:
@@ -1084,9 +1081,6 @@ static int create_device(struct zram *zram, int device_id)
 
 static void destroy_device(struct zram *zram)
 {
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
del_gendisk(zram->disk);
put_disk(zram->disk);
 
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH] [PATCH for v4.4] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 502406c9e6e1..616ee4f9c233 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1184,6 +1184,11 @@ static struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1264,15 +1269,9 @@ static int zram_add(void)
zram->disk->queue->limits.discard_zeroes_data = 0;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
zram->meta = NULL;
zram->max_comp_streams = 1;
@@ -1280,9 +1279,6 @@ static int zram_add(void)
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1310,16 +1306,6 @@ static int zram_remove(struct zram *zram)
zram->claim = true;
mutex_unlock(>bd_mutex);
 
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH] [PATCH for v4.4] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 502406c9e6e1..616ee4f9c233 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1184,6 +1184,11 @@ static struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1264,15 +1269,9 @@ static int zram_add(void)
zram->disk->queue->limits.discard_zeroes_data = 0;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
zram->meta = NULL;
zram->max_comp_streams = 1;
@@ -1280,9 +1279,6 @@ static int zram_add(void)
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1310,16 +1306,6 @@ static int zram_remove(struct zram *zram)
zram->claim = true;
mutex_unlock(>bd_mutex);
 
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH for 4.9] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b7c0b69a02f5..d64a53d3270a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1223,6 +1223,11 @@ static struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1303,24 +1308,15 @@ static int zram_add(void)
zram->disk->queue->limits.discard_zeroes_data = 0;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
zram->meta = NULL;
 
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1348,16 +1344,6 @@ static int zram_remove(struct zram *zram)
zram->claim = true;
mutex_unlock(>bd_mutex);
 
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH for 4.9] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b7c0b69a02f5..d64a53d3270a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1223,6 +1223,11 @@ static struct attribute_group zram_disk_attr_group = {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1303,24 +1308,15 @@ static int zram_add(void)
zram->disk->queue->limits.discard_zeroes_data = 0;
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, zram->disk->queue);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
zram->meta = NULL;
 
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1348,16 +1344,6 @@ static int zram_remove(struct zram *zram)
zram->claim = true;
mutex_unlock(>bd_mutex);
 
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH for v4.14] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 1e2648e4c286..27b202c64c84 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1491,6 +1491,11 @@ static const struct attribute_group zram_disk_attr_group 
= {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1568,23 +1573,14 @@ static int zram_add(void)
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1612,16 +1608,6 @@ static int zram_remove(struct zram *zram)
zram->claim = true;
mutex_unlock(>bd_mutex);
 
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH for v4.14] zram: close udev startup race condition as default groups

2018-11-22 Thread Minchan Kim
commit fef912bf860e upstream.
commit 98af4d4df889 upstream.

I got a report from Howard Chen that he saw zram and sysfs race(ie,
zram block device file is created but sysfs for it isn't yet)
when he tried to create new zram devices via hotadd knob.

v4.20 kernel fixes it by [1, 2] but it's too large size to merge
into -stable so this patch fixes the problem by registering defualt
group by Greg KH's approach[3].

This patch should be applied to every stable tree [3.16+] currently
existing from kernel.org because the problem was introduced at 2.6.37
by [4].

[1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
[2] 98af4d4df889, zram: register default groups with device_add_disk()
[3] http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
[4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface

Cc: Sergey Senozhatsky 
Cc: Hannes Reinecke 
Tested-by: Howard Chen 
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 26 ++
 1 file changed, 6 insertions(+), 20 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 1e2648e4c286..27b202c64c84 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1491,6 +1491,11 @@ static const struct attribute_group zram_disk_attr_group 
= {
.attrs = zram_disk_attrs,
 };
 
+static const struct attribute_group *zram_disk_attr_groups[] = {
+   _disk_attr_group,
+   NULL,
+};
+
 /*
  * Allocate and initialize new zram device. the function returns
  * '>= 0' device_id upon success, and negative value otherwise.
@@ -1568,23 +1573,14 @@ static int zram_add(void)
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
 
+   disk_to_dev(zram->disk)->groups = zram_disk_attr_groups;
add_disk(zram->disk);
 
-   ret = sysfs_create_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-   if (ret < 0) {
-   pr_err("Error creating sysfs group for device %d\n",
-   device_id);
-   goto out_free_disk;
-   }
strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
pr_info("Added device: %s\n", zram->disk->disk_name);
return device_id;
 
-out_free_disk:
-   del_gendisk(zram->disk);
-   put_disk(zram->disk);
 out_free_queue:
blk_cleanup_queue(queue);
 out_free_idr:
@@ -1612,16 +1608,6 @@ static int zram_remove(struct zram *zram)
zram->claim = true;
mutex_unlock(>bd_mutex);
 
-   /*
-* Remove sysfs first, so no one will perform a disksize
-* store while we destroy the devices. This also helps during
-* hot_remove -- zram_reset_device() is the last holder of
-* ->init_lock, no later/concurrent disksize_store() or any
-* other sysfs handlers are possible.
-*/
-   sysfs_remove_group(_to_dev(zram->disk)->kobj,
-   _disk_attr_group);
-
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



Re: [PATCH 4/6] zram: support idle page writeback

2018-11-22 Thread Minchan Kim
On Thu, Nov 22, 2018 at 03:59:26PM +0900, Sergey Senozhatsky wrote:
> On (11/22/18 15:31), Minchan Kim wrote:
> > > 
> > > I got what you mean now. Let's call it as "incompressible page wrieback"
> > > to prevent confusing.
> > > 
> > > "incompressible page writeback" would be orthgonal feature. The goal is
> > > "let's save memory at the cost of *latency*". If the page is swapped-in
> > > soon, it's unfortunate. However, the design expects once it's swapped out,
> > > it means it's non-workingset so soonish swappined-in would be rather not
> > > many, theoritically compared to other workingset.
> > > If's it's too frequent, it means system were heavily overcommitted.
> > 
> > Havid said, I agree it's not a good idea to enable incompressible page
> > writeback with idle page writeback. If you don't oppose, I want to add
> > new knob to "enable incompressible page writeback" so by default,
> > although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
> > is off until we enable the knob.
> > It would make some regressison if someone have used the feature but
> > I guess we are not too late.
> > 
> > What do you think?
> 
> Yes, totally works for me!
> 
> 
> "IDLE writeback" is superior to "incompressible writeback".
> 
> "incompressible writeback" is completely unpredictable and
> uncontrollable; it depens on data patterns and compression algorithms.
> While "IDLE writeback" is predictable.
> 
> I even suspect, that, *ideally*, we can remove "incompressible
> writeback". "IDLE pages" is a super set which also includes
> "incompressible" pages. So, technically, we still can do
> "incompressible writeback" from "IDLE writeback" path; but a much
> more reasonable one, based on a page idling period.
> 
> I understand that you want to keep "direct incompressible writeback"
> around. ZRAM is especially popular on devices which do suffer from
> flash wearout, so I can see "incompressible writeback" path becoming
> a dead code, long term.

Okay, both options makes regression if someone use it. Then, let's try
to remove it. It would make more clean with new idle writeback.

Thanks!



Re: [PATCH 4/6] zram: support idle page writeback

2018-11-22 Thread Minchan Kim
On Thu, Nov 22, 2018 at 03:59:26PM +0900, Sergey Senozhatsky wrote:
> On (11/22/18 15:31), Minchan Kim wrote:
> > > 
> > > I got what you mean now. Let's call it as "incompressible page wrieback"
> > > to prevent confusing.
> > > 
> > > "incompressible page writeback" would be orthgonal feature. The goal is
> > > "let's save memory at the cost of *latency*". If the page is swapped-in
> > > soon, it's unfortunate. However, the design expects once it's swapped out,
> > > it means it's non-workingset so soonish swappined-in would be rather not
> > > many, theoritically compared to other workingset.
> > > If's it's too frequent, it means system were heavily overcommitted.
> > 
> > Havid said, I agree it's not a good idea to enable incompressible page
> > writeback with idle page writeback. If you don't oppose, I want to add
> > new knob to "enable incompressible page writeback" so by default,
> > although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
> > is off until we enable the knob.
> > It would make some regressison if someone have used the feature but
> > I guess we are not too late.
> > 
> > What do you think?
> 
> Yes, totally works for me!
> 
> 
> "IDLE writeback" is superior to "incompressible writeback".
> 
> "incompressible writeback" is completely unpredictable and
> uncontrollable; it depens on data patterns and compression algorithms.
> While "IDLE writeback" is predictable.
> 
> I even suspect, that, *ideally*, we can remove "incompressible
> writeback". "IDLE pages" is a super set which also includes
> "incompressible" pages. So, technically, we still can do
> "incompressible writeback" from "IDLE writeback" path; but a much
> more reasonable one, based on a page idling period.
> 
> I understand that you want to keep "direct incompressible writeback"
> around. ZRAM is especially popular on devices which do suffer from
> flash wearout, so I can see "incompressible writeback" path becoming
> a dead code, long term.

Okay, both options makes regression if someone use it. Then, let's try
to remove it. It would make more clean with new idle writeback.

Thanks!



Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Thu, Nov 22, 2018 at 03:15:42PM +0900, Minchan Kim wrote:
> On Thu, Nov 22, 2018 at 02:40:40PM +0900, Sergey Senozhatsky wrote:
> > On (11/22/18 14:04), Minchan Kim wrote:
> > > 
> > > > additionally, it's too simple. It writes-back pages which can be
> > > > swapped in immediately; which basically means that we do pointless
> > > > PAGE_SIZE writes to a device which doesn't really like pointless
> > > > writes.
> > > 
> > > This patchset aims for *IDLE page* writeback and you can define
> > > what is IDLE page by yourself. It doesn't do pointless writeback.
> > > > 
> > > > It's a whole different story with idle, compressible pages writeback.
> > > 
> > > I don't understand your point.
> > 
> > Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
> > On the contrary, I think IDLE writeback is x100 better than writeback
> > which we currently have.
> > 
> > The "pointless writeback" comment was about the existing writeback,
> > when we WB pages which we couldn't compress. We can have a relative
> > huge percentage of incompressible pages, and not all of them will end
> > up being IDLE:
> >  - we swap out page
> >  - can't compress it
> >  - writeback PAGE_SIZE
> >  - swap it in two seconds later
> 
> I got what you mean now. Let's call it as "incompressible page wrieback"
> to prevent confusing.
> 
> "incompressible page writeback" would be orthgonal feature. The goal is
> "let's save memory at the cost of *latency*". If the page is swapped-in
> soon, it's unfortunate. However, the design expects once it's swapped out,
> it means it's non-workingset so soonish swappined-in would be rather not
> many, theoritically compared to other workingset.
> If's it's too frequent, it means system were heavily overcommitted.

Havid said, I agree it's not a good idea to enable incompressible page
writeback with idle page writeback. If you don't oppose, I want to add
new knob to "enable incompressible page writeback" so by default,
although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
is off until we enable the knob.
It would make some regressison if someone have used the feature but
I guess we are not too late.

What do you think?



Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Thu, Nov 22, 2018 at 03:15:42PM +0900, Minchan Kim wrote:
> On Thu, Nov 22, 2018 at 02:40:40PM +0900, Sergey Senozhatsky wrote:
> > On (11/22/18 14:04), Minchan Kim wrote:
> > > 
> > > > additionally, it's too simple. It writes-back pages which can be
> > > > swapped in immediately; which basically means that we do pointless
> > > > PAGE_SIZE writes to a device which doesn't really like pointless
> > > > writes.
> > > 
> > > This patchset aims for *IDLE page* writeback and you can define
> > > what is IDLE page by yourself. It doesn't do pointless writeback.
> > > > 
> > > > It's a whole different story with idle, compressible pages writeback.
> > > 
> > > I don't understand your point.
> > 
> > Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
> > On the contrary, I think IDLE writeback is x100 better than writeback
> > which we currently have.
> > 
> > The "pointless writeback" comment was about the existing writeback,
> > when we WB pages which we couldn't compress. We can have a relative
> > huge percentage of incompressible pages, and not all of them will end
> > up being IDLE:
> >  - we swap out page
> >  - can't compress it
> >  - writeback PAGE_SIZE
> >  - swap it in two seconds later
> 
> I got what you mean now. Let's call it as "incompressible page wrieback"
> to prevent confusing.
> 
> "incompressible page writeback" would be orthgonal feature. The goal is
> "let's save memory at the cost of *latency*". If the page is swapped-in
> soon, it's unfortunate. However, the design expects once it's swapped out,
> it means it's non-workingset so soonish swappined-in would be rather not
> many, theoritically compared to other workingset.
> If's it's too frequent, it means system were heavily overcommitted.

Havid said, I agree it's not a good idea to enable incompressible page
writeback with idle page writeback. If you don't oppose, I want to add
new knob to "enable incompressible page writeback" so by default,
although we enable CONFIG_ZRAM_WRITEBACK, incompressible page writeback
is off until we enable the knob.
It would make some regressison if someone have used the feature but
I guess we are not too late.

What do you think?



Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Thu, Nov 22, 2018 at 02:40:40PM +0900, Sergey Senozhatsky wrote:
> On (11/22/18 14:04), Minchan Kim wrote:
> > 
> > > additionally, it's too simple. It writes-back pages which can be
> > > swapped in immediately; which basically means that we do pointless
> > > PAGE_SIZE writes to a device which doesn't really like pointless
> > > writes.
> > 
> > This patchset aims for *IDLE page* writeback and you can define
> > what is IDLE page by yourself. It doesn't do pointless writeback.
> > > 
> > > It's a whole different story with idle, compressible pages writeback.
> > 
> > I don't understand your point.
> 
> Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
> On the contrary, I think IDLE writeback is x100 better than writeback
> which we currently have.
> 
> The "pointless writeback" comment was about the existing writeback,
> when we WB pages which we couldn't compress. We can have a relative
> huge percentage of incompressible pages, and not all of them will end
> up being IDLE:
>  - we swap out page
>  - can't compress it
>  - writeback PAGE_SIZE
>  - swap it in two seconds later

I got what you mean now. Let's call it as "incompressible page wrieback"
to prevent confusing.

"incompressible page writeback" would be orthgonal feature. The goal is
"let's save memory at the cost of *latency*". If the page is swapped-in
soon, it's unfortunate. However, the design expects once it's swapped out,
it means it's non-workingset so soonish swappined-in would be rather not
many, theoritically compared to other workingset.
If's it's too frequent, it means system were heavily overcommitted.


Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Thu, Nov 22, 2018 at 02:40:40PM +0900, Sergey Senozhatsky wrote:
> On (11/22/18 14:04), Minchan Kim wrote:
> > 
> > > additionally, it's too simple. It writes-back pages which can be
> > > swapped in immediately; which basically means that we do pointless
> > > PAGE_SIZE writes to a device which doesn't really like pointless
> > > writes.
> > 
> > This patchset aims for *IDLE page* writeback and you can define
> > what is IDLE page by yourself. It doesn't do pointless writeback.
> > > 
> > > It's a whole different story with idle, compressible pages writeback.
> > 
> > I don't understand your point.
> 
> Seems you misunderstood me. I'm not saying that IDLE writeback is bad.
> On the contrary, I think IDLE writeback is x100 better than writeback
> which we currently have.
> 
> The "pointless writeback" comment was about the existing writeback,
> when we WB pages which we couldn't compress. We can have a relative
> huge percentage of incompressible pages, and not all of them will end
> up being IDLE:
>  - we swap out page
>  - can't compress it
>  - writeback PAGE_SIZE
>  - swap it in two seconds later

I got what you mean now. Let's call it as "incompressible page wrieback"
to prevent confusing.

"incompressible page writeback" would be orthgonal feature. The goal is
"let's save memory at the cost of *latency*". If the page is swapped-in
soon, it's unfortunate. However, the design expects once it's swapped out,
it means it's non-workingset so soonish swappined-in would be rather not
many, theoritically compared to other workingset.
If's it's too frequent, it means system were heavily overcommitted.


Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag

2018-11-21 Thread Minchan Kim
On Tue, Nov 20, 2018 at 11:46:59AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (11/16/18 16:20), Minchan Kim wrote:
> [..]
> > +static ssize_t idle_store(struct device *dev,
> > +   struct device_attribute *attr, const char *buf, size_t len)
> > +{
> > +   struct zram *zram = dev_to_zram(dev);
> > +   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
> > +   int index;
> > +
> > +   down_read(>init_lock);
> > +   if (!init_done(zram)) {
> > +   up_read(>init_lock);
> > +   return -EINVAL;
> > +   }
> > +
> > +   for (index = 0; index < nr_pages; index++) {
> > +   zram_slot_lock(zram, index);
> > +   if (!zram_allocated(zram, index))
> > +   goto next;
> > +
> > +   zram_set_flag(zram, index, ZRAM_IDLE);
> > +next:
> > +   zram_slot_unlock(zram, index);
> > +   }
> > +
> > +   up_read(>init_lock);
> > +
> > +   return len;
> > +}
> 
> This is one way of doing it.
> 
> The other one could, probabaly, be a bit more friendly to the cache
> lines and CPU cycles. Basically, have a static timestamp variable,
> which would keep the timestamp of last idle_store().
> 
> static idle_snapshot_ts;
> 
> static ssize_t idle_store(struct device *dev,
> struct device_attribute *attr,
> const char *buf, size_t len)
> {
>   idle_snapshot_ts = ktime();
> }
> 
> And then in read_block_state() compare handle access time and
> idle_snapshot_ts (if it's not 0). If the page was not modified/access
> since the last idle_snapshot_ts (handle access time <= idle_snapshot_ts),
> then it's idle, otherwise (handle access time > idle_snapshot_ts) it's
> not idle.
> 
> Would this do the trick?

It was a option when I imagined this idea first but problem from product
division was memory waste of ac_time for every zram table.



Re: [PATCH 3/6] zram: introduce ZRAM_IDLE flag

2018-11-21 Thread Minchan Kim
On Tue, Nov 20, 2018 at 11:46:59AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (11/16/18 16:20), Minchan Kim wrote:
> [..]
> > +static ssize_t idle_store(struct device *dev,
> > +   struct device_attribute *attr, const char *buf, size_t len)
> > +{
> > +   struct zram *zram = dev_to_zram(dev);
> > +   unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
> > +   int index;
> > +
> > +   down_read(>init_lock);
> > +   if (!init_done(zram)) {
> > +   up_read(>init_lock);
> > +   return -EINVAL;
> > +   }
> > +
> > +   for (index = 0; index < nr_pages; index++) {
> > +   zram_slot_lock(zram, index);
> > +   if (!zram_allocated(zram, index))
> > +   goto next;
> > +
> > +   zram_set_flag(zram, index, ZRAM_IDLE);
> > +next:
> > +   zram_slot_unlock(zram, index);
> > +   }
> > +
> > +   up_read(>init_lock);
> > +
> > +   return len;
> > +}
> 
> This is one way of doing it.
> 
> The other one could, probabaly, be a bit more friendly to the cache
> lines and CPU cycles. Basically, have a static timestamp variable,
> which would keep the timestamp of last idle_store().
> 
> static idle_snapshot_ts;
> 
> static ssize_t idle_store(struct device *dev,
> struct device_attribute *attr,
> const char *buf, size_t len)
> {
>   idle_snapshot_ts = ktime();
> }
> 
> And then in read_block_state() compare handle access time and
> idle_snapshot_ts (if it's not 0). If the page was not modified/access
> since the last idle_snapshot_ts (handle access time <= idle_snapshot_ts),
> then it's idle, otherwise (handle access time > idle_snapshot_ts) it's
> not idle.
> 
> Would this do the trick?

It was a option when I imagined this idea first but problem from product
division was memory waste of ac_time for every zram table.



Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Thu, Nov 22, 2018 at 11:14:43AM +0900, Sergey Senozhatsky wrote:
> On (11/21/18 05:34), Minchan Kim wrote:
> > > 
> > > Just a thought,
> > > 
> > > I wonder if it will make sense (and if it will be possible) to writeback
> > > idle _compressed_ objects. Right now we decompress, say, a perfectly
> > > fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> > > it to the WB device. In this particular case it has a x10 bigger IO
> > > pressure on flash. If we can write/read compressed object then we
> > > will write and read 400-bytes, instead of PAGE_SIZE.
> > 
> > Although it has pros/cons, that's the my final goal although it would
> > add much complicated stuffs. Sometime, we should have the feature.
> 
> So you plan to switch to "compressed objects" writeback?

No switch. I want both finally. There are pros and cons.
Compressible write would be good for wearout of flash device(that's I want
to have it) but it has several read of a block(since the block has several
zpage) and decompression latency as well as complicated logic of management
of block. That's the unnecessary thing If backing device or system doesn't
have wearout concern.

> 
> > However, I want to go simple one first which is very valuable, too.
> 
> Flash wearout is a serious problem; maybe less of a problem on smart
> phones, but much bigger on TVs and on other embedded devices that have
> lifespans of 5+ years. With "writeback idle compressed" we can remove

Yub, It's a serious. That's why my patchset has writeback limitation,
stats as well as idle marking to help the system design.

> the existing "writeback incompressible pages" and writeback only
> "idle, compressed" pages.
> 
> The existing incompressible writeback is way too aggressive, and,

Do not agree. It depends on the system design.
Think idle page writeback. Once a day, you write out idle pages.
As a inital stage, you could write every idle pages into the storage.
it could be several hundred MB and next day? there is few MB write
because every idle pages were stored yesterday.
Even, we have a write_limit. You could estimate how per-day write
can demage your system. Your daemon can see one a day and decide
further write or not.

> additionally, it's too simple. It writes-back pages which can be
> swapped in immediately; which basically means that we do pointless
> PAGE_SIZE writes to a device which doesn't really like pointless
> writes.

This patchset aims for *IDLE page* writeback and you can define
what is IDLE page by yourself. It doesn't do pointless writeback.

> 
> It's a whole different story with idle, compressible pages writeback.

I don't understand your point.

> 
>   -ss


Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Thu, Nov 22, 2018 at 11:14:43AM +0900, Sergey Senozhatsky wrote:
> On (11/21/18 05:34), Minchan Kim wrote:
> > > 
> > > Just a thought,
> > > 
> > > I wonder if it will make sense (and if it will be possible) to writeback
> > > idle _compressed_ objects. Right now we decompress, say, a perfectly
> > > fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> > > it to the WB device. In this particular case it has a x10 bigger IO
> > > pressure on flash. If we can write/read compressed object then we
> > > will write and read 400-bytes, instead of PAGE_SIZE.
> > 
> > Although it has pros/cons, that's the my final goal although it would
> > add much complicated stuffs. Sometime, we should have the feature.
> 
> So you plan to switch to "compressed objects" writeback?

No switch. I want both finally. There are pros and cons.
Compressible write would be good for wearout of flash device(that's I want
to have it) but it has several read of a block(since the block has several
zpage) and decompression latency as well as complicated logic of management
of block. That's the unnecessary thing If backing device or system doesn't
have wearout concern.

> 
> > However, I want to go simple one first which is very valuable, too.
> 
> Flash wearout is a serious problem; maybe less of a problem on smart
> phones, but much bigger on TVs and on other embedded devices that have
> lifespans of 5+ years. With "writeback idle compressed" we can remove

Yub, It's a serious. That's why my patchset has writeback limitation,
stats as well as idle marking to help the system design.

> the existing "writeback incompressible pages" and writeback only
> "idle, compressed" pages.
> 
> The existing incompressible writeback is way too aggressive, and,

Do not agree. It depends on the system design.
Think idle page writeback. Once a day, you write out idle pages.
As a inital stage, you could write every idle pages into the storage.
it could be several hundred MB and next day? there is few MB write
because every idle pages were stored yesterday.
Even, we have a write_limit. You could estimate how per-day write
can demage your system. Your daemon can see one a day and decide
further write or not.

> additionally, it's too simple. It writes-back pages which can be
> swapped in immediately; which basically means that we do pointless
> PAGE_SIZE writes to a device which doesn't really like pointless
> writes.

This patchset aims for *IDLE page* writeback and you can define
what is IDLE page by yourself. It doesn't do pointless writeback.

> 
> It's a whole different story with idle, compressible pages writeback.

I don't understand your point.

> 
>   -ss


Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Wed, Nov 21, 2018 at 01:55:51PM +0900, Sergey Senozhatsky wrote:
> On (11/16/18 16:20), Minchan Kim wrote:
> > +   zram_set_flag(zram, index, ZRAM_UNDER_WB);
> > +   zram_slot_unlock(zram, index);
> > +   if (zram_bvec_read(zram, , index, 0, NULL)) {
> > +   zram_slot_lock(zram, index);
> > +   zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> > +   zram_slot_unlock(zram, index);
> > +   continue;
> > +   }
> > +
> > +   bio_init(, _vec, 1);
> > +   bio_set_dev(, zram->bdev);
> > +   bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
> > +   bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
> > +
> > +   bio_add_page(, bvec.bv_page, bvec.bv_len,
> > +   bvec.bv_offset);
> > +   /*
> > +* XXX: A single page IO would be inefficient for write
> > +* but it would be not bad as starter.
> > +*/
> > +   ret = submit_bio_wait();
> > +   if (ret) {
> > +   zram_slot_lock(zram, index);
> > +   zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> > +   zram_slot_unlock(zram, index);
> > +   continue;
> > +   }
> 
> Just a thought,
> 
> I wonder if it will make sense (and if it will be possible) to writeback
> idle _compressed_ objects. Right now we decompress, say, a perfectly
> fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> it to the WB device. In this particular case it has a x10 bigger IO
> pressure on flash. If we can write/read compressed object then we
> will write and read 400-bytes, instead of PAGE_SIZE.

Although it has pros/cons, that's the my final goal although it would
add much complicated stuffs. Sometime, we should have the feature.
However, I want to go simple one first which is very valuable, too.


Re: [PATCH 4/6] zram: support idle page writeback

2018-11-21 Thread Minchan Kim
On Wed, Nov 21, 2018 at 01:55:51PM +0900, Sergey Senozhatsky wrote:
> On (11/16/18 16:20), Minchan Kim wrote:
> > +   zram_set_flag(zram, index, ZRAM_UNDER_WB);
> > +   zram_slot_unlock(zram, index);
> > +   if (zram_bvec_read(zram, , index, 0, NULL)) {
> > +   zram_slot_lock(zram, index);
> > +   zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> > +   zram_slot_unlock(zram, index);
> > +   continue;
> > +   }
> > +
> > +   bio_init(, _vec, 1);
> > +   bio_set_dev(, zram->bdev);
> > +   bio.bi_iter.bi_sector = blk_idx * (PAGE_SIZE >> 9);
> > +   bio.bi_opf = REQ_OP_WRITE | REQ_SYNC;
> > +
> > +   bio_add_page(, bvec.bv_page, bvec.bv_len,
> > +   bvec.bv_offset);
> > +   /*
> > +* XXX: A single page IO would be inefficient for write
> > +* but it would be not bad as starter.
> > +*/
> > +   ret = submit_bio_wait();
> > +   if (ret) {
> > +   zram_slot_lock(zram, index);
> > +   zram_clear_flag(zram, index, ZRAM_UNDER_WB);
> > +   zram_slot_unlock(zram, index);
> > +   continue;
> > +   }
> 
> Just a thought,
> 
> I wonder if it will make sense (and if it will be possible) to writeback
> idle _compressed_ objects. Right now we decompress, say, a perfectly
> fine 400-byte compressed object to a PAGE_SIZE-d object and then push
> it to the WB device. In this particular case it has a x10 bigger IO
> pressure on flash. If we can write/read compressed object then we
> will write and read 400-bytes, instead of PAGE_SIZE.

Although it has pros/cons, that's the my final goal although it would
add much complicated stuffs. Sometime, we should have the feature.
However, I want to go simple one first which is very valuable, too.


Re: [PATCH] zram: close udev startup race condition as default groups

2018-11-16 Thread Minchan Kim
On Thu, Nov 15, 2018 at 12:45:04PM -0500, Sasha Levin wrote:
> On Wed, Nov 14, 2018 at 02:52:23PM +0900, Minchan Kim wrote:
> > commit fef912bf860e upstream.
> > commit 98af4d4df889 upstream.
> > 
> > I got a report from Howard Chen that he saw zram and sysfs race(ie,
> > zram block device file is created but sysfs for it isn't yet)
> > when he tried to create new zram devices via hotadd knob.
> > 
> > v4.20 kernel fixes it by [1, 2] but it's too large size to merge
> > into -stable so this patch fixes the problem by registering defualt
> > group by Greg KH's approach[3].
> > 
> > This patch should be applied to every stable tree [3.16+] currently
> > existing from kernel.org because the problem was introduced at 2.6.37
> > by [4].
> > 
> > [1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
> > [2] 98af4d4df889, zram: register default groups with device_add_disk()
> > [3] 
> > http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
> > [4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface
> > 
> > Cc: Sergey Senozhatsky 
> > Cc: Hannes Reinecke 
> > Tested-by: Howard Chen 
> > Signed-off-by: Minchan Kim 
> 
> I've queued this for 4.19 and 4.18, but it doesn't apply to anything
> older than that.

Thanks for the review, Hannes.

Sasha, I will send separate patches for older stable kernel.
Thanks for picking the patch.

> 
> --
> Thanks,
> Sasha


Re: [PATCH] zram: close udev startup race condition as default groups

2018-11-16 Thread Minchan Kim
On Thu, Nov 15, 2018 at 12:45:04PM -0500, Sasha Levin wrote:
> On Wed, Nov 14, 2018 at 02:52:23PM +0900, Minchan Kim wrote:
> > commit fef912bf860e upstream.
> > commit 98af4d4df889 upstream.
> > 
> > I got a report from Howard Chen that he saw zram and sysfs race(ie,
> > zram block device file is created but sysfs for it isn't yet)
> > when he tried to create new zram devices via hotadd knob.
> > 
> > v4.20 kernel fixes it by [1, 2] but it's too large size to merge
> > into -stable so this patch fixes the problem by registering defualt
> > group by Greg KH's approach[3].
> > 
> > This patch should be applied to every stable tree [3.16+] currently
> > existing from kernel.org because the problem was introduced at 2.6.37
> > by [4].
> > 
> > [1] fef912bf860e, block: genhd: add 'groups' argument to device_add_disk
> > [2] 98af4d4df889, zram: register default groups with device_add_disk()
> > [3] 
> > http://kroah.com/log/blog/2013/06/26/how-to-create-a-sysfs-file-correctly/
> > [4] 33863c21e69e9, Staging: zram: Replace ioctls with sysfs interface
> > 
> > Cc: Sergey Senozhatsky 
> > Cc: Hannes Reinecke 
> > Tested-by: Howard Chen 
> > Signed-off-by: Minchan Kim 
> 
> I've queued this for 4.19 and 4.18, but it doesn't apply to anything
> older than that.

Thanks for the review, Hannes.

Sasha, I will send separate patches for older stable kernel.
Thanks for picking the patch.

> 
> --
> Thanks,
> Sasha


[PATCH 5/6] zram: add bd_stat statistics

2018-11-15 Thread Minchan Kim
bd_stat reprenents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 31 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 55 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..a4daca7e5043 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes.) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 60b585dab6e0..1f4907307a0d 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: pages
+ bd_reads  the number of reads from backing device
+   Unit: pages
+ bd_writes the number of writes to backing device
+   Unit: pages
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b7b5c9e5f0cd..17d566d9a321 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -505,6 +505,8 @@ static unsigned long alloc_block_bdev(struct zram *zram)
ret = blk_idx;
 out:
spin_unlock_irq(>bitmap_lock);
+   if (ret != 0)
+   atomic64_inc(>stats.bd_count);
 
return ret;
 }
@@ -518,6 +520,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(>bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -661,6 +664,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -748,6 +752,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -790,6 +795,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
 
submit_bio(bio);
*pentry = entry;
+   atomic64_inc(>stats.bd_writes);
 
return 0;
 }
@@ -1053,6 +1059,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count),
+   (u64)atomic64_read(>stats.bd_reads),
+   (u64)atomic64_read(>stats.bd_writes));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1073,6 +1098,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksi

[PATCH 6/6] zram: writeback throttle

2018-11-15 Thread Minchan Kim
On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  9 
 Documentation/blockdev/zram.txt|  2 +
 drivers/block/zram/zram_drv.c  | 55 --
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index a4daca7e5043..210f2cdac752 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes.) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 1f4907307a0d..39ee416bf552 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit  WOspecifies the maximum amount of write IO zram 
can
+   write out to backing device
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 17d566d9a321..b263febaed10 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -317,6 +317,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   u64 val;
+   ssize_t ret = -EINVAL;
+
+   if (kstrtoull(buf, 10, ))
+   return ret;
+
+   down_read(>init_lock);
+   atomic64_set(>stats.bd_wb_limit, val);
+   if (val == 0 || val > atomic64_read(>stats.bd_writes))
+   zram->stop_writeback = false;
+   up_read(>init_lock);
+   ret = len;
+
+   return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   u64 val;
+   struct zram *zram = dev_to_zram(dev);
+
+   down_read(>init_lock);
+   val = atomic64_read(>stats.bd_wb_limit);
+   up_read(>init_lock);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
@@ -575,6 +609,7 @@ static ssize_t writeback_store(struct device *dev,
ssize_t ret;
unsigned long mode;
unsigned long blk_idx = 0;
+   u64 wb_count, wb_limit;
 
 #define HUGE_WRITEBACK 0x1
 #define IDLE_WRITEBACK 0x2
@@ -610,6 +645,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
 
+   if (zram->stop_writeback) {
+   ret = -EIO;
+   break;
+   }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -664,7 +704,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
-   atomic64_inc(>stats.bd_writes);
+   wb_count = atomic64_inc_return(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -687,6 +727,9 @@ static ssize_t writeback_store(struct device *dev,
z

[PATCH 5/6] zram: add bd_stat statistics

2018-11-15 Thread Minchan Kim
bd_stat reprenents things happened in backing device. Currently,
it supports bd_counts, bd_reads and bd_writes which are helpful
to understand wearout of flash and memory saving.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  8 ++
 Documentation/blockdev/zram.txt| 11 
 drivers/block/zram/zram_drv.c  | 31 ++
 drivers/block/zram/zram_drv.h  |  5 
 4 files changed, 55 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index d1f80b077885..a4daca7e5043 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -113,3 +113,11 @@ Contact:   Minchan Kim 
 Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
+
+What:  /sys/block/zram/bd_stat
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The bd_stat file is read-only and represents backing device's
+   statistics (bd_count, bd_reads, bd_writes.) in a format
+   similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 60b585dab6e0..1f4907307a0d 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -221,6 +221,17 @@ The stat file represents device's mm statistics. It 
consists of a single
  pages_compacted  the number of pages freed during compaction
  huge_pages  the number of incompressible pages
 
+File /sys/block/zram/bd_stat
+
+The stat file represents device's backing device statistics. It consists of
+a single line of text and contains the following stats separated by whitespace:
+ bd_count  size of data written in backing device.
+   Unit: pages
+ bd_reads  the number of reads from backing device
+   Unit: pages
+ bd_writes the number of writes to backing device
+   Unit: pages
+
 9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b7b5c9e5f0cd..17d566d9a321 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -505,6 +505,8 @@ static unsigned long alloc_block_bdev(struct zram *zram)
ret = blk_idx;
 out:
spin_unlock_irq(>bitmap_lock);
+   if (ret != 0)
+   atomic64_inc(>stats.bd_count);
 
return ret;
 }
@@ -518,6 +520,7 @@ static void free_block_bdev(struct zram *zram, unsigned 
long blk_idx)
was_set = test_and_clear_bit(blk_idx, zram->bitmap);
spin_unlock_irqrestore(>bitmap_lock, flags);
WARN_ON_ONCE(!was_set);
+   atomic64_dec(>stats.bd_count);
 }
 
 static void zram_page_end_io(struct bio *bio)
@@ -661,6 +664,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
+   atomic64_inc(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -748,6 +752,7 @@ static int read_from_bdev_sync(struct zram *zram, struct 
bio_vec *bvec,
 static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
unsigned long entry, struct bio *parent, bool sync)
 {
+   atomic64_inc(>stats.bd_reads);
if (sync)
return read_from_bdev_sync(zram, bvec, entry, parent);
else
@@ -790,6 +795,7 @@ static int write_to_bdev(struct zram *zram, struct bio_vec 
*bvec,
 
submit_bio(bio);
*pentry = entry;
+   atomic64_inc(>stats.bd_writes);
 
return 0;
 }
@@ -1053,6 +1059,25 @@ static ssize_t mm_stat_show(struct device *dev,
return ret;
 }
 
+#ifdef CONFIG_ZRAM_WRITEBACK
+static ssize_t bd_stat_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct zram *zram = dev_to_zram(dev);
+   ssize_t ret;
+
+   down_read(>init_lock);
+   ret = scnprintf(buf, PAGE_SIZE,
+   "%8llu %8llu %8llu\n",
+   (u64)atomic64_read(>stats.bd_count),
+   (u64)atomic64_read(>stats.bd_reads),
+   (u64)atomic64_read(>stats.bd_writes));
+   up_read(>init_lock);
+
+   return ret;
+}
+#endif
+
 static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1073,6 +1098,9 @@ static ssize_t debug_stat_show(struct device *dev,
 
 static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
+#ifdef CONFIG_ZRAM_WRITEBACK
+static DEVICE_ATTR_RO(bd_stat);
+#endif
 static DEVICE_ATTR_RO(debug_stat);
 
 static void zram_meta_free(struct zram *zram, u64 disksi

[PATCH 6/6] zram: writeback throttle

2018-11-15 Thread Minchan Kim
On small memory system, there are lots of write IO so if we use
flash device as swap, there would be serious flash wearout.
To overcome the problem, system developers need to design write
limitation strategy to guarantee flash health for entire product life.

This patch creates a new konb "writeback_limit" on zram. With that,
if current writeback IO count(/sys/block/zramX/io_stat) excceds
the limitation, zram stops further writeback until admin can reset
the limit.

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |  9 
 Documentation/blockdev/zram.txt|  2 +
 drivers/block/zram/zram_drv.c  | 55 --
 drivers/block/zram/zram_drv.h  |  2 +
 4 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index a4daca7e5043..210f2cdac752 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -121,3 +121,12 @@ Contact:   Minchan Kim 
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes.) in a format
similar to block layer statistics file format.
+
+What:  /sys/block/zram/writeback_limit
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback_limit file is read-write and specifies the maximum
+   amount of writeback ZRAM can do. The limit could be changed
+   in run time and "0" means disable the limit.
+   No limit is the initial state.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 1f4907307a0d..39ee416bf552 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -164,6 +164,8 @@ reset WOtrigger device reset
 mem_used_max  WOreset the `mem_used_max' counter (see later)
 mem_limit WOspecifies the maximum amount of memory ZRAM can use
 to store the compressed data
+writeback_limit  WOspecifies the maximum amount of write IO zram 
can
+   write out to backing device
 max_comp_streams  RWthe number of possible concurrent compress operations
 comp_algorithmRWshow and change the compression algorithm
 compact   WOtrigger memory compaction
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 17d566d9a321..b263febaed10 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -317,6 +317,40 @@ static ssize_t idle_store(struct device *dev,
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
+
+static ssize_t writeback_limit_store(struct device *dev,
+   struct device_attribute *attr, const char *buf, size_t len)
+{
+   struct zram *zram = dev_to_zram(dev);
+   u64 val;
+   ssize_t ret = -EINVAL;
+
+   if (kstrtoull(buf, 10, ))
+   return ret;
+
+   down_read(>init_lock);
+   atomic64_set(>stats.bd_wb_limit, val);
+   if (val == 0 || val > atomic64_read(>stats.bd_writes))
+   zram->stop_writeback = false;
+   up_read(>init_lock);
+   ret = len;
+
+   return ret;
+}
+
+static ssize_t writeback_limit_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   u64 val;
+   struct zram *zram = dev_to_zram(dev);
+
+   down_read(>init_lock);
+   val = atomic64_read(>stats.bd_wb_limit);
+   up_read(>init_lock);
+
+   return scnprintf(buf, PAGE_SIZE, "%llu\n", val);
+}
+
 static void reset_bdev(struct zram *zram)
 {
struct block_device *bdev;
@@ -575,6 +609,7 @@ static ssize_t writeback_store(struct device *dev,
ssize_t ret;
unsigned long mode;
unsigned long blk_idx = 0;
+   u64 wb_count, wb_limit;
 
 #define HUGE_WRITEBACK 0x1
 #define IDLE_WRITEBACK 0x2
@@ -610,6 +645,11 @@ static ssize_t writeback_store(struct device *dev,
bvec.bv_len = PAGE_SIZE;
bvec.bv_offset = 0;
 
+   if (zram->stop_writeback) {
+   ret = -EIO;
+   break;
+   }
+
if (!blk_idx) {
blk_idx = alloc_block_bdev(zram);
if (!blk_idx) {
@@ -664,7 +704,7 @@ static ssize_t writeback_store(struct device *dev,
continue;
}
 
-   atomic64_inc(>stats.bd_writes);
+   wb_count = atomic64_inc_return(>stats.bd_writes);
/*
 * We released zram_slot_lock so need to check if the slot was
 * changed. If there is freeing for the slot, we can catch it
@@ -687,6 +727,9 @@ static ssize_t writeback_store(struct device *dev,
z

[PATCH 4/6] zram: support idle page writeback

2018-11-15 Thread Minchan Kim
This patch supports new feature "zram idle page writeback".
On zram-swap usecase, zram has usually idle swap pages come
from many processes. It's pointless to keep in memory(ie, zram).

To solve the problem, this feature gives idle page writeback to
backing device so the goal is to save more memory space
on embedded system.

Normal sequence to use the feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo 1 > /sys/block/zram0/idle
sleep several hours
# idle zram slots are still IDLE marked.
echo 3 > /sys/block/zram0/writeback
# write the IDLE marked slot into backing device and free
# the memory.
}

echo 'val' > /sys/block/zramX/writeback

val is combination of bits.

0th bit: hugepage writeback
1th bit: idlepage writeback

Thus,
1 -> hugepage writeback
2 -> idlepage writeabck
3 -> writeback both pages

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  19 +++
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 166 +++--
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 187 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..60b585dab6e0 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -244,6 +244,25 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+User can writeback idle pages to backing device. To use the feature,
+first, user need to mark zram slots allocated currently as idle.
+Afterward, slots not accessed since then will have still idle mark.
+Then, if user does,
+   "echo val > /sys/block/zramX/writeback"
+
+  val is combination of bits.
+
+  0th bit: hugepage writeback
+  1th bit: idlepage writeback
+
+  Thus,
+  1 -> hugepage writeback
+  2 -> idlepage writeabck
+  3 -> writeback both pages
+
+zram will writeback the idle/huge pages to backing device and free the
+memory space pages occupied so save memory.
+
 = memory tracking
 
 With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-   bool "Write back incompressible page to backing device"
+   bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
+With /sys/block/zramX/{idle,writeback}, application could ask
+idle page's writeback to the backing device to save in memory.
+
 See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_MEMORY_TRACKING
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f956179076ce..b7b5c9e5f0cd 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
 static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
+static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
+   u32 index, int offset, struct bio *bio);
+
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
@@ -73,13 +76,6 @@ static inline bool init_done(struct zram *zram)
return zram->disksize;
 }
 
-static inline bool zram_allocated(struct zram *zram, u32 index)
-{
-
-   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
-   zram->table[index].handle;
-}
-
 static inline struct zram *dev_to_zram(struct device *dev)
 {
r

[PATCH 4/6] zram: support idle page writeback

2018-11-15 Thread Minchan Kim
This patch supports new feature "zram idle page writeback".
On zram-swap usecase, zram has usually idle swap pages come
from many processes. It's pointless to keep in memory(ie, zram).

To solve the problem, this feature gives idle page writeback to
backing device so the goal is to save more memory space
on embedded system.

Normal sequence to use the feature is as follows,

while (1) {
# mark allocated zram slot to idle
echo 1 > /sys/block/zram0/idle
sleep several hours
# idle zram slots are still IDLE marked.
echo 3 > /sys/block/zram0/writeback
# write the IDLE marked slot into backing device and free
# the memory.
}

echo 'val' > /sys/block/zramX/writeback

val is combination of bits.

0th bit: hugepage writeback
1th bit: idlepage writeback

Thus,
1 -> hugepage writeback
2 -> idlepage writeabck
3 -> writeback both pages

Signed-off-by: Minchan Kim 
---
 Documentation/ABI/testing/sysfs-block-zram |   7 +
 Documentation/blockdev/zram.txt|  19 +++
 drivers/block/zram/Kconfig |   5 +-
 drivers/block/zram/zram_drv.c  | 166 +++--
 drivers/block/zram/zram_drv.h  |   1 +
 5 files changed, 187 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 04c9a5980bc7..d1f80b077885 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -106,3 +106,10 @@ Contact:   Minchan Kim 
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram/block_state
+
+What:  /sys/block/zram/writeback
+Date:  November 2018
+Contact:   Minchan Kim 
+Description:
+   The writeback file is write-only and trigger idle and/or
+   huge page writeback to backing device.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index f3bcd716d8a9..60b585dab6e0 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -244,6 +244,25 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+User can writeback idle pages to backing device. To use the feature,
+first, user need to mark zram slots allocated currently as idle.
+Afterward, slots not accessed since then will have still idle mark.
+Then, if user does,
+   "echo val > /sys/block/zramX/writeback"
+
+  val is combination of bits.
+
+  0th bit: hugepage writeback
+  1th bit: idlepage writeback
+
+  Thus,
+  1 -> hugepage writeback
+  2 -> idlepage writeabck
+  3 -> writeback both pages
+
+zram will writeback the idle/huge pages to backing device and free the
+memory space pages occupied so save memory.
+
 = memory tracking
 
 With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index fcd055457364..1ffc64770643 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -15,7 +15,7 @@ config ZRAM
  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
-   bool "Write back incompressible page to backing device"
+   bool "Write back incompressible or idle page to backing device"
depends on ZRAM
help
 With incompressible page, there is no memory saving to keep it
@@ -23,6 +23,9 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
+With /sys/block/zramX/{idle,writeback}, application could ask
+idle page's writeback to the backing device to save in memory.
+
 See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_MEMORY_TRACKING
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f956179076ce..b7b5c9e5f0cd 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,6 +52,9 @@ static unsigned int num_devices = 1;
 static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
+static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
+   u32 index, int offset, struct bio *bio);
+
 
 static int zram_slot_trylock(struct zram *zram, u32 index)
 {
@@ -73,13 +76,6 @@ static inline bool init_done(struct zram *zram)
return zram->disksize;
 }
 
-static inline bool zram_allocated(struct zram *zram, u32 index)
-{
-
-   return (zram->table[index].flags >> (ZRAM_FLAG_SHIFT + 1)) ||
-   zram->table[index].handle;
-}
-
 static inline struct zram *dev_to_zram(struct device *dev)
 {
r

[PATCH 1/6] zram: fix lockdep warning of free block handling

2018-11-15 Thread Minchan Kim
[  254.519728] 
[  254.520311] WARNING: inconsistent lock state
[  254.520898] 4.19.0+ #390 Not tainted
[  254.521387] 
[  254.521732] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  254.521732] zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  254.521732] b1828693 (&(>bitmap_lock)->rlock){+.?.}, at: 
put_entry_bdev+0x1e/0x50
[  254.521732] {SOFTIRQ-ON-W} state was registered at:
[  254.521732]   _raw_spin_lock+0x2c/0x40
[  254.521732]   zram_make_request+0x755/0xdc9
[  254.521732]   generic_make_request+0x373/0x6a0
[  254.521732]   submit_bio+0x6c/0x140
[  254.521732]   __swap_writepage+0x3a8/0x480
[  254.521732]   shrink_page_list+0x1102/0x1a60
[  254.521732]   shrink_inactive_list+0x21b/0x3f0
[  254.521732]   shrink_node_memcg.constprop.99+0x4f8/0x7e0
[  254.521732]   shrink_node+0x7d/0x2f0
[  254.521732]   do_try_to_free_pages+0xe0/0x300
[  254.521732]   try_to_free_pages+0x116/0x2b0
[  254.521732]   __alloc_pages_slowpath+0x3f4/0xf80
[  254.521732]   __alloc_pages_nodemask+0x2a2/0x2f0
[  254.521732]   __handle_mm_fault+0x42e/0xb50
[  254.521732]   handle_mm_fault+0x55/0xb0
[  254.521732]   __do_page_fault+0x235/0x4b0
[  254.521732]   page_fault+0x1e/0x30
[  254.521732] irq event stamp: 228412
[  254.521732] hardirqs last  enabled at (228412): [] 
__slab_free+0x3e6/0x600
[  254.521732] hardirqs last disabled at (228411): [] 
__slab_free+0x1c5/0x600
[  254.521732] softirqs last  enabled at (228396): [] 
__do_softirq+0x31e/0x427
[  254.521732] softirqs last disabled at (228403): [] 
irq_exit+0xd1/0xe0
[  254.521732]
[  254.521732] other info that might help us debug this:
[  254.521732]  Possible unsafe locking scenario:
[  254.521732]
[  254.521732]CPU0
[  254.521732]
[  254.521732]   lock(&(>bitmap_lock)->rlock);
[  254.521732]   
[  254.521732] lock(&(>bitmap_lock)->rlock);
[  254.521732]
[  254.521732]  *** DEADLOCK ***
[  254.521732]
[  254.521732] no locks held by zram_verify/2095.
[  254.521732]
[  254.521732] stack backtrace:
[  254.521732] CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
[  254.521732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[  254.521732] Call Trace:
[  254.521732]  
[  254.521732]  dump_stack+0x67/0x9b
[  254.521732]  print_usage_bug+0x1bd/0x1d3
[  254.521732]  mark_lock+0x4aa/0x540
[  254.521732]  ? check_usage_backwards+0x160/0x160
[  254.521732]  __lock_acquire+0x51d/0x1300
[  254.521732]  ? free_debug_processing+0x24e/0x400
[  254.521732]  ? bio_endio+0x6d/0x1a0
[  254.521732]  ? lockdep_hardirqs_on+0x9b/0x180
[  254.521732]  ? lock_acquire+0x90/0x180
[  254.521732]  lock_acquire+0x90/0x180
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  _raw_spin_lock+0x2c/0x40
[  254.521732]  ? put_entry_bdev+0x1e/0x50
[  254.521732]  put_entry_bdev+0x1e/0x50
[  254.521732]  zram_free_page+0xf6/0x110
[  254.521732]  zram_slot_free_notify+0x42/0xa0
[  254.521732]  end_swap_bio_read+0x5b/0x170
[  254.521732]  blk_update_request+0x8f/0x340
[  254.521732]  scsi_end_request+0x2c/0x1e0
[  254.521732]  scsi_io_completion+0x98/0x650
[  254.521732]  blk_done_softirq+0x9e/0xd0
[  254.521732]  __do_softirq+0xcc/0x427
[  254.521732]  irq_exit+0xd1/0xe0
[  254.521732]  do_IRQ+0x93/0x120
[  254.521732]  common_interrupt+0xf/0xf
[  254.521732]  

With writeback feature, zram_slot_free_notify could be called
in softirq context by end_swap_bio_read. However, bitmap_lock
is not aware of that so lockdep yell out. Thanks.

The problem is not only bitmap_lock but it is also zram_slot_lock
so straightforward solution would disable irq on zram_slot_lock
which covers every bitmap_lock, too.
Although duration disabling the irq is short in many places
zram_slot_lock is used, a place(ie, decompress) is not fast
enough to hold irqlock on relying on compression algorithm
so it's not a option.

The approach in this patch is just "best effort", not guarantee
"freeing orphan zpage". If the zram_slot_lock contention may happen,
kernel couldn't free the zpage until it recycles the block. However,
such contention between zram_slot_free_notify and other places to
hold zram_slot_lock should be very rare in real practice.
To see how often it happens, this patch adds new debug stat
"miss_free".

It also adds irq lock in get/put_block_bdev to prevent deadlock
lockdep reported. The reason I used irq disable rather than bottom
half is swap_slot_free_notify could be called with irq disabled
so it breaks local_bh_enable's rule. The irqlock works on only
writebacked zram slot entry so it should be not frequent lock.

Cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Minchan Kim 
---
 drivers/block/zram/zram_drv.c | 56 +--
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/driver

<    1   2   3   4   5   6   7   8   9   10   >