Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-20 Thread Minchan Kim
ss it up and that the counter
> > > >   can drift between cpus.
> > > 
> > > Good point!
> > > 
> > > I just referenced it from ftrace because I thought the goal is similiar
> > > "no need to be exact unless the drift is frequent but wanted to be fast"
> > > 
> > > AFAIK, ftrace/printk is active user of the function so if the problem
> > > happens frequently, it might be serious. :)
> > 
> > It could be that ktime_get() is a better fit here - especially if
> > sched_clock() goes nuts after resume.  Unfortunately ktime_get()
> > appears to be totally undocumented :(
> > 
> 
> I will use ktime_get_boottime(). With it, zram is not demamaged by
> suspend/resume and code would be more simple/clear. For user, it
> would be more straightforward to parse the time.
> 
> Thanks for good suggestion, Andrew!
> 

Hey Andrew,

This is updated patch for 4/4.
If you want to replace full patchset, please tell me. I will send full
patchset.

>From 2ac685c32ffd3fba42d5eea6347f924c6e89bec0 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minc...@kernel.org>
Date: Mon, 9 Apr 2018 14:34:43 +0900
Subject: [PATCH v5 4/4] zram: introduce zram memory tracking

zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. Better idea is app developers free them directly
rather than remaining them on heap.

This patch tell us last access time of each block of zram via
"cat /sys/kernel/debug/zram/zram0/block_state".

The output is as follows,
  30075.033841 .wh
  30163.806904 s..
  30263.806919 ..h

First column is zram's block index and 3rh one represents symbol
(s: same page w: written page to backing store h: huge page) of the
block state. Second column represents usec time unit of the block
was last accessed. So above example means the 300th block is accessed
at 75.033851 second and it was huge so it was written to the backing
store.

Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.

I used the feature a few years ago to find memory hoggers in userspace
to notify them what memory they have wasted without touch for a long time.
With it, they could reduce unnecessary memory space. However, at that time,
I hacked up zram for the feature but now I need the feature again so
I decided it would be better to upstream rather than keeping it alone.
I hope I submit the userspace tool to use the feature soon

Cc: Randy Dunlap <rdun...@infradead.org>
Acked-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhat...@gmail.com>
Signed-off-by: Minchan Kim <minc...@kernel.org>
---
* from v4
  * use ktime_get_bootimte - Andrew
  * add feature usecase in change log - Andrew

 Documentation/blockdev/zram.txt |  24 ++
 drivers/block/zram/Kconfig  |  14 +++-
 drivers/block/zram/zram_drv.c   | 131 +---
 drivers/block/zram/zram_drv.h   |   7 +-
 4 files changed, 162 insertions(+), 14 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 78db38d02bc9..6cb804b709cf 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -243,5 +243,29 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+= memory tracking
+
+With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
+zram block. It could be useful to catch cold or incompressible
+pages of the process with*pagemap.
+If you enable the feature, you could see block state via
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+
+ 30075.033841 .wh
+ 30163.806904 s..
+ 30263.806919 ..h
+
+First column is zram's block index.
+Second column is access time since the system is boot
+Third column is state of the block.
+(s: same page
+w: written page to backing store
+h: huge page)
+
+First line of above example says 300th block is accessed at 75.033841sec
+and the block's state is huge so it is written back to the backing
+storage. It's a debugging feature so anyone shouldn't rely on it to work
+properly.
+
 Nitin Gupta
 ngu...@vflare.org
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index ac3a31d433b2..635235759a0a 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,7 +13,7 @@ config ZRAM
  It has several use cases, for example: /tmp storage, use as swap
  disk

Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-20 Thread Minchan Kim
e counter
> > > >   can drift between cpus.
> > > 
> > > Good point!
> > > 
> > > I just referenced it from ftrace because I thought the goal is similiar
> > > "no need to be exact unless the drift is frequent but wanted to be fast"
> > > 
> > > AFAIK, ftrace/printk is active user of the function so if the problem
> > > happens frequently, it might be serious. :)
> > 
> > It could be that ktime_get() is a better fit here - especially if
> > sched_clock() goes nuts after resume.  Unfortunately ktime_get()
> > appears to be totally undocumented :(
> > 
> 
> I will use ktime_get_boottime(). With it, zram is not demamaged by
> suspend/resume and code would be more simple/clear. For user, it
> would be more straightforward to parse the time.
> 
> Thanks for good suggestion, Andrew!
> 

Hey Andrew,

This is updated patch for 4/4.
If you want to replace full patchset, please tell me. I will send full
patchset.

>From 2ac685c32ffd3fba42d5eea6347f924c6e89bec0 Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Mon, 9 Apr 2018 14:34:43 +0900
Subject: [PATCH v5 4/4] zram: introduce zram memory tracking

zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. Better idea is app developers free them directly
rather than remaining them on heap.

This patch tell us last access time of each block of zram via
"cat /sys/kernel/debug/zram/zram0/block_state".

The output is as follows,
  30075.033841 .wh
  30163.806904 s..
  30263.806919 ..h

First column is zram's block index and 3rh one represents symbol
(s: same page w: written page to backing store h: huge page) of the
block state. Second column represents usec time unit of the block
was last accessed. So above example means the 300th block is accessed
at 75.033851 second and it was huge so it was written to the backing
store.

Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.

I used the feature a few years ago to find memory hoggers in userspace
to notify them what memory they have wasted without touch for a long time.
With it, they could reduce unnecessary memory space. However, at that time,
I hacked up zram for the feature but now I need the feature again so
I decided it would be better to upstream rather than keeping it alone.
I hope I submit the userspace tool to use the feature soon

Cc: Randy Dunlap 
Acked-by: Greg Kroah-Hartman 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
* from v4
  * use ktime_get_bootimte - Andrew
  * add feature usecase in change log - Andrew

 Documentation/blockdev/zram.txt |  24 ++
 drivers/block/zram/Kconfig  |  14 +++-
 drivers/block/zram/zram_drv.c   | 131 +---
 drivers/block/zram/zram_drv.h   |   7 +-
 4 files changed, 162 insertions(+), 14 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 78db38d02bc9..6cb804b709cf 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -243,5 +243,29 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+= memory tracking
+
+With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
+zram block. It could be useful to catch cold or incompressible
+pages of the process with*pagemap.
+If you enable the feature, you could see block state via
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+
+ 30075.033841 .wh
+ 30163.806904 s..
+ 30263.806919 ..h
+
+First column is zram's block index.
+Second column is access time since the system is boot
+Third column is state of the block.
+(s: same page
+w: written page to backing store
+h: huge page)
+
+First line of above example says 300th block is accessed at 75.033841sec
+and the block's state is huge so it is written back to the backing
+storage. It's a debugging feature so anyone shouldn't rely on it to work
+properly.
+
 Nitin Gupta
 ngu...@vflare.org
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index ac3a31d433b2..635235759a0a 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,7 +13,7 @@ config ZRAM
  It has several use cases, for example: /tmp storage, use as swap
  disks and maybe many more.
 
- See zram.txt for more information.
+ See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
bool "Write back incompressib

Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Minchan Kim
On Fri, Apr 20, 2018 at 11:18:34AM +0900, Sergey Senozhatsky wrote:
> On (04/20/18 11:09), Minchan Kim wrote:
> [..]
> > > hm, OK, can we get this info into the changelog?  
> > 
> > No problem. I will add as follows,
> > 
> > "I used the feature a few years ago to find memory hoggers in userspace
> > to notice them what memory they have wasted without touch for a long time.
> > With it, they could reduce unnecessary memory space. However, at that time,
> > I hacked up zram for the feature but now I need the feature again so
> > I decided it would be better to upstream rather than keeping it alone.
> > I hope I submit the userspace tool to use the feature soon"
> 
> Shall we then just wait until you resubmit the "complete" patch set: zram
> tracking + the user space tool which would parse the tracking output?

tl;dr: I think userspace tool is just ancillary, not must.

Although my main purpose is to find idle memory hogger, I don't think
userspace tool to find is must to merge this feature because someone
might want to do other thing regardless of the tool.

Examples from my mind is to see how swap write pattern going on,
how sparse swap write happens and so on. :)


Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Minchan Kim
On Fri, Apr 20, 2018 at 11:18:34AM +0900, Sergey Senozhatsky wrote:
> On (04/20/18 11:09), Minchan Kim wrote:
> [..]
> > > hm, OK, can we get this info into the changelog?  
> > 
> > No problem. I will add as follows,
> > 
> > "I used the feature a few years ago to find memory hoggers in userspace
> > to notice them what memory they have wasted without touch for a long time.
> > With it, they could reduce unnecessary memory space. However, at that time,
> > I hacked up zram for the feature but now I need the feature again so
> > I decided it would be better to upstream rather than keeping it alone.
> > I hope I submit the userspace tool to use the feature soon"
> 
> Shall we then just wait until you resubmit the "complete" patch set: zram
> tracking + the user space tool which would parse the tracking output?

tl;dr: I think userspace tool is just ancillary, not must.

Although my main purpose is to find idle memory hogger, I don't think
userspace tool to find is must to merge this feature because someone
might want to do other thing regardless of the tool.

Examples from my mind is to see how swap write pattern going on,
how sparse swap write happens and so on. :)


Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Sergey Senozhatsky
On (04/20/18 11:09), Minchan Kim wrote:
[..]
> > hm, OK, can we get this info into the changelog?  
> 
> No problem. I will add as follows,
> 
> "I used the feature a few years ago to find memory hoggers in userspace
> to notice them what memory they have wasted without touch for a long time.
> With it, they could reduce unnecessary memory space. However, at that time,
> I hacked up zram for the feature but now I need the feature again so
> I decided it would be better to upstream rather than keeping it alone.
> I hope I submit the userspace tool to use the feature soon"

Shall we then just wait until you resubmit the "complete" patch set: zram
tracking + the user space tool which would parse the tracking output?

-ss


Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Sergey Senozhatsky
On (04/20/18 11:09), Minchan Kim wrote:
[..]
> > hm, OK, can we get this info into the changelog?  
> 
> No problem. I will add as follows,
> 
> "I used the feature a few years ago to find memory hoggers in userspace
> to notice them what memory they have wasted without touch for a long time.
> With it, they could reduce unnecessary memory space. However, at that time,
> I hacked up zram for the feature but now I need the feature again so
> I decided it would be better to upstream rather than keeping it alone.
> I hope I submit the userspace tool to use the feature soon"

Shall we then just wait until you resubmit the "complete" patch set: zram
tracking + the user space tool which would parse the tracking output?

-ss


Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Minchan Kim
On Wed, Apr 18, 2018 at 02:07:15PM -0700, Andrew Morton wrote:
> On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim  wrote:
> 
> > Hi Andrew,
> > 
> > On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:
> > > 
> > > > zRam as swap is useful for small memory device. However, swap means
> > > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > > Especially, once init data for application are touched for launching,
> > > > they tend to be not accessed any more and finally swapped out.
> > > > zRAM can store such cold pages as compressed form but it's pointless
> > > > to keep in memory. Better idea is app developers free them directly
> > > > rather than remaining them on heap.
> > > > 
> > > > This patch tell us last access time of each block of zram via
> > > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > > 
> > > > The output is as follows,
> > > >   30075.033841 .wh
> > > >   30163.806904 s..
> > > >   30263.806919 ..h
> > > > 
> > > > First column is zram's block index and 3rh one represents symbol
> > > > (s: same page w: written page to backing store h: huge page) of the
> > > > block state. Second column represents usec time unit of the block
> > > > was last accessed. So above example means the 300th block is accessed
> > > > at 75.033851 second and it was huge so it was written to the backing
> > > > store.
> > > > 
> > > > Admin can leverage this information to catch cold|incompressible pages
> > > > of process with *pagemap* once part of heaps are swapped out.
> > > 
> > > A few things..
> > > 
> > > - Terms like "Admin can" and "Admin could" are worrisome.  How do we
> > >   know that admins *will* use this?  How do we know that we aren't
> > >   adding a bunch of stuff which nobody will find to be (sufficiently)
> > >   useful?  For example, is there some userspace tool to which you are
> > >   contributing which will be updated to use this feature?
> > 
> > Actually, I used this feature two years ago to find memory hogger
> > although the feature was very fast prototyping. It was very useful
> > to reduce memory cost in embedded space.
> > 
> > The reason I am trying to upstream the feature is I need the feature
> > again. :)
> > 
> > Yub, I have a userspace tool to use the feature although it was
> > not compatible with this new version. It should be updated with
> > new format. I will find a time to submit the tool.
> 
> hm, OK, can we get this info into the changelog?  

No problem. I will add as follows,

"I used the feature a few years ago to find memory hoggers in userspace
to notice them what memory they have wasted without touch for a long time.
With it, they could reduce unnecessary memory space. However, at that time,
I hacked up zram for the feature but now I need the feature again so
I decided it would be better to upstream rather than keeping it alone.
I hope I submit the userspace tool to use the feature soon"

> 
> > > 
> > > - block_state's second column is in microseconds since some
> > >   undocumented time.  But how is userspace to know how much time has
> > >   elapsed since the access?  ie, "current time".
> > 
> > It's a sched_clock so it should be elapsed time since the system boot.
> > I should have written it explictly.
> > I will fix it.
> > 
> > > 
> > > - Is the sched_clock() return value suitable for exporting to
> > >   userspace?  Is it monotonic?  Is it consistent across CPUs, across
> > >   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
> > >   way up to 2^64 on all CPU types, or will some processors wrap it at
> > >   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
> > >   points out that suspend/resume can mess it up and that the counter
> > >   can drift between cpus.
> > 
> > Good point!
> > 
> > I just referenced it from ftrace because I thought the goal is similiar
> > "no need to be exact unless the drift is frequent but wanted to be fast"
> > 
> > AFAIK, ftrace/printk is active user of the function so if the problem
> > happens frequently, it might be serious. :)
> 
> It could be that ktime_get() is a better fit here - especially if
> sched_clock() goes nuts after resume.  Unfortunately ktime_get()
> appears to be totally undocumented :(
> 

I will use ktime_get_boottime(). With it, zram is not demamaged by
suspend/resume and code would be more simple/clear. For user, it
would be more straightforward to parse the time.

Thanks for good suggestion, Andrew!



Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-19 Thread Minchan Kim
On Wed, Apr 18, 2018 at 02:07:15PM -0700, Andrew Morton wrote:
> On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim  wrote:
> 
> > Hi Andrew,
> > 
> > On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:
> > > 
> > > > zRam as swap is useful for small memory device. However, swap means
> > > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > > Especially, once init data for application are touched for launching,
> > > > they tend to be not accessed any more and finally swapped out.
> > > > zRAM can store such cold pages as compressed form but it's pointless
> > > > to keep in memory. Better idea is app developers free them directly
> > > > rather than remaining them on heap.
> > > > 
> > > > This patch tell us last access time of each block of zram via
> > > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > > 
> > > > The output is as follows,
> > > >   30075.033841 .wh
> > > >   30163.806904 s..
> > > >   30263.806919 ..h
> > > > 
> > > > First column is zram's block index and 3rh one represents symbol
> > > > (s: same page w: written page to backing store h: huge page) of the
> > > > block state. Second column represents usec time unit of the block
> > > > was last accessed. So above example means the 300th block is accessed
> > > > at 75.033851 second and it was huge so it was written to the backing
> > > > store.
> > > > 
> > > > Admin can leverage this information to catch cold|incompressible pages
> > > > of process with *pagemap* once part of heaps are swapped out.
> > > 
> > > A few things..
> > > 
> > > - Terms like "Admin can" and "Admin could" are worrisome.  How do we
> > >   know that admins *will* use this?  How do we know that we aren't
> > >   adding a bunch of stuff which nobody will find to be (sufficiently)
> > >   useful?  For example, is there some userspace tool to which you are
> > >   contributing which will be updated to use this feature?
> > 
> > Actually, I used this feature two years ago to find memory hogger
> > although the feature was very fast prototyping. It was very useful
> > to reduce memory cost in embedded space.
> > 
> > The reason I am trying to upstream the feature is I need the feature
> > again. :)
> > 
> > Yub, I have a userspace tool to use the feature although it was
> > not compatible with this new version. It should be updated with
> > new format. I will find a time to submit the tool.
> 
> hm, OK, can we get this info into the changelog?  

No problem. I will add as follows,

"I used the feature a few years ago to find memory hoggers in userspace
to notice them what memory they have wasted without touch for a long time.
With it, they could reduce unnecessary memory space. However, at that time,
I hacked up zram for the feature but now I need the feature again so
I decided it would be better to upstream rather than keeping it alone.
I hope I submit the userspace tool to use the feature soon"

> 
> > > 
> > > - block_state's second column is in microseconds since some
> > >   undocumented time.  But how is userspace to know how much time has
> > >   elapsed since the access?  ie, "current time".
> > 
> > It's a sched_clock so it should be elapsed time since the system boot.
> > I should have written it explictly.
> > I will fix it.
> > 
> > > 
> > > - Is the sched_clock() return value suitable for exporting to
> > >   userspace?  Is it monotonic?  Is it consistent across CPUs, across
> > >   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
> > >   way up to 2^64 on all CPU types, or will some processors wrap it at
> > >   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
> > >   points out that suspend/resume can mess it up and that the counter
> > >   can drift between cpus.
> > 
> > Good point!
> > 
> > I just referenced it from ftrace because I thought the goal is similiar
> > "no need to be exact unless the drift is frequent but wanted to be fast"
> > 
> > AFAIK, ftrace/printk is active user of the function so if the problem
> > happens frequently, it might be serious. :)
> 
> It could be that ktime_get() is a better fit here - especially if
> sched_clock() goes nuts after resume.  Unfortunately ktime_get()
> appears to be totally undocumented :(
> 

I will use ktime_get_boottime(). With it, zram is not demamaged by
suspend/resume and code would be more simple/clear. For user, it
would be more straightforward to parse the time.

Thanks for good suggestion, Andrew!



Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-18 Thread Andrew Morton
On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim  wrote:

> Hi Andrew,
> 
> On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:
> > 
> > > zRam as swap is useful for small memory device. However, swap means
> > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > Especially, once init data for application are touched for launching,
> > > they tend to be not accessed any more and finally swapped out.
> > > zRAM can store such cold pages as compressed form but it's pointless
> > > to keep in memory. Better idea is app developers free them directly
> > > rather than remaining them on heap.
> > > 
> > > This patch tell us last access time of each block of zram via
> > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > 
> > > The output is as follows,
> > >   30075.033841 .wh
> > >   30163.806904 s..
> > >   30263.806919 ..h
> > > 
> > > First column is zram's block index and 3rh one represents symbol
> > > (s: same page w: written page to backing store h: huge page) of the
> > > block state. Second column represents usec time unit of the block
> > > was last accessed. So above example means the 300th block is accessed
> > > at 75.033851 second and it was huge so it was written to the backing
> > > store.
> > > 
> > > Admin can leverage this information to catch cold|incompressible pages
> > > of process with *pagemap* once part of heaps are swapped out.
> > 
> > A few things..
> > 
> > - Terms like "Admin can" and "Admin could" are worrisome.  How do we
> >   know that admins *will* use this?  How do we know that we aren't
> >   adding a bunch of stuff which nobody will find to be (sufficiently)
> >   useful?  For example, is there some userspace tool to which you are
> >   contributing which will be updated to use this feature?
> 
> Actually, I used this feature two years ago to find memory hogger
> although the feature was very fast prototyping. It was very useful
> to reduce memory cost in embedded space.
> 
> The reason I am trying to upstream the feature is I need the feature
> again. :)
> 
> Yub, I have a userspace tool to use the feature although it was
> not compatible with this new version. It should be updated with
> new format. I will find a time to submit the tool.

hm, OK, can we get this info into the changelog?  

> > 
> > - block_state's second column is in microseconds since some
> >   undocumented time.  But how is userspace to know how much time has
> >   elapsed since the access?  ie, "current time".
> 
> It's a sched_clock so it should be elapsed time since the system boot.
> I should have written it explictly.
> I will fix it.
> 
> > 
> > - Is the sched_clock() return value suitable for exporting to
> >   userspace?  Is it monotonic?  Is it consistent across CPUs, across
> >   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
> >   way up to 2^64 on all CPU types, or will some processors wrap it at
> >   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
> >   points out that suspend/resume can mess it up and that the counter
> >   can drift between cpus.
> 
> Good point!
> 
> I just referenced it from ftrace because I thought the goal is similiar
> "no need to be exact unless the drift is frequent but wanted to be fast"
> 
> AFAIK, ftrace/printk is active user of the function so if the problem
> happens frequently, it might be serious. :)

It could be that ktime_get() is a better fit here - especially if
sched_clock() goes nuts after resume.  Unfortunately ktime_get()
appears to be totally undocumented :(



Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-18 Thread Andrew Morton
On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim  wrote:

> Hi Andrew,
> 
> On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:
> > 
> > > zRam as swap is useful for small memory device. However, swap means
> > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > Especially, once init data for application are touched for launching,
> > > they tend to be not accessed any more and finally swapped out.
> > > zRAM can store such cold pages as compressed form but it's pointless
> > > to keep in memory. Better idea is app developers free them directly
> > > rather than remaining them on heap.
> > > 
> > > This patch tell us last access time of each block of zram via
> > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > 
> > > The output is as follows,
> > >   30075.033841 .wh
> > >   30163.806904 s..
> > >   30263.806919 ..h
> > > 
> > > First column is zram's block index and 3rh one represents symbol
> > > (s: same page w: written page to backing store h: huge page) of the
> > > block state. Second column represents usec time unit of the block
> > > was last accessed. So above example means the 300th block is accessed
> > > at 75.033851 second and it was huge so it was written to the backing
> > > store.
> > > 
> > > Admin can leverage this information to catch cold|incompressible pages
> > > of process with *pagemap* once part of heaps are swapped out.
> > 
> > A few things..
> > 
> > - Terms like "Admin can" and "Admin could" are worrisome.  How do we
> >   know that admins *will* use this?  How do we know that we aren't
> >   adding a bunch of stuff which nobody will find to be (sufficiently)
> >   useful?  For example, is there some userspace tool to which you are
> >   contributing which will be updated to use this feature?
> 
> Actually, I used this feature two years ago to find memory hogger
> although the feature was very fast prototyping. It was very useful
> to reduce memory cost in embedded space.
> 
> The reason I am trying to upstream the feature is I need the feature
> again. :)
> 
> Yub, I have a userspace tool to use the feature although it was
> not compatible with this new version. It should be updated with
> new format. I will find a time to submit the tool.

hm, OK, can we get this info into the changelog?  

> > 
> > - block_state's second column is in microseconds since some
> >   undocumented time.  But how is userspace to know how much time has
> >   elapsed since the access?  ie, "current time".
> 
> It's a sched_clock so it should be elapsed time since the system boot.
> I should have written it explictly.
> I will fix it.
> 
> > 
> > - Is the sched_clock() return value suitable for exporting to
> >   userspace?  Is it monotonic?  Is it consistent across CPUs, across
> >   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
> >   way up to 2^64 on all CPU types, or will some processors wrap it at
> >   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
> >   points out that suspend/resume can mess it up and that the counter
> >   can drift between cpus.
> 
> Good point!
> 
> I just referenced it from ftrace because I thought the goal is similiar
> "no need to be exact unless the drift is frequent but wanted to be fast"
> 
> AFAIK, ftrace/printk is active user of the function so if the problem
> happens frequently, it might be serious. :)

It could be that ktime_get() is a better fit here - especially if
sched_clock() goes nuts after resume.  Unfortunately ktime_get()
appears to be totally undocumented :(



Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-17 Thread Minchan Kim
Hi Andrew,

On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:
> 
> > zRam as swap is useful for small memory device. However, swap means
> > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > Especially, once init data for application are touched for launching,
> > they tend to be not accessed any more and finally swapped out.
> > zRAM can store such cold pages as compressed form but it's pointless
> > to keep in memory. Better idea is app developers free them directly
> > rather than remaining them on heap.
> > 
> > This patch tell us last access time of each block of zram via
> > "cat /sys/kernel/debug/zram/zram0/block_state".
> > 
> > The output is as follows,
> >   30075.033841 .wh
> >   30163.806904 s..
> >   30263.806919 ..h
> > 
> > First column is zram's block index and 3rh one represents symbol
> > (s: same page w: written page to backing store h: huge page) of the
> > block state. Second column represents usec time unit of the block
> > was last accessed. So above example means the 300th block is accessed
> > at 75.033851 second and it was huge so it was written to the backing
> > store.
> > 
> > Admin can leverage this information to catch cold|incompressible pages
> > of process with *pagemap* once part of heaps are swapped out.
> 
> A few things..
> 
> - Terms like "Admin can" and "Admin could" are worrisome.  How do we
>   know that admins *will* use this?  How do we know that we aren't
>   adding a bunch of stuff which nobody will find to be (sufficiently)
>   useful?  For example, is there some userspace tool to which you are
>   contributing which will be updated to use this feature?

Actually, I used this feature two years ago to find memory hogger
although the feature was very fast prototyping. It was very useful
to reduce memory cost in embedded space.

The reason I am trying to upstream the feature is I need the feature
again. :)

Yub, I have a userspace tool to use the feature although it was
not compatible with this new version. It should be updated with
new format. I will find a time to submit the tool.

> 
> - block_state's second column is in microseconds since some
>   undocumented time.  But how is userspace to know how much time has
>   elapsed since the access?  ie, "current time".

It's a sched_clock so it should be elapsed time since the system boot.
I should have written it explictly.
I will fix it.

> 
> - Is the sched_clock() return value suitable for exporting to
>   userspace?  Is it monotonic?  Is it consistent across CPUs, across
>   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
>   way up to 2^64 on all CPU types, or will some processors wrap it at
>   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
>   points out that suspend/resume can mess it up and that the counter
>   can drift between cpus.

Good point!

I just referenced it from ftrace because I thought the goal is similiar
"no need to be exact unless the drift is frequent but wanted to be fast"

AFAIK, ftrace/printk is active user of the function so if the problem
happens frequently, it might be serious. :)


Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-17 Thread Minchan Kim
Hi Andrew,

On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:
> 
> > zRam as swap is useful for small memory device. However, swap means
> > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > Especially, once init data for application are touched for launching,
> > they tend to be not accessed any more and finally swapped out.
> > zRAM can store such cold pages as compressed form but it's pointless
> > to keep in memory. Better idea is app developers free them directly
> > rather than remaining them on heap.
> > 
> > This patch tell us last access time of each block of zram via
> > "cat /sys/kernel/debug/zram/zram0/block_state".
> > 
> > The output is as follows,
> >   30075.033841 .wh
> >   30163.806904 s..
> >   30263.806919 ..h
> > 
> > First column is zram's block index and 3rh one represents symbol
> > (s: same page w: written page to backing store h: huge page) of the
> > block state. Second column represents usec time unit of the block
> > was last accessed. So above example means the 300th block is accessed
> > at 75.033851 second and it was huge so it was written to the backing
> > store.
> > 
> > Admin can leverage this information to catch cold|incompressible pages
> > of process with *pagemap* once part of heaps are swapped out.
> 
> A few things..
> 
> - Terms like "Admin can" and "Admin could" are worrisome.  How do we
>   know that admins *will* use this?  How do we know that we aren't
>   adding a bunch of stuff which nobody will find to be (sufficiently)
>   useful?  For example, is there some userspace tool to which you are
>   contributing which will be updated to use this feature?

Actually, I used this feature two years ago to find memory hogger
although the feature was very fast prototyping. It was very useful
to reduce memory cost in embedded space.

The reason I am trying to upstream the feature is I need the feature
again. :)

Yub, I have a userspace tool to use the feature although it was
not compatible with this new version. It should be updated with
new format. I will find a time to submit the tool.

> 
> - block_state's second column is in microseconds since some
>   undocumented time.  But how is userspace to know how much time has
>   elapsed since the access?  ie, "current time".

It's a sched_clock so it should be elapsed time since the system boot.
I should have written it explictly.
I will fix it.

> 
> - Is the sched_clock() return value suitable for exporting to
>   userspace?  Is it monotonic?  Is it consistent across CPUs, across
>   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
>   way up to 2^64 on all CPU types, or will some processors wrap it at
>   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
>   points out that suspend/resume can mess it up and that the counter
>   can drift between cpus.

Good point!

I just referenced it from ftrace because I thought the goal is similiar
"no need to be exact unless the drift is frequent but wanted to be fast"

AFAIK, ftrace/printk is active user of the function so if the problem
happens frequently, it might be serious. :)


Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-17 Thread Andrew Morton
On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:

> zRam as swap is useful for small memory device. However, swap means
> those pages on zram are mostly cold pages due to VM's LRU algorithm.
> Especially, once init data for application are touched for launching,
> they tend to be not accessed any more and finally swapped out.
> zRAM can store such cold pages as compressed form but it's pointless
> to keep in memory. Better idea is app developers free them directly
> rather than remaining them on heap.
> 
> This patch tell us last access time of each block of zram via
> "cat /sys/kernel/debug/zram/zram0/block_state".
> 
> The output is as follows,
>   30075.033841 .wh
>   30163.806904 s..
>   30263.806919 ..h
> 
> First column is zram's block index and 3rh one represents symbol
> (s: same page w: written page to backing store h: huge page) of the
> block state. Second column represents usec time unit of the block
> was last accessed. So above example means the 300th block is accessed
> at 75.033851 second and it was huge so it was written to the backing
> store.
> 
> Admin can leverage this information to catch cold|incompressible pages
> of process with *pagemap* once part of heaps are swapped out.

A few things..

- Terms like "Admin can" and "Admin could" are worrisome.  How do we
  know that admins *will* use this?  How do we know that we aren't
  adding a bunch of stuff which nobody will find to be (sufficiently)
  useful?  For example, is there some userspace tool to which you are
  contributing which will be updated to use this feature?

- block_state's second column is in microseconds since some
  undocumented time.  But how is userspace to know how much time has
  elapsed since the access?  ie, "current time".

- Is the sched_clock() return value suitable for exporting to
  userspace?  Is it monotonic?  Is it consistent across CPUs, across
  CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
  way up to 2^64 on all CPU types, or will some processors wrap it at
  (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
  points out that suspend/resume can mess it up and that the counter
  can drift between cpus.




Re: [PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-17 Thread Andrew Morton
On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim  wrote:

> zRam as swap is useful for small memory device. However, swap means
> those pages on zram are mostly cold pages due to VM's LRU algorithm.
> Especially, once init data for application are touched for launching,
> they tend to be not accessed any more and finally swapped out.
> zRAM can store such cold pages as compressed form but it's pointless
> to keep in memory. Better idea is app developers free them directly
> rather than remaining them on heap.
> 
> This patch tell us last access time of each block of zram via
> "cat /sys/kernel/debug/zram/zram0/block_state".
> 
> The output is as follows,
>   30075.033841 .wh
>   30163.806904 s..
>   30263.806919 ..h
> 
> First column is zram's block index and 3rh one represents symbol
> (s: same page w: written page to backing store h: huge page) of the
> block state. Second column represents usec time unit of the block
> was last accessed. So above example means the 300th block is accessed
> at 75.033851 second and it was huge so it was written to the backing
> store.
> 
> Admin can leverage this information to catch cold|incompressible pages
> of process with *pagemap* once part of heaps are swapped out.

A few things..

- Terms like "Admin can" and "Admin could" are worrisome.  How do we
  know that admins *will* use this?  How do we know that we aren't
  adding a bunch of stuff which nobody will find to be (sufficiently)
  useful?  For example, is there some userspace tool to which you are
  contributing which will be updated to use this feature?

- block_state's second column is in microseconds since some
  undocumented time.  But how is userspace to know how much time has
  elapsed since the access?  ie, "current time".

- Is the sched_clock() return value suitable for exporting to
  userspace?  Is it monotonic?  Is it consistent across CPUs, across
  CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
  way up to 2^64 on all CPU types, or will some processors wrap it at
  (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
  points out that suspend/resume can mess it up and that the counter
  can drift between cpus.




[PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-16 Thread Minchan Kim
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. Better idea is app developers free them directly
rather than remaining them on heap.

This patch tell us last access time of each block of zram via
"cat /sys/kernel/debug/zram/zram0/block_state".

The output is as follows,
  30075.033841 .wh
  30163.806904 s..
  30263.806919 ..h

First column is zram's block index and 3rh one represents symbol
(s: same page w: written page to backing store h: huge page) of the
block state. Second column represents usec time unit of the block
was last accessed. So above example means the 300th block is accessed
at 75.033851 second and it was huge so it was written to the backing
store.

Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.

Cc: Randy Dunlap 
Acked-by: Greg Kroah-Hartman 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 Documentation/blockdev/zram.txt |  24 ++
 drivers/block/zram/Kconfig  |  14 +++-
 drivers/block/zram/zram_drv.c   | 140 +---
 drivers/block/zram/zram_drv.h   |   5 ++
 4 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 78db38d02bc9..49015b51ef1e 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -243,5 +243,29 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+= memory tracking
+
+With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
+zram block. It could be useful to catch cold or incompressible
+pages of the process with*pagemap.
+If you enable the feature, you could see block state via
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+
+ 30075.033841 .wh
+ 30163.806904 s..
+ 30263.806919 ..h
+
+First column is zram's block index.
+Second column is access time.
+Third column is state of the block.
+(s: same page
+w: written page to backing store
+h: huge page)
+
+First line of above example says 300th block is accessed at 75.033841sec
+and the block's state is huge so it is written back to the backing
+storage. It's a debugging feature so anyone shouldn't rely on it to work
+properly.
+
 Nitin Gupta
 ngu...@vflare.org
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index ac3a31d433b2..635235759a0a 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,7 +13,7 @@ config ZRAM
  It has several use cases, for example: /tmp storage, use as swap
  disks and maybe many more.
 
- See zram.txt for more information.
+ See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
bool "Write back incompressible page to backing device"
@@ -25,4 +25,14 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
-See zram.txt for more infomration.
+See Documentation/blockdev/zram.txt for more information.
+
+config ZRAM_MEMORY_TRACKING
+   bool "Track zRam block status"
+   depends on ZRAM && DEBUG_FS
+   help
+ With this feature, admin can track the state of allocated blocks
+ of zRAM. Admin could see the information via
+ /sys/kernel/debug/zram/zramX/block_state.
+
+ See Documentation/blockdev/zram.txt for more information.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7fc10e2ad734..a999d9996a13 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "zram_drv.h"
@@ -67,6 +68,13 @@ static inline bool init_done(struct zram *zram)
return zram->disksize;
 }
 
+static inline bool zram_allocated(struct zram *zram, u32 index)
+{
+
+   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   zram->table[index].handle;
+}
+
 static inline struct zram *dev_to_zram(struct device *dev)
 {
return (struct zram *)dev_to_disk(dev)->private_data;
@@ -83,7 +91,7 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 }
 
 /* flag operations require table entry bit_spin_lock() being held */
-static int zram_test_flag(struct zram *zram, u32 index,
+static bool zram_test_flag(struct zram *zram, u32 index,

[PATCH v5 4/4] zram: introduce zram memory tracking

2018-04-16 Thread Minchan Kim
zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. Better idea is app developers free them directly
rather than remaining them on heap.

This patch tell us last access time of each block of zram via
"cat /sys/kernel/debug/zram/zram0/block_state".

The output is as follows,
  30075.033841 .wh
  30163.806904 s..
  30263.806919 ..h

First column is zram's block index and 3rh one represents symbol
(s: same page w: written page to backing store h: huge page) of the
block state. Second column represents usec time unit of the block
was last accessed. So above example means the 300th block is accessed
at 75.033851 second and it was huge so it was written to the backing
store.

Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.

Cc: Randy Dunlap 
Acked-by: Greg Kroah-Hartman 
Reviewed-by: Sergey Senozhatsky 
Signed-off-by: Minchan Kim 
---
 Documentation/blockdev/zram.txt |  24 ++
 drivers/block/zram/Kconfig  |  14 +++-
 drivers/block/zram/zram_drv.c   | 140 +---
 drivers/block/zram/zram_drv.h   |   5 ++
 4 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 78db38d02bc9..49015b51ef1e 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -243,5 +243,29 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+= memory tracking
+
+With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
+zram block. It could be useful to catch cold or incompressible
+pages of the process with*pagemap.
+If you enable the feature, you could see block state via
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+
+ 30075.033841 .wh
+ 30163.806904 s..
+ 30263.806919 ..h
+
+First column is zram's block index.
+Second column is access time.
+Third column is state of the block.
+(s: same page
+w: written page to backing store
+h: huge page)
+
+First line of above example says 300th block is accessed at 75.033841sec
+and the block's state is huge so it is written back to the backing
+storage. It's a debugging feature so anyone shouldn't rely on it to work
+properly.
+
 Nitin Gupta
 ngu...@vflare.org
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index ac3a31d433b2..635235759a0a 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,7 +13,7 @@ config ZRAM
  It has several use cases, for example: /tmp storage, use as swap
  disks and maybe many more.
 
- See zram.txt for more information.
+ See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
bool "Write back incompressible page to backing device"
@@ -25,4 +25,14 @@ config ZRAM_WRITEBACK
 For this feature, admin should set up backing device via
 /sys/block/zramX/backing_dev.
 
-See zram.txt for more infomration.
+See Documentation/blockdev/zram.txt for more information.
+
+config ZRAM_MEMORY_TRACKING
+   bool "Track zRam block status"
+   depends on ZRAM && DEBUG_FS
+   help
+ With this feature, admin can track the state of allocated blocks
+ of zRAM. Admin could see the information via
+ /sys/kernel/debug/zram/zramX/block_state.
+
+ See Documentation/blockdev/zram.txt for more information.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7fc10e2ad734..a999d9996a13 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "zram_drv.h"
@@ -67,6 +68,13 @@ static inline bool init_done(struct zram *zram)
return zram->disksize;
 }
 
+static inline bool zram_allocated(struct zram *zram, u32 index)
+{
+
+   return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+   zram->table[index].handle;
+}
+
 static inline struct zram *dev_to_zram(struct device *dev)
 {
return (struct zram *)dev_to_disk(dev)->private_data;
@@ -83,7 +91,7 @@ static void zram_set_handle(struct zram *zram, u32 index, 
unsigned long handle)
 }
 
 /* flag operations require table entry bit_spin_lock() being held */
-static int zram_test_flag(struct zram *zram, u32 index,
+static bool zram_test_flag(struct zram *zram, u32 index,
enum zram_pageflags flag)
 {
return zram->table[index].value & BIT(flag);
@@