[PATCH 2/2] Add shrinker interface for mempressure cgroup

2013-01-04 Thread Anton Vorontsov
This commit implements Andrew Morton's idea of kernel-controlled userland
reclaimer. This is very similar to the in-kernel shrinker, with one major
difference: it is asynchronous, i.e. like kswapd.

Note that the shrinker interface is not a substitution for the levels, the
two interfaces report different kinds information (i.e. with the shrinker
you don't know the actual system state -- how bad/good the memory
situation is).

The interface is well documented and comes with a stress-test utility.

Signed-off-by: Anton Vorontsov 
---
 Documentation/cgroups/mempressure.txt|  53 +++-
 Documentation/cgroups/mempressure_test.c | 213 +++
 init/Kconfig |   5 +-
 mm/mempressure.c | 157 +++
 4 files changed, 423 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/cgroups/mempressure_test.c

diff --git a/Documentation/cgroups/mempressure.txt 
b/Documentation/cgroups/mempressure.txt
index dbc0aca..5094749 100644
--- a/Documentation/cgroups/mempressure.txt
+++ b/Documentation/cgroups/mempressure.txt
@@ -16,10 +16,55 @@
 
   After the hierarchy is mounted, you can use the following API:
 
+  /sys/fs/cgroup/.../mempressure.shrinker
+~~~
+  The file implements userland shrinker (memory reclaimer) interface, so
+  that the kernel can ask userland to help with the memory reclaiming
+  process.
+
+  There are two basic concepts: chunks and chunks' size. The program must
+  tell the kernel the granularity of its allocations (chunk size) and the
+  number of reclaimable chunks. The granularity may be not 100% accurate,
+  but the more it is accurate, the better. I.e. suppose the application
+  has 200 page renders cached (but not displayed), 1MB each. So the chunk
+  size is 1MB, and the number of chunks is 200.
+
+  The granularity is specified during shrinker registration (i.e. via
+  argument to the event_control cgroup file; and it is OK to register
+  multiple shrinkers for different granularities). The number of
+  reclaimable chunks is specified by writing to the mempressure.shrinker
+  file.
+
+  The notification comes through the eventfd() interface. Upon the
+  notification, a read() from the eventfd returns the number of chunks to
+  reclaim (free).
+
+  It is assumed that the application will free the specified amount of
+  chunks before reading from the eventfd again. If that is not the case,
+  suppose the program was not able to reclaim the chunks, then application
+  should re-add the amount of chunks by writing to the
+  mempressure.shrinker file (otherwise the chunks won't be accounted by
+  the kernel, since it assumes that they were reclaimed).
+
+  Event control:
+Used to setup shrinker events. There is only one argument for the
+event control: chunk size in bytes.
+  Read:
+Not implemented.
+  Write:
+Writes must be in " " format. Positive
+numbers increment the internal counter, negative numbers decrement it
+(but the kernel prevents the counter from falling down below zero).
+  Test:
+See mempressure_test.c
+
   /sys/fs/cgroup/.../mempressure.level
 
-  To maintain the interactivity/memory allocation cost, one can use the
-  pressure level notifications, and the levels are defined like this:
+  Instead of working on the bytes level (like shrinkers), one may decide
+  to maintain the interactivity/memory allocation cost.
+
+  For this, the cgroup has memory pressure level notifications, and the
+  levels are defined like this:
 
   The "low" level means that the system is reclaiming memory for new
   allocations. Monitoring reclaiming activity might be useful for
@@ -30,7 +75,9 @@
   The "medium" level means that the system is experiencing medium memory
   pressure, there is some mild swapping activity. Upon this event
   applications may decide to free any resources that can be easily
-  reconstructed or re-read from a disk.
+  reconstructed or re-read from a disk. Note that for a fine-grained
+  control, you should probably use the shrinker interface, as described
+  above.
 
   The "oom" level means that the system is actively thrashing, it is about
   to out of memory (OOM) or even the in-kernel OOM killer is on its way to
diff --git a/Documentation/cgroups/mempressure_test.c 
b/Documentation/cgroups/mempressure_test.c
new file mode 100644
index 000..a6c770c
--- /dev/null
+++ b/Documentation/cgroups/mempressure_test.c
@@ -0,0 +1,213 @@
+/*
+ * mempressure shrinker test
+ *
+ * Copyright 2012 Linaro Ltd.
+ *   Anton Vorontsov 
+ *
+ * It is pretty simple: we create two threads, the first one constantly
+ * tries to allocate memory (more than we physically have), the second
+ * thread listens to the kernel shrinker notifications and frees asked
+ * amount of chunks. When we allocate more than available RAM, the two
+ * threads start to fight. Idially, we should 

[PATCH 2/2] Add shrinker interface for mempressure cgroup

2013-01-04 Thread Anton Vorontsov
This commit implements Andrew Morton's idea of kernel-controlled userland
reclaimer. This is very similar to the in-kernel shrinker, with one major
difference: it is asynchronous, i.e. like kswapd.

Note that the shrinker interface is not a substitution for the levels, the
two interfaces report different kinds information (i.e. with the shrinker
you don't know the actual system state -- how bad/good the memory
situation is).

The interface is well documented and comes with a stress-test utility.

Signed-off-by: Anton Vorontsov anton.voront...@linaro.org
---
 Documentation/cgroups/mempressure.txt|  53 +++-
 Documentation/cgroups/mempressure_test.c | 213 +++
 init/Kconfig |   5 +-
 mm/mempressure.c | 157 +++
 4 files changed, 423 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/cgroups/mempressure_test.c

diff --git a/Documentation/cgroups/mempressure.txt 
b/Documentation/cgroups/mempressure.txt
index dbc0aca..5094749 100644
--- a/Documentation/cgroups/mempressure.txt
+++ b/Documentation/cgroups/mempressure.txt
@@ -16,10 +16,55 @@
 
   After the hierarchy is mounted, you can use the following API:
 
+  /sys/fs/cgroup/.../mempressure.shrinker
+~~~
+  The file implements userland shrinker (memory reclaimer) interface, so
+  that the kernel can ask userland to help with the memory reclaiming
+  process.
+
+  There are two basic concepts: chunks and chunks' size. The program must
+  tell the kernel the granularity of its allocations (chunk size) and the
+  number of reclaimable chunks. The granularity may be not 100% accurate,
+  but the more it is accurate, the better. I.e. suppose the application
+  has 200 page renders cached (but not displayed), 1MB each. So the chunk
+  size is 1MB, and the number of chunks is 200.
+
+  The granularity is specified during shrinker registration (i.e. via
+  argument to the event_control cgroup file; and it is OK to register
+  multiple shrinkers for different granularities). The number of
+  reclaimable chunks is specified by writing to the mempressure.shrinker
+  file.
+
+  The notification comes through the eventfd() interface. Upon the
+  notification, a read() from the eventfd returns the number of chunks to
+  reclaim (free).
+
+  It is assumed that the application will free the specified amount of
+  chunks before reading from the eventfd again. If that is not the case,
+  suppose the program was not able to reclaim the chunks, then application
+  should re-add the amount of chunks by writing to the
+  mempressure.shrinker file (otherwise the chunks won't be accounted by
+  the kernel, since it assumes that they were reclaimed).
+
+  Event control:
+Used to setup shrinker events. There is only one argument for the
+event control: chunk size in bytes.
+  Read:
+Not implemented.
+  Write:
+Writes must be in eventfd number of chunks format. Positive
+numbers increment the internal counter, negative numbers decrement it
+(but the kernel prevents the counter from falling down below zero).
+  Test:
+See mempressure_test.c
+
   /sys/fs/cgroup/.../mempressure.level
 
-  To maintain the interactivity/memory allocation cost, one can use the
-  pressure level notifications, and the levels are defined like this:
+  Instead of working on the bytes level (like shrinkers), one may decide
+  to maintain the interactivity/memory allocation cost.
+
+  For this, the cgroup has memory pressure level notifications, and the
+  levels are defined like this:
 
   The low level means that the system is reclaiming memory for new
   allocations. Monitoring reclaiming activity might be useful for
@@ -30,7 +75,9 @@
   The medium level means that the system is experiencing medium memory
   pressure, there is some mild swapping activity. Upon this event
   applications may decide to free any resources that can be easily
-  reconstructed or re-read from a disk.
+  reconstructed or re-read from a disk. Note that for a fine-grained
+  control, you should probably use the shrinker interface, as described
+  above.
 
   The oom level means that the system is actively thrashing, it is about
   to out of memory (OOM) or even the in-kernel OOM killer is on its way to
diff --git a/Documentation/cgroups/mempressure_test.c 
b/Documentation/cgroups/mempressure_test.c
new file mode 100644
index 000..a6c770c
--- /dev/null
+++ b/Documentation/cgroups/mempressure_test.c
@@ -0,0 +1,213 @@
+/*
+ * mempressure shrinker test
+ *
+ * Copyright 2012 Linaro Ltd.
+ *   Anton Vorontsov anton.voront...@linaro.org
+ *
+ * It is pretty simple: we create two threads, the first one constantly
+ * tries to allocate memory (more than we physically have), the second
+ * thread listens to the kernel shrinker notifications and frees asked
+ * amount of chunks. When we allocate more than