[Devel] [PATCH rh7 10/39] mm: kasan: initial memory quarantine implementation

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

Quarantine isolates freed objects in a separate queue.  The objects are
returned to the allocator later, which helps to detect use-after-free
errors.

When the object is freed, its state changes from KASAN_STATE_ALLOC to
KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
instead of being returned to the allocator, therefore every subsequent
access to that object triggers a KASAN error, and the error handler is
able to say where the object has been allocated and deallocated.

When it's time for the object to leave quarantine, its state becomes
KASAN_STATE_FREE and it's returned to the allocator.  From now on the
allocator may reuse it for another allocation.  Before that happens,
it's still possible to detect a use-after free on that object (it
retains the allocation/deallocation stacks).

When the allocator reuses this object, the shadow is unpoisoned and old
allocation/deallocation stacks are wiped.  Therefore a use of this
object, even an incorrect one, won't trigger ASan warning.

Without the quarantine, it's not guaranteed that the objects aren't
reused immediately, that's why the probability of catching a
use-after-free is lower than with quarantine in place.

Quarantine isolates freed objects in a separate queue.  The objects are
returned to the allocator later, which helps to detect use-after-free
errors.

Freed objects are first added to per-cpu quarantine queues.  When a
cache is destroyed or memory shrinking is requested, the objects are
moved into the global quarantine queue.  Whenever a kmalloc call allows
memory reclaiming, the oldest objects are popped out of the global queue
until the total size of objects in quarantine is less than 3/4 of the
maximum quarantine size (which is a fraction of installed physical
memory).

As long as an object remains in the quarantine, KASAN is able to report
accesses to it, so the chance of reporting a use-after-free is
increased.  Once the object leaves quarantine, the allocator may reuse
it, in which case the object is unpoisoned and KASAN can't detect
incorrect accesses to it.

Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.

This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov.  A number of improvements have been
suggested by Andrey Ryabinin.

[gli...@google.com: v9]
  Link: 
http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-gli...@google.com
Signed-off-by: Alexander Potapenko 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrey Konovalov 
Cc: Dmitry Vyukov 
Cc: Andrey Ryabinin 
Cc: Steven Rostedt 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 55834c59098d0c5a97b0f3247e55832b67facdcf)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h |  13 ++-
 mm/kasan/Makefile |   2 +
 mm/kasan/kasan.c  |  57 --
 mm/kasan/kasan.h  |  21 +++-
 mm/kasan/quarantine.c | 291 ++
 mm/kasan/report.c |   1 +
 mm/mempool.c  |   2 +-
 mm/slab.c |  12 ++-
 mm/slab.h |   1 +
 mm/slab_common.c  |   3 +
 10 files changed, 388 insertions(+), 15 deletions(-)
 create mode 100644 mm/kasan/quarantine.c

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index ab45598049da..9ab426991c4e 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -40,6 +40,8 @@ void kasan_free_pages(struct page *page, unsigned int order);
 
 void kasan_cache_create(struct kmem_cache *cache, size_t *size,
unsigned long *flags);
+void kasan_cache_shrink(struct kmem_cache *cache);
+void kasan_cache_destroy(struct kmem_cache *cache);
 
 void kasan_poison_slab(struct page *page);
 void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
@@ -53,7 +55,8 @@ void kasan_kmalloc(struct kmem_cache *s, const void *object, 
size_t size,
 void kasan_krealloc(const void *object, size_t new_size, gfp_t flags);
 
 void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags);
-void kasan_slab_free(struct kmem_cache *s, void *object);
+bool kasan_slab_free(struct kmem_cache *s, void *object);
+void kasan_poison_slab_free(struct kmem_cache *s, void *object);
 
 struct kasan_cache {
int alloc_meta_offset;
@@ -76,6 +79,8 @@ static inline void kasan_free_pages(struct page *page, 
unsigned int order) {}
 static inline void kasan_cache_create(struct kmem_cache *cache,
  size_t *size,
  unsigned long *flags) {}
+static inline void kasan_cache_shrink(struct kmem_cache *cache) {}
+static inline void kasan_cache_destroy(struct kmem_cache *cache) {}
 
 static inline void kasan_poi

[Devel] [PATCH rh7 17/39] lib/stackdepot.c: use __GFP_NOWARN for stack allocations

2017-09-14 Thread Andrey Ryabinin
From: "Kirill A. Shutemov" 

This (large, atomic) allocation attempt can fail.  We expect and handle
that, so avoid the scary warning.

Link: http://lkml.kernel.org/r/20160720151905.gb19...@node.shutemov.name
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: David Rientjes 
Cc: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 87cc271d5e4320d705cfdf59f68d4d037b3511b2)
Signed-off-by: Andrey Ryabinin 
---
 lib/stackdepot.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 53ad6c0831ae..60f77f1d470a 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -242,6 +242,7 @@ depot_stack_handle_t depot_save_stack(struct stack_trace 
*trace,
 */
alloc_flags &= ~GFP_ZONEMASK;
alloc_flags &= (GFP_ATOMIC | GFP_KERNEL);
+   alloc_flags |= __GFP_NOWARN;
page = alloc_pages(alloc_flags, STACK_ALLOC_ORDER);
if (page)
prealloc = page_address(page);
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 08/39] mm, kasan: stackdepot implementation. Enable stackdepot for SLAB

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
will allow KASAN store allocation/deallocation stack traces for memory
chunks.  The stack traces are stored in a hash table and referenced by
handles which reside in the kasan_alloc_meta and kasan_free_meta
structures in the allocated memory chunks.

IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.

Right now stackdepot support is only enabled in SLAB allocator.  Once
KASAN features in SLAB are on par with those in SLUB we can switch SLUB
to stackdepot as well, thus removing the dependency on SLUB stack
bookkeeping, which wastes a lot of memory.

This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.

Joonsoo has said that he plans to reuse the stackdepot code for the
mm/page_owner.c debugging facility.

[a...@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
[aryabi...@virtuozzo.com: comment style fixes]
Signed-off-by: Alexander Potapenko 
Signed-off-by: Andrey Ryabinin 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrey Konovalov 
Cc: Dmitry Vyukov 
Cc: Steven Rostedt 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit cd11016e5f5212c13c0cec7384a525edc93b4921)
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/kernel/Makefile   |   1 +
 include/linux/stackdepot.h |  32 +
 lib/Kconfig|   4 +
 lib/Kconfig.kasan  |   1 +
 lib/Makefile   |   4 +
 lib/stackdepot.c   | 284 +
 mm/kasan/kasan.c   |  55 -
 mm/kasan/kasan.h   |  11 +-
 mm/kasan/report.c  |  12 +-
 9 files changed, 392 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/stackdepot.h
 create mode 100644 lib/stackdepot.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 2a23dc9eda7a..a6981d800222 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -25,6 +25,7 @@ OBJECT_FILES_NON_STANDARD_entry_$(BITS).o := y
 KASAN_SANITIZE_head$(BITS).o := n
 KASAN_SANITIZE_dumpstack.o := n
 KASAN_SANITIZE_dumpstack_$(BITS).o := n
+KASAN_SANITIZE_stacktrace.o := n
 
 # If instrumentation of this dir is enabled, boot hangs during first second.
 # Probably could be more selective here, but note that files related to irqs,
diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
new file mode 100644
index ..7978b3e2c1e1
--- /dev/null
+++ b/include/linux/stackdepot.h
@@ -0,0 +1,32 @@
+/*
+ * A generic stack depot implementation
+ *
+ * Author: Alexander Potapenko 
+ * Copyright (C) 2016 Google, Inc.
+ *
+ * Based on code by Dmitry Chernenkov.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _LINUX_STACKDEPOT_H
+#define _LINUX_STACKDEPOT_H
+
+typedef u32 depot_stack_handle_t;
+
+struct stack_trace;
+
+depot_stack_handle_t depot_save_stack(struct stack_trace *trace, gfp_t flags);
+
+void depot_fetch_stack(depot_stack_handle_t handle, struct stack_trace *trace);
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 932be006a0ed..4ac97849f562 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -492,4 +492,8 @@ config ARCH_HAS_MMIO_FLUSH
 config PARMAN
tristate
 
+config STACKDEPOT
+   bool
+   select STACKTRACE
+
 endmenu
diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index 6471d772c243..670504a50612 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -7,6 +7,7 @@ config KASAN
bool "KASan: runtime memory debugger"
depends on SLUB_DEBUG || (SLAB && !DEBUG_SLAB)
select CONSTRUCTORS
+   select STACKDEPOT if SLAB
help
  Enables kernel address sanitizer - runtime memory debugger,
  designed to find out-of-bounds accesses and use-after-free bugs.
diff --git a/lib/Makefile b/lib/Makefile
index c02b909c6239..cfe21bd255b4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -166,7 +166,11 @@ obj-$(CONFIG_SG_POOL) += sg_pool.o
 obj-$(CONFIG_STMP_DEVICE) += stmp_device.o
 obj-$(CONFIG_IRQ_POLL) += irq_poll.o
 
+obj-$(CONFIG_STACKDEPOT) += stackdepot.o
+KASAN_SANITIZE_stackdepot.o := n
+
 libfdt_files = fdt.o fdt_ro.o fdt_wip.o fdt_rw.o fdt_sw.o fdt_strerror.o
+
 $(foreach file, $(libfdt_files), \
$(eval CFLAGS_$(file) = -I$(src)/../scripts/dtc/l

[Devel] [PATCH rh7 25/39] kasan: remove the unnecessary WARN_ONCE from quarantine.c

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

It's quite unlikely that the user will so little memory that the per-CPU
quarantines won't fit into the given fraction of the available memory.
Even in that case he won't be able to do anything with the information
given in the warning.

Link: 
http://lkml.kernel.org/r/1470929182-101413-1-git-send-email-gli...@google.com
Signed-off-by: Alexander Potapenko 
Acked-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Andrey Konovalov 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Kuthonuzo Luruo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit bcbf0d566b6e59a6e873bfe415cc415111a819e2)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/quarantine.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index b6728a33a4ac..baabaad4a4aa 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -217,11 +217,8 @@ void quarantine_reduce(void)
new_quarantine_size = (READ_ONCE(totalram_pages) << PAGE_SHIFT) /
QUARANTINE_FRACTION;
percpu_quarantines = QUARANTINE_PERCPU_SIZE * num_online_cpus();
-   if (WARN_ONCE(new_quarantine_size < percpu_quarantines,
-   "Too little memory, disabling global KASAN quarantine.\n"))
-   new_quarantine_size = 0;
-   else
-   new_quarantine_size -= percpu_quarantines;
+   new_quarantine_size = (new_quarantine_size < percpu_quarantines) ?
+   0 : new_quarantine_size - percpu_quarantines;
WRITE_ONCE(quarantine_size, new_quarantine_size);
 
last = global_quarantine.head;
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 16/39] mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

For KASAN builds:
 - switch SLUB allocator to using stackdepot instead of storing the
   allocation/deallocation stacks in the objects;
 - change the freelist hook so that parts of the freelist can be put
   into the quarantine.

[aryabi...@virtuozzo.com: fixes]
  Link: 
http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabi...@virtuozzo.com
Link: 
http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-gli...@google.com
Signed-off-by: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Christoph Lameter 
Cc: Dmitry Vyukov 
Cc: Steven Rostedt (Red Hat) 
Cc: Joonsoo Kim 
Cc: Kostya Serebryany 
Cc: Andrey Ryabinin 
Cc: Kuthonuzo Luruo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 80a9201a5965f4715d5c09790862e0df84ce0614)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h|  4 +++
 include/linux/slab_def.h |  3 ++-
 include/linux/slub_def.h |  4 +++
 lib/Kconfig.kasan|  4 +--
 mm/kasan/Makefile|  4 +--
 mm/kasan/kasan.c | 63 
 mm/kasan/kasan.h |  3 +--
 mm/kasan/report.c|  8 +++---
 mm/slub.c| 60 +++--
 9 files changed, 96 insertions(+), 57 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 9ab426991c4e..1122a7ff724b 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -66,6 +66,8 @@ struct kasan_cache {
 int kasan_module_alloc(void *addr, size_t size);
 void kasan_free_shadow(const struct vm_struct *vm);
 
+size_t kasan_metadata_size(struct kmem_cache *cache);
+
 #else /* CONFIG_KASAN */
 
 static inline void kasan_unpoison_shadow(const void *address, size_t size) {}
@@ -107,6 +109,8 @@ static inline void kasan_poison_slab_free(struct kmem_cache 
*s, void *object) {}
 static inline int kasan_module_alloc(void *addr, size_t size) { return 0; }
 static inline void kasan_free_shadow(const struct vm_struct *vm) {}
 
+static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; 
}
+
 #endif /* CONFIG_KASAN */
 
 #endif /* LINUX_KASAN_H */
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 13c72b34c6f4..b2e694e3db4d 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -94,7 +94,8 @@ struct kmem_cache {
 };
 
 static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
-   void *x) {
+   void *x)
+{
void *object = x - (x - page->s_mem) % cache->size;
void *last_object = page->s_mem + (cache->num - 1) * cache->size;
 
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 7188ba07139e..919acd6ed29d 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -98,6 +98,10 @@ struct kmem_cache {
 */
int remote_node_defrag_ratio;
 #endif
+#ifdef CONFIG_KASAN
+   struct kasan_cache kasan_info;
+#endif
+
struct kmem_cache_node *node[MAX_NUMNODES];
 };
 
diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index 670504a50612..da48f37ad788 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -5,9 +5,9 @@ if HAVE_ARCH_KASAN
 
 config KASAN
bool "KASan: runtime memory debugger"
-   depends on SLUB_DEBUG || (SLAB && !DEBUG_SLAB)
+   depends on SLUB || (SLAB && !DEBUG_SLAB)
select CONSTRUCTORS
-   select STACKDEPOT if SLAB
+   select STACKDEPOT
help
  Enables kernel address sanitizer - runtime memory debugger,
  designed to find out-of-bounds accesses and use-after-free bugs.
diff --git a/mm/kasan/Makefile b/mm/kasan/Makefile
index 7096981108a6..ac9cc9665e57 100644
--- a/mm/kasan/Makefile
+++ b/mm/kasan/Makefile
@@ -7,6 +7,4 @@ CFLAGS_REMOVE_kasan.o = -pg
 # see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63533
 CFLAGS_kasan.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
 
-obj-y := kasan.o report.o
-obj-$(CONFIG_SLAB) += quarantine.o
-
+obj-y := kasan.o report.o quarantine.o
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 014897fe6f06..8a57f22560a4 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -312,7 +312,6 @@ void kasan_free_pages(struct page *page, unsigned int order)
KASAN_FREE_PAGE);
 }
 
-#ifdef CONFIG_SLAB
 /*
  * Adaptive redzone policy taken from the userspace AddressSanitizer runtime.
  * For larger allocations larger redzones are used.
@@ -334,16 +333,8 @@ void kasan_cache_create(struct kmem_cache *cache, size_t 
*size,
unsigned long *flags)
 {
int redzone_adjust;
-   /* Make sure the adjusted size is still less than
-* KMALLOC_MAX_CACHE_SIZE.
-* TODO: this check is only useful for SLAB, but not SLUB. We'll need
-* to skip it for SLUB when it starts using kasan

[Devel] [PATCH rh7 27/39] kcov: do not instrument lib/stackdepot.c

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

There's no point in collecting coverage from lib/stackdepot.c, as it is
not a function of syscall inputs.  Disabling kcov instrumentation for that
file will reduce the coverage noise level.

Link: 
http://lkml.kernel.org/r/1474640972-104131-1-git-send-email-gli...@google.com
Signed-off-by: Alexander Potapenko 
Acked-by: Dmitry Vyukov 
Cc: Kostya Serebryany 
Cc: Andrey Konovalov 
Cc: syzkaller 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 65deb8af76defeae4b114a75242ed15b0bcba173)
Signed-off-by: Andrey Ryabinin 
---
 lib/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/Makefile b/lib/Makefile
index cfe21bd255b4..9b8233e61ee0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -168,6 +168,7 @@ obj-$(CONFIG_IRQ_POLL) += irq_poll.o
 
 obj-$(CONFIG_STACKDEPOT) += stackdepot.o
 KASAN_SANITIZE_stackdepot.o := n
+KCOV_INSTRUMENT_stackdepot.o := n
 
 libfdt_files = fdt.o fdt_ro.o fdt_wip.o fdt_rw.o fdt_sw.o fdt_strerror.o
 
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 01/39] kasan: show gcc version requirements in Kconfig and Documentation

2017-09-14 Thread Andrey Ryabinin
From: Joe Perches 

The documentation shows a need for gcc > 4.9.2, but it's really >=.  The
Kconfig entries don't show require versions so add them.  Correct a
latter/later typo too.  Also mention that gcc 5 required to catch out of
bounds accesses to global and stack variables.

Signed-off-by: Joe Perches 
Signed-off-by: Andrey Ryabinin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 01e76903f655a4d88c2e09d3182436c65f6e1213)
Signed-off-by: Andrey Ryabinin 
---
 Documentation/kasan.txt | 8 +---
 lib/Kconfig.kasan   | 8 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt
index ee36ef1a64c0..67e62ed6a198 100644
--- a/Documentation/kasan.txt
+++ b/Documentation/kasan.txt
@@ -9,7 +9,9 @@ a fast and comprehensive solution for finding use-after-free 
and out-of-bounds
 bugs.
 
 KASan uses compile-time instrumentation for checking every memory access,
-therefore you will need a certain version of GCC > 4.9.2
+therefore you will need a gcc version of 4.9.2 or later. KASan could detect out
+of bounds accesses to stack or global variables, but only if gcc 5.0 or later 
was
+used to built the kernel.
 
 Currently KASan is supported only for x86_64 architecture and requires that the
 kernel be built with the SLUB allocator.
@@ -23,8 +25,8 @@ To enable KASAN configure kernel with:
 
 and choose between CONFIG_KASAN_OUTLINE and CONFIG_KASAN_INLINE. Outline/inline
 is compiler instrumentation types. The former produces smaller binary the
-latter is 1.1 - 2 times faster. Inline instrumentation requires GCC 5.0 or
-latter.
+latter is 1.1 - 2 times faster. Inline instrumentation requires a gcc version
+of 5.0 or later.
 
 Currently KASAN works only with the SLUB memory allocator.
 For better bug detection and nicer report and enable CONFIG_STACKTRACE.
diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index 4fecaedc80a2..777eda7d1ab4 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -10,8 +10,11 @@ config KASAN
help
  Enables kernel address sanitizer - runtime memory debugger,
  designed to find out-of-bounds accesses and use-after-free bugs.
- This is strictly debugging feature. It consumes about 1/8
- of available memory and brings about ~x3 performance slowdown.
+ This is strictly a debugging feature and it requires a gcc version
+ of 4.9.2 or later. Detection of out of bounds accesses to stack or
+ global variables requires gcc 5.0 or later.
+ This feature consumes about 1/8 of available memory and brings about
+ ~x3 performance slowdown.
  For better error detection enable CONFIG_STACKTRACE,
  and add slub_debug=U to boot cmdline.
 
@@ -40,6 +43,7 @@ config KASAN_INLINE
  memory accesses. This is faster than outline (in some workloads
  it gives about x2 boost over outline instrumentation), but
  make kernel's .text size much bigger.
+ This requires a gcc version of 5.0 or later.
 
 endchoice
 
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 15/39] kasan/quarantine: fix bugs on qlist_move_cache()

2017-09-14 Thread Andrey Ryabinin
From: Joonsoo Kim 

There are two bugs on qlist_move_cache().  One is that qlist's tail
isn't set properly.  curr->next can be NULL since it is singly linked
list and NULL value on tail is invalid if there is one item on qlist.
Another one is that if cache is matched, qlist_put() is called and it
will set curr->next to NULL.  It would cause to stop the loop
prematurely.

These problems come from complicated implementation so I'd like to
re-implement it completely.  Implementation in this patch is really
simple.  Iterate all qlist_nodes and put them to appropriate list.

Unfortunately, I got this bug sometime ago and lose oops message.  But,
the bug looks trivial and no need to attach oops.

Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
Link: 
http://lkml.kernel.org/r/1467766348-22419-1-git-send-email-iamjoonsoo@lge.com
Signed-off-by: Joonsoo Kim 
Reviewed-by: Dmitry Vyukov 
Acked-by: Andrey Ryabinin 
Acked-by: Alexander Potapenko 
Cc: Kuthonuzo Luruo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 0ab686d8c8303069e80300663b3be6201a8697fb)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/quarantine.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index 4973505a9bdd..65793f150d1f 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -238,30 +238,23 @@ static void qlist_move_cache(struct qlist_head *from,
   struct qlist_head *to,
   struct kmem_cache *cache)
 {
-   struct qlist_node *prev = NULL, *curr;
+   struct qlist_node *curr;
 
if (unlikely(qlist_empty(from)))
return;
 
curr = from->head;
+   qlist_init(from);
while (curr) {
-   struct qlist_node *qlink = curr;
-   struct kmem_cache *obj_cache = qlink_to_cache(qlink);
-
-   if (obj_cache == cache) {
-   if (unlikely(from->head == qlink)) {
-   from->head = curr->next;
-   prev = curr;
-   } else
-   prev->next = curr->next;
-   if (unlikely(from->tail == qlink))
-   from->tail = curr->next;
-   from->bytes -= cache->size;
-   qlist_put(to, qlink, cache->size);
-   } else {
-   prev = curr;
-   }
-   curr = curr->next;
+   struct qlist_node *next = curr->next;
+   struct kmem_cache *obj_cache = qlink_to_cache(curr);
+
+   if (obj_cache == cache)
+   qlist_put(to, curr, obj_cache->size);
+   else
+   qlist_put(from, curr, obj_cache->size);
+
+   curr = next;
}
 }
 
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 26/39] mm, mempolicy: task->mempolicy must be NULL before dropping final reference

2017-09-14 Thread Andrey Ryabinin
From: David Rientjes 

KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions.  It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:

BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr 
88010b48102c
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
...
Call Trace:
dump_stack
kasan_object_err
kasan_report_error
__asan_report_load2_noabort
alloc_pages_current <-- use after free
depot_save_stack
save_stack
kasan_slab_free
kmem_cache_free
__mpol_put  <-- free
do_exit

This patch sets current->mempolicy to NULL before dropping the final
reference.

Link: 
http://lkml.kernel.org/r/alpine.deb.2.10.1608301442180.63...@chino.kir.corp.google.com
Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot 
for SLAB")
Signed-off-by: David Rientjes 
Reported-by: Vegard Nossum 
Acked-by: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Dmitry Vyukov 
Cc: [4.6+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit c11600e4fed67ae4cd6a8096936afd445410e8ed)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/mempolicy.h |  4 
 kernel/exit.c |  7 +--
 mm/mempolicy.c| 17 +
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 7f26526c488b..7e47465520f4 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -196,6 +196,7 @@ static inline int vma_migratable(struct vm_area_struct *vma)
 }
 
 extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned 
long);
+extern void mpol_put_task_policy(struct task_struct *);
 
 #else
 
@@ -320,5 +321,8 @@ static inline int mpol_misplaced(struct page *page, struct 
vm_area_struct *vma,
return -1; /* no node preference */
 }
 
+static inline void mpol_put_task_policy(struct task_struct *task)
+{
+}
 #endif /* CONFIG_NUMA */
 #endif
diff --git a/kernel/exit.c b/kernel/exit.c
index 668cacf375d2..32b7ba21d203 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -866,12 +866,7 @@ void do_exit(long code)
ptrace_put_breakpoints(tsk);
 
exit_notify(tsk, group_dead);
-#ifdef CONFIG_NUMA
-   task_lock(tsk);
-   mpol_put(tsk->mempolicy);
-   tsk->mempolicy = NULL;
-   task_unlock(tsk);
-#endif
+   mpol_put_task_policy(tsk);
 #ifdef CONFIG_FUTEX
if (unlikely(current->pi_state_cache))
kfree(current->pi_state_cache);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9b7800695b72..a2e2422f63c7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2377,6 +2377,23 @@ out:
return ret;
 }
 
+/*
+ * Drop the (possibly final) reference to task->mempolicy.  It needs to be
+ * dropped after task->mempolicy is set to NULL so that any allocation done as
+ * part of its kmem_cache_free(), such as by KASAN, doesn't reference a freed
+ * policy.
+ */
+void mpol_put_task_policy(struct task_struct *task)
+{
+   struct mempolicy *pol;
+
+   task_lock(task);
+   pol = task->mempolicy;
+   task->mempolicy = NULL;
+   task_unlock(task);
+   mpol_put(pol);
+}
+
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 30/39] kasan: eliminate long stalls during quarantine reduction

2017-09-14 Thread Andrey Ryabinin
From: Dmitry Vyukov 

Currently we dedicate 1/32 of RAM for quarantine and then reduce it by
1/4 of total quarantine size.  This can be a significant amount of
memory.  For example, with 4GB of RAM total quarantine size is 128MB and
it is reduced by 32MB at a time.  With 128GB of RAM total quarantine
size is 4GB and it is reduced by 1GB.  This leads to several problems:

 - freeing 1GB can take tens of seconds, causes rcu stall warnings and
   just introduces unexpected long delays at random places
 - if kmalloc() is called under a mutex, other threads stall on that
   mutex while a thread reduces quarantine
 - threads wait on quarantine_lock while one thread grabs a large batch
   of objects to evict
 - we walk the uncached list of object to free twice which makes all of
   the above worse
 - when a thread frees objects, they are already not accounted against
   global_quarantine.bytes; as the result we can have quarantine_size
   bytes in quarantine + unbounded amount of memory in large batches in
   threads that are in process of freeing

Reduce size of quarantine in smaller batches to reduce the delays.  The
only reason to reduce it in batches is amortization of overheads, the
new batch size of 1MB should be well enough to amortize spinlock
lock/unlock and few function calls.

Plus organize quarantine as a FIFO array of batches.  This allows to not
walk the list in quarantine_reduce() under quarantine_lock, which in
turn reduces contention and is just faster.

This improves performance of heavy load (syzkaller fuzzing) by ~20% with
4 CPUs and 32GB of RAM.  Also this eliminates frequent (every 5 sec)
drops of CPU consumption from ~400% to ~100% (one thread reduces
quarantine while others are waiting on a mutex).

Some reference numbers:
1. Machine with 4 CPUs and 4GB of memory. Quarantine size 128MB.
   Currently we free 32MB at at time.
   With new code we free 1MB at a time (1024 batches, ~128 are used).
2. Machine with 32 CPUs and 128GB of memory. Quarantine size 4GB.
   Currently we free 1GB at at time.
   With new code we free 8MB at a time (1024 batches, ~512 are used).
3. Machine with 4096 CPUs and 1TB of memory. Quarantine size 32GB.
   Currently we free 8GB at at time.
   With new code we free 4MB at a time (16K batches, ~8K are used).

Link: 
http://lkml.kernel.org/r/1478756952-18695-1-git-send-email-dvyu...@google.com
Signed-off-by: Dmitry Vyukov 
Cc: Eric Dumazet 
Cc: Greg Thelen 
Cc: Alexander Potapenko 
Cc: Andrey Ryabinin 
Cc: Andrey Konovalov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 64abdcb24351a27bed6e2b6a3c27348fe532c73f)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/quarantine.c | 94 ++-
 1 file changed, 48 insertions(+), 46 deletions(-)

diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index baabaad4a4aa..dae929c02bbb 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -86,24 +86,9 @@ static void qlist_move_all(struct qlist_head *from, struct 
qlist_head *to)
qlist_init(from);
 }
 
-static void qlist_move(struct qlist_head *from, struct qlist_node *last,
-   struct qlist_head *to, size_t size)
-{
-   if (unlikely(last == from->tail)) {
-   qlist_move_all(from, to);
-   return;
-   }
-   if (qlist_empty(to))
-   to->head = from->head;
-   else
-   to->tail->next = from->head;
-   to->tail = last;
-   from->head = last->next;
-   last->next = NULL;
-   from->bytes -= size;
-   to->bytes += size;
-}
-
+#define QUARANTINE_PERCPU_SIZE (1 << 20)
+#define QUARANTINE_BATCHES \
+   (1024 > 4 * CONFIG_NR_CPUS ? 1024 : 4 * CONFIG_NR_CPUS)
 
 /*
  * The object quarantine consists of per-cpu queues and a global queue,
@@ -111,11 +96,22 @@ static void qlist_move(struct qlist_head *from, struct 
qlist_node *last,
  */
 static DEFINE_PER_CPU(struct qlist_head, cpu_quarantine);
 
-static struct qlist_head global_quarantine;
+/* Round-robin FIFO array of batches. */
+static struct qlist_head global_quarantine[QUARANTINE_BATCHES];
+static int quarantine_head;
+static int quarantine_tail;
+/* Total size of all objects in global_quarantine across all batches. */
+static unsigned long quarantine_size;
 static DEFINE_SPINLOCK(quarantine_lock);
 
 /* Maximum size of the global queue. */
-static unsigned long quarantine_size;
+static unsigned long quarantine_max_size;
+
+/*
+ * Target size of a batch in global_quarantine.
+ * Usually equal to QUARANTINE_PERCPU_SIZE unless we have too much RAM.
+ */
+static unsigned long quarantine_batch_size;
 
 /*
  * The fraction of physical memory the quarantine is allowed to occupy.
@@ -124,9 +120,6 @@ static unsigned long quarantine_size;
  */
 #define QUARANTINE_FRACTION 32
 
-#define QUARANTINE_LOW_SIZE (READ_ONCE(quarantine_size) * 3 / 4)
-#define QUAR

[Devel] [PATCH rh7 21/39] mm/kasan: get rid of ->state in struct kasan_alloc_meta

2017-09-14 Thread Andrey Ryabinin
The state of object currently tracked in two places - shadow memory, and
the ->state field in struct kasan_alloc_meta.  We can get rid of the
latter.  The will save us a little bit of memory.  Also, this allow us
to move free stack into struct kasan_alloc_meta, without increasing
memory consumption.  So now we should always know when the last time the
object was freed.  This may be useful for long delayed use-after-free
bugs.

As a side effect this fixes following UBSAN warning:
UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
member access within misaligned address 88000d1efebc for type 
'struct qlist_node'
which requires 8 byte alignment

Link: 
http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabi...@virtuozzo.com
Reported-by: kernel test robot 
Signed-off-by: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Dmitry Vyukov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit b3cbd9bf77cd1888114dbee1653e79aa23fd4068)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h |  3 +++
 mm/kasan/kasan.c  | 61 +++
 mm/kasan/kasan.h  | 12 ++
 mm/kasan/quarantine.c |  2 --
 mm/kasan/report.c | 23 +--
 mm/slab.c |  1 +
 mm/slub.c |  2 ++
 7 files changed, 41 insertions(+), 63 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 1122a7ff724b..536a400d1d39 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -46,6 +46,7 @@ void kasan_cache_destroy(struct kmem_cache *cache);
 void kasan_poison_slab(struct page *page);
 void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
 void kasan_poison_object_data(struct kmem_cache *cache, void *object);
+void kasan_init_slab_obj(struct kmem_cache *cache, const void *object);
 
 void kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags);
 void kasan_kfree_large(const void *ptr);
@@ -89,6 +90,8 @@ static inline void kasan_unpoison_object_data(struct 
kmem_cache *cache,
void *object) {}
 static inline void kasan_poison_object_data(struct kmem_cache *cache,
void *object) {}
+static inline void kasan_init_slab_obj(struct kmem_cache *cache,
+   const void *object) {}
 
 static inline void kasan_kmalloc_large(void *ptr, size_t size, gfp_t flags) {}
 static inline void kasan_kfree_large(const void *ptr) {}
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index a8d3e087dad3..7fa1643e83df 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -403,11 +403,6 @@ void kasan_poison_object_data(struct kmem_cache *cache, 
void *object)
kasan_poison_shadow(object,
round_up(cache->object_size, KASAN_SHADOW_SCALE_SIZE),
KASAN_KMALLOC_REDZONE);
-   if (cache->flags & SLAB_KASAN) {
-   struct kasan_alloc_meta *alloc_info =
-   get_alloc_info(cache, object);
-   alloc_info->state = KASAN_STATE_INIT;
-   }
 }
 
 static inline int in_irqentry_text(unsigned long ptr)
@@ -471,6 +466,17 @@ struct kasan_free_meta *get_free_info(struct kmem_cache 
*cache,
return (void *)object + cache->kasan_info.free_meta_offset;
 }
 
+void kasan_init_slab_obj(struct kmem_cache *cache, const void *object)
+{
+   struct kasan_alloc_meta *alloc_info;
+
+   if (!(cache->flags & SLAB_KASAN))
+   return;
+
+   alloc_info = get_alloc_info(cache, object);
+   __memset(alloc_info, 0, sizeof(*alloc_info));
+}
+
 void kasan_slab_alloc(struct kmem_cache *cache, void *object, gfp_t flags)
 {
kasan_kmalloc(cache, object, cache->object_size, flags);
@@ -490,34 +496,27 @@ void kasan_poison_slab_free(struct kmem_cache *cache, 
void *object)
 
 bool kasan_slab_free(struct kmem_cache *cache, void *object)
 {
+   s8 shadow_byte;
+
/* RCU slabs could be legally used after free within the RCU period */
if (unlikely(cache->flags & SLAB_DESTROY_BY_RCU))
return false;
 
-   if (likely(cache->flags & SLAB_KASAN)) {
-   struct kasan_alloc_meta *alloc_info;
-   struct kasan_free_meta *free_info;
+   shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object));
+   if (shadow_byte < 0 || shadow_byte >= KASAN_SHADOW_SCALE_SIZE) {
+   pr_err("Double free");
+   dump_stack();
+   return true;
+   }
 
-   alloc_info = get_alloc_info(cache, object);
-   free_info = get_free_info(cache, object);
+   kasan_poison_slab_free(cache, object);
 
-   switch (alloc_info->state) {
-   case KASAN_STATE_A

[Devel] [PATCH rh7 22/39] kasan: improve double-free reports

2017-09-14 Thread Andrey Ryabinin
Currently we just dump stack in case of double free bug.
Let's dump all info about the object that we have.

[aryabi...@virtuozzo.com: change double free message per Alexander]
  Link: 
http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabi...@virtuozzo.com
Link: 
http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabi...@virtuozzo.com
Signed-off-by: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Dmitry Vyukov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 7e088978933ee186533355ae03a9dc1de99cf6c7)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/kasan.c  |  3 +--
 mm/kasan/kasan.h  |  2 ++
 mm/kasan/report.c | 51 ++-
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 7fa1643e83df..8f350a2edcb6 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -504,8 +504,7 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object)
 
shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object));
if (shadow_byte < 0 || shadow_byte >= KASAN_SHADOW_SCALE_SIZE) {
-   pr_err("Double free");
-   dump_stack();
+   kasan_report_double_free(cache, object, shadow_byte);
return true;
}
 
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index e4c0e91524b1..ddce58734098 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -100,6 +100,8 @@ static inline bool kasan_enabled(void)
 
 void kasan_report(unsigned long addr, size_t size,
bool is_write, unsigned long ip);
+void kasan_report_double_free(struct kmem_cache *cache, void *object,
+   s8 shadow);
 
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB)
 void quarantine_put(struct kasan_free_meta *info, struct kmem_cache *cache);
diff --git a/mm/kasan/report.c b/mm/kasan/report.c
index 94bb359fd0f3..cbd7f6e50cc1 100644
--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -98,6 +98,26 @@ static inline bool init_task_stack_addr(const void *addr)
sizeof(init_thread_union.stack));
 }
 
+static DEFINE_SPINLOCK(report_lock);
+
+static void kasan_start_report(unsigned long *flags)
+{
+   /*
+* Make sure we don't end up in loop.
+*/
+   kasan_disable_current();
+   spin_lock_irqsave(&report_lock, *flags);
+   
pr_err("==\n");
+}
+
+static void kasan_end_report(unsigned long *flags)
+{
+   
pr_err("==\n");
+   add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+   spin_unlock_irqrestore(&report_lock, *flags);
+   kasan_enable_current();
+}
+
 static void print_track(struct kasan_track *track)
 {
pr_err("PID = %u\n", track->pid);
@@ -111,8 +131,7 @@ static void print_track(struct kasan_track *track)
}
 }
 
-static void kasan_object_err(struct kmem_cache *cache, struct page *page,
-   void *object, char *unused_reason)
+static void kasan_object_err(struct kmem_cache *cache, void *object)
 {
struct kasan_alloc_meta *alloc_info = get_alloc_info(cache, object);
 
@@ -129,6 +148,18 @@ static void kasan_object_err(struct kmem_cache *cache, 
struct page *page,
print_track(&alloc_info->free_track);
 }
 
+void kasan_report_double_free(struct kmem_cache *cache, void *object,
+   s8 shadow)
+{
+   unsigned long flags;
+
+   kasan_start_report(&flags);
+   pr_err("BUG: Double free or freeing an invalid pointer\n");
+   pr_err("Unexpected shadow byte: 0x%hhX\n", shadow);
+   kasan_object_err(cache, object);
+   kasan_end_report(&flags);
+}
+
 static void print_address_description(struct kasan_access_info *info)
 {
const void *addr = info->access_addr;
@@ -142,8 +173,7 @@ static void print_address_description(struct 
kasan_access_info *info)
struct kmem_cache *cache = page->slab_cache;
object = nearest_obj(cache, page,
(void *)info->access_addr);
-   kasan_object_err(cache, page, object,
-   "kasan: bad access detected");
+   kasan_object_err(cache, object);
return;
}
dump_page(page, "kasan: bad access detected");
@@ -204,16 +234,13 @@ static void print_shadow_for_address(const void *addr)
}
 }
 
-static DEFINE_SPINLOCK(report_lock);
-
 static void kasan_report_error(struct kasan_access_info *info)
 {
unsigned long flags;
const char *bug_type;
 
-   spin_lock_irqsave(&report_lock, flags);
-

[Devel] [PATCH rh7 07/39] arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.

Move the definition of __irq_entry to  so that the
users don't need to pull in .  Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.

Signed-off-by: Alexander Potapenko 
Acked-by: Steven Rostedt 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrey Konovalov 
Cc: Dmitry Vyukov 
Cc: Andrey Ryabinin 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit be7635e7287e0e8013af3c89a6354a9e0182594c)
Signed-off-by: Andrey Ryabinin 
---
 arch/arm/include/asm/exception.h |  2 +-
 arch/arm/kernel/vmlinux.lds.S|  1 +
 arch/arm64/kernel/vmlinux.lds.S  |  1 +
 arch/blackfin/kernel/vmlinux.lds.S   |  1 +
 arch/c6x/kernel/vmlinux.lds.S|  1 +
 arch/metag/kernel/vmlinux.lds.S  |  1 +
 arch/microblaze/kernel/vmlinux.lds.S |  1 +
 arch/mips/kernel/vmlinux.lds.S   |  1 +
 arch/openrisc/kernel/vmlinux.lds.S   |  1 +
 arch/parisc/kernel/vmlinux.lds.S |  1 +
 arch/powerpc/kernel/vmlinux.lds.S|  1 +
 arch/s390/kernel/vmlinux.lds.S   |  1 +
 arch/sh/kernel/vmlinux.lds.S |  1 +
 arch/sparc/kernel/vmlinux.lds.S  |  1 +
 arch/tile/kernel/vmlinux.lds.S   |  1 +
 arch/x86/kernel/vmlinux.lds.S|  1 +
 include/asm-generic/vmlinux.lds.h| 12 +++-
 include/linux/ftrace.h   | 11 ---
 include/linux/interrupt.h| 20 
 kernel/softirq.c |  2 +-
 kernel/trace/trace_functions_graph.c |  1 +
 21 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/arch/arm/include/asm/exception.h b/arch/arm/include/asm/exception.h
index 5abaf5bbd985..bf1991263d2d 100644
--- a/arch/arm/include/asm/exception.h
+++ b/arch/arm/include/asm/exception.h
@@ -7,7 +7,7 @@
 #ifndef __ASM_ARM_EXCEPTION_H
 #define __ASM_ARM_EXCEPTION_H
 
-#include 
+#include 
 
 #define __exception__attribute__((section(".exception.text")))
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index 33f2ea32f5a0..b3428ce67bd0 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -100,6 +100,7 @@ SECTIONS
*(.exception.text)
__exception_text_end = .;
IRQENTRY_TEXT
+   SOFTIRQENTRY_TEXT
TEXT_TEXT
SCHED_TEXT
LOCK_TEXT
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 3fae2be8b016..96b19d8d264d 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -46,6 +46,7 @@ SECTIONS
*(.exception.text)
__exception_text_end = .;
IRQENTRY_TEXT
+   SOFTIRQENTRY_TEXT
TEXT_TEXT
SCHED_TEXT
LOCK_TEXT
diff --git a/arch/blackfin/kernel/vmlinux.lds.S 
b/arch/blackfin/kernel/vmlinux.lds.S
index ba35864b2b74..f7f4c3ae3f3e 100644
--- a/arch/blackfin/kernel/vmlinux.lds.S
+++ b/arch/blackfin/kernel/vmlinux.lds.S
@@ -35,6 +35,7 @@ SECTIONS
 #endif
LOCK_TEXT
IRQENTRY_TEXT
+   SOFTIRQENTRY_TEXT
KPROBES_TEXT
 #ifdef CONFIG_ROMKERNEL
__sinittext = .;
diff --git a/arch/c6x/kernel/vmlinux.lds.S b/arch/c6x/kernel/vmlinux.lds.S
index 1d81c4c129ec..5a05a725331f 100644
--- a/arch/c6x/kernel/vmlinux.lds.S
+++ b/arch/c6x/kernel/vmlinux.lds.S
@@ -78,6 +78,7 @@ SECTIONS
SCHED_TEXT
LOCK_TEXT
IRQENTRY_TEXT
+   SOFTIRQENTRY_TEXT
KPROBES_TEXT
*(.fixup)
*(.gnu.warning)
diff --git a/arch/metag/kernel/vmlinux.lds.S b/arch/metag/kernel/vmlinux.lds.S
index e12055e88bfe..150ace92c7ad 100644
--- a/arch/metag/kernel/vmlinux.lds.S
+++ b/arch/metag/kernel/vmlinux.lds.S
@@ -24,6 +24,7 @@ SECTIONS
LOCK_TEXT
KPROBES_TEXT
IRQENTRY_TEXT
+   SOFTIRQENTRY_TEXT
*(.text.*)
*(.gnu.warning)
}
diff --git a/arch/microblaze/kernel/vmlinux.lds.S 
b/arch/microblaze/kernel/vmlinux.lds.S
index 936d01a689d7..f8ee75888d9c 100644
--- a/arch/microblaze/kernel/vmlinux.lds.S
+++ b/arch/microblaze/kernel/vmlinux.lds.S
@@ -36,6 +36,7 @@ SECTIONS {
LOCK_TEXT
KPROBES_TEXT
IRQENTRY_TEXT
+   SOFTIRQENTRY_TEXT
. = ALIGN (4) ;
_etext = . ;
}
di

[Devel] [PATCH rh7 23/39] kasan: avoid overflowing quarantine size on low memory systems

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

If the total amount of memory assigned to quarantine is less than the
amount of memory assigned to per-cpu quarantines, |new_quarantine_size|
may overflow.  Instead, set it to zero.

[a...@linux-foundation.org: cleanup: use WARN_ONCE return value]
Link: 
http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-gli...@google.com
Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
Signed-off-by: Alexander Potapenko 
Reported-by: Dmitry Vyukov 
Cc: Andrey Ryabinin 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit c3cee372282cb6bcdf19ac1457581d5dd5ecb554)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/quarantine.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index 7fd121d13b88..b6728a33a4ac 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -198,7 +198,7 @@ void quarantine_put(struct kasan_free_meta *info, struct 
kmem_cache *cache)
 
 void quarantine_reduce(void)
 {
-   size_t new_quarantine_size;
+   size_t new_quarantine_size, percpu_quarantines;
unsigned long flags;
struct qlist_head to_free = QLIST_INIT;
size_t size_to_free = 0;
@@ -216,7 +216,12 @@ void quarantine_reduce(void)
 */
new_quarantine_size = (READ_ONCE(totalram_pages) << PAGE_SHIFT) /
QUARANTINE_FRACTION;
-   new_quarantine_size -= QUARANTINE_PERCPU_SIZE * num_online_cpus();
+   percpu_quarantines = QUARANTINE_PERCPU_SIZE * num_online_cpus();
+   if (WARN_ONCE(new_quarantine_size < percpu_quarantines,
+   "Too little memory, disabling global KASAN quarantine.\n"))
+   new_quarantine_size = 0;
+   else
+   new_quarantine_size -= percpu_quarantines;
WRITE_ONCE(quarantine_size, new_quarantine_size);
 
last = global_quarantine.head;
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 18/39] mm/kasan: fix corruptions and false positive reports

2017-09-14 Thread Andrey Ryabinin
Once an object is put into quarantine, we no longer own it, i.e.  object
could leave the quarantine and be reallocated.  So having set_track()
call after the quarantine_put() may corrupt slab objects.

 BUG kmalloc-4096 (Not tainted): Poison overwritten
 -
 Disabling lock debugging due to kernel taint
 INFO: 0x8804540de850-0x8804540de857. First byte 0xb5 instead of 0x6b
...
 INFO: Freed in qlist_free_all+0x42/0x100 age=75 cpu=3 pid=24492
  __slab_free+0x1d6/0x2e0
  ___cache_free+0xb6/0xd0
  qlist_free_all+0x83/0x100
  quarantine_reduce+0x177/0x1b0
  kasan_kmalloc+0xf3/0x100
  kasan_slab_alloc+0x12/0x20
  kmem_cache_alloc+0x109/0x3e0
  mmap_region+0x53e/0xe40
  do_mmap+0x70f/0xa50
  vm_mmap_pgoff+0x147/0x1b0
  SyS_mmap_pgoff+0x2c7/0x5b0
  SyS_mmap+0x1b/0x30
  do_syscall_64+0x1a0/0x4e0
  return_from_SYSCALL_64+0x0/0x7a
 INFO: Slab 0xea0011503600 objects=7 used=7 fp=0x  (null) 
flags=0x80004080
 INFO: Object 0x8804540de848 @offset=26696 fp=0x8804540dc588
 Redzone 8804540de840: bb bb bb bb bb bb bb bb  

 Object 8804540de848: 6b 6b 6b 6b 6b 6b 6b 6b b5 52 00 00 f2 01 60 cc  
.R`.

Similarly, poisoning after the quarantine_put() leads to false positive
use-after-free reports:

 BUG: KASAN: use-after-free in anon_vma_interval_tree_insert+0x304/0x430 at 
addr 880405c540a0
 Read of size 8 by task trinity-c0/3036
 CPU: 0 PID: 3036 Comm: trinity-c0 Not tainted 4.7.0-think+ #9
 Call Trace:
   dump_stack+0x68/0x96
   kasan_report_error+0x222/0x600
   __asan_report_load8_noabort+0x61/0x70
   anon_vma_interval_tree_insert+0x304/0x430
   anon_vma_chain_link+0x91/0xd0
   anon_vma_clone+0x136/0x3f0
   anon_vma_fork+0x81/0x4c0
   copy_process.part.47+0x2c43/0x5b20
   _do_fork+0x16d/0xbd0
   SyS_clone+0x19/0x20
   do_syscall_64+0x1a0/0x4e0
   entry_SYSCALL64_slow_path+0x25/0x25

Fix this by putting an object in the quarantine after all other
operations.

Fixes: 80a9201a5965 ("mm, kasan: switch SLUB to stackdepot, enable memory 
quarantine for SLUB")
Link: 
http://lkml.kernel.org/r/1470062715-14077-1-git-send-email-aryabi...@virtuozzo.com
Signed-off-by: Andrey Ryabinin 
Reported-by: Dave Jones 
Reported-by: Vegard Nossum 
Reported-by: Sasha Levin 
Acked-by: Alexander Potapenko 
Cc: Dmitry Vyukov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 4a3d308d6674fabf213bce9c1a661ef43a85e515)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/kasan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 8a57f22560a4..d7c814309c3e 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -504,9 +504,9 @@ bool kasan_slab_free(struct kmem_cache *cache, void *object)
switch (alloc_info->state) {
case KASAN_STATE_ALLOC:
alloc_info->state = KASAN_STATE_QUARANTINE;
-   quarantine_put(free_info, cache);
set_track(&free_info->track, GFP_NOWAIT);
kasan_poison_slab_free(cache, object);
+   quarantine_put(free_info, cache);
return true;
case KASAN_STATE_QUARANTINE:
case KASAN_STATE_FREE:
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 20/39] mm/kasan: get rid of ->alloc_size in struct kasan_alloc_meta

2017-09-14 Thread Andrey Ryabinin
Size of slab object already stored in cache->object_size.

Note, that kmalloc() internally rounds up size of allocation, so
object_size may be not equal to alloc_size, but, usually we don't need
to know the exact size of allocated object.  In case if we need that
information, we still can figure it out from the report.  The dump of
shadow memory allows to identify the end of allocated memory, and
thereby the exact allocation size.

Link: 
http://lkml.kernel.org/r/1470062715-14077-4-git-send-email-aryabi...@virtuozzo.com
Signed-off-by: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Dmitry Vyukov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 47b5c2a0f021e90a79845d1a1353780e5edd0bce)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/kasan.c  | 1 -
 mm/kasan/kasan.h  | 4 +---
 mm/kasan/report.c | 8 +++-
 3 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index d7c814309c3e..a8d3e087dad3 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -545,7 +545,6 @@ void kasan_kmalloc(struct kmem_cache *cache, const void 
*object, size_t size,
get_alloc_info(cache, object);
 
alloc_info->state = KASAN_STATE_ALLOC;
-   alloc_info->alloc_size = size;
set_track(&alloc_info->track, flags);
}
 }
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index 1143e64b6a34..1175fa05f8a6 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -76,9 +76,7 @@ struct kasan_track {
 
 struct kasan_alloc_meta {
struct kasan_track track;
-   u32 state : 2;  /* enum kasan_state */
-   u32 alloc_size : 30;
-   u32 reserved;
+   u32 state;
 };
 
 struct qlist_node {
diff --git a/mm/kasan/report.c b/mm/kasan/report.c
index ef85919f4326..45f17623677e 100644
--- a/mm/kasan/report.c
+++ b/mm/kasan/report.c
@@ -118,7 +118,9 @@ static void kasan_object_err(struct kmem_cache *cache, 
struct page *page,
struct kasan_free_meta *free_info;
 
dump_stack();
-   pr_err("Object at %p, in cache %s\n", object, cache->name);
+   pr_err("Object at %p, in cache %s size: %d\n", object, cache->name,
+   cache->object_size);
+
if (!(cache->flags & SLAB_KASAN))
return;
switch (alloc_info->state) {
@@ -126,15 +128,11 @@ static void kasan_object_err(struct kmem_cache *cache, 
struct page *page,
pr_err("Object not allocated yet\n");
break;
case KASAN_STATE_ALLOC:
-   pr_err("Object allocated with size %u bytes.\n",
-  alloc_info->alloc_size);
pr_err("Allocation:\n");
print_track(&alloc_info->track);
break;
case KASAN_STATE_FREE:
case KASAN_STATE_QUARANTINE:
-   pr_err("Object freed, allocated with size %u bytes\n",
-  alloc_info->alloc_size);
free_info = get_free_info(cache, object);
pr_err("Allocation:\n");
print_track(&alloc_info->track);
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 06/39] mm, kasan: add GFP flags to KASAN API

2017-09-14 Thread Andrey Ryabinin
From: Alexander Potapenko 

Add GFP flags to KASAN hooks for future patches to use.

This patch is based on the "mm: kasan: unified support for SLUB and SLAB
allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Alexander Potapenko 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrey Konovalov 
Cc: Dmitry Vyukov 
Cc: Andrey Ryabinin 
Cc: Steven Rostedt 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 505f5dcb1c419e55a9621a01f83eb5745d8d7398)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h | 19 +++
 include/linux/slab.h  |  4 ++--
 mm/kasan/kasan.c  | 15 ---
 mm/mempool.c  | 16 
 mm/slab.c | 15 ---
 mm/slab_common.c  |  4 ++--
 mm/slub.c | 17 +
 7 files changed, 48 insertions(+), 42 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index f55c31becdb6..ab45598049da 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -45,13 +45,14 @@ void kasan_poison_slab(struct page *page);
 void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
 void kasan_poison_object_data(struct kmem_cache *cache, void *object);
 
-void kasan_kmalloc_large(const void *ptr, size_t size);
+void kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags);
 void kasan_kfree_large(const void *ptr);
 void kasan_kfree(void *ptr);
-void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size);
-void kasan_krealloc(const void *object, size_t new_size);
+void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size,
+ gfp_t flags);
+void kasan_krealloc(const void *object, size_t new_size, gfp_t flags);
 
-void kasan_slab_alloc(struct kmem_cache *s, void *object);
+void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags);
 void kasan_slab_free(struct kmem_cache *s, void *object);
 
 struct kasan_cache {
@@ -82,14 +83,16 @@ static inline void kasan_unpoison_object_data(struct 
kmem_cache *cache,
 static inline void kasan_poison_object_data(struct kmem_cache *cache,
void *object) {}
 
-static inline void kasan_kmalloc_large(void *ptr, size_t size) {}
+static inline void kasan_kmalloc_large(void *ptr, size_t size, gfp_t flags) {}
 static inline void kasan_kfree_large(const void *ptr) {}
 static inline void kasan_kfree(void *ptr) {}
 static inline void kasan_kmalloc(struct kmem_cache *s, const void *object,
-   size_t size) {}
-static inline void kasan_krealloc(const void *object, size_t new_size) {}
+   size_t size, gfp_t flags) {}
+static inline void kasan_krealloc(const void *object, size_t new_size,
+gfp_t flags) {}
 
-static inline void kasan_slab_alloc(struct kmem_cache *s, void *object) {}
+static inline void kasan_slab_alloc(struct kmem_cache *s, void *object,
+  gfp_t flags) {}
 static inline void kasan_slab_free(struct kmem_cache *s, void *object) {}
 
 static inline int kasan_module_alloc(void *addr, size_t size) { return 0; }
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 7dc1b73cdcec..d4946a66d15b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -322,7 +322,7 @@ static __always_inline void *kmem_cache_alloc_trace(struct 
kmem_cache *s,
 {
void *ret = kmem_cache_alloc(s, flags);
 
-   kasan_kmalloc(s, ret, size);
+   kasan_kmalloc(s, ret, size, flags);
return ret;
 }
 
@@ -333,7 +333,7 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
 {
void *ret = kmem_cache_alloc_node(s, gfpflags, node);
 
-   kasan_kmalloc(s, ret, size);
+   kasan_kmalloc(s, ret, size, gfpflags);
return ret;
 }
 #endif /* CONFIG_TRACING */
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 2e1a640f8772..03a856d1af12 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -411,9 +411,9 @@ struct kasan_free_meta *get_free_info(struct kmem_cache 
*cache,
 }
 #endif
 
-void kasan_slab_alloc(struct kmem_cache *cache, void *object)
+void kasan_slab_alloc(struct kmem_cache *cache, void *object, gfp_t flags)
 {
-   kasan_kmalloc(cache, object, cache->object_size);
+   kasan_kmalloc(cache, object, cache->object_size, flags);
 }
 
 void kasan_slab_free(struct kmem_cache *cache, void *object)
@@ -439,7 +439,8 @@ void kasan_slab_free(struct kmem_cache *cache, void *object)
kasan_poison_shadow(object, rounded_up_size, KASAN_KMALLOC_FREE);
 }
 
-void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size)
+void kasan_kmalloc(struct kmem_cache *cache, const void *object, size_t size,
+  gfp_t flags)
 {
unsigned long redzone_start;
unsigned 

[Devel] [PATCH rh7 14/39] kasan: add newline to messages

2017-09-14 Thread Andrey Ryabinin
From: Dmitry Vyukov 

Currently GPF messages with KASAN look as follows:

  kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral 
protection fault:  [#1] SMP DEBUG_PAGEALLOC KASAN

Add newlines.

Link: 
http://lkml.kernel.org/r/1467294357-98002-1-git-send-email-dvyu...@google.com
Signed-off-by: Dmitry Vyukov 
Acked-by: Andrey Ryabinin 
Cc: Alexander Potapenko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 2ba78056acfe8d63a29565f91dae4678ed6b81ca)
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/mm/kasan_init_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index f9fb08ed645a..dbe2a7156d94 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -168,8 +168,8 @@ static int kasan_die_handler(struct notifier_block *self,
 void *data)
 {
if (val == DIE_GPF) {
-   pr_emerg("CONFIG_KASAN_INLINE enabled");
-   pr_emerg("GPF could be caused by NULL-ptr deref or user memory 
access");
+   pr_emerg("CONFIG_KASAN_INLINE enabled\n");
+   pr_emerg("GPF could be caused by NULL-ptr deref or user memory 
access\n");
}
return NOTIFY_OK;
 }
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 02/39] Documentation: kasan: fix a typo

2017-09-14 Thread Andrey Ryabinin
From: Wang Long 

Fix a couple of typos in the kasan document.

Signed-off-by: Wang Long 
Signed-off-by: Jonathan Corbet 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit f66fa08bf9e59b1231aba9e3c2ec28dcf08f0389)
Signed-off-by: Andrey Ryabinin 
---
 Documentation/kasan.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt
index 67e62ed6a198..82ed25f9d23c 100644
--- a/Documentation/kasan.txt
+++ b/Documentation/kasan.txt
@@ -149,7 +149,7 @@ AddressSanitizer dedicates 1/8 of kernel memory to its 
shadow memory
 (e.g. 16TB to cover 128TB on x86_64) and uses direct mapping with a scale and
 offset to translate a memory address to its corresponding shadow address.
 
-Here is the function witch translate an address to its corresponding shadow
+Here is the function which translates an address to its corresponding shadow
 address:
 
 static inline void *kasan_mem_to_shadow(const void *addr)
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 28/39] lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB

2017-09-14 Thread Andrey Ryabinin
From: Dmitry Vyukov 

KASAN uses stackdepot to memorize stacks for all kmalloc/kfree calls.
Current stackdepot capacity is 16MB (1024 top level entries x 4 pages on
second level).  Size of each stack is (num_frames + 3) * sizeof(long).
Which gives us ~84K stacks.  This capacity was chosen empirically and it
is enough to run kernel normally.

However, when lots of configs are enabled and a fuzzer tries to maximize
code coverage, it easily hits the limit within tens of minutes.  I've
tested for long a time with number of top level entries bumped 4x
(4096).  And I think I've seen overflow only once.  But I don't have all
configs enabled and code coverage has not reached maximum yet.  So bump
it 8x to 8192.

Since we have two-level table, memory cost of this is very moderate --
currently the top-level table is 8KB, with this patch it is 64KB, which
is negligible under KASAN.

Here is some approx math.

128MB allows us to memorize ~670K stacks (assuming stack is ~200b).
I've grepped kernel for kmalloc|kfree|kmem_cache_alloc|kmem_cache_free|
kzalloc|kstrdup|kstrndup|kmemdup and it gives ~60K matches.  Most of
alloc/free call sites are reachable with only one stack.  But some
utility functions can have large fanout.  Assuming average fanout is 5x,
total number of alloc/free stacks is ~300K.

Link: 
http://lkml.kernel.org/r/1476458416-122131-1-git-send-email-dvyu...@google.com
Signed-off-by: Dmitry Vyukov 
Cc: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Joonsoo Kim 
Cc: Baozeng Ding 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 02754e0a484a50a92d44c38879f2cb2792ebc572)
Signed-off-by: Andrey Ryabinin 
---
 lib/stackdepot.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 60f77f1d470a..4d830e299989 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -50,7 +50,7 @@
STACK_ALLOC_ALIGN)
 #define STACK_ALLOC_INDEX_BITS (DEPOT_STACK_BITS - \
STACK_ALLOC_NULL_PROTECTION_BITS - STACK_ALLOC_OFFSET_BITS)
-#define STACK_ALLOC_SLABS_CAP 1024
+#define STACK_ALLOC_SLABS_CAP 8192
 #define STACK_ALLOC_MAX_SLABS \
(((1LL << (STACK_ALLOC_INDEX_BITS)) < STACK_ALLOC_SLABS_CAP) ? \
 (1LL << (STACK_ALLOC_INDEX_BITS)) : STACK_ALLOC_SLABS_CAP)
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 39/39] module: Fix load_module() error path

2017-09-14 Thread Andrey Ryabinin
From: Peter Zijlstra 

The load_module() error path frees a module but forgot to take it out
of the mod_tree, leaving a dangling entry in the tree, causing havoc.

Cc: Mathieu Desnoyers 
Reported-by: Arthur Marsh 
Tested-by: Arthur Marsh 
Fixes: 93c2e105f6bc ("module: Optimize __module_address() using a latched 
RB-tree")
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 758556bdc1c8a8dffea0ea9f9df891878cc2468c)
Signed-off-by: Andrey Ryabinin 
---
 kernel/module.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/module.c b/kernel/module.c
index 952a9582f840..a5ee99f0f7a0 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3643,6 +3643,7 @@ static int load_module(struct load_info *info, const char 
__user *uargs,
mutex_lock(&module_mutex);
/* Unlink carefully: kallsyms could be walking list. */
list_del_rcu(&mod->list);
+   mod_tree_remove(mod);
wake_up_all(&module_wq);
mutex_unlock(&module_mutex);
  free_module:
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 33/39] rbtree: Make lockless searches non-fatal

2017-09-14 Thread Andrey Ryabinin
From: Peter Zijlstra 

Change the insert and erase code such that lockless searches are
non-fatal.

In and of itself an rbtree cannot be correctly searched while
in-modification, we can however provide weaker guarantees that will
allow the rbtree to be used in conjunction with other techniques, such
as latches; see 9b0fd802e8c0 ("seqcount: Add raw_write_seqcount_latch()").

For this to work we need the following guarantees from the rbtree
code:

 1) a lockless reader must not see partial stores, this would allow it
to observe nodes that are invalid memory.

 2) there must not be (temporary) loops in the tree structure in the
modifier's program order, this would cause a lookup which
interrupts the modifier to get stuck indefinitely.

For 1) we must use WRITE_ONCE() for all updates to the tree structure;
in particular this patch only does rb_{left,right} as those are the
only element required for simple searches.

It generates slightly worse code, probably because volatile. But in
pointer chasing heavy code a few instructions more should not matter.

For 2) I have carefully audited the code and drawn every intermediate
link state and not found a loop.

Cc: Mathieu Desnoyers 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Andrea Arcangeli 
Cc: David Woodhouse 
Cc: Rik van Riel 
Reviewed-by: Michel Lespinasse 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit d72da4a4d973d8a0a0d3c97e7cdebf287fbe3a99)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/rbtree.h   | 16 +++--
 include/linux/rbtree_augmented.h | 21 +++
 lib/rbtree.c | 76 
 3 files changed, 81 insertions(+), 32 deletions(-)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 57e75ae9910f..829c5a8b41c0 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -31,6 +31,7 @@
 
 #include 
 #include 
+#include 
 
 struct rb_node {
unsigned long  __rb_parent_color;
@@ -73,11 +74,11 @@ extern struct rb_node *rb_first_postorder(const struct 
rb_root *);
 extern struct rb_node *rb_next_postorder(const struct rb_node *);
 
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
-extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
+extern void rb_replace_node(struct rb_node *victim, struct rb_node *new,
struct rb_root *root);
 
-static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
-   struct rb_node ** rb_link)
+static inline void rb_link_node(struct rb_node *node, struct rb_node *parent,
+   struct rb_node **rb_link)
 {
node->__rb_parent_color = (unsigned long)parent;
node->rb_left = node->rb_right = NULL;
@@ -85,6 +86,15 @@ static inline void rb_link_node(struct rb_node * node, 
struct rb_node * parent,
*rb_link = node;
 }
 
+static inline void rb_link_node_rcu(struct rb_node *node, struct rb_node 
*parent,
+   struct rb_node **rb_link)
+{
+   node->__rb_parent_color = (unsigned long)parent;
+   node->rb_left = node->rb_right = NULL;
+
+   rcu_assign_pointer(*rb_link, node);
+}
+
 #define rb_entry_safe(ptr, type, member) \
({ typeof(ptr) ptr = (ptr); \
   ptr ? rb_entry(ptr, type, member) : NULL; \
diff --git a/include/linux/rbtree_augmented.h b/include/linux/rbtree_augmented.h
index fea49b5da12a..1690f2612449 100644
--- a/include/linux/rbtree_augmented.h
+++ b/include/linux/rbtree_augmented.h
@@ -113,11 +113,11 @@ __rb_change_child(struct rb_node *old, struct rb_node 
*new,
 {
if (parent) {
if (parent->rb_left == old)
-   parent->rb_left = new;
+   WRITE_ONCE(parent->rb_left, new);
else
-   parent->rb_right = new;
+   WRITE_ONCE(parent->rb_right, new);
} else
-   root->rb_node = new;
+   WRITE_ONCE(root->rb_node, new);
 }
 
 extern void __rb_erase_color(struct rb_node *parent, struct rb_root *root,
@@ -127,7 +127,8 @@ static __always_inline struct rb_node *
 __rb_erase_augmented(struct rb_node *node, struct rb_root *root,
 const struct rb_augment_callbacks *augment)
 {
-   struct rb_node *child = node->rb_right, *tmp = node->rb_left;
+   struct rb_node *child = node->rb_right;
+   struct rb_node *tmp = node->rb_left;
struct rb_node *parent, *rebalance;
unsigned long pc;
 
@@ -157,6 +158,7 @@ __rb_erase_augmented(struct rb_node *node, struct rb_root 
*root,
tmp = parent;
} else {
struct rb_node *successor = child, *child2;
+
tmp = child->rb_left;
if (!tm

[Devel] [PATCH rh7 37/39] rbtree: Implement generic latch_tree

2017-09-14 Thread Andrey Ryabinin
From: Peter Zijlstra 

Implement a latched RB-tree in order to get unconditional RCU/lockless
lookups.

Cc: Oleg Nesterov 
Cc: Michel Lespinasse 
Cc: Andrea Arcangeli 
Cc: David Woodhouse 
Cc: Rik van Riel 
Cc: Mathieu Desnoyers 
Cc: "Paul E. McKenney" 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit ade3f510f93a5613b672febe88eff8ea7f1c63b7)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/rbtree_latch.h | 212 +++
 1 file changed, 212 insertions(+)
 create mode 100644 include/linux/rbtree_latch.h

diff --git a/include/linux/rbtree_latch.h b/include/linux/rbtree_latch.h
new file mode 100644
index ..4f3432c61d12
--- /dev/null
+++ b/include/linux/rbtree_latch.h
@@ -0,0 +1,212 @@
+/*
+ * Latched RB-trees
+ *
+ * Copyright (C) 2015 Intel Corp., Peter Zijlstra 
+ *
+ * Since RB-trees have non-atomic modifications they're not immediately suited
+ * for RCU/lockless queries. Even though we made RB-tree lookups non-fatal for
+ * lockless lookups; we cannot guarantee they return a correct result.
+ *
+ * The simplest solution is a seqlock + RB-tree, this will allow lockless
+ * lookups; but has the constraint (inherent to the seqlock) that read sides
+ * cannot nest in write sides.
+ *
+ * If we need to allow unconditional lookups (say as required for NMI context
+ * usage) we need a more complex setup; this data structure provides this by
+ * employing the latch technique -- see @raw_write_seqcount_latch -- to
+ * implement a latched RB-tree which does allow for unconditional lookups by
+ * virtue of always having (at least) one stable copy of the tree.
+ *
+ * However, while we have the guarantee that there is at all times one stable
+ * copy, this does not guarantee an iteration will not observe modifications.
+ * What might have been a stable copy at the start of the iteration, need not
+ * remain so for the duration of the iteration.
+ *
+ * Therefore, this does require a lockless RB-tree iteration to be non-fatal;
+ * see the comment in lib/rbtree.c. Note however that we only require the first
+ * condition -- not seeing partial stores -- because the latch thing isolates
+ * us from loops. If we were to interrupt a modification the lookup would be
+ * pointed at the stable tree and complete while the modification was halted.
+ */
+
+#ifndef RB_TREE_LATCH_H
+#define RB_TREE_LATCH_H
+
+#include 
+#include 
+
+struct latch_tree_node {
+   struct rb_node node[2];
+};
+
+struct latch_tree_root {
+   seqcount_t  seq;
+   struct rb_root  tree[2];
+};
+
+/**
+ * latch_tree_ops - operators to define the tree order
+ * @less: used for insertion; provides the (partial) order between two 
elements.
+ * @comp: used for lookups; provides the order between the search key and an 
element.
+ *
+ * The operators are related like:
+ *
+ * comp(a->key,b) < 0  := less(a,b)
+ * comp(a->key,b) > 0  := less(b,a)
+ * comp(a->key,b) == 0 := !less(a,b) && !less(b,a)
+ *
+ * If these operators define a partial order on the elements we make no
+ * guarantee on which of the elements matching the key is found. See
+ * latch_tree_find().
+ */
+struct latch_tree_ops {
+   bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b);
+   int  (*comp)(void *key, struct latch_tree_node *b);
+};
+
+static __always_inline struct latch_tree_node *
+__lt_from_rb(struct rb_node *node, int idx)
+{
+   return container_of(node, struct latch_tree_node, node[idx]);
+}
+
+static __always_inline void
+__lt_insert(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx,
+   bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b))
+{
+   struct rb_root *root = <r->tree[idx];
+   struct rb_node **link = &root->rb_node;
+   struct rb_node *node = <n->node[idx];
+   struct rb_node *parent = NULL;
+   struct latch_tree_node *ltp;
+
+   while (*link) {
+   parent = *link;
+   ltp = __lt_from_rb(parent, idx);
+
+   if (less(ltn, ltp))
+   link = &parent->rb_left;
+   else
+   link = &parent->rb_right;
+   }
+
+   rb_link_node_rcu(node, parent, link);
+   rb_insert_color(node, root);
+}
+
+static __always_inline void
+__lt_erase(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx)
+{
+   rb_erase(<n->node[idx], <r->tree[idx]);
+}
+
+static __always_inline struct latch_tree_node *
+__lt_find(void *key, struct latch_tree_root *ltr, int idx,
+ int (*comp)(void *key, struct latch_tree_node *node))
+{
+   struct rb_node *node = rcu_dereference_raw(ltr->tree[idx].rb_node);
+   struct latch_tree_node *ltn;
+   int c;
+
+   while (node) {
+   ltn = __lt_from_rb(node, idx);
+  

[Devel] [PATCH rh7 34/39] seqlock: Better document raw_write_seqcount_latch()

2017-09-14 Thread Andrey Ryabinin
From: Peter Zijlstra 

Improve the documentation of the latch technique as used in the
current timekeeping code, such that it can be readily employed
elsewhere.

Borrow from the comments in timekeeping and replace those with a
reference to this more generic comment.

Cc: Andrea Arcangeli 
Cc: David Woodhouse 
Cc: Rik van Riel 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Reviewed-by: Mathieu Desnoyers 
Acked-by: Michel Lespinasse 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 6695b92a60bc7160c92d6dc5b17cc79673017c2f)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/seqlock.h   | 76 ++-
 kernel/time/timekeeping.c | 27 +
 2 files changed, 76 insertions(+), 27 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 48f2f69e3867..ee088ed20a6c 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -171,9 +171,83 @@ static inline int read_seqcount_retry(const seqcount_t *s, 
unsigned start)
 }
 
 
-/*
+/**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
  * @s: pointer to seqcount_t
+ *
+ * The latch technique is a multiversion concurrency control method that allows
+ * queries during non-atomic modifications. If you can guarantee queries never
+ * interrupt the modification -- e.g. the concurrency is strictly between CPUs
+ * -- you most likely do not need this.
+ *
+ * Where the traditional RCU/lockless data structures rely on atomic
+ * modifications to ensure queries observe either the old or the new state the
+ * latch allows the same for non-atomic updates. The trade-off is doubling the
+ * cost of storage; we have to maintain two copies of the entire data
+ * structure.
+ *
+ * Very simply put: we first modify one copy and then the other. This ensures
+ * there is always one copy in a stable state, ready to give us an answer.
+ *
+ * The basic form is a data structure like:
+ *
+ * struct latch_struct {
+ * seqcount_t  seq;
+ * struct data_struct  data[2];
+ * };
+ *
+ * Where a modification, which is assumed to be externally serialized, does the
+ * following:
+ *
+ * void latch_modify(struct latch_struct *latch, ...)
+ * {
+ * smp_wmb();  <- Ensure that the last data[1] update is visible
+ * latch->seq++;
+ * smp_wmb();  <- Ensure that the seqcount update is visible
+ *
+ * modify(latch->data[0], ...);
+ *
+ * smp_wmb();  <- Ensure that the data[0] update is visible
+ * latch->seq++;
+ * smp_wmb();  <- Ensure that the seqcount update is visible
+ *
+ * modify(latch->data[1], ...);
+ * }
+ *
+ * The query will have a form like:
+ *
+ * struct entry *latch_query(struct latch_struct *latch, ...)
+ * {
+ * struct entry *entry;
+ * unsigned seq, idx;
+ *
+ * do {
+ * seq = latch->seq;
+ * smp_rmb();
+ *
+ * idx = seq & 0x01;
+ * entry = data_query(latch->data[idx], ...);
+ *
+ * smp_rmb();
+ * } while (seq != latch->seq);
+ *
+ * return entry;
+ * }
+ *
+ * So during the modification, queries are first redirected to data[1]. Then we
+ * modify data[0]. When that is complete, we redirect queries back to data[0]
+ * and we can modify data[1].
+ *
+ * NOTE: The non-requirement for atomic modifications does _NOT_ include
+ *   the publishing of new entries in the case where data is a dynamic
+ *   data structure.
+ *
+ *   An iteration might start in data[0] and get suspended long enough
+ *   to miss an entire modification sequence, once it resumes it might
+ *   observe the new entry.
+ *
+ * NOTE: When data is a dynamic data structure; one should use regular RCU
+ *   patterns to manage the lifetimes of the objects within.
  */
 static inline void raw_write_seqcount_latch(seqcount_t *s)
 {
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index e79a23a1bd03..8e5b95064209 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -229,32 +229,7 @@ static inline s64 timekeeping_get_ns(struct tk_read_base 
*tkr)
  * We want to use this from any context including NMI and tracing /
  * instrumenting the timekeeping code itself.
  *
- * So we handle this differently than the other timekeeping accessor
- * functions which retry when the sequence count has changed. The
- * update side does:
- *
- * smp_wmb();  <- Ensure that the last base[1] update is visible
- * tkf->seq++;
- * smp_wmb();  <- Ensure that the seqcount update is visible
- * update(tkf->base[0], tkr);
- * smp_wmb();  <- Ensure that the base[0] update is visible
- * tkf->seq++;
- * smp_wmb();  <- Ensure that the seqcount update is visible
- * update(tkf->base[1], tkr);
- *
- * The reader side does:
- *
- * do {
- * seq = tkf->seq;
- * smp_rmb();
- * idx = seq

[Devel] [PATCH rh7 38/39] module: Optimize __module_address() using a latched RB-tree

2017-09-14 Thread Andrey Ryabinin
From: Peter Zijlstra 

Currently __module_address() is using a linear search through all
modules in order to find the module corresponding to the provided
address. With a lot of modules this can take a lot of time.

One of the users of this is kernel_text_address() which is employed
in many stack unwinders; which in turn are used by perf-callchain and
ftrace (possibly from NMI context).

So by optimizing __module_address() we optimize many stack unwinders
which are used by both perf and tracing in performance sensitive code.

Cc: Rusty Russell 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Oleg Nesterov 
Cc: "Paul E. McKenney" 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 93c2e105f6bcee231c951ba0e56e84505c4b0483)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/module.h |  32 +++---
 kernel/module.c| 117 ++---
 2 files changed, 138 insertions(+), 11 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index a4155ca70d1a..48c7335b05c8 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -236,8 +237,14 @@ struct module_ext {
 #endif
 };
 
-struct module
-{
+struct module;
+
+struct mod_tree_node {
+   struct module *mod;
+   struct latch_tree_node node;
+};
+
+struct module {
enum module_state state;
 
/* Member of list of modules */
@@ -296,8 +303,15 @@ struct module
/* Startup function. */
int (*init)(void);
 
-   /* If this is non-NULL, vfree after init() returns */
-   void *module_init;
+   /*
+* If this is non-NULL, vfree() after init() returns.
+*
+* Cacheline align here, such that:
+*   module_init, module_core, init_size, core_size,
+*   init_text_size, core_text_size and ltn_core.node[0]
+* are on the same cacheline.
+*/
+   void *module_init   cacheline_aligned;
 
/* Here is the actual code + data, vfree'd on unload. */
void *module_core;
@@ -308,6 +322,14 @@ struct module
/* The size of the executable code in each section.  */
unsigned int init_text_size, core_text_size;
 
+   /*
+* We want mtn_core::{mod,node[0]} to be in the same cacheline as the
+* above entries such that a regular lookup will only touch one
+* cacheline.
+*/
+   struct mod_tree_nodemtn_core;
+   struct mod_tree_nodemtn_init;
+
/* Size of RO sections of the module (text+rodata) */
unsigned int init_ro_size, core_ro_size;
 
@@ -392,7 +414,7 @@ struct module
ctor_fn_t *ctors;
unsigned int num_ctors;
 #endif
-};
+} cacheline_aligned;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
 #endif
diff --git a/kernel/module.c b/kernel/module.c
index 3f5edae1edac..952a9582f840 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -105,6 +105,108 @@
 DEFINE_MUTEX(module_mutex);
 EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
+
+/*
+ * Use a latched RB-tree for __module_address(); this allows us to use
+ * RCU-sched lookups of the address from any context.
+ *
+ * Because modules have two address ranges: init and core, we need two
+ * latch_tree_nodes entries. Therefore we need the back-pointer from
+ * mod_tree_node.
+ *
+ * Because init ranges are short lived we mark them unlikely and have placed
+ * them outside the critical cacheline in struct module.
+ */
+
+static __always_inline unsigned long __mod_tree_val(struct latch_tree_node *n)
+{
+   struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+   struct module *mod = mtn->mod;
+
+   if (unlikely(mtn == &mod->mtn_init))
+   return (unsigned long)mod->module_init;
+
+   return (unsigned long)mod->module_core;
+}
+
+static __always_inline unsigned long __mod_tree_size(struct latch_tree_node *n)
+{
+   struct mod_tree_node *mtn = container_of(n, struct mod_tree_node, node);
+   struct module *mod = mtn->mod;
+
+   if (unlikely(mtn == &mod->mtn_init))
+   return (unsigned long)mod->init_size;
+
+   return (unsigned long)mod->core_size;
+}
+
+static __always_inline bool
+mod_tree_less(struct latch_tree_node *a, struct latch_tree_node *b)
+{
+   return __mod_tree_val(a) < __mod_tree_val(b);
+}
+
+static __always_inline int
+mod_tree_comp(void *key, struct latch_tree_node *n)
+{
+   unsigned long val = (unsigned long)key;
+   unsigned long start, end;
+
+   start = __mod_tree_val(n);
+   if (val < start)
+   return -1;
+
+   end = start + __mod_tree_size(n);
+   if (val >= end)
+   return 1;
+
+   return 0;
+}
+
+static const struct latch_tree_ops mod_tree_ops = {
+   

[Devel] [PATCH rh7 32/39] kasan: fix races in quarantine_remove_cache()

2017-09-14 Thread Andrey Ryabinin
From: Dmitry Vyukov 

quarantine_remove_cache() frees all pending objects that belong to the
cache, before we destroy the cache itself.  However there are currently
two possibilities how it can fail to do so.

First, another thread can hold some of the objects from the cache in
temp list in quarantine_put().  quarantine_put() has a windows of
enabled interrupts, and on_each_cpu() in quarantine_remove_cache() can
finish right in that window.  These objects will be later freed into the
destroyed cache.

Then, quarantine_reduce() has the same problem.  It grabs a batch of
objects from the global quarantine, then unlocks quarantine_lock and
then frees the batch.  quarantine_remove_cache() can finish while some
objects from the cache are still in the local to_free list in
quarantine_reduce().

Fix the race with quarantine_put() by disabling interrupts for the whole
duration of quarantine_put().  In combination with on_each_cpu() in
quarantine_remove_cache() it ensures that quarantine_remove_cache()
either sees the objects in the per-cpu list or in the global list.

Fix the race with quarantine_reduce() by protecting quarantine_reduce()
with srcu critical section and then doing synchronize_srcu() at the end
of quarantine_remove_cache().

I've done some assessment of how good synchronize_srcu() works in this
case.  And on a 4 CPU VM I see that it blocks waiting for pending read
critical sections in about 2-3% of cases.  Which looks good to me.

I suspect that these races are the root cause of some GPFs that I
episodically hit.  Previously I did not have any explanation for them.

  BUG: unable to handle kernel NULL pointer dereference at 00c8
  IP: qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
  PGD 6aeea067
  PUD 60ed7067
  PMD 0
  Oops:  [#1] SMP KASAN
  Dumping ftrace buffer:
 (ftrace buffer empty)
  Modules linked in:
  CPU: 0 PID: 13667 Comm: syz-executor2 Not tainted 4.10.0+ #60
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  task: 88005f948040 task.stack: 880069818000
  RIP: 0010:qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
  RSP: 0018:88006981f298 EFLAGS: 00010246
  RAX: ea00 RBX:  RCX: ea1f
  RDX:  RSI: 88003fffc3e0 RDI: 
  RBP: 88006981f2c0 R08: 88002fed7bd8 R09: 0001001f000d
  R10: 001f000d R11: 88006981f000 R12: 88003fffc3e0
  R13: 88006981f2d0 R14: 81877fae R15: 8000
  FS:  7fb911a2d700() GS:88003ec0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 00c8 CR3: 60ed6000 CR4: 06f0
  Call Trace:
   quarantine_reduce+0x10e/0x120 mm/kasan/quarantine.c:239
   kasan_kmalloc+0xca/0xe0 mm/kasan/kasan.c:590
   kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
   slab_post_alloc_hook mm/slab.h:456 [inline]
   slab_alloc_node mm/slub.c:2718 [inline]
   kmem_cache_alloc_node+0x1d3/0x280 mm/slub.c:2754
   __alloc_skb+0x10f/0x770 net/core/skbuff.c:219
   alloc_skb include/linux/skbuff.h:932 [inline]
   _sctp_make_chunk+0x3b/0x260 net/sctp/sm_make_chunk.c:1388
   sctp_make_data net/sctp/sm_make_chunk.c:1420 [inline]
   sctp_make_datafrag_empty+0x208/0x360 net/sctp/sm_make_chunk.c:746
   sctp_datamsg_from_user+0x7e8/0x11d0 net/sctp/chunk.c:266
   sctp_sendmsg+0x2611/0x3970 net/sctp/socket.c:1962
   inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
   sock_sendmsg_nosec net/socket.c:633 [inline]
   sock_sendmsg+0xca/0x110 net/socket.c:643
   SYSC_sendto+0x660/0x810 net/socket.c:1685
   SyS_sendto+0x40/0x50 net/socket.c:1653

I am not sure about backporting.  The bug is quite hard to trigger, I've
seen it few times during our massive continuous testing (however, it
could be cause of some other episodic stray crashes as it leads to
memory corruption...).  If it is triggered, the consequences are very
bad -- almost definite bad memory corruption.  The fix is non trivial
and has chances of introducing new bugs.  I am also not sure how
actively people use KASAN on older releases.

[dvyu...@google.com: - sorted includes[
  Link: http://lkml.kernel.org/r/20170309094028.51088-1-dvyu...@google.com
Link: http://lkml.kernel.org/r/20170308151532.5070-1-dvyu...@google.com
Signed-off-by: Dmitry Vyukov 
Acked-by: Andrey Ryabinin 
Cc: Greg Thelen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit ce5bec54bb5debbbe51b40270d8f209a23cadae4)
Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/quarantine.c | 42 --
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index 6f1ed1630873..5c44c08f46b6 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -103,6 +104,7 @@ static int quara

[Devel] [PATCH rh7 31/39] kasan: drain quarantine of memcg slab objects

2017-09-14 Thread Andrey Ryabinin
From: Greg Thelen 

Per memcg slab accounting and kasan have a problem with kmem_cache
destruction.
 - kmem_cache_create() allocates a kmem_cache, which is used for
   allocations from processes running in root (top) memcg.
 - Processes running in non root memcg and allocating with either
   __GFP_ACCOUNT or from a SLAB_ACCOUNT cache use a per memcg
   kmem_cache.
 - Kasan catches use-after-free by having kfree() and kmem_cache_free()
   defer freeing of objects. Objects are placed in a quarantine.
 - kmem_cache_destroy() destroys root and non root kmem_caches. It takes
   care to drain the quarantine of objects from the root memcg's
   kmem_cache, but ignores objects associated with non root memcg. This
   causes leaks because quarantined per memcg objects refer to per memcg
   kmem cache being destroyed.

To see the problem:

 1) create a slab cache with kmem_cache_create(,,,SLAB_ACCOUNT,)
 2) from non root memcg, allocate and free a few objects from cache
 3) dispose of the cache with kmem_cache_destroy() kmem_cache_destroy()
will trigger a "Slab cache still has objects" warning indicating
that the per memcg kmem_cache structure was leaked.

Fix the leak by draining kasan quarantined objects allocated from non
root memcg.

Racing memcg deletion is tricky, but handled.  kmem_cache_destroy() =>
shutdown_memcg_caches() => __shutdown_memcg_cache() => shutdown_cache()
flushes per memcg quarantined objects, even if that memcg has been
rmdir'd and gone through memcg_deactivate_kmem_caches().

This leak only affects destroyed SLAB_ACCOUNT kmem caches when kasan is
enabled.  So I don't think it's worth patching stable kernels.

Link: 
http://lkml.kernel.org/r/1482257462-36948-1-git-send-email-gthe...@google.com
Signed-off-by: Greg Thelen 
Reviewed-by: Vladimir Davydov 
Acked-by: Andrey Ryabinin 
Cc: Alexander Potapenko 
Cc: Dmitry Vyukov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit f9fa1d919c696e90c887d8742198023e7639d139)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h | 4 ++--
 mm/kasan/kasan.c  | 2 +-
 mm/kasan/quarantine.c | 1 +
 mm/slab_common.c  | 6 --
 4 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 536a400d1d39..21cedc322d9a 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -41,7 +41,7 @@ void kasan_free_pages(struct page *page, unsigned int order);
 void kasan_cache_create(struct kmem_cache *cache, size_t *size,
unsigned long *flags);
 void kasan_cache_shrink(struct kmem_cache *cache);
-void kasan_cache_destroy(struct kmem_cache *cache);
+void kasan_cache_shutdown(struct kmem_cache *cache);
 
 void kasan_poison_slab(struct page *page);
 void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
@@ -83,7 +83,7 @@ static inline void kasan_cache_create(struct kmem_cache 
*cache,
  size_t *size,
  unsigned long *flags) {}
 static inline void kasan_cache_shrink(struct kmem_cache *cache) {}
-static inline void kasan_cache_destroy(struct kmem_cache *cache) {}
+static inline void kasan_cache_shutdown(struct kmem_cache *cache) {}
 
 static inline void kasan_poison_slab(struct page *page) {}
 static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 8f350a2edcb6..8b9531312417 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -373,7 +373,7 @@ void kasan_cache_shrink(struct kmem_cache *cache)
quarantine_remove_cache(cache);
 }
 
-void kasan_cache_destroy(struct kmem_cache *cache)
+void kasan_cache_shutdown(struct kmem_cache *cache)
 {
quarantine_remove_cache(cache);
 }
diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index dae929c02bbb..6f1ed1630873 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -274,6 +274,7 @@ static void per_cpu_remove_cache(void *arg)
qlist_free_all(&to_free, cache);
 }
 
+/* Free all quarantined objects belonging to cache. */
 void quarantine_remove_cache(struct kmem_cache *cache)
 {
unsigned long flags, i;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8c8c99b9db05..b24d35d85e58 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -359,6 +359,10 @@ EXPORT_SYMBOL(kmem_cache_create);
 static int do_kmem_cache_shutdown(struct kmem_cache *s,
struct list_head *release, bool *need_rcu_barrier)
 {
+
+   /* free asan quarantined objects */
+   kasan_cache_shutdown(s);
+
if (__kmem_cache_shutdown(s) != 0) {
printk(KERN_ERR "kmem_cache_destroy %s: "
   "Slab cache still has objects\n", s->name);
@@ -544,8 +548,6 @@ void kmem_cac

[Devel] [PATCH rh7 35/39] rcu: Move lockless_dereference() out of rcupdate.h

2017-09-14 Thread Andrey Ryabinin
From: Peter Zijlstra 

I want to use lockless_dereference() from seqlock.h, which would mean
including rcupdate.h from it, however rcupdate.h already includes
seqlock.h.

Avoid this by moving lockless_dereference() into compiler.h. This is
somewhat tricky since it uses smp_read_barrier_depends() which isn't
available there, but its a CPP macro so we can get away with it.

The alternative would be moving it into asm/barrier.h, but that would
be updating each arch (I can do if people feel that is more
appropriate).

Cc: Paul McKenney 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 0a04b0166929405cd833c1cc40f99e862b965ddc)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/compiler.h | 15 +++
 include/linux/rcupdate.h | 15 ---
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 73647b4cd947..7ce904c040dd 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -466,6 +466,21 @@ static __always_inline void __write_once_size(volatile 
void *p, void *res, int s
  */
 #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
 
+/**
+ * lockless_dereference() - safely load a pointer for later dereference
+ * @p: The pointer to load
+ *
+ * Similar to rcu_dereference(), but for situations where the pointed-to
+ * object's lifetime is managed by something other than RCU.  That
+ * "something other" might be reference counting or simple immortality.
+ */
+#define lockless_dereference(p) \
+({ \
+   typeof(p) _p1 = ACCESS_ONCE(p); \
+   smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
+   (_p1); \
+})
+
 /* Ignore/forbid kprobes attach on very low level functions marked by this 
attribute: */
 #ifdef CONFIG_KPROBES
 # define __kprobes __attribute__((__section__(".kprobes.text")))
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 68df10240cb4..981261775a41 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -580,21 +580,6 @@ static inline void rcu_preempt_sleep_check(void)
} while (0)
 
 /**
- * lockless_dereference() - safely load a pointer for later dereference
- * @p: The pointer to load
- *
- * Similar to rcu_dereference(), but for situations where the pointed-to
- * object's lifetime is managed by something other than RCU.  That
- * "something other" might be reference counting or simple immortality.
- */
-#define lockless_dereference(p) \
-({ \
-   typeof(p) _p1 = ACCESS_ONCE(p); \
-   smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
-   (_p1); \
-})
-
-/**
  * rcu_access_pointer() - fetch RCU pointer with no dereferencing
  * @p: The pointer to read
  *
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 36/39] seqlock: Introduce raw_read_seqcount_latch()

2017-09-14 Thread Andrey Ryabinin
From: Peter Zijlstra 

Because with latches there is a strict data dependency on the seq load
we can avoid the rmb in favour of a read_barrier_depends.

Suggested-by: Ingo Molnar 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Rusty Russell 

https://jira.sw.ru/browse/PSBM-69081
(cherry picked from commit 7fc26327b75685f37f58d64bdb061460f834f80d)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/seqlock.h   | 8 ++--
 kernel/time/timekeeping.c | 2 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index ee088ed20a6c..9d8997027263 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -34,6 +34,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -170,6 +171,10 @@ static inline int read_seqcount_retry(const seqcount_t *s, 
unsigned start)
return __read_seqcount_retry(s, start);
 }
 
+static inline int raw_read_seqcount_latch(seqcount_t *s)
+{
+   return lockless_dereference(s->sequence);
+}
 
 /**
  * raw_write_seqcount_latch - redirect readers to even/odd copy
@@ -222,8 +227,7 @@ static inline int read_seqcount_retry(const seqcount_t *s, 
unsigned start)
  * unsigned seq, idx;
  *
  * do {
- * seq = latch->seq;
- * smp_rmb();
+ * seq = lockless_dereference(latch->seq);
  *
  * idx = seq & 0x01;
  * entry = data_query(latch->data[idx], ...);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 8e5b95064209..d99c89095bfd 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -292,7 +292,7 @@ static __always_inline u64 __ktime_get_fast_ns(struct 
tk_fast *tkf)
u64 now;
 
do {
-   seq = raw_read_seqcount(&tkf->seq);
+   seq = raw_read_seqcount_latch(&tkf->seq);
tkr = tkf->base + (seq & 0x01);
now = ktime_to_ns(tkr->base) + timekeeping_get_ns(tkr);
} while (read_seqcount_retry(&tkf->seq, seq));
-- 
2.13.5

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ms/mm: mempool: kasan: don't poot mempool objects in quarantine

2017-09-28 Thread Andrey Ryabinin
Currently we may put reserved by mempool elements into quarantine via
kasan_kfree().  This is totally wrong since quarantine may really free
these objects.  So when mempool will try to use such element,
use-after-free will happen.  Or mempool may decide that it no longer
need that element and double-free it.

So don't put object into quarantine in kasan_kfree(), just poison it.
Rename kasan_kfree() to kasan_poison_kfree() to respect that.

Also, we shouldn't use kasan_slab_alloc()/kasan_krealloc() in
kasan_unpoison_element() because those functions may update allocation
stacktrace.  This would be wrong for the most of the remove_element call
sites.

(The only call site where we may want to update alloc stacktrace is
 in mempool_alloc(). Kmemleak solves this by calling
 kmemleak_update_trace(), so we could make something like that too.
 But this is out of scope of this patch).

Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
Link: http://lkml.kernel.org/r/575977c3.1010...@virtuozzo.com
Signed-off-by: Andrey Ryabinin 
Reported-by: Kuthonuzo Luruo 
Acked-by: Alexander Potapenko 
Cc: Dmitriy Vyukov 
Cc: Kostya Serebryany 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-73165
(cherry picked from commit 9b75a867cc9ddbafcaf35029358ac500f2635ff3)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h |  9 +
 mm/kasan/kasan.c  |  6 +++---
 mm/mempool.c  | 12 
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 21cedc322d9a..5dc6eef8351d 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -50,14 +50,13 @@ void kasan_init_slab_obj(struct kmem_cache *cache, const 
void *object);
 
 void kasan_kmalloc_large(const void *ptr, size_t size, gfp_t flags);
 void kasan_kfree_large(const void *ptr);
-void kasan_kfree(void *ptr);
+void kasan_poison_kfree(void *ptr);
 void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size,
  gfp_t flags);
 void kasan_krealloc(const void *object, size_t new_size, gfp_t flags);
 
 void kasan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags);
 bool kasan_slab_free(struct kmem_cache *s, void *object);
-void kasan_poison_slab_free(struct kmem_cache *s, void *object);
 
 struct kasan_cache {
int alloc_meta_offset;
@@ -67,6 +66,8 @@ struct kasan_cache {
 int kasan_module_alloc(void *addr, size_t size);
 void kasan_free_shadow(const struct vm_struct *vm);
 
+size_t ksize(const void *);
+static inline void kasan_unpoison_slab(const void *ptr) { ksize(ptr); }
 size_t kasan_metadata_size(struct kmem_cache *cache);
 
 #else /* CONFIG_KASAN */
@@ -95,7 +96,7 @@ static inline void kasan_init_slab_obj(struct kmem_cache 
*cache,
 
 static inline void kasan_kmalloc_large(void *ptr, size_t size, gfp_t flags) {}
 static inline void kasan_kfree_large(const void *ptr) {}
-static inline void kasan_kfree(void *ptr) {}
+static inline void kasan_poison_kfree(void *ptr) {}
 static inline void kasan_kmalloc(struct kmem_cache *s, const void *object,
size_t size, gfp_t flags) {}
 static inline void kasan_krealloc(const void *object, size_t new_size,
@@ -107,11 +108,11 @@ static inline bool kasan_slab_free(struct kmem_cache *s, 
void *object)
 {
return false;
 }
-static inline void kasan_poison_slab_free(struct kmem_cache *s, void *object) 
{}
 
 static inline int kasan_module_alloc(void *addr, size_t size) { return 0; }
 static inline void kasan_free_shadow(const struct vm_struct *vm) {}
 
+static inline void kasan_unpoison_slab(const void *ptr) { }
 static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; 
}
 
 #endif /* CONFIG_KASAN */
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 8b9531312417..33bc171b5625 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -482,7 +482,7 @@ void kasan_slab_alloc(struct kmem_cache *cache, void 
*object, gfp_t flags)
kasan_kmalloc(cache, object, cache->object_size, flags);
 }
 
-void kasan_poison_slab_free(struct kmem_cache *cache, void *object)
+static void kasan_poison_slab_free(struct kmem_cache *cache, void *object)
 {
unsigned long size = cache->object_size;
unsigned long rounded_up_size = round_up(size, KASAN_SHADOW_SCALE_SIZE);
@@ -581,7 +581,7 @@ void kasan_krealloc(const void *object, size_t size, gfp_t 
flags)
kasan_kmalloc(page->slab_cache, object, size, flags);
 }
 
-void kasan_kfree(void *ptr)
+void kasan_poison_kfree(void *ptr)
 {
struct page *page;
 
@@ -591,7 +591,7 @@ void kasan_kfree(void *ptr)
kasan_poison_shadow(ptr, PAGE_SIZE << compound_order(page),
KASAN_FREE_PAGE);
else
-   kasan_slab_free(page->slab_cache, ptr);
+   kasan_poison_slab_free(page->slab_cache, ptr);
 }
 
 void kasan_kfr

[Devel] [PATCH rh7] mm: issue panic() on bad page/pte bugs if panic_on_warn is set.

2017-10-05 Thread Andrey Ryabinin
Bad page state bugs is serious issue. It's worth issue panic if
panic_on_warn is set to collect crash dump and catch issue earlier.

https://jira.sw.ru/browse/PSBM-70168
Signed-off-by: Andrey Ryabinin 
---
 mm/memory.c | 2 ++
 mm/page_alloc.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index c30a042cebf5..b1c6968f1746 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -733,6 +733,8 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
   vma->vm_file->f_op->mmap);
dump_stack();
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+   if (panic_on_warn)
+   panic("panic_on_warn set ...\n");
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ee5e9afd433..137d1d86ddf4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -463,6 +463,8 @@ out:
/* Leave bad fields for debug, except PageBuddy could make trouble */
page_mapcount_reset(page); /* remove PageBuddy */
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
+   if (panic_on_warn)
+   panic("panic_on_warn set ...\n");
 }
 
 /*
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm,fs: make vfs_cache_min_ratio=0 by default

2017-10-09 Thread Andrey Ryabinin
vfs_cache_min_ratio is a sysctl ported from PSBM-6. It supposed
to increase performance, but it was noticed to do the opposite
https://jira.sw.ru/browse/PSBM-68644.

It supposed to protect slabs from reclaiming if their percentage is lower
than the sysctl value (2% by default).
However, it's not quite clear in what scenario it supposed to improve
performance. Moreover, I don't quite understand how is this 2% protection
supposed to change anything in reclaim logic. If we have relatively
small number of reclaimable slab objects  they won't be reclaimed anyway
because of priority protection. The following simple test:

#!/bin/bash
mkdir -p /vz/tst/
mkdir -p /sys/fs/cgroup/memory/tst

for i in {1..4} ; do
mkdir -p /sys/fs/cgroup/memory/tst/test$i
# 1G
echo $((1024*1024*1024)) > 
/sys/fs/cgroup/memory/tst/test$i/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/tst/test$i/tasks
mkdir -p /vz/tst/test$i

#fill slab_reclaimable
stat /vz/tst/test$i > /dev/null
for j in {1..10065} ; do
mkdir -p /vz/tst/test$i/test$j
stat /vz/tst/test$i/test$j > /dev/null
done

#read 1G file to fill page cache in memcg.
cat /vz/test$i > /dev/null
done

cat /sys/fs/cgroup/memory/tst/memory.stat
echo $$ > /sys/fs/cgroup/memory/tasks

#create memory pressure, sizeof /vz/fill_ram
#should be bigger than total ram.
cat /vz/fill_ram > /dev/null

show that there is no much difference between 0 and 2% settings.
In both cases we prefer to reclaim page cache.

So let's set vfs_cache_min_ratio to 0 and see how it goes.

https://jira.sw.ru/browse/PSBM-69672
Signed-off-by: Andrey Ryabinin 
---
 fs/dcache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 2cb018a05078..ce8e24f22f9a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -86,7 +86,7 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
 
-int sysctl_vfs_cache_min_ratio __read_mostly = 2;
+int sysctl_vfs_cache_min_ratio __read_mostly = 0;
 
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ms/pidns: fix NULL dereference in __task_pid_nr_ns()

2017-10-18 Thread Andrey Ryabinin
From: Eric Dumazet 

commit 81b1a832d79749058863cffe2c0ed4ef40f6e6ec upstream.

I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :

pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :

if (pid && ns->level <= pid->level) { // CRASH

Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(

get_task_pid() can benefit from same fix.

Signed-off-by: Eric Dumazet 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-75247
Signed-off-by: Andrey Ryabinin 
---
 kernel/pid.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 4f8d1d6d50fa..d1f9d4ccf9a5 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -511,7 +511,7 @@ struct pid *get_task_pid(struct task_struct *task, enum 
pid_type type)
rcu_read_lock();
if (type != PIDTYPE_PID)
task = task->group_leader;
-   pid = get_pid(task->pids[type].pid);
+   pid = get_pid(rcu_dereference(task->pids[type].pid));
rcu_read_unlock();
return pid;
 }
@@ -572,7 +572,7 @@ pid_t __task_pid_nr_ns(struct task_struct *task, enum 
pid_type type,
if (likely(pid_alive(task))) {
if (type != PIDTYPE_PID)
task = task->group_leader;
-   nr = pid_nr_ns(task->pids[type].pid, ns);
+   nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
}
rcu_read_unlock();
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] fs/nfs: don't use delayed unmount for nfs.

2017-10-27 Thread Andrey Ryabinin
Delayed nfs unmount causes too much PITA. We must destroy VENET ip after
unmount, but in that case we can't reuse that IP on restarted container
because it migh be still alive.

So let's just unmount NFS synchronously and destroy veip after it.

https://jira.sw.ru/browse/PSBM-76086
Signed-off-by: Andrey Ryabinin 
---
 drivers/net/venetdev.c | 9 ++---
 fs/namespace.c | 3 ++-
 fs/nfs/super.c | 1 +
 include/linux/fs.h | 4 
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
index 1c4ae90b7ba8..11f4a66aaf3d 100644
--- a/drivers/net/venetdev.c
+++ b/drivers/net/venetdev.c
@@ -765,7 +765,7 @@ static void venet_dellink(struct net_device *dev, struct 
list_head *head)
 * has VE_FEATURE_NFS enabled. Thus here we have to destroy veip in
 * this case.
 */
-   if (env->ve_netns || (env->features & VE_FEATURE_NFS))
+   if (env->ve_netns)
veip_shutdown(env);
 
env->_venet_dev = NULL;
@@ -1182,12 +1182,7 @@ static struct rtnl_link_ops venet_link_ops = {
 
 static void veip_shutdown_fini(void *data)
 {
-   struct ve_struct *ve = data;
-
-   if (ve->features & VE_FEATURE_NFS)
-   return;
-
-   veip_shutdown(ve);
+   veip_shutdown(data);
 }
 
 static struct ve_hook veip_shutdown_hook = {
diff --git a/fs/namespace.c b/fs/namespace.c
index 2c9824985bc5..c2489dd2f520 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1134,7 +1134,8 @@ static void mntput_no_expire(struct mount *mnt)
}
unlock_mount_hash();
 
-   if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
+   if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))
+   && !(mnt->mnt.mnt_sb->s_iflags & SB_I_UMOUNT_SYNC)) {
struct task_struct *task = current;
if (likely(!(task->flags & PF_KTHREAD))) {
init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 8f29ad17e29e..65a0ac8a3d16 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2414,6 +2414,7 @@ static int nfs_set_super(struct super_block *s, void 
*data)
int ret;
 
s->s_flags = sb_mntdata->mntflags;
+   s->s_iflags |= SB_I_UMOUNT_SYNC;
s->s_fs_info = server;
s->s_d_op = server->nfs_client->rpc_ops->dentry_ops;
ret = set_anon_super(s, server);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79011b4bc040..2f3a983741f8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1526,6 +1526,9 @@ struct mm_struct;
 #define UMOUNT_NOFOLLOW0x0008  /* Don't follow symlink on 
umount */
 #define UMOUNT_UNUSED  0x8000  /* Flag guaranteed to be unused */
 
+/* sb->s_iflags */
+#define SB_I_UMOUNT_SYNC   0x1000 /* don't use delayed unmount 
*/
+
 extern struct list_head super_blocks;
 extern spinlock_t sb_lock;
 
@@ -1566,6 +1569,7 @@ struct super_block {
const struct quotactl_ops   *s_qcop;
const struct export_operations *s_export_op;
unsigned long   s_flags;
+   unsigned long   s_iflags;   /* internal SB_I_* flags */
unsigned long   s_magic;
struct dentry   *s_root;
struct rw_semaphore s_umount;
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs/nfs: don't use delayed unmount for nfs.

2017-10-30 Thread Andrey Ryabinin


On 10/27/2017 08:45 PM, Andrei Vagin wrote:
> On Fri, Oct 27, 2017 at 06:31:18PM +0300, Andrey Ryabinin wrote:
>> Delayed nfs unmount causes too much PITA. We must destroy VENET ip after
>> unmount, but in that case we can't reuse that IP on restarted container
>> because it migh be still alive.
>>
>> So let's just unmount NFS synchronously and destroy veip after it.
> 
> You change a general scenario to fix your small case. For users, it will
> be unexpected behaviour. They call umount -l and don't expect any
> delays.


This has nothing to do with the MNT_DETACH (umount -l) at all. This more 
affects ordinary umount.
Currently, due to delayed mntput thing (see commit 9ea459e110df32e6 upstream), 
successful  umount /mnt
doesn't mean that umount actually finished. So this patch only brings back pre 
9ea459e110df32e6 behaviour
for NFS.



> How nfs mounts are umounted when a host is shutdowned? I think they are
> umounted from init scripts (systemd). Why we can't umount nfs mounts
> with the force flag when we stop a container?
> 

It doesn't really matter how umount triggered. Currently there is no way to do 
unmount synchronously.
You may call "umount /mnt" whenever you want, but successfully finished umount 
doesn't guarantee that delayed_mntput
finished or even started.


FYI, Stas suggested another possible way to fix this:
 "Bring back that tricky logic, allowing catching falling VEIP object and reuse 
it."

But it's unclear to me, how is this supposed to work. This basically means that 
we might have two interfaces with the same IP address.
So, when packet arrives it's unclear to which interface we supposed to redirect 
it.
Thus, bringing  back pre 9ea459e110df32e6 behaviour seems like the only way to 
me.


>>
>> https://jira.sw.ru/browse/PSBM-76086
>> Signed-off-by: Andrey Ryabinin 
>> ---
>>  drivers/net/venetdev.c | 9 ++---
>>  fs/namespace.c | 3 ++-
>>  fs/nfs/super.c | 1 +
>>  include/linux/fs.h | 4 
>>  4 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/net/venetdev.c b/drivers/net/venetdev.c
>> index 1c4ae90b7ba8..11f4a66aaf3d 100644
>> --- a/drivers/net/venetdev.c
>> +++ b/drivers/net/venetdev.c
>> @@ -765,7 +765,7 @@ static void venet_dellink(struct net_device *dev, struct 
>> list_head *head)
>>   * has VE_FEATURE_NFS enabled. Thus here we have to destroy veip in
>>   * this case.
>>   */
>> -if (env->ve_netns || (env->features & VE_FEATURE_NFS))
>> +if (env->ve_netns)
>>  veip_shutdown(env);
>>  
>>  env->_venet_dev = NULL;
>> @@ -1182,12 +1182,7 @@ static struct rtnl_link_ops venet_link_ops = {
>>  
>>  static void veip_shutdown_fini(void *data)
>>  {
>> -struct ve_struct *ve = data;
>> -
>> -if (ve->features & VE_FEATURE_NFS)
>> -return;
>> -
>> -veip_shutdown(ve);
>> +veip_shutdown(data);
>>  }
>>  
>>  static struct ve_hook veip_shutdown_hook = {
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 2c9824985bc5..c2489dd2f520 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -1134,7 +1134,8 @@ static void mntput_no_expire(struct mount *mnt)
>>  }
>>  unlock_mount_hash();
>>  
>> -if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
>> +if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))
>> +&& !(mnt->mnt.mnt_sb->s_iflags & SB_I_UMOUNT_SYNC)) {
>>  struct task_struct *task = current;
>>  if (likely(!(task->flags & PF_KTHREAD))) {
>>  init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
>> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
>> index 8f29ad17e29e..65a0ac8a3d16 100644
>> --- a/fs/nfs/super.c
>> +++ b/fs/nfs/super.c
>> @@ -2414,6 +2414,7 @@ static int nfs_set_super(struct super_block *s, void 
>> *data)
>>  int ret;
>>  
>>  s->s_flags = sb_mntdata->mntflags;
>> +s->s_iflags |= SB_I_UMOUNT_SYNC;
>>  s->s_fs_info = server;
>>  s->s_d_op = server->nfs_client->rpc_ops->dentry_ops;
>>  ret = set_anon_super(s, server);
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 79011b4bc040..2f3a983741f8 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1526,6 +1526,9 @@ struct mm_struct;
>>  #define UMOUNT_NOFOLLOW 0x0008  /* Don't follow symlink on 
>> umount */
>>  #define UM

[Devel] [PATCH rh7 2/2] mm/memcg: Fix potential softlockup during memcgroup shutdown.

2017-10-30 Thread Andrey Ryabinin
On a huge mem cgroup mem_cgroup_force_empty_list() may iterate
for a long time without rescheduling and cause softlockup.
Add cond_resched() to avoid this.

https://jira.sw.ru/browse/PSBM-76011
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index efc455d8ca81..a7fa84a9980a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4038,8 +4038,10 @@ static void mem_cgroup_force_empty_list(struct 
mem_cgroup *memcg,
/* found lock contention or "pc" is obsolete. */
busy = page;
schedule_timeout_uninterruptible(1);
-   } else
+   } else {
busy = NULL;
+   cond_resched();
+   }
} while (!list_empty(list));
 }
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] mm/memcg: Don't enable interrupts too soon.

2017-10-30 Thread Andrey Ryabinin
When mem_cgroup_move_parent() moves huge page it disables interrupt:

if (nr_pages > 1) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
flags = compound_lock_irqsave(page);
}

and calls:
ret = mem_cgroup_move_account(page, nr_pages, ...

which does the following:

local_irq_disable();
mem_cgroup_charge_statistics(to, page, nr_pages);
...
local_irq_enable();

So the last local_irq_enable() enables irq too early, which may lead
to the deadlock. mem_cgroup_move_account() should use local_irq_save()/
local_irq_restore() primitives instead.

Found while investigating https://jira.sw.ru/browse/PSBM-76011
but unrelated.

Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 239fbca70b59..efc455d8ca81 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3597,15 +3597,14 @@ static int mem_cgroup_move_account(struct page *page,
 
/* caller should have done css_get */
pc->mem_cgroup = to;
-   move_unlock_mem_cgroup(from, &flags);
+   spin_unlock(&from->move_lock);
ret = 0;
 
-   local_irq_disable();
mem_cgroup_charge_statistics(to, page, nr_pages);
memcg_check_events(to, page);
mem_cgroup_charge_statistics(from, page, -nr_pages);
memcg_check_events(from, page);
-   local_irq_enable();
+   local_irq_restore(flags);
 out_unlock:
unlock_page(page);
 out:
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7 0/3] ve: properly handle nr_cpus and cpu_rate for nested cgroups

2017-11-02 Thread Andrey Ryabinin


On 11/01/2017 04:49 PM, Pavel Tikhomirov wrote:
> https://jira.sw.ru/browse/PSBM-69678
> 
> Pavel Tikhomirov (3):
>   cgroup: remove rcu_read_lock from cgroup_get_ve_root
>   cgroup: make cgroup_get_ve_root visible in kernel/sched/core.c
>   sched: take nr_cpus and cpu_rate from ve root task group
> 

Reviewed-by: Andrey Ryabinin 

>  include/linux/sched.h |  2 ++
>  include/linux/ve.h|  7 +++
>  kernel/cgroup.c   |  9 +
>  kernel/sched/core.c   | 56 
> +--
>  kernel/sched/fair.c   |  9 +
>  5 files changed, 60 insertions(+), 23 deletions(-)
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] ms/mm: introduce kv[mz]alloc helpers

2017-11-03 Thread Andrey Ryabinin
This only small part of the upstream commit
a7c3e901a46ff54c016d040847eda598a9e3e653. I backported only
part that introduce kv[mz]alloc helpers.

Description of the original patch:

commit a7c3e901a46ff54c016d040847eda598a9e3e653
Author: Michal Hocko 
Date:   Mon May 8 15:57:09 2017 -0700

mm: introduce kv[mz]alloc helpers

Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mho...@kernel.org
[2] 
http://lkml.kernel.org/r/1485273626.16328.301.ca...@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

https://jira.sw.ru/browse/PSBM-76752
Signed-off-by: Andrey Ryabinin 
---
 include/linux/mm.h   | 14 +
 include/linux/vmalloc.h  |  1 +
 mm/nommu.c   |  5 +++
 mm/util.c| 45 ++
 mm/vmalloc.c |  2 +-
 security/apparmor/apparmorfs.c   |  2 +-
 security/apparmor/include/apparmor.h |  2 --
 security/apparmor/lib.c  | 61 
 security/apparmor/match.c|  2 +-
 9 files changed, 68 insertions(+), 66 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c806f43b5b59..897d7cfd2269 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -401,6 +401,20 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 }
 #endif
 
+extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
+static inline void *kvmalloc(size_t size, gfp_t flags)
+{
+   return kvmalloc_node(size, flags, NUMA_NO_NODE);
+}
+static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)
+{
+   return kvmalloc_node(size, flags | __GFP_ZERO, node);
+}
+static inline void *kvzalloc(size_t size, gfp_t flags)
+{
+   return kvmalloc(size, flags | __GFP_ZERO);
+}
+
 extern void kvfree(const void *addr);
 
 static inline void compound_lock(struct page *page)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 6ea82cf30dc1..59c80dd655a3 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -81,6 +81,7 @@ extern void *__vmalloc_node_range(unsigned long size, 
unsigned long align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller);
+extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 
 extern void vfree(const void *addr);
 
diff --git a/mm/nommu.c b/mm/nommu.c
index beecd953c29c..a16aee9188a8 100644
--

[Devel] [PATCH rh7 2/2] ms/mm: memcontrol: use vmalloc fallback for large kmem memcg arrays

2017-11-03 Thread Andrey Ryabinin
From: Johannes Weiner 

commit f80c7dab95a1f0f968acbafe4426ee9525b6f6ab upstream.

For quick per-memcg indexing, slab caches and list_lru structures
maintain linear arrays of descriptors. As the number of concurrent
memory cgroups in the system goes up, this requires large contiguous
allocations (8k cgroups = order-5, 16k cgroups = order-6 etc.) for
every existing slab cache and list_lru, which can easily fail on
loaded systems. E.g.:

mkdir: page allocation failure: order:5, mode:0x14040c0(GFP_KERNEL|__GFP_COMP), 
nodemask=(null)
CPU: 1 PID: 6399 Comm: mkdir Not tainted 4.13.0-mm1-00065-g720bbe532b7c-dirty 
#481
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-20170228_101828-anatol 04/01/2014
Call Trace:
 dump_stack+0x70/0x9d
 warn_alloc+0xd6/0x170
 ? __alloc_pages_direct_compact+0x4c/0x110
 __alloc_pages_nodemask+0xf50/0x1430
 ? __lock_acquire+0xd19/0x1360
 ? memcg_update_all_list_lrus+0x2e/0x2e0
 ? __mutex_lock+0x7c/0x950
 ? memcg_update_all_list_lrus+0x2e/0x2e0
 alloc_pages_current+0x60/0xc0
 kmalloc_order_trace+0x29/0x1b0
 __kmalloc+0x1f4/0x320
 memcg_update_all_list_lrus+0xca/0x2e0
 mem_cgroup_css_alloc+0x612/0x670
 cgroup_apply_control_enable+0x19e/0x360
 cgroup_mkdir+0x322/0x490
 kernfs_iop_mkdir+0x55/0x80
 vfs_mkdir+0xd0/0x120
 SyS_mkdirat+0x6c/0xe0
 SyS_mkdir+0x14/0x20
 entry_SYSCALL_64_fastpath+0x18/0xad
RIP: 0033:0x7f9ff36cee87
RSP: 002b:7ffc7612d758 EFLAGS: 0202 ORIG_RAX: 0053
RAX: ffda RBX: 7ffc7612da48 RCX: 7f9ff36cee87
RDX: 01ff RSI: 01ff RDI: 7ffc7612de86
RBP: 0002 R08: 01ff R09: 00401db0
R10: 01e2 R11: 0202 R12: 
R13: 7ffc7612da40 R14:  R15: 
Mem-Info:
active_anon:2965 inactive_anon:19 isolated_anon:0
 active_file:100270 inactive_file:98846 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 slab_reclaimable:7328 slab_unreclaimable:16402
 mapped:771 shmem:52 pagetables:278 bounce:0
 free:13718 free_pcp:0 free_cma:0

This output is from an artificial reproducer, but we have repeatedly
observed order-7 failures in production in the Facebook fleet. These
systems become useless as they cannot run more jobs, even though there
is plenty of memory to allocate 128 individual pages.

Use kvmalloc and kvzalloc to fall back to vmalloc space if these
arrays prove too large for allocating them physically contiguous.

Link: http://lkml.kernel.org/r/20170918184919.20644-1-han...@cmpxchg.org
Signed-off-by: Johannes Weiner 
Reviewed-by: Josef Bacik 
Acked-by: Michal Hocko 
Acked-by: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-76752
Signed-off-by: Andrey Ryabinin 
---
 mm/list_lru.c| 17 +++--
 mm/slab_common.c | 20 ++--
 2 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5adc6621b338..91dccc1e30bf 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -322,13 +322,13 @@ static int memcg_init_list_lru_node(struct list_lru_node 
*nlru)
struct list_lru_memcg *memcg_lrus;
int size = memcg_nr_cache_ids;
 
-   memcg_lrus = kmalloc(sizeof(*memcg_lrus) +
+   memcg_lrus = kvmalloc(sizeof(*memcg_lrus) +
 size * sizeof(void *), GFP_KERNEL);
if (!memcg_lrus)
return -ENOMEM;
 
if (__memcg_init_list_lru_node(memcg_lrus, 0, size)) {
-   kfree(memcg_lrus);
+   kvfree(memcg_lrus);
return -ENOMEM;
}
rcu_assign_pointer(nlru->memcg_lrus, memcg_lrus);
@@ -346,7 +346,12 @@ static void memcg_destroy_list_lru_node(struct 
list_lru_node *nlru)
 */
memcg_lrus = rcu_dereference_check(nlru->memcg_lrus, true);
__memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids);
-   kfree(memcg_lrus);
+   kvfree(memcg_lrus);
+}
+
+static void free_list_lru_memcg(struct rcu_head *head)
+{
+   kvfree(container_of(head, struct list_lru_memcg, rcu));
 }
 
 static int memcg_update_list_lru_node(struct list_lru_node *nlru,
@@ -359,12 +364,12 @@ static int memcg_update_list_lru_node(struct 
list_lru_node *nlru,
 
/* list_lrus_mutex is held, nobody can change memcg_lrus. Silence RCU */
old = rcu_dereference_check(nlru->memcg_lrus, true);
-   new = kmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
+   new = kvmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
if (!new)
return -ENOMEM;
 
if (__memcg_init_list_lru_node(new, old_size, new_size)) {
-   kfree(new);
+   kvfree(new);
return -ENOMEM;
}
 
@@ -381,7 +386,7 @@ static int memcg_update_list_lru_node(struct list_lru_node 
*nlru,
rcu_assign_pointer(nlru->memcg_lrus, new);
spin_unlock_irq(&nlru->lock);
 
-  

Re: [Devel] [PATCH] tswap, tcache: Increase shrinkers seeks

2017-11-08 Thread Andrey Ryabinin


On 11/08/2017 01:21 PM, Kirill Tkhai wrote:
> Commit e008b95a28ef95dd4bb08f69c89d26fc5fa7411a
> "ms/mm: use sc->priority for slab shrink targets"
> exposed the fact we shrinks too many tcache pages.
> 
> Shrinkers of {in,}active pages shrink up to 32
> pages, while tcache and tswap shrinks 128 pages.
> This became a reason of tcache active test fail.
> 
> This patch makes numbers of shrinked pages of tcache
> and tswap in consistent state with pages shrinkers,
> and restores the test-expected behaviour.
> 
> https://jira.sw.ru/browse/PSBM-72584
> 
> Signed-off-by: Kirill Tkhai 

Acked-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH] mm: Fix mis accounting of isolated pages in memcg_numa_isolate_pages()

2017-11-13 Thread Andrey Ryabinin


On 11/13/2017 02:50 PM, Kirill Tkhai wrote:
> When split_huge_page_to_list() fails, and a huge page is going back
> to LRU, the number of isolated pages is decreasing. So we must
> subtract HPAGE_PMD_NR from NR_ISOLATED_ANON counter, not to add it.
> 
> Otherwise, we may bumped into a situation, when number of isolated
> pages grows up to number of inactive pages, and direct reclaim hangs in:
> 
>   shrink_inactive_list()
>  while (too_many_isolated())
> congestion_wait(BLK_RW_ASYNC, HZ/10),
> 
> waiting for the counter becomes less. But it has no a chance
> to finish, and hangs forever. Fix that.
> 
> https://jira.sw.ru/browse/PSBM-76970
> 
> Signed-off-by: Kirill Tkhai 

Acked-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/4] kernel/ucount.c: mark user_header with kmemleak_ignore()

2017-11-14 Thread Andrey Ryabinin
From: "Luis R. Rodriguez" 

The user_header gets caught by kmemleak with the following splat as
missing a free:

  unreferenced object 0x99667a733d80 (size 96):
  comm "swapper/0", pid 1, jiffies 4294892317 (age 62191.468s)
  hex dump (first 32 bytes):
a0 b6 92 b4 ff ff ff ff 00 00 00 00 01 00 00 00  
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
 kmemleak_alloc+0x4a/0xa0
 __kmalloc+0x144/0x260
 __register_sysctl_table+0x54/0x5e0
 register_sysctl+0x1b/0x20
 user_namespace_sysctl_init+0x17/0x34
 do_one_initcall+0x52/0x1a0
 kernel_init_freeable+0x173/0x200
 kernel_init+0xe/0x100
 ret_from_fork+0x2c/0x40

The BUG_ON()s are intended to crash so no need to clean up after
ourselves on error there.  This is also a kernel/ subsys_init() we don't
need a respective exit call here as this is never modular, so just white
list it.

Link: http://lkml.kernel.org/r/20170203211404.31458-1-mcg...@kernel.org
Signed-off-by: Luis R. Rodriguez 
Cc: Eric W. Biederman 
Cc: Kees Cook 
Cc: Nikolay Borisov 
Cc: Serge Hallyn 
Cc: Jan Kara 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-76924
(cherry picked from commit ed5bd7dc88edf4a4a9c67130742b1b59aa017a5f)
Signed-off-by: Andrey Ryabinin 
---
 kernel/ucount.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 4aea3f02f287..533f78323f27 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -241,11 +241,10 @@ static __init int user_namespace_sysctl_init(void)
 * properly.
 */
user_header = register_sysctl("user", empty);
+   kmemleak_ignore(user_header);
BUG_ON(!user_header);
BUG_ON(!setup_userns_sysctls(&init_user_ns));
 #endif
return 0;
 }
 subsys_initcall(user_namespace_sysctl_init);
-
-
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4/4] tty/vt: Fix the memory leak in visual_init

2017-11-14 Thread Andrey Ryabinin
From: Dongxing Zhang 

If vc->vc_uni_pagedir_loc is not NULL, its refcount needs to be
decreased before vc_uni_pagedir_loc is re-assigned.

unreferenced object 0x88002cdd13b0 (size 512):
  comm "setfont", pid 503, jiffies 4294896503 (age 722.828s)
  hex dump (first 32 bytes):
40 92 61 2b 00 88 ff ff 00 00 00 00 00 00 00 00  @.a+
00 00 00 00 00 00 00 00 a0 ad 61 2b 00 88 ff ff  ..a+
  backtrace:
[] kmemleak_alloc+0x4e/0xb0
[] kmem_cache_alloc_trace+0x1c8/0x240
[] con_do_clear_unimap.isra.2+0x83/0xe0
[] con_clear_unimap+0x22/0x40
[] vt_ioctl+0xeb8/0x1170
[] tty_ioctl+0x208/0xca0
[] do_vfs_ioctl+0x2f8/0x510
[] SyS_ioctl+0x81/0xa0
[] system_call_fastpath+0x16/0x75
[] 0x
unreferenced object 0x88002b619240 (size 256):
  comm "setfont", pid 503, jiffies 4294896503 (age 722.828s)
  hex dump (first 32 bytes):
90 bc 84 d5 00 88 ff ff 58 85 84 d5 00 88 ff ff  X...
88 ac 84 d5 00 88 ff ff e0 b1 84 d5 00 88 ff ff  
  backtrace:
[] kmemleak_alloc+0x4e/0xb0
[] kmem_cache_alloc_trace+0x1c8/0x240
[] con_insert_unipair+0x86/0x170
[] con_set_unimap+0x1b7/0x280
[] vt_ioctl+0xe65/0x1170
[] tty_ioctl+0x208/0xca0
[] do_vfs_ioctl+0x2f8/0x510
[] SyS_ioctl+0x81/0xa0
[] system_call_fastpath+0x16/0x75
[] 0x

Signed-off-by: Dongxing Zhang 
Signed-off-by: Xiaoming Wang 
Reviewed-by: Peter Hurley 
Tested-by: Konstantin Khlebnikov 
Signed-off-by: Greg Kroah-Hartman 

https://jira.sw.ru/browse/PSBM-76924
(cherry picked from commit 08b33249d89700ba555d4ab5cc88714192b8ee46)
Signed-off-by: Andrey Ryabinin 
---
 drivers/tty/vt/vt.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 07c5666c2c30..fbc6290e417a 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -742,6 +742,8 @@ static void visual_init(struct vc_data *vc, int num, int 
init)
__module_get(vc->vc_sw->owner);
vc->vc_num = num;
vc->vc_display_fg = &master_display_fg;
+   if (vc->vc_uni_pagedir_loc)
+   con_free_unimap(vc);
vc->vc_uni_pagedir_loc = &vc->vc_uni_pagedir;
vc->vc_uni_pagedir = 0;
vc->vc_hi_font_mask = 0;
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/4] net: sysctl: fix a kmemleak warning

2017-11-14 Thread Andrey Ryabinin
From: Li RongQing 

the returned buffer of register_sysctl() is stored into net_header
variable, but net_header is not used after, and compiler maybe
optimise the variable out, and lead kmemleak reported the below warning

comm "swapper/0", pid 1, jiffies 4294937448 (age 267.270s)
hex dump (first 32 bytes):
90 38 8b 01 c0 ff ff ff 00 00 00 00 01 00 00 00 .8..
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
backtrace:
[] create_object+0x10c/0x2a0
[] kmemleak_alloc+0x54/0xa0
[] __kmalloc+0x1f8/0x4f8
[] __register_sysctl_table+0x64/0x5a0
[] register_sysctl+0x30/0x40
[] net_sysctl_init+0x20/0x58
[] sock_init+0x10/0xb0
[] do_one_initcall+0x90/0x1b8
[] kernel_init_freeable+0x218/0x2f0
[] kernel_init+0x1c/0xe8
[] ret_from_fork+0xc/0x50
[] 0x <>

Before fix, the objdump result on ARM64:
 :
   0:   a9be7bfdstp x29, x30, [sp,#-32]!
   4:   9001adrpx1, 0 
   8:   9000adrpx0, 0 
   c:   910003fdmov x29, sp
  10:   9121add x1, x1, #0x0
  14:   9100add x0, x0, #0x0
  18:   a90153f3stp x19, x20, [sp,#16]
  1c:   12800174mov w20, #0xfff4// #-12
  20:   9400bl  0 
  24:   b4000120cbz x0, 48 
  28:   9013adrpx19, 0 
  2c:   91000273add x19, x19, #0x0
  30:   9101a260add x0, x19, #0x68
  34:   9400bl  0 
  38:   2a0003f4mov w20, w0
  3c:   3560cbnzw0, 48 
  40:   aa1303e0mov x0, x19
  44:   9400bl  0 
  48:   2a1403e0mov w0, w20
  4c:   a94153f3ldp x19, x20, [sp,#16]
  50:   a8c27bfdldp x29, x30, [sp],#32
  54:   d65f03c0ret
After:
 :
   0:   a9bd7bfdstp x29, x30, [sp,#-48]!
   4:   9000adrpx0, 0 
   8:   910003fdmov x29, sp
   c:   a90153f3stp x19, x20, [sp,#16]
  10:   9013adrpx19, 0 
  14:   9100add x0, x0, #0x0
  18:   91000273add x19, x19, #0x0
  1c:   f90013f5str x21, [sp,#32]
  20:   aa1303e1mov x1, x19
  24:   12800175mov w21, #0xfff4// #-12
  28:   9400bl  0 
  2c:   f9002260str x0, [x19,#64]
  30:   b40001a0cbz x0, 64 
  34:   9014adrpx20, 0 
  38:   91000294add x20, x20, #0x0
  3c:   9101a280add x0, x20, #0x68
  40:   9400bl  0 
  44:   2a0003f5mov w21, w0
  48:   3580cbnzw0, 58 
  4c:   aa1403e0mov x0, x20
  50:   9400bl  0 
  54:   1404b   64 
  58:   f9402260ldr x0, [x19,#64]
  5c:   9400bl  0 
  60:   f900227fstr xzr, [x19,#64]
  64:   2a1503e0mov w0, w21
  68:   f94013f5ldr x21, [sp,#32]
  6c:   a94153f3ldp x19, x20, [sp,#16]
  70:   a8c37bfdldp x29, x30, [sp],#48
  74:   d65f03c0ret

Add the possible error handle to free the net_header to remove the
kmemleak warning

Signed-off-by: Li RongQing 
Signed-off-by: David S. Miller 

https://jira.sw.ru/browse/PSBM-76924
(cherry picked from commit ce9d9b8e5c2b7486edf76958bcdb5e6534a915b0)
Signed-off-by: Andrey Ryabinin 
---
 net/sysctl_net.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sysctl_net.c b/net/sysctl_net.c
index 42279fdceab4..62eb022db796 100644
--- a/net/sysctl_net.c
+++ b/net/sysctl_net.c
@@ -94,10 +94,14 @@ __init int net_sysctl_init(void)
goto out;
ret = register_pernet_subsys(&sysctl_pernet_ops);
if (ret)
-   goto out;
+   goto out1;
register_sysctl_root(&net_sysctl_root);
 out:
return ret;
+out1:
+   unregister_sysctl_table(net_header);
+   net_header = NULL;
+   goto out;
 }
 
 struct ctl_table_header *register_net_sysctl(struct net *net,
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/4] debugobjects: Make kmemleak ignore debug objects

2017-11-14 Thread Andrey Ryabinin
From: Waiman Long 

The allocated debug objects are either on the free list or in the
hashed bucket lists. So they won't get lost. However if both debug
objects and kmemleak are enabled and kmemleak scanning is done
while some of the debug objects are transitioning from one list to
the others, false negative reporting of memory leaks may happen for
those objects. For example,

[38687.275678] kmemleak: 12 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)
unreferenced object 0x92e98aabeb68 (size 40):
  comm "ksmtuned", pid 4344, jiffies 4298403600 (age 906.430s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 d0 bc db 92 e9 92 ff ff  
01 00 00 00 00 00 00 00 38 36 8a 61 e9 92 ff ff  86.a
  backtrace:
[] kmemleak_alloc+0x4a/0xa0
[] kmem_cache_alloc+0xe9/0x320
[] __debug_object_init+0x3e6/0x400
[] debug_object_activate+0x131/0x210
[] __call_rcu+0x3f/0x400
[] call_rcu_sched+0x1d/0x20
[] put_object+0x2c/0x40
[] __delete_object+0x3c/0x50
[] delete_object_full+0x1d/0x20
[] kmemleak_free+0x32/0x80
[] kmem_cache_free+0x77/0x350
[] unlink_anon_vmas+0x82/0x1e0
[] free_pgtables+0xa1/0x110
[] exit_mmap+0xc1/0x170
[] mmput+0x80/0x150
[] do_exit+0x2a9/0xd20

The references in the debug objects may also hide a real memory leak.

As there is no point in having kmemleak to track debug object
allocations, kmemleak checking is now disabled for debug objects.

Signed-off-by: Waiman Long 
Signed-off-by: Thomas Gleixner 
Cc: Andrew Morton 
Link: 
http://lkml.kernel.org/r/1502718733-8527-1-git-send-email-long...@redhat.com

https://jira.sw.ru/browse/PSBM-76924
Signed-off-by: Andrey Ryabinin 
---
 init/main.c| 2 +-
 lib/debugobjects.c | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index dd890da69d09..364d4a79dfd9 100644
--- a/init/main.c
+++ b/init/main.c
@@ -615,8 +615,8 @@ asmlinkage void __init start_kernel(void)
}
 #endif
page_cgroup_init();
-   debug_objects_mem_init();
kmemleak_init();
+   debug_objects_mem_init();
setup_per_cpu_pageset();
numa_policy_init();
if (late_time_init)
diff --git a/lib/debugobjects.c b/lib/debugobjects.c
index a8c4b2ff53a0..a1b85ba90d32 100644
--- a/lib/debugobjects.c
+++ b/lib/debugobjects.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define ODEBUG_HASH_BITS   14
 #define ODEBUG_HASH_SIZE   (1 << ODEBUG_HASH_BITS)
@@ -106,6 +107,7 @@ static void fill_pool(void)
if (!new)
return;
 
+   kmemleak_ignore(new);
raw_spin_lock_irqsave(&pool_lock, flags);
hlist_add_head(&new->node, &obj_pool);
debug_objects_alloc++;
@@ -1047,6 +1049,7 @@ static int __init 
debug_objects_replace_static_objects(void)
obj = kmem_cache_zalloc(obj_cache, GFP_KERNEL);
if (!obj)
goto free;
+   kmemleak_ignore(obj);
hlist_add_head(&obj->node, &objects);
}
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/2] mm/vmscan/HACK: scan only anon if global file inactive isn't low.

2017-11-23 Thread Andrey Ryabinin
Avoid swapping if global inactive list is big.

Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 524d1452deb1..798e013757f1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2064,6 +2064,22 @@ static void get_scan_count(struct lruvec *lruvec, struct 
scan_control *sc,
}
}
 
+   if (global_reclaim(sc)) {
+   unsigned long inactive = zone_page_state(zone, 
NR_INACTIVE_FILE);
+   unsigned long active = zone_page_state(zone, NR_ACTIVE_FILE);
+   unsigned long gb, inactive_ratio;
+
+   gb = (inactive + active) >> (30 - PAGE_SHIFT);
+   if (gb)
+   inactive_ratio = int_sqrt(10 * gb);
+   else
+   inactive_ratio = 1;
+   if (inactive_ratio * inactive >= active) {
+   scan_balance = SCAN_FILE;
+   goto out;
+   }
+   }
+
/*
 * There is enough inactive page cache, do not reclaim
 * anything from the anonymous working set right now.
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] mm/vmscan: make sysctl_vm_force_scan_thresh 100 by default

2017-11-23 Thread Andrey Ryabinin
force_scan was invented for very narrow case. It hurts us badly
when we have one cgroup that consumes almost all memory and few
small ones.

Set sysctl_vm_force_scan_thresh to 100 by default which effectively
disables it.

https://jira.sw.ru/browse/PSBM-77547
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e13a71e4e44e..524d1452deb1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1919,7 +1919,7 @@ static int vmscan_swappiness(struct scan_control *sc)
 }
 
 #ifdef CONFIG_MEMCG
-int sysctl_force_scan_thresh = 50;
+int sysctl_force_scan_thresh = 100;
 
 static inline bool zone_force_scan(struct zone *zone)
 {
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm/memcg: limit page cache in memcg hack.

2017-11-27 Thread Andrey Ryabinin
Add new memcg file - memory.cache.limit_in_bytes. Used
to limit page cache usage in cgroup.

Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 144 +---
 1 file changed, 126 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a165a221e87b..116b303319af 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -314,6 +314,8 @@ struct mem_cgroup {
 */
struct page_counter dcache;
 
+   struct page_counter cache;
+
/* beancounter-related stats */
unsigned long long swap_max;
atomic_long_t mem_failcnt;
@@ -502,6 +504,7 @@ enum res_type {
_MEMSWAP,
_OOM_TYPE,
_KMEM,
+   _CACHE,
 };
 
 #define MEMFILE_PRIVATE(x, val)((x) << 16 | (val))
@@ -2771,7 +2774,7 @@ static int memcg_cpu_hotplug_callback(struct 
notifier_block *nb,
  * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
  */
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool 
kmem_charge,
- unsigned int nr_pages)
+ unsigned int nr_pages, bool cache_charge)
 {
unsigned int batch = max(CHARGE_BATCH, nr_pages);
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -2786,12 +2789,22 @@ retry:
flags = 0;
 
if (consume_stock(memcg, nr_pages)) {
-   if (!kmem_charge)
-   goto done;
-   if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter))
+   if (kmem_charge && page_counter_try_charge(
+   &memcg->kmem, nr_pages, &counter)) {
+   refill_stock(memcg, nr_pages);
+   goto charge;
+   }
+
+   if (cache_charge && !page_counter_try_charge(
+   &memcg->cache, nr_pages, &counter))
goto done;
+
+   refill_stock(memcg, nr_pages);
+   if (kmem_charge)
+   page_counter_uncharge(&memcg->kmem, nr_pages);
}
 
+charge:
mem_over_limit = NULL;
if (!page_counter_try_charge(&memcg->memory, batch, &counter)) {
if (do_swap_account && page_counter_try_charge(
@@ -2804,15 +2817,29 @@ retry:
mem_over_limit = mem_cgroup_from_counter(counter, memory);
 
if (!mem_over_limit && kmem_charge) {
-   if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter))
+   if (page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
+   flags |= MEM_CGROUP_RECLAIM_KMEM;
+   mem_over_limit = mem_cgroup_from_counter(counter, kmem);
+   page_counter_uncharge(&memcg->memory, batch);
+   if (do_swap_account)
+   page_counter_uncharge(&memcg->memsw, batch);
+   }
+   }
+
+   if (!mem_over_limit && cache_charge) {
+   if (!page_counter_try_charge(&memcg->cache, nr_pages, &counter))
goto done_restock;
 
-   flags |= MEM_CGROUP_RECLAIM_KMEM;
-   mem_over_limit = mem_cgroup_from_counter(counter, kmem);
+   flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+   mem_over_limit = mem_cgroup_from_counter(counter, cache);
page_counter_uncharge(&memcg->memory, batch);
if (do_swap_account)
page_counter_uncharge(&memcg->memsw, batch);
-   } else if (!mem_over_limit)
+   if (kmem_charge)
+   page_counter_uncharge(&memcg->kmem, batch);
+   }
+
+   if (!mem_over_limit)
goto done_restock;
 
if (batch > nr_pages) {
@@ -2898,12 +2925,15 @@ done:
return 0;
 }
 
-static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages,
+   bool cache_charge)
 {
if (!mem_cgroup_is_root(memcg)) {
page_counter_uncharge(&memcg->memory, nr_pages);
if (do_swap_account)
page_counter_uncharge(&memcg->memsw, nr_pages);
+   if (cache_charge)
+   page_counter_uncharge(&memcg->cache, nr_pages);
}
 }
 
@@ -3068,7 +3098,7 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
 {
int ret = 0;
 
-   ret = try_charge(memcg, gfp, true, nr_pages);
+   ret = try_charge(memcg, gfp, true, nr_pages, false);
if (ret == -EINTR)  {
/*
 * try_charge() chose to bypass to root due to OOM kill or
@@ -4327,6 +4357,9 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, 
struct cftype 

[Devel] [PATCH rh7] fs/fuse/dev: improve ->splice() with fragmented memory

2017-11-29 Thread Andrey Ryabinin
fuse_dev_splice_[read,write]() temporary allocates array of pipe_buffer
structs. Depending on pipe size it could be quite large, thus we stall
in high order allocation request. Use kvmalloc() instead of kmalloc()
to fallback in vmalloc() if high order page is not available at the moment.

https://jira.sw.ru/browse/PSBM-77949
Signed-off-by: Andrey Ryabinin 
---
 fs/fuse/dev.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 3427eddcfb17..83c30e51dfca 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1353,7 +1353,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, 
loff_t *ppos,
if (!fud)
return -EPERM;
 
-   bufs = kmalloc(pipe->buffers * sizeof(struct pipe_buffer), GFP_KERNEL);
+   bufs = kvmalloc(pipe->buffers * sizeof(struct pipe_buffer), GFP_KERNEL);
if (!bufs)
return -ENOMEM;
 
@@ -1410,7 +1410,7 @@ out:
for (; page_nr < cs.nr_segs; page_nr++)
page_cache_release(bufs[page_nr].page);
 
-   kfree(bufs);
+   kvfree(bufs);
return ret;
 }
 
@@ -1991,7 +1991,7 @@ static ssize_t fuse_dev_splice_write(struct 
pipe_inode_info *pipe,
if (!fud)
return -EPERM;
 
-   bufs = kmalloc(pipe->buffers * sizeof(struct pipe_buffer), GFP_KERNEL);
+   bufs = kvmalloc(pipe->buffers * sizeof(struct pipe_buffer), GFP_KERNEL);
if (!bufs)
return -ENOMEM;
 
@@ -2049,7 +2049,7 @@ static ssize_t fuse_dev_splice_write(struct 
pipe_inode_info *pipe,
buf->ops->release(pipe, buf);
}
 out:
-   kfree(bufs);
+   kvfree(bufs);
return ret;
 }
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs/fuse/dev: improve ->splice() with fragmented memory

2017-11-29 Thread Andrey Ryabinin


On 11/29/2017 05:31 PM, Vasily Averin wrote:
> Could you please elaborate, why it should help in reported case?
> It seems for me kvmalloc will push reclaimer first exactly like kmalloc does 
> right now.
> 

Currently we try to allocate possibly high-order page with GFP_KERNEL flags. 
For order <= PAGE_ALLOC_COSTLY_ORDER
this will loop indefinitely until succeed.

kvmalloc set __GFP_NORETRY, so if high order page isn't available we bail out 
immediately and try vmalloc()
which will use 0-order pages.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs/fuse/dev: improve ->splice() with fragmented memory

2017-11-29 Thread Andrey Ryabinin


On 11/29/2017 05:37 PM, Vasily Averin wrote:
> got it,
> kvmalloc does not use kmalloc for size <= (16*PAGE_SIZE)
> 

No, it does use kmalloc() first:

void *kvmalloc_node(size_t size, gfp_t flags, int node)
{
gfp_t kmalloc_flags = flags;


/*
 * Make sure that larger requests are not too disruptive - no OOM
 * killer and no allocation failure warnings as we have a fallback
 */
if (size > PAGE_SIZE)
kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;

ret = kmalloc_node(size, kmalloc_flags, node);

/*
 * It doesn't really make sense to fallback to vmalloc for sub page
 * requests
 */
if (ret || size <= PAGE_SIZE)
return ret;

return __vmalloc_node_flags(size, node, flags | __GFP_HIGHMEM);

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm/tcache: replace BUG_ON()s with WARN_ON()s

2017-11-30 Thread Andrey Ryabinin
Tcache code filled with BUG_ON() checks. However the most cases
issues that BUG_ON() supposed to catch are not serious enough
to kill machine. So relax it's to WARN_ON.
Remove BUG_ON() in tcache_init_fs(), because it's useless.
It's called from the only one place in the kernel, which looks
like this:
pool_id = cleancache_ops->init_fs(PAGE_SIZE);

https://jira.sw.ru/browse/PSBM-77154
Signed-off-by: Andrey Ryabinin 
---
 mm/tcache.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/tcache.c b/mm/tcache.c
index b5157d9861d0..99c799a9d290 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -473,7 +473,7 @@ static void tcache_destroy_pool(int id)
for (i = 0; i < num_node_trees; i++)
tcache_invalidate_node_tree(&pool->node_tree[i]);
 
-   BUG_ON(atomic_long_read(&pool->nr_nodes) != 0);
+   WARN_ON(atomic_long_read(&pool->nr_nodes) != 0);
 
kfree(pool->node_tree);
kfree_rcu(pool, rcu);
@@ -590,9 +590,10 @@ retry:
spin_unlock_irqrestore(&tree->lock, flags);
 
if (node) {
-   BUG_ON(node->pool != pool);
if (node != new_node)
kfree(new_node);
+   if (WARN_ON(node->pool != pool))
+   node = NULL;
return node;
}
 
@@ -696,9 +697,9 @@ tcache_invalidate_node_tree(struct tcache_node_tree *tree)
struct tcache_node, tree_node);
 
/* Remaining nodes must be held solely by their pages */
-   BUG_ON(atomic_read(&node->kref.refcount) != 1);
-   BUG_ON(node->nr_pages == 0);
-   BUG_ON(node->invalidated);
+   WARN_ON(atomic_read(&node->kref.refcount) != 1);
+   WARN_ON(node->nr_pages == 0);
+   WARN_ON(node->invalidated);
 
tcache_hold_node(node);
tcache_invalidate_node_pages(node);
@@ -1182,7 +1183,8 @@ static unsigned long tcache_shrink_scan(struct shrinker 
*shrink,
struct page **pages = get_cpu_var(tcache_page_vec);
int nr_isolated, nr_reclaimed;
 
-   BUG_ON(sc->nr_to_scan > TCACHE_SCAN_BATCH);
+   if (WARN_ON(sc->nr_to_scan > TCACHE_SCAN_BATCH))
+   sc->nr_to_scan = TCACHE_SCAN_BATCH;
 
nr_isolated = tcache_lru_isolate(sc->nid, pages, sc->nr_to_scan);
if (!nr_isolated) {
@@ -1209,7 +1211,6 @@ struct shrinker tcache_shrinker = {
 
 static int tcache_cleancache_init_fs(size_t pagesize)
 {
-   BUG_ON(pagesize != PAGE_SIZE);
return tcache_create_pool();
 }
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] mm/tcache: replace BUG_ON()s with WARN_ON()s

2017-11-30 Thread Andrey Ryabinin


On 11/30/2017 04:24 PM, Kirill Tkhai wrote:
> On 30.11.2017 15:06, Andrey Ryabinin wrote:
>> Tcache code filled with BUG_ON() checks. However the most cases
>> issues that BUG_ON() supposed to catch are not serious enough
>> to kill machine. So relax it's to WARN_ON.
>> Remove BUG_ON() in tcache_init_fs(), because it's useless.
>> It's called from the only one place in the kernel, which looks
>> like this:
>>  pool_id = cleancache_ops->init_fs(PAGE_SIZE);
>>
>> https://jira.sw.ru/browse/PSBM-77154
>> Signed-off-by: Andrey Ryabinin 
>> ---
>>  mm/tcache.c | 15 ---
>>  1 file changed, 8 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/tcache.c b/mm/tcache.c
>> index b5157d9861d0..99c799a9d290 100644
>> --- a/mm/tcache.c
>> +++ b/mm/tcache.c
>> @@ -473,7 +473,7 @@ static void tcache_destroy_pool(int id)
>>  for (i = 0; i < num_node_trees; i++)
>>  tcache_invalidate_node_tree(&pool->node_tree[i]);
>>  
>> -BUG_ON(atomic_long_read(&pool->nr_nodes) != 0);
>> +WARN_ON(atomic_long_read(&pool->nr_nodes) != 0);
> 
> Patch looks good for me. One small question about above. Shouldn't we abort
> pool destroy in case of this WARN_ON() fires like you did for node below?

Yeah, it's better to leak few bytes rather than cause potential use-after-free.

> Also, if so it seems it would be useful to know the exact count of 
> pool->nr_nodes:
> either there is overcount or undercount...
> 

Ok.

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v2] mm/tcache: replace BUG_ON()s with WARN_ON()s

2017-11-30 Thread Andrey Ryabinin
Tcache code filled with BUG_ON() checks. However the most cases
issues that BUG_ON() supposed to catch are not serious enough
to kill machine. So relax it's to WARN_ON.
Remove BUG_ON() in tcache_init_fs(), because it's useless.
It's called from the only one place in the kernel, which looks
like this:
pool_id = cleancache_ops->init_fs(PAGE_SIZE);

https://jira.sw.ru/browse/PSBM-77154
Signed-off-by: Andrey Ryabinin 
---
 mm/tcache.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/tcache.c b/mm/tcache.c
index b5157d9861d0..31a2d0250fc8 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -446,6 +446,7 @@ static void tcache_destroy_pool(int id)
 {
int i;
struct tcache_pool *pool;
+   unsigned long nr_nodes;
 
spin_lock(&tcache_pool_lock);
pool = idr_find(&tcache_pool_idr, id);
@@ -473,7 +474,9 @@ static void tcache_destroy_pool(int id)
for (i = 0; i < num_node_trees; i++)
tcache_invalidate_node_tree(&pool->node_tree[i]);
 
-   BUG_ON(atomic_long_read(&pool->nr_nodes) != 0);
+   nr_nodes = atomic_long_read(&pool->nr_nodes);
+   if (WARN(nr_nodes != 0, "pool->nr_nodes %ld", nr_nodes))
+   return;
 
kfree(pool->node_tree);
kfree_rcu(pool, rcu);
@@ -590,9 +593,10 @@ retry:
spin_unlock_irqrestore(&tree->lock, flags);
 
if (node) {
-   BUG_ON(node->pool != pool);
if (node != new_node)
kfree(new_node);
+   if (WARN_ON(node->pool != pool))
+   node = NULL;
return node;
}
 
@@ -696,9 +700,9 @@ tcache_invalidate_node_tree(struct tcache_node_tree *tree)
struct tcache_node, tree_node);
 
/* Remaining nodes must be held solely by their pages */
-   BUG_ON(atomic_read(&node->kref.refcount) != 1);
-   BUG_ON(node->nr_pages == 0);
-   BUG_ON(node->invalidated);
+   WARN_ON(atomic_read(&node->kref.refcount) != 1);
+   WARN_ON(node->nr_pages == 0);
+   WARN_ON(node->invalidated);
 
tcache_hold_node(node);
tcache_invalidate_node_pages(node);
@@ -1182,7 +1186,8 @@ static unsigned long tcache_shrink_scan(struct shrinker 
*shrink,
struct page **pages = get_cpu_var(tcache_page_vec);
int nr_isolated, nr_reclaimed;
 
-   BUG_ON(sc->nr_to_scan > TCACHE_SCAN_BATCH);
+   if (WARN_ON(sc->nr_to_scan > TCACHE_SCAN_BATCH))
+   sc->nr_to_scan = TCACHE_SCAN_BATCH;
 
nr_isolated = tcache_lru_isolate(sc->nid, pages, sc->nr_to_scan);
if (!nr_isolated) {
@@ -1209,7 +1214,6 @@ struct shrinker tcache_shrinker = {
 
 static int tcache_cleancache_init_fs(size_t pagesize)
 {
-   BUG_ON(pagesize != PAGE_SIZE);
return tcache_create_pool();
 }
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] NFS: Don't call COMMIT in ->releasepage()

2017-12-01 Thread Andrey Ryabinin
From: Trond Myklebust 

While COMMIT has the potential to free up a lot of memory that is being
taken by unstable writes, it isn't guaranteed to free up this particular
page. Also, calling fsync() on the server is expensive and so we want to
do it in a more controlled fashion, rather than have it triggered at
random by the VM.

Signed-off-by: Trond Myklebust 

https://jira.sw.ru/browse/PSBM-77949
(cherry picked from commit 4f52b6bb8c57b9accafad526a429d6c0851cc62f)
Signed-off-by: Andrey Ryabinin 
---
 fs/nfs/file.c | 23 ---
 1 file changed, 23 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 7ad044976fd1..24d3d0c44bc4 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -470,31 +470,8 @@ static void nfs_invalidate_page(struct page *page, 
unsigned int offset,
  */
 static int nfs_release_page(struct page *page, gfp_t gfp)
 {
-   struct address_space *mapping = page->mapping;
-
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
 
-   /* Always try to initiate a 'commit' if relevant, but only
-* wait for it if __GFP_WAIT is set.  Even then, only wait 1
-* second and only if the 'bdi' is not congested.
-* Waiting indefinitely can cause deadlocks when the NFS
-* server is on this machine, when a new TCP connection is
-* needed and in other rare cases.  There is no particular
-* need to wait extensively here.  A short wait has the
-* benefit that someone else can worry about the freezer.
-*/
-   if (mapping) {
-   struct nfs_server *nfss = NFS_SERVER(mapping->host);
-   nfs_commit_inode(mapping->host, 0);
-   if ((gfp & __GFP_WAIT) &&
-   !bdi_write_congested(&nfss->backing_dev_info)) {
-   wait_on_page_bit_killable_timeout(page, PG_private,
- HZ);
-   if (PagePrivate(page))
-   set_bdi_congested(&nfss->backing_dev_info,
- BLK_RW_ASYNC);
-   }
-   }
/* If PagePrivate() is set, then the page is not freeable */
if (PagePrivate(page))
return 0;
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v2] tcache: Repeat invalidation in tcache_invalidate_node_pages()

2017-12-01 Thread Andrey Ryabinin


On 12/01/2017 06:02 PM, Kirill Tkhai wrote:
> When there are more than 2 users of a page,  __tcache_page_tree_delete()
> fails to freeze it. We skip it and never try to freeze it again.
> 
> In this case the page remains not invalidated, and tcache_node->nr_pages
> never decremented. Later, we catch WARN_ON() reporting about this.
> 
> tcache_shrink_scan()   tcache_destroy_pool
>tcache_lru_isolate()
>   tcache_grab_pool()
>   ...
>   page_cache_get_speculative() -->cnt == 2
> 
>   ...
>   tcache_put_pool() --> pool cnt zero
>   ...  
> wait_for_completion(&pool->completion);
>tcache_reclaim_pages
> tcache_invalidate_node_pages()
>   __tcache_reclaim_page()  tcache_lookup()
>   
> page_cache_get_speculative  --> cnt == 3
>
> __tcache_page_tree_delete
> page_ref_freeze(2) -->fail
> page_ref_freeze(2) -->fail
> 
> The patch fixes the problem. In case of we failed to invalidate a page,
> we remember this, and return to such pages after others are invalidated.
> 
> https://jira.sw.ru/browse/PSBM-78354
> 
> v2: Also fix tcache_detach_page()
> 
> Signed-off-by: Kirill Tkhai 
> ---

Acked-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm/mempolicy: Add cond_resched() in queue_pages_pte_range()

2017-12-18 Thread Andrey Ryabinin
Migrating huge range of memory may take quite some time, and
with lack of resched point it may cause softlockup.

NMI watchdog: BUG: soft lockup - CPU#57 stuck for 22s! [vcmmd:1942]
RIP: 0010:[]  [] isolate_lru_page+0x86/0x1c0
...
Call Trace:
 queue_pages_range+0x481/0x6d0
 migrate_to_node+0x79/0xe0
 do_migrate_pages+0x268/0x2d0
 cpuset_migrate_mm+0xcc/0xf0
 cpuset_change_nodemask+0x8e/0x90
 cgroup_scan_tasks+0x147/0x200
 update_tasks_nodemask+0x4b/0x70
 cpuset_migrate_mm+0xf0/0xf0
 cpuset_write_resmask+0x6b4/0x6f0
 lru_cache_add_active_or_unevictable+0x27/0xb0
 cgroup_rightmost_descendant+0x80/0x80
 cpuset_css_offline+0x50/0x50
 cgroup_file_write+0x1fe/0x2f0
 sb_start_write+0x58/0x110
 vfs_write+0xbd/0x1e0
 SyS_write+0x7f/0xe0
 system_call_fastpath+0x16/0x1b

Add cond_resched() to fix that.

Upstream got similar cond_resched() in commit
6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()")

https://jira.sw.ru/browse/PSBM-79273
Signed-off-by: Andrey Ryabinin 
---
 mm/mempolicy.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9b0dcf1835c4..7bf644c82837 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -513,6 +513,7 @@ static int queue_pages_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
break;
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(orig_pte, ptl);
+   cond_resched();
return addr != end;
 }
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 2/2] vznetstat: Convert some kmalloc()/kfree() to __vmalloc()/vfree()

2017-12-20 Thread Andrey Ryabinin


On 12/19/2017 03:25 PM, Kirill Tkhai wrote:
> Let's use virtually continuos pages instead of physically continuos
> as it's easier to allocate them.
> 
> Also, add __GFP_NOWARN to not disturb a user in case of ENOMEM.
> 
> https://jira.sw.ru/browse/PSBM-79502
> 
> Signed-off-by: Kirill Tkhai 
> ---

Acked-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/2] vznetstat: Add protection to venet_acct_set_classes()

2017-12-20 Thread Andrey Ryabinin


On 12/19/2017 03:24 PM, Kirill Tkhai wrote:
> It seems there was no synchronization since the time
> when ioctls in kernel were serialized via single mutex.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  kernel/ve/vznetstat/vznetstat.c |   11 +++
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/ve/vznetstat/vznetstat.c b/kernel/ve/vznetstat/vznetstat.c
> index 3a53ce27bde2..a65e05378ff4 100644
> --- a/kernel/ve/vznetstat/vznetstat.c
> +++ b/kernel/ve/vznetstat/vznetstat.c
> @@ -52,6 +52,7 @@ static struct class_info_set *info_v4 = NULL;
>  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>  static struct class_info_set *info_v6 = NULL;
>  #endif
> +static DEFINE_MUTEX(info_mutex);
>  
>  /* v6: flag IPv6 classes or IPv4 */
>  static int venet_acct_set_classes(const void __user *user_info, int length, 
> int v6)
> @@ -88,15 +89,17 @@ static int venet_acct_set_classes(const void __user 
> *user_info, int length, int
>   goto out_free;
>   }
>  
> - rcu_read_lock();
> + mutex_lock(&info_mutex);
>   if (v6) {
> - old = rcu_dereference(info_v6);
> + old = rcu_dereference_protected(info_v6,
> + lockdep_is_held(&info_mutex));
>   rcu_assign_pointer(info_v6, info);
>   } else {
> - old = rcu_dereference(info_v4);
> + old = rcu_dereference_protected(info_v4,
> + lockdep_is_held(&info_mutex));
>   rcu_assign_pointer(info_v4, info);

It probably would be easier to simply use xchg() here. Locking wouldn't be 
needed in that case.
But as I assume this is not performance sensitive code, so mutex should be fine.

Acked-by: Andrey Ryabinin 


>   }
> - rcu_read_unlock();
> + mutex_unlock(&info_mutex);
>  
>   synchronize_net();
>   /* IMPORTANT. I think reset of statistics collected should not be
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/2] vznetstat: Add protection to venet_acct_set_classes()

2017-12-20 Thread Andrey Ryabinin


On 12/20/2017 12:30 PM, Kirill Tkhai wrote:

>> How you about this?
>>
> vznetstat: Add protection to venet_acct_set_classes()
> 
> It seems there was no synchronization since the time
> when ioctls in kernel were serialized via single mutex.
> 
> Signed-off-by: Kirill Tkhai 
> ---


Acked-by: Andrey Ryabinin 


> diff --git a/kernel/ve/vznetstat/vznetstat.c b/kernel/ve/vznetstat/vznetstat.c
> index 3a53ce27bde2..92541dbc2a3f 100644
> --- a/kernel/ve/vznetstat/vznetstat.c
> +++ b/kernel/ve/vznetstat/vznetstat.c
> @@ -88,15 +88,11 @@ static int venet_acct_set_classes(const void __user 
> *user_info, int length, int
>   goto out_free;
>   }
>  
> - rcu_read_lock();
> - if (v6) {
> - old = rcu_dereference(info_v6);
> - rcu_assign_pointer(info_v6, info);
> - } else {
> - old = rcu_dereference(info_v4);
> - rcu_assign_pointer(info_v4, info);
> - }
> - rcu_read_unlock();
> + if (v6)
> + old = xchg(&info_v6, info);
> + else
> + old = xchg(&info_v4, info);
> + /* xchg() implies rcu_assign_pointer() barriers */
>  
>   synchronize_net();
>   /* IMPORTANT. I think reset of statistics collected should not be
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] drivers/bnx2x: Limit setting of the max mtu.

2017-12-29 Thread Andrey Ryabinin
Limit max mtu so that rx_buf_size fits into single page.
This must save us from allocation failures like this:

kswapd0: page allocation failure: order:2, mode:0x4020
Call Trace:
dump_stack+0x19/0x1b
warn_alloc_failed+0x110/0x180
__alloc_pages_nodemask+0x7bf/0xc60
alloc_pages_current+0x98/0x110
kmalloc_order+0x18/0x40
kmalloc_order_trace+0x26/0xa0
__kmalloc+0x279/0x290
bnx2x_frag_alloc.isra.61+0x2a/0x40 [bnx2x]
bnx2x_rx_int+0x227/0x17c0 [bnx2x]
bnx2x_poll+0x1dd/0x260 [bnx2x]
net_rx_action+0x179/0x390
__do_softirq+0x10f/0x2aa
call_softirq+0x1c/0x30
do_softirq+0x65/0xa0
irq_exit+0x105/0x110
do_IRQ+0x56/0xe0
common_interrupt+0x6d/0x6d

https://jira.sw.ru/browse/PSBM-77016
Signed-off-by: Andrey Ryabinin 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index f4e939f2ad66..339cfb0180e4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4868,6 +4868,15 @@ int bnx2x_change_mtu(struct net_device *dev, int new_mtu)
return -EINVAL;
}
 
+   if (SKB_DATA_ALIGN(new_mtu + BNX2X_FW_RX_ALIGN_START +
+   IP_HEADER_ALIGNMENT_PADDING + ETH_OVERHEAD +
+   BNX2X_FW_RX_ALIGN_END) + NET_SKB_PAD > PAGE_SIZE) {
+   new_mtu = PAGE_SIZE - NET_SKB_PAD - BNX2X_FW_RX_ALIGN_END -
+   ETH_OVERHEAD - IP_HEADER_ALIGNMENT_PADDING -
+   BNX2X_FW_RX_ALIGN_START;
+   }
+
+
/* This does not race with packet allocation
 * because the actual alloc size is
 * only updated as part of load
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] sched: Fallback to 0-order allocations in sched_create_group()

2017-12-29 Thread Andrey Ryabinin
On large machines ->cpustat_last/->vcpustat arrays are large
causing possibly failing high-order allocation:

   fio: page allocation failure: order:4, mode:0xc0d0

   Call Trace:
 dump_stack+0x19/0x1b
 warn_alloc_failed+0x110/0x180
  __alloc_pages_nodemask+0x7bf/0xc60
 alloc_pages_current+0x98/0x110
 kmalloc_order+0x18/0x40
 kmalloc_order_trace+0x26/0xa0
 __kmalloc+0x279/0x290
 sched_create_group+0xba/0x150
 sched_autogroup_create_attach+0x3f/0x1a0
 sys_setsid+0x73/0xc0
 system_call_fastpath+0x16/0x1b

Use kvzalloc to fallback to vmalloc() and avoid failure if
high order page is not available.

https://jira.sw.ru/browse/PSBM-79891
Fixes: 85fd6b2ff490 ("sched: Port cpustat related patches")
Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/core.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 75ef029b1595..6979f7d8ead7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8761,8 +8761,8 @@ static void free_sched_group(struct task_group *tg)
free_rt_sched_group(tg);
autogroup_free(tg);
free_percpu(tg->taskstats);
-   kfree(tg->cpustat_last);
-   kfree(tg->vcpustat);
+   kvfree(tg->cpustat_last);
+   kvfree(tg->vcpustat);
kfree(tg);
 }
 
@@ -8785,12 +8785,12 @@ struct task_group *sched_create_group(struct task_group 
*parent)
if (!tg->taskstats)
goto err;
 
-   tg->cpustat_last = kcalloc(nr_cpu_ids, sizeof(struct kernel_cpustat),
+   tg->cpustat_last = kvzalloc(nr_cpu_ids * sizeof(struct kernel_cpustat),
   GFP_KERNEL);
if (!tg->cpustat_last)
goto err;
 
-   tg->vcpustat = kcalloc(nr_cpu_ids, sizeof(struct kernel_cpustat),
+   tg->vcpustat = kvzalloc(nr_cpu_ids * sizeof(struct kernel_cpustat),
   GFP_KERNEL);
if (!tg->vcpustat)
goto err;
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm/page_alloc add warning about high order allocations.

2018-01-15 Thread Andrey Ryabinin
Add sysctl vm.warn_high_order. If set it will warn about
about allocations with order >= vm.warn_high_order.
Prints only 32 warning at most and skips all __GFP_NOWARN allocations.
Disabled by default.

https://jira.sw.ru/browse/PSBM-79892
Signed-off-by: Andrey Ryabinin 
---
 kernel/sysctl.c | 15 +++
 mm/page_alloc.c | 33 +
 2 files changed, 48 insertions(+)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e2d83c602b01..1de17f161be3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -128,6 +128,7 @@ static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
 static unsigned long one_ul = 1;
+static int ten = 10;
 static int one_hundred = 100;
 #ifdef CONFIG_PRINTK
 static int ten_thousand = 1;
@@ -174,6 +175,11 @@ extern int unaligned_dump_stack;
 extern int no_unaligned_warning;
 #endif
 
+extern int warn_order;
+extern int proc_warn_high_order(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos);
+
+
 static bool virtual_ptr(void **ptr, void *base, size_t size, void *cur);
 #define sysctl_virtual(sysctl) 
\
 int sysctl ## _virtual(struct ctl_table *table, int write, 
\
@@ -1664,6 +1670,15 @@ static struct ctl_table vm_table[] = {
.extra2 = &one_hundred,
},
 #endif
+   {
+   .procname   = "warn_high_order",
+   .data   = &warn_order,
+   .maxlen = sizeof(warn_order),
+   .mode   = 0644,
+   .proc_handler   = &proc_warn_high_order,
+   .extra1 = &zero,
+   .extra2 = &ten,
+   },
{ }
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f7cb012508e..e0a390866d4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3138,6 +3138,37 @@ static void __alloc_collect_stats(gfp_t gfp_mask, 
unsigned int order,
 #endif
 }
 
+struct static_key warn_high_order_key = STATIC_KEY_INIT_FALSE;
+int warn_order = MAX_ORDER+1;
+
+int proc_warn_high_order(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret;
+
+   ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   if (!ret) {
+   smp_wmb();
+   static_key_slow_inc(&warn_high_order_key);
+   }
+
+   return ret;
+}
+
+static __always_inline void warn_high_order(int order, gfp_t gfp_mask)
+{
+   static atomic_t warn_count = ATOMIC_INIT(32);
+
+   if (static_key_false(&warn_high_order_key)) {
+   int tmp_warn_order = smp_load_acquire(&warn_order);
+
+   if (order >= tmp_warn_order && !(gfp_mask & __GFP_NOWARN))
+   WARN(atomic_dec_return(&warn_count),
+   "order %d >= %d, gfp 0x%x\n",
+   order, tmp_warn_order, gfp_mask);
+   }
+}
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3161,6 +3192,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
WARN_ON_ONCE((gfp_mask & __GFP_FS) && current->journal_info &&
!(current->flags & PF_MEMALLOC));
 
+   warn_high_order(order, gfp_mask);
+
if (should_fail_alloc_page(gfp_mask, order))
return NULL;
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm/tcache: invalidate existing page during cleancache_put_page().

2018-01-18 Thread Andrey Ryabinin
We ->put_page() into tcache twice w/o ->get_page() in between, resulting in:

WARNING: CPU: 1 PID: 1936 at mm/tcache.c:752 tcache_attach_page+0x218/0x240

Call Trace:
 dump_stack+0x19/0x1b
 add_taint+0x32/0x70
 __warn+0xaa/0x100
 warn_slowpath_null+0x1d/0x20
 tcache_attach_page+0x218/0x240
 tcache_cleancache_put_page+0xdc/0x150
 __cleancache_put_page+0xa2/0xf0
 __delete_from_page_cache+0x309/0x370
 __remove_mapping+0x91/0x180
 shrink_page_list+0x5a8/0xa80
 shrink_inactive_list+0x1da/0x710
 shrink_lruvec+0x3a1/0x760
 shrink_zone+0x15b/0x310
 do_try_to_free_pages+0x1a0/0x610
 try_to_free_mem_cgroup_pages+0xeb/0x190
 mem_cgroup_reclaim+0x63/0x140
 try_charge+0x287/0x560
 mem_cgroup_try_charge+0x7a/0x130
 __add_to_page_cache_locked+0x97/0x2f0
 add_to_page_cache_lru+0x37/0xb0
 mpage_readpages+0xb5/0x150
 ext4_readpages+0x3c/0x40 [ext4]
 __do_page_cache_readahead+0x1cd/0x250
 ondemand_readahead+0x116/0x230
 page_cache_async_readahead+0xa8/0xc0

It's because we put all clean pages into cleancache, but in some cases
we may never read those pages from cleancache. E.g. see do_mpage_readpage(),
there are many case when cleancache_get_page() is not called.

Fix this by invalidating such existing page during ->put_page().

https://jira.sw.ru/browse/PSBM-80712
Reported-by: Kirill Tkhai 
Signed-off-by: Andrey Ryabinin 
---
 mm/tcache.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/tcache.c b/mm/tcache.c
index a45af63fbd1b..76e34d7e8851 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -749,7 +749,6 @@ static int tcache_page_tree_insert(struct tcache_node 
*node, pgoff_t index,
}
 
err = radix_tree_insert(&node->page_tree, index, page);
-   WARN_ON(err == -EEXIST);
if (!err) {
if (!node->nr_pages++)
tcache_hold_node(node);
@@ -1256,7 +1255,15 @@ static int tcache_cleancache_put_page(int pool_id,
cache_page = tcache_alloc_page(node->pool);
if (cache_page) {
copy_highpage(cache_page, page);
-   if (tcache_attach_page(node, index, cache_page)) {
+   ret = tcache_attach_page(node, index, cache_page);
+   if (ret) {
+   if (ret == -EEXIST) {
+   struct page *page;
+
+   page = tcache_detach_page(node, index, 
false);
+   if (page)
+   tcache_put_page(page);
+   }
if (put_page_testzero(cache_page))
tcache_put_page(cache_page);
} else
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v2] mm/tcache: invalidate existing page during cleancache_put_page().

2018-01-18 Thread Andrey Ryabinin
We ->put_page() into tcache twice w/o ->get_page() in between, resulting in:

WARNING: CPU: 1 PID: 1936 at mm/tcache.c:752 tcache_attach_page+0x218/0x240

Call Trace:
 dump_stack+0x19/0x1b
 add_taint+0x32/0x70
 __warn+0xaa/0x100
 warn_slowpath_null+0x1d/0x20
 tcache_attach_page+0x218/0x240
 tcache_cleancache_put_page+0xdc/0x150
 __cleancache_put_page+0xa2/0xf0
 __delete_from_page_cache+0x309/0x370
 __remove_mapping+0x91/0x180
 shrink_page_list+0x5a8/0xa80
 shrink_inactive_list+0x1da/0x710
 shrink_lruvec+0x3a1/0x760
 shrink_zone+0x15b/0x310
 do_try_to_free_pages+0x1a0/0x610
 try_to_free_mem_cgroup_pages+0xeb/0x190
 mem_cgroup_reclaim+0x63/0x140
 try_charge+0x287/0x560
 mem_cgroup_try_charge+0x7a/0x130
 __add_to_page_cache_locked+0x97/0x2f0
 add_to_page_cache_lru+0x37/0xb0
 mpage_readpages+0xb5/0x150
 ext4_readpages+0x3c/0x40 [ext4]
 __do_page_cache_readahead+0x1cd/0x250
 ondemand_readahead+0x116/0x230
 page_cache_async_readahead+0xa8/0xc0

It's because we put all clean pages into cleancache, but in some cases
we may never read those pages from cleancache. E.g. see do_mpage_readpage(),
there are many case when cleancache_get_page() is not called.

Fix this by invalidating such existing page during ->put_page().

https://jira.sw.ru/browse/PSBM-80712
Signed-off-by: Andrey Ryabinin 
---

Changes since v1:
 - add "ret = 0", as se must return number of put pages from 
tcache_cleancache_put_page()
 mm/tcache.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/tcache.c b/mm/tcache.c
index a45af63fbd1b..9710531731ab 100644
--- a/mm/tcache.c
+++ b/mm/tcache.c
@@ -749,7 +749,6 @@ static int tcache_page_tree_insert(struct tcache_node 
*node, pgoff_t index,
}
 
err = radix_tree_insert(&node->page_tree, index, page);
-   WARN_ON(err == -EEXIST);
if (!err) {
if (!node->nr_pages++)
tcache_hold_node(node);
@@ -1256,9 +1255,18 @@ static int tcache_cleancache_put_page(int pool_id,
cache_page = tcache_alloc_page(node->pool);
if (cache_page) {
copy_highpage(cache_page, page);
-   if (tcache_attach_page(node, index, cache_page)) {
+   ret = tcache_attach_page(node, index, cache_page);
+   if (ret) {
+   if (ret == -EEXIST) {
+   struct page *page;
+
+   page = tcache_detach_page(node, index, 
false);
+   if (page)
+   tcache_put_page(page);
+   }
if (put_page_testzero(cache_page))
tcache_put_page(cache_page);
+   ret = 0;
} else
ret = 1;
}
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v2] tcache: Close race between tcache_invalidate_node() and tcache_attach_page()

2018-01-18 Thread Andrey Ryabinin
On 01/15/2018 09:08 PM, Kirill Tkhai wrote:
> tcache_attach_page()tcache_invalidate_node()
> ..  __tcache_lookup_node()
> ..  __tcache_delete_node()
> Check node->invalidated ..
> tcache_page_tree_insert()   ..
> tcache_lru_add()..
> ..  tcache_invalidate_node_pages()
> ..node->invalidated = true
> 
> Check nr_page to determ if there is a race and repeat
> node pages iterations if so.
> 
> v2: Move invalidate assignment down in tcache_invalidate_node_tree().
> synchronize_sched() to be sure all tcache_attach_page() see invalidated.
> 
> Signed-off-by: Kirill Tkhai 

Acked-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 1/2] tcache: Refactor tcache_shrink_scan()

2018-01-23 Thread Andrey Ryabinin


On 01/23/2018 11:55 AM, Kirill Tkhai wrote:
> Make the function have the only return.
> 
> Signed-off-by: Kirill Tkhai 

Acked-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v3 2/2] tcache: Fix race between tcache_invalidate_node() and tcache_attach_page()

2018-01-23 Thread Andrey Ryabinin


On 01/23/2018 11:56 AM, Kirill Tkhai wrote:
> tcache_attach_page()  tcache_invalidate_node()
> ..__tcache_lookup_node()
> ..__tcache_delete_node()
> Check node->invalidated   ..
> tcache_page_tree_insert() ..
> tcache_lru_add()  ..
> ..tcache_invalidate_node_pages()
> ..  node->invalidated = true
> 
> Check nr_page to determ if there is a race and repeat
> node pages iterations if so.
> 
> v2: Move invalidate assignment down in tcache_invalidate_node_tree().
> v3: Synchronize sched in case of race with tcache_shrink_count() too
> to minimize repeats numbers.
> 
> Signed-off-by: Kirill Tkhai 
> ---


Acked-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/2] ms/kbuild: add -fno-PIE

2018-01-24 Thread Andrey Ryabinin
From: Sebastian Andrzej Siewior 

Debian started to build the gcc with -fPIE by default so the kernel
build ends before it starts properly with:
|kernel/bounds.c:1:0: error: code model kernel does not support PIC mode

Also add to KBUILD_AFLAGS due to:

|gcc -Wp,-MD,arch/x86/entry/vdso/vdso32/.note.o.d … -mfentry -DCC_USING_FENTRY 
… vdso/vdso32/note.S
|arch/x86/entry/vdso/vdso32/note.S:1:0: sorry, unimplemented: -mfentry isn’t 
supported for 32-bit in combination with -fpic

Tagging it stable so it is possible to compile recent stable kernels as
well.

Cc: sta...@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Michal Marek 
Signed-off-by: Andrey Ryabinin 
---
 Makefile | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Makefile b/Makefile
index 24108d07384e..f68ea0535834 100644
--- a/Makefile
+++ b/Makefile
@@ -623,6 +623,9 @@ endif # $(dot-config)
 # Defaults to vmlinux, but the arch makefile usually adds further targets
 all: vmlinux
 
+KBUILD_CFLAGS  += $(call cc-option,-fno-PIE)
+KBUILD_AFLAGS  += $(call cc-option,-fno-PIE)
+
 ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
 KBUILD_CFLAGS  += -Os $(call cc-disable-warning,maybe-uninitialized,)
 else
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] ms/scripts/has-stack-protector: add -fno-PIE

2018-01-24 Thread Andrey Ryabinin
From: Sebastian Andrzej Siewior 

Adding -no-PIE to the fstack protector check. -no-PIE was introduced
before -fstack-protector so there is no need for a runtime check.

Without it the build stops:
|Cannot use CONFIG_CC_STACKPROTECTOR_STRONG: -fstack-protector-strong available 
but compiler is broken

due to -mcmodel=kernel + -fPIE if -fPIE is enabled by default.

Tagging it stable so it is possible to compile recent stable kernels as
well.

Cc: sta...@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Michal Marek 
Signed-off-by: Andrey Ryabinin 
---
 scripts/gcc-x86_64-has-stack-protector.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/gcc-x86_64-has-stack-protector.sh 
b/scripts/gcc-x86_64-has-stack-protector.sh
index 973e8c141567..17867e723a51 100644
--- a/scripts/gcc-x86_64-has-stack-protector.sh
+++ b/scripts/gcc-x86_64-has-stack-protector.sh
@@ -1,6 +1,6 @@
 #!/bin/sh
 
-echo "int foo(void) { char X[200]; return 3; }" | $* -S -x c -c -O0 
-mcmodel=kernel -fstack-protector - -o - 2> /dev/null | grep -q "%gs"
+echo "int foo(void) { char X[200]; return 3; }" | $* -S -x c -c -O0 
-mcmodel=kernel -fno-PIE -fstack-protector - -o - 2> /dev/null | grep -q "%gs"
 if [ "$?" -eq "0" ] ; then
echo y
 else
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 01/25] ms/mm/compaction.c: periodically schedule when freeing pages

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

We've been getting warnings about an excessive amount of time spent
allocating pages for migration during memory compaction without
scheduling.  isolate_freepages_block() already periodically checks for
contended locks or the need to schedule, but isolate_freepages() never
does.

When a zone is massively long and no suitable targets can be found, this
iteration can be quite expensive without ever doing cond_resched().

Check periodically for the need to reschedule while the compaction free
scanner iterates.

Signed-off-by: David Rientjes 
Reviewed-by: Rik van Riel 
Reviewed-by: Wanpeng Li 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit f6ea3adb70b20ae36277a1b0eaaf4da9f6479a28)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 63f5f4627ea7..f693bf3b87e2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -698,6 +698,13 @@ static void isolate_freepages(struct zone *zone,
unsigned long isolated;
unsigned long end_pfn;
 
+   /*
+* This can iterate a massively long zone without finding any
+* suitable migration targets, so periodically check if we need
+* to schedule.
+*/
+   cond_resched();
+
if (!pfn_valid(pfn))
continue;
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 14/25] ms/mm/compaction: clean up unused code lines

2018-01-24 Thread Andrey Ryabinin
From: Heesub Shin 

Remove code lines currently not in use or never called.

Signed-off-by: Heesub Shin 
Acked-by: Vlastimil Babka 
Cc: Dongjun Shin 
Cc: Sunghwan Yun 
Cc: Minchan Kim 
Cc: Mel Gorman 
Cc: Joonsoo Kim 
Cc: Bartlomiej Zolnierkiewicz 
Cc: Michal Nazarewicz 
Cc: Naoya Horiguchi 
Cc: Christoph Lameter 
Cc: Rik van Riel 
Cc: Dongjun Shin 
Cc: Sunghwan Yun 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 13fb44e4b0414d7e718433a49e6430d5b76bd46e)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ee0c1e4aecd7..ea44b4bf85d9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -210,12 +210,6 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
return true;
 }
 
-static inline bool compact_trylock_irqsave(spinlock_t *lock,
-   unsigned long *flags, struct compact_control *cc)
-{
-   return compact_checklock_irqsave(lock, flags, false, cc);
-}
-
 /* Returns true if the page is within a block suitable for migration to */
 static bool suitable_migration_target(struct page *page)
 {
@@ -740,7 +734,6 @@ static void isolate_freepages(struct zone *zone,
continue;
 
/* Found a block suitable for isolating free pages from */
-   isolated = 0;
 
/*
 * Take care when isolating in last pageblock of a zone which
@@ -1169,9 +1162,6 @@ static void __compact_pgdat(pg_data_t *pgdat, struct 
compact_control *cc)
if (zone_watermark_ok(zone, cc->order,
low_wmark_pages(zone), 0, 0))
compaction_defer_reset(zone, cc->order, false);
-   /* Currently async compaction is never deferred. */
-   else if (cc->sync)
-   defer_compaction(zone, cc->order);
}
 
VM_BUG_ON(!list_empty(&cc->freepages));
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 03/25] ms/mm: compaction: encapsulate defer reset logic

2018-01-24 Thread Andrey Ryabinin
From: Vlastimil Babka 

Currently there are several functions to manipulate the deferred
compaction state variables.  The remaining case where the variables are
touched directly is when a successful allocation occurs in direct
compaction, or is expected to be successful in the future by kswapd.
Here, the lowest order that is expected to fail is updated, and in the
case of successful allocation, the deferred status and counter is reset
completely.

Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code.  No functional
change.

Signed-off-by: Vlastimil Babka 
Acked-by: Mel Gorman 
Reviewed-by: Rik van Riel 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit de6c60a6c115acaa721cfd499e028a413d1fcbf3)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/compaction.h | 16 
 mm/compaction.c|  9 -
 mm/page_alloc.c|  5 +
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 091d72e70d8a..7e1c76e3cd68 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -62,6 +62,22 @@ static inline bool compaction_deferred(struct zone *zone, 
int order)
return zone->compact_considered < defer_limit;
 }
 
+/*
+ * Update defer tracking counters after successful compaction of given order,
+ * which means an allocation either succeeded (alloc_success == true) or is
+ * expected to succeed.
+ */
+static inline void compaction_defer_reset(struct zone *zone, int order,
+   bool alloc_success)
+{
+   if (alloc_success) {
+   zone->compact_considered = 0;
+   zone->compact_defer_shift = 0;
+   }
+   if (order >= zone->compact_order_failed)
+   zone->compact_order_failed = order + 1;
+}
+
 /* Returns true if restarting compaction after many failures */
 static inline bool compaction_restarting(struct zone *zone, int order)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index 31ea5ed95194..f324cbaf923d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1150,12 +1150,11 @@ static void __compact_pgdat(pg_data_t *pgdat, struct 
compact_control *cc)
compact_zone(zone, cc);
 
if (cc->order > 0) {
-   int ok = zone_watermark_ok(zone, cc->order,
-   low_wmark_pages(zone), 0, 0);
-   if (ok && cc->order >= zone->compact_order_failed)
-   zone->compact_order_failed = cc->order + 1;
+   if (zone_watermark_ok(zone, cc->order,
+   low_wmark_pages(zone), 0, 0))
+   compaction_defer_reset(zone, cc->order, false);
/* Currently async compaction is never deferred. */
-   else if (!ok && cc->sync)
+   else if (cc->sync)
defer_compaction(zone, cc->order);
}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec66148d09ba..691c9bdbede3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2683,10 +2683,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
preferred_zone, migratetype);
if (page) {
preferred_zone->compact_blockskip_flush = false;
-   preferred_zone->compact_considered = 0;
-   preferred_zone->compact_defer_shift = 0;
-   if (order >= preferred_zone->compact_order_failed)
-   preferred_zone->compact_order_failed = order + 
1;
+   compaction_defer_reset(preferred_zone, order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 18/25] ms/mm, compaction: add per-zone migration pfn cache for async compaction

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

Each zone has a cached migration scanner pfn for memory compaction so that
subsequent calls to memory compaction can start where the previous call
left off.

Currently, the compaction migration scanner only updates the per-zone
cached pfn when pageblocks were not skipped for async compaction.  This
creates a dependency on calling sync compaction to avoid having subsequent
calls to async compaction from scanning an enormous amount of non-MOVABLE
pageblocks each time it is called.  On large machines, this could be
potentially very expensive.

This patch adds a per-zone cached migration scanner pfn only for async
compaction.  It is updated everytime a pageblock has been scanned in its
entirety and when no pages from it were successfully isolated.  The cached
migration scanner pfn for sync compaction is updated only when called for
sync compaction.

Signed-off-by: David Rientjes 
Acked-by: Vlastimil Babka 
Reviewed-by: Naoya Horiguchi 
Cc: Greg Thelen 
Cc: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 35979ef3393110ff3c12c6b94552208d3bdf1a36)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/mmzone.h |  5 ++--
 mm/compaction.c| 66 ++
 2 files changed, 43 insertions(+), 28 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 21963879b00a..8c23907d2bc3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -380,9 +380,10 @@ struct zone {
/* Set to true when the PG_migrate_skip bits should be cleared */
boolcompact_blockskip_flush;
 
-   /* pfns where compaction scanners should start */
+   /* pfn where compaction free scanner should start */
unsigned long   compact_cached_free_pfn;
-   unsigned long   compact_cached_migrate_pfn;
+   /* pfn where async and sync compaction migration scanner should start */
+   unsigned long   compact_cached_migrate_pfn[2];
 #endif
 #ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
diff --git a/mm/compaction.c b/mm/compaction.c
index 109647df914d..c20efb9ba784 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -91,7 +91,8 @@ static void __reset_isolation_suitable(struct zone *zone)
unsigned long end_pfn = zone_end_pfn(zone);
unsigned long pfn;
 
-   zone->compact_cached_migrate_pfn = start_pfn;
+   zone->compact_cached_migrate_pfn[0] = start_pfn;
+   zone->compact_cached_migrate_pfn[1] = start_pfn;
zone->compact_cached_free_pfn = end_pfn;
zone->compact_blockskip_flush = false;
 
@@ -133,9 +134,10 @@ void reset_isolation_suitable(pg_data_t *pgdat)
  */
 static void update_pageblock_skip(struct compact_control *cc,
struct page *page, unsigned long nr_isolated,
-   bool migrate_scanner)
+   bool set_unsuitable, bool migrate_scanner)
 {
struct zone *zone = cc->zone;
+   unsigned long pfn;
 
if (cc->ignore_skip_hint)
return;
@@ -143,20 +145,31 @@ static void update_pageblock_skip(struct compact_control 
*cc,
if (!page)
return;
 
-   if (!nr_isolated) {
-   unsigned long pfn = page_to_pfn(page);
+   if (nr_isolated)
+   return;
+
+   /*
+* Only skip pageblocks when all forms of compaction will be known to
+* fail in the near future.
+*/
+   if (set_unsuitable)
set_pageblock_skip(page);
 
-   /* Update where compaction should restart */
-   if (migrate_scanner) {
-   if (!cc->finished_update_migrate &&
-   pfn > zone->compact_cached_migrate_pfn)
-   zone->compact_cached_migrate_pfn = pfn;
-   } else {
-   if (!cc->finished_update_free &&
-   pfn < zone->compact_cached_free_pfn)
-   zone->compact_cached_free_pfn = pfn;
-   }
+   pfn = page_to_pfn(page);
+
+   /* Update where async and sync compaction should restart */
+   if (migrate_scanner) {
+   if (cc->finished_update_migrate)
+   return;
+   if (pfn > zone->compact_cached_migrate_pfn[0])
+   zone->compact_cached_migrate_pfn[0] = pfn;
+   if (cc->sync && pfn > zone->compact_cached_migrate_pfn[1])
+   zone->compact_cached_migrate_pfn[1] = pfn;
+   } else {
+   if (cc->finished_update_free)
+   return;
+   if (pfn < zone->compact_cached_free_pfn)
+   zone->compact_cached_free_pfn = pfn;
}
 }
 #else
@@ -168,7 +181,7 @

[Devel] [PATCH rh7 07/25] ms/mm/compaction: disallow high-order page for migration target

2018-01-24 Thread Andrey Ryabinin
From: Joonsoo Kim 

Purpose of compaction is to get a high order page.  Currently, if we
find high-order page while searching migration target page, we break it
to order-0 pages and use them as migration target.  It is contrary to
purpose of compaction, so disallow high-order page to be used for
migration target.

Additionally, clean-up logic in suitable_migration_target() to simplify
the code.  There is no functional changes from this clean-up.

Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 7d348b9ea64db0a315d777ce7d4b06697f946503)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 981e6b601bfe..1faade458d38 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -219,21 +219,12 @@ static inline bool compact_trylock_irqsave(spinlock_t 
*lock,
 /* Returns true if the page is within a block suitable for migration to */
 static bool suitable_migration_target(struct page *page)
 {
-   int migratetype = get_pageblock_migratetype(page);
-
-   /* Don't interfere with memory hot-remove or the min_free_kbytes blocks 
*/
-   if (migratetype == MIGRATE_RESERVE)
-   return false;
-
-   if (is_migrate_isolate(migratetype))
-   return false;
-
-   /* If the page is a large free page, then allow migration */
+   /* If the page is a large free page, then disallow migration */
if (PageBuddy(page) && page_order(page) >= pageblock_order)
-   return true;
+   return false;
 
/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
-   if (migrate_async_suitable(migratetype))
+   if (migrate_async_suitable(get_pageblock_migratetype(page)))
return true;
 
/* Otherwise skip the block */
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 11/25] ms/mm/compaction: clean-up code on success of ballon isolation

2018-01-24 Thread Andrey Ryabinin
From: Joonsoo Kim 

It is just for clean-up to reduce code size and improve readability.
There is no functional change.

Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit b6c750163c0d138f5041d95fcdbd1094b6928057)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ba9fa713038d..d833777ddaee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -566,11 +566,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
if (unlikely(balloon_page_movable(page))) {
if (balloon_page_isolate(page)) {
/* Successfully isolated */
-   cc->finished_update_migrate = true;
-   list_add(&page->lru, migratelist);
-   cc->nr_migratepages++;
-   nr_isolated++;
-   goto check_compact_cluster;
+   goto isolate_success;
}
}
continue;
@@ -631,13 +627,14 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
/* Successfully isolated */
-   cc->finished_update_migrate = true;
del_page_from_lru_list(page, lruvec, page_lru(page));
+
+isolate_success:
+   cc->finished_update_migrate = true;
list_add(&page->lru, migratelist);
cc->nr_migratepages++;
nr_isolated++;
 
-check_compact_cluster:
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
++low_pfn;
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 04/25] ms/mm: compaction: do not mark unmovable pageblocks as skipped in async compaction

2018-01-24 Thread Andrey Ryabinin
From: Vlastimil Babka 

Compaction temporarily marks pageblocks where it fails to isolate pages
as to-be-skipped in further compactions, in order to improve efficiency.
One of the reasons to fail isolating pages is that isolation is not
attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.

The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
async compaction become skipped due to the temporary mark also in future
sync compaction.  Moreover, this may follow quite soon during
__alloc_page_slowpath, without much time for kswapd to clear the
pageblock skip marks.  This goes against the idea that sync compaction
should try to scan these blocks more thoroughly than the async
compaction.

The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
blocks are not marked to be skipped.  Note this should not affect
performance or locking impact of further async compactions, as skipping
a block due to being !MIGRATE_MOVABLE is done soon after skipping a
block marked to be skipped, both without locking.

Signed-off-by: Vlastimil Babka 
Cc: Rik van Riel 
Acked-by: Mel Gorman 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 50b5b094e683f8e51e82c6dfe97b1608cf97e6c0)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f324cbaf923d..8ebf3d10ef17 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -472,6 +472,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
unsigned long flags;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
+   bool skipped_async_unsuitable = false;
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -547,6 +548,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
if (!cc->sync && last_pageblock_nr != pageblock_nr &&
!migrate_async_suitable(get_pageblock_migratetype(page))) {
cc->finished_update_migrate = true;
+   skipped_async_unsuitable = true;
goto next_pageblock;
}
 
@@ -640,8 +642,13 @@ next_pageblock:
if (locked)
spin_unlock_irqrestore(&zone->lru_lock, flags);
 
-   /* Update the pageblock-skip if the whole pageblock was scanned */
-   if (low_pfn == end_pfn)
+   /*
+* Update the pageblock-skip information and cached scanner pfn,
+* if the whole pageblock was scanned without isolating any page.
+* This is not done when pageblock was skipped due to being unsuitable
+* for async compaction, so that eventual sync compaction can try.
+*/
+   if (low_pfn == end_pfn && !skipped_async_unsuitable)
update_pageblock_skip(cc, valid_page, nr_isolated, true);
 
trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 06/25] ms/mm, compaction: avoid isolating pinned pages

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

Page migration will fail for memory that is pinned in memory with, for
example, get_user_pages().  In this case, it is unnecessary to take
zone->lru_lock or isolating the page and passing it to page migration
which will ultimately fail.

This is a racy check, the page can still change from under us, but in
that case we'll just fail later when attempting to move the page.

This avoids very expensive memory compaction when faulting transparent
hugepages after pinning a lot of memory with a Mellanox driver.

On a 128GB machine and pinning ~120GB of memory, before this patch we
see the enormous disparity in the number of page migration failures
because of the pinning (from /proc/vmstat):

compact_pages_moved 8450
compact_pagemigrate_failed 15614415

0.05% of pages isolated are successfully migrated and explicitly
triggering memory compaction takes 102 seconds.  After the patch:

compact_pages_moved 9197
compact_pagemigrate_failed 7

99.9% of pages isolated are now successfully migrated in this
configuration and memory compaction takes less than one second.

Signed-off-by: David Rientjes 
Acked-by: Hugh Dickins 
Acked-by: Mel Gorman 
Cc: Joonsoo Kim 
Cc: Rik van Riel 
Cc: Greg Thelen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 119d6d59dcc0980dcd581fdadb6b2033b512a473)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index bbb1f65b0041..981e6b601bfe 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -588,6 +588,15 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
continue;
}
 
+   /*
+* Migration will fail if an anonymous page is pinned in memory,
+* so avoid taking lru_lock and isolating it unnecessarily in an
+* admittedly racy check.
+*/
+   if (!page_mapping(page) &&
+   page_count(page) > page_mapcount(page))
+   continue;
+
/* Check if it is ok to still hold the lock */
locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
locked, cc);
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 00/25] compaction related stable backports.

2018-01-24 Thread Andrey Ryabinin
These are some compaction related -stable backports that we missing.


David Rientjes (9):
  ms/mm/compaction.c: periodically schedule when freeing pages
  ms/mm, compaction: avoid isolating pinned pages
  ms/mm, compaction: determine isolation mode only once
  ms/mm, compaction: ignore pageblock skip when manually invoking
compaction
  ms/mm, migration: add destination page freeing callback
  ms/mm, compaction: return failed migration target pages back to
freelist
  ms/mm, compaction: add per-zone migration pfn cache for async
compaction
  ms/mm, compaction: embed migration mode in compact_control
  ms/mm, compaction: terminate async compaction when rescheduling

Heesub Shin (1):
  ms/mm/compaction: clean up unused code lines

Hugh Dickins (1):
  ms/mm: fix direct reclaim writeback regression

Joonsoo Kim (6):
  ms/mm/compaction: disallow high-order page for migration target
  ms/mm/compaction: do not call suitable_migration_target() on every
page
  ms/mm/compaction: change the timing to check to drop the spinlock
  ms/mm/compaction: check pageblock suitability once per pageblock
  ms/mm/compaction: clean-up code on success of ballon isolation
  ms/mm/compaction: fix wrong order check in compact_finished()

Mel Gorman (1):
  ms/mm: compaction: trace compaction begin and end

Vlastimil Babka (7):
  ms/mm: compaction: encapsulate defer reset logic
  ms/mm: compaction: do not mark unmovable pageblocks as skipped in
async compaction
  ms/mm: compaction: reset scanner positions immediately when they meet
  ms/mm/compaction: cleanup isolate_freepages()
  ms/mm/compaction: do not count migratepages when unnecessary
  ms/mm/compaction: avoid rescanning pageblocks in isolate_freepages
  ms/mm, compaction: properly signal and act upon lock and need_sched()
contention

 include/linux/compaction.h|  20 ++-
 include/linux/migrate.h   |  11 +-
 include/linux/mmzone.h|   5 +-
 include/trace/events/compaction.h |  67 +++-
 mm/compaction.c   | 352 ++
 mm/internal.h |   8 +-
 mm/memcontrol.c   |   2 +-
 mm/memory-failure.c   |   4 +-
 mm/memory_hotplug.c   |   2 +-
 mm/mempolicy.c|   4 +-
 mm/migrate.c  |  57 --
 mm/page_alloc.c   |  44 ++---
 12 files changed, 363 insertions(+), 213 deletions(-)

-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 02/25] ms/mm: compaction: trace compaction begin and end

2018-01-24 Thread Andrey Ryabinin
39978 1085640 1095062 
1131716
Page migrate failure 0   0   0   0  
 0
Compaction pages isolated  1029475 1013490 2453074 2482698 
2565400
Compaction migrate scanned 9955461113442592437520227978356
30494204
Compaction free scanned   27715272285446548015061582898631
85756132
Compaction cost552 55513441379  
  1436
NUMA PTE updates 0   0   0   0  
 0
NUMA hint faults 0   0   0   0  
 0
NUMA hint local faults   0   0   0   0  
 0
NUMA hint local percent100 100 100 100  
   100
NUMA pages migrated  0   0   0   0  
 0
AutoNUMA cost0   0   0   0  
 0

There are some differences from the previous results for THP-like allocations:

- Here, the bad result for unpatched kernel in phase 3 is much more
  consistent to be between 65-70% and not related to the "regression" in
  3.12.  Still there is the improvement from patch 4 onwards, which brings
  it on par with simple GFP_HIGHUSER_MOVABLE allocations.

- Compaction costs have increased, but nowhere near as much as the
  non-THP case.  Again, the patches should be worth the gained
  determininsm.

- Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
   This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
  pfn's and pageblock skip bits are not reset by kswapd that often (at
  least in phase 3 where no concurrent activity would wake up kswapd) and
  the patches thus help the sync-after-async compaction.  It doesn't
  however show that the sync compaction would help so much with success
  rates, which can be again seen as a limitation of the benchmark
  scenario.

This patch (of 6):

Add two tracepoints for compaction begin and end of a zone.  Using this it
is possible to calculate how much time a workload is spending within
compaction and potentially debug problems related to cached pfns for
scanning.  In combination with the direct reclaim and slab trace points it
should be possible to estimate most allocation-related overhead for a
workload.

Signed-off-by: Mel Gorman 
Signed-off-by: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 0eb927c0ab789d3d7d69f68acb850f69d4e7c36f)
Signed-off-by: Andrey Ryabinin 
---
 include/trace/events/compaction.h | 42 +++
 mm/compaction.c   |  4 
 2 files changed, 46 insertions(+)

diff --git a/include/trace/events/compaction.h 
b/include/trace/events/compaction.h
index fde1b3e94c7d..06f544ef2f6f 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -67,6 +67,48 @@ TRACE_EVENT(mm_compaction_migratepages,
__entry->nr_failed)
 );
 
+TRACE_EVENT(mm_compaction_begin,
+   TP_PROTO(unsigned long zone_start, unsigned long migrate_start,
+   unsigned long free_start, unsigned long zone_end),
+
+   TP_ARGS(zone_start, migrate_start, free_start, zone_end),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, zone_start)
+   __field(unsigned long, migrate_start)
+   __field(unsigned long, free_start)
+   __field(unsigned long, zone_end)
+   ),
+
+   TP_fast_assign(
+   __entry->zone_start = zone_start;
+   __entry->migrate_start = migrate_start;
+   __entry->free_start = free_start;
+   __entry->zone_end = zone_end;
+   ),
+
+   TP_printk("zone_start=%lu migrate_start=%lu free_start=%lu 
zone_end=%lu",
+   __entry->zone_start,
+   __entry->migrate_start,
+   __entry->free_start,
+   __entry->zone_end)
+);
+
+TRACE_EVENT(mm_compaction_end,
+   TP_PROTO(int status),
+
+   TP_ARGS(status),
+
+   TP_STRUCT__entry(
+   __field(int, status)
+   ),
+
+   TP_fast_assign(
+   __entry->status = status;
+   ),
+
+   TP_printk("status=%d", __entry->status)
+);
 
 #endif /* _TRACE_COMPACTION_H */
 
diff --git a/mm/compaction.c b/mm/compaction.c
index f693bf3b87e2..31ea5ed95194 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -992,6 +992,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
zone->compact_cached_migrate_pfn = cc->migrate_pfn;
}
 
+   trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, cc->free_pfn, 
end_pfn);
+
migrate_prep_local();
 
while ((ret = compact_finished(zone

[Devel] [PATCH rh7 05/25] ms/mm: compaction: reset scanner positions immediately when they meet

2018-01-24 Thread Andrey Ryabinin
From: Vlastimil Babka 

Compaction used to start its migrate and free page scaners at the zone's
lowest and highest pfn, respectively.  Later, caching was introduced to
remember the scanners' progress across compaction attempts so that
pageblocks are not re-scanned uselessly.  Additionally, pageblocks where
isolation failed are marked to be quickly skipped when encountered again
in future compactions.

Currently, both the reset of cached pfn's and clearing of the pageblock
skip information for a zone is done in __reset_isolation_suitable().
This function gets called when:

 - compaction is restarting after being deferred
 - compact_blockskip_flush flag is set in compact_finished() when the scanners
   meet (and not again cleared when direct compaction succeeds in allocation)
   and kswapd acts upon this flag before going to sleep

This behavior is suboptimal for several reasons:

 - when direct sync compaction is called after async compaction fails (in the
   allocation slowpath), it will effectively do nothing, unless kswapd
   happens to process the compact_blockskip_flush flag meanwhile. This is racy
   and goes against the purpose of sync compaction to more thoroughly retry
   the compaction of a zone where async compaction has failed.
   The restart-after-deferring path cannot help here as deferring happens only
   after the sync compaction fails. It is also done only for the preferred
   zone, while the compaction might be done for a fallback zone.

 - the mechanism of marking pageblock to be skipped has little value since the
   cached pfn's are reset only together with the pageblock skip flags. This
   effectively limits pageblock skip usage to parallel compactions.

This patch changes compact_finished() so that cached pfn's are reset
immediately when the scanners meet.  Clearing pageblock skip flags is
unchanged, as well as the other situations where cached pfn's are reset.
This allows the sync-after-async compaction to retry pageblocks not
marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
compactions now skips without marking them.

Signed-off-by: Vlastimil Babka 
Cc: Rik van Riel 
Acked-by: Mel Gorman 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 55b7c4c99f6a448f72179297fe6432544f220063)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8ebf3d10ef17..bbb1f65b0041 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -866,6 +866,10 @@ static int compact_finished(struct zone *zone,
 
/* Compaction run completes if the migrate and free scanner meet */
if (cc->free_pfn <= cc->migrate_pfn) {
+   /* Let the next compaction start anew. */
+   zone->compact_cached_migrate_pfn = zone->zone_start_pfn;
+   zone->compact_cached_free_pfn = zone_end_pfn(zone);
+
/*
 * Mark that the PG_migrate_skip information should be cleared
 * by kswapd when it goes to sleep. kswapd does not set the
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 17/25] ms/mm, compaction: return failed migration target pages back to freelist

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

Greg reported that he found isolated free pages were returned back to the
VM rather than the compaction freelist.  This will cause holes behind the
free scanner and cause it to reallocate additional memory if necessary
later.

He detected the problem at runtime seeing that ext4 metadata pages (esp
the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
constantly visited by compaction calls of migrate_pages().  These pages
had a non-zero b_count which caused fallback_migrate_page() ->
try_to_release_page() -> try_to_free_buffers() to fail.

Memory compaction works by having a "freeing scanner" scan from one end of
a zone which isolates pages as migration targets while another "migrating
scanner" scans from the other end of the same zone which isolates pages
for migration.

When page migration fails for an isolated page, the target page is
returned to the system rather than the freelist built by the freeing
scanner.  This may require the freeing scanner to continue scanning memory
after suitable migration targets have already been returned to the system
needlessly.

This patch returns destination pages to the freeing scanner freelist when
page migration fails.  This prevents unnecessary work done by the freeing
scanner but also encourages memory to be as compacted as possible at the
end of the zone.

Signed-off-by: David Rientjes 
Reported-by: Greg Thelen 
Acked-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Reviewed-by: Naoya Horiguchi 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit d53aea3d46d64e95da9952887969f7533b9ab25e)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7e74add6b9c2..109647df914d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -794,23 +794,32 @@ static struct page *compaction_alloc(struct page 
*migratepage,
 }
 
 /*
- * We cannot control nr_migratepages and nr_freepages fully when migration is
- * running as migrate_pages() has no knowledge of compact_control. When
- * migration is complete, we count the number of pages on the lists by hand.
+ * This is a migrate-callback that "frees" freepages back to the isolated
+ * freelist.  All pages on the freelist are from the same zone, so there is no
+ * special handling needed for NUMA.
+ */
+static void compaction_free(struct page *page, unsigned long data)
+{
+   struct compact_control *cc = (struct compact_control *)data;
+
+   list_add(&page->lru, &cc->freepages);
+   cc->nr_freepages++;
+}
+
+/*
+ * We cannot control nr_migratepages fully when migration is running as
+ * migrate_pages() has no knowledge of of compact_control.  When migration is
+ * complete, we count the number of pages on the list by hand.
  */
 static void update_nr_listpages(struct compact_control *cc)
 {
int nr_migratepages = 0;
-   int nr_freepages = 0;
struct page *page;
 
list_for_each_entry(page, &cc->migratepages, lru)
nr_migratepages++;
-   list_for_each_entry(page, &cc->freepages, lru)
-   nr_freepages++;
 
cc->nr_migratepages = nr_migratepages;
-   cc->nr_freepages = nr_freepages;
 }
 
 /* possible outcome of isolate_migratepages */
@@ -1020,8 +1029,8 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
}
 
nr_migrate = cc->nr_migratepages;
-   err = migrate_pages(&cc->migratepages, compaction_alloc, NULL,
-   (unsigned long)cc,
+   err = migrate_pages(&cc->migratepages, compaction_alloc,
+   compaction_free, (unsigned long)cc,
cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
MR_COMPACTION);
update_nr_listpages(cc);
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 16/25] ms/mm, migration: add destination page freeing callback

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages.  When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages.  If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails.  If the caller passes NULL then
freeing back to the system will be handled as usual.  This patch
introduces no functional change.

Signed-off-by: David Rientjes 
Reviewed-by: Naoya Horiguchi 
Acked-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Greg Thelen 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/migrate.h | 11 ++
 mm/compaction.c |  2 +-
 mm/memcontrol.c |  2 +-
 mm/memory-failure.c |  4 ++--
 mm/memory_hotplug.c |  2 +-
 mm/mempolicy.c  |  4 ++--
 mm/migrate.c| 56 +
 mm/page_alloc.c |  2 +-
 8 files changed, 53 insertions(+), 30 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ba9b278d8f63..453f40ce636d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -5,7 +5,9 @@
 #include 
 #include 
 
-typedef struct page *new_page_t(struct page *, unsigned long private, int **);
+typedef struct page *new_page_t(struct page *page, unsigned long private,
+   int **reason);
+typedef void free_page_t(struct page *page, unsigned long private);
 
 /*
  * Return values from addresss_space_operations.migratepage():
@@ -30,7 +32,7 @@ extern void putback_lru_pages(struct list_head *l);
 extern void putback_movable_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
-extern int migrate_pages(struct list_head *l, new_page_t x,
+extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
unsigned long private, enum migrate_mode mode, int reason);
 
 extern int fail_migrate_page(struct address_space *,
@@ -53,8 +55,9 @@ extern int migrate_page_move_mapping(struct address_space 
*mapping,
 
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline void putback_movable_pages(struct list_head *l) {}
-static inline int migrate_pages(struct list_head *l, new_page_t x,
-   unsigned long private, enum migrate_mode mode, int reason)
+static inline int migrate_pages(struct list_head *l, new_page_t new,
+   free_page_t free, unsigned long private, enum migrate_mode mode,
+   int reason)
{ return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 0a2d4eded2e0..7e74add6b9c2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1020,7 +1020,7 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
}
 
nr_migrate = cc->nr_migratepages;
-   err = migrate_pages(&cc->migratepages, compaction_alloc,
+   err = migrate_pages(&cc->migratepages, compaction_alloc, NULL,
(unsigned long)cc,
cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
MR_COMPACTION);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 116b303319af..f50377729d10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5087,7 +5087,7 @@ static long __memcg_numa_migrate_pages(struct lruvec 
*lruvec, enum lru_list lru,
if (!scanned)
break;
 
-   ret = migrate_pages(&pages, memcg_numa_migrate_new_page,
+   ret = migrate_pages(&pages, memcg_numa_migrate_new_page, NULL,
(unsigned long)&ms, MIGRATE_ASYNC,
MR_SYSCALL);
putback_lru_pages(&pages);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e1c12352c7fd..f5fdd96740f0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1564,7 +1564,7 @@ static int soft_offline_huge_page(struct page *page, int 
flags)
return -EBUSY;
}
 
-   ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
+   ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
@@ -1719,7

[Devel] [PATCH rh7 08/25] ms/mm/compaction: do not call suitable_migration_target() on every page

2018-01-24 Thread Andrey Ryabinin
From: Joonsoo Kim 

suitable_migration_target() checks that pageblock is suitable for
migration target.  In isolate_freepages_block(), it is called on every
page and this is inefficient.  So make it called once per pageblock.

suitable_migration_target() also checks if page is highorder or not, but
it's criteria for highorder is pageblock order.  So calling it once
within pageblock range has no problem.

Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 1faade458d38..c29883fe146d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -247,6 +247,7 @@ static unsigned long isolate_freepages_block(struct 
compact_control *cc,
struct page *cursor, *valid_page = NULL;
unsigned long flags;
bool locked = false;
+   bool checked_pageblock = false;
 
cursor = pfn_to_page(blockpfn);
 
@@ -278,8 +279,16 @@ static unsigned long isolate_freepages_block(struct 
compact_control *cc,
break;
 
/* Recheck this is a suitable migration target under lock */
-   if (!strict && !suitable_migration_target(page))
-   break;
+   if (!strict && !checked_pageblock) {
+   /*
+* We need to check suitability of pageblock only once
+* and this isolate_freepages_block() is called with
+* pageblock range, so just check once is sufficient.
+*/
+   checked_pageblock = true;
+   if (!suitable_migration_target(page))
+   break;
+   }
 
/* Recheck this is a buddy page under lock */
if (!PageBuddy(page))
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 20/25] ms/mm, compaction: terminate async compaction when rescheduling

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

Async compaction terminates prematurely when need_resched(), see
compact_checklock_irqsave().  This can never trigger, however, if the
cond_resched() in isolate_migratepages_range() always takes care of the
scheduling.

If the cond_resched() actually triggers, then terminate this pageblock
scan for async compaction as well.

Signed-off-by: David Rientjes 
Acked-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit aeef4b83806f49a0c454b7d4578671b71045bee2)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 880c420f6ca3..35911b29683f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -501,8 +501,13 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
return 0;
}
 
+   if (cond_resched()) {
+   /* Async terminates prematurely on need_resched() */
+   if (cc->mode == MIGRATE_ASYNC)
+   return 0;
+   }
+
/* Time to isolate some pages for migration */
-   cond_resched();
for (; low_pfn < end_pfn; low_pfn++) {
/* give a chance to irqs before checking need_resched() */
if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 22/25] ms/mm/compaction: avoid rescanning pageblocks in isolate_freepages

2018-01-24 Thread Andrey Ryabinin
From: Vlastimil Babka 

The compaction free scanner in isolate_freepages() currently remembers PFN
of the highest pageblock where it successfully isolates, to be used as the
starting pageblock for the next invocation.  The rationale behind this is
that page migration might return free pages to the allocator when
migration fails and we don't want to skip them if the compaction
continues.

Since migration now returns free pages back to compaction code where they
can be reused, this is no longer a concern.  This patch changes
isolate_freepages() so that the PFN for restarting is updated with each
pageblock where isolation is attempted.  Using stress-highalloc from
mmtests, this resulted in 10% reduction of the pages scanned by the free
scanner.

Note that the somewhat similar functionality that records highest
successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
This cache is used when the whole compaction is restarted, not for
multiple invocations of the free scanner during single compaction.

Signed-off-by: Vlastimil Babka 
Cc: Minchan Kim 
Cc: Mel Gorman 
Cc: Joonsoo Kim 
Cc: Bartlomiej Zolnierkiewicz 
Acked-by: Michal Nazarewicz 
Reviewed-by: Naoya Horiguchi 
Cc: Christoph Lameter 
Cc: Rik van Riel 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit e9ade569910a82614ff5f2c2cea2b65a8d785da4)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 316a7b34ce37..d8ee1536819f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -692,7 +692,6 @@ static void isolate_freepages(struct zone *zone,
unsigned long block_start_pfn;  /* start of current pageblock */
unsigned long block_end_pfn;/* end of current pageblock */
unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
-   unsigned long next_free_pfn; /* start pfn for scaning at next round */
int nr_freepages = cc->nr_freepages;
struct list_head *freelist = &cc->freepages;
 
@@ -713,12 +712,6 @@ static void isolate_freepages(struct zone *zone,
low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
 
/*
-* If no pages are isolated, the block_start_pfn < low_pfn check
-* will kick in.
-*/
-   next_free_pfn = 0;
-
-   /*
 * Isolate free pages until enough are available to migrate the
 * pages on cc->migratepages. We stop searching if the migrate
 * and free page scanners meet or enough free pages are isolated.
@@ -758,19 +751,19 @@ static void isolate_freepages(struct zone *zone,
continue;
 
/* Found a block suitable for isolating free pages from */
+   cc->free_pfn = block_start_pfn;
isolated = isolate_freepages_block(cc, block_start_pfn,
block_end_pfn, freelist, false);
nr_freepages += isolated;
 
/*
-* Record the highest PFN we isolated pages from. When next
-* looking for free pages, the search will restart here as
-* page migration may have returned some pages to the allocator
+* Set a flag that we successfully isolated in this pageblock.
+* In the next loop iteration, zone->compact_cached_free_pfn
+* will not be updated and thus it will effectively contain the
+* highest pageblock we isolated pages from.
 */
-   if (isolated && next_free_pfn == 0) {
+   if (isolated)
cc->finished_update_free = true;
-   next_free_pfn = block_start_pfn;
-   }
}
 
/* split_free_page does not map the pages */
@@ -781,9 +774,8 @@ static void isolate_freepages(struct zone *zone,
 * so that compact_finished() may detect this
 */
if (block_start_pfn < low_pfn)
-   next_free_pfn = cc->migrate_pfn;
+   cc->free_pfn = cc->migrate_pfn;
 
-   cc->free_pfn = next_free_pfn;
cc->nr_freepages = nr_freepages;
 }
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 19/25] ms/mm, compaction: embed migration mode in compact_control

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

We're going to want to manipulate the migration mode for compaction in the
page allocator, and currently compact_control's sync field is only a bool.

Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
depending on the value of this bool.  Convert the bool to enum
migrate_mode and pass the migration mode in directly.  Later, we'll want
to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
avoid unnecessary latency.

This also alters compaction triggered from sysfs, either for the entire
system or for a node, to force MIGRATE_SYNC.

[a...@linux-foundation.org: fix build]
[iamjoonsoo@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
Signed-off-by: David Rientjes 
Suggested-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Greg Thelen 
Cc: Naoya Horiguchi 
Signed-off-by: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit e0b9daeb453e602a95ea43853dc12d385558ce1f)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/compaction.h |  4 ++--
 mm/compaction.c| 36 +++-
 mm/internal.h  |  3 ++-
 mm/page_alloc.c| 39 +--
 4 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7e1c76e3cd68..01e3132820da 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, 
int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
-   bool sync, bool *contended);
+   enum migrate_mode mode, bool *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
@@ -91,7 +91,7 @@ static inline bool compaction_restarting(struct zone *zone, 
int order)
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
-   bool sync, bool *contended)
+   enum migrate_mode mode, bool *contended)
 {
return COMPACT_CONTINUE;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index c20efb9ba784..880c420f6ca3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -163,7 +163,8 @@ static void update_pageblock_skip(struct compact_control 
*cc,
return;
if (pfn > zone->compact_cached_migrate_pfn[0])
zone->compact_cached_migrate_pfn[0] = pfn;
-   if (cc->sync && pfn > zone->compact_cached_migrate_pfn[1])
+   if (cc->mode != MIGRATE_ASYNC &&
+   pfn > zone->compact_cached_migrate_pfn[1])
zone->compact_cached_migrate_pfn[1] = pfn;
} else {
if (cc->finished_update_free)
@@ -210,7 +211,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
}
 
/* async aborts if taking too long or contended */
-   if (!cc->sync) {
+   if (cc->mode == MIGRATE_ASYNC) {
cc->contended = true;
return false;
}
@@ -480,7 +481,8 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
bool set_unsuitable = true;
-   const isolate_mode_t mode = (!cc->sync ? ISOLATE_ASYNC_MIGRATE : 0) |
+   const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
+   ISOLATE_ASYNC_MIGRATE : 0) |
(unevictable ? ISOLATE_UNEVICTABLE : 0);
 
/*
@@ -490,7 +492,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 */
while (unlikely(too_many_isolated(zone))) {
/* async migration should just abort */
-   if (!cc->sync)
+   if (cc->mode == MIGRATE_ASYNC)
return 0;
 
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -555,7 +557,8 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 * the minimum amount of work satisfies the allocation
 */
mt = get_pageblock_migratetype(page);
-   if (!cc->sync && !migrate_async_suitable(mt)) {
+   if (cc->mode == MIGRATE_ASYNC &&
+   !migrate_async_suitable(mt)) {

[Devel] [PATCH rh7 25/25] ms/mm: fix direct reclaim writeback regression

2018-01-24 Thread Andrey Ryabinin
From: Hugh Dickins 

Shortly before 3.16-rc1, Dave Jones reported:

  WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971
   xfs_vm_writepage+0x5ce/0x630 [xfs]()
  CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3
  Call Trace:
xfs_vm_writepage+0x5ce/0x630 [xfs]
shrink_page_list+0x8f9/0xb90
shrink_inactive_list+0x253/0x510
shrink_lruvec+0x563/0x6c0
shrink_zone+0x3b/0x100
shrink_zones+0x1f1/0x3c0
try_to_free_pages+0x164/0x380
__alloc_pages_nodemask+0x822/0xc90
alloc_pages_vma+0xaf/0x1c0
handle_mm_fault+0xa31/0xc50
  etc.

 970   if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
 971   PF_MEMALLOC))

I did not respond at the time, because a glance at the PageDirty block
in shrink_page_list() quickly shows that this is impossible: we don't do
writeback on file pages (other than tmpfs) from direct reclaim nowadays.
Dave was hallucinating, but it would have been disrespectful to say so.

However, my own /var/log/messages now shows similar complaints

  WARNING: CPU: 1 PID: 28814 at fs/ext4/inode.c:1881 ext4_writepage+0xa7/0x38b()
  WARNING: CPU: 0 PID: 27347 at fs/ext4/inode.c:1764 ext4_writepage+0xa7/0x38b()

from stressing some mmotm trees during July.

Could a dirty xfs or ext4 file page somehow get marked PageSwapBacked,
so fail shrink_page_list()'s page_is_file_cache() test, and so proceed
to mapping->a_ops->writepage()?

Yes, 3.16-rc1's commit 68711a746345 ("mm, migration: add destination
page freeing callback") has provided such a way to compaction: if
migrating a SwapBacked page fails, its newpage may be put back on the
list for later use with PageSwapBacked still set, and nothing will clear
it.

Whether that can do anything worse than issue WARN_ON_ONCEs, and get
some statistics wrong, is unclear: easier to fix than to think through
the consequences.

Fixing it here, before the put_new_page(), addresses the bug directly,
but is probably the worst place to fix it.  Page migration is doing too
many parts of the job on too many levels: fixing it in
move_to_new_page() to complement its SetPageSwapBacked would be
preferable, except why is it (and newpage->mapping and newpage->index)
done there, rather than down in migrate_page_move_mapping(), once we are
sure of success? Not a cleanup to get into right now, especially not
with memcg cleanups coming in 3.17.

Reported-by: Dave Jones 
Signed-off-by: Hugh Dickins 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 8bdd638091605dc66d92c57c4b80eb87fffc15f7)
Signed-off-by: Andrey Ryabinin 
---
 mm/migrate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 3369475fe15f..7741fcca6c2a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1010,9 +1010,10 @@ out:
 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
 * during isolation.
 */
-   if (rc != MIGRATEPAGE_SUCCESS && put_new_page)
+   if (rc != MIGRATEPAGE_SUCCESS && put_new_page) {
+   ClearPageSwapBacked(newpage);
put_new_page(newpage, private);
-   else if (unlikely(__is_movable_balloon_page(newpage)))
+   } else if (unlikely(__is_movable_balloon_page(newpage)))
/* drop our reference, page already in the balloon */
put_page(newpage);
else
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 09/25] ms/mm/compaction: change the timing to check to drop the spinlock

2018-01-24 Thread Andrey Ryabinin
From: Joonsoo Kim 

It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
pfn page.  This may results in below situation while isolating
migratepage.

1. try isolate 0x0 ~ 0x200 pfn pages.
2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
   the spinlock.
3. Then, to complete isolating, retry to aquire the lock.

I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
this time, locked variable would be false.

Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit be1aa03b973c7dcdc576f3503f7a60429825c35d)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c29883fe146d..c4b6b134b197 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -494,7 +494,7 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
cond_resched();
for (; low_pfn < end_pfn; low_pfn++) {
/* give a chance to irqs before checking need_resched() */
-   if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+   if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
if (should_release_lock(&zone->lru_lock)) {
spin_unlock_irqrestore(&zone->lru_lock, flags);
locked = false;
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 23/25] ms/mm, compaction: properly signal and act upon lock and need_sched() contention

2018-01-24 Thread Andrey Ryabinin
From: Vlastimil Babka 

Compaction uses compact_checklock_irqsave() function to periodically check
for lock contention and need_resched() to either abort async compaction,
or to free the lock, schedule and retake the lock.  When aborting,
cc->contended is set to signal the contended state to the caller.  Two
problems have been identified in this mechanism.

First, compaction also calls directly cond_resched() in both scanners when
no lock is yet taken.  This call either does not abort async compaction,
or set cc->contended appropriately.  This patch introduces a new
compact_should_abort() function to achieve both.  In isolate_freepages(),
the check frequency is reduced to once by SWAP_CLUSTER_MAX pageblocks to
match what the migration scanner does in the preliminary page checks.  In
case a pageblock is found suitable for calling isolate_freepages_block(),
the checks within there are done on higher frequency.

Second, isolate_freepages() does not check if isolate_freepages_block()
aborted due to contention, and advances to the next pageblock.  This
violates the principle of aborting on contention, and might result in
pageblocks not being scanned completely, since the scanning cursor is
advanced.  This problem has been noticed in the code by Joonsoo Kim when
reviewing related patches.  This patch makes isolate_freepages_block()
check the cc->contended flag and abort.

In case isolate_freepages() has already isolated some pages before
aborting due to contention, page migration will proceed, which is OK since
we do not want to waste the work that has been done, and page migration
has own checks for contention.  However, we do not want another isolation
attempt by either of the scanners, so cc->contended flag check is added
also to compaction_alloc() and compact_finished() to make sure compaction
is aborted right after the migration.

The outcome of the patch should be reduced lock contention by async
compaction and lower latencies for higher-order allocations where direct
compaction is involved.

[a...@linux-foundation.org: fix typo in comment]
Reported-by: Joonsoo Kim 
Signed-off-by: Vlastimil Babka 
Reviewed-by: Naoya Horiguchi 
Cc: Minchan Kim 
Cc: Mel Gorman 
Cc: Bartlomiej Zolnierkiewicz 
Cc: Michal Nazarewicz 
Cc: Christoph Lameter 
Cc: Rik van Riel 
Acked-by: Michal Nazarewicz 
Tested-by: Shawn Guo 
Tested-by: Kevin Hilman 
Tested-by: Stephen Warren 
Tested-by: Fabio Estevam 
Cc: David Rientjes 
Cc: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit be9765722e6b7ece8263cbab857490332339bd6f)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 54 --
 mm/internal.h   |  5 -
 2 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d8ee1536819f..4f6d23a87230 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -224,6 +224,30 @@ static bool compact_checklock_irqsave(spinlock_t *lock, 
unsigned long *flags,
return true;
 }
 
+/*
+ * Aside from avoiding lock contention, compaction also periodically checks
+ * need_resched() and either schedules in sync compaction or aborts async
+ * compaction. This is similar to what compact_checklock_irqsave() does, but
+ * is used where no lock is concerned.
+ *
+ * Returns false when no scheduling was needed, or sync compaction scheduled.
+ * Returns true when async compaction should abort.
+ */
+static inline bool compact_should_abort(struct compact_control *cc)
+{
+   /* async compaction aborts if contended */
+   if (need_resched()) {
+   if (cc->mode == MIGRATE_ASYNC) {
+   cc->contended = true;
+   return true;
+   }
+
+   cond_resched();
+   }
+
+   return false;
+}
+
 /* Returns true if the page is within a block suitable for migration to */
 static bool suitable_migration_target(struct page *page)
 {
@@ -501,11 +525,8 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
return 0;
}
 
-   if (cond_resched()) {
-   /* Async terminates prematurely on need_resched() */
-   if (cc->mode == MIGRATE_ASYNC)
-   return 0;
-   }
+   if (compact_should_abort(cc))
+   return 0;
 
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
@@ -724,9 +745,11 @@ static void isolate_freepages(struct zone *zone,
/*
 * This can iterate a massively long zone without finding any
 * suitable migration targets, so periodically check if we need
-* to schedule.
+* to schedule, or even abort async compaction.
 */
-   cond_resched();
+   if (!(block_start_pfn % (SWAP_CLUSTER_

[Devel] [PATCH rh7 10/25] ms/mm/compaction: check pageblock suitability once per pageblock

2018-01-24 Thread Andrey Ryabinin
From: Joonsoo Kim 

isolation_suitable() and migrate_async_suitable() is used to be sure
that this pageblock range is fine to be migragted.  It isn't needed to
call it on every page.  Current code do well if not suitable, but, don't
do well when suitable.

1) It re-checks isolation_suitable() on each page of a pageblock that was
   already estabilished as suitable.
2) It re-checks migrate_async_suitable() on each page of a pageblock that
   was not entered through the next_pageblock: label, because
   last_pageblock_nr is not otherwise updated.

This patch fixes situation by 1) calling isolation_suitable() only once
per pageblock and 2) always updating last_pageblock_nr to the pageblock
that was just checked.

Additionally, move PageBuddy() check after pageblock unit check, since
pageblock check is the first thing we should do and makes things more
simple.

[vba...@suse.cz: rephrase commit description]
Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit c122b2087ab94192f2b937e47b563a9c4e688ece)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c4b6b134b197..ba9fa713038d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -533,26 +533,31 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
 
/* If isolation recently failed, do not retry */
pageblock_nr = low_pfn >> pageblock_order;
-   if (!isolation_suitable(cc, page))
-   goto next_pageblock;
+   if (last_pageblock_nr != pageblock_nr) {
+   int mt;
+
+   last_pageblock_nr = pageblock_nr;
+   if (!isolation_suitable(cc, page))
+   goto next_pageblock;
+
+   /*
+* For async migration, also only scan in MOVABLE
+* blocks. Async migration is optimistic to see if
+* the minimum amount of work satisfies the allocation
+*/
+   mt = get_pageblock_migratetype(page);
+   if (!cc->sync && !migrate_async_suitable(mt)) {
+   cc->finished_update_migrate = true;
+   skipped_async_unsuitable = true;
+   goto next_pageblock;
+   }
+   }
 
/* Skip if free */
if (PageBuddy(page))
continue;
 
/*
-* For async migration, also only scan in MOVABLE blocks. Async
-* migration is optimistic to see if the minimum amount of work
-* satisfies the allocation
-*/
-   if (!cc->sync && last_pageblock_nr != pageblock_nr &&
-   !migrate_async_suitable(get_pageblock_migratetype(page))) {
-   cc->finished_update_migrate = true;
-   skipped_async_unsuitable = true;
-   goto next_pageblock;
-   }
-
-   /*
 * Check may be lockless but that's ok as we recheck later.
 * It's possible to migrate LRU pages and balloon pages
 * Skip any other type of page
@@ -643,7 +648,6 @@ check_compact_cluster:
 
 next_pageblock:
low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
-   last_pageblock_nr = pageblock_nr;
}
 
acct_isolated(zone, locked, cc);
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 21/25] ms/mm/compaction: do not count migratepages when unnecessary

2018-01-24 Thread Andrey Ryabinin
From: Vlastimil Babka 

During compaction, update_nr_listpages() has been used to count remaining
non-migrated and free pages after a call to migrage_pages().  The
freepages counting has become unneccessary, and it turns out that
migratepages counting is also unnecessary in most cases.

The only situation when it's needed to count cc->migratepages is when
migrate_pages() returns with a negative error code.  Otherwise, the
non-negative return value is the number of pages that were not migrated,
which is exactly the count of remaining pages in the cc->migratepages
list.

Furthermore, any non-zero count is only interesting for the tracepoint of
mm_compaction_migratepages events, because after that all remaining
unmigrated pages are put back and their count is set to 0.

This patch therefore removes update_nr_listpages() completely, and changes
the tracepoint definition so that the manual counting is done only when
the tracepoint is enabled, and only when migrate_pages() returns a
negative error code.

Furthermore, migrate_pages() and the tracepoints won't be called when
there's nothing to migrate.  This potentially avoids some wasted cycles
and reduces the volume of uninteresting mm_compaction_migratepages events
where "nr_migrated=0 nr_failed=0".  In the stress-highalloc mmtest, this
was about 75% of the events.  The mm_compaction_isolate_migratepages event
is better for determining that nothing was isolated for migration, and
this one was just duplicating the info.

Signed-off-by: Vlastimil Babka 
Reviewed-by: Naoya Horiguchi 
Cc: Minchan Kim 
Cc: Mel Gorman 
Cc: Joonsoo Kim 
Cc: Bartlomiej Zolnierkiewicz 
Acked-by: Michal Nazarewicz 
Cc: Christoph Lameter 
Cc: Rik van Riel 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit f8c9301fa5a2a8b873c67f2a3d8230d5c13f61b7)
Signed-off-by: Andrey Ryabinin 
---
 include/trace/events/compaction.h | 25 +
 mm/compaction.c   | 31 +++
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/include/trace/events/compaction.h 
b/include/trace/events/compaction.h
index 06f544ef2f6f..c6814b917bdf 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -5,6 +5,7 @@
 #define _TRACE_COMPACTION_H
 
 #include 
+#include 
 #include 
 #include 
 
@@ -47,10 +48,11 @@ DEFINE_EVENT(mm_compaction_isolate_template, 
mm_compaction_isolate_freepages,
 
 TRACE_EVENT(mm_compaction_migratepages,
 
-   TP_PROTO(unsigned long nr_migrated,
-   unsigned long nr_failed),
+   TP_PROTO(unsigned long nr_all,
+   int migrate_rc,
+   struct list_head *migratepages),
 
-   TP_ARGS(nr_migrated, nr_failed),
+   TP_ARGS(nr_all, migrate_rc, migratepages),
 
TP_STRUCT__entry(
__field(unsigned long, nr_migrated)
@@ -58,7 +60,22 @@ TRACE_EVENT(mm_compaction_migratepages,
),
 
TP_fast_assign(
-   __entry->nr_migrated = nr_migrated;
+   unsigned long nr_failed = 0;
+   struct list_head *page_lru;
+
+   /*
+* migrate_pages() returns either a non-negative number
+* with the number of pages that failed migration, or an
+* error code, in which case we need to count the remaining
+* pages manually
+*/
+   if (migrate_rc >= 0)
+   nr_failed = migrate_rc;
+   else
+   list_for_each(page_lru, migratepages)
+   nr_failed++;
+
+   __entry->nr_migrated = nr_all - nr_failed;
__entry->nr_failed = nr_failed;
),
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 35911b29683f..316a7b34ce37 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -826,22 +826,6 @@ static void compaction_free(struct page *page, unsigned 
long data)
cc->nr_freepages++;
 }
 
-/*
- * We cannot control nr_migratepages fully when migration is running as
- * migrate_pages() has no knowledge of of compact_control.  When migration is
- * complete, we count the number of pages on the list by hand.
- */
-static void update_nr_listpages(struct compact_control *cc)
-{
-   int nr_migratepages = 0;
-   struct page *page;
-
-   list_for_each_entry(page, &cc->migratepages, lru)
-   nr_migratepages++;
-
-   cc->nr_migratepages = nr_migratepages;
-}
-
 /* possible outcome of isolate_migratepages */
 typedef enum {
ISOLATE_ABORT,  /* Abort compaction now */
@@ -1036,7 +1020,6 @@ static int compact_zone(struct zone *zone, struct 
compact_control *cc)
migrate_prep_local();
 
while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
-   unsigned long nr_migrate, nr_remaining;
int err;

[Devel] [PATCH rh7 12/25] ms/mm, compaction: determine isolation mode only once

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

The conditions that control the isolation mode in
isolate_migratepages_range() do not change during the iteration, so
extract them out and only define the value once.

This actually does have an effect, gcc doesn't optimize it itself because
of cc->sync.

Signed-off-by: David Rientjes 
Cc: Mel Gorman 
Acked-by: Rik van Riel 
Acked-by: Vlastimil Babka 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d833777ddaee..9f2abe03d1aa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -467,12 +467,13 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
unsigned long last_pageblock_nr = 0, pageblock_nr;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct list_head *migratelist = &cc->migratepages;
-   isolate_mode_t mode = 0;
struct lruvec *lruvec;
unsigned long flags;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
bool skipped_async_unsuitable = false;
+   const isolate_mode_t mode = (!cc->sync ? ISOLATE_ASYNC_MIGRATE : 0) |
+   (unevictable ? ISOLATE_UNEVICTABLE : 0);
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -612,12 +613,6 @@ isolate_migratepages_range(struct zone *zone, struct 
compact_control *cc,
continue;
}
 
-   if (!cc->sync)
-   mode |= ISOLATE_ASYNC_MIGRATE;
-
-   if (unevictable)
-   mode |= ISOLATE_UNEVICTABLE;
-
lruvec = mem_cgroup_page_lruvec(page, zone);
 
/* Try isolate the page */
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 24/25] ms/mm/compaction: fix wrong order check in compact_finished()

2018-01-24 Thread Andrey Ryabinin
From: Joonsoo Kim 

What we want to check here is whether there is highorder freepage in buddy
list of other migratetype in order to steal it without fragmentation.
But, current code just checks cc->order which means allocation request
order.  So, this is wrong.

Without this fix, non-movable synchronous compaction below pageblock order
would not stopped until compaction is complete, because migratetype of
most pageblocks are movable and high order freepage made by compaction is
usually on movable type buddy list.

There is some report related to this bug. See below link.

  http://www.spinics.net/lists/linux-mm/msg81666.html

Although the issued system still has load spike comes from compaction,
this makes that system completely stable and responsive according to his
report.

stress-highalloc test in mmtests with non movable order 7 allocation
doesn't show any notable difference in allocation success rate, but, it
shows more compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
18.47 : 28.94

Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page 
immediately when it is made available")
Signed-off-by: Joonsoo Kim 
Acked-by: Vlastimil Babka 
Reviewed-by: Zhang Yanfei 
Cc: Mel Gorman 
Cc: David Rientjes 
Cc: Rik van Riel 
Cc: [3.7+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 372549c2a3778fd3df445819811c944ad54609ca)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4f6d23a87230..995097d8c50a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -941,7 +941,7 @@ static int compact_finished(struct zone *zone,
return COMPACT_PARTIAL;
 
/* Job done if allocation would set block type */
-   if (cc->order >= pageblock_order && area->nr_free)
+   if (order >= pageblock_order && area->nr_free)
return COMPACT_PARTIAL;
}
 
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 15/25] ms/mm/compaction: cleanup isolate_freepages()

2018-01-24 Thread Andrey Ryabinin
From: Vlastimil Babka 

isolate_freepages() is currently somewhat hard to follow thanks to many
looks like it is related to the 'low_pfn' variable, but in fact it is not.

This patch renames the 'high_pfn' variable to a hopefully less confusing name,
and slightly changes its handling without a functional change. A comment made
obsolete by recent changes is also updated.

[a...@linux-foundation.org: comment fixes, per Minchan]
[iamjoonsoo@lge.com: cleanups]
Signed-off-by: Vlastimil Babka 
Cc: Minchan Kim 
Cc: Mel Gorman 
Cc: Joonsoo Kim 
Cc: Bartlomiej Zolnierkiewicz 
Cc: Michal Nazarewicz 
Cc: Naoya Horiguchi 
Cc: Christoph Lameter 
Cc: Rik van Riel 
Cc: Dongjun Shin 
Cc: Sunghwan Yun 
Signed-off-by: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit c96b9e508f3d06ddb601dcc9792d62c044ab359e)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 56 +++-
 1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ea44b4bf85d9..0a2d4eded2e0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -669,7 +669,10 @@ static void isolate_freepages(struct zone *zone,
struct compact_control *cc)
 {
struct page *page;
-   unsigned long high_pfn, low_pfn, pfn, z_end_pfn;
+   unsigned long block_start_pfn;  /* start of current pageblock */
+   unsigned long block_end_pfn;/* end of current pageblock */
+   unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
+   unsigned long next_free_pfn; /* start pfn for scaning at next round */
int nr_freepages = cc->nr_freepages;
struct list_head *freelist = &cc->freepages;
 
@@ -677,32 +680,33 @@ static void isolate_freepages(struct zone *zone,
 * Initialise the free scanner. The starting point is where we last
 * successfully isolated from, zone-cached value, or the end of the
 * zone when isolating for the first time. We need this aligned to
-* the pageblock boundary, because we do pfn -= pageblock_nr_pages
-* in the for loop.
+* the pageblock boundary, because we do
+* block_start_pfn -= pageblock_nr_pages in the for loop.
+* For ending point, take care when isolating in last pageblock of a
+* a zone which ends in the middle of a pageblock.
 * The low boundary is the end of the pageblock the migration scanner
 * is using.
 */
-   pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
+   block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
+   block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
+   zone_end_pfn(zone));
low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
 
/*
-* Take care that if the migration scanner is at the end of the zone
-* that the free scanner does not accidentally move to the next zone
-* in the next isolation cycle.
+* If no pages are isolated, the block_start_pfn < low_pfn check
+* will kick in.
 */
-   high_pfn = min(low_pfn, pfn);
-
-   z_end_pfn = zone_end_pfn(zone);
+   next_free_pfn = 0;
 
/*
 * Isolate free pages until enough are available to migrate the
 * pages on cc->migratepages. We stop searching if the migrate
 * and free page scanners meet or enough free pages are isolated.
 */
-   for (; pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
-   pfn -= pageblock_nr_pages) {
+   for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
+   block_end_pfn = block_start_pfn,
+   block_start_pfn -= pageblock_nr_pages) {
unsigned long isolated;
-   unsigned long end_pfn;
 
/*
 * This can iterate a massively long zone without finding any
@@ -711,7 +715,7 @@ static void isolate_freepages(struct zone *zone,
 */
cond_resched();
 
-   if (!pfn_valid(pfn))
+   if (!pfn_valid(block_start_pfn))
continue;
 
/*
@@ -721,7 +725,7 @@ static void isolate_freepages(struct zone *zone,
 * i.e. it's possible that all pages within a zones range of
 * pages do not belong to a single zone.
 */
-   page = pfn_to_page(pfn);
+   page = pfn_to_page(block_start_pfn);
if (page_zone(page) != zone)
continue;
 
@@ -734,14 +738,8 @@ static void isolate_freepages(struct zone *zone,
continue;
 
/* Found a block suitable for isolating free pages from */

[Devel] [PATCH rh7 13/25] ms/mm, compaction: ignore pageblock skip when manually invoking compaction

2018-01-24 Thread Andrey Ryabinin
From: David Rientjes 

The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).

Signed-off-by: David Rientjes 
Acked-by: Mel Gorman 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 91ca9186484809c57303b33778d841cc28f696ed)
Signed-off-by: Andrey Ryabinin 
---
 mm/compaction.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9f2abe03d1aa..ee0c1e4aecd7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1194,6 +1194,7 @@ static void compact_node(int nid)
struct compact_control cc = {
.order = -1,
.sync = true,
+   .ignore_skip_hint = true,
};
 
__compact_pgdat(NODE_DATA(nid), &cc);
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 7/7] mm/vmscan: call wait_iff_congested() only if we have troubles in recaliming

2018-01-29 Thread Andrey Ryabinin
Even if zone congested it might be better to continue reclaim as we
may allocate memory from another zone. So call in wait_iff_congested()
only if we have troubles in reclaiming memory.

https://jira.sw.ru/browse/PSBM-61409
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d71fa15a1750..4922f734cdb4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2481,7 +2481,8 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
 * is congested. Allow kswapd to continue until it starts 
encountering
 * unqueued dirty pages or cycling through the LRU too quickly.
 */
-   if (!sc->hibernation_mode && !current_is_kswapd())
+   if (sc->priority < (DEF_PRIORITY - 2) && !sc->hibernation_mode 
&&
+   !current_is_kswapd())
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
if (reclaim_state) {
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 5/7] mm/vmscan: collect reclaim stats across zone

2018-01-29 Thread Andrey Ryabinin
Currently we collect reclaim stats per-lru list and set zone
flags based on these stats. This seems wrong, as lrus a per-memcg
thus one zone could have hundreds of them.

So add reclaim_stats pointer into shrink_control struct and sum
counters we need while iterating lrus in zone. Don't use them
yet, that's would be the next patch.

https://jira.sw.ru/browse/PSBM-61409
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 39 ---
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f974f57dd546..e6dde1e15a54 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -57,6 +57,18 @@
 #define CREATE_TRACE_POINTS
 #include 
 
+struct reclaim_stat {
+   unsigned nr_dirty;
+   unsigned nr_unqueued_dirty;
+   unsigned nr_congested;
+   unsigned nr_writeback;
+   unsigned nr_immediate;
+   unsigned nr_activate;
+   unsigned nr_ref_keep;
+   unsigned nr_unmap_fail;
+   unsigned nr_taken;
+};
+
 struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -97,6 +109,8 @@ struct scan_control {
 */
struct mem_cgroup *target_mem_cgroup;
 
+   struct reclaim_stat *stat;
+
/*
 * Nodemask of nodes allowed by the caller. If NULL, all nodes
 * are scanned.
@@ -845,17 +859,6 @@ static void page_check_dirty_writeback(struct page *page,
mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
-struct reclaim_stat {
-   unsigned nr_dirty;
-   unsigned nr_unqueued_dirty;
-   unsigned nr_congested;
-   unsigned nr_writeback;
-   unsigned nr_immediate;
-   unsigned nr_activate;
-   unsigned nr_ref_keep;
-   unsigned nr_unmap_fail;
-};
-
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1616,6 +1619,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
mem_cgroup_uncharge_list(&page_list);
free_hot_cold_page_list(&page_list, true);
 
+   if (sc->stat) {
+   sc->stat->nr_taken += nr_taken;
+   sc->stat->nr_dirty += stat.nr_dirty;
+   sc->stat->nr_unqueued_dirty += stat.nr_unqueued_dirty;
+   sc->stat->nr_congested += stat.nr_congested;
+   sc->stat->nr_writeback += stat.nr_writeback;
+   sc->stat->nr_immediate += stat.nr_immediate;
+   }
+
/*
 * If reclaim is isolating dirty pages under writeback, it implies
 * that the long-lived page allocation rate is exceeding the page
@@ -2418,6 +2430,9 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
};
unsigned long zone_lru_pages = 0;
struct mem_cgroup *memcg;
+   struct reclaim_stat stat = {};
+
+   sc->stat = &stat;
 
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;
@@ -2912,6 +2927,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct 
mem_cgroup *memcg,
struct zone *zone,
unsigned long *nr_scanned)
 {
+   struct reclaim_stat stat = {};
struct scan_control sc = {
.nr_scanned = 0,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2921,6 +2937,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct 
mem_cgroup *memcg,
.order = 0,
.priority = 0,
.target_mem_cgroup = memcg,
+   .stat = &stat,
};
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
unsigned long lru_pages;
-- 
2.13.6

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 6/7] mm/vmscan: Use per-zone sum of reclaim_stat to change zone state.

2018-01-29 Thread Andrey Ryabinin
Currently we collect reclaim stats per-lru list and set zone
flags based on these stats. This seems wrong, as lrus a per-memcg
thus one zone could have hundreds of them.

Move all that zone-related logic from shrink_inactive_list() to
shrink_zone, and make decisions based on per-zone sum of reclaim stat
instead of just per-lru.

https://jira.sw.ru/browse/PSBM-61409
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 109 ++--
 1 file changed, 54 insertions(+), 55 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e6dde1e15a54..d71fa15a1750 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1628,61 +1628,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
sc->stat->nr_immediate += stat.nr_immediate;
}
 
-   /*
-* If reclaim is isolating dirty pages under writeback, it implies
-* that the long-lived page allocation rate is exceeding the page
-* laundering rate. Either the global limits are not being effective
-* at throttling processes due to the page distribution throughout
-* zones or there is heavy usage of a slow backing device. The
-* only option is to throttle from reclaim context which is not ideal
-* as there is no guarantee the dirtying process is throttled in the
-* same way balance_dirty_pages() manages.
-*
-* Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
-* of pages under pages flagged for immediate reclaim and stall if any
-* are encountered in the nr_immediate check below.
-*/
-   if (stat.nr_writeback && stat.nr_writeback == nr_taken)
-   zone_set_flag(zone, ZONE_WRITEBACK);
-
-   if (!global_reclaim(sc) && stat.nr_immediate)
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-   if (sane_reclaim(sc)) {
-   /*
-* Tag a zone as congested if all the dirty pages scanned were
-* backed by a congested BDI and wait_iff_congested will stall.
-*/
-   if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested)
-   zone_set_flag(zone, ZONE_CONGESTED);
-
-   /*
-* If dirty pages are scanned that are not queued for IO, it
-* implies that flushers are not keeping up. In this case, flag
-* the zone ZONE_TAIL_LRU_DIRTY and kswapd will start writing
-* pages from reclaim context.
-*/
-   if (stat.nr_unqueued_dirty == nr_taken)
-   zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
-
-   /*
-* If kswapd scans pages marked marked for immediate
-* reclaim and under writeback (nr_immediate), it implies
-* that pages are cycling through the LRU faster than
-* they are written so also forcibly stall.
-*/
-   if (stat.nr_immediate)
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
-   }
-
-   /*
-* Stall direct reclaim for IO completions if underlying BDIs or zone
-* is congested. Allow kswapd to continue until it starts encountering
-* unqueued dirty pages or cycling through the LRU too quickly.
-*/
-   if (!sc->hibernation_mode && !current_is_kswapd())
-   wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
-
trace_mm_vmscan_lru_shrink_inactive(zone_to_nid(zone), zone_idx(zone),
nr_scanned, nr_reclaimed,
stat.nr_dirty,  stat.nr_writeback,
@@ -2485,6 +2430,60 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
shrink_slab(slab_gfp, zone_to_nid(zone), NULL,
sc->priority, false);
 
+   if (global_reclaim(sc)) {
+   /*
+* If reclaim is isolating dirty pages under writeback, 
it implies
+* that the long-lived page allocation rate is 
exceeding the page
+* laundering rate. Either the global limits are not 
being effective
+* at throttling processes due to the page distribution 
throughout
+* zones or there is heavy usage of a slow backing 
device. The
+* only option is to throttle from reclaim context 
which is not ideal
+* as there is no guarantee the dirtying process is 
throttled in the
+* same way balance_dirty_pages() manages.
+*
+* Once a zone is flagged ZONE_WRITEBACK, kswapd will 
count the number
+* of pages under pages flagged for immediate reclaim 
and stall if any
+* are

<    3   4   5   6   7   8   9   10   11   >