date:20151015

[Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Kirill Tkhai

Since we allow to attach a not current task to ve cgroup, there is the race
in the places where we use get_exec_env(). The task's ve may be changed after
it dereferenced get_exec_env(), so a lot of problems are possible there.
I'm sure the most places, where we use get_exec_env(), was not written in an
assumption it may change. Also, there are a lot of nested functions and it's
impossible to check every function to verify if it's input parameters, depending
on a caller's dereferenced ve, are not actual because of ve has been changed.

I'm suggest to use to modify get_exec_env() which will supply ve's stability.
It pairs with put_exec_env() which marks end of area where ve modification is
not desirable.

get_exec_env() may be used nested, so here is task_struct::ve_attach_lock_depth,
which allows nesting. The counter looks a better variant that plain read_lock()
in get_exec_env() and write_trylock() loop in ve_attach():

get_exec_env()
{
   ...
   read_lock();
   ...
}

ve_attach()
{
   while(!write_trylock())
  cpu_relax();
}

because this case the priority of read_lock() will be absolute and we lost all
advantages of queued rw locks fairness.

Also I considered variants with using RCU and task work, but they seems to be 
worse.

Please, your comments.

---
 include/linux/init_task.h |  3 ++-
 include/linux/sched.h |  1 +
 include/linux/ve.h| 29 +
 include/linux/ve_proto.h  |  1 -
 kernel/fork.c |  3 +++
 kernel/ve/ve.c|  8 +++-
 6 files changed, 42 insertions(+), 3 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index d2cbad0..57e0796 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -136,7 +136,8 @@ extern struct task_group root_task_group;
 #endif
 
 #ifdef CONFIG_VE
-#defineINIT_TASK_VE(tsk) .task_ve = ,
+#defineINIT_TASK_VE(tsk) .task_ve = ,  
\
+ .ve_attach_lock_depth = 0
 #else
 #defineINIT_TASK_VE(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e1bcabe..948481f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1564,6 +1564,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_VE
struct ve_struct *task_ve;
+   unsigned int ve_attach_lock_depth;
 #endif
 #ifdef CONFIG_MEMCG /* memcg uses this to do batch job */
struct memcg_batch_info {
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 86b95c3..3cea73d 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -33,6 +33,7 @@ struct ve_monitor;
 struct nsproxy;
 
 struct ve_struct {
+   rwlock_tattach_lock;
struct cgroup_subsys_state  css;
 
const char  *ve_name;
@@ -130,6 +131,34 @@ struct ve_struct {
 #endif
 };
 
+static inline struct ve_struct *get_exec_env(void)
+{
+   struct ve_struct *ve;
+
+   if (++current->ve_attach_lock_depth > 1)
+   return current->task_ve;
+
+   rcu_read_lock();
+again:
+   ve = current->task_ve;
+   read_lock(>attach_lock);
+   if (unlikely(current->task_ve != ve)) {
+   read_unlock(>attach_lock);
+   goto again;
+   }
+   rcu_read_unlock();
+
+   return ve;
+}
+
+static inline void put_exec_env(void)
+{
+   struct ve_struct *ve = current->task_ve;
+
+   if (!--current->ve_attach_lock_depth)
+   read_unlock(>attach_lock);
+}
+
 struct ve_devmnt {
struct list_headlink;
 
diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 0f5898e..3deb09e 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -30,7 +30,6 @@ static inline bool ve_is_super(struct ve_struct *ve)
return ve == 
 }
 
-#define get_exec_env() (current->task_ve)
 #define get_env_init(ve)   (ve->ve_ns->pid_ns->child_reaper)
 
 const char *ve_name(struct ve_struct *ve);
diff --git a/kernel/fork.c b/kernel/fork.c
index 505fa21..3d7e452 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1439,6 +1439,9 @@ static struct task_struct *copy_process(unsigned long 
clone_flags,
INIT_LIST_HEAD(>pi_state_list);
p->pi_state_cache = NULL;
 #endif
+#ifdef CONFIG_VE
+   p->ve_attach_lock_depth = 0;
+#endif
/*
 * sigaltstack should be cleared when sharing the same VM
 */
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 39a95e8..23833ed 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -640,6 +640,7 @@ static struct cgroup_subsys_state *ve_create(struct cgroup 
*cg)
ve->meminfo_val = VE_MEMINFO_DEFAULT;
 
 do_init:
+   ve->attach_lock = __RW_LOCK_UNLOCKED(>attach_lock);
init_rwsem(>op_sem);
mutex_init(>sync_mutex);
INIT_LIST_HEAD(>devices);
@@ -738,8 +739,11 @@ static int ve_can_attach(struct cgroup *cg, struct 
cgroup_taskset *tset)
 
 static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset)
 {
+

Re: [Devel] [PATCH RH7 1/2] device_cgroup: fake allowing all devices for docker inside VZCT

2015-10-15 Thread Pavel Tikhomirov


Here is the right link for RH7: https://jira.sw.ru/browse/PSBM-34529

Patch actually is a port from RH6.

On 10/15/2015 01:42 PM, Konstantin Khorenko wrote:

Volodya, please review.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 10/13/2015 06:11 PM, Pavel Tikhomirov wrote:

We need it for docker 1.7.+, please review.

On 10/07/2015 11:51 AM, Pavel Tikhomirov wrote:

Docker from 1.7.0 tries to add "a" to devices.allow for newly created
privileged container device_cgroup, and thus to allow all devices in
docker container. Docker fails to do so because not all devices are
allowed in parent VZCT cgroup.

To support docker we must allow writing "a" to devices.allow in CT.
With this patch if we get "a", we will silently exit without EPERM.

https://jira.sw.ru/browse/PSBM-38691

v2: fix bug link, fix comment stile
Signed-off-by: Pavel Tikhomirov 
---
   security/device_cgroup.c | 9 -
   1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 531e40c..9f932d7 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -689,7 +689,14 @@ static int devcgroup_update_access(struct
dev_cgroup *devcgroup,
   if (has_children(devcgroup))
   return -EINVAL;

-if (!may_allow_all(parent))
+if (!may_allow_all(parent)) {
+if (ve_is_super(get_exec_env()))
+return -EPERM;
+else
+/* Fooling docker in CT - silently exit */
+return 0;
+}
+
   return -EPERM;
   dev_exception_clean(devcgroup);
   devcgroup->behavior = DEVCG_DEFAULT_ALLOW;





--
Best regards, Tikhomirov Pavel
Software Developer, Odin.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/selftests: add memfd/sealing page-pinning tests

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 07e7c92c1c0de74828dfd29e39facebf02cdfd63
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:19 2015 +0400

ms/selftests: add memfd/sealing page-pinning tests

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: David Herrmann 

ML: 87b2d44026e0e315a7401551e95b189ac4b28217

Setting SEAL_WRITE is not possible if there're pending GUP users. This
commit adds selftests for memfd+sealing that use FUSE to create pending
page-references. FUSE is very helpful here in that it allows us to delay
direct-IO operations for an arbitrary amount of time. This way, we can
force the kernel to pin pages and then run our normal selftests.

Signed-off-by: David Herrmann 
Acked-by: Hugh Dickins 
Cc: Michael Kerrisk 
Cc: Ryan Lortie 
Cc: Lennart Poettering 
Cc: Daniel Mack 
Cc: Andy Lutomirski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrew Vagin 
---
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile |  14 +-
 tools/testing/selftests/memfd/fuse_mnt.c   | 110 +
 tools/testing/selftests/memfd/fuse_test.c  | 311 +
 tools/testing/selftests/memfd/run_fuse_test.sh |  14 ++
 5 files changed, 450 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/memfd/.gitignore 
b/tools/testing/selftests/memfd/.gitignore
index bcc8ee2..afe87c4 100644
--- a/tools/testing/selftests/memfd/.gitignore
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -1,2 +1,4 @@
+fuse_mnt
+fuse_test
 memfd_test
 memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile 
b/tools/testing/selftests/memfd/Makefile
index 36653b9..6816c49 100644
--- a/tools/testing/selftests/memfd/Makefile
+++ b/tools/testing/selftests/memfd/Makefile
@@ -7,6 +7,7 @@ ifeq ($(ARCH),x86_64)
ARCH := X86
 endif
 
+CFLAGS += -D_FILE_OFFSET_BITS=64
 CFLAGS += -I../../../../arch/x86/include/generated/uapi/
 CFLAGS += -I../../../../arch/x86/include/uapi/
 CFLAGS += -I../../../../include/uapi/
@@ -25,5 +26,16 @@ ifeq ($(ARCH),X86)
 endif
@./memfd_test || echo "memfd_test: [FAIL]"
 
+build_fuse:
+ifeq ($(ARCH),X86)
+   gcc $(CFLAGS) fuse_mnt.c `pkg-config fuse --cflags --libs` -o fuse_mnt
+   gcc $(CFLAGS) fuse_test.c -o fuse_test
+else
+   echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_fuse: build_fuse
+   @./run_fuse_test.sh || echo "fuse_test: [FAIL]"
+
 clean:
-   $(RM) memfd_test
+   $(RM) memfd_test fuse_test
diff --git a/tools/testing/selftests/memfd/fuse_mnt.c 
b/tools/testing/selftests/memfd/fuse_mnt.c
new file mode 100644
index 000..feacf12
--- /dev/null
+++ b/tools/testing/selftests/memfd/fuse_mnt.c
@@ -0,0 +1,110 @@
+/*
+ * memfd test file-system
+ * This file uses FUSE to create a dummy file-system with only one file /memfd.
+ * This file is read-only and takes 1s per read.
+ *
+ * This file-system is used by the memfd test-cases to force the kernel to pin
+ * pages during reads(). Due to the 1s delay of this file-system, this is a
+ * nice way to test race-conditions against get_user_pages() in the kernel.
+ *
+ * We use direct_io==1 to force the kernel to use direct-IO for this
+ * file-system.
+ */
+
+#define FUSE_USE_VERSION 26
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const char memfd_content[] = "memfd-example-content";
+static const char memfd_path[] = "/memfd";
+
+static int memfd_getattr(const char *path, struct stat *st)
+{
+   memset(st, 0, sizeof(*st));
+
+   if (!strcmp(path, "/")) {
+   st->st_mode = S_IFDIR | 0755;
+   st->st_nlink = 2;
+   } else if (!strcmp(path, memfd_path)) {
+   st->st_mode = S_IFREG | 0444;
+   st->st_nlink = 1;
+   st->st_size = strlen(memfd_content);
+   } else {
+   return -ENOENT;
+   }
+
+   return 0;
+}
+
+static int memfd_readdir(const char *path,
+void *buf,
+fuse_fill_dir_t filler,
+off_t offset,
+struct fuse_file_info *fi)
+{
+   if (strcmp(path, "/"))
+   return -ENOENT;
+
+   filler(buf, ".", NULL, 0);
+   filler(buf, "..", NULL, 0);
+   filler(buf, memfd_path + 1, NULL, 0);
+
+   return 0;
+}
+
+static int memfd_open(const char *path, struct fuse_file_info *fi)
+{
+   if (strcmp(path, memfd_path))
+

[Devel] [PATCH RHEL7 COMMIT] ms/shm: wait for pins to be released when sealing

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit c49f7bf80c6e70a3992c13f6b7f7a60b44c81dce
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:19 2015 +0400

ms/shm: wait for pins to be released when sealing

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: David Herrmann 

ML: 05f65b5c70909ef686f865f0a85406d74d75f70f

If we set SEAL_WRITE on a file, we must make sure there cannot be any
ongoing write-operations on the file.  For write() calls, we simply lock
the inode mutex, for mmap() we simply verify there're no writable
mappings.  However, there might be pages pinned by AIO, Direct-IO and
similar operations via GUP.  We must make sure those do not write to the
memfd file after we set SEAL_WRITE.

As there is no way to notify GUP users to drop pages or to wait for them
to be done, we implement the wait ourself: When setting SEAL_WRITE, we
check all pages for their ref-count.  If it's bigger than 1, we know
there's some user of the page.  We then mark the page and wait for up to
150ms for those ref-counts to be dropped.  If the ref-counts are not
dropped in time, we refuse the seal operation.

Signed-off-by: David Herrmann 
Acked-by: Hugh Dickins 
Cc: Michael Kerrisk 
Cc: Ryan Lortie 
Cc: Lennart Poettering 
Cc: Daniel Mack 
Cc: Andy Lutomirski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrew Vagin 
---
 mm/shmem.c | 110 -
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index bc8e08b..fd563aa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,9 +1903,117 @@ static loff_t shmem_file_llseek(struct file *file, 
loff_t offset, int whence)
return offset;
 }
 
+/*
+ * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
+ * so reuse a tag which we firmly believe is never set or cleared on shmem.
+ */
+#define SHMEM_TAG_PINNEDPAGECACHE_TAG_TOWRITE
+#define LAST_SCAN   4   /* about 150ms max */
+
+static void shmem_tag_pins(struct address_space *mapping)
+{
+   struct radix_tree_iter iter;
+   void **slot;
+   pgoff_t start;
+   struct page *page;
+
+   lru_add_drain();
+   start = 0;
+   rcu_read_lock();
+
+restart:
+   radix_tree_for_each_slot(slot, >page_tree, , start) {
+   page = radix_tree_deref_slot(slot);
+   if (!page || radix_tree_exception(page)) {
+   if (radix_tree_deref_retry(page))
+   goto restart;
+   } else if (page_count(page) - page_mapcount(page) > 1) {
+   spin_lock_irq(>tree_lock);
+   radix_tree_tag_set(>page_tree, iter.index,
+  SHMEM_TAG_PINNED);
+   spin_unlock_irq(>tree_lock);
+   }
+
+   if (need_resched()) {
+   cond_resched_rcu();
+   start = iter.index + 1;
+   goto restart;
+   }
+   }
+   rcu_read_unlock();
+}
+
+/*
+ * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
+ * via get_user_pages(), drivers might have some pending I/O without any active
+ * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages
+ * and see whether it has an elevated ref-count. If so, we tag them and wait 
for
+ * them to be dropped.
+ * The caller must guarantee that no new user will acquire writable references
+ * to those pages to avoid races.
+ */
 static int shmem_wait_for_pins(struct address_space *mapping)
 {
-   return 0;
+   struct radix_tree_iter iter;
+   void **slot;
+   pgoff_t start;
+   struct page *page;
+   int error, scan;
+
+   shmem_tag_pins(mapping);
+
+   error = 0;
+   for (scan = 0; scan <= LAST_SCAN; scan++) {
+   if (!radix_tree_tagged(>page_tree, SHMEM_TAG_PINNED))
+   break;
+
+   if (!scan)
+   lru_add_drain_all();
+   else if (schedule_timeout_killable((HZ << scan) / 200))
+   scan = LAST_SCAN;
+
+   start = 0;
+   rcu_read_lock();
+restart:
+   radix_tree_for_each_tagged(slot, >page_tree, ,
+  start, SHMEM_TAG_PINNED) {
+
+   page = radix_tree_deref_slot(slot);
+

[Devel] [PATCH RHEL7 COMMIT] ms/sched: add cond_resched_rcu() helper

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit f5375ae5711c334bb1305639dc08a45898a32f19
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:15 2015 +0400

ms/sched: add cond_resched_rcu() helper

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: Simon Horman 

ML: f6f3c437d09e2f62533034e67bfb4385191e992c

This is intended for use in loops which read data protected by RCU and may
have a large number of iterations.  Such an example is dumping the list of
connections known to IPVS: ip_vs_conn_array() and ip_vs_conn_seq_next().

The benefits are for CONFIG_PREEMPT_RCU=y where we save CPU cycles
by moving rcu_read_lock and rcu_read_unlock out of large loops
but still allowing the current task to be preempted after every
loop iteration for the CONFIG_PREEMPT_RCU=n case.

The call to cond_resched() is not needed when CONFIG_PREEMPT_RCU=y.
Thanks to Paul E. McKenney for explaining this and for the
final version that checks the context with CONFIG_DEBUG_ATOMIC_SLEEP=y
for all possible configurations.

The function can be empty in the CONFIG_PREEMPT_RCU case,
rcu_read_lock and rcu_read_unlock are not needed in this case
because the task can be preempted on indication from scheduler.
Thanks to Peter Zijlstra for catching this and for his help
in trying a solution that changes __might_sleep.

Initial cond_resched_rcu_lock() function suggested by Eric Dumazet.

Tested-by: Julian Anastasov 
Signed-off-by: Julian Anastasov 
Signed-off-by: Simon Horman 
Acked-by: Peter Zijlstra 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Andrew Vagin 
---
 include/linux/sched.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e1bcabe..4560071 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2717,6 +2717,15 @@ extern int __cond_resched_softirq(void);
__cond_resched_softirq();   \
 })
 
+static inline void cond_resched_rcu(void)
+{
+#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
+   rcu_read_unlock();
+   cond_resched();
+   rcu_read_lock();
+#endif
+}
+
 /*
  * Does a critical section need to be broken due to another
  * task waiting?: (technically does not depend on CONFIG_PREEMPT,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 19e5b6f0c09fa1a46634605cec0c212a106044ee
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:13 2015 +0400

ms/prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: Cyrill Gorcunov 

ML: f606b77f1a9e362451aca8f81d8f36a3a112139e

During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.

A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.

Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.

prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)

struct prctl_mm_map {
__u64   start_code;
__u64   end_code;
__u64   start_data;
__u64   end_data;
__u64   start_brk;
__u64   brk;
__u64   start_stack;
__u64   arg_start;
__u64   arg_end;
__u64   env_start;
__u64   env_end;
__u64   *auxv;
__u32   auxv_size;
__u32   exe_fd;
};

All members except @exe_fd correspond ones of struct mm_struct.  To figure
out which available values these members may take here are meanings of the
members.

 - start_code, end_code: represent bounds of executable code area
 - start_data, end_data: represent bounds of data area
 - start_brk, brk: used to calculate bounds for brk() syscall
 - start_stack: used when accounting space needed for command
   line arguments, environment and shmat() syscall
 - arg_start, arg_end, env_start, env_end: represent memory area
   supplied for command line arguments and environment variables
 - auxv, auxv_size: carries auxiliary vector, Elf format specifics
 - exe_fd: file descriptor number for executable link (/proc/self/exe)

Thus we apply the following requirements to the values

1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
   in user space thus it must be laying inside [mmap_min_addr, 
mmap_max_addr)
   interval.

2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
   VMAs (say a program maps own new .text and .data segments during 
execution)
   the rest of members should belong to VMA which must exist.

3) Addresses must be ordered, ie @start_ member must not be greater or
   equal to appropriate @end_ member.

4) As in regular Elf loading procedure we require that @start_brk and
   @brk be greater than @end_data.

5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
   exceed existing limit. Same applies to RLIMIT_STACK.

6) Auxiliary vector size must not exceed existing one (which is
   predefined as AT_VECTOR_SIZE and depends on architecture).

7) File descriptor passed in @exe_file should be pointing
   to executable file (because we use existing prctl_set_mm_exe_file_locked
   helper it ensures that the file we are going to use as exe link has all
   required permission granted).

Now about where these members are involved inside kernel code:

 - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;

 - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
   also they are considered if there enough space for brk() syscall
   result if RLIMIT_DATA is set;

 - @start_brk shown in /proc/$pid/stat output and accounted in brk()
   syscall if RLIMIT_DATA is set; also this member is tested to
   find a symbolic name of mmap event for perf system (we choose
   if event is generated for "heap" area); one more aplication is
   selinux -- we test if a process has PROCESS__EXECHEAP permission
   if trying to make heap area being executable with mprotect() syscall;

 - @brk is a current value for brk() syscall which lays inside heap
   area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
   provides new memory area to a user space upon brk() completion the
   mm::brk is updated to carry new value;

   Both @start_brk and @brk are actively used in /proc/$pid/maps

[Devel] [PATCH RHEL7 COMMIT] ms/shm: add memfd_create() syscall

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 9e421edd0c467fb8d3a230520421a58f55e2a46e
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:18 2015 +0400

ms/shm: add memfd_create() syscall

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

ML: 9183df25fe7b194563db3fec6dc3202a5855839c

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap().  It can support sealing and avoids any
connection to user-visible mount-points.  Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but with
a file-descriptor to it.

memfd_create() returns the raw shmem file, so calls like ftruncate() can
be used to modify the underlying inode.  Also calls like fstat() will
return proper information and mark the file as regular file.  If you want
sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
supported (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to a filesystem size limit.  It is still properly accounted to
memcg limits, though, and to the same overcommit or no-overcommit
accounting as all user memory.

Signed-off-by: David Herrmann 
Acked-by: Hugh Dickins 
Cc: Michael Kerrisk 
Cc: Ryan Lortie 
Cc: Lennart Poettering 
Cc: Daniel Mack 
Cc: Andy Lutomirski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Conflicts:

arch/x86/syscalls/syscall_32.tbl
arch/x86/syscalls/syscall_64.tbl
Signed-off-by: Andrew Vagin 
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h |  1 +
 kernel/sys_ni.c  |  1 +
 mm/shmem.c   | 73 
 5 files changed, 77 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 5d1de5d..4d0e1b4 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,6 +357,7 @@
 348i386process_vm_writev   sys_process_vm_writev   
compat_sys_process_vm_writev
 349i386kcmpsys_kcmp
 350i386finit_modulesys_finit_module
+356i386memfd_createsys_memfd_create
 
 500i386fairsched_mknod sys_fairsched_mknod
 501i386fairsched_rmnod sys_fairsched_rmnod
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 3ed05b4..2415f42 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -321,6 +321,7 @@
 312common  kcmpsys_kcmp
 313common  finit_modulesys_finit_module
 316common  renameat2   sys_renameat2
+319common  memfd_createsys_memfd_create
 320common  kexec_file_load sys_kexec_file_load
 
 49764  fairsched_nodemask  sys_fairsched_nodemask
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c89c938..2c2e396 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -786,6 +786,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int 
flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user 
*, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7c98d8f..75a69b0 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -194,6 +194,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 3964468..bc8e08b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2854,6 +2856,77 @@ static int shmem_show_options(struct seq_file *seq, 
struct dentry *root)
shmem_show_mpol(seq,

[Devel] [PATCH RHEL7 COMMIT] ms/prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit bf86a407af9fa74c7251b62c539f36c141fc4f77
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:13 2015 +0400

ms/prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: Cyrill Gorcunov 

ML: 71fe97e185040c5dac3216cd54e186dfa534efa0

Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
and rename the helper to prctl_set_mm_exe_file_locked().  This will allow
to reuse this function in a next patch.

Signed-off-by: Cyrill Gorcunov 

Cc: Kees Cook 
Cc: Tejun Heo 
Cc: Andrew Vagin 
Cc: Eric W. Biederman 
Cc: H. Peter Anvin 
Acked-by: Serge Hallyn 
Cc: Pavel Emelyanov 
Cc: Vasiliy Kulikov 
Cc: KAMEZAWA Hiroyuki 
Cc: Michael Kerrisk 
Cc: Julien Tinnes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrew Vagin 
---
 kernel/sys.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index a2d5644..cf580a7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2036,12 +2036,14 @@ SYSCALL_DEFINE1(umask, int, mask)
return mask;
 }
 
-static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
+static int prctl_set_mm_exe_file_locked(struct mm_struct *mm, unsigned int fd)
 {
struct fd exe;
struct inode *inode;
int err;
 
+   VM_BUG_ON(!rwsem_is_locked(>mmap_sem));
+
exe = fdget(fd);
if (!exe.file)
return -EBADF;
@@ -2062,8 +2064,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
if (err)
goto exit;
 
-   down_write(>mmap_sem);
-
/*
 * Forbid mm->exe_file change if old file still mapped.
 */
@@ -2075,7 +2075,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
if (vma->vm_file &&
path_equal(>vm_file->f_path,
   >exe_file->f_path))
-   goto exit_unlock;
+   goto exit;
}
 
/*
@@ -2086,13 +2086,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
 */
err = -EPERM;
if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit_unlock;
+   goto exit;
 
err = 0;
set_mm_exe_file(mm, exe.file);  /* this grabs a reference to exe.file */
-exit_unlock:
-   up_write(>mmap_sem);
-
 exit:
fdput(exe);
return err;
@@ -2112,8 +2109,12 @@ static int prctl_set_mm(int opt, unsigned long addr,
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
 
-   if (opt == PR_SET_MM_EXE_FILE)
-   return prctl_set_mm_exe_file(mm, (unsigned int)addr);
+   if (opt == PR_SET_MM_EXE_FILE) {
+   down_write(>mmap_sem);
+   error = prctl_set_mm_exe_file_locked(mm, (unsigned int)addr);
+   up_write(>mmap_sem);
+   return error;
+   }
 
if (addr >= TASK_SIZE || addr < mmap_min_addr)
return -EINVAL;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/aio: Make it possible to remap aio ring

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit a3ffce64acc927dd35825252566389966520dc94
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:14 2015 +0400

ms/aio: Make it possible to remap aio ring

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: Pavel Emelyanov 

ML: e4a0d3e720e7e508749c1439b5ba3aff56c92976

There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?

Signed-off-by: Pavel Emelyanov 

Acked-by: Dmitry Monakhov 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Andrew Vagin 
---
 fs/aio.c   | 20 
 include/linux/fs.h |  1 +
 mm/mremap.c|  3 ++-
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 9d700b0..301da77 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -257,12 +257,32 @@ static void aio_free_ring(struct kioctx *ctx)
 
 static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   vma->vm_flags |= VM_DONTEXPAND;
vma->vm_ops = _file_vm_ops;
return 0;
 }
 
+static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   struct kioctx *ctx;
+
+   spin_lock(>ioctx_lock);
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(ctx, >ioctx_list, list) {
+   if (ctx && ctx->aio_ring_file == file) {
+   ctx->user_id = ctx->mmap_base = vma->vm_start;
+   break;
+   }
+   }
+
+   rcu_read_unlock();
+   spin_unlock(>ioctx_lock);
+}
+
 static const struct file_operations aio_ring_fops = {
.mmap = aio_ring_mmap,
+   .mremap = aio_ring_remap,
 };
 
 #if IS_ENABLED(CONFIG_MIGRATION)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e7bd3f..bbbf186 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1734,6 +1734,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+   void (*mremap)(struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/mm/mremap.c b/mm/mremap.c
index e1db886..0b40af6 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -293,7 +293,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
old_len = new_len;
old_addr = new_addr;
new_addr = -ENOMEM;
-   }
+   } else if (vma->vm_file && vma->vm_file->f_op->mremap)
+   vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
 
/* Conceal VM_ACCOUNT so old

[Devel] [PATCH RHEL7 COMMIT] ms/make default ->i_fop have ->open() fail with ENXIO

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit ab6784bb6f5bca77caef0e23d07e0b86dd178557
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:20 2015 +0400

ms/make default ->i_fop have ->open() fail with ENXIO

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

ML: bd9b51e79cb0b8bc00a7e0076a4a8963ca4a797c

As it is, default ->i_fop has NULL ->open() (along with all other methods).
The only case where it matters is reopening (via procfs symlink) a file that
didn't get its ->f_op from ->i_fop - anything else will have ->i_fop 
assigned
to something sane (default would fail on read/write/ioctl/etc.).

Unfortunately, such case exists - alloc_file() users, especially
anon_get_file() ones.  There we have tons of opened files of very different
kinds sharing the same inode.  As the result, attempt to reopen those via
procfs succeeds and you get a descriptor you can't do anything with.

Moreover, in case of sockets we set ->i_fop that will only be used
on such reopen attempts - and put a failing ->open() into it to make sure
those do not succeed.

It would be simpler to put such ->open() into default ->i_fop and leave
it unchanged both for anon inode (as we do anyway) and for socket ones.  
Result:
* everything going through do_dentry_open() works as it used to
* sock_no_open() kludge is gone
* attempts to reopen anon-inode files fail as they really ought to
* ditto for aio_private_file()
* ditto for perfmon - this one actually tried to imitate sock_no_open()
trick, but failed to set ->i_fop, so in the current tree reopens succeed and
yield completely useless descriptor.  Intent clearly had been to fail with
-ENXIO on such reopens; now it actually does.
* everything else that used alloc_file() keeps working - it has ->i_fop
set for its inodes anyway

Signed-off-by: Al Viro 
Signed-off-by: Andrew Vagin 
---
 arch/ia64/kernel/perfmon.c | 10 --
 fs/inode.c | 12 +---
 include/linux/fs.h |  1 -
 net/Makefile   |  2 --
 net/nonet.c| 26 --
 net/socket.c   | 19 ---
 6 files changed, 9 insertions(+), 61 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 9ea25fc..4334a96 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2145,22 +2145,12 @@ doit:
return 0;
 }
 
-static int
-pfm_no_open(struct inode *irrelevant, struct file *dontcare)
-{
-   DPRINT(("pfm_no_open called\n"));
-   return -ENXIO;
-}
-
-
-
 static const struct file_operations pfm_file_ops = {
.llseek = no_llseek,
.read   = pfm_read,
.write  = pfm_write,
.poll   = pfm_poll,
.unlocked_ioctl = pfm_ioctl,
-   .open   = pfm_no_open,  /* special open code to disallow open 
via /proc */
.fasync = pfm_fasync,
.release= pfm_close,
.flush  = pfm_flush
diff --git a/fs/inode.c b/fs/inode.c
index 960cd15..6c27178 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -121,6 +121,11 @@ int proc_nr_inodes(ctl_table *table, int write,
 }
 #endif
 
+static int no_open(struct inode *inode, struct file *file)
+{
+   return -ENXIO;
+}
+
 /**
  * inode_init_always - perform inode structure intialisation
  * @sb: superblock inode belongs to
@@ -131,7 +136,8 @@ int proc_nr_inodes(ctl_table *table, int write,
  */
 int inode_init_always(struct super_block *sb, struct inode *inode)
 {
-   static const struct file_operations empty_fops;
+   static const struct inode_operations empty_iops;
+   static const struct file_operations no_open_fops = {.open = no_open};
struct address_space *const mapping = >i_data;
 
inode->i_sb = sb;
@@ -139,7 +145,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
inode->i_flags = 0;
atomic_set(>i_count, 1);
inode->i_op = _iops;
-   inode->i_fop = _fops;
+   inode->i_fop = _open_fops;
inode->__i_nlink = 1;
inode->i_opflags = 0;
i_uid_write(inode, 0);
@@ -1900,7 +1906,7 @@ void init_special_inode(struct inode *inode, umode_t 
mode, dev_t rdev)
} else if (S_ISFIFO(mode))
inode->i_fop = _fops;
else if (S_ISSOCK(mode))
-   inode->i_fop = _sock_fops;
+   ;   /* leave it no_open_fops */
else
printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for"
  " inode %s:%lu\n", mode, inode->i_sb->s_id,
diff --git a/include/linux/fs.h

[Devel] [PATCH RHEL7 COMMIT] ms/mm: mmap_region: kill correct_wcount/inode, use allow_write_access()

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 55c695be4110abbbc0c16bd2f6d55de27ac03b90
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:16 2015 +0400

ms/mm: mmap_region: kill correct_wcount/inode, use allow_write_access()

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

MM: e86867720e617774b560dfbc169b7f3d0d490950

correct_wcount and inode in mmap_region() just complicate the code.  This
boolean was needed previously, when deny_write_access() was called before
vma_merge(), now we can simply check VM_DENYWRITE and do
allow_write_access() if it is set.

allow_write_access() checks file != NULL, so this is safe even if it was
possible to use VM_DENYWRITE && !file.  Just we need to ensure we use the
same file which was deny_write_access()'ed, so the patch also moves "file
= vma->vm_file" down after allow_write_access().

Signed-off-by: Oleg Nesterov 
Cc: Hugh Dickins 
Cc: Al Viro 
Cc: Colin Cross 
Cc: David Rientjes 
Cc: KOSAKI Motohiro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrew Vagin 
---
 mm/mmap.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 826cf37..f87a78b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1494,11 +1494,9 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
 {
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
-   int correct_wcount = 0;
int error;
struct rb_node **rb_link, *rb_parent;
unsigned long charged = 0;
-   struct inode *inode =  file ? file_inode(file) : NULL;
unsigned long ub_charged = 0;
 
/* Check against address space limit. */
@@ -1576,7 +1574,6 @@ munmap_back:
error = deny_write_access(file);
if (error)
goto free_vma;
-   correct_wcount = 1;
}
vma->vm_file = get_file(file);
error = file->f_op->mmap(file, vma);
@@ -1631,11 +1628,10 @@ munmap_back:
}
 
vma_link(mm, vma, prev, rb_link, rb_parent);
-   file = vma->vm_file;
-
/* Once vma denies write, undo our temporary denial count */
-   if (correct_wcount)
-   atomic_inc(>i_writecount);
+   if (vm_flags & VM_DENYWRITE)
+   allow_write_access(file);
+   file = vma->vm_file;
 out:
perf_event_mmap(vma);
 
@@ -1663,8 +1659,8 @@ out:
return addr;
 
 unmap_and_free_vma:
-   if (correct_wcount)
-   atomic_inc(>i_writecount);
+   if (vm_flags & VM_DENYWRITE)
+   allow_write_access(file);
vma->vm_file = NULL;
fput(file);
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/selftests: add memfd_create() + sealing tests

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 2d68e19bd9105a3fe7006d8252ee516f97a9ade8
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:18 2015 +0400

ms/selftests: add memfd_create() + sealing tests

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: David Herrmann 

ML: 4f5ce5e8d7e2da3c714df8a7fa42edb9f992fc52

Some basic tests to verify sealing on memfds works as expected and
guarantees the advertised semantics.

Signed-off-by: David Herrmann 
Acked-by: Hugh Dickins 
Cc: Michael Kerrisk 
Cc: Ryan Lortie 
Cc: Lennart Poettering 
Cc: Daniel Mack 
Cc: Andy Lutomirski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrew Vagin 
---
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 913 +
 4 files changed, 945 insertions(+)

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index c7fd8ac..ab4015b 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -2,6 +2,7 @@ TARGETS = breakpoints
 TARGETS += cpu-hotplug
 TARGETS += efivarfs
 TARGETS += kcmp
+TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mqueue
 TARGETS += net
diff --git a/tools/testing/selftests/memfd/.gitignore 
b/tools/testing/selftests/memfd/.gitignore
new file mode 100644
index 000..bcc8ee2
--- /dev/null
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -0,0 +1,2 @@
+memfd_test
+memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile 
b/tools/testing/selftests/memfd/Makefile
new file mode 100644
index 000..36653b9
--- /dev/null
+++ b/tools/testing/selftests/memfd/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+   ARCH := X86
+endif
+ifeq ($(ARCH),x86_64)
+   ARCH := X86
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/uapi/
+CFLAGS += -I../../../../arch/x86/include/uapi/
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+   gcc $(CFLAGS) memfd_test.c -o memfd_test
+else
+   echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_tests: all
+ifeq ($(ARCH),X86)
+   gcc $(CFLAGS) memfd_test.c -o memfd_test
+endif
+   @./memfd_test || echo "memfd_test: [FAIL]"
+
+clean:
+   $(RM) memfd_test
diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
new file mode 100644
index 000..3634c90
--- /dev/null
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -0,0 +1,913 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MFD_DEF_SIZE 8192
+#define STACK_SIZE 65535
+
+static int sys_memfd_create(const char *name,
+   unsigned int flags)
+{
+   return syscall(__NR_memfd_create, name, flags);
+}
+
+static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
+{
+   int r, fd;
+
+   fd = sys_memfd_create(name, flags);
+   if (fd < 0) {
+   printf("memfd_create(\"%s\", %u) failed: %m\n",
+  name, flags);
+   abort();
+   }
+
+   r = ftruncate(fd, sz);
+   if (r < 0) {
+   printf("ftruncate(%llu) failed: %m\n", (unsigned long long)sz);
+   abort();
+   }
+
+   return fd;
+}
+
+static void mfd_fail_new(const char *name, unsigned int flags)
+{
+   int r;
+
+   r = sys_memfd_create(name, flags);
+   if (r >= 0) {
+   printf("memfd_create(\"%s\", %u) succeeded, but failure 
expected\n",
+  name, flags);
+   close(r);
+   abort();
+   }
+}
+
+static __u64 mfd_assert_get_seals(int fd)
+{
+   long r;
+
+   r = fcntl(fd, F_GET_SEALS);
+   if (r < 0) {
+   printf("GET_SEALS(%d) failed: %m\n", fd);
+   abort();
+   }
+
+   return r;
+}
+
+static void mfd_assert_has_seals(int fd, __u64 seals)
+{
+   __u64 s;
+
+   s = mfd_assert_get_seals(fd);
+   if (s != seals) {
+   printf("%llu != %llu = GET_SEALS(%d)\n",
+  (unsigned long long)seals, (unsigned

[Devel] [PATCH RHEL7 COMMIT] ms/mm: allow drivers to prevent new writable mappings

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit a60b63122e58834a4fd04b9be05311dd67801a07
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:16 2015 +0400

ms/mm: allow drivers to prevent new writable mappings

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: David Herrmann 

ML: 4bb5f5d9395bc112d93a134d8f5b05611eddc9c0

This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space.  To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set.  In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers.  This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann 
Acked-by: Hugh Dickins 
Cc: Michael Kerrisk 
Cc: Ryan Lortie 
Cc: Lennart Poettering 
Cc: Daniel Mack 
Cc: Andy Lutomirski 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrew Vagin 
---
 fs/inode.c |  1 +
 include/linux/fs.h | 29 +++--
 kernel/fork.c  |  2 +-
 mm/mmap.c  | 30 --
 mm/swap_state.c|  1 +
 5 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8c14103..960cd15 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -171,6 +171,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
mapping->a_ops = _aops;
mapping->host = inode;
mapping->flags = 0;
+   atomic_set(>i_mmap_writable, 0);
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
mapping->private_data = NULL;
mapping->backing_dev_info = _backing_dev_info;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbbf186..f410c54 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -547,7 +547,7 @@ struct address_space {
struct inode*host;  /* owner: inode, block_device */
struct radix_tree_root  page_tree;  /* radix tree of all pages */
spinlock_t  tree_lock;  /* and lock protecting it */
-   unsigned inti_mmap_writable;/* count VM_SHARED mappings */
+   atomic_ti_mmap_writable;/* count VM_SHARED mappings */
struct rb_root  i_mmap; /* tree of private and shared 
mappings */
struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */
struct mutexi_mmap_mutex;   /* protect tree, count, list */
@@ -633,10 +633,35 @@ static inline int mapping_mapped(struct address_space 
*mapping)
  * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff
  * marks vma as VM_SHARED if it is shared, and the file was opened for
  * writing i.e. vma may be mprotected writable even if now readonly.
+ *
+ * If i_mmap_writable is negative, no new writable mappings are allowed. You
+ * can only deny writable mappings, if none exists right now.
  */
 static inline int mapping_writably_mapped(struct address_space *mapping)
 {
-   return mapping->i_mmap_writable != 0;
+   return atomic_read(>i_mmap_writable) > 0;
+}
+
+static inline int mapping_map_writable(struct address_space *mapping)
+{
+   return atomic_inc_unless_negative(>i_mmap_writable) ?
+   0 : -EPERM;
+}
+
+static inline void mapping_unmap_writable(struct address_space *mapping)
+{
+   atomic_dec(>i_mmap_writable);
+}
+
+static inline int mapping_deny_writable(struct address_space *mapping)
+{
+   return atomic_dec_unless_positive(>i_mmap_writable) ?
+   0 : -EBUSY;
+}
+
+static inline void mapping_allow_writable(struct address_space *mapping)
+{
+   atomic_inc(>i_mmap_writable);
 }
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 505fa21..8fcc5db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -435,7 +435,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct 
*oldmm)
atomic_dec(>i_writecount);
mutex_lock(>i_mmap_mutex);
if (tmp->vm_flags & VM_SHARED)
-   mapping->i_mmap_writable++;
+

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Kirill Tkhai

Vova remind me, we may sleep inside get_exec_env() section.

So, it's yet better to use task work here.

On 15.10.2015 13:02, Kirill Tkhai wrote:
> Since we allow to attach a not current task to ve cgroup, there is the race
> in the places where we use get_exec_env(). The task's ve may be changed after
> it dereferenced get_exec_env(), so a lot of problems are possible there.
> I'm sure the most places, where we use get_exec_env(), was not written in an
> assumption it may change. Also, there are a lot of nested functions and it's
> impossible to check every function to verify if it's input parameters, 
> depending
> on a caller's dereferenced ve, are not actual because of ve has been changed.
> 
> I'm suggest to use to modify get_exec_env() which will supply ve's stability.
> It pairs with put_exec_env() which marks end of area where ve modification is
> not desirable.
> 
> get_exec_env() may be used nested, so here is 
> task_struct::ve_attach_lock_depth,
> which allows nesting. The counter looks a better variant that plain 
> read_lock()
> in get_exec_env() and write_trylock() loop in ve_attach():
> 
> get_exec_env()
> {
>...
>read_lock();
>...
> }
> 
> ve_attach()
> {
>while(!write_trylock())
>   cpu_relax();
> }
> 
> because this case the priority of read_lock() will be absolute and we lost all
> advantages of queued rw locks fairness.
> 
> Also I considered variants with using RCU and task work, but they seems to be 
> worse.
> 
> Please, your comments.
> 
> ---
>  include/linux/init_task.h |  3 ++-
>  include/linux/sched.h |  1 +
>  include/linux/ve.h| 29 +
>  include/linux/ve_proto.h  |  1 -
>  kernel/fork.c |  3 +++
>  kernel/ve/ve.c|  8 +++-
>  6 files changed, 42 insertions(+), 3 deletions(-)
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index d2cbad0..57e0796 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -136,7 +136,8 @@ extern struct task_group root_task_group;
>  #endif
>  
>  #ifdef CONFIG_VE
> -#define  INIT_TASK_VE(tsk) .task_ve = ,
> +#define  INIT_TASK_VE(tsk) .task_ve = ,  
> \
> +   .ve_attach_lock_depth = 0
>  #else
>  #define  INIT_TASK_VE(tsk)
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e1bcabe..948481f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1564,6 +1564,7 @@ struct task_struct {
>  #endif
>  #ifdef CONFIG_VE
>   struct ve_struct *task_ve;
> + unsigned int ve_attach_lock_depth;
>  #endif
>  #ifdef CONFIG_MEMCG /* memcg uses this to do batch job */
>   struct memcg_batch_info {
> diff --git a/include/linux/ve.h b/include/linux/ve.h
> index 86b95c3..3cea73d 100644
> --- a/include/linux/ve.h
> +++ b/include/linux/ve.h
> @@ -33,6 +33,7 @@ struct ve_monitor;
>  struct nsproxy;
>  
>  struct ve_struct {
> + rwlock_tattach_lock;
>   struct cgroup_subsys_state  css;
>  
>   const char  *ve_name;
> @@ -130,6 +131,34 @@ struct ve_struct {
>  #endif
>  };
>  
> +static inline struct ve_struct *get_exec_env(void)
> +{
> + struct ve_struct *ve;
> +
> + if (++current->ve_attach_lock_depth > 1)
> + return current->task_ve;
> +
> + rcu_read_lock();
> +again:
> + ve = current->task_ve;
> + read_lock(>attach_lock);
> + if (unlikely(current->task_ve != ve)) {
> + read_unlock(>attach_lock);
> + goto again;
> + }
> + rcu_read_unlock();
> +
> + return ve;
> +}
> +
> +static inline void put_exec_env(void)
> +{
> + struct ve_struct *ve = current->task_ve;
> +
> + if (!--current->ve_attach_lock_depth)
> + read_unlock(>attach_lock);
> +}
> +
>  struct ve_devmnt {
>   struct list_headlink;
>  
> diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
> index 0f5898e..3deb09e 100644
> --- a/include/linux/ve_proto.h
> +++ b/include/linux/ve_proto.h
> @@ -30,7 +30,6 @@ static inline bool ve_is_super(struct ve_struct *ve)
>   return ve == 
>  }
>  
> -#define get_exec_env()   (current->task_ve)
>  #define get_env_init(ve) (ve->ve_ns->pid_ns->child_reaper)
>  
>  const char *ve_name(struct ve_struct *ve);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 505fa21..3d7e452 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1439,6 +1439,9 @@ static struct task_struct *copy_process(unsigned long 
> clone_flags,
>   INIT_LIST_HEAD(>pi_state_list);
>   p->pi_state_cache = NULL;
>  #endif
> +#ifdef CONFIG_VE
> + p->ve_attach_lock_depth = 0;
> +#endif
>   /*
>* sigaltstack should be cleared when sharing the same VM
>*/
> diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
> index 39a95e8..23833ed 100644
> --- a/kernel/ve/ve.c
> +++ b/kernel/ve/ve.c
> @@ -640,6 +640,7 @@ static struct cgroup_subsys_state *ve_create(struct 
> cgroup

Re: [Devel] [PATCH RH7 1/2] device_cgroup: fake allowing all devices for docker inside VZCT

2015-10-15 Thread Konstantin Khorenko


Volodya, please review.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 10/13/2015 06:11 PM, Pavel Tikhomirov wrote:

We need it for docker 1.7.+, please review.

On 10/07/2015 11:51 AM, Pavel Tikhomirov wrote:

Docker from 1.7.0 tries to add "a" to devices.allow for newly created
privileged container device_cgroup, and thus to allow all devices in
docker container. Docker fails to do so because not all devices are
allowed in parent VZCT cgroup.

To support docker we must allow writing "a" to devices.allow in CT.
With this patch if we get "a", we will silently exit without EPERM.

https://jira.sw.ru/browse/PSBM-38691

v2: fix bug link, fix comment stile
Signed-off-by: Pavel Tikhomirov 
---
   security/device_cgroup.c | 9 -
   1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 531e40c..9f932d7 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -689,7 +689,14 @@ static int devcgroup_update_access(struct dev_cgroup 
*devcgroup,
if (has_children(devcgroup))
return -EINVAL;

-   if (!may_allow_all(parent))
+   if (!may_allow_all(parent)) {
+   if (ve_is_super(get_exec_env()))
+   return -EPERM;
+   else
+   /* Fooling docker in CT - silently exit 
*/
+   return 0;
+   }
+
return -EPERM;
dev_exception_clean(devcgroup);
devcgroup->behavior = DEVCG_DEFAULT_ALLOW;




___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve: Strip unset options in ve.mount_opts

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 5e85a8088c27b22a38d446530fab6904db6314a4
Author: Kirill Tkhai 
Date:   Thu Oct 15 14:37:13 2015 +0400

ve: Strip unset options in ve.mount_opts

Igor reports "(null)" in not set options may confuse user:

echo "0 182:223361;1 balloon_ino=12,pfcache_csum,,2: (null);"

The patch removes not set options from there:

echo "0 182:223361;1 balloon_ino=12,pfcache_csum,,;"

N.B. No any problem there, because *printf handles
zero strings for a long time.

Requested-by: Igor Sukhih 
Signed-off-by: Kirill Tkhai 
Reviewed-by: Maxim Patlasov 
---
 kernel/ve/ve.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 12cfa33..39a95e8 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -906,9 +906,12 @@ static int ve_mount_opts_show(struct seq_file *m, void *v)
struct ve_devmnt *devmnt = v;
dev_t dev = devmnt->dev;
 
-   seq_printf(m, "0 %u:%u;1 %s;2 %s;\n", MAJOR(dev), MINOR(dev),
- devmnt->hidden_options,
- devmnt->allowed_options);
+   seq_printf(m, "0 %u:%u;", MAJOR(dev), MINOR(dev));
+   if (devmnt->hidden_options)
+   seq_printf(m, "1 %s;", devmnt->hidden_options);
+   if (devmnt->allowed_options)
+   seq_printf(m, "2 %s;", devmnt->allowed_options);
+   seq_putc(m, '\n');
return 0;
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Pavel Emelyanov


> @@ -130,6 +131,34 @@ struct ve_struct {
>  #endif
>  };
>  
> +static inline struct ve_struct *get_exec_env(void)
> +{
> + struct ve_struct *ve;
> +
> + if (++current->ve_attach_lock_depth > 1)
> + return current->task_ve;
> +
> + rcu_read_lock();
> +again:
> + ve = current->task_ve;
> + read_lock(>attach_lock);
> + if (unlikely(current->task_ve != ve)) {
> + read_unlock(>attach_lock);
> + goto again;

Please, no. 3.10 kernel has task_work-s, ask the task you want to
attach to ve to execute the work by moving itself into it and keep
this small routine small and simple.

> + }
> + rcu_read_unlock();
> +
> + return ve;
> +}
> +

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/shm: add sealing API

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 38bc7de2200c7f0aafacc2f30769787ca3c55308
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:17 2015 +0400

ms/shm: add sealing API

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

ML: 40e041a2c858b3caefc757e26cb85bfceae5062b

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement.  However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies.  Look at the following two
use-cases:

  1) A graphics client wants to share its rendering-buffer with a
 graphics-server. The memory-region is allocated by the client for
 read/write access and a second FD is passed to the server. While
 scanning out from the memory region, the server has no guarantee that
 the client doesn't shrink the buffer at any time, requiring rather
 cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
 bandwidth consumption, zero-copy is preferred. After a message is
 assembled in-memory and a FD is passed to the remote side, both sides
 want to be sure that neither modifies this shared copy, anymore. The
 source may have put sensible data into the message without a separate
 copy and the target may want to parse the message inline, to avoid a
 local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING.  If you seal a file, a
specific set of operations is blocked on that file forever.  Unlike locks,
seals can only be set, never removed.  Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
  in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
   are possible. This affects fallocate(PUNCH_HOLE), mmap() and
   write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
  This basically prevents the F_ADD_SEAL operation on a file and
  can be set to prevent others from adding further seals that you
  don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

  1) The graphics server can verify that a passed file-descriptor has
 SEAL_SHRINK set. This allows safe scanout, while the client is
 allowed to increase buffer size for window-resizing on-the-fly.
 Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
 SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
 process can modify the data while the other side parses it.
 Furthermore, it guarantees that even with writable FDs passed to the
 peer, it cannot increase the size to hit memory-limits of the source
 process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
   can be called on any FD if the underlying file supports
   sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
   access to the file and F_SEAL_SEAL may not already be set.
   Furthermore, the underlying file must support sealing and
   there may not be any existing shared mapping of that file.
   Otherwise, EBADF/EPERM is returned.
   The given seals are _added_ to the existing set of seals
   on the

[Devel] [PATCH RHEL7 COMMIT] ms/mm: introduce check_data_rlimit helper

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit a3ca20dacb7becb446fde9154abd51cbb0594674
Author: Andrew Vagin 
Date:   Thu Oct 15 15:04:12 2015 +0400

ms/mm: introduce check_data_rlimit helper

The patch is required for CRIU.

https://jira.sw.ru/browse/PSBM-39834

From: Cyrill Gorcunov 

ML: 9c5990240e076ae564cccbd921868cd08f6daaa5

To eliminate code duplication lets introduce check_data_rlimit helper
which we will use in brk() and prctl() syscalls.

Signed-off-by: Cyrill Gorcunov 

Cc: Kees Cook 
Cc: Tejun Heo 
Cc: Andrew Vagin 
Cc: Eric W. Biederman 
Cc: H. Peter Anvin 
Acked-by: Serge Hallyn 
Cc: Pavel Emelyanov 
Cc: Vasiliy Kulikov 
Cc: KAMEZAWA Hiroyuki 
Cc: Michael Kerrisk 
Cc: Julien Tinnes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrew Vagin 
---
 include/linux/mm.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8424c6a..163d3d8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mempolicy;
 struct anon_vma;
@@ -1747,6 +1748,20 @@ extern struct vm_area_struct *copy_vma(struct 
vm_area_struct **,
bool *need_rmap_locks);
 extern void exit_mmap(struct mm_struct *);
 
+static inline int check_data_rlimit(unsigned long rlim,
+   unsigned long new,
+   unsigned long start,
+   unsigned long end_data,
+   unsigned long start_data)
+{
+   if (rlim < RLIM_INFINITY) {
+   if (((new - start) + (end_data - start_data)) > rlim)
+   return -ENOSPC;
+   }
+
+   return 0;
+}
+
 extern int mm_take_all_locks(struct mm_struct *mm);
 extern void mm_drop_all_locks(struct mm_struct *mm);
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/oom: add helpers for setting and clearing TIF_MEMDIE

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 4860757ccf723defc3ba770ca3ad3f8c67c4ae20
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:47:34 2015 +0400

ms/oom: add helpers for setting and clearing TIF_MEMDIE

Patchset description: oom enhancements - part 1

Pull mainstream patches that clean up TIF_MEMDIE handling. They will
come in handy for the upcoming oom rework.

https://jira.sw.ru/browse/PSBM-26973

David Rientjes (1):
  mm, oom: remove unnecessary exit_state check

Johannes Weiner (1):
  mm: oom_kill: clean up victim marking and exiting interfaces

Michal Hocko (3):
  oom: make sure that TIF_MEMDIE is set under task_lock
  oom: add helpers for setting and clearing TIF_MEMDIE
  oom: thaw the OOM victim if it is frozen

Tetsuo Handa (1):
  oom: don't count on mm-less current process

===
This patch desciption:

From: Michal Hocko 

This patchset addresses a race which was described in the changelog for
5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen.  But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation.  In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen.  This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.

The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.

The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_.  As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably.  I can
only speculate what might happen when a task is still runnable
unexpectedly.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
echo processors > /sys/power/pm_test
while true
do
echo mem > /sys/power/state
sleep 1s
done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and 
before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled.  I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, 
anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 
seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] ms/Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM.  They should go in
regardless the rest IMO.

Patches 3 and 4 are trivial printk ->

[Devel] [PATCH RHEL7 COMMIT] ms/oom: thaw the OOM victim if it is frozen

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 880e147721e60945828b460b86f36057e72603df
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:47:35 2015 +0400

ms/oom: thaw the OOM victim if it is frozen

Patchset description: oom enhancements - part 1

Pull mainstream patches that clean up TIF_MEMDIE handling. They will
come in handy for the upcoming oom rework.

https://jira.sw.ru/browse/PSBM-26973

David Rientjes (1):
  mm, oom: remove unnecessary exit_state check

Johannes Weiner (1):
  mm: oom_kill: clean up victim marking and exiting interfaces

Michal Hocko (3):
  oom: make sure that TIF_MEMDIE is set under task_lock
  oom: add helpers for setting and clearing TIF_MEMDIE
  oom: thaw the OOM victim if it is frozen

Tetsuo Handa (1):
  oom: don't count on mm-less current process

===
This patch desciption:

From: Michal Hocko 

oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
victim.  This is basically noop when the task is frozen though because the
task sleeps in the uninterruptible sleep.  The victim is eventually thawed
later when oom_scan_process_thread meets the task again in a later OOM
invocation so the OOM killer doesn't live lock.  But this is less than
optimal.

Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to
the victim.  We are not checking whether the task is frozen because that
would be racy and __thaw_task does that already.  oom_scan_process_thread
doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are
excluded completely now.

Signed-off-by: Michal Hocko 
Cc: Tejun Heo 
Cc: David Rientjes 
Cc: Johannes Weiner 
Cc: Oleg Nesterov 
Cc: Cong Wang 
Cc: "Rafael J. Wysocki" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 63a8ca9b2084fa5bd91aa380532f18e361764109)
Signed-off-by: Vladimir Davydov 

Reviewed-by: Kirill Tkhai 
---
 mm/oom_kill.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 224dd8d..7b106e8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -266,8 +266,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct 
*task,
 * Don't allow any other task to have access to the reserves.
 */
if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-   if (unlikely(frozen(task)))
-   __thaw_task(task);
if (!force_kill)
return OOM_SCAN_ABORT;
}
@@ -417,6 +415,14 @@ static void dump_header(struct task_struct *p, gfp_t 
gfp_mask, int order,
 void mark_tsk_oom_victim(struct task_struct *tsk)
 {
set_tsk_thread_flag(tsk, TIF_MEMDIE);
+
+   /*
+* Make sure that the task is woken up from uninterruptible sleep
+* if it is frozen because OOM killer wouldn't be able to free
+* any memory and livelock. freezing_slow_path will tell the freezer
+* that TIF_MEMDIE tasks should be ignored.
+*/
+   __thaw_task(tsk);
 }
 
 /**
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm, oom: remove unnecessary exit_state check

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit d2dc55df7ee5b44dac752c1ff02e2ae5ce251935
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:47:32 2015 +0400

ms/mm, oom: remove unnecessary exit_state check

Patchset description: oom enhancements - part 1

Pull mainstream patches that clean up TIF_MEMDIE handling. They will
come in handy for the upcoming oom rework.

https://jira.sw.ru/browse/PSBM-26973

David Rientjes (1):
  mm, oom: remove unnecessary exit_state check

Johannes Weiner (1):
  mm: oom_kill: clean up victim marking and exiting interfaces

Michal Hocko (3):
  oom: make sure that TIF_MEMDIE is set under task_lock
  oom: add helpers for setting and clearing TIF_MEMDIE
  oom: thaw the OOM victim if it is frozen

Tetsuo Handa (1):
  oom: don't count on mm-less current process

===
This patch desciption:

From: David Rientjes 

The oom killer scans each process and determines whether it is eligible
for oom kill or whether the oom killer should abort because of
concurrent memory freeing.  It will abort when an eligible process is
found to have TIF_MEMDIE set, meaning it has already been oom killed and
we're waiting for it to exit.

Processes with task->mm == NULL should not be considered because they
are either kthreads or have already detached their memory and killing
them would not lead to memory freeing.  That memory is only freed after
exit_mm() has returned, however, and not when task->mm is first set to
NULL.

Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
is no longer considered for oom kill, but only until exit_mm() has
returned.  This was fragile in the past because it relied on
exit_notify() to be reached before no longer considering TIF_MEMDIE
processes.

Signed-off-by: David Rientjes 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968)
Signed-off-by: Vladimir Davydov 

Reviewed-by: Kirill Tkhai 
---
 kernel/exit.c | 1 +
 mm/oom_kill.c | 2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index dbc8f77..90feb5f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -520,6 +520,7 @@ static void exit_mm(struct task_struct * tsk)
task_unlock(tsk);
mm_update_next_owner(mm);
mmput(mm);
+   clear_thread_flag(TIF_MEMDIE);
 }
 
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5413a44..57d9f3e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -258,8 +258,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct 
*task,
unsigned long totalpages, const nodemask_t *nodemask,
bool force_kill, bool ignore_memcg_guarantee)
 {
-   if (task->exit_state)
-   return OOM_SCAN_CONTINUE;
if (oom_unkillable_task(task, NULL, nodemask))
return OOM_SCAN_CONTINUE;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] oom: rework logic behind memory.oom_guarantee

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit acf9780b995d7cabcb227c5a3636635a365a1d7c
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:53:02 2015 +0400

oom: rework logic behind memory.oom_guarantee

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

Currently, memory.oom_guarantee works as a threshold: we first select
processes in cgroups whose usage is below oom guarantee, and only if
there is no eligible process in such cgroups, we disregard oom guarantee
configuration and iterate over all processes. Although simple to
implement, such a behavior is unfair: we do not differentiate between
cgroups that only slightly above their guarantee and those who exceed it
significantly.

This patch therefore reworks the way how memory.oom_guarantee affects
oom killer behavior. First of all, it reverts old logic, which was
introduced by commit e94e18346f74c ("memcg: add oom_guarantee"), leaving
hunks bringing the memory.oom_guarantee knob intact. Then it implements
a new approach of selecting oom victim that works as follows.

Now a task is selected by oom killer iff (a) the memory cgroup which the
process resides in has the greatest overdraft of all cgroups eligible
for scan and (b) the process has the greatest score among all processes
which reside in cgroups with the greatest overdraft. A cgroup's
overdraft is defined as

  (U-G)/(L-G), if U
---
 fs/proc/base.c |  2 +-
 include/linux/memcontrol.h |  6 ++--
 include/linux/oom.h| 24 +++--
 mm/memcontrol.c| 86 ++
 mm/oom_kill.c  | 61 ++--
 5 files changed, 92 insertions(+), 87 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b574498..b5f3a70 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -455,7 +455,7 @@ static int proc_oom_score(struct task_struct *task, char 
*buffer)
 
read_lock(_lock);
if (pid_alive(task))
-   points = oom_badness(task, NULL, NULL, totalpages) *
+   points = oom_badness(task, NULL, NULL, totalpages, NULL) *
1000 / totalpages;
read_unlock(_lock);
return sprintf(buffer, "%lu\n", points);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5911327..0c85642 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -122,7 +122,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);

[Devel] [PATCH RHEL7 COMMIT] memcg: add lock for protecting memcg->oom_notify list

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit f86c874c39188e9af50163092e161878a1067977
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:52:59 2015 +0400

memcg: add lock for protecting memcg->oom_notify list

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

Currently, memcg_oom_lock is used for this, but I'm going to get rid of
it in the following patch, so introduce a dedicated lock.

Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fdd14dd2..faef356 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5766,12 +5766,18 @@ static int compare_thresholds(const void *a, const void 
*b)
return 0;
 }
 
+static DEFINE_SPINLOCK(memcg_oom_notify_lock);
+
 static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
 {
struct mem_cgroup_eventfd_list *ev;
 
+   spin_lock(_oom_notify_lock);
+
list_for_each_entry(ev, >oom_notify, list)
eventfd_signal(ev->eventfd, 1);
+
+   spin_unlock(_oom_notify_lock);
return 0;
 }
 
@@ -5957,7 +5963,7 @@ static int mem_cgroup_oom_register_event(struct cgroup 
*cgrp,
if (!event)
return -ENOMEM;
 
-   spin_lock(_oom_lock);
+   spin_lock(_oom_notify_lock);
 
event->eventfd = eventfd;
list_add(>list, >oom_notify);
@@ -5965,7 +5971,7 @@ static int mem_cgroup_oom_register_event(struct cgroup 
*cgrp,
/* already in OOM ? */
if (atomic_read(>under_oom))
eventfd_signal(eventfd, 1);
-   spin_unlock(_oom_lock);
+   spin_unlock(_oom_notify_lock);
 
return 0;
 }
@@ -5979,7 +5985,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup 
*cgrp,
 
BUG_ON(type != _OOM_TYPE);
 
-   spin_lock(_oom_lock);
+   spin_lock(_oom_notify_lock);
 
list_for_each_entry_safe(ev, tmp, >oom_notify, list) {
if (ev->eventfd == eventfd) {
@@ -5988,7 +5994,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup 
*cgrp,
}
}
 
-   spin_unlock(_oom_lock);
+   spin_unlock(_oom_notify_lock);
 }
 
 static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Kirill Tkhai



On 15.10.2015 17:23, Pavel Emelyanov wrote:
> On 10/15/2015 05:21 PM, Kirill Tkhai wrote:
>>
>>
>> On 15.10.2015 14:15, Pavel Emelyanov wrote:
>>>
 @@ -130,6 +131,34 @@ struct ve_struct {
  #endif
  };
  
 +static inline struct ve_struct *get_exec_env(void)
 +{
 +  struct ve_struct *ve;
 +
 +  if (++current->ve_attach_lock_depth > 1)
 +  return current->task_ve;
 +
 +  rcu_read_lock();
 +again:
 +  ve = current->task_ve;
 +  read_lock(>attach_lock);
 +  if (unlikely(current->task_ve != ve)) {
 +  read_unlock(>attach_lock);
 +  goto again;
>>>
>>> Please, no. 3.10 kernel has task_work-s, ask the task you want to
>>> attach to ve to execute the work by moving itself into it and keep
>>> this small routine small and simple.
>>
>> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(),
>> so we can't wait attaching task till it complete the task work (it may
>> execute any code; to be locking cgroup_mutex, for example).
> 
> I see.
> 
>> Should we give a possibility (an interface) for userspace to get it know,
>> the task's finished ve changing?
> 
> No. What are the places where get_exec_env() is still required?

Ok.

We use it from time to time

$ git grep get_exec_env | grep -v ve_is_super | wc -l
71

>>  
 +  }
 +  rcu_read_unlock();
 +
 +  return ve;
 +}
 +
>>>
>> .
>>
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/vtty: Make indices to match pcs6 scheme

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.8
-->
commit 77f7c920ddb6426dfb580fff5146da73c6f7f7d3
Author: Cyrill Gorcunov 
Date:   Thu Oct 15 20:04:49 2015 +0400

ve/vtty: Make indices to match pcs6 scheme

In pcs6 vttys are mapped into internal kernel representation in
nonobvious way. The /dev/console represent [maj:5,min:1], in
turn /dev/tty[0-...] are defined as [maj:4,min:0...], where
minor is bijective to symbol postfix of the tty. Internally
in the pcs6 kernel any open of /dev/ttyX has been mapping
minor into vtty index as

 |  if (minor > 0)
 |  index = minor - 1
 |  else
 |  index = 0

which actually shifts indices and make /dev/tty0 as
an alias to /dev/console inside container.

Same time vzctl tool passes console number argument
in a decremented way, iow when one is typing

vzctl console $ctid 1

here is 1 is a tty number, the kernel sees is as 0,
opening containers /dev/console.

When one types "vzctl console $ctid 2" (which implies
to open container's /dev/tty2) the vzctl passes index 1
and the kernel opens /dev/tty2 because of the if/else index
mapping as show above.

Lets implement same indices mapping in pcs7 for backward
compatibility (in pcs7 there is a per-VE vtty_map_t structure
which reserve up to MAX_NR_VTTY_CONSOLES ttys to track
and it is simply an array addressed by tty index).

Same time lets fix a few nits: disable setup of controlling
terminal on /dev/console only, since all ttys can have
controlling sign; make sure we're having @tty_fops for
such terminals.

https://jira.sw.ru/browse/PSBM-40088

Signed-off-by: Cyrill Gorcunov 
Reviewed-by: Vladimir Davydov 

CC: Konstantin Khorenko 
CC: Igor Sukhih 

 drivers/tty/pty.c|7 +--
 drivers/tty/tty_io.c |   12 ++--
 2 files changed, 15 insertions(+), 4 deletions(-)
---
 drivers/tty/pty.c|  7 +--
 drivers/tty/tty_io.c | 12 ++--
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index b74ddca..0ab36f9 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -1240,8 +1240,11 @@ struct tty_driver *vtty_console_driver(int *index)
 struct tty_driver *vtty_driver(dev_t dev, int *index)
 {
if (MAJOR(dev) == TTY_MAJOR &&
-   MINOR(dev) < MAX_NR_VTTY_CONSOLES) {
-   *index = MINOR(dev);
+   MINOR(dev) <= MAX_NR_VTTY_CONSOLES) {
+   if (MINOR(dev))
+   *index = MINOR(dev) - 1;
+   else
+   *index = 0;
return vttys_driver;
}
return NULL;
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 8fc8334..8ce0a5f 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1941,7 +1941,8 @@ static struct tty_driver *tty_lookup_driver(dev_t device, 
struct file *filp,
if (!ve_is_super(ve)) {
driver = vtty_driver(device, index);
if (driver) {
-   *noctty = 1;
+   if (MINOR(device) == 0)
+   *noctty = 1;
return tty_driver_kref_get(driver);
}
}
@@ -1960,8 +1961,15 @@ static struct tty_driver *tty_lookup_driver(dev_t 
device, struct file *filp,
case MKDEV(TTYAUX_MAJOR, 1): {
struct tty_driver *console_driver = console_device(index);
 #ifdef CONFIG_VE
-   if (!ve_is_super(ve))
+   if (!ve_is_super(ve)) {
console_driver = vtty_console_driver(index);
+   /*
+* Reset fops, sometimes there might be
+* console_fops picked from inode->i_cdev
+* in chrdev_open()
+*/
+   filp->f_op = _fops;
+   }
 #endif
if (console_driver) {
driver = tty_driver_kref_get(console_driver);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/net: introduce TAP accounting

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit b59e089eb2d2fdc939e54abb656cd5b7a2ad500e
Author: Vladimir Sementsov-Ogievskiy 
Date:   Thu Oct 15 18:56:47 2015 +0400

ve/net: introduce TAP accounting

Add ve accounting to tun/tap devices. New ioctl should be called to
attach/create ve stat to tun/tap.

https://jira.sw.ru/browse/PSBM-27713

Note: TUN accounting is not tested for now and disabled in this commit.
only TAP accounting is allowed for now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Cyrill Gorcunov 
Acked-by: Konstantin Khorenko 
---
 drivers/net/tun.c   | 68 -
 include/uapi/linux/if_tun.h |  9 ++
 kernel/Kconfig.openvz   |  7 +
 3 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 392d701..4f7eee9 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -71,6 +71,10 @@
 
 #include 
 
+#ifdef CONFIG_VE_TUNTAP_ACCOUNTING
+#include 
+#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */
+
 /* Uncomment to enable debugging */
 /* #define TUN_DEBUG 1 */
 
@@ -190,6 +194,9 @@ struct tun_struct {
struct list_head disabled;
void *security;
u32 flow_count;
+#ifdef CONFIG_VE_TUNTAP_ACCOUNTING
+   struct venet_stat *vestat;
+#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */
 };
 
 static inline u32 tun_hashfn(u32 rxhash)
@@ -1241,6 +1248,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
tun->dev->stats.rx_packets++;
tun->dev->stats.rx_bytes += len;
 
+#ifdef CONFIG_VE_TUNTAP_ACCOUNTING
+   if (tun->vestat) {
+   venet_acct_classify_add_outgoing(tun->vestat, skb);
+   }
+#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */
+
tun_flow_update(tun, rxhash, tfile);
return total_len;
 }
@@ -1344,6 +1357,12 @@ static ssize_t tun_put_user(struct tun_struct *tun,
tun->dev->stats.tx_packets++;
tun->dev->stats.tx_bytes += len;
 
+#ifdef CONFIG_VE_TUNTAP_ACCOUNTING
+   if (tun->vestat) {
+   venet_acct_classify_add_incoming(tun->vestat, skb);
+   }
+#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */
+
return total;
 }
 
@@ -1428,6 +1447,14 @@ static void tun_free_netdev(struct net_device *dev)
BUG_ON(!(list_empty(>disabled)));
tun_flow_uninit(tun);
security_tun_dev_free_security(tun->security);
+
+#ifdef CONFIG_VE_TUNTAP_ACCOUNTING
+   if (tun->vestat) {
+   venet_acct_put_stat(tun->vestat);
+   tun->vestat = NULL;
+   }
+#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */
+
free_netdev(dev);
 }
 
@@ -1892,11 +1919,43 @@ unlock:
return ret;
 }
 
+#ifdef CONFIG_VE_TUNTAP_ACCOUNTING
+/* setacctid_ioctl should be called under rtnl_lock */
+static long setacctid_ioctl(struct file *file, void __user *argp)
+{
+   struct tun_file *tfile = file->private_data;
+   struct tun_acctid info;
+   struct net_device *dev;
+   struct tun_struct *tun;
+
+   if (copy_from_user(, argp, sizeof(info)))
+   return -EFAULT;
+
+   dev = __dev_get_by_name(tfile->net, info.ifname);
+   if (dev == NULL)
+   return -ENOENT;
+
+   /* This check may be dropped to allow tun devices */
+   if (dev->netdev_ops != _netdev_ops)
+   return -EINVAL;
+
+   tun = netdev_priv(dev);
+   if (tun->vestat) {
+   venet_acct_put_stat(tun->vestat);
+   }
+   tun->vestat = venet_acct_find_create_stat(info.acctid);
+   if (tun->vestat == NULL)
+   return -ENOMEM;
+
+   return 0;
+}
+#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */
+
 static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
unsigned long arg, int ifreq_len)
 {
struct tun_file *tfile = file->private_data;
-   struct tun_struct *tun;
+   struct tun_struct *tun = NULL;
void __user* argp = (void __user*)arg;
struct ifreq ifr;
kuid_t owner;
@@ -1925,6 +1984,13 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
ret = 0;
rtnl_lock();
 
+#ifdef CONFIG_VE_TUNTAP_ACCOUNTING
+   if (cmd == TUNSETACCTID) {
+   ret = setacctid_ioctl(file, argp);
+   goto unlock;
+   }
+#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */
+
tun = __tun_get(tfile);
if (cmd == TUNSETIFF && !tun) {
ifr.ifr_name[IFNAMSIZ-1] = '\0';
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index c80d152..81e791e 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -59,6 +59,9 @@
 #define TUNSETIFINDEX  _IOW('T', 218, unsigned int)
 #define TUNGETFILTER

[Devel] [PATCH RHEL7 COMMIT] config.OpenVZ: enable TAP accounting

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 63d0e865e0fbcf122786ae211b1746e25e407657
Author: Konstantin Khorenko 
Date:   Thu Oct 15 18:59:03 2015 +0400

config.OpenVZ: enable TAP accounting

Add ve accounting to tun/tap devices. New ioctl should be called to
attach/create ve stat to tun/tap.

https://jira.sw.ru/browse/PSBM-27713

Note: TUN accounting is not tested for now and disabled in this commit.
only TAP accounting is allowed for now.

Signed-off-by: Konstantin Khorenko 
---
 configs/kernel-3.10.0-x86_64-debug.config | 1 +
 configs/kernel-3.10.0-x86_64.config   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/configs/kernel-3.10.0-x86_64-debug.config 
b/configs/kernel-3.10.0-x86_64-debug.config
index 0f91ff9..6dcb566 100644
--- a/configs/kernel-3.10.0-x86_64-debug.config
+++ b/configs/kernel-3.10.0-x86_64-debug.config
@@ -5378,6 +5378,7 @@ CONFIG_VZ_LIST=m
 CONFIG_VZ_GENCALLS=y
 CONFIG_VE_NETDEV=m
 CONFIG_VE_NETDEV_ACCOUNTING=m
+CONFIG_VE_TUNTAP_ACCOUNTING=y
 CONFIG_VZ_DEV=m
 CONFIG_VE_IPTABLES=y
 CONFIG_VZ_WDOG=m
diff --git a/configs/kernel-3.10.0-x86_64.config 
b/configs/kernel-3.10.0-x86_64.config
index 3a5b8c0..7cfaaaf 100644
--- a/configs/kernel-3.10.0-x86_64.config
+++ b/configs/kernel-3.10.0-x86_64.config
@@ -5351,6 +5351,7 @@ CONFIG_VZ_LIST=m
 CONFIG_VZ_GENCALLS=y
 CONFIG_VE_NETDEV=m
 CONFIG_VE_NETDEV_ACCOUNTING=m
+CONFIG_VE_TUNTAP_ACCOUNTING=y
 CONFIG_VZ_DEV=m
 CONFIG_VE_IPTABLES=y
 CONFIG_VZ_WDOG=m
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] vtty: Make indices to match pcs6 scheme

2015-10-15 Thread Vladimir Davydov

On Mon, Oct 05, 2015 at 12:54:26PM +0300, Cyrill Gorcunov wrote:
> In pcs6 vttys are mapped into internal kernel representation in
> nonobvious way. The /dev/console represent [maj:5,min:1], in
> turn /dev/tty[0-...] are defined as [maj:4,min:0...], where
> minor is bijective to symbol postfix of the tty. Internally
> in the pcs6 kernel any open of /dev/ttyX has been mapping
> minor into vtty index as
> 
>  |if (minor > 0)
>  |index = minor - 1
>  |else
>  |index = 0
> 
> which actually shifts indices and make /dev/tty0 as
> an alias to /dev/console inside container.
> 
> Same time vzctl tool passes console number argument
> in a decremented way, iow when one is typing
> 
>   vzctl console $ctid 1
> 
> here is 1 is a tty number, the kernel sees is as 0,
> opening containers /dev/console.
> 
> When one types "vzctl console $ctid 2" (which implies
> to open container's /dev/tty2) the vzctl passes index 1
> and the kernel opens /dev/tty2 because of the if/else index
> mapping as show above.
> 
> Lets implement same indices mapping in pcs7 for backward
> compatibility (in pcs7 there is a per-VE vtty_map_t structure
> which reserve up to MAX_NR_VTTY_CONSOLES ttys to track
> and it is simply an array addressed by tty index).
> 
> Same time lets fix a few nits: disable setup of controlling
> terminal on /dev/console only, since all ttys can have
> controlling sign; make sure we're having @tty_fops for
> such terminals.
> 
> https://jira.sw.ru/browse/PSBM-40088
> 
> Signed-off-by: Cyrill Gorcunov 
> CC: Vladimir Davydov 
> CC: Konstantin Khorenko 
> CC: Igor Sukhih 

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/shm: add memfd_create() syscall: lost hunk

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.7
-->
commit efd8ed12768cc7bee733d35a2c35393626707143
Author: Konstantin Khorenko 
Date:   Thu Oct 15 19:39:42 2015 +0400

ms/shm: add memfd_create() syscall: lost hunk

Fixes 9e421edd0c467fb8d3a230520421a58f55e2a46e

This is a lost hunk from ms commit:
9183df25fe7b194563db3fec6dc3202a5855839c

ms/shm: add memfd_create() syscall

https://jira.sw.ru/browse/PSBM-39834

Signed-off-by: Konstantin Khorenko 
---
 include/uapi/linux/memfd.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 000..534e364
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,8 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+/* flags for memfd_create(2) (unsigned int) */
+#define MFD_CLOEXEC0x0001U
+#define MFD_ALLOW_SEALING  0x0002U
+
+#endif /* _UAPI_LINUX_MEMFD_H */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] memcg: add mem_cgroup_get/put helpers

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 28d232fd4095c371daa0980f5fae9642a30780b1
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:52:58 2015 +0400

memcg: add mem_cgroup_get/put helpers

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

Equivalent to css_get/put(mem_cgroup_css(memcg)). Currently, only used
by af_packet.c, but will also be used by the following patches.

Signed-off-by: Vladimir Davydov 
---
 include/linux/memcontrol.h | 18 ++
 net/packet/af_packet.c |  4 ++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ac3f16f..548a82c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -139,6 +139,16 @@ static inline bool mem_cgroup_disabled(void)
return false;
 }
 
+static inline void mem_cgroup_get(struct mem_cgroup *memcg)
+{
+   css_get(mem_cgroup_css(memcg));
+}
+
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(mem_cgroup_css(memcg));
+}
+
 void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
 unsigned long *flags);
 
@@ -321,6 +331,14 @@ static inline bool mem_cgroup_disabled(void)
return true;
 }
 
+static inline void mem_cgroup_get(struct mem_cgroup *memcg)
+{
+}
+
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline int
 mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 {
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ee9d56b..0bc235e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2524,7 +2524,7 @@ static struct cg_proto *packet_sk_charge(void)
goto out;
 
 out_put_cg:
-   css_put(mem_cgroup_css(psc->memcg));
+   mem_cgroup_put(psc->memcg);
 out_free_psc:
kfree(psc);
psc = NULL;
@@ -2545,7 +2545,7 @@ static void packet_sk_uncharge(struct cg_proto *cg)
 
if (psc) {
memcg_uncharge_kmem(psc->memcg, psc->amt);
-   css_put(mem_cgroup_css(psc->memcg));
+   mem_cgroup_put(psc->memcg);
kfree(psc);
}
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Pavel Emelyanov

On 10/15/2015 05:21 PM, Kirill Tkhai wrote:
> 
> 
> On 15.10.2015 14:15, Pavel Emelyanov wrote:
>>
>>> @@ -130,6 +131,34 @@ struct ve_struct {
>>>  #endif
>>>  };
>>>  
>>> +static inline struct ve_struct *get_exec_env(void)
>>> +{
>>> +   struct ve_struct *ve;
>>> +
>>> +   if (++current->ve_attach_lock_depth > 1)
>>> +   return current->task_ve;
>>> +
>>> +   rcu_read_lock();
>>> +again:
>>> +   ve = current->task_ve;
>>> +   read_lock(>attach_lock);
>>> +   if (unlikely(current->task_ve != ve)) {
>>> +   read_unlock(>attach_lock);
>>> +   goto again;
>>
>> Please, no. 3.10 kernel has task_work-s, ask the task you want to
>> attach to ve to execute the work by moving itself into it and keep
>> this small routine small and simple.
> 
> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(),
> so we can't wait attaching task till it complete the task work (it may
> execute any code; to be locking cgroup_mutex, for example).

I see.

> Should we give a possibility (an interface) for userspace to get it know,
> the task's finished ve changing?

No. What are the places where get_exec_env() is still required?

>  
>>> +   }
>>> +   rcu_read_unlock();
>>> +
>>> +   return ve;
>>> +}
>>> +
>>
> .
> 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/oom: don't count on mm-less current process

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 315f2cf7428d49d724775f01c545926c55b39a7e
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:47:33 2015 +0400

ms/oom: don't count on mm-less current process

Patchset description: oom enhancements - part 1

Pull mainstream patches that clean up TIF_MEMDIE handling. They will
come in handy for the upcoming oom rework.

https://jira.sw.ru/browse/PSBM-26973

David Rientjes (1):
  mm, oom: remove unnecessary exit_state check

Johannes Weiner (1):
  mm: oom_kill: clean up victim marking and exiting interfaces

Michal Hocko (3):
  oom: make sure that TIF_MEMDIE is set under task_lock
  oom: add helpers for setting and clearing TIF_MEMDIE
  oom: thaw the OOM victim if it is frozen

Tetsuo Handa (1):
  oom: don't count on mm-less current process

===
This patch desciption:

From: Tetsuo Handa 

out_of_memory() doesn't trigger the OOM killer if the current task is
already exiting or it has fatal signals pending, and gives the task
access to memory reserves instead.  However, doing so is wrong if
out_of_memory() is called by an allocation (e.g. from exit_task_work())
after the current task has already released its memory and cleared
TIF_MEMDIE at exit_mm().  If we again set TIF_MEMDIE to post-exit_mm()
current task, the OOM killer will be blocked by the task sitting in the
final schedule() waiting for its parent to reap it.  It will trigger an
OOM livelock if its parent is unable to reap it due to doing an
allocation and waiting for the OOM killer to kill it.

Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Cc: David Rientjes 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit d7a94e7e11badf8404d40b41e008c3131a3cebe3)
Signed-off-by: Vladimir Davydov 

Reviewed-by: Kirill Tkhai 

Conflicts:
mm/oom_kill.c
---
 mm/oom_kill.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 57d9f3e..fd9e13d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -643,8 +643,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t 
gfp_mask,
 * If current has a pending SIGKILL or is exiting, then automatically
 * select it.  The goal is to allow it to allocate so that it may
 * quickly exit and free its memory.
+*
+* But don't select if current has already released its mm and cleared
+* TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
 */
-   if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+   if (current->mm &&
+   (fatal_signal_pending(current) || current->flags & PF_EXITING)) {
set_thread_flag(TIF_MEMDIE);
return;
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] oom: pass points and overdraft to oom_kill_process

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit c67b670a0cde9ea89926108a26a651b9108e49c7
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:53:02 2015 +0400

oom: pass points and overdraft to oom_kill_process

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

This is required by oom berserker mode, which will be introduced later
in this series.

Signed-off-by: Vladimir Davydov 
---
 include/linux/oom.h |  3 ++-
 mm/memcontrol.c |  6 +++---
 mm/oom_kill.c   | 26 --
 3 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 9117d1d..6ea83b2 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -85,7 +85,8 @@ static inline bool oom_worse(unsigned long points, unsigned 
long overdraft,
 }
 
 extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
-unsigned int points, unsigned long totalpages,
+unsigned long points, unsigned long overdraft,
+unsigned long totalpages,
 struct mem_cgroup *memcg, nodemask_t *nodemask,
 const char *message);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c34dee0..14e6aee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1935,7 +1935,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
unsigned long chosen_points = 0;
unsigned long totalpages;
unsigned long overdraft;
-   unsigned int points = 0;
+   unsigned long points = 0;
struct task_struct *chosen = NULL;
 
/*
@@ -1987,8 +1987,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
 
if (!chosen)
return;
-   points = chosen_points * 1000 / totalpages;
-   oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
+   oom_kill_process(chosen, gfp_mask, order, chosen_points, max_overdraft,
+totalpages, memcg,
 NULL, "Memory cgroup out of memory");
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a437f68..d8a89c0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -353,7 +353,8 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct 
*task,
  *
  * (not docbooked, we don't want this one cluttering up the manual)
  */
-static struct task_struct *select_bad_process(unsigned int *ppoints,
+static struct task_struct *select_bad_process(unsigned long *ppoints,
+   unsigned long *poverdraft,
unsigned long totalpages, const nodemask_t *nodemask,
bool force_kill)
 {
@@ -389,7 +390,8 @@ static struct task_struct *select_bad_process(unsigned int

[Devel] [PATCH RHEL7 COMMIT] oom: drop OOM_SCAN_ABORT

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 3bb5625c93235a4fd013b4307a1fd9cc9db4e6a8
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:53:01 2015 +0400

oom: drop OOM_SCAN_ABORT

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

It is not used anymore, neither should it be used with the new locking
design.

Signed-off-by: Vladimir Davydov 
---
 include/linux/oom.h | 1 -
 mm/memcontrol.c | 6 --
 mm/oom_kill.c   | 7 +--
 3 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index f804551..6d4a94f 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -27,7 +27,6 @@ enum oom_constraint {
 enum oom_scan_t {
OOM_SCAN_OK,/* scan thread and find its badness */
OOM_SCAN_CONTINUE,  /* do not consider thread for oom kill */
-   OOM_SCAN_ABORT, /* abort the iteration and return */
OOM_SCAN_SELECT,/* always select this thread first */
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 892e5ff..7df0dff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2001,12 +2001,6 @@ retry:
/* fall through */
case OOM_SCAN_CONTINUE:
continue;
-   case OOM_SCAN_ABORT:
-   cgroup_iter_end(cgroup, );
-   mem_cgroup_iter_break(memcg, iter);
-   if (chosen)
-   put_task_struct(chosen);
-   return;
case OOM_SCAN_OK:
break;
};
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2fab831..914f9f4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -358,9 +358,6 @@ retry:
/* fall through */
case OOM_SCAN_CONTINUE:
continue;
-   case OOM_SCAN_ABORT:
-   rcu_read_unlock();
-   return ERR_PTR(-1UL);
case OOM_SCAN_OK:
break;
};
@@ -887,11 +884,9 @@ void out_of_memory(struct zonelist *zonelist, gfp_t 
gfp_mask,
if (!p) {
dump_header(NULL, gfp_mask, order, NULL, mpol_mask);
panic("Out of memory and no killable processes...\n");
-   }
-   if (PTR_ERR(p) != -1UL) {
+   } else
oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
 nodemask, "Out of memory");
-   }
 }
 
 /*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] oom: rework locking design

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 6376b304e2690ab7e3868b19f4a3eb8f78ee869e
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:53:00 2015 +0400

oom: rework locking design

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

Currently, after oom-killing a process, we keep busy waiting for it
until it frees some memory and we can fulfil the allocation request that
initiated oom. This slows down oom kill rate dramatically, because the
oom victim has to compete for cpu time with other (possibly numerous)
processes. The latter is unacceptable for the upcoming oom berserker,
which triggers if oom kills happen to often.

This patch reworks oom locking design as follows. Now only one process
is allowed to invoke oom killer in a memcg (root included) and all its
descendants, others have to wait for it to finish. Next, once a victim
is selected, the executioner will wait for it to die before retrying
allocation.

Signed-off-by: Vladimir Davydov 
---
 include/linux/memcontrol.h |   9 ++
 include/linux/oom.h|  13 ++-
 mm/memcontrol.c| 123 +++--
 mm/oom_kill.c  | 263 +
 mm/page_alloc.c|   6 +-
 5 files changed, 255 insertions(+), 159 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 548a82c..5911327 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_context;
 
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
@@ -120,6 +121,7 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct 
mem_cgroup *memcg);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
+extern struct oom_context *mem_cgroup_oom_context(struct mem_cgroup *memcg);
 extern bool mem_cgroup_below_oom_guarantee(struct task_struct *p);
 extern void mem_cgroup_note_oom_kill(struct mem_cgroup *memcg,
 struct task_struct *task);
@@ -363,6 +365,13 @@ mem_cgroup_update_lru_size(struct lruvec *lruvec, enum 
lru_list lru,
 {
 }
 
+static inline struct oom_context *
+mem_cgroup_oom_context(struct mem_cgroup *memcg)
+{
+   extern struct oom_context oom_ctx;
+   return _ctx;
+}
+
 static inline bool mem_cgroup_below_oom_guarantee(struct task_struct *p)
 {
return false;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 486fc6f..e19385d 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -31,6 +31,15 @@ enum oom_scan_t {

[Devel] [PATCH RHEL7 COMMIT] oom: introduce oom timeout

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 93e0a04b1eb4bcc4b996fe058af0c5a1c65b90c7
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:53:00 2015 +0400

oom: introduce oom timeout

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

Currently, we won't select a new oom victim until the previous one has
passed away. This might lead to a deadlock if an allocating task holds a
lock needed by the victim to complete. To cope with this problem, this
patch introduced oom timeout, after which a new task will be selected
even if the previous victim hasn't died. The timeout is hard-coded,
equals 5 seconds.

https://jira.sw.ru/browse/PSBM-38581

Signed-off-by: Vladimir Davydov 
---
 include/linux/oom.h |  2 ++
 mm/oom_kill.c   | 60 ++---
 2 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index e19385d..f804551 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -34,6 +34,8 @@ enum oom_scan_t {
 struct oom_context {
struct task_struct *owner;
struct task_struct *victim;
+   bool marked;
+   unsigned long oom_start;
wait_queue_head_t waitq;
 };
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ef7773f6..2fab831 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -45,6 +45,8 @@ int sysctl_oom_dump_tasks;
 
 static DEFINE_SPINLOCK(oom_context_lock);
 
+#define OOM_TIMEOUT(5 * HZ)
+
 #ifndef CONFIG_MEMCG
 struct oom_context oom_ctx = {
.waitq  = __WAIT_QUEUE_HEAD_INITIALIZER(oom_ctx.waitq),
@@ -55,6 +57,8 @@ void init_oom_context(struct oom_context *ctx)
 {
ctx->owner = NULL;
ctx->victim = NULL;
+   ctx->marked = false;
+   ctx->oom_start = 0;
init_waitqueue_head(>waitq);
 }
 
@@ -62,6 +66,7 @@ static void __release_oom_context(struct oom_context *ctx)
 {
ctx->owner = NULL;
ctx->victim = NULL;
+   ctx->marked = false;
wake_up_all(>waitq);
 }
 
@@ -291,11 +296,14 @@ enum oom_scan_t oom_scan_process_thread(struct 
task_struct *task,
 
/*
 * This task already has access to memory reserves and is being killed.
-* Don't allow any other task to have access to the reserves.
+* Try to select another one.
+*
+* This can only happen if oom_trylock timeout-ed, which most probably
+* means that the victim had dead-locked.
 */
if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
if (!force_kill)
-   return OOM_SCAN_ABORT;
+   return OOM_SCAN_CONTINUE;
}
if (!task->mm)
return OOM_SCAN_CONTINUE;
@@ -463,8 +471,10 @@ void mark_oom_victim(struct task_struct *tsk)

[Devel] [PATCH RHEL7 COMMIT] ms/mm: oom_kill: clean up victim marking and exiting interfaces

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 495272394bfe50c2c5925a1ec2ffbebed25b7fea
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:47:36 2015 +0400

ms/mm: oom_kill: clean up victim marking and exiting interfaces

Patchset description: oom enhancements - part 1

Pull mainstream patches that clean up TIF_MEMDIE handling. They will
come in handy for the upcoming oom rework.

https://jira.sw.ru/browse/PSBM-26973

David Rientjes (1):
  mm, oom: remove unnecessary exit_state check

Johannes Weiner (1):
  mm: oom_kill: clean up victim marking and exiting interfaces

Michal Hocko (3):
  oom: make sure that TIF_MEMDIE is set under task_lock
  oom: add helpers for setting and clearing TIF_MEMDIE
  oom: thaw the OOM victim if it is frozen

Tetsuo Handa (1):
  oom: don't count on mm-less current process

===
This patch desciption:

From: Johannes Weiner 

Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.

While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.

Signed-off-by: Johannes Weiner 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 16e951966f05da5ccd650104176f6ba289f7fa20)
Signed-off-by: Vladimir Davydov 

Reviewed-by: Kirill Tkhai 

Conflicts:
include/linux/oom.h
mm/memcontrol.c
mm/oom_kill.c
---
 drivers/staging/android/lowmemorykiller.c |  2 +-
 include/linux/oom.h   |  7 ---
 kernel/exit.c |  2 +-
 mm/memcontrol.c   |  2 +-
 mm/oom_kill.c | 14 +++---
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
index 4dd6a34..433e9a7 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -164,7 +164,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
shrink_control *sc)
 * infrastructure. There is no real reason why the selected
 * task should have access to the memory reserves.
 */
-   mark_tsk_oom_victim(selected);
+   mark_oom_victim(selected);
send_sig(SIGKILL, selected, 0);
rem += selected_tasksize;
}
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 3c37f1e..486fc6f 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -52,9 +52,7 @@ static inline bool oom_task_origin(const struct task_struct 
*p)
 /* linux/mm/oom_group.c */
 extern int get_task_oom_score_adj(struct task_struct *t);
 
-extern void mark_tsk_oom_victim(struct task_struct *tsk);
-
-extern void unmark_oom_victim(void);
+extern void mark_oom_victim(struct task_struct *tsk);
 
 extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup 
*memcg,
  const nodemask_t *nodemask, unsigned long totalpages);
@@ -75,6 +73,9 @@ extern enum oom_scan_t oom_scan_process_thread(struct 
task_struct *task,
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *mask, bool force_kill);
+
+extern void exit_oom_victim(void);
+
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 1b13207..1cc765b 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -521,7 +521,7 @@ static void exit_mm(struct task_struct * tsk)
mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
-   unmark_oom_victim();
+   exit_oom_victim();
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf9ca7f..fdd14dd2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1964,7 +1964,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup

[Devel] [PATCH RHEL7 COMMIT] oom: resurrect berserker mode

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit e651315e4475767b41a7e028c6127b25c5754312
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:53:03 2015 +0400

oom: resurrect berserker mode

Patchset description: oom enhancements - part 2

 - Patches 1-2 prepare memcg for upcoming changes in oom design.
 - Patch 3 reworks oom locking design so that the executioner waits for
   victim to exit. This is necessary to increase oom kill rate, which is
   essential for berserker mode.
 - Patch 4 drops unused OOM_SCAN_ABORT
 - Patch 5 introduces oom timeout.
   https://jira.sw.ru/browse/PSBM-38581
 - Patch 6 makes oom fairer when it comes to selecting a victim among
   different containers.
   https://jira.sw.ru/browse/PSBM-37915
 - Patch 7 prepares oom for introducing berserker mode
 - Patch 8 resurrects oom berserker mode, which is supposed to cope with
   actively forking processes.
   https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Changes in v3:
 - rework oom_trylock (patch 3)
 - select exiting process instead of aborting oom scan so as not to keep
   busy-waiting for an exiting process to exit (patches 3, 4)
 - cleanup oom timeout handling + fix stuck process trace dumped
   multiple times on timeout (patch 5)
 - set max_overdraft to ULONG_MAX on selected processes (patch 6)
 - rework oom berserker process selection logic (patches 7, 8)

Changes in v2:
 - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4)
 - propagate victim to the context that initiated oom in oom_unlock
   (patch 6)
 - always set oom_end on releasing oom context (patch 6)

Vladimir Davydov (8):
  memcg: add mem_cgroup_get/put helpers
  memcg: add lock for protecting memcg->oom_notify list
  oom: rework locking design
  oom: introduce oom timeout
  oom: drop OOM_SCAN_ABORT
  oom: rework logic behind memory.oom_guarantee
  oom: pass points and overdraft to oom_kill_process
  oom: resurrect berserker mode

Reviewed-by: Kirill Tkhai 

=
This patch description:

The logic behind the OOM berserker is the same as in PCS6: if processes
are killed by oom killer too often (< sysctl vm.oom_relaxation, 1 sec by
default), we increase "rage" (min -10, max 20) and kill 1 << "rage"
youngest worst processes if "rage" >= 0.

https://jira.sw.ru/browse/PSBM-17930

Signed-off-by: Vladimir Davydov 
---
 include/linux/oom.h |   3 ++
 kernel/sysctl.c |   7 
 mm/oom_kill.c   | 106 
 3 files changed, 116 insertions(+)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6ea83b2..acf58fc 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -35,7 +35,9 @@ struct oom_context {
struct task_struct *victim;
bool marked;
unsigned long oom_start;
+   unsigned long oom_end;
unsigned long overdraft;
+   int rage;
wait_queue_head_t waitq;
 };
 
@@ -126,4 +128,5 @@ extern struct task_struct *find_lock_task_mm(struct 
task_struct *p);
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern int sysctl_oom_relaxation;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 976f48c..9c081e3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1184,6 +1184,13 @@ static struct ctl_table vm_table[] = {
.proc_handler   = proc_dointvec,
},
{
+   .procname   = "oom_relaxation",
+   .data   = _oom_relaxation,
+   .maxlen = sizeof(sysctl_oom_relaxation),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_ms_jiffies,
+   },
+   {
.procname   = "overcommit_ratio",
.data   = _overcommit_ratio,
.maxlen = sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d8a89c0..6d16154 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -42,13 +42,18 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
+int sysctl_oom_relaxation = HZ;
 
 static DEFINE_SPINLOCK(oom_context_lock);
 
 #define OOM_TIMEOUT(5 * HZ)
 
+#define OOM_BASE_RAGE  -10
+#define OOM_MAX_RAGE   20
+
 #ifndef CONFIG_MEMCG
 struct oom_context oom_ctx = {
+   .rage   = OOM_BASE_RAGE,
.waitq  = __WAIT_QUEUE_HEAD_INITIALIZER(oom_ctx.waitq),
 };
 #endif
@@ -59,6 +64,8 @@ void init_oom_context(struct oom_context *ctx)

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Kirill Tkhai



On 15.10.2015 14:15, Pavel Emelyanov wrote:
> 
>> @@ -130,6 +131,34 @@ struct ve_struct {
>>  #endif
>>  };
>>  
>> +static inline struct ve_struct *get_exec_env(void)
>> +{
>> +struct ve_struct *ve;
>> +
>> +if (++current->ve_attach_lock_depth > 1)
>> +return current->task_ve;
>> +
>> +rcu_read_lock();
>> +again:
>> +ve = current->task_ve;
>> +read_lock(>attach_lock);
>> +if (unlikely(current->task_ve != ve)) {
>> +read_unlock(>attach_lock);
>> +goto again;
> 
> Please, no. 3.10 kernel has task_work-s, ask the task you want to
> attach to ve to execute the work by moving itself into it and keep
> this small routine small and simple.

cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(),
so we can't wait attaching task till it complete the task work (it may
execute any code; to be locking cgroup_mutex, for example).

Should we give a possibility (an interface) for userspace to get it know,
the task's finished ve changing?

 
>> +}
>> +rcu_read_unlock();
>> +
>> +return ve;
>> +}
>> +
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/oom: make sure that TIF_MEMDIE is set under task_lock

2015-10-15 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.8.6
-->
commit 82d2c87b0e1ecd58487d26f479142a3517cffc44
Author: Vladimir Davydov 
Date:   Thu Oct 15 17:47:34 2015 +0400

ms/oom: make sure that TIF_MEMDIE is set under task_lock

Patchset description: oom enhancements - part 1

Pull mainstream patches that clean up TIF_MEMDIE handling. They will
come in handy for the upcoming oom rework.

https://jira.sw.ru/browse/PSBM-26973

David Rientjes (1):
  mm, oom: remove unnecessary exit_state check

Johannes Weiner (1):
  mm: oom_kill: clean up victim marking and exiting interfaces

Michal Hocko (3):
  oom: make sure that TIF_MEMDIE is set under task_lock
  oom: add helpers for setting and clearing TIF_MEMDIE
  oom: thaw the OOM victim if it is frozen

Tetsuo Handa (1):
  oom: don't count on mm-less current process

===
This patch desciption:

From: Michal Hocko 

OOM killer tries to exclude tasks which do not have mm_struct associated
because killing such a task wouldn't help much.  The OOM victim gets
TIF_MEMDIE set to disable OOM killer while the current victim releases the
memory and then enables the OOM killer again by dropping the flag.

oom_kill_process is currently prone to a race condition when the OOM
victim is already exiting and TIF_MEMDIE is set after the task releases
its address space.  This might theoretically lead to OOM livelock if the
OOM victim blocks on an allocation later during exiting because it
wouldn't kill any other process and the exiting one won't be able to exit.
 The situation is highly unlikely because the OOM victim is expected to
release some memory which should help to sort out OOM situation.

Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
which will serialize the OOM killer with exit_mm which sets task->mm to
NULL.  Setting the flag for current is not necessary because check and set
is not racy.

Reported-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
Cc: David Rientjes 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 83363b917a2982dd509a5e2125e905b6873505a3)
Signed-off-by: Vladimir Davydov 

Reviewed-by: Kirill Tkhai 

Conflicts:
mm/oom_kill.c
---
 mm/oom_kill.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fd9e13d..5ac5d96 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -432,11 +432,14 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 * If the task is already exiting, don't alarm the sysadmin or kill
 * its children or threads, just set TIF_MEMDIE so it can die quickly
 */
-   if (p->flags & PF_EXITING) {
+   task_lock(p);
+   if (p->mm && p->flags & PF_EXITING) {
set_tsk_thread_flag(p, TIF_MEMDIE);
+   task_unlock(p);
put_task_struct(p);
return;
}
+   task_unlock(p);
 
if (__ratelimit(_rs))
dump_header(p, gfp_mask, order, memcg, nodemask);
@@ -486,6 +489,7 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
 
/* mm cannot safely be dereferenced after task_unlock(victim) */
mm = victim->mm;
+   set_tsk_thread_flag(victim, TIF_MEMDIE);
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB\n",
task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -517,7 +521,6 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
}
rcu_read_unlock();
 
-   set_tsk_thread_flag(victim, TIF_MEMDIE);
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
mem_cgroup_note_oom_kill(memcg, victim);
put_task_struct(victim);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Vladimir Davydov

On Thu, Oct 15, 2015 at 05:21:04PM +0300, Kirill Tkhai wrote:
> 
> 
> On 15.10.2015 14:15, Pavel Emelyanov wrote:
> > 
> >> @@ -130,6 +131,34 @@ struct ve_struct {
> >>  #endif
> >>  };
> >>  
> >> +static inline struct ve_struct *get_exec_env(void)
> >> +{
> >> +  struct ve_struct *ve;
> >> +
> >> +  if (++current->ve_attach_lock_depth > 1)
> >> +  return current->task_ve;
> >> +
> >> +  rcu_read_lock();
> >> +again:
> >> +  ve = current->task_ve;
> >> +  read_lock(>attach_lock);
> >> +  if (unlikely(current->task_ve != ve)) {
> >> +  read_unlock(>attach_lock);
> >> +  goto again;
> > 
> > Please, no. 3.10 kernel has task_work-s, ask the task you want to
> > attach to ve to execute the work by moving itself into it and keep
> > this small routine small and simple.
> 
> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(),
> so we can't wait attaching task till it complete the task work (it may
> execute any code; to be locking cgroup_mutex, for example).

Do we really want to wait until exec_env of the target task has changed?

Anyway, I think we can use task_work even if we need to be synchronous.
You just need two task works - one for the target and one for the
caller. The former will change the target task's exec_env while the
latter will wait for it to finish.

> 
> Should we give a possibility (an interface) for userspace to get it know,
> the task's finished ve changing?
> 
>  
> >> +  }
> >> +  rcu_read_unlock();
> >> +
> >> +  return ve;
> >> +}
> >> +
> > 
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [NEW KERNEL] 3.10.0-229.7.2.vz7.8.8 (rhel7)

2015-10-15 Thread builder

Changelog:

OpenVZ kernel rh7-3.10.0-229.7.2.vz7.8.8

* lost hunk brought for memfd_create() syscall port


Generated changelog:

* Thu Oct 15 2015 Konstantin Khorenko  
[3.10.0-229.7.2.vz7.8.8]
- ms/shm: add memfd_create() syscall: lost hunk (Konstantin Khorenko) 
[PSBM-39834]


Built packages: 
http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/229.7.2.vz7.8.8/
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] ve: Kill ve_list_head and ve_struct::ve_list

2015-10-15 Thread Vladimir Davydov

On Thu, Sep 24, 2015 at 06:11:26PM +0300, Kirill Tkhai wrote:
> Since we use ve_idr layer to reserve a id for a ve,
> and since a ve is linked there, using of ve_list_head
> just for linking VEs becomes redundant.

Nevertheless, iterating over a list is more convenient than over idr
IMO.

> 
> This patch replaces ve_list_head in the places, we iterate
> thru VEs list, with ve_idr mechanish, and kills the
> duplicate manner.

AFAICS this patch doesn't improve performance neither does it make the
code more readable IMHO, so personally I would refrain from merging it.
Up to Konstantin.

Also, see a few comments regarding the implementation below.

> 
> Signed-off-by: Kirill Tkhai 
> ---
...
> @@ -49,10 +49,9 @@ void vzmon_unregister_veaddr_print_cb(ve_seq_print_t);
>  int venet_init(void);
>  #endif
>  
> -extern struct list_head ve_list_head;
> -#define for_each_ve(ve)  list_for_each_entry((ve), _list_head, 
> ve_list)

I wouldn't drop the macro.

>  extern struct mutex ve_list_lock;

There's no ve_list, but there's still ve_list_lock. Confusing. Same for
ve_list_add and ve_list_del.

>  extern struct ve_struct *get_ve_by_id(envid_t);
> +extern struct idr ve_idr;
>  extern struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t 
> veid);
>  extern int ve_cgroup_remove(struct cgroup *root, envid_t veid);
>  
...
> @@ -772,26 +768,24 @@ static int vestat_seq_show(struct seq_file *m, void *v)
>  
>  void *ve_seq_start(struct seq_file *m, loff_t *pos)
>  {
> - struct ve_struct *curve;
> -
> - curve = get_exec_env();
>   mutex_lock(_list_lock);
> - if (!ve_is_super(curve)) {
> - if (*pos != 0)
> - return NULL;
> - return >ve_list;
> - }
>  
> - return seq_list_start(_list_head, *pos);
> + return ve_seq_next(m, NULL, pos);

I don't think it's correct to increment *pos in seq_start. Look at
seq_read: if the buffer is too small to hold the first entry, we will
jump over it instead of continuing reading it next time seq_read is
called.

>  }
>  EXPORT_SYMBOL(ve_seq_start);
>  
>  void *ve_seq_next(struct seq_file *m, void *v, loff_t *pos)
>  {
> - if (!ve_is_super(get_exec_env()))
> - return NULL;
> - else
> - return seq_list_next(v, _list_head, pos);
> + struct ve_struct *ve = get_exec_env();
> + int id = *pos;
> +
> + if (!ve_is_super(ve))

AFAICS you forgot to increment *pos here, which might result in
multiplying output inside ve.

> + return *pos ? NULL : ve;
> +
> + ve = idr_get_next(_idr, );
> + *pos = id + 1;
> +
> + return ve;
>  }
>  EXPORT_SYMBOL(ve_seq_next);
>  
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Kirill Tkhai



On 15.10.2015 17:44, Vladimir Davydov wrote:
> On Thu, Oct 15, 2015 at 05:21:04PM +0300, Kirill Tkhai wrote:
>>
>>
>> On 15.10.2015 14:15, Pavel Emelyanov wrote:
>>>
 @@ -130,6 +131,34 @@ struct ve_struct {
  #endif
  };
  
 +static inline struct ve_struct *get_exec_env(void)
 +{
 +  struct ve_struct *ve;
 +
 +  if (++current->ve_attach_lock_depth > 1)
 +  return current->task_ve;
 +
 +  rcu_read_lock();
 +again:
 +  ve = current->task_ve;
 +  read_lock(>attach_lock);
 +  if (unlikely(current->task_ve != ve)) {
 +  read_unlock(>attach_lock);
 +  goto again;
>>>
>>> Please, no. 3.10 kernel has task_work-s, ask the task you want to
>>> attach to ve to execute the work by moving itself into it and keep
>>> this small routine small and simple.
>>
>> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(),
>> so we can't wait attaching task till it complete the task work (it may
>> execute any code; to be locking cgroup_mutex, for example).
> 
> Do we really want to wait until exec_env of the target task has changed?
> 
> Anyway, I think we can use task_work even if we need to be synchronous.
> You just need two task works - one for the target and one for the
> caller. The former will change the target task's exec_env while the
> latter will wait for it to finish.

Hm. Maybe, it would be a good idea if cgroup_attach_task() could not be
called several times at once. Every attach requires a separate task_work,
so it needs additional memory. This complicates the thing.
 
>>
>> Should we give a possibility (an interface) for userspace to get it know,
>> the task's finished ve changing?
>>
>>  
 +  }
 +  rcu_read_unlock();
 +
 +  return ve;
 +}
 +
>>>
>>
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC] Fix get_exec_env() races

2015-10-15 Thread Kirill Tkhai



On 15.10.2015 19:49, Kirill Tkhai wrote:
> 
> 
> On 15.10.2015 17:44, Vladimir Davydov wrote:
>> On Thu, Oct 15, 2015 at 05:21:04PM +0300, Kirill Tkhai wrote:
>>>
>>>
>>> On 15.10.2015 14:15, Pavel Emelyanov wrote:

> @@ -130,6 +131,34 @@ struct ve_struct {
>  #endif
>  };
>  
> +static inline struct ve_struct *get_exec_env(void)
> +{
> + struct ve_struct *ve;
> +
> + if (++current->ve_attach_lock_depth > 1)
> + return current->task_ve;
> +
> + rcu_read_lock();
> +again:
> + ve = current->task_ve;
> + read_lock(>attach_lock);
> + if (unlikely(current->task_ve != ve)) {
> + read_unlock(>attach_lock);
> + goto again;

 Please, no. 3.10 kernel has task_work-s, ask the task you want to
 attach to ve to execute the work by moving itself into it and keep
 this small routine small and simple.
>>>
>>> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(),
>>> so we can't wait attaching task till it complete the task work (it may
>>> execute any code; to be locking cgroup_mutex, for example).
>>
>> Do we really want to wait until exec_env of the target task has changed?
>>
>> Anyway, I think we can use task_work even if we need to be synchronous.
>> You just need two task works - one for the target and one for the
>> caller. The former will change the target task's exec_env while the
>> latter will wait for it to finish.
> 
> Hm. Maybe, it would be a good idea if cgroup_attach_task() could not be
> called several times at once. Every attach requires a separate task_work,
> so it needs additional memory. This complicates the thing.

Though, we may wait on counter of number of all process and threads. Looks OK.
  
>>>
>>> Should we give a possibility (an interface) for userspace to get it know,
>>> the task's finished ve changing?
>>>
>>>  
> + }
> + rcu_read_unlock();
> +
> + return ve;
> +}
> +

>>>
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

46 matches

Mail list logo