[Devel] [RFC] Fix get_exec_env() races
Since we allow to attach a not current task to ve cgroup, there is the race in the places where we use get_exec_env(). The task's ve may be changed after it dereferenced get_exec_env(), so a lot of problems are possible there. I'm sure the most places, where we use get_exec_env(), was not written in an assumption it may change. Also, there are a lot of nested functions and it's impossible to check every function to verify if it's input parameters, depending on a caller's dereferenced ve, are not actual because of ve has been changed. I'm suggest to use to modify get_exec_env() which will supply ve's stability. It pairs with put_exec_env() which marks end of area where ve modification is not desirable. get_exec_env() may be used nested, so here is task_struct::ve_attach_lock_depth, which allows nesting. The counter looks a better variant that plain read_lock() in get_exec_env() and write_trylock() loop in ve_attach(): get_exec_env() { ... read_lock(); ... } ve_attach() { while(!write_trylock()) cpu_relax(); } because this case the priority of read_lock() will be absolute and we lost all advantages of queued rw locks fairness. Also I considered variants with using RCU and task work, but they seems to be worse. Please, your comments. --- include/linux/init_task.h | 3 ++- include/linux/sched.h | 1 + include/linux/ve.h| 29 + include/linux/ve_proto.h | 1 - kernel/fork.c | 3 +++ kernel/ve/ve.c| 8 +++- 6 files changed, 42 insertions(+), 3 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index d2cbad0..57e0796 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -136,7 +136,8 @@ extern struct task_group root_task_group; #endif #ifdef CONFIG_VE -#defineINIT_TASK_VE(tsk) .task_ve = , +#defineINIT_TASK_VE(tsk) .task_ve = , \ + .ve_attach_lock_depth = 0 #else #defineINIT_TASK_VE(tsk) #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index e1bcabe..948481f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1564,6 +1564,7 @@ struct task_struct { #endif #ifdef CONFIG_VE struct ve_struct *task_ve; + unsigned int ve_attach_lock_depth; #endif #ifdef CONFIG_MEMCG /* memcg uses this to do batch job */ struct memcg_batch_info { diff --git a/include/linux/ve.h b/include/linux/ve.h index 86b95c3..3cea73d 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -33,6 +33,7 @@ struct ve_monitor; struct nsproxy; struct ve_struct { + rwlock_tattach_lock; struct cgroup_subsys_state css; const char *ve_name; @@ -130,6 +131,34 @@ struct ve_struct { #endif }; +static inline struct ve_struct *get_exec_env(void) +{ + struct ve_struct *ve; + + if (++current->ve_attach_lock_depth > 1) + return current->task_ve; + + rcu_read_lock(); +again: + ve = current->task_ve; + read_lock(>attach_lock); + if (unlikely(current->task_ve != ve)) { + read_unlock(>attach_lock); + goto again; + } + rcu_read_unlock(); + + return ve; +} + +static inline void put_exec_env(void) +{ + struct ve_struct *ve = current->task_ve; + + if (!--current->ve_attach_lock_depth) + read_unlock(>attach_lock); +} + struct ve_devmnt { struct list_headlink; diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h index 0f5898e..3deb09e 100644 --- a/include/linux/ve_proto.h +++ b/include/linux/ve_proto.h @@ -30,7 +30,6 @@ static inline bool ve_is_super(struct ve_struct *ve) return ve == } -#define get_exec_env() (current->task_ve) #define get_env_init(ve) (ve->ve_ns->pid_ns->child_reaper) const char *ve_name(struct ve_struct *ve); diff --git a/kernel/fork.c b/kernel/fork.c index 505fa21..3d7e452 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1439,6 +1439,9 @@ static struct task_struct *copy_process(unsigned long clone_flags, INIT_LIST_HEAD(>pi_state_list); p->pi_state_cache = NULL; #endif +#ifdef CONFIG_VE + p->ve_attach_lock_depth = 0; +#endif /* * sigaltstack should be cleared when sharing the same VM */ diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 39a95e8..23833ed 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -640,6 +640,7 @@ static struct cgroup_subsys_state *ve_create(struct cgroup *cg) ve->meminfo_val = VE_MEMINFO_DEFAULT; do_init: + ve->attach_lock = __RW_LOCK_UNLOCKED(>attach_lock); init_rwsem(>op_sem); mutex_init(>sync_mutex); INIT_LIST_HEAD(>devices); @@ -738,8 +739,11 @@ static int ve_can_attach(struct cgroup *cg, struct cgroup_taskset *tset) static void ve_attach(struct cgroup *cg, struct cgroup_taskset *tset) { +
Re: [Devel] [PATCH RH7 1/2] device_cgroup: fake allowing all devices for docker inside VZCT
Here is the right link for RH7: https://jira.sw.ru/browse/PSBM-34529 Patch actually is a port from RH6. On 10/15/2015 01:42 PM, Konstantin Khorenko wrote: Volodya, please review. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 10/13/2015 06:11 PM, Pavel Tikhomirov wrote: We need it for docker 1.7.+, please review. On 10/07/2015 11:51 AM, Pavel Tikhomirov wrote: Docker from 1.7.0 tries to add "a" to devices.allow for newly created privileged container device_cgroup, and thus to allow all devices in docker container. Docker fails to do so because not all devices are allowed in parent VZCT cgroup. To support docker we must allow writing "a" to devices.allow in CT. With this patch if we get "a", we will silently exit without EPERM. https://jira.sw.ru/browse/PSBM-38691 v2: fix bug link, fix comment stile Signed-off-by: Pavel Tikhomirov--- security/device_cgroup.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 531e40c..9f932d7 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -689,7 +689,14 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup, if (has_children(devcgroup)) return -EINVAL; -if (!may_allow_all(parent)) +if (!may_allow_all(parent)) { +if (ve_is_super(get_exec_env())) +return -EPERM; +else +/* Fooling docker in CT - silently exit */ +return 0; +} + return -EPERM; dev_exception_clean(devcgroup); devcgroup->behavior = DEVCG_DEFAULT_ALLOW; -- Best regards, Tikhomirov Pavel Software Developer, Odin. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/selftests: add memfd/sealing page-pinning tests
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 07e7c92c1c0de74828dfd29e39facebf02cdfd63 Author: Andrew VaginDate: Thu Oct 15 15:04:19 2015 +0400 ms/selftests: add memfd/sealing page-pinning tests The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: David Herrmann ML: 87b2d44026e0e315a7401551e95b189ac4b28217 Setting SEAL_WRITE is not possible if there're pending GUP users. This commit adds selftests for memfd+sealing that use FUSE to create pending page-references. FUSE is very helpful here in that it allows us to delay direct-IO operations for an arbitrary amount of time. This way, we can force the kernel to pin pages and then run our normal selftests. Signed-off-by: David Herrmann Acked-by: Hugh Dickins Cc: Michael Kerrisk Cc: Ryan Lortie Cc: Lennart Poettering Cc: Daniel Mack Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrew Vagin --- tools/testing/selftests/memfd/.gitignore | 2 + tools/testing/selftests/memfd/Makefile | 14 +- tools/testing/selftests/memfd/fuse_mnt.c | 110 + tools/testing/selftests/memfd/fuse_test.c | 311 + tools/testing/selftests/memfd/run_fuse_test.sh | 14 ++ 5 files changed, 450 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore index bcc8ee2..afe87c4 100644 --- a/tools/testing/selftests/memfd/.gitignore +++ b/tools/testing/selftests/memfd/.gitignore @@ -1,2 +1,4 @@ +fuse_mnt +fuse_test memfd_test memfd-test-file diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile index 36653b9..6816c49 100644 --- a/tools/testing/selftests/memfd/Makefile +++ b/tools/testing/selftests/memfd/Makefile @@ -7,6 +7,7 @@ ifeq ($(ARCH),x86_64) ARCH := X86 endif +CFLAGS += -D_FILE_OFFSET_BITS=64 CFLAGS += -I../../../../arch/x86/include/generated/uapi/ CFLAGS += -I../../../../arch/x86/include/uapi/ CFLAGS += -I../../../../include/uapi/ @@ -25,5 +26,16 @@ ifeq ($(ARCH),X86) endif @./memfd_test || echo "memfd_test: [FAIL]" +build_fuse: +ifeq ($(ARCH),X86) + gcc $(CFLAGS) fuse_mnt.c `pkg-config fuse --cflags --libs` -o fuse_mnt + gcc $(CFLAGS) fuse_test.c -o fuse_test +else + echo "Not an x86 target, can't build memfd selftest" +endif + +run_fuse: build_fuse + @./run_fuse_test.sh || echo "fuse_test: [FAIL]" + clean: - $(RM) memfd_test + $(RM) memfd_test fuse_test diff --git a/tools/testing/selftests/memfd/fuse_mnt.c b/tools/testing/selftests/memfd/fuse_mnt.c new file mode 100644 index 000..feacf12 --- /dev/null +++ b/tools/testing/selftests/memfd/fuse_mnt.c @@ -0,0 +1,110 @@ +/* + * memfd test file-system + * This file uses FUSE to create a dummy file-system with only one file /memfd. + * This file is read-only and takes 1s per read. + * + * This file-system is used by the memfd test-cases to force the kernel to pin + * pages during reads(). Due to the 1s delay of this file-system, this is a + * nice way to test race-conditions against get_user_pages() in the kernel. + * + * We use direct_io==1 to force the kernel to use direct-IO for this + * file-system. + */ + +#define FUSE_USE_VERSION 26 + +#include +#include +#include +#include +#include +#include + +static const char memfd_content[] = "memfd-example-content"; +static const char memfd_path[] = "/memfd"; + +static int memfd_getattr(const char *path, struct stat *st) +{ + memset(st, 0, sizeof(*st)); + + if (!strcmp(path, "/")) { + st->st_mode = S_IFDIR | 0755; + st->st_nlink = 2; + } else if (!strcmp(path, memfd_path)) { + st->st_mode = S_IFREG | 0444; + st->st_nlink = 1; + st->st_size = strlen(memfd_content); + } else { + return -ENOENT; + } + + return 0; +} + +static int memfd_readdir(const char *path, +void *buf, +fuse_fill_dir_t filler, +off_t offset, +struct fuse_file_info *fi) +{ + if (strcmp(path, "/")) + return -ENOENT; + + filler(buf, ".", NULL, 0); + filler(buf, "..", NULL, 0); + filler(buf, memfd_path + 1, NULL, 0); + + return 0; +} + +static int memfd_open(const char *path, struct fuse_file_info *fi) +{ + if (strcmp(path, memfd_path)) +
[Devel] [PATCH RHEL7 COMMIT] ms/shm: wait for pins to be released when sealing
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit c49f7bf80c6e70a3992c13f6b7f7a60b44c81dce Author: Andrew VaginDate: Thu Oct 15 15:04:19 2015 +0400 ms/shm: wait for pins to be released when sealing The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: David Herrmann ML: 05f65b5c70909ef686f865f0a85406d74d75f70f If we set SEAL_WRITE on a file, we must make sure there cannot be any ongoing write-operations on the file. For write() calls, we simply lock the inode mutex, for mmap() we simply verify there're no writable mappings. However, there might be pages pinned by AIO, Direct-IO and similar operations via GUP. We must make sure those do not write to the memfd file after we set SEAL_WRITE. As there is no way to notify GUP users to drop pages or to wait for them to be done, we implement the wait ourself: When setting SEAL_WRITE, we check all pages for their ref-count. If it's bigger than 1, we know there's some user of the page. We then mark the page and wait for up to 150ms for those ref-counts to be dropped. If the ref-counts are not dropped in time, we refuse the seal operation. Signed-off-by: David Herrmann Acked-by: Hugh Dickins Cc: Michael Kerrisk Cc: Ryan Lortie Cc: Lennart Poettering Cc: Daniel Mack Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrew Vagin --- mm/shmem.c | 110 - 1 file changed, 109 insertions(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index bc8e08b..fd563aa 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1903,9 +1903,117 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; } +/* + * We need a tag: a new tag would expand every radix_tree_node by 8 bytes, + * so reuse a tag which we firmly believe is never set or cleared on shmem. + */ +#define SHMEM_TAG_PINNEDPAGECACHE_TAG_TOWRITE +#define LAST_SCAN 4 /* about 150ms max */ + +static void shmem_tag_pins(struct address_space *mapping) +{ + struct radix_tree_iter iter; + void **slot; + pgoff_t start; + struct page *page; + + lru_add_drain(); + start = 0; + rcu_read_lock(); + +restart: + radix_tree_for_each_slot(slot, >page_tree, , start) { + page = radix_tree_deref_slot(slot); + if (!page || radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) + goto restart; + } else if (page_count(page) - page_mapcount(page) > 1) { + spin_lock_irq(>tree_lock); + radix_tree_tag_set(>page_tree, iter.index, + SHMEM_TAG_PINNED); + spin_unlock_irq(>tree_lock); + } + + if (need_resched()) { + cond_resched_rcu(); + start = iter.index + 1; + goto restart; + } + } + rcu_read_unlock(); +} + +/* + * Setting SEAL_WRITE requires us to verify there's no pending writer. However, + * via get_user_pages(), drivers might have some pending I/O without any active + * user-space mappings (eg., direct-IO, AIO). Therefore, we look at all pages + * and see whether it has an elevated ref-count. If so, we tag them and wait for + * them to be dropped. + * The caller must guarantee that no new user will acquire writable references + * to those pages to avoid races. + */ static int shmem_wait_for_pins(struct address_space *mapping) { - return 0; + struct radix_tree_iter iter; + void **slot; + pgoff_t start; + struct page *page; + int error, scan; + + shmem_tag_pins(mapping); + + error = 0; + for (scan = 0; scan <= LAST_SCAN; scan++) { + if (!radix_tree_tagged(>page_tree, SHMEM_TAG_PINNED)) + break; + + if (!scan) + lru_add_drain_all(); + else if (schedule_timeout_killable((HZ << scan) / 200)) + scan = LAST_SCAN; + + start = 0; + rcu_read_lock(); +restart: + radix_tree_for_each_tagged(slot, >page_tree, , + start, SHMEM_TAG_PINNED) { + + page = radix_tree_deref_slot(slot); +
[Devel] [PATCH RHEL7 COMMIT] ms/sched: add cond_resched_rcu() helper
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit f5375ae5711c334bb1305639dc08a45898a32f19 Author: Andrew VaginDate: Thu Oct 15 15:04:15 2015 +0400 ms/sched: add cond_resched_rcu() helper The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: Simon Horman ML: f6f3c437d09e2f62533034e67bfb4385191e992c This is intended for use in loops which read data protected by RCU and may have a large number of iterations. Such an example is dumping the list of connections known to IPVS: ip_vs_conn_array() and ip_vs_conn_seq_next(). The benefits are for CONFIG_PREEMPT_RCU=y where we save CPU cycles by moving rcu_read_lock and rcu_read_unlock out of large loops but still allowing the current task to be preempted after every loop iteration for the CONFIG_PREEMPT_RCU=n case. The call to cond_resched() is not needed when CONFIG_PREEMPT_RCU=y. Thanks to Paul E. McKenney for explaining this and for the final version that checks the context with CONFIG_DEBUG_ATOMIC_SLEEP=y for all possible configurations. The function can be empty in the CONFIG_PREEMPT_RCU case, rcu_read_lock and rcu_read_unlock are not needed in this case because the task can be preempted on indication from scheduler. Thanks to Peter Zijlstra for catching this and for his help in trying a solution that changes __might_sleep. Initial cond_resched_rcu_lock() function suggested by Eric Dumazet. Tested-by: Julian Anastasov Signed-off-by: Julian Anastasov Signed-off-by: Simon Horman Acked-by: Peter Zijlstra Signed-off-by: Pablo Neira Ayuso Signed-off-by: Andrew Vagin --- include/linux/sched.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index e1bcabe..4560071 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2717,6 +2717,15 @@ extern int __cond_resched_softirq(void); __cond_resched_softirq(); \ }) +static inline void cond_resched_rcu(void) +{ +#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) + rcu_read_unlock(); + cond_resched(); + rcu_read_lock(); +#endif +} + /* * Does a critical section need to be broken due to another * task waiting?: (technically does not depend on CONFIG_PREEMPT, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 19e5b6f0c09fa1a46634605cec0c212a106044ee Author: Andrew VaginDate: Thu Oct 15 15:04:13 2015 +0400 ms/prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: Cyrill Gorcunov ML: f606b77f1a9e362451aca8f81d8f36a3a112139e During development of c/r we've noticed that in case if we need to support user namespaces we face a problem with capabilities in prctl(PR_SET_MM, ...) call, in particular once new user namespace is created capable(CAP_SYS_RESOURCE) no longer passes. A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values in one bundle, which would allow the kernel to make more intensive test for sanity of values and same time allow us to support checkpoint/restore of user namespaces. Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of prctl_mm_map structure which carries all the members to be updated. prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size) struct prctl_mm_map { __u64 start_code; __u64 end_code; __u64 start_data; __u64 end_data; __u64 start_brk; __u64 brk; __u64 start_stack; __u64 arg_start; __u64 arg_end; __u64 env_start; __u64 env_end; __u64 *auxv; __u32 auxv_size; __u32 exe_fd; }; All members except @exe_fd correspond ones of struct mm_struct. To figure out which available values these members may take here are meanings of the members. - start_code, end_code: represent bounds of executable code area - start_data, end_data: represent bounds of data area - start_brk, brk: used to calculate bounds for brk() syscall - start_stack: used when accounting space needed for command line arguments, environment and shmat() syscall - arg_start, arg_end, env_start, env_end: represent memory area supplied for command line arguments and environment variables - auxv, auxv_size: carries auxiliary vector, Elf format specifics - exe_fd: file descriptor number for executable link (/proc/self/exe) Thus we apply the following requirements to the values 1) Any member except @auxv, @auxv_size, @exe_fd is rather an address in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr) interval. 2) While @[start|end]_code and @[start|end]_data may point to an nonexisting VMAs (say a program maps own new .text and .data segments during execution) the rest of members should belong to VMA which must exist. 3) Addresses must be ordered, ie @start_ member must not be greater or equal to appropriate @end_ member. 4) As in regular Elf loading procedure we require that @start_brk and @brk be greater than @end_data. 5) If RLIMIT_DATA rlimit is set to non-infinity new values should not exceed existing limit. Same applies to RLIMIT_STACK. 6) Auxiliary vector size must not exceed existing one (which is predefined as AT_VECTOR_SIZE and depends on architecture). 7) File descriptor passed in @exe_file should be pointing to executable file (because we use existing prctl_set_mm_exe_file_locked helper it ensures that the file we are going to use as exe link has all required permission granted). Now about where these members are involved inside kernel code: - @start_code and @end_code are used in /proc/$pid/[stat|statm] output; - @start_data and @end_data are used in /proc/$pid/[stat|statm] output, also they are considered if there enough space for brk() syscall result if RLIMIT_DATA is set; - @start_brk shown in /proc/$pid/stat output and accounted in brk() syscall if RLIMIT_DATA is set; also this member is tested to find a symbolic name of mmap event for perf system (we choose if event is generated for "heap" area); one more aplication is selinux -- we test if a process has PROCESS__EXECHEAP permission if trying to make heap area being executable with mprotect() syscall; - @brk is a current value for brk() syscall which lays inside heap area, it's shown in /proc/$pid/stat. When syscall brk() succesfully provides new memory area to a user space upon brk() completion the mm::brk is updated to carry new value; Both @start_brk and @brk are actively used in /proc/$pid/maps
[Devel] [PATCH RHEL7 COMMIT] ms/shm: add memfd_create() syscall
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 9e421edd0c467fb8d3a230520421a58f55e2a46e Author: Andrew VaginDate: Thu Oct 15 15:04:18 2015 +0400 ms/shm: add memfd_create() syscall The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 ML: 9183df25fe7b194563db3fec6dc3202a5855839c memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor that you can pass to mmap(). It can support sealing and avoids any connection to user-visible mount-points. Thus, it's not subject to quotas on mounted file-systems, but can be used like malloc()'ed memory, but with a file-descriptor to it. memfd_create() returns the raw shmem file, so calls like ftruncate() can be used to modify the underlying inode. Also calls like fstat() will return proper information and mark the file as regular file. If you want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not supported (like on all other regular files). Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not subject to a filesystem size limit. It is still properly accounted to memcg limits, though, and to the same overcommit or no-overcommit accounting as all user memory. Signed-off-by: David Herrmann Acked-by: Hugh Dickins Cc: Michael Kerrisk Cc: Ryan Lortie Cc: Lennart Poettering Cc: Daniel Mack Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Conflicts: arch/x86/syscalls/syscall_32.tbl arch/x86/syscalls/syscall_64.tbl Signed-off-by: Andrew Vagin --- arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + kernel/sys_ni.c | 1 + mm/shmem.c | 73 5 files changed, 77 insertions(+) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index 5d1de5d..4d0e1b4 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -357,6 +357,7 @@ 348i386process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev 349i386kcmpsys_kcmp 350i386finit_modulesys_finit_module +356i386memfd_createsys_memfd_create 500i386fairsched_mknod sys_fairsched_mknod 501i386fairsched_rmnod sys_fairsched_rmnod diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 3ed05b4..2415f42 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -321,6 +321,7 @@ 312common kcmpsys_kcmp 313common finit_modulesys_finit_module 316common renameat2 sys_renameat2 +319common memfd_createsys_memfd_create 320common kexec_file_load sys_kexec_file_load 49764 fairsched_nodemask sys_fairsched_nodemask diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index c89c938..2c2e396 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -786,6 +786,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags, asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr); asmlinkage long sys_eventfd(unsigned int count); asmlinkage long sys_eventfd2(unsigned int count, int flags); +asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags); asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int); asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *, diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7c98d8f..75a69b0 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -194,6 +194,7 @@ cond_syscall(compat_sys_timerfd_settime); cond_syscall(compat_sys_timerfd_gettime); cond_syscall(sys_eventfd); cond_syscall(sys_eventfd2); +cond_syscall(sys_memfd_create); /* performance counters: */ cond_syscall(sys_perf_event_open); diff --git a/mm/shmem.c b/mm/shmem.c index 3964468..bc8e08b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt; #include #include #include +#include #include +#include #include #include @@ -2854,6 +2856,77 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) shmem_show_mpol(seq,
[Devel] [PATCH RHEL7 COMMIT] ms/prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit bf86a407af9fa74c7251b62c539f36c141fc4f77 Author: Andrew VaginDate: Thu Oct 15 15:04:13 2015 +0400 ms/prctl: PR_SET_MM -- factor out mmap_sem when updating mm::exe_file The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: Cyrill Gorcunov ML: 71fe97e185040c5dac3216cd54e186dfa534efa0 Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out and rename the helper to prctl_set_mm_exe_file_locked(). This will allow to reuse this function in a next patch. Signed-off-by: Cyrill Gorcunov Cc: Kees Cook Cc: Tejun Heo Cc: Andrew Vagin Cc: Eric W. Biederman Cc: H. Peter Anvin Acked-by: Serge Hallyn Cc: Pavel Emelyanov Cc: Vasiliy Kulikov Cc: KAMEZAWA Hiroyuki Cc: Michael Kerrisk Cc: Julien Tinnes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrew Vagin --- kernel/sys.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/kernel/sys.c b/kernel/sys.c index a2d5644..cf580a7 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2036,12 +2036,14 @@ SYSCALL_DEFINE1(umask, int, mask) return mask; } -static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) +static int prctl_set_mm_exe_file_locked(struct mm_struct *mm, unsigned int fd) { struct fd exe; struct inode *inode; int err; + VM_BUG_ON(!rwsem_is_locked(>mmap_sem)); + exe = fdget(fd); if (!exe.file) return -EBADF; @@ -2062,8 +2064,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) if (err) goto exit; - down_write(>mmap_sem); - /* * Forbid mm->exe_file change if old file still mapped. */ @@ -2075,7 +2075,7 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) if (vma->vm_file && path_equal(>vm_file->f_path, >exe_file->f_path)) - goto exit_unlock; + goto exit; } /* @@ -2086,13 +2086,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd) */ err = -EPERM; if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags)) - goto exit_unlock; + goto exit; err = 0; set_mm_exe_file(mm, exe.file); /* this grabs a reference to exe.file */ -exit_unlock: - up_write(>mmap_sem); - exit: fdput(exe); return err; @@ -2112,8 +2109,12 @@ static int prctl_set_mm(int opt, unsigned long addr, if (!capable(CAP_SYS_RESOURCE)) return -EPERM; - if (opt == PR_SET_MM_EXE_FILE) - return prctl_set_mm_exe_file(mm, (unsigned int)addr); + if (opt == PR_SET_MM_EXE_FILE) { + down_write(>mmap_sem); + error = prctl_set_mm_exe_file_locked(mm, (unsigned int)addr); + up_write(>mmap_sem); + return error; + } if (addr >= TASK_SIZE || addr < mmap_min_addr) return -EINVAL; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/aio: Make it possible to remap aio ring
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit a3ffce64acc927dd35825252566389966520dc94 Author: Andrew VaginDate: Thu Oct 15 15:04:14 2015 +0400 ms/aio: Make it possible to remap aio ring The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: Pavel Emelyanov ML: e4a0d3e720e7e508749c1439b5ba3aff56c92976 There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov Acked-by: Dmitry Monakhov Signed-off-by: Benjamin LaHaise Signed-off-by: Andrew Vagin --- fs/aio.c | 20 include/linux/fs.h | 1 + mm/mremap.c| 3 ++- 3 files changed, 23 insertions(+), 1 deletion(-) diff --git a/fs/aio.c b/fs/aio.c index 9d700b0..301da77 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -257,12 +257,32 @@ static void aio_free_ring(struct kioctx *ctx) static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma) { + vma->vm_flags |= VM_DONTEXPAND; vma->vm_ops = _file_vm_ops; return 0; } +static void aio_ring_remap(struct file *file, struct vm_area_struct *vma) +{ + struct mm_struct *mm = vma->vm_mm; + struct kioctx *ctx; + + spin_lock(>ioctx_lock); + rcu_read_lock(); + hlist_for_each_entry_rcu(ctx, >ioctx_list, list) { + if (ctx && ctx->aio_ring_file == file) { + ctx->user_id = ctx->mmap_base = vma->vm_start; + break; + } + } + + rcu_read_unlock(); + spin_unlock(>ioctx_lock); +} + static const struct file_operations aio_ring_fops = { .mmap = aio_ring_mmap, + .mremap = aio_ring_remap, }; #if IS_ENABLED(CONFIG_MIGRATION) diff --git a/include/linux/fs.h b/include/linux/fs.h index 7e7bd3f..bbbf186 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1734,6 +1734,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); + void (*mremap)(struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); diff --git a/mm/mremap.c b/mm/mremap.c index e1db886..0b40af6 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -293,7 +293,8 @@ static unsigned long move_vma(struct vm_area_struct *vma, old_len = new_len; old_addr = new_addr; new_addr = -ENOMEM; - } + } else if (vma->vm_file && vma->vm_file->f_op->mremap) + vma->vm_file->f_op->mremap(vma->vm_file, new_vma); /* Conceal VM_ACCOUNT so old
[Devel] [PATCH RHEL7 COMMIT] ms/make default ->i_fop have ->open() fail with ENXIO
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit ab6784bb6f5bca77caef0e23d07e0b86dd178557 Author: Andrew VaginDate: Thu Oct 15 15:04:20 2015 +0400 ms/make default ->i_fop have ->open() fail with ENXIO The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 ML: bd9b51e79cb0b8bc00a7e0076a4a8963ca4a797c As it is, default ->i_fop has NULL ->open() (along with all other methods). The only case where it matters is reopening (via procfs symlink) a file that didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned to something sane (default would fail on read/write/ioctl/etc.). Unfortunately, such case exists - alloc_file() users, especially anon_get_file() ones. There we have tons of opened files of very different kinds sharing the same inode. As the result, attempt to reopen those via procfs succeeds and you get a descriptor you can't do anything with. Moreover, in case of sockets we set ->i_fop that will only be used on such reopen attempts - and put a failing ->open() into it to make sure those do not succeed. It would be simpler to put such ->open() into default ->i_fop and leave it unchanged both for anon inode (as we do anyway) and for socket ones. Result: * everything going through do_dentry_open() works as it used to * sock_no_open() kludge is gone * attempts to reopen anon-inode files fail as they really ought to * ditto for aio_private_file() * ditto for perfmon - this one actually tried to imitate sock_no_open() trick, but failed to set ->i_fop, so in the current tree reopens succeed and yield completely useless descriptor. Intent clearly had been to fail with -ENXIO on such reopens; now it actually does. * everything else that used alloc_file() keeps working - it has ->i_fop set for its inodes anyway Signed-off-by: Al Viro Signed-off-by: Andrew Vagin --- arch/ia64/kernel/perfmon.c | 10 -- fs/inode.c | 12 +--- include/linux/fs.h | 1 - net/Makefile | 2 -- net/nonet.c| 26 -- net/socket.c | 19 --- 6 files changed, 9 insertions(+), 61 deletions(-) diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c index 9ea25fc..4334a96 100644 --- a/arch/ia64/kernel/perfmon.c +++ b/arch/ia64/kernel/perfmon.c @@ -2145,22 +2145,12 @@ doit: return 0; } -static int -pfm_no_open(struct inode *irrelevant, struct file *dontcare) -{ - DPRINT(("pfm_no_open called\n")); - return -ENXIO; -} - - - static const struct file_operations pfm_file_ops = { .llseek = no_llseek, .read = pfm_read, .write = pfm_write, .poll = pfm_poll, .unlocked_ioctl = pfm_ioctl, - .open = pfm_no_open, /* special open code to disallow open via /proc */ .fasync = pfm_fasync, .release= pfm_close, .flush = pfm_flush diff --git a/fs/inode.c b/fs/inode.c index 960cd15..6c27178 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -121,6 +121,11 @@ int proc_nr_inodes(ctl_table *table, int write, } #endif +static int no_open(struct inode *inode, struct file *file) +{ + return -ENXIO; +} + /** * inode_init_always - perform inode structure intialisation * @sb: superblock inode belongs to @@ -131,7 +136,8 @@ int proc_nr_inodes(ctl_table *table, int write, */ int inode_init_always(struct super_block *sb, struct inode *inode) { - static const struct file_operations empty_fops; + static const struct inode_operations empty_iops; + static const struct file_operations no_open_fops = {.open = no_open}; struct address_space *const mapping = >i_data; inode->i_sb = sb; @@ -139,7 +145,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) inode->i_flags = 0; atomic_set(>i_count, 1); inode->i_op = _iops; - inode->i_fop = _fops; + inode->i_fop = _open_fops; inode->__i_nlink = 1; inode->i_opflags = 0; i_uid_write(inode, 0); @@ -1900,7 +1906,7 @@ void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev) } else if (S_ISFIFO(mode)) inode->i_fop = _fops; else if (S_ISSOCK(mode)) - inode->i_fop = _sock_fops; + ; /* leave it no_open_fops */ else printk(KERN_DEBUG "init_special_inode: bogus i_mode (%o) for" " inode %s:%lu\n", mode, inode->i_sb->s_id, diff --git a/include/linux/fs.h
[Devel] [PATCH RHEL7 COMMIT] ms/mm: mmap_region: kill correct_wcount/inode, use allow_write_access()
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 55c695be4110abbbc0c16bd2f6d55de27ac03b90 Author: Andrew VaginDate: Thu Oct 15 15:04:16 2015 +0400 ms/mm: mmap_region: kill correct_wcount/inode, use allow_write_access() The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 MM: e86867720e617774b560dfbc169b7f3d0d490950 correct_wcount and inode in mmap_region() just complicate the code. This boolean was needed previously, when deny_write_access() was called before vma_merge(), now we can simply check VM_DENYWRITE and do allow_write_access() if it is set. allow_write_access() checks file != NULL, so this is safe even if it was possible to use VM_DENYWRITE && !file. Just we need to ensure we use the same file which was deny_write_access()'ed, so the patch also moves "file = vma->vm_file" down after allow_write_access(). Signed-off-by: Oleg Nesterov Cc: Hugh Dickins Cc: Al Viro Cc: Colin Cross Cc: David Rientjes Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrew Vagin --- mm/mmap.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 826cf37..f87a78b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1494,11 +1494,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr, { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; - int correct_wcount = 0; int error; struct rb_node **rb_link, *rb_parent; unsigned long charged = 0; - struct inode *inode = file ? file_inode(file) : NULL; unsigned long ub_charged = 0; /* Check against address space limit. */ @@ -1576,7 +1574,6 @@ munmap_back: error = deny_write_access(file); if (error) goto free_vma; - correct_wcount = 1; } vma->vm_file = get_file(file); error = file->f_op->mmap(file, vma); @@ -1631,11 +1628,10 @@ munmap_back: } vma_link(mm, vma, prev, rb_link, rb_parent); - file = vma->vm_file; - /* Once vma denies write, undo our temporary denial count */ - if (correct_wcount) - atomic_inc(>i_writecount); + if (vm_flags & VM_DENYWRITE) + allow_write_access(file); + file = vma->vm_file; out: perf_event_mmap(vma); @@ -1663,8 +1659,8 @@ out: return addr; unmap_and_free_vma: - if (correct_wcount) - atomic_inc(>i_writecount); + if (vm_flags & VM_DENYWRITE) + allow_write_access(file); vma->vm_file = NULL; fput(file); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/selftests: add memfd_create() + sealing tests
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 2d68e19bd9105a3fe7006d8252ee516f97a9ade8 Author: Andrew VaginDate: Thu Oct 15 15:04:18 2015 +0400 ms/selftests: add memfd_create() + sealing tests The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: David Herrmann ML: 4f5ce5e8d7e2da3c714df8a7fa42edb9f992fc52 Some basic tests to verify sealing on memfds works as expected and guarantees the advertised semantics. Signed-off-by: David Herrmann Acked-by: Hugh Dickins Cc: Michael Kerrisk Cc: Ryan Lortie Cc: Lennart Poettering Cc: Daniel Mack Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrew Vagin --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/memfd/.gitignore | 2 + tools/testing/selftests/memfd/Makefile | 29 + tools/testing/selftests/memfd/memfd_test.c | 913 + 4 files changed, 945 insertions(+) diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index c7fd8ac..ab4015b 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -2,6 +2,7 @@ TARGETS = breakpoints TARGETS += cpu-hotplug TARGETS += efivarfs TARGETS += kcmp +TARGETS += memfd TARGETS += memory-hotplug TARGETS += mqueue TARGETS += net diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore new file mode 100644 index 000..bcc8ee2 --- /dev/null +++ b/tools/testing/selftests/memfd/.gitignore @@ -0,0 +1,2 @@ +memfd_test +memfd-test-file diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile new file mode 100644 index 000..36653b9 --- /dev/null +++ b/tools/testing/selftests/memfd/Makefile @@ -0,0 +1,29 @@ +uname_M := $(shell uname -m 2>/dev/null || echo not) +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/) +ifeq ($(ARCH),i386) + ARCH := X86 +endif +ifeq ($(ARCH),x86_64) + ARCH := X86 +endif + +CFLAGS += -I../../../../arch/x86/include/generated/uapi/ +CFLAGS += -I../../../../arch/x86/include/uapi/ +CFLAGS += -I../../../../include/uapi/ +CFLAGS += -I../../../../include/ + +all: +ifeq ($(ARCH),X86) + gcc $(CFLAGS) memfd_test.c -o memfd_test +else + echo "Not an x86 target, can't build memfd selftest" +endif + +run_tests: all +ifeq ($(ARCH),X86) + gcc $(CFLAGS) memfd_test.c -o memfd_test +endif + @./memfd_test || echo "memfd_test: [FAIL]" + +clean: + $(RM) memfd_test diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c new file mode 100644 index 000..3634c90 --- /dev/null +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -0,0 +1,913 @@ +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define MFD_DEF_SIZE 8192 +#define STACK_SIZE 65535 + +static int sys_memfd_create(const char *name, + unsigned int flags) +{ + return syscall(__NR_memfd_create, name, flags); +} + +static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags) +{ + int r, fd; + + fd = sys_memfd_create(name, flags); + if (fd < 0) { + printf("memfd_create(\"%s\", %u) failed: %m\n", + name, flags); + abort(); + } + + r = ftruncate(fd, sz); + if (r < 0) { + printf("ftruncate(%llu) failed: %m\n", (unsigned long long)sz); + abort(); + } + + return fd; +} + +static void mfd_fail_new(const char *name, unsigned int flags) +{ + int r; + + r = sys_memfd_create(name, flags); + if (r >= 0) { + printf("memfd_create(\"%s\", %u) succeeded, but failure expected\n", + name, flags); + close(r); + abort(); + } +} + +static __u64 mfd_assert_get_seals(int fd) +{ + long r; + + r = fcntl(fd, F_GET_SEALS); + if (r < 0) { + printf("GET_SEALS(%d) failed: %m\n", fd); + abort(); + } + + return r; +} + +static void mfd_assert_has_seals(int fd, __u64 seals) +{ + __u64 s; + + s = mfd_assert_get_seals(fd); + if (s != seals) { + printf("%llu != %llu = GET_SEALS(%d)\n", + (unsigned long long)seals, (unsigned
[Devel] [PATCH RHEL7 COMMIT] ms/mm: allow drivers to prevent new writable mappings
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit a60b63122e58834a4fd04b9be05311dd67801a07 Author: Andrew VaginDate: Thu Oct 15 15:04:16 2015 +0400 ms/mm: allow drivers to prevent new writable mappings The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: David Herrmann ML: 4bb5f5d9395bc112d93a134d8f5b05611eddc9c0 This patch (of 6): The i_mmap_writable field counts existing writable mappings of an address_space. To allow drivers to prevent new writable mappings, make this counter signed and prevent new writable mappings if it is negative. This is modelled after i_writecount and DENYWRITE. This will be required by the shmem-sealing infrastructure to prevent any new writable mappings after the WRITE seal has been set. In case there exists a writable mapping, this operation will fail with EBUSY. Note that we rely on the fact that iff you already own a writable mapping, you can increase the counter without using the helpers. This is the same that we do for i_writecount. Signed-off-by: David Herrmann Acked-by: Hugh Dickins Cc: Michael Kerrisk Cc: Ryan Lortie Cc: Lennart Poettering Cc: Daniel Mack Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrew Vagin --- fs/inode.c | 1 + include/linux/fs.h | 29 +++-- kernel/fork.c | 2 +- mm/mmap.c | 30 -- mm/swap_state.c| 1 + 5 files changed, 54 insertions(+), 9 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 8c14103..960cd15 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -171,6 +171,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) mapping->a_ops = _aops; mapping->host = inode; mapping->flags = 0; + atomic_set(>i_mmap_writable, 0); mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); mapping->private_data = NULL; mapping->backing_dev_info = _backing_dev_info; diff --git a/include/linux/fs.h b/include/linux/fs.h index bbbf186..f410c54 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -547,7 +547,7 @@ struct address_space { struct inode*host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ spinlock_t tree_lock; /* and lock protecting it */ - unsigned inti_mmap_writable;/* count VM_SHARED mappings */ + atomic_ti_mmap_writable;/* count VM_SHARED mappings */ struct rb_root i_mmap; /* tree of private and shared mappings */ struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */ struct mutexi_mmap_mutex; /* protect tree, count, list */ @@ -633,10 +633,35 @@ static inline int mapping_mapped(struct address_space *mapping) * Note that i_mmap_writable counts all VM_SHARED vmas: do_mmap_pgoff * marks vma as VM_SHARED if it is shared, and the file was opened for * writing i.e. vma may be mprotected writable even if now readonly. + * + * If i_mmap_writable is negative, no new writable mappings are allowed. You + * can only deny writable mappings, if none exists right now. */ static inline int mapping_writably_mapped(struct address_space *mapping) { - return mapping->i_mmap_writable != 0; + return atomic_read(>i_mmap_writable) > 0; +} + +static inline int mapping_map_writable(struct address_space *mapping) +{ + return atomic_inc_unless_negative(>i_mmap_writable) ? + 0 : -EPERM; +} + +static inline void mapping_unmap_writable(struct address_space *mapping) +{ + atomic_dec(>i_mmap_writable); +} + +static inline int mapping_deny_writable(struct address_space *mapping) +{ + return atomic_dec_unless_positive(>i_mmap_writable) ? + 0 : -EBUSY; +} + +static inline void mapping_allow_writable(struct address_space *mapping) +{ + atomic_inc(>i_mmap_writable); } /* diff --git a/kernel/fork.c b/kernel/fork.c index 505fa21..8fcc5db 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -435,7 +435,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) atomic_dec(>i_writecount); mutex_lock(>i_mmap_mutex); if (tmp->vm_flags & VM_SHARED) - mapping->i_mmap_writable++; +
Re: [Devel] [RFC] Fix get_exec_env() races
Vova remind me, we may sleep inside get_exec_env() section. So, it's yet better to use task work here. On 15.10.2015 13:02, Kirill Tkhai wrote: > Since we allow to attach a not current task to ve cgroup, there is the race > in the places where we use get_exec_env(). The task's ve may be changed after > it dereferenced get_exec_env(), so a lot of problems are possible there. > I'm sure the most places, where we use get_exec_env(), was not written in an > assumption it may change. Also, there are a lot of nested functions and it's > impossible to check every function to verify if it's input parameters, > depending > on a caller's dereferenced ve, are not actual because of ve has been changed. > > I'm suggest to use to modify get_exec_env() which will supply ve's stability. > It pairs with put_exec_env() which marks end of area where ve modification is > not desirable. > > get_exec_env() may be used nested, so here is > task_struct::ve_attach_lock_depth, > which allows nesting. The counter looks a better variant that plain > read_lock() > in get_exec_env() and write_trylock() loop in ve_attach(): > > get_exec_env() > { >... >read_lock(); >... > } > > ve_attach() > { >while(!write_trylock()) > cpu_relax(); > } > > because this case the priority of read_lock() will be absolute and we lost all > advantages of queued rw locks fairness. > > Also I considered variants with using RCU and task work, but they seems to be > worse. > > Please, your comments. > > --- > include/linux/init_task.h | 3 ++- > include/linux/sched.h | 1 + > include/linux/ve.h| 29 + > include/linux/ve_proto.h | 1 - > kernel/fork.c | 3 +++ > kernel/ve/ve.c| 8 +++- > 6 files changed, 42 insertions(+), 3 deletions(-) > diff --git a/include/linux/init_task.h b/include/linux/init_task.h > index d2cbad0..57e0796 100644 > --- a/include/linux/init_task.h > +++ b/include/linux/init_task.h > @@ -136,7 +136,8 @@ extern struct task_group root_task_group; > #endif > > #ifdef CONFIG_VE > -#define INIT_TASK_VE(tsk) .task_ve = , > +#define INIT_TASK_VE(tsk) .task_ve = , > \ > + .ve_attach_lock_depth = 0 > #else > #define INIT_TASK_VE(tsk) > #endif > diff --git a/include/linux/sched.h b/include/linux/sched.h > index e1bcabe..948481f 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1564,6 +1564,7 @@ struct task_struct { > #endif > #ifdef CONFIG_VE > struct ve_struct *task_ve; > + unsigned int ve_attach_lock_depth; > #endif > #ifdef CONFIG_MEMCG /* memcg uses this to do batch job */ > struct memcg_batch_info { > diff --git a/include/linux/ve.h b/include/linux/ve.h > index 86b95c3..3cea73d 100644 > --- a/include/linux/ve.h > +++ b/include/linux/ve.h > @@ -33,6 +33,7 @@ struct ve_monitor; > struct nsproxy; > > struct ve_struct { > + rwlock_tattach_lock; > struct cgroup_subsys_state css; > > const char *ve_name; > @@ -130,6 +131,34 @@ struct ve_struct { > #endif > }; > > +static inline struct ve_struct *get_exec_env(void) > +{ > + struct ve_struct *ve; > + > + if (++current->ve_attach_lock_depth > 1) > + return current->task_ve; > + > + rcu_read_lock(); > +again: > + ve = current->task_ve; > + read_lock(>attach_lock); > + if (unlikely(current->task_ve != ve)) { > + read_unlock(>attach_lock); > + goto again; > + } > + rcu_read_unlock(); > + > + return ve; > +} > + > +static inline void put_exec_env(void) > +{ > + struct ve_struct *ve = current->task_ve; > + > + if (!--current->ve_attach_lock_depth) > + read_unlock(>attach_lock); > +} > + > struct ve_devmnt { > struct list_headlink; > > diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h > index 0f5898e..3deb09e 100644 > --- a/include/linux/ve_proto.h > +++ b/include/linux/ve_proto.h > @@ -30,7 +30,6 @@ static inline bool ve_is_super(struct ve_struct *ve) > return ve == > } > > -#define get_exec_env() (current->task_ve) > #define get_env_init(ve) (ve->ve_ns->pid_ns->child_reaper) > > const char *ve_name(struct ve_struct *ve); > diff --git a/kernel/fork.c b/kernel/fork.c > index 505fa21..3d7e452 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1439,6 +1439,9 @@ static struct task_struct *copy_process(unsigned long > clone_flags, > INIT_LIST_HEAD(>pi_state_list); > p->pi_state_cache = NULL; > #endif > +#ifdef CONFIG_VE > + p->ve_attach_lock_depth = 0; > +#endif > /* >* sigaltstack should be cleared when sharing the same VM >*/ > diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c > index 39a95e8..23833ed 100644 > --- a/kernel/ve/ve.c > +++ b/kernel/ve/ve.c > @@ -640,6 +640,7 @@ static struct cgroup_subsys_state *ve_create(struct > cgroup
Re: [Devel] [PATCH RH7 1/2] device_cgroup: fake allowing all devices for docker inside VZCT
Volodya, please review. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 10/13/2015 06:11 PM, Pavel Tikhomirov wrote: We need it for docker 1.7.+, please review. On 10/07/2015 11:51 AM, Pavel Tikhomirov wrote: Docker from 1.7.0 tries to add "a" to devices.allow for newly created privileged container device_cgroup, and thus to allow all devices in docker container. Docker fails to do so because not all devices are allowed in parent VZCT cgroup. To support docker we must allow writing "a" to devices.allow in CT. With this patch if we get "a", we will silently exit without EPERM. https://jira.sw.ru/browse/PSBM-38691 v2: fix bug link, fix comment stile Signed-off-by: Pavel Tikhomirov--- security/device_cgroup.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 531e40c..9f932d7 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -689,7 +689,14 @@ static int devcgroup_update_access(struct dev_cgroup *devcgroup, if (has_children(devcgroup)) return -EINVAL; - if (!may_allow_all(parent)) + if (!may_allow_all(parent)) { + if (ve_is_super(get_exec_env())) + return -EPERM; + else + /* Fooling docker in CT - silently exit */ + return 0; + } + return -EPERM; dev_exception_clean(devcgroup); devcgroup->behavior = DEVCG_DEFAULT_ALLOW; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve: Strip unset options in ve.mount_opts
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 5e85a8088c27b22a38d446530fab6904db6314a4 Author: Kirill TkhaiDate: Thu Oct 15 14:37:13 2015 +0400 ve: Strip unset options in ve.mount_opts Igor reports "(null)" in not set options may confuse user: echo "0 182:223361;1 balloon_ino=12,pfcache_csum,,2: (null);" The patch removes not set options from there: echo "0 182:223361;1 balloon_ino=12,pfcache_csum,,;" N.B. No any problem there, because *printf handles zero strings for a long time. Requested-by: Igor Sukhih Signed-off-by: Kirill Tkhai Reviewed-by: Maxim Patlasov --- kernel/ve/ve.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 12cfa33..39a95e8 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -906,9 +906,12 @@ static int ve_mount_opts_show(struct seq_file *m, void *v) struct ve_devmnt *devmnt = v; dev_t dev = devmnt->dev; - seq_printf(m, "0 %u:%u;1 %s;2 %s;\n", MAJOR(dev), MINOR(dev), - devmnt->hidden_options, - devmnt->allowed_options); + seq_printf(m, "0 %u:%u;", MAJOR(dev), MINOR(dev)); + if (devmnt->hidden_options) + seq_printf(m, "1 %s;", devmnt->hidden_options); + if (devmnt->allowed_options) + seq_printf(m, "2 %s;", devmnt->allowed_options); + seq_putc(m, '\n'); return 0; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC] Fix get_exec_env() races
> @@ -130,6 +131,34 @@ struct ve_struct { > #endif > }; > > +static inline struct ve_struct *get_exec_env(void) > +{ > + struct ve_struct *ve; > + > + if (++current->ve_attach_lock_depth > 1) > + return current->task_ve; > + > + rcu_read_lock(); > +again: > + ve = current->task_ve; > + read_lock(>attach_lock); > + if (unlikely(current->task_ve != ve)) { > + read_unlock(>attach_lock); > + goto again; Please, no. 3.10 kernel has task_work-s, ask the task you want to attach to ve to execute the work by moving itself into it and keep this small routine small and simple. > + } > + rcu_read_unlock(); > + > + return ve; > +} > + ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/shm: add sealing API
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 38bc7de2200c7f0aafacc2f30769787ca3c55308 Author: Andrew VaginDate: Thu Oct 15 15:04:17 2015 +0400 ms/shm: add sealing API The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 ML: 40e041a2c858b3caefc757e26cb85bfceae5062b If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy. While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks. This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked on that file forever. Unlike locks, seals can only be set, never removed. Hence, once you verified a specific set of seals is set, you're guaranteed that no-one can perform the blocked operations on this file, anymore. An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This affects ftruncate() and open(O_TRUNC). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write(). - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. This basically prevents the F_ADD_SEAL operation on a file and can be set to prevent others from adding further seals that you don't want. The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source). The new API is an extension to fcntl(), adding two new commands: F_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. F_ADD_SEALS: Change the seals of a given file. This requires WRITE access to the file and F_SEAL_SEAL may not already be set. Furthermore, the underlying file must support sealing and there may not be any existing shared mapping of that file. Otherwise, EBADF/EPERM is returned. The given seals are _added_ to the existing set of seals on the
[Devel] [PATCH RHEL7 COMMIT] ms/mm: introduce check_data_rlimit helper
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit a3ca20dacb7becb446fde9154abd51cbb0594674 Author: Andrew VaginDate: Thu Oct 15 15:04:12 2015 +0400 ms/mm: introduce check_data_rlimit helper The patch is required for CRIU. https://jira.sw.ru/browse/PSBM-39834 From: Cyrill Gorcunov ML: 9c5990240e076ae564cccbd921868cd08f6daaa5 To eliminate code duplication lets introduce check_data_rlimit helper which we will use in brk() and prctl() syscalls. Signed-off-by: Cyrill Gorcunov Cc: Kees Cook Cc: Tejun Heo Cc: Andrew Vagin Cc: Eric W. Biederman Cc: H. Peter Anvin Acked-by: Serge Hallyn Cc: Pavel Emelyanov Cc: Vasiliy Kulikov Cc: KAMEZAWA Hiroyuki Cc: Michael Kerrisk Cc: Julien Tinnes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrew Vagin --- include/linux/mm.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8424c6a..163d3d8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -18,6 +18,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -1747,6 +1748,20 @@ extern struct vm_area_struct *copy_vma(struct vm_area_struct **, bool *need_rmap_locks); extern void exit_mmap(struct mm_struct *); +static inline int check_data_rlimit(unsigned long rlim, + unsigned long new, + unsigned long start, + unsigned long end_data, + unsigned long start_data) +{ + if (rlim < RLIM_INFINITY) { + if (((new - start) + (end_data - start_data)) > rlim) + return -ENOSPC; + } + + return 0; +} + extern int mm_take_all_locks(struct mm_struct *mm); extern void mm_drop_all_locks(struct mm_struct *mm); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/oom: add helpers for setting and clearing TIF_MEMDIE
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 4860757ccf723defc3ba770ca3ad3f8c67c4ae20 Author: Vladimir DavydovDate: Thu Oct 15 17:47:34 2015 +0400 ms/oom: add helpers for setting and clearing TIF_MEMDIE Patchset description: oom enhancements - part 1 Pull mainstream patches that clean up TIF_MEMDIE handling. They will come in handy for the upcoming oom rework. https://jira.sw.ru/browse/PSBM-26973 David Rientjes (1): mm, oom: remove unnecessary exit_state check Johannes Weiner (1): mm: oom_kill: clean up victim marking and exiting interfaces Michal Hocko (3): oom: make sure that TIF_MEMDIE is set under task_lock oom: add helpers for setting and clearing TIF_MEMDIE oom: thaw the OOM victim if it is frozen Tetsuo Handa (1): oom: don't count on mm-less current process === This patch desciption: From: Michal Hocko This patchset addresses a race which was described in the changelog for 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] ms/Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk ->
[Devel] [PATCH RHEL7 COMMIT] ms/oom: thaw the OOM victim if it is frozen
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 880e147721e60945828b460b86f36057e72603df Author: Vladimir DavydovDate: Thu Oct 15 17:47:35 2015 +0400 ms/oom: thaw the OOM victim if it is frozen Patchset description: oom enhancements - part 1 Pull mainstream patches that clean up TIF_MEMDIE handling. They will come in handy for the upcoming oom rework. https://jira.sw.ru/browse/PSBM-26973 David Rientjes (1): mm, oom: remove unnecessary exit_state check Johannes Weiner (1): mm: oom_kill: clean up victim marking and exiting interfaces Michal Hocko (3): oom: make sure that TIF_MEMDIE is set under task_lock oom: add helpers for setting and clearing TIF_MEMDIE oom: thaw the OOM victim if it is frozen Tetsuo Handa (1): oom: don't count on mm-less current process === This patch desciption: From: Michal Hocko oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the victim. This is basically noop when the task is frozen though because the task sleeps in the uninterruptible sleep. The victim is eventually thawed later when oom_scan_process_thread meets the task again in a later OOM invocation so the OOM killer doesn't live lock. But this is less than optimal. Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to the victim. We are not checking whether the task is frozen because that would be racy and __thaw_task does that already. oom_scan_process_thread doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are excluded completely now. Signed-off-by: Michal Hocko Cc: Tejun Heo Cc: David Rientjes Cc: Johannes Weiner Cc: Oleg Nesterov Cc: Cong Wang Cc: "Rafael J. Wysocki" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 63a8ca9b2084fa5bd91aa380532f18e361764109) Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai --- mm/oom_kill.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 224dd8d..7b106e8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -266,8 +266,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, * Don't allow any other task to have access to the reserves. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { - if (unlikely(frozen(task))) - __thaw_task(task); if (!force_kill) return OOM_SCAN_ABORT; } @@ -417,6 +415,14 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order, void mark_tsk_oom_victim(struct task_struct *tsk) { set_tsk_thread_flag(tsk, TIF_MEMDIE); + + /* +* Make sure that the task is woken up from uninterruptible sleep +* if it is frozen because OOM killer wouldn't be able to free +* any memory and livelock. freezing_slow_path will tell the freezer +* that TIF_MEMDIE tasks should be ignored. +*/ + __thaw_task(tsk); } /** ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm, oom: remove unnecessary exit_state check
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit d2dc55df7ee5b44dac752c1ff02e2ae5ce251935 Author: Vladimir DavydovDate: Thu Oct 15 17:47:32 2015 +0400 ms/mm, oom: remove unnecessary exit_state check Patchset description: oom enhancements - part 1 Pull mainstream patches that clean up TIF_MEMDIE handling. They will come in handy for the upcoming oom rework. https://jira.sw.ru/browse/PSBM-26973 David Rientjes (1): mm, oom: remove unnecessary exit_state check Johannes Weiner (1): mm: oom_kill: clean up victim marking and exiting interfaces Michal Hocko (3): oom: make sure that TIF_MEMDIE is set under task_lock oom: add helpers for setting and clearing TIF_MEMDIE oom: thaw the OOM victim if it is frozen Tetsuo Handa (1): oom: don't count on mm-less current process === This patch desciption: From: David Rientjes The oom killer scans each process and determines whether it is eligible for oom kill or whether the oom killer should abort because of concurrent memory freeing. It will abort when an eligible process is found to have TIF_MEMDIE set, meaning it has already been oom killed and we're waiting for it to exit. Processes with task->mm == NULL should not be considered because they are either kthreads or have already detached their memory and killing them would not lead to memory freeing. That memory is only freed after exit_mm() has returned, however, and not when task->mm is first set to NULL. Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process is no longer considered for oom kill, but only until exit_mm() has returned. This was fragile in the past because it relied on exit_notify() to be reached before no longer considering TIF_MEMDIE processes. Signed-off-by: David Rientjes Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit fb794bcbb4e5552242f9a4c5e1ffe4c6da29a968) Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai --- kernel/exit.c | 1 + mm/oom_kill.c | 2 -- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/exit.c b/kernel/exit.c index dbc8f77..90feb5f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -520,6 +520,7 @@ static void exit_mm(struct task_struct * tsk) task_unlock(tsk); mm_update_next_owner(mm); mmput(mm); + clear_thread_flag(TIF_MEMDIE); } /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5413a44..57d9f3e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -258,8 +258,6 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill, bool ignore_memcg_guarantee) { - if (task->exit_state) - return OOM_SCAN_CONTINUE; if (oom_unkillable_task(task, NULL, nodemask)) return OOM_SCAN_CONTINUE; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] oom: rework logic behind memory.oom_guarantee
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit acf9780b995d7cabcb227c5a3636635a365a1d7c Author: Vladimir DavydovDate: Thu Oct 15 17:53:02 2015 +0400 oom: rework logic behind memory.oom_guarantee Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: Currently, memory.oom_guarantee works as a threshold: we first select processes in cgroups whose usage is below oom guarantee, and only if there is no eligible process in such cgroups, we disregard oom guarantee configuration and iterate over all processes. Although simple to implement, such a behavior is unfair: we do not differentiate between cgroups that only slightly above their guarantee and those who exceed it significantly. This patch therefore reworks the way how memory.oom_guarantee affects oom killer behavior. First of all, it reverts old logic, which was introduced by commit e94e18346f74c ("memcg: add oom_guarantee"), leaving hunks bringing the memory.oom_guarantee knob intact. Then it implements a new approach of selecting oom victim that works as follows. Now a task is selected by oom killer iff (a) the memory cgroup which the process resides in has the greatest overdraft of all cgroups eligible for scan and (b) the process has the greatest score among all processes which reside in cgroups with the greatest overdraft. A cgroup's overdraft is defined as (U-G)/(L-G), if U --- fs/proc/base.c | 2 +- include/linux/memcontrol.h | 6 ++-- include/linux/oom.h| 24 +++-- mm/memcontrol.c| 86 ++ mm/oom_kill.c | 61 ++-- 5 files changed, 92 insertions(+), 87 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index b574498..b5f3a70 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -455,7 +455,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer) read_lock(_lock); if (pid_alive(task)) - points = oom_badness(task, NULL, NULL, totalpages) * + points = oom_badness(task, NULL, NULL, totalpages, NULL) * 1000 / totalpages; read_unlock(_lock); return sprintf(buffer, "%lu\n", points); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5911327..0c85642 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -122,7 +122,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
[Devel] [PATCH RHEL7 COMMIT] memcg: add lock for protecting memcg->oom_notify list
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit f86c874c39188e9af50163092e161878a1067977 Author: Vladimir DavydovDate: Thu Oct 15 17:52:59 2015 +0400 memcg: add lock for protecting memcg->oom_notify list Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: Currently, memcg_oom_lock is used for this, but I'm going to get rid of it in the following patch, so introduce a dedicated lock. Signed-off-by: Vladimir Davydov --- mm/memcontrol.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fdd14dd2..faef356 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5766,12 +5766,18 @@ static int compare_thresholds(const void *a, const void *b) return 0; } +static DEFINE_SPINLOCK(memcg_oom_notify_lock); + static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) { struct mem_cgroup_eventfd_list *ev; + spin_lock(_oom_notify_lock); + list_for_each_entry(ev, >oom_notify, list) eventfd_signal(ev->eventfd, 1); + + spin_unlock(_oom_notify_lock); return 0; } @@ -5957,7 +5963,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp, if (!event) return -ENOMEM; - spin_lock(_oom_lock); + spin_lock(_oom_notify_lock); event->eventfd = eventfd; list_add(>list, >oom_notify); @@ -5965,7 +5971,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp, /* already in OOM ? */ if (atomic_read(>under_oom)) eventfd_signal(eventfd, 1); - spin_unlock(_oom_lock); + spin_unlock(_oom_notify_lock); return 0; } @@ -5979,7 +5985,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp, BUG_ON(type != _OOM_TYPE); - spin_lock(_oom_lock); + spin_lock(_oom_notify_lock); list_for_each_entry_safe(ev, tmp, >oom_notify, list) { if (ev->eventfd == eventfd) { @@ -5988,7 +5994,7 @@ static void mem_cgroup_oom_unregister_event(struct cgroup *cgrp, } } - spin_unlock(_oom_lock); + spin_unlock(_oom_notify_lock); } static int mem_cgroup_oom_control_read(struct cgroup *cgrp, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC] Fix get_exec_env() races
On 15.10.2015 17:23, Pavel Emelyanov wrote: > On 10/15/2015 05:21 PM, Kirill Tkhai wrote: >> >> >> On 15.10.2015 14:15, Pavel Emelyanov wrote: >>> @@ -130,6 +131,34 @@ struct ve_struct { #endif }; +static inline struct ve_struct *get_exec_env(void) +{ + struct ve_struct *ve; + + if (++current->ve_attach_lock_depth > 1) + return current->task_ve; + + rcu_read_lock(); +again: + ve = current->task_ve; + read_lock(>attach_lock); + if (unlikely(current->task_ve != ve)) { + read_unlock(>attach_lock); + goto again; >>> >>> Please, no. 3.10 kernel has task_work-s, ask the task you want to >>> attach to ve to execute the work by moving itself into it and keep >>> this small routine small and simple. >> >> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(), >> so we can't wait attaching task till it complete the task work (it may >> execute any code; to be locking cgroup_mutex, for example). > > I see. > >> Should we give a possibility (an interface) for userspace to get it know, >> the task's finished ve changing? > > No. What are the places where get_exec_env() is still required? Ok. We use it from time to time $ git grep get_exec_env | grep -v ve_is_super | wc -l 71 >> + } + rcu_read_unlock(); + + return ve; +} + >>> >> . >> > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/vtty: Make indices to match pcs6 scheme
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.8 --> commit 77f7c920ddb6426dfb580fff5146da73c6f7f7d3 Author: Cyrill GorcunovDate: Thu Oct 15 20:04:49 2015 +0400 ve/vtty: Make indices to match pcs6 scheme In pcs6 vttys are mapped into internal kernel representation in nonobvious way. The /dev/console represent [maj:5,min:1], in turn /dev/tty[0-...] are defined as [maj:4,min:0...], where minor is bijective to symbol postfix of the tty. Internally in the pcs6 kernel any open of /dev/ttyX has been mapping minor into vtty index as | if (minor > 0) | index = minor - 1 | else | index = 0 which actually shifts indices and make /dev/tty0 as an alias to /dev/console inside container. Same time vzctl tool passes console number argument in a decremented way, iow when one is typing vzctl console $ctid 1 here is 1 is a tty number, the kernel sees is as 0, opening containers /dev/console. When one types "vzctl console $ctid 2" (which implies to open container's /dev/tty2) the vzctl passes index 1 and the kernel opens /dev/tty2 because of the if/else index mapping as show above. Lets implement same indices mapping in pcs7 for backward compatibility (in pcs7 there is a per-VE vtty_map_t structure which reserve up to MAX_NR_VTTY_CONSOLES ttys to track and it is simply an array addressed by tty index). Same time lets fix a few nits: disable setup of controlling terminal on /dev/console only, since all ttys can have controlling sign; make sure we're having @tty_fops for such terminals. https://jira.sw.ru/browse/PSBM-40088 Signed-off-by: Cyrill Gorcunov Reviewed-by: Vladimir Davydov CC: Konstantin Khorenko CC: Igor Sukhih drivers/tty/pty.c|7 +-- drivers/tty/tty_io.c | 12 ++-- 2 files changed, 15 insertions(+), 4 deletions(-) --- drivers/tty/pty.c| 7 +-- drivers/tty/tty_io.c | 12 ++-- 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index b74ddca..0ab36f9 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -1240,8 +1240,11 @@ struct tty_driver *vtty_console_driver(int *index) struct tty_driver *vtty_driver(dev_t dev, int *index) { if (MAJOR(dev) == TTY_MAJOR && - MINOR(dev) < MAX_NR_VTTY_CONSOLES) { - *index = MINOR(dev); + MINOR(dev) <= MAX_NR_VTTY_CONSOLES) { + if (MINOR(dev)) + *index = MINOR(dev) - 1; + else + *index = 0; return vttys_driver; } return NULL; diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c index 8fc8334..8ce0a5f 100644 --- a/drivers/tty/tty_io.c +++ b/drivers/tty/tty_io.c @@ -1941,7 +1941,8 @@ static struct tty_driver *tty_lookup_driver(dev_t device, struct file *filp, if (!ve_is_super(ve)) { driver = vtty_driver(device, index); if (driver) { - *noctty = 1; + if (MINOR(device) == 0) + *noctty = 1; return tty_driver_kref_get(driver); } } @@ -1960,8 +1961,15 @@ static struct tty_driver *tty_lookup_driver(dev_t device, struct file *filp, case MKDEV(TTYAUX_MAJOR, 1): { struct tty_driver *console_driver = console_device(index); #ifdef CONFIG_VE - if (!ve_is_super(ve)) + if (!ve_is_super(ve)) { console_driver = vtty_console_driver(index); + /* +* Reset fops, sometimes there might be +* console_fops picked from inode->i_cdev +* in chrdev_open() +*/ + filp->f_op = _fops; + } #endif if (console_driver) { driver = tty_driver_kref_get(console_driver); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/net: introduce TAP accounting
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit b59e089eb2d2fdc939e54abb656cd5b7a2ad500e Author: Vladimir Sementsov-OgievskiyDate: Thu Oct 15 18:56:47 2015 +0400 ve/net: introduce TAP accounting Add ve accounting to tun/tap devices. New ioctl should be called to attach/create ve stat to tun/tap. https://jira.sw.ru/browse/PSBM-27713 Note: TUN accounting is not tested for now and disabled in this commit. only TAP accounting is allowed for now. Signed-off-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Cyrill Gorcunov Acked-by: Konstantin Khorenko --- drivers/net/tun.c | 68 - include/uapi/linux/if_tun.h | 9 ++ kernel/Kconfig.openvz | 7 + 3 files changed, 83 insertions(+), 1 deletion(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 392d701..4f7eee9 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -71,6 +71,10 @@ #include +#ifdef CONFIG_VE_TUNTAP_ACCOUNTING +#include +#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */ + /* Uncomment to enable debugging */ /* #define TUN_DEBUG 1 */ @@ -190,6 +194,9 @@ struct tun_struct { struct list_head disabled; void *security; u32 flow_count; +#ifdef CONFIG_VE_TUNTAP_ACCOUNTING + struct venet_stat *vestat; +#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */ }; static inline u32 tun_hashfn(u32 rxhash) @@ -1241,6 +1248,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, tun->dev->stats.rx_packets++; tun->dev->stats.rx_bytes += len; +#ifdef CONFIG_VE_TUNTAP_ACCOUNTING + if (tun->vestat) { + venet_acct_classify_add_outgoing(tun->vestat, skb); + } +#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */ + tun_flow_update(tun, rxhash, tfile); return total_len; } @@ -1344,6 +1357,12 @@ static ssize_t tun_put_user(struct tun_struct *tun, tun->dev->stats.tx_packets++; tun->dev->stats.tx_bytes += len; +#ifdef CONFIG_VE_TUNTAP_ACCOUNTING + if (tun->vestat) { + venet_acct_classify_add_incoming(tun->vestat, skb); + } +#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */ + return total; } @@ -1428,6 +1447,14 @@ static void tun_free_netdev(struct net_device *dev) BUG_ON(!(list_empty(>disabled))); tun_flow_uninit(tun); security_tun_dev_free_security(tun->security); + +#ifdef CONFIG_VE_TUNTAP_ACCOUNTING + if (tun->vestat) { + venet_acct_put_stat(tun->vestat); + tun->vestat = NULL; + } +#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */ + free_netdev(dev); } @@ -1892,11 +1919,43 @@ unlock: return ret; } +#ifdef CONFIG_VE_TUNTAP_ACCOUNTING +/* setacctid_ioctl should be called under rtnl_lock */ +static long setacctid_ioctl(struct file *file, void __user *argp) +{ + struct tun_file *tfile = file->private_data; + struct tun_acctid info; + struct net_device *dev; + struct tun_struct *tun; + + if (copy_from_user(, argp, sizeof(info))) + return -EFAULT; + + dev = __dev_get_by_name(tfile->net, info.ifname); + if (dev == NULL) + return -ENOENT; + + /* This check may be dropped to allow tun devices */ + if (dev->netdev_ops != _netdev_ops) + return -EINVAL; + + tun = netdev_priv(dev); + if (tun->vestat) { + venet_acct_put_stat(tun->vestat); + } + tun->vestat = venet_acct_find_create_stat(info.acctid); + if (tun->vestat == NULL) + return -ENOMEM; + + return 0; +} +#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */ + static long __tun_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg, int ifreq_len) { struct tun_file *tfile = file->private_data; - struct tun_struct *tun; + struct tun_struct *tun = NULL; void __user* argp = (void __user*)arg; struct ifreq ifr; kuid_t owner; @@ -1925,6 +1984,13 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, ret = 0; rtnl_lock(); +#ifdef CONFIG_VE_TUNTAP_ACCOUNTING + if (cmd == TUNSETACCTID) { + ret = setacctid_ioctl(file, argp); + goto unlock; + } +#endif /* CONFIG_VE_TUNTAP_ACCOUNTING */ + tun = __tun_get(tfile); if (cmd == TUNSETIFF && !tun) { ifr.ifr_name[IFNAMSIZ-1] = '\0'; diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index c80d152..81e791e 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -59,6 +59,9 @@ #define TUNSETIFINDEX _IOW('T', 218, unsigned int) #define TUNGETFILTER
[Devel] [PATCH RHEL7 COMMIT] config.OpenVZ: enable TAP accounting
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 63d0e865e0fbcf122786ae211b1746e25e407657 Author: Konstantin KhorenkoDate: Thu Oct 15 18:59:03 2015 +0400 config.OpenVZ: enable TAP accounting Add ve accounting to tun/tap devices. New ioctl should be called to attach/create ve stat to tun/tap. https://jira.sw.ru/browse/PSBM-27713 Note: TUN accounting is not tested for now and disabled in this commit. only TAP accounting is allowed for now. Signed-off-by: Konstantin Khorenko --- configs/kernel-3.10.0-x86_64-debug.config | 1 + configs/kernel-3.10.0-x86_64.config | 1 + 2 files changed, 2 insertions(+) diff --git a/configs/kernel-3.10.0-x86_64-debug.config b/configs/kernel-3.10.0-x86_64-debug.config index 0f91ff9..6dcb566 100644 --- a/configs/kernel-3.10.0-x86_64-debug.config +++ b/configs/kernel-3.10.0-x86_64-debug.config @@ -5378,6 +5378,7 @@ CONFIG_VZ_LIST=m CONFIG_VZ_GENCALLS=y CONFIG_VE_NETDEV=m CONFIG_VE_NETDEV_ACCOUNTING=m +CONFIG_VE_TUNTAP_ACCOUNTING=y CONFIG_VZ_DEV=m CONFIG_VE_IPTABLES=y CONFIG_VZ_WDOG=m diff --git a/configs/kernel-3.10.0-x86_64.config b/configs/kernel-3.10.0-x86_64.config index 3a5b8c0..7cfaaaf 100644 --- a/configs/kernel-3.10.0-x86_64.config +++ b/configs/kernel-3.10.0-x86_64.config @@ -5351,6 +5351,7 @@ CONFIG_VZ_LIST=m CONFIG_VZ_GENCALLS=y CONFIG_VE_NETDEV=m CONFIG_VE_NETDEV_ACCOUNTING=m +CONFIG_VE_TUNTAP_ACCOUNTING=y CONFIG_VZ_DEV=m CONFIG_VE_IPTABLES=y CONFIG_VZ_WDOG=m ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] vtty: Make indices to match pcs6 scheme
On Mon, Oct 05, 2015 at 12:54:26PM +0300, Cyrill Gorcunov wrote: > In pcs6 vttys are mapped into internal kernel representation in > nonobvious way. The /dev/console represent [maj:5,min:1], in > turn /dev/tty[0-...] are defined as [maj:4,min:0...], where > minor is bijective to symbol postfix of the tty. Internally > in the pcs6 kernel any open of /dev/ttyX has been mapping > minor into vtty index as > > |if (minor > 0) > |index = minor - 1 > |else > |index = 0 > > which actually shifts indices and make /dev/tty0 as > an alias to /dev/console inside container. > > Same time vzctl tool passes console number argument > in a decremented way, iow when one is typing > > vzctl console $ctid 1 > > here is 1 is a tty number, the kernel sees is as 0, > opening containers /dev/console. > > When one types "vzctl console $ctid 2" (which implies > to open container's /dev/tty2) the vzctl passes index 1 > and the kernel opens /dev/tty2 because of the if/else index > mapping as show above. > > Lets implement same indices mapping in pcs7 for backward > compatibility (in pcs7 there is a per-VE vtty_map_t structure > which reserve up to MAX_NR_VTTY_CONSOLES ttys to track > and it is simply an array addressed by tty index). > > Same time lets fix a few nits: disable setup of controlling > terminal on /dev/console only, since all ttys can have > controlling sign; make sure we're having @tty_fops for > such terminals. > > https://jira.sw.ru/browse/PSBM-40088 > > Signed-off-by: Cyrill Gorcunov> CC: Vladimir Davydov > CC: Konstantin Khorenko > CC: Igor Sukhih Reviewed-by: Vladimir Davydov ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/shm: add memfd_create() syscall: lost hunk
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.7 --> commit efd8ed12768cc7bee733d35a2c35393626707143 Author: Konstantin KhorenkoDate: Thu Oct 15 19:39:42 2015 +0400 ms/shm: add memfd_create() syscall: lost hunk Fixes 9e421edd0c467fb8d3a230520421a58f55e2a46e This is a lost hunk from ms commit: 9183df25fe7b194563db3fec6dc3202a5855839c ms/shm: add memfd_create() syscall https://jira.sw.ru/browse/PSBM-39834 Signed-off-by: Konstantin Khorenko --- include/uapi/linux/memfd.h | 8 1 file changed, 8 insertions(+) diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h new file mode 100644 index 000..534e364 --- /dev/null +++ b/include/uapi/linux/memfd.h @@ -0,0 +1,8 @@ +#ifndef _UAPI_LINUX_MEMFD_H +#define _UAPI_LINUX_MEMFD_H + +/* flags for memfd_create(2) (unsigned int) */ +#define MFD_CLOEXEC0x0001U +#define MFD_ALLOW_SEALING 0x0002U + +#endif /* _UAPI_LINUX_MEMFD_H */ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] memcg: add mem_cgroup_get/put helpers
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 28d232fd4095c371daa0980f5fae9642a30780b1 Author: Vladimir DavydovDate: Thu Oct 15 17:52:58 2015 +0400 memcg: add mem_cgroup_get/put helpers Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: Equivalent to css_get/put(mem_cgroup_css(memcg)). Currently, only used by af_packet.c, but will also be used by the following patches. Signed-off-by: Vladimir Davydov --- include/linux/memcontrol.h | 18 ++ net/packet/af_packet.c | 4 ++-- 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ac3f16f..548a82c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -139,6 +139,16 @@ static inline bool mem_cgroup_disabled(void) return false; } +static inline void mem_cgroup_get(struct mem_cgroup *memcg) +{ + css_get(mem_cgroup_css(memcg)); +} + +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ + css_put(mem_cgroup_css(memcg)); +} + void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags); @@ -321,6 +331,14 @@ static inline bool mem_cgroup_disabled(void) return true; } +static inline void mem_cgroup_get(struct mem_cgroup *memcg) +{ +} + +static inline void mem_cgroup_put(struct mem_cgroup *memcg) +{ +} + static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec) { diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index ee9d56b..0bc235e 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2524,7 +2524,7 @@ static struct cg_proto *packet_sk_charge(void) goto out; out_put_cg: - css_put(mem_cgroup_css(psc->memcg)); + mem_cgroup_put(psc->memcg); out_free_psc: kfree(psc); psc = NULL; @@ -2545,7 +2545,7 @@ static void packet_sk_uncharge(struct cg_proto *cg) if (psc) { memcg_uncharge_kmem(psc->memcg, psc->amt); - css_put(mem_cgroup_css(psc->memcg)); + mem_cgroup_put(psc->memcg); kfree(psc); } } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC] Fix get_exec_env() races
On 10/15/2015 05:21 PM, Kirill Tkhai wrote: > > > On 15.10.2015 14:15, Pavel Emelyanov wrote: >> >>> @@ -130,6 +131,34 @@ struct ve_struct { >>> #endif >>> }; >>> >>> +static inline struct ve_struct *get_exec_env(void) >>> +{ >>> + struct ve_struct *ve; >>> + >>> + if (++current->ve_attach_lock_depth > 1) >>> + return current->task_ve; >>> + >>> + rcu_read_lock(); >>> +again: >>> + ve = current->task_ve; >>> + read_lock(>attach_lock); >>> + if (unlikely(current->task_ve != ve)) { >>> + read_unlock(>attach_lock); >>> + goto again; >> >> Please, no. 3.10 kernel has task_work-s, ask the task you want to >> attach to ve to execute the work by moving itself into it and keep >> this small routine small and simple. > > cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(), > so we can't wait attaching task till it complete the task work (it may > execute any code; to be locking cgroup_mutex, for example). I see. > Should we give a possibility (an interface) for userspace to get it know, > the task's finished ve changing? No. What are the places where get_exec_env() is still required? > >>> + } >>> + rcu_read_unlock(); >>> + >>> + return ve; >>> +} >>> + >> > . > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/oom: don't count on mm-less current process
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 315f2cf7428d49d724775f01c545926c55b39a7e Author: Vladimir DavydovDate: Thu Oct 15 17:47:33 2015 +0400 ms/oom: don't count on mm-less current process Patchset description: oom enhancements - part 1 Pull mainstream patches that clean up TIF_MEMDIE handling. They will come in handy for the upcoming oom rework. https://jira.sw.ru/browse/PSBM-26973 David Rientjes (1): mm, oom: remove unnecessary exit_state check Johannes Weiner (1): mm: oom_kill: clean up victim marking and exiting interfaces Michal Hocko (3): oom: make sure that TIF_MEMDIE is set under task_lock oom: add helpers for setting and clearing TIF_MEMDIE oom: thaw the OOM victim if it is frozen Tetsuo Handa (1): oom: don't count on mm-less current process === This patch desciption: From: Tetsuo Handa out_of_memory() doesn't trigger the OOM killer if the current task is already exiting or it has fatal signals pending, and gives the task access to memory reserves instead. However, doing so is wrong if out_of_memory() is called by an allocation (e.g. from exit_task_work()) after the current task has already released its memory and cleared TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm() current task, the OOM killer will be blocked by the task sitting in the final schedule() waiting for its parent to reap it. It will trigger an OOM livelock if its parent is unable to reap it due to doing an allocation and waiting for the OOM killer to kill it. Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: David Rientjes Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit d7a94e7e11badf8404d40b41e008c3131a3cebe3) Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai Conflicts: mm/oom_kill.c --- mm/oom_kill.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 57d9f3e..fd9e13d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -643,8 +643,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. +* +* But don't select if current has already released its mm and cleared +* TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur. */ - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { + if (current->mm && + (fatal_signal_pending(current) || current->flags & PF_EXITING)) { set_thread_flag(TIF_MEMDIE); return; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] oom: pass points and overdraft to oom_kill_process
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit c67b670a0cde9ea89926108a26a651b9108e49c7 Author: Vladimir DavydovDate: Thu Oct 15 17:53:02 2015 +0400 oom: pass points and overdraft to oom_kill_process Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: This is required by oom berserker mode, which will be introduced later in this series. Signed-off-by: Vladimir Davydov --- include/linux/oom.h | 3 ++- mm/memcontrol.c | 6 +++--- mm/oom_kill.c | 26 -- 3 files changed, 21 insertions(+), 14 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index 9117d1d..6ea83b2 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -85,7 +85,8 @@ static inline bool oom_worse(unsigned long points, unsigned long overdraft, } extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, -unsigned int points, unsigned long totalpages, +unsigned long points, unsigned long overdraft, +unsigned long totalpages, struct mem_cgroup *memcg, nodemask_t *nodemask, const char *message); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c34dee0..14e6aee 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1935,7 +1935,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long chosen_points = 0; unsigned long totalpages; unsigned long overdraft; - unsigned int points = 0; + unsigned long points = 0; struct task_struct *chosen = NULL; /* @@ -1987,8 +1987,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!chosen) return; - points = chosen_points * 1000 / totalpages; - oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg, + oom_kill_process(chosen, gfp_mask, order, chosen_points, max_overdraft, +totalpages, memcg, NULL, "Memory cgroup out of memory"); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index a437f68..d8a89c0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -353,7 +353,8 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, * * (not docbooked, we don't want this one cluttering up the manual) */ -static struct task_struct *select_bad_process(unsigned int *ppoints, +static struct task_struct *select_bad_process(unsigned long *ppoints, + unsigned long *poverdraft, unsigned long totalpages, const nodemask_t *nodemask, bool force_kill) { @@ -389,7 +390,8 @@ static struct task_struct *select_bad_process(unsigned int
[Devel] [PATCH RHEL7 COMMIT] oom: drop OOM_SCAN_ABORT
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 3bb5625c93235a4fd013b4307a1fd9cc9db4e6a8 Author: Vladimir DavydovDate: Thu Oct 15 17:53:01 2015 +0400 oom: drop OOM_SCAN_ABORT Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: It is not used anymore, neither should it be used with the new locking design. Signed-off-by: Vladimir Davydov --- include/linux/oom.h | 1 - mm/memcontrol.c | 6 -- mm/oom_kill.c | 7 +-- 3 files changed, 1 insertion(+), 13 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index f804551..6d4a94f 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -27,7 +27,6 @@ enum oom_constraint { enum oom_scan_t { OOM_SCAN_OK,/* scan thread and find its badness */ OOM_SCAN_CONTINUE, /* do not consider thread for oom kill */ - OOM_SCAN_ABORT, /* abort the iteration and return */ OOM_SCAN_SELECT,/* always select this thread first */ }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 892e5ff..7df0dff 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2001,12 +2001,6 @@ retry: /* fall through */ case OOM_SCAN_CONTINUE: continue; - case OOM_SCAN_ABORT: - cgroup_iter_end(cgroup, ); - mem_cgroup_iter_break(memcg, iter); - if (chosen) - put_task_struct(chosen); - return; case OOM_SCAN_OK: break; }; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 2fab831..914f9f4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -358,9 +358,6 @@ retry: /* fall through */ case OOM_SCAN_CONTINUE: continue; - case OOM_SCAN_ABORT: - rcu_read_unlock(); - return ERR_PTR(-1UL); case OOM_SCAN_OK: break; }; @@ -887,11 +884,9 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, if (!p) { dump_header(NULL, gfp_mask, order, NULL, mpol_mask); panic("Out of memory and no killable processes...\n"); - } - if (PTR_ERR(p) != -1UL) { + } else oom_kill_process(p, gfp_mask, order, points, totalpages, NULL, nodemask, "Out of memory"); - } } /* ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] oom: rework locking design
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 6376b304e2690ab7e3868b19f4a3eb8f78ee869e Author: Vladimir DavydovDate: Thu Oct 15 17:53:00 2015 +0400 oom: rework locking design Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: Currently, after oom-killing a process, we keep busy waiting for it until it frees some memory and we can fulfil the allocation request that initiated oom. This slows down oom kill rate dramatically, because the oom victim has to compete for cpu time with other (possibly numerous) processes. The latter is unacceptable for the upcoming oom berserker, which triggers if oom kills happen to often. This patch reworks oom locking design as follows. Now only one process is allowed to invoke oom killer in a memcg (root included) and all its descendants, others have to wait for it to finish. Next, once a victim is selected, the executioner will wait for it to die before retrying allocation. Signed-off-by: Vladimir Davydov --- include/linux/memcontrol.h | 9 ++ include/linux/oom.h| 13 ++- mm/memcontrol.c| 123 +++-- mm/oom_kill.c | 263 + mm/page_alloc.c| 6 +- 5 files changed, 255 insertions(+), 159 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 548a82c..5911327 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -29,6 +29,7 @@ struct page_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct oom_context; /* Stats that can be updated by kernel. */ enum mem_cgroup_page_stat_item { @@ -120,6 +121,7 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg); int mem_cgroup_select_victim_node(struct mem_cgroup *memcg); unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); +extern struct oom_context *mem_cgroup_oom_context(struct mem_cgroup *memcg); extern bool mem_cgroup_below_oom_guarantee(struct task_struct *p); extern void mem_cgroup_note_oom_kill(struct mem_cgroup *memcg, struct task_struct *task); @@ -363,6 +365,13 @@ mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, { } +static inline struct oom_context * +mem_cgroup_oom_context(struct mem_cgroup *memcg) +{ + extern struct oom_context oom_ctx; + return _ctx; +} + static inline bool mem_cgroup_below_oom_guarantee(struct task_struct *p) { return false; diff --git a/include/linux/oom.h b/include/linux/oom.h index 486fc6f..e19385d 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -31,6 +31,15 @@ enum oom_scan_t {
[Devel] [PATCH RHEL7 COMMIT] oom: introduce oom timeout
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 93e0a04b1eb4bcc4b996fe058af0c5a1c65b90c7 Author: Vladimir DavydovDate: Thu Oct 15 17:53:00 2015 +0400 oom: introduce oom timeout Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: Currently, we won't select a new oom victim until the previous one has passed away. This might lead to a deadlock if an allocating task holds a lock needed by the victim to complete. To cope with this problem, this patch introduced oom timeout, after which a new task will be selected even if the previous victim hasn't died. The timeout is hard-coded, equals 5 seconds. https://jira.sw.ru/browse/PSBM-38581 Signed-off-by: Vladimir Davydov --- include/linux/oom.h | 2 ++ mm/oom_kill.c | 60 ++--- 2 files changed, 54 insertions(+), 8 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index e19385d..f804551 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -34,6 +34,8 @@ enum oom_scan_t { struct oom_context { struct task_struct *owner; struct task_struct *victim; + bool marked; + unsigned long oom_start; wait_queue_head_t waitq; }; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ef7773f6..2fab831 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -45,6 +45,8 @@ int sysctl_oom_dump_tasks; static DEFINE_SPINLOCK(oom_context_lock); +#define OOM_TIMEOUT(5 * HZ) + #ifndef CONFIG_MEMCG struct oom_context oom_ctx = { .waitq = __WAIT_QUEUE_HEAD_INITIALIZER(oom_ctx.waitq), @@ -55,6 +57,8 @@ void init_oom_context(struct oom_context *ctx) { ctx->owner = NULL; ctx->victim = NULL; + ctx->marked = false; + ctx->oom_start = 0; init_waitqueue_head(>waitq); } @@ -62,6 +66,7 @@ static void __release_oom_context(struct oom_context *ctx) { ctx->owner = NULL; ctx->victim = NULL; + ctx->marked = false; wake_up_all(>waitq); } @@ -291,11 +296,14 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, /* * This task already has access to memory reserves and is being killed. -* Don't allow any other task to have access to the reserves. +* Try to select another one. +* +* This can only happen if oom_trylock timeout-ed, which most probably +* means that the victim had dead-locked. */ if (test_tsk_thread_flag(task, TIF_MEMDIE)) { if (!force_kill) - return OOM_SCAN_ABORT; + return OOM_SCAN_CONTINUE; } if (!task->mm) return OOM_SCAN_CONTINUE; @@ -463,8 +471,10 @@ void mark_oom_victim(struct task_struct *tsk)
[Devel] [PATCH RHEL7 COMMIT] ms/mm: oom_kill: clean up victim marking and exiting interfaces
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 495272394bfe50c2c5925a1ec2ffbebed25b7fea Author: Vladimir DavydovDate: Thu Oct 15 17:47:36 2015 +0400 ms/mm: oom_kill: clean up victim marking and exiting interfaces Patchset description: oom enhancements - part 1 Pull mainstream patches that clean up TIF_MEMDIE handling. They will come in handy for the upcoming oom rework. https://jira.sw.ru/browse/PSBM-26973 David Rientjes (1): mm, oom: remove unnecessary exit_state check Johannes Weiner (1): mm: oom_kill: clean up victim marking and exiting interfaces Michal Hocko (3): oom: make sure that TIF_MEMDIE is set under task_lock oom: add helpers for setting and clearing TIF_MEMDIE oom: thaw the OOM victim if it is frozen Tetsuo Handa (1): oom: don't count on mm-less current process === This patch desciption: From: Johannes Weiner Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking are related in functionality, but the interface is not symmetrical at all: one is an internal OOM killer function used during the killing, the other is for an OOM victim to signal its own death on exit later on. This has locking implications, see follow-up changes. While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which is easier on the eye. Signed-off-by: Johannes Weiner Acked-by: David Rientjes Acked-by: Michal Hocko Cc: Tetsuo Handa Cc: Andrea Arcangeli Cc: Dave Chinner Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 16e951966f05da5ccd650104176f6ba289f7fa20) Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai Conflicts: include/linux/oom.h mm/memcontrol.c mm/oom_kill.c --- drivers/staging/android/lowmemorykiller.c | 2 +- include/linux/oom.h | 7 --- kernel/exit.c | 2 +- mm/memcontrol.c | 2 +- mm/oom_kill.c | 14 +++--- 5 files changed, 14 insertions(+), 13 deletions(-) diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c index 4dd6a34..433e9a7 100644 --- a/drivers/staging/android/lowmemorykiller.c +++ b/drivers/staging/android/lowmemorykiller.c @@ -164,7 +164,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc) * infrastructure. There is no real reason why the selected * task should have access to the memory reserves. */ - mark_tsk_oom_victim(selected); + mark_oom_victim(selected); send_sig(SIGKILL, selected, 0); rem += selected_tasksize; } diff --git a/include/linux/oom.h b/include/linux/oom.h index 3c37f1e..486fc6f 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -52,9 +52,7 @@ static inline bool oom_task_origin(const struct task_struct *p) /* linux/mm/oom_group.c */ extern int get_task_oom_score_adj(struct task_struct *t); -extern void mark_tsk_oom_victim(struct task_struct *tsk); - -extern void unmark_oom_victim(void); +extern void mark_oom_victim(struct task_struct *tsk); extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); @@ -75,6 +73,9 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task, extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order, nodemask_t *mask, bool force_kill); + +extern void exit_oom_victim(void); + extern int register_oom_notifier(struct notifier_block *nb); extern int unregister_oom_notifier(struct notifier_block *nb); diff --git a/kernel/exit.c b/kernel/exit.c index 1b13207..1cc765b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -521,7 +521,7 @@ static void exit_mm(struct task_struct * tsk) mm_update_next_owner(mm); mmput(mm); if (test_thread_flag(TIF_MEMDIE)) - unmark_oom_victim(); + exit_oom_victim(); } /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index cf9ca7f..fdd14dd2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1964,7 +1964,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup
[Devel] [PATCH RHEL7 COMMIT] oom: resurrect berserker mode
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit e651315e4475767b41a7e028c6127b25c5754312 Author: Vladimir DavydovDate: Thu Oct 15 17:53:03 2015 +0400 oom: resurrect berserker mode Patchset description: oom enhancements - part 2 - Patches 1-2 prepare memcg for upcoming changes in oom design. - Patch 3 reworks oom locking design so that the executioner waits for victim to exit. This is necessary to increase oom kill rate, which is essential for berserker mode. - Patch 4 drops unused OOM_SCAN_ABORT - Patch 5 introduces oom timeout. https://jira.sw.ru/browse/PSBM-38581 - Patch 6 makes oom fairer when it comes to selecting a victim among different containers. https://jira.sw.ru/browse/PSBM-37915 - Patch 7 prepares oom for introducing berserker mode - Patch 8 resurrects oom berserker mode, which is supposed to cope with actively forking processes. https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Changes in v3: - rework oom_trylock (patch 3) - select exiting process instead of aborting oom scan so as not to keep busy-waiting for an exiting process to exit (patches 3, 4) - cleanup oom timeout handling + fix stuck process trace dumped multiple times on timeout (patch 5) - set max_overdraft to ULONG_MAX on selected processes (patch 6) - rework oom berserker process selection logic (patches 7, 8) Changes in v2: - s/time_after/time_after_eq to avoid BUG_ON in oom_trylock (patch 4) - propagate victim to the context that initiated oom in oom_unlock (patch 6) - always set oom_end on releasing oom context (patch 6) Vladimir Davydov (8): memcg: add mem_cgroup_get/put helpers memcg: add lock for protecting memcg->oom_notify list oom: rework locking design oom: introduce oom timeout oom: drop OOM_SCAN_ABORT oom: rework logic behind memory.oom_guarantee oom: pass points and overdraft to oom_kill_process oom: resurrect berserker mode Reviewed-by: Kirill Tkhai = This patch description: The logic behind the OOM berserker is the same as in PCS6: if processes are killed by oom killer too often (< sysctl vm.oom_relaxation, 1 sec by default), we increase "rage" (min -10, max 20) and kill 1 << "rage" youngest worst processes if "rage" >= 0. https://jira.sw.ru/browse/PSBM-17930 Signed-off-by: Vladimir Davydov --- include/linux/oom.h | 3 ++ kernel/sysctl.c | 7 mm/oom_kill.c | 106 3 files changed, 116 insertions(+) diff --git a/include/linux/oom.h b/include/linux/oom.h index 6ea83b2..acf58fc 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -35,7 +35,9 @@ struct oom_context { struct task_struct *victim; bool marked; unsigned long oom_start; + unsigned long oom_end; unsigned long overdraft; + int rage; wait_queue_head_t waitq; }; @@ -126,4 +128,5 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p); extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; extern int sysctl_panic_on_oom; +extern int sysctl_oom_relaxation; #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 976f48c..9c081e3 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1184,6 +1184,13 @@ static struct ctl_table vm_table[] = { .proc_handler = proc_dointvec, }, { + .procname = "oom_relaxation", + .data = _oom_relaxation, + .maxlen = sizeof(sysctl_oom_relaxation), + .mode = 0644, + .proc_handler = proc_dointvec_ms_jiffies, + }, + { .procname = "overcommit_ratio", .data = _overcommit_ratio, .maxlen = sizeof(sysctl_overcommit_ratio), diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d8a89c0..6d16154 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -42,13 +42,18 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks; +int sysctl_oom_relaxation = HZ; static DEFINE_SPINLOCK(oom_context_lock); #define OOM_TIMEOUT(5 * HZ) +#define OOM_BASE_RAGE -10 +#define OOM_MAX_RAGE 20 + #ifndef CONFIG_MEMCG struct oom_context oom_ctx = { + .rage = OOM_BASE_RAGE, .waitq = __WAIT_QUEUE_HEAD_INITIALIZER(oom_ctx.waitq), }; #endif @@ -59,6 +64,8 @@ void init_oom_context(struct oom_context *ctx)
Re: [Devel] [RFC] Fix get_exec_env() races
On 15.10.2015 14:15, Pavel Emelyanov wrote: > >> @@ -130,6 +131,34 @@ struct ve_struct { >> #endif >> }; >> >> +static inline struct ve_struct *get_exec_env(void) >> +{ >> +struct ve_struct *ve; >> + >> +if (++current->ve_attach_lock_depth > 1) >> +return current->task_ve; >> + >> +rcu_read_lock(); >> +again: >> +ve = current->task_ve; >> +read_lock(>attach_lock); >> +if (unlikely(current->task_ve != ve)) { >> +read_unlock(>attach_lock); >> +goto again; > > Please, no. 3.10 kernel has task_work-s, ask the task you want to > attach to ve to execute the work by moving itself into it and keep > this small routine small and simple. cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(), so we can't wait attaching task till it complete the task work (it may execute any code; to be locking cgroup_mutex, for example). Should we give a possibility (an interface) for userspace to get it know, the task's finished ve changing? >> +} >> +rcu_read_unlock(); >> + >> +return ve; >> +} >> + > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/oom: make sure that TIF_MEMDIE is set under task_lock
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.8.6 --> commit 82d2c87b0e1ecd58487d26f479142a3517cffc44 Author: Vladimir DavydovDate: Thu Oct 15 17:47:34 2015 +0400 ms/oom: make sure that TIF_MEMDIE is set under task_lock Patchset description: oom enhancements - part 1 Pull mainstream patches that clean up TIF_MEMDIE handling. They will come in handy for the upcoming oom rework. https://jira.sw.ru/browse/PSBM-26973 David Rientjes (1): mm, oom: remove unnecessary exit_state check Johannes Weiner (1): mm: oom_kill: clean up victim marking and exiting interfaces Michal Hocko (3): oom: make sure that TIF_MEMDIE is set under task_lock oom: add helpers for setting and clearing TIF_MEMDIE oom: thaw the OOM victim if it is frozen Tetsuo Handa (1): oom: don't count on mm-less current process === This patch desciption: From: Michal Hocko OOM killer tries to exclude tasks which do not have mm_struct associated because killing such a task wouldn't help much. The OOM victim gets TIF_MEMDIE set to disable OOM killer while the current victim releases the memory and then enables the OOM killer again by dropping the flag. oom_kill_process is currently prone to a race condition when the OOM victim is already exiting and TIF_MEMDIE is set after the task releases its address space. This might theoretically lead to OOM livelock if the OOM victim blocks on an allocation later during exiting because it wouldn't kill any other process and the exiting one won't be able to exit. The situation is highly unlikely because the OOM victim is expected to release some memory which should help to sort out OOM situation. Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock which will serialize the OOM killer with exit_mm which sets task->mm to NULL. Setting the flag for current is not necessary because check and set is not racy. Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko Cc: David Rientjes Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 83363b917a2982dd509a5e2125e905b6873505a3) Signed-off-by: Vladimir Davydov Reviewed-by: Kirill Tkhai Conflicts: mm/oom_kill.c --- mm/oom_kill.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index fd9e13d..5ac5d96 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -432,11 +432,14 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + task_lock(p); + if (p->mm && p->flags & PF_EXITING) { set_tsk_thread_flag(p, TIF_MEMDIE); + task_unlock(p); put_task_struct(p); return; } + task_unlock(p); if (__ratelimit(_rs)) dump_header(p, gfp_mask, order, memcg, nodemask); @@ -486,6 +489,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; + set_tsk_thread_flag(victim, TIF_MEMDIE); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), @@ -517,7 +521,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, } rcu_read_unlock(); - set_tsk_thread_flag(victim, TIF_MEMDIE); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); mem_cgroup_note_oom_kill(memcg, victim); put_task_struct(victim); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC] Fix get_exec_env() races
On Thu, Oct 15, 2015 at 05:21:04PM +0300, Kirill Tkhai wrote: > > > On 15.10.2015 14:15, Pavel Emelyanov wrote: > > > >> @@ -130,6 +131,34 @@ struct ve_struct { > >> #endif > >> }; > >> > >> +static inline struct ve_struct *get_exec_env(void) > >> +{ > >> + struct ve_struct *ve; > >> + > >> + if (++current->ve_attach_lock_depth > 1) > >> + return current->task_ve; > >> + > >> + rcu_read_lock(); > >> +again: > >> + ve = current->task_ve; > >> + read_lock(>attach_lock); > >> + if (unlikely(current->task_ve != ve)) { > >> + read_unlock(>attach_lock); > >> + goto again; > > > > Please, no. 3.10 kernel has task_work-s, ask the task you want to > > attach to ve to execute the work by moving itself into it and keep > > this small routine small and simple. > > cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(), > so we can't wait attaching task till it complete the task work (it may > execute any code; to be locking cgroup_mutex, for example). Do we really want to wait until exec_env of the target task has changed? Anyway, I think we can use task_work even if we need to be synchronous. You just need two task works - one for the target and one for the caller. The former will change the target task's exec_env while the latter will wait for it to finish. > > Should we give a possibility (an interface) for userspace to get it know, > the task's finished ve changing? > > > >> + } > >> + rcu_read_unlock(); > >> + > >> + return ve; > >> +} > >> + > > > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [NEW KERNEL] 3.10.0-229.7.2.vz7.8.8 (rhel7)
Changelog: OpenVZ kernel rh7-3.10.0-229.7.2.vz7.8.8 * lost hunk brought for memfd_create() syscall port Generated changelog: * Thu Oct 15 2015 Konstantin Khorenko[3.10.0-229.7.2.vz7.8.8] - ms/shm: add memfd_create() syscall: lost hunk (Konstantin Khorenko) [PSBM-39834] Built packages: http://kojistorage.eng.sw.ru/packages/vzkernel/3.10.0/229.7.2.vz7.8.8/ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] ve: Kill ve_list_head and ve_struct::ve_list
On Thu, Sep 24, 2015 at 06:11:26PM +0300, Kirill Tkhai wrote: > Since we use ve_idr layer to reserve a id for a ve, > and since a ve is linked there, using of ve_list_head > just for linking VEs becomes redundant. Nevertheless, iterating over a list is more convenient than over idr IMO. > > This patch replaces ve_list_head in the places, we iterate > thru VEs list, with ve_idr mechanish, and kills the > duplicate manner. AFAICS this patch doesn't improve performance neither does it make the code more readable IMHO, so personally I would refrain from merging it. Up to Konstantin. Also, see a few comments regarding the implementation below. > > Signed-off-by: Kirill Tkhai> --- ... > @@ -49,10 +49,9 @@ void vzmon_unregister_veaddr_print_cb(ve_seq_print_t); > int venet_init(void); > #endif > > -extern struct list_head ve_list_head; > -#define for_each_ve(ve) list_for_each_entry((ve), _list_head, > ve_list) I wouldn't drop the macro. > extern struct mutex ve_list_lock; There's no ve_list, but there's still ve_list_lock. Confusing. Same for ve_list_add and ve_list_del. > extern struct ve_struct *get_ve_by_id(envid_t); > +extern struct idr ve_idr; > extern struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t > veid); > extern int ve_cgroup_remove(struct cgroup *root, envid_t veid); > ... > @@ -772,26 +768,24 @@ static int vestat_seq_show(struct seq_file *m, void *v) > > void *ve_seq_start(struct seq_file *m, loff_t *pos) > { > - struct ve_struct *curve; > - > - curve = get_exec_env(); > mutex_lock(_list_lock); > - if (!ve_is_super(curve)) { > - if (*pos != 0) > - return NULL; > - return >ve_list; > - } > > - return seq_list_start(_list_head, *pos); > + return ve_seq_next(m, NULL, pos); I don't think it's correct to increment *pos in seq_start. Look at seq_read: if the buffer is too small to hold the first entry, we will jump over it instead of continuing reading it next time seq_read is called. > } > EXPORT_SYMBOL(ve_seq_start); > > void *ve_seq_next(struct seq_file *m, void *v, loff_t *pos) > { > - if (!ve_is_super(get_exec_env())) > - return NULL; > - else > - return seq_list_next(v, _list_head, pos); > + struct ve_struct *ve = get_exec_env(); > + int id = *pos; > + > + if (!ve_is_super(ve)) AFAICS you forgot to increment *pos here, which might result in multiplying output inside ve. > + return *pos ? NULL : ve; > + > + ve = idr_get_next(_idr, ); > + *pos = id + 1; > + > + return ve; > } > EXPORT_SYMBOL(ve_seq_next); > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC] Fix get_exec_env() races
On 15.10.2015 17:44, Vladimir Davydov wrote: > On Thu, Oct 15, 2015 at 05:21:04PM +0300, Kirill Tkhai wrote: >> >> >> On 15.10.2015 14:15, Pavel Emelyanov wrote: >>> @@ -130,6 +131,34 @@ struct ve_struct { #endif }; +static inline struct ve_struct *get_exec_env(void) +{ + struct ve_struct *ve; + + if (++current->ve_attach_lock_depth > 1) + return current->task_ve; + + rcu_read_lock(); +again: + ve = current->task_ve; + read_lock(>attach_lock); + if (unlikely(current->task_ve != ve)) { + read_unlock(>attach_lock); + goto again; >>> >>> Please, no. 3.10 kernel has task_work-s, ask the task you want to >>> attach to ve to execute the work by moving itself into it and keep >>> this small routine small and simple. >> >> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(), >> so we can't wait attaching task till it complete the task work (it may >> execute any code; to be locking cgroup_mutex, for example). > > Do we really want to wait until exec_env of the target task has changed? > > Anyway, I think we can use task_work even if we need to be synchronous. > You just need two task works - one for the target and one for the > caller. The former will change the target task's exec_env while the > latter will wait for it to finish. Hm. Maybe, it would be a good idea if cgroup_attach_task() could not be called several times at once. Every attach requires a separate task_work, so it needs additional memory. This complicates the thing. >> >> Should we give a possibility (an interface) for userspace to get it know, >> the task's finished ve changing? >> >> + } + rcu_read_unlock(); + + return ve; +} + >>> >> ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC] Fix get_exec_env() races
On 15.10.2015 19:49, Kirill Tkhai wrote: > > > On 15.10.2015 17:44, Vladimir Davydov wrote: >> On Thu, Oct 15, 2015 at 05:21:04PM +0300, Kirill Tkhai wrote: >>> >>> >>> On 15.10.2015 14:15, Pavel Emelyanov wrote: > @@ -130,6 +131,34 @@ struct ve_struct { > #endif > }; > > +static inline struct ve_struct *get_exec_env(void) > +{ > + struct ve_struct *ve; > + > + if (++current->ve_attach_lock_depth > 1) > + return current->task_ve; > + > + rcu_read_lock(); > +again: > + ve = current->task_ve; > + read_lock(>attach_lock); > + if (unlikely(current->task_ve != ve)) { > + read_unlock(>attach_lock); > + goto again; Please, no. 3.10 kernel has task_work-s, ask the task you want to attach to ve to execute the work by moving itself into it and keep this small routine small and simple. >>> >>> cgroup_attach_task() is called under cgroup_mutex and threadgroup_lock(), >>> so we can't wait attaching task till it complete the task work (it may >>> execute any code; to be locking cgroup_mutex, for example). >> >> Do we really want to wait until exec_env of the target task has changed? >> >> Anyway, I think we can use task_work even if we need to be synchronous. >> You just need two task works - one for the target and one for the >> caller. The former will change the target task's exec_env while the >> latter will wait for it to finish. > > Hm. Maybe, it would be a good idea if cgroup_attach_task() could not be > called several times at once. Every attach requires a separate task_work, > so it needs additional memory. This complicates the thing. Though, we may wait on counter of number of all process and threads. Looks OK. >>> >>> Should we give a possibility (an interface) for userspace to get it know, >>> the task's finished ve changing? >>> >>> > + } > + rcu_read_unlock(); > + > + return ve; > +} > + >>> ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel