[Devel] [PATCH RHEL7 COMMIT] ve/net: Add VE_NF_CONNTRACK check in resolve_normal_ct()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.2 -- commit 1ae3e69714effdf80dd8306271096d86607608b1 Author: Kirill Tkhai ktk...@odin.com Date: Thu Aug 27 20:32:43 2015 +0400 ve/net: Add VE_NF_CONNTRACK check in resolve_normal_ct() This is a missed hunk from diff-ve-net-netfilter-combined. https://jira.sw.ru/browse/PSBM-35154 Signed-off-by: Kirill Tkhai ktk...@odin.com --- net/netfilter/nf_conntrack_core.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index bcd215d..33a6e9c 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1061,6 +1061,9 @@ resolve_normal_ct(struct net *net, struct nf_conn *tmpl, u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE; u32 hash; + if (!net_ipt_permitted(net, VE_NF_CONNTRACK)) + return NULL; + if (!nf_ct_get_tuple(skb, skb_network_offset(skb), dataoff, l3num, protonum, tuple, l3proto, l4proto)) { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_ref_cancel_init()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0873bd8f500347f34f06ddad0fbf024df91f8add Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:24 2015 +0400 ms/percpu-refcount: implement percpu_ref_cancel_init() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Normally, percpu_ref_init() initializes and percpu_ref_kill() initiates destruction which completes asynchronously. The asynchronous destruction can be problematic in init failure path where the caller wants to destroy half-constructed object - distinguishing half-constructed objects from the usual release method can be painful for complex objects. This patch implements percpu_ref_cancel_init() which synchronously destroys the percpu_ref without invoking release. To avoid unintentional misuses, the function requires the ref to have finished percpu_ref_init() but never used and triggers WARN otherwise. v2: Explain the weird name and usage restriction in the function comment. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit bc497bd33b2d6a6f07bc8574b4764edbd7fdffa8) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 1 + lib/percpu-refcount.c | 31 +++ 2 files changed, 32 insertions(+) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 8146aa9..6d843d6 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -68,6 +68,7 @@ struct percpu_ref { int __must_check percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); +void percpu_ref_cancel_init(struct percpu_ref *ref); void percpu_ref_kill(struct percpu_ref *ref); #define PCPU_STATUS_BITS 2 diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index b35eaac..ebeaac2 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -54,6 +54,37 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release) return 0; } +/** + * percpu_ref_cancel_init - cancel percpu_ref_init() + * @ref: percpu_ref to cancel init for + * + * Once a percpu_ref is initialized, its destruction is initiated by + * percpu_ref_kill() and completes asynchronously, which can be painful to + * do when destroying a half-constructed object in init failure path. + * + * This function destroys @ref without invoking @ref-release and the + * memory area containing it can be freed immediately on return. To + * prevent accidental misuse, it's required that @ref has finished + * percpu_ref_init(), whether successful or not, but never used. + * + * The weird name and usage restriction are to prevent people from using + * this function by mistake for normal shutdown instead of + * percpu_ref_kill(). + */ +void percpu_ref_cancel_init(struct percpu_ref *ref) +{ + unsigned __percpu *pcpu_count = ref-pcpu_count; + int cpu; + + WARN_ON_ONCE(atomic_read(ref-count) != 1 + PCPU_COUNT_BIAS); + + if (pcpu_count) { + for_each_possible_cpu(cpu) + WARN_ON_ONCE(*per_cpu_ptr(pcpu_count, cpu)); + free_percpu(ref-pcpu_count); + } +} + static void percpu_ref_kill_rcu(struct rcu_head *rcu) { struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: split cgroup destruction into two steps
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 33f3496e5d1342b4497058d017261d3b3fde0fe1 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:26 2015 +0400 ms/cgroup: split cgroup destruction into two steps Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup-free_work to -destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Li Zefan lize...@huawei.com (cherry picked from commit ea15f8ccdb430af1e8bc9b4e19a230eb4c356777) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: kernel/cgroup.c --- include/linux/cgroup.h | 2 +- kernel/cgroup.c| 25 - 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 626bc84..d34c42b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -259,7 +259,7 @@ struct cgroup { /* For RCU-protected deletion */ struct rcu_head rcu_head; - struct work_struct free_work; + struct work_struct destroy_work; /* List of events which userspace want to receive */ struct list_head event_list; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 062e0f4..6fd7038 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -213,6 +213,7 @@ static struct cgroup_name root_cgroup_name = { .name = / }; */ static int need_forkexit_callback __read_mostly; +static void cgroup_offline_fn(struct work_struct *work); static int cgroup_destroy_locked(struct cgroup *cgrp); static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys, struct cftype cfts[], bool is_add); @@ -836,7 +837,7 @@ static struct cgroup_name *cgroup_alloc_name(struct dentry *dentry) static void cgroup_free_fn(struct work_struct *work) { - struct cgroup *cgrp = container_of(work, struct cgroup, free_work); + struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work); struct cgroup_subsys *ss; mutex_lock(cgroup_mutex); @@ -881,7 +882,8 @@ static void cgroup_free_rcu(struct rcu_head *head) { struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); - queue_work(cgroup_destroy_wq, cgrp-free_work); + INIT_WORK(cgrp-destroy_work, cgroup_free_fn); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -1416,7 +1418,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_LIST_HEAD(cgrp-allcg_node); INIT_LIST_HEAD(cgrp-release_list); INIT_LIST_HEAD(cgrp-pidlists); - INIT_WORK(cgrp-free_work, cgroup_free_fn); mutex_init(cgrp-pidlist_mutex); INIT_LIST_HEAD(cgrp-event_list); spin_lock_init(cgrp-event_list_lock); @@ -4355,7 +4356,6 @@ static int
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: Don't use silly cmpxchg()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 337bb797aa4aa5eca030d634d0a9874290511db5 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:21 2015 +0400 ms/percpu-refcount: Don't use silly cmpxchg() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Kent Overstreet koverstr...@google.com The cmpxchg() was just to ensure the debug check didn't race, which was a bit excessive. The caller is supposed to do the appropriate synchronization, which means percpu_ref_kill() can just do a simple store. Signed-off-by: Kent Overstreet koverstr...@google.com Signed-off-by: Tejun Heo t...@kernel.org (cherry picked from commit c1ae6e9b4db00023b9caed72af49a93abad46452) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- lib/percpu-refcount.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 6f0ffd7..1a17399 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -107,22 +107,11 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) */ void percpu_ref_kill(struct percpu_ref *ref) { - unsigned __percpu *pcpu_count, *old, *new; + WARN_ONCE(REF_STATUS(ref-pcpu_count) == PCPU_REF_DEAD, + percpu_ref_kill() called more than once!\n); - pcpu_count = ACCESS_ONCE(ref-pcpu_count); - - do { - if (REF_STATUS(pcpu_count) == PCPU_REF_DEAD) { - WARN(1, percpu_ref_kill() called more than once!\n); - return; - } - - old = pcpu_count; - new = (unsigned __percpu *) - (((unsigned long) pcpu_count)|PCPU_REF_DEAD); - - pcpu_count = cmpxchg(ref-pcpu_count, old, new); - } while (pcpu_count != old); + ref-pcpu_count = (unsigned __percpu *) + (((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD); call_rcu(ref-rcu, percpu_ref_kill_rcu); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: reorder the operations in cgroup_destroy_locked()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit ce835adec25190f76a26cc97f1a38aadc93a4957 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:25 2015 +0400 ms/cgroup: reorder the operations in cgroup_destroy_locked() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org This patch reorders the operations in cgroup_destroy_locked() such that the userland visible parts happen before css offlining and removal from the -sibling list. This will be used to make css use percpu refcnt. While at it, split out CGRP_DEAD related comment from the refcnt deactivation one and correct / clarify how different guarantees are met. While this patch changes the specific order of operations, it shouldn't cause any noticeable behavior difference. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Li Zefan lize...@huawei.com (cherry picked from commit 455050d23e1bfc47ca98e943ad5b2f3a9bbe45fb) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: kernel/cgroup.c --- kernel/cgroup.c | 48 ++-- 1 file changed, 26 insertions(+), 22 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b073fba..062e0f4 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4367,9 +4367,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) /* * Block new css_tryget() by deactivating refcnt and mark @cgrp -* removed. This makes future css_tryget() and child creation -* attempts fail thus maintaining the removal conditions verified -* above. +* removed. This makes future css_tryget() attempts fail which we +* guarantee to -css_offline() callbacks. */ for_each_subsys(cgrp-root, ss) { struct cgroup_subsys_state *css = cgrp-subsys[ss-subsys_id]; @@ -4379,6 +4378,30 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) } set_bit(CGRP_REMOVED, cgrp-flags); + raw_spin_lock(release_list_lock); + if (!list_empty(cgrp-release_list)) + list_del_init(cgrp-release_list); + raw_spin_unlock(release_list_lock); + + /* +* Remove @cgrp directory. The removal puts the base ref but we +* aren't quite done with @cgrp yet, so hold onto it. +*/ + dget(d); + cgroup_d_remove_dir(d); + + /* +* Unregister events and notify userspace. +* Notify userspace about cgroup removing only after rmdir of cgroup +* directory to avoid race between userspace and kernelspace. +*/ + spin_lock(cgrp-event_list_lock); + list_for_each_entry_safe(event, tmp, cgrp-event_list, list) { + list_del_init(event-list); + schedule_work(event-remove); + } + spin_unlock(cgrp-event_list_lock); + /* tell subsystems to initate destruction */ for_each_subsys(cgrp-root, ss) offline_css(ss, cgrp); @@ -4393,34 +4416,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) for_each_subsys(cgrp-root, ss) css_put(cgrp-subsys[ss-subsys_id]); - raw_spin_lock(release_list_lock); - if (!list_empty(cgrp-release_list)) - list_del_init(cgrp-release_list); - raw_spin_unlock(release_list_lock); - /* delete this cgroup from parent-children */ list_del_rcu(cgrp-sibling); list_del_init(cgrp-allcg_node); - dget(d); - cgroup_d_remove_dir(d); dput(d); set_bit(CGRP_RELEASABLE, parent-flags); check_for_release(parent); - /* -* Unregister events and notify userspace. -* Notify userspace about cgroup removing only after rmdir of cgroup -* directory to avoid race between userspace and kernelspace. -*/ - spin_lock(cgrp-event_list_lock); -
[Devel] [PATCH RHEL7 COMMIT] ve/devpts: Revert 2c27d20125f5
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 99a71c6ceb41b6c8256620c4db844f7395f2a2c9 Author: Cyrill Gorcunov gorcu...@gmail.com Date: Fri Aug 28 14:14:08 2015 +0400 ve/devpts: Revert 2c27d20125f5 Here we revert 2c27d20125f5 (ve/devpts: cleanup per-VE creation) making code close to the vanilla one. We've tune devpts code a bit though in next patch but less intrusive. https://jira.sw.ru/browse/PSBM-34931 Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com CC: Vladimir Davydov vdavy...@virtuozzo.com CC: Andrey Vagin ava...@virtuozzo.com CC: Konstantin Khorenko khore...@virtuozzo.com CC: Pavel Emelyanov xe...@virtuozzo.com --- fs/devpts/inode.c | 39 ++- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index 3dcd4da..be0fb74 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -402,6 +402,20 @@ fail: } #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES +static int test_devpts_sb(struct super_block *s, void *p) +{ + return get_exec_env()-devpts_sb == s; +} + +static int set_devpts_sb(struct super_block *s, void *p) +{ + int error = set_anon_super(s, p); + if (!error) { + atomic_inc(s-s_active); + get_exec_env()-devpts_sb = s; + } + return error; +} /* * devpts_mount() @@ -436,7 +450,6 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, int error; struct pts_mount_opts opts; struct super_block *s; - struct dentry *root; error = parse_mount_options(data, PARSE_MOUNT, opts); if (error) @@ -450,29 +463,29 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, return ERR_PTR(-EINVAL); if (opts.newinstance) - root = mount_nodev(fs_type, flags, data, devpts_fill_super); + s = sget(fs_type, NULL, set_anon_super, flags, NULL); else - root = mount_ns(fs_type, flags, data, get_exec_env(), devpts_fill_super); + s = sget(fs_type, test_devpts_sb, set_devpts_sb, flags, NULL); + + if (IS_ERR(s)) + return ERR_CAST(s); - if (IS_ERR(root)) - return ERR_CAST(root); + if (!s-s_root) { + error = devpts_fill_super(s, data, flags MS_SILENT ? 1 : 0); + if (error) + goto out_undo_sget; + s-s_flags |= MS_ACTIVE; + } - s = root-d_sb; memcpy((DEVPTS_SB(s))-mount_opts, opts, sizeof(opts)); error = mknod_ptmx(s); if (error) goto out_undo_sget; - if (!opts.newinstance) { - atomic_inc(s-s_active); - get_exec_env()-devpts_sb = s; - } - - return root; + return dget(s-s_root); out_undo_sget: - dput(root); deactivate_locked_super(s); return ERR_PTR(error); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 4149fa7beae723cd745672c749ed0a94f7f672a4 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:24 2015 +0400 ms/percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Implement percpu_tryget() which stops giving out references once the percpu_ref is visible as killed. Because the refcnt is per-cpu, different CPUs will start to see a refcnt as killed at different points in time and tryget() may continue to succeed on subset of cpus for a while after percpu_ref_kill() returns. For use cases where it's necessary to know when all CPUs start to see the refcnt as dead, percpu_ref_kill_and_confirm() is added. The new function takes an extra argument @confirm_kill which is invoked when the refcnt is guaranteed to be viewed as killed on all CPUs. While this isn't the prettiest interface, it doesn't force synchronous wait and is much safer than requiring the caller to do its own call_rcu(). v2: Patch description rephrased to emphasize that tryget() may continue to succeed on some CPUs after kill() returns as suggested by Kent. v3: Function comment in percpu_ref_kill_and_confirm() updated warning people to not depend on the implied RCU grace period from the confirm callback as it's an implementation detail. Signed-off-by: Tejun Heo t...@kernel.org Slightly-Grumpily-Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit dbece3a0f1ef0b19aff1cc6ed0942fec9ab98de1) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 50 - lib/percpu-refcount.c | 23 ++- 2 files changed, 66 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 6d843d6..dd2a086 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -63,13 +63,30 @@ struct percpu_ref { */ unsigned __percpu *pcpu_count; percpu_ref_func_t *release; + percpu_ref_func_t *confirm_kill; struct rcu_head rcu; }; int __must_check percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); void percpu_ref_cancel_init(struct percpu_ref *ref); -void percpu_ref_kill(struct percpu_ref *ref); +void percpu_ref_kill_and_confirm(struct percpu_ref *ref, +percpu_ref_func_t *confirm_kill); + +/** + * percpu_ref_kill - drop the initial ref + * @ref: percpu_ref to kill + * + * Must be used to drop the initial ref on a percpu refcount; must be called + * precisely once before shutdown. + * + * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the + * percpu counters and dropping the initial ref. + */ +static inline void percpu_ref_kill(struct percpu_ref *ref) +{ + return percpu_ref_kill_and_confirm(ref, NULL); +} #define PCPU_STATUS_BITS 2 #define PCPU_STATUS_MASK ((1 PCPU_STATUS_BITS) - 1) @@ -101,6 +118,37 @@ static inline void percpu_ref_get(struct percpu_ref *ref) } /** + * percpu_ref_tryget - try to increment a percpu refcount + * @ref: percpu_ref to try-get + * + * Increment a percpu refcount unless it has already been killed. Returns + * %true on success; %false on failure. + * + * Completion of percpu_ref_kill() in itself doesn't guarantee that tryget + * will fail. For such guarantee, percpu_ref_kill_and_confirm() should be + * used. After the confirm_kill callback is invoked, it's guaranteed that + * no new reference will be given out by percpu_ref_tryget(). + */ +static inline bool percpu_ref_tryget(struct percpu_ref *ref) +{ + unsigned
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 82f6802b3f09878172024c57ed12cf2da92cccd3 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:23 2015 +0400 ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Two small changes. * Unlike most init functions, percpu_ref_init() allocates memory and may fail. Let's mark it with __must_check in case the caller forgets. * percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to dereference @ref-pcpu_count, which can be misleading. The pointer is guaranteed to be valid and visible and can't change underneath the function. Drop ACCESS_ONCE(). Signed-off-by: Tejun Heo t...@kernel.org (cherry picked from commit acac7883ee7bcc32476963bce7baf73d44574dd1) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 3 ++- lib/percpu-refcount.c | 4 +--- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index b61bd6f..8146aa9 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -66,7 +66,8 @@ struct percpu_ref { struct rcu_head rcu; }; -int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); +int __must_check percpu_ref_init(struct percpu_ref *ref, +percpu_ref_func_t *release); void percpu_ref_kill(struct percpu_ref *ref); #define PCPU_STATUS_BITS 2 diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 9a78e55..b35eaac 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -57,12 +57,10 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release) static void percpu_ref_kill_rcu(struct rcu_head *rcu) { struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu); - unsigned __percpu *pcpu_count; + unsigned __percpu *pcpu_count = ref-pcpu_count; unsigned count = 0; int cpu; - pcpu_count = ACCESS_ONCE(ref-pcpu_count); - /* Mask out PCPU_REF_DEAD */ pcpu_count = (unsigned __percpu *) (((unsigned long) pcpu_count) ~PCPU_STATUS_MASK); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu: implement generic percpu refcounting
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit b5ec5570459334e56491e564b567cc5bed16181e Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:21 2015 +0400 ms/percpu: implement generic percpu refcounting Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Kent Overstreet koverstr...@google.com This implements a refcount with similar semantics to atomic_get()/atomic_dec_and_test() - but percpu. It also implements two stage shutdown, as we need it to tear down the percpu counts. Before dropping the initial refcount, you must call percpu_ref_kill(); this puts the refcount in shutting down mode and switches back to a single atomic refcount with the appropriate barriers (synchronize_rcu()). It's also legal to call percpu_ref_kill() multiple times - it only returns true once, so callers don't have to reimplement shutdown synchronization. [a...@linux-foundation.org: fix build] [a...@linux-foundation.org: coding-style tweak] Signed-off-by: Kent Overstreet koverstr...@google.com Cc: Zach Brown z...@redhat.com Cc: Felipe Balbi ba...@ti.com Cc: Greg Kroah-Hartman gre...@linuxfoundation.org Cc: Mark Fasheh mfas...@suse.com Cc: Joel Becker jl...@evilplan.org Cc: Rusty Russell ru...@rustcorp.com.au Cc: Jens Axboe ax...@kernel.dk Cc: Asai Thambi S P asamymuth...@micron.com Cc: Selvan Mani sm...@micron.com Cc: Sam Bradshaw sbrads...@micron.com Cc: Jeff Moyer jmo...@redhat.com Cc: Al Viro v...@zeniv.linux.org.uk Cc: Benjamin LaHaise b...@kvack.org Cc: Tejun Heo t...@kernel.org Cc: Oleg Nesterov o...@redhat.com Cc: Christoph Lameter c...@linux-foundation.org Cc: Ingo Molnar mi...@redhat.com Reviewed-by: Theodore Ts'o ty...@mit.edu Signed-off-by: Tejun Heo t...@kernel.org (cherry picked from commit 215e262f2aeba378aa192da07c30770f9925a4bf) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: lib/Makefile --- include/linux/percpu-refcount.h | 122 ++ lib/Makefile| 2 +- lib/percpu-refcount.c | 128 3 files changed, 251 insertions(+), 1 deletion(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h new file mode 100644 index 000..24b31ef --- /dev/null +++ b/include/linux/percpu-refcount.h @@ -0,0 +1,122 @@ +/* + * Percpu refcounts: + * (C) 2012 Google, Inc. + * Author: Kent Overstreet koverstr...@google.com + * + * This implements a refcount with similar semantics to atomic_t - atomic_inc(), + * atomic_dec_and_test() - but percpu. + * + * There's one important difference between percpu refs and normal atomic_t + * refcounts; you have to keep track of your initial refcount, and then when you + * start shutting down you call percpu_ref_kill() _before_ dropping the initial + * refcount. + * + * The refcount will have a range of 0 to ((1U 31) - 1), i.e. one bit less + * than an atomic_t - this is because of the way shutdown works, see + * percpu_ref_kill()/PCPU_COUNT_BIAS. + * + * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the + * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill() + * puts the ref back in single atomic_t mode, collecting the per cpu refs and + * issuing the appropriate barriers, and then marks the ref as shutting down so + * that percpu_ref_put() will check for the ref hitting 0. After it returns, + * it's safe to drop the initial ref. + * + * USAGE: + * + * See fs/aio.c for some example usage; it's used there for struct kioctx, which + * is created when userspaces calls io_setup(), and destroyed when userspace + * calls io_destroy() or the process exits. + * + * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it + * calls percpu_ref_kill(), then hlist_del_rcu()
[Devel] [PATCH RHEL7 COMMIT] ms/memcg: issue memory.high reclaim after refilling percpu stock
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit c315808e33a89086d0dac4624c1fa6f4fe1f8051 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:22:20 2015 +0400 ms/memcg: issue memory.high reclaim after refilling percpu stock Currently, we dive into memory.high reclaim before reflling percpu stock. As a result, if we successfully charge a batch for a percpu stock while exceeding memory.high, others won't be able to use it until we finish and will probably have to reclaim themselves, which may lead to overreclaim. This patch therefore moves memory.high reclaim after refilling stocks. This is how it works upstream. I haven't seen any negative effects caused by this backport mistake, but let's stick to the mainstream behavior anyways. Fixes: 4038cd0e029dd (ms/memcg: port memory.high) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/memcontrol.c | 35 +-- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 37e81d3..5f3e0ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2730,10 +2730,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (likely(!ret)) { if (!do_swap_account) - goto done; + return CHARGE_OK; ret = res_counter_charge(memcg-memsw, csize, fail_res); if (likely(!ret)) - goto done; + return CHARGE_OK; res_counter_uncharge(memcg-res, csize); mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); @@ -2790,21 +2790,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, return CHARGE_OOM_DIE; return CHARGE_RETRY; - -done: - if (!(gfp_mask __GFP_WAIT)) - goto out; - /* -* If the hierarchy is above the normal consumption range, -* make the charging task trim their excess contribution. -*/ - do { - if (res_counter_read_u64(memcg-res, RES_USAGE) = memcg-high) - continue; - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, false); - } while ((memcg = parent_mem_cgroup(memcg))); -out: - return CHARGE_OK; } /* @@ -2836,7 +2821,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, { unsigned int batch = max(CHARGE_BATCH, nr_pages); int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - struct mem_cgroup *memcg = NULL; + struct mem_cgroup *memcg = NULL, *iter; int ret; /* @@ -2950,6 +2935,20 @@ again: if (batch nr_pages) refill_stock(memcg, batch - nr_pages); + + /* +* If the hierarchy is above the normal consumption range, +* make the charging task trim their excess contribution. +*/ + iter = memcg; + do { + if (!(gfp_mask __GFP_WAIT)) + break; + if (res_counter_read_u64(iter-res, RES_USAGE) = iter-high) + continue; + try_to_free_mem_cgroup_pages(iter, nr_pages, gfp_mask, false); + } while ((iter = parent_mem_cgroup(iter))); + css_put(memcg-css); done: *ptr = memcg; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/vznetstat: Fix potential exit race
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 9a440f22380933dd3547de7d83c553924c6ce284 Author: Cyrill Gorcunov gorcu...@virtuozzo.com Date: Fri Aug 28 14:31:18 2015 +0400 ve/vznetstat: Fix potential exit race When container is exiting another task may be doing operations with statistics incrementing/decrementing stat counter, which may lead to situation where counter is not zero, thus we don't zap @ve-stat member. Fix it by testing if the net is the last one belonging to a container. https://jira.sw.ru/browse/PSBM-35178 Fixes: 505f8aacf95dce27fad66c90d4e1cd64adcb5432 (ve/vznetstat: Don't destroy statistics until explicitly asked) Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com CC: Andrey Vagin ava...@virtuozzo.com CC: Vladimir Davydov vdavy...@virtuozzo.com CC: Konstantin Khorenko khore...@virtuozzo.com CC: Pavel Emelyanov xe...@virtuozzo.com CC: Igor Sukhih i...@parallels.com --- kernel/ve/vznetstat/vznetstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/ve/vznetstat/vznetstat.c b/kernel/ve/vznetstat/vznetstat.c index 9a25dea..99feafb 100644 --- a/kernel/ve/vznetstat/vznetstat.c +++ b/kernel/ve/vznetstat/vznetstat.c @@ -1098,7 +1098,7 @@ static void __net_exit net_exit_acct(struct net *net) if (ve-stat) { venet_acct_put_stat(ve-stat); - if (atomic_read(ve-stat-users) == 0) + if (ve-ve_netns == net) ve-stat = NULL; } } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: use RCU-sched insted of normal RCU
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 932bf29b63b1e7c74669a8847d7c69cc8b8ba919 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:25 2015 +0400 ms/percpu-refcount: use RCU-sched insted of normal RCU Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org percpu-refcount was incorrectly using preempt_disable/enable() for RCU critical sections against call_rcu(). 6a24474da8 (percpu-refcount: consistently use plain (non-sched) RCU) fixed it by converting the preepmtion operations with rcu_read_[un]lock() citing that there isn't any advantage in using sched-RCU over using the usual one; however, rcu_read_[un]lock() for the preemptible RCU implementation - CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly more expensive than preempt_disable/enable(). In a contrived microbench which repeats the followings, - percpu_ref_get() - copy 32 bytes of data into percpu buffer - percpu_put_get() - copy 32 bytes of data into percpu buffer rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by about 15% when compared to using sched-RCU. As the RCU critical sections are extremely short, using sched-RCU shouldn't have any latency implications. Convert to RCU-sched. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Kent Overstreet koverstr...@google.com Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Michal Hocko mho...@suse.cz Cc: Rusty Russell ru...@rustcorp.com.au (cherry picked from commit a4244454df1296e90cc961c1b636b1176ef0d9a0) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 12 ++-- lib/percpu-refcount.c | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index dd2a086..95961f0 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -105,7 +105,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - rcu_read_lock(); + rcu_read_lock_sched(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -114,7 +114,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) else atomic_inc(ref-count); - rcu_read_unlock(); + rcu_read_unlock_sched(); } /** @@ -134,7 +134,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) unsigned __percpu *pcpu_count; int ret = false; - rcu_read_lock(); + rcu_read_lock_sched(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -143,7 +143,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) ret = true; } - rcu_read_unlock(); + rcu_read_unlock_sched(); return ret; } @@ -159,7 +159,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - rcu_read_lock(); + rcu_read_lock_sched(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -168,7 +168,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) else if (unlikely(atomic_dec_and_test(ref-count))) ref-release(ref); - rcu_read_unlock(); + rcu_read_unlock_sched(); } #endif diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 8bf9e71..7deeb62 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -154,5 +154,5 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, (((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD); ref-confirm_kill = confirm_kill; - call_rcu(ref-rcu, percpu_ref_kill_rcu); + call_rcu_sched(ref-rcu, percpu_ref_kill_rcu); } ___ Devel mailing list Devel@openvz.org
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: cosmetic updates
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit d6bfd7b559fdbe649d00c272895cb26996d1ee1c Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:22 2015 +0400 ms/percpu-refcount: cosmetic updates Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org * s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t postfix for types and the type is gonna be used for a different type of callback too. * Add @ARG to function comments. * Drop unnecessary and unaligned indentation from percpu_ref_init() function comment. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit ac899061a93250c28562f05ad94d5c74603415bc) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 8 +--- lib/percpu-refcount.c | 7 --- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index abe1411..b61bd6f 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -51,7 +51,7 @@ #include linux/rcupdate.h struct percpu_ref; -typedef void (percpu_ref_release)(struct percpu_ref *); +typedef void (percpu_ref_func_t)(struct percpu_ref *); struct percpu_ref { atomic_tcount; @@ -62,11 +62,11 @@ struct percpu_ref { * percpu_ref_kill_rcu()) */ unsigned __percpu *pcpu_count; - percpu_ref_release *release; + percpu_ref_func_t *release; struct rcu_head rcu; }; -int percpu_ref_init(struct percpu_ref *, percpu_ref_release *); +int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); void percpu_ref_kill(struct percpu_ref *ref); #define PCPU_STATUS_BITS 2 @@ -78,6 +78,7 @@ void percpu_ref_kill(struct percpu_ref *ref); /** * percpu_ref_get - increment a percpu refcount + * @ref: percpu_ref to get * * Analagous to atomic_inc(). */ @@ -99,6 +100,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) /** * percpu_ref_put - decrement a percpu refcount + * @ref: percpu_ref to put * * Decrement the refcount, and if 0, call the release function (which was passed * to percpu_ref_init()) diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 1a17399..9a78e55 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -33,8 +33,8 @@ /** * percpu_ref_init - initialize a percpu refcount - * @ref: ref to initialize - * @release: function which will be called when refcount hits 0 + * @ref: percpu_ref to initialize + * @release: function which will be called when refcount hits 0 * * Initializes the refcount in single atomic counter mode with a refcount of 1; * analagous to atomic_set(ref, 1). @@ -42,7 +42,7 @@ * Note that @release must not sleep - it may potentially be called from RCU * callback context by percpu_ref_kill(). */ -int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release) +int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release) { atomic_set(ref-count, 1 + PCPU_COUNT_BIAS); @@ -98,6 +98,7 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) /** * percpu_ref_kill - safely drop initial ref + * @ref: percpu_ref to kill * * Must be used to drop the initial ref on a percpu refcount; must be called * precisely once before shutdown. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: dio_fastmap() must refresh bvec_merge_data
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit fc65c834967a14d37ef23348cec6528d18b0a169 Author: Maxim Patlasov mpatla...@openvz.org Date: Fri Aug 28 14:18:37 2015 +0400 ploop: dio_fastmap() must refresh bvec_merge_data q-merge_bvec_fn() may override some fileds of bvec_merge_data. For example, raid0_mergeable_bvec() does so. The blessed way is to initialize it from scratch before use -- see how __bio_add_page() prepares bvm for calling q-merge_bvec_fn(). Signed-off-by: Maxim Patlasov mpatla...@openvz.org Acked-by: Dmitry Monakhov dmonak...@openvz.org --- drivers/block/ploop/io_direct.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c index 793bcc5..0183b0f 100644 --- a/drivers/block/ploop/io_direct.c +++ b/drivers/block/ploop/io_direct.c @@ -1487,7 +1487,6 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio, struct request_queue * q; struct extent_map * em; int i; - struct bvec_merge_data bm_data; if (orig_bio-bi_size == 0) { bio-bi_vcnt = 0; @@ -1535,19 +1534,19 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio, bio-bi_size = 0; bio-bi_vcnt = 0; - bm_data.bi_bdev = bio-bi_bdev; - bm_data.bi_sector = bio-bi_sector; - bm_data.bi_size = 0; - bm_data.bi_rw = bio-bi_rw; - for (i = 0; i orig_bio-bi_vcnt; i++) { struct bio_vec * bv = bio-bi_io_vec[i]; + struct bvec_merge_data bm_data = { + .bi_bdev = bio-bi_bdev, + .bi_sector = bio-bi_sector, + .bi_size = bio-bi_size, + .bi_rw = bio-bi_rw, + }; if (q-merge_bvec_fn(q, bm_data, bv) bv-bv_len) { io-plo-st.fast_neg_backing++; return 1; } bio-bi_size += bv-bv_len; - bm_data.bi_size = bio-bi_size; bio-bi_vcnt++; } return 0; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: consistently use plain (non-sched) RCU
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 41721ced765e1156651d31c8b9deb0111340e984 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:22 2015 +0400 ms/percpu-refcount: consistently use plain (non-sched) RCU Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org percpu_ref_get/put() are using preempt_disable/enable() while percpu_ref_kill() is using plain call_rcu() instead of call_rcu_sched(). This is buggy as grace periods of the two may not match. Fix it by using plain RCU in percpu_ref_get/put(). (I suggested using sched RCU in the first place but there's no actual benefit in doing so unless we're gonna introduce different variants of get/put to be called while preemption is alredy disabled, which we definitely shouldn't.) Signed-off-by: Tejun Heo t...@kernel.org Reported-by: Rusty Russell ru...@rustcorp.com.au Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit 6a24474da83ea7c8b7d32f05f858b1259994067a) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 24b31ef..abe1411 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -85,7 +85,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - preempt_disable(); + rcu_read_lock(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -94,7 +94,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) else atomic_inc(ref-count); - preempt_enable(); + rcu_read_unlock(); } /** @@ -107,7 +107,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - preempt_disable(); + rcu_read_lock(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -116,7 +116,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) else if (unlikely(atomic_dec_and_test(ref-count))) ref-release(ref); - preempt_enable(); + rcu_read_unlock(); } #endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: use percpu refcnt for cgroup_subsys_states
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit b1753091f010a49bcd0a89aa23306ac816302f9c Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:27 2015 +0400 ms/cgroup: use percpu refcnt for cgroup_subsys_states Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking -css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo t...@kernel.org Reviewed-by: Kent Overstreet koverstr...@google.com Acked-by: Li Zefan lize...@huawei.com Cc: Michal Hocko mho...@suse.cz Cc: Mike Snitzer snit...@redhat.com Cc: Vivek Goyal vgo...@redhat.com Cc: Alasdair G. Kergon a...@redhat.com Cc: Jens Axboe ax...@kernel.dk Cc: Mikulas Patocka mpato...@redhat.com Cc: Glauber Costa glom...@gmail.com (cherry picked from commit d3daf28da16a30af95bfb303189a634a87606725) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: include/linux/cgroup.h kernel/cgroup.c --- include/linux/cgroup.h | 27 +++- kernel/cgroup.c| 166 +++-- 2
[Devel] [PATCH RHEL7 COMMIT] ve/devtmpfs: lightweight virtualization
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 22255fb606cfd53fb98b11c62b854c0de5a4c713 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:59 2015 +0400 ve/devtmpfs: lightweight virtualization Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: All this patch does is provides each VE with its own empty single tmpfs mount, which appears on an attempt to mount devtmpfs. It's up to the userspace to populate this fs on container start, all kernel requests to create a device node inside a VE are ignored. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c | 67 + include/linux/ve.h | 1 + kernel/ve/ve.c | 4 +++ 3 files changed, 72 insertions(+) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index f59b798..daf97ee 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -23,6 +23,7 @@ #include linux/ramfs.h #include linux/slab.h #include linux/kthread.h +#include linux/ve.h #include base.h static struct task_struct *thread; @@ -53,9 +54,61 @@ static int __init mount_param(char *str) } __setup(devtmpfs.mount=, mount_param); +#ifdef CONFIG_VE +static int ve_test_dev_sb(struct super_block *s, void *p) +{ + return get_exec_env()-dev_sb == s; +} + +static int ve_set_dev_sb(struct super_block *s, void *p) +{ + struct ve_struct *ve = get_exec_env(); + int error; + + error = set_anon_super(s, p); + if (!error) { + BUG_ON(ve-dev_sb); + ve-dev_sb = s; + atomic_inc(s-s_active); + } + return error; +} + +static struct dentry *ve_dev_mount(struct file_system_type *fs_type, int flags, + const char *dev_name, void *data) +{ + int (*fill_super)(struct super_block *, void *, int); + struct super_block *s; + int error; + +#ifdef CONFIG_TMPFS + fill_super = shmem_fill_super; +#else + fill_super = ramfs_fill_super; +#endif + s = sget(fs_type, ve_test_dev_sb, ve_set_dev_sb, flags, NULL); + if (IS_ERR(s)) + return ERR_CAST(s); + + if (!s-s_root) { + error = fill_super(s, data, flags MS_SILENT ? 1 : 0); + if (error) { + deactivate_locked_super(s); + return ERR_PTR(error); + } + s-s_flags |= MS_ACTIVE; + } + return dget(s-s_root); +} +#endif /* CONFIG_VE */ + static struct dentry *dev_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { +#ifdef CONFIG_VE + if (!ve_is_super(get_exec_env())) + return ve_dev_mount(fs_type, flags, dev_name, data); +#endif #ifdef CONFIG_TMPFS return mount_single(fs_type, flags, data, shmem_fill_super); #else @@ -79,6 +132,16 @@ static inline int is_blockdev(struct device *dev) static inline int is_blockdev(struct device *dev) { return 0; } #endif +#ifdef CONFIG_VE +static inline int is_ve_dev(struct device *dev) +{ + return dev-class dev-class-namespace == ve_namespace + ve_namespace(dev) != get_ve0(); +} +#else +static inline int is_ve_dev(struct device *dev) { return 0; } +#endif + int
[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use sb-s_fs_info
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 17dd96483ff558d44c98c3f8bcb04a86aca843a5 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:43 2015 +0400 ve/binfmt_misc: do not use sb-s_fs_info Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: When we virtualized binfmt_misc, we made sb-s_fs_info store a pointer to binfmt_misc struct. At the same time, we store a pointer to the owner ve_struct in sb-s_ns and a pointer to the same binfmt_misc struct in ve_struct-binfmt_misc. That said, we don't actually need to use s_fs_info, because we can get the binfmt_misc by dereferencing sb-s_ns-binfmt_misc. Using sb-s_fs_info instead of sb-s_ns will allow us to revert our patches introducing sb-s_ns. This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization). Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/binfmt_misc.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index 7e760d2..d0cb80c 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -65,6 +65,8 @@ struct binfmt_misc { int entry_count; }; +#define BINFMT_MISC(sb)(((struct ve_struct *)(sb)-s_ns)-binfmt_misc) + /* * Check if we support the binfmt * if we do, return the node, else NULL @@ -541,7 +543,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, Node *e = file_inode(file)-i_private; int res = parse_command(buffer, count); struct super_block *sb = file-f_path.dentry-d_sb; - struct binfmt_misc *bm_data = sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); switch (res) { case 1: clear_bit(Enabled, e-flags); @@ -576,7 +578,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, struct inode *inode; struct dentry *root, *dentry; struct super_block *sb = file-f_path.dentry-d_sb; - struct binfmt_misc *bm_data = sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); int err = 0; e = create_entry(buffer, count); @@ -641,7 +643,7 @@ static const struct file_operations bm_register_operations = { static ssize_t bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { - struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb); char *s = bm_data-enabled ? enabled\n : disabled\n; return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); @@ -650,7 +652,7 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) static ssize_t bm_status_write(struct file * file, const char __user * buffer, size_t count, loff_t *ppos) { - struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb); int res = parse_command(buffer, count); struct dentry *root; @@ -681,7 +683,7 @@ static const struct file_operations bm_status_operations = { static void bm_put_super(struct super_block *sb) { - struct binfmt_misc *bm_data = sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); struct ve_struct *ve = sb-s_ns; bm_data-enabled = 0; @@ -723,7 +725,6 @@ static int bm_fill_super(struct super_block * sb, void * data, int silent) } sb-s_op = s_ops; - sb-s_fs_info = bm_data; bm_data-enabled = 1; get_ve(ve); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert fs: add data pointer to mount_ns()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 8d9d5a10d874b4d9f66f1af3fdcabbe9aee396f2 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:58 2015 +0400 Revert fs: add data pointer to mount_ns() Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 69e6ae7f750fc862c9324441130abbff2c8b528e. This is only needed for per-ns filesystems that can accept user options. There is the only such a filesystem, devtmpfs, which we made per container. Since devtmpfs virtualization is going to be dropped, this patch is not necessary. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c | 4 ++-- fs/binfmt_misc.c| 2 +- fs/nfsd/nfsctl.c| 2 +- fs/super.c | 4 ++-- include/linux/fs.h | 2 +- ipc/mqueue.c| 2 +- net/sunrpc/rpc_pipe.c | 2 +- 7 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index 349d6eb..6f4ba37 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -59,9 +59,9 @@ static struct dentry *dev_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { #ifdef CONFIG_TMPFS - return mount_ns(fs_type, flags, data, get_exec_env(), shmem_fill_super); + return mount_ns(fs_type, flags, data, shmem_fill_super); #else - return mount_ns(fs_type, flags, data, get_exec_env(), ramfs_fill_super); + return mount_ns(fs_type, flags, data, ramfs_fill_super); #endif } diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index 460d53f..7e760d2 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -734,7 +734,7 @@ static int bm_fill_super(struct super_block * sb, void * data, int silent) static struct dentry *bm_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return mount_ns(fs_type, flags, data, get_exec_env(), bm_fill_super); + return mount_ns(fs_type, flags, get_exec_env(), bm_fill_super); } static struct linux_binfmt misc_format = { diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index 9b690c9..7411a56 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -1126,7 +1126,7 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent) static struct dentry *nfsd_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return mount_ns(fs_type, flags, NULL, current-nsproxy-net_ns, nfsd_fill_super); + return mount_ns(fs_type, flags, current-nsproxy-net_ns, nfsd_fill_super); } static void nfsd_umount(struct super_block *sb) diff --git a/fs/super.c b/fs/super.c index 7f316e8..c9b47bf 100644 --- a/fs/super.c +++ b/fs/super.c @@ -890,11 +890,11 @@ static int ns_set_super(struct super_block *sb, void *data) } struct dentry *mount_ns(struct file_system_type *fs_type, int flags, - void *data, void *ns, int (*fill_super)(struct super_block *, void *, int)) + void *data, int (*fill_super)(struct super_block *, void *, int)) { struct super_block *sb; - sb = sget(fs_type, ns_test_super, ns_set_super, flags, ns); + sb = sget(fs_type, ns_test_super, ns_set_super, flags, data); if (IS_ERR(sb)) return
[Devel] [PATCH RHEL7 COMMIT] Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 9b72ce16b191d84da03da83d5ccec29de8854686 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:41 2015 +0400 Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: This reverts commit 9e7411c5c3b53937171ef962ce7381337f125b28. This patch is not longer needed, because none of the mount_ns users needs sb-s_fs_info any more. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/nfs/dns_resolve.c | 2 +- fs/nfsd/nfs4recover.c | 4 ++-- fs/super.c| 4 ++-- include/linux/fs.h| 2 -- ipc/mqueue.c | 6 +++--- net/sunrpc/clnt.c | 2 +- net/sunrpc/rpc_pipe.c | 4 ++-- 7 files changed, 11 insertions(+), 13 deletions(-) diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c index dda6202..d25f10f 100644 --- a/fs/nfs/dns_resolve.c +++ b/fs/nfs/dns_resolve.c @@ -415,7 +415,7 @@ static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event, void *ptr) { struct super_block *sb = ptr; - struct net *net = sb-s_ns; + struct net *net = sb-s_fs_info; struct nfs_net *nn = net_generic(net, nfs_net_id); struct cache_detail *cd = nn-nfs_dns_resolve; int ret = 0; diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c index c714602..4c86b18 100644 --- a/fs/nfsd/nfs4recover.c +++ b/fs/nfsd/nfs4recover.c @@ -693,7 +693,7 @@ cld_pipe_downcall(struct file *filp, const char __user *src, size_t mlen) struct cld_upcall *tmp, *cup; struct cld_msg __user *cmsg = (struct cld_msg __user *)src; uint32_t xid; - struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_ns, + struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_fs_info, nfsd_net_id); struct cld_net *cn = nn-cld_net; @@ -1353,7 +1353,7 @@ static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event, void *ptr) { struct super_block *sb = ptr; - struct net *net = sb-s_ns; + struct net *net = sb-s_fs_info; struct nfsd_net *nn = net_generic(net, nfsd_net_id); struct cld_net *cn = nn-cld_net; struct dentry *dentry; diff --git a/fs/super.c b/fs/super.c index c9b47bf..341650d 100644 --- a/fs/super.c +++ b/fs/super.c @@ -880,12 +880,12 @@ EXPORT_SYMBOL(kill_litter_super); static int ns_test_super(struct super_block *sb, void *data) { - return sb-s_ns == data; + return sb-s_fs_info == data; } static int ns_set_super(struct super_block *sb, void *data) { - sb-s_ns = data; + sb-s_fs_info = data; return set_anon_super(sb, NULL); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 68cec28..553bca3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1457,8 +1457,6 @@ struct super_block { unsigned ints_max_links; fmode_t s_mode; - void*s_ns; /* Pointer to namespace */ - /* Granularity of c/m/atime in ns. Cannot be worse than a second */ u32s_time_gran; diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 18620cd..c508938 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -104,7 +104,7 @@ static inline struct mqueue_inode_info *MQUEUE_I(struct inode *inode) */ static inline struct ipc_namespace *__get_ns_from_inode(struct inode *inode) { - return get_ipc_ns(inode-i_sb-s_ns); + return get_ipc_ns(inode-i_sb-s_fs_info); } static struct ipc_namespace *get_ns_from_inode(struct inode *inode) @@ -407,7 +407,7 @@ static void mqueue_evict_inode(struct inode *inode) user-mq_bytes -= mq_bytes; /* * get_ns_from_inode() ensures that the -* (ipc_ns = sb-s_ns) is either a valid ipc_ns +* (ipc_ns = sb-s_fs_info) is either a valid ipc_ns * to which we now hold a reference, or it is NULL. * We can't put it here under mq_lock, though. */ @@ -1418,7 +1418,7 @@ int mq_init_ns(struct ipc_namespace *ns) void mq_clear_sbinfo(struct
[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use s_ns
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit a98a90ea907f522f1ae6ff0e1c6e78a39ade2494 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:44 2015 +0400 ve/binfmt_misc: do not use s_ns Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: Since 9e7411c5c3b5 was reverted, we must use sb-s_fs_info for storing a pointer to the namespace. This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization). Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/binfmt_misc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index d0cb80c..4487153 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -65,7 +65,7 @@ struct binfmt_misc { int entry_count; }; -#define BINFMT_MISC(sb)(((struct ve_struct *)(sb)-s_ns)-binfmt_misc) +#define BINFMT_MISC(sb)(((struct ve_struct *)(sb)-s_fs_info)-binfmt_misc) /* * Check if we support the binfmt @@ -684,7 +684,7 @@ static const struct file_operations bm_status_operations = { static void bm_put_super(struct super_block *sb) { struct binfmt_misc *bm_data = BINFMT_MISC(sb); - struct ve_struct *ve = sb-s_ns; + struct ve_struct *ve = sb-s_fs_info; bm_data-enabled = 0; put_ve(ve); @@ -703,7 +703,7 @@ static int bm_fill_super(struct super_block * sb, void * data, int silent) [3] = {register, bm_register_operations, S_IWUSR}, /* last one */ {} }; - struct ve_struct *ve = sb-s_ns; + struct ve_struct *ve = data; struct binfmt_misc *bm_data = ve-binfmt_misc; int err; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: containerize it with new obj ns operation
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 968c8efb7981f87f8bc0616741edb6c0bc556d76 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:57 2015 +0400 Revert devtmpfs: containerize it with new obj ns operation Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 53343c3b231ed36d973e6d3ac2ab9ad7b7c87e25. The whole point of devtmpfs is simplifying the system bootup logic. There is absolutely no point in virtualizing it, because on container start we create devices from a hardcoded list (these are ttys, which I'd prefer not to create at all using ptys instead, but we have to live with it for compatibility reasons for now). This means that it is enough to provide the userspace with per VE tmpfs mount called devtmpfs and teach it to make device nodes from a hardcoded list on container start instead of implementing devtmpfs virtualization in the kernel. The kernel part will be done by the following patches. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c| 37 ++--- fs/sysfs/ve.c | 9 - include/linux/kobject_ns.h | 2 -- 3 files changed, 2 insertions(+), 46 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index 0448af8..349d6eb 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -366,46 +366,13 @@ int devtmpfs_mount(const char *mntdir) static DECLARE_COMPLETION(setup_done); -static struct path set_dev_pwd(struct device *dev) -{ - const struct kobj_ns_type_operations *ops; - struct path pwd = current-fs-pwd; - - ops = kobj_ns_ops(dev-kobj); - path_get(pwd); - - if (ops ops-devtmpfs) { - const struct path *devtmpfs_root; - - devtmpfs_root = ops-devtmpfs(dev-kobj); - BUG_ON(!devtmpfs_root); - set_fs_pwd(current-fs, devtmpfs_root); - } - return pwd; -} - -static void drop_dev_pwd(struct path *pwd) -{ - set_fs_pwd(current-fs, pwd); - path_put(pwd); -} - static int handle(const char *name, umode_t mode, kuid_t uid, kgid_t gid, struct device *dev) { - struct path pwd; - int err; - - pwd = set_dev_pwd(dev); - if (mode) - err = handle_create(name, mode, uid, gid, dev); + return handle_create(name, mode, uid, gid, dev); else - err = handle_remove(name, dev); - - /* Restore kthread pwd */ - drop_dev_pwd(pwd); - return err; + return handle_remove(name, dev); } static int devtmpfsd(void *p) diff --git a/fs/sysfs/ve.c b/fs/sysfs/ve.c index 79ad6d5..bb28a4b 100644 --- a/fs/sysfs/ve.c +++ b/fs/sysfs/ve.c @@ -43,21 +43,12 @@ const void *ve_namespace(struct device *dev) return (!dev-groups dev_get_drvdata(dev)) ? dev_get_drvdata(dev) : get_ve0(); } -static const struct path *ve_devtmpfs(const struct kobject *kobj) -{ - struct device *dev = container_of(kobj, struct device, kobj); - const struct ve_struct *ve = dev-class-namespace(dev); - - return ve-devtmpfs_root; -} - struct kobj_ns_type_operations ve_ns_type_operations = { .type = KOBJ_NS_TYPE_VE, .grab_current_ns =
[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: per-VE mounts introduced
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 3fd8ef28e629c3ec00144f83249628244903876d Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:58 2015 +0400 Revert devtmpfs: per-VE mounts introduced Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit e85a799b629d5e28c8931ddd9127cf18d501745c. More devtmpfs virtualization crap to drop. Will be reworked. Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: include/linux/ve.h kernel/ve/ve.c --- drivers/base/devtmpfs.c | 28 ++-- include/linux/device.h | 4 include/linux/ve.h | 3 --- kernel/ve/ve.c | 8 4 files changed, 2 insertions(+), 41 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index 6f4ba37..f59b798 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -23,8 +23,6 @@ #include linux/ramfs.h #include linux/slab.h #include linux/kthread.h -#include linux/fs_struct.h -#include linux/ve.h #include base.h static struct task_struct *thread; @@ -59,9 +57,9 @@ static struct dentry *dev_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { #ifdef CONFIG_TMPFS - return mount_ns(fs_type, flags, data, shmem_fill_super); + return mount_single(fs_type, flags, data, shmem_fill_super); #else - return mount_ns(fs_type, flags, data, ramfs_fill_super); + return mount_single(fs_type, flags, data, ramfs_fill_super); #endif } @@ -387,7 +385,6 @@ static int devtmpfsd(void *p) goto out; sys_chdir(/..); /* will traverse into overmounted root */ sys_chroot(.); - get_fs_root(current-fs, get_exec_env()-devtmpfs_root); complete(setup_done); while (1) { spin_lock(req_lock); @@ -408,33 +405,12 @@ static int devtmpfsd(void *p) spin_unlock(req_lock); schedule(); } - path_put(get_exec_env()-devtmpfs_root); return 0; out: complete(setup_done); return *err; } -int ve_init_devtmpfs(void *data) -{ - struct ve_struct *ve = data; - struct vfsmount *mnt; - - mnt = kern_mount_data(dev_fs_type, ve); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); - ve-devtmpfs_root.mnt = mnt; - ve-devtmpfs_root.dentry = mnt-mnt_root; - return 0; -} - -void ve_fini_devtmpfs(void *data) -{ - struct ve_struct *ve = data; - - kern_unmount(ve-devtmpfs_root.mnt); -} - /* * Create devtmpfs instance, driver-core devices will add their device * nodes here. diff --git a/include/linux/device.h b/include/linux/device.h index df5152f..2c9c764 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1005,14 +1005,10 @@ extern void put_device(struct device *dev); extern int devtmpfs_create_node(struct device *dev); extern int devtmpfs_delete_node(struct device *dev); extern int devtmpfs_mount(const char *mntdir); -extern int ve_init_devtmpfs(void *data); -extern void ve_fini_devtmpfs(void *data); #else static inline int devtmpfs_create_node(struct device *dev) { return 0; } static inline int devtmpfs_delete_node(struct device *dev) { return 0; } static inline int
[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: destroy all nodes on ve stop
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0ea1f95684407db5892760b5a58a24003571f043 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:44 2015 +0400 ve/binfmt_misc: destroy all nodes on ve stop Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: Each registered binfmt_misc node pins binfmt_misc mount point, which in turn pins the owner ve. This means that if we don't clean up binfmt_misc nodes on ve stop, the mount point as well as the ve struct will leak. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/binfmt_misc.c | 28 +++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index 4487153..90c306e 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -752,16 +752,42 @@ static struct file_system_type bm_fs_type = { }; MODULE_ALIAS_FS(binfmt_misc); +static void ve_binfmt_fini(void *data) +{ + struct ve_struct *ve = data; + struct binfmt_misc *bm_data = ve-binfmt_misc; + + if (!bm_data) + return; + + /* +* XXX: Note we don't take any locks here. This is safe as long as +* nobody uses binfmt_misc outside the owner ve. +*/ + while (!list_empty(bm_data-entries)) + kill_node(bm_data, list_first_entry( + bm_data-entries, Node, list)); +} + +static struct ve_hook ve_binfmt_hook = { + .fini = ve_binfmt_fini, + .priority = HOOK_PRIO_DEFAULT, + .owner = THIS_MODULE, +}; + static int __init init_misc_binfmt(void) { int err = register_filesystem(bm_fs_type); - if (!err) + if (!err) { insert_binfmt(misc_format); + ve_hook_register(VE_SS_CHAIN, ve_binfmt_hook); + } return err; } static void __exit exit_misc_binfmt(void) { + ve_hook_unregister(ve_binfmt_hook); unregister_binfmt(misc_format); unregister_filesystem(bm_fs_type); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: Create required devices on container startup
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0cdfb581770d883cea99f30e49e3de1583ab6fc1 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:56 2015 +0400 Revert ve/devtmpfs: Create required devices on container startup Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 5cd1d17ff1b6a8f476ab6f4cd0a6830fbffe43f2. We don't actually need separate null, zero, and other mem class devices inside a VE. The patch being reverted added them merely for kdevtmpfs to create nodes for this devices under /dev. This work can and should be done by vzctl on container start, so drop this patch. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/char/mem.c | 20 --- kernel/ve/ve.c | 56 -- 2 files changed, 76 deletions(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index c486c83..a3653f7 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -30,7 +30,6 @@ #include linux/io.h #include linux/aio.h #include linux/security.h -#include linux/ve.h #include asm/uaccess.h @@ -924,20 +923,7 @@ static char *mem_devnode(struct device *dev, umode_t *mode) return NULL; } -#ifdef CONFIG_VE -static struct class mem_class_base = { - .name = mem, - .devnode= mem_devnode, - .ns_type= ve_ns_type_operations, - .namespace = ve_namespace, - .owner = THIS_MODULE, -}; - -struct class *mem_class = mem_class_base; -EXPORT_SYMBOL(mem_class); -#else static struct class *mem_class; -#endif static int __init chr_dev_init(void) { @@ -951,17 +937,11 @@ static int __init chr_dev_init(void) if (register_chrdev(MEM_MAJOR, mem, memory_fops)) printk(unable to get major %d for memory devs\n, MEM_MAJOR); -#ifdef CONFIG_VE - err = class_register(mem_class_base); - if (err) - return err; -#else mem_class = class_create(THIS_MODULE, mem); if (IS_ERR(mem_class)) return PTR_ERR(mem_class); mem_class-devnode = mem_devnode; -#endif for (minor = 1; minor ARRAY_SIZE(devlist); minor++) { if (!devlist[minor].name) continue; diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 4cd1f8b..cdbb342 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -413,55 +413,6 @@ static void ve_drop_context(struct ve_struct *ve) ve-init_cred = NULL; } -static const struct { - unsigned intminor; - char*name; -} ve_mem_class_devices[] = { - {3, null}, - {5, zero}, - {7, full}, - {8, random}, - {9, urandom}, -}; - -extern struct class *mem_class; - -static int ve_init_mem_class(struct ve_struct *ve) -{ - struct device *dev; - dev_t devt; - size_t i; - - for (i = 0; i ARRAY_SIZE(ve_mem_class_devices); i++) { - devt = MKDEV(MEM_MAJOR, ve_mem_class_devices[i].minor); - dev = device_create(mem_class, NULL, devt, - ve, ve_mem_class_devices[i].name); - if (IS_ERR(dev)) { - pr_err(Can't create %s (%d)\n, - ve_mem_class_devices[i].name, -
[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: pass proper options string
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0ffbb29c45f5ee709f4fa5dfa52f883cbe4a70f1 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:57 2015 +0400 Revert ve/devtmpfs: pass proper options string Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 1c6719b8aa075de4c9528811839d5f2595ef2994. This is related to devtmpfs virtualization, which I'm going to drop. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index c28e42c..0448af8 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -451,10 +451,9 @@ out: int ve_init_devtmpfs(void *data) { struct ve_struct *ve = data; - char opts[] = mode=0755; struct vfsmount *mnt; - mnt = kern_mount_data(dev_fs_type, opts); + mnt = kern_mount_data(dev_fs_type, ve); if (IS_ERR(mnt)) return PTR_ERR(mnt); ve-devtmpfs_root.mnt = mnt; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit d0856fdc15e0b49540c454b42a11ddf2af70cda6 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:43 2015 +0400 Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: This reverts commit 610d54ccee1af63b1b361d18ec4ee9fa5230dea8. Since commit 9e7411c5c3b5 was reverted, this one is no longer needed either. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/nfsd/nfsctl.c | 2 +- ipc/mqueue.c | 2 +- net/sunrpc/rpc_pipe.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index 7411a56..048d61d 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -1113,7 +1113,7 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent) #endif /* last one */ {} }; - struct net *net = sb-s_ns; + struct net *net = data; int ret; ret = simple_fill_super(sb, 0x6e667364, nfsd_files); diff --git a/ipc/mqueue.c b/ipc/mqueue.c index c508938..6a8f37d 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -309,7 +309,7 @@ err: static int mqueue_fill_super(struct super_block *sb, void *data, int silent) { struct inode *inode; - struct ipc_namespace *ns = sb-s_ns; + struct ipc_namespace *ns = data; sb-s_blocksize = PAGE_CACHE_SIZE; sb-s_blocksize_bits = PAGE_CACHE_SHIFT; diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c index b8f6185..79681e5 100644 --- a/net/sunrpc/rpc_pipe.c +++ b/net/sunrpc/rpc_pipe.c @@ -1395,7 +1395,7 @@ rpc_fill_super(struct super_block *sb, void *data, int silent) { struct inode *inode; struct dentry *root, *gssd_dentry; - struct net *net = sb-s_ns; + struct net *net = data; struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); int err; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 pty drivers
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 1ff0db51541d3bf04c228025cb48de284adb78b2 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:31:49 2015 +0400 Revert ve/pty: containerize Unix98 pty drivers Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: This reverts commit 79b66035f81e1c8996f2524f26af096e44e2ae4b. Conflicts: kernel/ve/ve.c Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- kernel/ve/ve.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index bdfa30d..5025149 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -449,10 +449,6 @@ int ve_start_container(struct ve_struct *ve) if (err) goto err_legacy_pty; - err = ve_unix98_pty_init(ve); - if (err) - goto err_unix98_pty; - err = ve_tty_console_init(ve); if (err) goto err_tty_console; @@ -472,8 +468,6 @@ int ve_start_container(struct ve_struct *ve) err_iterate: ve_tty_console_fini(ve); err_tty_console: - ve_unix98_pty_fini(ve); -err_unix98_pty: ve_legacy_pty_fini(ve); err_legacy_pty: ve_stop_umh(ve); @@ -506,7 +500,6 @@ void ve_stop_ns(struct pid_namespace *pid_ns) ve-is_running = 0; ve_tty_console_fini(ve); - ve_unix98_pty_fini(ve); ve_legacy_pty_fini(ve); ve_stop_umh(ve); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert pty: split Unix98 init routines
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit ee5a5380520330fedde1a323d5ca3cb5cad20b4f Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:32:03 2015 +0400 Revert pty: split Unix98 init routines Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: This reverts commit 3aec66abd43440bc7dd4c6bbe84734adb6d82851. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/tty/pty.c | 100 -- 1 file changed, 15 insertions(+), 85 deletions(-) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index 56c0a21..bd17a45 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -820,62 +820,25 @@ err_file: static struct file_operations ptmx_fops; -static void __unix98_unregister_ptmx(void) -{ - unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1); - cdev_del(ptmx_cdev); -} - -static int __unix98_register_ptmx(void) - { - int err; - - cdev_init(ptmx_cdev, ptmx_fops); - err = cdev_add(ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1); - if (err) { - printk(KERN_ERR Couldn't add /dev/ptmx device); - return err; - } - err = register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx); - if (err 0) { - printk(KERN_ERR Couldn't register /dev/ptmx driver); - goto err_ptmx_register; - } - return 0; - -err_ptmx_register: - cdev_del(ptmx_cdev); - return err; -} - -static int __unix98_pty_init(struct tty_driver **ptm_driver_p, - struct tty_driver **pts_driver_p) +static void __init unix98_pty_init(void) { - struct tty_driver *ptm_driver, *pts_driver; - int err; - struct device *dev; - ptm_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV | TTY_DRIVER_DEVPTS_MEM | TTY_DRIVER_DYNAMIC_ALLOC); - if (IS_ERR(ptm_driver)) { - printk(KERN_ERR Couldn't allocate Unix98 ptm driver); - return PTR_ERR(ptm_driver); - } + if (IS_ERR(ptm_driver)) + panic(Couldn't allocate Unix98 ptm driver); pts_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV | TTY_DRIVER_DEVPTS_MEM | TTY_DRIVER_DYNAMIC_ALLOC); - if (IS_ERR(pts_driver)) { - printk(KERN_ERR Couldn't allocate Unix98 pts driver); - err = PTR_ERR(pts_driver); - goto err_pts_alloc; - } + if (IS_ERR(pts_driver)) + panic(Couldn't allocate Unix98 pts driver); + ptm_driver-driver_name = pty_master; ptm_driver-name = ptm; ptm_driver-major = UNIX98_PTY_MASTER_MAJOR; @@ -905,53 +868,20 @@ static int __unix98_pty_init(struct tty_driver **ptm_driver_p, pts_driver-other = ptm_driver; tty_set_operations(pts_driver, pty_unix98_ops); - err = tty_register_driver(ptm_driver); - if (err) { - printk(KERN_ERR Couldn't register Unix98 ptm driver); - goto err_ptm_register; - } - err = tty_register_driver(pts_driver); - if (err) { - printk(KERN_ERR Couldn't register Unix98 pts driver); - goto err_pts_register; - } + if (tty_register_driver(ptm_driver)) + panic(Couldn't register Unix98 ptm driver); + if (tty_register_driver(pts_driver)) + panic(Couldn't register Unix98 pts driver); /* Now create the /dev/ptmx special device */ tty_default_fops(ptmx_fops); ptmx_fops.open = ptmx_open; - err = __unix98_register_ptmx(); - if (err) - goto err_ptmx_register; - - dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR,
[Devel] [PATCH RHEL7 COMMIT] ve/radix-tree: do not account radix_tree_nodes to memcg
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit d4b302e64d3523bddf4e300d0a975a7717ac784b Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:44:29 2015 +0400 ve/radix-tree: do not account radix_tree_nodes to memcg There are two problems if they are accounted. First, radix_tree_nodes allocated by tcache/tswap for storing their internal data will be accounted to the container that issued a store, which is wrong, because they can only get reclaimed on global pressure. Using __GFP_NOACCOUNT in tcache/tswap wouldn't help due to per cpu radix_tree_node preloads. Second, workingset detection logic (see mm/workingset.c) is still not memory cgroup aware. In particular, this means that shadow radix_tree_nodes can only be reclaimed on global memory pressure although they are accounted to a memory cgroup. As a result, after reading a huge file, all the container's memory can get filled with shadow entries, which won't be reclaimed on local memory pressure, making the container unusable. This is a quick-fix which makes radix_tree_nodes unaccountable. This is acceptable for now, because we had never accounted radix_tree_nodes before Vz7 anyway. The true fix would be (a) making radix_tree_node preloads unaccountable (or per memory cgroup) and (b) making workingset detection logic memory cgroup aware. This should and will be done upstream first. https://jira.sw.ru/browse/PSBM-35205 Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- lib/radix-tree.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index dd3347f..4b362cb 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -228,7 +228,8 @@ radix_tree_node_alloc(struct radix_tree_root *root) } } if (ret == NULL) - ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + ret = kmem_cache_alloc(radix_tree_node_cachep, + gfp_mask | __GFP_NOACCOUNT); BUG_ON(radix_tree_is_indirect_ptr(ret)); return ret; @@ -279,7 +280,8 @@ static int __radix_tree_preload(gfp_t gfp_mask) rtp = __get_cpu_var(radix_tree_preloads); while (rtp-nr ARRAY_SIZE(rtp-nodes)) { preempt_enable(); - node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + node = kmem_cache_alloc(radix_tree_node_cachep, + gfp_mask | __GFP_NOACCOUNT); if (node == NULL) goto out; preempt_disable(); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0845747ebe2654d1e6e56a0425b21e599a47f4f6 Author: Mel Gorman mgor...@suse.de Date: Fri Aug 28 18:50:29 2015 +0400 ms/mm/vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY This patch fixes memcg overreclaim w/o tswap/zswap as described in: https://jira.sw.ru/browse/PSBM-35275 Memcg overreclaim still happens if tswap or zswap is used. This case is to be investigated yet, however, this patch is definitely worth pulling. Commit mm: vmscan: obey proportional scanning requirements for kswapd ensured that file/anon lists were scanned proportionally for reclaim from kswapd but ignored it for direct reclaim. The intent was to minimse direct reclaim latency but Yuanhan Liu pointer out that it substitutes one long stall for many small stalls and distorts aging for normal workloads like streaming readers/writers. Hugh Dickins pointed out that a side-effect of the same commit was that when one LRU list dropped to zero that the entirety of the other list was shrunk leading to excessive reclaim in memcgs. This patch scans the file/anon lists proportionally for direct reclaim to similarly age page whether reclaimed by kswapd or direct reclaim but takes care to abort reclaim if one LRU drops to zero after reclaiming the requested number of pages. Based on ext4 and using the Intel VM scalability test 3.15.0-rc5 3.15.0-rc5 shrinker proportion Unit lru-file-readonceelapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%) Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%) Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%) Unit lru-file-readtwiceelapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%) Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%) Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%) The test cases are running multiple dd instances reading sparse files. The results are within the noise for the small test machine. The impact of the patch is more noticable from the vmstats 3.15.0-rc5 3.15.0-rc5 shrinker proportion Minor Faults 35154 36784 Major Faults 6111305 Swap Ins 3941651 Swap Outs 43945891 Allocation stalls 118616 44781 Direct pages scanned 4935171 4602313 Kswapd pages scanned 1592129216258483 Kswapd pages reclaimed1591330116248305 Direct pages reclaimed 4933368 4601133 Kswapd efficiency 99% 99% Kswapd velocity 670088.047 682555.961 Direct efficiency 99% 99% Direct velocity 207709.217 193212.133 Percentage direct scans23% 22% Page writes by reclaim4858.0006232.000 Page writes file 464 341 Page writes anon 43945891 Note that there are fewer allocation stalls even though the amount of direct reclaim scanning is very approximately the same. Signed-off-by: Mel Gorman mgor...@suse.de Cc: Johannes Weiner han...@cmpxchg.org Cc: Hugh Dickins hu...@google.com Cc: Tim Chen tim.c.c...@linux.intel.com Cc: Dave Chinner da...@fromorbit.com Tested-by: Yuanhan Liu yuanhan@linux.intel.com Cc: Bob Liu bob@oracle.com Cc: Jan Kara j...@suse.cz Cc: Rik van Riel r...@redhat.com Cc: Al Viro v...@zeniv.linux.org.uk Signed-off-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Linus Torvalds torva...@linux-foundation.org (cherry picked from commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/vmscan.c | 36 +--- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 0b4c98f..2bb62ce 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2129,13 +2129,27 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long nr_reclaimed = 0; unsigned long nr_to_reclaim = sc-nr_to_reclaim; struct blk_plug plug; - bool scan_adjusted = false; + bool scan_adjusted; get_scan_count(lruvec, sc, nr, lru_pages); /* Record the original scan target for proportional adjustments
[Devel] [PATCH RHEL7 COMMIT] memcg: fix swap_max calculation for nested cgroups
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 338ce9637d706f2bf01ef9153b78953ff65c2efb Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:36:03 2015 +0400 memcg: fix swap_max calculation for nested cgroups If there is a sub-memcg in a container, its swapout won't update swap_max of the container's memcg, because we don't ascend the memcg hierarchy in mem_cgroup_update_swap_max. This patch fixes this issue. Fixes: a74376e2dde13 (bc/memcg: show correct swap max for beancounters) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/memcontrol.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5f3e0ac..7fc2931 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -903,12 +903,14 @@ static void mem_cgroup_update_swap_max(struct mem_cgroup *memcg) { long long swap; - swap = res_counter_read_u64(memcg-memsw, RES_USAGE) - - res_counter_read_u64(memcg-res, RES_USAGE); + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + swap = res_counter_read_u64(memcg-memsw, RES_USAGE) - + res_counter_read_u64(memcg-res, RES_USAGE); - /* This is racy, but we don't have to be absolutely precise */ - if (swap (long long)memcg-swap_max) - memcg-swap_max = swap; + /* This is racy, but we don't have to be absolutely precise */ + if (swap (long long)memcg-swap_max) + memcg-swap_max = swap; + } } static void mem_cgroup_inc_failcnt(struct mem_cgroup *memcg, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 driver
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit fd19fc2c70ae5da0a0902dea96213f52dc6afbfd Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:31:56 2015 +0400 Revert ve/pty: containerize Unix98 driver Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: This reverts commit 1b2c1fe8428715c3b5ec0a94d0568b5a5c526032. Conflicts: include/linux/ve.h Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/tty/pty.c | 88 ++--- include/linux/tty.h | 6 ++-- include/linux/ve.h | 6 3 files changed, 32 insertions(+), 68 deletions(-) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index 7afb822..56c0a21 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -23,10 +23,15 @@ #include linux/devpts_fs.h #include linux/slab.h #include linux/mutex.h -#include linux/ve.h #include bc/misc.h +#ifdef CONFIG_UNIX98_PTYS +static struct tty_driver *ptm_driver; +static struct tty_driver *pts_driver; +static DEFINE_MUTEX(devpts_mutex); +#endif + static void pty_close(struct tty_struct *tty, struct file *filp) { BUG_ON(!tty); @@ -53,11 +58,11 @@ static void pty_close(struct tty_struct *tty, struct file *filp) if (tty-driver-subtype == PTY_TYPE_MASTER) { set_bit(TTY_OTHER_CLOSED, tty-flags); #ifdef CONFIG_UNIX98_PTYS - if (tty-driver == tty-driver-ve-ptm_driver) { - mutex_lock(tty-driver-ve-devpts_mutex); + if (tty-driver == ptm_driver) { + mutex_lock(devpts_mutex); if (tty-link-driver_data) devpts_pty_kill(tty-link-driver_data); - mutex_unlock(tty-driver-ve-devpts_mutex); + mutex_unlock(devpts_mutex); } #endif tty_unlock(tty); @@ -669,9 +674,9 @@ static struct tty_struct *pts_unix98_lookup(struct tty_driver *driver, { struct tty_struct *tty; - mutex_lock(driver-ve-devpts_mutex); + mutex_lock(devpts_mutex); tty = devpts_get_priv(pts_inode); - mutex_unlock(driver-ve-devpts_mutex); + mutex_unlock(devpts_mutex); /* Master must be open before slave */ if (!tty) return ERR_PTR(-EIO); @@ -748,7 +753,6 @@ static int ptmx_open(struct inode *inode, struct file *filp) struct inode *slave_inode; int retval; int index; - struct ve_struct *ve = (inode-i_sb-s_ns) ? : get_exec_env(); nonseekable_open(inode, filp); @@ -760,18 +764,18 @@ static int ptmx_open(struct inode *inode, struct file *filp) return retval; /* find a device that is not in use. */ - mutex_lock(ve-devpts_mutex); + mutex_lock(devpts_mutex); index = devpts_new_index(inode); if (index 0) { retval = index; - mutex_unlock(ve-devpts_mutex); + mutex_unlock(devpts_mutex); goto err_file; } - mutex_unlock(ve-devpts_mutex); + mutex_unlock(devpts_mutex); mutex_lock(tty_mutex); - tty = tty_init_dev(ve-ptm_driver, index); + tty = tty_init_dev(ptm_driver, index); if (IS_ERR(tty)) { retval = PTR_ERR(tty); @@ -796,7 +800,7 @@ static int ptmx_open(struct inode *inode, struct file *filp) } tty-link-driver_data = slave_inode; - retval = ve-ptm_driver-ops-open(tty, filp); + retval = ptm_driver-ops-open(tty, filp); if (retval) goto err_release; @@ -816,22 +820,16 @@ err_file: static struct file_operations ptmx_fops; -static void __unix98_unregister_ptmx(struct ve_struct *ve) +static void __unix98_unregister_ptmx(void) { - if (!ve_is_super(ve)) - return; - unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1); cdev_del(ptmx_cdev); } -static int __unix98_register_ptmx(struct ve_struct *ve) -{ +static int __unix98_register_ptmx(void) + { int
[Devel] [PATCH RHEL7 COMMIT] ve/pty: create ptmx device per ve namespace
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 953017eb9e8237859f63d7b0a2c816b7e7e5a615 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:32:16 2015 +0400 ve/pty: create ptmx device per ve namespace Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: After Unix98 PTY driver virtualization was reverted, we have to manually set sysfs permissions for ptmx. This, however, is currently impossible, because tty_class is still virtualized, which makes ve.sysfs_permissions ignore it (see sysfs_perms_set). This patch is a quick-fix which simply creates/destroys ptmx device in ve namespace on container start/stop. It must be dropped when commit 6022450d12653 (ve/tty: make tty_class VE-namespace aware) is reverted. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/tty/pty.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index bd17a45..529046b 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -818,6 +818,32 @@ err_file: return retval; } +static int ve_unix98_pty_init(void *data) +{ + struct ve_struct *ve = data; + struct device *dev; + + dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), ve, ptmx); + if (IS_ERR(dev)) { + pr_warn(Failed to create ptmx device for ve %s: %ld\n, + ve-ve_name, PTR_ERR(dev)); + return PTR_ERR(dev); + } + return 0; +} + +static void ve_unix98_pty_fini(void *data) +{ + device_destroy_namespace(tty_class, MKDEV(TTYAUX_MAJOR, 2), data); +} + +static struct ve_hook ve_unix98_pty_hook = { + .init = ve_unix98_pty_init, + .fini = ve_unix98_pty_fini, + .priority = HOOK_PRIO_DEFAULT, + .owner = THIS_MODULE, +}; + static struct file_operations ptmx_fops; static void __init unix98_pty_init(void) @@ -882,6 +908,7 @@ static void __init unix98_pty_init(void) register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx) 0) panic(Couldn't register /dev/ptmx driver); device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), NULL, ptmx); + ve_hook_register(VE_SS_CHAIN, ve_unix98_pty_hook); } #else ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: never isolate more pages than necessary
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 703ed09d7ee4d9af6cec3c4970842f282176f5e0 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:50:33 2015 +0400 ms/mm/vmscan: never isolate more pages than necessary Along with [PATCH rh7] mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY this should fix https://jira.sw.ru/browse/PSBM-35275 I submitted this patch upstream (https://lkml.org/lkml/2015/8/3/404) and it was merged into the mmotm tree. Hopefully, it will get merged into Linus's tree soon. If transparent huge pages are enabled, we can isolate many more pages than we actually need to scan, because we count both single and huge pages equally in isolate_lru_pages(). Since commit 5bc7b8aca942d (mm: thp: add split tail pages to shrink page list in page reclaim), we scan all the tail pages immediately after a huge page split (see shrink_page_list()). As a result, we can reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run! This is easy to catch on memcg reclaim with zswap enabled. The latter makes swapout instant so that if we happen to scan an unreferenced huge page we will evict both its head and tail pages immediately, which is likely to result in excessive reclaim. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 2bb62ce..7beadf5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1297,7 +1297,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, unsigned long nr_taken = 0; unsigned long scan; - for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { + for (scan = 0; scan nr_to_scan nr_taken nr_to_scan + !list_empty(src); scan++) { struct page *page; int nr_pages; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] ve: Add a ability to show ve.mount_opts
On 07/20/2015 10:05 PM, Maxim Patlasov wrote: On 07/14/2015 01:27 AM, Kirill Tkhai wrote: В Пн, 13/07/2015 в 12:38 -0700, Maxim Patlasov пишет: On 07/08/2015 04:50 AM, Kirill Tkhai wrote: ... Why do we need to show hidden options to CT' user? He/she doesn't see .balloon file, so it doesn't seem consistent to show balloon_ino=N. But this way read won't show all written using write. It may confuse users or vzctl developers. I think more debug info won't be worse. Sorry for delay, I somehow missed your reply in my inbox folder. Are these 'read' and 'write' allowed only from host system (ve0) or inside CT as well? It's allowed from inside a CT like other ve cgroup files. But it's not a problem of this patch, it's a generic problem, because mounting of ve cgroup from CT is not prohibited for now. Please, see cgroup_mount() for the details. OK. So, let's continue here: * by default ve cgroup is not visible from inside a CT * currently it's possible to mount ve cgroup inside a CT, but this is temporarily, we'll disable this https://jira.sw.ru/browse/PSBM-34291 * this patch allows to see mount options via ve cgroup = after PSBM-34291 is fixed, mount options will be visible only from ve0 (host) * for host it's OK to see all hidden options Kirill, Maxim, please ack that i understand the situation correctly here, and i'll apply the patch. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: Do not wait for page writeback for GFP_NOFS allocations
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit 32c7d3e46ac5734aa426767243a7e29657141ec3 Author: Michal HockoDate: Mon Aug 31 18:57:12 2015 +0400 ms/mm/vmscan: Do not wait for page writeback for GFP_NOFS allocations This patch has not been merged upstream yet, I took it from LKML. Nevertheless, it has already been committed to mmotm and even taken by Greg for stable. It is definitely worth backporting if we don't want to get tasks hung in D-state on memcg reclaim. vdavydov@ Nikolay has reported a hang when a memcg reclaim got stuck with the following backtrace: PID: 18308 TASK: 883d7c9b0a30 CPU: 1 COMMAND: "rsync" #0 __schedule at 815ab152 #1 schedule at 815ab76e #2 schedule_timeout at 815ae5e5 #3 io_schedule_timeout at 815aad6a #4 bit_wait_io at 815abfc6 #5 __wait_on_bit at 815abda5 #6 wait_on_page_bit at 8111fd4f #7 shrink_page_list at 81135445 #8 shrink_inactive_list at 81135845 #9 shrink_lruvec at 81135ead #10 shrink_zone at 811360c3 #11 shrink_zones at 81136eff #12 do_try_to_free_pages at 8113712f #13 try_to_free_mem_cgroup_pages at 811372be #14 try_charge at 81189423 #15 mem_cgroup_try_charge at 8118c6f5 #16 __add_to_page_cache_locked at 8112137d #17 add_to_page_cache_lru at 81121618 #18 pagecache_get_page at 8112170b #19 grow_dev_page at 811c8297 #20 __getblk_slow at 811c91d6 #21 __getblk_gfp at 811c92c1 #22 ext4_ext_grow_indepth at 8124565c #23 ext4_ext_create_new_leaf at 81246ca8 #24 ext4_ext_insert_extent at 81246f09 #25 ext4_ext_map_blocks at 8124a848 #26 ext4_map_blocks at 8121a5b7 #27 mpage_map_one_extent at 8121b1fa #28 mpage_map_and_submit_extent at 8121f07b #29 ext4_writepages at 8121f6d5 #30 do_writepages at 8112c490 #31 __filemap_fdatawrite_range at 81120199 #32 filemap_flush at 8112041c #33 ext4_alloc_da_blocks at 81219da1 #34 ext4_rename at 81229b91 #35 ext4_rename2 at 81229e32 #36 vfs_rename at 811a08a5 #37 SYSC_renameat2 at 811a3ffc #38 sys_renameat2 at 811a408e #39 sys_rename at 8119e51e #40 system_call_fastpath at 815afa89 Dave Chinner has properly pointed out that this is a deadlock in the reclaim code because ext4 doesn't submit pages which are marked by PG_writeback right away. The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM with too many dirty pages") and it was applied only when may_enter_fs was specified. The code has been changed by c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") which has removed the __GFP_FS restriction with a reasoning that we do not get into the fs code. But this is not sufficient apparently because the fs doesn't necessarily submit pages marked PG_writeback for IO right away. ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily submit the bio. Instead it tries to map more pages into the bio and mpage_map_one_extent might trigger memcg charge which might end up waiting on a page which is marked PG_writeback but hasn't been submitted yet so we would end up waiting for something that never finishes. Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2) before we go to wait on the writeback. The page fault path, which is the only path that triggers memcg oom killer since 3.12, shouldn't require GFP_NOFS and so we shouldn't reintroduce the premature OOM killer issue which was originally addressed by the heuristic. As per David Chinner the xfs is doing similar thing since 2.6.15 already so ext4 is not the only affected filesystem. Moreover he notes: : For example: IO completion might require unwritten extent conversion : which executes filesystem transactions and GFP_NOFS allocations. The : writeback flag on the pages can not be cleared until unwritten : extent conversion completes. Hence memory reclaim cannot wait on : page writeback to complete in GFP_NOFS context because it is not : safe to do so, memcg reclaim or otherwise. Cc: sta...@vger.kernel.org # 3.9+ [ty...@mit.edu:
[Devel] [PATCH RHEL7 COMMIT] ve/sysctl: Introduce proc_doulongvec_minmax_virtual()
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit 85474cc55aa11512b45b40bf382c620d78646992 Author: Andrey RyabininDate: Mon Aug 31 19:29:17 2015 +0400 ve/sysctl: Introduce proc_doulongvec_minmax_virtual() proc_doulongvec_minmax_virtual() - analogous of proc_doulongvec_minmax() for per CT sysctls. Will be used for virtualizing aio_nr, aio_max_nr https://jira.sw.ru/browse/PSBM-29017 Signed-off-by: Andrey Ryabinin Reviewed-by: Vladimir Davydov --- include/linux/sysctl.h | 2 ++ kernel/sysctl.c| 11 +++ 2 files changed, 13 insertions(+) diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index bdcf06d..af467dc 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -60,6 +60,8 @@ extern int proc_do_large_bitmap(struct ctl_table *, int, extern int proc_dointvec_virtual(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +extern int proc_doulongvec_minmax_virtual(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos); extern int proc_dointvec_immutable(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); extern int proc_dostring_immutable(struct ctl_table *table, int write, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8478a1e..1a568e7 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2727,6 +2727,17 @@ int proc_dointvec_virtual(struct ctl_table *table, int write, return -EINVAL; } +int proc_doulongvec_minmax_virtual(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + struct ctl_table tmp = *table; + + if (virtual_ptr(, , sizeof(ve0), get_exec_env())) + return proc_doulongvec_minmax(, write, buffer, lenp, ppos); + return -EINVAL; +} + static inline bool sysctl_in_container(void) { return !ve_is_super(get_exec_env()); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/fs/aio: aio_nr & aio_max_nr variables virtualization
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit d5a0970d86642a4150439d8a599f2f359e75fbf4 Author: Andrey RyabininDate: Mon Aug 31 19:38:05 2015 +0400 ve/fs/aio: aio_nr & aio_max_nr variables virtualization Virtualization of kernel global aio_nr & aio_max_nr variables is required to isolate containers and ve0 when allocating aio request/events resources. Each ve and ve0 has own aio_nr, aio_max_nr values. Function ioctx_alloc trying to charge appropriate aio_nr value selected by ve context. It's not possible to exhaust aio events resources of one ve from another ve. Default per-CT aio_max_nr value == 0x1, including CT0. https://jira.sw.ru/browse/PSBM-29017 Signed-off-by: Andrey Ryabinin Reviewed-by: Vladimir Davydov --- fs/aio.c| 38 +- include/linux/aio.h | 6 ++ include/linux/ve.h | 6 ++ kernel/sysctl.c | 16 kernel/ve/ve.c | 7 +++ 5 files changed, 44 insertions(+), 29 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 70a6599..9d700b0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -122,14 +123,9 @@ struct kioctx { struct page *internal_pages[AIO_RING_PAGES]; struct file *aio_ring_file; + struct ve_struct*ve; }; -/*-- sysctl variables*/ -static DEFINE_SPINLOCK(aio_nr_lock); -unsigned long aio_nr; /* current system wide number of aio requests */ -unsigned long aio_max_nr = 0x1; /* system wide maximum number of aio requests */ -/*end sysctl variables---*/ - static struct kmem_cache *kiocb_cachep; static struct kmem_cache *kioctx_cachep; @@ -495,6 +491,9 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb *kiocb, static void free_ioctx_rcu(struct rcu_head *head) { struct kioctx *ctx = container_of(head, struct kioctx, rcu_head); + struct ve_struct *ve = ctx->ve; + + put_ve(ve); kmem_cache_free(kioctx_cachep, ctx); } @@ -571,6 +570,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) { struct mm_struct *mm = current->mm; struct kioctx *ctx; + struct ve_struct *ve = get_exec_env(); int err = -ENOMEM; /* Prevent overflows */ @@ -580,7 +580,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) return ERR_PTR(-EINVAL); } - if (!nr_events || (unsigned long)nr_events > aio_max_nr) + if (!nr_events || (unsigned long)nr_events > ve->aio_max_nr) return ERR_PTR(-EAGAIN); ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL); @@ -588,6 +588,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) return ERR_PTR(-ENOMEM); ctx->max_reqs = nr_events; + ctx->ve = get_ve(ve); spin_lock_init(>ctx_lock); spin_lock_init(>completion_lock); @@ -608,14 +609,14 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) goto out_freectx; /* limit the number of system wide aios */ - spin_lock(_nr_lock); - if (aio_nr + nr_events > aio_max_nr || - aio_nr + nr_events < aio_nr) { - spin_unlock(_nr_lock); + spin_lock(>aio_nr_lock); + if (ve->aio_nr + nr_events > ve->aio_max_nr || + ve->aio_nr + nr_events < ve->aio_nr) { + spin_unlock(>aio_nr_lock); goto out_cleanup; } - aio_nr += ctx->max_reqs; - spin_unlock(_nr_lock); + ve->aio_nr += ctx->max_reqs; + spin_unlock(>aio_nr_lock); /* now link into global list. */ spin_lock(>ioctx_lock); @@ -633,6 +634,7 @@ out_cleanup: err = -EAGAIN; aio_free_ring(ctx); out_freectx: + put_ve(ctx->ve); mutex_unlock(>ring_lock); put_aio_ring_file(ctx); kmem_cache_free(kioctx_cachep, ctx); @@ -665,6 +667,8 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx, struct completion *requests_done) { if (!atomic_xchg(>dead, 1)) { + struct ve_struct *ve = ctx->ve; + spin_lock(>ioctx_lock); hlist_del_rcu(>list); spin_unlock(>ioctx_lock); @@ -676,10 +680,10 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx, * -EAGAIN with no ioctxs actually in use (as far as userspace * could tell). */ - spin_lock(_nr_lock); - BUG_ON(aio_nr - ctx->max_reqs > aio_nr); - aio_nr -= ctx->max_reqs; - spin_unlock(_nr_lock); + spin_lock(>aio_nr_lock); +
[Devel] [PATCH RHEL7 COMMIT] pfcache/ext4: fix automatic csum calculation
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit 17c90deb8c54a3feef19d557008fcb510bed8cd3 Author: Dmitry MonakhovDate: Mon Aug 31 20:03:06 2015 +0400 pfcache/ext4: fix automatic csum calculation port from 2.6.32-x: diff-pfcache-ext4-fix-automatic-csum-calculation Bug#1) https://jira.sw.ru/browse/PSBM-23774 truncate_data_csum should clear it's state unconditionally Bug#2) BUG_ON fs/jbd2/transaction.c:1033 truncate_data_csum call chain looks like follows: ->generic_file_buffered_write_iter ->ext4_da_write_begin ->ext4_journal_start( ,,1) : reserve 1 journal block ->ext4_write_end ->ext4_update_data_csum ->ext4_truncate_data_csum ->ext4_xattr_set ->ext4_journal_start(,,20): require 20 blocks, but since journal already started it use existing handle ->jbd2_journal_dirty_metadata J_ASSERT_JH(jh, handle->h_buffer_credits > 0) -> FAILURE Obviously it is illegal to modify xattr from random context. In order to fix that bug it is reasonable to call ext4_truncate_data_csum() only from proper context (where journal was not started yet.) This patch splits ext4_update_csum in two peaces: 1) check correct csum window position and drop csum if necessary (called from write_begin) 2) update in-memory csum state (called from write_end) Minor fix: do not calculate csum for empty files. https://jira.sw.ru/browse/PSBM-39233 Signed-off-by: Dmitry Monakhov --- fs/ext4/ext4.h | 3 ++- fs/ext4/inode.c| 13 + fs/ext4/pfcache.c | 41 +++-- fs/ext4/truncate.h | 3 +++ 4 files changed, 41 insertions(+), 19 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 7059994..fc9608e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2843,11 +2843,12 @@ extern long ext4_dump_pfcache(struct super_block *sb, struct pfcache_dump_request __user *dump); extern int ext4_load_data_csum(struct inode *inode); extern void ext4_start_data_csum(struct inode *inode); +extern void ext4_check_pos_data_csum(struct inode *inode, loff_t pos); extern void ext4_update_data_csum(struct inode *inode, loff_t pos, unsigned len, struct page* page); extern void ext4_commit_data_csum(struct inode *inode); extern void ext4_clear_data_csum(struct inode *inode); -extern int ext4_truncate_data_csum(struct inode *inode, loff_t end); +extern void ext4_truncate_data_csum(struct inode *inode, loff_t end); extern void ext4_load_dir_csum(struct inode *inode); extern void ext4_save_dir_csum(struct inode *inode); static inline int ext4_want_data_csum(struct inode *dir) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 1b3462c..78fc407 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -238,6 +238,8 @@ void ext4_evict_inode(struct inode *inode) * protection against it */ sb_start_intwrite(inode->i_sb); + if (inode->i_blocks && ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM)) + ext4_truncate_data_csum(inode, inode->i_size); handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, ext4_blocks_for_truncate(inode)+3); if (IS_ERR(handle)) { @@ -936,6 +938,10 @@ retry_grab: unlock_page(page); retry_journal: + /* Check csum window position before journal_start */ + if (ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM)) + ext4_check_pos_data_csum(inode, pos); + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks); if (IS_ERR(handle)) { page_cache_release(page); @@ -2593,6 +2599,10 @@ retry_grab: * of file which has an already mapped buffer. */ retry_journal: + /* Check csum window position before journal_start */ + if (ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM)) + ext4_check_pos_data_csum(inode, pos); + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, ext4_da_write_credits(inode, pos, len)); if (IS_ERR(handle)) { @@ -4640,6 +4650,9 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) if (error) goto err_out; } + if (ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM)) + ext4_truncate_data_csum(inode, attr->ia_size); + handle = ext4_journal_start(inode, EXT4_HT_INODE, 3); if (IS_ERR(handle)) {
[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: remove now unused css_depth()
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit 8999360445307b87d687dac0b551d21e02386426 Author: Tejun HeoDate: Wed Jun 12 21:04:48 2013 -0700 ms/cgroup: remove now unused css_depth() Signed-off-by: Tejun Heo Acked-by: Li Zefan --- include/linux/cgroup.h | 1 - kernel/cgroup.c| 12 2 files changed, 13 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index b7eb28f..44b64c9 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -884,7 +884,6 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg, /* Get id and depth of css */ unsigned short css_id(struct cgroup_subsys_state *css); -unsigned short css_depth(struct cgroup_subsys_state *css); struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id); #else /* !CONFIG_CGROUPS */ diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b5a603c..d96176e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5296,18 +5296,6 @@ unsigned short css_id(struct cgroup_subsys_state *css) } EXPORT_SYMBOL_GPL(css_id); -unsigned short css_depth(struct cgroup_subsys_state *css) -{ - struct css_id *cssid; - - cssid = rcu_dereference_check(css->id, css_refcnt(css)); - - if (cssid) - return cssid->depth; - return 0; -} -EXPORT_SYMBOL_GPL(css_depth); - /** * css_is_ancestor - test "root" css is an ancestor of "child" * @child: the css to be tested. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/video/logo: show Odin's logo on boot
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit 051055ab058cba98ca8e0fe93689162588cb556a Author: Andrey Ryabinin <aryabi...@odin.com> Date: Mon Aug 31 17:22:42 2015 +0400 ve/video/logo: show Odin's logo on boot Show Odin's logo instead of "tux" when booting kernel with framebuffer enabled. https://jira.sw.ru/browse/PSBM-34430 Signed-off-by: Andrey Ryabinin <aryabi...@odin.com> Cc: Vladimir Davydov <vdavy...@parallels.com> Cc: Konstantin Khorenko <khore...@virtuozzo.com> khorenko@ note: the Odin's logo is shown by default, unlike PCS6, no additional kernel boot option required --- drivers/video/logo/Kconfig | 5 + drivers/video/logo/Makefile | 1 + drivers/video/logo/logo.c| 3 + drivers/video/logo/logo_odin_clut224.ppm | 24484 + include/linux/linux_logo.h | 1 + 5 files changed, 24494 insertions(+) diff --git a/drivers/video/logo/Kconfig b/drivers/video/logo/Kconfig index 39ac49e..241a013 100644 --- a/drivers/video/logo/Kconfig +++ b/drivers/video/logo/Kconfig @@ -82,4 +82,9 @@ config LOGO_M32R_CLUT224 depends on M32R default y +config LOGO_ODIN_CLUT224 + bool "224-color Odin logo" + depends on LOGO + default y + endif # LOGO diff --git a/drivers/video/logo/Makefile b/drivers/video/logo/Makefile index 3b43781..de2215f 100644 --- a/drivers/video/logo/Makefile +++ b/drivers/video/logo/Makefile @@ -15,6 +15,7 @@ obj-$(CONFIG_LOGO_SUPERH_MONO)+= logo_superh_mono.o obj-$(CONFIG_LOGO_SUPERH_VGA16)+= logo_superh_vga16.o obj-$(CONFIG_LOGO_SUPERH_CLUT224) += logo_superh_clut224.o obj-$(CONFIG_LOGO_M32R_CLUT224)+= logo_m32r_clut224.o +obj-$(CONFIG_LOGO_ODIN_CLUT224)+= logo_odin_clut224.o obj-$(CONFIG_SPU_BASE) += logo_spe_clut224.o diff --git a/drivers/video/logo/logo.c b/drivers/video/logo/logo.c index 080c35b..72b1542 100644 --- a/drivers/video/logo/logo.c +++ b/drivers/video/logo/logo.c @@ -100,6 +100,9 @@ const struct linux_logo * __init_refok fb_find_logo(int depth) /* M32R Linux logo */ logo = _m32r_clut224; #endif +#ifdef CONFIG_LOGO_ODIN_CLUT224 + logo = _odin_clut224; +#endif } return logo; } diff --git a/drivers/video/logo/logo_odin_clut224.ppm b/drivers/video/logo/logo_odin_clut224.ppm new file mode 100644 index 000..d9f7399 --- /dev/null +++ b/drivers/video/logo/logo_odin_clut224.ppm @@ -0,0 +1,24484 @@ +P3 +# CREATOR: GIMP PNM Filter Version 1.1 +80 102 +255 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +7 +9 +5 +16 +18 +15 +20 +21 +19 +37 +39 +36 +65 +66 +64 +85 +86 +84 +95 +97 +94 +117 +119 +116 +93 +95 +92 +86 +87 +85 +64 +65 +63 +34 +36 +33 +21 +22 +20 +15 +17 +13 +7 +9 +5 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2 +0 +0 +2
Re: [Devel] [PATCH 00/17] oom killer enhancements
Kirill, please review. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/14/2015 08:03 PM, Vladimir Davydov wrote: - Patches 1-3 fix stalls on global (1, 2) and local (3) reclaim. https://jira.sw.ru/browse/PSBM-35155 - Patches 4-6 revert our code implementing per memcg oom guarantees. - Patches 7-10 fix stalls on oom - Patch 11 introduced oom timeout https://jira.sw.ru/browse/PSBM-38581 - Patches 12, 13 fix oom vs freezer cgroup race https://jira.sw.ru/browse/PSBM-38758 - Patches 14, 15 reimplement oom guarantees https://jira.sw.ru/browse/PSBM-37915 - Patches 16, 17 resurrect oom berserker mode https://jira.sw.ru/browse/PSBM-17930 https://jira.sw.ru/browse/PSBM-26973 Cong Wang (1): freezer: Do not freeze tasks killed by OOM killer Lisa Du (1): mm: vmscan: fix do_try_to_free_pages() livelock Michal Hocko (1): oom: thaw the OOM victim if it is frozen Vinayak Menon (1): mm: vmscan: fix the page state calculation in too_many_isolated Vladimir Davydov (13): mm: vmscan: do not scan lruvec if it seems to be unreclaimable memcg: revert old oom_guarantee logic Revert "ve/mm: ignore oom_score_adj of containerized tasks on global OOM" oom: zap oom_report_invocation proto Port diff-sched-introduce-cond_resched_may_throttle oom: allow to throttle due to cfs bandwidth while invoking oom sched: add sched_boost_task helper oom: boost dying tasks on global oom oom: introduce oom kill timeout mm: take into account ub oom score on global oom memcg: forbid setting memory.oom_guarantee from inside a container oom: resurrect berserker mode oom: do not dump all tasks info on each oom kill fs/proc/base.c | 7 +- include/bc/beancounter.h | 5 ++ include/linux/memcontrol.h | 8 +- include/linux/mmzone.h | 2 +- include/linux/oom.h| 22 +++--- include/linux/sched.h | 20 + kernel/bc/beancounter.c| 29 +++ kernel/freezer.c | 3 + kernel/rtmutex.c | 5 ++ kernel/sched/core.c| 19 +++-- kernel/sched/fair.c| 3 +- kernel/sysctl.c| 14 mm/internal.h | 1 + mm/memcontrol.c| 99 ++-- mm/migrate.c | 2 +- mm/oom_kill.c | 183 + mm/page_alloc.c| 6 +- mm/swap.c | 1 + mm/vmscan.c| 102 + mm/vmstat.c| 4 +- 20 files changed, 377 insertions(+), 158 deletions(-) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] config.OpenVZ: show Odin's logo on boot
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit eeddf77a216082687f589e408d3f560ced20ab96 Author: Konstantin Khorenko <khore...@virtuozzo.com> Date: Mon Aug 31 17:24:47 2015 +0400 config.OpenVZ: show Odin's logo on boot Show Odin's logo instead of "tux" when booting kernel with framebuffer enabled. https://jira.sw.ru/browse/PSBM-34430 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> khorenko@ note: the Odin's logo is shown by default, unlike PCS6, no additional kernel boot option required --- configs/kernel-3.10.0-x86_64-debug.config | 1 + configs/kernel-3.10.0-x86_64.config | 1 + 2 files changed, 2 insertions(+) diff --git a/configs/kernel-3.10.0-x86_64-debug.config b/configs/kernel-3.10.0-x86_64-debug.config index 5c17082..e8ea24f 100644 --- a/configs/kernel-3.10.0-x86_64-debug.config +++ b/configs/kernel-3.10.0-x86_64-debug.config @@ -5401,6 +5401,7 @@ CONFIG_TCACHE=y CONFIG_TSWAP=y CONFIG_VZ_IOLIMIT=m +CONFIG_LOGO_ODIN_CLUT224=y # # User resources diff --git a/configs/kernel-3.10.0-x86_64.config b/configs/kernel-3.10.0-x86_64.config index ffc144b..d8b2a97 100644 --- a/configs/kernel-3.10.0-x86_64.config +++ b/configs/kernel-3.10.0-x86_64.config @@ -5372,6 +5372,7 @@ CONFIG_TCACHE=y CONFIG_TSWAP=y CONFIG_VZ_IOLIMIT=m +CONFIG_LOGO_ODIN_CLUT224=y # # User resources ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/tty: vt -- Fix nil dereference due to race
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit 0fa0a39ad2c644f55f447cef85e5a9a8f06e43b7 Author: Cyrill Gorcunov <gorcu...@virtuozzo.com> Date: Mon Aug 31 17:08:30 2015 +0400 ve/tty: vt -- Fix nil dereference due to race In commit 5571b126368c0153d73eaec0fdf43fbcbae67fd9 we bring in the stabs for virtual terminals but they are race sensitive: all therminals are represented by one per-VE @vz_tty_conm tty peer which can be removed and set to nil if application ask for new terminal when old one is inside "remove" stage. This may lead to nil dereference and panic as Nikita spotted | [ 325.357491] BUG: unable to handle kernel NULL pointer dereference at 0004 | [ 325.357816] IP: [] tty_open+0x610/0x6e0 | [ 325.357994] PGD 3b745067 PUD 3b6ff067 PMD 0 | [ 325.358201] Oops: 0002 [#1] SMP | [ 325.362469] CPU: 1 PID: 2873 Comm: criu ve: 200 Not tainted 3.10.0-123.1.2.vz7.5.29 #1 5.29 | [ 325.362954] task: 88003af56480 ti: 88003bf24000 task.ti: 88003bf24000 | [ 325.363119] RIP: 0010:[] [] tty_open+0x610/0x6e0 | [ 325.363329] RSP: 0018:88003bf25c00 EFLAGS: 00010202 | [ 325.363454] RAX: 0001 RBX: 88003638c000 RCX: 0001 | [ 325.363614] RDX: 88003bf25c34 RSI: 0002 RDI: 88003bf6 | [ 325.363776] RBP: 88003bf25c68 R08: 000208c0 R09: 88003d803c00 | [ 325.363972] R10: 0002 R11: 0004 R12: | [ 325.364145] R13: R14: 0042 R15: | [ 325.364303] FS: 7fe73f0f6740() GS:88003de4() knlGS: | [ 325.364479] CS: 0010 DS: ES: CR0: 80050033 | [ 325.364616] CR2: 0004 CR3: 3c78b000 CR4: 001406e0 | [ 325.364775] DR0: DR1: DR2: | [ 325.364949] DR3: DR6: 0ff0 DR7: 0400 | [ 325.365120] Stack: | [ 325.365198] 88003af56480 00043bf25c68 88003af56480 88003b7a1960 | [ 325.365498] 88003af56480 00428002 0001 b2cf8386 | [ 325.365838] 88003d21a068 88003b7a1960 88003638c000 | [ 325.366130] Call Trace: | [ 325.366215] [] chrdev_open+0xa1/0x1e0 | [ 325.366339] [] ? cdev_put+0x30/0x30 | [ 325.366472] [] do_dentry_open.isra.17+0x192/0x290 | [ 325.366625] [] finish_open+0x1e/0x30 | [ 325.366752] [] do_last.isra.62+0x36d/0x1020 | [ 325.366956] [] path_openat.isra.63+0xbe/0x480 | [ 325.367097] [] do_filp_open+0x4b/0xb0 | [ 325.367226] [] ? getname_flags+0x2c/0x120 | [ 325.367361] [] ? __alloc_fd+0xa7/0x130 | [ 325.367490] [] do_sys_open+0xf3/0x1f0 | [ 325.367623] [] SyS_openat+0x14/0x20 | [ 325.367758] [] system_call_fastpath+0x16/0x1b Lets provide per VT tty as it should be. Note the code is being reworked now for bring in real virtualization instead of stubs so this is rather a fix to not block migration testings (that's why I don't remove @vz_tty_conm and @vz_tty_cons from the struct ve_struct since I've already zapped all this including the file kernel/ve/console.c itself and once new version is stabilized we drop all this in one pass). https://jira.sw.ru/browse/PSBM-37929 Signed-off-by: Cyrill Gorcunov <gorcu...@virtuozzo.com> CC: Nikita Spiridonov <nspirido...@odin.com> CC: Vladimir Davydov <vdavy...@virtuozzo.com> CC: Konstantin Khorenko <khore...@virtuozzo.com> --- kernel/ve/console.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/ve/console.c b/kernel/ve/console.c index 922848a..bc7d752 100644 --- a/kernel/ve/console.c +++ b/kernel/ve/console.c @@ -47,7 +47,7 @@ static struct tty_struct *vz_tty_lookup(struct tty_driver *driver, if (idx != VZ_CON_INDEX || driver == vz_cons_driver) return ERR_PTR(-EIO); - return ve->vz_tty_conm; + return ve->vz_tty_vt[idx]; } static int vz_tty_install(struct tty_driver *driver, struct tty_struct *tty) @@ -62,7 +62,7 @@ static int vz_tty_install(struct tty_driver *driver, struct tty_struct *tty) tty_port_init(tty->port); tty->termios = driver->init_termios; - ve->vz_tty_conm = tty; + ve->vz_tty_vt[tty->index] = tty; tty_driver_kref_get(driver); tty->count++; @@ -74,7 +74,7 @@ static void vz_tty_remove(struct tty_driver *driver, struct tty_struct *tty) struct ve_struct *ve = get_exec_env(); BUG_ON(driv
[Devel] [PATCH RHEL7 COMMIT] ms/mm: memcontrol: reclaim at least once for __GFP_NORETRY
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit 8aee40dd982b476c208aee78e19cd756f2dac8a7 Author: Johannes WeinerDate: Mon Aug 31 17:09:47 2015 +0400 ms/mm: memcontrol: reclaim at least once for __GFP_NORETRY Currently, __GFP_NORETRY tries charging once and gives up before even trying to reclaim. Bring the behavior on par with the page allocator and reclaim at least once before giving up. Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Cc: Hugh Dickins Cc: Tejun Heo Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds (cherry picked from commit 28c34c291e746aab1c2bfd6d6609b2e47fa0978b) Signed-off-by: Vladimir Davydov Conflicts: mm/memcontrol.c --- mm/memcontrol.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7fc2931..52c7871 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2754,11 +2754,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, return CHARGE_WOULDBLOCK; } - if (gfp_mask & __GFP_NORETRY) { - mem_cgroup_inc_failcnt(mem_over_limit, gfp_mask, nr_pages); - return CHARGE_NOMEM; - } - ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) return CHARGE_RETRY; @@ -2787,6 +2782,9 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, mem_cgroup_inc_failcnt(mem_over_limit, gfp_mask, nr_pages); + if (gfp_mask & __GFP_NORETRY) + return CHARGE_NOMEM; + /* check OOM */ if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize))) return CHARGE_OOM_DIE; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] memcg: count all oom kills
Kirill, please review. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/14/2015 12:48 PM, Vladimir Davydov wrote: We do not count processes killed because they share victim's mm. Fix it. Fixes: 66053f4201e41 ("memcg: count oom kills") Signed-off-by: Vladimir Davydov <vdavy...@parallels.com> --- include/linux/memcontrol.h | 4 ++-- mm/memcontrol.c| 15 +-- mm/oom_kill.c | 3 ++- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index eb7ae43a57f9..ac3f16f0ee28 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -122,7 +122,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); extern bool mem_cgroup_below_oom_guarantee(struct task_struct *p); extern void mem_cgroup_note_oom_kill(struct mem_cgroup *memcg, -struct mm_struct *mm); +struct task_struct *task); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); extern void mem_cgroup_replace_page_cache(struct page *oldpage, @@ -351,7 +351,7 @@ static inline bool mem_cgroup_below_oom_guarantee(struct task_struct *p) } static inline void -mem_cgroup_note_oom_kill(struct mem_cgroup *memcg, struct mm_struct *mm) +mem_cgroup_note_oom_kill(struct mem_cgroup *memcg, struct task_struct *task) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 52c787165b17..0cb329028a29 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1623,14 +1623,25 @@ bool mem_cgroup_below_oom_guarantee(struct task_struct *p) } void mem_cgroup_note_oom_kill(struct mem_cgroup *root_memcg, - struct mm_struct *mm) + struct task_struct *task) { struct mem_cgroup *memcg, *memcg_to_put; + struct task_struct *p; if (!root_memcg) root_memcg = root_mem_cgroup; - memcg_to_put = memcg = try_get_mem_cgroup_from_mm(mm); + p = find_lock_task_mm(task); + if (p) { + memcg = try_get_mem_cgroup_from_mm(p->mm); + task_unlock(p); + } else { + rcu_read_lock(); + memcg = mem_cgroup_from_task(task); + css_get(>css); + rcu_read_unlock(); + } + memcg_to_put = memcg; if (!memcg || !mem_cgroup_same_or_subtree(root_memcg, memcg)) memcg = root_memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c99a5f559286..70893730524a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -498,7 +498,6 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; - mem_cgroup_note_oom_kill(memcg, mm); pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), @@ -526,11 +525,13 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, task_pid_nr(p), p->comm); task_unlock(p); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true); + mem_cgroup_note_oom_kill(memcg, p); } rcu_read_unlock(); set_tsk_thread_flag(victim, TIF_MEMDIE); do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true); + mem_cgroup_note_oom_kill(memcg, victim); put_task_struct(victim); } #undef K ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/kmod: fix out-of-bounds access in call_modprobe()
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.4 --> commit e2164f15d2f004ce076da3aa925b681bd8cde8d8 Author: Andrey Ryabinin <aryabi...@odin.com> Date: Mon Aug 31 17:15:30 2015 +0400 ve/kmod: fix out-of-bounds access in call_modprobe() Commit 18f83b2460e2 ("ve/kmod: Port autoloading from CT") extended argv array for one more element, however it wasn't extended on allocation site. https://jira.sw.ru/browse/PSBM-38666 Fixes: 18f83b2460e2 ("ve/kmod: Port autoloading from CT") Signed-off-by: Andrey Ryabinin <aryabi...@odin.com> Cc: Konstantin Khorenko <khore...@virtuozzo.com> Signed-off-by: Andrey Ryabinin <aryabi...@odin.com> Acked-by: Kirill Tkhai <ktk...@odin.com> --- kernel/kmod.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/kmod.c b/kernel/kmod.c index e0554f8..aa5cb99 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -91,7 +91,7 @@ static int call_modprobe(char *module_name, int wait, int blacklist) NULL }; - char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL); + char **argv = kmalloc(sizeof(char *[6]), GFP_KERNEL); if (!argv) goto out; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] ploop: use GFP_NOIO in ploop_make_request
Maxim, please review. Do we need the same in PCS6? -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/17/2015 04:30 PM, Vladimir Davydov wrote: Currently, we use GFP_NOFS, which may result in a dead lock as follows: filemap_fault do_mpage_readpage submit_bio generic_make_request initializes current->bio_list calls make_request_fn ploop_make_request bio_alloc(GFP_NOFS) kmem_cache_alloc memcg_charge_kmem try_to_free_mem_cgroup_pages swap_writepage generic_make_request puts bio on current->bio_list try_to-free_mem_cgroup_pages wait_on_page_writeback The wait_on_page_writeback will never complete then, because the corresponding bio is on current->bio_list and for it to get to the queue we must return from ploop_make_request first. The stack trace of a hung task: [] sleep_on_page+0xe/0x20 [] wait_on_page_bit+0x86/0xb0 [] shrink_page_list+0x6e2/0xaf0 [] shrink_inactive_list+0x1cb/0x610 [] shrink_lruvec+0x395/0x790 [] shrink_zone+0x181/0x350 [] do_try_to_free_pages+0x170/0x530 [] try_to_free_mem_cgroup_pages+0xb6/0x140 [] __mem_cgroup_try_charge+0x1de/0xd70 [] memcg_charge_kmem+0x9b/0x100 [] __memcg_charge_slab+0x3b/0x90 [] new_slab+0x264/0x3f0 [] __slab_alloc+0x315/0x48f [] kmem_cache_alloc+0x1cc/0x210 [] mempool_alloc_slab+0x15/0x20 [] mempool_alloc+0x69/0x170 [] bvec_alloc+0x92/0x120 [] bio_alloc_bioset+0x1e8/0x2e0 [] ploop_make_request+0x2a6/0xac0 [ploop] [] generic_make_request+0xe2/0x130 [] submit_bio+0x77/0x1c0 [] do_mpage_readpage+0x37f/0x6e0 [] mpage_readpages+0xeb/0x160 [] ext4_readpages+0x3c/0x40 [ext4] [] __do_page_cache_readahead+0x1e0/0x260 [] ra_submit+0x21/0x30 [] filemap_fault+0x321/0x4b0 [] __do_fault+0x8a/0x560 [] handle_mm_fault+0x3d0/0xd80 [] __do_page_fault+0x15e/0x530 [] do_page_fault+0x1a/0x70 [] page_fault+0x28/0x30 https://jira.sw.ru/browse/PSBM-38842 Signed-off-by: Vladimir Davydov <vdavy...@parallels.com> --- drivers/block/ploop/dev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 30eb8a7551e5..f37df4dacf8c 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -717,7 +717,7 @@ preallocate_bio(struct bio * orig_bio, struct ploop_device * plo) } if (nbio == NULL) - nbio = bio_alloc(GFP_NOFS, max(orig_bio->bi_max_vecs, block_vecs(plo))); + nbio = bio_alloc(GFP_NOIO, max(orig_bio->bi_max_vecs, block_vecs(plo))); return nbio; } @@ -852,7 +852,7 @@ static void ploop_make_request(struct request_queue *q, struct bio *bio) if (!current->io_context) { struct io_context *ioc; - ioc = get_task_io_context(current, GFP_NOFS, NUMA_NO_NODE); + ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE); if (ioc) put_io_context(ioc); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/cgroup: fix mangle root in CT
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.5 --> commit 1518ff8ef0a78d8be1b19774506f355424103e9a Author: Pavel TikhomirovDate: Tue Sep 1 16:13:30 2015 +0400 ve/cgroup: fix mangle root in CT cgroups with depth level more than 2 were not mangled inside a container, that might caused problems with docker, docker were able to see in /proc/self/cgroup paths relative to host. But it is not docker specific: CT-103 /# mkdir /sys/fs/cgroup/devices/test.slice CT-103 /# mkdir /sys/fs/cgroup/devices/test.slice/test.scope CT-103 /# sleep 1000& [1] 578 CT-103 /# echo 578 > /sys/fs/cgroup/devices/test.slice/test.scope/tasks with patch: CT-103 /# cat /proc/578/cgroup 16:ve:/ 15:hugetlb:/ 14:perf_event:/ 12:net_cls:/ 11:freezer:/ 10:devices:/test.slice/test.scope 6:name=systemd:/user-0.slice/session-c109.scope 5:cpuset:/ 4:cpuacct,cpu:/ 3:beancounter:/ 2:memory:/ 1:blkio:/ without: CT-103 /# cat /proc/480/cgroup 16:ve:/ 15:hugetlb:/ 14:perf_event:/ 12:net_cls:/ 11:freezer:/ 10:devices:/103/test.slice/test.scope 6:name=systemd:/user.slice/user-0.slice/session-c2.scope 5:cpuset:/ 4:cpuacct,cpu:/ 3:beancounter:/ 2:memory:/ 1:blkio:/ https://jira.sw.ru/browse/PSBM-38634 Signed-off-by: Pavel Tikhomirov Reviewed-by: Cyrill Gorcunov khorenko@: this fix is quite inflexible, if we move CTs into machine.slice, we have to rework it. But i accept it because we are still not sure with final cgroups "virtualization" implementation => less work right now which can be later dropped. --- kernel/cgroup.c | 35 --- 1 file changed, 20 insertions(+), 15 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index d96176e..a07c4e0 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1808,6 +1808,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) { int ret = -ENAMETOOLONG; char *start; + struct ve_struct *ve = get_exec_env(); if (!cgrp->parent) { if (strlcpy(buf, "/", buflen) >= buflen) @@ -1815,21 +1816,6 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) return 0; } -#ifdef CONFIG_VE - /* -* Containers cgroups are bind-mounted from node -* so they are like '/' from inside, thus we have -* to mangle cgroup path output. -*/ - if (!ve_is_super(get_exec_env())) { - if (cgrp->parent && !cgrp->parent->parent) { - if (strlcpy(buf, "/", buflen) >= buflen) - return -ENAMETOOLONG; - return 0; - } - } -#endif - start = buf + buflen - 1; *start = '\0'; @@ -1838,6 +1824,25 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen) const char *name = cgroup_name(cgrp); int len; +#ifdef CONFIG_VE + if (!ve_is_super(ve) && cgrp->parent && !cgrp->parent->parent) { + /* +* Containers cgroups are bind-mounted from node +* so they are like '/' from inside, thus we have +* to mangle cgroup path output. Effectively it is +* enough to remove two topmost cgroups from path. +* e.g. in ct 101: /101/test.slice/test.scope -> +* /test.slice/test.scope +*/ + if (*start != '/') { + if (--start < buf) + goto out; + *start = '/'; + } + break; + } +#endif + len = strlen(name); if ((start -= len) < buf) goto out; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] mmap: call mmap prep only for regular files
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.5 --> commit 1e596ab0358ff8dde342efb6274e08459d08a711 Author: Vladimir DavydovDate: Tue Sep 1 16:16:59 2015 +0400 mmap: call mmap prep only for regular files Port 2.6.32-x diff-mm-mmap-call-mmap-prep-only-for-regular-files We forgot to port this patch. This results in KP on an attempt to mmap a char device on ext4. = Author: Vladimir Davydov Email: vdavy...@parallels.com Subject: mmap: call mmap prep only for regular files Date: Mon, 17 Feb 2014 12:59:36 +0400 To give FS a chance to clear pfcache csum on shared mmap, we issue ->mmap(vma=NULL) for those FS's that want it (FS_HAS_MMAP_PREP) before taking mmap_sem (we can't do it under mmap_sem due to lockdep, see PSBM-23133). There we haven't checked arguments properly yet. In particular, the file can refer to a device, in which case we will crash, because devices' ->mmap (e.g. /dev/zero) is not supposed to be called with vma=NULL. Fix this by checking if the file refers to a regular file before calling mmap prep for it. https://bugzilla.openvz.org/show_bug.cgi?id=2886 https://jira.sw.ru/browse/PSBM-25031 Signed-off-by: Vladimir Davydov Acked-by: Dmitry Monakhov = Reported-by: Andrew Perepechko Signed-off-by: Vladimir Davydov Cc: Andrew Perepechko Cc: Alex Lyashkov Cc: Igor Seletskiy --- mm/util.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/util.c b/mm/util.c index 31cd9d7..e0ac8ae 100644 --- a/mm/util.c +++ b/mm/util.c @@ -367,6 +367,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, if (!ret) { /* Ugly fix for PSBM-23133 vdavydov@ */ if (file && file->f_op && (flag & MAP_TYPE) == MAP_SHARED && + S_ISREG(file_inode(file)->i_mode) && (file_inode(file)->i_sb->s_type->fs_flags & FS_HAS_MMAP_PREP)) file->f_op->mmap(file, NULL); down_write(>mmap_sem); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/sched/numa: Fix initialization of sched_domain_topology for NUMA
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 384a4643220fffd9001172e16ea54396a3675ab6 Author: Andrey RyabininDate: Thu Sep 3 19:27:30 2015 +0400 ms/sched/numa: Fix initialization of sched_domain_topology for NUMA https://jira.sw.ru/browse/PSBM-26429 From: Vincent Guittot commit c515db8cd311ef77b2dc7cbd6b695022655bb0f3 upstream. Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with kvm64 cpu. A panic occured while building the sched_domain. In sched_init_numa, we create a new topology table in which both default levels and numa levels are copied. The last row of the table must have a null pointer in the mask field. The current implementation doesn't add this last row in the computation of the table size. So we add 1 row in the allocation size that will be used as the last row of the table. The kzalloc will ensure that the mask field is NULL. Reported-by: Jet Chen Tested-by: Jet Chen Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra Cc: fengguang...@intel.com Link: http://lkml.kernel.org/r/1399972261-25693-1-git-send-email-vincent.guit...@linaro.org Signed-off-by: Ingo Molnar Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 30f39a25..df63b3a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6827,7 +6827,7 @@ static void sched_init_numa(void) /* Compute default topology size */ for (i = 0; sched_domain_topology[i].mask; i++); - tl = kzalloc((i + level) * + tl = kzalloc((i + level + 1) * sizeof(struct sched_domain_topology_level), GFP_KERNEL); if (!tl) return; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/MIPS: Use NUMA_NO_NODE instead of -1 for node ID.
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 61913332fac411269855bf321d34f87f7a4fb060 Author: Andrey RyabininDate: Thu Sep 3 19:27:33 2015 +0400 ms/MIPS: Use NUMA_NO_NODE instead of -1 for node ID. https://jira.sw.ru/browse/PSBM-26429 From: Ralf Baechle commit 761845f0f68cf6eba9cad0a58d977b89f8d4486f upstream. Original patch by Jianguo Wu . Signed-off-by: Ralf Baechle Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/mips/kernel/module.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 977a623..2a52568 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -46,7 +47,7 @@ static DEFINE_SPINLOCK(dbe_lock); void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END, - GFP_KERNEL, PAGE_KERNEL, -1, + GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE, __builtin_return_address(0)); } #endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit e79a1d458f45de9a672aefd76753949780b6af16 Author: Andrey RyabininDate: Thu Sep 3 19:27:34 2015 +0400 ms/mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 02e72cc61713185013d958baba508288ba2a0157 upstream. There are two versions of alloc/free hooks now - one for CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n. I see no reason why calls to other debugging subsystems (LOCKDEP, DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG. All this features should work regardless of SLUB_DEBUG config, as all of them already have own Kconfig options. This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It simply has not worked before because should_failslab() call was in a hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else". Note: There is one concealed change in allocation path for SLUB_DEBUG=n and all other debugging features disabled. The might_sleep_if() call can generate some code even if DEBUG_ATOMIC_SLEEP=n. For PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I think it should be ok. Signed-off-by: Andrey Ryabinin Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- mm/slub.c | 90 --- 1 file changed, 40 insertions(+), 50 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 51772b6..f39e69c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -928,50 +928,6 @@ static void trace(struct kmem_cache *s, struct page *page, void *object, } /* - * Hooks for other subsystems that check memory allocations. In a typical - * production configuration these hooks all should produce no code at all. - */ -static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) -{ - flags &= gfp_allowed_mask; - lockdep_trace_alloc(flags); - might_sleep_if(flags & __GFP_WAIT); - WARN_ON((flags & __GFP_FS) && current->journal_info); - - return should_failslab(s->object_size, flags, s->flags); -} - -static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, void *object) -{ - flags &= gfp_allowed_mask; - kmemcheck_slab_alloc(s, flags, object, slab_ksize(s)); - kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags); -} - -static inline void slab_free_hook(struct kmem_cache *s, void *x) -{ - kmemleak_free_recursive(x, s->flags); - - /* -* Trouble is that we may no longer disable interupts in the fast path -* So in order to make the debug calls that expect irqs to be -* disabled we need to disable interrupts temporarily. -*/ -#if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP) - { - unsigned long flags; - - local_irq_save(flags); - kmemcheck_slab_free(s, x, s->object_size); - debug_check_no_locks_freed(x, s->object_size); - local_irq_restore(flags); - } -#endif - if (!(s->flags & SLAB_DEBUG_OBJECTS)) - debug_check_no_obj_freed(x, s->object_size); -} - -/* * Tracking of fully allocated slabs for debugging purposes. * * list_lock must be held. @@ -1256,16 +1212,50 @@ static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects) {} static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects) {} - +#endif /* CONFIG_SLUB_DEBUG */ +/* + * Hooks for other subsystems that check memory allocations. In a typical + * production configuration these hooks all should produce no code at all. + */ static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags) - { return 0; } +{ + flags &= gfp_allowed_mask; + lockdep_trace_alloc(flags); + might_sleep_if(flags & __GFP_WAIT); + WARN_ON((flags & __GFP_FS) && current->journal_info); -static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, - void *object) {} + return should_failslab(s->object_size, flags, s->flags); +} -static inline void slab_free_hook(struct kmem_cache *s,
[Devel] [PATCH RHEL7 COMMIT] ms/mm/arch: use NUMA_NO_NODE
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 59313b3e6fe7c9ffe5dde09bd8379e3ac8a583e0 Author: Andrey RyabininDate: Thu Sep 3 19:27:32 2015 +0400 ms/mm/arch: use NUMA_NO_NODE https://jira.sw.ru/browse/PSBM-26429 From: Jianguo Wu commit 40c3baa7c66f1352521378ee83509fb8f4c465de upstream. Use more appropriate NUMA_NO_NODE instead of -1 in all archs' module_alloc() Signed-off-by: Jianguo Wu Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/arm/kernel/module.c| 2 +- arch/arm64/kernel/module.c | 2 +- arch/parisc/kernel/module.c | 2 +- arch/s390/kernel/module.c | 2 +- arch/sparc/kernel/module.c | 2 +- arch/x86/kernel/module.c| 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index 1e9be5d..be3232f 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -40,7 +40,7 @@ void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, -1, + GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE, __builtin_return_address(0)); } #endif diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index ca0e3d5..8f898bd 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -29,7 +29,7 @@ void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, -1, + GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE, __builtin_return_address(0)); } diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c index 2a625fb..50dfafc 100644 --- a/arch/parisc/kernel/module.c +++ b/arch/parisc/kernel/module.c @@ -219,7 +219,7 @@ void *module_alloc(unsigned long size) * init_data correctly */ return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, GFP_KERNEL | __GFP_HIGHMEM, - PAGE_KERNEL_RWX, -1, + PAGE_KERNEL_RWX, NUMA_NO_NODE, __builtin_return_address(0)); } diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c index 7845e15..b89b591 100644 --- a/arch/s390/kernel/module.c +++ b/arch/s390/kernel/module.c @@ -50,7 +50,7 @@ void *module_alloc(unsigned long size) if (PAGE_ALIGN(size) > MODULES_LEN) return NULL; return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, -1, + GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE, __builtin_return_address(0)); } #endif diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c index 4435488..97655e0 100644 --- a/arch/sparc/kernel/module.c +++ b/arch/sparc/kernel/module.c @@ -29,7 +29,7 @@ static void *module_map(unsigned long size) if (PAGE_ALIGN(size) > MODULES_LEN) return NULL; return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, -1, + GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE, __builtin_return_address(0)); } #else diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 7c1efc4..958bfb6 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -49,7 +49,7 @@ void *module_alloc(unsigned long size) return NULL; return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, - -1, __builtin_return_address(0)); + NUMA_NO_NODE, __builtin_return_address(0)); } #ifdef CONFIG_X86_32 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/kernel: use the gnu89 standard explicitly
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 0331e712aa16d12ea15c567e25111c3443456479 Author: Andrey RyabininDate: Thu Sep 3 19:27:29 2015 +0400 ms/kernel: use the gnu89 standard explicitly https://jira.sw.ru/browse/PSBM-26429 From: "Kirill A. Shutemov" commit 51b97e354ba9fce1890cf38ecc754aa49677fc89 upstream. Sasha Levin reports: "gcc5 changes the default standard to c11, which makes kernel build unhappy Explicitly define the kernel standard to be gnu89 which should keep everything working exactly like it was before gcc5" There are multiple small issues with the new default, but the biggest issue seems to be that the old - and very useful - GNU extension to allow a cast in front of an initializer has gone away. Patch updated by Kirill: "I'm pretty sure all gcc versions you can build kernel with supports -std=gnu89. cc-option is redunrant. We also need to adjust HOSTCFLAGS otherwise allmodconfig fails for me" Note by Andrew Pinski: "Yes it was reported and both problems relating to this extension has been added to gnu99 and gnu11. Though there are other issues with the kernel dealing with extern inline have different semantics between gnu89 and gnu99/11" End result: we may be able to move up to a newer stdc model eventually, but right now the newer models have some annoying deficiencies, so the traditional "gnu89" model ends up being the preferred one. Signed-off-by: Sasha Levin Singed-off-by: Kirill A. Shutemov Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- Makefile | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index 1ccfd12..bfd04ef 100644 --- a/Makefile +++ b/Makefile @@ -253,7 +253,7 @@ CONFIG_SHELL := $(shell if [ -x "$$BASH" ]; then echo $$BASH; \ HOSTCC = gcc HOSTCXX = g++ -HOSTCFLAGS = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -fomit-frame-pointer +HOSTCFLAGS = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -fomit-frame-pointer -std=gnu89 HOSTCXXFLAGS = -O2 # Decide whether to build built-in, modular, or both. @@ -385,7 +385,8 @@ KBUILD_CFLAGS := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \ -fno-strict-aliasing -fno-common \ -Werror-implicit-function-declaration \ -Wno-format-security \ - -fno-delete-null-pointer-checks + -fno-delete-null-pointer-checks \ + -std=gnu89 ifeq ($(KBUILD_EXTMOD),) ifneq (,$(filter $(ARCH), x86 x86_64)) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit abee218424a2434e8cc576037563de55bec730de Author: Andrey RyabininDate: Thu Sep 3 19:27:31 2015 +0400 ms/mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area https://jira.sw.ru/browse/PSBM-26429 From: Wanpeng Li commit 762216ab4e175f49d17bc7ad778c57b9028184e6 upstream. Use wrapper function get_vm_area_size to calculate size of vm area. Signed-off-by: Wanpeng Li Cc: Dave Hansen Cc: Rik van Riel Cc: Fengguang Wu Cc: Joonsoo Kim Cc: Johannes Weiner Cc: Tejun Heo Cc: Yasuaki Ishimatsu Cc: David Rientjes Cc: KOSAKI Motohiro Cc: Jiri Kosina Cc: Wanpeng Li Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- mm/vmalloc.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 7fbc92a..0c531e1 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1285,7 +1285,7 @@ void unmap_kernel_range(unsigned long addr, unsigned long size) int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages) { unsigned long addr = (unsigned long)area->addr; - unsigned long end = addr + area->size - PAGE_SIZE; + unsigned long end = addr + get_vm_area_size(area); int err; err = vmap_page_range(addr, end, prot, *pages); @@ -1605,7 +1605,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, unsigned int nr_pages, array_size, i; gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; - nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT; + nr_pages = get_vm_area_size(area) >> PAGE_SHIFT; array_size = (nr_pages * sizeof(struct page *)); area->nr_pages = nr_pages; @@ -2037,7 +2037,7 @@ long vread(char *buf, char *addr, unsigned long count) vm = va->vm; vaddr = (char *) vm->addr; - if (addr >= vaddr + vm->size - PAGE_SIZE) + if (addr >= vaddr + get_vm_area_size(vm)) continue; while (addr < vaddr) { if (count == 0) @@ -2047,7 +2047,7 @@ long vread(char *buf, char *addr, unsigned long count) addr++; count--; } - n = vaddr + vm->size - PAGE_SIZE - addr; + n = vaddr + get_vm_area_size(vm) - addr; if (n > count) n = count; if (!(vm->flags & VM_IOREMAP)) @@ -2119,7 +2119,7 @@ long vwrite(char *buf, char *addr, unsigned long count) vm = va->vm; vaddr = (char *) vm->addr; - if (addr >= vaddr + vm->size - PAGE_SIZE) + if (addr >= vaddr + get_vm_area_size(vm)) continue; while (addr < vaddr) { if (count == 0) @@ -2128,7 +2128,7 @@ long vwrite(char *buf, char *addr, unsigned long count) addr++; count--; } - n = vaddr + vm->size - PAGE_SIZE - addr; + n = vaddr + get_vm_area_size(vm) - addr; if (n > count) n = count; if (!(vm->flags & VM_IOREMAP)) { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/compiler-gcc: integrate the various compiler-gcc[345].h files
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit bd1e2b9bde2a2ec95dbafa3dcf17a29dd3acd63e Author: Andrey RyabininDate: Thu Sep 3 19:27:30 2015 +0400 ms/compiler-gcc: integrate the various compiler-gcc[345].h files https://jira.sw.ru/browse/PSBM-26429 From: Joe Perches commit cb984d101b30eb7478d32df56a0023e4603cba7f upstream. As gcc major version numbers are going to advance rather rapidly in the future, there's no real value in separate files for each compiler version. Deduplicate some of the macros #defined in each file too. Neaten comments using normal kernel commenting style. Signed-off-by: Joe Perches Cc: Andi Kleen Cc: Michal Marek Cc: Segher Boessenkool Cc: Sasha Levin Cc: Anton Blanchard Cc: Alan Modra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/compiler-gcc.h | 120 -- include/linux/compiler-gcc3.h | 23 include/linux/compiler-gcc4.h | 92 include/linux/compiler-gcc5.h | 67 --- 4 files changed, 116 insertions(+), 186 deletions(-) diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 24545cd..0c5d746 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -97,10 +97,122 @@ #define __maybe_unused __attribute__((unused)) #define __always_unused__attribute__((unused)) -#define __gcc_header(x) #x -#define _gcc_header(x) __gcc_header(linux/compiler-gcc##x.h) -#define gcc_header(x) _gcc_header(x) -#include gcc_header(__GNUC__) +/* gcc version specific checks */ + +#if GCC_VERSION < 30200 +# error Sorry, your compiler is too old - please upgrade it. +#endif + +#if GCC_VERSION < 30300 +# define __used__attribute__((__unused__)) +#else +# define __used__attribute__((__used__)) +#endif + +#ifdef CONFIG_GCOV_KERNEL +# if GCC_VERSION < 30400 +# error "GCOV profiling support for gcc versions below 3.4 not included" +# endif /* __GNUC_MINOR__ */ +#endif /* CONFIG_GCOV_KERNEL */ + +#if GCC_VERSION >= 30400 +#define __must_check __attribute__((warn_unused_result)) +#endif + +#if GCC_VERSION >= 4 + +/* GCC 4.1.[01] miscompiles __weak */ +#ifdef __KERNEL__ +# if GCC_VERSION >= 40100 && GCC_VERSION <= 40101 +# error Your version of gcc miscompiles the __weak directive +# endif +#endif + +#define __used __attribute__((__used__)) +#define __compiler_offsetof(a, b) \ + __builtin_offsetof(a, b) + +#if GCC_VERSION >= 40100 && GCC_VERSION < 40600 +# define __compiletime_object_size(obj) __builtin_object_size(obj, 0) +#endif + +#if GCC_VERSION >= 40300 +/* Mark functions as cold. gcc will assume any path leading to a call + * to them will be unlikely. This means a lot of manual unlikely()s + * are unnecessary now for any paths leading to the usual suspects + * like BUG(), printk(), panic() etc. [but let's keep them for now for + * older compilers] + * + * Early snapshots of gcc 4.3 don't support this and we can't detect this + * in the preprocessor, but we can live with this because they're unreleased. + * Maketime probing would be overkill here. + * + * gcc also has a __attribute__((__hot__)) to move hot functions into + * a special section, but I don't see any sense in this right now in + * the kernel context + */ +#define __cold __attribute__((__cold__)) + +#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__) + +#ifndef __CHECKER__ +# define __compiletime_warning(message) __attribute__((warning(message))) +# define __compiletime_error(message) __attribute__((error(message))) +#endif /* __CHECKER__ */ +#endif /* GCC_VERSION >= 40300 */ + +#if GCC_VERSION >= 40500 +/* + * Mark a position in code as unreachable. This can be used to + * suppress control flow warnings after asm blocks that transfer + * control elsewhere. + * + * Early snapshots of gcc 4.5 don't support this and we can't detect + * this in the preprocessor, but we can live with this because they're + * unreleased. Really, we need to have autoconf for the kernel. + */ +#define unreachable() __builtin_unreachable() + +/* Mark a function definition as prohibited from being cloned. */ +#define __noclone __attribute__((__noclone__)) + +#endif /* GCC_VERSION >= 40500 */ + +#if
[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Flush TLBs after switching CR3
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit f1796a8d4debf66ae569701aeaf5e739661808c2 Author: Andrey RyabininDate: Thu Sep 3 19:27:51 2015 +0400 ms/x86/kasan: Flush TLBs after switching CR3 https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 241d2c54c62fa0939fc9a9512b48ac3434e90a89 upstream. load_cr3() doesn't cause tlb_flush if PGE enabled. This may cause tons of false positive reports spamming the kernel to death. To fix this __flush_tlb_all() should be called explicitly after CR3 changed. Signed-off-by: Andrey Ryabinin Cc: # 4.0+ Cc: Alexander Popov Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1435828178-10975-4-git-send-email-a.ryabi...@samsung.com Signed-off-by: Ingo Molnar Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/x86/mm/kasan_init_64.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index ad0b931..0ada6cc 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -208,6 +208,7 @@ void __init kasan_init(void) memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt)); load_cr3(early_level4_pgt); + __flush_tlb_all(); clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END); @@ -234,5 +235,6 @@ void __init kasan_init(void) memset(kasan_zero_page, 0, PAGE_SIZE); load_cr3(init_level4_pgt); + __flush_tlb_all(); init_task.kasan_depth = 0; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/module: fix types of device tables aliases
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit db7ae5a5dcbe87b199efdb784074ff3597d06d42 Author: Andrey RyabininDate: Thu Sep 3 19:27:45 2015 +0400 ms/module: fix types of device tables aliases https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 6301939d97d079f0d3dbe71e750f4daf5d39fc33 upstream. MODULE_DEVICE_TABLE() macro used to create aliases to device tables. Normally alias should have the same type as aliased symbol. Device tables are arrays, so they have 'struct type##_device_id[x]' types. Alias created by MODULE_DEVICE_TABLE() will have non-array type - 'struct type##_device_id'. This inconsistency confuses compiler, it could make a wrong assumption about variable's size which leads KASan to produce a false positive report about out of bounds access. For every global variable compiler calls __asan_register_globals() passing information about global variable (address, size, size with redzone, name ...) __asan_register_globals() poison symbols redzone to detect possible out of bounds accesses. When symbol has an alias __asan_register_globals() will be called as for symbol so for alias. Compiler determines size of variable by size of variable's type. Alias and symbol have the same address, so if alias have the wrong size part of memory that actually belongs to the symbol could be poisoned as redzone of alias symbol. By fixing type of alias symbol we will fix size of it, so __asan_register_globals() will not poison valid memory. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/module.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/module.h b/include/linux/module.h index c3b88d6..40bb478 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -84,7 +84,7 @@ void trim_init_extable(struct module *m); #ifdef MODULE #define MODULE_GENERIC_TABLE(gtype,name) \ -extern const struct gtype##_id __mod_##gtype##_table \ +extern const typeof(name) __mod_##gtype##_table\ __attribute__ ((unused, alias(__stringify(name #else /* !MODULE */ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: introduce metadata_access_enable()/metadata_access_disable()
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 5a39d8752462593deb7d24f021f2b5fe5956ad8f Author: Andrey RyabininDate: Thu Sep 3 19:27:39 2015 +0400 ms/mm: slub: introduce metadata_access_enable()/metadata_access_disable() https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit a79316c6178ca419e35feef47d47f50b4e0ee9f2 upstream. It's ok for slub to access memory that marked by kasan as inaccessible (object's metadata). Kasan shouldn't print report in that case because these accesses are valid. Disabling instrumentation of slub.c code is not enough to achieve this because slub passes pointer to object's metadata into external functions like memchr_inv(). We don't want to disable instrumentation for memchr_inv() because this is quite generic function, and we don't want to miss bugs. metadata_access_enable/metadata_access_disable used to tell KASan where accesses to metadata starts/end, so we could temporarily disable KASan reports. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- mm/slub.c | 25 + 1 file changed, 25 insertions(+) diff --git a/mm/slub.c b/mm/slub.c index 306cfc4..d775ccb 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -464,12 +465,30 @@ static char *slub_debug_slabs; static int disable_higher_order_debug; /* + * slub is about to manipulate internal object metadata. This memory lies + * outside the range of the allocated object, so accessing it would normally + * be reported by kasan as a bounds error. metadata_access_enable() is used + * to tell kasan that these accesses are OK. + */ +static inline void metadata_access_enable(void) +{ + kasan_disable_current(); +} + +static inline void metadata_access_disable(void) +{ + kasan_enable_current(); +} + +/* * Object debugging */ static void print_section(char *text, u8 *addr, unsigned int length) { + metadata_access_enable(); print_hex_dump(KERN_ERR, text, DUMP_PREFIX_ADDRESS, 16, 1, addr, length, 1); + metadata_access_disable(); } static struct track *get_track(struct kmem_cache *s, void *object, @@ -499,7 +518,9 @@ static void set_track(struct kmem_cache *s, void *object, trace.max_entries = TRACK_ADDRS_COUNT; trace.entries = p->addrs; trace.skip = 3; + metadata_access_enable(); save_stack_trace(); + metadata_access_disable(); /* See rant in lockdep.c */ if (trace.nr_entries != 0 && @@ -672,7 +693,9 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page, u8 *fault; u8 *end; + metadata_access_enable(); fault = memchr_inv(start, value, bytes); + metadata_access_disable(); if (!fault) return 1; @@ -765,7 +788,9 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page) if (!remainder) return 1; + metadata_access_enable(); fault = memchr_inv(end - remainder, POISON_INUSE, remainder); + metadata_access_disable(); if (!fault) return 1; while (end > fault && end[-1] == POISON_INUSE) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/kasan, module: move MODULE_ALIGN macro into
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 5ad7c91c09ff6978e0c5df0fd4b25e7baa6eee87 Author: Andrey RyabininDate: Thu Sep 3 19:27:47 2015 +0400 ms/kasan, module: move MODULE_ALIGN macro into https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit d3733e5c98e952d419e77fa721912f09d15a2806 upstream. include/linux/moduleloader.h is more suitable place for this macro. Also change alignment to PAGE_SIZE for CONFIG_KASAN=n as such alignment already assumed in several places. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Acked-by: Rusty Russell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/kasan.h| 4 include/linux/moduleloader.h | 7 +++ 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 5fa48a2..5bb0744 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -50,15 +50,11 @@ void kasan_krealloc(const void *object, size_t new_size); void kasan_slab_alloc(struct kmem_cache *s, void *object); void kasan_slab_free(struct kmem_cache *s, void *object); -#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) - int kasan_module_alloc(void *addr, size_t size); void kasan_free_shadow(const struct vm_struct *vm); #else /* CONFIG_KASAN */ -#define MODULE_ALIGN 1 - static inline void kasan_unpoison_shadow(const void *address, size_t size) {} static inline void kasan_enable_current(void) {} diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h index 560ca53..8405769 100644 --- a/include/linux/moduleloader.h +++ b/include/linux/moduleloader.h @@ -80,4 +80,11 @@ int module_finalize(const Elf_Ehdr *hdr, /* Any cleanup needed when module leaves. */ void module_arch_cleanup(struct module *mod); +#ifdef CONFIG_KASAN +#include +#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT) +#else +#define MODULE_ALIGN PAGE_SIZE +#endif + #endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/kernel: add support for .init_array.* constructors
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 498baa3fdd64742725539c281086d23aee327fa4 Author: Andrey RyabininDate: Thu Sep 3 19:27:44 2015 +0400 ms/kernel: add support for .init_array.* constructors https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 9ddf82521c86ae07af79dbe5a93c52890f2bab23 upstream. KASan uses constructors for initializing redzones for global variables. Globals instrumentation in GCC 4.9.2 produces constructors with priority (.init_array.00099) Currently kernel ignores such constructors. Only constructors with default priority supported (.init_array) This patch adds support for constructors with priorities. For kernel image we put pointers to constructors between __ctors_start/__ctors_end and do_ctors() will call them on start up. For modules we merge .init_array.* sections into resulting .init_array. Module code properly handles constructors in .init_array section. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/asm-generic/vmlinux.lds.h | 1 + scripts/module-common.lds | 4 2 files changed, 5 insertions(+) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 72e4edc..5c90355 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -481,6 +481,7 @@ #define KERNEL_CTORS() . = ALIGN(8); \ VMLINUX_SYMBOL(__ctors_start) = .; \ *(.ctors) \ + *(SORT(.init_array.*)) \ *(.init_array) \ VMLINUX_SYMBOL(__ctors_end) = .; #else diff --git a/scripts/module-common.lds b/scripts/module-common.lds index 0865b3e..10fa8bf 100644 --- a/scripts/module-common.lds +++ b/scripts/module-common.lds @@ -16,4 +16,8 @@ SECTIONS { __kcrctab_unused: { *(SORT(___kcrctab_unused+*)) } __kcrctab_unused_gpl: { *(SORT(___kcrctab_unused_gpl+*)) } __kcrctab_gpl_future: { *(SORT(___kcrctab_gpl_future+*)) } + + + . = ALIGN(8); + .init_array 0 : { *(SORT(.init_array.*)) *(.init_array) } } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm: vmalloc: add flag preventing guard hole allocation
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 8db2f73889dbd2a488309474ddc1783d3228a40e Author: Andrey RyabininDate: Thu Sep 3 19:27:43 2015 +0400 ms/mm: vmalloc: add flag preventing guard hole allocation https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 71394fe50146202f2c8d92cf50f5ebc761acf254 upstream. For instrumenting global variables KASan will shadow memory backing memory for modules. So on module loading we will need to allocate memory for shadow and map it at address in shadow that corresponds to the address allocated in module_alloc(). __vmalloc_node_range() could be used for this purpose, except it puts a guard hole after allocated area. Guard hole in shadow memory should be a problem because at some future point we might need to have a shadow memory at address occupied by guard hole. So we could fail to allocate shadow for module_alloc(). Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't have a guard hole. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/vmalloc.h | 9 +++-- mm/vmalloc.c| 6 ++ 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index dd0a2c8..00b9b15 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -16,6 +16,7 @@ struct vm_area_struct;/* vma defining user mapping in mm_types.h */ #define VM_USERMAP 0x0008 /* suitable for remap_vmalloc_range */ #define VM_VPAGES 0x0010 /* buffer for pages was vmalloc'ed */ #define VM_UNLIST 0x0020 /* vm_struct is not listed in vmlist */ +#define VM_NO_GUARD0x0040 /* don't add guard page */ /* bits [20..32] reserved for arch specific ioremap internals */ /* @@ -96,8 +97,12 @@ void vmalloc_sync_all(void); static inline size_t get_vm_area_size(const struct vm_struct *area) { - /* return actual size without guard page */ - return area->size - PAGE_SIZE; + if (!(area->flags & VM_NO_GUARD)) + /* return actual size without guard page */ + return area->size - PAGE_SIZE; + else + return area->size; + } extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 0c531e1..7a0addf 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1356,10 +1356,8 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, if (unlikely(!area)) return NULL; - /* -* We always allocate a guard page. -*/ - size += PAGE_SIZE; + if (!(flags & VM_NO_GUARD)) + size += PAGE_SIZE; va = alloc_vmap_area(size, align, start, end, node, gfp_mask); if (IS_ERR(va)) { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/fs: dcache: manually unpoison dname after allocation to shut up kasan's reports
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 6b444b2466dfe34ee64bf03a05c9e8a85c581f0a Author: Andrey RyabininDate: Thu Sep 3 19:27:40 2015 +0400 ms/fs: dcache: manually unpoison dname after allocation to shut up kasan's reports https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd upstream. We need to manually unpoison rounded up allocation size for dname to avoid kasan's reports in dentry_string_cmp(). When CONFIG_DCACHE_WORD_ACCESS=y dentry_string_cmp may access few bytes beyound requested in kmalloc() size. dentry_string_cmp() relates on that fact that dentry allocated using kmalloc and kmalloc internally round up allocation size. So this is not a bug, but this makes kasan to complain about such accesses. To avoid such reports we mark rounded up allocation size in shadow as accessible. Signed-off-by: Andrey Ryabinin Reported-by: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- fs/dcache.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/dcache.c b/fs/dcache.c index a341efe..a4f60d1 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -43,6 +44,7 @@ #include "internal.h" #include "mount.h" + /* * Usage: * dcache->d_inode->i_lock protects: @@ -1550,6 +1552,11 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) kmem_cache_free(dentry_cache, dentry); return NULL; } + if (IS_ENABLED(CONFIG_DCACHE_WORD_ACCESS)) + kasan_unpoison_shadow(dname, + round_up(name->len + 1, + sizeof(unsigned long))); + } else { dname = dentry->d_iname; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm, mempool: poison elements backed by slab allocator
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit bbeaa6232872bec76a69e7cb6b41606f1cf61ad3 Author: Andrey RyabininDate: Thu Sep 3 19:27:48 2015 +0400 ms/mm, mempool: poison elements backed by slab allocator https://jira.sw.ru/browse/PSBM-26429 From: David Rientjes commit bdfedb76f4f5aa5e37380e3b71adee4a39f30fc6 upstream. Mempools keep elements in a reserved pool for contexts in which allocation may not be possible. When an element is allocated from the reserved pool, its memory contents is the same as when it was added to the reserved pool. Because of this, elements lack any free poisoning to detect use-after-free errors. This patch adds free poisoning for elements backed by the slab allocator. This is possible because the mempool layer knows the object size of each element. When an element is added to the reserved pool, it is poisoned with POISON_FREE. When it is removed from the reserved pool, the contents are checked for POISON_FREE. If there is a mismatch, a warning is emitted to the kernel log. This is only effective for configs with CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG_ON. [fabio.este...@freescale.com: use '%zu' for printing 'size_t' variable] [a...@arndb.de: add missing include] Signed-off-by: David Rientjes Cc: Dave Kleikamp Cc: Christoph Hellwig Cc: Sebastian Ott Cc: Mikulas Patocka Cc: Catalin Marinas Signed-off-by: Fabio Estevam Signed-off-by: Arnd Bergmann Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- mm/mempool.c | 94 ++-- 1 file changed, 92 insertions(+), 2 deletions(-) diff --git a/mm/mempool.c b/mm/mempool.c index 5499047..db146ad 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -6,25 +6,115 @@ * extreme VM load. * * started by Ingo Molnar, Copyright (C) 2001 + * debugging by David Rientjes, Copyright (C) 2015 */ #include #include + +#include +#include #include #include #include #include +#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB_DEBUG_ON) +static void poison_error(mempool_t *pool, void *element, size_t size, +size_t byte) +{ + const int nr = pool->curr_nr; + const int start = max_t(int, byte - (BITS_PER_LONG / 8), 0); + const int end = min_t(int, byte + (BITS_PER_LONG / 8), size); + int i; + + pr_err("BUG: mempool element poison mismatch\n"); + pr_err("Mempool %p size %zu\n", pool, size); + pr_err(" nr=%d @ %p: %s0x", nr, element, start > 0 ? "... " : ""); + for (i = start; i < end; i++) + pr_cont("%x ", *(u8 *)(element + i)); + pr_cont("%s\n", end < size ? "..." : ""); + dump_stack(); +} + +static void __check_element(mempool_t *pool, void *element, size_t size) +{ + u8 *obj = element; + size_t i; + + for (i = 0; i < size; i++) { + u8 exp = (i < size - 1) ? POISON_FREE : POISON_END; + + if (obj[i] != exp) { + poison_error(pool, element, size, i); + return; + } + } + memset(obj, POISON_INUSE, size); +} + +static void check_element(mempool_t *pool, void *element) +{ + /* Mempools backed by slab allocator */ + if (pool->free == mempool_free_slab || pool->free == mempool_kfree) + __check_element(pool, element, ksize(element)); + + /* Mempools backed by page allocator */ + if (pool->free == mempool_free_pages) { + int order = (int)(long)pool->pool_data; + void *addr = kmap_atomic((struct page *)element); + + __check_element(pool, addr, 1UL << (PAGE_SHIFT + order)); + kunmap_atomic(addr); + } +} + +static void __poison_element(void *element, size_t size) +{ + u8 *obj = element; + + memset(obj, POISON_FREE, size - 1); + obj[size - 1] = POISON_END; +} + +static void poison_element(mempool_t *pool, void *element) +{ + /* Mempools backed by slab allocator */ + if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) + __poison_element(element, ksize(element)); + + /* Mempools backed by page allocator */ + if (pool->alloc == mempool_alloc_pages) { + int order = (int)(long)pool->pool_data; + void
[Devel] [PATCH RHEL7 COMMIT] ms/mm/mempool.c: kasan: poison mempool elements
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 185298f11666838595fb5a2574231e5248178256 Author: Andrey RyabininDate: Thu Sep 3 19:27:48 2015 +0400 ms/mm/mempool.c: kasan: poison mempool elements https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 923936157b158f36bd6a3d86496dce82b1a957de upstream. Mempools keep allocated objects in reserved for situations when ordinary allocation may not be possible to satisfy. These objects shouldn't be accessed before they leave the pool. This patch poison elements when get into the pool and unpoison when they leave it. This will let KASan to detect use-after-free of mempool's elements. Signed-off-by: Andrey Ryabinin Tested-by: David Rientjes Cc: Catalin Marinas Cc: Dmitry Chernenkov Cc: Dmitry Vyukov Cc: Alexander Potapenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/kasan.h | 2 ++ mm/kasan/kasan.c | 13 + mm/mempool.c | 23 +++ 3 files changed, 38 insertions(+) diff --git a/include/linux/kasan.h b/include/linux/kasan.h index 5bb0744..5486d77 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -44,6 +44,7 @@ void kasan_poison_object_data(struct kmem_cache *cache, void *object); void kasan_kmalloc_large(const void *ptr, size_t size); void kasan_kfree_large(const void *ptr); +void kasan_kfree(void *ptr); void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size); void kasan_krealloc(const void *object, size_t new_size); @@ -71,6 +72,7 @@ static inline void kasan_poison_object_data(struct kmem_cache *cache, static inline void kasan_kmalloc_large(void *ptr, size_t size) {} static inline void kasan_kfree_large(const void *ptr) {} +static inline void kasan_kfree(void *ptr) {} static inline void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size) {} static inline void kasan_krealloc(const void *object, size_t new_size) {} diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 936d816..6c513a6 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -389,6 +389,19 @@ void kasan_krealloc(const void *object, size_t size) kasan_kmalloc(page->slab_cache, object, size); } +void kasan_kfree(void *ptr) +{ + struct page *page; + + page = virt_to_head_page(ptr); + + if (unlikely(!PageSlab(page))) + kasan_poison_shadow(ptr, PAGE_SIZE << compound_order(page), + KASAN_FREE_PAGE); + else + kasan_slab_free(page->slab_cache, ptr); +} + void kasan_kfree_large(const void *ptr) { struct page *page = virt_to_page(ptr); diff --git a/mm/mempool.c b/mm/mempool.c index db146ad..abf8243 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -13,6 +13,7 @@ #include #include +#include #include #include #include @@ -101,10 +102,31 @@ static inline void poison_element(mempool_t *pool, void *element) } #endif /* CONFIG_DEBUG_SLAB || CONFIG_SLUB_DEBUG_ON */ +static void kasan_poison_element(mempool_t *pool, void *element) +{ + if (pool->alloc == mempool_alloc_slab) + kasan_slab_free(pool->pool_data, element); + if (pool->alloc == mempool_kmalloc) + kasan_kfree(element); + if (pool->alloc == mempool_alloc_pages) + kasan_free_pages(element, (unsigned long)pool->pool_data); +} + +static void kasan_unpoison_element(mempool_t *pool, void *element) +{ + if (pool->alloc == mempool_alloc_slab) + kasan_slab_alloc(pool->pool_data, element); + if (pool->alloc == mempool_kmalloc) + kasan_krealloc(element, (size_t)pool->pool_data); + if (pool->alloc == mempool_alloc_pages) + kasan_alloc_pages(element, (unsigned long)pool->pool_data); +} + static void add_element(mempool_t *pool, void *element) { BUG_ON(pool->curr_nr >= pool->min_nr); poison_element(pool, element); + kasan_poison_element(pool, element); pool->elements[pool->curr_nr++] = element; } @@ -114,6 +136,7 @@ static void *remove_element(mempool_t *pool) BUG_ON(pool->curr_nr < 0); check_element(pool, element); + kasan_unpoison_element(pool, element); return element; } ___ Devel mailing list Devel@openvz.org
[Devel] [PATCH RHEL7 COMMIT] ms/lib: add kasan test module
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 86fbf39cbfaeefd815791b34628f6df8040b4d2f Author: Andrey RyabininDate: Thu Sep 3 19:27:41 2015 +0400 ms/lib: add kasan test module https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 3f15801cdc2379ca4bf507f48bffd788f9e508ae upstream. This is a test module doing various nasty things like out of bounds accesses, use after free. It is useful for testing kernel debugging features like kernel address sanitizer. It mostly concentrates on testing of slab allocator, but we might want to add more different stuff here in future (like stack/global variables out of bounds accesses and so on). Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- lib/Kconfig.kasan | 8 ++ lib/Makefile | 1 + lib/test_kasan.c | 277 ++ 3 files changed, 286 insertions(+) diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan index a11ac02..4d47d87 100644 --- a/lib/Kconfig.kasan +++ b/lib/Kconfig.kasan @@ -42,4 +42,12 @@ config KASAN_INLINE endchoice +config TEST_KASAN + tristate "Module for testing kasan for bug detection" + depends on m && KASAN + help + This is a test module doing various nasty things like + out of bounds accesses, use after free. It is useful for testing + kernel debugging features like kernel address sanitizer. + endif diff --git a/lib/Makefile b/lib/Makefile index 175face7..d5372fc 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -31,6 +31,7 @@ obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += kstrtox.o obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o +obj-$(CONFIG_TEST_KASAN) += test_kasan.o obj-y += kmapset.o diff --git a/lib/test_kasan.c b/lib/test_kasan.c new file mode 100644 index 000..098c08e --- /dev/null +++ b/lib/test_kasan.c @@ -0,0 +1,277 @@ +/* + * + * Copyright (c) 2014 Samsung Electronics Co., Ltd. + * Author: Andrey Ryabinin + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + */ + +#define pr_fmt(fmt) "kasan test: %s " fmt, __func__ + +#include +#include +#include +#include +#include + +static noinline void __init kmalloc_oob_right(void) +{ + char *ptr; + size_t size = 123; + + pr_info("out-of-bounds to right\n"); + ptr = kmalloc(size, GFP_KERNEL); + if (!ptr) { + pr_err("Allocation failed\n"); + return; + } + + ptr[size] = 'x'; + kfree(ptr); +} + +static noinline void __init kmalloc_oob_left(void) +{ + char *ptr; + size_t size = 15; + + pr_info("out-of-bounds to left\n"); + ptr = kmalloc(size, GFP_KERNEL); + if (!ptr) { + pr_err("Allocation failed\n"); + return; + } + + *ptr = *(ptr - 1); + kfree(ptr); +} + +static noinline void __init kmalloc_node_oob_right(void) +{ + char *ptr; + size_t size = 4096; + + pr_info("kmalloc_node(): out-of-bounds to right\n"); + ptr = kmalloc_node(size, GFP_KERNEL, 0); + if (!ptr) { + pr_err("Allocation failed\n"); + return; + } + + ptr[size] = 0; + kfree(ptr); +} + +static noinline void __init kmalloc_large_oob_rigth(void) +{ + char *ptr; + size_t size = KMALLOC_MAX_CACHE_SIZE + 10; + + pr_info("kmalloc large allocation: out-of-bounds to right\n"); + ptr = kmalloc(size, GFP_KERNEL); + if (!ptr) { + pr_err("Allocation failed\n"); + return; + } + + ptr[size] =
[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: share object_err function
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit c9f94e82e07bf5bafb4e1afa04875aa444a276f7 Author: Andrey RyabininDate: Thu Sep 3 19:27:38 2015 +0400 ms/mm: slub: share object_err function https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 75c66def8d815201aa0386ecc7c66a5c8dbca1ee upstream. Remove static and add function declarations to linux/slub_def.h so it could be used by kernel address sanitizer. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/slub_def.h | 3 +++ mm/slub.c| 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index bd48c92..89bcb9e 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -139,4 +139,7 @@ static inline void *virt_to_obj(struct kmem_cache *s, return (void *)x - ((x - slab_page) % s->size); } +void object_err(struct kmem_cache *s, struct page *page, + u8 *object, char *reason); + #endif /* _LINUX_SLUB_DEF_H */ diff --git a/mm/slub.c b/mm/slub.c index f39e69c..306cfc4 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -625,7 +625,7 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) dump_stack(); } -static void object_err(struct kmem_cache *s, struct page *page, +void object_err(struct kmem_cache *s, struct page *page, u8 *object, char *reason) { slab_bug(s, "%s", reason); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/x86_64: kasan: add interceptors for memset/memmove/memcpy functions
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 100caa44f8eb00f138f374132ff5137d85dc21da Author: Andrey RyabininDate: Thu Sep 3 19:27:42 2015 +0400 ms/x86_64: kasan: add interceptors for memset/memmove/memcpy functions https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 393f203f5fd54421fddb1e2a263f64d3876eeadb upstream. Recently instrumentation of builtin functions calls was removed from GCC 5.0. To check the memory accessed by such functions, userspace asan always uses interceptors for them. So now we should do this as well. This patch declares memset/memmove/memcpy as weak symbols. In mm/kasan/kasan.c we have our own implementation of those functions which checks memory before accessing it. Default memset/memmove/memcpy now now always have aliases with '__' prefix. For files that built without kasan instrumentation (e.g. mm/slub.c) original mem* replaced (via #define) with prefixed variants, cause we don't want to check memory accesses there. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/x86/boot/compressed/eboot.c | 5 +++-- arch/x86/boot/compressed/misc.h | 1 + arch/x86/include/asm/string_64.h | 18 +- arch/x86/kernel/x8664_ksyms_64.c | 10 -- arch/x86/lib/memcpy_64.S | 6 -- arch/x86/lib/memmove_64.S| 4 arch/x86/lib/memset_64.S | 10 ++ mm/kasan/kasan.c | 29 + 8 files changed, 72 insertions(+), 11 deletions(-) diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c index dd94e98..dc3694d 100644 --- a/arch/x86/boot/compressed/eboot.c +++ b/arch/x86/boot/compressed/eboot.c @@ -7,6 +7,9 @@ * * --- */ +#include "misc.h" +#include +#include "../string.h" #include #include #include @@ -14,8 +17,6 @@ #include #include -#undef memcpy /* Use memcpy from misc.c */ - #include "eboot.h" static efi_system_table_t *sys_table; diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index 674019d..768b889 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -7,6 +7,7 @@ * we just keep it from happening */ #undef CONFIG_PARAVIRT +#undef CONFIG_KASAN #ifdef CONFIG_X86_32 #define _ASM_X86_DESC_H 1 #endif diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index 19e2c46..e466119 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -27,11 +27,12 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t function. */ #define __HAVE_ARCH_MEMCPY 1 +extern void *__memcpy(void *to, const void *from, size_t len); + #ifndef CONFIG_KMEMCHECK #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4 extern void *memcpy(void *to, const void *from, size_t len); #else -extern void *__memcpy(void *to, const void *from, size_t len); #define memcpy(dst, src, len) \ ({ \ size_t __len = (len); \ @@ -53,9 +54,11 @@ extern void *__memcpy(void *to, const void *from, size_t len); #define __HAVE_ARCH_MEMSET void *memset(void *s, int c, size_t n); +void *__memset(void *s, int c, size_t n); #define __HAVE_ARCH_MEMMOVE void *memmove(void *dest, const void *src, size_t count); +void *__memmove(void *dest, const void *src, size_t count); int memcmp(const void *cs, const void *ct, size_t count); size_t strlen(const char *s); @@ -63,6 +66,19 @@ char *strcpy(char *dest, const char *src); char *strcat(char *dest,
[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: add kernel address sanitizer support for slub allocator
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit f5739fe62cb93cddd8165d0fca93f773d36431b8 Author: Andrey RyabininDate: Thu Sep 3 19:27:40 2015 +0400 ms/mm: slub: add kernel address sanitizer support for slub allocator https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 0316bec22ec95ea2faca6406437b0b5950553b7c upstream. With this patch kasan will be able to catch bugs in memory allocated by slub. Initially all objects in newly allocated slab page, marked as redzone. Later, when allocation of slub object happens, requested by caller number of bytes marked as accessible, and the rest of the object (including slub's metadata) marked as redzone (inaccessible). We also mark object as accessible if ksize was called for this object. There is some places in kernel where ksize function is called to inquire size of really allocated area. Such callers could validly access whole allocated memory, so it should be marked as accessible. Code in slub.c and slab_common.c files could validly access to object's metadata, so instrumentation for this files are disabled. Signed-off-by: Andrey Ryabinin Signed-off-by: Dmitry Chernenkov Cc: Dmitry Vyukov Cc: Konstantin Serebryany Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/kasan.h | 27 ++ include/linux/slab.h | 11 -- lib/Kconfig.kasan | 1 + mm/Makefile | 3 ++ mm/kasan/kasan.c | 98 +++ mm/kasan/kasan.h | 5 +++ mm/kasan/report.c | 21 +++ mm/slab_common.c | 5 ++- mm/slub.c | 31 ++-- 9 files changed, 197 insertions(+), 5 deletions(-) diff --git a/include/linux/kasan.h b/include/linux/kasan.h index f00c15c..d5310ee 100644 --- a/include/linux/kasan.h +++ b/include/linux/kasan.h @@ -37,6 +37,18 @@ void kasan_unpoison_shadow(const void *address, size_t size); void kasan_alloc_pages(struct page *page, unsigned int order); void kasan_free_pages(struct page *page, unsigned int order); +void kasan_poison_slab(struct page *page); +void kasan_unpoison_object_data(struct kmem_cache *cache, void *object); +void kasan_poison_object_data(struct kmem_cache *cache, void *object); + +void kasan_kmalloc_large(const void *ptr, size_t size); +void kasan_kfree_large(const void *ptr); +void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size); +void kasan_krealloc(const void *object, size_t new_size); + +void kasan_slab_alloc(struct kmem_cache *s, void *object); +void kasan_slab_free(struct kmem_cache *s, void *object); + #else /* CONFIG_KASAN */ static inline void kasan_unpoison_shadow(const void *address, size_t size) {} @@ -47,6 +59,21 @@ static inline void kasan_disable_current(void) {} static inline void kasan_alloc_pages(struct page *page, unsigned int order) {} static inline void kasan_free_pages(struct page *page, unsigned int order) {} +static inline void kasan_poison_slab(struct page *page) {} +static inline void kasan_unpoison_object_data(struct kmem_cache *cache, + void *object) {} +static inline void kasan_poison_object_data(struct kmem_cache *cache, + void *object) {} + +static inline void kasan_kmalloc_large(void *ptr, size_t size) {} +static inline void kasan_kfree_large(const void *ptr) {} +static inline void kasan_kmalloc(struct kmem_cache *s, const void *object, + size_t size) {} +static inline void kasan_krealloc(const void *object, size_t new_size) {} + +static inline void kasan_slab_alloc(struct kmem_cache *s, void *object) {} +static inline void kasan_slab_free(struct kmem_cache *s, void *object) {} + #endif /* CONFIG_KASAN */ #endif /* LINUX_KASAN_H */ diff --git a/include/linux/slab.h b/include/linux/slab.h
[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: introduce virt_to_obj function
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit c79da004858018af5e66fd380014fea3e5d5271d Author: Andrey RyabininDate: Thu Sep 3 19:27:38 2015 +0400 ms/mm: slub: introduce virt_to_obj function https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 912f5fbf1d3060f25d6994aed0265c55b974b2e9 upstream. virt_to_obj takes kmem_cache address, address of slab page, address x pointing somewhere inside slab object, and returns address of the beginning of object. Signed-off-by: Andrey Ryabinin Acked-by: Christoph Lameter Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- include/linux/slub_def.h | 16 1 file changed, 16 insertions(+) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index d7d4571..bd48c92 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -123,4 +123,20 @@ static inline void sysfs_slab_remove(struct kmem_cache *s) } #endif + +/** + * virt_to_obj - returns address of the beginning of object. + * @s: object's kmem_cache + * @slab_page: address of slab page + * @x: address within object memory range + * + * Returns address of the beginning of object + */ +static inline void *virt_to_obj(struct kmem_cache *s, + const void *slab_page, + const void *x) +{ + return (void *)x - ((x - slab_page) % s->size); +} + #endif /* _LINUX_SLUB_DEF_H */ ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/kasan: Makefile: shut up warnings if CONFIG_COMPILE_TEST=y
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 0bc35fb562a57be834fea65d992bb49b94909579 Author: Andrey RyabininDate: Thu Sep 3 19:27:49 2015 +0400 ms/kasan: Makefile: shut up warnings if CONFIG_COMPILE_TEST=y https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 6e54abac1b8e0b7febffdbad37b605daef1cfcff upstream. It might be annoying to constantly see this: scripts/Makefile.kasan:16: Cannot use CONFIG_KASAN: -fsanitize=kernel-address is not supported by compiler while performing allmodconfig/allyesconfig build tests. Disable this warning if CONFIG_COMPILE_TEST=y. Signed-off-by: Andrey Ryabinin Cc: Michal Marek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- scripts/Makefile.kasan | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/scripts/Makefile.kasan b/scripts/Makefile.kasan index 631619b..3f874d2 100644 --- a/scripts/Makefile.kasan +++ b/scripts/Makefile.kasan @@ -13,12 +13,16 @@ CFLAGS_KASAN := $(call cc-option, -fsanitize=kernel-address \ --param asan-instrumentation-with-call-threshold=$(call_threshold)) ifeq ($(call cc-option, $(CFLAGS_KASAN_MINIMAL) -Werror),) + ifneq ($(CONFIG_COMPILE_TEST),y) $(warning Cannot use CONFIG_KASAN: \ -fsanitize=kernel-address is not supported by compiler) + endif else ifeq ($(CFLAGS_KASAN),) -$(warning CONFIG_KASAN: compiler does not support all options.\ -Trying minimal configuration) +ifneq ($(CONFIG_COMPILE_TEST),y) +$(warning CONFIG_KASAN: compiler does not support all options.\ +Trying minimal configuration) +endif CFLAGS_KASAN := $(CFLAGS_KASAN_MINIMAL) endif endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Add message about KASAN being initialized
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 4e8e61be6ec2720f7871f8d673c8e806dd93 Author: Andrey RyabininDate: Thu Sep 3 19:27:52 2015 +0400 ms/x86/kasan: Add message about KASAN being initialized https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 8515522949951d81fe2d06c0a3292f171f2b8ec4 upstream. Print informational message to tell user that kernel runs with KASAN enabled. Add a "kasan: " prefix to all messages in kasan_init_64.c. Signed-off-by: Andrey Ryabinin Cc: Alexander Popov Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1435828178-10975-6-git-send-email-a.ryabi...@samsung.com Signed-off-by: Ingo Molnar Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/x86/mm/kasan_init_64.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index ef3dea9..f9fb08e 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -1,3 +1,4 @@ +#define pr_fmt(fmt) "kasan: " fmt #include #include #include @@ -237,4 +238,6 @@ void __init kasan_init(void) load_cr3(init_level4_pgt); __flush_tlb_all(); init_task.kasan_depth = 0; + + pr_info("Kernel address sanitizer initialized\n"); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Fix boot crash on AMD processors
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 1551fd8cc2353656479158d30ad46940290098da Author: Andrey RyabininDate: Thu Sep 3 19:27:51 2015 +0400 ms/x86/kasan: Fix boot crash on AMD processors https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit d4f86beacc21d538dc41e1fc75a22e084f547edf upstream. While populating zero shadow wrong bits in upper level page tables used. __PAGE_KERNEL_RO that was used for pgd/pud/pmd has _PAGE_BIT_GLOBAL set. Global bit is present only in the lowest level of the page translation hierarchy (ptes), and it should be zero in upper levels. This bug seems doesn't cause any troubles on Intel cpus, while on AMDs it cause kernel crash on boot. Use _KERNPG_TABLE bits for pgds/puds/pmds to fix this. Reported-by: Borislav Petkov Signed-off-by: Andrey Ryabinin Cc: # 4.0+ Cc: Alexander Popov Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1435828178-10975-5-git-send-email-a.ryabi...@samsung.com Signed-off-by: Ingo Molnar Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/x86/mm/kasan_init_64.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 0ada6cc..ef3dea9 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -85,7 +85,7 @@ static int __init zero_pmd_populate(pud_t *pud, unsigned long addr, while (IS_ALIGNED(addr, PMD_SIZE) && addr + PMD_SIZE <= end) { WARN_ON(!pmd_none(*pmd)); set_pmd(pmd, __pmd(__pa_nodebug(kasan_zero_pte) - | __PAGE_KERNEL_RO)); + | _KERNPG_TABLE)); addr += PMD_SIZE; pmd = pmd_offset(pud, addr); } @@ -111,7 +111,7 @@ static int __init zero_pud_populate(pgd_t *pgd, unsigned long addr, while (IS_ALIGNED(addr, PUD_SIZE) && addr + PUD_SIZE <= end) { WARN_ON(!pud_none(*pud)); set_pud(pud, __pud(__pa_nodebug(kasan_zero_pmd) - | __PAGE_KERNEL_RO)); + | _KERNPG_TABLE)); addr += PUD_SIZE; pud = pud_offset(pgd, addr); } @@ -136,7 +136,7 @@ static int __init zero_pgd_populate(unsigned long addr, unsigned long end) while (IS_ALIGNED(addr, PGDIR_SIZE) && addr + PGDIR_SIZE <= end) { WARN_ON(!pgd_none(*pgd)); set_pgd(pgd, __pgd(__pa_nodebug(kasan_zero_pud) - | __PAGE_KERNEL_RO)); + | _KERNPG_TABLE)); addr += PGDIR_SIZE; pgd = pgd_offset_k(addr); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/kasan: enable stack instrumentation
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit afb61959d53afd934a5117de982401ddd17cf44a Author: Andrey RyabininDate: Thu Sep 3 19:27:43 2015 +0400 ms/kasan: enable stack instrumentation https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit c420f167db8c799d69fe43a801c58a7f02e9d57c upstream. Stack instrumentation allows to detect out of bounds memory accesses for variables allocated on stack. Compiler adds redzones around every variable on stack and poisons redzones in function's prologue. Such approach significantly increases stack usage, so all in-kernel stacks size were doubled. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/x86/include/asm/page_64_types.h | 12 +--- arch/x86/kernel/Makefile | 2 ++ arch/x86/mm/kasan_init_64.c | 11 +-- include/linux/init_task.h| 8 mm/kasan/kasan.h | 9 + mm/kasan/report.c| 6 ++ scripts/Makefile.kasan | 1 + 7 files changed, 44 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index 735457b..042942c 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -1,17 +1,23 @@ #ifndef _ASM_X86_PAGE_64_DEFS_H #define _ASM_X86_PAGE_64_DEFS_H -#define THREAD_SIZE_ORDER 2 +#ifdef CONFIG_KASAN +#define KASAN_STACK_ORDER 1 +#else +#define KASAN_STACK_ORDER 0 +#endif + +#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) #define CURRENT_MASK (~(THREAD_SIZE - 1)) -#define EXCEPTION_STACK_ORDER 0 +#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER) #define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER) #define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1) #define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER) -#define IRQ_STACK_ORDER 2 +#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER) #define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER) #define DOUBLEFAULT_STACK 1 diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 102a138..4d5df57 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -17,6 +17,8 @@ CFLAGS_REMOVE_early_printk.o = -pg endif KASAN_SANITIZE_head$(BITS).o := n +KASAN_SANITIZE_dumpstack.o := n +KASAN_SANITIZE_dumpstack_$(BITS).o := n CFLAGS_irq.o := -I$(src)/../include/asm/trace diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index cf3190a..a0c0dcc 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -189,11 +189,18 @@ void __init kasan_init(void) if (map_range(_mapped[i])) panic("kasan: unable to allocate shadow!"); } - populate_zero_shadow(kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM), - (void *)KASAN_SHADOW_END); + kasan_mem_to_shadow((void *)__START_KERNEL_map)); + + vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext), + (unsigned long)kasan_mem_to_shadow(_end), + NUMA_NO_NODE); + + populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_VADDR), + (void *)KASAN_SHADOW_END); memset(kasan_zero_page, 0, PAGE_SIZE); load_cr3(init_level4_pgt); + init_task.kasan_depth = 0; } diff --git a/include/linux/init_task.h b/include/linux/init_task.h index b1bdeb6..d2cbad0 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -161,6 +161,13 @@ extern struct task_group root_task_group; #define INIT_TASK_COMM "swapper" +#ifdef CONFIG_KASAN +# define INIT_KASAN(tsk)
[Devel] [PATCH RHEL7 COMMIT] ms/kmemleak: disable kasan instrumentation for kmemleak
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit c15154cbcf8144b7f551b571c0a4a129bfcf999d Author: Andrey RyabininDate: Thu Sep 3 19:27:41 2015 +0400 ms/kmemleak: disable kasan instrumentation for kmemleak https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit e79ed2f13faab8fc9d4ad76d5f5a241724e45836 upstream. kmalloc internally round up allocation size, and kmemleak uses rounded up size as object's size. This makes kasan to complain while kmemleak scans memory or calculates of object's checksum. The simplest solution here is to disable kasan. Signed-off-by: Andrey Ryabinin Acked-by: Catalin Marinas Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- mm/kmemleak.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 98e1b34..5fe0a34 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -98,6 +98,7 @@ #include #include +#include #include #include #include @@ -1077,7 +1078,10 @@ static bool update_checksum(struct kmemleak_object *object) if (!kmemcheck_is_obj_initialized(object->pointer, object->size)) return false; + kasan_disable_current(); object->checksum = crc32(0, (void *)object->pointer, object->size); + kasan_enable_current(); + return object->checksum != old_csum; } @@ -1128,7 +1132,9 @@ static void scan_block(void *_start, void *_end, BYTES_PER_POINTER)) continue; + kasan_disable_current(); pointer = *ptr; + kasan_enable_current(); object = find_and_get_object(pointer, 1); if (!object) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/kasan: enable instrumentation of global variables
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit b3ad5de4e3c0866f2aa2b581f348989eee5bc9df Author: Andrey RyabininDate: Thu Sep 3 19:27:46 2015 +0400 ms/kasan: enable instrumentation of global variables https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit bebf56a1b176c2e1c9efe44e7e6915532cc682cf upstream. This feature let us to detect accesses out of bounds of global variables. This will work as for globals in kernel image, so for globals in modules. Currently this won't work for symbols in user-specified sections (e.g. __init, __read_mostly, ...) The idea of this is simple. Compiler increases each global variable by redzone size and add constructors invoking __asan_register_globals() function. Information about global variable (address, size, size with redzone ...) passed to __asan_register_globals() so we could poison variable's redzone. This patch also forces module_alloc() to return 8*PAGE_SIZE aligned address making shadow memory handling ( kasan_module_alloc()/kasan_module_free() ) more simple. Such alignment guarantees that each shadow page backing modules address space correspond to only one module_alloc() allocation. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- Documentation/kasan.txt | 2 +- arch/x86/kernel/module.c| 12 +-- arch/x86/mm/kasan_init_64.c | 4 ++-- include/linux/kasan.h | 10 + kernel/module.c | 2 ++ lib/Kconfig.kasan | 1 + mm/kasan/kasan.c| 52 + mm/kasan/kasan.h| 25 ++ mm/kasan/report.c | 22 +++ scripts/Makefile.kasan | 2 +- 10 files changed, 126 insertions(+), 6 deletions(-) diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt index f0645a8..092fc10 100644 --- a/Documentation/kasan.txt +++ b/Documentation/kasan.txt @@ -9,7 +9,7 @@ a fast and comprehensive solution for finding use-after-free and out-of-bounds bugs. KASan uses compile-time instrumentation for checking every memory access, -therefore you will need a certain version of GCC >= 4.9.2 +therefore you will need a certain version of GCC > 4.9.2 Currently KASan is supported only for x86_64 architecture and requires that the kernel be built with the SLUB allocator. diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index 2ce4a9a..5892e83 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -45,11 +46,18 @@ do { \ void *module_alloc(unsigned long size) { + void *p; + if (PAGE_ALIGN(size) > MODULES_LEN) return NULL; - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, + p = __vmalloc_node_range(size, MODULE_ALIGN, MODULES_VADDR, MODULES_END, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, -0, NUMA_NO_NODE, __builtin_return_address(0)); + 0, NUMA_NO_NODE, __builtin_return_address(0)); + if (p && (kasan_module_alloc(p, size) < 0)) { + vfree(p); + return NULL; + } + return p; } #ifdef CONFIG_X86_32 diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index a0c0dcc..7620537 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -194,9 +194,9 @@ void __init kasan_init(void) vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext), (unsigned long)kasan_mem_to_shadow(_end), -
[Devel] [PATCH RHEL7 COMMIT] ms/kasan: disable memory hotplug
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 2aab6eabec66429c34f23086f82dd793d660b283 Author: Andrey RyabininDate: Thu Sep 3 19:27:36 2015 +0400 ms/kasan: disable memory hotplug https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit 786a8959912eb94fc2381c2ae487a96ce55dabca upstream. Currently memory hotplug won't work with KASan. As we don't have shadow for hotplugged memory, kernel will crash on the first access to it. To make this work we will need to allocate shadow for new memory. At some future point proper memory hotplug support will be implemented. Until then, print a warning at startup and disable memory hot-add. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- mm/kasan/kasan.c | 21 + 1 file changed, 21 insertions(+) diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c index 6dc1aa7..def8110 100644 --- a/mm/kasan/kasan.c +++ b/mm/kasan/kasan.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -300,3 +301,23 @@ EXPORT_SYMBOL(__asan_storeN_noabort); /* to shut up compiler complaints */ void __asan_handle_no_return(void) {} EXPORT_SYMBOL(__asan_handle_no_return); + +#ifdef CONFIG_MEMORY_HOTPLUG +static int kasan_mem_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + return (action == MEM_GOING_ONLINE) ? NOTIFY_BAD : NOTIFY_OK; +} + +static int __init kasan_memhotplug_init(void) +{ + pr_err("WARNING: KASan doesn't support memory hot-add\n"); + pr_err("Memory hot-add will be disabled\n"); + + hotplug_memory_notifier(kasan_mem_notifier, 0); + + return 0; +} + +module_init(kasan_memhotplug_init); +#endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/MODULE_DEVICE_TABLE: fix some callsites
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit c2814d62886ad2b1697f7a4152d545662bdb2351 Author: Andrey RyabininDate: Thu Sep 3 19:27:34 2015 +0400 ms/MODULE_DEVICE_TABLE: fix some callsites https://jira.sw.ru/browse/PSBM-26429 From: Andrew Morton commit 0f989f749b51ec1fd94bb5a42f8ad10c8b9f73cb upstream. The patch "module: fix types of device tables aliases" newly requires that invocations of MODULE_DEVICE_TABLE(type, name); come *after* the definition of `name'. That is reasonable, but some drivers weren't doing this. Fix them. Cc: James Bottomley Cc: Andrey Ryabinin Cc: David Miller Cc: Hans Verkuil Acked-by: Mauro Carvalho Chehab Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- drivers/net/ethernet/emulex/benet/be_main.c | 1 - drivers/scsi/be2iscsi/be_main.c | 1 - 2 files changed, 2 deletions(-) diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index 167fe08..4e60ee7 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -26,7 +26,6 @@ #include MODULE_VERSION(DRV_VER); -MODULE_DEVICE_TABLE(pci, be_dev_ids); MODULE_DESCRIPTION(DRV_DESC " " DRV_VER); MODULE_AUTHOR("Emulex Corporation"); MODULE_LICENSE("GPL"); diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c index 6b079d6..f9506b2 100644 --- a/drivers/scsi/be2iscsi/be_main.c +++ b/drivers/scsi/be2iscsi/be_main.c @@ -48,7 +48,6 @@ static unsigned int be_iopoll_budget = 10; static unsigned int be_max_phys_size = 64; static unsigned int enable_msix = 1; -MODULE_DEVICE_TABLE(pci, beiscsi_pci_id_table); MODULE_DESCRIPTION(DRV_DESC " " BUILD_STR); MODULE_VERSION(BUILD_STR); MODULE_AUTHOR("Emulex Corporation"); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/x86/init: Clear 'init_level4_pgt' earlier
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 2c3d4203ed393d91ba79a0fa59f5e1ce5fe7627a Author: Andrey RyabininDate: Thu Sep 3 19:27:49 2015 +0400 ms/x86/init: Clear 'init_level4_pgt' earlier https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit d0f77d4d04b222a817925d33ba3589b190bfa863 upstream. Currently x86_64_start_kernel() has two KASAN related function calls. The first call maps shadow to early_level4_pgt, the second maps shadow to init_level4_pgt. If we move clear_page(init_level4_pgt) earlier, we could hide KASAN low level detail from generic x86_64 initialization code. The next patch will do it. Signed-off-by: Andrey Ryabinin Cc: # 4.0+ Cc: Alexander Popov Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1435828178-10975-2-git-send-email-a.ryabi...@samsung.com Signed-off-by: Ingo Molnar Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/x86/kernel/head64.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index 67df086..357ce8a 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -164,6 +164,8 @@ void __init x86_64_start_kernel(char * real_mode_data) /* clear bss before set_intr_gate with early_idt_handler */ clear_bss(); + clear_page(init_level4_pgt); + for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) set_intr_gate(i, early_idt_handlers[i]); load_idt((const struct desc_ptr *)_descr); @@ -178,7 +180,6 @@ void __init x86_64_start_kernel(char * real_mode_data) if (console_loglevel == 10) early_printk("Kernel alive\n"); - clear_page(init_level4_pgt); /* set init_level4_pgt kernel high mapping*/ init_level4_pgt[511] = early_level4_pgt[511]; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Fix KASAN shadow region page tables
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 71bad1e5d1a2aa16ce31dc1413d36d066e73b7e4 Author: Andrey RyabininDate: Thu Sep 3 19:27:50 2015 +0400 ms/x86/kasan: Fix KASAN shadow region page tables https://jira.sw.ru/browse/PSBM-26429 From: Alexander Popov commit 5d5aa3cfca5cf74cd928daf3674642e6004328d1 upstream. Currently KASAN shadow region page tables created without respect of physical offset (phys_base). This causes kernel halt when phys_base is not zero. So let's initialize KASAN shadow region page tables in kasan_early_init() using __pa_nodebug() which considers phys_base. This patch also separates x86_64_start_kernel() from KASAN low level details by moving kasan_map_early_shadow(init_level4_pgt) into kasan_early_init(). Remove the comment before clear_bss() which stopped bringing much profit to the code readability. Otherwise describing all the new order dependencies would be too verbose. Signed-off-by: Alexander Popov Signed-off-by: Andrey Ryabinin Cc: # 4.0+ Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Borislav Petkov Cc: Dmitry Vyukov Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1435828178-10975-3-git-send-email-a.ryabi...@samsung.com Signed-off-by: Ingo Molnar Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/x86/include/asm/kasan.h | 8 ++-- arch/x86/kernel/head64.c | 7 ++- arch/x86/kernel/head_64.S| 28 arch/x86/mm/kasan_init_64.c | 36 ++-- 4 files changed, 38 insertions(+), 41 deletions(-) diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h index 8b22422..74a2a8d 100644 --- a/arch/x86/include/asm/kasan.h +++ b/arch/x86/include/asm/kasan.h @@ -14,15 +14,11 @@ #ifndef __ASSEMBLY__ -extern pte_t kasan_zero_pte[]; -extern pte_t kasan_zero_pmd[]; -extern pte_t kasan_zero_pud[]; - #ifdef CONFIG_KASAN -void __init kasan_map_early_shadow(pgd_t *pgd); +void __init kasan_early_init(void); void __init kasan_init(void); #else -static inline void kasan_map_early_shadow(pgd_t *pgd) { } +static inline void kasan_early_init(void) { } static inline void kasan_init(void) { } #endif diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c index 357ce8a..c2dd757 100644 --- a/arch/x86/kernel/head64.c +++ b/arch/x86/kernel/head64.c @@ -159,13 +159,12 @@ void __init x86_64_start_kernel(char * real_mode_data) /* Kill off the identity-map trampoline */ reset_early_page_tables(); - kasan_map_early_shadow(early_level4_pgt); - - /* clear bss before set_intr_gate with early_idt_handler */ clear_bss(); clear_page(init_level4_pgt); + kasan_early_init(); + for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) set_intr_gate(i, early_idt_handlers[i]); load_idt((const struct desc_ptr *)_descr); @@ -183,8 +182,6 @@ void __init x86_64_start_kernel(char * real_mode_data) /* set init_level4_pgt kernel high mapping*/ init_level4_pgt[511] = early_level4_pgt[511]; - kasan_map_early_shadow(init_level4_pgt); - x86_64_start_reservations(real_mode_data); } diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index 4178929..cb5bf29 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -514,22 +514,6 @@ ENTRY(phys_base) /* This must match the first entry in level2_kernel_pgt */ .quad 0x -#ifdef CONFIG_KASAN -#define FILL(VAL, COUNT) \ - .rept (COUNT) ; \ - .quad (VAL) ; \ - .endr - -NEXT_PAGE(kasan_zero_pte) - FILL(kasan_zero_page - __START_KERNEL_map + _KERNPG_TABLE, 512) -NEXT_PAGE(kasan_zero_pmd) - FILL(kasan_zero_pte - __START_KERNEL_map + _KERNPG_TABLE, 512) -NEXT_PAGE(kasan_zero_pud) - FILL(kasan_zero_pmd - __START_KERNEL_map + _KERNPG_TABLE, 512) - -#undef FILL -#endif - #include "../../x86/xen/xen-head.S" .section .bss, "aw", @nobits @@ -551,15 +535,3 @@ ENTRY(trace_idt_table) NEXT_PAGE(empty_zero_page) .skip PAGE_SIZE -#ifdef CONFIG_KASAN -/* - * This page used as early shadow. We don't use empty_zero_page - * at early stages, stack instrumentation could write some garbage - * to this
[Devel] [PATCH RHEL7 COMMIT] ms/mm: vmalloc: pass additional vm_flags to __vmalloc_node_range()
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit a670730ea44733529cfb1072b8d50bbf4956858d Author: Andrey RyabininDate: Thu Sep 3 19:27:44 2015 +0400 ms/mm: vmalloc: pass additional vm_flags to __vmalloc_node_range() https://jira.sw.ru/browse/PSBM-26429 From: Andrey Ryabinin commit cb9e3c292d0115499c660028ad35ac5501d722b5 upstream. For instrumenting global variables KASan will shadow memory backing memory for modules. So on module loading we will need to allocate memory for shadow and map it at address in shadow that corresponds to the address allocated in module_alloc(). __vmalloc_node_range() could be used for this purpose, except it puts a guard hole after allocated area. Guard hole in shadow memory should be a problem because at some future point we might need to have a shadow memory at address occupied by guard hole. So we could fail to allocate shadow for module_alloc(). Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into __vmalloc_node_range(). Add new parameter 'vm_flags' to __vmalloc_node_range() function. Signed-off-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Konstantin Serebryany Cc: Dmitry Chernenkov Signed-off-by: Andrey Konovalov Cc: Yuri Gribov Cc: Konstantin Khlebnikov Cc: Sasha Levin Cc: Christoph Lameter Cc: Joonsoo Kim Cc: Dave Hansen Cc: Andi Kleen Cc: Ingo Molnar Cc: Thomas Gleixner Cc: "H. Peter Anvin" Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Andrey Ryabinin Signed-off-by: Andrey Ryabinin --- arch/arm/kernel/module.c| 2 +- arch/arm64/kernel/module.c | 4 ++-- arch/mips/kernel/module.c | 2 +- arch/parisc/kernel/module.c | 2 +- arch/s390/kernel/module.c | 2 +- arch/sparc/kernel/module.c | 2 +- arch/x86/kernel/module.c| 2 +- include/linux/vmalloc.h | 4 +++- mm/vmalloc.c| 10 ++ 9 files changed, 17 insertions(+), 13 deletions(-) diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index be3232f..162c0b3 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -40,7 +40,7 @@ void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE, + GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, __builtin_return_address(0)); } #endif diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index 8f898bd..c7bc3e6 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -29,8 +29,8 @@ void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE, - __builtin_return_address(0)); + GFP_KERNEL, PAGE_KERNEL_EXEC, 0, + NUMA_NO_NODE, __builtin_return_address(0)); } enum aarch64_reloc_op { diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 2a52568..1833f51 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -47,7 +47,7 @@ static DEFINE_SPINLOCK(dbe_lock); void *module_alloc(unsigned long size) { return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END, - GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE, + GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } #endif diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c index 50dfafc..0d498ef 100644 --- a/arch/parisc/kernel/module.c +++ b/arch/parisc/kernel/module.c @@ -219,7 +219,7 @@ void *module_alloc(unsigned long size) * init_data correctly */ return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, GFP_KERNEL | __GFP_HIGHMEM, - PAGE_KERNEL_RWX, NUMA_NO_NODE, + PAGE_KERNEL_RWX, 0,
[Devel] [PATCH RHEL7 COMMIT] ve: revise permissions to allow mount smth
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.8 --> commit 68cf9d3cff9993ae2793c53661721b89d1b2895b Author: Andrew VaginDate: Tue Sep 8 12:47:01 2015 +0400 ve: revise permissions to allow mount smth reverts commit d492bfa387237 ("ve/vfs: allow mount/umount, pivot_root with CAP_VE_SYS_ADMIN") Return back to the behavior of the upstream kernel. Currently we use mount namespaces and need nothing special here. https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew Vagin Reviewed-by: Vladimir Davydov --- fs/namespace.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 593b262..77a1ede 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1306,9 +1306,7 @@ static int do_umount(struct mount *mnt, int flags) */ static inline bool may_mount(void) { - return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN) || - nsown_capable(CAP_SYS_ADMIN) || - nsown_capable(CAP_VE_SYS_ADMIN); + return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN); } /* ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] cred: add ve_capable to check capabilities relative to the current VE (v2)
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.8 --> commit 9c0a32ed2a39800f298cf96308530c805a9188fd Author: Andrew VaginDate: Tue Sep 8 12:45:07 2015 +0400 cred: add ve_capable to check capabilities relative to the current VE (v2) We want to allow a few operations in VE. Currently we use nsown_capable, but it's wrong, because in this case we allow these operations in any user namespace. v2: take ve0->cred if the currect ve isn't running https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew Vagin Reviewed-by: Vladimir Davydov --- fs/autofs4/root.c | 6 ++ fs/ioprio.c| 2 +- fs/namei.c | 2 +- include/linux/capability.h | 1 + kernel/capability.c| 20 kernel/printk.c| 5 ++--- net/ipv6/sit.c | 2 +- net/netfilter/nf_sockopt.c | 2 +- security/commoncap.c | 4 ++-- security/device_cgroup.c | 4 ++-- 10 files changed, 33 insertions(+), 15 deletions(-) diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c index 68e3edb..1462d8b 100644 --- a/fs/autofs4/root.c +++ b/fs/autofs4/root.c @@ -588,8 +588,7 @@ static int autofs4_dir_unlink(struct inode *dir, struct dentry *dentry) struct autofs_info *p_ino; /* This allows root to remove symlinks */ - if (!autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) && - !capable(CAP_VE_SYS_ADMIN)) + if (!autofs4_oz_mode(sbi) && !ve_capable(CAP_SYS_ADMIN)) return -EPERM; if (atomic_dec_and_test(>count)) { @@ -837,8 +836,7 @@ static int autofs4_root_ioctl_unlocked(struct inode *inode, struct file *filp, _IOC_NR(cmd) - _IOC_NR(AUTOFS_IOC_FIRST) >= AUTOFS_IOC_COUNT) return -ENOTTY; - if (!autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) && - !capable(CAP_VE_SYS_ADMIN)) + if (!autofs4_oz_mode(sbi) && !ve_capable(CAP_SYS_ADMIN)) return -EPERM; switch(cmd) { diff --git a/fs/ioprio.c b/fs/ioprio.c index c876fad..f9d9187 100644 --- a/fs/ioprio.c +++ b/fs/ioprio.c @@ -75,7 +75,7 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) switch (class) { case IOPRIO_CLASS_RT: - if (!capable(CAP_VE_ADMIN)) + if (!ve_capable(CAP_SYS_ADMIN)) return -EPERM; class = IOPRIO_CLASS_BE; data = 0; diff --git a/fs/namei.c b/fs/namei.c index 8e29a44..e7d9f54 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3397,7 +3397,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) if (error) return error; - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD)) + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !ve_capable(CAP_MKNOD)) return -EPERM; if (!dir->i_op->mknod) diff --git a/include/linux/capability.h b/include/linux/capability.h index 2b77384..b1131e3 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -217,6 +217,7 @@ extern bool has_ns_capability_noaudit(struct task_struct *t, extern bool capable(int cap); extern bool ns_capable(struct user_namespace *ns, int cap); extern bool nsown_capable(int cap); +extern bool ve_capable(int cap); extern bool inode_capable(const struct inode *inode, int cap); extern bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap); diff --git a/kernel/capability.c b/kernel/capability.c index 0a843d5..4a73381 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -16,6 +16,7 @@ #include #include #include +#include /* * Leveraged for setting/resetting capabilities @@ -396,6 +397,25 @@ bool ns_capable(struct user_namespace *ns, int cap) } EXPORT_SYMBOL(ns_capable); +#if CONFIG_VE +bool ve_capable(int cap) +{ + struct cred *cred = get_exec_env()->init_cred; + + if (cred == NULL) /* ve isn't running */ + cred = ve0.init_cred; + + return ns_capable(cred->user_ns, cap); +} +#else +bool ve_capable(int cap) +{ + return capable(cap); +} +#endif + +EXPORT_SYMBOL_GPL(ve_capable); + /** * file_ns_capable - Determine if the file's opener had a capability in effect * @file: The file we want to check diff --git a/kernel/printk.c b/kernel/printk.c index 44b3783..91766fc 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -468,14 +468,13 @@ static int check_syslog_permissions(int type, bool from_file) return 0; if (syslog_action_restricted(type)) { - if (nsown_capable(CAP_SYSLOG)) + if
[Devel] [PATCH RHEL7 COMMIT] Revert "ve/rtnl: allow move network devices into network namespace in CT"
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.8 --> commit 48d88de87d8ce2bd7e69ba82e5410dd2ec82f602 Author: Andrew VaginDate: Tue Sep 8 12:50:52 2015 +0400 Revert "ve/rtnl: allow move network devices into network namespace in CT" This reverts commit b238eaaf8029c022899ee874132814bd1be5551f. https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew Vagin Reviewed-by: Vladimir Davydov --- net/core/rtnetlink.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 2e8b10f..0d2df96 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1403,8 +1403,7 @@ static int do_setlink(const struct sk_buff *skb, err = PTR_ERR(net); goto errout; } - if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN) && - !netlink_ns_capable(skb, net->user_ns, CAP_VE_NET_ADMIN)) { + if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) { err = -EPERM; goto errout; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert "ve/net/ioctl: allow change net-device name with CAP_VE_NET_ADMIN"
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.8 --> commit bf95a05ce5971fa899e169aa27e869f34ac91b72 Author: Andrew VaginDate: Tue Sep 8 12:50:37 2015 +0400 Revert "ve/net/ioctl: allow change net-device name with CAP_VE_NET_ADMIN" This reverts commit 9118029490d75eee8ea1c8513412b55b94be92d9. https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew Vagin Reviewed-by: Vladimir Davydov --- net/core/dev_ioctl.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c index 77df687..d407219 100644 --- a/net/core/dev_ioctl.c +++ b/net/core/dev_ioctl.c @@ -476,11 +476,8 @@ int dev_ioctl(struct net *net, unsigned int cmd, void __user *arg) */ case SIOCGMIIPHY: case SIOCGMIIREG: - if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) - return -EPERM; case SIOCSIFNAME: - if (!ns_capable(net->user_ns, CAP_NET_ADMIN) && - !ns_capable(net->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; dev_load(net, ifr.ifr_name); rtnl_lock(); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert "ve/net: allow containers create bridges with CAP_VE_NET_ADMIN"
The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.8 --> commit ddcb719bd3e3ea79056bcc74db038c3c5d0e10a1 Author: Andrew VaginDate: Tue Sep 8 12:50:24 2015 +0400 Revert "ve/net: allow containers create bridges with CAP_VE_NET_ADMIN" This reverts commit 52b6df12cf62fc92edadcec3860f6418d4d8333e. https://jira.sw.ru/browse/PSBM-39077 Signed-off-by: Andrew Vagin Reviewed-by: Vladimir Davydov --- net/bridge/br_ioctl.c | 33 +++-- net/core/dev_ioctl.c | 8 2 files changed, 15 insertions(+), 26 deletions(-) diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c index 45c4c22..98447b8 100644 --- a/net/bridge/br_ioctl.c +++ b/net/bridge/br_ioctl.c @@ -89,8 +89,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int isadd) struct net_device *dev; int ret; - if (!ns_capable(net->user_ns, CAP_NET_ADMIN) && - !ns_capable(net->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; dev = __dev_get_by_index(net, ifindex); @@ -180,29 +179,25 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) } case BRCTL_SET_BRIDGE_FORWARD_DELAY: - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_forward_delay(br, args[1]); case BRCTL_SET_BRIDGE_HELLO_TIME: - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_hello_time(br, args[1]); case BRCTL_SET_BRIDGE_MAX_AGE: - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; return br_set_max_age(br, args[1]); case BRCTL_SET_AGEING_TIME: - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br->ageing_time = clock_t_to_jiffies(args[1]); @@ -242,16 +237,14 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) } case BRCTL_SET_BRIDGE_STP_STATE: - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; br_stp_set_enabled(br, args[1]); return 0; case BRCTL_SET_BRIDGE_PRIORITY: - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(>lock); @@ -264,8 +257,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) struct net_bridge_port *p; int ret; - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(>lock); @@ -282,8 +274,7 @@ static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) struct net_bridge_port *p; int ret; - if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) && - !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN)) return -EPERM; spin_lock_bh(>lock); @@ -340,8 +331,7 @@ static int old_deviceless(struct net *net, void __user *uarg) { char buf[IFNAMSIZ]; - if (!ns_capable(net->user_ns, CAP_NET_ADMIN) && - !ns_capable(net->user_ns, CAP_VE_NET_ADMIN)) + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) return -EPERM; if (copy_from_user(buf, (void
[Devel] [PATCH RHEL7 COMMIT] net: udpv6: release memcg on destroy
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit b9643707fb3f7e18c9681e14c7184d0aa17110a9 Author: Vladimir DavydovDate: Thu Sep 3 13:17:57 2015 +0400 net: udpv6: release memcg on destroy In case of udpv6 we never release the memcg reference taken in udpv6_prot->init. This leads to memcg leak. Fix it by calling sock_release_memcg from udpv6_prot->destroy. https://jira.sw.ru/browse/PSBM-39084 Fixes: ee3396bb65bf ("udp: Charge ingress buffers into cg memory") Signed-off-by: Vladimir Davydov --- net/ipv6/udp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 4d3754d..780e823 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1352,6 +1352,7 @@ void udpv6_destroy_sock(struct sock *sk) } inet6_destroy_sock(sk); + sock_release_memcg(sk); } /* ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/writeback: revert ub dirty limit related stuff
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit fe9db6c5f1e3e58f0ad60caf19c3d9e1a1cca474 Author: Vladimir DavydovDate: Thu Sep 3 14:10:34 2015 +0400 ve/writeback: revert ub dirty limit related stuff This patch reverts ub dirty limit related hunks brought by the initial commit 2a8b5de95918. None of them actually works, so this patch introduces no functional changes. Dirty set control will be reimplemented in the scope of https://jira.sw.ru/browse/PSBM-33841 Signed-off-by: Vladimir Davydov --- fs/fs-writeback.c | 39 +++ include/linux/writeback.h | 4 mm/page-writeback.c | 4 3 files changed, 7 insertions(+), 40 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 66586a4..ac8066b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -40,7 +40,6 @@ struct wb_writeback_work { long nr_pages; struct super_block *sb; - struct user_beancounter *ub; unsigned long *older_than_this; enum writeback_sync_modes sync_mode; unsigned int tagged_writepages:1; @@ -130,8 +129,8 @@ out_unlock: } static void -__bdi_start_writeback(struct backing_dev_info *bdi, - long nr_pages, bool range_cyclic, enum wb_reason reason) +__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, + bool range_cyclic, enum wb_reason reason) { struct wb_writeback_work *work; @@ -150,7 +149,6 @@ __bdi_start_writeback(struct backing_dev_info *bdi, work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; work->reason= reason; - work->ub= NULL; bdi_queue_work(bdi, work); } @@ -673,7 +671,6 @@ static long writeback_sb_inodes(struct super_block *sb, .range_cyclic = work->range_cyclic, .range_start= 0, .range_end = LLONG_MAX, - .wb_ub = work->ub, }; unsigned long start_time = jiffies; long write_chunk; @@ -707,14 +704,6 @@ static long writeback_sb_inodes(struct super_block *sb, * kind writeout is handled by the freer. */ spin_lock(>i_lock); - if (wbc.wb_ub && !wb->bdi->dirty_exceeded && - (inode->i_mapping->dirtied_ub != wbc.wb_ub) && - (inode->i_state & I_DIRTY) == I_DIRTY_PAGES && - ub_should_skip_writeback(wbc.wb_ub, inode)) { - requeue_io(inode, wb); - continue; - } - if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { spin_unlock(>i_lock); redirty_tail(inode, wb); @@ -913,12 +902,9 @@ static long wb_writeback(struct bdi_writeback *wb, /* * For background writeout, stop when we are below the -* background dirty threshold. For filtered background -* writeback we write all inodes dirtied before us, -* because we cannot dereference this ub pointer. +* background dirty threshold */ - if (work->for_background && !work->ub && - !over_bground_thresh(wb->bdi)) + if (work->for_background && !over_bground_thresh(wb->bdi)) break; /* @@ -1371,7 +1357,7 @@ out_unlock_inode: } EXPORT_SYMBOL(__mark_inode_dirty); -static void wait_sb_inodes(struct super_block *sb, struct user_beancounter *ub) +static void wait_sb_inodes(struct super_block *sb) { struct inode *inode, *old_inode = NULL; @@ -1399,11 +1385,6 @@ static void wait_sb_inodes(struct super_block *sb, struct user_beancounter *ub) spin_unlock(>i_lock); continue; } - if (ub && (mapping->dirtied_ub != ub) && - (inode->i_state & I_DIRTY) == I_DIRTY_PAGES) { - spin_unlock(>i_lock); - continue; - } __iget(inode); spin_unlock(>i_lock); spin_unlock(_sb_list_lock); @@ -1522,12 +1503,11 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb); * This function writes and waits on any dirty inode belonging to this * super_block. */ -void sync_inodes_sb_ub(struct super_block *sb, struct user_beancounter *ub) +void sync_inodes_sb(struct super_block *sb) { DECLARE_COMPLETION_ONSTACK(done); struct wb_writeback_work work = { .sb = sb, - .ub = ub, .sync_mode = WB_SYNC_ALL,
[Devel] [PATCH RHEL7 COMMIT] ub: zap unused socket accounting bits
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 86ade5f1aad07dbdee7324c78e506f0240cefa18 Author: Vladimir DavydovDate: Thu Sep 3 14:34:11 2015 +0400 ub: zap unused socket accounting bits It should have been done in the scope of c73bfca7594c ("bc: Rip old network buffers and sockets accounting"). Signed-off-by: Vladimir Davydov --- include/bc/beancounter.h | 24 kernel/bc/beancounter.c | 8 2 files changed, 32 deletions(-) diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h index 9180f2a..3c32ddf 100644 --- a/include/bc/beancounter.h +++ b/include/bc/beancounter.h @@ -49,26 +49,11 @@ */ struct task_beancounter; -struct sock_beancounter; struct page_private { unsigned long ubp_tmpfs_respages; }; -struct sock_private { - unsigned long ubp_rmem_thres; - unsigned long ubp_wmem_pressure; - unsigned long ubp_maxadvmss; - unsigned long ubp_rmem_pressure; - int ubp_tw_count; -#define UB_RMEM_EXPAND 0 -#define UB_RMEM_KEEP1 -#define UB_RMEM_SHRINK 2 - struct list_headubp_other_socks; - struct list_headubp_tcp_socks; - struct percpu_counter ubp_orphan_count; -}; - struct ub_percpu_struct { int dirty_pages; int writeback_pages; @@ -129,15 +114,6 @@ struct user_beancounter { struct page_private ppriv; #define ub_tmpfs_respages ppriv.ubp_tmpfs_respages - struct sock_private spriv; -#define ub_rmem_thres spriv.ubp_rmem_thres -#define ub_maxadvmss spriv.ubp_maxadvmss -#define ub_rmem_pressure spriv.ubp_rmem_pressure -#define ub_wmem_pressure spriv.ubp_wmem_pressure -#define ub_tcp_sk_list spriv.ubp_tcp_socks -#define ub_other_sk_list spriv.ubp_other_socks -#define ub_orphan_countspriv.ubp_orphan_count -#define ub_tw_countspriv.ubp_tw_count atomic_long_t dirty_pages; atomic_long_t writeback_pages; diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c index 6b5ed78..8edef0d 100644 --- a/kernel/bc/beancounter.c +++ b/kernel/bc/beancounter.c @@ -347,9 +347,6 @@ static struct user_beancounter *alloc_ub(const char *name) if (!new_ub->ub_name) goto fail_name; - if (percpu_counter_init(_ub->ub_orphan_count, 0)) - goto fail_pcpu; - new_ub->ub_percpu = alloc_percpu(struct ub_percpu_struct); if (new_ub->ub_percpu == NULL) goto fail_free; @@ -357,8 +354,6 @@ static struct user_beancounter *alloc_ub(const char *name) return new_ub; fail_free: - percpu_counter_destroy(_ub->ub_orphan_count); -fail_pcpu: kfree(new_ub->ub_name); fail_name: kfree(new_ub); @@ -367,7 +362,6 @@ fail_name: static inline void free_ub(struct user_beancounter *ub) { - percpu_counter_destroy(>ub_orphan_count); free_percpu(ub->ub_percpu); kfree(ub->ub_store); kfree(ub->private_data2); @@ -1068,8 +1062,6 @@ static void init_beancounter_struct(struct user_beancounter *ub) { ub->ub_magic = UB_MAGIC; spin_lock_init(>ub_lock); - INIT_LIST_HEAD(>ub_tcp_sk_list); - INIT_LIST_HEAD(>ub_other_sk_list); } static void init_beancounter_nolimits(struct user_beancounter *ub) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ub: zap ub_tmpfs_respages
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit fb50f65f2c49b6a73e369153a660ff17227223b0 Author: Vladimir DavydovDate: Thu Sep 3 14:34:34 2015 +0400 ub: zap ub_tmpfs_respages It is always 0 in both Vz7 and PCS6, so drop it. Signed-off-by: Vladimir Davydov --- include/bc/beancounter.h | 7 --- kernel/bc/beancounter.c | 1 - kernel/bc/vm_pages.c | 19 --- 3 files changed, 27 deletions(-) diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h index ec2ba18..a6241f6 100644 --- a/include/bc/beancounter.h +++ b/include/bc/beancounter.h @@ -50,10 +50,6 @@ struct task_beancounter; -struct page_private { - unsigned long ubp_tmpfs_respages; -}; - struct ub_percpu_struct { int dirty_pages; int writeback_pages; @@ -106,9 +102,6 @@ struct user_beancounter { struct ratelimit_state ub_ratelimit; - struct page_private ppriv; -#define ub_tmpfs_respages ppriv.ubp_tmpfs_respages - atomic_long_t dirty_pages; atomic_long_t writeback_pages; atomic_long_t wb_requests; diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c index 8edef0d..d0ab65a 100644 --- a/kernel/bc/beancounter.c +++ b/kernel/bc/beancounter.c @@ -492,7 +492,6 @@ static inline int bc_verify_held(struct user_beancounter *ub) __ub_stat_get(ub, dirty_pages)); clean &= verify_res(ub, "writeback_pages", __ub_stat_get(ub, writeback_pages)); - clean &= verify_res(ub, "tmpfs_respages", ub->ub_tmpfs_respages); return clean; } diff --git a/kernel/bc/vm_pages.c b/kernel/bc/vm_pages.c index 23e8742..7529899 100644 --- a/kernel/bc/vm_pages.c +++ b/kernel/bc/vm_pages.c @@ -119,22 +119,6 @@ void ub_lockedshm_uncharge(struct shmem_inode_info *shi, unsigned long size) uncharge_beancounter(ub, UB_LOCKEDPAGES, size >> PAGE_SHIFT); } -static inline void do_ub_tmpfs_respages_sub(struct user_beancounter *ub, - unsigned long size) -{ - unsigned long flags; - - spin_lock_irqsave(>ub_lock, flags); - /* catch possible overflow */ - if (ub->ub_tmpfs_respages < size) { - uncharge_warn(ub, "tmpfs_respages", - size, ub->ub_tmpfs_respages); - size = ub->ub_tmpfs_respages; - } - ub->ub_tmpfs_respages -= size; - spin_unlock_irqrestore(>ub_lock, flags); -} - static int bc_fill_sysinfo(struct user_beancounter *ub, unsigned long meminfo_val, struct sysinfo *si) { @@ -269,9 +253,6 @@ static int bc_vmaux_show(struct seq_file *f, void *v) ub_sync_memcg(ub); - seq_printf(f, bc_proc_lu_fmt, "tmpfs_respages", - ub->ub_tmpfs_respages); - seq_printf(f, bc_proc_lu_fmt, "ram", ub->ub_parms[UB_PHYSPAGES].held); return 0; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/crypto/ghash-intel: specify context size for ghash async algorithm
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 94edfce3b5486a560f9a469659038ffa6310a621 Author: Andrey RyabininDate: Thu Sep 3 13:25:49 2015 +0400 ms/crypto/ghash-intel: specify context size for ghash async algorithm Currently context size (cra_ctxsize) doesn't specified for ghash_async_alg. Which means it's zero. Thus crypto_create_tfm() doesn't allocate needed space for ghash_async_ctx, so any read/write to ctx becomes invalid. https://jira.sw.ru/browse/PSBM-38669 Signed-off-by: Andrey Ryabinin khorenko@: the patch to be sent to mainstream as well. --- arch/x86/crypto/ghash-clmulni-intel_glue.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/crypto/ghash-clmulni-intel_glue.c b/arch/x86/crypto/ghash-clmulni-intel_glue.c index 6759dd1..11e213e 100644 --- a/arch/x86/crypto/ghash-clmulni-intel_glue.c +++ b/arch/x86/crypto/ghash-clmulni-intel_glue.c @@ -283,6 +283,7 @@ static struct ahash_alg ghash_async_alg = { .cra_name = "ghash", .cra_driver_name= "ghash-clmulni", .cra_priority = 400, + .cra_ctxsize= sizeof(struct ghash_async_ctx), .cra_flags = CRYPTO_ALG_TYPE_AHASH | CRYPTO_ALG_ASYNC, .cra_blocksize = GHASH_BLOCK_SIZE, .cra_type = _ahash_type, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert diff-writeback-throttle-writer-when-local-BDI-threshold-is-hit bits
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 56782e5a6d798078f3523e97f6f2eb8277028b91 Author: Vladimir DavydovDate: Thu Sep 3 13:53:01 2015 +0400 Revert diff-writeback-throttle-writer-when-local-BDI-threshold-is-hit bits This was brought by the initial commit 2a8b5de95918, but it is incomplete - the following hunk patching balance_dirty_pages was lost: > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 003b68e..a58795c 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -546,7 +546,8 @@ static void balance_dirty_pages(struct address_space *mapping, >* catch-up. This avoids (excessively) small writeouts >* when the bdi limits are ramping up. >*/ > - if (nr_reclaimable + nr_writeback < > + if (bdi_cap_account_writeback(bdi) && > + nr_reclaimable + nr_writeback < > (background_thresh + dirty_thresh) / 2 && > ub_dirty + ub_writeback < > (ub_background_thresh + ub_thresh) / 2) I've filed a separate issue for porting it: https://jira.sw.ru/browse/PSBM-39167 Signed-off-by: Vladimir Davydov --- fs/fs-writeback.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9cdcc28..66586a4 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -843,9 +843,6 @@ static bool over_bground_thresh(struct backing_dev_info *bdi) { unsigned long background_thresh, dirty_thresh; - if (!bdi_cap_account_writeback(bdi) && bdi->dirty_exceeded) - return true; - global_dirty_limits(_thresh, _thresh); if (global_page_state(NR_FILE_DIRTY) + ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ub: drop swapin/swapout stats
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 681603761a5a2c380c8c3c2f2d3c878aa5a267a0 Author: Vladimir DavydovDate: Thu Sep 3 14:34:22 2015 +0400 ub: drop swapin/swapout stats Swapin/swapout cannot be accounted by beancounters anymore, because memory management moved to memcg. Right now, these stats are not provided by memcg, so this patch simply drops them from /proc/vmstat inside container and from /proc/bc/CTID/vmaux on the host. If anybody requests these counters, they should be reimplemented in the scope of memcg and returned back. https://jira.sw.ru/browse/PSBM-39327 Signed-off-by: Vladimir Davydov --- include/bc/beancounter.h | 6 -- kernel/bc/vm_pages.c | 31 +-- 2 files changed, 1 insertion(+), 36 deletions(-) diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h index 3c32ddf..ec2ba18 100644 --- a/include/bc/beancounter.h +++ b/include/bc/beancounter.h @@ -63,12 +63,6 @@ struct ub_percpu_struct { unsigned long fuse_requests; unsigned long fuse_bytes; - unsigned long swapin; - unsigned long swapout; - - unsigned long vswapin; - unsigned long vswapout; - #ifdef CONFIG_BC_IO_ACCOUNTING unsigned long async_write_complete; unsigned long async_write_canceled; diff --git a/kernel/bc/vm_pages.c b/kernel/bc/vm_pages.c index c52d34f..23e8742 100644 --- a/kernel/bc/vm_pages.c +++ b/kernel/bc/vm_pages.c @@ -220,18 +220,7 @@ out: static int bc_fill_vmstat(struct user_beancounter *ub, unsigned long *stat) { - int cpu; - - for_each_possible_cpu(cpu) { - struct ub_percpu_struct *pcpu = ub_percpu(ub, cpu); - - stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->swapin; - stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT] += pcpu->swapout; - - stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->vswapin; - stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT] += pcpu->vswapout; - } - + /* FIXME: show swapin/swapout? */ return NOTIFY_OK; } @@ -275,32 +264,14 @@ module_exit(fini_vmguar_notifier); static int bc_vmaux_show(struct seq_file *f, void *v) { struct user_beancounter *ub; - struct ub_percpu_struct *ub_pcpu; - unsigned long swapin, swapout, vswapin, vswapout; - int i; ub = seq_beancounter(f); ub_sync_memcg(ub); - swapin = swapout = vswapin = vswapout = 0; - for_each_possible_cpu(i) { - ub_pcpu = ub_percpu(ub, i); - swapin += ub_pcpu->swapin; - swapout += ub_pcpu->swapout; - vswapin += ub_pcpu->vswapin; - vswapout += ub_pcpu->vswapout; - } - seq_printf(f, bc_proc_lu_fmt, "tmpfs_respages", ub->ub_tmpfs_respages); - seq_printf(f, bc_proc_lu_fmt, "swapin", swapin); - seq_printf(f, bc_proc_lu_fmt, "swapout", swapout); - - seq_printf(f, bc_proc_lu_fmt, "vswapin", vswapin); - seq_printf(f, bc_proc_lu_fmt, "vswapout", vswapout); - seq_printf(f, bc_proc_lu_fmt, "ram", ub->ub_parms[UB_PHYSPAGES].held); return 0; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7 0/4] memcg/kmem: account some non-slab objects
And another patchset for your attention. -- Best regards, Konstantin Khorenko, Virtuozzo Linux Kernel Team On 08/26/2015 07:28 PM, Vladimir Davydov wrote: This patch set implements memcg/kmem accounting for vmalloc, pipe buffers, and page tables. I'll probably try to submit these patches (slightly modified) upstream after v4.2 has been released. Vladimir Davydov (4): vmalloc: account to memcg/kmem fs: account anon pipe buffers to memcg/kmem gfp: add __get_free_kmem_pages helper arch: x86: charge page tables to memcg/kmem arch/x86/include/asm/pgalloc.h | 13 +++-- arch/x86/mm/pgtable.c | 24 +++- fs/pipe.c | 13 - include/linux/gfp.h| 1 + mm/page_alloc.c| 12 mm/vmalloc.c | 6 +++--- 6 files changed, 46 insertions(+), 23 deletions(-) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ub: zap ub_dirty_pages
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit d6328c465a4b45ab97e55f17b4896da32986c1bd Author: Vladimir DavydovDate: Thu Sep 3 15:26:24 2015 +0400 ub: zap ub_dirty_pages It is not used anywhere. Signed-off-by: Vladimir Davydov --- include/bc/io_acct.h | 7 --- 1 file changed, 7 deletions(-) diff --git a/include/bc/io_acct.h b/include/bc/io_acct.h index 5b51853..fa7afb1 100644 --- a/include/bc/io_acct.h +++ b/include/bc/io_acct.h @@ -56,8 +56,6 @@ extern void ub_io_account_cancel(struct address_space *mapping); extern void ub_io_writeback_inc(struct address_space *mapping); extern void ub_io_writeback_dec(struct address_space *mapping); -#define ub_dirty_pages(ub) ub_stat_get(ub, dirty_pages) - extern int ub_dirty_limits(unsigned long *pbackground, long *pdirty, struct user_beancounter *ub); @@ -101,11 +99,6 @@ static inline void ub_io_writeback_dec(struct address_space *mapping) { } -static inline unsigned long ub_dirty_pages(struct user_beancounter *ub) -{ - return 0; -} - static inline int ub_dirty_limits(unsigned long *pbackground, long *pdirty, struct user_beancounter *ub) { ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ub: rename private_data2 to iolimit
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 2b26765194fd30316e5b34aa72a2c5043bd11e8d Author: Vladimir DavydovDate: Thu Sep 3 15:29:20 2015 +0400 ub: rename private_data2 to iolimit ub->private_data2 is only used for storing iolimit housekeeping struct, so call it appropriately. Signed-off-by: Vladimir Davydov --- include/bc/beancounter.h | 2 +- kernel/bc/beancounter.c | 2 +- kernel/ve/vziolimit.c| 14 +++--- 3 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h index a6241f6..5ba999e 100644 --- a/include/bc/beancounter.h +++ b/include/bc/beancounter.h @@ -107,7 +107,7 @@ struct user_beancounter { atomic_long_t wb_requests; atomic_long_t wb_sectors; - void*private_data2; + void*iolimit; /* resources statistic and settings */ struct ubparm ub_parms[UB_RESOURCES]; diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c index f9e7fea..90fc1dd 100644 --- a/kernel/bc/beancounter.c +++ b/kernel/bc/beancounter.c @@ -364,8 +364,8 @@ static inline void free_ub(struct user_beancounter *ub) { free_percpu(ub->ub_percpu); kfree(ub->ub_store); - kfree(ub->private_data2); kfree(ub->ub_name); + kfree(ub->iolimit); kfree(ub); } diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c index 1da233d..628ec80 100644 --- a/kernel/ve/vziolimit.c +++ b/kernel/ve/vziolimit.c @@ -163,7 +163,7 @@ static int iolimit_virtinfo(struct vnotifier_block *nb, unsigned long cmd, void *arg, int old_ret) { struct user_beancounter *ub = get_exec_ub(); - struct iolimit *iolimit = ub->private_data2; + struct iolimit *iolimit = ub->iolimit; unsigned long flags, timeout; struct request_queue *q; @@ -257,7 +257,7 @@ static void throttle_state(struct user_beancounter *ub, static struct iolimit *iolimit_get(struct user_beancounter *ub) { - struct iolimit *iolimit = ub->private_data2; + struct iolimit *iolimit = ub->iolimit; if (iolimit) return iolimit; @@ -268,11 +268,11 @@ static struct iolimit *iolimit_get(struct user_beancounter *ub) init_waitqueue_head(>wq); spin_lock_irq(>ub_lock); - if (ub->private_data2) { + if (ub->iolimit) { kfree(iolimit); - iolimit = ub->private_data2; + iolimit = ub->iolimit; } else - ub->private_data2 = iolimit; + ub->iolimit = iolimit; spin_unlock_irq(>ub_lock); return iolimit; @@ -296,7 +296,7 @@ static int iolimit_ioctl(struct file *file, unsigned int cmd, unsigned long arg) if (!ub) return -ENOENT; - iolimit = ub->private_data2; + iolimit = ub->iolimit; switch (cmd) { case VZCTL_SET_IOLIMIT: @@ -365,7 +365,7 @@ static ssize_t iolimit_cgroup_read(struct cgroup *cg, struct cftype *cft, size_t nbytes, loff_t *ppos) { struct user_beancounter *ub = cgroup_ub(cg); - struct iolimit *iolimit = ub->private_data2; + struct iolimit *iolimit = ub->iolimit; unsigned long val = 0; int len; char str[32]; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RHEL7 COMMIT] ub: drop swapin/swapout stats
If anybody really check swapin/swapout inside Containers, please let us know the usecase - how do you use these stats. Thank you. -- Konstantin On 09/03/2015 01:34 PM, Konstantin Khorenko wrote: The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit 681603761a5a2c380c8c3c2f2d3c878aa5a267a0 Author: Vladimir Davydov <vdavy...@parallels.com> Date: Thu Sep 3 14:34:22 2015 +0400 ub: drop swapin/swapout stats Swapin/swapout cannot be accounted by beancounters anymore, because memory management moved to memcg. Right now, these stats are not provided by memcg, so this patch simply drops them from /proc/vmstat inside container and from /proc/bc/CTID/vmaux on the host. If anybody requests these counters, they should be reimplemented in the scope of memcg and returned back. https://jira.sw.ru/browse/PSBM-39327 Signed-off-by: Vladimir Davydov <vdavy...@parallels.com> --- include/bc/beancounter.h | 6 -- kernel/bc/vm_pages.c | 31 +-- 2 files changed, 1 insertion(+), 36 deletions(-) diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h index 3c32ddf..ec2ba18 100644 --- a/include/bc/beancounter.h +++ b/include/bc/beancounter.h @@ -63,12 +63,6 @@ struct ub_percpu_struct { unsigned long fuse_requests; unsigned long fuse_bytes; - unsigned long swapin; - unsigned long swapout; - - unsigned long vswapin; - unsigned long vswapout; - #ifdef CONFIG_BC_IO_ACCOUNTING unsigned long async_write_complete; unsigned long async_write_canceled; diff --git a/kernel/bc/vm_pages.c b/kernel/bc/vm_pages.c index c52d34f..23e8742 100644 --- a/kernel/bc/vm_pages.c +++ b/kernel/bc/vm_pages.c @@ -220,18 +220,7 @@ out: static int bc_fill_vmstat(struct user_beancounter *ub, unsigned long *stat) { - int cpu; - - for_each_possible_cpu(cpu) { - struct ub_percpu_struct *pcpu = ub_percpu(ub, cpu); - - stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->swapin; - stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT] += pcpu->swapout; - - stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->vswapin; - stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT] += pcpu->vswapout; - } - + /* FIXME: show swapin/swapout? */ return NOTIFY_OK; } @@ -275,32 +264,14 @@ module_exit(fini_vmguar_notifier); static int bc_vmaux_show(struct seq_file *f, void *v) { struct user_beancounter *ub; - struct ub_percpu_struct *ub_pcpu; - unsigned long swapin, swapout, vswapin, vswapout; - int i; ub = seq_beancounter(f); ub_sync_memcg(ub); - swapin = swapout = vswapin = vswapout = 0; - for_each_possible_cpu(i) { - ub_pcpu = ub_percpu(ub, i); - swapin += ub_pcpu->swapin; - swapout += ub_pcpu->swapout; - vswapin += ub_pcpu->vswapin; - vswapout += ub_pcpu->vswapout; - } - seq_printf(f, bc_proc_lu_fmt, "tmpfs_respages", ub->ub_tmpfs_respages); - seq_printf(f, bc_proc_lu_fmt, "swapin", swapin); - seq_printf(f, bc_proc_lu_fmt, "swapout", swapout); - - seq_printf(f, bc_proc_lu_fmt, "vswapin", vswapin); - seq_printf(f, bc_proc_lu_fmt, "vswapout", vswapout); - seq_printf(f, bc_proc_lu_fmt, "ram", ub->ub_parms[UB_PHYSPAGES].held); return 0; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: use GFP_NOIO in ploop_make_request
The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.6 --> commit af3295b412e5b1fefb22857d26e1332cafc186d5 Author: Vladimir DavydovDate: Thu Sep 3 15:37:24 2015 +0400 ploop: use GFP_NOIO in ploop_make_request Currently, we use GFP_NOFS, which may result in a dead lock as follows: filemap_fault do_mpage_readpage submit_bio generic_make_request initializes current->bio_list calls make_request_fn ploop_make_request bio_alloc(GFP_NOFS) kmem_cache_alloc memcg_charge_kmem try_to_free_mem_cgroup_pages swap_writepage generic_make_request puts bio on current->bio_list try_to-free_mem_cgroup_pages wait_on_page_writeback The wait_on_page_writeback will never complete then, because the corresponding bio is on current->bio_list and for it to get to the queue we must return from ploop_make_request first. The stack trace of a hung task: [] sleep_on_page+0xe/0x20 [] wait_on_page_bit+0x86/0xb0 [] shrink_page_list+0x6e2/0xaf0 [] shrink_inactive_list+0x1cb/0x610 [] shrink_lruvec+0x395/0x790 [] shrink_zone+0x181/0x350 [] do_try_to_free_pages+0x170/0x530 [] try_to_free_mem_cgroup_pages+0xb6/0x140 [] __mem_cgroup_try_charge+0x1de/0xd70 [] memcg_charge_kmem+0x9b/0x100 [] __memcg_charge_slab+0x3b/0x90 [] new_slab+0x264/0x3f0 [] __slab_alloc+0x315/0x48f [] kmem_cache_alloc+0x1cc/0x210 [] mempool_alloc_slab+0x15/0x20 [] mempool_alloc+0x69/0x170 [] bvec_alloc+0x92/0x120 [] bio_alloc_bioset+0x1e8/0x2e0 [] ploop_make_request+0x2a6/0xac0 [ploop] [] generic_make_request+0xe2/0x130 [] submit_bio+0x77/0x1c0 [] do_mpage_readpage+0x37f/0x6e0 [] mpage_readpages+0xeb/0x160 [] ext4_readpages+0x3c/0x40 [ext4] [] __do_page_cache_readahead+0x1e0/0x260 [] ra_submit+0x21/0x30 [] filemap_fault+0x321/0x4b0 [] __do_fault+0x8a/0x560 [] handle_mm_fault+0x3d0/0xd80 [] __do_page_fault+0x15e/0x530 [] do_page_fault+0x1a/0x70 [] page_fault+0x28/0x30 https://jira.sw.ru/browse/PSBM-38842 Signed-off-by: Vladimir Davydov --- drivers/block/ploop/dev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index 97e75a7..7eb9865 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -717,7 +717,7 @@ preallocate_bio(struct bio * orig_bio, struct ploop_device * plo) } if (nbio == NULL) - nbio = bio_alloc(GFP_NOFS, max(orig_bio->bi_max_vecs, block_vecs(plo))); + nbio = bio_alloc(GFP_NOIO, max(orig_bio->bi_max_vecs, block_vecs(plo))); return nbio; } @@ -852,7 +852,7 @@ static void ploop_make_request(struct request_queue *q, struct bio *bio) if (!current->io_context) { struct io_context *ioc; - ioc = get_task_io_context(current, GFP_NOFS, NUMA_NO_NODE); + ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE); if (ioc) put_io_context(ioc); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel