[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_ref_cancel_init()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0873bd8f500347f34f06ddad0fbf024df91f8add Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:24 2015 +0400 ms/percpu-refcount: implement percpu_ref_cancel_init() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Normally, percpu_ref_init() initializes and percpu_ref_kill() initiates destruction which completes asynchronously. The asynchronous destruction can be problematic in init failure path where the caller wants to destroy half-constructed object - distinguishing half-constructed objects from the usual release method can be painful for complex objects. This patch implements percpu_ref_cancel_init() which synchronously destroys the percpu_ref without invoking release. To avoid unintentional misuses, the function requires the ref to have finished percpu_ref_init() but never used and triggers WARN otherwise. v2: Explain the weird name and usage restriction in the function comment. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit bc497bd33b2d6a6f07bc8574b4764edbd7fdffa8) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 1 + lib/percpu-refcount.c | 31 +++ 2 files changed, 32 insertions(+) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 8146aa9..6d843d6 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -68,6 +68,7 @@ struct percpu_ref { int __must_check percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); +void percpu_ref_cancel_init(struct percpu_ref *ref); void percpu_ref_kill(struct percpu_ref *ref); #define PCPU_STATUS_BITS 2 diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index b35eaac..ebeaac2 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -54,6 +54,37 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release) return 0; } +/** + * percpu_ref_cancel_init - cancel percpu_ref_init() + * @ref: percpu_ref to cancel init for + * + * Once a percpu_ref is initialized, its destruction is initiated by + * percpu_ref_kill() and completes asynchronously, which can be painful to + * do when destroying a half-constructed object in init failure path. + * + * This function destroys @ref without invoking @ref-release and the + * memory area containing it can be freed immediately on return. To + * prevent accidental misuse, it's required that @ref has finished + * percpu_ref_init(), whether successful or not, but never used. + * + * The weird name and usage restriction are to prevent people from using + * this function by mistake for normal shutdown instead of + * percpu_ref_kill(). + */ +void percpu_ref_cancel_init(struct percpu_ref *ref) +{ + unsigned __percpu *pcpu_count = ref-pcpu_count; + int cpu; + + WARN_ON_ONCE(atomic_read(ref-count) != 1 + PCPU_COUNT_BIAS); + + if (pcpu_count) { + for_each_possible_cpu(cpu) + WARN_ON_ONCE(*per_cpu_ptr(pcpu_count, cpu)); + free_percpu(ref-pcpu_count); + } +} + static void percpu_ref_kill_rcu(struct rcu_head *rcu) { struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: split cgroup destruction into two steps
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 33f3496e5d1342b4497058d017261d3b3fde0fe1 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:26 2015 +0400 ms/cgroup: split cgroup destruction into two steps Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup-free_work to -destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Li Zefan lize...@huawei.com (cherry picked from commit ea15f8ccdb430af1e8bc9b4e19a230eb4c356777) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: kernel/cgroup.c --- include/linux/cgroup.h | 2 +- kernel/cgroup.c| 25 - 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 626bc84..d34c42b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -259,7 +259,7 @@ struct cgroup { /* For RCU-protected deletion */ struct rcu_head rcu_head; - struct work_struct free_work; + struct work_struct destroy_work; /* List of events which userspace want to receive */ struct list_head event_list; diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 062e0f4..6fd7038 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -213,6 +213,7 @@ static struct cgroup_name root_cgroup_name = { .name = / }; */ static int need_forkexit_callback __read_mostly; +static void cgroup_offline_fn(struct work_struct *work); static int cgroup_destroy_locked(struct cgroup *cgrp); static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys, struct cftype cfts[], bool is_add); @@ -836,7 +837,7 @@ static struct cgroup_name *cgroup_alloc_name(struct dentry *dentry) static void cgroup_free_fn(struct work_struct *work) { - struct cgroup *cgrp = container_of(work, struct cgroup, free_work); + struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work); struct cgroup_subsys *ss; mutex_lock(cgroup_mutex); @@ -881,7 +882,8 @@ static void cgroup_free_rcu(struct rcu_head *head) { struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head); - queue_work(cgroup_destroy_wq, cgrp-free_work); + INIT_WORK(cgrp-destroy_work, cgroup_free_fn); + queue_work(cgroup_destroy_wq, cgrp-destroy_work); } static void cgroup_diput(struct dentry *dentry, struct inode *inode) @@ -1416,7 +1418,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp) INIT_LIST_HEAD(cgrp-allcg_node); INIT_LIST_HEAD(cgrp-release_list); INIT_LIST_HEAD(cgrp-pidlists); - INIT_WORK(cgrp-free_work, cgroup_free_fn); mutex_init(cgrp-pidlist_mutex); INIT_LIST_HEAD(cgrp-event_list); spin_lock_init(cgrp-event_list_lock); @@ -4355,7 +4356,6 @@ static int
Re: [Devel] [RFC rh7 v5] ve/tty: vt -- Implement per VE support for console and terminals
On Thu, Aug 27, 2015 at 10:24:15PM +0300, Cyrill Gorcunov wrote: On Thu, Aug 27, 2015 at 07:11:28PM +0300, Vladimir Davydov wrote: Hmm, checkpatch still has max_line_length set to 80. Could you please share a link to this agreement? fine. Wonder, do you really still sit on 80 chars terminal? I use a 12 laptop. With the window vertically split into two panes, I have only ~80 characters per each pane. https://lkml.org/lkml/2012/2/3/101 One of the several conversations. Fortunately, 80 column limit is still there. I've just checked that on my external 20 display, with the font size my eyes are used to, I can keep two panes of 104 columns max. So if they decided to switch to 100 column standard, even a huge 15 laptop wouldn't save me :-/ nb: you know, moving patches from mainline (slave lock) seems to be not that simple, they introduced own new lock class for that. at moment i think how to modify our vtty code without mangling general tty code. I'd still suggest moving EXTRA_REF logic to tty_io.c. Yes, it's going to hurt a little during rebases, but if we try to keep it in pty.c we will implicitly rely on tty_io.c internal logic (locking or ref counting rules), which will probably result in failures at runtime, which is much worse than failures at build time. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: Don't use silly cmpxchg()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 337bb797aa4aa5eca030d634d0a9874290511db5 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:21 2015 +0400 ms/percpu-refcount: Don't use silly cmpxchg() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Kent Overstreet koverstr...@google.com The cmpxchg() was just to ensure the debug check didn't race, which was a bit excessive. The caller is supposed to do the appropriate synchronization, which means percpu_ref_kill() can just do a simple store. Signed-off-by: Kent Overstreet koverstr...@google.com Signed-off-by: Tejun Heo t...@kernel.org (cherry picked from commit c1ae6e9b4db00023b9caed72af49a93abad46452) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- lib/percpu-refcount.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 6f0ffd7..1a17399 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -107,22 +107,11 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) */ void percpu_ref_kill(struct percpu_ref *ref) { - unsigned __percpu *pcpu_count, *old, *new; + WARN_ONCE(REF_STATUS(ref-pcpu_count) == PCPU_REF_DEAD, + percpu_ref_kill() called more than once!\n); - pcpu_count = ACCESS_ONCE(ref-pcpu_count); - - do { - if (REF_STATUS(pcpu_count) == PCPU_REF_DEAD) { - WARN(1, percpu_ref_kill() called more than once!\n); - return; - } - - old = pcpu_count; - new = (unsigned __percpu *) - (((unsigned long) pcpu_count)|PCPU_REF_DEAD); - - pcpu_count = cmpxchg(ref-pcpu_count, old, new); - } while (pcpu_count != old); + ref-pcpu_count = (unsigned __percpu *) + (((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD); call_rcu(ref-rcu, percpu_ref_kill_rcu); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: reorder the operations in cgroup_destroy_locked()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit ce835adec25190f76a26cc97f1a38aadc93a4957 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:25 2015 +0400 ms/cgroup: reorder the operations in cgroup_destroy_locked() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org This patch reorders the operations in cgroup_destroy_locked() such that the userland visible parts happen before css offlining and removal from the -sibling list. This will be used to make css use percpu refcnt. While at it, split out CGRP_DEAD related comment from the refcnt deactivation one and correct / clarify how different guarantees are met. While this patch changes the specific order of operations, it shouldn't cause any noticeable behavior difference. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Li Zefan lize...@huawei.com (cherry picked from commit 455050d23e1bfc47ca98e943ad5b2f3a9bbe45fb) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: kernel/cgroup.c --- kernel/cgroup.c | 48 ++-- 1 file changed, 26 insertions(+), 22 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index b073fba..062e0f4 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4367,9 +4367,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) /* * Block new css_tryget() by deactivating refcnt and mark @cgrp -* removed. This makes future css_tryget() and child creation -* attempts fail thus maintaining the removal conditions verified -* above. +* removed. This makes future css_tryget() attempts fail which we +* guarantee to -css_offline() callbacks. */ for_each_subsys(cgrp-root, ss) { struct cgroup_subsys_state *css = cgrp-subsys[ss-subsys_id]; @@ -4379,6 +4378,30 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) } set_bit(CGRP_REMOVED, cgrp-flags); + raw_spin_lock(release_list_lock); + if (!list_empty(cgrp-release_list)) + list_del_init(cgrp-release_list); + raw_spin_unlock(release_list_lock); + + /* +* Remove @cgrp directory. The removal puts the base ref but we +* aren't quite done with @cgrp yet, so hold onto it. +*/ + dget(d); + cgroup_d_remove_dir(d); + + /* +* Unregister events and notify userspace. +* Notify userspace about cgroup removing only after rmdir of cgroup +* directory to avoid race between userspace and kernelspace. +*/ + spin_lock(cgrp-event_list_lock); + list_for_each_entry_safe(event, tmp, cgrp-event_list, list) { + list_del_init(event-list); + schedule_work(event-remove); + } + spin_unlock(cgrp-event_list_lock); + /* tell subsystems to initate destruction */ for_each_subsys(cgrp-root, ss) offline_css(ss, cgrp); @@ -4393,34 +4416,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) for_each_subsys(cgrp-root, ss) css_put(cgrp-subsys[ss-subsys_id]); - raw_spin_lock(release_list_lock); - if (!list_empty(cgrp-release_list)) - list_del_init(cgrp-release_list); - raw_spin_unlock(release_list_lock); - /* delete this cgroup from parent-children */ list_del_rcu(cgrp-sibling); list_del_init(cgrp-allcg_node); - dget(d); - cgroup_d_remove_dir(d); dput(d); set_bit(CGRP_RELEASABLE, parent-flags); check_for_release(parent); - /* -* Unregister events and notify userspace. -* Notify userspace about cgroup removing only after rmdir of cgroup -* directory to avoid race between userspace and kernelspace. -*/ - spin_lock(cgrp-event_list_lock); -
[Devel] [PATCH RHEL7 COMMIT] ve/devpts: Revert 2c27d20125f5
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 99a71c6ceb41b6c8256620c4db844f7395f2a2c9 Author: Cyrill Gorcunov gorcu...@gmail.com Date: Fri Aug 28 14:14:08 2015 +0400 ve/devpts: Revert 2c27d20125f5 Here we revert 2c27d20125f5 (ve/devpts: cleanup per-VE creation) making code close to the vanilla one. We've tune devpts code a bit though in next patch but less intrusive. https://jira.sw.ru/browse/PSBM-34931 Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com CC: Vladimir Davydov vdavy...@virtuozzo.com CC: Andrey Vagin ava...@virtuozzo.com CC: Konstantin Khorenko khore...@virtuozzo.com CC: Pavel Emelyanov xe...@virtuozzo.com --- fs/devpts/inode.c | 39 ++- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index 3dcd4da..be0fb74 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -402,6 +402,20 @@ fail: } #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES +static int test_devpts_sb(struct super_block *s, void *p) +{ + return get_exec_env()-devpts_sb == s; +} + +static int set_devpts_sb(struct super_block *s, void *p) +{ + int error = set_anon_super(s, p); + if (!error) { + atomic_inc(s-s_active); + get_exec_env()-devpts_sb = s; + } + return error; +} /* * devpts_mount() @@ -436,7 +450,6 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, int error; struct pts_mount_opts opts; struct super_block *s; - struct dentry *root; error = parse_mount_options(data, PARSE_MOUNT, opts); if (error) @@ -450,29 +463,29 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type, return ERR_PTR(-EINVAL); if (opts.newinstance) - root = mount_nodev(fs_type, flags, data, devpts_fill_super); + s = sget(fs_type, NULL, set_anon_super, flags, NULL); else - root = mount_ns(fs_type, flags, data, get_exec_env(), devpts_fill_super); + s = sget(fs_type, test_devpts_sb, set_devpts_sb, flags, NULL); + + if (IS_ERR(s)) + return ERR_CAST(s); - if (IS_ERR(root)) - return ERR_CAST(root); + if (!s-s_root) { + error = devpts_fill_super(s, data, flags MS_SILENT ? 1 : 0); + if (error) + goto out_undo_sget; + s-s_flags |= MS_ACTIVE; + } - s = root-d_sb; memcpy((DEVPTS_SB(s))-mount_opts, opts, sizeof(opts)); error = mknod_ptmx(s); if (error) goto out_undo_sget; - if (!opts.newinstance) { - atomic_inc(s-s_active); - get_exec_env()-devpts_sb = s; - } - - return root; + return dget(s-s_root); out_undo_sget: - dput(root); deactivate_locked_super(s); return ERR_PTR(error); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 4149fa7beae723cd745672c749ed0a94f7f672a4 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:24 2015 +0400 ms/percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Implement percpu_tryget() which stops giving out references once the percpu_ref is visible as killed. Because the refcnt is per-cpu, different CPUs will start to see a refcnt as killed at different points in time and tryget() may continue to succeed on subset of cpus for a while after percpu_ref_kill() returns. For use cases where it's necessary to know when all CPUs start to see the refcnt as dead, percpu_ref_kill_and_confirm() is added. The new function takes an extra argument @confirm_kill which is invoked when the refcnt is guaranteed to be viewed as killed on all CPUs. While this isn't the prettiest interface, it doesn't force synchronous wait and is much safer than requiring the caller to do its own call_rcu(). v2: Patch description rephrased to emphasize that tryget() may continue to succeed on some CPUs after kill() returns as suggested by Kent. v3: Function comment in percpu_ref_kill_and_confirm() updated warning people to not depend on the implied RCU grace period from the confirm callback as it's an implementation detail. Signed-off-by: Tejun Heo t...@kernel.org Slightly-Grumpily-Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit dbece3a0f1ef0b19aff1cc6ed0942fec9ab98de1) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 50 - lib/percpu-refcount.c | 23 ++- 2 files changed, 66 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 6d843d6..dd2a086 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -63,13 +63,30 @@ struct percpu_ref { */ unsigned __percpu *pcpu_count; percpu_ref_func_t *release; + percpu_ref_func_t *confirm_kill; struct rcu_head rcu; }; int __must_check percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); void percpu_ref_cancel_init(struct percpu_ref *ref); -void percpu_ref_kill(struct percpu_ref *ref); +void percpu_ref_kill_and_confirm(struct percpu_ref *ref, +percpu_ref_func_t *confirm_kill); + +/** + * percpu_ref_kill - drop the initial ref + * @ref: percpu_ref to kill + * + * Must be used to drop the initial ref on a percpu refcount; must be called + * precisely once before shutdown. + * + * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the + * percpu counters and dropping the initial ref. + */ +static inline void percpu_ref_kill(struct percpu_ref *ref) +{ + return percpu_ref_kill_and_confirm(ref, NULL); +} #define PCPU_STATUS_BITS 2 #define PCPU_STATUS_MASK ((1 PCPU_STATUS_BITS) - 1) @@ -101,6 +118,37 @@ static inline void percpu_ref_get(struct percpu_ref *ref) } /** + * percpu_ref_tryget - try to increment a percpu refcount + * @ref: percpu_ref to try-get + * + * Increment a percpu refcount unless it has already been killed. Returns + * %true on success; %false on failure. + * + * Completion of percpu_ref_kill() in itself doesn't guarantee that tryget + * will fail. For such guarantee, percpu_ref_kill_and_confirm() should be + * used. After the confirm_kill callback is invoked, it's guaranteed that + * no new reference will be given out by percpu_ref_tryget(). + */ +static inline bool percpu_ref_tryget(struct percpu_ref *ref) +{ + unsigned
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 82f6802b3f09878172024c57ed12cf2da92cccd3 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:23 2015 +0400 ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org Two small changes. * Unlike most init functions, percpu_ref_init() allocates memory and may fail. Let's mark it with __must_check in case the caller forgets. * percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to dereference @ref-pcpu_count, which can be misleading. The pointer is guaranteed to be valid and visible and can't change underneath the function. Drop ACCESS_ONCE(). Signed-off-by: Tejun Heo t...@kernel.org (cherry picked from commit acac7883ee7bcc32476963bce7baf73d44574dd1) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 3 ++- lib/percpu-refcount.c | 4 +--- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index b61bd6f..8146aa9 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -66,7 +66,8 @@ struct percpu_ref { struct rcu_head rcu; }; -int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); +int __must_check percpu_ref_init(struct percpu_ref *ref, +percpu_ref_func_t *release); void percpu_ref_kill(struct percpu_ref *ref); #define PCPU_STATUS_BITS 2 diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 9a78e55..b35eaac 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -57,12 +57,10 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release) static void percpu_ref_kill_rcu(struct rcu_head *rcu) { struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu); - unsigned __percpu *pcpu_count; + unsigned __percpu *pcpu_count = ref-pcpu_count; unsigned count = 0; int cpu; - pcpu_count = ACCESS_ONCE(ref-pcpu_count); - /* Mask out PCPU_REF_DEAD */ pcpu_count = (unsigned __percpu *) (((unsigned long) pcpu_count) ~PCPU_STATUS_MASK); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu: implement generic percpu refcounting
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit b5ec5570459334e56491e564b567cc5bed16181e Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:21 2015 +0400 ms/percpu: implement generic percpu refcounting Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Kent Overstreet koverstr...@google.com This implements a refcount with similar semantics to atomic_get()/atomic_dec_and_test() - but percpu. It also implements two stage shutdown, as we need it to tear down the percpu counts. Before dropping the initial refcount, you must call percpu_ref_kill(); this puts the refcount in shutting down mode and switches back to a single atomic refcount with the appropriate barriers (synchronize_rcu()). It's also legal to call percpu_ref_kill() multiple times - it only returns true once, so callers don't have to reimplement shutdown synchronization. [a...@linux-foundation.org: fix build] [a...@linux-foundation.org: coding-style tweak] Signed-off-by: Kent Overstreet koverstr...@google.com Cc: Zach Brown z...@redhat.com Cc: Felipe Balbi ba...@ti.com Cc: Greg Kroah-Hartman gre...@linuxfoundation.org Cc: Mark Fasheh mfas...@suse.com Cc: Joel Becker jl...@evilplan.org Cc: Rusty Russell ru...@rustcorp.com.au Cc: Jens Axboe ax...@kernel.dk Cc: Asai Thambi S P asamymuth...@micron.com Cc: Selvan Mani sm...@micron.com Cc: Sam Bradshaw sbrads...@micron.com Cc: Jeff Moyer jmo...@redhat.com Cc: Al Viro v...@zeniv.linux.org.uk Cc: Benjamin LaHaise b...@kvack.org Cc: Tejun Heo t...@kernel.org Cc: Oleg Nesterov o...@redhat.com Cc: Christoph Lameter c...@linux-foundation.org Cc: Ingo Molnar mi...@redhat.com Reviewed-by: Theodore Ts'o ty...@mit.edu Signed-off-by: Tejun Heo t...@kernel.org (cherry picked from commit 215e262f2aeba378aa192da07c30770f9925a4bf) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: lib/Makefile --- include/linux/percpu-refcount.h | 122 ++ lib/Makefile| 2 +- lib/percpu-refcount.c | 128 3 files changed, 251 insertions(+), 1 deletion(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h new file mode 100644 index 000..24b31ef --- /dev/null +++ b/include/linux/percpu-refcount.h @@ -0,0 +1,122 @@ +/* + * Percpu refcounts: + * (C) 2012 Google, Inc. + * Author: Kent Overstreet koverstr...@google.com + * + * This implements a refcount with similar semantics to atomic_t - atomic_inc(), + * atomic_dec_and_test() - but percpu. + * + * There's one important difference between percpu refs and normal atomic_t + * refcounts; you have to keep track of your initial refcount, and then when you + * start shutting down you call percpu_ref_kill() _before_ dropping the initial + * refcount. + * + * The refcount will have a range of 0 to ((1U 31) - 1), i.e. one bit less + * than an atomic_t - this is because of the way shutdown works, see + * percpu_ref_kill()/PCPU_COUNT_BIAS. + * + * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the + * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill() + * puts the ref back in single atomic_t mode, collecting the per cpu refs and + * issuing the appropriate barriers, and then marks the ref as shutting down so + * that percpu_ref_put() will check for the ref hitting 0. After it returns, + * it's safe to drop the initial ref. + * + * USAGE: + * + * See fs/aio.c for some example usage; it's used there for struct kioctx, which + * is created when userspaces calls io_setup(), and destroyed when userspace + * calls io_destroy() or the process exits. + * + * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it + * calls percpu_ref_kill(), then hlist_del_rcu()
[Devel] [PATCH RHEL7 COMMIT] ms/memcg: issue memory.high reclaim after refilling percpu stock
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit c315808e33a89086d0dac4624c1fa6f4fe1f8051 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:22:20 2015 +0400 ms/memcg: issue memory.high reclaim after refilling percpu stock Currently, we dive into memory.high reclaim before reflling percpu stock. As a result, if we successfully charge a batch for a percpu stock while exceeding memory.high, others won't be able to use it until we finish and will probably have to reclaim themselves, which may lead to overreclaim. This patch therefore moves memory.high reclaim after refilling stocks. This is how it works upstream. I haven't seen any negative effects caused by this backport mistake, but let's stick to the mainstream behavior anyways. Fixes: 4038cd0e029dd (ms/memcg: port memory.high) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/memcontrol.c | 35 +-- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 37e81d3..5f3e0ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2730,10 +2730,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (likely(!ret)) { if (!do_swap_account) - goto done; + return CHARGE_OK; ret = res_counter_charge(memcg-memsw, csize, fail_res); if (likely(!ret)) - goto done; + return CHARGE_OK; res_counter_uncharge(memcg-res, csize); mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw); @@ -2790,21 +2790,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, return CHARGE_OOM_DIE; return CHARGE_RETRY; - -done: - if (!(gfp_mask __GFP_WAIT)) - goto out; - /* -* If the hierarchy is above the normal consumption range, -* make the charging task trim their excess contribution. -*/ - do { - if (res_counter_read_u64(memcg-res, RES_USAGE) = memcg-high) - continue; - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, false); - } while ((memcg = parent_mem_cgroup(memcg))); -out: - return CHARGE_OK; } /* @@ -2836,7 +2821,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, { unsigned int batch = max(CHARGE_BATCH, nr_pages); int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES; - struct mem_cgroup *memcg = NULL; + struct mem_cgroup *memcg = NULL, *iter; int ret; /* @@ -2950,6 +2935,20 @@ again: if (batch nr_pages) refill_stock(memcg, batch - nr_pages); + + /* +* If the hierarchy is above the normal consumption range, +* make the charging task trim their excess contribution. +*/ + iter = memcg; + do { + if (!(gfp_mask __GFP_WAIT)) + break; + if (res_counter_read_u64(iter-res, RES_USAGE) = iter-high) + continue; + try_to_free_mem_cgroup_pages(iter, nr_pages, gfp_mask, false); + } while ((iter = parent_mem_cgroup(iter))); + css_put(memcg-css); done: *ptr = memcg; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/vznetstat: Fix potential exit race
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 9a440f22380933dd3547de7d83c553924c6ce284 Author: Cyrill Gorcunov gorcu...@virtuozzo.com Date: Fri Aug 28 14:31:18 2015 +0400 ve/vznetstat: Fix potential exit race When container is exiting another task may be doing operations with statistics incrementing/decrementing stat counter, which may lead to situation where counter is not zero, thus we don't zap @ve-stat member. Fix it by testing if the net is the last one belonging to a container. https://jira.sw.ru/browse/PSBM-35178 Fixes: 505f8aacf95dce27fad66c90d4e1cd64adcb5432 (ve/vznetstat: Don't destroy statistics until explicitly asked) Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com CC: Andrey Vagin ava...@virtuozzo.com CC: Vladimir Davydov vdavy...@virtuozzo.com CC: Konstantin Khorenko khore...@virtuozzo.com CC: Pavel Emelyanov xe...@virtuozzo.com CC: Igor Sukhih i...@parallels.com --- kernel/ve/vznetstat/vznetstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/ve/vznetstat/vznetstat.c b/kernel/ve/vznetstat/vznetstat.c index 9a25dea..99feafb 100644 --- a/kernel/ve/vznetstat/vznetstat.c +++ b/kernel/ve/vznetstat/vznetstat.c @@ -1098,7 +1098,7 @@ static void __net_exit net_exit_acct(struct net *net) if (ve-stat) { venet_acct_put_stat(ve-stat); - if (atomic_read(ve-stat-users) == 0) + if (ve-ve_netns == net) ve-stat = NULL; } } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: use RCU-sched insted of normal RCU
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 932bf29b63b1e7c74669a8847d7c69cc8b8ba919 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:25 2015 +0400 ms/percpu-refcount: use RCU-sched insted of normal RCU Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org percpu-refcount was incorrectly using preempt_disable/enable() for RCU critical sections against call_rcu(). 6a24474da8 (percpu-refcount: consistently use plain (non-sched) RCU) fixed it by converting the preepmtion operations with rcu_read_[un]lock() citing that there isn't any advantage in using sched-RCU over using the usual one; however, rcu_read_[un]lock() for the preemptible RCU implementation - CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly more expensive than preempt_disable/enable(). In a contrived microbench which repeats the followings, - percpu_ref_get() - copy 32 bytes of data into percpu buffer - percpu_put_get() - copy 32 bytes of data into percpu buffer rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by about 15% when compared to using sched-RCU. As the RCU critical sections are extremely short, using sched-RCU shouldn't have any latency implications. Convert to RCU-sched. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Kent Overstreet koverstr...@google.com Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Michal Hocko mho...@suse.cz Cc: Rusty Russell ru...@rustcorp.com.au (cherry picked from commit a4244454df1296e90cc961c1b636b1176ef0d9a0) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 12 ++-- lib/percpu-refcount.c | 2 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index dd2a086..95961f0 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -105,7 +105,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - rcu_read_lock(); + rcu_read_lock_sched(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -114,7 +114,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) else atomic_inc(ref-count); - rcu_read_unlock(); + rcu_read_unlock_sched(); } /** @@ -134,7 +134,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) unsigned __percpu *pcpu_count; int ret = false; - rcu_read_lock(); + rcu_read_lock_sched(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -143,7 +143,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) ret = true; } - rcu_read_unlock(); + rcu_read_unlock_sched(); return ret; } @@ -159,7 +159,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - rcu_read_lock(); + rcu_read_lock_sched(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -168,7 +168,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) else if (unlikely(atomic_dec_and_test(ref-count))) ref-release(ref); - rcu_read_unlock(); + rcu_read_unlock_sched(); } #endif diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 8bf9e71..7deeb62 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -154,5 +154,5 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, (((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD); ref-confirm_kill = confirm_kill; - call_rcu(ref-rcu, percpu_ref_kill_rcu); + call_rcu_sched(ref-rcu, percpu_ref_kill_rcu); } ___ Devel mailing list Devel@openvz.org
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: cosmetic updates
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit d6bfd7b559fdbe649d00c272895cb26996d1ee1c Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:22 2015 +0400 ms/percpu-refcount: cosmetic updates Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org * s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t postfix for types and the type is gonna be used for a different type of callback too. * Add @ARG to function comments. * Drop unnecessary and unaligned indentation from percpu_ref_init() function comment. Signed-off-by: Tejun Heo t...@kernel.org Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit ac899061a93250c28562f05ad94d5c74603415bc) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 8 +--- lib/percpu-refcount.c | 7 --- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index abe1411..b61bd6f 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -51,7 +51,7 @@ #include linux/rcupdate.h struct percpu_ref; -typedef void (percpu_ref_release)(struct percpu_ref *); +typedef void (percpu_ref_func_t)(struct percpu_ref *); struct percpu_ref { atomic_tcount; @@ -62,11 +62,11 @@ struct percpu_ref { * percpu_ref_kill_rcu()) */ unsigned __percpu *pcpu_count; - percpu_ref_release *release; + percpu_ref_func_t *release; struct rcu_head rcu; }; -int percpu_ref_init(struct percpu_ref *, percpu_ref_release *); +int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release); void percpu_ref_kill(struct percpu_ref *ref); #define PCPU_STATUS_BITS 2 @@ -78,6 +78,7 @@ void percpu_ref_kill(struct percpu_ref *ref); /** * percpu_ref_get - increment a percpu refcount + * @ref: percpu_ref to get * * Analagous to atomic_inc(). */ @@ -99,6 +100,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) /** * percpu_ref_put - decrement a percpu refcount + * @ref: percpu_ref to put * * Decrement the refcount, and if 0, call the release function (which was passed * to percpu_ref_init()) diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 1a17399..9a78e55 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -33,8 +33,8 @@ /** * percpu_ref_init - initialize a percpu refcount - * @ref: ref to initialize - * @release: function which will be called when refcount hits 0 + * @ref: percpu_ref to initialize + * @release: function which will be called when refcount hits 0 * * Initializes the refcount in single atomic counter mode with a refcount of 1; * analagous to atomic_set(ref, 1). @@ -42,7 +42,7 @@ * Note that @release must not sleep - it may potentially be called from RCU * callback context by percpu_ref_kill(). */ -int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release) +int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release) { atomic_set(ref-count, 1 + PCPU_COUNT_BIAS); @@ -98,6 +98,7 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) /** * percpu_ref_kill - safely drop initial ref + * @ref: percpu_ref to kill * * Must be used to drop the initial ref on a percpu refcount; must be called * precisely once before shutdown. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [RFC rh7 v5] ve/tty: vt -- Implement per VE support for console and terminals
On Fri, Aug 28, 2015 at 11:12:39AM +0300, Vladimir Davydov wrote: nb: you know, moving patches from mainline (slave lock) seems to be not that simple, they introduced own new lock class for that. at moment i think how to modify our vtty code without mangling general tty code. I'd still suggest moving EXTRA_REF logic to tty_io.c. Yes, it's going to hurt a little during rebases, but if we try to keep it in pty.c we will implicitly rely on tty_io.c internal logic (locking or ref counting rules), which will probably result in failures at runtime, which is much worse than failures at build time. Seems so :/ I didn't find a way to hide all this things solely inside vtty code. Cyrill ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: dio_fastmap() must refresh bvec_merge_data
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit fc65c834967a14d37ef23348cec6528d18b0a169 Author: Maxim Patlasov mpatla...@openvz.org Date: Fri Aug 28 14:18:37 2015 +0400 ploop: dio_fastmap() must refresh bvec_merge_data q-merge_bvec_fn() may override some fileds of bvec_merge_data. For example, raid0_mergeable_bvec() does so. The blessed way is to initialize it from scratch before use -- see how __bio_add_page() prepares bvm for calling q-merge_bvec_fn(). Signed-off-by: Maxim Patlasov mpatla...@openvz.org Acked-by: Dmitry Monakhov dmonak...@openvz.org --- drivers/block/ploop/io_direct.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c index 793bcc5..0183b0f 100644 --- a/drivers/block/ploop/io_direct.c +++ b/drivers/block/ploop/io_direct.c @@ -1487,7 +1487,6 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio, struct request_queue * q; struct extent_map * em; int i; - struct bvec_merge_data bm_data; if (orig_bio-bi_size == 0) { bio-bi_vcnt = 0; @@ -1535,19 +1534,19 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio, bio-bi_size = 0; bio-bi_vcnt = 0; - bm_data.bi_bdev = bio-bi_bdev; - bm_data.bi_sector = bio-bi_sector; - bm_data.bi_size = 0; - bm_data.bi_rw = bio-bi_rw; - for (i = 0; i orig_bio-bi_vcnt; i++) { struct bio_vec * bv = bio-bi_io_vec[i]; + struct bvec_merge_data bm_data = { + .bi_bdev = bio-bi_bdev, + .bi_sector = bio-bi_sector, + .bi_size = bio-bi_size, + .bi_rw = bio-bi_rw, + }; if (q-merge_bvec_fn(q, bm_data, bv) bv-bv_len) { io-plo-st.fast_neg_backing++; return 1; } bio-bi_size += bv-bv_len; - bm_data.bi_size = bio-bi_size; bio-bi_vcnt++; } return 0; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: consistently use plain (non-sched) RCU
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 41721ced765e1156651d31c8b9deb0111340e984 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:22 2015 +0400 ms/percpu-refcount: consistently use plain (non-sched) RCU Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org percpu_ref_get/put() are using preempt_disable/enable() while percpu_ref_kill() is using plain call_rcu() instead of call_rcu_sched(). This is buggy as grace periods of the two may not match. Fix it by using plain RCU in percpu_ref_get/put(). (I suggested using sched RCU in the first place but there's no actual benefit in doing so unless we're gonna introduce different variants of get/put to be called while preemption is alredy disabled, which we definitely shouldn't.) Signed-off-by: Tejun Heo t...@kernel.org Reported-by: Rusty Russell ru...@rustcorp.com.au Acked-by: Kent Overstreet koverstr...@google.com (cherry picked from commit 6a24474da83ea7c8b7d32f05f858b1259994067a) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- include/linux/percpu-refcount.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 24b31ef..abe1411 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -85,7 +85,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - preempt_disable(); + rcu_read_lock(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -94,7 +94,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) else atomic_inc(ref-count); - preempt_enable(); + rcu_read_unlock(); } /** @@ -107,7 +107,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) { unsigned __percpu *pcpu_count; - preempt_disable(); + rcu_read_lock(); pcpu_count = ACCESS_ONCE(ref-pcpu_count); @@ -116,7 +116,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) else if (unlikely(atomic_dec_and_test(ref-count))) ref-release(ref); - preempt_enable(); + rcu_read_unlock(); } #endif ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: use percpu refcnt for cgroup_subsys_states
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit b1753091f010a49bcd0a89aa23306ac816302f9c Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 14:49:27 2015 +0400 ms/cgroup: use percpu refcnt for cgroup_subsys_states Patchset description: Pulling upstream patches converting css refcnt to percpu_ref. https://jira.sw.ru/browse/PSBM-34174 Kent Overstreet (2): percpu: implement generic percpu refcounting percpu-refcount: Don't use silly cmpxchg() Tejun Heo (9): percpu-refcount: consistently use plain (non-sched) RCU percpu-refcount: cosmetic updates percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu() percpu-refcount: implement percpu_ref_cancel_init() percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm() percpu-refcount: use RCU-sched insted of normal RCU cgroup: reorder the operations in cgroup_destroy_locked() cgroup: split cgroup destruction into two steps cgroup: use percpu refcnt for cgroup_subsys_states === This patch description: From: Tejun Heo t...@kernel.org A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking -css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo t...@kernel.org Reviewed-by: Kent Overstreet koverstr...@google.com Acked-by: Li Zefan lize...@huawei.com Cc: Michal Hocko mho...@suse.cz Cc: Mike Snitzer snit...@redhat.com Cc: Vivek Goyal vgo...@redhat.com Cc: Alasdair G. Kergon a...@redhat.com Cc: Jens Axboe ax...@kernel.dk Cc: Mikulas Patocka mpato...@redhat.com Cc: Glauber Costa glom...@gmail.com (cherry picked from commit d3daf28da16a30af95bfb303189a634a87606725) Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: include/linux/cgroup.h kernel/cgroup.c --- include/linux/cgroup.h | 27 +++- kernel/cgroup.c| 166 +++-- 2
[Devel] [PATCH RHEL7 COMMIT] ve/devtmpfs: lightweight virtualization
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 22255fb606cfd53fb98b11c62b854c0de5a4c713 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:59 2015 +0400 ve/devtmpfs: lightweight virtualization Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: All this patch does is provides each VE with its own empty single tmpfs mount, which appears on an attempt to mount devtmpfs. It's up to the userspace to populate this fs on container start, all kernel requests to create a device node inside a VE are ignored. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c | 67 + include/linux/ve.h | 1 + kernel/ve/ve.c | 4 +++ 3 files changed, 72 insertions(+) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index f59b798..daf97ee 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -23,6 +23,7 @@ #include linux/ramfs.h #include linux/slab.h #include linux/kthread.h +#include linux/ve.h #include base.h static struct task_struct *thread; @@ -53,9 +54,61 @@ static int __init mount_param(char *str) } __setup(devtmpfs.mount=, mount_param); +#ifdef CONFIG_VE +static int ve_test_dev_sb(struct super_block *s, void *p) +{ + return get_exec_env()-dev_sb == s; +} + +static int ve_set_dev_sb(struct super_block *s, void *p) +{ + struct ve_struct *ve = get_exec_env(); + int error; + + error = set_anon_super(s, p); + if (!error) { + BUG_ON(ve-dev_sb); + ve-dev_sb = s; + atomic_inc(s-s_active); + } + return error; +} + +static struct dentry *ve_dev_mount(struct file_system_type *fs_type, int flags, + const char *dev_name, void *data) +{ + int (*fill_super)(struct super_block *, void *, int); + struct super_block *s; + int error; + +#ifdef CONFIG_TMPFS + fill_super = shmem_fill_super; +#else + fill_super = ramfs_fill_super; +#endif + s = sget(fs_type, ve_test_dev_sb, ve_set_dev_sb, flags, NULL); + if (IS_ERR(s)) + return ERR_CAST(s); + + if (!s-s_root) { + error = fill_super(s, data, flags MS_SILENT ? 1 : 0); + if (error) { + deactivate_locked_super(s); + return ERR_PTR(error); + } + s-s_flags |= MS_ACTIVE; + } + return dget(s-s_root); +} +#endif /* CONFIG_VE */ + static struct dentry *dev_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { +#ifdef CONFIG_VE + if (!ve_is_super(get_exec_env())) + return ve_dev_mount(fs_type, flags, dev_name, data); +#endif #ifdef CONFIG_TMPFS return mount_single(fs_type, flags, data, shmem_fill_super); #else @@ -79,6 +132,16 @@ static inline int is_blockdev(struct device *dev) static inline int is_blockdev(struct device *dev) { return 0; } #endif +#ifdef CONFIG_VE +static inline int is_ve_dev(struct device *dev) +{ + return dev-class dev-class-namespace == ve_namespace + ve_namespace(dev) != get_ve0(); +} +#else +static inline int is_ve_dev(struct device *dev) { return 0; } +#endif + int
[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use sb-s_fs_info
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 17dd96483ff558d44c98c3f8bcb04a86aca843a5 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:43 2015 +0400 ve/binfmt_misc: do not use sb-s_fs_info Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: When we virtualized binfmt_misc, we made sb-s_fs_info store a pointer to binfmt_misc struct. At the same time, we store a pointer to the owner ve_struct in sb-s_ns and a pointer to the same binfmt_misc struct in ve_struct-binfmt_misc. That said, we don't actually need to use s_fs_info, because we can get the binfmt_misc by dereferencing sb-s_ns-binfmt_misc. Using sb-s_fs_info instead of sb-s_ns will allow us to revert our patches introducing sb-s_ns. This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization). Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/binfmt_misc.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index 7e760d2..d0cb80c 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -65,6 +65,8 @@ struct binfmt_misc { int entry_count; }; +#define BINFMT_MISC(sb)(((struct ve_struct *)(sb)-s_ns)-binfmt_misc) + /* * Check if we support the binfmt * if we do, return the node, else NULL @@ -541,7 +543,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, Node *e = file_inode(file)-i_private; int res = parse_command(buffer, count); struct super_block *sb = file-f_path.dentry-d_sb; - struct binfmt_misc *bm_data = sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); switch (res) { case 1: clear_bit(Enabled, e-flags); @@ -576,7 +578,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, struct inode *inode; struct dentry *root, *dentry; struct super_block *sb = file-f_path.dentry-d_sb; - struct binfmt_misc *bm_data = sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); int err = 0; e = create_entry(buffer, count); @@ -641,7 +643,7 @@ static const struct file_operations bm_register_operations = { static ssize_t bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { - struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb); char *s = bm_data-enabled ? enabled\n : disabled\n; return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); @@ -650,7 +652,7 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) static ssize_t bm_status_write(struct file * file, const char __user * buffer, size_t count, loff_t *ppos) { - struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb); int res = parse_command(buffer, count); struct dentry *root; @@ -681,7 +683,7 @@ static const struct file_operations bm_status_operations = { static void bm_put_super(struct super_block *sb) { - struct binfmt_misc *bm_data = sb-s_fs_info; + struct binfmt_misc *bm_data = BINFMT_MISC(sb); struct ve_struct *ve = sb-s_ns; bm_data-enabled = 0; @@ -723,7 +725,6 @@ static int bm_fill_super(struct super_block * sb, void * data, int silent) } sb-s_op = s_ops; - sb-s_fs_info = bm_data; bm_data-enabled = 1; get_ve(ve); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert fs: add data pointer to mount_ns()
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 8d9d5a10d874b4d9f66f1af3fdcabbe9aee396f2 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:58 2015 +0400 Revert fs: add data pointer to mount_ns() Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 69e6ae7f750fc862c9324441130abbff2c8b528e. This is only needed for per-ns filesystems that can accept user options. There is the only such a filesystem, devtmpfs, which we made per container. Since devtmpfs virtualization is going to be dropped, this patch is not necessary. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c | 4 ++-- fs/binfmt_misc.c| 2 +- fs/nfsd/nfsctl.c| 2 +- fs/super.c | 4 ++-- include/linux/fs.h | 2 +- ipc/mqueue.c| 2 +- net/sunrpc/rpc_pipe.c | 2 +- 7 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index 349d6eb..6f4ba37 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -59,9 +59,9 @@ static struct dentry *dev_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { #ifdef CONFIG_TMPFS - return mount_ns(fs_type, flags, data, get_exec_env(), shmem_fill_super); + return mount_ns(fs_type, flags, data, shmem_fill_super); #else - return mount_ns(fs_type, flags, data, get_exec_env(), ramfs_fill_super); + return mount_ns(fs_type, flags, data, ramfs_fill_super); #endif } diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index 460d53f..7e760d2 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -734,7 +734,7 @@ static int bm_fill_super(struct super_block * sb, void * data, int silent) static struct dentry *bm_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return mount_ns(fs_type, flags, data, get_exec_env(), bm_fill_super); + return mount_ns(fs_type, flags, get_exec_env(), bm_fill_super); } static struct linux_binfmt misc_format = { diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index 9b690c9..7411a56 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -1126,7 +1126,7 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent) static struct dentry *nfsd_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return mount_ns(fs_type, flags, NULL, current-nsproxy-net_ns, nfsd_fill_super); + return mount_ns(fs_type, flags, current-nsproxy-net_ns, nfsd_fill_super); } static void nfsd_umount(struct super_block *sb) diff --git a/fs/super.c b/fs/super.c index 7f316e8..c9b47bf 100644 --- a/fs/super.c +++ b/fs/super.c @@ -890,11 +890,11 @@ static int ns_set_super(struct super_block *sb, void *data) } struct dentry *mount_ns(struct file_system_type *fs_type, int flags, - void *data, void *ns, int (*fill_super)(struct super_block *, void *, int)) + void *data, int (*fill_super)(struct super_block *, void *, int)) { struct super_block *sb; - sb = sget(fs_type, ns_test_super, ns_set_super, flags, ns); + sb = sget(fs_type, ns_test_super, ns_set_super, flags, data); if (IS_ERR(sb)) return
[Devel] [PATCH RHEL7 COMMIT] Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 9b72ce16b191d84da03da83d5ccec29de8854686 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:41 2015 +0400 Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: This reverts commit 9e7411c5c3b53937171ef962ce7381337f125b28. This patch is not longer needed, because none of the mount_ns users needs sb-s_fs_info any more. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/nfs/dns_resolve.c | 2 +- fs/nfsd/nfs4recover.c | 4 ++-- fs/super.c| 4 ++-- include/linux/fs.h| 2 -- ipc/mqueue.c | 6 +++--- net/sunrpc/clnt.c | 2 +- net/sunrpc/rpc_pipe.c | 4 ++-- 7 files changed, 11 insertions(+), 13 deletions(-) diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c index dda6202..d25f10f 100644 --- a/fs/nfs/dns_resolve.c +++ b/fs/nfs/dns_resolve.c @@ -415,7 +415,7 @@ static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event, void *ptr) { struct super_block *sb = ptr; - struct net *net = sb-s_ns; + struct net *net = sb-s_fs_info; struct nfs_net *nn = net_generic(net, nfs_net_id); struct cache_detail *cd = nn-nfs_dns_resolve; int ret = 0; diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c index c714602..4c86b18 100644 --- a/fs/nfsd/nfs4recover.c +++ b/fs/nfsd/nfs4recover.c @@ -693,7 +693,7 @@ cld_pipe_downcall(struct file *filp, const char __user *src, size_t mlen) struct cld_upcall *tmp, *cup; struct cld_msg __user *cmsg = (struct cld_msg __user *)src; uint32_t xid; - struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_ns, + struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_fs_info, nfsd_net_id); struct cld_net *cn = nn-cld_net; @@ -1353,7 +1353,7 @@ static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event, void *ptr) { struct super_block *sb = ptr; - struct net *net = sb-s_ns; + struct net *net = sb-s_fs_info; struct nfsd_net *nn = net_generic(net, nfsd_net_id); struct cld_net *cn = nn-cld_net; struct dentry *dentry; diff --git a/fs/super.c b/fs/super.c index c9b47bf..341650d 100644 --- a/fs/super.c +++ b/fs/super.c @@ -880,12 +880,12 @@ EXPORT_SYMBOL(kill_litter_super); static int ns_test_super(struct super_block *sb, void *data) { - return sb-s_ns == data; + return sb-s_fs_info == data; } static int ns_set_super(struct super_block *sb, void *data) { - sb-s_ns = data; + sb-s_fs_info = data; return set_anon_super(sb, NULL); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 68cec28..553bca3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1457,8 +1457,6 @@ struct super_block { unsigned ints_max_links; fmode_t s_mode; - void*s_ns; /* Pointer to namespace */ - /* Granularity of c/m/atime in ns. Cannot be worse than a second */ u32s_time_gran; diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 18620cd..c508938 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -104,7 +104,7 @@ static inline struct mqueue_inode_info *MQUEUE_I(struct inode *inode) */ static inline struct ipc_namespace *__get_ns_from_inode(struct inode *inode) { - return get_ipc_ns(inode-i_sb-s_ns); + return get_ipc_ns(inode-i_sb-s_fs_info); } static struct ipc_namespace *get_ns_from_inode(struct inode *inode) @@ -407,7 +407,7 @@ static void mqueue_evict_inode(struct inode *inode) user-mq_bytes -= mq_bytes; /* * get_ns_from_inode() ensures that the -* (ipc_ns = sb-s_ns) is either a valid ipc_ns +* (ipc_ns = sb-s_fs_info) is either a valid ipc_ns * to which we now hold a reference, or it is NULL. * We can't put it here under mq_lock, though. */ @@ -1418,7 +1418,7 @@ int mq_init_ns(struct ipc_namespace *ns) void mq_clear_sbinfo(struct
[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use s_ns
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit a98a90ea907f522f1ae6ff0e1c6e78a39ade2494 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:44 2015 +0400 ve/binfmt_misc: do not use s_ns Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: Since 9e7411c5c3b5 was reverted, we must use sb-s_fs_info for storing a pointer to the namespace. This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization). Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/binfmt_misc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index d0cb80c..4487153 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -65,7 +65,7 @@ struct binfmt_misc { int entry_count; }; -#define BINFMT_MISC(sb)(((struct ve_struct *)(sb)-s_ns)-binfmt_misc) +#define BINFMT_MISC(sb)(((struct ve_struct *)(sb)-s_fs_info)-binfmt_misc) /* * Check if we support the binfmt @@ -684,7 +684,7 @@ static const struct file_operations bm_status_operations = { static void bm_put_super(struct super_block *sb) { struct binfmt_misc *bm_data = BINFMT_MISC(sb); - struct ve_struct *ve = sb-s_ns; + struct ve_struct *ve = sb-s_fs_info; bm_data-enabled = 0; put_ve(ve); @@ -703,7 +703,7 @@ static int bm_fill_super(struct super_block * sb, void * data, int silent) [3] = {register, bm_register_operations, S_IWUSR}, /* last one */ {} }; - struct ve_struct *ve = sb-s_ns; + struct ve_struct *ve = data; struct binfmt_misc *bm_data = ve-binfmt_misc; int err; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: containerize it with new obj ns operation
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 968c8efb7981f87f8bc0616741edb6c0bc556d76 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:57 2015 +0400 Revert devtmpfs: containerize it with new obj ns operation Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 53343c3b231ed36d973e6d3ac2ab9ad7b7c87e25. The whole point of devtmpfs is simplifying the system bootup logic. There is absolutely no point in virtualizing it, because on container start we create devices from a hardcoded list (these are ttys, which I'd prefer not to create at all using ptys instead, but we have to live with it for compatibility reasons for now). This means that it is enough to provide the userspace with per VE tmpfs mount called devtmpfs and teach it to make device nodes from a hardcoded list on container start instead of implementing devtmpfs virtualization in the kernel. The kernel part will be done by the following patches. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c| 37 ++--- fs/sysfs/ve.c | 9 - include/linux/kobject_ns.h | 2 -- 3 files changed, 2 insertions(+), 46 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index 0448af8..349d6eb 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -366,46 +366,13 @@ int devtmpfs_mount(const char *mntdir) static DECLARE_COMPLETION(setup_done); -static struct path set_dev_pwd(struct device *dev) -{ - const struct kobj_ns_type_operations *ops; - struct path pwd = current-fs-pwd; - - ops = kobj_ns_ops(dev-kobj); - path_get(pwd); - - if (ops ops-devtmpfs) { - const struct path *devtmpfs_root; - - devtmpfs_root = ops-devtmpfs(dev-kobj); - BUG_ON(!devtmpfs_root); - set_fs_pwd(current-fs, devtmpfs_root); - } - return pwd; -} - -static void drop_dev_pwd(struct path *pwd) -{ - set_fs_pwd(current-fs, pwd); - path_put(pwd); -} - static int handle(const char *name, umode_t mode, kuid_t uid, kgid_t gid, struct device *dev) { - struct path pwd; - int err; - - pwd = set_dev_pwd(dev); - if (mode) - err = handle_create(name, mode, uid, gid, dev); + return handle_create(name, mode, uid, gid, dev); else - err = handle_remove(name, dev); - - /* Restore kthread pwd */ - drop_dev_pwd(pwd); - return err; + return handle_remove(name, dev); } static int devtmpfsd(void *p) diff --git a/fs/sysfs/ve.c b/fs/sysfs/ve.c index 79ad6d5..bb28a4b 100644 --- a/fs/sysfs/ve.c +++ b/fs/sysfs/ve.c @@ -43,21 +43,12 @@ const void *ve_namespace(struct device *dev) return (!dev-groups dev_get_drvdata(dev)) ? dev_get_drvdata(dev) : get_ve0(); } -static const struct path *ve_devtmpfs(const struct kobject *kobj) -{ - struct device *dev = container_of(kobj, struct device, kobj); - const struct ve_struct *ve = dev-class-namespace(dev); - - return ve-devtmpfs_root; -} - struct kobj_ns_type_operations ve_ns_type_operations = { .type = KOBJ_NS_TYPE_VE, .grab_current_ns =
Re: [Devel] [PATCH 2/3] ve: revise permissions to allow mount smth
On Fri, Aug 28, 2015 at 05:20:02PM +0400, Andrew Vagin wrote: Return back to the behavior of the upstream kernel. Currently we use mount namespaces and need nothing special here. Signed-off-by: Andrew Vagin ava...@openvz.org Reviewed-by: Vladimir Davydov vdavy...@parallels.com It's worth noting that this patch reverts commit d492bfa387237 (ve/vfs: allow mount/umount, pivot_root with CAP_VE_SYS_ADMIN). ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: per-VE mounts introduced
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 3fd8ef28e629c3ec00144f83249628244903876d Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:58 2015 +0400 Revert devtmpfs: per-VE mounts introduced Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit e85a799b629d5e28c8931ddd9127cf18d501745c. More devtmpfs virtualization crap to drop. Will be reworked. Signed-off-by: Vladimir Davydov vdavy...@parallels.com Conflicts: include/linux/ve.h kernel/ve/ve.c --- drivers/base/devtmpfs.c | 28 ++-- include/linux/device.h | 4 include/linux/ve.h | 3 --- kernel/ve/ve.c | 8 4 files changed, 2 insertions(+), 41 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index 6f4ba37..f59b798 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -23,8 +23,6 @@ #include linux/ramfs.h #include linux/slab.h #include linux/kthread.h -#include linux/fs_struct.h -#include linux/ve.h #include base.h static struct task_struct *thread; @@ -59,9 +57,9 @@ static struct dentry *dev_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { #ifdef CONFIG_TMPFS - return mount_ns(fs_type, flags, data, shmem_fill_super); + return mount_single(fs_type, flags, data, shmem_fill_super); #else - return mount_ns(fs_type, flags, data, ramfs_fill_super); + return mount_single(fs_type, flags, data, ramfs_fill_super); #endif } @@ -387,7 +385,6 @@ static int devtmpfsd(void *p) goto out; sys_chdir(/..); /* will traverse into overmounted root */ sys_chroot(.); - get_fs_root(current-fs, get_exec_env()-devtmpfs_root); complete(setup_done); while (1) { spin_lock(req_lock); @@ -408,33 +405,12 @@ static int devtmpfsd(void *p) spin_unlock(req_lock); schedule(); } - path_put(get_exec_env()-devtmpfs_root); return 0; out: complete(setup_done); return *err; } -int ve_init_devtmpfs(void *data) -{ - struct ve_struct *ve = data; - struct vfsmount *mnt; - - mnt = kern_mount_data(dev_fs_type, ve); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); - ve-devtmpfs_root.mnt = mnt; - ve-devtmpfs_root.dentry = mnt-mnt_root; - return 0; -} - -void ve_fini_devtmpfs(void *data) -{ - struct ve_struct *ve = data; - - kern_unmount(ve-devtmpfs_root.mnt); -} - /* * Create devtmpfs instance, driver-core devices will add their device * nodes here. diff --git a/include/linux/device.h b/include/linux/device.h index df5152f..2c9c764 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1005,14 +1005,10 @@ extern void put_device(struct device *dev); extern int devtmpfs_create_node(struct device *dev); extern int devtmpfs_delete_node(struct device *dev); extern int devtmpfs_mount(const char *mntdir); -extern int ve_init_devtmpfs(void *data); -extern void ve_fini_devtmpfs(void *data); #else static inline int devtmpfs_create_node(struct device *dev) { return 0; } static inline int devtmpfs_delete_node(struct device *dev) { return 0; } static inline int
[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: destroy all nodes on ve stop
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0ea1f95684407db5892760b5a58a24003571f043 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:44 2015 +0400 ve/binfmt_misc: destroy all nodes on ve stop Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: Each registered binfmt_misc node pins binfmt_misc mount point, which in turn pins the owner ve. This means that if we don't clean up binfmt_misc nodes on ve stop, the mount point as well as the ve struct will leak. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/binfmt_misc.c | 28 +++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index 4487153..90c306e 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -752,16 +752,42 @@ static struct file_system_type bm_fs_type = { }; MODULE_ALIAS_FS(binfmt_misc); +static void ve_binfmt_fini(void *data) +{ + struct ve_struct *ve = data; + struct binfmt_misc *bm_data = ve-binfmt_misc; + + if (!bm_data) + return; + + /* +* XXX: Note we don't take any locks here. This is safe as long as +* nobody uses binfmt_misc outside the owner ve. +*/ + while (!list_empty(bm_data-entries)) + kill_node(bm_data, list_first_entry( + bm_data-entries, Node, list)); +} + +static struct ve_hook ve_binfmt_hook = { + .fini = ve_binfmt_fini, + .priority = HOOK_PRIO_DEFAULT, + .owner = THIS_MODULE, +}; + static int __init init_misc_binfmt(void) { int err = register_filesystem(bm_fs_type); - if (!err) + if (!err) { insert_binfmt(misc_format); + ve_hook_register(VE_SS_CHAIN, ve_binfmt_hook); + } return err; } static void __exit exit_misc_binfmt(void) { + ve_hook_unregister(ve_binfmt_hook); unregister_binfmt(misc_format); unregister_filesystem(bm_fs_type); } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: Create required devices on container startup
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0cdfb581770d883cea99f30e49e3de1583ab6fc1 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:56 2015 +0400 Revert ve/devtmpfs: Create required devices on container startup Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 5cd1d17ff1b6a8f476ab6f4cd0a6830fbffe43f2. We don't actually need separate null, zero, and other mem class devices inside a VE. The patch being reverted added them merely for kdevtmpfs to create nodes for this devices under /dev. This work can and should be done by vzctl on container start, so drop this patch. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/char/mem.c | 20 --- kernel/ve/ve.c | 56 -- 2 files changed, 76 deletions(-) diff --git a/drivers/char/mem.c b/drivers/char/mem.c index c486c83..a3653f7 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -30,7 +30,6 @@ #include linux/io.h #include linux/aio.h #include linux/security.h -#include linux/ve.h #include asm/uaccess.h @@ -924,20 +923,7 @@ static char *mem_devnode(struct device *dev, umode_t *mode) return NULL; } -#ifdef CONFIG_VE -static struct class mem_class_base = { - .name = mem, - .devnode= mem_devnode, - .ns_type= ve_ns_type_operations, - .namespace = ve_namespace, - .owner = THIS_MODULE, -}; - -struct class *mem_class = mem_class_base; -EXPORT_SYMBOL(mem_class); -#else static struct class *mem_class; -#endif static int __init chr_dev_init(void) { @@ -951,17 +937,11 @@ static int __init chr_dev_init(void) if (register_chrdev(MEM_MAJOR, mem, memory_fops)) printk(unable to get major %d for memory devs\n, MEM_MAJOR); -#ifdef CONFIG_VE - err = class_register(mem_class_base); - if (err) - return err; -#else mem_class = class_create(THIS_MODULE, mem); if (IS_ERR(mem_class)) return PTR_ERR(mem_class); mem_class-devnode = mem_devnode; -#endif for (minor = 1; minor ARRAY_SIZE(devlist); minor++) { if (!devlist[minor].name) continue; diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 4cd1f8b..cdbb342 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -413,55 +413,6 @@ static void ve_drop_context(struct ve_struct *ve) ve-init_cred = NULL; } -static const struct { - unsigned intminor; - char*name; -} ve_mem_class_devices[] = { - {3, null}, - {5, zero}, - {7, full}, - {8, random}, - {9, urandom}, -}; - -extern struct class *mem_class; - -static int ve_init_mem_class(struct ve_struct *ve) -{ - struct device *dev; - dev_t devt; - size_t i; - - for (i = 0; i ARRAY_SIZE(ve_mem_class_devices); i++) { - devt = MKDEV(MEM_MAJOR, ve_mem_class_devices[i].minor); - dev = device_create(mem_class, NULL, devt, - ve, ve_mem_class_devices[i].name); - if (IS_ERR(dev)) { - pr_err(Can't create %s (%d)\n, - ve_mem_class_devices[i].name, -
Re: [Devel] [PATCH 1/3] cred: add ve_capable to check capabilities relative to the current VE
On Fri, Aug 28, 2015 at 05:20:01PM +0400, Andrew Vagin wrote: +bool ve_capable(int cap) +{ + return ns_capable(get_exec_env()-init_cred-user_ns, cap); +} init_cred is set in ve_grab_context, which means that if a task occasionally uses ve_capable() before writing START to ve.state, the kernel will panic. Please add a sanity check, which will make ve_capable() fall back on capable() if init_cred is not available. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: pass proper options string
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0ffbb29c45f5ee709f4fa5dfa52f883cbe4a70f1 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:10:57 2015 +0400 Revert ve/devtmpfs: pass proper options string Patchset description: Rework devtmpfs virtualization Currently, we implement full-featured devtmpfs virtualization for VE: when a device is created in a VE namespace, we send a signal to kdevtmpfs to create the devnode on devtmpfs mount corresponding to the VE. This seems to be over-complicated: all this work can be done from userspace, because we only have a hardcoded list of devices created exclusively for VE on container start. Those are tty-related stuff and mem devices, and we only need the latter to create devtmpfs nodes. Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can be called before a VE tty device is unregistered, resulting in a KP: https://jira.sw.ru/browse/PSBM-35077 This patch therefore simplifies it. It makes the kernel only provide a single empty tmpfs mount per VE, which appears on an attempt to mount devtmpfs from inside a VE. The content of the fs is to be filled by the userspace on container start, which will be done in the scope of https://jira.sw.ru/browse/PSBM-35146 Vladimir Davydov (6): Revert ve/devtmpfs: Create required devices on container startup Revert ve/devtmpfs: pass proper options string Revert devtmpfs: containerize it with new obj ns operation Revert fs: add data pointer to mount_ns() Revert devtmpfs: per-VE mounts introduced devtmpfs: lightweight virtualization Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com === This patch description: This reverts commit 1c6719b8aa075de4c9528811839d5f2595ef2994. This is related to devtmpfs virtualization, which I'm going to drop. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/base/devtmpfs.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index c28e42c..0448af8 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -451,10 +451,9 @@ out: int ve_init_devtmpfs(void *data) { struct ve_struct *ve = data; - char opts[] = mode=0755; struct vfsmount *mnt; - mnt = kern_mount_data(dev_fs_type, opts); + mnt = kern_mount_data(dev_fs_type, ve); if (IS_ERR(mnt)) return PTR_ERR(mnt); ve-devtmpfs_root.mnt = mnt; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/3] ve: revise permissions to allow mount smth
Return back to the behavior of the upstream kernel. Currently we use mount namespaces and need nothing special here. Signed-off-by: Andrew Vagin ava...@openvz.org --- fs/namespace.c |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 593b262..77a1ede 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1306,9 +1306,7 @@ static int do_umount(struct mount *mnt, int flags) */ static inline bool may_mount(void) { - return ns_capable(current-nsproxy-mnt_ns-user_ns, CAP_SYS_ADMIN) || - nsown_capable(CAP_SYS_ADMIN) || - nsown_capable(CAP_VE_SYS_ADMIN); + return ns_capable(current-nsproxy-mnt_ns-user_ns, CAP_SYS_ADMIN); } /* -- 1.7.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit d0856fdc15e0b49540c454b42a11ddf2af70cda6 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 16:42:43 2015 +0400 Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super Patchset description: zap sb-s_ns + fix memleak in binfmt_misc Vladimir Davydov (6): binfmt_misc: do not use sb-s_fs_info Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super binfmt_misc: do not use s_ns binfmt_misc: destroy all nodes on ve stop https://jira.sw.ru/browse/PSBM-39154 Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com == This patch description: This reverts commit 610d54ccee1af63b1b361d18ec4ee9fa5230dea8. Since commit 9e7411c5c3b5 was reverted, this one is no longer needed either. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/nfsd/nfsctl.c | 2 +- ipc/mqueue.c | 2 +- net/sunrpc/rpc_pipe.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index 7411a56..048d61d 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -1113,7 +1113,7 @@ static int nfsd_fill_super(struct super_block * sb, void * data, int silent) #endif /* last one */ {} }; - struct net *net = sb-s_ns; + struct net *net = data; int ret; ret = simple_fill_super(sb, 0x6e667364, nfsd_files); diff --git a/ipc/mqueue.c b/ipc/mqueue.c index c508938..6a8f37d 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -309,7 +309,7 @@ err: static int mqueue_fill_super(struct super_block *sb, void *data, int silent) { struct inode *inode; - struct ipc_namespace *ns = sb-s_ns; + struct ipc_namespace *ns = data; sb-s_blocksize = PAGE_CACHE_SIZE; sb-s_blocksize_bits = PAGE_CACHE_SHIFT; diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c index b8f6185..79681e5 100644 --- a/net/sunrpc/rpc_pipe.c +++ b/net/sunrpc/rpc_pipe.c @@ -1395,7 +1395,7 @@ rpc_fill_super(struct super_block *sb, void *data, int silent) { struct inode *inode; struct dentry *root, *gssd_dentry; - struct net *net = sb-s_ns; + struct net *net = data; struct sunrpc_net *sn = net_generic(net, sunrpc_net_id); int err; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/3] cred: add ve_capable to check capabilities relative to the current VE
We want to allow a few operations in VE. Currently we use nsown_capable, but it's wrong, because in this case we allow these operations in any user namespace. Signed-off-by: Andrew Vagin ava...@openvz.org --- fs/autofs4/root.c |6 ++ fs/ioprio.c|2 +- fs/namei.c |2 +- include/linux/capability.h |1 + kernel/capability.c| 15 +++ kernel/printk.c|5 ++--- net/ipv6/sit.c |2 +- net/netfilter/nf_sockopt.c |2 +- security/commoncap.c |4 ++-- security/device_cgroup.c |4 ++-- 10 files changed, 28 insertions(+), 15 deletions(-) diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c index 68e3edb..1462d8b 100644 --- a/fs/autofs4/root.c +++ b/fs/autofs4/root.c @@ -588,8 +588,7 @@ static int autofs4_dir_unlink(struct inode *dir, struct dentry *dentry) struct autofs_info *p_ino; /* This allows root to remove symlinks */ - if (!autofs4_oz_mode(sbi) !capable(CAP_SYS_ADMIN) - !capable(CAP_VE_SYS_ADMIN)) + if (!autofs4_oz_mode(sbi) !ve_capable(CAP_SYS_ADMIN)) return -EPERM; if (atomic_dec_and_test(ino-count)) { @@ -837,8 +836,7 @@ static int autofs4_root_ioctl_unlocked(struct inode *inode, struct file *filp, _IOC_NR(cmd) - _IOC_NR(AUTOFS_IOC_FIRST) = AUTOFS_IOC_COUNT) return -ENOTTY; - if (!autofs4_oz_mode(sbi) !capable(CAP_SYS_ADMIN) - !capable(CAP_VE_SYS_ADMIN)) + if (!autofs4_oz_mode(sbi) !ve_capable(CAP_SYS_ADMIN)) return -EPERM; switch(cmd) { diff --git a/fs/ioprio.c b/fs/ioprio.c index c876fad..f9d9187 100644 --- a/fs/ioprio.c +++ b/fs/ioprio.c @@ -75,7 +75,7 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) switch (class) { case IOPRIO_CLASS_RT: - if (!capable(CAP_VE_ADMIN)) + if (!ve_capable(CAP_SYS_ADMIN)) return -EPERM; class = IOPRIO_CLASS_BE; data = 0; diff --git a/fs/namei.c b/fs/namei.c index 8e29a44..e7d9f54 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3397,7 +3397,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) if (error) return error; - if ((S_ISCHR(mode) || S_ISBLK(mode)) !nsown_capable(CAP_MKNOD)) + if ((S_ISCHR(mode) || S_ISBLK(mode)) !ve_capable(CAP_MKNOD)) return -EPERM; if (!dir-i_op-mknod) diff --git a/include/linux/capability.h b/include/linux/capability.h index 2b77384..b1131e3 100644 --- a/include/linux/capability.h +++ b/include/linux/capability.h @@ -217,6 +217,7 @@ extern bool has_ns_capability_noaudit(struct task_struct *t, extern bool capable(int cap); extern bool ns_capable(struct user_namespace *ns, int cap); extern bool nsown_capable(int cap); +extern bool ve_capable(int cap); extern bool inode_capable(const struct inode *inode, int cap); extern bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap); diff --git a/kernel/capability.c b/kernel/capability.c index 0a843d5..e409594 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -16,6 +16,7 @@ #include linux/pid_namespace.h #include linux/user_namespace.h #include asm/uaccess.h +#include linux/ve.h /* * Leveraged for setting/resetting capabilities @@ -396,6 +397,20 @@ bool ns_capable(struct user_namespace *ns, int cap) } EXPORT_SYMBOL(ns_capable); +#if CONFIG_VE +bool ve_capable(int cap) +{ + return ns_capable(get_exec_env()-init_cred-user_ns, cap); +} +#else +bool ve_capable(int cap) +{ + return capable(cap); +} +#endif + +EXPORT_SYMBOL_GPL(ve_capable); + /** * file_ns_capable - Determine if the file's opener had a capability in effect * @file: The file we want to check diff --git a/kernel/printk.c b/kernel/printk.c index 44b3783..91766fc 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -468,14 +468,13 @@ static int check_syslog_permissions(int type, bool from_file) return 0; if (syslog_action_restricted(type)) { - if (nsown_capable(CAP_SYSLOG)) + if (ve_capable(CAP_SYSLOG)) return 0; /* * For historical reasons, accept CAP_SYS_ADMIN too, with * a warning. */ - if (nsown_capable(CAP_SYS_ADMIN) || - nsown_capable(CAP_VE_ADMIN)) { + if (ve_capable(CAP_SYS_ADMIN)) { pr_warn_once(%s (%d): Attempt to access syslog with CAP_SYS_ADMIN but no CAP_SYSLOG (deprecated).\n, diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 8f4c52d..0cbb2b2 100644 ---
[Devel] [PATCH 3/3] ve: remove ns_capable(CAP_VE.*)
If we use user namespaces, we don't need to have special capabilities. Signed-off-by: Andrew Vagin ava...@openvz.org --- fs/proc/root.c |3 +-- ipc/mqueue.c|3 +-- ipc/util.c |2 +- kernel/nsproxy.c|6 ++ kernel/sys.c|4 ++-- net/bridge/br_ioctl.c | 33 +++-- net/core/dev_ioctl.c|9 +++-- net/core/ethtool.c |3 +-- net/core/rtnetlink.c|6 ++ net/core/scm.c |2 +- net/decnet/netfilter/dn_rtmsg.c |3 +-- net/ipv4/arp.c |3 +-- net/ipv4/devinet.c |6 ++ net/ipv4/fib_frontend.c |2 +- net/ipv4/ip_sockglue.c |3 +-- net/ipv4/ip_tunnel.c|6 ++ net/ipv4/netfilter/ip_tables.c | 12 net/ipv6/addrconf.c |4 ++-- net/ipv6/ip6_tunnel.c |6 ++ net/ipv6/netfilter/ip6_tables.c | 12 net/ipv6/route.c|2 +- net/ipv6/sit.c |9 +++-- net/key/af_key.c|3 +-- net/netfilter/nfnetlink.c |3 +-- net/netlink/af_netlink.c|1 - net/netlink/genetlink.c |3 +-- net/xfrm/xfrm_user.c|3 +-- 27 files changed, 53 insertions(+), 99 deletions(-) diff --git a/fs/proc/root.c b/fs/proc/root.c index 0b7dbdb..923b398 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -121,8 +121,7 @@ static struct dentry *proc_mount(struct file_system_type *fs_type, options = data; if (!current_user_ns()-may_mount_proc || - (!ns_capable(ns-user_ns, CAP_SYS_ADMIN) -!ns_capable(ns-user_ns, CAP_VE_SYS_ADMIN))) + (!ns_capable(ns-user_ns, CAP_SYS_ADMIN))) return ERR_PTR(-EPERM); } diff --git a/ipc/mqueue.c b/ipc/mqueue.c index c5f1d3e..657814c 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -335,8 +335,7 @@ static struct dentry *mqueue_mount(struct file_system_type *fs_type, /* Don't allow mounting unless the caller has CAP_SYS_ADMIN * over the ipc namespace. */ - if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN) - !ns_capable(ns-user_ns, CAP_VE_SYS_ADMIN)) + if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); data = ns; diff --git a/ipc/util.c b/ipc/util.c index 795e05f..15e09aa 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -771,7 +771,7 @@ struct kern_ipc_perm *ipcctl_pre_down_nolock(struct ipc_namespace *ns, euid = current_euid(); if (uid_eq(euid, ipcp-cuid) || uid_eq(euid, ipcp-uid) || - ns_capable(ns-user_ns, CAP_VE_SYS_ADMIN)) + ns_capable(ns-user_ns, CAP_SYS_ADMIN)) return ipcp; /* successful lookup */ err: return ERR_PTR(err); diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 81402a8..62aebc8 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -136,8 +136,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk) CLONE_NEWPID | CLONE_NEWNET))) return 0; - if (!ns_capable(user_ns, CAP_SYS_ADMIN) - !ns_capable(user_ns, CAP_VE_SYS_ADMIN)) { + if (!ns_capable(user_ns, CAP_SYS_ADMIN)) { err = -EPERM; goto out; } @@ -198,8 +197,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags, return 0; user_ns = new_cred ? new_cred-user_ns : current_user_ns(); - if (!ns_capable(user_ns, CAP_SYS_ADMIN) - !ns_capable(user_ns, CAP_VE_SYS_ADMIN)) + if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; *new_nsp = create_new_namespaces(unshare_flags, current, user_ns, diff --git a/kernel/sys.c b/kernel/sys.c index 44f0295..a2d5644 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1604,7 +1604,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, len) int errno; char tmp[__NEW_UTS_LEN]; - if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_VE_SYS_ADMIN)) + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) return -EPERM; if (len 0 || len __NEW_UTS_LEN) @@ -1655,7 +1655,7 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, len) int errno; char tmp[__NEW_UTS_LEN]; - if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_VE_SYS_ADMIN)) + if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN)) return -EPERM; if (len 0 || len __NEW_UTS_LEN) return -EINVAL; diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c index 45c4c22..98447b8 100644 --- a/net/bridge/br_ioctl.c
[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 pty drivers
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 1ff0db51541d3bf04c228025cb48de284adb78b2 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:31:49 2015 +0400 Revert ve/pty: containerize Unix98 pty drivers Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: This reverts commit 79b66035f81e1c8996f2524f26af096e44e2ae4b. Conflicts: kernel/ve/ve.c Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- kernel/ve/ve.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index bdfa30d..5025149 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -449,10 +449,6 @@ int ve_start_container(struct ve_struct *ve) if (err) goto err_legacy_pty; - err = ve_unix98_pty_init(ve); - if (err) - goto err_unix98_pty; - err = ve_tty_console_init(ve); if (err) goto err_tty_console; @@ -472,8 +468,6 @@ int ve_start_container(struct ve_struct *ve) err_iterate: ve_tty_console_fini(ve); err_tty_console: - ve_unix98_pty_fini(ve); -err_unix98_pty: ve_legacy_pty_fini(ve); err_legacy_pty: ve_stop_umh(ve); @@ -506,7 +500,6 @@ void ve_stop_ns(struct pid_namespace *pid_ns) ve-is_running = 0; ve_tty_console_fini(ve); - ve_unix98_pty_fini(ve); ve_legacy_pty_fini(ve); ve_stop_umh(ve); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert pty: split Unix98 init routines
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit ee5a5380520330fedde1a323d5ca3cb5cad20b4f Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:32:03 2015 +0400 Revert pty: split Unix98 init routines Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: This reverts commit 3aec66abd43440bc7dd4c6bbe84734adb6d82851. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/tty/pty.c | 100 -- 1 file changed, 15 insertions(+), 85 deletions(-) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index 56c0a21..bd17a45 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -820,62 +820,25 @@ err_file: static struct file_operations ptmx_fops; -static void __unix98_unregister_ptmx(void) -{ - unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1); - cdev_del(ptmx_cdev); -} - -static int __unix98_register_ptmx(void) - { - int err; - - cdev_init(ptmx_cdev, ptmx_fops); - err = cdev_add(ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1); - if (err) { - printk(KERN_ERR Couldn't add /dev/ptmx device); - return err; - } - err = register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx); - if (err 0) { - printk(KERN_ERR Couldn't register /dev/ptmx driver); - goto err_ptmx_register; - } - return 0; - -err_ptmx_register: - cdev_del(ptmx_cdev); - return err; -} - -static int __unix98_pty_init(struct tty_driver **ptm_driver_p, - struct tty_driver **pts_driver_p) +static void __init unix98_pty_init(void) { - struct tty_driver *ptm_driver, *pts_driver; - int err; - struct device *dev; - ptm_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV | TTY_DRIVER_DEVPTS_MEM | TTY_DRIVER_DYNAMIC_ALLOC); - if (IS_ERR(ptm_driver)) { - printk(KERN_ERR Couldn't allocate Unix98 ptm driver); - return PTR_ERR(ptm_driver); - } + if (IS_ERR(ptm_driver)) + panic(Couldn't allocate Unix98 ptm driver); pts_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV | TTY_DRIVER_DEVPTS_MEM | TTY_DRIVER_DYNAMIC_ALLOC); - if (IS_ERR(pts_driver)) { - printk(KERN_ERR Couldn't allocate Unix98 pts driver); - err = PTR_ERR(pts_driver); - goto err_pts_alloc; - } + if (IS_ERR(pts_driver)) + panic(Couldn't allocate Unix98 pts driver); + ptm_driver-driver_name = pty_master; ptm_driver-name = ptm; ptm_driver-major = UNIX98_PTY_MASTER_MAJOR; @@ -905,53 +868,20 @@ static int __unix98_pty_init(struct tty_driver **ptm_driver_p, pts_driver-other = ptm_driver; tty_set_operations(pts_driver, pty_unix98_ops); - err = tty_register_driver(ptm_driver); - if (err) { - printk(KERN_ERR Couldn't register Unix98 ptm driver); - goto err_ptm_register; - } - err = tty_register_driver(pts_driver); - if (err) { - printk(KERN_ERR Couldn't register Unix98 pts driver); - goto err_pts_register; - } + if (tty_register_driver(ptm_driver)) + panic(Couldn't register Unix98 ptm driver); + if (tty_register_driver(pts_driver)) + panic(Couldn't register Unix98 pts driver); /* Now create the /dev/ptmx special device */ tty_default_fops(ptmx_fops); ptmx_fops.open = ptmx_open; - err = __unix98_register_ptmx(); - if (err) - goto err_ptmx_register; - - dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR,
[Devel] [PATCH RHEL7 COMMIT] ve/radix-tree: do not account radix_tree_nodes to memcg
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit d4b302e64d3523bddf4e300d0a975a7717ac784b Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:44:29 2015 +0400 ve/radix-tree: do not account radix_tree_nodes to memcg There are two problems if they are accounted. First, radix_tree_nodes allocated by tcache/tswap for storing their internal data will be accounted to the container that issued a store, which is wrong, because they can only get reclaimed on global pressure. Using __GFP_NOACCOUNT in tcache/tswap wouldn't help due to per cpu radix_tree_node preloads. Second, workingset detection logic (see mm/workingset.c) is still not memory cgroup aware. In particular, this means that shadow radix_tree_nodes can only be reclaimed on global memory pressure although they are accounted to a memory cgroup. As a result, after reading a huge file, all the container's memory can get filled with shadow entries, which won't be reclaimed on local memory pressure, making the container unusable. This is a quick-fix which makes radix_tree_nodes unaccountable. This is acceptable for now, because we had never accounted radix_tree_nodes before Vz7 anyway. The true fix would be (a) making radix_tree_node preloads unaccountable (or per memory cgroup) and (b) making workingset detection logic memory cgroup aware. This should and will be done upstream first. https://jira.sw.ru/browse/PSBM-35205 Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- lib/radix-tree.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/lib/radix-tree.c b/lib/radix-tree.c index dd3347f..4b362cb 100644 --- a/lib/radix-tree.c +++ b/lib/radix-tree.c @@ -228,7 +228,8 @@ radix_tree_node_alloc(struct radix_tree_root *root) } } if (ret == NULL) - ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + ret = kmem_cache_alloc(radix_tree_node_cachep, + gfp_mask | __GFP_NOACCOUNT); BUG_ON(radix_tree_is_indirect_ptr(ret)); return ret; @@ -279,7 +280,8 @@ static int __radix_tree_preload(gfp_t gfp_mask) rtp = __get_cpu_var(radix_tree_preloads); while (rtp-nr ARRAY_SIZE(rtp-nodes)) { preempt_enable(); - node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask); + node = kmem_cache_alloc(radix_tree_node_cachep, + gfp_mask | __GFP_NOACCOUNT); if (node == NULL) goto out; preempt_disable(); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 0845747ebe2654d1e6e56a0425b21e599a47f4f6 Author: Mel Gorman mgor...@suse.de Date: Fri Aug 28 18:50:29 2015 +0400 ms/mm/vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY This patch fixes memcg overreclaim w/o tswap/zswap as described in: https://jira.sw.ru/browse/PSBM-35275 Memcg overreclaim still happens if tswap or zswap is used. This case is to be investigated yet, however, this patch is definitely worth pulling. Commit mm: vmscan: obey proportional scanning requirements for kswapd ensured that file/anon lists were scanned proportionally for reclaim from kswapd but ignored it for direct reclaim. The intent was to minimse direct reclaim latency but Yuanhan Liu pointer out that it substitutes one long stall for many small stalls and distorts aging for normal workloads like streaming readers/writers. Hugh Dickins pointed out that a side-effect of the same commit was that when one LRU list dropped to zero that the entirety of the other list was shrunk leading to excessive reclaim in memcgs. This patch scans the file/anon lists proportionally for direct reclaim to similarly age page whether reclaimed by kswapd or direct reclaim but takes care to abort reclaim if one LRU drops to zero after reclaiming the requested number of pages. Based on ext4 and using the Intel VM scalability test 3.15.0-rc5 3.15.0-rc5 shrinker proportion Unit lru-file-readonceelapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%) Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%) Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%) Unit lru-file-readtwiceelapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%) Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%) Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%) The test cases are running multiple dd instances reading sparse files. The results are within the noise for the small test machine. The impact of the patch is more noticable from the vmstats 3.15.0-rc5 3.15.0-rc5 shrinker proportion Minor Faults 35154 36784 Major Faults 6111305 Swap Ins 3941651 Swap Outs 43945891 Allocation stalls 118616 44781 Direct pages scanned 4935171 4602313 Kswapd pages scanned 1592129216258483 Kswapd pages reclaimed1591330116248305 Direct pages reclaimed 4933368 4601133 Kswapd efficiency 99% 99% Kswapd velocity 670088.047 682555.961 Direct efficiency 99% 99% Direct velocity 207709.217 193212.133 Percentage direct scans23% 22% Page writes by reclaim4858.0006232.000 Page writes file 464 341 Page writes anon 43945891 Note that there are fewer allocation stalls even though the amount of direct reclaim scanning is very approximately the same. Signed-off-by: Mel Gorman mgor...@suse.de Cc: Johannes Weiner han...@cmpxchg.org Cc: Hugh Dickins hu...@google.com Cc: Tim Chen tim.c.c...@linux.intel.com Cc: Dave Chinner da...@fromorbit.com Tested-by: Yuanhan Liu yuanhan@linux.intel.com Cc: Bob Liu bob@oracle.com Cc: Jan Kara j...@suse.cz Cc: Rik van Riel r...@redhat.com Cc: Al Viro v...@zeniv.linux.org.uk Signed-off-by: Andrew Morton a...@linux-foundation.org Signed-off-by: Linus Torvalds torva...@linux-foundation.org (cherry picked from commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/vmscan.c | 36 +--- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 0b4c98f..2bb62ce 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2129,13 +2129,27 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned long nr_reclaimed = 0; unsigned long nr_to_reclaim = sc-nr_to_reclaim; struct blk_plug plug; - bool scan_adjusted = false; + bool scan_adjusted; get_scan_count(lruvec, sc, nr, lru_pages); /* Record the original scan target for proportional adjustments
[Devel] [PATCH RHEL7 COMMIT] memcg: fix swap_max calculation for nested cgroups
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 338ce9637d706f2bf01ef9153b78953ff65c2efb Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:36:03 2015 +0400 memcg: fix swap_max calculation for nested cgroups If there is a sub-memcg in a container, its swapout won't update swap_max of the container's memcg, because we don't ascend the memcg hierarchy in mem_cgroup_update_swap_max. This patch fixes this issue. Fixes: a74376e2dde13 (bc/memcg: show correct swap max for beancounters) Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/memcontrol.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5f3e0ac..7fc2931 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -903,12 +903,14 @@ static void mem_cgroup_update_swap_max(struct mem_cgroup *memcg) { long long swap; - swap = res_counter_read_u64(memcg-memsw, RES_USAGE) - - res_counter_read_u64(memcg-res, RES_USAGE); + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + swap = res_counter_read_u64(memcg-memsw, RES_USAGE) - + res_counter_read_u64(memcg-res, RES_USAGE); - /* This is racy, but we don't have to be absolutely precise */ - if (swap (long long)memcg-swap_max) - memcg-swap_max = swap; + /* This is racy, but we don't have to be absolutely precise */ + if (swap (long long)memcg-swap_max) + memcg-swap_max = swap; + } } static void mem_cgroup_inc_failcnt(struct mem_cgroup *memcg, ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] Revert diff-writeback-throttle-writer-when-local-BDI-threshold-is-hit bits
This was brought by the initial commit 2a8b5de95918, but it is incomplete - the following hunk patching balance_dirty_pages was lost: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 003b68e..a58795c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -546,7 +546,8 @@ static void balance_dirty_pages(struct address_space *mapping, * catch-up. This avoids (excessively) small writeouts * when the bdi limits are ramping up. */ - if (nr_reclaimable + nr_writeback + if (bdi_cap_account_writeback(bdi) + nr_reclaimable + nr_writeback (background_thresh + dirty_thresh) / 2 ub_dirty + ub_writeback (ub_background_thresh + ub_thresh) / 2) I've filed a separate issue for porting it: https://jira.sw.ru/browse/PSBM-39167 Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- fs/fs-writeback.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 9cdcc28b2ee5..66586a4f32de 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -843,9 +843,6 @@ static bool over_bground_thresh(struct backing_dev_info *bdi) { unsigned long background_thresh, dirty_thresh; - if (!bdi_cap_account_writeback(bdi) bdi-dirty_exceeded) - return true; - global_dirty_limits(background_thresh, dirty_thresh); if (global_page_state(NR_FILE_DIRTY) + -- 2.1.4 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 3/3] ve: remove ns_capable(CAP_VE.*)
On Fri, Aug 28, 2015 at 05:20:03PM +0400, Andrew Vagin wrote: If we use user namespaces, we don't need to have special capabilities. Signed-off-by: Andrew Vagin ava...@openvz.org Lovely :-) Although it'd be even better if you reverted all the patches tampering capability checks one-by-one so that it'd be easier to drop them during the next rebase. Anyway, Reviewed-by: Vladimir Davydov vdavy...@parallels.com A couple of notes regarding this patch set. It seems CAP_VE_ADMIN and CAP_VE_NET_ADMIN are not used anymore. Let's drop them? Also, you forgot to revert commit 1875887f263e (ve: caps: ignore setting wrong caps with CAP_SETPCAP), please do it in a separate patch. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 driver
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit fd19fc2c70ae5da0a0902dea96213f52dc6afbfd Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:31:56 2015 +0400 Revert ve/pty: containerize Unix98 driver Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: This reverts commit 1b2c1fe8428715c3b5ec0a94d0568b5a5c526032. Conflicts: include/linux/ve.h Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/tty/pty.c | 88 ++--- include/linux/tty.h | 6 ++-- include/linux/ve.h | 6 3 files changed, 32 insertions(+), 68 deletions(-) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index 7afb822..56c0a21 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -23,10 +23,15 @@ #include linux/devpts_fs.h #include linux/slab.h #include linux/mutex.h -#include linux/ve.h #include bc/misc.h +#ifdef CONFIG_UNIX98_PTYS +static struct tty_driver *ptm_driver; +static struct tty_driver *pts_driver; +static DEFINE_MUTEX(devpts_mutex); +#endif + static void pty_close(struct tty_struct *tty, struct file *filp) { BUG_ON(!tty); @@ -53,11 +58,11 @@ static void pty_close(struct tty_struct *tty, struct file *filp) if (tty-driver-subtype == PTY_TYPE_MASTER) { set_bit(TTY_OTHER_CLOSED, tty-flags); #ifdef CONFIG_UNIX98_PTYS - if (tty-driver == tty-driver-ve-ptm_driver) { - mutex_lock(tty-driver-ve-devpts_mutex); + if (tty-driver == ptm_driver) { + mutex_lock(devpts_mutex); if (tty-link-driver_data) devpts_pty_kill(tty-link-driver_data); - mutex_unlock(tty-driver-ve-devpts_mutex); + mutex_unlock(devpts_mutex); } #endif tty_unlock(tty); @@ -669,9 +674,9 @@ static struct tty_struct *pts_unix98_lookup(struct tty_driver *driver, { struct tty_struct *tty; - mutex_lock(driver-ve-devpts_mutex); + mutex_lock(devpts_mutex); tty = devpts_get_priv(pts_inode); - mutex_unlock(driver-ve-devpts_mutex); + mutex_unlock(devpts_mutex); /* Master must be open before slave */ if (!tty) return ERR_PTR(-EIO); @@ -748,7 +753,6 @@ static int ptmx_open(struct inode *inode, struct file *filp) struct inode *slave_inode; int retval; int index; - struct ve_struct *ve = (inode-i_sb-s_ns) ? : get_exec_env(); nonseekable_open(inode, filp); @@ -760,18 +764,18 @@ static int ptmx_open(struct inode *inode, struct file *filp) return retval; /* find a device that is not in use. */ - mutex_lock(ve-devpts_mutex); + mutex_lock(devpts_mutex); index = devpts_new_index(inode); if (index 0) { retval = index; - mutex_unlock(ve-devpts_mutex); + mutex_unlock(devpts_mutex); goto err_file; } - mutex_unlock(ve-devpts_mutex); + mutex_unlock(devpts_mutex); mutex_lock(tty_mutex); - tty = tty_init_dev(ve-ptm_driver, index); + tty = tty_init_dev(ptm_driver, index); if (IS_ERR(tty)) { retval = PTR_ERR(tty); @@ -796,7 +800,7 @@ static int ptmx_open(struct inode *inode, struct file *filp) } tty-link-driver_data = slave_inode; - retval = ve-ptm_driver-ops-open(tty, filp); + retval = ptm_driver-ops-open(tty, filp); if (retval) goto err_release; @@ -816,22 +820,16 @@ err_file: static struct file_operations ptmx_fops; -static void __unix98_unregister_ptmx(struct ve_struct *ve) +static void __unix98_unregister_ptmx(void) { - if (!ve_is_super(ve)) - return; - unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1); cdev_del(ptmx_cdev); } -static int __unix98_register_ptmx(struct ve_struct *ve) -{ +static int __unix98_register_ptmx(void) + { int
[Devel] [PATCH 2/2] fs: allow to mount devtmpfs in a non-root userns (v2)
devtmpfs is virtualized, so it has to be secure. v2: fix return code Signed-off-by: Andrew Vagin ava...@openvz.org --- drivers/base/devtmpfs.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c index c28e42c..f21e292 100644 --- a/drivers/base/devtmpfs.c +++ b/drivers/base/devtmpfs.c @@ -58,6 +58,9 @@ __setup(devtmpfs.mount=, mount_param); static struct dentry *dev_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { + if (get_exec_env()-init_cred-user_ns != current_user_ns()) + return ERR_PTR(-EPERM); + #ifdef CONFIG_TMPFS return mount_ns(fs_type, flags, data, get_exec_env(), shmem_fill_super); #else @@ -69,7 +72,7 @@ static struct file_system_type dev_fs_type = { .name = devtmpfs, .mount = dev_mount, .kill_sb = kill_litter_super, - .fs_flags = FS_VIRTUALIZED, + .fs_flags = FS_VIRTUALIZED | FS_USERNS_MOUNT | FS_USERNS_DEV_MOUNT, }; #ifdef CONFIG_BLOCK -- 1.7.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/pty: create ptmx device per ve namespace
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 953017eb9e8237859f63d7b0a2c816b7e7e5a615 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:32:16 2015 +0400 ve/pty: create ptmx device per ve namespace Patchset description: Zap Unix98 pty virtualization Unix98 ptys are already virtualized on the VFS layer, nothing needs to be done on the driver's side. We don't even have this in PCS6. The patch set makes ptmx device system-wide while its class, tty_class, is still virtualized. Since it's now system-wide, we have to add its sysfs entry to ve.default_sysfs_permissions, but since its class is virtualized, we won't be able to do it (see sysfs_perms_set - sysfs_find_dirent). As a result, if the container relies on sysfs while creating devnodes, it will not find ptmx and therefore fallback to legacy ptys, which we are going to drop. The last patch (ve/pty: create ptmx device per ve namespace) addresses this. === This patch description: After Unix98 PTY driver virtualization was reverted, we have to manually set sysfs permissions for ptmx. This, however, is currently impossible, because tty_class is still virtualized, which makes ve.sysfs_permissions ignore it (see sysfs_perms_set). This patch is a quick-fix which simply creates/destroys ptmx device in ve namespace on container start/stop. It must be dropped when commit 6022450d12653 (ve/tty: make tty_class VE-namespace aware) is reverted. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- drivers/tty/pty.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c index bd17a45..529046b 100644 --- a/drivers/tty/pty.c +++ b/drivers/tty/pty.c @@ -818,6 +818,32 @@ err_file: return retval; } +static int ve_unix98_pty_init(void *data) +{ + struct ve_struct *ve = data; + struct device *dev; + + dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), ve, ptmx); + if (IS_ERR(dev)) { + pr_warn(Failed to create ptmx device for ve %s: %ld\n, + ve-ve_name, PTR_ERR(dev)); + return PTR_ERR(dev); + } + return 0; +} + +static void ve_unix98_pty_fini(void *data) +{ + device_destroy_namespace(tty_class, MKDEV(TTYAUX_MAJOR, 2), data); +} + +static struct ve_hook ve_unix98_pty_hook = { + .init = ve_unix98_pty_init, + .fini = ve_unix98_pty_fini, + .priority = HOOK_PRIO_DEFAULT, + .owner = THIS_MODULE, +}; + static struct file_operations ptmx_fops; static void __init unix98_pty_init(void) @@ -882,6 +908,7 @@ static void __init unix98_pty_init(void) register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx) 0) panic(Couldn't register /dev/ptmx driver); device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), NULL, ptmx); + ve_hook_register(VE_SS_CHAIN, ve_unix98_pty_hook); } #else ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: never isolate more pages than necessary
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 703ed09d7ee4d9af6cec3c4970842f282176f5e0 Author: Vladimir Davydov vdavy...@parallels.com Date: Fri Aug 28 18:50:33 2015 +0400 ms/mm/vmscan: never isolate more pages than necessary Along with [PATCH rh7] mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY this should fix https://jira.sw.ru/browse/PSBM-35275 I submitted this patch upstream (https://lkml.org/lkml/2015/8/3/404) and it was merged into the mmotm tree. Hopefully, it will get merged into Linus's tree soon. If transparent huge pages are enabled, we can isolate many more pages than we actually need to scan, because we count both single and huge pages equally in isolate_lru_pages(). Since commit 5bc7b8aca942d (mm: thp: add split tail pages to shrink page list in page reclaim), we scan all the tail pages immediately after a huge page split (see shrink_page_list()). As a result, we can reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run! This is easy to catch on memcg reclaim with zswap enabled. The latter makes swapout instant so that if we happen to scan an unreferenced huge page we will evict both its head and tail pages immediately, which is likely to result in excessive reclaim. Signed-off-by: Vladimir Davydov vdavy...@parallels.com --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 2bb62ce..7beadf5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1297,7 +1297,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, unsigned long nr_taken = 0; unsigned long scan; - for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { + for (scan = 0; scan nr_to_scan nr_taken nr_to_scan + !list_empty(src); scan++) { struct page *page; int nr_pages; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ve/net: Fix vlan NETIF_F_VIRTUAL feature initialization
The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-229.7.2.vz7.6.3 -- commit 3e11f3abe191cb393cd8c025913e6a9b739fcabe Author: Kirill Tkhai ktk...@odin.com Date: Sat Aug 29 02:30:46 2015 +0400 ve/net: Fix vlan NETIF_F_VIRTUAL feature initialization vlan_setup() is called when dev's net hasn't been set yet: rtnl_create_link alloc_netdev_mqs dev_net_set(dev, init_net) vlan_setup ... if (!ve_is_super(dev_net(dev)-owner_ve)) dev-features |= NETIF_F_VIRTUAL ... dev_net_set(dev, net) So vlan's dev has no NETIF_F_VIRTUAL feature, and further check of ve_is_dev_movable() fails. Patch makes the feature to be set always, independent of dev_net(). Anyway, in further we test it only if ve is not super. Also, others (loopback for exmple) set it always too. https://jira.sw.ru/browse/PSBM-35266 Signed-off-by: Kirill Tkhai ktk...@odin.com Acked-by: Andrew Vagin ava...@odin.com --- net/8021q/vlan_dev.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index 80fa918..09205c3 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -794,6 +794,5 @@ void vlan_setup(struct net_device *dev) dev-ethtool_ops= vlan_ethtool_ops; memset(dev-broadcast, 0, ETH_ALEN); - if (!ve_is_super(dev_net(dev)-owner_ve)) - dev-features |= NETIF_F_VIRTUAL; + dev-features |= NETIF_F_VIRTUAL; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel