date:20150828

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_ref_cancel_init()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0873bd8f500347f34f06ddad0fbf024df91f8add
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:24 2015 +0400

ms/percpu-refcount: implement percpu_ref_cancel_init()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Normally, percpu_ref_init() initializes and percpu_ref_kill()
initiates destruction which completes asynchronously.  The
asynchronous destruction can be problematic in init failure path where
the caller wants to destroy half-constructed object - distinguishing
half-constructed objects from the usual release method can be painful
for complex objects.

This patch implements percpu_ref_cancel_init() which synchronously
destroys the percpu_ref without invoking release.  To avoid
unintentional misuses, the function requires the ref to have finished
percpu_ref_init() but never used and triggers WARN otherwise.

v2: Explain the weird name and usage restriction in the function
comment.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit bc497bd33b2d6a6f07bc8574b4764edbd7fdffa8)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h |  1 +
 lib/percpu-refcount.c   | 31 +++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 8146aa9..6d843d6 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -68,6 +68,7 @@ struct percpu_ref {
 
 int __must_check percpu_ref_init(struct percpu_ref *ref,
 percpu_ref_func_t *release);
+void percpu_ref_cancel_init(struct percpu_ref *ref);
 void percpu_ref_kill(struct percpu_ref *ref);
 
 #define PCPU_STATUS_BITS   2
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index b35eaac..ebeaac2 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -54,6 +54,37 @@ int percpu_ref_init(struct percpu_ref *ref, 
percpu_ref_func_t *release)
return 0;
 }
 
+/**
+ * percpu_ref_cancel_init - cancel percpu_ref_init()
+ * @ref: percpu_ref to cancel init for
+ *
+ * Once a percpu_ref is initialized, its destruction is initiated by
+ * percpu_ref_kill() and completes asynchronously, which can be painful to
+ * do when destroying a half-constructed object in init failure path.
+ *
+ * This function destroys @ref without invoking @ref-release and the
+ * memory area containing it can be freed immediately on return.  To
+ * prevent accidental misuse, it's required that @ref has finished
+ * percpu_ref_init(), whether successful or not, but never used.
+ *
+ * The weird name and usage restriction are to prevent people from using
+ * this function by mistake for normal shutdown instead of
+ * percpu_ref_kill().
+ */
+void percpu_ref_cancel_init(struct percpu_ref *ref)
+{
+   unsigned __percpu *pcpu_count = ref-pcpu_count;
+   int cpu;
+
+   WARN_ON_ONCE(atomic_read(ref-count) != 1 + PCPU_COUNT_BIAS);
+
+   if (pcpu_count) {
+   for_each_possible_cpu(cpu)
+   WARN_ON_ONCE(*per_cpu_ptr(pcpu_count, cpu));
+   free_percpu(ref-pcpu_count);
+   }
+}
+
 static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 {
struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: split cgroup destruction into two steps

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 33f3496e5d1342b4497058d017261d3b3fde0fe1
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:26 2015 +0400

ms/cgroup: split cgroup destruction into two steps

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item.  The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent.  The separation is to allow using percpu refcnt for css.

Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name.  As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine.  A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.

As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both.  This patch
renames cgroup-free_work to -destroy_work and uses it for both
purposes.  INIT_WORK() is now performed right before queueing the work
item.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Li Zefan lize...@huawei.com
(cherry picked from commit ea15f8ccdb430af1e8bc9b4e19a230eb4c356777)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
kernel/cgroup.c
---
 include/linux/cgroup.h |  2 +-
 kernel/cgroup.c| 25 -
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 626bc84..d34c42b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -259,7 +259,7 @@ struct cgroup {
 
/* For RCU-protected deletion */
struct rcu_head rcu_head;
-   struct work_struct free_work;
+   struct work_struct destroy_work;
 
/* List of events which userspace want to receive */
struct list_head event_list;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 062e0f4..6fd7038 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -213,6 +213,7 @@ static struct cgroup_name root_cgroup_name = { .name = / 
};
  */
 static int need_forkexit_callback __read_mostly;
 
+static void cgroup_offline_fn(struct work_struct *work);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys 
*subsys,
  struct cftype cfts[], bool is_add);
@@ -836,7 +837,7 @@ static struct cgroup_name *cgroup_alloc_name(struct dentry 
*dentry)
 
 static void cgroup_free_fn(struct work_struct *work)
 {
-   struct cgroup *cgrp = container_of(work, struct cgroup, free_work);
+   struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work);
struct cgroup_subsys *ss;
 
mutex_lock(cgroup_mutex);
@@ -881,7 +882,8 @@ static void cgroup_free_rcu(struct rcu_head *head)
 {
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
 
-   queue_work(cgroup_destroy_wq, cgrp-free_work);
+   INIT_WORK(cgrp-destroy_work, cgroup_free_fn);
+   queue_work(cgroup_destroy_wq, cgrp-destroy_work);
 }
 
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -1416,7 +1418,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(cgrp-allcg_node);
INIT_LIST_HEAD(cgrp-release_list);
INIT_LIST_HEAD(cgrp-pidlists);
-   INIT_WORK(cgrp-free_work, cgroup_free_fn);
mutex_init(cgrp-pidlist_mutex);
INIT_LIST_HEAD(cgrp-event_list);
spin_lock_init(cgrp-event_list_lock);
@@ -4355,7 +4356,6 @@ static int

Re: [Devel] [RFC rh7 v5] ve/tty: vt -- Implement per VE support for console and terminals

2015-08-28 Thread Vladimir Davydov

On Thu, Aug 27, 2015 at 10:24:15PM +0300, Cyrill Gorcunov wrote:
 On Thu, Aug 27, 2015 at 07:11:28PM +0300, Vladimir Davydov wrote:
  
  Hmm, checkpatch still has max_line_length set to 80. Could you please
  share a link to this agreement?
  
   fine. Wonder, do you really still sit on 80 chars terminal?
  
  I use a 12 laptop. With the window vertically split into two panes, I
  have only ~80 characters per each pane.
 
 https://lkml.org/lkml/2012/2/3/101
 
 One of the several conversations.

Fortunately, 80 column limit is still there.

I've just checked that on my external 20 display, with the font size my
eyes are used to, I can keep two panes of 104 columns max. So if they
decided to switch to 100 column standard, even a huge 15 laptop
wouldn't save me :-/

 
 nb: you know, moving patches from mainline (slave lock) seems
 to be not that simple, they introduced own new lock class for that.
 at moment i think how to modify our vtty code without mangling
 general tty code.

I'd still suggest moving EXTRA_REF logic to tty_io.c. Yes, it's going to
hurt a little during rebases, but if we try to keep it in pty.c we will
implicitly rely on tty_io.c internal logic (locking or ref counting
rules), which will probably result in failures at runtime, which is much
worse than failures at build time.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: Don't use silly cmpxchg()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 337bb797aa4aa5eca030d634d0a9874290511db5
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:21 2015 +0400

ms/percpu-refcount: Don't use silly cmpxchg()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Kent Overstreet koverstr...@google.com

The cmpxchg() was just to ensure the debug check didn't race, which was
a bit excessive. The caller is supposed to do the appropriate
synchronization, which means percpu_ref_kill() can just do a simple
store.

Signed-off-by: Kent Overstreet koverstr...@google.com
Signed-off-by: Tejun Heo t...@kernel.org
(cherry picked from commit c1ae6e9b4db00023b9caed72af49a93abad46452)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 lib/percpu-refcount.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 6f0ffd7..1a17399 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -107,22 +107,11 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
  */
 void percpu_ref_kill(struct percpu_ref *ref)
 {
-   unsigned __percpu *pcpu_count, *old, *new;
+   WARN_ONCE(REF_STATUS(ref-pcpu_count) == PCPU_REF_DEAD,
+ percpu_ref_kill() called more than once!\n);
 
-   pcpu_count = ACCESS_ONCE(ref-pcpu_count);
-
-   do {
-   if (REF_STATUS(pcpu_count) == PCPU_REF_DEAD) {
-   WARN(1, percpu_ref_kill() called more than once!\n);
-   return;
-   }
-
-   old = pcpu_count;
-   new = (unsigned __percpu *)
-   (((unsigned long) pcpu_count)|PCPU_REF_DEAD);
-
-   pcpu_count = cmpxchg(ref-pcpu_count, old, new);
-   } while (pcpu_count != old);
+   ref-pcpu_count = (unsigned __percpu *)
+   (((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD);
 
call_rcu(ref-rcu, percpu_ref_kill_rcu);
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: reorder the operations in cgroup_destroy_locked()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit ce835adec25190f76a26cc97f1a38aadc93a4957
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:25 2015 +0400

ms/cgroup: reorder the operations in cgroup_destroy_locked()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the -sibling list.  This will be used to make css use
percpu refcnt.

While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.

While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Li Zefan lize...@huawei.com
(cherry picked from commit 455050d23e1bfc47ca98e943ad5b2f3a9bbe45fb)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
kernel/cgroup.c
---
 kernel/cgroup.c | 48 ++--
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b073fba..062e0f4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4367,9 +4367,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 
/*
 * Block new css_tryget() by deactivating refcnt and mark @cgrp
-* removed.  This makes future css_tryget() and child creation
-* attempts fail thus maintaining the removal conditions verified
-* above.
+* removed.  This makes future css_tryget() attempts fail which we
+* guarantee to -css_offline() callbacks.
 */
for_each_subsys(cgrp-root, ss) {
struct cgroup_subsys_state *css = cgrp-subsys[ss-subsys_id];
@@ -4379,6 +4378,30 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
}
set_bit(CGRP_REMOVED, cgrp-flags);
 
+   raw_spin_lock(release_list_lock);
+   if (!list_empty(cgrp-release_list))
+   list_del_init(cgrp-release_list);
+   raw_spin_unlock(release_list_lock);
+
+   /*
+* Remove @cgrp directory.  The removal puts the base ref but we
+* aren't quite done with @cgrp yet, so hold onto it.
+*/
+   dget(d);
+   cgroup_d_remove_dir(d);
+
+   /*
+* Unregister events and notify userspace.
+* Notify userspace about cgroup removing only after rmdir of cgroup
+* directory to avoid race between userspace and kernelspace.
+*/
+   spin_lock(cgrp-event_list_lock);
+   list_for_each_entry_safe(event, tmp, cgrp-event_list, list) {
+   list_del_init(event-list);
+   schedule_work(event-remove);
+   }
+   spin_unlock(cgrp-event_list_lock);
+
/* tell subsystems to initate destruction */
for_each_subsys(cgrp-root, ss)
offline_css(ss, cgrp);
@@ -4393,34 +4416,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
for_each_subsys(cgrp-root, ss)
css_put(cgrp-subsys[ss-subsys_id]);
 
-   raw_spin_lock(release_list_lock);
-   if (!list_empty(cgrp-release_list))
-   list_del_init(cgrp-release_list);
-   raw_spin_unlock(release_list_lock);
-
/* delete this cgroup from parent-children */
list_del_rcu(cgrp-sibling);
list_del_init(cgrp-allcg_node);
 
-   dget(d);
-   cgroup_d_remove_dir(d);
dput(d);
 
set_bit(CGRP_RELEASABLE, parent-flags);
check_for_release(parent);
 
-   /*
-* Unregister events and notify userspace.
-* Notify userspace about cgroup removing only after rmdir of cgroup
-* directory to avoid race between userspace and kernelspace.
-*/
-   spin_lock(cgrp-event_list_lock);
-

[Devel] [PATCH RHEL7 COMMIT] ve/devpts: Revert 2c27d20125f5

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 99a71c6ceb41b6c8256620c4db844f7395f2a2c9
Author: Cyrill Gorcunov gorcu...@gmail.com
Date:   Fri Aug 28 14:14:08 2015 +0400

ve/devpts: Revert 2c27d20125f5

Here we revert 2c27d20125f5 (ve/devpts: cleanup per-VE creation)
making code close to the vanilla one. We've tune devpts code a bit though in
next patch but less intrusive.

https://jira.sw.ru/browse/PSBM-34931

Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com

CC: Vladimir Davydov vdavy...@virtuozzo.com
CC: Andrey Vagin ava...@virtuozzo.com
CC: Konstantin Khorenko khore...@virtuozzo.com
CC: Pavel Emelyanov xe...@virtuozzo.com
---
 fs/devpts/inode.c | 39 ++-
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 3dcd4da..be0fb74 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -402,6 +402,20 @@ fail:
 }
 
 #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
+static int test_devpts_sb(struct super_block *s, void *p)
+{
+   return get_exec_env()-devpts_sb == s;
+}
+
+static int set_devpts_sb(struct super_block *s, void *p)
+{
+   int error = set_anon_super(s, p);
+   if (!error) {
+   atomic_inc(s-s_active);
+   get_exec_env()-devpts_sb = s;
+   }
+   return error;
+}
 
 /*
  * devpts_mount()
@@ -436,7 +450,6 @@ static struct dentry *devpts_mount(struct file_system_type 
*fs_type,
int error;
struct pts_mount_opts opts;
struct super_block *s;
-   struct dentry *root;
 
error = parse_mount_options(data, PARSE_MOUNT, opts);
if (error)
@@ -450,29 +463,29 @@ static struct dentry *devpts_mount(struct 
file_system_type *fs_type,
return ERR_PTR(-EINVAL);
 
if (opts.newinstance)
-   root = mount_nodev(fs_type, flags, data, devpts_fill_super);
+   s = sget(fs_type, NULL, set_anon_super, flags, NULL);
else
-   root = mount_ns(fs_type, flags, data, get_exec_env(), 
devpts_fill_super);
+   s = sget(fs_type, test_devpts_sb, set_devpts_sb, flags, NULL);
+
+   if (IS_ERR(s))
+   return ERR_CAST(s);
 
-   if (IS_ERR(root))
-   return ERR_CAST(root);
+   if (!s-s_root) {
+   error = devpts_fill_super(s, data, flags  MS_SILENT ? 1 : 0);
+   if (error)
+   goto out_undo_sget;
+   s-s_flags |= MS_ACTIVE;
+   }
 
-   s = root-d_sb;
memcpy((DEVPTS_SB(s))-mount_opts, opts, sizeof(opts));
 
error = mknod_ptmx(s);
if (error)
goto out_undo_sget;
 
-   if (!opts.newinstance) {
-   atomic_inc(s-s_active);
-   get_exec_env()-devpts_sb = s;
-   }
-
-   return root;
+   return dget(s-s_root);
 
 out_undo_sget:
-   dput(root);
deactivate_locked_super(s);
return ERR_PTR(error);
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 4149fa7beae723cd745672c749ed0a94f7f672a4
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:24 2015 +0400

ms/percpu-refcount: implement percpu_tryget() along with 
percpu_ref_kill_and_confirm()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Implement percpu_tryget() which stops giving out references once the
percpu_ref is visible as killed.  Because the refcnt is per-cpu,
different CPUs will start to see a refcnt as killed at different
points in time and tryget() may continue to succeed on subset of cpus
for a while after percpu_ref_kill() returns.

For use cases where it's necessary to know when all CPUs start to see
the refcnt as dead, percpu_ref_kill_and_confirm() is added.  The new
function takes an extra argument @confirm_kill which is invoked when
the refcnt is guaranteed to be viewed as killed on all CPUs.

While this isn't the prettiest interface, it doesn't force synchronous
wait and is much safer than requiring the caller to do its own
call_rcu().

v2: Patch description rephrased to emphasize that tryget() may
continue to succeed on some CPUs after kill() returns as suggested
by Kent.

v3: Function comment in percpu_ref_kill_and_confirm() updated warning
people to not depend on the implied RCU grace period from the
confirm callback as it's an implementation detail.

Signed-off-by: Tejun Heo t...@kernel.org
Slightly-Grumpily-Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit dbece3a0f1ef0b19aff1cc6ed0942fec9ab98de1)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 50 -
 lib/percpu-refcount.c   | 23 ++-
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 6d843d6..dd2a086 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -63,13 +63,30 @@ struct percpu_ref {
 */
unsigned __percpu   *pcpu_count;
percpu_ref_func_t   *release;
+   percpu_ref_func_t   *confirm_kill;
struct rcu_head rcu;
 };
 
 int __must_check percpu_ref_init(struct percpu_ref *ref,
 percpu_ref_func_t *release);
 void percpu_ref_cancel_init(struct percpu_ref *ref);
-void percpu_ref_kill(struct percpu_ref *ref);
+void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
+percpu_ref_func_t *confirm_kill);
+
+/**
+ * percpu_ref_kill - drop the initial ref
+ * @ref: percpu_ref to kill
+ *
+ * Must be used to drop the initial ref on a percpu refcount; must be called
+ * precisely once before shutdown.
+ *
+ * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the
+ * percpu counters and dropping the initial ref.
+ */
+static inline void percpu_ref_kill(struct percpu_ref *ref)
+{
+   return percpu_ref_kill_and_confirm(ref, NULL);
+}
 
 #define PCPU_STATUS_BITS   2
 #define PCPU_STATUS_MASK   ((1  PCPU_STATUS_BITS) - 1)
@@ -101,6 +118,37 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 }
 
 /**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless it has already been killed.  Returns
+ * %true on success; %false on failure.
+ *
+ * Completion of percpu_ref_kill() in itself doesn't guarantee that tryget
+ * will fail.  For such guarantee, percpu_ref_kill_and_confirm() should be
+ * used.  After the confirm_kill callback is invoked, it's guaranteed that
+ * no new reference will be given out by percpu_ref_tryget().
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+   unsigned

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 82f6802b3f09878172024c57ed12cf2da92cccd3
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:23 2015 +0400

ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use 
ACCESS_ONCE() in percpu_ref_kill_rcu()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Two small changes.

* Unlike most init functions, percpu_ref_init() allocates memory and
  may fail.  Let's mark it with __must_check in case the caller
  forgets.

* percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to
  dereference @ref-pcpu_count, which can be misleading.  The pointer
  is guaranteed to be valid and visible and can't change underneath
  the function.  Drop ACCESS_ONCE().

Signed-off-by: Tejun Heo t...@kernel.org
(cherry picked from commit acac7883ee7bcc32476963bce7baf73d44574dd1)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 3 ++-
 lib/percpu-refcount.c   | 4 +---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index b61bd6f..8146aa9 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -66,7 +66,8 @@ struct percpu_ref {
struct rcu_head rcu;
 };
 
-int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release);
+int __must_check percpu_ref_init(struct percpu_ref *ref,
+percpu_ref_func_t *release);
 void percpu_ref_kill(struct percpu_ref *ref);
 
 #define PCPU_STATUS_BITS   2
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9a78e55..b35eaac 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -57,12 +57,10 @@ int percpu_ref_init(struct percpu_ref *ref, 
percpu_ref_func_t *release)
 static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 {
struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
-   unsigned __percpu *pcpu_count;
+   unsigned __percpu *pcpu_count = ref-pcpu_count;
unsigned count = 0;
int cpu;
 
-   pcpu_count = ACCESS_ONCE(ref-pcpu_count);
-
/* Mask out PCPU_REF_DEAD */
pcpu_count = (unsigned __percpu *)
(((unsigned long) pcpu_count)  ~PCPU_STATUS_MASK);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu: implement generic percpu refcounting

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit b5ec5570459334e56491e564b567cc5bed16181e
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:21 2015 +0400

ms/percpu: implement generic percpu refcounting

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Kent Overstreet koverstr...@google.com

This implements a refcount with similar semantics to
atomic_get()/atomic_dec_and_test() - but percpu.

It also implements two stage shutdown, as we need it to tear down the
percpu counts.  Before dropping the initial refcount, you must call
percpu_ref_kill(); this puts the refcount in shutting down mode and
switches back to a single atomic refcount with the appropriate
barriers (synchronize_rcu()).

It's also legal to call percpu_ref_kill() multiple times - it only
returns true once, so callers don't have to reimplement shutdown
synchronization.

[a...@linux-foundation.org: fix build]
[a...@linux-foundation.org: coding-style tweak]
Signed-off-by: Kent Overstreet koverstr...@google.com
Cc: Zach Brown z...@redhat.com
Cc: Felipe Balbi ba...@ti.com
Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
Cc: Mark Fasheh mfas...@suse.com
Cc: Joel Becker jl...@evilplan.org
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Jens Axboe ax...@kernel.dk
Cc: Asai Thambi S P asamymuth...@micron.com
Cc: Selvan Mani sm...@micron.com
Cc: Sam Bradshaw sbrads...@micron.com
Cc: Jeff Moyer jmo...@redhat.com
Cc: Al Viro v...@zeniv.linux.org.uk
Cc: Benjamin LaHaise b...@kvack.org
Cc: Tejun Heo t...@kernel.org
Cc: Oleg Nesterov o...@redhat.com
Cc: Christoph Lameter c...@linux-foundation.org
Cc: Ingo Molnar mi...@redhat.com
Reviewed-by: Theodore Ts'o ty...@mit.edu
Signed-off-by: Tejun Heo t...@kernel.org

(cherry picked from commit 215e262f2aeba378aa192da07c30770f9925a4bf)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
lib/Makefile
---
 include/linux/percpu-refcount.h | 122 ++
 lib/Makefile|   2 +-
 lib/percpu-refcount.c   | 128 
 3 files changed, 251 insertions(+), 1 deletion(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
new file mode 100644
index 000..24b31ef
--- /dev/null
+++ b/include/linux/percpu-refcount.h
@@ -0,0 +1,122 @@
+/*
+ * Percpu refcounts:
+ * (C) 2012 Google, Inc.
+ * Author: Kent Overstreet koverstr...@google.com
+ *
+ * This implements a refcount with similar semantics to atomic_t - 
atomic_inc(),
+ * atomic_dec_and_test() - but percpu.
+ *
+ * There's one important difference between percpu refs and normal atomic_t
+ * refcounts; you have to keep track of your initial refcount, and then when 
you
+ * start shutting down you call percpu_ref_kill() _before_ dropping the initial
+ * refcount.
+ *
+ * The refcount will have a range of 0 to ((1U  31) - 1), i.e. one bit less
+ * than an atomic_t - this is because of the way shutdown works, see
+ * percpu_ref_kill()/PCPU_COUNT_BIAS.
+ *
+ * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the
+ * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill()
+ * puts the ref back in single atomic_t mode, collecting the per cpu refs and
+ * issuing the appropriate barriers, and then marks the ref as shutting down so
+ * that percpu_ref_put() will check for the ref hitting 0.  After it returns,
+ * it's safe to drop the initial ref.
+ *
+ * USAGE:
+ *
+ * See fs/aio.c for some example usage; it's used there for struct kioctx, 
which
+ * is created when userspaces calls io_setup(), and destroyed when userspace
+ * calls io_destroy() or the process exits.
+ *
+ * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
+ * calls percpu_ref_kill(), then hlist_del_rcu()

[Devel] [PATCH RHEL7 COMMIT] ms/memcg: issue memory.high reclaim after refilling percpu stock

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit c315808e33a89086d0dac4624c1fa6f4fe1f8051
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:22:20 2015 +0400

ms/memcg: issue memory.high reclaim after refilling percpu stock

Currently, we dive into memory.high reclaim before reflling percpu
stock. As a result, if we successfully charge a batch for a percpu stock
while exceeding memory.high, others won't be able to use it until we
finish and will probably have to reclaim themselves, which may lead to
overreclaim. This patch therefore moves memory.high reclaim after
refilling stocks. This is how it works upstream.

I haven't seen any negative effects caused by this backport mistake, but
let's stick to the mainstream behavior anyways.

Fixes: 4038cd0e029dd (ms/memcg: port memory.high)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/memcontrol.c | 35 +--
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 37e81d3..5f3e0ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2730,10 +2730,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
 
if (likely(!ret)) {
if (!do_swap_account)
-   goto done;
+   return CHARGE_OK;
ret = res_counter_charge(memcg-memsw, csize, fail_res);
if (likely(!ret))
-   goto done;
+   return CHARGE_OK;
 
res_counter_uncharge(memcg-res, csize);
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
@@ -2790,21 +2790,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
return CHARGE_OOM_DIE;
 
return CHARGE_RETRY;
-
-done:
-   if (!(gfp_mask  __GFP_WAIT))
-   goto out;
-   /*
-* If the hierarchy is above the normal consumption range,
-* make the charging task trim their excess contribution.
-*/
-   do {
-   if (res_counter_read_u64(memcg-res, RES_USAGE) = memcg-high)
-   continue;
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, false);
-   } while ((memcg = parent_mem_cgroup(memcg)));
-out:
-   return CHARGE_OK;
 }
 
 /*
@@ -2836,7 +2821,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 {
unsigned int batch = max(CHARGE_BATCH, nr_pages);
int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-   struct mem_cgroup *memcg = NULL;
+   struct mem_cgroup *memcg = NULL, *iter;
int ret;
 
/*
@@ -2950,6 +2935,20 @@ again:
 
if (batch  nr_pages)
refill_stock(memcg, batch - nr_pages);
+
+   /*
+* If the hierarchy is above the normal consumption range,
+* make the charging task trim their excess contribution.
+*/
+   iter = memcg;
+   do {
+   if (!(gfp_mask  __GFP_WAIT))
+   break;
+   if (res_counter_read_u64(iter-res, RES_USAGE) = iter-high)
+   continue;
+   try_to_free_mem_cgroup_pages(iter, nr_pages, gfp_mask, false);
+   } while ((iter = parent_mem_cgroup(iter)));
+
css_put(memcg-css);
 done:
*ptr = memcg;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/vznetstat: Fix potential exit race

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 9a440f22380933dd3547de7d83c553924c6ce284
Author: Cyrill Gorcunov gorcu...@virtuozzo.com
Date:   Fri Aug 28 14:31:18 2015 +0400

ve/vznetstat: Fix potential exit race

When container is exiting another task may be doing operations
with statistics incrementing/decrementing stat counter, which
may lead to situation where counter is not zero, thus we don't
zap @ve-stat member.

Fix it by testing if the net is the last one belonging
to a container.

https://jira.sw.ru/browse/PSBM-35178

Fixes: 505f8aacf95dce27fad66c90d4e1cd64adcb5432
(ve/vznetstat: Don't destroy statistics until explicitly asked)

Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com

CC: Andrey Vagin ava...@virtuozzo.com
CC: Vladimir Davydov vdavy...@virtuozzo.com
CC: Konstantin Khorenko khore...@virtuozzo.com
CC: Pavel Emelyanov xe...@virtuozzo.com
CC: Igor Sukhih i...@parallels.com
---
 kernel/ve/vznetstat/vznetstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/ve/vznetstat/vznetstat.c b/kernel/ve/vznetstat/vznetstat.c
index 9a25dea..99feafb 100644
--- a/kernel/ve/vznetstat/vznetstat.c
+++ b/kernel/ve/vznetstat/vznetstat.c
@@ -1098,7 +1098,7 @@ static void __net_exit net_exit_acct(struct net *net)
 
if (ve-stat) {
venet_acct_put_stat(ve-stat);
-   if (atomic_read(ve-stat-users) == 0)
+   if (ve-ve_netns == net)
ve-stat = NULL;
}
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: use RCU-sched insted of normal RCU

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 932bf29b63b1e7c74669a8847d7c69cc8b8ba919
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:25 2015 +0400

ms/percpu-refcount: use RCU-sched insted of normal RCU

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

percpu-refcount was incorrectly using preempt_disable/enable() for RCU
critical sections against call_rcu().  6a24474da8 (percpu-refcount:
consistently use plain (non-sched) RCU) fixed it by converting the
preepmtion operations with rcu_read_[un]lock() citing that there isn't
any advantage in using sched-RCU over using the usual one; however,
rcu_read_[un]lock() for the preemptible RCU implementation -
CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly
more expensive than preempt_disable/enable().

In a contrived microbench which repeats the followings,

 - percpu_ref_get()
 - copy 32 bytes of data into percpu buffer
 - percpu_put_get()
 - copy 32 bytes of data into percpu buffer

rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by
about 15% when compared to using sched-RCU.

As the RCU critical sections are extremely short, using sched-RCU
shouldn't have any latency implications.  Convert to RCU-sched.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Kent Overstreet koverstr...@google.com
Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Michal Hocko mho...@suse.cz
Cc: Rusty Russell ru...@rustcorp.com.au
(cherry picked from commit a4244454df1296e90cc961c1b636b1176ef0d9a0)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 12 ++--
 lib/percpu-refcount.c   |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index dd2a086..95961f0 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -105,7 +105,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   rcu_read_lock();
+   rcu_read_lock_sched();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -114,7 +114,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
else
atomic_inc(ref-count);
 
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
 }
 
 /**
@@ -134,7 +134,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
unsigned __percpu *pcpu_count;
int ret = false;
 
-   rcu_read_lock();
+   rcu_read_lock_sched();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -143,7 +143,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
ret = true;
}
 
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
 
return ret;
 }
@@ -159,7 +159,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   rcu_read_lock();
+   rcu_read_lock_sched();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -168,7 +168,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
else if (unlikely(atomic_dec_and_test(ref-count)))
ref-release(ref);
 
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
 }
 
 #endif
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 8bf9e71..7deeb62 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -154,5 +154,5 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
(((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD);
ref-confirm_kill = confirm_kill;
 
-   call_rcu(ref-rcu, percpu_ref_kill_rcu);
+   call_rcu_sched(ref-rcu, percpu_ref_kill_rcu);
 }
___
Devel mailing list
Devel@openvz.org

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: cosmetic updates

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit d6bfd7b559fdbe649d00c272895cb26996d1ee1c
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:22 2015 +0400

ms/percpu-refcount: cosmetic updates

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

* s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t
  postfix for types and the type is gonna be used for a different type
  of callback too.

* Add @ARG to function comments.

* Drop unnecessary and unaligned indentation from percpu_ref_init()
  function comment.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit ac899061a93250c28562f05ad94d5c74603415bc)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 8 +---
 lib/percpu-refcount.c   | 7 ---
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index abe1411..b61bd6f 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -51,7 +51,7 @@
 #include linux/rcupdate.h
 
 struct percpu_ref;
-typedef void (percpu_ref_release)(struct percpu_ref *);
+typedef void (percpu_ref_func_t)(struct percpu_ref *);
 
 struct percpu_ref {
atomic_tcount;
@@ -62,11 +62,11 @@ struct percpu_ref {
 * percpu_ref_kill_rcu())
 */
unsigned __percpu   *pcpu_count;
-   percpu_ref_release  *release;
+   percpu_ref_func_t   *release;
struct rcu_head rcu;
 };
 
-int percpu_ref_init(struct percpu_ref *, percpu_ref_release *);
+int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release);
 void percpu_ref_kill(struct percpu_ref *ref);
 
 #define PCPU_STATUS_BITS   2
@@ -78,6 +78,7 @@ void percpu_ref_kill(struct percpu_ref *ref);
 
 /**
  * percpu_ref_get - increment a percpu refcount
+ * @ref: percpu_ref to get
  *
  * Analagous to atomic_inc().
   */
@@ -99,6 +100,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 
 /**
  * percpu_ref_put - decrement a percpu refcount
+ * @ref: percpu_ref to put
  *
  * Decrement the refcount, and if 0, call the release function (which was 
passed
  * to percpu_ref_init())
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 1a17399..9a78e55 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -33,8 +33,8 @@
 
 /**
  * percpu_ref_init - initialize a percpu refcount
- * @ref:   ref to initialize
- * @release:   function which will be called when refcount hits 0
+ * @ref: percpu_ref to initialize
+ * @release: function which will be called when refcount hits 0
  *
  * Initializes the refcount in single atomic counter mode with a refcount of 1;
  * analagous to atomic_set(ref, 1).
@@ -42,7 +42,7 @@
  * Note that @release must not sleep - it may potentially be called from RCU
  * callback context by percpu_ref_kill().
  */
-int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release)
+int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release)
 {
atomic_set(ref-count, 1 + PCPU_COUNT_BIAS);
 
@@ -98,6 +98,7 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 
 /**
  * percpu_ref_kill - safely drop initial ref
+ * @ref: percpu_ref to kill
  *
  * Must be used to drop the initial ref on a percpu refcount; must be called
  * precisely once before shutdown.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [RFC rh7 v5] ve/tty: vt -- Implement per VE support for console and terminals

2015-08-28 Thread Cyrill Gorcunov

On Fri, Aug 28, 2015 at 11:12:39AM +0300, Vladimir Davydov wrote:
  
  nb: you know, moving patches from mainline (slave lock) seems
  to be not that simple, they introduced own new lock class for that.
  at moment i think how to modify our vtty code without mangling
  general tty code.
 
 I'd still suggest moving EXTRA_REF logic to tty_io.c. Yes, it's going to
 hurt a little during rebases, but if we try to keep it in pty.c we will
 implicitly rely on tty_io.c internal logic (locking or ref counting
 rules), which will probably result in failures at runtime, which is much
 worse than failures at build time.

Seems so :/ I didn't find a way to hide all this things solely
inside vtty code.

Cyrill
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ploop: dio_fastmap() must refresh bvec_merge_data

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit fc65c834967a14d37ef23348cec6528d18b0a169
Author: Maxim Patlasov mpatla...@openvz.org
Date:   Fri Aug 28 14:18:37 2015 +0400

ploop: dio_fastmap() must refresh bvec_merge_data

q-merge_bvec_fn() may override some fileds of bvec_merge_data.
For example, raid0_mergeable_bvec() does so. The blessed way is
to initialize it from scratch before use -- see how __bio_add_page()
prepares bvm for calling q-merge_bvec_fn().

Signed-off-by: Maxim Patlasov mpatla...@openvz.org
Acked-by: Dmitry Monakhov dmonak...@openvz.org
---
 drivers/block/ploop/io_direct.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 793bcc5..0183b0f 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1487,7 +1487,6 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio,
struct request_queue * q;
struct extent_map * em;
int i;
-   struct bvec_merge_data bm_data;
 
if (orig_bio-bi_size == 0) {
bio-bi_vcnt   = 0;
@@ -1535,19 +1534,19 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio,
bio-bi_size = 0;
bio-bi_vcnt = 0;
 
-   bm_data.bi_bdev = bio-bi_bdev;
-   bm_data.bi_sector = bio-bi_sector;
-   bm_data.bi_size = 0;
-   bm_data.bi_rw = bio-bi_rw;
-
for (i = 0; i  orig_bio-bi_vcnt; i++) {
struct bio_vec * bv = bio-bi_io_vec[i];
+   struct bvec_merge_data bm_data = {
+   .bi_bdev = bio-bi_bdev,
+   .bi_sector = bio-bi_sector,
+   .bi_size = bio-bi_size,
+   .bi_rw = bio-bi_rw,
+   };
if (q-merge_bvec_fn(q, bm_data, bv)  bv-bv_len) {
io-plo-st.fast_neg_backing++;
return 1;
}
bio-bi_size += bv-bv_len;
-   bm_data.bi_size = bio-bi_size;
bio-bi_vcnt++;
}
return 0;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: consistently use plain (non-sched) RCU

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 41721ced765e1156651d31c8b9deb0111340e984
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:22 2015 +0400

ms/percpu-refcount: consistently use plain (non-sched) RCU

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

percpu_ref_get/put() are using preempt_disable/enable() while
percpu_ref_kill() is using plain call_rcu() instead of
call_rcu_sched().  This is buggy as grace periods of the two may not
match.  Fix it by using plain RCU in percpu_ref_get/put().

(I suggested using sched RCU in the first place but there's no actual
 benefit in doing so unless we're gonna introduce different variants
 of get/put to be called while preemption is alredy disabled, which we
 definitely shouldn't.)

Signed-off-by: Tejun Heo t...@kernel.org
Reported-by: Rusty Russell ru...@rustcorp.com.au
Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit 6a24474da83ea7c8b7d32f05f858b1259994067a)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 24b31ef..abe1411 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -85,7 +85,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   preempt_disable();
+   rcu_read_lock();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -94,7 +94,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
else
atomic_inc(ref-count);
 
-   preempt_enable();
+   rcu_read_unlock();
 }
 
 /**
@@ -107,7 +107,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   preempt_disable();
+   rcu_read_lock();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -116,7 +116,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
else if (unlikely(atomic_dec_and_test(ref-count)))
ref-release(ref);
 
-   preempt_enable();
+   rcu_read_unlock();
 }
 
 #endif
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: use percpu refcnt for cgroup_subsys_states

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit b1753091f010a49bcd0a89aa23306ac816302f9c
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:27 2015 +0400

ms/cgroup: use percpu refcnt for cgroup_subsys_states

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.

This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.

The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
-css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.

This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.

v2: - init_cgroup_css() was calling percpu_ref_init() without checking
  the return value.  This causes two problems - the obvious lack
  of error handling and percpu_ref_init() being called from
  cgroup_init_subsys() before the allocators are up, which
  triggers warnings but doesn't cause actual problems as the
  refcnt isn't used for roots anyway.  Fix both by moving
  percpu_ref_init() to cgroup_create().

- The base references were put too early by
  percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
  refs one extra time.  This wasn't noticeable because css's go
  through another RCU grace period before being freed.  Update
  cgroup_destroy_locked() to grab an extra reference before
  killing the refcnts.  This problem was noticed by Kent.

Signed-off-by: Tejun Heo t...@kernel.org
Reviewed-by: Kent Overstreet koverstr...@google.com
Acked-by: Li Zefan lize...@huawei.com
Cc: Michal Hocko mho...@suse.cz
Cc: Mike Snitzer snit...@redhat.com
Cc: Vivek Goyal vgo...@redhat.com
Cc: Alasdair G. Kergon a...@redhat.com
Cc: Jens Axboe ax...@kernel.dk
Cc: Mikulas Patocka mpato...@redhat.com
Cc: Glauber Costa glom...@gmail.com
(cherry picked from commit d3daf28da16a30af95bfb303189a634a87606725)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
include/linux/cgroup.h
kernel/cgroup.c
---
 include/linux/cgroup.h |  27 +++-
 kernel/cgroup.c| 166 +++--
 2

[Devel] [PATCH RHEL7 COMMIT] ve/devtmpfs: lightweight virtualization

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 22255fb606cfd53fb98b11c62b854c0de5a4c713
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:59 2015 +0400

ve/devtmpfs: lightweight virtualization

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

All this patch does is provides each VE with its own empty single tmpfs
mount, which appears on an attempt to mount devtmpfs. It's up to the
userspace to populate this fs on container start, all kernel requests to
create a device node inside a VE are ignored.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c | 67 +
 include/linux/ve.h  |  1 +
 kernel/ve/ve.c  |  4 +++
 3 files changed, 72 insertions(+)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index f59b798..daf97ee 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -23,6 +23,7 @@
 #include linux/ramfs.h
 #include linux/slab.h
 #include linux/kthread.h
+#include linux/ve.h
 #include base.h
 
 static struct task_struct *thread;
@@ -53,9 +54,61 @@ static int __init mount_param(char *str)
 }
 __setup(devtmpfs.mount=, mount_param);
 
+#ifdef CONFIG_VE
+static int ve_test_dev_sb(struct super_block *s, void *p)
+{
+   return get_exec_env()-dev_sb == s;
+}
+
+static int ve_set_dev_sb(struct super_block *s, void *p)
+{
+   struct ve_struct *ve = get_exec_env();
+   int error;
+
+   error = set_anon_super(s, p);
+   if (!error) {
+   BUG_ON(ve-dev_sb);
+   ve-dev_sb = s;
+   atomic_inc(s-s_active);
+   }
+   return error;
+}
+
+static struct dentry *ve_dev_mount(struct file_system_type *fs_type, int flags,
+ const char *dev_name, void *data)
+{
+   int (*fill_super)(struct super_block *, void *, int);
+   struct super_block *s;
+   int error;
+
+#ifdef CONFIG_TMPFS
+   fill_super = shmem_fill_super;
+#else
+   fill_super = ramfs_fill_super;
+#endif
+   s = sget(fs_type, ve_test_dev_sb, ve_set_dev_sb, flags, NULL);
+   if (IS_ERR(s))
+   return ERR_CAST(s);
+
+   if (!s-s_root) {
+   error = fill_super(s, data, flags  MS_SILENT ? 1 : 0);
+   if (error) {
+   deactivate_locked_super(s);
+   return ERR_PTR(error);
+   }
+   s-s_flags |= MS_ACTIVE;
+   }
+   return dget(s-s_root);
+}
+#endif /* CONFIG_VE */
+
 static struct dentry *dev_mount(struct file_system_type *fs_type, int flags,
  const char *dev_name, void *data)
 {
+#ifdef CONFIG_VE
+   if (!ve_is_super(get_exec_env()))
+   return ve_dev_mount(fs_type, flags, dev_name, data);
+#endif
 #ifdef CONFIG_TMPFS
return mount_single(fs_type, flags, data, shmem_fill_super);
 #else
@@ -79,6 +132,16 @@ static inline int is_blockdev(struct device *dev)
 static inline int is_blockdev(struct device *dev) { return 0; }
 #endif
 
+#ifdef CONFIG_VE
+static inline int is_ve_dev(struct device *dev)
+{
+   return dev-class  dev-class-namespace == ve_namespace 
+   ve_namespace(dev) != get_ve0();
+}
+#else
+static inline int is_ve_dev(struct device *dev) { return 0; }
+#endif
+
 int

[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use sb-s_fs_info

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 17dd96483ff558d44c98c3f8bcb04a86aca843a5
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:43 2015 +0400

ve/binfmt_misc: do not use sb-s_fs_info

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

When we virtualized binfmt_misc, we made sb-s_fs_info store a pointer
to binfmt_misc struct. At the same time, we store a pointer to the owner
ve_struct in sb-s_ns and a pointer to the same binfmt_misc struct in
ve_struct-binfmt_misc. That said, we don't actually need to use
s_fs_info, because we can get the binfmt_misc by dereferencing
sb-s_ns-binfmt_misc.

Using sb-s_fs_info instead of sb-s_ns will allow us to revert our
patches introducing sb-s_ns.

This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization).

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/binfmt_misc.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 7e760d2..d0cb80c 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -65,6 +65,8 @@ struct binfmt_misc {
int entry_count;
 };
 
+#define BINFMT_MISC(sb)(((struct ve_struct 
*)(sb)-s_ns)-binfmt_misc)
+
 /* 
  * Check if we support the binfmt
  * if we do, return the node, else NULL
@@ -541,7 +543,7 @@ static ssize_t bm_entry_write(struct file *file, const char 
__user *buffer,
Node *e = file_inode(file)-i_private;
int res = parse_command(buffer, count);
struct super_block *sb = file-f_path.dentry-d_sb;
-   struct binfmt_misc *bm_data = sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(sb);
 
switch (res) {
case 1: clear_bit(Enabled, e-flags);
@@ -576,7 +578,7 @@ static ssize_t bm_register_write(struct file *file, const 
char __user *buffer,
struct inode *inode;
struct dentry *root, *dentry;
struct super_block *sb = file-f_path.dentry-d_sb;
-   struct binfmt_misc *bm_data = sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(sb);
int err = 0;
 
e = create_entry(buffer, count);
@@ -641,7 +643,7 @@ static const struct file_operations bm_register_operations 
= {
 static ssize_t
 bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t 
*ppos)
 {
-   struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb);
char *s = bm_data-enabled ? enabled\n : disabled\n;
 
return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
@@ -650,7 +652,7 @@ bm_status_read(struct file *file, char __user *buf, size_t 
nbytes, loff_t *ppos)
 static ssize_t bm_status_write(struct file * file, const char __user * buffer,
size_t count, loff_t *ppos)
 {
-   struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb);
int res = parse_command(buffer, count);
struct dentry *root;
 
@@ -681,7 +683,7 @@ static const struct file_operations bm_status_operations = {
 
 static void bm_put_super(struct super_block *sb)
 {
-   struct binfmt_misc *bm_data = sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(sb);
struct ve_struct *ve = sb-s_ns;
 
bm_data-enabled = 0;
@@ -723,7 +725,6 @@ static int bm_fill_super(struct super_block * sb, void * 
data, int silent)
}
 
sb-s_op = s_ops;
-   sb-s_fs_info = bm_data;
 
bm_data-enabled = 1;
get_ve(ve);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert fs: add data pointer to mount_ns()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 8d9d5a10d874b4d9f66f1af3fdcabbe9aee396f2
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:58 2015 +0400

Revert fs: add data pointer to mount_ns()

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 69e6ae7f750fc862c9324441130abbff2c8b528e.

This is only needed for per-ns filesystems that can accept user options.
There is the only such a filesystem, devtmpfs, which we made per
container. Since devtmpfs virtualization is going to be dropped, this
patch is not necessary.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c | 4 ++--
 fs/binfmt_misc.c| 2 +-
 fs/nfsd/nfsctl.c| 2 +-
 fs/super.c  | 4 ++--
 include/linux/fs.h  | 2 +-
 ipc/mqueue.c| 2 +-
 net/sunrpc/rpc_pipe.c   | 2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 349d6eb..6f4ba37 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -59,9 +59,9 @@ static struct dentry *dev_mount(struct file_system_type 
*fs_type, int flags,
  const char *dev_name, void *data)
 {
 #ifdef CONFIG_TMPFS
-   return mount_ns(fs_type, flags, data, get_exec_env(), shmem_fill_super);
+   return mount_ns(fs_type, flags, data, shmem_fill_super);
 #else
-   return mount_ns(fs_type, flags, data, get_exec_env(), ramfs_fill_super);
+   return mount_ns(fs_type, flags, data, ramfs_fill_super);
 #endif
 }
 
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 460d53f..7e760d2 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -734,7 +734,7 @@ static int bm_fill_super(struct super_block * sb, void * 
data, int silent)
 static struct dentry *bm_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
 {
-   return mount_ns(fs_type, flags, data, get_exec_env(), bm_fill_super);
+   return mount_ns(fs_type, flags, get_exec_env(), bm_fill_super);
 }
 
 static struct linux_binfmt misc_format = {
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 9b690c9..7411a56 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1126,7 +1126,7 @@ static int nfsd_fill_super(struct super_block * sb, void 
* data, int silent)
 static struct dentry *nfsd_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
 {
-   return mount_ns(fs_type, flags, NULL, current-nsproxy-net_ns, 
nfsd_fill_super);
+   return mount_ns(fs_type, flags, current-nsproxy-net_ns, 
nfsd_fill_super);
 }
 
 static void nfsd_umount(struct super_block *sb)
diff --git a/fs/super.c b/fs/super.c
index 7f316e8..c9b47bf 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -890,11 +890,11 @@ static int ns_set_super(struct super_block *sb, void 
*data)
 }
 
 struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
-   void *data, void *ns, int (*fill_super)(struct super_block *, void *, 
int))
+   void *data, int (*fill_super)(struct super_block *, void *, int))
 {
struct super_block *sb;
 
-   sb = sget(fs_type, ns_test_super, ns_set_super, flags, ns);
+   sb = sget(fs_type, ns_test_super, ns_set_super, flags, data);
if (IS_ERR(sb))
return

[Devel] [PATCH RHEL7 COMMIT] Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 9b72ce16b191d84da03da83d5ccec29de8854686
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:41 2015 +0400

Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

This reverts commit 9e7411c5c3b53937171ef962ce7381337f125b28.

This patch is not longer needed, because none of the mount_ns users
needs sb-s_fs_info any more.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/nfs/dns_resolve.c  | 2 +-
 fs/nfsd/nfs4recover.c | 4 ++--
 fs/super.c| 4 ++--
 include/linux/fs.h| 2 --
 ipc/mqueue.c  | 6 +++---
 net/sunrpc/clnt.c | 2 +-
 net/sunrpc/rpc_pipe.c | 4 ++--
 7 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c
index dda6202..d25f10f 100644
--- a/fs/nfs/dns_resolve.c
+++ b/fs/nfs/dns_resolve.c
@@ -415,7 +415,7 @@ static int rpc_pipefs_event(struct notifier_block *nb, 
unsigned long event,
   void *ptr)
 {
struct super_block *sb = ptr;
-   struct net *net = sb-s_ns;
+   struct net *net = sb-s_fs_info;
struct nfs_net *nn = net_generic(net, nfs_net_id);
struct cache_detail *cd = nn-nfs_dns_resolve;
int ret = 0;
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index c714602..4c86b18 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -693,7 +693,7 @@ cld_pipe_downcall(struct file *filp, const char __user 
*src, size_t mlen)
struct cld_upcall *tmp, *cup;
struct cld_msg __user *cmsg = (struct cld_msg __user *)src;
uint32_t xid;
-   struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_ns,
+   struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_fs_info,
nfsd_net_id);
struct cld_net *cn = nn-cld_net;
 
@@ -1353,7 +1353,7 @@ static int
 rpc_pipefs_event(struct notifier_block *nb, unsigned long event, void *ptr)
 {
struct super_block *sb = ptr;
-   struct net *net = sb-s_ns;
+   struct net *net = sb-s_fs_info;
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
struct cld_net *cn = nn-cld_net;
struct dentry *dentry;
diff --git a/fs/super.c b/fs/super.c
index c9b47bf..341650d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -880,12 +880,12 @@ EXPORT_SYMBOL(kill_litter_super);
 
 static int ns_test_super(struct super_block *sb, void *data)
 {
-   return sb-s_ns == data;
+   return sb-s_fs_info == data;
 }
 
 static int ns_set_super(struct super_block *sb, void *data)
 {
-   sb-s_ns = data;
+   sb-s_fs_info = data;
return set_anon_super(sb, NULL);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 68cec28..553bca3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1457,8 +1457,6 @@ struct super_block {
unsigned ints_max_links;
fmode_t s_mode;
 
-   void*s_ns;  /* Pointer to namespace */
-
/* Granularity of c/m/atime in ns.
   Cannot be worse than a second */
u32s_time_gran;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 18620cd..c508938 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -104,7 +104,7 @@ static inline struct mqueue_inode_info *MQUEUE_I(struct 
inode *inode)
  */
 static inline struct ipc_namespace *__get_ns_from_inode(struct inode *inode)
 {
-   return get_ipc_ns(inode-i_sb-s_ns);
+   return get_ipc_ns(inode-i_sb-s_fs_info);
 }
 
 static struct ipc_namespace *get_ns_from_inode(struct inode *inode)
@@ -407,7 +407,7 @@ static void mqueue_evict_inode(struct inode *inode)
user-mq_bytes -= mq_bytes;
/*
 * get_ns_from_inode() ensures that the
-* (ipc_ns = sb-s_ns) is either a valid ipc_ns
+* (ipc_ns = sb-s_fs_info) is either a valid ipc_ns
 * to which we now hold a reference, or it is NULL.
 * We can't put it here under mq_lock, though.
 */
@@ -1418,7 +1418,7 @@ int mq_init_ns(struct ipc_namespace *ns)
 
 void mq_clear_sbinfo(struct

[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use s_ns

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit a98a90ea907f522f1ae6ff0e1c6e78a39ade2494
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:44 2015 +0400

ve/binfmt_misc: do not use s_ns

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

Since 9e7411c5c3b5 was reverted, we must use sb-s_fs_info for storing a
pointer to the namespace.

This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization).

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/binfmt_misc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index d0cb80c..4487153 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -65,7 +65,7 @@ struct binfmt_misc {
int entry_count;
 };
 
-#define BINFMT_MISC(sb)(((struct ve_struct 
*)(sb)-s_ns)-binfmt_misc)
+#define BINFMT_MISC(sb)(((struct ve_struct 
*)(sb)-s_fs_info)-binfmt_misc)
 
 /* 
  * Check if we support the binfmt
@@ -684,7 +684,7 @@ static const struct file_operations bm_status_operations = {
 static void bm_put_super(struct super_block *sb)
 {
struct binfmt_misc *bm_data = BINFMT_MISC(sb);
-   struct ve_struct *ve = sb-s_ns;
+   struct ve_struct *ve = sb-s_fs_info;
 
bm_data-enabled = 0;
put_ve(ve);
@@ -703,7 +703,7 @@ static int bm_fill_super(struct super_block * sb, void * 
data, int silent)
[3] = {register, bm_register_operations, S_IWUSR},
/* last one */ {}
};
-   struct ve_struct *ve = sb-s_ns;
+   struct ve_struct *ve = data;
struct binfmt_misc *bm_data = ve-binfmt_misc;
int err;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: containerize it with new obj ns operation

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 968c8efb7981f87f8bc0616741edb6c0bc556d76
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:57 2015 +0400

Revert devtmpfs: containerize it with new obj ns operation

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 53343c3b231ed36d973e6d3ac2ab9ad7b7c87e25.

The whole point of devtmpfs is simplifying the system bootup logic.
There is absolutely no point in virtualizing it, because on container
start we create devices from a hardcoded list (these are ttys, which I'd
prefer not to create at all using ptys instead, but we have to live with
it for compatibility reasons for now). This means that it is enough to
provide the userspace with per VE tmpfs mount called devtmpfs and
teach it to make device nodes from a hardcoded list on container start
instead of implementing devtmpfs virtualization in the kernel. The
kernel part will be done by the following patches.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c| 37 ++---
 fs/sysfs/ve.c  |  9 -
 include/linux/kobject_ns.h |  2 --
 3 files changed, 2 insertions(+), 46 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 0448af8..349d6eb 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -366,46 +366,13 @@ int devtmpfs_mount(const char *mntdir)
 
 static DECLARE_COMPLETION(setup_done);
 
-static struct path set_dev_pwd(struct device *dev)
-{
-   const struct kobj_ns_type_operations *ops;
-   struct path pwd = current-fs-pwd;
-
-   ops = kobj_ns_ops(dev-kobj);
-   path_get(pwd);
-
-   if (ops  ops-devtmpfs) {
-   const struct path *devtmpfs_root;
-
-   devtmpfs_root = ops-devtmpfs(dev-kobj);
-   BUG_ON(!devtmpfs_root);
-   set_fs_pwd(current-fs, devtmpfs_root);
-   }
-   return pwd;
-}
-
-static void drop_dev_pwd(struct path *pwd)
-{
-   set_fs_pwd(current-fs, pwd);
-   path_put(pwd);
-}
-
 static int handle(const char *name, umode_t mode, kuid_t uid, kgid_t gid,
  struct device *dev)
 {
-   struct path pwd;
-   int err;
-
-   pwd = set_dev_pwd(dev);
-
if (mode)
-   err = handle_create(name, mode, uid, gid, dev);
+   return handle_create(name, mode, uid, gid, dev);
else
-   err = handle_remove(name, dev);
-
-   /* Restore kthread pwd */
-   drop_dev_pwd(pwd);
-   return err;
+   return handle_remove(name, dev);
 }
 
 static int devtmpfsd(void *p)
diff --git a/fs/sysfs/ve.c b/fs/sysfs/ve.c
index 79ad6d5..bb28a4b 100644
--- a/fs/sysfs/ve.c
+++ b/fs/sysfs/ve.c
@@ -43,21 +43,12 @@ const void *ve_namespace(struct device *dev)
return (!dev-groups  dev_get_drvdata(dev)) ? dev_get_drvdata(dev) : 
get_ve0();
 }
 
-static const struct path *ve_devtmpfs(const struct kobject *kobj)
-{
-   struct device *dev = container_of(kobj, struct device, kobj);
-   const struct ve_struct *ve = dev-class-namespace(dev);
-
-   return ve-devtmpfs_root;
-}
-
 struct kobj_ns_type_operations ve_ns_type_operations = {
.type = KOBJ_NS_TYPE_VE,
.grab_current_ns =

Re: [Devel] [PATCH 2/3] ve: revise permissions to allow mount smth

2015-08-28 Thread Vladimir Davydov

On Fri, Aug 28, 2015 at 05:20:02PM +0400, Andrew Vagin wrote:
 Return back to the behavior of the upstream kernel.
 Currently we use mount namespaces and need nothing special here.
 
 Signed-off-by: Andrew Vagin ava...@openvz.org

Reviewed-by: Vladimir Davydov vdavy...@parallels.com

It's worth noting that this patch reverts commit d492bfa387237 (ve/vfs:
allow mount/umount, pivot_root with CAP_VE_SYS_ADMIN).
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: per-VE mounts introduced

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 3fd8ef28e629c3ec00144f83249628244903876d
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:58 2015 +0400

Revert devtmpfs: per-VE mounts introduced

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit e85a799b629d5e28c8931ddd9127cf18d501745c.

More devtmpfs virtualization crap to drop. Will be reworked.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
include/linux/ve.h
kernel/ve/ve.c
---
 drivers/base/devtmpfs.c | 28 ++--
 include/linux/device.h  |  4 
 include/linux/ve.h  |  3 ---
 kernel/ve/ve.c  |  8 
 4 files changed, 2 insertions(+), 41 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 6f4ba37..f59b798 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -23,8 +23,6 @@
 #include linux/ramfs.h
 #include linux/slab.h
 #include linux/kthread.h
-#include linux/fs_struct.h
-#include linux/ve.h
 #include base.h
 
 static struct task_struct *thread;
@@ -59,9 +57,9 @@ static struct dentry *dev_mount(struct file_system_type 
*fs_type, int flags,
  const char *dev_name, void *data)
 {
 #ifdef CONFIG_TMPFS
-   return mount_ns(fs_type, flags, data, shmem_fill_super);
+   return mount_single(fs_type, flags, data, shmem_fill_super);
 #else
-   return mount_ns(fs_type, flags, data, ramfs_fill_super);
+   return mount_single(fs_type, flags, data, ramfs_fill_super);
 #endif
 }
 
@@ -387,7 +385,6 @@ static int devtmpfsd(void *p)
goto out;
sys_chdir(/..); /* will traverse into overmounted root */
sys_chroot(.);
-   get_fs_root(current-fs, get_exec_env()-devtmpfs_root);
complete(setup_done);
while (1) {
spin_lock(req_lock);
@@ -408,33 +405,12 @@ static int devtmpfsd(void *p)
spin_unlock(req_lock);
schedule();
}
-   path_put(get_exec_env()-devtmpfs_root);
return 0;
 out:
complete(setup_done);
return *err;
 }
 
-int ve_init_devtmpfs(void *data)
-{
-   struct ve_struct *ve = data;
-   struct vfsmount *mnt;
-
-   mnt = kern_mount_data(dev_fs_type, ve);
-   if (IS_ERR(mnt))
-   return PTR_ERR(mnt);
-   ve-devtmpfs_root.mnt = mnt;
-   ve-devtmpfs_root.dentry = mnt-mnt_root;
-   return 0;
-}
-
-void ve_fini_devtmpfs(void *data)
-{
-   struct ve_struct *ve = data;
-
-   kern_unmount(ve-devtmpfs_root.mnt);
-}
-
 /*
  * Create devtmpfs instance, driver-core devices will add their device
  * nodes here.
diff --git a/include/linux/device.h b/include/linux/device.h
index df5152f..2c9c764 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1005,14 +1005,10 @@ extern void put_device(struct device *dev);
 extern int devtmpfs_create_node(struct device *dev);
 extern int devtmpfs_delete_node(struct device *dev);
 extern int devtmpfs_mount(const char *mntdir);
-extern int ve_init_devtmpfs(void *data);
-extern void ve_fini_devtmpfs(void *data);
 #else
 static inline int devtmpfs_create_node(struct device *dev) { return 0; }
 static inline int devtmpfs_delete_node(struct device *dev) { return 0; }
 static inline int

[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: destroy all nodes on ve stop

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0ea1f95684407db5892760b5a58a24003571f043
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:44 2015 +0400

ve/binfmt_misc: destroy all nodes on ve stop

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

Each registered binfmt_misc node pins binfmt_misc mount point, which in
turn pins the owner ve. This means that if we don't clean up binfmt_misc
nodes on ve stop, the mount point as well as the ve struct will leak.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/binfmt_misc.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 4487153..90c306e 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -752,16 +752,42 @@ static struct file_system_type bm_fs_type = {
 };
 MODULE_ALIAS_FS(binfmt_misc);
 
+static void ve_binfmt_fini(void *data)
+{
+   struct ve_struct *ve = data;
+   struct binfmt_misc *bm_data = ve-binfmt_misc;
+
+   if (!bm_data)
+   return;
+
+   /*
+* XXX: Note we don't take any locks here. This is safe as long as
+* nobody uses binfmt_misc outside the owner ve.
+*/
+   while (!list_empty(bm_data-entries))
+   kill_node(bm_data, list_first_entry(
+   bm_data-entries, Node, list));
+}
+
+static struct ve_hook ve_binfmt_hook = {
+   .fini   = ve_binfmt_fini,
+   .priority   = HOOK_PRIO_DEFAULT,
+   .owner  = THIS_MODULE,
+};
+
 static int __init init_misc_binfmt(void)
 {
int err = register_filesystem(bm_fs_type);
-   if (!err)
+   if (!err) {
insert_binfmt(misc_format);
+   ve_hook_register(VE_SS_CHAIN, ve_binfmt_hook);
+   }
return err;
 }
 
 static void __exit exit_misc_binfmt(void)
 {
+   ve_hook_unregister(ve_binfmt_hook);
unregister_binfmt(misc_format);
unregister_filesystem(bm_fs_type);
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: Create required devices on container startup

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0cdfb581770d883cea99f30e49e3de1583ab6fc1
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:56 2015 +0400

Revert ve/devtmpfs: Create required devices on container startup

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 5cd1d17ff1b6a8f476ab6f4cd0a6830fbffe43f2.

We don't actually need separate null, zero, and other mem class devices
inside a VE. The patch being reverted added them merely for kdevtmpfs to
create nodes for this devices under /dev. This work can and should be
done by vzctl on container start, so drop this patch.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/char/mem.c | 20 ---
 kernel/ve/ve.c | 56 --
 2 files changed, 76 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index c486c83..a3653f7 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -30,7 +30,6 @@
 #include linux/io.h
 #include linux/aio.h
 #include linux/security.h
-#include linux/ve.h
 
 #include asm/uaccess.h
 
@@ -924,20 +923,7 @@ static char *mem_devnode(struct device *dev, umode_t *mode)
return NULL;
 }
 
-#ifdef CONFIG_VE
-static struct class mem_class_base = {
-   .name   = mem,
-   .devnode= mem_devnode,
-   .ns_type= ve_ns_type_operations,
-   .namespace  = ve_namespace,
-   .owner  = THIS_MODULE,
-};
-
-struct class *mem_class = mem_class_base;
-EXPORT_SYMBOL(mem_class);
-#else
 static struct class *mem_class;
-#endif
 
 static int __init chr_dev_init(void)
 {
@@ -951,17 +937,11 @@ static int __init chr_dev_init(void)
if (register_chrdev(MEM_MAJOR, mem, memory_fops))
printk(unable to get major %d for memory devs\n, MEM_MAJOR);
 
-#ifdef CONFIG_VE
-   err = class_register(mem_class_base);
-   if (err)
-   return err;
-#else
mem_class = class_create(THIS_MODULE, mem);
if (IS_ERR(mem_class))
return PTR_ERR(mem_class);
 
mem_class-devnode = mem_devnode;
-#endif
for (minor = 1; minor  ARRAY_SIZE(devlist); minor++) {
if (!devlist[minor].name)
continue;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 4cd1f8b..cdbb342 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -413,55 +413,6 @@ static void ve_drop_context(struct ve_struct *ve)
ve-init_cred = NULL;
 }
 
-static const struct {
-   unsigned intminor;
-   char*name;
-} ve_mem_class_devices[] = {
-   {3, null},
-   {5, zero},
-   {7, full},
-   {8, random},
-   {9, urandom},
-};
-
-extern struct class *mem_class;
-
-static int ve_init_mem_class(struct ve_struct *ve)
-{
-   struct device *dev;
-   dev_t devt;
-   size_t i;
-
-   for (i = 0; i  ARRAY_SIZE(ve_mem_class_devices); i++) {
-   devt = MKDEV(MEM_MAJOR, ve_mem_class_devices[i].minor);
-   dev = device_create(mem_class, NULL, devt,
-   ve, ve_mem_class_devices[i].name);
-   if (IS_ERR(dev)) {
-   pr_err(Can't create %s (%d)\n,
-  ve_mem_class_devices[i].name,
-

Re: [Devel] [PATCH 1/3] cred: add ve_capable to check capabilities relative to the current VE

2015-08-28 Thread Vladimir Davydov

On Fri, Aug 28, 2015 at 05:20:01PM +0400, Andrew Vagin wrote:

 +bool ve_capable(int cap)
 +{
 + return ns_capable(get_exec_env()-init_cred-user_ns, cap);
 +}

init_cred is set in ve_grab_context, which means that if a task
occasionally uses ve_capable() before writing START to ve.state, the
kernel will panic. Please add a sanity check, which will make
ve_capable() fall back on capable() if init_cred is not available.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: pass proper options string

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0ffbb29c45f5ee709f4fa5dfa52f883cbe4a70f1
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:57 2015 +0400

Revert ve/devtmpfs: pass proper options string

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 1c6719b8aa075de4c9528811839d5f2595ef2994.

This is related to devtmpfs virtualization, which I'm going to drop.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c28e42c..0448af8 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -451,10 +451,9 @@ out:
 int ve_init_devtmpfs(void *data)
 {
struct ve_struct *ve = data;
-   char opts[] = mode=0755;
struct vfsmount *mnt;
 
-   mnt = kern_mount_data(dev_fs_type, opts);
+   mnt = kern_mount_data(dev_fs_type, ve);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
ve-devtmpfs_root.mnt = mnt;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 2/3] ve: revise permissions to allow mount smth

2015-08-28 Thread Andrew Vagin

Return back to the behavior of the upstream kernel.
Currently we use mount namespaces and need nothing special here.

Signed-off-by: Andrew Vagin ava...@openvz.org
---
 fs/namespace.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 593b262..77a1ede 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1306,9 +1306,7 @@ static int do_umount(struct mount *mnt, int flags)
  */
 static inline bool may_mount(void)
 {
-   return ns_capable(current-nsproxy-mnt_ns-user_ns, CAP_SYS_ADMIN) ||
- nsown_capable(CAP_SYS_ADMIN) ||
- nsown_capable(CAP_VE_SYS_ADMIN);
+   return ns_capable(current-nsproxy-mnt_ns-user_ns, CAP_SYS_ADMIN);
 }
 
 /*
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit d0856fdc15e0b49540c454b42a11ddf2af70cda6
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:43 2015 +0400

Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

This reverts commit 610d54ccee1af63b1b361d18ec4ee9fa5230dea8.

Since commit 9e7411c5c3b5 was reverted, this one is no longer needed
either.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/nfsd/nfsctl.c  | 2 +-
 ipc/mqueue.c  | 2 +-
 net/sunrpc/rpc_pipe.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 7411a56..048d61d 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1113,7 +1113,7 @@ static int nfsd_fill_super(struct super_block * sb, void 
* data, int silent)
 #endif
/* last one */ {}
};
-   struct net *net = sb-s_ns;
+   struct net *net = data;
int ret;
 
ret = simple_fill_super(sb, 0x6e667364, nfsd_files);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c508938..6a8f37d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -309,7 +309,7 @@ err:
 static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
 {
struct inode *inode;
-   struct ipc_namespace *ns = sb-s_ns;
+   struct ipc_namespace *ns = data;
 
sb-s_blocksize = PAGE_CACHE_SIZE;
sb-s_blocksize_bits = PAGE_CACHE_SHIFT;
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index b8f6185..79681e5 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1395,7 +1395,7 @@ rpc_fill_super(struct super_block *sb, void *data, int 
silent)
 {
struct inode *inode;
struct dentry *root, *gssd_dentry;
-   struct net *net = sb-s_ns;
+   struct net *net = data;
struct sunrpc_net *sn = net_generic(net, sunrpc_net_id);
int err;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH 1/3] cred: add ve_capable to check capabilities relative to the current VE

2015-08-28 Thread Andrew Vagin

We want to allow a few operations in VE. Currently we use nsown_capable,
but it's wrong, because in this case we allow these operations in any
user namespace.

Signed-off-by: Andrew Vagin ava...@openvz.org
---
 fs/autofs4/root.c  |6 ++
 fs/ioprio.c|2 +-
 fs/namei.c |2 +-
 include/linux/capability.h |1 +
 kernel/capability.c|   15 +++
 kernel/printk.c|5 ++---
 net/ipv6/sit.c |2 +-
 net/netfilter/nf_sockopt.c |2 +-
 security/commoncap.c   |4 ++--
 security/device_cgroup.c   |4 ++--
 10 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index 68e3edb..1462d8b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -588,8 +588,7 @@ static int autofs4_dir_unlink(struct inode *dir, struct 
dentry *dentry)
struct autofs_info *p_ino;

/* This allows root to remove symlinks */
-   if (!autofs4_oz_mode(sbi)  !capable(CAP_SYS_ADMIN) 
-   !capable(CAP_VE_SYS_ADMIN))
+   if (!autofs4_oz_mode(sbi)  !ve_capable(CAP_SYS_ADMIN))
return -EPERM;
 
if (atomic_dec_and_test(ino-count)) {
@@ -837,8 +836,7 @@ static int autofs4_root_ioctl_unlocked(struct inode *inode, 
struct file *filp,
 _IOC_NR(cmd) - _IOC_NR(AUTOFS_IOC_FIRST) = AUTOFS_IOC_COUNT)
return -ENOTTY;

-   if (!autofs4_oz_mode(sbi)  !capable(CAP_SYS_ADMIN) 
-   !capable(CAP_VE_SYS_ADMIN))
+   if (!autofs4_oz_mode(sbi)  !ve_capable(CAP_SYS_ADMIN))
return -EPERM;

switch(cmd) {
diff --git a/fs/ioprio.c b/fs/ioprio.c
index c876fad..f9d9187 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -75,7 +75,7 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
 
switch (class) {
case IOPRIO_CLASS_RT:
-   if (!capable(CAP_VE_ADMIN))
+   if (!ve_capable(CAP_SYS_ADMIN))
return -EPERM;
class = IOPRIO_CLASS_BE;
data = 0;
diff --git a/fs/namei.c b/fs/namei.c
index 8e29a44..e7d9f54 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3397,7 +3397,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, 
umode_t mode, dev_t dev)
if (error)
return error;
 
-   if ((S_ISCHR(mode) || S_ISBLK(mode))  !nsown_capable(CAP_MKNOD))
+   if ((S_ISCHR(mode) || S_ISBLK(mode))  !ve_capable(CAP_MKNOD))
return -EPERM;
 
if (!dir-i_op-mknod)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 2b77384..b1131e3 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -217,6 +217,7 @@ extern bool has_ns_capability_noaudit(struct task_struct *t,
 extern bool capable(int cap);
 extern bool ns_capable(struct user_namespace *ns, int cap);
 extern bool nsown_capable(int cap);
+extern bool ve_capable(int cap);
 extern bool inode_capable(const struct inode *inode, int cap);
 extern bool file_ns_capable(const struct file *file, struct user_namespace 
*ns, int cap);
 
diff --git a/kernel/capability.c b/kernel/capability.c
index 0a843d5..e409594 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -16,6 +16,7 @@
 #include linux/pid_namespace.h
 #include linux/user_namespace.h
 #include asm/uaccess.h
+#include linux/ve.h
 
 /*
  * Leveraged for setting/resetting capabilities
@@ -396,6 +397,20 @@ bool ns_capable(struct user_namespace *ns, int cap)
 }
 EXPORT_SYMBOL(ns_capable);
 
+#if CONFIG_VE
+bool ve_capable(int cap)
+{
+   return ns_capable(get_exec_env()-init_cred-user_ns, cap);
+}
+#else
+bool ve_capable(int cap)
+{
+   return capable(cap);
+}
+#endif
+
+EXPORT_SYMBOL_GPL(ve_capable);
+
 /**
  * file_ns_capable - Determine if the file's opener had a capability in effect
  * @file:  The file we want to check
diff --git a/kernel/printk.c b/kernel/printk.c
index 44b3783..91766fc 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -468,14 +468,13 @@ static int check_syslog_permissions(int type, bool 
from_file)
return 0;
 
if (syslog_action_restricted(type)) {
-   if (nsown_capable(CAP_SYSLOG))
+   if (ve_capable(CAP_SYSLOG))
return 0;
/*
 * For historical reasons, accept CAP_SYS_ADMIN too, with
 * a warning.
 */
-   if (nsown_capable(CAP_SYS_ADMIN) ||
-   nsown_capable(CAP_VE_ADMIN)) {
+   if (ve_capable(CAP_SYS_ADMIN)) {
pr_warn_once(%s (%d): Attempt to access syslog with 
 CAP_SYS_ADMIN but no CAP_SYSLOG 
 (deprecated).\n,
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 8f4c52d..0cbb2b2 100644
---

[Devel] [PATCH 3/3] ve: remove ns_capable(CAP_VE.*)

2015-08-28 Thread Andrew Vagin

If we use user namespaces, we don't need to have special capabilities.

Signed-off-by: Andrew Vagin ava...@openvz.org
---
 fs/proc/root.c  |3 +--
 ipc/mqueue.c|3 +--
 ipc/util.c  |2 +-
 kernel/nsproxy.c|6 ++
 kernel/sys.c|4 ++--
 net/bridge/br_ioctl.c   |   33 +++--
 net/core/dev_ioctl.c|9 +++--
 net/core/ethtool.c  |3 +--
 net/core/rtnetlink.c|6 ++
 net/core/scm.c  |2 +-
 net/decnet/netfilter/dn_rtmsg.c |3 +--
 net/ipv4/arp.c  |3 +--
 net/ipv4/devinet.c  |6 ++
 net/ipv4/fib_frontend.c |2 +-
 net/ipv4/ip_sockglue.c  |3 +--
 net/ipv4/ip_tunnel.c|6 ++
 net/ipv4/netfilter/ip_tables.c  |   12 
 net/ipv6/addrconf.c |4 ++--
 net/ipv6/ip6_tunnel.c   |6 ++
 net/ipv6/netfilter/ip6_tables.c |   12 
 net/ipv6/route.c|2 +-
 net/ipv6/sit.c  |9 +++--
 net/key/af_key.c|3 +--
 net/netfilter/nfnetlink.c   |3 +--
 net/netlink/af_netlink.c|1 -
 net/netlink/genetlink.c |3 +--
 net/xfrm/xfrm_user.c|3 +--
 27 files changed, 53 insertions(+), 99 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0b7dbdb..923b398 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -121,8 +121,7 @@ static struct dentry *proc_mount(struct file_system_type 
*fs_type,
options = data;
 
if (!current_user_ns()-may_mount_proc ||
-   (!ns_capable(ns-user_ns, CAP_SYS_ADMIN) 
-!ns_capable(ns-user_ns, CAP_VE_SYS_ADMIN)))
+   (!ns_capable(ns-user_ns, CAP_SYS_ADMIN)))
return ERR_PTR(-EPERM);
}
 
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c5f1d3e..657814c 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -335,8 +335,7 @@ static struct dentry *mqueue_mount(struct file_system_type 
*fs_type,
/* Don't allow mounting unless the caller has CAP_SYS_ADMIN
 * over the ipc namespace.
 */
-   if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN) 
-   !ns_capable(ns-user_ns, CAP_VE_SYS_ADMIN))
+   if (!ns_capable(ns-user_ns, CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
 
data = ns;
diff --git a/ipc/util.c b/ipc/util.c
index 795e05f..15e09aa 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -771,7 +771,7 @@ struct kern_ipc_perm *ipcctl_pre_down_nolock(struct 
ipc_namespace *ns,
 
euid = current_euid();
if (uid_eq(euid, ipcp-cuid) || uid_eq(euid, ipcp-uid)  ||
-   ns_capable(ns-user_ns, CAP_VE_SYS_ADMIN))
+   ns_capable(ns-user_ns, CAP_SYS_ADMIN))
return ipcp; /* successful lookup */
 err:
return ERR_PTR(err);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 81402a8..62aebc8 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -136,8 +136,7 @@ int copy_namespaces(unsigned long flags, struct task_struct 
*tsk)
CLONE_NEWPID | CLONE_NEWNET)))
return 0;
 
-   if (!ns_capable(user_ns, CAP_SYS_ADMIN) 
-   !ns_capable(user_ns, CAP_VE_SYS_ADMIN)) {
+   if (!ns_capable(user_ns, CAP_SYS_ADMIN)) {
err = -EPERM;
goto out;
}
@@ -198,8 +197,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
return 0;
 
user_ns = new_cred ? new_cred-user_ns : current_user_ns();
-   if (!ns_capable(user_ns, CAP_SYS_ADMIN) 
-   !ns_capable(user_ns, CAP_VE_SYS_ADMIN))
+   if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
 
*new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
diff --git a/kernel/sys.c b/kernel/sys.c
index 44f0295..a2d5644 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1604,7 +1604,7 @@ SYSCALL_DEFINE2(sethostname, char __user *, name, int, 
len)
int errno;
char tmp[__NEW_UTS_LEN];
 
-   if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_VE_SYS_ADMIN))
+   if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))
return -EPERM;
 
if (len  0 || len  __NEW_UTS_LEN)
@@ -1655,7 +1655,7 @@ SYSCALL_DEFINE2(setdomainname, char __user *, name, int, 
len)
int errno;
char tmp[__NEW_UTS_LEN];
 
-   if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_VE_SYS_ADMIN))
+   if (!ns_capable(current-nsproxy-uts_ns-user_ns, CAP_SYS_ADMIN))
return -EPERM;
if (len  0 || len  __NEW_UTS_LEN)
return -EINVAL;
diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 45c4c22..98447b8 100644
--- a/net/bridge/br_ioctl.c

[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 pty drivers

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 1ff0db51541d3bf04c228025cb48de284adb78b2
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:31:49 2015 +0400

Revert ve/pty: containerize Unix98 pty drivers

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

This reverts commit 79b66035f81e1c8996f2524f26af096e44e2ae4b.

Conflicts:
kernel/ve/ve.c

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 kernel/ve/ve.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index bdfa30d..5025149 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -449,10 +449,6 @@ int ve_start_container(struct ve_struct *ve)
if (err)
goto err_legacy_pty;
 
-   err = ve_unix98_pty_init(ve);
-   if (err)
-   goto err_unix98_pty;
-
err = ve_tty_console_init(ve);
if (err)
goto err_tty_console;
@@ -472,8 +468,6 @@ int ve_start_container(struct ve_struct *ve)
 err_iterate:
ve_tty_console_fini(ve);
 err_tty_console:
-   ve_unix98_pty_fini(ve);
-err_unix98_pty:
ve_legacy_pty_fini(ve);
 err_legacy_pty:
ve_stop_umh(ve);
@@ -506,7 +500,6 @@ void ve_stop_ns(struct pid_namespace *pid_ns)
ve-is_running = 0;
 
ve_tty_console_fini(ve);
-   ve_unix98_pty_fini(ve);
ve_legacy_pty_fini(ve);
 
ve_stop_umh(ve);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert pty: split Unix98 init routines

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit ee5a5380520330fedde1a323d5ca3cb5cad20b4f
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:32:03 2015 +0400

Revert pty: split Unix98 init routines

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

This reverts commit 3aec66abd43440bc7dd4c6bbe84734adb6d82851.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/tty/pty.c | 100 --
 1 file changed, 15 insertions(+), 85 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 56c0a21..bd17a45 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -820,62 +820,25 @@ err_file:
 
 static struct file_operations ptmx_fops;
 
-static void __unix98_unregister_ptmx(void)
-{
-   unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1);
-   cdev_del(ptmx_cdev);
-}
-
-static int __unix98_register_ptmx(void)
- {
-   int err;
-
-   cdev_init(ptmx_cdev, ptmx_fops);
-   err = cdev_add(ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1);
-   if (err) {
-   printk(KERN_ERR Couldn't add /dev/ptmx device);
-   return err;
-   }
-   err = register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx);
-   if (err  0) {
-   printk(KERN_ERR Couldn't register /dev/ptmx driver);
-   goto err_ptmx_register;
-   }
-   return 0;
-
-err_ptmx_register:
-   cdev_del(ptmx_cdev);
-   return err;
-}
-
-static int __unix98_pty_init(struct tty_driver **ptm_driver_p,
-   struct tty_driver **pts_driver_p)
+static void __init unix98_pty_init(void)
 {
-   struct tty_driver *ptm_driver, *pts_driver;
-   int err;
-   struct device *dev;
-
ptm_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX,
TTY_DRIVER_RESET_TERMIOS |
TTY_DRIVER_REAL_RAW |
TTY_DRIVER_DYNAMIC_DEV |
TTY_DRIVER_DEVPTS_MEM |
TTY_DRIVER_DYNAMIC_ALLOC);
-   if (IS_ERR(ptm_driver)) {
-   printk(KERN_ERR Couldn't allocate Unix98 ptm driver);
-   return PTR_ERR(ptm_driver);
-   }
+   if (IS_ERR(ptm_driver))
+   panic(Couldn't allocate Unix98 ptm driver);
pts_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX,
TTY_DRIVER_RESET_TERMIOS |
TTY_DRIVER_REAL_RAW |
TTY_DRIVER_DYNAMIC_DEV |
TTY_DRIVER_DEVPTS_MEM |
TTY_DRIVER_DYNAMIC_ALLOC);
-   if (IS_ERR(pts_driver)) {
-   printk(KERN_ERR Couldn't allocate Unix98 pts driver);
-   err = PTR_ERR(pts_driver);
-   goto err_pts_alloc;
-   }
+   if (IS_ERR(pts_driver))
+   panic(Couldn't allocate Unix98 pts driver);
+
ptm_driver-driver_name = pty_master;
ptm_driver-name = ptm;
ptm_driver-major = UNIX98_PTY_MASTER_MAJOR;
@@ -905,53 +868,20 @@ static int __unix98_pty_init(struct tty_driver 
**ptm_driver_p,
pts_driver-other = ptm_driver;
tty_set_operations(pts_driver, pty_unix98_ops);
 
-   err = tty_register_driver(ptm_driver);
-   if (err) {
-   printk(KERN_ERR Couldn't register Unix98 ptm driver);
-   goto err_ptm_register;
-   }
-   err = tty_register_driver(pts_driver);
-   if (err) {
-   printk(KERN_ERR Couldn't register Unix98 pts driver);
-   goto err_pts_register;
-   }
+   if (tty_register_driver(ptm_driver))
+   panic(Couldn't register Unix98 ptm driver);
+   if (tty_register_driver(pts_driver))
+   panic(Couldn't register Unix98 pts driver);
 
/* Now create the /dev/ptmx special device */
tty_default_fops(ptmx_fops);
ptmx_fops.open = ptmx_open;
 
-   err = __unix98_register_ptmx();
-   if (err)
-   goto err_ptmx_register;
-
-   dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR,

[Devel] [PATCH RHEL7 COMMIT] ve/radix-tree: do not account radix_tree_nodes to memcg

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit d4b302e64d3523bddf4e300d0a975a7717ac784b
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:44:29 2015 +0400

ve/radix-tree: do not account radix_tree_nodes to memcg

There are two problems if they are accounted.

First, radix_tree_nodes allocated by tcache/tswap for storing their
internal data will be accounted to the container that issued a store,
which is wrong, because they can only get reclaimed on global pressure.
Using __GFP_NOACCOUNT in tcache/tswap wouldn't help due to per cpu
radix_tree_node preloads.

Second, workingset detection logic (see mm/workingset.c) is still not
memory cgroup aware. In particular, this means that shadow
radix_tree_nodes can only be reclaimed on global memory pressure
although they are accounted to a memory cgroup. As a result, after
reading a huge file, all the container's memory can get filled with
shadow entries, which won't be reclaimed on local memory pressure,
making the container unusable.

This is a quick-fix which makes radix_tree_nodes unaccountable. This is
acceptable for now, because we had never accounted radix_tree_nodes
before Vz7 anyway. The true fix would be (a) making radix_tree_node
preloads unaccountable (or per memory cgroup) and (b) making workingset
detection logic memory cgroup aware. This should and will be done
upstream first.

https://jira.sw.ru/browse/PSBM-35205

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 lib/radix-tree.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index dd3347f..4b362cb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -228,7 +228,8 @@ radix_tree_node_alloc(struct radix_tree_root *root)
}
}
if (ret == NULL)
-   ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+   ret = kmem_cache_alloc(radix_tree_node_cachep,
+  gfp_mask | __GFP_NOACCOUNT);
 
BUG_ON(radix_tree_is_indirect_ptr(ret));
return ret;
@@ -279,7 +280,8 @@ static int __radix_tree_preload(gfp_t gfp_mask)
rtp = __get_cpu_var(radix_tree_preloads);
while (rtp-nr  ARRAY_SIZE(rtp-nodes)) {
preempt_enable();
-   node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+   node = kmem_cache_alloc(radix_tree_node_cachep,
+   gfp_mask | __GFP_NOACCOUNT);
if (node == NULL)
goto out;
preempt_disable();
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0845747ebe2654d1e6e56a0425b21e599a47f4f6
Author: Mel Gorman mgor...@suse.de
Date:   Fri Aug 28 18:50:29 2015 +0400

ms/mm/vmscan: use proportional scanning during direct reclaim and full scan 
at DEF_PRIORITY


This patch fixes memcg overreclaim w/o tswap/zswap as described in:

https://jira.sw.ru/browse/PSBM-35275

Memcg overreclaim still happens if tswap or zswap is used. This case is
to be investigated yet, however, this patch is definitely worth pulling.


Commit mm: vmscan: obey proportional scanning requirements for kswapd
ensured that file/anon lists were scanned proportionally for reclaim from
kswapd but ignored it for direct reclaim.  The intent was to minimse
direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
long stall for many small stalls and distorts aging for normal workloads
like streaming readers/writers.  Hugh Dickins pointed out that a
side-effect of the same commit was that when one LRU list dropped to zero
that the entirety of the other list was shrunk leading to excessive
reclaim in memcgs.  This patch scans the file/anon lists proportionally
for direct reclaim to similarly age page whether reclaimed by kswapd or
direct reclaim but takes care to abort reclaim if one LRU drops to zero
after reclaiming the requested number of pages.

Based on ext4 and using the Intel VM scalability test

  3.15.0-rc5
3.15.0-rc5
shrinker
proportion
Unit  lru-file-readonceelapsed  5.3500 (  0.00%)  5.4200 ( 
-1.31%)
Unit  lru-file-readonce time_range  0.2700 (  0.00%)  0.1400 ( 
48.15%)
Unit  lru-file-readonce time_stddv  0.1148 (  0.00%)  0.0536 ( 
53.33%)
Unit lru-file-readtwiceelapsed  8.1700 (  0.00%)  8.1700 (  
0.00%)
Unit lru-file-readtwice time_range  0.4300 (  0.00%)  0.2300 ( 
46.51%)
Unit lru-file-readtwice time_stddv  0.1650 (  0.00%)  0.0971 ( 
41.16%)

The test cases are running multiple dd instances reading sparse files. The 
results are within
the noise for the small test machine. The impact of the patch is more 
noticable from the vmstats

3.15.0-rc5  3.15.0-rc5
  shrinker  proportion
Minor Faults 35154   36784
Major Faults   6111305
Swap Ins   3941651
Swap Outs 43945891
Allocation stalls   118616   44781
Direct pages scanned   4935171 4602313
Kswapd pages scanned  1592129216258483
Kswapd pages reclaimed1591330116248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency  99% 99%
Kswapd velocity 670088.047  682555.961
Direct efficiency  99% 99%
Direct velocity 207709.217  193212.133
Percentage direct scans23% 22%
Page writes by reclaim4858.0006232.000
Page writes file   464 341
Page writes anon  43945891

Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.

Signed-off-by: Mel Gorman mgor...@suse.de
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Hugh Dickins hu...@google.com
Cc: Tim Chen tim.c.c...@linux.intel.com
Cc: Dave Chinner da...@fromorbit.com
Tested-by: Yuanhan Liu yuanhan@linux.intel.com
Cc: Bob Liu bob@oracle.com
Cc: Jan Kara j...@suse.cz
Cc: Rik van Riel r...@redhat.com
Cc: Al Viro v...@zeniv.linux.org.uk
Signed-off-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Linus Torvalds torva...@linux-foundation.org
(cherry picked from commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/vmscan.c | 36 +---
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0b4c98f..2bb62ce 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2129,13 +2129,27 @@ static void shrink_lruvec(struct lruvec *lruvec, struct 
scan_control *sc,
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc-nr_to_reclaim;
struct blk_plug plug;
-   bool scan_adjusted = false;
+   bool scan_adjusted;
 
get_scan_count(lruvec, sc, nr, lru_pages);
 
/* Record the original scan target for proportional adjustments

[Devel] [PATCH RHEL7 COMMIT] memcg: fix swap_max calculation for nested cgroups

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 338ce9637d706f2bf01ef9153b78953ff65c2efb
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:36:03 2015 +0400

memcg: fix swap_max calculation for nested cgroups

If there is a sub-memcg in a container, its swapout won't update
swap_max of the container's memcg, because we don't ascend the memcg
hierarchy in mem_cgroup_update_swap_max. This patch fixes this issue.

Fixes: a74376e2dde13 (bc/memcg: show correct swap max for beancounters)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/memcontrol.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5f3e0ac..7fc2931 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -903,12 +903,14 @@ static void mem_cgroup_update_swap_max(struct mem_cgroup 
*memcg)
 {
long long swap;
 
-   swap = res_counter_read_u64(memcg-memsw, RES_USAGE) -
-   res_counter_read_u64(memcg-res, RES_USAGE);
+   for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+   swap = res_counter_read_u64(memcg-memsw, RES_USAGE) -
+   res_counter_read_u64(memcg-res, RES_USAGE);
 
-   /* This is racy, but we don't have to be absolutely precise */
-   if (swap  (long long)memcg-swap_max)
-   memcg-swap_max = swap;
+   /* This is racy, but we don't have to be absolutely precise */
+   if (swap  (long long)memcg-swap_max)
+   memcg-swap_max = swap;
+   }
 }
 
 static void mem_cgroup_inc_failcnt(struct mem_cgroup *memcg,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH rh7] Revert diff-writeback-throttle-writer-when-local-BDI-threshold-is-hit bits

2015-08-28 Thread Vladimir Davydov

This was brought by the initial commit 2a8b5de95918, but it is
incomplete - the following hunk patching balance_dirty_pages was lost:

 diff --git a/mm/page-writeback.c b/mm/page-writeback.c
 index 003b68e..a58795c 100644
 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -546,7 +546,8 @@ static void balance_dirty_pages(struct address_space 
 *mapping,
* catch-up. This avoids (excessively) small writeouts
* when the bdi limits are ramping up.
*/
 - if (nr_reclaimable + nr_writeback 
 + if (bdi_cap_account_writeback(bdi) 
 + nr_reclaimable + nr_writeback 
   (background_thresh + dirty_thresh) / 2 
   ub_dirty + ub_writeback 
   (ub_background_thresh + ub_thresh) / 2)

I've filed a separate issue for porting it:

https://jira.sw.ru/browse/PSBM-39167

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/fs-writeback.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9cdcc28b2ee5..66586a4f32de 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -843,9 +843,6 @@ static bool over_bground_thresh(struct backing_dev_info 
*bdi)
 {
unsigned long background_thresh, dirty_thresh;
 
-   if (!bdi_cap_account_writeback(bdi)  bdi-dirty_exceeded)
-   return true;
-
global_dirty_limits(background_thresh, dirty_thresh);
 
if (global_page_state(NR_FILE_DIRTY) +
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 3/3] ve: remove ns_capable(CAP_VE.*)

2015-08-28 Thread Vladimir Davydov

On Fri, Aug 28, 2015 at 05:20:03PM +0400, Andrew Vagin wrote:
 If we use user namespaces, we don't need to have special capabilities.
 
 Signed-off-by: Andrew Vagin ava...@openvz.org

Lovely :-) Although it'd be even better if you reverted all the patches
tampering capability checks one-by-one so that it'd be easier to drop
them during the next rebase. Anyway,

Reviewed-by: Vladimir Davydov vdavy...@parallels.com

A couple of notes regarding this patch set.

It seems CAP_VE_ADMIN and CAP_VE_NET_ADMIN are not used anymore. Let's
drop them?

Also, you forgot to revert commit 1875887f263e (ve: caps: ignore
setting wrong caps with CAP_SETPCAP), please do it in a separate patch.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 driver

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit fd19fc2c70ae5da0a0902dea96213f52dc6afbfd
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:31:56 2015 +0400

Revert ve/pty: containerize Unix98 driver

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

This reverts commit 1b2c1fe8428715c3b5ec0a94d0568b5a5c526032.

Conflicts:
include/linux/ve.h

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/tty/pty.c   | 88 ++---
 include/linux/tty.h |  6 ++--
 include/linux/ve.h  |  6 
 3 files changed, 32 insertions(+), 68 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 7afb822..56c0a21 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -23,10 +23,15 @@
 #include linux/devpts_fs.h
 #include linux/slab.h
 #include linux/mutex.h
-#include linux/ve.h
 
 #include bc/misc.h
 
+#ifdef CONFIG_UNIX98_PTYS
+static struct tty_driver *ptm_driver;
+static struct tty_driver *pts_driver;
+static DEFINE_MUTEX(devpts_mutex);
+#endif
+
 static void pty_close(struct tty_struct *tty, struct file *filp)
 {
BUG_ON(!tty);
@@ -53,11 +58,11 @@ static void pty_close(struct tty_struct *tty, struct file 
*filp)
if (tty-driver-subtype == PTY_TYPE_MASTER) {
set_bit(TTY_OTHER_CLOSED, tty-flags);
 #ifdef CONFIG_UNIX98_PTYS
-   if (tty-driver == tty-driver-ve-ptm_driver) {
-   mutex_lock(tty-driver-ve-devpts_mutex);
+   if (tty-driver == ptm_driver) {
+   mutex_lock(devpts_mutex);
if (tty-link-driver_data)
devpts_pty_kill(tty-link-driver_data);
-   mutex_unlock(tty-driver-ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
}
 #endif
tty_unlock(tty);
@@ -669,9 +674,9 @@ static struct tty_struct *pts_unix98_lookup(struct 
tty_driver *driver,
 {
struct tty_struct *tty;
 
-   mutex_lock(driver-ve-devpts_mutex);
+   mutex_lock(devpts_mutex);
tty = devpts_get_priv(pts_inode);
-   mutex_unlock(driver-ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
/* Master must be open before slave */
if (!tty)
return ERR_PTR(-EIO);
@@ -748,7 +753,6 @@ static int ptmx_open(struct inode *inode, struct file *filp)
struct inode *slave_inode;
int retval;
int index;
-   struct ve_struct *ve = (inode-i_sb-s_ns) ? : get_exec_env();
 
nonseekable_open(inode, filp);
 
@@ -760,18 +764,18 @@ static int ptmx_open(struct inode *inode, struct file 
*filp)
return retval;
 
/* find a device that is not in use. */
-   mutex_lock(ve-devpts_mutex);
+   mutex_lock(devpts_mutex);
index = devpts_new_index(inode);
if (index  0) {
retval = index;
-   mutex_unlock(ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
goto err_file;
}
 
-   mutex_unlock(ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
 
mutex_lock(tty_mutex);
-   tty = tty_init_dev(ve-ptm_driver, index);
+   tty = tty_init_dev(ptm_driver, index);
 
if (IS_ERR(tty)) {
retval = PTR_ERR(tty);
@@ -796,7 +800,7 @@ static int ptmx_open(struct inode *inode, struct file *filp)
}
tty-link-driver_data = slave_inode;
 
-   retval = ve-ptm_driver-ops-open(tty, filp);
+   retval = ptm_driver-ops-open(tty, filp);
if (retval)
goto err_release;
 
@@ -816,22 +820,16 @@ err_file:
 
 static struct file_operations ptmx_fops;
 
-static void __unix98_unregister_ptmx(struct ve_struct *ve)
+static void __unix98_unregister_ptmx(void)
 {
-   if (!ve_is_super(ve))
-   return;
-
unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1);
cdev_del(ptmx_cdev);
 }
 
-static int __unix98_register_ptmx(struct ve_struct *ve)
-{
+static int __unix98_register_ptmx(void)
+ {
int

[Devel] [PATCH 2/2] fs: allow to mount devtmpfs in a non-root userns (v2)

2015-08-28 Thread Andrew Vagin

devtmpfs is virtualized, so it has to be secure.

v2: fix return code

Signed-off-by: Andrew Vagin ava...@openvz.org
---
 drivers/base/devtmpfs.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c28e42c..f21e292 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -58,6 +58,9 @@ __setup(devtmpfs.mount=, mount_param);
 static struct dentry *dev_mount(struct file_system_type *fs_type, int flags,
  const char *dev_name, void *data)
 {
+   if (get_exec_env()-init_cred-user_ns != current_user_ns())
+   return ERR_PTR(-EPERM);
+
 #ifdef CONFIG_TMPFS
return mount_ns(fs_type, flags, data, get_exec_env(), shmem_fill_super);
 #else
@@ -69,7 +72,7 @@ static struct file_system_type dev_fs_type = {
.name = devtmpfs,
.mount = dev_mount,
.kill_sb = kill_litter_super,
-   .fs_flags = FS_VIRTUALIZED,
+   .fs_flags = FS_VIRTUALIZED | FS_USERNS_MOUNT | FS_USERNS_DEV_MOUNT,
 };
 
 #ifdef CONFIG_BLOCK
-- 
1.7.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/pty: create ptmx device per ve namespace

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 953017eb9e8237859f63d7b0a2c816b7e7e5a615
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:32:16 2015 +0400

ve/pty: create ptmx device per ve namespace

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

After Unix98 PTY driver virtualization was reverted, we have to
manually set sysfs permissions for ptmx. This, however, is currently
impossible, because tty_class is still virtualized, which makes
ve.sysfs_permissions ignore it (see sysfs_perms_set).

This patch is a quick-fix which simply creates/destroys ptmx device in
ve namespace on container start/stop. It must be dropped when commit
6022450d12653 (ve/tty: make tty_class VE-namespace aware) is reverted.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/tty/pty.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index bd17a45..529046b 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -818,6 +818,32 @@ err_file:
return retval;
 }
 
+static int ve_unix98_pty_init(void *data)
+{
+   struct ve_struct *ve = data;
+   struct device *dev;
+
+   dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), ve, 
ptmx);
+   if (IS_ERR(dev)) {
+   pr_warn(Failed to create ptmx device for ve %s: %ld\n,
+   ve-ve_name, PTR_ERR(dev));
+   return PTR_ERR(dev);
+   }
+   return 0;
+}
+
+static void ve_unix98_pty_fini(void *data)
+{
+   device_destroy_namespace(tty_class, MKDEV(TTYAUX_MAJOR, 2), data);
+}
+
+static struct ve_hook ve_unix98_pty_hook = {
+   .init   = ve_unix98_pty_init,
+   .fini   = ve_unix98_pty_fini,
+   .priority   = HOOK_PRIO_DEFAULT,
+   .owner  = THIS_MODULE,
+};
+
 static struct file_operations ptmx_fops;
 
 static void __init unix98_pty_init(void)
@@ -882,6 +908,7 @@ static void __init unix98_pty_init(void)
register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx)  0)
panic(Couldn't register /dev/ptmx driver);
device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), NULL, ptmx);
+   ve_hook_register(VE_SS_CHAIN, ve_unix98_pty_hook);
 }
 
 #else
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: never isolate more pages than necessary

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 703ed09d7ee4d9af6cec3c4970842f282176f5e0
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:50:33 2015 +0400

ms/mm/vmscan: never isolate more pages than necessary


Along with [PATCH rh7] mm: vmscan: use proportional scanning during
direct reclaim and full scan at DEF_PRIORITY this should fix

https://jira.sw.ru/browse/PSBM-35275

I submitted this patch upstream (https://lkml.org/lkml/2015/8/3/404) and
it was merged into the mmotm tree. Hopefully, it will get merged into
Linus's tree soon.


If transparent huge pages are enabled, we can isolate many more pages
than we actually need to scan, because we count both single and huge
pages equally in isolate_lru_pages().

Since commit 5bc7b8aca942d (mm: thp: add split tail pages to shrink
page list in page reclaim), we scan all the tail pages immediately
after a huge page split (see shrink_page_list()). As a result, we can
reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run!

This is easy to catch on memcg reclaim with zswap enabled. The latter
makes swapout instant so that if we happen to scan an unreferenced huge
page we will evict both its head and tail pages immediately, which is
likely to result in excessive reclaim.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bb62ce..7beadf5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1297,7 +1297,8 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
unsigned long nr_taken = 0;
unsigned long scan;
 
-   for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
+   for (scan = 0; scan  nr_to_scan  nr_taken  nr_to_scan 
+   !list_empty(src); scan++) {
struct page *page;
int nr_pages;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/net: Fix vlan NETIF_F_VIRTUAL feature initialization

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 3e11f3abe191cb393cd8c025913e6a9b739fcabe
Author: Kirill Tkhai ktk...@odin.com
Date:   Sat Aug 29 02:30:46 2015 +0400

ve/net: Fix vlan NETIF_F_VIRTUAL feature initialization

vlan_setup() is called when dev's net hasn't been set yet:

rtnl_create_link
alloc_netdev_mqs
dev_net_set(dev, init_net)
vlan_setup
   ...
   if (!ve_is_super(dev_net(dev)-owner_ve))
  dev-features |= NETIF_F_VIRTUAL
   ...
dev_net_set(dev, net)

So vlan's dev has no NETIF_F_VIRTUAL feature, and further
check of ve_is_dev_movable() fails.

Patch makes the feature to be set always, independent
of dev_net(). Anyway, in further we test it only if
ve is not super. Also, others (loopback for exmple) set
it always too.

https://jira.sw.ru/browse/PSBM-35266

Signed-off-by: Kirill Tkhai ktk...@odin.com
Acked-by: Andrew Vagin ava...@odin.com
---
 net/8021q/vlan_dev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 80fa918..09205c3 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -794,6 +794,5 @@ void vlan_setup(struct net_device *dev)
dev-ethtool_ops= vlan_ethtool_ops;
 
memset(dev-broadcast, 0, ETH_ALEN);
-   if (!ve_is_super(dev_net(dev)-owner_ve))
-   dev-features |= NETIF_F_VIRTUAL;
+   dev-features |= NETIF_F_VIRTUAL;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

45 matches

Mail list logo