from:"Konstantin Khorenko"

[Devel] [PATCH RHEL7 COMMIT] ve/net: Add VE_NF_CONNTRACK check in resolve_normal_ct()

2015-08-27 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.2
--
commit 1ae3e69714effdf80dd8306271096d86607608b1
Author: Kirill Tkhai ktk...@odin.com
Date:   Thu Aug 27 20:32:43 2015 +0400

ve/net: Add VE_NF_CONNTRACK check in resolve_normal_ct()

This is a missed hunk from diff-ve-net-netfilter-combined.

https://jira.sw.ru/browse/PSBM-35154

Signed-off-by: Kirill Tkhai ktk...@odin.com
---
 net/netfilter/nf_conntrack_core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index bcd215d..33a6e9c 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1061,6 +1061,9 @@ resolve_normal_ct(struct net *net, struct nf_conn *tmpl,
u16 zone = tmpl ? nf_ct_zone(tmpl) : NF_CT_DEFAULT_ZONE;
u32 hash;
 
+   if (!net_ipt_permitted(net, VE_NF_CONNTRACK))
+   return NULL;
+
if (!nf_ct_get_tuple(skb, skb_network_offset(skb),
 dataoff, l3num, protonum, tuple, l3proto,
 l4proto)) {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_ref_cancel_init()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0873bd8f500347f34f06ddad0fbf024df91f8add
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:24 2015 +0400

ms/percpu-refcount: implement percpu_ref_cancel_init()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Normally, percpu_ref_init() initializes and percpu_ref_kill()
initiates destruction which completes asynchronously.  The
asynchronous destruction can be problematic in init failure path where
the caller wants to destroy half-constructed object - distinguishing
half-constructed objects from the usual release method can be painful
for complex objects.

This patch implements percpu_ref_cancel_init() which synchronously
destroys the percpu_ref without invoking release.  To avoid
unintentional misuses, the function requires the ref to have finished
percpu_ref_init() but never used and triggers WARN otherwise.

v2: Explain the weird name and usage restriction in the function
comment.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit bc497bd33b2d6a6f07bc8574b4764edbd7fdffa8)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h |  1 +
 lib/percpu-refcount.c   | 31 +++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 8146aa9..6d843d6 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -68,6 +68,7 @@ struct percpu_ref {
 
 int __must_check percpu_ref_init(struct percpu_ref *ref,
 percpu_ref_func_t *release);
+void percpu_ref_cancel_init(struct percpu_ref *ref);
 void percpu_ref_kill(struct percpu_ref *ref);
 
 #define PCPU_STATUS_BITS   2
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index b35eaac..ebeaac2 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -54,6 +54,37 @@ int percpu_ref_init(struct percpu_ref *ref, 
percpu_ref_func_t *release)
return 0;
 }
 
+/**
+ * percpu_ref_cancel_init - cancel percpu_ref_init()
+ * @ref: percpu_ref to cancel init for
+ *
+ * Once a percpu_ref is initialized, its destruction is initiated by
+ * percpu_ref_kill() and completes asynchronously, which can be painful to
+ * do when destroying a half-constructed object in init failure path.
+ *
+ * This function destroys @ref without invoking @ref-release and the
+ * memory area containing it can be freed immediately on return.  To
+ * prevent accidental misuse, it's required that @ref has finished
+ * percpu_ref_init(), whether successful or not, but never used.
+ *
+ * The weird name and usage restriction are to prevent people from using
+ * this function by mistake for normal shutdown instead of
+ * percpu_ref_kill().
+ */
+void percpu_ref_cancel_init(struct percpu_ref *ref)
+{
+   unsigned __percpu *pcpu_count = ref-pcpu_count;
+   int cpu;
+
+   WARN_ON_ONCE(atomic_read(ref-count) != 1 + PCPU_COUNT_BIAS);
+
+   if (pcpu_count) {
+   for_each_possible_cpu(cpu)
+   WARN_ON_ONCE(*per_cpu_ptr(pcpu_count, cpu));
+   free_percpu(ref-pcpu_count);
+   }
+}
+
 static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 {
struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: split cgroup destruction into two steps

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 33f3496e5d1342b4497058d017261d3b3fde0fe1
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:26 2015 +0400

ms/cgroup: split cgroup destruction into two steps

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item.  The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent.  The separation is to allow using percpu refcnt for css.

Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name.  As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine.  A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.

As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both.  This patch
renames cgroup-free_work to -destroy_work and uses it for both
purposes.  INIT_WORK() is now performed right before queueing the work
item.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Li Zefan lize...@huawei.com
(cherry picked from commit ea15f8ccdb430af1e8bc9b4e19a230eb4c356777)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
kernel/cgroup.c
---
 include/linux/cgroup.h |  2 +-
 kernel/cgroup.c| 25 -
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 626bc84..d34c42b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -259,7 +259,7 @@ struct cgroup {
 
/* For RCU-protected deletion */
struct rcu_head rcu_head;
-   struct work_struct free_work;
+   struct work_struct destroy_work;
 
/* List of events which userspace want to receive */
struct list_head event_list;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 062e0f4..6fd7038 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -213,6 +213,7 @@ static struct cgroup_name root_cgroup_name = { .name = / 
};
  */
 static int need_forkexit_callback __read_mostly;
 
+static void cgroup_offline_fn(struct work_struct *work);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys 
*subsys,
  struct cftype cfts[], bool is_add);
@@ -836,7 +837,7 @@ static struct cgroup_name *cgroup_alloc_name(struct dentry 
*dentry)
 
 static void cgroup_free_fn(struct work_struct *work)
 {
-   struct cgroup *cgrp = container_of(work, struct cgroup, free_work);
+   struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work);
struct cgroup_subsys *ss;
 
mutex_lock(cgroup_mutex);
@@ -881,7 +882,8 @@ static void cgroup_free_rcu(struct rcu_head *head)
 {
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
 
-   queue_work(cgroup_destroy_wq, cgrp-free_work);
+   INIT_WORK(cgrp-destroy_work, cgroup_free_fn);
+   queue_work(cgroup_destroy_wq, cgrp-destroy_work);
 }
 
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -1416,7 +1418,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(cgrp-allcg_node);
INIT_LIST_HEAD(cgrp-release_list);
INIT_LIST_HEAD(cgrp-pidlists);
-   INIT_WORK(cgrp-free_work, cgroup_free_fn);
mutex_init(cgrp-pidlist_mutex);
INIT_LIST_HEAD(cgrp-event_list);
spin_lock_init(cgrp-event_list_lock);
@@ -4355,7 +4356,6 @@ static int

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: Don't use silly cmpxchg()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 337bb797aa4aa5eca030d634d0a9874290511db5
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:21 2015 +0400

ms/percpu-refcount: Don't use silly cmpxchg()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Kent Overstreet koverstr...@google.com

The cmpxchg() was just to ensure the debug check didn't race, which was
a bit excessive. The caller is supposed to do the appropriate
synchronization, which means percpu_ref_kill() can just do a simple
store.

Signed-off-by: Kent Overstreet koverstr...@google.com
Signed-off-by: Tejun Heo t...@kernel.org
(cherry picked from commit c1ae6e9b4db00023b9caed72af49a93abad46452)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 lib/percpu-refcount.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 6f0ffd7..1a17399 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -107,22 +107,11 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
  */
 void percpu_ref_kill(struct percpu_ref *ref)
 {
-   unsigned __percpu *pcpu_count, *old, *new;
+   WARN_ONCE(REF_STATUS(ref-pcpu_count) == PCPU_REF_DEAD,
+ percpu_ref_kill() called more than once!\n);
 
-   pcpu_count = ACCESS_ONCE(ref-pcpu_count);
-
-   do {
-   if (REF_STATUS(pcpu_count) == PCPU_REF_DEAD) {
-   WARN(1, percpu_ref_kill() called more than once!\n);
-   return;
-   }
-
-   old = pcpu_count;
-   new = (unsigned __percpu *)
-   (((unsigned long) pcpu_count)|PCPU_REF_DEAD);
-
-   pcpu_count = cmpxchg(ref-pcpu_count, old, new);
-   } while (pcpu_count != old);
+   ref-pcpu_count = (unsigned __percpu *)
+   (((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD);
 
call_rcu(ref-rcu, percpu_ref_kill_rcu);
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: reorder the operations in cgroup_destroy_locked()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit ce835adec25190f76a26cc97f1a38aadc93a4957
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:25 2015 +0400

ms/cgroup: reorder the operations in cgroup_destroy_locked()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the -sibling list.  This will be used to make css use
percpu refcnt.

While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.

While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Li Zefan lize...@huawei.com
(cherry picked from commit 455050d23e1bfc47ca98e943ad5b2f3a9bbe45fb)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
kernel/cgroup.c
---
 kernel/cgroup.c | 48 ++--
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b073fba..062e0f4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4367,9 +4367,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 
/*
 * Block new css_tryget() by deactivating refcnt and mark @cgrp
-* removed.  This makes future css_tryget() and child creation
-* attempts fail thus maintaining the removal conditions verified
-* above.
+* removed.  This makes future css_tryget() attempts fail which we
+* guarantee to -css_offline() callbacks.
 */
for_each_subsys(cgrp-root, ss) {
struct cgroup_subsys_state *css = cgrp-subsys[ss-subsys_id];
@@ -4379,6 +4378,30 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
}
set_bit(CGRP_REMOVED, cgrp-flags);
 
+   raw_spin_lock(release_list_lock);
+   if (!list_empty(cgrp-release_list))
+   list_del_init(cgrp-release_list);
+   raw_spin_unlock(release_list_lock);
+
+   /*
+* Remove @cgrp directory.  The removal puts the base ref but we
+* aren't quite done with @cgrp yet, so hold onto it.
+*/
+   dget(d);
+   cgroup_d_remove_dir(d);
+
+   /*
+* Unregister events and notify userspace.
+* Notify userspace about cgroup removing only after rmdir of cgroup
+* directory to avoid race between userspace and kernelspace.
+*/
+   spin_lock(cgrp-event_list_lock);
+   list_for_each_entry_safe(event, tmp, cgrp-event_list, list) {
+   list_del_init(event-list);
+   schedule_work(event-remove);
+   }
+   spin_unlock(cgrp-event_list_lock);
+
/* tell subsystems to initate destruction */
for_each_subsys(cgrp-root, ss)
offline_css(ss, cgrp);
@@ -4393,34 +4416,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
for_each_subsys(cgrp-root, ss)
css_put(cgrp-subsys[ss-subsys_id]);
 
-   raw_spin_lock(release_list_lock);
-   if (!list_empty(cgrp-release_list))
-   list_del_init(cgrp-release_list);
-   raw_spin_unlock(release_list_lock);
-
/* delete this cgroup from parent-children */
list_del_rcu(cgrp-sibling);
list_del_init(cgrp-allcg_node);
 
-   dget(d);
-   cgroup_d_remove_dir(d);
dput(d);
 
set_bit(CGRP_RELEASABLE, parent-flags);
check_for_release(parent);
 
-   /*
-* Unregister events and notify userspace.
-* Notify userspace about cgroup removing only after rmdir of cgroup
-* directory to avoid race between userspace and kernelspace.
-*/
-   spin_lock(cgrp-event_list_lock);
-

[Devel] [PATCH RHEL7 COMMIT] ve/devpts: Revert 2c27d20125f5

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 99a71c6ceb41b6c8256620c4db844f7395f2a2c9
Author: Cyrill Gorcunov gorcu...@gmail.com
Date:   Fri Aug 28 14:14:08 2015 +0400

ve/devpts: Revert 2c27d20125f5

Here we revert 2c27d20125f5 (ve/devpts: cleanup per-VE creation)
making code close to the vanilla one. We've tune devpts code a bit though in
next patch but less intrusive.

https://jira.sw.ru/browse/PSBM-34931

Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com

CC: Vladimir Davydov vdavy...@virtuozzo.com
CC: Andrey Vagin ava...@virtuozzo.com
CC: Konstantin Khorenko khore...@virtuozzo.com
CC: Pavel Emelyanov xe...@virtuozzo.com
---
 fs/devpts/inode.c | 39 ++-
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 3dcd4da..be0fb74 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -402,6 +402,20 @@ fail:
 }
 
 #ifdef CONFIG_DEVPTS_MULTIPLE_INSTANCES
+static int test_devpts_sb(struct super_block *s, void *p)
+{
+   return get_exec_env()-devpts_sb == s;
+}
+
+static int set_devpts_sb(struct super_block *s, void *p)
+{
+   int error = set_anon_super(s, p);
+   if (!error) {
+   atomic_inc(s-s_active);
+   get_exec_env()-devpts_sb = s;
+   }
+   return error;
+}
 
 /*
  * devpts_mount()
@@ -436,7 +450,6 @@ static struct dentry *devpts_mount(struct file_system_type 
*fs_type,
int error;
struct pts_mount_opts opts;
struct super_block *s;
-   struct dentry *root;
 
error = parse_mount_options(data, PARSE_MOUNT, opts);
if (error)
@@ -450,29 +463,29 @@ static struct dentry *devpts_mount(struct 
file_system_type *fs_type,
return ERR_PTR(-EINVAL);
 
if (opts.newinstance)
-   root = mount_nodev(fs_type, flags, data, devpts_fill_super);
+   s = sget(fs_type, NULL, set_anon_super, flags, NULL);
else
-   root = mount_ns(fs_type, flags, data, get_exec_env(), 
devpts_fill_super);
+   s = sget(fs_type, test_devpts_sb, set_devpts_sb, flags, NULL);
+
+   if (IS_ERR(s))
+   return ERR_CAST(s);
 
-   if (IS_ERR(root))
-   return ERR_CAST(root);
+   if (!s-s_root) {
+   error = devpts_fill_super(s, data, flags  MS_SILENT ? 1 : 0);
+   if (error)
+   goto out_undo_sget;
+   s-s_flags |= MS_ACTIVE;
+   }
 
-   s = root-d_sb;
memcpy((DEVPTS_SB(s))-mount_opts, opts, sizeof(opts));
 
error = mknod_ptmx(s);
if (error)
goto out_undo_sget;
 
-   if (!opts.newinstance) {
-   atomic_inc(s-s_active);
-   get_exec_env()-devpts_sb = s;
-   }
-
-   return root;
+   return dget(s-s_root);
 
 out_undo_sget:
-   dput(root);
deactivate_locked_super(s);
return ERR_PTR(error);
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 4149fa7beae723cd745672c749ed0a94f7f672a4
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:24 2015 +0400

ms/percpu-refcount: implement percpu_tryget() along with 
percpu_ref_kill_and_confirm()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Implement percpu_tryget() which stops giving out references once the
percpu_ref is visible as killed.  Because the refcnt is per-cpu,
different CPUs will start to see a refcnt as killed at different
points in time and tryget() may continue to succeed on subset of cpus
for a while after percpu_ref_kill() returns.

For use cases where it's necessary to know when all CPUs start to see
the refcnt as dead, percpu_ref_kill_and_confirm() is added.  The new
function takes an extra argument @confirm_kill which is invoked when
the refcnt is guaranteed to be viewed as killed on all CPUs.

While this isn't the prettiest interface, it doesn't force synchronous
wait and is much safer than requiring the caller to do its own
call_rcu().

v2: Patch description rephrased to emphasize that tryget() may
continue to succeed on some CPUs after kill() returns as suggested
by Kent.

v3: Function comment in percpu_ref_kill_and_confirm() updated warning
people to not depend on the implied RCU grace period from the
confirm callback as it's an implementation detail.

Signed-off-by: Tejun Heo t...@kernel.org
Slightly-Grumpily-Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit dbece3a0f1ef0b19aff1cc6ed0942fec9ab98de1)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 50 -
 lib/percpu-refcount.c   | 23 ++-
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 6d843d6..dd2a086 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -63,13 +63,30 @@ struct percpu_ref {
 */
unsigned __percpu   *pcpu_count;
percpu_ref_func_t   *release;
+   percpu_ref_func_t   *confirm_kill;
struct rcu_head rcu;
 };
 
 int __must_check percpu_ref_init(struct percpu_ref *ref,
 percpu_ref_func_t *release);
 void percpu_ref_cancel_init(struct percpu_ref *ref);
-void percpu_ref_kill(struct percpu_ref *ref);
+void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
+percpu_ref_func_t *confirm_kill);
+
+/**
+ * percpu_ref_kill - drop the initial ref
+ * @ref: percpu_ref to kill
+ *
+ * Must be used to drop the initial ref on a percpu refcount; must be called
+ * precisely once before shutdown.
+ *
+ * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the
+ * percpu counters and dropping the initial ref.
+ */
+static inline void percpu_ref_kill(struct percpu_ref *ref)
+{
+   return percpu_ref_kill_and_confirm(ref, NULL);
+}
 
 #define PCPU_STATUS_BITS   2
 #define PCPU_STATUS_MASK   ((1  PCPU_STATUS_BITS) - 1)
@@ -101,6 +118,37 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 }
 
 /**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless it has already been killed.  Returns
+ * %true on success; %false on failure.
+ *
+ * Completion of percpu_ref_kill() in itself doesn't guarantee that tryget
+ * will fail.  For such guarantee, percpu_ref_kill_and_confirm() should be
+ * used.  After the confirm_kill callback is invoked, it's guaranteed that
+ * no new reference will be given out by percpu_ref_tryget().
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+   unsigned

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 82f6802b3f09878172024c57ed12cf2da92cccd3
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:23 2015 +0400

ms/percpu-refcount: add __must_check to percpu_ref_init() and don't use 
ACCESS_ONCE() in percpu_ref_kill_rcu()

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

Two small changes.

* Unlike most init functions, percpu_ref_init() allocates memory and
  may fail.  Let's mark it with __must_check in case the caller
  forgets.

* percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to
  dereference @ref-pcpu_count, which can be misleading.  The pointer
  is guaranteed to be valid and visible and can't change underneath
  the function.  Drop ACCESS_ONCE().

Signed-off-by: Tejun Heo t...@kernel.org
(cherry picked from commit acac7883ee7bcc32476963bce7baf73d44574dd1)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 3 ++-
 lib/percpu-refcount.c   | 4 +---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index b61bd6f..8146aa9 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -66,7 +66,8 @@ struct percpu_ref {
struct rcu_head rcu;
 };
 
-int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release);
+int __must_check percpu_ref_init(struct percpu_ref *ref,
+percpu_ref_func_t *release);
 void percpu_ref_kill(struct percpu_ref *ref);
 
 #define PCPU_STATUS_BITS   2
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9a78e55..b35eaac 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -57,12 +57,10 @@ int percpu_ref_init(struct percpu_ref *ref, 
percpu_ref_func_t *release)
 static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 {
struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
-   unsigned __percpu *pcpu_count;
+   unsigned __percpu *pcpu_count = ref-pcpu_count;
unsigned count = 0;
int cpu;
 
-   pcpu_count = ACCESS_ONCE(ref-pcpu_count);
-
/* Mask out PCPU_REF_DEAD */
pcpu_count = (unsigned __percpu *)
(((unsigned long) pcpu_count)  ~PCPU_STATUS_MASK);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu: implement generic percpu refcounting

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit b5ec5570459334e56491e564b567cc5bed16181e
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:21 2015 +0400

ms/percpu: implement generic percpu refcounting

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Kent Overstreet koverstr...@google.com

This implements a refcount with similar semantics to
atomic_get()/atomic_dec_and_test() - but percpu.

It also implements two stage shutdown, as we need it to tear down the
percpu counts.  Before dropping the initial refcount, you must call
percpu_ref_kill(); this puts the refcount in shutting down mode and
switches back to a single atomic refcount with the appropriate
barriers (synchronize_rcu()).

It's also legal to call percpu_ref_kill() multiple times - it only
returns true once, so callers don't have to reimplement shutdown
synchronization.

[a...@linux-foundation.org: fix build]
[a...@linux-foundation.org: coding-style tweak]
Signed-off-by: Kent Overstreet koverstr...@google.com
Cc: Zach Brown z...@redhat.com
Cc: Felipe Balbi ba...@ti.com
Cc: Greg Kroah-Hartman gre...@linuxfoundation.org
Cc: Mark Fasheh mfas...@suse.com
Cc: Joel Becker jl...@evilplan.org
Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Jens Axboe ax...@kernel.dk
Cc: Asai Thambi S P asamymuth...@micron.com
Cc: Selvan Mani sm...@micron.com
Cc: Sam Bradshaw sbrads...@micron.com
Cc: Jeff Moyer jmo...@redhat.com
Cc: Al Viro v...@zeniv.linux.org.uk
Cc: Benjamin LaHaise b...@kvack.org
Cc: Tejun Heo t...@kernel.org
Cc: Oleg Nesterov o...@redhat.com
Cc: Christoph Lameter c...@linux-foundation.org
Cc: Ingo Molnar mi...@redhat.com
Reviewed-by: Theodore Ts'o ty...@mit.edu
Signed-off-by: Tejun Heo t...@kernel.org

(cherry picked from commit 215e262f2aeba378aa192da07c30770f9925a4bf)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
lib/Makefile
---
 include/linux/percpu-refcount.h | 122 ++
 lib/Makefile|   2 +-
 lib/percpu-refcount.c   | 128 
 3 files changed, 251 insertions(+), 1 deletion(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
new file mode 100644
index 000..24b31ef
--- /dev/null
+++ b/include/linux/percpu-refcount.h
@@ -0,0 +1,122 @@
+/*
+ * Percpu refcounts:
+ * (C) 2012 Google, Inc.
+ * Author: Kent Overstreet koverstr...@google.com
+ *
+ * This implements a refcount with similar semantics to atomic_t - 
atomic_inc(),
+ * atomic_dec_and_test() - but percpu.
+ *
+ * There's one important difference between percpu refs and normal atomic_t
+ * refcounts; you have to keep track of your initial refcount, and then when 
you
+ * start shutting down you call percpu_ref_kill() _before_ dropping the initial
+ * refcount.
+ *
+ * The refcount will have a range of 0 to ((1U  31) - 1), i.e. one bit less
+ * than an atomic_t - this is because of the way shutdown works, see
+ * percpu_ref_kill()/PCPU_COUNT_BIAS.
+ *
+ * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the
+ * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill()
+ * puts the ref back in single atomic_t mode, collecting the per cpu refs and
+ * issuing the appropriate barriers, and then marks the ref as shutting down so
+ * that percpu_ref_put() will check for the ref hitting 0.  After it returns,
+ * it's safe to drop the initial ref.
+ *
+ * USAGE:
+ *
+ * See fs/aio.c for some example usage; it's used there for struct kioctx, 
which
+ * is created when userspaces calls io_setup(), and destroyed when userspace
+ * calls io_destroy() or the process exits.
+ *
+ * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
+ * calls percpu_ref_kill(), then hlist_del_rcu()

[Devel] [PATCH RHEL7 COMMIT] ms/memcg: issue memory.high reclaim after refilling percpu stock

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit c315808e33a89086d0dac4624c1fa6f4fe1f8051
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:22:20 2015 +0400

ms/memcg: issue memory.high reclaim after refilling percpu stock

Currently, we dive into memory.high reclaim before reflling percpu
stock. As a result, if we successfully charge a batch for a percpu stock
while exceeding memory.high, others won't be able to use it until we
finish and will probably have to reclaim themselves, which may lead to
overreclaim. This patch therefore moves memory.high reclaim after
refilling stocks. This is how it works upstream.

I haven't seen any negative effects caused by this backport mistake, but
let's stick to the mainstream behavior anyways.

Fixes: 4038cd0e029dd (ms/memcg: port memory.high)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/memcontrol.c | 35 +--
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 37e81d3..5f3e0ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2730,10 +2730,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
 
if (likely(!ret)) {
if (!do_swap_account)
-   goto done;
+   return CHARGE_OK;
ret = res_counter_charge(memcg-memsw, csize, fail_res);
if (likely(!ret))
-   goto done;
+   return CHARGE_OK;
 
res_counter_uncharge(memcg-res, csize);
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
@@ -2790,21 +2790,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
return CHARGE_OOM_DIE;
 
return CHARGE_RETRY;
-
-done:
-   if (!(gfp_mask  __GFP_WAIT))
-   goto out;
-   /*
-* If the hierarchy is above the normal consumption range,
-* make the charging task trim their excess contribution.
-*/
-   do {
-   if (res_counter_read_u64(memcg-res, RES_USAGE) = memcg-high)
-   continue;
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, false);
-   } while ((memcg = parent_mem_cgroup(memcg)));
-out:
-   return CHARGE_OK;
 }
 
 /*
@@ -2836,7 +2821,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 {
unsigned int batch = max(CHARGE_BATCH, nr_pages);
int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-   struct mem_cgroup *memcg = NULL;
+   struct mem_cgroup *memcg = NULL, *iter;
int ret;
 
/*
@@ -2950,6 +2935,20 @@ again:
 
if (batch  nr_pages)
refill_stock(memcg, batch - nr_pages);
+
+   /*
+* If the hierarchy is above the normal consumption range,
+* make the charging task trim their excess contribution.
+*/
+   iter = memcg;
+   do {
+   if (!(gfp_mask  __GFP_WAIT))
+   break;
+   if (res_counter_read_u64(iter-res, RES_USAGE) = iter-high)
+   continue;
+   try_to_free_mem_cgroup_pages(iter, nr_pages, gfp_mask, false);
+   } while ((iter = parent_mem_cgroup(iter)));
+
css_put(memcg-css);
 done:
*ptr = memcg;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/vznetstat: Fix potential exit race

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 9a440f22380933dd3547de7d83c553924c6ce284
Author: Cyrill Gorcunov gorcu...@virtuozzo.com
Date:   Fri Aug 28 14:31:18 2015 +0400

ve/vznetstat: Fix potential exit race

When container is exiting another task may be doing operations
with statistics incrementing/decrementing stat counter, which
may lead to situation where counter is not zero, thus we don't
zap @ve-stat member.

Fix it by testing if the net is the last one belonging
to a container.

https://jira.sw.ru/browse/PSBM-35178

Fixes: 505f8aacf95dce27fad66c90d4e1cd64adcb5432
(ve/vznetstat: Don't destroy statistics until explicitly asked)

Signed-off-by: Cyrill Gorcunov gorcu...@virtuozzo.com

CC: Andrey Vagin ava...@virtuozzo.com
CC: Vladimir Davydov vdavy...@virtuozzo.com
CC: Konstantin Khorenko khore...@virtuozzo.com
CC: Pavel Emelyanov xe...@virtuozzo.com
CC: Igor Sukhih i...@parallels.com
---
 kernel/ve/vznetstat/vznetstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/ve/vznetstat/vznetstat.c b/kernel/ve/vznetstat/vznetstat.c
index 9a25dea..99feafb 100644
--- a/kernel/ve/vznetstat/vznetstat.c
+++ b/kernel/ve/vznetstat/vznetstat.c
@@ -1098,7 +1098,7 @@ static void __net_exit net_exit_acct(struct net *net)
 
if (ve-stat) {
venet_acct_put_stat(ve-stat);
-   if (atomic_read(ve-stat-users) == 0)
+   if (ve-ve_netns == net)
ve-stat = NULL;
}
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: use RCU-sched insted of normal RCU

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 932bf29b63b1e7c74669a8847d7c69cc8b8ba919
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:25 2015 +0400

ms/percpu-refcount: use RCU-sched insted of normal RCU

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

percpu-refcount was incorrectly using preempt_disable/enable() for RCU
critical sections against call_rcu().  6a24474da8 (percpu-refcount:
consistently use plain (non-sched) RCU) fixed it by converting the
preepmtion operations with rcu_read_[un]lock() citing that there isn't
any advantage in using sched-RCU over using the usual one; however,
rcu_read_[un]lock() for the preemptible RCU implementation -
CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly
more expensive than preempt_disable/enable().

In a contrived microbench which repeats the followings,

 - percpu_ref_get()
 - copy 32 bytes of data into percpu buffer
 - percpu_put_get()
 - copy 32 bytes of data into percpu buffer

rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by
about 15% when compared to using sched-RCU.

As the RCU critical sections are extremely short, using sched-RCU
shouldn't have any latency implications.  Convert to RCU-sched.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Kent Overstreet koverstr...@google.com
Acked-by: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Michal Hocko mho...@suse.cz
Cc: Rusty Russell ru...@rustcorp.com.au
(cherry picked from commit a4244454df1296e90cc961c1b636b1176ef0d9a0)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 12 ++--
 lib/percpu-refcount.c   |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index dd2a086..95961f0 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -105,7 +105,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   rcu_read_lock();
+   rcu_read_lock_sched();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -114,7 +114,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
else
atomic_inc(ref-count);
 
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
 }
 
 /**
@@ -134,7 +134,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
unsigned __percpu *pcpu_count;
int ret = false;
 
-   rcu_read_lock();
+   rcu_read_lock_sched();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -143,7 +143,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
ret = true;
}
 
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
 
return ret;
 }
@@ -159,7 +159,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   rcu_read_lock();
+   rcu_read_lock_sched();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -168,7 +168,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
else if (unlikely(atomic_dec_and_test(ref-count)))
ref-release(ref);
 
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
 }
 
 #endif
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 8bf9e71..7deeb62 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -154,5 +154,5 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
(((unsigned long) ref-pcpu_count)|PCPU_REF_DEAD);
ref-confirm_kill = confirm_kill;
 
-   call_rcu(ref-rcu, percpu_ref_kill_rcu);
+   call_rcu_sched(ref-rcu, percpu_ref_kill_rcu);
 }
___
Devel mailing list
Devel@openvz.org

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: cosmetic updates

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit d6bfd7b559fdbe649d00c272895cb26996d1ee1c
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:22 2015 +0400

ms/percpu-refcount: cosmetic updates

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

* s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t
  postfix for types and the type is gonna be used for a different type
  of callback too.

* Add @ARG to function comments.

* Drop unnecessary and unaligned indentation from percpu_ref_init()
  function comment.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit ac899061a93250c28562f05ad94d5c74603415bc)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 8 +---
 lib/percpu-refcount.c   | 7 ---
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index abe1411..b61bd6f 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -51,7 +51,7 @@
 #include linux/rcupdate.h
 
 struct percpu_ref;
-typedef void (percpu_ref_release)(struct percpu_ref *);
+typedef void (percpu_ref_func_t)(struct percpu_ref *);
 
 struct percpu_ref {
atomic_tcount;
@@ -62,11 +62,11 @@ struct percpu_ref {
 * percpu_ref_kill_rcu())
 */
unsigned __percpu   *pcpu_count;
-   percpu_ref_release  *release;
+   percpu_ref_func_t   *release;
struct rcu_head rcu;
 };
 
-int percpu_ref_init(struct percpu_ref *, percpu_ref_release *);
+int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release);
 void percpu_ref_kill(struct percpu_ref *ref);
 
 #define PCPU_STATUS_BITS   2
@@ -78,6 +78,7 @@ void percpu_ref_kill(struct percpu_ref *ref);
 
 /**
  * percpu_ref_get - increment a percpu refcount
+ * @ref: percpu_ref to get
  *
  * Analagous to atomic_inc().
   */
@@ -99,6 +100,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 
 /**
  * percpu_ref_put - decrement a percpu refcount
+ * @ref: percpu_ref to put
  *
  * Decrement the refcount, and if 0, call the release function (which was 
passed
  * to percpu_ref_init())
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 1a17399..9a78e55 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -33,8 +33,8 @@
 
 /**
  * percpu_ref_init - initialize a percpu refcount
- * @ref:   ref to initialize
- * @release:   function which will be called when refcount hits 0
+ * @ref: percpu_ref to initialize
+ * @release: function which will be called when refcount hits 0
  *
  * Initializes the refcount in single atomic counter mode with a refcount of 1;
  * analagous to atomic_set(ref, 1).
@@ -42,7 +42,7 @@
  * Note that @release must not sleep - it may potentially be called from RCU
  * callback context by percpu_ref_kill().
  */
-int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release)
+int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release)
 {
atomic_set(ref-count, 1 + PCPU_COUNT_BIAS);
 
@@ -98,6 +98,7 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 
 /**
  * percpu_ref_kill - safely drop initial ref
+ * @ref: percpu_ref to kill
  *
  * Must be used to drop the initial ref on a percpu refcount; must be called
  * precisely once before shutdown.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ploop: dio_fastmap() must refresh bvec_merge_data

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit fc65c834967a14d37ef23348cec6528d18b0a169
Author: Maxim Patlasov mpatla...@openvz.org
Date:   Fri Aug 28 14:18:37 2015 +0400

ploop: dio_fastmap() must refresh bvec_merge_data

q-merge_bvec_fn() may override some fileds of bvec_merge_data.
For example, raid0_mergeable_bvec() does so. The blessed way is
to initialize it from scratch before use -- see how __bio_add_page()
prepares bvm for calling q-merge_bvec_fn().

Signed-off-by: Maxim Patlasov mpatla...@openvz.org
Acked-by: Dmitry Monakhov dmonak...@openvz.org
---
 drivers/block/ploop/io_direct.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 793bcc5..0183b0f 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -1487,7 +1487,6 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio,
struct request_queue * q;
struct extent_map * em;
int i;
-   struct bvec_merge_data bm_data;
 
if (orig_bio-bi_size == 0) {
bio-bi_vcnt   = 0;
@@ -1535,19 +1534,19 @@ dio_fastmap(struct ploop_io * io, struct bio * orig_bio,
bio-bi_size = 0;
bio-bi_vcnt = 0;
 
-   bm_data.bi_bdev = bio-bi_bdev;
-   bm_data.bi_sector = bio-bi_sector;
-   bm_data.bi_size = 0;
-   bm_data.bi_rw = bio-bi_rw;
-
for (i = 0; i  orig_bio-bi_vcnt; i++) {
struct bio_vec * bv = bio-bi_io_vec[i];
+   struct bvec_merge_data bm_data = {
+   .bi_bdev = bio-bi_bdev,
+   .bi_sector = bio-bi_sector,
+   .bi_size = bio-bi_size,
+   .bi_rw = bio-bi_rw,
+   };
if (q-merge_bvec_fn(q, bm_data, bv)  bv-bv_len) {
io-plo-st.fast_neg_backing++;
return 1;
}
bio-bi_size += bv-bv_len;
-   bm_data.bi_size = bio-bi_size;
bio-bi_vcnt++;
}
return 0;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/percpu-refcount: consistently use plain (non-sched) RCU

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 41721ced765e1156651d31c8b9deb0111340e984
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:22 2015 +0400

ms/percpu-refcount: consistently use plain (non-sched) RCU

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

percpu_ref_get/put() are using preempt_disable/enable() while
percpu_ref_kill() is using plain call_rcu() instead of
call_rcu_sched().  This is buggy as grace periods of the two may not
match.  Fix it by using plain RCU in percpu_ref_get/put().

(I suggested using sched RCU in the first place but there's no actual
 benefit in doing so unless we're gonna introduce different variants
 of get/put to be called while preemption is alredy disabled, which we
 definitely shouldn't.)

Signed-off-by: Tejun Heo t...@kernel.org
Reported-by: Rusty Russell ru...@rustcorp.com.au
Acked-by: Kent Overstreet koverstr...@google.com
(cherry picked from commit 6a24474da83ea7c8b7d32f05f858b1259994067a)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 include/linux/percpu-refcount.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 24b31ef..abe1411 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -85,7 +85,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   preempt_disable();
+   rcu_read_lock();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -94,7 +94,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
else
atomic_inc(ref-count);
 
-   preempt_enable();
+   rcu_read_unlock();
 }
 
 /**
@@ -107,7 +107,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
 {
unsigned __percpu *pcpu_count;
 
-   preempt_disable();
+   rcu_read_lock();
 
pcpu_count = ACCESS_ONCE(ref-pcpu_count);
 
@@ -116,7 +116,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
else if (unlikely(atomic_dec_and_test(ref-count)))
ref-release(ref);
 
-   preempt_enable();
+   rcu_read_unlock();
 }
 
 #endif
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: use percpu refcnt for cgroup_subsys_states

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit b1753091f010a49bcd0a89aa23306ac816302f9c
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 14:49:27 2015 +0400

ms/cgroup: use percpu refcnt for cgroup_subsys_states

Patchset description:

Pulling upstream patches converting css refcnt to percpu_ref.

https://jira.sw.ru/browse/PSBM-34174

Kent Overstreet (2):
  percpu: implement generic percpu refcounting
  percpu-refcount: Don't use silly cmpxchg()

Tejun Heo (9):
  percpu-refcount: consistently use plain (non-sched) RCU
  percpu-refcount: cosmetic updates
  percpu-refcount: add __must_check to percpu_ref_init() and don't use
ACCESS_ONCE() in percpu_ref_kill_rcu()
  percpu-refcount: implement percpu_ref_cancel_init()
  percpu-refcount: implement percpu_tryget() along with
percpu_ref_kill_and_confirm()
  percpu-refcount: use RCU-sched insted of normal RCU
  cgroup: reorder the operations in cgroup_destroy_locked()
  cgroup: split cgroup destruction into two steps
  cgroup: use percpu refcnt for cgroup_subsys_states

===
This patch description:

From: Tejun Heo t...@kernel.org

A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.

This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.

The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
-css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.

This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.

v2: - init_cgroup_css() was calling percpu_ref_init() without checking
  the return value.  This causes two problems - the obvious lack
  of error handling and percpu_ref_init() being called from
  cgroup_init_subsys() before the allocators are up, which
  triggers warnings but doesn't cause actual problems as the
  refcnt isn't used for roots anyway.  Fix both by moving
  percpu_ref_init() to cgroup_create().

- The base references were put too early by
  percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
  refs one extra time.  This wasn't noticeable because css's go
  through another RCU grace period before being freed.  Update
  cgroup_destroy_locked() to grab an extra reference before
  killing the refcnts.  This problem was noticed by Kent.

Signed-off-by: Tejun Heo t...@kernel.org
Reviewed-by: Kent Overstreet koverstr...@google.com
Acked-by: Li Zefan lize...@huawei.com
Cc: Michal Hocko mho...@suse.cz
Cc: Mike Snitzer snit...@redhat.com
Cc: Vivek Goyal vgo...@redhat.com
Cc: Alasdair G. Kergon a...@redhat.com
Cc: Jens Axboe ax...@kernel.dk
Cc: Mikulas Patocka mpato...@redhat.com
Cc: Glauber Costa glom...@gmail.com
(cherry picked from commit d3daf28da16a30af95bfb303189a634a87606725)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
include/linux/cgroup.h
kernel/cgroup.c
---
 include/linux/cgroup.h |  27 +++-
 kernel/cgroup.c| 166 +++--
 2

[Devel] [PATCH RHEL7 COMMIT] ve/devtmpfs: lightweight virtualization

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 22255fb606cfd53fb98b11c62b854c0de5a4c713
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:59 2015 +0400

ve/devtmpfs: lightweight virtualization

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

All this patch does is provides each VE with its own empty single tmpfs
mount, which appears on an attempt to mount devtmpfs. It's up to the
userspace to populate this fs on container start, all kernel requests to
create a device node inside a VE are ignored.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c | 67 +
 include/linux/ve.h  |  1 +
 kernel/ve/ve.c  |  4 +++
 3 files changed, 72 insertions(+)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index f59b798..daf97ee 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -23,6 +23,7 @@
 #include linux/ramfs.h
 #include linux/slab.h
 #include linux/kthread.h
+#include linux/ve.h
 #include base.h
 
 static struct task_struct *thread;
@@ -53,9 +54,61 @@ static int __init mount_param(char *str)
 }
 __setup(devtmpfs.mount=, mount_param);
 
+#ifdef CONFIG_VE
+static int ve_test_dev_sb(struct super_block *s, void *p)
+{
+   return get_exec_env()-dev_sb == s;
+}
+
+static int ve_set_dev_sb(struct super_block *s, void *p)
+{
+   struct ve_struct *ve = get_exec_env();
+   int error;
+
+   error = set_anon_super(s, p);
+   if (!error) {
+   BUG_ON(ve-dev_sb);
+   ve-dev_sb = s;
+   atomic_inc(s-s_active);
+   }
+   return error;
+}
+
+static struct dentry *ve_dev_mount(struct file_system_type *fs_type, int flags,
+ const char *dev_name, void *data)
+{
+   int (*fill_super)(struct super_block *, void *, int);
+   struct super_block *s;
+   int error;
+
+#ifdef CONFIG_TMPFS
+   fill_super = shmem_fill_super;
+#else
+   fill_super = ramfs_fill_super;
+#endif
+   s = sget(fs_type, ve_test_dev_sb, ve_set_dev_sb, flags, NULL);
+   if (IS_ERR(s))
+   return ERR_CAST(s);
+
+   if (!s-s_root) {
+   error = fill_super(s, data, flags  MS_SILENT ? 1 : 0);
+   if (error) {
+   deactivate_locked_super(s);
+   return ERR_PTR(error);
+   }
+   s-s_flags |= MS_ACTIVE;
+   }
+   return dget(s-s_root);
+}
+#endif /* CONFIG_VE */
+
 static struct dentry *dev_mount(struct file_system_type *fs_type, int flags,
  const char *dev_name, void *data)
 {
+#ifdef CONFIG_VE
+   if (!ve_is_super(get_exec_env()))
+   return ve_dev_mount(fs_type, flags, dev_name, data);
+#endif
 #ifdef CONFIG_TMPFS
return mount_single(fs_type, flags, data, shmem_fill_super);
 #else
@@ -79,6 +132,16 @@ static inline int is_blockdev(struct device *dev)
 static inline int is_blockdev(struct device *dev) { return 0; }
 #endif
 
+#ifdef CONFIG_VE
+static inline int is_ve_dev(struct device *dev)
+{
+   return dev-class  dev-class-namespace == ve_namespace 
+   ve_namespace(dev) != get_ve0();
+}
+#else
+static inline int is_ve_dev(struct device *dev) { return 0; }
+#endif
+
 int

[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use sb-s_fs_info

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 17dd96483ff558d44c98c3f8bcb04a86aca843a5
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:43 2015 +0400

ve/binfmt_misc: do not use sb-s_fs_info

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

When we virtualized binfmt_misc, we made sb-s_fs_info store a pointer
to binfmt_misc struct. At the same time, we store a pointer to the owner
ve_struct in sb-s_ns and a pointer to the same binfmt_misc struct in
ve_struct-binfmt_misc. That said, we don't actually need to use
s_fs_info, because we can get the binfmt_misc by dereferencing
sb-s_ns-binfmt_misc.

Using sb-s_fs_info instead of sb-s_ns will allow us to revert our
patches introducing sb-s_ns.

This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization).

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/binfmt_misc.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 7e760d2..d0cb80c 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -65,6 +65,8 @@ struct binfmt_misc {
int entry_count;
 };
 
+#define BINFMT_MISC(sb)(((struct ve_struct 
*)(sb)-s_ns)-binfmt_misc)
+
 /* 
  * Check if we support the binfmt
  * if we do, return the node, else NULL
@@ -541,7 +543,7 @@ static ssize_t bm_entry_write(struct file *file, const char 
__user *buffer,
Node *e = file_inode(file)-i_private;
int res = parse_command(buffer, count);
struct super_block *sb = file-f_path.dentry-d_sb;
-   struct binfmt_misc *bm_data = sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(sb);
 
switch (res) {
case 1: clear_bit(Enabled, e-flags);
@@ -576,7 +578,7 @@ static ssize_t bm_register_write(struct file *file, const 
char __user *buffer,
struct inode *inode;
struct dentry *root, *dentry;
struct super_block *sb = file-f_path.dentry-d_sb;
-   struct binfmt_misc *bm_data = sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(sb);
int err = 0;
 
e = create_entry(buffer, count);
@@ -641,7 +643,7 @@ static const struct file_operations bm_register_operations 
= {
 static ssize_t
 bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t 
*ppos)
 {
-   struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb);
char *s = bm_data-enabled ? enabled\n : disabled\n;
 
return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
@@ -650,7 +652,7 @@ bm_status_read(struct file *file, char __user *buf, size_t 
nbytes, loff_t *ppos)
 static ssize_t bm_status_write(struct file * file, const char __user * buffer,
size_t count, loff_t *ppos)
 {
-   struct binfmt_misc *bm_data = file-f_dentry-d_sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(file-f_dentry-d_sb);
int res = parse_command(buffer, count);
struct dentry *root;
 
@@ -681,7 +683,7 @@ static const struct file_operations bm_status_operations = {
 
 static void bm_put_super(struct super_block *sb)
 {
-   struct binfmt_misc *bm_data = sb-s_fs_info;
+   struct binfmt_misc *bm_data = BINFMT_MISC(sb);
struct ve_struct *ve = sb-s_ns;
 
bm_data-enabled = 0;
@@ -723,7 +725,6 @@ static int bm_fill_super(struct super_block * sb, void * 
data, int silent)
}
 
sb-s_op = s_ops;
-   sb-s_fs_info = bm_data;
 
bm_data-enabled = 1;
get_ve(ve);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert fs: add data pointer to mount_ns()

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 8d9d5a10d874b4d9f66f1af3fdcabbe9aee396f2
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:58 2015 +0400

Revert fs: add data pointer to mount_ns()

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 69e6ae7f750fc862c9324441130abbff2c8b528e.

This is only needed for per-ns filesystems that can accept user options.
There is the only such a filesystem, devtmpfs, which we made per
container. Since devtmpfs virtualization is going to be dropped, this
patch is not necessary.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c | 4 ++--
 fs/binfmt_misc.c| 2 +-
 fs/nfsd/nfsctl.c| 2 +-
 fs/super.c  | 4 ++--
 include/linux/fs.h  | 2 +-
 ipc/mqueue.c| 2 +-
 net/sunrpc/rpc_pipe.c   | 2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 349d6eb..6f4ba37 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -59,9 +59,9 @@ static struct dentry *dev_mount(struct file_system_type 
*fs_type, int flags,
  const char *dev_name, void *data)
 {
 #ifdef CONFIG_TMPFS
-   return mount_ns(fs_type, flags, data, get_exec_env(), shmem_fill_super);
+   return mount_ns(fs_type, flags, data, shmem_fill_super);
 #else
-   return mount_ns(fs_type, flags, data, get_exec_env(), ramfs_fill_super);
+   return mount_ns(fs_type, flags, data, ramfs_fill_super);
 #endif
 }
 
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 460d53f..7e760d2 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -734,7 +734,7 @@ static int bm_fill_super(struct super_block * sb, void * 
data, int silent)
 static struct dentry *bm_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
 {
-   return mount_ns(fs_type, flags, data, get_exec_env(), bm_fill_super);
+   return mount_ns(fs_type, flags, get_exec_env(), bm_fill_super);
 }
 
 static struct linux_binfmt misc_format = {
diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 9b690c9..7411a56 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1126,7 +1126,7 @@ static int nfsd_fill_super(struct super_block * sb, void 
* data, int silent)
 static struct dentry *nfsd_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
 {
-   return mount_ns(fs_type, flags, NULL, current-nsproxy-net_ns, 
nfsd_fill_super);
+   return mount_ns(fs_type, flags, current-nsproxy-net_ns, 
nfsd_fill_super);
 }
 
 static void nfsd_umount(struct super_block *sb)
diff --git a/fs/super.c b/fs/super.c
index 7f316e8..c9b47bf 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -890,11 +890,11 @@ static int ns_set_super(struct super_block *sb, void 
*data)
 }
 
 struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
-   void *data, void *ns, int (*fill_super)(struct super_block *, void *, 
int))
+   void *data, int (*fill_super)(struct super_block *, void *, int))
 {
struct super_block *sb;
 
-   sb = sget(fs_type, ns_test_super, ns_set_super, flags, ns);
+   sb = sget(fs_type, ns_test_super, ns_set_super, flags, data);
if (IS_ERR(sb))
return

[Devel] [PATCH RHEL7 COMMIT] Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 9b72ce16b191d84da03da83d5ccec29de8854686
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:41 2015 +0400

Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns() calls

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

This reverts commit 9e7411c5c3b53937171ef962ce7381337f125b28.

This patch is not longer needed, because none of the mount_ns users
needs sb-s_fs_info any more.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/nfs/dns_resolve.c  | 2 +-
 fs/nfsd/nfs4recover.c | 4 ++--
 fs/super.c| 4 ++--
 include/linux/fs.h| 2 --
 ipc/mqueue.c  | 6 +++---
 net/sunrpc/clnt.c | 2 +-
 net/sunrpc/rpc_pipe.c | 4 ++--
 7 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c
index dda6202..d25f10f 100644
--- a/fs/nfs/dns_resolve.c
+++ b/fs/nfs/dns_resolve.c
@@ -415,7 +415,7 @@ static int rpc_pipefs_event(struct notifier_block *nb, 
unsigned long event,
   void *ptr)
 {
struct super_block *sb = ptr;
-   struct net *net = sb-s_ns;
+   struct net *net = sb-s_fs_info;
struct nfs_net *nn = net_generic(net, nfs_net_id);
struct cache_detail *cd = nn-nfs_dns_resolve;
int ret = 0;
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index c714602..4c86b18 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -693,7 +693,7 @@ cld_pipe_downcall(struct file *filp, const char __user 
*src, size_t mlen)
struct cld_upcall *tmp, *cup;
struct cld_msg __user *cmsg = (struct cld_msg __user *)src;
uint32_t xid;
-   struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_ns,
+   struct nfsd_net *nn = net_generic(filp-f_dentry-d_sb-s_fs_info,
nfsd_net_id);
struct cld_net *cn = nn-cld_net;
 
@@ -1353,7 +1353,7 @@ static int
 rpc_pipefs_event(struct notifier_block *nb, unsigned long event, void *ptr)
 {
struct super_block *sb = ptr;
-   struct net *net = sb-s_ns;
+   struct net *net = sb-s_fs_info;
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
struct cld_net *cn = nn-cld_net;
struct dentry *dentry;
diff --git a/fs/super.c b/fs/super.c
index c9b47bf..341650d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -880,12 +880,12 @@ EXPORT_SYMBOL(kill_litter_super);
 
 static int ns_test_super(struct super_block *sb, void *data)
 {
-   return sb-s_ns == data;
+   return sb-s_fs_info == data;
 }
 
 static int ns_set_super(struct super_block *sb, void *data)
 {
-   sb-s_ns = data;
+   sb-s_fs_info = data;
return set_anon_super(sb, NULL);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 68cec28..553bca3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1457,8 +1457,6 @@ struct super_block {
unsigned ints_max_links;
fmode_t s_mode;
 
-   void*s_ns;  /* Pointer to namespace */
-
/* Granularity of c/m/atime in ns.
   Cannot be worse than a second */
u32s_time_gran;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 18620cd..c508938 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -104,7 +104,7 @@ static inline struct mqueue_inode_info *MQUEUE_I(struct 
inode *inode)
  */
 static inline struct ipc_namespace *__get_ns_from_inode(struct inode *inode)
 {
-   return get_ipc_ns(inode-i_sb-s_ns);
+   return get_ipc_ns(inode-i_sb-s_fs_info);
 }
 
 static struct ipc_namespace *get_ns_from_inode(struct inode *inode)
@@ -407,7 +407,7 @@ static void mqueue_evict_inode(struct inode *inode)
user-mq_bytes -= mq_bytes;
/*
 * get_ns_from_inode() ensures that the
-* (ipc_ns = sb-s_ns) is either a valid ipc_ns
+* (ipc_ns = sb-s_fs_info) is either a valid ipc_ns
 * to which we now hold a reference, or it is NULL.
 * We can't put it here under mq_lock, though.
 */
@@ -1418,7 +1418,7 @@ int mq_init_ns(struct ipc_namespace *ns)
 
 void mq_clear_sbinfo(struct

[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: do not use s_ns

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit a98a90ea907f522f1ae6ff0e1c6e78a39ade2494
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:44 2015 +0400

ve/binfmt_misc: do not use s_ns

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

Since 9e7411c5c3b5 was reverted, we must use sb-s_fs_info for storing a
pointer to the namespace.

This could be merged to 0b0dbb644794 (VE/BINFTM: virtualization).

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/binfmt_misc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index d0cb80c..4487153 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -65,7 +65,7 @@ struct binfmt_misc {
int entry_count;
 };
 
-#define BINFMT_MISC(sb)(((struct ve_struct 
*)(sb)-s_ns)-binfmt_misc)
+#define BINFMT_MISC(sb)(((struct ve_struct 
*)(sb)-s_fs_info)-binfmt_misc)
 
 /* 
  * Check if we support the binfmt
@@ -684,7 +684,7 @@ static const struct file_operations bm_status_operations = {
 static void bm_put_super(struct super_block *sb)
 {
struct binfmt_misc *bm_data = BINFMT_MISC(sb);
-   struct ve_struct *ve = sb-s_ns;
+   struct ve_struct *ve = sb-s_fs_info;
 
bm_data-enabled = 0;
put_ve(ve);
@@ -703,7 +703,7 @@ static int bm_fill_super(struct super_block * sb, void * 
data, int silent)
[3] = {register, bm_register_operations, S_IWUSR},
/* last one */ {}
};
-   struct ve_struct *ve = sb-s_ns;
+   struct ve_struct *ve = data;
struct binfmt_misc *bm_data = ve-binfmt_misc;
int err;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: containerize it with new obj ns operation

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 968c8efb7981f87f8bc0616741edb6c0bc556d76
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:57 2015 +0400

Revert devtmpfs: containerize it with new obj ns operation

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 53343c3b231ed36d973e6d3ac2ab9ad7b7c87e25.

The whole point of devtmpfs is simplifying the system bootup logic.
There is absolutely no point in virtualizing it, because on container
start we create devices from a hardcoded list (these are ttys, which I'd
prefer not to create at all using ptys instead, but we have to live with
it for compatibility reasons for now). This means that it is enough to
provide the userspace with per VE tmpfs mount called devtmpfs and
teach it to make device nodes from a hardcoded list on container start
instead of implementing devtmpfs virtualization in the kernel. The
kernel part will be done by the following patches.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c| 37 ++---
 fs/sysfs/ve.c  |  9 -
 include/linux/kobject_ns.h |  2 --
 3 files changed, 2 insertions(+), 46 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 0448af8..349d6eb 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -366,46 +366,13 @@ int devtmpfs_mount(const char *mntdir)
 
 static DECLARE_COMPLETION(setup_done);
 
-static struct path set_dev_pwd(struct device *dev)
-{
-   const struct kobj_ns_type_operations *ops;
-   struct path pwd = current-fs-pwd;
-
-   ops = kobj_ns_ops(dev-kobj);
-   path_get(pwd);
-
-   if (ops  ops-devtmpfs) {
-   const struct path *devtmpfs_root;
-
-   devtmpfs_root = ops-devtmpfs(dev-kobj);
-   BUG_ON(!devtmpfs_root);
-   set_fs_pwd(current-fs, devtmpfs_root);
-   }
-   return pwd;
-}
-
-static void drop_dev_pwd(struct path *pwd)
-{
-   set_fs_pwd(current-fs, pwd);
-   path_put(pwd);
-}
-
 static int handle(const char *name, umode_t mode, kuid_t uid, kgid_t gid,
  struct device *dev)
 {
-   struct path pwd;
-   int err;
-
-   pwd = set_dev_pwd(dev);
-
if (mode)
-   err = handle_create(name, mode, uid, gid, dev);
+   return handle_create(name, mode, uid, gid, dev);
else
-   err = handle_remove(name, dev);
-
-   /* Restore kthread pwd */
-   drop_dev_pwd(pwd);
-   return err;
+   return handle_remove(name, dev);
 }
 
 static int devtmpfsd(void *p)
diff --git a/fs/sysfs/ve.c b/fs/sysfs/ve.c
index 79ad6d5..bb28a4b 100644
--- a/fs/sysfs/ve.c
+++ b/fs/sysfs/ve.c
@@ -43,21 +43,12 @@ const void *ve_namespace(struct device *dev)
return (!dev-groups  dev_get_drvdata(dev)) ? dev_get_drvdata(dev) : 
get_ve0();
 }
 
-static const struct path *ve_devtmpfs(const struct kobject *kobj)
-{
-   struct device *dev = container_of(kobj, struct device, kobj);
-   const struct ve_struct *ve = dev-class-namespace(dev);
-
-   return ve-devtmpfs_root;
-}
-
 struct kobj_ns_type_operations ve_ns_type_operations = {
.type = KOBJ_NS_TYPE_VE,
.grab_current_ns =

[Devel] [PATCH RHEL7 COMMIT] Revert devtmpfs: per-VE mounts introduced

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 3fd8ef28e629c3ec00144f83249628244903876d
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:58 2015 +0400

Revert devtmpfs: per-VE mounts introduced

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit e85a799b629d5e28c8931ddd9127cf18d501745c.

More devtmpfs virtualization crap to drop. Will be reworked.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com

Conflicts:
include/linux/ve.h
kernel/ve/ve.c
---
 drivers/base/devtmpfs.c | 28 ++--
 include/linux/device.h  |  4 
 include/linux/ve.h  |  3 ---
 kernel/ve/ve.c  |  8 
 4 files changed, 2 insertions(+), 41 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index 6f4ba37..f59b798 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -23,8 +23,6 @@
 #include linux/ramfs.h
 #include linux/slab.h
 #include linux/kthread.h
-#include linux/fs_struct.h
-#include linux/ve.h
 #include base.h
 
 static struct task_struct *thread;
@@ -59,9 +57,9 @@ static struct dentry *dev_mount(struct file_system_type 
*fs_type, int flags,
  const char *dev_name, void *data)
 {
 #ifdef CONFIG_TMPFS
-   return mount_ns(fs_type, flags, data, shmem_fill_super);
+   return mount_single(fs_type, flags, data, shmem_fill_super);
 #else
-   return mount_ns(fs_type, flags, data, ramfs_fill_super);
+   return mount_single(fs_type, flags, data, ramfs_fill_super);
 #endif
 }
 
@@ -387,7 +385,6 @@ static int devtmpfsd(void *p)
goto out;
sys_chdir(/..); /* will traverse into overmounted root */
sys_chroot(.);
-   get_fs_root(current-fs, get_exec_env()-devtmpfs_root);
complete(setup_done);
while (1) {
spin_lock(req_lock);
@@ -408,33 +405,12 @@ static int devtmpfsd(void *p)
spin_unlock(req_lock);
schedule();
}
-   path_put(get_exec_env()-devtmpfs_root);
return 0;
 out:
complete(setup_done);
return *err;
 }
 
-int ve_init_devtmpfs(void *data)
-{
-   struct ve_struct *ve = data;
-   struct vfsmount *mnt;
-
-   mnt = kern_mount_data(dev_fs_type, ve);
-   if (IS_ERR(mnt))
-   return PTR_ERR(mnt);
-   ve-devtmpfs_root.mnt = mnt;
-   ve-devtmpfs_root.dentry = mnt-mnt_root;
-   return 0;
-}
-
-void ve_fini_devtmpfs(void *data)
-{
-   struct ve_struct *ve = data;
-
-   kern_unmount(ve-devtmpfs_root.mnt);
-}
-
 /*
  * Create devtmpfs instance, driver-core devices will add their device
  * nodes here.
diff --git a/include/linux/device.h b/include/linux/device.h
index df5152f..2c9c764 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1005,14 +1005,10 @@ extern void put_device(struct device *dev);
 extern int devtmpfs_create_node(struct device *dev);
 extern int devtmpfs_delete_node(struct device *dev);
 extern int devtmpfs_mount(const char *mntdir);
-extern int ve_init_devtmpfs(void *data);
-extern void ve_fini_devtmpfs(void *data);
 #else
 static inline int devtmpfs_create_node(struct device *dev) { return 0; }
 static inline int devtmpfs_delete_node(struct device *dev) { return 0; }
 static inline int

[Devel] [PATCH RHEL7 COMMIT] ve/binfmt_misc: destroy all nodes on ve stop

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0ea1f95684407db5892760b5a58a24003571f043
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:44 2015 +0400

ve/binfmt_misc: destroy all nodes on ve stop

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

Each registered binfmt_misc node pins binfmt_misc mount point, which in
turn pins the owner ve. This means that if we don't clean up binfmt_misc
nodes on ve stop, the mount point as well as the ve struct will leak.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/binfmt_misc.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 4487153..90c306e 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -752,16 +752,42 @@ static struct file_system_type bm_fs_type = {
 };
 MODULE_ALIAS_FS(binfmt_misc);
 
+static void ve_binfmt_fini(void *data)
+{
+   struct ve_struct *ve = data;
+   struct binfmt_misc *bm_data = ve-binfmt_misc;
+
+   if (!bm_data)
+   return;
+
+   /*
+* XXX: Note we don't take any locks here. This is safe as long as
+* nobody uses binfmt_misc outside the owner ve.
+*/
+   while (!list_empty(bm_data-entries))
+   kill_node(bm_data, list_first_entry(
+   bm_data-entries, Node, list));
+}
+
+static struct ve_hook ve_binfmt_hook = {
+   .fini   = ve_binfmt_fini,
+   .priority   = HOOK_PRIO_DEFAULT,
+   .owner  = THIS_MODULE,
+};
+
 static int __init init_misc_binfmt(void)
 {
int err = register_filesystem(bm_fs_type);
-   if (!err)
+   if (!err) {
insert_binfmt(misc_format);
+   ve_hook_register(VE_SS_CHAIN, ve_binfmt_hook);
+   }
return err;
 }
 
 static void __exit exit_misc_binfmt(void)
 {
+   ve_hook_unregister(ve_binfmt_hook);
unregister_binfmt(misc_format);
unregister_filesystem(bm_fs_type);
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: Create required devices on container startup

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0cdfb581770d883cea99f30e49e3de1583ab6fc1
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:56 2015 +0400

Revert ve/devtmpfs: Create required devices on container startup

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 5cd1d17ff1b6a8f476ab6f4cd0a6830fbffe43f2.

We don't actually need separate null, zero, and other mem class devices
inside a VE. The patch being reverted added them merely for kdevtmpfs to
create nodes for this devices under /dev. This work can and should be
done by vzctl on container start, so drop this patch.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/char/mem.c | 20 ---
 kernel/ve/ve.c | 56 --
 2 files changed, 76 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index c486c83..a3653f7 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -30,7 +30,6 @@
 #include linux/io.h
 #include linux/aio.h
 #include linux/security.h
-#include linux/ve.h
 
 #include asm/uaccess.h
 
@@ -924,20 +923,7 @@ static char *mem_devnode(struct device *dev, umode_t *mode)
return NULL;
 }
 
-#ifdef CONFIG_VE
-static struct class mem_class_base = {
-   .name   = mem,
-   .devnode= mem_devnode,
-   .ns_type= ve_ns_type_operations,
-   .namespace  = ve_namespace,
-   .owner  = THIS_MODULE,
-};
-
-struct class *mem_class = mem_class_base;
-EXPORT_SYMBOL(mem_class);
-#else
 static struct class *mem_class;
-#endif
 
 static int __init chr_dev_init(void)
 {
@@ -951,17 +937,11 @@ static int __init chr_dev_init(void)
if (register_chrdev(MEM_MAJOR, mem, memory_fops))
printk(unable to get major %d for memory devs\n, MEM_MAJOR);
 
-#ifdef CONFIG_VE
-   err = class_register(mem_class_base);
-   if (err)
-   return err;
-#else
mem_class = class_create(THIS_MODULE, mem);
if (IS_ERR(mem_class))
return PTR_ERR(mem_class);
 
mem_class-devnode = mem_devnode;
-#endif
for (minor = 1; minor  ARRAY_SIZE(devlist); minor++) {
if (!devlist[minor].name)
continue;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 4cd1f8b..cdbb342 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -413,55 +413,6 @@ static void ve_drop_context(struct ve_struct *ve)
ve-init_cred = NULL;
 }
 
-static const struct {
-   unsigned intminor;
-   char*name;
-} ve_mem_class_devices[] = {
-   {3, null},
-   {5, zero},
-   {7, full},
-   {8, random},
-   {9, urandom},
-};
-
-extern struct class *mem_class;
-
-static int ve_init_mem_class(struct ve_struct *ve)
-{
-   struct device *dev;
-   dev_t devt;
-   size_t i;
-
-   for (i = 0; i  ARRAY_SIZE(ve_mem_class_devices); i++) {
-   devt = MKDEV(MEM_MAJOR, ve_mem_class_devices[i].minor);
-   dev = device_create(mem_class, NULL, devt,
-   ve, ve_mem_class_devices[i].name);
-   if (IS_ERR(dev)) {
-   pr_err(Can't create %s (%d)\n,
-  ve_mem_class_devices[i].name,
-

[Devel] [PATCH RHEL7 COMMIT] Revert ve/devtmpfs: pass proper options string

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0ffbb29c45f5ee709f4fa5dfa52f883cbe4a70f1
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:10:57 2015 +0400

Revert ve/devtmpfs: pass proper options string

Patchset description:

Rework devtmpfs virtualization

Currently, we implement full-featured devtmpfs virtualization for VE:
when a device is created in a VE namespace, we send a signal to
kdevtmpfs to create the devnode on devtmpfs mount corresponding to the
VE. This seems to be over-complicated: all this work can be done from
userspace, because we only have a hardcoded list of devices created
exclusively for VE on container start. Those are tty-related stuff and
mem devices, and we only need the latter to create devtmpfs nodes.
Moreover, it is buggy: ve_stop_ns, which destroys VE devtmpfs mount can
be called before a VE tty device is unregistered, resulting in a KP:

https://jira.sw.ru/browse/PSBM-35077

This patch therefore simplifies it. It makes the kernel only provide a
single empty tmpfs mount per VE, which appears on an attempt to mount
devtmpfs from inside a VE. The content of the fs is to be filled by the
userspace on container start, which will be done in the scope of

https://jira.sw.ru/browse/PSBM-35146

Vladimir Davydov (6):
  Revert ve/devtmpfs: Create required devices on container startup
  Revert ve/devtmpfs: pass proper options string
  Revert devtmpfs: containerize it with new obj ns operation
  Revert fs: add data pointer to mount_ns()
  Revert devtmpfs: per-VE mounts introduced
  devtmpfs: lightweight virtualization

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

===
This patch description:

This reverts commit 1c6719b8aa075de4c9528811839d5f2595ef2994.

This is related to devtmpfs virtualization, which I'm going to drop.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/base/devtmpfs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c28e42c..0448af8 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -451,10 +451,9 @@ out:
 int ve_init_devtmpfs(void *data)
 {
struct ve_struct *ve = data;
-   char opts[] = mode=0755;
struct vfsmount *mnt;
 
-   mnt = kern_mount_data(dev_fs_type, opts);
+   mnt = kern_mount_data(dev_fs_type, ve);
if (IS_ERR(mnt))
return PTR_ERR(mnt);
ve-devtmpfs_root.mnt = mnt;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit d0856fdc15e0b49540c454b42a11ddf2af70cda6
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 16:42:43 2015 +0400

Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in fill_super

Patchset description:

zap sb-s_ns + fix memleak in binfmt_misc

Vladimir Davydov (6):
  binfmt_misc: do not use sb-s_fs_info
  Revert VE/VFS: use sb-s_ns member to store namespace for mount_ns()
calls
  Revert ve/sunrpc: use correct pointer to net_namespace in auth_gss.c
  Revert nfsd/sunrpc/mqueue: use sb-s_ns instead of data in
fill_super
  binfmt_misc: do not use s_ns
  binfmt_misc: destroy all nodes on ve stop

https://jira.sw.ru/browse/PSBM-39154

Reviewed-by: Cyrill Gorcunov gorcu...@virtuozzo.com

==
This patch description:

This reverts commit 610d54ccee1af63b1b361d18ec4ee9fa5230dea8.

Since commit 9e7411c5c3b5 was reverted, this one is no longer needed
either.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 fs/nfsd/nfsctl.c  | 2 +-
 ipc/mqueue.c  | 2 +-
 net/sunrpc/rpc_pipe.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 7411a56..048d61d 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1113,7 +1113,7 @@ static int nfsd_fill_super(struct super_block * sb, void 
* data, int silent)
 #endif
/* last one */ {}
};
-   struct net *net = sb-s_ns;
+   struct net *net = data;
int ret;
 
ret = simple_fill_super(sb, 0x6e667364, nfsd_files);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c508938..6a8f37d 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -309,7 +309,7 @@ err:
 static int mqueue_fill_super(struct super_block *sb, void *data, int silent)
 {
struct inode *inode;
-   struct ipc_namespace *ns = sb-s_ns;
+   struct ipc_namespace *ns = data;
 
sb-s_blocksize = PAGE_CACHE_SIZE;
sb-s_blocksize_bits = PAGE_CACHE_SHIFT;
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index b8f6185..79681e5 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1395,7 +1395,7 @@ rpc_fill_super(struct super_block *sb, void *data, int 
silent)
 {
struct inode *inode;
struct dentry *root, *gssd_dentry;
-   struct net *net = sb-s_ns;
+   struct net *net = data;
struct sunrpc_net *sn = net_generic(net, sunrpc_net_id);
int err;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 pty drivers

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 1ff0db51541d3bf04c228025cb48de284adb78b2
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:31:49 2015 +0400

Revert ve/pty: containerize Unix98 pty drivers

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

This reverts commit 79b66035f81e1c8996f2524f26af096e44e2ae4b.

Conflicts:
kernel/ve/ve.c

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 kernel/ve/ve.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index bdfa30d..5025149 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -449,10 +449,6 @@ int ve_start_container(struct ve_struct *ve)
if (err)
goto err_legacy_pty;
 
-   err = ve_unix98_pty_init(ve);
-   if (err)
-   goto err_unix98_pty;
-
err = ve_tty_console_init(ve);
if (err)
goto err_tty_console;
@@ -472,8 +468,6 @@ int ve_start_container(struct ve_struct *ve)
 err_iterate:
ve_tty_console_fini(ve);
 err_tty_console:
-   ve_unix98_pty_fini(ve);
-err_unix98_pty:
ve_legacy_pty_fini(ve);
 err_legacy_pty:
ve_stop_umh(ve);
@@ -506,7 +500,6 @@ void ve_stop_ns(struct pid_namespace *pid_ns)
ve-is_running = 0;
 
ve_tty_console_fini(ve);
-   ve_unix98_pty_fini(ve);
ve_legacy_pty_fini(ve);
 
ve_stop_umh(ve);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert pty: split Unix98 init routines

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit ee5a5380520330fedde1a323d5ca3cb5cad20b4f
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:32:03 2015 +0400

Revert pty: split Unix98 init routines

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

This reverts commit 3aec66abd43440bc7dd4c6bbe84734adb6d82851.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/tty/pty.c | 100 --
 1 file changed, 15 insertions(+), 85 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 56c0a21..bd17a45 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -820,62 +820,25 @@ err_file:
 
 static struct file_operations ptmx_fops;
 
-static void __unix98_unregister_ptmx(void)
-{
-   unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1);
-   cdev_del(ptmx_cdev);
-}
-
-static int __unix98_register_ptmx(void)
- {
-   int err;
-
-   cdev_init(ptmx_cdev, ptmx_fops);
-   err = cdev_add(ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1);
-   if (err) {
-   printk(KERN_ERR Couldn't add /dev/ptmx device);
-   return err;
-   }
-   err = register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx);
-   if (err  0) {
-   printk(KERN_ERR Couldn't register /dev/ptmx driver);
-   goto err_ptmx_register;
-   }
-   return 0;
-
-err_ptmx_register:
-   cdev_del(ptmx_cdev);
-   return err;
-}
-
-static int __unix98_pty_init(struct tty_driver **ptm_driver_p,
-   struct tty_driver **pts_driver_p)
+static void __init unix98_pty_init(void)
 {
-   struct tty_driver *ptm_driver, *pts_driver;
-   int err;
-   struct device *dev;
-
ptm_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX,
TTY_DRIVER_RESET_TERMIOS |
TTY_DRIVER_REAL_RAW |
TTY_DRIVER_DYNAMIC_DEV |
TTY_DRIVER_DEVPTS_MEM |
TTY_DRIVER_DYNAMIC_ALLOC);
-   if (IS_ERR(ptm_driver)) {
-   printk(KERN_ERR Couldn't allocate Unix98 ptm driver);
-   return PTR_ERR(ptm_driver);
-   }
+   if (IS_ERR(ptm_driver))
+   panic(Couldn't allocate Unix98 ptm driver);
pts_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX,
TTY_DRIVER_RESET_TERMIOS |
TTY_DRIVER_REAL_RAW |
TTY_DRIVER_DYNAMIC_DEV |
TTY_DRIVER_DEVPTS_MEM |
TTY_DRIVER_DYNAMIC_ALLOC);
-   if (IS_ERR(pts_driver)) {
-   printk(KERN_ERR Couldn't allocate Unix98 pts driver);
-   err = PTR_ERR(pts_driver);
-   goto err_pts_alloc;
-   }
+   if (IS_ERR(pts_driver))
+   panic(Couldn't allocate Unix98 pts driver);
+
ptm_driver-driver_name = pty_master;
ptm_driver-name = ptm;
ptm_driver-major = UNIX98_PTY_MASTER_MAJOR;
@@ -905,53 +868,20 @@ static int __unix98_pty_init(struct tty_driver 
**ptm_driver_p,
pts_driver-other = ptm_driver;
tty_set_operations(pts_driver, pty_unix98_ops);
 
-   err = tty_register_driver(ptm_driver);
-   if (err) {
-   printk(KERN_ERR Couldn't register Unix98 ptm driver);
-   goto err_ptm_register;
-   }
-   err = tty_register_driver(pts_driver);
-   if (err) {
-   printk(KERN_ERR Couldn't register Unix98 pts driver);
-   goto err_pts_register;
-   }
+   if (tty_register_driver(ptm_driver))
+   panic(Couldn't register Unix98 ptm driver);
+   if (tty_register_driver(pts_driver))
+   panic(Couldn't register Unix98 pts driver);
 
/* Now create the /dev/ptmx special device */
tty_default_fops(ptmx_fops);
ptmx_fops.open = ptmx_open;
 
-   err = __unix98_register_ptmx();
-   if (err)
-   goto err_ptmx_register;
-
-   dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR,

[Devel] [PATCH RHEL7 COMMIT] ve/radix-tree: do not account radix_tree_nodes to memcg

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit d4b302e64d3523bddf4e300d0a975a7717ac784b
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:44:29 2015 +0400

ve/radix-tree: do not account radix_tree_nodes to memcg

There are two problems if they are accounted.

First, radix_tree_nodes allocated by tcache/tswap for storing their
internal data will be accounted to the container that issued a store,
which is wrong, because they can only get reclaimed on global pressure.
Using __GFP_NOACCOUNT in tcache/tswap wouldn't help due to per cpu
radix_tree_node preloads.

Second, workingset detection logic (see mm/workingset.c) is still not
memory cgroup aware. In particular, this means that shadow
radix_tree_nodes can only be reclaimed on global memory pressure
although they are accounted to a memory cgroup. As a result, after
reading a huge file, all the container's memory can get filled with
shadow entries, which won't be reclaimed on local memory pressure,
making the container unusable.

This is a quick-fix which makes radix_tree_nodes unaccountable. This is
acceptable for now, because we had never accounted radix_tree_nodes
before Vz7 anyway. The true fix would be (a) making radix_tree_node
preloads unaccountable (or per memory cgroup) and (b) making workingset
detection logic memory cgroup aware. This should and will be done
upstream first.

https://jira.sw.ru/browse/PSBM-35205

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 lib/radix-tree.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index dd3347f..4b362cb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -228,7 +228,8 @@ radix_tree_node_alloc(struct radix_tree_root *root)
}
}
if (ret == NULL)
-   ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+   ret = kmem_cache_alloc(radix_tree_node_cachep,
+  gfp_mask | __GFP_NOACCOUNT);
 
BUG_ON(radix_tree_is_indirect_ptr(ret));
return ret;
@@ -279,7 +280,8 @@ static int __radix_tree_preload(gfp_t gfp_mask)
rtp = __get_cpu_var(radix_tree_preloads);
while (rtp-nr  ARRAY_SIZE(rtp-nodes)) {
preempt_enable();
-   node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+   node = kmem_cache_alloc(radix_tree_node_cachep,
+   gfp_mask | __GFP_NOACCOUNT);
if (node == NULL)
goto out;
preempt_disable();
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 0845747ebe2654d1e6e56a0425b21e599a47f4f6
Author: Mel Gorman mgor...@suse.de
Date:   Fri Aug 28 18:50:29 2015 +0400

ms/mm/vmscan: use proportional scanning during direct reclaim and full scan 
at DEF_PRIORITY


This patch fixes memcg overreclaim w/o tswap/zswap as described in:

https://jira.sw.ru/browse/PSBM-35275

Memcg overreclaim still happens if tswap or zswap is used. This case is
to be investigated yet, however, this patch is definitely worth pulling.


Commit mm: vmscan: obey proportional scanning requirements for kswapd
ensured that file/anon lists were scanned proportionally for reclaim from
kswapd but ignored it for direct reclaim.  The intent was to minimse
direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
long stall for many small stalls and distorts aging for normal workloads
like streaming readers/writers.  Hugh Dickins pointed out that a
side-effect of the same commit was that when one LRU list dropped to zero
that the entirety of the other list was shrunk leading to excessive
reclaim in memcgs.  This patch scans the file/anon lists proportionally
for direct reclaim to similarly age page whether reclaimed by kswapd or
direct reclaim but takes care to abort reclaim if one LRU drops to zero
after reclaiming the requested number of pages.

Based on ext4 and using the Intel VM scalability test

  3.15.0-rc5
3.15.0-rc5
shrinker
proportion
Unit  lru-file-readonceelapsed  5.3500 (  0.00%)  5.4200 ( 
-1.31%)
Unit  lru-file-readonce time_range  0.2700 (  0.00%)  0.1400 ( 
48.15%)
Unit  lru-file-readonce time_stddv  0.1148 (  0.00%)  0.0536 ( 
53.33%)
Unit lru-file-readtwiceelapsed  8.1700 (  0.00%)  8.1700 (  
0.00%)
Unit lru-file-readtwice time_range  0.4300 (  0.00%)  0.2300 ( 
46.51%)
Unit lru-file-readtwice time_stddv  0.1650 (  0.00%)  0.0971 ( 
41.16%)

The test cases are running multiple dd instances reading sparse files. The 
results are within
the noise for the small test machine. The impact of the patch is more 
noticable from the vmstats

3.15.0-rc5  3.15.0-rc5
  shrinker  proportion
Minor Faults 35154   36784
Major Faults   6111305
Swap Ins   3941651
Swap Outs 43945891
Allocation stalls   118616   44781
Direct pages scanned   4935171 4602313
Kswapd pages scanned  1592129216258483
Kswapd pages reclaimed1591330116248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency  99% 99%
Kswapd velocity 670088.047  682555.961
Direct efficiency  99% 99%
Direct velocity 207709.217  193212.133
Percentage direct scans23% 22%
Page writes by reclaim4858.0006232.000
Page writes file   464 341
Page writes anon  43945891

Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.

Signed-off-by: Mel Gorman mgor...@suse.de
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Hugh Dickins hu...@google.com
Cc: Tim Chen tim.c.c...@linux.intel.com
Cc: Dave Chinner da...@fromorbit.com
Tested-by: Yuanhan Liu yuanhan@linux.intel.com
Cc: Bob Liu bob@oracle.com
Cc: Jan Kara j...@suse.cz
Cc: Rik van Riel r...@redhat.com
Cc: Al Viro v...@zeniv.linux.org.uk
Signed-off-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Linus Torvalds torva...@linux-foundation.org
(cherry picked from commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/vmscan.c | 36 +---
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0b4c98f..2bb62ce 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2129,13 +2129,27 @@ static void shrink_lruvec(struct lruvec *lruvec, struct 
scan_control *sc,
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc-nr_to_reclaim;
struct blk_plug plug;
-   bool scan_adjusted = false;
+   bool scan_adjusted;
 
get_scan_count(lruvec, sc, nr, lru_pages);
 
/* Record the original scan target for proportional adjustments

[Devel] [PATCH RHEL7 COMMIT] memcg: fix swap_max calculation for nested cgroups

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 338ce9637d706f2bf01ef9153b78953ff65c2efb
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:36:03 2015 +0400

memcg: fix swap_max calculation for nested cgroups

If there is a sub-memcg in a container, its swapout won't update
swap_max of the container's memcg, because we don't ascend the memcg
hierarchy in mem_cgroup_update_swap_max. This patch fixes this issue.

Fixes: a74376e2dde13 (bc/memcg: show correct swap max for beancounters)
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/memcontrol.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5f3e0ac..7fc2931 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -903,12 +903,14 @@ static void mem_cgroup_update_swap_max(struct mem_cgroup 
*memcg)
 {
long long swap;
 
-   swap = res_counter_read_u64(memcg-memsw, RES_USAGE) -
-   res_counter_read_u64(memcg-res, RES_USAGE);
+   for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+   swap = res_counter_read_u64(memcg-memsw, RES_USAGE) -
+   res_counter_read_u64(memcg-res, RES_USAGE);
 
-   /* This is racy, but we don't have to be absolutely precise */
-   if (swap  (long long)memcg-swap_max)
-   memcg-swap_max = swap;
+   /* This is racy, but we don't have to be absolutely precise */
+   if (swap  (long long)memcg-swap_max)
+   memcg-swap_max = swap;
+   }
 }
 
 static void mem_cgroup_inc_failcnt(struct mem_cgroup *memcg,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert ve/pty: containerize Unix98 driver

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit fd19fc2c70ae5da0a0902dea96213f52dc6afbfd
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:31:56 2015 +0400

Revert ve/pty: containerize Unix98 driver

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

This reverts commit 1b2c1fe8428715c3b5ec0a94d0568b5a5c526032.

Conflicts:
include/linux/ve.h

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/tty/pty.c   | 88 ++---
 include/linux/tty.h |  6 ++--
 include/linux/ve.h  |  6 
 3 files changed, 32 insertions(+), 68 deletions(-)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index 7afb822..56c0a21 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -23,10 +23,15 @@
 #include linux/devpts_fs.h
 #include linux/slab.h
 #include linux/mutex.h
-#include linux/ve.h
 
 #include bc/misc.h
 
+#ifdef CONFIG_UNIX98_PTYS
+static struct tty_driver *ptm_driver;
+static struct tty_driver *pts_driver;
+static DEFINE_MUTEX(devpts_mutex);
+#endif
+
 static void pty_close(struct tty_struct *tty, struct file *filp)
 {
BUG_ON(!tty);
@@ -53,11 +58,11 @@ static void pty_close(struct tty_struct *tty, struct file 
*filp)
if (tty-driver-subtype == PTY_TYPE_MASTER) {
set_bit(TTY_OTHER_CLOSED, tty-flags);
 #ifdef CONFIG_UNIX98_PTYS
-   if (tty-driver == tty-driver-ve-ptm_driver) {
-   mutex_lock(tty-driver-ve-devpts_mutex);
+   if (tty-driver == ptm_driver) {
+   mutex_lock(devpts_mutex);
if (tty-link-driver_data)
devpts_pty_kill(tty-link-driver_data);
-   mutex_unlock(tty-driver-ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
}
 #endif
tty_unlock(tty);
@@ -669,9 +674,9 @@ static struct tty_struct *pts_unix98_lookup(struct 
tty_driver *driver,
 {
struct tty_struct *tty;
 
-   mutex_lock(driver-ve-devpts_mutex);
+   mutex_lock(devpts_mutex);
tty = devpts_get_priv(pts_inode);
-   mutex_unlock(driver-ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
/* Master must be open before slave */
if (!tty)
return ERR_PTR(-EIO);
@@ -748,7 +753,6 @@ static int ptmx_open(struct inode *inode, struct file *filp)
struct inode *slave_inode;
int retval;
int index;
-   struct ve_struct *ve = (inode-i_sb-s_ns) ? : get_exec_env();
 
nonseekable_open(inode, filp);
 
@@ -760,18 +764,18 @@ static int ptmx_open(struct inode *inode, struct file 
*filp)
return retval;
 
/* find a device that is not in use. */
-   mutex_lock(ve-devpts_mutex);
+   mutex_lock(devpts_mutex);
index = devpts_new_index(inode);
if (index  0) {
retval = index;
-   mutex_unlock(ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
goto err_file;
}
 
-   mutex_unlock(ve-devpts_mutex);
+   mutex_unlock(devpts_mutex);
 
mutex_lock(tty_mutex);
-   tty = tty_init_dev(ve-ptm_driver, index);
+   tty = tty_init_dev(ptm_driver, index);
 
if (IS_ERR(tty)) {
retval = PTR_ERR(tty);
@@ -796,7 +800,7 @@ static int ptmx_open(struct inode *inode, struct file *filp)
}
tty-link-driver_data = slave_inode;
 
-   retval = ve-ptm_driver-ops-open(tty, filp);
+   retval = ptm_driver-ops-open(tty, filp);
if (retval)
goto err_release;
 
@@ -816,22 +820,16 @@ err_file:
 
 static struct file_operations ptmx_fops;
 
-static void __unix98_unregister_ptmx(struct ve_struct *ve)
+static void __unix98_unregister_ptmx(void)
 {
-   if (!ve_is_super(ve))
-   return;
-
unregister_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1);
cdev_del(ptmx_cdev);
 }
 
-static int __unix98_register_ptmx(struct ve_struct *ve)
-{
+static int __unix98_register_ptmx(void)
+ {
int

[Devel] [PATCH RHEL7 COMMIT] ve/pty: create ptmx device per ve namespace

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 953017eb9e8237859f63d7b0a2c816b7e7e5a615
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:32:16 2015 +0400

ve/pty: create ptmx device per ve namespace

Patchset description:

Zap Unix98 pty virtualization

Unix98 ptys are already
virtualized on the VFS layer, nothing needs to be done on the driver's
side. We don't even have this in PCS6.

The patch set makes ptmx device system-wide while its class, tty_class, is
still virtualized. Since it's now system-wide, we have to add its sysfs 
entry
to ve.default_sysfs_permissions, but since its class is virtualized, we 
won't
be able to do it (see sysfs_perms_set - sysfs_find_dirent).

As a result, if the container relies on sysfs while creating devnodes,
it will not find ptmx and therefore fallback to legacy ptys, which we
are going to drop.
The last patch (ve/pty: create ptmx device per ve namespace) addresses this.

===
This patch description:

After Unix98 PTY driver virtualization was reverted, we have to
manually set sysfs permissions for ptmx. This, however, is currently
impossible, because tty_class is still virtualized, which makes
ve.sysfs_permissions ignore it (see sysfs_perms_set).

This patch is a quick-fix which simply creates/destroys ptmx device in
ve namespace on container start/stop. It must be dropped when commit
6022450d12653 (ve/tty: make tty_class VE-namespace aware) is reverted.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 drivers/tty/pty.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/tty/pty.c b/drivers/tty/pty.c
index bd17a45..529046b 100644
--- a/drivers/tty/pty.c
+++ b/drivers/tty/pty.c
@@ -818,6 +818,32 @@ err_file:
return retval;
 }
 
+static int ve_unix98_pty_init(void *data)
+{
+   struct ve_struct *ve = data;
+   struct device *dev;
+
+   dev = device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), ve, 
ptmx);
+   if (IS_ERR(dev)) {
+   pr_warn(Failed to create ptmx device for ve %s: %ld\n,
+   ve-ve_name, PTR_ERR(dev));
+   return PTR_ERR(dev);
+   }
+   return 0;
+}
+
+static void ve_unix98_pty_fini(void *data)
+{
+   device_destroy_namespace(tty_class, MKDEV(TTYAUX_MAJOR, 2), data);
+}
+
+static struct ve_hook ve_unix98_pty_hook = {
+   .init   = ve_unix98_pty_init,
+   .fini   = ve_unix98_pty_fini,
+   .priority   = HOOK_PRIO_DEFAULT,
+   .owner  = THIS_MODULE,
+};
+
 static struct file_operations ptmx_fops;
 
 static void __init unix98_pty_init(void)
@@ -882,6 +908,7 @@ static void __init unix98_pty_init(void)
register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, /dev/ptmx)  0)
panic(Couldn't register /dev/ptmx driver);
device_create(tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), NULL, ptmx);
+   ve_hook_register(VE_SS_CHAIN, ve_unix98_pty_hook);
 }
 
 #else
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: never isolate more pages than necessary

2015-08-28 Thread Konstantin Khorenko

The commit is pushed to branch-rh7-3.10.0-229.7.2-ovz and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.3
--
commit 703ed09d7ee4d9af6cec3c4970842f282176f5e0
Author: Vladimir Davydov vdavy...@parallels.com
Date:   Fri Aug 28 18:50:33 2015 +0400

ms/mm/vmscan: never isolate more pages than necessary


Along with [PATCH rh7] mm: vmscan: use proportional scanning during
direct reclaim and full scan at DEF_PRIORITY this should fix

https://jira.sw.ru/browse/PSBM-35275

I submitted this patch upstream (https://lkml.org/lkml/2015/8/3/404) and
it was merged into the mmotm tree. Hopefully, it will get merged into
Linus's tree soon.


If transparent huge pages are enabled, we can isolate many more pages
than we actually need to scan, because we count both single and huge
pages equally in isolate_lru_pages().

Since commit 5bc7b8aca942d (mm: thp: add split tail pages to shrink
page list in page reclaim), we scan all the tail pages immediately
after a huge page split (see shrink_page_list()). As a result, we can
reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run!

This is easy to catch on memcg reclaim with zswap enabled. The latter
makes swapout instant so that if we happen to scan an unreferenced huge
page we will evict both its head and tail pages immediately, which is
likely to result in excessive reclaim.

Signed-off-by: Vladimir Davydov vdavy...@parallels.com
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bb62ce..7beadf5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1297,7 +1297,8 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
unsigned long nr_taken = 0;
unsigned long scan;
 
-   for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
+   for (scan = 0; scan  nr_to_scan  nr_taken  nr_to_scan 
+   !list_empty(src); scan++) {
struct page *page;
int nr_pages;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] ve: Add a ability to show ve.mount_opts

2015-08-26 Thread Konstantin Khorenko



On 07/20/2015 10:05 PM, Maxim Patlasov wrote:

On 07/14/2015 01:27 AM, Kirill Tkhai wrote:

В Пн, 13/07/2015 в 12:38 -0700, Maxim Patlasov пишет:

On 07/08/2015 04:50 AM, Kirill Tkhai wrote:

...

Why do we need to show hidden options to CT' user? He/she doesn't
see

.balloon file, so it doesn't seem consistent to show
balloon_ino=N.

But this way read won't show all written using write. It may
confuse users or vzctl developers.

I think more debug info won't be worse.

Sorry for delay, I somehow missed your reply in my inbox folder. Are
these 'read' and 'write' allowed only from host system (ve0) or
inside
CT as well?

It's allowed from inside a CT like other ve cgroup files.
But it's not a problem of this patch, it's a generic problem,
because mounting of ve cgroup from CT is not prohibited for now.
Please, see cgroup_mount() for the details.


OK.


So, let's continue here:
* by default ve cgroup is not visible from inside a CT

* currently it's possible to mount ve cgroup inside a CT, but this is 
temporarily, we'll disable this
  https://jira.sw.ru/browse/PSBM-34291

* this patch allows to see mount options via ve cgroup =
  after PSBM-34291 is fixed, mount options will be visible only from ve0 (host)

* for host it's OK to see all hidden options

Kirill, Maxim, please ack that i understand the situation correctly here,
and i'll apply the patch.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmscan: Do not wait for page writeback for GFP_NOFS allocations

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit 32c7d3e46ac5734aa426767243a7e29657141ec3
Author: Michal Hocko 
Date:   Mon Aug 31 18:57:12 2015 +0400

ms/mm/vmscan: Do not wait for page writeback for GFP_NOFS allocations


This patch has not been merged upstream yet, I took it from LKML.
Nevertheless, it has already been committed to mmotm and even taken by
Greg for stable. It is definitely worth backporting if we don't want to
get tasks hung in D-state on memcg reclaim.
vdavydov@


Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: 883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at 815ab152
  #1 schedule at 815ab76e
  #2 schedule_timeout at 815ae5e5
  #3 io_schedule_timeout at 815aad6a
  #4 bit_wait_io at 815abfc6
  #5 __wait_on_bit at 815abda5
  #6 wait_on_page_bit at 8111fd4f
  #7 shrink_page_list at 81135445
  #8 shrink_inactive_list at 81135845
  #9 shrink_lruvec at 81135ead
 #10 shrink_zone at 811360c3
 #11 shrink_zones at 81136eff
 #12 do_try_to_free_pages at 8113712f
 #13 try_to_free_mem_cgroup_pages at 811372be
 #14 try_charge at 81189423
 #15 mem_cgroup_try_charge at 8118c6f5
 #16 __add_to_page_cache_locked at 8112137d
 #17 add_to_page_cache_lru at 81121618
 #18 pagecache_get_page at 8112170b
 #19 grow_dev_page at 811c8297
 #20 __getblk_slow at 811c91d6
 #21 __getblk_gfp at 811c92c1
 #22 ext4_ext_grow_indepth at 8124565c
 #23 ext4_ext_create_new_leaf at 81246ca8
 #24 ext4_ext_insert_extent at 81246f09
 #25 ext4_ext_map_blocks at 8124a848
 #26 ext4_map_blocks at 8121a5b7
 #27 mpage_map_one_extent at 8121b1fa
 #28 mpage_map_and_submit_extent at 8121f07b
 #29 ext4_writepages at 8121f6d5
 #30 do_writepages at 8112c490
 #31 __filemap_fdatawrite_range at 81120199
 #32 filemap_flush at 8112041c
 #33 ext4_alloc_da_blocks at 81219da1
 #34 ext4_rename at 81229b91
 #35 ext4_rename2 at 81229e32
 #36 vfs_rename at 811a08a5
 #37 SYSC_renameat2 at 811a3ffc
 #38 sys_renameat2 at 811a408e
 #39 sys_rename at 8119e51e
 #40 system_call_fastpath at 815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f44fcb0 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: sta...@vger.kernel.org # 3.9+
[ty...@mit.edu:

[Devel] [PATCH RHEL7 COMMIT] ve/sysctl: Introduce proc_doulongvec_minmax_virtual()

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit 85474cc55aa11512b45b40bf382c620d78646992
Author: Andrey Ryabinin 
Date:   Mon Aug 31 19:29:17 2015 +0400

ve/sysctl: Introduce proc_doulongvec_minmax_virtual()

proc_doulongvec_minmax_virtual() - analogous of proc_doulongvec_minmax()
for per CT sysctls. Will be used for virtualizing aio_nr, aio_max_nr

https://jira.sw.ru/browse/PSBM-29017

Signed-off-by: Andrey Ryabinin 
Reviewed-by: Vladimir Davydov 
---
 include/linux/sysctl.h |  2 ++
 kernel/sysctl.c| 11 +++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index bdcf06d..af467dc 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -60,6 +60,8 @@ extern int proc_do_large_bitmap(struct ctl_table *, int,
 
 extern int proc_dointvec_virtual(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
+extern int proc_doulongvec_minmax_virtual(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos);
 extern int proc_dointvec_immutable(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
 extern int proc_dostring_immutable(struct ctl_table *table, int write,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8478a1e..1a568e7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2727,6 +2727,17 @@ int proc_dointvec_virtual(struct ctl_table *table, int 
write,
return -EINVAL;
 }
 
+int proc_doulongvec_minmax_virtual(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   struct ctl_table tmp = *table;
+
+   if (virtual_ptr(, , sizeof(ve0), get_exec_env()))
+   return proc_doulongvec_minmax(, write, buffer, lenp, ppos);
+   return -EINVAL;
+}
+
 static inline bool sysctl_in_container(void)
 {
return !ve_is_super(get_exec_env());
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/fs/aio: aio_nr & aio_max_nr variables virtualization

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit d5a0970d86642a4150439d8a599f2f359e75fbf4
Author: Andrey Ryabinin 
Date:   Mon Aug 31 19:38:05 2015 +0400

ve/fs/aio: aio_nr & aio_max_nr variables virtualization

Virtualization of kernel global aio_nr & aio_max_nr variables is required
to isolate containers and ve0 when allocating aio request/events resources.

Each ve and ve0 has own aio_nr, aio_max_nr values. Function ioctx_alloc 
trying
to charge appropriate aio_nr value selected by ve context.

It's not possible to exhaust aio events resources of one ve from another ve.

Default per-CT aio_max_nr value == 0x1, including CT0.

https://jira.sw.ru/browse/PSBM-29017

Signed-off-by: Andrey Ryabinin 
Reviewed-by: Vladimir Davydov 
---
 fs/aio.c| 38 +-
 include/linux/aio.h |  6 ++
 include/linux/ve.h  |  6 ++
 kernel/sysctl.c | 16 
 kernel/ve/ve.c  |  7 +++
 5 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 70a6599..9d700b0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -122,14 +123,9 @@ struct kioctx {
 
struct page *internal_pages[AIO_RING_PAGES];
struct file *aio_ring_file;
+   struct ve_struct*ve;
 };
 
-/*-- sysctl variables*/
-static DEFINE_SPINLOCK(aio_nr_lock);
-unsigned long aio_nr;  /* current system wide number of aio requests */
-unsigned long aio_max_nr = 0x1; /* system wide maximum number of aio 
requests */
-/*end sysctl variables---*/
-
 static struct kmem_cache   *kiocb_cachep;
 static struct kmem_cache   *kioctx_cachep;
 
@@ -495,6 +491,9 @@ static int kiocb_cancel(struct kioctx *ctx, struct kiocb 
*kiocb,
 static void free_ioctx_rcu(struct rcu_head *head)
 {
struct kioctx *ctx = container_of(head, struct kioctx, rcu_head);
+   struct ve_struct *ve = ctx->ve;
+
+   put_ve(ve);
kmem_cache_free(kioctx_cachep, ctx);
 }
 
@@ -571,6 +570,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 {
struct mm_struct *mm = current->mm;
struct kioctx *ctx;
+   struct ve_struct *ve = get_exec_env();
int err = -ENOMEM;
 
/* Prevent overflows */
@@ -580,7 +580,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ERR_PTR(-EINVAL);
}
 
-   if (!nr_events || (unsigned long)nr_events > aio_max_nr)
+   if (!nr_events || (unsigned long)nr_events > ve->aio_max_nr)
return ERR_PTR(-EAGAIN);
 
ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);
@@ -588,6 +588,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ERR_PTR(-ENOMEM);
 
ctx->max_reqs = nr_events;
+   ctx->ve = get_ve(ve);
 
spin_lock_init(>ctx_lock);
spin_lock_init(>completion_lock);
@@ -608,14 +609,14 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
goto out_freectx;
 
/* limit the number of system wide aios */
-   spin_lock(_nr_lock);
-   if (aio_nr + nr_events > aio_max_nr ||
-   aio_nr + nr_events < aio_nr) {
-   spin_unlock(_nr_lock);
+   spin_lock(>aio_nr_lock);
+   if (ve->aio_nr + nr_events > ve->aio_max_nr ||
+   ve->aio_nr + nr_events < ve->aio_nr) {
+   spin_unlock(>aio_nr_lock);
goto out_cleanup;
}
-   aio_nr += ctx->max_reqs;
-   spin_unlock(_nr_lock);
+   ve->aio_nr += ctx->max_reqs;
+   spin_unlock(>aio_nr_lock);
 
/* now link into global list. */
spin_lock(>ioctx_lock);
@@ -633,6 +634,7 @@ out_cleanup:
err = -EAGAIN;
aio_free_ring(ctx);
 out_freectx:
+   put_ve(ctx->ve);
mutex_unlock(>ring_lock);
put_aio_ring_file(ctx);
kmem_cache_free(kioctx_cachep, ctx);
@@ -665,6 +667,8 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx 
*ctx,
struct completion *requests_done)
 {
if (!atomic_xchg(>dead, 1)) {
+   struct ve_struct *ve = ctx->ve;
+
spin_lock(>ioctx_lock);
hlist_del_rcu(>list);
spin_unlock(>ioctx_lock);
@@ -676,10 +680,10 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx 
*ctx,
 * -EAGAIN with no ioctxs actually in use (as far as userspace
 *  could tell).
 */
-   spin_lock(_nr_lock);
-   BUG_ON(aio_nr - ctx->max_reqs > aio_nr);
-   aio_nr -= ctx->max_reqs;
-   spin_unlock(_nr_lock);
+   spin_lock(>aio_nr_lock);
+

[Devel] [PATCH RHEL7 COMMIT] pfcache/ext4: fix automatic csum calculation

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit 17c90deb8c54a3feef19d557008fcb510bed8cd3
Author: Dmitry Monakhov 
Date:   Mon Aug 31 20:03:06 2015 +0400

pfcache/ext4: fix automatic csum calculation

port from 2.6.32-x: diff-pfcache-ext4-fix-automatic-csum-calculation

Bug#1) https://jira.sw.ru/browse/PSBM-23774
   truncate_data_csum should clear it's  state unconditionally

Bug#2) BUG_ON fs/jbd2/transaction.c:1033
 truncate_data_csum call chain looks like follows:
 ->generic_file_buffered_write_iter
   ->ext4_da_write_begin
 ->ext4_journal_start( ,,1) : reserve 1 journal block
   ->ext4_write_end
 ->ext4_update_data_csum
   ->ext4_truncate_data_csum
 ->ext4_xattr_set
   ->ext4_journal_start(,,20): require 20 blocks,
   but since journal already started
   it use existing handle
->jbd2_journal_dirty_metadata
   J_ASSERT_JH(jh, handle->h_buffer_credits > 0) -> FAILURE

 Obviously it is illegal to modify xattr from random context.
 In order to fix that bug it is reasonable to call ext4_truncate_data_csum()
 only from proper context (where journal was not started yet.)
 This patch splits ext4_update_csum in two peaces:
  1) check correct csum window position and drop csum if necessary (called 
from write_begin)
  2) update in-memory csum state (called from write_end)

Minor fix: do not calculate csum for empty files.

https://jira.sw.ru/browse/PSBM-39233

Signed-off-by: Dmitry Monakhov 
---
 fs/ext4/ext4.h |  3 ++-
 fs/ext4/inode.c| 13 +
 fs/ext4/pfcache.c  | 41 +++--
 fs/ext4/truncate.h |  3 +++
 4 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7059994..fc9608e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2843,11 +2843,12 @@ extern long ext4_dump_pfcache(struct super_block *sb,
struct pfcache_dump_request __user 
*dump);
 extern int ext4_load_data_csum(struct inode *inode);
 extern void ext4_start_data_csum(struct inode *inode);
+extern void ext4_check_pos_data_csum(struct inode *inode, loff_t pos);
 extern void ext4_update_data_csum(struct inode *inode, loff_t pos,
  unsigned len, struct page* page);
 extern void ext4_commit_data_csum(struct inode *inode);
 extern void ext4_clear_data_csum(struct inode *inode);
-extern int ext4_truncate_data_csum(struct inode *inode, loff_t end);
+extern void ext4_truncate_data_csum(struct inode *inode, loff_t end);
 extern void ext4_load_dir_csum(struct inode *inode);
 extern void ext4_save_dir_csum(struct inode *inode);
 static inline int ext4_want_data_csum(struct inode *dir)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1b3462c..78fc407 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -238,6 +238,8 @@ void ext4_evict_inode(struct inode *inode)
 * protection against it
 */
sb_start_intwrite(inode->i_sb);
+   if (inode->i_blocks && ext4_test_inode_state(inode, 
EXT4_STATE_PFCACHE_CSUM))
+   ext4_truncate_data_csum(inode, inode->i_size);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE,
ext4_blocks_for_truncate(inode)+3);
if (IS_ERR(handle)) {
@@ -936,6 +938,10 @@ retry_grab:
unlock_page(page);
 
 retry_journal:
+   /* Check csum window position before journal_start */
+   if (ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM))
+   ext4_check_pos_data_csum(inode, pos);
+
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
if (IS_ERR(handle)) {
page_cache_release(page);
@@ -2593,6 +2599,10 @@ retry_grab:
 * of file which has an already mapped buffer.
 */
 retry_journal:
+   /* Check csum window position before journal_start */
+   if (ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM))
+   ext4_check_pos_data_csum(inode, pos);
+
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
ext4_da_write_credits(inode, pos, len));
if (IS_ERR(handle)) {
@@ -4640,6 +4650,9 @@ int ext4_setattr(struct dentry *dentry, struct iattr 
*attr)
if (error)
goto err_out;
}
+   if (ext4_test_inode_state(inode, 
EXT4_STATE_PFCACHE_CSUM))
+   ext4_truncate_data_csum(inode, attr->ia_size);
+
handle = ext4_journal_start(inode, EXT4_HT_INODE, 3);
if (IS_ERR(handle)) {

[Devel] [PATCH RHEL7 COMMIT] ms/cgroup: remove now unused css_depth()

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit 8999360445307b87d687dac0b551d21e02386426
Author: Tejun Heo 
Date:   Wed Jun 12 21:04:48 2013 -0700

ms/cgroup: remove now unused css_depth()

Signed-off-by: Tejun Heo 
Acked-by: Li Zefan 
---
 include/linux/cgroup.h |  1 -
 kernel/cgroup.c| 12 
 2 files changed, 13 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b7eb28f..44b64c9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -884,7 +884,6 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
 
 /* Get id and depth of css */
 unsigned short css_id(struct cgroup_subsys_state *css);
-unsigned short css_depth(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 
 #else /* !CONFIG_CGROUPS */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b5a603c..d96176e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5296,18 +5296,6 @@ unsigned short css_id(struct cgroup_subsys_state *css)
 }
 EXPORT_SYMBOL_GPL(css_id);
 
-unsigned short css_depth(struct cgroup_subsys_state *css)
-{
-   struct css_id *cssid;
-
-   cssid = rcu_dereference_check(css->id, css_refcnt(css));
-
-   if (cssid)
-   return cssid->depth;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(css_depth);
-
 /**
  *  css_is_ancestor - test "root" css is an ancestor of "child"
  * @child: the css to be tested.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/video/logo: show Odin's logo on boot

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit 051055ab058cba98ca8e0fe93689162588cb556a
Author: Andrey Ryabinin <aryabi...@odin.com>
Date:   Mon Aug 31 17:22:42 2015 +0400

ve/video/logo: show Odin's logo on boot

Show Odin's logo instead of "tux" when booting kernel with
framebuffer enabled.

https://jira.sw.ru/browse/PSBM-34430

Signed-off-by: Andrey Ryabinin <aryabi...@odin.com>
Cc: Vladimir Davydov <vdavy...@parallels.com>
Cc: Konstantin Khorenko <khore...@virtuozzo.com>

khorenko@ note: the Odin's logo is shown by default, unlike
PCS6, no additional kernel boot option required
---
 drivers/video/logo/Kconfig   | 5 +
 drivers/video/logo/Makefile  | 1 +
 drivers/video/logo/logo.c| 3 +
 drivers/video/logo/logo_odin_clut224.ppm | 24484 +
 include/linux/linux_logo.h   | 1 +
 5 files changed, 24494 insertions(+)

diff --git a/drivers/video/logo/Kconfig b/drivers/video/logo/Kconfig
index 39ac49e..241a013 100644
--- a/drivers/video/logo/Kconfig
+++ b/drivers/video/logo/Kconfig
@@ -82,4 +82,9 @@ config LOGO_M32R_CLUT224
depends on M32R
default y
 
+config LOGO_ODIN_CLUT224
+   bool "224-color Odin logo"
+   depends on LOGO
+   default y
+
 endif # LOGO
diff --git a/drivers/video/logo/Makefile b/drivers/video/logo/Makefile
index 3b43781..de2215f 100644
--- a/drivers/video/logo/Makefile
+++ b/drivers/video/logo/Makefile
@@ -15,6 +15,7 @@ obj-$(CONFIG_LOGO_SUPERH_MONO)+= 
logo_superh_mono.o
 obj-$(CONFIG_LOGO_SUPERH_VGA16)+= logo_superh_vga16.o
 obj-$(CONFIG_LOGO_SUPERH_CLUT224)  += logo_superh_clut224.o
 obj-$(CONFIG_LOGO_M32R_CLUT224)+= logo_m32r_clut224.o
+obj-$(CONFIG_LOGO_ODIN_CLUT224)+= logo_odin_clut224.o
 
 obj-$(CONFIG_SPU_BASE) += logo_spe_clut224.o
 
diff --git a/drivers/video/logo/logo.c b/drivers/video/logo/logo.c
index 080c35b..72b1542 100644
--- a/drivers/video/logo/logo.c
+++ b/drivers/video/logo/logo.c
@@ -100,6 +100,9 @@ const struct linux_logo * __init_refok fb_find_logo(int 
depth)
/* M32R Linux logo */
logo = _m32r_clut224;
 #endif
+#ifdef CONFIG_LOGO_ODIN_CLUT224
+   logo = _odin_clut224;
+#endif
}
return logo;
 }
diff --git a/drivers/video/logo/logo_odin_clut224.ppm 
b/drivers/video/logo/logo_odin_clut224.ppm
new file mode 100644
index 000..d9f7399
--- /dev/null
+++ b/drivers/video/logo/logo_odin_clut224.ppm
@@ -0,0 +1,24484 @@
+P3
+# CREATOR: GIMP PNM Filter Version 1.1
+80 102
+255
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+7
+9
+5
+16
+18
+15
+20
+21
+19
+37
+39
+36
+65
+66
+64
+85
+86
+84
+95
+97
+94
+117
+119
+116
+93
+95
+92
+86
+87
+85
+64
+65
+63
+34
+36
+33
+21
+22
+20
+15
+17
+13
+7
+9
+5
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2
+0
+0
+2

Re: [Devel] [PATCH 00/17] oom killer enhancements

2015-08-31 Thread Konstantin Khorenko


Kirill, please review.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 08/14/2015 08:03 PM, Vladimir Davydov wrote:

  - Patches 1-3 fix stalls on global (1, 2) and local (3) reclaim.
https://jira.sw.ru/browse/PSBM-35155
  - Patches 4-6 revert our code implementing per memcg oom guarantees.
  - Patches 7-10 fix stalls on oom
  - Patch 11 introduced oom timeout
https://jira.sw.ru/browse/PSBM-38581
  - Patches 12, 13 fix oom vs freezer cgroup race
https://jira.sw.ru/browse/PSBM-38758
  - Patches 14, 15 reimplement oom guarantees
https://jira.sw.ru/browse/PSBM-37915
  - Patches 16, 17 resurrect oom berserker mode
https://jira.sw.ru/browse/PSBM-17930

https://jira.sw.ru/browse/PSBM-26973

Cong Wang (1):
   freezer: Do not freeze tasks killed by OOM killer

Lisa Du (1):
   mm: vmscan: fix do_try_to_free_pages() livelock

Michal Hocko (1):
   oom: thaw the OOM victim if it is frozen

Vinayak Menon (1):
   mm: vmscan: fix the page state calculation in too_many_isolated

Vladimir Davydov (13):
   mm: vmscan: do not scan lruvec if it seems to be unreclaimable
   memcg: revert old oom_guarantee logic
   Revert "ve/mm: ignore oom_score_adj of containerized tasks on global
 OOM"
   oom: zap oom_report_invocation proto
   Port diff-sched-introduce-cond_resched_may_throttle
   oom: allow to throttle due to cfs bandwidth while invoking oom
   sched: add sched_boost_task helper
   oom: boost dying tasks on global oom
   oom: introduce oom kill timeout
   mm: take into account ub oom score on global oom
   memcg: forbid setting memory.oom_guarantee from inside a container
   oom: resurrect berserker mode
   oom: do not dump all tasks info on each oom kill

  fs/proc/base.c |   7 +-
  include/bc/beancounter.h   |   5 ++
  include/linux/memcontrol.h |   8 +-
  include/linux/mmzone.h |   2 +-
  include/linux/oom.h|  22 +++---
  include/linux/sched.h  |  20 +
  kernel/bc/beancounter.c|  29 +++
  kernel/freezer.c   |   3 +
  kernel/rtmutex.c   |   5 ++
  kernel/sched/core.c|  19 +++--
  kernel/sched/fair.c|   3 +-
  kernel/sysctl.c|  14 
  mm/internal.h  |   1 +
  mm/memcontrol.c|  99 ++--
  mm/migrate.c   |   2 +-
  mm/oom_kill.c  | 183 +
  mm/page_alloc.c|   6 +-
  mm/swap.c  |   1 +
  mm/vmscan.c| 102 +
  mm/vmstat.c|   4 +-
  20 files changed, 377 insertions(+), 158 deletions(-)


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] config.OpenVZ: show Odin's logo on boot

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit eeddf77a216082687f589e408d3f560ced20ab96
Author: Konstantin Khorenko <khore...@virtuozzo.com>
Date:   Mon Aug 31 17:24:47 2015 +0400

config.OpenVZ: show Odin's logo on boot

Show Odin's logo instead of "tux" when booting kernel with
framebuffer enabled.

https://jira.sw.ru/browse/PSBM-34430

    Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com>

khorenko@ note: the Odin's logo is shown by default, unlike
PCS6, no additional kernel boot option required
---
 configs/kernel-3.10.0-x86_64-debug.config | 1 +
 configs/kernel-3.10.0-x86_64.config   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/configs/kernel-3.10.0-x86_64-debug.config 
b/configs/kernel-3.10.0-x86_64-debug.config
index 5c17082..e8ea24f 100644
--- a/configs/kernel-3.10.0-x86_64-debug.config
+++ b/configs/kernel-3.10.0-x86_64-debug.config
@@ -5401,6 +5401,7 @@ CONFIG_TCACHE=y
 CONFIG_TSWAP=y
 
 CONFIG_VZ_IOLIMIT=m
+CONFIG_LOGO_ODIN_CLUT224=y
 
 #
 # User resources
diff --git a/configs/kernel-3.10.0-x86_64.config 
b/configs/kernel-3.10.0-x86_64.config
index ffc144b..d8b2a97 100644
--- a/configs/kernel-3.10.0-x86_64.config
+++ b/configs/kernel-3.10.0-x86_64.config
@@ -5372,6 +5372,7 @@ CONFIG_TCACHE=y
 CONFIG_TSWAP=y
 
 CONFIG_VZ_IOLIMIT=m
+CONFIG_LOGO_ODIN_CLUT224=y
 
 #
 # User resources
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/tty: vt -- Fix nil dereference due to race

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit 0fa0a39ad2c644f55f447cef85e5a9a8f06e43b7
Author: Cyrill Gorcunov <gorcu...@virtuozzo.com>
Date:   Mon Aug 31 17:08:30 2015 +0400

ve/tty: vt -- Fix nil dereference due to race

In commit 5571b126368c0153d73eaec0fdf43fbcbae67fd9 we bring
in the stabs for virtual terminals but they are race sensitive:
all therminals are represented by one per-VE @vz_tty_conm tty peer
which can be removed and set to nil if application ask for new
terminal when old one is inside "remove" stage. This may
lead to nil dereference and panic as Nikita spotted

 | [  325.357491] BUG: unable to handle kernel NULL pointer dereference at 
0004
 | [  325.357816] IP: [] tty_open+0x610/0x6e0
 | [  325.357994] PGD 3b745067 PUD 3b6ff067 PMD 0
 | [  325.358201] Oops: 0002 [#1] SMP
 | [  325.362469] CPU: 1 PID: 2873 Comm: criu ve: 200 Not tainted 
3.10.0-123.1.2.vz7.5.29 #1 5.29
 | [  325.362954] task: 88003af56480 ti: 88003bf24000 task.ti: 
88003bf24000
 | [  325.363119] RIP: 0010:[]  [] 
tty_open+0x610/0x6e0
 | [  325.363329] RSP: 0018:88003bf25c00  EFLAGS: 00010202
 | [  325.363454] RAX: 0001 RBX: 88003638c000 RCX: 
0001
 | [  325.363614] RDX: 88003bf25c34 RSI: 0002 RDI: 
88003bf6
 | [  325.363776] RBP: 88003bf25c68 R08: 000208c0 R09: 
88003d803c00
 | [  325.363972] R10: 0002 R11: 0004 R12: 

 | [  325.364145] R13:  R14: 0042 R15: 

 | [  325.364303] FS:  7fe73f0f6740() GS:88003de4() 
knlGS:
 | [  325.364479] CS:  0010 DS:  ES:  CR0: 80050033
 | [  325.364616] CR2: 0004 CR3: 3c78b000 CR4: 
001406e0
 | [  325.364775] DR0:  DR1:  DR2: 

 | [  325.364949] DR3:  DR6: 0ff0 DR7: 
0400
 | [  325.365120] Stack:
 | [  325.365198]  88003af56480 00043bf25c68 88003af56480 
88003b7a1960
 | [  325.365498]  88003af56480 00428002 0001 
b2cf8386
 | [  325.365838]  88003d21a068 88003b7a1960 88003638c000 

 | [  325.366130] Call Trace:
 | [  325.366215]  [] chrdev_open+0xa1/0x1e0
 | [  325.366339]  [] ? cdev_put+0x30/0x30
 | [  325.366472]  [] do_dentry_open.isra.17+0x192/0x290
 | [  325.366625]  [] finish_open+0x1e/0x30
 | [  325.366752]  [] do_last.isra.62+0x36d/0x1020
 | [  325.366956]  [] path_openat.isra.63+0xbe/0x480
 | [  325.367097]  [] do_filp_open+0x4b/0xb0
 | [  325.367226]  [] ? getname_flags+0x2c/0x120
 | [  325.367361]  [] ? __alloc_fd+0xa7/0x130
 | [  325.367490]  [] do_sys_open+0xf3/0x1f0
 | [  325.367623]  [] SyS_openat+0x14/0x20
 | [  325.367758]  [] system_call_fastpath+0x16/0x1b

Lets provide per VT tty as it should be.

Note the code is being reworked now for bring in real virtualization
instead of stubs so this is rather a fix to not block migration testings
(that's why I don't remove @vz_tty_conm and @vz_tty_cons from the
 struct ve_struct since I've already zapped all this including
 the file kernel/ve/console.c itself and once new version is
 stabilized we drop all this in one pass).

https://jira.sw.ru/browse/PSBM-37929

Signed-off-by: Cyrill Gorcunov <gorcu...@virtuozzo.com>

CC: Nikita Spiridonov <nspirido...@odin.com>
CC: Vladimir Davydov <vdavy...@virtuozzo.com>
CC: Konstantin Khorenko <khore...@virtuozzo.com>
---
 kernel/ve/console.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/ve/console.c b/kernel/ve/console.c
index 922848a..bc7d752 100644
--- a/kernel/ve/console.c
+++ b/kernel/ve/console.c
@@ -47,7 +47,7 @@ static struct tty_struct *vz_tty_lookup(struct tty_driver 
*driver,
if (idx != VZ_CON_INDEX || driver == vz_cons_driver)
return ERR_PTR(-EIO);
 
-   return ve->vz_tty_conm;
+   return ve->vz_tty_vt[idx];
 }
 
 static int vz_tty_install(struct tty_driver *driver, struct tty_struct *tty)
@@ -62,7 +62,7 @@ static int vz_tty_install(struct tty_driver *driver, struct 
tty_struct *tty)
tty_port_init(tty->port);
tty->termios = driver->init_termios;
 
-   ve->vz_tty_conm = tty;
+   ve->vz_tty_vt[tty->index] = tty;
 
tty_driver_kref_get(driver);
tty->count++;
@@ -74,7 +74,7 @@ static void vz_tty_remove(struct tty_driver *driver, struct 
tty_struct *tty)
struct ve_struct *ve = get_exec_env();
 
BUG_ON(driv

[Devel] [PATCH RHEL7 COMMIT] ms/mm: memcontrol: reclaim at least once for __GFP_NORETRY

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit 8aee40dd982b476c208aee78e19cd756f2dac8a7
Author: Johannes Weiner 
Date:   Mon Aug 31 17:09:47 2015 +0400

ms/mm: memcontrol: reclaim at least once for __GFP_NORETRY

Currently, __GFP_NORETRY tries charging once and gives up before even
trying to reclaim.  Bring the behavior on par with the page allocator
and reclaim at least once before giving up.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Hugh Dickins 
Cc: Tejun Heo 
Cc: Vladimir Davydov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 28c34c291e746aab1c2bfd6d6609b2e47fa0978b)
Signed-off-by: Vladimir Davydov 

Conflicts:
mm/memcontrol.c
---
 mm/memcontrol.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7fc2931..52c7871 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2754,11 +2754,6 @@ static int mem_cgroup_do_charge(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
return CHARGE_WOULDBLOCK;
}
 
-   if (gfp_mask & __GFP_NORETRY) {
-   mem_cgroup_inc_failcnt(mem_over_limit, gfp_mask, nr_pages);
-   return CHARGE_NOMEM;
-   }
-
ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
return CHARGE_RETRY;
@@ -2787,6 +2782,9 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, 
gfp_t gfp_mask,
 
mem_cgroup_inc_failcnt(mem_over_limit, gfp_mask, nr_pages);
 
+   if (gfp_mask & __GFP_NORETRY)
+   return CHARGE_NOMEM;
+
/* check OOM */
if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
return CHARGE_OOM_DIE;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] memcg: count all oom kills

2015-08-31 Thread Konstantin Khorenko


Kirill, please review.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 08/14/2015 12:48 PM, Vladimir Davydov wrote:

We do not count processes killed because they share victim's mm. Fix it.

Fixes: 66053f4201e41 ("memcg: count oom kills")
Signed-off-by: Vladimir Davydov <vdavy...@parallels.com>
---
  include/linux/memcontrol.h |  4 ++--
  mm/memcontrol.c| 15 +--
  mm/oom_kill.c  |  3 ++-
  3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eb7ae43a57f9..ac3f16f0ee28 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -122,7 +122,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec 
*lruvec, enum lru_list);
  void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
  extern bool mem_cgroup_below_oom_guarantee(struct task_struct *p);
  extern void mem_cgroup_note_oom_kill(struct mem_cgroup *memcg,
-struct mm_struct *mm);
+struct task_struct *task);
  extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
  extern void mem_cgroup_replace_page_cache(struct page *oldpage,
@@ -351,7 +351,7 @@ static inline bool mem_cgroup_below_oom_guarantee(struct 
task_struct *p)
  }

  static inline void
-mem_cgroup_note_oom_kill(struct mem_cgroup *memcg, struct mm_struct *mm)
+mem_cgroup_note_oom_kill(struct mem_cgroup *memcg, struct task_struct *task)
  {
  }

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 52c787165b17..0cb329028a29 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1623,14 +1623,25 @@ bool mem_cgroup_below_oom_guarantee(struct task_struct 
*p)
  }

  void mem_cgroup_note_oom_kill(struct mem_cgroup *root_memcg,
- struct mm_struct *mm)
+ struct task_struct *task)
  {
struct mem_cgroup *memcg, *memcg_to_put;
+   struct task_struct *p;

if (!root_memcg)
root_memcg = root_mem_cgroup;

-   memcg_to_put = memcg = try_get_mem_cgroup_from_mm(mm);
+   p = find_lock_task_mm(task);
+   if (p) {
+   memcg = try_get_mem_cgroup_from_mm(p->mm);
+   task_unlock(p);
+   } else {
+   rcu_read_lock();
+   memcg = mem_cgroup_from_task(task);
+   css_get(>css);
+   rcu_read_unlock();
+   }
+   memcg_to_put = memcg;
if (!memcg || !mem_cgroup_same_or_subtree(root_memcg, memcg))
memcg = root_memcg;

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c99a5f559286..70893730524a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -498,7 +498,6 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,

/* mm cannot safely be dereferenced after task_unlock(victim) */
mm = victim->mm;
-   mem_cgroup_note_oom_kill(memcg, mm);
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB\n",
task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -526,11 +525,13 @@ void oom_kill_process(struct task_struct *p, gfp_t 
gfp_mask, int order,
task_pid_nr(p), p->comm);
task_unlock(p);
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+   mem_cgroup_note_oom_kill(memcg, p);
}
rcu_read_unlock();

set_tsk_thread_flag(victim, TIF_MEMDIE);
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+   mem_cgroup_note_oom_kill(memcg, victim);
put_task_struct(victim);
  }
  #undef K


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/kmod: fix out-of-bounds access in call_modprobe()

2015-08-31 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.4
-->
commit e2164f15d2f004ce076da3aa925b681bd8cde8d8
Author: Andrey Ryabinin <aryabi...@odin.com>
Date:   Mon Aug 31 17:15:30 2015 +0400

ve/kmod: fix out-of-bounds access in call_modprobe()

Commit 18f83b2460e2 ("ve/kmod: Port autoloading from CT") extended
argv array for one more element, however it wasn't extended
on allocation site.

https://jira.sw.ru/browse/PSBM-38666

Fixes: 18f83b2460e2 ("ve/kmod: Port autoloading from CT")
Signed-off-by: Andrey Ryabinin <aryabi...@odin.com>
Cc: Konstantin Khorenko <khore...@virtuozzo.com>

Signed-off-by: Andrey Ryabinin <aryabi...@odin.com>
Acked-by: Kirill Tkhai <ktk...@odin.com>
---
 kernel/kmod.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index e0554f8..aa5cb99 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -91,7 +91,7 @@ static int call_modprobe(char *module_name, int wait, int 
blacklist)
NULL
};
 
-   char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
+   char **argv = kmalloc(sizeof(char *[6]), GFP_KERNEL);
if (!argv)
goto out;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7] ploop: use GFP_NOIO in ploop_make_request

2015-08-31 Thread Konstantin Khorenko


Maxim, please review.

Do we need the same in PCS6?

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 08/17/2015 04:30 PM, Vladimir Davydov wrote:

Currently, we use GFP_NOFS, which may result in a dead lock as follows:

filemap_fault
  do_mpage_readpage
   submit_bio
generic_make_request initializes current->bio_list
 calls make_request_fn
 ploop_make_request
  bio_alloc(GFP_NOFS)
   kmem_cache_alloc
memcg_charge_kmem
 try_to_free_mem_cgroup_pages
  swap_writepage
   generic_make_request  puts bio on current->bio_list
 try_to-free_mem_cgroup_pages
  wait_on_page_writeback

The wait_on_page_writeback will never complete then, because the
corresponding bio is on current->bio_list and for it to get to the queue
we must return from ploop_make_request first.

The stack trace of a hung task:

[] sleep_on_page+0xe/0x20
[] wait_on_page_bit+0x86/0xb0
[] shrink_page_list+0x6e2/0xaf0
[] shrink_inactive_list+0x1cb/0x610
[] shrink_lruvec+0x395/0x790
[] shrink_zone+0x181/0x350
[] do_try_to_free_pages+0x170/0x530
[] try_to_free_mem_cgroup_pages+0xb6/0x140
[] __mem_cgroup_try_charge+0x1de/0xd70
[] memcg_charge_kmem+0x9b/0x100
[] __memcg_charge_slab+0x3b/0x90
[] new_slab+0x264/0x3f0
[] __slab_alloc+0x315/0x48f
[] kmem_cache_alloc+0x1cc/0x210
[] mempool_alloc_slab+0x15/0x20
[] mempool_alloc+0x69/0x170
[] bvec_alloc+0x92/0x120
[] bio_alloc_bioset+0x1e8/0x2e0
[] ploop_make_request+0x2a6/0xac0 [ploop]
[] generic_make_request+0xe2/0x130
[] submit_bio+0x77/0x1c0
[] do_mpage_readpage+0x37f/0x6e0
[] mpage_readpages+0xeb/0x160
[] ext4_readpages+0x3c/0x40 [ext4]
[] __do_page_cache_readahead+0x1e0/0x260
[] ra_submit+0x21/0x30
[] filemap_fault+0x321/0x4b0
[] __do_fault+0x8a/0x560
[] handle_mm_fault+0x3d0/0xd80
[] __do_page_fault+0x15e/0x530
[] do_page_fault+0x1a/0x70
[] page_fault+0x28/0x30

https://jira.sw.ru/browse/PSBM-38842

Signed-off-by: Vladimir Davydov <vdavy...@parallels.com>
---
  drivers/block/ploop/dev.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 30eb8a7551e5..f37df4dacf8c 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -717,7 +717,7 @@ preallocate_bio(struct bio * orig_bio, struct ploop_device 
* plo)
}

if (nbio == NULL)
-   nbio = bio_alloc(GFP_NOFS, max(orig_bio->bi_max_vecs, 
block_vecs(plo)));
+   nbio = bio_alloc(GFP_NOIO, max(orig_bio->bi_max_vecs, 
block_vecs(plo)));
return nbio;
  }

@@ -852,7 +852,7 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)

if (!current->io_context) {
struct io_context *ioc;
-   ioc = get_task_io_context(current, GFP_NOFS, NUMA_NO_NODE);
+   ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE);
if (ioc)
put_io_context(ioc);
}


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/cgroup: fix mangle root in CT

2015-09-01 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.5
-->
commit 1518ff8ef0a78d8be1b19774506f355424103e9a
Author: Pavel Tikhomirov 
Date:   Tue Sep 1 16:13:30 2015 +0400

ve/cgroup: fix mangle root in CT

cgroups with depth level more than 2 were not mangled inside a
container, that might caused problems with docker, docker were able
to see in /proc/self/cgroup paths relative to host.

But it is not docker specific:

CT-103 /# mkdir /sys/fs/cgroup/devices/test.slice
CT-103 /# mkdir /sys/fs/cgroup/devices/test.slice/test.scope
CT-103 /# sleep 1000&
[1] 578
CT-103 /# echo 578 > /sys/fs/cgroup/devices/test.slice/test.scope/tasks

with patch:

CT-103 /# cat /proc/578/cgroup
16:ve:/
15:hugetlb:/
14:perf_event:/
12:net_cls:/
11:freezer:/
10:devices:/test.slice/test.scope
6:name=systemd:/user-0.slice/session-c109.scope
5:cpuset:/
4:cpuacct,cpu:/
3:beancounter:/
2:memory:/
1:blkio:/

without:

CT-103 /# cat /proc/480/cgroup
16:ve:/
15:hugetlb:/
14:perf_event:/
12:net_cls:/
11:freezer:/
10:devices:/103/test.slice/test.scope
6:name=systemd:/user.slice/user-0.slice/session-c2.scope
5:cpuset:/
4:cpuacct,cpu:/
3:beancounter:/
2:memory:/
1:blkio:/

https://jira.sw.ru/browse/PSBM-38634

Signed-off-by: Pavel Tikhomirov 
Reviewed-by: Cyrill Gorcunov 

khorenko@: this fix is quite inflexible, if we move CTs into
machine.slice, we have to rework it.
But i accept it because we are still not sure with final
cgroups "virtualization" implementation => less work right now
which can be later dropped.
---
 kernel/cgroup.c | 35 ---
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d96176e..a07c4e0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1808,6 +1808,7 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int 
buflen)
 {
int ret = -ENAMETOOLONG;
char *start;
+   struct ve_struct *ve = get_exec_env();
 
if (!cgrp->parent) {
if (strlcpy(buf, "/", buflen) >= buflen)
@@ -1815,21 +1816,6 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, 
int buflen)
return 0;
}
 
-#ifdef CONFIG_VE
-   /*
-* Containers cgroups are bind-mounted from node
-* so they are like '/' from inside, thus we have
-* to mangle cgroup path output.
-*/
-   if (!ve_is_super(get_exec_env())) {
-   if (cgrp->parent && !cgrp->parent->parent) {
-   if (strlcpy(buf, "/", buflen) >= buflen)
-   return -ENAMETOOLONG;
-   return 0;
-   }
-   }
-#endif
-
start = buf + buflen - 1;
*start = '\0';
 
@@ -1838,6 +1824,25 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, 
int buflen)
const char *name = cgroup_name(cgrp);
int len;
 
+#ifdef CONFIG_VE
+   if (!ve_is_super(ve) && cgrp->parent && !cgrp->parent->parent) {
+   /*
+* Containers cgroups are bind-mounted from node
+* so they are like '/' from inside, thus we have
+* to mangle cgroup path output. Effectively it is
+* enough to remove two topmost cgroups from path.
+* e.g. in ct 101: /101/test.slice/test.scope ->
+* /test.slice/test.scope
+*/
+   if (*start != '/') {
+   if (--start < buf)
+   goto out;
+   *start = '/';
+   }
+   break;
+   }
+#endif
+
len = strlen(name);
if ((start -= len) < buf)
goto out;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] mmap: call mmap prep only for regular files

2015-09-01 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.5
-->
commit 1e596ab0358ff8dde342efb6274e08459d08a711
Author: Vladimir Davydov 
Date:   Tue Sep 1 16:16:59 2015 +0400

mmap: call mmap prep only for regular files

Port 2.6.32-x diff-mm-mmap-call-mmap-prep-only-for-regular-files

We forgot to port this patch. This results in KP on an attempt to mmap a
char device on ext4.


=
Author: Vladimir Davydov
Email: vdavy...@parallels.com
Subject: mmap: call mmap prep only for regular files
Date: Mon, 17 Feb 2014 12:59:36 +0400

To give FS a chance to clear pfcache csum on shared mmap, we issue
->mmap(vma=NULL) for those FS's that want it (FS_HAS_MMAP_PREP) before
taking mmap_sem (we can't do it under mmap_sem due to lockdep, see
PSBM-23133). There we haven't checked arguments properly yet. In
particular, the file can refer to a device, in which case we will
crash, because devices' ->mmap (e.g. /dev/zero) is not supposed to be
called with vma=NULL. Fix this by checking if the file refers to a
regular file before calling mmap prep for it.

https://bugzilla.openvz.org/show_bug.cgi?id=2886
https://jira.sw.ru/browse/PSBM-25031

Signed-off-by: Vladimir Davydov 
Acked-by: Dmitry Monakhov 

=

Reported-by: Andrew Perepechko 
Signed-off-by: Vladimir Davydov 

Cc: Andrew Perepechko 
Cc: Alex Lyashkov 
Cc: Igor Seletskiy 
---
 mm/util.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/util.c b/mm/util.c
index 31cd9d7..e0ac8ae 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -367,6 +367,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned 
long addr,
if (!ret) {
/* Ugly fix for PSBM-23133 vdavydov@ */
if (file && file->f_op && (flag & MAP_TYPE) == MAP_SHARED &&
+   S_ISREG(file_inode(file)->i_mode) &&
(file_inode(file)->i_sb->s_type->fs_flags & 
FS_HAS_MMAP_PREP))
file->f_op->mmap(file, NULL);
down_write(>mmap_sem);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/sched/numa: Fix initialization of sched_domain_topology for NUMA

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 384a4643220fffd9001172e16ea54396a3675ab6
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:30 2015 +0400

ms/sched/numa: Fix initialization of sched_domain_topology for NUMA

https://jira.sw.ru/browse/PSBM-26429

From: Vincent Guittot 

commit c515db8cd311ef77b2dc7cbd6b695022655bb0f3 upstream.

Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with
kvm64 cpu. A panic occured while building the sched_domain.

In sched_init_numa, we create a new topology table in which both default
levels and numa levels are copied. The last row of the table must have a 
null
pointer in the mask field.

The current implementation doesn't add this last row in the computation of 
the
table size. So we add 1 row in the allocation size that will be used as the
last row of the table. The kzalloc will ensure that the mask field is NULL.

Reported-by: Jet Chen 
Tested-by: Jet Chen 
Signed-off-by: Vincent Guittot 
Signed-off-by: Peter Zijlstra 
Cc: fengguang...@intel.com
Link: 
http://lkml.kernel.org/r/1399972261-25693-1-git-send-email-vincent.guit...@linaro.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 30f39a25..df63b3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6827,7 +6827,7 @@ static void sched_init_numa(void)
/* Compute default topology size */
for (i = 0; sched_domain_topology[i].mask; i++);
 
-   tl = kzalloc((i + level) *
+   tl = kzalloc((i + level + 1) *
sizeof(struct sched_domain_topology_level), GFP_KERNEL);
if (!tl)
return;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/MIPS: Use NUMA_NO_NODE instead of -1 for node ID.

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 61913332fac411269855bf321d34f87f7a4fb060
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:33 2015 +0400

ms/MIPS: Use NUMA_NO_NODE instead of -1 for node ID.

https://jira.sw.ru/browse/PSBM-26429

From: Ralf Baechle 

commit 761845f0f68cf6eba9cad0a58d977b89f8d4486f upstream.

Original patch by Jianguo Wu .

Signed-off-by: Ralf Baechle 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/mips/kernel/module.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 977a623..2a52568 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -46,7 +47,7 @@ static DEFINE_SPINLOCK(dbe_lock);
 void *module_alloc(unsigned long size)
 {
return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
-   GFP_KERNEL, PAGE_KERNEL, -1,
+   GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 #endif
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit e79a1d458f45de9a672aefd76753949780b6af16
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:34 2015 +0400

ms/mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 02e72cc61713185013d958baba508288ba2a0157 upstream.

There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.

I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.

This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration.  It
simply has not worked before because should_failslab() call was in a
hook hidden under "#ifdef CONFIG_SLUB_DEBUG #else".

Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled.  The might_sleep_if() call
can generate some code even if DEBUG_ATOMIC_SLEEP=n.  For
PREEMPT_VOLUNTARY=y might_sleep() inserts _cond_resched() call, but I
think it should be ok.

Signed-off-by: Andrey Ryabinin 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 mm/slub.c | 90 ---
 1 file changed, 40 insertions(+), 50 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 51772b6..f39e69c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -928,50 +928,6 @@ static void trace(struct kmem_cache *s, struct page *page, 
void *object,
 }
 
 /*
- * Hooks for other subsystems that check memory allocations. In a typical
- * production configuration these hooks all should produce no code at all.
- */
-static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
-{
-   flags &= gfp_allowed_mask;
-   lockdep_trace_alloc(flags);
-   might_sleep_if(flags & __GFP_WAIT);
-   WARN_ON((flags & __GFP_FS) && current->journal_info);
-
-   return should_failslab(s->object_size, flags, s->flags);
-}
-
-static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, 
void *object)
-{
-   flags &= gfp_allowed_mask;
-   kmemcheck_slab_alloc(s, flags, object, slab_ksize(s));
-   kmemleak_alloc_recursive(object, s->object_size, 1, s->flags, flags);
-}
-
-static inline void slab_free_hook(struct kmem_cache *s, void *x)
-{
-   kmemleak_free_recursive(x, s->flags);
-
-   /*
-* Trouble is that we may no longer disable interupts in the fast path
-* So in order to make the debug calls that expect irqs to be
-* disabled we need to disable interrupts temporarily.
-*/
-#if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
-   {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   kmemcheck_slab_free(s, x, s->object_size);
-   debug_check_no_locks_freed(x, s->object_size);
-   local_irq_restore(flags);
-   }
-#endif
-   if (!(s->flags & SLAB_DEBUG_OBJECTS))
-   debug_check_no_obj_freed(x, s->object_size);
-}
-
-/*
  * Tracking of fully allocated slabs for debugging purposes.
  *
  * list_lock must be held.
@@ -1256,16 +1212,50 @@ static inline void inc_slabs_node(struct kmem_cache *s, 
int node,
int objects) {}
 static inline void dec_slabs_node(struct kmem_cache *s, int node,
int objects) {}
-
+#endif /* CONFIG_SLUB_DEBUG */
+/*
+ * Hooks for other subsystems that check memory allocations. In a typical
+ * production configuration these hooks all should produce no code at all.
+ */
 static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
-   { return 0; }
+{
+   flags &= gfp_allowed_mask;
+   lockdep_trace_alloc(flags);
+   might_sleep_if(flags & __GFP_WAIT);
+   WARN_ON((flags & __GFP_FS) && current->journal_info);
 
-static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags,
-   void *object) {}
+   return should_failslab(s->object_size, flags, s->flags);
+}
 
-static inline void slab_free_hook(struct kmem_cache *s,

[Devel] [PATCH RHEL7 COMMIT] ms/mm/arch: use NUMA_NO_NODE

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 59313b3e6fe7c9ffe5dde09bd8379e3ac8a583e0
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:32 2015 +0400

ms/mm/arch: use NUMA_NO_NODE

https://jira.sw.ru/browse/PSBM-26429

From: Jianguo Wu 

commit 40c3baa7c66f1352521378ee83509fb8f4c465de upstream.

Use more appropriate NUMA_NO_NODE instead of -1 in all archs' module_alloc()

Signed-off-by: Jianguo Wu 
Acked-by: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/arm/kernel/module.c| 2 +-
 arch/arm64/kernel/module.c  | 2 +-
 arch/parisc/kernel/module.c | 2 +-
 arch/s390/kernel/module.c   | 2 +-
 arch/sparc/kernel/module.c  | 2 +-
 arch/x86/kernel/module.c| 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 1e9be5d..be3232f 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -40,7 +40,7 @@
 void *module_alloc(unsigned long size)
 {
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, -1,
+   GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 #endif
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index ca0e3d5..8f898bd 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -29,7 +29,7 @@
 void *module_alloc(unsigned long size)
 {
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, -1,
+   GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index 2a625fb..50dfafc 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -219,7 +219,7 @@ void *module_alloc(unsigned long size)
 * init_data correctly */
return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
GFP_KERNEL | __GFP_HIGHMEM,
-   PAGE_KERNEL_RWX, -1,
+   PAGE_KERNEL_RWX, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 7845e15..b89b591 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -50,7 +50,7 @@ void *module_alloc(unsigned long size)
if (PAGE_ALIGN(size) > MODULES_LEN)
return NULL;
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL, -1,
+   GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 #endif
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index 4435488..97655e0 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -29,7 +29,7 @@ static void *module_map(unsigned long size)
if (PAGE_ALIGN(size) > MODULES_LEN)
return NULL;
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL, -1,
+   GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 #else
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 7c1efc4..958bfb6 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -49,7 +49,7 @@ void *module_alloc(unsigned long size)
return NULL;
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC,
-   -1, __builtin_return_address(0));
+   NUMA_NO_NODE, __builtin_return_address(0));
 }
 
 #ifdef CONFIG_X86_32
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/kernel: use the gnu89 standard explicitly

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 0331e712aa16d12ea15c567e25111c3443456479
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:29 2015 +0400

ms/kernel: use the gnu89 standard explicitly

https://jira.sw.ru/browse/PSBM-26429

From: "Kirill A. Shutemov" 

commit 51b97e354ba9fce1890cf38ecc754aa49677fc89 upstream.

Sasha Levin reports:
 "gcc5 changes the default standard to c11, which makes kernel build
  unhappy

  Explicitly define the kernel standard to be gnu89 which should keep
  everything working exactly like it was before gcc5"

There are multiple small issues with the new default, but the biggest
issue seems to be that the old - and very useful - GNU extension to
allow a cast in front of an initializer has gone away.

Patch updated by Kirill:
 "I'm pretty sure all gcc versions you can build kernel with supports
  -std=gnu89.  cc-option is redunrant.

  We also need to adjust HOSTCFLAGS otherwise allmodconfig fails for me"

Note by Andrew Pinski:
 "Yes it was reported and both problems relating to this extension has
  been added to gnu99 and gnu11.  Though there are other issues with the
  kernel dealing with extern inline have different semantics between
  gnu89 and gnu99/11"

End result: we may be able to move up to a newer stdc model eventually,
but right now the newer models have some annoying deficiencies, so the
traditional "gnu89" model ends up being the preferred one.

Signed-off-by: Sasha Levin 
Singed-off-by: Kirill A. Shutemov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 Makefile | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 1ccfd12..bfd04ef 100644
--- a/Makefile
+++ b/Makefile
@@ -253,7 +253,7 @@ CONFIG_SHELL := $(shell if [ -x "$$BASH" ]; then echo 
$$BASH; \
 
 HOSTCC   = gcc
 HOSTCXX  = g++
-HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 
-fomit-frame-pointer
+HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 
-fomit-frame-pointer -std=gnu89
 HOSTCXXFLAGS = -O2
 
 # Decide whether to build built-in, modular, or both.
@@ -385,7 +385,8 @@ KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes 
-Wno-trigraphs \
   -fno-strict-aliasing -fno-common \
   -Werror-implicit-function-declaration \
   -Wno-format-security \
-  -fno-delete-null-pointer-checks
+  -fno-delete-null-pointer-checks \
+  -std=gnu89
 
 ifeq ($(KBUILD_EXTMOD),)
 ifneq (,$(filter $(ARCH), x86 x86_64))
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm area

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit abee218424a2434e8cc576037563de55bec730de
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:31 2015 +0400

ms/mm/vmalloc: use wrapper function get_vm_area_size to caculate size of vm 
area

https://jira.sw.ru/browse/PSBM-26429

From: Wanpeng Li 

commit 762216ab4e175f49d17bc7ad778c57b9028184e6 upstream.

Use wrapper function get_vm_area_size to calculate size of vm area.

Signed-off-by: Wanpeng Li 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Fengguang Wu 
Cc: Joonsoo Kim 
Cc: Johannes Weiner 
Cc: Tejun Heo 
Cc: Yasuaki Ishimatsu 
Cc: David Rientjes 
Cc: KOSAKI Motohiro 
Cc: Jiri Kosina 
Cc: Wanpeng Li 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 mm/vmalloc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7fbc92a..0c531e1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1285,7 +1285,7 @@ void unmap_kernel_range(unsigned long addr, unsigned long 
size)
 int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
 {
unsigned long addr = (unsigned long)area->addr;
-   unsigned long end = addr + area->size - PAGE_SIZE;
+   unsigned long end = addr + get_vm_area_size(area);
int err;
 
err = vmap_page_range(addr, end, prot, *pages);
@@ -1605,7 +1605,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
unsigned int nr_pages, array_size, i;
gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
 
-   nr_pages = (area->size - PAGE_SIZE) >> PAGE_SHIFT;
+   nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
array_size = (nr_pages * sizeof(struct page *));
 
area->nr_pages = nr_pages;
@@ -2037,7 +2037,7 @@ long vread(char *buf, char *addr, unsigned long count)
 
vm = va->vm;
vaddr = (char *) vm->addr;
-   if (addr >= vaddr + vm->size - PAGE_SIZE)
+   if (addr >= vaddr + get_vm_area_size(vm))
continue;
while (addr < vaddr) {
if (count == 0)
@@ -2047,7 +2047,7 @@ long vread(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + vm->size - PAGE_SIZE - addr;
+   n = vaddr + get_vm_area_size(vm) - addr;
if (n > count)
n = count;
if (!(vm->flags & VM_IOREMAP))
@@ -2119,7 +2119,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
 
vm = va->vm;
vaddr = (char *) vm->addr;
-   if (addr >= vaddr + vm->size - PAGE_SIZE)
+   if (addr >= vaddr + get_vm_area_size(vm))
continue;
while (addr < vaddr) {
if (count == 0)
@@ -2128,7 +2128,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + vm->size - PAGE_SIZE - addr;
+   n = vaddr + get_vm_area_size(vm) - addr;
if (n > count)
n = count;
if (!(vm->flags & VM_IOREMAP)) {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/compiler-gcc: integrate the various compiler-gcc[345].h files

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit bd1e2b9bde2a2ec95dbafa3dcf17a29dd3acd63e
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:30 2015 +0400

ms/compiler-gcc: integrate the various compiler-gcc[345].h files

https://jira.sw.ru/browse/PSBM-26429

From: Joe Perches 

commit cb984d101b30eb7478d32df56a0023e4603cba7f upstream.

As gcc major version numbers are going to advance rather rapidly in the
future, there's no real value in separate files for each compiler
version.

Deduplicate some of the macros #defined in each file too.

Neaten comments using normal kernel commenting style.

Signed-off-by: Joe Perches 
Cc: Andi Kleen 
Cc: Michal Marek 
Cc: Segher Boessenkool 
Cc: Sasha Levin 
Cc: Anton Blanchard 
Cc: Alan Modra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/compiler-gcc.h  | 120 --
 include/linux/compiler-gcc3.h |  23 
 include/linux/compiler-gcc4.h |  92 
 include/linux/compiler-gcc5.h |  67 ---
 4 files changed, 116 insertions(+), 186 deletions(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 24545cd..0c5d746 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -97,10 +97,122 @@
 #define __maybe_unused __attribute__((unused))
 #define __always_unused__attribute__((unused))
 
-#define __gcc_header(x) #x
-#define _gcc_header(x) __gcc_header(linux/compiler-gcc##x.h)
-#define gcc_header(x) _gcc_header(x)
-#include gcc_header(__GNUC__)
+/* gcc version specific checks */
+
+#if GCC_VERSION < 30200
+# error Sorry, your compiler is too old - please upgrade it.
+#endif
+
+#if GCC_VERSION < 30300
+# define __used__attribute__((__unused__))
+#else
+# define __used__attribute__((__used__))
+#endif
+
+#ifdef CONFIG_GCOV_KERNEL
+# if GCC_VERSION < 30400
+#   error "GCOV profiling support for gcc versions below 3.4 not included"
+# endif /* __GNUC_MINOR__ */
+#endif /* CONFIG_GCOV_KERNEL */
+
+#if GCC_VERSION >= 30400
+#define __must_check   __attribute__((warn_unused_result))
+#endif
+
+#if GCC_VERSION >= 4
+
+/* GCC 4.1.[01] miscompiles __weak */
+#ifdef __KERNEL__
+# if GCC_VERSION >= 40100 &&  GCC_VERSION <= 40101
+#  error Your version of gcc miscompiles the __weak directive
+# endif
+#endif
+
+#define __used __attribute__((__used__))
+#define __compiler_offsetof(a, b)  \
+   __builtin_offsetof(a, b)
+
+#if GCC_VERSION >= 40100 && GCC_VERSION < 40600
+# define __compiletime_object_size(obj) __builtin_object_size(obj, 0)
+#endif
+
+#if GCC_VERSION >= 40300
+/* Mark functions as cold. gcc will assume any path leading to a call
+ * to them will be unlikely.  This means a lot of manual unlikely()s
+ * are unnecessary now for any paths leading to the usual suspects
+ * like BUG(), printk(), panic() etc. [but let's keep them for now for
+ * older compilers]
+ *
+ * Early snapshots of gcc 4.3 don't support this and we can't detect this
+ * in the preprocessor, but we can live with this because they're unreleased.
+ * Maketime probing would be overkill here.
+ *
+ * gcc also has a __attribute__((__hot__)) to move hot functions into
+ * a special section, but I don't see any sense in this right now in
+ * the kernel context
+ */
+#define __cold __attribute__((__cold__))
+
+#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
+
+#ifndef __CHECKER__
+# define __compiletime_warning(message) __attribute__((warning(message)))
+# define __compiletime_error(message) __attribute__((error(message)))
+#endif /* __CHECKER__ */
+#endif /* GCC_VERSION >= 40300 */
+
+#if GCC_VERSION >= 40500
+/*
+ * Mark a position in code as unreachable.  This can be used to
+ * suppress control flow warnings after asm blocks that transfer
+ * control elsewhere.
+ *
+ * Early snapshots of gcc 4.5 don't support this and we can't detect
+ * this in the preprocessor, but we can live with this because they're
+ * unreleased.  Really, we need to have autoconf for the kernel.
+ */
+#define unreachable() __builtin_unreachable()
+
+/* Mark a function definition as prohibited from being cloned. */
+#define __noclone  __attribute__((__noclone__))
+
+#endif /* GCC_VERSION >= 40500 */
+
+#if

[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Flush TLBs after switching CR3

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit f1796a8d4debf66ae569701aeaf5e739661808c2
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:51 2015 +0400

ms/x86/kasan: Flush TLBs after switching CR3

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 241d2c54c62fa0939fc9a9512b48ac3434e90a89 upstream.

load_cr3() doesn't cause tlb_flush if PGE enabled.

This may cause tons of false positive reports spamming the
kernel to death.

To fix this __flush_tlb_all() should be called explicitly
after CR3 changed.

Signed-off-by: Andrey Ryabinin 
Cc:  # 4.0+
Cc: Alexander Popov 
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1435828178-10975-4-git-send-email-a.ryabi...@samsung.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/x86/mm/kasan_init_64.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index ad0b931..0ada6cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -208,6 +208,7 @@ void __init kasan_init(void)
 
memcpy(early_level4_pgt, init_level4_pgt, sizeof(early_level4_pgt));
load_cr3(early_level4_pgt);
+   __flush_tlb_all();
 
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
@@ -234,5 +235,6 @@ void __init kasan_init(void)
memset(kasan_zero_page, 0, PAGE_SIZE);
 
load_cr3(init_level4_pgt);
+   __flush_tlb_all();
init_task.kasan_depth = 0;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/module: fix types of device tables aliases

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit db7ae5a5dcbe87b199efdb784074ff3597d06d42
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:45 2015 +0400

ms/module: fix types of device tables aliases

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 6301939d97d079f0d3dbe71e750f4daf5d39fc33 upstream.

MODULE_DEVICE_TABLE() macro used to create aliases to device tables.
Normally alias should have the same type as aliased symbol.

Device tables are arrays, so they have 'struct type##_device_id[x]'
types. Alias created by MODULE_DEVICE_TABLE() will have non-array type -
'struct type##_device_id'.

This inconsistency confuses compiler, it could make a wrong assumption
about variable's size which leads KASan to produce a false positive report
about out of bounds access.

For every global variable compiler calls __asan_register_globals() passing
information about global variable (address, size, size with redzone, name
...) __asan_register_globals() poison symbols redzone to detect possible
out of bounds accesses.

When symbol has an alias __asan_register_globals() will be called as for
symbol so for alias.  Compiler determines size of variable by size of
variable's type.  Alias and symbol have the same address, so if alias have
the wrong size part of memory that actually belongs to the symbol could be
poisoned as redzone of alias symbol.

By fixing type of alias symbol we will fix size of it, so
__asan_register_globals() will not poison valid memory.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/module.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index c3b88d6..40bb478 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -84,7 +84,7 @@ void trim_init_extable(struct module *m);
 
 #ifdef MODULE
 #define MODULE_GENERIC_TABLE(gtype,name)   \
-extern const struct gtype##_id __mod_##gtype##_table   \
+extern const typeof(name) __mod_##gtype##_table\
   __attribute__ ((unused, alias(__stringify(name
 
 #else  /* !MODULE */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: introduce metadata_access_enable()/metadata_access_disable()

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 5a39d8752462593deb7d24f021f2b5fe5956ad8f
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:39 2015 +0400

ms/mm: slub: introduce metadata_access_enable()/metadata_access_disable()

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit a79316c6178ca419e35feef47d47f50b4e0ee9f2 upstream.

It's ok for slub to access memory that marked by kasan as inaccessible
(object's metadata).  Kasan shouldn't print report in that case because
these accesses are valid.  Disabling instrumentation of slub.c code is not
enough to achieve this because slub passes pointer to object's metadata
into external functions like memchr_inv().

We don't want to disable instrumentation for memchr_inv() because this is
quite generic function, and we don't want to miss bugs.

metadata_access_enable/metadata_access_disable used to tell KASan where
accesses to metadata starts/end, so we could temporarily disable KASan
reports.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 mm/slub.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 306cfc4..d775ccb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -464,12 +465,30 @@ static char *slub_debug_slabs;
 static int disable_higher_order_debug;
 
 /*
+ * slub is about to manipulate internal object metadata.  This memory lies
+ * outside the range of the allocated object, so accessing it would normally
+ * be reported by kasan as a bounds error.  metadata_access_enable() is used
+ * to tell kasan that these accesses are OK.
+ */
+static inline void metadata_access_enable(void)
+{
+   kasan_disable_current();
+}
+
+static inline void metadata_access_disable(void)
+{
+   kasan_enable_current();
+}
+
+/*
  * Object debugging
  */
 static void print_section(char *text, u8 *addr, unsigned int length)
 {
+   metadata_access_enable();
print_hex_dump(KERN_ERR, text, DUMP_PREFIX_ADDRESS, 16, 1, addr,
length, 1);
+   metadata_access_disable();
 }
 
 static struct track *get_track(struct kmem_cache *s, void *object,
@@ -499,7 +518,9 @@ static void set_track(struct kmem_cache *s, void *object,
trace.max_entries = TRACK_ADDRS_COUNT;
trace.entries = p->addrs;
trace.skip = 3;
+   metadata_access_enable();
save_stack_trace();
+   metadata_access_disable();
 
/* See rant in lockdep.c */
if (trace.nr_entries != 0 &&
@@ -672,7 +693,9 @@ static int check_bytes_and_report(struct kmem_cache *s, 
struct page *page,
u8 *fault;
u8 *end;
 
+   metadata_access_enable();
fault = memchr_inv(start, value, bytes);
+   metadata_access_disable();
if (!fault)
return 1;
 
@@ -765,7 +788,9 @@ static int slab_pad_check(struct kmem_cache *s, struct page 
*page)
if (!remainder)
return 1;
 
+   metadata_access_enable();
fault = memchr_inv(end - remainder, POISON_INUSE, remainder);
+   metadata_access_disable();
if (!fault)
return 1;
while (end > fault && end[-1] == POISON_INUSE)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/kasan, module: move MODULE_ALIGN macro into

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 5ad7c91c09ff6978e0c5df0fd4b25e7baa6eee87
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:47 2015 +0400

ms/kasan, module: move MODULE_ALIGN macro into 

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit d3733e5c98e952d419e77fa721912f09d15a2806 upstream.

include/linux/moduleloader.h is more suitable place for this macro.
Also change alignment to PAGE_SIZE for CONFIG_KASAN=n as such
alignment already assumed in several places.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Acked-by: Rusty Russell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h| 4 
 include/linux/moduleloader.h | 7 +++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 5fa48a2..5bb0744 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -50,15 +50,11 @@ void kasan_krealloc(const void *object, size_t new_size);
 void kasan_slab_alloc(struct kmem_cache *s, void *object);
 void kasan_slab_free(struct kmem_cache *s, void *object);
 
-#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
-
 int kasan_module_alloc(void *addr, size_t size);
 void kasan_free_shadow(const struct vm_struct *vm);
 
 #else /* CONFIG_KASAN */
 
-#define MODULE_ALIGN 1
-
 static inline void kasan_unpoison_shadow(const void *address, size_t size) {}
 
 static inline void kasan_enable_current(void) {}
diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 560ca53..8405769 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -80,4 +80,11 @@ int module_finalize(const Elf_Ehdr *hdr,
 /* Any cleanup needed when module leaves. */
 void module_arch_cleanup(struct module *mod);
 
+#ifdef CONFIG_KASAN
+#include 
+#define MODULE_ALIGN (PAGE_SIZE << KASAN_SHADOW_SCALE_SHIFT)
+#else
+#define MODULE_ALIGN PAGE_SIZE
+#endif
+
 #endif
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/kernel: add support for .init_array.* constructors

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 498baa3fdd64742725539c281086d23aee327fa4
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:44 2015 +0400

ms/kernel: add support for .init_array.* constructors

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 9ddf82521c86ae07af79dbe5a93c52890f2bab23 upstream.

KASan uses constructors for initializing redzones for global variables.
Globals instrumentation in GCC 4.9.2 produces constructors with priority
(.init_array.00099)

Currently kernel ignores such constructors.  Only constructors with
default priority supported (.init_array)

This patch adds support for constructors with priorities.  For kernel
image we put pointers to constructors between __ctors_start/__ctors_end
and do_ctors() will call them on start up.  For modules we merge
.init_array.* sections into resulting .init_array.  Module code properly
handles constructors in .init_array section.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/asm-generic/vmlinux.lds.h | 1 +
 scripts/module-common.lds | 4 
 2 files changed, 5 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 72e4edc..5c90355 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -481,6 +481,7 @@
 #define KERNEL_CTORS() . = ALIGN(8);  \
VMLINUX_SYMBOL(__ctors_start) = .; \
*(.ctors)  \
+   *(SORT(.init_array.*)) \
*(.init_array) \
VMLINUX_SYMBOL(__ctors_end) = .;
 #else
diff --git a/scripts/module-common.lds b/scripts/module-common.lds
index 0865b3e..10fa8bf 100644
--- a/scripts/module-common.lds
+++ b/scripts/module-common.lds
@@ -16,4 +16,8 @@ SECTIONS {
__kcrctab_unused: { *(SORT(___kcrctab_unused+*)) }
__kcrctab_unused_gpl: { *(SORT(___kcrctab_unused_gpl+*)) }
__kcrctab_gpl_future: { *(SORT(___kcrctab_gpl_future+*)) }
+
+
+   . = ALIGN(8);
+   .init_array 0 : { *(SORT(.init_array.*)) *(.init_array) }
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm: vmalloc: add flag preventing guard hole allocation

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 8db2f73889dbd2a488309474ddc1783d3228a40e
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:43 2015 +0400

ms/mm: vmalloc: add flag preventing guard hole allocation

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 71394fe50146202f2c8d92cf50f5ebc761acf254 upstream.

For instrumenting global variables KASan will shadow memory backing memory
for modules.  So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area.  Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole.  So we could fail to allocate shadow
for module_alloc().

Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
have a guard hole.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/vmalloc.h | 9 +++--
 mm/vmalloc.c| 6 ++
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index dd0a2c8..00b9b15 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -16,6 +16,7 @@ struct vm_area_struct;/* vma defining user 
mapping in mm_types.h */
 #define VM_USERMAP 0x0008  /* suitable for remap_vmalloc_range */
 #define VM_VPAGES  0x0010  /* buffer for pages was vmalloc'ed */
 #define VM_UNLIST  0x0020  /* vm_struct is not listed in vmlist */
+#define VM_NO_GUARD0x0040  /* don't add guard page */
 /* bits [20..32] reserved for arch specific ioremap internals */
 
 /*
@@ -96,8 +97,12 @@ void vmalloc_sync_all(void);
 
 static inline size_t get_vm_area_size(const struct vm_struct *area)
 {
-   /* return actual size without guard page */
-   return area->size - PAGE_SIZE;
+   if (!(area->flags & VM_NO_GUARD))
+   /* return actual size without guard page */
+   return area->size - PAGE_SIZE;
+   else
+   return area->size;
+
 }
 
 extern struct vm_struct *get_vm_area(unsigned long size, unsigned long flags);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0c531e1..7a0addf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1356,10 +1356,8 @@ static struct vm_struct *__get_vm_area_node(unsigned 
long size,
if (unlikely(!area))
return NULL;
 
-   /*
-* We always allocate a guard page.
-*/
-   size += PAGE_SIZE;
+   if (!(flags & VM_NO_GUARD))
+   size += PAGE_SIZE;
 
va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
if (IS_ERR(va)) {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/fs: dcache: manually unpoison dname after allocation to shut up kasan's reports

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 6b444b2466dfe34ee64bf03a05c9e8a85c581f0a
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:40 2015 +0400

ms/fs: dcache: manually unpoison dname after allocation to shut up kasan's 
reports

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit df4c0e36f1b1782b0611a77c52cc240e5c4752dd upstream.

We need to manually unpoison rounded up allocation size for dname to avoid
kasan's reports in dentry_string_cmp().  When CONFIG_DCACHE_WORD_ACCESS=y
dentry_string_cmp may access few bytes beyound requested in kmalloc()
size.

dentry_string_cmp() relates on that fact that dentry allocated using
kmalloc and kmalloc internally round up allocation size.  So this is not a
bug, but this makes kasan to complain about such accesses.  To avoid such
reports we mark rounded up allocation size in shadow as accessible.

Signed-off-by: Andrey Ryabinin 
Reported-by: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 fs/dcache.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index a341efe..a4f60d1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -43,6 +44,7 @@
 #include "internal.h"
 #include "mount.h"
 
+
 /*
  * Usage:
  * dcache->d_inode->i_lock protects:
@@ -1550,6 +1552,11 @@ struct dentry *__d_alloc(struct super_block *sb, const 
struct qstr *name)
kmem_cache_free(dentry_cache, dentry); 
return NULL;
}
+   if (IS_ENABLED(CONFIG_DCACHE_WORD_ACCESS))
+   kasan_unpoison_shadow(dname,
+   round_up(name->len + 1,
+   sizeof(unsigned long)));
+
} else  {
dname = dentry->d_iname;
}   
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/mm, mempool: poison elements backed by slab allocator

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit bbeaa6232872bec76a69e7cb6b41606f1cf61ad3
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:48 2015 +0400

ms/mm, mempool: poison elements backed by slab allocator

https://jira.sw.ru/browse/PSBM-26429

From: David Rientjes 

commit bdfedb76f4f5aa5e37380e3b71adee4a39f30fc6 upstream.

Mempools keep elements in a reserved pool for contexts in which allocation
may not be possible.  When an element is allocated from the reserved pool,
its memory contents is the same as when it was added to the reserved pool.

Because of this, elements lack any free poisoning to detect use-after-free
errors.

This patch adds free poisoning for elements backed by the slab allocator.
This is possible because the mempool layer knows the object size of each
element.

When an element is added to the reserved pool, it is poisoned with
POISON_FREE.  When it is removed from the reserved pool, the contents are
checked for POISON_FREE.  If there is a mismatch, a warning is emitted to
the kernel log.

This is only effective for configs with CONFIG_DEBUG_SLAB or
CONFIG_SLUB_DEBUG_ON.

[fabio.este...@freescale.com: use '%zu' for printing 'size_t' variable]
[a...@arndb.de: add missing include]
Signed-off-by: David Rientjes 
Cc: Dave Kleikamp 
Cc: Christoph Hellwig 
Cc: Sebastian Ott 
Cc: Mikulas Patocka 
Cc: Catalin Marinas 
Signed-off-by: Fabio Estevam 
Signed-off-by: Arnd Bergmann 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 mm/mempool.c | 94 ++--
 1 file changed, 92 insertions(+), 2 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 5499047..db146ad 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -6,25 +6,115 @@
  *  extreme VM load.
  *
  *  started by Ingo Molnar, Copyright (C) 2001
+ *  debugging by David Rientjes, Copyright (C) 2015
  */
 
 #include 
 #include 
+
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
+#if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_SLUB_DEBUG_ON)
+static void poison_error(mempool_t *pool, void *element, size_t size,
+size_t byte)
+{
+   const int nr = pool->curr_nr;
+   const int start = max_t(int, byte - (BITS_PER_LONG / 8), 0);
+   const int end = min_t(int, byte + (BITS_PER_LONG / 8), size);
+   int i;
+
+   pr_err("BUG: mempool element poison mismatch\n");
+   pr_err("Mempool %p size %zu\n", pool, size);
+   pr_err(" nr=%d @ %p: %s0x", nr, element, start > 0 ? "... " : "");
+   for (i = start; i < end; i++)
+   pr_cont("%x ", *(u8 *)(element + i));
+   pr_cont("%s\n", end < size ? "..." : "");
+   dump_stack();
+}
+
+static void __check_element(mempool_t *pool, void *element, size_t size)
+{
+   u8 *obj = element;
+   size_t i;
+
+   for (i = 0; i < size; i++) {
+   u8 exp = (i < size - 1) ? POISON_FREE : POISON_END;
+
+   if (obj[i] != exp) {
+   poison_error(pool, element, size, i);
+   return;
+   }
+   }
+   memset(obj, POISON_INUSE, size);
+}
+
+static void check_element(mempool_t *pool, void *element)
+{
+   /* Mempools backed by slab allocator */
+   if (pool->free == mempool_free_slab || pool->free == mempool_kfree)
+   __check_element(pool, element, ksize(element));
+
+   /* Mempools backed by page allocator */
+   if (pool->free == mempool_free_pages) {
+   int order = (int)(long)pool->pool_data;
+   void *addr = kmap_atomic((struct page *)element);
+
+   __check_element(pool, addr, 1UL << (PAGE_SHIFT + order));
+   kunmap_atomic(addr);
+   }
+}
+
+static void __poison_element(void *element, size_t size)
+{
+   u8 *obj = element;
+
+   memset(obj, POISON_FREE, size - 1);
+   obj[size - 1] = POISON_END;
+}
+
+static void poison_element(mempool_t *pool, void *element)
+{
+   /* Mempools backed by slab allocator */
+   if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc)
+   __poison_element(element, ksize(element));
+
+   /* Mempools backed by page allocator */
+   if (pool->alloc == mempool_alloc_pages) {
+   int order = (int)(long)pool->pool_data;
+   void

[Devel] [PATCH RHEL7 COMMIT] ms/mm/mempool.c: kasan: poison mempool elements

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 185298f11666838595fb5a2574231e5248178256
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:48 2015 +0400

ms/mm/mempool.c: kasan: poison mempool elements

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 923936157b158f36bd6a3d86496dce82b1a957de upstream.

Mempools keep allocated objects in reserved for situations when ordinary
allocation may not be possible to satisfy.  These objects shouldn't be
accessed before they leave the pool.

This patch poison elements when get into the pool and unpoison when they
leave it.  This will let KASan to detect use-after-free of mempool's
elements.

Signed-off-by: Andrey Ryabinin 
Tested-by: David Rientjes 
Cc: Catalin Marinas 
Cc: Dmitry Chernenkov 
Cc: Dmitry Vyukov 
Cc: Alexander Potapenko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h |  2 ++
 mm/kasan/kasan.c  | 13 +
 mm/mempool.c  | 23 +++
 3 files changed, 38 insertions(+)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 5bb0744..5486d77 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -44,6 +44,7 @@ void kasan_poison_object_data(struct kmem_cache *cache, void 
*object);
 
 void kasan_kmalloc_large(const void *ptr, size_t size);
 void kasan_kfree_large(const void *ptr);
+void kasan_kfree(void *ptr);
 void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size);
 void kasan_krealloc(const void *object, size_t new_size);
 
@@ -71,6 +72,7 @@ static inline void kasan_poison_object_data(struct kmem_cache 
*cache,
 
 static inline void kasan_kmalloc_large(void *ptr, size_t size) {}
 static inline void kasan_kfree_large(const void *ptr) {}
+static inline void kasan_kfree(void *ptr) {}
 static inline void kasan_kmalloc(struct kmem_cache *s, const void *object,
size_t size) {}
 static inline void kasan_krealloc(const void *object, size_t new_size) {}
diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 936d816..6c513a6 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -389,6 +389,19 @@ void kasan_krealloc(const void *object, size_t size)
kasan_kmalloc(page->slab_cache, object, size);
 }
 
+void kasan_kfree(void *ptr)
+{
+   struct page *page;
+
+   page = virt_to_head_page(ptr);
+
+   if (unlikely(!PageSlab(page)))
+   kasan_poison_shadow(ptr, PAGE_SIZE << compound_order(page),
+   KASAN_FREE_PAGE);
+   else
+   kasan_slab_free(page->slab_cache, ptr);
+}
+
 void kasan_kfree_large(const void *ptr)
 {
struct page *page = virt_to_page(ptr);
diff --git a/mm/mempool.c b/mm/mempool.c
index db146ad..abf8243 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -13,6 +13,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -101,10 +102,31 @@ static inline void poison_element(mempool_t *pool, void 
*element)
 }
 #endif /* CONFIG_DEBUG_SLAB || CONFIG_SLUB_DEBUG_ON */
 
+static void kasan_poison_element(mempool_t *pool, void *element)
+{
+   if (pool->alloc == mempool_alloc_slab)
+   kasan_slab_free(pool->pool_data, element);
+   if (pool->alloc == mempool_kmalloc)
+   kasan_kfree(element);
+   if (pool->alloc == mempool_alloc_pages)
+   kasan_free_pages(element, (unsigned long)pool->pool_data);
+}
+
+static void kasan_unpoison_element(mempool_t *pool, void *element)
+{
+   if (pool->alloc == mempool_alloc_slab)
+   kasan_slab_alloc(pool->pool_data, element);
+   if (pool->alloc == mempool_kmalloc)
+   kasan_krealloc(element, (size_t)pool->pool_data);
+   if (pool->alloc == mempool_alloc_pages)
+   kasan_alloc_pages(element, (unsigned long)pool->pool_data);
+}
+
 static void add_element(mempool_t *pool, void *element)
 {
BUG_ON(pool->curr_nr >= pool->min_nr);
poison_element(pool, element);
+   kasan_poison_element(pool, element);
pool->elements[pool->curr_nr++] = element;
 }
 
@@ -114,6 +136,7 @@ static void *remove_element(mempool_t *pool)
 
BUG_ON(pool->curr_nr < 0);
check_element(pool, element);
+   kasan_unpoison_element(pool, element);
return element;
 }
 
___
Devel mailing list
Devel@openvz.org

[Devel] [PATCH RHEL7 COMMIT] ms/lib: add kasan test module

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 86fbf39cbfaeefd815791b34628f6df8040b4d2f
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:41 2015 +0400

ms/lib: add kasan test module

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 3f15801cdc2379ca4bf507f48bffd788f9e508ae upstream.

This is a test module doing various nasty things like out of bounds
accesses, use after free.  It is useful for testing kernel debugging
features like kernel address sanitizer.

It mostly concentrates on testing of slab allocator, but we might want to
add more different stuff here in future (like stack/global variables out
of bounds accesses and so on).

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 lib/Kconfig.kasan |   8 ++
 lib/Makefile  |   1 +
 lib/test_kasan.c  | 277 ++
 3 files changed, 286 insertions(+)

diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index a11ac02..4d47d87 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -42,4 +42,12 @@ config KASAN_INLINE
 
 endchoice
 
+config TEST_KASAN
+   tristate "Module for testing kasan for bug detection"
+   depends on m && KASAN
+   help
+ This is a test module doing various nasty things like
+ out of bounds accesses, use after free. It is useful for testing
+ kernel debugging features like kernel address sanitizer.
+
 endif
diff --git a/lib/Makefile b/lib/Makefile
index 175face7..d5372fc 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -31,6 +31,7 @@ obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += kstrtox.o
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
+obj-$(CONFIG_TEST_KASAN) += test_kasan.o
 
 obj-y += kmapset.o
 
diff --git a/lib/test_kasan.c b/lib/test_kasan.c
new file mode 100644
index 000..098c08e
--- /dev/null
+++ b/lib/test_kasan.c
@@ -0,0 +1,277 @@
+/*
+ *
+ * Copyright (c) 2014 Samsung Electronics Co., Ltd.
+ * Author: Andrey Ryabinin 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#define pr_fmt(fmt) "kasan test: %s " fmt, __func__
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static noinline void __init kmalloc_oob_right(void)
+{
+   char *ptr;
+   size_t size = 123;
+
+   pr_info("out-of-bounds to right\n");
+   ptr = kmalloc(size, GFP_KERNEL);
+   if (!ptr) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   ptr[size] = 'x';
+   kfree(ptr);
+}
+
+static noinline void __init kmalloc_oob_left(void)
+{
+   char *ptr;
+   size_t size = 15;
+
+   pr_info("out-of-bounds to left\n");
+   ptr = kmalloc(size, GFP_KERNEL);
+   if (!ptr) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   *ptr = *(ptr - 1);
+   kfree(ptr);
+}
+
+static noinline void __init kmalloc_node_oob_right(void)
+{
+   char *ptr;
+   size_t size = 4096;
+
+   pr_info("kmalloc_node(): out-of-bounds to right\n");
+   ptr = kmalloc_node(size, GFP_KERNEL, 0);
+   if (!ptr) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   ptr[size] = 0;
+   kfree(ptr);
+}
+
+static noinline void __init kmalloc_large_oob_rigth(void)
+{
+   char *ptr;
+   size_t size = KMALLOC_MAX_CACHE_SIZE + 10;
+
+   pr_info("kmalloc large allocation: out-of-bounds to right\n");
+   ptr = kmalloc(size, GFP_KERNEL);
+   if (!ptr) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   ptr[size] =

[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: share object_err function

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit c9f94e82e07bf5bafb4e1afa04875aa444a276f7
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:38 2015 +0400

ms/mm: slub: share object_err function

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 75c66def8d815201aa0386ecc7c66a5c8dbca1ee upstream.

Remove static and add function declarations to linux/slub_def.h so it
could be used by kernel address sanitizer.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/slub_def.h | 3 +++
 mm/slub.c| 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index bd48c92..89bcb9e 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -139,4 +139,7 @@ static inline void *virt_to_obj(struct kmem_cache *s,
return (void *)x - ((x - slab_page) % s->size);
 }
 
+void object_err(struct kmem_cache *s, struct page *page,
+   u8 *object, char *reason);
+
 #endif /* _LINUX_SLUB_DEF_H */
diff --git a/mm/slub.c b/mm/slub.c
index f39e69c..306cfc4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -625,7 +625,7 @@ static void print_trailer(struct kmem_cache *s, struct page 
*page, u8 *p)
dump_stack();
 }
 
-static void object_err(struct kmem_cache *s, struct page *page,
+void object_err(struct kmem_cache *s, struct page *page,
u8 *object, char *reason)
 {
slab_bug(s, "%s", reason);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/x86_64: kasan: add interceptors for memset/memmove/memcpy functions

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 100caa44f8eb00f138f374132ff5137d85dc21da
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:42 2015 +0400

ms/x86_64: kasan: add interceptors for memset/memmove/memcpy functions

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 393f203f5fd54421fddb1e2a263f64d3876eeadb upstream.

Recently instrumentation of builtin functions calls was removed from GCC
5.0.  To check the memory accessed by such functions, userspace asan
always uses interceptors for them.

So now we should do this as well.  This patch declares
memset/memmove/memcpy as weak symbols.  In mm/kasan/kasan.c we have our
own implementation of those functions which checks memory before accessing
it.

Default memset/memmove/memcpy now now always have aliases with '__'
prefix.  For files that built without kasan instrumentation (e.g.
mm/slub.c) original mem* replaced (via #define) with prefixed variants,
cause we don't want to check memory accesses there.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/x86/boot/compressed/eboot.c |  5 +++--
 arch/x86/boot/compressed/misc.h  |  1 +
 arch/x86/include/asm/string_64.h | 18 +-
 arch/x86/kernel/x8664_ksyms_64.c | 10 --
 arch/x86/lib/memcpy_64.S |  6 --
 arch/x86/lib/memmove_64.S|  4 
 arch/x86/lib/memset_64.S | 10 ++
 mm/kasan/kasan.c | 29 +
 8 files changed, 72 insertions(+), 11 deletions(-)

diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
index dd94e98..dc3694d 100644
--- a/arch/x86/boot/compressed/eboot.c
+++ b/arch/x86/boot/compressed/eboot.c
@@ -7,6 +7,9 @@
  *
  * --- */
 
+#include "misc.h"
+#include 
+#include "../string.h"
 #include 
 #include 
 #include 
@@ -14,8 +17,6 @@
 #include 
 #include 
 
-#undef memcpy  /* Use memcpy from misc.c */
-
 #include "eboot.h"
 
 static efi_system_table_t *sys_table;
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 674019d..768b889 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -7,6 +7,7 @@
  * we just keep it from happening
  */
 #undef CONFIG_PARAVIRT
+#undef CONFIG_KASAN
 #ifdef CONFIG_X86_32
 #define _ASM_X86_DESC_H 1
 #endif
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..e466119 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -27,11 +27,12 @@ static __always_inline void *__inline_memcpy(void *to, 
const void *from, size_t
function. */
 
 #define __HAVE_ARCH_MEMCPY 1
+extern void *__memcpy(void *to, const void *from, size_t len);
+
 #ifndef CONFIG_KMEMCHECK
 #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
 extern void *memcpy(void *to, const void *from, size_t len);
 #else
-extern void *__memcpy(void *to, const void *from, size_t len);
 #define memcpy(dst, src, len)  \
 ({ \
size_t __len = (len);   \
@@ -53,9 +54,11 @@ extern void *__memcpy(void *to, const void *from, size_t 
len);
 
 #define __HAVE_ARCH_MEMSET
 void *memset(void *s, int c, size_t n);
+void *__memset(void *s, int c, size_t n);
 
 #define __HAVE_ARCH_MEMMOVE
 void *memmove(void *dest, const void *src, size_t count);
+void *__memmove(void *dest, const void *src, size_t count);
 
 int memcmp(const void *cs, const void *ct, size_t count);
 size_t strlen(const char *s);
@@ -63,6 +66,19 @@ char *strcpy(char *dest, const char *src);
 char *strcat(char *dest,

[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: add kernel address sanitizer support for slub allocator

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit f5739fe62cb93cddd8165d0fca93f773d36431b8
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:40 2015 +0400

ms/mm: slub: add kernel address sanitizer support for slub allocator

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 0316bec22ec95ea2faca6406437b0b5950553b7c upstream.

With this patch kasan will be able to catch bugs in memory allocated by
slub.  Initially all objects in newly allocated slab page, marked as
redzone.  Later, when allocation of slub object happens, requested by
caller number of bytes marked as accessible, and the rest of the object
(including slub's metadata) marked as redzone (inaccessible).

We also mark object as accessible if ksize was called for this object.
There is some places in kernel where ksize function is called to inquire
size of really allocated area.  Such callers could validly access whole
allocated memory, so it should be marked as accessible.

Code in slub.c and slab_common.c files could validly access to object's
metadata, so instrumentation for this files are disabled.

Signed-off-by: Andrey Ryabinin 
Signed-off-by: Dmitry Chernenkov 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/kasan.h | 27 ++
 include/linux/slab.h  | 11 --
 lib/Kconfig.kasan |  1 +
 mm/Makefile   |  3 ++
 mm/kasan/kasan.c  | 98 +++
 mm/kasan/kasan.h  |  5 +++
 mm/kasan/report.c | 21 +++
 mm/slab_common.c  |  5 ++-
 mm/slub.c | 31 ++--
 9 files changed, 197 insertions(+), 5 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index f00c15c..d5310ee 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -37,6 +37,18 @@ void kasan_unpoison_shadow(const void *address, size_t size);
 void kasan_alloc_pages(struct page *page, unsigned int order);
 void kasan_free_pages(struct page *page, unsigned int order);
 
+void kasan_poison_slab(struct page *page);
+void kasan_unpoison_object_data(struct kmem_cache *cache, void *object);
+void kasan_poison_object_data(struct kmem_cache *cache, void *object);
+
+void kasan_kmalloc_large(const void *ptr, size_t size);
+void kasan_kfree_large(const void *ptr);
+void kasan_kmalloc(struct kmem_cache *s, const void *object, size_t size);
+void kasan_krealloc(const void *object, size_t new_size);
+
+void kasan_slab_alloc(struct kmem_cache *s, void *object);
+void kasan_slab_free(struct kmem_cache *s, void *object);
+
 #else /* CONFIG_KASAN */
 
 static inline void kasan_unpoison_shadow(const void *address, size_t size) {}
@@ -47,6 +59,21 @@ static inline void kasan_disable_current(void) {}
 static inline void kasan_alloc_pages(struct page *page, unsigned int order) {}
 static inline void kasan_free_pages(struct page *page, unsigned int order) {}
 
+static inline void kasan_poison_slab(struct page *page) {}
+static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
+   void *object) {}
+static inline void kasan_poison_object_data(struct kmem_cache *cache,
+   void *object) {}
+
+static inline void kasan_kmalloc_large(void *ptr, size_t size) {}
+static inline void kasan_kfree_large(const void *ptr) {}
+static inline void kasan_kmalloc(struct kmem_cache *s, const void *object,
+   size_t size) {}
+static inline void kasan_krealloc(const void *object, size_t new_size) {}
+
+static inline void kasan_slab_alloc(struct kmem_cache *s, void *object) {}
+static inline void kasan_slab_free(struct kmem_cache *s, void *object) {}
+
 #endif /* CONFIG_KASAN */
 
 #endif /* LINUX_KASAN_H */
diff --git a/include/linux/slab.h b/include/linux/slab.h

[Devel] [PATCH RHEL7 COMMIT] ms/mm: slub: introduce virt_to_obj function

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit c79da004858018af5e66fd380014fea3e5d5271d
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:38 2015 +0400

ms/mm: slub: introduce virt_to_obj function

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 912f5fbf1d3060f25d6994aed0265c55b974b2e9 upstream.

virt_to_obj takes kmem_cache address, address of slab page, address x
pointing somewhere inside slab object, and returns address of the
beginning of object.

Signed-off-by: Andrey Ryabinin 
Acked-by: Christoph Lameter 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 include/linux/slub_def.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d7d4571..bd48c92 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -123,4 +123,20 @@ static inline void sysfs_slab_remove(struct kmem_cache *s)
 }
 #endif
 
+
+/**
+ * virt_to_obj - returns address of the beginning of object.
+ * @s: object's kmem_cache
+ * @slab_page: address of slab page
+ * @x: address within object memory range
+ *
+ * Returns address of the beginning of object
+ */
+static inline void *virt_to_obj(struct kmem_cache *s,
+   const void *slab_page,
+   const void *x)
+{
+   return (void *)x - ((x - slab_page) % s->size);
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/kasan: Makefile: shut up warnings if CONFIG_COMPILE_TEST=y

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 0bc35fb562a57be834fea65d992bb49b94909579
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:49 2015 +0400

ms/kasan: Makefile: shut up warnings if CONFIG_COMPILE_TEST=y

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 6e54abac1b8e0b7febffdbad37b605daef1cfcff upstream.

It might be annoying to constantly see this:

scripts/Makefile.kasan:16: Cannot use CONFIG_KASAN: 
-fsanitize=kernel-address is not supported by compiler

while performing allmodconfig/allyesconfig build tests.
Disable this warning if CONFIG_COMPILE_TEST=y.

Signed-off-by: Andrey Ryabinin 
Cc: Michal Marek 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 scripts/Makefile.kasan | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/scripts/Makefile.kasan b/scripts/Makefile.kasan
index 631619b..3f874d2 100644
--- a/scripts/Makefile.kasan
+++ b/scripts/Makefile.kasan
@@ -13,12 +13,16 @@ CFLAGS_KASAN := $(call cc-option, -fsanitize=kernel-address 
\
--param 
asan-instrumentation-with-call-threshold=$(call_threshold))
 
 ifeq ($(call cc-option, $(CFLAGS_KASAN_MINIMAL) -Werror),)
+   ifneq ($(CONFIG_COMPILE_TEST),y)
 $(warning Cannot use CONFIG_KASAN: \
 -fsanitize=kernel-address is not supported by compiler)
+   endif
 else
 ifeq ($(CFLAGS_KASAN),)
-$(warning CONFIG_KASAN: compiler does not support all options.\
-Trying minimal configuration)
+ifneq ($(CONFIG_COMPILE_TEST),y)
+$(warning CONFIG_KASAN: compiler does not support all options.\
+Trying minimal configuration)
+endif
 CFLAGS_KASAN := $(CFLAGS_KASAN_MINIMAL)
 endif
 endif
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Add message about KASAN being initialized

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 4e8e61be6ec2720f7871f8d673c8e806dd93
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:52 2015 +0400

ms/x86/kasan: Add message about KASAN being initialized

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 8515522949951d81fe2d06c0a3292f171f2b8ec4 upstream.

Print informational message to tell user that kernel
runs with KASAN enabled.

Add a "kasan: " prefix to all messages in kasan_init_64.c.

Signed-off-by: Andrey Ryabinin 
Cc: Alexander Popov 
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1435828178-10975-6-git-send-email-a.ryabi...@samsung.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/x86/mm/kasan_init_64.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index ef3dea9..f9fb08e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -1,3 +1,4 @@
+#define pr_fmt(fmt) "kasan: " fmt
 #include 
 #include 
 #include 
@@ -237,4 +238,6 @@ void __init kasan_init(void)
load_cr3(init_level4_pgt);
__flush_tlb_all();
init_task.kasan_depth = 0;
+
+   pr_info("Kernel address sanitizer initialized\n");
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Fix boot crash on AMD processors

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 1551fd8cc2353656479158d30ad46940290098da
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:51 2015 +0400

ms/x86/kasan: Fix boot crash on AMD processors

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit d4f86beacc21d538dc41e1fc75a22e084f547edf upstream.

While populating zero shadow wrong bits in upper level page
tables used. __PAGE_KERNEL_RO that was used for pgd/pud/pmd has
_PAGE_BIT_GLOBAL set. Global bit is present only in the lowest
level of the page translation hierarchy (ptes), and it should be
zero in upper levels.

This bug seems doesn't cause any troubles on Intel cpus, while
on AMDs it cause kernel crash on boot.

Use _KERNPG_TABLE bits for pgds/puds/pmds to fix this.

Reported-by: Borislav Petkov 
Signed-off-by: Andrey Ryabinin 
Cc:  # 4.0+
Cc: Alexander Popov 
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Dmitry Vyukov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1435828178-10975-5-git-send-email-a.ryabi...@samsung.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/x86/mm/kasan_init_64.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0ada6cc..ef3dea9 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -85,7 +85,7 @@ static int __init zero_pmd_populate(pud_t *pud, unsigned long 
addr,
while (IS_ALIGNED(addr, PMD_SIZE) && addr + PMD_SIZE <= end) {
WARN_ON(!pmd_none(*pmd));
set_pmd(pmd, __pmd(__pa_nodebug(kasan_zero_pte)
-   | __PAGE_KERNEL_RO));
+   | _KERNPG_TABLE));
addr += PMD_SIZE;
pmd = pmd_offset(pud, addr);
}
@@ -111,7 +111,7 @@ static int __init zero_pud_populate(pgd_t *pgd, unsigned 
long addr,
while (IS_ALIGNED(addr, PUD_SIZE) && addr + PUD_SIZE <= end) {
WARN_ON(!pud_none(*pud));
set_pud(pud, __pud(__pa_nodebug(kasan_zero_pmd)
-   | __PAGE_KERNEL_RO));
+   | _KERNPG_TABLE));
addr += PUD_SIZE;
pud = pud_offset(pgd, addr);
}
@@ -136,7 +136,7 @@ static int __init zero_pgd_populate(unsigned long addr, 
unsigned long end)
while (IS_ALIGNED(addr, PGDIR_SIZE) && addr + PGDIR_SIZE <= end) {
WARN_ON(!pgd_none(*pgd));
set_pgd(pgd, __pgd(__pa_nodebug(kasan_zero_pud)
-   | __PAGE_KERNEL_RO));
+   | _KERNPG_TABLE));
addr += PGDIR_SIZE;
pgd = pgd_offset_k(addr);
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/kasan: enable stack instrumentation

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit afb61959d53afd934a5117de982401ddd17cf44a
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:43 2015 +0400

ms/kasan: enable stack instrumentation

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit c420f167db8c799d69fe43a801c58a7f02e9d57c upstream.

Stack instrumentation allows to detect out of bounds memory accesses for
variables allocated on stack.  Compiler adds redzones around every
variable on stack and poisons redzones in function's prologue.

Such approach significantly increases stack usage, so all in-kernel stacks
size were doubled.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/x86/include/asm/page_64_types.h | 12 +---
 arch/x86/kernel/Makefile |  2 ++
 arch/x86/mm/kasan_init_64.c  | 11 +--
 include/linux/init_task.h|  8 
 mm/kasan/kasan.h |  9 +
 mm/kasan/report.c|  6 ++
 scripts/Makefile.kasan   |  1 +
 7 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 735457b..042942c 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -1,17 +1,23 @@
 #ifndef _ASM_X86_PAGE_64_DEFS_H
 #define _ASM_X86_PAGE_64_DEFS_H
 
-#define THREAD_SIZE_ORDER  2
+#ifdef CONFIG_KASAN
+#define KASAN_STACK_ORDER 1
+#else
+#define KASAN_STACK_ORDER 0
+#endif
+
+#define THREAD_SIZE_ORDER  (2 + KASAN_STACK_ORDER)
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 #define CURRENT_MASK (~(THREAD_SIZE - 1))
 
-#define EXCEPTION_STACK_ORDER 0
+#define EXCEPTION_STACK_ORDER (0 + KASAN_STACK_ORDER)
 #define EXCEPTION_STKSZ (PAGE_SIZE << EXCEPTION_STACK_ORDER)
 
 #define DEBUG_STACK_ORDER (EXCEPTION_STACK_ORDER + 1)
 #define DEBUG_STKSZ (PAGE_SIZE << DEBUG_STACK_ORDER)
 
-#define IRQ_STACK_ORDER 2
+#define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)
 #define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER)
 
 #define DOUBLEFAULT_STACK 1
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 102a138..4d5df57 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -17,6 +17,8 @@ CFLAGS_REMOVE_early_printk.o = -pg
 endif
 
 KASAN_SANITIZE_head$(BITS).o := n
+KASAN_SANITIZE_dumpstack.o := n
+KASAN_SANITIZE_dumpstack_$(BITS).o := n
 
 CFLAGS_irq.o := -I$(src)/../include/asm/trace
 
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index cf3190a..a0c0dcc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -189,11 +189,18 @@ void __init kasan_init(void)
if (map_range(_mapped[i]))
panic("kasan: unable to allocate shadow!");
}
-
populate_zero_shadow(kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
-   (void *)KASAN_SHADOW_END);
+   kasan_mem_to_shadow((void *)__START_KERNEL_map));
+
+   vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
+   (unsigned long)kasan_mem_to_shadow(_end),
+   NUMA_NO_NODE);
+
+   populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_VADDR),
+   (void *)KASAN_SHADOW_END);
 
memset(kasan_zero_page, 0, PAGE_SIZE);
 
load_cr3(init_level4_pgt);
+   init_task.kasan_depth = 0;
 }
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b1bdeb6..d2cbad0 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -161,6 +161,13 @@ extern struct task_group root_task_group;
 
 #define INIT_TASK_COMM "swapper"
 
+#ifdef CONFIG_KASAN
+# define INIT_KASAN(tsk)

[Devel] [PATCH RHEL7 COMMIT] ms/kmemleak: disable kasan instrumentation for kmemleak

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit c15154cbcf8144b7f551b571c0a4a129bfcf999d
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:41 2015 +0400

ms/kmemleak: disable kasan instrumentation for kmemleak

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit e79ed2f13faab8fc9d4ad76d5f5a241724e45836 upstream.

kmalloc internally round up allocation size, and kmemleak uses rounded up
size as object's size.  This makes kasan to complain while kmemleak scans
memory or calculates of object's checksum.  The simplest solution here is
to disable kasan.

Signed-off-by: Andrey Ryabinin 
Acked-by: Catalin Marinas 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 mm/kmemleak.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 98e1b34..5fe0a34 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -98,6 +98,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -1077,7 +1078,10 @@ static bool update_checksum(struct kmemleak_object 
*object)
if (!kmemcheck_is_obj_initialized(object->pointer, object->size))
return false;
 
+   kasan_disable_current();
object->checksum = crc32(0, (void *)object->pointer, object->size);
+   kasan_enable_current();
+
return object->checksum != old_csum;
 }
 
@@ -1128,7 +1132,9 @@ static void scan_block(void *_start, void *_end,
  BYTES_PER_POINTER))
continue;
 
+   kasan_disable_current();
pointer = *ptr;
+   kasan_enable_current();
 
object = find_and_get_object(pointer, 1);
if (!object)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/kasan: enable instrumentation of global variables

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit b3ad5de4e3c0866f2aa2b581f348989eee5bc9df
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:46 2015 +0400

ms/kasan: enable instrumentation of global variables

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit bebf56a1b176c2e1c9efe44e7e6915532cc682cf upstream.

This feature let us to detect accesses out of bounds of global variables.
This will work as for globals in kernel image, so for globals in modules.
Currently this won't work for symbols in user-specified sections (e.g.
__init, __read_mostly, ...)

The idea of this is simple.  Compiler increases each global variable by
redzone size and add constructors invoking __asan_register_globals()
function.  Information about global variable (address, size, size with
redzone ...) passed to __asan_register_globals() so we could poison
variable's redzone.

This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
address making shadow memory handling (
kasan_module_alloc()/kasan_module_free() ) more simple.  Such alignment
guarantees that each shadow page backing modules address space correspond
to only one module_alloc() allocation.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 Documentation/kasan.txt |  2 +-
 arch/x86/kernel/module.c| 12 +--
 arch/x86/mm/kasan_init_64.c |  4 ++--
 include/linux/kasan.h   | 10 +
 kernel/module.c |  2 ++
 lib/Kconfig.kasan   |  1 +
 mm/kasan/kasan.c| 52 +
 mm/kasan/kasan.h| 25 ++
 mm/kasan/report.c   | 22 +++
 scripts/Makefile.kasan  |  2 +-
 10 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/Documentation/kasan.txt b/Documentation/kasan.txt
index f0645a8..092fc10 100644
--- a/Documentation/kasan.txt
+++ b/Documentation/kasan.txt
@@ -9,7 +9,7 @@ a fast and comprehensive solution for finding use-after-free 
and out-of-bounds
 bugs.
 
 KASan uses compile-time instrumentation for checking every memory access,
-therefore you will need a certain version of GCC >= 4.9.2
+therefore you will need a certain version of GCC > 4.9.2
 
 Currently KASan is supported only for x86_64 architecture and requires that the
 kernel be built with the SLUB allocator.
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 2ce4a9a..5892e83 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -45,11 +46,18 @@ do {
\
 
 void *module_alloc(unsigned long size)
 {
+   void *p;
+
if (PAGE_ALIGN(size) > MODULES_LEN)
return NULL;
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
+   p =  __vmalloc_node_range(size, MODULE_ALIGN, MODULES_VADDR, 
MODULES_END,
GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC,
-0, NUMA_NO_NODE, __builtin_return_address(0));
+   0, NUMA_NO_NODE, __builtin_return_address(0));
+   if (p && (kasan_module_alloc(p, size) < 0)) {
+   vfree(p);
+   return NULL;
+   }
+   return p;
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index a0c0dcc..7620537 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -194,9 +194,9 @@ void __init kasan_init(void)
 
vmemmap_populate((unsigned long)kasan_mem_to_shadow(_stext),
(unsigned long)kasan_mem_to_shadow(_end),
-

[Devel] [PATCH RHEL7 COMMIT] ms/kasan: disable memory hotplug

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 2aab6eabec66429c34f23086f82dd793d660b283
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:36 2015 +0400

ms/kasan: disable memory hotplug

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit 786a8959912eb94fc2381c2ae487a96ce55dabca upstream.

Currently memory hotplug won't work with KASan.  As we don't have shadow
for hotplugged memory, kernel will crash on the first access to it.  To
make this work we will need to allocate shadow for new memory.

At some future point proper memory hotplug support will be implemented.
Until then, print a warning at startup and disable memory hot-add.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 mm/kasan/kasan.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/mm/kasan/kasan.c b/mm/kasan/kasan.c
index 6dc1aa7..def8110 100644
--- a/mm/kasan/kasan.c
+++ b/mm/kasan/kasan.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -300,3 +301,23 @@ EXPORT_SYMBOL(__asan_storeN_noabort);
 /* to shut up compiler complaints */
 void __asan_handle_no_return(void) {}
 EXPORT_SYMBOL(__asan_handle_no_return);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static int kasan_mem_notifier(struct notifier_block *nb,
+   unsigned long action, void *data)
+{
+   return (action == MEM_GOING_ONLINE) ? NOTIFY_BAD : NOTIFY_OK;
+}
+
+static int __init kasan_memhotplug_init(void)
+{
+   pr_err("WARNING: KASan doesn't support memory hot-add\n");
+   pr_err("Memory hot-add will be disabled\n");
+
+   hotplug_memory_notifier(kasan_mem_notifier, 0);
+
+   return 0;
+}
+
+module_init(kasan_memhotplug_init);
+#endif
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/MODULE_DEVICE_TABLE: fix some callsites

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit c2814d62886ad2b1697f7a4152d545662bdb2351
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:34 2015 +0400

ms/MODULE_DEVICE_TABLE: fix some callsites

https://jira.sw.ru/browse/PSBM-26429

From: Andrew Morton 

commit 0f989f749b51ec1fd94bb5a42f8ad10c8b9f73cb upstream.

The patch "module: fix types of device tables aliases" newly requires that
invocations of

MODULE_DEVICE_TABLE(type, name);

come *after* the definition of `name'.  That is reasonable, but some
drivers weren't doing this.  Fix them.

Cc: James Bottomley 
Cc: Andrey Ryabinin 
Cc: David Miller 
Cc: Hans Verkuil 
Acked-by: Mauro Carvalho Chehab 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 drivers/net/ethernet/emulex/benet/be_main.c | 1 -
 drivers/scsi/be2iscsi/be_main.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 167fe08..4e60ee7 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -26,7 +26,6 @@
 #include 
 
 MODULE_VERSION(DRV_VER);
-MODULE_DEVICE_TABLE(pci, be_dev_ids);
 MODULE_DESCRIPTION(DRV_DESC " " DRV_VER);
 MODULE_AUTHOR("Emulex Corporation");
 MODULE_LICENSE("GPL");
diff --git a/drivers/scsi/be2iscsi/be_main.c b/drivers/scsi/be2iscsi/be_main.c
index 6b079d6..f9506b2 100644
--- a/drivers/scsi/be2iscsi/be_main.c
+++ b/drivers/scsi/be2iscsi/be_main.c
@@ -48,7 +48,6 @@ static unsigned int be_iopoll_budget = 10;
 static unsigned int be_max_phys_size = 64;
 static unsigned int enable_msix = 1;
 
-MODULE_DEVICE_TABLE(pci, beiscsi_pci_id_table);
 MODULE_DESCRIPTION(DRV_DESC " " BUILD_STR);
 MODULE_VERSION(BUILD_STR);
 MODULE_AUTHOR("Emulex Corporation");
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/x86/init: Clear 'init_level4_pgt' earlier

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 2c3d4203ed393d91ba79a0fa59f5e1ce5fe7627a
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:49 2015 +0400

ms/x86/init: Clear 'init_level4_pgt' earlier

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit d0f77d4d04b222a817925d33ba3589b190bfa863 upstream.

Currently x86_64_start_kernel() has two KASAN related
function calls. The first call maps shadow to early_level4_pgt,
the second maps shadow to init_level4_pgt.

If we move clear_page(init_level4_pgt) earlier, we could hide
KASAN low level detail from generic x86_64 initialization code.
The next patch will do it.

Signed-off-by: Andrey Ryabinin 
Cc:  # 4.0+
Cc: Alexander Popov 
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1435828178-10975-2-git-send-email-a.ryabi...@samsung.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/x86/kernel/head64.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 67df086..357ce8a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -164,6 +164,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
/* clear bss before set_intr_gate with early_idt_handler */
clear_bss();
 
+   clear_page(init_level4_pgt);
+
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handlers[i]);
load_idt((const struct desc_ptr *)_descr);
@@ -178,7 +180,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
if (console_loglevel == 10)
early_printk("Kernel alive\n");
 
-   clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/x86/kasan: Fix KASAN shadow region page tables

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 71bad1e5d1a2aa16ce31dc1413d36d066e73b7e4
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:50 2015 +0400

ms/x86/kasan: Fix KASAN shadow region page tables

https://jira.sw.ru/browse/PSBM-26429

From: Alexander Popov 

commit 5d5aa3cfca5cf74cd928daf3674642e6004328d1 upstream.

Currently KASAN shadow region page tables created without
respect of physical offset (phys_base). This causes kernel halt
when phys_base is not zero.

So let's initialize KASAN shadow region page tables in
kasan_early_init() using __pa_nodebug() which considers
phys_base.

This patch also separates x86_64_start_kernel() from KASAN low
level details by moving kasan_map_early_shadow(init_level4_pgt)
into kasan_early_init().

Remove the comment before clear_bss() which stopped bringing
much profit to the code readability. Otherwise describing all
the new order dependencies would be too verbose.

Signed-off-by: Alexander Popov 
Signed-off-by: Andrey Ryabinin 
Cc:  # 4.0+
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Borislav Petkov 
Cc: Dmitry Vyukov 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1435828178-10975-3-git-send-email-a.ryabi...@samsung.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/x86/include/asm/kasan.h |  8 ++--
 arch/x86/kernel/head64.c |  7 ++-
 arch/x86/kernel/head_64.S| 28 
 arch/x86/mm/kasan_init_64.c  | 36 ++--
 4 files changed, 38 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/kasan.h b/arch/x86/include/asm/kasan.h
index 8b22422..74a2a8d 100644
--- a/arch/x86/include/asm/kasan.h
+++ b/arch/x86/include/asm/kasan.h
@@ -14,15 +14,11 @@
 
 #ifndef __ASSEMBLY__
 
-extern pte_t kasan_zero_pte[];
-extern pte_t kasan_zero_pmd[];
-extern pte_t kasan_zero_pud[];
-
 #ifdef CONFIG_KASAN
-void __init kasan_map_early_shadow(pgd_t *pgd);
+void __init kasan_early_init(void);
 void __init kasan_init(void);
 #else
-static inline void kasan_map_early_shadow(pgd_t *pgd) { }
+static inline void kasan_early_init(void) { }
 static inline void kasan_init(void) { }
 #endif
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 357ce8a..c2dd757 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -159,13 +159,12 @@ void __init x86_64_start_kernel(char * real_mode_data)
/* Kill off the identity-map trampoline */
reset_early_page_tables();
 
-   kasan_map_early_shadow(early_level4_pgt);
-
-   /* clear bss before set_intr_gate with early_idt_handler */
clear_bss();
 
clear_page(init_level4_pgt);
 
+   kasan_early_init();
+
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handlers[i]);
load_idt((const struct desc_ptr *)_descr);
@@ -183,8 +182,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
 
-   kasan_map_early_shadow(init_level4_pgt);
-
x86_64_start_reservations(real_mode_data);
 }
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 4178929..cb5bf29 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -514,22 +514,6 @@ ENTRY(phys_base)
/* This must match the first entry in level2_kernel_pgt */
.quad   0x
 
-#ifdef CONFIG_KASAN
-#define FILL(VAL, COUNT)   \
-   .rept (COUNT) ; \
-   .quad   (VAL) ; \
-   .endr
-
-NEXT_PAGE(kasan_zero_pte)
-   FILL(kasan_zero_page - __START_KERNEL_map + _KERNPG_TABLE, 512)
-NEXT_PAGE(kasan_zero_pmd)
-   FILL(kasan_zero_pte - __START_KERNEL_map + _KERNPG_TABLE, 512)
-NEXT_PAGE(kasan_zero_pud)
-   FILL(kasan_zero_pmd - __START_KERNEL_map + _KERNPG_TABLE, 512)
-
-#undef FILL
-#endif
-
 #include "../../x86/xen/xen-head.S"

.section .bss, "aw", @nobits
@@ -551,15 +535,3 @@ ENTRY(trace_idt_table)
 NEXT_PAGE(empty_zero_page)
.skip PAGE_SIZE
 
-#ifdef CONFIG_KASAN
-/*
- * This page used as early shadow. We don't use empty_zero_page
- * at early stages, stack instrumentation could write some garbage
- * to this

[Devel] [PATCH RHEL7 COMMIT] ms/mm: vmalloc: pass additional vm_flags to __vmalloc_node_range()

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit a670730ea44733529cfb1072b8d50bbf4956858d
Author: Andrey Ryabinin 
Date:   Thu Sep 3 19:27:44 2015 +0400

ms/mm: vmalloc: pass additional vm_flags to __vmalloc_node_range()

https://jira.sw.ru/browse/PSBM-26429

From: Andrey Ryabinin 

commit cb9e3c292d0115499c660028ad35ac5501d722b5 upstream.

For instrumenting global variables KASan will shadow memory backing memory
for modules.  So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area.  Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole.  So we could fail to allocate shadow
for module_alloc().

Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
__vmalloc_node_range().  Add new parameter 'vm_flags' to
__vmalloc_node_range() function.

Signed-off-by: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Konstantin Serebryany 
Cc: Dmitry Chernenkov 
Signed-off-by: Andrey Konovalov 
Cc: Yuri Gribov 
Cc: Konstantin Khlebnikov 
Cc: Sasha Levin 
Cc: Christoph Lameter 
Cc: Joonsoo Kim 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 

Signed-off-by: Andrey Ryabinin 
---
 arch/arm/kernel/module.c|  2 +-
 arch/arm64/kernel/module.c  |  4 ++--
 arch/mips/kernel/module.c   |  2 +-
 arch/parisc/kernel/module.c |  2 +-
 arch/s390/kernel/module.c   |  2 +-
 arch/sparc/kernel/module.c  |  2 +-
 arch/x86/kernel/module.c|  2 +-
 include/linux/vmalloc.h |  4 +++-
 mm/vmalloc.c| 10 ++
 9 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index be3232f..162c0b3 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -40,7 +40,7 @@
 void *module_alloc(unsigned long size)
 {
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE,
+   GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 #endif
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 8f898bd..c7bc3e6 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -29,8 +29,8 @@
 void *module_alloc(unsigned long size)
 {
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   GFP_KERNEL, PAGE_KERNEL_EXEC, 0,
+   NUMA_NO_NODE, __builtin_return_address(0));
 }
 
 enum aarch64_reloc_op {
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 2a52568..1833f51 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -47,7 +47,7 @@ static DEFINE_SPINLOCK(dbe_lock);
 void *module_alloc(unsigned long size)
 {
return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
-   GFP_KERNEL, PAGE_KERNEL, NUMA_NO_NODE,
+   GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
__builtin_return_address(0));
 }
 #endif
diff --git a/arch/parisc/kernel/module.c b/arch/parisc/kernel/module.c
index 50dfafc..0d498ef 100644
--- a/arch/parisc/kernel/module.c
+++ b/arch/parisc/kernel/module.c
@@ -219,7 +219,7 @@ void *module_alloc(unsigned long size)
 * init_data correctly */
return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
GFP_KERNEL | __GFP_HIGHMEM,
-   PAGE_KERNEL_RWX, NUMA_NO_NODE,
+   PAGE_KERNEL_RWX, 0,

[Devel] [PATCH RHEL7 COMMIT] ve: revise permissions to allow mount smth

2015-09-08 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.8
-->
commit 68cf9d3cff9993ae2793c53661721b89d1b2895b
Author: Andrew Vagin 
Date:   Tue Sep 8 12:47:01 2015 +0400

ve: revise permissions to allow mount smth

reverts commit
d492bfa387237 ("ve/vfs: allow mount/umount, pivot_root with 
CAP_VE_SYS_ADMIN")

Return back to the behavior of the upstream kernel.
Currently we use mount namespaces and need nothing special here.

https://jira.sw.ru/browse/PSBM-39077

Signed-off-by: Andrew Vagin 
Reviewed-by: Vladimir Davydov 
---
 fs/namespace.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 593b262..77a1ede 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1306,9 +1306,7 @@ static int do_umount(struct mount *mnt, int flags)
  */
 static inline bool may_mount(void)
 {
-   return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN) ||
- nsown_capable(CAP_SYS_ADMIN) ||
- nsown_capable(CAP_VE_SYS_ADMIN);
+   return ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN);
 }
 
 /*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] cred: add ve_capable to check capabilities relative to the current VE (v2)

2015-09-08 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.8
-->
commit 9c0a32ed2a39800f298cf96308530c805a9188fd
Author: Andrew Vagin 
Date:   Tue Sep 8 12:45:07 2015 +0400

cred: add ve_capable to check capabilities relative to the current VE (v2)

We want to allow a few operations in VE. Currently we use nsown_capable,
but it's wrong, because in this case we allow these operations in any
user namespace.

v2: take ve0->cred if the currect ve isn't running

https://jira.sw.ru/browse/PSBM-39077

Signed-off-by: Andrew Vagin 
Reviewed-by: Vladimir Davydov 
---
 fs/autofs4/root.c  |  6 ++
 fs/ioprio.c|  2 +-
 fs/namei.c |  2 +-
 include/linux/capability.h |  1 +
 kernel/capability.c| 20 
 kernel/printk.c|  5 ++---
 net/ipv6/sit.c |  2 +-
 net/netfilter/nf_sockopt.c |  2 +-
 security/commoncap.c   |  4 ++--
 security/device_cgroup.c   |  4 ++--
 10 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index 68e3edb..1462d8b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -588,8 +588,7 @@ static int autofs4_dir_unlink(struct inode *dir, struct 
dentry *dentry)
struct autofs_info *p_ino;

/* This allows root to remove symlinks */
-   if (!autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) &&
-   !capable(CAP_VE_SYS_ADMIN))
+   if (!autofs4_oz_mode(sbi) && !ve_capable(CAP_SYS_ADMIN))
return -EPERM;
 
if (atomic_dec_and_test(>count)) {
@@ -837,8 +836,7 @@ static int autofs4_root_ioctl_unlocked(struct inode *inode, 
struct file *filp,
 _IOC_NR(cmd) - _IOC_NR(AUTOFS_IOC_FIRST) >= AUTOFS_IOC_COUNT)
return -ENOTTY;

-   if (!autofs4_oz_mode(sbi) && !capable(CAP_SYS_ADMIN) &&
-   !capable(CAP_VE_SYS_ADMIN))
+   if (!autofs4_oz_mode(sbi) && !ve_capable(CAP_SYS_ADMIN))
return -EPERM;

switch(cmd) {
diff --git a/fs/ioprio.c b/fs/ioprio.c
index c876fad..f9d9187 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -75,7 +75,7 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
 
switch (class) {
case IOPRIO_CLASS_RT:
-   if (!capable(CAP_VE_ADMIN))
+   if (!ve_capable(CAP_SYS_ADMIN))
return -EPERM;
class = IOPRIO_CLASS_BE;
data = 0;
diff --git a/fs/namei.c b/fs/namei.c
index 8e29a44..e7d9f54 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3397,7 +3397,7 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, 
umode_t mode, dev_t dev)
if (error)
return error;
 
-   if ((S_ISCHR(mode) || S_ISBLK(mode)) && !nsown_capable(CAP_MKNOD))
+   if ((S_ISCHR(mode) || S_ISBLK(mode)) && !ve_capable(CAP_MKNOD))
return -EPERM;
 
if (!dir->i_op->mknod)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 2b77384..b1131e3 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -217,6 +217,7 @@ extern bool has_ns_capability_noaudit(struct task_struct *t,
 extern bool capable(int cap);
 extern bool ns_capable(struct user_namespace *ns, int cap);
 extern bool nsown_capable(int cap);
+extern bool ve_capable(int cap);
 extern bool inode_capable(const struct inode *inode, int cap);
 extern bool file_ns_capable(const struct file *file, struct user_namespace 
*ns, int cap);
 
diff --git a/kernel/capability.c b/kernel/capability.c
index 0a843d5..4a73381 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Leveraged for setting/resetting capabilities
@@ -396,6 +397,25 @@ bool ns_capable(struct user_namespace *ns, int cap)
 }
 EXPORT_SYMBOL(ns_capable);
 
+#if CONFIG_VE
+bool ve_capable(int cap)
+{
+   struct cred *cred = get_exec_env()->init_cred;
+
+   if (cred == NULL) /* ve isn't running */
+   cred = ve0.init_cred;
+
+   return ns_capable(cred->user_ns, cap);
+}
+#else
+bool ve_capable(int cap)
+{
+   return capable(cap);
+}
+#endif
+
+EXPORT_SYMBOL_GPL(ve_capable);
+
 /**
  * file_ns_capable - Determine if the file's opener had a capability in effect
  * @file:  The file we want to check
diff --git a/kernel/printk.c b/kernel/printk.c
index 44b3783..91766fc 100644
--- a/kernel/printk.c
+++ b/kernel/printk.c
@@ -468,14 +468,13 @@ static int check_syslog_permissions(int type, bool 
from_file)
return 0;
 
if (syslog_action_restricted(type)) {
-   if (nsown_capable(CAP_SYSLOG))
+   if

[Devel] [PATCH RHEL7 COMMIT] Revert "ve/rtnl: allow move network devices into network namespace in CT"

2015-09-08 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.8
-->
commit 48d88de87d8ce2bd7e69ba82e5410dd2ec82f602
Author: Andrew Vagin 
Date:   Tue Sep 8 12:50:52 2015 +0400

Revert "ve/rtnl: allow move network devices into network namespace in CT"

This reverts commit b238eaaf8029c022899ee874132814bd1be5551f.

https://jira.sw.ru/browse/PSBM-39077

Signed-off-by: Andrew Vagin 
Reviewed-by: Vladimir Davydov 
---
 net/core/rtnetlink.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 2e8b10f..0d2df96 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1403,8 +1403,7 @@ static int do_setlink(const struct sk_buff *skb,
err = PTR_ERR(net);
goto errout;
}
-   if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN) &&
-   !netlink_ns_capable(skb, net->user_ns, CAP_VE_NET_ADMIN)) {
+   if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN)) {
err = -EPERM;
goto errout;
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert "ve/net/ioctl: allow change net-device name with CAP_VE_NET_ADMIN"

2015-09-08 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.8
-->
commit bf95a05ce5971fa899e169aa27e869f34ac91b72
Author: Andrew Vagin 
Date:   Tue Sep 8 12:50:37 2015 +0400

Revert "ve/net/ioctl: allow change net-device name with CAP_VE_NET_ADMIN"

This reverts commit 9118029490d75eee8ea1c8513412b55b94be92d9.

https://jira.sw.ru/browse/PSBM-39077

Signed-off-by: Andrew Vagin 
Reviewed-by: Vladimir Davydov 
---
 net/core/dev_ioctl.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index 77df687..d407219 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -476,11 +476,8 @@ int dev_ioctl(struct net *net, unsigned int cmd, void 
__user *arg)
 */
case SIOCGMIIPHY:
case SIOCGMIIREG:
-   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
-   return -EPERM;
case SIOCSIFNAME:
-   if (!ns_capable(net->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(net->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
dev_load(net, ifr.ifr_name);
rtnl_lock();
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert "ve/net: allow containers create bridges with CAP_VE_NET_ADMIN"

2015-09-08 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2.vz7.8.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.8
-->
commit ddcb719bd3e3ea79056bcc74db038c3c5d0e10a1
Author: Andrew Vagin 
Date:   Tue Sep 8 12:50:24 2015 +0400

Revert "ve/net: allow containers create bridges with CAP_VE_NET_ADMIN"

This reverts commit 52b6df12cf62fc92edadcec3860f6418d4d8333e.

https://jira.sw.ru/browse/PSBM-39077

Signed-off-by: Andrew Vagin 
Reviewed-by: Vladimir Davydov 
---
 net/bridge/br_ioctl.c | 33 +++--
 net/core/dev_ioctl.c  |  8 
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 45c4c22..98447b8 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -89,8 +89,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int 
isadd)
struct net_device *dev;
int ret;
 
-   if (!ns_capable(net->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(net->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
dev = __dev_get_by_index(net, ifindex);
@@ -180,29 +179,25 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
}
 
case BRCTL_SET_BRIDGE_FORWARD_DELAY:
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
return br_set_forward_delay(br, args[1]);
 
case BRCTL_SET_BRIDGE_HELLO_TIME:
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
return br_set_hello_time(br, args[1]);
 
case BRCTL_SET_BRIDGE_MAX_AGE:
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
return br_set_max_age(br, args[1]);
 
case BRCTL_SET_AGEING_TIME:
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
br->ageing_time = clock_t_to_jiffies(args[1]);
@@ -242,16 +237,14 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
}
 
case BRCTL_SET_BRIDGE_STP_STATE:
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
br_stp_set_enabled(br, args[1]);
return 0;
 
case BRCTL_SET_BRIDGE_PRIORITY:
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
spin_lock_bh(>lock);
@@ -264,8 +257,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
struct net_bridge_port *p;
int ret;
 
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
spin_lock_bh(>lock);
@@ -282,8 +274,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
struct net_bridge_port *p;
int ret;
 
-   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(dev_net(dev)->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
spin_lock_bh(>lock);
@@ -340,8 +331,7 @@ static int old_deviceless(struct net *net, void __user 
*uarg)
{
char buf[IFNAMSIZ];
 
-   if (!ns_capable(net->user_ns, CAP_NET_ADMIN) &&
-   !ns_capable(net->user_ns, CAP_VE_NET_ADMIN))
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
if (copy_from_user(buf, (void

[Devel] [PATCH RHEL7 COMMIT] net: udpv6: release memcg on destroy

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit b9643707fb3f7e18c9681e14c7184d0aa17110a9
Author: Vladimir Davydov 
Date:   Thu Sep 3 13:17:57 2015 +0400

net: udpv6: release memcg on destroy

In case of udpv6 we never release the memcg reference taken in
udpv6_prot->init. This leads to memcg leak. Fix it by calling
sock_release_memcg from udpv6_prot->destroy.

https://jira.sw.ru/browse/PSBM-39084

Fixes: ee3396bb65bf ("udp: Charge ingress buffers into cg memory")
Signed-off-by: Vladimir Davydov 
---
 net/ipv6/udp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 4d3754d..780e823 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1352,6 +1352,7 @@ void udpv6_destroy_sock(struct sock *sk)
}
 
inet6_destroy_sock(sk);
+   sock_release_memcg(sk);
 }
 
 /*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ve/writeback: revert ub dirty limit related stuff

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit fe9db6c5f1e3e58f0ad60caf19c3d9e1a1cca474
Author: Vladimir Davydov 
Date:   Thu Sep 3 14:10:34 2015 +0400

ve/writeback: revert ub dirty limit related stuff

This patch reverts ub dirty limit related hunks brought by the initial
commit 2a8b5de95918. None of them actually works, so this patch
introduces no functional changes. Dirty set control will be
reimplemented in the scope of

https://jira.sw.ru/browse/PSBM-33841

Signed-off-by: Vladimir Davydov 
---
 fs/fs-writeback.c | 39 +++
 include/linux/writeback.h |  4 
 mm/page-writeback.c   |  4 
 3 files changed, 7 insertions(+), 40 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 66586a4..ac8066b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -40,7 +40,6 @@
 struct wb_writeback_work {
long nr_pages;
struct super_block *sb;
-   struct user_beancounter *ub;
unsigned long *older_than_this;
enum writeback_sync_modes sync_mode;
unsigned int tagged_writepages:1;
@@ -130,8 +129,8 @@ out_unlock:
 }
 
 static void
-__bdi_start_writeback(struct backing_dev_info *bdi,
- long nr_pages, bool range_cyclic, enum wb_reason reason)
+__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
+ bool range_cyclic, enum wb_reason reason)
 {
struct wb_writeback_work *work;
 
@@ -150,7 +149,6 @@ __bdi_start_writeback(struct backing_dev_info *bdi,
work->nr_pages  = nr_pages;
work->range_cyclic = range_cyclic;
work->reason= reason;
-   work->ub= NULL;
 
bdi_queue_work(bdi, work);
 }
@@ -673,7 +671,6 @@ static long writeback_sb_inodes(struct super_block *sb,
.range_cyclic   = work->range_cyclic,
.range_start= 0,
.range_end  = LLONG_MAX,
-   .wb_ub  = work->ub,
};
unsigned long start_time = jiffies;
long write_chunk;
@@ -707,14 +704,6 @@ static long writeback_sb_inodes(struct super_block *sb,
 * kind writeout is handled by the freer.
 */
spin_lock(>i_lock);
-   if (wbc.wb_ub && !wb->bdi->dirty_exceeded &&
-   (inode->i_mapping->dirtied_ub != wbc.wb_ub) &&
-   (inode->i_state & I_DIRTY) == I_DIRTY_PAGES &&
-   ub_should_skip_writeback(wbc.wb_ub, inode)) {
-   requeue_io(inode, wb);
-   continue;
-   }
-
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
spin_unlock(>i_lock);
redirty_tail(inode, wb);
@@ -913,12 +902,9 @@ static long wb_writeback(struct bdi_writeback *wb,
 
/*
 * For background writeout, stop when we are below the
-* background dirty threshold. For filtered background
-* writeback we write all inodes dirtied before us,
-* because we cannot dereference this ub pointer.
+* background dirty threshold
 */
-   if (work->for_background && !work->ub &&
-   !over_bground_thresh(wb->bdi))
+   if (work->for_background && !over_bground_thresh(wb->bdi))
break;
 
/*
@@ -1371,7 +1357,7 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
-static void wait_sb_inodes(struct super_block *sb, struct user_beancounter *ub)
+static void wait_sb_inodes(struct super_block *sb)
 {
struct inode *inode, *old_inode = NULL;
 
@@ -1399,11 +1385,6 @@ static void wait_sb_inodes(struct super_block *sb, 
struct user_beancounter *ub)
spin_unlock(>i_lock);
continue;
}
-   if (ub && (mapping->dirtied_ub != ub) &&
-   (inode->i_state & I_DIRTY) == I_DIRTY_PAGES) {
-   spin_unlock(>i_lock);
-   continue;
-   }
__iget(inode);
spin_unlock(>i_lock);
spin_unlock(_sb_list_lock);
@@ -1522,12 +1503,11 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
  * This function writes and waits on any dirty inode belonging to this
  * super_block.
  */
-void sync_inodes_sb_ub(struct super_block *sb, struct user_beancounter *ub)
+void sync_inodes_sb(struct super_block *sb)
 {
DECLARE_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
.sb = sb,
-   .ub = ub,
.sync_mode  = WB_SYNC_ALL,

[Devel] [PATCH RHEL7 COMMIT] ub: zap unused socket accounting bits

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 86ade5f1aad07dbdee7324c78e506f0240cefa18
Author: Vladimir Davydov 
Date:   Thu Sep 3 14:34:11 2015 +0400

ub: zap unused socket accounting bits

It should have been done in the scope of c73bfca7594c ("bc: Rip old
network buffers and sockets accounting").

Signed-off-by: Vladimir Davydov 
---
 include/bc/beancounter.h | 24 
 kernel/bc/beancounter.c  |  8 
 2 files changed, 32 deletions(-)

diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
index 9180f2a..3c32ddf 100644
--- a/include/bc/beancounter.h
+++ b/include/bc/beancounter.h
@@ -49,26 +49,11 @@
  */
 
 struct task_beancounter;
-struct sock_beancounter;
 
 struct page_private {
unsigned long   ubp_tmpfs_respages;
 };
 
-struct sock_private {
-   unsigned long   ubp_rmem_thres;
-   unsigned long   ubp_wmem_pressure;
-   unsigned long   ubp_maxadvmss;
-   unsigned long   ubp_rmem_pressure;
-   int ubp_tw_count;
-#define UB_RMEM_EXPAND  0
-#define UB_RMEM_KEEP1
-#define UB_RMEM_SHRINK  2
-   struct list_headubp_other_socks;
-   struct list_headubp_tcp_socks;
-   struct percpu_counter   ubp_orphan_count;
-};
-
 struct ub_percpu_struct {
int dirty_pages;
int writeback_pages;
@@ -129,15 +114,6 @@ struct user_beancounter {
 
struct page_private ppriv;
 #define ub_tmpfs_respages  ppriv.ubp_tmpfs_respages
-   struct sock_private spriv;
-#define ub_rmem_thres  spriv.ubp_rmem_thres
-#define ub_maxadvmss   spriv.ubp_maxadvmss
-#define ub_rmem_pressure   spriv.ubp_rmem_pressure
-#define ub_wmem_pressure   spriv.ubp_wmem_pressure
-#define ub_tcp_sk_list spriv.ubp_tcp_socks
-#define ub_other_sk_list   spriv.ubp_other_socks
-#define ub_orphan_countspriv.ubp_orphan_count
-#define ub_tw_countspriv.ubp_tw_count
 
atomic_long_t   dirty_pages;
atomic_long_t   writeback_pages;
diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index 6b5ed78..8edef0d 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -347,9 +347,6 @@ static struct user_beancounter *alloc_ub(const char *name)
if (!new_ub->ub_name)
goto fail_name;
 
-   if (percpu_counter_init(_ub->ub_orphan_count, 0))
-   goto fail_pcpu;
-
new_ub->ub_percpu = alloc_percpu(struct ub_percpu_struct);
if (new_ub->ub_percpu == NULL)
goto fail_free;
@@ -357,8 +354,6 @@ static struct user_beancounter *alloc_ub(const char *name)
return new_ub;
 
 fail_free:
-   percpu_counter_destroy(_ub->ub_orphan_count);
-fail_pcpu:
kfree(new_ub->ub_name);
 fail_name:
kfree(new_ub);
@@ -367,7 +362,6 @@ fail_name:
 
 static inline void free_ub(struct user_beancounter *ub)
 {
-   percpu_counter_destroy(>ub_orphan_count);
free_percpu(ub->ub_percpu);
kfree(ub->ub_store);
kfree(ub->private_data2);
@@ -1068,8 +1062,6 @@ static void init_beancounter_struct(struct 
user_beancounter *ub)
 {
ub->ub_magic = UB_MAGIC;
spin_lock_init(>ub_lock);
-   INIT_LIST_HEAD(>ub_tcp_sk_list);
-   INIT_LIST_HEAD(>ub_other_sk_list);
 }
 
 static void init_beancounter_nolimits(struct user_beancounter *ub)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ub: zap ub_tmpfs_respages

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit fb50f65f2c49b6a73e369153a660ff17227223b0
Author: Vladimir Davydov 
Date:   Thu Sep 3 14:34:34 2015 +0400

ub: zap ub_tmpfs_respages

It is always 0 in both Vz7 and PCS6, so drop it.

Signed-off-by: Vladimir Davydov 
---
 include/bc/beancounter.h |  7 ---
 kernel/bc/beancounter.c  |  1 -
 kernel/bc/vm_pages.c | 19 ---
 3 files changed, 27 deletions(-)

diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
index ec2ba18..a6241f6 100644
--- a/include/bc/beancounter.h
+++ b/include/bc/beancounter.h
@@ -50,10 +50,6 @@
 
 struct task_beancounter;
 
-struct page_private {
-   unsigned long   ubp_tmpfs_respages;
-};
-
 struct ub_percpu_struct {
int dirty_pages;
int writeback_pages;
@@ -106,9 +102,6 @@ struct user_beancounter {
 
struct ratelimit_state  ub_ratelimit;
 
-   struct page_private ppriv;
-#define ub_tmpfs_respages  ppriv.ubp_tmpfs_respages
-
atomic_long_t   dirty_pages;
atomic_long_t   writeback_pages;
atomic_long_t   wb_requests;
diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index 8edef0d..d0ab65a 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -492,7 +492,6 @@ static inline int bc_verify_held(struct user_beancounter 
*ub)
__ub_stat_get(ub, dirty_pages));
clean &= verify_res(ub, "writeback_pages",
__ub_stat_get(ub, writeback_pages));
-   clean &= verify_res(ub, "tmpfs_respages", ub->ub_tmpfs_respages);
 
return clean;
 }
diff --git a/kernel/bc/vm_pages.c b/kernel/bc/vm_pages.c
index 23e8742..7529899 100644
--- a/kernel/bc/vm_pages.c
+++ b/kernel/bc/vm_pages.c
@@ -119,22 +119,6 @@ void ub_lockedshm_uncharge(struct shmem_inode_info *shi, 
unsigned long size)
uncharge_beancounter(ub, UB_LOCKEDPAGES, size >> PAGE_SHIFT);
 }
 
-static inline void do_ub_tmpfs_respages_sub(struct user_beancounter *ub,
-   unsigned long size)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(>ub_lock, flags);
-   /* catch possible overflow */
-   if (ub->ub_tmpfs_respages < size) {
-   uncharge_warn(ub, "tmpfs_respages",
-   size, ub->ub_tmpfs_respages);
-   size = ub->ub_tmpfs_respages;
-   }
-   ub->ub_tmpfs_respages -= size;
-   spin_unlock_irqrestore(>ub_lock, flags);
-}
-
 static int bc_fill_sysinfo(struct user_beancounter *ub,
unsigned long meminfo_val, struct sysinfo *si)
 {
@@ -269,9 +253,6 @@ static int bc_vmaux_show(struct seq_file *f, void *v)
 
ub_sync_memcg(ub);
 
-   seq_printf(f, bc_proc_lu_fmt, "tmpfs_respages",
-   ub->ub_tmpfs_respages);
-
seq_printf(f, bc_proc_lu_fmt, "ram", ub->ub_parms[UB_PHYSPAGES].held);
 
return 0;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ms/crypto/ghash-intel: specify context size for ghash async algorithm

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 94edfce3b5486a560f9a469659038ffa6310a621
Author: Andrey Ryabinin 
Date:   Thu Sep 3 13:25:49 2015 +0400

ms/crypto/ghash-intel: specify context size for ghash async algorithm

Currently context size (cra_ctxsize) doesn't specified for
ghash_async_alg. Which means it's zero. Thus crypto_create_tfm()
doesn't allocate needed space for ghash_async_ctx, so any
read/write to ctx becomes invalid.

https://jira.sw.ru/browse/PSBM-38669

Signed-off-by: Andrey Ryabinin 

khorenko@: the patch to be sent to mainstream as well.
---
 arch/x86/crypto/ghash-clmulni-intel_glue.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/crypto/ghash-clmulni-intel_glue.c 
b/arch/x86/crypto/ghash-clmulni-intel_glue.c
index 6759dd1..11e213e 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_glue.c
+++ b/arch/x86/crypto/ghash-clmulni-intel_glue.c
@@ -283,6 +283,7 @@ static struct ahash_alg ghash_async_alg = {
.cra_name   = "ghash",
.cra_driver_name= "ghash-clmulni",
.cra_priority   = 400,
+   .cra_ctxsize= sizeof(struct 
ghash_async_ctx),
.cra_flags  = CRYPTO_ALG_TYPE_AHASH | 
CRYPTO_ALG_ASYNC,
.cra_blocksize  = GHASH_BLOCK_SIZE,
.cra_type   = _ahash_type,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] Revert diff-writeback-throttle-writer-when-local-BDI-threshold-is-hit bits

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 56782e5a6d798078f3523e97f6f2eb8277028b91
Author: Vladimir Davydov 
Date:   Thu Sep 3 13:53:01 2015 +0400

Revert diff-writeback-throttle-writer-when-local-BDI-threshold-is-hit bits

This was brought by the initial commit 2a8b5de95918, but it is
incomplete - the following hunk patching balance_dirty_pages was lost:

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 003b68e..a58795c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -546,7 +546,8 @@ static void balance_dirty_pages(struct address_space 
*mapping,
>* catch-up. This avoids (excessively) small writeouts
>* when the bdi limits are ramping up.
>*/
> - if (nr_reclaimable + nr_writeback <
> + if (bdi_cap_account_writeback(bdi) &&
> + nr_reclaimable + nr_writeback <
>   (background_thresh + dirty_thresh) / 2 &&
>   ub_dirty + ub_writeback <
>   (ub_background_thresh + ub_thresh) / 2)

I've filed a separate issue for porting it:

https://jira.sw.ru/browse/PSBM-39167

Signed-off-by: Vladimir Davydov 
---
 fs/fs-writeback.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9cdcc28..66586a4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -843,9 +843,6 @@ static bool over_bground_thresh(struct backing_dev_info 
*bdi)
 {
unsigned long background_thresh, dirty_thresh;
 
-   if (!bdi_cap_account_writeback(bdi) && bdi->dirty_exceeded)
-   return true;
-
global_dirty_limits(_thresh, _thresh);
 
if (global_page_state(NR_FILE_DIRTY) +
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ub: drop swapin/swapout stats

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 681603761a5a2c380c8c3c2f2d3c878aa5a267a0
Author: Vladimir Davydov 
Date:   Thu Sep 3 14:34:22 2015 +0400

ub: drop swapin/swapout stats

Swapin/swapout cannot be accounted by beancounters anymore, because
memory management moved to memcg. Right now, these stats are not
provided by memcg, so this patch simply drops them from /proc/vmstat
inside container and from /proc/bc/CTID/vmaux on the host. If anybody
requests these counters, they should be reimplemented in the scope of
memcg and returned back.

https://jira.sw.ru/browse/PSBM-39327

Signed-off-by: Vladimir Davydov 
---
 include/bc/beancounter.h |  6 --
 kernel/bc/vm_pages.c | 31 +--
 2 files changed, 1 insertion(+), 36 deletions(-)

diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
index 3c32ddf..ec2ba18 100644
--- a/include/bc/beancounter.h
+++ b/include/bc/beancounter.h
@@ -63,12 +63,6 @@ struct ub_percpu_struct {
unsigned long fuse_requests;
unsigned long fuse_bytes;
 
-   unsigned long swapin;
-   unsigned long swapout;
-
-   unsigned long vswapin;
-   unsigned long vswapout;
-
 #ifdef CONFIG_BC_IO_ACCOUNTING
unsigned long async_write_complete;
unsigned long async_write_canceled;
diff --git a/kernel/bc/vm_pages.c b/kernel/bc/vm_pages.c
index c52d34f..23e8742 100644
--- a/kernel/bc/vm_pages.c
+++ b/kernel/bc/vm_pages.c
@@ -220,18 +220,7 @@ out:
 
 static int bc_fill_vmstat(struct user_beancounter *ub, unsigned long *stat)
 {
-   int cpu;
-
-   for_each_possible_cpu(cpu) {
-   struct ub_percpu_struct *pcpu = ub_percpu(ub, cpu);
-
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->swapin;
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT]   += pcpu->swapout;
-
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->vswapin;
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT]   += pcpu->vswapout;
-   }
-
+   /* FIXME: show swapin/swapout? */
return NOTIFY_OK;
 }
 
@@ -275,32 +264,14 @@ module_exit(fini_vmguar_notifier);
 static int bc_vmaux_show(struct seq_file *f, void *v)
 {
struct user_beancounter *ub;
-   struct ub_percpu_struct *ub_pcpu;
-   unsigned long swapin, swapout, vswapin, vswapout;
-   int i;
 
ub = seq_beancounter(f);
 
ub_sync_memcg(ub);
 
-   swapin = swapout = vswapin = vswapout = 0;
-   for_each_possible_cpu(i) {
-   ub_pcpu = ub_percpu(ub, i);
-   swapin += ub_pcpu->swapin;
-   swapout += ub_pcpu->swapout;
-   vswapin += ub_pcpu->vswapin;
-   vswapout += ub_pcpu->vswapout;
-   }
-
seq_printf(f, bc_proc_lu_fmt, "tmpfs_respages",
ub->ub_tmpfs_respages);
 
-   seq_printf(f, bc_proc_lu_fmt, "swapin", swapin);
-   seq_printf(f, bc_proc_lu_fmt, "swapout", swapout);
-
-   seq_printf(f, bc_proc_lu_fmt, "vswapin", vswapin);
-   seq_printf(f, bc_proc_lu_fmt, "vswapout", vswapout);
-
seq_printf(f, bc_proc_lu_fmt, "ram", ub->ub_parms[UB_PHYSPAGES].held);
 
return 0;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH rh7 0/4] memcg/kmem: account some non-slab objects

2015-09-03 Thread Konstantin Khorenko


And another patchset for your attention.

--
Best regards,

Konstantin Khorenko,
Virtuozzo Linux Kernel Team

On 08/26/2015 07:28 PM, Vladimir Davydov wrote:

This patch set implements memcg/kmem accounting for vmalloc, pipe
buffers, and page tables. I'll probably try to submit these patches
(slightly modified) upstream after v4.2 has been released.

Vladimir Davydov (4):
   vmalloc: account to memcg/kmem
   fs: account anon pipe buffers to memcg/kmem
   gfp: add __get_free_kmem_pages helper
   arch: x86: charge page tables to memcg/kmem

  arch/x86/include/asm/pgalloc.h | 13 +++--
  arch/x86/mm/pgtable.c  | 24 +++-
  fs/pipe.c  | 13 -
  include/linux/gfp.h|  1 +
  mm/page_alloc.c| 12 
  mm/vmalloc.c   |  6 +++---
  6 files changed, 46 insertions(+), 23 deletions(-)


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ub: zap ub_dirty_pages

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit d6328c465a4b45ab97e55f17b4896da32986c1bd
Author: Vladimir Davydov 
Date:   Thu Sep 3 15:26:24 2015 +0400

ub: zap ub_dirty_pages

It is not used anywhere.

Signed-off-by: Vladimir Davydov 
---
 include/bc/io_acct.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/include/bc/io_acct.h b/include/bc/io_acct.h
index 5b51853..fa7afb1 100644
--- a/include/bc/io_acct.h
+++ b/include/bc/io_acct.h
@@ -56,8 +56,6 @@ extern void ub_io_account_cancel(struct address_space 
*mapping);
 extern void ub_io_writeback_inc(struct address_space *mapping);
 extern void ub_io_writeback_dec(struct address_space *mapping);
 
-#define ub_dirty_pages(ub) ub_stat_get(ub, dirty_pages)
-
 extern int ub_dirty_limits(unsigned long *pbackground,
   long *pdirty, struct user_beancounter *ub);
 
@@ -101,11 +99,6 @@ static inline void ub_io_writeback_dec(struct address_space 
*mapping)
 {
 }
 
-static inline unsigned long ub_dirty_pages(struct user_beancounter *ub)
-{
-   return 0;
-}
-
 static inline int ub_dirty_limits(unsigned long *pbackground,
  long *pdirty, struct user_beancounter *ub)
 {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ub: rename private_data2 to iolimit

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 2b26765194fd30316e5b34aa72a2c5043bd11e8d
Author: Vladimir Davydov 
Date:   Thu Sep 3 15:29:20 2015 +0400

ub: rename private_data2 to iolimit

ub->private_data2 is only used for storing iolimit housekeeping struct,
so call it appropriately.

Signed-off-by: Vladimir Davydov 
---
 include/bc/beancounter.h |  2 +-
 kernel/bc/beancounter.c  |  2 +-
 kernel/ve/vziolimit.c| 14 +++---
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
index a6241f6..5ba999e 100644
--- a/include/bc/beancounter.h
+++ b/include/bc/beancounter.h
@@ -107,7 +107,7 @@ struct user_beancounter {
atomic_long_t   wb_requests;
atomic_long_t   wb_sectors;
 
-   void*private_data2;
+   void*iolimit;
 
/* resources statistic and settings */
struct ubparm   ub_parms[UB_RESOURCES];
diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index f9e7fea..90fc1dd 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -364,8 +364,8 @@ static inline void free_ub(struct user_beancounter *ub)
 {
free_percpu(ub->ub_percpu);
kfree(ub->ub_store);
-   kfree(ub->private_data2);
kfree(ub->ub_name);
+   kfree(ub->iolimit);
kfree(ub);
 }
 
diff --git a/kernel/ve/vziolimit.c b/kernel/ve/vziolimit.c
index 1da233d..628ec80 100644
--- a/kernel/ve/vziolimit.c
+++ b/kernel/ve/vziolimit.c
@@ -163,7 +163,7 @@ static int iolimit_virtinfo(struct vnotifier_block *nb,
unsigned long cmd, void *arg, int old_ret)
 {
struct user_beancounter *ub = get_exec_ub();
-   struct iolimit *iolimit = ub->private_data2;
+   struct iolimit *iolimit = ub->iolimit;
unsigned long flags, timeout;
struct request_queue *q;
 
@@ -257,7 +257,7 @@ static void throttle_state(struct user_beancounter *ub,
 
 static struct iolimit *iolimit_get(struct user_beancounter *ub)
 {
-   struct iolimit *iolimit = ub->private_data2;
+   struct iolimit *iolimit = ub->iolimit;
 
if (iolimit)
return iolimit;
@@ -268,11 +268,11 @@ static struct iolimit *iolimit_get(struct 
user_beancounter *ub)
init_waitqueue_head(>wq);
 
spin_lock_irq(>ub_lock);
-   if (ub->private_data2) {
+   if (ub->iolimit) {
kfree(iolimit);
-   iolimit = ub->private_data2;
+   iolimit = ub->iolimit;
} else
-   ub->private_data2 = iolimit;
+   ub->iolimit = iolimit;
spin_unlock_irq(>ub_lock);
 
return iolimit;
@@ -296,7 +296,7 @@ static int iolimit_ioctl(struct file *file, unsigned int 
cmd, unsigned long arg)
if (!ub)
return -ENOENT;
 
-   iolimit = ub->private_data2;
+   iolimit = ub->iolimit;
 
switch (cmd) {
case VZCTL_SET_IOLIMIT:
@@ -365,7 +365,7 @@ static ssize_t iolimit_cgroup_read(struct cgroup *cg, 
struct cftype *cft,
  size_t nbytes, loff_t *ppos)
 {
struct user_beancounter *ub = cgroup_ub(cg);
-   struct iolimit *iolimit = ub->private_data2;
+   struct iolimit *iolimit = ub->iolimit;
unsigned long val = 0;
int len;
char str[32];
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH RHEL7 COMMIT] ub: drop swapin/swapout stats

2015-09-03 Thread Konstantin Khorenko


If anybody really check swapin/swapout inside Containers,
please let us know the usecase - how do you use these stats.

Thank you.

--
Konstantin

On 09/03/2015 01:34 PM, Konstantin Khorenko wrote:

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit 681603761a5a2c380c8c3c2f2d3c878aa5a267a0
Author: Vladimir Davydov <vdavy...@parallels.com>
Date:   Thu Sep 3 14:34:22 2015 +0400

 ub: drop swapin/swapout stats

 Swapin/swapout cannot be accounted by beancounters anymore, because
 memory management moved to memcg. Right now, these stats are not
 provided by memcg, so this patch simply drops them from /proc/vmstat
 inside container and from /proc/bc/CTID/vmaux on the host. If anybody
 requests these counters, they should be reimplemented in the scope of
 memcg and returned back.

 https://jira.sw.ru/browse/PSBM-39327

 Signed-off-by: Vladimir Davydov <vdavy...@parallels.com>
---
  include/bc/beancounter.h |  6 --
  kernel/bc/vm_pages.c | 31 +--
  2 files changed, 1 insertion(+), 36 deletions(-)

diff --git a/include/bc/beancounter.h b/include/bc/beancounter.h
index 3c32ddf..ec2ba18 100644
--- a/include/bc/beancounter.h
+++ b/include/bc/beancounter.h
@@ -63,12 +63,6 @@ struct ub_percpu_struct {
unsigned long fuse_requests;
unsigned long fuse_bytes;

-   unsigned long swapin;
-   unsigned long swapout;
-
-   unsigned long vswapin;
-   unsigned long vswapout;
-
  #ifdef CONFIG_BC_IO_ACCOUNTING
unsigned long async_write_complete;
unsigned long async_write_canceled;
diff --git a/kernel/bc/vm_pages.c b/kernel/bc/vm_pages.c
index c52d34f..23e8742 100644
--- a/kernel/bc/vm_pages.c
+++ b/kernel/bc/vm_pages.c
@@ -220,18 +220,7 @@ out:

  static int bc_fill_vmstat(struct user_beancounter *ub, unsigned long *stat)
  {
-   int cpu;
-
-   for_each_possible_cpu(cpu) {
-   struct ub_percpu_struct *pcpu = ub_percpu(ub, cpu);
-
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->swapin;
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT]   += pcpu->swapout;
-
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPIN]+= pcpu->vswapin;
-   stat[NR_VM_ZONE_STAT_ITEMS + PSWPOUT]   += pcpu->vswapout;
-   }
-
+   /* FIXME: show swapin/swapout? */
return NOTIFY_OK;
  }

@@ -275,32 +264,14 @@ module_exit(fini_vmguar_notifier);
  static int bc_vmaux_show(struct seq_file *f, void *v)
  {
struct user_beancounter *ub;
-   struct ub_percpu_struct *ub_pcpu;
-   unsigned long swapin, swapout, vswapin, vswapout;
-   int i;

ub = seq_beancounter(f);

ub_sync_memcg(ub);

-   swapin = swapout = vswapin = vswapout = 0;
-   for_each_possible_cpu(i) {
-   ub_pcpu = ub_percpu(ub, i);
-   swapin += ub_pcpu->swapin;
-   swapout += ub_pcpu->swapout;
-   vswapin += ub_pcpu->vswapin;
-   vswapout += ub_pcpu->vswapout;
-   }
-
seq_printf(f, bc_proc_lu_fmt, "tmpfs_respages",
ub->ub_tmpfs_respages);

-   seq_printf(f, bc_proc_lu_fmt, "swapin", swapin);
-   seq_printf(f, bc_proc_lu_fmt, "swapout", swapout);
-
-   seq_printf(f, bc_proc_lu_fmt, "vswapin", vswapin);
-   seq_printf(f, bc_proc_lu_fmt, "vswapout", vswapout);
-
seq_printf(f, bc_proc_lu_fmt, "ram", ub->ub_parms[UB_PHYSPAGES].held);

return 0;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RHEL7 COMMIT] ploop: use GFP_NOIO in ploop_make_request

2015-09-03 Thread Konstantin Khorenko

The commit is pushed to "branch-rh7-3.10.0-229.7.2-ovz" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-229.7.2.vz7.6.6
-->
commit af3295b412e5b1fefb22857d26e1332cafc186d5
Author: Vladimir Davydov 
Date:   Thu Sep 3 15:37:24 2015 +0400

ploop: use GFP_NOIO in ploop_make_request

Currently, we use GFP_NOFS, which may result in a dead lock as follows:

filemap_fault
 do_mpage_readpage
  submit_bio
   generic_make_request initializes current->bio_list
calls make_request_fn
ploop_make_request
 bio_alloc(GFP_NOFS)
  kmem_cache_alloc
   memcg_charge_kmem
try_to_free_mem_cgroup_pages
 swap_writepage
  generic_make_request  puts bio on current->bio_list
try_to-free_mem_cgroup_pages
 wait_on_page_writeback

The wait_on_page_writeback will never complete then, because the
corresponding bio is on current->bio_list and for it to get to the queue
we must return from ploop_make_request first.

The stack trace of a hung task:

[] sleep_on_page+0xe/0x20
[] wait_on_page_bit+0x86/0xb0
[] shrink_page_list+0x6e2/0xaf0
[] shrink_inactive_list+0x1cb/0x610
[] shrink_lruvec+0x395/0x790
[] shrink_zone+0x181/0x350
[] do_try_to_free_pages+0x170/0x530
[] try_to_free_mem_cgroup_pages+0xb6/0x140
[] __mem_cgroup_try_charge+0x1de/0xd70
[] memcg_charge_kmem+0x9b/0x100
[] __memcg_charge_slab+0x3b/0x90
[] new_slab+0x264/0x3f0
[] __slab_alloc+0x315/0x48f
[] kmem_cache_alloc+0x1cc/0x210
[] mempool_alloc_slab+0x15/0x20
[] mempool_alloc+0x69/0x170
[] bvec_alloc+0x92/0x120
[] bio_alloc_bioset+0x1e8/0x2e0
[] ploop_make_request+0x2a6/0xac0 [ploop]
[] generic_make_request+0xe2/0x130
[] submit_bio+0x77/0x1c0
[] do_mpage_readpage+0x37f/0x6e0
[] mpage_readpages+0xeb/0x160
[] ext4_readpages+0x3c/0x40 [ext4]
[] __do_page_cache_readahead+0x1e0/0x260
[] ra_submit+0x21/0x30
[] filemap_fault+0x321/0x4b0
[] __do_fault+0x8a/0x560
[] handle_mm_fault+0x3d0/0xd80
[] __do_page_fault+0x15e/0x530
[] do_page_fault+0x1a/0x70
[] page_fault+0x28/0x30

https://jira.sw.ru/browse/PSBM-38842

Signed-off-by: Vladimir Davydov 
---
 drivers/block/ploop/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 97e75a7..7eb9865 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -717,7 +717,7 @@ preallocate_bio(struct bio * orig_bio, struct ploop_device 
* plo)
}
 
if (nbio == NULL)
-   nbio = bio_alloc(GFP_NOFS, max(orig_bio->bi_max_vecs, 
block_vecs(plo)));
+   nbio = bio_alloc(GFP_NOIO, max(orig_bio->bi_max_vecs, 
block_vecs(plo)));
return nbio;
 }
 
@@ -852,7 +852,7 @@ static void ploop_make_request(struct request_queue *q, 
struct bio *bio)
 
if (!current->io_context) {
struct io_context *ioc;
-   ioc = get_task_io_context(current, GFP_NOFS, NUMA_NO_NODE);
+   ioc = get_task_io_context(current, GFP_NOIO, NUMA_NO_NODE);
if (ioc)
put_io_context(ioc);
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 5954 matches

Mail list logo