[Devel] [PATCH RHEL7 COMMIT] docker: Revert "vfs: take stat's dev from mnt->sb"

2016-08-24 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.28.2.vz7.17.3
-->
commit 11a02aff96d8a6140b5ca71eacaa6e9b995db7d5
Author: Pavel Tikhomirov 
Date:   Wed Aug 24 20:19:01 2016 +0400

docker: Revert "vfs: take stat's dev from mnt->sb"

This reverts commit ecfa1b0f7ba985e200e807941b5838943b266cb3.

All non-directory objects on overlayfs should report an st_dev from
lower or upper filesystem that is providing the object then call
stat(2) on it(Documentation/filesystems/overlayfs.txt). But in our
case file on overlayfs reports st_dev from overlayfs itself:

e.g.: on VZ7 host device is 57 but should be 64768:
mkdir /lower /upper /merged /work
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\
workdir=/work /merged
touch /merged/file
stat /merged/file | grep Device | awk '{print $1$2}'
  Device:39h/57d
cat /proc/self/mountinfo
  63 1 253:0 / / rw,relatime shared:1 - ext4
/dev/mapper/virtuozzo_pcs7-root rw,data=ordered
  149 63 0:57 / /merged rw,relatime shared:81 - overlay overlay
rw,lowerdir=/lower,upperdir=/upper,workdir=/work

Reason is in sys_stat->vfs_stat->vfs_fstatat->vfs_getattr:
1) Call ovl_getattr()->ovl_path_real() - find real object's path
2) Call ovl_getattr()->vfs_getattr() - find real object's s_dev
3) Replace it with overlay's s_dev, which is wrong.

We do not have simfs - remove step (3) and stat will be fine again.

*note: we need it for docker - when stat and fstat give different
s_dev ldconfig in glibc breaks, and thus docker-ui tests break.

https://jira.sw.ru/browse/PSBM-51255

Signed-off-by: Pavel Tikhomirov 
Reviewed-by: Kirill Tkhai 
---
 fs/stat.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index a423f6351e27..d0ea7ef75e26 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -14,7 +14,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -48,14 +47,10 @@ int vfs_getattr(struct path *path, struct kstat *stat)
return retval;
 
if (inode->i_op->getattr)
-   retval = inode->i_op->getattr(path->mnt, path->dentry, stat);
-   else
-   generic_fillattr(inode, stat);
+   return inode->i_op->getattr(path->mnt, path->dentry, stat);
 
-   if (!retval)
-   stat->dev = path->mnt->mnt_sb->s_dev;
-
-   return retval;
+   generic_fillattr(inode, stat);
+   return 0;
 }
 
 EXPORT_SYMBOL(vfs_getattr);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] tty: Fix task hang if one of peers is sitting in read

2016-08-24 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.28.2.vz7.17.3
-->
commit a7fb20c4cd83a1add67efe3edaa8500cc6edc6d1
Author: Cyrill Gorcunov 
Date:   Wed Aug 24 16:01:42 2016 +0400

tty: Fix task hang if one of peers is sitting in read

We reverted the former fix (ae93b8e96941c9ad) in commit 9539e4b2c5eee61f
but the changes ported by rh team eventually are still not enough.
So bring ae93b8e96941c9ad back.

https://jira.sw.ru/browse/PSBM-51273

Signed-off-by: Cyrill Gorcunov 
CC: Igor Sukhih 
    CC: Vladimir Davydov 
CC: Konstantin Khorenko 
---
 drivers/tty/tty_ldisc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index fd2b20d6af80..4c82aaad8566 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -685,7 +685,7 @@ void tty_ldisc_hangup(struct tty_struct *tty)
 *
 * Avoid racing set_ldisc or tty_ldisc_release
 */
-   tty_ldisc_lock_pair(tty, tty->link);
+   tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT);
 
if (tty->ldisc) {
 
@@ -707,7 +707,7 @@ void tty_ldisc_hangup(struct tty_struct *tty)
WARN_ON(tty_ldisc_open(tty, tty->ldisc));
}
}
-   tty_ldisc_enable_pair(tty, tty->link);
+   tty_ldisc_unlock(tty);
if (reset)
tty_reset_termios(tty);
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: add memory.numa_migrate file

2016-08-23 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.28.2.vz7.17.2
-->
commit 9c08b7e913cf502ea61ca00814247488bdb1f65f
Author: Vladimir Davydov 
Date:   Tue Aug 23 17:08:57 2016 +0400

mm: memcontrol: add memory.numa_migrate file

The new file is supposed to be used for migrating pages accounted to a
memory cgroup to a particular set of numa nodes. The reason to add it is
that currently there's no API for migrating unmapped file pages used for
storing page cache (neither migrate_pages syscall nor cpuset subsys
doesn't provide this functionality).

The file is added to the memory cgroup and has the following format:

  NODELIST[ MAX_SCAN]

where NODELIST is a comma-separated list of ranges N1-N2 specifying the set
of nodes to migrate pages of this cgroup to, and the optional MAX_SCAN
imposes a limit on the number of pages that can be migrated in one go.

The call may be interrupted by a signal, in which case -EINTR is returned.

https://jira.sw.ru/browse/PSBM-50875
    
Signed-off-by: Vladimir Davydov 
Reviewed-by: Andrey Ryabinin 
Cc: Igor Redko 
Cc: Konstantin Neumoin 
---
 mm/memcontrol.c | 226 
 1 file changed, 226 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0d0e31e0917e..69189490ed68 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include 
 #include 
@@ -5697,6 +5698,226 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
seq_putc(m, '\n');
return 0;
 }
+
+/*
+ * memcg_numa_migrate_new_page() private argument. @target_nodes specifies the
+ * set of nodes to allocate pages from. @current_node is the current preferable
+ * node, it gets rotated after each allocation.
+ */
+struct memcg_numa_migrate_struct {
+   nodemask_t *target_nodes;
+   int current_node;
+};
+
+/*
+ * Used as an argument for migrate_pages(). Allocated pages are spread evenly
+ * among destination nodes.
+ */
+static struct page *memcg_numa_migrate_new_page(struct page *page,
+   unsigned long private, int **result)
+{
+   struct memcg_numa_migrate_struct *ms = (void *)private;
+   gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN;
+
+   ms->current_node = next_node(ms->current_node, *ms->target_nodes);
+   if (ms->current_node >= MAX_NUMNODES) {
+   ms->current_node = first_node(*ms->target_nodes);
+   VM_BUG_ON(ms->current_node >= MAX_NUMNODES);
+   }
+
+   return __alloc_pages_nodemask(gfp_mask, 0,
+   node_zonelist(ms->current_node, gfp_mask),
+   ms->target_nodes);
+}
+
+/*
+ * Isolate at most @nr_to_scan pages from @lruvec for further migration and
+ * store them in @dst. Returns the number of pages scanned. Return value of 0
+ * means that @lruved is empty.
+ */
+static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list lru,
+long nr_to_scan, struct list_head *dst)
+{
+   struct list_head *src = &lruvec->lists[lru];
+   struct zone *zone = lruvec_zone(lruvec);
+   long scanned = 0, taken = 0;
+
+   spin_lock_irq(&zone->lru_lock);
+   while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) {
+   struct page *page = list_last_entry(src, struct page, lru);
+   int nr_pages;
+
+   scanned++;
+
+   switch (__isolate_lru_page(page, ISOLATE_ASYNC_MIGRATE)) {
+   case 0:
+   nr_pages = hpage_nr_pages(page);
+   mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
+   list_move(&page->lru, dst);
+   taken += nr_pages;
+   break;
+
+   case -EBUSY:
+   list_move(&page->lru, src);
+   continue;
+
+   default:
+   BUG();
+   }
+   }
+   __mod_zone_page_state(zone, NR_LRU_BASE + lru, -taken);
+   __mod_zone_page_state(zone, NR_ISOLATED_ANON + is_file_lru(lru), taken);
+   spin_unlock_irq(&zone->lru_lock);
+
+   return scanned;
+}
+
+static long __memcg_numa_migrate_pages(struct lruvec *lruvec, enum lru_list 
lru,
+  nodemask_t *target_nodes, long 
nr_to_scan)
+{
+   struct memcg_numa_migrate_struct ms = {
+   .target_nodes = target_nodes,
+   .current_node = -1,
+   };
+   LIST_HEAD(pages);
+   long total_scanned = 0;
+
+   /

[Devel] [PATCH rh7 v2] mm: memcontrol: add memory.numa_migrate file

2016-08-23 Thread Vladimir Davydov
The new file is supposed to be used for migrating pages accounted to a
memory cgroup to a particular set of numa nodes. The reason to add it is
that currently there's no API for migrating unmapped file pages used for
storing page cache (neither migrate_pages syscall nor cpuset subsys
doesn't provide this functionality).

The file is added to the memory cgroup and has the following format:

  NODELIST[ MAX_SCAN]

where NODELIST is a comma-separated list of ranges N1-N2 specifying the set
of nodes to migrate pages of this cgroup to, and the optional MAX_SCAN
imposes a limit on the number of pages that can be migrated in one go.

The call may be interrupted by a signal, in which case -EINTR is returned.

https://jira.sw.ru/browse/PSBM-50875

Signed-off-by: Vladimir Davydov 
Cc: Andrey Ryabinin 
Cc: Igor Redko 
Cc: Konstantin Neumoin 
---
Changes in v2:
 - break loop if not making any progress (fixes softlockup)
 - drop useless VM_BUG_ON_PAGE in memcg_numa_isolate_pages and replace BUG_ON
   with VM_BUG_ON in memcg_numa_migrate_new_page

 mm/memcontrol.c | 226 
 1 file changed, 226 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3a16b99ccc6..bfb56a649225 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include 
 #include 
@@ -5697,6 +5698,226 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
seq_putc(m, '\n');
return 0;
 }
+
+/*
+ * memcg_numa_migrate_new_page() private argument. @target_nodes specifies the
+ * set of nodes to allocate pages from. @current_node is the current preferable
+ * node, it gets rotated after each allocation.
+ */
+struct memcg_numa_migrate_struct {
+   nodemask_t *target_nodes;
+   int current_node;
+};
+
+/*
+ * Used as an argument for migrate_pages(). Allocated pages are spread evenly
+ * among destination nodes.
+ */
+static struct page *memcg_numa_migrate_new_page(struct page *page,
+   unsigned long private, int **result)
+{
+   struct memcg_numa_migrate_struct *ms = (void *)private;
+   gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN;
+
+   ms->current_node = next_node(ms->current_node, *ms->target_nodes);
+   if (ms->current_node >= MAX_NUMNODES) {
+   ms->current_node = first_node(*ms->target_nodes);
+   VM_BUG_ON(ms->current_node >= MAX_NUMNODES);
+   }
+
+   return __alloc_pages_nodemask(gfp_mask, 0,
+   node_zonelist(ms->current_node, gfp_mask),
+   ms->target_nodes);
+}
+
+/*
+ * Isolate at most @nr_to_scan pages from @lruvec for further migration and
+ * store them in @dst. Returns the number of pages scanned. Return value of 0
+ * means that @lruved is empty.
+ */
+static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list lru,
+long nr_to_scan, struct list_head *dst)
+{
+   struct list_head *src = &lruvec->lists[lru];
+   struct zone *zone = lruvec_zone(lruvec);
+   long scanned = 0, taken = 0;
+
+   spin_lock_irq(&zone->lru_lock);
+   while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) {
+   struct page *page = list_last_entry(src, struct page, lru);
+   int nr_pages;
+
+   scanned++;
+
+   switch (__isolate_lru_page(page, ISOLATE_ASYNC_MIGRATE)) {
+   case 0:
+   nr_pages = hpage_nr_pages(page);
+   mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
+   list_move(&page->lru, dst);
+   taken += nr_pages;
+   break;
+
+   case -EBUSY:
+   list_move(&page->lru, src);
+   continue;
+
+   default:
+   BUG();
+   }
+   }
+   __mod_zone_page_state(zone, NR_LRU_BASE + lru, -taken);
+   __mod_zone_page_state(zone, NR_ISOLATED_ANON + is_file_lru(lru), taken);
+   spin_unlock_irq(&zone->lru_lock);
+
+   return scanned;
+}
+
+static long __memcg_numa_migrate_pages(struct lruvec *lruvec, enum lru_list 
lru,
+  nodemask_t *target_nodes, long 
nr_to_scan)
+{
+   struct memcg_numa_migrate_struct ms = {
+   .target_nodes = target_nodes,
+   .current_node = -1,
+   };
+   LIST_HEAD(pages);
+   long total_scanned = 0;
+
+   /*
+* If no limit on the maximal number of migrated pages is specified,
+* assume the caller wants to migrate them all.
+*/
+   if (nr_to_scan < 0)
+   nr_to_scan = mem_cgroup_get_lru_size(lruvec, lru);
+
+ 

Re: [Devel] [PATCH rh7] mm: memcontrol: add memory.numa_migrate file

2016-08-23 Thread Vladimir Davydov
On Tue, Aug 23, 2016 at 12:57:53PM +0300, Andrey Ryabinin wrote:
...
> echo "0 100" > /sys/fs/cgroup/memory/machine.slice/100/memory.numa_migrate
> 
> [  296.073002] BUG: soft lockup - CPU#1 stuck for 22s! [bash:4028]

Thanks for catching, will fix in v2.

> > +static struct page *memcg_numa_migrate_new_page(struct page *page,
> > +   unsigned long private, int **result)
> > +{
> > +   struct memcg_numa_migrate_struct *ms = (void *)private;
> > +   gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN;
> > +
> > +   ms->current_node = next_node(ms->current_node, *ms->target_nodes);
> > +   if (ms->current_node >= MAX_NUMNODES) {
> > +   ms->current_node = first_node(*ms->target_nodes);
> > +   BUG_ON(ms->current_node >= MAX_NUMNODES);
> 
> Maybe WARN_ON() or VM_BUG_ON() ?

Will replace with VM_BUG_ON.

> > +   }
> > +
> > +   return __alloc_pages_nodemask(gfp_mask, 0,
> > +   node_zonelist(ms->current_node, gfp_mask),
> > +   ms->target_nodes);
> > +}
> > +
> > +/*
> > + * Isolate at most @nr_to_scan pages from @lruvec for further migration and
> > + * store them in @dst. Returns the number of pages scanned. Return value 
> > of 0
> > + * means that @lruved is empty.
> > + */
> > +static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list 
> > lru,
> > +long nr_to_scan, struct list_head *dst)
> > +{
> > +   struct list_head *src = &lruvec->lists[lru];
> > +   struct zone *zone = lruvec_zone(lruvec);
> > +   long scanned = 0, taken = 0;
> > +
> > +   spin_lock_irq(&zone->lru_lock);
> > +   while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) {
> > +   struct page *page = list_last_entry(src, struct page, lru);
> > +   int nr_pages;
> > +
> > +   VM_BUG_ON_PAGE(!PageLRU(page), page);
> > +
> 
> __isolate_lru_page() will return -EINVAL for !PageLRU, so either this or the 
> BUG() bellow is unnecessary.

OK, will remove the VM_BUG_ON_PAGE.

...
> > +static int memcg_numa_migrate_pages(struct mem_cgroup *memcg,
> > +   nodemask_t *target_nodes, long nr_to_scan)
> > +{
> > +   struct mem_cgroup *mi;
> > +   long total_scanned = 0;
> > +
> > +again:
> > +   for_each_mem_cgroup_tree(mi, memcg) {
> > +   struct zone *zone;
> > +
> > +   for_each_populated_zone(zone) {
> > +   struct lruvec *lruvec;
> > +   enum lru_list lru;
> > +   long scanned;
> > +
> > +   if (node_isset(zone_to_nid(zone), *target_nodes))
> > +   continue;
> > +
> > +   lruvec = mem_cgroup_zone_lruvec(zone, mi);
> > +   /*
> > +* For the sake of simplicity, do not attempt to migrate
> > +* unevictable pages. It should be fine as long as there
> > +* aren't too many of them, which is usually true.
> > +*/
> > +   for_each_evictable_lru(lru) {
> > +   scanned = __memcg_numa_migrate_pages(lruvec,
> > +   lru, target_nodes,
> > +   nr_to_scan > 0 ?
> > +   SWAP_CLUSTER_MAX : -1);
> 
>   Shouldn't we just pass nr_to_scan here?

No, I want to migrate memory evenly from all nodes. I.e. if you have 2
source nodes and nr_to_scan=100, there should be ~50 pages migrated from
one node and ~50 from another, not 100-vs-0.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL

2016-08-23 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.28.2.vz7.17.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.28.2.vz7.17.2
-->
commit 96e55a820e144b35e9254e97cdc0afbb4f879b3f
Author: Vladimir Davydov 
Date:   Tue Aug 23 13:13:19 2016 +0400

vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL

For the sake of Docker, we only call vzprivnet rules if skb comes from
the host [1]. To check that, we look at skb->dev->nd_net->owner_ve. This
works fine when skb is retransmitted by a device (as it is the case in
case of a bridged network), but this results in KP when skb is sent
directly to a veth or venet device (via sendto) provided sysctl
net.vzpriv_filter_host is enabled:

  BUG: unable to handle kernel NULL pointer dereference at 03e8
  IP: [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet]
  Oops:  [#1] SMP
  CPU: 0 PID: 3669 Comm: sendmail ve: 2ee1e66b-d1d4-4cb9-b65e-56af4cdd60b7 
Not tainted 3.10.0-327.28.2.vz7.17.1 #1 17.1
  task: 880039e94c20 ti: 880036428000 task.ti: 880036428000
  RIP: 0010:[]  [] 
vzprivnet_hook+0x9/0xb0 [ip_vzprivnet]
  RSP: 0018:88003642ba78  EFLAGS: 00010202
  RAX:  RBX: 88003642bb08 RCX: 88003a451000
  RDX: 0001 RSI: 0001 RDI: 88000509d000
  RBP: 88003642ba80 R08: 88003642bb08 R09: 0024
  R10: 880035f78360 R11: 0006 R12: 88003642bad0
  R13: 88000509d000 R14: 81a6c1f0 R15: 
  FS:  7fcc702ca840() GS:88003de0() 
knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 03e8 CR3: 39c1d000 CR4: 06f0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Stack:
   a03059f2 88003642bac0 81562950 a0308680
   43838e8f 88000509d000 88003642bb08 88000509d000
   0024 88003642baf8 81562a38 a0308680
  Call Trace:
   [] ? vzprivnet_out_hook+0x32/0x40 [ip_vzprivnet]
   [] nf_iterate+0x70/0xb0
   [] nf_hook_slow+0xa8/0x110
   [] __ip_local_out_sk+0xee/0x100
   [] ? ip_make_skb+0x22/0x120
   [] ? ip_forward_options+0x1c0/0x1c0
   [] ip_local_out_sk+0x1b/0x40
   [] ip_send_skb+0x16/0x50
   [] udp_send_skb+0x170/0x380
   [] ? ip_copy_metadata+0x170/0x170
   [] udp_sendmsg+0x2f7/0x9d0
   [] ? link_path_walk+0x81/0x860
   [] inet_sendmsg+0x64/0xb0
   [] ? radix_tree_lookup_slot+0x22/0x50
   [] sock_sendmsg+0x87/0xc0
   [] ? unlock_page+0x2b/0x30
   [] SYSC_sendto+0x121/0x1c0
   [] ? __do_page_fault+0x164/0x450
   [] ? do_page_fault+0x23/0x80
   [] SyS_sendto+0xe/0x10
   [] system_call_fastpath+0x16/0x1b

This happens, because in this case skb->dev is NULL (there's no device
this skb is arrived to). To avoid crash there, let's take owner_ve from
the net namespace which the socket is assigned to.

https://jira.sw.ru/browse/PSBM-51041

Fixes: 32efdd408fad ("vzprivnet: Do not execute vzprivnet_hook inside CT") 
[1]
Signed-off-by: Vladimir Davydov 
Acked-by: Pavel Tikhomirov 
---
 net/ipv4/netfilter/ip_vzprivnet.c  | 7 ++-
 net/ipv6/netfilter/ip6_vzprivnet.c | 7 ++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ip_vzprivnet.c 
b/net/ipv4/netfilter/ip_vzprivnet.c
index 293b2802b4c2..6e2bbe2d42ef 100644
--- a/net/ipv4/netfilter/ip_vzprivnet.c
+++ b/net/ipv4/netfilter/ip_vzprivnet.c
@@ -250,8 +250,13 @@ static unsigned int vzprivnet_hook(struct sk_buff *skb, 
int can_be_bridge)
 {
struct dst_entry *dst;
unsigned int pmark = VZPRIV_MARK_UNKNOWN;
+   struct net *src_net;
 
-   if (!ve_is_super(skb->dev->nd_net->owner_ve))
+   if (WARN_ON_ONCE(!skb->dev && !skb->sk))
+   return NF_ACCEPT;
+
+   src_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+   if (!ve_is_super(src_net->owner_ve))
return NF_ACCEPT;
 
dst = skb_dst(skb);
diff --git a/net/ipv6/netfilter/ip6_vzprivnet.c 
b/net/ipv6/netfilter/ip6_vzprivnet.c
index ee9c3c972637..0a491afd81b6 100644
--- a/net/ipv6/netfilter/ip6_vzprivnet.c
+++ b/net/ipv6/netfilter/ip6_vzprivnet.c
@@ -484,8 +484,13 @@ static unsigned int vzprivnet6_hook(struct sk_buff *skb, 
int can_be_bridge)
int verdict = NF_DROP;
struct vzprivnet *dst, *src;
struct ipv6hdr *hdr;
+   struct net *src_net;
 
-   if (!ve_is_super(skb->dev->nd_net->owner_ve))
+   if (WARN_ON_ONCE(!skb->dev && !skb->sk))
+   return NF_ACCEPT;
+
+   src_net = skb->dev

[Devel] [PATCH rh7 v2] vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL

2016-08-23 Thread Vladimir Davydov
For the sake of Docker, we only call vzprivnet rules if skb comes from
the host [1]. To check that, we look at skb->dev->nd_net->owner_ve. This
works fine when skb is retransmitted by a device (as it is the case in
case of a bridged network), but this results in KP when skb is sent
directly to a veth or venet device (via sendto) provided sysctl
net.vzpriv_filter_host is enabled:

  BUG: unable to handle kernel NULL pointer dereference at 03e8
  IP: [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet]
  Oops:  [#1] SMP
  CPU: 0 PID: 3669 Comm: sendmail ve: 2ee1e66b-d1d4-4cb9-b65e-56af4cdd60b7 Not 
tainted 3.10.0-327.28.2.vz7.17.1 #1 17.1
  task: 880039e94c20 ti: 880036428000 task.ti: 880036428000
  RIP: 0010:[]  [] vzprivnet_hook+0x9/0xb0 
[ip_vzprivnet]
  RSP: 0018:88003642ba78  EFLAGS: 00010202
  RAX:  RBX: 88003642bb08 RCX: 88003a451000
  RDX: 0001 RSI: 0001 RDI: 88000509d000
  RBP: 88003642ba80 R08: 88003642bb08 R09: 0024
  R10: 880035f78360 R11: 0006 R12: 88003642bad0
  R13: 88000509d000 R14: 81a6c1f0 R15: 
  FS:  7fcc702ca840() GS:88003de0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 03e8 CR3: 39c1d000 CR4: 06f0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Stack:
   a03059f2 88003642bac0 81562950 a0308680
   43838e8f 88000509d000 88003642bb08 88000509d000
   0024 88003642baf8 81562a38 a0308680
  Call Trace:
   [] ? vzprivnet_out_hook+0x32/0x40 [ip_vzprivnet]
   [] nf_iterate+0x70/0xb0
   [] nf_hook_slow+0xa8/0x110
   [] __ip_local_out_sk+0xee/0x100
   [] ? ip_make_skb+0x22/0x120
   [] ? ip_forward_options+0x1c0/0x1c0
   [] ip_local_out_sk+0x1b/0x40
   [] ip_send_skb+0x16/0x50
   [] udp_send_skb+0x170/0x380
   [] ? ip_copy_metadata+0x170/0x170
   [] udp_sendmsg+0x2f7/0x9d0
   [] ? link_path_walk+0x81/0x860
   [] inet_sendmsg+0x64/0xb0
   [] ? radix_tree_lookup_slot+0x22/0x50
   [] sock_sendmsg+0x87/0xc0
   [] ? unlock_page+0x2b/0x30
   [] SYSC_sendto+0x121/0x1c0
   [] ? __do_page_fault+0x164/0x450
   [] ? do_page_fault+0x23/0x80
   [] SyS_sendto+0xe/0x10
   [] system_call_fastpath+0x16/0x1b

This happens, because in this case skb->dev is NULL (there's no device
this skb is arrived to). To avoid crash there, let's take owner_ve from
the net namespace which the socket is assigned to.

https://jira.sw.ru/browse/PSBM-51041

Fixes: 32efdd408fad ("vzprivnet: Do not execute vzprivnet_hook inside CT") [1]
Signed-off-by: Vladimir Davydov 
Cc: Pavel Tikhomirov 
---
Changes in v2:
 - do not crash if both skb->dev and skb->sk turn out to be NULL
   for some reason - just print a warning and accept the packet

 net/ipv4/netfilter/ip_vzprivnet.c  | 7 ++-
 net/ipv6/netfilter/ip6_vzprivnet.c | 7 ++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ip_vzprivnet.c 
b/net/ipv4/netfilter/ip_vzprivnet.c
index 293b2802b4c2..6e2bbe2d42ef 100644
--- a/net/ipv4/netfilter/ip_vzprivnet.c
+++ b/net/ipv4/netfilter/ip_vzprivnet.c
@@ -250,8 +250,13 @@ static unsigned int vzprivnet_hook(struct sk_buff *skb, 
int can_be_bridge)
 {
struct dst_entry *dst;
unsigned int pmark = VZPRIV_MARK_UNKNOWN;
+   struct net *src_net;
 
-   if (!ve_is_super(skb->dev->nd_net->owner_ve))
+   if (WARN_ON_ONCE(!skb->dev && !skb->sk))
+   return NF_ACCEPT;
+
+   src_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+   if (!ve_is_super(src_net->owner_ve))
return NF_ACCEPT;
 
dst = skb_dst(skb);
diff --git a/net/ipv6/netfilter/ip6_vzprivnet.c 
b/net/ipv6/netfilter/ip6_vzprivnet.c
index ee9c3c972637..0a491afd81b6 100644
--- a/net/ipv6/netfilter/ip6_vzprivnet.c
+++ b/net/ipv6/netfilter/ip6_vzprivnet.c
@@ -484,8 +484,13 @@ static unsigned int vzprivnet6_hook(struct sk_buff *skb, 
int can_be_bridge)
int verdict = NF_DROP;
struct vzprivnet *dst, *src;
struct ipv6hdr *hdr;
+   struct net *src_net;
 
-   if (!ve_is_super(skb->dev->nd_net->owner_ve))
+   if (WARN_ON_ONCE(!skb->dev && !skb->sk))
+   return NF_ACCEPT;
+
+   src_net = skb->dev ? dev_net(skb->dev) : sock_net(skb->sk);
+   if (!ve_is_super(src_net->owner_ve))
return NF_ACCEPT;
 
hdr = ipv6_hdr(skb);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] vzprivnet: vzprivnet_hook: fix crash if skb->dev == NULL

2016-08-19 Thread Vladimir Davydov
For the sake of Docker, we only call vzprivnet rules if skb comes from
the host [1]. To check that, we look at skb->dev->nd_net->owner_ve. This
works fine when skb is retransmitted by a device (as it is the case in
case of a bridged network), but this results in KP when skb is sent
directly to a veth or venet device (via sendto) provided sysctl
net.vzpriv_filter_host is enabled:

  BUG: unable to handle kernel NULL pointer dereference at 03e8
  IP: [] vzprivnet_hook+0x9/0xb0 [ip_vzprivnet]
  Oops:  [#1] SMP
  CPU: 0 PID: 3669 Comm: sendmail ve: 2ee1e66b-d1d4-4cb9-b65e-56af4cdd60b7 Not 
tainted 3.10.0-327.28.2.vz7.17.1 #1 17.1
  task: 880039e94c20 ti: 880036428000 task.ti: 880036428000
  RIP: 0010:[]  [] vzprivnet_hook+0x9/0xb0 
[ip_vzprivnet]
  RSP: 0018:88003642ba78  EFLAGS: 00010202
  RAX:  RBX: 88003642bb08 RCX: 88003a451000
  RDX: 0001 RSI: 0001 RDI: 88000509d000
  RBP: 88003642ba80 R08: 88003642bb08 R09: 0024
  R10: 880035f78360 R11: 0006 R12: 88003642bad0
  R13: 88000509d000 R14: 81a6c1f0 R15: 
  FS:  7fcc702ca840() GS:88003de0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 03e8 CR3: 39c1d000 CR4: 06f0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Stack:
   a03059f2 88003642bac0 81562950 a0308680
   43838e8f 88000509d000 88003642bb08 88000509d000
   0024 88003642baf8 81562a38 a0308680
  Call Trace:
   [] ? vzprivnet_out_hook+0x32/0x40 [ip_vzprivnet]
   [] nf_iterate+0x70/0xb0
   [] nf_hook_slow+0xa8/0x110
   [] __ip_local_out_sk+0xee/0x100
   [] ? ip_make_skb+0x22/0x120
   [] ? ip_forward_options+0x1c0/0x1c0
   [] ip_local_out_sk+0x1b/0x40
   [] ip_send_skb+0x16/0x50
   [] udp_send_skb+0x170/0x380
   [] ? ip_copy_metadata+0x170/0x170
   [] udp_sendmsg+0x2f7/0x9d0
   [] ? link_path_walk+0x81/0x860
   [] inet_sendmsg+0x64/0xb0
   [] ? radix_tree_lookup_slot+0x22/0x50
   [] sock_sendmsg+0x87/0xc0
   [] ? unlock_page+0x2b/0x30
   [] SYSC_sendto+0x121/0x1c0
   [] ? __do_page_fault+0x164/0x450
   [] ? do_page_fault+0x23/0x80
   [] SyS_sendto+0xe/0x10
   [] system_call_fastpath+0x16/0x1b

This happens, because in this case skb->dev is NULL (there's no device
this skb is arrived to). To avoid crash there, let's take owner_ve from
the net namespace which the socket is assigned to.

https://jira.sw.ru/browse/PSBM-51041

Fixes: 32efdd408fad ("vzprivnet: Do not execute vzprivnet_hook inside CT") [1]
Signed-off-by: Vladimir Davydov 
Cc: Pavel Tikhomirov 
---
 net/ipv4/netfilter/ip_vzprivnet.c  | 3 ++-
 net/ipv6/netfilter/ip6_vzprivnet.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ip_vzprivnet.c 
b/net/ipv4/netfilter/ip_vzprivnet.c
index 293b2802b4c2..798a3ced3ef3 100644
--- a/net/ipv4/netfilter/ip_vzprivnet.c
+++ b/net/ipv4/netfilter/ip_vzprivnet.c
@@ -250,8 +250,9 @@ static unsigned int vzprivnet_hook(struct sk_buff *skb, int 
can_be_bridge)
 {
struct dst_entry *dst;
unsigned int pmark = VZPRIV_MARK_UNKNOWN;
+   struct net *src_net = skb->dev ? skb->dev->nd_net : sock_net(skb->sk);
 
-   if (!ve_is_super(skb->dev->nd_net->owner_ve))
+   if (!ve_is_super(src_net->owner_ve))
return NF_ACCEPT;
 
dst = skb_dst(skb);
diff --git a/net/ipv6/netfilter/ip6_vzprivnet.c 
b/net/ipv6/netfilter/ip6_vzprivnet.c
index ee9c3c972637..36cc1d4c5aa0 100644
--- a/net/ipv6/netfilter/ip6_vzprivnet.c
+++ b/net/ipv6/netfilter/ip6_vzprivnet.c
@@ -484,8 +484,9 @@ static unsigned int vzprivnet6_hook(struct sk_buff *skb, 
int can_be_bridge)
int verdict = NF_DROP;
struct vzprivnet *dst, *src;
struct ipv6hdr *hdr;
+   struct net *src_net = skb->dev ? skb->dev->nd_net : sock_net(skb->sk);
 
-   if (!ve_is_super(skb->dev->nd_net->owner_ve))
+   if (!ve_is_super(src_net->owner_ve))
return NF_ACCEPT;
 
hdr = ipv6_hdr(skb);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: memcontrol: add memory.numa_migrate file

2016-08-18 Thread Vladimir Davydov
The new file is supposed to be used for migrating pages accounted to a
memory cgroup to a particular set of numa nodes. The reason to add it is
that currently there's no API for migrating unmapped file pages used for
storing page cache (neither migrate_pages syscall nor cpuset subsys
doesn't provide this functionality).

The file is added to the memory cgroup and has the following format:

  NODELIST[ MAX_SCAN]

where NODELIST is a comma-separated list of ranges N1-N2 specifying the set
of nodes to migrate pages of this cgroup to, and the optional MAX_SCAN
imposes a limit on the number of pages that can be migrated in one go.

The call may be interrupted by a signal, in which case -EINTR is returned.

https://jira.sw.ru/browse/PSBM-50875

Signed-off-by: Vladimir Davydov 
Cc: Andrey Ryabinin 
Cc: Igor Redko 
Cc: Konstantin Neumoin 
---
 mm/memcontrol.c | 223 
 1 file changed, 223 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3a16b99ccc6..8c6c4fb9c153 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 #include 
 #include 
@@ -5697,6 +5698,223 @@ static int memcg_numa_stat_show(struct cgroup *cont, 
struct cftype *cft,
seq_putc(m, '\n');
return 0;
 }
+
+/*
+ * memcg_numa_migrate_new_page() private argument. @target_nodes specifies the
+ * set of nodes to allocate pages from. @current_node is the current preferable
+ * node, it gets rotated after each allocation.
+ */
+struct memcg_numa_migrate_struct {
+   nodemask_t *target_nodes;
+   int current_node;
+};
+
+/*
+ * Used as an argument for migrate_pages(). Allocated pages are spread evenly
+ * among destination nodes.
+ */
+static struct page *memcg_numa_migrate_new_page(struct page *page,
+   unsigned long private, int **result)
+{
+   struct memcg_numa_migrate_struct *ms = (void *)private;
+   gfp_t gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_NORETRY | __GFP_NOWARN;
+
+   ms->current_node = next_node(ms->current_node, *ms->target_nodes);
+   if (ms->current_node >= MAX_NUMNODES) {
+   ms->current_node = first_node(*ms->target_nodes);
+   BUG_ON(ms->current_node >= MAX_NUMNODES);
+   }
+
+   return __alloc_pages_nodemask(gfp_mask, 0,
+   node_zonelist(ms->current_node, gfp_mask),
+   ms->target_nodes);
+}
+
+/*
+ * Isolate at most @nr_to_scan pages from @lruvec for further migration and
+ * store them in @dst. Returns the number of pages scanned. Return value of 0
+ * means that @lruved is empty.
+ */
+static long memcg_numa_isolate_pages(struct lruvec *lruvec, enum lru_list lru,
+long nr_to_scan, struct list_head *dst)
+{
+   struct list_head *src = &lruvec->lists[lru];
+   struct zone *zone = lruvec_zone(lruvec);
+   long scanned = 0, taken = 0;
+
+   spin_lock_irq(&zone->lru_lock);
+   while (!list_empty(src) && scanned < nr_to_scan && taken < nr_to_scan) {
+   struct page *page = list_last_entry(src, struct page, lru);
+   int nr_pages;
+
+   VM_BUG_ON_PAGE(!PageLRU(page), page);
+
+   scanned++;
+
+   switch (__isolate_lru_page(page, ISOLATE_ASYNC_MIGRATE)) {
+   case 0:
+   nr_pages = hpage_nr_pages(page);
+   mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
+   list_move(&page->lru, dst);
+   taken += nr_pages;
+   break;
+
+   case -EBUSY:
+   list_move(&page->lru, src);
+   continue;
+
+   default:
+   BUG();
+   }
+   }
+   __mod_zone_page_state(zone, NR_LRU_BASE + lru, -taken);
+   __mod_zone_page_state(zone, NR_ISOLATED_ANON + is_file_lru(lru), taken);
+   spin_unlock_irq(&zone->lru_lock);
+
+   return scanned;
+}
+
+static long __memcg_numa_migrate_pages(struct lruvec *lruvec, enum lru_list 
lru,
+  nodemask_t *target_nodes, long 
nr_to_scan)
+{
+   struct memcg_numa_migrate_struct ms = {
+   .target_nodes = target_nodes,
+   .current_node = -1,
+   };
+   LIST_HEAD(pages);
+   long total_scanned = 0;
+
+   /*
+* If no limit on the maximal number of migrated pages is specified,
+* assume the caller wants to migrate them all.
+*/
+   if (nr_to_scan < 0)
+   nr_to_scan = mem_cgroup_get_lru_size(lruvec, lru);
+
+   while (total_scanned < nr_to_scan) {
+   int ret;
+   long scanned;
+
+   scanned = memcg_numa_isolate

Re: [Devel] [PATCH 4/4] x86/arch_prctl/vdso: add ARCH_MAP_VDSO_*

2016-08-11 Thread Vladimir Davydov
On Tue, Jul 26, 2016 at 05:25:02PM +0300, Dmitry Safonov wrote:
> Add API to change vdso blob type with arch_prctl.
> As this is usefull only by needs of CRIU, expose
> this interface under CONFIG_CHECKPOINT_RESTORE.
> 
> Cc: Andy Lutomirski 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Cc: "H. Peter Anvin" 
> 
> [Differences to vanilla patches:
>  o API only for 32-bit vDSO mapping
>  o unmap previous vdso just by mm->context.vdso pointer]
> Signed-off-by: Dmitry Safonov 

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] sched: make load balancing more agressive

2016-08-03 Thread Vladimir Davydov
Currently, we only pull tasks if the destination cpu group load is below
the average over the domain being rebalanced. This sounds reasonable,
but only as long as there's no pinned tasks, otherwise we can get an
unfair task distribution. For instance, suppose the host has 16 cores
and there's a container pinned to two of the cores (either strictly by
using cpumask or indirectly by setting cpulimit). If we start 16 tasks
in the container, then the average load will be 1, so that even if 15
tasks turn out to run on the same cpu (out of 2), no tasks will be
pulled, which is wrong.

To overcome this issue, let's port the following patches from PCS6:

  diff-sched-balance-even-if-load-is-greater-than-average
  
diff-sched-always-try-to-equalize-load-between-this-and-busiest-cpus-when-balancing

They make the balance procedure pull tasks even if the destination is
above average, by setting the imbalance value to be

  (source_load - destination_load) / 2

instead of

  (average_load - destination_load) / 2

This implies decreasing the convergence speed of the balancing
procedure, but PCS6 has worked like that for quite a while, so it should
be fine.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cedd178f963c..685517597a30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6618,7 +6618,7 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
/* How much load to actually move to equalise the imbalance */
env->imbalance = min(
max_pull * busiest->group_power,
-   (sds->avg_load - this->avg_load) * this->group_power
+   (busiest->avg_load - this->avg_load) * this->group_power
) / SCHED_POWER_SCALE;
 
/*
@@ -6695,13 +6695,6 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
if (this->avg_load >= busiest->avg_load)
goto out_balanced;
 
-   /*
-* Don't pull any tasks if this group is already above the domain
-* average load.
-*/
-   if (this->avg_load >= sds.avg_load)
-   goto out_balanced;
-
if (env->idle == CPU_IDLE) {
/*
 * This cpu is idle. If the busiest group load doesn't
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] sched: fair: fix dst pinned not set when failing migration due to cpulimit restriction

2016-08-03 Thread Vladimir Davydov
When failing task migration due to cpulimit restriction, we should set
dst pinned flag so that the load balancing procedure will proceed to the
next cpu, just like in case of failing task migration due to affinity
mask (see can_migrate_task). We don't do that since rebase to
3.10.0-327.18.2.el7. Fix that.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e39ed4c17464..cedd178f963c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5446,15 +5446,17 @@ static inline int can_migrate_task_cpulimit(struct 
task_struct *p, struct lb_env
 
schedstat_inc(p, se.statistics.nr_failed_migrations_cpulimit);
 
+   env->flags |= LBF_SOME_PINNED;
+
if (check_cpulimit_spread(tg, env->src_cpu) != 0)
return 0;
 
-   if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+   if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
return 0;
 
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
if (cfs_rq_active(tg->cfs_rq[cpu])) {
-   env->flags |= LBF_SOME_PINNED;
+   env->flags |= LBF_DST_PINNED;
env->new_dst_cpu = cpu;
break;
}
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] sched: debug: show nr_failed_migrations_cpulimit

2016-08-03 Thread Vladimir Davydov
This is our (non-mainstream) counter counting how many times task
migration failed due to cpulimit restriction. For some reason, we don't
show it in proc although it might be helpful for debugging. Let's fix
that.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 55ac5fb78e29..6cf0c2ceedfe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -594,6 +594,7 @@ void proc_sched_show_task(struct task_struct *p, struct 
seq_file *m)
P(se.statistics.nr_migrations_cold);
P(se.statistics.nr_failed_migrations_affine);
P(se.statistics.nr_failed_migrations_running);
+   P(se.statistics.nr_failed_migrations_cpulimit);
P(se.statistics.nr_failed_migrations_hot);
P(se.statistics.nr_forced_migrations);
P(se.statistics.nr_wakeups);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] sched: add WARN_ON's to debug task boosting

2016-08-03 Thread Vladimir Davydov
This patch ports

* diff-sched-add-WARN_ONs-to-debug-task-boosting
Added to 042stab114_2

Assert that we never have a boosted entity under a throttled hierarchy.
Also, do not panic if on set_next_entity we find a boosted entity being
not on the list - just warn and carry on as if nothing happened.

https://jira.sw.ru/browse/PSBM-44475
https://jira.sw.ru/browse/PSBM-50077

Signed-off-by: Vladimir Davydov 
Reviewed-by: Kirill Tkhai 
---
 kernel/sched/fair.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 515685f77217..e39ed4c17464 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -909,9 +909,10 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 #ifdef CONFIG_CFS_BANDWIDTH
 static inline void update_entity_boost(struct sched_entity *se)
 {
-   if (!entity_is_task(se))
+   if (!entity_is_task(se)) {
se->boosted = cfs_rq_has_boosted_entities(group_cfs_rq(se));
-   else {
+   WARN_ON(se->boosted && cfs_rq_throttled(group_cfs_rq(se)));
+   } else {
struct task_struct *p = task_of(se);
 
if (unlikely(p != current))
@@ -943,6 +944,8 @@ static inline void __enqueue_boosted_entity(struct cfs_rq 
*cfs_rq,
 static inline void __dequeue_boosted_entity(struct cfs_rq *cfs_rq,
struct sched_entity *se)
 {
+   if (WARN_ON(se->boost_node.next == LIST_POISON1))
+   return;
list_del(&se->boost_node);
 }
 
@@ -953,8 +956,11 @@ static int enqueue_boosted_entity(struct cfs_rq *cfs_rq,
if (se != cfs_rq->curr)
__enqueue_boosted_entity(cfs_rq, se);
se->boosted = 1;
+   WARN_ON(!entity_is_task(se) &&
+   cfs_rq_throttled(group_cfs_rq(se)));
return 1;
-   }
+   } else
+   WARN_ON(cfs_rq_throttled(group_cfs_rq(se)));
 
return 0;
 }
@@ -3847,6 +3853,8 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth 
*cfs_b)
  */
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 {
+   WARN_ON(cfs_rq_has_boosted_entities(cfs_rq));
+
if (!cfs_bandwidth_used())
return;
 
@@ -4150,8 +4158,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
} else if (boost) {
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
-   if (!enqueue_boosted_entity(cfs_rq, se))
+   if (!enqueue_boosted_entity(cfs_rq, se)) {
+   WARN_ON(throttled_hierarchy(cfs_rq));
break;
+   }
if (cfs_rq_throttled(cfs_rq))
unthrottle_cfs_rq(cfs_rq);
}
@@ -4213,8 +4223,10 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running--;
 
-   if (cfs_rq_throttled(cfs_rq))
+   if (cfs_rq_throttled(cfs_rq)) {
+   WARN_ON(boosted);
break;
+   }
 
if (boosted)
boosted = dequeue_boosted_entity(cfs_rq, se);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: release plo->ctl_mutex for thaw_bdev in PLOOP_IOC_THAW handler

2016-07-15 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.25
-->
commit b81bf870a6c314b664977ee9b6747cf93e78bcc3
Author: Vladimir Davydov 
Date:   Fri Jul 15 13:34:36 2016 +0400

ploop: release plo->ctl_mutex for thaw_bdev in PLOOP_IOC_THAW handler

Recent patch to ploop 91a74e3b91a ("ploop: add PLOOP_IOC_FREEZE and
PLOOP_IOC_THAW ioctls") introduced the follwing deadlock:

Thread 1:
[] sleep_on_buffer+0xe/0x20
[] __sync_dirty_buffer+0xb8/0xe0
[] sync_dirty_buffer+0x13/0x20
[] ext4_commit_super+0x1b0/0x240 [ext4]
[] ext4_unfreeze+0x2d/0x40 [ext4]
[] thaw_super+0x3f/0xb0
[] thaw_bdev+0x65/0x80
[] ploop_ioctl+0x6d0/0x29f0 [ploop]
[] blkdev_ioctl+0x2df/0x770
[] block_ioctl+0x41/0x50
[] do_vfs_ioctl+0x255/0x4f0
[] SyS_ioctl+0x54/0xa0
[] system_call_fastpath+0x16/0x1b
[] 0x

Thread 2:
[] ploop_pb_get_pending+0x163/0x290 [ploop]
[] ploop_push_backup_io_get.isra.26+0x81/0x1b0 [ploop]
[] ploop_push_backup_io+0x15b/0x260 [ploop]
[] ploop_ioctl+0xe96/0x29f0 [ploop]
[] blkdev_ioctl+0x2df/0x770
[] block_ioctl+0x41/0x50
[] do_vfs_ioctl+0x255/0x4f0
[] SyS_ioctl+0x54/0xa0
[] system_call_fastpath+0x16/0x1b
[] 0x

E.g. thread 1 is thawing ploop with PLOOP_IOC_THAW ioctl which holds
plo->ctl_mutex during its work. To thaw itself, ext4 has to commit some
data. This commit triggers push backup out-of-order request which must
be processed and acked by userspace to be completed. But userspace can't
process it, because ploop_pb_get_pending() wants the same mutext. Thus,
deadlock.

Fix the deadlock by releasing the mutex before calling thaw_bdev and
reaquiring it after thaw_bdev is done.

https://jira.sw.ru/browse/PSBM-49699

Reported-by: Pavel Borzenkov 
    Signed-off-by: Vladimir Davydov 
Cc: Maxim Patlasov 
---
 drivers/block/ploop/dev.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index d52975eaaa36..3dc94ca5c393 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4839,11 +4839,12 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
if (!test_bit(PLOOP_S_FROZEN, &plo->state))
return 0;
 
+   plo->sb = NULL;
+   clear_bit(PLOOP_S_FROZEN, &plo->state);
+
+   mutex_unlock(&plo->ctl_mutex);
err = thaw_bdev(bdev, sb);
-   if (!err) {
-   plo->sb = NULL;
-   clear_bit(PLOOP_S_FROZEN, &plo->state);
-   }
+   mutex_lock(&plo->ctl_mutex);
 
return err;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ploop: release plo->ctl_mutex for thaw_bdev in PLOOP_IOC_THAW handler

2016-07-15 Thread Vladimir Davydov
Recent patch to ploop 91a74e3b91a ("ploop: add PLOOP_IOC_FREEZE and
PLOOP_IOC_THAW ioctls") introduced the follwing deadlock:

Thread 1:
[] sleep_on_buffer+0xe/0x20
[] __sync_dirty_buffer+0xb8/0xe0
[] sync_dirty_buffer+0x13/0x20
[] ext4_commit_super+0x1b0/0x240 [ext4]
[] ext4_unfreeze+0x2d/0x40 [ext4]
[] thaw_super+0x3f/0xb0
[] thaw_bdev+0x65/0x80
[] ploop_ioctl+0x6d0/0x29f0 [ploop]
[] blkdev_ioctl+0x2df/0x770
[] block_ioctl+0x41/0x50
[] do_vfs_ioctl+0x255/0x4f0
[] SyS_ioctl+0x54/0xa0
[] system_call_fastpath+0x16/0x1b
[] 0x

Thread 2:
[] ploop_pb_get_pending+0x163/0x290 [ploop]
[] ploop_push_backup_io_get.isra.26+0x81/0x1b0 [ploop]
[] ploop_push_backup_io+0x15b/0x260 [ploop]
[] ploop_ioctl+0xe96/0x29f0 [ploop]
[] blkdev_ioctl+0x2df/0x770
[] block_ioctl+0x41/0x50
[] do_vfs_ioctl+0x255/0x4f0
[] SyS_ioctl+0x54/0xa0
[] system_call_fastpath+0x16/0x1b
[] 0x

E.g. thread 1 is thawing ploop with PLOOP_IOC_THAW ioctl which holds
plo->ctl_mutex during its work. To thaw itself, ext4 has to commit some
data. This commit triggers push backup out-of-order request which must
be processed and acked by userspace to be completed. But userspace can't
process it, because ploop_pb_get_pending() wants the same mutext. Thus,
deadlock.

Fix the deadlock by releasing the mutex before calling thaw_bdev and
reaquiring it after thaw_bdev is done.

https://jira.sw.ru/browse/PSBM-49699

Reported-by: Pavel Borzenkov 
Signed-off-by: Vladimir Davydov 
Cc: Maxim Patlasov 
---
 drivers/block/ploop/dev.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index d52975eaaa36..3dc94ca5c393 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4839,11 +4839,12 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
if (!test_bit(PLOOP_S_FROZEN, &plo->state))
return 0;
 
+   plo->sb = NULL;
+   clear_bit(PLOOP_S_FROZEN, &plo->state);
+
+   mutex_unlock(&plo->ctl_mutex);
err = thaw_bdev(bdev, sb);
-   if (!err) {
-   plo->sb = NULL;
-   clear_bit(PLOOP_S_FROZEN, &plo->state);
-   }
+   mutex_lock(&plo->ctl_mutex);
 
return err;
 }
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] sched: use topmost limited ancestor for cpulimit balancing

2016-07-14 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit 0754d59aeddb72bc69ec10ebbe70dc41f03c16ab
Author: Vladimir Davydov 
Date:   Thu Jul 14 20:46:46 2016 +0400

sched: use topmost limited ancestor for cpulimit balancing

We want to keep all proceses of a container's cgroup packed on the
minimal allowed number of cpus, which is set by the cpulimit. Doing this
properly when deep hierarchies are used is tricky if not impossible w/o
introducing tremendous overhead, so initially we implemented this
feature exclusively for top-level cgroups. Now this isn't enough, as
containers can be created in machine.slice. So in this patch we make
cpulimit balancing work for topmost cgroups that has a cpu limit set.
This way, no matter if containers are created under the root or in
machine.slice, cpulimit balancing will always be applied to container's
cgroup as machine.slice isn't supposed to have cpu limit set.

https://jira.sw.ru/browse/PSBM-49203

    Signed-off-by: Vladimir Davydov 
---
 kernel/sched/core.c  | 62 
 kernel/sched/fair.c  | 36 +-
 kernel/sched/sched.h |  2 ++
 3 files changed, 69 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94deef41f05a..657b8e4ba8d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7557,6 +7557,10 @@ void __init sched_init(void)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_CFS_CPULIMIT
+   root_task_group.topmost_limited_ancestor = &root_task_group;
+#endif
+
for_each_possible_cpu(i) {
struct rq *rq;
 
@@ -7882,6 +7886,8 @@ err:
return ERR_PTR(-ENOMEM);
 }
 
+static void tg_update_topmost_limited_ancestor(struct task_group *tg);
+
 void sched_online_group(struct task_group *tg, struct task_group *parent)
 {
unsigned long flags;
@@ -7894,6 +7900,9 @@ void sched_online_group(struct task_group *tg, struct 
task_group *parent)
tg->parent = parent;
INIT_LIST_HEAD(&tg->children);
list_add_rcu(&tg->siblings, &parent->children);
+
+   tg_update_topmost_limited_ancestor(tg);
+
spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
@@ -8428,6 +8437,8 @@ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 
1ms */
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
+static void tg_limit_toggled(struct task_group *tg);
+
 /* call with cfs_constraints_mutex held */
 static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
@@ -8485,6 +8496,8 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
unthrottle_cfs_rq(cfs_rq);
raw_spin_unlock_irq(&rq->lock);
}
+   if (runtime_enabled != runtime_was_enabled)
+   tg_limit_toggled(tg);
return ret;
 }
 
@@ -8662,6 +8675,49 @@ static int cpu_stats_show(struct cgroup *cgrp, struct 
cftype *cft,
 }
 
 #ifdef CONFIG_CFS_CPULIMIT
+static int __tg_update_topmost_limited_ancestor(struct task_group *tg, void 
*unused)
+{
+   struct task_group *parent = tg->parent;
+
+   /*
+* Parent and none of its uncestors is limited? The task group should
+* become a topmost limited uncestor then, provided it has a limit set.
+* Otherwise inherit topmost limited ancestor from the parent.
+*/
+   if (parent->topmost_limited_ancestor == parent &&
+   parent->cfs_bandwidth.quota == RUNTIME_INF)
+   tg->topmost_limited_ancestor = tg;
+   else
+   tg->topmost_limited_ancestor = parent->topmost_limited_ancestor;
+   return 0;
+}
+
+static void tg_update_topmost_limited_ancestor(struct task_group *tg)
+{
+   __tg_update_topmost_limited_ancestor(tg, NULL);
+}
+
+static void tg_limit_toggled(struct task_group *tg)
+{
+   if (tg->topmost_limited_ancestor != tg) {
+   /*
+* This task group is not a topmost limited ancestor, so both
+* it and all its children must already point to their topmost
+* limited ancestor, and we have nothing to do.
+*/
+   return;
+   }
+
+   /*
+* This task group is a topmost limited ancestor. Walk over all its
+* children and update their pointers to the topmost limited ancestor.
+*/
+
+   spin_lock_irq(&task_group_lock);
+   walk_tg_tree_from(tg, __tg_update_topmost_limited_ancestor, tg_nop, 
NULL);
+   spin_unlock_irq(&task_group_lock);
+}
+
 static void tg_update_cpu_limit(struct task_group *tg)
 {
long quota, period;
@@ -8736,6 +8792,12 @@ static int nr_cpus_wri

[Devel] [PATCH RHEL7 COMMIT] sched: account task_group->nr_cpus_active for all cgroups

2016-07-14 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit 2d2b5ae36a8af86450d7cdae8dcc981e3430b06d
Author: Vladimir Davydov 
Date:   Thu Jul 14 20:46:41 2016 +0400

sched: account task_group->nr_cpus_active for all cgroups

Currently nr_cpus_active is only accounted for top-level cgroups,
because container cgroups, which are the only users of this counter,
could only be created under the root cgroup.

Now things have changed, and containers can reside either under the root
or under machine.slice or in any other cgroup depending on the host's
config. So we can't preserve this little optimization anymore. Remove
it.

    Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd25e1e8ae5b..70a5861d4166 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3039,7 +3039,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-   if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight)
+   if (!cfs_rq->load.weight)
inc_nr_active_cfs_rqs(cfs_rq);
 
/*
@@ -3163,7 +3163,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
update_min_vruntime(cfs_rq);
update_cfs_shares(cfs_rq);
 
-   if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight)
+   if (!cfs_rq->load.weight)
dec_nr_active_cfs_rqs(cfs_rq, flags & DEQUEUE_TASK_SLEEP);
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] sched: make check_cpulimit_spread accept tg instead of cfs_rq

2016-07-14 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit 7ae55375b703b6f386447e223df63dc1f217cc6f
Author: Vladimir Davydov 
Date:   Thu Jul 14 20:46:43 2016 +0400

sched: make check_cpulimit_spread accept tg instead of cfs_rq

It only needs cfs_rq->tg, so let's pass it directly. This eases further
modifications.
    
Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 57 -
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70a5861d4166..52365f6a4e36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -542,9 +542,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct 
hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
-static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu)
+static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu)
 {
-   struct task_group *tg = cfs_rq->tg;
int nr_cpus_active = atomic_read(&tg->nr_cpus_active);
int nr_cpus_limit = DIV_ROUND_UP(tg->cpu_rate, MAX_CPU_RATE);
 
@@ -579,7 +578,7 @@ static inline enum hrtimer_restart 
sched_cfs_active_timer(struct hrtimer *timer)
return 0;
 }
 
-static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu)
+static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu)
 {
return 1;
 }
@@ -4717,18 +4716,15 @@ done:
 
 static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu)
 {
-   struct cfs_rq *cfs_rq;
struct task_group *tg;
struct sched_domain *sd;
int prev_cpu = task_cpu(p);
int cpu;
 
-   cfs_rq = top_cfs_rq_of(&p->se);
-   if (check_cpulimit_spread(cfs_rq, *new_cpu) > 0)
+   tg = top_cfs_rq_of(&p->se)->tg;
+   if (check_cpulimit_spread(tg, *new_cpu) > 0)
return false;
 
-   tg = cfs_rq->tg;
-
if (cfs_rq_active(tg->cfs_rq[*new_cpu]))
return true;
 
@@ -5084,7 +5080,7 @@ static int cpulimit_balance_cpu_stop(void *data);
 static inline void trigger_cpulimit_balance(struct task_struct *p)
 {
struct rq *this_rq;
-   struct cfs_rq *cfs_rq;
+   struct task_group *tg;
int this_cpu, cpu, target_cpu = -1;
struct sched_domain *sd;
 
@@ -5094,8 +5090,8 @@ static inline void trigger_cpulimit_balance(struct 
task_struct *p)
if (!p->se.on_rq || this_rq->active_balance)
return;
 
-   cfs_rq = top_cfs_rq_of(&p->se);
-   if (check_cpulimit_spread(cfs_rq, this_cpu) >= 0)
+   tg = top_cfs_rq_of(&p->se)->tg;
+   if (check_cpulimit_spread(tg, this_cpu) >= 0)
return;
 
rcu_read_lock();
@@ -5105,7 +5101,7 @@ static inline void trigger_cpulimit_balance(struct 
task_struct *p)
for_each_cpu_and(cpu, sched_domain_span(sd),
 tsk_cpus_allowed(p)) {
if (cpu != this_cpu &&
-   cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) {
+   cfs_rq_active(tg->cfs_rq[cpu])) {
target_cpu = cpu;
goto unlock;
}
@@ -5471,22 +5467,22 @@ static inline bool migrate_degrades_locality(struct 
task_struct *p,
 static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
-   struct cfs_rq *cfs_rq = top_cfs_rq_of(&p->se);
+   struct task_group *tg = top_cfs_rq_of(&p->se)->tg;
int tsk_cache_hot = 0;
 
-   if (check_cpulimit_spread(cfs_rq, env->dst_cpu) < 0) {
+   if (check_cpulimit_spread(tg, env->dst_cpu) < 0) {
int cpu;
 
schedstat_inc(p, se.statistics.nr_failed_migrations_cpulimit);
 
-   if (check_cpulimit_spread(cfs_rq, env->src_cpu) != 0)
+   if (check_cpulimit_spread(tg, env->src_cpu) != 0)
return 0;
 
if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
return 0;
 
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
-   if (cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) {
+   if (cfs_rq_active(tg->cfs_rq[cpu])) {
env->flags |= LBF_SOME_PINNED;
env->new_dst_cpu = cpu;
break;
@@ -5719,7 +5715,7 @@ static int move_task_group(struct cfs_rq *cfs_rq, struct 
lb_env *env)
 
 static int move_task_groups(struct lb_env *env)
 {
-   struct cfs_rq *cfs_rq, *top_cfs_rq;
+   struct cfs_rq *cfs_rq;
struct

[Devel] [PATCH RHEL7 COMMIT] sched: cleanup !CFS_CPULIMIT code

2016-07-14 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit 5e6211be102fd95bb3bd1c807cf91a235cc10a77
Author: Vladimir Davydov 
Date:   Thu Jul 14 20:46:44 2016 +0400

sched: cleanup !CFS_CPULIMIT code

Let's move all CFS_CPULIMIT related functions under CFS_CPULIMIT ifdef.
This will ease further patching.

    Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 39 ++-
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52365f6a4e36..2ff38fc1d600 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -560,11 +560,6 @@ static inline int check_cpulimit_spread(struct task_group 
*tg, int target_cpu)
return cfs_rq_active(tg->cfs_rq[target_cpu]) ? 0 : -1;
 }
 #else /* !CONFIG_CFS_CPULIMIT */
-static inline int cfs_rq_active(struct cfs_rq *cfs_rq)
-{
-   return 1;
-}
-
 static inline void inc_nr_active_cfs_rqs(struct cfs_rq *cfs_rq)
 {
 }
@@ -572,16 +567,6 @@ static inline void inc_nr_active_cfs_rqs(struct cfs_rq 
*cfs_rq)
 static inline void dec_nr_active_cfs_rqs(struct cfs_rq *cfs_rq, int postpone)
 {
 }
-
-static inline enum hrtimer_restart sched_cfs_active_timer(struct hrtimer 
*timer)
-{
-   return 0;
-}
-
-static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu)
-{
-   return 1;
-}
 #endif /* CONFIG_CFS_CPULIMIT */
 
 static __always_inline
@@ -4716,6 +4701,7 @@ done:
 
 static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu)
 {
+#ifdef CONFIG_CFS_CPULIMIT
struct task_group *tg;
struct sched_domain *sd;
int prev_cpu = task_cpu(p);
@@ -4741,6 +4727,7 @@ static inline bool select_runnable_cpu(struct task_struct 
*p, int *new_cpu)
}
}
}
+#endif
return false;
 }
 
@@ -5461,14 +5448,10 @@ static inline bool migrate_degrades_locality(struct 
task_struct *p,
 }
 #endif
 
-/*
- * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
- */
-static
-int can_migrate_task(struct task_struct *p, struct lb_env *env)
+static inline int can_migrate_task_cpulimit(struct task_struct *p, struct 
lb_env *env)
 {
+#ifdef CONFIG_CFS_CPULIMIT
struct task_group *tg = top_cfs_rq_of(&p->se)->tg;
-   int tsk_cache_hot = 0;
 
if (check_cpulimit_spread(tg, env->dst_cpu) < 0) {
int cpu;
@@ -5490,6 +5473,20 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
}
return 0;
}
+#endif
+   return 1;
+}
+
+/*
+ * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ */
+static
+int can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+   int tsk_cache_hot = 0;
+
+   if (!can_migrate_task_cpulimit(p, env))
+   return 0;
 
/*
 * We do not migrate tasks that are:
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] arch: x86: perf_event_intel: do not taint kernel when irq loop is stuck

2016-07-14 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit 098b10ef2ff6088c2e3df8949045a094ab37bf52
Author: Vladimir Davydov 
Date:   Thu Jul 14 20:42:36 2016 +0400

arch: x86: perf_event_intel: do not taint kernel when irq loop is stuck

Presumably, this happens in case a perf counter overflows. This might be
a hardware bug, which needs a workaround. We don't have enough knowledge
to fix it or investigate further. Since the issue is rare and can't lead
to system crash, we can close our eye on it. Nevertheless, it taints the
kernel, which results in test failure. To avoid that, let's replace WARN
with pr_warn.

https://jira.sw.ru/browse/PSBM-49258

Signed-off-by: Vladimir Davydov 
---
 arch/x86/kernel/cpu/perf_event_intel.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 5106e8378d96..9f2a12e5e553 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1579,8 +1579,13 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 again:
intel_pmu_ack_status(status);
if (++loops > 100) {
-   WARN_ONCE(1, "perfevents: irq loop stuck!\n");
-   perf_event_print_debug();
+   static bool warned = false;
+   if (!warned) {
+   pr_warn("perfevents: irq loop stuck!\n");
+   dump_stack();
+   perf_event_print_debug();
+   warned = true;
+   }
intel_pmu_reset();
goto done;
}
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] arch: x86: perf_event_intel: do not taint kernel when irq loop is stuck

2016-07-14 Thread Vladimir Davydov
Presumably, this happens in case a perf counter overflows. This might be
a hardware bug, which needs a workaround. We don't have enough knowledge
to fix it or investigate further. Since the issue is rare and can't lead
to system crash, we can close our eye on it. Nevertheless, it taints the
kernel, which results in test failure. To avoid that, let's replace WARN
with pr_warn.

https://jira.sw.ru/browse/PSBM-49258

Signed-off-by: Vladimir Davydov 
---
 arch/x86/kernel/cpu/perf_event_intel.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 5106e8378d96..9f2a12e5e553 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1579,8 +1579,13 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 again:
intel_pmu_ack_status(status);
if (++loops > 100) {
-   WARN_ONCE(1, "perfevents: irq loop stuck!\n");
-   perf_event_print_debug();
+   static bool warned = false;
+   if (!warned) {
+   pr_warn("perfevents: irq loop stuck!\n");
+   dump_stack();
+   perf_event_print_debug();
+   warned = true;
+   }
intel_pmu_reset();
goto done;
}
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm: default collapse huge pages if there's at least 1/4th ptes mapped

2016-07-13 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit ee04e199550d1a1b517306e729aaa56d9c045399
Author: Vladimir Davydov 
Date:   Wed Jul 13 20:52:40 2016 +0400

mm: default collapse huge pages if there's at least 1/4th ptes mapped

A huge page may be collapsed by khugepaged if there's not more than
khugepaged_max_ptes_none unmapped ptes (configured via sysfs). The
latter equals 511 (HPAGE_PMD_NR - 1) by default, which results in
noticeable growth in memory footprint if a process has a sparse address
space. Experiments have shown (see bug-id below) that decreasing the
threshold down to 384 (3/4*HPAGE_PMD_NR) results in no performance
degradation for VMs and CTs and at the same time improves test results
for VMs (because qemu has a sparse heap). So let's set it by default.

https://jira.sw.ru/browse/PSBM-48885

Signed-off-by: Vladimir Davydov 
---
 mm/huge_memory.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7543156e8d39..3c23df1d3392 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -58,11 +58,10 @@ static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
 /*
- * default collapse hugepages if there is at least one pte mapped like
- * it would have happened if the vma was large enough during page
- * fault.
+ * default collapse hugepages if there is at least 1/4th ptes mapped
+ * to avoid memory footprint growth due to fragmentation
  */
-static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
+static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR*3/4;
 
 static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] fs: make overlayfs disabled in CT by default

2016-07-13 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit af1bf9e1a067c1186501ef5415acdb62a33e8c22
Author: Maxim Patlasov 
Date:   Wed Jul 13 20:52:37 2016 +0400

fs: make overlayfs disabled in CT by default

Overlayfs is in "TECH PREVIEW" state right now. Letting CT users to freely
mount and exercise overlayfs, we risk to have the whole node crashed.

Let's disable it for CT users by default. Customers who need it (e.g. to
run Docker in CT) may enable it like this:

# echo 1 > /proc/sys/fs/experimental_fs_enable

The patch is a temporary (awkward) workaround until we make overlayfs
production-ready. Then we'll roll back the patch.

https://jira.sw.ru/browse/PSBM-49629

Signed-off-by: Maxim Patlasov 
Reviewed-by: Vladimir Davydov 
---
 fs/filesystems.c | 8 +++-
 fs/overlayfs/super.c | 2 +-
 include/linux/fs.h   | 4 
 kernel/sysctl.c  | 7 +++
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/filesystems.c b/fs/filesystems.c
index beaba560979f..670d228e9c56 100644
--- a/fs/filesystems.c
+++ b/fs/filesystems.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 
+/* Affects ability of CT users to mount fs marked as FS_EXPERIMENTAL */
+int sysctl_experimental_fs_enable;
+
 /*
  * Handling of filesystem drivers list.
  * Rules:
@@ -219,7 +222,10 @@ int __init get_filesystem_list(char *buf)
 
 static inline bool filesystem_permitted(const struct file_system_type *fs)
 {
-   return ve_is_super(get_exec_env()) || (fs->fs_flags & FS_VIRTUALIZED);
+   return ve_is_super(get_exec_env()) ||
+   (fs->fs_flags & FS_VIRTUALIZED) ||
+   ((fs->fs_flags & FS_EXPERIMENTAL) &&
+sysctl_experimental_fs_enable);
 }
 
 #ifdef CONFIG_PROC_FS
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index c20cfe977cdf..d5c57b4b5983 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1129,7 +1129,7 @@ static struct file_system_type ovl_fs_type = {
.name   = "overlay",
.mount  = ovl_mount,
.kill_sb= kill_anon_super,
-   .fs_flags   = FS_VIRTUALIZED,
+   .fs_flags   = FS_EXPERIMENTAL,
 };
 MODULE_ALIAS_FS("overlay");
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7203dbadbbf9..f1c3d5be60d8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -59,6 +59,8 @@ extern struct inodes_stat_t inodes_stat;
 extern int leases_enable, lease_break_time;
 extern int sysctl_protected_symlinks;
 extern int sysctl_protected_hardlinks;
+extern int sysctl_experimental_fs_enable;
+
 
 struct buffer_head;
 typedef int (get_block_t)(struct inode *inode, sector_t iblock,
@@ -2108,6 +2110,8 @@ struct file_system_type {
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
 #define FS_USERNS_DEV_MOUNT16 /* A userns mount does not imply MNT_NODEV */
 #define FS_VIRTUALIZED 64  /* Can mount this fstype inside ve */
+#define FS_EXPERIMENTAL128 /* Ability to mount this fstype 
inside ve
+* is governed by 
experimental_fs_enable */
 #define FS_HAS_RM_XQUOTA   256 /* KABI: fs has the rm_xquota quota op 
*/
 #define FS_HAS_INVALIDATE_RANGE512 /* FS has new ->invalidatepage 
with length arg */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c8f7bc34c590..e59dd3be92dd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1781,6 +1781,13 @@ static struct ctl_table fs_table[] = {
.proc_handler   = &pipe_proc_fn,
.extra1 = &pipe_min_size,
},
+   {
+   .procname   = "experimental_fs_enable",
+   .data   = &sysctl_experimental_fs_enable,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   },
{ }
 };
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls

2016-07-13 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.24
-->
commit 91a74e3b91ab9aa0d4be64b01f86a14c36b3279e
Author: Maxim Patlasov 
Date:   Wed Jul 13 20:52:28 2016 +0400

ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls

The ioctls simply freeze and thaw ploop bdev.

If no fs is mounted over ploop bdev being frozen, then the freeze ioctl
just increments bd_fsfreeze_count, which prevents the ploop from being
mounted until it is thawed.

https://jira.sw.ru/browse/PSBM-49091

Caveats:

 1) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as 
one.
 2) The same for thaw.

[vdavydov@: allow to freeze unmounted ploop]
Signed-off-by: Maxim Patlasov 
Signed-off-by: Vladimir Davydov 
Cc: Pavel Borzenkov 
---
 drivers/block/ploop/dev.c  | 39 +++
 include/linux/ploop/ploop.h|  2 ++
 include/linux/ploop/ploop_if.h |  6 ++
 3 files changed, 47 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e5f010b9aeba..d52975eaaa36 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4815,6 +4815,39 @@ static int ploop_push_backup_stop(struct ploop_device 
*plo, unsigned long arg)
return copy_to_user((void*)arg, &ctl, sizeof(ctl));
 }
 
+static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev)
+{
+   struct super_block *sb = plo->sb;
+
+   if (test_bit(PLOOP_S_FROZEN, &plo->state))
+   return 0;
+
+   sb = freeze_bdev(bdev);
+   if (sb && IS_ERR(sb))
+   return PTR_ERR(sb);
+
+   plo->sb = sb;
+   set_bit(PLOOP_S_FROZEN, &plo->state);
+   return 0;
+}
+
+static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev)
+{
+   struct super_block *sb = plo->sb;
+   int err;
+
+   if (!test_bit(PLOOP_S_FROZEN, &plo->state))
+   return 0;
+
+   err = thaw_bdev(bdev, sb);
+   if (!err) {
+   plo->sb = NULL;
+   clear_bit(PLOOP_S_FROZEN, &plo->state);
+   }
+
+   return err;
+}
+
 static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int 
cmd,
   unsigned long arg)
 {
@@ -4928,6 +4961,12 @@ static int ploop_ioctl(struct block_device *bdev, 
fmode_t fmode, unsigned int cm
case PLOOP_IOC_PUSH_BACKUP_STOP:
err = ploop_push_backup_stop(plo, arg);
break;
+   case PLOOP_IOC_FREEZE:
+   err = ploop_freeze(plo, bdev);
+   break;
+   case PLOOP_IOC_THAW:
+   err = ploop_thaw(plo, bdev);
+   break;
default:
err = -EINVAL;
}
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index deee8a78cc96..7864edf17f19 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -61,6 +61,7 @@ enum {
   (for minor mgmt only) */
PLOOP_S_ONCE,   /* An event (e.g. printk once) happened */
PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */
+   PLOOP_S_FROZEN  /* Frozen PLOOP_IOC_FREEZE */
 };
 
 struct ploop_snapdata
@@ -409,6 +410,7 @@ struct ploop_device
struct block_device *bdev;
struct request_queue*queue;
struct task_struct  *thread;
+   struct super_block  *sb;
struct rb_node  link;
 
/* someone who wants to quiesce state-machine waits
diff --git a/include/linux/ploop/ploop_if.h b/include/linux/ploop/ploop_if.h
index a098ca9d0ef0..302ace984a5a 100644
--- a/include/linux/ploop/ploop_if.h
+++ b/include/linux/ploop/ploop_if.h
@@ -352,6 +352,12 @@ struct ploop_track_extent
 /* Stop push backup */
 #define PLOOP_IOC_PUSH_BACKUP_STOP _IOR(PLOOPCTLTYPE, 31, struct 
ploop_push_backup_stop_ctl)
 
+/* Freeze FS mounted over ploop */
+#define PLOOP_IOC_FREEZE   _IO(PLOOPCTLTYPE, 32)
+
+/* Unfreeze FS mounted over ploop */
+#define PLOOP_IOC_THAW _IO(PLOOPCTLTYPE, 33)
+
 /* Events exposed via /sys/block/ploopN/pstate/event */
 #define PLOOP_EVENT_ABORTED1
 #define PLOOP_EVENT_STOPPED2
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/4] sched: cleanup !CFS_CPULIMIT code

2016-07-13 Thread Vladimir Davydov
Let's move all CFS_CPULIMIT related functions under CFS_CPULIMIT ifdef.
This will ease further patching.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 39 ++-
 1 file changed, 18 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52365f6a4e36..2ff38fc1d600 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -560,11 +560,6 @@ static inline int check_cpulimit_spread(struct task_group 
*tg, int target_cpu)
return cfs_rq_active(tg->cfs_rq[target_cpu]) ? 0 : -1;
 }
 #else /* !CONFIG_CFS_CPULIMIT */
-static inline int cfs_rq_active(struct cfs_rq *cfs_rq)
-{
-   return 1;
-}
-
 static inline void inc_nr_active_cfs_rqs(struct cfs_rq *cfs_rq)
 {
 }
@@ -572,16 +567,6 @@ static inline void inc_nr_active_cfs_rqs(struct cfs_rq 
*cfs_rq)
 static inline void dec_nr_active_cfs_rqs(struct cfs_rq *cfs_rq, int postpone)
 {
 }
-
-static inline enum hrtimer_restart sched_cfs_active_timer(struct hrtimer 
*timer)
-{
-   return 0;
-}
-
-static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu)
-{
-   return 1;
-}
 #endif /* CONFIG_CFS_CPULIMIT */
 
 static __always_inline
@@ -4716,6 +4701,7 @@ done:
 
 static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu)
 {
+#ifdef CONFIG_CFS_CPULIMIT
struct task_group *tg;
struct sched_domain *sd;
int prev_cpu = task_cpu(p);
@@ -4741,6 +4727,7 @@ static inline bool select_runnable_cpu(struct task_struct 
*p, int *new_cpu)
}
}
}
+#endif
return false;
 }
 
@@ -5461,14 +5448,10 @@ static inline bool migrate_degrades_locality(struct 
task_struct *p,
 }
 #endif
 
-/*
- * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
- */
-static
-int can_migrate_task(struct task_struct *p, struct lb_env *env)
+static inline int can_migrate_task_cpulimit(struct task_struct *p, struct 
lb_env *env)
 {
+#ifdef CONFIG_CFS_CPULIMIT
struct task_group *tg = top_cfs_rq_of(&p->se)->tg;
-   int tsk_cache_hot = 0;
 
if (check_cpulimit_spread(tg, env->dst_cpu) < 0) {
int cpu;
@@ -5490,6 +5473,20 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
}
return 0;
}
+#endif
+   return 1;
+}
+
+/*
+ * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ */
+static
+int can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+   int tsk_cache_hot = 0;
+
+   if (!can_migrate_task_cpulimit(p, env))
+   return 0;
 
/*
 * We do not migrate tasks that are:
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4/4] sched: use topmost limited ancestor for cpulimit balancing

2016-07-13 Thread Vladimir Davydov
We want to keep all proceses of a container's cgroup packed on the
minimal allowed number of cpus, which is set by the cpulimit. Doing this
properly when deep hierarchies are used is tricky if not impossible w/o
introducing tremendous overhead, so initially we implemented this
feature exclusively for top-level cgroups. Now this isn't enough, as
containers can be created in machine.slice. So in this patch we make
cpulimit balancing work for topmost cgroups that has a cpu limit set.
This way, no matter if containers are created under the root or in
machine.slice, cpulimit balancing will always be applied to container's
cgroup as machine.slice isn't supposed to have cpu limit set.

https://jira.sw.ru/browse/PSBM-49203

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/core.c  | 62 
 kernel/sched/fair.c  | 36 +-
 kernel/sched/sched.h |  2 ++
 3 files changed, 69 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94deef41f05a..657b8e4ba8d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7557,6 +7557,10 @@ void __init sched_init(void)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
+#ifdef CONFIG_CFS_CPULIMIT
+   root_task_group.topmost_limited_ancestor = &root_task_group;
+#endif
+
for_each_possible_cpu(i) {
struct rq *rq;
 
@@ -7882,6 +7886,8 @@ err:
return ERR_PTR(-ENOMEM);
 }
 
+static void tg_update_topmost_limited_ancestor(struct task_group *tg);
+
 void sched_online_group(struct task_group *tg, struct task_group *parent)
 {
unsigned long flags;
@@ -7894,6 +7900,9 @@ void sched_online_group(struct task_group *tg, struct 
task_group *parent)
tg->parent = parent;
INIT_LIST_HEAD(&tg->children);
list_add_rcu(&tg->siblings, &parent->children);
+
+   tg_update_topmost_limited_ancestor(tg);
+
spin_unlock_irqrestore(&task_group_lock, flags);
 }
 
@@ -8428,6 +8437,8 @@ const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 
1ms */
 
 static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime);
 
+static void tg_limit_toggled(struct task_group *tg);
+
 /* call with cfs_constraints_mutex held */
 static int __tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 {
@@ -8485,6 +8496,8 @@ static int __tg_set_cfs_bandwidth(struct task_group *tg, 
u64 period, u64 quota)
unthrottle_cfs_rq(cfs_rq);
raw_spin_unlock_irq(&rq->lock);
}
+   if (runtime_enabled != runtime_was_enabled)
+   tg_limit_toggled(tg);
return ret;
 }
 
@@ -8662,6 +8675,49 @@ static int cpu_stats_show(struct cgroup *cgrp, struct 
cftype *cft,
 }
 
 #ifdef CONFIG_CFS_CPULIMIT
+static int __tg_update_topmost_limited_ancestor(struct task_group *tg, void 
*unused)
+{
+   struct task_group *parent = tg->parent;
+
+   /*
+* Parent and none of its uncestors is limited? The task group should
+* become a topmost limited uncestor then, provided it has a limit set.
+* Otherwise inherit topmost limited ancestor from the parent.
+*/
+   if (parent->topmost_limited_ancestor == parent &&
+   parent->cfs_bandwidth.quota == RUNTIME_INF)
+   tg->topmost_limited_ancestor = tg;
+   else
+   tg->topmost_limited_ancestor = parent->topmost_limited_ancestor;
+   return 0;
+}
+
+static void tg_update_topmost_limited_ancestor(struct task_group *tg)
+{
+   __tg_update_topmost_limited_ancestor(tg, NULL);
+}
+
+static void tg_limit_toggled(struct task_group *tg)
+{
+   if (tg->topmost_limited_ancestor != tg) {
+   /*
+* This task group is not a topmost limited ancestor, so both
+* it and all its children must already point to their topmost
+* limited ancestor, and we have nothing to do.
+*/
+   return;
+   }
+
+   /*
+* This task group is a topmost limited ancestor. Walk over all its
+* children and update their pointers to the topmost limited ancestor.
+*/
+
+   spin_lock_irq(&task_group_lock);
+   walk_tg_tree_from(tg, __tg_update_topmost_limited_ancestor, tg_nop, 
NULL);
+   spin_unlock_irq(&task_group_lock);
+}
+
 static void tg_update_cpu_limit(struct task_group *tg)
 {
long quota, period;
@@ -8736,6 +8792,12 @@ static int nr_cpus_write_u64(struct cgroup *cgrp, struct 
cftype *cftype,
return tg_set_cpu_limit(tg, tg->cpu_rate, nr_cpus);
 }
 #else
+static void tg_update_topmost_limited_ancestor(struct task_group *tg)
+{
+}
+static void tg_limit_toggled(struct task_group *tg)
+{
+}
 static void tg_update_cpu_limit(struct task_group *tg)
 {
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ff38fc1d600..515685f77217 100644
-

[Devel] [PATCH rh7 2/4] sched: make check_cpulimit_spread accept tg instead of cfs_rq

2016-07-13 Thread Vladimir Davydov
It only needs cfs_rq->tg, so let's pass it directly. This eases further
modifications.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 57 -
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70a5861d4166..52365f6a4e36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -542,9 +542,8 @@ static enum hrtimer_restart sched_cfs_active_timer(struct 
hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
-static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu)
+static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu)
 {
-   struct task_group *tg = cfs_rq->tg;
int nr_cpus_active = atomic_read(&tg->nr_cpus_active);
int nr_cpus_limit = DIV_ROUND_UP(tg->cpu_rate, MAX_CPU_RATE);
 
@@ -579,7 +578,7 @@ static inline enum hrtimer_restart 
sched_cfs_active_timer(struct hrtimer *timer)
return 0;
 }
 
-static inline int check_cpulimit_spread(struct cfs_rq *cfs_rq, int target_cpu)
+static inline int check_cpulimit_spread(struct task_group *tg, int target_cpu)
 {
return 1;
 }
@@ -4717,18 +4716,15 @@ done:
 
 static inline bool select_runnable_cpu(struct task_struct *p, int *new_cpu)
 {
-   struct cfs_rq *cfs_rq;
struct task_group *tg;
struct sched_domain *sd;
int prev_cpu = task_cpu(p);
int cpu;
 
-   cfs_rq = top_cfs_rq_of(&p->se);
-   if (check_cpulimit_spread(cfs_rq, *new_cpu) > 0)
+   tg = top_cfs_rq_of(&p->se)->tg;
+   if (check_cpulimit_spread(tg, *new_cpu) > 0)
return false;
 
-   tg = cfs_rq->tg;
-
if (cfs_rq_active(tg->cfs_rq[*new_cpu]))
return true;
 
@@ -5084,7 +5080,7 @@ static int cpulimit_balance_cpu_stop(void *data);
 static inline void trigger_cpulimit_balance(struct task_struct *p)
 {
struct rq *this_rq;
-   struct cfs_rq *cfs_rq;
+   struct task_group *tg;
int this_cpu, cpu, target_cpu = -1;
struct sched_domain *sd;
 
@@ -5094,8 +5090,8 @@ static inline void trigger_cpulimit_balance(struct 
task_struct *p)
if (!p->se.on_rq || this_rq->active_balance)
return;
 
-   cfs_rq = top_cfs_rq_of(&p->se);
-   if (check_cpulimit_spread(cfs_rq, this_cpu) >= 0)
+   tg = top_cfs_rq_of(&p->se)->tg;
+   if (check_cpulimit_spread(tg, this_cpu) >= 0)
return;
 
rcu_read_lock();
@@ -5105,7 +5101,7 @@ static inline void trigger_cpulimit_balance(struct 
task_struct *p)
for_each_cpu_and(cpu, sched_domain_span(sd),
 tsk_cpus_allowed(p)) {
if (cpu != this_cpu &&
-   cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) {
+   cfs_rq_active(tg->cfs_rq[cpu])) {
target_cpu = cpu;
goto unlock;
}
@@ -5471,22 +5467,22 @@ static inline bool migrate_degrades_locality(struct 
task_struct *p,
 static
 int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
-   struct cfs_rq *cfs_rq = top_cfs_rq_of(&p->se);
+   struct task_group *tg = top_cfs_rq_of(&p->se)->tg;
int tsk_cache_hot = 0;
 
-   if (check_cpulimit_spread(cfs_rq, env->dst_cpu) < 0) {
+   if (check_cpulimit_spread(tg, env->dst_cpu) < 0) {
int cpu;
 
schedstat_inc(p, se.statistics.nr_failed_migrations_cpulimit);
 
-   if (check_cpulimit_spread(cfs_rq, env->src_cpu) != 0)
+   if (check_cpulimit_spread(tg, env->src_cpu) != 0)
return 0;
 
if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
return 0;
 
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
-   if (cfs_rq_active(cfs_rq->tg->cfs_rq[cpu])) {
+   if (cfs_rq_active(tg->cfs_rq[cpu])) {
env->flags |= LBF_SOME_PINNED;
env->new_dst_cpu = cpu;
break;
@@ -5719,7 +5715,7 @@ static int move_task_group(struct cfs_rq *cfs_rq, struct 
lb_env *env)
 
 static int move_task_groups(struct lb_env *env)
 {
-   struct cfs_rq *cfs_rq, *top_cfs_rq;
+   struct cfs_rq *cfs_rq;
struct task_group *tg;
unsigned long load;
int cur_pulled, pulled = 0;
@@ -5728,8 +5724,7 @@ static int move_task_groups(struct lb_env *env)
return 0;
 
for_each_leaf_cfs_rq(env->src_rq, cfs_rq) {
-   tg = cfs_rq->tg;
-   if (tg == &root_task_group)
+   if (cfs_rq->tg == &root_task_group)

[Devel] [PATCH rh7 1/4] sched: account task_group->nr_cpus_active for all cgroups

2016-07-13 Thread Vladimir Davydov
Currently nr_cpus_active is only accounted for top-level cgroups,
because container cgroups, which are the only users of this counter,
could only be created under the root cgroup.

Now things have changed, and containers can reside either under the root
or under machine.slice or in any other cgroup depending on the host's
config. So we can't preserve this little optimization anymore. Remove
it.

Signed-off-by: Vladimir Davydov 
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd25e1e8ae5b..70a5861d4166 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3039,7 +3039,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-   if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight)
+   if (!cfs_rq->load.weight)
inc_nr_active_cfs_rqs(cfs_rq);
 
/*
@@ -3163,7 +3163,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
update_min_vruntime(cfs_rq);
update_cfs_shares(cfs_rq);
 
-   if (is_top_cfs_rq(cfs_rq) && !cfs_rq->load.weight)
+   if (!cfs_rq->load.weight)
dec_nr_active_cfs_rqs(cfs_rq, flags & DEQUEUE_TASK_SLEEP);
 }
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 0/4] sched: fix degradation caused by moving containers to machine.slice

2016-07-13 Thread Vladimir Davydov
We have a hack in the scheduler that makes containers' processes run on
the minimal allowed number of cpus, which dramatically improves
performance of some tests if a container has 1 or 2 cpus. This hack
depends on the fact that containers are located under the root cgroup,
so moving them to machine.slice broke it. This patch set fixes it.

https://jira.sw.ru/browse/PSBM-49203

Vladimir Davydov (4):
  sched: account task_group->nr_cpus_active for all cgroups
  sched: make check_cpulimit_spread accept tg instead of cfs_rq
  sched: cleanup !CFS_CPULIMIT code
  sched: use topmost limited ancestor for cpulimit balancing

 kernel/sched/core.c  |  62 ++
 kernel/sched/fair.c  | 120 ++-
 kernel/sched/sched.h |   2 +
 3 files changed, 107 insertions(+), 77 deletions(-)

-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v2] ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls

2016-07-13 Thread Vladimir Davydov
From: Maxim Patlasov 

The ioctls simply freeze and thaw ploop bdev.

If no fs is mounted over ploop bdev being frozen, then the freeze ioctl
just increments bd_fsfreeze_count, which prevents the ploop from being
mounted until it is thawed.

https://jira.sw.ru/browse/PSBM-49091

Caveats:

 1) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as one.
 2) The same for thaw.

[vdavydov@: allow to freeze unmounted ploop]
Signed-off-by: Maxim Patlasov 
Signed-off-by: Vladimir Davydov 
Cc: Pavel Borzenkov 
---
Changes in v2:
 - avoid patching generic code

 drivers/block/ploop/dev.c  | 39 +++
 include/linux/ploop/ploop.h|  2 ++
 include/linux/ploop/ploop_if.h |  6 ++
 3 files changed, 47 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e5f010b9aeba..d52975eaaa36 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4815,6 +4815,39 @@ static int ploop_push_backup_stop(struct ploop_device 
*plo, unsigned long arg)
return copy_to_user((void*)arg, &ctl, sizeof(ctl));
 }
 
+static int ploop_freeze(struct ploop_device *plo, struct block_device *bdev)
+{
+   struct super_block *sb = plo->sb;
+
+   if (test_bit(PLOOP_S_FROZEN, &plo->state))
+   return 0;
+
+   sb = freeze_bdev(bdev);
+   if (sb && IS_ERR(sb))
+   return PTR_ERR(sb);
+
+   plo->sb = sb;
+   set_bit(PLOOP_S_FROZEN, &plo->state);
+   return 0;
+}
+
+static int ploop_thaw(struct ploop_device *plo, struct block_device *bdev)
+{
+   struct super_block *sb = plo->sb;
+   int err;
+
+   if (!test_bit(PLOOP_S_FROZEN, &plo->state))
+   return 0;
+
+   err = thaw_bdev(bdev, sb);
+   if (!err) {
+   plo->sb = NULL;
+   clear_bit(PLOOP_S_FROZEN, &plo->state);
+   }
+
+   return err;
+}
+
 static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int 
cmd,
   unsigned long arg)
 {
@@ -4928,6 +4961,12 @@ static int ploop_ioctl(struct block_device *bdev, 
fmode_t fmode, unsigned int cm
case PLOOP_IOC_PUSH_BACKUP_STOP:
err = ploop_push_backup_stop(plo, arg);
break;
+   case PLOOP_IOC_FREEZE:
+   err = ploop_freeze(plo, bdev);
+   break;
+   case PLOOP_IOC_THAW:
+   err = ploop_thaw(plo, bdev);
+   break;
default:
err = -EINVAL;
}
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index deee8a78cc96..7864edf17f19 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -61,6 +61,7 @@ enum {
   (for minor mgmt only) */
PLOOP_S_ONCE,   /* An event (e.g. printk once) happened */
PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */
+   PLOOP_S_FROZEN  /* Frozen PLOOP_IOC_FREEZE */
 };
 
 struct ploop_snapdata
@@ -409,6 +410,7 @@ struct ploop_device
struct block_device *bdev;
struct request_queue*queue;
struct task_struct  *thread;
+   struct super_block  *sb;
struct rb_node  link;
 
/* someone who wants to quiesce state-machine waits
diff --git a/include/linux/ploop/ploop_if.h b/include/linux/ploop/ploop_if.h
index a098ca9d0ef0..302ace984a5a 100644
--- a/include/linux/ploop/ploop_if.h
+++ b/include/linux/ploop/ploop_if.h
@@ -352,6 +352,12 @@ struct ploop_track_extent
 /* Stop push backup */
 #define PLOOP_IOC_PUSH_BACKUP_STOP _IOR(PLOOPCTLTYPE, 31, struct 
ploop_push_backup_stop_ctl)
 
+/* Freeze FS mounted over ploop */
+#define PLOOP_IOC_FREEZE   _IO(PLOOPCTLTYPE, 32)
+
+/* Unfreeze FS mounted over ploop */
+#define PLOOP_IOC_THAW _IO(PLOOPCTLTYPE, 33)
+
 /* Events exposed via /sys/block/ploopN/pstate/event */
 #define PLOOP_EVENT_ABORTED1
 #define PLOOP_EVENT_STOPPED2
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 1/4] fs: do not fail on double freeze bdev w/o sb

2016-07-13 Thread Vladimir Davydov
On Tue, Jul 12, 2016 at 03:02:11PM -0700, Maxim Patlasov wrote:
> Let's keep flies and cutlets separately. It seems we can easily satisfy
> push-backup needs by implementing freeze/thaw ploop ioctls without tackling
> generic code at all, see a patch in attachment (unless I missed something
> obvious). And apart from these ploop/push-backup stuff, if you think your
> changes for freeze_bdev() and thaw_bdev() are useful, send them upstream, so
> we'll back-port them later, when they are accepted upstream (unless I missed
> some scenario for which those changes matter for us). In the other words, I
> think we have to keep our vz7 generic code base closer to ms, unless we have
> good reason to deviate.

Agree. Generally, I like your patch more than mine, but I've a concern
about it - see below.

> 
> On 07/12/2016 03:04 AM, Vladimir Davydov wrote:
> >It's possible to freeze a bdev which is not mounted. In this case
> >freeze_bdev() only increments bd_fsfrozen_count in order to prevent the
> >bdev from being mounted and does nothing else. A second freeze attempt
> >on the same device is supposed to increment bd_fsfrozen_count again, but
> >it results in NULL ptr dereference, because freeze_bdev() doesn't check
> >the return value of get_super(). Fix that.
> >
> >Signed-off-by: Vladimir Davydov 
> >---
> >  fs/block_dev.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> >diff --git a/fs/block_dev.c b/fs/block_dev.c
> >index 4575c62d8b0b..325ee7161fbf 100644
> >--- a/fs/block_dev.c
> >+++ b/fs/block_dev.c
> >@@ -227,7 +227,8 @@ struct super_block *freeze_bdev(struct block_device 
> >*bdev)
> >  * thaw_bdev drops it.
> >  */
> > sb = get_super(bdev);
> >-drop_super(sb);
> >+if (sb)
> >+drop_super(sb);
> > mutex_unlock(&bdev->bd_fsfreeze_mutex);
> > return sb;
> > }
> 

> The ioctls simply freeze and thaw ploop bdev.
> 
> Caveats:
> 
> 1) If no fs mounted, the ioctls have no effect.
> 2) No nested freeze: many PLOOP_IOC_FREEZE ioctls have the same effect as one.
> 3) The same for thaw.

I think #2 and #3 are OK. But regarding #1 - what if we want to make a
backup of a secondary ploop which is not mounted? So we try to freeze it
and succeed, but it isn't actually frozen, so it can be mounted and
modified while we're backing it up, which is incorrect AFAIU.

What about something like this on top of your patch?

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 9a9cc8b0b934..d52975eaaa36 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4819,16 +4819,15 @@ static int ploop_freeze(struct ploop_device *plo, 
struct block_device *bdev)
 {
struct super_block *sb = plo->sb;
 
-   if (sb)
+   if (test_bit(PLOOP_S_FROZEN, &plo->state))
return 0;
 
sb = freeze_bdev(bdev);
if (sb && IS_ERR(sb))
return PTR_ERR(sb);
-   if (!sb)
-   thaw_bdev(bdev, sb);
 
plo->sb = sb;
+   set_bit(PLOOP_S_FROZEN, &plo->state);
return 0;
 }
 
@@ -4837,12 +4836,14 @@ static int ploop_thaw(struct ploop_device *plo, struct 
block_device *bdev)
struct super_block *sb = plo->sb;
int err;
 
-   if (!sb)
+   if (!test_bit(PLOOP_S_FROZEN, &plo->state))
return 0;
 
err = thaw_bdev(bdev, sb);
-   if (!err)
+   if (!err) {
plo->sb = NULL;
+   clear_bit(PLOOP_S_FROZEN, &plo->state);
+   }
 
return err;
 }
diff --git a/include/linux/ploop/ploop.h b/include/linux/ploop/ploop.h
index 6ae96c4486fe..7864edf17f19 100644
--- a/include/linux/ploop/ploop.h
+++ b/include/linux/ploop/ploop.h
@@ -61,6 +61,7 @@ enum {
   (for minor mgmt only) */
PLOOP_S_ONCE,   /* An event (e.g. printk once) happened */
PLOOP_S_PUSH_BACKUP,/* Push_backup is in progress */
+   PLOOP_S_FROZEN  /* Frozen PLOOP_IOC_FREEZE */
 };
 
 struct ploop_snapdata
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: default collapse huge pages if there's at least 1/4th ptes mapped

2016-07-12 Thread Vladimir Davydov
A huge page may be collapsed by khugepaged if there's not more than
khugepaged_max_ptes_none unmapped ptes (configured via sysfs). The
latter equals 511 (HPAGE_PMD_NR - 1) by default, which results in
noticeable growth in memory footprint if a process has a sparse address
space. Experiments have shown (see bug-id below) that decreasing the
threshold down to 384 (3/4*HPAGE_PMD_NR) results in no performance
degradation for VMs and CTs and at the same time improves test results
for VMs (because qemu has a sparse heap). So let's set it by default.

https://jira.sw.ru/browse/PSBM-48885

Signed-off-by: Vladimir Davydov 
---
 mm/huge_memory.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7543156e8d39..3c23df1d3392 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -58,11 +58,10 @@ static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
 /*
- * default collapse hugepages if there is at least one pte mapped like
- * it would have happened if the vma was large enough during page
- * fault.
+ * default collapse hugepages if there is at least 1/4th ptes mapped
+ * to avoid memory footprint growth due to fragmentation
  */
-static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
+static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR*3/4;
 
 static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/4] fs: export get_active_super

2016-07-12 Thread Vladimir Davydov
It is required by the next patch.

Signed-off-by: Vladimir Davydov 
---
 fs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/super.c b/fs/super.c
index 50ac29391b96..e52b7db23c8f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -680,6 +680,7 @@ restart:
spin_unlock(&sb_lock);
return NULL;
 }
+EXPORT_SYMBOL(get_active_super);
  
 struct super_block *user_get_super(dev_t dev)
 {
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4/4] ploop: add PLOOP_IOC_FREEZE and PLOOP_IOC_THAW ioctls

2016-07-12 Thread Vladimir Davydov
These ioctls simply freeze and thaw ploop bdev, respectively (i.e. FS
mounted over ploop device). They are required by ploop push backup for
freezing secondary ploops mounted inside containers. The point is that
these mount points are not shown in host's /proc/mounts due to mount
namespace, so there's no easy way for push backup process to get the
mount point given a device name in order to call FIFREEZE ioctl.
(Actually, there's a way - using /proc/PID/mounts and /proc/PID/root
where PID is the pid of a container's process, but it's cumbersome).

https://jira.sw.ru/browse/PSBM-49091

Signed-off-by: Vladimir Davydov 
Cc: Pavel Borzenkov 
---
 drivers/block/ploop/dev.c  | 28 
 include/linux/ploop/ploop_if.h |  6 ++
 2 files changed, 34 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index e5f010b9aeba..d2b3c9fd9176 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4815,6 +4815,28 @@ static int ploop_push_backup_stop(struct ploop_device 
*plo, unsigned long arg)
return copy_to_user((void*)arg, &ctl, sizeof(ctl));
 }
 
+static int ploop_freeze(struct block_device *bdev)
+{
+   struct super_block *sb;
+
+   sb = freeze_bdev(bdev);
+   if (sb && IS_ERR(sb))
+   return PTR_ERR(sb);
+   return 0;
+}
+
+static int ploop_thaw(struct block_device *bdev)
+{
+   struct super_block *sb;
+   int err;
+
+   sb = get_active_super(bdev);
+   err = thaw_bdev(bdev, sb);
+   if (sb)
+   deactivate_super(sb);
+   return err;
+}
+
 static int ploop_ioctl(struct block_device *bdev, fmode_t fmode, unsigned int 
cmd,
   unsigned long arg)
 {
@@ -4928,6 +4950,12 @@ static int ploop_ioctl(struct block_device *bdev, 
fmode_t fmode, unsigned int cm
case PLOOP_IOC_PUSH_BACKUP_STOP:
err = ploop_push_backup_stop(plo, arg);
break;
+   case PLOOP_IOC_FREEZE:
+   err = ploop_freeze(bdev);
+   break;
+   case PLOOP_IOC_THAW:
+   err = ploop_thaw(bdev);
+   break;
default:
err = -EINVAL;
}
diff --git a/include/linux/ploop/ploop_if.h b/include/linux/ploop/ploop_if.h
index a098ca9d0ef0..302ace984a5a 100644
--- a/include/linux/ploop/ploop_if.h
+++ b/include/linux/ploop/ploop_if.h
@@ -352,6 +352,12 @@ struct ploop_track_extent
 /* Stop push backup */
 #define PLOOP_IOC_PUSH_BACKUP_STOP _IOR(PLOOPCTLTYPE, 31, struct 
ploop_push_backup_stop_ctl)
 
+/* Freeze FS mounted over ploop */
+#define PLOOP_IOC_FREEZE   _IO(PLOOPCTLTYPE, 32)
+
+/* Unfreeze FS mounted over ploop */
+#define PLOOP_IOC_THAW _IO(PLOOPCTLTYPE, 33)
+
 /* Events exposed via /sys/block/ploopN/pstate/event */
 #define PLOOP_EVENT_ABORTED1
 #define PLOOP_EVENT_STOPPED2
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/4] fs: fix thaw_bdev return value in case bdev is not frozen

2016-07-12 Thread Vladimir Davydov
We should return -EINVAL in this case, but instead return 0.
Also, remove a duplicate code block, while we're here.

Signed-off-by: Vladimir Davydov 
---
 fs/block_dev.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 325ee7161fbf..0310d6402cf5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -274,14 +274,11 @@ int thaw_bdev(struct block_device *bdev, struct 
super_block *sb)
goto out;
 
error = thaw_super(sb);
-   if (error) {
+   if (error)
bdev->bd_fsfreeze_count++;
-   mutex_unlock(&bdev->bd_fsfreeze_mutex);
-   return error;
-   }
 out:
mutex_unlock(&bdev->bd_fsfreeze_mutex);
-   return 0;
+   return error;
 }
 EXPORT_SYMBOL(thaw_bdev);
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/4] fs: do not fail on double freeze bdev w/o sb

2016-07-12 Thread Vladimir Davydov
It's possible to freeze a bdev which is not mounted. In this case
freeze_bdev() only increments bd_fsfrozen_count in order to prevent the
bdev from being mounted and does nothing else. A second freeze attempt
on the same device is supposed to increment bd_fsfrozen_count again, but
it results in NULL ptr dereference, because freeze_bdev() doesn't check
the return value of get_super(). Fix that.

Signed-off-by: Vladimir Davydov 
---
 fs/block_dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4575c62d8b0b..325ee7161fbf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -227,7 +227,8 @@ struct super_block *freeze_bdev(struct block_device *bdev)
 * thaw_bdev drops it.
 */
sb = get_super(bdev);
-   drop_super(sb);
+   if (sb)
+   drop_super(sb);
mutex_unlock(&bdev->bd_fsfreeze_mutex);
return sb;
}
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default (v2)

2016-07-08 Thread Vladimir Davydov
On Thu, Jul 07, 2016 at 01:00:36PM -0700, Maxim Patlasov wrote:
> Overlayfs is in "TECH PREVIEW" state right now. Letting CT users to freely
> mount and exercise overlayfs, we risk to have the whole node crashed.
> 
> Let's disable it for CT users by default. Customers who need it (e.g. to
> run Docker in CT) may enable it like this:
> 
> # echo 1 > /proc/sys/fs/experimental_fs_enable
> 
> The patch is a temporary (awkward) workaround until we make overlayfs
> production-ready. Then we'll roll back the patch.
> 
> Changed in v2:
>  - let's only leave system-wide sysctl for permitting overlayfs; the sysctl
>is "rw" in ve0, but "ro" inside CT.
> 
> https://jira.sw.ru/browse/PSBM-47981

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: carefully check for user charges while reparenting

2016-07-07 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.23
-->
commit 50feb34e21530c4b27eae2836c4630d2b104476f
Author: Vladimir Davydov 
Date:   Thu Jul 7 18:33:25 2016 +0400

mm: memcontrol: carefully check for user charges while reparenting

kmem is uncharged before res, therefore when checking if there are still
user charges in a memory cgroup, we should read res before kmem,
otherwise a kmem uncharge can get in-between two reads, leading to
false-positive res <= kmem. Add smp_rmb() to guarantee this never
happens.

Note, since x86 doesn't reorder reads, this patch doesn't actually
introduce any functional changes - it just clarifies the code.

Fixes: 35c0d2a992aa ("mm: memcontrol: fix race between kmem uncharge and 
charge reparenting")
Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f525f27e481..8151d4259c6b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4814,7 +4814,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup 
*memcg,
 static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 {
int node, zid;
-   u64 usage;
+   u64 res, kmem;
 
do {
/* This is for making all *used* pages to be on LRU. */
@@ -4845,10 +4845,17 @@ static void mem_cgroup_reparent_charges(struct 
mem_cgroup *memcg)
 * so the lru seemed empty but the page could have been added
 * right after the check. RES_USAGE should be safe as we always
 * charge before adding to the LRU.
+*
+* Note, we must read memcg->res strictly before memcg->kmem,
+* because otherwise a kmem charge might get uncharged in
+* between the two reads leading to res <= kmem, even though
+* there are still user pages charged to this cgroup out there.
+* (see also comment in memcg_charge_kmem())
 */
-   usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
-   res_counter_read_u64(&memcg->kmem, RES_USAGE);
-   } while (usage > 0);
+   res = res_counter_read_u64(&memcg->res, RES_USAGE);
+   smp_rmb();
+   kmem = res_counter_read_u64(&memcg->kmem, RES_USAGE);
+   } while (res > kmem);
 }
 
 /*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] dcache: fix dentry leak when shrink races with kill

2016-07-07 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.23
-->
commit 88b8ca0fe3055767d9b92495b3e28f587b923d04
Author: Vladimir Davydov 
Date:   Thu Jul 7 18:33:13 2016 +0400

dcache: fix dentry leak when shrink races with kill

dentry_kill() does not free a dentry in case it is on a shrink list -
see __dentry_kill() -> d_free(). Instead it just marks it
DCACHE_MAY_FREE, which will make the shrinker free it when it's done
with it. This is required to avoid use after free in
shrink_dentry_list(). This logic was back-ported by commit e33cae748d1a
("ms/dcache: dentry_kill(): don't try to remove from shrink list").

When back-porting this commit I occasionally missed a hunk for
shrink_dentry_list(). The hunk makes shrink_dentry_list() more carefully
check dentry->d_lock_ref.count, i.e. instead of merely checking if it's
0 or not, it makes it check if it's strictly greater than 0. W/o this
check a dentry might leak if shrink races with kill, because before
trying to free a dentry, dentry_kill() first calls
lockref_mark_dead(&dentry->d_lockref), which sets d_lockref.count to
-128, so that shrink_dentry_list() will silently skip the dentry instead
of freeing it.

This patch resurrects the missing hunk.

https://jira.sw.ru/browse/PSBM-49321

Fixes: e33cae748d1a ("ms/dcache: dentry_kill(): don't try to remove from 
shrink list")
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 09ed486c9f1d..6433814a02d2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -874,7 +874,7 @@ static void shrink_dentry_list(struct list_head *list)
 * We found an inuse dentry which was not removed from
 * the LRU because of laziness during lookup. Do not free it.
 */
-   if (dentry->d_lockref.count) {
+   if ((int)dentry->d_lockref.count > 0) {
spin_unlock(&dentry->d_lock);
if (parent)
spin_unlock(&parent->d_lock);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] propogate_mnt: Handle the first propogated copy being a slave

2016-07-07 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.23
-->
commit 529c728822cb387ac470cd10969c5b1458467a94
Author: Eric W. Biederman 
Date:   Thu Jul 7 18:33:19 2016 +0400

propogate_mnt: Handle the first propogated copy being a slave

When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0010
> IP: [] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops:  [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: 8800bb0a8000 ti: 8800bac3c000 task.ti: 8800bac3c000
> RIP: 0010:[]  [] 
propagate_one+0xbe/0x1c0
> RSP: 0018:8800bac3fd38  EFLAGS: 00010283
> RAX:  RBX: 8800bb77ec00 RCX: 0010
> RDX:  RSI: 8800bb58c000 RDI: 8800bb58c480
> RBP: 8800bac3fd48 R08: 0001 R09: 
> R10: 1ca1 R11: 1c9d R12: 
> R13: 8800ba713800 R14: 8800bac3fda0 R15: 8800bb77ec00
> FS:  7f3c0cd9b7e0() GS:8800bfb0() 
knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0010 CR3: bb79d000 CR4: 06e0
> Stack:
>  8800bb77ec00  8800bac3fd88 811fbf85
>  8800bac3fd98 8800bb77f080 8800ba713800 8800bb262b40
>    8800bac3fdd8 811f1da0
> Call Trace:
>  [] propagate_mnt+0x105/0x140
>  [] attach_recursive_mnt+0x120/0x1e0
>  [] graft_tree+0x63/0x70
>  [] do_add_mount+0x9b/0x100
>  [] do_mount+0x2aa/0xdf0
>  [] ? strndup_user+0x4e/0x70
>  [] SyS_mount+0x75/0xc0
>  [] do_syscall_64+0x4b/0xa0
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 
22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 
89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP  [] propagate_one+0xbe/0x1c0
>  RSP 
> CR2: 0010
> ---[ end trace 2725ecd95164f217 ]---

This oops happens with the namespace_sem held and can be triggered by
non-root users.  An all around not pleasant experience.

To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.

Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.

The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.

To avoid other kinds of confusion last_dest is not changed when
computing last_source.  last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.

Cc: sta...@vger.kernel.org
fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen 
Reviewed-by: Seth Forshee 
Tested-by: Seth Forshee 
    Signed-off-by: "Eric W. Biederman" 
(cherry picked from commit 5ec0811d30378ae104f250bfc9b3640242d81e3f)
Signed-off-by: Vladimir Davydov 
Fixes: CVE-2016-4581

Conflicts:
fs/pnode.c
---
 fs/pnode.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 74f10c0e7e00..cc9ac074ba00 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -198,7 +198,7 @@ static struct mount *next_group(struct mount *m, struct 
mount *origin)
 
 /* all accesses are serialized by namespace_sem */
 static struct user_namespace *user_ns;
-static struct mount *last_dest, *last_source, *dest_master;
+static struct mount *last_dest, *first_source, *last_source, *dest_master;
 static struct mountpoint *mp;
 static struct list_head *list;
 
@@ -216,22 +216,23 @@ static int propagate_one(struct mount *m)
type = CL_MAKE_SHARED;
} else {
struct mount *n, *p;
+   bool done;
for (n = m; ; n = p) {
p = n->mnt_master;
-   if (p == dest_master || IS_MNT_MARKED(p)) {
-   while (last_dest->mnt_master !

[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: fix race between user memory reparent and charge

2016-07-07 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.23
-->
commit 9c3a30958ef74ed70bcc92ecf3727d36dbe463a0
Author: Vladimir Davydov 
Date:   Thu Jul 7 18:33:30 2016 +0400

mm: memcontrol: fix race between user memory reparent and charge

When a memory cgroup is destroyed (via rmdir), user memory pages
accounted to it get recharged to the parent cgroup - see
mem_cgroup_css_offlie() and mem_cgroup_reparent_charges(). If, for some
reason, a page is left charged to the destroyed cgroup after
mem_cgroup_reparent_charges() was done, we might get use-after-free,
because user memory charges do not hold reference to the cgroup.

And it seems to be possible in case reparenting races with
__mem_cgroup_try_charge() as follows:

  __mem_cgroup_try_charge
   get memcg from mm, inc ref
mem_cgroup_css_offline
 mem_cgroup_reparent_charges
   charge page to memcg
   put ref to memcg

To fix this issue, let's make __mem_cgroup_try_charge() cancel charge if
it detects that the cgroup was destroyed.

https://jira.sw.ru/browse/PSBM-49117

    Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8151d4259c6b..e3a16b99ccc6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -314,6 +314,7 @@ struct mem_cgroup {
 * Should the accounting and control be hierarchical, per subtree?
 */
bool use_hierarchy;
+   bool is_offline;
unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */
 
booloom_lock;
@@ -2952,6 +2953,23 @@ again:
}
} while (ret != CHARGE_OK);
 
+   /*
+* Cancel charge in case this cgroup was destroyed while we were here,
+* otherwise we can get a pending user memory charge to an offline
+* cgroup, which might result in use-after-free after the cgroup gets
+* released (see also mem_cgroup_css_offline()).
+*
+* Note, no need to issue an explicit barrier here, because a
+* successful charge implies full memory barrier.
+*/
+   if (unlikely(memcg->is_offline)) {
+   res_counter_uncharge(&memcg->res, batch * PAGE_SIZE);
+   if (do_swap_account)
+   res_counter_uncharge(&memcg->memsw, batch * PAGE_SIZE);
+   css_put(&memcg->css);
+   goto bypass;
+   }
+
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
 
@@ -6657,6 +6675,15 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
 {
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+   /*
+* Mark memory cgroup as offline before going to reparent charges.
+* This guarantees that __mem_cgroup_try_charge() either charges before
+* reparenting starts or doesn't charge at all, hence we won't have
+* pending user memory charges after reparenting is done.
+*/
+   memcg->is_offline = true;
+   smp_mb();
+
memcg_deactivate_kmem(memcg);
 
mem_cgroup_invalidate_reclaim_iterators(memcg);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] overlayfs: verify upper dentry before unlink and rename

2016-07-07 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.23
-->
commit 4791f46af35069ef00b582a7f229974f6b16e5bd
Author: Maxim Patlasov 
Date:   Thu Jul 7 18:32:38 2016 +0400

overlayfs: verify upper dentry before unlink and rename

Without this patch it is easy to crash node by fiddling
with overlayfs dirs. Backport commit 11f37104 from ms:

From: Miklos Szeredi 

ovl: verify upper dentry before unlink and rename

Unlink and rename in overlayfs checked the upper dentry for staleness by
verifying upper->d_parent against upperdir.  However the dentry can go
stale also by being unhashed, for example.

Expand the verification to actually look up the name again (under parent
lock) and check if it matches the upper dentry.  This matches what the VFS
does before passing the dentry to filesytem's unlink/rename methods, which
excludes any inconsistency caused by overlayfs.

Signed-off-by: Miklos Szeredi 

https://jira.sw.ru/browse/PSBM-47981
---
 fs/overlayfs/dir.c | 59 +++---
 1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 33c47712d16f..229b9e4be3bd 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -596,21 +596,25 @@ static int ovl_remove_upper(struct dentry *dentry, bool 
is_dir)
 {
struct dentry *upperdir = ovl_dentry_upper(dentry->d_parent);
struct inode *dir = upperdir->d_inode;
-   struct dentry *upper = ovl_dentry_upper(dentry);
+   struct dentry *upper;
int err;
 
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
+   upper = lookup_one_len(dentry->d_name.name, upperdir,
+  dentry->d_name.len);
+   err = PTR_ERR(upper);
+   if (IS_ERR(upper))
+   goto out_unlock;
+
err = -ESTALE;
-   if (upper->d_parent == upperdir) {
-   /* Don't let d_delete() think it can reset d_inode */
-   dget(upper);
+   if (upper == ovl_dentry_upper(dentry)) {
if (is_dir)
err = vfs_rmdir(dir, upper);
else
err = vfs_unlink(dir, upper, NULL);
-   dput(upper);
ovl_dentry_version_inc(dentry->d_parent);
}
+   dput(upper);
 
/*
 * Keeping this dentry hashed would mean having to release
@@ -619,6 +623,7 @@ static int ovl_remove_upper(struct dentry *dentry, bool 
is_dir)
 * now.
 */
d_drop(dentry);
+out_unlock:
mutex_unlock(&dir->i_mutex);
 
return err;
@@ -839,29 +844,39 @@ static int ovl_rename2(struct inode *olddir, struct 
dentry *old,
 
trap = lock_rename(new_upperdir, old_upperdir);
 
-   olddentry = ovl_dentry_upper(old);
-   newdentry = ovl_dentry_upper(new);
-   if (newdentry) {
+
+   olddentry = lookup_one_len(old->d_name.name, old_upperdir,
+  old->d_name.len);
+   err = PTR_ERR(olddentry);
+   if (IS_ERR(olddentry))
+   goto out_unlock;
+
+   err = -ESTALE;
+   if (olddentry != ovl_dentry_upper(old))
+   goto out_dput_old;
+
+   newdentry = lookup_one_len(new->d_name.name, new_upperdir,
+  new->d_name.len);
+   err = PTR_ERR(newdentry);
+   if (IS_ERR(newdentry))
+   goto out_dput_old;
+
+   err = -ESTALE;
+   if (ovl_dentry_upper(new)) {
if (opaquedir) {
-   newdentry = opaquedir;
-   opaquedir = NULL;
+   if (newdentry != opaquedir)
+   goto out_dput;
} else {
-   dget(newdentry);
+   if (newdentry != ovl_dentry_upper(new))
+   goto out_dput;
}
} else {
new_create = true;
-   newdentry = lookup_one_len(new->d_name.name, new_upperdir,
-  new->d_name.len);
-   err = PTR_ERR(newdentry);
-   if (IS_ERR(newdentry))
-   goto out_unlock;
+   if (!d_is_negative(newdentry) &&
+   (!new_opaque || !ovl_is_whiteout(newdentry)))
+   goto out_dput;
}
 
-   err = -ESTALE;
-   if (olddentry->d_parent != old_upperdir)
-   goto out_dput;
-   if (newdentry->d_parent != new_upperdir)
-   goto out_dput;
if (olddentry == trap)
goto out_dput;
if (newdentry == trap)
@@ -917,6 +932,8 @@ static int ovl_rename2(struct inode *olddir, struct dentry 
*old,
 
 out_dput:
dput(newdentry);
+out_dput_old:
+   dput(olddentry);
 out_unlock:
   

[Devel] [PATCH rh7] mm: memcontrol: fix race between user memory reparent and charge

2016-07-07 Thread Vladimir Davydov
When a memory cgroup is destroyed (via rmdir), user memory pages
accounted to it get recharged to the parent cgroup - see
mem_cgroup_css_offlie() and mem_cgroup_reparent_charges(). If, for some
reason, a page is left charged to the destroyed cgroup after
mem_cgroup_reparent_charges() was done, we might get use-after-free,
because user memory charges do not hold reference to the cgroup.

And it seems to be possible in case reparenting races with
__mem_cgroup_try_charge() as follows:

  __mem_cgroup_try_charge
   get memcg from mm, inc ref
mem_cgroup_css_offline
 mem_cgroup_reparent_charges
   charge page to memcg
   put ref to memcg

To fix this issue, let's make __mem_cgroup_try_charge() cancel charge if
it detects that the cgroup was destroyed.

https://jira.sw.ru/browse/PSBM-49117

Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8151d4259c6b..e3a16b99ccc6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -314,6 +314,7 @@ struct mem_cgroup {
 * Should the accounting and control be hierarchical, per subtree?
 */
bool use_hierarchy;
+   bool is_offline;
unsigned long kmem_account_flags; /* See KMEM_ACCOUNTED_*, below */
 
booloom_lock;
@@ -2952,6 +2953,23 @@ again:
}
} while (ret != CHARGE_OK);
 
+   /*
+* Cancel charge in case this cgroup was destroyed while we were here,
+* otherwise we can get a pending user memory charge to an offline
+* cgroup, which might result in use-after-free after the cgroup gets
+* released (see also mem_cgroup_css_offline()).
+*
+* Note, no need to issue an explicit barrier here, because a
+* successful charge implies full memory barrier.
+*/
+   if (unlikely(memcg->is_offline)) {
+   res_counter_uncharge(&memcg->res, batch * PAGE_SIZE);
+   if (do_swap_account)
+   res_counter_uncharge(&memcg->memsw, batch * PAGE_SIZE);
+   css_put(&memcg->css);
+   goto bypass;
+   }
+
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
 
@@ -6657,6 +6675,15 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
 {
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
 
+   /*
+* Mark memory cgroup as offline before going to reparent charges.
+* This guarantees that __mem_cgroup_try_charge() either charges before
+* reparenting starts or doesn't charge at all, hence we won't have
+* pending user memory charges after reparenting is done.
+*/
+   memcg->is_offline = true;
+   smp_mb();
+
memcg_deactivate_kmem(memcg);
 
mem_cgroup_invalidate_reclaim_iterators(memcg);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: memcontrol: carefully check for user charges while reparenting

2016-07-07 Thread Vladimir Davydov
kmem is uncharged before res, therefore when checking if there are still
user charges in a memory cgroup, we should read res before kmem,
otherwise a kmem uncharge can get in-between two reads, leading to
false-positive res <= kmem. Add smp_rmb() to guarantee this never
happens.

Note, since x86 doesn't reorder reads, this patch doesn't actually
introduce any functional changes - it just clarifies the code.

Fixes: 35c0d2a992aa ("mm: memcontrol: fix race between kmem uncharge and charge 
reparenting")
Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f525f27e481..8151d4259c6b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4814,7 +4814,7 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup 
*memcg,
 static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
 {
int node, zid;
-   u64 usage;
+   u64 res, kmem;
 
do {
/* This is for making all *used* pages to be on LRU. */
@@ -4845,10 +4845,17 @@ static void mem_cgroup_reparent_charges(struct 
mem_cgroup *memcg)
 * so the lru seemed empty but the page could have been added
 * right after the check. RES_USAGE should be safe as we always
 * charge before adding to the LRU.
+*
+* Note, we must read memcg->res strictly before memcg->kmem,
+* because otherwise a kmem charge might get uncharged in
+* between the two reads leading to res <= kmem, even though
+* there are still user pages charged to this cgroup out there.
+* (see also comment in memcg_charge_kmem())
 */
-   usage = res_counter_read_u64(&memcg->res, RES_USAGE) -
-   res_counter_read_u64(&memcg->kmem, RES_USAGE);
-   } while (usage > 0);
+   res = res_counter_read_u64(&memcg->res, RES_USAGE);
+   smp_rmb();
+   kmem = res_counter_read_u64(&memcg->kmem, RES_USAGE);
+   } while (res > kmem);
 }
 
 /*
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-07 Thread Vladimir Davydov
On Wed, Jul 06, 2016 at 10:33:07AM -0700, Maxim Patlasov wrote:
> On 07/06/2016 02:26 AM, Vladimir Davydov wrote:
> 
> >On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote:
> >>Vova,
> >>
> >>
> >>On 07/04/2016 11:03 AM, Maxim Patlasov wrote:
> >>>On 07/04/2016 08:53 AM, Vladimir Davydov wrote:
> >>>
> >>>>On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
> >>>>...
> >>>>>@@ -643,6 +643,7 @@ static struct cgroup_subsys_state
> >>>>>*ve_create(struct cgroup *cg)
> >>>>>ve->odirect_enable = 2;
> >>>>>  ve->fsync_enable = 2;
> >>>>>+ve->experimental_fs_enable = 2;
> >>>>For odirect_enable and fsync_enable, 2 means follow the host's config, 1
> >>>>means enable unconditionally, and 0 means disable unconditionally. But
> >>>>we don't want to allow a user inside a CT to enable this feature, right?
> >>>I thought it's OK to allow user inside CT to enable it if host sysadmin is
> >>>OK about it. The same logic as for odirect: by default
> >>>ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this
> >>>knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the
> >>>feature becomes enabled. If an user wants to voluntarily disable it inside
> >>>CT, that's OK too.
> >>>
> >>>>This is confusing. May be, we'd better add a new VE_FEATURE for the
> >>>>purpose?
> >>>Not sure right now. I'll look at it and let you know later.
> >>Technically, it is very easy to implement new VE_FEATURE for overlayfs. But
> >>this approach is less flexible because we return EPERM from ve_write_64 if
> >>CT is running, and we'll need to involve userspace team to make the feature
> >>configurable and (possibly) persistent. Do you think it's worthy for
> >>something we'll get rid of soon anyway (I mean as soon as PSBM-47981
> >>resolved)?
> >Fair enough, not much point in introducing yet another feature for the
> >purpose, at least right now, sysctl should do for the beginning.
> >
> >Come to think of it, do we really need this sysctl inside containers? I
> >mean, by enabling this sysctl on the host we open a possible system-wide
> >security hole, which a CT admin won't be able to mitigate by disabling
> >overlayfs inside her CT. So why would she need it for? To prevent
> >non-privileged CT users from mounting overlayfs inside a user ns? But
> >overlayfs is not permitted to be mounted by a userns root anyway AFAICS.
> >May be, just drop in-CT sysctl then?
> 
> Currently, anyone who can login into CT as root may mount overlayfs, then
> try to exploit its weak sides. This is a problem.
> 
> Until we ensure that overlayfs is production-ready (at least does not have
> obvious breaches), let's disable it by default (of course, if ve != ve0).
> Those who want to play with overlayfs at their own risk will enable it by
> turning on some knob on host system (ve == ve0).
> 
> I don't think that mixing trusted (overlayfs-enabled) CTs and not trusted
> (overlayfs-disabled) CTs on the same physical node is important use-case for
> now. So, any simple system-wide knob must work.



> Essentially, the same scheme
> with odirect: by default it is '0' in ve0 and the root inside CT cannot turn
> it on; and if it is manually set to '1' in ve0, the behavior will depend on
> per-CT root willing.

No, that's not how it works. AFAICS (see may_use_odirect),

  ve0 sysctlve sysctl   odirect allowed in ve?
  x 0   0
  x 1   1
  x 2   x

i.e. system-wide sysctl can't be used to disallow odirect inside a VE,
while you want a different behavior AFAIU - you want to enable overlayfs
if both ve0 sysctl and ve sysctl are set. That's why the patch looks
confusing to me.

Let's only leave system-wide sysctl for permitting overlayfs. VE sysctl
doesn't make any sense - only root user is allowed to mount overlayfs
inside a CT and she can set this sysctl anyway.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] propogate_mnt: Handle the first propogated copy being a slave

2016-07-06 Thread Vladimir Davydov
From: "Eric W. Biederman" 

When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0010
> IP: [] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops:  [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: 8800bb0a8000 ti: 8800bac3c000 task.ti: 8800bac3c000
> RIP: 0010:[]  [] propagate_one+0xbe/0x1c0
> RSP: 0018:8800bac3fd38  EFLAGS: 00010283
> RAX:  RBX: 8800bb77ec00 RCX: 0010
> RDX:  RSI: 8800bb58c000 RDI: 8800bb58c480
> RBP: 8800bac3fd48 R08: 0001 R09: 
> R10: 1ca1 R11: 1c9d R12: 
> R13: 8800ba713800 R14: 8800bac3fda0 R15: 8800bb77ec00
> FS:  7f3c0cd9b7e0() GS:8800bfb0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0010 CR3: bb79d000 CR4: 06e0
> Stack:
>  8800bb77ec00  8800bac3fd88 811fbf85
>  8800bac3fd98 8800bb77f080 8800ba713800 8800bb262b40
>    8800bac3fdd8 811f1da0
> Call Trace:
>  [] propagate_mnt+0x105/0x140
>  [] attach_recursive_mnt+0x120/0x1e0
>  [] graft_tree+0x63/0x70
>  [] do_add_mount+0x9b/0x100
>  [] do_mount+0x2aa/0xdf0
>  [] ? strndup_user+0x4e/0x70
>  [] SyS_mount+0x75/0xc0
>  [] do_syscall_64+0x4b/0xa0
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 
> 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 
> 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP  [] propagate_one+0xbe/0x1c0
>  RSP 
> CR2: 0010
> ---[ end trace 2725ecd95164f217 ]---

This oops happens with the namespace_sem held and can be triggered by
non-root users.  An all around not pleasant experience.

To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.

Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.

The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.

To avoid other kinds of confusion last_dest is not changed when
computing last_source.  last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.

Cc: sta...@vger.kernel.org
fixes: f2ebb3a921c1ca1e2ddd9242e95a1989a50c4c68 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen 
Reviewed-by: Seth Forshee 
Tested-by: Seth Forshee 
Signed-off-by: "Eric W. Biederman" 
(cherry picked from commit 5ec0811d30378ae104f250bfc9b3640242d81e3f)
Signed-off-by: Vladimir Davydov 
Fixes: CVE-2016-4581

Conflicts:
fs/pnode.c
---
 fs/pnode.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 74f10c0e7e00..cc9ac074ba00 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -198,7 +198,7 @@ static struct mount *next_group(struct mount *m, struct 
mount *origin)
 
 /* all accesses are serialized by namespace_sem */
 static struct user_namespace *user_ns;
-static struct mount *last_dest, *last_source, *dest_master;
+static struct mount *last_dest, *first_source, *last_source, *dest_master;
 static struct mountpoint *mp;
 static struct list_head *list;
 
@@ -216,22 +216,23 @@ static int propagate_one(struct mount *m)
type = CL_MAKE_SHARED;
} else {
struct mount *n, *p;
+   bool done;
for (n = m; ; n = p) {
p = n->mnt_master;
-   if (p == dest_master || IS_MNT_MARKED(p)) {
-   while (last_dest->mnt_master != p) {
-   last_source = last_source->mnt_master;
-   last_dest = last_source->mnt_parent;
-   }
-   if (n->mnt_group_id != last_dest->mnt_group_id 
||
-   (!n->mnt_group_id &&
-!last_dest->mnt_group_id)) {
-   last_source = last_source->mnt_master;
-   last_dest = last_source->mnt_parent;
-   }
+   if (p ==

Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-06 Thread Vladimir Davydov
On Tue, Jul 05, 2016 at 04:45:10PM -0700, Maxim Patlasov wrote:
> Vova,
> 
> 
> On 07/04/2016 11:03 AM, Maxim Patlasov wrote:
> >On 07/04/2016 08:53 AM, Vladimir Davydov wrote:
> >
> >>On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
> >>...
> >>>@@ -643,6 +643,7 @@ static struct cgroup_subsys_state
> >>>*ve_create(struct cgroup *cg)
> >>>ve->odirect_enable = 2;
> >>>  ve->fsync_enable = 2;
> >>>+ve->experimental_fs_enable = 2;
> >>For odirect_enable and fsync_enable, 2 means follow the host's config, 1
> >>means enable unconditionally, and 0 means disable unconditionally. But
> >>we don't want to allow a user inside a CT to enable this feature, right?
> >
> >I thought it's OK to allow user inside CT to enable it if host sysadmin is
> >OK about it. The same logic as for odirect: by default
> >ve0->experimental_fs_enable = 0, so whatever user inside CT writes to this
> >knob, the feature is disabled. If sysadmin writes '1' to ve0->..., the
> >feature becomes enabled. If an user wants to voluntarily disable it inside
> >CT, that's OK too.
> >
> >>This is confusing. May be, we'd better add a new VE_FEATURE for the
> >>purpose?
> >
> >Not sure right now. I'll look at it and let you know later.
> 
> Technically, it is very easy to implement new VE_FEATURE for overlayfs. But
> this approach is less flexible because we return EPERM from ve_write_64 if
> CT is running, and we'll need to involve userspace team to make the feature
> configurable and (possibly) persistent. Do you think it's worthy for
> something we'll get rid of soon anyway (I mean as soon as PSBM-47981
> resolved)?

Fair enough, not much point in introducing yet another feature for the
purpose, at least right now, sysctl should do for the beginning.

Come to think of it, do we really need this sysctl inside containers? I
mean, by enabling this sysctl on the host we open a possible system-wide
security hole, which a CT admin won't be able to mitigate by disabling
overlayfs inside her CT. So why would she need it for? To prevent
non-privileged CT users from mounting overlayfs inside a user ns? But
overlayfs is not permitted to be mounted by a userns root anyway AFAICS.
May be, just drop in-CT sysctl then?
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] dcache: fix dentry leak when shrink races with kill

2016-07-05 Thread Vladimir Davydov
dentry_kill() does not free a dentry in case it is on a shrink list -
see __dentry_kill() -> d_free(). Instead it just marks it
DCACHE_MAY_FREE, which will make the shrinker free it when it's done
with it. This is required to avoid use after free in
shrink_dentry_list(). This logic was back-ported by commit e33cae748d1a
("ms/dcache: dentry_kill(): don't try to remove from shrink list").

When back-porting this commit I occasionally missed a hunk for
shrink_dentry_list(). The hunk makes shrink_dentry_list() more carefully
check dentry->d_lock_ref.count, i.e. instead of merely checking if it's
0 or not, it makes it check if it's strictly greater than 0. W/o this
check a dentry might leak if shrink races with kill, because before
trying to free a dentry, dentry_kill() first calls
lockref_mark_dead(&dentry->d_lockref), which sets d_lockref.count to
-128, so that shrink_dentry_list() will silently skip the dentry instead
of freeing it.

This patch resurrects the missing hunk.

https://jira.sw.ru/browse/PSBM-49321

Fixes: e33cae748d1a ("ms/dcache: dentry_kill(): don't try to remove from shrink 
list")
Signed-off-by: Vladimir Davydov 
---
 fs/dcache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 09ed486c9f1d..6433814a02d2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -874,7 +874,7 @@ static void shrink_dentry_list(struct list_head *list)
 * We found an inuse dentry which was not removed from
 * the LRU because of laziness during lookup. Do not free it.
 */
-   if (dentry->d_lockref.count) {
+   if ((int)dentry->d_lockref.count > 0) {
spin_unlock(&dentry->d_lock);
if (parent)
spin_unlock(&parent->d_lock);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] fs: make overlayfs disabled in CT by default

2016-07-04 Thread Vladimir Davydov
On Tue, Jun 28, 2016 at 03:48:54PM -0700, Maxim Patlasov wrote:
...
> @@ -643,6 +643,7 @@ static struct cgroup_subsys_state *ve_create(struct 
> cgroup *cg)
>  
>   ve->odirect_enable = 2;
>   ve->fsync_enable = 2;
> + ve->experimental_fs_enable = 2;

For odirect_enable and fsync_enable, 2 means follow the host's config, 1
means enable unconditionally, and 0 means disable unconditionally. But
we don't want to allow a user inside a CT to enable this feature, right?
This is confusing. May be, we'd better add a new VE_FEATURE for the
purpose?

>  
>  #ifdef CONFIG_VE_IPTABLES
>   ve->ipt_mask = ve_setup_iptables_mask(VE_IP_DEFAULT);
> 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] config: disable numa balancing by default

2016-07-04 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.22
-->
commit 4699d7601d1d8c567c3f59dcaf4e3b1bde327710
Author: Vladimir Davydov 
Date:   Mon Jul 4 18:11:21 2016 +0400

config: disable numa balancing by default

It results in LAMP DVD store benchmark degradation.

https://jira.sw.ru/browse/PSBM-49131

Signed-off-by: Vladimir Davydov 
---
 configs/kernel-3.10.0-x86_64-debug.config | 2 ++
 configs/kernel-3.10.0-x86_64.config   | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/configs/kernel-3.10.0-x86_64-debug.config 
b/configs/kernel-3.10.0-x86_64-debug.config
index 4142b41946ce..d65f0ea5ea17 100644
--- a/configs/kernel-3.10.0-x86_64-debug.config
+++ b/configs/kernel-3.10.0-x86_64-debug.config
@@ -5470,6 +5470,8 @@ CONFIG_QUOTA_COMPAT=y
 
 CONFIG_BLK_DEV_NBD=m
 
+CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=n
+
 # disabled in debug RHEL7 config by default
 CONFIG_PANIC_ON_OOPS=y
 CONFIG_PANIC_ON_OOPS_VALUE=1
diff --git a/configs/kernel-3.10.0-x86_64.config 
b/configs/kernel-3.10.0-x86_64.config
index 3be34a6bcea5..53e103bf438c 100644
--- a/configs/kernel-3.10.0-x86_64.config
+++ b/configs/kernel-3.10.0-x86_64.config
@@ -5442,6 +5442,8 @@ CONFIG_QUOTA_COMPAT=y
 
 CONFIG_BLK_DEV_NBD=m
 
+CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=n
+
 #
 # OpenVZ
 #
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: reloc vs extent_conversion race fix

2016-07-03 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.22
-->
commit bbc7ec7ef8492f736fa0cce85f4a609fb7cb80df
Author: Dmitry Monakhov 
Date:   Sun Jul 3 21:38:12 2016 +0400

ploop: reloc vs extent_conversion race fix

We have fixed most relocation bugs during fixing 
https://jira.sw.ru/browse/PSBM-47107

Currently reloc_a looks like follows:

 1->read_data_from_old_post
 2->write_to_new_pos
->sumbit_alloc
  ->submit_pad
  ->post_submit->convert_unwritten
 3->update_index ->write_page with FLUSH|FUA
 4->nullify_old_pos
 5->issue_flush

But on step 3 extent coversion is not yet stable because belongs to 
uncommitted
transaction. We MUST call ->fsync inside ->post_sumit as we do for REQ_FUA
requests. Let's tag relocatoin requests as FUA from very beginning in order 
to
assert sync semantics.

https://jira.sw.ru/browse/PSBM-49143
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 40768b6ef2c5..e5f010b9aeba 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4097,7 +4097,7 @@ static void ploop_relocate(struct ploop_device * plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = 0;
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_A);
preq->error = 0;
@@ -4401,7 +4401,7 @@ static void ploop_relocblks_process(struct ploop_device 
*plo)
preq->bl.tail = preq->bl.head = NULL;
preq->req_cluster = ~0U; /* uninitialized */
preq->req_size = 0;
-   preq->req_rw = WRITE_SYNC;
+   preq->req_rw = WRITE_SYNC|REQ_FUA;
preq->eng_state = PLOOP_E_ENTRY;
preq->state = (1 << PLOOP_REQ_SYNC) | (1 << PLOOP_REQ_RELOC_S);
preq->error = 0;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/2] Revert "ve/vmscan: do not throttle kthreads due to too_many_isolated"

2016-06-27 Thread Vladimir Davydov
This reverts commit 5ce7561a6b0a517fcf4fbcd8a1b00dab0ddd4222.

Not needed any longer as the previous patch fixed the issue in a
different way.

Signed-off-by: Vladimir Davydov 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ac08ddf50b8..06ff6972ef22 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1415,7 +1415,7 @@ static int __too_many_isolated(struct zone *zone, int 
file,
 static int too_many_isolated(struct zone *zone, int file,
 struct scan_control *sc)
 {
-   if (current->flags & PF_KTHREAD)
+   if (current_is_kswapd())
return 0;
 
if (!global_reclaim(sc))
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] mm: vmscan: never wait on writeback pages

2016-06-27 Thread Vladimir Davydov
Currently, if memcg reclaim encounters a page under writeback it waits
for the writeback to finish. This is done in order to avoid hitting OOM
when there are a lot of potentially reclaimable pages under writeback,
as memcg lacks dirty pages limit. Although it saves us from premature
OOM, this technique is deadlock prone if writeback is supposed to be
done by a process that might need to allocate memory, like in case of
vstorage. If the process responsible for writeback tries to allocate a
page it might get stuck in too_many_isolated() loop waiting for
processes performing memcg reclaim to put isolated pages back to the
LRU, but memcg reclaim might be stuck waiting for writeback to complete,
resulting in a deadlock.

To avoid this kind of deadlock, let's, instead of waiting for page
writeback directly, call congestion_wait() after returning isolated
pages to the LRU in case writeback pages are recycled through the LRU
before IO can complete. This should still prevent premature memcg OOM
while rendering the deadlock described above impossible.

https://jira.sw.ru/browse/PSBM-48115

Signed-off-by: Vladimir Davydov 
---
 mm/vmscan.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f6ce18df3ed..3ac08ddf50b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -929,11 +929,11 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 *__GFP_IO|__GFP_FS for this reason); but more thought
 *would probably show more reasons.
 *
-* 3) memcg encounters a page that is not already marked
+* 3) memcg encounters a page that is already marked
 *PageReclaim. memcg does not have any dirty pages
 *throttling so we could easily OOM just because too many
 *pages are in writeback and there is nothing else to
-*reclaim. Wait for the writeback to complete.
+*reclaim. Stall memcg reclaim then.
 */
if (PageWriteback(page)) {
/* Case 1 above */
@@ -954,7 +954,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 * enough to care.  What we do want is for this
 * page to have PageReclaim set next time memcg
 * reclaim reaches the tests above, so it will
-* then wait_on_page_writeback() to avoid OOM;
+* then stall to avoid OOM;
 * and it's also appropriate in global reclaim.
 */
SetPageReclaim(page);
@@ -964,7 +964,8 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 
/* Case 3 above */
} else {
-   wait_on_page_writeback(page);
+   nr_immediate++;
+   goto keep_locked;
}
}
 
@@ -1586,10 +1587,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
if (nr_writeback && nr_writeback == nr_taken)
zone_set_flag(zone, ZONE_WRITEBACK);
 
-   /*
-* memcg will stall in page writeback so only consider forcibly
-* stalling for global reclaim
-*/
+   if (!global_reclaim(sc) && nr_immediate)
+   congestion_wait(BLK_RW_ASYNC, HZ/10);
+
if (global_reclaim(sc)) {
/*
 * Tag a zone as congested if all the dirty pages scanned were
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ub: account memory overcommit failures in UB_PRIVVMPAGES.failcnt

2016-06-27 Thread Vladimir Davydov
On Mon, Jun 27, 2016 at 12:50:10PM +0300, Andrey Ryabinin wrote:
> If allocation failed due to memory overcommit failcounters doesn't change.
> This contradicts with userspace expectations.
> With this patch, such failures will be accounted in failconter of
> UB_PRIVVMPAGES.
> 
> https://jira.sw.ru/browse/PSBM-48891
> 
> Signed-off-by: Andrey Ryabinin 

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] Resurrect proc fairsched files

2016-06-24 Thread Vladimir Davydov
They are still required by userspace, which checks for their presence.
Leave them empty.

https://jira.sw.ru/browse/PSBM-48824

Signed-off-by: Vladimir Davydov 
---
 kernel/ve/veowner.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/kernel/ve/veowner.c b/kernel/ve/veowner.c
index 86065072a9ca..757dde99ef0f 100644
--- a/kernel/ve/veowner.c
+++ b/kernel/ve/veowner.c
@@ -36,12 +36,38 @@
 struct proc_dir_entry *proc_vz_dir;
 EXPORT_SYMBOL(proc_vz_dir);
 
+static int proc_fairsched_open(struct inode *inode, struct file *file)
+{
+   return 0;
+}
+
+static ssize_t proc_fairsched_read(struct file *file, char __user *buf,
+  size_t size, loff_t *ppos)
+{
+   return 0;
+}
+
+static struct file_operations proc_fairsched_operations = {
+   .open   = proc_fairsched_open,
+   .read   = proc_fairsched_read,
+   .llseek = noop_llseek,
+};
+
 static void prepare_proc(void)
 {
proc_vz_dir = proc_mkdir_mode("vz", S_ISVTX | S_IRUGO | S_IXUGO, NULL);
if (!proc_vz_dir)
panic("Can't create /proc/vz dir\n");
+
+   /* Legacy files. They are not really needed and should be removed
+* sooner or later, but leave the stubs for now as they may be required
+* by userspace */
+
proc_mkdir_mode("container", 0, proc_vz_dir);
+   proc_mkdir_mode("fairsched", 0, proc_vz_dir);
+
+   proc_create("fairsched", S_ISVTX, NULL, &proc_fairsched_operations);
+   proc_create("fairsched2", S_ISVTX, NULL, &proc_fairsched_operations);
 }
 #endif
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ve/fs: namespace -- Don't fail on permissions if @ve->devmnt_list is empty

2016-06-23 Thread Vladimir Davydov
On Thu, Jun 23, 2016 at 01:34:18PM +0300, Cyrill Gorcunov wrote:
> In commit 7eeb5b4afa8db5a2f2e1e47ab6b84e55fc8c5661 I addressed
> first half of a problem, but I happen to work with dirty copy
> of libvzctl where mount_opts cgroup has been c/r'ed manually,
> so I missed the case where @devmnt_list is empty on restore
> (just like it is in vanilla libvzctl). So fix the second half.
> 
> https://jira.sw.ru/browse/PSBM-48188
> 
> Reported-by: Igor Sukhih 
> Signed-off-by: Cyrill Gorcunov 
> CC: Vladimir Davydov 
> CC: Konstantin Khorenko 

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/4] net: ipip: fix crash on newlink if VE_FEATURE_IPIP is disabled

2016-06-23 Thread Vladimir Davydov
In this case net_generic returns NULL. We must handle this gracefully.

Signed-off-by: Vladimir Davydov 
---
 net/ipv4/ipip.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 7842dcb2fd65..b1004fb7539c 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -357,6 +357,9 @@ static int ipip_newlink(struct net *src_net, struct 
net_device *dev,
 {
struct ip_tunnel_parm p;
 
+   if (net_generic(dev_net(dev), ipip_net_id) == NULL)
+   return -EACCES;
+
ipip_netlink_parms(data, &p);
return ip_tunnel_newlink(dev, tb, &p);
 }
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4/4] net: sit: fix crash on newlink if VE_FEATURE_SIT is disabled

2016-06-23 Thread Vladimir Davydov
In this case net_generic returns NULL. We must handle this gracefully.

Signed-off-by: Vladimir Davydov 
---
 net/ipv6/sit.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 2a73b520d3bf..6b1ae3b06be9 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1441,6 +1441,9 @@ static int ipip6_newlink(struct net *src_net, struct 
net_device *dev,
 #endif
int err;
 
+   if (net_generic(net, sit_net_id) == NULL)
+   return -EACCES;
+
nt = netdev_priv(dev);
ipip6_netlink_parms(data, &nt->parms);
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/4] net: ipip: enable in container

2016-06-23 Thread Vladimir Davydov
Currently, we fail to init ipip per-net in a ve, because it has neither
NETIF_F_VIRTUAL nor NETIF_F_NETNS_LOCAL:

 ipip_init_net
  ip_tunnel_init_net
   __ip_tunnel_create
register_netdevice
 ve_is_dev_movable

In PCS6 ipip has NETIF_F_NETNS_LOCAL, so everything works fine there,
but this restriction was removed in RH7 kernel, so we fail to start a
container if ipip is loaded (or load ipip if there are containers
running).

Mark ipip as NETIF_F_VIRTUAL to fix this issue.

https://jira.sw.ru/browse/PSBM-48608

Signed-off-by: Vladimir Davydov 
---
 net/ipv4/ipip.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index e556a1df5a57..7842dcb2fd65 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -301,6 +301,7 @@ static void ipip_tunnel_setup(struct net_device *dev)
netif_keep_dst(dev);
 
dev->features   |= IPIP_FEATURES;
+   dev->features   |= NETIF_F_VIRTUAL;
dev->hw_features|= IPIP_FEATURES;
ip_tunnel_setup(dev, ipip_net_id);
 }
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/4] net: ip_vti: skip per net init in ve

2016-06-23 Thread Vladimir Davydov
ip_vti devices lack NETIF_F_VIRTUAL, so they can't be created inside a
container. Problem is a device of this kind is created on net ns init if
the module is loaded, as a result a container start fails with EPERM.

We could allow ip_vti inside container (as well as other net devices,
which I would really like to do), but this is insecure and might break
migration, so let's keep it disabled and fix the issue by silently
skipping ip_vti per net init if running inside a ve.

https://jira.sw.ru/browse/PSBM-48698

Signed-off-by: Vladimir Davydov 
---
 net/ipv4/ip_vti.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index ce80a9a1be9d..3158100646ed 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -58,6 +58,9 @@ static int vti_input(struct sk_buff *skb, int nexthdr, __be32 
spi,
struct net *net = dev_net(skb->dev);
struct ip_tunnel_net *itn = net_generic(net, vti_net_id);
 
+   if (itn == NULL)
+   return -EINVAL;
+
tunnel = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY,
  iph->saddr, iph->daddr, 0);
if (tunnel != NULL) {
@@ -256,6 +259,9 @@ static int vti4_err(struct sk_buff *skb, u32 info)
int protocol = iph->protocol;
struct ip_tunnel_net *itn = net_generic(net, vti_net_id);
 
+   if (itn == NULL)
+   return -1;
+
tunnel = ip_tunnel_lookup(itn, skb->dev->ifindex, TUNNEL_NO_KEY,
  iph->daddr, iph->saddr, 0);
if (!tunnel)
@@ -413,6 +419,9 @@ static int __net_init vti_init_net(struct net *net)
int err;
struct ip_tunnel_net *itn;
 
+   if (!ve_is_super(net->owner_ve))
+   return net_assign_generic(net, vti_net_id, NULL);
+
err = ip_tunnel_init_net(net, vti_net_id, &vti_link_ops, "ip_vti0");
if (err)
return err;
@@ -424,6 +433,9 @@ static int __net_init vti_init_net(struct net *net)
 static void __net_exit vti_exit_net(struct net *net)
 {
struct ip_tunnel_net *itn = net_generic(net, vti_net_id);
+
+   if (itn == NULL)
+   return;
ip_tunnel_delete_net(itn, &vti_link_ops);
 }
 
@@ -473,6 +485,9 @@ static int vti_newlink(struct net *src_net, struct 
net_device *dev,
 {
struct ip_tunnel_parm parms;
 
+   if (net_generic(dev_net(dev), vti_net_id) == NULL)
+   return -EACCES;
+
vti_netlink_parms(data, &parms);
return ip_tunnel_newlink(dev, tb, &parms);
 }
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] ve/cpustat: don't try to update vcpustats for root_task_group

2016-06-22 Thread Vladimir Davydov
On Wed, Jun 22, 2016 at 03:59:05PM +0300, Andrey Ryabinin wrote:
> root_task_group doesn't have vcpu stats. Attempt to upate those leads
> to NULL-ptr deref:
> 
>   BUG: unable to handle kernel NULL pointer dereference at   
> (null)
>   IP: [] cpu_cgroup_update_vcpustat+0x13c/0x620
>   ...
>   Call Trace:
>[] cpu_cgroup_get_stat+0x7b/0x180
>[] ve_get_cpu_stat+0x27/0x70
>[] fill_cpu_stat+0x91/0x1e0 [vzmon]
>[] vzcalls_ioctl+0x2bb/0x430 [vzmon]
>[] vzctl_ioctl+0x45/0x60 [vzdev]
>[] do_vfs_ioctl+0x255/0x4f0
>[] SyS_ioctl+0x54/0xa0
>[] system_call_fastpath+0x16/0x1b
> 
> So, return -ENOENT if we asked for vcpu stats of root_task_group.
> 
> https://jira.sw.ru/browse/PSBM-48721
> 
> Signed-off-by: Andrey Ryabinin 

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/2] cgroup: un-export cgroup_kernel_* and zap cgroup_kernel_remove

2016-06-22 Thread Vladimir Davydov
After fairsched's gone, cgroup_kernel_remove is not used any more, so
drop it. cgroup_kernel_* family of functions are now used only by
beancounters, which is a part of the kernel, so un-export them.

Signed-off-by: Vladimir Davydov 
---
 include/linux/cgroup.h |  1 -
 kernel/cgroup.c| 26 --
 2 files changed, 27 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 730ca9091bfb..b34239dcdb52 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -55,7 +55,6 @@ struct cgroup *cgroup_kernel_lookup(struct vfsmount *mnt,
const char *pathname);
 struct cgroup *cgroup_kernel_open(struct cgroup *parent,
enum cgroup_open_flags flags, const char *name);
-int cgroup_kernel_remove(struct cgroup *parent, const char *name);
 int cgroup_kernel_attach(struct cgroup *cgrp, struct task_struct *tsk);
 void cgroup_kernel_close(struct cgroup *cgrp);
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 581924e7af9e..1c047b9bb1fb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5669,13 +5669,11 @@ struct vfsmount *cgroup_kernel_mount(struct 
cgroup_sb_opts *opts)
 {
return kern_mount_data(&cgroup_fs_type, opts);
 }
-EXPORT_SYMBOL(cgroup_kernel_mount);
 
 struct cgroup *cgroup_get_root(struct vfsmount *mnt)
 {
return mnt->mnt_root->d_fsdata;
 }
-EXPORT_SYMBOL(cgroup_get_root);
 
 struct cgroup *cgroup_kernel_lookup(struct vfsmount *mnt,
const char *pathname)
@@ -5698,7 +5696,6 @@ struct cgroup *cgroup_kernel_lookup(struct vfsmount *mnt,
path_put(&path);
return cgrp;
 }
-EXPORT_SYMBOL(cgroup_kernel_lookup);
 
 struct cgroup *cgroup_kernel_open(struct cgroup *parent,
enum cgroup_open_flags flags, const char *name)
@@ -5729,27 +5726,6 @@ out:
mutex_unlock(&parent->dentry->d_inode->i_mutex);
return cgrp;
 }
-EXPORT_SYMBOL(cgroup_kernel_open);
-
-int cgroup_kernel_remove(struct cgroup *parent, const char *name)
-{
-   struct dentry *dentry;
-   int ret;
-
-   mutex_lock_nested(&parent->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
-   dentry = lookup_one_len(name, parent->dentry, strlen(name));
-   ret = PTR_ERR(dentry);
-   if (IS_ERR(dentry))
-   goto out;
-   ret = -ENOENT;
-   if (dentry->d_inode)
-   ret = vfs_rmdir(parent->dentry->d_inode, dentry);
-   dput(dentry);
-out:
-   mutex_unlock(&parent->dentry->d_inode->i_mutex);
-   return ret;
-}
-EXPORT_SYMBOL(cgroup_kernel_remove);
 
 int cgroup_kernel_attach(struct cgroup *cgrp, struct task_struct *tsk)
 {
@@ -5761,7 +5737,6 @@ int cgroup_kernel_attach(struct cgroup *cgrp, struct 
task_struct *tsk)
mutex_unlock(&cgroup_mutex);
return ret;
 }
-EXPORT_SYMBOL(cgroup_kernel_attach);
 
 void cgroup_kernel_close(struct cgroup *cgrp)
 {
@@ -5770,4 +5745,3 @@ void cgroup_kernel_close(struct cgroup *cgrp)
check_for_release(cgrp);
}
 }
-EXPORT_SYMBOL(cgroup_kernel_close);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] Remove container and beancounter directories from /proc/vz

2016-06-22 Thread Vladimir Davydov
In PCS6, cgroups were mounted there. Now they are unused, as all cgroups
are supposed to be mounted by systemd under /sys/fs/cgroup.

Signed-off-by: Vladimir Davydov 
---
 kernel/bc/proc.c| 1 -
 kernel/ve/veowner.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/kernel/bc/proc.c b/kernel/bc/proc.c
index 3a3b4e3f28c8..9f60d9991e0a 100644
--- a/kernel/bc/proc.c
+++ b/kernel/bc/proc.c
@@ -754,7 +754,6 @@ static int __init ub_init_proc(void)
entry = proc_create("user_beancounters",
S_IRUSR|S_ISVTX, NULL, &ub_file_operations);
proc_create("vswap", S_IRUSR, proc_vz_dir, &ub_vswap_fops);
-   proc_mkdir_mode("beancounter", 0, proc_vz_dir);
return 0;
 }
 
diff --git a/kernel/ve/veowner.c b/kernel/ve/veowner.c
index 86065072a9ca..7642191bf517 100644
--- a/kernel/ve/veowner.c
+++ b/kernel/ve/veowner.c
@@ -41,7 +41,6 @@ static void prepare_proc(void)
proc_vz_dir = proc_mkdir_mode("vz", S_ISVTX | S_IRUGO | S_IXUGO, NULL);
if (!proc_vz_dir)
panic("Can't create /proc/vz dir\n");
-   proc_mkdir_mode("container", 0, proc_vz_dir);
 }
 #endif
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] ve: drop ve_cgroup_open and ve_cgroup_remove

2016-06-22 Thread Vladimir Davydov
Fairsched was the last user of these functions. After it's gone, we
don't need them any longer.

Signed-off-by: Vladimir Davydov 
---
 include/linux/ve_proto.h |  2 --
 kernel/ve/ve.c   | 21 -
 2 files changed, 23 deletions(-)

diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 8cc7fe3ba2a3..d2dc12d2f2c2 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -50,8 +50,6 @@ extern struct list_head ve_list_head;
 #define for_each_ve(ve)list_for_each_entry((ve), &ve_list_head, 
ve_list)
 extern struct mutex ve_list_lock;
 extern struct ve_struct *get_ve_by_id(envid_t);
-extern struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t 
veid);
-extern int ve_cgroup_remove(struct cgroup *root, envid_t veid);
 
 extern int nr_threads_ve(struct ve_struct *ve);
 
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 2459cb53a665..9995dbcd1623 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -156,27 +156,6 @@ const char *ve_name(struct ve_struct *ve)
 }
 EXPORT_SYMBOL(ve_name);
 
-/* Cgroup must be closed with cgroup_kernel_close */
-struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t veid)
-{
-   char name[16];
-   struct cgroup *cgrp;
-
-   snprintf(name, sizeof(name), "%u", veid);
-   cgrp = cgroup_kernel_open(root, flags, name);
-   return cgrp ? cgrp : ERR_PTR(-ENOENT);
-}
-EXPORT_SYMBOL(ve_cgroup_open);
-
-int ve_cgroup_remove(struct cgroup *root, envid_t veid)
-{
-   char name[16];
-
-   snprintf(name, sizeof(name), "%u", veid);
-   return cgroup_kernel_remove(root, name);
-}
-EXPORT_SYMBOL(ve_cgroup_remove);
-
 /* under rcu_read_lock if task != current */
 const char *task_ve_name(struct task_struct *task)
 {
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] Drop CAP_VE_ADMIN and CAP_VE_NET_ADMIN

2016-06-22 Thread Vladimir Davydov
Not needed anymore as we use user ns for capability checking.
Also, move capable_setveid() helper to ve.h so as not to pollute
generic headers.

Signed-off-by: Vladimir Davydov 
---
 include/linux/ve.h  |  3 +++
 include/uapi/linux/capability.h | 55 -
 2 files changed, 3 insertions(+), 55 deletions(-)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index cea3a87cb9c0..247cadb78c06 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -138,6 +138,9 @@ struct ve_devmnt {
 #define VE_MEMINFO_DEFAULT  1   /* default behaviour */
 #define VE_MEMINFO_SYSTEM   0   /* disable meminfo virtualization */
 
+#define capable_setveid() \
+   (ve_is_super(get_exec_env()) && capable(CAP_SYS_ADMIN))
+
 extern int nr_ve;
 extern struct proc_dir_entry *proc_vz_dir;
 extern struct cgroup_subsys ve_subsys;
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index cadbfe6109e8..b3d37bb108b8 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -307,61 +307,6 @@ struct vfs_cap_data {
 
 #define CAP_SETFCAP 31
 
-#ifdef __KERNEL__
-/*
- * Important note: VZ capabilities do intersect with CAP_AUDIT
- * this is due to compatibility reasons. Nothing bad.
- * Both VZ and Audit/SELinux caps are disabled in VPSs.
- */
-
-/* Allow access to all information. In the other case some structures will be
- * hiding to ensure different Virtual Environment non-interaction on the same
- * node (NOW OBSOLETED)
- */
-#define CAP_SETVEID 29
-
-#define capable_setveid()  ({  \
-   ve_is_super(get_exec_env()) &&  \
-   (capable(CAP_SYS_ADMIN) ||  \
-capable(CAP_VE_ADMIN));\
-   })
-
-/*
- * coinsides with CAP_AUDIT_CONTROL but we don't care, since
- * audit is disabled in Virtuozzo
- */
-#define CAP_VE_ADMIN30
-
-#ifdef CONFIG_VE
-
-/* Replacement for CAP_NET_ADMIN:
-   delegated rights to the Virtual environment of its network administration.
-   For now the following rights have been delegated:
-
-   Allow setting arbitrary process / process group ownership on sockets
-   Allow interface configuration
- */
-#define CAP_VE_NET_ADMIN CAP_VE_ADMIN
-
-/* Replacement for CAP_SYS_ADMIN:
-   delegated rights to the Virtual environment of its administration.
-   For now the following rights have been delegated:
- */
-/* Allow mount/umount/remount */
-/* Allow examination and configuration of disk quotas */
-/* Allow removing semaphores */
-/* Used instead of CAP_CHOWN to "chown" IPC message queues, semaphores
-   and shared memory */
-/* Allow locking/unlocking of shared memory segment */
-/* Allow forged pids on socket credentials passing */
-
-#define CAP_VE_SYS_ADMIN CAP_VE_ADMIN
-#else
-#define CAP_VE_NET_ADMIN CAP_NET_ADMIN
-#define CAP_VE_SYS_ADMIN CAP_SYS_ADMIN
-#endif
-#endif
-
 /* Override MAC access.
The base kernel enforces no MAC policy.
An LSM may enforce a MAC policy, and if it does and it chooses
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: memcontrol: reclaim when shrinking memory.high below usage

2016-06-21 Thread Vladimir Davydov
From: Johannes Weiner 

When setting memory.high below usage, nothing happens until the next
charge comes along, and then it will only reclaim its own charge and not
the now potentially huge excess of the new memory.high.  This can cause
groups to stay in excess of their memory.high indefinitely.

To fix that, when shrinking memory.high, kick off a reclaim cycle that
goes after the delta.

https://jira.sw.ru/browse/PSBM-48546

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Cc: Vladimir Davydov 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
(cherry picked from commit 588083bb37a3cea8533c392370a554417c8f29cb)
Signed-off-by: Vladimir Davydov 

Conflicts:
mm/memcontrol.c
---
 mm/memcontrol.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index de7c36295515..1f525f27e481 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5314,7 +5314,7 @@ static int mem_cgroup_high_write(struct cgroup *cont, 
struct cftype *cft,
 const char *buffer)
 {
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
-   unsigned long long val;
+   unsigned long long val, usage;
int ret;
 
ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -5322,6 +5322,12 @@ static int mem_cgroup_high_write(struct cgroup *cont, 
struct cftype *cft,
return ret;
 
memcg->high = val;
+
+   usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+   if (usage > val)
+   try_to_free_mem_cgroup_pages(memcg,
+(usage - val) >> PAGE_SHIFT,
+GFP_KERNEL, false);
return 0;
 }
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] cgroup: fix path mangling for ve cgroups

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 79fa6ee2446a3efe9791378cf9b582bbee0ef7ec
Author: Vladimir Davydov 
Date:   Mon Jun 20 21:07:58 2016 +0400

cgroup: fix path mangling for ve cgroups

Presently, we just cut first component off cgroup path when inside a VE,
because all VE cgroups are located at the top level of the cgroup
hierarchy. However, this is going to change - the cgroups are going to
move to machine.slice - so we should introduce a more generic way of
mangling cgroup paths.

This patch does the trick. On a VE start it marks all cgroups the init
task of the VE resides in with a special flag (CGRP_VE_ROOT). Cgroups
marked this way will be treated as root if looked at from inside a VE.
As long as we don't have nested VEs, this should work fine.

Note, we don't need to clear these flags on VE destruction, because
vzctl always creates new cgroups on VE start.

https://jira.sw.ru/browse/PSBM-48629
    
Signed-off-by: Vladimir Davydov 
---
 include/linux/cgroup.h |  3 +++
 kernel/cgroup.c| 27 ---
 kernel/ve/ve.c |  4 
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index aad06e8e0258..730ca9091bfb 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -175,6 +175,9 @@ enum {
CGRP_CPUSET_CLONE_CHILDREN,
/* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */
CGRP_SANE_BEHAVIOR,
+
+   /* The cgroup is root in a VE */
+   CGRP_VE_ROOT,
 };
 
 struct cgroup_name {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dd548853e2eb..581924e7af9e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1791,6 +1791,21 @@ static struct file_system_type cgroup_fs_type = {
 
 static struct kobject *cgroup_kobj;
 
+#ifdef CONFIG_VE
+void cgroup_mark_ve_root(struct ve_struct *ve)
+{
+   struct cgroup *cgrp;
+   struct cgroupfs_root *root;
+
+   mutex_lock(&cgroup_mutex);
+   for_each_active_root(root) {
+   cgrp = task_cgroup_from_root(ve->init_task, root);
+   set_bit(CGRP_VE_ROOT, &cgrp->flags);
+   }
+   mutex_unlock(&cgroup_mutex);
+}
+#endif
+
 /**
  * cgroup_path - generate the path of a cgroup
  * @cgrp: the cgroup in question
@@ -1804,7 +1819,8 @@ static struct kobject *cgroup_kobj;
  * inode's i_mutex, while on the other hand cgroup_path() can be called
  * with some irq-safe spinlocks held.
  */
-int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt)
+static int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen,
+bool virt)
 {
int ret = -ENAMETOOLONG;
char *start;
@@ -1824,14 +1840,11 @@ int __cgroup_path(const struct cgroup *cgrp, char *buf, 
int buflen, bool virt)
int len;
 
 #ifdef CONFIG_VE
-   if (virt && cgrp->parent && !cgrp->parent->parent) {
+   if (virt && test_bit(CGRP_VE_ROOT, &cgrp->flags)) {
/*
 * Containers cgroups are bind-mounted from node
 * so they are like '/' from inside, thus we have
-* to mangle cgroup path output. Effectively it is
-* enough to remove two topmost cgroups from path.
-* e.g. in ct 101: /101/test.slice/test.scope ->
-* /test.slice/test.scope
+* to mangle cgroup path output.
 */
if (*start != '/') {
if (--start < buf)
@@ -2391,7 +2404,7 @@ static ssize_t cgroup_file_write(struct file *file, const 
char __user *buf,
 * inside a container FS.
 */
if (!ve_is_super(get_exec_env())
-   && (!cgrp->parent || !cgrp->parent->parent)
+   && test_bit(CGRP_VE_ROOT, &cgrp->flags)
&& !get_exec_env()->is_pseudosuper
&& !(cft->flags & CFTYPE_VE_WRITABLE))
return -EPERM;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 9904a4ae130e..2459cb53a665 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -452,6 +452,8 @@ static void ve_drop_context(struct ve_struct *ve)
 
 static const struct timespec zero_time = { };
 
+extern void cgroup_mark_ve_root(struct ve_struct *ve);
+
 /* under ve->op_sem write-lock */
 static int ve_start_container(struct ve_struct *ve)
 {
@@ -499,6 +501,8 @@ static int ve_start_container(struct ve_struct *ve)
if (err < 0)
goto err_iterate;
 
+   cgroup_mark_ve_root(ve);
+
   

[Devel] [PATCH RHEL7 COMMIT] Drop vz_compat boot param

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit f8b72e7837625c7de569fefcf3bba05ac2ef6b5e
Author: Vladimir Davydov 
Date:   Mon Jun 20 21:01:36 2016 +0400

Drop vz_compat boot param

It was introduced by commit d7b23ae8a314f ("ve/cgroups: use cgroup
subsystem names only if in vz compat mode") in order to provide a way of
running pcs6 environment along with vz7 kernel. Turned out, this is not
needed, so drop the option altogether.
    
Signed-off-by: Vladimir Davydov 
---
 include/linux/ve.h  |  4 
 kernel/bc/beancounter.c |  2 --
 kernel/fairsched.c  |  1 -
 kernel/ve/ve.c  | 10 --
 kernel/ve/vecalls.c |  1 -
 5 files changed, 18 deletions(-)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index 813f16d5e825..182a63899a0b 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -153,8 +153,6 @@ extern __u64 ve_setup_iptables_mask(__u64 init_mask);
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
-extern int vz_compat;
-
 extern struct kobj_ns_type_operations ve_ns_type_operations;
 extern struct kobject * kobject_create_and_add_ve(const char *name,
struct kobject *parent);
@@ -247,8 +245,6 @@ static inline void ve_mount_nr_dec(void)
 
 #define ve_uevent_seqnum uevent_seqnum
 
-#define vz_compat  (0)
-
 static inline int vz_security_family_check(struct net *net, int family) { 
return 0; }
 static inline int vz_security_protocol_check(struct net *net, int protocol) { 
return 0; }
 
diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index f8a397269152..d35ddb3499d4 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -33,7 +33,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -1179,7 +1178,6 @@ void __init ub_init_late(void)
 int __init ub_init_cgroup(void)
 {
struct cgroup_sb_opts blkio_opts = {
-   .name   = vz_compat ? "beancounter" : NULL,
.subsys_mask= (1ul << blkio_subsys_id),
};
struct cgroup_sb_opts mem_opts = {
diff --git a/kernel/fairsched.c b/kernel/fairsched.c
index 959c19f4d7fc..e015cff87a97 100644
--- a/kernel/fairsched.c
+++ b/kernel/fairsched.c
@@ -796,7 +796,6 @@ int __init fairsched_init(void)
 {
struct vfsmount *cpu_mnt, *cpuset_mnt;
struct cgroup_sb_opts cpu_opts = {
-   .name   = vz_compat ? "fairsched" : NULL,
.subsys_mask=
(1ul << cpu_cgroup_subsys_id) |
(1ul << cpuacct_subsys_id),
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 22df66e1b257..d811d4818fa6 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -87,18 +87,8 @@ DEFINE_MUTEX(ve_list_lock);
 int nr_ve = 1; /* One VE always exists. Compatibility with vestat */
 EXPORT_SYMBOL(nr_ve);
 
-int vz_compat;
-EXPORT_SYMBOL(vz_compat);
-
 static DEFINE_IDR(ve_idr);
 
-static int __init vz_compat_setup(char *arg)
-{
-   get_option(&arg, &vz_compat);
-   return 0;
-}
-early_param("vz_compat", vz_compat_setup);
-
 struct ve_struct *get_ve(struct ve_struct *ve)
 {
if (ve)
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index 5aa9722d692d..2b8b27998f07 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -309,7 +309,6 @@ static struct vfsmount *ve_cgroup_mnt, *devices_cgroup_mnt;
 static int __init init_vecalls_cgroups(void)
 {
struct cgroup_sb_opts devices_opts = {
-   .name   = vz_compat ? "container" : NULL,
.subsys_mask=
(1ul << devices_subsys_id),
};
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] timers should not get negative argument

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 3788c76811b2b04318c3f4b240f1e83245ad15e5
Author: Vasily Averin 
Date:   Mon Jun 20 20:58:56 2016 +0400

timers should not get negative argument

This patch fixes 25-sec delay on login into systemd based containers.

Userspace application can set timer for past
and expect that the timer will be expired immediately.

This can do not work as expected inside migrated containers.
Translated argument provided to timer can become negative,
and according timer will sleep a very long time.

https://jira.sw.ru/browse/PSBM-48475

    CC: Vladimir Davydov 
CC: Konstantin Khorenko 
Signed-off-by: Vasily Averin 
Acked-by: Cyrill Gorcunov 
---
 kernel/posix-timers.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index b98cfe429d9b..8ebf01827ee6 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -133,6 +133,8 @@ static struct k_clock posix_clocks[MAX_CLOCKS];
 (which_clock) == CLOCK_MONOTONIC_COARSE)
 
 #ifdef CONFIG_VE
+static struct timespec zero_time;
+
 void monotonic_abs_to_ve(clockid_t which_clock, struct timespec *tp)
 {
struct ve_struct *ve = get_exec_env();
@@ -151,6 +153,10 @@ void monotonic_ve_to_abs(clockid_t which_clock, struct 
timespec *tp)
set_normalized_timespec(tp,
tp->tv_sec + ve->start_timespec.tv_sec,
tp->tv_nsec + ve->start_timespec.tv_nsec);
+   if (timespec_compare(tp, &zero_time) <= 0) {
+   tp->tv_sec =  0;
+   tp->tv_nsec = 1;
+   }
 }
 #endif
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 0/6] Support containers in machine.slice

2016-06-20 Thread Vladimir Davydov
The following problems have to be solved if we want to move containers
to machine.slice:

 - CPU stats reporting. Currently, we just open cgroup by name when we
   need stats corresponding to a VE. This is addressed by patch 3.

 - setdevperms ioctl. The same problem as in case 1. Addressed by patch
   3 as well.

 - cgroup path mangling (/proc/self/cgroup, mountinfo). This is fixed by
   patches 5 and 6.

With containers moved to machine.slice fairsched syscalls and
VZCTL_ENV_CREATE ioctl get broken and can't be easily fixed, so we just
drop them (patches 1, 2, 4). This should be fine, because libvctl
switched to cgroup interface long ago.

https://jira.sw.ru/browse/PSBM-48629

Vladimir Davydov (6):
  Drop vz_compat boot param
  Drop VZCTL_ENV_CREATE
  Use ve init task's css instead of opening cgroup via vfs
  Drop fairsched syscalls
  cgroup: use cgroup_path_ve helper in cgroup_show_path
  cgroup: fix path mangling for ve cgroups

 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 fs/proc/loadavg.c |   3 +-
 fs/proc/stat.c|   3 +-
 fs/proc/uptime.c  |  15 +-
 include/linux/cgroup.h|   3 +
 include/linux/cpuset.h|   5 -
 include/linux/device_cgroup.h |   6 +-
 include/linux/fairsched.h |  88 
 include/linux/sched.h |  21 -
 include/linux/ve.h|  30 +-
 include/linux/ve_proto.h  |   4 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/bc/beancounter.c   |   2 -
 kernel/cgroup.c   |  66 ++-
 kernel/cpuset.c   |  26 -
 kernel/fairsched.c| 829 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 kernel/ve/ve.c| 104 +++-
 kernel/ve/vecalls.c   | 505 +-
 security/device_cgroup.c  |  65 +--
 30 files changed, 191 insertions(+), 1738 deletions(-)
 delete mode 100644 include/linux/fairsched.h
 delete mode 100644 include/uapi/linux/fairsched.h
 delete mode 100644 kernel/fairsched.c

-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] Use ve init task's css instead of opening cgroup via vfs

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 083ecd8a5051975639669e3349a17e07d299c299
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:13 2016 +0300

Use ve init task's css instead of opening cgroup via vfs

Currently, whenever we need to get cpu or devices cgroup corresponding
to a ve, we open it using cgroup_kernel_open(). This is inflexible,
because it relies on the fact that all container cgroups are located at
a specific location which can never change (at the top level). Since we
want to move container cgroups to machine.slice, we need to rework this.

This patch does the trick. It makes each ve remember its init task at
container start, and use css corresponding to init task whenever we need
to get a corresponding cgroup. Note, that after this patch is applied,
we don't need to mount cpu and devices cgroup in kernel.

https://jira.sw.ru/browse/PSBM-48629
    
Signed-off-by: Vladimir Davydov 
---
 fs/proc/loadavg.c |  3 +-
 fs/proc/stat.c|  3 +-
 fs/proc/uptime.c  | 15 
 include/linux/device_cgroup.h |  5 ++-
 include/linux/fairsched.h | 23 
 include/linux/ve.h| 18 ++
 kernel/fairsched.c| 61 
 kernel/ve/ve.c| 82 ++-
 kernel/ve/vecalls.c   | 67 ---
 security/device_cgroup.c  | 19 +-
 10 files changed, 126 insertions(+), 170 deletions(-)

diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 4cbdeef1aa71..40d8a90b0f13 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -6,7 +6,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define LOAD_INT(x) ((x) >> FSHIFT)
@@ -20,7 +19,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_loadavg(ve_name(ve), m);
+   ret = ve_show_loadavg(ve, m);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e9991db527e0..7f7e87c855e4 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -98,7 +97,7 @@ static int show_stat(struct seq_file *p, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_stat(ve_name(ve), p);
+   ret = ve_show_cpu_stat(ve, p);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 6fd56831c796..8fa578e8a553 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -5,7 +5,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -25,11 +24,11 @@ static inline void get_ve0_idle(struct timespec *idle)
idle->tv_nsec = rem;
 }
 
-static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp)
+static inline void get_veX_idle(struct ve_struct *ve, struct timespec *idle)
 {
struct kernel_cpustat kstat;
 
-   cpu_cgroup_get_stat(cgrp, &kstat);
+   ve_get_cpu_stat(ve, &kstat);
cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle);
 }
 
@@ -37,14 +36,12 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 {
struct timespec uptime;
struct timespec idle;
+   struct ve_struct *ve = get_exec_env();
 
-   if (ve_is_super(get_exec_env()))
+   if (ve_is_super(ve))
get_ve0_idle(&idle);
-   else {
-   rcu_read_lock();
-   get_veX_idle(&idle, task_cgroup(current, cpu_cgroup_subsys_id));
-   rcu_read_unlock();
-   }
+   else
+   get_veX_idle(ve, &idle);
 
do_posix_clock_monotonic_gettime(&uptime);
monotonic_to_bootbased(&uptime);
diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 64c2da27278c..25ea2270aabe 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -16,10 +16,9 @@ extern int devcgroup_device_permission(umode_t mode, dev_t 
dev, int mask);
 extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
-struct cgroup;
-int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
-int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
+int devcgroup_set_perms_ve(struct ve_struct *, unsigned, dev_t, unsigned);
+int devcgroup_seq_show_ve(struct ve_struct *, struct seq_file *);
 
 #else
 static inline int de

[Devel] [PATCH RHEL7 COMMIT] cgroup: use cgroup_path_ve helper in cgroup_show_path

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit df0243406fc27e4af78ca6d9111a0bd30fea00a3
Author: Vladimir Davydov 
Date:   Mon Jun 20 21:07:48 2016 +0400

cgroup: use cgroup_path_ve helper in cgroup_show_path

Presently, it basically duplicates the code used for mangling cgroup
path shown inside ve, which is already present in cgroup_path_ve. Let's
reuse it.

    Signed-off-by: Vladimir Davydov 
---
 kernel/cgroup.c | 39 +--
 1 file changed, 9 insertions(+), 30 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5c012f6e94e5..dd548853e2eb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1373,41 +1373,20 @@ static int cgroup_remount(struct super_block *sb, int 
*flags, char *data)
 }
 
 #ifdef CONFIG_VE
-int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
+static int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
 {
-   char *buf;
+   struct cgroup *cgrp = __d_cgrp(dentry);
+   char *buf, *end;
size_t size = seq_get_buf(m, &buf);
-   int res = -1, err = 0;
-
-   if (size) {
-   char *p = dentry_path(dentry, buf, size);
-   if (!IS_ERR(p)) {
-   char *end;
-   if (!ve_is_super(get_exec_env())) {
-   while (*++p != '/') {
-   /*
-* Mangle one level when showing
-* cgroup mount source in container
-* e.g.: "/111" -> "/",
-* "/111/test.slice/test.scope" ->
-* "/test.slice/test.scope"
-*/
-   if (*p == '\0') {
-   *--p = '/';
-   break;
-   }
-   }
-   }
-   end = mangle_path(buf, p, " \t\n\\");
-   if (end)
-   res = end - buf;
-   } else {
-   err = PTR_ERR(p);
-   }
+   int res = -1;
+
+   if (size > 0 && cgroup_path_ve(cgrp, buf, size) == 0) {
+   end = mangle_path(buf, buf, " \t\n\\");
+   res = end - buf;
}
seq_commit(m, res);
 
-   return err;
+   return 0;
 }
 #endif
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] Drop fairsched syscalls

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 13985cb1990d71a321504c58daa16b50ac9a0ec7
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:14 2016 +0300

Drop fairsched syscalls

Everything that can be configured via fairsched syscalls is accessible
via cpu cgroup. Since it's getting difficult to maintain the syscalls
due to the upcoming move of containers to machine.slice, drop them.

Also, drop all functions from sched and cpuset which were used only by
fairsched syscalls.

Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is,
because otherwise it won't get selected, because its only user was
VZ_FAIRSCHED config option dropped by this patch. I think we need to
merge this option with CFS_BANDWIDTH eventually, but let's leave it as
is for now.

Signed-off-by: Vladimir Davydov 
---
 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 include/linux/cpuset.h|   5 -
 include/linux/fairsched.h |  58 ---
 include/linux/sched.h |  20 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/cpuset.c   |  26 --
 kernel/fairsched.c| 705 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 18 files changed, 25 insertions(+), 943 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index ce9d2d7977e5..8a44bbd2bee6 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -374,14 +374,14 @@ SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
-SYSCALL(fairsched_mknod)
-SYSCALL(fairsched_rmnod)
-SYSCALL(fairsched_chwt)
-SYSCALL(fairsched_mvpr)
-SYSCALL(fairsched_rate)
-SYSCALL(fairsched_vcpus)
-SYSCALL(fairsched_cpumask)
-SYSCALL(fairsched_nodemask)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
 SYSCALL(getluid)
 SYSCALL(setluid)
 SYSCALL(setublimit)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index e90207158a12..41fc69c6822b 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,14 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 
-#define __NR_fairsched_mknod   360
-#define __NR_fairsched_rmnod   361
-#define __NR_fairsched_chwt362
-#define __NR_fairsched_mvpr363
-#define __NR_fairsched_rate364
-#define __NR_fairsched_vcpus   365
-#define __NR_fairsched_cpumask 366
-#define __NR_fairsched_nodemask367
 #define __NR_getluid   368
 #define __NR_setluid   369
 #define __NR_setublimit370
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index e60fd32ebba3..f8ed67d66913 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,15 +360,6 @@
 356i386memfd_createsys_memfd_create
 374i386userfaultfd sys_userfaultfd
 
-500i386fairsched_mknod sys_fairsched_mknod
-501i386fairsched_rmnod sys_fairsched_rmnod
-502i386fairsched_chwt  sys_fairsched_chwt
-503i386fairsched_mvpr  sys_fairsched_mvpr
-504i386fairsched_rate  sys_fairsched_rate
-505i386fairsched_vcpus sys_fairsched_vcpus
-506i386fairsched_cpumask   sys_fairsched_cpumask
-507i386fairsched_nodemask  sys_fairsched_nodemask
-
 510i386getluid sys_getluid
 511i386setluid sys_setluid
 512i386setublimit  sys_setublimit  
compat_sys_setublimit
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 846183e5a9f0..7f009985158e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -325,18 +325,10 @@
 320common  kexec_file_load sys_kexec_file_load
 323common  userfaultfd sys_userfaultfd
 
-49764  fairsched_nodemask  sys_fairsched_nodemask
-49864  fairsched_cpumask   sys_

[Devel] [PATCH RHEL7 COMMIT] Drop VZCTL_ENV_CREATE

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 8d46dca70d92147cf928633f279b9c36deb234c2
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:12 2016 +0300

Drop VZCTL_ENV_CREATE

It's getting too difficult to support it. Since we've been using cgroup
interface for creating VE for quite a while, let's drop it.

Signed-off-by: Vladimir Davydov 
---
 include/linux/device_cgroup.h |   1 -
 include/linux/fairsched.h |   7 -
 include/linux/sched.h |   1 -
 include/linux/ve.h|   8 -
 include/linux/ve_proto.h  |   4 -
 kernel/fairsched.c|  64 +--
 kernel/ve/ve.c|   8 +-
 kernel/ve/vecalls.c   | 437 +-
 security/device_cgroup.c  |  46 -
 9 files changed, 5 insertions(+), 571 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 32588bb8fb4e..64c2da27278c 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
 struct cgroup;
-int devcgroup_default_perms_ve(struct cgroup *cgroup);
 int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
 int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index f3dede236945..b73f51eadabc 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
   unsigned long __user *user_mask_ptr);
 
-int fairsched_new_node(int id, unsigned int vcpus);
-int fairsched_move_task(int id, struct task_struct *tsk);
-void fairsched_drop_node(int id, int leave);
-
 int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat);
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun);
@@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file 
*p);
 
 #else /* CONFIG_VZ_FAIRSCHED */
 
-static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; }
-static inline int fairsched_move_task(int id, struct task_struct *tsk) { 
return 0; }
-static inline void fairsched_drop_node(int id, int leave) { }
 static inline int fairsched_show_stat(const char *name, struct seq_file *p) { 
return -ENOSYS; }
 static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) 
{ return -ENOSYS; }
 static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long 
*avenrun) { return -ENOSYS; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21775a21f8ab..84a9888b2483 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,7 +1241,6 @@ struct task_struct {
unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
 * execve */
unsigned in_iowait:1;
-   unsigned did_ve_enter:1;
unsigned no_new_privs:1; /* task may not gain privileges */
unsigned may_throttle:1;
 
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 182a63899a0b..459c8bc581d9 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -41,13 +41,10 @@ struct ve_struct {
struct list_headve_list;
 
envid_t veid;
-   boollegacy; /* created using the legacy API
-  (vzctl ioctl - see do_env_create) */
 
unsigned intclass_id;
struct rw_semaphore op_sem;
int is_running;
-   int is_locked;
int is_pseudosuper;
atomic_tsuspend;
/* see vzcalluser.h for VE_FEATURE_XXX definitions */
@@ -146,10 +143,6 @@ extern struct cgroup_subsys ve_subsys;
 
 extern unsigned int sysctl_ve_mount_nr;
 
-#ifdef CONFIG_VE_IPTABLES
-extern __u64 ve_setup_iptables_mask(__u64 init_mask);
-#endif
-
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
@@ -209,7 +202,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, 
struct timespec *tp);
 
 void ve_stop_ns(struct pid_namespace *ns);
 void ve_exit_ns(struct pid_namespace *ns);
-int ve_start_container(struct ve_struct *ve);
 
 extern bool current_user_ns_initial(void);
 struct user_namespace *ve_init_user_ns(void);
diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 61d80190d0f1..8cc7fe3ba2a3 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -53,10 +53,6 @@ extern

[Devel] [PATCH RHEL7 COMMIT] Drop fairsched syscalls

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 13985cb1990d71a321504c58daa16b50ac9a0ec7
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:14 2016 +0300

Drop fairsched syscalls

Everything that can be configured via fairsched syscalls is accessible
via cpu cgroup. Since it's getting difficult to maintain the syscalls
due to the upcoming move of containers to machine.slice, drop them.

Also, drop all functions from sched and cpuset which were used only by
fairsched syscalls.

Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is,
because otherwise it won't get selected, because its only user was
VZ_FAIRSCHED config option dropped by this patch. I think we need to
merge this option with CFS_BANDWIDTH eventually, but let's leave it as
is for now.

Signed-off-by: Vladimir Davydov 
---
 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 include/linux/cpuset.h|   5 -
 include/linux/fairsched.h |  58 ---
 include/linux/sched.h |  20 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/cpuset.c   |  26 --
 kernel/fairsched.c| 705 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 18 files changed, 25 insertions(+), 943 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index ce9d2d7977e5..8a44bbd2bee6 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -374,14 +374,14 @@ SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
-SYSCALL(fairsched_mknod)
-SYSCALL(fairsched_rmnod)
-SYSCALL(fairsched_chwt)
-SYSCALL(fairsched_mvpr)
-SYSCALL(fairsched_rate)
-SYSCALL(fairsched_vcpus)
-SYSCALL(fairsched_cpumask)
-SYSCALL(fairsched_nodemask)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
 SYSCALL(getluid)
 SYSCALL(setluid)
 SYSCALL(setublimit)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index e90207158a12..41fc69c6822b 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,14 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 
-#define __NR_fairsched_mknod   360
-#define __NR_fairsched_rmnod   361
-#define __NR_fairsched_chwt362
-#define __NR_fairsched_mvpr363
-#define __NR_fairsched_rate364
-#define __NR_fairsched_vcpus   365
-#define __NR_fairsched_cpumask 366
-#define __NR_fairsched_nodemask367
 #define __NR_getluid   368
 #define __NR_setluid   369
 #define __NR_setublimit370
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index e60fd32ebba3..f8ed67d66913 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,15 +360,6 @@
 356i386memfd_createsys_memfd_create
 374i386userfaultfd sys_userfaultfd
 
-500i386fairsched_mknod sys_fairsched_mknod
-501i386fairsched_rmnod sys_fairsched_rmnod
-502i386fairsched_chwt  sys_fairsched_chwt
-503i386fairsched_mvpr  sys_fairsched_mvpr
-504i386fairsched_rate  sys_fairsched_rate
-505i386fairsched_vcpus sys_fairsched_vcpus
-506i386fairsched_cpumask   sys_fairsched_cpumask
-507i386fairsched_nodemask  sys_fairsched_nodemask
-
 510i386getluid sys_getluid
 511i386setluid sys_setluid
 512i386setublimit  sys_setublimit  
compat_sys_setublimit
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 846183e5a9f0..7f009985158e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -325,18 +325,10 @@
 320common  kexec_file_load sys_kexec_file_load
 323common  userfaultfd sys_userfaultfd
 
-49764  fairsched_nodemask  sys_fairsched_nodemask
-49864  fairsched_cpumask   sys_

[Devel] [PATCH RHEL7 COMMIT] Drop VZCTL_ENV_CREATE

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 8d46dca70d92147cf928633f279b9c36deb234c2
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:12 2016 +0300

Drop VZCTL_ENV_CREATE

It's getting too difficult to support it. Since we've been using cgroup
interface for creating VE for quite a while, let's drop it.

Signed-off-by: Vladimir Davydov 
---
 include/linux/device_cgroup.h |   1 -
 include/linux/fairsched.h |   7 -
 include/linux/sched.h |   1 -
 include/linux/ve.h|   8 -
 include/linux/ve_proto.h  |   4 -
 kernel/fairsched.c|  64 +--
 kernel/ve/ve.c|   8 +-
 kernel/ve/vecalls.c   | 437 +-
 security/device_cgroup.c  |  46 -
 9 files changed, 5 insertions(+), 571 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 32588bb8fb4e..64c2da27278c 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
 struct cgroup;
-int devcgroup_default_perms_ve(struct cgroup *cgroup);
 int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
 int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index f3dede236945..b73f51eadabc 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
   unsigned long __user *user_mask_ptr);
 
-int fairsched_new_node(int id, unsigned int vcpus);
-int fairsched_move_task(int id, struct task_struct *tsk);
-void fairsched_drop_node(int id, int leave);
-
 int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat);
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun);
@@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file 
*p);
 
 #else /* CONFIG_VZ_FAIRSCHED */
 
-static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; }
-static inline int fairsched_move_task(int id, struct task_struct *tsk) { 
return 0; }
-static inline void fairsched_drop_node(int id, int leave) { }
 static inline int fairsched_show_stat(const char *name, struct seq_file *p) { 
return -ENOSYS; }
 static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) 
{ return -ENOSYS; }
 static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long 
*avenrun) { return -ENOSYS; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21775a21f8ab..84a9888b2483 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,7 +1241,6 @@ struct task_struct {
unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
 * execve */
unsigned in_iowait:1;
-   unsigned did_ve_enter:1;
unsigned no_new_privs:1; /* task may not gain privileges */
unsigned may_throttle:1;
 
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 182a63899a0b..459c8bc581d9 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -41,13 +41,10 @@ struct ve_struct {
struct list_headve_list;
 
envid_t veid;
-   boollegacy; /* created using the legacy API
-  (vzctl ioctl - see do_env_create) */
 
unsigned intclass_id;
struct rw_semaphore op_sem;
int is_running;
-   int is_locked;
int is_pseudosuper;
atomic_tsuspend;
/* see vzcalluser.h for VE_FEATURE_XXX definitions */
@@ -146,10 +143,6 @@ extern struct cgroup_subsys ve_subsys;
 
 extern unsigned int sysctl_ve_mount_nr;
 
-#ifdef CONFIG_VE_IPTABLES
-extern __u64 ve_setup_iptables_mask(__u64 init_mask);
-#endif
-
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
@@ -209,7 +202,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, 
struct timespec *tp);
 
 void ve_stop_ns(struct pid_namespace *ns);
 void ve_exit_ns(struct pid_namespace *ns);
-int ve_start_container(struct ve_struct *ve);
 
 extern bool current_user_ns_initial(void);
 struct user_namespace *ve_init_user_ns(void);
diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 61d80190d0f1..8cc7fe3ba2a3 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -53,10 +53,6 @@ extern

[Devel] [PATCH RHEL7 COMMIT] mm: memcontrol: fix race between kmem uncharge and charge reparenting

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 35c0d2a992aaa399cccaee2fc9f3ed6879840dd4
Author: Vladimir Davydov 
Date:   Mon Jun 20 20:59:38 2016 +0400

mm: memcontrol: fix race between kmem uncharge and charge reparenting

When a cgroup is destroyed, all user memory pages get recharged to the
parent cgroup. Recharging is done by mem_cgroup_reparent_charges which
keeps looping until res <= kmem. This is supposed to guarantee that by
the time cgroup gets released, no pages is charged to it. However, the
guarantee might be violated in case mem_cgroup_reparent_charges races
with kmem charge or uncharge.

Currently, kmem is charged before res and uncharged after. As a result,
kmem might become greater than res for a short period of time even if
there are still user memory pages charged to the cgroup. In this case
mem_cgroup_reparent_charges will give up prematurely, and the cgroup
might be released though there are still pages charged to it. Uncharge
of such a page will trigger kernel panic:

  general protection fault:  [#1] SMP
  CPU: 0 PID: 972445 Comm: httpd ve: 0 Tainted: G   OE  
   3.10.0-427.10.1.lve1.4.9.el7.x86_64 #1 12.14
  task: 88065d53d8d0 ti: 880224f34000 task.ti: 880224f34000
  RIP: 0010:[]  [] 
mem_cgroup_charge_statistics.isra.16+0x13/0x60
  RSP: 0018:880224f37a80  EFLAGS: 00010202
  RAX:  RBX: 8807b26f0110 RCX: 
  RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8
  RBP: 880224f37a80 R08:  R09: 03808000
  R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440
  R13: 0001 R14:  R15: 8806a5566000
  FS:  () GS:8807d400() 
knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f54289bd74c CR3: 0006638b1000 CR4: 06f0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Stack:
   880224f37ac0 811e9ddf 88060001 ea000c9c0440
   0001 037d1000 880224f37c78 0380
   880224f37ad0 811ee99a 880224f37b08 811b9ec9
  Call Trace:
   [] __mem_cgroup_uncharge_common+0xcf/0x320
   [] mem_cgroup_uncharge_page+0x2a/0x30
   [] page_remove_rmap+0xb9/0x160
   [] ? res_counter_uncharge+0x13/0x20
   [] unmap_page_range+0x460/0x870
   [] unmap_single_vma+0x81/0xf0
   [] unmap_vmas+0x49/0x90
   [] exit_mmap+0xac/0x1a0
   [] mmput+0x6b/0x140
   [] flush_old_exec+0x467/0x8d0
   [] load_elf_binary+0x33c/0xde0
   [] ? get_user_pages+0x52/0x60
   [] ? load_elf_library+0x220/0x220
   [] search_binary_handler+0xd5/0x300
   [] do_execve_common.isra.26+0x657/0x720
   [] SyS_execve+0x29/0x30
   [] stub_execve+0x69/0xa0

To prevent this from happening, let's always charge kmem after res and
uncharge before res.

https://bugs.openvz.org/browse/OVZ-6756

Reported-by: Anatoly Stepanov 
Signed-off-by: Vladimir Davydov 
Reviewed-by: Kirill Tkhai 
---
 mm/memcontrol.c | 44 
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c3fbb2d2c48..de7c36295515 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3163,10 +3163,6 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t 
gfp, u64 size)
int ret = 0;
bool may_oom;
 
-   ret = res_counter_charge(&memcg->kmem, size, &fail_res);
-   if (ret)
-   return ret;
-
/*
 * Conditions under which we can wait for the oom_killer. Those are
 * the same conditions tested by the core page allocator
@@ -3198,8 +3194,33 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t 
gfp, u64 size)
res_counter_charge_nofail(&memcg->memsw, size,
  &fail_res);
ret = 0;
-   } else if (ret)
-   res_counter_uncharge(&memcg->kmem, size);
+   }
+
+   if (ret)
+   return ret;
+
+   /*
+* When a cgroup is destroyed, all user memory pages get recharged to
+* the parent cgroup. Recharging is done by mem_cgroup_reparent_charges
+* which keeps looping until res <= kmem. This is supposed to guarantee
+* that by the time cgroup gets released, no pages is charged to it.
+*
+* If kmem were charged before res or uncharged after, kmem might
+* become grea

[Devel] [PATCH rh7 4/6] Drop fairsched syscalls

2016-06-20 Thread Vladimir Davydov
Everything that can be configured via fairsched syscalls is accessible
via cpu cgroup. Since it's getting difficult to maintain the syscalls
due to the upcoming move of containers to machine.slice, drop them.

Also, drop all functions from sched and cpuset which were used only by
fairsched syscalls.

Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is,
because otherwise it won't get selected, because its only user was
VZ_FAIRSCHED config option dropped by this patch. I think we need to
merge this option with CFS_BANDWIDTH eventually, but let's leave it as
is for now.

Signed-off-by: Vladimir Davydov 
---
 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 include/linux/cpuset.h|   5 -
 include/linux/fairsched.h |  58 ---
 include/linux/sched.h |  20 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/cpuset.c   |  26 --
 kernel/fairsched.c| 705 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 18 files changed, 25 insertions(+), 943 deletions(-)
 delete mode 100644 include/linux/fairsched.h
 delete mode 100644 include/uapi/linux/fairsched.h
 delete mode 100644 kernel/fairsched.c

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index ce9d2d7977e5..8a44bbd2bee6 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -374,14 +374,14 @@ SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
-SYSCALL(fairsched_mknod)
-SYSCALL(fairsched_rmnod)
-SYSCALL(fairsched_chwt)
-SYSCALL(fairsched_mvpr)
-SYSCALL(fairsched_rate)
-SYSCALL(fairsched_vcpus)
-SYSCALL(fairsched_cpumask)
-SYSCALL(fairsched_nodemask)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
 SYSCALL(getluid)
 SYSCALL(setluid)
 SYSCALL(setublimit)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index e90207158a12..41fc69c6822b 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,14 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 
-#define __NR_fairsched_mknod   360
-#define __NR_fairsched_rmnod   361
-#define __NR_fairsched_chwt362
-#define __NR_fairsched_mvpr363
-#define __NR_fairsched_rate364
-#define __NR_fairsched_vcpus   365
-#define __NR_fairsched_cpumask 366
-#define __NR_fairsched_nodemask367
 #define __NR_getluid   368
 #define __NR_setluid   369
 #define __NR_setublimit370
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index e60fd32ebba3..f8ed67d66913 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,15 +360,6 @@
 356i386memfd_createsys_memfd_create
 374i386userfaultfd sys_userfaultfd
 
-500i386fairsched_mknod sys_fairsched_mknod
-501i386fairsched_rmnod sys_fairsched_rmnod
-502i386fairsched_chwt  sys_fairsched_chwt
-503i386fairsched_mvpr  sys_fairsched_mvpr
-504i386fairsched_rate  sys_fairsched_rate
-505i386fairsched_vcpus sys_fairsched_vcpus
-506i386fairsched_cpumask   sys_fairsched_cpumask
-507i386fairsched_nodemask  sys_fairsched_nodemask
-
 510i386getluid sys_getluid
 511i386setluid sys_setluid
 512i386setublimit  sys_setublimit  
compat_sys_setublimit
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 846183e5a9f0..7f009985158e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -325,18 +325,10 @@
 320common  kexec_file_load sys_kexec_file_load
 323common  userfaultfd sys_userfaultfd
 
-49764  fairsched_nodemask  sys_fairsched_nodemask
-49864  fairsched_cpumask   sys_fairsched_cpumask
-49964  fairsched_vcpus sys_fairsched_vcpus
 50064  getluid sys_getluid
 50164  setluid sys_setluid
 50264  setublimit  sys_setublimit
 503   

[Devel] [PATCH rh7 1/6] Drop vz_compat boot param

2016-06-20 Thread Vladimir Davydov
It was introduced by commit d7b23ae8a314f ("ve/cgroups: use cgroup
subsystem names only if in vz compat mode") in order to provide a way of
running pcs6 environment along with vz7 kernel. Turned out, this is not
needed, so drop the option altogether.

Signed-off-by: Vladimir Davydov 
---
 include/linux/ve.h  |  4 
 kernel/bc/beancounter.c |  2 --
 kernel/fairsched.c  |  1 -
 kernel/ve/ve.c  | 10 --
 kernel/ve/vecalls.c |  1 -
 5 files changed, 18 deletions(-)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index 2d0c19ee2d98..a40e219c8bce 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -155,8 +155,6 @@ extern __u64 ve_setup_iptables_mask(__u64 init_mask);
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
-extern int vz_compat;
-
 extern struct kobj_ns_type_operations ve_ns_type_operations;
 extern struct kobject * kobject_create_and_add_ve(const char *name,
struct kobject *parent);
@@ -249,8 +247,6 @@ static inline void ve_mount_nr_dec(void)
 
 #define ve_uevent_seqnum uevent_seqnum
 
-#define vz_compat  (0)
-
 static inline int vz_security_family_check(struct net *net, int family) { 
return 0; }
 static inline int vz_security_protocol_check(struct net *net, int protocol) { 
return 0; }
 
diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index b26d292e2881..935ca517e1f4 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -35,7 +35,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -1181,7 +1180,6 @@ void __init ub_init_late(void)
 int __init ub_init_cgroup(void)
 {
struct cgroup_sb_opts blkio_opts = {
-   .name   = vz_compat ? "beancounter" : NULL,
.subsys_mask= (1ul << blkio_subsys_id),
};
struct cgroup_sb_opts mem_opts = {
diff --git a/kernel/fairsched.c b/kernel/fairsched.c
index d3d17126a85c..8149076c8cb8 100644
--- a/kernel/fairsched.c
+++ b/kernel/fairsched.c
@@ -796,7 +796,6 @@ int __init fairsched_init(void)
 {
struct vfsmount *cpu_mnt, *cpuset_mnt;
struct cgroup_sb_opts cpu_opts = {
-   .name   = vz_compat ? "fairsched" : NULL,
.subsys_mask=
(1ul << cpu_cgroup_subsys_id) |
(1ul << cpuacct_subsys_id),
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 53fa12dca238..703f97c03cb2 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -89,18 +89,8 @@ DEFINE_MUTEX(ve_list_lock);
 int nr_ve = 1; /* One VE always exists. Compatibility with vestat */
 EXPORT_SYMBOL(nr_ve);
 
-int vz_compat;
-EXPORT_SYMBOL(vz_compat);
-
 static DEFINE_IDR(ve_idr);
 
-static int __init vz_compat_setup(char *arg)
-{
-   get_option(&arg, &vz_compat);
-   return 0;
-}
-early_param("vz_compat", vz_compat_setup);
-
 struct ve_struct *get_ve(struct ve_struct *ve)
 {
if (ve)
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index 537fc4aa964b..a690a8faabba 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -309,7 +309,6 @@ static struct vfsmount *ve_cgroup_mnt, *devices_cgroup_mnt;
 static int __init init_vecalls_cgroups(void)
 {
struct cgroup_sb_opts devices_opts = {
-   .name   = vz_compat ? "container" : NULL,
.subsys_mask=
(1ul << devices_subsys_id),
};
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 6/6] cgroup: fix path mangling for ve cgroups

2016-06-20 Thread Vladimir Davydov
Presently, we just cut first component off cgroup path when inside a VE,
because all VE cgroups are located at the top level of the cgroup
hierarchy. However, this is going to change - the cgroups are going to
move to machine.slice - so we should introduce a more generic way of
mangling cgroup paths.

This patch does the trick. On a VE start it marks all cgroups the init
task of the VE resides in with a special flag (CGRP_VE_ROOT). Cgroups
marked this way will be treated as root if looked at from inside a VE.
As long as we don't have nested VEs, this should work fine.

Note, we don't need to clear these flags on VE destruction, because
vzctl always creates new cgroups on VE start.

https://jira.sw.ru/browse/PSBM-48629

Signed-off-by: Vladimir Davydov 
---
 include/linux/cgroup.h |  3 +++
 kernel/cgroup.c| 27 ---
 kernel/ve/ve.c |  4 
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index aad06e8e0258..730ca9091bfb 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -175,6 +175,9 @@ enum {
CGRP_CPUSET_CLONE_CHILDREN,
/* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */
CGRP_SANE_BEHAVIOR,
+
+   /* The cgroup is root in a VE */
+   CGRP_VE_ROOT,
 };
 
 struct cgroup_name {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dd548853e2eb..581924e7af9e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1791,6 +1791,21 @@ static struct file_system_type cgroup_fs_type = {
 
 static struct kobject *cgroup_kobj;
 
+#ifdef CONFIG_VE
+void cgroup_mark_ve_root(struct ve_struct *ve)
+{
+   struct cgroup *cgrp;
+   struct cgroupfs_root *root;
+
+   mutex_lock(&cgroup_mutex);
+   for_each_active_root(root) {
+   cgrp = task_cgroup_from_root(ve->init_task, root);
+   set_bit(CGRP_VE_ROOT, &cgrp->flags);
+   }
+   mutex_unlock(&cgroup_mutex);
+}
+#endif
+
 /**
  * cgroup_path - generate the path of a cgroup
  * @cgrp: the cgroup in question
@@ -1804,7 +1819,8 @@ static struct kobject *cgroup_kobj;
  * inode's i_mutex, while on the other hand cgroup_path() can be called
  * with some irq-safe spinlocks held.
  */
-int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt)
+static int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen,
+bool virt)
 {
int ret = -ENAMETOOLONG;
char *start;
@@ -1824,14 +1840,11 @@ int __cgroup_path(const struct cgroup *cgrp, char *buf, 
int buflen, bool virt)
int len;
 
 #ifdef CONFIG_VE
-   if (virt && cgrp->parent && !cgrp->parent->parent) {
+   if (virt && test_bit(CGRP_VE_ROOT, &cgrp->flags)) {
/*
 * Containers cgroups are bind-mounted from node
 * so they are like '/' from inside, thus we have
-* to mangle cgroup path output. Effectively it is
-* enough to remove two topmost cgroups from path.
-* e.g. in ct 101: /101/test.slice/test.scope ->
-* /test.slice/test.scope
+* to mangle cgroup path output.
 */
if (*start != '/') {
if (--start < buf)
@@ -2391,7 +2404,7 @@ static ssize_t cgroup_file_write(struct file *file, const 
char __user *buf,
 * inside a container FS.
 */
if (!ve_is_super(get_exec_env())
-   && (!cgrp->parent || !cgrp->parent->parent)
+   && test_bit(CGRP_VE_ROOT, &cgrp->flags)
&& !get_exec_env()->is_pseudosuper
&& !(cft->flags & CFTYPE_VE_WRITABLE))
return -EPERM;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 08a15fc02e21..e65130f18bb4 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -454,6 +454,8 @@ static void ve_drop_context(struct ve_struct *ve)
 
 static const struct timespec zero_time = { };
 
+extern void cgroup_mark_ve_root(struct ve_struct *ve);
+
 /* under ve->op_sem write-lock */
 static int ve_start_container(struct ve_struct *ve)
 {
@@ -501,6 +503,8 @@ static int ve_start_container(struct ve_struct *ve)
if (err < 0)
goto err_iterate;
 
+   cgroup_mark_ve_root(ve);
+
ve->is_running = 1;
 
printk(KERN_INFO "CT: %s: started\n", ve_name(ve));
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 5/6] cgroup: use cgroup_path_ve helper in cgroup_show_path

2016-06-20 Thread Vladimir Davydov
Presently, it basically duplicates the code used for mangling cgroup
path shown inside ve, which is already present in cgroup_path_ve. Let's
reuse it.

Signed-off-by: Vladimir Davydov 
---
 kernel/cgroup.c | 39 +--
 1 file changed, 9 insertions(+), 30 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5c012f6e94e5..dd548853e2eb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1373,41 +1373,20 @@ static int cgroup_remount(struct super_block *sb, int 
*flags, char *data)
 }
 
 #ifdef CONFIG_VE
-int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
+static int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
 {
-   char *buf;
+   struct cgroup *cgrp = __d_cgrp(dentry);
+   char *buf, *end;
size_t size = seq_get_buf(m, &buf);
-   int res = -1, err = 0;
-
-   if (size) {
-   char *p = dentry_path(dentry, buf, size);
-   if (!IS_ERR(p)) {
-   char *end;
-   if (!ve_is_super(get_exec_env())) {
-   while (*++p != '/') {
-   /*
-* Mangle one level when showing
-* cgroup mount source in container
-* e.g.: "/111" -> "/",
-* "/111/test.slice/test.scope" ->
-* "/test.slice/test.scope"
-*/
-   if (*p == '\0') {
-   *--p = '/';
-   break;
-   }
-   }
-   }
-   end = mangle_path(buf, p, " \t\n\\");
-   if (end)
-   res = end - buf;
-   } else {
-   err = PTR_ERR(p);
-   }
+   int res = -1;
+
+   if (size > 0 && cgroup_path_ve(cgrp, buf, size) == 0) {
+   end = mangle_path(buf, buf, " \t\n\\");
+   res = end - buf;
}
seq_commit(m, res);
 
-   return err;
+   return 0;
 }
 #endif
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/6] Drop VZCTL_ENV_CREATE

2016-06-20 Thread Vladimir Davydov
It's getting too difficult to support it. Since we've been using cgroup
interface for creating VE for quite a while, let's drop it.

Signed-off-by: Vladimir Davydov 
---
 include/linux/device_cgroup.h |   1 -
 include/linux/fairsched.h |   7 -
 include/linux/sched.h |   1 -
 include/linux/ve.h|   8 -
 include/linux/ve_proto.h  |   4 -
 kernel/fairsched.c|  64 +--
 kernel/ve/ve.c|   8 +-
 kernel/ve/vecalls.c   | 437 +-
 security/device_cgroup.c  |  46 -
 9 files changed, 5 insertions(+), 571 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 32588bb8fb4e..64c2da27278c 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
 struct cgroup;
-int devcgroup_default_perms_ve(struct cgroup *cgroup);
 int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
 int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index e242c0d4c065..615e88928e25 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
   unsigned long __user *user_mask_ptr);
 
-int fairsched_new_node(int id, unsigned int vcpus);
-int fairsched_move_task(int id, struct task_struct *tsk);
-void fairsched_drop_node(int id, int leave);
-
 int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat);
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun);
@@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file 
*p);
 
 #else /* CONFIG_VZ_FAIRSCHED */
 
-static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; }
-static inline int fairsched_move_task(int id, struct task_struct *tsk) { 
return 0; }
-static inline void fairsched_drop_node(int id, int leave) { }
 static inline int fairsched_show_stat(const char *name, struct seq_file *p) { 
return -ENOSYS; }
 static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) 
{ return -ENOSYS; }
 static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long 
*avenrun) { return -ENOSYS; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21775a21f8ab..84a9888b2483 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,7 +1241,6 @@ struct task_struct {
unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
 * execve */
unsigned in_iowait:1;
-   unsigned did_ve_enter:1;
unsigned no_new_privs:1; /* task may not gain privileges */
unsigned may_throttle:1;
 
diff --git a/include/linux/ve.h b/include/linux/ve.h
index a40e219c8bce..878ca284a6ba 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -43,13 +43,10 @@ struct ve_struct {
struct list_headve_list;
 
envid_t veid;
-   boollegacy; /* created using the legacy API
-  (vzctl ioctl - see do_env_create) */
 
unsigned intclass_id;
struct rw_semaphore op_sem;
int is_running;
-   int is_locked;
int is_pseudosuper;
atomic_tsuspend;
/* see vzcalluser.h for VE_FEATURE_XXX definitions */
@@ -148,10 +145,6 @@ extern struct cgroup_subsys ve_subsys;
 
 extern unsigned int sysctl_ve_mount_nr;
 
-#ifdef CONFIG_VE_IPTABLES
-extern __u64 ve_setup_iptables_mask(__u64 init_mask);
-#endif
-
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
@@ -211,7 +204,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, 
struct timespec *tp);
 
 void ve_stop_ns(struct pid_namespace *ns);
 void ve_exit_ns(struct pid_namespace *ns);
-int ve_start_container(struct ve_struct *ve);
 
 extern bool current_user_ns_initial(void);
 struct user_namespace *ve_init_user_ns(void);
diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 153f18bd19b1..5787afe275ce 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -55,10 +55,6 @@ extern struct ve_struct *get_ve_by_id(envid_t);
 extern struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t 
veid);
 extern int ve_cgroup_remove(struct cgroup *root, envid_t veid);
 
-struct env_create_param3;
-extern int real_env_create(envid_t veid, unsigned flags, u32 class_id,
-  struct env_create_param3 

[Devel] [PATCH rh7 3/6] Use ve init task's css instead of opening cgroup via vfs

2016-06-20 Thread Vladimir Davydov
Currently, whenever we need to get cpu or devices cgroup corresponding
to a ve, we open it using cgroup_kernel_open(). This is inflexible,
because it relies on the fact that all container cgroups are located at
a specific location which can never change (at the top level). Since we
want to move container cgroups to machine.slice, we need to rework this.

This patch does the trick. It makes each ve remember its init task at
container start, and use css corresponding to init task whenever we need
to get a corresponding cgroup. Note, that after this patch is applied,
we don't need to mount cpu and devices cgroup in kernel.

https://jira.sw.ru/browse/PSBM-48629

Signed-off-by: Vladimir Davydov 
---
 fs/proc/loadavg.c |  3 +-
 fs/proc/stat.c|  3 +-
 fs/proc/uptime.c  | 15 
 include/linux/device_cgroup.h |  5 ++-
 include/linux/fairsched.h | 23 
 include/linux/ve.h| 18 ++
 kernel/fairsched.c| 61 
 kernel/ve/ve.c| 82 ++-
 kernel/ve/vecalls.c   | 67 ---
 security/device_cgroup.c  | 19 +-
 10 files changed, 126 insertions(+), 170 deletions(-)

diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 4cbdeef1aa71..40d8a90b0f13 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -6,7 +6,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define LOAD_INT(x) ((x) >> FSHIFT)
@@ -20,7 +19,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_loadavg(ve_name(ve), m);
+   ret = ve_show_loadavg(ve, m);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e9991db527e0..7f7e87c855e4 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -98,7 +97,7 @@ static int show_stat(struct seq_file *p, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_stat(ve_name(ve), p);
+   ret = ve_show_cpu_stat(ve, p);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 6fd56831c796..8fa578e8a553 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -5,7 +5,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -25,11 +24,11 @@ static inline void get_ve0_idle(struct timespec *idle)
idle->tv_nsec = rem;
 }
 
-static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp)
+static inline void get_veX_idle(struct ve_struct *ve, struct timespec *idle)
 {
struct kernel_cpustat kstat;
 
-   cpu_cgroup_get_stat(cgrp, &kstat);
+   ve_get_cpu_stat(ve, &kstat);
cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle);
 }
 
@@ -37,14 +36,12 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 {
struct timespec uptime;
struct timespec idle;
+   struct ve_struct *ve = get_exec_env();
 
-   if (ve_is_super(get_exec_env()))
+   if (ve_is_super(ve))
get_ve0_idle(&idle);
-   else {
-   rcu_read_lock();
-   get_veX_idle(&idle, task_cgroup(current, cpu_cgroup_subsys_id));
-   rcu_read_unlock();
-   }
+   else
+   get_veX_idle(ve, &idle);
 
do_posix_clock_monotonic_gettime(&uptime);
monotonic_to_bootbased(&uptime);
diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 64c2da27278c..25ea2270aabe 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -16,10 +16,9 @@ extern int devcgroup_device_permission(umode_t mode, dev_t 
dev, int mask);
 extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
-struct cgroup;
-int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
-int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
+int devcgroup_set_perms_ve(struct ve_struct *, unsigned, dev_t, unsigned);
+int devcgroup_seq_show_ve(struct ve_struct *, struct seq_file *);
 
 #else
 static inline int devcgroup_inode_permission(struct inode *inode, int mask)
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index 615e88928e25..b779d2e85b12 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,31 +51,8 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,

[Devel] test - pls ignore

2016-06-17 Thread Vladimir Davydov

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: memcontrol: fix race between kmem uncharge and charge reparenting

2016-06-17 Thread Vladimir Davydov
When a cgroup is destroyed, all user memory pages get recharged to the
parent cgroup. Recharging is done by mem_cgroup_reparent_charges which
keeps looping until res <= kmem. This is supposed to guarantee that by
the time cgroup gets released, no pages is charged to it. However, the
guarantee might be violated in case mem_cgroup_reparent_charges races
with kmem charge or uncharge.

Currently, kmem is charged before res and uncharged after. As a result,
kmem might become greater than res for a short period of time even if
there are still user memory pages charged to the cgroup. In this case
mem_cgroup_reparent_charges will give up prematurely, and the cgroup
might be released though there are still pages charged to it. Uncharge
of such a page will trigger kernel panic:

  general protection fault:  [#1] SMP
  CPU: 0 PID: 972445 Comm: httpd ve: 0 Tainted: G   OE     
3.10.0-427.10.1.lve1.4.9.el7.x86_64 #1 12.14
  task: 88065d53d8d0 ti: 880224f34000 task.ti: 880224f34000
  RIP: 0010:[]  [] 
mem_cgroup_charge_statistics.isra.16+0x13/0x60
  RSP: 0018:880224f37a80  EFLAGS: 00010202
  RAX:  RBX: 8807b26f0110 RCX: 
  RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8
  RBP: 880224f37a80 R08:  R09: 03808000
  R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440
  R13: 0001 R14:  R15: 8806a5566000
  FS:  () GS:8807d400() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f54289bd74c CR3: 0006638b1000 CR4: 06f0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Stack:
   880224f37ac0 811e9ddf 88060001 ea000c9c0440
   0001 037d1000 880224f37c78 0380
   880224f37ad0 811ee99a 880224f37b08 811b9ec9
  Call Trace:
   [] __mem_cgroup_uncharge_common+0xcf/0x320
   [] mem_cgroup_uncharge_page+0x2a/0x30
   [] page_remove_rmap+0xb9/0x160
   [] ? res_counter_uncharge+0x13/0x20
   [] unmap_page_range+0x460/0x870
   [] unmap_single_vma+0x81/0xf0
   [] unmap_vmas+0x49/0x90
   [] exit_mmap+0xac/0x1a0
   [] mmput+0x6b/0x140
   [] flush_old_exec+0x467/0x8d0
   [] load_elf_binary+0x33c/0xde0
   [] ? get_user_pages+0x52/0x60
   [] ? load_elf_library+0x220/0x220
   [] search_binary_handler+0xd5/0x300
   [] do_execve_common.isra.26+0x657/0x720
   [] SyS_execve+0x29/0x30
   [] stub_execve+0x69/0xa0

To prevent this from happening, let's always charge kmem after res and
uncharge before res.

https://bugs.openvz.org/browse/OVZ-6756

Reported-by: Anatoly Stepanov 
Signed-off-by: Vladimir Davydov 
---
 mm/memcontrol.c | 44 
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c3fbb2d2c48..de7c36295515 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3163,10 +3163,6 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t 
gfp, u64 size)
int ret = 0;
bool may_oom;
 
-   ret = res_counter_charge(&memcg->kmem, size, &fail_res);
-   if (ret)
-   return ret;
-
/*
 * Conditions under which we can wait for the oom_killer. Those are
 * the same conditions tested by the core page allocator
@@ -3198,8 +3194,33 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t 
gfp, u64 size)
res_counter_charge_nofail(&memcg->memsw, size,
  &fail_res);
ret = 0;
-   } else if (ret)
-   res_counter_uncharge(&memcg->kmem, size);
+   }
+
+   if (ret)
+   return ret;
+
+   /*
+* When a cgroup is destroyed, all user memory pages get recharged to
+* the parent cgroup. Recharging is done by mem_cgroup_reparent_charges
+* which keeps looping until res <= kmem. This is supposed to guarantee
+* that by the time cgroup gets released, no pages is charged to it.
+*
+* If kmem were charged before res or uncharged after, kmem might
+* become greater than res for a short period of time even if there
+* were still user memory pages charged to the cgroup. In this case
+* mem_cgroup_reparent_charges would give up prematurely, and the
+* cgroup could be released though there were still pages charged to
+* it. Uncharge of such a page would trigger kernel panic.
+*
+* To prevent this from happening, kmem must be charged after res and
+* uncharged before res.
+*/
+   ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+   if (ret) {
+   res_counter_uncharge(&

Re: [Devel] memcg: mem_cgroup_uncharge_page() kernel panic/lockup

2016-06-15 Thread Vladimir Davydov
Hi,

Thanks for the report.

Could you please

 - file a bug to bugzilla.openvz.org

 - upload the vmcore at
   rsync://fe.sw.ru/f837d67c8e2ade8cee3367cb0f880268/

On Mon, Jun 13, 2016 at 09:24:33AM +0300, Anatoly Stepanov wrote:
> Hello everyone!
> 
> We encounter an issue with mem_cgroup_uncharge_page() function,
> it appears quite often on our clients servers.
> 
> Basically the issue sometimes leads to hard-lockup, sometimes to GP fault.
> 
> Based on bug reports from clients, the problem shows up when a user
> process calls "execve" or "exit" syscalls.
> As we know in those cases kernel invokes "uncharging" for every page
> when its unmapped from all the mm's.
> 
> Kernel dump analysis shows that at the moment of
> mem_cgroup_uncharge_page() "memcg" pointer
> (taken from page_cgroup) seems to be pointing to some random memory area.
> 
> On the other hand, if we look at current->mm->css, then memcg instance
> exists and is "online".
> 
> This led me to a thought that "page_cgroup->memcg" may be changed by
> some part of memcg code in parallel.
> As far as i understand, the only option here is "reclaim code path"
> (may be i'm wrong)
> 
> So, i suppose there might be a race between "memcg uncharge code" and
> "memcg reclaim code".
> 
> Please, give me your thoughts about it
> thanks
> 
> P.S.:
> 
> Additional info:
> 
> Kernel: rh7-3.10.0-327.10.1.vz7.12.14
> 
> *1st
> BT
> 
> PID: 972445  TASK: 88065d53d8d0  CPU: 0   COMMAND: "httpd"
>  #0 [880224f37818] machine_kexec at 8105249b
>  #1 [880224f37878] crash_kexec at 81103532
>  #2 [880224f37948] oops_end at 81641628
>  #3 [880224f37970] die at 810184cb
>  #4 [880224f379a0] do_general_protection at 81640f24
>  #5 [880224f379d0] general_protection at 81640768
> [exception RIP: mem_cgroup_charge_statistics+19]
> RIP: 811e7733  RSP: 880224f37a80  RFLAGS: 00010202
> RAX:   RBX: 8807b26f0110  RCX: 
> RDX: 79726f6765746163  RSI: ea000c9c0440  RDI: 8806a55662f8
> RBP: 880224f37a80   R8:    R9: 03808000
> R10: 00b8  R11: ea001eaa8980  R12: ea000c9c0440
> R13: 0001  R14:   R15: 8806a5566000
> ORIG_RAX:   CS: 0010  SS: 0018
>  #6 [880224f37a88] __mem_cgroup_uncharge_common at 811e9ddf
>  #7 [880224f37ac8] mem_cgroup_uncharge_page at 811ee99a
>  #8 [880224f37ad8] page_remove_rmap at 811b9ec9
>  #9 [880224f37b10] unmap_page_range at 811ab580
> #10 [880224f37bf8] unmap_single_vma at 811aba11
> #11 [880224f37c30] unmap_vmas at 811ace79
> #12 [880224f37c68] exit_mmap at 811b663c
> #13 [880224f37d18] mmput at 8107853b
> #14 [880224f37d38] flush_old_exec at 81202547
> #15 [880224f37d88] load_elf_binary at 8125883c
> #16 [880224f37e58] search_binary_handler at 81201c25
> #17 [880224f37ea0] do_execve_common at 812032b7
> #18 [880224f37f30] sys_execve at 81203619
> #19 [880224f37f50] stub_execve at 81649369
> RIP: 7f54284b3287  RSP: 7ffda57a0698  RFLAGS: 0297
> RAX: 003b  RBX: 037c5fe8  RCX: 
> RDX: 037cf3f8  RSI: 037ce5f8  RDI: 7f5425fcabf1
> RBP: 7ffda57a0750   R8: 0001   R9: 
> 
> 
> ***2nd
> BT**:
> 
> PID: 168440  TASK: 88001e31cc20  CPU: 18  COMMAND: "httpd"
>  #0 [88007255f838] machine_kexec at 8105249b
>  #1 [88007255f898] crash_kexec at 81103532
>  #2 [88007255f968] oops_end at 81641628
>  #3 [88007255f990] no_context at 8163222b
>  #4 [88007255f9e0] __bad_area_nosemaphore at 816322c1
>  #5 [88007255fa30] bad_area_nosemaphore at 8163244a
>  #6 [88007255fa40] __do_page_fault at 8164443e
>  #7 [88007255faa0] trace_do_page_fault at 81644673
>  #8 [88007255fad8] do_async_page_fault at 81643d59
>  #9 [88007255faf0] async_page_fault at 816407f8
> [exception RIP: memcg_check_events+435]
> RIP: 811e9b53  RSP: 88007255fba0  RFLAGS: 00010246
> RAX: f81ef81e  RBX: 8802106d5000  RCX: 
> RDX: f81e  RSI: 0002  RDI: 8807aa2642e8
> RBP: 88007255fbf0   R8: 0202   R9: 
> R10: 0010  R11: 88007255ffd8  R12: 8807aa2642e0
> R13: 0410  R14: 8802073de700  R15: 8802106d5000
> ORIG_RAX:   CS: 0010  SS: 0018
> #10 [88007255fbf8] __mem_cgroup_uncharge_common at 811

Re: [Devel] [PATCH rh7 v3] vtty: Don't free console mapping until no clients left

2016-06-14 Thread Vladimir Davydov
On Tue, Jun 14, 2016 at 12:20:17PM +0300, Cyrill Gorcunov wrote:
> Currently on container's stop we free vtty mapping in a force way
> so that if there is active console hooked from the node it become
> unusable since then. It was easier to work with when we've been
> reworking virtual console code.
> 
> Now lets make console fully functional as it was in pcs6:
> when opened it must survice container start/stop cycle
> and checkpoint/restore as well.
> 
> For this sake we:
> 
>  - drop ve_hook code, it no longer needed
>  - free console @map on final close of the last tty opened
> 
> https://jira.sw.ru/browse/PSBM-39463
> 
> Signed-off-by: Cyrill Gorcunov 
> CC: Vladimir Davydov 
> CC: Konstantin Khorenko 
> CC: Igor Sukhih 
> CC: Pavel Emelyanov 

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v2] vtty: Don't free console mapping until no clients left

2016-06-14 Thread Vladimir Davydov
On Sat, Jun 11, 2016 at 12:35:13PM +0300, Cyrill Gorcunov wrote:
...
> @@ -939,6 +938,7 @@ static vtty_map_t *vtty_map_alloc(envid_
>   lockdep_assert_held(&tty_mutex);
>   if (map) {
>   map->veid = veid;
> + init_completion(&map->work);

Stale hunk?

>   veid = idr_alloc(&vtty_idr, map, veid, veid + 1, GFP_KERNEL);
>   if (veid < 0) {
>   kfree(map);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7 v2] vtty: Allow to wait until container's console appear

2016-06-10 Thread Vladimir Davydov
On Fri, Jun 10, 2016 at 04:34:34PM +0300, Cyrill Gorcunov wrote:
> After tty code redesing we've been requiring container to start
> first before be able to connect into it via vzctl console command.
> Here we rather allow userspace tool to wait until container brought
> to life and proceed connecting into console.
> 
> https://jira.sw.ru/browse/PSBM-39463
> 
> Note: when someone tried to open several consoles for offline
> mode (say vzctl console 300 1 and vzctl console 300 2) simultaneously
> only one is allowed until VE is up, the second vzctl command will
> exit with -EBUSY.
> 
> v2:
>  - move everything into vtty code
> 
> Signed-off-by: Cyrill Gorcunov 
> CC: Vladimir Davydov 
> CC: Konstantin Khorenko 
> CC: Igor Sukhih 
> CC: Pavel Emelyanov 
> ---
>  drivers/tty/pty.c   |   67 
> 
>  include/linux/ve.h  |2 +
>  kernel/ve/ve.c  |5 +++
>  kernel/ve/vecalls.c |6 ++--
>  4 files changed, 77 insertions(+), 3 deletions(-)
> 
> Index: linux-pcs7.git/drivers/tty/pty.c
> ===
> --- linux-pcs7.git.orig/drivers/tty/pty.c
> +++ linux-pcs7.git/drivers/tty/pty.c
> @@ -1284,8 +1284,64 @@ static int __init vtty_init(void)
>   return 0;
>  }
>  
> +static DECLARE_RWSEM(vtty_console_sem);
> +static DEFINE_IDR(vtty_idr_console);

We already have vtty_idr, may be reuse it?

> +
> +static struct ve_struct *vtty_get_ve_by_id(envid_t veid)
> +{
> + DECLARE_COMPLETION_ONSTACK(console_work);
> + struct ve_struct *ve;
> + int ret;
> +
> + down_write(&vtty_console_sem);
> + ve = get_ve_by_id(veid);
> + if (ve) {
> + up_write(&vtty_console_sem);
> + return ve;
> + }
> +

> + if (idr_find(&vtty_idr_console, veid)) {
> + up_write(&vtty_console_sem);
> + return ERR_PTR(-EBUSY);
> + }

This block is useless - it's handled by ENOSPC check below.

> +
> + ret = idr_alloc(&vtty_idr_console, &console_work, veid, veid + 1, 
> GFP_KERNEL);
> + if (ret < 0) {
> + if (ret == -ENOSPC)
> + ret = -EBUSY;
> + } else
> + ret = 0;
> + up_write(&vtty_console_sem);
> +
> + if (!ret)
> + ret = wait_for_completion_interruptible(&console_work);
> +
> + if (!ret)
> + ve = get_ve_by_id(veid);
> + else
> + ve = ERR_PTR(ret);
> +
> + down_write(&vtty_console_sem);
> + if (!ret)
> + idr_remove(&vtty_idr_console, veid);
> + up_write(&vtty_console_sem);
> + return ve;
> +}
> +
> +void vtty_console_notify(struct ve_struct *ve)
> +{
> + struct completion *console_work;
> +
> + down_read(&vtty_console_sem);
> + console_work = idr_find(&vtty_idr_console, ve->veid);
> + if (console_work)
> + complete(console_work);
> + up_read(&vtty_console_sem);
> +}
> +
>  int vtty_open_master(envid_t veid, int idx)
>  {
> + struct ve_struct *ve = NULL;
>   struct tty_struct *tty;
>   struct file *file;
>   char devname[64];
> @@ -1298,6 +1354,16 @@ int vtty_open_master(envid_t veid, int i
>   if (fd < 0)
>   return fd;
>  
> + ve = vtty_get_ve_by_id(veid);
> + if (IS_ERR_OR_NULL(ve)) {
> + if (IS_ERR(ve))
> + ret = PTR_ERR(ve);
> + else
> + ret = -ENOENT;
> + ve = NULL;
> + goto err_put_unused_fd;
> + }
> +

Come to think of it, is this really necessary? Can't we just allocate
vtty_map in vtty_open_master and return master tty w/o open slave? Any
write/read will put the caller to sleep anyway.

>   snprintf(devname, sizeof(devname), "v%utty%d", veid, idx);
>   file = anon_inode_getfile(devname, &vtty_fops, NULL, O_RDWR);
>   if (IS_ERR(file)) {
> @@ -1364,6 +1430,7 @@ int vtty_open_master(envid_t veid, int i
>   mutex_unlock(&tty_mutex);
>   ret = fd;
>  out:
> + put_ve(ve);
>   return ret;
>  
>  err_install:
> Index: linux-pcs7.git/include/linux/ve.h
> ===
> --- linux-pcs7.git.orig/include/linux/ve.h
> +++ linux-pcs7.git/include/linux/ve.h
> @@ -215,6 +215,8 @@ void ve_stop_ns(struct pid_namespace *ns
>  void ve_exit_ns(struct pid_namespace *ns);
>  int ve_start_container(struct ve_struct *ve);
>  
> +void vtty_console_notify(struct ve_struct *ve);
> +
>  extern b

Re: [Devel] [PATCH rh7] vtty: Don't free console mapping until no clients left

2016-06-09 Thread Vladimir Davydov
On Tue, Jun 07, 2016 at 06:18:38PM +0300, Cyrill Gorcunov wrote:
> Currently on container's stop we free vtty mapping in a force way
> so that if there is active console hooked from the node it become
> unusable since then. It was easier to work with when we've been
> reworking virtual console code.
> 
> Now lets make console fully functional as it was in pcs6:
> when opened it must survice container start/stop cycle
> and checkpoint/restore as well.
> 
> For this sake we:
> 
>  - drop ve_hook code, it no longer needed
>  - free console @map on final close of the last tty opened
> 
> https://jira.sw.ru/browse/PSBM-39463
> 
> Signed-off-by: Cyrill Gorcunov 
> CC: Vladimir Davydov 
> CC: Konstantin Khorenko 
> CC: Igor Sukhih 
> CC: Pavel Emelyanov 

Reviewed-by: Vladimir Davydov 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] vtty: Allow to wait until container's console appear

2016-06-09 Thread Vladimir Davydov
On Mon, Jun 06, 2016 at 07:26:57PM +0300, Cyrill Gorcunov wrote:
> After tty code redesing we've been requiring container to start
> first before be able to connect into it via vzctl console command.
> Here we rather allow userspace tool to wait until container brought
> to life and proceed connecting into console.
> 
> https://jira.sw.ru/browse/PSBM-39463
> 
> Signed-off-by: Cyrill Gorcunov 
> CC: Vladimir Davydov 
> CC: Konstantin Khorenko 
> CC: Igor Sukhih 
> CC: Pavel Emelyanov 
> ---
>  include/linux/ve.h  |2 ++
>  kernel/ve/ve.c  |   48 
>  kernel/ve/vecalls.c |   23 +--
>  3 files changed, 71 insertions(+), 2 deletions(-)
> 
> Index: linux-pcs7.git/include/linux/ve.h
> ===
> --- linux-pcs7.git.orig/include/linux/ve.h
> +++ linux-pcs7.git/include/linux/ve.h
> @@ -215,6 +215,8 @@ void ve_stop_ns(struct pid_namespace *ns
>  void ve_exit_ns(struct pid_namespace *ns);
>  int ve_start_container(struct ve_struct *ve);
>  
> +int ve_console_wait(envid_t veid);
> +
>  extern bool current_user_ns_initial(void);
>  struct user_namespace *ve_init_user_ns(void);
>  
> Index: linux-pcs7.git/kernel/ve/ve.c
> ===
> --- linux-pcs7.git.orig/kernel/ve/ve.c
> +++ linux-pcs7.git/kernel/ve/ve.c
> @@ -260,6 +260,49 @@ struct user_namespace *ve_init_user_ns(v
>  }
>  EXPORT_SYMBOL(ve_init_user_ns);
>  
> +static DEFINE_IDR(ve_idr_console);
> +static DECLARE_RWSEM(ve_console_sem);
> +
> +int ve_console_wait(envid_t veid)
> +{
> + DECLARE_COMPLETION_ONSTACK(console_work);
> + int ret;
> +
> + down_write(&ve_console_sem);
> + if (idr_find(&ve_idr_console, veid)) {
> + up_write(&ve_console_sem);
> + return -EEXIST;
> + }
> +
> + ret = idr_alloc(&ve_idr_console, &console_work, veid, veid + 1, 
> GFP_KERNEL);
> + if (ret < 0) {
> + if (ret == -ENOSPC)
> + ret = -EEXIST;
> + } else
> + ret = 0;
> + downgrade_write(&ve_console_sem);
> +
> + if (!ret) {
> + ret = wait_for_completion_interruptible(&console_work);
> + idr_remove(&ve_idr_console, veid);
> + }
> +
> + up_read(&ve_console_sem);
> + return ret;
> +}
> +EXPORT_SYMBOL(ve_console_wait);
> +
> +static void ve_console_notify(struct ve_struct *ve)
> +{
> + struct completion *console_work;
> +
> + down_read(&ve_console_sem);
> + console_work = idr_find(&ve_idr_console, ve->veid);
> + if (console_work)
> + complete(console_work);
> + up_read(&ve_console_sem);
> +}
> +
>  int nr_threads_ve(struct ve_struct *ve)
>  {
>   return cgroup_task_count(ve->css.cgroup);
> @@ -494,6 +537,11 @@ int ve_start_container(struct ve_struct
>  
>   get_ve(ve); /* for ve_exit_ns() */
>  
> + /*
> +  * Console waiter are to be notified at the very
> +  * end when everything else is ready.
> +  */
> + ve_console_notify(ve);
>   return 0;
>  
>  err_iterate:
> Index: linux-pcs7.git/kernel/ve/vecalls.c
> ===
> --- linux-pcs7.git.orig/kernel/ve/vecalls.c
> +++ linux-pcs7.git/kernel/ve/vecalls.c
> @@ -991,8 +991,27 @@ static int ve_configure(envid_t veid, un
>   int err = -ENOKEY;
>  
>   ve = get_ve_by_id(veid);
> - if (!ve)
> - return -EINVAL;
> + if (!ve) {
> +
> + if (key != VE_CONFIGURE_OPEN_TTY)
> + return -EINVAL;
> + /*
> +  * Offline console management:
> +  * wait until ve is up and proceed.
> +  */

What if a VE is created right here, before we call ve_console_wait()?
Looks like the caller will hang forever...

> + err = ve_console_wait(veid);
> + if (err)
> + return err;
> +
> + /*
> +  * A container should not exit immediately once
> +  * started but if it does, for any reason, simply
> +  * exit out gracefully.
> +  */
> + ve = get_ve_by_id(veid);
> + if (!ve)
> + return -ENOENT;
> + }

Can't we fold this into vtty_open_master()? The latter doesn't need ve
object, it only needs veid, which is known here.

>  
>   switch(key) {
>   case VE_CONFIGURE_OS_RELEASE:
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] sched/core/cfs: don't reset nr_cpus while setting cpu limits

2016-06-08 Thread Vladimir Davydov
On Tue, Jun 07, 2016 at 04:50:38PM +0300, Andrey Ryabinin wrote:
> Setting cpu limits reset number of cpus
> # echo 2 >/sys/fs/cgroup/cpu,cpuacct/101/cpu.nr_cpus
> # exec 101 cat /proc/cpuinfo |grep -c processor
>  2
> # echo 16 >/sys/fs/cgroup/cpu,cpuacct/101/cpu.cfs_quota_us
> # vzctl exec 101 cat /proc/cpuinfo |grep -c processor
>  4
> # cat /sys/fs/cgroup/cpu,cpuacct/101/cpu.nr_cpus
>  0
> 
> tg_update_cpu_limit() does that without any apparent reason,
> so let's fix it,.
> 
> https://jira.sw.ru/browse/PSBM-48061
> 
> Signed-off-by: Andrey Ryabinin 
> ---
>  kernel/sched/core.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2c147c8..51ebed2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8696,7 +8696,6 @@ static void tg_update_cpu_limit(struct task_group *tg)
>   }
>  
>   tg->cpu_rate = rate;
> - tg->nr_cpus = 0;

This is incorrect. Suppose nr_cpus = 2 and you set cfs_quota to
4 * cfs_period. If you don't reset nr_cpus, you'll get cpu limit equal
to 400, although it should be min(nr_cpus * 100, cpu_rate) = 200.

>  }
>  
>  static int tg_set_cpu_limit(struct task_group *tg,
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] net: packet: rework rx/tx ring pages accounting

2016-06-02 Thread Vladimir Davydov
To account tx/rx ring pages to kmemcg, we allocate them with
__GFP_ACCOUNT. After commit 1265d3474391 ("mm: charge/uncharge kmemcg
from generic page allocator paths") this implies that these pages have
PAGE_KMEMCG_MAPCOUNT_VALUE stored in page->_mapcount. This is incorrect
as these pages are supposed to be mapped to userspace:

  BUG: Bad page map in process packet_sock_mma  pte:800241837025 
pmd:2428aa067
  page:ea0009060dc0 count:2 mapcount:-255 mapping:  (null) index:0x0
  page flags: 0x2f0004(referenced)
  page dumped because: bad pte
  addr:7f16c9a8c000 vm_flags:18100073 anon_vma:  (null) 
mapping:880210caed80 index:0
  vma->vm_ops->fault:   (null)
  vma->vm_file->f_op->mmap: sock_mmap+0x0/0x20
  CPU: 2 PID: 6141 Comm: packet_sock_mma ve: 
e7eccd35-3ea1-4dc1-9a04-dba948120299 Not tainted 3.10.0-327.18.2.vz7.14.10 #1 
14.10
  Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 
07/14/2011
  ea0009060dc0 7be30e48 88024235ba68 81633548
  88024235bab0 811a908f 800241837025 
  8802428aa460 ea0009060dc0 7f16c9a8c000 88024235bc20
  Call Trace:
  [] dump_stack+0x19/0x1b
  [] print_bad_pte+0x1af/0x250
  [] unmap_page_range+0x76b/0x870
  [] unmap_single_vma+0x81/0xf0
  [] unmap_vmas+0x49/0x90
  [] exit_mmap+0xac/0x1a0
  [] mmput+0x6b/0x140
  [] do_exit+0x2ac/0xb10
  [] ? plist_del+0x46/0x70
  [] ? __unqueue_futex+0x32/0x70
  [] ? futex_wait+0x11d/0x280
  [] do_group_exit+0x3f/0xa0
  [] get_signal_to_deliver+0x1d0/0x6d0
  [] do_signal+0x57/0x6c0
  [] ? do_futex+0x15b/0x600
  [] do_notify_resume+0x5f/0xb0
  [] int_signal+0x12/0x17

To fix that, let's charge these pages directly using memcg_charge_kmem()
to the cgroup the packet socket is accounted to (via ->sk_cgrp).

https://jira.sw.ru/browse/PSBM-47873

Signed-off-by: Vladimir Davydov 
---
 net/packet/af_packet.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index ecb5464c5622..2a1b15a85928 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3712,7 +3712,7 @@ static void free_pg_vec(struct pgv *pg_vec, unsigned int 
order,
 static char *alloc_one_pg_vec_page(unsigned long order)
 {
char *buffer = NULL;
-   gfp_t gfp_flags = GFP_KERNEL_ACCOUNT | __GFP_COMP |
+   gfp_t gfp_flags = GFP_KERNEL | __GFP_COMP |
  __GFP_ZERO | __GFP_NOWARN | __GFP_NORETRY;
 
buffer = (char *) __get_free_pages(gfp_flags, order);
@@ -3723,7 +3723,7 @@ static char *alloc_one_pg_vec_page(unsigned long order)
/*
 * __get_free_pages failed, fall back to vmalloc
 */
-   buffer = vzalloc_account((1 << order) * PAGE_SIZE);
+   buffer = vzalloc((1 << order) * PAGE_SIZE);
 
if (buffer)
return buffer;
@@ -3770,6 +3770,7 @@ out_free_pgvec:
 static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
int closing, int tx_ring)
 {
+   struct packet_sk_charge *psc = (struct packet_sk_charge *)sk->sk_cgrp;
struct pgv *pg_vec = NULL;
struct packet_sock *po = pkt_sk(sk);
int was_running, order = 0;
@@ -3839,9 +3840,16 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
 
err = -ENOMEM;
order = get_order(req->tp_block_size);
+   if (psc && memcg_charge_kmem(psc->memcg, GFP_KERNEL,
+   (PAGE_SIZE << order) * req->tp_block_nr))
+   goto out;
pg_vec = alloc_pg_vec(req, order);
-   if (unlikely(!pg_vec))
+   if (unlikely(!pg_vec)) {
+   if (psc)
+   memcg_uncharge_kmem(psc->memcg,
+   (PAGE_SIZE << order) * 
req->tp_block_nr);
goto out;
+   }
switch (po->tp_version) {
case TPACKET_V3:
/* Transmit path is not supported. We checked
@@ -3912,8 +3920,12 @@ static int packet_set_ring(struct sock *sk, union 
tpacket_req_u *req_u,
}
release_sock(sk);
 
-   if (pg_vec)
+   if (pg_vec) {
+   if (psc)
+   memcg_uncharge_kmem(psc->memcg,
+   (PAGE_SIZE << order) * req->tp_block_nr);
free_pg_vec(pg_vec, order, req->tp_block_nr);
+   }
 out:
return err;
 }
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm: fix PAGE_KMEMCG_MAPCOUNT_VALUE

2016-06-02 Thread Vladimir Davydov
It should be -512, -256 is used for balloon pages.

Fixes: 1265d3474391 ("mm: charge/uncharge kmemcg from generic page allocator 
paths")
Signed-off-by: Vladimir Davydov 
---
 include/linux/page-flags.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d15d20d84142..731a76613ea4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -523,7 +523,7 @@ static inline void __ClearPageBalloon(struct page *page)
atomic_set(&page->_mapcount, -1);
 }
 
-#define PAGE_KMEMCG_MAPCOUNT_VALUE (-256)
+#define PAGE_KMEMCG_MAPCOUNT_VALUE (-512)
 
 static inline int PageKmemcg(struct page *page)
 {
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] pfcache: abort ext4_pfcache_open if inode already has peer installed

2016-06-01 Thread Vladimir Davydov
Calling ioctl(FS_IOC_PFCACHE_OPEN) on an inode that already has a
pfcache peer installed results in i_peer_list corruption:

  WARNING: at lib/list_debug.c:36 __list_add+0x8a/0xc0()
  list_add double add: new=88009c525d40, prev=880088a5bac0, 
next=88009c525d40.
  CPU: 5 PID: 1429 Comm: pfcached ve: 0 Not tainted 3.10.0-327.18.2.vz7.14.9 #1 
14.9
   0024 85f7231d 88008f153c80 81632bb7
   88008f153cb8 8107b460 88009c525d40 88009c525d40
   880088a5bac0 88009c525cc8 88009c525c90 88008f153d20
  Call Trace:
   [] dump_stack+0x19/0x1b
   [] warn_slowpath_common+0x70/0xb0
   [] warn_slowpath_fmt+0x5c/0x80
   [] __list_add+0x8a/0xc0
   [] open_mapping_peer+0x15c/0x1f0
   [] ext4_open_pfcache+0x155/0x1b0 [ext4]
   [] ext4_ioctl+0xa9/0x15f0 [ext4]
   [] ? handle_mm_fault+0x5b4/0xf50
   [] ? do_filp_open+0x4b/0xb0
   [] do_vfs_ioctl+0x255/0x4f0
   [] ? __do_page_fault+0x164/0x450
   [] SyS_ioctl+0x54/0xa0
   [] system_call_fastpath+0x16/0x1b

https://jira.sw.ru/browse/PSBM-47806

Signed-off-by: Vladimir Davydov 
---
 fs/ext4/pfcache.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ext4/pfcache.c b/fs/ext4/pfcache.c
index fe1296f27eb2..ab2f20c243d1 100644
--- a/fs/ext4/pfcache.c
+++ b/fs/ext4/pfcache.c
@@ -43,6 +43,9 @@ int ext4_open_pfcache(struct inode *inode)
struct path root, path;
int ret;
 
+   if (inode->i_mapping->i_peer_file)
+   return -EBUSY;
+
if (!(ext4_test_inode_state(inode, EXT4_STATE_PFCACHE_CSUM) &&
  EXT4_I(inode)->i_data_csum_end < 0))
return -ENODATA;
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 5/7] af_unix: charge buffers to kmemcg

2016-05-30 Thread Vladimir Davydov
Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.

Since unix socket buffers are always allocated from process context,
all we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock->sk_allocation mask.

https://jira.sw.ru/browse/PSBM-34562

Signed-off-by: Vladimir Davydov 
---
 net/unix/af_unix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 0e629f509cd0..1da93a400145 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -761,6 +761,7 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock)
lockdep_set_class(&sk->sk_receive_queue.lock,
&af_unix_sk_receive_queue_lock_key);
 
+   sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
sk->sk_write_space  = unix_write_space;
sk->sk_max_ack_backlog  = net->unx.sysctl_max_dgram_qlen;
sk->sk_destruct = unix_sock_destructor;
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/7] Drop alloc_kmem_pages and friends

2016-05-30 Thread Vladimir Davydov
These functions work exactly like alloc_pages and others except they
will charge allocated page to current memcg if __GFP_ACCOUNT is passed.
In the next patch I'm going to move charge/uncharge to generic
allocation paths, so that these special helpers won't be necessary.

Signed-off-by: Vladimir Davydov 
---
 arch/x86/include/asm/pgalloc.h| 14 -
 arch/x86/kernel/ldt.c |  6 ++--
 arch/x86/mm/pgtable.c | 19 +---
 fs/pipe.c | 11 +++
 include/linux/gfp.h   |  8 -
 kernel/fork.c |  6 ++--
 mm/memcontrol.c   |  1 -
 mm/page_alloc.c   | 65 ---
 mm/slab_common.c  |  2 +-
 mm/slub.c |  4 +--
 mm/vmalloc.c  |  6 ++--
 net/netfilter/nf_conntrack_core.c |  6 ++--
 net/packet/af_packet.c|  8 ++---
 13 files changed, 38 insertions(+), 118 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index f5897582b88c..58e45671d127 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -48,7 +48,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, 
pte_t *pte)
 static inline void pte_free(struct mm_struct *mm, struct page *pte)
 {
pgtable_page_dtor(pte);
-   __free_kmem_pages(pte, 0);
+   __free_page(pte);
 }
 
 extern void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte);
@@ -81,11 +81,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
struct page *page;
-   page = alloc_kmem_pages(GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO, 
0);
+   page = alloc_pages(GFP_KERNEL_ACCOUNT | __GFP_REPEAT | __GFP_ZERO, 0);
if (!page)
return NULL;
if (!pgtable_pmd_page_ctor(page)) {
-   __free_kmem_pages(page, 0);
+   __free_page(page);
return NULL;
}
return (pmd_t *)page_address(page);
@@ -95,7 +95,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
pgtable_pmd_page_dtor(virt_to_page(pmd));
-   free_kmem_pages((unsigned long)pmd, 0);
+   free_page((unsigned long)pmd);
 }
 
 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd);
@@ -125,14 +125,14 @@ static inline void pgd_populate(struct mm_struct *mm, 
pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)__get_free_kmem_pages(GFP_KERNEL_ACCOUNT|__GFP_REPEAT|
- __GFP_ZERO, 0);
+   return (pud_t *)__get_free_page(GFP_KERNEL_ACCOUNT|__GFP_REPEAT|
+   __GFP_ZERO);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-   free_kmem_pages((unsigned long)pud, 0);
+   free_page((unsigned long)pud);
 }
 
 extern void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud);
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 4a6c8fee47f2..942b0a4e40d5 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -44,7 +44,7 @@ static int alloc_ldt(mm_context_t *pc, int mincount, int 
reload)
if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE)
newldt = vmalloc_account(mincount * LDT_ENTRY_SIZE);
else
-   newldt = (void *)__get_free_kmem_pages(GFP_KERNEL_ACCOUNT, 0);
+   newldt = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
 
if (!newldt)
return -ENOMEM;
@@ -83,7 +83,7 @@ static int alloc_ldt(mm_context_t *pc, int mincount, int 
reload)
if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
vfree(oldldt);
else
-   __free_kmem_pages(virt_to_page(oldldt), 0);
+   __free_page(virt_to_page(oldldt));
}
return 0;
 }
@@ -138,7 +138,7 @@ void destroy_context(struct mm_struct *mm)
if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE)
vfree(mm->context.ldt);
else
-   __free_kmem_pages(virt_to_page(mm->context.ldt), 0);
+   __free_page(virt_to_page(mm->context.ldt));
mm->context.size = 0;
}
 }
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 02ec6243372d..ba13ef8e651a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -25,11 +25,11 @@ pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long 
address)
 {
struct page *pte;
 
-   pte = alloc_kmem_pages(__userpte_alloc_gfp, 0);
+   pte = alloc_pages(__userpte_alloc_gfp, 0);
if (!pte)
r

[Devel] [PATCH rh7 4/7] mm: charge/uncharge kmemcg from generic page allocator paths

2016-05-30 Thread Vladimir Davydov
Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.

This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e. when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.

To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e. to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends. A charged page will be
automatically uncharged on free.

To make it possible, we need to mark pages charged to kmemcg somehow. To
avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages. Since pages charged to kmemcg are not supposed to be
mapped to userspace, it should work just fine. There are other (ab)users
of page->_mapcount - buddy and balloon pages - but we don't conflict
with them.

In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths. If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.

Signed-off-by: Vladimir Davydov 
---
 include/linux/memcontrol.h |  3 ++-
 include/linux/page-flags.h | 19 +++
 mm/memcontrol.c|  4 
 mm/page_alloc.c|  4 
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d26adf10eaa7..48bf2caa008d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -617,7 +618,7 @@ memcg_kmem_newpage_charge(struct page *page, gfp_t gfp, int 
order)
 static inline void
 memcg_kmem_uncharge_pages(struct page *page, int order)
 {
-   if (memcg_kmem_enabled())
+   if (memcg_kmem_enabled() && PageKmemcg(page))
__memcg_kmem_uncharge_pages(page, order);
 }
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index cdf83ecac8f3..d15d20d84142 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -523,6 +523,25 @@ static inline void __ClearPageBalloon(struct page *page)
atomic_set(&page->_mapcount, -1);
 }
 
+#define PAGE_KMEMCG_MAPCOUNT_VALUE (-256)
+
+static inline int PageKmemcg(struct page *page)
+{
+   return atomic_read(&page->_mapcount) == PAGE_KMEMCG_MAPCOUNT_VALUE;
+}
+
+static inline void __SetPageKmemcg(struct page *page)
+{
+   VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
+   atomic_set(&page->_mapcount, PAGE_KMEMCG_MAPCOUNT_VALUE);
+}
+
+static inline void __ClearPageKmemcg(struct page *page)
+{
+   VM_BUG_ON_PAGE(!PageKmemcg(page), page);
+   atomic_set(&page->_mapcount, -1);
+}
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8eb48071ea22..1c3fbb2d2c48 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3555,6 +3555,8 @@ __memcg_kmem_newpage_charge(struct page *page, gfp_t gfp, 
int order)
SetPageCgroupUsed(pc);
unlock_page_cgroup(pc);
 
+   __SetPageKmemcg(page);
+
return true;
 }
 
@@ -3588,6 +3590,8 @@ void __memcg_kmem_uncharge_pages(struct page *page, int 
order)
 
VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
+
+   __ClearPageKmemcg(page);
 }
 
 struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f02d8013add..2b04f36ea016 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -748,6 +748,7 @@ static bool free_pages_prepare(struct page *page, unsigned 
int order)
 
if (PageAnon(page))
page->mapping = NULL;
+   memcg_kmem_uncharge_pages(page, order);
for (i = 0; i < (1 << order); i++) {
bad += free_pages_check(page + i);
if (static_key_false(&zero_free_pages))
@@ -2804,6 +2805,9 @@ out:
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
 
+   if (page && !memcg_kmem_newpage_charge(page, gfp_mask, order))
+   __free_pages(page, order);
+
return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 0/7] Some kmemcg related fixes

2016-05-30 Thread Vladimir Davydov
This patch set backports some changes from the following patch set
submitted upstream:

  lkml.kernel.org/r/cover.1464079537.git.vdavy...@virtuozzo.com

[hasn't been merged yet]

namely:
 - move kmemcg charge/uncharge to generic allocator paths
 - fix pipe buffer stealing
 - avoid charging kernel page tables
 - account unix socket buffers to kmemcg (PSBM-34562)

Vladimir Davydov (7):
  Drop alloc_kmem_pages and friends
  mm: memcontrol: drop memcg_kmem_commit_charge
  Move PageBalloon and PageBuddy helpers to page-flags.h
  mm: charge/uncharge kmemcg from generic page allocator paths
  af_unix: charge buffers to kmemcg
  pipe: uncharge page on ->steal
  arch: x86: don't charge kernel page tables to kmemcg

 arch/x86/include/asm/pgalloc.h| 22 +
 arch/x86/kernel/ldt.c |  6 ++--
 arch/x86/mm/pgtable.c | 32 +-
 fs/pipe.c | 28 +++-
 include/linux/gfp.h   |  8 -
 include/linux/memcontrol.h| 39 --
 include/linux/mm.h| 47 --
 include/linux/page-flags.h| 66 +
 kernel/fork.c |  6 ++--
 mm/memcontrol.c   | 31 ++
 mm/page_alloc.c   | 69 +++
 mm/slab_common.c  |  2 +-
 mm/slub.c |  4 +--
 mm/vmalloc.c  |  6 ++--
 net/netfilter/nf_conntrack_core.c |  6 ++--
 net/packet/af_packet.c|  8 ++---
 net/unix/af_unix.c|  1 +
 17 files changed, 157 insertions(+), 224 deletions(-)

-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


  1   2   3   4   5   6   7   8   9   10   >