Re: [Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-20 Thread Maxim Patlasov

Dima,

I agree with general approach of this patch, but there are some 
(easy-to-fix) issues. See, please, inline comments below...


On 06/20/2016 11:58 AM, Dmitry Monakhov wrote:

barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
  ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
   ->delta->allocate
  ->io->submit_allloc: dio_submit_alloc
->dio_submit_pad
E_DATA_WBI : data written, time to update index
   ->delta->allocate_complete:ploop_index_update
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
 ->write_page
 ->ploop_map_wb_complete
   ->ploop_wb_complete_post_process
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FUA


Sorry, I can't agree, it actually does not ignore:


static void
kaio_write_page(struct ploop_io * io, struct ploop_request * preq,
 struct page * page, sector_t sec, int fua)
{
/* No FUA in kaio, convert it to fsync */
if (fua)
set_bit(PLOOP_REQ_KAIO_FSYNC, >state);




BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
not just latest one.


No need to tag *all*. See inline comments below.


This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
- fix up fua handling for dio_submit

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
  drivers/block/ploop/dev.c   |  8 +---
  drivers/block/ploop/io_direct.c | 29 +-
  drivers/block/ploop/io_kaio.c   | 17 ++--
  drivers/block/ploop/map.c   | 45 ++---
  include/linux/ploop/ploop.h |  8 
  5 files changed, 54 insertions(+), 53 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
  
  	__TRACE("Z %p %u\n", preq, preq->req_cluster);
  
+	if (!preq->error) {

+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, >state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
  
-		/* Relocated data write required sync before BAT updatee */

-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..d7ecd4a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
trace_submit(preq);
  
  	preflush = !!(rw & REQ_FLUSH);

-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, >state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   postfua = !!(rw & REQ_FUA);
+   if (ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, >state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, >state);
+   postfua = 0;
}


"postfua" is a horrible name, let us see if we can get rid of it 
completely. Also, the way how ploop_req_delay_fua_possible implemented 
is prone to errors (see below an issue in kaio_complete_io_state). Let's 
rework it like this:


static inline bool ploop_req_delay_fua_possible(struct ploop_request 
*preq)

{
return preq->eng_state == PLOOP_E_DATA_WBI;
}


Then, that chunk in the dio_submit above might look as:


/* If we can delay, mark req that delayed flush required */
if ((rw & REQ_FUA) && 

[Devel] [PATCH 3/3] ploop: fixup FORCE_{FLUSH,FUA} handling v2

2016-06-20 Thread Dmitry Monakhov
barrier code is broken in many ways:
Currently only ->dio_submit() handles PLOOP_REQ_FORCE_{FLUSH,FUA} correctly.
But request also can goes though ->dio_submit_alloc()->dio_submit_pad and 
write_page (for indexes)
So in case of grow_dev we have following sequance:

E_RELOC_DATA_READ:
 ->set_bit(PLOOP_REQ_FORCE_FUA, >state);
  ->delta->allocate
 ->io->submit_allloc: dio_submit_alloc
   ->dio_submit_pad
E_DATA_WBI : data written, time to update index
  ->delta->allocate_complete:ploop_index_update
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
->write_page
->ploop_map_wb_complete
  ->ploop_wb_complete_post_process
->set_bit(PLOOP_REQ_FORCE_FUA, >state);
E_RELOC_NULLIFY:

   ->submit()

BUG#2: currecntly kaio write_page silently ignores REQ_FUA
BUG#3: io_direct:dio_submit  if fua_delay is not possible we MUST tag all bios 
via REQ_FUA
   not just latest one.
This patch unify barrier handling like follows:
- Get rid of FORCE_{FLUSH,FUA}
- Introduce DELAYED_FLUSH, currecntly it supported only by io_direct
- fix up fua handling for dio_submit

This makes reloc sequence optimal:
io_direct
RELOC_S: R1, W2, WBI:FLUSH|FUA
RELOC_A: R1, W2, WBI:FLUSH|FUA, W1:NULLIFY|FUA
io_kaio
RELOC_S: R1, W2:FUA, WBI:FUA
RELOC_A: R1, W2:FUA, WBI:FUA, W1:NULLIFY|FUA

https://jira.sw.ru/browse/PSBM-47107
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/dev.c   |  8 +---
 drivers/block/ploop/io_direct.c | 29 +-
 drivers/block/ploop/io_kaio.c   | 17 ++--
 drivers/block/ploop/map.c   | 45 ++---
 include/linux/ploop/ploop.h |  8 
 5 files changed, 54 insertions(+), 53 deletions(-)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index 96f7850..fbc5f2f 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -1224,6 +1224,9 @@ static void ploop_complete_request(struct ploop_request * 
preq)
 
__TRACE("Z %p %u\n", preq, preq->req_cluster);
 
+   if (!preq->error) {
+   WARN_ON(test_bit(PLOOP_REQ_DELAYED_FLUSH, >state));
+   }
while (preq->bl.head) {
struct bio * bio = preq->bl.head;
preq->bl.head = bio->bi_next;
@@ -2530,9 +2533,8 @@ restart:
top_delta = ploop_top_delta(plo);
sbl.head = sbl.tail = preq->aux_bio;
 
-   /* Relocated data write required sync before BAT updatee */
-   set_bit(PLOOP_REQ_FORCE_FUA, >state);
-
+   /* Relocated data write required sync before BAT updatee
+* this will happen inside index_update */
if (test_bit(PLOOP_REQ_RELOC_S, >state)) {
preq->eng_state = PLOOP_E_DATA_WBI;
plo->st.bio_out++;
diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index a6d83fe..d7ecd4a 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -90,21 +90,12 @@ dio_submit(struct ploop_io *io, struct ploop_request * preq,
trace_submit(preq);
 
preflush = !!(rw & REQ_FLUSH);
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FLUSH, >state))
-   preflush = 1;
-
-   if (test_and_clear_bit(PLOOP_REQ_FORCE_FUA, >state))
-   postfua = 1;
-
-   if (!postfua && ploop_req_delay_fua_possible(rw, preq)) {
-
+   postfua = !!(rw & REQ_FUA);
+   if (ploop_req_delay_fua_possible(rw, preq)) {
/* Mark req that delayed flush required */
-   set_bit(PLOOP_REQ_FORCE_FLUSH, >state);
-   } else if (rw & REQ_FUA) {
-   postfua = 1;
+   set_bit(PLOOP_REQ_DELAYED_FLUSH, >state);
+   postfua = 0;
}
-
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
@@ -238,14 +229,15 @@ flush_bio:
rw2 |= REQ_FLUSH;
preflush = 0;
}
-   if (unlikely(postfua && !bl.head))
-   rw2 |= (REQ_FUA | ((bio_num) ? REQ_FLUSH : 0));
+   /* Very unlikely, but correct.
+* TODO: Optimize postfua via DELAY_FLUSH for any req state */
+   if (unlikely(!postfua))
+   rw2 |= REQ_FUA;
 
ploop_acc_ff_out(preq->plo, rw2 | b->bi_rw);
submit_bio(rw2, b);
bio_num++;
}
-
ploop_complete_io_request(preq);
return;
 
@@ -1520,15 +1512,14 @@ dio_read_page(struct ploop_io * io, struct 
ploop_request * preq,
 
 static void
 dio_write_page(struct ploop_io * io, struct ploop_request * preq,
-  struct page * page, sector_t sec, int fua)
+  struct page * page, sector_t sec, unsigned long rw)
 {
if (!(io->files.file->f_mode & FMODE_WRITE)) {

[Devel] [PATCH 1/3] ploop: skip redundant fsync for REQ_FUA in post_submit v2

2016-06-20 Thread Dmitry Monakhov
Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index b844a80..58d7580 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -517,27 +517,31 @@ dio_post_submit(struct ploop_io *io, struct ploop_request 
* preq)
struct ploop_device *plo = preq->plo;
sector_t sec = (sector_t)preq->iblock << preq->plo->cluster_log;
loff_t clu_siz = 1 << (preq->plo->cluster_log + 9);
+   int force_sync = preq->req_rw & REQ_FUA;
int err;
 
file_start_write(io->files.file);
 
-   /* Here io->io_count is even ... */
-   spin_lock_irq(>lock);
-   io->io_count++;
-   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
-   spin_unlock_irq(>lock);
-
+   if (!force_sync) {
+   /* Here io->io_count is even ... */
+   spin_lock_irq(>lock);
+   io->io_count++;
+   set_bit(PLOOP_IO_FSYNC_DELAYED, >io_state);
+   spin_unlock_irq(>lock);
+   }
err = io->files.file->f_op->fallocate(io->files.file,
  FALLOC_FL_CONVERT_UNWRITTEN,
  (loff_t)sec << 9, clu_siz);
 
/* highly unlikely case: FUA coming to a block not provisioned yet */
-   if (!err && (preq->req_rw & REQ_FUA))
+   if (!err && force_sync)
err = io->ops->sync(io);
 
-   spin_lock_irq(>lock);
-   io->io_count++;
-   spin_unlock_irq(>lock);
+   if (!force_sync) {
+   spin_lock_irq(>lock);
+   io->io_count++;
+   spin_unlock_irq(>lock);
+   }
/* and here io->io_count is even (+2) again. */
 
file_end_write(io->files.file);
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/3] ploop: deadcode cleanup

2016-06-20 Thread Dmitry Monakhov
(rw & REQ_FUA) branch is impossible because REQ_FUA was cleared line above.
Logic was moved to ploop_req_delay_fua_possible() long time ago.

Signed-off-by: Dmitry Monakhov 
---
 drivers/block/ploop/io_direct.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/block/ploop/io_direct.c b/drivers/block/ploop/io_direct.c
index 58d7580..a6d83fe 100644
--- a/drivers/block/ploop/io_direct.c
+++ b/drivers/block/ploop/io_direct.c
@@ -108,15 +108,6 @@ dio_submit(struct ploop_io *io, struct ploop_request * 
preq,
rw &= ~(REQ_FLUSH | REQ_FUA);
 
 
-   /* In case of eng_state != COMPLETE, we'll do FUA in
-* ploop_index_update(). Otherwise, we should mark
-* last bio as FUA here. */
-   if (rw & REQ_FUA) {
-   rw &= ~REQ_FUA;
-   if (preq->eng_state == PLOOP_E_COMPLETE)
-   postfua = 1;
-   }
-
bio_list_init();
 
if (iblk == PLOOP_ZERO_INDEX)
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] cgroup: fix path mangling for ve cgroups

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 79fa6ee2446a3efe9791378cf9b582bbee0ef7ec
Author: Vladimir Davydov 
Date:   Mon Jun 20 21:07:58 2016 +0400

cgroup: fix path mangling for ve cgroups

Presently, we just cut first component off cgroup path when inside a VE,
because all VE cgroups are located at the top level of the cgroup
hierarchy. However, this is going to change - the cgroups are going to
move to machine.slice - so we should introduce a more generic way of
mangling cgroup paths.

This patch does the trick. On a VE start it marks all cgroups the init
task of the VE resides in with a special flag (CGRP_VE_ROOT). Cgroups
marked this way will be treated as root if looked at from inside a VE.
As long as we don't have nested VEs, this should work fine.

Note, we don't need to clear these flags on VE destruction, because
vzctl always creates new cgroups on VE start.

https://jira.sw.ru/browse/PSBM-48629

Signed-off-by: Vladimir Davydov 
---
 include/linux/cgroup.h |  3 +++
 kernel/cgroup.c| 27 ---
 kernel/ve/ve.c |  4 
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index aad06e8e0258..730ca9091bfb 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -175,6 +175,9 @@ enum {
CGRP_CPUSET_CLONE_CHILDREN,
/* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */
CGRP_SANE_BEHAVIOR,
+
+   /* The cgroup is root in a VE */
+   CGRP_VE_ROOT,
 };
 
 struct cgroup_name {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dd548853e2eb..581924e7af9e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1791,6 +1791,21 @@ static struct file_system_type cgroup_fs_type = {
 
 static struct kobject *cgroup_kobj;
 
+#ifdef CONFIG_VE
+void cgroup_mark_ve_root(struct ve_struct *ve)
+{
+   struct cgroup *cgrp;
+   struct cgroupfs_root *root;
+
+   mutex_lock(_mutex);
+   for_each_active_root(root) {
+   cgrp = task_cgroup_from_root(ve->init_task, root);
+   set_bit(CGRP_VE_ROOT, >flags);
+   }
+   mutex_unlock(_mutex);
+}
+#endif
+
 /**
  * cgroup_path - generate the path of a cgroup
  * @cgrp: the cgroup in question
@@ -1804,7 +1819,8 @@ static struct kobject *cgroup_kobj;
  * inode's i_mutex, while on the other hand cgroup_path() can be called
  * with some irq-safe spinlocks held.
  */
-int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt)
+static int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen,
+bool virt)
 {
int ret = -ENAMETOOLONG;
char *start;
@@ -1824,14 +1840,11 @@ int __cgroup_path(const struct cgroup *cgrp, char *buf, 
int buflen, bool virt)
int len;
 
 #ifdef CONFIG_VE
-   if (virt && cgrp->parent && !cgrp->parent->parent) {
+   if (virt && test_bit(CGRP_VE_ROOT, >flags)) {
/*
 * Containers cgroups are bind-mounted from node
 * so they are like '/' from inside, thus we have
-* to mangle cgroup path output. Effectively it is
-* enough to remove two topmost cgroups from path.
-* e.g. in ct 101: /101/test.slice/test.scope ->
-* /test.slice/test.scope
+* to mangle cgroup path output.
 */
if (*start != '/') {
if (--start < buf)
@@ -2391,7 +2404,7 @@ static ssize_t cgroup_file_write(struct file *file, const 
char __user *buf,
 * inside a container FS.
 */
if (!ve_is_super(get_exec_env())
-   && (!cgrp->parent || !cgrp->parent->parent)
+   && test_bit(CGRP_VE_ROOT, >flags)
&& !get_exec_env()->is_pseudosuper
&& !(cft->flags & CFTYPE_VE_WRITABLE))
return -EPERM;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 9904a4ae130e..2459cb53a665 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -452,6 +452,8 @@ static void ve_drop_context(struct ve_struct *ve)
 
 static const struct timespec zero_time = { };
 
+extern void cgroup_mark_ve_root(struct ve_struct *ve);
+
 /* under ve->op_sem write-lock */
 static int ve_start_container(struct ve_struct *ve)
 {
@@ -499,6 +501,8 @@ static int ve_start_container(struct ve_struct *ve)
if (err < 0)
goto err_iterate;
 
+   cgroup_mark_ve_root(ve);
+
ve->is_running = 1;
 
printk(KERN_INFO "CT: %s: started\n", ve_name(ve));
___
Devel mailing list
Devel@openvz.org

[Devel] [PATCH RHEL7 COMMIT] Drop vz_compat boot param

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit f8b72e7837625c7de569fefcf3bba05ac2ef6b5e
Author: Vladimir Davydov 
Date:   Mon Jun 20 21:01:36 2016 +0400

Drop vz_compat boot param

It was introduced by commit d7b23ae8a314f ("ve/cgroups: use cgroup
subsystem names only if in vz compat mode") in order to provide a way of
running pcs6 environment along with vz7 kernel. Turned out, this is not
needed, so drop the option altogether.

Signed-off-by: Vladimir Davydov 
---
 include/linux/ve.h  |  4 
 kernel/bc/beancounter.c |  2 --
 kernel/fairsched.c  |  1 -
 kernel/ve/ve.c  | 10 --
 kernel/ve/vecalls.c |  1 -
 5 files changed, 18 deletions(-)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index 813f16d5e825..182a63899a0b 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -153,8 +153,6 @@ extern __u64 ve_setup_iptables_mask(__u64 init_mask);
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
-extern int vz_compat;
-
 extern struct kobj_ns_type_operations ve_ns_type_operations;
 extern struct kobject * kobject_create_and_add_ve(const char *name,
struct kobject *parent);
@@ -247,8 +245,6 @@ static inline void ve_mount_nr_dec(void)
 
 #define ve_uevent_seqnum uevent_seqnum
 
-#define vz_compat  (0)
-
 static inline int vz_security_family_check(struct net *net, int family) { 
return 0; }
 static inline int vz_security_protocol_check(struct net *net, int protocol) { 
return 0; }
 
diff --git a/kernel/bc/beancounter.c b/kernel/bc/beancounter.c
index f8a397269152..d35ddb3499d4 100644
--- a/kernel/bc/beancounter.c
+++ b/kernel/bc/beancounter.c
@@ -33,7 +33,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -1179,7 +1178,6 @@ void __init ub_init_late(void)
 int __init ub_init_cgroup(void)
 {
struct cgroup_sb_opts blkio_opts = {
-   .name   = vz_compat ? "beancounter" : NULL,
.subsys_mask= (1ul << blkio_subsys_id),
};
struct cgroup_sb_opts mem_opts = {
diff --git a/kernel/fairsched.c b/kernel/fairsched.c
index 959c19f4d7fc..e015cff87a97 100644
--- a/kernel/fairsched.c
+++ b/kernel/fairsched.c
@@ -796,7 +796,6 @@ int __init fairsched_init(void)
 {
struct vfsmount *cpu_mnt, *cpuset_mnt;
struct cgroup_sb_opts cpu_opts = {
-   .name   = vz_compat ? "fairsched" : NULL,
.subsys_mask=
(1ul << cpu_cgroup_subsys_id) |
(1ul << cpuacct_subsys_id),
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 22df66e1b257..d811d4818fa6 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -87,18 +87,8 @@ DEFINE_MUTEX(ve_list_lock);
 int nr_ve = 1; /* One VE always exists. Compatibility with vestat */
 EXPORT_SYMBOL(nr_ve);
 
-int vz_compat;
-EXPORT_SYMBOL(vz_compat);
-
 static DEFINE_IDR(ve_idr);
 
-static int __init vz_compat_setup(char *arg)
-{
-   get_option(, _compat);
-   return 0;
-}
-early_param("vz_compat", vz_compat_setup);
-
 struct ve_struct *get_ve(struct ve_struct *ve)
 {
if (ve)
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index 5aa9722d692d..2b8b27998f07 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -309,7 +309,6 @@ static struct vfsmount *ve_cgroup_mnt, *devices_cgroup_mnt;
 static int __init init_vecalls_cgroups(void)
 {
struct cgroup_sb_opts devices_opts = {
-   .name   = vz_compat ? "container" : NULL,
.subsys_mask=
(1ul << devices_subsys_id),
};
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] timers should not get negative argument

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 3788c76811b2b04318c3f4b240f1e83245ad15e5
Author: Vasily Averin 
Date:   Mon Jun 20 20:58:56 2016 +0400

timers should not get negative argument

This patch fixes 25-sec delay on login into systemd based containers.

Userspace application can set timer for past
and expect that the timer will be expired immediately.

This can do not work as expected inside migrated containers.
Translated argument provided to timer can become negative,
and according timer will sleep a very long time.

https://jira.sw.ru/browse/PSBM-48475

CC: Vladimir Davydov 
CC: Konstantin Khorenko 
Signed-off-by: Vasily Averin 
Acked-by: Cyrill Gorcunov 
---
 kernel/posix-timers.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index b98cfe429d9b..8ebf01827ee6 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -133,6 +133,8 @@ static struct k_clock posix_clocks[MAX_CLOCKS];
 (which_clock) == CLOCK_MONOTONIC_COARSE)
 
 #ifdef CONFIG_VE
+static struct timespec zero_time;
+
 void monotonic_abs_to_ve(clockid_t which_clock, struct timespec *tp)
 {
struct ve_struct *ve = get_exec_env();
@@ -151,6 +153,10 @@ void monotonic_ve_to_abs(clockid_t which_clock, struct 
timespec *tp)
set_normalized_timespec(tp,
tp->tv_sec + ve->start_timespec.tv_sec,
tp->tv_nsec + ve->start_timespec.tv_nsec);
+   if (timespec_compare(tp, _time) <= 0) {
+   tp->tv_sec =  0;
+   tp->tv_nsec = 1;
+   }
 }
 #endif
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 0/6] Support containers in machine.slice

2016-06-20 Thread Vladimir Davydov
The following problems have to be solved if we want to move containers
to machine.slice:

 - CPU stats reporting. Currently, we just open cgroup by name when we
   need stats corresponding to a VE. This is addressed by patch 3.

 - setdevperms ioctl. The same problem as in case 1. Addressed by patch
   3 as well.

 - cgroup path mangling (/proc/self/cgroup, mountinfo). This is fixed by
   patches 5 and 6.

With containers moved to machine.slice fairsched syscalls and
VZCTL_ENV_CREATE ioctl get broken and can't be easily fixed, so we just
drop them (patches 1, 2, 4). This should be fine, because libvctl
switched to cgroup interface long ago.

https://jira.sw.ru/browse/PSBM-48629

Vladimir Davydov (6):
  Drop vz_compat boot param
  Drop VZCTL_ENV_CREATE
  Use ve init task's css instead of opening cgroup via vfs
  Drop fairsched syscalls
  cgroup: use cgroup_path_ve helper in cgroup_show_path
  cgroup: fix path mangling for ve cgroups

 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 fs/proc/loadavg.c |   3 +-
 fs/proc/stat.c|   3 +-
 fs/proc/uptime.c  |  15 +-
 include/linux/cgroup.h|   3 +
 include/linux/cpuset.h|   5 -
 include/linux/device_cgroup.h |   6 +-
 include/linux/fairsched.h |  88 
 include/linux/sched.h |  21 -
 include/linux/ve.h|  30 +-
 include/linux/ve_proto.h  |   4 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/bc/beancounter.c   |   2 -
 kernel/cgroup.c   |  66 ++-
 kernel/cpuset.c   |  26 -
 kernel/fairsched.c| 829 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 kernel/ve/ve.c| 104 +++-
 kernel/ve/vecalls.c   | 505 +-
 security/device_cgroup.c  |  65 +--
 30 files changed, 191 insertions(+), 1738 deletions(-)
 delete mode 100644 include/linux/fairsched.h
 delete mode 100644 include/uapi/linux/fairsched.h
 delete mode 100644 kernel/fairsched.c

-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] Use ve init task's css instead of opening cgroup via vfs

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 083ecd8a5051975639669e3349a17e07d299c299
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:13 2016 +0300

Use ve init task's css instead of opening cgroup via vfs

Currently, whenever we need to get cpu or devices cgroup corresponding
to a ve, we open it using cgroup_kernel_open(). This is inflexible,
because it relies on the fact that all container cgroups are located at
a specific location which can never change (at the top level). Since we
want to move container cgroups to machine.slice, we need to rework this.

This patch does the trick. It makes each ve remember its init task at
container start, and use css corresponding to init task whenever we need
to get a corresponding cgroup. Note, that after this patch is applied,
we don't need to mount cpu and devices cgroup in kernel.

https://jira.sw.ru/browse/PSBM-48629

Signed-off-by: Vladimir Davydov 
---
 fs/proc/loadavg.c |  3 +-
 fs/proc/stat.c|  3 +-
 fs/proc/uptime.c  | 15 
 include/linux/device_cgroup.h |  5 ++-
 include/linux/fairsched.h | 23 
 include/linux/ve.h| 18 ++
 kernel/fairsched.c| 61 
 kernel/ve/ve.c| 82 ++-
 kernel/ve/vecalls.c   | 67 ---
 security/device_cgroup.c  | 19 +-
 10 files changed, 126 insertions(+), 170 deletions(-)

diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 4cbdeef1aa71..40d8a90b0f13 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -6,7 +6,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define LOAD_INT(x) ((x) >> FSHIFT)
@@ -20,7 +19,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_loadavg(ve_name(ve), m);
+   ret = ve_show_loadavg(ve, m);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e9991db527e0..7f7e87c855e4 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -98,7 +97,7 @@ static int show_stat(struct seq_file *p, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_stat(ve_name(ve), p);
+   ret = ve_show_cpu_stat(ve, p);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 6fd56831c796..8fa578e8a553 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -5,7 +5,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -25,11 +24,11 @@ static inline void get_ve0_idle(struct timespec *idle)
idle->tv_nsec = rem;
 }
 
-static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp)
+static inline void get_veX_idle(struct ve_struct *ve, struct timespec *idle)
 {
struct kernel_cpustat kstat;
 
-   cpu_cgroup_get_stat(cgrp, );
+   ve_get_cpu_stat(ve, );
cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle);
 }
 
@@ -37,14 +36,12 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 {
struct timespec uptime;
struct timespec idle;
+   struct ve_struct *ve = get_exec_env();
 
-   if (ve_is_super(get_exec_env()))
+   if (ve_is_super(ve))
get_ve0_idle();
-   else {
-   rcu_read_lock();
-   get_veX_idle(, task_cgroup(current, cpu_cgroup_subsys_id));
-   rcu_read_unlock();
-   }
+   else
+   get_veX_idle(ve, );
 
do_posix_clock_monotonic_gettime();
monotonic_to_bootbased();
diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 64c2da27278c..25ea2270aabe 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -16,10 +16,9 @@ extern int devcgroup_device_permission(umode_t mode, dev_t 
dev, int mask);
 extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
-struct cgroup;
-int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
-int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
+int devcgroup_set_perms_ve(struct ve_struct *, unsigned, dev_t, unsigned);
+int devcgroup_seq_show_ve(struct ve_struct *, struct seq_file *);
 
 #else
 static inline int devcgroup_inode_permission(struct inode *inode, int 

[Devel] [PATCH RHEL7 COMMIT] Drop fairsched syscalls

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 13985cb1990d71a321504c58daa16b50ac9a0ec7
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:14 2016 +0300

Drop fairsched syscalls

Everything that can be configured via fairsched syscalls is accessible
via cpu cgroup. Since it's getting difficult to maintain the syscalls
due to the upcoming move of containers to machine.slice, drop them.

Also, drop all functions from sched and cpuset which were used only by
fairsched syscalls.

Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is,
because otherwise it won't get selected, because its only user was
VZ_FAIRSCHED config option dropped by this patch. I think we need to
merge this option with CFS_BANDWIDTH eventually, but let's leave it as
is for now.

Signed-off-by: Vladimir Davydov 
---
 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 include/linux/cpuset.h|   5 -
 include/linux/fairsched.h |  58 ---
 include/linux/sched.h |  20 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/cpuset.c   |  26 --
 kernel/fairsched.c| 705 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 18 files changed, 25 insertions(+), 943 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index ce9d2d7977e5..8a44bbd2bee6 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -374,14 +374,14 @@ SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
-SYSCALL(fairsched_mknod)
-SYSCALL(fairsched_rmnod)
-SYSCALL(fairsched_chwt)
-SYSCALL(fairsched_mvpr)
-SYSCALL(fairsched_rate)
-SYSCALL(fairsched_vcpus)
-SYSCALL(fairsched_cpumask)
-SYSCALL(fairsched_nodemask)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
 SYSCALL(getluid)
 SYSCALL(setluid)
 SYSCALL(setublimit)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index e90207158a12..41fc69c6822b 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,14 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 
-#define __NR_fairsched_mknod   360
-#define __NR_fairsched_rmnod   361
-#define __NR_fairsched_chwt362
-#define __NR_fairsched_mvpr363
-#define __NR_fairsched_rate364
-#define __NR_fairsched_vcpus   365
-#define __NR_fairsched_cpumask 366
-#define __NR_fairsched_nodemask367
 #define __NR_getluid   368
 #define __NR_setluid   369
 #define __NR_setublimit370
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index e60fd32ebba3..f8ed67d66913 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,15 +360,6 @@
 356i386memfd_createsys_memfd_create
 374i386userfaultfd sys_userfaultfd
 
-500i386fairsched_mknod sys_fairsched_mknod
-501i386fairsched_rmnod sys_fairsched_rmnod
-502i386fairsched_chwt  sys_fairsched_chwt
-503i386fairsched_mvpr  sys_fairsched_mvpr
-504i386fairsched_rate  sys_fairsched_rate
-505i386fairsched_vcpus sys_fairsched_vcpus
-506i386fairsched_cpumask   sys_fairsched_cpumask
-507i386fairsched_nodemask  sys_fairsched_nodemask
-
 510i386getluid sys_getluid
 511i386setluid sys_setluid
 512i386setublimit  sys_setublimit  
compat_sys_setublimit
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 846183e5a9f0..7f009985158e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -325,18 +325,10 @@
 320common  kexec_file_load sys_kexec_file_load
 323common  userfaultfd sys_userfaultfd
 
-49764  fairsched_nodemask  sys_fairsched_nodemask
-49864  fairsched_cpumask   

[Devel] [PATCH RHEL7 COMMIT] cgroup: use cgroup_path_ve helper in cgroup_show_path

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit df0243406fc27e4af78ca6d9111a0bd30fea00a3
Author: Vladimir Davydov 
Date:   Mon Jun 20 21:07:48 2016 +0400

cgroup: use cgroup_path_ve helper in cgroup_show_path

Presently, it basically duplicates the code used for mangling cgroup
path shown inside ve, which is already present in cgroup_path_ve. Let's
reuse it.

Signed-off-by: Vladimir Davydov 
---
 kernel/cgroup.c | 39 +--
 1 file changed, 9 insertions(+), 30 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5c012f6e94e5..dd548853e2eb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1373,41 +1373,20 @@ static int cgroup_remount(struct super_block *sb, int 
*flags, char *data)
 }
 
 #ifdef CONFIG_VE
-int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
+static int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
 {
-   char *buf;
+   struct cgroup *cgrp = __d_cgrp(dentry);
+   char *buf, *end;
size_t size = seq_get_buf(m, );
-   int res = -1, err = 0;
-
-   if (size) {
-   char *p = dentry_path(dentry, buf, size);
-   if (!IS_ERR(p)) {
-   char *end;
-   if (!ve_is_super(get_exec_env())) {
-   while (*++p != '/') {
-   /*
-* Mangle one level when showing
-* cgroup mount source in container
-* e.g.: "/111" -> "/",
-* "/111/test.slice/test.scope" ->
-* "/test.slice/test.scope"
-*/
-   if (*p == '\0') {
-   *--p = '/';
-   break;
-   }
-   }
-   }
-   end = mangle_path(buf, p, " \t\n\\");
-   if (end)
-   res = end - buf;
-   } else {
-   err = PTR_ERR(p);
-   }
+   int res = -1;
+
+   if (size > 0 && cgroup_path_ve(cgrp, buf, size) == 0) {
+   end = mangle_path(buf, buf, " \t\n\\");
+   res = end - buf;
}
seq_commit(m, res);
 
-   return err;
+   return 0;
 }
 #endif
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] Drop fairsched syscalls

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 13985cb1990d71a321504c58daa16b50ac9a0ec7
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:14 2016 +0300

Drop fairsched syscalls

Everything that can be configured via fairsched syscalls is accessible
via cpu cgroup. Since it's getting difficult to maintain the syscalls
due to the upcoming move of containers to machine.slice, drop them.

Also, drop all functions from sched and cpuset which were used only by
fairsched syscalls.

Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is,
because otherwise it won't get selected, because its only user was
VZ_FAIRSCHED config option dropped by this patch. I think we need to
merge this option with CFS_BANDWIDTH eventually, but let's leave it as
is for now.

Signed-off-by: Vladimir Davydov 
---
 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 include/linux/cpuset.h|   5 -
 include/linux/fairsched.h |  58 ---
 include/linux/sched.h |  20 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/cpuset.c   |  26 --
 kernel/fairsched.c| 705 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 18 files changed, 25 insertions(+), 943 deletions(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index ce9d2d7977e5..8a44bbd2bee6 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -374,14 +374,14 @@ SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
-SYSCALL(fairsched_mknod)
-SYSCALL(fairsched_rmnod)
-SYSCALL(fairsched_chwt)
-SYSCALL(fairsched_mvpr)
-SYSCALL(fairsched_rate)
-SYSCALL(fairsched_vcpus)
-SYSCALL(fairsched_cpumask)
-SYSCALL(fairsched_nodemask)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
 SYSCALL(getluid)
 SYSCALL(setluid)
 SYSCALL(setublimit)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index e90207158a12..41fc69c6822b 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,14 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 
-#define __NR_fairsched_mknod   360
-#define __NR_fairsched_rmnod   361
-#define __NR_fairsched_chwt362
-#define __NR_fairsched_mvpr363
-#define __NR_fairsched_rate364
-#define __NR_fairsched_vcpus   365
-#define __NR_fairsched_cpumask 366
-#define __NR_fairsched_nodemask367
 #define __NR_getluid   368
 #define __NR_setluid   369
 #define __NR_setublimit370
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index e60fd32ebba3..f8ed67d66913 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,15 +360,6 @@
 356i386memfd_createsys_memfd_create
 374i386userfaultfd sys_userfaultfd
 
-500i386fairsched_mknod sys_fairsched_mknod
-501i386fairsched_rmnod sys_fairsched_rmnod
-502i386fairsched_chwt  sys_fairsched_chwt
-503i386fairsched_mvpr  sys_fairsched_mvpr
-504i386fairsched_rate  sys_fairsched_rate
-505i386fairsched_vcpus sys_fairsched_vcpus
-506i386fairsched_cpumask   sys_fairsched_cpumask
-507i386fairsched_nodemask  sys_fairsched_nodemask
-
 510i386getluid sys_getluid
 511i386setluid sys_setluid
 512i386setublimit  sys_setublimit  
compat_sys_setublimit
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 846183e5a9f0..7f009985158e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -325,18 +325,10 @@
 320common  kexec_file_load sys_kexec_file_load
 323common  userfaultfd sys_userfaultfd
 
-49764  fairsched_nodemask  sys_fairsched_nodemask
-49864  fairsched_cpumask   

[Devel] [PATCH RHEL7 COMMIT] Drop VZCTL_ENV_CREATE

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 8d46dca70d92147cf928633f279b9c36deb234c2
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:12 2016 +0300

Drop VZCTL_ENV_CREATE

It's getting too difficult to support it. Since we've been using cgroup
interface for creating VE for quite a while, let's drop it.

Signed-off-by: Vladimir Davydov 
---
 include/linux/device_cgroup.h |   1 -
 include/linux/fairsched.h |   7 -
 include/linux/sched.h |   1 -
 include/linux/ve.h|   8 -
 include/linux/ve_proto.h  |   4 -
 kernel/fairsched.c|  64 +--
 kernel/ve/ve.c|   8 +-
 kernel/ve/vecalls.c   | 437 +-
 security/device_cgroup.c  |  46 -
 9 files changed, 5 insertions(+), 571 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 32588bb8fb4e..64c2da27278c 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
 struct cgroup;
-int devcgroup_default_perms_ve(struct cgroup *cgroup);
 int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
 int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index f3dede236945..b73f51eadabc 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
   unsigned long __user *user_mask_ptr);
 
-int fairsched_new_node(int id, unsigned int vcpus);
-int fairsched_move_task(int id, struct task_struct *tsk);
-void fairsched_drop_node(int id, int leave);
-
 int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat);
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun);
@@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file 
*p);
 
 #else /* CONFIG_VZ_FAIRSCHED */
 
-static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; }
-static inline int fairsched_move_task(int id, struct task_struct *tsk) { 
return 0; }
-static inline void fairsched_drop_node(int id, int leave) { }
 static inline int fairsched_show_stat(const char *name, struct seq_file *p) { 
return -ENOSYS; }
 static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) 
{ return -ENOSYS; }
 static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long 
*avenrun) { return -ENOSYS; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21775a21f8ab..84a9888b2483 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,7 +1241,6 @@ struct task_struct {
unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
 * execve */
unsigned in_iowait:1;
-   unsigned did_ve_enter:1;
unsigned no_new_privs:1; /* task may not gain privileges */
unsigned may_throttle:1;
 
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 182a63899a0b..459c8bc581d9 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -41,13 +41,10 @@ struct ve_struct {
struct list_headve_list;
 
envid_t veid;
-   boollegacy; /* created using the legacy API
-  (vzctl ioctl - see do_env_create) */
 
unsigned intclass_id;
struct rw_semaphore op_sem;
int is_running;
-   int is_locked;
int is_pseudosuper;
atomic_tsuspend;
/* see vzcalluser.h for VE_FEATURE_XXX definitions */
@@ -146,10 +143,6 @@ extern struct cgroup_subsys ve_subsys;
 
 extern unsigned int sysctl_ve_mount_nr;
 
-#ifdef CONFIG_VE_IPTABLES
-extern __u64 ve_setup_iptables_mask(__u64 init_mask);
-#endif
-
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
@@ -209,7 +202,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, 
struct timespec *tp);
 
 void ve_stop_ns(struct pid_namespace *ns);
 void ve_exit_ns(struct pid_namespace *ns);
-int ve_start_container(struct ve_struct *ve);
 
 extern bool current_user_ns_initial(void);
 struct user_namespace *ve_init_user_ns(void);
diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 61d80190d0f1..8cc7fe3ba2a3 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -53,10 

[Devel] [PATCH RHEL7 COMMIT] Drop VZCTL_ENV_CREATE

2016-06-20 Thread Vladimir Davydov
The commit is pushed to "branch-rh7-3.10.0-327.18.2.vz7.14.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-327.18.2.vz7.14.16
-->
commit 8d46dca70d92147cf928633f279b9c36deb234c2
Author: Vladimir Davydov 
Date:   Mon Jun 20 19:40:12 2016 +0300

Drop VZCTL_ENV_CREATE

It's getting too difficult to support it. Since we've been using cgroup
interface for creating VE for quite a while, let's drop it.

Signed-off-by: Vladimir Davydov 
---
 include/linux/device_cgroup.h |   1 -
 include/linux/fairsched.h |   7 -
 include/linux/sched.h |   1 -
 include/linux/ve.h|   8 -
 include/linux/ve_proto.h  |   4 -
 kernel/fairsched.c|  64 +--
 kernel/ve/ve.c|   8 +-
 kernel/ve/vecalls.c   | 437 +-
 security/device_cgroup.c  |  46 -
 9 files changed, 5 insertions(+), 571 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 32588bb8fb4e..64c2da27278c 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
 struct cgroup;
-int devcgroup_default_perms_ve(struct cgroup *cgroup);
 int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
 int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index f3dede236945..b73f51eadabc 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
   unsigned long __user *user_mask_ptr);
 
-int fairsched_new_node(int id, unsigned int vcpus);
-int fairsched_move_task(int id, struct task_struct *tsk);
-void fairsched_drop_node(int id, int leave);
-
 int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat);
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun);
@@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file 
*p);
 
 #else /* CONFIG_VZ_FAIRSCHED */
 
-static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; }
-static inline int fairsched_move_task(int id, struct task_struct *tsk) { 
return 0; }
-static inline void fairsched_drop_node(int id, int leave) { }
 static inline int fairsched_show_stat(const char *name, struct seq_file *p) { 
return -ENOSYS; }
 static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) 
{ return -ENOSYS; }
 static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long 
*avenrun) { return -ENOSYS; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21775a21f8ab..84a9888b2483 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,7 +1241,6 @@ struct task_struct {
unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
 * execve */
unsigned in_iowait:1;
-   unsigned did_ve_enter:1;
unsigned no_new_privs:1; /* task may not gain privileges */
unsigned may_throttle:1;
 
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 182a63899a0b..459c8bc581d9 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -41,13 +41,10 @@ struct ve_struct {
struct list_headve_list;
 
envid_t veid;
-   boollegacy; /* created using the legacy API
-  (vzctl ioctl - see do_env_create) */
 
unsigned intclass_id;
struct rw_semaphore op_sem;
int is_running;
-   int is_locked;
int is_pseudosuper;
atomic_tsuspend;
/* see vzcalluser.h for VE_FEATURE_XXX definitions */
@@ -146,10 +143,6 @@ extern struct cgroup_subsys ve_subsys;
 
 extern unsigned int sysctl_ve_mount_nr;
 
-#ifdef CONFIG_VE_IPTABLES
-extern __u64 ve_setup_iptables_mask(__u64 init_mask);
-#endif
-
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
@@ -209,7 +202,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, 
struct timespec *tp);
 
 void ve_stop_ns(struct pid_namespace *ns);
 void ve_exit_ns(struct pid_namespace *ns);
-int ve_start_container(struct ve_struct *ve);
 
 extern bool current_user_ns_initial(void);
 struct user_namespace *ve_init_user_ns(void);
diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 61d80190d0f1..8cc7fe3ba2a3 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -53,10 

[Devel] [PATCH rh7 4/6] Drop fairsched syscalls

2016-06-20 Thread Vladimir Davydov
Everything that can be configured via fairsched syscalls is accessible
via cpu cgroup. Since it's getting difficult to maintain the syscalls
due to the upcoming move of containers to machine.slice, drop them.

Also, drop all functions from sched and cpuset which were used only by
fairsched syscalls.

Note, I make CFS_BANDWIDTH select CFS_CPULIMIT config option. This is,
because otherwise it won't get selected, because its only user was
VZ_FAIRSCHED config option dropped by this patch. I think we need to
merge this option with CFS_BANDWIDTH eventually, but let's leave it as
is for now.

Signed-off-by: Vladimir Davydov 
---
 arch/powerpc/include/asm/systbl.h |  16 +-
 arch/powerpc/include/uapi/asm/unistd.h|   8 -
 arch/x86/syscalls/syscall_32.tbl  |   9 -
 arch/x86/syscalls/syscall_64.tbl  |   8 -
 configs/kernel-3.10.0-x86_64-debug.config |   1 -
 configs/kernel-3.10.0-x86_64.config   |   1 -
 include/linux/cpuset.h|   5 -
 include/linux/fairsched.h |  58 ---
 include/linux/sched.h |  20 -
 include/uapi/linux/Kbuild |   1 -
 include/uapi/linux/fairsched.h|   8 -
 init/Kconfig  |  20 +-
 kernel/Makefile   |   1 -
 kernel/cpuset.c   |  26 --
 kernel/fairsched.c| 705 --
 kernel/sched/core.c   |  69 +--
 kernel/sched/cpuacct.h|   2 +
 kernel/sys_ni.c   |  10 -
 18 files changed, 25 insertions(+), 943 deletions(-)
 delete mode 100644 include/linux/fairsched.h
 delete mode 100644 include/uapi/linux/fairsched.h
 delete mode 100644 kernel/fairsched.c

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index ce9d2d7977e5..8a44bbd2bee6 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -374,14 +374,14 @@ SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
 SYSCALL(ni_syscall)
-SYSCALL(fairsched_mknod)
-SYSCALL(fairsched_rmnod)
-SYSCALL(fairsched_chwt)
-SYSCALL(fairsched_mvpr)
-SYSCALL(fairsched_rate)
-SYSCALL(fairsched_vcpus)
-SYSCALL(fairsched_cpumask)
-SYSCALL(fairsched_nodemask)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
+SYSCALL(ni_syscall)
 SYSCALL(getluid)
 SYSCALL(setluid)
 SYSCALL(setublimit)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index e90207158a12..41fc69c6822b 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,14 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 
-#define __NR_fairsched_mknod   360
-#define __NR_fairsched_rmnod   361
-#define __NR_fairsched_chwt362
-#define __NR_fairsched_mvpr363
-#define __NR_fairsched_rate364
-#define __NR_fairsched_vcpus   365
-#define __NR_fairsched_cpumask 366
-#define __NR_fairsched_nodemask367
 #define __NR_getluid   368
 #define __NR_setluid   369
 #define __NR_setublimit370
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index e60fd32ebba3..f8ed67d66913 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,15 +360,6 @@
 356i386memfd_createsys_memfd_create
 374i386userfaultfd sys_userfaultfd
 
-500i386fairsched_mknod sys_fairsched_mknod
-501i386fairsched_rmnod sys_fairsched_rmnod
-502i386fairsched_chwt  sys_fairsched_chwt
-503i386fairsched_mvpr  sys_fairsched_mvpr
-504i386fairsched_rate  sys_fairsched_rate
-505i386fairsched_vcpus sys_fairsched_vcpus
-506i386fairsched_cpumask   sys_fairsched_cpumask
-507i386fairsched_nodemask  sys_fairsched_nodemask
-
 510i386getluid sys_getluid
 511i386setluid sys_setluid
 512i386setublimit  sys_setublimit  
compat_sys_setublimit
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 846183e5a9f0..7f009985158e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -325,18 +325,10 @@
 320common  kexec_file_load sys_kexec_file_load
 323common  userfaultfd sys_userfaultfd
 
-49764  fairsched_nodemask  sys_fairsched_nodemask
-49864  fairsched_cpumask   sys_fairsched_cpumask
-49964  fairsched_vcpus sys_fairsched_vcpus
 50064  getluid sys_getluid
 50164  setluid sys_setluid
 50264  setublimit  sys_setublimit
 50364  ubstat  

[Devel] [PATCH rh7 6/6] cgroup: fix path mangling for ve cgroups

2016-06-20 Thread Vladimir Davydov
Presently, we just cut first component off cgroup path when inside a VE,
because all VE cgroups are located at the top level of the cgroup
hierarchy. However, this is going to change - the cgroups are going to
move to machine.slice - so we should introduce a more generic way of
mangling cgroup paths.

This patch does the trick. On a VE start it marks all cgroups the init
task of the VE resides in with a special flag (CGRP_VE_ROOT). Cgroups
marked this way will be treated as root if looked at from inside a VE.
As long as we don't have nested VEs, this should work fine.

Note, we don't need to clear these flags on VE destruction, because
vzctl always creates new cgroups on VE start.

https://jira.sw.ru/browse/PSBM-48629

Signed-off-by: Vladimir Davydov 
---
 include/linux/cgroup.h |  3 +++
 kernel/cgroup.c| 27 ---
 kernel/ve/ve.c |  4 
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index aad06e8e0258..730ca9091bfb 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -175,6 +175,9 @@ enum {
CGRP_CPUSET_CLONE_CHILDREN,
/* see the comment above CGRP_ROOT_SANE_BEHAVIOR for details */
CGRP_SANE_BEHAVIOR,
+
+   /* The cgroup is root in a VE */
+   CGRP_VE_ROOT,
 };
 
 struct cgroup_name {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dd548853e2eb..581924e7af9e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1791,6 +1791,21 @@ static struct file_system_type cgroup_fs_type = {
 
 static struct kobject *cgroup_kobj;
 
+#ifdef CONFIG_VE
+void cgroup_mark_ve_root(struct ve_struct *ve)
+{
+   struct cgroup *cgrp;
+   struct cgroupfs_root *root;
+
+   mutex_lock(_mutex);
+   for_each_active_root(root) {
+   cgrp = task_cgroup_from_root(ve->init_task, root);
+   set_bit(CGRP_VE_ROOT, >flags);
+   }
+   mutex_unlock(_mutex);
+}
+#endif
+
 /**
  * cgroup_path - generate the path of a cgroup
  * @cgrp: the cgroup in question
@@ -1804,7 +1819,8 @@ static struct kobject *cgroup_kobj;
  * inode's i_mutex, while on the other hand cgroup_path() can be called
  * with some irq-safe spinlocks held.
  */
-int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen, bool virt)
+static int __cgroup_path(const struct cgroup *cgrp, char *buf, int buflen,
+bool virt)
 {
int ret = -ENAMETOOLONG;
char *start;
@@ -1824,14 +1840,11 @@ int __cgroup_path(const struct cgroup *cgrp, char *buf, 
int buflen, bool virt)
int len;
 
 #ifdef CONFIG_VE
-   if (virt && cgrp->parent && !cgrp->parent->parent) {
+   if (virt && test_bit(CGRP_VE_ROOT, >flags)) {
/*
 * Containers cgroups are bind-mounted from node
 * so they are like '/' from inside, thus we have
-* to mangle cgroup path output. Effectively it is
-* enough to remove two topmost cgroups from path.
-* e.g. in ct 101: /101/test.slice/test.scope ->
-* /test.slice/test.scope
+* to mangle cgroup path output.
 */
if (*start != '/') {
if (--start < buf)
@@ -2391,7 +2404,7 @@ static ssize_t cgroup_file_write(struct file *file, const 
char __user *buf,
 * inside a container FS.
 */
if (!ve_is_super(get_exec_env())
-   && (!cgrp->parent || !cgrp->parent->parent)
+   && test_bit(CGRP_VE_ROOT, >flags)
&& !get_exec_env()->is_pseudosuper
&& !(cft->flags & CFTYPE_VE_WRITABLE))
return -EPERM;
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 08a15fc02e21..e65130f18bb4 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -454,6 +454,8 @@ static void ve_drop_context(struct ve_struct *ve)
 
 static const struct timespec zero_time = { };
 
+extern void cgroup_mark_ve_root(struct ve_struct *ve);
+
 /* under ve->op_sem write-lock */
 static int ve_start_container(struct ve_struct *ve)
 {
@@ -501,6 +503,8 @@ static int ve_start_container(struct ve_struct *ve)
if (err < 0)
goto err_iterate;
 
+   cgroup_mark_ve_root(ve);
+
ve->is_running = 1;
 
printk(KERN_INFO "CT: %s: started\n", ve_name(ve));
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 5/6] cgroup: use cgroup_path_ve helper in cgroup_show_path

2016-06-20 Thread Vladimir Davydov
Presently, it basically duplicates the code used for mangling cgroup
path shown inside ve, which is already present in cgroup_path_ve. Let's
reuse it.

Signed-off-by: Vladimir Davydov 
---
 kernel/cgroup.c | 39 +--
 1 file changed, 9 insertions(+), 30 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5c012f6e94e5..dd548853e2eb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1373,41 +1373,20 @@ static int cgroup_remount(struct super_block *sb, int 
*flags, char *data)
 }
 
 #ifdef CONFIG_VE
-int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
+static int cgroup_show_path(struct seq_file *m, struct dentry *dentry)
 {
-   char *buf;
+   struct cgroup *cgrp = __d_cgrp(dentry);
+   char *buf, *end;
size_t size = seq_get_buf(m, );
-   int res = -1, err = 0;
-
-   if (size) {
-   char *p = dentry_path(dentry, buf, size);
-   if (!IS_ERR(p)) {
-   char *end;
-   if (!ve_is_super(get_exec_env())) {
-   while (*++p != '/') {
-   /*
-* Mangle one level when showing
-* cgroup mount source in container
-* e.g.: "/111" -> "/",
-* "/111/test.slice/test.scope" ->
-* "/test.slice/test.scope"
-*/
-   if (*p == '\0') {
-   *--p = '/';
-   break;
-   }
-   }
-   }
-   end = mangle_path(buf, p, " \t\n\\");
-   if (end)
-   res = end - buf;
-   } else {
-   err = PTR_ERR(p);
-   }
+   int res = -1;
+
+   if (size > 0 && cgroup_path_ve(cgrp, buf, size) == 0) {
+   end = mangle_path(buf, buf, " \t\n\\");
+   res = end - buf;
}
seq_commit(m, res);
 
-   return err;
+   return 0;
 }
 #endif
 
-- 
2.1.4

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 2/6] Drop VZCTL_ENV_CREATE

2016-06-20 Thread Vladimir Davydov
It's getting too difficult to support it. Since we've been using cgroup
interface for creating VE for quite a while, let's drop it.

Signed-off-by: Vladimir Davydov 
---
 include/linux/device_cgroup.h |   1 -
 include/linux/fairsched.h |   7 -
 include/linux/sched.h |   1 -
 include/linux/ve.h|   8 -
 include/linux/ve_proto.h  |   4 -
 kernel/fairsched.c|  64 +--
 kernel/ve/ve.c|   8 +-
 kernel/ve/vecalls.c   | 437 +-
 security/device_cgroup.c  |  46 -
 9 files changed, 5 insertions(+), 571 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 32588bb8fb4e..64c2da27278c 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -17,7 +17,6 @@ extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
 struct cgroup;
-int devcgroup_default_perms_ve(struct cgroup *cgroup);
 int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
 int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index e242c0d4c065..615e88928e25 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,10 +51,6 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
   unsigned long __user *user_mask_ptr);
 
-int fairsched_new_node(int id, unsigned int vcpus);
-int fairsched_move_task(int id, struct task_struct *tsk);
-void fairsched_drop_node(int id, int leave);
-
 int fairsched_get_cpu_stat(const char *name, struct kernel_cpustat *kstat);
 
 int cpu_cgroup_get_avenrun(struct cgroup *cgrp, unsigned long *avenrun);
@@ -71,9 +67,6 @@ int fairsched_show_loadavg(const char *name, struct seq_file 
*p);
 
 #else /* CONFIG_VZ_FAIRSCHED */
 
-static inline int fairsched_new_node(int id, unsigned int vcpus) { return 0; }
-static inline int fairsched_move_task(int id, struct task_struct *tsk) { 
return 0; }
-static inline void fairsched_drop_node(int id, int leave) { }
 static inline int fairsched_show_stat(const char *name, struct seq_file *p) { 
return -ENOSYS; }
 static inline int fairsched_show_loadavg(const char *name, struct seq_file *p) 
{ return -ENOSYS; }
 static inline int fairsched_get_cpu_avenrun(const char *name, unsigned long 
*avenrun) { return -ENOSYS; }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21775a21f8ab..84a9888b2483 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1241,7 +1241,6 @@ struct task_struct {
unsigned in_execve:1;   /* Tell the LSMs that the process is doing an
 * execve */
unsigned in_iowait:1;
-   unsigned did_ve_enter:1;
unsigned no_new_privs:1; /* task may not gain privileges */
unsigned may_throttle:1;
 
diff --git a/include/linux/ve.h b/include/linux/ve.h
index a40e219c8bce..878ca284a6ba 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -43,13 +43,10 @@ struct ve_struct {
struct list_headve_list;
 
envid_t veid;
-   boollegacy; /* created using the legacy API
-  (vzctl ioctl - see do_env_create) */
 
unsigned intclass_id;
struct rw_semaphore op_sem;
int is_running;
-   int is_locked;
int is_pseudosuper;
atomic_tsuspend;
/* see vzcalluser.h for VE_FEATURE_XXX definitions */
@@ -148,10 +145,6 @@ extern struct cgroup_subsys ve_subsys;
 
 extern unsigned int sysctl_ve_mount_nr;
 
-#ifdef CONFIG_VE_IPTABLES
-extern __u64 ve_setup_iptables_mask(__u64 init_mask);
-#endif
-
 #ifdef CONFIG_VE
 #define ve_uevent_seqnum   (get_exec_env()->_uevent_seqnum)
 
@@ -211,7 +204,6 @@ extern void monotonic_ve_to_abs(clockid_t which_clock, 
struct timespec *tp);
 
 void ve_stop_ns(struct pid_namespace *ns);
 void ve_exit_ns(struct pid_namespace *ns);
-int ve_start_container(struct ve_struct *ve);
 
 extern bool current_user_ns_initial(void);
 struct user_namespace *ve_init_user_ns(void);
diff --git a/include/linux/ve_proto.h b/include/linux/ve_proto.h
index 153f18bd19b1..5787afe275ce 100644
--- a/include/linux/ve_proto.h
+++ b/include/linux/ve_proto.h
@@ -55,10 +55,6 @@ extern struct ve_struct *get_ve_by_id(envid_t);
 extern struct cgroup *ve_cgroup_open(struct cgroup *root, int flags, envid_t 
veid);
 extern int ve_cgroup_remove(struct cgroup *root, envid_t veid);
 
-struct env_create_param3;
-extern int real_env_create(envid_t veid, unsigned flags, u32 class_id,
-  struct env_create_param3 

[Devel] [PATCH rh7 3/6] Use ve init task's css instead of opening cgroup via vfs

2016-06-20 Thread Vladimir Davydov
Currently, whenever we need to get cpu or devices cgroup corresponding
to a ve, we open it using cgroup_kernel_open(). This is inflexible,
because it relies on the fact that all container cgroups are located at
a specific location which can never change (at the top level). Since we
want to move container cgroups to machine.slice, we need to rework this.

This patch does the trick. It makes each ve remember its init task at
container start, and use css corresponding to init task whenever we need
to get a corresponding cgroup. Note, that after this patch is applied,
we don't need to mount cpu and devices cgroup in kernel.

https://jira.sw.ru/browse/PSBM-48629

Signed-off-by: Vladimir Davydov 
---
 fs/proc/loadavg.c |  3 +-
 fs/proc/stat.c|  3 +-
 fs/proc/uptime.c  | 15 
 include/linux/device_cgroup.h |  5 ++-
 include/linux/fairsched.h | 23 
 include/linux/ve.h| 18 ++
 kernel/fairsched.c| 61 
 kernel/ve/ve.c| 82 ++-
 kernel/ve/vecalls.c   | 67 ---
 security/device_cgroup.c  | 19 +-
 10 files changed, 126 insertions(+), 170 deletions(-)

diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 4cbdeef1aa71..40d8a90b0f13 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -6,7 +6,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #define LOAD_INT(x) ((x) >> FSHIFT)
@@ -20,7 +19,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_loadavg(ve_name(ve), m);
+   ret = ve_show_loadavg(ve, m);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e9991db527e0..7f7e87c855e4 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -98,7 +97,7 @@ static int show_stat(struct seq_file *p, void *v)
ve = get_exec_env();
if (!ve_is_super(ve)) {
int ret;
-   ret = fairsched_show_stat(ve_name(ve), p);
+   ret = ve_show_cpu_stat(ve, p);
if (ret != -ENOSYS)
return ret;
}
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 6fd56831c796..8fa578e8a553 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -5,7 +5,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -25,11 +24,11 @@ static inline void get_ve0_idle(struct timespec *idle)
idle->tv_nsec = rem;
 }
 
-static inline void get_veX_idle(struct timespec *idle, struct cgroup* cgrp)
+static inline void get_veX_idle(struct ve_struct *ve, struct timespec *idle)
 {
struct kernel_cpustat kstat;
 
-   cpu_cgroup_get_stat(cgrp, );
+   ve_get_cpu_stat(ve, );
cputime_to_timespec(kstat.cpustat[CPUTIME_IDLE], idle);
 }
 
@@ -37,14 +36,12 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 {
struct timespec uptime;
struct timespec idle;
+   struct ve_struct *ve = get_exec_env();
 
-   if (ve_is_super(get_exec_env()))
+   if (ve_is_super(ve))
get_ve0_idle();
-   else {
-   rcu_read_lock();
-   get_veX_idle(, task_cgroup(current, cpu_cgroup_subsys_id));
-   rcu_read_unlock();
-   }
+   else
+   get_veX_idle(ve, );
 
do_posix_clock_monotonic_gettime();
monotonic_to_bootbased();
diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 64c2da27278c..25ea2270aabe 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -16,10 +16,9 @@ extern int devcgroup_device_permission(umode_t mode, dev_t 
dev, int mask);
 extern int devcgroup_device_visible(umode_t mode, int major,
int start_minor, int nr_minors);
 
-struct cgroup;
-int devcgroup_set_perms_ve(struct cgroup *cgroup, unsigned, dev_t, unsigned);
 struct ve_struct;
-int devcgroup_seq_show_ve(struct cgroup *devices_root, struct ve_struct *ve, 
struct seq_file *m);
+int devcgroup_set_perms_ve(struct ve_struct *, unsigned, dev_t, unsigned);
+int devcgroup_seq_show_ve(struct ve_struct *, struct seq_file *);
 
 #else
 static inline int devcgroup_inode_permission(struct inode *inode, int mask)
diff --git a/include/linux/fairsched.h b/include/linux/fairsched.h
index 615e88928e25..b779d2e85b12 100644
--- a/include/linux/fairsched.h
+++ b/include/linux/fairsched.h
@@ -51,31 +51,8 @@ asmlinkage long sys_fairsched_cpumask(unsigned int id, 
unsigned int len,
 asmlinkage long sys_fairsched_nodemask(unsigned int id, unsigned int len,
   unsigned long __user *user_mask_ptr);
 
-int 

Re: [Devel] [PATCH rh7] mm: memcontrol: fix race between kmem uncharge and charge reparenting

2016-06-20 Thread Kirill Tkhai


On 17.06.2016 13:35, Vladimir Davydov wrote:
> When a cgroup is destroyed, all user memory pages get recharged to the
> parent cgroup. Recharging is done by mem_cgroup_reparent_charges which
> keeps looping until res <= kmem. This is supposed to guarantee that by
> the time cgroup gets released, no pages is charged to it. However, the
> guarantee might be violated in case mem_cgroup_reparent_charges races
> with kmem charge or uncharge.
> 
> Currently, kmem is charged before res and uncharged after. As a result,
> kmem might become greater than res for a short period of time even if
> there are still user memory pages charged to the cgroup. In this case
> mem_cgroup_reparent_charges will give up prematurely, and the cgroup
> might be released though there are still pages charged to it. Uncharge
> of such a page will trigger kernel panic:
> 
>   general protection fault:  [#1] SMP
>   CPU: 0 PID: 972445 Comm: httpd ve: 0 Tainted: G   OE    
>  3.10.0-427.10.1.lve1.4.9.el7.x86_64 #1 12.14
>   task: 88065d53d8d0 ti: 880224f34000 task.ti: 880224f34000
>   RIP: 0010:[]  [] 
> mem_cgroup_charge_statistics.isra.16+0x13/0x60
>   RSP: 0018:880224f37a80  EFLAGS: 00010202
>   RAX:  RBX: 8807b26f0110 RCX: 
>   RDX: 79726f6765746163 RSI: ea000c9c0440 RDI: 8806a55662f8
>   RBP: 880224f37a80 R08:  R09: 03808000
>   R10: 00b8 R11: ea001eaa8980 R12: ea000c9c0440
>   R13: 0001 R14:  R15: 8806a5566000
>   FS:  () GS:8807d400() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: 7f54289bd74c CR3: 0006638b1000 CR4: 06f0
>   DR0:  DR1:  DR2: 
>   DR3:  DR6: 0ff0 DR7: 0400
>   Stack:
>880224f37ac0 811e9ddf 88060001 ea000c9c0440
>0001 037d1000 880224f37c78 0380
>880224f37ad0 811ee99a 880224f37b08 811b9ec9
>   Call Trace:
>[] __mem_cgroup_uncharge_common+0xcf/0x320
>[] mem_cgroup_uncharge_page+0x2a/0x30
>[] page_remove_rmap+0xb9/0x160
>[] ? res_counter_uncharge+0x13/0x20
>[] unmap_page_range+0x460/0x870
>[] unmap_single_vma+0x81/0xf0
>[] unmap_vmas+0x49/0x90
>[] exit_mmap+0xac/0x1a0
>[] mmput+0x6b/0x140
>[] flush_old_exec+0x467/0x8d0
>[] load_elf_binary+0x33c/0xde0
>[] ? get_user_pages+0x52/0x60
>[] ? load_elf_library+0x220/0x220
>[] search_binary_handler+0xd5/0x300
>[] do_execve_common.isra.26+0x657/0x720
>[] SyS_execve+0x29/0x30
>[] stub_execve+0x69/0xa0
> 
> To prevent this from happening, let's always charge kmem after res and
> uncharge before res.
> 
> https://bugs.openvz.org/browse/OVZ-6756
> 
> Reported-by: Anatoly Stepanov 
> Signed-off-by: Vladimir Davydov 

Reviewed-by: Kirill Tkhai 

> ---
>  mm/memcontrol.c | 44 
>  1 file changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1c3fbb2d2c48..de7c36295515 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3163,10 +3163,6 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t 
> gfp, u64 size)
>   int ret = 0;
>   bool may_oom;
>  
> - ret = res_counter_charge(>kmem, size, _res);
> - if (ret)
> - return ret;
> -
>   /*
>* Conditions under which we can wait for the oom_killer. Those are
>* the same conditions tested by the core page allocator
> @@ -3198,8 +3194,33 @@ int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t 
> gfp, u64 size)
>   res_counter_charge_nofail(>memsw, size,
> _res);
>   ret = 0;
> - } else if (ret)
> - res_counter_uncharge(>kmem, size);
> + }
> +
> + if (ret)
> + return ret;
> +
> + /*
> +  * When a cgroup is destroyed, all user memory pages get recharged to
> +  * the parent cgroup. Recharging is done by mem_cgroup_reparent_charges
> +  * which keeps looping until res <= kmem. This is supposed to guarantee
> +  * that by the time cgroup gets released, no pages is charged to it.
> +  *
> +  * If kmem were charged before res or uncharged after, kmem might
> +  * become greater than res for a short period of time even if there
> +  * were still user memory pages charged to the cgroup. In this case
> +  * mem_cgroup_reparent_charges would give up prematurely, and the
> +  * cgroup could be released though there were still pages charged to
> +  * it. Uncharge of such a page would trigger kernel panic.
> +  *
> +  * To prevent this from happening, kmem must be charged 

[Devel] [PATCH rh7] locks: check for fl->fl_owner != filp in show_fd_locks

2016-06-20 Thread Stanislav Kinsburskiy
NFS emulates flocks via posix lock on server and fl->fl_owner is set to filp.

Signed-off-by: Stanislav Kinsburskiy 
---
 fs/locks.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/locks.c b/fs/locks.c
index cb7da61..a5ab0c0 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2503,6 +2503,7 @@ void show_fd_locks(struct seq_file *f,
 * matches ->fl_file.
 */
if (fl->fl_owner != files &&
+   fl->fl_owner != (fl_owner_t)filp &&
fl->fl_owner != NULL)
continue;
 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel