[Devel] [PATCH RHEL7 COMMIT] vt: selection, push console lock down

2020-10-14 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.18.2.vz7.163.37
-->
commit a1e4a3ee0b77f42ecd94b83527e0df27502f54d1
Author: Jiri Slaby 
Date:   Wed Oct 14 07:38:36 2020 +0300

vt: selection, push console lock down

We need to nest the console lock in sel_lock, so we have to push it down
a bit. Fortunately, the callers of set_selection_* just lock the console
lock around the function call. So moving it down is easy.

In the next patch, we switch the order.

Signed-off-by: Jiri Slaby 
Fixes: 07e6124a1a46 ("vt: selection, close sel_buffer race")
Cc: stable 
Link: https://lore.kernel.org/r/20200228115406.5735-1-jsl...@suse.cz
Signed-off-by: Greg Kroah-Hartman 

https://jira.sw.ru/browse/PSBM-121234

This is a backport of mainline commit 
4b70dd57a15d2f4685ac6e38056bad93e81e982f:

* speakup-related hunk was dropped because that driver does not use
set_selection(): it is not exported in this kernel version;

* the affected code is in set_selection() rather than 
set_selection_kernel().

Signed-off-by: Evgenii Shatokhin 
---
 drivers/tty/vt/selection.c | 13 -
 drivers/tty/vt/vt.c|  2 --
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 145cf96..1c31746 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -157,7 +157,7 @@ static int store_utf8(u16 c, char *p)
  * The entire selection process is managed under the console_lock. It's
  *  a lot under the lock but its hardly a performance path
  */
-int set_selection(const struct tiocl_selection __user *sel, struct tty_struct 
*tty)
+static int __set_selection(const struct tiocl_selection __user *sel, struct 
tty_struct *tty)
 {
struct vc_data *vc = vc_cons[fg_console].d;
int sel_mode, new_sel_start, new_sel_end, spc;
@@ -333,6 +333,17 @@ unlock:
return ret;
 }
 
+int set_selection(const struct tiocl_selection __user *sel, struct tty_struct 
*tty)
+{
+   int ret;
+
+   console_lock();
+   ret = __set_selection(sel, tty);
+   console_unlock();
+
+   return ret;
+}
+
 /* Insert the contents of the selection buffer into the
  * queue of the tty associated with the current console.
  * Invoked by ioctl().
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 795d786..07078cf 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2603,9 +2603,7 @@ int tioclinux(struct tty_struct *tty, unsigned long arg)
switch (type)
{
case TIOCL_SETSEL:
-   console_lock();
ret = set_selection((struct tiocl_selection __user 
*)(p+1), tty);
-   console_unlock();
break;
case TIOCL_PASTESEL:
ret = paste_selection(tty);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] vt: selection, push sel_lock up

2020-10-14 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.18.2.vz7.163.37
-->
commit 1a8d1db4502fd1419b103b8530e1e513ddca1d7c
Author: Jiri Slaby 
Date:   Wed Oct 14 07:38:42 2020 +0300

vt: selection, push sel_lock up

sel_lock cannot nest in the console lock. Thanks to syzkaller, the
kernel states firmly:

> WARNING: possible circular locking dependency detected
> 5.6.0-rc3-syzkaller #0 Not tainted
> --
> syz-executor.4/20336 is trying to acquire lock:
> 8880a2e952a0 (&tty->termios_rwsem){}, at: 
tty_unthrottle+0x22/0x100 drivers/tty/tty_ioctl.c:136
>
> but task is already holding lock:
> 89462e70 (sel_lock){+.+.}, at: paste_selection+0x118/0x470 
drivers/tty/vt/selection.c:374
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (sel_lock){+.+.}:
>mutex_lock_nested+0x1b/0x30 kernel/locking/mutex.c:1118
>set_selection_kernel+0x3b8/0x18a0 drivers/tty/vt/selection.c:217
>set_selection_user+0x63/0x80 drivers/tty/vt/selection.c:181
>tioclinux+0x103/0x530 drivers/tty/vt/vt.c:3050
>vt_ioctl+0x3f1/0x3a30 drivers/tty/vt/vt_ioctl.c:364

This is ioctl(TIOCL_SETSEL).
Locks held on the path: console_lock -> sel_lock

> -> #1 (console_lock){+.+.}:
>console_lock+0x46/0x70 kernel/printk/printk.c:2289
>con_flush_chars+0x50/0x650 drivers/tty/vt/vt.c:3223
>n_tty_write+0xeae/0x1200 drivers/tty/n_tty.c:2350
>do_tty_write drivers/tty/tty_io.c:962 [inline]
>tty_write+0x5a1/0x950 drivers/tty/tty_io.c:1046

This is write().
Locks held on the path: termios_rwsem -> console_lock

> -> #0 (&tty->termios_rwsem){}:
>down_write+0x57/0x140 kernel/locking/rwsem.c:1534
>tty_unthrottle+0x22/0x100 drivers/tty/tty_ioctl.c:136
>mkiss_receive_buf+0x12aa/0x1340 drivers/net/hamradio/mkiss.c:902
>tty_ldisc_receive_buf+0x12f/0x170 drivers/tty/tty_buffer.c:465
>paste_selection+0x346/0x470 drivers/tty/vt/selection.c:389
>tioclinux+0x121/0x530 drivers/tty/vt/vt.c:3055
>vt_ioctl+0x3f1/0x3a30 drivers/tty/vt/vt_ioctl.c:364

This is ioctl(TIOCL_PASTESEL).
Locks held on the path: sel_lock -> termios_rwsem

> other info that might help us debug this:
>
> Chain exists of:
>   &tty->termios_rwsem --> console_lock --> sel_lock

Clearly. From the above, we have:
 console_lock -> sel_lock
 sel_lock -> termios_rwsem
 termios_rwsem -> console_lock

Fix this by reversing the console_lock -> sel_lock dependency in
ioctl(TIOCL_SETSEL). First, lock sel_lock, then console_lock.

Signed-off-by: Jiri Slaby 
Reported-by: syzbot+26183d9746e62da32...@syzkaller.appspotmail.com
Fixes: 07e6124a1a46 ("vt: selection, close sel_buffer race")
Cc: stable 
Link: https://lore.kernel.org/r/20200228115406.5735-2-jsl...@suse.cz
Signed-off-by: Greg Kroah-Hartman 

https://jira.sw.ru/browse/PSBM-121234

This is a backport of mainline commit 
e8c75a30a23c6ba63f4ef6895cbf41fd42f21aa2:
the affected code is in set_selection() rather than set_selection_kernel().

Signed-off-by: Evgenii Shatokhin 
---
 drivers/tty/vt/selection.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 1c31746..1a2146c 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -164,7 +164,7 @@ static int __set_selection(const struct tiocl_selection 
__user *sel, struct tty_
char *bp, *obp;
int i, ps, pe, multiplier;
u16 c;
-   int mode, ret = 0;
+   int mode;
 
poke_blanked_console();
 
@@ -204,7 +204,6 @@ static int __set_selection(const struct tiocl_selection 
__user *sel, struct tty_
pe = tmp;
}
 
-   mutex_lock(&sel_lock);
if (sel_cons != vc_cons[fg_console].d) {
clear_selection();
sel_cons = vc_cons[fg_console].d;
@@ -250,10 +249,9 @@ static int __set_selection(const struct tiocl_selection 
__user *sel, struct tty_
break;
case TIOCL_SELPOINTER:
highlight_pointer(pe);
-   goto unlock;
+   return 0;
default:
-   ret = -EINVAL;
-   goto unlock;
+   return -EINVAL;
}
 
/* remove the pointer */
@@ -275,7 +273,7 @@ static int __set_selection(const struct tiocl_selection 
__user *sel, struct tty_
else if (new_sel_start == sel_start)
   

Re: [Devel] [PATCH RH7] overlayfs: avoid permission check for priveleged processes

2020-10-14 Thread Vasily Averin
Pavel,
please review

On 10/14/20 2:05 AM, Andrey Zhadchenko wrote:
> Overlayfs temporary override credentials in copy_up function to ones which was
> used to create mount. Unfortunately vfs_setxattr requires CAP_SYS_ADMIN
> capability in current user namespace. This leads to strange situations.
> For example, if overlayfs mount was made inside ve it is impossible to use
> copy_up from init_user_ns even with CAP_SYS_ADMIN. This is because overriden
> credentials are not sufficient in init_user_ns to set xattr to file.
> This is also required for criu since copy_up can be triggered on dump stage:
> reading inotify fhandle from /proc may start copy_up.
> 
> Add an option to avoid vfs_setxattr CAP_SYS_ADMIN check if current credentials
> have CAP_SYS_ADMIN in namespace that is recorded in overlayfs mount 
> superblock.
> 
> https://jira.sw.ru/browse/PSBM-108122
> Signed-off-by: Andrey Zhadchenko 
> ---
>  fs/overlayfs/copy_up.c   | 25 +++--
>  fs/overlayfs/overlayfs.h | 39 ++-
>  fs/overlayfs/util.c  | 32 
>  fs/xattr.c   |  2 +-
>  4 files changed, 74 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 1564a35..d6b285f 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "overlayfs.h"
>  
>  #define OVL_COPY_UP_CHUNK_SIZE (1 << 20)
> @@ -321,8 +322,8 @@ out:
>   return fh;
>  }
>  
> -int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
> -struct dentry *upper)
> +int ovl_set_origin_ext(struct dentry *dentry, struct dentry *lower,
> +struct dentry *upper, int propagate_cap)
>  {
>   const struct ovl_fh *fh = NULL;
>   int err;
> @@ -341,8 +342,8 @@ int ovl_set_origin(struct dentry *dentry, struct dentry 
> *lower,
>   /*
>* Do not fail when upper doesn't support xattrs.
>*/
> - err = ovl_check_setxattr(dentry, upper, OVL_XATTR_ORIGIN, fh,
> -  fh ? fh->len : 0, 0);
> + err = ovl_check_setxattr_ext(dentry, upper, OVL_XATTR_ORIGIN, fh,
> +  fh ? fh->len : 0, 0, propagate_cap);
>   kfree(fh);
>  
>   return err;
> @@ -433,6 +434,7 @@ struct ovl_copy_up_ctx {
>   struct dentry *destdir;
>   struct qstr destname;
>   struct dentry *workdir;
> + int propagate_cap;
>   bool tmpfile;
>   bool origin;
>   bool indexed;
> @@ -711,7 +713,7 @@ out:
>  }
>  
>  static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
> -int flags)
> +int flags, int propagate_cap)
>  {
>   int err;
>   struct path parentpath;
> @@ -719,6 +721,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct 
> dentry *dentry,
>   .parent = parent,
>   .dentry = dentry,
>   .workdir = ovl_workdir(dentry),
> + .propagate_cap = propagate_cap,
>   };
>  
>   if (WARN_ON(!ctx.workdir))
> @@ -768,9 +771,19 @@ static int ovl_copy_up_one(struct dentry *parent, struct 
> dentry *dentry,
>   return err;
>  }
>  
> +static int ovl_can_propagate_cap(struct dentry *dentry)
> +{
> + struct super_block *sb = dentry->d_sb;
> + struct ovl_fs *ofs = sb->s_fs_info;
> + struct user_namespace *ovl_ns = ofs->creator_cred->user_ns;
> +
> + return ns_capable(ovl_ns, CAP_SYS_ADMIN);
> +}
> +
>  int ovl_copy_up_flags(struct dentry *dentry, int flags)
>  {
>   int err = 0;
> + int propagate_cap = ovl_can_propagate_cap(dentry);
>   const struct cred *old_cred = ovl_override_creds(dentry->d_sb);
>   bool disconnected = (dentry->d_flags & DCACHE_DISCONNECTED);
>  
> @@ -815,7 +828,7 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)
>   next = parent;
>   }
>  
> - err = ovl_copy_up_one(parent, next, flags);
> + err = ovl_copy_up_one(parent, next, flags, propagate_cap);
>  
>   dput(parent);
>   dput(next);
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 7052938..6917acd 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -149,15 +149,6 @@ static inline int ovl_do_symlink(struct inode *dir, 
> struct dentry *dentry,
>   return err;
>  }
>  
> -static inline int ovl_do_setxattr(struct dentry *dentry, const char *name,
> -   const void *value, size_t size, int flags)
> -{
> - int err = vfs_setxattr(dentry, name, value, size, flags);
> - pr_debug("setxattr(%pd2, \"%s\", \"%*pE\", %zu, 0x%x) = %i\n",
> -  dentry, name, min((int)size, 48), value, size, flags, err);
> - return err;
> -}
> -
>  static inline int ovl_do_removexattr(struct dentry *dentry, const char *name)
>  {
>   int err = vfs_rem

[Devel] [PATCH rh7] ms/aio: Kill aio_rw_vect_retry()

2020-10-14 Thread Andrey Ryabinin
From: Kent Overstreet 

This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).

Signed-off-by: Kent Overstreet 
Cc: Zach Brown 
Cc: Felipe Balbi 
Cc: Greg Kroah-Hartman 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Rusty Russell 
Cc: Jens Axboe 
Cc: Asai Thambi S P 
Cc: Selvan Mani 
Cc: Sam Bradshaw 
Cc: Jeff Moyer 
Cc: Al Viro 
Cc: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 

https://jira.sw.ru/browse/PSBM-121197
(cherry picked from commit 73a7075e3f6ec63dc359064eea6fd84f406cf2a5)
Signed-off-by: Andrey Ryabinin 
---
 drivers/staging/android/logger.c |  2 +-
 drivers/usb/gadget/inode.c   |  6 +--
 fs/aio.c | 92 +++-
 fs/block_dev.c   |  2 +-
 fs/nfs/direct.c  |  1 -
 fs/ocfs2/file.c  |  6 +--
 fs/read_write.c  |  3 --
 fs/udf/file.c|  2 +-
 include/linux/aio.h  |  2 -
 mm/page_io.c |  1 -
 net/socket.c |  2 +-
 11 files changed, 28 insertions(+), 91 deletions(-)

diff --git a/drivers/staging/android/logger.c b/drivers/staging/android/logger.c
index 34519ea14b54..16a6c3179625 100644
--- a/drivers/staging/android/logger.c
+++ b/drivers/staging/android/logger.c
@@ -481,7 +481,7 @@ static ssize_t logger_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
header.sec = now.tv_sec;
header.nsec = now.tv_nsec;
header.euid = current_euid();
-   header.len = min_t(size_t, iocb->ki_left, LOGGER_ENTRY_MAX_PAYLOAD);
+   header.len = min_t(size_t, iocb->ki_nbytes, LOGGER_ENTRY_MAX_PAYLOAD);
header.hdr_size = sizeof(struct logger_entry);
 
/* null writes succeed, return zero */
diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index 570c005062ab..09aae3c48d2c 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -709,11 +709,11 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(usb_endpoint_dir_in(&epdata->desc)))
return -EINVAL;
 
-   buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+   buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
-   return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
+   return ep_aio_rwtail(iocb, buf, iocb->ki_nbytes, epdata, iov, nr_segs);
 }
 
 static ssize_t
@@ -728,7 +728,7 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(!usb_endpoint_dir_in(&epdata->desc)))
return -EINVAL;
 
-   buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+   buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
diff --git a/fs/aio.c b/fs/aio.c
index c7e23a5832aa..f1b27fc5defb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -879,7 +879,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
if (unlikely(!req))
goto out_put;
 
-   atomic_set(&req->ki_users, 2);
+   atomic_set(&req->ki_users, 1);
req->ki_ctx = ctx;
 
return req;
@@ -1279,75 +1279,9 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
return -EINVAL;
 }
 
-static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret)
-{
-   struct iovec *iov = &iocb->ki_iovec[iocb->ki_cur_seg];
-
-   BUG_ON(ret <= 0);
-
-   while (iocb->ki_cur_seg < iocb->ki_nr_segs && ret > 0) {
-   ssize_t this = min((ssize_t)iov->iov_len, ret);
-   iov->iov_base += this;
-   iov->iov_len -= this;
-   iocb->ki_left -= this;
-   ret -= this;
-   if (iov->iov_len == 0) {
-   iocb->ki_cur_seg++;
-   iov++;
-   }
-   }
-
-   /* the caller should not have done more io than what fit in
-* the remaining iovecs */
-   BUG_ON(ret > 0 && iocb->ki_left == 0);
-}
-
 typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
 
-static ssize_t aio_rw_vect_retry(struct kiocb *iocb, int rw, aio_rw_op *rw_op)
-{
-   struct file *file = iocb->ki_filp;
-   struct address_space *mapping = file->f_mapping;
-   struct inode *inode = mapping->host;
-   ssize_t ret = 0;
-
-   /* This matches the pread()/pwrite() logic */
-   if (iocb->ki_pos < 0)
-   return -EINVAL;
-
-   if (rw == WRITE)
-   file_start_write(file);
-   do {
-   ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg],
-   iocb->ki_nr_segs - iocb->ki_cur_seg,
-   iocb->ki_pos);
-   if (ret > 0)
-   

Re: [Devel] [PATCH RH7] overlayfs: avoid permission check for priveleged processes

2020-10-14 Thread Pavel Tikhomirov

On 10/14/20 2:05 AM, Andrey Zhadchenko wrote:

Overlayfs temporary override credentials in copy_up function to ones which was
used to create mount.



Unfortunately vfs_setxattr requires CAP_SYS_ADMIN
capability in current user namespace.


No, if it was so, it would be no error =) To be correct we should say:

Function vfs_setxattr for "trusted." attrs requires CAP_SYS_ADMIN in 
current ve's userns.


It is done so to mimic mainstream behaviour for containers(ves), so that 
container user can't set "trusted." xattrs if it is in non-root 
container userns.



This leads to strange situations.
For example, if overlayfs mount was made inside ve it is impossible to use
copy_up from init_user_ns even with CAP_SYS_ADMIN. This is because overriden
credentials are not sufficient in init_user_ns to set xattr to file.
This is also required for criu since copy_up can be triggered on dump stage:
reading inotify fhandle from /proc may start copy_up.


I hope that overlayfs overrides credentials exactly to be able to pass 
those checks. In mainstream kernel overlayfs can used from any userns, 
but can be only mounted from init_user_ns, so credentials always change 
to "more permissive". So it should be safe to skip override in case we 
are already "more permissive" than superblocks credentials.




Add an option to avoid vfs_setxattr CAP_SYS_ADMIN check if current credentials
have CAP_SYS_ADMIN in namespace that is recorded in overlayfs mount superblock.


Sorry but looking on the code I don't see how it works... There are only 
three codepaths here:


  +-< ovl_do_setxattr_ext
+-< ovl_do_setxattr #1 sets propagate_cap to false
+-< ovl_check_setxattr_ext
| +-< ovl_set_origin_ext
| | +-< ovl_set_origin #2 sets propagate_cap to false
| +-< ovl_check_setxattr #3 sets propagate_cap to false

And on all of them we don't "propagate_cap". Probably I'm missing 
something though.




https://jira.sw.ru/browse/PSBM-108122
Signed-off-by: Andrey Zhadchenko 
---
  fs/overlayfs/copy_up.c   | 25 +++--
  fs/overlayfs/overlayfs.h | 39 ++-
  fs/overlayfs/util.c  | 32 
  fs/xattr.c   |  2 +-
  4 files changed, 74 insertions(+), 24 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 1564a35..d6b285f 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "overlayfs.h"
  
  #define OVL_COPY_UP_CHUNK_SIZE (1 << 20)

@@ -321,8 +322,8 @@ out:
return fh;
  }
  
-int ovl_set_origin(struct dentry *dentry, struct dentry *lower,

-  struct dentry *upper)
+int ovl_set_origin_ext(struct dentry *dentry, struct dentry *lower,
+  struct dentry *upper, int propagate_cap)
  {
const struct ovl_fh *fh = NULL;
int err;
@@ -341,8 +342,8 @@ int ovl_set_origin(struct dentry *dentry, struct dentry 
*lower,
/*
 * Do not fail when upper doesn't support xattrs.
 */
-   err = ovl_check_setxattr(dentry, upper, OVL_XATTR_ORIGIN, fh,
-fh ? fh->len : 0, 0);
+   err = ovl_check_setxattr_ext(dentry, upper, OVL_XATTR_ORIGIN, fh,
+fh ? fh->len : 0, 0, propagate_cap);
kfree(fh);
  
  	return err;

@@ -433,6 +434,7 @@ struct ovl_copy_up_ctx {
struct dentry *destdir;
struct qstr destname;
struct dentry *workdir;
+   int propagate_cap;
bool tmpfile;
bool origin;
bool indexed;
@@ -711,7 +713,7 @@ out:
  }
  
  static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,

-  int flags)
+  int flags, int propagate_cap)
  {
int err;
struct path parentpath;
@@ -719,6 +721,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct 
dentry *dentry,
.parent = parent,
.dentry = dentry,
.workdir = ovl_workdir(dentry),
+   .propagate_cap = propagate_cap,
};
  
  	if (WARN_ON(!ctx.workdir))

@@ -768,9 +771,19 @@ static int ovl_copy_up_one(struct dentry *parent, struct 
dentry *dentry,
return err;
  }
  
+static int ovl_can_propagate_cap(struct dentry *dentry)

+{
+   struct super_block *sb = dentry->d_sb;
+   struct ovl_fs *ofs = sb->s_fs_info;
+   struct user_namespace *ovl_ns = ofs->creator_cred->user_ns;
+
+   return ns_capable(ovl_ns, CAP_SYS_ADMIN);
+}
+
  int ovl_copy_up_flags(struct dentry *dentry, int flags)
  {
int err = 0;
+   int propagate_cap = ovl_can_propagate_cap(dentry);
const struct cred *old_cred = ovl_override_creds(dentry->d_sb);
bool disconnected = (dentry->d_flags & DCACHE_DISCONNECTED);
  
@@ -815,7 +828,7 @@ int ovl_copy_up_flags(struct dentry *dentry, int flags)

next = parent;
}
  
-		e

[Devel] [PATCH rh8] ve/futex/timeout: adjust futex timeout to absolule

2020-10-14 Thread Konstantin Khorenko
From: Kirill Tkhai 

This converts ve-absolute-monotonic time to global-absolute-monotonic time.

https://jira.sw.ru/browse/PSBM-14471

diff-futex-reference-ct-monotonic-clock-from-ct-start

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Kirill Tkhai 

(cherry picked from vz7 commit 14a4db52ee8c862eb7a9dec740b15c646e0b59aa)
Signed-off-by: Konstantin Khorenko 
---
 kernel/futex.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5282b74fc31b..9947cb4db384 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -3618,6 +3619,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t 
*timeout,
 {
int cmd = op & FUTEX_CMD_MASK;
unsigned int flags = 0;
+   ktime_t abs_time;
 
if (!(op & FUTEX_PRIVATE_FLAG))
flags |= FLAGS_SHARED;
@@ -3627,6 +3629,12 @@ long do_futex(u32 __user *uaddr, int op, u32 val, 
ktime_t *timeout,
if (cmd != FUTEX_WAIT && cmd != FUTEX_WAIT_BITSET && \
cmd != FUTEX_WAIT_REQUEUE_PI)
return -ENOSYS;
+   } else if (timeout) {
+   if (cmd == FUTEX_WAIT_BITSET || cmd == FUTEX_WAIT_REQUEUE_PI) {
+   abs_time = ktime_add(*timeout, ns_to_ktime(
+get_exec_env()->start_time));
+   timeout = &abs_time;
+   }
}
 
switch (cmd) {
-- 
2.28.0

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ms/vt: selection, push console lock down #PSBM-120640

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 2305a61979a0e62c63a23c48c6cd250da9516cfa
Author: Jiri Slaby 
Date:   Wed Oct 14 15:32:53 2020 +0300

ms/vt: selection, push console lock down #PSBM-120640

We need to nest the console lock in sel_lock, so we have to push it down
a bit. Fortunately, the callers of set_selection_* just lock the console
lock around the function call. So moving it down is easy.

In the next patch, we switch the order.

Signed-off-by: Jiri Slaby 
Fixes: 07e6124a1a46 ("vt: selection, close sel_buffer race")
Cc: stable 
Link: https://lore.kernel.org/r/20200228115406.5735-1-jsl...@suse.cz
Signed-off-by: Greg Kroah-Hartman 

https://jira.sw.ru/browse/PSBM-120640

This is a backport of mainline commit 
4b70dd57a15d2f4685ac6e38056bad93e81e982f:

* speakup-related hunk was dropped because that driver does not use
set_selection(): it is not exported in this kernel version;

* the affected code is in set_selection() rather than 
set_selection_kernel().

Signed-off-by: Evgenii Shatokhin 
---
 drivers/tty/vt/selection.c | 13 -
 drivers/tty/vt/vt.c|  2 --
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 2a68d6fdb7b1..2f378a7cd1fe 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -155,7 +155,7 @@ static int store_utf8(u16 c, char *p)
  * The entire selection process is managed under the console_lock. It's
  *  a lot under the lock but its hardly a performance path
  */
-int set_selection(const struct tiocl_selection __user *sel, struct tty_struct 
*tty)
+static int __set_selection(const struct tiocl_selection __user *sel, struct 
tty_struct *tty)
 {
struct vc_data *vc = vc_cons[fg_console].d;
int new_sel_start, new_sel_end, spc;
@@ -320,6 +320,17 @@ int set_selection(const struct tiocl_selection __user 
*sel, struct tty_struct *t
return ret;
 }
 
+int set_selection(const struct tiocl_selection __user *sel, struct tty_struct 
*tty)
+{
+   int ret;
+
+   console_lock();
+   ret = __set_selection(sel, tty);
+   console_unlock();
+
+   return ret;
+}
+
 /* Insert the contents of the selection buffer into the
  * queue of the tty associated with the current console.
  * Invoked by ioctl().
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 29cf1cd7aff0..440a2d085729 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -2694,9 +2694,7 @@ int tioclinux(struct tty_struct *tty, unsigned long arg)
switch (type)
{
case TIOCL_SETSEL:
-   console_lock();
ret = set_selection((struct tiocl_selection __user 
*)(p+1), tty);
-   console_unlock();
break;
case TIOCL_PASTESEL:
ret = paste_selection(tty);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ms/vt: selection, close sel_buffer race #PSBM-120640

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit c69dd934d249b86e9b493bd89261b03f70f2357b
Author: Jiri Slaby 
Date:   Wed Oct 14 15:32:52 2020 +0300

ms/vt: selection, close sel_buffer race #PSBM-120640

syzkaller reported this UAF:
BUG: KASAN: use-after-free in n_tty_receive_buf_common+0x2481/0x2940 
drivers/tty/n_tty.c:1741
Read of size 1 at addr 8880089e40e9 by task syz-executor.1/13184

CPU: 0 PID: 13184 Comm: syz-executor.1 Not tainted 5.4.7 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 
04/01/2014
Call Trace:
...
 kasan_report+0xe/0x20 mm/kasan/common.c:634
 n_tty_receive_buf_common+0x2481/0x2940 drivers/tty/n_tty.c:1741
 tty_ldisc_receive_buf+0xac/0x190 drivers/tty/tty_buffer.c:461
 paste_selection+0x297/0x400 drivers/tty/vt/selection.c:372
 tioclinux+0x20d/0x4e0 drivers/tty/vt/vt.c:3044
 vt_ioctl+0x1bcf/0x28d0 drivers/tty/vt/vt_ioctl.c:364
 tty_ioctl+0x525/0x15a0 drivers/tty/tty_io.c:2657
 vfs_ioctl fs/ioctl.c:47 [inline]

It is due to a race between parallel paste_selection (TIOCL_PASTESEL)
and set_selection_user (TIOCL_SETSEL) invocations. One uses sel_buffer,
while the other frees it and reallocates a new one for another
selection. Add a mutex to close this race.

The mutex takes care properly of sel_buffer and sel_buffer_lth only. The
other selection global variables (like sel_start, sel_end, and sel_cons)
are protected only in set_selection_user. The other functions need quite
some more work to close the races of the variables there. This is going
to happen later.

This likely fixes (I am unsure as there is no reproducer provided) bug
206361 too. It was marked as CVE-2020-8648.

Signed-off-by: Jiri Slaby 
Reported-by: syzbot+59997e8d5cbdc486e...@syzkaller.appspotmail.com
References: https://bugzilla.kernel.org/show_bug.cgi?id=206361
Cc: stable 
Link: https://lore.kernel.org/r/20200210081131.23572-2-jsl...@suse.cz
Signed-off-by: Greg Kroah-Hartman 

https://jira.sw.ru/browse/PSBM-120640

This is a backport of mainline commit 
07e6124a1a46b4b5a9b3cacc0c306b50da87abf5:
the affected code is in set_selection() rather than set_selection_kernel().

Signed-off-by: Evgenii Shatokhin 
---
 drivers/tty/vt/selection.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 90ea1cc52b7a..2a68d6fdb7b1 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -41,6 +42,7 @@ static volatile int sel_start = -1;   /* cleared by 
clear_selection */
 static int sel_end;
 static int sel_buffer_lth;
 static char *sel_buffer;
+static DEFINE_MUTEX(sel_lock);
 
 /* clear_selection, highlight and highlight_pointer can be called
from interrupt (via scrollback/front) */
@@ -161,7 +163,7 @@ int set_selection(const struct tiocl_selection __user *sel, 
struct tty_struct *t
char *bp, *obp;
int i, ps, pe, multiplier;
u16 c;
-   int mode;
+   int mode, ret = 0;
 
poke_blanked_console();
if (copy_from_user(&v, sel, sizeof(*sel)))
@@ -188,6 +190,7 @@ int set_selection(const struct tiocl_selection __user *sel, 
struct tty_struct *t
if (ps > pe)/* make sel_start <= sel_end */
swap(ps, pe);
 
+   mutex_lock(&sel_lock);
if (sel_cons != vc_cons[fg_console].d) {
clear_selection();
sel_cons = vc_cons[fg_console].d;
@@ -233,9 +236,10 @@ int set_selection(const struct tiocl_selection __user 
*sel, struct tty_struct *t
break;
case TIOCL_SELPOINTER:
highlight_pointer(pe);
-   return 0;
+   goto unlock;
default:
-   return -EINVAL;
+   ret = -EINVAL;
+   goto unlock;
}
 
/* remove the pointer */
@@ -257,7 +261,7 @@ int set_selection(const struct tiocl_selection __user *sel, 
struct tty_struct *t
else if (new_sel_start == sel_start)
{
if (new_sel_end == sel_end) /* no action required */
-   return 0;
+   goto unlock;
else if (new_sel_end > sel_end) /* extend to right */
highlight(sel_end + 2, new_sel_end);
else/* contract from right */
@@ -285,7 +289,8 @@ int set_selection(const struct tiocl_selection __user *sel, 
struct tty_struct *t
if (!bp) {
printk(KERN_WARNING "selection: kmalloc() failed\n");
 

[Devel] [PATCH RHEL8 COMMIT] ms/vt: selection, push sel_lock up #PSBM-120640

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit daab9ab9f8ec04454be9c6cdbef8ffba583eaea9
Author: Jiri Slaby 
Date:   Wed Oct 14 15:32:53 2020 +0300

ms/vt: selection, push sel_lock up #PSBM-120640

sel_lock cannot nest in the console lock. Thanks to syzkaller, the
kernel states firmly:

> WARNING: possible circular locking dependency detected
> 5.6.0-rc3-syzkaller #0 Not tainted
> --
> syz-executor.4/20336 is trying to acquire lock:
> 8880a2e952a0 (&tty->termios_rwsem){}, at: 
tty_unthrottle+0x22/0x100 drivers/tty/tty_ioctl.c:136
>
> but task is already holding lock:
> 89462e70 (sel_lock){+.+.}, at: paste_selection+0x118/0x470 
drivers/tty/vt/selection.c:374
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (sel_lock){+.+.}:
>mutex_lock_nested+0x1b/0x30 kernel/locking/mutex.c:1118
>set_selection_kernel+0x3b8/0x18a0 drivers/tty/vt/selection.c:217
>set_selection_user+0x63/0x80 drivers/tty/vt/selection.c:181
>tioclinux+0x103/0x530 drivers/tty/vt/vt.c:3050
>vt_ioctl+0x3f1/0x3a30 drivers/tty/vt/vt_ioctl.c:364

This is ioctl(TIOCL_SETSEL).
Locks held on the path: console_lock -> sel_lock

> -> #1 (console_lock){+.+.}:
>console_lock+0x46/0x70 kernel/printk/printk.c:2289
>con_flush_chars+0x50/0x650 drivers/tty/vt/vt.c:3223
>n_tty_write+0xeae/0x1200 drivers/tty/n_tty.c:2350
>do_tty_write drivers/tty/tty_io.c:962 [inline]
>tty_write+0x5a1/0x950 drivers/tty/tty_io.c:1046

This is write().
Locks held on the path: termios_rwsem -> console_lock

> -> #0 (&tty->termios_rwsem){}:
>down_write+0x57/0x140 kernel/locking/rwsem.c:1534
>tty_unthrottle+0x22/0x100 drivers/tty/tty_ioctl.c:136
>mkiss_receive_buf+0x12aa/0x1340 drivers/net/hamradio/mkiss.c:902
>tty_ldisc_receive_buf+0x12f/0x170 drivers/tty/tty_buffer.c:465
>paste_selection+0x346/0x470 drivers/tty/vt/selection.c:389
>tioclinux+0x121/0x530 drivers/tty/vt/vt.c:3055
>vt_ioctl+0x3f1/0x3a30 drivers/tty/vt/vt_ioctl.c:364

This is ioctl(TIOCL_PASTESEL).
Locks held on the path: sel_lock -> termios_rwsem

> other info that might help us debug this:
>
> Chain exists of:
>   &tty->termios_rwsem --> console_lock --> sel_lock

Clearly. From the above, we have:
 console_lock -> sel_lock
 sel_lock -> termios_rwsem
 termios_rwsem -> console_lock

Fix this by reversing the console_lock -> sel_lock dependency in
ioctl(TIOCL_SETSEL). First, lock sel_lock, then console_lock.

Signed-off-by: Jiri Slaby 
Reported-by: syzbot+26183d9746e62da32...@syzkaller.appspotmail.com
Fixes: 07e6124a1a46 ("vt: selection, close sel_buffer race")
Cc: stable 
Link: https://lore.kernel.org/r/20200228115406.5735-2-jsl...@suse.cz
Signed-off-by: Greg Kroah-Hartman 

https://jira.sw.ru/browse/PSBM-120640

This is a backport of mainline commit 
e8c75a30a23c6ba63f4ef6895cbf41fd42f21aa2:
the affected code is in set_selection() rather than set_selection_kernel().

Signed-off-by: Evgenii Shatokhin 
---
 drivers/tty/vt/selection.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 2f378a7cd1fe..c50a05b57470 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -163,7 +163,7 @@ static int __set_selection(const struct tiocl_selection 
__user *sel, struct tty_
char *bp, *obp;
int i, ps, pe, multiplier;
u16 c;
-   int mode, ret = 0;
+   int mode;
 
poke_blanked_console();
if (copy_from_user(&v, sel, sizeof(*sel)))
@@ -190,7 +190,6 @@ static int __set_selection(const struct tiocl_selection 
__user *sel, struct tty_
if (ps > pe)/* make sel_start <= sel_end */
swap(ps, pe);
 
-   mutex_lock(&sel_lock);
if (sel_cons != vc_cons[fg_console].d) {
clear_selection();
sel_cons = vc_cons[fg_console].d;
@@ -236,10 +235,9 @@ static int __set_selection(const struct tiocl_selection 
__user *sel, struct tty_
break;
case TIOCL_SELPOINTER:
highlight_pointer(pe);
-   goto unlock;
+   return 0;
default:
-   ret = -EINVAL;
-   goto unlock;
+   return -EINVAL;
}
 
/* remove the pointer */
@@ -261,7 +259,7 @@ static int __

[Devel] [PATCH RHEL8 COMMIT] memcg: Fix missing memcg->cache charges during page

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 6fc29b4923f41ef5597536d5a590ef83c59d26a5
Author: Andrey Ryabinin 
Date:   Wed Oct 14 15:38:33 2020 +0300

memcg: Fix missing memcg->cache charges during page

migration #PSBM-120653
Date: Fri,  9 Oct 2020 12:52:03 +0300
Message-Id: <20201009095203.12533-1-aryabi...@virtuozzo.com>

Since 44b7a8d33d66 ("mm: memcontrol: do not uncharge old page in
 page cache replacement") the mem_cgroup_migrate() charges newpage,
but the ->cache charge is missing here. Add it to fix negative ->cache
values, which leads to WARNING like bellow and softlockups.

 WARNING: CPU: 14 PID: 1372 at mm/page_counter.c:62 
page_counter_cancel+0x26/0x30

 Call Trace:
  page_counter_uncharge+0x1d/0x30
  uncharge_batch+0x25c/0x2e0
  mem_cgroup_uncharge_list+0x64/0x90
  release_pages+0x33e/0x3c0
  __pagevec_release+0x1b/0x40
  truncate_inode_pages_range+0x358/0x8b0
  ext4_evict_inode+0x167/0x580 [ext4]
  evict+0xd2/0x1a0
  do_unlinkat+0x250/0x2e0
  do_syscall_64+0x5b/0x1a0
  entry_SYSCALL_64_after_hwframe+0x65/0xca

https://jira.sw.ru/browse/PSBM-120653
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df70c3bdd444..134cb27307f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6867,6 +6867,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page 
*newpage)
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
+   if (!PageAnon(newpage) && !PageSwapBacked(newpage))
+   page_counter_charge(&memcg->cache, nr_pages);
css_get_many(&memcg->css, nr_pages);
 
commit_charge(newpage, memcg, false);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] kernel/cgroup: Remove unnecessary cgroup_mutex lock. #PSBM-120670

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 59490dde6e535e0447a8b27857b20bdd4b01e012
Author: Andrey Ryabinin 
Date:   Wed Oct 14 15:41:00 2020 +0300

kernel/cgroup: Remove unnecessary cgroup_mutex lock. #PSBM-120670

Stopping container causes the lockdep to complain (see report bellow).
We can avoid it simply by removing cgroup_mutex lock from
cgroup_mark_ve_root(). I believe it's not needed there, it seems to be
added just in case.

 WARNING: possible circular locking dependency detected
 4.18.0-193.6.3.vz8.4.6+debug #1 Not tainted
 --
 vzctl/36606 is trying to acquire lock:
 88814b195ca0 (kn->count#338){}, at: 
kernfs_remove_by_name_ns+0x40/0x80

 but task is already holding lock:
 9cf75a90 (cgroup_mutex){+.+.}, at: cgroup_kn_lock_live+0x106/0x390

 which lock already depends on the new lock.
 the existing dependency chain (in reverse order) is:

 -> #2 (cgroup_mutex){+.+.}:
__mutex_lock+0x163/0x13d0
cgroup_mark_ve_root+0x1d/0x2e0
ve_state_write+0xb81/0xdc0
cgroup_file_write+0x2da/0x7a0
kernfs_fop_write+0x255/0x410
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 -> #1 (&ve->op_sem){}:
down_write+0xa0/0x3d0
ve_state_write+0x6b/0xdc0
cgroup_file_write+0x2da/0x7a0
kernfs_fop_write+0x255/0x410
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 -> #0 (kn->count#338){}:
__lock_acquire+0x22cb/0x48c0
lock_acquire+0x14f/0x3b0
__kernfs_remove+0x61e/0x810
kernfs_remove_by_name_ns+0x40/0x80
cgroup_addrm_files+0x531/0x940
css_clear_dir+0xfb/0x200
kill_css+0x8f/0x120
cgroup_destroy_locked+0x246/0x5e0
cgroup_rmdir+0x2f/0x2c0
kernfs_iop_rmdir+0x131/0x1b0
vfs_rmdir+0x142/0x3c0
do_rmdir+0x2b2/0x340
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 other info that might help us debug this:

 Chain exists of:
   kn->count#338 --> &ve->op_sem --> cgroup_mutex

  Possible unsafe locking scenario:

CPU0CPU1

   lock(cgroup_mutex);
lock(&ve->op_sem);
lock(cgroup_mutex);
   lock(kn->count#338);

*** DEADLOCK ***

 4 locks held by vzctl/36606:
  #0: 88813c02c890 (sb_writers#7){.+.+}, at: mnt_want_write+0x3c/0xa0
  #1: 88814414ad48 (&type->i_mutex_dir_key#5/1){+.+.}, at: 
do_rmdir+0x23c/0x340
  #2: 88811d3054e8 (&type->i_mutex_dir_key#5){}, at: 
vfs_rmdir+0xb6/0x3c0
  #3: 9cf75a90 (cgroup_mutex){+.+.}, at: 
cgroup_kn_lock_live+0x106/0x390

 Call Trace:
  dump_stack+0x9a/0xf0
  check_noncircular+0x317/0x3c0
  __lock_acquire+0x22cb/0x48c0
  lock_acquire+0x14f/0x3b0
  __kernfs_remove+0x61e/0x810
  kernfs_remove_by_name_ns+0x40/0x80
  cgroup_addrm_files+0x531/0x940
  css_clear_dir+0xfb/0x200
  kill_css+0x8f/0x120
  cgroup_destroy_locked+0x246/0x5e0
  cgroup_rmdir+0x2f/0x2c0
  kernfs_iop_rmdir+0x131/0x1b0
  vfs_rmdir+0x142/0x3c0
  do_rmdir+0x2b2/0x340
  do_syscall_64+0xa5/0x4d0
  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

https://jira.sw.ru/browse/PSBM-120670
Signed-off-by: Andrey Ryabinin 
---
 kernel/cgroup/cgroup.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8420f3547f1a..08137d43f3ab 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1883,7 +1883,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve)
struct css_set *cset;
struct cgroup *cgrp;
 
-   mutex_lock(&cgroup_mutex);
spin_lock_irq(&css_set_lock);
 
rcu_read_lock();
@@ -1899,7 +1898,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve)
rcu_read_unlock();
 
spin_unlock_irq(&css_set_lock);
-   mutex_unlock(&cgroup_mutex);
 }
 
 static struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] mm/memcg: Use per-cpu stock charges for ->kmem and ->cache counters #PSBM-101300

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 84476bf2ce81c2fb9454adc048fbc9e9a0704538
Author: Andrey Ryabinin 
Date:   Wed Oct 14 15:45:51 2020 +0300

mm/memcg: Use per-cpu stock charges for ->kmem and ->cache counters 
#PSBM-101300

Currently we use per-cpu stocks to do precharges of the ->memory and ->memsw
counters. Do this for the ->kmem and ->cache as well to decrease contention
on these counters as well.

https://jira.sw.ru/browse/PSBM-101300
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 75 +++--
 1 file changed, 51 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 134cb27307f2..b3f97309ca39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2023,6 +2023,8 @@ EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
+   unsigned int cache_nr_pages;
+   unsigned int kmem_nr_pages;
struct work_struct work;
unsigned long flags;
 #define FLUSHING_CACHED_CHARGE 0
@@ -2041,7 +2043,8 @@ static DEFINE_MUTEX(percpu_charge_mutex);
  *
  * returns true if successful, false otherwise.
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+   bool cache, bool kmem)
 {
struct memcg_stock_pcp *stock;
unsigned long flags;
@@ -2053,9 +2056,19 @@ static bool consume_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
local_irq_save(flags);
 
stock = this_cpu_ptr(&memcg_stock);
-   if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
-   stock->nr_pages -= nr_pages;
-   ret = true;
+   if (memcg == stock->cached) {
+   if (cache && stock->cache_nr_pages >= nr_pages) {
+   stock->cache_nr_pages -= nr_pages;
+   ret = true;
+   }
+   if (kmem && stock->kmem_nr_pages >= nr_pages) {
+   stock->kmem_nr_pages -= nr_pages;
+   ret = true;
+   }
+   if (!cache && !kmem && stock->nr_pages >= nr_pages) {
+   stock->nr_pages -= nr_pages;
+   ret = true;
+   }
}
 
local_irq_restore(flags);
@@ -2069,13 +2082,21 @@ static bool consume_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
 static void drain_stock(struct memcg_stock_pcp *stock)
 {
struct mem_cgroup *old = stock->cached;
+   unsigned long nr_pages = stock->nr_pages + stock->cache_nr_pages + 
stock->kmem_nr_pages;
+
+   if (stock->cache_nr_pages)
+   page_counter_uncharge(&old->cache, stock->cache_nr_pages);
+   if (stock->kmem_nr_pages)
+   page_counter_uncharge(&old->kmem, stock->kmem_nr_pages);
 
-   if (stock->nr_pages) {
-   page_counter_uncharge(&old->memory, stock->nr_pages);
+   if (nr_pages) {
+   page_counter_uncharge(&old->memory, nr_pages);
if (do_memsw_account())
-   page_counter_uncharge(&old->memsw, stock->nr_pages);
+   page_counter_uncharge(&old->memsw, nr_pages);
css_put_many(&old->css, stock->nr_pages);
stock->nr_pages = 0;
+   stock->kmem_nr_pages = 0;
+   stock->cache_nr_pages = 0;
}
stock->cached = NULL;
 }
@@ -2102,10 +2123,12 @@ static void drain_local_stock(struct work_struct *dummy)
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+   bool cache, bool kmem)
 {
struct memcg_stock_pcp *stock;
unsigned long flags;
+   unsigned long stock_nr_pages;
 
local_irq_save(flags);
 
@@ -2114,9 +2137,17 @@ static void refill_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
drain_stock(stock);
stock->cached = memcg;
}
-   stock->nr_pages += nr_pages;
 
-   if (stock->nr_pages > MEMCG_CHARGE_BATCH)
+   if (cache)
+   stock->cache_nr_pages += nr_pages;
+   else if (kmem)
+   stock->kmem_nr_pages += nr_pages;
+   else
+   stock->nr_pages += nr_pages;
+
+   stock_nr_pages = stock->nr_pages + stock->cache_nr_pages +
+   stock->kmem_nr_pages;
+   if (nr_pages > MEMCG_CHARGE_BATCH)
drain_stock(stock);
 
local_irq_restore(flags);
@@ -2143,9 +2174,11 @@ static void drain_all_stock(str

[Devel] [PATCH RHEL8 COMMIT] sched: Account task_group::start_time

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 52becf1c4e0370bec3a93df03c0b5b920e6fa5e8
Author: Kirill Tkhai 
Date:   Tue Nov 28 16:13:48 2017 +0300

sched: Account task_group::start_time

Extracted from "Initial patch".

Signed-off-by: Kirill Tkhai 

(cherry picked from vz7 commit bad04073f185d257f6a3290523ca02c095837e8b)
Signed-off-by: Konstantin Khorenko 

Rebase to vz8 notes:
* moved from struct timespec to u64 (nsec)
---
 kernel/sched/core.c  | 4 
 kernel/sched/sched.h | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 92062773e632..8a57956d64d6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6095,6 +6095,7 @@ void __init sched_init(void)
 #ifdef CONFIG_CFS_CPULIMIT
root_task_group.topmost_limited_ancestor = &root_task_group;
 #endif
+   root_task_group.start_time = 0;
 #endif /* CONFIG_CGROUP_SCHED */
 
for_each_possible_cpu(i) {
@@ -6413,6 +6414,9 @@ struct task_group *sched_create_group(struct task_group 
*parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;
 
+   /* start_timespec is saved CT0 uptime */
+   tg->start_time = ktime_get_boot_ns();
+
return tg;
 
 err:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4dbf03a3242f..b2f0c26b2c50 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -404,6 +404,9 @@ struct task_group {
struct autogroup*autogroup;
 #endif
 
+   /* Monotonic time in nsecs: */
+   u64 start_time;
+
struct cfs_bandwidthcfs_bandwidth;
 #ifdef CONFIG_CFS_CPULIMIT
 #define MAX_CPU_RATE 1024
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ve: Virtualize sysinfo

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 688c65f8eaf161a15fc2a644b511d7c245d45890
Author: Kirill Tkhai 
Date:   Tue Nov 28 15:15:16 2017 +0300

ve: Virtualize sysinfo

Extracted from "Initial patch".

Signed-off-by: Kirill Tkhai 

(cherry picked from vz7 commit e55cd51304b3271a2adaf43de9b9a5a7be34541e)
Signed-off-by: Konstantin Khorenko 

Port to vz8 notes:
* virtinfo_notifier_call (bc_fill_sysinfo()) is substituted by
  direct call to si_meminfo_ve() and only for not VE0.
* "avenrun" is not virtualized yet - need to port first commit
  715f311fdb4a ("sched: Account task_group::cpustat,taskstats,avenrun")
* ve_struct.real_start_time is u64 now instead of timespec
---
 kernel/sys.c | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index cfde07d0ba9f..2644090f8d4b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2545,6 +2545,8 @@ SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned 
__user *, nodep,
return err ? -EFAULT : 0;
 }
 
+extern void si_meminfo_ve(struct sysinfo *si, struct ve_struct *ve);
+
 /**
  * do_sysinfo - fill in sysinfo struct
  * @info: pointer to buffer to fill
@@ -2554,18 +2556,34 @@ static int do_sysinfo(struct sysinfo *info)
unsigned long mem_total, sav_total;
unsigned int mem_unit, bitcount;
struct timespec tp;
+   struct ve_struct *ve;
 
memset(info, 0, sizeof(struct sysinfo));
 
+   si_meminfo(info);
+   si_swapinfo(info);
+
get_monotonic_boottime(&tp);
-   info->uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0);
 
-   get_avenrun(info->loads, 0, SI_LOAD_SHIFT - FSHIFT);
+   ve = get_exec_env();
+   if (ve_is_super(ve)) {
+   info->uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0);
+   get_avenrun(info->loads, 0, SI_LOAD_SHIFT - FSHIFT);
+
+   info->procs = nr_threads;
+   } else {
+   si_meminfo_ve(info, ve);
+   info->uptime = tp.tv_sec + (tp.tv_nsec ? 1 : 0) -
+  ve->real_start_time / NSEC_PER_SEC;
 
-   info->procs = nr_threads;
+   info->procs = nr_threads_ve(ve);
 
-   si_meminfo(info);
-   si_swapinfo(info);
+#if 0
+FIXME after
+715f311fdb4a ("sched: Account task_group::cpustat,taskstats,avenrun") is ported
+   get_avenrun_ve(info->loads, 0, SI_LOAD_SHIFT - FSHIFT);
+#endif
+   }
 
/*
 * If the sum of all the available memory (i.e. ram + swap)
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ve/time: Add comment in ve_start_container() on start time initialization

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 957f72c6b846e1d3fdf9072080fecbd10d08d5ad
Author: Konstantin Khorenko 
Date:   Tue Oct 6 19:20:45 2020 +0300

ve/time: Add comment in ve_start_container() on start time initialization

Fixes: e931118f8139 ("ve: Add ve cgroup and ve_hook subsys")

Signed-off-by: Konstantin Khorenko 
---
 kernel/ve/ve.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 1688407562d4..ac2252445841 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -398,6 +398,11 @@ static int ve_start_container(struct ve_struct *ve)
if (task_active_pid_ns(tsk) != tsk->nsproxy->pid_ns_for_children)
return -ECHILD;
 
+   /*
+* Setup uptime for new containers only, if restored
+* the value won't be zero here already but setup from
+* cgroup write while resuming the container.
+*/
if (ve->start_time == 0) {
ve->start_time = tsk->start_time;
ve->real_start_time = tsk->real_start_time;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ve/time: Fix VE uptime virtualization to use u64 start_time

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 0dc3f0f110adc2356506e976b8d359744b0ba2f9
Author: Cyrill Gorcunov 
Date:   Thu Feb 11 13:02:13 2016 +0400

ve/time: Fix VE uptime virtualization to use u64 start_time

Fixes: a3c4d1d8f383 ("ve/time: Customize VE uptime")

Signed-off-by: Konstantin Khorenko 

+++
ve: Use @real_start_timespec in uptime_proc_show

uptime_proc_show uses bootbased clocks so we should use
@real_start_timespec here instead. Seems was a typo while
converting from pcs6 code.

In scope of
https://jira.sw.ru/browse/PSBM-41406

Signed-off-by: Cyrill Gorcunov 
Reviewed-by: Vladimir Davydov 

vdavydov@:
This hunk was a part of
  diff-cpt-record-ct-boot-based-start-time-to-show-correct-uptime
which was skipped during rebase to RH7 because it was considered 
cpt-related.

(cherry picked from vz7 commit 55b9202e39282f2a21773fd1fd99317bc6e07ddd)
Signed-off-by: Konstantin Khorenko 
---
 fs/proc/uptime.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 9a08b8a92c13..bc07d42ce9f5 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -38,7 +38,7 @@ FIXME:to be reworked anyway in
 
 static int uptime_proc_show(struct seq_file *m, void *v)
 {
-   struct timespec uptime;
+   struct timespec uptime, offset;
struct timespec64 idle;
 
if (ve_is_super(get_exec_env()))
@@ -58,9 +58,10 @@ FIXME:  to be reworked anyway in
get_monotonic_boottime(&uptime);
 #ifdef CONFIG_VE
if (!ve_is_super(get_exec_env())) {
+   offset = ns_to_timespec(get_exec_env()->real_start_time);
set_normalized_timespec(&uptime,
-   uptime.tv_sec - get_exec_env()->start_timespec.tv_sec,
-   uptime.tv_nsec - 
get_exec_env()->start_timespec.tv_nsec);
+uptime.tv_sec - offset.tv_sec,
+uptime.tv_nsec - offset.tv_nsec);
}
 #endif
seq_printf(m, "%lu.%02lu %lu.%02lu\n",
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ve: Add interface for ve::clock_[monotonic|bootbased] adjustment

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 25cab3041305ed84af1340396427c299025abcc4
Author: Cyrill Gorcunov 
Date:   Thu Feb 11 12:59:24 2016 +0400

ve: Add interface for ve::clock_[monotonic|bootbased] adjustment

This two members represent monotonic and bootbased clocks for
container's uptime. When container is in suspended state (or
moving to another node) we trest monotonic and bootbased
clocks as being stopped so we need to account delta time
on restore and adjust the members in subject.

Moreover this timestamps are involved into posix-timers
setup so once application tries to setup monotonic clocks
after the restore (with absolute time specification) we
adjust the values as well.

The application which migrate a container must fetch
the current settings from /sys/fs/cgroup/ve/$VE/ve.real_start_timespec
and /sys/fs/cgroup/ve/$VE/ve.start_timespec, then write them
back on the restore.

https://jira.sw.ru/browse/PSBM-41311
https://jira.sw.ru/browse/PSBM-41406

v2:
 - use clock_[monotonic|bootbased] for cgroup entry names instead

Original-by: Andrew Vagin 
Signed-off-by: Cyrill Gorcunov 
Reviewed-by: Vladimir Davydov 

(cherry picked from vz7 commit 43f4b0c752abd84aa1b346373d152941123d2446
("ve: Add interface for @start_timespec and @real_start_timespec
adjustmen"))

Signed-off-by: Konstantin Khorenko 
---
 kernel/ve/ve.c | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index ac2252445841..cc26d3b2fa9b 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -925,6 +925,66 @@ static ssize_t ve_os_release_write(struct kernfs_open_file 
*of, char *buf,
return ret ? ret : nbytes;
 }
 
+enum {
+   VE_CF_CLOCK_MONOTONIC,
+   VE_CF_CLOCK_BOOTBASED,
+};
+
+static int ve_ts_read(struct seq_file *sf, void *v)
+{
+   struct ve_struct *ve = css_to_ve(seq_css(sf));
+   struct timespec ts;
+   u64 now, delta;
+
+   switch (seq_cft(sf)->private) {
+   case VE_CF_CLOCK_MONOTONIC:
+   now = ktime_get_ns();
+   delta = ve->start_time;
+   break;
+   case VE_CF_CLOCK_BOOTBASED:
+   now = ktime_get_boot_ns();
+   delta = ve->real_start_time;
+   break;
+   default:
+   now = delta = 0;
+   WARN_ON_ONCE(1);
+   break;
+   }
+
+   ts = ns_to_timespec(now - delta);
+   seq_printf(sf, "%ld %ld", ts.tv_sec, ts.tv_nsec);
+   return 0;
+}
+
+static ssize_t ve_ts_write(struct kernfs_open_file *of, char *buf,
+  size_t nbytes, loff_t off)
+{
+   struct ve_struct *ve = css_to_ve(of_css(of));
+   struct timespec delta;
+   u64 delta_ns, now, *target;
+
+   if (sscanf(buf, "%ld %ld", &delta.tv_sec, &delta.tv_nsec) != 2)
+   return -EINVAL;
+   delta_ns = timespec_to_ns(&delta);
+
+   switch (of_cft(of)->private) {
+   case VE_CF_CLOCK_MONOTONIC:
+   now = ktime_get_ns();
+   target = &ve->start_time;
+   break;
+   case VE_CF_CLOCK_BOOTBASED:
+   now = ktime_get_boot_ns();
+   target = &ve->real_start_time;
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
+   *target = now - delta_ns;
+   return nbytes;
+}
+
 static u64 ve_netns_max_nr_read(struct cgroup_subsys_state *css, struct cftype 
*cft)
 {
return css_to_ve(css)->netns_max_nr;
@@ -994,6 +1054,20 @@ static struct cftype ve_cftypes[] = {
.read_u64   = ve_iptables_mask_read,
.write_u64  = ve_iptables_mask_write,
},
+   {
+   .name   = "clock_monotonic",
+   .flags  = CFTYPE_NOT_ON_ROOT,
+   .seq_show   = ve_ts_read,
+   .write  = ve_ts_write,
+   .private= VE_CF_CLOCK_MONOTONIC,
+   },
+   {
+   .name   = "clock_bootbased",
+   .flags  = CFTYPE_NOT_ON_ROOT,
+   .seq_show   = ve_ts_read,
+   .write  = ve_ts_write,
+   .private= VE_CF_CLOCK_BOOTBASED,
+   },
 #endif
{
.name   = "netns_max_nr",
___
Devel mailing list
Devel@openvz.org
https://lists.openv

[Devel] [PATCH RHEL8 COMMIT] ve/time: rework times() syscall and /proc/[pid]/stat to handle u64 time offsets

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit dcfbbefff2072fa969eb5ec103748a6b74d077c6
Author: Konstantin Khorenko 
Date:   Tue Oct 6 19:04:32 2020 +0300

ve/time: rework times() syscall and /proc/[pid]/stat to handle u64 time 
offsets

ve_struct.{start_time,real_start_time} are u64 now, change the code
correspondingly.

Drop duplicated fields start_timespec/real_start_timespec in ve_struct.

Fixes: f2716576136d ("ve/time: Use ve_relative_clock in times() syscall
and /proc/[pid]/stat")

Signed-off-by: Konstantin Khorenko 
---
 fs/proc/array.c|  7 ++-
 include/linux/ve.h |  4 
 kernel/sys.c   | 19 ---
 3 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index e85d0caa6efa..e9b6e403858a 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -546,11 +546,8 @@ static int do_task_stat(struct seq_file *m, struct 
pid_namespace *ns,
 
 #ifdef CONFIG_VE
if (!is_super) {
-   struct timespec *ve_start_ts =
-   &get_exec_env()->real_start_timespec;
-   start_time -=
-   (unsigned long long)ve_start_ts->tv_sec * NSEC_PER_SEC
-   + ve_start_ts->tv_nsec;
+   u64 offset = get_exec_env()->real_start_time;
+   start_time -= (unsigned long long)offset;
}
/* tasks inside a CT can have negative start time e.g. if the CT was
 * migrated from another hw node, in which case we will report 0 in
diff --git a/include/linux/ve.h b/include/linux/ve.h
index b659e779cb49..0db98e8e08c1 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -52,10 +52,6 @@ struct ve_struct {
struct net_device   *venet_dev;
 #endif
 
-/* per VE CPU stats*/
-   struct timespec start_timespec; /* monotonic time */
-   struct timespec real_start_timespec;/* boot based time */
-
/* see vzcalluser.h for VE_FEATURE_XXX definitions */
__u64   features;
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 2644090f8d4b..df02329b0e5c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -957,16 +957,13 @@ static void do_sys_times(struct tms *tms)
 }
 
 #ifdef CONFIG_VE
-unsigned long long ve_relative_clock(struct timespec * ts)
+static u64 ve_relative_clock(u64 time)
 {
-   unsigned long long offset = 0;
+   u64 offset = 0;
+   struct ve_struct *ve = get_exec_env();
 
-   if (ts->tv_sec > get_exec_env()->start_timespec.tv_sec ||
-   (ts->tv_sec == get_exec_env()->start_timespec.tv_sec &&
-ts->tv_nsec >= get_exec_env()->start_timespec.tv_nsec))
-   offset = (unsigned long long)(ts->tv_sec -
-   get_exec_env()->start_timespec.tv_sec) * NSEC_PER_SEC
-   + ts->tv_nsec - get_exec_env()->start_timespec.tv_nsec;
+   if (time > ve->start_time)
+   offset = time - ve->start_time;
return nsec_to_clock_t(offset);
 }
 #endif
@@ -974,7 +971,7 @@ unsigned long long ve_relative_clock(struct timespec * ts)
 SYSCALL_DEFINE1(times, struct tms __user *, tbuf)
 {
 #ifdef CONFIG_VE
-   struct timespec now;
+   u64 now;
 #endif
 
if (tbuf) {
@@ -989,9 +986,9 @@ SYSCALL_DEFINE1(times, struct tms __user *, tbuf)
return (long) jiffies_64_to_clock_t(get_jiffies_64());
 #else
/* Compare to calculation in fs/proc/array.c */
-   ktime_get_ts(&now);
+   now = ktime_get_ns();
force_successful_syscall_return();
-   return ve_relative_clock(&now);
+   return (long) ve_relative_clock(now);
 #endif
 }
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ve/futex/timeout: adjust futex timeout to absolule

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 1bc875e8a7d673ed9c10e48cd4f0fd6956dc5769
Author: Kirill Tkhai 
Date:   Wed Aug 6 15:37:15 2014 +0400

ve/futex/timeout: adjust futex timeout to absolule

This converts ve-absolute-monotonic time to global-absolute-monotonic time.

https://jira.sw.ru/browse/PSBM-14471

diff-futex-reference-ct-monotonic-clock-from-ct-start

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Kirill Tkhai 

(cherry picked from vz7 commit 14a4db52ee8c862eb7a9dec740b15c646e0b59aa)
Signed-off-by: Konstantin Khorenko 
---
 kernel/futex.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5282b74fc31b..9947cb4db384 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -3618,6 +3619,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t 
*timeout,
 {
int cmd = op & FUTEX_CMD_MASK;
unsigned int flags = 0;
+   ktime_t abs_time;
 
if (!(op & FUTEX_PRIVATE_FLAG))
flags |= FLAGS_SHARED;
@@ -3627,6 +3629,12 @@ long do_futex(u32 __user *uaddr, int op, u32 val, 
ktime_t *timeout,
if (cmd != FUTEX_WAIT && cmd != FUTEX_WAIT_BITSET && \
cmd != FUTEX_WAIT_REQUEUE_PI)
return -ENOSYS;
+   } else if (timeout) {
+   if (cmd == FUTEX_WAIT_BITSET || cmd == FUTEX_WAIT_REQUEUE_PI) {
+   abs_time = ktime_add(*timeout, ns_to_ktime(
+get_exec_env()->start_time));
+   timeout = &abs_time;
+   }
}
 
switch (cmd) {
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ve/posix-timers: reference ve monotonic clock from ve start (v2)

2020-10-14 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.12
-->
commit 5f2799f1c2d5997984d15ffc901314eedc6b1295
Author: Kirill Tkhai 
Date:   Tue Aug 5 13:22:35 2014 +0400

ve/posix-timers: reference ve monotonic clock from ve start (v2)

So that CLOCK_MONOTONIC will be monotonic even if ve is migrated to
another hw node.

Note, translating ve <-> abs time in clock_settime and timer_settime is
not necessary because (1) clock_settime won't set monotonic clock and
(2) timer_gettime always returns relative time.

https://jira.sw.ru/browse/PSBM-13860

diff-posix_timers-reference-ct-monotonic-clock-from-ct-start

Signed-off-by: Vladimir Davydov 

Acked-by: Pavel Emelyanov 
Signed-off-by: Kirill Tkhai 

+++
ve/posix-timers: reference ve monotonic clock from start in clock_nanosleep

This is an addition to 
diff-posix_timers-reference-ct-monotonic-clock-from-ct-start

Otherwise, apps that use sys_clock_nanosleep() to suspend their
execution can hang after ve migration.


diff-posix-timers-reference-ve-monotonic-clock-from-ve-start-in-clock_nanosleep

Signed-off-by: Vladimir Davydov 

Acked-by: Konstantin Khlebnikov 
Acked-by: Pavel Emelyanov 
Signed-off-by: Kirill Tkhai 

+++
timers: Port 
diff-ve-timers-convert-ve-monotonic-to-abs-time-when-setting-timerfd-2

Need this for docker, as sometimes systemd-tmpfiles-clean.timer inside
a PCS7 CT is spamming dbus with requests to start corresponding service.
And at the same time Docker tries to create cgroup for container and
attach it to hierarchies like memory and blkio.

That is because systemd timer was triggered with non-virtualized timerfd
using plain host clock but check that timer is successfull uses
virtualized clock_gettime and don't pass before proper(in-container)
timer activation. And timers charges again and again starts service
got in busy loop.

https://jira.sw.ru/browse/PSBM-34017

v2: move the stubs to ve.h

Port the following RH6 commit:

  Author: Vladimir Davydov
  Email: vdavy...@parallels.com
  Subject: fs: convert ve monotonic to abs time when setting timerfd
  Date: Fri, 15 Feb 2013 11:57:09 +0400

  * [timers] corrected TFD_TIMER_ABSTIME timer handling,
the issue led to high cpu usage inside a Fedora 18 CT
by 'init' process (PSBM-18284)

  Monotonic time inside container, as it can be obtained using various
  system calls such as clock_gettime, is reported since start of the 
container,
  not since start of the whole system. This was made in order to avoid time
  issues while a container is migrated between different physical hosts, 
but this
  also introduced a lot of problems in time- related system calls because
  absolute monotonic time, which is in fact relative to container, passed 
to those
  system calls must be converted to system-wide monotonic time, which is 
used by
  kernel hrtimers.

  One of those buggy system calls is timerfd_settime which accepts as an
  argument absolute time if flag TFD_TIMER_ABSTIME is specified.

  The patch fixes it by converting container monotonic time to system-
  wide monotonic time using the monotonic_ve_to_abs() function, which was
  introduced earlier and is now exported for that reason.

  https://jira.sw.ru/browse/PSBM-18284

  Signed-off-by: Vladimir Davydov 

Signed-off-by: Pavel Tikhomirov 
Signed-off-by: Kirill Tkhai 
Reviewed-by: Vladimir Davydov 

(cherry picked from vz7 commit 869542c24c41c0578b47d2ef83cfa63427e0e5e1)
Signed-off-by: Konstantin Khorenko 

+++
timers should not get negative argument

This patch fixes 25-sec delay on login into systemd based containers.

Userspace application can set timer for past
and expect that the timer will be expired immediately.

This can do not work as expected inside migrated containers.
Translated argument provided to timer can become negative,
and according timer will sleep a very long time.

https://jira.sw.ru/browse/PSBM-48475

CC: Vladimir Davydov 
CC: Konstantin Khorenko 
Signed-off-by: Vasily Averin 
Acked-by: Cyrill Gorcunov 

(cherry picked from vz7 commit a71fa19facb00472e47760255ab2e6fa16885732)
Signed-off-by: Konstantin Khorenko 
---
 fs/timerfd.c   |  8 +--
 include/linux/ve.h |  8 +++
 kernel/time/posix-timers.c | 54 --
 3 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index cdad49da3ff7..59ed38c29941 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -26,6 +26,7 @@
 #include 
 #i

Re: [Devel] [PATCH RH7] overlayfs: avoid permission check for priveleged processes

2020-10-14 Thread Andrey Zhadchenko
On Wed, 14 Oct 2020 12:33:53 +0300
Pavel Tikhomirov  wrote:

> On 10/14/20 2:05 AM, Andrey Zhadchenko wrote:
> > Overlayfs temporary override credentials in copy_up function to
> > ones which was used to create mount.  
> 
> > Unfortunately vfs_setxattr requires CAP_SYS_ADMIN
> > capability in current user namespace.  
> 
> No, if it was so, it would be no error =) To be correct we should say:
> 
> Function vfs_setxattr for "trusted." attrs requires CAP_SYS_ADMIN in 
> current ve's userns.
> 

Yes, it is exactly as you say. "trusted" prefix requires CAP_SYS_ADMIN.
I will fix commit message.

> It is done so to mimic mainstream behaviour for containers(ves), so
> that container user can't set "trusted." xattrs if it is in non-root 
> container userns.
> 
> > This leads to strange situations.
> > For example, if overlayfs mount was made inside ve it is impossible
> > to use copy_up from init_user_ns even with CAP_SYS_ADMIN. This is
> > because overriden credentials are not sufficient in init_user_ns to
> > set xattr to file. This is also required for criu since copy_up can
> > be triggered on dump stage: reading inotify fhandle from /proc may
> > start copy_up.  
> 
> I hope that overlayfs overrides credentials exactly to be able to
> pass those checks. In mainstream kernel overlayfs can used from any
> userns, but can be only mounted from init_user_ns, so credentials
> always change to "more permissive". So it should be safe to skip
> override in case we are already "more permissive" than superblocks
> credentials.
> 

At least in vz7 currently you can unshare + mount overlayfs.
But it will disable nfs_export (dmesg below), since overlayfs won't be
able so set trusted xattr.

unshare -m -p -f -U -r bash
mkdir lower upper merged work
mount -t overlay overlay -o
lowerdir=./lower,upperdir=./upper,workdir=./work ./merged
dmesg | tail -n 10
[708824.300397] overlayfs: upper fs does not support xattr, falling back to 
index=off. 
[708824.300399] overlayfs: NFS export requires "index=on", falling back to 
nfs_export=off.

> > 
> > Add an option to avoid vfs_setxattr CAP_SYS_ADMIN check if current
> > credentials have CAP_SYS_ADMIN in namespace that is recorded in
> > overlayfs mount superblock.  
> 
> Sorry but looking on the code I don't see how it works... There are
> only three codepaths here:
> 
>+-< ovl_do_setxattr_ext
>  +-< ovl_do_setxattr #1 sets propagate_cap to false
>  +-< ovl_check_setxattr_ext
>  | +-< ovl_set_origin_ext
>  | | +-< ovl_set_origin #2 sets propagate_cap to false
>  | +-< ovl_check_setxattr #3 sets propagate_cap to false
> 
> And on all of them we don't "propagate_cap". Probably I'm missing 
> something though.
> 

Yes, I really lost this during polishing.
This should be like that:

+-< ovl_copy_up_flags #checks for propagate_cap, overrides creds
  +-< ovl_copy_up_one #fills ctx with propagate_cap
  | +-< ovl_copy_up_locked
  | | +-< ovl_copy_up_inode # calls ovl_do_setxattr_ext with propagate_cap from 
ctx <- I missed that
  | | | +-< ovl_do_setxattr_ext

> > 
> > https://jira.sw.ru/browse/PSBM-108122
> > Signed-off-by: Andrey Zhadchenko 
> > ---
> >   fs/overlayfs/copy_up.c   | 25 +++--
> >   fs/overlayfs/overlayfs.h | 39
> > ++- fs/overlayfs/util.c  |
> > 32  fs/xattr.c   |  2 +-
> >   4 files changed, 74 insertions(+), 24 deletions(-)
> > 
> > diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> > index 1564a35..d6b285f 100644
> > --- a/fs/overlayfs/copy_up.c
> > +++ b/fs/overlayfs/copy_up.c
> > @@ -20,6 +20,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include "overlayfs.h"
> >   
> >   #define OVL_COPY_UP_CHUNK_SIZE (1 << 20)
> > @@ -321,8 +322,8 @@ out:
> > return fh;
> >   }
> >   
> > -int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
> > -  struct dentry *upper)
> > +int ovl_set_origin_ext(struct dentry *dentry, struct dentry *lower,
> > +  struct dentry *upper, int propagate_cap)
> >   {
> > const struct ovl_fh *fh = NULL;
> > int err;
> > @@ -341,8 +342,8 @@ int ovl_set_origin(struct dentry *dentry,
> > struct dentry *lower, /*
> >  * Do not fail when upper doesn't support xattrs.
> >  */
> > -   err = ovl_check_setxattr(dentry, upper, OVL_XATTR_ORIGIN,
> > fh,
> > -fh ? fh->len : 0, 0);
> > +   err = ovl_check_setxattr_ext(dentry, upper,
> > OVL_XATTR_ORIGIN, fh,
> > +fh ? fh->len : 0, 0,
> > propagate_cap); kfree(fh);
> >   
> > return err;
> > @@ -433,6 +434,7 @@ struct ovl_copy_up_ctx {
> > struct dentry *destdir;
> > struct qstr destname;
> > struct dentry *workdir;
> > +   int propagate_cap;
> > bool tmpfile;
> > bool origin;
> > bool indexed;
> > @@ -711,7 +713,7 @@ out:
> >   }
> >   
> >   static int ovl_copy_up_one(struct dentry *parent, struct dentry
> > *dentry,