Re: [Devel] [PATCH RHEL7] ploop: zero-out block device statistics at ploop_stop

2020-09-22 Thread Kirill Tkhai
On 22.09.2020 18:57, Valeriy Vdovin wrote:
> ploop block device is represented by a block device file in /dev, but
> it's lifecycle is separated from the file itself by PLOOP_IOC_START and
> PLOOP_IOC_STOP ioctls. This way ploop file in /dev can be an empty
> placeholder after PLOOP_IOC_STOP ioctl and reinitialized later by a
> PLOOP_IOC_START. Because of that some of the important data structures
> stay allocated after stop and maintain old values until and after restart.
> This situation is also true for block device statistics that remain unchanged
> after end of ploop device lifecycle. Fresh-started ploop device is considered
> a new entity with stats equal to zero. For that we zero out stats at 
> ploop_stop.
> 
> https://jira.sw.ru/browse/PSBM-95605
> 
> Signed-off-by: Valeriy.Vdovin 

Reviewed-by: Kirill Tkhai 

> ---
>  drivers/block/ploop/dev.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
> index ac4d142..c54ff90 100644
> --- a/drivers/block/ploop/dev.c
> +++ b/drivers/block/ploop/dev.c
> @@ -4373,6 +4373,9 @@ static int ploop_stop(struct ploop_device * plo, struct 
> block_device *bdev)
>  
>   clear_bit(PLOOP_S_RUNNING, >state);
>  
> + part_stat_set_all(>disk->part0, 0);
> + memset(>st, 0, sizeof(plo->st));
> +
>   del_timer_sync(>mitigation_timer);
>   del_timer_sync(>freeze_timer);
>  
> 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7] ploop: zero-out block device statistics at ploop_stop

2020-09-22 Thread Valeriy Vdovin
ploop block device is represented by a block device file in /dev, but
it's lifecycle is separated from the file itself by PLOOP_IOC_START and
PLOOP_IOC_STOP ioctls. This way ploop file in /dev can be an empty
placeholder after PLOOP_IOC_STOP ioctl and reinitialized later by a
PLOOP_IOC_START. Because of that some of the important data structures
stay allocated after stop and maintain old values until and after restart.
This situation is also true for block device statistics that remain unchanged
after end of ploop device lifecycle. Fresh-started ploop device is considered
a new entity with stats equal to zero. For that we zero out stats at ploop_stop.

https://jira.sw.ru/browse/PSBM-95605

Signed-off-by: Valeriy.Vdovin 
---
 drivers/block/ploop/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c
index ac4d142..c54ff90 100644
--- a/drivers/block/ploop/dev.c
+++ b/drivers/block/ploop/dev.c
@@ -4373,6 +4373,9 @@ static int ploop_stop(struct ploop_device * plo, struct 
block_device *bdev)
 
clear_bit(PLOOP_S_RUNNING, >state);
 
+   part_stat_set_all(>disk->part0, 0);
+   memset(>st, 0, sizeof(plo->st));
+
del_timer_sync(>mitigation_timer);
del_timer_sync(>freeze_timer);
 
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ms/teach move_mount(2) to work with OPEN_TREE_CLONE

2020-09-22 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.6
-->
commit d16c52de7d40f97b428cc00eb931cc0e1912ecdd
Author: Pavel Tikhomirov 
Date:   Tue Sep 22 18:56:58 2020 +0300

ms/teach move_mount(2) to work with OPEN_TREE_CLONE

Patchset description:
These syscalls were added as preparation step for new mount api (fsopen,
fsconfig, fsmount and fspick will be ported separately).

We can use them to implement "cross-namespace bind-mounting" like this:

fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE);
setns(nsfd, CLONE_NEWNS);
move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);

This will allow us implementing feature of adding bindmounts to runing
container instead of having unreliable external propagations.

Version for VZ8 is slightly different from VZ7 version.

https://jira.sw.ru/browse/PSBM-107263

Current patch description:
From: David Howells 

Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be
attached by move_mount(2).

If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is
not detached anymore, it won't be dissolved.  move_mount(2) is adjusted
to handle detached source.

That gives us equivalents of mount --bind and mount --rbind.

Thanks also to Alan Jenkins  for
providing a whole bunch of ways to break things using this interface.

Signed-off-by: Al Viro 
Signed-off-by: David Howells 
Signed-off-by: Al Viro 

teach move_mount(2) to work with OPEN_TREE_CLONE
(cherry-picked from commit 44dfd84a6d54a675e35ab618d9fab47b36cb78cd)
do_move_mount(): fix an unsafe use of is_anon_ns()
(cherry-picked from commit 05883eee857eab4693e7d13ebab06716475c5754)
vfs: move_mount: reject moving kernel internal mounts
(cherry-picked from commit 570d7a98e7d6d5d8706d94ffd2d40adeaa318332)

https://jira.sw.ru/browse/PSBM-107263
Signed-off-by: Pavel Tikhomirov 
---
 fs/namespace.c | 63 +++---
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 51cacd439590..d355b5921d1e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1789,10 +1789,16 @@ void dissolve_on_fput(struct vfsmount *mnt)
namespace_lock();
lock_mount_hash();
ns = real_mount(mnt)->mnt_ns;
-   umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
+   if (ns) {
+   if (is_anon_ns(ns))
+   umount_tree(real_mount(mnt), UMOUNT_CONNECTED);
+   else
+   ns = NULL;
+   }
unlock_mount_hash();
namespace_unlock();
-   free_mnt_ns(ns);
+   if (ns)
+   free_mnt_ns(ns);
 }
 
 void drop_collected_mounts(struct vfsmount *mnt)
@@ -2000,6 +2006,10 @@ static int attach_recursive_mnt(struct mount *source_mnt,
attach_mnt(source_mnt, dest_mnt, dest_mp);
touch_mnt_namespace(source_mnt->mnt_ns);
} else {
+   if (source_mnt->mnt_ns) {
+   /* move from anon - the caller will destroy */
+   list_del_init(_mnt->mnt_ns->list);
+   }
mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
commit_tree(source_mnt);
}
@@ -2467,13 +2477,37 @@ static int do_set_group(struct path *path, const char 
*sibling_name)
return err;
 }
 
+/*
+ * Check that there aren't references to earlier/same mount namespaces in the
+ * specified subtree.  Such references can act as pins for mount namespaces
+ * that aren't checked by the mount-cycle checking code, thereby allowing
+ * cycles to be made.
+ */
+static bool check_for_nsfs_mounts(struct mount *subtree)
+{
+   struct mount *p;
+   bool ret = false;
+
+   lock_mount_hash();
+   for (p = subtree; p; p = next_mnt(p, subtree))
+   if (mnt_ns_loop(p->mnt.mnt_root))
+   goto out;
+
+   ret = true;
+out:
+   unlock_mount_hash();
+   return ret;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
struct path parent_path = {.mnt = NULL, .dentry = NULL};
+   struct mnt_namespace *ns;
struct mount *p;
struct mount *old;
struct mountpoint *mp;
int err;
+   bool attached;
 
mp = lock_mount(new_path);
if (IS_ERR(mp))
@@ -2481,12 +2515,20 @@ static int do_move_mount(struct path *old_path, struct 
path *new_path)
 
old = real_mount(old_path->mnt);
p = real_mount(new_path->mnt);
+   attached = mnt_has_parent(old);
+   ns = old->mnt_ns;
 
err = -EINVAL;
-   if (!check_mnt(p) || !check_mnt(old))
+   /* The mountpoint must be in our namespace. */
+   if (!check_mnt(p))
 

[Devel] [PATCH RHEL8 COMMIT] ms/vfs: syscall: Add move_mount(2) to move mounts around

2020-09-22 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.6
-->
commit d64c6358efec5c13a55bc4dddf9d614634d6d892
Author: David Howells 
Date:   Tue Sep 22 18:56:57 2020 +0300

ms/vfs: syscall: Add move_mount(2) to move mounts around

Patchset description:
These syscalls were added as preparation step for new mount api (fsopen,
fsconfig, fsmount and fspick will be ported separately).

We can use them to implement "cross-namespace bind-mounting" like this:

fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE);
setns(nsfd, CLONE_NEWNS);
move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);

This will allow us implementing feature of adding bindmounts to runing
container instead of having unreliable external propagations.

Version for VZ8 is slightly different from VZ7 version.

https://jira.sw.ru/browse/PSBM-107263

Current patch description:
From: David Howells 

Add a move_mount() system call that will move a mount from one place to
another and, in the next commit, allow to attach an unattached mount tree.

The new system call looks like the following:

int move_mount(int from_dfd, const char *from_path,
   int to_dfd, const char *to_path,
   unsigned int flags);

Signed-off-by: David Howells 
cc: linux-...@vger.kernel.org
Signed-off-by: Al Viro 

vfs: syscall: Add move_mount(2) to move mounts around
(cherry-picked from commit 2db154b3ea8e14b04fee23e3fdfd5e9d17fbc6ae)
uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]
(cherry-picked from commit 9c8ad7a2ff0bfe58f019ec0abc1fb965114dde7d)
selinux: fix regression introduced by move_mount(2) syscall
(cherry-picked from commit 98aa00345de54b8340dc2ddcd87f446d33387b5e)

https://jira.sw.ru/browse/PSBM-107263
Signed-off-by: Pavel Tikhomirov 
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/namespace.c | 126 +
 include/linux/lsm_hooks.h  |   6 ++
 include/linux/security.h   |   7 ++
 include/linux/syscalls.h   |   3 +
 include/uapi/linux/fs.h|  11 +++
 security/security.c|   5 ++
 security/selinux/hooks.c   |  10 +++
 9 files changed, 139 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 103079ec2891..ec3e619444ee 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -402,3 +402,4 @@
 426i386io_uring_enter  sys_io_uring_enter  
__ia32_sys_io_uring_enter
 427i386io_uring_register   sys_io_uring_register   
__ia32_sys_io_uring_register
 428i386open_tree   sys_open_tree   
__ia32_sys_open_tree
+429i386move_mount  sys_move_mount  
__ia32_sys_move_mount
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5772d5b0f1a6..640ff4463a21 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,7 @@
 426common  io_uring_enter  __x64_sys_io_uring_enter
 427common  io_uring_register   __x64_sys_io_uring_register
 428common  open_tree   __x64_sys_open_tree
+429common  move_mount  __x64_sys_move_mount
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/namespace.c b/fs/namespace.c
index a669502c450b..51cacd439590 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2467,72 +2467,81 @@ static int do_set_group(struct path *path, const char 
*sibling_name)
return err;
 }
 
-static int do_move_mount(struct path *path, const char *old_name)
+static int do_move_mount(struct path *old_path, struct path *new_path)
 {
-   struct path old_path, parent_path;
+   struct path parent_path = {.mnt = NULL, .dentry = NULL};
struct mount *p;
struct mount *old;
struct mountpoint *mp;
int err;
-   if (!old_name || !*old_name)
-   return -EINVAL;
-   err = kern_path(old_name, LOOKUP_FOLLOW, _path);
-   if (err)
-   return err;
 
-   mp = lock_mount(path);
-   err = PTR_ERR(mp);
+   mp = lock_mount(new_path);
if (IS_ERR(mp))
-   goto out;
+   return PTR_ERR(mp);
 
-   old = real_mount(old_path.mnt);
-   p = real_mount(path->mnt);
+   old = real_mount(old_path->mnt);
+   p = real_mount(new_path->mnt);
 
err = -EINVAL;
if (!check_mnt(p) || !check_mnt(old))
-   goto out1;
+   

[Devel] [PATCH RHEL8 COMMIT] ms/vfs: syscall: Add open_tree(2) to reference or clone a mount

2020-09-22 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.6
-->
commit cd9f3d61b5bbf7ad81dbcba920834369b878b1a9
Author: Pavel Tikhomirov 
Date:   Tue Sep 22 16:02:16 2020 +0300

ms/vfs: syscall: Add open_tree(2) to reference or clone a mount

Patchset description:
These syscalls were added as preparation step for new mount api (fsopen,
fsconfig, fsmount and fspick will be ported separately).

We can use them to implement "cross-namespace bind-mounting" like this:

fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE);
setns(nsfd, CLONE_NEWNS);
move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);

This will allow us implementing feature of adding bindmounts to runing
container instead of having unreliable external propagations.

Version for VZ8 is slightly different from VZ7 version.

https://jira.sw.ru/browse/PSBM-107263

Current patch description:
From: Al Viro 

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)).  flags should be an OR of
some of the following:
* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname.  With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question.  In other words, the same as mount --rbind
or mount --bind would've taken.  The detached tree will be
dissolved on the final close of obtained file.  Creation of such
detached trees requires the same capabilities as doing mount --bind.

Signed-off-by: Al Viro 
Signed-off-by: David Howells 
cc: linux-...@vger.kernel.org
Signed-off-by: Al Viro 

vfs: syscall: Add open_tree(2) to reference or clone a mount
(cherry-picked from commit a07b20004793d8926f78d63eb5980559f7813404)
uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]
(cherry-picked from commit 9c8ad7a2ff0bfe58f019ec0abc1fb965114dde7d)
fs/namespace: add __user to open_tree and move_mount syscalls
(cherry-picked from commit 2658ce095df583cdf9ede475ec4da0b3cc7f7b05)

https://jira.sw.ru/browse/PSBM-107263
Signed-off-by: Pavel Tikhomirov 
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/file_table.c|   9 +-
 fs/internal.h  |   1 +
 fs/namespace.c | 157 -
 include/linux/fs.h |   3 +
 include/linux/syscalls.h   |   1 +
 include/uapi/linux/fcntl.h |   1 +
 include/uapi/linux/fs.h|   6 ++
 9 files changed, 155 insertions(+), 25 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 2eefd2a7c1ce..103079ec2891 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -401,3 +401,4 @@
 425i386io_uring_setup  sys_io_uring_setup  
__ia32_sys_io_uring_setup
 426i386io_uring_enter  sys_io_uring_enter  
__ia32_sys_io_uring_enter
 427i386io_uring_register   sys_io_uring_register   
__ia32_sys_io_uring_register
+428i386open_tree   sys_open_tree   
__ia32_sys_open_tree
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 65c026185e61..5772d5b0f1a6 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -346,6 +346,7 @@
 425common  io_uring_setup  __x64_sys_io_uring_setup
 426common  io_uring_enter  __x64_sys_io_uring_enter
 427common  io_uring_register   __x64_sys_io_uring_register
+428common  open_tree   __x64_sys_open_tree
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/file_table.c b/fs/file_table.c
index 2931252f47ae..4c8a5d845a1c 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -245,6 +245,7 @@ static void __fput(struct file *file)
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = file->f_inode;
+   fmode_t mode = file->f_mode;
 
if (unlikely(!(file->f_mode & FMODE_OPENED)))
goto out;
@@ -267,18 +268,20 @@ static void __fput(struct 

[Devel] [PATCH RHEL8 COMMIT] ms/saner handling of temporary namespaces

2020-09-22 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.6
-->
commit 51865b4696675b8be4bdefd61e7b2a3310b6a45d
Author: Al Viro 
Date:   Tue Sep 22 15:37:58 2020 +0300

ms/saner handling of temporary namespaces

Patchset description:
These syscalls were added as preparation step for new mount api (fsopen,
fsconfig, fsmount and fspick will be ported separately).

We can use them to implement "cross-namespace bind-mounting" like this:

fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE);
setns(nsfd, CLONE_NEWNS);
move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);

This will allow us implementing feature of adding bindmounts to runing
container instead of having unreliable external propagations.

Version for VZ8 is slightly different from VZ7 version.

https://jira.sw.ru/browse/PSBM-107263

Current patch description:
From: Al Viro 

mount_subtree() creates (and soon destroys) a temporary namespace,
so that automounts could function normally.  These beasts should
never become anyone's current namespaces; they don't, but it would
be better to make prevention of that more straightforward.  And
since they don't become anyone's current namespace, we don't need
to bother with reserving procfs inums for those.

Teach alloc_mnt_ns() to skip inum allocation if told so, adjust
put_mnt_ns() accordingly, make mount_subtree() use temporary
(anon) namespace.  is_anon_ns() checks if a namespace is such.

Signed-off-by: Al Viro 

(cherry-picked from commit 74e831221cfd79460ec11c1b641093863f0ef3ce)
https://jira.sw.ru/browse/PSBM-107263
Signed-off-by: Pavel Tikhomirov 
---
 fs/mount.h |  5 
 fs/namespace.c | 74 +++---
 2 files changed, 40 insertions(+), 39 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index f39bc9da4d73..6250de544760 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -146,3 +146,8 @@ static inline bool is_local_mountpoint(struct dentry 
*dentry)
 
return __is_local_mountpoint(dentry);
 }
+
+static inline bool is_anon_ns(struct mnt_namespace *ns)
+{
+   return ns->seq == 0;
+}
diff --git a/fs/namespace.c b/fs/namespace.c
index 1018ae0efa06..22589d59f476 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2862,7 +2862,8 @@ static void dec_mnt_namespaces(struct ucounts *ucounts)
 
 static void free_mnt_ns(struct mnt_namespace *ns)
 {
-   ns_free_inum(>ns);
+   if (!is_anon_ns(ns))
+   ns_free_inum(>ns);
dec_mnt_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
kfree(ns);
@@ -2877,7 +2878,7 @@ static void free_mnt_ns(struct mnt_namespace *ns)
  */
 static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1);
 
-static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)
+static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool 
anon)
 {
struct mnt_namespace *new_ns;
struct ucounts *ucounts;
@@ -2887,28 +2888,27 @@ static struct mnt_namespace *alloc_mnt_ns(struct 
user_namespace *user_ns)
if (!ucounts)
return ERR_PTR(-ENOSPC);
 
-   new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
+   new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
if (!new_ns) {
dec_mnt_namespaces(ucounts);
return ERR_PTR(-ENOMEM);
}
-   ret = ns_alloc_inum(_ns->ns);
-   if (ret) {
-   kfree(new_ns);
-   dec_mnt_namespaces(ucounts);
-   return ERR_PTR(ret);
+   if (!anon) {
+   ret = ns_alloc_inum(_ns->ns);
+   if (ret) {
+   kfree(new_ns);
+   dec_mnt_namespaces(ucounts);
+   return ERR_PTR(ret);
+   }
}
new_ns->ns.ops = _operations;
-   new_ns->seq = atomic64_add_return(1, _ns_seq);
+   if (!anon)
+   new_ns->seq = atomic64_add_return(1, _ns_seq);
atomic_set(_ns->count, 1);
-   new_ns->root = NULL;
INIT_LIST_HEAD(_ns->list);
init_waitqueue_head(_ns->poll);
-   new_ns->event = 0;
new_ns->user_ns = get_user_ns(user_ns);
new_ns->ucounts = ucounts;
-   new_ns->mounts = 0;
-   new_ns->pending_mounts = 0;
return new_ns;
 }
 
@@ -2932,7 +2932,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, 
struct mnt_namespace *ns,
 
old = ns->root;
 
-   new_ns = alloc_mnt_ns(user_ns);
+   new_ns = alloc_mnt_ns(user_ns, false);
if (IS_ERR(new_ns))
return new_ns;
 
@@ -2987,37 +2987,25 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, 
struct mnt_namespace *ns,
return new_ns;
 }
 
-/**
- * create_mnt_ns - creates a private namespace and adds a 

[Devel] [PATCH RHEL8 COMMIT] modules: use kvmalloc when creating sysfs attributes for ELF sections

2020-09-22 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.6
-->
commit 9175123cf76678492bf559facbce177e90e5c07e
Author: Evgenii Shatokhin 
Date:   Tue Sep 22 15:31:59 2020 +0300

modules: use kvmalloc when creating sysfs attributes for ELF sections

A kernel module containing a ReadyKernel patch could have lots of ELF
sections: one per each new or patched function, one per each new static
or global variable, etc.

The kernel creates a sysfs file
/sys/module//sections/ for each loaded
section when the patch module is being loaded, see add_sect_attrs() in
kernel/module.c. A big chunk of memory is allocated for all these with
kzalloc.

For the ReadyKernel patches we have already released, the amount of
memory could be as high as 34528 bytes (48 + 80 * 431 loaded sections),
which is a 4th order allocation. 3rd order allocations are also common
here, see https://jira.sw.ru/browse/PSBM-95050.

Not only it is a waste (contiguous memory is not needed there), but the
allocation may also fail when the memory is fragmented. ReadyKernel
patches are often used in the systems with rather significant uptime, so
the fragmentation is possible.

It could be better if the patch modules did not use too many ELF
sections. However, the KPatch maintainers pointed out
(https://github.com/dynup/kpatch/pull/1131) that the same problem would
affect regular kernel modules as well after FGKASLR has been merged into
the mainline kernel. Combining the sections of the kernel modules
destroys the purpose of FGKASLR, so, it was agreed that we probably
should just switch to kvmalloc+kvfree in add_sect_attrs/free_sect_attrs.

Details and discussion:
https://www.spinics.net/lists/live-patching/msg06364.html

https://jira.sw.ru/browse/PSBM-108017
Signed-off-by: Evgenii Shatokhin 
---
 kernel/module.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index d4702f0d711a..e58ad01de8bf 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1492,7 +1492,7 @@ static void free_sect_attrs(struct module_sect_attrs 
*sect_attrs)
 
for (section = 0; section < sect_attrs->nsections; section++)
kfree(sect_attrs->attrs[section].name);
-   kfree(sect_attrs);
+   kvfree(sect_attrs);
 }
 
 static void add_sect_attrs(struct module *mod, const struct load_info *info)
@@ -1510,7 +1510,7 @@ static void add_sect_attrs(struct module *mod, const 
struct load_info *info)
+ nloaded * sizeof(sect_attrs->attrs[0]),
sizeof(sect_attrs->grp.attrs[0]));
size[1] = (nloaded + 1) * sizeof(sect_attrs->grp.attrs[0]);
-   sect_attrs = kzalloc(size[0] + size[1], GFP_KERNEL);
+   sect_attrs = kvzalloc(size[0] + size[1], GFP_KERNEL);
if (sect_attrs == NULL)
return;
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] mm, memcg: add oom counter to memory.stat memcgroup file #PSBM-107731

2020-09-22 Thread Andrey Ryabinin
Add oom counter to memory.stat file. oom shows amount of oom kills
triggered due to cgroup's memory limit. total_oom shows total sum of
oom kills triggered due to cgroup's and it's sub-groups memory limits.

memory.stat in the root cgroup counts global oom kills.

E.g:
 # mkdir /sys/fs/cgroup/memory/test/
 # echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
 # echo 100M > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
 # echo $$ > /sys/fs/cgroup/memory/test/tasks
 # ./vm-scalability/usemem -O 200M
 # grep oom /sys/fs/cgroup/memory/test/memory.stat
   oom 1
   total_oom 1
 # echo -1 > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
 # echo -1 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
 # ./vm-scalability/usemem -O 1000G
 # grep oom /sys/fs/cgroup/memory/memory.stat
oom 1
total_oom 2

https://jira.sw.ru/browse/PSBM-107731
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6587cc2ef019..fe06c7db2ad3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -400,6 +400,7 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu __percpu *stat;
struct mem_cgroup_stat2_cpu stat2;
spinlock_t pcp_counter_lock;
+   atomic_long_t   oom;
 
atomic_tdead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
@@ -2005,6 +2006,7 @@ void mem_cgroup_note_oom_kill(struct mem_cgroup 
*root_memcg,
if (memcg == root_memcg)
break;
}
+   atomic_long_inc(_memcg->oom);
 
if (memcg_to_put)
css_put(_to_put->css);
@@ -5691,6 +5693,7 @@ static int memcg_stat_show(struct cgroup *cont, struct 
cftype *cft,
for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
seq_printf(m, "%s %lu\n", mem_cgroup_events_names[i],
   mem_cgroup_read_events(memcg, i));
+   seq_printf(m, "oom %lu\n", atomic_long_read(>oom));
 
for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i],
@@ -5733,6 +5736,12 @@ static int memcg_stat_show(struct cgroup *cont, struct 
cftype *cft,
seq_printf(m, "total_%s %llu\n",
   mem_cgroup_events_names[i], val);
}
+   {
+   unsigned long val = 0;
+   for_each_mem_cgroup_tree(mi, memcg)
+   val += atomic_long_read(>oom);
+   seq_printf(m, "total_oom %lu\n", val);
+   }
 
for (i = 0; i < NR_LRU_LISTS; i++) {
unsigned long long val = 0;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] ve/perf: forbid perf events syscall in containers

2020-09-22 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.6
-->
commit ed1fc404e6904702b4caded62e1e70e7420e12be
Author: Pavel Tikhomirov 
Date:   Tue Sep 22 15:27:06 2020 +0300

ve/perf: forbid perf events syscall in containers

If some process has perf_event_open fd it can monitor different (kernel,
hardware, etc) perfomance counters through it. And this fd is configured
through perf_event_attr which has more than 30 fields. There is
currently no kernel interface to get the configuration of existing perf
event fd. So to dump such an fd with CRIU we should add this interface.

We have ovs-vswitchd, which opens perf event fd and does nothing with it
according to comments, it is only used in case someone will use PERF()
macros to debug some code parts which implies recompilation of ovs. But
it is still a problem on migration because CRIU detects this fd and
fails. Also ovs can handle if it can't open perf event fd and fallbacks
gracefully to work without it.

So (at least for now) we should forbid this interface, to fix problems
with ovs daemon migration.

https://jira.sw.ru/browse/PSBM-107217

Signed-off-by: Pavel Tikhomirov 
---
 kernel/events/core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 61b0e1dfdebe..17066990a235 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -10874,6 +10875,9 @@ SYSCALL_DEFINE5(perf_event_open,
if (flags & ~PERF_FLAG_ALL)
return -EINVAL;
 
+   if (!ve_is_super(get_exec_env()))
+   return -EACCES;
+
err = perf_copy_attr(attr_uptr, );
if (err)
return err;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] net: openvswitch: add capability to specify ifindex of new links

2020-09-22 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.6
-->
commit 5e825b7e975d0593a648f467c585bcaded7036b3
Author: Andrey Zhadchenko 
Date:   Tue Sep 22 15:07:25 2020 +0300

net: openvswitch: add capability to specify ifindex of new links

CRIU is preserving ifindexes of net devices after restoration, but
current Open vSwitch API are not capable to do that. So we need to
modify it, because

- Restoring net devices with random ifindex will lead to some excessive
  work to restore master relationship.

- OVS device taking another link ifindex will likely cause some problems
  and may snowball previous point.

- Although OVS daemon is not supported yet, it is holding some tables
  which have ifindex.

Openvswitch creates several net devices, but unlike rtnetlink API there is
no option to specify ifindex for link. This is crucial for criu during
restore stage.
Use ovs_header->dp_ifindex during OVS_DP_CMD_NEW as desired ifindex.
Use OVS_VPORT_ATTR_IFINDEX during OVS_VPORT_CMD_NEW to specify new netdev
ifindex. Both values were not relevant for corresponding requests, so
existing software won't mess with it.

https://jira.sw.ru/browse/PSBM-105844
Signed-off-by: Andrey Zhadchenko 
---
 net/openvswitch/datapath.c   | 16 ++--
 net/openvswitch/vport-internal_dev.c |  1 +
 net/openvswitch/vport.h  |  2 ++
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 900b16e668a1..a720430daaa3 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1600,6 +1600,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
struct vport *vport;
struct ovs_net *ovs_net;
int err, i;
+   struct ovs_header *ovs_header = info->userhdr;
 
err = -EINVAL;
if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID])
@@ -1649,6 +1650,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
parms.dp = dp;
parms.port_no = OVSP_LOCAL;
parms.upcall_portids = a[OVS_DP_ATTR_UPCALL_PID];
+   parms.desired_ifindex = ovs_header->dp_ifindex;
 
err = ovs_dp_change(dp, a);
if (err)
@@ -2044,7 +2046,10 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct 
genl_info *info)
if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] ||
!a[OVS_VPORT_ATTR_UPCALL_PID])
return -EINVAL;
-   if (a[OVS_VPORT_ATTR_IFINDEX])
+
+   parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]);
+
+   if (a[OVS_VPORT_ATTR_IFINDEX] && parms.type != OVS_VPORT_TYPE_INTERNAL)
return -EOPNOTSUPP;
 
port_no = a[OVS_VPORT_ATTR_PORT_NO]
@@ -2081,12 +2086,19 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, 
struct genl_info *info)
}
 
parms.name = nla_data(a[OVS_VPORT_ATTR_NAME]);
-   parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]);
parms.options = a[OVS_VPORT_ATTR_OPTIONS];
parms.dp = dp;
parms.port_no = port_no;
parms.upcall_portids = a[OVS_VPORT_ATTR_UPCALL_PID];
 
+   if (parms.type == OVS_VPORT_TYPE_INTERNAL) {
+   if (a[OVS_VPORT_ATTR_IFINDEX])
+   parms.desired_ifindex =
+   nla_get_u32(a[OVS_VPORT_ATTR_IFINDEX]);
+   else
+   parms.desired_ifindex = 0;
+   }
+
vport = new_vport();
err = PTR_ERR(vport);
if (IS_ERR(vport)) {
diff --git a/net/openvswitch/vport-internal_dev.c 
b/net/openvswitch/vport-internal_dev.c
index 3ebf8ba7c389..a9bb6e5e11ad 100644
--- a/net/openvswitch/vport-internal_dev.c
+++ b/net/openvswitch/vport-internal_dev.c
@@ -200,6 +200,7 @@ static struct vport *internal_dev_create(const struct 
vport_parms *parms)
if (vport->port_no == OVSP_LOCAL)
vport->dev->features |= NETIF_F_NETNS_LOCAL;
 
+   dev->ifindex = parms->desired_ifindex;
rtnl_lock();
err = register_netdevice(vport->dev);
if (err)
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index cda66c26ad08..c5281b52f489 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -109,6 +109,8 @@ struct vport_parms {
enum ovs_vport_type type;
struct nlattr *options;
 
+   int desired_ifindex;
+
/* For ovs_vport_alloc(). */
struct datapath *dp;
u16 port_no;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT]

2020-09-22 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.18.2.vz7.163.27
-->

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL COMMIT]

2020-09-22 Thread Vasily Averin
The commit is pushed to "" and will appear at 
https://src.openvz.org/scm/ovz/vzkernel.git
after 
-->

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RHEL7] cgroup: fixed NULL-pointer dereference in cgroup_release_agent

2020-09-22 Thread Kirill Tkhai
On 21.09.2020 12:18, Valeriy Vdovin wrote:
> The fix checks that ve->init_task is not referenced during warning
> message decision if ve == ve0, because ve0 init_task is always NULL.
> 
> https://jira.sw.ru/browse/PSBM-107673
> Signed-off-by: Valeriy Vdovin 

Reviewed-by: Kirill Tkhai 

> ---
>  kernel/cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 691505c..27d7a5e 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -5934,7 +5934,7 @@ void cgroup_release_agent(struct work_struct *work)
>   envp, UMH_WAIT_EXEC, NULL, NULL, NULL);
>  
>   ve_task = ve->init_task;
> - if (err < 0 && (!(ve_task->flags & PF_EXITING)))
> + if (err < 0 && (ve ==  || !(ve_task->flags & PF_EXITING)))
>   pr_warn_ratelimited("cgroup release_agent "
>   "%s %s failed: %d\n",
>   agentbuf, pathbuf, err);
> 

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7] KVM: LAPIC: Fix pv ipis use-before-initialization

2020-09-22 Thread Valeriy Vdovin
From: Wanpeng Li 

Reported by syzkaller:

 BUG: unable to handle kernel NULL pointer dereference at 0014
 PGD 80040410c067 P4D 80040410c067 PUD 40410d067 PMD 0
 Oops:  [#1] PREEMPT SMP PTI
 CPU: 3 PID: 2567 Comm: poc Tainted: G   OE 4.19.0-rc5 #16
 RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm]
 Call Trace:
  kvm_emulate_hypercall+0x3cc/0x700 [kvm]
  handle_vmcall+0xe/0x10 [kvm_intel]
  vmx_handle_exit+0xc1/0x11b0 [kvm_intel]
  vcpu_enter_guest+0x9fb/0x1910 [kvm]
  kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
  kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
  do_vfs_ioctl+0xa5/0x690
  ksys_ioctl+0x6d/0x80
  __x64_sys_ioctl+0x1a/0x20
  do_syscall_64+0x83/0x6e0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The reason is that the apic map has not yet been initialized, the testcase
triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map
is dereferenced. This patch fixes it by checking whether or not apic map is
NULL and bailing out immediately if that is the case.

Fixes: 4180bf1b65 (KVM: X86: Implement "send IPI" hypercall)
Reported-by: Wei Wu 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Wei Wu 
Signed-off-by: Wanpeng Li 
Cc: sta...@vger.kernel.org
Signed-off-by: Paolo Bonzini 

(cherry-picked from commit 38ab012f109caf10f471db1adf284e620dd8d701)
https://jira.sw.ru/browse/PSBM-107931

Signed-off-by: Valeriy.Vdovin 
---
 arch/x86/kvm/lapic.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 740be89..f433199 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -566,6 +566,11 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
ipi_bitmap_low,
rcu_read_lock();
map = rcu_dereference(kvm->arch.apic_map);
 
+   if (unlikely(!map)) {
+   count = -EOPNOTSUPP;
+   goto out;
+   }
+
if (min > map->max_apic_id)
goto out;
/* Bits above cluster_size are masked in the caller.  */
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] KVM: LAPIC: Fix pv ipis out-of-bounds access

2020-09-22 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.18.2.vz7.163.27
-->
commit 69be0b6e4ce2dee6e42bf89a7497d1fef2c4e2d0
Author: Wanpeng Li 
Date:   Tue Sep 22 10:58:11 2020 +0300

KVM: LAPIC: Fix pv ipis out-of-bounds access

Dan Carpenter reported that the untrusted data returns from 
kvm_register_read()
results in the following static checker warning:
  arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
  error: buffer underflow 'map->phys_map' 's32min-s32max'

KVM guest can easily trigger this by executing the following assembly 
sequence
in Ring0:

mov $10, %rax
mov $0x, %rbx
mov $0x, %rdx
mov $0, %rsi
vmcall

As this will cause KVM to execute the following code-path:
vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> 
kvm_pv_send_ipi()
which will reach out-of-bounds access.

This patch fixes it by adding a check to kvm_pv_send_ipi() against 
map->max_apic_id,
ignoring destinations that are not present and delivering the rest. We also 
check
whether or not map->phys_map[min + i] is NULL since the max_apic_id is set 
to the
max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
unconditionally set max_apic_id to 255 to reserve enough space for any 
xAPIC ID.

Reported-by: Dan Carpenter 
Reviewed-by: Liran Alon 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Liran Alon 
Cc: Dan Carpenter 
Signed-off-by: Wanpeng Li 
[Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
Signed-off-by: Radim Krčmář 

(cherry picked from commit bdf7ffc89922a52a4f08a12f7421ea24bb7626a0)
https://jira.sw.ru/browse/PSBM-107931

Signed-off-by: Valeriy Vdovin 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/lapic.c| 27 ---
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 50817dc3..e9ee080 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1434,7 +1434,7 @@ void kvm_arch_mmu_notifier_invalidate_page(struct kvm 
*kvm,
 
 u64 kvm_get_arch_capabilities(void);
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
-   unsigned long ipi_bitmap_high, int min,
+   unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit);
 
 void kvm_define_shared_msr(unsigned index, u32 msr);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1487fe2..740be89 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -543,7 +543,7 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
kvm_lapic_irq *irq,
 }
 
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
-   unsigned long ipi_bitmap_high, int min,
+   unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit)
 {
int i;
@@ -566,18 +566,31 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
ipi_bitmap_low,
rcu_read_lock();
map = rcu_dereference(kvm->arch.apic_map);
 
+   if (min > map->max_apic_id)
+   goto out;
/* Bits above cluster_size are masked in the caller.  */
-   for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) {
-   vcpu = map->phys_map[min + i]->vcpu;
-   count += kvm_apic_set_irq(vcpu, , NULL);
+   for_each_set_bit(i, _bitmap_low,
+   min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
+   if (map->phys_map[min + i]) {
+   vcpu = map->phys_map[min + i]->vcpu;
+   count += kvm_apic_set_irq(vcpu, , NULL);
+   }
}
 
min += cluster_size;
-   for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) {
-   vcpu = map->phys_map[min + i]->vcpu;
-   count += kvm_apic_set_irq(vcpu, , NULL);
+
+   if (min > map->max_apic_id)
+   goto out;
+
+   for_each_set_bit(i, _bitmap_high,
+   min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
+   if (map->phys_map[min + i]) {
+   vcpu = map->phys_map[min + i]->vcpu;
+   count += kvm_apic_set_irq(vcpu, , NULL);
+   }
}
 
+out:
rcu_read_unlock();
return count;
 }
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH] KVM: LAPIC: Fix pv ipis out-of-bounds access

2020-09-22 Thread Valeriy Vdovin
From: Wanpeng Li 

Dan Carpenter reported that the untrusted data returns from kvm_register_read()
results in the following static checker warning:
  arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
  error: buffer underflow 'map->phys_map' 's32min-s32max'

KVM guest can easily trigger this by executing the following assembly sequence
in Ring0:

mov $10, %rax
mov $0x, %rbx
mov $0x, %rdx
mov $0, %rsi
vmcall

As this will cause KVM to execute the following code-path:
vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> 
kvm_pv_send_ipi()
which will reach out-of-bounds access.

This patch fixes it by adding a check to kvm_pv_send_ipi() against 
map->max_apic_id,
ignoring destinations that are not present and delivering the rest. We also 
check
whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to 
the
max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.

Reported-by: Dan Carpenter 
Reviewed-by: Liran Alon 
Cc: Paolo Bonzini 
Cc: Radim Krčmář 
Cc: Liran Alon 
Cc: Dan Carpenter 
Signed-off-by: Wanpeng Li 
[Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
Signed-off-by: Radim Krčmář 

(cherry picked from commit bdf7ffc89922a52a4f08a12f7421ea24bb7626a0)
https://jira.sw.ru/browse/PSBM-107931

Signed-off-by: Valeriy Vdovin 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/lapic.c| 27 ---
 2 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 50817dc3..e9ee080 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1434,7 +1434,7 @@ void kvm_arch_mmu_notifier_invalidate_page(struct kvm 
*kvm,
 
 u64 kvm_get_arch_capabilities(void);
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
-   unsigned long ipi_bitmap_high, int min,
+   unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit);
 
 void kvm_define_shared_msr(unsigned index, u32 msr);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 1487fe2..740be89 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -543,7 +543,7 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
kvm_lapic_irq *irq,
 }
 
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
-   unsigned long ipi_bitmap_high, int min,
+   unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit)
 {
int i;
@@ -566,18 +566,31 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
ipi_bitmap_low,
rcu_read_lock();
map = rcu_dereference(kvm->arch.apic_map);
 
+   if (min > map->max_apic_id)
+   goto out;
/* Bits above cluster_size are masked in the caller.  */
-   for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) {
-   vcpu = map->phys_map[min + i]->vcpu;
-   count += kvm_apic_set_irq(vcpu, , NULL);
+   for_each_set_bit(i, _bitmap_low,
+   min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
+   if (map->phys_map[min + i]) {
+   vcpu = map->phys_map[min + i]->vcpu;
+   count += kvm_apic_set_irq(vcpu, , NULL);
+   }
}
 
min += cluster_size;
-   for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) {
-   vcpu = map->phys_map[min + i]->vcpu;
-   count += kvm_apic_set_irq(vcpu, , NULL);
+
+   if (min > map->max_apic_id)
+   goto out;
+
+   for_each_set_bit(i, _bitmap_high,
+   min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) {
+   if (map->phys_map[min + i]) {
+   vcpu = map->phys_map[min + i]->vcpu;
+   count += kvm_apic_set_irq(vcpu, , NULL);
+   }
}
 
+out:
rcu_read_unlock();
return count;
 }
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] cgroup: fixed NULL-pointer dereference in cgroup_release_agent

2020-09-22 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.18.2.vz7.163.27
-->
commit 9c144d0325cefda39adb3736d6f5538e45e778a4
Author: Valeriy Vdovin 
Date:   Tue Sep 22 10:32:30 2020 +0300

cgroup: fixed NULL-pointer dereference in cgroup_release_agent

The fix checks that ve->init_task is not referenced during warning
message decision if ve == ve0, because ve0 init_task is always NULL.

https://jira.sw.ru/browse/PSBM-107673
Signed-off-by: Valeriy Vdovin 
---
 kernel/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 691505c..27d7a5e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5934,7 +5934,7 @@ void cgroup_release_agent(struct work_struct *work)
envp, UMH_WAIT_EXEC, NULL, NULL, NULL);
 
ve_task = ve->init_task;
-   if (err < 0 && (!(ve_task->flags & PF_EXITING)))
+   if (err < 0 && (ve ==  || !(ve_task->flags & PF_EXITING)))
pr_warn_ratelimited("cgroup release_agent "
"%s %s failed: %d\n",
agentbuf, pathbuf, err);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] x86_64: fix crashes due to bogus iret traps handling #PSBM-107794

2020-09-22 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.18.2.vz7.163.27
-->
commit 32457eef8a6680864624049df7ebdbcf53676a93
Author: Andrey Ryabinin 
Date:   Tue Sep 22 10:32:22 2020 +0300

x86_64: fix crashes due to bogus iret traps handling #PSBM-107794

Our handling of bad irets seems to be broken since meltdown fix.
When interrupt return to userspace fails we running with user CR3
thus faulting in error_sti on access to 'kernel_stack' variable.
This continues with series of faults in page fault handler until
we run out of stack and end up with:

PANIC: double fault, error_code: 0x0
RIP: 0010:[]  [] 
async_page_fault+0xd/0x30
Call Trace:

 ? smp_apic_timer_interrupt+0x48/0x60
 ? apic_timer_interrupt+0x16a/0x170

 ? bad_area+0x49/0x50
 ? __do_page_fault+0x477/0x500
 ? trace_do_page_fault+0x56/0x150
 ? do_async_page_fault+0x22/0xf0
 ? async_page_fault+0x28/0x30
 ? .E_write_words+0x5c/0x641
 ? putname+0x3d/0x60
 ? timerqueue_add+0x60/0xb0
 ? enqueue_hrtimer+0x25/0x80
 ? hrtimer_start_range_ns+0x1fd/0x3c0
 ? recalc_sigpending+0x1b/0x70
 ? __set_task_blocked+0x41/0xa0
 ? restore_altstack+0x18/0x30
 ? sys_rt_sigreturn+0xe8/0x100
 ? stub_rt_sigreturn+0x48/0x90

Backport the fix for this from RHEL 7.9 beta

https://jira.sw.ru/browse/PSBM-107794
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/kernel/entry_64.S | 49 ++
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3e67d18..91e5503 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -949,12 +949,42 @@ irq_return:
 * when returning from IPI handler.
 */
INTERRUPT_RETURN
+   _ASM_EXTABLE(irq_return, bad_iret)
 
 #ifdef CONFIG_PARAVIRT
 ENTRY(native_iret)
iretq
+   _ASM_EXTABLE(native_iret, bad_iret)
 #endif
 
+   .section .fixup,"ax"
+bad_iret:
+   /*
+* The iret traps when the %cs or %ss being restored is bogus.
+* We've lost the original trap vector and error code.
+* #GPF is the most likely one to get for an invalid selector.
+* So pretend we completed the iret and took the #GPF in user mode.
+*
+* We are now running with the kernel GS after exception recovery.
+* But error_entry expects us to have user GS to match the user %cs,
+* so swap back.
+*/
+   pushq $0
+
+   /*
+* If a kernel bug clears user CS bit and in turn we'll skip SWAPGS in
+* general_protection, skip the SWAPGS here as well so we won't hard 
reboot.
+* This increases robustness of bad_iret to kernel bugs as well.
+*/
+   testl $3, 8*2(%rsp)
+   je 1f
+   SWAPGS
+1:
+
+   jmp general_protection
+
+   .previous
+
/* edi: workmask, edx: work */
 retint_careful:
CFI_RESTORE_STATE
@@ -1550,15 +1580,16 @@ error_sti:
 
 /*
  * There are two places in the kernel that can potentially fault with
- * usergs. Handle them here.  B stepping K8s sometimes report a
- * truncated RIP for IRET exceptions returning to compat mode. Check
- * for these here too.
+ * usergs. Handle them here. The exception handlers after iret run with
+ * kernel gs again, so don't set the user space flag. B stepping K8s
+ * sometimes report an truncated RIP for IRET exceptions returning to
+ * compat mode. Check for these here too.
  */
 error_kernelspace:
incl %ebx
leaq irq_return(%rip),%rcx
cmpq %rcx,RIP+8(%rsp)
-   je error_bad_iret
+   je error_swapgs
movl %ecx,%eax  /* zero extend */
cmpq %rax,RIP+8(%rsp)
je bstep_iret
@@ -1570,15 +1601,7 @@ error_kernelspace:
 bstep_iret:
/* Fix truncated RIP */
movq %rcx,RIP+8(%rsp)
-   /* fall through */
-
-error_bad_iret:
-   SWAPGS
-   mov %rsp,%rdi
-   call fixup_bad_iret
-   mov %rax,%rsp
-   decl %ebx   /* Return to usergs */
-   jmp error_sti
+   jmp error_swapgs
CFI_ENDPROC
 END(error_entry)
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel