Re: [Devel] [PATCH RHEL7] ploop: zero-out block device statistics at ploop_stop
On 22.09.2020 18:57, Valeriy Vdovin wrote: > ploop block device is represented by a block device file in /dev, but > it's lifecycle is separated from the file itself by PLOOP_IOC_START and > PLOOP_IOC_STOP ioctls. This way ploop file in /dev can be an empty > placeholder after PLOOP_IOC_STOP ioctl and reinitialized later by a > PLOOP_IOC_START. Because of that some of the important data structures > stay allocated after stop and maintain old values until and after restart. > This situation is also true for block device statistics that remain unchanged > after end of ploop device lifecycle. Fresh-started ploop device is considered > a new entity with stats equal to zero. For that we zero out stats at > ploop_stop. > > https://jira.sw.ru/browse/PSBM-95605 > > Signed-off-by: Valeriy.Vdovin Reviewed-by: Kirill Tkhai > --- > drivers/block/ploop/dev.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c > index ac4d142..c54ff90 100644 > --- a/drivers/block/ploop/dev.c > +++ b/drivers/block/ploop/dev.c > @@ -4373,6 +4373,9 @@ static int ploop_stop(struct ploop_device * plo, struct > block_device *bdev) > > clear_bit(PLOOP_S_RUNNING, >state); > > + part_stat_set_all(>disk->part0, 0); > + memset(>st, 0, sizeof(plo->st)); > + > del_timer_sync(>mitigation_timer); > del_timer_sync(>freeze_timer); > > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7] ploop: zero-out block device statistics at ploop_stop
ploop block device is represented by a block device file in /dev, but it's lifecycle is separated from the file itself by PLOOP_IOC_START and PLOOP_IOC_STOP ioctls. This way ploop file in /dev can be an empty placeholder after PLOOP_IOC_STOP ioctl and reinitialized later by a PLOOP_IOC_START. Because of that some of the important data structures stay allocated after stop and maintain old values until and after restart. This situation is also true for block device statistics that remain unchanged after end of ploop device lifecycle. Fresh-started ploop device is considered a new entity with stats equal to zero. For that we zero out stats at ploop_stop. https://jira.sw.ru/browse/PSBM-95605 Signed-off-by: Valeriy.Vdovin --- drivers/block/ploop/dev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/block/ploop/dev.c b/drivers/block/ploop/dev.c index ac4d142..c54ff90 100644 --- a/drivers/block/ploop/dev.c +++ b/drivers/block/ploop/dev.c @@ -4373,6 +4373,9 @@ static int ploop_stop(struct ploop_device * plo, struct block_device *bdev) clear_bit(PLOOP_S_RUNNING, >state); + part_stat_set_all(>disk->part0, 0); + memset(>st, 0, sizeof(plo->st)); + del_timer_sync(>mitigation_timer); del_timer_sync(>freeze_timer); -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL8 COMMIT] ms/teach move_mount(2) to work with OPEN_TREE_CLONE
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.6 --> commit d16c52de7d40f97b428cc00eb931cc0e1912ecdd Author: Pavel Tikhomirov Date: Tue Sep 22 18:56:58 2020 +0300 ms/teach move_mount(2) to work with OPEN_TREE_CLONE Patchset description: These syscalls were added as preparation step for new mount api (fsopen, fsconfig, fsmount and fspick will be ported separately). We can use them to implement "cross-namespace bind-mounting" like this: fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE); setns(nsfd, CLONE_NEWNS); move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH); This will allow us implementing feature of adding bindmounts to runing container instead of having unreliable external propagations. Version for VZ8 is slightly different from VZ7 version. https://jira.sw.ru/browse/PSBM-107263 Current patch description: From: David Howells Allow a detached tree created by open_tree(..., OPEN_TREE_CLONE) to be attached by move_mount(2). If by the time of final fput() of OPEN_TREE_CLONE-opened file its tree is not detached anymore, it won't be dissolved. move_mount(2) is adjusted to handle detached source. That gives us equivalents of mount --bind and mount --rbind. Thanks also to Alan Jenkins for providing a whole bunch of ways to break things using this interface. Signed-off-by: Al Viro Signed-off-by: David Howells Signed-off-by: Al Viro teach move_mount(2) to work with OPEN_TREE_CLONE (cherry-picked from commit 44dfd84a6d54a675e35ab618d9fab47b36cb78cd) do_move_mount(): fix an unsafe use of is_anon_ns() (cherry-picked from commit 05883eee857eab4693e7d13ebab06716475c5754) vfs: move_mount: reject moving kernel internal mounts (cherry-picked from commit 570d7a98e7d6d5d8706d94ffd2d40adeaa318332) https://jira.sw.ru/browse/PSBM-107263 Signed-off-by: Pavel Tikhomirov --- fs/namespace.c | 63 +++--- 1 file changed, 56 insertions(+), 7 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index 51cacd439590..d355b5921d1e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1789,10 +1789,16 @@ void dissolve_on_fput(struct vfsmount *mnt) namespace_lock(); lock_mount_hash(); ns = real_mount(mnt)->mnt_ns; - umount_tree(real_mount(mnt), UMOUNT_CONNECTED); + if (ns) { + if (is_anon_ns(ns)) + umount_tree(real_mount(mnt), UMOUNT_CONNECTED); + else + ns = NULL; + } unlock_mount_hash(); namespace_unlock(); - free_mnt_ns(ns); + if (ns) + free_mnt_ns(ns); } void drop_collected_mounts(struct vfsmount *mnt) @@ -2000,6 +2006,10 @@ static int attach_recursive_mnt(struct mount *source_mnt, attach_mnt(source_mnt, dest_mnt, dest_mp); touch_mnt_namespace(source_mnt->mnt_ns); } else { + if (source_mnt->mnt_ns) { + /* move from anon - the caller will destroy */ + list_del_init(_mnt->mnt_ns->list); + } mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt); commit_tree(source_mnt); } @@ -2467,13 +2477,37 @@ static int do_set_group(struct path *path, const char *sibling_name) return err; } +/* + * Check that there aren't references to earlier/same mount namespaces in the + * specified subtree. Such references can act as pins for mount namespaces + * that aren't checked by the mount-cycle checking code, thereby allowing + * cycles to be made. + */ +static bool check_for_nsfs_mounts(struct mount *subtree) +{ + struct mount *p; + bool ret = false; + + lock_mount_hash(); + for (p = subtree; p; p = next_mnt(p, subtree)) + if (mnt_ns_loop(p->mnt.mnt_root)) + goto out; + + ret = true; +out: + unlock_mount_hash(); + return ret; +} + static int do_move_mount(struct path *old_path, struct path *new_path) { struct path parent_path = {.mnt = NULL, .dentry = NULL}; + struct mnt_namespace *ns; struct mount *p; struct mount *old; struct mountpoint *mp; int err; + bool attached; mp = lock_mount(new_path); if (IS_ERR(mp)) @@ -2481,12 +2515,20 @@ static int do_move_mount(struct path *old_path, struct path *new_path) old = real_mount(old_path->mnt); p = real_mount(new_path->mnt); + attached = mnt_has_parent(old); + ns = old->mnt_ns; err = -EINVAL; - if (!check_mnt(p) || !check_mnt(old)) + /* The mountpoint must be in our namespace. */ + if (!check_mnt(p))
[Devel] [PATCH RHEL8 COMMIT] ms/vfs: syscall: Add move_mount(2) to move mounts around
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.6 --> commit d64c6358efec5c13a55bc4dddf9d614634d6d892 Author: David Howells Date: Tue Sep 22 18:56:57 2020 +0300 ms/vfs: syscall: Add move_mount(2) to move mounts around Patchset description: These syscalls were added as preparation step for new mount api (fsopen, fsconfig, fsmount and fspick will be ported separately). We can use them to implement "cross-namespace bind-mounting" like this: fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE); setns(nsfd, CLONE_NEWNS); move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH); This will allow us implementing feature of adding bindmounts to runing container instead of having unreliable external propagations. Version for VZ8 is slightly different from VZ7 version. https://jira.sw.ru/browse/PSBM-107263 Current patch description: From: David Howells Add a move_mount() system call that will move a mount from one place to another and, in the next commit, allow to attach an unattached mount tree. The new system call looks like the following: int move_mount(int from_dfd, const char *from_path, int to_dfd, const char *to_path, unsigned int flags); Signed-off-by: David Howells cc: linux-...@vger.kernel.org Signed-off-by: Al Viro vfs: syscall: Add move_mount(2) to move mounts around (cherry-picked from commit 2db154b3ea8e14b04fee23e3fdfd5e9d17fbc6ae) uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2] (cherry-picked from commit 9c8ad7a2ff0bfe58f019ec0abc1fb965114dde7d) selinux: fix regression introduced by move_mount(2) syscall (cherry-picked from commit 98aa00345de54b8340dc2ddcd87f446d33387b5e) https://jira.sw.ru/browse/PSBM-107263 Signed-off-by: Pavel Tikhomirov --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/namespace.c | 126 + include/linux/lsm_hooks.h | 6 ++ include/linux/security.h | 7 ++ include/linux/syscalls.h | 3 + include/uapi/linux/fs.h| 11 +++ security/security.c| 5 ++ security/selinux/hooks.c | 10 +++ 9 files changed, 139 insertions(+), 31 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 103079ec2891..ec3e619444ee 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -402,3 +402,4 @@ 426i386io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter 427i386io_uring_register sys_io_uring_register __ia32_sys_io_uring_register 428i386open_tree sys_open_tree __ia32_sys_open_tree +429i386move_mount sys_move_mount __ia32_sys_move_mount diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5772d5b0f1a6..640ff4463a21 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -347,6 +347,7 @@ 426common io_uring_enter __x64_sys_io_uring_enter 427common io_uring_register __x64_sys_io_uring_register 428common open_tree __x64_sys_open_tree +429common move_mount __x64_sys_move_mount # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/namespace.c b/fs/namespace.c index a669502c450b..51cacd439590 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2467,72 +2467,81 @@ static int do_set_group(struct path *path, const char *sibling_name) return err; } -static int do_move_mount(struct path *path, const char *old_name) +static int do_move_mount(struct path *old_path, struct path *new_path) { - struct path old_path, parent_path; + struct path parent_path = {.mnt = NULL, .dentry = NULL}; struct mount *p; struct mount *old; struct mountpoint *mp; int err; - if (!old_name || !*old_name) - return -EINVAL; - err = kern_path(old_name, LOOKUP_FOLLOW, _path); - if (err) - return err; - mp = lock_mount(path); - err = PTR_ERR(mp); + mp = lock_mount(new_path); if (IS_ERR(mp)) - goto out; + return PTR_ERR(mp); - old = real_mount(old_path.mnt); - p = real_mount(path->mnt); + old = real_mount(old_path->mnt); + p = real_mount(new_path->mnt); err = -EINVAL; if (!check_mnt(p) || !check_mnt(old)) - goto out1; +
[Devel] [PATCH RHEL8 COMMIT] ms/vfs: syscall: Add open_tree(2) to reference or clone a mount
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.6 --> commit cd9f3d61b5bbf7ad81dbcba920834369b878b1a9 Author: Pavel Tikhomirov Date: Tue Sep 22 16:02:16 2020 +0300 ms/vfs: syscall: Add open_tree(2) to reference or clone a mount Patchset description: These syscalls were added as preparation step for new mount api (fsopen, fsconfig, fsmount and fspick will be ported separately). We can use them to implement "cross-namespace bind-mounting" like this: fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE); setns(nsfd, CLONE_NEWNS); move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH); This will allow us implementing feature of adding bindmounts to runing container instead of having unreliable external propagations. Version for VZ8 is slightly different from VZ7 version. https://jira.sw.ru/browse/PSBM-107263 Current patch description: From: Al Viro open_tree(dfd, pathname, flags) Returns an O_PATH-opened file descriptor or an error. dfd and pathname specify the location to open, in usual fashion (see e.g. fstatat(2)). flags should be an OR of some of the following: * AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW - same meanings as usual * OPEN_TREE_CLOEXEC - make the resulting descriptor close-on-exec * OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE - instead of opening the location in question, create a detached mount tree matching the subtree rooted at location specified by dfd/pathname. With AT_RECURSIVE the entire subtree is cloned, without it - only the part within in the mount containing the location in question. In other words, the same as mount --rbind or mount --bind would've taken. The detached tree will be dissolved on the final close of obtained file. Creation of such detached trees requires the same capabilities as doing mount --bind. Signed-off-by: Al Viro Signed-off-by: David Howells cc: linux-...@vger.kernel.org Signed-off-by: Al Viro vfs: syscall: Add open_tree(2) to reference or clone a mount (cherry-picked from commit a07b20004793d8926f78d63eb5980559f7813404) uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2] (cherry-picked from commit 9c8ad7a2ff0bfe58f019ec0abc1fb965114dde7d) fs/namespace: add __user to open_tree and move_mount syscalls (cherry-picked from commit 2658ce095df583cdf9ede475ec4da0b3cc7f7b05) https://jira.sw.ru/browse/PSBM-107263 Signed-off-by: Pavel Tikhomirov --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/file_table.c| 9 +- fs/internal.h | 1 + fs/namespace.c | 157 - include/linux/fs.h | 3 + include/linux/syscalls.h | 1 + include/uapi/linux/fcntl.h | 1 + include/uapi/linux/fs.h| 6 ++ 9 files changed, 155 insertions(+), 25 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 2eefd2a7c1ce..103079ec2891 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -401,3 +401,4 @@ 425i386io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426i386io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter 427i386io_uring_register sys_io_uring_register __ia32_sys_io_uring_register +428i386open_tree sys_open_tree __ia32_sys_open_tree diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 65c026185e61..5772d5b0f1a6 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -346,6 +346,7 @@ 425common io_uring_setup __x64_sys_io_uring_setup 426common io_uring_enter __x64_sys_io_uring_enter 427common io_uring_register __x64_sys_io_uring_register +428common open_tree __x64_sys_open_tree # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/file_table.c b/fs/file_table.c index 2931252f47ae..4c8a5d845a1c 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -245,6 +245,7 @@ static void __fput(struct file *file) struct dentry *dentry = file->f_path.dentry; struct vfsmount *mnt = file->f_path.mnt; struct inode *inode = file->f_inode; + fmode_t mode = file->f_mode; if (unlikely(!(file->f_mode & FMODE_OPENED))) goto out; @@ -267,18 +268,20 @@ static void __fput(struct
[Devel] [PATCH RHEL8 COMMIT] ms/saner handling of temporary namespaces
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.6 --> commit 51865b4696675b8be4bdefd61e7b2a3310b6a45d Author: Al Viro Date: Tue Sep 22 15:37:58 2020 +0300 ms/saner handling of temporary namespaces Patchset description: These syscalls were added as preparation step for new mount api (fsopen, fsconfig, fsmount and fspick will be ported separately). We can use them to implement "cross-namespace bind-mounting" like this: fd = open_tree(AT_FDCWD, "/mnt", OPEN_TREE_CLONE); setns(nsfd, CLONE_NEWNS); move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH); This will allow us implementing feature of adding bindmounts to runing container instead of having unreliable external propagations. Version for VZ8 is slightly different from VZ7 version. https://jira.sw.ru/browse/PSBM-107263 Current patch description: From: Al Viro mount_subtree() creates (and soon destroys) a temporary namespace, so that automounts could function normally. These beasts should never become anyone's current namespaces; they don't, but it would be better to make prevention of that more straightforward. And since they don't become anyone's current namespace, we don't need to bother with reserving procfs inums for those. Teach alloc_mnt_ns() to skip inum allocation if told so, adjust put_mnt_ns() accordingly, make mount_subtree() use temporary (anon) namespace. is_anon_ns() checks if a namespace is such. Signed-off-by: Al Viro (cherry-picked from commit 74e831221cfd79460ec11c1b641093863f0ef3ce) https://jira.sw.ru/browse/PSBM-107263 Signed-off-by: Pavel Tikhomirov --- fs/mount.h | 5 fs/namespace.c | 74 +++--- 2 files changed, 40 insertions(+), 39 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index f39bc9da4d73..6250de544760 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -146,3 +146,8 @@ static inline bool is_local_mountpoint(struct dentry *dentry) return __is_local_mountpoint(dentry); } + +static inline bool is_anon_ns(struct mnt_namespace *ns) +{ + return ns->seq == 0; +} diff --git a/fs/namespace.c b/fs/namespace.c index 1018ae0efa06..22589d59f476 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2862,7 +2862,8 @@ static void dec_mnt_namespaces(struct ucounts *ucounts) static void free_mnt_ns(struct mnt_namespace *ns) { - ns_free_inum(>ns); + if (!is_anon_ns(ns)) + ns_free_inum(>ns); dec_mnt_namespaces(ns->ucounts); put_user_ns(ns->user_ns); kfree(ns); @@ -2877,7 +2878,7 @@ static void free_mnt_ns(struct mnt_namespace *ns) */ static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1); -static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) +static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns, bool anon) { struct mnt_namespace *new_ns; struct ucounts *ucounts; @@ -2887,28 +2888,27 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) if (!ucounts) return ERR_PTR(-ENOSPC); - new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL); + new_ns = kzalloc(sizeof(struct mnt_namespace), GFP_KERNEL); if (!new_ns) { dec_mnt_namespaces(ucounts); return ERR_PTR(-ENOMEM); } - ret = ns_alloc_inum(_ns->ns); - if (ret) { - kfree(new_ns); - dec_mnt_namespaces(ucounts); - return ERR_PTR(ret); + if (!anon) { + ret = ns_alloc_inum(_ns->ns); + if (ret) { + kfree(new_ns); + dec_mnt_namespaces(ucounts); + return ERR_PTR(ret); + } } new_ns->ns.ops = _operations; - new_ns->seq = atomic64_add_return(1, _ns_seq); + if (!anon) + new_ns->seq = atomic64_add_return(1, _ns_seq); atomic_set(_ns->count, 1); - new_ns->root = NULL; INIT_LIST_HEAD(_ns->list); init_waitqueue_head(_ns->poll); - new_ns->event = 0; new_ns->user_ns = get_user_ns(user_ns); new_ns->ucounts = ucounts; - new_ns->mounts = 0; - new_ns->pending_mounts = 0; return new_ns; } @@ -2932,7 +2932,7 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, old = ns->root; - new_ns = alloc_mnt_ns(user_ns); + new_ns = alloc_mnt_ns(user_ns, false); if (IS_ERR(new_ns)) return new_ns; @@ -2987,37 +2987,25 @@ struct mnt_namespace *copy_mnt_ns(unsigned long flags, struct mnt_namespace *ns, return new_ns; } -/** - * create_mnt_ns - creates a private namespace and adds a
[Devel] [PATCH RHEL8 COMMIT] modules: use kvmalloc when creating sysfs attributes for ELF sections
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.6 --> commit 9175123cf76678492bf559facbce177e90e5c07e Author: Evgenii Shatokhin Date: Tue Sep 22 15:31:59 2020 +0300 modules: use kvmalloc when creating sysfs attributes for ELF sections A kernel module containing a ReadyKernel patch could have lots of ELF sections: one per each new or patched function, one per each new static or global variable, etc. The kernel creates a sysfs file /sys/module//sections/ for each loaded section when the patch module is being loaded, see add_sect_attrs() in kernel/module.c. A big chunk of memory is allocated for all these with kzalloc. For the ReadyKernel patches we have already released, the amount of memory could be as high as 34528 bytes (48 + 80 * 431 loaded sections), which is a 4th order allocation. 3rd order allocations are also common here, see https://jira.sw.ru/browse/PSBM-95050. Not only it is a waste (contiguous memory is not needed there), but the allocation may also fail when the memory is fragmented. ReadyKernel patches are often used in the systems with rather significant uptime, so the fragmentation is possible. It could be better if the patch modules did not use too many ELF sections. However, the KPatch maintainers pointed out (https://github.com/dynup/kpatch/pull/1131) that the same problem would affect regular kernel modules as well after FGKASLR has been merged into the mainline kernel. Combining the sections of the kernel modules destroys the purpose of FGKASLR, so, it was agreed that we probably should just switch to kvmalloc+kvfree in add_sect_attrs/free_sect_attrs. Details and discussion: https://www.spinics.net/lists/live-patching/msg06364.html https://jira.sw.ru/browse/PSBM-108017 Signed-off-by: Evgenii Shatokhin --- kernel/module.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index d4702f0d711a..e58ad01de8bf 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1492,7 +1492,7 @@ static void free_sect_attrs(struct module_sect_attrs *sect_attrs) for (section = 0; section < sect_attrs->nsections; section++) kfree(sect_attrs->attrs[section].name); - kfree(sect_attrs); + kvfree(sect_attrs); } static void add_sect_attrs(struct module *mod, const struct load_info *info) @@ -1510,7 +1510,7 @@ static void add_sect_attrs(struct module *mod, const struct load_info *info) + nloaded * sizeof(sect_attrs->attrs[0]), sizeof(sect_attrs->grp.attrs[0])); size[1] = (nloaded + 1) * sizeof(sect_attrs->grp.attrs[0]); - sect_attrs = kzalloc(size[0] + size[1], GFP_KERNEL); + sect_attrs = kvzalloc(size[0] + size[1], GFP_KERNEL); if (sect_attrs == NULL) return; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] mm, memcg: add oom counter to memory.stat memcgroup file #PSBM-107731
Add oom counter to memory.stat file. oom shows amount of oom kills triggered due to cgroup's memory limit. total_oom shows total sum of oom kills triggered due to cgroup's and it's sub-groups memory limits. memory.stat in the root cgroup counts global oom kills. E.g: # mkdir /sys/fs/cgroup/memory/test/ # echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # echo 100M > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes # echo $$ > /sys/fs/cgroup/memory/test/tasks # ./vm-scalability/usemem -O 200M # grep oom /sys/fs/cgroup/memory/test/memory.stat oom 1 total_oom 1 # echo -1 > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes # echo -1 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # ./vm-scalability/usemem -O 1000G # grep oom /sys/fs/cgroup/memory/memory.stat oom 1 total_oom 2 https://jira.sw.ru/browse/PSBM-107731 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 9 + 1 file changed, 9 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6587cc2ef019..fe06c7db2ad3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -400,6 +400,7 @@ struct mem_cgroup { struct mem_cgroup_stat_cpu __percpu *stat; struct mem_cgroup_stat2_cpu stat2; spinlock_t pcp_counter_lock; + atomic_long_t oom; atomic_tdead_count; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) @@ -2005,6 +2006,7 @@ void mem_cgroup_note_oom_kill(struct mem_cgroup *root_memcg, if (memcg == root_memcg) break; } + atomic_long_inc(_memcg->oom); if (memcg_to_put) css_put(_to_put->css); @@ -5691,6 +5693,7 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft, for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) seq_printf(m, "%s %lu\n", mem_cgroup_events_names[i], mem_cgroup_read_events(memcg, i)); + seq_printf(m, "oom %lu\n", atomic_long_read(>oom)); for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i], @@ -5733,6 +5736,12 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft, seq_printf(m, "total_%s %llu\n", mem_cgroup_events_names[i], val); } + { + unsigned long val = 0; + for_each_mem_cgroup_tree(mi, memcg) + val += atomic_long_read(>oom); + seq_printf(m, "total_oom %lu\n", val); + } for (i = 0; i < NR_LRU_LISTS; i++) { unsigned long long val = 0; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL8 COMMIT] ve/perf: forbid perf events syscall in containers
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.6 --> commit ed1fc404e6904702b4caded62e1e70e7420e12be Author: Pavel Tikhomirov Date: Tue Sep 22 15:27:06 2020 +0300 ve/perf: forbid perf events syscall in containers If some process has perf_event_open fd it can monitor different (kernel, hardware, etc) perfomance counters through it. And this fd is configured through perf_event_attr which has more than 30 fields. There is currently no kernel interface to get the configuration of existing perf event fd. So to dump such an fd with CRIU we should add this interface. We have ovs-vswitchd, which opens perf event fd and does nothing with it according to comments, it is only used in case someone will use PERF() macros to debug some code parts which implies recompilation of ovs. But it is still a problem on migration because CRIU detects this fd and fails. Also ovs can handle if it can't open perf event fd and fallbacks gracefully to work without it. So (at least for now) we should forbid this interface, to fix problems with ovs daemon migration. https://jira.sw.ru/browse/PSBM-107217 Signed-off-by: Pavel Tikhomirov --- kernel/events/core.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/events/core.c b/kernel/events/core.c index 61b0e1dfdebe..17066990a235 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "internal.h" @@ -10874,6 +10875,9 @@ SYSCALL_DEFINE5(perf_event_open, if (flags & ~PERF_FLAG_ALL) return -EINVAL; + if (!ve_is_super(get_exec_env())) + return -EACCES; + err = perf_copy_attr(attr_uptr, ); if (err) return err; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL8 COMMIT] net: openvswitch: add capability to specify ifindex of new links
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.6 --> commit 5e825b7e975d0593a648f467c585bcaded7036b3 Author: Andrey Zhadchenko Date: Tue Sep 22 15:07:25 2020 +0300 net: openvswitch: add capability to specify ifindex of new links CRIU is preserving ifindexes of net devices after restoration, but current Open vSwitch API are not capable to do that. So we need to modify it, because - Restoring net devices with random ifindex will lead to some excessive work to restore master relationship. - OVS device taking another link ifindex will likely cause some problems and may snowball previous point. - Although OVS daemon is not supported yet, it is holding some tables which have ifindex. Openvswitch creates several net devices, but unlike rtnetlink API there is no option to specify ifindex for link. This is crucial for criu during restore stage. Use ovs_header->dp_ifindex during OVS_DP_CMD_NEW as desired ifindex. Use OVS_VPORT_ATTR_IFINDEX during OVS_VPORT_CMD_NEW to specify new netdev ifindex. Both values were not relevant for corresponding requests, so existing software won't mess with it. https://jira.sw.ru/browse/PSBM-105844 Signed-off-by: Andrey Zhadchenko --- net/openvswitch/datapath.c | 16 ++-- net/openvswitch/vport-internal_dev.c | 1 + net/openvswitch/vport.h | 2 ++ 3 files changed, 17 insertions(+), 2 deletions(-) diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c index 900b16e668a1..a720430daaa3 100644 --- a/net/openvswitch/datapath.c +++ b/net/openvswitch/datapath.c @@ -1600,6 +1600,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info) struct vport *vport; struct ovs_net *ovs_net; int err, i; + struct ovs_header *ovs_header = info->userhdr; err = -EINVAL; if (!a[OVS_DP_ATTR_NAME] || !a[OVS_DP_ATTR_UPCALL_PID]) @@ -1649,6 +1650,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info) parms.dp = dp; parms.port_no = OVSP_LOCAL; parms.upcall_portids = a[OVS_DP_ATTR_UPCALL_PID]; + parms.desired_ifindex = ovs_header->dp_ifindex; err = ovs_dp_change(dp, a); if (err) @@ -2044,7 +2046,10 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info) if (!a[OVS_VPORT_ATTR_NAME] || !a[OVS_VPORT_ATTR_TYPE] || !a[OVS_VPORT_ATTR_UPCALL_PID]) return -EINVAL; - if (a[OVS_VPORT_ATTR_IFINDEX]) + + parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]); + + if (a[OVS_VPORT_ATTR_IFINDEX] && parms.type != OVS_VPORT_TYPE_INTERNAL) return -EOPNOTSUPP; port_no = a[OVS_VPORT_ATTR_PORT_NO] @@ -2081,12 +2086,19 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info) } parms.name = nla_data(a[OVS_VPORT_ATTR_NAME]); - parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]); parms.options = a[OVS_VPORT_ATTR_OPTIONS]; parms.dp = dp; parms.port_no = port_no; parms.upcall_portids = a[OVS_VPORT_ATTR_UPCALL_PID]; + if (parms.type == OVS_VPORT_TYPE_INTERNAL) { + if (a[OVS_VPORT_ATTR_IFINDEX]) + parms.desired_ifindex = + nla_get_u32(a[OVS_VPORT_ATTR_IFINDEX]); + else + parms.desired_ifindex = 0; + } + vport = new_vport(); err = PTR_ERR(vport); if (IS_ERR(vport)) { diff --git a/net/openvswitch/vport-internal_dev.c b/net/openvswitch/vport-internal_dev.c index 3ebf8ba7c389..a9bb6e5e11ad 100644 --- a/net/openvswitch/vport-internal_dev.c +++ b/net/openvswitch/vport-internal_dev.c @@ -200,6 +200,7 @@ static struct vport *internal_dev_create(const struct vport_parms *parms) if (vport->port_no == OVSP_LOCAL) vport->dev->features |= NETIF_F_NETNS_LOCAL; + dev->ifindex = parms->desired_ifindex; rtnl_lock(); err = register_netdevice(vport->dev); if (err) diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h index cda66c26ad08..c5281b52f489 100644 --- a/net/openvswitch/vport.h +++ b/net/openvswitch/vport.h @@ -109,6 +109,8 @@ struct vport_parms { enum ovs_vport_type type; struct nlattr *options; + int desired_ifindex; + /* For ovs_vport_alloc(). */ struct datapath *dp; u16 port_no; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT]
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-1127.18.2.vz7.163.27 --> ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL COMMIT]
The commit is pushed to "" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after --> ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RHEL7] cgroup: fixed NULL-pointer dereference in cgroup_release_agent
On 21.09.2020 12:18, Valeriy Vdovin wrote: > The fix checks that ve->init_task is not referenced during warning > message decision if ve == ve0, because ve0 init_task is always NULL. > > https://jira.sw.ru/browse/PSBM-107673 > Signed-off-by: Valeriy Vdovin Reviewed-by: Kirill Tkhai > --- > kernel/cgroup.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/cgroup.c b/kernel/cgroup.c > index 691505c..27d7a5e 100644 > --- a/kernel/cgroup.c > +++ b/kernel/cgroup.c > @@ -5934,7 +5934,7 @@ void cgroup_release_agent(struct work_struct *work) > envp, UMH_WAIT_EXEC, NULL, NULL, NULL); > > ve_task = ve->init_task; > - if (err < 0 && (!(ve_task->flags & PF_EXITING))) > + if (err < 0 && (ve == || !(ve_task->flags & PF_EXITING))) > pr_warn_ratelimited("cgroup release_agent " > "%s %s failed: %d\n", > agentbuf, pathbuf, err); > ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7] KVM: LAPIC: Fix pv ipis use-before-initialization
From: Wanpeng Li Reported by syzkaller: BUG: unable to handle kernel NULL pointer dereference at 0014 PGD 80040410c067 P4D 80040410c067 PUD 40410d067 PMD 0 Oops: [#1] PREEMPT SMP PTI CPU: 3 PID: 2567 Comm: poc Tainted: G OE 4.19.0-rc5 #16 RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm] Call Trace: kvm_emulate_hypercall+0x3cc/0x700 [kvm] handle_vmcall+0xe/0x10 [kvm_intel] vmx_handle_exit+0xc1/0x11b0 [kvm_intel] vcpu_enter_guest+0x9fb/0x1910 [kvm] kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm] kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm] do_vfs_ioctl+0xa5/0x690 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x83/0x6e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that the apic map has not yet been initialized, the testcase triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map is dereferenced. This patch fixes it by checking whether or not apic map is NULL and bailing out immediately if that is the case. Fixes: 4180bf1b65 (KVM: X86: Implement "send IPI" hypercall) Reported-by: Wei Wu Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Wei Wu Signed-off-by: Wanpeng Li Cc: sta...@vger.kernel.org Signed-off-by: Paolo Bonzini (cherry-picked from commit 38ab012f109caf10f471db1adf284e620dd8d701) https://jira.sw.ru/browse/PSBM-107931 Signed-off-by: Valeriy.Vdovin --- arch/x86/kvm/lapic.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 740be89..f433199 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -566,6 +566,11 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, rcu_read_lock(); map = rcu_dereference(kvm->arch.apic_map); + if (unlikely(!map)) { + count = -EOPNOTSUPP; + goto out; + } + if (min > map->max_apic_id) goto out; /* Bits above cluster_size are masked in the caller. */ -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] KVM: LAPIC: Fix pv ipis out-of-bounds access
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-1127.18.2.vz7.163.27 --> commit 69be0b6e4ce2dee6e42bf89a7497d1fef2c4e2d0 Author: Wanpeng Li Date: Tue Sep 22 10:58:11 2020 +0300 KVM: LAPIC: Fix pv ipis out-of-bounds access Dan Carpenter reported that the untrusted data returns from kvm_register_read() results in the following static checker warning: arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi() error: buffer underflow 'map->phys_map' 's32min-s32max' KVM guest can easily trigger this by executing the following assembly sequence in Ring0: mov $10, %rax mov $0x, %rbx mov $0x, %rdx mov $0, %rsi vmcall As this will cause KVM to execute the following code-path: vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi() which will reach out-of-bounds access. This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id, ignoring destinations that are not present and delivering the rest. We also check whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID. Reported-by: Dan Carpenter Reviewed-by: Liran Alon Cc: Paolo Bonzini Cc: Radim KrÄmáŠCc: Liran Alon Cc: Dan Carpenter Signed-off-by: Wanpeng Li [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim] Signed-off-by: Radim KrÄmáŠ(cherry picked from commit bdf7ffc89922a52a4f08a12f7421ea24bb7626a0) https://jira.sw.ru/browse/PSBM-107931 Signed-off-by: Valeriy Vdovin --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/lapic.c| 27 --- 2 files changed, 21 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 50817dc3..e9ee080 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1434,7 +1434,7 @@ void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, u64 kvm_get_arch_capabilities(void); int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, - unsigned long ipi_bitmap_high, int min, + unsigned long ipi_bitmap_high, u32 min, unsigned long icr, int op_64_bit); void kvm_define_shared_msr(unsigned index, u32 msr); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 1487fe2..740be89 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -543,7 +543,7 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq, } int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, - unsigned long ipi_bitmap_high, int min, + unsigned long ipi_bitmap_high, u32 min, unsigned long icr, int op_64_bit) { int i; @@ -566,18 +566,31 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, rcu_read_lock(); map = rcu_dereference(kvm->arch.apic_map); + if (min > map->max_apic_id) + goto out; /* Bits above cluster_size are masked in the caller. */ - for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) { - vcpu = map->phys_map[min + i]->vcpu; - count += kvm_apic_set_irq(vcpu, , NULL); + for_each_set_bit(i, _bitmap_low, + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) { + if (map->phys_map[min + i]) { + vcpu = map->phys_map[min + i]->vcpu; + count += kvm_apic_set_irq(vcpu, , NULL); + } } min += cluster_size; - for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) { - vcpu = map->phys_map[min + i]->vcpu; - count += kvm_apic_set_irq(vcpu, , NULL); + + if (min > map->max_apic_id) + goto out; + + for_each_set_bit(i, _bitmap_high, + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) { + if (map->phys_map[min + i]) { + vcpu = map->phys_map[min + i]->vcpu; + count += kvm_apic_set_irq(vcpu, , NULL); + } } +out: rcu_read_unlock(); return count; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH] KVM: LAPIC: Fix pv ipis out-of-bounds access
From: Wanpeng Li Dan Carpenter reported that the untrusted data returns from kvm_register_read() results in the following static checker warning: arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi() error: buffer underflow 'map->phys_map' 's32min-s32max' KVM guest can easily trigger this by executing the following assembly sequence in Ring0: mov $10, %rax mov $0x, %rbx mov $0x, %rdx mov $0, %rsi vmcall As this will cause KVM to execute the following code-path: vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi() which will reach out-of-bounds access. This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id, ignoring destinations that are not present and delivering the rest. We also check whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID. Reported-by: Dan Carpenter Reviewed-by: Liran Alon Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Liran Alon Cc: Dan Carpenter Signed-off-by: Wanpeng Li [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim] Signed-off-by: Radim Krčmář (cherry picked from commit bdf7ffc89922a52a4f08a12f7421ea24bb7626a0) https://jira.sw.ru/browse/PSBM-107931 Signed-off-by: Valeriy Vdovin --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/lapic.c| 27 --- 2 files changed, 21 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 50817dc3..e9ee080 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1434,7 +1434,7 @@ void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, u64 kvm_get_arch_capabilities(void); int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, - unsigned long ipi_bitmap_high, int min, + unsigned long ipi_bitmap_high, u32 min, unsigned long icr, int op_64_bit); void kvm_define_shared_msr(unsigned index, u32 msr); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 1487fe2..740be89 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -543,7 +543,7 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq, } int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, - unsigned long ipi_bitmap_high, int min, + unsigned long ipi_bitmap_high, u32 min, unsigned long icr, int op_64_bit) { int i; @@ -566,18 +566,31 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low, rcu_read_lock(); map = rcu_dereference(kvm->arch.apic_map); + if (min > map->max_apic_id) + goto out; /* Bits above cluster_size are masked in the caller. */ - for_each_set_bit(i, _bitmap_low, BITS_PER_LONG) { - vcpu = map->phys_map[min + i]->vcpu; - count += kvm_apic_set_irq(vcpu, , NULL); + for_each_set_bit(i, _bitmap_low, + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) { + if (map->phys_map[min + i]) { + vcpu = map->phys_map[min + i]->vcpu; + count += kvm_apic_set_irq(vcpu, , NULL); + } } min += cluster_size; - for_each_set_bit(i, _bitmap_high, BITS_PER_LONG) { - vcpu = map->phys_map[min + i]->vcpu; - count += kvm_apic_set_irq(vcpu, , NULL); + + if (min > map->max_apic_id) + goto out; + + for_each_set_bit(i, _bitmap_high, + min((u32)BITS_PER_LONG, (map->max_apic_id - min + 1))) { + if (map->phys_map[min + i]) { + vcpu = map->phys_map[min + i]->vcpu; + count += kvm_apic_set_irq(vcpu, , NULL); + } } +out: rcu_read_unlock(); return count; } -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] cgroup: fixed NULL-pointer dereference in cgroup_release_agent
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-1127.18.2.vz7.163.27 --> commit 9c144d0325cefda39adb3736d6f5538e45e778a4 Author: Valeriy Vdovin Date: Tue Sep 22 10:32:30 2020 +0300 cgroup: fixed NULL-pointer dereference in cgroup_release_agent The fix checks that ve->init_task is not referenced during warning message decision if ve == ve0, because ve0 init_task is always NULL. https://jira.sw.ru/browse/PSBM-107673 Signed-off-by: Valeriy Vdovin --- kernel/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 691505c..27d7a5e 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -5934,7 +5934,7 @@ void cgroup_release_agent(struct work_struct *work) envp, UMH_WAIT_EXEC, NULL, NULL, NULL); ve_task = ve->init_task; - if (err < 0 && (!(ve_task->flags & PF_EXITING))) + if (err < 0 && (ve == || !(ve_task->flags & PF_EXITING))) pr_warn_ratelimited("cgroup release_agent " "%s %s failed: %d\n", agentbuf, pathbuf, err); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] x86_64: fix crashes due to bogus iret traps handling #PSBM-107794
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-1127.18.2.vz7.163.27 --> commit 32457eef8a6680864624049df7ebdbcf53676a93 Author: Andrey Ryabinin Date: Tue Sep 22 10:32:22 2020 +0300 x86_64: fix crashes due to bogus iret traps handling #PSBM-107794 Our handling of bad irets seems to be broken since meltdown fix. When interrupt return to userspace fails we running with user CR3 thus faulting in error_sti on access to 'kernel_stack' variable. This continues with series of faults in page fault handler until we run out of stack and end up with: PANIC: double fault, error_code: 0x0 RIP: 0010:[] [] async_page_fault+0xd/0x30 Call Trace: ? smp_apic_timer_interrupt+0x48/0x60 ? apic_timer_interrupt+0x16a/0x170 ? bad_area+0x49/0x50 ? __do_page_fault+0x477/0x500 ? trace_do_page_fault+0x56/0x150 ? do_async_page_fault+0x22/0xf0 ? async_page_fault+0x28/0x30 ? .E_write_words+0x5c/0x641 ? putname+0x3d/0x60 ? timerqueue_add+0x60/0xb0 ? enqueue_hrtimer+0x25/0x80 ? hrtimer_start_range_ns+0x1fd/0x3c0 ? recalc_sigpending+0x1b/0x70 ? __set_task_blocked+0x41/0xa0 ? restore_altstack+0x18/0x30 ? sys_rt_sigreturn+0xe8/0x100 ? stub_rt_sigreturn+0x48/0x90 Backport the fix for this from RHEL 7.9 beta https://jira.sw.ru/browse/PSBM-107794 Signed-off-by: Andrey Ryabinin --- arch/x86/kernel/entry_64.S | 49 ++ 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 3e67d18..91e5503 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -949,12 +949,42 @@ irq_return: * when returning from IPI handler. */ INTERRUPT_RETURN + _ASM_EXTABLE(irq_return, bad_iret) #ifdef CONFIG_PARAVIRT ENTRY(native_iret) iretq + _ASM_EXTABLE(native_iret, bad_iret) #endif + .section .fixup,"ax" +bad_iret: + /* +* The iret traps when the %cs or %ss being restored is bogus. +* We've lost the original trap vector and error code. +* #GPF is the most likely one to get for an invalid selector. +* So pretend we completed the iret and took the #GPF in user mode. +* +* We are now running with the kernel GS after exception recovery. +* But error_entry expects us to have user GS to match the user %cs, +* so swap back. +*/ + pushq $0 + + /* +* If a kernel bug clears user CS bit and in turn we'll skip SWAPGS in +* general_protection, skip the SWAPGS here as well so we won't hard reboot. +* This increases robustness of bad_iret to kernel bugs as well. +*/ + testl $3, 8*2(%rsp) + je 1f + SWAPGS +1: + + jmp general_protection + + .previous + /* edi: workmask, edx: work */ retint_careful: CFI_RESTORE_STATE @@ -1550,15 +1580,16 @@ error_sti: /* * There are two places in the kernel that can potentially fault with - * usergs. Handle them here. B stepping K8s sometimes report a - * truncated RIP for IRET exceptions returning to compat mode. Check - * for these here too. + * usergs. Handle them here. The exception handlers after iret run with + * kernel gs again, so don't set the user space flag. B stepping K8s + * sometimes report an truncated RIP for IRET exceptions returning to + * compat mode. Check for these here too. */ error_kernelspace: incl %ebx leaq irq_return(%rip),%rcx cmpq %rcx,RIP+8(%rsp) - je error_bad_iret + je error_swapgs movl %ecx,%eax /* zero extend */ cmpq %rax,RIP+8(%rsp) je bstep_iret @@ -1570,15 +1601,7 @@ error_kernelspace: bstep_iret: /* Fix truncated RIP */ movq %rcx,RIP+8(%rsp) - /* fall through */ - -error_bad_iret: - SWAPGS - mov %rsp,%rdi - call fixup_bad_iret - mov %rax,%rsp - decl %ebx /* Return to usergs */ - jmp error_sti + jmp error_swapgs CFI_ENDPROC END(error_entry) ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel