[Devel] [PATCH rh7 2/2] mm/vmscan: add cond_resched() to loop in shrink_slab_memcg()
shrink_slab_memcg() may iterate for a long time without resched if we have many memcg with small amount of objects. Add cond_resched() to avoid potential softlockup. https://jira.sw.ru/browse/PSBM-125095 Signed-off-by: Andrey Ryabinin --- mm/vmscan.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 080500f4e366..17a7ed60f525 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -527,6 +527,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct shrinker *shrinker; bool is_nfs; + cond_resched(); + shrinker = idr_find(_idr, i); if (unlikely(!shrinker)) { clear_bit(i, map->map); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/2] mm: memcg: fix memcg reclaim soft lockup
From: Xunlei Pang We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add a cond_resched() at the upper shrink_node_memcgs() to solve this issue, this will mean that we will get a scheduling point for each memcg in the reclaimed hierarchy without any dependency on the reclaimable memory in that memcg thus making it more predictable. Suggested-by: Michal Hocko Signed-off-by: Xunlei Pang Signed-off-by: Andrew Morton Acked-by: Chris Down Acked-by: Michal Hocko Acked-by: Johannes Weiner Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlp...@linux.alibaba.com Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-125095 (cherry picked from commit e3336cab2579012b1e72b5265adf98e2d6e244ad) Signed-off-by: Andrey Ryabinin --- mm/vmscan.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 85622f235e78..080500f4e366 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2684,6 +2684,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc, do { unsigned long lru_pages, scanned; + /* +* This loop can become CPU-bound when target memcgs +* aren't eligible for reclaim - either because they +* don't have any reclaimable pages, or because their +* memory is explicitly protected. Avoid soft lockups. +*/ + cond_resched(); + if (!sc->may_thrash && mem_cgroup_low(root, memcg)) continue; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 2/2] jbd2: raid amnesia protection for the journal
From: Dmitry Monakhov https://jira.sw.ru/browse/PSBM-15484 Some blockdevices can return different data on read requests from same block after power failure (for example mirrored raid is out of sync, and resync is in progress) In that case following sutuation is possible: Power failure happen after transaction commit log was issued for transaction 'D', next boot first dist will have commit block, but second one will not. mirror1: journal={Ac-Bc-Cc-Dc } mirror2: journal={Ac-Bc-Cc-D } Now let's let assumes that we read from mirror1 and found that 'D' has valid commit block, so journal_replay will replay that transaction, but second power failure may happen before journal_reset() so next journal_replay() may read from mirror2 and found that 'C' is last valid transaction. This result in corruption because we already replayed trandaction 'D'. In order to avoid such ambiguity we should pefrorm 'stabilize write'. 1) Read and rewrite latest commit id block 2) Invalidate next block in order to guarantee that journal head becomes stable. Signed-off-by: Dmitry Monakhov Signed-off-by: Andrey Ryabinin --- fs/jbd2/recovery.c | 77 +- 1 file changed, 76 insertions(+), 1 deletion(-) diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c index a4967b27ffb6..78e7d2fed069 100644 --- a/fs/jbd2/recovery.c +++ b/fs/jbd2/recovery.c @@ -33,6 +33,9 @@ struct recovery_info int nr_replays; int nr_revokes; int nr_revoke_hits; + + unsigned intlast_log_block; + struct buffer_head *last_commit_bh; }; enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY}; @@ -229,6 +232,71 @@ do { \ var -= ((journal)->j_last - (journal)->j_first);\ } while (0) +/* + * The 'Raid amnesia' effect protection: https://jira.sw.ru/browse/PSBM-15484 + * + * Some blockdevices can return different data on read requests from same block + * after power failure (for example mirrored raid is out of sync, and resync is + * in progress) In that case following sutuation is possible: + * + * Power failure happen after transaction commit log was issued for + * transaction 'D', next boot first dist will have commit block, but + * second one will not. + * mirror1: journal={Ac-Bc-Cc-Dc } + * mirror2: journal={Ac-Bc-Cc-D } + * Now let's let assumes that we read from mirror1 and found that 'D' has + * valid commit block, so journal_replay will replay that transaction, but + * second power failure may happen before journal_reset() so next + * journal_replay() may read from mirror2 and found that 'C' is last valid + * transaction. This result in corruption because we already replayed + * trandaction 'D'. + * In order to avoid such ambiguity we should pefrorm 'stabilize write'. + * 1) Read and rewrite latest commit id block + * 2) Invalidate next block in + * order to guarantee that journal head becomes stable. + * Yes i know that 'stabilize write' approach is ugly but this is the only + * way to run filesystem on blkdevices with 'raid amnesia' effect + */ +static int stabilize_journal_head(journal_t *journal, struct recovery_info *info) +{ + struct buffer_head *bh[2] = {NULL, NULL}; + int err, err2, i; + + if (!info->last_commit_bh) + return 0; + + bh[0] = info->last_commit_bh; + info->last_commit_bh = NULL; + + err = jread([1], journal, info->last_log_block); + if (err) + goto out; + + for (i = 0; i < 2; i++) { + lock_buffer(bh[i]); + /* Explicitly invalidate block beyond last commit block */ + if (i == 1) + memset(bh[i]->b_data, 0, journal->j_blocksize); + + BUFFER_TRACE(bh[i], "marking dirty"); + set_buffer_uptodate(bh[i]); + mark_buffer_dirty(bh[i]); + BUFFER_TRACE(bh[i], "marking uptodate"); + unlock_buffer(bh[i]); + } + err = sync_blockdev(journal->j_dev); + /* Make sure data is on permanent storage */ + if (journal->j_flags & JBD2_BARRIER) { + err2 = blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL); + if (!err) + err = err2; + } +out: + brelse(bh[0]); + brelse(bh[1]); + return err; +} + /** * jbd2_journal_recover - recovers a on-disk journal * @journal: the journal to recover @@ -265,6 +333,8 @@ int jbd2_journal_recover(journal_t *journal) } err = do_one_pass(journal, , PASS_SCAN); + if (!err) + err = stabilize_journal_head(journal, ); if (!err) err = do_one_pass(journal, , PASS_REVOKE); if (!err) @@ -315,6 +385,7 @@ int jbd2_journal_skip_recovery(journal_t *journal) memset (, 0, siz
[Devel] [PATCH vz8 1/2] ve/ext4: treat panic_on_errors as remount-ro_on_errors in CTs
From: Dmitry Monakhov This is a port from 2.6.32-x of: * diff-ext4-in-containers-treat-panic_on_errors-as-remount-ro_on_errors ext4: in containers treat errors=panic as Container can explode whole node if it remounts its ploop with option 'errors=panic' and triggers abort after that. Signed-off-by: Konstantin Khlebnikov Acked-by: Maxim V. Patlasov Signed-off-by: Dmitry Monakhov khorenko@: currently we have devmnt->allowed_options options which are configured via userspace and currently vzctl provides empty list. This is an additional check - just in case someone get secondary ploop image with 'errors=panic' mount option saved in the image and mounts it from inside a CT. Signed-off-by: Andrey Ryabinin --- fs/ext4/super.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 60c9fb110be3..f6feb495e8b0 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1845,6 +1845,7 @@ static int clear_qf_name(struct super_block *sb, int qtype) #define MOPT_NO_EXT3 0x0200 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3) #define MOPT_STRING0x0400 +#define MOPT_WANT_SYS_ADMIN0x0800 static const struct mount_opts { int token; @@ -1877,7 +1878,7 @@ static const struct mount_opts { EXT4_MOUNT_JOURNAL_CHECKSUM), MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT}, {Opt_noload, EXT4_MOUNT_NOLOAD, MOPT_NO_EXT2 | MOPT_SET}, - {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR}, + {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR|MOPT_WANT_SYS_ADMIN}, {Opt_err_ro, EXT4_MOUNT_ERRORS_RO, MOPT_SET | MOPT_CLEAR_ERR}, {Opt_err_cont, EXT4_MOUNT_ERRORS_CONT, MOPT_SET | MOPT_CLEAR_ERR}, {Opt_data_err_abort, EXT4_MOUNT_DATA_ERR_ABORT, @@ -2019,6 +2020,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, } if (m->flags & MOPT_CLEAR_ERR) clear_opt(sb, ERRORS_MASK); + if (m->flags & MOPT_WANT_SYS_ADMIN && !capable(CAP_SYS_ADMIN)) + return 1; + if (token == Opt_noquota && sb_any_quota_loaded(sb)) { ext4_msg(sb, KERN_ERR, "Cannot change quota " "options when quota turned on"); @@ -3892,8 +3896,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK) set_opt(sb, WRITEBACK_DATA); - if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC) - set_opt(sb, ERRORS_PANIC); + if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC) { + if (capable(CAP_SYS_ADMIN)) + set_opt(sb, ERRORS_PANIC); + else + set_opt(sb, ERRORS_RO); + } else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_CONTINUE) set_opt(sb, ERRORS_CONT); else -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 v2] fs/ovelayfs: Fix crash on overlayfs mount
Kdump kernel fails to load because of crash on mount of overlayfs: BUG: unable to handle kernel NULL pointer dereference at 0060 Call Trace: seq_path+0x64/0xb0 print_paths_option+0x79/0xa0 ovl_show_options+0x3a/0x320 show_mountinfo+0x1ee/0x290 seq_read+0x2f8/0x400 vfs_read+0x9d/0x150 ksys_read+0x4f/0xb0 do_syscall_64+0x5b/0x1a0 This is cause by OOB access of ofs->lowerpaths. We transfer to print_paths_option() ofs->numlayer as size of ->lowerpaths array, but it's not. The correct number of lowerpaths elements is ->numlower in 'struct ovl_entry'. So move lowerpaths there and use oe->numlower as array size. Fixes: 17fc61697f73 ("overlayfs: add dynamic path resolving in mount options") Fixes: 2191d729083d ("overlayfs: add mnt_id paths options") https://jira.sw.ru/browse/PSBM-123508 Signed-off-by: Andrey Ryabinin --- fs/overlayfs/ovl_entry.h | 2 +- fs/overlayfs/super.c | 37 - 2 files changed, 21 insertions(+), 18 deletions(-) diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h index ea1906448ec5..2315089a0211 100644 --- a/fs/overlayfs/ovl_entry.h +++ b/fs/overlayfs/ovl_entry.h @@ -54,7 +54,6 @@ struct ovl_fs { unsigned int numlayer; /* Number of unique fs among layers including upper fs */ unsigned int numfs; - struct path *lowerpaths; const struct ovl_layer *layers; struct ovl_sb *fs; /* workbasepath is the path at workdir= mount option */ @@ -98,6 +97,7 @@ struct ovl_entry { struct rcu_head rcu; }; unsigned numlower; + struct path *lowerpaths; struct ovl_path lowerstack[]; }; diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 3755f280036f..fb419617564c 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -70,8 +70,12 @@ static void ovl_entry_stack_free(struct ovl_entry *oe) { unsigned int i; - for (i = 0; i < oe->numlower; i++) + for (i = 0; i < oe->numlower; i++) { dput(oe->lowerstack[i].dentry); + if (oe->lowerpaths) + path_put(>lowerpaths[i]); + } + kfree(oe->lowerpaths); } static bool ovl_metacopy_def = IS_ENABLED(CONFIG_OVERLAY_FS_METACOPY); @@ -241,11 +245,6 @@ static void ovl_free_fs(struct ovl_fs *ofs) ovl_inuse_unlock(ofs->upper_mnt->mnt_root); mntput(ofs->upper_mnt); path_put(>upperpath); - if (ofs->lowerpaths) { - for (i = 0; i < ofs->numlayer; i++) - path_put(>lowerpaths[i]); - kfree(ofs->lowerpaths); - } for (i = 1; i < ofs->numlayer; i++) { iput(ofs->layers[i].trap); mntput(ofs->layers[i].mnt); @@ -359,9 +358,10 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry) { struct super_block *sb = dentry->d_sb; struct ovl_fs *ofs = sb->s_fs_info; + struct ovl_entry *oe = OVL_E(dentry); if (ovl_dyn_path_opts) { - print_paths_option(m, "lowerdir", ofs->lowerpaths, ofs->numlayer); + print_paths_option(m, "lowerdir", oe->lowerpaths, oe->numlower); if (ofs->config.upperdir) { print_path_option(m, "upperdir", >upperpath); print_path_option(m, "workdir", >workbasepath); @@ -375,7 +375,7 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry) } if (ovl_mnt_id_path_opts) { - print_mnt_ids_option(m, "lowerdir_mnt_id", ofs->lowerpaths, ofs->numlayer); + print_mnt_ids_option(m, "lowerdir_mnt_id", oe->lowerpaths, oe->numlower); /* * We don't need to show mnt_id for workdir because it * on the same mount as upperdir. @@ -1626,6 +1626,7 @@ static struct ovl_entry *ovl_get_lowerstack(struct super_block *sb, int err; char *lowertmp, *lower; unsigned int stacklen, numlower = 0, i; + struct path *stack = NULL; struct ovl_entry *oe; err = -ENOMEM; @@ -1649,14 +1650,14 @@ static struct ovl_entry *ovl_get_lowerstack(struct super_block *sb, } err = -ENOMEM; - ofs->lowerpaths = kcalloc(stacklen, sizeof(struct path), GFP_KERNEL); - if (!ofs->lowerpaths) + stack = kcalloc(stacklen, sizeof(struct path), GFP_KERNEL); + if (!stack) goto out_err; err = -EINVAL; lower = lowertmp; for (numlower = 0; numlower < stacklen; numlower++) { - err = ovl_lower_dir(lower, >lowerpaths[numlower], ofs, + err = ovl_lower_dir(lower, [numlower], ofs,
[Devel] [PATCH vz8] fs/ovelayfs: Fix crash on overlayfs mount
Kdump kernel fails to load because of crash on mount of overlayfs: BUG: unable to handle kernel NULL pointer dereference at 0060 Call Trace: seq_path+0x64/0xb0 print_paths_option+0x79/0xa0 ovl_show_options+0x3a/0x320 show_mountinfo+0x1ee/0x290 seq_read+0x2f8/0x400 vfs_read+0x9d/0x150 ksys_read+0x4f/0xb0 do_syscall_64+0x5b/0x1a0 This is cause by OOB access of ofs->lowerpaths. We transfer to print_paths_option() ofs->numlayer as size of ->lowerpaths array, but it's not. We could probably pass 'ofs->numlayer - 1' as number of lower layers/path, but it's better to remove lowerpaths completely. All necessary information already contained in 'struct ovl_entry'. Use it to print paths instead. Fixes: 17fc61697f73 ("overlayfs: add dynamic path resolving in mount options") Fixes: 2191d729083d ("overlayfs: add mnt_id paths options") https://jira.sw.ru/browse/PSBM-123508 Signed-off-by: Andrey Ryabinin --- fs/overlayfs/overlayfs.h | 4 ++-- fs/overlayfs/ovl_entry.h | 1 - fs/overlayfs/super.c | 30 ++ fs/overlayfs/util.c | 13 + 4 files changed, 25 insertions(+), 23 deletions(-) diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h index 7e103d002819..a708ebbd2e21 100644 --- a/fs/overlayfs/overlayfs.h +++ b/fs/overlayfs/overlayfs.h @@ -313,10 +313,10 @@ ssize_t ovl_getxattr(struct dentry *dentry, char *name, char **value, void print_path_option(struct seq_file *m, const char *name, struct path *path); void print_paths_option(struct seq_file *m, const char *name, - struct path *paths, unsigned int num); + struct ovl_path *paths, unsigned int num); void print_mnt_id_option(struct seq_file *m, const char *name, struct path *path); void print_mnt_ids_option(struct seq_file *m, const char *name, - struct path *paths, unsigned int num); + struct ovl_path *paths, unsigned int num); static inline bool ovl_is_impuredir(struct dentry *dentry) { diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h index ea1906448ec5..4e7272c7e4dd 100644 --- a/fs/overlayfs/ovl_entry.h +++ b/fs/overlayfs/ovl_entry.h @@ -54,7 +54,6 @@ struct ovl_fs { unsigned int numlayer; /* Number of unique fs among layers including upper fs */ unsigned int numfs; - struct path *lowerpaths; const struct ovl_layer *layers; struct ovl_sb *fs; /* workbasepath is the path at workdir= mount option */ diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 3755f280036f..069d365a609d 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -241,11 +241,6 @@ static void ovl_free_fs(struct ovl_fs *ofs) ovl_inuse_unlock(ofs->upper_mnt->mnt_root); mntput(ofs->upper_mnt); path_put(>upperpath); - if (ofs->lowerpaths) { - for (i = 0; i < ofs->numlayer; i++) - path_put(>lowerpaths[i]); - kfree(ofs->lowerpaths); - } for (i = 1; i < ofs->numlayer; i++) { iput(ofs->layers[i].trap); mntput(ofs->layers[i].mnt); @@ -359,9 +354,10 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry) { struct super_block *sb = dentry->d_sb; struct ovl_fs *ofs = sb->s_fs_info; + struct ovl_entry *oe = OVL_E(dentry); if (ovl_dyn_path_opts) { - print_paths_option(m, "lowerdir", ofs->lowerpaths, ofs->numlayer); + print_paths_option(m, "lowerdir", oe->lowerstack, oe->numlower); if (ofs->config.upperdir) { print_path_option(m, "upperdir", >upperpath); print_path_option(m, "workdir", >workbasepath); @@ -375,7 +371,8 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry) } if (ovl_mnt_id_path_opts) { - print_mnt_ids_option(m, "lowerdir_mnt_id", ofs->lowerpaths, ofs->numlayer); + print_mnt_ids_option(m, "lowerdir_mnt_id", oe->lowerstack, oe->numlower); + /* * We don't need to show mnt_id for workdir because it * on the same mount as upperdir. @@ -1625,6 +1622,7 @@ static struct ovl_entry *ovl_get_lowerstack(struct super_block *sb, { int err; char *lowertmp, *lower; + struct path *stack = NULL; unsigned int stacklen, numlower = 0, i; struct ovl_entry *oe; @@ -1649,14 +1647,14 @@ static struct ovl_entry *ovl_get_lowerstack(struct super_block *sb, } err = -ENOMEM; - ofs->lowerpaths = kcalloc(stacklen, sizeof(struct path), GFP_KERNEL); - if (!ofs->lowerpaths) + stack = kcalloc(stacklen, sizeo
[Devel] [PATCH vz8] netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports
From: Jozsef Kadlecsik In the case of huge hash:* types of sets, due to the single spinlock of a set the processing of the whole set under spinlock protection could take too long. There were four places where the whole hash table of the set was processed from bucket to bucket under holding the spinlock: - During resizing a set, the original set was locked to exclude kernel side add/del element operations (userspace add/del is excluded by the nfnetlink mutex). The original set is actually just read during the resize, so the spinlocking is replaced with rcu locking of regions. However, thus there can be parallel kernel side add/del of entries. In order not to loose those operations a backlog is added and replayed after the successful resize. - Garbage collection of timed out entries was also protected by the spinlock. In order not to lock too long, region locking is introduced and a single region is processed in one gc go. Also, the simple timer based gc running is replaced with a workqueue based solution. The internal book-keeping (number of elements, size of extensions) is moved to region level due to the region locking. - Adding elements: when the max number of the elements is reached, the gc was called to evict the timed out entries. The new approach is that the gc is called just for the matching region, assuming that if the region (proportionally) seems to be full, then the whole set does. We could scan the other regions to check every entry under rcu locking, but for huge sets it'd mean a slowdown at adding elements. - Listing the set header data: when the set was defined with timeout support, the garbage collector was called to clean up timed out entries to get the correct element numbers and set size values. Now the set is scanned to check non-timed out entries, without actually calling the gc for the whole set. Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe -> SOFTIRQ-unsafe lock order issues during working on the patch. Reported-by: syzbot+4b0e9d4ff3cf11783...@syzkaller.appspotmail.com Reported-by: syzbot+c27b8d5010f45c666...@syzkaller.appspotmail.com Reported-by: syzbot+68a806795ac89df3a...@syzkaller.appspotmail.com Fixes: 23c42a403a9c ("netfilter: ipset: Introduction of new commands and protocol version 7") Signed-off-by: Jozsef Kadlecsik https://jira.sw.ru/browse/PSBM-123524 (cherry picked from commit f66ee0410b1c3481ee75e5db9b34547b4d582465) Signed-off-by: Andrey Ryabinin --- include/linux/netfilter/ipset/ip_set.h | 11 +- net/netfilter/ipset/ip_set_core.c | 34 +- net/netfilter/ipset/ip_set_hash_gen.h | 633 + 3 files changed, 472 insertions(+), 206 deletions(-) diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h index e499d170f12d..3c49b540c701 100644 --- a/include/linux/netfilter/ipset/ip_set.h +++ b/include/linux/netfilter/ipset/ip_set.h @@ -124,6 +124,7 @@ struct ip_set_ext { u32 timeout; u8 packets_op; u8 bytes_op; + bool target; }; struct ip_set; @@ -190,6 +191,14 @@ struct ip_set_type_variant { /* Return true if "b" set is the same as "a" * according to the create set parameters */ bool (*same_set)(const struct ip_set *a, const struct ip_set *b); + /* Region-locking is used */ + bool region_lock; +}; + +struct ip_set_region { + spinlock_t lock;/* Region lock */ + size_t ext_size;/* Size of the dynamic extensions */ + u32 elements; /* Number of elements vs timeout */ }; /* The core set type structure */ @@ -461,7 +470,7 @@ bitmap_bytes(u32 a, u32 b) #include #define IP_SET_INIT_KEXT(skb, opt, set)\ - { .bytes = (skb)->len, .packets = 1,\ + { .bytes = (skb)->len, .packets = 1, .target = true,\ .timeout = ip_set_adt_opt_timeout(opt, set) } #define IP_SET_INIT_UEXT(set) \ diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c index 56b59904feca..615b5791edf2 100644 --- a/net/netfilter/ipset/ip_set_core.c +++ b/net/netfilter/ipset/ip_set_core.c @@ -558,6 +558,20 @@ ip_set_rcu_get(struct net *net, ip_set_id_t index) return set; } +static inline void +ip_set_lock(struct ip_set *set) +{ + if (!set->variant->region_lock) + spin_lock_bh(>lock); +} + +static inline void +ip_set_unlock(struct ip_set *set) +{ + if (!set->variant->region_lock) + spin_unlock_bh(>lock); +} + int ip_set_test(ip_set_id_t index, const struct sk_buff *skb, const struct xt_action_param *par, struct ip_set_adt_opt *opt) @@ -579,9 +593,9 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb, if (ret == -EAGAIN) { /* Type requests element to be completed */
[Devel] [PATCH vz8] ptrace: fix task_join_group_stop() for the case when current is traced
From: Oleg Nesterov This testcase #include #include #include #include #include #include #include void *tf(void *arg) { return NULL; } int main(void) { int pid = fork(); if (!pid) { kill(getpid(), SIGSTOP); pthread_t th; pthread_create(, NULL, tf, NULL); return 0; } waitpid(pid, NULL, WSTOPPED); ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE); waitpid(pid, NULL, 0); ptrace(PTRACE_CONT, pid, 0,0); waitpid(pid, NULL, 0); int status; int thread = waitpid(-1, , 0); assert(thread > 0 && thread != pid); assert(status == 0x80137f); return 0; } fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap(). This is because task_join_group_stop() has 2 problems when current is traced: 1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee can be woken up by debugger and it can clone another thread which should join the group-stop. We need to check group_stop_count || SIGNAL_STOP_STOPPED. 2. If SIGNAL_STOP_STOPPED is already set, we should not increment sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread should stop without another do_notify_parent_cldstop() report. To clarify, the problem is very old and we should blame ptrace_init_task(). But now that we have task_join_group_stop() it makes more sense to fix this helper to avoid the code duplication. Reported-by: syzbot+3485e3773f7da290e...@syzkaller.appspotmail.com Signed-off-by: Oleg Nesterov Signed-off-by: Andrew Morton Cc: Jens Axboe Cc: Christian Brauner Cc: "Eric W . Biederman" Cc: Zhiqiang Liu Cc: Tejun Heo Cc: Link: https://lkml.kernel.org/r/20201019134237.ga18...@redhat.com Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-123525 (cherry picked from commit 7b3c36fc4c231ca532120bbc0df67a12f09c1d96) Signed-off-by: Andrey Ryabinin --- kernel/signal.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 177cd7f04acb..171f7496f811 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -388,16 +388,17 @@ static bool task_participate_group_stop(struct task_struct *task) void task_join_group_stop(struct task_struct *task) { + unsigned long mask = current->jobctl & JOBCTL_STOP_SIGMASK; + struct signal_struct *sig = current->signal; + + if (sig->group_stop_count) { + sig->group_stop_count++; + mask |= JOBCTL_STOP_CONSUME; + } else if (!(sig->flags & SIGNAL_STOP_STOPPED)) + return; + /* Have the new thread join an on-going signal group stop */ - unsigned long jobctl = current->jobctl; - if (jobctl & JOBCTL_STOP_PENDING) { - struct signal_struct *sig = current->signal; - unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK; - unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME; - if (task_set_jobctl_pending(task, signr | gstop)) { - sig->group_stop_count++; - } - } + task_set_jobctl_pending(task, mask | JOBCTL_STOP_PENDING); } /* -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] vdso: fix VM_BUG_ON_PAGE(PageSlab(page)) on unmap
vdso_data is mapped to userspace which means that we can't use kmalloc() to allocate it. Kmalloc() doesn't even guarantee that we will get page aligned memory. kernel BUG at include/linux/mm.h:693! RIP: 0010:unmap_page_range+0x15f2/0x2630 Call Trace: unmap_vmas+0x11e/0x1d0 exit_mmap+0x215/0x420 mmput+0x10a/0x400 do_exit+0x98f/0x2d00 do_group_exit+0xec/0x2b0 __x64_sys_exit_group+0x3a/0x50 do_syscall_64+0xa5/0x4d0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf Use alloc_pages_exact() to allocate it. We can't use alloc_pages(), or __get_free_pages() here since vdso_fault() need to perform get_page() on individual sub-pages and alloc_pages() doesn't initalize sub-pages. https://jira.sw.ru/browse/PSBM-123551 Signed-off-by: Andrey Ryabinin --- kernel/ve/ve.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index b114e2918bb7..0c6630c6616a 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -568,7 +568,7 @@ static int copy_vdso(struct vdso_image **vdso_dst, const struct vdso_image *vdso if (!vdso) return -ENOMEM; - vdso_data = kmalloc(vdso_src->size, GFP_KERNEL); + vdso_data = alloc_pages_exact(vdso_src->size, GFP_KERNEL); if (!vdso_data) { kfree(vdso); return -ENOMEM; @@ -585,11 +585,11 @@ static int copy_vdso(struct vdso_image **vdso_dst, const struct vdso_image *vdso static void ve_free_vdso(struct ve_struct *ve) { if (ve->vdso_64 && ve->vdso_64 != _image_64) { - kfree(ve->vdso_64->data); + free_pages_exact(ve->vdso_64->data, ve->vdso_64->size); kfree(ve->vdso_64); } if (ve->vdso_32 && ve->vdso_32 != _image_32) { - kfree(ve->vdso_32->data); + free_pages_exact(ve->vdso_32->data, ve->vdso_32->size); kfree(ve->vdso_32); } } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] mm, memcg: Fix "add oom counter to memory.stat memcgroup file"
Fix rebase of commit 3f10e0c1a0df12a2a503d0d9a3ec7b4f3ac3a467 Author: Andrey Ryabinin Date: Mon Oct 5 13:18:40 2020 +0300 mm, memcg: add oom counter to memory.stat memcgroup file https://jira.sw.ru/browse/PSBM-123537 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 38 +- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 01c012b11243..c0f825a4c43e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4141,12 +4141,28 @@ static const unsigned int memcg1_events[] = { PSWPOUT, }; +static void accumulate_ooms(struct mem_cgroup *memcg, unsigned long *total_oom, + unsigned long *total_oom_kill) +{ + struct mem_cgroup *mi; + + total_oom_kill = total_oom = 0; + + for_each_mem_cgroup_tree(mi, memcg) { + total_oom += atomic_long_read(>memory_events[MEMCG_OOM]); + total_oom_kill += atomic_long_read(>memory_events[MEMCG_OOM_KILL]); + + cond_resched(); + } +} + static int memcg_stat_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); unsigned long memory, memsw; struct mem_cgroup *mi; unsigned int i; + unsigned long total_oom = 0, total_oom_kill = 0; BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); @@ -4162,21 +4178,19 @@ static int memcg_stat_show(struct seq_file *m, void *v) seq_printf(m, "%s %lu\n", vm_event_name(memcg1_events[i]), memcg_events_local(memcg, memcg1_events[i])); + + accumulate_ooms(memcg, _oom, _oom_kill); + /* * For root_mem_cgroup we want to account global ooms as well. * The diff between all MEMCG_OOM_KILL and MEMCG_OOM events * should give us the glogbal ooms count. */ - if (memcg == root_mem_cgroup) { - unsigned long glob_ooms; - - glob_ooms = memcg_events(memcg, memcg1_events[MEMCG_OOM_KILL]) - - memcg_events(memcg, memcg1_events[MEMCG_OOM]); - seq_printf(m, "oom %lu\n", glob_ooms + - memcg_events_local(memcg, memcg1_events[MEMCG_OOM])); - } else + if (memcg == root_mem_cgroup) + seq_printf(m, "oom %lu\n", total_oom_kill - total_oom); + else seq_printf(m, "oom %lu\n", - memcg_events_local(memcg, memcg1_events[MEMCG_OOM])); + atomic_long_read(>memory_events[MEMCG_OOM])); for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %lu\n", lru_list_name(i), @@ -4209,11 +4223,9 @@ static int memcg_stat_show(struct seq_file *m, void *v) (u64)memcg_events(memcg, memcg1_events[i])); if (memcg == root_mem_cgroup) - seq_printf(m, "total_oom %lu\n", - memcg_events(memcg, memcg1_events[MEMCG_OOM_KILL])); + seq_printf(m, "total_oom %lu\n", total_oom_kill); else - seq_printf(m, "total_oom %lu\n", - memcg_events(memcg, memcg1_events[MEMCG_OOM])); + seq_printf(m, "total_oom %lu\n", total_oom); for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "total_%s %llu\n", lru_list_name(i), -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 4.1/8] ms/mm/rmap: share the i_mmap_rwsem fix
Use down_read_nested to avoid lockdep complain. Signed-off-by: Andrey Ryabinin --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 523957450d20..90cf61e209ac 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1724,7 +1724,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) return ret; pgoff = page_to_pgoff(page); - i_mmap_lock_read(mapping); + down_read_nested(>i_mmap_rwsem, SINGLE_DEPTH_NESTING); vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 4.1/8] ms/mm/rmap: share the i_mmap_rwsem fix
Use down_read_nested to avoid lockdep complain. Signed-off-by: Andrey Ryabinin --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index 523957450d20..90cf61e209ac 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1724,7 +1724,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) return ret; pgoff = page_to_pgoff(page); - i_mmap_lock_read(mapping); + down_read_nested(>i_mmap_rwsem, SINGLE_DEPTH_NESTING); vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 8/8] ms/mm/memory.c: share the i_mmap_rwsem
From: Davidlohr Bueso The unmap_mapping_range family of functions do the unmapping of user pages (ultimately via zap_page_range_single) without touching the actual interval tree, thus share the lock. Signed-off-by: Davidlohr Bueso Cc: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Cc: Peter Zijlstra (Intel) Cc: Rik van Riel Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit c8475d144abb1e62958cc5ec281d2a9e161c1946) Signed-off-by: Andrey Ryabinin --- mm/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 7e66dea08f3f..3e5124d14996 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2712,10 +2712,10 @@ void unmap_mapping_range(struct address_space *mapping, if (details.last_index < details.first_index) details.last_index = ULONG_MAX; - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); if (unlikely(!RB_EMPTY_ROOT(>i_mmap))) unmap_mapping_range_tree(>i_mmap, ); - i_mmap_unlock_write(mapping); + i_mmap_unlock_read(mapping); } EXPORT_SYMBOL(unmap_mapping_range); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 7/8] ms/mm/nommu: share the i_mmap_rwsem
From: Davidlohr Bueso Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify that any shared mappings of the inode in question aren't broken (dead zone). afaict the only user being ramfs to handle the size change attribute. Pretty much a no-brainer to share the lock. Signed-off-by: Davidlohr Bueso Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Rik van Riel Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit 1acf2e040721564d579297646862b8ea3dd4511b) Signed-off-by: Andrey Ryabinin --- mm/nommu.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/mm/nommu.c b/mm/nommu.c index f994621e52f0..290fe3031147 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -2134,14 +2134,14 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size, high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; down_write(_region_sem); - i_mmap_lock_write(inode->i_mapping); + i_mmap_lock_read(inode->i_mapping); /* search for VMAs that fall within the dead zone */ vma_interval_tree_foreach(vma, >i_mapping->i_mmap, low, high) { /* found one - only interested if it's shared out of the page * cache */ if (vma->vm_flags & VM_SHARED) { - i_mmap_unlock_write(inode->i_mapping); + i_mmap_unlock_read(inode->i_mapping); up_write(_region_sem); return -ETXTBSY; /* not quite true, but near enough */ } @@ -2153,8 +2153,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size, * we don't check for any regions that start beyond the EOF as there * shouldn't be any */ - vma_interval_tree_foreach(vma, >i_mapping->i_mmap, - 0, ULONG_MAX) { + vma_interval_tree_foreach(vma, >i_mapping->i_mmap, 0, ULONG_MAX) { if (!(vma->vm_flags & VM_SHARED)) continue; @@ -2169,7 +2168,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size, } } - i_mmap_unlock_write(inode->i_mapping); + i_mmap_unlock_read(inode->i_mapping); up_write(_region_sem); return 0; } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 5/8] ms/uprobes: share the i_mmap_rwsem
From: Davidlohr Bueso Both register and unregister call build_map_info() in order to create the list of mappings before installing or removing breakpoints for every mm which maps file backed memory. As such, there is no reason to hold the i_mmap_rwsem exclusively, so share it and allow concurrent readers to build the mapping data. Signed-off-by: Davidlohr Bueso Acked-by: Srikar Dronamraju Acked-by: "Kirill A. Shutemov" Cc: Oleg Nesterov Acked-by: Hugh Dickins Acked-by: Peter Zijlstra (Intel) Cc: Rik van Riel Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit 4a23717a236b2ab31efb1651f586126789fc997f) Signed-off-by: Andrey Ryabinin --- kernel/events/uprobes.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 9f312227a769..be501d8d9704 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -690,7 +690,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register) int more = 0; again: - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { if (!valid_vma(vma, is_register)) continue; @@ -721,7 +721,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register) info->mm = vma->vm_mm; info->vaddr = offset_to_vaddr(vma, offset); } - i_mmap_unlock_write(mapping); + i_mmap_unlock_read(mapping); if (!more) goto out; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 6/8] ms/mm/memory-failure: share the i_mmap_rwsem
From: Davidlohr Bueso No brainer conversion: collect_procs_file() only schedules a process for later kill, share the lock, similarly to the anon vma variant. Signed-off-by: Davidlohr Bueso Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Rik van Riel Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit d28eb9c861f41aa2af4cfcc5eeeddff42b13d31e) Signed-off-by: Andrey Ryabinin --- mm/memory-failure.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index da1ef2edd5dd..a5f5e604c0b8 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -497,7 +497,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, struct task_struct *tsk; struct address_space *mapping = page->mapping; - i_mmap_lock_write(mapping); + i_mmap_lock_read(mapping); qread_lock(_lock); for_each_process(tsk) { pgoff_t pgoff = page_to_pgoff(page); @@ -519,7 +519,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, } } qread_unlock(_lock); - i_mmap_unlock_write(mapping); + i_mmap_unlock_read(mapping); } /* -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/8] ms/mm, fs: introduce helpers around the i_mmap_mutex
afe. This patchset has been tested with: postgres 9.4 (with brand new hugetlb support), hugetlbfs test suite (all tests pass, in fact more tests pass with these changes than with an upstream kernel), ltp, aim7 benchmarks, memcached and iozone with the -B option for mmap'ing. *Untested* paths are nommu, memory-failure, uprobes and xip. This patch (of 8): Various parts of the kernel acquire and release this mutex, so add i_mmap_lock_write() and immap_unlock_write() helper functions that will encapsulate this logic. The next patch will make use of these. Signed-off-by: Davidlohr Bueso Reviewed-by: Rik van Riel Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit 8b28f621bea6f84d44adf7e804b73aff1e09105b) Signed-off-by: Andrey Ryabinin --- include/linux/fs.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index 55a92ce36e94..e32cb9b71042 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -698,6 +698,16 @@ struct block_device { int mapping_tagged(struct address_space *mapping, int tag); +static inline void i_mmap_lock_write(struct address_space *mapping) +{ + mutex_lock(>i_mmap_mutex); +} + +static inline void i_mmap_unlock_write(struct address_space *mapping) +{ + mutex_unlock(>i_mmap_mutex); +} + /* * Might pages of this file be mapped into userspace? */ -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 4/8] ms/mm/rmap: share the i_mmap_rwsem
From: Davidlohr Bueso Similarly to the anon memory counterpart, we can share the mapping's lock ownership as the interval tree is not modified when doing doing the walk, only the file page. Signed-off-by: Davidlohr Bueso Acked-by: Rik van Riel Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit 3dec0ba0be6a532cac949e02b853021bf6d57dad) Signed-off-by: Andrey Ryabinin --- include/linux/fs.h | 10 ++ mm/rmap.c | 9 + 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index f422b0f7b02a..acedffc46fe4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -709,6 +709,16 @@ static inline void i_mmap_unlock_write(struct address_space *mapping) up_write(>i_mmap_rwsem); } +static inline void i_mmap_lock_read(struct address_space *mapping) +{ + down_read(>i_mmap_rwsem); +} + +static inline void i_mmap_unlock_read(struct address_space *mapping) +{ + up_read(>i_mmap_rwsem); +} + /* * Might pages of this file be mapped into userspace? */ diff --git a/mm/rmap.c b/mm/rmap.c index e72be32c3dae..523957450d20 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1723,7 +1723,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) if (!mapping) return ret; pgoff = page_to_pgoff(page); - down_write_nested(>i_mmap_rwsem, SINGLE_DEPTH_NESTING); + + i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); @@ -1748,7 +1749,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) if (!mapping_mapped(peer)) continue; - i_mmap_lock_write(peer); + i_mmap_lock_read(peer); vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { unsigned long address = vma_address(page, vma); @@ -1764,7 +1765,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) cond_resched(); } - i_mmap_unlock_write(peer); + i_mmap_unlock_read(peer); if (ret != SWAP_AGAIN) goto done; @@ -1772,7 +1773,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc) goto done; } done: - i_mmap_unlock_write(mapping); + i_mmap_unlock_read(mapping); return ret; } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 3/8] ms/mm: convert i_mmap_mutex to rwsem
From: Davidlohr Bueso The i_mmap_mutex is a close cousin of the anon vma lock, both protecting similar data, one for file backed pages and the other for anon memory. To this end, this lock can also be a rwsem. In addition, there are some important opportunities to share the lock when there are no tree modifications. This conversion is straightforward. For now, all users take the write lock. [s...@canb.auug.org.au: update fremap.c] Signed-off-by: Davidlohr Bueso Reviewed-by: Rik van Riel Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit c8c06efa8b552608493b7066c234cfa82c47fcea) Signed-off-by: Andrey Ryabinin --- Documentation/vm/locking | 2 +- fs/hugetlbfs/inode.c | 10 +- fs/inode.c | 2 +- include/linux/fs.h | 7 --- include/linux/mmu_notifier.h | 2 +- kernel/events/uprobes.c | 2 +- mm/filemap.c | 10 +- mm/hugetlb.c | 10 +- mm/memory.c | 2 +- mm/mmap.c| 6 +++--- mm/mremap.c | 2 +- mm/rmap.c| 4 ++-- 12 files changed, 30 insertions(+), 29 deletions(-) diff --git a/Documentation/vm/locking b/Documentation/vm/locking index f61228bd6395..fb6402884062 100644 --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is modified by expand_stack(), it is hard to come up with a destructive scenario without having the vmlist protection in this case. -The page_table_lock nests with the inode i_mmap_mutex and the kmem cache +The page_table_lock nests with the inode i_mmap_rwsem and the kmem cache c_spinlock spinlocks. This is okay, since the kmem code asks for pages after dropping c_spinlock. The page_table_lock also nests with pagecache_lock and pagemap_lru_lock spinlocks, and no code asks for memory with these locks diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index fb40a55cc8f1..68f8f2f0eaf5 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -757,12 +757,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb, } /* - * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never + * Hugetlbfs is not reclaimable; therefore its i_mmap_rwsem will never * be taken from reclaim -- unlike regular filesystems. This needs an * annotation because huge_pmd_share() does an allocation under hugetlb's - * i_mmap_mutex. + * i_mmap_rwsem. */ -struct lock_class_key hugetlbfs_i_mmap_mutex_key; +static struct lock_class_key hugetlbfs_i_mmap_rwsem_key; static struct inode *hugetlbfs_get_inode(struct super_block *sb, struct inode *dir, @@ -779,8 +779,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, if (inode) { inode->i_ino = get_next_ino(); inode_init_owner(inode, dir, mode); - lockdep_set_class(>i_mapping->i_mmap_mutex, - _i_mmap_mutex_key); + lockdep_set_class(>i_mapping->i_mmap_rwsem, + _i_mmap_rwsem_key); inode->i_mapping->a_ops = _aops; inode->i_mapping->backing_dev_info =_backing_dev_info; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; diff --git a/fs/inode.c b/fs/inode.c index 5253272c3742..2423a30dda1b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -356,7 +356,7 @@ void address_space_init_once(struct address_space *mapping) memset(mapping, 0, sizeof(*mapping)); INIT_RADIX_TREE(>page_tree, GFP_ATOMIC); spin_lock_init(>tree_lock); - mutex_init(>i_mmap_mutex); + init_rwsem(>i_mmap_rwsem); INIT_LIST_HEAD(>private_list); spin_lock_init(>private_lock); mapping->i_mmap = RB_ROOT; diff --git a/include/linux/fs.h b/include/linux/fs.h index e32cb9b71042..f422b0f7b02a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -626,7 +627,7 @@ struct address_space { RH_KABI_REPLACE(unsigned int i_mmap_writable, atomic_t i_mmap_writable) /* count VM_SHARED mappings */ struct rb_root i_mmap; /* tree of private and shared mappings */ - struct mutexi_mmap_mutex; /* protect tree, count, list */ + struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */ /* Protected by tree_lock together with the radix tree */ unsigned long nrpages;
[Devel] [PATCH rh7 2/8] ms/mm: use new helper functions around the i_mmap_mutex
From: Davidlohr Bueso Convert all open coded mutex_lock/unlock calls to the i_mmap_[lock/unlock]_write() helpers. Signed-off-by: Davidlohr Bueso Acked-by: Rik van Riel Acked-by: "Kirill A. Shutemov" Acked-by: Hugh Dickins Cc: Oleg Nesterov Acked-by: Peter Zijlstra (Intel) Cc: Srikar Dronamraju Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-122663 (cherry picked from commit 83cde9e8ba95d180eaefefe834958fbf7008cf39) Signed-off-by: Andrey Ryabinin --- fs/dax.c| 4 ++-- fs/hugetlbfs/inode.c| 12 ++-- kernel/events/uprobes.c | 4 ++-- kernel/fork.c | 4 ++-- mm/hugetlb.c| 12 ++-- mm/memory-failure.c | 4 ++-- mm/memory.c | 28 ++-- mm/mmap.c | 14 +++--- mm/mremap.c | 4 ++-- mm/nommu.c | 14 +++--- mm/rmap.c | 6 +++--- 11 files changed, 53 insertions(+), 53 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index f22e3b32b6cc..7a18745acf01 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -909,7 +909,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping, spinlock_t *ptl; bool changed; - mutex_lock(>i_mmap_mutex); + i_mmap_lock_write(mapping); vma_interval_tree_foreach(vma, >i_mmap, index, index) { unsigned long address; @@ -960,7 +960,7 @@ unlock_pte: if (changed) mmu_notifier_invalidate_page(vma->vm_mm, address); } - mutex_unlock(>i_mmap_mutex); + i_mmap_unlock_write(mapping); } static int dax_writeback_one(struct dax_device *dax_dev, diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index bdd5c7827391..fb40a55cc8f1 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -493,11 +493,11 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, if (unlikely(page_mapped(page))) { BUG_ON(truncate_op); - mutex_lock(>i_mmap_mutex); + i_mmap_lock_write(mapping); hugetlb_vmdelete_list(>i_mmap, next * pages_per_huge_page(h), (next + 1) * pages_per_huge_page(h)); - mutex_unlock(>i_mmap_mutex); + i_mmap_unlock_write(mapping); } lock_page(page); @@ -553,10 +553,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset) pgoff = offset >> PAGE_SHIFT; i_size_write(inode, offset); - mutex_lock(>i_mmap_mutex); + i_mmap_lock_write(mapping); if (!RB_EMPTY_ROOT(>i_mmap)) hugetlb_vmdelete_list(>i_mmap, pgoff, 0); - mutex_unlock(>i_mmap_mutex); + i_mmap_unlock_write(mapping); remove_inode_hugepages(inode, offset, LLONG_MAX); return 0; } @@ -578,12 +578,12 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) struct address_space *mapping = inode->i_mapping; mutex_lock(>i_mutex); - mutex_lock(>i_mmap_mutex); + i_mmap_lock_write(mapping); if (!RB_EMPTY_ROOT(>i_mmap)) hugetlb_vmdelete_list(>i_mmap, hole_start >> PAGE_SHIFT, hole_end >> PAGE_SHIFT); - mutex_unlock(>i_mmap_mutex); + i_mmap_unlock_write(mapping); remove_inode_hugepages(inode, hole_start, hole_end); mutex_unlock(>i_mutex); } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index a5a59cc93fb6..816ad8e3d92f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -690,7 +690,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register) int more = 0; again: - mutex_lock(>i_mmap_mutex); + i_mmap_lock_write(mapping); vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) { if (!valid_vma(vma, is_register)) continue; @@ -721,7 +721,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register) info->mm = vma->vm_mm; info->vaddr = offset_to_vaddr(vma, offset); } - mutex_unlock(>i_mmap_mutex); + i_mmap_unlock_write(mapping); if (!more) goto out; diff --git a/kernel/fork.c b/kernel/fork.c index 9467e21a8fa4..b6a5279403be 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -504,7 +504,7 @@ static int dup_mmap(struct mm_struct *mm, struct
[Devel] [PATCH rh7] Revert "mm: Port diff-mm-vmscan-disable-fs-related-activity-for-direct-direct-reclaim"
This reverts commit 50fb388878b646872b78143de3c1bf3fa6f7f148. Sometimes we can see a lot of reclaimable dcache and no other reclaimbale memory. It looks like that kswapd can't keep up reclaiming dcache fast enough. Commit 50fb388878b6 forbids to reclaim dcache in direct reclaim to prevent potential deadlocks that might happen due to bugs in other subsystems. Revert it to allow more aggressive dcache reclaim. It's unlikely to cause any problems since we already directly reclaim dcache in memcg reclaim, so let's do the same for the global one. https://jira.sw.ru/browse/PSBM-122663 Signed-off-by: Andrey Ryabinin --- mm/vmscan.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 85622f235e78..240435eb6d84 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2653,15 +2653,9 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc, { struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; - gfp_t slab_gfp = sc->gfp_mask; bool slab_only = sc->slab_only; bool retry; - /* Disable fs-related IO for direct reclaim */ - if (!sc->target_mem_cgroup && - (current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC) - slab_gfp &= ~__GFP_FS; - do { struct mem_cgroup *root = sc->target_mem_cgroup; struct mem_cgroup_reclaim_cookie reclaim = { @@ -2695,7 +2689,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc, } if (is_classzone) { - shrink_slab(slab_gfp, zone_to_nid(zone), + shrink_slab(sc->gfp_mask, zone_to_nid(zone), memcg, sc->priority, false); if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_slab; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm, page_alloc: move_freepages should not examine struct page of reserved memory
From: David Rientjes After commit 907ec5fca3dc ("mm: zero remaining unavailable struct pages"), struct page of reserved memory is zeroed. This causes page->flags to be 0 and fixes issues related to reading /proc/kpageflags, for example, of reserved memory. The VM_BUG_ON() in move_freepages_block(), however, assumes that page_zone() is meaningful even for reserved memory. That assumption is no longer true after the aforementioned commit. There's no reason why move_freepages_block() should be testing the legitimacy of page_zone() for reserved memory; its scope is limited only to pages on the zone's freelist. Note that pfn_valid() can be true for reserved memory: there is a backing struct page. The check for page_to_nid(page) is also buggy but reserved memory normally only appears on node 0 so the zeroing doesn't affect this. Move the debug checks to after verifying PageBuddy is true. This isolates the scope of the checks to only be for buddy pages which are on the zone's freelist which move_freepages_block() is operating on. In this case, an incorrect node or zone is a bug worthy of being warned about (and the examination of struct page is acceptable bcause this memory is not reserved). Why does move_freepages_block() gets called on reserved memory? It's simply math after finding a valid free page from the per-zone free area to use as fallback. We find the beginning and end of the pageblock of the valid page and that can bring us into memory that was reserved per the e820. pfn_valid() is still true (it's backed by a struct page), but since it's zero'd we shouldn't make any inferences here about comparing its node or zone. The current node check just happens to succeed most of the time by luck because reserved memory typically appears on node 0. The fix here is to validate that we actually have buddy pages before testing if there's any type of zone or node strangeness going on. We noticed it almost immediately after bringing 907ec5fca3dc in on CONFIG_DEBUG_VM builds. It depends on finding specific free pages in the per-zone free area where the math in move_freepages() will bring the start or end pfn into reserved memory and wanting to claim that entire pageblock as a new migratetype. So the path will be rare, require CONFIG_DEBUG_VM, and require fallback to a different migratetype. Some struct pages were already zeroed from reserve pages before 907ec5fca3c so it theoretically could trigger before this commit. I think it's rare enough under a config option that most people don't run that others may not have noticed. I wouldn't argue against a stable tag and the backport should be easy enough, but probably wouldn't single out a commit that this is fixing. Mel said: : The overhead of the debugging check is higher with this patch although : it'll only affect debug builds and the path is not particularly hot. : If this was a concern, I think it would be reasonable to simply remove : the debugging check as the zone boundaries are checked in : move_freepages_block and we never expect a zone/node to be smaller than : a pageblock and stuck in the middle of another zone. Link: http://lkml.kernel.org/r/alpine.deb.2.21.1908122036560.10...@chino.kir.corp.google.com Signed-off-by: David Rientjes Acked-by: Mel Gorman Cc: Naoya Horiguchi Cc: Masayoshi Mizuma Cc: Oscar Salvador Cc: Pavel Tatashin Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-123085 (cherry-picked from commit cd961038381f392b364a7c4a040f4576ca415b1a) [Note: we don't have commit 907ec5fca3dc, but as changelog says this could trigger before it. And we have all other symptoms - reserved page from NUMA node 1 with zeroed struct page, so page_zone() gives us wrong zone, hence BUG_ON()]. Signed-off-by: Andrey Ryabinin --- mm/page_alloc.c | 19 +++ 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fe9b18fef7d..3a147749e528 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1668,23 +1668,7 @@ int move_freepages(struct zone *zone, unsigned long order; int pages_moved = 0; -#ifndef CONFIG_HOLES_IN_ZONE - /* -* page_zone is not safe to call in this context when -* CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant -* anyway as we check zone boundaries in move_freepages_block(). -* Remove at a later date when no bug reports exist related to -* grouping pages by mobility -*/ - BUG_ON(pfn_valid(page_to_pfn(start_page)) && - pfn_valid(page_to_pfn(end_page)) && - page_zone(start_page) != page_zone(end_page)); -#endif - for (page = start_page; page <= end_page;) { - /* Make sure we are not inadvertently changing nodes */ - VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page); - if (!p
[Devel] [PATCH rh7] mm/memcg: cleanup vmpressure from mem_cgroup_css_free()
Cleaning up vmpressure from mem_cgroup_css_offline() doesn't look safe. It looks like mem_cgroup_css_offline() might race with reclaim which will queue vmpressure work after the flush. Put vmpressure_cleanup() in mem_cgroup_css_free() where we have exclusive access to memcg. It was originally there, see https://jira.sw.ru/browse/PSBM-93884 but moved in a process of rebase. https://jira.sw.ru/browse/PSBM-122653 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e36ad592b3c7..803273a4d9cb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6822,8 +6822,6 @@ static void mem_cgroup_css_offline(struct cgroup *cont) mem_cgroup_free_all(memcg); mem_cgroup_reparent_charges(memcg); - vmpressure_cleanup(>vmpressure); - /* * A cgroup can be destroyed while somebody is waiting for its * oom context, in which case the context will never be unlocked @@ -6878,7 +6876,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) mem_cgroup_reparent_charges(memcg); cancel_work_sync(>high_work); - + vmpressure_cleanup(>vmpressure); memcg_destroy_kmem(memcg); memcg_free_shrinker_maps(memcg); __mem_cgroup_free(memcg); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm/memcg: cleanup vmpressure from mem_cgroup_css_free()
Cleaning up vmpressure from mem_cgroup_css_offline() doesn't look safe. It looks like mem_cgroup_css_offline() might race with reclaim which will queue vmpressure work after the flush. Put vmpressure_cleanup() in mem_cgroup_css_free() where we have exclusive access to memcg. It was originally there, see https://jira.sw.ru/browse/PSBM-93884 but moved in a process of rebase. https://jira.sw.ru/browse/PSBM-122655 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e36ad592b3c7..803273a4d9cb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6822,8 +6822,6 @@ static void mem_cgroup_css_offline(struct cgroup *cont) mem_cgroup_free_all(memcg); mem_cgroup_reparent_charges(memcg); - vmpressure_cleanup(>vmpressure); - /* * A cgroup can be destroyed while somebody is waiting for its * oom context, in which case the context will never be unlocked @@ -6878,7 +6876,7 @@ static void mem_cgroup_css_free(struct cgroup *cont) mem_cgroup_reparent_charges(memcg); cancel_work_sync(>high_work); - + vmpressure_cleanup(>vmpressure); memcg_destroy_kmem(memcg); memcg_free_shrinker_maps(memcg); __mem_cgroup_free(memcg); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 2/3] oom: resurrect berserker mode
From: Vladimir Davydov The logic behind the OOM berserker is the same as in PCS6: if processes are killed by oom killer too often (< sysctl vm.oom_relaxation, 1 sec by default), we increase "rage" (min -10, max 20) and kill 1 << "rage" youngest worst processes if "rage" >= 0. https://jira.sw.ru/browse/PSBM-17930 Signed-off-by: Vladimir Davydov [aryabinin: vz8 rebase] Signed-off-by: Andrey Ryabinin --- include/linux/memcontrol.h | 6 +++ include/linux/oom.h| 4 ++ mm/oom_kill.c | 97 ++ 3 files changed, 107 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c26041c681f2..0efabad868ce 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -260,6 +260,12 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; + int oom_rage; + spinlock_t oom_rage_lock; + unsigned long prev_oom_time; + unsigned long oom_time; + + /* memory.events */ struct cgroup_file events_file; diff --git a/include/linux/oom.h b/include/linux/oom.h index b0ee726c1672..9a6d16a1ace5 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -15,6 +15,9 @@ struct notifier_block; struct mem_cgroup; struct task_struct; +#define OOM_BASE_RAGE -10 +#define OOM_MAX_RAGE 20 + /* * Details of the page allocation that triggered the oom killer that are used to * determine what should be killed. @@ -44,6 +47,7 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + unsigned long overdraft; }; extern struct mutex oom_lock; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ab436d94ae5d..e746b41d558c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -53,6 +53,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks; +int sysctl_oom_relaxation = HZ; DEFINE_MUTEX(oom_lock); @@ -947,6 +948,101 @@ static int oom_kill_memcg_member(struct task_struct *task, void *message) return 0; } +/* + * Kill more processes if oom happens too often in this context. + */ +static void oom_berserker(struct oom_control *oc) +{ + static DEFINE_RATELIMIT_STATE(berserker_rs, + DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + struct task_struct *p; + struct mem_cgroup *memcg; + unsigned long now = jiffies; + int rage; + int killed = 0; + + memcg = oc->memcg ?: root_mem_cgroup; + + spin_lock(>oom_rage_lock); + memcg->prev_oom_time = memcg->oom_time; + memcg->oom_time = now; + /* +* Increase rage if oom happened recently in this context, reset +* rage otherwise. +* +* previous oomthis oom (unfinished) +* + +*^^ +* prev_oom_time <> oom_time +*/ + if (time_after(now, memcg->prev_oom_time + sysctl_oom_relaxation)) + memcg->oom_rage = OOM_BASE_RAGE; + else if (memcg->oom_rage < OOM_MAX_RAGE) + memcg->oom_rage++; + rage = memcg->oom_rage; + spin_unlock(>oom_rage_lock); + + if (rage < 0) + return; + + /* +* So, we are in rage. Kill (1 << rage) youngest tasks that are +* as bad as the victim. +*/ + read_lock(_lock); + list_for_each_entry_reverse(p, _task.tasks, tasks) { + unsigned long tsk_points; + unsigned long tsk_overdraft; + + if (!p->mm || test_tsk_thread_flag(p, TIF_MEMDIE) || + fatal_signal_pending(p) || p->flags & PF_EXITING || + oom_unkillable_task(p, oc->memcg, oc->nodemask)) + continue; + + tsk_points = oom_badness(p, oc->memcg, oc->nodemask, + oc->totalpages, _overdraft); + if (tsk_overdraft < oc->overdraft) + continue; + + /* +* oom_badness never returns a negative value, even if +* oom_score_adj would make badness so, instead it +* returns 1. So we do not kill task with badness 1 if +* the victim has badness > 1 so as not to risk killing +* protected tasks. +*/ + if (tsk_points <= 1 && oc->chosen_points > 1) + continue; + + /* +* Consider tasks as equally bad if they have equal +* normalized scores. +*/ +
[Devel] [PATCH vz8 3/3] oom: make berserker more aggressive
From: Vladimir Davydov In the berserker mode we kill a bunch of tasks that are as bad as the selected victim. We assume two tasks to be equally bad if they consume the same permille of memory. With such a strict check, it might turn out that oom berserker won't kill any tasks in case a fork bomb is running inside a container while the effect of killing a task eating <=1/1000th of memory won't be enough to cope with memory shortage. Let's loosen this check and use percentage instead of permille. In this case, it might still happen that berserker won't kill anyone, but in this case the regular oom should free at least 1/100th of memory, which should be enough even for small containers. Also, check berserker mode even if the victim has already exited by the time we are about to send SIGKILL to it. Rationale: when the berserker is in rage, it might kill hundreds of tasks so that the next oom kill is likely to select an exiting task. Not triggering berserker in this case will result in oom stalls. Signed-off-by: Vladimir Davydov [aryabinin: rh8 rebase] Signed-off-by: Andrey Ryabinin --- mm/oom_kill.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index e746b41d558c..1cf75939aba6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1016,11 +1016,11 @@ static void oom_berserker(struct oom_control *oc) continue; /* -* Consider tasks as equally bad if they have equal -* normalized scores. +* Consider tasks as equally bad if they occupy equal +* percentage of available memory. */ - if (tsk_points * 1000 / oc->totalpages < - oc->chosen_points * 1000 / oc->totalpages) + if (tsk_points * 100 / oc->totalpages < + oc->chosen_points * 100 / oc->totalpages) continue; if (__ratelimit(_rs)) { @@ -1061,6 +1061,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) wake_oom_reaper(victim); task_unlock(victim); put_task_struct(victim); + oom_berserker(oc); return; } task_unlock(victim); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 1/3] proc, memcg: use memcg limits for showing oom_score inside CT
Use memcg's limits of task to show /proc//oom_score. Note: in vz7 we had different behavior. It showed 'oom_score' based on 've->memcg' limits of process reading oom_score. Now we look at memcg of process and don't care about the current one. It seems more correct behaviour. Signed-off-by: Andrey Ryabinin --- fs/proc/base.c | 8 +++- include/linux/memcontrol.h | 11 +++ 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 85fee7396e90..cb417426dd92 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -525,8 +525,14 @@ static const struct file_operations proc_lstats_operations = { static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - unsigned long totalpages = totalram_pages + total_swap_pages; + unsigned long totalpages; unsigned long points = 0; + struct mem_cgroup *memcg; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(task); + totalpages = mem_cgroup_total_pages(memcg); + rcu_read_unlock(); points = oom_badness(task, NULL, NULL, totalpages, NULL) * 1000 / totalpages; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index eb8634128a81..c26041c681f2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -581,6 +581,17 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, return mz->lru_zone_size[zone_idx][lru]; } +static inline unsigned long mem_cgroup_total_pages(struct mem_cgroup *memcg) +{ + unsigned long ram, ram_swap; + extern long total_swap_pages; + + ram = min_t(unsigned long, totalram_pages, memcg->memory.max); + ram_swap = min_t(unsigned long, memcg->memsw.max, ram + total_swap_pages); + + return ram_swap; +} + void mem_cgroup_handle_over_high(void); unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh8 0/3] vecalls: Implement VZCTL_GET_CPU_STAT ioctl
On 11/10/20 12:44 PM, Konstantin Khorenko wrote: > Used by vzstat/dispatcher/libvirt. > Faster than parsing Container's cpu cgroup files. > > Konstantin Khorenko (3): > vecalls: Add cpu stat measurement units comments to header > ve/sched/loadavg: Provide task_group parameter to get_avenrun_ve() > vecalls: Introduce VZCTL_GET_CPU_STAT ioctl > > include/linux/sched/loadavg.h | 2 - > include/linux/ve.h | 2 + > include/uapi/linux/vzcalluser.h | 14 +++ > kernel/sched/loadavg.c | 12 +- > kernel/sys.c| 6 ++- > kernel/time/time.c | 1 + > kernel/ve/ve.c | 18 + > kernel/ve/vecalls.c | 66 + > 8 files changed, 109 insertions(+), 12 deletions(-) > Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH7] ploop: Fix crash in purge_lru_warn()
On 11/10/20 5:47 PM, Kirill Tkhai wrote: > do_div() works wrong in case of the second argument is long. > We don't need remainder, so we don't need do_div() at all. > > https://jira.sw.ru/browse/PSBM-122035 > > Reported-by: Evgenii Shatokhin > Signed-off-by: Kirill Tkhai Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] vdso, vclock_gettime: fix linking with old linkers
On some old linkers vdso fails to build because of dynamic reloction of 've_start_time' symbol: VDSO2C arch/x86/entry/vdso/vdso-image-64.c Error: vdso image contains dynamic relocations I was able to figure out why new linkers doesn't generate relocation while old ones does, but I did find out that visibility("hidden") attribute on 've_start_time' cures the problem. https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin --- arch/x86/entry/vdso/vclock_gettime.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 224dbe80da66..b2f1f19319d8 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -24,7 +24,7 @@ #define gtod ((vsyscall_gtod_data)) -u64 ve_start_time; +u64 ve_start_time __attribute__((visibility("hidden"))); extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts); extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 2/2] ext4: send abort uevent on ext4 journal abort
From: Dmitry Monakhov Currenlty error from device result in ext4_abort, but uevent not generated because ext4_abort() caller's context do not allow GFP_KERNEL memory allocation. Let's relax submission context requirement and deffer actual uevent submission to work_queue. It can be any workqueue I've pick rsv_conversion_wq because it is already exists. khorenko@: "system_wq" does not fit here because at the moment of work execution sb can be already destroyed. "EXT4_SB(sb)->rsv_conversion_wq" is flushed before sb is destroyed. Signed-off-by: Dmitry Monakhov [aryabinin rh8 rebase] Signed-off-by: Andrey Ryabinin --- fs/ext4/ext4.h | 2 ++ fs/ext4/super.c | 10 ++ 2 files changed, 12 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 228492c9518f..bbdd7efc8447 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1499,6 +1499,7 @@ struct ext4_sb_info { __u32 s_csum_seed; bool s_err_event_sent; + bool s_abrt_event_sent; /* Reclaim extents from extent status tree */ struct shrinker s_es_shrinker; @@ -3126,6 +3127,7 @@ enum ext4_event_type { EXT4_UA_UMOUNT, EXT4_UA_REMOUNT, EXT4_UA_ERROR, + EXT4_UA_ABORT, EXT4_UA_FREEZE, EXT4_UA_UNFREEZE, }; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3cc979825ec8..00619f45b1c3 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -420,6 +420,9 @@ static void ext4_send_uevent_work(struct work_struct *w) case EXT4_UA_ERROR: ret = add_uevent_var(env, "FS_ACTION=%s", "ERROR"); break; + case EXT4_UA_ABORT: + ret = add_uevent_var(env, "FS_ACTION=%s", "ABORT"); + break; case EXT4_UA_FREEZE: ret = add_uevent_var(env, "FS_ACTION=%s", "FREEZE"); break; @@ -576,6 +579,9 @@ static void ext4_handle_error(struct super_block *sb) if (!test_opt(sb, ERRORS_CONT)) { journal_t *journal = EXT4_SB(sb)->s_journal; + if (!xchg(_SB(sb)->s_abrt_event_sent, 1)) + ext4_send_uevent(sb, EXT4_UA_ABORT); + EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; if (journal) jbd2_journal_abort(journal, -EIO); @@ -801,6 +807,10 @@ void __ext4_abort(struct super_block *sb, const char *function, if (sb_rdonly(sb) == 0) { ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only"); + + if (!xchg(_SB(sb)->s_abrt_event_sent, 1)) + ext4_send_uevent(sb, EXT4_UA_ABORT); + EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED; /* * Make sure updated value of ->s_mount_flags will be visible -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 1/2] ext4: add generic uevent infrastructure
From: Dmitry Monakhov *Purpose: It is reasonable to annaunce fs related events via uevent infrastructure. This patch implement only ext4'th part, but IMHO this should be usefull for any generic filesystem. Example: Runtime fs-error is pure async event. Currently there is no good way to handle this situation and inform user-space about this. *Implementation: Add uevent infrastructure similar to dm uevent FS_ACTION = {MOUNT|UMOUNT|REMOUNT|ERROR|FREEZE|UNFREEZE} FS_UUID FS_NAME FS_TYPE Signed-off-by: Dmitry Monakhov [aryabinin: add error event, rh8 rebase] Signed-off-by: Andrey Ryabinin --- fs/ext4/ext4.h | 11 + fs/ext4/super.c | 128 +++- 2 files changed, 138 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 028832d858fc..228492c9518f 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1498,6 +1498,8 @@ struct ext4_sb_info { /* Precomputed FS UUID checksum for seeding other checksums */ __u32 s_csum_seed; + bool s_err_event_sent; + /* Reclaim extents from extent status tree */ struct shrinker s_es_shrinker; struct list_head s_es_list; /* List of inodes with reclaimable extents */ @@ -3119,6 +3121,15 @@ extern int ext4_check_blockref(const char *, unsigned int, struct ext4_ext_path; struct ext4_extent; +enum ext4_event_type { + EXT4_UA_MOUNT, + EXT4_UA_UMOUNT, + EXT4_UA_REMOUNT, + EXT4_UA_ERROR, + EXT4_UA_FREEZE, + EXT4_UA_UNFREEZE, +}; + /* * Maximum number of logical blocks in a file; ext4_extent's ee_block is * __le32. diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 7fc5ad243953..3cc979825ec8 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -354,6 +354,117 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi) #define ext4_get_tstamp(es, tstamp) \ __ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi) +static int ext4_uuid_valid(const u8 *uuid) +{ + int i; + + for (i = 0; i < 16; i++) { + if (uuid[i]) + return 1; + } + return 0; +} + +struct ext4_uevent { + struct super_block *sb; + enum ext4_event_type action; + struct work_struct work; +}; + +/** + * ext4_send_uevent - prepare and send uevent + * + * @sb:super_block + * @action:action type + * + */ +static void ext4_send_uevent_work(struct work_struct *w) +{ + struct ext4_uevent *e = container_of(w, struct ext4_uevent, work); + struct super_block *sb = e->sb; + struct kobj_uevent_env *env; + const u8 *uuid = EXT4_SB(sb)->s_es->s_uuid; + enum kobject_action kaction = KOBJ_CHANGE; + int ret; + + env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL); + if (!env){ + kfree(e); + return; + } + ret = add_uevent_var(env, "FS_TYPE=%s", sb->s_type->name); + if (ret) + goto out; + ret = add_uevent_var(env, "FS_NAME=%s", sb->s_id); + if (ret) + goto out; + + if (ext4_uuid_valid(uuid)) { + ret = add_uevent_var(env, "UUID=%pUB", uuid); + if (ret) + goto out; + } + + switch (e->action) { + case EXT4_UA_MOUNT: + kaction = KOBJ_ONLINE; + ret = add_uevent_var(env, "FS_ACTION=%s", "MOUNT"); + break; + case EXT4_UA_UMOUNT: + kaction = KOBJ_OFFLINE; + ret = add_uevent_var(env, "FS_ACTION=%s", "UMOUNT"); + break; + case EXT4_UA_REMOUNT: + ret = add_uevent_var(env, "FS_ACTION=%s", "REMOUNT"); + break; + case EXT4_UA_ERROR: + ret = add_uevent_var(env, "FS_ACTION=%s", "ERROR"); + break; + case EXT4_UA_FREEZE: + ret = add_uevent_var(env, "FS_ACTION=%s", "FREEZE"); + break; + case EXT4_UA_UNFREEZE: + ret = add_uevent_var(env, "FS_ACTION=%s", "UNFREEZE"); + break; + default: + ret = -EINVAL; + } + if (ret) + goto out; + ret = kobject_uevent_env(&(EXT4_SB(sb)->s_kobj), kaction, env->envp); +out: + kfree(env); + kfree(e); +} + +/** + * ext4_send_uevent - prepare and schedule event submission + * + * @sb:super_block + * @action:action type + * + */ +void ext4_send_uevent(struct super_block *sb, enum ext4_event_type action) +{ + struct ext4_uevent *e; + + /* +* May happen if called from ext4_put_super() -> __ext4_abort() +* -> ext4_send_uevent() +*/ + if (!EXT4_SB(sb)->rsv_conversion_wq) + ret
[Devel] [PATCH vz8] x86_64, vclock_gettime: Use standart division instead of __iter_div_u64_rem()
timespec_sub_ns() historically uses __iter_div_u64_rem() for division. Probably it's supposed to be faster /* * Iterative div/mod for use when dividend is not expected to be much * bigger than divisor. */ u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder) However in our case ve_start_time may make dividend much bigger than divisor. So let's use standard "/" instead of iterative one. With 0 ve_start_time I wasn't able to see measurable difference, however with big ve_start_time the difference is rather significant: # time ./clock_iter_div real1m30.224s user1m30.343s sys 0m0.008s # time taskset ./clock_div real0m2.757s user0m1.730s sys 0m0.066s 32-bit vdso doesn't like 64-bit division and doesn't link. I think it needs __udivsi3(). So just fallback to __iter_div_u64_rem() on 32-bit. https://jira.sw.ru/browse/PSBM-121856 Signed-off-by: Andrey Ryabinin --- arch/x86/entry/vdso/vclock_gettime.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index be1de6c4cafa..224dbe80da66 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -229,13 +229,27 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } +static inline u64 divu64(u64 dividend, u32 divisor, u64 *remainder) +{ + /* 32-bit wants __udivsi3() and fails to link, so fallback to iter */ +#ifndef BUILD_VDSO32 + u64 res; + + res = dividend/divisor; + *remainder = dividend % divisor; + return res; +#else + return __iter_div_u64_rem(dividend, divisor, remainder); +#endif +} + static inline void timespec_sub_ns(struct timespec *ts, u64 ns) { if ((s64)ns <= 0) { - ts->tv_sec += __iter_div_u64_rem(-ns, NSEC_PER_SEC, ); + ts->tv_sec += divu64(-ns, NSEC_PER_SEC, ); ts->tv_nsec = ns; } else { - ts->tv_sec -= __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + ts->tv_sec -= divu64(ns, NSEC_PER_SEC, ); if (ns) { ts->tv_sec--; ns = NSEC_PER_SEC - ns; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 v3 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read
If several threads read /proc/cpuinfo some can see in 'flags' values from c->x86_capability, before __do_cpuid_fault() called and masks applied. Fix this by forming 'flags' on stack first and copy them in per_cpu(cpu_flags, cpu) as a last step. https://jira.sw.ru/browse/PSBM-121823 Signed-off-by: Andrey Ryabinin --- Changes since v1: - none Changes since v2: - add spinlock, use temporary ve_flags in show_cpuinfo() arch/x86/kernel/cpu/proc.c | 31 ++- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index 4fe1577d5e6f..08fd7ff9a55b 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -65,15 +65,16 @@ struct cpu_flags { }; static DEFINE_PER_CPU(struct cpu_flags, cpu_flags); +static DEFINE_SPINLOCK(cpu_flags_lock); static void init_cpu_flags(void *dummy) { int cpu = smp_processor_id(); - struct cpu_flags *flags = _cpu(cpu_flags, cpu); + struct cpu_flags flags; struct cpuinfo_x86 *c = _data(cpu); unsigned int eax, ebx, ecx, edx; - memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32)); + memcpy(, c->x86_capability, sizeof(flags)); /* * Clear feature bits masked using cpuid masking/faulting. @@ -81,26 +82,30 @@ static void init_cpu_flags(void *dummy) if (c->cpuid_level >= 0x0001) { __do_cpuid_fault(0x0001, 0, , , , ); - flags->val[4] &= ecx; - flags->val[0] &= edx; + flags.val[4] &= ecx; + flags.val[0] &= edx; } if (c->cpuid_level >= 0x0007) { __do_cpuid_fault(0x0007, 0, , , , ); - flags->val[9] &= ebx; + flags.val[9] &= ebx; } if ((c->extended_cpuid_level & 0x) == 0x8000 && c->extended_cpuid_level >= 0x8001) { __do_cpuid_fault(0x8001, 0, , , , ); - flags->val[6] &= ecx; - flags->val[1] &= edx; + flags.val[6] &= ecx; + flags.val[1] &= edx; } if (c->cpuid_level >= 0x000d) { __do_cpuid_fault(0x000d, 1, , , , ); - flags->val[10] &= eax; + flags.val[10] &= eax; } + + spin_lock(_flags_lock); + memcpy(_cpu(cpu_flags, cpu), , sizeof(flags)); + spin_unlock(_flags_lock); } static int show_cpuinfo(struct seq_file *m, void *v) @@ -108,6 +113,7 @@ static int show_cpuinfo(struct seq_file *m, void *v) struct cpuinfo_x86 *c = v; unsigned int cpu; int is_super = ve_is_super(get_exec_env()); + struct cpu_flags ve_flags; int i; cpu = c->cpu_index; @@ -147,12 +153,19 @@ static int show_cpuinfo(struct seq_file *m, void *v) show_cpuinfo_core(m, c, cpu); show_cpuinfo_misc(m, c); + if (!is_super) { + spin_lock_irq(_flags_lock); + memcpy(_flags, _cpu(cpu_flags, cpu), sizeof(ve_flags)); + spin_unlock_irq(_flags_lock); + } + + seq_puts(m, "flags\t\t:"); for (i = 0; i < 32*NCAPINTS; i++) if (x86_cap_flags[i] != NULL && ((is_super && cpu_has(c, i)) || (!is_super && test_bit(i, (unsigned long *) - _cpu(cpu_flags, cpu) + _flags seq_printf(m, " %s", x86_cap_flags[i]); seq_puts(m, "\nbugs\t\t:"); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 v3 2/2] x86: don't enable cpuid faults if /proc/vz/cpuid_override unused
We don't need to enable cpuid faults if /proc/vz/cpuid_override was never used. If task was attached to ve before a write to 'cpuid_override' it will not get cpuid faults now. It shouldn't be a problem since the proper use of 'cpuid_override' requires stopping all containers. https://jira.sw.ru/browse/PSBM-121823 Signed-off-by: Andrey Ryabinin Reviewed-by: Kirill Tkhai --- Changes since v1: - git add include/linux/cpuid_override.h Changes since v2: - add review tag arch/x86/kernel/cpuid_fault.c | 21 ++--- include/linux/cpuid_override.h | 30 ++ kernel/ve/ve.c | 5 - 3 files changed, 36 insertions(+), 20 deletions(-) create mode 100644 include/linux/cpuid_override.h diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c index 1e8ffacc4412..cb6c2216fa8a 100644 --- a/arch/x86/kernel/cpuid_fault.c +++ b/arch/x86/kernel/cpuid_fault.c @@ -1,3 +1,4 @@ +#include #include #include #include @@ -9,25 +10,7 @@ #include #include -struct cpuid_override_entry { - unsigned int op; - unsigned int count; - bool has_count; - unsigned int eax; - unsigned int ebx; - unsigned int ecx; - unsigned int edx; -}; - -#define MAX_CPUID_OVERRIDE_ENTRIES 16 - -struct cpuid_override_table { - struct rcu_head rcu_head; - int size; - struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES]; -}; - -static struct cpuid_override_table __rcu *cpuid_override __read_mostly; +struct cpuid_override_table __rcu *cpuid_override __read_mostly; static DEFINE_SPINLOCK(cpuid_override_lock); static void cpuid_override_update(struct cpuid_override_table *new_table) diff --git a/include/linux/cpuid_override.h b/include/linux/cpuid_override.h new file mode 100644 index ..ea0fa7af3d3c --- /dev/null +++ b/include/linux/cpuid_override.h @@ -0,0 +1,30 @@ +#ifndef __CPUID_OVERRIDE_H +#define __CPUID_OVERRIDE_H + +#include + +struct cpuid_override_entry { + unsigned int op; + unsigned int count; + bool has_count; + unsigned int eax; + unsigned int ebx; + unsigned int ecx; + unsigned int edx; +}; + +#define MAX_CPUID_OVERRIDE_ENTRIES 16 + +struct cpuid_override_table { + struct rcu_head rcu_head; + int size; + struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES]; +}; + +extern struct cpuid_override_table __rcu *cpuid_override; + +static inline bool cpuid_override_on(void) +{ + return rcu_access_pointer(cpuid_override); +} +#endif diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index aad8ce69ca1f..0d4d0ab70369 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -9,6 +9,7 @@ * 've.c' helper file performing VE sub-system initialization */ +#include #include #include #include @@ -801,6 +802,7 @@ static void ve_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; struct task_struct *task; + extern struct cpuid_override_table __rcu *cpuid_override; cgroup_taskset_for_each(task, css, tset) { struct ve_struct *ve = css_to_ve(css); @@ -816,7 +818,8 @@ static void ve_attach(struct cgroup_taskset *tset) /* Leave parent exec domain */ task->parent_exec_id--; - set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE); + if (cpuid_override_on()) + set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE); task->task_ve = ve; } } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH vz8 v2 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read
On 11/3/20 2:28 PM, Kirill Tkhai wrote: > On 02.11.2020 20:13, Andrey Ryabinin wrote: >> If several threads read /proc/cpuinfo some can see in 'flags' >> values from c->x86_capability, before __do_cpuid_fault() called >> and masks applied. Fix this by forming 'flags' on stack first >> and copy them in per_cpu(cpu_flags, cpu) as a last step. >> >> https://jira.sw.ru/browse/PSBM-121823 >> Signed-off-by: Andrey Ryabinin >> --- >> Changes since v1: >> - none >> >> arch/x86/kernel/cpu/proc.c | 17 + >> 1 file changed, 9 insertions(+), 8 deletions(-) >> >> diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c >> index 4fe1577d5e6f..4cc2951e34fb 100644 >> --- a/arch/x86/kernel/cpu/proc.c >> +++ b/arch/x86/kernel/cpu/proc.c >> @@ -69,11 +69,11 @@ static DEFINE_PER_CPU(struct cpu_flags, cpu_flags); >> static void init_cpu_flags(void *dummy) >> { >> int cpu = smp_processor_id(); >> -struct cpu_flags *flags = _cpu(cpu_flags, cpu); >> +struct cpu_flags flags; >> struct cpuinfo_x86 *c = _data(cpu); >> unsigned int eax, ebx, ecx, edx; >> >> -memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32)); >> +memcpy(, c->x86_capability, sizeof(flags)); >> >> /* >> * Clear feature bits masked using cpuid masking/faulting. >> @@ -81,26 +81,27 @@ static void init_cpu_flags(void *dummy) >> >> if (c->cpuid_level >= 0x0001) { >> __do_cpuid_fault(0x0001, 0, , , , ); >> -flags->val[4] &= ecx; >> -flags->val[0] &= edx; >> +flags.val[4] &= ecx; >> +flags.val[0] &= edx; >> } >> >> if (c->cpuid_level >= 0x0007) { >> __do_cpuid_fault(0x0007, 0, , , , ); >> -flags->val[9] &= ebx; >> +flags.val[9] &= ebx; >> } >> >> if ((c->extended_cpuid_level & 0x) == 0x8000 && >> c->extended_cpuid_level >= 0x8001) { >> __do_cpuid_fault(0x8001, 0, , , , ); >> -flags->val[6] &= ecx; >> -flags->val[1] &= edx; >> +flags.val[6] &= ecx; >> +flags.val[1] &= edx; >> } >> >> if (c->cpuid_level >= 0x000d) { >> __do_cpuid_fault(0x000d, 1, , , , ); >> -flags->val[10] &= eax; >> +flags.val[10] &= eax; >> } >> +memcpy(_cpu(cpu_flags, cpu), , sizeof(flags)); > > This is still racy, since memcpy() is not atomic. Maybe we should add some > lock on top of this? > This race shouldn't be a problem since flags are not supposed to change during ve lifetime. So we overwriting same values. But don't mind to add spinlock protection. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 v2 2/2] x86: don't enable cpuid faults if /proc/vz/cpuid_override unused
We don't need to enable cpuid faults if /proc/vz/cpuid_override was never used. If task was attached to ve before a write to 'cpuid_override' it will not get cpuid faults now. It shouldn't be a problem since the proper use of 'cpuid_override' requires stopping all containers. https://jira.sw.ru/browse/PSBM-121823 Signed-off-by: Andrey Ryabinin --- Changes since v1: - git add include/linux/cpuid_override.h arch/x86/kernel/cpuid_fault.c | 21 ++--- include/linux/cpuid_override.h | 30 ++ kernel/ve/ve.c | 5 - 3 files changed, 36 insertions(+), 20 deletions(-) create mode 100644 include/linux/cpuid_override.h diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c index 1e8ffacc4412..cb6c2216fa8a 100644 --- a/arch/x86/kernel/cpuid_fault.c +++ b/arch/x86/kernel/cpuid_fault.c @@ -1,3 +1,4 @@ +#include #include #include #include @@ -9,25 +10,7 @@ #include #include -struct cpuid_override_entry { - unsigned int op; - unsigned int count; - bool has_count; - unsigned int eax; - unsigned int ebx; - unsigned int ecx; - unsigned int edx; -}; - -#define MAX_CPUID_OVERRIDE_ENTRIES 16 - -struct cpuid_override_table { - struct rcu_head rcu_head; - int size; - struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES]; -}; - -static struct cpuid_override_table __rcu *cpuid_override __read_mostly; +struct cpuid_override_table __rcu *cpuid_override __read_mostly; static DEFINE_SPINLOCK(cpuid_override_lock); static void cpuid_override_update(struct cpuid_override_table *new_table) diff --git a/include/linux/cpuid_override.h b/include/linux/cpuid_override.h new file mode 100644 index ..ea0fa7af3d3c --- /dev/null +++ b/include/linux/cpuid_override.h @@ -0,0 +1,30 @@ +#ifndef __CPUID_OVERRIDE_H +#define __CPUID_OVERRIDE_H + +#include + +struct cpuid_override_entry { + unsigned int op; + unsigned int count; + bool has_count; + unsigned int eax; + unsigned int ebx; + unsigned int ecx; + unsigned int edx; +}; + +#define MAX_CPUID_OVERRIDE_ENTRIES 16 + +struct cpuid_override_table { + struct rcu_head rcu_head; + int size; + struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES]; +}; + +extern struct cpuid_override_table __rcu *cpuid_override; + +static inline bool cpuid_override_on(void) +{ + return rcu_access_pointer(cpuid_override); +} +#endif diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index aad8ce69ca1f..0d4d0ab70369 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -9,6 +9,7 @@ * 've.c' helper file performing VE sub-system initialization */ +#include #include #include #include @@ -801,6 +802,7 @@ static void ve_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; struct task_struct *task; + extern struct cpuid_override_table __rcu *cpuid_override; cgroup_taskset_for_each(task, css, tset) { struct ve_struct *ve = css_to_ve(css); @@ -816,7 +818,8 @@ static void ve_attach(struct cgroup_taskset *tset) /* Leave parent exec domain */ task->parent_exec_id--; - set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE); + if (cpuid_override_on()) + set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE); task->task_ve = ve; } } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 v2 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read
If several threads read /proc/cpuinfo some can see in 'flags' values from c->x86_capability, before __do_cpuid_fault() called and masks applied. Fix this by forming 'flags' on stack first and copy them in per_cpu(cpu_flags, cpu) as a last step. https://jira.sw.ru/browse/PSBM-121823 Signed-off-by: Andrey Ryabinin --- Changes since v1: - none arch/x86/kernel/cpu/proc.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index 4fe1577d5e6f..4cc2951e34fb 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -69,11 +69,11 @@ static DEFINE_PER_CPU(struct cpu_flags, cpu_flags); static void init_cpu_flags(void *dummy) { int cpu = smp_processor_id(); - struct cpu_flags *flags = _cpu(cpu_flags, cpu); + struct cpu_flags flags; struct cpuinfo_x86 *c = _data(cpu); unsigned int eax, ebx, ecx, edx; - memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32)); + memcpy(, c->x86_capability, sizeof(flags)); /* * Clear feature bits masked using cpuid masking/faulting. @@ -81,26 +81,27 @@ static void init_cpu_flags(void *dummy) if (c->cpuid_level >= 0x0001) { __do_cpuid_fault(0x0001, 0, , , , ); - flags->val[4] &= ecx; - flags->val[0] &= edx; + flags.val[4] &= ecx; + flags.val[0] &= edx; } if (c->cpuid_level >= 0x0007) { __do_cpuid_fault(0x0007, 0, , , , ); - flags->val[9] &= ebx; + flags.val[9] &= ebx; } if ((c->extended_cpuid_level & 0x) == 0x8000 && c->extended_cpuid_level >= 0x8001) { __do_cpuid_fault(0x8001, 0, , , , ); - flags->val[6] &= ecx; - flags->val[1] &= edx; + flags.val[6] &= ecx; + flags.val[1] &= edx; } if (c->cpuid_level >= 0x000d) { __do_cpuid_fault(0x000d, 1, , , , ); - flags->val[10] &= eax; + flags.val[10] &= eax; } + memcpy(_cpu(cpu_flags, cpu), , sizeof(flags)); } static int show_cpuinfo(struct seq_file *m, void *v) -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 2/2] x86: don't enable cpuid faults if /proc/vz/cpuid_override unused
We don't need to enable cpuid faults if /proc/vz/cpuid_override was never used. If task was attached to ve before a write to 'cpuid_override' it will not get cpuid faults now. It shouldn't be a problem since the proper use of 'cpuid_override' requires stopping all containers. https://jira.sw.ru/browse/PSBM-121823 Signed-off-by: Andrey Ryabinin --- arch/x86/kernel/cpuid_fault.c | 21 ++--- kernel/ve/ve.c| 5 - 2 files changed, 6 insertions(+), 20 deletions(-) diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c index 1e8ffacc4412..cb6c2216fa8a 100644 --- a/arch/x86/kernel/cpuid_fault.c +++ b/arch/x86/kernel/cpuid_fault.c @@ -1,3 +1,4 @@ +#include #include #include #include @@ -9,25 +10,7 @@ #include #include -struct cpuid_override_entry { - unsigned int op; - unsigned int count; - bool has_count; - unsigned int eax; - unsigned int ebx; - unsigned int ecx; - unsigned int edx; -}; - -#define MAX_CPUID_OVERRIDE_ENTRIES 16 - -struct cpuid_override_table { - struct rcu_head rcu_head; - int size; - struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES]; -}; - -static struct cpuid_override_table __rcu *cpuid_override __read_mostly; +struct cpuid_override_table __rcu *cpuid_override __read_mostly; static DEFINE_SPINLOCK(cpuid_override_lock); static void cpuid_override_update(struct cpuid_override_table *new_table) diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index aad8ce69ca1f..0d4d0ab70369 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -9,6 +9,7 @@ * 've.c' helper file performing VE sub-system initialization */ +#include #include #include #include @@ -801,6 +802,7 @@ static void ve_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; struct task_struct *task; + extern struct cpuid_override_table __rcu *cpuid_override; cgroup_taskset_for_each(task, css, tset) { struct ve_struct *ve = css_to_ve(css); @@ -816,7 +818,8 @@ static void ve_attach(struct cgroup_taskset *tset) /* Leave parent exec domain */ task->parent_exec_id--; - set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE); + if (cpuid_override_on()) + set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE); task->task_ve = ve; } } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read
If several threads read /proc/cpuinfo some can see in 'flags' values from c->x86_capability, before __do_cpuid_fault() called and masks applied. Fix this by forming 'flags' on stack first and copy them in per_cpu(cpu_flags, cpu) as a last step. https://jira.sw.ru/browse/PSBM-121823 Signed-off-by: Andrey Ryabinin --- arch/x86/kernel/cpu/proc.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index 4fe1577d5e6f..4cc2951e34fb 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -69,11 +69,11 @@ static DEFINE_PER_CPU(struct cpu_flags, cpu_flags); static void init_cpu_flags(void *dummy) { int cpu = smp_processor_id(); - struct cpu_flags *flags = _cpu(cpu_flags, cpu); + struct cpu_flags flags; struct cpuinfo_x86 *c = _data(cpu); unsigned int eax, ebx, ecx, edx; - memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32)); + memcpy(, c->x86_capability, sizeof(flags)); /* * Clear feature bits masked using cpuid masking/faulting. @@ -81,26 +81,27 @@ static void init_cpu_flags(void *dummy) if (c->cpuid_level >= 0x0001) { __do_cpuid_fault(0x0001, 0, , , , ); - flags->val[4] &= ecx; - flags->val[0] &= edx; + flags.val[4] &= ecx; + flags.val[0] &= edx; } if (c->cpuid_level >= 0x0007) { __do_cpuid_fault(0x0007, 0, , , , ); - flags->val[9] &= ebx; + flags.val[9] &= ebx; } if ((c->extended_cpuid_level & 0x) == 0x8000 && c->extended_cpuid_level >= 0x8001) { __do_cpuid_fault(0x8001, 0, , , , ); - flags->val[6] &= ecx; - flags->val[1] &= edx; + flags.val[6] &= ecx; + flags.val[1] &= edx; } if (c->cpuid_level >= 0x000d) { __do_cpuid_fault(0x000d, 1, , , , ); - flags->val[10] &= eax; + flags.val[10] &= eax; } + memcpy(_cpu(cpu_flags, cpu), , sizeof(flags)); } static int show_cpuinfo(struct seq_file *m, void *v) -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh8 3/3] ve/vestat: Introduce /proc/vz/vestat
On 10/30/20 4:08 PM, Konstantin Khorenko wrote: > The patch is based on following vz7 commits: > > f997bf6c613a ("ve: initial patch") > 75fc174adc36 ("sched: Port cpustat related patches") > 09e1cb4a7d4d ("ve/proc: restricted proc-entries scope") > a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs") > > Signed-off-by: Konstantin Khorenko > --- Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh8 2/3] ve/time/stat: idle time virtualization in /proc/loadavg
On 10/30/20 4:08 PM, Konstantin Khorenko wrote: > The patch is based on following vz7 commits: > a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs") > 75fc174adc36 ("sched: Port cpustat related patches") > > Fixes: a3c4d1d8f383 ("ve/time: Customize VE uptime") > > TODO: to separate FIXME hunks from a3c4d1d8f383 ("ve/time: Customize VE > uptime") and merge them into this commit > > Signed-off-by: Konstantin Khorenko > --- Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh8 1/3] ve/sched/stat: Introduce handler for getting CT cpu statistics
On 10/30/20 4:08 PM, Konstantin Khorenko wrote: > It will be used later in > * idle cpu stat virtualization in /proc/loadavg > * /proc/vz/vestat output > * VZCTL_GET_CPU_STAT ioctl > > The patch is based on following vz7 commits: > ecdce58b214c ("sched: Export per task_group statistics_work") > 75fc174adc36 ("sched: Port cpustat related patches") > a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs") > > Signed-off-by: Konstantin Khorenko Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh8 0/8] ve/proc/sched/stat: Virtualize /proc/stat in a Container
On 10/28/20 6:57 PM, Konstantin Khorenko wrote: > This patchset contains of parts of following vz7 commits: > > a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs") > ecdce58b214c ("sched: Export per task_group statistics_work") > fc24d1785a28 ("fs/proc: print fairshed stat") > 75fc174adc36 ("sched: Port cpustat related patches") > 3c7b1e52294c ("ve/sched: Hide steal time from inside CT") > 3d34f0d3b529 ("proc/cpu/cgroup: make boottime in CT reveal the real start > time") > 715f311fdb4a ("sched: Account task_group::cpustat,taskstats,avenrun") > > Known issues: > - context switches ("ctxt") and number of forks ("processes") >virtualization is TBD > - "procs_blocked" reported is incorrect, to be fixed by later patches > > Konstantin Khorenko (8): > ve/cgroup: export cgroup_get_ve_root1() + cleanup > kernel/stat: Introduce kernel_cpustat operation wrappers > ve/sched/stat: Add basic infrastructure for vcpu statistics > ve/sched/stat: Introduce functions to calculate vcpustat data > ve/proc/stat: Introduce /proc/stat virtualized handler for Containers > ve/proc/stat: Wire virtualized /proc/stat handler > ve/proc/stat: Introduce CPUTIME_USED field in cpustat statistic > sched: Fix task_group "iowait_sum" statistic accounting > > fs/proc/stat.c | 10 + > include/linux/kernel_stat.h | 37 > include/linux/ve.h | 8 + > kernel/cgroup/cgroup.c | 6 +- > kernel/sched/core.c | 17 +- > kernel/sched/cpuacct.c | 377 ++++ > kernel/sched/fair.c | 3 +- > kernel/sched/sched.h| 5 + > kernel/ve/ve.c | 17 ++ > 9 files changed, 475 insertions(+), 5 deletions(-) > Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh8] sched/stat: account ctxsw per task group
On 10/29/20 6:46 PM, Konstantin Khorenko wrote: > From: Vladimir Davydov > > This is a backport of diff-sched-account-ctxsw-per-task-group: > > Subject: sched: account ctxsw per task group > Date: Fri, 28 Dec 2012 15:09:45 +0400 > > * [sched] the number of context switches should be reported correctly > inside a CT in /proc/stat (PSBM-18113) > > For /proc/stat:ctxt to be correct inside containers. > > https://jira.sw.ru/browse/PSBM-18113 > > Signed-off-by: Vladimir Davydov > > (cherry picked from vz7 commit d388f0bf64adb74cd62c4deff58e181bd63d62ac) > Signed-off-by: Konstantin Khorenko > --- Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 3/3] x86: Show vcpu cpuflags in cpuinfo
From: Kirill Tkhai Show cpu_i flags as flags of vcpu_i. Extracted from "Initial patch". Merged several reworks. TODO: Maybe replace/rework on_each_cpu() with smp_call_function_single(). Then we won't need split c_start() in previous patch (as the call function will be called right before specific cpu is being prepared to show). This should be rather easy. [aryabinin: Don't see what it buys us, so I didn't try to implement it] Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-121823 [aryabinin:vz8 rebase] Signed-off-by: Andrey Ryabinin --- arch/x86/kernel/cpu/proc.c | 63 +++--- 1 file changed, 59 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index d6b17a60acf6..4fe1577d5e6f 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -4,6 +4,8 @@ #include #include #include +#include +#include #include "cpu.h" @@ -58,10 +60,54 @@ extern void __do_cpuid_fault(unsigned int op, unsigned int count, unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx); +struct cpu_flags { + u32 val[NCAPINTS]; +}; + +static DEFINE_PER_CPU(struct cpu_flags, cpu_flags); + +static void init_cpu_flags(void *dummy) +{ + int cpu = smp_processor_id(); + struct cpu_flags *flags = _cpu(cpu_flags, cpu); + struct cpuinfo_x86 *c = _data(cpu); + unsigned int eax, ebx, ecx, edx; + + memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32)); + + /* +* Clear feature bits masked using cpuid masking/faulting. +*/ + + if (c->cpuid_level >= 0x0001) { + __do_cpuid_fault(0x0001, 0, , , , ); + flags->val[4] &= ecx; + flags->val[0] &= edx; + } + + if (c->cpuid_level >= 0x0007) { + __do_cpuid_fault(0x0007, 0, , , , ); + flags->val[9] &= ebx; + } + + if ((c->extended_cpuid_level & 0x) == 0x8000 && + c->extended_cpuid_level >= 0x8001) { + __do_cpuid_fault(0x8001, 0, , , , ); + flags->val[6] &= ecx; + flags->val[1] &= edx; + } + + if (c->cpuid_level >= 0x000d) { + __do_cpuid_fault(0x000d, 1, , , , ); + flags->val[10] &= eax; + } +} + static int show_cpuinfo(struct seq_file *m, void *v) { struct cpuinfo_x86 *c = v; unsigned int cpu; + int is_super = ve_is_super(get_exec_env()); int i; cpu = c->cpu_index; @@ -103,7 +149,10 @@ static int show_cpuinfo(struct seq_file *m, void *v) seq_puts(m, "flags\t\t:"); for (i = 0; i < 32*NCAPINTS; i++) - if (cpu_has(c, i) && x86_cap_flags[i] != NULL) + if (x86_cap_flags[i] != NULL && + ((is_super && cpu_has(c, i)) || +(!is_super && test_bit(i, (unsigned long *) + _cpu(cpu_flags, cpu) seq_printf(m, " %s", x86_cap_flags[i]); seq_puts(m, "\nbugs\t\t:"); @@ -145,18 +194,24 @@ static int show_cpuinfo(struct seq_file *m, void *v) return 0; } -static void *c_start(struct seq_file *m, loff_t *pos) +static void *__c_start(struct seq_file *m, loff_t *pos) { *pos = cpumask_next(*pos - 1, cpu_online_mask); - if ((*pos) < nr_cpu_ids) + if (bitmap_weight(cpumask_bits(cpu_online_mask), *pos) < num_online_vcpus()) return _data(*pos); return NULL; } +static void *c_start(struct seq_file *m, loff_t *pos) +{ + on_each_cpu(init_cpu_flags, NULL, 1); + return __c_start(m, pos); +} + static void *c_next(struct seq_file *m, void *v, loff_t *pos) { (*pos)++; - return c_start(m, pos); + return __c_start(m, pos); } static void c_stop(struct seq_file *m, void *v) -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 2/3] x86: make ARCH_[SET|GET]_CPUID friends with /proc/vz/cpuid_override
We are using cpuid faults to emulate cpuid in containers. This conflicts with arch_prctl(ARCH_SET_CPUID, 0) which allows to enable cpuid faulting so that cpuid instruction causes SIGSEGV. Add TIF_CPUID_OVERRIDE thread info flag which is added on all !ve0 tasks. And check this flag along with TIF_NOCPUID to decide whether we need to enable/disable cpuid faults or not. https://jira.sw.ru/browse/PSBM-121823 Signed-off-by: Andrey Ryabinin --- arch/x86/include/asm/thread_info.h | 4 +++- arch/x86/kernel/cpuid_fault.c | 3 ++- arch/x86/kernel/process.c | 13 + arch/x86/kernel/traps.c| 3 +++ kernel/ve/ve.c | 1 + 5 files changed, 18 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index c0da378eed8b..6ffb64d25383 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -92,6 +92,7 @@ struct thread_info { #define TIF_NOCPUID15 /* CPUID is not accessible in userland */ #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ +#define TIF_CPUID_OVERRIDE 18 /* CPUID emulation enabled */ #define TIF_NOHZ 19 /* in adaptive nohz mode */ #define TIF_MEMDIE 20 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */ @@ -122,6 +123,7 @@ struct thread_info { #define _TIF_NOCPUID (1 << TIF_NOCPUID) #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) +#define _TIF_CPUID_OVERRIDE(1 << TIF_CPUID_OVERRIDE) #define _TIF_NOHZ (1 << TIF_NOHZ) #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG) #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP) @@ -153,7 +155,7 @@ struct thread_info { /* flags to check in __switch_to() */ #define _TIF_WORK_CTXSW_BASE \ (_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP| \ -_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE) +_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE | _TIF_CPUID_OVERRIDE) /* * Avoid calls to __switch_to_xtra() on UP as STIBP is not evaluated. diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c index 339e2638c3b8..1e8ffacc4412 100644 --- a/arch/x86/kernel/cpuid_fault.c +++ b/arch/x86/kernel/cpuid_fault.c @@ -6,7 +6,8 @@ #include #include #include -#include +#include +#include struct cpuid_override_entry { unsigned int op; diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index e5c5b1d724ab..788b9b8f8f9c 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -209,7 +209,8 @@ static void set_cpuid_faulting(bool on) static void disable_cpuid(void) { preempt_disable(); - if (!test_and_set_thread_flag(TIF_NOCPUID)) { + if (!test_and_set_thread_flag(TIF_NOCPUID) || + test_thread_flag(TIF_CPUID_OVERRIDE)) { /* * Must flip the CPU state synchronously with * TIF_NOCPUID in the current running context. @@ -222,7 +223,8 @@ static void disable_cpuid(void) static void enable_cpuid(void) { preempt_disable(); - if (test_and_clear_thread_flag(TIF_NOCPUID)) { + if (test_and_clear_thread_flag(TIF_NOCPUID) && + !test_thread_flag(TIF_CPUID_OVERRIDE)) { /* * Must flip the CPU state synchronously with * TIF_NOCPUID in the current running context. @@ -505,6 +507,7 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev, *next; unsigned long tifp, tifn; + bool prev_cpuid, next_cpuid; prev = _p->thread; next = _p->thread; @@ -529,8 +532,10 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p) if ((tifp ^ tifn) & _TIF_NOTSC) cr4_toggle_bits_irqsoff(X86_CR4_TSD); - if ((tifp ^ tifn) & _TIF_NOCPUID) - set_cpuid_faulting(!!(tifn & _TIF_NOCPUID)); + prev_cpuid = (tifp & _TIF_NOCPUID) || (tifp & _TIF_CPUID_OVERRIDE); + next_cpuid = (tifn & _TIF_NOCPUID) || (tifn & _TIF_CPUID_OVERRIDE); + if (prev_cpuid != next_cpuid) + set_cpuid_faulting(next_cpuid); if (likely(!((tifp | tifn) & _TIF_SPEC_FORCE_UPDATE))) { __speculation_ctrl_update(tifp, tifn); diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c43e3b80e50f..d0b379cf0484 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -526,6 +526,9 @@ static int check_cpuid_fault(struct pt_regs *regs, long error_code) if (error_code != 0)
[Devel] [PATCH vz8 1/3] arch/x86: introduce cpuid override
From: Vladimir Davydov Port diff-arch-x86-introduce-cpuid-override Recent Intel CPUs rejected CPUID masking, which is required for flex migration, in favor of CPUID faulting. So we need to support it in kenrel. This patch adds user writable file /proc/vz/cpuid_override, which contains CPUID override table. Each table entry must have the following format: op[ count]: eax ebx ecx edx where @op and optional @count define a CPUID function, whose output one would like to override (@op and @count are loaded to EAX and ECX registers respectively before calling CPUID); @eax, @ebx, @ecx, @edx - the desired CPUID output for the specified function. All values must be in HEX, 0x prefix is optional. Notes: - the file is only present on hosts that support CPUID faulting; - CPUID faulting is always enabled if it is supported; - CPUID output is overridden on all present CPUs; - the maximal number of entries one can override equals 16; - each write(2) to the file removes all existing entries before adding new ones, so the whole table must be written in one write(2); in particular writing an empty line to the file removes all existing rules. Example: Suppose we want to mask out SSE2 (CPUID.01H:EDX:26) and RDTSCP (CPUID.8001H:EDX:27). Then we should execute the following sequence: - get the current cpuid value: # cpuid -r | grep -e '^\s*0x0001' -e '^\s*0x8001' | head -n 2 0x0001 0x00: eax=0x000306e4 ebx=0x00200800 ecx=0x7fbee3ff edx=0xbfebfbff 0x8001 0x00: eax=0x ebx=0x ecx=0x0001 edx=0x2c100800 - clear the feature bits we want to mask out and write the result to /proc/vz/cpuid_override: # cat >/proc/vz/cpuid_override <https://jira.sw.ru/browse/PSBM-28682 Signed-off-by: Vladimir Davydov Acked-by: Cyrill Gorcunov = https://jira.sw.ru/browse/PSBM-33638 Signed-off-by: Vladimir Davydov Rebase: Signed-off-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-121823 [aryabinin: vz8 rebase] Signed-off-by: Andrey Ryabinin --- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/asm/traps.h | 2 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/cpu/proc.c | 4 + arch/x86/kernel/cpuid_fault.c| 258 +++ arch/x86/kernel/traps.c | 24 +++ 6 files changed, 290 insertions(+) create mode 100644 arch/x86/kernel/cpuid_fault.c diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 6a21c227775c..9668ec6a064d 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -114,6 +114,7 @@ #define MSR_IA32_BBL_CR_CTL0x0119 #define MSR_IA32_BBL_CR_CTL3 0x011e +#define MSR_MISC_FEATURES_ENABLES 0x0140 #define MSR_IA32_TSX_CTRL 0x0122 #define TSX_CTRL_RTM_DISABLE BIT(0) /* Disable RTM feature */ diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 0ae298ea01a1..0282c81719e7 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -124,6 +124,8 @@ void __noreturn handle_stack_overflow(const char *message, unsigned long fault_address); #endif +void do_cpuid_fault(struct pt_regs *); + /* Interrupts/Exceptions */ enum { X86_TRAP_DE = 0,/* 0, Divide-by-zero */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 431d8c6e641d..b9451b653b04 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -63,6 +63,7 @@ obj-y += pci-iommu_table.o obj-y += resource.o obj-y += irqflags.o obj-y += spec_ctrl.o +obj-y += cpuid_fault.o obj-y += process.o obj-y += fpu/ diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index 2c8522a39ed5..d6b17a60acf6 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -54,6 +54,10 @@ static void show_cpuinfo_misc(struct seq_file *m, struct cpuinfo_x86 *c) } #endif +extern void __do_cpuid_fault(unsigned int op, unsigned int count, +unsigned int *eax, unsigned int *ebx, +unsigned int *ecx, unsigned int *edx); + static int show_cpuinfo(struct seq_file *m, void *v) { struct cpuinfo_x86 *c = v; diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c new file mode 100644 index ..339e2638c3b8 --- /dev/null +++ b/arch/x86/kernel/cpuid_fault.c @@ -0,0 +1,258 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct cpuid_override_entry { + unsigned int op; + unsigned int count; + bool has_count; + unsigned int eax; + unsigned int ebx; + unsigned i
[Devel] [PATCH vz8] userns: associate user_struct with the user_namespace
user_struct contains per-user counters like processes, files, sigpending etc which we wouldn't like to share across different namespaces. Make per-userns uid hastable instead of global. This is partial revert of the 7b44ab978b77a ("userns: Disassociate user_struct from the user_namespace.") Signed-off-by: Andrey Ryabinin --- include/linux/sched/user.h | 1 + include/linux/user_namespace.h | 4 kernel/user.c | 22 +- kernel/user_namespace.c| 13 + 4 files changed, 31 insertions(+), 9 deletions(-) diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h index 9a9536fd4fe3..4bf5a723f138 100644 --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -60,6 +60,7 @@ extern struct user_struct *find_user(kuid_t); extern struct user_struct root_user; #define INIT_USER (_user) +extern struct user_struct * alloc_uid_ns(struct user_namespace *ns, kuid_t); /* per-UID process charging. */ extern struct user_struct * alloc_uid(kuid_t); diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index b004d5aeba1f..30493179b756 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -15,6 +15,9 @@ #define UID_GID_MAP_MAX_BASE_EXTENTS 5 #define UID_GID_MAP_MAX_EXTENTS 340 +#define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 7) +#define UIDHASH_SZ (1 << UIDHASH_BITS) + struct uid_gid_extent { u32 first; u32 lower_first; @@ -73,6 +76,7 @@ struct user_namespace { struct uid_gid_map gid_map; struct uid_gid_map projid_map; atomic_tcount; + struct hlist_head uidhash_table[UIDHASH_SZ]; struct user_namespace *parent; int level; kuid_t owner; diff --git a/kernel/user.c b/kernel/user.c index 0df9b1640b2a..f9f540484499 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -8,6 +8,7 @@ * able to have per-user limits for system resources. */ +#include #include #include #include @@ -74,14 +75,11 @@ EXPORT_SYMBOL_GPL(init_user_ns); * when changing user ID's (ie setuid() and friends). */ -#define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 7) -#define UIDHASH_SZ (1 << UIDHASH_BITS) #define UIDHASH_MASK (UIDHASH_SZ - 1) #define __uidhashfn(uid) (((uid >> UIDHASH_BITS) + uid) & UIDHASH_MASK) -#define uidhashentry(uid) (uidhash_table + __uidhashfn((__kuid_val(uid +#define uidhashentry(ns, uid) ((ns)->uidhash_table + __uidhashfn((__kuid_val(uid static struct kmem_cache *uid_cachep; -struct hlist_head uidhash_table[UIDHASH_SZ]; /* * The uidhash_lock is mostly taken from process context, but it is @@ -155,9 +153,10 @@ struct user_struct *find_user(kuid_t uid) { struct user_struct *ret; unsigned long flags; + struct user_namespace *ns = current_user_ns(); spin_lock_irqsave(_lock, flags); - ret = uid_hash_find(uid, uidhashentry(uid)); + ret = uid_hash_find(uid, uidhashentry(ns, uid)); spin_unlock_irqrestore(_lock, flags); return ret; } @@ -173,9 +172,9 @@ void free_uid(struct user_struct *up) free_user(up, flags); } -struct user_struct *alloc_uid(kuid_t uid) +struct user_struct *alloc_uid_ns(struct user_namespace *ns, kuid_t uid) { - struct hlist_head *hashent = uidhashentry(uid); + struct hlist_head *hashent = uidhashentry(ns, uid); struct user_struct *up, *new; spin_lock_irq(_lock); @@ -215,6 +214,11 @@ struct user_struct *alloc_uid(kuid_t uid) return NULL; } +struct user_struct *alloc_uid(kuid_t uid) +{ + return alloc_uid_ns(current_user_ns(), uid); +} + static int __init uid_cache_init(void) { int n; @@ -223,11 +227,11 @@ static int __init uid_cache_init(void) 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); for(n = 0; n < UIDHASH_SZ; ++n) - INIT_HLIST_HEAD(uidhash_table + n); + INIT_HLIST_HEAD(init_user_ns.uidhash_table + n); /* Insert the root user immediately (init already runs as root) */ spin_lock_irq(_lock); - uid_hash_insert(_user, uidhashentry(GLOBAL_ROOT_UID)); + uid_hash_insert(_user, uidhashentry(_user_ns, GLOBAL_ROOT_UID)); spin_unlock_irq(_lock); return 0; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 243fb390d744..459b88044c62 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -74,6 +74,7 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) int create_user_ns(struct cred *new) { struct user_namespace *ns, *parent_ns = new->user_ns; + struct user_struct *new_user; kuid_t owner = new->euid; kgid_t group = new->egid; struct ucounts *ucounts; @@ -116,6 +117,17 @@ int create_user_ns(s
[Devel] [PATCH vz8 2/2] ve/fs/devmnt: process mount options
From: Kirill Tkhai Port patch diff-ve-fs-process-mount-options-check-and-insert by Maxim Patlasov: The patch implements two kinds of processing mount options: check and insert. Check is OK if and only if each option supplied by CT-user is present among options listed in allowed_options. Insert transforms mount options supplied by CT-user like this: = + Check is performed both for mount and remount. Insert - only for mount. All this happens only for mount/remount inside CT and if proper ve_devmnt struct is found in ve->devmnt_list (searched by 'dev'). https://jira.sw.ru/browse/PSBM-32273 Signed-off-by: Kirill Tkhai Acked-by: Maxim Patlasov +++ ve/fs/devmnt: allow more than one mount option inside a CT strsep() changes provided string: puts '\0' instead of separators, thus after successful call to ve_devmnt_check() we insert only first provided mount options, ignoring others. mFixes: bc4143b ("ve/fs/devmnt: process mount options") Found during implementation of https://jira.sw.ru/browse/PSBM-40075 Signed-off-by: Konstantin Khorenko Reviewed-by: Kirill Tkhai https://jira.sw.ru/browse/PSBM-108196 Signed-off-by: Andrey Ryabinin --- fs/namespace.c | 146 - fs/super.c | 16 + include/linux/fs.h | 2 + 3 files changed, 163 insertions(+), 1 deletion(-) diff --git a/fs/namespace.c b/fs/namespace.c index d355b5921d1e..c24ab7597a39 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -28,6 +28,8 @@ #include #include +#include + #include "pnode.h" #include "internal.h" @@ -2344,6 +2346,148 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags) return error; } +#ifdef CONFIG_VE +/* + * Returns first occurrence of needle in haystack separated by sep, + * or NULL if not found + */ +static char *strstr_separated(char *haystack, char *needle, char sep) +{ + int needle_len = strlen(needle); + + while (haystack) { + if (!strncmp(haystack, needle, needle_len) && + (haystack[needle_len] == 0 || /* end-of-line or */ +haystack[needle_len] == sep)) /* separator */ + return haystack; + + haystack = strchr(haystack, sep); + if (haystack) + haystack++; + } + + return NULL; +} + +static int ve_devmnt_check(char *options, char *allowed) +{ + char *p; + char *tmp_options; + + if (!options || !*options) + return 0; + + if (!allowed) + return -EPERM; + + /* strsep() changes provided string: puts '\0' instead of separators */ + tmp_options = kstrdup(options, GFP_KERNEL); + if (!tmp_options) + return -ENOMEM; + + while ((p = strsep(_options, ",")) != NULL) { + if (!*p) + continue; + + if (!strstr_separated(allowed, p, ',')) { + kfree(tmp_options); + return -EPERM; + } + } + + kfree(tmp_options); + return 0; +} + +static int ve_devmnt_insert(char *options, char *hidden) +{ + int options_len; + int hidden_len; + + if (!hidden) + return 0; + + if (!options) + return -EAGAIN; + + options_len = strlen(options); + hidden_len = strlen(hidden); + + if (hidden_len + options_len + 2 > PAGE_SIZE) + return -EPERM; + + memmove(options + hidden_len + 1, options, options_len); + memcpy(options, hidden, hidden_len); + + options[hidden_len] = ','; + options[hidden_len + options_len + 1] = 0; + + return 0; +} + +int ve_devmnt_process(struct ve_struct *ve, dev_t dev, void **data_pp, int remount) +{ + void *data = *data_pp; + struct ve_devmnt *devmnt; + int err; +again: + err = 1; + mutex_lock(>devmnt_mutex); + list_for_each_entry(devmnt, >devmnt_list, link) { + if (devmnt->dev == dev) { + err = ve_devmnt_check(data, devmnt->allowed_options); + + if (!err && !remount) + err = ve_devmnt_insert(data, devmnt->hidden_options); + + break; + } + } + mutex_unlock(>devmnt_mutex); + + switch (err) { + case -EAGAIN: + if (!(data = (void *)__get_free_page(GFP_KERNEL))) + return -ENOMEM; + *(char *)data = 0; /* the string must be zero-terminated */ + goto again; + case 1: + if (*data_pp) { + ve_printk(VE_LOG_BOTH, KERN_WARNING "VE%s: no allowed " + "mount options found for device %u:%u\n", + ve->ve_name, MAJOR(dev), MI
[Devel] [PATCH vz8 1/2] ve/devmnt: Introduce ve::devmnt list #PSBM-108196
From: Kirill Tkhai 1)Porting patch "ve: mount option list" by Maxim Patlasov: The patch adds new fields to ve_struct: devmnt_list and devmnt_mutex. devmnt_list is the head of list of ve_devmnt structs. Each host block device visible from CT can have no more than one struct ve_devmnt linked in ve->devmnt_list. If ve_devmnt is present, it can be found by 'dev' field. Each ve_devmnt struct may bear two strings: hidden and allowed options. hidden_options will be automatically added to CT-user-supplied mount options after checking allowed_options. Only options listed in allowed_options are allowed. devmnt_mutex is to protect operations on the list of ve_devmnt structs. 2)Porting patch "vecalls: VE_CONFIGURE_MOUNT_OPTIONS" by Maxim Patlasov. Reworking the interface using cgroups. Each CT now has a file: [ve_cgroup_mnt_pnt]/[CTID]/ve.mount_opts for configuring permittions for a block device. Below is permittions line example: "0 major:minor;1 balloon_ino=12,pfcache_csum,pfcache=/vz/pfcache;2 barrier=1" Here, major:minor is a device, '1' starts comma-separated list of hidden options, and '2' is allowed ones. https://jira.sw.ru/browse/PSBM-32273 Signed-off-by: Kirill Tkhai Acked-by: Maxim Patlasov +++ ve/cgroups: Align ve_cftypes assignments For readability sake. We've other aligned already. Signed-off-by: Cyrill Gorcunov Rebase: ktkhai@: Merged "ve: increase max length of ve.mount_opts string" ve/devmnt: Add a ability to show ve.mount_opts A user may want to see allowed mount options. This patch allows that. khorenko@: * by default ve cgroup is not visible from inside a CT * currently it's possible to mount ve cgroup inside a CT, but this is temporarily, we'll disable this in the scope of https://jira.sw.ru/browse/PSBM-34291 * this patch allows to see mount options via ve cgroup => after PSBM-34291 is fixed, mount options will be visible only from ve0 (host) * for host it's OK to see all hidden options Signed-off-by: Kirill Tkhai Rebase: ktkhai@: Merged "ve: Strip unset options in ve.mount_opts" [aryabinin: vz8 rebase] https://jira.sw.ru/browse/PSBM-108196 Signed-off-by: Andrey Ryabinin --- include/linux/ve.h | 11 +++ kernel/ve/ve.c | 175 + 2 files changed, 186 insertions(+) diff --git a/include/linux/ve.h b/include/linux/ve.h index 5b1962ff4c66..1b6317275ca2 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -96,6 +96,17 @@ struct ve_struct { #endif struct vdso_image *vdso_64; struct vdso_image *vdso_32; + + struct list_headdevmnt_list; + struct mutexdevmnt_mutex; +}; + +struct ve_devmnt { + struct list_headlink; + + dev_t dev; + char*allowed_options; + char*hidden_options; /* balloon_ino, etc. */ }; #define VE_MEMINFO_DEFAULT 1 /* default behaviour */ diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index ac3dda55e9ae..935e13340051 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -9,6 +9,7 @@ * 've.c' helper file performing VE sub-system initialization */ +#include #include #include #include @@ -643,6 +644,8 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ #ifdef CONFIG_COREDUMP strcpy(ve->core_pattern, "core"); #endif + INIT_LIST_HEAD(>devmnt_list); + mutex_init(>devmnt_mutex); return >css; @@ -687,10 +690,33 @@ static void ve_offline(struct cgroup_subsys_state *css) ve->ve_name = NULL; } +static void ve_devmnt_free(struct ve_devmnt *devmnt) +{ + if (!devmnt) + return; + + kfree(devmnt->allowed_options); + kfree(devmnt->hidden_options); + kfree(devmnt); +} + +static void free_ve_devmnts(struct ve_struct *ve) +{ + while (!list_empty(>devmnt_list)) { + struct ve_devmnt *devmnt; + + devmnt = list_first_entry(>devmnt_list, struct ve_devmnt, link); + list_del(>link); + ve_devmnt_free(devmnt); + } +} + static void ve_destroy(struct cgroup_subsys_state *css) { struct ve_struct *ve = css_to_ve(css); + free_ve_devmnts(ve); + kmapset_unlink(>sysfs_perms_key, _ve_perms_set); ve_log_destroy(ve); ve_free_vdso(ve); @@ -1085,6 +,148 @@ static u64 ve_netns_avail_nr_read(struct cgroup_subsys_state *css, struct cftype return atomic_read(_to_ve(css)->netns_avail_nr); } +static int ve_mount_opts_read(struct seq_file *sf, void *v) +{ + struct ve_struct *ve = css_to_ve(seq_css(sf)); + struct ve_devmnt *devmnt; + + if (ve_is_super(ve)) + return -ENODEV; + + mutex_lock(>devmnt_mutex); + list_for_each_entry(devmnt, >devmnt_list, link) { + dev_t dev = d
[Devel] [PATCH vz8 2/4] ia32: add 32-bit vdso virtualization.
Similarly to the 64-bit vdso, make 32-bit vdso mapping per-ve. This will allow per container modification of the linux version xin .note section of vdso and monotonic time. https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin --- arch/x86/entry/vdso/vma.c| 4 ++-- arch/x86/kernel/process_64.c | 2 +- include/linux/ve.h | 1 + kernel/ve/ve.c | 35 +-- 4 files changed, 25 insertions(+), 17 deletions(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index c48deffc1473..538c6730f436 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -56,7 +56,7 @@ static void vdso_fix_landing(const struct vdso_image *image, struct vm_area_struct *new_vma) { #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION - if (in_ia32_syscall() && image == _image_32) { + if (in_ia32_syscall() && image == get_exec_env()->vdso_32) { struct pt_regs *regs = current_pt_regs(); unsigned long vdso_land = image->sym_int80_landing_pad; unsigned long old_land_addr = vdso_land + @@ -281,7 +281,7 @@ static int load_vdso32(void) if (vdso32_enabled != 1) /* Other values all mean "disabled" */ return 0; - return map_vdso(_image_32, 0); + return map_vdso(get_exec_env()->vdso_32, 0); } #endif diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index a010d4b9d126..22215141 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -686,7 +686,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2) # endif # if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION case ARCH_MAP_VDSO_32: - return prctl_map_vdso(_image_32, arg2); + return prctl_map_vdso(get_exec_env()->vdso_32, arg2); # endif case ARCH_MAP_VDSO_64: return prctl_map_vdso(get_exec_env()->vdso_64, arg2); diff --git a/include/linux/ve.h b/include/linux/ve.h index 0e85a4032c3a..5b1962ff4c66 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -95,6 +95,7 @@ struct ve_struct { struct cn_private *cn; #endif struct vdso_image *vdso_64; + struct vdso_image *vdso_32; }; #define VE_MEMINFO_DEFAULT 1 /* default behaviour */ diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 186deb3f88f4..03b8d126a0ed 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -58,6 +58,7 @@ struct ve_struct ve0 = { .netns_max_nr = INT_MAX, .meminfo_val= VE_MEMINFO_SYSTEM, .vdso_64= (struct vdso_image*)_image_64, + .vdso_32= (struct vdso_image*)_image_32, }; EXPORT_SYMBOL(ve0); @@ -540,13 +541,12 @@ static __u64 ve_setup_iptables_mask(__u64 init_mask) } #endif -static int copy_vdso(struct ve_struct *ve) +static int copy_vdso(struct vdso_image **vdso_dst, const struct vdso_image *vdso_src) { - const struct vdso_image *vdso_src = _image_64; struct vdso_image *vdso; void *vdso_data; - if (ve->vdso_64) + if (*vdso_dst) return 0; vdso = kmemdup(vdso_src, sizeof(*vdso), GFP_KERNEL); @@ -563,10 +563,22 @@ static int copy_vdso(struct ve_struct *ve) vdso->data = vdso_data; - ve->vdso_64 = vdso; + *vdso_dst = vdso; return 0; } +static void ve_free_vdso(struct ve_struct *ve) +{ + if (ve->vdso_64 && ve->vdso_64 != _image_64) { + kfree(ve->vdso_64->data); + kfree(ve->vdso_64); + } + if (ve->vdso_32 && ve->vdso_32 != _image_32) { + kfree(ve->vdso_32->data); + kfree(ve->vdso_32); + } +} + static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_css) { struct ve_struct *ve = @@ -592,7 +604,10 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ if (err) goto err_log; - if (copy_vdso(ve)) + if (copy_vdso(>vdso_64, _image_64)) + goto err_vdso; + + if (copy_vdso(>vdso_32, _image_32)) goto err_vdso; ve->features = VE_FEATURES_DEF; @@ -619,6 +634,7 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ return >css; err_vdso: + ve_free_vdso(ve); ve_log_destroy(ve); err_log: free_percpu(ve->sched_lat_ve.cur); @@ -658,15 +674,6 @@ static void ve_offline(struct cgroup_subsys_state *css) ve->ve_name = NULL; } -static void ve_free_vdso(struct ve_struct *ve) -{ - if (ve->vdso_64 == _image_64) - return; - - kfree(ve->vdso_64->data); - kfree(ve->vdso_6
[Devel] [PATCH vz8 3/4] ve: patch linux_version_code in vdso
On the write to ve.os_release file patch the linux_version_code in the .note section of vdso. https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin --- arch/x86/entry/vdso/vdso-note.S | 2 ++ arch/x86/entry/vdso/vdso2c.c | 1 + arch/x86/entry/vdso/vdso32/note.S | 2 ++ arch/x86/include/asm/vdso.h | 1 + kernel/ve/ve.c| 7 +++ 5 files changed, 13 insertions(+) diff --git a/arch/x86/entry/vdso/vdso-note.S b/arch/x86/entry/vdso/vdso-note.S index 79a071e4357e..c0e6e65f9fec 100644 --- a/arch/x86/entry/vdso/vdso-note.S +++ b/arch/x86/entry/vdso/vdso-note.S @@ -7,6 +7,8 @@ #include #include + .globl linux_version_code ELFNOTE_START(Linux, 0, "a") +linux_version_code: .long LINUX_VERSION_CODE ELFNOTE_END diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 4674f58581a1..7fab0bd96ac1 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -109,6 +109,7 @@ struct vdso_sym required_syms[] = { {"__kernel_sigreturn", true}, {"__kernel_rt_sigreturn", true}, {"int80_landing_pad", true}, + {"linux_version_code", true}, }; __attribute__((format(printf, 1, 2))) __attribute__((noreturn)) diff --git a/arch/x86/entry/vdso/vdso32/note.S b/arch/x86/entry/vdso/vdso32/note.S index 9fd51f206314..096b62f14863 100644 --- a/arch/x86/entry/vdso/vdso32/note.S +++ b/arch/x86/entry/vdso/vdso32/note.S @@ -10,7 +10,9 @@ /* Ideally this would use UTS_NAME, but using a quoted string here doesn't work. Remember to change this when changing the kernel's name. */ + .globl linux_version_code ELFNOTE_START(Linux, 0, "a") +linux_version_code: .long LINUX_VERSION_CODE ELFNOTE_END diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index 27566e57e87d..92c7ac06828e 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -27,6 +27,7 @@ struct vdso_image { long sym___kernel_rt_sigreturn; long sym___kernel_vsyscall; long sym_int80_landing_pad; + long sym_linux_version_code; }; #ifdef CONFIG_X86_64 diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 03b8d126a0ed..98c2e7e3d2c6 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -954,6 +954,7 @@ static ssize_t ve_os_release_write(struct kernfs_open_file *of, char *buf, { struct cgroup_subsys_state *css = of_css(of); struct ve_struct *ve = css_to_ve(css); + int n1, n2, n3, new_version; char *release; int ret = 0; @@ -964,6 +965,12 @@ static ssize_t ve_os_release_write(struct kernfs_open_file *of, char *buf, goto up_opsem; } + if (sscanf(buf, "%d.%d.%d", , , ) == 3) { + new_version = ((n1 << 16) + (n2 << 8)) + n3; + *((int *)(ve->vdso_64->data + ve->vdso_64->sym_linux_version_code)) = new_version; + *((int *)(ve->vdso_32->data + ve->vdso_32->sym_linux_version_code)) = new_version; + } + down_write(_sem); release = ve->ve_ns->uts_ns->name.release; strncpy(release, buf, __NEW_UTS_LEN); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 1/4] ve, x86_64: add per-ve vdso mapping.
Make vdso mapping per-ve. This will allow per container modification of the linux version in .note section of vdso and monotonic time. https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin --- arch/x86/entry/vdso/vma.c| 3 ++- arch/x86/kernel/process_64.c | 2 +- include/linux/ve.h | 2 ++ kernel/ve/ve.c | 43 4 files changed, 48 insertions(+), 2 deletions(-) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index eb3d85f87884..c48deffc1473 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -291,7 +291,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) if (!vdso64_enabled) return 0; - return map_vdso_randomized(_image_64); + + return map_vdso_randomized(get_exec_env()->vdso_64); } #ifdef CONFIG_COMPAT diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index c1c8d66cbe70..a010d4b9d126 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -689,7 +689,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2) return prctl_map_vdso(_image_32, arg2); # endif case ARCH_MAP_VDSO_64: - return prctl_map_vdso(_image_64, arg2); + return prctl_map_vdso(get_exec_env()->vdso_64, arg2); #endif default: diff --git a/include/linux/ve.h b/include/linux/ve.h index ec7dc522ac1f..0e85a4032c3a 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -15,6 +15,7 @@ #include #include #include +#include struct nsproxy; struct veip_struct; @@ -93,6 +94,7 @@ struct ve_struct { #ifdef CONFIG_CONNECTOR struct cn_private *cn; #endif + struct vdso_image *vdso_64; }; #define VE_MEMINFO_DEFAULT 1 /* default behaviour */ diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index cc26d3b2fa9b..186deb3f88f4 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -57,6 +57,7 @@ struct ve_struct ve0 = { .netns_avail_nr = ATOMIC_INIT(INT_MAX), .netns_max_nr = INT_MAX, .meminfo_val= VE_MEMINFO_SYSTEM, + .vdso_64= (struct vdso_image*)_image_64, }; EXPORT_SYMBOL(ve0); @@ -539,6 +540,33 @@ static __u64 ve_setup_iptables_mask(__u64 init_mask) } #endif +static int copy_vdso(struct ve_struct *ve) +{ + const struct vdso_image *vdso_src = _image_64; + struct vdso_image *vdso; + void *vdso_data; + + if (ve->vdso_64) + return 0; + + vdso = kmemdup(vdso_src, sizeof(*vdso), GFP_KERNEL); + if (!vdso) + return -ENOMEM; + + vdso_data = kmalloc(vdso_src->size, GFP_KERNEL); + if (!vdso_data) { + kfree(vdso); + return -ENOMEM; + } + + memcpy(vdso_data, vdso_src->data, vdso_src->size); + + vdso->data = vdso_data; + + ve->vdso_64 = vdso; + return 0; +} + static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_css) { struct ve_struct *ve = @@ -564,6 +592,9 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ if (err) goto err_log; + if (copy_vdso(ve)) + goto err_vdso; + ve->features = VE_FEATURES_DEF; ve->_randomize_va_space = ve0._randomize_va_space; @@ -587,6 +618,8 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ return >css; +err_vdso: + ve_log_destroy(ve); err_log: free_percpu(ve->sched_lat_ve.cur); err_lat: @@ -625,12 +658,22 @@ static void ve_offline(struct cgroup_subsys_state *css) ve->ve_name = NULL; } +static void ve_free_vdso(struct ve_struct *ve) +{ + if (ve->vdso_64 == _image_64) + return; + + kfree(ve->vdso_64->data); + kfree(ve->vdso_64); +} + static void ve_destroy(struct cgroup_subsys_state *css) { struct ve_struct *ve = css_to_ve(css); kmapset_unlink(>sysfs_perms_key, _ve_perms_set); ve_log_destroy(ve); + ve_free_vdso(ve); #if IS_ENABLED(CONFIG_BINFMT_MISC) kfree(ve->binfmt_misc); #endif -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 4/4] ve: add per-ve CLOCK_MONOTONIC time via __vclock_getttime()
Make possible to read virtualized container's CLOCK_MONOTONIC time via __vclock_getttime(). Record containers start time in per-ve vdso and substruct it from the host's time on clock read. https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin --- arch/x86/entry/vdso/vclock_gettime.c | 27 +++ arch/x86/entry/vdso/vdso2c.c | 1 + arch/x86/include/asm/vdso.h | 1 + kernel/ve/ve.c | 14 ++ 4 files changed, 39 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index e48ca3afa091..be1de6c4cafa 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -24,6 +24,8 @@ #define gtod ((vsyscall_gtod_data)) +u64 ve_start_time; + extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts); extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz); extern time_t __vdso_time(time_t *t); @@ -227,6 +229,21 @@ notrace static int __always_inline do_realtime(struct timespec *ts) return mode; } +static inline void timespec_sub_ns(struct timespec *ts, u64 ns) +{ + if ((s64)ns <= 0) { + ts->tv_sec += __iter_div_u64_rem(-ns, NSEC_PER_SEC, ); + ts->tv_nsec = ns; + } else { + ts->tv_sec -= __iter_div_u64_rem(ns, NSEC_PER_SEC, ); + if (ns) { + ts->tv_sec--; + ns = NSEC_PER_SEC - ns; + } + ts->tv_nsec = ns; + } +} + notrace static int __always_inline do_monotonic(struct timespec *ts) { unsigned long seq; @@ -242,9 +259,7 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) ns >>= gtod->shift; } while (unlikely(gtod_read_retry(gtod, seq))); - ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, ); - ts->tv_nsec = ns; - + timespec_sub_ns(ts, ve_start_time - ns); return mode; } @@ -260,12 +275,16 @@ notrace static void do_realtime_coarse(struct timespec *ts) notrace static void do_monotonic_coarse(struct timespec *ts) { + u64 ns; unsigned long seq; + do { seq = gtod_read_begin(gtod); ts->tv_sec = gtod->monotonic_time_coarse_sec; - ts->tv_nsec = gtod->monotonic_time_coarse_nsec; + ns = gtod->monotonic_time_coarse_nsec; } while (unlikely(gtod_read_retry(gtod, seq))); + + timespec_sub_ns(ts, ve_start_time - ns); } notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 7fab0bd96ac1..c76141e9ca16 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -110,6 +110,7 @@ struct vdso_sym required_syms[] = { {"__kernel_rt_sigreturn", true}, {"int80_landing_pad", true}, {"linux_version_code", true}, + {"ve_start_time", true}, }; __attribute__((format(printf, 1, 2))) __attribute__((noreturn)) diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index 92c7ac06828e..9c265f79a126 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -28,6 +28,7 @@ struct vdso_image { long sym___kernel_vsyscall; long sym_int80_landing_pad; long sym_linux_version_code; + long sym_ve_start_time; }; #ifdef CONFIG_X86_64 diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 98c2e7e3d2c6..ac3dda55e9ae 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -374,6 +374,17 @@ static int ve_start_kthreadd(struct ve_struct *ve) return err; } +static void ve_set_vdso_time(struct ve_struct *ve, u64 time) +{ + u64 *vdso_start_time; + + vdso_start_time = ve->vdso_64->data + ve->vdso_64->sym_ve_start_time; + *vdso_start_time = time; + + vdso_start_time = ve->vdso_32->data + ve->vdso_32->sym_ve_start_time; + *vdso_start_time = time; +} + /* under ve->op_sem write-lock */ static int ve_start_container(struct ve_struct *ve) { @@ -408,6 +419,8 @@ static int ve_start_container(struct ve_struct *ve) if (ve->start_time == 0) { ve->start_time = tsk->start_time; ve->real_start_time = tsk->real_start_time; + + ve_set_vdso_time(ve, ve->start_time); } /* The value is wrong, but it is never compared to process * start times */ @@ -1028,6 +1041,7 @@ static ssize_t ve_ts_write(struct kernfs_open_file *of, char *buf, case VE_CF_CLOCK_MONOTONIC: now = ktime_get_ns(); target = >start_time; + ve_set_vdso_time(ve, now - delta_ns);
Re: [Devel] [PATCH rh8] mm/swap: activate swapped in pages on fault
On 10/19/20 7:32 PM, Konstantin Khorenko wrote: > From: Andrey Ryabinin > > Move swapped in anon pages directly to active list. This should > help us to prevent anon thrashing. Recently swapped in pages > has more chances to stay in memory. > > https://pmc.acronis.com/browse/VSTOR-20859 > Signed-off-by: Andrey Ryabinin > [VvS RHEL7.8 rebase] context changes > > (cherry picked from vz7 commit 134cd9b20a914080539e6310f76fe3f7b32bc710) > Signed-off-by: Konstantin Khorenko Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh8] ve: Virtualize /proc/swaps to watch from inside CT
On 10/19/20 5:27 PM, Konstantin Khorenko wrote: > From: Kirill Tkhai > > Customize /proc/swaps when showing from !ve_is_super. > Extracted from "Initial patch". > > Signed-off-by: Kirill Tkhai > > (cherry picked from vz7 commit 88c087f1fdb4b0f7934804269df36035ab6b83eb) > Signed-off-by: Konstantin Khorenko Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] ms/aio: Kill aio_rw_vect_retry()
From: Kent Overstreet This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: Kent Overstreet Cc: Zach Brown Cc: Felipe Balbi Cc: Greg Kroah-Hartman Cc: Mark Fasheh Cc: Joel Becker Cc: Rusty Russell Cc: Jens Axboe Cc: Asai Thambi S P Cc: Selvan Mani Cc: Sam Bradshaw Cc: Jeff Moyer Cc: Al Viro Cc: Benjamin LaHaise Signed-off-by: Benjamin LaHaise https://jira.sw.ru/browse/PSBM-121197 (cherry picked from commit 73a7075e3f6ec63dc359064eea6fd84f406cf2a5) Signed-off-by: Andrey Ryabinin --- drivers/staging/android/logger.c | 2 +- drivers/usb/gadget/inode.c | 6 +-- fs/aio.c | 92 +++- fs/block_dev.c | 2 +- fs/nfs/direct.c | 1 - fs/ocfs2/file.c | 6 +-- fs/read_write.c | 3 -- fs/udf/file.c| 2 +- include/linux/aio.h | 2 - mm/page_io.c | 1 - net/socket.c | 2 +- 11 files changed, 28 insertions(+), 91 deletions(-) diff --git a/drivers/staging/android/logger.c b/drivers/staging/android/logger.c index 34519ea14b54..16a6c3179625 100644 --- a/drivers/staging/android/logger.c +++ b/drivers/staging/android/logger.c @@ -481,7 +481,7 @@ static ssize_t logger_aio_write(struct kiocb *iocb, const struct iovec *iov, header.sec = now.tv_sec; header.nsec = now.tv_nsec; header.euid = current_euid(); - header.len = min_t(size_t, iocb->ki_left, LOGGER_ENTRY_MAX_PAYLOAD); + header.len = min_t(size_t, iocb->ki_nbytes, LOGGER_ENTRY_MAX_PAYLOAD); header.hdr_size = sizeof(struct logger_entry); /* null writes succeed, return zero */ diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c index 570c005062ab..09aae3c48d2c 100644 --- a/drivers/usb/gadget/inode.c +++ b/drivers/usb/gadget/inode.c @@ -709,11 +709,11 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov, if (unlikely(usb_endpoint_dir_in(>desc))) return -EINVAL; - buf = kmalloc(iocb->ki_left, GFP_KERNEL); + buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; - return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs); + return ep_aio_rwtail(iocb, buf, iocb->ki_nbytes, epdata, iov, nr_segs); } static ssize_t @@ -728,7 +728,7 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov, if (unlikely(!usb_endpoint_dir_in(>desc))) return -EINVAL; - buf = kmalloc(iocb->ki_left, GFP_KERNEL); + buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; diff --git a/fs/aio.c b/fs/aio.c index c7e23a5832aa..f1b27fc5defb 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -879,7 +879,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx) if (unlikely(!req)) goto out_put; - atomic_set(>ki_users, 2); + atomic_set(>ki_users, 1); req->ki_ctx = ctx; return req; @@ -1279,75 +1279,9 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx) return -EINVAL; } -static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret) -{ - struct iovec *iov = >ki_iovec[iocb->ki_cur_seg]; - - BUG_ON(ret <= 0); - - while (iocb->ki_cur_seg < iocb->ki_nr_segs && ret > 0) { - ssize_t this = min((ssize_t)iov->iov_len, ret); - iov->iov_base += this; - iov->iov_len -= this; - iocb->ki_left -= this; - ret -= this; - if (iov->iov_len == 0) { - iocb->ki_cur_seg++; - iov++; - } - } - - /* the caller should not have done more io than what fit in -* the remaining iovecs */ - BUG_ON(ret > 0 && iocb->ki_left == 0); -} - typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *, unsigned long, loff_t); -static ssize_t aio_rw_vect_retry(struct kiocb *iocb, int rw, aio_rw_op *rw_op) -{ - struct file *file = iocb->ki_filp; - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; - ssize_t ret = 0; - - /* This matches the pread()/pwrite() logic */ - if (iocb->ki_pos < 0) - return -EINVAL; - - if (rw == WRITE) - file_start_write(file); - do { - ret = rw_op(iocb, >ki_iovec[iocb->ki_cur_seg], - iocb->ki_nr_segs - iocb-&g
[Devel] [PATCH vz8] mm/memcg: Use per-cpu stock charges for ->kmem and ->cache counters
Currently we use per-cpu stocks to do precharges of the ->memory and ->memsw counters. Do this for the ->kmem and ->cache as well to decrease contention on these counters as well. https://jira.sw.ru/browse/PSBM-101300 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 75 + 1 file changed, 51 insertions(+), 24 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 134cb27307f2..b3f97309ca39 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2023,6 +2023,8 @@ EXPORT_SYMBOL(unlock_page_memcg); struct memcg_stock_pcp { struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; + unsigned int cache_nr_pages; + unsigned int kmem_nr_pages; struct work_struct work; unsigned long flags; #define FLUSHING_CACHED_CHARGE 0 @@ -2041,7 +2043,8 @@ static DEFINE_MUTEX(percpu_charge_mutex); * * returns true if successful, false otherwise. */ -static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) +static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, + bool cache, bool kmem) { struct memcg_stock_pcp *stock; unsigned long flags; @@ -2053,9 +2056,19 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) local_irq_save(flags); stock = this_cpu_ptr(_stock); - if (memcg == stock->cached && stock->nr_pages >= nr_pages) { - stock->nr_pages -= nr_pages; - ret = true; + if (memcg == stock->cached) { + if (cache && stock->cache_nr_pages >= nr_pages) { + stock->cache_nr_pages -= nr_pages; + ret = true; + } + if (kmem && stock->kmem_nr_pages >= nr_pages) { + stock->kmem_nr_pages -= nr_pages; + ret = true; + } + if (!cache && !kmem && stock->nr_pages >= nr_pages) { + stock->nr_pages -= nr_pages; + ret = true; + } } local_irq_restore(flags); @@ -2069,13 +2082,21 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) static void drain_stock(struct memcg_stock_pcp *stock) { struct mem_cgroup *old = stock->cached; + unsigned long nr_pages = stock->nr_pages + stock->cache_nr_pages + stock->kmem_nr_pages; + + if (stock->cache_nr_pages) + page_counter_uncharge(>cache, stock->cache_nr_pages); + if (stock->kmem_nr_pages) + page_counter_uncharge(>kmem, stock->kmem_nr_pages); - if (stock->nr_pages) { - page_counter_uncharge(>memory, stock->nr_pages); + if (nr_pages) { + page_counter_uncharge(>memory, nr_pages); if (do_memsw_account()) - page_counter_uncharge(>memsw, stock->nr_pages); + page_counter_uncharge(>memsw, nr_pages); css_put_many(>css, stock->nr_pages); stock->nr_pages = 0; + stock->kmem_nr_pages = 0; + stock->cache_nr_pages = 0; } stock->cached = NULL; } @@ -2102,10 +2123,12 @@ static void drain_local_stock(struct work_struct *dummy) * Cache charges(val) to local per_cpu area. * This will be consumed by consume_stock() function, later. */ -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages, + bool cache, bool kmem) { struct memcg_stock_pcp *stock; unsigned long flags; + unsigned long stock_nr_pages; local_irq_save(flags); @@ -2114,9 +2137,17 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) drain_stock(stock); stock->cached = memcg; } - stock->nr_pages += nr_pages; - if (stock->nr_pages > MEMCG_CHARGE_BATCH) + if (cache) + stock->cache_nr_pages += nr_pages; + else if (kmem) + stock->kmem_nr_pages += nr_pages; + else + stock->nr_pages += nr_pages; + + stock_nr_pages = stock->nr_pages + stock->cache_nr_pages + + stock->kmem_nr_pages; + if (nr_pages > MEMCG_CHARGE_BATCH) drain_stock(stock); local_irq_restore(flags); @@ -2143,9 +2174,11 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) for_each_online_cpu(cpu) { struct memcg_stock_pcp *stock = _cpu(memcg_stock, cpu); struct mem_cgroup *memcg; + unsigned long nr_pages = stock->nr_pages + stock-&g
[Devel] [PATCH rh7] mm/memcg: optimize mem_cgroup_enough_memory()
mem_cgroup_enough_memory() iterates memcg's subtree to account 'MEM_CGROUP_STAT_CACHE - MEM_CGROUP_STAT_SHMEM'. Fortunately we can just read memcg->cache counter instead as it's hierarchical (includes subgroups) and doesn't account shmem. https://jira.sw.ru/browse/PSBM-120968 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6587cc2ef019..e36ad592b3c7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4721,11 +4721,7 @@ int mem_cgroup_enough_memory(struct mem_cgroup *memcg, long pages) free += page_counter_read(>dcache); /* assume file cache is reclaimable */ - free += mem_cgroup_recursive_stat2(memcg, MEM_CGROUP_STAT_CACHE); - - /* but do not count shmem pages as they can't be purged, -* only swapped out */ - free -= mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SHMEM); + free += page_counter_read(>cache); return free < pages ? -ENOMEM : 0; } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] kernel/cgroup: remove unnecessary cgroup_mutex lock.
Stopping container causes the lockdep to complain (see report bellow). We can avoid it simply by removing cgroup_mutex lock from cgroup_mark_ve_root(). I believe it's not needed there, it seems to be added just in case. WARNING: possible circular locking dependency detected 4.18.0-193.6.3.vz8.4.6+debug #1 Not tainted -- vzctl/36606 is trying to acquire lock: 88814b195ca0 (kn->count#338){}, at: kernfs_remove_by_name_ns+0x40/0x80 but task is already holding lock: 9cf75a90 (cgroup_mutex){+.+.}, at: cgroup_kn_lock_live+0x106/0x390 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (cgroup_mutex){+.+.}: __mutex_lock+0x163/0x13d0 cgroup_mark_ve_root+0x1d/0x2e0 ve_state_write+0xb81/0xdc0 cgroup_file_write+0x2da/0x7a0 kernfs_fop_write+0x255/0x410 vfs_write+0x157/0x460 ksys_write+0xb8/0x170 do_syscall_64+0xa5/0x4d0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf -> #1 (>op_sem){}: down_write+0xa0/0x3d0 ve_state_write+0x6b/0xdc0 cgroup_file_write+0x2da/0x7a0 kernfs_fop_write+0x255/0x410 vfs_write+0x157/0x460 ksys_write+0xb8/0x170 do_syscall_64+0xa5/0x4d0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf -> #0 (kn->count#338){}: __lock_acquire+0x22cb/0x48c0 lock_acquire+0x14f/0x3b0 __kernfs_remove+0x61e/0x810 kernfs_remove_by_name_ns+0x40/0x80 cgroup_addrm_files+0x531/0x940 css_clear_dir+0xfb/0x200 kill_css+0x8f/0x120 cgroup_destroy_locked+0x246/0x5e0 cgroup_rmdir+0x2f/0x2c0 kernfs_iop_rmdir+0x131/0x1b0 vfs_rmdir+0x142/0x3c0 do_rmdir+0x2b2/0x340 do_syscall_64+0xa5/0x4d0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf other info that might help us debug this: Chain exists of: kn->count#338 --> >op_sem --> cgroup_mutex Possible unsafe locking scenario: CPU0CPU1 lock(cgroup_mutex); lock(>op_sem); lock(cgroup_mutex); lock(kn->count#338); *** DEADLOCK *** 4 locks held by vzctl/36606: #0: 88813c02c890 (sb_writers#7){.+.+}, at: mnt_want_write+0x3c/0xa0 #1: 88814414ad48 (>i_mutex_dir_key#5/1){+.+.}, at: do_rmdir+0x23c/0x340 #2: 88811d3054e8 (>i_mutex_dir_key#5){}, at: vfs_rmdir+0xb6/0x3c0 #3: 9cf75a90 (cgroup_mutex){+.+.}, at: cgroup_kn_lock_live+0x106/0x390 Call Trace: dump_stack+0x9a/0xf0 check_noncircular+0x317/0x3c0 __lock_acquire+0x22cb/0x48c0 lock_acquire+0x14f/0x3b0 __kernfs_remove+0x61e/0x810 kernfs_remove_by_name_ns+0x40/0x80 cgroup_addrm_files+0x531/0x940 css_clear_dir+0xfb/0x200 kill_css+0x8f/0x120 cgroup_destroy_locked+0x246/0x5e0 cgroup_rmdir+0x2f/0x2c0 kernfs_iop_rmdir+0x131/0x1b0 vfs_rmdir+0x142/0x3c0 do_rmdir+0x2b2/0x340 do_syscall_64+0xa5/0x4d0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf https://jira.sw.ru/browse/PSBM-120670 Signed-off-by: Andrey Ryabinin --- kernel/cgroup/cgroup.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8420f3547f1a..08137d43f3ab 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1883,7 +1883,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve) struct css_set *cset; struct cgroup *cgrp; - mutex_lock(_mutex); spin_lock_irq(_set_lock); rcu_read_lock(); @@ -1899,7 +1898,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve) rcu_read_unlock(); spin_unlock_irq(_set_lock); - mutex_unlock(_mutex); } static struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp) -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] memcg: Fix missing memcg->cache charges during page migration
Since 44b7a8d33d66 ("mm: memcontrol: do not uncharge old page in page cache replacement") the mem_cgroup_migrate() charges newpage, but the ->cache charge is missing here. Add it to fix negative ->cache values, which leads to WARNING like bellow and softlockups. WARNING: CPU: 14 PID: 1372 at mm/page_counter.c:62 page_counter_cancel+0x26/0x30 Call Trace: page_counter_uncharge+0x1d/0x30 uncharge_batch+0x25c/0x2e0 mem_cgroup_uncharge_list+0x64/0x90 release_pages+0x33e/0x3c0 __pagevec_release+0x1b/0x40 truncate_inode_pages_range+0x358/0x8b0 ext4_evict_inode+0x167/0x580 [ext4] evict+0xd2/0x1a0 do_unlinkat+0x250/0x2e0 do_syscall_64+0x5b/0x1a0 entry_SYSCALL_64_after_hwframe+0x65/0xca https://jira.sw.ru/browse/PSBM-120653 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df70c3bdd444..134cb27307f2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6867,6 +6867,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) page_counter_charge(>memory, nr_pages); if (do_memsw_account()) page_counter_charge(>memsw, nr_pages); + if (!PageAnon(newpage) && !PageSwapBacked(newpage)) + page_counter_charge(>cache, nr_pages); css_get_many(>css, nr_pages); commit_charge(newpage, memcg, false); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] vmscan: don't report reclaim progress if there was no progress.
On 10/9/20 10:22 AM, Vasily Averin wrote: > Andrey, > could you please clarify, is it required for vz8 too? > vz8 don't need this. This part was removed by commit 0a0337e0d1 ("mm, oom: rework oom detection") ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] mm/filemap: fix potential memcg->cache charge leak
On 10/9/20 10:14 AM, Vasily Averin wrote: > vz8 is affected too, please cherry-pick > vz7 commit 79a5642e9d9a6bdbb56d9e0ee990fd96b7c8625c > vz8 is not affected ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] mm/filemap: fix potential memcg->cache charge leak
__add_to_page_cache_locked() after mem_cgroup_try_charge_cache() uses mem_cgroup_cancel_charge() in one of the error paths. This may lead to leaking a few memcg->cache charges. Use mem_cgroup_cancel_cache_charge() to fix this. https://jira.sw.ru/browse/PSBM-121046 Signed-off-by: Andrey Ryabinin --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index 53db13f236da..2bd5ca4e7528 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -732,7 +732,7 @@ static int __add_to_page_cache_locked(struct page *page, error = radix_tree_maybe_preload(gfp_mask & GFP_RECLAIM_MASK); if (error) { if (!huge) - mem_cgroup_cancel_charge(page, memcg); + mem_cgroup_cancel_cache_charge(page, memcg); return error; } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH rh7] tun: Silence allocation failer if user asked for too big header
On 10/6/20 11:17 AM, Konstantin Khorenko wrote: > On 10/05/2020 04:42 PM, Andrey Ryabinin wrote: >> Userspace may ask tun device to send packet with ridiculously >> big header and trigger this: >> >> [ cut here ] >> WARNING: CPU: 1 PID: 15366 at mm/page_alloc.c:3548 >> __alloc_pages_nodemask+0x537/0x1200 >> order 19 >= 11, gfp 0x2044d0 >> Call Trace: >> dump_stack+0x19/0x1b >> __warn+0x17f/0x1c0 >> warn_slowpath_fmt+0xad/0xe0 >> __alloc_pages_nodemask+0x537/0x1200 >> kmalloc_large_node+0x5f/0xd0 >> __kmalloc_node_track_caller+0x425/0x630 >> __kmalloc_reserve.isra.33+0x47/0xd0 >> __alloc_skb+0xdd/0x5f0 >> alloc_skb_with_frags+0x8f/0x540 >> sock_alloc_send_pskb+0x5e5/0x940 >> tun_get_user+0x38b/0x24a0 [tun] >> tun_chr_aio_write+0x13a/0x250 [tun] >> do_sync_readv_writev+0xdf/0x1c0 >> do_readv_writev+0x1a5/0x850 >> vfs_writev+0xba/0x190 >> SyS_writev+0x17c/0x340 >> system_call_fastpath+0x25/0x2a >> >> Just add __GFP_NOWARN and silently return -ENOMEM to fix this. >> >> https://jira.sw.ru/browse/PSBM-103639 >> Signed-off-by: Andrey Ryabinin >> --- >> drivers/net/tun.c | 4 ++-- >> include/net/sock.h | 7 +++ >> net/core/sock.c | 9 + >> 3 files changed, 18 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c >> index e95a89ba48b7..c0879c6a9703 100644 >> --- a/drivers/net/tun.c >> +++ b/drivers/net/tun.c >> @@ -1142,8 +1142,8 @@ static struct sk_buff *tun_alloc_skb(struct tun_file >> *tfile, >> if (prepad + len < PAGE_SIZE || !linear) >> linear = len; >> >> - skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock, >> - , 0); >> + skb = sock_alloc_send_pskb_flags(sk, prepad + linear, len - linear, >> noblock, >> + , 0, __GFP_NOWARN); > > May be __GFP_ORDER_NOWARN ? > __GFP_ORDER_NOWARN doesn't silence the WARN triggered here: if (order >= MAX_ORDER) { WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN)); return NULL; } ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] tun: Silence allocation failer if user asked for too big header
Userspace may ask tun device to send packet with ridiculously big header and trigger this: [ cut here ] WARNING: CPU: 1 PID: 15366 at mm/page_alloc.c:3548 __alloc_pages_nodemask+0x537/0x1200 order 19 >= 11, gfp 0x2044d0 Call Trace: dump_stack+0x19/0x1b __warn+0x17f/0x1c0 warn_slowpath_fmt+0xad/0xe0 __alloc_pages_nodemask+0x537/0x1200 kmalloc_large_node+0x5f/0xd0 __kmalloc_node_track_caller+0x425/0x630 __kmalloc_reserve.isra.33+0x47/0xd0 __alloc_skb+0xdd/0x5f0 alloc_skb_with_frags+0x8f/0x540 sock_alloc_send_pskb+0x5e5/0x940 tun_get_user+0x38b/0x24a0 [tun] tun_chr_aio_write+0x13a/0x250 [tun] do_sync_readv_writev+0xdf/0x1c0 do_readv_writev+0x1a5/0x850 vfs_writev+0xba/0x190 SyS_writev+0x17c/0x340 system_call_fastpath+0x25/0x2a Just add __GFP_NOWARN and silently return -ENOMEM to fix this. https://jira.sw.ru/browse/PSBM-103639 Signed-off-by: Andrey Ryabinin --- drivers/net/tun.c | 4 ++-- include/net/sock.h | 7 +++ net/core/sock.c| 9 + 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index e95a89ba48b7..c0879c6a9703 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1142,8 +1142,8 @@ static struct sk_buff *tun_alloc_skb(struct tun_file *tfile, if (prepad + len < PAGE_SIZE || !linear) linear = len; - skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock, - , 0); + skb = sock_alloc_send_pskb_flags(sk, prepad + linear, len - linear, noblock, + , 0, __GFP_NOWARN); if (!skb) return ERR_PTR(err); diff --git a/include/net/sock.h b/include/net/sock.h index 4136d2c3080c..1912d85ecc4d 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1626,6 +1626,13 @@ extern struct sk_buff *sock_alloc_send_pskb(struct sock *sk, int noblock, int *errcode, int max_page_order); +extern struct sk_buff *sock_alloc_send_pskb_flags(struct sock *sk, + unsigned long header_len, + unsigned long data_len, + int noblock, + int *errcode, + int max_page_order, + gfp_t extra_flags); extern void *sock_kmalloc(struct sock *sk, int size, gfp_t priority); extern void sock_kfree_s(struct sock *sk, void *mem, int size); diff --git a/net/core/sock.c b/net/core/sock.c index 508fc6093a26..07ea42f976cf 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1964,6 +1964,15 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len, } EXPORT_SYMBOL(sock_alloc_send_pskb); +struct sk_buff *sock_alloc_send_pskb_flags(struct sock *sk, unsigned long header_len, +unsigned long data_len, int noblock, +int *errcode, int max_page_order, gfp_t extra_flags) +{ + return __sock_alloc_send_pskb(sk, header_len, data_len, noblock, + errcode, max_page_order, extra_flags); +} +EXPORT_SYMBOL(sock_alloc_send_pskb_flags); + struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size, int noblock, int *errcode) { -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] vmscan: don't report reclaim progress if there was no progress.
__alloc_pages_slowpath relies on the direct reclaim and did_some_progress as an indicator that it makes sense to retry allocation rather than declaring OOM. shrink_zones checks if all zones reclaimable and if shrink_zone didn't make any progress it prevents from a premature OOM killer invocation by reporting the progress. This might happen if the LRU is full of dirty or writeback pages and direct reclaim cannot clean those up. zone_reclaimable allows to rescan the reclaimable lists several times and restart if a page is freed. This is really subtle behavior and it might lead to a livelock when a single freed page keeps allocator looping but the current task will not be able to allocate that single page. OOM killer would be more appropriate than looping without any progress for unbounded amount of time. Report no progress even if zones are reclaimable as OOM is more appropiate in that case. https://jira.sw.ru/browse/PSBM-104900 Signed-off-by: Andrey Ryabinin --- mm/vmscan.c | 24 1 file changed, 24 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 13ae9bd1e92e..85622f235e78 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2952,26 +2952,6 @@ static void snapshot_refaults(struct mem_cgroup *root_memcg, struct zone *zone) } while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL))); } -/* All zones in zonelist are unreclaimable? */ -static bool all_unreclaimable(struct zonelist *zonelist, - struct scan_control *sc) -{ - struct zoneref *z; - struct zone *zone; - - for_each_zone_zonelist_nodemask(zone, z, zonelist, - gfp_zone(sc->gfp_mask), sc->nodemask) { - if (!populated_zone(zone)) - continue; - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) - continue; - if (zone_reclaimable(zone)) - return false; - } - - return true; -} - static void shrink_tcrutches(struct scan_control *scan_ctrl) { int nid; @@ -3097,10 +3077,6 @@ out: goto retry; } - /* top priority shrink_zones still had more to do? don't OOM, then */ - if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc)) - return 1; - return 0; } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] kernel/sched/fair: Fix 'releasing a pinned lock'
Lockdep complains that after rq_repin_lock() the lock wasn't unpinned before rq->lock release. [ cut here ] releasing a pinned lock WARNING: CPU: 0 PID: 24 at kernel/locking/lockdep.c:4271 lock_release+0x939/0xee0 Call Trace: _raw_spin_unlock+0x1c/0x30 load_balance+0x1472/0x2e30 pick_next_task_fair+0x62c/0x2300 __schedule+0x481/0x1600 schedule+0xbf/0x240 worker_thread+0x1d5/0xb50 kthread+0x30e/0x3d0 ret_from_fork+0x3a/0x50 Add rq_unpin_lock(); call to fix this. Also for consistency use 'busiest' instead of 'env.src_rq' which is the same. https://jira.sw.ru/browse/PSBM-120800 Signed-off-by: Andrey Ryabinin --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fc87dee4fd0e..23a2f2452474 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9178,9 +9178,10 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop = 0; local_irq_save(rf.flags); double_rq_lock(env.dst_rq, busiest); - rq_repin_lock(env.src_rq, ); + rq_repin_lock(busiest, ); update_rq_clock(env.dst_rq); cur_ld_moved = ld_moved = move_task_groups(); + rq_unpin_lock(busiest, ); double_rq_unlock(env.dst_rq, busiest); local_irq_restore(rf.flags); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] mm, memcg: add oom counter to memory.stat memcgroup file
Add oom counter to memory.stat file. oom shows amount of oom kills triggered due to cgroup's memory limit. total_oom shows total sum of oom kills triggered due to cgroup's and it's sub-groups memory limits. memory.stat in the root cgroup counts global oom kills. E.g: # mkdir /sys/fs/cgroup/memory/test/ # echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # echo 100M > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes # echo $$ > /sys/fs/cgroup/memory/test/tasks # ./vm-scalability/usemem -O 200M # grep oom /sys/fs/cgroup/memory/test/memory.stat oom 1 total_oom 1 # echo -1 > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes # echo -1 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # ./vm-scalability/usemem -O 1000G # grep oom /sys/fs/cgroup/memory/memory.stat oom 1 total_oom 2 https://jira.sw.ru/browse/PSBM-108287 Signed-off-by: Andrey Ryabinin --- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c| 33 ++--- 2 files changed, 28 insertions(+), 7 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b097f137a3df..eb8634128a81 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -75,6 +75,8 @@ struct accumulated_stats { unsigned long stat[MEMCG_NR_STAT]; unsigned long events[NR_VM_EVENT_ITEMS]; unsigned long lru_pages[NR_LRU_LISTS]; + unsigned long oom; + unsigned long oom_kill; const unsigned int *stats_array; const unsigned int *events_array; int stats_size; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 37d4df653f39..ca3a07543416 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3144,6 +3144,8 @@ void accumulate_memcg_tree(struct mem_cgroup *memcg, for (i = 0; i < NR_LRU_LISTS; i++) acc->lru_pages[i] += mem_cgroup_nr_lru_pages(mi, BIT(i)); + acc->oom += atomic_long_read(>memory_events[MEMCG_OOM]); + acc->oom_kill += atomic_long_read(>memory_events[MEMCG_OOM_KILL]); cond_resched(); } @@ -3899,6 +3901,13 @@ static int memcg_stat_show(struct seq_file *m, void *v) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS); + memset(, 0, sizeof(acc)); + acc.stats_size = ARRAY_SIZE(memcg1_stats); + acc.stats_array = memcg1_stats; + acc.events_size = ARRAY_SIZE(memcg1_events); + acc.events_array = memcg1_events; + accumulate_memcg_tree(memcg, ); + for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account()) continue; @@ -3911,6 +3920,18 @@ static int memcg_stat_show(struct seq_file *m, void *v) seq_printf(m, "%s %lu\n", memcg1_event_names[i], memcg_sum_events(memcg, memcg1_events[i])); + /* +* For root_mem_cgroup we want to account global ooms as well. +* The diff between allo MEMCG_OOM_KILL and MEMCG_OOM events +* should give us the glogbal ooms count. +*/ + if (memcg == root_mem_cgroup) + seq_printf(m, "oom %lu\n", acc.oom_kill - acc.oom + + atomic_long_read(>memory_events[MEMCG_OOM])); + else + seq_printf(m, "oom %lu\n", + atomic_long_read(>memory_events[MEMCG_OOM])); + for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i], mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE); @@ -3927,13 +3948,6 @@ static int memcg_stat_show(struct seq_file *m, void *v) seq_printf(m, "hierarchical_memsw_limit %llu\n", (u64)memsw * PAGE_SIZE); - memset(, 0, sizeof(acc)); - acc.stats_size = ARRAY_SIZE(memcg1_stats); - acc.stats_array = memcg1_stats; - acc.events_size = ARRAY_SIZE(memcg1_events); - acc.events_array = memcg1_events; - accumulate_memcg_tree(memcg, ); - for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account()) continue; @@ -3945,6 +3959,11 @@ static int memcg_stat_show(struct seq_file *m, void *v) seq_printf(m, "total_%s %llu\n", memcg1_event_names[i], (u64)acc.events[i]); + if (memcg == root_mem_cgroup) + seq_printf(m, "total_oom %lu\n", acc.oom_kill); + else + seq_printf(m, "total_oom %lu\n", acc.oom); + for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "total_%s %llu\n&quo
[Devel] [PATCH vz8 2/2] mm/memcg: fix cache growth above cache.limit_in_bytes
Exceeding cache above cache.limit_in_bytes schedules high_work_func() which tries to reclaim 32 pages. If cache generated fast enough or it allows cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim enough. Try to reclaim exceeded amount of cache instead. https://jira.sw.ru/browse/PSBM-106384 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c30150b8732d..37d4df653f39 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2213,14 +2213,18 @@ static void reclaim_high(struct mem_cgroup *memcg, { do { + long cache_overused; if (page_counter_read(>memory) > memcg->high) { memcg_memory_event(memcg, MEMCG_HIGH); try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); } - if (page_counter_read(>cache) > memcg->cache.max) - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, false); + cache_overused = page_counter_read(>cache) - + memcg->cache.max; + + if (cache_overused > 0) + try_to_free_mem_cgroup_pages(memcg, cache_overused, gfp_mask, false); } while ((memcg = parent_mem_cgroup(memcg))); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 1/2] mm/memcg: reclaim memory.cache.limit_in_bytes from background
Reclaiming memory above memory.cache.limit_in_bytes always in direct reclaim mode adds to much of a cost for vstorage. Instead of direct reclaim allow to overflow memory.cache.limit_in_bytes but launch the reclaim in background task. https://pmc.acronis.com/browse/VSTOR-24395 https://jira.sw.ru/browse/PSBM-94761 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 42 ++ 1 file changed, 18 insertions(+), 24 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 68242a72be4d..c30150b8732d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2211,11 +2211,16 @@ static void reclaim_high(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { + do { - if (page_counter_read(>memory) <= memcg->high) - continue; - memcg_memory_event(memcg, MEMCG_HIGH); - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + + if (page_counter_read(>memory) > memcg->high) { + memcg_memory_event(memcg, MEMCG_HIGH); + try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + } + + if (page_counter_read(>cache) > memcg->cache.max) + try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, false); } while ((memcg = parent_mem_cgroup(memcg))); } @@ -2270,13 +2275,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool kmem_charge refill_stock(memcg, nr_pages); goto charge; } - - if (cache_charge && !page_counter_try_charge( - >cache, nr_pages, )) { - refill_stock(memcg, nr_pages); - goto charge; - } - return 0; + css_get_many(>css, batch); + goto done; } charge: @@ -2301,19 +2301,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool kmem_charge } } - if (!mem_over_limit && cache_charge) { - if (page_counter_try_charge(>cache, nr_pages, )) - goto done_restock; - - may_swap = false; - mem_over_limit = mem_cgroup_from_counter(counter, cache); - page_counter_uncharge(>memory, batch); - if (do_memsw_account()) - page_counter_uncharge(>memsw, batch); - if (kmem_charge) - page_counter_uncharge(>kmem, nr_pages); - } - if (!mem_over_limit) goto done_restock; @@ -2437,6 +2424,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool kmem_charge css_get_many(>css, batch); if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); +done: + if (cache_charge) + page_counter_charge(>cache, nr_pages); /* * If the hierarchy is above the normal consumption range, schedule @@ -2457,7 +2447,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool kmem_charge current->memcg_nr_pages_over_high += batch; set_notify_resume(current); break; + } else if (page_counter_read(>cache) > memcg->cache.max) { + if (!work_pending(>high_work)) + schedule_work(>high_work); } + } while ((memcg = parent_mem_cgroup(memcg))); return 0; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RH8] mm/tcache: restore missing rcu_read_lock() in tcache_detach_page()
On 10/2/20 5:13 PM, Evgenii Shatokhin wrote: > Looks like rcu_read_lock() was lost in "out:" path of tcache_detach_page() > when tcache was ported to VZ8. As a result, Syzkaller was able to hit > the following warning: > > WARNING: bad unlock balance detected! > 4.18.0-193.6.3.vz8.4.7.syz+debug #1 Tainted: GW-r- > - > - > vcmmd/926 is trying to release lock (rcu_read_lock) at: > [] tcache_detach_page+0x530/0x750 > but there are no more locks to release! > > other info that might help us debug this: > 2 locks held by vcmmd/926: >#0: 888036331f30 (>mmap_sem){}, at: __do_page_fault+0x157/0x550 >#1: 8880567295f8 (>i_mmap_sem){}, at: > ext4_filemap_fault+0x82/0xc0 [ext4] > > stack backtrace: > CPU: 0 PID: 926 Comm: vcmmd ve: / >Tainted: GW-r- - > 4.18.0-193.6.3.vz8.4.7.syz+debug #1 4.7 > Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014 > Call Trace: >dump_stack+0xd2/0x148 >print_unlock_imbalance_bug.cold.40+0xc8/0xd4 >lock_release+0x5e3/0x1360 >tcache_detach_page+0x559/0x750 >tcache_cleancache_get_page+0xe9/0x780 >__cleancache_get_page+0x212/0x320 >ext4_mpage_readpages+0x165d/0x1b90 [ext4] >ext4_readpages+0xd6/0x110 [ext4] >read_pages+0xff/0x5b0 >__do_page_cache_readahead+0x3fc/0x5b0 >filemap_fault+0x912/0x1b80 >ext4_filemap_fault+0x8a/0xc0 [ext4] >__do_fault+0x110/0x410 >do_fault+0x622/0x1010 >__handle_mm_fault+0x980/0x1120 >handle_mm_fault+0x17f/0x610 >__do_page_fault+0x25d/0x550 >do_page_fault+0x38/0x290 >do_async_page_fault+0x5b/0xe0 >async_page_fault+0x1e/0x30 > > Let us restore rcu_read_lock(). > > https://jira.sw.ru/browse/PSBM-120802 > Signed-off-by: Evgenii Shatokhin Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 v2] kernel/sched/fair.c: Add more missing update_rq_clock() calls
Add update_rq_clock() for 'target_rq' to avoid WARN() coming from attach_task(). Also add rq_repin_lock(busiest, ); in load_balance() for detach_task(). The update_rq_clock() isn't necessary since it was updated before, but we need the repin since rq lock was released after update. https://jira.sw.ru/browse/PSBM-108013 Reported-by: Kirill Tkhai Signed-off-by: Andrey Ryabinin --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e6dc21d5fa03..fc87dee4fd0e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7817,6 +7817,7 @@ static int cpulimit_balance_cpu_stop(void *data) schedstat_inc(sd->clb_count); update_rq_clock(rq); + update_rq_clock(target_rq); if (do_cpulimit_balance()) schedstat_inc(sd->clb_pushed); else @@ -9177,6 +9178,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop = 0; local_irq_save(rf.flags); double_rq_lock(env.dst_rq, busiest); + rq_repin_lock(env.src_rq, ); update_rq_clock(env.dst_rq); cur_ld_moved = ld_moved = move_task_groups(); double_rq_unlock(env.dst_rq, busiest); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] kernel/sched/fair.c: Add more missing update_rq_clock() calls
Add update_rq_clock() for 'target_rq' to avoid WARN() coming from attach_task(). Also add update_rq_clock(env.src_rq); in load_balance() for detach_task(). https://jira.sw.ru/browse/PSBM-108013 Reported-by: Kirill Tkhai Signed-off-by: Andrey Ryabinin --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e6dc21d5fa03..99dcb9e77efd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7817,6 +7817,7 @@ static int cpulimit_balance_cpu_stop(void *data) schedstat_inc(sd->clb_count); update_rq_clock(rq); + update_rq_clock(target_rq); if (do_cpulimit_balance()) schedstat_inc(sd->clb_pushed); else @@ -9177,6 +9178,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop = 0; local_irq_save(rf.flags); double_rq_lock(env.dst_rq, busiest); + update_rq_clock(env.src_rq); update_rq_clock(env.dst_rq); cur_ld_moved = ld_moved = move_task_groups(); double_rq_unlock(env.dst_rq, busiest); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH v2 vz8] kernel/sched/fair.c: Add missing update_rq_clock() calls
On 9/29/20 11:24 AM, Kirill Tkhai wrote: > On 28.09.2020 15:03, Andrey Ryabinin wrote: >> We've got a hard lockup which seems to be caused by mgag200 >> console printk code calling to schedule_work from scheduler >> with rq->lock held: >> #5 [b79e034239a8] native_queued_spin_lock_slowpath at 8b50c6c6 >> #6 [b79e034239a8] _raw_spin_lock at 8bc96e5c >> #7 [b79e034239b0] try_to_wake_up at 8b4e26ff >> #8 [b79e03423a10] __queue_work at 8b4ce3f3 >> #9 [b79e03423a58] queue_work_on at 8b4ce714 >> #10 [b79e03423a68] mga_imageblit at c026d666 [mgag200] >> #11 [b79e03423a80] soft_cursor at 8b8a9d84 >> #12 [b79e03423ad8] bit_cursor at 8b8a99b2 >> #13 [b79e03423ba0] hide_cursor at 8b93bc7a >> #14 [b79e03423bb0] vt_console_print at 8b93e07d >> #15 [b79e03423c18] console_unlock at 8b518f0e >> #16 [b79e03423c68] vprintk_emit_log at 8b51acf7 >> #17 [b79e03423cc0] vprintk_default at 8b51adcd >> #18 [b79e03423cd0] printk at 8b51b3d6 >> #19 [b79e03423d30] __warn_printk at 8b4b13a0 >> #20 [b79e03423d98] assert_clock_updated at 8b4dd293 >> #21 [b79e03423da0] deactivate_task at 8b4e12d1 >> #22 [b79e03423dc8] move_task_group at 8b4eaa5b >> #23 [b79e03423e00] cpulimit_balance_cpu_stop at 8b4f02f3 >> #24 [b79e03423eb0] cpu_stopper_thread at 8b576b67 >> #25 [b79e03423ee8] smpboot_thread_fn at 8b4d9125 >> #26 [b79e03423f10] kthread at 8b4d4fc2 >> #27 [b79e03423f50] ret_from_fork at 8be00255 >> >> The printk called because assert_clock_updated() triggered >> SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP); >> >> This means that we missing necessary update_rq_clock() call. >> Add one to cpulimit_balance_cpu_stop() to fix the warning. >> Also add one in load_balance() before move_task_groups() call. >> It seems to be another place missing this call. >> >> https://jira.sw.ru/browse/PSBM-108013 >> Signed-off-by: Andrey Ryabinin >> --- >> kernel/sched/fair.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 5d3556b15e70..e6dc21d5fa03 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7816,6 +7816,7 @@ static int cpulimit_balance_cpu_stop(void *data) >> >> schedstat_inc(sd->clb_count); >> >> +update_rq_clock(rq); > > Shouldn't we also add the same for target_rq to avoid WARN() coming from > attach_task()? > It seems like we should. ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH v2 vz8] kernel/sched/fair.c: Add missing update_rq_clock() calls
We've got a hard lockup which seems to be caused by mgag200 console printk code calling to schedule_work from scheduler with rq->lock held: #5 [b79e034239a8] native_queued_spin_lock_slowpath at 8b50c6c6 #6 [b79e034239a8] _raw_spin_lock at 8bc96e5c #7 [b79e034239b0] try_to_wake_up at 8b4e26ff #8 [b79e03423a10] __queue_work at 8b4ce3f3 #9 [b79e03423a58] queue_work_on at 8b4ce714 #10 [b79e03423a68] mga_imageblit at c026d666 [mgag200] #11 [b79e03423a80] soft_cursor at 8b8a9d84 #12 [b79e03423ad8] bit_cursor at 8b8a99b2 #13 [b79e03423ba0] hide_cursor at 8b93bc7a #14 [b79e03423bb0] vt_console_print at 8b93e07d #15 [b79e03423c18] console_unlock at 8b518f0e #16 [b79e03423c68] vprintk_emit_log at 8b51acf7 #17 [b79e03423cc0] vprintk_default at 8b51adcd #18 [b79e03423cd0] printk at 8b51b3d6 #19 [b79e03423d30] __warn_printk at 8b4b13a0 #20 [b79e03423d98] assert_clock_updated at 8b4dd293 #21 [b79e03423da0] deactivate_task at 8b4e12d1 #22 [b79e03423dc8] move_task_group at 8b4eaa5b #23 [b79e03423e00] cpulimit_balance_cpu_stop at 8b4f02f3 #24 [b79e03423eb0] cpu_stopper_thread at 8b576b67 #25 [b79e03423ee8] smpboot_thread_fn at 8b4d9125 #26 [b79e03423f10] kthread at 8b4d4fc2 #27 [b79e03423f50] ret_from_fork at 8be00255 The printk called because assert_clock_updated() triggered SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP); This means that we missing necessary update_rq_clock() call. Add one to cpulimit_balance_cpu_stop() to fix the warning. Also add one in load_balance() before move_task_groups() call. It seems to be another place missing this call. https://jira.sw.ru/browse/PSBM-108013 Signed-off-by: Andrey Ryabinin --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5d3556b15e70..e6dc21d5fa03 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7816,6 +7816,7 @@ static int cpulimit_balance_cpu_stop(void *data) schedstat_inc(sd->clb_count); + update_rq_clock(rq); if (do_cpulimit_balance()) schedstat_inc(sd->clb_pushed); else @@ -9176,6 +9177,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop = 0; local_irq_save(rf.flags); double_rq_lock(env.dst_rq, busiest); + update_rq_clock(env.dst_rq); cur_ld_moved = ld_moved = move_task_groups(); double_rq_unlock(env.dst_rq, busiest); local_irq_restore(rf.flags); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8] kernel/sched/fair.c: Add missing update_rq_clock() calls
We've got a hard lockup which seems to be caused by mgag200 console printk code calling to schedule_work from scheduler with rq->lock held: #5 [b79e034239a8] native_queued_spin_lock_slowpath at 8b50c6c6 #6 [b79e034239a8] _raw_spin_lock at 8bc96e5c #7 [b79e034239b0] try_to_wake_up at 8b4e26ff #8 [b79e03423a10] __queue_work at 8b4ce3f3 #9 [b79e03423a58] queue_work_on at 8b4ce714 The printk called because assert_clock_updated() triggered SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP); This means that we missing necessary update_rq_clock() call. Add one to cpulimit_balance_cpu_stop() to fix the warning. Also add one in load_balance() before move_task_groups() call. It seems to be another place missing this call. https://jira.sw.ru/browse/PSBM-108013 Signed-off-by: Andrey Ryabinin --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5d3556b15e70..e6dc21d5fa03 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7816,6 +7816,7 @@ static int cpulimit_balance_cpu_stop(void *data) schedstat_inc(sd->clb_count); + update_rq_clock(rq); if (do_cpulimit_balance()) schedstat_inc(sd->clb_pushed); else @@ -9176,6 +9177,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop = 0; local_irq_save(rf.flags); double_rq_lock(env.dst_rq, busiest); + update_rq_clock(env.dst_rq); cur_ld_moved = ld_moved = move_task_groups(); double_rq_unlock(env.dst_rq, busiest); local_irq_restore(rf.flags); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] keys, user: fix NULL-ptr dereference in user_destroy() #PSBM-108198
key->payload.data could be NULL BUG: unable to handle kernel NULL pointer dereference at 0010 IP: user_destroy+0x13/0x30 Call Trace: key_gc_unused_keys.constprop.1+0xfd/0x110 key_garbage_collector+0x1d7/0x390 process_one_work+0x185/0x440 worker_thread+0x126/0x3c0 kthread+0xd1/0xe0 ret_from_fork_nospec_begin+0x7/0x21 Add the necessary check to fix this. https://jira.sw.ru/browse/PSBM-108198 Fixes: 499126f3b029 ("keys, user: Fix high order allocation in user_instantiate()") Signed-off-by: Andrey Ryabinin --- security/keys/user_defined.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/security/keys/user_defined.c b/security/keys/user_defined.c index b13d70b69069..c3196db50e30 100644 --- a/security/keys/user_defined.c +++ b/security/keys/user_defined.c @@ -184,8 +184,10 @@ void user_destroy(struct key *key) { struct user_key_payload *upayload = key->payload.data; - memset(upayload, 0, sizeof(*upayload) + upayload->datalen); - kvfree(upayload); + if (upayload) { + memset(upayload, 0, sizeof(*upayload) + upayload->datalen); + kvfree(upayload); + } } EXPORT_SYMBOL_GPL(user_destroy); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] mm, memcg: add oom counter to memory.stat memcgroup file #PSBM-107731
Add oom counter to memory.stat file. oom shows amount of oom kills triggered due to cgroup's memory limit. total_oom shows total sum of oom kills triggered due to cgroup's and it's sub-groups memory limits. memory.stat in the root cgroup counts global oom kills. E.g: # mkdir /sys/fs/cgroup/memory/test/ # echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # echo 100M > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes # echo $$ > /sys/fs/cgroup/memory/test/tasks # ./vm-scalability/usemem -O 200M # grep oom /sys/fs/cgroup/memory/test/memory.stat oom 1 total_oom 1 # echo -1 > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes # echo -1 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes # ./vm-scalability/usemem -O 1000G # grep oom /sys/fs/cgroup/memory/memory.stat oom 1 total_oom 2 https://jira.sw.ru/browse/PSBM-107731 Signed-off-by: Andrey Ryabinin --- mm/memcontrol.c | 9 + 1 file changed, 9 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6587cc2ef019..fe06c7db2ad3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -400,6 +400,7 @@ struct mem_cgroup { struct mem_cgroup_stat_cpu __percpu *stat; struct mem_cgroup_stat2_cpu stat2; spinlock_t pcp_counter_lock; + atomic_long_t oom; atomic_tdead_count; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) @@ -2005,6 +2006,7 @@ void mem_cgroup_note_oom_kill(struct mem_cgroup *root_memcg, if (memcg == root_memcg) break; } + atomic_long_inc(_memcg->oom); if (memcg_to_put) css_put(_to_put->css); @@ -5691,6 +5693,7 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft, for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) seq_printf(m, "%s %lu\n", mem_cgroup_events_names[i], mem_cgroup_read_events(memcg, i)); + seq_printf(m, "oom %lu\n", atomic_long_read(>oom)); for (i = 0; i < NR_LRU_LISTS; i++) seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i], @@ -5733,6 +5736,12 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft, seq_printf(m, "total_%s %llu\n", mem_cgroup_events_names[i], val); } + { + unsigned long val = 0; + for_each_mem_cgroup_tree(mi, memcg) + val += atomic_long_read(>oom); + seq_printf(m, "total_oom %lu\n", val); + } for (i = 0; i < NR_LRU_LISTS; i++) { unsigned long long val = 0; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 4/4] bcache: fix cache_set_flush() NULL pointer dereference on OOM #PSBM-106785
From: Eric Wheeler When bch_cache_set_alloc() fails to kzalloc the cache_set, the asyncronous closure handling tries to dereference a cache_set that hadn't yet been allocated inside of cache_set_flush() which is called by __cache_set_unregister() during cleanup. This appears to happen only during an OOM condition on bcache_register. Signed-off-by: Eric Wheeler Cc: sta...@vger.kernel.org https://jira.sw.ru/browse/PSBM-106785 (cherry picked from commit f8b11260a445169989d01df75d35af0f56178f95) Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/super.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 88a008577dc0..f06212f856c6 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1295,6 +1295,9 @@ static void cache_set_flush(struct closure *cl) set_bit(CACHE_SET_STOPPING_2, >flags); wake_up(>alloc_wait); + if (!c) + closure_return(cl); + bch_cache_accounting_destroy(>accounting); kobject_put(>internal); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 3/4] bcache: unregister reboot notifier if bcache fails to unregister device #PSBM-106785
From: Zheng Liu In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu Tested-by: Joshua Schmid Tested-by: Eric Wheeler Cc: Kent Overstreet Cc: sta...@vger.kernel.org Signed-off-by: Jens Axboe https://jira.sw.ru/browse/PSBM-106785 (cherry picked from commit 2ecf0cdb2b437402110ab57546e02abfa68a716b) Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/super.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 0fccdc395ebe..88a008577dc0 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1959,8 +1959,10 @@ static int __init bcache_init(void) closure_debug_init(); bcache_major = register_blkdev(0, "bcache"); - if (bcache_major < 0) + if (bcache_major < 0) { + unregister_reboot_notifier(); return bcache_major; + } if (!(bcache_wq = create_workqueue("bcache")) || !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) || -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 2/4] bcache: Data corruption fix #PSBM-106785
From: Kent Overstreet The code that handles overlapping extents that we've just read back in from disk was depending on the behaviour of the code that handles overlapping extents as we're inserting into a btree node in the case of an insert that forced an existing extent to be split: on insert, if we had to split we'd also insert a new extent to represent the top part of the old extent - and then that new extent would get written out. The code that read the extents back in thus not bother with splitting extents - if it saw an extent that ovelapped in the middle of an older extent, it would trim the old extent to only represent the bottom part, assuming that the original insert would've inserted a new extent to represent the top part. I still haven't figured out _how_ it can happen, but I'm now pretty convinced (and testing has confirmed) that there's some kind of an obscure corner case (probably involving extent merging, and multiple overwrites in different sets) that breaks this. The fix is to change the mergesort fixup code to split extents itself when required. Signed-off-by: Kent Overstreet Cc: linux-stable # >= v3.10 https://jira.sw.ru/browse/PSBM-106785 (cherry picked from commit ef71ec2d92a08eb27e9d036e3d48835b6597) Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/bset.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index 14032e8c7731..1b27cbd822e1 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -927,7 +927,7 @@ static void sort_key_next(struct btree_iter *iter, *i = iter->data[--iter->used]; } -static void btree_sort_fixup(struct btree_iter *iter) +static struct bkey *btree_sort_fixup(struct btree_iter *iter, struct bkey *tmp) { while (iter->used > 1) { struct btree_iter_set *top = iter->data, *i = top + 1; @@ -955,9 +955,22 @@ static void btree_sort_fixup(struct btree_iter *iter) } else { /* can't happen because of comparison func */ BUG_ON(!bkey_cmp(_KEY(top->k), _KEY(i->k))); - bch_cut_back(_KEY(i->k), top->k); + + if (bkey_cmp(i->k, top->k) < 0) { + bkey_copy(tmp, top->k); + + bch_cut_back(_KEY(i->k), tmp); + bch_cut_front(i->k, top->k); + heap_sift(iter, 0, btree_iter_cmp); + + return tmp; + } else { + bch_cut_back(_KEY(i->k), top->k); + } } } + + return NULL; } static void btree_mergesort(struct btree *b, struct bset *out, @@ -965,15 +978,20 @@ static void btree_mergesort(struct btree *b, struct bset *out, bool fixup, bool remove_stale) { struct bkey *k, *last = NULL; + BKEY_PADDED(k) tmp; bool (*bad)(struct btree *, const struct bkey *) = remove_stale ? bch_ptr_bad : bch_ptr_invalid; while (!btree_iter_end(iter)) { if (fixup && !b->level) - btree_sort_fixup(iter); + k = btree_sort_fixup(iter, ); + else + k = NULL; + + if (!k) + k = bch_btree_iter_next(iter); - k = bch_btree_iter_next(iter); if (bad(b, k)) continue; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 1/4] bcache: Fix crashes of bcache used with raid1 #PSBM-106785
When bcache is built on top of raid1 devices, the following warning happens: WARNING: CPU: 2 PID: 8138 at include/linux/bio.h:559 raid1_write_request+0x994/0xba0 [raid1] Call Trace: dump_stack+0x19/0x1b __warn+0xd8/0x100 warn_slowpath_null+0x1d/0x20 raid1_write_request+0x994/0xba0 [raid1] raid1_make_request+0x8a/0x5b0 [raid1] md_handle_request+0xd0/0x150 md_make_request+0x79/0x190 generic_make_request+0x147/0x380 bch_generic_make_request_hack+0x2a/0xc0 [bcache] bch_generic_make_request+0x3d/0x190 [bcache] write_dirty+0x7e/0x110 [bcache] process_one_work+0x185/0x440 worker_thread+0x126/0x3c0 kthread+0xd1/0xe0 ret_from_fork_nospec_begin+0x21/0x21 And immediately followed by the crash: kernel BUG at drivers/md/bcache/closure.c:53! Call Trace: dirty_endio+0x28/0x30 [bcache] bio_endio+0x8c/0x130 call_bio_endio+0x2f/0x40 [raid1] raid_end_bio_io+0x2e/0x90 [raid1] r1_bio_write_done+0x35/0x50 [raid1] raid1_end_write_request+0x118/0x2f0 [raid1] bio_endio+0x8c/0x130 blk_update_request+0x90/0x370 blk_mq_end_request+0x1a/0x90 virtblk_request_done+0x3f/0x70 [virtio_blk] __blk_mq_complete_request_remote+0x19/0x20 flush_smp_call_function_queue+0x63/0x130 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x2d/0x40 call_function_single_interrupt+0x16a/0x170 So this happens because bcache doesn't allocate & initialize 'bio_aux' structure needed by raid1 device. Add 'bio_aux' to 'dirty_io' struct and initialize it along with the 'bio' in dirty_init() to fix this. https://jira.sw.ru/browse/PSBM-106785 Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/writeback.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 841f0490d4ef..c2bda701bf9d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -17,6 +17,7 @@ static void read_dirty(struct closure *); struct dirty_io { struct closure cl; struct cached_dev *dc; + struct bio_aux bio_aux; struct bio bio; }; @@ -122,6 +123,7 @@ static void dirty_init(struct keybuf_key *w) bio->bi_max_vecs= DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS); bio->bi_private = w; bio->bi_io_vec = bio->bi_inline_vecs; + bio_init_aux(>bio, >bio_aux); bch_bio_map(bio, NULL); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7 v2 1/4] bcache: Fix crashes of bcache used with raid1 #PSBM-106785
When bcache is built on top of raid1 devices, the following warning happens: WARNING: CPU: 2 PID: 8138 at include/linux/bio.h:559 raid1_write_request+0x994/0xba0 [raid1] Call Trace: dump_stack+0x19/0x1b __warn+0xd8/0x100 warn_slowpath_null+0x1d/0x20 raid1_write_request+0x994/0xba0 [raid1] raid1_make_request+0x8a/0x5b0 [raid1] md_handle_request+0xd0/0x150 md_make_request+0x79/0x190 generic_make_request+0x147/0x380 bch_generic_make_request_hack+0x2a/0xc0 [bcache] bch_generic_make_request+0x3d/0x190 [bcache] write_dirty+0x7e/0x110 [bcache] process_one_work+0x185/0x440 worker_thread+0x126/0x3c0 kthread+0xd1/0xe0 ret_from_fork_nospec_begin+0x21/0x21 And immediately followed by the crash: kernel BUG at drivers/md/bcache/closure.c:53! Call Trace: dirty_endio+0x28/0x30 [bcache] bio_endio+0x8c/0x130 call_bio_endio+0x2f/0x40 [raid1] raid_end_bio_io+0x2e/0x90 [raid1] r1_bio_write_done+0x35/0x50 [raid1] raid1_end_write_request+0x118/0x2f0 [raid1] bio_endio+0x8c/0x130 blk_update_request+0x90/0x370 blk_mq_end_request+0x1a/0x90 virtblk_request_done+0x3f/0x70 [virtio_blk] __blk_mq_complete_request_remote+0x19/0x20 flush_smp_call_function_queue+0x63/0x130 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x2d/0x40 call_function_single_interrupt+0x16a/0x170 So this happens because bcache doesn't allocate & initialize 'bio_aux' structure needed by raid1 device. Add 'bio_aux' to 'dirty_io' struct and initialize it along with the 'bio' in dirty_init() to fix this. https://jira.sw.ru/browse/PSBM-106785 Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/writeback.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 841f0490d4ef..c2bda701bf9d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -17,6 +17,7 @@ static void read_dirty(struct closure *); struct dirty_io { struct closure cl; struct cached_dev *dc; + struct bio_aux bio_aux; struct bio bio; }; @@ -122,6 +123,7 @@ static void dirty_init(struct keybuf_key *w) bio->bi_max_vecs= DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS); bio->bi_private = w; bio->bi_io_vec = bio->bi_inline_vecs; + bio_init_aux(>bio, >bio_aux); bch_bio_map(bio, NULL); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 v2] keys, user: Fix high order allocation in user_instantiate() #PSBM-107794
Adding user key might trigger 4-order allocation which is unreliable in case of fragmented memory: [ cut here ] WARNING: CPU: 3 PID: 134927 at mm/page_alloc.c:3533 __alloc_pages_nodemask+0x1b1/0x600 order 4 >= 3, gfp 0x40d0 Kernel panic - not syncing: panic_on_warn set ... CPU: 3 PID: 134927 Comm: add_key01 ve: 0 Kdump: loaded Tainted: G OE 3.10.0-1127.18.2.vz7.163.15 #1 163.15 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014 Call Trace: dump_stack+0x19/0x1b panic+0xe8/0x21f __warn+0xfa/0x100 warn_slowpath_fmt+0x5f/0x80 __alloc_pages_nodemask+0x1b1/0x600 alloc_pages_current+0x98/0x110 kmalloc_order+0x18/0x40 kmalloc_order_trace+0x26/0xa0 __kmalloc+0x281/0x2a0 user_instantiate+0x47/0x90 __key_instantiate_and_link+0x54/0x100 key_create_or_update+0x398/0x490 SyS_add_key+0x12c/0x220 system_call_fastpath+0x25/0x2a Use kvmalloc() to avoid potential -ENOMEM due to fragmentation. https://jira.sw.ru/browse/PSBM-107794 Signed-off-by: Andrey Ryabinin --- Changes since v1: - Add #PSBM-107794 to subject security/keys/user_defined.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/security/keys/user_defined.c b/security/keys/user_defined.c index bc8d3227dc4b..b13d70b69069 100644 --- a/security/keys/user_defined.c +++ b/security/keys/user_defined.c @@ -9,6 +9,7 @@ * 2 of the License, or (at your option) any later version. */ +#include #include #include #include @@ -75,7 +76,7 @@ int user_instantiate(struct key *key, struct key_preparsed_payload *prep) goto error; ret = -ENOMEM; - upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); + upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); if (!upayload) goto error; @@ -96,7 +97,8 @@ static void user_free_payload_rcu(struct rcu_head *head) struct user_key_payload *payload; payload = container_of(head, struct user_key_payload, rcu); - kzfree(payload); + memset(payload, 0, sizeof(*payload) + payload->datalen); + kvfree(payload); } /* @@ -115,7 +117,7 @@ int user_update(struct key *key, struct key_preparsed_payload *prep) /* construct a replacement payload */ ret = -ENOMEM; - upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); + upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); if (!upayload) goto error; @@ -182,7 +184,8 @@ void user_destroy(struct key *key) { struct user_key_payload *upayload = key->payload.data; - kzfree(upayload); + memset(upayload, 0, sizeof(*upayload) + upayload->datalen); + kvfree(upayload); } EXPORT_SYMBOL_GPL(user_destroy); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] keys, user: Fix high order allocation in user_instantiate()
Adding user key might trigger 4-order allocation which is unreliable in case of fragmented memory: [ cut here ] WARNING: CPU: 3 PID: 134927 at mm/page_alloc.c:3533 __alloc_pages_nodemask+0x1b1/0x600 order 4 >= 3, gfp 0x40d0 Kernel panic - not syncing: panic_on_warn set ... CPU: 3 PID: 134927 Comm: add_key01 ve: 0 Kdump: loaded Tainted: G OE 3.10.0-1127.18.2.vz7.163.15 #1 163.15 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014 Call Trace: dump_stack+0x19/0x1b panic+0xe8/0x21f __warn+0xfa/0x100 warn_slowpath_fmt+0x5f/0x80 __alloc_pages_nodemask+0x1b1/0x600 alloc_pages_current+0x98/0x110 kmalloc_order+0x18/0x40 kmalloc_order_trace+0x26/0xa0 __kmalloc+0x281/0x2a0 user_instantiate+0x47/0x90 __key_instantiate_and_link+0x54/0x100 key_create_or_update+0x398/0x490 SyS_add_key+0x12c/0x220 system_call_fastpath+0x25/0x2a Use kvmalloc() to avoid potential -ENOMEM due to fragmentation. https://jira.sw.ru/browse/PSBM-107794 Signed-off-by: Andrey Ryabinin --- security/keys/user_defined.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/security/keys/user_defined.c b/security/keys/user_defined.c index bc8d3227dc4b..b13d70b69069 100644 --- a/security/keys/user_defined.c +++ b/security/keys/user_defined.c @@ -9,6 +9,7 @@ * 2 of the License, or (at your option) any later version. */ +#include #include #include #include @@ -75,7 +76,7 @@ int user_instantiate(struct key *key, struct key_preparsed_payload *prep) goto error; ret = -ENOMEM; - upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); + upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); if (!upayload) goto error; @@ -96,7 +97,8 @@ static void user_free_payload_rcu(struct rcu_head *head) struct user_key_payload *payload; payload = container_of(head, struct user_key_payload, rcu); - kzfree(payload); + memset(payload, 0, sizeof(*payload) + payload->datalen); + kvfree(payload); } /* @@ -115,7 +117,7 @@ int user_update(struct key *key, struct key_preparsed_payload *prep) /* construct a replacement payload */ ret = -ENOMEM; - upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); + upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); if (!upayload) goto error; @@ -182,7 +184,8 @@ void user_destroy(struct key *key) { struct user_key_payload *upayload = key->payload.data; - kzfree(upayload); + memset(upayload, 0, sizeof(*upayload) + upayload->datalen); + kvfree(upayload); } EXPORT_SYMBOL_GPL(user_destroy); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 4/4] bcache: fix cache_set_flush() NULL pointer dereference on OOM
From: Eric Wheeler When bch_cache_set_alloc() fails to kzalloc the cache_set, the asyncronous closure handling tries to dereference a cache_set that hadn't yet been allocated inside of cache_set_flush() which is called by __cache_set_unregister() during cleanup. This appears to happen only during an OOM condition on bcache_register. Signed-off-by: Eric Wheeler Cc: sta...@vger.kernel.org https://jira.sw.ru/browse/PSBM-106785 (cherry picked from commit f8b11260a445169989d01df75d35af0f56178f95) Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/super.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 88a008577dc0..f06212f856c6 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1295,6 +1295,9 @@ static void cache_set_flush(struct closure *cl) set_bit(CACHE_SET_STOPPING_2, >flags); wake_up(>alloc_wait); + if (!c) + closure_return(cl); + bch_cache_accounting_destroy(>accounting); kobject_put(>internal); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/4] bcache: unregister reboot notifier if bcache fails to unregister device
From: Zheng Liu In bcache_init() function it forgot to unregister reboot notifier if bcache fails to unregister a block device. This commit fixes this. Signed-off-by: Zheng Liu Tested-by: Joshua Schmid Tested-by: Eric Wheeler Cc: Kent Overstreet Cc: sta...@vger.kernel.org Signed-off-by: Jens Axboe https://jira.sw.ru/browse/PSBM-106785 (cherry picked from commit 2ecf0cdb2b437402110ab57546e02abfa68a716b) Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/super.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 0fccdc395ebe..88a008577dc0 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1959,8 +1959,10 @@ static int __init bcache_init(void) closure_debug_init(); bcache_major = register_blkdev(0, "bcache"); - if (bcache_major < 0) + if (bcache_major < 0) { + unregister_reboot_notifier(); return bcache_major; + } if (!(bcache_wq = create_workqueue("bcache")) || !(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) || -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/4] bcache: Data corruption fix
From: Kent Overstreet The code that handles overlapping extents that we've just read back in from disk was depending on the behaviour of the code that handles overlapping extents as we're inserting into a btree node in the case of an insert that forced an existing extent to be split: on insert, if we had to split we'd also insert a new extent to represent the top part of the old extent - and then that new extent would get written out. The code that read the extents back in thus not bother with splitting extents - if it saw an extent that ovelapped in the middle of an older extent, it would trim the old extent to only represent the bottom part, assuming that the original insert would've inserted a new extent to represent the top part. I still haven't figured out _how_ it can happen, but I'm now pretty convinced (and testing has confirmed) that there's some kind of an obscure corner case (probably involving extent merging, and multiple overwrites in different sets) that breaks this. The fix is to change the mergesort fixup code to split extents itself when required. Signed-off-by: Kent Overstreet Cc: linux-stable # >= v3.10 https://jira.sw.ru/browse/PSBM-106785 (cherry picked from commit ef71ec2d92a08eb27e9d036e3d48835b6597) Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/bset.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c index 14032e8c7731..1b27cbd822e1 100644 --- a/drivers/md/bcache/bset.c +++ b/drivers/md/bcache/bset.c @@ -927,7 +927,7 @@ static void sort_key_next(struct btree_iter *iter, *i = iter->data[--iter->used]; } -static void btree_sort_fixup(struct btree_iter *iter) +static struct bkey *btree_sort_fixup(struct btree_iter *iter, struct bkey *tmp) { while (iter->used > 1) { struct btree_iter_set *top = iter->data, *i = top + 1; @@ -955,9 +955,22 @@ static void btree_sort_fixup(struct btree_iter *iter) } else { /* can't happen because of comparison func */ BUG_ON(!bkey_cmp(_KEY(top->k), _KEY(i->k))); - bch_cut_back(_KEY(i->k), top->k); + + if (bkey_cmp(i->k, top->k) < 0) { + bkey_copy(tmp, top->k); + + bch_cut_back(_KEY(i->k), tmp); + bch_cut_front(i->k, top->k); + heap_sift(iter, 0, btree_iter_cmp); + + return tmp; + } else { + bch_cut_back(_KEY(i->k), top->k); + } } } + + return NULL; } static void btree_mergesort(struct btree *b, struct bset *out, @@ -965,15 +978,20 @@ static void btree_mergesort(struct btree *b, struct bset *out, bool fixup, bool remove_stale) { struct bkey *k, *last = NULL; + BKEY_PADDED(k) tmp; bool (*bad)(struct btree *, const struct bkey *) = remove_stale ? bch_ptr_bad : bch_ptr_invalid; while (!btree_iter_end(iter)) { if (fixup && !b->level) - btree_sort_fixup(iter); + k = btree_sort_fixup(iter, ); + else + k = NULL; + + if (!k) + k = bch_btree_iter_next(iter); - k = bch_btree_iter_next(iter); if (bad(b, k)) continue; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/4] bcache: Fix crashes of bcache used with raid1
When bcache is built on top of raid1 devices, the following warning happens: WARNING: CPU: 2 PID: 8138 at include/linux/bio.h:559 raid1_write_request+0x994/0xba0 [raid1] Call Trace: dump_stack+0x19/0x1b __warn+0xd8/0x100 warn_slowpath_null+0x1d/0x20 raid1_write_request+0x994/0xba0 [raid1] raid1_make_request+0x8a/0x5b0 [raid1] md_handle_request+0xd0/0x150 md_make_request+0x79/0x190 generic_make_request+0x147/0x380 bch_generic_make_request_hack+0x2a/0xc0 [bcache] bch_generic_make_request+0x3d/0x190 [bcache] write_dirty+0x7e/0x110 [bcache] process_one_work+0x185/0x440 worker_thread+0x126/0x3c0 kthread+0xd1/0xe0 ret_from_fork_nospec_begin+0x21/0x21 And immediately followed by the crash: kernel BUG at drivers/md/bcache/closure.c:53! Call Trace: dirty_endio+0x28/0x30 [bcache] bio_endio+0x8c/0x130 call_bio_endio+0x2f/0x40 [raid1] raid_end_bio_io+0x2e/0x90 [raid1] r1_bio_write_done+0x35/0x50 [raid1] raid1_end_write_request+0x118/0x2f0 [raid1] bio_endio+0x8c/0x130 blk_update_request+0x90/0x370 blk_mq_end_request+0x1a/0x90 virtblk_request_done+0x3f/0x70 [virtio_blk] __blk_mq_complete_request_remote+0x19/0x20 flush_smp_call_function_queue+0x63/0x130 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x2d/0x40 call_function_single_interrupt+0x16a/0x170 So this happens because bcache doesn't allocate & initialize 'bio_aux' structure needed by raid1 device. Add 'bio_aux' to 'dirty_io' struct and initialize it along with the 'bio' in dirty_init() to fix this. https://jira.sw.ru/browse/PSBM-106785 Signed-off-by: Andrey Ryabinin --- drivers/md/bcache/writeback.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index 841f0490d4ef..c2bda701bf9d 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c @@ -17,6 +17,7 @@ static void read_dirty(struct closure *); struct dirty_io { struct closure cl; struct cached_dev *dc; + struct bio_aux bio_aux; struct bio bio; }; @@ -122,6 +123,7 @@ static void dirty_init(struct keybuf_key *w) bio->bi_max_vecs= DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS); bio->bi_private = w; bio->bi_io_vec = bio->bi_inline_vecs; + bio_init_aux(>bio, >bio_aux); bch_bio_map(bio, NULL); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] cgroup: add missing dput() in cgroup_unmark_ve_roots()
cgroup_unmark_ve_roots() calls dget() on cgroup's dentry but don't have the corresponding dput() call. This leads to leaking cgroups. Add missing dput() to fix this. https://jira.sw.ru/browse/PSBM-107328 Fixes: 1ac69e183447 ("ve/cgroup: added release_agent to each container root cgroup.") Signed-off-by: Andrey Ryabinin --- kernel/cgroup.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 55713a0071ce..5f3111805eba 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4719,6 +4719,7 @@ void cgroup_unmark_ve_roots(struct ve_struct *ve) mutex_lock(>i_mutex); mutex_lock(_mutex); cgroup_rm_file(cgrp, cft); + dput(cgrp->dentry); BUG_ON(!rcu_dereference_protected(cgrp->ve_owner, lockdep_is_held(_mutex))); rcu_assign_pointer(cgrp->ve_owner, NULL); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] ms/kernel/kmod: fix use-after-free of the sub_info structure
From: Martin Schwidefsky Found this in the message log on a s390 system: BUG kmalloc-192 (Not tainted): Poison overwritten Disabling lock debugging due to kernel taint INFO: 0x684761f4-0x684761f7. First byte 0xff instead of 0x6b INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648 __slab_alloc.isra.47.constprop.56+0x5f6/0x658 kmem_cache_alloc_trace+0x106/0x408 call_usermodehelper_setup+0x70/0x128 call_usermodehelper+0x62/0x90 cgroup_release_agent+0x178/0x1c0 process_one_work+0x36e/0x680 worker_thread+0x2f0/0x4f8 kthread+0x10a/0x120 kernel_thread_starter+0x6/0xc kernel_thread_starter+0x0/0xc INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648 __slab_free+0x94/0x560 kfree+0x364/0x3e0 call_usermodehelper_exec+0x110/0x1b8 cgroup_release_agent+0x178/0x1c0 process_one_work+0x36e/0x680 worker_thread+0x2f0/0x4f8 kthread+0x10a/0x120 kernel_thread_starter+0x6/0xc kernel_thread_starter+0x0/0xc There is a use-after-free bug on the subprocess_info structure allocated by the user mode helper. In case do_execve() returns with an error call_usermodehelper() stores the error code to sub_info->retval, but sub_info can already have been freed. Regarding UMH_NO_WAIT, the sub_info structure can be freed by __call_usermodehelper() before the worker thread returns from do_execve(), allowing memory corruption when do_execve() failed after exec_mmap() is called. Regarding UMH_WAIT_EXEC, the call to umh_complete() allows call_usermodehelper_exec() to continue which then frees sub_info. To fix this race the code needs to make sure that the call to call_usermodehelper_freeinfo() is always done after the last store to sub_info->retval. Signed-off-by: Martin Schwidefsky Reviewed-by: Oleg Nesterov Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-107061 (cherry-picked from commit 0baf2a4dbf75abb7c186fd6c8d55d27aaa354a29) Signed-off-by: Andrey Ryabinin --- kernel/kmod.c | 76 +-- 1 file changed, 37 insertions(+), 39 deletions(-) diff --git a/kernel/kmod.c b/kernel/kmod.c index 7fc0ba9e3216..de2bcfdc94d0 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -585,12 +585,34 @@ int __request_module(bool wait, const char *fmt, ...) EXPORT_SYMBOL(__request_module); #endif /* CONFIG_MODULES */ +static void call_usermodehelper_freeinfo(struct subprocess_info *info) +{ + if (info->cleanup) + (*info->cleanup)(info); + kfree(info); +} + +static void umh_complete(struct subprocess_info *sub_info) +{ + struct completion *comp = xchg(_info->complete, NULL); + /* +* See call_usermodehelper_exec(). If xchg() returns NULL +* we own sub_info, the UMH_KILLABLE caller has gone away +* or the caller used UMH_NO_WAIT. +*/ + if (comp) + complete(comp); + else + call_usermodehelper_freeinfo(sub_info); +} + /* * This is the task which runs the usermode application */ static int call_usermodehelper(void *data) { struct subprocess_info *sub_info = data; + int wait = sub_info->wait & ~UMH_KILLABLE; struct cred *new; int retval; @@ -607,7 +629,7 @@ static int call_usermodehelper(void *data) retval = -ENOMEM; new = prepare_kernel_cred(current); if (!new) - goto fail; + goto out; spin_lock(_sysctl_lock); new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); @@ -619,7 +641,7 @@ static int call_usermodehelper(void *data) retval = sub_info->init(sub_info, new); if (retval) { abort_creds(new); - goto fail; + goto out; } } @@ -628,12 +650,13 @@ static int call_usermodehelper(void *data) retval = do_execve(getname_kernel(sub_info->path), (const char __user *const __user *)sub_info->argv, (const char __user *const __user *)sub_info->envp); +out: + sub_info->retval = retval; + /* wait_for_helper() will call umh_complete if UHM_WAIT_PROC. */ + if (wait != UMH_WAIT_PROC) + umh_complete(sub_info); if (!retval) return 0; - - /* Exec failed? */ -fail: - sub_info->retval = retval; do_exit(0); } @@ -644,26 +667,6 @@ static int call_helper(void *data) return call_usermodehelper(data); } -static void call_usermodehelper_freeinfo(struct subprocess_info *info) -{ - if (info->cleanup) - (*info->cleanup)(info); - kfree(info); -} - -static void umh_complete(struct subprocess_info *sub_i
Re: [Devel] [PATCH RHEL v2] mm: Reduce access frequency to shrinker_rwsem during shrink_slab
On 8/20/20 5:51 PM, Valeriy Vdovin wrote: > Bug https://jira.sw.ru/browse/PSBM-99181 has introduced a problem: when > the kernel has opened NFS delegations and NFS server is not accessible > at the time when NFS shrinker is called, the whole shrinker list > execution gets stuck until NFS server is back. Being a problem in itself > it also introduces bigger problem - during that hang, the shrinker_rwsem > also gets locked, consequently no new mounts can be done at that time > because new superblock tries to register it's own shrinker and also gets > stuck at aquiring shrinker_rwsem. > > Commit 9e9e35d050955648449498827deb2d43be0564e1 is a workaround for that > problem. It is known that during signle shrinker execution we do not > actually need to hold shrinker_rwsem so we release and reacqiure the > rwsem for each shrinker in the list. > > Because of this workaround shrink_slab function now experiences a major > slowdown, because shrinker_rwsem gets accessed for each shrinker in the > list twice. On an idle fresh-booted system shrinker_list could be > iterated up to 1600 times a second, although originally the problem was > local to only one NFS shrinker. > > This patch fixes commit 9e9e35d050955648449498827deb2d43be0564e1 in a > way that before calling for up_read for shrinker_rwsem, we check that > this is really an NFS shrinker by checking NFS magic in superblock, if > it is accessible from shrinker. > > https://jira.sw.ru/browse/PSBM-99181 > > Co-authored-by: Andrey Ryabinin > Signed-off-by: Valeriy Vdovin > > Changes: > v2: Added missing 'rwsem_is_contented' check > --- Reviewed-by: Andrey Ryabinin ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH RHEL7] mm: Reduce access frequency to shrinker_rwsem during shrink_slab
On 8/20/20 11:32 AM, Valeriy Vdovin wrote: > @@ -565,14 +588,16 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, >* memcg_expand_one_shrinker_map if new shrinkers >* were registred in the meanwhile. >*/ > - if (!down_read_trylock(_rwsem)) { > - freed = freed ? : 1; > + if (is_nfs) { > + if (!down_read_trylock(_rwsem)) { > + freed = freed ? : 1; > + put_shrinker(shrinker); > + return freed; > + } > put_shrinker(shrinker); > - return freed; > + map = memcg_nid_shrinker_map(memcg, nid); > + nr_max = min(shrinker_nr_max, map->nr_max); > } Need to add rwsem_is_contended() check back. It was here before commit 9e9e35d05 else if (rwsem_is_contended(_rwsem)) { freed = freed ? : 1; break; } > - put_shrinker(shrinker); > - map = memcg_nid_shrinker_map(memcg, nid); > - nr_max = min(shrinker_nr_max, map->nr_max); > } > unlock: > up_read(_rwsem); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7] ms/vt: vt_ioctl: fix VT_DISALLOCATE freeing in-use virtual console
From: Eric Biggers The VT_DISALLOCATE ioctl can free a virtual console while tty_release() is still running, causing a use-after-free in con_shutdown(). This occurs because VT_DISALLOCATE considers a virtual console's 'struct vc_data' to be unused as soon as the corresponding tty's refcount hits 0. But actually it may be still being closed. Fix this by making vc_data be reference-counted via the embedded 'struct tty_port'. A newly allocated virtual console has refcount 1. Opening it for the first time increments the refcount to 2. Closing it for the last time decrements the refcount (in tty_operations::cleanup() so that it happens late enough), as does VT_DISALLOCATE. Reproducer: #include #include #include #include int main() { if (fork()) { for (;;) close(open("/dev/tty5", O_RDWR)); } else { int fd = open("/dev/tty10", O_RDWR); for (;;) ioctl(fd, VT_DISALLOCATE, 5); } } KASAN report: BUG: KASAN: use-after-free in con_shutdown+0x76/0x80 drivers/tty/vt/vt.c:3278 Write of size 8 at addr 88806a4ec108 by task syz_vt/129 CPU: 0 PID: 129 Comm: syz_vt Not tainted 5.6.0-rc2 #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014 Call Trace: [...] con_shutdown+0x76/0x80 drivers/tty/vt/vt.c:3278 release_tty+0xa8/0x410 drivers/tty/tty_io.c:1514 tty_release_struct+0x34/0x50 drivers/tty/tty_io.c:1629 tty_release+0x984/0xed0 drivers/tty/tty_io.c:1789 [...] Allocated by task 129: [...] kzalloc include/linux/slab.h:669 [inline] vc_allocate drivers/tty/vt/vt.c:1085 [inline] vc_allocate+0x1ac/0x680 drivers/tty/vt/vt.c:1066 con_install+0x4d/0x3f0 drivers/tty/vt/vt.c:3229 tty_driver_install_tty drivers/tty/tty_io.c:1228 [inline] tty_init_dev+0x94/0x350 drivers/tty/tty_io.c:1341 tty_open_by_driver drivers/tty/tty_io.c:1987 [inline] tty_open+0x3ca/0xb30 drivers/tty/tty_io.c:2035 [...] Freed by task 130: [...] kfree+0xbf/0x1e0 mm/slab.c:3757 vt_disallocate drivers/tty/vt/vt_ioctl.c:300 [inline] vt_ioctl+0x16dc/0x1e30 drivers/tty/vt/vt_ioctl.c:818 tty_ioctl+0x9db/0x11b0 drivers/tty/tty_io.c:2660 [...] Fixes: 4001d7b7fc27 ("vt: push down the tty lock so we can see what is left to tackle") Cc: # v3.4+ Reported-by: syzbot+522643ab5729b0421...@syzkaller.appspotmail.com Acked-by: Jiri Slaby Signed-off-by: Eric Biggers Link: https://lore.kernel.org/r/20200322034305.210082-2-ebigg...@kernel.org Signed-off-by: Greg Kroah-Hartman https://jira.sw.ru/browse/PSBM-106391 (cherry-picked from commit ca4463bf8438b403596edd0ec961ca0d4fbe0220) Signed-off-by: Andrey Ryabinin --- drivers/tty/vt/vt.c | 23 ++- drivers/tty/vt/vt_ioctl.c | 12 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c index 0ee0cd507522..795d7867ac24 100644 --- a/drivers/tty/vt/vt.c +++ b/drivers/tty/vt/vt.c @@ -767,6 +767,17 @@ static void visual_deinit(struct vc_data *vc) module_put(vc->vc_sw->owner); } +static void vc_port_destruct(struct tty_port *port) +{ + struct vc_data *vc = container_of(port, struct vc_data, port); + + kfree(vc); +} + +static const struct tty_port_operations vc_port_ops = { + .destruct = vc_port_destruct, +}; + int vc_allocate(unsigned int currcons) /* return 0 on success */ { struct vt_notifier_param param; @@ -792,6 +803,7 @@ int vc_allocate(unsigned int currcons) /* return 0 on success */ vc_cons[currcons].d = vc; tty_port_init(>port); + vc->port.ops = _port_ops; INIT_WORK(_cons[currcons].SAK_work, vc_SAK); visual_init(vc, currcons, 1); @@ -2799,6 +2811,7 @@ static int con_install(struct tty_driver *driver, struct tty_struct *tty) tty->driver_data = vc; vc->port.tty = tty; + tty_port_get(>port); if (!tty->winsize.ws_row && !tty->winsize.ws_col) { tty->winsize.ws_row = vc_cons[currcons].d->vc_rows; @@ -2834,6 +2847,13 @@ static void con_shutdown(struct tty_struct *tty) console_unlock(); } +static void con_cleanup(struct tty_struct *tty) +{ + struct vc_data *vc = tty->driver_data; + + tty_port_put(>port); +} + static int default_italic_color= 2; // green (ASCII) static int default_underline_color = 3; // cyan (ASCII) module_param_named(italic, default_italic_color, int, S_IRUGO | S_IWUSR); @@ -2956,7 +2976,8 @@ static const struct tty_operations con_o
Re: [Devel] [PATCH rh7 v4] mm/memcg: fix cache growth above cache.limit_in_bytes
On 7/30/20 6:52 PM, Evgenii Shatokhin wrote: > Hi, > > On 30.07.2020 18:02, Andrey Ryabinin wrote: >> Exceeding cache above cache.limit_in_bytes schedules high_work_func() >> which tries to reclaim 32 pages. If cache generated fast enough or it allows >> cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim >> enough. Try to reclaim exceeded amount of cache instead. >> >> https://jira.sw.ru/browse/PSBM-106384 >> Signed-off-by: Andrey Ryabinin >> --- >> >> - Changes since v1: add bug link to changelog >> - Changes since v2: Fix cache_overused check (We should check if it's >> positive). >> Made this stupid bug during cleanup, patch was tested without bogus >> cleanup, >> so it shoud work. >> - Chnages since v3: Compilation fixes, properly tested now. >> >> mm/memcontrol.c | 10 +++--- >> 1 file changed, 7 insertions(+), 3 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 3cf200f506c3..16cbd451a588 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg, >> { >> do { >> + long cache_overused; >> + >> if (page_counter_read(>memory) > memcg->high) >> try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 0); >> - if (page_counter_read(>cache) > memcg->cache.limit) >> - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, >> - MEM_CGROUP_RECLAIM_NOSWAP); >> + cache_overused = page_counter_read(>cache) - >> + memcg->cache.limit; > > If cache_overused is less than 32 pages, the kernel would try to reclaim less > than before the patch. It it OK, or should it try to reclaim at least 32 > pages? It's ok, try_to_free_mem_cgroup_pages will increase it: unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, int flags) .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 v4] mm/memcg: fix cache growth above cache.limit_in_bytes
Exceeding cache above cache.limit_in_bytes schedules high_work_func() which tries to reclaim 32 pages. If cache generated fast enough or it allows cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim enough. Try to reclaim exceeded amount of cache instead. https://jira.sw.ru/browse/PSBM-106384 Signed-off-by: Andrey Ryabinin --- - Changes since v1: add bug link to changelog - Changes since v2: Fix cache_overused check (We should check if it's positive). Made this stupid bug during cleanup, patch was tested without bogus cleanup, so it shoud work. - Chnages since v3: Compilation fixes, properly tested now. mm/memcontrol.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3cf200f506c3..16cbd451a588 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg, { do { + long cache_overused; + if (page_counter_read(>memory) > memcg->high) try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 0); - if (page_counter_read(>cache) > memcg->cache.limit) - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEM_CGROUP_RECLAIM_NOSWAP); + cache_overused = page_counter_read(>cache) - + memcg->cache.limit; + if (cache_overused > 0) + try_to_free_mem_cgroup_pages(memcg, cache_overused, + gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP); } while ((memcg = parent_mem_cgroup(memcg))); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 v3] mm/memcg: fix cache growth above cache.limit_in_bytes
Exceeding cache above cache.limit_in_bytes schedules high_work_func() which tries to reclaim 32 pages. If cache generated fast enough or it allows cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim enough. Try to reclaim exceeded amount of cache instead. https://jira.sw.ru/browse/PSBM-106384 Signed-off-by: Andrey Ryabinin --- Changes since v1: add bug link to changelog Changes since v2: Fix cache_overused check (We should check if it's positive). Made this stupid bug during cleanup, patch was tested without bogus cleanup, so it shoud work. mm/memcontrol.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3cf200f506c3..e23e546fd00f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg, { do { + long cache_overused; + if (page_counter_read(>memory) > memcg->high) try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 0); - if (page_counter_read(>cache) > memcg->cache.limit) - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEM_CGROUP_RECLAIM_NOSWAP); + cache_overused = page_counter_read(>cache) - + memcg->cache.limit; + if (cache_overused > 0) + try_to_free_mem_cgroup_pages(memcg, max(CHARGE_BATCH, cache_overused, + gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP); } while ((memcg = parent_mem_cgroup(memcg))); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 v2] mm/memcg: fix cache growth above cache.limit_in_bytes
Exceeding cache above cache.limit_in_bytes schedules high_work_func() which tries to reclaim 32 pages. If cache generated fast enough or it allows cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim enough. Try to reclaim exceeded amount of cache instead. https://jira.sw.ru/browse/PSBM-106384 Signed-off-by: Andrey Ryabinin --- Changes since v1: add bug link to changelog mm/memcontrol.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3cf200f506c3..e5adb0e81cbb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg, { do { + unsigned long cache_overused; + if (page_counter_read(>memory) > memcg->high) try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 0); - if (page_counter_read(>cache) > memcg->cache.limit) - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEM_CGROUP_RECLAIM_NOSWAP); + cache_overused = page_counter_read(>cache) - + memcg->cache.limit; + if (cache_overused) + try_to_free_mem_cgroup_pages(memcg, cache_overused, + gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP); } while ((memcg = parent_mem_cgroup(memcg))); } -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel