[Devel] [PATCH rh7 2/2] mm/vmscan: add cond_resched() to loop in shrink_slab_memcg()
shrink_slab_memcg() may iterate for a long time without resched if we have many memcg with small amount of objects. Add cond_resched() to avoid potential softlockup. https://jira.sw.ru/browse/PSBM-125095 Signed-off-by: Andrey Ryabinin --- mm/vmscan.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 080500f4e366..17a7ed60f525 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -527,6 +527,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct shrinker *shrinker; bool is_nfs; + cond_resched(); + shrinker = idr_find(_idr, i); if (unlikely(!shrinker)) { clear_bit(i, map->map); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 1/2] mm: memcg: fix memcg reclaim soft lockup
From: Xunlei Pang We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg doesn't have any reclaimable memory. It can be easily reproduced as below: watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204] CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12 Call Trace: shrink_lruvec+0x49f/0x640 shrink_node+0x2a6/0x6f0 do_try_to_free_pages+0xe9/0x3e0 try_to_free_mem_cgroup_pages+0xef/0x1f0 try_charge+0x2c1/0x750 mem_cgroup_charge+0xd7/0x240 __add_to_page_cache_locked+0x2fd/0x370 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0x10b/0x2f0 filemap_fault+0x661/0xad0 ext4_filemap_fault+0x2c/0x40 __do_fault+0x4d/0xf9 handle_mm_fault+0x1080/0x1790 It only happens on our 1-vcpu instances, because there's no chance for oom reaper to run to reclaim the to-be-killed process. Add a cond_resched() at the upper shrink_node_memcgs() to solve this issue, this will mean that we will get a scheduling point for each memcg in the reclaimed hierarchy without any dependency on the reclaimable memory in that memcg thus making it more predictable. Suggested-by: Michal Hocko Signed-off-by: Xunlei Pang Signed-off-by: Andrew Morton Acked-by: Chris Down Acked-by: Michal Hocko Acked-by: Johannes Weiner Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlp...@linux.alibaba.com Signed-off-by: Linus Torvalds https://jira.sw.ru/browse/PSBM-125095 (cherry picked from commit e3336cab2579012b1e72b5265adf98e2d6e244ad) Signed-off-by: Andrey Ryabinin --- mm/vmscan.c | 8 1 file changed, 8 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 85622f235e78..080500f4e366 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2684,6 +2684,14 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc, do { unsigned long lru_pages, scanned; + /* +* This loop can become CPU-bound when target memcgs +* aren't eligible for reclaim - either because they +* don't have any reclaimable pages, or because their +* memory is explicitly protected. Avoid soft lockups. +*/ + cond_resched(); + if (!sc->may_thrash && mem_cgroup_low(root, memcg)) continue; -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 2/2] jbd2: raid amnesia protection for the journal
From: Dmitry Monakhov https://jira.sw.ru/browse/PSBM-15484 Some blockdevices can return different data on read requests from same block after power failure (for example mirrored raid is out of sync, and resync is in progress) In that case following sutuation is possible: Power failure happen after transaction commit log was issued for transaction 'D', next boot first dist will have commit block, but second one will not. mirror1: journal={Ac-Bc-Cc-Dc } mirror2: journal={Ac-Bc-Cc-D } Now let's let assumes that we read from mirror1 and found that 'D' has valid commit block, so journal_replay will replay that transaction, but second power failure may happen before journal_reset() so next journal_replay() may read from mirror2 and found that 'C' is last valid transaction. This result in corruption because we already replayed trandaction 'D'. In order to avoid such ambiguity we should pefrorm 'stabilize write'. 1) Read and rewrite latest commit id block 2) Invalidate next block in order to guarantee that journal head becomes stable. Signed-off-by: Dmitry Monakhov Signed-off-by: Andrey Ryabinin --- fs/jbd2/recovery.c | 77 +- 1 file changed, 76 insertions(+), 1 deletion(-) diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c index a4967b27ffb6..78e7d2fed069 100644 --- a/fs/jbd2/recovery.c +++ b/fs/jbd2/recovery.c @@ -33,6 +33,9 @@ struct recovery_info int nr_replays; int nr_revokes; int nr_revoke_hits; + + unsigned intlast_log_block; + struct buffer_head *last_commit_bh; }; enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY}; @@ -229,6 +232,71 @@ do { \ var -= ((journal)->j_last - (journal)->j_first);\ } while (0) +/* + * The 'Raid amnesia' effect protection: https://jira.sw.ru/browse/PSBM-15484 + * + * Some blockdevices can return different data on read requests from same block + * after power failure (for example mirrored raid is out of sync, and resync is + * in progress) In that case following sutuation is possible: + * + * Power failure happen after transaction commit log was issued for + * transaction 'D', next boot first dist will have commit block, but + * second one will not. + * mirror1: journal={Ac-Bc-Cc-Dc } + * mirror2: journal={Ac-Bc-Cc-D } + * Now let's let assumes that we read from mirror1 and found that 'D' has + * valid commit block, so journal_replay will replay that transaction, but + * second power failure may happen before journal_reset() so next + * journal_replay() may read from mirror2 and found that 'C' is last valid + * transaction. This result in corruption because we already replayed + * trandaction 'D'. + * In order to avoid such ambiguity we should pefrorm 'stabilize write'. + * 1) Read and rewrite latest commit id block + * 2) Invalidate next block in + * order to guarantee that journal head becomes stable. + * Yes i know that 'stabilize write' approach is ugly but this is the only + * way to run filesystem on blkdevices with 'raid amnesia' effect + */ +static int stabilize_journal_head(journal_t *journal, struct recovery_info *info) +{ + struct buffer_head *bh[2] = {NULL, NULL}; + int err, err2, i; + + if (!info->last_commit_bh) + return 0; + + bh[0] = info->last_commit_bh; + info->last_commit_bh = NULL; + + err = jread([1], journal, info->last_log_block); + if (err) + goto out; + + for (i = 0; i < 2; i++) { + lock_buffer(bh[i]); + /* Explicitly invalidate block beyond last commit block */ + if (i == 1) + memset(bh[i]->b_data, 0, journal->j_blocksize); + + BUFFER_TRACE(bh[i], "marking dirty"); + set_buffer_uptodate(bh[i]); + mark_buffer_dirty(bh[i]); + BUFFER_TRACE(bh[i], "marking uptodate"); + unlock_buffer(bh[i]); + } + err = sync_blockdev(journal->j_dev); + /* Make sure data is on permanent storage */ + if (journal->j_flags & JBD2_BARRIER) { + err2 = blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL); + if (!err) + err = err2; + } +out: + brelse(bh[0]); + brelse(bh[1]); + return err; +} + /** * jbd2_journal_recover - recovers a on-disk journal * @journal: the journal to recover @@ -265,6 +333,8 @@ int jbd2_journal_recover(journal_t *journal) } err = do_one_pass(journal, , PASS_SCAN); + if (!err) + err = stabilize_journal_head(journal, ); if (!err) err = do_one_pass(journal, , PASS_REVOKE); if (!err) @@ -315,6 +385,7 @@ int jbd2_journal_skip_recovery(journal_t *journal) memset (, 0, sizeof(info)); err = do_one_pass(journal, , PASS_SCAN); +
[Devel] [PATCH vz8 1/2] ve/ext4: treat panic_on_errors as remount-ro_on_errors in CTs
From: Dmitry Monakhov This is a port from 2.6.32-x of: * diff-ext4-in-containers-treat-panic_on_errors-as-remount-ro_on_errors ext4: in containers treat errors=panic as Container can explode whole node if it remounts its ploop with option 'errors=panic' and triggers abort after that. Signed-off-by: Konstantin Khlebnikov Acked-by: Maxim V. Patlasov Signed-off-by: Dmitry Monakhov khorenko@: currently we have devmnt->allowed_options options which are configured via userspace and currently vzctl provides empty list. This is an additional check - just in case someone get secondary ploop image with 'errors=panic' mount option saved in the image and mounts it from inside a CT. Signed-off-by: Andrey Ryabinin --- fs/ext4/super.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 60c9fb110be3..f6feb495e8b0 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1845,6 +1845,7 @@ static int clear_qf_name(struct super_block *sb, int qtype) #define MOPT_NO_EXT3 0x0200 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3) #define MOPT_STRING0x0400 +#define MOPT_WANT_SYS_ADMIN0x0800 static const struct mount_opts { int token; @@ -1877,7 +1878,7 @@ static const struct mount_opts { EXT4_MOUNT_JOURNAL_CHECKSUM), MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT}, {Opt_noload, EXT4_MOUNT_NOLOAD, MOPT_NO_EXT2 | MOPT_SET}, - {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR}, + {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR|MOPT_WANT_SYS_ADMIN}, {Opt_err_ro, EXT4_MOUNT_ERRORS_RO, MOPT_SET | MOPT_CLEAR_ERR}, {Opt_err_cont, EXT4_MOUNT_ERRORS_CONT, MOPT_SET | MOPT_CLEAR_ERR}, {Opt_data_err_abort, EXT4_MOUNT_DATA_ERR_ABORT, @@ -2019,6 +2020,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token, } if (m->flags & MOPT_CLEAR_ERR) clear_opt(sb, ERRORS_MASK); + if (m->flags & MOPT_WANT_SYS_ADMIN && !capable(CAP_SYS_ADMIN)) + return 1; + if (token == Opt_noquota && sb_any_quota_loaded(sb)) { ext4_msg(sb, KERN_ERR, "Cannot change quota " "options when quota turned on"); @@ -3892,8 +3896,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK) set_opt(sb, WRITEBACK_DATA); - if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC) - set_opt(sb, ERRORS_PANIC); + if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC) { + if (capable(CAP_SYS_ADMIN)) + set_opt(sb, ERRORS_PANIC); + else + set_opt(sb, ERRORS_RO); + } else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_CONTINUE) set_opt(sb, ERRORS_CONT); else -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RH7] ms/netfilter: nf_nat: don't bug when mapping already exists
It seems preferrable to limp along if we have a conflicting mapping, its certainly better than a BUG(). Signed-off-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso (cherry picked from commit 75c2631468e8af554057246b2413e738dd96af3d) This patch fixes host crash during restart firewalld service https://jira.sw.ru/browse/PSBM-124668 Signed-off-by: Vasily Averin --- net/netfilter/nf_nat_core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index 5a48480..790951c 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -402,7 +402,9 @@ nf_nat_setup_info(struct nf_conn *ct, NF_CT_ASSERT(maniptype == NF_NAT_MANIP_SRC || maniptype == NF_NAT_MANIP_DST); - BUG_ON(nf_nat_initialized(ct, maniptype)); + + if (WARN_ON(nf_nat_initialized(ct, maniptype))) + return NF_DROP; /* What we've got will look like inverse of reply. Normally * this is what is in the conntrack, except for prior -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel