[Devel] [PATCH rh7 2/2] mm/vmscan: add cond_resched() to loop in shrink_slab_memcg()

2021-02-01 Thread Andrey Ryabinin
shrink_slab_memcg() may iterate for a long time without resched if we
have many memcg with small amount of objects. Add cond_resched() to
avoid potential softlockup.

https://jira.sw.ru/browse/PSBM-125095
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 080500f4e366..17a7ed60f525 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -527,6 +527,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int 
nid,
struct shrinker *shrinker;
bool is_nfs;
 
+   cond_resched();
+
shrinker = idr_find(_idr, i);
if (unlikely(!shrinker)) {
clear_bit(i, map->map);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] mm: memcg: fix memcg reclaim soft lockup

2021-02-01 Thread Andrey Ryabinin
From: Xunlei Pang 

We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
doesn't have any reclaimable memory.

It can be easily reproduced as below:

  watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
  CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
  Call Trace:
shrink_lruvec+0x49f/0x640
shrink_node+0x2a6/0x6f0
do_try_to_free_pages+0xe9/0x3e0
try_to_free_mem_cgroup_pages+0xef/0x1f0
try_charge+0x2c1/0x750
mem_cgroup_charge+0xd7/0x240
__add_to_page_cache_locked+0x2fd/0x370
add_to_page_cache_lru+0x4a/0xc0
pagecache_get_page+0x10b/0x2f0
filemap_fault+0x661/0xad0
ext4_filemap_fault+0x2c/0x40
__do_fault+0x4d/0xf9
handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance for
oom reaper to run to reclaim the to-be-killed process.

Add a cond_resched() at the upper shrink_node_memcgs() to solve this
issue, this will mean that we will get a scheduling point for each memcg
in the reclaimed hierarchy without any dependency on the reclaimable
memory in that memcg thus making it more predictable.

Suggested-by: Michal Hocko 
Signed-off-by: Xunlei Pang 
Signed-off-by: Andrew Morton 
Acked-by: Chris Down 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Link: 
http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlp...@linux.alibaba.com
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-125095
(cherry picked from commit e3336cab2579012b1e72b5265adf98e2d6e244ad)
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85622f235e78..080500f4e366 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2684,6 +2684,14 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
do {
unsigned long lru_pages, scanned;
 
+   /*
+* This loop can become CPU-bound when target memcgs
+* aren't eligible for reclaim - either because they
+* don't have any reclaimable pages, or because their
+* memory is explicitly protected. Avoid soft lockups.
+*/
+   cond_resched();
+
if (!sc->may_thrash && mem_cgroup_low(root, memcg))
continue;
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 2/2] jbd2: raid amnesia protection for the journal

2021-02-01 Thread Andrey Ryabinin
From: Dmitry Monakhov 

https://jira.sw.ru/browse/PSBM-15484

Some blockdevices can return different data on read requests from same block
after power failure (for example mirrored raid is out of sync, and resync is
in progress) In that case following sutuation is possible:

Power failure happen after transaction commit log was issued for
transaction 'D', next boot first dist will have commit block, but
second one will not.
mirror1: journal={Ac-Bc-Cc-Dc }
mirror2: journal={Ac-Bc-Cc-D  }
Now let's let assumes that we read from mirror1 and found that 'D' has
valid commit block, so journal_replay will replay that transaction, but
second power failure may happen before journal_reset() so next
journal_replay() may read from mirror2 and found that 'C' is last valid
transaction. This result in corruption because we already replayed
trandaction 'D'.
In order to avoid such ambiguity we should pefrorm 'stabilize write'.
1) Read and rewrite latest commit id block
2) Invalidate next block in
order to guarantee that journal head becomes stable.

Signed-off-by: Dmitry Monakhov 
Signed-off-by: Andrey Ryabinin 
---
 fs/jbd2/recovery.c | 77 +-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index a4967b27ffb6..78e7d2fed069 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -33,6 +33,9 @@ struct recovery_info
int nr_replays;
int nr_revokes;
int nr_revoke_hits;
+
+   unsigned intlast_log_block;
+   struct buffer_head  *last_commit_bh;
 };
 
 enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
@@ -229,6 +232,71 @@ do {   
\
var -= ((journal)->j_last - (journal)->j_first);\
 } while (0)
 
+/*
+ * The 'Raid amnesia' effect protection: https://jira.sw.ru/browse/PSBM-15484
+ *
+ * Some blockdevices can return different data on read requests from same block
+ * after power failure (for example mirrored raid is out of sync, and resync is
+ * in progress) In that case following sutuation is possible:
+ *
+ * Power failure happen after transaction commit log was issued for
+ * transaction 'D', next boot first dist will have commit block, but
+ * second one will not.
+ * mirror1: journal={Ac-Bc-Cc-Dc }
+ * mirror2: journal={Ac-Bc-Cc-D  }
+ * Now let's let assumes that we read from mirror1 and found that 'D' has
+ * valid commit block, so journal_replay will replay that transaction, but
+ * second power failure may happen before journal_reset() so next
+ * journal_replay() may read from mirror2 and found that 'C' is last valid
+ * transaction. This result in corruption because we already replayed
+ * trandaction 'D'.
+ * In order to avoid such ambiguity we should pefrorm 'stabilize write'.
+ * 1) Read and rewrite latest commit id block
+ * 2) Invalidate next block in
+ * order to guarantee that journal head becomes stable.
+ * Yes i know that 'stabilize write' approach is ugly but this is the only
+ * way to run filesystem on blkdevices with 'raid amnesia' effect
+ */
+static int stabilize_journal_head(journal_t *journal, struct recovery_info 
*info)
+{
+   struct buffer_head *bh[2] = {NULL, NULL};
+   int err, err2, i;
+
+   if (!info->last_commit_bh)
+   return 0;
+
+   bh[0] = info->last_commit_bh;
+   info->last_commit_bh = NULL;
+
+   err = jread([1], journal, info->last_log_block);
+   if (err)
+   goto out;
+
+   for (i = 0; i < 2; i++) {
+   lock_buffer(bh[i]);
+   /* Explicitly invalidate block beyond last commit block */
+   if (i == 1)
+   memset(bh[i]->b_data, 0, journal->j_blocksize);
+
+   BUFFER_TRACE(bh[i], "marking dirty");
+   set_buffer_uptodate(bh[i]);
+   mark_buffer_dirty(bh[i]);
+   BUFFER_TRACE(bh[i], "marking uptodate");
+   unlock_buffer(bh[i]);
+   }
+   err = sync_blockdev(journal->j_dev);
+   /* Make sure data is on permanent storage */
+   if (journal->j_flags & JBD2_BARRIER) {
+   err2 = blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL);
+   if (!err)
+   err = err2;
+   }
+out:
+   brelse(bh[0]);
+   brelse(bh[1]);
+   return err;
+}
+
 /**
  * jbd2_journal_recover - recovers a on-disk journal
  * @journal: the journal to recover
@@ -265,6 +333,8 @@ int jbd2_journal_recover(journal_t *journal)
}
 
err = do_one_pass(journal, , PASS_SCAN);
+   if (!err)
+   err = stabilize_journal_head(journal, );
if (!err)
err = do_one_pass(journal, , PASS_REVOKE);
if (!err)
@@ -315,6 +385,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
memset (, 0, sizeof(info));
 
err = do_one_pass(journal, , PASS_SCAN);
+  

[Devel] [PATCH vz8 1/2] ve/ext4: treat panic_on_errors as remount-ro_on_errors in CTs

2021-02-01 Thread Andrey Ryabinin
From: Dmitry Monakhov 

This is a port from 2.6.32-x of:

* diff-ext4-in-containers-treat-panic_on_errors-as-remount-ro_on_errors

ext4: in containers treat errors=panic as

Container can explode whole node if it remounts its ploop
with option 'errors=panic' and triggers abort after that.

Signed-off-by: Konstantin Khlebnikov 
Acked-by: Maxim V. Patlasov 

Signed-off-by: Dmitry Monakhov 

khorenko@: currently we have devmnt->allowed_options options which are
configured via userspace and currently vzctl provides empty list.
This is an additional check - just in case someone get secondary
ploop image with 'errors=panic' mount option saved in the image
and mounts it from inside a CT.

Signed-off-by: Andrey Ryabinin 
---
 fs/ext4/super.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 60c9fb110be3..f6feb495e8b0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1845,6 +1845,7 @@ static int clear_qf_name(struct super_block *sb, int 
qtype)
 #define MOPT_NO_EXT3   0x0200
 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING0x0400
+#define MOPT_WANT_SYS_ADMIN0x0800
 
 static const struct mount_opts {
int token;
@@ -1877,7 +1878,7 @@ static const struct mount_opts {
EXT4_MOUNT_JOURNAL_CHECKSUM),
 MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT},
{Opt_noload, EXT4_MOUNT_NOLOAD, MOPT_NO_EXT2 | MOPT_SET},
-   {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR},
+   {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | 
MOPT_CLEAR_ERR|MOPT_WANT_SYS_ADMIN},
{Opt_err_ro, EXT4_MOUNT_ERRORS_RO, MOPT_SET | MOPT_CLEAR_ERR},
{Opt_err_cont, EXT4_MOUNT_ERRORS_CONT, MOPT_SET | MOPT_CLEAR_ERR},
{Opt_data_err_abort, EXT4_MOUNT_DATA_ERR_ABORT,
@@ -2019,6 +2020,9 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
}
if (m->flags & MOPT_CLEAR_ERR)
clear_opt(sb, ERRORS_MASK);
+   if (m->flags & MOPT_WANT_SYS_ADMIN && !capable(CAP_SYS_ADMIN))
+   return 1;
+
if (token == Opt_noquota && sb_any_quota_loaded(sb)) {
ext4_msg(sb, KERN_ERR, "Cannot change quota "
 "options when quota turned on");
@@ -3892,8 +3896,12 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK)
set_opt(sb, WRITEBACK_DATA);
 
-   if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC)
-   set_opt(sb, ERRORS_PANIC);
+   if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC) {
+   if (capable(CAP_SYS_ADMIN))
+   set_opt(sb, ERRORS_PANIC);
+   else
+   set_opt(sb, ERRORS_RO);
+   }
else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_CONTINUE)
set_opt(sb, ERRORS_CONT);
else
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] ms/netfilter: nf_nat: don't bug when mapping already exists

2021-02-01 Thread Vasily Averin
It seems preferrable to limp along if we have a conflicting mapping,
its certainly better than a BUG().

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
(cherry picked from commit 75c2631468e8af554057246b2413e738dd96af3d)

This patch fixes host crash during restart firewalld service
https://jira.sw.ru/browse/PSBM-124668
Signed-off-by: Vasily Averin 
---
 net/netfilter/nf_nat_core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 5a48480..790951c 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -402,7 +402,9 @@ nf_nat_setup_info(struct nf_conn *ct,
 
NF_CT_ASSERT(maniptype == NF_NAT_MANIP_SRC ||
 maniptype == NF_NAT_MANIP_DST);
-   BUG_ON(nf_nat_initialized(ct, maniptype));
+
+   if (WARN_ON(nf_nat_initialized(ct, maniptype)))
+   return NF_DROP;
 
/* What we've got will look like inverse of reply. Normally
 * this is what is in the conntrack, except for prior
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel