[Devel] [PATCH rh7 2/2] mm/vmscan: add cond_resched() to loop in shrink_slab_memcg()

2021-02-01 Thread Andrey Ryabinin
shrink_slab_memcg() may iterate for a long time without resched if we
have many memcg with small amount of objects. Add cond_resched() to
avoid potential softlockup.

https://jira.sw.ru/browse/PSBM-125095
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 080500f4e366..17a7ed60f525 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -527,6 +527,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int 
nid,
struct shrinker *shrinker;
bool is_nfs;
 
+   cond_resched();
+
shrinker = idr_find(_idr, i);
if (unlikely(!shrinker)) {
clear_bit(i, map->map);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/2] mm: memcg: fix memcg reclaim soft lockup

2021-02-01 Thread Andrey Ryabinin
From: Xunlei Pang 

We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
doesn't have any reclaimable memory.

It can be easily reproduced as below:

  watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
  CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
  Call Trace:
shrink_lruvec+0x49f/0x640
shrink_node+0x2a6/0x6f0
do_try_to_free_pages+0xe9/0x3e0
try_to_free_mem_cgroup_pages+0xef/0x1f0
try_charge+0x2c1/0x750
mem_cgroup_charge+0xd7/0x240
__add_to_page_cache_locked+0x2fd/0x370
add_to_page_cache_lru+0x4a/0xc0
pagecache_get_page+0x10b/0x2f0
filemap_fault+0x661/0xad0
ext4_filemap_fault+0x2c/0x40
__do_fault+0x4d/0xf9
handle_mm_fault+0x1080/0x1790

It only happens on our 1-vcpu instances, because there's no chance for
oom reaper to run to reclaim the to-be-killed process.

Add a cond_resched() at the upper shrink_node_memcgs() to solve this
issue, this will mean that we will get a scheduling point for each memcg
in the reclaimed hierarchy without any dependency on the reclaimable
memory in that memcg thus making it more predictable.

Suggested-by: Michal Hocko 
Signed-off-by: Xunlei Pang 
Signed-off-by: Andrew Morton 
Acked-by: Chris Down 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Link: 
http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlp...@linux.alibaba.com
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-125095
(cherry picked from commit e3336cab2579012b1e72b5265adf98e2d6e244ad)
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85622f235e78..080500f4e366 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2684,6 +2684,14 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
do {
unsigned long lru_pages, scanned;
 
+   /*
+* This loop can become CPU-bound when target memcgs
+* aren't eligible for reclaim - either because they
+* don't have any reclaimable pages, or because their
+* memory is explicitly protected. Avoid soft lockups.
+*/
+   cond_resched();
+
if (!sc->may_thrash && mem_cgroup_low(root, memcg))
continue;
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 2/2] jbd2: raid amnesia protection for the journal

2021-02-01 Thread Andrey Ryabinin
From: Dmitry Monakhov 

https://jira.sw.ru/browse/PSBM-15484

Some blockdevices can return different data on read requests from same block
after power failure (for example mirrored raid is out of sync, and resync is
in progress) In that case following sutuation is possible:

Power failure happen after transaction commit log was issued for
transaction 'D', next boot first dist will have commit block, but
second one will not.
mirror1: journal={Ac-Bc-Cc-Dc }
mirror2: journal={Ac-Bc-Cc-D  }
Now let's let assumes that we read from mirror1 and found that 'D' has
valid commit block, so journal_replay will replay that transaction, but
second power failure may happen before journal_reset() so next
journal_replay() may read from mirror2 and found that 'C' is last valid
transaction. This result in corruption because we already replayed
trandaction 'D'.
In order to avoid such ambiguity we should pefrorm 'stabilize write'.
1) Read and rewrite latest commit id block
2) Invalidate next block in
order to guarantee that journal head becomes stable.

Signed-off-by: Dmitry Monakhov 
Signed-off-by: Andrey Ryabinin 
---
 fs/jbd2/recovery.c | 77 +-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index a4967b27ffb6..78e7d2fed069 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -33,6 +33,9 @@ struct recovery_info
int nr_replays;
int nr_revokes;
int nr_revoke_hits;
+
+   unsigned intlast_log_block;
+   struct buffer_head  *last_commit_bh;
 };
 
 enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
@@ -229,6 +232,71 @@ do {   
\
var -= ((journal)->j_last - (journal)->j_first);\
 } while (0)
 
+/*
+ * The 'Raid amnesia' effect protection: https://jira.sw.ru/browse/PSBM-15484
+ *
+ * Some blockdevices can return different data on read requests from same block
+ * after power failure (for example mirrored raid is out of sync, and resync is
+ * in progress) In that case following sutuation is possible:
+ *
+ * Power failure happen after transaction commit log was issued for
+ * transaction 'D', next boot first dist will have commit block, but
+ * second one will not.
+ * mirror1: journal={Ac-Bc-Cc-Dc }
+ * mirror2: journal={Ac-Bc-Cc-D  }
+ * Now let's let assumes that we read from mirror1 and found that 'D' has
+ * valid commit block, so journal_replay will replay that transaction, but
+ * second power failure may happen before journal_reset() so next
+ * journal_replay() may read from mirror2 and found that 'C' is last valid
+ * transaction. This result in corruption because we already replayed
+ * trandaction 'D'.
+ * In order to avoid such ambiguity we should pefrorm 'stabilize write'.
+ * 1) Read and rewrite latest commit id block
+ * 2) Invalidate next block in
+ * order to guarantee that journal head becomes stable.
+ * Yes i know that 'stabilize write' approach is ugly but this is the only
+ * way to run filesystem on blkdevices with 'raid amnesia' effect
+ */
+static int stabilize_journal_head(journal_t *journal, struct recovery_info 
*info)
+{
+   struct buffer_head *bh[2] = {NULL, NULL};
+   int err, err2, i;
+
+   if (!info->last_commit_bh)
+   return 0;
+
+   bh[0] = info->last_commit_bh;
+   info->last_commit_bh = NULL;
+
+   err = jread([1], journal, info->last_log_block);
+   if (err)
+   goto out;
+
+   for (i = 0; i < 2; i++) {
+   lock_buffer(bh[i]);
+   /* Explicitly invalidate block beyond last commit block */
+   if (i == 1)
+   memset(bh[i]->b_data, 0, journal->j_blocksize);
+
+   BUFFER_TRACE(bh[i], "marking dirty");
+   set_buffer_uptodate(bh[i]);
+   mark_buffer_dirty(bh[i]);
+   BUFFER_TRACE(bh[i], "marking uptodate");
+   unlock_buffer(bh[i]);
+   }
+   err = sync_blockdev(journal->j_dev);
+   /* Make sure data is on permanent storage */
+   if (journal->j_flags & JBD2_BARRIER) {
+   err2 = blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL);
+   if (!err)
+   err = err2;
+   }
+out:
+   brelse(bh[0]);
+   brelse(bh[1]);
+   return err;
+}
+
 /**
  * jbd2_journal_recover - recovers a on-disk journal
  * @journal: the journal to recover
@@ -265,6 +333,8 @@ int jbd2_journal_recover(journal_t *journal)
}
 
err = do_one_pass(journal, , PASS_SCAN);
+   if (!err)
+   err = stabilize_journal_head(journal, );
if (!err)
err = do_one_pass(journal, , PASS_REVOKE);
if (!err)
@@ -315,6 +385,7 @@ int jbd2_journal_skip_recovery(journal_t *journal)
memset (, 0, siz

[Devel] [PATCH vz8 1/2] ve/ext4: treat panic_on_errors as remount-ro_on_errors in CTs

2021-02-01 Thread Andrey Ryabinin
From: Dmitry Monakhov 

This is a port from 2.6.32-x of:

* diff-ext4-in-containers-treat-panic_on_errors-as-remount-ro_on_errors

ext4: in containers treat errors=panic as

Container can explode whole node if it remounts its ploop
with option 'errors=panic' and triggers abort after that.

Signed-off-by: Konstantin Khlebnikov 
Acked-by: Maxim V. Patlasov 

Signed-off-by: Dmitry Monakhov 

khorenko@: currently we have devmnt->allowed_options options which are
configured via userspace and currently vzctl provides empty list.
This is an additional check - just in case someone get secondary
ploop image with 'errors=panic' mount option saved in the image
and mounts it from inside a CT.

Signed-off-by: Andrey Ryabinin 
---
 fs/ext4/super.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 60c9fb110be3..f6feb495e8b0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1845,6 +1845,7 @@ static int clear_qf_name(struct super_block *sb, int 
qtype)
 #define MOPT_NO_EXT3   0x0200
 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING0x0400
+#define MOPT_WANT_SYS_ADMIN0x0800
 
 static const struct mount_opts {
int token;
@@ -1877,7 +1878,7 @@ static const struct mount_opts {
EXT4_MOUNT_JOURNAL_CHECKSUM),
 MOPT_EXT4_ONLY | MOPT_SET | MOPT_EXPLICIT},
{Opt_noload, EXT4_MOUNT_NOLOAD, MOPT_NO_EXT2 | MOPT_SET},
-   {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | MOPT_CLEAR_ERR},
+   {Opt_err_panic, EXT4_MOUNT_ERRORS_PANIC, MOPT_SET | 
MOPT_CLEAR_ERR|MOPT_WANT_SYS_ADMIN},
{Opt_err_ro, EXT4_MOUNT_ERRORS_RO, MOPT_SET | MOPT_CLEAR_ERR},
{Opt_err_cont, EXT4_MOUNT_ERRORS_CONT, MOPT_SET | MOPT_CLEAR_ERR},
{Opt_data_err_abort, EXT4_MOUNT_DATA_ERR_ABORT,
@@ -2019,6 +2020,9 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
}
if (m->flags & MOPT_CLEAR_ERR)
clear_opt(sb, ERRORS_MASK);
+   if (m->flags & MOPT_WANT_SYS_ADMIN && !capable(CAP_SYS_ADMIN))
+   return 1;
+
if (token == Opt_noquota && sb_any_quota_loaded(sb)) {
ext4_msg(sb, KERN_ERR, "Cannot change quota "
 "options when quota turned on");
@@ -3892,8 +3896,12 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
else if ((def_mount_opts & EXT4_DEFM_JMODE) == EXT4_DEFM_JMODE_WBACK)
set_opt(sb, WRITEBACK_DATA);
 
-   if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC)
-   set_opt(sb, ERRORS_PANIC);
+   if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_PANIC) {
+   if (capable(CAP_SYS_ADMIN))
+   set_opt(sb, ERRORS_PANIC);
+   else
+   set_opt(sb, ERRORS_RO);
+   }
else if (le16_to_cpu(sbi->s_es->s_errors) == EXT4_ERRORS_CONTINUE)
set_opt(sb, ERRORS_CONT);
else
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 v2] fs/ovelayfs: Fix crash on overlayfs mount

2021-01-18 Thread Andrey Ryabinin
Kdump kernel fails to load because of crash on mount of overlayfs:

 BUG: unable to handle kernel NULL pointer dereference at 0060

 Call Trace:
  seq_path+0x64/0xb0
  print_paths_option+0x79/0xa0
  ovl_show_options+0x3a/0x320
  show_mountinfo+0x1ee/0x290
  seq_read+0x2f8/0x400
  vfs_read+0x9d/0x150
  ksys_read+0x4f/0xb0
  do_syscall_64+0x5b/0x1a0

This is cause by OOB access of ofs->lowerpaths.
We transfer to print_paths_option() ofs->numlayer as size of ->lowerpaths
array, but it's not.

The correct number of lowerpaths elements is ->numlower in 'struct ovl_entry'.
So move lowerpaths there and use oe->numlower as array size.

Fixes: 17fc61697f73 ("overlayfs: add dynamic path resolving in mount options")
Fixes: 2191d729083d ("overlayfs: add mnt_id paths options")

https://jira.sw.ru/browse/PSBM-123508
Signed-off-by: Andrey Ryabinin 
---
 fs/overlayfs/ovl_entry.h |  2 +-
 fs/overlayfs/super.c | 37 -
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index ea1906448ec5..2315089a0211 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -54,7 +54,6 @@ struct ovl_fs {
unsigned int numlayer;
/* Number of unique fs among layers including upper fs */
unsigned int numfs;
-   struct path *lowerpaths;
const struct ovl_layer *layers;
struct ovl_sb *fs;
/* workbasepath is the path at workdir= mount option */
@@ -98,6 +97,7 @@ struct ovl_entry {
struct rcu_head rcu;
};
unsigned numlower;
+   struct path *lowerpaths;
struct ovl_path lowerstack[];
 };
 
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 3755f280036f..fb419617564c 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -70,8 +70,12 @@ static void ovl_entry_stack_free(struct ovl_entry *oe)
 {
unsigned int i;
 
-   for (i = 0; i < oe->numlower; i++)
+   for (i = 0; i < oe->numlower; i++) {
dput(oe->lowerstack[i].dentry);
+   if (oe->lowerpaths)
+   path_put(>lowerpaths[i]);
+   }
+   kfree(oe->lowerpaths);
 }
 
 static bool ovl_metacopy_def = IS_ENABLED(CONFIG_OVERLAY_FS_METACOPY);
@@ -241,11 +245,6 @@ static void ovl_free_fs(struct ovl_fs *ofs)
ovl_inuse_unlock(ofs->upper_mnt->mnt_root);
mntput(ofs->upper_mnt);
path_put(>upperpath);
-   if (ofs->lowerpaths) {
-   for (i = 0; i < ofs->numlayer; i++)
-   path_put(>lowerpaths[i]);
-   kfree(ofs->lowerpaths);
-   }
for (i = 1; i < ofs->numlayer; i++) {
iput(ofs->layers[i].trap);
mntput(ofs->layers[i].mnt);
@@ -359,9 +358,10 @@ static int ovl_show_options(struct seq_file *m, struct 
dentry *dentry)
 {
struct super_block *sb = dentry->d_sb;
struct ovl_fs *ofs = sb->s_fs_info;
+   struct ovl_entry *oe = OVL_E(dentry);
 
if (ovl_dyn_path_opts) {
-   print_paths_option(m, "lowerdir", ofs->lowerpaths, 
ofs->numlayer);
+   print_paths_option(m, "lowerdir", oe->lowerpaths, oe->numlower);
if (ofs->config.upperdir) {
print_path_option(m, "upperdir", >upperpath);
print_path_option(m, "workdir", >workbasepath);
@@ -375,7 +375,7 @@ static int ovl_show_options(struct seq_file *m, struct 
dentry *dentry)
}
 
if (ovl_mnt_id_path_opts) {
-   print_mnt_ids_option(m, "lowerdir_mnt_id", ofs->lowerpaths, 
ofs->numlayer);
+   print_mnt_ids_option(m, "lowerdir_mnt_id", oe->lowerpaths, 
oe->numlower);
/*
 * We don't need to show mnt_id for workdir because it
 * on the same mount as upperdir.
@@ -1626,6 +1626,7 @@ static struct ovl_entry *ovl_get_lowerstack(struct 
super_block *sb,
int err;
char *lowertmp, *lower;
unsigned int stacklen, numlower = 0, i;
+   struct path *stack = NULL;
struct ovl_entry *oe;
 
err = -ENOMEM;
@@ -1649,14 +1650,14 @@ static struct ovl_entry *ovl_get_lowerstack(struct 
super_block *sb,
}
 
err = -ENOMEM;
-   ofs->lowerpaths = kcalloc(stacklen, sizeof(struct path), GFP_KERNEL);
-   if (!ofs->lowerpaths)
+   stack = kcalloc(stacklen, sizeof(struct path), GFP_KERNEL);
+   if (!stack)
goto out_err;
 
err = -EINVAL;
lower = lowertmp;
for (numlower = 0; numlower < stacklen; numlower++) {
-   err = ovl_lower_dir(lower, >lowerpaths[numlower], ofs,
+   err = ovl_lower_dir(lower, [numlower], ofs,
  

[Devel] [PATCH vz8] fs/ovelayfs: Fix crash on overlayfs mount

2021-01-13 Thread Andrey Ryabinin
Kdump kernel fails to load because of crash on mount of overlayfs:

 BUG: unable to handle kernel NULL pointer dereference at 0060

 Call Trace:
  seq_path+0x64/0xb0
  print_paths_option+0x79/0xa0
  ovl_show_options+0x3a/0x320
  show_mountinfo+0x1ee/0x290
  seq_read+0x2f8/0x400
  vfs_read+0x9d/0x150
  ksys_read+0x4f/0xb0
  do_syscall_64+0x5b/0x1a0

This is cause by OOB access of ofs->lowerpaths.
We transfer to print_paths_option() ofs->numlayer as size of ->lowerpaths
array, but it's not. We could probably pass 'ofs->numlayer - 1' as
number of lower layers/path, but it's better to remove lowerpaths
completely. All necessary information already contained in 'struct ovl_entry'.
Use it to print paths instead.

Fixes: 17fc61697f73 ("overlayfs: add dynamic path resolving in mount options")
Fixes: 2191d729083d ("overlayfs: add mnt_id paths options")

https://jira.sw.ru/browse/PSBM-123508
Signed-off-by: Andrey Ryabinin 
---
 fs/overlayfs/overlayfs.h |  4 ++--
 fs/overlayfs/ovl_entry.h |  1 -
 fs/overlayfs/super.c | 30 ++
 fs/overlayfs/util.c  | 13 +
 4 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 7e103d002819..a708ebbd2e21 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -313,10 +313,10 @@ ssize_t ovl_getxattr(struct dentry *dentry, char *name, 
char **value,
 
 void print_path_option(struct seq_file *m, const char *name, struct path 
*path);
 void print_paths_option(struct seq_file *m, const char *name,
-   struct path *paths, unsigned int num);
+   struct ovl_path *paths, unsigned int num);
 void print_mnt_id_option(struct seq_file *m, const char *name, struct path 
*path);
 void print_mnt_ids_option(struct seq_file *m, const char *name,
-   struct path *paths, unsigned int num);
+   struct ovl_path *paths, unsigned int num);
 
 static inline bool ovl_is_impuredir(struct dentry *dentry)
 {
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index ea1906448ec5..4e7272c7e4dd 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -54,7 +54,6 @@ struct ovl_fs {
unsigned int numlayer;
/* Number of unique fs among layers including upper fs */
unsigned int numfs;
-   struct path *lowerpaths;
const struct ovl_layer *layers;
struct ovl_sb *fs;
/* workbasepath is the path at workdir= mount option */
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 3755f280036f..069d365a609d 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -241,11 +241,6 @@ static void ovl_free_fs(struct ovl_fs *ofs)
ovl_inuse_unlock(ofs->upper_mnt->mnt_root);
mntput(ofs->upper_mnt);
path_put(>upperpath);
-   if (ofs->lowerpaths) {
-   for (i = 0; i < ofs->numlayer; i++)
-   path_put(>lowerpaths[i]);
-   kfree(ofs->lowerpaths);
-   }
for (i = 1; i < ofs->numlayer; i++) {
iput(ofs->layers[i].trap);
mntput(ofs->layers[i].mnt);
@@ -359,9 +354,10 @@ static int ovl_show_options(struct seq_file *m, struct 
dentry *dentry)
 {
struct super_block *sb = dentry->d_sb;
struct ovl_fs *ofs = sb->s_fs_info;
+   struct ovl_entry *oe = OVL_E(dentry);
 
if (ovl_dyn_path_opts) {
-   print_paths_option(m, "lowerdir", ofs->lowerpaths, 
ofs->numlayer);
+   print_paths_option(m, "lowerdir", oe->lowerstack, oe->numlower);
if (ofs->config.upperdir) {
print_path_option(m, "upperdir", >upperpath);
print_path_option(m, "workdir", >workbasepath);
@@ -375,7 +371,8 @@ static int ovl_show_options(struct seq_file *m, struct 
dentry *dentry)
}
 
if (ovl_mnt_id_path_opts) {
-   print_mnt_ids_option(m, "lowerdir_mnt_id", ofs->lowerpaths, 
ofs->numlayer);
+   print_mnt_ids_option(m, "lowerdir_mnt_id", oe->lowerstack, 
oe->numlower);
+
/*
 * We don't need to show mnt_id for workdir because it
 * on the same mount as upperdir.
@@ -1625,6 +1622,7 @@ static struct ovl_entry *ovl_get_lowerstack(struct 
super_block *sb,
 {
int err;
char *lowertmp, *lower;
+   struct path *stack = NULL;
unsigned int stacklen, numlower = 0, i;
struct ovl_entry *oe;
 
@@ -1649,14 +1647,14 @@ static struct ovl_entry *ovl_get_lowerstack(struct 
super_block *sb,
}
 
err = -ENOMEM;
-   ofs->lowerpaths = kcalloc(stacklen, sizeof(struct path), GFP_KERNEL);
-   if (!ofs->lowerpaths)
+   stack = kcalloc(stacklen, sizeo

[Devel] [PATCH vz8] netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports

2020-12-17 Thread Andrey Ryabinin
From: Jozsef Kadlecsik 

In the case of huge hash:* types of sets, due to the single spinlock of
a set the processing of the whole set under spinlock protection could take
too long.

There were four places where the whole hash table of the set was processed
from bucket to bucket under holding the spinlock:

- During resizing a set, the original set was locked to exclude kernel side
  add/del element operations (userspace add/del is excluded by the
  nfnetlink mutex). The original set is actually just read during the
  resize, so the spinlocking is replaced with rcu locking of regions.
  However, thus there can be parallel kernel side add/del of entries.
  In order not to loose those operations a backlog is added and replayed
  after the successful resize.
- Garbage collection of timed out entries was also protected by the spinlock.
  In order not to lock too long, region locking is introduced and a single
  region is processed in one gc go. Also, the simple timer based gc running
  is replaced with a workqueue based solution. The internal book-keeping
  (number of elements, size of extensions) is moved to region level due to
  the region locking.
- Adding elements: when the max number of the elements is reached, the gc
  was called to evict the timed out entries. The new approach is that the gc
  is called just for the matching region, assuming that if the region
  (proportionally) seems to be full, then the whole set does. We could scan
  the other regions to check every entry under rcu locking, but for huge
  sets it'd mean a slowdown at adding elements.
- Listing the set header data: when the set was defined with timeout
  support, the garbage collector was called to clean up timed out entries
  to get the correct element numbers and set size values. Now the set is
  scanned to check non-timed out entries, without actually calling the gc
  for the whole set.

Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe ->
SOFTIRQ-unsafe lock order issues during working on the patch.

Reported-by: syzbot+4b0e9d4ff3cf11783...@syzkaller.appspotmail.com
Reported-by: syzbot+c27b8d5010f45c666...@syzkaller.appspotmail.com
Reported-by: syzbot+68a806795ac89df3a...@syzkaller.appspotmail.com
Fixes: 23c42a403a9c ("netfilter: ipset: Introduction of new commands and 
protocol version 7")
Signed-off-by: Jozsef Kadlecsik 

https://jira.sw.ru/browse/PSBM-123524
(cherry picked from commit f66ee0410b1c3481ee75e5db9b34547b4d582465)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/netfilter/ipset/ip_set.h |  11 +-
 net/netfilter/ipset/ip_set_core.c  |  34 +-
 net/netfilter/ipset/ip_set_hash_gen.h  | 633 +
 3 files changed, 472 insertions(+), 206 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h 
b/include/linux/netfilter/ipset/ip_set.h
index e499d170f12d..3c49b540c701 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -124,6 +124,7 @@ struct ip_set_ext {
u32 timeout;
u8 packets_op;
u8 bytes_op;
+   bool target;
 };
 
 struct ip_set;
@@ -190,6 +191,14 @@ struct ip_set_type_variant {
/* Return true if "b" set is the same as "a"
 * according to the create set parameters */
bool (*same_set)(const struct ip_set *a, const struct ip_set *b);
+   /* Region-locking is used */
+   bool region_lock;
+};
+
+struct ip_set_region {
+   spinlock_t lock;/* Region lock */
+   size_t ext_size;/* Size of the dynamic extensions */
+   u32 elements;   /* Number of elements vs timeout */
 };
 
 /* The core set type structure */
@@ -461,7 +470,7 @@ bitmap_bytes(u32 a, u32 b)
 #include 
 
 #define IP_SET_INIT_KEXT(skb, opt, set)\
-   { .bytes = (skb)->len, .packets = 1,\
+   { .bytes = (skb)->len, .packets = 1, .target = true,\
  .timeout = ip_set_adt_opt_timeout(opt, set) }
 
 #define IP_SET_INIT_UEXT(set)  \
diff --git a/net/netfilter/ipset/ip_set_core.c 
b/net/netfilter/ipset/ip_set_core.c
index 56b59904feca..615b5791edf2 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -558,6 +558,20 @@ ip_set_rcu_get(struct net *net, ip_set_id_t index)
return set;
 }
 
+static inline void
+ip_set_lock(struct ip_set *set)
+{
+   if (!set->variant->region_lock)
+   spin_lock_bh(>lock);
+}
+
+static inline void
+ip_set_unlock(struct ip_set *set)
+{
+   if (!set->variant->region_lock)
+   spin_unlock_bh(>lock);
+}
+
 int
 ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
const struct xt_action_param *par, struct ip_set_adt_opt *opt)
@@ -579,9 +593,9 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
if (ret == -EAGAIN) {
/* Type requests element to be completed */
  

[Devel] [PATCH vz8] ptrace: fix task_join_group_stop() for the case when current is traced

2020-12-17 Thread Andrey Ryabinin
From: Oleg Nesterov 

This testcase

#include 
#include 
#include 
#include 
#include 
#include 
#include 

void *tf(void *arg)
{
return NULL;
}

int main(void)
{
int pid = fork();
if (!pid) {
kill(getpid(), SIGSTOP);

pthread_t th;
pthread_create(, NULL, tf, NULL);

return 0;
}

waitpid(pid, NULL, WSTOPPED);

ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE);
waitpid(pid, NULL, 0);

ptrace(PTRACE_CONT, pid, 0,0);
waitpid(pid, NULL, 0);

int status;
int thread = waitpid(-1, , 0);
assert(thread > 0 && thread != pid);
assert(status == 0x80137f);

return 0;
}

fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap().

This is because task_join_group_stop() has 2 problems when current is traced:

1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee
   can be woken up by debugger and it can clone another thread which
   should join the group-stop.

   We need to check group_stop_count || SIGNAL_STOP_STOPPED.

2. If SIGNAL_STOP_STOPPED is already set, we should not increment
   sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread
   should stop without another do_notify_parent_cldstop() report.

To clarify, the problem is very old and we should blame
ptrace_init_task().  But now that we have task_join_group_stop() it makes
more sense to fix this helper to avoid the code duplication.

Reported-by: syzbot+3485e3773f7da290e...@syzkaller.appspotmail.com
Signed-off-by: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Cc: Jens Axboe 
Cc: Christian Brauner 
Cc: "Eric W . Biederman" 
Cc: Zhiqiang Liu 
Cc: Tejun Heo 
Cc: 
Link: https://lkml.kernel.org/r/20201019134237.ga18...@redhat.com
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-123525
(cherry picked from commit 7b3c36fc4c231ca532120bbc0df67a12f09c1d96)
Signed-off-by: Andrey Ryabinin 
---
 kernel/signal.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 177cd7f04acb..171f7496f811 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -388,16 +388,17 @@ static bool task_participate_group_stop(struct 
task_struct *task)
 
 void task_join_group_stop(struct task_struct *task)
 {
+   unsigned long mask = current->jobctl & JOBCTL_STOP_SIGMASK;
+   struct signal_struct *sig = current->signal;
+
+   if (sig->group_stop_count) {
+   sig->group_stop_count++;
+   mask |= JOBCTL_STOP_CONSUME;
+   } else if (!(sig->flags & SIGNAL_STOP_STOPPED))
+   return;
+
/* Have the new thread join an on-going signal group stop */
-   unsigned long jobctl = current->jobctl;
-   if (jobctl & JOBCTL_STOP_PENDING) {
-   struct signal_struct *sig = current->signal;
-   unsigned long signr = jobctl & JOBCTL_STOP_SIGMASK;
-   unsigned long gstop = JOBCTL_STOP_PENDING | JOBCTL_STOP_CONSUME;
-   if (task_set_jobctl_pending(task, signr | gstop)) {
-   sig->group_stop_count++;
-   }
-   }
+   task_set_jobctl_pending(task, mask | JOBCTL_STOP_PENDING);
 }
 
 /*
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] vdso: fix VM_BUG_ON_PAGE(PageSlab(page)) on unmap

2020-12-15 Thread Andrey Ryabinin
vdso_data is mapped to userspace which means that we can't
use kmalloc() to allocate it. Kmalloc() doesn't even guarantee
that we will get page aligned memory.

 kernel BUG at include/linux/mm.h:693!
 RIP: 0010:unmap_page_range+0x15f2/0x2630
 Call Trace:
  unmap_vmas+0x11e/0x1d0
  exit_mmap+0x215/0x420
  mmput+0x10a/0x400
  do_exit+0x98f/0x2d00
  do_group_exit+0xec/0x2b0
  __x64_sys_exit_group+0x3a/0x50
  do_syscall_64+0xa5/0x4d0
  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

Use alloc_pages_exact() to allocate it. We can't use
alloc_pages(), or __get_free_pages() here since vdso_fault()
need to perform get_page() on individual sub-pages and alloc_pages()
doesn't initalize sub-pages.

https://jira.sw.ru/browse/PSBM-123551
Signed-off-by: Andrey Ryabinin 
---
 kernel/ve/ve.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index b114e2918bb7..0c6630c6616a 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -568,7 +568,7 @@ static int copy_vdso(struct vdso_image **vdso_dst, const 
struct vdso_image *vdso
if (!vdso)
return -ENOMEM;
 
-   vdso_data = kmalloc(vdso_src->size, GFP_KERNEL);
+   vdso_data = alloc_pages_exact(vdso_src->size, GFP_KERNEL);
if (!vdso_data) {
kfree(vdso);
return -ENOMEM;
@@ -585,11 +585,11 @@ static int copy_vdso(struct vdso_image **vdso_dst, const 
struct vdso_image *vdso
 static void ve_free_vdso(struct ve_struct *ve)
 {
if (ve->vdso_64 && ve->vdso_64 != _image_64) {
-   kfree(ve->vdso_64->data);
+   free_pages_exact(ve->vdso_64->data, ve->vdso_64->size);
kfree(ve->vdso_64);
}
if (ve->vdso_32 && ve->vdso_32 != _image_32) {
-   kfree(ve->vdso_32->data);
+   free_pages_exact(ve->vdso_32->data, ve->vdso_32->size);
kfree(ve->vdso_32);
}
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] mm, memcg: Fix "add oom counter to memory.stat memcgroup file"

2020-12-08 Thread Andrey Ryabinin
Fix rebase of commit 3f10e0c1a0df12a2a503d0d9a3ec7b4f3ac3a467
Author: Andrey Ryabinin 
Date: Mon Oct 5 13:18:40 2020 +0300

mm, memcg: add oom counter to memory.stat memcgroup file

https://jira.sw.ru/browse/PSBM-123537
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 38 +-
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 01c012b11243..c0f825a4c43e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4141,12 +4141,28 @@ static const unsigned int memcg1_events[] = {
PSWPOUT,
 };
 
+static void accumulate_ooms(struct mem_cgroup *memcg, unsigned long *total_oom,
+   unsigned long *total_oom_kill)
+{
+   struct mem_cgroup *mi;
+
+   total_oom_kill = total_oom = 0;
+
+   for_each_mem_cgroup_tree(mi, memcg) {
+   total_oom += atomic_long_read(>memory_events[MEMCG_OOM]);
+   total_oom_kill += 
atomic_long_read(>memory_events[MEMCG_OOM_KILL]);
+
+   cond_resched();
+   }
+}
+
 static int memcg_stat_show(struct seq_file *m, void *v)
 {
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
unsigned long memory, memsw;
struct mem_cgroup *mi;
unsigned int i;
+   unsigned long total_oom = 0, total_oom_kill = 0;
 
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
@@ -4162,21 +4178,19 @@ static int memcg_stat_show(struct seq_file *m, void *v)
seq_printf(m, "%s %lu\n", vm_event_name(memcg1_events[i]),
   memcg_events_local(memcg, memcg1_events[i]));
 
+
+   accumulate_ooms(memcg, _oom, _oom_kill);
+
/*
 * For root_mem_cgroup we want to account global ooms as well.
 * The diff between all MEMCG_OOM_KILL and MEMCG_OOM events
 * should give us the glogbal ooms count.
 */
-   if (memcg == root_mem_cgroup) {
-   unsigned long glob_ooms;
-
-   glob_ooms = memcg_events(memcg, memcg1_events[MEMCG_OOM_KILL]) -
-   memcg_events(memcg, memcg1_events[MEMCG_OOM]);
-   seq_printf(m, "oom %lu\n", glob_ooms +
-   memcg_events_local(memcg, memcg1_events[MEMCG_OOM]));
-   } else
+   if (memcg == root_mem_cgroup)
+   seq_printf(m, "oom %lu\n", total_oom_kill - total_oom);
+   else
seq_printf(m, "oom %lu\n",
-   memcg_events_local(memcg, memcg1_events[MEMCG_OOM]));
+   atomic_long_read(>memory_events[MEMCG_OOM]));
 
for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "%s %lu\n", lru_list_name(i),
@@ -4209,11 +4223,9 @@ static int memcg_stat_show(struct seq_file *m, void *v)
   (u64)memcg_events(memcg, memcg1_events[i]));
 
if (memcg == root_mem_cgroup)
-   seq_printf(m, "total_oom %lu\n",
-   memcg_events(memcg, memcg1_events[MEMCG_OOM_KILL]));
+   seq_printf(m, "total_oom %lu\n", total_oom_kill);
else
-   seq_printf(m, "total_oom %lu\n",
-   memcg_events(memcg, memcg1_events[MEMCG_OOM]));
+   seq_printf(m, "total_oom %lu\n", total_oom);
 
for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "total_%s %llu\n", lru_list_name(i),
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4.1/8] ms/mm/rmap: share the i_mmap_rwsem fix

2020-12-01 Thread Andrey Ryabinin
Use down_read_nested to avoid lockdep complain.

Signed-off-by: Andrey Ryabinin 
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 523957450d20..90cf61e209ac 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1724,7 +1724,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
return ret;
pgoff = page_to_pgoff(page);
 
-   i_mmap_lock_read(mapping);
+   down_read_nested(>i_mmap_rwsem, SINGLE_DEPTH_NESTING);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4.1/8] ms/mm/rmap: share the i_mmap_rwsem fix

2020-12-01 Thread Andrey Ryabinin
Use down_read_nested to avoid lockdep complain.

Signed-off-by: Andrey Ryabinin 
---
 mm/rmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 523957450d20..90cf61e209ac 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1724,7 +1724,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
return ret;
pgoff = page_to_pgoff(page);
 
-   i_mmap_lock_read(mapping);
+   down_read_nested(>i_mmap_rwsem, SINGLE_DEPTH_NESTING);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 8/8] ms/mm/memory.c: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

The unmap_mapping_range family of functions do the unmapping of user pages
(ultimately via zap_page_range_single) without touching the actual
interval tree, thus share the lock.

Signed-off-by: Davidlohr Bueso 
Cc: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit c8475d144abb1e62958cc5ec281d2a9e161c1946)
Signed-off-by: Andrey Ryabinin 
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e66dea08f3f..3e5124d14996 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2712,10 +2712,10 @@ void unmap_mapping_range(struct address_space *mapping,
if (details.last_index < details.first_index)
details.last_index = ULONG_MAX;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
if (unlikely(!RB_EMPTY_ROOT(>i_mmap)))
unmap_mapping_range_tree(>i_mmap, );
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 7/8] ms/mm/nommu: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Shrinking/truncate logic can call nommu_shrink_inode_mappings() to verify
that any shared mappings of the inode in question aren't broken (dead
zone).  afaict the only user being ramfs to handle the size change
attribute.

Pretty much a no-brainer to share the lock.

Signed-off-by: Davidlohr Bueso 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 1acf2e040721564d579297646862b8ea3dd4511b)
Signed-off-by: Andrey Ryabinin 
---
 mm/nommu.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index f994621e52f0..290fe3031147 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -2134,14 +2134,14 @@ int nommu_shrink_inode_mappings(struct inode *inode, 
size_t size,
high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
down_write(_region_sem);
-   i_mmap_lock_write(inode->i_mapping);
+   i_mmap_lock_read(inode->i_mapping);
 
/* search for VMAs that fall within the dead zone */
vma_interval_tree_foreach(vma, >i_mapping->i_mmap, low, high) {
/* found one - only interested if it's shared out of the page
 * cache */
if (vma->vm_flags & VM_SHARED) {
-   i_mmap_unlock_write(inode->i_mapping);
+   i_mmap_unlock_read(inode->i_mapping);
up_write(_region_sem);
return -ETXTBSY; /* not quite true, but near enough */
}
@@ -2153,8 +2153,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, 
size_t size,
 * we don't check for any regions that start beyond the EOF as there
 * shouldn't be any
 */
-   vma_interval_tree_foreach(vma, >i_mapping->i_mmap,
- 0, ULONG_MAX) {
+   vma_interval_tree_foreach(vma, >i_mapping->i_mmap, 0, ULONG_MAX) 
{
if (!(vma->vm_flags & VM_SHARED))
continue;
 
@@ -2169,7 +2168,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, 
size_t size,
}
}
 
-   i_mmap_unlock_write(inode->i_mapping);
+   i_mmap_unlock_read(inode->i_mapping);
up_write(_region_sem);
return 0;
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 5/8] ms/uprobes: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Both register and unregister call build_map_info() in order to create the
list of mappings before installing or removing breakpoints for every mm
which maps file backed memory.  As such, there is no reason to hold the
i_mmap_rwsem exclusively, so share it and allow concurrent readers to
build the mapping data.

Signed-off-by: Davidlohr Bueso 
Acked-by: Srikar Dronamraju 
Acked-by: "Kirill A. Shutemov" 
Cc: Oleg Nesterov 
Acked-by: Hugh Dickins 
Acked-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 4a23717a236b2ab31efb1651f586126789fc997f)
Signed-off-by: Andrey Ryabinin 
---
 kernel/events/uprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 9f312227a769..be501d8d9704 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -690,7 +690,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
int more = 0;
 
  again:
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
@@ -721,7 +721,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
info->mm = vma->vm_mm;
info->vaddr = offset_to_vaddr(vma, offset);
}
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
 
if (!more)
goto out;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 6/8] ms/mm/memory-failure: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

No brainer conversion: collect_procs_file() only schedules a process for
later kill, share the lock, similarly to the anon vma variant.

Signed-off-by: Davidlohr Bueso 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit d28eb9c861f41aa2af4cfcc5eeeddff42b13d31e)
Signed-off-by: Andrey Ryabinin 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index da1ef2edd5dd..a5f5e604c0b8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -497,7 +497,7 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
struct task_struct *tsk;
struct address_space *mapping = page->mapping;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
qread_lock(_lock);
for_each_process(tsk) {
pgoff_t pgoff = page_to_pgoff(page);
@@ -519,7 +519,7 @@ static void collect_procs_file(struct page *page, struct 
list_head *to_kill,
}
}
qread_unlock(_lock);
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
 }
 
 /*
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 1/8] ms/mm, fs: introduce helpers around the i_mmap_mutex

2020-11-30 Thread Andrey Ryabinin
afe.

This patchset has been tested with: postgres 9.4 (with brand new hugetlb
support), hugetlbfs test suite (all tests pass, in fact more tests pass
with these changes than with an upstream kernel), ltp, aim7 benchmarks,
memcached and iozone with the -B option for mmap'ing.  *Untested* paths
are nommu, memory-failure, uprobes and xip.

This patch (of 8):

Various parts of the kernel acquire and release this mutex, so add
i_mmap_lock_write() and immap_unlock_write() helper functions that will
encapsulate this logic.  The next patch will make use of these.

Signed-off-by: Davidlohr Bueso 
Reviewed-by: Rik van Riel 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 8b28f621bea6f84d44adf7e804b73aff1e09105b)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/fs.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 55a92ce36e94..e32cb9b71042 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -698,6 +698,16 @@ struct block_device {
 
 int mapping_tagged(struct address_space *mapping, int tag);
 
+static inline void i_mmap_lock_write(struct address_space *mapping)
+{
+   mutex_lock(>i_mmap_mutex);
+}
+
+static inline void i_mmap_unlock_write(struct address_space *mapping)
+{
+   mutex_unlock(>i_mmap_mutex);
+}
+
 /*
  * Might pages of this file be mapped into userspace?
  */
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 4/8] ms/mm/rmap: share the i_mmap_rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Similarly to the anon memory counterpart, we can share the mapping's lock
ownership as the interval tree is not modified when doing doing the walk,
only the file page.

Signed-off-by: Davidlohr Bueso 
Acked-by: Rik van Riel 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 3dec0ba0be6a532cac949e02b853021bf6d57dad)
Signed-off-by: Andrey Ryabinin 
---
 include/linux/fs.h | 10 ++
 mm/rmap.c  |  9 +
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f422b0f7b02a..acedffc46fe4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -709,6 +709,16 @@ static inline void i_mmap_unlock_write(struct 
address_space *mapping)
up_write(>i_mmap_rwsem);
 }
 
+static inline void i_mmap_lock_read(struct address_space *mapping)
+{
+   down_read(>i_mmap_rwsem);
+}
+
+static inline void i_mmap_unlock_read(struct address_space *mapping)
+{
+   up_read(>i_mmap_rwsem);
+}
+
 /*
  * Might pages of this file be mapped into userspace?
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index e72be32c3dae..523957450d20 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1723,7 +1723,8 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
if (!mapping)
return ret;
pgoff = page_to_pgoff(page);
-   down_write_nested(>i_mmap_rwsem, SINGLE_DEPTH_NESTING);
+
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
 
@@ -1748,7 +1749,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
if (!mapping_mapped(peer))
continue;
 
-   i_mmap_lock_write(peer);
+   i_mmap_lock_read(peer);
 
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
@@ -1764,7 +1765,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
 
cond_resched();
}
-   i_mmap_unlock_write(peer);
+   i_mmap_unlock_read(peer);
 
if (ret != SWAP_AGAIN)
goto done;
@@ -1772,7 +1773,7 @@ static int rmap_walk_file(struct page *page, struct 
rmap_walk_control *rwc)
goto done;
}
 done:
-   i_mmap_unlock_write(mapping);
+   i_mmap_unlock_read(mapping);
return ret;
 }
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 3/8] ms/mm: convert i_mmap_mutex to rwsem

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory.  To
this end, this lock can also be a rwsem.  In addition, there are some
important opportunities to share the lock when there are no tree
modifications.

This conversion is straightforward.  For now, all users take the write
lock.

[s...@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso 
Reviewed-by: Rik van Riel 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit c8c06efa8b552608493b7066c234cfa82c47fcea)
Signed-off-by: Andrey Ryabinin 
---
 Documentation/vm/locking |  2 +-
 fs/hugetlbfs/inode.c | 10 +-
 fs/inode.c   |  2 +-
 include/linux/fs.h   |  7 ---
 include/linux/mmu_notifier.h |  2 +-
 kernel/events/uprobes.c  |  2 +-
 mm/filemap.c | 10 +-
 mm/hugetlb.c | 10 +-
 mm/memory.c  |  2 +-
 mm/mmap.c|  6 +++---
 mm/mremap.c  |  2 +-
 mm/rmap.c|  4 ++--
 12 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
index f61228bd6395..fb6402884062 100644
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is 
modified by
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_mutex and the kmem cache
+The page_table_lock nests with the inode i_mmap_rwsem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fb40a55cc8f1..68f8f2f0eaf5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -757,12 +757,12 @@ static struct inode *hugetlbfs_get_root(struct 
super_block *sb,
 }
 
 /*
- * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
+ * Hugetlbfs is not reclaimable; therefore its i_mmap_rwsem will never
  * be taken from reclaim -- unlike regular filesystems. This needs an
  * annotation because huge_pmd_share() does an allocation under hugetlb's
- * i_mmap_mutex.
+ * i_mmap_rwsem.
  */
-struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
 
 static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct inode *dir,
@@ -779,8 +779,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block 
*sb,
if (inode) {
inode->i_ino = get_next_ino();
inode_init_owner(inode, dir, mode);
-   lockdep_set_class(>i_mapping->i_mmap_mutex,
-   _i_mmap_mutex_key);
+   lockdep_set_class(>i_mapping->i_mmap_rwsem,
+   _i_mmap_rwsem_key);
inode->i_mapping->a_ops = _aops;
inode->i_mapping->backing_dev_info =_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/inode.c b/fs/inode.c
index 5253272c3742..2423a30dda1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -356,7 +356,7 @@ void address_space_init_once(struct address_space *mapping)
memset(mapping, 0, sizeof(*mapping));
INIT_RADIX_TREE(>page_tree, GFP_ATOMIC);
spin_lock_init(>tree_lock);
-   mutex_init(>i_mmap_mutex);
+   init_rwsem(>i_mmap_rwsem);
INIT_LIST_HEAD(>private_list);
spin_lock_init(>private_lock);
mapping->i_mmap = RB_ROOT;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e32cb9b71042..f422b0f7b02a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -626,7 +627,7 @@ struct address_space {
RH_KABI_REPLACE(unsigned int i_mmap_writable,
 atomic_t i_mmap_writable) /* count VM_SHARED mappings 
*/
struct rb_root  i_mmap; /* tree of private and shared 
mappings */
-   struct mutexi_mmap_mutex;   /* protect tree, count, list */
+   struct rw_semaphore i_mmap_rwsem;   /* protect tree, count, list */
/* Protected by tree_lock together with the radix tree */
unsigned long   nrpages;   

[Devel] [PATCH rh7 2/8] ms/mm: use new helper functions around the i_mmap_mutex

2020-11-30 Thread Andrey Ryabinin
From: Davidlohr Bueso 

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso 
Acked-by: Rik van Riel 
Acked-by: "Kirill A. Shutemov" 
Acked-by: Hugh Dickins 
Cc: Oleg Nesterov 
Acked-by: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Acked-by: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-122663
(cherry picked from commit 83cde9e8ba95d180eaefefe834958fbf7008cf39)
Signed-off-by: Andrey Ryabinin 
---
 fs/dax.c|  4 ++--
 fs/hugetlbfs/inode.c| 12 ++--
 kernel/events/uprobes.c |  4 ++--
 kernel/fork.c   |  4 ++--
 mm/hugetlb.c| 12 ++--
 mm/memory-failure.c |  4 ++--
 mm/memory.c | 28 ++--
 mm/mmap.c   | 14 +++---
 mm/mremap.c |  4 ++--
 mm/nommu.c  | 14 +++---
 mm/rmap.c   |  6 +++---
 11 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f22e3b32b6cc..7a18745acf01 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -909,7 +909,7 @@ static void dax_mapping_entry_mkclean(struct address_space 
*mapping,
spinlock_t *ptl;
bool changed;
 
-   mutex_lock(>i_mmap_mutex);
+   i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, >i_mmap, index, index) {
unsigned long address;
 
@@ -960,7 +960,7 @@ unlock_pte:
if (changed)
mmu_notifier_invalidate_page(vma->vm_mm, address);
}
-   mutex_unlock(>i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
 }
 
 static int dax_writeback_one(struct dax_device *dax_dev,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index bdd5c7827391..fb40a55cc8f1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -493,11 +493,11 @@ static void remove_inode_hugepages(struct inode *inode, 
loff_t lstart,
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
 
-   mutex_lock(>i_mmap_mutex);
+   i_mmap_lock_write(mapping);
hugetlb_vmdelete_list(>i_mmap,
next * pages_per_huge_page(h),
(next + 1) * pages_per_huge_page(h));
-   mutex_unlock(>i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
}
 
lock_page(page);
@@ -553,10 +553,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t 
offset)
pgoff = offset >> PAGE_SHIFT;
 
i_size_write(inode, offset);
-   mutex_lock(>i_mmap_mutex);
+   i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(>i_mmap))
hugetlb_vmdelete_list(>i_mmap, pgoff, 0);
-   mutex_unlock(>i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, offset, LLONG_MAX);
return 0;
 }
@@ -578,12 +578,12 @@ static long hugetlbfs_punch_hole(struct inode *inode, 
loff_t offset, loff_t len)
struct address_space *mapping = inode->i_mapping;
 
mutex_lock(>i_mutex);
-   mutex_lock(>i_mmap_mutex);
+   i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(>i_mmap))
hugetlb_vmdelete_list(>i_mmap,
hole_start >> PAGE_SHIFT,
hole_end  >> PAGE_SHIFT);
-   mutex_unlock(>i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
remove_inode_hugepages(inode, hole_start, hole_end);
mutex_unlock(>i_mutex);
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a5a59cc93fb6..816ad8e3d92f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -690,7 +690,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
int more = 0;
 
  again:
-   mutex_lock(>i_mmap_mutex);
+   i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
@@ -721,7 +721,7 @@ build_map_info(struct address_space *mapping, loff_t 
offset, bool is_register)
info->mm = vma->vm_mm;
info->vaddr = offset_to_vaddr(vma, offset);
}
-   mutex_unlock(>i_mmap_mutex);
+   i_mmap_unlock_write(mapping);
 
if (!more)
goto out;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9467e21a8fa4..b6a5279403be 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -504,7 +504,7 @@ static int dup_mmap(struct mm_struct *mm, struct 

[Devel] [PATCH rh7] Revert "mm: Port diff-mm-vmscan-disable-fs-related-activity-for-direct-direct-reclaim"

2020-11-30 Thread Andrey Ryabinin
This reverts commit 50fb388878b646872b78143de3c1bf3fa6f7f148.
Sometimes we can see a lot of reclaimable dcache and no other reclaimbale 
memory.
It looks like that kswapd can't keep up reclaiming dcache fast enough.

Commit 50fb388878b6 forbids to reclaim dcache in direct reclaim to prevent
potential deadlocks that might happen due to bugs in other subsystems.
Revert it to allow more aggressive dcache reclaim. It's unlikely to cause
any problems since we already directly reclaim dcache in memcg reclaim,
so let's do the same for the global one.

https://jira.sw.ru/browse/PSBM-122663
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85622f235e78..240435eb6d84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2653,15 +2653,9 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
 {
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
-   gfp_t slab_gfp = sc->gfp_mask;
bool slab_only = sc->slab_only;
bool retry;
 
-   /* Disable fs-related IO for direct reclaim */
-   if (!sc->target_mem_cgroup &&
-   (current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
-   slab_gfp &= ~__GFP_FS;
-
do {
struct mem_cgroup *root = sc->target_mem_cgroup;
struct mem_cgroup_reclaim_cookie reclaim = {
@@ -2695,7 +2689,7 @@ static void shrink_zone(struct zone *zone, struct 
scan_control *sc,
}
 
if (is_classzone) {
-   shrink_slab(slab_gfp, zone_to_nid(zone),
+   shrink_slab(sc->gfp_mask, zone_to_nid(zone),
memcg, sc->priority, false);
if (reclaim_state) {
sc->nr_reclaimed += 
reclaim_state->reclaimed_slab;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm, page_alloc: move_freepages should not examine struct page of reserved memory

2020-11-27 Thread Andrey Ryabinin
From: David Rientjes 

After commit 907ec5fca3dc ("mm: zero remaining unavailable struct
pages"), struct page of reserved memory is zeroed.  This causes
page->flags to be 0 and fixes issues related to reading
/proc/kpageflags, for example, of reserved memory.

The VM_BUG_ON() in move_freepages_block(), however, assumes that
page_zone() is meaningful even for reserved memory.  That assumption is
no longer true after the aforementioned commit.

There's no reason why move_freepages_block() should be testing the
legitimacy of page_zone() for reserved memory; its scope is limited only
to pages on the zone's freelist.

Note that pfn_valid() can be true for reserved memory: there is a
backing struct page.  The check for page_to_nid(page) is also buggy but
reserved memory normally only appears on node 0 so the zeroing doesn't
affect this.

Move the debug checks to after verifying PageBuddy is true.  This
isolates the scope of the checks to only be for buddy pages which are on
the zone's freelist which move_freepages_block() is operating on.  In
this case, an incorrect node or zone is a bug worthy of being warned
about (and the examination of struct page is acceptable bcause this
memory is not reserved).

Why does move_freepages_block() gets called on reserved memory? It's
simply math after finding a valid free page from the per-zone free area
to use as fallback.  We find the beginning and end of the pageblock of
the valid page and that can bring us into memory that was reserved per
the e820.  pfn_valid() is still true (it's backed by a struct page), but
since it's zero'd we shouldn't make any inferences here about comparing
its node or zone.  The current node check just happens to succeed most
of the time by luck because reserved memory typically appears on node 0.

The fix here is to validate that we actually have buddy pages before
testing if there's any type of zone or node strangeness going on.

We noticed it almost immediately after bringing 907ec5fca3dc in on
CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
the per-zone free area where the math in move_freepages() will bring the
start or end pfn into reserved memory and wanting to claim that entire
pageblock as a new migratetype.  So the path will be rare, require
CONFIG_DEBUG_VM, and require fallback to a different migratetype.

Some struct pages were already zeroed from reserve pages before
907ec5fca3c so it theoretically could trigger before this commit.  I
think it's rare enough under a config option that most people don't run
that others may not have noticed.  I wouldn't argue against a stable tag
and the backport should be easy enough, but probably wouldn't single out
a commit that this is fixing.

Mel said:

: The overhead of the debugging check is higher with this patch although
: it'll only affect debug builds and the path is not particularly hot.
: If this was a concern, I think it would be reasonable to simply remove
: the debugging check as the zone boundaries are checked in
: move_freepages_block and we never expect a zone/node to be smaller than
: a pageblock and stuck in the middle of another zone.

Link: 
http://lkml.kernel.org/r/alpine.deb.2.21.1908122036560.10...@chino.kir.corp.google.com
Signed-off-by: David Rientjes 
Acked-by: Mel Gorman 
Cc: Naoya Horiguchi 
Cc: Masayoshi Mizuma 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-123085
(cherry-picked from commit cd961038381f392b364a7c4a040f4576ca415b1a)

[Note: we don't have commit 907ec5fca3dc, but as changelog says
this could trigger before it. And we have all other symptoms - reserved
page from NUMA node 1 with zeroed struct page, so page_zone() gives us
wrong zone, hence BUG_ON()].

Signed-off-by: Andrey Ryabinin 
---
 mm/page_alloc.c | 19 +++
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fe9b18fef7d..3a147749e528 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1668,23 +1668,7 @@ int move_freepages(struct zone *zone,
unsigned long order;
int pages_moved = 0;
 
-#ifndef CONFIG_HOLES_IN_ZONE
-   /*
-* page_zone is not safe to call in this context when
-* CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
-* anyway as we check zone boundaries in move_freepages_block().
-* Remove at a later date when no bug reports exist related to
-* grouping pages by mobility
-*/
-   BUG_ON(pfn_valid(page_to_pfn(start_page)) &&
-  pfn_valid(page_to_pfn(end_page)) &&
-  page_zone(start_page) != page_zone(end_page));
-#endif
-
for (page = start_page; page <= end_page;) {
-   /* Make sure we are not inadvertently changing nodes */
-   VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
-
if (!p

[Devel] [PATCH rh7] mm/memcg: cleanup vmpressure from mem_cgroup_css_free()

2020-11-20 Thread Andrey Ryabinin
Cleaning up vmpressure from mem_cgroup_css_offline() doesn't look
safe. It looks like mem_cgroup_css_offline() might race with reclaim
which will queue vmpressure work  after the flush.

Put vmpressure_cleanup() in mem_cgroup_css_free() where we have
exclusive access to memcg. It was originally there, see
https://jira.sw.ru/browse/PSBM-93884 but moved in a process of rebase.

https://jira.sw.ru/browse/PSBM-122653
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e36ad592b3c7..803273a4d9cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6822,8 +6822,6 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
mem_cgroup_free_all(memcg);
mem_cgroup_reparent_charges(memcg);
 
-   vmpressure_cleanup(>vmpressure);
-
/*
 * A cgroup can be destroyed while somebody is waiting for its
 * oom context, in which case the context will never be unlocked
@@ -6878,7 +6876,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
mem_cgroup_reparent_charges(memcg);
 
cancel_work_sync(>high_work);
-
+   vmpressure_cleanup(>vmpressure);
memcg_destroy_kmem(memcg);
memcg_free_shrinker_maps(memcg);
__mem_cgroup_free(memcg);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm/memcg: cleanup vmpressure from mem_cgroup_css_free()

2020-11-20 Thread Andrey Ryabinin
Cleaning up vmpressure from mem_cgroup_css_offline() doesn't look
safe. It looks like mem_cgroup_css_offline() might race with reclaim
which will queue vmpressure work  after the flush.

Put vmpressure_cleanup() in mem_cgroup_css_free() where we have
exclusive access to memcg. It was originally there, see
https://jira.sw.ru/browse/PSBM-93884 but moved in a process of rebase.

https://jira.sw.ru/browse/PSBM-122655
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e36ad592b3c7..803273a4d9cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6822,8 +6822,6 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
mem_cgroup_free_all(memcg);
mem_cgroup_reparent_charges(memcg);
 
-   vmpressure_cleanup(>vmpressure);
-
/*
 * A cgroup can be destroyed while somebody is waiting for its
 * oom context, in which case the context will never be unlocked
@@ -6878,7 +6876,7 @@ static void mem_cgroup_css_free(struct cgroup *cont)
mem_cgroup_reparent_charges(memcg);
 
cancel_work_sync(>high_work);
-
+   vmpressure_cleanup(>vmpressure);
memcg_destroy_kmem(memcg);
memcg_free_shrinker_maps(memcg);
__mem_cgroup_free(memcg);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 2/3] oom: resurrect berserker mode

2020-11-11 Thread Andrey Ryabinin
From: Vladimir Davydov 

The logic behind the OOM berserker is the same as in PCS6: if processes
are killed by oom killer too often (< sysctl vm.oom_relaxation, 1 sec by
default), we increase "rage" (min -10, max 20) and kill 1 << "rage"
youngest worst processes if "rage" >= 0.

https://jira.sw.ru/browse/PSBM-17930

Signed-off-by: Vladimir Davydov 
[aryabinin: vz8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 include/linux/memcontrol.h |  6 +++
 include/linux/oom.h|  4 ++
 mm/oom_kill.c  | 97 ++
 3 files changed, 107 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c26041c681f2..0efabad868ce 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -260,6 +260,12 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   int oom_rage;
+   spinlock_t  oom_rage_lock;
+   unsigned long   prev_oom_time;
+   unsigned long   oom_time;
+
+
/* memory.events */
struct cgroup_file events_file;
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b0ee726c1672..9a6d16a1ace5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -15,6 +15,9 @@ struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
 
+#define OOM_BASE_RAGE  -10
+#define OOM_MAX_RAGE   20
+
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
  * determine what should be killed.
@@ -44,6 +47,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+   unsigned long overdraft;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ab436d94ae5d..e746b41d558c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,6 +53,7 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
+int sysctl_oom_relaxation = HZ;
 
 DEFINE_MUTEX(oom_lock);
 
@@ -947,6 +948,101 @@ static int oom_kill_memcg_member(struct task_struct 
*task, void *message)
return 0;
 }
 
+/*
+ * Kill more processes if oom happens too often in this context.
+ */
+static void oom_berserker(struct oom_control *oc)
+{
+   static DEFINE_RATELIMIT_STATE(berserker_rs,
+   DEFAULT_RATELIMIT_INTERVAL,
+   DEFAULT_RATELIMIT_BURST);
+   struct task_struct *p;
+   struct mem_cgroup *memcg;
+   unsigned long now = jiffies;
+   int rage;
+   int killed = 0;
+
+   memcg = oc->memcg ?: root_mem_cgroup;
+
+   spin_lock(>oom_rage_lock);
+   memcg->prev_oom_time = memcg->oom_time;
+   memcg->oom_time = now;
+   /*
+* Increase rage if oom happened recently in this context, reset
+* rage otherwise.
+*
+* previous oomthis oom (unfinished)
+* +
+*^^
+*  prev_oom_time  <>  oom_time
+*/
+   if (time_after(now, memcg->prev_oom_time + sysctl_oom_relaxation))
+   memcg->oom_rage = OOM_BASE_RAGE;
+   else if (memcg->oom_rage < OOM_MAX_RAGE)
+   memcg->oom_rage++;
+   rage = memcg->oom_rage;
+   spin_unlock(>oom_rage_lock);
+
+   if (rage < 0)
+   return;
+
+   /*
+* So, we are in rage. Kill (1 << rage) youngest tasks that are
+* as bad as the victim.
+*/
+   read_lock(_lock);
+   list_for_each_entry_reverse(p, _task.tasks, tasks) {
+   unsigned long tsk_points;
+   unsigned long tsk_overdraft;
+
+   if (!p->mm || test_tsk_thread_flag(p, TIF_MEMDIE) ||
+   fatal_signal_pending(p) || p->flags & PF_EXITING ||
+   oom_unkillable_task(p, oc->memcg, oc->nodemask))
+   continue;
+
+   tsk_points = oom_badness(p, oc->memcg, oc->nodemask,
+   oc->totalpages, _overdraft);
+   if (tsk_overdraft < oc->overdraft)
+   continue;
+
+   /*
+* oom_badness never returns a negative value, even if
+* oom_score_adj would make badness so, instead it
+* returns 1. So we do not kill task with badness 1 if
+* the victim has badness > 1 so as not to risk killing
+* protected tasks.
+*/
+   if (tsk_points <= 1 && oc->chosen_points > 1)
+   continue;
+
+   /*
+* Consider tasks as equally bad if they have equal
+* normalized scores.
+*/
+   

[Devel] [PATCH vz8 3/3] oom: make berserker more aggressive

2020-11-11 Thread Andrey Ryabinin
From: Vladimir Davydov 

In the berserker mode we kill a bunch of tasks that are as bad as the
selected victim. We assume two tasks to be equally bad if they consume
the same permille of memory. With such a strict check, it might turn out
that oom berserker won't kill any tasks in case a fork bomb is running
inside a container while the effect of killing a task eating <=1/1000th
of memory won't be enough to cope with memory shortage. Let's loosen
this check and use percentage instead of permille. In this case, it
might still happen that berserker won't kill anyone, but in this case
the regular oom should free at least 1/100th of memory, which should be
enough even for small containers.

Also, check berserker mode even if the victim has already exited by the
time we are about to send SIGKILL to it. Rationale: when the berserker
is in rage, it might kill hundreds of tasks so that the next oom kill is
likely to select an exiting task. Not triggering berserker in this case
will result in oom stalls.

Signed-off-by: Vladimir Davydov 

[aryabinin: rh8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 mm/oom_kill.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e746b41d558c..1cf75939aba6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1016,11 +1016,11 @@ static void oom_berserker(struct oom_control *oc)
continue;
 
/*
-* Consider tasks as equally bad if they have equal
-* normalized scores.
+* Consider tasks as equally bad if they occupy equal
+* percentage of available memory.
 */
-   if (tsk_points * 1000 / oc->totalpages <
-   oc->chosen_points * 1000 / oc->totalpages)
+   if (tsk_points * 100 / oc->totalpages <
+   oc->chosen_points * 100 / oc->totalpages)
continue;
 
if (__ratelimit(_rs)) {
@@ -1061,6 +1061,7 @@ static void oom_kill_process(struct oom_control *oc, 
const char *message)
wake_oom_reaper(victim);
task_unlock(victim);
put_task_struct(victim);
+   oom_berserker(oc);
return;
}
task_unlock(victim);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 1/3] proc, memcg: use memcg limits for showing oom_score inside CT

2020-11-11 Thread Andrey Ryabinin
Use memcg's limits of task to show /proc//oom_score.
Note: in vz7 we had different behavior. It showed 'oom_score'
based on 've->memcg' limits of process reading oom_score.
Now we look at memcg of  process and don't care about the
current one. It seems more correct behaviour.

Signed-off-by: Andrey Ryabinin 
---
 fs/proc/base.c |  8 +++-
 include/linux/memcontrol.h | 11 +++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 85fee7396e90..cb417426dd92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -525,8 +525,14 @@ static const struct file_operations proc_lstats_operations 
= {
 static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
  struct pid *pid, struct task_struct *task)
 {
-   unsigned long totalpages = totalram_pages + total_swap_pages;
+   unsigned long totalpages;
unsigned long points = 0;
+   struct mem_cgroup *memcg;
+
+   rcu_read_lock();
+   memcg = mem_cgroup_from_task(task);
+   totalpages = mem_cgroup_total_pages(memcg);
+   rcu_read_unlock();
 
points = oom_badness(task, NULL, NULL, totalpages, NULL) *
1000 / totalpages;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eb8634128a81..c26041c681f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -581,6 +581,17 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec 
*lruvec,
return mz->lru_zone_size[zone_idx][lru];
 }
 
+static inline unsigned long mem_cgroup_total_pages(struct mem_cgroup *memcg)
+{
+   unsigned long ram, ram_swap;
+   extern long total_swap_pages;
+
+   ram = min_t(unsigned long, totalram_pages, memcg->memory.max);
+   ram_swap = min_t(unsigned long, memcg->memsw.max, ram + 
total_swap_pages);
+
+   return ram_swap;
+}
+
 void mem_cgroup_handle_over_high(void);
 
 unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh8 0/3] vecalls: Implement VZCTL_GET_CPU_STAT ioctl

2020-11-10 Thread Andrey Ryabinin



On 11/10/20 12:44 PM, Konstantin Khorenko wrote:
> Used by vzstat/dispatcher/libvirt.
> Faster than parsing Container's cpu cgroup files.
> 
> Konstantin Khorenko (3):
>   vecalls: Add cpu stat measurement units comments to header
>   ve/sched/loadavg: Provide task_group parameter to get_avenrun_ve()
>   vecalls: Introduce VZCTL_GET_CPU_STAT ioctl
> 
>  include/linux/sched/loadavg.h   |  2 -
>  include/linux/ve.h  |  2 +
>  include/uapi/linux/vzcalluser.h | 14 +++
>  kernel/sched/loadavg.c  | 12 +-
>  kernel/sys.c|  6 ++-
>  kernel/time/time.c  |  1 +
>  kernel/ve/ve.c  | 18 +
>  kernel/ve/vecalls.c | 66 +
>  8 files changed, 109 insertions(+), 12 deletions(-)
> 
Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH7] ploop: Fix crash in purge_lru_warn()

2020-11-10 Thread Andrey Ryabinin



On 11/10/20 5:47 PM, Kirill Tkhai wrote:
> do_div() works wrong in case of the second argument is long.
> We don't need remainder, so we don't need do_div() at all.
> 
> https://jira.sw.ru/browse/PSBM-122035
> 
> Reported-by: Evgenii Shatokhin 
> Signed-off-by: Kirill Tkhai 

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] vdso, vclock_gettime: fix linking with old linkers

2020-11-09 Thread Andrey Ryabinin
On some old linkers vdso fails to build because of
dynamic reloction of 've_start_time' symbol:
VDSO2C  arch/x86/entry/vdso/vdso-image-64.c
Error: vdso image contains dynamic relocations

I was able to figure out why new linkers doesn't generate relocation
while old ones does, but I did find out that visibility("hidden")
attribute on 've_start_time' cures the problem.

https://jira.sw.ru/browse/PSBM-121668
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/entry/vdso/vclock_gettime.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 224dbe80da66..b2f1f19319d8 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -24,7 +24,7 @@
 
 #define gtod ((vsyscall_gtod_data))
 
-u64 ve_start_time;
+u64 ve_start_time  __attribute__((visibility("hidden")));
 
 extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
 extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 2/2] ext4: send abort uevent on ext4 journal abort

2020-11-05 Thread Andrey Ryabinin
From: Dmitry Monakhov 

Currenlty error from device result in ext4_abort, but uevent not generated 
because
ext4_abort() caller's context do not allow GFP_KERNEL memory allocation.
Let's relax submission context requirement and deffer actual uevent submission
to work_queue.  It can be any workqueue I've pick rsv_conversion_wq because it 
is
already exists.

khorenko@: "system_wq" does not fit here because at the moment of
work execution sb can be already destroyed.
"EXT4_SB(sb)->rsv_conversion_wq" is flushed before sb is destroyed.

Signed-off-by: Dmitry Monakhov 

[aryabinin rh8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 fs/ext4/ext4.h  |  2 ++
 fs/ext4/super.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 228492c9518f..bbdd7efc8447 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1499,6 +1499,7 @@ struct ext4_sb_info {
__u32 s_csum_seed;
 
bool s_err_event_sent;
+   bool s_abrt_event_sent;
 
/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
@@ -3126,6 +3127,7 @@ enum ext4_event_type {
  EXT4_UA_UMOUNT,
  EXT4_UA_REMOUNT,
  EXT4_UA_ERROR,
+ EXT4_UA_ABORT,
  EXT4_UA_FREEZE,
  EXT4_UA_UNFREEZE,
 };
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3cc979825ec8..00619f45b1c3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -420,6 +420,9 @@ static void ext4_send_uevent_work(struct work_struct *w)
case EXT4_UA_ERROR:
ret = add_uevent_var(env, "FS_ACTION=%s", "ERROR");
break;
+   case EXT4_UA_ABORT:
+   ret = add_uevent_var(env, "FS_ACTION=%s", "ABORT");
+   break;
case EXT4_UA_FREEZE:
ret = add_uevent_var(env, "FS_ACTION=%s", "FREEZE");
break;
@@ -576,6 +579,9 @@ static void ext4_handle_error(struct super_block *sb)
if (!test_opt(sb, ERRORS_CONT)) {
journal_t *journal = EXT4_SB(sb)->s_journal;
 
+   if (!xchg(_SB(sb)->s_abrt_event_sent, 1))
+   ext4_send_uevent(sb, EXT4_UA_ABORT);
+
EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED;
if (journal)
jbd2_journal_abort(journal, -EIO);
@@ -801,6 +807,10 @@ void __ext4_abort(struct super_block *sb, const char 
*function,
 
if (sb_rdonly(sb) == 0) {
ext4_msg(sb, KERN_CRIT, "Remounting filesystem read-only");
+
+   if (!xchg(_SB(sb)->s_abrt_event_sent, 1))
+   ext4_send_uevent(sb, EXT4_UA_ABORT);
+
EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED;
/*
 * Make sure updated value of ->s_mount_flags will be visible
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 1/2] ext4: add generic uevent infrastructure

2020-11-05 Thread Andrey Ryabinin
From: Dmitry Monakhov 

*Purpose:
It is reasonable to annaunce fs related events via uevent infrastructure.
This patch implement only ext4'th part, but IMHO this should be usefull for
any generic filesystem.

Example: Runtime fs-error is pure async event. Currently there is no good
way to handle this situation and inform user-space about this.

*Implementation:
 Add uevent infrastructure similar to dm uevent
 FS_ACTION = {MOUNT|UMOUNT|REMOUNT|ERROR|FREEZE|UNFREEZE}
 FS_UUID
 FS_NAME
 FS_TYPE

Signed-off-by: Dmitry Monakhov 

[aryabinin: add error event, rh8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 fs/ext4/ext4.h  |  11 +
 fs/ext4/super.c | 128 +++-
 2 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 028832d858fc..228492c9518f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1498,6 +1498,8 @@ struct ext4_sb_info {
/* Precomputed FS UUID checksum for seeding other checksums */
__u32 s_csum_seed;
 
+   bool s_err_event_sent;
+
/* Reclaim extents from extent status tree */
struct shrinker s_es_shrinker;
struct list_head s_es_list; /* List of inodes with reclaimable 
extents */
@@ -3119,6 +3121,15 @@ extern int ext4_check_blockref(const char *, unsigned 
int,
 struct ext4_ext_path;
 struct ext4_extent;
 
+enum ext4_event_type {
+ EXT4_UA_MOUNT,
+ EXT4_UA_UMOUNT,
+ EXT4_UA_REMOUNT,
+ EXT4_UA_ERROR,
+ EXT4_UA_FREEZE,
+ EXT4_UA_UNFREEZE,
+};
+
 /*
  * Maximum number of logical blocks in a file; ext4_extent's ee_block is
  * __le32.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7fc5ad243953..3cc979825ec8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -354,6 +354,117 @@ static time64_t __ext4_get_tstamp(__le32 *lo, __u8 *hi)
 #define ext4_get_tstamp(es, tstamp) \
__ext4_get_tstamp(&(es)->tstamp, &(es)->tstamp ## _hi)
 
+static int ext4_uuid_valid(const u8 *uuid)
+{
+   int i;
+
+   for (i = 0; i < 16; i++) {
+   if (uuid[i])
+   return 1;
+   }
+   return 0;
+}
+
+struct ext4_uevent {
+   struct super_block *sb;
+   enum ext4_event_type action;
+   struct work_struct work;
+};
+
+/**
+ * ext4_send_uevent - prepare and send uevent
+ *
+ * @sb:super_block
+ * @action:action type
+ *
+ */
+static void ext4_send_uevent_work(struct work_struct *w)
+{
+   struct ext4_uevent *e = container_of(w, struct ext4_uevent, work);
+   struct super_block *sb = e->sb;
+   struct kobj_uevent_env *env;
+   const u8 *uuid = EXT4_SB(sb)->s_es->s_uuid;
+   enum kobject_action kaction = KOBJ_CHANGE;
+   int ret;
+
+   env = kzalloc(sizeof(struct kobj_uevent_env), GFP_KERNEL);
+   if (!env){
+   kfree(e);
+   return;
+   }
+   ret = add_uevent_var(env, "FS_TYPE=%s", sb->s_type->name);
+   if (ret)
+   goto out;
+   ret = add_uevent_var(env, "FS_NAME=%s", sb->s_id);
+   if (ret)
+   goto out;
+
+   if (ext4_uuid_valid(uuid)) {
+   ret = add_uevent_var(env, "UUID=%pUB", uuid);
+   if (ret)
+   goto out;
+   }
+
+   switch (e->action) {
+   case EXT4_UA_MOUNT:
+   kaction = KOBJ_ONLINE;
+   ret = add_uevent_var(env, "FS_ACTION=%s", "MOUNT");
+   break;
+   case EXT4_UA_UMOUNT:
+   kaction = KOBJ_OFFLINE;
+   ret = add_uevent_var(env, "FS_ACTION=%s", "UMOUNT");
+   break;
+   case EXT4_UA_REMOUNT:
+   ret = add_uevent_var(env, "FS_ACTION=%s", "REMOUNT");
+   break;
+   case EXT4_UA_ERROR:
+   ret = add_uevent_var(env, "FS_ACTION=%s", "ERROR");
+   break;
+   case EXT4_UA_FREEZE:
+   ret = add_uevent_var(env, "FS_ACTION=%s", "FREEZE");
+   break;
+   case EXT4_UA_UNFREEZE:
+   ret = add_uevent_var(env, "FS_ACTION=%s", "UNFREEZE");
+   break;
+   default:
+   ret = -EINVAL;
+   }
+   if (ret)
+   goto out;
+   ret = kobject_uevent_env(&(EXT4_SB(sb)->s_kobj), kaction, env->envp);
+out:
+   kfree(env);
+   kfree(e);
+}
+
+/**
+ * ext4_send_uevent - prepare and schedule event submission
+ *
+ * @sb:super_block
+ * @action:action type
+ *
+ */
+void ext4_send_uevent(struct super_block *sb, enum ext4_event_type action)
+{
+   struct ext4_uevent *e;
+
+   /*
+* May happen if called from ext4_put_super() -> __ext4_abort()
+* -> ext4_send_uevent()
+*/
+   if (!EXT4_SB(sb)->rsv_conversion_wq)
+   ret

[Devel] [PATCH vz8] x86_64, vclock_gettime: Use standart division instead of __iter_div_u64_rem()

2020-11-03 Thread Andrey Ryabinin
timespec_sub_ns() historically uses __iter_div_u64_rem() for division.
Probably it's supposed to be faster

/*
 * Iterative div/mod for use when dividend is not expected to be much
 * bigger than divisor.
 */
u32 iter_div_u64_rem(u64 dividend, u32 divisor, u64 *remainder)

However in our case ve_start_time may make dividend much bigger than divisor.
So let's use standard "/" instead of iterative one. With 0 ve_start_time
I wasn't able to see measurable difference, however with big ve_start_time
the difference is rather significant:

 # time ./clock_iter_div
real1m30.224s
user1m30.343s
sys 0m0.008s

 # time taskset ./clock_div
real0m2.757s
user0m1.730s
sys 0m0.066s

32-bit vdso doesn't like 64-bit division and doesn't link.
I think it needs __udivsi3(). So just fallback to __iter_div_u64_rem()
on 32-bit.

https://jira.sw.ru/browse/PSBM-121856
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/entry/vdso/vclock_gettime.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index be1de6c4cafa..224dbe80da66 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -229,13 +229,27 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
+static inline u64 divu64(u64 dividend, u32 divisor, u64 *remainder)
+{
+   /* 32-bit wants __udivsi3() and fails to link, so fallback to iter */
+#ifndef BUILD_VDSO32
+   u64 res;
+
+   res = dividend/divisor;
+   *remainder = dividend % divisor;
+   return res;
+#else
+   return __iter_div_u64_rem(dividend, divisor, remainder);
+#endif
+}
+
 static inline void timespec_sub_ns(struct timespec *ts, u64 ns)
 {
if ((s64)ns <= 0) {
-   ts->tv_sec += __iter_div_u64_rem(-ns, NSEC_PER_SEC, );
+   ts->tv_sec += divu64(-ns, NSEC_PER_SEC, );
ts->tv_nsec = ns;
} else {
-   ts->tv_sec -= __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   ts->tv_sec -= divu64(ns, NSEC_PER_SEC, );
if (ns) {
ts->tv_sec--;
ns = NSEC_PER_SEC - ns;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 v3 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read

2020-11-03 Thread Andrey Ryabinin
If several threads read /proc/cpuinfo some can see in 'flags'
values from c->x86_capability, before __do_cpuid_fault() called
and masks applied. Fix this by forming 'flags' on stack first
and copy them in per_cpu(cpu_flags, cpu) as a last step.

https://jira.sw.ru/browse/PSBM-121823
Signed-off-by: Andrey Ryabinin 
---

Changes since v1:
 - none

Changes since v2:
 - add spinlock, use temporary ve_flags in show_cpuinfo()
 
 arch/x86/kernel/cpu/proc.c | 31 ++-
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 4fe1577d5e6f..08fd7ff9a55b 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -65,15 +65,16 @@ struct cpu_flags {
 };
 
 static DEFINE_PER_CPU(struct cpu_flags, cpu_flags);
+static DEFINE_SPINLOCK(cpu_flags_lock);
 
 static void init_cpu_flags(void *dummy)
 {
int cpu = smp_processor_id();
-   struct cpu_flags *flags = _cpu(cpu_flags, cpu);
+   struct cpu_flags flags;
struct cpuinfo_x86 *c = _data(cpu);
unsigned int eax, ebx, ecx, edx;
 
-   memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32));
+   memcpy(, c->x86_capability, sizeof(flags));
 
/*
 * Clear feature bits masked using cpuid masking/faulting.
@@ -81,26 +82,30 @@ static void init_cpu_flags(void *dummy)
 
if (c->cpuid_level >= 0x0001) {
__do_cpuid_fault(0x0001, 0, , , , );
-   flags->val[4] &= ecx;
-   flags->val[0] &= edx;
+   flags.val[4] &= ecx;
+   flags.val[0] &= edx;
}
 
if (c->cpuid_level >= 0x0007) {
__do_cpuid_fault(0x0007, 0, , , , );
-   flags->val[9] &= ebx;
+   flags.val[9] &= ebx;
}
 
if ((c->extended_cpuid_level & 0x) == 0x8000 &&
c->extended_cpuid_level >= 0x8001) {
__do_cpuid_fault(0x8001, 0, , , , );
-   flags->val[6] &= ecx;
-   flags->val[1] &= edx;
+   flags.val[6] &= ecx;
+   flags.val[1] &= edx;
}
 
if (c->cpuid_level >= 0x000d) {
__do_cpuid_fault(0x000d, 1, , , , );
-   flags->val[10] &= eax;
+   flags.val[10] &= eax;
}
+
+   spin_lock(_flags_lock);
+   memcpy(_cpu(cpu_flags, cpu), , sizeof(flags));
+   spin_unlock(_flags_lock);
 }
 
 static int show_cpuinfo(struct seq_file *m, void *v)
@@ -108,6 +113,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
struct cpuinfo_x86 *c = v;
unsigned int cpu;
int is_super = ve_is_super(get_exec_env());
+   struct cpu_flags ve_flags;
int i;
 
cpu = c->cpu_index;
@@ -147,12 +153,19 @@ static int show_cpuinfo(struct seq_file *m, void *v)
show_cpuinfo_core(m, c, cpu);
show_cpuinfo_misc(m, c);
 
+   if (!is_super) {
+   spin_lock_irq(_flags_lock);
+   memcpy(_flags, _cpu(cpu_flags, cpu), sizeof(ve_flags));
+   spin_unlock_irq(_flags_lock);
+   }
+
+
seq_puts(m, "flags\t\t:");
for (i = 0; i < 32*NCAPINTS; i++)
if (x86_cap_flags[i] != NULL &&
((is_super && cpu_has(c, i)) ||
 (!is_super && test_bit(i, (unsigned long *)
-   _cpu(cpu_flags, 
cpu)
+   _flags
seq_printf(m, " %s", x86_cap_flags[i]);
 
seq_puts(m, "\nbugs\t\t:");
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 v3 2/2] x86: don't enable cpuid faults if /proc/vz/cpuid_override unused

2020-11-03 Thread Andrey Ryabinin
We don't need to enable cpuid faults if /proc/vz/cpuid_override
was never used. If task was attached to ve before a write to
'cpuid_override' it will not get cpuid faults now. It shouldn't
be a problem since the proper use of 'cpuid_override' requires
stopping all containers.

https://jira.sw.ru/browse/PSBM-121823
Signed-off-by: Andrey Ryabinin 
Reviewed-by: Kirill Tkhai 
---

Changes since v1:
 - git add include/linux/cpuid_override.h 
Changes since v2:
 - add review tag

 arch/x86/kernel/cpuid_fault.c  | 21 ++---
 include/linux/cpuid_override.h | 30 ++
 kernel/ve/ve.c |  5 -
 3 files changed, 36 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/cpuid_override.h

diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c
index 1e8ffacc4412..cb6c2216fa8a 100644
--- a/arch/x86/kernel/cpuid_fault.c
+++ b/arch/x86/kernel/cpuid_fault.c
@@ -1,3 +1,4 @@
+#include 
 #include 
 #include 
 #include 
@@ -9,25 +10,7 @@
 #include 
 #include 
 
-struct cpuid_override_entry {
-   unsigned int op;
-   unsigned int count;
-   bool has_count;
-   unsigned int eax;
-   unsigned int ebx;
-   unsigned int ecx;
-   unsigned int edx;
-};
-
-#define MAX_CPUID_OVERRIDE_ENTRIES 16
-
-struct cpuid_override_table {
-   struct rcu_head rcu_head;
-   int size;
-   struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES];
-};
-
-static struct cpuid_override_table __rcu *cpuid_override __read_mostly;
+struct cpuid_override_table __rcu *cpuid_override __read_mostly;
 static DEFINE_SPINLOCK(cpuid_override_lock);
 
 static void cpuid_override_update(struct cpuid_override_table *new_table)
diff --git a/include/linux/cpuid_override.h b/include/linux/cpuid_override.h
new file mode 100644
index ..ea0fa7af3d3c
--- /dev/null
+++ b/include/linux/cpuid_override.h
@@ -0,0 +1,30 @@
+#ifndef __CPUID_OVERRIDE_H
+#define __CPUID_OVERRIDE_H
+
+#include 
+
+struct cpuid_override_entry {
+   unsigned int op;
+   unsigned int count;
+   bool has_count;
+   unsigned int eax;
+   unsigned int ebx;
+   unsigned int ecx;
+   unsigned int edx;
+};
+
+#define MAX_CPUID_OVERRIDE_ENTRIES 16
+
+struct cpuid_override_table {
+   struct rcu_head rcu_head;
+   int size;
+   struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES];
+};
+
+extern struct cpuid_override_table __rcu *cpuid_override;
+
+static inline bool cpuid_override_on(void)
+{
+   return rcu_access_pointer(cpuid_override);
+}
+#endif
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index aad8ce69ca1f..0d4d0ab70369 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -9,6 +9,7 @@
  * 've.c' helper file performing VE sub-system initialization
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -801,6 +802,7 @@ static void ve_attach(struct cgroup_taskset *tset)
 {
struct cgroup_subsys_state *css;
struct task_struct *task;
+   extern struct cpuid_override_table __rcu *cpuid_override;
 
cgroup_taskset_for_each(task, css, tset) {
struct ve_struct *ve = css_to_ve(css);
@@ -816,7 +818,8 @@ static void ve_attach(struct cgroup_taskset *tset)
/* Leave parent exec domain */
task->parent_exec_id--;
 
-   set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE);
+   if (cpuid_override_on())
+   set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE);
task->task_ve = ve;
}
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH vz8 v2 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read

2020-11-03 Thread Andrey Ryabinin



On 11/3/20 2:28 PM, Kirill Tkhai wrote:
> On 02.11.2020 20:13, Andrey Ryabinin wrote:
>> If several threads read /proc/cpuinfo some can see in 'flags'
>> values from c->x86_capability, before __do_cpuid_fault() called
>> and masks applied. Fix this by forming 'flags' on stack first
>> and copy them in per_cpu(cpu_flags, cpu) as a last step.
>>
>> https://jira.sw.ru/browse/PSBM-121823
>> Signed-off-by: Andrey Ryabinin 
>> ---
>> Changes since v1:
>>  - none
>>  
>>  arch/x86/kernel/cpu/proc.c | 17 +
>>  1 file changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
>> index 4fe1577d5e6f..4cc2951e34fb 100644
>> --- a/arch/x86/kernel/cpu/proc.c
>> +++ b/arch/x86/kernel/cpu/proc.c
>> @@ -69,11 +69,11 @@ static DEFINE_PER_CPU(struct cpu_flags, cpu_flags);
>>  static void init_cpu_flags(void *dummy)
>>  {
>>  int cpu = smp_processor_id();
>> -struct cpu_flags *flags = _cpu(cpu_flags, cpu);
>> +struct cpu_flags flags;
>>  struct cpuinfo_x86 *c = _data(cpu);
>>  unsigned int eax, ebx, ecx, edx;
>>  
>> -memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32));
>> +memcpy(, c->x86_capability, sizeof(flags));
>>  
>>  /*
>>   * Clear feature bits masked using cpuid masking/faulting.
>> @@ -81,26 +81,27 @@ static void init_cpu_flags(void *dummy)
>>  
>>  if (c->cpuid_level >= 0x0001) {
>>  __do_cpuid_fault(0x0001, 0, , , , );
>> -flags->val[4] &= ecx;
>> -flags->val[0] &= edx;
>> +flags.val[4] &= ecx;
>> +flags.val[0] &= edx;
>>  }
>>  
>>  if (c->cpuid_level >= 0x0007) {
>>  __do_cpuid_fault(0x0007, 0, , , , );
>> -flags->val[9] &= ebx;
>> +flags.val[9] &= ebx;
>>  }
>>  
>>  if ((c->extended_cpuid_level & 0x) == 0x8000 &&
>>  c->extended_cpuid_level >= 0x8001) {
>>  __do_cpuid_fault(0x8001, 0, , , , );
>> -flags->val[6] &= ecx;
>> -flags->val[1] &= edx;
>> +flags.val[6] &= ecx;
>> +flags.val[1] &= edx;
>>  }
>>  
>>  if (c->cpuid_level >= 0x000d) {
>>  __do_cpuid_fault(0x000d, 1, , , , );
>> -flags->val[10] &= eax;
>> +flags.val[10] &= eax;
>>  }
>> +memcpy(_cpu(cpu_flags, cpu), , sizeof(flags));
> 
> This is still racy, since memcpy() is not atomic. Maybe we should add some 
> lock on top of this?
> 

This race shouldn't be a problem since flags are not supposed to change during 
ve lifetime.
So we overwriting same values. But don't mind to add spinlock protection.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 v2 2/2] x86: don't enable cpuid faults if /proc/vz/cpuid_override unused

2020-11-02 Thread Andrey Ryabinin
We don't need to enable cpuid faults if /proc/vz/cpuid_override
was never used. If task was attached to ve before a write to
'cpuid_override' it will not get cpuid faults now. It shouldn't
be a problem since the proper use of 'cpuid_override' requires
stopping all containers.

https://jira.sw.ru/browse/PSBM-121823
Signed-off-by: Andrey Ryabinin 
---

Changes since v1:
 - git add include/linux/cpuid_override.h 
 
 arch/x86/kernel/cpuid_fault.c  | 21 ++---
 include/linux/cpuid_override.h | 30 ++
 kernel/ve/ve.c |  5 -
 3 files changed, 36 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/cpuid_override.h

diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c
index 1e8ffacc4412..cb6c2216fa8a 100644
--- a/arch/x86/kernel/cpuid_fault.c
+++ b/arch/x86/kernel/cpuid_fault.c
@@ -1,3 +1,4 @@
+#include 
 #include 
 #include 
 #include 
@@ -9,25 +10,7 @@
 #include 
 #include 
 
-struct cpuid_override_entry {
-   unsigned int op;
-   unsigned int count;
-   bool has_count;
-   unsigned int eax;
-   unsigned int ebx;
-   unsigned int ecx;
-   unsigned int edx;
-};
-
-#define MAX_CPUID_OVERRIDE_ENTRIES 16
-
-struct cpuid_override_table {
-   struct rcu_head rcu_head;
-   int size;
-   struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES];
-};
-
-static struct cpuid_override_table __rcu *cpuid_override __read_mostly;
+struct cpuid_override_table __rcu *cpuid_override __read_mostly;
 static DEFINE_SPINLOCK(cpuid_override_lock);
 
 static void cpuid_override_update(struct cpuid_override_table *new_table)
diff --git a/include/linux/cpuid_override.h b/include/linux/cpuid_override.h
new file mode 100644
index ..ea0fa7af3d3c
--- /dev/null
+++ b/include/linux/cpuid_override.h
@@ -0,0 +1,30 @@
+#ifndef __CPUID_OVERRIDE_H
+#define __CPUID_OVERRIDE_H
+
+#include 
+
+struct cpuid_override_entry {
+   unsigned int op;
+   unsigned int count;
+   bool has_count;
+   unsigned int eax;
+   unsigned int ebx;
+   unsigned int ecx;
+   unsigned int edx;
+};
+
+#define MAX_CPUID_OVERRIDE_ENTRIES 16
+
+struct cpuid_override_table {
+   struct rcu_head rcu_head;
+   int size;
+   struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES];
+};
+
+extern struct cpuid_override_table __rcu *cpuid_override;
+
+static inline bool cpuid_override_on(void)
+{
+   return rcu_access_pointer(cpuid_override);
+}
+#endif
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index aad8ce69ca1f..0d4d0ab70369 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -9,6 +9,7 @@
  * 've.c' helper file performing VE sub-system initialization
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -801,6 +802,7 @@ static void ve_attach(struct cgroup_taskset *tset)
 {
struct cgroup_subsys_state *css;
struct task_struct *task;
+   extern struct cpuid_override_table __rcu *cpuid_override;
 
cgroup_taskset_for_each(task, css, tset) {
struct ve_struct *ve = css_to_ve(css);
@@ -816,7 +818,8 @@ static void ve_attach(struct cgroup_taskset *tset)
/* Leave parent exec domain */
task->parent_exec_id--;
 
-   set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE);
+   if (cpuid_override_on())
+   set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE);
task->task_ve = ve;
}
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 v2 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read

2020-11-02 Thread Andrey Ryabinin
If several threads read /proc/cpuinfo some can see in 'flags'
values from c->x86_capability, before __do_cpuid_fault() called
and masks applied. Fix this by forming 'flags' on stack first
and copy them in per_cpu(cpu_flags, cpu) as a last step.

https://jira.sw.ru/browse/PSBM-121823
Signed-off-by: Andrey Ryabinin 
---
Changes since v1:
 - none
 
 arch/x86/kernel/cpu/proc.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 4fe1577d5e6f..4cc2951e34fb 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -69,11 +69,11 @@ static DEFINE_PER_CPU(struct cpu_flags, cpu_flags);
 static void init_cpu_flags(void *dummy)
 {
int cpu = smp_processor_id();
-   struct cpu_flags *flags = _cpu(cpu_flags, cpu);
+   struct cpu_flags flags;
struct cpuinfo_x86 *c = _data(cpu);
unsigned int eax, ebx, ecx, edx;
 
-   memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32));
+   memcpy(, c->x86_capability, sizeof(flags));
 
/*
 * Clear feature bits masked using cpuid masking/faulting.
@@ -81,26 +81,27 @@ static void init_cpu_flags(void *dummy)
 
if (c->cpuid_level >= 0x0001) {
__do_cpuid_fault(0x0001, 0, , , , );
-   flags->val[4] &= ecx;
-   flags->val[0] &= edx;
+   flags.val[4] &= ecx;
+   flags.val[0] &= edx;
}
 
if (c->cpuid_level >= 0x0007) {
__do_cpuid_fault(0x0007, 0, , , , );
-   flags->val[9] &= ebx;
+   flags.val[9] &= ebx;
}
 
if ((c->extended_cpuid_level & 0x) == 0x8000 &&
c->extended_cpuid_level >= 0x8001) {
__do_cpuid_fault(0x8001, 0, , , , );
-   flags->val[6] &= ecx;
-   flags->val[1] &= edx;
+   flags.val[6] &= ecx;
+   flags.val[1] &= edx;
}
 
if (c->cpuid_level >= 0x000d) {
__do_cpuid_fault(0x000d, 1, , , , );
-   flags->val[10] &= eax;
+   flags.val[10] &= eax;
}
+   memcpy(_cpu(cpu_flags, cpu), , sizeof(flags));
 }
 
 static int show_cpuinfo(struct seq_file *m, void *v)
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 2/2] x86: don't enable cpuid faults if /proc/vz/cpuid_override unused

2020-11-02 Thread Andrey Ryabinin
We don't need to enable cpuid faults if /proc/vz/cpuid_override
was never used. If task was attached to ve before a write to
'cpuid_override' it will not get cpuid faults now. It shouldn't
be a problem since the proper use of 'cpuid_override' requires
stopping all containers.

https://jira.sw.ru/browse/PSBM-121823
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/kernel/cpuid_fault.c | 21 ++---
 kernel/ve/ve.c|  5 -
 2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c
index 1e8ffacc4412..cb6c2216fa8a 100644
--- a/arch/x86/kernel/cpuid_fault.c
+++ b/arch/x86/kernel/cpuid_fault.c
@@ -1,3 +1,4 @@
+#include 
 #include 
 #include 
 #include 
@@ -9,25 +10,7 @@
 #include 
 #include 
 
-struct cpuid_override_entry {
-   unsigned int op;
-   unsigned int count;
-   bool has_count;
-   unsigned int eax;
-   unsigned int ebx;
-   unsigned int ecx;
-   unsigned int edx;
-};
-
-#define MAX_CPUID_OVERRIDE_ENTRIES 16
-
-struct cpuid_override_table {
-   struct rcu_head rcu_head;
-   int size;
-   struct cpuid_override_entry entries[MAX_CPUID_OVERRIDE_ENTRIES];
-};
-
-static struct cpuid_override_table __rcu *cpuid_override __read_mostly;
+struct cpuid_override_table __rcu *cpuid_override __read_mostly;
 static DEFINE_SPINLOCK(cpuid_override_lock);
 
 static void cpuid_override_update(struct cpuid_override_table *new_table)
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index aad8ce69ca1f..0d4d0ab70369 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -9,6 +9,7 @@
  * 've.c' helper file performing VE sub-system initialization
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -801,6 +802,7 @@ static void ve_attach(struct cgroup_taskset *tset)
 {
struct cgroup_subsys_state *css;
struct task_struct *task;
+   extern struct cpuid_override_table __rcu *cpuid_override;
 
cgroup_taskset_for_each(task, css, tset) {
struct ve_struct *ve = css_to_ve(css);
@@ -816,7 +818,8 @@ static void ve_attach(struct cgroup_taskset *tset)
/* Leave parent exec domain */
task->parent_exec_id--;
 
-   set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE);
+   if (cpuid_override_on())
+   set_tsk_thread_flag(task, TIF_CPUID_OVERRIDE);
task->task_ve = ve;
}
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 1/2] x86, cpuinfo: Fix race on parallel /proc/cpuinfo read

2020-11-02 Thread Andrey Ryabinin
If several threads read /proc/cpuinfo some can see in 'flags'
values from c->x86_capability, before __do_cpuid_fault() called
and masks applied. Fix this by forming 'flags' on stack first
and copy them in per_cpu(cpu_flags, cpu) as a last step.

https://jira.sw.ru/browse/PSBM-121823
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/kernel/cpu/proc.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 4fe1577d5e6f..4cc2951e34fb 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -69,11 +69,11 @@ static DEFINE_PER_CPU(struct cpu_flags, cpu_flags);
 static void init_cpu_flags(void *dummy)
 {
int cpu = smp_processor_id();
-   struct cpu_flags *flags = _cpu(cpu_flags, cpu);
+   struct cpu_flags flags;
struct cpuinfo_x86 *c = _data(cpu);
unsigned int eax, ebx, ecx, edx;
 
-   memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32));
+   memcpy(, c->x86_capability, sizeof(flags));
 
/*
 * Clear feature bits masked using cpuid masking/faulting.
@@ -81,26 +81,27 @@ static void init_cpu_flags(void *dummy)
 
if (c->cpuid_level >= 0x0001) {
__do_cpuid_fault(0x0001, 0, , , , );
-   flags->val[4] &= ecx;
-   flags->val[0] &= edx;
+   flags.val[4] &= ecx;
+   flags.val[0] &= edx;
}
 
if (c->cpuid_level >= 0x0007) {
__do_cpuid_fault(0x0007, 0, , , , );
-   flags->val[9] &= ebx;
+   flags.val[9] &= ebx;
}
 
if ((c->extended_cpuid_level & 0x) == 0x8000 &&
c->extended_cpuid_level >= 0x8001) {
__do_cpuid_fault(0x8001, 0, , , , );
-   flags->val[6] &= ecx;
-   flags->val[1] &= edx;
+   flags.val[6] &= ecx;
+   flags.val[1] &= edx;
}
 
if (c->cpuid_level >= 0x000d) {
__do_cpuid_fault(0x000d, 1, , , , );
-   flags->val[10] &= eax;
+   flags.val[10] &= eax;
}
+   memcpy(_cpu(cpu_flags, cpu), , sizeof(flags));
 }
 
 static int show_cpuinfo(struct seq_file *m, void *v)
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh8 3/3] ve/vestat: Introduce /proc/vz/vestat

2020-10-30 Thread Andrey Ryabinin



On 10/30/20 4:08 PM, Konstantin Khorenko wrote:
> The patch is based on following vz7 commits:
> 
>   f997bf6c613a ("ve: initial patch")
>   75fc174adc36 ("sched: Port cpustat related patches")
>   09e1cb4a7d4d ("ve/proc: restricted proc-entries scope")
>   a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs")
> 
> Signed-off-by: Konstantin Khorenko 
> ---

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh8 2/3] ve/time/stat: idle time virtualization in /proc/loadavg

2020-10-30 Thread Andrey Ryabinin



On 10/30/20 4:08 PM, Konstantin Khorenko wrote:
> The patch is based on following vz7 commits:
>   a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs")
>   75fc174adc36 ("sched: Port cpustat related patches")
> 
> Fixes: a3c4d1d8f383 ("ve/time: Customize VE uptime")
> 
> TODO: to separate FIXME hunks from a3c4d1d8f383 ("ve/time: Customize VE
> uptime") and merge them into this commit
> 
> Signed-off-by: Konstantin Khorenko 
> ---
Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh8 1/3] ve/sched/stat: Introduce handler for getting CT cpu statistics

2020-10-30 Thread Andrey Ryabinin



On 10/30/20 4:08 PM, Konstantin Khorenko wrote:
> It will be used later in
>   * idle cpu stat virtualization in /proc/loadavg
>   * /proc/vz/vestat output
>   * VZCTL_GET_CPU_STAT ioctl
> 
> The patch is based on following vz7 commits:
>   ecdce58b214c ("sched: Export per task_group statistics_work")
>   75fc174adc36 ("sched: Port cpustat related patches")
>   a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs")
> 
> Signed-off-by: Konstantin Khorenko 

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh8 0/8] ve/proc/sched/stat: Virtualize /proc/stat in a Container

2020-10-30 Thread Andrey Ryabinin



On 10/28/20 6:57 PM, Konstantin Khorenko wrote:
> This patchset contains of parts of following vz7 commits:
> 
>   a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs")
>   ecdce58b214c ("sched: Export per task_group statistics_work")
>   fc24d1785a28 ("fs/proc: print fairshed stat")
>   75fc174adc36 ("sched: Port cpustat related patches")
>   3c7b1e52294c ("ve/sched: Hide steal time from inside CT")
>   3d34f0d3b529 ("proc/cpu/cgroup: make boottime in CT reveal the real start 
> time")
>   715f311fdb4a ("sched: Account task_group::cpustat,taskstats,avenrun")
> 
> Known issues:
>  - context switches ("ctxt") and number of forks ("processes")
>virtualization is TBD
>  - "procs_blocked" reported is incorrect, to be fixed by later patches
> 
> Konstantin Khorenko (8):
>   ve/cgroup: export cgroup_get_ve_root1() + cleanup
>   kernel/stat: Introduce kernel_cpustat operation wrappers
>   ve/sched/stat: Add basic infrastructure for vcpu statistics
>   ve/sched/stat: Introduce functions to calculate vcpustat data
>   ve/proc/stat: Introduce /proc/stat virtualized handler for Containers
>   ve/proc/stat: Wire virtualized /proc/stat handler
>   ve/proc/stat: Introduce CPUTIME_USED field in cpustat statistic
>   sched: Fix task_group "iowait_sum" statistic accounting
> 
>  fs/proc/stat.c  |  10 +
>  include/linux/kernel_stat.h |  37 
>  include/linux/ve.h  |   8 +
>  kernel/cgroup/cgroup.c  |   6 +-
>  kernel/sched/core.c |  17 +-
>  kernel/sched/cpuacct.c  | 377 ++++
>  kernel/sched/fair.c |   3 +-
>  kernel/sched/sched.h|   5 +
>  kernel/ve/ve.c  |  17 ++
>  9 files changed, 475 insertions(+), 5 deletions(-)
> 

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh8] sched/stat: account ctxsw per task group

2020-10-30 Thread Andrey Ryabinin



On 10/29/20 6:46 PM, Konstantin Khorenko wrote:
> From: Vladimir Davydov 
> 
> This is a backport of diff-sched-account-ctxsw-per-task-group:
> 
>  Subject: sched: account ctxsw per task group
>  Date: Fri, 28 Dec 2012 15:09:45 +0400
> 
> * [sched] the number of context switches should be reported correctly
> inside a CT in /proc/stat (PSBM-18113)
> 
> For /proc/stat:ctxt to be correct inside containers.
> 
> https://jira.sw.ru/browse/PSBM-18113
> 
> Signed-off-by: Vladimir Davydov 
> 
> (cherry picked from vz7 commit d388f0bf64adb74cd62c4deff58e181bd63d62ac)
> Signed-off-by: Konstantin Khorenko 
> ---

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 3/3] x86: Show vcpu cpuflags in cpuinfo

2020-10-30 Thread Andrey Ryabinin
From: Kirill Tkhai 

Show cpu_i flags as flags of vcpu_i.

Extracted from "Initial patch". Merged several reworks.

TODO: Maybe replace/rework on_each_cpu() with smp_call_function_single().
Then we won't need split c_start() in previous patch (as the call
function will be called right before specific cpu is being prepared
to show). This should be rather easy.
[aryabinin: Don't see what it buys us, so I didn't try to implement it]

Signed-off-by: Kirill Tkhai 

https://jira.sw.ru/browse/PSBM-121823
[aryabinin:vz8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/kernel/cpu/proc.c | 63 +++---
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index d6b17a60acf6..4fe1577d5e6f 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -4,6 +4,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "cpu.h"
 
@@ -58,10 +60,54 @@ extern void __do_cpuid_fault(unsigned int op, unsigned int 
count,
 unsigned int *eax, unsigned int *ebx,
 unsigned int *ecx, unsigned int *edx);
 
+struct cpu_flags {
+   u32 val[NCAPINTS];
+};
+
+static DEFINE_PER_CPU(struct cpu_flags, cpu_flags);
+
+static void init_cpu_flags(void *dummy)
+{
+   int cpu = smp_processor_id();
+   struct cpu_flags *flags = _cpu(cpu_flags, cpu);
+   struct cpuinfo_x86 *c = _data(cpu);
+   unsigned int eax, ebx, ecx, edx;
+
+   memcpy(flags->val, c->x86_capability, NCAPINTS * sizeof(u32));
+
+   /*
+* Clear feature bits masked using cpuid masking/faulting.
+*/
+
+   if (c->cpuid_level >= 0x0001) {
+   __do_cpuid_fault(0x0001, 0, , , , );
+   flags->val[4] &= ecx;
+   flags->val[0] &= edx;
+   }
+
+   if (c->cpuid_level >= 0x0007) {
+   __do_cpuid_fault(0x0007, 0, , , , );
+   flags->val[9] &= ebx;
+   }
+
+   if ((c->extended_cpuid_level & 0x) == 0x8000 &&
+   c->extended_cpuid_level >= 0x8001) {
+   __do_cpuid_fault(0x8001, 0, , , , );
+   flags->val[6] &= ecx;
+   flags->val[1] &= edx;
+   }
+
+   if (c->cpuid_level >= 0x000d) {
+   __do_cpuid_fault(0x000d, 1, , , , );
+   flags->val[10] &= eax;
+   }
+}
+
 static int show_cpuinfo(struct seq_file *m, void *v)
 {
struct cpuinfo_x86 *c = v;
unsigned int cpu;
+   int is_super = ve_is_super(get_exec_env());
int i;
 
cpu = c->cpu_index;
@@ -103,7 +149,10 @@ static int show_cpuinfo(struct seq_file *m, void *v)
 
seq_puts(m, "flags\t\t:");
for (i = 0; i < 32*NCAPINTS; i++)
-   if (cpu_has(c, i) && x86_cap_flags[i] != NULL)
+   if (x86_cap_flags[i] != NULL &&
+   ((is_super && cpu_has(c, i)) ||
+(!is_super && test_bit(i, (unsigned long *)
+   _cpu(cpu_flags, 
cpu)
seq_printf(m, " %s", x86_cap_flags[i]);
 
seq_puts(m, "\nbugs\t\t:");
@@ -145,18 +194,24 @@ static int show_cpuinfo(struct seq_file *m, void *v)
return 0;
 }
 
-static void *c_start(struct seq_file *m, loff_t *pos)
+static void *__c_start(struct seq_file *m, loff_t *pos)
 {
*pos = cpumask_next(*pos - 1, cpu_online_mask);
-   if ((*pos) < nr_cpu_ids)
+   if (bitmap_weight(cpumask_bits(cpu_online_mask), *pos) < 
num_online_vcpus())
return _data(*pos);
return NULL;
 }
 
+static void *c_start(struct seq_file *m, loff_t *pos)
+{
+   on_each_cpu(init_cpu_flags, NULL, 1);
+   return __c_start(m, pos);
+}
+
 static void *c_next(struct seq_file *m, void *v, loff_t *pos)
 {
(*pos)++;
-   return c_start(m, pos);
+   return __c_start(m, pos);
 }
 
 static void c_stop(struct seq_file *m, void *v)
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 2/3] x86: make ARCH_[SET|GET]_CPUID friends with /proc/vz/cpuid_override

2020-10-30 Thread Andrey Ryabinin
We are using cpuid faults to emulate cpuid in containers. This
conflicts with arch_prctl(ARCH_SET_CPUID, 0) which allows to enable
cpuid faulting so that cpuid instruction causes SIGSEGV.

Add TIF_CPUID_OVERRIDE thread info flag which is added on all
!ve0 tasks. And check this flag along with TIF_NOCPUID to
decide whether we need to enable/disable cpuid faults or not.

https://jira.sw.ru/browse/PSBM-121823
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/include/asm/thread_info.h |  4 +++-
 arch/x86/kernel/cpuid_fault.c  |  3 ++-
 arch/x86/kernel/process.c  | 13 +
 arch/x86/kernel/traps.c|  3 +++
 kernel/ve/ve.c |  1 +
 5 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index c0da378eed8b..6ffb64d25383 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -92,6 +92,7 @@ struct thread_info {
 #define TIF_NOCPUID15  /* CPUID is not accessible in userland 
*/
 #define TIF_NOTSC  16  /* TSC is not accessible in userland */
 #define TIF_IA32   17  /* IA32 compatibility process */
+#define TIF_CPUID_OVERRIDE 18  /* CPUID emulation enabled */
 #define TIF_NOHZ   19  /* in adaptive nohz mode */
 #define TIF_MEMDIE 20  /* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
*/
@@ -122,6 +123,7 @@ struct thread_info {
 #define _TIF_NOCPUID   (1 << TIF_NOCPUID)
 #define _TIF_NOTSC (1 << TIF_NOTSC)
 #define _TIF_IA32  (1 << TIF_IA32)
+#define _TIF_CPUID_OVERRIDE(1 << TIF_CPUID_OVERRIDE)
 #define _TIF_NOHZ  (1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
@@ -153,7 +155,7 @@ struct thread_info {
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE   \
(_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP| \
-_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE)
+_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE | _TIF_CPUID_OVERRIDE)
 
 /*
  * Avoid calls to __switch_to_xtra() on UP as STIBP is not evaluated.
diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c
index 339e2638c3b8..1e8ffacc4412 100644
--- a/arch/x86/kernel/cpuid_fault.c
+++ b/arch/x86/kernel/cpuid_fault.c
@@ -6,7 +6,8 @@
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
 
 struct cpuid_override_entry {
unsigned int op;
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index e5c5b1d724ab..788b9b8f8f9c 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -209,7 +209,8 @@ static void set_cpuid_faulting(bool on)
 static void disable_cpuid(void)
 {
preempt_disable();
-   if (!test_and_set_thread_flag(TIF_NOCPUID)) {
+   if (!test_and_set_thread_flag(TIF_NOCPUID) ||
+   test_thread_flag(TIF_CPUID_OVERRIDE)) {
/*
 * Must flip the CPU state synchronously with
 * TIF_NOCPUID in the current running context.
@@ -222,7 +223,8 @@ static void disable_cpuid(void)
 static void enable_cpuid(void)
 {
preempt_disable();
-   if (test_and_clear_thread_flag(TIF_NOCPUID)) {
+   if (test_and_clear_thread_flag(TIF_NOCPUID) &&
+   !test_thread_flag(TIF_CPUID_OVERRIDE)) {
/*
 * Must flip the CPU state synchronously with
 * TIF_NOCPUID in the current running context.
@@ -505,6 +507,7 @@ void __switch_to_xtra(struct task_struct *prev_p, struct 
task_struct *next_p)
 {
struct thread_struct *prev, *next;
unsigned long tifp, tifn;
+   bool prev_cpuid, next_cpuid;
 
prev = _p->thread;
next = _p->thread;
@@ -529,8 +532,10 @@ void __switch_to_xtra(struct task_struct *prev_p, struct 
task_struct *next_p)
if ((tifp ^ tifn) & _TIF_NOTSC)
cr4_toggle_bits_irqsoff(X86_CR4_TSD);
 
-   if ((tifp ^ tifn) & _TIF_NOCPUID)
-   set_cpuid_faulting(!!(tifn & _TIF_NOCPUID));
+   prev_cpuid = (tifp & _TIF_NOCPUID) || (tifp & _TIF_CPUID_OVERRIDE);
+   next_cpuid = (tifn & _TIF_NOCPUID) || (tifn & _TIF_CPUID_OVERRIDE);
+   if (prev_cpuid != next_cpuid)
+   set_cpuid_faulting(next_cpuid);
 
if (likely(!((tifp | tifn) & _TIF_SPEC_FORCE_UPDATE))) {
__speculation_ctrl_update(tifp, tifn);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c43e3b80e50f..d0b379cf0484 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -526,6 +526,9 @@ static int check_cpuid_fault(struct pt_regs *regs, long 
error_code)
if (error_code != 0)
 

[Devel] [PATCH vz8 1/3] arch/x86: introduce cpuid override

2020-10-30 Thread Andrey Ryabinin
From: Vladimir Davydov 

Port diff-arch-x86-introduce-cpuid-override

Recent Intel CPUs rejected CPUID masking, which is required for flex
migration, in favor of CPUID faulting. So we need to support it in
kenrel.

This patch adds user writable file /proc/vz/cpuid_override, which
contains CPUID override table. Each table entry must have the following
format:

  op[ count]: eax ebx ecx edx

where @op and optional @count define a CPUID function, whose output one
would like to override (@op and @count are loaded to EAX and ECX
registers respectively before calling CPUID); @eax, @ebx, @ecx, @edx -
the desired CPUID output for the specified function. All values must be
in HEX, 0x prefix is optional.

Notes:

 - the file is only present on hosts that support CPUID faulting;
 - CPUID faulting is always enabled if it is supported;
 - CPUID output is overridden on all present CPUs;
 - the maximal number of entries one can override equals 16;
 - each write(2) to the file removes all existing entries before adding
   new ones, so the whole table must be written in one write(2); in
   particular writing an empty line to the file removes all existing
   rules.

Example:

Suppose we want to mask out SSE2 (CPUID.01H:EDX:26) and RDTSCP
(CPUID.8001H:EDX:27). Then we should execute the following sequence:

 - get the current cpuid value:

   # cpuid -r | grep -e '^\s*0x0001' -e '^\s*0x8001' | head -n 2
  0x0001 0x00: eax=0x000306e4 ebx=0x00200800 ecx=0x7fbee3ff 
edx=0xbfebfbff
  0x8001 0x00: eax=0x ebx=0x ecx=0x0001 
edx=0x2c100800

 - clear the feature bits we want to mask out and write the result to
   /proc/vz/cpuid_override:

   # cat >/proc/vz/cpuid_override <https://jira.sw.ru/browse/PSBM-28682

Signed-off-by: Vladimir Davydov 

Acked-by: Cyrill Gorcunov 
=

https://jira.sw.ru/browse/PSBM-33638

Signed-off-by: Vladimir Davydov 
Rebase:
Signed-off-by: Kirill Tkhai 

https://jira.sw.ru/browse/PSBM-121823
[aryabinin: vz8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/include/asm/msr-index.h |   1 +
 arch/x86/include/asm/traps.h |   2 +
 arch/x86/kernel/Makefile |   1 +
 arch/x86/kernel/cpu/proc.c   |   4 +
 arch/x86/kernel/cpuid_fault.c| 258 +++
 arch/x86/kernel/traps.c  |  24 +++
 6 files changed, 290 insertions(+)
 create mode 100644 arch/x86/kernel/cpuid_fault.c

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 6a21c227775c..9668ec6a064d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -114,6 +114,7 @@
 
 #define MSR_IA32_BBL_CR_CTL0x0119
 #define MSR_IA32_BBL_CR_CTL3   0x011e
+#define MSR_MISC_FEATURES_ENABLES  0x0140
 
 #define MSR_IA32_TSX_CTRL  0x0122
 #define TSX_CTRL_RTM_DISABLE   BIT(0)  /* Disable RTM feature */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 0ae298ea01a1..0282c81719e7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -124,6 +124,8 @@ void __noreturn handle_stack_overflow(const char *message,
  unsigned long fault_address);
 #endif
 
+void do_cpuid_fault(struct pt_regs *);
+
 /* Interrupts/Exceptions */
 enum {
X86_TRAP_DE = 0,/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 431d8c6e641d..b9451b653b04 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -63,6 +63,7 @@ obj-y += pci-iommu_table.o
 obj-y  += resource.o
 obj-y  += irqflags.o
 obj-y  += spec_ctrl.o
+obj-y  += cpuid_fault.o
 
 obj-y  += process.o
 obj-y  += fpu/
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 2c8522a39ed5..d6b17a60acf6 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -54,6 +54,10 @@ static void show_cpuinfo_misc(struct seq_file *m, struct 
cpuinfo_x86 *c)
 }
 #endif
 
+extern void __do_cpuid_fault(unsigned int op, unsigned int count,
+unsigned int *eax, unsigned int *ebx,
+unsigned int *ecx, unsigned int *edx);
+
 static int show_cpuinfo(struct seq_file *m, void *v)
 {
struct cpuinfo_x86 *c = v;
diff --git a/arch/x86/kernel/cpuid_fault.c b/arch/x86/kernel/cpuid_fault.c
new file mode 100644
index ..339e2638c3b8
--- /dev/null
+++ b/arch/x86/kernel/cpuid_fault.c
@@ -0,0 +1,258 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct cpuid_override_entry {
+   unsigned int op;
+   unsigned int count;
+   bool has_count;
+   unsigned int eax;
+   unsigned int ebx;
+   unsigned i

[Devel] [PATCH vz8] userns: associate user_struct with the user_namespace

2020-10-27 Thread Andrey Ryabinin
user_struct contains per-user counters like processes, files,
sigpending etc which we wouldn't like to share across different
namespaces.
Make per-userns uid hastable instead of global.
This is partial revert of the 7b44ab978b77a
 ("userns: Disassociate user_struct from the user_namespace.")

Signed-off-by: Andrey Ryabinin 
---
 include/linux/sched/user.h |  1 +
 include/linux/user_namespace.h |  4 
 kernel/user.c  | 22 +-
 kernel/user_namespace.c| 13 +
 4 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 9a9536fd4fe3..4bf5a723f138 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -60,6 +60,7 @@ extern struct user_struct *find_user(kuid_t);
 extern struct user_struct root_user;
 #define INIT_USER (_user)
 
+extern struct user_struct * alloc_uid_ns(struct user_namespace *ns, kuid_t);
 
 /* per-UID process charging. */
 extern struct user_struct * alloc_uid(kuid_t);
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index b004d5aeba1f..30493179b756 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -15,6 +15,9 @@
 #define UID_GID_MAP_MAX_BASE_EXTENTS 5
 #define UID_GID_MAP_MAX_EXTENTS 340
 
+#define UIDHASH_BITS   (CONFIG_BASE_SMALL ? 3 : 7)
+#define UIDHASH_SZ (1 << UIDHASH_BITS)
+
 struct uid_gid_extent {
u32 first;
u32 lower_first;
@@ -73,6 +76,7 @@ struct user_namespace {
struct uid_gid_map  gid_map;
struct uid_gid_map  projid_map;
atomic_tcount;
+   struct hlist_head   uidhash_table[UIDHASH_SZ];
struct user_namespace   *parent;
int level;
kuid_t  owner;
diff --git a/kernel/user.c b/kernel/user.c
index 0df9b1640b2a..f9f540484499 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -8,6 +8,7 @@
  * able to have per-user limits for system resources. 
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -74,14 +75,11 @@ EXPORT_SYMBOL_GPL(init_user_ns);
  * when changing user ID's (ie setuid() and friends).
  */
 
-#define UIDHASH_BITS   (CONFIG_BASE_SMALL ? 3 : 7)
-#define UIDHASH_SZ (1 << UIDHASH_BITS)
 #define UIDHASH_MASK   (UIDHASH_SZ - 1)
 #define __uidhashfn(uid)   (((uid >> UIDHASH_BITS) + uid) & UIDHASH_MASK)
-#define uidhashentry(uid)  (uidhash_table + __uidhashfn((__kuid_val(uid
+#define uidhashentry(ns, uid)  ((ns)->uidhash_table + 
__uidhashfn((__kuid_val(uid
 
 static struct kmem_cache *uid_cachep;
-struct hlist_head uidhash_table[UIDHASH_SZ];
 
 /*
  * The uidhash_lock is mostly taken from process context, but it is
@@ -155,9 +153,10 @@ struct user_struct *find_user(kuid_t uid)
 {
struct user_struct *ret;
unsigned long flags;
+   struct user_namespace *ns = current_user_ns();
 
spin_lock_irqsave(_lock, flags);
-   ret = uid_hash_find(uid, uidhashentry(uid));
+   ret = uid_hash_find(uid, uidhashentry(ns, uid));
spin_unlock_irqrestore(_lock, flags);
return ret;
 }
@@ -173,9 +172,9 @@ void free_uid(struct user_struct *up)
free_user(up, flags);
 }
 
-struct user_struct *alloc_uid(kuid_t uid)
+struct user_struct *alloc_uid_ns(struct user_namespace *ns, kuid_t uid)
 {
-   struct hlist_head *hashent = uidhashentry(uid);
+   struct hlist_head *hashent = uidhashentry(ns, uid);
struct user_struct *up, *new;
 
spin_lock_irq(_lock);
@@ -215,6 +214,11 @@ struct user_struct *alloc_uid(kuid_t uid)
return NULL;
 }
 
+struct user_struct *alloc_uid(kuid_t uid)
+{
+   return alloc_uid_ns(current_user_ns(), uid);
+}
+
 static int __init uid_cache_init(void)
 {
int n;
@@ -223,11 +227,11 @@ static int __init uid_cache_init(void)
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
 
for(n = 0; n < UIDHASH_SZ; ++n)
-   INIT_HLIST_HEAD(uidhash_table + n);
+   INIT_HLIST_HEAD(init_user_ns.uidhash_table + n);
 
/* Insert the root user immediately (init already runs as root) */
spin_lock_irq(_lock);
-   uid_hash_insert(_user, uidhashentry(GLOBAL_ROOT_UID));
+   uid_hash_insert(_user, uidhashentry(_user_ns, 
GLOBAL_ROOT_UID));
spin_unlock_irq(_lock);
 
return 0;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 243fb390d744..459b88044c62 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -74,6 +74,7 @@ static void set_cred_user_ns(struct cred *cred, struct 
user_namespace *user_ns)
 int create_user_ns(struct cred *new)
 {
struct user_namespace *ns, *parent_ns = new->user_ns;
+   struct user_struct *new_user;
kuid_t owner = new->euid;
kgid_t group = new->egid;
struct ucounts *ucounts;
@@ -116,6 +117,17 @@ int create_user_ns(s

[Devel] [PATCH vz8 2/2] ve/fs/devmnt: process mount options

2020-10-26 Thread Andrey Ryabinin
From: Kirill Tkhai 

Port patch diff-ve-fs-process-mount-options-check-and-insert by Maxim Patlasov:

The patch implements two kinds of processing mount options: check and insert.
Check is OK if and only if each option supplied by CT-user is present
among options listed in allowed_options.

Insert transforms mount options supplied by CT-user like this:

 =  + 

Check is performed both for mount and remount. Insert - only for mount. All
this happens only for mount/remount inside CT and if proper ve_devmnt struct
is found in ve->devmnt_list (searched by 'dev').

https://jira.sw.ru/browse/PSBM-32273

Signed-off-by: Kirill Tkhai 
Acked-by: Maxim Patlasov 

+++
ve/fs/devmnt: allow more than one mount option inside a CT

strsep() changes provided string: puts '\0' instead of separators,
thus after successful call to ve_devmnt_check() we insert
only first provided mount options, ignoring others.

mFixes: bc4143b ("ve/fs/devmnt: process mount options")

Found during implementation of
https://jira.sw.ru/browse/PSBM-40075

Signed-off-by: Konstantin Khorenko 
Reviewed-by: Kirill Tkhai 

https://jira.sw.ru/browse/PSBM-108196
Signed-off-by: Andrey Ryabinin 
---
 fs/namespace.c | 146 -
 fs/super.c |  16 +
 include/linux/fs.h |   2 +
 3 files changed, 163 insertions(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d355b5921d1e..c24ab7597a39 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 
+#include 
+
 #include "pnode.h"
 #include "internal.h"
 
@@ -2344,6 +2346,148 @@ static int change_mount_flags(struct vfsmount *mnt, int 
ms_flags)
return error;
 }
 
+#ifdef CONFIG_VE
+/*
+ * Returns first occurrence of needle in haystack separated by sep,
+ * or NULL if not found
+ */
+static char *strstr_separated(char *haystack, char *needle, char sep)
+{
+   int needle_len = strlen(needle);
+
+   while (haystack) {
+   if (!strncmp(haystack, needle, needle_len) &&
+   (haystack[needle_len] == 0 || /* end-of-line or */
+haystack[needle_len] == sep)) /* separator */
+   return haystack;
+
+   haystack = strchr(haystack, sep);
+   if (haystack)
+   haystack++;
+   }
+
+   return NULL;
+}
+
+static int ve_devmnt_check(char *options, char *allowed)
+{
+   char *p;
+   char *tmp_options;
+
+   if (!options || !*options)
+   return 0;
+
+   if (!allowed)
+   return -EPERM;
+
+   /* strsep() changes provided string: puts '\0' instead of separators */
+   tmp_options = kstrdup(options, GFP_KERNEL);
+   if (!tmp_options)
+   return -ENOMEM;
+
+   while ((p = strsep(_options, ",")) != NULL) {
+   if (!*p)
+   continue;
+
+   if (!strstr_separated(allowed, p, ',')) {
+   kfree(tmp_options);
+   return -EPERM;
+   }
+   }
+
+   kfree(tmp_options);
+   return 0;
+}
+
+static int ve_devmnt_insert(char *options, char *hidden)
+{
+   int options_len;
+   int hidden_len;
+
+   if (!hidden)
+   return 0;
+
+   if (!options)
+   return -EAGAIN;
+
+   options_len = strlen(options);
+   hidden_len = strlen(hidden);
+
+   if (hidden_len + options_len + 2 > PAGE_SIZE)
+   return -EPERM;
+
+   memmove(options + hidden_len + 1, options, options_len);
+   memcpy(options, hidden, hidden_len);
+
+   options[hidden_len] = ',';
+   options[hidden_len + options_len + 1] = 0;
+
+   return 0;
+}
+
+int ve_devmnt_process(struct ve_struct *ve, dev_t dev, void **data_pp, int 
remount)
+{
+   void *data = *data_pp;
+   struct ve_devmnt *devmnt;
+   int err;
+again:
+   err = 1;
+   mutex_lock(>devmnt_mutex);
+   list_for_each_entry(devmnt, >devmnt_list, link) {
+   if (devmnt->dev == dev) {
+   err = ve_devmnt_check(data, devmnt->allowed_options);
+
+   if (!err && !remount)
+   err = ve_devmnt_insert(data, 
devmnt->hidden_options);
+
+   break;
+   }
+   }
+   mutex_unlock(>devmnt_mutex);
+
+   switch (err) {
+   case -EAGAIN:
+   if (!(data = (void *)__get_free_page(GFP_KERNEL)))
+   return -ENOMEM;
+   *(char *)data = 0; /* the string must be zero-terminated */
+   goto again;
+   case 1:
+   if (*data_pp) {
+   ve_printk(VE_LOG_BOTH, KERN_WARNING "VE%s: no allowed "
+ "mount options found for device %u:%u\n",
+ ve->ve_name, MAJOR(dev), MI

[Devel] [PATCH vz8 1/2] ve/devmnt: Introduce ve::devmnt list #PSBM-108196

2020-10-26 Thread Andrey Ryabinin
From: Kirill Tkhai 

1)Porting patch "ve: mount option list" by Maxim Patlasov:

The patch adds new fields to ve_struct: devmnt_list and devmnt_mutex.
devmnt_list is the head of list of ve_devmnt structs. Each host block device
visible from CT can have no more than one struct ve_devmnt linked in
ve->devmnt_list. If ve_devmnt is present, it can be found by 'dev' field.

Each ve_devmnt struct may bear two strings: hidden and allowed options.
hidden_options will be automatically added to CT-user-supplied mount options
after checking allowed_options. Only options listed in allowed_options are
allowed.

devmnt_mutex is to protect operations on the list of ve_devmnt structs.

2)Porting patch "vecalls: VE_CONFIGURE_MOUNT_OPTIONS" by Maxim Patlasov.

Reworking the interface using cgroups. Each CT now has a file:

[ve_cgroup_mnt_pnt]/[CTID]/ve.mount_opts

for configuring permittions for a block device. Below is permittions line
example:

"0 major:minor;1 balloon_ino=12,pfcache_csum,pfcache=/vz/pfcache;2 barrier=1"

Here, major:minor is a device, '1' starts comma-separated list of
hidden options, and '2' is allowed ones.

https://jira.sw.ru/browse/PSBM-32273

Signed-off-by: Kirill Tkhai 
Acked-by: Maxim Patlasov 

+++
ve/cgroups: Align ve_cftypes assignments

For readability sake. We've other aligned already.

Signed-off-by: Cyrill Gorcunov 
Rebase: ktkhai@: Merged "ve: increase max length of ve.mount_opts string"

ve/devmnt: Add a ability to show ve.mount_opts

A user may want to see allowed mount options.
This patch allows that.

khorenko@:
* by default ve cgroup is not visible from inside a CT

* currently it's possible to mount ve cgroup inside a CT, but this is
  temporarily, we'll disable this in the scope of
  https://jira.sw.ru/browse/PSBM-34291

* this patch allows to see mount options via ve cgroup =>
  after PSBM-34291 is fixed, mount options will be visible only from ve0 (host)

* for host it's OK to see all hidden options

Signed-off-by: Kirill Tkhai 
Rebase: ktkhai@: Merged "ve: Strip unset options in ve.mount_opts"

[aryabinin: vz8 rebase]
https://jira.sw.ru/browse/PSBM-108196
Signed-off-by: Andrey Ryabinin 
---
 include/linux/ve.h |  11 +++
 kernel/ve/ve.c | 175 +
 2 files changed, 186 insertions(+)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index 5b1962ff4c66..1b6317275ca2 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -96,6 +96,17 @@ struct ve_struct {
 #endif
struct vdso_image   *vdso_64;
struct vdso_image   *vdso_32;
+
+   struct list_headdevmnt_list;
+   struct mutexdevmnt_mutex;
+};
+
+struct ve_devmnt {
+   struct list_headlink;
+
+   dev_t   dev;
+   char*allowed_options;
+   char*hidden_options; /* balloon_ino, etc. */
 };
 
 #define VE_MEMINFO_DEFAULT 1   /* default behaviour */
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index ac3dda55e9ae..935e13340051 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -9,6 +9,7 @@
  * 've.c' helper file performing VE sub-system initialization
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -643,6 +644,8 @@ static struct cgroup_subsys_state *ve_create(struct 
cgroup_subsys_state *parent_
 #ifdef CONFIG_COREDUMP
strcpy(ve->core_pattern, "core");
 #endif
+   INIT_LIST_HEAD(>devmnt_list);
+   mutex_init(>devmnt_mutex);
 
return >css;
 
@@ -687,10 +690,33 @@ static void ve_offline(struct cgroup_subsys_state *css)
ve->ve_name = NULL;
 }
 
+static void ve_devmnt_free(struct ve_devmnt *devmnt)
+{
+   if (!devmnt)
+   return;
+
+   kfree(devmnt->allowed_options);
+   kfree(devmnt->hidden_options);
+   kfree(devmnt);
+}
+
+static void free_ve_devmnts(struct ve_struct *ve)
+{
+   while (!list_empty(>devmnt_list)) {
+   struct ve_devmnt *devmnt;
+
+   devmnt = list_first_entry(>devmnt_list, struct ve_devmnt, 
link);
+   list_del(>link);
+   ve_devmnt_free(devmnt);
+   }
+}
+
 static void ve_destroy(struct cgroup_subsys_state *css)
 {
struct ve_struct *ve = css_to_ve(css);
 
+   free_ve_devmnts(ve);
+
kmapset_unlink(>sysfs_perms_key, _ve_perms_set);
ve_log_destroy(ve);
ve_free_vdso(ve);
@@ -1085,6 +,148 @@ static u64 ve_netns_avail_nr_read(struct 
cgroup_subsys_state *css, struct cftype
return atomic_read(_to_ve(css)->netns_avail_nr);
 }
 
+static int ve_mount_opts_read(struct seq_file *sf, void *v)
+{
+   struct ve_struct *ve = css_to_ve(seq_css(sf));
+   struct ve_devmnt *devmnt;
+
+   if (ve_is_super(ve))
+   return -ENODEV;
+
+   mutex_lock(>devmnt_mutex);
+   list_for_each_entry(devmnt, >devmnt_list, link) {
+   dev_t dev = d

[Devel] [PATCH vz8 2/4] ia32: add 32-bit vdso virtualization.

2020-10-22 Thread Andrey Ryabinin
Similarly to the 64-bit vdso, make 32-bit vdso mapping per-ve.
This will allow per container modification of the linux version
xin .note section of vdso and monotonic time.

https://jira.sw.ru/browse/PSBM-121668
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/entry/vdso/vma.c|  4 ++--
 arch/x86/kernel/process_64.c |  2 +-
 include/linux/ve.h   |  1 +
 kernel/ve/ve.c   | 35 +--
 4 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index c48deffc1473..538c6730f436 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -56,7 +56,7 @@ static void vdso_fix_landing(const struct vdso_image *image,
struct vm_area_struct *new_vma)
 {
 #if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
-   if (in_ia32_syscall() && image == _image_32) {
+   if (in_ia32_syscall() && image == get_exec_env()->vdso_32) {
struct pt_regs *regs = current_pt_regs();
unsigned long vdso_land = image->sym_int80_landing_pad;
unsigned long old_land_addr = vdso_land +
@@ -281,7 +281,7 @@ static int load_vdso32(void)
if (vdso32_enabled != 1)  /* Other values all mean "disabled" */
return 0;
 
-   return map_vdso(_image_32, 0);
+   return map_vdso(get_exec_env()->vdso_32, 0);
 }
 #endif
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a010d4b9d126..22215141 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -686,7 +686,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, 
unsigned long arg2)
 # endif
 # if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION
case ARCH_MAP_VDSO_32:
-   return prctl_map_vdso(_image_32, arg2);
+   return prctl_map_vdso(get_exec_env()->vdso_32, arg2);
 # endif
case ARCH_MAP_VDSO_64:
return prctl_map_vdso(get_exec_env()->vdso_64, arg2);
diff --git a/include/linux/ve.h b/include/linux/ve.h
index 0e85a4032c3a..5b1962ff4c66 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -95,6 +95,7 @@ struct ve_struct {
struct cn_private   *cn;
 #endif
struct vdso_image   *vdso_64;
+   struct vdso_image   *vdso_32;
 };
 
 #define VE_MEMINFO_DEFAULT 1   /* default behaviour */
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 186deb3f88f4..03b8d126a0ed 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -58,6 +58,7 @@ struct ve_struct ve0 = {
.netns_max_nr   = INT_MAX,
.meminfo_val= VE_MEMINFO_SYSTEM,
.vdso_64= (struct vdso_image*)_image_64,
+   .vdso_32= (struct vdso_image*)_image_32,
 };
 EXPORT_SYMBOL(ve0);
 
@@ -540,13 +541,12 @@ static __u64 ve_setup_iptables_mask(__u64 init_mask)
 }
 #endif
 
-static int copy_vdso(struct ve_struct *ve)
+static int copy_vdso(struct vdso_image **vdso_dst, const struct vdso_image 
*vdso_src)
 {
-   const struct vdso_image *vdso_src = _image_64;
struct vdso_image *vdso;
void *vdso_data;
 
-   if (ve->vdso_64)
+   if (*vdso_dst)
return 0;
 
vdso = kmemdup(vdso_src, sizeof(*vdso), GFP_KERNEL);
@@ -563,10 +563,22 @@ static int copy_vdso(struct ve_struct *ve)
 
vdso->data = vdso_data;
 
-   ve->vdso_64 = vdso;
+   *vdso_dst = vdso;
return 0;
 }
 
+static void ve_free_vdso(struct ve_struct *ve)
+{
+   if (ve->vdso_64 && ve->vdso_64 != _image_64) {
+   kfree(ve->vdso_64->data);
+   kfree(ve->vdso_64);
+   }
+   if (ve->vdso_32 && ve->vdso_32 != _image_32) {
+   kfree(ve->vdso_32->data);
+   kfree(ve->vdso_32);
+   }
+}
+
 static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state 
*parent_css)
 {
struct ve_struct *ve = 
@@ -592,7 +604,10 @@ static struct cgroup_subsys_state *ve_create(struct 
cgroup_subsys_state *parent_
if (err)
goto err_log;
 
-   if (copy_vdso(ve))
+   if (copy_vdso(>vdso_64, _image_64))
+   goto err_vdso;
+
+   if (copy_vdso(>vdso_32, _image_32))
goto err_vdso;
 
ve->features = VE_FEATURES_DEF;
@@ -619,6 +634,7 @@ static struct cgroup_subsys_state *ve_create(struct 
cgroup_subsys_state *parent_
return >css;
 
 err_vdso:
+   ve_free_vdso(ve);
ve_log_destroy(ve);
 err_log:
free_percpu(ve->sched_lat_ve.cur);
@@ -658,15 +674,6 @@ static void ve_offline(struct cgroup_subsys_state *css)
ve->ve_name = NULL;
 }
 
-static void ve_free_vdso(struct ve_struct *ve)
-{
-   if (ve->vdso_64 == _image_64)
-   return;
-
-   kfree(ve->vdso_64->data);
-   kfree(ve->vdso_6

[Devel] [PATCH vz8 3/4] ve: patch linux_version_code in vdso

2020-10-22 Thread Andrey Ryabinin
On the write to ve.os_release file patch the linux_version_code
in the .note section of vdso.

https://jira.sw.ru/browse/PSBM-121668
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/entry/vdso/vdso-note.S   | 2 ++
 arch/x86/entry/vdso/vdso2c.c  | 1 +
 arch/x86/entry/vdso/vdso32/note.S | 2 ++
 arch/x86/include/asm/vdso.h   | 1 +
 kernel/ve/ve.c| 7 +++
 5 files changed, 13 insertions(+)

diff --git a/arch/x86/entry/vdso/vdso-note.S b/arch/x86/entry/vdso/vdso-note.S
index 79a071e4357e..c0e6e65f9fec 100644
--- a/arch/x86/entry/vdso/vdso-note.S
+++ b/arch/x86/entry/vdso/vdso-note.S
@@ -7,6 +7,8 @@
 #include 
 #include 
 
+   .globl linux_version_code
 ELFNOTE_START(Linux, 0, "a")
+linux_version_code:
.long LINUX_VERSION_CODE
 ELFNOTE_END
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 4674f58581a1..7fab0bd96ac1 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -109,6 +109,7 @@ struct vdso_sym required_syms[] = {
{"__kernel_sigreturn", true},
{"__kernel_rt_sigreturn", true},
{"int80_landing_pad", true},
+   {"linux_version_code", true},
 };
 
 __attribute__((format(printf, 1, 2))) __attribute__((noreturn))
diff --git a/arch/x86/entry/vdso/vdso32/note.S 
b/arch/x86/entry/vdso/vdso32/note.S
index 9fd51f206314..096b62f14863 100644
--- a/arch/x86/entry/vdso/vdso32/note.S
+++ b/arch/x86/entry/vdso/vdso32/note.S
@@ -10,7 +10,9 @@
 /* Ideally this would use UTS_NAME, but using a quoted string here
doesn't work. Remember to change this when changing the
kernel's name. */
+   .globl linux_version_code
 ELFNOTE_START(Linux, 0, "a")
+linux_version_code:
.long LINUX_VERSION_CODE
 ELFNOTE_END
 
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 27566e57e87d..92c7ac06828e 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -27,6 +27,7 @@ struct vdso_image {
long sym___kernel_rt_sigreturn;
long sym___kernel_vsyscall;
long sym_int80_landing_pad;
+   long sym_linux_version_code;
 };
 
 #ifdef CONFIG_X86_64
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 03b8d126a0ed..98c2e7e3d2c6 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -954,6 +954,7 @@ static ssize_t ve_os_release_write(struct kernfs_open_file 
*of, char *buf,
 {
struct cgroup_subsys_state *css = of_css(of);
struct ve_struct *ve = css_to_ve(css);
+   int n1, n2, n3, new_version;
char *release;
int ret = 0;
 
@@ -964,6 +965,12 @@ static ssize_t ve_os_release_write(struct kernfs_open_file 
*of, char *buf,
goto up_opsem;
}
 
+   if (sscanf(buf, "%d.%d.%d", , , ) == 3) {
+   new_version = ((n1 << 16) + (n2 << 8)) + n3;
+   *((int *)(ve->vdso_64->data + 
ve->vdso_64->sym_linux_version_code)) = new_version;
+   *((int *)(ve->vdso_32->data + 
ve->vdso_32->sym_linux_version_code)) = new_version;
+   }
+
down_write(_sem);
release = ve->ve_ns->uts_ns->name.release;
strncpy(release, buf, __NEW_UTS_LEN);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 1/4] ve, x86_64: add per-ve vdso mapping.

2020-10-22 Thread Andrey Ryabinin
Make vdso mapping per-ve. This will allow per container modification
of the linux version in .note section of vdso and monotonic time.

https://jira.sw.ru/browse/PSBM-121668
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/entry/vdso/vma.c|  3 ++-
 arch/x86/kernel/process_64.c |  2 +-
 include/linux/ve.h   |  2 ++
 kernel/ve/ve.c   | 43 
 4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index eb3d85f87884..c48deffc1473 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -291,7 +291,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
if (!vdso64_enabled)
return 0;
 
-   return map_vdso_randomized(_image_64);
+
+   return map_vdso_randomized(get_exec_env()->vdso_64);
 }
 
 #ifdef CONFIG_COMPAT
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index c1c8d66cbe70..a010d4b9d126 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -689,7 +689,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, 
unsigned long arg2)
return prctl_map_vdso(_image_32, arg2);
 # endif
case ARCH_MAP_VDSO_64:
-   return prctl_map_vdso(_image_64, arg2);
+   return prctl_map_vdso(get_exec_env()->vdso_64, arg2);
 #endif
 
default:
diff --git a/include/linux/ve.h b/include/linux/ve.h
index ec7dc522ac1f..0e85a4032c3a 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct nsproxy;
 struct veip_struct;
@@ -93,6 +94,7 @@ struct ve_struct {
 #ifdef CONFIG_CONNECTOR
struct cn_private   *cn;
 #endif
+   struct vdso_image   *vdso_64;
 };
 
 #define VE_MEMINFO_DEFAULT 1   /* default behaviour */
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index cc26d3b2fa9b..186deb3f88f4 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -57,6 +57,7 @@ struct ve_struct ve0 = {
.netns_avail_nr = ATOMIC_INIT(INT_MAX),
.netns_max_nr   = INT_MAX,
.meminfo_val= VE_MEMINFO_SYSTEM,
+   .vdso_64= (struct vdso_image*)_image_64,
 };
 EXPORT_SYMBOL(ve0);
 
@@ -539,6 +540,33 @@ static __u64 ve_setup_iptables_mask(__u64 init_mask)
 }
 #endif
 
+static int copy_vdso(struct ve_struct *ve)
+{
+   const struct vdso_image *vdso_src = _image_64;
+   struct vdso_image *vdso;
+   void *vdso_data;
+
+   if (ve->vdso_64)
+   return 0;
+
+   vdso = kmemdup(vdso_src, sizeof(*vdso), GFP_KERNEL);
+   if (!vdso)
+   return -ENOMEM;
+
+   vdso_data = kmalloc(vdso_src->size, GFP_KERNEL);
+   if (!vdso_data) {
+   kfree(vdso);
+   return -ENOMEM;
+   }
+
+   memcpy(vdso_data, vdso_src->data, vdso_src->size);
+
+   vdso->data = vdso_data;
+
+   ve->vdso_64 = vdso;
+   return 0;
+}
+
 static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state 
*parent_css)
 {
struct ve_struct *ve = 
@@ -564,6 +592,9 @@ static struct cgroup_subsys_state *ve_create(struct 
cgroup_subsys_state *parent_
if (err)
goto err_log;
 
+   if (copy_vdso(ve))
+   goto err_vdso;
+
ve->features = VE_FEATURES_DEF;
ve->_randomize_va_space = ve0._randomize_va_space;
 
@@ -587,6 +618,8 @@ static struct cgroup_subsys_state *ve_create(struct 
cgroup_subsys_state *parent_
 
return >css;
 
+err_vdso:
+   ve_log_destroy(ve);
 err_log:
free_percpu(ve->sched_lat_ve.cur);
 err_lat:
@@ -625,12 +658,22 @@ static void ve_offline(struct cgroup_subsys_state *css)
ve->ve_name = NULL;
 }
 
+static void ve_free_vdso(struct ve_struct *ve)
+{
+   if (ve->vdso_64 == _image_64)
+   return;
+
+   kfree(ve->vdso_64->data);
+   kfree(ve->vdso_64);
+}
+
 static void ve_destroy(struct cgroup_subsys_state *css)
 {
struct ve_struct *ve = css_to_ve(css);
 
kmapset_unlink(>sysfs_perms_key, _ve_perms_set);
ve_log_destroy(ve);
+   ve_free_vdso(ve);
 #if IS_ENABLED(CONFIG_BINFMT_MISC)
kfree(ve->binfmt_misc);
 #endif
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 4/4] ve: add per-ve CLOCK_MONOTONIC time via __vclock_getttime()

2020-10-22 Thread Andrey Ryabinin
Make possible to read virtualized container's CLOCK_MONOTONIC time
via __vclock_getttime(). Record containers start time in per-ve
vdso and substruct it from the host's time on clock read.

https://jira.sw.ru/browse/PSBM-121668
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/entry/vdso/vclock_gettime.c | 27 +++
 arch/x86/entry/vdso/vdso2c.c |  1 +
 arch/x86/include/asm/vdso.h  |  1 +
 kernel/ve/ve.c   | 14 ++
 4 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index e48ca3afa091..be1de6c4cafa 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -24,6 +24,8 @@
 
 #define gtod ((vsyscall_gtod_data))
 
+u64 ve_start_time;
+
 extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
 extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
 extern time_t __vdso_time(time_t *t);
@@ -227,6 +229,21 @@ notrace static int __always_inline do_realtime(struct 
timespec *ts)
return mode;
 }
 
+static inline void timespec_sub_ns(struct timespec *ts, u64 ns)
+{
+   if ((s64)ns <= 0) {
+   ts->tv_sec += __iter_div_u64_rem(-ns, NSEC_PER_SEC, );
+   ts->tv_nsec = ns;
+   } else {
+   ts->tv_sec -= __iter_div_u64_rem(ns, NSEC_PER_SEC, );
+   if (ns) {
+   ts->tv_sec--;
+   ns = NSEC_PER_SEC - ns;
+   }
+   ts->tv_nsec = ns;
+   }
+}
+
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
unsigned long seq;
@@ -242,9 +259,7 @@ notrace static int __always_inline do_monotonic(struct 
timespec *ts)
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
 
-   ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, );
-   ts->tv_nsec = ns;
-
+   timespec_sub_ns(ts, ve_start_time - ns);
return mode;
 }
 
@@ -260,12 +275,16 @@ notrace static void do_realtime_coarse(struct timespec 
*ts)
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
+   u64 ns;
unsigned long seq;
+
do {
seq = gtod_read_begin(gtod);
ts->tv_sec = gtod->monotonic_time_coarse_sec;
-   ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+   ns = gtod->monotonic_time_coarse_nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
+
+   timespec_sub_ns(ts, ve_start_time - ns);
 }
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 7fab0bd96ac1..c76141e9ca16 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -110,6 +110,7 @@ struct vdso_sym required_syms[] = {
{"__kernel_rt_sigreturn", true},
{"int80_landing_pad", true},
{"linux_version_code", true},
+   {"ve_start_time", true},
 };
 
 __attribute__((format(printf, 1, 2))) __attribute__((noreturn))
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 92c7ac06828e..9c265f79a126 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -28,6 +28,7 @@ struct vdso_image {
long sym___kernel_vsyscall;
long sym_int80_landing_pad;
long sym_linux_version_code;
+   long sym_ve_start_time;
 };
 
 #ifdef CONFIG_X86_64
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 98c2e7e3d2c6..ac3dda55e9ae 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -374,6 +374,17 @@ static int ve_start_kthreadd(struct ve_struct *ve)
return err;
 }
 
+static void ve_set_vdso_time(struct ve_struct *ve, u64 time)
+{
+   u64 *vdso_start_time;
+
+   vdso_start_time = ve->vdso_64->data + ve->vdso_64->sym_ve_start_time;
+   *vdso_start_time = time;
+
+   vdso_start_time = ve->vdso_32->data + ve->vdso_32->sym_ve_start_time;
+   *vdso_start_time = time;
+}
+
 /* under ve->op_sem write-lock */
 static int ve_start_container(struct ve_struct *ve)
 {
@@ -408,6 +419,8 @@ static int ve_start_container(struct ve_struct *ve)
if (ve->start_time == 0) {
ve->start_time = tsk->start_time;
ve->real_start_time = tsk->real_start_time;
+
+   ve_set_vdso_time(ve, ve->start_time);
}
/* The value is wrong, but it is never compared to process
 * start times */
@@ -1028,6 +1041,7 @@ static ssize_t ve_ts_write(struct kernfs_open_file *of, 
char *buf,
case VE_CF_CLOCK_MONOTONIC:
now = ktime_get_ns();
target = >start_time;
+   ve_set_vdso_time(ve, now - delta_ns);

Re: [Devel] [PATCH rh8] mm/swap: activate swapped in pages on fault

2020-10-22 Thread Andrey Ryabinin



On 10/19/20 7:32 PM, Konstantin Khorenko wrote:
> From: Andrey Ryabinin 
> 
> Move swapped in anon pages directly to active list. This should
> help us to prevent anon thrashing. Recently swapped in pages
> has more chances to stay in memory.
> 
> https://pmc.acronis.com/browse/VSTOR-20859
> Signed-off-by: Andrey Ryabinin 
> [VvS RHEL7.8 rebase] context changes
> 
> (cherry picked from vz7 commit 134cd9b20a914080539e6310f76fe3f7b32bc710)
> Signed-off-by: Konstantin Khorenko 

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh8] ve: Virtualize /proc/swaps to watch from inside CT

2020-10-22 Thread Andrey Ryabinin



On 10/19/20 5:27 PM, Konstantin Khorenko wrote:
> From: Kirill Tkhai 
> 
> Customize /proc/swaps when showing from !ve_is_super.
> Extracted from "Initial patch".
> 
> Signed-off-by: Kirill Tkhai 
> 
> (cherry picked from vz7 commit 88c087f1fdb4b0f7934804269df36035ab6b83eb)
> Signed-off-by: Konstantin Khorenko 


Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ms/aio: Kill aio_rw_vect_retry()

2020-10-14 Thread Andrey Ryabinin
From: Kent Overstreet 

This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).

Signed-off-by: Kent Overstreet 
Cc: Zach Brown 
Cc: Felipe Balbi 
Cc: Greg Kroah-Hartman 
Cc: Mark Fasheh 
Cc: Joel Becker 
Cc: Rusty Russell 
Cc: Jens Axboe 
Cc: Asai Thambi S P 
Cc: Selvan Mani 
Cc: Sam Bradshaw 
Cc: Jeff Moyer 
Cc: Al Viro 
Cc: Benjamin LaHaise 
Signed-off-by: Benjamin LaHaise 

https://jira.sw.ru/browse/PSBM-121197
(cherry picked from commit 73a7075e3f6ec63dc359064eea6fd84f406cf2a5)
Signed-off-by: Andrey Ryabinin 
---
 drivers/staging/android/logger.c |  2 +-
 drivers/usb/gadget/inode.c   |  6 +--
 fs/aio.c | 92 +++-
 fs/block_dev.c   |  2 +-
 fs/nfs/direct.c  |  1 -
 fs/ocfs2/file.c  |  6 +--
 fs/read_write.c  |  3 --
 fs/udf/file.c|  2 +-
 include/linux/aio.h  |  2 -
 mm/page_io.c |  1 -
 net/socket.c |  2 +-
 11 files changed, 28 insertions(+), 91 deletions(-)

diff --git a/drivers/staging/android/logger.c b/drivers/staging/android/logger.c
index 34519ea14b54..16a6c3179625 100644
--- a/drivers/staging/android/logger.c
+++ b/drivers/staging/android/logger.c
@@ -481,7 +481,7 @@ static ssize_t logger_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
header.sec = now.tv_sec;
header.nsec = now.tv_nsec;
header.euid = current_euid();
-   header.len = min_t(size_t, iocb->ki_left, LOGGER_ENTRY_MAX_PAYLOAD);
+   header.len = min_t(size_t, iocb->ki_nbytes, LOGGER_ENTRY_MAX_PAYLOAD);
header.hdr_size = sizeof(struct logger_entry);
 
/* null writes succeed, return zero */
diff --git a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
index 570c005062ab..09aae3c48d2c 100644
--- a/drivers/usb/gadget/inode.c
+++ b/drivers/usb/gadget/inode.c
@@ -709,11 +709,11 @@ ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(usb_endpoint_dir_in(>desc)))
return -EINVAL;
 
-   buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+   buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
-   return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
+   return ep_aio_rwtail(iocb, buf, iocb->ki_nbytes, epdata, iov, nr_segs);
 }
 
 static ssize_t
@@ -728,7 +728,7 @@ ep_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (unlikely(!usb_endpoint_dir_in(>desc)))
return -EINVAL;
 
-   buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+   buf = kmalloc(iocb->ki_nbytes, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
diff --git a/fs/aio.c b/fs/aio.c
index c7e23a5832aa..f1b27fc5defb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -879,7 +879,7 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
if (unlikely(!req))
goto out_put;
 
-   atomic_set(>ki_users, 2);
+   atomic_set(>ki_users, 1);
req->ki_ctx = ctx;
 
return req;
@@ -1279,75 +1279,9 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
return -EINVAL;
 }
 
-static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret)
-{
-   struct iovec *iov = >ki_iovec[iocb->ki_cur_seg];
-
-   BUG_ON(ret <= 0);
-
-   while (iocb->ki_cur_seg < iocb->ki_nr_segs && ret > 0) {
-   ssize_t this = min((ssize_t)iov->iov_len, ret);
-   iov->iov_base += this;
-   iov->iov_len -= this;
-   iocb->ki_left -= this;
-   ret -= this;
-   if (iov->iov_len == 0) {
-   iocb->ki_cur_seg++;
-   iov++;
-   }
-   }
-
-   /* the caller should not have done more io than what fit in
-* the remaining iovecs */
-   BUG_ON(ret > 0 && iocb->ki_left == 0);
-}
-
 typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
 
-static ssize_t aio_rw_vect_retry(struct kiocb *iocb, int rw, aio_rw_op *rw_op)
-{
-   struct file *file = iocb->ki_filp;
-   struct address_space *mapping = file->f_mapping;
-   struct inode *inode = mapping->host;
-   ssize_t ret = 0;
-
-   /* This matches the pread()/pwrite() logic */
-   if (iocb->ki_pos < 0)
-   return -EINVAL;
-
-   if (rw == WRITE)
-   file_start_write(file);
-   do {
-   ret = rw_op(iocb, >ki_iovec[iocb->ki_cur_seg],
-   iocb->ki_nr_segs - iocb-&g

[Devel] [PATCH vz8] mm/memcg: Use per-cpu stock charges for ->kmem and ->cache counters

2020-10-13 Thread Andrey Ryabinin
Currently we use per-cpu stocks to do precharges of the ->memory and ->memsw
counters. Do this for the ->kmem and ->cache as well to decrease contention
on these counters as well.

https://jira.sw.ru/browse/PSBM-101300
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 75 +
 1 file changed, 51 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 134cb27307f2..b3f97309ca39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2023,6 +2023,8 @@ EXPORT_SYMBOL(unlock_page_memcg);
 struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
+   unsigned int cache_nr_pages;
+   unsigned int kmem_nr_pages;
struct work_struct work;
unsigned long flags;
 #define FLUSHING_CACHED_CHARGE 0
@@ -2041,7 +2043,8 @@ static DEFINE_MUTEX(percpu_charge_mutex);
  *
  * returns true if successful, false otherwise.
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+   bool cache, bool kmem)
 {
struct memcg_stock_pcp *stock;
unsigned long flags;
@@ -2053,9 +2056,19 @@ static bool consume_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
-   if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
-   stock->nr_pages -= nr_pages;
-   ret = true;
+   if (memcg == stock->cached) {
+   if (cache && stock->cache_nr_pages >= nr_pages) {
+   stock->cache_nr_pages -= nr_pages;
+   ret = true;
+   }
+   if (kmem && stock->kmem_nr_pages >= nr_pages) {
+   stock->kmem_nr_pages -= nr_pages;
+   ret = true;
+   }
+   if (!cache && !kmem && stock->nr_pages >= nr_pages) {
+   stock->nr_pages -= nr_pages;
+   ret = true;
+   }
}
 
local_irq_restore(flags);
@@ -2069,13 +2082,21 @@ static bool consume_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
 static void drain_stock(struct memcg_stock_pcp *stock)
 {
struct mem_cgroup *old = stock->cached;
+   unsigned long nr_pages = stock->nr_pages + stock->cache_nr_pages + 
stock->kmem_nr_pages;
+
+   if (stock->cache_nr_pages)
+   page_counter_uncharge(>cache, stock->cache_nr_pages);
+   if (stock->kmem_nr_pages)
+   page_counter_uncharge(>kmem, stock->kmem_nr_pages);
 
-   if (stock->nr_pages) {
-   page_counter_uncharge(>memory, stock->nr_pages);
+   if (nr_pages) {
+   page_counter_uncharge(>memory, nr_pages);
if (do_memsw_account())
-   page_counter_uncharge(>memsw, stock->nr_pages);
+   page_counter_uncharge(>memsw, nr_pages);
css_put_many(>css, stock->nr_pages);
stock->nr_pages = 0;
+   stock->kmem_nr_pages = 0;
+   stock->cache_nr_pages = 0;
}
stock->cached = NULL;
 }
@@ -2102,10 +2123,12 @@ static void drain_local_stock(struct work_struct *dummy)
  * Cache charges(val) to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+   bool cache, bool kmem)
 {
struct memcg_stock_pcp *stock;
unsigned long flags;
+   unsigned long stock_nr_pages;
 
local_irq_save(flags);
 
@@ -2114,9 +2137,17 @@ static void refill_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
drain_stock(stock);
stock->cached = memcg;
}
-   stock->nr_pages += nr_pages;
 
-   if (stock->nr_pages > MEMCG_CHARGE_BATCH)
+   if (cache)
+   stock->cache_nr_pages += nr_pages;
+   else if (kmem)
+   stock->kmem_nr_pages += nr_pages;
+   else
+   stock->nr_pages += nr_pages;
+
+   stock_nr_pages = stock->nr_pages + stock->cache_nr_pages +
+   stock->kmem_nr_pages;
+   if (nr_pages > MEMCG_CHARGE_BATCH)
drain_stock(stock);
 
local_irq_restore(flags);
@@ -2143,9 +2174,11 @@ static void drain_all_stock(struct mem_cgroup 
*root_memcg)
for_each_online_cpu(cpu) {
struct memcg_stock_pcp *stock = _cpu(memcg_stock, cpu);
struct mem_cgroup *memcg;
+   unsigned long nr_pages = stock->nr_pages + stock-&g

[Devel] [PATCH rh7] mm/memcg: optimize mem_cgroup_enough_memory()

2020-10-12 Thread Andrey Ryabinin
mem_cgroup_enough_memory() iterates memcg's subtree to account
'MEM_CGROUP_STAT_CACHE - MEM_CGROUP_STAT_SHMEM'.

Fortunately we can just read memcg->cache counter instead
as it's hierarchical (includes subgroups) and doesn't account
shmem.

https://jira.sw.ru/browse/PSBM-120968
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6587cc2ef019..e36ad592b3c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4721,11 +4721,7 @@ int mem_cgroup_enough_memory(struct mem_cgroup *memcg, 
long pages)
free += page_counter_read(>dcache);
 
/* assume file cache is reclaimable */
-   free += mem_cgroup_recursive_stat2(memcg, MEM_CGROUP_STAT_CACHE);
-
-   /* but do not count shmem pages as they can't be purged,
-* only swapped out */
-   free -= mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SHMEM);
+   free += page_counter_read(>cache);
 
return free < pages ? -ENOMEM : 0;
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] kernel/cgroup: remove unnecessary cgroup_mutex lock.

2020-10-09 Thread Andrey Ryabinin
Stopping container causes the lockdep to complain (see report bellow).
We can avoid it simply by removing cgroup_mutex lock from
cgroup_mark_ve_root(). I believe it's not needed there, it seems to be
added just in case.

 WARNING: possible circular locking dependency detected
 4.18.0-193.6.3.vz8.4.6+debug #1 Not tainted
 --
 vzctl/36606 is trying to acquire lock:
 88814b195ca0 (kn->count#338){}, at: kernfs_remove_by_name_ns+0x40/0x80

 but task is already holding lock:
 9cf75a90 (cgroup_mutex){+.+.}, at: cgroup_kn_lock_live+0x106/0x390

 which lock already depends on the new lock.
 the existing dependency chain (in reverse order) is:

 -> #2 (cgroup_mutex){+.+.}:
__mutex_lock+0x163/0x13d0
cgroup_mark_ve_root+0x1d/0x2e0
ve_state_write+0xb81/0xdc0
cgroup_file_write+0x2da/0x7a0
kernfs_fop_write+0x255/0x410
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 -> #1 (>op_sem){}:
down_write+0xa0/0x3d0
ve_state_write+0x6b/0xdc0
cgroup_file_write+0x2da/0x7a0
kernfs_fop_write+0x255/0x410
vfs_write+0x157/0x460
ksys_write+0xb8/0x170
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 -> #0 (kn->count#338){}:
__lock_acquire+0x22cb/0x48c0
lock_acquire+0x14f/0x3b0
__kernfs_remove+0x61e/0x810
kernfs_remove_by_name_ns+0x40/0x80
cgroup_addrm_files+0x531/0x940
css_clear_dir+0xfb/0x200
kill_css+0x8f/0x120
cgroup_destroy_locked+0x246/0x5e0
cgroup_rmdir+0x2f/0x2c0
kernfs_iop_rmdir+0x131/0x1b0
vfs_rmdir+0x142/0x3c0
do_rmdir+0x2b2/0x340
do_syscall_64+0xa5/0x4d0
entry_SYSCALL_64_after_hwframe+0x6a/0xdf

 other info that might help us debug this:

 Chain exists of:
   kn->count#338 --> >op_sem --> cgroup_mutex

  Possible unsafe locking scenario:

CPU0CPU1

   lock(cgroup_mutex);
lock(>op_sem);
lock(cgroup_mutex);
   lock(kn->count#338);

*** DEADLOCK ***

 4 locks held by vzctl/36606:
  #0: 88813c02c890 (sb_writers#7){.+.+}, at: mnt_want_write+0x3c/0xa0
  #1: 88814414ad48 (>i_mutex_dir_key#5/1){+.+.}, at: 
do_rmdir+0x23c/0x340
  #2: 88811d3054e8 (>i_mutex_dir_key#5){}, at: 
vfs_rmdir+0xb6/0x3c0
  #3: 9cf75a90 (cgroup_mutex){+.+.}, at: cgroup_kn_lock_live+0x106/0x390

 Call Trace:
  dump_stack+0x9a/0xf0
  check_noncircular+0x317/0x3c0
  __lock_acquire+0x22cb/0x48c0
  lock_acquire+0x14f/0x3b0
  __kernfs_remove+0x61e/0x810
  kernfs_remove_by_name_ns+0x40/0x80
  cgroup_addrm_files+0x531/0x940
  css_clear_dir+0xfb/0x200
  kill_css+0x8f/0x120
  cgroup_destroy_locked+0x246/0x5e0
  cgroup_rmdir+0x2f/0x2c0
  kernfs_iop_rmdir+0x131/0x1b0
  vfs_rmdir+0x142/0x3c0
  do_rmdir+0x2b2/0x340
  do_syscall_64+0xa5/0x4d0
  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

https://jira.sw.ru/browse/PSBM-120670
Signed-off-by: Andrey Ryabinin 
---
 kernel/cgroup/cgroup.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8420f3547f1a..08137d43f3ab 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1883,7 +1883,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve)
struct css_set *cset;
struct cgroup *cgrp;
 
-   mutex_lock(_mutex);
spin_lock_irq(_set_lock);
 
rcu_read_lock();
@@ -1899,7 +1898,6 @@ void cgroup_mark_ve_root(struct ve_struct *ve)
rcu_read_unlock();
 
spin_unlock_irq(_set_lock);
-   mutex_unlock(_mutex);
 }
 
 static struct cgroup *cgroup_get_ve_root1(struct cgroup *cgrp)
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] memcg: Fix missing memcg->cache charges during page migration

2020-10-09 Thread Andrey Ryabinin
Since 44b7a8d33d66 ("mm: memcontrol: do not uncharge old page in
 page cache replacement") the mem_cgroup_migrate() charges newpage,
but the ->cache charge is missing here. Add it to fix negative ->cache
values, which leads to WARNING like bellow and softlockups.

 WARNING: CPU: 14 PID: 1372 at mm/page_counter.c:62 
page_counter_cancel+0x26/0x30

 Call Trace:
  page_counter_uncharge+0x1d/0x30
  uncharge_batch+0x25c/0x2e0
  mem_cgroup_uncharge_list+0x64/0x90
  release_pages+0x33e/0x3c0
  __pagevec_release+0x1b/0x40
  truncate_inode_pages_range+0x358/0x8b0
  ext4_evict_inode+0x167/0x580 [ext4]
  evict+0xd2/0x1a0
  do_unlinkat+0x250/0x2e0
  do_syscall_64+0x5b/0x1a0
  entry_SYSCALL_64_after_hwframe+0x65/0xca

https://jira.sw.ru/browse/PSBM-120653
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df70c3bdd444..134cb27307f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6867,6 +6867,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page 
*newpage)
page_counter_charge(>memory, nr_pages);
if (do_memsw_account())
page_counter_charge(>memsw, nr_pages);
+   if (!PageAnon(newpage) && !PageSwapBacked(newpage))
+   page_counter_charge(>cache, nr_pages);
css_get_many(>css, nr_pages);
 
commit_charge(newpage, memcg, false);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] vmscan: don't report reclaim progress if there was no progress.

2020-10-09 Thread Andrey Ryabinin



On 10/9/20 10:22 AM, Vasily Averin wrote:
> Andrey,
> could you please clarify, is it required for vz8 too?
> 

vz8 don't need this. This part  was removed by commit 0a0337e0d1 ("mm, oom: 
rework oom detection")

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] mm/filemap: fix potential memcg->cache charge leak

2020-10-09 Thread Andrey Ryabinin



On 10/9/20 10:14 AM, Vasily Averin wrote:
> vz8 is affected too, please cherry-pick 
> vz7 commit 79a5642e9d9a6bdbb56d9e0ee990fd96b7c8625c
> 

vz8 is not affected
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] mm/filemap: fix potential memcg->cache charge leak

2020-10-08 Thread Andrey Ryabinin
__add_to_page_cache_locked() after mem_cgroup_try_charge_cache()
uses mem_cgroup_cancel_charge() in one of the error paths.
This may lead to leaking a few memcg->cache charges.

Use mem_cgroup_cancel_cache_charge() to fix this.

https://jira.sw.ru/browse/PSBM-121046
Signed-off-by: Andrey Ryabinin 
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 53db13f236da..2bd5ca4e7528 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -732,7 +732,7 @@ static int __add_to_page_cache_locked(struct page *page,
error = radix_tree_maybe_preload(gfp_mask & GFP_RECLAIM_MASK);
if (error) {
if (!huge)
-   mem_cgroup_cancel_charge(page, memcg);
+   mem_cgroup_cancel_cache_charge(page, memcg);
return error;
}
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH rh7] tun: Silence allocation failer if user asked for too big header

2020-10-06 Thread Andrey Ryabinin


On 10/6/20 11:17 AM, Konstantin Khorenko wrote:
> On 10/05/2020 04:42 PM, Andrey Ryabinin wrote:
>> Userspace may ask tun device to send packet with ridiculously
>> big header and trigger this:
>>
>>  [ cut here ]
>>  WARNING: CPU: 1 PID: 15366 at mm/page_alloc.c:3548 
>> __alloc_pages_nodemask+0x537/0x1200
>>  order 19 >= 11, gfp 0x2044d0
>>  Call Trace:
>>    dump_stack+0x19/0x1b
>>    __warn+0x17f/0x1c0
>>    warn_slowpath_fmt+0xad/0xe0
>>    __alloc_pages_nodemask+0x537/0x1200
>>    kmalloc_large_node+0x5f/0xd0
>>    __kmalloc_node_track_caller+0x425/0x630
>>    __kmalloc_reserve.isra.33+0x47/0xd0
>>    __alloc_skb+0xdd/0x5f0
>>    alloc_skb_with_frags+0x8f/0x540
>>    sock_alloc_send_pskb+0x5e5/0x940
>>    tun_get_user+0x38b/0x24a0 [tun]
>>    tun_chr_aio_write+0x13a/0x250 [tun]
>>    do_sync_readv_writev+0xdf/0x1c0
>>    do_readv_writev+0x1a5/0x850
>>    vfs_writev+0xba/0x190
>>    SyS_writev+0x17c/0x340
>>    system_call_fastpath+0x25/0x2a
>>
>> Just add __GFP_NOWARN and silently return -ENOMEM to fix this.
>>
>> https://jira.sw.ru/browse/PSBM-103639
>> Signed-off-by: Andrey Ryabinin 
>> ---
>>  drivers/net/tun.c  | 4 ++--
>>  include/net/sock.h | 7 +++
>>  net/core/sock.c    | 9 +
>>  3 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index e95a89ba48b7..c0879c6a9703 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1142,8 +1142,8 @@ static struct sk_buff *tun_alloc_skb(struct tun_file 
>> *tfile,
>>  if (prepad + len < PAGE_SIZE || !linear)
>>  linear = len;
>>
>> -    skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
>> -   , 0);
>> +    skb = sock_alloc_send_pskb_flags(sk, prepad + linear, len - linear, 
>> noblock,
>> +    , 0, __GFP_NOWARN);
> 
> May be __GFP_ORDER_NOWARN ?
> 

__GFP_ORDER_NOWARN doesn't silence the WARN triggered here:
if (order >= MAX_ORDER) {
WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
return NULL;
}




___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] tun: Silence allocation failer if user asked for too big header

2020-10-05 Thread Andrey Ryabinin
Userspace may ask tun device to send packet with ridiculously
big header and trigger this:

 [ cut here ]
 WARNING: CPU: 1 PID: 15366 at mm/page_alloc.c:3548 
__alloc_pages_nodemask+0x537/0x1200
 order 19 >= 11, gfp 0x2044d0
 Call Trace:
   dump_stack+0x19/0x1b
   __warn+0x17f/0x1c0
   warn_slowpath_fmt+0xad/0xe0
   __alloc_pages_nodemask+0x537/0x1200
   kmalloc_large_node+0x5f/0xd0
   __kmalloc_node_track_caller+0x425/0x630
   __kmalloc_reserve.isra.33+0x47/0xd0
   __alloc_skb+0xdd/0x5f0
   alloc_skb_with_frags+0x8f/0x540
   sock_alloc_send_pskb+0x5e5/0x940
   tun_get_user+0x38b/0x24a0 [tun]
   tun_chr_aio_write+0x13a/0x250 [tun]
   do_sync_readv_writev+0xdf/0x1c0
   do_readv_writev+0x1a5/0x850
   vfs_writev+0xba/0x190
   SyS_writev+0x17c/0x340
   system_call_fastpath+0x25/0x2a

Just add __GFP_NOWARN and silently return -ENOMEM to fix this.

https://jira.sw.ru/browse/PSBM-103639
Signed-off-by: Andrey Ryabinin 
---
 drivers/net/tun.c  | 4 ++--
 include/net/sock.h | 7 +++
 net/core/sock.c| 9 +
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e95a89ba48b7..c0879c6a9703 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1142,8 +1142,8 @@ static struct sk_buff *tun_alloc_skb(struct tun_file 
*tfile,
if (prepad + len < PAGE_SIZE || !linear)
linear = len;
 
-   skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
-  , 0);
+   skb = sock_alloc_send_pskb_flags(sk, prepad + linear, len - linear, 
noblock,
+   , 0, __GFP_NOWARN);
if (!skb)
return ERR_PTR(err);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 4136d2c3080c..1912d85ecc4d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1626,6 +1626,13 @@ extern struct sk_buff
*sock_alloc_send_pskb(struct sock *sk,
  int noblock,
  int *errcode,
  int max_page_order);
+extern struct sk_buff  *sock_alloc_send_pskb_flags(struct sock *sk,
+ unsigned long header_len,
+ unsigned long data_len,
+ int noblock,
+ int *errcode,
+ int max_page_order,
+ gfp_t extra_flags);
 extern void *sock_kmalloc(struct sock *sk, int size,
  gfp_t priority);
 extern void sock_kfree_s(struct sock *sk, void *mem, int size);
diff --git a/net/core/sock.c b/net/core/sock.c
index 508fc6093a26..07ea42f976cf 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1964,6 +1964,15 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, 
unsigned long header_len,
 }
 EXPORT_SYMBOL(sock_alloc_send_pskb);
 
+struct sk_buff *sock_alloc_send_pskb_flags(struct sock *sk, unsigned long 
header_len,
+unsigned long data_len, int noblock,
+int *errcode, int max_page_order, gfp_t 
extra_flags)
+{
+   return __sock_alloc_send_pskb(sk, header_len, data_len, noblock,
+   errcode, max_page_order, extra_flags);
+}
+EXPORT_SYMBOL(sock_alloc_send_pskb_flags);
+
 struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size,
int noblock, int *errcode)
 {
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] vmscan: don't report reclaim progress if there was no progress.

2020-10-05 Thread Andrey Ryabinin
__alloc_pages_slowpath relies on the direct reclaim and did_some_progress
as an indicator that it makes sense to retry allocation rather than
declaring OOM. shrink_zones checks if all zones reclaimable and if
shrink_zone didn't make any progress it prevents from a premature OOM
killer invocation by reporting the progress.
This might happen if the LRU is full of dirty or writeback pages
and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several times
and restart if a page is freed.  This is really subtle behavior and it
might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page.  OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

Report no progress even if zones are reclaimable as OOM is more appropiate
in that case.

https://jira.sw.ru/browse/PSBM-104900
Signed-off-by: Andrey Ryabinin 
---
 mm/vmscan.c | 24 
 1 file changed, 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13ae9bd1e92e..85622f235e78 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2952,26 +2952,6 @@ static void snapshot_refaults(struct mem_cgroup 
*root_memcg, struct zone *zone)
} while ((memcg = mem_cgroup_iter(root_memcg, memcg, NULL)));
 }
 
-/* All zones in zonelist are unreclaimable? */
-static bool all_unreclaimable(struct zonelist *zonelist,
-   struct scan_control *sc)
-{
-   struct zoneref *z;
-   struct zone *zone;
-
-   for_each_zone_zonelist_nodemask(zone, z, zonelist,
-   gfp_zone(sc->gfp_mask), sc->nodemask) {
-   if (!populated_zone(zone))
-   continue;
-   if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-   continue;
-   if (zone_reclaimable(zone))
-   return false;
-   }
-
-   return true;
-}
-
 static void shrink_tcrutches(struct scan_control *scan_ctrl)
 {
int nid;
@@ -3097,10 +3077,6 @@ out:
goto retry;
}
 
-   /* top priority shrink_zones still had more to do? don't OOM, then */
-   if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc))
-   return 1;
-
return 0;
 }
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] kernel/sched/fair: Fix 'releasing a pinned lock'

2020-10-05 Thread Andrey Ryabinin
Lockdep complains that after rq_repin_lock() the lock wasn't unpinned
before rq->lock release.

[ cut here ]
releasing a pinned lock
WARNING: CPU: 0 PID: 24 at kernel/locking/lockdep.c:4271 
lock_release+0x939/0xee0
Call Trace:
 _raw_spin_unlock+0x1c/0x30
 load_balance+0x1472/0x2e30
 pick_next_task_fair+0x62c/0x2300
 __schedule+0x481/0x1600
 schedule+0xbf/0x240
 worker_thread+0x1d5/0xb50
 kthread+0x30e/0x3d0
 ret_from_fork+0x3a/0x50

Add rq_unpin_lock(); call to fix this. Also for consistency use 'busiest'
instead of 'env.src_rq' which is the same.

https://jira.sw.ru/browse/PSBM-120800
Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc87dee4fd0e..23a2f2452474 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9178,9 +9178,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.loop = 0;
local_irq_save(rf.flags);
double_rq_lock(env.dst_rq, busiest);
-   rq_repin_lock(env.src_rq, );
+   rq_repin_lock(busiest, );
update_rq_clock(env.dst_rq);
cur_ld_moved = ld_moved = move_task_groups();
+   rq_unpin_lock(busiest, );
double_rq_unlock(env.dst_rq, busiest);
local_irq_restore(rf.flags);
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] mm, memcg: add oom counter to memory.stat memcgroup file

2020-10-02 Thread Andrey Ryabinin
Add oom counter to memory.stat file. oom shows amount of oom kills
triggered due to cgroup's memory limit. total_oom shows total sum of
oom kills triggered due to cgroup's and it's sub-groups memory limits.

memory.stat in the root cgroup counts global oom kills.

E.g:
 # mkdir /sys/fs/cgroup/memory/test/
 # echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
 # echo 100M > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
 # echo $$ > /sys/fs/cgroup/memory/test/tasks
 # ./vm-scalability/usemem -O 200M
 # grep oom /sys/fs/cgroup/memory/test/memory.stat
   oom 1
   total_oom 1
 # echo -1 > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
 # echo -1 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
 # ./vm-scalability/usemem -O 1000G
 # grep oom /sys/fs/cgroup/memory/memory.stat
oom 1
total_oom 2

https://jira.sw.ru/browse/PSBM-108287
Signed-off-by: Andrey Ryabinin 
---
 include/linux/memcontrol.h |  2 ++
 mm/memcontrol.c| 33 ++---
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b097f137a3df..eb8634128a81 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -75,6 +75,8 @@ struct accumulated_stats {
unsigned long stat[MEMCG_NR_STAT];
unsigned long events[NR_VM_EVENT_ITEMS];
unsigned long lru_pages[NR_LRU_LISTS];
+   unsigned long oom;
+   unsigned long oom_kill;
const unsigned int *stats_array;
const unsigned int *events_array;
int stats_size;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 37d4df653f39..ca3a07543416 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3144,6 +3144,8 @@ void accumulate_memcg_tree(struct mem_cgroup *memcg,
for (i = 0; i < NR_LRU_LISTS; i++)
acc->lru_pages[i] +=
mem_cgroup_nr_lru_pages(mi, BIT(i));
+   acc->oom += atomic_long_read(>memory_events[MEMCG_OOM]);
+   acc->oom_kill += 
atomic_long_read(>memory_events[MEMCG_OOM_KILL]);
 
cond_resched();
}
@@ -3899,6 +3901,13 @@ static int memcg_stat_show(struct seq_file *m, void *v)
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
 
+   memset(, 0, sizeof(acc));
+   acc.stats_size = ARRAY_SIZE(memcg1_stats);
+   acc.stats_array = memcg1_stats;
+   acc.events_size = ARRAY_SIZE(memcg1_events);
+   acc.events_array = memcg1_events;
+   accumulate_memcg_tree(memcg, );
+
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
continue;
@@ -3911,6 +3920,18 @@ static int memcg_stat_show(struct seq_file *m, void *v)
seq_printf(m, "%s %lu\n", memcg1_event_names[i],
   memcg_sum_events(memcg, memcg1_events[i]));
 
+   /*
+* For root_mem_cgroup we want to account global ooms as well.
+* The diff between allo MEMCG_OOM_KILL and MEMCG_OOM events
+* should give us the glogbal ooms count.
+*/
+   if (memcg == root_mem_cgroup)
+   seq_printf(m, "oom %lu\n", acc.oom_kill - acc.oom +
+   atomic_long_read(>memory_events[MEMCG_OOM]));
+   else
+   seq_printf(m, "oom %lu\n",
+   atomic_long_read(>memory_events[MEMCG_OOM]));
+
for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i],
   mem_cgroup_nr_lru_pages(memcg, BIT(i)) * PAGE_SIZE);
@@ -3927,13 +3948,6 @@ static int memcg_stat_show(struct seq_file *m, void *v)
seq_printf(m, "hierarchical_memsw_limit %llu\n",
   (u64)memsw * PAGE_SIZE);
 
-   memset(, 0, sizeof(acc));
-   acc.stats_size = ARRAY_SIZE(memcg1_stats);
-   acc.stats_array = memcg1_stats;
-   acc.events_size = ARRAY_SIZE(memcg1_events);
-   acc.events_array = memcg1_events;
-   accumulate_memcg_tree(memcg, );
-
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
continue;
@@ -3945,6 +3959,11 @@ static int memcg_stat_show(struct seq_file *m, void *v)
seq_printf(m, "total_%s %llu\n", memcg1_event_names[i],
   (u64)acc.events[i]);
 
+   if (memcg == root_mem_cgroup)
+   seq_printf(m, "total_oom %lu\n", acc.oom_kill);
+   else
+   seq_printf(m, "total_oom %lu\n", acc.oom);
+
for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "total_%s %llu\n&quo

[Devel] [PATCH vz8 2/2] mm/memcg: fix cache growth above cache.limit_in_bytes

2020-10-02 Thread Andrey Ryabinin
Exceeding cache above cache.limit_in_bytes schedules high_work_func()
which tries to reclaim 32 pages. If cache generated fast enough or it allows
cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim
enough. Try to reclaim exceeded amount of cache instead.

https://jira.sw.ru/browse/PSBM-106384
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c30150b8732d..37d4df653f39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2213,14 +2213,18 @@ static void reclaim_high(struct mem_cgroup *memcg,
 {
 
do {
+   long cache_overused;
 
if (page_counter_read(>memory) > memcg->high) {
memcg_memory_event(memcg, MEMCG_HIGH);
try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 
true);
}
 
-   if (page_counter_read(>cache) > memcg->cache.max)
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 
false);
+   cache_overused = page_counter_read(>cache) -
+   memcg->cache.max;
+
+   if (cache_overused > 0)
+   try_to_free_mem_cgroup_pages(memcg, cache_overused, 
gfp_mask, false);
} while ((memcg = parent_mem_cgroup(memcg)));
 }
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 1/2] mm/memcg: reclaim memory.cache.limit_in_bytes from background

2020-10-02 Thread Andrey Ryabinin
Reclaiming memory above memory.cache.limit_in_bytes always in direct
reclaim mode adds to much of a cost for vstorage. Instead of direct
reclaim allow to overflow memory.cache.limit_in_bytes but launch
the reclaim in background task.

https://pmc.acronis.com/browse/VSTOR-24395
https://jira.sw.ru/browse/PSBM-94761
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 42 ++
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 68242a72be4d..c30150b8732d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2211,11 +2211,16 @@ static void reclaim_high(struct mem_cgroup *memcg,
 unsigned int nr_pages,
 gfp_t gfp_mask)
 {
+
do {
-   if (page_counter_read(>memory) <= memcg->high)
-   continue;
-   memcg_memory_event(memcg, MEMCG_HIGH);
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+
+   if (page_counter_read(>memory) > memcg->high) {
+   memcg_memory_event(memcg, MEMCG_HIGH);
+   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 
true);
+   }
+
+   if (page_counter_read(>cache) > memcg->cache.max)
+   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 
false);
} while ((memcg = parent_mem_cgroup(memcg)));
 }
 
@@ -2270,13 +2275,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask, bool kmem_charge
refill_stock(memcg, nr_pages);
goto charge;
}
-
-   if (cache_charge && !page_counter_try_charge(
-   >cache, nr_pages, )) {
-   refill_stock(memcg, nr_pages);
-   goto charge;
-   }
-   return 0;
+   css_get_many(>css, batch);
+   goto done;
}
 
 charge:
@@ -2301,19 +2301,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask, bool kmem_charge
}
}
 
-   if (!mem_over_limit && cache_charge) {
-   if (page_counter_try_charge(>cache, nr_pages, ))
-   goto done_restock;
-
-   may_swap = false;
-   mem_over_limit = mem_cgroup_from_counter(counter, cache);
-   page_counter_uncharge(>memory, batch);
-   if (do_memsw_account())
-   page_counter_uncharge(>memsw, batch);
-   if (kmem_charge)
-   page_counter_uncharge(>kmem, nr_pages);
-   }
-
if (!mem_over_limit)
goto done_restock;
 
@@ -2437,6 +2424,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask, bool kmem_charge
css_get_many(>css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
+done:
+   if (cache_charge)
+   page_counter_charge(>cache, nr_pages);
 
/*
 * If the hierarchy is above the normal consumption range, schedule
@@ -2457,7 +2447,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask, bool kmem_charge
current->memcg_nr_pages_over_high += batch;
set_notify_resume(current);
break;
+   } else if (page_counter_read(>cache) > memcg->cache.max) 
{
+   if (!work_pending(>high_work))
+   schedule_work(>high_work);
}
+
} while ((memcg = parent_mem_cgroup(memcg)));
 
return 0;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RH8] mm/tcache: restore missing rcu_read_lock() in tcache_detach_page()

2020-10-02 Thread Andrey Ryabinin



On 10/2/20 5:13 PM, Evgenii Shatokhin wrote:
> Looks like rcu_read_lock() was lost in "out:" path of tcache_detach_page()
> when tcache was ported to VZ8. As a result, Syzkaller was able to hit
> the following warning:
> 
>   WARNING: bad unlock balance detected!
>   4.18.0-193.6.3.vz8.4.7.syz+debug #1 Tainted: GW-r-  
> -
>   -
>   vcmmd/926 is trying to release lock (rcu_read_lock) at:
>   [] tcache_detach_page+0x530/0x750
>   but there are no more locks to release!
> 
>   other info that might help us debug this:
>   2 locks held by vcmmd/926:
>#0: 888036331f30 (>mmap_sem){}, at: __do_page_fault+0x157/0x550
>#1: 8880567295f8 (>i_mmap_sem){}, at: 
> ext4_filemap_fault+0x82/0xc0 [ext4]
> 
>   stack backtrace:
>   CPU: 0 PID: 926 Comm: vcmmd ve: /
>Tainted: GW-r-  - 
> 4.18.0-193.6.3.vz8.4.7.syz+debug #1 4.7
>   Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014
>   Call Trace:
>dump_stack+0xd2/0x148
>print_unlock_imbalance_bug.cold.40+0xc8/0xd4
>lock_release+0x5e3/0x1360
>tcache_detach_page+0x559/0x750
>tcache_cleancache_get_page+0xe9/0x780
>__cleancache_get_page+0x212/0x320
>ext4_mpage_readpages+0x165d/0x1b90 [ext4]
>ext4_readpages+0xd6/0x110 [ext4]
>read_pages+0xff/0x5b0
>__do_page_cache_readahead+0x3fc/0x5b0
>filemap_fault+0x912/0x1b80
>ext4_filemap_fault+0x8a/0xc0 [ext4]
>__do_fault+0x110/0x410
>do_fault+0x622/0x1010
>__handle_mm_fault+0x980/0x1120
>handle_mm_fault+0x17f/0x610
>__do_page_fault+0x25d/0x550
>do_page_fault+0x38/0x290
>do_async_page_fault+0x5b/0xe0
>async_page_fault+0x1e/0x30
> 
> Let us restore rcu_read_lock().
> 
> https://jira.sw.ru/browse/PSBM-120802
> Signed-off-by: Evgenii Shatokhin 

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 v2] kernel/sched/fair.c: Add more missing update_rq_clock() calls

2020-09-29 Thread Andrey Ryabinin
Add update_rq_clock() for 'target_rq' to avoid WARN() coming
from attach_task(). Also add rq_repin_lock(busiest, ); in
load_balance() for detach_task(). The update_rq_clock() isn't
necessary since it was updated before, but we need the repin
since rq lock was released after update.

https://jira.sw.ru/browse/PSBM-108013
Reported-by: Kirill Tkhai 
Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6dc21d5fa03..fc87dee4fd0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7817,6 +7817,7 @@ static int cpulimit_balance_cpu_stop(void *data)
schedstat_inc(sd->clb_count);
 
update_rq_clock(rq);
+   update_rq_clock(target_rq);
if (do_cpulimit_balance())
schedstat_inc(sd->clb_pushed);
else
@@ -9177,6 +9178,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.loop = 0;
local_irq_save(rf.flags);
double_rq_lock(env.dst_rq, busiest);
+   rq_repin_lock(env.src_rq, );
update_rq_clock(env.dst_rq);
cur_ld_moved = ld_moved = move_task_groups();
double_rq_unlock(env.dst_rq, busiest);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] kernel/sched/fair.c: Add more missing update_rq_clock() calls

2020-09-29 Thread Andrey Ryabinin
Add update_rq_clock() for 'target_rq' to avoid WARN() coming
from attach_task(). Also add update_rq_clock(env.src_rq); in
load_balance() for detach_task().

https://jira.sw.ru/browse/PSBM-108013
Reported-by: Kirill Tkhai 
Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6dc21d5fa03..99dcb9e77efd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7817,6 +7817,7 @@ static int cpulimit_balance_cpu_stop(void *data)
schedstat_inc(sd->clb_count);
 
update_rq_clock(rq);
+   update_rq_clock(target_rq);
if (do_cpulimit_balance())
schedstat_inc(sd->clb_pushed);
else
@@ -9177,6 +9178,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.loop = 0;
local_irq_save(rf.flags);
double_rq_lock(env.dst_rq, busiest);
+   update_rq_clock(env.src_rq);
update_rq_clock(env.dst_rq);
cur_ld_moved = ld_moved = move_task_groups();
double_rq_unlock(env.dst_rq, busiest);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH v2 vz8] kernel/sched/fair.c: Add missing update_rq_clock() calls

2020-09-29 Thread Andrey Ryabinin



On 9/29/20 11:24 AM, Kirill Tkhai wrote:
> On 28.09.2020 15:03, Andrey Ryabinin wrote:
>> We've got a hard lockup which seems to be caused by mgag200
>> console printk code calling to schedule_work from scheduler
>> with rq->lock held:
>>   #5 [b79e034239a8] native_queued_spin_lock_slowpath at 8b50c6c6
>>   #6 [b79e034239a8] _raw_spin_lock at 8bc96e5c
>>   #7 [b79e034239b0] try_to_wake_up at 8b4e26ff
>>   #8 [b79e03423a10] __queue_work at 8b4ce3f3
>>   #9 [b79e03423a58] queue_work_on at 8b4ce714
>>  #10 [b79e03423a68] mga_imageblit at c026d666 [mgag200]
>>  #11 [b79e03423a80] soft_cursor at 8b8a9d84
>>  #12 [b79e03423ad8] bit_cursor at 8b8a99b2
>>  #13 [b79e03423ba0] hide_cursor at 8b93bc7a
>>  #14 [b79e03423bb0] vt_console_print at 8b93e07d
>>  #15 [b79e03423c18] console_unlock at 8b518f0e
>>  #16 [b79e03423c68] vprintk_emit_log at 8b51acf7
>>  #17 [b79e03423cc0] vprintk_default at 8b51adcd
>>  #18 [b79e03423cd0] printk at 8b51b3d6
>>  #19 [b79e03423d30] __warn_printk at 8b4b13a0
>>  #20 [b79e03423d98] assert_clock_updated at 8b4dd293
>>  #21 [b79e03423da0] deactivate_task at 8b4e12d1
>>  #22 [b79e03423dc8] move_task_group at 8b4eaa5b
>>  #23 [b79e03423e00] cpulimit_balance_cpu_stop at 8b4f02f3
>>  #24 [b79e03423eb0] cpu_stopper_thread at 8b576b67
>>  #25 [b79e03423ee8] smpboot_thread_fn at 8b4d9125
>>  #26 [b79e03423f10] kthread at 8b4d4fc2
>>  #27 [b79e03423f50] ret_from_fork at 8be00255
>>
>> The printk called because assert_clock_updated() triggered
>>  SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP);
>>
>> This means that we missing necessary update_rq_clock() call.
>> Add one to cpulimit_balance_cpu_stop() to fix the warning.
>> Also add one in load_balance() before move_task_groups() call.
>> It seems to be another place missing this call.
>>
>> https://jira.sw.ru/browse/PSBM-108013
>> Signed-off-by: Andrey Ryabinin 
>> ---
>>  kernel/sched/fair.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5d3556b15e70..e6dc21d5fa03 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7816,6 +7816,7 @@ static int cpulimit_balance_cpu_stop(void *data)
>>  
>>  schedstat_inc(sd->clb_count);
>>  
>> +update_rq_clock(rq);
> 
> Shouldn't we also add the same for target_rq to avoid WARN() coming from 
> attach_task()?
> 

It seems like we should.
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH v2 vz8] kernel/sched/fair.c: Add missing update_rq_clock() calls

2020-09-28 Thread Andrey Ryabinin
We've got a hard lockup which seems to be caused by mgag200
console printk code calling to schedule_work from scheduler
with rq->lock held:
  #5 [b79e034239a8] native_queued_spin_lock_slowpath at 8b50c6c6
  #6 [b79e034239a8] _raw_spin_lock at 8bc96e5c
  #7 [b79e034239b0] try_to_wake_up at 8b4e26ff
  #8 [b79e03423a10] __queue_work at 8b4ce3f3
  #9 [b79e03423a58] queue_work_on at 8b4ce714
 #10 [b79e03423a68] mga_imageblit at c026d666 [mgag200]
 #11 [b79e03423a80] soft_cursor at 8b8a9d84
 #12 [b79e03423ad8] bit_cursor at 8b8a99b2
 #13 [b79e03423ba0] hide_cursor at 8b93bc7a
 #14 [b79e03423bb0] vt_console_print at 8b93e07d
 #15 [b79e03423c18] console_unlock at 8b518f0e
 #16 [b79e03423c68] vprintk_emit_log at 8b51acf7
 #17 [b79e03423cc0] vprintk_default at 8b51adcd
 #18 [b79e03423cd0] printk at 8b51b3d6
 #19 [b79e03423d30] __warn_printk at 8b4b13a0
 #20 [b79e03423d98] assert_clock_updated at 8b4dd293
 #21 [b79e03423da0] deactivate_task at 8b4e12d1
 #22 [b79e03423dc8] move_task_group at 8b4eaa5b
 #23 [b79e03423e00] cpulimit_balance_cpu_stop at 8b4f02f3
 #24 [b79e03423eb0] cpu_stopper_thread at 8b576b67
 #25 [b79e03423ee8] smpboot_thread_fn at 8b4d9125
 #26 [b79e03423f10] kthread at 8b4d4fc2
 #27 [b79e03423f50] ret_from_fork at 8be00255

The printk called because assert_clock_updated() triggered
SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP);

This means that we missing necessary update_rq_clock() call.
Add one to cpulimit_balance_cpu_stop() to fix the warning.
Also add one in load_balance() before move_task_groups() call.
It seems to be another place missing this call.

https://jira.sw.ru/browse/PSBM-108013
Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d3556b15e70..e6dc21d5fa03 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7816,6 +7816,7 @@ static int cpulimit_balance_cpu_stop(void *data)
 
schedstat_inc(sd->clb_count);
 
+   update_rq_clock(rq);
if (do_cpulimit_balance())
schedstat_inc(sd->clb_pushed);
else
@@ -9176,6 +9177,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.loop = 0;
local_irq_save(rf.flags);
double_rq_lock(env.dst_rq, busiest);
+   update_rq_clock(env.dst_rq);
cur_ld_moved = ld_moved = move_task_groups();
double_rq_unlock(env.dst_rq, busiest);
local_irq_restore(rf.flags);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8] kernel/sched/fair.c: Add missing update_rq_clock() calls

2020-09-28 Thread Andrey Ryabinin
We've got a hard lockup which seems to be caused by mgag200
console printk code calling to schedule_work from scheduler
with rq->lock held:

 #5 [b79e034239a8] native_queued_spin_lock_slowpath at 8b50c6c6
 #6 [b79e034239a8] _raw_spin_lock at 8bc96e5c
 #7 [b79e034239b0] try_to_wake_up at 8b4e26ff
 #8 [b79e03423a10] __queue_work at 8b4ce3f3
 #9 [b79e03423a58] queue_work_on at 8b4ce714

The printk called because assert_clock_updated() triggered
SCHED_WARN_ON(rq->clock_update_flags < RQCF_ACT_SKIP);

This means that we missing necessary update_rq_clock() call.
Add one to cpulimit_balance_cpu_stop() to fix the warning.
Also add one in load_balance() before move_task_groups() call.
It seems to be another place missing this call.

https://jira.sw.ru/browse/PSBM-108013
Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d3556b15e70..e6dc21d5fa03 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7816,6 +7816,7 @@ static int cpulimit_balance_cpu_stop(void *data)
 
schedstat_inc(sd->clb_count);
 
+   update_rq_clock(rq);
if (do_cpulimit_balance())
schedstat_inc(sd->clb_pushed);
else
@@ -9176,6 +9177,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
env.loop = 0;
local_irq_save(rf.flags);
double_rq_lock(env.dst_rq, busiest);
+   update_rq_clock(env.dst_rq);
cur_ld_moved = ld_moved = move_task_groups();
double_rq_unlock(env.dst_rq, busiest);
local_irq_restore(rf.flags);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] keys, user: fix NULL-ptr dereference in user_destroy() #PSBM-108198

2020-09-23 Thread Andrey Ryabinin
key->payload.data could be NULL

BUG: unable to handle kernel NULL pointer dereference at 0010
IP: user_destroy+0x13/0x30

Call Trace:
  key_gc_unused_keys.constprop.1+0xfd/0x110
  key_garbage_collector+0x1d7/0x390
  process_one_work+0x185/0x440
  worker_thread+0x126/0x3c0
  kthread+0xd1/0xe0
  ret_from_fork_nospec_begin+0x7/0x21

Add the necessary check to fix this.

https://jira.sw.ru/browse/PSBM-108198
Fixes: 499126f3b029 ("keys, user: Fix high order allocation in 
user_instantiate()")
Signed-off-by: Andrey Ryabinin 
---
 security/keys/user_defined.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/security/keys/user_defined.c b/security/keys/user_defined.c
index b13d70b69069..c3196db50e30 100644
--- a/security/keys/user_defined.c
+++ b/security/keys/user_defined.c
@@ -184,8 +184,10 @@ void user_destroy(struct key *key)
 {
struct user_key_payload *upayload = key->payload.data;
 
-   memset(upayload, 0, sizeof(*upayload) + upayload->datalen);
-   kvfree(upayload);
+   if (upayload) {
+   memset(upayload, 0, sizeof(*upayload) + upayload->datalen);
+   kvfree(upayload);
+   }
 }
 
 EXPORT_SYMBOL_GPL(user_destroy);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7] mm, memcg: add oom counter to memory.stat memcgroup file #PSBM-107731

2020-09-22 Thread Andrey Ryabinin
Add oom counter to memory.stat file. oom shows amount of oom kills
triggered due to cgroup's memory limit. total_oom shows total sum of
oom kills triggered due to cgroup's and it's sub-groups memory limits.

memory.stat in the root cgroup counts global oom kills.

E.g:
 # mkdir /sys/fs/cgroup/memory/test/
 # echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
 # echo 100M > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
 # echo $$ > /sys/fs/cgroup/memory/test/tasks
 # ./vm-scalability/usemem -O 200M
 # grep oom /sys/fs/cgroup/memory/test/memory.stat
   oom 1
   total_oom 1
 # echo -1 > /sys/fs/cgroup/memory/test/memory.memsw.limit_in_bytes
 # echo -1 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
 # ./vm-scalability/usemem -O 1000G
 # grep oom /sys/fs/cgroup/memory/memory.stat
oom 1
total_oom 2

https://jira.sw.ru/browse/PSBM-107731
Signed-off-by: Andrey Ryabinin 
---
 mm/memcontrol.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6587cc2ef019..fe06c7db2ad3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -400,6 +400,7 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu __percpu *stat;
struct mem_cgroup_stat2_cpu stat2;
spinlock_t pcp_counter_lock;
+   atomic_long_t   oom;
 
atomic_tdead_count;
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
@@ -2005,6 +2006,7 @@ void mem_cgroup_note_oom_kill(struct mem_cgroup 
*root_memcg,
if (memcg == root_memcg)
break;
}
+   atomic_long_inc(_memcg->oom);
 
if (memcg_to_put)
css_put(_to_put->css);
@@ -5691,6 +5693,7 @@ static int memcg_stat_show(struct cgroup *cont, struct 
cftype *cft,
for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
seq_printf(m, "%s %lu\n", mem_cgroup_events_names[i],
   mem_cgroup_read_events(memcg, i));
+   seq_printf(m, "oom %lu\n", atomic_long_read(>oom));
 
for (i = 0; i < NR_LRU_LISTS; i++)
seq_printf(m, "%s %lu\n", mem_cgroup_lru_names[i],
@@ -5733,6 +5736,12 @@ static int memcg_stat_show(struct cgroup *cont, struct 
cftype *cft,
seq_printf(m, "total_%s %llu\n",
   mem_cgroup_events_names[i], val);
}
+   {
+   unsigned long val = 0;
+   for_each_mem_cgroup_tree(mi, memcg)
+   val += atomic_long_read(>oom);
+   seq_printf(m, "total_oom %lu\n", val);
+   }
 
for (i = 0; i < NR_LRU_LISTS; i++) {
unsigned long long val = 0;
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 4/4] bcache: fix cache_set_flush() NULL pointer dereference on OOM #PSBM-106785

2020-09-14 Thread Andrey Ryabinin
From: Eric Wheeler 

When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup.  This appears to happen only
during an OOM condition on bcache_register.

Signed-off-by: Eric Wheeler 
Cc: sta...@vger.kernel.org

https://jira.sw.ru/browse/PSBM-106785
(cherry picked from commit f8b11260a445169989d01df75d35af0f56178f95)
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/super.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 88a008577dc0..f06212f856c6 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1295,6 +1295,9 @@ static void cache_set_flush(struct closure *cl)
set_bit(CACHE_SET_STOPPING_2, >flags);
wake_up(>alloc_wait);
 
+   if (!c)
+   closure_return(cl);
+
bch_cache_accounting_destroy(>accounting);
 
kobject_put(>internal);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 3/4] bcache: unregister reboot notifier if bcache fails to unregister device #PSBM-106785

2020-09-14 Thread Andrey Ryabinin
From: Zheng Liu 

In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device.  This commit fixes this.

Signed-off-by: Zheng Liu 
Tested-by: Joshua Schmid 
Tested-by: Eric Wheeler 
Cc: Kent Overstreet 
Cc: sta...@vger.kernel.org
Signed-off-by: Jens Axboe 

https://jira.sw.ru/browse/PSBM-106785
(cherry picked from commit 2ecf0cdb2b437402110ab57546e02abfa68a716b)
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/super.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0fccdc395ebe..88a008577dc0 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1959,8 +1959,10 @@ static int __init bcache_init(void)
closure_debug_init();
 
bcache_major = register_blkdev(0, "bcache");
-   if (bcache_major < 0)
+   if (bcache_major < 0) {
+   unregister_reboot_notifier();
return bcache_major;
+   }
 
if (!(bcache_wq = create_workqueue("bcache")) ||
!(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 2/4] bcache: Data corruption fix #PSBM-106785

2020-09-14 Thread Andrey Ryabinin
From: Kent Overstreet 

The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.

The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.

I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.

Signed-off-by: Kent Overstreet 
Cc: linux-stable  # >= v3.10

https://jira.sw.ru/browse/PSBM-106785
(cherry picked from commit ef71ec2d92a08eb27e9d036e3d48835b6597)
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/bset.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index 14032e8c7731..1b27cbd822e1 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -927,7 +927,7 @@ static void sort_key_next(struct btree_iter *iter,
*i = iter->data[--iter->used];
 }
 
-static void btree_sort_fixup(struct btree_iter *iter)
+static struct bkey *btree_sort_fixup(struct btree_iter *iter, struct bkey *tmp)
 {
while (iter->used > 1) {
struct btree_iter_set *top = iter->data, *i = top + 1;
@@ -955,9 +955,22 @@ static void btree_sort_fixup(struct btree_iter *iter)
} else {
/* can't happen because of comparison func */
BUG_ON(!bkey_cmp(_KEY(top->k), _KEY(i->k)));
-   bch_cut_back(_KEY(i->k), top->k);
+
+   if (bkey_cmp(i->k, top->k) < 0) {
+   bkey_copy(tmp, top->k);
+
+   bch_cut_back(_KEY(i->k), tmp);
+   bch_cut_front(i->k, top->k);
+   heap_sift(iter, 0, btree_iter_cmp);
+
+   return tmp;
+   } else {
+   bch_cut_back(_KEY(i->k), top->k);
+   }
}
}
+
+   return NULL;
 }
 
 static void btree_mergesort(struct btree *b, struct bset *out,
@@ -965,15 +978,20 @@ static void btree_mergesort(struct btree *b, struct bset 
*out,
bool fixup, bool remove_stale)
 {
struct bkey *k, *last = NULL;
+   BKEY_PADDED(k) tmp;
bool (*bad)(struct btree *, const struct bkey *) = remove_stale
? bch_ptr_bad
: bch_ptr_invalid;
 
while (!btree_iter_end(iter)) {
if (fixup && !b->level)
-   btree_sort_fixup(iter);
+   k = btree_sort_fixup(iter, );
+   else
+   k = NULL;
+
+   if (!k)
+   k = bch_btree_iter_next(iter);
 
-   k = bch_btree_iter_next(iter);
if (bad(b, k))
continue;
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 1/4] bcache: Fix crashes of bcache used with raid1 #PSBM-106785

2020-09-14 Thread Andrey Ryabinin
When bcache is built on top of raid1 devices, the following
warning happens:

 WARNING: CPU: 2 PID: 8138 at include/linux/bio.h:559 
raid1_write_request+0x994/0xba0 [raid1]
 Call Trace:
  dump_stack+0x19/0x1b
  __warn+0xd8/0x100
  warn_slowpath_null+0x1d/0x20
  raid1_write_request+0x994/0xba0 [raid1]
  raid1_make_request+0x8a/0x5b0 [raid1]
  md_handle_request+0xd0/0x150
  md_make_request+0x79/0x190
  generic_make_request+0x147/0x380
  bch_generic_make_request_hack+0x2a/0xc0 [bcache]
  bch_generic_make_request+0x3d/0x190 [bcache]
  write_dirty+0x7e/0x110 [bcache]
  process_one_work+0x185/0x440
  worker_thread+0x126/0x3c0
  kthread+0xd1/0xe0
  ret_from_fork_nospec_begin+0x21/0x21

And immediately followed by the crash:
 kernel BUG at drivers/md/bcache/closure.c:53!
 Call Trace:
  dirty_endio+0x28/0x30 [bcache]
  bio_endio+0x8c/0x130
  call_bio_endio+0x2f/0x40 [raid1]
  raid_end_bio_io+0x2e/0x90 [raid1]
  r1_bio_write_done+0x35/0x50 [raid1]
  raid1_end_write_request+0x118/0x2f0 [raid1]
  bio_endio+0x8c/0x130
  blk_update_request+0x90/0x370
  blk_mq_end_request+0x1a/0x90
  virtblk_request_done+0x3f/0x70 [virtio_blk]
  __blk_mq_complete_request_remote+0x19/0x20
  flush_smp_call_function_queue+0x63/0x130
  generic_smp_call_function_single_interrupt+0x13/0x30
  smp_call_function_single_interrupt+0x2d/0x40
  call_function_single_interrupt+0x16a/0x170

So this happens because bcache doesn't allocate & initialize 'bio_aux'
structure needed by raid1 device. Add 'bio_aux' to 'dirty_io' struct
and initialize it along with the 'bio' in dirty_init() to fix this.

https://jira.sw.ru/browse/PSBM-106785
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/writeback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 841f0490d4ef..c2bda701bf9d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -17,6 +17,7 @@ static void read_dirty(struct closure *);
 struct dirty_io {
struct closure  cl;
struct cached_dev   *dc;
+   struct bio_aux  bio_aux;
struct bio  bio;
 };
 
@@ -122,6 +123,7 @@ static void dirty_init(struct keybuf_key *w)
bio->bi_max_vecs= DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS);
bio->bi_private = w;
bio->bi_io_vec  = bio->bi_inline_vecs;
+   bio_init_aux(>bio, >bio_aux);
bch_bio_map(bio, NULL);
 }
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RH7 v2 1/4] bcache: Fix crashes of bcache used with raid1 #PSBM-106785

2020-09-14 Thread Andrey Ryabinin
When bcache is built on top of raid1 devices, the following
warning happens:

 WARNING: CPU: 2 PID: 8138 at include/linux/bio.h:559 
raid1_write_request+0x994/0xba0 [raid1]
 Call Trace:
  dump_stack+0x19/0x1b
  __warn+0xd8/0x100
  warn_slowpath_null+0x1d/0x20
  raid1_write_request+0x994/0xba0 [raid1]
  raid1_make_request+0x8a/0x5b0 [raid1]
  md_handle_request+0xd0/0x150
  md_make_request+0x79/0x190
  generic_make_request+0x147/0x380
  bch_generic_make_request_hack+0x2a/0xc0 [bcache]
  bch_generic_make_request+0x3d/0x190 [bcache]
  write_dirty+0x7e/0x110 [bcache]
  process_one_work+0x185/0x440
  worker_thread+0x126/0x3c0
  kthread+0xd1/0xe0
  ret_from_fork_nospec_begin+0x21/0x21

And immediately followed by the crash:
 kernel BUG at drivers/md/bcache/closure.c:53!
 Call Trace:
  dirty_endio+0x28/0x30 [bcache]
  bio_endio+0x8c/0x130
  call_bio_endio+0x2f/0x40 [raid1]
  raid_end_bio_io+0x2e/0x90 [raid1]
  r1_bio_write_done+0x35/0x50 [raid1]
  raid1_end_write_request+0x118/0x2f0 [raid1]
  bio_endio+0x8c/0x130
  blk_update_request+0x90/0x370
  blk_mq_end_request+0x1a/0x90
  virtblk_request_done+0x3f/0x70 [virtio_blk]
  __blk_mq_complete_request_remote+0x19/0x20
  flush_smp_call_function_queue+0x63/0x130
  generic_smp_call_function_single_interrupt+0x13/0x30
  smp_call_function_single_interrupt+0x2d/0x40
  call_function_single_interrupt+0x16a/0x170

So this happens because bcache doesn't allocate & initialize 'bio_aux'
structure needed by raid1 device. Add 'bio_aux' to 'dirty_io' struct
and initialize it along with the 'bio' in dirty_init() to fix this.

https://jira.sw.ru/browse/PSBM-106785
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/writeback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 841f0490d4ef..c2bda701bf9d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -17,6 +17,7 @@ static void read_dirty(struct closure *);
 struct dirty_io {
struct closure  cl;
struct cached_dev   *dc;
+   struct bio_aux  bio_aux;
struct bio  bio;
 };
 
@@ -122,6 +123,7 @@ static void dirty_init(struct keybuf_key *w)
bio->bi_max_vecs= DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS);
bio->bi_private = w;
bio->bi_io_vec  = bio->bi_inline_vecs;
+   bio_init_aux(>bio, >bio_aux);
bch_bio_map(bio, NULL);
 }
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v2] keys, user: Fix high order allocation in user_instantiate() #PSBM-107794

2020-09-14 Thread Andrey Ryabinin
Adding user key might trigger 4-order allocation which is unreliable
in case of fragmented memory:

 [ cut here ]
 WARNING: CPU: 3 PID: 134927 at mm/page_alloc.c:3533 
__alloc_pages_nodemask+0x1b1/0x600
 order 4 >= 3, gfp 0x40d0
 Kernel panic - not syncing: panic_on_warn set ...
 CPU: 3 PID: 134927 Comm: add_key01 ve: 0 Kdump: loaded Tainted: G   OE 
    3.10.0-1127.18.2.vz7.163.15 #1 163.15
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014
 Call Trace:
  dump_stack+0x19/0x1b
  panic+0xe8/0x21f
  __warn+0xfa/0x100
  warn_slowpath_fmt+0x5f/0x80
  __alloc_pages_nodemask+0x1b1/0x600
  alloc_pages_current+0x98/0x110
  kmalloc_order+0x18/0x40
  kmalloc_order_trace+0x26/0xa0
  __kmalloc+0x281/0x2a0
  user_instantiate+0x47/0x90
  __key_instantiate_and_link+0x54/0x100
  key_create_or_update+0x398/0x490
  SyS_add_key+0x12c/0x220
  system_call_fastpath+0x25/0x2a

Use kvmalloc() to avoid potential -ENOMEM due to fragmentation.

https://jira.sw.ru/browse/PSBM-107794
Signed-off-by: Andrey Ryabinin 
---

Changes since v1:
  - Add #PSBM-107794 to subject

 security/keys/user_defined.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/security/keys/user_defined.c b/security/keys/user_defined.c
index bc8d3227dc4b..b13d70b69069 100644
--- a/security/keys/user_defined.c
+++ b/security/keys/user_defined.c
@@ -9,6 +9,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -75,7 +76,7 @@ int user_instantiate(struct key *key, struct 
key_preparsed_payload *prep)
goto error;
 
ret = -ENOMEM;
-   upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
+   upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
if (!upayload)
goto error;
 
@@ -96,7 +97,8 @@ static void user_free_payload_rcu(struct rcu_head *head)
struct user_key_payload *payload;
 
payload = container_of(head, struct user_key_payload, rcu);
-   kzfree(payload);
+   memset(payload, 0, sizeof(*payload) + payload->datalen);
+   kvfree(payload);
 }
 
 /*
@@ -115,7 +117,7 @@ int user_update(struct key *key, struct 
key_preparsed_payload *prep)
 
/* construct a replacement payload */
ret = -ENOMEM;
-   upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
+   upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
if (!upayload)
goto error;
 
@@ -182,7 +184,8 @@ void user_destroy(struct key *key)
 {
struct user_key_payload *upayload = key->payload.data;
 
-   kzfree(upayload);
+   memset(upayload, 0, sizeof(*upayload) + upayload->datalen);
+   kvfree(upayload);
 }
 
 EXPORT_SYMBOL_GPL(user_destroy);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] keys, user: Fix high order allocation in user_instantiate()

2020-09-14 Thread Andrey Ryabinin
Adding user key might trigger 4-order allocation which is unreliable
in case of fragmented memory:

 [ cut here ]
 WARNING: CPU: 3 PID: 134927 at mm/page_alloc.c:3533 
__alloc_pages_nodemask+0x1b1/0x600
 order 4 >= 3, gfp 0x40d0
 Kernel panic - not syncing: panic_on_warn set ...
 CPU: 3 PID: 134927 Comm: add_key01 ve: 0 Kdump: loaded Tainted: G   OE 
    3.10.0-1127.18.2.vz7.163.15 #1 163.15
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.2 04/01/2014
 Call Trace:
  dump_stack+0x19/0x1b
  panic+0xe8/0x21f
  __warn+0xfa/0x100
  warn_slowpath_fmt+0x5f/0x80
  __alloc_pages_nodemask+0x1b1/0x600
  alloc_pages_current+0x98/0x110
  kmalloc_order+0x18/0x40
  kmalloc_order_trace+0x26/0xa0
  __kmalloc+0x281/0x2a0
  user_instantiate+0x47/0x90
  __key_instantiate_and_link+0x54/0x100
  key_create_or_update+0x398/0x490
  SyS_add_key+0x12c/0x220
  system_call_fastpath+0x25/0x2a

Use kvmalloc() to avoid potential -ENOMEM due to fragmentation.

https://jira.sw.ru/browse/PSBM-107794
Signed-off-by: Andrey Ryabinin 
---
 security/keys/user_defined.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/security/keys/user_defined.c b/security/keys/user_defined.c
index bc8d3227dc4b..b13d70b69069 100644
--- a/security/keys/user_defined.c
+++ b/security/keys/user_defined.c
@@ -9,6 +9,7 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -75,7 +76,7 @@ int user_instantiate(struct key *key, struct 
key_preparsed_payload *prep)
goto error;
 
ret = -ENOMEM;
-   upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
+   upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
if (!upayload)
goto error;
 
@@ -96,7 +97,8 @@ static void user_free_payload_rcu(struct rcu_head *head)
struct user_key_payload *payload;
 
payload = container_of(head, struct user_key_payload, rcu);
-   kzfree(payload);
+   memset(payload, 0, sizeof(*payload) + payload->datalen);
+   kvfree(payload);
 }
 
 /*
@@ -115,7 +117,7 @@ int user_update(struct key *key, struct 
key_preparsed_payload *prep)
 
/* construct a replacement payload */
ret = -ENOMEM;
-   upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
+   upayload = kvmalloc(sizeof(*upayload) + datalen, GFP_KERNEL);
if (!upayload)
goto error;
 
@@ -182,7 +184,8 @@ void user_destroy(struct key *key)
 {
struct user_key_payload *upayload = key->payload.data;
 
-   kzfree(upayload);
+   memset(upayload, 0, sizeof(*upayload) + upayload->datalen);
+   kvfree(upayload);
 }
 
 EXPORT_SYMBOL_GPL(user_destroy);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/4] bcache: fix cache_set_flush() NULL pointer dereference on OOM

2020-09-14 Thread Andrey Ryabinin
From: Eric Wheeler 

When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup.  This appears to happen only
during an OOM condition on bcache_register.

Signed-off-by: Eric Wheeler 
Cc: sta...@vger.kernel.org

https://jira.sw.ru/browse/PSBM-106785
(cherry picked from commit f8b11260a445169989d01df75d35af0f56178f95)
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/super.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 88a008577dc0..f06212f856c6 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1295,6 +1295,9 @@ static void cache_set_flush(struct closure *cl)
set_bit(CACHE_SET_STOPPING_2, >flags);
wake_up(>alloc_wait);
 
+   if (!c)
+   closure_return(cl);
+
bch_cache_accounting_destroy(>accounting);
 
kobject_put(>internal);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/4] bcache: unregister reboot notifier if bcache fails to unregister device

2020-09-14 Thread Andrey Ryabinin
From: Zheng Liu 

In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device.  This commit fixes this.

Signed-off-by: Zheng Liu 
Tested-by: Joshua Schmid 
Tested-by: Eric Wheeler 
Cc: Kent Overstreet 
Cc: sta...@vger.kernel.org
Signed-off-by: Jens Axboe 

https://jira.sw.ru/browse/PSBM-106785
(cherry picked from commit 2ecf0cdb2b437402110ab57546e02abfa68a716b)
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/super.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0fccdc395ebe..88a008577dc0 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1959,8 +1959,10 @@ static int __init bcache_init(void)
closure_debug_init();
 
bcache_major = register_blkdev(0, "bcache");
-   if (bcache_major < 0)
+   if (bcache_major < 0) {
+   unregister_reboot_notifier();
return bcache_major;
+   }
 
if (!(bcache_wq = create_workqueue("bcache")) ||
!(bcache_kobj = kobject_create_and_add("bcache", fs_kobj)) ||
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/4] bcache: Data corruption fix

2020-09-14 Thread Andrey Ryabinin
From: Kent Overstreet 

The code that handles overlapping extents that we've just read back in from disk
was depending on the behaviour of the code that handles overlapping extents as
we're inserting into a btree node in the case of an insert that forced an
existing extent to be split: on insert, if we had to split we'd also insert a
new extent to represent the top part of the old extent - and then that new
extent would get written out.

The code that read the extents back in thus not bother with splitting extents -
if it saw an extent that ovelapped in the middle of an older extent, it would
trim the old extent to only represent the bottom part, assuming that the
original insert would've inserted a new extent to represent the top part.

I still haven't figured out _how_ it can happen, but I'm now pretty convinced
(and testing has confirmed) that there's some kind of an obscure corner case
(probably involving extent merging, and multiple overwrites in different sets)
that breaks this. The fix is to change the mergesort fixup code to split extents
itself when required.

Signed-off-by: Kent Overstreet 
Cc: linux-stable  # >= v3.10

https://jira.sw.ru/browse/PSBM-106785
(cherry picked from commit ef71ec2d92a08eb27e9d036e3d48835b6597)
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/bset.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index 14032e8c7731..1b27cbd822e1 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -927,7 +927,7 @@ static void sort_key_next(struct btree_iter *iter,
*i = iter->data[--iter->used];
 }
 
-static void btree_sort_fixup(struct btree_iter *iter)
+static struct bkey *btree_sort_fixup(struct btree_iter *iter, struct bkey *tmp)
 {
while (iter->used > 1) {
struct btree_iter_set *top = iter->data, *i = top + 1;
@@ -955,9 +955,22 @@ static void btree_sort_fixup(struct btree_iter *iter)
} else {
/* can't happen because of comparison func */
BUG_ON(!bkey_cmp(_KEY(top->k), _KEY(i->k)));
-   bch_cut_back(_KEY(i->k), top->k);
+
+   if (bkey_cmp(i->k, top->k) < 0) {
+   bkey_copy(tmp, top->k);
+
+   bch_cut_back(_KEY(i->k), tmp);
+   bch_cut_front(i->k, top->k);
+   heap_sift(iter, 0, btree_iter_cmp);
+
+   return tmp;
+   } else {
+   bch_cut_back(_KEY(i->k), top->k);
+   }
}
}
+
+   return NULL;
 }
 
 static void btree_mergesort(struct btree *b, struct bset *out,
@@ -965,15 +978,20 @@ static void btree_mergesort(struct btree *b, struct bset 
*out,
bool fixup, bool remove_stale)
 {
struct bkey *k, *last = NULL;
+   BKEY_PADDED(k) tmp;
bool (*bad)(struct btree *, const struct bkey *) = remove_stale
? bch_ptr_bad
: bch_ptr_invalid;
 
while (!btree_iter_end(iter)) {
if (fixup && !b->level)
-   btree_sort_fixup(iter);
+   k = btree_sort_fixup(iter, );
+   else
+   k = NULL;
+
+   if (!k)
+   k = bch_btree_iter_next(iter);
 
-   k = bch_btree_iter_next(iter);
if (bad(b, k))
continue;
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/4] bcache: Fix crashes of bcache used with raid1

2020-09-14 Thread Andrey Ryabinin
When bcache is built on top of raid1 devices, the following
warning happens:

 WARNING: CPU: 2 PID: 8138 at include/linux/bio.h:559 
raid1_write_request+0x994/0xba0 [raid1]
 Call Trace:
  dump_stack+0x19/0x1b
  __warn+0xd8/0x100
  warn_slowpath_null+0x1d/0x20
  raid1_write_request+0x994/0xba0 [raid1]
  raid1_make_request+0x8a/0x5b0 [raid1]
  md_handle_request+0xd0/0x150
  md_make_request+0x79/0x190
  generic_make_request+0x147/0x380
  bch_generic_make_request_hack+0x2a/0xc0 [bcache]
  bch_generic_make_request+0x3d/0x190 [bcache]
  write_dirty+0x7e/0x110 [bcache]
  process_one_work+0x185/0x440
  worker_thread+0x126/0x3c0
  kthread+0xd1/0xe0
  ret_from_fork_nospec_begin+0x21/0x21

And immediately followed by the crash:
 kernel BUG at drivers/md/bcache/closure.c:53!
 Call Trace:
  dirty_endio+0x28/0x30 [bcache]
  bio_endio+0x8c/0x130
  call_bio_endio+0x2f/0x40 [raid1]
  raid_end_bio_io+0x2e/0x90 [raid1]
  r1_bio_write_done+0x35/0x50 [raid1]
  raid1_end_write_request+0x118/0x2f0 [raid1]
  bio_endio+0x8c/0x130
  blk_update_request+0x90/0x370
  blk_mq_end_request+0x1a/0x90
  virtblk_request_done+0x3f/0x70 [virtio_blk]
  __blk_mq_complete_request_remote+0x19/0x20
  flush_smp_call_function_queue+0x63/0x130
  generic_smp_call_function_single_interrupt+0x13/0x30
  smp_call_function_single_interrupt+0x2d/0x40
  call_function_single_interrupt+0x16a/0x170

So this happens because bcache doesn't allocate & initialize 'bio_aux'
structure needed by raid1 device. Add 'bio_aux' to 'dirty_io' struct
and initialize it along with the 'bio' in dirty_init() to fix this.

https://jira.sw.ru/browse/PSBM-106785
Signed-off-by: Andrey Ryabinin 
---
 drivers/md/bcache/writeback.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 841f0490d4ef..c2bda701bf9d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -17,6 +17,7 @@ static void read_dirty(struct closure *);
 struct dirty_io {
struct closure  cl;
struct cached_dev   *dc;
+   struct bio_aux  bio_aux;
struct bio  bio;
 };
 
@@ -122,6 +123,7 @@ static void dirty_init(struct keybuf_key *w)
bio->bi_max_vecs= DIV_ROUND_UP(KEY_SIZE(>key), PAGE_SECTORS);
bio->bi_private = w;
bio->bi_io_vec  = bio->bi_inline_vecs;
+   bio_init_aux(>bio, >bio_aux);
bch_bio_map(bio, NULL);
 }
 
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] cgroup: add missing dput() in cgroup_unmark_ve_roots()

2020-08-28 Thread Andrey Ryabinin
cgroup_unmark_ve_roots() calls dget() on cgroup's dentry but don't
have the corresponding dput() call. This leads to leaking cgroups.

Add missing dput() to fix this.

https://jira.sw.ru/browse/PSBM-107328
Fixes: 1ac69e183447 ("ve/cgroup: added release_agent to each container root 
cgroup.")
Signed-off-by: Andrey Ryabinin 
---
 kernel/cgroup.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 55713a0071ce..5f3111805eba 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4719,6 +4719,7 @@ void cgroup_unmark_ve_roots(struct ve_struct *ve)
mutex_lock(>i_mutex);
mutex_lock(_mutex);
cgroup_rm_file(cgrp, cft);
+   dput(cgrp->dentry);
BUG_ON(!rcu_dereference_protected(cgrp->ve_owner,
lockdep_is_held(_mutex)));
rcu_assign_pointer(cgrp->ve_owner, NULL);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ms/kernel/kmod: fix use-after-free of the sub_info structure

2020-08-24 Thread Andrey Ryabinin
From: Martin Schwidefsky 

Found this in the message log on a s390 system:

BUG kmalloc-192 (Not tainted): Poison overwritten
Disabling lock debugging due to kernel taint
INFO: 0x684761f4-0x684761f7. First byte 0xff instead of 0x6b
INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
 __slab_alloc.isra.47.constprop.56+0x5f6/0x658
 kmem_cache_alloc_trace+0x106/0x408
 call_usermodehelper_setup+0x70/0x128
 call_usermodehelper+0x62/0x90
 cgroup_release_agent+0x178/0x1c0
 process_one_work+0x36e/0x680
 worker_thread+0x2f0/0x4f8
 kthread+0x10a/0x120
 kernel_thread_starter+0x6/0xc
 kernel_thread_starter+0x0/0xc
INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
 __slab_free+0x94/0x560
 kfree+0x364/0x3e0
 call_usermodehelper_exec+0x110/0x1b8
 cgroup_release_agent+0x178/0x1c0
 process_one_work+0x36e/0x680
 worker_thread+0x2f0/0x4f8
 kthread+0x10a/0x120
 kernel_thread_starter+0x6/0xc
 kernel_thread_starter+0x0/0xc

There is a use-after-free bug on the subprocess_info structure allocated
by the user mode helper.  In case do_execve() returns with an error
call_usermodehelper() stores the error code to sub_info->retval, but
sub_info can already have been freed.

Regarding UMH_NO_WAIT, the sub_info structure can be freed by
__call_usermodehelper() before the worker thread returns from
do_execve(), allowing memory corruption when do_execve() failed after
exec_mmap() is called.

Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
call_usermodehelper_exec() to continue which then frees sub_info.

To fix this race the code needs to make sure that the call to
call_usermodehelper_freeinfo() is always done after the last store to
sub_info->retval.

Signed-off-by: Martin Schwidefsky 
Reviewed-by: Oleg Nesterov 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

https://jira.sw.ru/browse/PSBM-107061
(cherry-picked from commit 0baf2a4dbf75abb7c186fd6c8d55d27aaa354a29)
Signed-off-by: Andrey Ryabinin 
---
 kernel/kmod.c | 76 +--
 1 file changed, 37 insertions(+), 39 deletions(-)

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 7fc0ba9e3216..de2bcfdc94d0 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -585,12 +585,34 @@ int __request_module(bool wait, const char *fmt, ...)
 EXPORT_SYMBOL(__request_module);
 #endif /* CONFIG_MODULES */
 
+static void call_usermodehelper_freeinfo(struct subprocess_info *info)
+{
+   if (info->cleanup)
+   (*info->cleanup)(info);
+   kfree(info);
+}
+
+static void umh_complete(struct subprocess_info *sub_info)
+{
+   struct completion *comp = xchg(_info->complete, NULL);
+   /*
+* See call_usermodehelper_exec(). If xchg() returns NULL
+* we own sub_info, the UMH_KILLABLE caller has gone away
+* or the caller used UMH_NO_WAIT.
+*/
+   if (comp)
+   complete(comp);
+   else
+   call_usermodehelper_freeinfo(sub_info);
+}
+
 /*
  * This is the task which runs the usermode application
  */
 static int call_usermodehelper(void *data)
 {
struct subprocess_info *sub_info = data;
+   int wait = sub_info->wait & ~UMH_KILLABLE;
struct cred *new;
int retval;
 
@@ -607,7 +629,7 @@ static int call_usermodehelper(void *data)
retval = -ENOMEM;
new = prepare_kernel_cred(current);
if (!new)
-   goto fail;
+   goto out;
 
spin_lock(_sysctl_lock);
new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset);
@@ -619,7 +641,7 @@ static int call_usermodehelper(void *data)
retval = sub_info->init(sub_info, new);
if (retval) {
abort_creds(new);
-   goto fail;
+   goto out;
}
}
 
@@ -628,12 +650,13 @@ static int call_usermodehelper(void *data)
retval = do_execve(getname_kernel(sub_info->path),
   (const char __user *const __user *)sub_info->argv,
   (const char __user *const __user *)sub_info->envp);
+out:
+   sub_info->retval = retval;
+   /* wait_for_helper() will call umh_complete if UHM_WAIT_PROC. */
+   if (wait != UMH_WAIT_PROC)
+   umh_complete(sub_info);
if (!retval)
return 0;
-
-   /* Exec failed? */
-fail:
-   sub_info->retval = retval;
do_exit(0);
 }
 
@@ -644,26 +667,6 @@ static int call_helper(void *data)
return call_usermodehelper(data);
 }
 
-static void call_usermodehelper_freeinfo(struct subprocess_info *info)
-{
-   if (info->cleanup)
-   (*info->cleanup)(info);
-   kfree(info);
-}
-
-static void umh_complete(struct subprocess_info *sub_i

Re: [Devel] [PATCH RHEL v2] mm: Reduce access frequency to shrinker_rwsem during shrink_slab

2020-08-21 Thread Andrey Ryabinin



On 8/20/20 5:51 PM, Valeriy Vdovin wrote:
> Bug https://jira.sw.ru/browse/PSBM-99181 has introduced a problem: when
> the kernel has opened NFS delegations and NFS server is not accessible
> at the time when NFS shrinker is called, the whole shrinker list
> execution gets stuck until NFS server is back. Being a problem in itself
> it also introduces bigger problem - during that hang, the shrinker_rwsem
> also gets locked, consequently no new mounts can be done at that time
> because new superblock tries to register it's own shrinker and also gets
> stuck at aquiring shrinker_rwsem.
> 
> Commit 9e9e35d050955648449498827deb2d43be0564e1 is a workaround for that
> problem. It is known that during signle shrinker execution we do not
> actually need to hold shrinker_rwsem so we release and reacqiure the
> rwsem for each shrinker in the list.
> 
> Because of this workaround shrink_slab function now experiences a major
> slowdown, because shrinker_rwsem gets accessed for each shrinker in the
> list twice. On an idle fresh-booted system shrinker_list could be
> iterated up to 1600 times a second, although originally the problem was
> local to only one NFS shrinker.
> 
> This patch fixes commit 9e9e35d050955648449498827deb2d43be0564e1 in a
> way that before calling for up_read for shrinker_rwsem, we check that
> this is really an NFS shrinker by checking NFS magic in superblock, if
> it is accessible from shrinker.
> 
> https://jira.sw.ru/browse/PSBM-99181
> 
> Co-authored-by: Andrey Ryabinin 
> Signed-off-by: Valeriy Vdovin 
> 
> Changes:
>   v2: Added missing 'rwsem_is_contented' check
> ---

Reviewed-by: Andrey Ryabinin 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH RHEL7] mm: Reduce access frequency to shrinker_rwsem during shrink_slab

2020-08-20 Thread Andrey Ryabinin



On 8/20/20 11:32 AM, Valeriy Vdovin wrote:

> @@ -565,14 +588,16 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>* memcg_expand_one_shrinker_map if new shrinkers
>* were registred in the meanwhile.
>*/
> - if (!down_read_trylock(_rwsem)) {
> - freed = freed ? : 1;
> + if (is_nfs) {
> + if (!down_read_trylock(_rwsem)) {
> + freed = freed ? : 1;
> + put_shrinker(shrinker);
> + return freed;
> + }
>   put_shrinker(shrinker);
> - return freed;
> + map = memcg_nid_shrinker_map(memcg, nid);
> + nr_max = min(shrinker_nr_max, map->nr_max);
>   }

Need to add rwsem_is_contended() check back. It was here before commit 9e9e35d05

else if (rwsem_is_contended(_rwsem)) {
freed = freed ? : 1;
break;
}



> - put_shrinker(shrinker);
> - map = memcg_nid_shrinker_map(memcg, nid);
> - nr_max = min(shrinker_nr_max, map->nr_max);
>   }
>  unlock:
>   up_read(_rwsem);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7] ms/vt: vt_ioctl: fix VT_DISALLOCATE freeing in-use virtual console

2020-08-19 Thread Andrey Ryabinin
From: Eric Biggers 

The VT_DISALLOCATE ioctl can free a virtual console while tty_release()
is still running, causing a use-after-free in con_shutdown().  This
occurs because VT_DISALLOCATE considers a virtual console's
'struct vc_data' to be unused as soon as the corresponding tty's
refcount hits 0.  But actually it may be still being closed.

Fix this by making vc_data be reference-counted via the embedded
'struct tty_port'.  A newly allocated virtual console has refcount 1.
Opening it for the first time increments the refcount to 2.  Closing it
for the last time decrements the refcount (in tty_operations::cleanup()
so that it happens late enough), as does VT_DISALLOCATE.

Reproducer:
#include 
#include 
#include 
#include 

int main()
{
if (fork()) {
for (;;)
close(open("/dev/tty5", O_RDWR));
} else {
int fd = open("/dev/tty10", O_RDWR);

for (;;)
ioctl(fd, VT_DISALLOCATE, 5);
}
}

KASAN report:
BUG: KASAN: use-after-free in con_shutdown+0x76/0x80 
drivers/tty/vt/vt.c:3278
Write of size 8 at addr 88806a4ec108 by task syz_vt/129

CPU: 0 PID: 129 Comm: syz_vt Not tainted 5.6.0-rc2 #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
?-20191223_100556-anatol 04/01/2014
Call Trace:
 [...]
 con_shutdown+0x76/0x80 drivers/tty/vt/vt.c:3278
 release_tty+0xa8/0x410 drivers/tty/tty_io.c:1514
 tty_release_struct+0x34/0x50 drivers/tty/tty_io.c:1629
 tty_release+0x984/0xed0 drivers/tty/tty_io.c:1789
 [...]

Allocated by task 129:
 [...]
 kzalloc include/linux/slab.h:669 [inline]
 vc_allocate drivers/tty/vt/vt.c:1085 [inline]
 vc_allocate+0x1ac/0x680 drivers/tty/vt/vt.c:1066
 con_install+0x4d/0x3f0 drivers/tty/vt/vt.c:3229
 tty_driver_install_tty drivers/tty/tty_io.c:1228 [inline]
 tty_init_dev+0x94/0x350 drivers/tty/tty_io.c:1341
 tty_open_by_driver drivers/tty/tty_io.c:1987 [inline]
 tty_open+0x3ca/0xb30 drivers/tty/tty_io.c:2035
 [...]

Freed by task 130:
 [...]
 kfree+0xbf/0x1e0 mm/slab.c:3757
 vt_disallocate drivers/tty/vt/vt_ioctl.c:300 [inline]
 vt_ioctl+0x16dc/0x1e30 drivers/tty/vt/vt_ioctl.c:818
 tty_ioctl+0x9db/0x11b0 drivers/tty/tty_io.c:2660
 [...]

Fixes: 4001d7b7fc27 ("vt: push down the tty lock so we can see what is left to 
tackle")
Cc:  # v3.4+
Reported-by: syzbot+522643ab5729b0421...@syzkaller.appspotmail.com
Acked-by: Jiri Slaby 
Signed-off-by: Eric Biggers 
Link: https://lore.kernel.org/r/20200322034305.210082-2-ebigg...@kernel.org
Signed-off-by: Greg Kroah-Hartman 

https://jira.sw.ru/browse/PSBM-106391
(cherry-picked from commit ca4463bf8438b403596edd0ec961ca0d4fbe0220)
Signed-off-by: Andrey Ryabinin 
---
 drivers/tty/vt/vt.c   | 23 ++-
 drivers/tty/vt/vt_ioctl.c | 12 
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 0ee0cd507522..795d7867ac24 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -767,6 +767,17 @@ static void visual_deinit(struct vc_data *vc)
module_put(vc->vc_sw->owner);
 }
 
+static void vc_port_destruct(struct tty_port *port)
+{
+   struct vc_data *vc = container_of(port, struct vc_data, port);
+
+   kfree(vc);
+}
+
+static const struct tty_port_operations vc_port_ops = {
+   .destruct = vc_port_destruct,
+};
+
 int vc_allocate(unsigned int currcons) /* return 0 on success */
 {
struct vt_notifier_param param;
@@ -792,6 +803,7 @@ int vc_allocate(unsigned int currcons)  /* return 0 on 
success */
 
vc_cons[currcons].d = vc;
tty_port_init(>port);
+   vc->port.ops = _port_ops;
INIT_WORK(_cons[currcons].SAK_work, vc_SAK);
 
visual_init(vc, currcons, 1);
@@ -2799,6 +2811,7 @@ static int con_install(struct tty_driver *driver, struct 
tty_struct *tty)
 
tty->driver_data = vc;
vc->port.tty = tty;
+   tty_port_get(>port);
 
if (!tty->winsize.ws_row && !tty->winsize.ws_col) {
tty->winsize.ws_row = vc_cons[currcons].d->vc_rows;
@@ -2834,6 +2847,13 @@ static void con_shutdown(struct tty_struct *tty)
console_unlock();
 }
 
+static void con_cleanup(struct tty_struct *tty)
+{
+   struct vc_data *vc = tty->driver_data;
+
+   tty_port_put(>port);
+}
+
 static int default_italic_color= 2; // green (ASCII)
 static int default_underline_color = 3; // cyan (ASCII)
 module_param_named(italic, default_italic_color, int, S_IRUGO | S_IWUSR);
@@ -2956,7 +2976,8 @@ static const struct tty_operations con_o

Re: [Devel] [PATCH rh7 v4] mm/memcg: fix cache growth above cache.limit_in_bytes

2020-07-30 Thread Andrey Ryabinin


On 7/30/20 6:52 PM, Evgenii Shatokhin wrote:
> Hi,
> 
> On 30.07.2020 18:02, Andrey Ryabinin wrote:
>> Exceeding cache above cache.limit_in_bytes schedules high_work_func()
>> which tries to reclaim 32 pages. If cache generated fast enough or it allows
>> cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim
>> enough. Try to reclaim exceeded amount of cache instead.
>>
>> https://jira.sw.ru/browse/PSBM-106384
>> Signed-off-by: Andrey Ryabinin 
>> ---
>>
>>   - Changes since v1: add bug link to changelog
>>   - Changes since v2: Fix cache_overused check (We should check if it's 
>> positive).
>>  Made this stupid bug during cleanup, patch was tested without bogus 
>> cleanup,
>>  so it shoud work.
>>   - Chnages since v3: Compilation fixes, properly tested now.
>>
>>   mm/memcontrol.c | 10 +++---
>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 3cf200f506c3..16cbd451a588 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg,
>>   {
>>     do {
>> +    long cache_overused;
>> +
>>   if (page_counter_read(>memory) > memcg->high)
>>   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 0);
>>   -    if (page_counter_read(>cache) > memcg->cache.limit)
>> -    try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask,
>> -    MEM_CGROUP_RECLAIM_NOSWAP);
>> +    cache_overused = page_counter_read(>cache) -
>> +    memcg->cache.limit;
> 
> If cache_overused is less than 32 pages, the kernel would try to reclaim less 
> than before the patch. It it OK, or should it try to reclaim at least 32 
> pages?

It's ok, try_to_free_mem_cgroup_pages will increase it:

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
   unsigned long nr_pages,
   gfp_t gfp_mask,
   int flags)


.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),


___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v4] mm/memcg: fix cache growth above cache.limit_in_bytes

2020-07-30 Thread Andrey Ryabinin
Exceeding cache above cache.limit_in_bytes schedules high_work_func()
which tries to reclaim 32 pages. If cache generated fast enough or it allows
cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim
enough. Try to reclaim exceeded amount of cache instead.

https://jira.sw.ru/browse/PSBM-106384
Signed-off-by: Andrey Ryabinin 
---

 - Changes since v1: add bug link to changelog
 - Changes since v2: Fix cache_overused check (We should check if it's 
positive).
Made this stupid bug during cleanup, patch was tested without bogus cleanup,
so it shoud work.
 - Chnages since v3: Compilation fixes, properly tested now.

 mm/memcontrol.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3cf200f506c3..16cbd451a588 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg,
 {
 
do {
+   long cache_overused;
+
if (page_counter_read(>memory) > memcg->high)
try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 
0);
 
-   if (page_counter_read(>cache) > memcg->cache.limit)
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask,
-   MEM_CGROUP_RECLAIM_NOSWAP);
+   cache_overused = page_counter_read(>cache) -
+   memcg->cache.limit;
+   if (cache_overused > 0)
+   try_to_free_mem_cgroup_pages(memcg, cache_overused,
+   gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 
} while ((memcg = parent_mem_cgroup(memcg)));
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v3] mm/memcg: fix cache growth above cache.limit_in_bytes

2020-07-30 Thread Andrey Ryabinin
Exceeding cache above cache.limit_in_bytes schedules high_work_func()
which tries to reclaim 32 pages. If cache generated fast enough or it allows
cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim
enough. Try to reclaim exceeded amount of cache instead.

https://jira.sw.ru/browse/PSBM-106384
Signed-off-by: Andrey Ryabinin 
---

 Changes since v1: add bug link to changelog
 Changes since v2: Fix cache_overused check (We should check if it's positive).
Made this stupid bug during cleanup, patch was tested without bogus cleanup,
so it shoud work.

 mm/memcontrol.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3cf200f506c3..e23e546fd00f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg,
 {
 
do {
+   long cache_overused;
+
if (page_counter_read(>memory) > memcg->high)
try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 
0);
 
-   if (page_counter_read(>cache) > memcg->cache.limit)
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask,
-   MEM_CGROUP_RECLAIM_NOSWAP);
+   cache_overused = page_counter_read(>cache) -
+   memcg->cache.limit;
+   if (cache_overused > 0)
+   try_to_free_mem_cgroup_pages(memcg, max(CHARGE_BATCH, 
cache_overused,
+   gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 
} while ((memcg = parent_mem_cgroup(memcg)));
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v2] mm/memcg: fix cache growth above cache.limit_in_bytes

2020-07-30 Thread Andrey Ryabinin
Exceeding cache above cache.limit_in_bytes schedules high_work_func()
which tries to reclaim 32 pages. If cache generated fast enough or it allows
cgroup to steadily grow above cache.limit_in_bytes because we don't reclaim
enough. Try to reclaim exceeded amount of cache instead.

https://jira.sw.ru/browse/PSBM-106384
Signed-off-by: Andrey Ryabinin 
---

Changes since v1: add bug link to changelog

 mm/memcontrol.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3cf200f506c3..e5adb0e81cbb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3080,12 +3080,16 @@ static void reclaim_high(struct mem_cgroup *memcg,
 {
 
do {
+   unsigned long cache_overused;
+
if (page_counter_read(>memory) > memcg->high)
try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, 
0);
 
-   if (page_counter_read(>cache) > memcg->cache.limit)
-   try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask,
-   MEM_CGROUP_RECLAIM_NOSWAP);
+   cache_overused = page_counter_read(>cache) -
+   memcg->cache.limit;
+   if (cache_overused)
+   try_to_free_mem_cgroup_pages(memcg, cache_overused,
+   gfp_mask, MEM_CGROUP_RECLAIM_NOSWAP);
 
} while ((memcg = parent_mem_cgroup(memcg)));
 }
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


  1   2   3   4   5   6   7   8   9   10   >