[Devel] [PATCH vz8 2/3] oom: resurrect berserker mode

2020-11-11 Thread Andrey Ryabinin
From: Vladimir Davydov 

The logic behind the OOM berserker is the same as in PCS6: if processes
are killed by oom killer too often (< sysctl vm.oom_relaxation, 1 sec by
default), we increase "rage" (min -10, max 20) and kill 1 << "rage"
youngest worst processes if "rage" >= 0.

https://jira.sw.ru/browse/PSBM-17930

Signed-off-by: Vladimir Davydov 
[aryabinin: vz8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 include/linux/memcontrol.h |  6 +++
 include/linux/oom.h|  4 ++
 mm/oom_kill.c  | 97 ++
 3 files changed, 107 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c26041c681f2..0efabad868ce 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -260,6 +260,12 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   int oom_rage;
+   spinlock_t  oom_rage_lock;
+   unsigned long   prev_oom_time;
+   unsigned long   oom_time;
+
+
/* memory.events */
struct cgroup_file events_file;
 
diff --git a/include/linux/oom.h b/include/linux/oom.h
index b0ee726c1672..9a6d16a1ace5 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -15,6 +15,9 @@ struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
 
+#define OOM_BASE_RAGE  -10
+#define OOM_MAX_RAGE   20
+
 /*
  * Details of the page allocation that triggered the oom killer that are used 
to
  * determine what should be killed.
@@ -44,6 +47,7 @@ struct oom_control {
unsigned long totalpages;
struct task_struct *chosen;
unsigned long chosen_points;
+   unsigned long overdraft;
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ab436d94ae5d..e746b41d558c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -53,6 +53,7 @@
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
+int sysctl_oom_relaxation = HZ;
 
 DEFINE_MUTEX(oom_lock);
 
@@ -947,6 +948,101 @@ static int oom_kill_memcg_member(struct task_struct 
*task, void *message)
return 0;
 }
 
+/*
+ * Kill more processes if oom happens too often in this context.
+ */
+static void oom_berserker(struct oom_control *oc)
+{
+   static DEFINE_RATELIMIT_STATE(berserker_rs,
+   DEFAULT_RATELIMIT_INTERVAL,
+   DEFAULT_RATELIMIT_BURST);
+   struct task_struct *p;
+   struct mem_cgroup *memcg;
+   unsigned long now = jiffies;
+   int rage;
+   int killed = 0;
+
+   memcg = oc->memcg ?: root_mem_cgroup;
+
+   spin_lock(>oom_rage_lock);
+   memcg->prev_oom_time = memcg->oom_time;
+   memcg->oom_time = now;
+   /*
+* Increase rage if oom happened recently in this context, reset
+* rage otherwise.
+*
+* previous oomthis oom (unfinished)
+* +
+*^^
+*  prev_oom_time  <>  oom_time
+*/
+   if (time_after(now, memcg->prev_oom_time + sysctl_oom_relaxation))
+   memcg->oom_rage = OOM_BASE_RAGE;
+   else if (memcg->oom_rage < OOM_MAX_RAGE)
+   memcg->oom_rage++;
+   rage = memcg->oom_rage;
+   spin_unlock(>oom_rage_lock);
+
+   if (rage < 0)
+   return;
+
+   /*
+* So, we are in rage. Kill (1 << rage) youngest tasks that are
+* as bad as the victim.
+*/
+   read_lock(_lock);
+   list_for_each_entry_reverse(p, _task.tasks, tasks) {
+   unsigned long tsk_points;
+   unsigned long tsk_overdraft;
+
+   if (!p->mm || test_tsk_thread_flag(p, TIF_MEMDIE) ||
+   fatal_signal_pending(p) || p->flags & PF_EXITING ||
+   oom_unkillable_task(p, oc->memcg, oc->nodemask))
+   continue;
+
+   tsk_points = oom_badness(p, oc->memcg, oc->nodemask,
+   oc->totalpages, _overdraft);
+   if (tsk_overdraft < oc->overdraft)
+   continue;
+
+   /*
+* oom_badness never returns a negative value, even if
+* oom_score_adj would make badness so, instead it
+* returns 1. So we do not kill task with badness 1 if
+* the victim has badness > 1 so as not to risk killing
+* protected tasks.
+*/
+   if (tsk_points <= 1 && oc->chosen_points > 1)
+   continue;
+
+   /*
+* Consider tasks as equally bad if they have equal
+* normalized scores.
+*/
+   if (tsk_points * 1000 / oc->totalpages <
+   oc->chosen_points * 1000 / oc->totalpages)
+   continue;
+
+

[Devel] [PATCH vz8 3/3] oom: make berserker more aggressive

2020-11-11 Thread Andrey Ryabinin
From: Vladimir Davydov 

In the berserker mode we kill a bunch of tasks that are as bad as the
selected victim. We assume two tasks to be equally bad if they consume
the same permille of memory. With such a strict check, it might turn out
that oom berserker won't kill any tasks in case a fork bomb is running
inside a container while the effect of killing a task eating <=1/1000th
of memory won't be enough to cope with memory shortage. Let's loosen
this check and use percentage instead of permille. In this case, it
might still happen that berserker won't kill anyone, but in this case
the regular oom should free at least 1/100th of memory, which should be
enough even for small containers.

Also, check berserker mode even if the victim has already exited by the
time we are about to send SIGKILL to it. Rationale: when the berserker
is in rage, it might kill hundreds of tasks so that the next oom kill is
likely to select an exiting task. Not triggering berserker in this case
will result in oom stalls.

Signed-off-by: Vladimir Davydov 

[aryabinin: rh8 rebase]
Signed-off-by: Andrey Ryabinin 
---
 mm/oom_kill.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e746b41d558c..1cf75939aba6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1016,11 +1016,11 @@ static void oom_berserker(struct oom_control *oc)
continue;
 
/*
-* Consider tasks as equally bad if they have equal
-* normalized scores.
+* Consider tasks as equally bad if they occupy equal
+* percentage of available memory.
 */
-   if (tsk_points * 1000 / oc->totalpages <
-   oc->chosen_points * 1000 / oc->totalpages)
+   if (tsk_points * 100 / oc->totalpages <
+   oc->chosen_points * 100 / oc->totalpages)
continue;
 
if (__ratelimit(_rs)) {
@@ -1061,6 +1061,7 @@ static void oom_kill_process(struct oom_control *oc, 
const char *message)
wake_oom_reaper(victim);
task_unlock(victim);
put_task_struct(victim);
+   oom_berserker(oc);
return;
}
task_unlock(victim);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH vz8 1/3] proc, memcg: use memcg limits for showing oom_score inside CT

2020-11-11 Thread Andrey Ryabinin
Use memcg's limits of task to show /proc//oom_score.
Note: in vz7 we had different behavior. It showed 'oom_score'
based on 've->memcg' limits of process reading oom_score.
Now we look at memcg of  process and don't care about the
current one. It seems more correct behaviour.

Signed-off-by: Andrey Ryabinin 
---
 fs/proc/base.c |  8 +++-
 include/linux/memcontrol.h | 11 +++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 85fee7396e90..cb417426dd92 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -525,8 +525,14 @@ static const struct file_operations proc_lstats_operations 
= {
 static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
  struct pid *pid, struct task_struct *task)
 {
-   unsigned long totalpages = totalram_pages + total_swap_pages;
+   unsigned long totalpages;
unsigned long points = 0;
+   struct mem_cgroup *memcg;
+
+   rcu_read_lock();
+   memcg = mem_cgroup_from_task(task);
+   totalpages = mem_cgroup_total_pages(memcg);
+   rcu_read_unlock();
 
points = oom_badness(task, NULL, NULL, totalpages, NULL) *
1000 / totalpages;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eb8634128a81..c26041c681f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -581,6 +581,17 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec 
*lruvec,
return mz->lru_zone_size[zone_idx][lru];
 }
 
+static inline unsigned long mem_cgroup_total_pages(struct mem_cgroup *memcg)
+{
+   unsigned long ram, ram_swap;
+   extern long total_swap_pages;
+
+   ram = min_t(unsigned long, totalram_pages, memcg->memory.max);
+   ram_swap = min_t(unsigned long, memcg->memsw.max, ram + 
total_swap_pages);
+
+   return ram_swap;
+}
+
 void mem_cgroup_handle_over_high(void);
 
 unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
-- 
2.26.2

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] vecalls: Introduce VZCTL_GET_CPU_STAT ioctl

2020-11-11 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.17
-->
commit 264e1b6d6450baab509faa667f4ac72606d84940
Author: Konstantin Khorenko 
Date:   Wed Nov 11 16:00:04 2020 +0300

vecalls: Introduce VZCTL_GET_CPU_STAT ioctl

This vzctl ioctl still used by vzstat utility and dispatcher/libvirt
statistics reporting.
>From one point of view almost all data can be get from cpu cgroup of a
Container (missing data can be exported additionally),
but statistics is gathered often and ioctl is faster and requires less
cpu power => let it be for now.

The current patch is based on following vz7 commits:
  ecdce58b214c ("sched: Export per task_group statistics_work")
  a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs")
  75fc174adc36 ("sched: Port cpustat related patches")

Signed-off-by: Konstantin Khorenko 
Reviewed-by: Andrey Ryabinin 
---
 include/linux/ve.h  |  2 ++
 kernel/time/time.c  |  1 +
 kernel/ve/ve.c  | 18 +++
 kernel/ve/vecalls.c | 66 +
 4 files changed, 87 insertions(+)

diff --git a/include/linux/ve.h b/include/linux/ve.h
index 656ee43e383e..7cb416f342e7 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -201,10 +201,12 @@ struct seq_file;
 #if defined(CONFIG_VE) && defined(CONFIG_CGROUP_SCHED)
 int ve_show_cpu_stat(struct ve_struct *ve, struct seq_file *p);
 int ve_show_loadavg(struct ve_struct *ve, struct seq_file *p);
+int ve_get_cpu_avenrun(struct ve_struct *ve, unsigned long *avenrun);
 int ve_get_cpu_stat(struct ve_struct *ve, struct kernel_cpustat *kstat);
 #else
 static inline int ve_show_cpu_stat(struct ve_struct *ve, struct seq_file *p) { 
return -ENOSYS; }
 static inline int ve_show_loadavg(struct ve_struct *ve, struct seq_file *p) { 
return -ENOSYS; }
+static inline int ve_get_cpu_avenrun(struct ve_struct *ve, unsigned long 
*avenrun) { return -ENOSYS; }
 static inline int ve_get_cpu_stat(struct ve_struct *ve, struct kernel_cpustat 
*kstat) { return -ENOSYS; }
 #endif
 
diff --git a/kernel/time/time.c b/kernel/time/time.c
index 2b41e8e2d31d..ff1db0ba0c39 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -770,6 +770,7 @@ u64 nsec_to_clock_t(u64 x)
return div_u64(x * 9, (9ull * NSEC_PER_SEC + (USER_HZ / 2)) / USER_HZ);
 #endif
 }
+EXPORT_SYMBOL(nsec_to_clock_t);
 
 u64 jiffies64_to_nsecs(u64 j)
 {
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index a9afefc5b9de..29e98e6396dc 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -1430,6 +1430,24 @@ int ve_show_loadavg(struct ve_struct *ve, struct 
seq_file *p)
return err;
 }
 
+inline struct task_group *css_tg(struct cgroup_subsys_state *css);
+int get_avenrun_tg(struct task_group *tg, unsigned long *loads,
+  unsigned long offset, int shift);
+
+int ve_get_cpu_avenrun(struct ve_struct *ve, unsigned long *avnrun)
+{
+   struct cgroup_subsys_state *css;
+   struct task_group *tg;
+   int err;
+
+   css = ve_get_init_css(ve, cpu_cgrp_id);
+   tg = css_tg(css);
+   err = get_avenrun_tg(tg, avnrun, 0, 0);
+   css_put(css);
+   return err;
+}
+EXPORT_SYMBOL(ve_get_cpu_avenrun);
+
 int cpu_cgroup_get_stat(struct cgroup_subsys_state *cpu_css,
struct cgroup_subsys_state *cpuacct_css,
struct kernel_cpustat *kstat);
diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c
index 3258b49b15b2..786a743faa1a 100644
--- a/kernel/ve/vecalls.c
+++ b/kernel/ve/vecalls.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 #include 
@@ -35,6 +37,62 @@ static u64 ve_get_uptime(struct ve_struct *ve)
return ktime_get_boot_ns() - ve->real_start_time;
 }
 
+static int fill_cpu_stat(envid_t veid, struct vz_cpu_stat __user *buf)
+{
+   struct ve_struct *ve;
+   struct vz_cpu_stat *vstat;
+   int retval;
+   int i;
+   unsigned long tmp;
+   unsigned long avnrun[3];
+   struct kernel_cpustat kstat;
+
+   if (!ve_is_super(get_exec_env()) && (veid != get_exec_env()->veid))
+   return -EPERM;
+   ve = get_ve_by_id(veid);
+   if (!ve)
+   return -ESRCH;
+
+   retval = -ENOMEM;
+   vstat = kzalloc(sizeof(*vstat), GFP_KERNEL);
+   if (!vstat)
+   goto out_put_ve;
+
+   retval = ve_get_cpu_stat(ve, );
+   if (retval)
+   goto out_free;
+
+   retval = ve_get_cpu_avenrun(ve, avnrun);
+   if (retval)
+   goto out_free;
+
+   vstat->user_jif   = (unsigned long)nsec_to_clock_t(
+  kstat.cpustat[CPUTIME_USER]);
+   vstat->nice_jif   = (unsigned long)nsec_to_clock_t(
+  kstat.cpustat[CPUTIME_NICE]);
+   vstat->system_jif = (unsigned 

[Devel] [PATCH RHEL8 COMMIT] ve/sched/loadavg: Provide task_group parameter to get_avenrun_ve()

2020-11-11 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.17
-->
commit 2e7bc3486fb7bd25dbc3a6a4530ece030aa8456c
Author: Konstantin Khorenko 
Date:   Wed Nov 11 16:00:03 2020 +0300

ve/sched/loadavg: Provide task_group parameter to get_avenrun_ve()

Rename get_avenrun_ve() to get_avenrun_tg() and provide it
the task_group argument to use it later for any VE, for the the current
one.

Fixes: f52cf2752bca ("ve/sched/loadavg: Calculate avenrun for Containers
root cpu cgroups")

Signed-off-by: Konstantin Khorenko 
Reviewed-by: Andrey Ryabinin 
---
 include/linux/sched/loadavg.h |  2 --
 kernel/sched/loadavg.c| 12 ++--
 kernel/sys.c  |  6 +-
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h
index 1da5768389b7..25fb3344cdbf 100644
--- a/include/linux/sched/loadavg.h
+++ b/include/linux/sched/loadavg.h
@@ -16,8 +16,6 @@
  */
 extern unsigned long avenrun[];/* Load averages */
 extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
-extern void get_avenrun_ve(unsigned long *loads,
-  unsigned long offset, int shift);
 
 #define FSHIFT 11  /* nr of bits of precision */
 #define FIXED_1(1avenrun[1] + offset) << shift;
loads[2] = (tg->avenrun[2] + offset) << shift;
+
+   return 0;
 }
 
 long calc_load_fold_active(struct rq *this_rq, long adjust)
diff --git a/kernel/sys.c b/kernel/sys.c
index e7e07ea8d7ef..8560e5bcb6c2 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2543,6 +2543,8 @@ SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned 
__user *, nodep,
 }
 
 extern void si_meminfo_ve(struct sysinfo *si, struct ve_struct *ve);
+extern int get_avenrun_tg(struct task_group *tg, unsigned long *loads,
+ unsigned long offset, int shift);
 
 /**
  * do_sysinfo - fill in sysinfo struct
@@ -2575,7 +2577,9 @@ static int do_sysinfo(struct sysinfo *info)
 
info->procs = nr_threads_ve(ve);
 
-   get_avenrun_ve(info->loads, 0, SI_LOAD_SHIFT - FSHIFT);
+   /* does not fail on non-VE0 task group */
+   (void)get_avenrun_tg(NULL, info->loads,
+0, SI_LOAD_SHIFT - FSHIFT);
}
 
/*
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] vecalls: Add cpu stat measurement units comments to header

2020-11-11 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.17
-->
commit 3ed7a174b687c3dd2fab0ee255731b8efa3b44d9
Author: Konstantin Khorenko 
Date:   Wed Nov 11 16:00:03 2020 +0300

vecalls: Add cpu stat measurement units comments to header

It's not obvious why, say, "user_jif" field does not contain
time in jiffies, so add clarification comments.

Fixes: 248ed6b2a193 ("ve: Add vecalls")

Signed-off-by: Konstantin Khorenko 
Reviewed-by: Andrey Ryabinin 
---
 include/uapi/linux/vzcalluser.h | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/vzcalluser.h b/include/uapi/linux/vzcalluser.h
index f2584b4b284f..8a4ff0015e40 100644
--- a/include/uapi/linux/vzcalluser.h
+++ b/include/uapi/linux/vzcalluser.h
@@ -55,13 +55,13 @@ struct vz_load_avg {
 };
 
 struct vz_cpu_stat {
-   unsigned long   user_jif;
-   unsigned long   nice_jif;
-   unsigned long   system_jif;
-   unsigned long   uptime_jif;
-   __u64   idle_clk;
-   __u64   strv_clk;
-   __u64   uptime_clk;
+   unsigned long   user_jif;   /* clock_t */
+   unsigned long   nice_jif;   /* clock_t */
+   unsigned long   system_jif  /* clock_t */;
+   unsigned long   uptime_jif  /* clock_t */;
+   __u64   idle_clk;   /* ns */
+   __u64   strv_clk;   /* deprecated */
+   __u64   uptime_clk; /* ns */
struct vz_load_avg  avenrun[3]; /* loadavg data */
 };
 
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] vdso, vclock_gettime: fix linking with old linkers

2020-11-11 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.17
-->
commit 998f47870c4a498d1a53c191816c3c03dda480f9
Author: Andrey Ryabinin 
Date:   Wed Nov 11 15:54:08 2020 +0300

vdso, vclock_gettime: fix linking with old linkers

On some old linkers vdso fails to build because of
dynamic reloction of 've_start_time' symbol:
VDSO2C  arch/x86/entry/vdso/vdso-image-64.c
Error: vdso image contains dynamic relocations

I was able to figure out why new linkers doesn't generate relocation
while old ones does, but I did find out that visibility("hidden")
attribute on 've_start_time' cures the problem.

Fixes: af2c78f571e6 ("ve: add per-ve CLOCK_MONOTONIC time via
__vdso_gettimeofday()")

https://jira.sw.ru/browse/PSBM-121668
Signed-off-by: Andrey Ryabinin 
---
 arch/x86/entry/vdso/vclock_gettime.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 224dbe80da66..b2f1f19319d8 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -24,7 +24,7 @@
 
 #define gtod ((vsyscall_gtod_data))
 
-u64 ve_start_time;
+u64 ve_start_time  __attribute__((visibility("hidden")));
 
 extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts);
 extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz);
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL8 COMMIT] sched/stat: account forks per task group

2020-11-11 Thread Konstantin Khorenko
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear 
at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-193.6.3.vz8.4.17
-->
commit 372ecbe6241acf33e718ad826670d5a03ed6efaa
Author: Vladimir Davydov 
Date:   Thu Mar 14 21:00:53 2013 +0400

sched/stat: account forks per task group

This is a backport of diff-sched-account-forks-per-task-group:

 Subject: sched: account forks per task group
 Date: Fri, 28 Dec 2012 15:09:46 +0400

* [sched] the number of processes should be reported correctly
inside a CT in /proc/stat (PSBM-18113)

For /proc/stat:processes to be correct inside containers.

https://jira.sw.ru/browse/PSBM-18113

Signed-off-by: Vladimir Davydov 

(cherry picked from vz7 commit 0a927bf02fd873f4e9bad7c4df0c201bf9b48274)
Signed-off-by: Konstantin Khorenko 
---
 kernel/sched/cpuacct.c | 4 +++-
 kernel/sched/fair.c| 1 +
 kernel/sched/sched.h   | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 646bbd257110..df5fe01c8f24 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -652,6 +652,7 @@ int cpu_cgroup_proc_stat(struct cgroup_subsys_state 
*cpu_css,
unsigned long tg_nr_running = 0;
unsigned long tg_nr_iowait = 0;
unsigned long long tg_nr_switches = 0;
+   unsigned long tg_nr_forks = 0;
 
getboottime64();
 
@@ -671,6 +672,7 @@ int cpu_cgroup_proc_stat(struct cgroup_subsys_state 
*cpu_css,
tg_nr_running += tg->cfs_rq[i]->h_nr_running;
tg_nr_iowait  += tg->cfs_rq[i]->nr_iowait;
tg_nr_switches += tg->cfs_rq[i]->nr_switches;
+   tg_nr_forks   += tg->cfs_rq[i]->nr_forks;
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
tg_nr_running += tg->rt_rq[i]->rt_nr_running;
@@ -746,7 +748,7 @@ int cpu_cgroup_proc_stat(struct cgroup_subsys_state 
*cpu_css,
   "procs_blocked %lu\n",
   tg_nr_switches,
   (unsigned long long)boot_sec,
-  total_forks,
+  tg_nr_forks,
   tg_nr_running,
   tg_nr_iowait);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b9bb108625a..892329471df1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10300,6 +10300,7 @@ static void task_fork_fair(struct task_struct *p)
}
 
se->vruntime -= cfs_rq->min_vruntime;
+   cfs_rq->nr_forks++;
rq_unlock(rq, );
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3d55b45f1ea6..ccd8ad478a08 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -545,6 +545,7 @@ struct cfs_rq {
struct sched_entity *prev;
 
u64 nr_switches;
+   unsigned long nr_forks;
 
 #ifdef CONFIG_SCHED_DEBUG
unsigned intnr_spread_over;
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh7 v2] cgroup: rework reference acquisition for cgroup_find_inode

2020-11-11 Thread Andrey Zhadchenko
Use more generic igrab instead of atomic inc. Move cgroup_hash_del to eviction
stage to avoid deadlock.

Signed-off-by: Andrey Zhadchenko 
---

v2: adjusted function call order in cgroup_evict_inode to match existing code

 kernel/cgroup.c | 25 -
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 27d7a5e..8c2cef8 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1522,21 +1522,10 @@ static struct inode *cgroup_find_inode(unsigned long 
fh[2], char take_ref)
struct inode *ret = NULL;
 
spin_lock(_inode_table_lock);
-   item = cgroup_find_item_no_lock(fh);
 
-   /*
-* If we need to increase refcount, we should be aware of possible
-* deadlock. Another thread may have started deleting this inode:
-* iput->iput_final->cgroup_delete_inode->cgroup_hash_del
-* If we just call igrab, it will try to take i_lock and this will
-* result in deadlock, because deleting thread has already taken it
-* and waits on cgroup_inode_table_lock to find inode in hashtable.
-*
-* If i_count is zero, someone is deleting it -> skip.
-*/
-   if (take_ref && item)
-   if (!atomic_inc_not_zero(>inode->i_count))
-   item = NULL;
+   item = cgroup_find_item_no_lock(fh);
+   if (item && take_ref && !igrab(item->inode))
+   item = NULL;
 
spin_unlock(_inode_table_lock);
 
@@ -1634,15 +1623,17 @@ static const struct export_operations cgroup_export_ops 
= {
.fh_to_dentry   = cgroup_fh_to_dentry,
 };
 
-static int cgroup_delete_inode(struct inode *inode)
+static void cgroup_evict_inode(struct inode *inode)
 {
+   truncate_inode_pages_final(>i_data);
+   clear_inode(inode);
cgroup_hash_del(inode);
-   return generic_delete_inode(inode);
 }
 
 static const struct super_operations cgroup_ops = {
.statfs = simple_statfs,
-   .drop_inode = cgroup_delete_inode,
+   .drop_inode = generic_delete_inode,
+   .evict_inode = cgroup_evict_inode,
.show_options = cgroup_show_options,
 #ifdef CONFIG_VE
.show_path = cgroup_show_path,
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh8 1/2] netlink: protect NETLINK_REPAIR

2020-11-11 Thread Andrey Zhadchenko
Prevent using netlink repair mode from containers.

Signed-off-by: Andrey Zhadchenko 
---
 net/netlink/af_netlink.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 46c2dbd..2b9e9c7 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1670,6 +1670,13 @@ static int netlink_setsockopt(struct socket *sock, int 
level, int optname,
 
switch (optname) {
case NETLINK_REPAIR:
+#ifdef CONFIG_VE
+   {
+   struct ve_struct *ve = get_exec_env();
+   if (!ve_is_super(ve) && !ve->is_pseudosuper)
+   return -ENOPROTOOPT;
+   }
+#endif
if (val)
nlk->flags |= NETLINK_F_REPAIR;
else
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH rh8 2/2] netlink: add an option to set sk->err from userspace

2020-11-11 Thread Andrey Zhadchenko
Sometimes during dump criu can encounter sockets with overflown kernel buffer,
which results in ENOBUFS error during next read. We need a reliable way
to restore sk->sk_err.

https://jira.sw.ru/browse/PSBM-120976
Signed-off-by: Andrey Zhadchenko 
---
 include/uapi/linux/netlink.h |  1 +
 net/netlink/af_netlink.c | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
index 67ea114..4360186 100644
--- a/include/uapi/linux/netlink.h
+++ b/include/uapi/linux/netlink.h
@@ -157,6 +157,7 @@ enum nlmsgerr_attrs {
 #define NETLINK_EXT_ACK11
 #define NETLINK_GET_STRICT_CHK 12
 #define NETLINK_REPAIR 127
+#define NETLINK_SETERR 128
 
 struct nl_pktinfo {
__u32   group;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 2b9e9c7..c372555 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1683,6 +1683,16 @@ static int netlink_setsockopt(struct socket *sock, int 
level, int optname,
nlk->flags &= ~NETLINK_F_REPAIR;
err = 0;
break;
+   case NETLINK_SETERR:
+   err = -ENOPROTOOPT;
+   if (nlk->flags & NETLINK_F_REPAIR) {
+   if (!val || val > MAX_ERRNO)
+   return -EINVAL;
+   sk->sk_err = val;
+   sk->sk_error_report(sk);
+   err = 0;
+   }
+   break;
case NETLINK_PKTINFO:
if (val)
nlk->flags |= NETLINK_F_RECV_PKTINFO;
-- 
1.8.3.1

___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel


[Devel] [PATCH RHEL7 COMMIT] ploop: Fix crash in purge_lru_warn()

2020-11-11 Thread Vasily Averin
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will 
appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-1127.18.2.vz7.163.43
-->
commit a2ed38f07f597d7a92dcff9e1489eebc2938d325
Author: Kirill Tkhai 
Date:   Wed Nov 11 09:46:22 2020 +0300

ploop: Fix crash in purge_lru_warn()

do_div() works wrong in case of the second argument is long.
We don't need remainder, so we don't need do_div() at all.

https://jira.sw.ru/browse/PSBM-122035

Reported-by: Evgenii Shatokhin 
Signed-off-by: Kirill Tkhai 
Reviewed-by: Evgenii Shatokhin 
Reviewed-by: Andrey Ryabinin 
---
 drivers/block/ploop/io_direct_map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/ploop/io_direct_map.c 
b/drivers/block/ploop/io_direct_map.c
index 5528e86..8f09ab0 100644
--- a/drivers/block/ploop/io_direct_map.c
+++ b/drivers/block/ploop/io_direct_map.c
@@ -377,7 +377,7 @@ static inline void purge_lru_warn(struct extent_map_tree 
*tree)
loff_t ratio = i_size_read(tree->mapping->host) * 100;
long images_size = atomic_long_read(_io_images_size) ? : 1;
 
-   do_div(ratio, images_size);
+   ratio /= images_size;
 
printk(KERN_WARNING "Purging lru entry from extent tree for inode %ld "
   "(map_size=%d ratio=%lld%%)\n",
___
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel