[Devel] [PATCH vz8 2/3] oom: resurrect berserker mode
From: Vladimir Davydov The logic behind the OOM berserker is the same as in PCS6: if processes are killed by oom killer too often (< sysctl vm.oom_relaxation, 1 sec by default), we increase "rage" (min -10, max 20) and kill 1 << "rage" youngest worst processes if "rage" >= 0. https://jira.sw.ru/browse/PSBM-17930 Signed-off-by: Vladimir Davydov [aryabinin: vz8 rebase] Signed-off-by: Andrey Ryabinin --- include/linux/memcontrol.h | 6 +++ include/linux/oom.h| 4 ++ mm/oom_kill.c | 97 ++ 3 files changed, 107 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c26041c681f2..0efabad868ce 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -260,6 +260,12 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; + int oom_rage; + spinlock_t oom_rage_lock; + unsigned long prev_oom_time; + unsigned long oom_time; + + /* memory.events */ struct cgroup_file events_file; diff --git a/include/linux/oom.h b/include/linux/oom.h index b0ee726c1672..9a6d16a1ace5 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -15,6 +15,9 @@ struct notifier_block; struct mem_cgroup; struct task_struct; +#define OOM_BASE_RAGE -10 +#define OOM_MAX_RAGE 20 + /* * Details of the page allocation that triggered the oom killer that are used to * determine what should be killed. @@ -44,6 +47,7 @@ struct oom_control { unsigned long totalpages; struct task_struct *chosen; unsigned long chosen_points; + unsigned long overdraft; }; extern struct mutex oom_lock; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ab436d94ae5d..e746b41d558c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -53,6 +53,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks; +int sysctl_oom_relaxation = HZ; DEFINE_MUTEX(oom_lock); @@ -947,6 +948,101 @@ static int oom_kill_memcg_member(struct task_struct *task, void *message) return 0; } +/* + * Kill more processes if oom happens too often in this context. + */ +static void oom_berserker(struct oom_control *oc) +{ + static DEFINE_RATELIMIT_STATE(berserker_rs, + DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + struct task_struct *p; + struct mem_cgroup *memcg; + unsigned long now = jiffies; + int rage; + int killed = 0; + + memcg = oc->memcg ?: root_mem_cgroup; + + spin_lock(>oom_rage_lock); + memcg->prev_oom_time = memcg->oom_time; + memcg->oom_time = now; + /* +* Increase rage if oom happened recently in this context, reset +* rage otherwise. +* +* previous oomthis oom (unfinished) +* + +*^^ +* prev_oom_time <> oom_time +*/ + if (time_after(now, memcg->prev_oom_time + sysctl_oom_relaxation)) + memcg->oom_rage = OOM_BASE_RAGE; + else if (memcg->oom_rage < OOM_MAX_RAGE) + memcg->oom_rage++; + rage = memcg->oom_rage; + spin_unlock(>oom_rage_lock); + + if (rage < 0) + return; + + /* +* So, we are in rage. Kill (1 << rage) youngest tasks that are +* as bad as the victim. +*/ + read_lock(_lock); + list_for_each_entry_reverse(p, _task.tasks, tasks) { + unsigned long tsk_points; + unsigned long tsk_overdraft; + + if (!p->mm || test_tsk_thread_flag(p, TIF_MEMDIE) || + fatal_signal_pending(p) || p->flags & PF_EXITING || + oom_unkillable_task(p, oc->memcg, oc->nodemask)) + continue; + + tsk_points = oom_badness(p, oc->memcg, oc->nodemask, + oc->totalpages, _overdraft); + if (tsk_overdraft < oc->overdraft) + continue; + + /* +* oom_badness never returns a negative value, even if +* oom_score_adj would make badness so, instead it +* returns 1. So we do not kill task with badness 1 if +* the victim has badness > 1 so as not to risk killing +* protected tasks. +*/ + if (tsk_points <= 1 && oc->chosen_points > 1) + continue; + + /* +* Consider tasks as equally bad if they have equal +* normalized scores. +*/ + if (tsk_points * 1000 / oc->totalpages < + oc->chosen_points * 1000 / oc->totalpages) + continue; + +
[Devel] [PATCH vz8 3/3] oom: make berserker more aggressive
From: Vladimir Davydov In the berserker mode we kill a bunch of tasks that are as bad as the selected victim. We assume two tasks to be equally bad if they consume the same permille of memory. With such a strict check, it might turn out that oom berserker won't kill any tasks in case a fork bomb is running inside a container while the effect of killing a task eating <=1/1000th of memory won't be enough to cope with memory shortage. Let's loosen this check and use percentage instead of permille. In this case, it might still happen that berserker won't kill anyone, but in this case the regular oom should free at least 1/100th of memory, which should be enough even for small containers. Also, check berserker mode even if the victim has already exited by the time we are about to send SIGKILL to it. Rationale: when the berserker is in rage, it might kill hundreds of tasks so that the next oom kill is likely to select an exiting task. Not triggering berserker in this case will result in oom stalls. Signed-off-by: Vladimir Davydov [aryabinin: rh8 rebase] Signed-off-by: Andrey Ryabinin --- mm/oom_kill.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index e746b41d558c..1cf75939aba6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1016,11 +1016,11 @@ static void oom_berserker(struct oom_control *oc) continue; /* -* Consider tasks as equally bad if they have equal -* normalized scores. +* Consider tasks as equally bad if they occupy equal +* percentage of available memory. */ - if (tsk_points * 1000 / oc->totalpages < - oc->chosen_points * 1000 / oc->totalpages) + if (tsk_points * 100 / oc->totalpages < + oc->chosen_points * 100 / oc->totalpages) continue; if (__ratelimit(_rs)) { @@ -1061,6 +1061,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) wake_oom_reaper(victim); task_unlock(victim); put_task_struct(victim); + oom_berserker(oc); return; } task_unlock(victim); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH vz8 1/3] proc, memcg: use memcg limits for showing oom_score inside CT
Use memcg's limits of task to show /proc//oom_score. Note: in vz7 we had different behavior. It showed 'oom_score' based on 've->memcg' limits of process reading oom_score. Now we look at memcg of process and don't care about the current one. It seems more correct behaviour. Signed-off-by: Andrey Ryabinin --- fs/proc/base.c | 8 +++- include/linux/memcontrol.h | 11 +++ 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 85fee7396e90..cb417426dd92 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -525,8 +525,14 @@ static const struct file_operations proc_lstats_operations = { static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - unsigned long totalpages = totalram_pages + total_swap_pages; + unsigned long totalpages; unsigned long points = 0; + struct mem_cgroup *memcg; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(task); + totalpages = mem_cgroup_total_pages(memcg); + rcu_read_unlock(); points = oom_badness(task, NULL, NULL, totalpages, NULL) * 1000 / totalpages; diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index eb8634128a81..c26041c681f2 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -581,6 +581,17 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec, return mz->lru_zone_size[zone_idx][lru]; } +static inline unsigned long mem_cgroup_total_pages(struct mem_cgroup *memcg) +{ + unsigned long ram, ram_swap; + extern long total_swap_pages; + + ram = min_t(unsigned long, totalram_pages, memcg->memory.max); + ram_swap = min_t(unsigned long, memcg->memsw.max, ram + total_swap_pages); + + return ram_swap; +} + void mem_cgroup_handle_over_high(void); unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); -- 2.26.2 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL8 COMMIT] vecalls: Introduce VZCTL_GET_CPU_STAT ioctl
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.17 --> commit 264e1b6d6450baab509faa667f4ac72606d84940 Author: Konstantin Khorenko Date: Wed Nov 11 16:00:04 2020 +0300 vecalls: Introduce VZCTL_GET_CPU_STAT ioctl This vzctl ioctl still used by vzstat utility and dispatcher/libvirt statistics reporting. >From one point of view almost all data can be get from cpu cgroup of a Container (missing data can be exported additionally), but statistics is gathered often and ioctl is faster and requires less cpu power => let it be for now. The current patch is based on following vz7 commits: ecdce58b214c ("sched: Export per task_group statistics_work") a58fb58bff1c ("Use ve init task's css instead of opening cgroup via vfs") 75fc174adc36 ("sched: Port cpustat related patches") Signed-off-by: Konstantin Khorenko Reviewed-by: Andrey Ryabinin --- include/linux/ve.h | 2 ++ kernel/time/time.c | 1 + kernel/ve/ve.c | 18 +++ kernel/ve/vecalls.c | 66 + 4 files changed, 87 insertions(+) diff --git a/include/linux/ve.h b/include/linux/ve.h index 656ee43e383e..7cb416f342e7 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -201,10 +201,12 @@ struct seq_file; #if defined(CONFIG_VE) && defined(CONFIG_CGROUP_SCHED) int ve_show_cpu_stat(struct ve_struct *ve, struct seq_file *p); int ve_show_loadavg(struct ve_struct *ve, struct seq_file *p); +int ve_get_cpu_avenrun(struct ve_struct *ve, unsigned long *avenrun); int ve_get_cpu_stat(struct ve_struct *ve, struct kernel_cpustat *kstat); #else static inline int ve_show_cpu_stat(struct ve_struct *ve, struct seq_file *p) { return -ENOSYS; } static inline int ve_show_loadavg(struct ve_struct *ve, struct seq_file *p) { return -ENOSYS; } +static inline int ve_get_cpu_avenrun(struct ve_struct *ve, unsigned long *avenrun) { return -ENOSYS; } static inline int ve_get_cpu_stat(struct ve_struct *ve, struct kernel_cpustat *kstat) { return -ENOSYS; } #endif diff --git a/kernel/time/time.c b/kernel/time/time.c index 2b41e8e2d31d..ff1db0ba0c39 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -770,6 +770,7 @@ u64 nsec_to_clock_t(u64 x) return div_u64(x * 9, (9ull * NSEC_PER_SEC + (USER_HZ / 2)) / USER_HZ); #endif } +EXPORT_SYMBOL(nsec_to_clock_t); u64 jiffies64_to_nsecs(u64 j) { diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index a9afefc5b9de..29e98e6396dc 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -1430,6 +1430,24 @@ int ve_show_loadavg(struct ve_struct *ve, struct seq_file *p) return err; } +inline struct task_group *css_tg(struct cgroup_subsys_state *css); +int get_avenrun_tg(struct task_group *tg, unsigned long *loads, + unsigned long offset, int shift); + +int ve_get_cpu_avenrun(struct ve_struct *ve, unsigned long *avnrun) +{ + struct cgroup_subsys_state *css; + struct task_group *tg; + int err; + + css = ve_get_init_css(ve, cpu_cgrp_id); + tg = css_tg(css); + err = get_avenrun_tg(tg, avnrun, 0, 0); + css_put(css); + return err; +} +EXPORT_SYMBOL(ve_get_cpu_avenrun); + int cpu_cgroup_get_stat(struct cgroup_subsys_state *cpu_css, struct cgroup_subsys_state *cpuacct_css, struct kernel_cpustat *kstat); diff --git a/kernel/ve/vecalls.c b/kernel/ve/vecalls.c index 3258b49b15b2..786a743faa1a 100644 --- a/kernel/ve/vecalls.c +++ b/kernel/ve/vecalls.c @@ -22,6 +22,8 @@ #include #include #include +#include +#include #include #include @@ -35,6 +37,62 @@ static u64 ve_get_uptime(struct ve_struct *ve) return ktime_get_boot_ns() - ve->real_start_time; } +static int fill_cpu_stat(envid_t veid, struct vz_cpu_stat __user *buf) +{ + struct ve_struct *ve; + struct vz_cpu_stat *vstat; + int retval; + int i; + unsigned long tmp; + unsigned long avnrun[3]; + struct kernel_cpustat kstat; + + if (!ve_is_super(get_exec_env()) && (veid != get_exec_env()->veid)) + return -EPERM; + ve = get_ve_by_id(veid); + if (!ve) + return -ESRCH; + + retval = -ENOMEM; + vstat = kzalloc(sizeof(*vstat), GFP_KERNEL); + if (!vstat) + goto out_put_ve; + + retval = ve_get_cpu_stat(ve, ); + if (retval) + goto out_free; + + retval = ve_get_cpu_avenrun(ve, avnrun); + if (retval) + goto out_free; + + vstat->user_jif = (unsigned long)nsec_to_clock_t( + kstat.cpustat[CPUTIME_USER]); + vstat->nice_jif = (unsigned long)nsec_to_clock_t( + kstat.cpustat[CPUTIME_NICE]); + vstat->system_jif = (unsigned
[Devel] [PATCH RHEL8 COMMIT] ve/sched/loadavg: Provide task_group parameter to get_avenrun_ve()
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.17 --> commit 2e7bc3486fb7bd25dbc3a6a4530ece030aa8456c Author: Konstantin Khorenko Date: Wed Nov 11 16:00:03 2020 +0300 ve/sched/loadavg: Provide task_group parameter to get_avenrun_ve() Rename get_avenrun_ve() to get_avenrun_tg() and provide it the task_group argument to use it later for any VE, for the the current one. Fixes: f52cf2752bca ("ve/sched/loadavg: Calculate avenrun for Containers root cpu cgroups") Signed-off-by: Konstantin Khorenko Reviewed-by: Andrey Ryabinin --- include/linux/sched/loadavg.h | 2 -- kernel/sched/loadavg.c| 12 ++-- kernel/sys.c | 6 +- 3 files changed, 15 insertions(+), 5 deletions(-) diff --git a/include/linux/sched/loadavg.h b/include/linux/sched/loadavg.h index 1da5768389b7..25fb3344cdbf 100644 --- a/include/linux/sched/loadavg.h +++ b/include/linux/sched/loadavg.h @@ -16,8 +16,6 @@ */ extern unsigned long avenrun[];/* Load averages */ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); -extern void get_avenrun_ve(unsigned long *loads, - unsigned long offset, int shift); #define FSHIFT 11 /* nr of bits of precision */ #define FIXED_1(1avenrun[1] + offset) << shift; loads[2] = (tg->avenrun[2] + offset) << shift; + + return 0; } long calc_load_fold_active(struct rq *this_rq, long adjust) diff --git a/kernel/sys.c b/kernel/sys.c index e7e07ea8d7ef..8560e5bcb6c2 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2543,6 +2543,8 @@ SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep, } extern void si_meminfo_ve(struct sysinfo *si, struct ve_struct *ve); +extern int get_avenrun_tg(struct task_group *tg, unsigned long *loads, + unsigned long offset, int shift); /** * do_sysinfo - fill in sysinfo struct @@ -2575,7 +2577,9 @@ static int do_sysinfo(struct sysinfo *info) info->procs = nr_threads_ve(ve); - get_avenrun_ve(info->loads, 0, SI_LOAD_SHIFT - FSHIFT); + /* does not fail on non-VE0 task group */ + (void)get_avenrun_tg(NULL, info->loads, +0, SI_LOAD_SHIFT - FSHIFT); } /* ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL8 COMMIT] vecalls: Add cpu stat measurement units comments to header
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.17 --> commit 3ed7a174b687c3dd2fab0ee255731b8efa3b44d9 Author: Konstantin Khorenko Date: Wed Nov 11 16:00:03 2020 +0300 vecalls: Add cpu stat measurement units comments to header It's not obvious why, say, "user_jif" field does not contain time in jiffies, so add clarification comments. Fixes: 248ed6b2a193 ("ve: Add vecalls") Signed-off-by: Konstantin Khorenko Reviewed-by: Andrey Ryabinin --- include/uapi/linux/vzcalluser.h | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/include/uapi/linux/vzcalluser.h b/include/uapi/linux/vzcalluser.h index f2584b4b284f..8a4ff0015e40 100644 --- a/include/uapi/linux/vzcalluser.h +++ b/include/uapi/linux/vzcalluser.h @@ -55,13 +55,13 @@ struct vz_load_avg { }; struct vz_cpu_stat { - unsigned long user_jif; - unsigned long nice_jif; - unsigned long system_jif; - unsigned long uptime_jif; - __u64 idle_clk; - __u64 strv_clk; - __u64 uptime_clk; + unsigned long user_jif; /* clock_t */ + unsigned long nice_jif; /* clock_t */ + unsigned long system_jif /* clock_t */; + unsigned long uptime_jif /* clock_t */; + __u64 idle_clk; /* ns */ + __u64 strv_clk; /* deprecated */ + __u64 uptime_clk; /* ns */ struct vz_load_avg avenrun[3]; /* loadavg data */ }; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL8 COMMIT] vdso, vclock_gettime: fix linking with old linkers
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.17 --> commit 998f47870c4a498d1a53c191816c3c03dda480f9 Author: Andrey Ryabinin Date: Wed Nov 11 15:54:08 2020 +0300 vdso, vclock_gettime: fix linking with old linkers On some old linkers vdso fails to build because of dynamic reloction of 've_start_time' symbol: VDSO2C arch/x86/entry/vdso/vdso-image-64.c Error: vdso image contains dynamic relocations I was able to figure out why new linkers doesn't generate relocation while old ones does, but I did find out that visibility("hidden") attribute on 've_start_time' cures the problem. Fixes: af2c78f571e6 ("ve: add per-ve CLOCK_MONOTONIC time via __vdso_gettimeofday()") https://jira.sw.ru/browse/PSBM-121668 Signed-off-by: Andrey Ryabinin --- arch/x86/entry/vdso/vclock_gettime.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 224dbe80da66..b2f1f19319d8 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -24,7 +24,7 @@ #define gtod ((vsyscall_gtod_data)) -u64 ve_start_time; +u64 ve_start_time __attribute__((visibility("hidden"))); extern int __vdso_clock_gettime(clockid_t clock, struct timespec *ts); extern int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz); ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL8 COMMIT] sched/stat: account forks per task group
The commit is pushed to "branch-rh8-4.18.0-193.6.3.vz8.4.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh8-4.18.0-193.6.3.vz8.4.17 --> commit 372ecbe6241acf33e718ad826670d5a03ed6efaa Author: Vladimir Davydov Date: Thu Mar 14 21:00:53 2013 +0400 sched/stat: account forks per task group This is a backport of diff-sched-account-forks-per-task-group: Subject: sched: account forks per task group Date: Fri, 28 Dec 2012 15:09:46 +0400 * [sched] the number of processes should be reported correctly inside a CT in /proc/stat (PSBM-18113) For /proc/stat:processes to be correct inside containers. https://jira.sw.ru/browse/PSBM-18113 Signed-off-by: Vladimir Davydov (cherry picked from vz7 commit 0a927bf02fd873f4e9bad7c4df0c201bf9b48274) Signed-off-by: Konstantin Khorenko --- kernel/sched/cpuacct.c | 4 +++- kernel/sched/fair.c| 1 + kernel/sched/sched.h | 1 + 3 files changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 646bbd257110..df5fe01c8f24 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -652,6 +652,7 @@ int cpu_cgroup_proc_stat(struct cgroup_subsys_state *cpu_css, unsigned long tg_nr_running = 0; unsigned long tg_nr_iowait = 0; unsigned long long tg_nr_switches = 0; + unsigned long tg_nr_forks = 0; getboottime64(); @@ -671,6 +672,7 @@ int cpu_cgroup_proc_stat(struct cgroup_subsys_state *cpu_css, tg_nr_running += tg->cfs_rq[i]->h_nr_running; tg_nr_iowait += tg->cfs_rq[i]->nr_iowait; tg_nr_switches += tg->cfs_rq[i]->nr_switches; + tg_nr_forks += tg->cfs_rq[i]->nr_forks; #endif #ifdef CONFIG_RT_GROUP_SCHED tg_nr_running += tg->rt_rq[i]->rt_nr_running; @@ -746,7 +748,7 @@ int cpu_cgroup_proc_stat(struct cgroup_subsys_state *cpu_css, "procs_blocked %lu\n", tg_nr_switches, (unsigned long long)boot_sec, - total_forks, + tg_nr_forks, tg_nr_running, tg_nr_iowait); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0b9bb108625a..892329471df1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10300,6 +10300,7 @@ static void task_fork_fair(struct task_struct *p) } se->vruntime -= cfs_rq->min_vruntime; + cfs_rq->nr_forks++; rq_unlock(rq, ); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3d55b45f1ea6..ccd8ad478a08 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -545,6 +545,7 @@ struct cfs_rq { struct sched_entity *prev; u64 nr_switches; + unsigned long nr_forks; #ifdef CONFIG_SCHED_DEBUG unsigned intnr_spread_over; ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh7 v2] cgroup: rework reference acquisition for cgroup_find_inode
Use more generic igrab instead of atomic inc. Move cgroup_hash_del to eviction stage to avoid deadlock. Signed-off-by: Andrey Zhadchenko --- v2: adjusted function call order in cgroup_evict_inode to match existing code kernel/cgroup.c | 25 - 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 27d7a5e..8c2cef8 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1522,21 +1522,10 @@ static struct inode *cgroup_find_inode(unsigned long fh[2], char take_ref) struct inode *ret = NULL; spin_lock(_inode_table_lock); - item = cgroup_find_item_no_lock(fh); - /* -* If we need to increase refcount, we should be aware of possible -* deadlock. Another thread may have started deleting this inode: -* iput->iput_final->cgroup_delete_inode->cgroup_hash_del -* If we just call igrab, it will try to take i_lock and this will -* result in deadlock, because deleting thread has already taken it -* and waits on cgroup_inode_table_lock to find inode in hashtable. -* -* If i_count is zero, someone is deleting it -> skip. -*/ - if (take_ref && item) - if (!atomic_inc_not_zero(>inode->i_count)) - item = NULL; + item = cgroup_find_item_no_lock(fh); + if (item && take_ref && !igrab(item->inode)) + item = NULL; spin_unlock(_inode_table_lock); @@ -1634,15 +1623,17 @@ static const struct export_operations cgroup_export_ops = { .fh_to_dentry = cgroup_fh_to_dentry, }; -static int cgroup_delete_inode(struct inode *inode) +static void cgroup_evict_inode(struct inode *inode) { + truncate_inode_pages_final(>i_data); + clear_inode(inode); cgroup_hash_del(inode); - return generic_delete_inode(inode); } static const struct super_operations cgroup_ops = { .statfs = simple_statfs, - .drop_inode = cgroup_delete_inode, + .drop_inode = generic_delete_inode, + .evict_inode = cgroup_evict_inode, .show_options = cgroup_show_options, #ifdef CONFIG_VE .show_path = cgroup_show_path, -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh8 1/2] netlink: protect NETLINK_REPAIR
Prevent using netlink repair mode from containers. Signed-off-by: Andrey Zhadchenko --- net/netlink/af_netlink.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 46c2dbd..2b9e9c7 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1670,6 +1670,13 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname, switch (optname) { case NETLINK_REPAIR: +#ifdef CONFIG_VE + { + struct ve_struct *ve = get_exec_env(); + if (!ve_is_super(ve) && !ve->is_pseudosuper) + return -ENOPROTOOPT; + } +#endif if (val) nlk->flags |= NETLINK_F_REPAIR; else -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH rh8 2/2] netlink: add an option to set sk->err from userspace
Sometimes during dump criu can encounter sockets with overflown kernel buffer, which results in ENOBUFS error during next read. We need a reliable way to restore sk->sk_err. https://jira.sw.ru/browse/PSBM-120976 Signed-off-by: Andrey Zhadchenko --- include/uapi/linux/netlink.h | 1 + net/netlink/af_netlink.c | 10 ++ 2 files changed, 11 insertions(+) diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h index 67ea114..4360186 100644 --- a/include/uapi/linux/netlink.h +++ b/include/uapi/linux/netlink.h @@ -157,6 +157,7 @@ enum nlmsgerr_attrs { #define NETLINK_EXT_ACK11 #define NETLINK_GET_STRICT_CHK 12 #define NETLINK_REPAIR 127 +#define NETLINK_SETERR 128 struct nl_pktinfo { __u32 group; diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 2b9e9c7..c372555 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -1683,6 +1683,16 @@ static int netlink_setsockopt(struct socket *sock, int level, int optname, nlk->flags &= ~NETLINK_F_REPAIR; err = 0; break; + case NETLINK_SETERR: + err = -ENOPROTOOPT; + if (nlk->flags & NETLINK_F_REPAIR) { + if (!val || val > MAX_ERRNO) + return -EINVAL; + sk->sk_err = val; + sk->sk_error_report(sk); + err = 0; + } + break; case NETLINK_PKTINFO: if (val) nlk->flags |= NETLINK_F_RECV_PKTINFO; -- 1.8.3.1 ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel
[Devel] [PATCH RHEL7 COMMIT] ploop: Fix crash in purge_lru_warn()
The commit is pushed to "branch-rh7-3.10.0-1127.18.2.vz7.163.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh7-3.10.0-1127.18.2.vz7.163.43 --> commit a2ed38f07f597d7a92dcff9e1489eebc2938d325 Author: Kirill Tkhai Date: Wed Nov 11 09:46:22 2020 +0300 ploop: Fix crash in purge_lru_warn() do_div() works wrong in case of the second argument is long. We don't need remainder, so we don't need do_div() at all. https://jira.sw.ru/browse/PSBM-122035 Reported-by: Evgenii Shatokhin Signed-off-by: Kirill Tkhai Reviewed-by: Evgenii Shatokhin Reviewed-by: Andrey Ryabinin --- drivers/block/ploop/io_direct_map.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/ploop/io_direct_map.c b/drivers/block/ploop/io_direct_map.c index 5528e86..8f09ab0 100644 --- a/drivers/block/ploop/io_direct_map.c +++ b/drivers/block/ploop/io_direct_map.c @@ -377,7 +377,7 @@ static inline void purge_lru_warn(struct extent_map_tree *tree) loff_t ratio = i_size_read(tree->mapping->host) * 100; long images_size = atomic_long_read(_io_images_size) ? : 1; - do_div(ratio, images_size); + ratio /= images_size; printk(KERN_WARNING "Purging lru entry from extent tree for inode %ld " "(map_size=%d ratio=%lld%%)\n", ___ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel