[lxc-devel] [lxcfs/master] fixed a problem with server stability degradation caused by cgroup dy…

yinhongbo on Github Tue, 08 Oct 2019 20:26:03 -0700

The following pull request was submitted through Github.
It can be accessed and reviewed at: https://github.com/lxc/lxcfs/pull/308

This e-mail was sent by the LXC bot, direct replies will not reach the author
unless they happen to be subscribed to this list.

=== Description (from pull-request) ===
When we troubleshoot the problem of server network jitter at one time, the reason for the positioning is that the kernel is traversing the cgroup, and the main time is small in the memcg_stat_show function. Our kernel engineer obtains lxcfs by grabbing the stack and triggers this problem. After combining lxcfs code analysis and positioning, it is determined that the code disappears after reload lxcfs.

kernel stack：
```
COMMAND: lxcfs PID: 3223863 LATENCY: 55ms
trace_irqoff_record+0x12b/0x1b0 [trace_irqoff]
trace_irqoff_hrtimer_handler+0x97/0x99 [trace_irqoff]
__hrtimer_run_queues+0xdc/0x220
hrtimer_interrupt+0xa6/0x1f0
smp_apic_timer_interrupt+0x62/0x120
apic_timer_interrupt+0x7d/0x90
memcg_stat_show+0x27a/0x460
seq_read+0x11f/0x3f0
__vfs_read+0x33/0x160
vfs_read+0x91/0x130
SyS_read+0x52/0xc0
do_syscall_64+0x68/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
```

this pr fixed a problem with server stability degradation caused by cgroup dying traversal. This is a very bad problem.
when initpid == 1, getting meminfo will traverses all the memory cgroup.
a certain probability causes server stability to drop, such as network jitter.

cgroup dying examples are as follows。userspace has 1810 memcgroup, but kernel has 110588 memory cgroup

```
root@XX-YY-ZZ-AA:~# lscgroup | grep memory | wc -l
1810
root@XX-YY-ZZ-AA:~# grep memory /proc/cgroups
memory 7 110588 1
````

more questions about cgroup dying.
Can refer to https://lwn.net/Articles/787614/?spm=a2c4e.10696291.0.0.105919a4uX5P3F

Signed-off-by: Hongbo Yin <yinhon...@bytedance.com>

From 5e0117a1c8eee3b0a844d7065de44daaaa81c4c0 Mon Sep 17 00:00:00 2001
From: Hongbo Yin <yinhon...@bytedance.com>
Date: Wed, 9 Oct 2019 11:11:12 +0800
Subject: [PATCH] fixed a problem with server stability degradation caused by
 cgroup dying traversal.

Signed-off-by: Hongbo Yin <yinhon...@bytedance.com>
---
 bindings.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/bindings.c b/bindings.c
index 1811955..9a5a3df 100644
--- a/bindings.c
+++ b/bindings.c
@@ -3452,6 +3452,18 @@ static int proc_meminfo_read(char *buf, size_t size, 
off_t offset,
        pid_t initpid = lookup_initpid_in_store(fc->pid);
        if (initpid <= 0)
                initpid = fc->pid;
+
+       /*
+        * fixed a problem with server stability degradation caused by cgroup 
dying traversal. This is a very bad problem.
+        * when initpid == 1, getting meminfo will traverses all the memory 
cgroup.
+        * a certain probability causes server stability to drop, such as 
network jitter.
+        *
+        * more questions about cgroup dying.
+        * Can refer to 
https://lwn.net/Articles/787614/?spm=a2c4e.10696291.0.0.105919a4uX5P3F
+        */
+       if (initpid == 1)
+               return read_file("/proc/meminfo", buf, size, d);
+
        cg = get_pid_cgroup(initpid, "memory");
        if (!cg)
                return read_file("/proc/meminfo", buf, size, d);

_______________________________________________
lxc-devel mailing list
lxc-devel@lists.linuxcontainers.org
http://lists.linuxcontainers.org/listinfo/lxc-devel

[lxc-devel] [lxcfs/master] fixed a problem with server stability degradation caused by cgroup dy…

Reply via email to