On Mon, 10 Aug 2020 17:41:32 +0200
Greg KH <gre...@linuxfoundation.org> wrote:

> On Tue, Aug 11, 2020 at 01:27:00AM +1000, Eugene Lubarsky wrote:
> > On Mon, 10 Aug 2020 17:04:53 +0200
> > Greg KH <gre...@linuxfoundation.org> wrote:  
> And have you benchmarked any of this?  Try working with the common
> tools that want this information and see if it actually is noticeable
> (hint, I have been doing that with the readfile work and it's
> surprising what the results are in places...)

Apologies for the delay. Here are some benchmarks with atop.

Patch to atop at: https://github.com/eug48/atop/commits/proc-all
Patch to add /proc/all/schedstat & cpuset below.
atop not collecting threads & cmdline as /proc/all/ doesn't support it.
10,000 processes, kernel 5.8, nested KVM, 2 cores of i7-6700HQ @ 2.60GHz

# USE_PROC_ALL=0 ./atop -w test 1 &
# pidstat -p $(pidof atop) 1

01:33:05   %usr %system  %guest   %wait    %CPU   CPU  Command
01:33:06  33.66   33.66    0.00    0.99   67.33     1  atop
01:33:07  33.00   32.00    0.00    2.00   65.00     0  atop
01:33:08  34.00   31.00    0.00    1.00   65.00     0  atop
...
Average:  33.15   32.79    0.00    1.09   65.94     -  atop


# USE_PROC_ALL=1 ./atop -w test 1 &
# pidstat -p $(pidof atop) 1

01:33:33   %usr %system  %guest   %wait    %CPU   CPU  Command
01:33:34  28.00   14.00    0.00    1.00   42.00     1  atop
01:33:35  28.00   14.00    0.00    0.00   42.00     1  atop
01:33:36  26.00   13.00    0.00    0.00   39.00     1  atop
...
Average:  27.08   12.86    0.00    0.35   39.94     -  atop

So CPU usage goes down from ~65% to ~40%.

Data collection times in milliseconds are:

# xsv cat columns proc.csv procall.csv \
> | xsv stats \
> | xsv select field,min,max,mean,stddev \
> | xsv table
field           min  max  mean     stddev
/proc time      558  625  586.59   18.29
/proc/all time  231  262  243.56   8.02

Much performance optimisation can still be done, e.g. the modified atop
uses fgets which is reading 1KB at a time, and seq_file seems to only
return 4KB pages. task_diag should be much faster still.

I'd imagine this sort of thing would be useful for daemons monitoring
large numbers of processes. I don't run such systems myself; my initial
motivation was frustration with the Kubernetes kubelet having ~2-4% CPU
usage even with a couple of containers. Basic profiling suggests syscalls
have a lot to do with it - it's actually reading loads of tiny cgroup files
and enumerating many directories every 10 seconds, but /proc has similar
issues and seemed easier to start with.

Anyway, I've read that io_uring could also help here in the near future,
which would be really cool especially if there was a way to enumerate
directories and read many files regex-style in a single operation,
e.g. /proc/[0-9].*/(stat|statm|io)

> > Currently I'm trying to re-use the existing code in fs/proc that
> > controls which PIDs are visible, but may well be missing
> > something..  
> 
> Try it out and see if it works correctly.  And pid namespaces are not
> the only thing these days from what I call :)
> 
I've tried `unshare --fork --pid --mount-proc cat /proc/all/stat`
which seems to behave correctly. ptrace flags are handled by the
existing code.


Best Wishes,
Eugene


From 2ffc2e388f7ce4e3f182c2442823e5f13bae03dd Mon Sep 17 00:00:00 2001
From: Eugene Lubarsky <elubarsky.li...@gmail.com>
Date: Tue, 25 Aug 2020 12:36:41 +1000
Subject: [RFC PATCH] fs/proc: /proc/all: add schedstat and cpuset

Signed-off-by: Eugene Lubarsky <elubarsky.li...@gmail.com>
---
 fs/proc/base.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0bba4b3a985e..44d73f1ade4a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3944,6 +3944,36 @@ static int proc_all_io(struct seq_file *m, void *v)
 }
 #endif
 
+#ifdef CONFIG_PROC_PID_CPUSET
+static int proc_all_cpuset(struct seq_file *m, void *v)
+{
+       struct all_iter *iter = (struct all_iter *) v;
+       struct pid_namespace *ns = iter->ns;
+       struct task_struct *task = iter->tgid_iter.task;
+       struct pid *pid = task->thread_pid;
+
+       seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns));
+       seq_puts(m, " ");
+
+       return proc_cpuset_show(m, ns, pid, task);
+}
+#endif
+
+#ifdef CONFIG_SCHED_INFO
+static int proc_all_schedstat(struct seq_file *m, void *v)
+{
+       struct all_iter *iter = (struct all_iter *) v;
+       struct pid_namespace *ns = iter->ns;
+       struct task_struct *task = iter->tgid_iter.task;
+       struct pid *pid = task->thread_pid;
+
+       seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns));
+       seq_puts(m, " ");
+
+       return proc_pid_schedstat(m, ns, pid, task);
+}
+#endif
+
 static int proc_all_statx(struct seq_file *m, void *v)
 {
        struct all_iter *iter = (struct all_iter *) v;
@@ -3990,6 +4020,12 @@ PROC_ALL_OPS(status);
 #ifdef CONFIG_TASK_IO_ACCOUNTING
        PROC_ALL_OPS(io);
 #endif
+#ifdef CONFIG_SCHED_INFO
+       PROC_ALL_OPS(schedstat);
+#endif
+#ifdef CONFIG_PROC_PID_CPUSET
+       PROC_ALL_OPS(cpuset);
+#endif
 
 #define PROC_ALL_CREATE(NAME) \
        do { \
@@ -4011,4 +4047,10 @@ void __init proc_all_init(void)
 #ifdef CONFIG_TASK_IO_ACCOUNTING
        PROC_ALL_CREATE(io);
 #endif
+#ifdef CONFIG_SCHED_INFO
+       PROC_ALL_CREATE(schedstat);
+#endif
+#ifdef CONFIG_PROC_PID_CPUSET
+       PROC_ALL_CREATE(cpuset);
+#endif
 }
-- 
2.25.1

Reply via email to