Re: [RFC PATCH] samples:bpf: introduce task detector

2020-05-28 Thread
On 2020/5/29 上午2:34, Andrii Nakryiko wrote: [snip] >>> >>> With CO-RE, it also will allow to compile this tool once and run it on >>> many different kernels without recompilation. Please do take a look >>> and submit a PR there, it will be a good addition to the toolkit (and >>> will force you

Re: [RFC PATCH] samples:bpf: introduce task detector

2020-05-28 Thread
Hi, Andrii Thanks for your comments :-) On 2020/5/28 下午2:36, Andrii Nakryiko wrote: [snip] >> --- > > I haven't looked through implementation thoroughly yet. But I have few > general remarks. > > This looks like a useful and generic tool. I think it will get most > attention and be most useful

[RFC PATCH] samples:bpf: introduce task detector

2020-05-27 Thread
This is a tool to trace the related schedule events of a specified task, eg the migration, sched in/out, wakeup and sleep/block. The event was translated into sentence to be more readable, by execute command 'task_detector -p 49870' we continually tracing the schedule events related to 'top'

Re: [PATCH v2 0/4] per-cgroup numa suite

2019-08-05 Thread
Hi, Folks Please feel free to comment if you got any concerns :-) Hi, Peter How do you think about this version? Please let us know if it's still not good enough to be accepted :-) Regards, Michael Wang On 2019/7/16 上午11:38, 王贇 wrote: > During our torturing on numa stuff, we found probl

Re: [PATCH v2 0/4] per-cgroup numa suite

2019-07-24 Thread
:-) Regards, Michael Wang On 2019/7/16 上午11:38, 王贇 wrote: > During our torturing on numa stuff, we found problems like: > > * missing per-cgroup information about the per-node execution status > * missing per-cgroup information about the numa locality > > That is when we

Re: [PATCH 4/4] numa: introduce numa cling feature

2019-07-21 Thread
On 2019/7/12 下午4:58, 王贇 wrote: [snip] > > I see, we should not override the decision of select_idle_sibling(). > > Actually the original design we try to achieve is: > > let wake affine select the target > try find idle sibling of target > if got one >

[PATCH v5 4/4] numa: introduce numa cling feature

2019-07-21 Thread
Although we paid so many effort to settle down task on a particular node, there are still chances for a task to leave it's preferred node, that is by wakeup, numa swap migrations or load balance. When we are using cpu cgroup in share way, since all the workloads see all the cpus, it could be

Re: [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat

2019-07-21 Thread
On 2019/7/20 上午12:39, Michal Koutný wrote: > On Tue, Jul 16, 2019 at 11:40:35AM +0800, 王贇 > wrote: >> By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new >> output line heading with 'exectime', like: >> >> exectime 311900 407166 > What

[PATCH v4 4/4] numa: introduce numa cling feature

2019-07-15 Thread
Although we paid so many effort to settle down task on a particular node, there are still chances for a task to leave it's preferred node, that is by wakeup, numa swap migrations or load balance. When we are using cpu cgroup in share way, since all the workloads see all the cpus, it could be

[PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat

2019-07-15 Thread
This patch introduced numa execution time information, to imply the numa efficiency. By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new output line heading with 'exectime', like: exectime 311900 407166 which means the tasks of this cgroup executed 311900 micro seconds on

[PATCH v2 3/4] numa: introduce numa group per task group

2019-07-15 Thread
By tracing numa page faults, we recognize tasks sharing the same page, and try pack them together into a single numa group. However when two task share lot's of cache pages while not much anonymous pages, since numa balancing do not tracing cache page, they have no chance to join into the same

[PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic

2019-07-15 Thread
This patch introduced numa locality statistic, which try to imply the numa balancing efficiency per memory cgroup. On numa balancing, we trace the local page accessing ratio of tasks, which we call the locality. By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see output line

[PATCH v2 0/4] per-cgroup numa suite

2019-07-15 Thread
During our torturing on numa stuff, we found problems like: * missing per-cgroup information about the per-node execution status * missing per-cgroup information about the numa locality That is when we have a cpu cgroup running with bunch of tasks, no good way to tell how it's tasks are

Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-15 Thread
Hi Michal, Thx for the comments :-) On 2019/7/15 下午8:10, Michal Koutný wrote: > Hello Yun. > > On Fri, Jul 12, 2019 at 06:10:24PM +0800, 王贇 > wrote: >> Forgive me but I have no idea on how to combined this >> with memory cgroup's locality hierarchical update... &g

Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-14 Thread
On 2019/7/12 下午6:10, 王贇 wrote: [snip] >> >> Documentation/cgroup-v1/cpusets.txt >> >> Look for mems_allowed. > > This is the attribute belong to cpuset cgroup isn't it? > > Forgive me but I have no idea on how to combined this > with memory cgroup's l

Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-12 Thread
On 2019/7/12 下午5:42, Peter Zijlstra wrote: > On Fri, Jul 12, 2019 at 05:11:25PM +0800, 王贇 wrote: >> >> >> On 2019/7/12 下午3:58, Peter Zijlstra wrote: >> [snip] >>>>> >>>>> Then our task t1 should be accounted to B (as you do), but also

Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-12 Thread
On 2019/7/12 下午3:58, Peter Zijlstra wrote: [snip] >>> >>> Then our task t1 should be accounted to B (as you do), but also to A and >>> R. >> >> I get the point but not quite sure about this... >> >> Not like pages there are no hierarchical limitation on locality, also tasks > > You can use

Re: [PATCH 4/4] numa: introduce numa cling feature

2019-07-12 Thread
On 2019/7/12 下午3:53, Peter Zijlstra wrote: [snip] return target; } >>> >>> Select idle sibling should never cross node boundaries and is thus the >>> entirely wrong place to fix anything. >> >> Hmm.. in our early testing the printk show both select_task_rq_fair() and >>

Re: [PATCH 3/4] numa: introduce numa group per task group

2019-07-11 Thread
On 2019/7/11 下午10:10, Peter Zijlstra wrote: > On Wed, Jul 03, 2019 at 11:32:32AM +0800, 王贇 wrote: >> By tracing numa page faults, we recognize tasks sharing the same page, >> and try pack them together into a single numa group. >> >> However when two task share

Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-11 Thread
On 2019/7/11 下午9:47, Peter Zijlstra wrote: [snip] >> +rcu_read_lock(); >> +memcg = mem_cgroup_from_task(p); >> +if (idx != -1) >> +this_cpu_inc(memcg->stat_numa->locality[idx]); > > I thought cgroups were supposed to be hierarchical. That is, if we have: > >

Re: [PATCH 2/4] numa: append per-node execution info in memory.numa_stat

2019-07-11 Thread
On 2019/7/11 下午9:45, Peter Zijlstra wrote: > On Wed, Jul 03, 2019 at 11:29:15AM +0800, 王贇 wrote: > >> +++ b/include/linux/memcontrol.h >> @@ -190,6 +190,7 @@ enum memcg_numa_locality_interval { >> >> struct memcg_stat_numa { >> u64 locality[N

Re: [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-11 Thread
On 2019/7/11 下午9:43, Peter Zijlstra wrote: > On Wed, Jul 03, 2019 at 11:28:10AM +0800, 王贇 wrote: >> +#ifdef CONFIG_NUMA_BALANCING >> + >> +enum memcg_numa_locality_interval { >> +PERCENT_0_29, >> +PERCENT_30_39, >> +PERCENT_40_49, >&

Re: [PATCH 4/4] numa: introduce numa cling feature

2019-07-11 Thread
On 2019/7/11 下午10:27, Peter Zijlstra wrote: [snip] >> Thus we introduce the numa cling, which try to prevent tasks leaving >> the preferred node on wakeup fast path. > > >> @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p, >> int prev, int target) >> if

Re: [PATCH 0/4] per cgroup numa suite

2019-07-11 Thread
Hi folks, How do you think about these patches? During most of our tests the results show stable improvements, thus we consider this as a generic problem and proposed this solution, hope to help address the issue. Comments are sincerely welcome :-) Regards, Michael Wang On 2019/7/3 上午11:26, 王

[PATCH v3 4/4] numa: introduce numa cling feature

2019-07-08 Thread
Although we paid so many effort to settle down task on a particular node, there are still chances for a task to leave it's preferred node, that is by wakeup, numa swap migrations or load balance. When we are using cpu cgroup in share way, since all the workloads see all the cpus, it could be

Re: [PATCH v2 4/4] numa: introduce numa cling feature

2019-07-08 Thread
On 2019/7/8 下午4:07, Hillf Danton wrote: > > On Mon, 8 Jul 2019 10:25:27 +0800 Michael Wang wrote: >> /* Attempt to migrate a task to a CPU on the preferred node. */ >> static void numa_migrate_preferred(struct task_struct *p) >> { >> +bool failed, target; >> unsigned long interval = HZ;

[PATCH v2 4/4] numa: introduce numa cling feature

2019-07-07 Thread
Although we paid so many effort to settle down task on a particular node, there are still chances for a task to leave it's preferred node, that is by wakeup, numa swap migrations or load balance. When we are using cpu cgroup in share way, since all the workloads see all the cpus, it could be

[PATCH 4/4] numa: introduce numa cling feature

2019-07-02 Thread
Although we paid so many effort to settle down task on a particular node, there are still chances for a task to leave it's preferred node, that is by wakeup, numa swap migrations or load balance. When we are using cpu cgroup in share way, since all the workloads see all the cpus, it could be

[PATCH 3/4] numa: introduce numa group per task group

2019-07-02 Thread
By tracing numa page faults, we recognize tasks sharing the same page, and try pack them together into a single numa group. However when two task share lot's of cache pages while not much anonymous pages, since numa balancing do not tracing cache page, they have no chance to join into the same

[PATCH 2/4] numa: append per-node execution info in memory.numa_stat

2019-07-02 Thread
This patch introduced numa execution information, to imply the numa efficiency. By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we see new output line heading with 'exectime', like: exectime 311900 407166 which means the tasks of this cgroup executed 311900 micro seconds on

[PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic

2019-07-02 Thread
This patch introduced numa locality statistic, which try to imply the numa balancing efficiency per memory cgroup. By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we see new output line heading with 'locality', the format is: locality 0%~29% 30%~39% 40%~49% 50%~59% 60%~69%

[PATCH 0/4] per cpu cgroup numa suite

2019-07-02 Thread
During our torturing on numa stuff, we found problems like: * missing per-cgroup information about the per-node execution status * missing per-cgroup information about the numa locality That is when we have a cpu cgroup running with bunch of tasks, no good way to tell how it's tasks are

Re: [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat

2019-04-23 Thread
On 2019/4/23 下午5:46, Peter Zijlstra wrote: > On Tue, Apr 23, 2019 at 05:36:25PM +0800, 王贇 wrote: >> >> >> On 2019/4/23 下午4:52, Peter Zijlstra wrote: >>> On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote: >>>> This patch introduced numa execution info

Re: [RFC PATCH 5/5] numa: numa balancer

2019-04-23 Thread
On 2019/4/23 下午5:05, Peter Zijlstra wrote: [snip] >> >> TODO: >> * improve the logical to address the regression cases >> * Find a way, maybe, to handle the page cache left on remote >> * find more scenery which could gain benefit >> >> Signed-off-by: Michael Wang >> --- >>

Re: [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node

2019-04-23 Thread
On 2019/4/23 下午4:55, Peter Zijlstra wrote: > On Mon, Apr 22, 2019 at 10:13:36AM +0800, 王贇 wrote: >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index af171ccb56a2..6513504373b4 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2031,6 +2031,10

Re: [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat

2019-04-23 Thread
On 2019/4/23 下午4:52, Peter Zijlstra wrote: > On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote: >> This patch introduced numa execution information, to imply the numa >> efficiency. >> >> By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we >

Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic

2019-04-23 Thread
On 2019/4/23 下午4:47, Peter Zijlstra wrote: > On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote: >> +p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages; > > Possibly: 3 + !!(mem_node = numa_node_id()), generates better code. Sounds good~ will app

Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic

2019-04-23 Thread
On 2019/4/23 下午4:46, Peter Zijlstra wrote: > On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote: >> + * 0 -- remote faults >> + * 1 -- local faults >> + * 2 -- page migration failure >> + * 3 -- remote page accessing after page migration >> +

Re: [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic

2019-04-23 Thread
On 2019/4/23 下午4:44, Peter Zijlstra wrote: > On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote: >> +#ifdef CONFIG_NUMA_BALANCING >> + >> +enum memcg_numa_locality_interval { >> +PERCENT_0_9, >> +PERCENT_10_19, >> +PERCENT_20_29, >&

Re: [RFC PATCH 0/5] NUMA Balancer Suite

2019-04-22 Thread
t's the problem is, we'll try to address them :-) Regards, Michael Wang > > Thanks > Wind > > > 王贇 mailto:yun.w...@linux.alibaba.com>> > 于2019年4月22日周一 上午10:13写道: > > We have NUMA Balancing feature which always trying to move pages > of a task t

[RFC PATCH 5/5] numa: numa balancer

2019-04-21 Thread
numa balancer is a module which will try to automatically adjust numa balancing stuff to gain numa bonus as much as possible. For each memory cgroup, we process the work in two steps: On stage 1 we check cgroup's exectime and memory topology to see if there could be a candidate for settled down,

[RFC PATCH 4/5] numa: introduce numa balancer infrastructure

2019-04-21 Thread
Now we have the way to estimate and adjust numa preferred node for each memcg, next problem is how to use them. Usually one will bind workloads with cpuset.cpus, combined with cpuset.mems or maybe better the memory policy to achieve numa bonus, however in complicated scenery like combined type of

[RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node

2019-04-21 Thread
This patch add a new entry 'numa_preferred' for each memory cgroup, by which we can now override the memory policy of the tasks inside a particular cgroup, combined with numa balancing, we now be able to migrate the workloads of a cgroup to the specified numa node, in gentle way. The load

[RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat

2019-04-21 Thread
This patch introduced numa execution information, to imply the numa efficiency. By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we see new output line heading with 'exectime', like: exectime 24399843 27865444 which means the tasks of this cgroup executed 24399843 ticks on

[RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic

2019-04-21 Thread
This patch introduced numa locality statistic, which try to imply the numa balancing efficiency per memory cgroup. By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we see new output line heading with 'locality', the format is: locality 0~9% 10%~19% 20%~29% 30%~39% 40%~49%

[RFC PATCH 0/5] NUMA Balancer Suite

2019-04-21 Thread
We have NUMA Balancing feature which always trying to move pages of a task to the node it executed more, while still got issues: * page cache can't be handled * no cgroup level balancing Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks, below scenery could be easily

Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group

2019-01-29 Thread
On 2019/1/28 下午3:21, 禹舟键 wrote: [snip] No offense but I'm afraid you misunderstand the problem we try to solve by wait_sum, if your purpose is to have a way to tell whether there are sufficient CPU inside a container, please try lxcfs + top, if there are almost no idle and load is high, then

Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group

2019-01-27 Thread
for calculating the hierarchy wait_sum is traversing the cfs_rq's se from the target task's se to the root_task_group children's se. Regartds Yuzhoujian 王贇 mailto:yun.w...@linux.alibaba.com>> 于2019年1月25日周五 上午11:12写道: On 2019/1/23 下午5:46, ufo19890...@gmail.com <mailto:ufo19890...@gmail.co

Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group

2019-01-24 Thread
On 2019/1/23 下午5:46, ufo19890...@gmail.com wrote: From: yuzhoujian We can monitor the sum wait time of a task group since 'commit 3d6c50c27bd6 ("sched/debug: Show the sum wait time of a task group")'. However this wait_sum just represents the confilct between different task groups, since it

[PATCH v3] tg: show the sum wait time of an task group

2018-07-23 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good way to accurately represent the

[PATCH v3] tg: show the sum wait time of an task group

2018-07-23 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good way to accurately represent the

Re: [PATCH v2] tg: show the sum wait time of an task group

2018-07-23 Thread
On 2018/7/23 下午5:31, Peter Zijlstra wrote: On Wed, Jul 04, 2018 at 11:27:27AM +0800, 王贇 wrote: @@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v) seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled); seq_printf(sf, "thr

Re: [PATCH v2] tg: show the sum wait time of an task group

2018-07-23 Thread
On 2018/7/23 下午5:31, Peter Zijlstra wrote: On Wed, Jul 04, 2018 at 11:27:27AM +0800, 王贇 wrote: @@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v) seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled); seq_printf(sf, "thr

Re: [PATCH v2] tg: show the sum wait time of an task group

2018-07-16 Thread
Hi, folks On 2018/7/4 上午11:27, 王贇 wrote: Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much

Re: [PATCH v2] tg: show the sum wait time of an task group

2018-07-16 Thread
Hi, folks On 2018/7/4 上午11:27, 王贇 wrote: Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much

Re: [PATCH v2] tg: show the sum wait time of an task group

2018-07-09 Thread
On 2018/7/4 上午11:27, 王贇 wrote: Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good

Re: [PATCH v2] tg: show the sum wait time of an task group

2018-07-09 Thread
On 2018/7/4 上午11:27, 王贇 wrote: Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good

[PATCH v2] tg: show the sum wait time of an task group

2018-07-03 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good way to accurately represent the

[PATCH v2] tg: show the sum wait time of an task group

2018-07-03 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good way to accurately represent the

[PATCH] tg: show the sum wait time of an task group

2018-07-02 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good way to accurately represent the

[PATCH] tg: show the sum wait time of an task group

2018-07-02 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process or sched_debug could cost too much, and there is no good way to accurately represent the

Re: [RFC PATCH] tg: count the sum wait time of an task group

2018-07-02 Thread
Hi, Peter On 2018/7/2 下午8:03, Peter Zijlstra wrote: On Mon, Jul 02, 2018 at 03:29:39PM +0800, 王贇 wrote: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1866e64..ef82ceb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -862,6 +862,7 @@ static void update_curr_fair

Re: [RFC PATCH] tg: count the sum wait time of an task group

2018-07-02 Thread
Hi, Peter On 2018/7/2 下午8:03, Peter Zijlstra wrote: On Mon, Jul 02, 2018 at 03:29:39PM +0800, 王贇 wrote: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1866e64..ef82ceb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -862,6 +862,7 @@ static void update_curr_fair

[RFC PATCH] tg: count the sum wait time of an task group

2018-07-02 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process could cost too much, and there is no good way to accurately represent the conflict with these

[RFC PATCH] tg: count the sum wait time of an task group

2018-07-02 Thread
Although we can rely on cpuacct to present the cpu usage of task group, it is hard to tell how intense the competition is between these groups on cpu resources. Monitoring the wait time of each process could cost too much, and there is no good way to accurately represent the conflict with these