Hi, On 12.12.2018 15:15, Jiri Olsa wrote: > On Wed, Dec 12, 2018 at 10:40:22AM +0300, Alexey Budankov wrote: >> >> Build node cpu masks for mmap data buffers. Bind AIO data buffers >> to nodes according to kernel data buffers location. Apply node cpu >> masks to trace reading thread every time it references memory cross >> node or cross cpu. >> >> Signed-off-by: Alexey Budankov <alexey.budan...@linux.intel.com> >> --- >> tools/perf/builtin-record.c | 9 +++++++++ >> tools/perf/util/evlist.c | 6 +++++- >> tools/perf/util/mmap.c | 38 ++++++++++++++++++++++++++++++++++++- >> tools/perf/util/mmap.h | 1 + >> 4 files changed, 52 insertions(+), 2 deletions(-) >> >> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c >> index 4979719e54ae..1a1438c73f96 100644 >> --- a/tools/perf/builtin-record.c >> +++ b/tools/perf/builtin-record.c >> @@ -532,6 +532,9 @@ static int record__mmap_evlist(struct record *rec, >> struct record_opts *opts = &rec->opts; >> char msg[512]; >> >> + if (opts->affinity != PERF_AFFINITY_SYS) >> + cpu__setup_cpunode_map(); >> + >> if (perf_evlist__mmap_ex(evlist, opts->mmap_pages, >> opts->auxtrace_mmap_pages, >> opts->auxtrace_snapshot_mode, >> @@ -751,6 +754,12 @@ static int record__mmap_read_evlist(struct record *rec, >> struct perf_evlist *evli >> struct perf_mmap *map = &maps[i]; >> >> if (map->base) { >> + if (rec->opts.affinity != PERF_AFFINITY_SYS && >> + !CPU_EQUAL(&rec->affinity_mask, >> &map->affinity_mask)) { >> + CPU_ZERO(&rec->affinity_mask); >> + CPU_OR(&rec->affinity_mask, >> &rec->affinity_mask, &map->affinity_mask); >> + sched_setaffinity(0, >> sizeof(rec->affinity_mask), &rec->affinity_mask); >> + } > > hum, so you change affinity every time you read different map?
That is what exactly happens when --affinity=cpu. With --affinity=node thread affinity changes only when the thread gets mmap buffer allocated at the remote node. For dual socket machine it is twice at max for one loop execution. > I'm surprised this is actualy faster.. Imagine that some app's thread running on cpu 0 of node 1 generates samples into a kernel buffer which is also allocated at node 1. The tool thread running on cpu 0 of node 0 takes the buffer and puts some part of it into write syscall what can cause cross node memory move and induce collection overhead (from the kernel buffer into fs cache buffers executing some portion of write syscall code on cpu 0 of node 0). > > anyway this patch is doing 2 things.. binding the memory allocation > to nodes and setting the process affinity, please seprate those and > explain the logic behind Separated in v2. Binding is implemented for AIO user space buffers only to map them to the same nodes kernel buffers are mapped to. Tool thread affinity mask bouncing is implemented and applicable as for serial as for AIO streaming. AIO streaming without binding can result in cross node memory moves from kernel buffers to AIO ones. Thanks, Alexey > > thanks, > jirka >