Re: [PATCH v2] perf ordered_events: Optimise event object reuse

2020-05-28 Thread Matt Fleming
On Wed, 27 May, at 12:25:33PM, Jiri Olsa wrote:
> On Tue, May 26, 2020 at 02:59:28PM +0100, Matt Fleming wrote:
> > +/*
> > + * Allocate a new event object from the free event cache.
> > + *
> > + * Find the first address range in the cache and carve out enough bytes
> > + * for an ordered_event objects. The object with the lowest address is
> > + * always returned so that subsequent allocations benefit from
> > + * contiguous memory accesses (spatial locality).
> > + */
> > +static struct ordered_event *free_event_get_tree(struct ordered_events *oe)
> > +{
> > +   struct interval_tree_node *it;
> > +   struct ordered_event *new;
> > +   size_t bytes = sizeof(*new);
> > +
> > +   it = interval_tree_iter_first(&oe->cache.rb, 0, ULONG_MAX);
> > +   if (!it)
> > +   return NULL;
> > +
> > +   /* Has the cache memory been exhausted? */
> > +   assert(cache_region_size(it) >= bytes);
> > +
> > +   new = (void *)it->start;
> > +
> > +   if (cache_region_size(it) == bytes) {
> > +   interval_tree_remove(it, &oe->cache.rb);
> > +   free(it);
> > +   }
> > +
> > +   it->start += bytes;
> 
> this does not look right.. should this go to else path in above condition?

Urgh, yes. You're right -- this should be in the else path.


Re: [PATCH v2] perf ordered_events: Optimise event object reuse

2020-05-27 Thread Jiri Olsa
On Tue, May 26, 2020 at 02:59:28PM +0100, Matt Fleming wrote:

SNIP

> +
> +/*
> + * Allocate a new event object from the free event cache.
> + *
> + * Find the first address range in the cache and carve out enough bytes
> + * for an ordered_event objects. The object with the lowest address is
> + * always returned so that subsequent allocations benefit from
> + * contiguous memory accesses (spatial locality).
> + */
> +static struct ordered_event *free_event_get_tree(struct ordered_events *oe)
> +{
> + struct interval_tree_node *it;
> + struct ordered_event *new;
> + size_t bytes = sizeof(*new);
> +
> + it = interval_tree_iter_first(&oe->cache.rb, 0, ULONG_MAX);
> + if (!it)
> + return NULL;
> +
> + /* Has the cache memory been exhausted? */
> + assert(cache_region_size(it) >= bytes);
> +
> + new = (void *)it->start;
> +
> + if (cache_region_size(it) == bytes) {
> + interval_tree_remove(it, &oe->cache.rb);
> + free(it);
> + }
> +
> + it->start += bytes;

this does not look right.. should this go to else path in above condition?

jirka

> + return new;
> +}
> +

SNIP



Re: [PATCH v2] perf ordered_events: Optimise event object reuse

2020-05-26 Thread Arnaldo Carvalho de Melo
Adding a few more folks that worked on the ordering of events over the
years.

Some minor nits at the end of the message.

Thanks!

- Arnaldo

Em Tue, May 26, 2020 at 02:59:28PM +0100, Matt Fleming escreveu:
> ordered_event objects can be placed on the free event list in any order
> which means future allocations may not return objects at sequential
> locations in memory. Getting non-contiguous objects from the free list
> has bad consequences when later iterating over those objects in
> ordered_events__queue().
> 
> For example, large perf.data files can contain trillions of events and
> since objects that are next to each other in the free linked-list can
> point to pretty much anywhere in the object address space, lots of
> cycles in ordered_events__queue() are spent servicing DTLB misses.
> 
> Implement the free event cache using the in-kernel implementation of
> interval trees so that objects can always be allocated from the free
> object cache in sequential order, improving spatial locality and
> reducing DTLB misses.
> 
> Since the existing linked-list implementation is faster if you're
> working with fewer events, add a --free-event-list option to force the
> use of the linked-list implementation. However, most users will benefit
> from the new interval tree code.
> 
> Here are some numbers showing the speed up (reduction in execution time)
> when running perf sched latency on sched events data and perf report on
> HW_CPU_CYCLES.
> 
>  $ perf stat --null -r 10 -- bash -c \
>   "export PAGER=cat ; perf sched latency -i $file --stdio &>/dev/null"
> 
>   Nr events File Size   BeforeAfterSpeed up
> --  -    ---  --
>   123318457470 29MB 0.21490.2440-13.5%
>  1651500885357260MB 3.12051.9855+36.4%
>  3470519751785506MB 6.18213.8941+37.0%
>  8213765551679   1100MB13.48758.5079+36.9%
> 15900515973759   1946MB23.4069   15.0960+35.5%
> 
> and HW_CPU_CYCLES events:
> 
>  $ perf stat --null -r 10 -- bash -c \
>   "export PAGER=cat ; perf report -i $file --stdio &>/dev/null"
> 
>   Nr events File Size   BeforeAfterSpeed up
> --  -    ---  --
>   328805166262 29MB  1.637 1.587+3.0%
>  3280413919096253MB 13.38112.349+7.7%
>  6475305954370500MB 25.64823.753+7.4%
> 14218430569416   1000MB 52.80049.036+7.1%
> 26760562279427   2000MB 97.16990.129+7.2%
> 
> Signed-off-by: Matt Fleming 
> ---
> 
> Changes in v2:
> 
>  - Add --free-event-list parameter to fallback to the linked-list
>implementation of the free event cache.
> 
>  - Rename the free_cache_* API to free_event_* now that we've both both
>the old linked-list version and the newer _tree() calls. Also make
>the API public so I can drop the #define gunk in the tests.
> 
>  - Update check-header.sh for interval tree files.
> 
>  - Remove leftover struct cache_region type that accidentally snuck into v1.
> 
>  tools/include/linux/interval_tree.h |  30 +++
>  tools/include/linux/interval_tree_generic.h | 187 ++
>  tools/lib/interval_tree.c   |  12 ++
>  tools/perf/MANIFEST |   1 +
>  tools/perf/check-headers.sh |   2 +
>  tools/perf/perf.c   |   6 +
>  tools/perf/tests/Build  |   1 +
>  tools/perf/tests/builtin-test.c |   4 +
>  tools/perf/tests/free-event-tree.c  | 201 
>  tools/perf/tests/tests.h|   2 +
>  tools/perf/util/Build   |   6 +
>  tools/perf/util/ordered-events.c| 174 -
>  tools/perf/util/ordered-events.h|  16 +-
>  13 files changed, 634 insertions(+), 8 deletions(-)
>  create mode 100644 tools/include/linux/interval_tree.h
>  create mode 100644 tools/include/linux/interval_tree_generic.h
>  create mode 100644 tools/lib/interval_tree.c
>  create mode 100644 tools/perf/tests/free-event-tree.c
> 
> diff --git a/tools/include/linux/interval_tree.h 
> b/tools/include/linux/interval_tree.h
> new file mode 100644
> index ..288c26f50732
> --- /dev/null
> +++ b/tools/include/linux/interval_tree.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_INTERVAL_TREE_H
> +#define _LINUX_INTERVAL_TREE_H
> +
> +#include 
> +
> +struct interval_tree_node {
> + struct rb_node rb;
> + unsigned long start;/* Start of interval */
> + unsigned long last; /* Last location _in_ interval */
> + unsigned long __subtree_last;
> +};
> +
> +extern void
> +interval_tree_insert(struct interval_tree_node *node,
> +  struct rb_root_cached *root);
> +
> +extern void
> +interval_tree_remove(struct interval_tree_node *node,
> +  struct rb_root_cached *root);
> +
> +extern struct interval_tree_node

[PATCH v2] perf ordered_events: Optimise event object reuse

2020-05-26 Thread Matt Fleming
ordered_event objects can be placed on the free event list in any order
which means future allocations may not return objects at sequential
locations in memory. Getting non-contiguous objects from the free list
has bad consequences when later iterating over those objects in
ordered_events__queue().

For example, large perf.data files can contain trillions of events and
since objects that are next to each other in the free linked-list can
point to pretty much anywhere in the object address space, lots of
cycles in ordered_events__queue() are spent servicing DTLB misses.

Implement the free event cache using the in-kernel implementation of
interval trees so that objects can always be allocated from the free
object cache in sequential order, improving spatial locality and
reducing DTLB misses.

Since the existing linked-list implementation is faster if you're
working with fewer events, add a --free-event-list option to force the
use of the linked-list implementation. However, most users will benefit
from the new interval tree code.

Here are some numbers showing the speed up (reduction in execution time)
when running perf sched latency on sched events data and perf report on
HW_CPU_CYCLES.

 $ perf stat --null -r 10 -- bash -c \
"export PAGER=cat ; perf sched latency -i $file --stdio &>/dev/null"

  Nr events File Size   BeforeAfterSpeed up
--  -    ---  --
  123318457470 29MB 0.21490.2440-13.5%
 1651500885357260MB 3.12051.9855+36.4%
 3470519751785506MB 6.18213.8941+37.0%
 8213765551679   1100MB13.48758.5079+36.9%
15900515973759   1946MB23.4069   15.0960+35.5%

and HW_CPU_CYCLES events:

 $ perf stat --null -r 10 -- bash -c \
"export PAGER=cat ; perf report -i $file --stdio &>/dev/null"

  Nr events File Size   BeforeAfterSpeed up
--  -    ---  --
  328805166262 29MB  1.637 1.587+3.0%
 3280413919096253MB 13.38112.349+7.7%
 6475305954370500MB 25.64823.753+7.4%
14218430569416   1000MB 52.80049.036+7.1%
26760562279427   2000MB 97.16990.129+7.2%

Signed-off-by: Matt Fleming 
---

Changes in v2:

 - Add --free-event-list parameter to fallback to the linked-list
   implementation of the free event cache.

 - Rename the free_cache_* API to free_event_* now that we've both both
   the old linked-list version and the newer _tree() calls. Also make
   the API public so I can drop the #define gunk in the tests.

 - Update check-header.sh for interval tree files.

 - Remove leftover struct cache_region type that accidentally snuck into v1.

 tools/include/linux/interval_tree.h |  30 +++
 tools/include/linux/interval_tree_generic.h | 187 ++
 tools/lib/interval_tree.c   |  12 ++
 tools/perf/MANIFEST |   1 +
 tools/perf/check-headers.sh |   2 +
 tools/perf/perf.c   |   6 +
 tools/perf/tests/Build  |   1 +
 tools/perf/tests/builtin-test.c |   4 +
 tools/perf/tests/free-event-tree.c  | 201 
 tools/perf/tests/tests.h|   2 +
 tools/perf/util/Build   |   6 +
 tools/perf/util/ordered-events.c| 174 -
 tools/perf/util/ordered-events.h|  16 +-
 13 files changed, 634 insertions(+), 8 deletions(-)
 create mode 100644 tools/include/linux/interval_tree.h
 create mode 100644 tools/include/linux/interval_tree_generic.h
 create mode 100644 tools/lib/interval_tree.c
 create mode 100644 tools/perf/tests/free-event-tree.c

diff --git a/tools/include/linux/interval_tree.h 
b/tools/include/linux/interval_tree.h
new file mode 100644
index ..288c26f50732
--- /dev/null
+++ b/tools/include/linux/interval_tree.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INTERVAL_TREE_H
+#define _LINUX_INTERVAL_TREE_H
+
+#include 
+
+struct interval_tree_node {
+   struct rb_node rb;
+   unsigned long start;/* Start of interval */
+   unsigned long last; /* Last location _in_ interval */
+   unsigned long __subtree_last;
+};
+
+extern void
+interval_tree_insert(struct interval_tree_node *node,
+struct rb_root_cached *root);
+
+extern void
+interval_tree_remove(struct interval_tree_node *node,
+struct rb_root_cached *root);
+
+extern struct interval_tree_node *
+interval_tree_iter_first(struct rb_root_cached *root,
+unsigned long start, unsigned long last);
+
+extern struct interval_tree_node *
+interval_tree_iter_next(struct interval_tree_node *node,
+   unsigned long start, unsigned long last);
+
+#endif /* _LINUX_INTERVAL_TREE_H */
diff --git a/tools/include/linux/interval_tree_generic.h 
b/tools/include/linux/interval_tre