from:"Song Liu"

Re: [PATCH] livepatch: introduce klp_func called interface

2024-06-07 Thread Song Liu

Hi Miroslav,

On Fri, Jun 7, 2024 at 2:07 AM Miroslav Benes  wrote:
>
> Hi,
>
> On Tue, 4 Jun 2024, Song Liu wrote:
>
> > On Tue, May 21, 2024 at 1:04 AM Petr Mladek  wrote:
> > [...]
> > > >
> > > > Yes, but the information you get is limited compared to what is 
> > > > available
> > > > now. You would obtain the information that a patched function was called
> > > > but ftrace could also give you the context and more.
> > >
> > > Another motivation to use ftrace for testing is that it does not
> > > affect the performance in production.
> > >
> > > We should keep klp_ftrace_handler() as fast as possible so that we
> > > could livepatch also performance sensitive functions.
> >
> > At LPC last year, we discussed about adding a counter to each
> > klp_func, like:
> >
> > struct klp_func {
> > ...
> > u64 __percpu counter;
> > ...
> > };
> >
> > With some static_key (+ sysctl), this should give us a way to estimate
> > the overhead of livepatch. If we have the counter, this patch is not
> > needed any more. Does this (adding the counter) sound like
> > something we still want to pursue?
>
> It would be better than this patch but given what was mentioned in the
> thread I wonder if it is possible to use ftrace even for this. See
> /sys/kernel/tracing/trace_stat/function*. It already gathers the number of
> hits.

I didn't know about the trace_stat API until today. :) It somehow doesn't
exist on some older kernels. (I haven't debugged it.)

> Would it be sufficient for you? I guess it depends on what the intention
> is. If there is no time limit, klp_func.counter might be better to provide
> some kind of overall statistics (but I am not sure if it has any value)
> and to avoid having ftrace registered on a live patched function for
> infinite period of time. If the intention is to gather data for some
> limited period, trace_stat sounds like much better approach to me.

We don't have very urgent use for this. As we discussed, various tracing
tools are sufficient in most cases. I brought this up in the context of the
"called" entry: if we are really adding a new entry, let's do "counter"
instead of "called".

Thanks,
Song

Re: [PATCH] kallsyms, livepatch: Fix livepatch with CONFIG_LTO_CLANG

2024-06-07 Thread Song Liu

Hi Miroslav,

Thanks for reviewing the patch!

On Fri, Jun 7, 2024 at 6:06 AM Miroslav Benes  wrote:
>
> Hi,
>
> On Tue, 4 Jun 2024, Song Liu wrote:
>
> > With CONFIG_LTO_CLANG, the compiler may postfix symbols with .llvm.
> > to avoid symbol duplication. scripts/kallsyms.c sorted the symbols
> > without these postfixes. The default symbol lookup also removes these
> > postfixes before comparing symbols.
> >
> > On the other hand, livepatch need to look up symbols with the full names.
> > However, calling kallsyms_on_each_match_symbol with full name (with the
> > postfix) cannot find the symbol(s). As a result, we cannot livepatch
> > kernel functions with .llvm. postfix or kernel functions that use
> > relocation information to symbols with .llvm. postfixes.
> >
> > Fix this by calling kallsyms_on_each_match_symbol without the postfix;
> > and then match the full name (with postfix) in klp_match_callback.
> >
> > Signed-off-by: Song Liu 
> > ---
> >  include/linux/kallsyms.h | 13 +
> >  kernel/kallsyms.c| 21 -
> >  kernel/livepatch/core.c  | 32 +++-
> >  3 files changed, 60 insertions(+), 6 deletions(-)
>
> I do not like much that something which seems to be kallsyms-internal is
> leaked out. You need to export cleanup_symbol_name() and there is now a
> lot of code outside. I would feel much more comfortable if it is all
> hidden from kallsyms users and kept there. Would it be possible?

I think it is possible. Currently, kallsyms_on_each_match_symbol matches
symbols without the postfix. We can add a variation or a parameter, so
that it matches the full name with post fix.

> Moreover, isn't there a similar problem for ftrace, kprobes, ebpf,...?

Yes, there is a similar problem with tracing use cases. But the requirements
are not the same:

For livepatch, we have to point to the exact symbol we want to patch or
relocation to. We have sympos API defined to differentiate different symbols
with the same name.

For tracing, some discrepancy is acceptable. AFAICT, there isn't an API
similar to sympos yet. Also, we can play some tricks with tracing. For
example, we can use "uniq symbol + offset" to point a kprobe to one of
the duplicated symbols.

Given livepatch has a well defined API, while the APIs at tracing side
may still change, we can change kallsyms to make sure livepatch side
works. Work on the tracing side can wait.

Thanks,
Song

Re: [PATCH] livepatch: introduce klp_func called interface

2024-06-04 Thread Song Liu

On Tue, May 21, 2024 at 1:04 AM Petr Mladek  wrote:
[...]
> >
> > Yes, but the information you get is limited compared to what is available
> > now. You would obtain the information that a patched function was called
> > but ftrace could also give you the context and more.
>
> Another motivation to use ftrace for testing is that it does not
> affect the performance in production.
>
> We should keep klp_ftrace_handler() as fast as possible so that we
> could livepatch also performance sensitive functions.

At LPC last year, we discussed about adding a counter to each
klp_func, like:

struct klp_func {
...
u64 __percpu counter;
...
};

With some static_key (+ sysctl), this should give us a way to estimate
the overhead of livepatch. If we have the counter, this patch is not
needed any more. Does this (adding the counter) sound like
something we still want to pursue?

Thanks,
Song

[PATCH] kallsyms, livepatch: Fix livepatch with CONFIG_LTO_CLANG

2024-06-04 Thread Song Liu

With CONFIG_LTO_CLANG, the compiler may postfix symbols with .llvm.
to avoid symbol duplication. scripts/kallsyms.c sorted the symbols
without these postfixes. The default symbol lookup also removes these
postfixes before comparing symbols.

On the other hand, livepatch need to look up symbols with the full names.
However, calling kallsyms_on_each_match_symbol with full name (with the
postfix) cannot find the symbol(s). As a result, we cannot livepatch
kernel functions with .llvm. postfix or kernel functions that use
relocation information to symbols with .llvm. postfixes.

Fix this by calling kallsyms_on_each_match_symbol without the postfix;
and then match the full name (with postfix) in klp_match_callback.

Signed-off-by: Song Liu 
---
 include/linux/kallsyms.h | 13 +
 kernel/kallsyms.c| 21 -
 kernel/livepatch/core.c  | 32 +++-
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
index c3f075e8f60c..d7d07a134ae4 100644
--- a/include/linux/kallsyms.h
+++ b/include/linux/kallsyms.h
@@ -97,6 +97,10 @@ extern int sprint_backtrace_build_id(char *buffer, unsigned 
long address);
 
 int lookup_symbol_name(unsigned long addr, char *symname);
 
+int kallsyms_lookup_symbol_full_name(unsigned long addr, char *symname);
+
+void kallsyms_cleanup_symbol_name(char *s);
+
 #else /* !CONFIG_KALLSYMS */
 
 static inline unsigned long kallsyms_lookup_name(const char *name)
@@ -165,6 +169,15 @@ static inline int kallsyms_on_each_match_symbol(int 
(*fn)(void *, unsigned long)
 {
return -EOPNOTSUPP;
 }
+
+static inline int kallsyms_lookup_symbol_full_name(unsigned long addr, char 
*symname)
+{
+   return -ERANGE;
+}
+
+static inline void kallsyms_cleanup_symbol_name(char *s)
+{
+}
 #endif /*CONFIG_KALLSYMS*/
 
 static inline void print_ip_sym(const char *loglvl, unsigned long ip)
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 22ea19a36e6e..f0d04fa1bbb4 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -163,7 +163,7 @@ unsigned long kallsyms_sym_address(int idx)
return kallsyms_relative_base - 1 - kallsyms_offsets[idx];
 }
 
-static void cleanup_symbol_name(char *s)
+void kallsyms_cleanup_symbol_name(char *s)
 {
char *res;
 
@@ -191,7 +191,7 @@ static int compare_symbol_name(const char *name, char 
*namebuf)
 * To ensure correct bisection in kallsyms_lookup_names(), do
 * cleanup_symbol_name(namebuf) before comparing name and namebuf.
 */
-   cleanup_symbol_name(namebuf);
+   kallsyms_cleanup_symbol_name(namebuf);
return strcmp(name, namebuf);
 }
 
@@ -426,7 +426,7 @@ static const char *kallsyms_lookup_buildid(unsigned long 
addr,
offset, modname, namebuf);
 
 found:
-   cleanup_symbol_name(namebuf);
+   kallsyms_cleanup_symbol_name(namebuf);
return ret;
 }
 
@@ -446,7 +446,7 @@ const char *kallsyms_lookup(unsigned long addr,
   NULL, namebuf);
 }
 
-int lookup_symbol_name(unsigned long addr, char *symname)
+static int __lookup_symbol_name(unsigned long addr, char *symname, bool 
cleanup)
 {
int res;
 
@@ -468,10 +468,21 @@ int lookup_symbol_name(unsigned long addr, char *symname)
return res;
 
 found:
-   cleanup_symbol_name(symname);
+   if (cleanup)
+   kallsyms_cleanup_symbol_name(symname);
return 0;
 }
 
+int lookup_symbol_name(unsigned long addr, char *symname)
+{
+   return __lookup_symbol_name(addr, symname, true);
+}
+
+int kallsyms_lookup_symbol_full_name(unsigned long addr, char *symname)
+{
+   return __lookup_symbol_name(addr, symname, false);
+}
+
 /* Look up a kernel symbol and return it in a text buffer. */
 static int __sprint_symbol(char *buffer, unsigned long address,
   int symbol_offset, int add_offset, int add_buildid)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 52426665eecc..284220e04801 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -128,6 +128,19 @@ struct klp_find_arg {
 static int klp_match_callback(void *data, unsigned long addr)
 {
struct klp_find_arg *args = data;
+#ifdef CONFIG_LTO_CLANG
+   char full_name[KSYM_NAME_LEN];
+
+   /*
+* With CONFIG_LTO_CLANG, we need to compare the full name of the
+* symbol (with .llvm. postfix).
+*/
+   if (kallsyms_lookup_symbol_full_name(addr, full_name))
+   return 0;
+
+   if (strcmp(args->name, full_name))
+   return 0;
+#endif
 
args->addr = addr;
args->count++;
@@ -145,10 +158,12 @@ static int klp_match_callback(void *data, unsigned long 
addr)
 
 static int klp_find_callback(void *data, const char *name, unsigned long addr)
 {
+#ifndef CONFIG_LTO_CLANG
struct klp_find_arg *args = data;
 
if (strcmp(a

Re: [PATCH v6 08/16] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Extend execmem parameters to accommodate more complex overrides of
> module_alloc() by architectures.
>
> This includes specification of a fallback range required by arm, arm64
> and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
> allocation of KASAN shadow required by s390 and x86 and support for
> late initialization of execmem required by arm64.
>
> The core implementation of execmem_alloc() takes care of suppressing
> warnings when the initial allocation fails but there is a fallback range
> defined.
>
> Signed-off-by: Mike Rapoport (IBM) 
> Acked-by: Will Deacon 

nit: We should probably move the logic for ARCH_WANTS_EXECMEM_LATE
to a separate patch.

Otherwise,

Acked-by: Song Liu

Re: [PATCH v6 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
>
> Provide a generic implementation in execmem that uses the parameters for
> address space ranges, required alignment and page protections provided
> by architectures.
>
> The architectures must fill execmem_info structure and implement
> execmem_arch_setup() that returns a pointer to that structure. This way the
> execmem initialization won't be called from every architecture, but rather
> from a central place, namely a core_initcall() in execmem.
>
> The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
> with the parameters defined by the architectures.  If an architecture does
> not implement execmem_arch_setup(), execmem_alloc() will fall back to
> module_alloc().
>
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Song Liu

Re: [PATCH v6 06/16] mm: introduce execmem_alloc() and execmem_free()

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
>
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
>
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.
>
> No functional changes.
>
> Signed-off-by: Mike Rapoport (IBM) 
> Acked-by: Masami Hiramatsu (Google) 

Acked-by: Song Liu

Re: [PATCH v6 05/16] module: make module_memory_{alloc,free} more self-contained

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Move the logic related to the memory allocation and freeing into
> module_memory_alloc() and module_memory_free().
>
> Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Song Liu

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-22 Thread Song Liu

Hi Masami and Mike,

On Sat, Apr 20, 2024 at 2:11 AM Masami Hiramatsu  wrote:
[...]
> > >
> > > IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> > > input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> > > will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> > > the "range" object, we will have to compare different range parameters to 
> > > check
> > > we can share cached pages between module text and kprobe, which is not
> > > efficient. Did I miss something?
>
> Song, thanks for trying to eplain. I think I need to explain why I used
> module_alloc() originally.
>
> This depends on how kprobe features are implemented on the architecture, and
> how much features are supported on kprobes.
>
> Because kprobe jump optimization and kprobe jump-back optimization need to
> use a jump instruction to jump into the trampoline and jump back from the
> trampoline directly, if the architecuture jmp instruction supports +-2GB range
> like x86, it needs to allocate the trampoline buffer inside such address 
> space.
> This requirement is similar to the modules (because module function needs to
> call other functions in the kernel etc.), at least kprobes on x86 used
> module_alloc().
>
> However, if an architecture only supports breakpoint/trap based kprobe,
> it does not need to consider whether the execmem is allocated.
>
> >
> > We can always share large ROX pages as long as they are within the correct
> > address space. The permissions for them are ROX and the alignment
> > differences are due to KASAN and this is handled during allocation of the
> > large page to refill the cache. __execmem_cache_alloc() only needs to limit
> > the search for the address space of the range.
>
> So I don't think EXECMEM_KPROBE always same as EXECMEM_MODULE_TEXT, it
> should be configured for each arch. Especially, if it is only used for
> searching parameter, it looks OK to me.

Thanks for the explanation!

I was thinking "we can have EXECMEM_KPROBE share the same parameters as
EXECMEM_MODULE_TEXT for all architectures". But this thought is built on top
of assumptions on future changes/improvements within multiple sub systems.
At this moment, I have no objections moving forward with current execmem APIs.

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport  wrote:
>
> On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
> > [...]
> > > > >
> > > > > [1] 
> > > > > https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> > > >
> > > > For the ROX to work, we need different users (module text, kprobe, 
> > > > etc.) to have
> > > > the same execmem_range. From [1]:
> > > >
> > > > static void *execmem_cache_alloc(struct execmem_range *range, size_t 
> > > > size)
> > > > {
> > > > ...
> > > >p = __execmem_cache_alloc(size);
> > > >if (p)
> > > >return p;
> > > >   err = execmem_cache_populate(range, size);
> > > > ...
> > > > }
> > > >
> > > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > > we can only call execmem_cache_alloc() with one execmem_range.
> > >
> > > Actually, on x86 this will "just work" because everything shares the same
> > > address space :)
> > >
> > > The 2M pages in the cache will be in the modules space, so
> > > __execmem_cache_alloc() will always return memory from that address space.
> > >
> > > For other architectures this indeed needs to be fixed with passing the
> > > range to __execmem_cache_alloc() and limiting search in the cache for that
> > > range.
> >
> > I think we at least need the "map to" concept (initially proposed by Thomas)
> > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > the same range.
>
> Why?

IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
the "range" object, we will have to compare different range parameters to check
we can share cached pages between module text and kprobe, which is not
efficient. Did I miss something?

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
[...]
> > >
> > > [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> >
> > For the ROX to work, we need different users (module text, kprobe, etc.) to 
> > have
> > the same execmem_range. From [1]:
> >
> > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > {
> > ...
> >p = __execmem_cache_alloc(size);
> >if (p)
> >return p;
> >   err = execmem_cache_populate(range, size);
> > ...
> > }
> >
> > We are calling __execmem_cache_alloc() without range. For this to work,
> > we can only call execmem_cache_alloc() with one execmem_range.
>
> Actually, on x86 this will "just work" because everything shares the same
> address space :)
>
> The 2M pages in the cache will be in the modules space, so
> __execmem_cache_alloc() will always return memory from that address space.
>
> For other architectures this indeed needs to be fixed with passing the
> range to __execmem_cache_alloc() and limiting search in the cache for that
> range.

I think we at least need the "map to" concept (initially proposed by Thomas)
to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
maps to EXECMEM_MODULE_TEXT, so that all these actually share
the same range.

Does this make sense?

Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Thu, Apr 18, 2024 at 11:56 PM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
> > >
> > > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > > > >
> > > > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > > > maybe I
> > > > > > > should have named the enum execmem_consumer at the first place.
> > > > > >
> > > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > > > kprobe,
> > > > > > and bpf (and maybe also module text) all have the same requirements.
> > > > > > Did I miss something?
> > > > >
> > > > > It's enough to have one architecture with different constrains for 
> > > > > kprobes
> > > > > and bpf to warrant a type for each.
> > > >
> > > > AFAICT, some of these constraints can be changed without too much work.
> > >
> > > But why?
> > > I honestly don't understand what are you trying to optimize here. A few
> > > lines of initialization in execmem_info?
> >
> > IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> > harder for bpf and kprobe to share the same ROX page. In many use cases,
> > a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> > module text. It is not efficient if we have to allocate separate pages for 
> > each
> > of these use cases. If this is not a problem, the current approach works.
>
> The caching of large ROX pages does not need to be per type.
>
> In the POC I've posted for caching of large ROX pages on x86 [1], the cache is
> global and to make kprobes and bpf use it it's enough to set a flag in
> execmem_info.
>
> [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org

For the ROX to work, we need different users (module text, kprobe, etc.) to have
the same execmem_range. From [1]:

static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
{
...
   p = __execmem_cache_alloc(size);
   if (p)
   return p;
  err = execmem_cache_populate(range, size);
...
}

We are calling __execmem_cache_alloc() without range. For this to work,
we can only call execmem_cache_alloc() with one execmem_range.

Did I miss something?

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu

On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > >
> > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > maybe I
> > > > > should have named the enum execmem_consumer at the first place.
> > > >
> > > > I think looking at execmem_type from consumers' point of view adds
> > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > kprobe,
> > > > and bpf (and maybe also module text) all have the same requirements.
> > > > Did I miss something?
> > >
> > > It's enough to have one architecture with different constrains for kprobes
> > > and bpf to warrant a type for each.
> >
> > AFAICT, some of these constraints can be changed without too much work.
>
> But why?
> I honestly don't understand what are you trying to optimize here. A few
> lines of initialization in execmem_info?

IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
harder for bpf and kprobe to share the same ROX page. In many use cases,
a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
module text. It is not efficient if we have to allocate separate pages for each
of these use cases. If this is not a problem, the current approach works.

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu

On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
>
[...]
> >
> > Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> > really need
> > EXECMEM_ANYWHERE below?
> >
> > > >
> > > > * I'm not sure about BPF's requirements; it seems happy doing the same 
> > > > as
> > > >   modules.
> > >
> > > BPF are happy with vmalloc().
> > >
> > > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > > our own
> > > > types, e.g.
> > > >
> > > >   EXECMEM_ANYWHERE
> > > >   EXECMEM_NOPLT
> > > >   EXECMEM_PREL32
> > > >
> > > > ... and then we use those in arch code to implement module_alloc() and 
> > > > friends.
> > >
> > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > should have named the enum execmem_consumer at the first place.
> >
> > I think looking at execmem_type from consumers' point of view adds
> > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > and bpf (and maybe also module text) all have the same requirements.
> > Did I miss something?
>
> It's enough to have one architecture with different constrains for kprobes
> and bpf to warrant a type for each.
>

AFAICT, some of these constraints can be changed without too much work.

> Where do you see unnecessary complexity?
>
> > IOW, we have
> >
> > enum execmem_type {
> > EXECMEM_DEFAULT,
> > EXECMEM_TEXT,
> > EXECMEM_KPROBES = EXECMEM_TEXT,
> > EXECMEM_FTRACE = EXECMEM_TEXT,
> > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > _KPROBE, _FTRACE, _BPF */
> > EXECMEM_DATA,  /* rw */
> > EXECMEM_RO_DATA,
> > EXECMEM_RO_AFTER_INIT,
> > EXECMEM_TYPE_MAX,
> > };
> >
> > Does this make sense?
>
> How do you suggest to deal with e.g. riscv that has separate address spaces
> for modules, kprobes and bpf?

IIUC, modules and bpf use the same address space on riscv, while kprobes use
vmalloc address. I haven't tried this yet, but I think we can let
kprobes use the
same space as modules and bpf, which is:

 |  -4 GB | 7fff |2 GB | modules, BPF

Did I get this right?

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-17 Thread Song Liu

On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport  wrote:
>
> On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > +/**
> > > > + * enum execmem_type - types of executable memory ranges
> > > > + *
> > > > + * There are several subsystems that allocate executable memory.
> > > > + * Architectures define different restrictions on placement,
> > > > + * permissions, alignment and other parameters for memory that can be 
> > > > used
> > > > + * by these subsystems.
> > > > + * Types in this enum identify subsystems that allocate executable 
> > > > memory
> > > > + * and let architectures define parameters for ranges suitable for
> > > > + * allocations by each subsystem.
> > > > + *
> > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types 
> > > > that
> > > > + * are not explcitly defined.
> > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > + * @EXECMEM_BPF: parameters for BPF
> > > > + * @EXECMEM_TYPE_MAX:
> > > > + */
> > > > +enum execmem_type {
> > > > + EXECMEM_DEFAULT,
> > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > + EXECMEM_KPROBES,
> > > > + EXECMEM_FTRACE,
> > > > + EXECMEM_BPF,
> > > > + EXECMEM_TYPE_MAX,
> > > > +};
> > >
> > > Can we please get a break-down of how all these types are actually
> > > different from one another?
> > >
> > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > mind) and has less strict placement constraints for some of them?
> >
> > Yeah, and really I'd *much* rather deal with that in arch code, as I have 
> > said
> > several times.
> >
> > For arm64 we have two bsaic restrictions:
> >
> > 1) Direct branches can go +/-128M
> >We can expand this range by having direct branches go to PLTs, at a
> >performance cost.
> >
> > 2) PREL32 relocations can go +/-2G
> >We cannot expand this further.
> >
> > * We don't need to allocate memory for ftrace. We do not use trampolines.
> >
> > * Kprobes XOL areas don't care about either of those; we don't place any
> >   PC-relative instructions in those. Maybe we want to in future.
> >
> > * Modules care about both; we'd *prefer* to place them within +/-128M of all
> >   other kernel/module code, but if there's no space we can use PLTs and 
> > expand
> >   that to +/-2G. Since modules can refreence other modules, that ends up
> >   actually being halved, and modules have to fit within some 2G window that
> >   also covers the kernel.

Is +/- 2G enough for all realistic use cases? If so, I guess we don't
really need
EXECMEM_ANYWHERE below?

> >
> > * I'm not sure about BPF's requirements; it seems happy doing the same as
> >   modules.
>
> BPF are happy with vmalloc().
>
> > So if we *must* use a common execmem allocator, what we'd reall want is our 
> > own
> > types, e.g.
> >
> >   EXECMEM_ANYWHERE
> >   EXECMEM_NOPLT
> >   EXECMEM_PREL32
> >
> > ... and then we use those in arch code to implement module_alloc() and 
> > friends.
>
> I'm looking at execmem_types more as definition of the consumers, maybe I
> should have named the enum execmem_consumer at the first place.

I think looking at execmem_type from consumers' point of view adds
unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
and bpf (and maybe also module text) all have the same requirements.
Did I miss something?

IOW, we have

enum execmem_type {
EXECMEM_DEFAULT,
EXECMEM_TEXT,
EXECMEM_KPROBES = EXECMEM_TEXT,
EXECMEM_FTRACE = EXECMEM_TEXT,
EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
_KPROBE, _FTRACE, _BPF */
EXECMEM_DATA,  /* rw */
EXECMEM_RO_DATA,
EXECMEM_RO_AFTER_INIT,
EXECMEM_TYPE_MAX,
};

Does this make sense?

Thanks,
Song

Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-06 Thread Song Liu

Hi Calvin,

It is great to hear from you! :)

On Wed, Mar 6, 2024 at 3:23 PM Calvin Owens  wrote:
>
> On Wednesday 03/06 at 13:34 -0800, Luis Chamberlain wrote:
> > On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote:
> > > Hello all,
> > >
> > > This patchset makes it possible to use bpftrace with kprobes on kernels
> > > built without loadable module support.
> >
> > This is a step in the right direction for another reason: clearly the
> > module_alloc() is not about modules, and we have special reasons for it
> > now beyond modules. The effort to share a generalize a huge page for
> > these things is also another reason for some of this but that is more
> > long term.
> >
> > I'm all for minor changes here so to avoid regressions but it seems a
> > rename is in order -- if we're going to all this might as well do it
> > now. And for that I'd just like to ask you paint the bikeshed with
> > Song Liu as he's been the one slowly making way to help us get there
> > with the "module: replace module_layout with module_memory",
> > and Mike Rapoport as he's had some follow up attempts [0]. As I see it,
> > the EXECMEM stuff would be what we use instead then. Mike kept the
> > module_alloc() and the execmem was just a wrapper but your move of the
> > arch stuff makes sense as well and I think would complement his series
> > nicely.
>
> I apologize for missing that. I think these are the four most recent
> versions of the different series referenced from that LWN link:
>
>   a) https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org/
>   b) https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org/
>   c) https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/
>   d) 
> https://lore.kernel.org/all/20201120202426.18009-1-rick.p.edgeco...@intel.com/
>
> Song and Mike, please correct me if I'm wrong, but I think what I've
> done here (see [1], sorry for not adding you initially) is compatible
> with everything both of you have recently proposed above. How do you
> feel about this as a first step?

I agree that the work here is compatible with other efforts. I have no
objection to making this the first step.

>
> For naming, execmem_alloc() seems reasonable to me? I have no strong
> feelings at all, I'll just use that going forward unless somebody else
> expresses an opinion.

I am not good at naming things. No objection from me to "execmem_alloc".

Thanks,
Song

Re: [PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well

2023-10-27 Thread Song Liu

On Fri, Oct 27, 2023 at 4:31 PM Andrii Nakryiko  wrote:
>
> Recent changes to count number of matching symbols when creating
> a kprobe event failed to take into account kernel modules. As such, it
> breaks kprobes on kernel module symbols, by assuming there is no match.
>
> Fix this my calling module_kallsyms_on_each_symbol() in addition to
> kallsyms_on_each_match_symbol() to perform a proper counting.
>
> Cc: Francis Laniel 
> Cc: sta...@vger.kernel.org
> Cc: Masami Hiramatsu 
> Cc: Steven Rostedt 
> Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches 
> several symbols")
> Signed-off-by: Andrii Nakryiko 

Acked-by: Song Liu

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-23 Thread Song Liu

On Sat, Sep 23, 2023 at 8:39 AM Mike Rapoport  wrote:
>
> On Thu, Sep 21, 2023 at 03:34:18PM -0700, Song Liu wrote:
> > On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
> > >
> >
> > [...]
> >
> > > diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> > > index 42215f9404af..db5561d0c233 100644
> > > --- a/arch/s390/kernel/module.c
> > > +++ b/arch/s390/kernel/module.c
> > > @@ -21,6 +21,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
> > >  #ifdef CONFIG_FUNCTION_TRACER
> > >  void module_arch_cleanup(struct module *mod)
> > >  {
> > > -   module_memfree(mod->arch.trampolines_start);
> > > +   execmem_free(mod->arch.trampolines_start);
> > >  }
> > >  #endif
> > >
> > > @@ -510,7 +511,7 @@ static int 
> > > module_alloc_ftrace_hotpatch_trampolines(struct module *me,
> > >
> > > size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
> > > numpages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > -   start = module_alloc(numpages * PAGE_SIZE);
> > > +   start = execmem_text_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);
> >
> > This should be EXECMEM_MODULE_TEXT?
>
> This is an ftrace trampoline, so I think it should be FTRACE type of
> allocation.

Yeah, I was aware of the ftrace trampoline. My point was, ftrace trampoline
doesn't seem to have any special requirements. Therefore, it is probably not
necessary to have a separate type just for it.

AFAICT, kprobe, ftrace, and BPF (JIT and trampoline) can share the same
execmem_type. We may need some work for some archs, but nothing is
fundamentally different among these.

Thanks,
Song

Re: [PATCH v3 06/13] mm/execmem: introduce execmem_data_alloc()

2023-09-22 Thread Song Liu

On Fri, Sep 22, 2023 at 12:17 AM Christophe Leroy
 wrote:
>
>
>
> Le 22/09/2023 à 00:52, Song Liu a écrit :
> > On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
> >>
> > [...]
> >> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> >> index 519bdfdca595..09d45ac786e9 100644
> >> --- a/include/linux/execmem.h
> >> +++ b/include/linux/execmem.h
> >> @@ -29,6 +29,7 @@
> >>* @EXECMEM_KPROBES: parameters for kprobes
> >>* @EXECMEM_FTRACE: parameters for ftrace
> >>* @EXECMEM_BPF: parameters for BPF
> >> + * @EXECMEM_MODULE_DATA: parameters for module data sections
> >>* @EXECMEM_TYPE_MAX:
> >>*/
> >>   enum execmem_type {
> >> @@ -37,6 +38,7 @@ enum execmem_type {
> >>  EXECMEM_KPROBES,
> >>  EXECMEM_FTRACE,
> >
> > In longer term, I think we can improve the JITed code and merge
> > kprobe/ftrace/bpf. to use the same ranges. Also, do we need special
> > setting for FTRACE? If not, let's just remove it.
>
> How can we do that ? Some platforms like powerpc require executable
> memory for BPF and non-exec mem for KPROBE so it can't be in the same
> area/ranges.

Hmm... non-exec mem for kprobes?

   if (strict_module_rwx_enabled())
   execmem_params.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
   else
   execmem_params.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_EXEC;

Do you mean the latter case?

Thanks,
Song

Re: [PATCH v3 06/13] mm/execmem: introduce execmem_data_alloc()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
>
[...]
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> index 519bdfdca595..09d45ac786e9 100644
> --- a/include/linux/execmem.h
> +++ b/include/linux/execmem.h
> @@ -29,6 +29,7 @@
>   * @EXECMEM_KPROBES: parameters for kprobes
>   * @EXECMEM_FTRACE: parameters for ftrace
>   * @EXECMEM_BPF: parameters for BPF
> + * @EXECMEM_MODULE_DATA: parameters for module data sections
>   * @EXECMEM_TYPE_MAX:
>   */
>  enum execmem_type {
> @@ -37,6 +38,7 @@ enum execmem_type {
> EXECMEM_KPROBES,
> EXECMEM_FTRACE,

In longer term, I think we can improve the JITed code and merge
kprobe/ftrace/bpf. to use the same ranges. Also, do we need special
setting for FTRACE? If not, let's just remove it.

> EXECMEM_BPF,
> +   EXECMEM_MODULE_DATA,
> EXECMEM_TYPE_MAX,
>  };

Overall, it is great that kprobe/ftrace/bpf no longer depend on modules.

OTOH, I think we should merge execmem_type and existing mod_mem_type.
Otherwise, we still need to handle page permissions in multiple places.
What is our plan for that?

Thanks,
Song


>
> @@ -107,6 +109,23 @@ struct execmem_params *execmem_arch_params(void);
>   */
>  void *execmem_text_alloc(enum execmem_type type, size_t size);
>
> +/**
> + * execmem_data_alloc - allocate memory for data coupled to code
> + * @type: type of the allocation
> + * @size: how many bytes of memory are required
> + *
> + * Allocates memory that will contain data coupled with executable code,
> + * like data sections in kernel modules.
> + *
> + * The memory will have protections defined by architecture.
> + *
> + * The allocated memory will reside in an area that does not impose
> + * restrictions on the addressing modes.
> + *
> + * Return: a pointer to the allocated memory or %NULL
> + */
> +void *execmem_data_alloc(enum execmem_type type, size_t size);
> +
>  /**
>   * execmem_free - free executable memory
>   * @ptr: pointer to the memory that should be freed
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index c4146bfcd0a7..2ae83a6abf66 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1188,25 +1188,16 @@ void __weak module_arch_freeing_init(struct module 
> *mod)
>  {
>  }
>
> -static bool mod_mem_use_vmalloc(enum mod_mem_type type)
> -{
> -   return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) &&
> -   mod_mem_type_is_core_data(type);
> -}
> -
>  static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
>  {
> -   if (mod_mem_use_vmalloc(type))
> -   return vzalloc(size);
> +   if (mod_mem_type_is_data(type))
> +   return execmem_data_alloc(EXECMEM_MODULE_DATA, size);
> return execmem_text_alloc(EXECMEM_MODULE_TEXT, size);
>  }
>
>  static void module_memory_free(void *ptr, enum mod_mem_type type)
>  {
> -   if (mod_mem_use_vmalloc(type))
> -   vfree(ptr);
> -   else
> -   execmem_free(ptr);
> +   execmem_free(ptr);
>  }
>
>  static void free_mod_mem(struct module *mod)
> diff --git a/mm/execmem.c b/mm/execmem.c
> index abcbd07e05ac..aeff85261360 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -53,11 +53,23 @@ static void *execmem_alloc(size_t size, struct 
> execmem_range *range)
> return kasan_reset_tag(p);
>  }
>
> +static inline bool execmem_range_is_data(enum execmem_type type)
> +{
> +   return type == EXECMEM_MODULE_DATA;
> +}
> +
>  void *execmem_text_alloc(enum execmem_type type, size_t size)
>  {
> return execmem_alloc(size, _params.ranges[type]);
>  }
>
> +void *execmem_data_alloc(enum execmem_type type, size_t size)
> +{
> +   WARN_ON_ONCE(!execmem_range_is_data(type));
> +
> +   return execmem_alloc(size, _params.ranges[type]);
> +}
> +
>  void execmem_free(void *ptr)
>  {
> /*
> @@ -93,7 +105,10 @@ static void execmem_init_missing(struct execmem_params *p)
> struct execmem_range *r = >ranges[i];
>
> if (!r->start) {
> -   r->pgprot = default_range->pgprot;
> +   if (execmem_range_is_data(i))
> +   r->pgprot = PAGE_KERNEL;
> +   else
> +   r->pgprot = default_range->pgprot;
> r->alignment = default_range->alignment;
> r->start = default_range->start;
> r->end = default_range->end;
> --
> 2.39.2
>

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
>

[...]

> diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> index 42215f9404af..db5561d0c233 100644
> --- a/arch/s390/kernel/module.c
> +++ b/arch/s390/kernel/module.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
>  #ifdef CONFIG_FUNCTION_TRACER
>  void module_arch_cleanup(struct module *mod)
>  {
> -   module_memfree(mod->arch.trampolines_start);
> +   execmem_free(mod->arch.trampolines_start);
>  }
>  #endif
>
> @@ -510,7 +511,7 @@ static int 
> module_alloc_ftrace_hotpatch_trampolines(struct module *me,
>
> size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
> numpages = DIV_ROUND_UP(size, PAGE_SIZE);
> -   start = module_alloc(numpages * PAGE_SIZE);
> +   start = execmem_text_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);

This should be EXECMEM_MODULE_TEXT?

Thanks,
Song

> if (!start)
> return -ENOMEM;
> set_memory_rox((unsigned long)start, numpages);
[...]

Re: [PATCH v3 09/13] powerpc: extend execmem_params for kprobes allocations

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
>
[...]
> @@ -135,5 +138,13 @@ struct execmem_params __init *execmem_arch_params(void)
>
> range->pgprot = prot;
>
> +   execmem_params.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
> +   execmem_params.ranges[EXECMEM_KPROBES].start = VMALLOC_END;

.end = VMALLOC_END.

Thanks,
Song

> +
> +   if (strict_module_rwx_enabled())
> +   execmem_params.ranges[EXECMEM_KPROBES].pgprot = 
> PAGE_KERNEL_ROX;
> +   else
> +   execmem_params.ranges[EXECMEM_KPROBES].pgprot = 
> PAGE_KERNEL_EXEC;
> +
> return _params;
>  }
> --
> 2.39.2
>
>

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
>
[...]
> +
> +/**
> + * enum execmem_type - types of executable memory ranges
> + *
> + * There are several subsystems that allocate executable memory.
> + * Architectures define different restrictions on placement,
> + * permissions, alignment and other parameters for memory that can be used
> + * by these subsystems.
> + * Types in this enum identify subsystems that allocate executable memory
> + * and let architectures define parameters for ranges suitable for
> + * allocations by each subsystem.
> + *
> + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> + * are not explcitly defined.
> + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> + * @EXECMEM_KPROBES: parameters for kprobes
> + * @EXECMEM_FTRACE: parameters for ftrace
> + * @EXECMEM_BPF: parameters for BPF
> + * @EXECMEM_TYPE_MAX:
> + */
> +enum execmem_type {
> +   EXECMEM_DEFAULT,

I found EXECMEM_DEFAULT more confusing than helpful.

Song

> +   EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> +   EXECMEM_KPROBES,
> +   EXECMEM_FTRACE,
> +   EXECMEM_BPF,
> +   EXECMEM_TYPE_MAX,
> +};
> +
[...]

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
>
[...]
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static void *execmem_alloc(size_t size)
> +{
> +   return module_alloc(size);
> +}
> +
> +void *execmem_text_alloc(enum execmem_type type, size_t size)
> +{
> +   return execmem_alloc(size);
> +}

execmem_text_alloc (and later execmem_data_alloc) both take "type" as
input. I guess we can just use execmem_alloc(type, size) for everything?

Thanks,
Song

> +
> +void execmem_free(void *ptr)
> +{
> +   /*
> +* This memory may be RO, and freeing RO memory in an interrupt is not
> +* supported by vmalloc.
> +*/
> +   WARN_ON(in_interrupt());
> +   vfree(ptr);
> +}
> --
> 2.39.2
>

Re: [PATCH v4 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-20 Thread Song Liu




> On Apr 20, 2021, at 10:31 AM, Jiri Olsa  wrote:
> 
> On Mon, Apr 19, 2021 at 01:36:48PM -0700, Song Liu wrote:
> 
> SNIP
> 
>>  if (stat_config.initial_delay < 0) {
>> @@ -784,11 +790,11 @@ static int __run_perf_stat(int argc, const char 
>> **argv, int run_idx)
>>  if (affinity__setup() < 0)
>>  return -1;
>> 
>> -if (target__has_bpf()) {
>> -evlist__for_each_entry(evsel_list, counter) {
>> -if (bpf_counter__load(counter, ))
>> -return -1;
>> -}
>> +evlist__for_each_entry(evsel_list, counter) {
>> +if (bpf_counter__load(counter, ))
>> +return -1;
>> +if (!evsel__is_bpf(counter))
>> +all_counters_use_bpf = false;
> 
> could be done in bpf_counter__load, check below:
> 
>>  }
>> 
>>  evlist__for_each_cpu (evsel_list, i, cpu) {
>> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
>> index 5de991ab46af9..33b1888103dfa 100644
>> --- a/tools/perf/util/bpf_counter.c
>> +++ b/tools/perf/util/bpf_counter.c
>> @@ -790,7 +790,8 @@ int bpf_counter__load(struct evsel *evsel, struct target 
>> *target)
>> {
>>  if (target->bpf_str)
>>  evsel->bpf_counter_ops = _program_profiler_ops;
>> -else if (target->use_bpf)
>> +else if (target->use_bpf ||
>> + evsel__match_bpf_counter_events(evsel->name))
>>  evsel->bpf_counter_ops = _ops;
> 
> with:
>   else
>   all_counters_use_bpf = false;
> 
> I was also thinking of oving it to evlist, but it's sat specific,
> so I think it's good as static.. thanks for changing the implementation

Hmm... then we need to somehow make all_counters_use_bpf visible in
bpf_counter.c, which won't be very clean. Also, since this is stat 
specific, I guess it is better to keep it inside builtin-stat.c?
The runtime overhead should be minimal. 

Thanks,
Song

[PATCH v4 4/4] perf-stat: introduce ':b' modifier

2021-04-19 Thread Song Liu

Introduce 'b' modifier to event parser, which means use BPF program to
manage this event. This is the same as --bpf-counters option, but only
applies to this event. For example,

  perf stat -e cycles:b,cs   # use bpf for cycles, but not cs
  perf stat -e cycles,cs --bpf-counters  # use bpf for both cycles and cs

Suggested-by: Jiri Olsa 
Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c  | 2 +-
 tools/perf/util/evsel.h| 1 +
 tools/perf/util/parse-events.c | 8 +++-
 tools/perf/util/parse-events.l | 2 +-
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 33b1888103dfa..f179f57430253 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -790,7 +790,7 @@ int bpf_counter__load(struct evsel *evsel, struct target 
*target)
 {
if (target->bpf_str)
evsel->bpf_counter_ops = _program_profiler_ops;
-   else if (target->use_bpf ||
+   else if (target->use_bpf || evsel->bpf_counter ||
 evsel__match_bpf_counter_events(evsel->name))
evsel->bpf_counter_ops = _ops;
 
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ce4b629d659c2..8f66cdcb338d0 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -82,6 +82,7 @@ struct evsel {
boolauto_merge_stats;
boolcollect_stat;
boolweak_group;
+   boolbpf_counter;
int bpf_fd;
struct bpf_object   *bpf_obj;
};
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 8123d218ad17c..46ebd269a98d1 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1804,6 +1804,7 @@ struct event_modifier {
int pinned;
int weak;
int exclusive;
+   int bpf_counter;
 };
 
 static int get_event_modifier(struct event_modifier *mod, char *str,
@@ -1824,6 +1825,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
int exclude = eu | ek | eh;
int exclude_GH = evsel ? evsel->exclude_GH : 0;
int weak = 0;
+   int bpf_counter = 0;
 
memset(mod, 0, sizeof(*mod));
 
@@ -1867,6 +1869,8 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
exclusive = 1;
} else if (*str == 'W') {
weak = 1;
+   } else if (*str == 'b') {
+   bpf_counter = 1;
} else
break;
 
@@ -1898,6 +1902,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
mod->sample_read = sample_read;
mod->pinned = pinned;
mod->weak = weak;
+   mod->bpf_counter = bpf_counter;
mod->exclusive = exclusive;
 
return 0;
@@ -1912,7 +1917,7 @@ static int check_modifier(char *str)
char *p = str;
 
/* The sizeof includes 0 byte as well. */
-   if (strlen(str) > (sizeof("ukhGHpppPSDIWe") - 1))
+   if (strlen(str) > (sizeof("ukhGHpppPSDIWeb") - 1))
return -1;
 
while (*p) {
@@ -1953,6 +1958,7 @@ int parse_events__modifier_event(struct list_head *list, 
char *str, bool add)
evsel->sample_read = mod.sample_read;
evsel->precise_max = mod.precise_max;
evsel->weak_group  = mod.weak;
+   evsel->bpf_counter = mod.bpf_counter;
 
if (evsel__is_group_leader(evsel)) {
evsel->core.attr.pinned = mod.pinned;
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 0b36285a9435d..fb8646cc3e834 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -210,7 +210,7 @@ name_tag
[\'][a-zA-Z_*?\[\]][a-zA-Z0-9_*?\-,\.\[\]:=]*[\']
 name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.:]*
 drv_cfg_term   [a-zA-Z0-9_\.]+(=[a-zA-Z0-9_*?\.:]+)?
 /* If you add a modifier you need to update check_modifier() */
-modifier_event [ukhpPGHSDIWe]+
+modifier_event [ukhpPGHSDIWeb]+
 modifier_bp[rwx]{1,3}
 
 %%
-- 
2.30.2

[PATCH v4 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-19 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. Events with name in the option will use
BPF.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 40 +++---
 tools/perf/util/bpf_counter.c  |  3 +-
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 ++
 tools/perf/util/evsel.h|  8 ++
 tools/perf/util/target.h   |  5 
 7 files changed, 61 insertions(+), 23 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 6ec5960b08c3d..f10e24da23e90 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2a2c15cac80a3..157105e792eaf 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -160,6 +160,7 @@ static const char *smi_cost_attrs = {
 };
 
 static struct evlist   *evsel_list;
+static bool all_counters_use_bpf = true;
 
 static struct target target = {
.uid= UINT_MAX,
@@ -399,6 +400,9 @@ static int read_affinity_counters(struct timespec *rs)
struct affinity affinity;
int i, ncpus, cpu;
 
+   if (all_counters_use_bpf)
+   return 0;
+
if (affinity__setup() < 0)
return -1;
 
@@ -413,6 +417,8 @@ static int read_affinity_counters(struct timespec *rs)
evlist__for_each_entry(evsel_list, counter) {
if (evsel__cpu_iter_skip(counter, cpu))
continue;
+   if (evsel__is_bpf(counter))
+   continue;
if (!counter->err) {
counter->err = read_counter_cpu(counter, rs,

counter->cpu_iter - 1);
@@ -429,6 +435,9 @@ static int read_bpf_map_counters(void)
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!evsel__is_bpf(counter))
+   continue;
+
err = bpf_counter__read(counter);
if (err)
return err;
@@ -439,14 +448,10 @@ static int read_bpf_map_counters(void)
 static void read_counters(struct timespec *rs)
 {
struct evsel *counter;
-   int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf())
-   err = read_bpf_map_counters();
-   else
-   err = read_affinity_counters(rs);
-   if (err < 0)
+   if (read_bpf_map_counters() ||
+   read_affinity_counters(rs))
return;
}
 
@@ -535,12 +540,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel__is_bpf(evsel))
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +790,11 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup() < 0)
return -1;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, ))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, ))
+   return -1;
+   if (!evsel__is_bpf(counter))
+   all_counters_use_bpf = false;
}

[PATCH v4 2/4] perf bpf: check perf_attr_map is compatible with the perf binary

2021-04-19 Thread Song Liu

perf_attr_map could be shared among different version of perf binary. Add
bperf_attr_map_compatible() to check whether the existing attr_map is
compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index be484ddbbd5be..5de991ab46af9 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -312,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, _info, _info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -346,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

[PATCH v4 1/4] perf util: move bpf_perf definitions to a libperf header

2021-04-19 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPF_PERF_DEFAULT_ATTR_MAP_PATH to
bpf_perf.h for other tools to use.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/util/bpf_counter.c  | 27 +++---
 2 files changed, 34 insertions(+), 24 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

diff --git a/tools/lib/perf/include/perf/bpf_perf.h 
b/tools/lib/perf/include/perf/bpf_perf.h
new file mode 100644
index 0..e7cf6ba7b674b
--- /dev/null
+++ b/tools/lib/perf/include/perf/bpf_perf.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPF_PERF_H
+#define __LIBPERF_BPF_PERF_H
+
+#include   /* for __u32 */
+
+/*
+ * bpf_perf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* default attr_map name */
+#define BPF_PERF_DEFAULT_ATTR_MAP_PATH "perf_attr_map"
+
+#endif /* __LIBPERF_BPF_PERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..be484ddbbd5be 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -341,8 +320,8 @@ static int bperf_lock_attr_map(struct target *target)
if (target->attr_map) {
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
-   scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+   scnprintf(path, PATH_MAX, "%s/fs/bpf/%s", sysfs__mountpoint(),
+ BPF_PERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
-- 
2.30.2

[PATCH v4 0/4] perf util: bpf perf improvements

2021-04-19 Thread Song Liu

This patches set improves bpf_perf (perf-stat --bpf-counter) by
 1) exposing key definitions to a libperf header;
 2) adding compatibility check for perf_attr_map;
 3) introducing config stat.bpf-counter-events.
 4) introducing 'b' modify to event parser.

Changes v3 => v4:
1. Improve the logic that decides when to skip read_affinity_counters().
   (Jiri)
2. Clean up a condition in bpf_counters.c:read_counters(). (Jiri)

Changes v2 => v3:
1. Add 'b' modifier. (Jiri)
2. Allow configuring stat.bpf-counter-events with any event name (instead
  of limiting to hardware events). (Jiri)

Changes v1 => v2:
1. Separte 2/3 from 1/3. (Jiri)
2. Rename bperf.h to bpf_perf.h. (Jiri)
3. Other small fixes/optimizations. (Jiri)

Song Liu (4):
  perf util: move bpf_perf definitions to a libperf header
  perf bpf: check perf_attr_map is compatible with the perf binary
  perf-stat: introduce config stat.bpf-counter-events
  perf-stat: introduce ':b' modifier

 tools/lib/perf/include/perf/bpf_perf.h | 31 
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 40 -
 tools/perf/util/bpf_counter.c  | 49 +-
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 
 tools/perf/util/evsel.h|  9 +
 tools/perf/util/parse-events.c |  8 -
 tools/perf/util/parse-events.l |  2 +-
 tools/perf/util/target.h   |  5 ---
 10 files changed, 123 insertions(+), 49 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

--
2.30.2

Re: [PATCH v3 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-19 Thread Song Liu




> On Apr 19, 2021, at 7:26 AM, Jiri Olsa  wrote:
> 
> On Fri, Apr 16, 2021 at 03:13:24PM -0700, Song Liu wrote:
> 
> SNIP
> 
>> +/*
>> + * Returns:
>> + * 0   if all events use BPF;
>> + * 1   if some events do NOT use BPF;
>> + * < 0 on errors;
>> + */
>> static int read_bpf_map_counters(void)
>> {
>> +bool has_none_bpf_events = false;
>>  struct evsel *counter;
>>  int err;
>> 
>>  evlist__for_each_entry(evsel_list, counter) {
>> +if (!evsel__is_bpf(counter)) {
>> +has_none_bpf_events = true;
>> +continue;
>> +}
>>  err = bpf_counter__read(counter);
>>  if (err)
>>  return err;
>>  }
>> -return 0;
>> +return has_none_bpf_events ? 1 : 0;
>> }
>> 
>> static void read_counters(struct timespec *rs)
>> @@ -442,9 +455,10 @@ static void read_counters(struct timespec *rs)
>>  int err;
>> 
>>  if (!stat_config.stop_read_counter) {
>> -if (target__has_bpf())
>> -err = read_bpf_map_counters();
>> -else
>> +err = read_bpf_map_counters();
>> +if (err < 0)
>> +return;
>> +if (err)
>>  err = read_affinity_counters(rs);
> 
> this part is confusing for me.. I understand we don't want to enter
> read_affinity_counters when there's no bpf counter, so we don't set
> affinities in vain.. but there must be better way ;-)
> 
>> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
>> index 5de991ab46af9..3189b63714371 100644
>> --- a/tools/perf/util/bpf_counter.c
>> +++ b/tools/perf/util/bpf_counter.c
>> @@ -792,6 +792,8 @@ int bpf_counter__load(struct evsel *evsel, struct target 
>> *target)
>>  evsel->bpf_counter_ops = _program_profiler_ops;
>>  else if (target->use_bpf)
>>  evsel->bpf_counter_ops = _ops;
>> +else if (evsel__match_bpf_counter_events(evsel->name))
>> +evsel->bpf_counter_ops = _ops;
> 
> please put this with the target->use_bpf check,
> it seems like it's another thing

Thanks for the review. Fixing these in v4. 

Song

Re: [PATCH v3 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-19 Thread Song Liu




> On Apr 17, 2021, at 9:45 AM, Namhyung Kim  wrote:
> 
> Hi Song,
> 
> On Sat, Apr 17, 2021 at 7:13 AM Song Liu  wrote:
>> 
>> Currently, to use BPF to aggregate perf event counters, the user uses
>> --bpf-counters option. Enable "use bpf by default" events with a config
>> option, stat.bpf-counter-events. Events with name in the option will use
>> BPF.
>> 
>> This also enables mixed BPF event and regular event in the same sesssion.
>> For example:
>> 
>>   perf config stat.bpf-counter-events=instructions
>>   perf stat -e instructions,cs
>> 
>> The second command will use BPF for "instructions" but not "cs".
>> 
>> Signed-off-by: Song Liu 
>> ---
>> @@ -535,12 +549,13 @@ static int enable_counters(void)
>>struct evsel *evsel;
>>int err;
>> 
>> -   if (target__has_bpf()) {
>> -   evlist__for_each_entry(evsel_list, evsel) {
>> -   err = bpf_counter__enable(evsel);
>> -   if (err)
>> -   return err;
>> -   }
>> +   evlist__for_each_entry(evsel_list, evsel) {
>> +   if (!evsel__is_bpf(evsel))
>> +   continue;
>> +
>> +   err = bpf_counter__enable(evsel);
>> +   if (err)
>> +   return err;
> 
> I just realized it doesn't have a disable counterpart.

I guess it is not really necessary for perf-stat? It is probably good
to have it though. How about we do it in a follow up patch?

[...]

>> +   if (!evsel__bpf_counter_events)
>> +   return false;
>> +
>> +   ptr = strstr(evsel__bpf_counter_events, name);
>> +   name_len = strlen(name);
>> +
>> +   /* check name matches a full token in evsel__bpf_counter_events */
>> +   match = (ptr != NULL) &&
>> +   ((ptr == evsel__bpf_counter_events) || (*(ptr - 1) == ',')) 
>> &&
>> +   ((*(ptr + name_len) == ',') || (*(ptr + name_len) == '\0'));
> 
> I'm not sure we have an event name which is a substring of another.
> Maybe it can retry if it fails to match.

We have ref-cycles and cycles. And some raw events may be substring of others?

Thanks,
Song

[...]

[PATCH v3 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-16 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. Events with name in the option will use
BPF.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 43 +-
 tools/perf/util/bpf_counter.c  |  2 ++
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 +
 tools/perf/util/evsel.h|  8 +
 tools/perf/util/target.h   |  5 ---
 7 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 6ec5960b08c3d..78afe13cd7d47 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common hardware events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2a2c15cac80a3..0c76271f3ef53 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -413,6 +413,8 @@ static int read_affinity_counters(struct timespec *rs)
evlist__for_each_entry(evsel_list, counter) {
if (evsel__cpu_iter_skip(counter, cpu))
continue;
+   if (evsel__is_bpf(counter))
+   continue;
if (!counter->err) {
counter->err = read_counter_cpu(counter, rs,

counter->cpu_iter - 1);
@@ -423,17 +425,28 @@ static int read_affinity_counters(struct timespec *rs)
return 0;
 }
 
+/*
+ * Returns:
+ * 0   if all events use BPF;
+ * 1   if some events do NOT use BPF;
+ * < 0 on errors;
+ */
 static int read_bpf_map_counters(void)
 {
+   bool has_none_bpf_events = false;
struct evsel *counter;
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!evsel__is_bpf(counter)) {
+   has_none_bpf_events = true;
+   continue;
+   }
err = bpf_counter__read(counter);
if (err)
return err;
}
-   return 0;
+   return has_none_bpf_events ? 1 : 0;
 }
 
 static void read_counters(struct timespec *rs)
@@ -442,9 +455,10 @@ static void read_counters(struct timespec *rs)
int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf())
-   err = read_bpf_map_counters();
-   else
+   err = read_bpf_map_counters();
+   if (err < 0)
+   return;
+   if (err)
err = read_affinity_counters(rs);
if (err < 0)
return;
@@ -535,12 +549,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel__is_bpf(evsel))
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +799,9 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup() < 0)
return -1;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, ))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, ))
+   return -1;
}
 
evlist__for_each_cpu (evsel_list, i, cpu) {
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 5de

[PATCH v3 4/4] perf-stat: introduce ':b' modifier

2021-04-16 Thread Song Liu

Introduce 'b' modifier to event parser, which means use BPF program to
manage this event. This is the same as --bpf-counters option, but only
applies to this event. For example,

  perf stat -e cycles:b,cs   # use bpf for cycles, but not cs
  perf stat -e cycles,cs --bpf-counters  # use bpf for both cycles and cs

Suggested-by: Jiri Olsa 
Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c  | 2 +-
 tools/perf/util/evsel.h| 1 +
 tools/perf/util/parse-events.c | 8 +++-
 tools/perf/util/parse-events.l | 2 +-
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 3189b63714371..7b4d511c7e6ec 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -790,7 +790,7 @@ int bpf_counter__load(struct evsel *evsel, struct target 
*target)
 {
if (target->bpf_str)
evsel->bpf_counter_ops = _program_profiler_ops;
-   else if (target->use_bpf)
+   else if (target->use_bpf || evsel->bpf_counter)
evsel->bpf_counter_ops = _ops;
else if (evsel__match_bpf_counter_events(evsel->name))
evsel->bpf_counter_ops = _ops;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ce4b629d659c2..8f66cdcb338d0 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -82,6 +82,7 @@ struct evsel {
boolauto_merge_stats;
boolcollect_stat;
boolweak_group;
+   boolbpf_counter;
int bpf_fd;
struct bpf_object   *bpf_obj;
};
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 8123d218ad17c..46ebd269a98d1 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1804,6 +1804,7 @@ struct event_modifier {
int pinned;
int weak;
int exclusive;
+   int bpf_counter;
 };
 
 static int get_event_modifier(struct event_modifier *mod, char *str,
@@ -1824,6 +1825,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
int exclude = eu | ek | eh;
int exclude_GH = evsel ? evsel->exclude_GH : 0;
int weak = 0;
+   int bpf_counter = 0;
 
memset(mod, 0, sizeof(*mod));
 
@@ -1867,6 +1869,8 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
exclusive = 1;
} else if (*str == 'W') {
weak = 1;
+   } else if (*str == 'b') {
+   bpf_counter = 1;
} else
break;
 
@@ -1898,6 +1902,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
mod->sample_read = sample_read;
mod->pinned = pinned;
mod->weak = weak;
+   mod->bpf_counter = bpf_counter;
mod->exclusive = exclusive;
 
return 0;
@@ -1912,7 +1917,7 @@ static int check_modifier(char *str)
char *p = str;
 
/* The sizeof includes 0 byte as well. */
-   if (strlen(str) > (sizeof("ukhGHpppPSDIWe") - 1))
+   if (strlen(str) > (sizeof("ukhGHpppPSDIWeb") - 1))
return -1;
 
while (*p) {
@@ -1953,6 +1958,7 @@ int parse_events__modifier_event(struct list_head *list, 
char *str, bool add)
evsel->sample_read = mod.sample_read;
evsel->precise_max = mod.precise_max;
evsel->weak_group  = mod.weak;
+   evsel->bpf_counter = mod.bpf_counter;
 
if (evsel__is_group_leader(evsel)) {
evsel->core.attr.pinned = mod.pinned;
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 0b36285a9435d..fb8646cc3e834 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -210,7 +210,7 @@ name_tag
[\'][a-zA-Z_*?\[\]][a-zA-Z0-9_*?\-,\.\[\]:=]*[\']
 name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.:]*
 drv_cfg_term   [a-zA-Z0-9_\.]+(=[a-zA-Z0-9_*?\.:]+)?
 /* If you add a modifier you need to update check_modifier() */
-modifier_event [ukhpPGHSDIWe]+
+modifier_event [ukhpPGHSDIWeb]+
 modifier_bp[rwx]{1,3}
 
 %%
-- 
2.30.2

[PATCH v3 2/4] perf bpf: check perf_attr_map is compatible with the perf binary

2021-04-16 Thread Song Liu

perf_attr_map could be shared among different version of perf binary. Add
bperf_attr_map_compatible() to check whether the existing attr_map is
compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index be484ddbbd5be..5de991ab46af9 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -312,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, _info, _info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -346,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

[PATCH v3 1/4] perf util: move bpf_perf definitions to a libperf header

2021-04-16 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPF_PERF_DEFAULT_ATTR_MAP_PATH to
bpf_perf.h for other tools to use.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/util/bpf_counter.c  | 27 +++---
 2 files changed, 34 insertions(+), 24 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

diff --git a/tools/lib/perf/include/perf/bpf_perf.h 
b/tools/lib/perf/include/perf/bpf_perf.h
new file mode 100644
index 0..e7cf6ba7b674b
--- /dev/null
+++ b/tools/lib/perf/include/perf/bpf_perf.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPF_PERF_H
+#define __LIBPERF_BPF_PERF_H
+
+#include   /* for __u32 */
+
+/*
+ * bpf_perf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* default attr_map name */
+#define BPF_PERF_DEFAULT_ATTR_MAP_PATH "perf_attr_map"
+
+#endif /* __LIBPERF_BPF_PERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..be484ddbbd5be 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -341,8 +320,8 @@ static int bperf_lock_attr_map(struct target *target)
if (target->attr_map) {
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
-   scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+   scnprintf(path, PATH_MAX, "%s/fs/bpf/%s", sysfs__mountpoint(),
+ BPF_PERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
-- 
2.30.2

[PATCH v3 0/4] perf util: bpf perf improvements

2021-04-16 Thread Song Liu

This patches set improves bpf_perf (perf-stat --bpf-counter) by
  1) exposing key definitions to a libperf header;
  2) adding compatibility check for perf_attr_map;
  3) introducing config stat.bpf-counter-events.
  4) introducing 'b' modify to event parser.

Changes v2 => v3:
1. Add 'b' modifier. (Jiri)
2. Allow configuring stat.bpf-counter-events with any event name (instead
   of limiting to hardware events). (Jiri)

Changes v1 => v2:
1. Separte 2/3 from 1/3. (Jiri)
2. Rename bperf.h to bpf_perf.h. (Jiri)
3. Other small fixes/optimizations. (Jiri)

Song Liu (4):
  perf util: move bpf_perf definitions to a libperf header
  perf bpf: check perf_attr_map is compatible with the perf binary
  perf-stat: introduce config stat.bpf-counter-events
  perf-stat: introduce ':b' modifier

 tools/lib/perf/include/perf/bpf_perf.h | 31 
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 43 ++
 tools/perf/util/bpf_counter.c  | 50 +-
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 
 tools/perf/util/evsel.h|  9 +
 tools/perf/util/parse-events.c |  8 -
 tools/perf/util/parse-events.l |  2 +-
 tools/perf/util/target.h   |  5 ---
 10 files changed, 129 insertions(+), 47 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

--
2.30.2

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 11:24 AM, Jiri Olsa  wrote:
> 
> On Thu, Apr 08, 2021 at 06:08:20PM +0000, Song Liu wrote:
>> 
>> 
>>> On Apr 8, 2021, at 10:45 AM, Jiri Olsa  wrote:
>>> 
>>> On Thu, Apr 08, 2021 at 05:28:10PM +, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
>>>>> 
>>>>> On Thu, Apr 08, 2021 at 04:39:33PM +, Song Liu wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>>>>>> 
>>>>>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>>>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>>>>>> --bpf-counters option. Enable "use bpf by default" events with a config
>>>>>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>>>>>> evsel__hw_names.
>>>>>>>> 
>>>>>>>> This also enables mixed BPF event and regular event in the same 
>>>>>>>> sesssion.
>>>>>>>> For example:
>>>>>>>> 
>>>>>>>> perf config stat.bpf-counter-events=instructions
>>>>>>>> perf stat -e instructions,cs
>>>>>>>> 
>>>>>>> 
>>>>>>> so if we are mixing events now, how about uing modifier for bpf 
>>>>>>> counters,
>>>>>>> instead of configuring .perfconfig list we could use:
>>>>>>> 
>>>>>>> perf stat -e instructions:b,cs
>>>>>>> 
>>>>>>> thoughts?
>>>>>>> 
>>>>>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>>>>>> feel free to use it
>>>>>> 
>>>>>> I think we will need both 'b' modifier and .perfconfig configuration. 
>>>>>> For systems with BPF-managed perf events running in the background, 
>>>>> 
>>>>> hum, I'm not sure I understand what that means.. you mean there
>>>>> are tools that run perf stat so you don't want to change them?
>>>> 
>>>> We have tools that do perf_event_open(). I will change them to use 
>>>> BPF managed perf events for "cycles" and "instructions". Since these 
>>>> tools are running 24/7, perf-stat on the system should use BPF managed
>>>> "cycles" and "instructions" by default. 
>>> 
>>> well if you are already changing the tools why not change them to add
>>> modifier.. but I don't mind adding that .perfconfig stuff if you need
>>> that
>> 
>> The tools I mentioned here don't use perf-stat, they just use 
>> perf_event_open() and read the perf events fds. We want a config to make
> 
> just curious, how those tools use perf_event_open?
> 
>> "cycles" to use BPF by default, so that when the user (not these tools)
>> runs perf-stat, it will share PMCs with those events by default. 
> 
> I'm sorry but I still don't see the usecase.. if you need to change both 
> tools,
> you can change them to use bpf-managed event, why bother with the list?
> 
>>> 
>>>> 
>>>>> 
>>>>>> .perfconfig makes sure perf-stat sessions will share PMCs with these 
>>>>>> background monitoring tools. 'b' modifier, on the other hand, is useful
>>>>>> when the user knows there is opportunity to share the PMCs. 
>>>>>> 
>>>>>> Does this make sense? 
>>>>> 
>>>>> if there's reason for that then sure.. but let's not limit that just
>>>>> on HARDWARE events only.. there are RAW events with the same demand
>>>>> for this feature.. why don't we let user define any event for this?
>>>> 
>>>> I haven't found a good way to config RAW events. I guess RAW events 
>>>> could use 'b' modifier? 
>>> any event uing the pmu notation like cpu/instructions/
>> 
>> Can we do something like "perf config stat.bpf-counter-events=cpu/*" means 
>> all "cpu/xx" events use BPF by default?
> 
> I think there's misundestanding, all I'm saying is that IIUC you check
> events stat.bpf-counter-events to be HARDWARE type, which I don't think
> is necessary and we can allow any event in there

>From what I see, most of the opportunity of sharing comes from a few common
events, like cycles, instructions. The second reason is that, the config 
implementation is easy and straightforward. We sure can extend the config 
to other events. Before that, 'b' modifier should be good for these use
cases. 

Thanks,
Song

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 11:50 AM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Thu, Apr 08, 2021 at 08:24:47PM +0200, Jiri Olsa escreveu:
>> On Thu, Apr 08, 2021 at 06:08:20PM +, Song Liu wrote:
>>> 
>>> 
>>>> On Apr 8, 2021, at 10:45 AM, Jiri Olsa  wrote:
>>>> 
>>>> On Thu, Apr 08, 2021 at 05:28:10PM +, Song Liu wrote:
>>>>> 
>>>>> 
>>>>>> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
>>>>>> 
>>>>>> On Thu, Apr 08, 2021 at 04:39:33PM +0000, Song Liu wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>>>>>>> 
>>>>>>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>>>>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>>>>>>> --bpf-counters option. Enable "use bpf by default" events with a 
>>>>>>>>> config
>>>>>>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>>>>>>> evsel__hw_names.
>>>>>>>>> 
>>>>>>>>> This also enables mixed BPF event and regular event in the same 
>>>>>>>>> sesssion.
>>>>>>>>> For example:
>>>>>>>>> 
>>>>>>>>> perf config stat.bpf-counter-events=instructions
>>>>>>>>> perf stat -e instructions,cs
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> so if we are mixing events now, how about uing modifier for bpf 
>>>>>>>> counters,
>>>>>>>> instead of configuring .perfconfig list we could use:
>>>>>>>> 
>>>>>>>> perf stat -e instructions:b,cs
>>>>>>>> 
>>>>>>>> thoughts?
>>>>>>>> 
>>>>>>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>>>>>>> feel free to use it
>>>>>>> 
>>>>>>> I think we will need both 'b' modifier and .perfconfig configuration. 
>>>>>>> For systems with BPF-managed perf events running in the background, 
>>>>>> 
>>>>>> hum, I'm not sure I understand what that means.. you mean there
>>>>>> are tools that run perf stat so you don't want to change them?
>>>>> 
>>>>> We have tools that do perf_event_open(). I will change them to use 
>>>>> BPF managed perf events for "cycles" and "instructions". Since these 
>>>>> tools are running 24/7, perf-stat on the system should use BPF managed
>>>>> "cycles" and "instructions" by default. 
>>>> 
>>>> well if you are already changing the tools why not change them to add
>>>> modifier.. but I don't mind adding that .perfconfig stuff if you need
>>>> that
>>> 
>>> The tools I mentioned here don't use perf-stat, they just use 
>>> perf_event_open() and read the perf events fds. We want a config to make
>> 
>> just curious, how those tools use perf_event_open?
> 
> I.e. do they use tools/lib/perf/? :-)

Not right now. I do hope we can eventually let them use libperf. But I 
haven't figured out the best path forward. 

> 
> I guess they will use it now for getting that "struct 
> perf_event_attr_map_entry" and
> the map name define.
> 
>>> "cycles" to use BPF by default, so that when the user (not these tools)
>>> runs perf-stat, it will share PMCs with those events by default. 
> 
>> I'm sorry but I still don't see the usecase.. if you need to change both 
>> tools,
>> you can change them to use bpf-managed event, why bother with the list?
> 
> He wants users not to bother if they are using bpf based counters, this will 
> happen
> automagically after they set their ~/.perfconfig with some command line Song 
> provides.
> 
> Then they will be using bpf counters that won't get exclusive access to those
> scarce counters, the tooling they are using will use bpf-counters and all will
> be well.
> 
> Right Song?

Yes, exactly. The config automatically switches ad-hoc perf-stat runs (for 
debug, 
performance tuning, etc.) to bpf managed counters. 

Thanks,
Song

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 10:45 AM, Jiri Olsa  wrote:
> 
> On Thu, Apr 08, 2021 at 05:28:10PM +0000, Song Liu wrote:
>> 
>> 
>>> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
>>> 
>>> On Thu, Apr 08, 2021 at 04:39:33PM +, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>>>> 
>>>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>>>> --bpf-counters option. Enable "use bpf by default" events with a config
>>>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>>>> evsel__hw_names.
>>>>>> 
>>>>>> This also enables mixed BPF event and regular event in the same sesssion.
>>>>>> For example:
>>>>>> 
>>>>>> perf config stat.bpf-counter-events=instructions
>>>>>> perf stat -e instructions,cs
>>>>>> 
>>>>> 
>>>>> so if we are mixing events now, how about uing modifier for bpf counters,
>>>>> instead of configuring .perfconfig list we could use:
>>>>> 
>>>>> perf stat -e instructions:b,cs
>>>>> 
>>>>> thoughts?
>>>>> 
>>>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>>>> feel free to use it
>>>> 
>>>> I think we will need both 'b' modifier and .perfconfig configuration. 
>>>> For systems with BPF-managed perf events running in the background, 
>>> 
>>> hum, I'm not sure I understand what that means.. you mean there
>>> are tools that run perf stat so you don't want to change them?
>> 
>> We have tools that do perf_event_open(). I will change them to use 
>> BPF managed perf events for "cycles" and "instructions". Since these 
>> tools are running 24/7, perf-stat on the system should use BPF managed
>> "cycles" and "instructions" by default. 
> 
> well if you are already changing the tools why not change them to add
> modifier.. but I don't mind adding that .perfconfig stuff if you need
> that

The tools I mentioned here don't use perf-stat, they just use 
perf_event_open() and read the perf events fds. We want a config to make
"cycles" to use BPF by default, so that when the user (not these tools)
runs perf-stat, it will share PMCs with those events by default. 

> 
>> 
>>> 
>>>> .perfconfig makes sure perf-stat sessions will share PMCs with these 
>>>> background monitoring tools. 'b' modifier, on the other hand, is useful
>>>> when the user knows there is opportunity to share the PMCs. 
>>>> 
>>>> Does this make sense? 
>>> 
>>> if there's reason for that then sure.. but let's not limit that just
>>> on HARDWARE events only.. there are RAW events with the same demand
>>> for this feature.. why don't we let user define any event for this?
>> 
>> I haven't found a good way to config RAW events. I guess RAW events 
>> could use 'b' modifier? 
> any event uing the pmu notation like cpu/instructions/

Can we do something like "perf config stat.bpf-counter-events=cpu/*" means 
all "cpu/xx" events use BPF by default?

Thanks,
Song

> 
> we can allow any event to be BPF-managed, right? IIUC we don't care,
> the code will work with any event

Yes, the code works with any event.

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
> 
> On Thu, Apr 08, 2021 at 04:39:33PM +0000, Song Liu wrote:
>> 
>> 
>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>> 
>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>> --bpf-counters option. Enable "use bpf by default" events with a config
>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>> evsel__hw_names.
>>>> 
>>>> This also enables mixed BPF event and regular event in the same sesssion.
>>>> For example:
>>>> 
>>>>  perf config stat.bpf-counter-events=instructions
>>>>  perf stat -e instructions,cs
>>>> 
>>> 
>>> so if we are mixing events now, how about uing modifier for bpf counters,
>>> instead of configuring .perfconfig list we could use:
>>> 
>>> perf stat -e instructions:b,cs
>>> 
>>> thoughts?
>>> 
>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>> feel free to use it
>> 
>> I think we will need both 'b' modifier and .perfconfig configuration. 
>> For systems with BPF-managed perf events running in the background, 
> 
> hum, I'm not sure I understand what that means.. you mean there
> are tools that run perf stat so you don't want to change them?

We have tools that do perf_event_open(). I will change them to use 
BPF managed perf events for "cycles" and "instructions". Since these 
tools are running 24/7, perf-stat on the system should use BPF managed
"cycles" and "instructions" by default. 

> 
>> .perfconfig makes sure perf-stat sessions will share PMCs with these 
>> background monitoring tools. 'b' modifier, on the other hand, is useful
>> when the user knows there is opportunity to share the PMCs. 
>> 
>> Does this make sense? 
> 
> if there's reason for that then sure.. but let's not limit that just
> on HARDWARE events only.. there are RAW events with the same demand
> for this feature.. why don't we let user define any event for this?

I haven't found a good way to config RAW events. I guess RAW events 
could use 'b' modifier? 

Thanks,
Song

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
> 
> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>> Currently, to use BPF to aggregate perf event counters, the user uses
>> --bpf-counters option. Enable "use bpf by default" events with a config
>> option, stat.bpf-counter-events. This is limited to hardware events in
>> evsel__hw_names.
>> 
>> This also enables mixed BPF event and regular event in the same sesssion.
>> For example:
>> 
>>   perf config stat.bpf-counter-events=instructions
>>   perf stat -e instructions,cs
>> 
> 
> so if we are mixing events now, how about uing modifier for bpf counters,
> instead of configuring .perfconfig list we could use:
> 
>  perf stat -e instructions:b,cs
> 
> thoughts?
> 
> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
> feel free to use it

I think we will need both 'b' modifier and .perfconfig configuration. 
For systems with BPF-managed perf events running in the background, 
.perfconfig makes sure perf-stat sessions will share PMCs with these 
background monitoring tools. 'b' modifier, on the other hand, is useful
when the user knows there is opportunity to share the PMCs. 

Does this make sense? 

Thanks,
Song

> 
> jirka
> 
> 
> ---
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index ca52581f1b17..c55e4e58d1dc 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -82,6 +82,7 @@ struct evsel {
>   boolauto_merge_stats;
>   boolcollect_stat;
>   boolweak_group;
> + boolbpf_counter;
>   int bpf_fd;
>   struct bpf_object   *bpf_obj;
>   };
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 9ecb45bea948..b5850f1ea90b 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -1801,6 +1801,7 @@ struct event_modifier {
>   int pinned;
>   int weak;
>   int exclusive;
> + int bpf_counter;
> };
> 
> static int get_event_modifier(struct event_modifier *mod, char *str,
> @@ -1821,6 +1822,7 @@ static int get_event_modifier(struct event_modifier 
> *mod, char *str,
>   int exclude = eu | ek | eh;
>   int exclude_GH = evsel ? evsel->exclude_GH : 0;
>   int weak = 0;
> + int bpf_counter = 0;
> 
>   memset(mod, 0, sizeof(*mod));
> 
> @@ -1864,6 +1866,8 @@ static int get_event_modifier(struct event_modifier 
> *mod, char *str,
>   exclusive = 1;
>   } else if (*str == 'W') {
>   weak = 1;
> + } else if (*str == 'b') {
> + bpf_counter = 1;
>   } else
>   break;
> 
> @@ -1895,6 +1899,7 @@ static int get_event_modifier(struct event_modifier 
> *mod, char *str,
>   mod->sample_read = sample_read;
>   mod->pinned = pinned;
>   mod->weak = weak;
> + mod->bpf_counter = bpf_counter;
>   mod->exclusive = exclusive;
> 
>   return 0;
> @@ -1909,7 +1914,7 @@ static int check_modifier(char *str)
>   char *p = str;
> 
>   /* The sizeof includes 0 byte as well. */
> - if (strlen(str) > (sizeof("ukhGHpppPSDIWe") - 1))
> + if (strlen(str) > (sizeof("ukhGHpppPSDIWeb") - 1))
>   return -1;
> 
>   while (*p) {
> @@ -1950,6 +1955,7 @@ int parse_events__modifier_event(struct list_head 
> *list, char *str, bool add)
>   evsel->sample_read = mod.sample_read;
>   evsel->precise_max = mod.precise_max;
>   evsel->weak_group  = mod.weak;
> + evsel->bpf_counter = mod.bpf_counter;
> 
>   if (evsel__is_group_leader(evsel)) {
>   evsel->core.attr.pinned = mod.pinned;
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 0b36285a9435..fb8646cc3e83 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -210,7 +210,7 @@ name_tag  
> [\'][a-zA-Z_*?\[\]][a-zA-Z0-9_*?\-,\.\[\]:=]*[\']
> name_minus[a-zA-Z_*?][a-zA-Z0-9\-_*?.:]*
> drv_cfg_term  [a-zA-Z0-9_\.]+(=[a-zA-Z0-9_*?\.:]+)?
> /* If you add a modifier you need to update check_modifier() */
> -modifier_event   [ukhpPGHSDIWe]+
> +modifier_event   [ukhpPGHSDIWeb]+
> modifier_bp   [rwx]{1,3}
> 
> %%
>

[PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-06 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. This is limited to hardware events in
evsel__hw_names.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 43 +-
 tools/perf/util/bpf_counter.c  | 11 +++
 tools/perf/util/config.c   | 32 +++
 tools/perf/util/evsel.c|  2 ++
 tools/perf/util/evsel.h|  6 
 tools/perf/util/target.h   |  5 ---
 7 files changed, 81 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 744211fa8c186..6d4733eaac170 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common hardware events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 4bb48c6b66980..7c26e627db0ef 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -413,6 +413,8 @@ static int read_affinity_counters(struct timespec *rs)
evlist__for_each_entry(evsel_list, counter) {
if (evsel__cpu_iter_skip(counter, cpu))
continue;
+   if (evsel__is_bpf(counter))
+   continue;
if (!counter->err) {
counter->err = read_counter_cpu(counter, rs,

counter->cpu_iter - 1);
@@ -423,17 +425,28 @@ static int read_affinity_counters(struct timespec *rs)
return 0;
 }
 
+/*
+ * Returns:
+ * 0   if all events use BPF;
+ * 1   if some events do NOT use BPF;
+ * < 0 on errors;
+ */
 static int read_bpf_map_counters(void)
 {
+   bool has_none_bpf_events = false;
struct evsel *counter;
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!evsel__is_bpf(counter)) {
+   has_none_bpf_events = true;
+   continue;
+   }
err = bpf_counter__read(counter);
if (err)
return err;
}
-   return 0;
+   return has_none_bpf_events ? 1 : 0;
 }
 
 static void read_counters(struct timespec *rs)
@@ -442,9 +455,10 @@ static void read_counters(struct timespec *rs)
int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf())
-   err = read_bpf_map_counters();
-   else
+   err = read_bpf_map_counters();
+   if (err < 0)
+   return;
+   if (err)
err = read_affinity_counters(rs);
if (err < 0)
return;
@@ -535,12 +549,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel__is_bpf(evsel))
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +799,9 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup() < 0)
return -1;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, ))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, ))
+   return -1;
}
 
evlist__for_each_cpu (evsel_list, i, cpu) {
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf

[PATCH v2 1/3] perf util: move bpf_perf definitions to a libperf header

2021-04-06 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPF_PERF_DEFAULT_ATTR_MAP_PATH to
bpf_perf.h for other tools to use.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/util/bpf_counter.c  | 27 +++---
 2 files changed, 34 insertions(+), 24 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

diff --git a/tools/lib/perf/include/perf/bpf_perf.h 
b/tools/lib/perf/include/perf/bpf_perf.h
new file mode 100644
index 0..e7cf6ba7b674b
--- /dev/null
+++ b/tools/lib/perf/include/perf/bpf_perf.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPF_PERF_H
+#define __LIBPERF_BPF_PERF_H
+
+#include   /* for __u32 */
+
+/*
+ * bpf_perf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* default attr_map name */
+#define BPF_PERF_DEFAULT_ATTR_MAP_PATH "perf_attr_map"
+
+#endif /* __LIBPERF_BPF_PERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..be484ddbbd5be 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -341,8 +320,8 @@ static int bperf_lock_attr_map(struct target *target)
if (target->attr_map) {
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
-   scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+   scnprintf(path, PATH_MAX, "%s/fs/bpf/%s", sysfs__mountpoint(),
+ BPF_PERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
-- 
2.30.2

[PATCH v2 2/3] perf bpf: check perf_attr_map is compatible with the perf binary

2021-04-06 Thread Song Liu

perf_attr_map could be shared among different version of perf binary. Add
bperf_attr_map_compatible() to check whether the existing attr_map is
compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index be484ddbbd5be..5de991ab46af9 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -312,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, _info, _info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -346,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

[PATCH v2 0/3] perf util: bpf perf improvements

2021-04-06 Thread Song Liu

This patches set improves bpf_perf (perf-stat --bpf-counter) by
  1) exposing key definitions to a libperf header;
  2) adding compatibility check for perf_attr_map;
  3) introducing config stat.bpf-counter-events.

Changes v1 => v2:
1. Separte 2/3 from 1/3. (Jiri)
2. Rename bperf.h to bpf_perf.h. (Jiri)
3. Other small fixes/optimizations. (Jiri)

Song Liu (3):
  perf util: move bpf_perf definitions to a libperf header
  perf bpf: check perf_attr_map is compatible with the perf binary
  perf-stat: introduce config stat.bpf-counter-events

 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/Documentation/perf-stat.txt |  2 +
 tools/perf/builtin-stat.c  | 43 ---
 tools/perf/util/bpf_counter.c  | 57 +++---
 tools/perf/util/config.c   | 32 +++
 tools/perf/util/evsel.c|  2 +
 tools/perf/util/evsel.h|  6 +++
 tools/perf/util/target.h   |  5 ---
 8 files changed, 134 insertions(+), 44 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

--
2.30.2

Re: [PATCH 2/2] perf-stat: introduce config stat.bpf-counter-events

2021-04-06 Thread Song Liu




> On Apr 6, 2021, at 7:21 AM, Jiri Olsa  wrote:
> 
> On Fri, Apr 02, 2021 at 05:29:38PM -0700, Song Liu wrote:
>> Currently, to use BPF to aggregate perf event counters, the user uses
>> --bpf-counters option. Enable "use bpf by default" events with a config
>> option, stat.bpf-counter-events. This is limited to hardware events in
>> evsel__hw_names.
>> 
>> This also enables mixed BPF event and regular event in the same sesssion.
>> For example:
>> 
>>   perf config stat.bpf-counter-events=instructions
>>   perf stat -e instructions,cs
> 
> hum, so this will effectively allow to mix 'bpf-shared' counters
> with normals ones.. I don't think we're ready for that ;-)

I think we are ready. :) all bpf_counter stuff is within evsel, so mixing 
them doesn't need much work. 

> 
>> 
>> The second command will use BPF for "instructions" but not "cs".
>> 
>> Signed-off-by: Song Liu 
>> ---
>> tools/perf/Documentation/perf-stat.txt |  2 ++
>> tools/perf/builtin-stat.c  | 41 --
>> tools/perf/util/bpf_counter.c  | 11 +++
>> tools/perf/util/config.c   | 32 
>> tools/perf/util/evsel.c|  2 ++
>> tools/perf/util/evsel.h|  1 +
>> tools/perf/util/target.h   |  5 
>> 7 files changed, 74 insertions(+), 20 deletions(-)
>> 
>> diff --git a/tools/perf/Documentation/perf-stat.txt 
>> b/tools/perf/Documentation/perf-stat.txt
>> index 744211fa8c186..6d4733eaac170 100644
>> --- a/tools/perf/Documentation/perf-stat.txt
>> +++ b/tools/perf/Documentation/perf-stat.txt
>> @@ -97,6 +97,8 @@ report::
>>  Use BPF programs to aggregate readings from perf_events.  This
>>  allows multiple perf-stat sessions that are counting the same metric 
>> (cycles,
>>  instructions, etc.) to share hardware counters.
>> +To use BPF programs on common hardware events by default, use
>> +"perf config stat.bpf-counter-events=".
>> 
>> --bpf-attr-map::
>>  With option "--bpf-counters", different perf-stat sessions share
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 4bb48c6b66980..5adfa708ffe68 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -423,17 +423,28 @@ static int read_affinity_counters(struct timespec *rs)
>>  return 0;
>> }
>> 
>> +/*
>> + * Returns:
>> + * 0   if all events use BPF;
>> + * 1   if some events do NOT use BPF;
>> + * < 0 on errors;
>> + */
>> static int read_bpf_map_counters(void)
>> {
>> +bool has_none_bpf_events = false;
>>  struct evsel *counter;
>>  int err;
>> 
>>  evlist__for_each_entry(evsel_list, counter) {
>> +if (!counter->bpf_counter_ops) {
>> +has_none_bpf_events = true;
>> +continue;
>> +}
>>  err = bpf_counter__read(counter);
>>  if (err)
>>  return err;
>>  }
>> -return 0;
>> +return has_none_bpf_events ? 1 : 0;
>> }
>> 
>> static void read_counters(struct timespec *rs)
>> @@ -442,9 +453,10 @@ static void read_counters(struct timespec *rs)
>>  int err;
>> 
>>  if (!stat_config.stop_read_counter) {
>> -if (target__has_bpf())
>> -err = read_bpf_map_counters();
>> -else
>> +err = read_bpf_map_counters();
>> +if (err < 0)
>> +return;
>> +if (err)
>>  err = read_affinity_counters(rs);
> 
> so read_affinity_counters will read also 'bpf-shared' counters no?
> as long as it was separated, I did not see a problem, now we have
> counters that either have bpf ops set or have not
> 
> it'd be great to do some generic separation.. I was thinking to move
> bpf_counter_ops into some generic counter ops and we would just fill
> in the proper ops for the counter.. buuut the affinity readings are
> not compatible with what we are doing in bperf_read and the profiler
> bpf read
> 
> so I think the solution will be just to skip those events in
> read_affinity_counters and all the other code, and have some
> helper like:
> 
>   bool evsel__is_bpf(evsel)
> 
> so it's clear why it's skipped

Yes, this will be better! Current version does have the problem of 
extra read in read_affinity_counters(). Will fix this. 

Thanks,
Song

Re: [PATCH 1/2] perf util: move bperf definitions to a libperf header

2021-04-06 Thread Song Liu




> On Apr 6, 2021, at 7:21 AM, Jiri Olsa  wrote:
> 
> On Fri, Apr 02, 2021 at 05:29:37PM -0700, Song Liu wrote:
>> By following the same protocol, other tools can share hardware PMCs with
>> perf. Move perf_event_attr_map_entry and BPERF_DEFAULT_ATTR_MAP_PATH to
>> bperf.h for other tools to use.
> 
> hi,
> so is this necessary for some other tool now?

We have monitoring tools do perf_event_open(). I would like to migrate these
to bperf. 

> 
>> 
>> Also add bperf_attr_map_compatible() to check whether existing attr_map
>> is compatible with current perf binary.
> 
> please separate this change

Will do. 

> 
>> 
>> Signed-off-by: Song Liu 
>> ---
>> tools/lib/perf/include/perf/bperf.h | 29 +++
>> tools/perf/util/bpf_counter.c   | 44 ++---
>> 2 files changed, 50 insertions(+), 23 deletions(-)
>> create mode 100644 tools/lib/perf/include/perf/bperf.h
>> 
>> diff --git a/tools/lib/perf/include/perf/bperf.h 
>> b/tools/lib/perf/include/perf/bperf.h
>> new file mode 100644
>> index 0..02b2fd5e50c75
>> --- /dev/null
>> +++ b/tools/lib/perf/include/perf/bperf.h
> 
> I wonder we want to call it bpf_perf.h to be more generic?
> or best just bpf.h ... but that might give us some conflict
> headache in future ;-)

I would rather avoid bpf.h... I am ok with bpf_perf.h or bperf.h

> 
>> @@ -0,0 +1,29 @@
>> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
>> +#ifndef __LIBPERF_BPERF_H
>> +#define __LIBPERF_BPERF_H
>> +
>> +/*
>> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
>> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
>> + * no concurrent access to the attr_map.  The key of attr_map is struct
>> + * perf_event_attr, and the value is struct perf_event_attr_map_entry.
>> + *
>> + * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
>> + * leader prog, and the diff_map. Each perf-stat session holds a reference
>> + * to the bpf_link to make sure the leader prog is attached to sched_switch
>> + * tracepoint.
>> + *
>> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
>> + * does not hold any references to the leader program. Once all perf-stat
>> + * sessions of these events exit, the leader prog, its maps, and the
>> + * perf_events will be freed.
>> + */
>> +struct perf_event_attr_map_entry {
>> +__u32 link_id;
>> +__u32 diff_map_id;
>> +};
> 
> this header file should be self contained,
> so you need __u32 definitions

Will add. 

> 
> 
>> +
>> +/* pin the map at sysfs__mountpoint()/BPERF_DEFAULT_ATTR_MAP_PATH */
>> +#define BPERF_DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
> 
> if we are going to expose this, I think we should expose just
> "perf_attr_map" ... without the 'fs/bpf' part, because that
> could be mounted anywhere

Will fix this. 

Thanks,
Song

[PATCH 2/2] perf-stat: introduce config stat.bpf-counter-events

2021-04-02 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. This is limited to hardware events in
evsel__hw_names.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 41 --
 tools/perf/util/bpf_counter.c  | 11 +++
 tools/perf/util/config.c   | 32 
 tools/perf/util/evsel.c|  2 ++
 tools/perf/util/evsel.h|  1 +
 tools/perf/util/target.h   |  5 
 7 files changed, 74 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 744211fa8c186..6d4733eaac170 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common hardware events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 4bb48c6b66980..5adfa708ffe68 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -423,17 +423,28 @@ static int read_affinity_counters(struct timespec *rs)
return 0;
 }
 
+/*
+ * Returns:
+ * 0   if all events use BPF;
+ * 1   if some events do NOT use BPF;
+ * < 0 on errors;
+ */
 static int read_bpf_map_counters(void)
 {
+   bool has_none_bpf_events = false;
struct evsel *counter;
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!counter->bpf_counter_ops) {
+   has_none_bpf_events = true;
+   continue;
+   }
err = bpf_counter__read(counter);
if (err)
return err;
}
-   return 0;
+   return has_none_bpf_events ? 1 : 0;
 }
 
 static void read_counters(struct timespec *rs)
@@ -442,9 +453,10 @@ static void read_counters(struct timespec *rs)
int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf())
-   err = read_bpf_map_counters();
-   else
+   err = read_bpf_map_counters();
+   if (err < 0)
+   return;
+   if (err)
err = read_affinity_counters(rs);
if (err < 0)
return;
@@ -535,12 +547,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel->bpf_counter_ops)
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +797,9 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup() < 0)
return -1;
 
-   if (target__has_bpf()) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, ))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, ))
+   return -1;
}
 
evlist__for_each_cpu (evsel_list, i, cpu) {
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index df70c8dcf7850..ea45914af1693 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -792,6 +792,17 @@ int bpf_counter__load(struct evsel *evsel, struct target 
*target)
evsel->bpf_counter_ops = _program_profiler_ops;
else if (target->use_bpf)
evsel->bpf_counter_ops = _ops;
+   else {
+   int i;
+
+   for (i = 0; i < PERF_COUNT_HW_MAX; i++) {
+   if (!strcmp(evsel->name

[PATCH 1/2] perf util: move bperf definitions to a libperf header

2021-04-02 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPERF_DEFAULT_ATTR_MAP_PATH to
bperf.h for other tools to use.

Also add bperf_attr_map_compatible() to check whether existing attr_map
is compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bperf.h | 29 +++
 tools/perf/util/bpf_counter.c   | 44 ++---
 2 files changed, 50 insertions(+), 23 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bperf.h

diff --git a/tools/lib/perf/include/perf/bperf.h 
b/tools/lib/perf/include/perf/bperf.h
new file mode 100644
index 0..02b2fd5e50c75
--- /dev/null
+++ b/tools/lib/perf/include/perf/bperf.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPERF_H
+#define __LIBPERF_BPERF_H
+
+/*
+ * bperf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* pin the map at sysfs__mountpoint()/BPERF_DEFAULT_ATTR_MAP_PATH */
+#define BPERF_DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
+
+#endif /* __LIBPERF_BPERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..df70c8dcf7850 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -333,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, _info, _info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -342,7 +335,7 @@ static int bperf_lock_attr_map(struct target *target)
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+ BPERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
@@ -367,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

Re: [PATCH -next] libbpf: remove redundant semi-colon

2021-04-01 Thread Song Liu

On Thu, Apr 1, 2021 at 10:58 AM Yang Yingliang  wrote:
>
> Signed-off-by: Yang Yingliang 

Please add a short commit log.

Thanks,
Song

> ---
>  tools/lib/bpf/linker.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/linker.c b/tools/lib/bpf/linker.c
> index 46b16cbdcda3..4e08bc07e635 100644
> --- a/tools/lib/bpf/linker.c
> +++ b/tools/lib/bpf/linker.c
> @@ -1895,7 +1895,7 @@ static int finalize_btf_ext(struct bpf_linker *linker)
> hdr->func_info_len = funcs_sz;
> hdr->line_info_off = funcs_sz;
> hdr->line_info_len = lines_sz;
> -   hdr->core_relo_off = funcs_sz + lines_sz;;
> +   hdr->core_relo_off = funcs_sz + lines_sz;
> hdr->core_relo_len = core_relos_sz;
>
> if (funcs_sz) {
> --
> 2.25.1
>

Re: [PATCH] linux/bpf.h: Remove repeated struct declaration

2021-04-01 Thread Song Liu

On Thu, Apr 1, 2021 at 12:22 AM Wan Jiabing  wrote:
>
> struct btf_type is declared twice. One is declared at 35th line.
> The blew one is not needed. Remove the duplicate.
>
> Signed-off-by: Wan Jiabing 

Acked-by: Song Liu

Re: [PATCH] linux/bpf-cgroup.h: Delete repeated struct declaration

2021-04-01 Thread Song Liu




> On Mar 31, 2021, at 11:46 PM, Wan Jiabing  wrote:
> 
> struct bpf_prog is declared twice. There is one declaration
> which is independent on the MACRO at 18th line.
> So the below one is not needed though. Remove the duplicate.
> 
> Signed-off-by: Wan Jiabing 

Acked-by: Song Liu 

> ---
> include/linux/bpf-cgroup.h | 1 -
> 1 file changed, 1 deletion(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index c42e02b4d84b..57b4d4b980e7 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -418,7 +418,6 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr,
> union bpf_attr __user *uattr);
> #else
> 
> -struct bpf_prog;
> struct cgroup_bpf {};
> static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
> static inline void cgroup_bpf_offline(struct cgroup *cgrp) {}
> -- 
> 2.25.1
>

Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-04-01 Thread Song Liu




> On Mar 30, 2021, at 8:11 AM, Namhyung Kim  wrote:
> 
> On Tue, Mar 30, 2021 at 3:33 PM Song Liu  wrote:
>>> On Mar 29, 2021, at 4:33 AM, Namhyung Kim  wrote:
>>> 
>>> On Mon, Mar 29, 2021 at 2:17 AM Song Liu  wrote:
>>>>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
>>>>> 
>>>>> As we can run many jobs (in container) on a big machine, we want to
>>>>> measure each job's performance during the run.  To do that, the
>>>>> perf_event can be associated to a cgroup to measure it only.
>>>>> 
>> 
>> [...]
>> 
>>>>> + return 0;
>>>>> +}
>>>> 
>>>> Could you please explain why we need this logic in can_attach?
>>> 
>>> IIUC the ss->attach() is called after a task's cgroup membership
>>> is changed.  But we want to collect the performance numbers for
>>> the old cgroup just before the change.  As the logic merely checks
>>> the current task's cgroup, it should be done in the can_attach()
>>> which is called before the cgroup change.
>> 
>> Thanks for the explanations.
>> 
>> Overall, I really like the core idea, especially that the overhead on
>> context switch is bounded (by the depth of cgroup tree).
> 
> Thanks!
> 
>> 
>> Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
>> Specifically, if we can have
>> 
>>  PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list
>>  PERF_EVENT_IOC_EL_CGROUP  delete a cgroup from the list
>> 
>> we can probably share these events among multiple processes, and
>> these processes don't need to know others' cgroup list. I think
>> this will be useful for users to build customized monitoring in
>> its own container.
>> 
>> Does this make sense?
> 
> Maybe we can add ADD/DEL interface for more flexible monitoring
> but I'm not sure which use cases it'll be used actually.
> 
> For your multi-process sharing case, the original events' file
> descriptors should be shared first.  

Yes, we will need some other work to make the ADD/DEL interface 
work properly. Let's worry about that later. 

For both patches:

Acked-by: Song Liu 

Thanks,
Song

[...]

Re: [syzbot] possible deadlock in register_for_each_vma

2021-03-31 Thread Song Liu




> On Mar 31, 2021, at 9:59 AM, Oleg Nesterov  wrote:
> 
> On 03/28, Hillf Danton wrote:
>> 
>> On Sat, 27 Mar 2021 18:53:08 Oleg Nesterov wrote:
>>> Hi Hillf,
>>> 
>>> it seems that you already understand the problem ;) I don't.
>> 
>> It is simpler than you thought - I always blindly believe what syzbot
>> reported is true before it turns out false as I am not smarter than it.
>> Feel free to laugh loud.
> 
> I am not going to laugh. I too think that lockdep is more clever than me.
> 
>>> Could you explain in details how double __register is possible ? and how
>> 
>> Taking another look at the report over five minutes may help more?
> 
> No. I spent much, much more time time and I still can't understand your
> patch which adds UPROBE_REGISTERING. Quite possibly your patch is fine,
> just I am not smart enough.
> 
> And I am a bit surprised you refused to help me.
> 
>>> it connects to this lockdep report?
>> 
>> Feel free to show the report is false and ignore my noise.
> 
> Well, this particular report looks correct but false-positive to me,
> _free_event() is not possible, but I can be easily wrong and we need
> to shut up lockdep anyway...
> 
> 
> ---
> Add more CC's. So, we have the following trace
> 
>   -> #0 (dup_mmap_sem){}-{0:0}:
>check_prev_add kernel/locking/lockdep.c:2936 [inline]
>check_prevs_add kernel/locking/lockdep.c:3059 [inline]
>validate_chain kernel/locking/lockdep.c:3674 [inline]
>__lock_acquire+0x2b14/0x54c0 kernel/locking/lockdep.c:4900
>lock_acquire kernel/locking/lockdep.c:5510 [inline]
>lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
>percpu_down_write+0x95/0x440 kernel/locking/percpu-rwsem.c:217
>register_for_each_vma+0x2c/0xc10 kernel/events/uprobes.c:1040
>__uprobe_register+0x5c2/0x850 kernel/events/uprobes.c:1181
>trace_uprobe_enable kernel/trace/trace_uprobe.c:1065 [inline]
>probe_event_enable+0x357/0xa00 kernel/trace/trace_uprobe.c:1134
>trace_uprobe_register+0x443/0x880 kernel/trace/trace_uprobe.c:1461
>perf_trace_event_reg kernel/trace/trace_event_perf.c:129 [inline]
>perf_trace_event_init+0x549/0xa20 kernel/trace/trace_event_perf.c:204
>perf_uprobe_init+0x16f/0x210 kernel/trace/trace_event_perf.c:336
>perf_uprobe_event_init+0xff/0x1c0 kernel/events/core.c:9754
>perf_try_init_event+0x12a/0x560 kernel/events/core.c:11071
>perf_init_event kernel/events/core.c:11123 [inline]
>perf_event_alloc.part.0+0xe3b/0x3960 kernel/events/core.c:11403
>perf_event_alloc kernel/events/core.c:11785 [inline]
>__do_sys_perf_event_open+0x647/0x2e60 kernel/events/core.c:11883
>do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> 
> which shows that this path takes
> 
>   event_mutex -> uprobe.register_rwsem -> dup_mmap_sem -> mm.mmap_lock
> 
> Not good. If nothing else, perf_mmap_close() path can take event_mutex under
> mm.mmap_lock, so lockdep complains correctly.
> 
> But why does perf_uprobe_init() take event_mutex? The comment mentions
> uprobe_buffer_enable().
> 
> If this is the only reason, then why uprobe_buffer_enable/disable abuse
> event_mutex?
> 
> IOW, can something like the stupid patch below work? (Just in case... yes
> it is very suboptimal, I am just trying to understand the problem).
> 
> Song, Namhyung, Peter, what do you think?
> 
> Oleg.

I think the following patch works well. I haven't tested it though. 

Thanks,
Song


> 
> --- x/kernel/trace/trace_event_perf.c
> +++ x/kernel/trace/trace_event_perf.c
> @@ -327,16 +327,9 @@ int perf_uprobe_init(struct perf_event *p_event,
>   goto out;
>   }
> 
> - /*
> -  * local trace_uprobe need to hold event_mutex to call
> -  * uprobe_buffer_enable() and uprobe_buffer_disable().
> -  * event_mutex is not required for local trace_kprobes.
> -  */
> - mutex_lock(_mutex);
>   ret = perf_trace_event_init(tp_event, p_event);
>   if (ret)
>   destroy_local_trace_uprobe(tp_event);
> - mutex_unlock(_mutex);
> out:
>   kfree(path);
>   return ret;
> --- x/kernel/trace/trace_uprobe.c
> +++ x/kernel/trace/trace_uprobe.c
> @@ -857,6 +857,7 @@ struct uprobe_cpu_buffer {
> };
> static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer;
> static int uprobe_buffer_refcnt;
> +static DEFINE_MUTEX(uprobe_buffer_mutex);
> 
> static int uprobe_buffer_init(void)
> {
> @@ -894,13 +895,13 @@ static int uprobe_buffer_enable(void)
> {
>   int ret = 0;
> 
> - BUG_ON(!mutex_is_locked(_mutex));
> -
> + mutex_lock(_buffer_mutex);
>   if (uprobe_buffer_refcnt++ == 0) {
>   ret = uprobe_buffer_init();
>   if (ret < 0)
>   uprobe_buffer_refcnt--;
>   }
> + mutex_unlock(_buffer_mutex);
> 
>   return

Re: [Patch bpf-next] bpf: remove unused parameter from ___bpf_prog_run

2021-03-31 Thread Song Liu

On Wed, Mar 31, 2021 at 12:14 AM He Fengqing  wrote:
>
> 'stack' parameter is not used in ___bpf_prog_run,
> the base address have been set to FP reg. So consequently remove it.
>
> Signed-off-by: He Fengqing 

Acked-by: Song Liu

Re: [PATCH bpf-next v2] bpf: check flags in 'bpf_ringbuf_discard()' and 'bpf_ringbuf_submit()'

2021-03-30 Thread Song Liu




> On Mar 30, 2021, at 3:37 PM, Pedro Tammela  wrote:
> 
> The current code only checks flags in 'bpf_ringbuf_output()'.
> 
> Signed-off-by: Pedro Tammela 

Acked-by: Song Liu

Re: [PATCH bpf-next] bpf: check flags in 'bpf_ringbuf_discard()' and 'bpf_ringbuf_submit()'

2021-03-30 Thread Song Liu



> On Mar 30, 2021, at 7:22 AM, Pedro Tammela  wrote:
> 
> Em seg., 29 de mar. de 2021 às 13:10, Song Liu  
> escreveu:
>> 
>> 
>> 
>>> On Mar 28, 2021, at 9:10 AM, Pedro Tammela  wrote:
>>> 
>>> The current code only checks flags in 'bpf_ringbuf_output()'.
>>> 
>>> Signed-off-by: Pedro Tammela 
>>> ---
>>> include/uapi/linux/bpf.h   |  8 
>>> kernel/bpf/ringbuf.c   | 13 +++--
>>> tools/include/uapi/linux/bpf.h |  8 
>>> 3 files changed, 19 insertions(+), 10 deletions(-)
>>> 
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 100cb2e4c104..232b5e5dd045 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -4073,7 +4073,7 @@ union bpf_attr {
>>> *Valid pointer with *size* bytes of memory available; NULL,
>>> *otherwise.
>>> *
>>> - * void bpf_ringbuf_submit(void *data, u64 flags)
>>> + * int bpf_ringbuf_submit(void *data, u64 flags)
>> 
>> This should be "long" instead of "int".
>> 
>>> *Description
>>> *Submit reserved ring buffer sample, pointed to by *data*.
>>> *If **BPF_RB_NO_WAKEUP** is specified in *flags*, no 
>>> notification
>>> @@ -4083,9 +4083,9 @@ union bpf_attr {
>>> *If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, 
>>> notification
>>> *of new data availability is sent unconditionally.
>>> *Return
>>> - *   Nothing. Always succeeds.
>>> + *   0 on success, or a negative error in case of failure.
>>> *
>>> - * void bpf_ringbuf_discard(void *data, u64 flags)
>>> + * int bpf_ringbuf_discard(void *data, u64 flags)
>> 
>> Ditto. And same for tools/include/uapi/linux/bpf.h
>> 
>>> *Description
>>> *Discard reserved ring buffer sample, pointed to by *data*.
>>> *If **BPF_RB_NO_WAKEUP** is specified in *flags*, no 
>>> notification
>>> @@ -4095,7 +4095,7 @@ union bpf_attr {
>>> *If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, 
>>> notification
>>> *of new data availability is sent unconditionally.
>>> *Return
>>> - *   Nothing. Always succeeds.
>>> + *   0 on success, or a negative error in case of failure.
>>> *
>>> * u64 bpf_ringbuf_query(void *ringbuf, u64 flags)
>>> *Description
>>> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
>>> index f25b719ac786..f76dafe2427e 100644
>>> --- a/kernel/bpf/ringbuf.c
>>> +++ b/kernel/bpf/ringbuf.c
>>> @@ -397,26 +397,35 @@ static void bpf_ringbuf_commit(void *sample, u64 
>>> flags, bool discard)
>>> 
>>> BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
>>> {
>>> + if (unlikely(flags & ~(BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP)))
>>> + return -EINVAL;
>> 
>> We can move this check to bpf_ringbuf_commit().
> 
> I don't believe we can because in 'bpf_ringbuf_output()' the flag
> checking in 'bpf_ringbuf_commit()' is already
> too late.

I see. Let's keep it in current functions then. 

Thanks,
Song

Re: [Patch bpf-next] net: filter: Remove unused bpf_load_pointer

2021-03-30 Thread Song Liu

On Mon, Mar 29, 2021 at 7:11 PM He Fengqing  wrote:
>
> Remove unused bpf_load_pointer function in filter.h
>
> Signed-off-by: He Fengqing 

Acked-by: Song Liu 

[...]

Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-30 Thread Song Liu

> On Mar 29, 2021, at 4:33 AM, Namhyung Kim  wrote:
> 
> On Mon, Mar 29, 2021 at 2:17 AM Song Liu  wrote:
>>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
>>> 
>>> As we can run many jobs (in container) on a big machine, we want to
>>> measure each job's performance during the run.  To do that, the
>>> perf_event can be associated to a cgroup to measure it only.
>>> 

[...]

>>> + return 0;
>>> +}
>> 
>> Could you please explain why we need this logic in can_attach?
> 
> IIUC the ss->attach() is called after a task's cgroup membership
> is changed.  But we want to collect the performance numbers for
> the old cgroup just before the change.  As the logic merely checks
> the current task's cgroup, it should be done in the can_attach()
> which is called before the cgroup change.

Thanks for the explanations. 

Overall, I really like the core idea, especially that the overhead on 
context switch is bounded (by the depth of cgroup tree). 

Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible? 
Specifically, if we can have

  PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list 
  PERF_EVENT_IOC_EL_CGROUP  delete a cgroup from the list

we can probably share these events among multiple processes, and 
these processes don't need to know others' cgroup list. I think 
this will be useful for users to build customized monitoring in 
its own container. 

Does this make sense?

Thanks,
Song

Re: [PATCH] selftests/bpf: add LDFLAGS when building test_verifier

2021-03-29 Thread Song Liu

On Mon, Mar 29, 2021 at 11:13 AM Jisheng Zhang
 wrote:
>
> From: Jisheng Zhang 
>
> This is useful for cross compile process to point linker to the
> correct libelf, libcap, libz path.

LGTM:
Acked-by: Song Liu 

btw: Do we also need LDFLAGS for some other binaries, like test_cpp,
TRUNNER_BINARY, etc?

Thanks,
Song

>
> Signed-off-by: Jisheng Zhang 
> ---
>  tools/testing/selftests/bpf/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/Makefile 
> b/tools/testing/selftests/bpf/Makefile
> index 044bfdcf5b74..dac1c5094e28 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -447,7 +447,7 @@ verifier/tests.h: verifier/*.c
> ) > verifier/tests.h)
>  $(OUTPUT)/test_verifier: test_verifier.c verifier/tests.h $(BPFOBJ) | 
> $(OUTPUT)
> $(call msg,BINARY,,$@)
> -   $(Q)$(CC) $(CFLAGS) $(filter %.a %.o %.c,$^) $(LDLIBS) -o $@
> +   $(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o %.c,$^) $(LDLIBS) -o 
> $@
>
>  # Make sure we are able to include and link libbpf against c++.
>  $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
> --
> 2.31.0
>
>

Re: [PATCH bpf-next] libbpf: Add '_wait()' and '_nowait()' macros for 'bpf_ring_buffer__poll()'

2021-03-29 Thread Song Liu




> On Mar 28, 2021, at 9:10 AM, Pedro Tammela  wrote:
> 
> 'bpf_ring_buffer__poll()' abstracts the polling method, so abstract the
> constants that make the implementation don't wait or wait indefinetly
> for data.
> 
> Signed-off-by: Pedro Tammela 
> ---
> tools/lib/bpf/libbpf.h | 3 +++
> tools/testing/selftests/bpf/benchs/bench_ringbufs.c| 2 +-
> tools/testing/selftests/bpf/prog_tests/ringbuf.c   | 6 +++---
> tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c | 4 ++--
> 4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index f500621d28e5..3817d84f91c6 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -540,6 +540,9 @@ LIBBPF_API int ring_buffer__poll(struct ring_buffer *rb, 
> int timeout_ms);
> LIBBPF_API int ring_buffer__consume(struct ring_buffer *rb);
> LIBBPF_API int ring_buffer__epoll_fd(const struct ring_buffer *rb);
> 
> +#define ring_buffer__poll_wait(rb) ring_buffer__poll(rb, -1)
> +#define ring_buffer__poll_nowait(rb) ring_buffer__poll(rb, 0)

I think we don't need ring_buffer__poll_wait() as ring_buffer__poll() already 
means "wait for timeout_ms". 

Actually, I think ring_buffer__poll() is enough. ring_buffer__poll_nowait() 
is not that useful either. 

Thanks,
Song

Re: [PATCH bpf-next] bpf: check flags in 'bpf_ringbuf_discard()' and 'bpf_ringbuf_submit()'

2021-03-29 Thread Song Liu




> On Mar 28, 2021, at 9:10 AM, Pedro Tammela  wrote:
> 
> The current code only checks flags in 'bpf_ringbuf_output()'.
> 
> Signed-off-by: Pedro Tammela 
> ---
> include/uapi/linux/bpf.h   |  8 
> kernel/bpf/ringbuf.c   | 13 +++--
> tools/include/uapi/linux/bpf.h |  8 
> 3 files changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 100cb2e4c104..232b5e5dd045 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4073,7 +4073,7 @@ union bpf_attr {
>  *Valid pointer with *size* bytes of memory available; NULL,
>  *otherwise.
>  *
> - * void bpf_ringbuf_submit(void *data, u64 flags)
> + * int bpf_ringbuf_submit(void *data, u64 flags)

This should be "long" instead of "int". 

>  *Description
>  *Submit reserved ring buffer sample, pointed to by *data*.
>  *If **BPF_RB_NO_WAKEUP** is specified in *flags*, no notification
> @@ -4083,9 +4083,9 @@ union bpf_attr {
>  *If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, notification
>  *of new data availability is sent unconditionally.
>  *Return
> - *   Nothing. Always succeeds.
> + *   0 on success, or a negative error in case of failure.
>  *
> - * void bpf_ringbuf_discard(void *data, u64 flags)
> + * int bpf_ringbuf_discard(void *data, u64 flags)

Ditto. And same for tools/include/uapi/linux/bpf.h

>  *Description
>  *Discard reserved ring buffer sample, pointed to by *data*.
>  *If **BPF_RB_NO_WAKEUP** is specified in *flags*, no notification
> @@ -4095,7 +4095,7 @@ union bpf_attr {
>  *If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, notification
>  *of new data availability is sent unconditionally.
>  *Return
> - *   Nothing. Always succeeds.
> + *   0 on success, or a negative error in case of failure.
>  *
>  * u64 bpf_ringbuf_query(void *ringbuf, u64 flags)
>  *Description
> diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c
> index f25b719ac786..f76dafe2427e 100644
> --- a/kernel/bpf/ringbuf.c
> +++ b/kernel/bpf/ringbuf.c
> @@ -397,26 +397,35 @@ static void bpf_ringbuf_commit(void *sample, u64 flags, 
> bool discard)
> 
> BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
> {
> + if (unlikely(flags & ~(BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP)))
> + return -EINVAL;

We can move this check to bpf_ringbuf_commit(). 

Thanks,
Song

[...]

Re: [PATCH bpf-next] bpf: add 'BPF_RB_MAY_WAKEUP' flag

2021-03-29 Thread Song Liu




> On Mar 28, 2021, at 9:10 AM, Pedro Tammela  wrote:
> 
> The current way to provide a no-op flag to 'bpf_ringbuf_submit()',
> 'bpf_ringbuf_discard()' and 'bpf_ringbuf_output()' is to provide a '0'
> value.
> 
> A '0' value might notify the consumer if it already caught up in processing,
> so let's provide a more descriptive notation for this value.
> 
> Signed-off-by: Pedro Tammela 

Acked-by: Song Liu

Re: [PATCH 2/2] perf/core: Support reading group events with shared cgroups

2021-03-28 Thread Song Liu




> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> 
> This enables reading event group's counter values together with a
> PERF_EVENT_IOC_READ_CGROUP command like we do in the regular read().
> Users should give a correct size of buffer to be read.
> 
> Signed-off-by: Namhyung Kim 
> ---
> kernel/events/core.c | 119 +--
> 1 file changed, 116 insertions(+), 3 deletions(-)
> 

[...]

> +}
> +
> +static int perf_event_read_cgrp_node_group(struct perf_event *event, u64 
> cgrp_id,
> +char __user *buf)
> +{
> + struct perf_cgroup_node *cgrp;
> + struct perf_event_context *ctx = event->ctx;
> + struct perf_event *sibling;
> + u64 read_format = event->attr.read_format;
> + unsigned long flags;
> + u64 *values;
> + int n = 1;
> + int ret;
> +
> + values = kzalloc(event->read_size, GFP_KERNEL);
> + if (!values)
> + return -ENOMEM;
> +
> + values[0] = 1 + event->nr_siblings;
> +
> + /* update event count and times (possibly run on other cpu) */
> + (void)perf_event_read(event, true);
> +
> + raw_spin_lock_irqsave(>lock, flags);
> +
> + cgrp = find_cgroup_node(event, cgrp_id);
> + if (cgrp == NULL) {
> + raw_spin_unlock_irqrestore(>lock, flags);
> + kfree(values);
> + return -ENOENT;
> + }
> +
> + if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
> + values[n++] = cgrp->time_enabled;
> + if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> + values[n++] = cgrp->time_running;
> +
> + values[n++] = cgrp->count;
> + if (read_format & PERF_FORMAT_ID)
> + values[n++] = primary_event_id(event);
> +
> + for_each_sibling_event(sibling, event) {
> + n += perf_event_read_cgrp_node_sibling(sibling, read_format,
> +cgrp_id, [n]);
> + }
> +
> + raw_spin_unlock_irqrestore(>lock, flags);
> +
> + ret = copy_to_user(buf, values, n * sizeof(u64));
> + kfree(values);
> + if (ret)
> + return -EFAULT;
> +
> + return n * sizeof(u64);
> +}
> +
> +static int perf_event_read_cgroup_node(struct perf_event *event, u64 
> read_size,
> +u64 cgrp_id, char __user *buf)
> +{
> + u64 read_format = event->attr.read_format;
> +
> + if (read_size < event->read_size + 2 * sizeof(u64))

Why do we need read_size + 2 u64 here? 

Thanks,
Song

[...]

Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-28 Thread Song Liu




> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> 
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run.  To do that, the
> perf_event can be associated to a cgroup to measure it only.
> 
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
> 
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups.  All we need is a separate counter (and
> two timing variables) for each cgroup.  I added a hash table to map
> from cgroup id to the attached cgroups.
> 
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different.  And it attributes the delta to the current task's cgroup.
> 
> This patch adds two new ioctl commands to perf_event for light-weight
> cgroup event counting (i.e. perf stat).
> 
> * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> 64-bit array to attach given cgroups.  The first element is a
> number of cgroups in the buffer, and the rest is a list of cgroup
> ids to add a cgroup info to the given event.
> 
> * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> array to get the event counter values.  The first element is size
> of the array in byte, and the second element is a cgroup id to
> read.  The rest is to save the counter value and timings.
> 
> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple.  The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.
> 
> Cc: Tejun Heo 
> Signed-off-by: Namhyung Kim 
> ---
> include/linux/perf_event.h  |  22 ++
> include/uapi/linux/perf_event.h |   2 +
> kernel/events/core.c| 474 ++--
> 3 files changed, 471 insertions(+), 27 deletions(-)

[...]

> @@ -4461,6 +4787,8 @@ static void __perf_event_init_context(struct 
> perf_event_context *ctx)
>   INIT_LIST_HEAD(>event_list);
>   INIT_LIST_HEAD(>pinned_active);
>   INIT_LIST_HEAD(>flexible_active);
> + INIT_LIST_HEAD(>cgrp_ctx_entry);
> + INIT_LIST_HEAD(>cgrp_node_list);

I guess we need ifdef CONFIG_CGROUP_PERF here?

>   refcount_set(>refcount, 1);
> }
> 
> @@ -4851,6 +5179,8 @@ static void _free_event(struct perf_event *event)
>   if (is_cgroup_event(event))
>   perf_detach_cgroup(event);
> 
> + perf_event_destroy_cgroup_nodes(event);
> +
>   if (!event->parent) {
>   if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
>   put_callchain_buffers();

[...]

> +static void perf_sched_enable(void)
> +{
> + /*
> +  * We need the mutex here because static_branch_enable()
> +  * must complete *before* the perf_sched_count increment
> +  * becomes visible.
> +  */
> + if (atomic_inc_not_zero(_sched_count))
> + return;

Why don't we use perf_cgroup_events for the new use case? 

> +
> + mutex_lock(_sched_mutex);
> + if (!atomic_read(_sched_count)) {
> + static_branch_enable(_sched_events);
> + /*
> +  * Guarantee that all CPUs observe they key change and
> +  * call the perf scheduling hooks before proceeding to
> +  * install events that need them.
> +  */
> + synchronize_rcu();
> + }
> + /*
> +  * Now that we have waited for the sync_sched(), allow further
> +  * increments to by-pass the mutex.
> +  */
> + atomic_inc(_sched_count);
> + mutex_unlock(_sched_mutex);
> +}
> +
> static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> {
>   struct perf_event *event = file->private_data;
>   struct perf_event_context *ctx;
> + bool do_sched_enable = false;
>   long ret;
> 
>   /* Treat ioctl like writes as it is likely a mutating operation. */
> @@ -5595,9 +6006,19 @@ static long perf_ioctl(struct file *file, unsigned int 
> cmd, unsigned long arg)
>   return ret;
> 
>   ctx = perf_event_ctx_lock(event);
> + /* ATTACH_CGROUP requires context switch callback */
> + if (cmd == PERF_EVENT_IOC_ATTACH_CGROUP && 
> !event_has_cgroup_node(event))
> + do_sched_enable = true;
>   ret = _perf_ioctl(event, cmd, arg);
>   perf_event_ctx_unlock(event, ctx);
> 
> + /*
> +  * Due to the circular lock dependency, it cannot call
> +  * static_branch_enable() under the ctx->mutex.
> +  */
> + if (do_sched_enable && ret >= 0)
> + perf_sched_enable();
> +
>   return ret;
> }
> 
> @@ -11240,33 +11661,8 @@ static void account_event(struct perf_event *event)
>   if

Re: [PATCH] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs

2021-03-28 Thread Song Liu




> On Mar 23, 2021, at 10:13 AM, Collin Fijalkovich  
> wrote:
> 
> Question: when we use this on shared library, the library is still
> writable. When the
> shared library is opened for write, these pages will refault in as 4kB
> pages, right? 
> 
> That's correct, while a file is opened for write it will refault into 4kB 
> pages and block use of THPs. Once all writers complete (i_writecount <=0), 
> the file can fault into THPs again and khugepaged can collapse existing page 
> ranges provided that it can successfully allocate new huge pages.

Will it be a problem if a slow writer (say a slow scp) writes to the 
shared library while the shared library is in use? 

Thanks,
Song

> 
> From,
> Collin 
> 
> On Mon, Mar 22, 2021 at 4:55 PM Song Liu  wrote:
> On Mon, Mar 22, 2021 at 3:00 PM Collin Fijalkovich
>  wrote:
> >
> > Transparent huge pages are supported for read-only non-shmem filesystems,
> > but are only used for vmas with VM_DENYWRITE. This condition ensures that
> > file THPs are protected from writes while an application is running
> > (ETXTBSY).  Any existing file THPs are then dropped from the page cache
> > when a file is opened for write in do_dentry_open(). Since sys_mmap
> > ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
> > produced by execve().
> >
> > Systems that make heavy use of shared libraries (e.g. Android) are unable
> > to apply VM_DENYWRITE through the dynamic linker, preventing them from
> > benefiting from the resultant reduced contention on the TLB.
> >
> > This patch reduces the constraint on file THPs allowing use with any
> > executable mapping from a file not opened for write (see
> > inode_is_open_for_write()). It also introduces additional conditions to
> > ensure that files opened for write will never be backed by file THPs.
> 
> Thanks for working on this. We could also use this in many data center
> workloads.
> 
> Question: when we use this on shared library, the library is still
> writable. When the
> shared library is opened for write, these pages will refault in as 4kB
> pages, right?
> 
> Thanks,
> Song

Re: [PATCH] bpf: remove redundant assignment of variable id

2021-03-26 Thread Song Liu

On Fri, Mar 26, 2021 at 12:45 PM Colin King  wrote:
>
> From: Colin Ian King 
>
> The variable id is being assigned a value that is never
> read, the assignment is redundant and can be removed.
>
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Acked-by: Song Liu 

For future patches, please prefix it as [PATCH bpf-next] for
[PATCH bpf], based on which tree the patch should apply to.

> ---
>  kernel/bpf/btf.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 369faeddf1df..b22fb29347c0 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -789,7 +789,6 @@ static const struct btf_type 
> *btf_type_skip_qualifiers(const struct btf *btf,
>
> while (btf_type_is_modifier(t) &&
>BTF_INFO_KIND(t->info) != BTF_KIND_TYPEDEF) {
> -   id = t->type;
> t = btf_type_by_id(btf, t->type);
> }
>
> --
> 2.30.2
>

Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-24 Thread Song Liu




> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> 
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run.  To do that, the
> perf_event can be associated to a cgroup to measure it only.
> 
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
> 
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups.  All we need is a separate counter (and
> two timing variables) for each cgroup.  I added a hash table to map
> from cgroup id to the attached cgroups.
> 
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different.  And it attributes the delta to the current task's cgroup.
> 
> This patch adds two new ioctl commands to perf_event for light-weight
> cgroup event counting (i.e. perf stat).
> 
> * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> 64-bit array to attach given cgroups.  The first element is a
> number of cgroups in the buffer, and the rest is a list of cgroup
> ids to add a cgroup info to the given event.
> 
> * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> array to get the event counter values.  The first element is size
> of the array in byte, and the second element is a cgroup id to
> read.  The rest is to save the counter value and timings.
> 
> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple.  The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.
> 
> Cc: Tejun Heo 
> Signed-off-by: Namhyung Kim 
> ---
> include/linux/perf_event.h  |  22 ++
> include/uapi/linux/perf_event.h |   2 +
> kernel/events/core.c| 474 ++--
> 3 files changed, 471 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3f7f89ea5e51..2760f3b07534 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -771,6 +771,18 @@ struct perf_event {
> 
> #ifdef CONFIG_CGROUP_PERF
>   struct perf_cgroup  *cgrp; /* cgroup event is attach to */
> +
> + /* to share an event for multiple cgroups */
> + struct hlist_head   *cgrp_node_hash;
> + struct perf_cgroup_node *cgrp_node_entries;
> + int nr_cgrp_nodes;
> + int cgrp_node_hash_bits;
> +
> + struct list_headcgrp_node_entry;
> +
> + u64 cgrp_node_count;
> + u64 cgrp_node_time_enabled;
> + u64 cgrp_node_time_running;

A comment saying the above values are from previous reading would be helpful. 

> #endif
> 
> #ifdef CONFIG_SECURITY
> @@ -780,6 +792,14 @@ struct perf_event {
> #endif /* CONFIG_PERF_EVENTS */
> };
> 
> +struct perf_cgroup_node {
> + struct hlist_node   node;
> + u64 id;
> + u64 count;
> + u64 time_enabled;
> + u64 time_running;
> + u64 padding[2];

Do we really need the padding? For cache line alignment? 

> +};
> 
> struct perf_event_groups {
>   struct rb_root  tree;
> @@ -843,6 +863,8 @@ struct perf_event_context {
>   int pin_count;
> #ifdef CONFIG_CGROUP_PERF
>   int nr_cgroups;  /* cgroup evts */
> + struct list_headcgrp_node_list;
> + struct list_headcgrp_ctx_entry;
> #endif
>   void*task_ctx_data; /* pmu specific data */
>   struct rcu_head rcu_head;
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index ad15e40d7f5d..06bc7ab13616 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -479,6 +479,8 @@ struct perf_event_query_bpf {
> #define PERF_EVENT_IOC_PAUSE_OUTPUT   _IOW('$', 9, __u32)
> #define PERF_EVENT_IOC_QUERY_BPF  _IOWR('$', 10, struct 
> perf_event_query_bpf *)
> #define PERF_EVENT_IOC_MODIFY_ATTRIBUTES  _IOW('$', 11, struct 
> perf_event_attr *)
> +#define PERF_EVENT_IOC_ATTACH_CGROUP _IOW('$', 12, __u64 *)
> +#define PERF_EVENT_IOC_READ_CGROUP   _IOWR('$', 13, __u64 *)
> 
> enum perf_event_ioc_flags {
>   PERF_IOC_FLAG_GROUP = 1U << 0,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index f07943183041..38c26a23418a 100644
> --- a/kernel/events/core.c
> +++

Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-23 Thread Song Liu




> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
> 
> As we can run many jobs (in container) on a big machine, we want to
> measure each job's performance during the run.  To do that, the
> perf_event can be associated to a cgroup to measure it only.
> 
> However such cgroup events need to be opened separately and it causes
> significant overhead in event multiplexing during the context switch
> as well as resource consumption like in file descriptors and memory
> footprint.
> 
> As a cgroup event is basically a cpu event, we can share a single cpu
> event for multiple cgroups.  All we need is a separate counter (and
> two timing variables) for each cgroup.  I added a hash table to map
> from cgroup id to the attached cgroups.
> 
> With this change, the cpu event needs to calculate a delta of event
> counter values when the cgroups of current and the next task are
> different.  And it attributes the delta to the current task's cgroup.
> 
> This patch adds two new ioctl commands to perf_event for light-weight
> cgroup event counting (i.e. perf stat).
> 
> * PERF_EVENT_IOC_ATTACH_CGROUP - it takes a buffer consists of a
> 64-bit array to attach given cgroups.  The first element is a
> number of cgroups in the buffer, and the rest is a list of cgroup
> ids to add a cgroup info to the given event.
> 
> * PERF_EVENT_IOC_READ_CGROUP - it takes a buffer consists of a 64-bit
> array to get the event counter values.  The first element is size
> of the array in byte, and the second element is a cgroup id to
> read.  The rest is to save the counter value and timings.
> 
> This attaches all cgroups in a single syscall and I didn't add the
> DETACH command deliberately to make the implementation simple.  The
> attached cgroup nodes would be deleted when the file descriptor of the
> perf_event is closed.

This is very interesting idea! 

Could you please add some description of the relationship among 
perf_event and contexts? The code is a little confusing. For example,
why do we need cgroup_ctx_list? 

Thanks,
Song 

[...]

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-23 Thread Song Liu




> On Mar 23, 2021, at 2:10 PM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Fri, Mar 19, 2021 at 04:14:42PM +0000, Song Liu escreveu:
>>> On Mar 19, 2021, at 8:58 AM, Namhyung Kim  wrote:
>>> On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo  
>>> wrote:
>>>> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
>>>>> On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
>>>>>>> On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
>>>>>>> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
>>>>>>> wrote:
>>>>>>>> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
>>>>>>>>> perf stat -C 1,3,5  107.063 [sec]
>>>>>>>>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
> 
>>>>>>>> I can't see why it's actualy faster than normal perf ;-)
>>>>>>>> would be worth to find out
> 
>>>>>>> Isn't this all about contended cases?
> 
>>>>>> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
>>>>>> doesn't need it.
> 
>>>>> Yep, so for uncontended cases, normal perf should be the same as the
>>>>> baseline (faster than the bperf).  But for contended cases, the bperf
>>>>> works faster.
> 
>>>> The difference should be small enough that for people that use this in a
>>>> machine where contention happens most of the time, setting a
>>>> ~/.perfconfig to use it by default should be advantageous, i.e. no need
>>>> to use --bpf-counters on the command line all the time.
> 
>>>> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
>>>> a look again now but I want to have this merged on perf/core so that I
>>>> can work on a new BPF SKEL to use this:
> 
>>> I have a concern for the per cpu target, but it can be done later, so
> 
>>> Acked-by: Namhyung Kim 
> 
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable
> 
>>> Interesting!  Actually I was thinking about the similar too. :)
>> 
>> Hi Namhyung, Jiri, and Arnaldo,
>> 
>> Thanks a lot for your kind review. 
>> 
>> Here is updated 3/3, where we use perf-bench instead of stressapptest.
> 
> I had to apply this updated 3/3 manually, as there was some munging, its
> all now at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/core
> 
> Please take a look at the "Committer testing" section I added to the
> main patch, introducing bperf:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=tmp.perf/core=7fac83aaf2eecc9e7e7b72da694c49bb4ce7fdfc
> 
> And check if I made any mistake or if something else could be added.
> 
> It'll move to perf/core after my set of automated tests finishes.

Thanks Arnaldo! Looks great!

Song

Re: [PATCH v2 1/3] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-22 Thread Song Liu




> On Mar 19, 2021, at 11:41 AM, Arnaldo Carvalho de Melo  
> wrote:
> 
> Em Thu, Mar 18, 2021 at 10:15:13PM +0100, Jiri Olsa escreveu:
>> On Tue, Mar 16, 2021 at 02:18:35PM -0700, Song Liu wrote:
>>> bperf is off by default. To enable it, pass --bpf-counters option to
>>> perf-stat. bperf uses a BPF hashmap to share information about BPF
>>> programs and maps used by bperf. This map is pinned to bpffs. The default
>>> path is /sys/fs/bpf/perf_attr_map. The user could change the path with
>>> option --bpf-attr-map.
>>> 
>>> Signed-off-by: Song Liu 
>> 
>> Reviewed-by: Jiri Olsa 
> 
> After applying just this first patch in the series I'm getting this
> after a 'make -C tools/ clean', now I'm checking if I need some new
> clang, ideas?
> 
> - Arnaldo

Hi Arnaldo, 

Are you still getting this error? 

Thanks,
Song

> 
> [acme@quaco perf]$ make O=/tmp/build/perf -C tools/perf BUILD_BPF_SKEL=1 
> PYTHON=python3 install-bin
> make: Entering directory '/home/acme/git/perf/tools/perf'
>  BUILD:   Doing 'make -j8' parallel build
> Warning: Kernel ABI header at 'tools/include/uapi/linux/kvm.h' differs from 
> latest version at 'include/uapi/linux/kvm.h'
> diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
> Warning: Kernel ABI header at 
> 'tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl' differs from latest 
> version at 'arch/mips/kernel/syscalls/syscall_n64.tbl'
> diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl 
> arch/mips/kernel/syscalls/syscall_n64.tbl
> 
> Auto-detecting system features:
> ... dwarf: [ on  ]
> ...dwarf_getlocations: [ on  ]
> ... glibc: [ on  ]
> ...libbfd: [ on  ]
> ...libbfd-buildid: [ on  ]
> ...libcap: [ on  ]
> ...libelf: [ on  ]
> ...   libnuma: [ on  ]
> ...numa_num_possible_cpus: [ on  ]
> ...   libperl: [ on  ]
> ... libpython: [ on  ]
> ... libcrypto: [ on  ]
> ... libunwind: [ on  ]
> ...libdw-dwarf-unwind: [ on  ]
> ...  zlib: [ on  ]
> ...  lzma: [ on  ]
> ... get_cpuid: [ on  ]
> ...   bpf: [ on  ]
> ...libaio: [ on  ]
> ...   libzstd: [ on  ]
> ...disassembler-four-args: [ on  ]
> 
>  GEN  /tmp/build/perf/common-cmds.h
>  CC   /tmp/build/perf/exec-cmd.o
>  MKDIR/tmp/build/perf/fd/
>  MKDIR/tmp/build/perf/fs/
>  CC   /tmp/build/perf/fs/fs.o
>  CC   /tmp/build/perf/event-parse.o
>  CC   /tmp/build/perf/fd/array.o
>  CC   /tmp/build/perf/core.o
>  GEN  /tmp/build/perf/bpf_helper_defs.h
>  CC   /tmp/build/perf/event-plugin.o
>  MKDIR/tmp/build/perf/staticobjs/
>  PERF_VERSION = 5.12.rc2.g3df07f57f205
>  CC   /tmp/build/perf/staticobjs/libbpf.o
>  CC   /tmp/build/perf/cpu.o
>  LD   /tmp/build/perf/fd/libapi-in.o
>  CC   /tmp/build/perf/cpumap.o
>  CC   /tmp/build/perf/help.o
>  MKDIR/tmp/build/perf/fs/
>  CC   /tmp/build/perf/fs/tracing_path.o
>  CC   /tmp/build/perf/fs/cgroup.o
>  CC   /tmp/build/perf/trace-seq.o
>  CC   /tmp/build/perf/pager.o
>  CC   /tmp/build/perf/parse-options.o
>  LD   /tmp/build/perf/fs/libapi-in.o
>  CC   /tmp/build/perf/debug.o
>  CC   /tmp/build/perf/str_error_r.o
>  CC   /tmp/build/perf/run-command.o
>  CC   /tmp/build/perf/sigchain.o
>  LD   /tmp/build/perf/libapi-in.o
>  AR   /tmp/build/perf/libapi.a
>  CC   /tmp/build/perf/subcmd-config.o
>  CC   /tmp/build/perf/threadmap.o
>  CC   /tmp/build/perf/evsel.o
>  CC   /tmp/build/perf/parse-filter.o
>  MKDIR/tmp/build/perf/staticobjs/
>  CC   /tmp/build/perf/staticobjs/bpf.o
>  CC   /tmp/build/perf/evlist.o
>  CC   /tmp/build/perf/parse-utils.o
>  CC   /tmp/build/perf/kbuffer-parse.o
>  CC   /tmp/build/perf/tep_strerror.o
>  CC   /tmp/build/perf/mmap.o
>  CC   /tmp/build/perf/zalloc.o
>  CC   /tmp/build/perf/event-parse-api.o
>  LD   /tmp/build/perf/libsubcmd-in.o
>  AR   /tmp/build/perf/libsubcmd.a
>  CC   /tmp/build/perf/xyarray.o
>  LD   /tmp/build/perf/libtraceevent-in.o
>  LINK /tmp/build/perf/libtraceevent.a
>  CC   /tmp/build/perf/staticobjs/nlattr.o
>  CC   /tmp/build/perf/staticobjs/btf.o
>  CC   /tmp/build/perf/lib.o
>  CC   /tmp/build/perf/staticobjs/libbpf_errno.o
>  CC   /tmp/

Re: [PATCH] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs

2021-03-22 Thread Song Liu

On Mon, Mar 22, 2021 at 3:00 PM Collin Fijalkovich
 wrote:
>
> Transparent huge pages are supported for read-only non-shmem filesystems,
> but are only used for vmas with VM_DENYWRITE. This condition ensures that
> file THPs are protected from writes while an application is running
> (ETXTBSY).  Any existing file THPs are then dropped from the page cache
> when a file is opened for write in do_dentry_open(). Since sys_mmap
> ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
> produced by execve().
>
> Systems that make heavy use of shared libraries (e.g. Android) are unable
> to apply VM_DENYWRITE through the dynamic linker, preventing them from
> benefiting from the resultant reduced contention on the TLB.
>
> This patch reduces the constraint on file THPs allowing use with any
> executable mapping from a file not opened for write (see
> inode_is_open_for_write()). It also introduces additional conditions to
> ensure that files opened for write will never be backed by file THPs.

Thanks for working on this. We could also use this in many data center
workloads.

Question: when we use this on shared library, the library is still
writable. When the
shared library is opened for write, these pages will refault in as 4kB
pages, right?

Thanks,
Song

Re: [PATCH] md: Fix missing unused status line of /proc/mdstat

2021-03-22 Thread Song Liu

On Wed, Mar 17, 2021 at 7:05 AM Jan Glauber  wrote:
>
> Reading /proc/mdstat with a read buffer size that would not
> fit the unused status line in the first read will skip this
> line from the output.
>
> So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
> like: unused devices: 
>
> Don't return NULL immediately in start() for v=2 but call
> show() once to print the status line also for multiple reads.
>
> Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and 
> interface")
> Signed-off-by: Jan Glauber 

Applied to md-next. Thanks!
Song

Re: [PATCH v2 1/3] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-19 Thread Song Liu




> On Mar 19, 2021, at 11:55 AM, Jiri Olsa  wrote:
> 
> On Fri, Mar 19, 2021 at 03:41:57PM -0300, Arnaldo Carvalho de Melo wrote:
> 
> SNIP
> 
>>  LD   
>> /tmp/build/perf/util/bpf_skel/.tmp//bootstrap/libbpf/staticobjs/libbpf-in.o
>>  LINK /tmp/build/perf/util/bpf_skel/.tmp//bootstrap/libbpf/libbpf.a
>>  LINK /tmp/build/perf/util/bpf_skel/.tmp//bootstrap/bpftool
>>  GEN-SKEL /tmp/build/perf/util/bpf_skel/bpf_prog_profiler.skel.h
>>  GEN-SKEL /tmp/build/perf/util/bpf_skel/bperf_leader.skel.h
>>  GEN-SKEL /tmp/build/perf/util/bpf_skel/bperf_follower.skel.h
>> libbpf: map 'prev_readings': unexpected def kind var.
>> Error: failed to open BPF object file: Invalid argument
>> libbpf: map 'diff_readings': unexpected def kind var.
>> Error: failed to open BPF object file: Invalid argument
> 
> I'm getting clean build for the same options,
> could you please send the same output also with 'JOBS=1 V=1'
> 
> 
>> make[2]: *** [Makefile.perf:1029: 
>> /tmp/build/perf/util/bpf_skel/bperf_leader.skel.h] Error 255
>> make[2]: *** Waiting for unfinished jobs
>> make[2]: *** [Makefile.perf:1029: 
>> /tmp/build/perf/util/bpf_skel/bperf_follower.skel.h] Error 255
>> make[1]: *** [Makefile.perf:236: sub-make] Error 2
>> make: *** [Makefile:110: install-bin] Error 2
>> make: Leaving directory '/home/acme/git/perf/tools/perf'
>> [acme@quaco perf]$ clang -v
>> clang version 11.0.0 (https://github.com/llvm/llvm-project 
>> 67420f1b0e9c673ee638f2680fa83f468019004f)
>> Target: x86_64-unknown-linux-gnu
>> Thread model: posix
>> InstalledDir: /usr/local/bin
>> Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/10
>> Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/10
>> Candidate multilib: .;@m64
>> Candidate multilib: 32;@m32
>> Selected multilib: .;@m64
>> [acme@quaco perf]$
>> 
> 
> I have:
> 
> [jolsa@dell-r440-01 linux-perf]$ clang --version
> clang version 11.0.0 (Fedora 11.0.0-2.fc33)
> Target: x86_64-unknown-linux-gnu
> Thread model: posix
> InstalledDir: /usr/bin

I am not able to repro this error either. I tried two versions of clang:

clang version 11.0.0 (Red Hat 11.0.0-0.2.rc2.module_el8.4.0+533+50191577)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /bin

clang version 12.0.0 (https://github.com/llvm/llvm-project.git 
07f1e1f44c87d1ee84caf13d6e5aa64eb7e1b068)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin

Thanks,
Song

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-19 Thread Song Liu




> On Mar 19, 2021, at 8:58 AM, Namhyung Kim  wrote:
> 
> Hi Arnaldo,
> 
> On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo
>  wrote:
>> 
>> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
>>> On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
>>>>> On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
>>>>> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
>>>>> wrote:
>>>>>> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
>>>>>>> perf stat -C 1,3,5  107.063 [sec]
>>>>>>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>> 
>>>>>> I can't see why it's actualy faster than normal perf ;-)
>>>>>> would be worth to find out
>> 
>>>>> Isn't this all about contended cases?
>> 
>>>> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
>>>> doesn't need it.
>> 
>>> Yep, so for uncontended cases, normal perf should be the same as the
>>> baseline (faster than the bperf).  But for contended cases, the bperf
>>> works faster.
>> 
>> The difference should be small enough that for people that use this in a
>> machine where contention happens most of the time, setting a
>> ~/.perfconfig to use it by default should be advantageous, i.e. no need
>> to use --bpf-counters on the command line all the time.
>> 
>> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
>> a look again now but I want to have this merged on perf/core so that I
>> can work on a new BPF SKEL to use this:
> 
> I have a concern for the per cpu target, but it can be done later, so
> 
> Acked-by: Namhyung Kim 
> 
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable
> 
> Interesting!  Actually I was thinking about the similar too. :)

Hi Namhyung, Jiri, and Arnaldo,

Thanks a lot for your kind review. 

Here is updated 3/3, where we use perf-bench instead of stressapptest.

Thanks,
Song


>From cc79d161be9c9d24198f7e35b50058a6e15076fd Mon Sep 17 00:00:00 2001
From: Song Liu 
Date: Tue, 16 Mar 2021 00:19:53 -0700
Subject: [PATCH v3 3/3] perf-test: add a test for perf-stat --bpf-counters
 option

Add a test to compare the output of perf-stat with and without option
--bpf-counters. If the difference is more than 10%, the test is considered
as failed.

Signed-off-by: Song Liu 
---
 tools/perf/tests/shell/stat_bpf_counters.sh | 31 +
 1 file changed, 31 insertions(+)
 create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh

diff --git a/tools/perf/tests/shell/stat_bpf_counters.sh 
b/tools/perf/tests/shell/stat_bpf_counters.sh
new file mode 100755
index 0..7aabf177ce8d1
--- /dev/null
+++ b/tools/perf/tests/shell/stat_bpf_counters.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+# perf stat --bpf-counters test
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+# check whether $2 is within +/- 10% of $1
+compare_number()
+{
+   first_num=$1
+   second_num=$2
+
+   # upper bound is first_num * 110%
+   upper=$(( $first_num + $first_num / 10 ))
+   # lower bound is first_num * 90%
+   lower=$(( $first_num - $first_num / 10 ))
+
+   if [ $second_num -gt $upper ] || [ $second_num -lt $lower ]; then
+   echo "The difference between $first_num and $second_num are 
greater than 10%."
+   exit 1
+   fi
+}
+
+# skip if --bpf-counters is not supported
+perf stat --bpf-counters true > /dev/null 2>&1 || exit 2
+
+base_cycles=$(perf stat --no-big-num -e cycles -- perf bench sched messaging 
-g 1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}')
+bpf_cycles=$(perf stat --no-big-num --bpf-counters -e cycles -- perf bench 
sched messaging -g 1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}')
+
+compare_number $base_cycles $bpf_cycles
+exit 0
--
2.30.2

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Song Liu




> On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> 
> 
> 
> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  wrote:
>> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
>>> 
>>> 
>>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
>>  wrote:
>>>> 
>>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
>>>>> Hi Song,
>>>>> 
>>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu 
>> wrote:
>>>>>> 
>>>>>> perf uses performance monitoring counters (PMCs) to monitor
>> system
>>>>>> performance. The PMCs are limited hardware resources. For
>> example,
>>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>>>> 
>>>>>> Modern data center systems use these PMCs in many different ways:
>>>>>> system level monitoring, (maybe nested) container level
>> monitoring, per
>>>>>> process monitoring, profiling (in sample mode), etc. In some
>> cases,
>>>>>> there are more active perf_events than available hardware PMCs.
>> To allow
>>>>>> all perf_events to have a chance to run, it is necessary to do
>> expensive
>>>>>> time multiplexing of events.
>>>>>> 
>>>>>> On the other hand, many monitoring tools count the common metrics
>> (cycles,
>>>>>> instructions). It is a waste to have multiple tools create
>> multiple
>>>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>>> 
>>>>> Right, it'd be really helpful when the PMCs are frequently or
>> mostly shared.
>>>>> But it'd also increase the overhead for uncontended cases as BPF
>> programs
>>>>> need to run on every context switch.  Depending on the workload,
>> it may
>>>>> cause a non-negligible performance impact.  So users should be
>> aware of it.
>>>> 
>>>> Would be interesting to, humm, measure both cases to have a firm
>> number
>>>> of the impact, how many instructions are added when sharing using
>>>> --bpf-counters?
>>>> 
>>>> I.e. compare the "expensive time multiplexing of events" with its
>>>> avoidance by using --bpf-counters.
>>>> 
>>>> Song, have you perfmormed such measurements?
>>> 
>>> I have got some measurements with perf-bench-sched-messaging:
>>> 
>>> The system: x86_64 with 23 cores (46 HT)
>>> 
>>> The perf-stat command:
>>> perf stat -e
>> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles > etc.>
>>> 
>>> The benchmark command and output:
>>> ./perf bench sched messaging -g 40 -l 5 -t
>>> # Running 'sched/messaging' benchmark:
>>> # 20 sender and receiver threads per group
>>> # 40 groups == 1600 threads run
>>> Total time: 10X.XXX [sec]
>>> 
>>> 
>>> I use the "Total time" as measurement, so smaller number is better. 
>>> 
>>> For each condition, I run the command 5 times, and took the median of
>> 
>>> "Total time". 
>>> 
>>> Baseline (no perf-stat) 104.873 [sec]
>>> # global
>>> perf stat -a107.887 [sec]
>>> perf stat -a --bpf-counters 106.071 [sec]
>>> # per task
>>> perf stat   106.314 [sec]
>>> perf stat --bpf-counters105.965 [sec]
>>> # per cpu
>>> perf stat -C 1,3,5  107.063 [sec]
>>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>> 
>> I can't see why it's actualy faster than normal perf ;-)
>> would be worth to find out
> 
> Isn't this all about contended cases?

Yeah, the normal perf is doing time multiplexing; while --bpf-counters 
doesn't need it. 

Thanks,
Song

Re: [PATCH v2 1/3] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-18 Thread Song Liu




> On Mar 18, 2021, at 6:49 AM, Namhyung Kim  wrote:
> 
> On Thu, Mar 18, 2021 at 4:22 PM Song Liu  wrote:
>> 
>> 
>> 
>>> On Mar 17, 2021, at 10:54 PM, Namhyung Kim  wrote:
>>> 
>> 
>> [...]
>> 
>>>> +
>>>> +static int bperf_reload_leader_program(struct evsel *evsel, int 
>>>> attr_map_fd,
>>>> +  struct perf_event_attr_map_entry 
>>>> *entry)
>>>> +{
>>>> +   struct bperf_leader_bpf *skel = bperf_leader_bpf__open();
>>>> +   int link_fd, diff_map_fd, err;
>>>> +   struct bpf_link *link = NULL;
>>>> +
>>>> +   if (!skel) {
>>>> +   pr_err("Failed to open leader skeleton\n");
>>>> +   return -1;
>>>> +   }
>>>> +
>>>> +   bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
>>>> +   err = bperf_leader_bpf__load(skel);
>>>> +   if (err) {
>>>> +   pr_err("Failed to load leader skeleton\n");
>>>> +   goto out;
>>>> +   }
>>>> +
>>>> +   err = -1;
>>>> +   link = bpf_program__attach(skel->progs.on_switch);
>>>> +   if (!link) {
>>>> +   pr_err("Failed to attach leader program\n");
>>>> +   goto out;
>>>> +   }
>>>> +
>>>> +   link_fd = bpf_link__fd(link);
>>>> +   diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
>>>> +   entry->link_id = bpf_link_get_id(link_fd);
>>>> +   entry->diff_map_id = bpf_map_get_id(diff_map_fd);
>>>> +   err = bpf_map_update_elem(attr_map_fd, >core.attr, entry, 
>>>> BPF_ANY);
>>>> +   assert(err == 0);
>>>> +
>>>> +   evsel->bperf_leader_link_fd = 
>>>> bpf_link_get_fd_by_id(entry->link_id);
>>>> +   assert(evsel->bperf_leader_link_fd >= 0);
>>> 
>>> Isn't it the same as link_fd?
>> 
>> This is a different fd on the same link.
> 
> Ok
> 
>> 
>>> 
>>>> +
>>>> +   /*
>>>> +* save leader_skel for install_pe, which is called within
>>>> +* following evsel__open_per_cpu call
>>>> +*/
>>>> +   evsel->leader_skel = skel;
>>>> +   evsel__open_per_cpu(evsel, all_cpu_map, -1);
>>>> +
>>>> +out:
>>>> +   bperf_leader_bpf__destroy(skel);
>>>> +   bpf_link__destroy(link);
>>> 
>>> Why do we destroy it?  Is it because we get an another reference?
>> 
>> Yes. We only need evsel->bperf_leader_link_fd to keep the whole
>> skeleton attached.
>> 
>> When multiple perf-stat sessions are sharing the leader skeleton,
>> only the first one loads the leader skeleton, by calling
>> bperf_reload_leader_program(). Other sessions simply hold a fd to
>> the bpf_link. More explanation in bperf__load() below.
> 
> Ok.
> 
>> 
>> 
>>> 
>>>> +   return err;
>>>> +}
>>>> +
>>>> +static int bperf__load(struct evsel *evsel, struct target *target)
>>>> +{
>>>> +   struct perf_event_attr_map_entry entry = {0x, 0x};
>>>> +   int attr_map_fd, diff_map_fd = -1, err;
>>>> +   enum bperf_filter_type filter_type;
>>>> +   __u32 filter_entry_cnt, i;
>>>> +
>>>> +   if (bperf_check_target(evsel, target, _type, 
>>>> _entry_cnt))
>>>> +   return -1;
>>>> +
>>>> +   if (!all_cpu_map) {
>>>> +   all_cpu_map = perf_cpu_map__new(NULL);
>>>> +   if (!all_cpu_map)
>>>> +   return -1;
>>>> +   }
>>>> +
>>>> +   evsel->bperf_leader_prog_fd = -1;
>>>> +   evsel->bperf_leader_link_fd = -1;
>>>> +
>>>> +   /*
>>>> +* Step 1: hold a fd on the leader program and the bpf_link, if
>>>> +* the program is not already gone, reload the program.
>>>> +* Use flock() to ensure exclusive access to the perf_event_attr
>>>> +* map.
>>>> +*/
>>>> +   attr_map_fd = bperf_lock_attr_map(target);
>>&

Re: [PATCH v2 3/3] perf-test: add a test for perf-stat --bpf-counters option

2021-03-18 Thread Song Liu




> On Mar 17, 2021, at 11:07 PM, Namhyung Kim  wrote:
> 
> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
>> 
>> Add a test to compare the output of perf-stat with and without option
>> --bpf-counters. If the difference is more than 10%, the test is considered
>> as failed.
>> 
>> For stable results between two runs (w/ and w/o --bpf-counters), the test
>> program should: 1) be long enough for better signal-noise-ratio; 2) not
>> depend on the behavior of IO subsystem (for less noise from caching). So
>> far, the best option we found is stressapptest.
>> 
>> Signed-off-by: Song Liu 
>> ---
>> tools/perf/tests/shell/stat_bpf_counters.sh | 34 +
>> 1 file changed, 34 insertions(+)
>> create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh
>> 
>> diff --git a/tools/perf/tests/shell/stat_bpf_counters.sh 
>> b/tools/perf/tests/shell/stat_bpf_counters.sh
>> new file mode 100755
>> index 0..c0bcb38d6b53c
>> --- /dev/null
>> +++ b/tools/perf/tests/shell/stat_bpf_counters.sh
>> @@ -0,0 +1,34 @@
>> +#!/bin/sh
>> +# perf stat --bpf-counters test
>> +# SPDX-License-Identifier: GPL-2.0
>> +
>> +set -e
>> +
>> +# check whether $2 is within +/- 10% of $1
>> +compare_number()
>> +{
>> +   first_num=$1
>> +   second_num=$2
>> +
>> +   # upper bound is first_num * 110%
>> +   upper=$(( $first_num + $first_num / 10 ))
>> +   # lower bound is first_num * 90%
>> +   lower=$(( $first_num - $first_num / 10 ))
>> +
>> +   if [ $second_num -gt $upper ] || [ $second_num -lt $lower ]; then
>> +   echo "The difference between $first_num and $second_num are 
>> greater than 10%."
>> +   exit 1
>> +   fi
>> +}
>> +
>> +# skip if --bpf-counters is not supported
>> +perf stat --bpf-counters true > /dev/null 2>&1 || exit 2
>> +
>> +# skip if stressapptest is not available
>> +stressapptest -s 1 -M 100 -m 1 > /dev/null 2>&1 || exit 2
> 
> I don't know how popular it is, but we can print some info
> in case we miss it.

I just realized that perf-bench-sched-messaging is a good test to use, 
so we don't need stressapptest. Attached the updated version below.

> 
>> +
>> +base_cycles=$(perf stat --no-big-num -e cycles -- stressapptest -s 3 -M 100 
>> -m 1 2>&1 | grep -e cycles | awk '{print $1}')
>> +bpf_cycles=$(perf stat --no-big-num --bpf-counters -e cycles -- 
>> stressapptest -s 3 -M 100 -m 1 2>&1 | grep -e cycles | awk '{print $1}')
> 
> I think just awk '/cycles/ {print $1}' should work.

Thanks! Fixed in the new version. 

Song




cat tools/perf/tests/shell/stat_bpf_counters.sh
#!/bin/sh
# perf stat --bpf-counters test
# SPDX-License-Identifier: GPL-2.0

set -e

# check whether $2 is within +/- 10% of $1
compare_number()
{
first_num=$1
second_num=$2

# upper bound is first_num * 110%
upper=$(( $first_num + $first_num / 10 ))
# lower bound is first_num * 90%
lower=$(( $first_num - $first_num / 10 ))

if [ $second_num -gt $upper ] || [ $second_num -lt $lower ]; then
echo "The difference between $first_num and $second_num are 
greater than 10%."
exit 1
fi
}

# skip if --bpf-counters is not supported
perf stat --bpf-counters true > /dev/null 2>&1 || exit 2

base_cycles=$(perf stat --no-big-num -e cycles -- perf bench sched messaging -g 
1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}')
bpf_cycles=$(perf stat --no-big-num --bpf-counters -e cycles -- perf bench 
sched messaging -g 1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}')

compare_number $base_cycles $bpf_cycles
exit 0

Re: [PATCH v2 1/3] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-18 Thread Song Liu




> On Mar 17, 2021, at 10:54 PM, Namhyung Kim  wrote:
> 

[...]

>> +
>> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
>> +  struct perf_event_attr_map_entry 
>> *entry)
>> +{
>> +   struct bperf_leader_bpf *skel = bperf_leader_bpf__open();
>> +   int link_fd, diff_map_fd, err;
>> +   struct bpf_link *link = NULL;
>> +
>> +   if (!skel) {
>> +   pr_err("Failed to open leader skeleton\n");
>> +   return -1;
>> +   }
>> +
>> +   bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
>> +   err = bperf_leader_bpf__load(skel);
>> +   if (err) {
>> +   pr_err("Failed to load leader skeleton\n");
>> +   goto out;
>> +   }
>> +
>> +   err = -1;
>> +   link = bpf_program__attach(skel->progs.on_switch);
>> +   if (!link) {
>> +   pr_err("Failed to attach leader program\n");
>> +   goto out;
>> +   }
>> +
>> +   link_fd = bpf_link__fd(link);
>> +   diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
>> +   entry->link_id = bpf_link_get_id(link_fd);
>> +   entry->diff_map_id = bpf_map_get_id(diff_map_fd);
>> +   err = bpf_map_update_elem(attr_map_fd, >core.attr, entry, 
>> BPF_ANY);
>> +   assert(err == 0);
>> +
>> +   evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id);
>> +   assert(evsel->bperf_leader_link_fd >= 0);
> 
> Isn't it the same as link_fd?

This is a different fd on the same link. 

> 
>> +
>> +   /*
>> +* save leader_skel for install_pe, which is called within
>> +* following evsel__open_per_cpu call
>> +*/
>> +   evsel->leader_skel = skel;
>> +   evsel__open_per_cpu(evsel, all_cpu_map, -1);
>> +
>> +out:
>> +   bperf_leader_bpf__destroy(skel);
>> +   bpf_link__destroy(link);
> 
> Why do we destroy it?  Is it because we get an another reference?

Yes. We only need evsel->bperf_leader_link_fd to keep the whole 
skeleton attached. 

When multiple perf-stat sessions are sharing the leader skeleton, 
only the first one loads the leader skeleton, by calling 
bperf_reload_leader_program(). Other sessions simply hold a fd to 
the bpf_link. More explanation in bperf__load() below.  


> 
>> +   return err;
>> +}
>> +
>> +static int bperf__load(struct evsel *evsel, struct target *target)
>> +{
>> +   struct perf_event_attr_map_entry entry = {0x, 0x};
>> +   int attr_map_fd, diff_map_fd = -1, err;
>> +   enum bperf_filter_type filter_type;
>> +   __u32 filter_entry_cnt, i;
>> +
>> +   if (bperf_check_target(evsel, target, _type, 
>> _entry_cnt))
>> +   return -1;
>> +
>> +   if (!all_cpu_map) {
>> +   all_cpu_map = perf_cpu_map__new(NULL);
>> +   if (!all_cpu_map)
>> +   return -1;
>> +   }
>> +
>> +   evsel->bperf_leader_prog_fd = -1;
>> +   evsel->bperf_leader_link_fd = -1;
>> +
>> +   /*
>> +* Step 1: hold a fd on the leader program and the bpf_link, if
>> +* the program is not already gone, reload the program.
>> +* Use flock() to ensure exclusive access to the perf_event_attr
>> +* map.
>> +*/
>> +   attr_map_fd = bperf_lock_attr_map(target);
>> +   if (attr_map_fd < 0) {
>> +   pr_err("Failed to lock perf_event_attr map\n");
>> +   return -1;
>> +   }
>> +
>> +   err = bpf_map_lookup_elem(attr_map_fd, >core.attr, );
>> +   if (err) {
>> +   err = bpf_map_update_elem(attr_map_fd, >core.attr, 
>> , BPF_ANY);
>> +   if (err)
>> +   goto out;
>> +   }
>> +
>> +   evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry.link_id);
>> +   if (evsel->bperf_leader_link_fd < 0 &&
>> +   bperf_reload_leader_program(evsel, attr_map_fd, ))
>> +   goto out;

Continue with previous explanation. In bperf_reload_leader_program(), 
we open another reference to the link, and destroy the skeleton. This 
brings the code to the same state as evsel->bperf_leader_link_fd >= 
condition above. 

>> +
>> +   /*
>> +* The bpf_link holds reference to the leader program, and the
>> +* leader program holds reference to the maps. Therefore, if
>> +* link_id is valid, diff_map_id should also be valid.
>> +*/
>> +   evsel->bperf_leader_prog_fd = bpf_prog_get_fd_by_id(
>> +   bpf_link_get_prog_id(evsel->bperf_leader_link_fd));
>> +   assert(evsel->bperf_leader_prog_fd >= 0);
>> +
>> +   diff_map_fd = bpf_map_get_fd_by_id(entry.diff_map_id);
>> +   assert(diff_map_fd >= 0);
>> +

[...]

>> +static int bperf__read(struct evsel *evsel)
>> +{
>> +   struct bperf_follower_bpf *skel = evsel->follower_skel;
>> +   __u32 num_cpu_bpf = cpu__max_cpu();
>> +   struct bpf_perf_event_value values[num_cpu_bpf];
>> +

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Song Liu




> On Mar 17, 2021, at 9:32 PM, Namhyung Kim  wrote:
> 
> On Thu, Mar 18, 2021 at 12:52 PM Song Liu  wrote:
>> 
>> 
>> 
>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo  
>>> wrote:
>>> 
>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
>>>> Hi Song,
>>>> 
>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
>>>>> 
>>>>> perf uses performance monitoring counters (PMCs) to monitor system
>>>>> performance. The PMCs are limited hardware resources. For example,
>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>>> 
>>>>> Modern data center systems use these PMCs in many different ways:
>>>>> system level monitoring, (maybe nested) container level monitoring, per
>>>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>>>> there are more active perf_events than available hardware PMCs. To allow
>>>>> all perf_events to have a chance to run, it is necessary to do expensive
>>>>> time multiplexing of events.
>>>>> 
>>>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>>>> instructions). It is a waste to have multiple tools create multiple
>>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>> 
>>>> Right, it'd be really helpful when the PMCs are frequently or mostly 
>>>> shared.
>>>> But it'd also increase the overhead for uncontended cases as BPF programs
>>>> need to run on every context switch.  Depending on the workload, it may
>>>> cause a non-negligible performance impact.  So users should be aware of it.
>>> 
>>> Would be interesting to, humm, measure both cases to have a firm number
>>> of the impact, how many instructions are added when sharing using
>>> --bpf-counters?
>>> 
>>> I.e. compare the "expensive time multiplexing of events" with its
>>> avoidance by using --bpf-counters.
>>> 
>>> Song, have you perfmormed such measurements?
>> 
>> I have got some measurements with perf-bench-sched-messaging:
>> 
>> The system: x86_64 with 23 cores (46 HT)
>> 
>> The perf-stat command:
>> perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles 
>> 
>> 
>> The benchmark command and output:
>> ./perf bench sched messaging -g 40 -l 5 -t
>> # Running 'sched/messaging' benchmark:
>> # 20 sender and receiver threads per group
>> # 40 groups == 1600 threads run
>> Total time: 10X.XXX [sec]
>> 
>> 
>> I use the "Total time" as measurement, so smaller number is better.
>> 
>> For each condition, I run the command 5 times, and took the median of
>> "Total time".
>> 
>> Baseline (no perf-stat) 104.873 [sec]
>> # global
>> perf stat -a107.887 [sec]
>> perf stat -a --bpf-counters 106.071 [sec]
>> # per task
>> perf stat   106.314 [sec]
>> perf stat --bpf-counters105.965 [sec]
>> # per cpu
>> perf stat -C 1,3,5  107.063 [sec]
>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>> 
>> From the data, --bpf-counters is slightly better than the regular event
>> for all targets. I noticed that the results are not very stable. There
>> are a couple 108.xx runs in some of the conditions (w/ and w/o
>> --bpf-counters).
> 
> Hmm.. so this result is when multiplexing happened, right?
> I wondered how/why the regular perf stat is slower..

I should have made this more clear. This is when regular perf-stat time 
multiplexing (2x ref-cycles on Intel). OTOH, bpf-counters does enables 
sharing, so there is no time multiplexing. IOW, this is overhead of BPF 
vs. overhead of time multiplexing. 

Thanks,
Song

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-17 Thread Song Liu




> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
>> Hi Song,
>> 
>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
>>> 
>>> perf uses performance monitoring counters (PMCs) to monitor system
>>> performance. The PMCs are limited hardware resources. For example,
>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>> 
>>> Modern data center systems use these PMCs in many different ways:
>>> system level monitoring, (maybe nested) container level monitoring, per
>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>> there are more active perf_events than available hardware PMCs. To allow
>>> all perf_events to have a chance to run, it is necessary to do expensive
>>> time multiplexing of events.
>>> 
>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>> instructions). It is a waste to have multiple tools create multiple
>>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> Right, it'd be really helpful when the PMCs are frequently or mostly shared.
>> But it'd also increase the overhead for uncontended cases as BPF programs
>> need to run on every context switch.  Depending on the workload, it may
>> cause a non-negligible performance impact.  So users should be aware of it.
> 
> Would be interesting to, humm, measure both cases to have a firm number
> of the impact, how many instructions are added when sharing using
> --bpf-counters?
> 
> I.e. compare the "expensive time multiplexing of events" with its
> avoidance by using --bpf-counters.
> 
> Song, have you perfmormed such measurements?

I have got some measurements with perf-bench-sched-messaging:

The system: x86_64 with 23 cores (46 HT)

The perf-stat command:
perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles 


The benchmark command and output:
./perf bench sched messaging -g 40 -l 5 -t
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 40 groups == 1600 threads run
 Total time: 10X.XXX [sec]


I use the "Total time" as measurement, so smaller number is better. 

For each condition, I run the command 5 times, and took the median of 
"Total time". 

Baseline (no perf-stat) 104.873 [sec]
# global
perf stat -a107.887 [sec]
perf stat -a --bpf-counters 106.071 [sec]
# per task
perf stat   106.314 [sec]
perf stat --bpf-counters105.965 [sec]
# per cpu
perf stat -C 1,3,5  107.063 [sec]
perf stat -C 1,3,5 --bpf-counters   106.406 [sec]

>From the data, --bpf-counters is slightly better than the regular event
for all targets. I noticed that the results are not very stable. There 
are a couple 108.xx runs in some of the conditions (w/ and w/o 
--bpf-counters).


I also measured the average runtime of the BPF programs, with 

sysctl kernel.bpf_stats_enabled=1

For each event, if we have one leader and two followers, the total run 
time is about 340ns. IOW, 340ns for two perf-stat reading instructions, 
340ns for two perf-stat reading cycles, etc. 

Thanks,
Song

[PATCH v2 3/3] perf-test: add a test for perf-stat --bpf-counters option

2021-03-16 Thread Song Liu

Add a test to compare the output of perf-stat with and without option
--bpf-counters. If the difference is more than 10%, the test is considered
as failed.

For stable results between two runs (w/ and w/o --bpf-counters), the test
program should: 1) be long enough for better signal-noise-ratio; 2) not
depend on the behavior of IO subsystem (for less noise from caching). So
far, the best option we found is stressapptest.

Signed-off-by: Song Liu 
---
 tools/perf/tests/shell/stat_bpf_counters.sh | 34 +
 1 file changed, 34 insertions(+)
 create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh

diff --git a/tools/perf/tests/shell/stat_bpf_counters.sh 
b/tools/perf/tests/shell/stat_bpf_counters.sh
new file mode 100755
index 0..c0bcb38d6b53c
--- /dev/null
+++ b/tools/perf/tests/shell/stat_bpf_counters.sh
@@ -0,0 +1,34 @@
+#!/bin/sh
+# perf stat --bpf-counters test
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+# check whether $2 is within +/- 10% of $1
+compare_number()
+{
+   first_num=$1
+   second_num=$2
+
+   # upper bound is first_num * 110%
+   upper=$(( $first_num + $first_num / 10 ))
+   # lower bound is first_num * 90%
+   lower=$(( $first_num - $first_num / 10 ))
+
+   if [ $second_num -gt $upper ] || [ $second_num -lt $lower ]; then
+   echo "The difference between $first_num and $second_num are 
greater than 10%."
+   exit 1
+   fi
+}
+
+# skip if --bpf-counters is not supported
+perf stat --bpf-counters true > /dev/null 2>&1 || exit 2
+
+# skip if stressapptest is not available
+stressapptest -s 1 -M 100 -m 1 > /dev/null 2>&1 || exit 2
+
+base_cycles=$(perf stat --no-big-num -e cycles -- stressapptest -s 3 -M 100 -m 
1 2>&1 | grep -e cycles | awk '{print $1}')
+bpf_cycles=$(perf stat --no-big-num --bpf-counters -e cycles -- stressapptest 
-s 3 -M 100 -m 1 2>&1 | grep -e cycles | awk '{print $1}')
+
+compare_number $base_cycles $bpf_cycles
+exit 0
-- 
2.30.2

[PATCH v2 2/3] perf-stat: measure t0 and ref_time after enable_counters()

2021-03-16 Thread Song Liu

Take measurements of t0 and ref_time after enable_counters(), so that
they only measure the time consumed when the counters are enabled.

Signed-off-by: Song Liu 
---
 tools/perf/builtin-stat.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 92696373da994..d030c3a49a8e4 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -931,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
/*
 * Enable counters and exec the command:
 */
-   t0 = rdclock();
-   clock_gettime(CLOCK_MONOTONIC, _time);
-
if (forks) {
evlist__start_workload(evsel_list);
err = enable_counters();
if (err)
return -1;
 
+   t0 = rdclock();
+   clock_gettime(CLOCK_MONOTONIC, _time);
+
if (interval || timeout || 
evlist__ctlfd_initialized(evsel_list))
status = dispatch_events(forks, timeout, interval, 
);
if (child_pid != -1) {
@@ -960,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
err = enable_counters();
if (err)
return -1;
+
+   t0 = rdclock();
+   clock_gettime(CLOCK_MONOTONIC, _time);
+
status = dispatch_events(forks, timeout, interval, );
}
 
-- 
2.30.2

[PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-16 Thread Song Liu

perf uses performance monitoring counters (PMCs) to monitor system
performance. The PMCs are limited hardware resources. For example,
Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways:
system level monitoring, (maybe nested) container level monitoring, per
process monitoring, profiling (in sample mode), etc. In some cases,
there are more active perf_events than available hardware PMCs. To allow
all perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics (cycles,
instructions). It is a waste to have multiple tools create multiple
perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Changes v1 => v2:
  1. Add documentation.
  2. Add a shell test.
  3. Rename options, default path of the atto-map, and some variables.
  4. Add a separate patch that moves clock_gettime() in __run_perf_stat()
 to after enable_counters().
  5. Make perf_cpu_map for all cpus a global variable.
  6. Use sysfs__mountpoint() for default attr-map path.
  7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus().
  8. Add flag "enabled" to the follower program. Then move follower attach
 to bperf__load() and simplify bperf__enable().

Song Liu (3):
  perf-stat: introduce bperf, share hardware PMCs with BPF
  perf-stat: measure t0 and ref_time after enable_counters()
  perf-test: add a test for perf-stat --bpf-counters option

 tools/perf/Documentation/perf-stat.txt|  11 +
 tools/perf/Makefile.perf  |   1 +
 tools/perf/builtin-stat.c |  20 +-
 tools/perf/tests/shell/stat_bpf_counters.sh   |  34 ++
 tools/perf/util/bpf_counter.c | 519 +-
 tools/perf/util/bpf_skel/bperf.h  |  14 +
 tools/perf/util/bpf_skel/bperf_follower.bpf.c |  69 +++
 tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
 tools/perf/util/bpf_skel/bperf_u.h|  14 +
 tools/perf/util/evsel.h   |  20 +-
 tools/perf/util/target.h  |   4 +-
 11 files changed, 742 insertions(+), 10 deletions(-)
 create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh
 create mode 100644 tools/perf/util/bpf_skel/bperf.h
 create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_u.h

--
2.30.2

[PATCH v2 1/3] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-16 Thread Song Liu

perf uses performance monitoring counters (PMCs) to monitor system
performance. The PMCs are limited hardware resources. For example,
Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways:
system level monitoring, (maybe nested) container level monitoring, per
process monitoring, profiling (in sample mode), etc. In some cases,
there are more active perf_events than available hardware PMCs. To allow
all perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics (cycles,
instructions). It is a waste to have multiple tools create multiple
perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --bpf-counters option to
perf-stat. bperf uses a BPF hashmap to share information about BPF
programs and maps used by bperf. This map is pinned to bpffs. The default
path is /sys/fs/bpf/perf_attr_map. The user could change the path with
option --bpf-attr-map.

Signed-off-by: Song Liu 

---
Known limitations:
1. Do not support per cgroup events;
2. Do not support monitoring of BPF program (perf-stat -b);
3. Do not support event groups;
4. Do not support inherit events during fork().

The following commands have been tested:

   perf stat --bpf-counters -e cycles,ref-cycles -a
   perf stat --bpf-counters -e cycles,instructions -C 1,3,4
   perf stat --bpf-counters -e cycles -p 123
   perf stat --bpf-counters -e cycles -t 100,101
   perf stat --bpf-counters -e cycles,ref-cycles -- stressapptest ...
---
 tools/perf/Documentation/perf-stat.txt|  11 +
 tools/perf/Makefile.perf  |   1 +
 tools/perf/builtin-stat.c |  10 +
 tools/perf/util/bpf_counter.c | 519 +-
 tools/perf/util/bpf_skel/bperf.h  |  14 +
 tools/perf/util/bpf_skel/bperf_follower.bpf.c |  69 +++
 tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
 tools/perf/util/bpf_skel/bperf_u.h|  14 +
 tools/perf/util/evsel.h   |  20 +-
 tools/perf/util/target.h  |   4 +-
 10 files changed, 701 insertions(+), 7 deletions(-)
 create mode 100644 tools/perf/util/bpf_skel/bperf.h
 create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_u.h

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 08a1714494f87..d2e7656b5ef81 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -93,6 +93,17 @@ report::
 
 1.102235068 seconds time elapsed
 
+--bpf-counters::
+   Use BPF programs to aggregate readings from perf_events.  This
+   allows multiple perf-stat sessions that are counting the same metric 
(cycles,
+   instructions, etc.) to share hardware counters.
+
+--bpf-attr-map::
+   With option "--bpf-counters", different perf-stat sessions share
+   information about shared BPF programs and maps via a pinned hashmap.
+   Use "--bpf-attr-map" to specify the path of this pinned hashmap.
+   The default path is /sys/fs/bpf/perf_attr_map.
+
 ifdef::HAVE_LIBPFM[]
 --pfm-events events::
 Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net)
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index f6e609673de2b..ca9aa08e85a1f 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1007,6 +1007,7 @@ python-clean:
 SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
 SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
+SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 
 ifdef BUILD_BPF_SKEL
 BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2e2e4a8345ea2..92696373da994 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
}
 
evlist__for_each_cpu (evsel_list, i, cpu) {
+   /*
+* bperf calls evsel__open_per_cpu() in bperf__load(), so
+* no need to call it again here.
+*/
+   if (target.use_bpf)
+

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-15 Thread Song Liu




> On Mar 15, 2021, at 8:34 AM, Andi Kleen  wrote:
> 
>> I still think that there is value in taking those measurements after we
>> enable the counters, as, for instance, for interval mode we want
>> measurements with the counters enabled, whatever happens before the
>> counters are enabled is just startup time, etc. Jiri, Andi?
> 
> Yes I agree. Only the time with counters enabled matters.

I see. This change itself is a valid fix. I will create a separate patch.

Thanks,
Song

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-15 Thread Song Liu




> On Mar 15, 2021, at 7:09 AM, Jiri Olsa  wrote:
> 
> On Mon, Mar 15, 2021 at 07:51:11AM +0000, Song Liu wrote:
> 
> SNIP

[...]

>> 
>> It is mostly to cover corner cases, like something else used the same 
>> name. 
> 
> about that.. we just take the object fd assuming it's map,
> should we test it somehow?
> 
>  map_fd = bpf_obj_get(path);
> 
> if it's not the map we expect, I think we should generate
> another name without forcing user to run again with --attr-map
> 
> but still warn, so other perf session can use the new name

The auto failover is an interesting idea. But I guess we still need 
--attr-map. Another use case is when the user mounted bpffs to a 
different path. Alternatively, maybe we can teach perf to search all 
bpffs mount points for perf_attr_map? 

> 
> SNIP
> 
>>>> +static int bperf_sync_counters(struct evsel *evsel)
>>>> +{
>>>> +  struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
>>>> +  int num_cpu, i, cpu;
>>>> +
>>>> +  if (!all_cpus)
>>>> +  return -1;
>>>> +
>>>> +  num_cpu = all_cpus->nr;
>>>> +  for (i = 0; i < num_cpu; i++) {
>>>> +  cpu = all_cpus->map[i];
>>>> +  bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
>>>> +  }
>>>> +  return 0;
>>>> +}
>>>> +
>>>> +static int bperf__enable(struct evsel *evsel)
>>>> +{
>>>> +  struct bperf_follower_bpf *skel = evsel->follower_skel;
>>>> +  __u32 num_cpu_bpf = libbpf_num_possible_cpus();
>>> 
>>> we have cpu__max_cpu for that
>> 
>> libbpf calls for percpu array use libbpf_num_possible_cpus. So I guess it 
>> is better to use the same here. The two are identical at the moment though.
> 
> then in the bperf__read you take that array and update
> perf_counts, which is based on perf's cpus, so they mix
> anyway
> 
> I'd like to keep perf code using perf's cpus api.. could
> we just check at the begining that libbpf_num_possible_cpus
> returns same number as cpu__max_cpu (if not, we have a
> problem anyway) and use perf's cpu api

Let me try cpu__max_cpu. 

Thanks,
Song

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-15 Thread Song Liu




> On Mar 14, 2021, at 3:48 PM, Jiri Olsa  wrote:
> 
> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
> 
> isn't that another filter via bpf_get_current_cgroup_id ?

This is tricky with nested cgroups. We also have 
bpf_get_current_ancestor_cgroup_id,
but that's not the exact same we need either. We may need a new helper. 

Also, we are limited by 38 follower programs for the leader program, which 
might be too few for Namhyung's use case. We can use some logic in the 
follower program to count events for many cgroups in one fexit program. 

> 
>> 2. Do not support monitoring of BPF program (perf-stat -b);
> 
> we'd need to call the leadr on fentry/fexit of the monitored bpf
> program, right?

My current plan is to let perf-stat -b share the same perf_event map with
bperf, but not the leader program. 

> 
>> 3. Do not support event groups.
> 
> I guess for group support you'll need to load 'leaders' for each group member

I am not sure how this would work. Say the user started the following in 
parallel:

#1  perf stat --bpf-counters -e cycles -a &
#2  perf stat --bpf-counters -e instructions -C 1,2,3 &
#3  perf stat --bpf-counters -e {cycles,instructions} -p 123 

Event "cycles" is loaded by #1; while event "instruction" is loaded by #2.
If #3 reuses these events, it is tricky (or impossible?) to make sure the
event group works as expected, right?

> 
>> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
> 
> works good with that file you sent.. I'll check/test more,
> so far some quick comments below
> 
> thanks,
> jirka
> 
> 
> 
> SNIP
> 
>> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
>> #ifdef HAVE_BPF_SKEL
>>  OPT_STRING('b', "bpf-prog", _str, "bpf-prog-id",
>> "stat events on existing bpf program id"),
>> +OPT_BOOLEAN(0, "use-bpf", _bpf,
>> +"use bpf program to count events"),
>> +OPT_STRING(0, "attr-map", _map, "attr-map-path",
>> +   "path to perf_event_attr map"),
> 
> what's the point of allowing another name? just debug purpose?

It is mostly to cover corner cases, like something else used the same 
name. 

> 
> SNIP
> 
>> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
>> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
>> + * no concurrent access to the attr_map.  The key of attr_map is struct
>> + * perf_event_attr, and the value is struct bperf_attr_map_entry.
>> + *
>> + * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-13 Thread Song Liu




> On Mar 13, 2021, at 2:06 PM, Jiri Olsa  wrote:
> 
> On Fri, Mar 12, 2021 at 04:09:53PM +0000, Song Liu wrote:
>> 
>> 
>>> On Mar 12, 2021, at 7:45 AM, Song Liu  wrote:
>>> 
>>> 
>>> 
>>>> On Mar 12, 2021, at 4:12 AM, Jiri Olsa  wrote:
>>>> 
>>>> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>>>>> perf uses performance monitoring counters (PMCs) to monitor system
>>>>> performance. The PMCs are limited hardware resources. For example,
>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>>> 
>>>>> Modern data center systems use these PMCs in many different ways:
>>>>> system level monitoring, (maybe nested) container level monitoring, per
>>>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>>>> there are more active perf_events than available hardware PMCs. To allow
>>>>> all perf_events to have a chance to run, it is necessary to do expensive
>>>>> time multiplexing of events.
>>>>> 
>>>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>>>> instructions). It is a waste to have multiple tools create multiple
>>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>>> 
>>>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>>>> of having each perf-stat session to read its own perf_events, bperf uses
>>>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>>>> 
>>>>> Please refer to the comment before the definition of bperf_ops for the
>>>>> description of bperf architecture.
>>>>> 
>>>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>>>> used by bperf. This map is pinned to bpffs. The default address is
>>>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>>>> --attr-map.
>>>> 
>>>> nice, I recall the presentation about that and was wondering
>>>> when this will come up ;-)
>>> 
>>> The progress is slower than I expected. But I finished some dependencies of 
>>> this in the last year: 
>>> 
>>> 1. BPF_PROG_TEST_RUN for raw_tp event;
>>> 2. perf-stat -b, which introduced skeleton and bpf_counter;
>>> 3. BPF task local storage, I didn't use it in this version, but it could,
>>>help optimize bperf in the future. 
>>> 
>>>> 
>>>>> 
>>>>> ---
>>>>> Known limitations:
>>>>> 1. Do not support per cgroup events;
>>>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>>>> 3. Do not support event groups.
>>>>> 
>>>>> The following commands have been tested:
>>>>> 
>>>>> perf stat --use-bpf -e cycles -a
>>>>> perf stat --use-bpf -e cycles -C 1,3,4
>>>>> perf stat --use-bpf -e cycles -p 123
>>>>> perf stat --use-bpf -e cycles -t 100,101
>>>> 
>>>> I assume the output is same as standard perf?
>> 
>> Btw, please give it a try. :) 
>> 
>> It worked pretty well in my tests. If it doesn't work for some combination 
>> of options, please let me know. 
> 
> heya, can't compile
> 
>  CLANG
> /home/jolsa/linux-perf/tools/perf/util/bpf_skel/.tmp/bperf_follower.bpf.o
> util/bpf_skel/bperf_follower.bpf.c:8:10: fatal error: 'bperf_u.h' file not 
> found
> #include "bperf_u.h"
> ^~~

Oops, I forgot git-add. :( 

The file is very simple:

tools/perf/util/bpf_skel/bperf_u.h:


// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
// Copyright (c) 2021 Facebook

#ifndef __BPERF_STAT_U_H
#define __BPERF_STAT_U_H

enum bperf_filter_type {
BPERF_FILTER_GLOBAL = 1,
BPERF_FILTER_CPU,
BPERF_FILTER_PID,
BPERF_FILTER_TGID,
};

#endif /* __BPERF_STAT_U_H */

Thanks,
Song

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-13 Thread Song Liu




> On Mar 12, 2021, at 6:47 PM, Namhyung Kim  wrote:
> 
> On Sat, Mar 13, 2021 at 12:38 AM Song Liu  wrote:
>> 
>> 
>> 
>>> On Mar 12, 2021, at 12:36 AM, Namhyung Kim  wrote:
>>> 
>>> Hi,
>>> 
>>> On Fri, Mar 12, 2021 at 11:03 AM Song Liu  wrote:
>>>> 
>>>> perf uses performance monitoring counters (PMCs) to monitor system
>>>> performance. The PMCs are limited hardware resources. For example,
>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>> 
>>>> Modern data center systems use these PMCs in many different ways:
>>>> system level monitoring, (maybe nested) container level monitoring, per
>>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>>> there are more active perf_events than available hardware PMCs. To allow
>>>> all perf_events to have a chance to run, it is necessary to do expensive
>>>> time multiplexing of events.
>>>> 
>>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>>> instructions). It is a waste to have multiple tools create multiple
>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>> 
>>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>>> of having each perf-stat session to read its own perf_events, bperf uses
>>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>>> 
>>>> Please refer to the comment before the definition of bperf_ops for the
>>>> description of bperf architecture.
>>> 
>>> Interesting!  Actually I thought about something similar before,
>>> but my BPF knowledge is outdated.  So I need to catch up but
>>> failed to have some time for it so far. ;-)
>>> 
>>>> 
>>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>>> used by bperf. This map is pinned to bpffs. The default address is
>>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>>> --attr-map.
>>>> 
>>>> ---
>>>> Known limitations:
>>>> 1. Do not support per cgroup events;
>>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>>> 3. Do not support event groups.
>>> 
>>> In my case, per cgroup event counting is very important.
>>> And I'd like to do that with lots of cpus and cgroups.
>> 
>> We can easily extend this approach to support cgroups events. I didn't
>> implement it to keep the first version simple.
> 
> OK.
> 
>> 
>>> So I'm working on an in-kernel solution (without BPF),
>>> I hope to share it soon.
>> 
>> This is interesting! I cannot wait to see how it looks like. I spent
>> quite some time try to enable in kernel sharing (not just cgroup
>> events), but finally decided to try BPF approach.
> 
> Well I found it hard to support generic event sharing that works
> for all use cases.  So I'm focusing on the per cgroup case only.
> 
>> 
>>> 
>>> And for event groups, it seems the current implementation
>>> cannot handle more than one event (not even in a group).
>>> That could be a serious limitation..
>> 
>> It supports multiple events. Multiple events are independent, i.e.,
>> "cycles" and "instructions" would use two independent leader programs.
> 
> OK, then do you need multiple bperf_attr_maps?  Does it work
> for an arbitrary number of events?

The bperf_attr_map (or perf_attr_map) is shared among different events. 
It is a hash map with perf_event_attr as the key. Currently, I hard coded
its size to 16. We can introduce more flexible management of this map. 

Thanks,
Song

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-12 Thread Song Liu




> On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
> 
> Cool stuff, but I think you can break this up into more self contained
> patches, see below.
> 
> Apart from that, some suggestions/requests:
> 
> We need a shell 'perf test' that uses some synthetic workload so that we
> can count events with/without --use-bpf (--bpf-counters is my
> alternative name, see below), and then compare if the difference is
> under some acceptable range.
> 
> As a followup patch we could have something like:
> 
> perf config stat.bpf-counters=yes
> 
> That would make 'perf stat' use BPF counters for what it can, using the
> default method for the non-supported targets, emitting some 'perf stat
> -v' visible warning (i.e. a debug message), i.e. make it opt-in that the
> user wants to use BPF counters for all that is possible at that point in
> time.o
> 
> Thanks for working on this,
> 
> - Arnaldo
> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
>> 
>> Signed-off-by: Song Liu 
>> ---
>> tools/perf/Makefile.perf  |   1 +
>> tools/perf/builtin-stat.c |  20 +-
>> tools/perf/util/bpf_counter.c | 552 +-
>> tools/perf/util/bpf_skel/bperf.h  |  14 +
>> tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
>> tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>> tools/perf/util/evsel.h   |  20 +-
>> tools/perf/util/target.h  |   4 +-
>> 8 files changed, 712 insertions(+), 10 deletions(-)
>> create mode 100644 tools/perf/util/bpf_skel/bperf.h
>> create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>> create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>> 
>> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
>> index f6e609673de2b..ca9aa08e85a1f 100644
>> --- a/tools/perf/Makefile.perf
>> +++ b/tools/perf/Makefile.perf
>> @@ -1007,6 +1007,7 @@ python-clean:
>> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>> SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>> +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h 
>> $(SKEL_OUT)/bperf_follower.skel.h
>> 
>> ifdef BUILD_BPF_SKEL
>> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 2e2e4a8345ea2..34df713a8eea9 100644
>> --- a/tools/perf/builtin-stat.c
&

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-12 Thread Song Liu




> On Mar 12, 2021, at 7:45 AM, Song Liu  wrote:
> 
> 
> 
>> On Mar 12, 2021, at 4:12 AM, Jiri Olsa  wrote:
>> 
>> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>>> perf uses performance monitoring counters (PMCs) to monitor system
>>> performance. The PMCs are limited hardware resources. For example,
>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>> 
>>> Modern data center systems use these PMCs in many different ways:
>>> system level monitoring, (maybe nested) container level monitoring, per
>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>> there are more active perf_events than available hardware PMCs. To allow
>>> all perf_events to have a chance to run, it is necessary to do expensive
>>> time multiplexing of events.
>>> 
>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>> instructions). It is a waste to have multiple tools create multiple
>>> perf_events of "cycles" and occupy multiple PMCs.
>>> 
>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>> of having each perf-stat session to read its own perf_events, bperf uses
>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>> 
>>> Please refer to the comment before the definition of bperf_ops for the
>>> description of bperf architecture.
>>> 
>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>> used by bperf. This map is pinned to bpffs. The default address is
>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>> --attr-map.
>> 
>> nice, I recall the presentation about that and was wondering
>> when this will come up ;-)
> 
> The progress is slower than I expected. But I finished some dependencies of 
> this in the last year: 
> 
>  1. BPF_PROG_TEST_RUN for raw_tp event;
>  2. perf-stat -b, which introduced skeleton and bpf_counter;
>  3. BPF task local storage, I didn't use it in this version, but it could,
> help optimize bperf in the future. 
> 
>> 
>>> 
>>> ---
>>> Known limitations:
>>> 1. Do not support per cgroup events;
>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>> 3. Do not support event groups.
>>> 
>>> The following commands have been tested:
>>> 
>>>  perf stat --use-bpf -e cycles -a
>>>  perf stat --use-bpf -e cycles -C 1,3,4
>>>  perf stat --use-bpf -e cycles -p 123
>>>  perf stat --use-bpf -e cycles -t 100,101
>> 
>> I assume the output is same as standard perf?

Btw, please give it a try. :) 

It worked pretty well in my tests. If it doesn't work for some combination 
of options, please let me know. 

Thanks,
Song

> 
> Yes, the output is identical to that without --use-bpf option.

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-12 Thread Song Liu




> On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
> 
> Cool stuff, but I think you can break this up into more self contained
> patches, see below.
> 
> Apart from that, some suggestions/requests:
> 
> We need a shell 'perf test' that uses some synthetic workload so that we
> can count events with/without --use-bpf (--bpf-counters is my
> alternative name, see below), and then compare if the difference is
> under some acceptable range.

Yes, "perf test" makes sense. Would this be the extension of current 
perf-test command? Or a new set of tests?

> 
> As a followup patch we could have something like:
> 
> perf config stat.bpf-counters=yes
> 
> That would make 'perf stat' use BPF counters for what it can, using the
> default method for the non-supported targets, emitting some 'perf stat
> -v' visible warning (i.e. a debug message), i.e. make it opt-in that the
> user wants to use BPF counters for all that is possible at that point in
> time.o

Yes, the fallback mechanism will be very helpful. I also have ideas on
setting a list for "common events", and only use BPF for the common 
events. Not common events should just use the original method. 

> 
> Thanks for working on this,
> 
> - Arnaldo
> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
>> 
>> Signed-off-by: Song Liu 
>> ---
>> tools/perf/Makefile.perf  |   1 +
>> tools/perf/builtin-stat.c |  20 +-
>> tools/perf/util/bpf_counter.c | 552 +-
>> tools/perf/util/bpf_skel/bperf.h  |  14 +
>> tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
>> tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>> tools/perf/util/evsel.h   |  20 +-
>> tools/perf/util/target.h  |   4 +-
>> 8 files changed, 712 insertions(+), 10 deletions(-)
>> create mode 100644 tools/perf/util/bpf_skel/bperf.h
>> create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>> create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>> 
>> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
>> index f6e609673de2b..ca9aa08e85a1f 100644
>> --- a/tools/perf/Makefile.perf
>> +++ b/tools/perf/Makefile.perf
>> @@ -1007,6 +1007,7 @@ python-clean:
>> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>> SKELETONS := $(SKEL_OUT)/bpf_prog

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-12 Thread Song Liu




> On Mar 12, 2021, at 4:12 AM, Jiri Olsa  wrote:
> 
> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
> 
> nice, I recall the presentation about that and was wondering
> when this will come up ;-)

The progress is slower than I expected. But I finished some dependencies of 
this in the last year: 

  1. BPF_PROG_TEST_RUN for raw_tp event;
  2. perf-stat -b, which introduced skeleton and bpf_counter;
  3. BPF task local storage, I didn't use it in this version, but it could,
 help optimize bperf in the future. 

> 
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
>> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
> 
> I assume the output is same as standard perf?

Yes, the output is identical to that without --use-bpf option. 

Thanks,
Song

Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-12 Thread Song Liu




> On Mar 12, 2021, at 12:36 AM, Namhyung Kim  wrote:
> 
> Hi,
> 
> On Fri, Mar 12, 2021 at 11:03 AM Song Liu  wrote:
>> 
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
> 
> Interesting!  Actually I thought about something similar before,
> but my BPF knowledge is outdated.  So I need to catch up but
> failed to have some time for it so far. ;-)
> 
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
> 
> In my case, per cgroup event counting is very important.
> And I'd like to do that with lots of cpus and cgroups.

We can easily extend this approach to support cgroups events. I didn't 
implement it to keep the first version simple. 

> So I'm working on an in-kernel solution (without BPF),
> I hope to share it soon.

This is interesting! I cannot wait to see how it looks like. I spent
quite some time try to enable in kernel sharing (not just cgroup
events), but finally decided to try BPF approach. 

> 
> And for event groups, it seems the current implementation
> cannot handle more than one event (not even in a group).
> That could be a serious limitation..

It supports multiple events. Multiple events are independent, i.e.,
"cycles" and "instructions" would use two independent leader programs.

> 
>> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
> 
> Hmm... so it loads both leader and follower programs if needed, right?
> Does it support multiple followers with different targets at the same time?

Yes, the whole idea is to have one leader program and multiple follower
programs. If we only run one of these commands at a time, it will load 
one leader and one follower. If we run multiple of them in parallel, 
they will share the same leader program and load multiple follower 
programs. 

I actually tested more than the commands above. The list actually means
we support -a, -C -p, and -t. 

Currently, this works for multiple events, and different parallel 
perf-stat. The two commands below will work well in parallel:
  
  perf stat --use-bpf -e ref-cycles,instructions -a
  perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5

Note the use of ref-cycles, which can only use one counter on Intel CPUs.
With this approach, the above two commands will not do time multiplexing
on ref-cycles. 

Thanks,
Song

[PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-11 Thread Song Liu

perf uses performance monitoring counters (PMCs) to monitor system
performance. The PMCs are limited hardware resources. For example,
Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways:
system level monitoring, (maybe nested) container level monitoring, per
process monitoring, profiling (in sample mode), etc. In some cases,
there are more active perf_events than available hardware PMCs. To allow
all perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics (cycles,
instructions). It is a waste to have multiple tools create multiple
perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Please refer to the comment before the definition of bperf_ops for the
description of bperf architecture.

bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
bperf uses a BPF hashmap to share information about BPF programs and maps
used by bperf. This map is pinned to bpffs. The default address is
/sys/fs/bpf/bperf_attr_map. The user could change the address with option
--attr-map.

---
Known limitations:
1. Do not support per cgroup events;
2. Do not support monitoring of BPF program (perf-stat -b);
3. Do not support event groups.

The following commands have been tested:

   perf stat --use-bpf -e cycles -a
   perf stat --use-bpf -e cycles -C 1,3,4
   perf stat --use-bpf -e cycles -p 123
   perf stat --use-bpf -e cycles -t 100,101

Signed-off-by: Song Liu 
---
 tools/perf/Makefile.perf  |   1 +
 tools/perf/builtin-stat.c |  20 +-
 tools/perf/util/bpf_counter.c | 552 +-
 tools/perf/util/bpf_skel/bperf.h  |  14 +
 tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
 tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
 tools/perf/util/evsel.h   |  20 +-
 tools/perf/util/target.h  |   4 +-
 8 files changed, 712 insertions(+), 10 deletions(-)
 create mode 100644 tools/perf/util/bpf_skel/bperf.h
 create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index f6e609673de2b..ca9aa08e85a1f 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -1007,6 +1007,7 @@ python-clean:
 SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
 SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
 SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
+SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
 
 ifdef BUILD_BPF_SKEL
 BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2e2e4a8345ea2..34df713a8eea9 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
}
 
evlist__for_each_cpu (evsel_list, i, cpu) {
+   /*
+* bperf calls evsel__open_per_cpu() in bperf__load(), so
+* no need to call it again here.
+*/
+   if (target.use_bpf)
+   break;
affinity__set(, cpu);
 
evlist__for_each_entry(evsel_list, counter) {
@@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
/*
 * Enable counters and exec the command:
 */
-   t0 = rdclock();
-   clock_gettime(CLOCK_MONOTONIC, _time);
-
if (forks) {
evlist__start_workload(evsel_list);
err = enable_counters();
if (err)
return -1;
 
+   t0 = rdclock();
+   clock_gettime(CLOCK_MONOTONIC, _time);
+
if (interval || timeout || 
evlist__ctlfd_initialized(evsel_list))
status = dispatch_events(forks, timeout, interval, 
);
if (child_pid != -1) {
@@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
err = enable_counters();
if (err)
return -1;
+
+   t0 = rdclock();
+   clock_gettime(CLOCK_MONOTONIC, _time);
+
status = dispatch_events(forks, timeout, interval, );
}
 
@@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
 #ifdef HAVE_BPF_SKEL
OPT_STRING('b',

Re: [PATCH 2/3] perf tool: Enable warnings when compiling BPF programs

2021-03-06 Thread Song Liu




> On Mar 6, 2021, at 12:08 AM, Ian Rogers  wrote:
> 
> Add -Wall -Werror when compiling BPF skeletons.
> 
> Signed-off-by: Ian Rogers 

Acked-by: Song Liu 

> ---
> tools/perf/Makefile.perf | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
> index 5345ac70cd83..f43d2551f3de 100644
> --- a/tools/perf/Makefile.perf
> +++ b/tools/perf/Makefile.perf
> @@ -1029,7 +1029,7 @@ $(BPFTOOL): | $(SKEL_TMP_OUT)
>   OUTPUT=$(SKEL_TMP_OUT)/ bootstrap
> 
> $(SKEL_TMP_OUT)/%.bpf.o: util/bpf_skel/%.bpf.c $(LIBBPF) | $(SKEL_TMP_OUT)
> - $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf $(BPF_INCLUDE) \
> + $(QUIET_CLANG)$(CLANG) -g -O2 -target bpf -Wall -Werror $(BPF_INCLUDE) \
> -c $(filter util/bpf_skel/%.bpf.c,$^) -o $@ && $(LLVM_STRIP) -g $@
> 
> $(SKEL_OUT)/%.skel.h: $(SKEL_TMP_OUT)/%.bpf.o | $(BPFTOOL)
> -- 
> 2.30.1.766.gb4fecdf3b7-goog
>

Re: [PATCH 3/3] perf bpf: Minor whitespace cleanup.

2021-03-06 Thread Song Liu




> On Mar 6, 2021, at 12:08 AM, Ian Rogers  wrote:
> 
> Missed space after #include.
> 
> Signed-off-by: Ian Rogers 

Acked-by: Song Liu 

> ---
> tools/perf/util/bpf_counter.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/bpf_counter.h b/tools/perf/util/bpf_counter.h
> index 2eca210e5dc1..cb9c532e0a07 100644
> --- a/tools/perf/util/bpf_counter.h
> +++ b/tools/perf/util/bpf_counter.h
> @@ -38,7 +38,7 @@ int bpf_counter__install_pe(struct evsel *evsel, int cpu, 
> int fd);
> 
> #else /* HAVE_BPF_SKEL */
> 
> -#include
> +#include 
> 
> static inline int bpf_counter__load(struct evsel *evsel __maybe_unused,
>   struct target *target __maybe_unused)
> -- 
> 2.30.1.766.gb4fecdf3b7-goog
>

Re: [PATCH 1/3] perf skel: Remove some unused variables.

2021-03-06 Thread Song Liu




> On Mar 6, 2021, at 12:08 AM, Ian Rogers  wrote:
> 
> Fixes -Wall warnings.
> 
> Signed-off-by: Ian Rogers 

Acked-by: Song Liu 

Thanks for the clean up!

> ---
> tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c 
> b/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c
> index c7cec92d0236..ab12b4c4ece2 100644
> --- a/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c
> +++ b/tools/perf/util/bpf_skel/bpf_prog_profiler.bpf.c
> @@ -52,7 +52,7 @@ int BPF_PROG(fentry_XXX)
> static inline void
> fexit_update_maps(struct bpf_perf_event_value *after)
> {
> - struct bpf_perf_event_value *before, diff, *accum;
> + struct bpf_perf_event_value *before, diff;
>   __u32 zero = 0;
> 
>   before = bpf_map_lookup_elem(_readings, );
> @@ -78,7 +78,6 @@ int BPF_PROG(fexit_XXX)
> {
>   struct bpf_perf_event_value reading;
>   __u32 cpu = bpf_get_smp_processor_id();
> - __u32 one = 1, zero = 0;
>   int err;
> 
>   /* read all events before updating the maps, to reduce error */
> -- 
> 2.30.1.766.gb4fecdf3b7-goog
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1581 matches

Mail list logo