from:"Song Liu"

Re: [RFC 00/31] objtool, livepatch: Livepatch module generation

2024-09-12 Thread Song Liu

Hi Josh,

On Wed, Sep 11, 2024 at 9:20 AM Josh Poimboeuf  wrote:
[...]

> > Do not get me wrong. I do not expect that the upstream variant would
> > be feature complete from the beginning. I just want to get a picture
> > how far it is. The code will be maintained only when it would have
> > users. And it would have users only when it would be comparable or
> > better then kPatch.
>
> I agree it needs to be fully functional before merge, but only for x86.
>
> Red Hat (and Meta?) will start using it as soon as x86 support is ready,
> because IBT/LTO support is needed, which kpatch-build can't handle.

While we (Meta) do have a workaround in kpatch to build livepatch for
kernels built with LTO, we will try to switch to this approach once
the x86 support is ready.

There are also other companies that would like to use LTO+livepatch
combination.

Thanks,
Song

Re: [RFC 00/31] objtool, livepatch: Livepatch module generation

2024-09-09 Thread Song Liu

On Mon, Sep 9, 2024 at 2:19 PM Josh Poimboeuf  wrote:
>
> On Sat, Sep 07, 2024 at 10:04:25PM -0700, Song Liu wrote:
> > I think gcc doesn't complain, but clang does:
> >
> > $ cat ttt.c
> > static inline void ret(void)
> > {
> >   return;
> > }
> >
> > int main(void)
> > {
> >   return 0;
> > }
>
> Ah...  That's probably why the kernel adds "__maybe_unused" to its
> inline macro (which the tools don't have).
>
> Does this fix?

Yes! It fixes this problem.

Thanks,
Song

Re: [RFC 00/31] objtool, livepatch: Livepatch module generation

2024-09-07 Thread Song Liu

On Sat, Sep 7, 2024 at 1:14 PM Josh Poimboeuf  wrote:
>
> On Sat, Sep 07, 2024 at 10:43:10AM -0700, Song Liu wrote:
> > clang gives the following:
> >
> > elf.c:102:1: error: unused function '__sym_remove' 
> > [-Werror,-Wunused-function]
> >   102 | INTERVAL_TREE_DEFINE(struct symbol, node, unsigned long, 
> > __subtree_last,
> >   | 
> > ^~~~
> >   103 |  __sym_start, __sym_last, static inline, __sym)
> >   |  ~~
> > /data/users/songliubraving/kernel/linux-git/tools/include/linux/interval_tree_generic.h:65:15:
> > note: expanded from macro 'INTERVAL_TREE_DEFINE'
> >65 | ITSTATIC void ITPREFIX ## _remove(ITSTRUCT *node,
> >\
> >   |   ^~~
> > :155:1: note: expanded from here
> >   155 | __sym_remove
> >   | ^~~~
> > 1 error generated.
>
> Here's how __sym_remove() is created:
>
> #define INTERVAL_TREE_DEFINE(ITSTRUCT, ITRB, ITTYPE, ITSUBTREE,   
> \
>  ITSTART, ITLAST, ITSTATIC, ITPREFIX) 
> \
> ...
>
> ITSTATIC void ITPREFIX ## _remove(ITSTRUCT *node, 
> \
>   struct rb_root_cached *root)
> \
>
> INTERVAL_TREE_DEFINE(struct symbol, node, unsigned long, __subtree_last,
>  __sym_start, __sym_last, static inline, __sym)
>
> ITSTATIC is 'static inline' so it shouldn't be complaining about it
> being unused, right?

I think gcc doesn't complain, but clang does:

$ cat ttt.c
static inline void ret(void)
{
  return;
}

int main(void)
{
  return 0;
}

$ gcc  ttt.c  -Werror -Wunused-function
$ clang ttt.c  -Werror -Wunused-function
ttt.c:1:20: error: unused function 'ret' [-Werror,-Wunused-function]
1 | static inline void ret(void)
  |^~~
1 error generated.

>
> If you add -E to the cflags to get preprocessed output, can you confirm
> __sym_remove() is 'static inline'?

Yes, it is 'static inline'.

Song

Re: [RFC 00/31] objtool, livepatch: Livepatch module generation

2024-09-07 Thread Song Liu

On Fri, Sep 6, 2024 at 11:46 PM Josh Poimboeuf  wrote:
>
> On Tue, Sep 03, 2024 at 10:32:00AM -0700, Song Liu wrote:
> > +++ w/tools/objtool/elf.c
> > @@ -468,10 +468,8 @@ static void elf_add_symbol(struct elf *elf,
> > struct symbol *sym)
> >  *
> >  * TODO: is this still true?
> >  */
> > -#if 0
> > -   if (sym->type == STT_NOTYPE && !sym->len)
> > +   if (sym->type == STT_NOTYPE && !sym->len && false)
> > __sym_remove(sym, &sym->sec->symbol_tree);
> > -#endif
>
> Song, can you explain this change?  Was there a warning about
> __sym_remove() not being used?  Not sure how that would be possible
> since it should be static inline:
>
> INTERVAL_TREE_DEFINE(struct symbol, node, unsigned long, __subtree_last,
>  __sym_start, __sym_last, static inline, __sym)
>   ^
>

clang gives the following:

elf.c:102:1: error: unused function '__sym_remove' [-Werror,-Wunused-function]
  102 | INTERVAL_TREE_DEFINE(struct symbol, node, unsigned long, __subtree_last,
  | ^~~~
  103 |  __sym_start, __sym_last, static inline, __sym)
  |  ~~
/data/users/songliubraving/kernel/linux-git/tools/include/linux/interval_tree_generic.h:65:15:
note: expanded from macro 'INTERVAL_TREE_DEFINE'
   65 | ITSTATIC void ITPREFIX ## _remove(ITSTRUCT *node,
   \
  |   ^~~
:155:1: note: expanded from here
  155 | __sym_remove
  | ^~~~
1 error generated.

gcc didn't complain here.

Thanks,
Song

Re: [RFC 00/31] objtool, livepatch: Livepatch module generation

2024-09-05 Thread Song Liu

On Thu, Sep 5, 2024 at 12:13 AM Josh Poimboeuf  wrote:
>
> On Wed, Sep 04, 2024 at 09:14:00PM -0700, Josh Poimboeuf wrote:
>
> > + if (!!nr_objs) {
> ^^
> oops
>
> Fixed version:

The fixed version works for the following cases with gcc-12:

1. no BTF, no IBT;
2. with BTF and CONFIG_MODULE_ALLOW_BTF_MISMATCH, no IBT;
3. with BTF and CONFIG_MODULE_ALLOW_BTF_MISMATCH, with IBT;

There is still some issue with BTF, so we need
CONFIG_MODULE_ALLOW_BTF_MISMATCH here.

But it appears to be mostly working.

OTOH, there are still some issues with LLVM ("symbol 'xxx' already
defined", etc.).

Thanks,
Song

Re: [RFC 00/31] objtool, livepatch: Livepatch module generation

2024-09-04 Thread Song Liu

On Wed, Sep 4, 2024 at 1:59 PM Josh Poimboeuf  wrote:
>
> On Wed, Sep 04, 2024 at 01:23:55PM -0700, Song Liu wrote:
> > Hi Josh,
> >
> > Thanks for the fix! The gcc kernel now compiles.
> >
> > I am now testing with the attached config file (where I disabled
> > CONFIG_DEBUG_INFO_BTF), with the attached patch.
>
> Probably a good idea to disable BTF as I think it's causing some
> confusion.

Agreed. We disabled BTF in kpatch-build, which works well so far.

Thanks,
Song

Re: [RFC 00/31] objtool, livepatch: Livepatch module generation

2024-09-03 Thread Song Liu

Hi Josh,

Thanks for the patchset! We really need this work so that we can undo our
hack for LTO enabled kernels.

On Mon, Sep 2, 2024 at 9:00 PM Josh Poimboeuf  wrote:
>
> Hi,
>
> Here's a new way to build livepatch modules called klp-build.
>
> I started working on it when I realized that objtool already does 99% of
> the work needed for detecting function changes.
>
> This is similar in concept to kpatch-build, but the implementation is
> much cleaner.
>
> Personally I still have reservations about the "source-based" approach
> (klp-convert and friends), including the fragility and performance
> concerns of -flive-patching.  I would submit that klp-build might be
> considered the "official" way to make livepatch modules.
>
> Please try it out and let me know what you think.  Based on v6.10.
>
> Also avaiable at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git 
> klp-build-rfc

I tried to compile the code in this branch with gcc-12 and llvm-18. Some
of these errors are easy to fix (attached below). But some are trickier, for
example:

with gcc-12:
  ...
  BTFIDS  vmlinux
  NM  System.map
  SORTTAB vmlinux
incomplete ORC unwind tables in file: vmlinux
Failed to sort kernel tables

with clang-18:

:4:1: error: symbol '__alt_0' is already defined
4 | __alt_0:
  | ^
:4:1: error: symbol '__alt_1' is already defined
4 | __alt_1:
  | ^

Thanks,
Song

Fix/hack I have on top of this branch:

diff --git i/tools/objtool/check.c w/tools/objtool/check.c
index f55dec2932de..5c4152d60780 100644
--- i/tools/objtool/check.c
+++ w/tools/objtool/check.c
@@ -2,7 +2,7 @@
 /*
  * Copyright (C) 2015-2017 Josh Poimboeuf 
  */
-
+#define _GNU_SOURCE
 #include 
 #include 
 #include 
@@ -1519,7 +1519,7 @@ static void add_jump_destinations(struct
objtool_file *file)
struct reloc *reloc;

for_each_insn(file, insn) {
-   struct instruction *dest_insn;
+   struct instruction *dest_insn = NULL;
struct section *dest_sec = NULL;
struct symbol *dest_sym = NULL;
unsigned long dest_off;
diff --git i/tools/objtool/elf.c w/tools/objtool/elf.c
index 7960921996bd..462ce897ff29 100644
--- i/tools/objtool/elf.c
+++ w/tools/objtool/elf.c
@@ -468,10 +468,8 @@ static void elf_add_symbol(struct elf *elf,
struct symbol *sym)
 *
 * TODO: is this still true?
 */
-#if 0
-   if (sym->type == STT_NOTYPE && !sym->len)
+   if (sym->type == STT_NOTYPE && !sym->len && false)
__sym_remove(sym, &sym->sec->symbol_tree);
-#endif

sym->demangled_name = demangle_name(sym);
 }
diff --git i/tools/objtool/klp-diff.c w/tools/objtool/klp-diff.c
index 76296e38f9ff..4a3f4172f4a5 100644
--- i/tools/objtool/klp-diff.c
+++ w/tools/objtool/klp-diff.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (C) 2024 Josh Poimboeuf 
  */
+#define _GNU_SOURCE
 #include 
 #include 
 #include 
@@ -1109,4 +1110,3 @@ int cmd_klp_diff(int argc, const char **argv)
elf_write(elf_out);
return 0;
 }
-

Re: [PATCH v3 0/2] Fix kallsyms with CONFIG_LTO_CLANG

2024-08-15 Thread Song Liu

Hi Kees,

> On Aug 15, 2024, at 9:05 AM, Kees Cook  wrote:
> 
> On Mon, Aug 12, 2024 at 06:13:22PM +0000, Song Liu wrote:
>> Hi Luis,
>> 
>>> On Aug 12, 2024, at 9:57 AM, Luis Chamberlain  wrote:
>>> 
>>> On Mon, Aug 12, 2024 at 09:21:02AM -0700, Song Liu wrote:
>>>> Hi folks,
>>>> 
>>>> Do we have more concerns and/or suggestions with this set? If not,
>>>> what would be the next step for it?
>>> 
>>> I'm all for simplifying things, and this does just that, however,
>>> I'm not the one you need to convince, the folks who added the original
>>> hacks should provide their Reviewed-by / Tested-by not just for 
>>> CONFIG_LTO_CLANG
>>> but also given this provides an alternative fix, don't we want to invert
>>> the order so we don't regress CONFIG_LTO_CLANG ? And shouldn't the patches
>>> also have their respective Fixes tag?
>> 
>> kallsyms has got quite a few changes/improvements in the past few years:
>> 
>> 1. Sami added logic to trim LTO hash in 2021 [1];
>> 2. Zhen added logic to sort kallsyms in 2022 [2];
>> 3. Yonghong changed cleanup_symbol_name() in 2023 [3]. 
>> 
>> In this set, we are undoing 1 and 3, but we keep 2. Shall we point Fixes
>> tag to [1] or [3]? The patch won't apply to a kernel with only [1] 
>> (without [2] and [3]); while this set is not just fixing [3]. So I think
>> it is not accurate either way. OTOH, the combination of CONFIG_LTO_CLANG
>> and livepatching is probably not used by a lot of users, so I guess we 
>> are OK without Fixes tags? I personally don't have a strong preference 
>> either way. 
>> 
>> It is not necessary to invert the order of the two patches. Only applying
>> one of the two patches won't cause more issues than what we have today.
> 
> Which tree should carry this series?

I am looking through the commit log on kernel/kallsyms.c _just now_, and  
found you took most of recent patches for kallsyms. Could you please take
this set as well?

Thanks,
Song

Re: [PATCH v3 0/2] Fix kallsyms with CONFIG_LTO_CLANG

2024-08-13 Thread Song Liu

Hi Masami, 

Thanks for your review and test!

@Sami, could you please also review the set?

@Luis, I replied to 1/2 and 2/2 with Fixes tags that I think make most 
sense. Please let me know if we need changes to the set or more reviews
and tests. 

Thanks,
Song

> On Aug 12, 2024, at 9:29 PM, Masami Hiramatsu  wrote:
> 
> On Wed,  7 Aug 2024 15:05:11 -0700
> Song Liu  wrote:
> 
>> With CONFIG_LTO_CLANG, the compiler/linker adds .llvm. suffix to
>> local symbols to avoid duplications. Existing scripts/kallsyms sorts
>> symbols without .llvm. suffix. However, this causes quite some
>> issues later on. Some users of kallsyms, such as livepatch, have to match
>> symbols exactly.
>> 
>> Address this by sorting full symbols at build time, and let kallsyms
>> lookup APIs to match the symbols exactly.
>> 
> 
> I've tested this series and confirmed it makes kprobes work with llvm suffixed
> symbols.
> 
> /sys/kernel/tracing # echo "p c_start.llvm.8011538628216713357" >> 
> kprobe_events
> /sys/kernel/tracing # cat kprobe_events 
> p:kprobes/p_c_start_llvm_8011538628216713357_0 
> c_start.llvm.8011538628216713357
> /sys/kernel/tracing # echo "p c_start" >> kprobe_events 
> /sys/kernel/tracing # cat kprobe_events 
> p:kprobes/p_c_start_llvm_8011538628216713357_0 
> c_start.llvm.8011538628216713357
> p:kprobes/p_c_start_0 c_start
> 
> And ftrace too.
> 
> /sys/kernel/tracing # grep ^c_start available_filter_functions
> c_start.llvm.8011538628216713357
> c_start
> c_start.llvm.17132674095431275852
> 
> Tested-by: Masami Hiramatsu (Google) 
> Reviewed-by: Masami Hiramatsu (Google) 
> 
> for this series.
> 
>> Changes v2 => v3:
>> 1. Remove the _without_suffix APIs, as kprobe will not use them.
>>   (Masami Hiramatsu)
>> 
>> v2: 
>> https://lore.kernel.org/live-patching/20240802210836.2210140-1-s...@kernel.org/T/#u
>> 
>> Changes v1 => v2:
>> 1. Update the APIs to remove all .XXX suffixes (v1 only removes .llvm.*).
>> 2. Rename the APIs as *_without_suffix. (Masami Hiramatsu)
>> 3. Fix another user from kprobe. (Masami Hiramatsu)
>> 4. Add tests for the new APIs in kallsyms_selftests.
>> 
>> v1: 
>> https://lore.kernel.org/live-patching/20240730005433.3559731-1-s...@kernel.org/T/#u
>> 
>> Song Liu (2):
>>  kallsyms: Do not cleanup .llvm. suffix before sorting symbols
>>  kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
>> 
>> kernel/kallsyms.c  | 55 +-
>> kernel/kallsyms_selftest.c | 22 +--
>> scripts/kallsyms.c | 31 ++---
>> scripts/link-vmlinux.sh|  4 ---
>> 4 files changed, 9 insertions(+), 103 deletions(-)
>> 
>> --
>> 2.43.5
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH v3 2/2] kallsyms: Match symbols exactly with CONFIG_LTO_CLANG

2024-08-13 Thread Song Liu



> On Aug 7, 2024, at 3:05 PM, Song Liu  wrote:
> 
> With CONFIG_LTO_CLANG=y, the compiler may add .llvm. suffix to
> function names to avoid duplication. APIs like kallsyms_lookup_name()
> and kallsyms_on_each_match_symbol() tries to match these symbol names
> without the .llvm. suffix, e.g., match "c_stop" with symbol
> c_stop.llvm.17132674095431275852. This turned out to be problematic
> for use cases that require exact match, for example, livepatch.
> 
> Fix this by making the APIs to match symbols exactly.
> 
> Also cleanup kallsyms_selftests accordingly.
> 
> Signed-off-by: Song Liu 

Fixes: 8cc32a9bbf29 ("kallsyms: strip LTO-only suffixes from promoted global 
functions")

Re: [PATCH v3 1/2] kallsyms: Do not cleanup .llvm. suffix before sorting symbols

2024-08-13 Thread Song Liu



> On Aug 7, 2024, at 3:05 PM, Song Liu  wrote:
> 
> Cleaning up the symbols causes various issues afterwards. Let's sort
> the list based on original name.
> 
> Signed-off-by: Song Liu 

Fixes: 8cc32a9bbf29 ("kallsyms: strip LTO-only suffixes from promoted global 
functions")

Re: [PATCH v3 0/2] Fix kallsyms with CONFIG_LTO_CLANG

2024-08-12 Thread Song Liu

Hi Luis,

> On Aug 12, 2024, at 9:57 AM, Luis Chamberlain  wrote:
> 
> On Mon, Aug 12, 2024 at 09:21:02AM -0700, Song Liu wrote:
>> Hi folks,
>> 
>> Do we have more concerns and/or suggestions with this set? If not,
>> what would be the next step for it?
> 
> I'm all for simplifying things, and this does just that, however,
> I'm not the one you need to convince, the folks who added the original
> hacks should provide their Reviewed-by / Tested-by not just for 
> CONFIG_LTO_CLANG
> but also given this provides an alternative fix, don't we want to invert
> the order so we don't regress CONFIG_LTO_CLANG ? And shouldn't the patches
> also have their respective Fixes tag?

kallsyms has got quite a few changes/improvements in the past few years:

1. Sami added logic to trim LTO hash in 2021 [1];
2. Zhen added logic to sort kallsyms in 2022 [2];
3. Yonghong changed cleanup_symbol_name() in 2023 [3]. 

In this set, we are undoing 1 and 3, but we keep 2. Shall we point Fixes
tag to [1] or [3]? The patch won't apply to a kernel with only [1] 
(without [2] and [3]); while this set is not just fixing [3]. So I think
it is not accurate either way. OTOH, the combination of CONFIG_LTO_CLANG
and livepatching is probably not used by a lot of users, so I guess we 
are OK without Fixes tags? I personally don't have a strong preference 
either way. 

It is not necessary to invert the order of the two patches. Only applying
one of the two patches won't cause more issues than what we have today. 

Thanks,
Song

> 
> Provided the commit logs are extended with Fixes and order is maintained
> to be able to bisect correctly:
> 
> Reviewed-by: Luis Chamberlain 
> 
>  Luis

[1] 8b8e6b5d3b01 ("kallsyms: strip ThinLTO hashes from static functions")
[2] 60443c88f3a8 ("kallsyms: Improve the performance of kallsyms_lookup_name()")
[3] 8cc32a9bbf29 ("kallsyms: strip LTO-only suffixes from promoted global 
functions")

Re: [PATCH v3 0/2] Fix kallsyms with CONFIG_LTO_CLANG

2024-08-12 Thread Song Liu

Hi folks,

Do we have more concerns and/or suggestions with this set? If not,
what would be the next step for it?

Thanks,
Song

On Wed, Aug 7, 2024 at 3:05 PM Song Liu  wrote:
>
> With CONFIG_LTO_CLANG, the compiler/linker adds .llvm. suffix to
> local symbols to avoid duplications. Existing scripts/kallsyms sorts
> symbols without .llvm. suffix. However, this causes quite some
> issues later on. Some users of kallsyms, such as livepatch, have to match
> symbols exactly.
>
> Address this by sorting full symbols at build time, and let kallsyms
> lookup APIs to match the symbols exactly.
>
> Changes v2 => v3:
> 1. Remove the _without_suffix APIs, as kprobe will not use them.
>(Masami Hiramatsu)
>
> v2: 
> https://lore.kernel.org/live-patching/20240802210836.2210140-1-s...@kernel.org/T/#u
>
> Changes v1 => v2:
> 1. Update the APIs to remove all .XXX suffixes (v1 only removes .llvm.*).
> 2. Rename the APIs as *_without_suffix. (Masami Hiramatsu)
> 3. Fix another user from kprobe. (Masami Hiramatsu)
> 4. Add tests for the new APIs in kallsyms_selftests.
>
> v1: 
> https://lore.kernel.org/live-patching/20240730005433.3559731-1-s...@kernel.org/T/#u
>
> Song Liu (2):
>   kallsyms: Do not cleanup .llvm. suffix before sorting symbols
>   kallsyms: Match symbols exactly with CONFIG_LTO_CLANG
>
>  kernel/kallsyms.c  | 55 +-
>  kernel/kallsyms_selftest.c | 22 +--
>  scripts/kallsyms.c | 31 ++---
>  scripts/link-vmlinux.sh|  4 ---
>  4 files changed, 9 insertions(+), 103 deletions(-)
>
> --
> 2.43.5

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-09 Thread Song Liu

Hi Alessandro,

> On Aug 8, 2024, at 11:20 PM, Alessandro Carminati 
>  wrote:
> 
> Hello,
> sorry to join late at the party.
> 
> Il giorno gio 8 ago 2024 alle ore 11:59 Petr Mladek 
> ha scritto:
>> 
>> On Wed 2024-08-07 20:48:48, Song Liu wrote:
>>> 
>>> 
>>>> On Aug 7, 2024, at 8:33 AM, Sami Tolvanen  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> On Wed, Aug 7, 2024 at 3:08 AM Masami Hiramatsu  
>>>> wrote:
>>>>> 
>>>>> On Wed, 7 Aug 2024 00:19:20 +
>>>>> Song Liu  wrote:
>>>>> 
>>>>>> Do you mean we do not want patch 3/3, but would like to keep 1/3 and part
>>>>>> of 2/3 (remove the _without_suffix APIs)? If this is the case, we are
>>>>>> undoing the change by Sami in [1], and thus may break some tracing tools.
>>>>> 
>>>>> What tracing tools may be broke and why?
>>>> 
>>>> This was a few years ago when we were first adding LTO support, but
>>>> the unexpected suffixes in tracing output broke systrace in Android,
>>>> presumably because the tools expected to find specific function names
>>>> without suffixes. I'm not sure if systrace would still be a problem
>>>> today, but other tools might still make assumptions about the function
>>>> name format. At the time, we decided to filter out the suffixes in all
>>>> user space visible output to avoid these issues.
>>>> 
>>>>> For this suffix problem, I would like to add another patch to allow 
>>>>> probing on
>>>>> suffixed symbols. (It seems suffixed symbols are not available at this 
>>>>> point)
>>>>> 
>>>>> The problem is that the suffixed symbols maybe a "part" of the original 
>>>>> function,
>>>>> thus user has to carefully use it.
>>>>> 
>>>>>> 
>>>>>> Sami, could you please share your thoughts on this?
>>>>> 
>>>>> Sami, I would like to know what problem you have on kprobes.
>>>> 
>>>> The reports we received back then were about registering kprobes for
>>>> static functions, which obviously failed if the compiler added a
>>>> suffix to the function name. This was more of a problem with ThinLTO
>>>> and Clang CFI at the time because the compiler used to rename _all_
>>>> static functions, but one can obviously run into the same issue with
>>>> just LTO.
>>> 
>>> I think newer LLVM/clang no longer add suffixes to all static functions
>>> with LTO and CFI. So this may not be a real issue any more?
>>> 
>>> If we still need to allow tracing without suffix, I think the approach
>>> in this patch set is correct (sort syms based on full name,
>> 
>> Yes, we should allow to find the symbols via the full name, definitely.
>> 
>>> remove suffixes in special APIs during lookup).
>> 
>> Just an idea. Alternative solution would be to make make an alias
>> without the suffix when there is only one symbol with the same
>> name.
>> 
>> It would be complementary with the patch adding aliases for symbols
>> with the same name, see
>> https://lore.kernel.org/r/20231204214635.2916691-1-alessandro.carmin...@gmail.com
>> 
>> I would allow to find the symbols with and without the suffix using
>> a single API.
> 
> kas_alias isn't handling LTO as effectively as it should.
> This is something I plan to address in the next patch version.
> Introducing aliases is the best approach I found to preserve current
> tools behavior while adding this new feature.
> While I believe it will deliver the promised benefits, there is a trade-off,
> particularly affecting features like live patching, which rely on handling
> duplicate symbols.
> For instance, kallsyms_lookup_names typically returns the last occurrence
> of a symbol when the end argument is not NULL, but introducing aliases
> disrupts this behavior.

Do you think with v3 of this set [1], live patching should be fine? The
idea is to let kallsyms_lookup_names() do full name match, then live
patching can find the right symbol with symbol name + old_sympos. 
Did I miss some cases?

> I'm working on a solution to manage duplicate symbols, ensuring compatibility
> with both LTO and kallsyms_lookup_names.

Thanks,
Song

[1] 
https://lore.kernel.org/live-patching/20240807220513.3100483-1-s...@kernel.org/T/#u

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-09 Thread Song Liu



> On Aug 9, 2024, at 8:40 AM, Petr Mladek  wrote:
> 
> On Thu 2024-08-08 15:20:26, Song Liu wrote:
>> 
>> 
>>> On Aug 8, 2024, at 2:59 AM, Petr Mladek  wrote:
>>> 
>>> On Wed 2024-08-07 20:48:48, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Aug 7, 2024, at 8:33 AM, Sami Tolvanen  wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> On Wed, Aug 7, 2024 at 3:08 AM Masami Hiramatsu  
>>>>> wrote:
>>>>>> 
>>>>>> On Wed, 7 Aug 2024 00:19:20 +
>>>>>> Song Liu  wrote:
>>>>>> 
>>>>>>> Do you mean we do not want patch 3/3, but would like to keep 1/3 and 
>>>>>>> part
>>>>>>> of 2/3 (remove the _without_suffix APIs)? If this is the case, we are
>>>>>>> undoing the change by Sami in [1], and thus may break some tracing 
>>>>>>> tools.
>>>>>> 
>>>>>> What tracing tools may be broke and why?
>>>>> 
>>>>> This was a few years ago when we were first adding LTO support, but
>>>>> the unexpected suffixes in tracing output broke systrace in Android,
>>>>> presumably because the tools expected to find specific function names
>>>>> without suffixes. I'm not sure if systrace would still be a problem
>>>>> today, but other tools might still make assumptions about the function
>>>>> name format. At the time, we decided to filter out the suffixes in all
>>>>> user space visible output to avoid these issues.
>>>>> 
>>>>>> For this suffix problem, I would like to add another patch to allow 
>>>>>> probing on
>>>>>> suffixed symbols. (It seems suffixed symbols are not available at this 
>>>>>> point)
>>>>>> 
>>>>>> The problem is that the suffixed symbols maybe a "part" of the original 
>>>>>> function,
>>>>>> thus user has to carefully use it.
>>>>>> 
>>>>>>> 
>>>>>>> Sami, could you please share your thoughts on this?
>>>>>> 
>>>>>> Sami, I would like to know what problem you have on kprobes.
>>>>> 
>>>>> The reports we received back then were about registering kprobes for
>>>>> static functions, which obviously failed if the compiler added a
>>>>> suffix to the function name. This was more of a problem with ThinLTO
>>>>> and Clang CFI at the time because the compiler used to rename _all_
>>>>> static functions, but one can obviously run into the same issue with
>>>>> just LTO.
>>>> 
>>>> I think newer LLVM/clang no longer add suffixes to all static functions
>>>> with LTO and CFI. So this may not be a real issue any more?
>>>> 
>>>> If we still need to allow tracing without suffix, I think the approach
>>>> in this patch set is correct (sort syms based on full name,
>>> 
>>> Yes, we should allow to find the symbols via the full name, definitely.
>>> 
>>>> remove suffixes in special APIs during lookup).
>>> 
>>> Just an idea. Alternative solution would be to make make an alias
>>> without the suffix when there is only one symbol with the same
>>> name.
>>> 
>>> It would be complementary with the patch adding aliases for symbols
>>> with the same name, see
>>> https://lore.kernel.org/r/20231204214635.2916691-1-alessandro.carmin...@gmail.com
>> 
>> I guess v3 plus this work may work well together.  
>> 
>>> I would allow to find the symbols with and without the suffix using
>>> a single API.
>> 
>> Could you please describe how this API would work? I tried some 
>> idea in v1, but it turned out to be quite confusing. So I decided 
>> to leave this logic to the users of kallsyms APIs in v2.
> 
> If we create an alias without the suffix but only when there is only
> one symbol with such a name then we have, for example:
> 
>  klp_complete_transition.lwn.123456
>  klp_complete_transition [alias]
> 
>  init_once.lwn.2131221
>  init_once.lwn.3443243
>  init_once.lwn.4324322
>  init_once.lwn.5214121
>  init_once.lwn.2153121
>  init_once.lwn.4342343
> 
> This way, it will be possible to find the static symbol
> "klp_complete_transition" without the suffix via the alias.
> It will have the alias because it has an unique name.
> 
> While "init_once" symbol must always be searched with the suffix
> because it is not unique.
> 
> It looks like >99% of static symbols have unique name.

Got it. The idea is to generate the alias at boot time. I think
this will indeed work. 

IIUC, v3 of this set with Alessandro's work (maybe with some 
variations) should do this. 


Thanks, 
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-08 Thread Song Liu



> On Aug 8, 2024, at 2:59 AM, Petr Mladek  wrote:
> 
> On Wed 2024-08-07 20:48:48, Song Liu wrote:
>> 
>> 
>>> On Aug 7, 2024, at 8:33 AM, Sami Tolvanen  wrote:
>>> 
>>> Hi,
>>> 
>>> On Wed, Aug 7, 2024 at 3:08 AM Masami Hiramatsu  wrote:
>>>> 
>>>> On Wed, 7 Aug 2024 00:19:20 +
>>>> Song Liu  wrote:
>>>> 
>>>>> Do you mean we do not want patch 3/3, but would like to keep 1/3 and part
>>>>> of 2/3 (remove the _without_suffix APIs)? If this is the case, we are
>>>>> undoing the change by Sami in [1], and thus may break some tracing tools.
>>>> 
>>>> What tracing tools may be broke and why?
>>> 
>>> This was a few years ago when we were first adding LTO support, but
>>> the unexpected suffixes in tracing output broke systrace in Android,
>>> presumably because the tools expected to find specific function names
>>> without suffixes. I'm not sure if systrace would still be a problem
>>> today, but other tools might still make assumptions about the function
>>> name format. At the time, we decided to filter out the suffixes in all
>>> user space visible output to avoid these issues.
>>> 
>>>> For this suffix problem, I would like to add another patch to allow 
>>>> probing on
>>>> suffixed symbols. (It seems suffixed symbols are not available at this 
>>>> point)
>>>> 
>>>> The problem is that the suffixed symbols maybe a "part" of the original 
>>>> function,
>>>> thus user has to carefully use it.
>>>> 
>>>>> 
>>>>> Sami, could you please share your thoughts on this?
>>>> 
>>>> Sami, I would like to know what problem you have on kprobes.
>>> 
>>> The reports we received back then were about registering kprobes for
>>> static functions, which obviously failed if the compiler added a
>>> suffix to the function name. This was more of a problem with ThinLTO
>>> and Clang CFI at the time because the compiler used to rename _all_
>>> static functions, but one can obviously run into the same issue with
>>> just LTO.
>> 
>> I think newer LLVM/clang no longer add suffixes to all static functions
>> with LTO and CFI. So this may not be a real issue any more?
>> 
>> If we still need to allow tracing without suffix, I think the approach
>> in this patch set is correct (sort syms based on full name,
> 
> Yes, we should allow to find the symbols via the full name, definitely.
> 
>> remove suffixes in special APIs during lookup).
> 
> Just an idea. Alternative solution would be to make make an alias
> without the suffix when there is only one symbol with the same
> name.
> 
> It would be complementary with the patch adding aliases for symbols
> with the same name, see
> https://lore.kernel.org/r/20231204214635.2916691-1-alessandro.carmin...@gmail.com

I guess v3 plus this work may work well together.  

> I would allow to find the symbols with and without the suffix using
> a single API.

Could you please describe how this API would work? I tried some 
idea in v1, but it turned out to be quite confusing. So I decided 
to leave this logic to the users of kallsyms APIs in v2. 

Thanks,
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-08 Thread Song Liu



> On Aug 8, 2024, at 2:48 AM, Petr Mladek  wrote:
> 
> On Wed 2024-08-07 19:46:31, Song Liu wrote:
>> 
>> 
>>> On Aug 7, 2024, at 7:58 AM, zhang warden  wrote:
>>> 
>>> 
>>>> In my GCC built, we have suffixes like ".constprop.0", ".part.0", 
>>>> ".isra.0", 
>>>> and ".isra.0.cold".
>>> 
>>> A fresher's eye, I met sometime when try to build a livepatch module and 
>>> found some mistake caused by ".constprop.0" ".part.0" which is generated by 
>>> GCC.
>>> 
>>> These section with such suffixes is special and sometime the symbol 
>>> st_value is quite different. What is these kind of section (or symbol) use 
>>> for?
>> 
>> 
>> IIUC, constprop means const propagation. For example, function 
>> "foo(int a, int b)" that is called as "foo(a, 10)" will be come 
>> "foo(int a)" with a hard-coded b = 10 inside. 
>> 
>> .part.0 is part of the function, as the other part is inlined in 
>> the caller.
> 
> Hmm, we should not remove the suffixes like .constprop*, .part*,
> .isra*. They implement a special optimized variant of the function.
> It is not longer the original full-featured one.
> 
> This is a difference against adding a suffix for a static function.
> Such a symbol implements the original full-featured function.

Allow tracing without .llvm. suffixes may target a different 
function with same name, i.e. func_a.llvm.1 vs. func_a.llvm.2. We
can probably detect and report this in the kernel. However, I would 
rather we just disallow tracing without suffixes. I think Masami
also agrees with this. 

Thanks,
Song

Re: [PATCH v2 2/3] kallsyms: Add APIs to match symbol without .XXXX suffix.

2024-08-08 Thread Song Liu



> On Aug 8, 2024, at 3:20 AM, Petr Mladek  wrote:
> 
> On Fri 2024-08-02 14:08:34, Song Liu wrote:
>> With CONFIG_LTO_CLANG=y, the compiler may add suffix to function names
>> to avoid duplication. This causes confusion with users of kallsyms.
>> On one hand, users like livepatch are required to match the symbols
>> exactly. On the other hand, users like kprobe would like to match to
>> original function names.
>> 
>> Solve this by splitting kallsyms APIs. Specifically, existing APIs now
>> should match the symbols exactly. Add two APIs that match only the part
>> without . suffix. Specifically, the following two APIs are added.
>> 
>> 1. kallsyms_lookup_name_without_suffix()
>> 2. kallsyms_on_each_match_symbol_without_suffix()
>> 
>> These APIs will be used by kprobe.
>> 
>> Also cleanup some code and update kallsyms_selftests accordingly.
>> 
>> --- a/kernel/kallsyms.c
>> +++ b/kernel/kallsyms.c
>> @@ -164,30 +164,27 @@ static void cleanup_symbol_name(char *s)
>> {
>> char *res;
>> 
>> - if (!IS_ENABLED(CONFIG_LTO_CLANG))
>> - return;
>> -
>> /*
>> * LLVM appends various suffixes for local functions and variables that
>> * must be promoted to global scope as part of LTO.  This can break
>> * hooking of static functions with kprobes. '.' is not a valid
>> - * character in an identifier in C. Suffixes only in LLVM LTO observed:
>> - * - foo.llvm.[0-9a-f]+
>> + * character in an identifier in C, so we can just remove the
>> + * suffix.
>> */
>> - res = strstr(s, ".llvm.");
>> + res = strstr(s, ".");
> 
> IMHO, we should not remove the suffixes like .constprop*, .part*,
> .isra*. They implement a special optimized variant of the function.
> It is not longer the original full-featured one.
> 
>> if (res)
>> *res = '\0';
>> 
>> return;
>> }
>> 
>> -static int compare_symbol_name(const char *name, char *namebuf)
>> +static int compare_symbol_name(const char *name, char *namebuf, bool 
>> exact_match)
>> {
>> - /* The kallsyms_seqs_of_names is sorted based on names after
>> - * cleanup_symbol_name() (see scripts/kallsyms.c) if clang lto is enabled.
>> - * To ensure correct bisection in kallsyms_lookup_names(), do
>> - * cleanup_symbol_name(namebuf) before comparing name and namebuf.
>> - */
>> + int ret = strcmp(name, namebuf);
>> +
>> + if (exact_match || !ret)
>> + return ret;
>> +
>> cleanup_symbol_name(namebuf);
>> return strcmp(name, namebuf);
>> }
>> @@ -204,13 +201,17 @@ static unsigned int get_symbol_seq(int index)
>> 
>> static int kallsyms_lookup_names(const char *name,
>> unsigned int *start,
>> - unsigned int *end)
>> + unsigned int *end,
>> + bool exact_match)
>> {
>> int ret;
>> int low, mid, high;
>> unsigned int seq, off;
>> char namebuf[KSYM_NAME_LEN];
>> 
>> + if (!IS_ENABLED(CONFIG_LTO_CLANG))
>> + exact_match = true;
> 
> IMHO, this is very a bad design. It causes that
> 
> kallsyms_on_each_match_symbol_without_suffix(,,, false);
> 
> does not longer work as expected. It creates a hard to maintain
> code. The code does not do what it looks like.

Indeed. It actually caused issue with GCC-built kernel in my 
tests after submitting v2. 

Thanks,
Song

[PATCH v3 2/2] kallsyms: Match symbols exactly with CONFIG_LTO_CLANG

2024-08-07 Thread Song Liu

With CONFIG_LTO_CLANG=y, the compiler may add .llvm. suffix to
function names to avoid duplication. APIs like kallsyms_lookup_name()
and kallsyms_on_each_match_symbol() tries to match these symbol names
without the .llvm. suffix, e.g., match "c_stop" with symbol
c_stop.llvm.17132674095431275852. This turned out to be problematic
for use cases that require exact match, for example, livepatch.

Fix this by making the APIs to match symbols exactly.

Also cleanup kallsyms_selftests accordingly.

Signed-off-by: Song Liu 
---
 kernel/kallsyms.c  | 55 +-
 kernel/kallsyms_selftest.c | 22 +--
 2 files changed, 7 insertions(+), 70 deletions(-)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index fb2c77368d18..a9a0ca605d4a 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -160,38 +160,6 @@ unsigned long kallsyms_sym_address(int idx)
return kallsyms_relative_base - 1 - kallsyms_offsets[idx];
 }
 
-static void cleanup_symbol_name(char *s)
-{
-   char *res;
-
-   if (!IS_ENABLED(CONFIG_LTO_CLANG))
-   return;
-
-   /*
-* LLVM appends various suffixes for local functions and variables that
-* must be promoted to global scope as part of LTO.  This can break
-* hooking of static functions with kprobes. '.' is not a valid
-* character in an identifier in C. Suffixes only in LLVM LTO observed:
-* - foo.llvm.[0-9a-f]+
-*/
-   res = strstr(s, ".llvm.");
-   if (res)
-   *res = '\0';
-
-   return;
-}
-
-static int compare_symbol_name(const char *name, char *namebuf)
-{
-   /* The kallsyms_seqs_of_names is sorted based on names after
-* cleanup_symbol_name() (see scripts/kallsyms.c) if clang lto is 
enabled.
-* To ensure correct bisection in kallsyms_lookup_names(), do
-* cleanup_symbol_name(namebuf) before comparing name and namebuf.
-*/
-   cleanup_symbol_name(namebuf);
-   return strcmp(name, namebuf);
-}
-
 static unsigned int get_symbol_seq(int index)
 {
unsigned int i, seq = 0;
@@ -219,7 +187,7 @@ static int kallsyms_lookup_names(const char *name,
seq = get_symbol_seq(mid);
off = get_symbol_offset(seq);
kallsyms_expand_symbol(off, namebuf, ARRAY_SIZE(namebuf));
-   ret = compare_symbol_name(name, namebuf);
+   ret = strcmp(name, namebuf);
if (ret > 0)
low = mid + 1;
else if (ret < 0)
@@ -236,7 +204,7 @@ static int kallsyms_lookup_names(const char *name,
seq = get_symbol_seq(low - 1);
off = get_symbol_offset(seq);
kallsyms_expand_symbol(off, namebuf, ARRAY_SIZE(namebuf));
-   if (compare_symbol_name(name, namebuf))
+   if (strcmp(name, namebuf))
break;
low--;
}
@@ -248,7 +216,7 @@ static int kallsyms_lookup_names(const char *name,
seq = get_symbol_seq(high + 1);
off = get_symbol_offset(seq);
kallsyms_expand_symbol(off, namebuf, 
ARRAY_SIZE(namebuf));
-   if (compare_symbol_name(name, namebuf))
+   if (strcmp(name, namebuf))
break;
high++;
}
@@ -407,8 +375,7 @@ static int kallsyms_lookup_buildid(unsigned long addr,
if (modbuildid)
*modbuildid = NULL;
 
-   ret = strlen(namebuf);
-   goto found;
+   return strlen(namebuf);
}
 
/* See if it's in a module or a BPF JITed image. */
@@ -422,8 +389,6 @@ static int kallsyms_lookup_buildid(unsigned long addr,
ret = ftrace_mod_address_lookup(addr, symbolsize,
offset, modname, namebuf);
 
-found:
-   cleanup_symbol_name(namebuf);
return ret;
 }
 
@@ -450,8 +415,6 @@ const char *kallsyms_lookup(unsigned long addr,
 
 int lookup_symbol_name(unsigned long addr, char *symname)
 {
-   int res;
-
symname[0] = '\0';
symname[KSYM_NAME_LEN - 1] = '\0';
 
@@ -462,16 +425,10 @@ int lookup_symbol_name(unsigned long addr, char *symname)
/* Grab name */
kallsyms_expand_symbol(get_symbol_offset(pos),
   symname, KSYM_NAME_LEN);
-   goto found;
+   return 0;
}
/* See if it's in a module. */
-   res = lookup_module_symbol_name(addr, symname);
-   if (res)
-   return res;
-
-found:
-   cleanup_symbol_name(symname);
-   return 0;
+   return lookup_module_symbol_name(addr, symname);
 }
 
 /* Look up a kernel symbol and return it in a text buffer. */
diff --git

[PATCH v3 1/2] kallsyms: Do not cleanup .llvm. suffix before sorting symbols

2024-08-07 Thread Song Liu

Cleaning up the symbols causes various issues afterwards. Let's sort
the list based on original name.

Signed-off-by: Song Liu 
---
 scripts/kallsyms.c  | 31 ++-
 scripts/link-vmlinux.sh |  4 
 2 files changed, 2 insertions(+), 33 deletions(-)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 0ed873491bf5..123dab0572f8 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -5,8 +5,7 @@
  * This software may be used and distributed according to the terms
  * of the GNU General Public License, incorporated herein by reference.
  *
- * Usage: kallsyms [--all-symbols] [--absolute-percpu]
- * [--lto-clang] in.map > out.S
+ * Usage: kallsyms [--all-symbols] [--absolute-percpu]  in.map > out.S
  *
  *  Table compression uses all the unused char codes on the symbols and
  *  maps these to the most used substrings (tokens). For instance, it might
@@ -62,7 +61,6 @@ static struct sym_entry **table;
 static unsigned int table_size, table_cnt;
 static int all_symbols;
 static int absolute_percpu;
-static int lto_clang;
 
 static int token_profit[0x1];
 
@@ -73,8 +71,7 @@ static unsigned char best_table_len[256];
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] "
-   "[--lto-clang] in.map > out.S\n");
+   fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] 
in.map > out.S\n");
exit(1);
 }
 
@@ -344,25 +341,6 @@ static bool symbol_absolute(const struct sym_entry *s)
return s->percpu_absolute;
 }
 
-static void cleanup_symbol_name(char *s)
-{
-   char *p;
-
-   /*
-* ASCII[.]   = 2e
-* ASCII[0-9] = 30,39
-* ASCII[A-Z] = 41,5a
-* ASCII[_]   = 5f
-* ASCII[a-z] = 61,7a
-*
-* As above, replacing the first '.' in ".llvm." with '\0' does not
-* affect the main sorting, but it helps us with subsorting.
-*/
-   p = strstr(s, ".llvm.");
-   if (p)
-   *p = '\0';
-}
-
 static int compare_names(const void *a, const void *b)
 {
int ret;
@@ -526,10 +504,6 @@ static void write_src(void)
output_address(relative_base);
printf("\n");
 
-   if (lto_clang)
-   for (i = 0; i < table_cnt; i++)
-   cleanup_symbol_name((char *)table[i]->sym);
-
sort_symbols_by_name();
output_label("kallsyms_seqs_of_names");
for (i = 0; i < table_cnt; i++)
@@ -807,7 +781,6 @@ int main(int argc, char **argv)
static const struct option long_options[] = {
{"all-symbols", no_argument, &all_symbols, 1},
{"absolute-percpu", no_argument, &absolute_percpu, 1},
-   {"lto-clang",   no_argument, <o_clang,   1},
{},
};
 
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index f7b2503cdba9..22d0bc843986 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -156,10 +156,6 @@ kallsyms()
kallsymopt="${kallsymopt} --absolute-percpu"
fi
 
-   if is_enabled CONFIG_LTO_CLANG; then
-   kallsymopt="${kallsymopt} --lto-clang"
-   fi
-
info KSYMS "${2}.S"
scripts/kallsyms ${kallsymopt} "${1}" > "${2}.S"
 
-- 
2.43.5

[PATCH v3 0/2] Fix kallsyms with CONFIG_LTO_CLANG

2024-08-07 Thread Song Liu

With CONFIG_LTO_CLANG, the compiler/linker adds .llvm. suffix to
local symbols to avoid duplications. Existing scripts/kallsyms sorts
symbols without .llvm. suffix. However, this causes quite some
issues later on. Some users of kallsyms, such as livepatch, have to match
symbols exactly.

Address this by sorting full symbols at build time, and let kallsyms
lookup APIs to match the symbols exactly.

Changes v2 => v3:
1. Remove the _without_suffix APIs, as kprobe will not use them.
   (Masami Hiramatsu)

v2: 
https://lore.kernel.org/live-patching/20240802210836.2210140-1-s...@kernel.org/T/#u

Changes v1 => v2:
1. Update the APIs to remove all .XXX suffixes (v1 only removes .llvm.*).
2. Rename the APIs as *_without_suffix. (Masami Hiramatsu)
3. Fix another user from kprobe. (Masami Hiramatsu)
4. Add tests for the new APIs in kallsyms_selftests.

v1: 
https://lore.kernel.org/live-patching/20240730005433.3559731-1-s...@kernel.org/T/#u

Song Liu (2):
  kallsyms: Do not cleanup .llvm. suffix before sorting symbols
  kallsyms: Match symbols exactly with CONFIG_LTO_CLANG

 kernel/kallsyms.c  | 55 +-
 kernel/kallsyms_selftest.c | 22 +--
 scripts/kallsyms.c | 31 ++---
 scripts/link-vmlinux.sh|  4 ---
 4 files changed, 9 insertions(+), 103 deletions(-)

--
2.43.5

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-07 Thread Song Liu



> On Aug 7, 2024, at 1:55 PM, Masami Hiramatsu  wrote:
> 
> On Wed, 7 Aug 2024 00:19:20 +0000
> Song Liu  wrote:
> 
>> 
>> 
>>> On Aug 6, 2024, at 5:01 PM, Masami Hiramatsu  wrote:
>>> 
>>> On Tue, 6 Aug 2024 20:12:55 +
>>> Song Liu  wrote:
>>> 
>>>> 
>>>> 
>>>>> On Aug 6, 2024, at 1:01 PM, Steven Rostedt  wrote:
>>>>> 
>>>>> On Tue, 6 Aug 2024 16:00:49 -0400
>>>>> Steven Rostedt  wrote:
>>>>> 
>>>>>>>>> + if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
>>>>>>>>> + addr = kallsyms_lookup_name_without_suffix(trace_kprobe_symbol(tk));
>>>>>>>>> +
>>>>>>>> 
>>>>>>>> So you do the lookup twice if this is enabled?
>>>>>>>> 
>>>>>>>> Why not just use "kallsyms_lookup_name_without_suffix()" the entire 
>>>>>>>> time,
>>>>>>>> and it should work just the same as "kallsyms_lookup_name()" if it's 
>>>>>>>> not
>>>>>>>> needed?
>>>>>>> 
>>>>>>> We still want to give priority to full match. For example, we have:
>>>>>>> 
>>>>>>> [root@~]# grep c_next /proc/kallsyms
>>>>>>> 81419dc0 t c_next.llvm.7567888411731313343
>>>>>>> 81680600 t c_next
>>>>>>> 81854380 t c_next.llvm.14337844803752139461
>>>>>>> 
>>>>>>> If the goal is to explicitly trace c_next.llvm.7567888411731313343, the
>>>>>>> user can provide the full name. If we always match _without_suffix, all
>>>>>>> of the 3 will match to the first one. 
>>>>>>> 
>>>>>>> Does this make sense?  
>>>>>> 
>>>>>> Yes. Sorry, I missed the "&& !addr)" after the "IS_ENABLED()", which 
>>>>>> looked
>>>>>> like you did the command twice.
>>>>> 
>>>>> But that said, does this only have to be for llvm? Or should we do this 
>>>>> for
>>>>> even gcc? As I believe gcc can give strange symbols too.
>>>> 
>>>> I think most of the issue comes with LTO, as LTO promotes local static
>>>> functions to global functions. IIUC, we don't have GCC built, LTO enabled
>>>> kernel yet.
>>>> 
>>>> In my GCC built, we have suffixes like ".constprop.0", ".part.0", 
>>>> ".isra.0", 
>>>> and ".isra.0.cold". We didn't do anything about these before this set. So 
>>>> I 
>>>> think we are OK not handling them now. We sure can enable it for GCC built
>>>> kernel in the future.
>>> 
>>> Hmm, I think it should be handled as it is. This means it should do as
>>> livepatch does. Since I expected user will check kallsyms if gets error,
>>> we should keep this as it is. (if a symbol has suffix, it should accept
>>> symbol with suffix, or user will get confused because they can not find
>>> which symbol is kprobed.)
>>> 
>>> Sorry about the conclusion (so I NAK this), but this is a good discussion.
>> 
>> Do you mean we do not want patch 3/3, but would like to keep 1/3 and part 
>> of 2/3 (remove the _without_suffix APIs)? If this is the case, we are 
>> undoing the change by Sami in [1], and thus may break some tracing tools.
> 
> BTW, I confirmed that the PATCH 1/3 and 2/3 fixes kprobes to probe on suffixed
> symbols correctly. (because 1/3 allows to search suffixed symbols) 
> 
> /sys/kernel/tracing # cat dynamic_events 
> p:kprobes/p_c_stop_llvm_17132674095431275852_0 
> c_stop.llvm.17132674095431275852
> p:kprobes/p_c_stop_llvm_8011538628216713357_0 c_stop.llvm.8011538628216713357
> p:kprobes/p_c_stop_0 c_stop

Thanks for confirming this works. 

I will clean up the code and send v3. I plan to remove the _without_suffix APIs
in v3. We can add them back if we need them to get systrace working. 

Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-07 Thread Song Liu



> On Aug 7, 2024, at 8:33 AM, Sami Tolvanen  wrote:
> 
> Hi,
> 
> On Wed, Aug 7, 2024 at 3:08 AM Masami Hiramatsu  wrote:
>> 
>> On Wed, 7 Aug 2024 00:19:20 +
>> Song Liu  wrote:
>> 
>>> Do you mean we do not want patch 3/3, but would like to keep 1/3 and part
>>> of 2/3 (remove the _without_suffix APIs)? If this is the case, we are
>>> undoing the change by Sami in [1], and thus may break some tracing tools.
>> 
>> What tracing tools may be broke and why?
> 
> This was a few years ago when we were first adding LTO support, but
> the unexpected suffixes in tracing output broke systrace in Android,
> presumably because the tools expected to find specific function names
> without suffixes. I'm not sure if systrace would still be a problem
> today, but other tools might still make assumptions about the function
> name format. At the time, we decided to filter out the suffixes in all
> user space visible output to avoid these issues.
> 
>> For this suffix problem, I would like to add another patch to allow probing 
>> on
>> suffixed symbols. (It seems suffixed symbols are not available at this point)
>> 
>> The problem is that the suffixed symbols maybe a "part" of the original 
>> function,
>> thus user has to carefully use it.
>> 
>>> 
>>> Sami, could you please share your thoughts on this?
>> 
>> Sami, I would like to know what problem you have on kprobes.
> 
> The reports we received back then were about registering kprobes for
> static functions, which obviously failed if the compiler added a
> suffix to the function name. This was more of a problem with ThinLTO
> and Clang CFI at the time because the compiler used to rename _all_
> static functions, but one can obviously run into the same issue with
> just LTO.

I think newer LLVM/clang no longer add suffixes to all static functions
with LTO and CFI. So this may not be a real issue any more?

If we still need to allow tracing without suffix, I think the approach
in this patch set is correct (sort syms based on full name, remove
suffixes in special APIs during lookup). 

Thanks,
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-07 Thread Song Liu



> On Aug 7, 2024, at 1:40 PM, Song Liu  wrote:
> 
> 
> 
>> On Aug 7, 2024, at 1:08 PM, Steven Rostedt  wrote:
>> 
>> On Wed, 7 Aug 2024 19:41:11 +
>> Song Liu  wrote:
>> 
>> 
>>> It appears there are multiple APIs that may need change. For example, on gcc
>>> built kernel, /sys/kernel/tracing/available_filter_functions does not show 
>>> the suffix: 
>>> 
>>> [root@(none)]# grep cmos_irq_enable /proc/kallsyms
>>> 81db5470 t __pfx_cmos_irq_enable.constprop.0
>>> 81db5480 t cmos_irq_enable.constprop.0
>>> 822dec6e t cmos_irq_enable.constprop.0.cold
>>> 
>>> [root@(none)]# grep cmos_irq_enable 
>>> /sys/kernel/tracing/available_filter_functions
>>> cmos_irq_enable
>> 
>> Strange, I don't see that:
>> 
>> ~# grep cmos_irq_enable /proc/kallsyms 
>> 8f4b2500 t __pfx_cmos_irq_enable.constprop.0
>> 8f4b2510 t cmos_irq_enable.constprop.0
>> 
>> ~# grep cmos_irq_enable /sys/kernel/tracing/available_filter_functions
>> cmos_irq_enable.constprop.0
> 
> Ah, this is caused by my change. Let me fix that in the next version.

PS: Current code still remove .llvm. suffix from 
available_filter_functions. 

I will change kallsyms_lookup_buildid() to not do the cleanup. 

Thanks,
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-07 Thread Song Liu



> On Aug 7, 2024, at 1:08 PM, Steven Rostedt  wrote:
> 
> On Wed, 7 Aug 2024 19:41:11 +0000
> Song Liu  wrote:
> 
> 
>> It appears there are multiple APIs that may need change. For example, on gcc
>> built kernel, /sys/kernel/tracing/available_filter_functions does not show 
>> the suffix: 
>> 
>>  [root@(none)]# grep cmos_irq_enable /proc/kallsyms
>>  81db5470 t __pfx_cmos_irq_enable.constprop.0
>>  81db5480 t cmos_irq_enable.constprop.0
>>  822dec6e t cmos_irq_enable.constprop.0.cold
>> 
>>  [root@(none)]# grep cmos_irq_enable 
>> /sys/kernel/tracing/available_filter_functions
>>  cmos_irq_enable
> 
> Strange, I don't see that:
> 
>  ~# grep cmos_irq_enable /proc/kallsyms 
>  8f4b2500 t __pfx_cmos_irq_enable.constprop.0
>  8f4b2510 t cmos_irq_enable.constprop.0
> 
>  ~# grep cmos_irq_enable /sys/kernel/tracing/available_filter_functions
>  cmos_irq_enable.constprop.0

Ah, this is caused by my change. Let me fix that in the next version. 

Thanks,
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-07 Thread Song Liu

> On Aug 7, 2024, at 7:58 AM, zhang warden  wrote:
> 
> 
>> In my GCC built, we have suffixes like ".constprop.0", ".part.0", ".isra.0", 
>> and ".isra.0.cold".
> 
> A fresher's eye, I met sometime when try to build a livepatch module and 
> found some mistake caused by ".constprop.0" ".part.0" which is generated by 
> GCC.
> 
> These section with such suffixes is special and sometime the symbol st_value 
> is quite different. What is these kind of section (or symbol) use for?

IIUC, constprop means const propagation. For example, function 
"foo(int a, int b)" that is called as "foo(a, 10)" will be come 
"foo(int a)" with a hard-coded b = 10 inside. 

.part.0 is part of the function, as the other part is inlined in 
the caller. 

With binary-diff based toolchain (kpatch-build), I think these will be 
handled automatically. However, if we write the livepatch manually, we 
need to understand these behavior with .constprop and .part. 

Thanks,
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-07 Thread Song Liu



> On Aug 7, 2024, at 3:08 AM, Masami Hiramatsu  wrote:
> 
> On Wed, 7 Aug 2024 00:19:20 +0000
> Song Liu  wrote:
> 
>> 
>> 
>>> On Aug 6, 2024, at 5:01 PM, Masami Hiramatsu  wrote:
>>> 
>>> On Tue, 6 Aug 2024 20:12:55 +
>>> Song Liu  wrote:
>>> 
>>>> 
>>>> 
>>>>> On Aug 6, 2024, at 1:01 PM, Steven Rostedt  wrote:
>>>>> 
>>>>> On Tue, 6 Aug 2024 16:00:49 -0400
>>>>> Steven Rostedt  wrote:
>>>>> 
>>>>>>>>> + if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
>>>>>>>>> + addr = kallsyms_lookup_name_without_suffix(trace_kprobe_symbol(tk));
>>>>>>>>> +
>>>>>>>> 
>>>>>>>> So you do the lookup twice if this is enabled?
>>>>>>>> 
>>>>>>>> Why not just use "kallsyms_lookup_name_without_suffix()" the entire 
>>>>>>>> time,
>>>>>>>> and it should work just the same as "kallsyms_lookup_name()" if it's 
>>>>>>>> not
>>>>>>>> needed?
>>>>>>> 
>>>>>>> We still want to give priority to full match. For example, we have:
>>>>>>> 
>>>>>>> [root@~]# grep c_next /proc/kallsyms
>>>>>>> 81419dc0 t c_next.llvm.7567888411731313343
>>>>>>> 81680600 t c_next
>>>>>>> 81854380 t c_next.llvm.14337844803752139461
>>>>>>> 
>>>>>>> If the goal is to explicitly trace c_next.llvm.7567888411731313343, the
>>>>>>> user can provide the full name. If we always match _without_suffix, all
>>>>>>> of the 3 will match to the first one. 
>>>>>>> 
>>>>>>> Does this make sense?  
>>>>>> 
>>>>>> Yes. Sorry, I missed the "&& !addr)" after the "IS_ENABLED()", which 
>>>>>> looked
>>>>>> like you did the command twice.
>>>>> 
>>>>> But that said, does this only have to be for llvm? Or should we do this 
>>>>> for
>>>>> even gcc? As I believe gcc can give strange symbols too.
>>>> 
>>>> I think most of the issue comes with LTO, as LTO promotes local static
>>>> functions to global functions. IIUC, we don't have GCC built, LTO enabled
>>>> kernel yet.
>>>> 
>>>> In my GCC built, we have suffixes like ".constprop.0", ".part.0", 
>>>> ".isra.0", 
>>>> and ".isra.0.cold". We didn't do anything about these before this set. So 
>>>> I 
>>>> think we are OK not handling them now. We sure can enable it for GCC built
>>>> kernel in the future.
>>> 
>>> Hmm, I think it should be handled as it is. This means it should do as
>>> livepatch does. Since I expected user will check kallsyms if gets error,
>>> we should keep this as it is. (if a symbol has suffix, it should accept
>>> symbol with suffix, or user will get confused because they can not find
>>> which symbol is kprobed.)
>>> 
>>> Sorry about the conclusion (so I NAK this), but this is a good discussion.
>> 
>> Do you mean we do not want patch 3/3, but would like to keep 1/3 and part 
>> of 2/3 (remove the _without_suffix APIs)? If this is the case, we are 
>> undoing the change by Sami in [1], and thus may break some tracing tools.
> 
> What tracing tools may be broke and why?
> 
> For this suffix problem, I would like to add another patch to allow probing on
> suffixed symbols. (It seems suffixed symbols are not available at this point)
> 
> The problem is that the suffixed symbols maybe a "part" of the original 
> function,
> thus user has to carefully use it.

It appears there are multiple APIs that may need change. For example, on gcc
built kernel, /sys/kernel/tracing/available_filter_functions does not show 
the suffix: 

  [root@(none)]# grep cmos_irq_enable /proc/kallsyms
  81db5470 t __pfx_cmos_irq_enable.constprop.0
  81db5480 t cmos_irq_enable.constprop.0
  822dec6e t cmos_irq_enable.constprop.0.cold

  [root@(none)]# grep cmos_irq_enable 
/sys/kernel/tracing/available_filter_functions
  cmos_irq_enable

perf-probe uses _text+ for such cases:

  [root@(none)]# cat /sys/kernel/tracing/kprobe_events
  p:probe/cmos_irq_enable _text+14374016 

I am not sure which APIs do we need to change here. 

Thanks,
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-06 Thread Song Liu



> On Aug 6, 2024, at 5:01 PM, Masami Hiramatsu  wrote:
> 
> On Tue, 6 Aug 2024 20:12:55 +0000
> Song Liu  wrote:
> 
>> 
>> 
>>> On Aug 6, 2024, at 1:01 PM, Steven Rostedt  wrote:
>>> 
>>> On Tue, 6 Aug 2024 16:00:49 -0400
>>> Steven Rostedt  wrote:
>>> 
>>>>>>> + if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
>>>>>>> + addr = kallsyms_lookup_name_without_suffix(trace_kprobe_symbol(tk));
>>>>>>> +
>>>>>> 
>>>>>> So you do the lookup twice if this is enabled?
>>>>>> 
>>>>>> Why not just use "kallsyms_lookup_name_without_suffix()" the entire time,
>>>>>> and it should work just the same as "kallsyms_lookup_name()" if it's not
>>>>>> needed?
>>>>> 
>>>>> We still want to give priority to full match. For example, we have:
>>>>> 
>>>>> [root@~]# grep c_next /proc/kallsyms
>>>>> 81419dc0 t c_next.llvm.7567888411731313343
>>>>> 81680600 t c_next
>>>>> 81854380 t c_next.llvm.14337844803752139461
>>>>> 
>>>>> If the goal is to explicitly trace c_next.llvm.7567888411731313343, the
>>>>> user can provide the full name. If we always match _without_suffix, all
>>>>> of the 3 will match to the first one. 
>>>>> 
>>>>> Does this make sense?  
>>>> 
>>>> Yes. Sorry, I missed the "&& !addr)" after the "IS_ENABLED()", which looked
>>>> like you did the command twice.
>>> 
>>> But that said, does this only have to be for llvm? Or should we do this for
>>> even gcc? As I believe gcc can give strange symbols too.
>> 
>> I think most of the issue comes with LTO, as LTO promotes local static
>> functions to global functions. IIUC, we don't have GCC built, LTO enabled
>> kernel yet.
>> 
>> In my GCC built, we have suffixes like ".constprop.0", ".part.0", ".isra.0", 
>> and ".isra.0.cold". We didn't do anything about these before this set. So I 
>> think we are OK not handling them now. We sure can enable it for GCC built
>> kernel in the future.
> 
> Hmm, I think it should be handled as it is. This means it should do as
> livepatch does. Since I expected user will check kallsyms if gets error,
> we should keep this as it is. (if a symbol has suffix, it should accept
> symbol with suffix, or user will get confused because they can not find
> which symbol is kprobed.)
> 
> Sorry about the conclusion (so I NAK this), but this is a good discussion. 

Do you mean we do not want patch 3/3, but would like to keep 1/3 and part 
of 2/3 (remove the _without_suffix APIs)? If this is the case, we are 
undoing the change by Sami in [1], and thus may break some tracing tools. 

Sami, could you please share your thoughts on this? 

If this works, I will send next version with 1/3 and part of 2/3. 

Thanks,
Song

[1] 
https://lore.kernel.org/all/20210408182843.1754385-8-samitolva...@google.com/

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-06 Thread Song Liu



> On Aug 6, 2024, at 1:01 PM, Steven Rostedt  wrote:
> 
> On Tue, 6 Aug 2024 16:00:49 -0400
> Steven Rostedt  wrote:
> 
> + if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
> + addr = kallsyms_lookup_name_without_suffix(trace_kprobe_symbol(tk));
> +
 
 So you do the lookup twice if this is enabled?
 
 Why not just use "kallsyms_lookup_name_without_suffix()" the entire time,
 and it should work just the same as "kallsyms_lookup_name()" if it's not
 needed?
>>> 
>>> We still want to give priority to full match. For example, we have:
>>> 
>>> [root@~]# grep c_next /proc/kallsyms
>>> 81419dc0 t c_next.llvm.7567888411731313343
>>> 81680600 t c_next
>>> 81854380 t c_next.llvm.14337844803752139461
>>> 
>>> If the goal is to explicitly trace c_next.llvm.7567888411731313343, the
>>> user can provide the full name. If we always match _without_suffix, all
>>> of the 3 will match to the first one. 
>>> 
>>> Does this make sense?  
>> 
>> Yes. Sorry, I missed the "&& !addr)" after the "IS_ENABLED()", which looked
>> like you did the command twice.
> 
> But that said, does this only have to be for llvm? Or should we do this for
> even gcc? As I believe gcc can give strange symbols too.

I think most of the issue comes with LTO, as LTO promotes local static
functions to global functions. IIUC, we don't have GCC built, LTO enabled
kernel yet.

In my GCC built, we have suffixes like ".constprop.0", ".part.0", ".isra.0", 
and ".isra.0.cold". We didn't do anything about these before this set. So I 
think we are OK not handling them now. We sure can enable it for GCC built
kernel in the future. 

Thanks,
Song

Re: [PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-06 Thread Song Liu

Hi Steven,

> On Aug 6, 2024, at 11:44 AM, Steven Rostedt  wrote:
> 
> On Fri,  2 Aug 2024 14:08:35 -0700
> Song Liu  wrote:
> 
>> Use the new kallsyms APIs that matches symbols name with .XXX
>> suffix. This allows userspace tools to get kprobes on the expected
>> function name, while the actual symbol has a .llvm. suffix.
>> 
>> This only effects kernel compile with CONFIG_LTO_CLANG.
>> 
>> Signed-off-by: Song Liu 
>> ---
>> kernel/kprobes.c|  6 +-
>> kernel/trace/trace_kprobe.c | 11 ++-
>> 2 files changed, 15 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
>> index e85de37d9e1e..99102283b076 100644
>> --- a/kernel/kprobes.c
>> +++ b/kernel/kprobes.c
>> @@ -70,7 +70,11 @@ static DEFINE_PER_CPU(struct kprobe *, kprobe_instance);
>> kprobe_opcode_t * __weak kprobe_lookup_name(const char *name,
>> unsigned int __unused)
>> {
>> - return ((kprobe_opcode_t *)(kallsyms_lookup_name(name)));
>> + unsigned long addr = kallsyms_lookup_name(name);
>> +
>> + if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
>> + addr = kallsyms_lookup_name_without_suffix(name);
>> + return ((kprobe_opcode_t *)(addr));
>> }
>> 
>> /*
>> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
>> index 61a6da808203..d2ad0c561c83 100644
>> --- a/kernel/trace/trace_kprobe.c
>> +++ b/kernel/trace/trace_kprobe.c
>> @@ -203,6 +203,10 @@ unsigned long trace_kprobe_address(struct trace_kprobe 
>> *tk)
>> if (tk->symbol) {
>> addr = (unsigned long)
>> kallsyms_lookup_name(trace_kprobe_symbol(tk));
>> +
>> + if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
>> + addr = kallsyms_lookup_name_without_suffix(trace_kprobe_symbol(tk));
>> +
> 
> So you do the lookup twice if this is enabled?
> 
> Why not just use "kallsyms_lookup_name_without_suffix()" the entire time,
> and it should work just the same as "kallsyms_lookup_name()" if it's not
> needed?

We still want to give priority to full match. For example, we have:

[root@~]# grep c_next /proc/kallsyms
81419dc0 t c_next.llvm.7567888411731313343
81680600 t c_next
81854380 t c_next.llvm.14337844803752139461

If the goal is to explicitly trace c_next.llvm.7567888411731313343, the
user can provide the full name. If we always match _without_suffix, all
of the 3 will match to the first one. 

Does this make sense?

Thanks,
Song

Re: [PATCH v2 2/3] kallsyms: Add APIs to match symbol without .XXXX suffix.

2024-08-05 Thread Song Liu

Hi Masami,

On Mon, Aug 5, 2024 at 6:45 AM Masami Hiramatsu  wrote:
>
> On Fri,  2 Aug 2024 14:08:34 -0700
> Song Liu  wrote:
>
> > With CONFIG_LTO_CLANG=y, the compiler may add suffix to function names
> > to avoid duplication. This causes confusion with users of kallsyms.
> > On one hand, users like livepatch are required to match the symbols
> > exactly. On the other hand, users like kprobe would like to match to
> > original function names.
> >
> > Solve this by splitting kallsyms APIs. Specifically, existing APIs now
> > should match the symbols exactly. Add two APIs that match only the part
> > without . suffix. Specifically, the following two APIs are added.
> >
> > 1. kallsyms_lookup_name_without_suffix()
> > 2. kallsyms_on_each_match_symbol_without_suffix()
> >
> > These APIs will be used by kprobe.
> >
> > Also cleanup some code and update kallsyms_selftests accordingly.
> >
> > Signed-off-by: Song Liu 
>
> Looks good to me, but I have a nitpick.
>
>
> > --- a/kernel/kallsyms.c
> > +++ b/kernel/kallsyms.c
> > @@ -164,30 +164,27 @@ static void cleanup_symbol_name(char *s)
> >  {
> >   char *res;
> >
> > - if (!IS_ENABLED(CONFIG_LTO_CLANG))
> > - return;
> > -
> >   /*
> >* LLVM appends various suffixes for local functions and variables 
> > that
> >* must be promoted to global scope as part of LTO.  This can break
> >* hooking of static functions with kprobes. '.' is not a valid
> > -  * character in an identifier in C. Suffixes only in LLVM LTO 
> > observed:
> > -  * - foo.llvm.[0-9a-f]+
> > +  * character in an identifier in C, so we can just remove the
> > +  * suffix.
> >*/
> > - res = strstr(s, ".llvm.");
> > + res = strstr(s, ".");
>
> nit: "strchr(s, '.')" ?
>
> Anyway,
>
> Reviewed-by: Masami Hiramatsu (Google) 

Thanks for your kind review!

If we have another version, I will fold this change (and the one in
kallsyms_selftest.c) in.

Song

[PATCH v2 3/3] tracing/kprobes: Use APIs that matches symbols without .XXX suffix

2024-08-02 Thread Song Liu

Use the new kallsyms APIs that matches symbols name with .XXX
suffix. This allows userspace tools to get kprobes on the expected
function name, while the actual symbol has a .llvm. suffix.

This only effects kernel compile with CONFIG_LTO_CLANG.

Signed-off-by: Song Liu 
---
 kernel/kprobes.c|  6 +-
 kernel/trace/trace_kprobe.c | 11 ++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index e85de37d9e1e..99102283b076 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -70,7 +70,11 @@ static DEFINE_PER_CPU(struct kprobe *, kprobe_instance);
 kprobe_opcode_t * __weak kprobe_lookup_name(const char *name,
unsigned int __unused)
 {
-   return ((kprobe_opcode_t *)(kallsyms_lookup_name(name)));
+   unsigned long addr = kallsyms_lookup_name(name);
+
+   if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
+   addr = kallsyms_lookup_name_without_suffix(name);
+   return ((kprobe_opcode_t *)(addr));
 }
 
 /*
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 61a6da808203..d2ad0c561c83 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -203,6 +203,10 @@ unsigned long trace_kprobe_address(struct trace_kprobe *tk)
if (tk->symbol) {
addr = (unsigned long)
kallsyms_lookup_name(trace_kprobe_symbol(tk));
+
+   if (IS_ENABLED(CONFIG_LTO_CLANG) && !addr)
+   addr = 
kallsyms_lookup_name_without_suffix(trace_kprobe_symbol(tk));
+
if (addr)
addr += tk->rp.kp.offset;
} else {
@@ -766,8 +770,13 @@ static unsigned int number_of_same_symbols(const char 
*mod, const char *func_nam
 {
struct sym_count_ctx ctx = { .count = 0, .name = func_name };
 
-   if (!mod)
+   if (!mod) {
kallsyms_on_each_match_symbol(count_symbols, func_name, 
&ctx.count);
+   if (IS_ENABLED(CONFIG_LTO_CLANG) && !ctx.count) {
+   kallsyms_on_each_match_symbol_without_suffix(
+   count_symbols, func_name, &ctx.count);
+   }
+   }
 
module_kallsyms_on_each_symbol(mod, count_mod_symbols, &ctx);
 
-- 
2.43.0

[PATCH v2 2/3] kallsyms: Add APIs to match symbol without .XXXX suffix.

2024-08-02 Thread Song Liu

With CONFIG_LTO_CLANG=y, the compiler may add suffix to function names
to avoid duplication. This causes confusion with users of kallsyms.
On one hand, users like livepatch are required to match the symbols
exactly. On the other hand, users like kprobe would like to match to
original function names.

Solve this by splitting kallsyms APIs. Specifically, existing APIs now
should match the symbols exactly. Add two APIs that match only the part
without . suffix. Specifically, the following two APIs are added.

1. kallsyms_lookup_name_without_suffix()
2. kallsyms_on_each_match_symbol_without_suffix()

These APIs will be used by kprobe.

Also cleanup some code and update kallsyms_selftests accordingly.

Signed-off-by: Song Liu 
---
 include/linux/kallsyms.h   | 14 ++
 kernel/kallsyms.c  | 88 ++
 kernel/kallsyms_selftest.c | 75 +++-
 3 files changed, 128 insertions(+), 49 deletions(-)

diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
index c3f075e8f60c..9ef2a944a993 100644
--- a/include/linux/kallsyms.h
+++ b/include/linux/kallsyms.h
@@ -74,9 +74,12 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, 
unsigned long),
void *data);
 int kallsyms_on_each_match_symbol(int (*fn)(void *, unsigned long),
  const char *name, void *data);
+int kallsyms_on_each_match_symbol_without_suffix(int (*fn)(void *, unsigned 
long),
+const char *name, void *data);
 
 /* Lookup the address for a symbol. Returns 0 if not found. */
 unsigned long kallsyms_lookup_name(const char *name);
+unsigned long kallsyms_lookup_name_without_suffix(const char *name);
 
 extern int kallsyms_lookup_size_offset(unsigned long addr,
  unsigned long *symbolsize,
@@ -104,6 +107,11 @@ static inline unsigned long kallsyms_lookup_name(const 
char *name)
return 0;
 }
 
+static inline unsigned long kallsyms_lookup_name_without_suffix(const char 
*name)
+{
+   return 0;
+}
+
 static inline int kallsyms_lookup_size_offset(unsigned long addr,
  unsigned long *symbolsize,
  unsigned long *offset)
@@ -165,6 +173,12 @@ static inline int kallsyms_on_each_match_symbol(int 
(*fn)(void *, unsigned long)
 {
return -EOPNOTSUPP;
 }
+
+static inline int kallsyms_on_each_match_symbol_without_suffix(int (*fn)(void 
*, unsigned long),
+  const char 
*name, void *data)
+{
+   return -EOPNOTSUPP;
+}
 #endif /*CONFIG_KALLSYMS*/
 
 static inline void print_ip_sym(const char *loglvl, unsigned long ip)
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index fb2c77368d18..64fdff6cde85 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -164,30 +164,27 @@ static void cleanup_symbol_name(char *s)
 {
char *res;
 
-   if (!IS_ENABLED(CONFIG_LTO_CLANG))
-   return;
-
/*
 * LLVM appends various suffixes for local functions and variables that
 * must be promoted to global scope as part of LTO.  This can break
 * hooking of static functions with kprobes. '.' is not a valid
-* character in an identifier in C. Suffixes only in LLVM LTO observed:
-* - foo.llvm.[0-9a-f]+
+* character in an identifier in C, so we can just remove the
+* suffix.
 */
-   res = strstr(s, ".llvm.");
+   res = strstr(s, ".");
if (res)
*res = '\0';
 
return;
 }
 
-static int compare_symbol_name(const char *name, char *namebuf)
+static int compare_symbol_name(const char *name, char *namebuf, bool 
exact_match)
 {
-   /* The kallsyms_seqs_of_names is sorted based on names after
-* cleanup_symbol_name() (see scripts/kallsyms.c) if clang lto is 
enabled.
-* To ensure correct bisection in kallsyms_lookup_names(), do
-* cleanup_symbol_name(namebuf) before comparing name and namebuf.
-*/
+   int ret = strcmp(name, namebuf);
+
+   if (exact_match || !ret)
+   return ret;
+
cleanup_symbol_name(namebuf);
return strcmp(name, namebuf);
 }
@@ -204,13 +201,17 @@ static unsigned int get_symbol_seq(int index)
 
 static int kallsyms_lookup_names(const char *name,
 unsigned int *start,
-unsigned int *end)
+unsigned int *end,
+bool exact_match)
 {
int ret;
int low, mid, high;
unsigned int seq, off;
char namebuf[KSYM_NAME_LEN];
 
+   if (!IS_ENABLED(CONFIG_LTO_CLANG))
+   exact_match = true;
+
low = 0;
high = kallsyms_num_syms - 1;
 
@@ -219,7 +220,7 @@ static int kallsyms_lookup_nam

[PATCH v2 1/3] kallsyms: Do not cleanup .llvm. suffix before sorting symbols

2024-08-02 Thread Song Liu

Cleaning up the symbols causes various issues afterwards. Let's sort
the list based on original name and handle suffix later during the
symbol lookup.

Signed-off-by: Song Liu 
---
 scripts/kallsyms.c  | 31 ++-
 scripts/link-vmlinux.sh |  4 
 2 files changed, 2 insertions(+), 33 deletions(-)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 0ed873491bf5..123dab0572f8 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -5,8 +5,7 @@
  * This software may be used and distributed according to the terms
  * of the GNU General Public License, incorporated herein by reference.
  *
- * Usage: kallsyms [--all-symbols] [--absolute-percpu]
- * [--lto-clang] in.map > out.S
+ * Usage: kallsyms [--all-symbols] [--absolute-percpu]  in.map > out.S
  *
  *  Table compression uses all the unused char codes on the symbols and
  *  maps these to the most used substrings (tokens). For instance, it might
@@ -62,7 +61,6 @@ static struct sym_entry **table;
 static unsigned int table_size, table_cnt;
 static int all_symbols;
 static int absolute_percpu;
-static int lto_clang;
 
 static int token_profit[0x1];
 
@@ -73,8 +71,7 @@ static unsigned char best_table_len[256];
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] "
-   "[--lto-clang] in.map > out.S\n");
+   fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] 
in.map > out.S\n");
exit(1);
 }
 
@@ -344,25 +341,6 @@ static bool symbol_absolute(const struct sym_entry *s)
return s->percpu_absolute;
 }
 
-static void cleanup_symbol_name(char *s)
-{
-   char *p;
-
-   /*
-* ASCII[.]   = 2e
-* ASCII[0-9] = 30,39
-* ASCII[A-Z] = 41,5a
-* ASCII[_]   = 5f
-* ASCII[a-z] = 61,7a
-*
-* As above, replacing the first '.' in ".llvm." with '\0' does not
-* affect the main sorting, but it helps us with subsorting.
-*/
-   p = strstr(s, ".llvm.");
-   if (p)
-   *p = '\0';
-}
-
 static int compare_names(const void *a, const void *b)
 {
int ret;
@@ -526,10 +504,6 @@ static void write_src(void)
output_address(relative_base);
printf("\n");
 
-   if (lto_clang)
-   for (i = 0; i < table_cnt; i++)
-   cleanup_symbol_name((char *)table[i]->sym);
-
sort_symbols_by_name();
output_label("kallsyms_seqs_of_names");
for (i = 0; i < table_cnt; i++)
@@ -807,7 +781,6 @@ int main(int argc, char **argv)
static const struct option long_options[] = {
{"all-symbols", no_argument, &all_symbols, 1},
{"absolute-percpu", no_argument, &absolute_percpu, 1},
-   {"lto-clang",   no_argument, <o_clang,   1},
{},
};
 
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index f7b2503cdba9..22d0bc843986 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -156,10 +156,6 @@ kallsyms()
kallsymopt="${kallsymopt} --absolute-percpu"
fi
 
-   if is_enabled CONFIG_LTO_CLANG; then
-   kallsymopt="${kallsymopt} --lto-clang"
-   fi
-
info KSYMS "${2}.S"
scripts/kallsyms ${kallsymopt} "${1}" > "${2}.S"
 
-- 
2.43.0

[PATCH v2 0/3] Fix kallsyms with CONFIG_LTO_CLANG

2024-08-02 Thread Song Liu

With CONFIG_LTO_CLANG, the compiler/linker adds .llvm. suffix to
local symbols to avoid duplications. Existing scripts/kallsyms sorts
symbols without .llvm. suffix. However, this causes quite some
issues later on. Some users of kallsyms, such as livepatch, have to match
symbols exactly; while other users, such as kprobe, would match symbols
without the suffix.

Address this by sorting full symbols at build time, and split kallsyms
APIs to explicitly match full symbols or without suffix. Specifically,
exiting APIs will match symbols exactly. Two new APIs are added to match
symbols with suffix. Use the new APIs in tracing/kprobes.


Changes v1 => v2:
1. Update the APIs to remove all .XXX suffixes (v1 only removes .llvm.*).
2. Rename the APIs as *_without_suffix. (Masami Hiramatsu)
3. Fix another user from kprobe. (Masami Hiramatsu)
4. Add tests for the new APIs in kallsyms_selftests.

v1: 
https://lore.kernel.org/live-patching/20240730005433.3559731-1-s...@kernel.org/T/#u

Song Liu (3):
  kallsyms: Do not cleanup .llvm. suffix before sorting symbols
  kallsyms: Add APIs to match symbol without . suffix.
  tracing/kprobes: Use APIs that matches symbols without .XXX suffix

 include/linux/kallsyms.h| 14 ++
 kernel/kallsyms.c   | 88 +
 kernel/kallsyms_selftest.c  | 75 ++-
 kernel/kprobes.c|  6 ++-
 kernel/trace/trace_kprobe.c | 11 -
 scripts/kallsyms.c  | 31 +
 scripts/link-vmlinux.sh |  4 --
 7 files changed, 145 insertions(+), 84 deletions(-)

--
2.43.0

Re: [PATCH 2/3] kallsyms: Add APIs to match symbol without .llmv. suffix.

2024-08-02 Thread Song Liu

> On Jul 29, 2024, at 5:54 PM, Song Liu  wrote:
> 
> With CONFIG_LTO_CLANG=y, the compiler may add suffix to function names
> to avoid duplication. This causes confusion with users of kallsyms.
> On one hand, users like livepatch are required to match the symbols
> exactly. On the other hand, users like kprobe would like to match to
> original function names.
> 
> Solve this by splitting kallsyms APIs. Specifically, existing APIs now
> should match the symbols exactly. Add two APIs that matches the full
> symbol, or only the part without .llvm.suffix. Specifically, the following
> two APIs are added:
> 
> 1. kallsyms_lookup_name_or_prefix()
> 2. kallsyms_on_each_match_symbol_or_prefix()
> 
> These APIs will be used by kprobe.
> 
> Also cleanup some code and adjust kallsyms_selftests accordingly.
> 
> Signed-off-by: Song Liu 

Actually, if we only remove .llvm. suffix, but keep other .XXX
suffix, the *_without_suffx APIs will have the same issue Yonghong 
tried to fix in commit 33f0467fe06934d5e4ea6e24ce2b9c65ce618e26: 
binary search with symbols - .llvm. suffix is not correct. 
(Please see the commit log of 33f0467fe06934d5e4ea6e24ce2b9c65ce618e26 
for more details.)

I am updating the code to remove all .XXX suffix. This design will 
not have this issue. 

Thanks,
Song

Re: [PATCH 2/3] kallsyms: Add APIs to match symbol without .llmv. suffix.

2024-08-02 Thread Song Liu

Hi Petr, 

> On Aug 2, 2024, at 8:45 AM, Petr Mladek  wrote:

[...]

> 
> IMHO, it depends on the use case. Let's keep "ping_table/"
> as an example. Why people would want to trace this function?
> There might be various reasons, for example:
> 
>  1. ping_table.llvm.15394922576589127018 appeared in a backtrace
> 
>  2. ping_table.llvm.15394922576589127018 appeared in a histogram
> 
>  3. ping_table looks interesting when reading code sources
> 
>  4. ping_table need to be monitored on all systems because
> of security/performance.
> 
> The full name "ping_table.llvm.15394922576589127018" is perfectly
> fine in the 1st and 2nd scenario. People knew this name already
> before they start thinking about tracing.
> 
> The short name is more practical in 3rd and 4th scenario. Especially,
> when there is only one static symbol with this short name. Otherwise,
> the user would need an extra step to find the full name.
> 
> The full name is even more problematic for system monitors. These
> applications might need to probe particular symbols. They might
> have hard times when the symbol is:
> 
>.
> 
> They will have to deal with it. But it means that every such tool
> would need an extra (non-trivial) code for this. Every tool would
> try its own approach => a lot of problems.
> 
> IMHO, the two APIs could make the life easier.
> 
> Well, even kprobe might need two APIs to allow probing by
> full name or without the suffix.

The problem is, with potential partial inlining by modern compilers, 
tracing "symbol name from sources" is not accurate. In our production 
kernels, we have to add some explicit "noline" to make sure we can 
trace these functions reliably. 

Of course, this issue exists without random suffix: any function 
could be partially inlined. However, allowing tracing without the 
suffix seems to hint that tracing with "symbol name from sources" 
is valid, which is not really true. 

At the moment, I have no objections to keep the _without_suffix
APIs. But for long term, I still think we need to set clear 
expectations for the users: tracing symbols from sources is not
reliable. 

Thanks,
Song

Re: [PATCH 2/3] kallsyms: Add APIs to match symbol without .llmv. suffix.

2024-08-01 Thread Song Liu



> On Aug 1, 2024, at 6:18 PM, Leizhen (ThunderTown) 
>  wrote:
> 
> On 2024/7/31 9:00, Song Liu wrote:
>> Hi Masami, 
>> 
>>> On Jul 30, 2024, at 6:03 AM, Masami Hiramatsu  wrote:
>>> 
>>> On Mon, 29 Jul 2024 17:54:32 -0700
>>> Song Liu  wrote:
>>> 
>>>> With CONFIG_LTO_CLANG=y, the compiler may add suffix to function names
>>>> to avoid duplication. This causes confusion with users of kallsyms.
>>>> On one hand, users like livepatch are required to match the symbols
>>>> exactly. On the other hand, users like kprobe would like to match to
>>>> original function names.
>>>> 
>>>> Solve this by splitting kallsyms APIs. Specifically, existing APIs now
>>>> should match the symbols exactly. Add two APIs that matches the full
>>>> symbol, or only the part without .llvm.suffix. Specifically, the following
>>>> two APIs are added:
>>>> 
>>>> 1. kallsyms_lookup_name_or_prefix()
>>>> 2. kallsyms_on_each_match_symbol_or_prefix()
>>> 
>>> Since this API only removes the suffix, "match prefix" is a bit confusing.
>>> (this sounds like matching "foo" with "foo" and "foo_bar", but in reality,
>>> it only matches "foo" and "foo.llvm.*")
>>> What about the name below?
>>> 
>>> kallsyms_lookup_name_without_suffix()
>>> kallsyms_on_each_match_symbol_without_suffix()
>> 
>> I am open to name suggestions. I named it as xx or prefix to highlight
>> that these two APIs will try match full name first, and they only match
>> the symbol without suffix when there is no full name match. 
>> 
>> Maybe we can call them: 
>> - kallsyms_lookup_name_or_without_suffix()
>> - kallsyms_on_each_match_symbol_or_without_suffix()
>> 
>> Again, I am open to any name selections here.
> 
> Only static functions have suffixes. In my opinion, explicitly marking static
> might be a little clearer.
> kallsyms_lookup_static_name()
> kallsyms_on_each_match_static_symbol()

While these names are shorter, I think they are more confusing. Not all
folks know that only static functions can have suffixes. 

Maybe we should not hide the "try match full name first first" in the
API, and let the users handle it. Then, we can safely call the new APIs
*_without_suffix(), as Masami suggested. 

If there is no objections, I will send v2 based on this direction. 

Thanks,
Song

Re: [PATCH 3/3] tracing/kprobes: Use APIs that matches symbols with .llvm. suffix

2024-07-30 Thread Song Liu

Hi Masami,

> On Jul 30, 2024, at 6:04 AM, Masami Hiramatsu  wrote:
> 
> On Mon, 29 Jul 2024 17:54:33 -0700
> Song Liu  wrote:
> 
>> Use the new kallsyms APIs that matches symbols name with .llvm.
>> suffix. This allows userspace tools to get kprobes on the expected
>> function name, while the actual symbol has a .llvm. suffix.
>> 
> 
> _kprobe_addr@kernel/kprobes.c may also fail with this change.
> 

Thanks for catching this! I will fix this in the next version. 

Song

Re: [PATCH 2/3] kallsyms: Add APIs to match symbol without .llmv. suffix.

2024-07-30 Thread Song Liu

Hi Masami, 

> On Jul 30, 2024, at 6:03 AM, Masami Hiramatsu  wrote:
> 
> On Mon, 29 Jul 2024 17:54:32 -0700
> Song Liu  wrote:
> 
>> With CONFIG_LTO_CLANG=y, the compiler may add suffix to function names
>> to avoid duplication. This causes confusion with users of kallsyms.
>> On one hand, users like livepatch are required to match the symbols
>> exactly. On the other hand, users like kprobe would like to match to
>> original function names.
>> 
>> Solve this by splitting kallsyms APIs. Specifically, existing APIs now
>> should match the symbols exactly. Add two APIs that matches the full
>> symbol, or only the part without .llvm.suffix. Specifically, the following
>> two APIs are added:
>> 
>> 1. kallsyms_lookup_name_or_prefix()
>> 2. kallsyms_on_each_match_symbol_or_prefix()
> 
> Since this API only removes the suffix, "match prefix" is a bit confusing.
> (this sounds like matching "foo" with "foo" and "foo_bar", but in reality,
> it only matches "foo" and "foo.llvm.*")
> What about the name below?
> 
> kallsyms_lookup_name_without_suffix()
> kallsyms_on_each_match_symbol_without_suffix()

I am open to name suggestions. I named it as xx or prefix to highlight
that these two APIs will try match full name first, and they only match
the symbol without suffix when there is no full name match. 

Maybe we can call them: 
- kallsyms_lookup_name_or_without_suffix()
- kallsyms_on_each_match_symbol_or_without_suffix()

Again, I am open to any name selections here. 

> 
>> 
>> These APIs will be used by kprobe.
> 
> No other user need this?

AFACIT, kprobe is the only use case here. Sami, please correct 
me if I missed any users. 

More thoughts on this: 

I actually hope we don't need these two new APIs, as they are 
confusing. Modern compilers can do many things to the code 
(inlining, etc.). So when we are tracing a function, we are not 
really tracing "function in the source code". Instead, we are 
tracing "function in the binary". If a function is inlined, it 
will not show up in the binary. If a function is _partially_ 
inlined (inlined by some callers, but not by others), it will 
show up in the binary, but we won't be tracing it as it appears
in the source code. Therefore, tracing functions by their names 
in the source code only works under certain assumptions. And 
these assumptions may not hold with modern compilers. Ideally, 
I think we cannot promise the user can use name "ping_table" to
trace function "ping_table.llvm.15394922576589127018"

Does this make sense?

Thanks,
Song

[...]

Re: [PATCH 0/3] Fix kallsyms with CONFIG_LTO_CLANG

2024-07-29 Thread Song Liu

On Mon, Jul 29, 2024 at 5:54 PM Song Liu  wrote:
>
> With CONFIG_LTO_CLANG, the compiler/linker adds .llvm. suffix to
> local symbols to avoid duplications. Existing scripts/kallsyms sorts
> symbols without .llvm. suffix. However, this causes quite some
> issues later on. Some users of kallsyms, such as livepatch, have to match
> symbols exactly; while other users, such as kprobe, would match symbols
> without the suffix.
>
> Address this by sorting full symbols (with .llvm.) at build time, and
> split kallsyms APIs to explicitly match full symbols or without suffix.
> Specifically, exiting APIs will match symbols exactly. Two new APIs are
> added to match symbols with suffix. Use the new APIs in tracing/kprobes.

Forgot to mention: This is to follow up the discussions in this thread:

https://lore.kernel.org/live-patching/20240605032120.3179157-1-s...@kernel.org/T/#u

Thanks,
Song

[PATCH 3/3] tracing/kprobes: Use APIs that matches symbols with .llvm. suffix

2024-07-29 Thread Song Liu

Use the new kallsyms APIs that matches symbols name with .llvm.
suffix. This allows userspace tools to get kprobes on the expected
function name, while the actual symbol has a .llvm. suffix.

This only effects kernel compared with CONFIG_LTO_CLANG.

Signed-off-by: Song Liu 
---
 kernel/trace/trace_kprobe.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 61a6da808203..c319382c1a09 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -202,7 +202,8 @@ unsigned long trace_kprobe_address(struct trace_kprobe *tk)
 
if (tk->symbol) {
addr = (unsigned long)
-   kallsyms_lookup_name(trace_kprobe_symbol(tk));
+   kallsyms_lookup_name_or_prefix(trace_kprobe_symbol(tk));
+
if (addr)
addr += tk->rp.kp.offset;
} else {
@@ -766,8 +767,13 @@ static unsigned int number_of_same_symbols(const char 
*mod, const char *func_nam
 {
struct sym_count_ctx ctx = { .count = 0, .name = func_name };
 
-   if (!mod)
+   if (!mod) {
kallsyms_on_each_match_symbol(count_symbols, func_name, 
&ctx.count);
+   if (IS_ENABLED(CONFIG_LTO_CLANG) && !ctx.count) {
+   kallsyms_on_each_match_symbol_or_prefix(
+   count_symbols, func_name, &ctx.count);
+   }
+   }
 
module_kallsyms_on_each_symbol(mod, count_mod_symbols, &ctx);
 
-- 
2.43.0

[PATCH 2/3] kallsyms: Add APIs to match symbol without .llmv. suffix.

2024-07-29 Thread Song Liu

With CONFIG_LTO_CLANG=y, the compiler may add suffix to function names
to avoid duplication. This causes confusion with users of kallsyms.
On one hand, users like livepatch are required to match the symbols
exactly. On the other hand, users like kprobe would like to match to
original function names.

Solve this by splitting kallsyms APIs. Specifically, existing APIs now
should match the symbols exactly. Add two APIs that matches the full
symbol, or only the part without .llvm.suffix. Specifically, the following
two APIs are added:

1. kallsyms_lookup_name_or_prefix()
2. kallsyms_on_each_match_symbol_or_prefix()

These APIs will be used by kprobe.

Also cleanup some code and adjust kallsyms_selftests accordingly.

Signed-off-by: Song Liu 
---
 include/linux/kallsyms.h   | 14 +++
 kernel/kallsyms.c  | 83 ++
 kernel/kallsyms_selftest.c | 22 +-
 3 files changed, 73 insertions(+), 46 deletions(-)

diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
index c3f075e8f60c..09b2d2099107 100644
--- a/include/linux/kallsyms.h
+++ b/include/linux/kallsyms.h
@@ -74,9 +74,12 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, 
unsigned long),
void *data);
 int kallsyms_on_each_match_symbol(int (*fn)(void *, unsigned long),
  const char *name, void *data);
+int kallsyms_on_each_match_symbol_or_prefix(int (*fn)(void *, unsigned long),
+   const char *name, void *data);
 
 /* Lookup the address for a symbol. Returns 0 if not found. */
 unsigned long kallsyms_lookup_name(const char *name);
+unsigned long kallsyms_lookup_name_or_prefix(const char *name);
 
 extern int kallsyms_lookup_size_offset(unsigned long addr,
  unsigned long *symbolsize,
@@ -104,6 +107,11 @@ static inline unsigned long kallsyms_lookup_name(const 
char *name)
return 0;
 }
 
+static inline unsigned long kallsyms_lookup_name_or_prefix(const char *name)
+{
+   return 0;
+}
+
 static inline int kallsyms_lookup_size_offset(unsigned long addr,
  unsigned long *symbolsize,
  unsigned long *offset)
@@ -165,6 +173,12 @@ static inline int kallsyms_on_each_match_symbol(int 
(*fn)(void *, unsigned long)
 {
return -EOPNOTSUPP;
 }
+
+static inline int kallsyms_on_each_match_symbol_or_prefix(int (*fn)(void *, 
unsigned long),
+ const char *name, 
void *data)
+{
+   return -EOPNOTSUPP;
+}
 #endif /*CONFIG_KALLSYMS*/
 
 static inline void print_ip_sym(const char *loglvl, unsigned long ip)
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index fb2c77368d18..4285dd85d814 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -164,9 +164,6 @@ static void cleanup_symbol_name(char *s)
 {
char *res;
 
-   if (!IS_ENABLED(CONFIG_LTO_CLANG))
-   return;
-
/*
 * LLVM appends various suffixes for local functions and variables that
 * must be promoted to global scope as part of LTO.  This can break
@@ -181,13 +178,13 @@ static void cleanup_symbol_name(char *s)
return;
 }
 
-static int compare_symbol_name(const char *name, char *namebuf)
+static int compare_symbol_name(const char *name, char *namebuf, bool 
exact_match)
 {
-   /* The kallsyms_seqs_of_names is sorted based on names after
-* cleanup_symbol_name() (see scripts/kallsyms.c) if clang lto is 
enabled.
-* To ensure correct bisection in kallsyms_lookup_names(), do
-* cleanup_symbol_name(namebuf) before comparing name and namebuf.
-*/
+   int ret = strcmp(name, namebuf);
+
+   if (exact_match || !ret)
+   return ret;
+
cleanup_symbol_name(namebuf);
return strcmp(name, namebuf);
 }
@@ -204,13 +201,17 @@ static unsigned int get_symbol_seq(int index)
 
 static int kallsyms_lookup_names(const char *name,
 unsigned int *start,
-unsigned int *end)
+unsigned int *end,
+bool exact_match)
 {
int ret;
int low, mid, high;
unsigned int seq, off;
char namebuf[KSYM_NAME_LEN];
 
+   if (!IS_ENABLED(CONFIG_LTO_CLANG))
+   exact_match = true;
+
low = 0;
high = kallsyms_num_syms - 1;
 
@@ -219,7 +220,7 @@ static int kallsyms_lookup_names(const char *name,
seq = get_symbol_seq(mid);
off = get_symbol_offset(seq);
kallsyms_expand_symbol(off, namebuf, ARRAY_SIZE(namebuf));
-   ret = compare_symbol_name(name, namebuf);
+   ret = compare_symbol_name(name, namebuf, exact_match);
if (ret > 0)
low = mid + 1;
else if (ret < 0)
@@

[PATCH 1/3] kallsyms: Do not cleanup .llvm. suffix before sorting symbols

2024-07-29 Thread Song Liu

Cleaning up the symbols causes various issues afterwards. Let's sort
the list based on original name and handle suffix later during the
symbol lookup.

Signed-off-by: Song Liu 
---
 scripts/kallsyms.c  | 31 ++-
 scripts/link-vmlinux.sh |  4 
 2 files changed, 2 insertions(+), 33 deletions(-)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 0ed873491bf5..123dab0572f8 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -5,8 +5,7 @@
  * This software may be used and distributed according to the terms
  * of the GNU General Public License, incorporated herein by reference.
  *
- * Usage: kallsyms [--all-symbols] [--absolute-percpu]
- * [--lto-clang] in.map > out.S
+ * Usage: kallsyms [--all-symbols] [--absolute-percpu]  in.map > out.S
  *
  *  Table compression uses all the unused char codes on the symbols and
  *  maps these to the most used substrings (tokens). For instance, it might
@@ -62,7 +61,6 @@ static struct sym_entry **table;
 static unsigned int table_size, table_cnt;
 static int all_symbols;
 static int absolute_percpu;
-static int lto_clang;
 
 static int token_profit[0x1];
 
@@ -73,8 +71,7 @@ static unsigned char best_table_len[256];
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] "
-   "[--lto-clang] in.map > out.S\n");
+   fprintf(stderr, "Usage: kallsyms [--all-symbols] [--absolute-percpu] 
in.map > out.S\n");
exit(1);
 }
 
@@ -344,25 +341,6 @@ static bool symbol_absolute(const struct sym_entry *s)
return s->percpu_absolute;
 }
 
-static void cleanup_symbol_name(char *s)
-{
-   char *p;
-
-   /*
-* ASCII[.]   = 2e
-* ASCII[0-9] = 30,39
-* ASCII[A-Z] = 41,5a
-* ASCII[_]   = 5f
-* ASCII[a-z] = 61,7a
-*
-* As above, replacing the first '.' in ".llvm." with '\0' does not
-* affect the main sorting, but it helps us with subsorting.
-*/
-   p = strstr(s, ".llvm.");
-   if (p)
-   *p = '\0';
-}
-
 static int compare_names(const void *a, const void *b)
 {
int ret;
@@ -526,10 +504,6 @@ static void write_src(void)
output_address(relative_base);
printf("\n");
 
-   if (lto_clang)
-   for (i = 0; i < table_cnt; i++)
-   cleanup_symbol_name((char *)table[i]->sym);
-
sort_symbols_by_name();
output_label("kallsyms_seqs_of_names");
for (i = 0; i < table_cnt; i++)
@@ -807,7 +781,6 @@ int main(int argc, char **argv)
static const struct option long_options[] = {
{"all-symbols", no_argument, &all_symbols, 1},
{"absolute-percpu", no_argument, &absolute_percpu, 1},
-   {"lto-clang",   no_argument, <o_clang,   1},
{},
};
 
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index f7b2503cdba9..22d0bc843986 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -156,10 +156,6 @@ kallsyms()
kallsymopt="${kallsymopt} --absolute-percpu"
fi
 
-   if is_enabled CONFIG_LTO_CLANG; then
-   kallsymopt="${kallsymopt} --lto-clang"
-   fi
-
info KSYMS "${2}.S"
scripts/kallsyms ${kallsymopt} "${1}" > "${2}.S"
 
-- 
2.43.0

[PATCH 0/3] Fix kallsyms with CONFIG_LTO_CLANG

2024-07-29 Thread Song Liu

With CONFIG_LTO_CLANG, the compiler/linker adds .llvm. suffix to
local symbols to avoid duplications. Existing scripts/kallsyms sorts
symbols without .llvm. suffix. However, this causes quite some
issues later on. Some users of kallsyms, such as livepatch, have to match
symbols exactly; while other users, such as kprobe, would match symbols
without the suffix.

Address this by sorting full symbols (with .llvm.) at build time, and
split kallsyms APIs to explicitly match full symbols or without suffix.
Specifically, exiting APIs will match symbols exactly. Two new APIs are
added to match symbols with suffix. Use the new APIs in tracing/kprobes.

Song Liu (3):
  kallsyms: Do not cleanup .llvm. suffix before sorting symbols
  kallsyms: Add APIs to match symbol without .llmv. suffix.
  tracing/kprobes: Use APIs that matches symbols with .llvm.
suffix

 include/linux/kallsyms.h| 14 +++
 kernel/kallsyms.c   | 83 ++---
 kernel/kallsyms_selftest.c  | 22 +-
 kernel/trace/trace_kprobe.c | 10 -
 scripts/kallsyms.c  | 30 +-
 scripts/link-vmlinux.sh |  4 --
 6 files changed, 83 insertions(+), 80 deletions(-)

--
2.43.0

Re: [PATCH] kallsyms, livepatch: Fix livepatch with CONFIG_LTO_CLANG

2024-07-09 Thread Song Liu

> On Jul 9, 2024, at 8:07 AM, Sami Tolvanen  wrote:

[...]

> 
>>> I am a bit scared because using hashed symbol names in backtraces, gdb,
>>> ... would be a nightmare. Hashes are not human readable and
>>> they would complicate the life a lot. And using different names
>>> in different interfaces would complicate the life either.
>> 
>> All great points.
>> 
>> The scope of the Rust issue is self contained to modversion_info,
>> whereas for CONFIG_LTO_CLANG issue commit 8b8e6b5d3b013b0
>> ("kallsyms: strip ThinLTO hashes from static functions") describes
>> the issue with userspace tools (it doesn't explain which ones)
>> which don't expect the function name to change. This seems to happen
>> to static routines so I can only suspect this isn't an issue with
>> modversioning as the only symbols that would be used there wouldn't be
>> static.
>> 
>> Sami, what was the exact userspace issue with CONFIG_LTO_CLANG and these
>> long symbols?
> 
> The issue with LTO wasn't symbol length. IIRC the compiler renaming
> symbols with ThinLTO caused issues for folks using dynamic kprobes,
> and I seem to recall it also breaking systrace in Android, at which
> point we decided to strip the postfix in kallsyms to avoid breaking
> anything else.

Trying to understand all the requirements and constraints. IIUC, we
can mostly agree: 

(1) A way to match a symbol exactly is crucial for users like live 
patching. 
(2) Original symbol name is useful for backtrace, etc. (IOW hash 
alone is not enough)

With these two requirements/constraints, we need 

   original symbol name + something 

for duplicate symbols. "Something" here could be a path name 
(xxx_driver_xxx_yyy_c), or a hash, or sympos. 

At the moment, (1) is not met with CONFIG_LTO_CLANG. The original
patch tries to fix this, but the solution seems not optimal. I will 
send another version that allows kallsyms match exactly or without
suffix. 

This work shouldn't cause any problem for Rust, as Rust also need 
original symbol name + "something". If we finally decide "something" 
should be some format of hash, we can change all users (live patch, 
etc.) to use hash, which might be better than sympos. Note: I am
not trying to say "something" should be hash.  

OTOH, there is also an open question: Shall we allow tracing with
only original symbol name (without specifying _something_). I think
this a separate question and we don't have to answer it here. 

Does this make sense?

Thanks,
Song

Re: [PATCH] livepatch: introduce klp_func called interface

2024-06-07 Thread Song Liu

Hi Miroslav,

On Fri, Jun 7, 2024 at 2:07 AM Miroslav Benes  wrote:
>
> Hi,
>
> On Tue, 4 Jun 2024, Song Liu wrote:
>
> > On Tue, May 21, 2024 at 1:04 AM Petr Mladek  wrote:
> > [...]
> > > >
> > > > Yes, but the information you get is limited compared to what is 
> > > > available
> > > > now. You would obtain the information that a patched function was called
> > > > but ftrace could also give you the context and more.
> > >
> > > Another motivation to use ftrace for testing is that it does not
> > > affect the performance in production.
> > >
> > > We should keep klp_ftrace_handler() as fast as possible so that we
> > > could livepatch also performance sensitive functions.
> >
> > At LPC last year, we discussed about adding a counter to each
> > klp_func, like:
> >
> > struct klp_func {
> > ...
> > u64 __percpu counter;
> > ...
> > };
> >
> > With some static_key (+ sysctl), this should give us a way to estimate
> > the overhead of livepatch. If we have the counter, this patch is not
> > needed any more. Does this (adding the counter) sound like
> > something we still want to pursue?
>
> It would be better than this patch but given what was mentioned in the
> thread I wonder if it is possible to use ftrace even for this. See
> /sys/kernel/tracing/trace_stat/function*. It already gathers the number of
> hits.

I didn't know about the trace_stat API until today. :) It somehow doesn't
exist on some older kernels. (I haven't debugged it.)

> Would it be sufficient for you? I guess it depends on what the intention
> is. If there is no time limit, klp_func.counter might be better to provide
> some kind of overall statistics (but I am not sure if it has any value)
> and to avoid having ftrace registered on a live patched function for
> infinite period of time. If the intention is to gather data for some
> limited period, trace_stat sounds like much better approach to me.

We don't have very urgent use for this. As we discussed, various tracing
tools are sufficient in most cases. I brought this up in the context of the
"called" entry: if we are really adding a new entry, let's do "counter"
instead of "called".

Thanks,
Song

Re: [PATCH] kallsyms, livepatch: Fix livepatch with CONFIG_LTO_CLANG

2024-06-07 Thread Song Liu

Hi Miroslav,

Thanks for reviewing the patch!

On Fri, Jun 7, 2024 at 6:06 AM Miroslav Benes  wrote:
>
> Hi,
>
> On Tue, 4 Jun 2024, Song Liu wrote:
>
> > With CONFIG_LTO_CLANG, the compiler may postfix symbols with .llvm.
> > to avoid symbol duplication. scripts/kallsyms.c sorted the symbols
> > without these postfixes. The default symbol lookup also removes these
> > postfixes before comparing symbols.
> >
> > On the other hand, livepatch need to look up symbols with the full names.
> > However, calling kallsyms_on_each_match_symbol with full name (with the
> > postfix) cannot find the symbol(s). As a result, we cannot livepatch
> > kernel functions with .llvm. postfix or kernel functions that use
> > relocation information to symbols with .llvm. postfixes.
> >
> > Fix this by calling kallsyms_on_each_match_symbol without the postfix;
> > and then match the full name (with postfix) in klp_match_callback.
> >
> > Signed-off-by: Song Liu 
> > ---
> >  include/linux/kallsyms.h | 13 +
> >  kernel/kallsyms.c| 21 -
> >  kernel/livepatch/core.c  | 32 +++-
> >  3 files changed, 60 insertions(+), 6 deletions(-)
>
> I do not like much that something which seems to be kallsyms-internal is
> leaked out. You need to export cleanup_symbol_name() and there is now a
> lot of code outside. I would feel much more comfortable if it is all
> hidden from kallsyms users and kept there. Would it be possible?

I think it is possible. Currently, kallsyms_on_each_match_symbol matches
symbols without the postfix. We can add a variation or a parameter, so
that it matches the full name with post fix.

> Moreover, isn't there a similar problem for ftrace, kprobes, ebpf,...?

Yes, there is a similar problem with tracing use cases. But the requirements
are not the same:

For livepatch, we have to point to the exact symbol we want to patch or
relocation to. We have sympos API defined to differentiate different symbols
with the same name.

For tracing, some discrepancy is acceptable. AFAICT, there isn't an API
similar to sympos yet. Also, we can play some tricks with tracing. For
example, we can use "uniq symbol + offset" to point a kprobe to one of
the duplicated symbols.

Given livepatch has a well defined API, while the APIs at tracing side
may still change, we can change kallsyms to make sure livepatch side
works. Work on the tracing side can wait.

Thanks,
Song

Re: [PATCH] livepatch: introduce klp_func called interface

2024-06-04 Thread Song Liu

On Tue, May 21, 2024 at 1:04 AM Petr Mladek  wrote:
[...]
> >
> > Yes, but the information you get is limited compared to what is available
> > now. You would obtain the information that a patched function was called
> > but ftrace could also give you the context and more.
>
> Another motivation to use ftrace for testing is that it does not
> affect the performance in production.
>
> We should keep klp_ftrace_handler() as fast as possible so that we
> could livepatch also performance sensitive functions.

At LPC last year, we discussed about adding a counter to each
klp_func, like:

struct klp_func {
...
u64 __percpu counter;
...
};

With some static_key (+ sysctl), this should give us a way to estimate
the overhead of livepatch. If we have the counter, this patch is not
needed any more. Does this (adding the counter) sound like
something we still want to pursue?

Thanks,
Song

[PATCH] kallsyms, livepatch: Fix livepatch with CONFIG_LTO_CLANG

2024-06-04 Thread Song Liu

With CONFIG_LTO_CLANG, the compiler may postfix symbols with .llvm.
to avoid symbol duplication. scripts/kallsyms.c sorted the symbols
without these postfixes. The default symbol lookup also removes these
postfixes before comparing symbols.

On the other hand, livepatch need to look up symbols with the full names.
However, calling kallsyms_on_each_match_symbol with full name (with the
postfix) cannot find the symbol(s). As a result, we cannot livepatch
kernel functions with .llvm. postfix or kernel functions that use
relocation information to symbols with .llvm. postfixes.

Fix this by calling kallsyms_on_each_match_symbol without the postfix;
and then match the full name (with postfix) in klp_match_callback.

Signed-off-by: Song Liu 
---
 include/linux/kallsyms.h | 13 +
 kernel/kallsyms.c| 21 -
 kernel/livepatch/core.c  | 32 +++-
 3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
index c3f075e8f60c..d7d07a134ae4 100644
--- a/include/linux/kallsyms.h
+++ b/include/linux/kallsyms.h
@@ -97,6 +97,10 @@ extern int sprint_backtrace_build_id(char *buffer, unsigned 
long address);
 
 int lookup_symbol_name(unsigned long addr, char *symname);
 
+int kallsyms_lookup_symbol_full_name(unsigned long addr, char *symname);
+
+void kallsyms_cleanup_symbol_name(char *s);
+
 #else /* !CONFIG_KALLSYMS */
 
 static inline unsigned long kallsyms_lookup_name(const char *name)
@@ -165,6 +169,15 @@ static inline int kallsyms_on_each_match_symbol(int 
(*fn)(void *, unsigned long)
 {
return -EOPNOTSUPP;
 }
+
+static inline int kallsyms_lookup_symbol_full_name(unsigned long addr, char 
*symname)
+{
+   return -ERANGE;
+}
+
+static inline void kallsyms_cleanup_symbol_name(char *s)
+{
+}
 #endif /*CONFIG_KALLSYMS*/
 
 static inline void print_ip_sym(const char *loglvl, unsigned long ip)
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 22ea19a36e6e..f0d04fa1bbb4 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -163,7 +163,7 @@ unsigned long kallsyms_sym_address(int idx)
return kallsyms_relative_base - 1 - kallsyms_offsets[idx];
 }
 
-static void cleanup_symbol_name(char *s)
+void kallsyms_cleanup_symbol_name(char *s)
 {
char *res;
 
@@ -191,7 +191,7 @@ static int compare_symbol_name(const char *name, char 
*namebuf)
 * To ensure correct bisection in kallsyms_lookup_names(), do
 * cleanup_symbol_name(namebuf) before comparing name and namebuf.
 */
-   cleanup_symbol_name(namebuf);
+   kallsyms_cleanup_symbol_name(namebuf);
return strcmp(name, namebuf);
 }
 
@@ -426,7 +426,7 @@ static const char *kallsyms_lookup_buildid(unsigned long 
addr,
offset, modname, namebuf);
 
 found:
-   cleanup_symbol_name(namebuf);
+   kallsyms_cleanup_symbol_name(namebuf);
return ret;
 }
 
@@ -446,7 +446,7 @@ const char *kallsyms_lookup(unsigned long addr,
   NULL, namebuf);
 }
 
-int lookup_symbol_name(unsigned long addr, char *symname)
+static int __lookup_symbol_name(unsigned long addr, char *symname, bool 
cleanup)
 {
int res;
 
@@ -468,10 +468,21 @@ int lookup_symbol_name(unsigned long addr, char *symname)
return res;
 
 found:
-   cleanup_symbol_name(symname);
+   if (cleanup)
+   kallsyms_cleanup_symbol_name(symname);
return 0;
 }
 
+int lookup_symbol_name(unsigned long addr, char *symname)
+{
+   return __lookup_symbol_name(addr, symname, true);
+}
+
+int kallsyms_lookup_symbol_full_name(unsigned long addr, char *symname)
+{
+   return __lookup_symbol_name(addr, symname, false);
+}
+
 /* Look up a kernel symbol and return it in a text buffer. */
 static int __sprint_symbol(char *buffer, unsigned long address,
   int symbol_offset, int add_offset, int add_buildid)
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 52426665eecc..284220e04801 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -128,6 +128,19 @@ struct klp_find_arg {
 static int klp_match_callback(void *data, unsigned long addr)
 {
struct klp_find_arg *args = data;
+#ifdef CONFIG_LTO_CLANG
+   char full_name[KSYM_NAME_LEN];
+
+   /*
+* With CONFIG_LTO_CLANG, we need to compare the full name of the
+* symbol (with .llvm. postfix).
+*/
+   if (kallsyms_lookup_symbol_full_name(addr, full_name))
+   return 0;
+
+   if (strcmp(args->name, full_name))
+   return 0;
+#endif
 
args->addr = addr;
args->count++;
@@ -145,10 +158,12 @@ static int klp_match_callback(void *data, unsigned long 
addr)
 
 static int klp_find_callback(void *data, const char *name, unsigned long addr)
 {
+#ifndef CONFIG_LTO_CLANG
struct klp_find_arg *args = data;
 
if (strcmp(a

Re: [PATCH v6 08/16] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Extend execmem parameters to accommodate more complex overrides of
> module_alloc() by architectures.
>
> This includes specification of a fallback range required by arm, arm64
> and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
> allocation of KASAN shadow required by s390 and x86 and support for
> late initialization of execmem required by arm64.
>
> The core implementation of execmem_alloc() takes care of suppressing
> warnings when the initial allocation fails but there is a fallback range
> defined.
>
> Signed-off-by: Mike Rapoport (IBM) 
> Acked-by: Will Deacon 

nit: We should probably move the logic for ARCH_WANTS_EXECMEM_LATE
to a separate patch.

Otherwise,

Acked-by: Song Liu

Re: [PATCH v6 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Several architectures override module_alloc() only to define address
> range for code allocations different than VMALLOC address space.
>
> Provide a generic implementation in execmem that uses the parameters for
> address space ranges, required alignment and page protections provided
> by architectures.
>
> The architectures must fill execmem_info structure and implement
> execmem_arch_setup() that returns a pointer to that structure. This way the
> execmem initialization won't be called from every architecture, but rather
> from a central place, namely a core_initcall() in execmem.
>
> The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
> with the parameters defined by the architectures.  If an architecture does
> not implement execmem_arch_setup(), execmem_alloc() will fall back to
> module_alloc().
>
> Signed-off-by: Mike Rapoport (IBM) 

Acked-by: Song Liu

Re: [PATCH v6 06/16] mm: introduce execmem_alloc() and execmem_free()

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
>
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
>
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.
>
> No functional changes.
>
> Signed-off-by: Mike Rapoport (IBM) 
> Acked-by: Masami Hiramatsu (Google) 

Acked-by: Song Liu

Re: [PATCH v6 05/16] module: make module_memory_{alloc,free} more self-contained

2024-04-26 Thread Song Liu

On Fri, Apr 26, 2024 at 1:30 AM Mike Rapoport  wrote:
>
> From: "Mike Rapoport (IBM)" 
>
> Move the logic related to the memory allocation and freeing into
> module_memory_alloc() and module_memory_free().
>
> Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Song Liu

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-22 Thread Song Liu

Hi Masami and Mike,

On Sat, Apr 20, 2024 at 2:11 AM Masami Hiramatsu  wrote:
[...]
> > >
> > > IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> > > input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> > > will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> > > the "range" object, we will have to compare different range parameters to 
> > > check
> > > we can share cached pages between module text and kprobe, which is not
> > > efficient. Did I miss something?
>
> Song, thanks for trying to eplain. I think I need to explain why I used
> module_alloc() originally.
>
> This depends on how kprobe features are implemented on the architecture, and
> how much features are supported on kprobes.
>
> Because kprobe jump optimization and kprobe jump-back optimization need to
> use a jump instruction to jump into the trampoline and jump back from the
> trampoline directly, if the architecuture jmp instruction supports +-2GB range
> like x86, it needs to allocate the trampoline buffer inside such address 
> space.
> This requirement is similar to the modules (because module function needs to
> call other functions in the kernel etc.), at least kprobes on x86 used
> module_alloc().
>
> However, if an architecture only supports breakpoint/trap based kprobe,
> it does not need to consider whether the execmem is allocated.
>
> >
> > We can always share large ROX pages as long as they are within the correct
> > address space. The permissions for them are ROX and the alignment
> > differences are due to KASAN and this is handled during allocation of the
> > large page to refill the cache. __execmem_cache_alloc() only needs to limit
> > the search for the address space of the range.
>
> So I don't think EXECMEM_KPROBE always same as EXECMEM_MODULE_TEXT, it
> should be configured for each arch. Especially, if it is only used for
> searching parameter, it looks OK to me.

Thanks for the explanation!

I was thinking "we can have EXECMEM_KPROBE share the same parameters as
EXECMEM_MODULE_TEXT for all architectures". But this thought is built on top
of assumptions on future changes/improvements within multiple sub systems.
At this moment, I have no objections moving forward with current execmem APIs.

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport  wrote:
>
> On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
> > [...]
> > > > >
> > > > > [1] 
> > > > > https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> > > >
> > > > For the ROX to work, we need different users (module text, kprobe, 
> > > > etc.) to have
> > > > the same execmem_range. From [1]:
> > > >
> > > > static void *execmem_cache_alloc(struct execmem_range *range, size_t 
> > > > size)
> > > > {
> > > > ...
> > > >p = __execmem_cache_alloc(size);
> > > >if (p)
> > > >return p;
> > > >   err = execmem_cache_populate(range, size);
> > > > ...
> > > > }
> > > >
> > > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > > we can only call execmem_cache_alloc() with one execmem_range.
> > >
> > > Actually, on x86 this will "just work" because everything shares the same
> > > address space :)
> > >
> > > The 2M pages in the cache will be in the modules space, so
> > > __execmem_cache_alloc() will always return memory from that address space.
> > >
> > > For other architectures this indeed needs to be fixed with passing the
> > > range to __execmem_cache_alloc() and limiting search in the cache for that
> > > range.
> >
> > I think we at least need the "map to" concept (initially proposed by Thomas)
> > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > the same range.
>
> Why?

IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
the "range" object, we will have to compare different range parameters to check
we can share cached pages between module text and kprobe, which is not
efficient. Did I miss something?

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
[...]
> > >
> > > [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> >
> > For the ROX to work, we need different users (module text, kprobe, etc.) to 
> > have
> > the same execmem_range. From [1]:
> >
> > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > {
> > ...
> >p = __execmem_cache_alloc(size);
> >if (p)
> >return p;
> >   err = execmem_cache_populate(range, size);
> > ...
> > }
> >
> > We are calling __execmem_cache_alloc() without range. For this to work,
> > we can only call execmem_cache_alloc() with one execmem_range.
>
> Actually, on x86 this will "just work" because everything shares the same
> address space :)
>
> The 2M pages in the cache will be in the modules space, so
> __execmem_cache_alloc() will always return memory from that address space.
>
> For other architectures this indeed needs to be fixed with passing the
> range to __execmem_cache_alloc() and limiting search in the cache for that
> range.

I think we at least need the "map to" concept (initially proposed by Thomas)
to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
maps to EXECMEM_MODULE_TEXT, so that all these actually share
the same range.

Does this make sense?

Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Thu, Apr 18, 2024 at 11:56 PM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
> > >
> > > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > > > >
> > > > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > > > maybe I
> > > > > > > should have named the enum execmem_consumer at the first place.
> > > > > >
> > > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > > > kprobe,
> > > > > > and bpf (and maybe also module text) all have the same requirements.
> > > > > > Did I miss something?
> > > > >
> > > > > It's enough to have one architecture with different constrains for 
> > > > > kprobes
> > > > > and bpf to warrant a type for each.
> > > >
> > > > AFAICT, some of these constraints can be changed without too much work.
> > >
> > > But why?
> > > I honestly don't understand what are you trying to optimize here. A few
> > > lines of initialization in execmem_info?
> >
> > IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> > harder for bpf and kprobe to share the same ROX page. In many use cases,
> > a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> > module text. It is not efficient if we have to allocate separate pages for 
> > each
> > of these use cases. If this is not a problem, the current approach works.
>
> The caching of large ROX pages does not need to be per type.
>
> In the POC I've posted for caching of large ROX pages on x86 [1], the cache is
> global and to make kprobes and bpf use it it's enough to set a flag in
> execmem_info.
>
> [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org

For the ROX to work, we need different users (module text, kprobe, etc.) to have
the same execmem_range. From [1]:

static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
{
...
   p = __execmem_cache_alloc(size);
   if (p)
   return p;
  err = execmem_cache_populate(range, size);
...
}

We are calling __execmem_cache_alloc() without range. For this to work,
we can only call execmem_cache_alloc() with one execmem_range.

Did I miss something?

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu

On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > >
> > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > maybe I
> > > > > should have named the enum execmem_consumer at the first place.
> > > >
> > > > I think looking at execmem_type from consumers' point of view adds
> > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > kprobe,
> > > > and bpf (and maybe also module text) all have the same requirements.
> > > > Did I miss something?
> > >
> > > It's enough to have one architecture with different constrains for kprobes
> > > and bpf to warrant a type for each.
> >
> > AFAICT, some of these constraints can be changed without too much work.
>
> But why?
> I honestly don't understand what are you trying to optimize here. A few
> lines of initialization in execmem_info?

IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
harder for bpf and kprobe to share the same ROX page. In many use cases,
a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
module text. It is not efficient if we have to allocate separate pages for each
of these use cases. If this is not a problem, the current approach works.

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-18 Thread Song Liu

On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
>
[...]
> >
> > Is +/- 2G enough for all realistic use cases? If so, I guess we don't
> > really need
> > EXECMEM_ANYWHERE below?
> >
> > > >
> > > > * I'm not sure about BPF's requirements; it seems happy doing the same 
> > > > as
> > > >   modules.
> > >
> > > BPF are happy with vmalloc().
> > >
> > > > So if we *must* use a common execmem allocator, what we'd reall want is 
> > > > our own
> > > > types, e.g.
> > > >
> > > >   EXECMEM_ANYWHERE
> > > >   EXECMEM_NOPLT
> > > >   EXECMEM_PREL32
> > > >
> > > > ... and then we use those in arch code to implement module_alloc() and 
> > > > friends.
> > >
> > > I'm looking at execmem_types more as definition of the consumers, maybe I
> > > should have named the enum execmem_consumer at the first place.
> >
> > I think looking at execmem_type from consumers' point of view adds
> > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
> > and bpf (and maybe also module text) all have the same requirements.
> > Did I miss something?
>
> It's enough to have one architecture with different constrains for kprobes
> and bpf to warrant a type for each.
>

AFAICT, some of these constraints can be changed without too much work.

> Where do you see unnecessary complexity?
>
> > IOW, we have
> >
> > enum execmem_type {
> > EXECMEM_DEFAULT,
> > EXECMEM_TEXT,
> > EXECMEM_KPROBES = EXECMEM_TEXT,
> > EXECMEM_FTRACE = EXECMEM_TEXT,
> > EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
> > _KPROBE, _FTRACE, _BPF */
> > EXECMEM_DATA,  /* rw */
> > EXECMEM_RO_DATA,
> > EXECMEM_RO_AFTER_INIT,
> > EXECMEM_TYPE_MAX,
> > };
> >
> > Does this make sense?
>
> How do you suggest to deal with e.g. riscv that has separate address spaces
> for modules, kprobes and bpf?

IIUC, modules and bpf use the same address space on riscv, while kprobes use
vmalloc address. I haven't tried this yet, but I think we can let
kprobes use the
same space as modules and bpf, which is:

 |  -4 GB | 7fff |2 GB | modules, BPF

Did I get this right?

Thanks,
Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-17 Thread Song Liu

On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport  wrote:
>
> On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > +/**
> > > > + * enum execmem_type - types of executable memory ranges
> > > > + *
> > > > + * There are several subsystems that allocate executable memory.
> > > > + * Architectures define different restrictions on placement,
> > > > + * permissions, alignment and other parameters for memory that can be 
> > > > used
> > > > + * by these subsystems.
> > > > + * Types in this enum identify subsystems that allocate executable 
> > > > memory
> > > > + * and let architectures define parameters for ranges suitable for
> > > > + * allocations by each subsystem.
> > > > + *
> > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types 
> > > > that
> > > > + * are not explcitly defined.
> > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > + * @EXECMEM_BPF: parameters for BPF
> > > > + * @EXECMEM_TYPE_MAX:
> > > > + */
> > > > +enum execmem_type {
> > > > + EXECMEM_DEFAULT,
> > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > + EXECMEM_KPROBES,
> > > > + EXECMEM_FTRACE,
> > > > + EXECMEM_BPF,
> > > > + EXECMEM_TYPE_MAX,
> > > > +};
> > >
> > > Can we please get a break-down of how all these types are actually
> > > different from one another?
> > >
> > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > mind) and has less strict placement constraints for some of them?
> >
> > Yeah, and really I'd *much* rather deal with that in arch code, as I have 
> > said
> > several times.
> >
> > For arm64 we have two bsaic restrictions:
> >
> > 1) Direct branches can go +/-128M
> >We can expand this range by having direct branches go to PLTs, at a
> >performance cost.
> >
> > 2) PREL32 relocations can go +/-2G
> >We cannot expand this further.
> >
> > * We don't need to allocate memory for ftrace. We do not use trampolines.
> >
> > * Kprobes XOL areas don't care about either of those; we don't place any
> >   PC-relative instructions in those. Maybe we want to in future.
> >
> > * Modules care about both; we'd *prefer* to place them within +/-128M of all
> >   other kernel/module code, but if there's no space we can use PLTs and 
> > expand
> >   that to +/-2G. Since modules can refreence other modules, that ends up
> >   actually being halved, and modules have to fit within some 2G window that
> >   also covers the kernel.

Is +/- 2G enough for all realistic use cases? If so, I guess we don't
really need
EXECMEM_ANYWHERE below?

> >
> > * I'm not sure about BPF's requirements; it seems happy doing the same as
> >   modules.
>
> BPF are happy with vmalloc().
>
> > So if we *must* use a common execmem allocator, what we'd reall want is our 
> > own
> > types, e.g.
> >
> >   EXECMEM_ANYWHERE
> >   EXECMEM_NOPLT
> >   EXECMEM_PREL32
> >
> > ... and then we use those in arch code to implement module_alloc() and 
> > friends.
>
> I'm looking at execmem_types more as definition of the consumers, maybe I
> should have named the enum execmem_consumer at the first place.

I think looking at execmem_type from consumers' point of view adds
unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
and bpf (and maybe also module text) all have the same requirements.
Did I miss something?

IOW, we have

enum execmem_type {
EXECMEM_DEFAULT,
EXECMEM_TEXT,
EXECMEM_KPROBES = EXECMEM_TEXT,
EXECMEM_FTRACE = EXECMEM_TEXT,
EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
_KPROBE, _FTRACE, _BPF */
EXECMEM_DATA,  /* rw */
EXECMEM_RO_DATA,
EXECMEM_RO_AFTER_INIT,
EXECMEM_TYPE_MAX,
};

Does this make sense?

Thanks,
Song

Re: [RFC][PATCH 0/4] Make bpf_jit and kprobes work with CONFIG_MODULES=n

2024-03-06 Thread Song Liu

Hi Calvin,

It is great to hear from you! :)

On Wed, Mar 6, 2024 at 3:23 PM Calvin Owens  wrote:
>
> On Wednesday 03/06 at 13:34 -0800, Luis Chamberlain wrote:
> > On Wed, Mar 06, 2024 at 12:05:07PM -0800, Calvin Owens wrote:
> > > Hello all,
> > >
> > > This patchset makes it possible to use bpftrace with kprobes on kernels
> > > built without loadable module support.
> >
> > This is a step in the right direction for another reason: clearly the
> > module_alloc() is not about modules, and we have special reasons for it
> > now beyond modules. The effort to share a generalize a huge page for
> > these things is also another reason for some of this but that is more
> > long term.
> >
> > I'm all for minor changes here so to avoid regressions but it seems a
> > rename is in order -- if we're going to all this might as well do it
> > now. And for that I'd just like to ask you paint the bikeshed with
> > Song Liu as he's been the one slowly making way to help us get there
> > with the "module: replace module_layout with module_memory",
> > and Mike Rapoport as he's had some follow up attempts [0]. As I see it,
> > the EXECMEM stuff would be what we use instead then. Mike kept the
> > module_alloc() and the execmem was just a wrapper but your move of the
> > arch stuff makes sense as well and I think would complement his series
> > nicely.
>
> I apologize for missing that. I think these are the four most recent
> versions of the different series referenced from that LWN link:
>
>   a) https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org/
>   b) https://lore.kernel.org/all/20230526051529.3387103-1-s...@kernel.org/
>   c) https://lore.kernel.org/all/20221107223921.3451913-1-s...@kernel.org/
>   d) 
> https://lore.kernel.org/all/20201120202426.18009-1-rick.p.edgeco...@intel.com/
>
> Song and Mike, please correct me if I'm wrong, but I think what I've
> done here (see [1], sorry for not adding you initially) is compatible
> with everything both of you have recently proposed above. How do you
> feel about this as a first step?

I agree that the work here is compatible with other efforts. I have no
objection to making this the first step.

>
> For naming, execmem_alloc() seems reasonable to me? I have no strong
> feelings at all, I'll just use that going forward unless somebody else
> expresses an opinion.

I am not good at naming things. No objection from me to "execmem_alloc".

Thanks,
Song

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-23 Thread Song Liu

On Sat, Sep 23, 2023 at 8:39 AM Mike Rapoport  wrote:
>
> On Thu, Sep 21, 2023 at 03:34:18PM -0700, Song Liu wrote:
> > On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
> > >
> >
> > [...]
> >
> > > diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> > > index 42215f9404af..db5561d0c233 100644
> > > --- a/arch/s390/kernel/module.c
> > > +++ b/arch/s390/kernel/module.c
> > > @@ -21,6 +21,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
> > >  #ifdef CONFIG_FUNCTION_TRACER
> > >  void module_arch_cleanup(struct module *mod)
> > >  {
> > > -   module_memfree(mod->arch.trampolines_start);
> > > +   execmem_free(mod->arch.trampolines_start);
> > >  }
> > >  #endif
> > >
> > > @@ -510,7 +511,7 @@ static int 
> > > module_alloc_ftrace_hotpatch_trampolines(struct module *me,
> > >
> > > size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
> > > numpages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > -   start = module_alloc(numpages * PAGE_SIZE);
> > > +   start = execmem_text_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);
> >
> > This should be EXECMEM_MODULE_TEXT?
>
> This is an ftrace trampoline, so I think it should be FTRACE type of
> allocation.

Yeah, I was aware of the ftrace trampoline. My point was, ftrace trampoline
doesn't seem to have any special requirements. Therefore, it is probably not
necessary to have a separate type just for it.

AFAICT, kprobe, ftrace, and BPF (JIT and trampoline) can share the same
execmem_type. We may need some work for some archs, but nothing is
fundamentally different among these.

Thanks,
Song

Re: [PATCH v3 06/13] mm/execmem: introduce execmem_data_alloc()

2023-09-22 Thread Song Liu

On Fri, Sep 22, 2023 at 12:17 AM Christophe Leroy
 wrote:
>
>
>
> Le 22/09/2023 à 00:52, Song Liu a écrit :
> > On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
> >>
> > [...]
> >> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> >> index 519bdfdca595..09d45ac786e9 100644
> >> --- a/include/linux/execmem.h
> >> +++ b/include/linux/execmem.h
> >> @@ -29,6 +29,7 @@
> >>* @EXECMEM_KPROBES: parameters for kprobes
> >>* @EXECMEM_FTRACE: parameters for ftrace
> >>* @EXECMEM_BPF: parameters for BPF
> >> + * @EXECMEM_MODULE_DATA: parameters for module data sections
> >>* @EXECMEM_TYPE_MAX:
> >>*/
> >>   enum execmem_type {
> >> @@ -37,6 +38,7 @@ enum execmem_type {
> >>  EXECMEM_KPROBES,
> >>  EXECMEM_FTRACE,
> >
> > In longer term, I think we can improve the JITed code and merge
> > kprobe/ftrace/bpf. to use the same ranges. Also, do we need special
> > setting for FTRACE? If not, let's just remove it.
>
> How can we do that ? Some platforms like powerpc require executable
> memory for BPF and non-exec mem for KPROBE so it can't be in the same
> area/ranges.

Hmm... non-exec mem for kprobes?

   if (strict_module_rwx_enabled())
   execmem_params.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
   else
   execmem_params.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_EXEC;

Do you mean the latter case?

Thanks,
Song

Re: [PATCH v3 06/13] mm/execmem: introduce execmem_data_alloc()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
>
[...]
> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> index 519bdfdca595..09d45ac786e9 100644
> --- a/include/linux/execmem.h
> +++ b/include/linux/execmem.h
> @@ -29,6 +29,7 @@
>   * @EXECMEM_KPROBES: parameters for kprobes
>   * @EXECMEM_FTRACE: parameters for ftrace
>   * @EXECMEM_BPF: parameters for BPF
> + * @EXECMEM_MODULE_DATA: parameters for module data sections
>   * @EXECMEM_TYPE_MAX:
>   */
>  enum execmem_type {
> @@ -37,6 +38,7 @@ enum execmem_type {
> EXECMEM_KPROBES,
> EXECMEM_FTRACE,

In longer term, I think we can improve the JITed code and merge
kprobe/ftrace/bpf. to use the same ranges. Also, do we need special
setting for FTRACE? If not, let's just remove it.

> EXECMEM_BPF,
> +   EXECMEM_MODULE_DATA,
> EXECMEM_TYPE_MAX,
>  };

Overall, it is great that kprobe/ftrace/bpf no longer depend on modules.

OTOH, I think we should merge execmem_type and existing mod_mem_type.
Otherwise, we still need to handle page permissions in multiple places.
What is our plan for that?

Thanks,
Song


>
> @@ -107,6 +109,23 @@ struct execmem_params *execmem_arch_params(void);
>   */
>  void *execmem_text_alloc(enum execmem_type type, size_t size);
>
> +/**
> + * execmem_data_alloc - allocate memory for data coupled to code
> + * @type: type of the allocation
> + * @size: how many bytes of memory are required
> + *
> + * Allocates memory that will contain data coupled with executable code,
> + * like data sections in kernel modules.
> + *
> + * The memory will have protections defined by architecture.
> + *
> + * The allocated memory will reside in an area that does not impose
> + * restrictions on the addressing modes.
> + *
> + * Return: a pointer to the allocated memory or %NULL
> + */
> +void *execmem_data_alloc(enum execmem_type type, size_t size);
> +
>  /**
>   * execmem_free - free executable memory
>   * @ptr: pointer to the memory that should be freed
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index c4146bfcd0a7..2ae83a6abf66 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1188,25 +1188,16 @@ void __weak module_arch_freeing_init(struct module 
> *mod)
>  {
>  }
>
> -static bool mod_mem_use_vmalloc(enum mod_mem_type type)
> -{
> -   return IS_ENABLED(CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC) &&
> -   mod_mem_type_is_core_data(type);
> -}
> -
>  static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
>  {
> -   if (mod_mem_use_vmalloc(type))
> -   return vzalloc(size);
> +   if (mod_mem_type_is_data(type))
> +   return execmem_data_alloc(EXECMEM_MODULE_DATA, size);
> return execmem_text_alloc(EXECMEM_MODULE_TEXT, size);
>  }
>
>  static void module_memory_free(void *ptr, enum mod_mem_type type)
>  {
> -   if (mod_mem_use_vmalloc(type))
> -   vfree(ptr);
> -   else
> -   execmem_free(ptr);
> +   execmem_free(ptr);
>  }
>
>  static void free_mod_mem(struct module *mod)
> diff --git a/mm/execmem.c b/mm/execmem.c
> index abcbd07e05ac..aeff85261360 100644
> --- a/mm/execmem.c
> +++ b/mm/execmem.c
> @@ -53,11 +53,23 @@ static void *execmem_alloc(size_t size, struct 
> execmem_range *range)
> return kasan_reset_tag(p);
>  }
>
> +static inline bool execmem_range_is_data(enum execmem_type type)
> +{
> +   return type == EXECMEM_MODULE_DATA;
> +}
> +
>  void *execmem_text_alloc(enum execmem_type type, size_t size)
>  {
> return execmem_alloc(size, &execmem_params.ranges[type]);
>  }
>
> +void *execmem_data_alloc(enum execmem_type type, size_t size)
> +{
> +   WARN_ON_ONCE(!execmem_range_is_data(type));
> +
> +   return execmem_alloc(size, &execmem_params.ranges[type]);
> +}
> +
>  void execmem_free(void *ptr)
>  {
> /*
> @@ -93,7 +105,10 @@ static void execmem_init_missing(struct execmem_params *p)
> struct execmem_range *r = &p->ranges[i];
>
> if (!r->start) {
> -   r->pgprot = default_range->pgprot;
> +   if (execmem_range_is_data(i))
> +   r->pgprot = PAGE_KERNEL;
> +   else
> +   r->pgprot = default_range->pgprot;
> r->alignment = default_range->alignment;
> r->start = default_range->start;
> r->end = default_range->end;
> --
> 2.39.2
>

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
>

[...]

> diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> index 42215f9404af..db5561d0c233 100644
> --- a/arch/s390/kernel/module.c
> +++ b/arch/s390/kernel/module.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
>  #ifdef CONFIG_FUNCTION_TRACER
>  void module_arch_cleanup(struct module *mod)
>  {
> -   module_memfree(mod->arch.trampolines_start);
> +   execmem_free(mod->arch.trampolines_start);
>  }
>  #endif
>
> @@ -510,7 +511,7 @@ static int 
> module_alloc_ftrace_hotpatch_trampolines(struct module *me,
>
> size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
> numpages = DIV_ROUND_UP(size, PAGE_SIZE);
> -   start = module_alloc(numpages * PAGE_SIZE);
> +   start = execmem_text_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);

This should be EXECMEM_MODULE_TEXT?

Thanks,
Song

> if (!start)
> return -ENOMEM;
> set_memory_rox((unsigned long)start, numpages);
[...]

Re: [PATCH v3 09/13] powerpc: extend execmem_params for kprobes allocations

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:31 AM Mike Rapoport  wrote:
>
[...]
> @@ -135,5 +138,13 @@ struct execmem_params __init *execmem_arch_params(void)
>
> range->pgprot = prot;
>
> +   execmem_params.ranges[EXECMEM_KPROBES].start = VMALLOC_START;
> +   execmem_params.ranges[EXECMEM_KPROBES].start = VMALLOC_END;

.end = VMALLOC_END.

Thanks,
Song

> +
> +   if (strict_module_rwx_enabled())
> +   execmem_params.ranges[EXECMEM_KPROBES].pgprot = 
> PAGE_KERNEL_ROX;
> +   else
> +   execmem_params.ranges[EXECMEM_KPROBES].pgprot = 
> PAGE_KERNEL_EXEC;
> +
> return &execmem_params;
>  }
> --
> 2.39.2
>
>

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
>
[...]
> +
> +/**
> + * enum execmem_type - types of executable memory ranges
> + *
> + * There are several subsystems that allocate executable memory.
> + * Architectures define different restrictions on placement,
> + * permissions, alignment and other parameters for memory that can be used
> + * by these subsystems.
> + * Types in this enum identify subsystems that allocate executable memory
> + * and let architectures define parameters for ranges suitable for
> + * allocations by each subsystem.
> + *
> + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> + * are not explcitly defined.
> + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> + * @EXECMEM_KPROBES: parameters for kprobes
> + * @EXECMEM_FTRACE: parameters for ftrace
> + * @EXECMEM_BPF: parameters for BPF
> + * @EXECMEM_TYPE_MAX:
> + */
> +enum execmem_type {
> +   EXECMEM_DEFAULT,

I found EXECMEM_DEFAULT more confusing than helpful.

Song

> +   EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> +   EXECMEM_KPROBES,
> +   EXECMEM_FTRACE,
> +   EXECMEM_BPF,
> +   EXECMEM_TYPE_MAX,
> +};
> +
[...]

Re: [PATCH v3 02/13] mm: introduce execmem_text_alloc() and execmem_free()

2023-09-21 Thread Song Liu

On Mon, Sep 18, 2023 at 12:30 AM Mike Rapoport  wrote:
>
[...]
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static void *execmem_alloc(size_t size)
> +{
> +   return module_alloc(size);
> +}
> +
> +void *execmem_text_alloc(enum execmem_type type, size_t size)
> +{
> +   return execmem_alloc(size);
> +}

execmem_text_alloc (and later execmem_data_alloc) both take "type" as
input. I guess we can just use execmem_alloc(type, size) for everything?

Thanks,
Song

> +
> +void execmem_free(void *ptr)
> +{
> +   /*
> +* This memory may be RO, and freeing RO memory in an interrupt is not
> +* supported by vmalloc.
> +*/
> +   WARN_ON(in_interrupt());
> +   vfree(ptr);
> +}
> --
> 2.39.2
>

Re: [PATCH v4 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-20 Thread Song Liu




> On Apr 20, 2021, at 10:31 AM, Jiri Olsa  wrote:
> 
> On Mon, Apr 19, 2021 at 01:36:48PM -0700, Song Liu wrote:
> 
> SNIP
> 
>>  if (stat_config.initial_delay < 0) {
>> @@ -784,11 +790,11 @@ static int __run_perf_stat(int argc, const char 
>> **argv, int run_idx)
>>  if (affinity__setup(&affinity) < 0)
>>  return -1;
>> 
>> -if (target__has_bpf(&target)) {
>> -evlist__for_each_entry(evsel_list, counter) {
>> -if (bpf_counter__load(counter, &target))
>> -return -1;
>> -}
>> +evlist__for_each_entry(evsel_list, counter) {
>> +if (bpf_counter__load(counter, &target))
>> +return -1;
>> +if (!evsel__is_bpf(counter))
>> +all_counters_use_bpf = false;
> 
> could be done in bpf_counter__load, check below:
> 
>>  }
>> 
>>  evlist__for_each_cpu (evsel_list, i, cpu) {
>> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
>> index 5de991ab46af9..33b1888103dfa 100644
>> --- a/tools/perf/util/bpf_counter.c
>> +++ b/tools/perf/util/bpf_counter.c
>> @@ -790,7 +790,8 @@ int bpf_counter__load(struct evsel *evsel, struct target 
>> *target)
>> {
>>  if (target->bpf_str)
>>  evsel->bpf_counter_ops = &bpf_program_profiler_ops;
>> -else if (target->use_bpf)
>> +else if (target->use_bpf ||
>> + evsel__match_bpf_counter_events(evsel->name))
>>  evsel->bpf_counter_ops = &bperf_ops;
> 
> with:
>   else
>   all_counters_use_bpf = false;
> 
> I was also thinking of oving it to evlist, but it's sat specific,
> so I think it's good as static.. thanks for changing the implementation

Hmm... then we need to somehow make all_counters_use_bpf visible in
bpf_counter.c, which won't be very clean. Also, since this is stat 
specific, I guess it is better to keep it inside builtin-stat.c?
The runtime overhead should be minimal. 

Thanks,
Song

[PATCH v4 4/4] perf-stat: introduce ':b' modifier

2021-04-19 Thread Song Liu

Introduce 'b' modifier to event parser, which means use BPF program to
manage this event. This is the same as --bpf-counters option, but only
applies to this event. For example,

  perf stat -e cycles:b,cs   # use bpf for cycles, but not cs
  perf stat -e cycles,cs --bpf-counters  # use bpf for both cycles and cs

Suggested-by: Jiri Olsa 
Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c  | 2 +-
 tools/perf/util/evsel.h| 1 +
 tools/perf/util/parse-events.c | 8 +++-
 tools/perf/util/parse-events.l | 2 +-
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 33b1888103dfa..f179f57430253 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -790,7 +790,7 @@ int bpf_counter__load(struct evsel *evsel, struct target 
*target)
 {
if (target->bpf_str)
evsel->bpf_counter_ops = &bpf_program_profiler_ops;
-   else if (target->use_bpf ||
+   else if (target->use_bpf || evsel->bpf_counter ||
 evsel__match_bpf_counter_events(evsel->name))
evsel->bpf_counter_ops = &bperf_ops;
 
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ce4b629d659c2..8f66cdcb338d0 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -82,6 +82,7 @@ struct evsel {
boolauto_merge_stats;
boolcollect_stat;
boolweak_group;
+   boolbpf_counter;
int bpf_fd;
struct bpf_object   *bpf_obj;
};
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 8123d218ad17c..46ebd269a98d1 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1804,6 +1804,7 @@ struct event_modifier {
int pinned;
int weak;
int exclusive;
+   int bpf_counter;
 };
 
 static int get_event_modifier(struct event_modifier *mod, char *str,
@@ -1824,6 +1825,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
int exclude = eu | ek | eh;
int exclude_GH = evsel ? evsel->exclude_GH : 0;
int weak = 0;
+   int bpf_counter = 0;
 
memset(mod, 0, sizeof(*mod));
 
@@ -1867,6 +1869,8 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
exclusive = 1;
} else if (*str == 'W') {
weak = 1;
+   } else if (*str == 'b') {
+   bpf_counter = 1;
} else
break;
 
@@ -1898,6 +1902,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
mod->sample_read = sample_read;
mod->pinned = pinned;
mod->weak = weak;
+   mod->bpf_counter = bpf_counter;
mod->exclusive = exclusive;
 
return 0;
@@ -1912,7 +1917,7 @@ static int check_modifier(char *str)
char *p = str;
 
/* The sizeof includes 0 byte as well. */
-   if (strlen(str) > (sizeof("ukhGHpppPSDIWe") - 1))
+   if (strlen(str) > (sizeof("ukhGHpppPSDIWeb") - 1))
return -1;
 
while (*p) {
@@ -1953,6 +1958,7 @@ int parse_events__modifier_event(struct list_head *list, 
char *str, bool add)
evsel->sample_read = mod.sample_read;
evsel->precise_max = mod.precise_max;
evsel->weak_group  = mod.weak;
+   evsel->bpf_counter = mod.bpf_counter;
 
if (evsel__is_group_leader(evsel)) {
evsel->core.attr.pinned = mod.pinned;
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 0b36285a9435d..fb8646cc3e834 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -210,7 +210,7 @@ name_tag
[\'][a-zA-Z_*?\[\]][a-zA-Z0-9_*?\-,\.\[\]:=]*[\']
 name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.:]*
 drv_cfg_term   [a-zA-Z0-9_\.]+(=[a-zA-Z0-9_*?\.:]+)?
 /* If you add a modifier you need to update check_modifier() */
-modifier_event [ukhpPGHSDIWe]+
+modifier_event [ukhpPGHSDIWeb]+
 modifier_bp[rwx]{1,3}
 
 %%
-- 
2.30.2

[PATCH v4 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-19 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. Events with name in the option will use
BPF.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 40 +++---
 tools/perf/util/bpf_counter.c  |  3 +-
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 ++
 tools/perf/util/evsel.h|  8 ++
 tools/perf/util/target.h   |  5 
 7 files changed, 61 insertions(+), 23 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 6ec5960b08c3d..f10e24da23e90 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2a2c15cac80a3..157105e792eaf 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -160,6 +160,7 @@ static const char *smi_cost_attrs = {
 };
 
 static struct evlist   *evsel_list;
+static bool all_counters_use_bpf = true;
 
 static struct target target = {
.uid= UINT_MAX,
@@ -399,6 +400,9 @@ static int read_affinity_counters(struct timespec *rs)
struct affinity affinity;
int i, ncpus, cpu;
 
+   if (all_counters_use_bpf)
+   return 0;
+
if (affinity__setup(&affinity) < 0)
return -1;
 
@@ -413,6 +417,8 @@ static int read_affinity_counters(struct timespec *rs)
evlist__for_each_entry(evsel_list, counter) {
if (evsel__cpu_iter_skip(counter, cpu))
continue;
+   if (evsel__is_bpf(counter))
+   continue;
if (!counter->err) {
counter->err = read_counter_cpu(counter, rs,

counter->cpu_iter - 1);
@@ -429,6 +435,9 @@ static int read_bpf_map_counters(void)
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!evsel__is_bpf(counter))
+   continue;
+
err = bpf_counter__read(counter);
if (err)
return err;
@@ -439,14 +448,10 @@ static int read_bpf_map_counters(void)
 static void read_counters(struct timespec *rs)
 {
struct evsel *counter;
-   int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf(&target))
-   err = read_bpf_map_counters();
-   else
-   err = read_affinity_counters(rs);
-   if (err < 0)
+   if (read_bpf_map_counters() ||
+   read_affinity_counters(rs))
return;
}
 
@@ -535,12 +540,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel__is_bpf(evsel))
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +790,11 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup(&affinity) < 0)
return -1;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, &target))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, &target))
+   return -1

[PATCH v4 2/4] perf bpf: check perf_attr_map is compatible with the perf binary

2021-04-19 Thread Song Liu

perf_attr_map could be shared among different version of perf binary. Add
bperf_attr_map_compatible() to check whether the existing attr_map is
compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index be484ddbbd5be..5de991ab46af9 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -312,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, &map_info, &map_info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -346,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

[PATCH v4 1/4] perf util: move bpf_perf definitions to a libperf header

2021-04-19 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPF_PERF_DEFAULT_ATTR_MAP_PATH to
bpf_perf.h for other tools to use.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/util/bpf_counter.c  | 27 +++---
 2 files changed, 34 insertions(+), 24 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

diff --git a/tools/lib/perf/include/perf/bpf_perf.h 
b/tools/lib/perf/include/perf/bpf_perf.h
new file mode 100644
index 0..e7cf6ba7b674b
--- /dev/null
+++ b/tools/lib/perf/include/perf/bpf_perf.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPF_PERF_H
+#define __LIBPERF_BPF_PERF_H
+
+#include   /* for __u32 */
+
+/*
+ * bpf_perf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* default attr_map name */
+#define BPF_PERF_DEFAULT_ATTR_MAP_PATH "perf_attr_map"
+
+#endif /* __LIBPERF_BPF_PERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..be484ddbbd5be 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -341,8 +320,8 @@ static int bperf_lock_attr_map(struct target *target)
if (target->attr_map) {
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
-   scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+   scnprintf(path, PATH_MAX, "%s/fs/bpf/%s", sysfs__mountpoint(),
+ BPF_PERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
-- 
2.30.2

[PATCH v4 0/4] perf util: bpf perf improvements

2021-04-19 Thread Song Liu

This patches set improves bpf_perf (perf-stat --bpf-counter) by
 1) exposing key definitions to a libperf header;
 2) adding compatibility check for perf_attr_map;
 3) introducing config stat.bpf-counter-events.
 4) introducing 'b' modify to event parser.

Changes v3 => v4:
1. Improve the logic that decides when to skip read_affinity_counters().
   (Jiri)
2. Clean up a condition in bpf_counters.c:read_counters(). (Jiri)

Changes v2 => v3:
1. Add 'b' modifier. (Jiri)
2. Allow configuring stat.bpf-counter-events with any event name (instead
  of limiting to hardware events). (Jiri)

Changes v1 => v2:
1. Separte 2/3 from 1/3. (Jiri)
2. Rename bperf.h to bpf_perf.h. (Jiri)
3. Other small fixes/optimizations. (Jiri)

Song Liu (4):
  perf util: move bpf_perf definitions to a libperf header
  perf bpf: check perf_attr_map is compatible with the perf binary
  perf-stat: introduce config stat.bpf-counter-events
  perf-stat: introduce ':b' modifier

 tools/lib/perf/include/perf/bpf_perf.h | 31 
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 40 -
 tools/perf/util/bpf_counter.c  | 49 +-
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 
 tools/perf/util/evsel.h|  9 +
 tools/perf/util/parse-events.c |  8 -
 tools/perf/util/parse-events.l |  2 +-
 tools/perf/util/target.h   |  5 ---
 10 files changed, 123 insertions(+), 49 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

--
2.30.2

Re: [PATCH v3 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-19 Thread Song Liu




> On Apr 19, 2021, at 7:26 AM, Jiri Olsa  wrote:
> 
> On Fri, Apr 16, 2021 at 03:13:24PM -0700, Song Liu wrote:
> 
> SNIP
> 
>> +/*
>> + * Returns:
>> + * 0   if all events use BPF;
>> + * 1   if some events do NOT use BPF;
>> + * < 0 on errors;
>> + */
>> static int read_bpf_map_counters(void)
>> {
>> +bool has_none_bpf_events = false;
>>  struct evsel *counter;
>>  int err;
>> 
>>  evlist__for_each_entry(evsel_list, counter) {
>> +if (!evsel__is_bpf(counter)) {
>> +has_none_bpf_events = true;
>> +continue;
>> +}
>>  err = bpf_counter__read(counter);
>>  if (err)
>>  return err;
>>  }
>> -return 0;
>> +return has_none_bpf_events ? 1 : 0;
>> }
>> 
>> static void read_counters(struct timespec *rs)
>> @@ -442,9 +455,10 @@ static void read_counters(struct timespec *rs)
>>  int err;
>> 
>>  if (!stat_config.stop_read_counter) {
>> -if (target__has_bpf(&target))
>> -err = read_bpf_map_counters();
>> -else
>> +err = read_bpf_map_counters();
>> +if (err < 0)
>> +return;
>> +if (err)
>>  err = read_affinity_counters(rs);
> 
> this part is confusing for me.. I understand we don't want to enter
> read_affinity_counters when there's no bpf counter, so we don't set
> affinities in vain.. but there must be better way ;-)
> 
>> diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
>> index 5de991ab46af9..3189b63714371 100644
>> --- a/tools/perf/util/bpf_counter.c
>> +++ b/tools/perf/util/bpf_counter.c
>> @@ -792,6 +792,8 @@ int bpf_counter__load(struct evsel *evsel, struct target 
>> *target)
>>  evsel->bpf_counter_ops = &bpf_program_profiler_ops;
>>  else if (target->use_bpf)
>>  evsel->bpf_counter_ops = &bperf_ops;
>> +else if (evsel__match_bpf_counter_events(evsel->name))
>> +evsel->bpf_counter_ops = &bperf_ops;
> 
> please put this with the target->use_bpf check,
> it seems like it's another thing

Thanks for the review. Fixing these in v4. 

Song

Re: [PATCH v3 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-19 Thread Song Liu




> On Apr 17, 2021, at 9:45 AM, Namhyung Kim  wrote:
> 
> Hi Song,
> 
> On Sat, Apr 17, 2021 at 7:13 AM Song Liu  wrote:
>> 
>> Currently, to use BPF to aggregate perf event counters, the user uses
>> --bpf-counters option. Enable "use bpf by default" events with a config
>> option, stat.bpf-counter-events. Events with name in the option will use
>> BPF.
>> 
>> This also enables mixed BPF event and regular event in the same sesssion.
>> For example:
>> 
>>   perf config stat.bpf-counter-events=instructions
>>   perf stat -e instructions,cs
>> 
>> The second command will use BPF for "instructions" but not "cs".
>> 
>> Signed-off-by: Song Liu 
>> ---
>> @@ -535,12 +549,13 @@ static int enable_counters(void)
>>struct evsel *evsel;
>>int err;
>> 
>> -   if (target__has_bpf(&target)) {
>> -   evlist__for_each_entry(evsel_list, evsel) {
>> -   err = bpf_counter__enable(evsel);
>> -   if (err)
>> -   return err;
>> -   }
>> +   evlist__for_each_entry(evsel_list, evsel) {
>> +   if (!evsel__is_bpf(evsel))
>> +   continue;
>> +
>> +   err = bpf_counter__enable(evsel);
>> +   if (err)
>> +   return err;
> 
> I just realized it doesn't have a disable counterpart.

I guess it is not really necessary for perf-stat? It is probably good
to have it though. How about we do it in a follow up patch?

[...]

>> +   if (!evsel__bpf_counter_events)
>> +   return false;
>> +
>> +   ptr = strstr(evsel__bpf_counter_events, name);
>> +   name_len = strlen(name);
>> +
>> +   /* check name matches a full token in evsel__bpf_counter_events */
>> +   match = (ptr != NULL) &&
>> +   ((ptr == evsel__bpf_counter_events) || (*(ptr - 1) == ',')) 
>> &&
>> +   ((*(ptr + name_len) == ',') || (*(ptr + name_len) == '\0'));
> 
> I'm not sure we have an event name which is a substring of another.
> Maybe it can retry if it fails to match.

We have ref-cycles and cycles. And some raw events may be substring of others?

Thanks,
Song

[...]

[PATCH v3 3/4] perf-stat: introduce config stat.bpf-counter-events

2021-04-16 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. Events with name in the option will use
BPF.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 43 +-
 tools/perf/util/bpf_counter.c  |  2 ++
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 +
 tools/perf/util/evsel.h|  8 +
 tools/perf/util/target.h   |  5 ---
 7 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 6ec5960b08c3d..78afe13cd7d47 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common hardware events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2a2c15cac80a3..0c76271f3ef53 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -413,6 +413,8 @@ static int read_affinity_counters(struct timespec *rs)
evlist__for_each_entry(evsel_list, counter) {
if (evsel__cpu_iter_skip(counter, cpu))
continue;
+   if (evsel__is_bpf(counter))
+   continue;
if (!counter->err) {
counter->err = read_counter_cpu(counter, rs,

counter->cpu_iter - 1);
@@ -423,17 +425,28 @@ static int read_affinity_counters(struct timespec *rs)
return 0;
 }
 
+/*
+ * Returns:
+ * 0   if all events use BPF;
+ * 1   if some events do NOT use BPF;
+ * < 0 on errors;
+ */
 static int read_bpf_map_counters(void)
 {
+   bool has_none_bpf_events = false;
struct evsel *counter;
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!evsel__is_bpf(counter)) {
+   has_none_bpf_events = true;
+   continue;
+   }
err = bpf_counter__read(counter);
if (err)
return err;
}
-   return 0;
+   return has_none_bpf_events ? 1 : 0;
 }
 
 static void read_counters(struct timespec *rs)
@@ -442,9 +455,10 @@ static void read_counters(struct timespec *rs)
int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf(&target))
-   err = read_bpf_map_counters();
-   else
+   err = read_bpf_map_counters();
+   if (err < 0)
+   return;
+   if (err)
err = read_affinity_counters(rs);
if (err < 0)
return;
@@ -535,12 +549,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel__is_bpf(evsel))
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +799,9 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup(&affinity) < 0)
return -1;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, &target))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, &target))
+   return -1;
}
 
evlist__for_each_cpu (evsel_list, i, cpu) {
diff --git a/tools

[PATCH v3 4/4] perf-stat: introduce ':b' modifier

2021-04-16 Thread Song Liu

Introduce 'b' modifier to event parser, which means use BPF program to
manage this event. This is the same as --bpf-counters option, but only
applies to this event. For example,

  perf stat -e cycles:b,cs   # use bpf for cycles, but not cs
  perf stat -e cycles,cs --bpf-counters  # use bpf for both cycles and cs

Suggested-by: Jiri Olsa 
Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c  | 2 +-
 tools/perf/util/evsel.h| 1 +
 tools/perf/util/parse-events.c | 8 +++-
 tools/perf/util/parse-events.l | 2 +-
 4 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 3189b63714371..7b4d511c7e6ec 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -790,7 +790,7 @@ int bpf_counter__load(struct evsel *evsel, struct target 
*target)
 {
if (target->bpf_str)
evsel->bpf_counter_ops = &bpf_program_profiler_ops;
-   else if (target->use_bpf)
+   else if (target->use_bpf || evsel->bpf_counter)
evsel->bpf_counter_ops = &bperf_ops;
else if (evsel__match_bpf_counter_events(evsel->name))
evsel->bpf_counter_ops = &bperf_ops;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index ce4b629d659c2..8f66cdcb338d0 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -82,6 +82,7 @@ struct evsel {
boolauto_merge_stats;
boolcollect_stat;
boolweak_group;
+   boolbpf_counter;
int bpf_fd;
struct bpf_object   *bpf_obj;
};
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 8123d218ad17c..46ebd269a98d1 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1804,6 +1804,7 @@ struct event_modifier {
int pinned;
int weak;
int exclusive;
+   int bpf_counter;
 };
 
 static int get_event_modifier(struct event_modifier *mod, char *str,
@@ -1824,6 +1825,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
int exclude = eu | ek | eh;
int exclude_GH = evsel ? evsel->exclude_GH : 0;
int weak = 0;
+   int bpf_counter = 0;
 
memset(mod, 0, sizeof(*mod));
 
@@ -1867,6 +1869,8 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
exclusive = 1;
} else if (*str == 'W') {
weak = 1;
+   } else if (*str == 'b') {
+   bpf_counter = 1;
} else
break;
 
@@ -1898,6 +1902,7 @@ static int get_event_modifier(struct event_modifier *mod, 
char *str,
mod->sample_read = sample_read;
mod->pinned = pinned;
mod->weak = weak;
+   mod->bpf_counter = bpf_counter;
mod->exclusive = exclusive;
 
return 0;
@@ -1912,7 +1917,7 @@ static int check_modifier(char *str)
char *p = str;
 
/* The sizeof includes 0 byte as well. */
-   if (strlen(str) > (sizeof("ukhGHpppPSDIWe") - 1))
+   if (strlen(str) > (sizeof("ukhGHpppPSDIWeb") - 1))
return -1;
 
while (*p) {
@@ -1953,6 +1958,7 @@ int parse_events__modifier_event(struct list_head *list, 
char *str, bool add)
evsel->sample_read = mod.sample_read;
evsel->precise_max = mod.precise_max;
evsel->weak_group  = mod.weak;
+   evsel->bpf_counter = mod.bpf_counter;
 
if (evsel__is_group_leader(evsel)) {
evsel->core.attr.pinned = mod.pinned;
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 0b36285a9435d..fb8646cc3e834 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -210,7 +210,7 @@ name_tag
[\'][a-zA-Z_*?\[\]][a-zA-Z0-9_*?\-,\.\[\]:=]*[\']
 name_minus [a-zA-Z_*?][a-zA-Z0-9\-_*?.:]*
 drv_cfg_term   [a-zA-Z0-9_\.]+(=[a-zA-Z0-9_*?\.:]+)?
 /* If you add a modifier you need to update check_modifier() */
-modifier_event [ukhpPGHSDIWe]+
+modifier_event [ukhpPGHSDIWeb]+
 modifier_bp[rwx]{1,3}
 
 %%
-- 
2.30.2

[PATCH v3 2/4] perf bpf: check perf_attr_map is compatible with the perf binary

2021-04-16 Thread Song Liu

perf_attr_map could be shared among different version of perf binary. Add
bperf_attr_map_compatible() to check whether the existing attr_map is
compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index be484ddbbd5be..5de991ab46af9 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -312,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, &map_info, &map_info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -346,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

[PATCH v3 1/4] perf util: move bpf_perf definitions to a libperf header

2021-04-16 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPF_PERF_DEFAULT_ATTR_MAP_PATH to
bpf_perf.h for other tools to use.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/util/bpf_counter.c  | 27 +++---
 2 files changed, 34 insertions(+), 24 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

diff --git a/tools/lib/perf/include/perf/bpf_perf.h 
b/tools/lib/perf/include/perf/bpf_perf.h
new file mode 100644
index 0..e7cf6ba7b674b
--- /dev/null
+++ b/tools/lib/perf/include/perf/bpf_perf.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPF_PERF_H
+#define __LIBPERF_BPF_PERF_H
+
+#include   /* for __u32 */
+
+/*
+ * bpf_perf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* default attr_map name */
+#define BPF_PERF_DEFAULT_ATTR_MAP_PATH "perf_attr_map"
+
+#endif /* __LIBPERF_BPF_PERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..be484ddbbd5be 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -341,8 +320,8 @@ static int bperf_lock_attr_map(struct target *target)
if (target->attr_map) {
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
-   scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+   scnprintf(path, PATH_MAX, "%s/fs/bpf/%s", sysfs__mountpoint(),
+ BPF_PERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
-- 
2.30.2

[PATCH v3 0/4] perf util: bpf perf improvements

2021-04-16 Thread Song Liu

This patches set improves bpf_perf (perf-stat --bpf-counter) by
  1) exposing key definitions to a libperf header;
  2) adding compatibility check for perf_attr_map;
  3) introducing config stat.bpf-counter-events.
  4) introducing 'b' modify to event parser.

Changes v2 => v3:
1. Add 'b' modifier. (Jiri)
2. Allow configuring stat.bpf-counter-events with any event name (instead
   of limiting to hardware events). (Jiri)

Changes v1 => v2:
1. Separte 2/3 from 1/3. (Jiri)
2. Rename bperf.h to bpf_perf.h. (Jiri)
3. Other small fixes/optimizations. (Jiri)

Song Liu (4):
  perf util: move bpf_perf definitions to a libperf header
  perf bpf: check perf_attr_map is compatible with the perf binary
  perf-stat: introduce config stat.bpf-counter-events
  perf-stat: introduce ':b' modifier

 tools/lib/perf/include/perf/bpf_perf.h | 31 
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 43 ++
 tools/perf/util/bpf_counter.c  | 50 +-
 tools/perf/util/config.c   |  4 +++
 tools/perf/util/evsel.c| 22 
 tools/perf/util/evsel.h|  9 +
 tools/perf/util/parse-events.c |  8 -
 tools/perf/util/parse-events.l |  2 +-
 tools/perf/util/target.h   |  5 ---
 10 files changed, 129 insertions(+), 47 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

--
2.30.2

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 11:24 AM, Jiri Olsa  wrote:
> 
> On Thu, Apr 08, 2021 at 06:08:20PM +0000, Song Liu wrote:
>> 
>> 
>>> On Apr 8, 2021, at 10:45 AM, Jiri Olsa  wrote:
>>> 
>>> On Thu, Apr 08, 2021 at 05:28:10PM +, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
>>>>> 
>>>>> On Thu, Apr 08, 2021 at 04:39:33PM +, Song Liu wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>>>>>> 
>>>>>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>>>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>>>>>> --bpf-counters option. Enable "use bpf by default" events with a config
>>>>>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>>>>>> evsel__hw_names.
>>>>>>>> 
>>>>>>>> This also enables mixed BPF event and regular event in the same 
>>>>>>>> sesssion.
>>>>>>>> For example:
>>>>>>>> 
>>>>>>>> perf config stat.bpf-counter-events=instructions
>>>>>>>> perf stat -e instructions,cs
>>>>>>>> 
>>>>>>> 
>>>>>>> so if we are mixing events now, how about uing modifier for bpf 
>>>>>>> counters,
>>>>>>> instead of configuring .perfconfig list we could use:
>>>>>>> 
>>>>>>> perf stat -e instructions:b,cs
>>>>>>> 
>>>>>>> thoughts?
>>>>>>> 
>>>>>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>>>>>> feel free to use it
>>>>>> 
>>>>>> I think we will need both 'b' modifier and .perfconfig configuration. 
>>>>>> For systems with BPF-managed perf events running in the background, 
>>>>> 
>>>>> hum, I'm not sure I understand what that means.. you mean there
>>>>> are tools that run perf stat so you don't want to change them?
>>>> 
>>>> We have tools that do perf_event_open(). I will change them to use 
>>>> BPF managed perf events for "cycles" and "instructions". Since these 
>>>> tools are running 24/7, perf-stat on the system should use BPF managed
>>>> "cycles" and "instructions" by default. 
>>> 
>>> well if you are already changing the tools why not change them to add
>>> modifier.. but I don't mind adding that .perfconfig stuff if you need
>>> that
>> 
>> The tools I mentioned here don't use perf-stat, they just use 
>> perf_event_open() and read the perf events fds. We want a config to make
> 
> just curious, how those tools use perf_event_open?
> 
>> "cycles" to use BPF by default, so that when the user (not these tools)
>> runs perf-stat, it will share PMCs with those events by default. 
> 
> I'm sorry but I still don't see the usecase.. if you need to change both 
> tools,
> you can change them to use bpf-managed event, why bother with the list?
> 
>>> 
>>>> 
>>>>> 
>>>>>> .perfconfig makes sure perf-stat sessions will share PMCs with these 
>>>>>> background monitoring tools. 'b' modifier, on the other hand, is useful
>>>>>> when the user knows there is opportunity to share the PMCs. 
>>>>>> 
>>>>>> Does this make sense? 
>>>>> 
>>>>> if there's reason for that then sure.. but let's not limit that just
>>>>> on HARDWARE events only.. there are RAW events with the same demand
>>>>> for this feature.. why don't we let user define any event for this?
>>>> 
>>>> I haven't found a good way to config RAW events. I guess RAW events 
>>>> could use 'b' modifier? 
>>> any event uing the pmu notation like cpu/instructions/
>> 
>> Can we do something like "perf config stat.bpf-counter-events=cpu/*" means 
>> all "cpu/xx" events use BPF by default?
> 
> I think there's misundestanding, all I'm saying is that IIUC you check
> events stat.bpf-counter-events to be HARDWARE type, which I don't think
> is necessary and we can allow any event in there

>From what I see, most of the opportunity of sharing comes from a few common
events, like cycles, instructions. The second reason is that, the config 
implementation is easy and straightforward. We sure can extend the config 
to other events. Before that, 'b' modifier should be good for these use
cases. 

Thanks,
Song

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 11:50 AM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Thu, Apr 08, 2021 at 08:24:47PM +0200, Jiri Olsa escreveu:
>> On Thu, Apr 08, 2021 at 06:08:20PM +, Song Liu wrote:
>>> 
>>> 
>>>> On Apr 8, 2021, at 10:45 AM, Jiri Olsa  wrote:
>>>> 
>>>> On Thu, Apr 08, 2021 at 05:28:10PM +, Song Liu wrote:
>>>>> 
>>>>> 
>>>>>> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
>>>>>> 
>>>>>> On Thu, Apr 08, 2021 at 04:39:33PM +0000, Song Liu wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>>>>>>> 
>>>>>>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>>>>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>>>>>>> --bpf-counters option. Enable "use bpf by default" events with a 
>>>>>>>>> config
>>>>>>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>>>>>>> evsel__hw_names.
>>>>>>>>> 
>>>>>>>>> This also enables mixed BPF event and regular event in the same 
>>>>>>>>> sesssion.
>>>>>>>>> For example:
>>>>>>>>> 
>>>>>>>>> perf config stat.bpf-counter-events=instructions
>>>>>>>>> perf stat -e instructions,cs
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> so if we are mixing events now, how about uing modifier for bpf 
>>>>>>>> counters,
>>>>>>>> instead of configuring .perfconfig list we could use:
>>>>>>>> 
>>>>>>>> perf stat -e instructions:b,cs
>>>>>>>> 
>>>>>>>> thoughts?
>>>>>>>> 
>>>>>>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>>>>>>> feel free to use it
>>>>>>> 
>>>>>>> I think we will need both 'b' modifier and .perfconfig configuration. 
>>>>>>> For systems with BPF-managed perf events running in the background, 
>>>>>> 
>>>>>> hum, I'm not sure I understand what that means.. you mean there
>>>>>> are tools that run perf stat so you don't want to change them?
>>>>> 
>>>>> We have tools that do perf_event_open(). I will change them to use 
>>>>> BPF managed perf events for "cycles" and "instructions". Since these 
>>>>> tools are running 24/7, perf-stat on the system should use BPF managed
>>>>> "cycles" and "instructions" by default. 
>>>> 
>>>> well if you are already changing the tools why not change them to add
>>>> modifier.. but I don't mind adding that .perfconfig stuff if you need
>>>> that
>>> 
>>> The tools I mentioned here don't use perf-stat, they just use 
>>> perf_event_open() and read the perf events fds. We want a config to make
>> 
>> just curious, how those tools use perf_event_open?
> 
> I.e. do they use tools/lib/perf/? :-)

Not right now. I do hope we can eventually let them use libperf. But I 
haven't figured out the best path forward. 

> 
> I guess they will use it now for getting that "struct 
> perf_event_attr_map_entry" and
> the map name define.
> 
>>> "cycles" to use BPF by default, so that when the user (not these tools)
>>> runs perf-stat, it will share PMCs with those events by default. 
> 
>> I'm sorry but I still don't see the usecase.. if you need to change both 
>> tools,
>> you can change them to use bpf-managed event, why bother with the list?
> 
> He wants users not to bother if they are using bpf based counters, this will 
> happen
> automagically after they set their ~/.perfconfig with some command line Song 
> provides.
> 
> Then they will be using bpf counters that won't get exclusive access to those
> scarce counters, the tooling they are using will use bpf-counters and all will
> be well.
> 
> Right Song?

Yes, exactly. The config automatically switches ad-hoc perf-stat runs (for 
debug, 
performance tuning, etc.) to bpf managed counters. 

Thanks,
Song

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 10:45 AM, Jiri Olsa  wrote:
> 
> On Thu, Apr 08, 2021 at 05:28:10PM +0000, Song Liu wrote:
>> 
>> 
>>> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
>>> 
>>> On Thu, Apr 08, 2021 at 04:39:33PM +, Song Liu wrote:
>>>> 
>>>> 
>>>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>>>> 
>>>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>>>> --bpf-counters option. Enable "use bpf by default" events with a config
>>>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>>>> evsel__hw_names.
>>>>>> 
>>>>>> This also enables mixed BPF event and regular event in the same sesssion.
>>>>>> For example:
>>>>>> 
>>>>>> perf config stat.bpf-counter-events=instructions
>>>>>> perf stat -e instructions,cs
>>>>>> 
>>>>> 
>>>>> so if we are mixing events now, how about uing modifier for bpf counters,
>>>>> instead of configuring .perfconfig list we could use:
>>>>> 
>>>>> perf stat -e instructions:b,cs
>>>>> 
>>>>> thoughts?
>>>>> 
>>>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>>>> feel free to use it
>>>> 
>>>> I think we will need both 'b' modifier and .perfconfig configuration. 
>>>> For systems with BPF-managed perf events running in the background, 
>>> 
>>> hum, I'm not sure I understand what that means.. you mean there
>>> are tools that run perf stat so you don't want to change them?
>> 
>> We have tools that do perf_event_open(). I will change them to use 
>> BPF managed perf events for "cycles" and "instructions". Since these 
>> tools are running 24/7, perf-stat on the system should use BPF managed
>> "cycles" and "instructions" by default. 
> 
> well if you are already changing the tools why not change them to add
> modifier.. but I don't mind adding that .perfconfig stuff if you need
> that

The tools I mentioned here don't use perf-stat, they just use 
perf_event_open() and read the perf events fds. We want a config to make
"cycles" to use BPF by default, so that when the user (not these tools)
runs perf-stat, it will share PMCs with those events by default. 

> 
>> 
>>> 
>>>> .perfconfig makes sure perf-stat sessions will share PMCs with these 
>>>> background monitoring tools. 'b' modifier, on the other hand, is useful
>>>> when the user knows there is opportunity to share the PMCs. 
>>>> 
>>>> Does this make sense? 
>>> 
>>> if there's reason for that then sure.. but let's not limit that just
>>> on HARDWARE events only.. there are RAW events with the same demand
>>> for this feature.. why don't we let user define any event for this?
>> 
>> I haven't found a good way to config RAW events. I guess RAW events 
>> could use 'b' modifier? 
> any event uing the pmu notation like cpu/instructions/

Can we do something like "perf config stat.bpf-counter-events=cpu/*" means 
all "cpu/xx" events use BPF by default?

Thanks,
Song

> 
> we can allow any event to be BPF-managed, right? IIUC we don't care,
> the code will work with any event

Yes, the code works with any event.

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 10:20 AM, Jiri Olsa  wrote:
> 
> On Thu, Apr 08, 2021 at 04:39:33PM +0000, Song Liu wrote:
>> 
>> 
>>> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
>>> 
>>> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>>>> Currently, to use BPF to aggregate perf event counters, the user uses
>>>> --bpf-counters option. Enable "use bpf by default" events with a config
>>>> option, stat.bpf-counter-events. This is limited to hardware events in
>>>> evsel__hw_names.
>>>> 
>>>> This also enables mixed BPF event and regular event in the same sesssion.
>>>> For example:
>>>> 
>>>>  perf config stat.bpf-counter-events=instructions
>>>>  perf stat -e instructions,cs
>>>> 
>>> 
>>> so if we are mixing events now, how about uing modifier for bpf counters,
>>> instead of configuring .perfconfig list we could use:
>>> 
>>> perf stat -e instructions:b,cs
>>> 
>>> thoughts?
>>> 
>>> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
>>> feel free to use it
>> 
>> I think we will need both 'b' modifier and .perfconfig configuration. 
>> For systems with BPF-managed perf events running in the background, 
> 
> hum, I'm not sure I understand what that means.. you mean there
> are tools that run perf stat so you don't want to change them?

We have tools that do perf_event_open(). I will change them to use 
BPF managed perf events for "cycles" and "instructions". Since these 
tools are running 24/7, perf-stat on the system should use BPF managed
"cycles" and "instructions" by default. 

> 
>> .perfconfig makes sure perf-stat sessions will share PMCs with these 
>> background monitoring tools. 'b' modifier, on the other hand, is useful
>> when the user knows there is opportunity to share the PMCs. 
>> 
>> Does this make sense? 
> 
> if there's reason for that then sure.. but let's not limit that just
> on HARDWARE events only.. there are RAW events with the same demand
> for this feature.. why don't we let user define any event for this?

I haven't found a good way to config RAW events. I guess RAW events 
could use 'b' modifier? 

Thanks,
Song

Re: [PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-08 Thread Song Liu




> On Apr 8, 2021, at 4:47 AM, Jiri Olsa  wrote:
> 
> On Tue, Apr 06, 2021 at 05:36:01PM -0700, Song Liu wrote:
>> Currently, to use BPF to aggregate perf event counters, the user uses
>> --bpf-counters option. Enable "use bpf by default" events with a config
>> option, stat.bpf-counter-events. This is limited to hardware events in
>> evsel__hw_names.
>> 
>> This also enables mixed BPF event and regular event in the same sesssion.
>> For example:
>> 
>>   perf config stat.bpf-counter-events=instructions
>>   perf stat -e instructions,cs
>> 
> 
> so if we are mixing events now, how about uing modifier for bpf counters,
> instead of configuring .perfconfig list we could use:
> 
>  perf stat -e instructions:b,cs
> 
> thoughts?
> 
> the change below adds 'b' modifier and sets 'evsel::bpf_counter',
> feel free to use it

I think we will need both 'b' modifier and .perfconfig configuration. 
For systems with BPF-managed perf events running in the background, 
.perfconfig makes sure perf-stat sessions will share PMCs with these 
background monitoring tools. 'b' modifier, on the other hand, is useful
when the user knows there is opportunity to share the PMCs. 

Does this make sense? 

Thanks,
Song

> 
> jirka
> 
> 
> ---
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index ca52581f1b17..c55e4e58d1dc 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -82,6 +82,7 @@ struct evsel {
>   boolauto_merge_stats;
>   boolcollect_stat;
>   boolweak_group;
> + boolbpf_counter;
>   int bpf_fd;
>   struct bpf_object   *bpf_obj;
>   };
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 9ecb45bea948..b5850f1ea90b 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -1801,6 +1801,7 @@ struct event_modifier {
>   int pinned;
>   int weak;
>   int exclusive;
> + int bpf_counter;
> };
> 
> static int get_event_modifier(struct event_modifier *mod, char *str,
> @@ -1821,6 +1822,7 @@ static int get_event_modifier(struct event_modifier 
> *mod, char *str,
>   int exclude = eu | ek | eh;
>   int exclude_GH = evsel ? evsel->exclude_GH : 0;
>   int weak = 0;
> + int bpf_counter = 0;
> 
>   memset(mod, 0, sizeof(*mod));
> 
> @@ -1864,6 +1866,8 @@ static int get_event_modifier(struct event_modifier 
> *mod, char *str,
>   exclusive = 1;
>   } else if (*str == 'W') {
>   weak = 1;
> + } else if (*str == 'b') {
> + bpf_counter = 1;
>   } else
>   break;
> 
> @@ -1895,6 +1899,7 @@ static int get_event_modifier(struct event_modifier 
> *mod, char *str,
>   mod->sample_read = sample_read;
>   mod->pinned = pinned;
>   mod->weak = weak;
> + mod->bpf_counter = bpf_counter;
>   mod->exclusive = exclusive;
> 
>   return 0;
> @@ -1909,7 +1914,7 @@ static int check_modifier(char *str)
>   char *p = str;
> 
>   /* The sizeof includes 0 byte as well. */
> - if (strlen(str) > (sizeof("ukhGHpppPSDIWe") - 1))
> + if (strlen(str) > (sizeof("ukhGHpppPSDIWeb") - 1))
>   return -1;
> 
>   while (*p) {
> @@ -1950,6 +1955,7 @@ int parse_events__modifier_event(struct list_head 
> *list, char *str, bool add)
>   evsel->sample_read = mod.sample_read;
>   evsel->precise_max = mod.precise_max;
>   evsel->weak_group  = mod.weak;
> + evsel->bpf_counter = mod.bpf_counter;
> 
>   if (evsel__is_group_leader(evsel)) {
>   evsel->core.attr.pinned = mod.pinned;
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 0b36285a9435..fb8646cc3e83 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -210,7 +210,7 @@ name_tag  
> [\'][a-zA-Z_*?\[\]][a-zA-Z0-9_*?\-,\.\[\]:=]*[\']
> name_minus[a-zA-Z_*?][a-zA-Z0-9\-_*?.:]*
> drv_cfg_term  [a-zA-Z0-9_\.]+(=[a-zA-Z0-9_*?\.:]+)?
> /* If you add a modifier you need to update check_modifier() */
> -modifier_event   [ukhpPGHSDIWe]+
> +modifier_event   [ukhpPGHSDIWeb]+
> modifier_bp   [rwx]{1,3}
> 
> %%
>

[PATCH v2 3/3] perf-stat: introduce config stat.bpf-counter-events

2021-04-06 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. This is limited to hardware events in
evsel__hw_names.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 43 +-
 tools/perf/util/bpf_counter.c  | 11 +++
 tools/perf/util/config.c   | 32 +++
 tools/perf/util/evsel.c|  2 ++
 tools/perf/util/evsel.h|  6 
 tools/perf/util/target.h   |  5 ---
 7 files changed, 81 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 744211fa8c186..6d4733eaac170 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common hardware events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 4bb48c6b66980..7c26e627db0ef 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -413,6 +413,8 @@ static int read_affinity_counters(struct timespec *rs)
evlist__for_each_entry(evsel_list, counter) {
if (evsel__cpu_iter_skip(counter, cpu))
continue;
+   if (evsel__is_bpf(counter))
+   continue;
if (!counter->err) {
counter->err = read_counter_cpu(counter, rs,

counter->cpu_iter - 1);
@@ -423,17 +425,28 @@ static int read_affinity_counters(struct timespec *rs)
return 0;
 }
 
+/*
+ * Returns:
+ * 0   if all events use BPF;
+ * 1   if some events do NOT use BPF;
+ * < 0 on errors;
+ */
 static int read_bpf_map_counters(void)
 {
+   bool has_none_bpf_events = false;
struct evsel *counter;
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!evsel__is_bpf(counter)) {
+   has_none_bpf_events = true;
+   continue;
+   }
err = bpf_counter__read(counter);
if (err)
return err;
}
-   return 0;
+   return has_none_bpf_events ? 1 : 0;
 }
 
 static void read_counters(struct timespec *rs)
@@ -442,9 +455,10 @@ static void read_counters(struct timespec *rs)
int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf(&target))
-   err = read_bpf_map_counters();
-   else
+   err = read_bpf_map_counters();
+   if (err < 0)
+   return;
+   if (err)
err = read_affinity_counters(rs);
if (err < 0)
return;
@@ -535,12 +549,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel__is_bpf(evsel))
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +799,9 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup(&affinity) < 0)
return -1;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, &target))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, &target))
+   return -1;
}
 
evlist__for_each_cpu (evsel_list, i, cpu

[PATCH v2 1/3] perf util: move bpf_perf definitions to a libperf header

2021-04-06 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPF_PERF_DEFAULT_ATTR_MAP_PATH to
bpf_perf.h for other tools to use.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/util/bpf_counter.c  | 27 +++---
 2 files changed, 34 insertions(+), 24 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

diff --git a/tools/lib/perf/include/perf/bpf_perf.h 
b/tools/lib/perf/include/perf/bpf_perf.h
new file mode 100644
index 0..e7cf6ba7b674b
--- /dev/null
+++ b/tools/lib/perf/include/perf/bpf_perf.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPF_PERF_H
+#define __LIBPERF_BPF_PERF_H
+
+#include   /* for __u32 */
+
+/*
+ * bpf_perf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* default attr_map name */
+#define BPF_PERF_DEFAULT_ATTR_MAP_PATH "perf_attr_map"
+
+#endif /* __LIBPERF_BPF_PERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..be484ddbbd5be 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -341,8 +320,8 @@ static int bperf_lock_attr_map(struct target *target)
if (target->attr_map) {
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
-   scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+   scnprintf(path, PATH_MAX, "%s/fs/bpf/%s", sysfs__mountpoint(),
+ BPF_PERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
-- 
2.30.2

[PATCH v2 2/3] perf bpf: check perf_attr_map is compatible with the perf binary

2021-04-06 Thread Song Liu

perf_attr_map could be shared among different version of perf binary. Add
bperf_attr_map_compatible() to check whether the existing attr_map is
compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/perf/util/bpf_counter.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index be484ddbbd5be..5de991ab46af9 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -312,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, &map_info, &map_info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -346,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

[PATCH v2 0/3] perf util: bpf perf improvements

2021-04-06 Thread Song Liu

This patches set improves bpf_perf (perf-stat --bpf-counter) by
  1) exposing key definitions to a libperf header;
  2) adding compatibility check for perf_attr_map;
  3) introducing config stat.bpf-counter-events.

Changes v1 => v2:
1. Separte 2/3 from 1/3. (Jiri)
2. Rename bperf.h to bpf_perf.h. (Jiri)
3. Other small fixes/optimizations. (Jiri)

Song Liu (3):
  perf util: move bpf_perf definitions to a libperf header
  perf bpf: check perf_attr_map is compatible with the perf binary
  perf-stat: introduce config stat.bpf-counter-events

 tools/lib/perf/include/perf/bpf_perf.h | 31 ++
 tools/perf/Documentation/perf-stat.txt |  2 +
 tools/perf/builtin-stat.c  | 43 ---
 tools/perf/util/bpf_counter.c  | 57 +++---
 tools/perf/util/config.c   | 32 +++
 tools/perf/util/evsel.c|  2 +
 tools/perf/util/evsel.h|  6 +++
 tools/perf/util/target.h   |  5 ---
 8 files changed, 134 insertions(+), 44 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bpf_perf.h

--
2.30.2

Re: [PATCH 2/2] perf-stat: introduce config stat.bpf-counter-events

2021-04-06 Thread Song Liu




> On Apr 6, 2021, at 7:21 AM, Jiri Olsa  wrote:
> 
> On Fri, Apr 02, 2021 at 05:29:38PM -0700, Song Liu wrote:
>> Currently, to use BPF to aggregate perf event counters, the user uses
>> --bpf-counters option. Enable "use bpf by default" events with a config
>> option, stat.bpf-counter-events. This is limited to hardware events in
>> evsel__hw_names.
>> 
>> This also enables mixed BPF event and regular event in the same sesssion.
>> For example:
>> 
>>   perf config stat.bpf-counter-events=instructions
>>   perf stat -e instructions,cs
> 
> hum, so this will effectively allow to mix 'bpf-shared' counters
> with normals ones.. I don't think we're ready for that ;-)

I think we are ready. :) all bpf_counter stuff is within evsel, so mixing 
them doesn't need much work. 

> 
>> 
>> The second command will use BPF for "instructions" but not "cs".
>> 
>> Signed-off-by: Song Liu 
>> ---
>> tools/perf/Documentation/perf-stat.txt |  2 ++
>> tools/perf/builtin-stat.c  | 41 --
>> tools/perf/util/bpf_counter.c  | 11 +++
>> tools/perf/util/config.c   | 32 
>> tools/perf/util/evsel.c|  2 ++
>> tools/perf/util/evsel.h|  1 +
>> tools/perf/util/target.h   |  5 
>> 7 files changed, 74 insertions(+), 20 deletions(-)
>> 
>> diff --git a/tools/perf/Documentation/perf-stat.txt 
>> b/tools/perf/Documentation/perf-stat.txt
>> index 744211fa8c186..6d4733eaac170 100644
>> --- a/tools/perf/Documentation/perf-stat.txt
>> +++ b/tools/perf/Documentation/perf-stat.txt
>> @@ -97,6 +97,8 @@ report::
>>  Use BPF programs to aggregate readings from perf_events.  This
>>  allows multiple perf-stat sessions that are counting the same metric 
>> (cycles,
>>  instructions, etc.) to share hardware counters.
>> +To use BPF programs on common hardware events by default, use
>> +"perf config stat.bpf-counter-events=".
>> 
>> --bpf-attr-map::
>>  With option "--bpf-counters", different perf-stat sessions share
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 4bb48c6b66980..5adfa708ffe68 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -423,17 +423,28 @@ static int read_affinity_counters(struct timespec *rs)
>>  return 0;
>> }
>> 
>> +/*
>> + * Returns:
>> + * 0   if all events use BPF;
>> + * 1   if some events do NOT use BPF;
>> + * < 0 on errors;
>> + */
>> static int read_bpf_map_counters(void)
>> {
>> +bool has_none_bpf_events = false;
>>  struct evsel *counter;
>>  int err;
>> 
>>  evlist__for_each_entry(evsel_list, counter) {
>> +if (!counter->bpf_counter_ops) {
>> +has_none_bpf_events = true;
>> +continue;
>> +}
>>  err = bpf_counter__read(counter);
>>  if (err)
>>  return err;
>>  }
>> -return 0;
>> +return has_none_bpf_events ? 1 : 0;
>> }
>> 
>> static void read_counters(struct timespec *rs)
>> @@ -442,9 +453,10 @@ static void read_counters(struct timespec *rs)
>>  int err;
>> 
>>  if (!stat_config.stop_read_counter) {
>> -if (target__has_bpf(&target))
>> -err = read_bpf_map_counters();
>> -else
>> +err = read_bpf_map_counters();
>> +if (err < 0)
>> +return;
>> +if (err)
>>  err = read_affinity_counters(rs);
> 
> so read_affinity_counters will read also 'bpf-shared' counters no?
> as long as it was separated, I did not see a problem, now we have
> counters that either have bpf ops set or have not
> 
> it'd be great to do some generic separation.. I was thinking to move
> bpf_counter_ops into some generic counter ops and we would just fill
> in the proper ops for the counter.. buuut the affinity readings are
> not compatible with what we are doing in bperf_read and the profiler
> bpf read
> 
> so I think the solution will be just to skip those events in
> read_affinity_counters and all the other code, and have some
> helper like:
> 
>   bool evsel__is_bpf(evsel)
> 
> so it's clear why it's skipped

Yes, this will be better! Current version does have the problem of 
extra read in read_affinity_counters(). Will fix this. 

Thanks,
Song

Re: [PATCH 1/2] perf util: move bperf definitions to a libperf header

2021-04-06 Thread Song Liu




> On Apr 6, 2021, at 7:21 AM, Jiri Olsa  wrote:
> 
> On Fri, Apr 02, 2021 at 05:29:37PM -0700, Song Liu wrote:
>> By following the same protocol, other tools can share hardware PMCs with
>> perf. Move perf_event_attr_map_entry and BPERF_DEFAULT_ATTR_MAP_PATH to
>> bperf.h for other tools to use.
> 
> hi,
> so is this necessary for some other tool now?

We have monitoring tools do perf_event_open(). I would like to migrate these
to bperf. 

> 
>> 
>> Also add bperf_attr_map_compatible() to check whether existing attr_map
>> is compatible with current perf binary.
> 
> please separate this change

Will do. 

> 
>> 
>> Signed-off-by: Song Liu 
>> ---
>> tools/lib/perf/include/perf/bperf.h | 29 +++
>> tools/perf/util/bpf_counter.c   | 44 ++---
>> 2 files changed, 50 insertions(+), 23 deletions(-)
>> create mode 100644 tools/lib/perf/include/perf/bperf.h
>> 
>> diff --git a/tools/lib/perf/include/perf/bperf.h 
>> b/tools/lib/perf/include/perf/bperf.h
>> new file mode 100644
>> index 0..02b2fd5e50c75
>> --- /dev/null
>> +++ b/tools/lib/perf/include/perf/bperf.h
> 
> I wonder we want to call it bpf_perf.h to be more generic?
> or best just bpf.h ... but that might give us some conflict
> headache in future ;-)

I would rather avoid bpf.h... I am ok with bpf_perf.h or bperf.h

> 
>> @@ -0,0 +1,29 @@
>> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
>> +#ifndef __LIBPERF_BPERF_H
>> +#define __LIBPERF_BPERF_H
>> +
>> +/*
>> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
>> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
>> + * no concurrent access to the attr_map.  The key of attr_map is struct
>> + * perf_event_attr, and the value is struct perf_event_attr_map_entry.
>> + *
>> + * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
>> + * leader prog, and the diff_map. Each perf-stat session holds a reference
>> + * to the bpf_link to make sure the leader prog is attached to sched_switch
>> + * tracepoint.
>> + *
>> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
>> + * does not hold any references to the leader program. Once all perf-stat
>> + * sessions of these events exit, the leader prog, its maps, and the
>> + * perf_events will be freed.
>> + */
>> +struct perf_event_attr_map_entry {
>> +__u32 link_id;
>> +__u32 diff_map_id;
>> +};
> 
> this header file should be self contained,
> so you need __u32 definitions

Will add. 

> 
> 
>> +
>> +/* pin the map at sysfs__mountpoint()/BPERF_DEFAULT_ATTR_MAP_PATH */
>> +#define BPERF_DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
> 
> if we are going to expose this, I think we should expose just
> "perf_attr_map" ... without the 'fs/bpf' part, because that
> could be mounted anywhere

Will fix this. 

Thanks,
Song

[PATCH 2/2] perf-stat: introduce config stat.bpf-counter-events

2021-04-02 Thread Song Liu

Currently, to use BPF to aggregate perf event counters, the user uses
--bpf-counters option. Enable "use bpf by default" events with a config
option, stat.bpf-counter-events. This is limited to hardware events in
evsel__hw_names.

This also enables mixed BPF event and regular event in the same sesssion.
For example:

   perf config stat.bpf-counter-events=instructions
   perf stat -e instructions,cs

The second command will use BPF for "instructions" but not "cs".

Signed-off-by: Song Liu 
---
 tools/perf/Documentation/perf-stat.txt |  2 ++
 tools/perf/builtin-stat.c  | 41 --
 tools/perf/util/bpf_counter.c  | 11 +++
 tools/perf/util/config.c   | 32 
 tools/perf/util/evsel.c|  2 ++
 tools/perf/util/evsel.h|  1 +
 tools/perf/util/target.h   |  5 
 7 files changed, 74 insertions(+), 20 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 744211fa8c186..6d4733eaac170 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -97,6 +97,8 @@ report::
Use BPF programs to aggregate readings from perf_events.  This
allows multiple perf-stat sessions that are counting the same metric 
(cycles,
instructions, etc.) to share hardware counters.
+   To use BPF programs on common hardware events by default, use
+   "perf config stat.bpf-counter-events=".
 
 --bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 4bb48c6b66980..5adfa708ffe68 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -423,17 +423,28 @@ static int read_affinity_counters(struct timespec *rs)
return 0;
 }
 
+/*
+ * Returns:
+ * 0   if all events use BPF;
+ * 1   if some events do NOT use BPF;
+ * < 0 on errors;
+ */
 static int read_bpf_map_counters(void)
 {
+   bool has_none_bpf_events = false;
struct evsel *counter;
int err;
 
evlist__for_each_entry(evsel_list, counter) {
+   if (!counter->bpf_counter_ops) {
+   has_none_bpf_events = true;
+   continue;
+   }
err = bpf_counter__read(counter);
if (err)
return err;
}
-   return 0;
+   return has_none_bpf_events ? 1 : 0;
 }
 
 static void read_counters(struct timespec *rs)
@@ -442,9 +453,10 @@ static void read_counters(struct timespec *rs)
int err;
 
if (!stat_config.stop_read_counter) {
-   if (target__has_bpf(&target))
-   err = read_bpf_map_counters();
-   else
+   err = read_bpf_map_counters();
+   if (err < 0)
+   return;
+   if (err)
err = read_affinity_counters(rs);
if (err < 0)
return;
@@ -535,12 +547,13 @@ static int enable_counters(void)
struct evsel *evsel;
int err;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, evsel) {
-   err = bpf_counter__enable(evsel);
-   if (err)
-   return err;
-   }
+   evlist__for_each_entry(evsel_list, evsel) {
+   if (!evsel->bpf_counter_ops)
+   continue;
+
+   err = bpf_counter__enable(evsel);
+   if (err)
+   return err;
}
 
if (stat_config.initial_delay < 0) {
@@ -784,11 +797,9 @@ static int __run_perf_stat(int argc, const char **argv, 
int run_idx)
if (affinity__setup(&affinity) < 0)
return -1;
 
-   if (target__has_bpf(&target)) {
-   evlist__for_each_entry(evsel_list, counter) {
-   if (bpf_counter__load(counter, &target))
-   return -1;
-   }
+   evlist__for_each_entry(evsel_list, counter) {
+   if (bpf_counter__load(counter, &target))
+   return -1;
}
 
evlist__for_each_cpu (evsel_list, i, cpu) {
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index df70c8dcf7850..ea45914af1693 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -792,6 +792,17 @@ int bpf_counter__load(struct evsel *evsel, struct target 
*target)
evsel->bpf_counter_ops = &bpf_program_profiler_ops;
else if (target->use_bpf)
evsel->bpf_counter_ops = &bperf_ops;
+   else {
+   int i;
+
+   for (i = 0; i < PERF_COUNT_HW_MAX

[PATCH 1/2] perf util: move bperf definitions to a libperf header

2021-04-02 Thread Song Liu

By following the same protocol, other tools can share hardware PMCs with
perf. Move perf_event_attr_map_entry and BPERF_DEFAULT_ATTR_MAP_PATH to
bperf.h for other tools to use.

Also add bperf_attr_map_compatible() to check whether existing attr_map
is compatible with current perf binary.

Signed-off-by: Song Liu 
---
 tools/lib/perf/include/perf/bperf.h | 29 +++
 tools/perf/util/bpf_counter.c   | 44 ++---
 2 files changed, 50 insertions(+), 23 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/bperf.h

diff --git a/tools/lib/perf/include/perf/bperf.h 
b/tools/lib/perf/include/perf/bperf.h
new file mode 100644
index 0..02b2fd5e50c75
--- /dev/null
+++ b/tools/lib/perf/include/perf/bperf.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __LIBPERF_BPERF_H
+#define __LIBPERF_BPERF_H
+
+/*
+ * bperf uses a hashmap, the attr_map, to track all the leader programs.
+ * The hashmap is pinned in bpffs. flock() on this file is used to ensure
+ * no concurrent access to the attr_map.  The key of attr_map is struct
+ * perf_event_attr, and the value is struct perf_event_attr_map_entry.
+ *
+ * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
+ * leader prog, and the diff_map. Each perf-stat session holds a reference
+ * to the bpf_link to make sure the leader prog is attached to sched_switch
+ * tracepoint.
+ *
+ * Since the hashmap only contains IDs of the bpf_link and diff_map, it
+ * does not hold any references to the leader program. Once all perf-stat
+ * sessions of these events exit, the leader prog, its maps, and the
+ * perf_events will be freed.
+ */
+struct perf_event_attr_map_entry {
+   __u32 link_id;
+   __u32 diff_map_id;
+};
+
+/* pin the map at sysfs__mountpoint()/BPERF_DEFAULT_ATTR_MAP_PATH */
+#define BPERF_DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
+
+#endif /* __LIBPERF_BPERF_H */
diff --git a/tools/perf/util/bpf_counter.c b/tools/perf/util/bpf_counter.c
index 81d1df3c4ec0e..df70c8dcf7850 100644
--- a/tools/perf/util/bpf_counter.c
+++ b/tools/perf/util/bpf_counter.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bpf_counter.h"
 #include "counts.h"
@@ -29,28 +30,6 @@
 #include "bpf_skel/bperf_leader.skel.h"
 #include "bpf_skel/bperf_follower.skel.h"
 
-/*
- * bperf uses a hashmap, the attr_map, to track all the leader programs.
- * The hashmap is pinned in bpffs. flock() on this file is used to ensure
- * no concurrent access to the attr_map.  The key of attr_map is struct
- * perf_event_attr, and the value is struct perf_event_attr_map_entry.
- *
- * struct perf_event_attr_map_entry contains two __u32 IDs, bpf_link of the
- * leader prog, and the diff_map. Each perf-stat session holds a reference
- * to the bpf_link to make sure the leader prog is attached to sched_switch
- * tracepoint.
- *
- * Since the hashmap only contains IDs of the bpf_link and diff_map, it
- * does not hold any references to the leader program. Once all perf-stat
- * sessions of these events exit, the leader prog, its maps, and the
- * perf_events will be freed.
- */
-struct perf_event_attr_map_entry {
-   __u32 link_id;
-   __u32 diff_map_id;
-};
-
-#define DEFAULT_ATTR_MAP_PATH "fs/bpf/perf_attr_map"
 #define ATTR_MAP_SIZE 16
 
 static inline void *u64_to_ptr(__u64 ptr)
@@ -333,6 +312,20 @@ static __u32 bpf_map_get_id(int fd)
return map_info.id;
 }
 
+static bool bperf_attr_map_compatible(int attr_map_fd)
+{
+   struct bpf_map_info map_info = {0};
+   __u32 map_info_len = sizeof(map_info);
+   int err;
+
+   err = bpf_obj_get_info_by_fd(attr_map_fd, &map_info, &map_info_len);
+
+   if (err)
+   return false;
+   return (map_info.key_size == sizeof(struct perf_event_attr)) &&
+   (map_info.value_size == sizeof(struct 
perf_event_attr_map_entry));
+}
+
 static int bperf_lock_attr_map(struct target *target)
 {
char path[PATH_MAX];
@@ -342,7 +335,7 @@ static int bperf_lock_attr_map(struct target *target)
scnprintf(path, PATH_MAX, "%s", target->attr_map);
} else {
scnprintf(path, PATH_MAX, "%s/%s", sysfs__mountpoint(),
- DEFAULT_ATTR_MAP_PATH);
+ BPERF_DEFAULT_ATTR_MAP_PATH);
}
 
if (access(path, F_OK)) {
@@ -367,6 +360,11 @@ static int bperf_lock_attr_map(struct target *target)
return -1;
}
 
+   if (!bperf_attr_map_compatible(map_fd)) {
+   close(map_fd);
+   return -1;
+
+   }
err = flock(map_fd, LOCK_EX);
if (err) {
close(map_fd);
-- 
2.30.2

Re: [PATCH -next] libbpf: remove redundant semi-colon

2021-04-01 Thread Song Liu

On Thu, Apr 1, 2021 at 10:58 AM Yang Yingliang  wrote:
>
> Signed-off-by: Yang Yingliang 

Please add a short commit log.

Thanks,
Song

> ---
>  tools/lib/bpf/linker.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/linker.c b/tools/lib/bpf/linker.c
> index 46b16cbdcda3..4e08bc07e635 100644
> --- a/tools/lib/bpf/linker.c
> +++ b/tools/lib/bpf/linker.c
> @@ -1895,7 +1895,7 @@ static int finalize_btf_ext(struct bpf_linker *linker)
> hdr->func_info_len = funcs_sz;
> hdr->line_info_off = funcs_sz;
> hdr->line_info_len = lines_sz;
> -   hdr->core_relo_off = funcs_sz + lines_sz;;
> +   hdr->core_relo_off = funcs_sz + lines_sz;
> hdr->core_relo_len = core_relos_sz;
>
> if (funcs_sz) {
> --
> 2.25.1
>

Re: [PATCH] linux/bpf.h: Remove repeated struct declaration

2021-04-01 Thread Song Liu

On Thu, Apr 1, 2021 at 12:22 AM Wan Jiabing  wrote:
>
> struct btf_type is declared twice. One is declared at 35th line.
> The blew one is not needed. Remove the duplicate.
>
> Signed-off-by: Wan Jiabing 

Acked-by: Song Liu

Re: [PATCH] linux/bpf-cgroup.h: Delete repeated struct declaration

2021-03-31 Thread Song Liu




> On Mar 31, 2021, at 11:46 PM, Wan Jiabing  wrote:
> 
> struct bpf_prog is declared twice. There is one declaration
> which is independent on the MACRO at 18th line.
> So the below one is not needed though. Remove the duplicate.
> 
> Signed-off-by: Wan Jiabing 

Acked-by: Song Liu 

> ---
> include/linux/bpf-cgroup.h | 1 -
> 1 file changed, 1 deletion(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index c42e02b4d84b..57b4d4b980e7 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -418,7 +418,6 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr,
> union bpf_attr __user *uattr);
> #else
> 
> -struct bpf_prog;
> struct cgroup_bpf {};
> static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
> static inline void cgroup_bpf_offline(struct cgroup *cgrp) {}
> -- 
> 2.25.1
>

Re: [PATCH 1/2] perf/core: Share an event with multiple cgroups

2021-03-31 Thread Song Liu




> On Mar 30, 2021, at 8:11 AM, Namhyung Kim  wrote:
> 
> On Tue, Mar 30, 2021 at 3:33 PM Song Liu  wrote:
>>> On Mar 29, 2021, at 4:33 AM, Namhyung Kim  wrote:
>>> 
>>> On Mon, Mar 29, 2021 at 2:17 AM Song Liu  wrote:
>>>>> On Mar 23, 2021, at 9:21 AM, Namhyung Kim  wrote:
>>>>> 
>>>>> As we can run many jobs (in container) on a big machine, we want to
>>>>> measure each job's performance during the run.  To do that, the
>>>>> perf_event can be associated to a cgroup to measure it only.
>>>>> 
>> 
>> [...]
>> 
>>>>> + return 0;
>>>>> +}
>>>> 
>>>> Could you please explain why we need this logic in can_attach?
>>> 
>>> IIUC the ss->attach() is called after a task's cgroup membership
>>> is changed.  But we want to collect the performance numbers for
>>> the old cgroup just before the change.  As the logic merely checks
>>> the current task's cgroup, it should be done in the can_attach()
>>> which is called before the cgroup change.
>> 
>> Thanks for the explanations.
>> 
>> Overall, I really like the core idea, especially that the overhead on
>> context switch is bounded (by the depth of cgroup tree).
> 
> Thanks!
> 
>> 
>> Is it possible to make PERF_EVENT_IOC_ATTACH_CGROUP more flexible?
>> Specifically, if we can have
>> 
>>  PERF_EVENT_IOC_ADD_CGROUP add a cgroup to the list
>>  PERF_EVENT_IOC_EL_CGROUP  delete a cgroup from the list
>> 
>> we can probably share these events among multiple processes, and
>> these processes don't need to know others' cgroup list. I think
>> this will be useful for users to build customized monitoring in
>> its own container.
>> 
>> Does this make sense?
> 
> Maybe we can add ADD/DEL interface for more flexible monitoring
> but I'm not sure which use cases it'll be used actually.
> 
> For your multi-process sharing case, the original events' file
> descriptors should be shared first.  

Yes, we will need some other work to make the ADD/DEL interface 
work properly. Let's worry about that later. 

For both patches:

Acked-by: Song Liu 

Thanks,
Song

[...]

Re: [syzbot] possible deadlock in register_for_each_vma

2021-03-31 Thread Song Liu




> On Mar 31, 2021, at 9:59 AM, Oleg Nesterov  wrote:
> 
> On 03/28, Hillf Danton wrote:
>> 
>> On Sat, 27 Mar 2021 18:53:08 Oleg Nesterov wrote:
>>> Hi Hillf,
>>> 
>>> it seems that you already understand the problem ;) I don't.
>> 
>> It is simpler than you thought - I always blindly believe what syzbot
>> reported is true before it turns out false as I am not smarter than it.
>> Feel free to laugh loud.
> 
> I am not going to laugh. I too think that lockdep is more clever than me.
> 
>>> Could you explain in details how double __register is possible ? and how
>> 
>> Taking another look at the report over five minutes may help more?
> 
> No. I spent much, much more time time and I still can't understand your
> patch which adds UPROBE_REGISTERING. Quite possibly your patch is fine,
> just I am not smart enough.
> 
> And I am a bit surprised you refused to help me.
> 
>>> it connects to this lockdep report?
>> 
>> Feel free to show the report is false and ignore my noise.
> 
> Well, this particular report looks correct but false-positive to me,
> _free_event() is not possible, but I can be easily wrong and we need
> to shut up lockdep anyway...
> 
> 
> ---
> Add more CC's. So, we have the following trace
> 
>   -> #0 (dup_mmap_sem){}-{0:0}:
>check_prev_add kernel/locking/lockdep.c:2936 [inline]
>check_prevs_add kernel/locking/lockdep.c:3059 [inline]
>validate_chain kernel/locking/lockdep.c:3674 [inline]
>__lock_acquire+0x2b14/0x54c0 kernel/locking/lockdep.c:4900
>lock_acquire kernel/locking/lockdep.c:5510 [inline]
>lock_acquire+0x1ab/0x740 kernel/locking/lockdep.c:5475
>percpu_down_write+0x95/0x440 kernel/locking/percpu-rwsem.c:217
>register_for_each_vma+0x2c/0xc10 kernel/events/uprobes.c:1040
>__uprobe_register+0x5c2/0x850 kernel/events/uprobes.c:1181
>trace_uprobe_enable kernel/trace/trace_uprobe.c:1065 [inline]
>probe_event_enable+0x357/0xa00 kernel/trace/trace_uprobe.c:1134
>trace_uprobe_register+0x443/0x880 kernel/trace/trace_uprobe.c:1461
>perf_trace_event_reg kernel/trace/trace_event_perf.c:129 [inline]
>perf_trace_event_init+0x549/0xa20 kernel/trace/trace_event_perf.c:204
>perf_uprobe_init+0x16f/0x210 kernel/trace/trace_event_perf.c:336
>perf_uprobe_event_init+0xff/0x1c0 kernel/events/core.c:9754
>perf_try_init_event+0x12a/0x560 kernel/events/core.c:11071
>perf_init_event kernel/events/core.c:11123 [inline]
>perf_event_alloc.part.0+0xe3b/0x3960 kernel/events/core.c:11403
>perf_event_alloc kernel/events/core.c:11785 [inline]
>__do_sys_perf_event_open+0x647/0x2e60 kernel/events/core.c:11883
>do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> 
> which shows that this path takes
> 
>   event_mutex -> uprobe.register_rwsem -> dup_mmap_sem -> mm.mmap_lock
> 
> Not good. If nothing else, perf_mmap_close() path can take event_mutex under
> mm.mmap_lock, so lockdep complains correctly.
> 
> But why does perf_uprobe_init() take event_mutex? The comment mentions
> uprobe_buffer_enable().
> 
> If this is the only reason, then why uprobe_buffer_enable/disable abuse
> event_mutex?
> 
> IOW, can something like the stupid patch below work? (Just in case... yes
> it is very suboptimal, I am just trying to understand the problem).
> 
> Song, Namhyung, Peter, what do you think?
> 
> Oleg.

I think the following patch works well. I haven't tested it though. 

Thanks,
Song


> 
> --- x/kernel/trace/trace_event_perf.c
> +++ x/kernel/trace/trace_event_perf.c
> @@ -327,16 +327,9 @@ int perf_uprobe_init(struct perf_event *p_event,
>   goto out;
>   }
> 
> - /*
> -  * local trace_uprobe need to hold event_mutex to call
> -  * uprobe_buffer_enable() and uprobe_buffer_disable().
> -  * event_mutex is not required for local trace_kprobes.
> -  */
> - mutex_lock(&event_mutex);
>   ret = perf_trace_event_init(tp_event, p_event);
>   if (ret)
>   destroy_local_trace_uprobe(tp_event);
> - mutex_unlock(&event_mutex);
> out:
>   kfree(path);
>   return ret;
> --- x/kernel/trace/trace_uprobe.c
> +++ x/kernel/trace/trace_uprobe.c
> @@ -857,6 +857,7 @@ struct uprobe_cpu_buffer {
> };
> static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer;
> static int uprobe_buffer_refcnt;
> +static DEFINE_MUTEX(uprobe_buffer_mutex);
> 
> static int uprobe_buffer_init(void)
> {
> @@ -894,13 +895,13 @@ static int uprobe_buffer_enable(void)
> {
>   int ret = 0;
> 
> - BUG_ON(!mutex_is_locked(&event_mutex));
> -
> + mutex_lock(&uprobe_buffer_mutex);
>   if (uprobe_buffer_refcnt++ == 0) {
>   ret = uprobe_buffer_init();
>   if (ret < 0)
>   uprobe_buffer_refcnt--;
>   }
> + mutex_unlock(&uprobe_bu

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1001 matches

Mail list logo