from:"Kris Van Hees"

Re: [PATCH v3 4/6] module: script to generate offset ranges for builtin modules

2024-06-06 Thread Kris Van Hees

On Tue, May 21, 2024 at 01:53:27AM +0900, Masahiro Yamada wrote:
> On Fri, May 17, 2024 at 1:31???PM Kris Van Hees  
> wrote:
> >
> > The offset range data for builtin modules is generated using:
> >  - modules.builtin.modinfo: associates object files with module names
> >  - vmlinux.map: provides load order of sections and offset of first member
> > per section
> >  - vmlinux.o.map: provides offset of object file content per section
> >  - .*.cmd: build cmd file with KBUILD_MODFILE and KBUILD_MODNAME
> >
> > The generated data will look like:
> >
> > .text - = _text
> > .text baf0-cb10 amd_uncore
> > .text 0009bd10-0009c8e0 iosf_mbi
> > ...
> > .text 008e6660-008e9630 snd_soc_wcd_mbhc
> > .text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x
> > .text 008ea610-008ea780 snd_soc_wcd9335
> > ...
> > .data - = _sdata
> > .data f020-f680 amd_uncore
> >
> > For each ELF section, it lists the offset of the first symbol.  This can
> > be used to determine the base address of the section at runtime.
> >
> > Next, it lists (in strict ascending order) offset ranges in that section
> > that cover the symbols of one or more builtin modules.  Multiple ranges
> > can apply to a single module, and ranges can be shared between modules.
> >
> > Signed-off-by: Kris Van Hees 
> > Reviewed-by: Nick Alcock 
> > ---
> > Changes since v2:
> >  - 1st arg to generate_builtin_ranges.awk is now modules.builtin.modinfo
> >  - Switched from using modules.builtin.objs to parsing .*.cmd files
> >  - Parse data from .*.cmd in generate_builtin_ranges.awk
> > ---
> >  scripts/generate_builtin_ranges.awk | 232 
> >  1 file changed, 232 insertions(+)
> >  create mode 100755 scripts/generate_builtin_ranges.awk
> >
> > diff --git a/scripts/generate_builtin_ranges.awk 
> > b/scripts/generate_builtin_ranges.awk
> > new file mode 100755
> > index 0..6975a9c7266d9
> > --- /dev/null
> > +++ b/scripts/generate_builtin_ranges.awk
> > @@ -0,0 +1,232 @@
> > +#!/usr/bin/gawk -f
> > +# SPDX-License-Identifier: GPL-2.0
> > +# generate_builtin_ranges.awk: Generate address range data for builtin 
> > modules
> > +# Written by Kris Van Hees 
> > +#
> > +# Usage: generate_builtin_ranges.awk modules.builtin.modinfo vmlinux.map \
> > +#  vmlinux.o.map > modules.builtin.ranges
> > +#
> > +
> > +BEGIN {
> > +   # modules.builtin.modinfo uses \0 as record separator
> > +   # All other files use \n.
> > +   RS = "[\n\0]";
> > +}
> 
> 
> Why do you want to parse modules.builtin.modinfo
> instead of modules.builtin?
> 
> modules.builtin uses \n separator.

Oh my, I completely overlooked modules.builtin.  Thank you!  That is indeed
much easier.

> > +
> > +# Return the module name(s) (if any) associated with the given object.
> > +#
> > +# If we have seen this object before, return information from the cache.
> > +# Otherwise, retrieve it from the corresponding .cmd file.
> > +#
> > +function get_module_info(fn, mod, obj, mfn, s) {
> 
> 
> There are 5 arguments, while the caller passes only 1 argument
> ( get_module_info($4) )

That is the way to be able to have local variables in an AWK function - every
variable mentioned in the function declaration is local to the function.  It
is an oddity in AWK.

> > +   if (fn in omod)
> > +   return omod[fn];
> > +
> > +   if (match(fn, /\/[^/]+$/) == 0)
> > +   return "";
> > +
> > +   obj = fn;
> > +   mod = "";
> > +   mfn = "";
> > +   fn = substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd";
> > +   if (getline s  > +   if (match(s, /DKBUILD_MODNAME=[^ ]+/) > 0) {
> > +   mod = substr(s, RSTART + 17, RLENGTH - 17);
> > +   gsub(/['"]/, "", mod);
> > +   gsub(/:/, " ", mod);
> > +   }
> > +
> > +   if (match(s, /DKBUILD_MODFILE=[^ ]+/) > 0) {
> > +   mfn = substr(s, RSTART + 17, RLENGTH - 17);
> > +   gsub(/['"]/, "", mfn);
> > +   gsub(/:/, " ", mfn);
> > +   }
> > +   }
> > +   close(fn);
> > +
> > +# tmp = $0;
> > +# $0 = s;
> > +# print mo

[PATCH v3 6/6] module: add install target for modules.builtin.ranges

2024-05-16 Thread Kris Van Hees

When CONFIG_BUILTIN_MODULE_RANGES is enabled, the modules.builtin.ranges
file should be installed in the module install location.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v2:
 - Include modules.builtin.ranges in modules install target
---
 scripts/Makefile.modinst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/scripts/Makefile.modinst b/scripts/Makefile.modinst
index 0afd75472679f..f5160ddd74239 100644
--- a/scripts/Makefile.modinst
+++ b/scripts/Makefile.modinst
@@ -30,10 +30,10 @@ $(MODLIB)/modules.order: modules.order FORCE
 quiet_cmd_install_modorder = INSTALL $@
   cmd_install_modorder = sed 's:^\(.*\)\.o$$:kernel/\1.ko:' $< > $@
 
-# Install modules.builtin(.modinfo) even when CONFIG_MODULES is disabled.
-install-y += $(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo)
+# Install modules.builtin(.modinfo,.ranges) even when CONFIG_MODULES is 
disabled.
+install-y += $(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo 
modules.builtin.ranges)
 
-$(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo): $(MODLIB)/%: 
% FORCE
+$(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo 
modules.builtin.ranges): $(MODLIB)/%: % FORCE
$(call cmd,install)
 
 endif
-- 
2.43.0

[PATCH v3 5/6] kbuild: generate modules.builtin.ranges when linking the kernel

2024-05-16 Thread Kris Van Hees

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v2:
 - 1st arg to generate_builtin_ranges.awk is now modules.builtin.modinfo
 - Use $(real-prereqs) rather than $(filter-out ...)
---
 scripts/Makefile.vmlinux | 16 
 1 file changed, 16 insertions(+)

diff --git a/scripts/Makefile.vmlinux b/scripts/Makefile.vmlinux
index c9f3e03124d7f..afe8287e8dda0 100644
--- a/scripts/Makefile.vmlinux
+++ b/scripts/Makefile.vmlinux
@@ -36,6 +36,22 @@ targets += vmlinux
 vmlinux: scripts/link-vmlinux.sh vmlinux.o $(KBUILD_LDS) FORCE
+$(call if_changed_dep,link_vmlinux)
 
+# module.builtin.ranges
+# ---
+ifdef CONFIG_BUILTIN_MODULE_RANGES
+__default: modules.builtin.ranges
+
+quiet_cmd_modules_builtin_ranges = GEN $@
+  cmd_modules_builtin_ranges = \
+   $(srctree)/scripts/generate_builtin_ranges.awk $(real-prereqs) > $@
+
+vmlinux.map: vmlinux
+
+targets += modules.builtin.ranges
+modules.builtin.ranges: modules.builtin.modinfo vmlinux.map vmlinux.o.map FORCE
+   $(call if_changed,modules_builtin_ranges)
+endif
+
 # Add FORCE to the prequisites of a target to force it to be always rebuilt.
 # ---
 
-- 
2.43.0

[PATCH v3 4/6] module: script to generate offset ranges for builtin modules

2024-05-16 Thread Kris Van Hees

The offset range data for builtin modules is generated using:
 - modules.builtin.modinfo: associates object files with module names
 - vmlinux.map: provides load order of sections and offset of first member
per section
 - vmlinux.o.map: provides offset of object file content per section
 - .*.cmd: build cmd file with KBUILD_MODFILE and KBUILD_MODNAME

The generated data will look like:

.text - = _text
.text baf0-cb10 amd_uncore
.text 0009bd10-0009c8e0 iosf_mbi
...
.text 008e6660-008e9630 snd_soc_wcd_mbhc
.text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x
.text 008ea610-008ea780 snd_soc_wcd9335
...
.data - = _sdata
.data f020-f680 amd_uncore

For each ELF section, it lists the offset of the first symbol.  This can
be used to determine the base address of the section at runtime.

Next, it lists (in strict ascending order) offset ranges in that section
that cover the symbols of one or more builtin modules.  Multiple ranges
can apply to a single module, and ranges can be shared between modules.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v2:
 - 1st arg to generate_builtin_ranges.awk is now modules.builtin.modinfo
 - Switched from using modules.builtin.objs to parsing .*.cmd files
 - Parse data from .*.cmd in generate_builtin_ranges.awk
---
 scripts/generate_builtin_ranges.awk | 232 
 1 file changed, 232 insertions(+)
 create mode 100755 scripts/generate_builtin_ranges.awk

diff --git a/scripts/generate_builtin_ranges.awk 
b/scripts/generate_builtin_ranges.awk
new file mode 100755
index 0..6975a9c7266d9
--- /dev/null
+++ b/scripts/generate_builtin_ranges.awk
@@ -0,0 +1,232 @@
+#!/usr/bin/gawk -f
+# SPDX-License-Identifier: GPL-2.0
+# generate_builtin_ranges.awk: Generate address range data for builtin modules
+# Written by Kris Van Hees 
+#
+# Usage: generate_builtin_ranges.awk modules.builtin.modinfo vmlinux.map \
+#  vmlinux.o.map > modules.builtin.ranges
+#
+
+BEGIN {
+   # modules.builtin.modinfo uses \0 as record separator
+   # All other files use \n.
+   RS = "[\n\0]";
+}
+
+# Return the module name(s) (if any) associated with the given object.
+#
+# If we have seen this object before, return information from the cache.
+# Otherwise, retrieve it from the corresponding .cmd file.
+#
+function get_module_info(fn, mod, obj, mfn, s) {
+   if (fn in omod)
+   return omod[fn];
+
+   if (match(fn, /\/[^/]+$/) == 0)
+   return "";
+
+   obj = fn;
+   mod = "";
+   mfn = "";
+   fn = substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd";
+   if (getline s  0) {
+   mod = substr(s, RSTART + 17, RLENGTH - 17);
+   gsub(/['"]/, "", mod);
+   gsub(/:/, " ", mod);
+   }
+
+   if (match(s, /DKBUILD_MODFILE=[^ ]+/) > 0) {
+   mfn = substr(s, RSTART + 17, RLENGTH - 17);
+   gsub(/['"]/, "", mfn);
+   gsub(/:/, " ", mfn);
+   }
+   }
+   close(fn);
+
+# tmp = $0;
+# $0 = s;
+# print mod " " mfn " " obj " " $NF;
+# $0 = tmp;
+
+   # A single module (common case) also reflects objects that are not part
+   # of a module.  Some of those objects have names that are also a module
+   # name (e.g. core).  We check the associated module file name, and if
+   # they do not match, the object is not part of a module.
+   if (mod !~ / /) {
+   if (!(mod in mods))
+   return "";
+   if (mods[mod] != mfn)
+   return "";
+   }
+
+   # At this point, mod is a single (valid) module name, or a list of
+   # module names (that do not need validation).
+   omod[obj] = mod;
+   close(fn);
+
+   return mod;
+}
+
+FNR == 1 {
+   FC++;
+}
+
+# (1-old) Build a mapping to associate object files with built-in module names.
+#
+# The first file argument is used as input (modules.builtin.objs).
+#
+FC == 1 && old_behaviour {
+   sub(/:/, "");
+   mod = $1;
+   sub(/([^/]*\/)+/, "", mod);
+   sub(/\.o$/, "", mod);
+   gsub(/-/, "_", mod);
+
+   if (NF > 1) {
+   for (i = 2; i <= NF; i++) {
+   if ($i in mods)
+   mods[$i] = mods[$i] " " mod;
+   else
+   mods[$i] = mod;
+   }
+   } else
+   mods[$1] = mod;
+
+   next;
+}
+
+# (1) Build a lookup map of built-in module names.
+#
+# The first file argument is used as input (modules.builtin.modinfo).
+#
+# We are interested in

[PATCH v3 3/6] kbuild: generate a linker map for vmlinux.o

2024-05-16 Thread Kris Van Hees

When CONFIG_BUILTIN_MODULE_RANGES is set, a linker map for vmlinux.o needs
to be generated.  The generation of offset range data for builtin modules
depends on that linker map to know what offsets in an ELF section belong
to an object file for a particular builtin module.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 scripts/Makefile.vmlinux_o | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/Makefile.vmlinux_o b/scripts/Makefile.vmlinux_o
index 6de297916ce68..252505505e0e3 100644
--- a/scripts/Makefile.vmlinux_o
+++ b/scripts/Makefile.vmlinux_o
@@ -45,9 +45,12 @@ objtool-args = $(vmlinux-objtool-args-y) --link
 # Link of vmlinux.o used for section mismatch analysis
 # ---
 
+vmlinux-o-ld-args-$(CONFIG_BUILTIN_MODULE_RANGES)  += -Map=$@.map
+
 quiet_cmd_ld_vmlinux.o = LD  $@
   cmd_ld_vmlinux.o = \
$(LD) ${KBUILD_LDFLAGS} -r -o $@ \
+   $(vmlinux-o-ld-args-y) \
$(addprefix -T , $(initcalls-lds)) \
--whole-archive vmlinux.a --no-whole-archive \
--start-group $(KBUILD_VMLINUX_LIBS) --end-group \
-- 
2.43.0

[PATCH v3 2/6] trace: add CONFIG_BUILTIN_MODULE_RANGES option

2024-05-16 Thread Kris Van Hees

The CONFIG_BUILTIN_MODULE_RANGES option controls whether offset range data
is generated for kernel modules that are built into the kernel image.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
Reviewed-by: Alan Maguire 
---
Changes since v2:
 - Add explicit dependency on FTRACE for CONFIG_BUILTIN_MODULE_RANGES
---
 kernel/trace/Kconfig | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 47345bf1d4a9f..d0c82b4b3a61e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -188,6 +188,24 @@ menuconfig FTRACE
 
 if FTRACE
 
+config BUILTIN_MODULE_RANGES
+   bool "Generate address range information for builtin modules"
+   depends on FTRACE
+   select VMLINUX_MAP
+   help
+ When modules are built into the kernel, there will be no module name
+ associated with its symbols in /proc/kallsyms.  Tracers may want to
+ identify symbols by module name and symbol name regardless of whether
+ the module is configured as loadable or not.
+
+ This option generates modules.builtin.ranges in the build tree with
+ offset ranges (per ELF section) for the module(s) they belong to.
+ It also records an anchor symbol to determine the load address of the
+ section.
+
+ It is fully compatible with CONFIG_RANDOMIZE_BASE and similar late-
+ address-modification options.
+
 config BOOTTIME_TRACING
bool "Boot-time Tracing support"
depends on TRACING
-- 
2.43.0

[PATCH v3 1/6] kbuild: add mod(name,file)_flags to assembler flags for module objects

2024-05-16 Thread Kris Van Hees

Module objects compiled from C source can be identified by the presence
of -DKBUILD_MODFILE and -DKBUILD_MODNAME on their compile command lines.
However, module objects from assembler source do not have this defines.

Add $(modfile_flags) to modkern_aflags (similar to modkern_cflahs), and
add $(modname_flags) to a_flags (similar to c_flags).

Signed-off-by: Kris Van Hees 
---
 scripts/Makefile.lib | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 3179747cbd2cc..a2524ffd046f4 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -234,7 +234,7 @@ modkern_rustflags = 
 \
 
 modkern_aflags = $(if $(part-of-module),   \
$(KBUILD_AFLAGS_MODULE) $(AFLAGS_MODULE),   \
-   $(KBUILD_AFLAGS_KERNEL) $(AFLAGS_KERNEL))
+   $(KBUILD_AFLAGS_KERNEL) $(AFLAGS_KERNEL) 
$(modfile_flags))
 
 c_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \
 -include $(srctree)/include/linux/compiler_types.h   \
@@ -244,7 +244,7 @@ c_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) 
$(LINUXINCLUDE) \
 rust_flags = $(_rust_flags) $(modkern_rustflags) 
@$(objtree)/include/generated/rustc_cfg
 
 a_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \
-$(_a_flags) $(modkern_aflags)
+$(_a_flags) $(modkern_aflags) $(modname_flags)
 
 cpp_flags  = -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \
 $(_cpp_flags)
-- 
2.43.0

[PATCH v3 0/6] Generate address range data for built-in modules

2024-05-16 Thread Kris Van Hees

Especially for tracing applications, it is convenient to be able to
refer to a symbol using a  pair and to be able
to translate an address into a  pair.  But
that does not work if the module is built into the kernel because the
object files that comprise the built-in module implementation are simply
linked into the kernel image along with all other kernel object files.

This is especially visible when providing tracing scripts for support
purposes, where the developer of the script targets a particular kernel
version, but does not have control over whether the target system has
a particular module as loadable module or built-in module.  When tracing
symbols within a module, referring them by 
pairs is both convenient and aids symbol lookup.  But that naming will
not work if the module name information is lost if the module is built
into the kernel on the target system.

Earlier work addressing this loss of information for built-in modules
involved adding module name information to the kallsyms data, but that
required more invasive code in the kernel proper.  This work never did
get merged into the kernel tree.

All that is really needed is knowing whether a given address belongs to
a particular module (or multiple modules if they share an object file).
Or in other words, whether that address falls within an address range
that is associated with one or more modules.

Objects can be identified as belonging to a particular module (or
modules) based on defines that are passed as flags to their respective
compilation commands.  The data found in modules.builtin.modinfo is
used to determine what modules are built into the kernel proper.  Then,
vmlinux.o.map and vmlinux.map can be parsed in a single pass to generate
a modules.buitin.ranges file with offset range information (relative to
the base address of the associated section) for built-in modules.  This
file gets installed along with the other modules.builtin.* files.

The impact on the kernel build is minimal because everything is done
using a single-pass AWK script.  The generated data size is minimal as
well, (depending on the exact kernel configuration) usually in the range
of 500-700 lines, with a file size of 20-40KB (if all modules are built
in, the file contains about 8000 lines, with a file size of about 285KB).

Changes since v2:
 - Switched from using modules.builtin.objs to parsing .*.cmd files
 - Add explicit dependency on FTRACE for CONFIG_BUILTIN_MODULE_RANGES
 - 1st arg to generate_builtin_ranges.awk is now modules.builtin.modinfo
 - Parse data from .*.cmd in generate_builtin_ranges.awk
 - Use $(real-prereqs) rather than $(filter-out ...)
 - Include modules.builtin.ranges in modules install target

Changes since v1:
 - Renamed CONFIG_BUILTIN_RANGES to CONFIG_BUILTIN_MODULE_RANGES
 - Moved the config option to the tracers section
 - 2nd arg to generate_builtin_ranges.awk should be vmlinux.map

Kris Van Hees (5):
  trace: add CONFIG_BUILTIN_MODULE_RANGES option
  kbuild: generate a linker map for vmlinux.o
  module: script to generate offset ranges for builtin modules
  kbuild: generate modules.builtin.ranges when linking the kernel
  module: add install target for modules.builtin.ranges

Luis Chamberlain (1):
  kbuild: add modules.builtin.objs

 .gitignore  |   2 +-
 Documentation/dontdiff  |   2 +-
 Documentation/kbuild/kbuild.rst |   5 ++
 Makefile|   8 +-
 include/linux/module.h  |   4 +-
 kernel/trace/Kconfig|  17 
 scripts/Makefile.lib|   5 +-
 scripts/Makefile.modinst|  11 ++-
 scripts/Makefile.vmlinux|  17 
 scripts/Makefile.vmlinux_o  |  18 -
 scripts/generate_builtin_ranges.awk | 149 
 11 files changed, 228 insertions(+), 10 deletions(-)
 create mode 100755 scripts/generate_builtin_ranges.awk


base-commit: dd5a440a31fae6e459c0d627162825505361
-- 
2.42.0

Re: [PATCH v2 0/6] Generate address range data for built-in modules

2024-05-15 Thread Kris Van Hees

On Mon, May 13, 2024 at 01:43:15PM +0900, Masahiro Yamada wrote:
> On Sun, May 12, 2024 at 7:42???AM Kris Van Hees  
> wrote:
> >
> > Especially for tracing applications, it is convenient to be able to
> > refer to a symbol using a  pair and to be able
> > to translate an address into a  pair.  But
> > that does not work if the module is built into the kernel because the
> > object files that comprise the built-in module implementation are simply
> > linked into the kernel image along with all other kernel object files.
> >
> > This is especially visible when providing tracing scripts for support
> > purposes, where the developer of the script targets a particular kernel
> > version, but does not have control over whether the target system has
> > a particular module as loadable module or built-in module.  When tracing
> > symbols within a module, referring them by 
> > pairs is both convenient and aids symbol lookup.  But that naming will
> > not work if the module name information is lost if the module is built
> > into the kernel on the target system.
> >
> > Earlier work addressing this loss of information for built-in modules
> > involved adding module name information to the kallsyms data, but that
> > required more invasive code in the kernel proper.  This work never did
> > get merged into the kernel tree.
> >
> > All that is really needed is knowing whether a given address belongs to
> > a particular module (or multiple modules if they share an object file).
> > Or in other words, whether that address falls within an address range
> > that is associated with one or more modules.
> >
> > This patch series is baaed on Luis Chamberlain's patch to generate
> > modules.builtin.objs, associating built-in modules with their object
> > files.  Using this data, vmlinux.o.map and vmlinux.map can be parsed in
> > a single pass to generate a modules.buitin.ranges file with offset range
> > information (relative to the base address of the associated section) for
> > built-in modules.  The file gets installed along with the other
> > modules.builtin.* files.
> 
> 
> 
> I still do not want to see modules.builtin.objs.
> 
> 
> During the vmlinux.o.map parse, every time an object path
> is encountered, you can open the corresponding .cmd file.
> 
> 
> 
> Let's say, you have the following in vmlinux.o.map:
> 
> .text  0x007d4fe0 0x46c8 drivers/i2c/i2c-core-base.o
> 
> 
> 
> You can check drivers/i2c/.i2c-core-base.o.cmd
> 
> 
> $ cat drivers/i2c/.i2c-core-base.o.cmd | tr ' ' '\n' | grep KBUILD_MODFILE
> -DKBUILD_MODFILE='"drivers/i2c/i2c-core"'
> 
> 
> Now you know this object is part of drivers/i2c/i2c-core
> (that is, its modname is "i2c-core")
> 
> 
> 
> 
> Next, you will get the following:
> 
>  .text  0x007dc550 0x13c4 drivers/i2c/i2c-core-acpi.o
> 
> 
> $ cat drivers/i2c/.i2c-core-acpi.o.cmd | tr ' ' '\n' | grep KBUILD_MODFILE
> -DKBUILD_MODFILE='"drivers/i2c/i2c-core"'
> 
> 
> This one is also a part of drivers/i2c/i2c-core
>
> 
> You will get the address range of "i2c-core" without changing Makefiles.

Thank you for this suggestion.  I have this approach now implemented, making
use of both KBUILD_MODFILE and KBUILD_MODNAME (both are needed to conclusively
determine that an object belongs to a module).

However, this is not catching objects that are compiled from assembler source,
because modfile_flags and modname_flags are not added to the assembler flags,
and thus KBUILD_MODFILE and KBUILD_MODNAME are not present in the .cmd file
for those objects.

It would seem that it is harmless to add those flags to assembler flags, so
would that be an acceptable solution?  It definitely would provide consistency
with non-asm objects.  And we already pass modfile and modname flags to the
non-asm builds for objects that most certainly do not belong in modules amnyway,
e.g.

$ cat arch/x86/boot/.cmdline.o.cmd| tr ' ' '\n' | grep -- -DKBUILD_MOD
-DKBUILD_MODFILE='"arch/x86/boot/cmdline"'
-DKBUILD_MODNAME='"cmdline"'

> You still need to modify scripts/Makefile.vmlinux(_o)
> but you can implement everything else in your script,
> although I did not fully understand the gawk script.
> 
> 
> Now, you can use Python if you like:
> 
>   
> https://lore.kernel.org/lkml/20240512-python-version-v2-1-382870a1f...@linaro.org/
> 
> Presumably, python code will be more readable for many people.
> 
> 
> GNU awk is not documented in Documentation/process/changes.rst
> If you insist on using gawk, you need to add it to the doc.
> 
> 
> 
> 
> 
&

[PATCH v2 6/6] module: add install target for modules.builtin.ranges

2024-05-11 Thread Kris Van Hees

When CONFIG_BUILTIN_MODULE_RANGES is enabled, the modules.builtin.ranges
file should be installed in the module install location.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v1:
 - Renamed CONFIG_BUILTIN_RANGES to CONFIG_BUILTIN_MODULE_RANGES
---
 scripts/Makefile.modinst | 5 +
 1 file changed, 5 insertions(+)

diff --git a/scripts/Makefile.modinst b/scripts/Makefile.modinst
index b45586aa1de49..e185dacae892a 100644
--- a/scripts/Makefile.modinst
+++ b/scripts/Makefile.modinst
@@ -36,6 +36,11 @@ install-y += $(addprefix $(MODLIB)/, modules.builtin 
modules.builtin.modinfo mod
 $(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo 
modules.builtin.objs): $(MODLIB)/%: % FORCE
$(call cmd,install)
 
+install-$(CONFIG_BUILTIN_MODULE_RANGES) += $(MODLIB)/modules.builtin.ranges
+
+$(MODLIB)/modules.builtin.ranges: modules.builtin.ranges FORCE
+   $(call cmd,install)
+
 endif
 
 modules := $(call read-file, $(MODORDER))
-- 
2.43.0

[PATCH v2 5/6] kbuild: generate modules.builtin.ranges when linking the kernel

2024-05-11 Thread Kris Van Hees

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v1:
 - Renamed CONFIG_BUILTIN_RANGES to CONFIG_BUILTIN_MODULE_RANGES
---
 scripts/Makefile.vmlinux | 17 +
 1 file changed, 17 insertions(+)

diff --git a/scripts/Makefile.vmlinux b/scripts/Makefile.vmlinux
index c9f3e03124d7f..54095d72f7fd7 100644
--- a/scripts/Makefile.vmlinux
+++ b/scripts/Makefile.vmlinux
@@ -36,6 +36,23 @@ targets += vmlinux
 vmlinux: scripts/link-vmlinux.sh vmlinux.o $(KBUILD_LDS) FORCE
+$(call if_changed_dep,link_vmlinux)
 
+# module.builtin.ranges
+# ---
+ifdef CONFIG_BUILTIN_MODULE_RANGES
+__default: modules.builtin.ranges
+
+quiet_cmd_modules_builtin_ranges = GEN $@
+  cmd_modules_builtin_ranges = \
+   $(srctree)/scripts/generate_builtin_ranges.awk \
+ $(filter-out FORCE,$+) > $@
+
+vmlinux.map: vmlinux
+
+targets += modules.builtin.ranges
+modules.builtin.ranges: modules.builtin.objs vmlinux.map vmlinux.o.map FORCE
+   $(call if_changed,modules_builtin_ranges)
+endif
+
 # Add FORCE to the prequisites of a target to force it to be always rebuilt.
 # ---
 
-- 
2.43.0

[PATCH v2 4/6] module: script to generate offset ranges for builtin modules

2024-05-11 Thread Kris Van Hees

The offset range data for builtin modules is generated using:
 - modules.builtin.objs: associates object files with module names
 - vmlinux.map: provides load order of sections and offset of first member
per section
 - vmlinux.o.map: provides offset of object file content per section

The generated data will look like:

.text - = _text
.text baf0-cb10 amd_uncore
.text 0009bd10-0009c8e0 iosf_mbi
...
.text 008e6660-008e9630 snd_soc_wcd_mbhc
.text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x
.text 008ea610-008ea780 snd_soc_wcd9335
...
.data - = _sdata
.data f020-f680 amd_uncore

For each ELF section, it lists the offset of the first symbol.  This can
be used to determine the base address of the section at runtime.

Next, it lists (in strict ascending order) offset ranges in that section
that cover the symbols of one or more builtin modules.  Multiple ranges
can apply to a single module, and ranges can be shared between modules.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v1:
 - Updated commit msg (vmlinux.o -> vmlinux.map)
---
 scripts/generate_builtin_ranges.awk | 150 
 1 file changed, 150 insertions(+)
 create mode 100755 scripts/generate_builtin_ranges.awk

diff --git a/scripts/generate_builtin_ranges.awk 
b/scripts/generate_builtin_ranges.awk
new file mode 100755
index 0..d5d668c97bd7f
--- /dev/null
+++ b/scripts/generate_builtin_ranges.awk
@@ -0,0 +1,150 @@
+#!/usr/bin/gawk -f
+# SPDX-License-Identifier: GPL-2.0
+
+FNR == 1 {
+   FC++;
+}
+
+# (1) Build a mapping to associate object files with built-in module names.
+#
+# The first file argument is used as input (modules.builtin.objs).
+#
+FC == 1 {
+   sub(/:/, "");
+   mod = $1;
+   sub(/([^/]*\/)+/, "", mod);
+   sub(/\.o$/, "", mod);
+   gsub(/-/, "_", mod);
+
+   if (NF > 1) {
+   for (i = 2; i <= NF; i++) {
+   if ($i in mods)
+   mods[$i] = mods[$i] " " mod;
+   else
+   mods[$i] = mod;
+   }
+   } else
+   mods[$1] = mod;
+
+   next;
+}
+
+# (2) Determine the load address for each section.
+#
+# The second file argument is used as input (vmlinux.map).
+# Since some AWK implementations cannot handle large integers, we strip of the
+# first 4 hex digits from the address.  This is safe because the kernel space
+# is not large enough for addresses to extend into those digits.
+#
+FC == 2 && /^\./ && NF > 2 {
+   if (type)
+   delete sect_addend[type];
+
+   if ($1 ~ /percpu/)
+   next;
+
+   raw_addr = $2;
+   addr_prefix = "^" substr($2, 1, 6);
+   sub(addr_prefix, "0x", $2);
+   base = strtonum($2);
+   type = $1;
+   anchor = 0;
+   sect_base[type] = base;
+
+   next;
+}
+
+!type {
+   next;
+}
+
+# (3) We need to determine the base address of the section so that ranges can
+# be expressed based on offsets from the base address.  This accommodates the
+# kernel sections getting loaded at different addresses than what is recorded
+# in vmlinux.map.
+#
+# At runtime, we will need to determine the base address of each section we are
+# interested in.  We do that by recording the offset of the first symbol in the
+# section.  Once we know the address of this symbol in the running kernel, we
+# can calculate the base address of the section.
+#
+# If possible, we use an explicit anchor symbol (sym = .) listed at the base
+# address (offset 0).
+#
+# If there is no such symbol, we record the first symbol in the section along
+# with its offset.
+#
+# We also determine the offset of the first member in the section in case the
+# final linking inserts some content between the start of the section and the
+# first member.  I.e. in that case, vmlinux.map will list the first member at
+# a non-zero offset whereas vmlinux.o.map will list it at offset 0.  We record
+# the addend so we can apply it when processing vmlinux.o.map (next).
+#
+FC == 2 && !anchor && raw_addr == $1 && $3 == "=" && $4 == "." {
+   anchor = sprintf("%s %08x-%08x = %s", type, 0, 0, $2);
+   sect_anchor[type] = anchor;
+
+   next;
+}
+
+FC == 2 && !anchor && $1 ~ /^0x/ && $2 !~ /^0x/ && NF <= 4 {
+   sub(addr_prefix, "0x", $1);
+   addr = strtonum($1) - base;
+   anchor = sprintf("%s %08x-%08x = %s", type, addr, addr, $2);
+   sect_anchor[type] = anchor;
+
+   next;
+}
+
+FC == 2 && base && /^ \./ && $1 == type && NF == 4 {
+   sub(addr_prefix, "0x", $2);
+   addr = strtonum($2);
+   sect_addend[type] = ad

[PATCH v2 3/6] kbuild: generate a linker map for vmlinux.o

2024-05-11 Thread Kris Van Hees

When CONFIG_BUILTIN_MODULE_RANGES is set, a linker map for vmlinux.o needs
to be generated.  The generation of offset range data for builtin modules
depends on that linker map to know what offsets in an ELF section belong
to an object file for a particular builtin module.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v1:
 - Renamed CONFIG_BUILTIN_RANGES to CONFIG_BUILTIN_MODULE_RANGES
---
 scripts/Makefile.vmlinux_o | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/Makefile.vmlinux_o b/scripts/Makefile.vmlinux_o
index 508b3294e2cf1..e4a92838780d5 100644
--- a/scripts/Makefile.vmlinux_o
+++ b/scripts/Makefile.vmlinux_o
@@ -45,9 +45,12 @@ objtool-args = $(vmlinux-objtool-args-y) --link
 # Link of vmlinux.o used for section mismatch analysis
 # ---
 
+vmlinux-o-ld-args-$(CONFIG_BUILTIN_MODULE_RANGES)  += -Map=$@.map
+
 quiet_cmd_ld_vmlinux.o = LD  $@
   cmd_ld_vmlinux.o = \
$(LD) ${KBUILD_LDFLAGS} -r -o $@ \
+   $(vmlinux-o-ld-args-y) \
$(addprefix -T , $(initcalls-lds)) \
--whole-archive vmlinux.a --no-whole-archive \
--start-group $(KBUILD_VMLINUX_LIBS) --end-group \
-- 
2.43.0

[PATCH v2 2/6] trace: add CONFIG_BUILTIN_MODULE_RANGES option

2024-05-11 Thread Kris Van Hees

The CONFIG_BUILTIN_MODULE_RANGES option controls whether offset range data
is generated for kernel modules that are built into the kernel image.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
Reviewed-by: Alan Maguire 
---
Changes since v1:
 - Renamed CONFIG_BUILTIN_RANGES to CONFIG_BUILTIN_MODULE_RANGES
 - Moved the config option to the tracers section
---
 kernel/trace/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 47345bf1d4a9f..839a56e971cc0 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -188,6 +188,23 @@ menuconfig FTRACE
 
 if FTRACE
 
+config BUILTIN_MODULE_RANGES
+   bool "Generate address range information for builtin modules"
+   select VMLINUX_MAP
+   help
+ When modules are built into the kernel, there will be no module name
+ associated with its symbols in /proc/kallsyms.  Tracers may want to
+ identify symbols by module name and symbol name regardless of whether
+ the module is configured as loadable or not.
+
+ This option generates modules.builtin.ranges in the build tree with
+ offset ranges (per ELF section) for the module(s) they belong to.
+ It also records an anchor symbol to determine the load address of the
+ section.
+
+ It is fully compatible with CONFIG_RANDOMIZE_BASE and similar late-
+ address-modification options.
+
 config BOOTTIME_TRACING
bool "Boot-time Tracing support"
depends on TRACING
-- 
2.43.0

[PATCH v2 1/6] kbuild: add modules.builtin.objs

2024-05-11 Thread Kris Van Hees

From: Luis Chamberlain 

The file modules.builtin names all modules that are built into the
kernel; this is checked by modprobe to not fail when trying to load
something built-in. But for tools which want to see which object files
make up each module, we want to help them with such a mapping as it is
not easy to get this otherwise.

We do this by just extending scripts/Makefile.lib with a new variable
and define to capture all object files included in this module, store it
in a new objs= modinfo stanza, then extract it just before linking into
a new file modules.builtin.objs with a layout roughly modelled on a
makefile:

path/to/module.o: path/to/constituent.o path/to/other-constituent.o

Single-file built-in modules get a line reading

path/to/module.o:

Note that the .modinfo section is discarded at the link stage, so the
kernel is not bloated at all (see include/asm-generic/vmlinux.lds.h).

Orabug: 29891866
Signed-off-by: Luis Chamberlain 
Signed-off-by: Nick Alcock 
Reviewed-by: Nick Alcock 
Reviewed-by: Kris Van Hees 
---
Changes since v1:
 - None
---
 .gitignore  |  2 +-
 Documentation/dontdiff  |  2 +-
 Documentation/kbuild/kbuild.rst |  5 +
 Makefile|  8 ++--
 include/linux/module.h  |  4 +++-
 scripts/Makefile.lib|  5 -
 scripts/Makefile.modinst|  6 +++---
 scripts/Makefile.vmlinux_o  | 15 ++-
 8 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/.gitignore b/.gitignore
index c59dc60ba62ef..62ede8565a2aa 100644
--- a/.gitignore
+++ b/.gitignore
@@ -69,7 +69,7 @@ modules.order
 /System.map
 /Module.markers
 /modules.builtin
-/modules.builtin.modinfo
+/modules.builtin.*
 /modules.nsdeps
 
 #
diff --git a/Documentation/dontdiff b/Documentation/dontdiff
index 3c399f132e2db..75b9655e57914 100644
--- a/Documentation/dontdiff
+++ b/Documentation/dontdiff
@@ -179,7 +179,7 @@ mkutf8data
 modpost
 modules-only.symvers
 modules.builtin
-modules.builtin.modinfo
+modules.builtin.*
 modules.nsdeps
 modules.order
 modversions.h*
diff --git a/Documentation/kbuild/kbuild.rst b/Documentation/kbuild/kbuild.rst
index 9c8d1d046ea56..79e104ffee715 100644
--- a/Documentation/kbuild/kbuild.rst
+++ b/Documentation/kbuild/kbuild.rst
@@ -17,6 +17,11 @@ modules.builtin
 This file lists all modules that are built into the kernel. This is used
 by modprobe to not fail when trying to load something builtin.
 
+modules.builtin.objs
+---
+This file contains object mapping of modules that are built into the kernel
+to their corresponding object files used to build the module.
+
 modules.builtin.modinfo
 ---
 This file contains modinfo from all modules that are built into the kernel.
diff --git a/Makefile b/Makefile
index d51d411d44a82..cc979f9874f5a 100644
--- a/Makefile
+++ b/Makefile
@@ -1140,7 +1140,11 @@ PHONY += vmlinux_o
 vmlinux_o: vmlinux.a $(KBUILD_VMLINUX_LIBS)
$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.vmlinux_o
 
-vmlinux.o modules.builtin.modinfo modules.builtin: vmlinux_o
+MODULES_BUILTIN := modules.builtin.modinfo
+MODULES_BUILTIN += modules.builtin
+MODULES_BUILTIN += modules.builtin.objs
+
+vmlinux.o $(MODULES_BUILTIN): vmlinux_o
@:
 
 PHONY += vmlinux
@@ -1476,7 +1480,7 @@ endif # CONFIG_MODULES
 
 # Directories & files removed with 'make clean'
 CLEAN_FILES += vmlinux.symvers modules-only.symvers \
-  modules.builtin modules.builtin.modinfo modules.nsdeps \
+  modules.builtin modules.builtin.* modules.nsdeps \
   compile_commands.json .thinlto-cache rust/test \
   rust-project.json .vmlinux.objs .vmlinux.export.c
 
diff --git a/include/linux/module.h b/include/linux/module.h
index 1153b0d99a808..cbfff06e00cd6 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -180,7 +180,9 @@ extern void cleanup_module(void);
 #ifdef MODULE
 #define MODULE_FILE
 #else
-#define MODULE_FILEMODULE_INFO(file, KBUILD_MODFILE);
+#define MODULE_FILE  \
+   MODULE_INFO(file, KBUILD_MODFILE);\
+   MODULE_INFO(objs, KBUILD_MODOBJS);
 #endif
 
 /*
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 3179747cbd2cc..3b3baa78d4fbd 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -118,6 +118,8 @@ modname-multi = $(sort $(foreach m,$(multi-obj-ym),\
 __modname = $(or $(modname-multi),$(basetarget))
 
 modname = $(subst $(space),:,$(__modname))
+modname-objs = $($(modname)-objs) $($(modname)-y) $($(modname)-Y)
+modname-objs-prefixed = $(sort $(strip $(addprefix $(obj)/, $(modname-objs
 modfile = $(addprefix $(obj)/,$(__modname))
 
 # target with $(obj)/ and its suffix stripped
@@ -131,7 +133,8 @@ name-fix = $(call stringify,$(call name-fix-token,$1))
 basename_flags = -DKBUILD_BASENAME=$(call name-fix,$(basetarget))
 modname_flags  = -DKBUILD_MODNAME=$(call

[PATCH v2 0/6] Generate address range data for built-in modules

2024-05-11 Thread Kris Van Hees

Especially for tracing applications, it is convenient to be able to
refer to a symbol using a  pair and to be able
to translate an address into a  pair.  But
that does not work if the module is built into the kernel because the
object files that comprise the built-in module implementation are simply
linked into the kernel image along with all other kernel object files.

This is especially visible when providing tracing scripts for support
purposes, where the developer of the script targets a particular kernel
version, but does not have control over whether the target system has
a particular module as loadable module or built-in module.  When tracing
symbols within a module, referring them by 
pairs is both convenient and aids symbol lookup.  But that naming will
not work if the module name information is lost if the module is built
into the kernel on the target system.

Earlier work addressing this loss of information for built-in modules
involved adding module name information to the kallsyms data, but that
required more invasive code in the kernel proper.  This work never did
get merged into the kernel tree.

All that is really needed is knowing whether a given address belongs to
a particular module (or multiple modules if they share an object file).
Or in other words, whether that address falls within an address range
that is associated with one or more modules.

This patch series is baaed on Luis Chamberlain's patch to generate
modules.builtin.objs, associating built-in modules with their object
files.  Using this data, vmlinux.o.map and vmlinux.map can be parsed in
a single pass to generate a modules.buitin.ranges file with offset range
information (relative to the base address of the associated section) for
built-in modules.  The file gets installed along with the other
modules.builtin.* files.

The impact on the kernel build is minimal because everything is done
using a single-pass AWK script.  The generated data size is minimal as
well, (depending on the exact kernel configuration) usually in the range
of 500-700 lines, with a file size of 20-40KB.

Changes since v1:
 - Renamed CONFIG_BUILTIN_RANGES to CONFIG_BUILTIN_MODULE_RANGES
 - Moved the config option to the tracers section
 - 2nd arg to generate_builtin_ranges.awk should be vmlinux.map

Kris Van Hees (5):
  trace: add CONFIG_BUILTIN_MODULE_RANGES option
  kbuild: generate a linker map for vmlinux.o
  module: script to generate offset ranges for builtin modules
  kbuild: generate modules.builtin.ranges when linking the kernel
  module: add install target for modules.builtin.ranges

Luis Chamberlain (1):
  kbuild: add modules.builtin.objs

 .gitignore  |   2 +-
 Documentation/dontdiff  |   2 +-
 Documentation/kbuild/kbuild.rst |   5 ++
 Makefile|   8 +-
 include/linux/module.h  |   4 +-
 kernel/trace/Kconfig|  17 
 scripts/Makefile.lib|   5 +-
 scripts/Makefile.modinst|  11 ++-
 scripts/Makefile.vmlinux|  17 
 scripts/Makefile.vmlinux_o  |  18 -
 scripts/generate_builtin_ranges.awk | 149 
 11 files changed, 228 insertions(+), 10 deletions(-)
 create mode 100755 scripts/generate_builtin_ranges.awk


base-commit: dd5a440a31fae6e459c0d627162825505361
-- 
2.42.0

Re: [PATCH 2/6] module: add CONFIG_BUILTIN_RANGES option

2023-12-11 Thread Kris Van Hees

On Sat, Dec 09, 2023 at 07:59:17AM +0900, Masami Hiramatsu wrote:
> On Fri,  8 Dec 2023 00:07:48 -0500
> Kris Van Hees  wrote:
> 
> > The CONFIG_BUILTIN_RANGES option controls whether offset range data is
> > generated for kernel modules that are built into the kernel image.
> > 
> > Signed-off-by: Kris Van Hees 
> > Reviewed-by: Nick Alcock 
> > Reviewed-by: Alan Maguire 
> > ---
> >  kernel/module/Kconfig | 17 +
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
> > index 33a2e991f608..0798439b11ac 100644
> > --- a/kernel/module/Kconfig
> > +++ b/kernel/module/Kconfig
> > @@ -389,4 +389,21 @@ config MODULES_TREE_LOOKUP
> > def_bool y
> > depends on PERF_EVENTS || TRACING || CFI_CLANG
> >  
> > +config BUILTIN_RANGES
> 
> BUILTIN_MODULE_RANGES ?

Ah yes, thank you.  Will fix in v2.

> BTW, even if CONFIG_MODULES=n, we can embed the kernel module code.
> So should this visible if the CONFIG_MODULES=n ?

That is a very good point.  And in that case, the ranges information should
still be produced when this option is set.  I will move the option to a more
appropriate location, not depending on CONFIG_MODULES.

> Thank you,

Thank you for your feedback!

> > +   bool "Generate address range information for builtin modules"
> > +   depends on VMLINUX_MAP
> > +   help
> > + When modules are built into the kernel, there will be no module name
> > + associated with its symbols in /proc/kallsyms.  Tracers may want to
> > + identify symbols by module name and symbol name regardless of whether
> > + the module is configured as loadable or not.
> > +
> > + This option generates modules.builtin.ranges in the build tree with
> > + offset ranges (per ELF section) for the module(s) they belong to.
> > + It also records an anchor symbol to determine the load address of the
> > + section.
> > +
> > + It is fully compatible with CONFIG_RANDOMIZE_BASE and similar late-
> > + address-modification options.
> > +
> >  endif # MODULES
> > -- 
> > 2.42.0
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google)

[PATCH 6/6] module: add install target for modules.builtin.ranges

2023-12-07 Thread Kris Van Hees

When CONFIG_BUILTIN_RANGES is enable, the modules.builtin.ranges file
should be installed in the module install location.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 scripts/Makefile.modinst | 5 +
 1 file changed, 5 insertions(+)

diff --git a/scripts/Makefile.modinst b/scripts/Makefile.modinst
index b45586aa1de4..f30f5ea04566 100644
--- a/scripts/Makefile.modinst
+++ b/scripts/Makefile.modinst
@@ -36,6 +36,11 @@ install-y += $(addprefix $(MODLIB)/, modules.builtin 
modules.builtin.modinfo mod
 $(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo 
modules.builtin.objs): $(MODLIB)/%: % FORCE
$(call cmd,install)
 
+install-$(CONFIG_BUILTIN_RANGES) += $(MODLIB)/modules.builtin.ranges
+
+$(MODLIB)/modules.builtin.ranges: modules.builtin.ranges FORCE
+   $(call cmd,install)
+
 endif
 
 modules := $(call read-file, $(MODORDER))
-- 
2.42.0

[PATCH 5/6] kbuild: generate modules.builtin.ranges when linking the kernel

2023-12-07 Thread Kris Van Hees

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 scripts/Makefile.vmlinux | 17 +
 1 file changed, 17 insertions(+)

diff --git a/scripts/Makefile.vmlinux b/scripts/Makefile.vmlinux
index c9f3e03124d7..c23d40b023ff 100644
--- a/scripts/Makefile.vmlinux
+++ b/scripts/Makefile.vmlinux
@@ -36,6 +36,23 @@ targets += vmlinux
 vmlinux: scripts/link-vmlinux.sh vmlinux.o $(KBUILD_LDS) FORCE
+$(call if_changed_dep,link_vmlinux)
 
+# module.builtin.ranges
+# ---
+ifdef CONFIG_BUILTIN_RANGES
+__default: modules.builtin.ranges
+
+quiet_cmd_modules_builtin_ranges = GEN $@
+  cmd_modules_builtin_ranges = \
+   $(srctree)/scripts/generate_builtin_ranges.awk \
+ $(filter-out FORCE,$+) > $@
+
+vmlinux.map: vmlinux
+
+targets += modules.builtin.ranges
+modules.builtin.ranges: modules.builtin.objs vmlinux.map vmlinux.o.map FORCE
+   $(call if_changed,modules_builtin_ranges)
+endif
+
 # Add FORCE to the prequisites of a target to force it to be always rebuilt.
 # ---
 
-- 
2.42.0

[PATCH 4/6] module: script to generate offset ranges for builtin modules

2023-12-07 Thread Kris Van Hees

The offset range data for builtin modules is generated using:
 - modules.builtin.objs: associates object files with module names
 - vmlinux.o: provides load order of sections and offset of first member per
section
 - vmlinux.o.map: provides offset of object file content per section

The generated data will look like:

.text - = _text
.text baf0-cb10 amd_uncore
.text 0009bd10-0009c8e0 iosf_mbi
...
.text 008e6660-008e9630 snd_soc_wcd_mbhc
.text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x
.text 008ea610-008ea780 snd_soc_wcd9335
...
.data - = _sdata
.data f020-f680 amd_uncore

For each ELF section, it lists the offset of the first symbol.  This can
be used to determine the base address of the section at runtime.

Next, it lists (in strict ascending order) offset ranges in that section
that cover the symbols of one or more builtin modules.  Multiple ranges
can apply to a single module, and ranges can be shared between modules.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 scripts/generate_builtin_ranges.awk | 149 
 1 file changed, 149 insertions(+)
 create mode 100755 scripts/generate_builtin_ranges.awk

diff --git a/scripts/generate_builtin_ranges.awk 
b/scripts/generate_builtin_ranges.awk
new file mode 100755
index ..d5d668c97bd7
--- /dev/null
+++ b/scripts/generate_builtin_ranges.awk
@@ -0,0 +1,149 @@
+#!/usr/bin/gawk -f
+
+FNR == 1 {
+   FC++;
+}
+
+# (1) Build a mapping to associate object files with built-in module names.
+#
+# The first file argument is used as input (modules.builtin.objs).
+#
+FC == 1 {
+   sub(/:/, "");
+   mod = $1;
+   sub(/([^/]*\/)+/, "", mod);
+   sub(/\.o$/, "", mod);
+   gsub(/-/, "_", mod);
+
+   if (NF > 1) {
+   for (i = 2; i <= NF; i++) {
+   if ($i in mods)
+   mods[$i] = mods[$i] " " mod;
+   else
+   mods[$i] = mod;
+   }
+   } else
+   mods[$1] = mod;
+
+   next;
+}
+
+# (2) Determine the load address for each section.
+#
+# The second file argument is used as input (vmlinux.map).
+# Since some AWK implementations cannot handle large integers, we strip of the
+# first 4 hex digits from the address.  This is safe because the kernel space
+# is not large enough for addresses to extend into those digits.
+#
+FC == 2 && /^\./ && NF > 2 {
+   if (type)
+   delete sect_addend[type];
+
+   if ($1 ~ /percpu/)
+   next;
+
+   raw_addr = $2;
+   addr_prefix = "^" substr($2, 1, 6);
+   sub(addr_prefix, "0x", $2);
+   base = strtonum($2);
+   type = $1;
+   anchor = 0;
+   sect_base[type] = base;
+
+   next;
+}
+
+!type {
+   next;
+}
+
+# (3) We need to determine the base address of the section so that ranges can
+# be expressed based on offsets from the base address.  This accomodates the
+# kernel sections getting loaded at different addresses than what is recorded
+# in vmlinux.map.
+#
+# At runtime, we will need to determine the base address of each section we are
+# interested in.  We do that by recording the offset of the first symbol in the
+# section.  Once we know the address of this symbol in the running kernel, we
+# can calculate the base address of the section.
+#
+# If possible, we use an explicit anchor symbol (sym = .) listed at the base
+# address (offset 0).
+#
+# If there is no such symbol, we record the first symbol in the section along
+# with its offset.
+#
+# We also determine the offset of the first member in the section in case the
+# final linking inserts some content between the start of the section and the
+# first member.  I.e. in that case, vmlinux.map will list the first member at
+# a non-zero offset whereas vmlinux.o.map will list it at offset 0.  We record
+# the addend so we can apply it when processing vmlinux.o.map (next).
+#
+FC == 2 && !anchor && raw_addr == $1 && $3 == "=" && $4 == "." {
+   anchor = sprintf("%s %08x-%08x = %s", type, 0, 0, $2);
+   sect_anchor[type] = anchor;
+
+   next;
+}
+
+FC == 2 && !anchor && $1 ~ /^0x/ && $2 !~ /^0x/ && NF <= 4 {
+   sub(addr_prefix, "0x", $1);
+   addr = strtonum($1) - base;
+   anchor = sprintf("%s %08x-%08x = %s", type, addr, addr, $2);
+   sect_anchor[type] = anchor;
+
+   next;
+}
+
+FC == 2 && base && /^ \./ && $1 == type && NF == 4 {
+   sub(addr_prefix, "0x", $2);
+   addr = strtonum($2);
+   sect_addend[type] = addr - base;
+
+   if (anchor) {
+   base = 0;
+   type = 0;
+   }
+
+   next;
+}
+
+

[PATCH 3/6] kbuild: generate a linker map for vmlinux.o

2023-12-07 Thread Kris Van Hees

When CONFIG_BUILTIN_RANGES is set, a linker map for vmlinux.o needs to
be generated.  The generation of offset range data for builtin modules
depends on that linker map to know what offsets in an ELF section belong
to an object file for a particular builtin module.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 scripts/Makefile.vmlinux_o | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/Makefile.vmlinux_o b/scripts/Makefile.vmlinux_o
index bfb84efcef39..9e35cb0ed862 100644
--- a/scripts/Makefile.vmlinux_o
+++ b/scripts/Makefile.vmlinux_o
@@ -45,9 +45,12 @@ objtool-args = $(vmlinux-objtool-args-y) --link
 # Link of vmlinux.o used for section mismatch analysis
 # ---
 
+vmlinux-o-ld-args-$(CONFIG_BUILTIN_RANGES) += -Map=$@.map
+
 quiet_cmd_ld_vmlinux.o = LD  $@
   cmd_ld_vmlinux.o = \
$(LD) ${KBUILD_LDFLAGS} -r -o $@ \
+   $(vmlinux-o-ld-args-y) \
$(addprefix -T , $(initcalls-lds)) \
--whole-archive vmlinux.a --no-whole-archive \
--start-group $(KBUILD_VMLINUX_LIBS) --end-group \
-- 
2.42.0

[PATCH 2/6] module: add CONFIG_BUILTIN_RANGES option

2023-12-07 Thread Kris Van Hees

The CONFIG_BUILTIN_RANGES option controls whether offset range data is
generated for kernel modules that are built into the kernel image.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
Reviewed-by: Alan Maguire 
---
 kernel/module/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig
index 33a2e991f608..0798439b11ac 100644
--- a/kernel/module/Kconfig
+++ b/kernel/module/Kconfig
@@ -389,4 +389,21 @@ config MODULES_TREE_LOOKUP
def_bool y
depends on PERF_EVENTS || TRACING || CFI_CLANG
 
+config BUILTIN_RANGES
+   bool "Generate address range information for builtin modules"
+   depends on VMLINUX_MAP
+   help
+ When modules are built into the kernel, there will be no module name
+ associated with its symbols in /proc/kallsyms.  Tracers may want to
+ identify symbols by module name and symbol name regardless of whether
+ the module is configured as loadable or not.
+
+ This option generates modules.builtin.ranges in the build tree with
+ offset ranges (per ELF section) for the module(s) they belong to.
+ It also records an anchor symbol to determine the load address of the
+ section.
+
+ It is fully compatible with CONFIG_RANDOMIZE_BASE and similar late-
+ address-modification options.
+
 endif # MODULES
-- 
2.42.0

[PATCH 1/6] kbuild: add modules.builtin.objs

2023-12-07 Thread Kris Van Hees

From: Luis Chamberlain 

The file modules.builtin names all modules that are built into the
kernel; this is checked by modprobe to not fail when trying to load
something built-in. But for tools which want to see which object files
make up each module, we want to help them with such a mapping as it is
not easy to get this otherwise.

We do this by just extending scripts/Makefile.lib with a new variable
and define to capture all object files included in this module, store it
in a new objs= modinfo stanza, then extract it just before linking into
a new file modules.builtin.objs with a layout roughly modelled on a
makefile:

path/to/module.o: path/to/constituent.o path/to/other-constituent.o

Single-file built-in modules get a line reading

path/to/module.o:

Note that the .modinfo section is discarded at the link stage, so the
kernel is not bloated at all (see include/asm-generic/vmlinux.lds.h).

Orabug: 29891866
Signed-off-by: Luis Chamberlain 
Signed-off-by: Nick Alcock 
Reviewed-by: Nick Alcock 
Reviewed-by: Kris Van Hees 
---
 .gitignore  |  2 +-
 Documentation/dontdiff  |  2 +-
 Documentation/kbuild/kbuild.rst |  5 +
 Makefile|  8 ++--
 include/linux/module.h  |  4 +++-
 scripts/Makefile.lib|  5 -
 scripts/Makefile.modinst|  6 +++---
 scripts/Makefile.vmlinux_o  | 15 ++-
 8 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/.gitignore b/.gitignore
index 0bbae167bf93..7e3a0a1556a5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -68,7 +68,7 @@ modules.order
 /System.map
 /Module.markers
 /modules.builtin
-/modules.builtin.modinfo
+/modules.builtin.*
 /modules.nsdeps
 
 #
diff --git a/Documentation/dontdiff b/Documentation/dontdiff
index 3c399f132e2d..75b9655e5791 100644
--- a/Documentation/dontdiff
+++ b/Documentation/dontdiff
@@ -179,7 +179,7 @@ mkutf8data
 modpost
 modules-only.symvers
 modules.builtin
-modules.builtin.modinfo
+modules.builtin.*
 modules.nsdeps
 modules.order
 modversions.h*
diff --git a/Documentation/kbuild/kbuild.rst b/Documentation/kbuild/kbuild.rst
index bd906407e307..15d1b61d9454 100644
--- a/Documentation/kbuild/kbuild.rst
+++ b/Documentation/kbuild/kbuild.rst
@@ -17,6 +17,11 @@ modules.builtin
 This file lists all modules that are built into the kernel. This is used
 by modprobe to not fail when trying to load something builtin.
 
+modules.builtin.objs
+---
+This file contains object mapping of modules that are built into the kernel
+to their corresponding object files used to build the module.
+
 modules.builtin.modinfo
 ---
 This file contains modinfo from all modules that are built into the kernel.
diff --git a/Makefile b/Makefile
index cbe63ba9126e..7e48618771dd 100644
--- a/Makefile
+++ b/Makefile
@@ -1145,7 +1145,11 @@ PHONY += vmlinux_o
 vmlinux_o: vmlinux.a $(KBUILD_VMLINUX_LIBS)
$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.vmlinux_o
 
-vmlinux.o modules.builtin.modinfo modules.builtin: vmlinux_o
+MODULES_BUILTIN := modules.builtin.modinfo
+MODULES_BUILTIN += modules.builtin
+MODULES_BUILTIN += modules.builtin.objs
+
+vmlinux.o $(MODULES_BUILTIN): vmlinux_o
@:
 
 PHONY += vmlinux
@@ -1473,7 +1477,7 @@ endif # CONFIG_MODULES
 
 # Directories & files removed with 'make clean'
 CLEAN_FILES += vmlinux.symvers modules-only.symvers \
-  modules.builtin modules.builtin.modinfo modules.nsdeps \
+  modules.builtin modules.builtin.* modules.nsdeps \
   compile_commands.json .thinlto-cache rust/test \
   rust-project.json .vmlinux.objs .vmlinux.export.c
 
diff --git a/include/linux/module.h b/include/linux/module.h
index a98e188cf37b..53323e94b96e 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -180,7 +180,9 @@ extern void cleanup_module(void);
 #ifdef MODULE
 #define MODULE_FILE
 #else
-#define MODULE_FILEMODULE_INFO(file, KBUILD_MODFILE);
+#define MODULE_FILE  \
+   MODULE_INFO(file, KBUILD_MODFILE);\
+   MODULE_INFO(objs, KBUILD_MODOBJS);
 #endif
 
 /*
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 68d0134bdbf9..40803f8faa5e 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -112,6 +112,8 @@ modname-multi = $(sort $(foreach m,$(multi-obj-ym),\
 __modname = $(or $(modname-multi),$(basetarget))
 
 modname = $(subst $(space),:,$(__modname))
+modname-objs = $($(modname)-objs) $($(modname)-y) $($(modname)-Y)
+modname-objs-prefixed = $(sort $(strip $(addprefix $(obj)/, $(modname-objs
 modfile = $(addprefix $(obj)/,$(__modname))
 
 # target with $(obj)/ and its suffix stripped
@@ -125,7 +127,8 @@ name-fix = $(call stringify,$(call name-fix-token,$1))
 basename_flags = -DKBUILD_BASENAME=$(call name-fix,$(basetarget))
 modname_flags  = -DKBUILD_MODNAME=$(call name-fix,$(mod

[PATCH 0/6] Generate address range data for built-in modules

2023-12-07 Thread Kris Van Hees

Especially for tracing applications, it is convenient to be able to
refer to a symbol using a  pair and to be able
to translate an address into a  pair.  But
that does not work if the module is built into the kernel because the
object files that comprise the built-in module implementation are simply
linked into the kernel image along with all other kernel object files.

This is especially visible when providing tracing scripts for support
purposes, where the developer of the script targets a particular kernel
version, but does not have control over whether the target system has
a particular module as loadable module or built-in module.  When tracing
symbols within a module, referring them by 
pairs is both convenient and aids symbol lookup.  But that naming will
not work if the module name information is lost if the module is built
into the kernel on the target system.

Earlier work addressing this loss of information for built-in modules
involved adding module name information to the kallsyms data, but that
required more invasive code in the kernel proper.  This work never did
get merged into the kernel tree.

All that is really needed is knowing whether a given address belongs to
a particular module (or multiple modules if they share an object file).
Or in other words, whether that address falls within an address range
that is associated with one or more modules.

This patch series is baaed on Luis Chamberlain's patch to generate
modules.builtin.objs, associating built-in modules with their object
files.  Using this data, vmlinux.o.map and vmlinux.map can be parsed in
a single pass to generate a modules.buitin.ranges file with offset range
information (relative to the base address of the associated section) for
built-in modules.  The file gets installed along with the other
modules.builtin.* files.

The impact on the kernel build is minimal because everything is done
using a single-pass AWK script.  The generated data size is minimal as
well, (depending on the exact kernel configuration) in the range of
500-600 lines, with a file size of 20-30KB.

Kris Van Hees (5):
  module: add CONFIG_BUILTIN_RANGES option
  kbuild: generate a linker map for vmlinux.o
  module: script to generate offset ranges for builtin modules
  kbuild: generate modules.builtin.ranges when linking the kernel
  module: add install target for modules.builtin.ranges

Luis Chamberlain (1):
  kbuild: add modules.builtin.objs

 .gitignore  |   2 +-
 Documentation/dontdiff  |   2 +-
 Documentation/kbuild/kbuild.rst |   5 +
 Makefile|   8 +-
 include/linux/module.h  |   4 +-
 kernel/module/Kconfig   |  14 +++
 scripts/Makefile.lib|   5 +-
 scripts/Makefile.modinst|  11 +-
 scripts/Makefile.vmlinux|  17 
 scripts/Makefile.vmlinux_o  |  18 +++-
 scripts/generate_builtin_ranges.awk | 149 
 11 files changed, 225 insertions(+), 10 deletions(-)
 create mode 100755 scripts/generate_builtin_ranges.awk


base-commit: 8f34f6b7b6b3260eb6312a19ececcc97908d15b7
-- 
2.42.0

Re: [PATCH v5] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-10-10 Thread Kris Van Hees

On Tue, Oct 10, 2023 at 01:33:09PM +0200, Petr Mladek wrote:
> On Mon 2023-10-09 15:14:28, Alessandro Carminati wrote:
> > Hello Kris,
> > 
> > Thank you for your contribution and for having your thought shared with me.
> > 
> > Allow me to begin this conversation by explaining what came to mind when
> > I decided to propose a patch that creates aliases.
> > 
> > The objective was to address a specific problem I was facing while
> > minimizing any potential impact on other aspects.
> > My initial consideration was the existence of numerous tools, both in the
> > kernel and in userspace, that rely on the current kallsyms implementation.
> > Both Nick and I shared the concern that making changes to elements upon
> > which these tools depend on could have significant consequences.
> > 
> > To the best of my knowledge, Nick's strategy has been to duplicate kallsyms
> > with something new - a new, improved kallsyms file.
> > 
> > However, even if Nick's patch were to be accepted, it wouldn't fully meet
> > my personal requirements.
> > This is because my goal was to utilize kprobe on a symbol that shares its
> > name with others. Nick's work wouldn't allow me to do this, and that's why,
> > I proposed an alternative.
> > 
> > As a result, my strategy was more modest and focused solely on creating
> > aliases for duplicate symbols.
> > By adding these aliases, existing tools would remain unaffected, and the
> > current system state and ecosystem would be preserved.
> > For instance, mechanisms like live patching could continue to use the
> > symbol hit count.
> > 
> > On the flip side, introducing these new symbols would enable tracers to
> > directly employ the new names without any modifications, and humans could
> > easily identify the symbol they are dealing with just by examining the
> > name.
> > These are the fundamental principles behind my patch - introducing aliases.
> > 
> > Il giorno gio 5 ott 2023 alle ore 18:25 Kris Van Hees
> >  ha scritto:
> > >
> > > On Wed, Sep 27, 2023 at 05:35:16PM +, Alessandro Carminati (Red Hat) 
> > > wrote:
> > > > It is not uncommon for drivers or modules related to similar peripherals
> > > > to have symbols with the exact same name.
> > > > While this is not a problem for the kernel's binary itself, it becomes 
> > > > an
> > > > issue when attempting to trace or probe specific functions using
> > > > infrastructure like ftrace or kprobe.
> > > >
> > > > The tracing subsystem relies on the `nm -n vmlinux` output, which 
> > > > provides
> > > > symbol information from the kernel's ELF binary. However, when multiple
> > > > symbols share the same name, the standard nm output does not 
> > > > differentiate
> > > > between them. This can lead to confusion and difficulty when trying to
> > > > probe the intended symbol.
> > > >
> > > >  ~ # cat /proc/kallsyms | grep " name_show"
> > > >  8c4f76d0 t name_show
> > > >  8c9cccb0 t name_show
> > > >  8cb0ac20 t name_show
> > > >  8cc728c0 t name_show
> > > >  8ce0efd0 t name_show
> > > >  8ce126c0 t name_show
> > > >  8ce1dd20 t name_show
> > > >  8ce24e70 t name_show
> > > >  8d1104c0 t name_show
> > > >  8d1fe480 t name_show
> > >
> > > One problem that remains as far as I can see is that this approach does 
> > > not
> > > take into consideration that there can be duplicate symbols in the core
> > > kernel, but also between core kernel and loadable modules, and even 
> > > between
> > > loadable modules.  So, I think more is needed to also ensure that this
> > > approach of adding alias symbols is also done for loadable modules.
> > 
> > To identify which symbols are duplicated, including those contained in
> > modules, it requires exploring all the objects. If I were to propose a
> > complementary tool to kas_alias that operates on modules, it would need to
> > run on all objects to ascertain the state of the names.
> > Only after this assessment could it produce its output.
> > This would entail postponing the second kallsyms pass until after all
> > modules have been processed.
> > Additionally, modules would need to be processed twice: once to assess the
> > names and a second time to generate aliases for duplicated symbols.
> &

Re: [PATCH v5] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-10-05 Thread Kris Van Hees

On Thu, Oct 05, 2023 at 12:24:03PM -0400, Kris Van Hees wrote:
> On Wed, Sep 27, 2023 at 05:35:16PM +, Alessandro Carminati (Red Hat) 
> wrote:
> > It is not uncommon for drivers or modules related to similar peripherals
> > to have symbols with the exact same name.
> > While this is not a problem for the kernel's binary itself, it becomes an
> > issue when attempting to trace or probe specific functions using
> > infrastructure like ftrace or kprobe.
> > 
> > The tracing subsystem relies on the `nm -n vmlinux` output, which provides
> > symbol information from the kernel's ELF binary. However, when multiple
> > symbols share the same name, the standard nm output does not differentiate
> > between them. This can lead to confusion and difficulty when trying to
> > probe the intended symbol.
> > 
> >  ~ # cat /proc/kallsyms | grep " name_show"
> >  8c4f76d0 t name_show
> >  8c9cccb0 t name_show
> >  8cb0ac20 t name_show
> >  8cc728c0 t name_show
> >  8ce0efd0 t name_show
> >  8ce126c0 t name_show
> >  8ce1dd20 t name_show
> >  8ce24e70 t name_show
> >  8d1104c0 t name_show
> >  8d1fe480 t name_show
> 
> One problem that remains as far as I can see is that this approach does not
> take into consideration that there can be duplicate symbols in the core
> kernel, but also between core kernel and loadable modules, and even between
> loadable modules.  So, I think more is needed to also ensure that this
> approach of adding alias symbols is also done for loadable modules.

Re-reading my email a bit more, I realize that I did not accurate demonstrate
the issue I arefer to.  Obviously, the current patch provides aliases for
duplicate symbols in the core kernel.  And loadable modules are annotated with
the module name.  So, all is well unless modules have duplicate symbols
themselves.  And although it is rare, it does happen:

# grep ' metadata_show' /proc/kallsyms 
c05659c0 t metadata_show[md_mod]
c05739f0 t metadata_show[md_mod]

This again shows a case where there are duplicate symbols that cannot be
distinguished for tracing purposes without additional information.  So, the
proposed patch unfortunately does not cover all cases because of loadable
modules.

> Earlier work that cover all symbols (core kernel and loadable modules) was
> posted quite a while ago by Nick Alcock:
> 
> https://lore.kernel.org/all/20221205163157.269335-1-nick.alc...@oracle.com/
> 
> It takes a different approach and adds in other info that is very useful for
> tracing, but unfortunately it has been dormant for a long time now.
> 
> While module symbols are handled quite differently (for kallsyms) from the
> core kernel symbols, I think that a similar approach tied in with modpost
> ought to be quite possible.  It will add to the size of modules because the
> data needs to be stored in the .ko but that is unavoidable.  But not doing it
> unfortunately would mean that the duplicate symbol issue remains unresolved
> in the presence of loadable modules.
> 
> > kas_alias addresses this challenge by enhancing symbol names with
> > meaningful suffixes generated from the source file and line number
> > during the kernel build process.
> > These newly generated aliases provide tracers with the ability to
> > comprehend the symbols they are interacting with when utilizing the
> > ftracefs interface.
> > This approach may also allow for the probing by name of previously
> > inaccessible symbols.
> > 
> >  ~ # cat /proc/kallsyms | grep gic_mask_irq
> >  d15671e505ac t gic_mask_irq
> >  d15671e505ac t gic_mask_irq@drivers_irqchip_irq_gic_c_167
> >  d15671e532a4 t gic_mask_irq
> >  d15671e532a4 t gic_mask_irq@drivers_irqchip_irq_gic_v3_c_407
> >  ~ #
> 
> In the same context as mentioned above (module symbols), I am hoping that the
> alias you generate might also be able to contain a module identifier name,
> much like the aforementioned patch series by Nick Alcock added.  We have it
> for loadable modules already of course, but as has been discussed in relation
> to the earlier work, being able to associate a module name with a symbol
> regardless of whether that module is configured to be built into the kernel
> or whether it is configured to be a loadable module is helpful for tracing
> purposes.  Especially for tracers that use tracing scripts that might get
> deployed on diverse systems where it cannot always be known at the time of
> developing the tracer scripts whether a kernel module is configured to be
> loadable or built-in.
> 
> I'd be happy to work o

Re: [PATCH v5] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-10-05 Thread Kris Van Hees

On Wed, Sep 27, 2023 at 05:35:16PM +, Alessandro Carminati (Red Hat) wrote:
> It is not uncommon for drivers or modules related to similar peripherals
> to have symbols with the exact same name.
> While this is not a problem for the kernel's binary itself, it becomes an
> issue when attempting to trace or probe specific functions using
> infrastructure like ftrace or kprobe.
> 
> The tracing subsystem relies on the `nm -n vmlinux` output, which provides
> symbol information from the kernel's ELF binary. However, when multiple
> symbols share the same name, the standard nm output does not differentiate
> between them. This can lead to confusion and difficulty when trying to
> probe the intended symbol.
> 
>  ~ # cat /proc/kallsyms | grep " name_show"
>  8c4f76d0 t name_show
>  8c9cccb0 t name_show
>  8cb0ac20 t name_show
>  8cc728c0 t name_show
>  8ce0efd0 t name_show
>  8ce126c0 t name_show
>  8ce1dd20 t name_show
>  8ce24e70 t name_show
>  8d1104c0 t name_show
>  8d1fe480 t name_show

One problem that remains as far as I can see is that this approach does not
take into consideration that there can be duplicate symbols in the core
kernel, but also between core kernel and loadable modules, and even between
loadable modules.  So, I think more is needed to also ensure that this
approach of adding alias symbols is also done for loadable modules.

Earlier work that cover all symbols (core kernel and loadable modules) was
posted quite a while ago by Nick Alcock:

https://lore.kernel.org/all/20221205163157.269335-1-nick.alc...@oracle.com/

It takes a different approach and adds in other info that is very useful for
tracing, but unfortunately it has been dormant for a long time now.

While module symbols are handled quite differently (for kallsyms) from the
core kernel symbols, I think that a similar approach tied in with modpost
ought to be quite possible.  It will add to the size of modules because the
data needs to be stored in the .ko but that is unavoidable.  But not doing it
unfortunately would mean that the duplicate symbol issue remains unresolved
in the presence of loadable modules.

> kas_alias addresses this challenge by enhancing symbol names with
> meaningful suffixes generated from the source file and line number
> during the kernel build process.
> These newly generated aliases provide tracers with the ability to
> comprehend the symbols they are interacting with when utilizing the
> ftracefs interface.
> This approach may also allow for the probing by name of previously
> inaccessible symbols.
> 
>  ~ # cat /proc/kallsyms | grep gic_mask_irq
>  d15671e505ac t gic_mask_irq
>  d15671e505ac t gic_mask_irq@drivers_irqchip_irq_gic_c_167
>  d15671e532a4 t gic_mask_irq
>  d15671e532a4 t gic_mask_irq@drivers_irqchip_irq_gic_v3_c_407
>  ~ #

In the same context as mentioned above (module symbols), I am hoping that the
alias you generate might also be able to contain a module identifier name,
much like the aforementioned patch series by Nick Alcock added.  We have it
for loadable modules already of course, but as has been discussed in relation
to the earlier work, being able to associate a module name with a symbol
regardless of whether that module is configured to be built into the kernel
or whether it is configured to be a loadable module is helpful for tracing
purposes.  Especially for tracers that use tracing scripts that might get
deployed on diverse systems where it cannot always be known at the time of
developing the tracer scripts whether a kernel module is configured to be
loadable or built-in.

I'd be happy to work on something like this as a contribution to your work.
I would envision the alias entry not needing to have the typical [module] added
to it because that will already be annotated on the actual symbol entry.  So,
the alias could be extended to be something like:

c0533720 t floppy_open  [floppy]
c0533720 t floppy_open@floppy:drivers_block_floppy_c_3988

(absence of a name: prefix to the path would indicate the symbol is not
 associated with any module)

Doing this is more realistic now as a result of the clean-up patches that
Nick introduced, e.g.

https://lore.kernel.org/lkml/20230302211759.30135-1-nick.alc...@oracle.com/

> Changes from v1:
> - Integrated changes requested by Masami to exclude symbols with prefixes
>   "_cfi" and "_pfx".
> - Introduced a small framework to handle patterns that need to be excluded
>   from the alias production.
> - Excluded other symbols using the framework.
> - Introduced the ability to discriminate between text and data symbols.
> - Added two new config symbols in this version: CONFIG_KALLSYMS_ALIAS_DATA,
>   which allows data for data, and CONFIG_KALLSYMS_ALIAS_DATA_ALL, which
>   excludes all filters and provides an alias for each duplicated symbol.
> 
>

Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace

2019-07-10 Thread Kris Van Hees

On Wed, Jul 10, 2019 at 11:19:43PM +0200, Daniel Borkmann wrote:
> On 07/10/2019 10:30 PM, Jonathan Corbet wrote:
> > On Wed, 10 Jul 2019 21:32:25 +0200
> > Daniel Borkmann  wrote:
> > 
> >> Looks like you missed Brendan Gregg's prior feedback from v1 [0]. I haven't
> >> seen a strong compelling argument for why this needs to reside in the 
> >> kernel
> >> tree given we also have all the other tracing tools and many of which also
> >> rely on BPF such as bcc, bpftrace, ply, systemtap, sysdig, lttng to just 
> >> name
> >> a few.
> > 
> > So I'm just watching from the sidelines here, but I do feel the need to
> > point out that Kris appears to be trying to follow the previous feedback
> > he got from Alexei, where creating tools/dtrace is exactly what he was
> > told to do:
> > 
> >   
> > https://lwn.net/ml/netdev/20190521175617.ipry6ue7o24a2...@ast-mbp.dhcp.thefacebook.com/
> > 
> > Now he's being told the exact opposite.  Not the best experience for
> > somebody who is trying to make the kernel better.
> 
> Ugh, agree, sorry for the misleading direction. Alexei is currently offgrid
> this week, he might comment later.
> 
> It has nothing to do with making the _kernel_ better, it's a /user space/ 
> front
> end for the existing kernel infrastructure like many of the other tracers out
> there. Don't get me wrong, adding the missing /kernel parts/ for it is a 
> totally
> different subject [and _that_ is what is making the kernel better, not the 
> former].

I disagree.  Yes, the current patch obviously isn't making the kernel better
because it doesn't touch the kernel.  But DTrace as a whole is not just a
/front end/ to the existing kernel infrastructure, and I did make that point
at LPC 2018 and in my emails.  Some of its more advanced features will lead
to contributions to the kernel that (by virtue of being developed as part of
this DTrace re-implementation) will more often than not be able to benefit
other tracers as well.  I do think that aspect qualifies as working towards
making the kenrel better.

> Hypothetical question: does it make the _kernel_ better if we suddenly add a 
> huge
> and complex project like tools/mysql/ to the kernel tree? Nope.
> 
> > There are still people interested in DTrace out there.  How would you
> > recommend that Kris proceed at this point?
> 
> My recommendation to proceed is to maintain the dtrace user space tooling in
> its own separate project like the vast majority of all the other tracing 
> projects
> (see also the other advantages that Steven pointed out from his experience), 
> and
> extend the kernel bits whenever needed.

I wish that would have been the initial recommendation because it certainly
would have avoided me going down a path that was going to lead to rejection.

Either way, I do hope that as work progresses and contributions to the kernel
code are submitted in support of advancing tracing on Linux, those patches
will receive a fair review and consideration.  I can appreciate that some
people do not like DTrace or feel that it is not necessary, but personal
opinions about tools should not be a deciding factor in whether a contribution
has merit or not.

Kris

Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace

2019-07-10 Thread Kris Van Hees

This patch's subject should of course be [PATCH V2 1/1] rather than 0/1.
Sorry about that.

On Wed, Jul 10, 2019 at 08:42:24AM -0700, Kris Van Hees wrote:
> This initial implementation of a tiny subset of DTrace functionality
> provides the following options:
> 
>   dtrace [-lvV] [-b bufsz] -s script
>   -b  set trace buffer size
>   -l  list probes (only works with '-s script' for now)
>   -s  enable or list probes for the specified BPF program
>   -V  report DTrace API version
> 
> The patch comprises quite a bit of code due to DTrace requiring a few
> crucial components, even in its most basic form.
> 
> The code is structured around the command line interface implemented in
> dtrace.c.  It provides option parsing and drives the three modes of
> operation that are currently implemented:
> 
> 1. Report DTrace API version information.
>   Report the version information and terminate.
> 
> 2. List probes in BPF programs.
>   Initialize the list of probes that DTrace recognizes, load BPF
>   programs, parse all BPF ELF section names, resolve them into
>   known probes, and emit the probe names.  Then terminate.
> 
> 3. Load BPF programs and collect tracing data.
>   Initialize the list of probes that DTrace recognizes, load BPF
>   programs and attach them to their corresponding probes, set up
>   perf event output buffers, and start processing tracing data.
> 
> This implementation makes extensive use of BPF (handled by dt_bpf.c) and
> the perf event output ring buffer (handled by dt_buffer.c).  DTrace-style
> probe handling (dt_probe.c) offers an interface to probes that hides the
> implementation details of the individual probe types by provider (dt_fbt.c
> and dt_syscall.c).  Probe lookup by name uses a hashtable implementation
> (dt_hash.c).  The dt_utils.c code populates a list of online CPU ids, so
> we know what CPUs we can obtain tracing data from.
> 
> Building the tool is trivial because its only dependency (libbpf) is in
> the kernel tree under tools/lib/bpf.  A simple 'make' in the tools/dtrace
> directory suffices.
> 
> The 'dtrace' executable needs to run as root because BPF programs cannot
> be loaded by non-root users.
> 
> Signed-off-by: Kris Van Hees 
> Reviewed-by: David Mc Lean 
> Reviewed-by: Eugene Loh 
> ---
> Changes in v2:
> - Use ring_buffer_read_head() and ring_buffer_write_tail() to
>   avoid use of volatile.
> - Handle perf events that wrap around the ring buffer boundary.
> - Remove unnecessary PERF_EVENT_IOC_ENABLE.
> - Remove -I$(srctree)/tools/perf from KBUILD_HOSTCFLAGS since it
>   is not actually used.
> - Use PT_REGS_PARM1(x), etc instead of my own macros.  Adding 
>   PT_REGS_PARM6(x) in bpf_sample.c because we need to be able to
>   support up to 6 arguments passed by registers.
> ---
>  MAINTAINERS|   6 +
>  tools/dtrace/Makefile  |  87 ++
>  tools/dtrace/bpf_sample.c  | 146 
>  tools/dtrace/dt_bpf.c  | 185 
>  tools/dtrace/dt_buffer.c   | 338 +
>  tools/dtrace/dt_fbt.c  | 201 ++
>  tools/dtrace/dt_hash.c | 211 +++
>  tools/dtrace/dt_probe.c| 230 +
>  tools/dtrace/dt_syscall.c  | 179 
>  tools/dtrace/dt_utils.c| 132 +++
>  tools/dtrace/dtrace.c  | 249 +++
>  tools/dtrace/dtrace.h  |  13 ++
>  tools/dtrace/dtrace_impl.h | 101 +++
>  13 files changed, 2078 insertions(+)
>  create mode 100644 tools/dtrace/Makefile
>  create mode 100644 tools/dtrace/bpf_sample.c
>  create mode 100644 tools/dtrace/dt_bpf.c
>  create mode 100644 tools/dtrace/dt_buffer.c
>  create mode 100644 tools/dtrace/dt_fbt.c
>  create mode 100644 tools/dtrace/dt_hash.c
>  create mode 100644 tools/dtrace/dt_probe.c
>  create mode 100644 tools/dtrace/dt_syscall.c
>  create mode 100644 tools/dtrace/dt_utils.c
>  create mode 100644 tools/dtrace/dtrace.c
>  create mode 100644 tools/dtrace/dtrace.h
>  create mode 100644 tools/dtrace/dtrace_impl.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfa9ed89c031..410240732d55 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5485,6 +5485,12 @@ W: https://linuxtv.org
>  S:   Odd Fixes
>  F:   drivers/media/pci/dt3155/
>  
> +DTRACE
> +M:   Kris Van Hees 
> +L:   dtrace-de...@oss.oracle.com
> +S:   Maintained
> +F:   tools/dtrace/
> +
>  DVB_USB_AF9015 MEDIA DRIVER
>  M:   Antti Palosaari 
>  L:   linux-me...@vger.kernel.org
> diff --git a/tools/dtrace/Makefile b

[PATCH V2 0/1] tools/dtrace: initial implementation of DTrace

2019-07-10 Thread Kris Van Hees

This initial implementation of a tiny subset of DTrace functionality
provides the following options:

dtrace [-lvV] [-b bufsz] -s script
-b  set trace buffer size
-l  list probes (only works with '-s script' for now)
-s  enable or list probes for the specified BPF program
-V  report DTrace API version

The patch comprises quite a bit of code due to DTrace requiring a few
crucial components, even in its most basic form.

The code is structured around the command line interface implemented in
dtrace.c.  It provides option parsing and drives the three modes of
operation that are currently implemented:

1. Report DTrace API version information.
Report the version information and terminate.

2. List probes in BPF programs.
Initialize the list of probes that DTrace recognizes, load BPF
programs, parse all BPF ELF section names, resolve them into
known probes, and emit the probe names.  Then terminate.

3. Load BPF programs and collect tracing data.
Initialize the list of probes that DTrace recognizes, load BPF
programs and attach them to their corresponding probes, set up
perf event output buffers, and start processing tracing data.

This implementation makes extensive use of BPF (handled by dt_bpf.c) and
the perf event output ring buffer (handled by dt_buffer.c).  DTrace-style
probe handling (dt_probe.c) offers an interface to probes that hides the
implementation details of the individual probe types by provider (dt_fbt.c
and dt_syscall.c).  Probe lookup by name uses a hashtable implementation
(dt_hash.c).  The dt_utils.c code populates a list of online CPU ids, so
we know what CPUs we can obtain tracing data from.

Building the tool is trivial because its only dependency (libbpf) is in
the kernel tree under tools/lib/bpf.  A simple 'make' in the tools/dtrace
directory suffices.

The 'dtrace' executable needs to run as root because BPF programs cannot
be loaded by non-root users.

Signed-off-by: Kris Van Hees 
Reviewed-by: David Mc Lean 
Reviewed-by: Eugene Loh 
---
Changes in v2:
- Use ring_buffer_read_head() and ring_buffer_write_tail() to
  avoid use of volatile.
- Handle perf events that wrap around the ring buffer boundary.
- Remove unnecessary PERF_EVENT_IOC_ENABLE.
- Remove -I$(srctree)/tools/perf from KBUILD_HOSTCFLAGS since it
  is not actually used.
- Use PT_REGS_PARM1(x), etc instead of my own macros.  Adding 
  PT_REGS_PARM6(x) in bpf_sample.c because we need to be able to
  support up to 6 arguments passed by registers.
---
 MAINTAINERS|   6 +
 tools/dtrace/Makefile  |  87 ++
 tools/dtrace/bpf_sample.c  | 146 
 tools/dtrace/dt_bpf.c  | 185 
 tools/dtrace/dt_buffer.c   | 338 +
 tools/dtrace/dt_fbt.c  | 201 ++
 tools/dtrace/dt_hash.c | 211 +++
 tools/dtrace/dt_probe.c| 230 +
 tools/dtrace/dt_syscall.c  | 179 
 tools/dtrace/dt_utils.c| 132 +++
 tools/dtrace/dtrace.c  | 249 +++
 tools/dtrace/dtrace.h  |  13 ++
 tools/dtrace/dtrace_impl.h | 101 +++
 13 files changed, 2078 insertions(+)
 create mode 100644 tools/dtrace/Makefile
 create mode 100644 tools/dtrace/bpf_sample.c
 create mode 100644 tools/dtrace/dt_bpf.c
 create mode 100644 tools/dtrace/dt_buffer.c
 create mode 100644 tools/dtrace/dt_fbt.c
 create mode 100644 tools/dtrace/dt_hash.c
 create mode 100644 tools/dtrace/dt_probe.c
 create mode 100644 tools/dtrace/dt_syscall.c
 create mode 100644 tools/dtrace/dt_utils.c
 create mode 100644 tools/dtrace/dtrace.c
 create mode 100644 tools/dtrace/dtrace.h
 create mode 100644 tools/dtrace/dtrace_impl.h

diff --git a/MAINTAINERS b/MAINTAINERS
index cfa9ed89c031..410240732d55 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5485,6 +5485,12 @@ W:   https://linuxtv.org
 S: Odd Fixes
 F: drivers/media/pci/dt3155/
 
+DTRACE
+M: Kris Van Hees 
+L: dtrace-de...@oss.oracle.com
+S: Maintained
+F: tools/dtrace/
+
 DVB_USB_AF9015 MEDIA DRIVER
 M: Antti Palosaari 
 L: linux-me...@vger.kernel.org
diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
new file mode 100644
index ..03ae498d1429
--- /dev/null
+++ b/tools/dtrace/Makefile
@@ -0,0 +1,87 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# This Makefile is based on samples/bpf.
+#
+# Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+
+DT_VERSION := 2.0.0
+DT_GIT_VERSION := $(shell git rev-parse HEAD 2>/dev/null || \
+  echo Unknown)
+
+DTRACE_PATH?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(DTRACE_PATH)/..
+SAMPLES_PATH   := $(DTRACE_PATH)/../../samples
+
+hostprog

[PATCH V2 0/1] tools/dtrace: initial implementation of DTrace

2019-07-10 Thread Kris Van Hees

This is version 2 of the patch, incorporating feedback from Peter Zijlstra and
Arnaldo Carvalho de Melo.

Changes in Makefile:
- Remove -I$(srctree)/tools/perf from KBUILD_HOSTCFLAGS since it
  is not actually used.

Changes in dt_bpf.c:
- Remove unnecessary PERF_EVENT_IOC_ENABLE.

Changes in dt_buffer.c:
- Use ring_buffer_read_head() and ring_buffer_write_tail() to
  avoid use of volatile.
- Handle perf events that wrap around the ring buffer boundary.
- Remove unnecessary PERF_EVENT_IOC_ENABLE.

Changes in bpf_sample.c:
- Use PT_REGS_PARM1(x), etc instead of my own macros.  Adding
  PT_REGS_PARM6(x) in bpf_sample.c because we need to be able to
  support up to 6 arguments passed by registers.

This patch is also available, applied to bpf-next, at the following URL:

https://github.com/oracle/dtrace-linux-kernel/tree/dtrace-bpf

As suggested in feedback to my earlier patch submissions, this code takes an
approach to avoid kernel code changes as much as possible.  The current patch
does not involve any kernel code changes.  Further development of this code
will continue with this approach, incrementally adding features to this first
minimal implementation.  The goal is a fully featured and functional DTrace
implementation involving kernel changes only when strictly necessary.

The code presented here supports two very basic functions:

1. Listing probes that are used in BPF programs

   # dtrace -l -s bpf_sample.o
  ID   PROVIDERMODULE  FUNCTION NAME
   18876fbt   vmlinuxksys_write entry
   70423syscall   vmlinux write entry

2. Loading BPF tracing programs and collecting data that they generate

   # dtrace -s bpf_sample.o
   CPU ID
15  70423 0x8c0968bf8ec0 0x01 0x0055e019eb3f60 
0x2c
15  18876 0x8c0968bf8ec0 0x01 0x0055e019eb3f60 
0x2c
   ...

Only kprobes and syscall tracepoints are supported since this is an initial
patch.  It does show the use of a generic BPF function to implement the actual
probe action, called from two distinct probe types.  Follow-up patches will
add more probe types, add more tracing features from the D language, add
support for D script compilation to BPF, etc.

The implementation makes use of libbpf for handling BPF ELF objects, and uses
the perf event output ring buffer (supported through BPF) to retrieve the
tracing data.  The next step in development will be adding support to libbpf
for programs using shared functions from a collection of functions included in
the BPF ELF object (as suggested by Alexei).  

The code is structured as follows:
 tools/dtrace/dtrace.c  = command line utility
 tools/dtrace/dt_bpf.c  = interface to libbpf
 tools/dtrace/dt_buffer.c   = perf event output buffer handling
 tools/dtrace/dt_fbt.c  = kprobes probe provider
 tools/dtrace/dt_syscall.c  = syscall tracepoint probe provider
 tools/dtrace/dt_probe.c= generic probe and probe provider handling code
  This implements a generic interface to the actual
  probe providers (dt_fbt and dt_syscall).
 tools/dtrace/dt_hash.c = general probe hashing implementation
 tools/dtrace/dt_utils.c= support code (manage list of online CPUs)
 tools/dtrace/dtrace.h  = API header file (used by BPF program source code)
 tools/dtrace/dtrace_impl.h = implementation header file
 tools/dtrace/bpf_sample.c  = sample BPF program using two probe types

I included an entry for the MAINTAINERS file.  I offer to actively maintain
this code, and to keep advancing its development.

Cheers,
Kris Van Hees

Re: [PATCH 1/1] tools/dtrace: initial implementation of DTrace

2019-07-08 Thread Kris Van Hees

On Mon, Jul 08, 2019 at 02:15:37PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jul 03, 2019 at 08:14:30PM -0700, Kris Van Hees escreveu:
> > This initial implementation of a tiny subset of DTrace functionality
> > provides the following options:
> > 
> > dtrace [-lvV] [-b bufsz] -s script
> > -b  set trace buffer size
> > -l  list probes (only works with '-s script' for now)
> > -s  enable or list probes for the specified BPF program
> > -V  report DTrace API version
> > 
> > The patch comprises quite a bit of code due to DTrace requiring a few
> > crucial components, even in its most basic form.
> > 
> > The code is structured around the command line interface implemented in
> > dtrace.c.  It provides option parsing and drives the three modes of
> > operation that are currently implemented:
> > 
> > 1. Report DTrace API version information.
> > Report the version information and terminate.
> > 
> > 2. List probes in BPF programs.
> > Initialize the list of probes that DTrace recognizes, load BPF
> > programs, parse all BPF ELF section names, resolve them into
> > known probes, and emit the probe names.  Then terminate.
> > 
> > 3. Load BPF programs and collect tracing data.
> > Initialize the list of probes that DTrace recognizes, load BPF
> > programs and attach them to their corresponding probes, set up
> > perf event output buffers, and start processing tracing data.
> > 
> > This implementation makes extensive use of BPF (handled by dt_bpf.c) and
> > the perf event output ring buffer (handled by dt_buffer.c).  DTrace-style
> > probe handling (dt_probe.c) offers an interface to probes that hides the
> > implementation details of the individual probe types by provider (dt_fbt.c
> > and dt_syscall.c).  Probe lookup by name uses a hashtable implementation
> > (dt_hash.c).  The dt_utils.c code populates a list of online CPU ids, so
> > we know what CPUs we can obtain tracing data from.
> > 
> > Building the tool is trivial because its only dependency (libbpf) is in
> > the kernel tree under tools/lib/bpf.  A simple 'make' in the tools/dtrace
> > directory suffices.
> > 
> > The 'dtrace' executable needs to run as root because BPF programs cannot
> > be loaded by non-root users.
> > 
> > Signed-off-by: Kris Van Hees 
> > Reviewed-by: David Mc Lean 
> > Reviewed-by: Eugene Loh 
> > ---
> >  MAINTAINERS|   6 +
> >  tools/dtrace/Makefile  |  88 ++
> >  tools/dtrace/bpf_sample.c  | 145 
> >  tools/dtrace/dt_bpf.c  | 188 +
> >  tools/dtrace/dt_buffer.c   | 331 +
> >  tools/dtrace/dt_fbt.c  | 201 ++
> >  tools/dtrace/dt_hash.c | 211 +++
> >  tools/dtrace/dt_probe.c| 230 ++
> >  tools/dtrace/dt_syscall.c  | 179 
> >  tools/dtrace/dt_utils.c| 132 +++
> >  tools/dtrace/dtrace.c  | 249 
> >  tools/dtrace/dtrace.h  |  13 ++
> >  tools/dtrace/dtrace_impl.h | 101 +++
> >  13 files changed, 2074 insertions(+)
> >  create mode 100644 tools/dtrace/Makefile
> >  create mode 100644 tools/dtrace/bpf_sample.c
> >  create mode 100644 tools/dtrace/dt_bpf.c
> >  create mode 100644 tools/dtrace/dt_buffer.c
> >  create mode 100644 tools/dtrace/dt_fbt.c
> >  create mode 100644 tools/dtrace/dt_hash.c
> >  create mode 100644 tools/dtrace/dt_probe.c
> >  create mode 100644 tools/dtrace/dt_syscall.c
> >  create mode 100644 tools/dtrace/dt_utils.c
> >  create mode 100644 tools/dtrace/dtrace.c
> >  create mode 100644 tools/dtrace/dtrace.h
> >  create mode 100644 tools/dtrace/dtrace_impl.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 606d1f80bc49..668468834865 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -5474,6 +5474,12 @@ W:   https://linuxtv.org
> >  S: Odd Fixes
> >  F: drivers/media/pci/dt3155/
> >  
> > +DTRACE
> > +M: Kris Van Hees 
> > +L: dtrace-de...@oss.oracle.com
> > +S: Maintained
> > +F: tools/dtrace/
> > +
> >  DVB_USB_AF9015 MEDIA DRIVER
> >  M: Antti Palosaari 
> >  L: linux-me...@vger.kernel.org
> > diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
> > new file mode 100644
> > index ..99fd0f9dd1d6
> > --- /dev/null
> > +++ b/tools/dtrace/Makefile
> > @@ -0,0 +1,88 @

Re: [PATCH 1/1] tools/dtrace: initial implementation of DTrace

2019-07-08 Thread Kris Van Hees

On Thu, Jul 04, 2019 at 03:03:36PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 08:14:30PM -0700, Kris Van Hees wrote:
> > +/*
> > + * Read the data_head offset from the header page of the ring buffer.  The
> > + * argument is declared 'volatile' because it references a memory mapped 
> > page
> > + * that the kernel may be writing to while we access it here.
> > + */
> > +static u64 read_rb_head(volatile struct perf_event_mmap_page *rb_page)
> > +{
> > +   u64 head = rb_page->data_head;
> > +
> > +   asm volatile("" ::: "memory");
> > +
> > +   return head;
> > +}
> > +
> > +/*
> > + * Write the data_tail offset in the header page of the ring buffer.  The
> > + * argument is declared 'volatile' because it references a memory mapped 
> > page
> > + * that the kernel may be writing to while we access it here.
> 
> s/writing/reading/

Thanks!

> > + */
> > +static void write_rb_tail(volatile struct perf_event_mmap_page *rb_page,
> > + u64 tail)
> > +{
> > +   asm volatile("" ::: "memory");
> > +
> > +   rb_page->data_tail = tail;
> > +}
> 
> That volatile usage is atrocious (kernel style would have you use
> {READ,WRITE}_ONCE()). Also your comments fail to mark these as
> load_acquire and store_release. And by only using a compiler barrier
> you're hard assuming TSO, which is somewhat fragile at best.
> 
> Alternatively, you can use the C11 bits and write:
> 
>   return __atomic_load_n(_page->data_head, __ATOMIC_ACQUIRE);
> 
>   __atomic_store_n(_page->data_tail, tail, __ATOMIC_RELEASE);

Perhaps I should just use ring_buffer_read_head() and ring_buffer_write_tail()
since they are provided in tools/include/linux/ring_buffer.h?  I expect that
would be even more preferable over __atomic_load_n() and __atomic_store_n()?

> > +/*
> > + * Process and output the probe data at the supplied address.
> > + */
> > +static int output_event(int cpu, u64 *buf)
> > +{
> > +   u8  *data = (u8 *)buf;
> > +   struct perf_event_header*hdr;
> > +
> > +   hdr = (struct perf_event_header *)data;
> > +   data += sizeof(struct perf_event_header);
> > +
> > +   if (hdr->type == PERF_RECORD_SAMPLE) {
> > +   u8  *ptr = data;
> > +   u32 i, size, probe_id;
> > +
> > +   /*
> > +* struct {
> > +*  struct perf_event_headerheader;
> > +*  u32 size;
> > +*  u32 probe_id;
> > +*  u32 gap;
> > +*  u64 data[n];
> > +* }
> > +* and data points to the 'size' member at this point.
> > +*/
> > +   if (ptr > (u8 *)buf + hdr->size) {
> > +   fprintf(stderr, "BAD: corrupted sample header\n");
> > +   goto out;
> > +   }
> > +
> > +   size = *(u32 *)data;
> > +   data += sizeof(size);
> > +   ptr += sizeof(size) + size;
> > +   if (ptr != (u8 *)buf + hdr->size) {
> > +   fprintf(stderr, "BAD: invalid sample size\n");
> > +   goto out;
> > +   }
> > +
> > +   probe_id = *(u32 *)data;
> > +   data += sizeof(probe_id);
> > +   size -= sizeof(probe_id);
> > +   data += sizeof(u32);/* skip 32-bit gap */
> > +   size -= sizeof(u32);
> > +   buf = (u64 *)data;
> > +
> > +   printf("%3d %6d ", cpu, probe_id);
> > +   for (i = 0, size /= sizeof(u64); i < size; i++)
> > +   printf("%#016lx ", buf[i]);
> > +   printf("\n");
> > +   } else if (hdr->type == PERF_RECORD_LOST) {
> > +   u64 lost;
> > +
> > +   /*
> > +* struct {
> > +*  struct perf_event_headerheader;
> > +*  u64 id;
> > +*  u64 lost;
> > +* }
> > +* and data points to the 'id' member at this point.
> > +*/
> > +   lost = *(u64 *)(data + sizeof(u64));
> > +
> > +   printf("[%ld probes dropped]\n", lost);
> >

Re: [PATCH 1/1] tools/dtrace: initial implementation of DTrace

2019-07-08 Thread Kris Van Hees

On Thu, Jul 04, 2019 at 03:05:09PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 08:14:30PM -0700, Kris Van Hees wrote:
> > +int dt_bpf_attach(int event_id, int bpf_fd)
> > +{
> > +   int event_fd;
> > +   int rc;
> > +   struct perf_event_attr  attr = {};
> > +
> > +   attr.type = PERF_TYPE_TRACEPOINT;
> > +   attr.sample_type = PERF_SAMPLE_RAW;
> > +   attr.sample_period = 1;
> > +   attr.wakeup_events = 1;
> > +   attr.config = event_id;
> > +
> > +   /* Register the event (based on its id), and obtain a fd. */
> > +   event_fd = perf_event_open(, -1, 0, -1, 0);
> > +   if (event_fd < 0) {
> > +   perror("sys_perf_event_open");
> > +   return -1;
> > +   }
> > +
> > +   /* Enable the probe. */
> > +   rc = ioctl(event_fd, PERF_EVENT_IOC_ENABLE, 0);
> 
> AFAICT you didn't use attr.disabled = 1, so this IOC_ENABLE is
> completely superfluous.

Oh yes, good point (and the same applies to the dt_buffer.c code where I set
up the events that own each buffer - no point in doing an explicit enable there
eiteher).

Thanks for catching this!

> > +   if (rc < 0) {
> > +   perror("PERF_EVENT_IOC_ENABLE");
> > +   return -1;
> > +   }
> > +
> > +   /* Associate the BPF program with the event. */
> > +   rc = ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, bpf_fd);
> > +   if (rc < 0) {
> > +   perror("PERF_EVENT_IOC_SET_BPF");
> > +   return -1;
> > +   }
> > +
> > +   return 0;
> > +}

[PATCH 1/1] tools/dtrace: initial implementation of DTrace

2019-07-03 Thread Kris Van Hees

This initial implementation of a tiny subset of DTrace functionality
provides the following options:

dtrace [-lvV] [-b bufsz] -s script
-b  set trace buffer size
-l  list probes (only works with '-s script' for now)
-s  enable or list probes for the specified BPF program
-V  report DTrace API version

The patch comprises quite a bit of code due to DTrace requiring a few
crucial components, even in its most basic form.

The code is structured around the command line interface implemented in
dtrace.c.  It provides option parsing and drives the three modes of
operation that are currently implemented:

1. Report DTrace API version information.
Report the version information and terminate.

2. List probes in BPF programs.
Initialize the list of probes that DTrace recognizes, load BPF
programs, parse all BPF ELF section names, resolve them into
known probes, and emit the probe names.  Then terminate.

3. Load BPF programs and collect tracing data.
Initialize the list of probes that DTrace recognizes, load BPF
programs and attach them to their corresponding probes, set up
perf event output buffers, and start processing tracing data.

This implementation makes extensive use of BPF (handled by dt_bpf.c) and
the perf event output ring buffer (handled by dt_buffer.c).  DTrace-style
probe handling (dt_probe.c) offers an interface to probes that hides the
implementation details of the individual probe types by provider (dt_fbt.c
and dt_syscall.c).  Probe lookup by name uses a hashtable implementation
(dt_hash.c).  The dt_utils.c code populates a list of online CPU ids, so
we know what CPUs we can obtain tracing data from.

Building the tool is trivial because its only dependency (libbpf) is in
the kernel tree under tools/lib/bpf.  A simple 'make' in the tools/dtrace
directory suffices.

The 'dtrace' executable needs to run as root because BPF programs cannot
be loaded by non-root users.

Signed-off-by: Kris Van Hees 
Reviewed-by: David Mc Lean 
Reviewed-by: Eugene Loh 
---
 MAINTAINERS|   6 +
 tools/dtrace/Makefile  |  88 ++
 tools/dtrace/bpf_sample.c  | 145 
 tools/dtrace/dt_bpf.c  | 188 +
 tools/dtrace/dt_buffer.c   | 331 +
 tools/dtrace/dt_fbt.c  | 201 ++
 tools/dtrace/dt_hash.c | 211 +++
 tools/dtrace/dt_probe.c| 230 ++
 tools/dtrace/dt_syscall.c  | 179 
 tools/dtrace/dt_utils.c| 132 +++
 tools/dtrace/dtrace.c  | 249 
 tools/dtrace/dtrace.h  |  13 ++
 tools/dtrace/dtrace_impl.h | 101 +++
 13 files changed, 2074 insertions(+)
 create mode 100644 tools/dtrace/Makefile
 create mode 100644 tools/dtrace/bpf_sample.c
 create mode 100644 tools/dtrace/dt_bpf.c
 create mode 100644 tools/dtrace/dt_buffer.c
 create mode 100644 tools/dtrace/dt_fbt.c
 create mode 100644 tools/dtrace/dt_hash.c
 create mode 100644 tools/dtrace/dt_probe.c
 create mode 100644 tools/dtrace/dt_syscall.c
 create mode 100644 tools/dtrace/dt_utils.c
 create mode 100644 tools/dtrace/dtrace.c
 create mode 100644 tools/dtrace/dtrace.h
 create mode 100644 tools/dtrace/dtrace_impl.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 606d1f80bc49..668468834865 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5474,6 +5474,12 @@ W:   https://linuxtv.org
 S: Odd Fixes
 F: drivers/media/pci/dt3155/
 
+DTRACE
+M: Kris Van Hees 
+L: dtrace-de...@oss.oracle.com
+S: Maintained
+F: tools/dtrace/
+
 DVB_USB_AF9015 MEDIA DRIVER
 M: Antti Palosaari 
 L: linux-me...@vger.kernel.org
diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
new file mode 100644
index ..99fd0f9dd1d6
--- /dev/null
+++ b/tools/dtrace/Makefile
@@ -0,0 +1,88 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# This Makefile is based on samples/bpf.
+#
+# Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+
+DT_VERSION := 2.0.0
+DT_GIT_VERSION := $(shell git rev-parse HEAD 2>/dev/null || \
+  echo Unknown)
+
+DTRACE_PATH?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(DTRACE_PATH)/..
+SAMPLES_PATH   := $(DTRACE_PATH)/../../samples
+
+hostprogs-y:= dtrace
+
+LIBBPF := $(TOOLS_PATH)/lib/bpf/libbpf.a
+OBJS   := dt_bpf.o dt_buffer.o dt_utils.o dt_probe.o \
+  dt_hash.o \
+  dt_fbt.o dt_syscall.o
+
+dtrace-objs:= $(OBJS) dtrace.o
+
+always := $(hostprogs-y)
+always += bpf_sample.o
+
+KBUILD_HOSTCFLAGS  += -DDT_VERSION=\"$(DT_VERSION)\"
+KBUILD_HOSTCFLAGS  += -DDT_GIT_VERSION=\"$(DT_GIT_VERSION)\"
+KBUILD_HOSTCFLAGS  +=

[PATCH 0/1] tools/dtrace: initial implementation of DTrace

2019-07-03 Thread Kris Van Hees

This patch is also available, applied to bpf-next, at the following URL:

https://github.com/oracle/dtrace-linux-kernel/tree/dtrace-bpf

As suggested in feedback to my earlier patch submissions, this code takes an
approach to avoid kernel code changes as much as possible.  The current patch
does not involve any kernel code changes.  Further development of this code
will continue with this approach, incrementally adding features to this first
minimal implementation.  The goal is a fully featured and functional DTrace
implementation involving kernel changes only when strictly necessary.

The code presented here supports two very basic functions:

1. Listing probes that are used in BPF programs

   # dtrace -l -s bpf_sample.o
  ID   PROVIDERMODULE  FUNCTION NAME
   18876fbt   vmlinuxksys_write entry
   70423syscall   vmlinux write entry

2. Loading BPF tracing programs and collecting data that they generate

   # dtrace -s bpf_sample.o
   CPU ID
15  70423 0x8c0968bf8ec0 0x01 0x0055e019eb3f60 
0x2c
15  18876 0x8c0968bf8ec0 0x01 0x0055e019eb3f60 
0x2c
   ...

Only kprobes and syscall tracepoints are supported since this is an initial
patch.  It does show the use of a generic BPF function to implement the actual
probe action, called from two distinct probe types.  Follow-up patches will
add more probe types, add more tracing features from the D language, add
support for D script compilation to BPF, etc.

The implementation makes use of libbpf for handling BPF ELF objects, and uses
the perf event output ring buffer (supported through BPF) to retrieve the
tracing data.  The next step in development will be adding support to libbpf
for programs using shared functions from a collection of functions included in
the BPF ELF object (as suggested by Alexei).  

The code is structured as follows:
 tools/dtrace/dtrace.c  = command line utility
 tools/dtrace/dt_bpf.c  = interface to libbpf
 tools/dtrace/dt_buffer.c   = perf event output buffer handling
 tools/dtrace/dt_fbt.c  = kprobes probe provider
 tools/dtrace/dt_syscall.c  = syscall tracepoint probe provider
 tools/dtrace/dt_probe.c= generic probe and probe provider handling code
  This implements a generic interface to the actual
  probe providers (dt_fbt and dt_syscall).
 tools/dtrace/dt_hash.c = general probe hashing implementation
 tools/dtrace/dt_utils.c= support code (manage list of online CPUs)
 tools/dtrace/dtrace.h  = API header file (used by BPF program source code)
 tools/dtrace/dtrace_impl.h = implementation header file
 tools/dtrace/bpf_sample.c  = sample BPF program using two probe types

I included an entry for the MAINTAINERS file.  I offer to actively maintain
this code, and to keep advancing its development.

Cheers,
Kris Van Hees

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-06-17 Thread Kris Van Hees

On Mon, Jun 17, 2019 at 08:01:52PM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 17, 2019 at 6:54 PM Kris Van Hees  
> wrote:
> >
> > It is not hypothetical.  The folowing example works fine:
> >
> > static int noinline bpf_action(void *ctx, long fd, long buf, long count)
> > {
> > int cpu = bpf_get_smp_processor_id();
> > struct data {
> > u64 arg0;
> > u64 arg1;
> > u64 arg2;
> > }   rec;
> >
> > memset(, 0, sizeof(rec));
> >
> > rec.arg0 = fd;
> > rec.arg1 = buf;
> > rec.arg2 = count;
> >
> > bpf_perf_event_output(ctx, , cpu, , sizeof(rec));
> >
> > return 0;
> > }
> >
> > SEC("kprobe/ksys_write")
> > int bpf_kprobe(struct pt_regs *ctx)
> > {
> > return bpf_action(ctx, ctx->di, ctx->si, ctx->dx);
> > }
> >
> > SEC("tracepoint/syscalls/sys_enter_write")
> > int bpf_tp(struct syscalls_enter_write_args *ctx)
> > {
> > return bpf_action(ctx, ctx->fd, ctx->buf, ctx->count);
> > }
> >
> > char _license[] SEC("license") = "GPL";
> > u32 _version SEC("version") = LINUX_VERSION_CODE;
> 
> Great. Then you're all set to proceed with user space dtrace tooling, right?

I can indeed proceed with the initial basics, yes, and have started.  I hope
to have a first bare bones patch for review sometime next week.

> What you'll discover thought that it works only for simplest things
> like above. libbpf assumes that everything in single elf will be used
> and passes the whole thing to the kernel.
> The verifer removes dead code only from single program.
> It disallows unused functions. Hence libbpf needs to start doing
> more "linker work" than it does today.
> When it loads .o it needs to pass to the kernel only the functions
> that are used by the program.
> This work should be straightforward to implement.
> Unfortunately no one had time to do it.

Ah yes, I see what you mean.  I'll work on that next since I will definitely
be needing that.

> It's also going to be the first step to multi-elf support.
> libbpf would need to do the same "linker work" across .o-s.

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-06-17 Thread Kris Van Hees

On Mon, Jun 17, 2019 at 06:32:22PM -0700, Alexei Starovoitov wrote:
> On Mon, Jun 17, 2019 at 6:25 PM Kris Van Hees  
> wrote:
> >
> > On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
> >
> > << stuff skipped because it is not relevant to the technical discussion... 
> > >>
> >
> > > > > In particular you brought up a good point that there is a use case
> > > > > for sharing a piece of bpf program between kprobe and tracepoint 
> > > > > events.
> > > > > The better way to do that is via bpf2bpf call.
> > > > > Example:
> > > > > void bpf_subprog(arbitrary args)
> > > > > {
> > > > > }
> > > > >
> > > > > SEC("kprobe/__set_task_comm")
> > > > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > > > {
> > > > >   bpf_subprog(...);
> > > > > }
> > > > >
> > > > > SEC("tracepoint/sched/sched_switch")
> > > > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > > > {
> > > > >   bpf_subprog(...);
> > > > > }
> > > > >
> > > > > Such configuration is not supported by the verifier yet.
> > > > > We've been discussing it for some time, but no work has started,
> > > > > since there was no concrete use case.
> > > > > If you can work on adding support for it everyone will benefit.
> > > > >
> > > > > Could you please consider doing that as a step forward?
> > > >
> > > > This definitely looks to be an interesting addition and I am happy to 
> > > > look into
> > > > that further.  I have a few questions that I hope you can shed light 
> > > > on...
> > > >
> > > > 1. What context would bpf_subprog execute with?  If it can be called 
> > > > from
> > > >multiple different prog types, would it see whichever context the 
> > > > caller
> > > >is executing with?  Or would you envision bpf_subprog to not be 
> > > > allowed to
> > > >access the execution context because it cannot know which one is in 
> > > > use?
> > >
> > > bpf_subprog() won't be able to access 'ctx' pointer _if_ it's ambiguous.
> > > The verifier already smart enough to track all the data flow, so it's 
> > > fine to
> > > pass 'struct pt_regs *ctx' as long as it's accessed safely.
> > > For example:
> > > void bpf_subprog(int kind, struct pt_regs *ctx1, struct sched_switch_args 
> > > *ctx2)
> > > {
> > >   if (kind == 1)
> > >  bpf_printk("%d", ctx1->pc);
> > >   if (kind == 2)
> > >  bpf_printk("%d", ctx2->next_pid);
> > > }
> > >
> > > SEC("kprobe/__set_task_comm")
> > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > {
> > >   bpf_subprog(1, ctx, NULL);
> > > }
> > >
> > > SEC("tracepoint/sched/sched_switch")
> > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > {
> > >   bpf_subprog(2, NULL, ctx);
> > > }
> > >
> > > The verifier should be able to prove that the above is correct.
> > > It can do so already if s/ctx1/map_value1/, s/ctx2/map_value2/
> > > What's missing is an ability to have more than one 'starting' or 'root 
> > > caller'
> > > program.
> > >
> > > Now replace SEC("tracepoint/sched/sched_switch") with 
> > > SEC("cgroup/ingress")
> > > and it's becoming clear that BPF_PROG_TYPE_PROBE approach is not good 
> > > enough, right?
> > > Folks are already sharing the bpf progs between kprobe and networking.
> > > Currently it's done via code duplication and actual sharing happens via 
> > > maps.
> > > That's not ideal, hence we've been discussing 'shared library' approach 
> > > for
> > > quite some time. We need a way to support common bpf functions that can 
> > > be called
> > > from networking and from tracing programs.
> > >
> > > > 2. Given that BPF programs are loaded with a specification of the prog 
> > > > type,
> > > >how would one load a code construct as the one you outline above?  
> > > > How can
> > > >you load a BPF function and have it be used as subprog from programs 
> > &g

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-06-17 Thread Kris Van Hees

On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:

<< stuff skipped because it is not relevant to the technical discussion... >>

> > > In particular you brought up a good point that there is a use case
> > > for sharing a piece of bpf program between kprobe and tracepoint events.
> > > The better way to do that is via bpf2bpf call.
> > > Example:
> > > void bpf_subprog(arbitrary args)
> > > {
> > > }
> > > 
> > > SEC("kprobe/__set_task_comm")
> > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > SEC("tracepoint/sched/sched_switch")
> > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > Such configuration is not supported by the verifier yet.
> > > We've been discussing it for some time, but no work has started,
> > > since there was no concrete use case.
> > > If you can work on adding support for it everyone will benefit.
> > > 
> > > Could you please consider doing that as a step forward?
> > 
> > This definitely looks to be an interesting addition and I am happy to look 
> > into
> > that further.  I have a few questions that I hope you can shed light on...
> > 
> > 1. What context would bpf_subprog execute with?  If it can be called from
> >multiple different prog types, would it see whichever context the caller
> >is executing with?  Or would you envision bpf_subprog to not be allowed 
> > to
> >access the execution context because it cannot know which one is in use?
> 
> bpf_subprog() won't be able to access 'ctx' pointer _if_ it's ambiguous.
> The verifier already smart enough to track all the data flow, so it's fine to
> pass 'struct pt_regs *ctx' as long as it's accessed safely.
> For example:
> void bpf_subprog(int kind, struct pt_regs *ctx1, struct sched_switch_args 
> *ctx2)
> {
>   if (kind == 1)
>  bpf_printk("%d", ctx1->pc);
>   if (kind == 2)
>  bpf_printk("%d", ctx2->next_pid);
> }
> 
> SEC("kprobe/__set_task_comm")
> int bpf_prog_kprobe(struct pt_regs *ctx)
> {
>   bpf_subprog(1, ctx, NULL);
> }
> 
> SEC("tracepoint/sched/sched_switch")
> int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> {
>   bpf_subprog(2, NULL, ctx);
> }
> 
> The verifier should be able to prove that the above is correct.
> It can do so already if s/ctx1/map_value1/, s/ctx2/map_value2/
> What's missing is an ability to have more than one 'starting' or 'root caller'
> program.
> 
> Now replace SEC("tracepoint/sched/sched_switch") with SEC("cgroup/ingress")
> and it's becoming clear that BPF_PROG_TYPE_PROBE approach is not good enough, 
> right?
> Folks are already sharing the bpf progs between kprobe and networking.
> Currently it's done via code duplication and actual sharing happens via maps.
> That's not ideal, hence we've been discussing 'shared library' approach for
> quite some time. We need a way to support common bpf functions that can be 
> called
> from networking and from tracing programs.
> 
> > 2. Given that BPF programs are loaded with a specification of the prog 
> > type, 
> >how would one load a code construct as the one you outline above?  How 
> > can
> >you load a BPF function and have it be used as subprog from programs that
> >are loaded separately?  I.e. in the sample above, if bpf_subprog is 
> > loaded
> >as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
> >referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
> >loaded separately?
> 
> The api to support shared libraries was discussed, but not yet implemented.
> We've discussed 'FD + name' approach.
> FD identifies a loaded program (which is root program + a set of subprogs)
> and other programs can be loaded at any time later. The BPF_CALL instructions
> in such later program would refer to older subprogs via FD + name.
> Note that both tracing and networking progs can be part of single elf file.
> libbpf has to be smart to load progs into kernel step by step
> and reusing subprogs that are already loaded.
> 
> Note that libbpf work for such feature can begin _without_ kernel changes.
> libbpf can pass bpf_prog_kprobe+bpf_subprog as a single program first,
> then pass bpf_prog_tracepoint+bpf_subprog second (as a separate program).
> The bpf_subprog will be duplicated and JITed twice, but sharing will happen
> because data structures (maps, global and static data) will be shared.
> This way the support for 'pseudo shared libraries' can begin.
> (later accompanied by FD+name kernel support)

As far as I can determine, the current libbpd implementation is already able
to do the duplication of the called function, even when the ELF object contains
programs of differemt program types.  I.e. the example you give at the top
of the email actually seems to work already.  Right?

In that case, I am a bit unsure what more can be done on the side of libbpf
without needing kernel changes?

> There are other things we discsused. Ideally the

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-06-06 Thread Kris Van Hees

On Fri, May 31, 2019 at 03:25:25PM +, Chris Mason wrote:
> 
> I'm being pretty liberal with chopping down quoted material to help 
> emphasize a particular opinion about how to bootstrap existing 
> out-of-tree projects into the kernel.  My goal here is to talk more 
> about the process and less about the technical details, so please 
> forgive me if I've ignored or changed the technical meaning of anything 
> below.
> 
> On 30 May 2019, at 12:15, Kris Van Hees wrote:
> 
> > On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
> >
> > ... I believe that the discussion that has been going on in other
> > emails has shown that while introducing a program type that provides a
> > generic (abstracted) context is a different approach from what has 
> > been done
> > so far, it is a new use case that provides for additional ways in 
> > which BPF
> > can be used.
> >
> 
> [ ... ]
> 
> >
> > Yes and no.  It depends on what you are trying to do with the BPF 
> > program that
> > is attached to the different events.  From a tracing perspective, 
> > providing a
> > single BPF program with an abstract context would ...
> 
> [ ... ]
> 
> >
> > In this model kprobe/ksys_write and 
> > tracepoint/syscalls/sys_enter_write are
> > equivalent for most tracing purposes ...
> 
> [ ... ]
> 
> >
> > I agree with what you are saying but I am presenting an additional use 
> > case
> 
> [ ... ]
> 
> >>
> >> All that aside the kernel support for shared libraries is an awesome
> >> feature to have and a bunch of folks want to see it happen, but
> >> it's not a blocker for 'dtrace to bpf' user space work.
> >> libbpf can be taught to do this 'pseudo shared library' feature
> >> while 'dtrace to bpf' side doesn't need to do anything special.
> 
> [ ... ]
> 
> This thread intermixes some abstract conceptual changes with smaller 
> technical improvements, and in general it follows a familiar pattern 
> other out-of-tree projects have hit while trying to adapt the kernel to 
> their existing code.  Just from this one email, I quoted the abstract 
> models with use cases etc, and this is often where the discussions side 
> track into less productive areas.
> 
> >
> > So you are basically saying that I should redesign DTrace?
> 
> In your place, I would have removed features and adapted dtrace as much 
> as possible to require the absolute minimum of kernel patches, or even 
> better, no patches at all.  I'd document all of the features that worked 
> as expected, and underline anything either missing or suboptimal that 
> needed additional kernel changes.  Then I'd focus on expanding the 
> community of people using dtrace against the mainline kernel, and work 
> through the series features and improvements one by one upstream over 
> time.

Well, that is actually what I am doing in the sense that the proposed patches
are quite minimal and lie at the core of the style of tracing that we need to
support.  So I definitely agree with your statement.  The code I posted
implements a minimal set of features (hardly any at all), although as Peter
pointed out, some more can be stripped from it and I have done that already
in a revision of the patchset I was preparing.

> Your current approach relies on an all-or-nothing landing of patches 
> upstream, and this consistently leads to conflict every time a project 
> tries it.  A more incremental approach will require bigger changes on 
> the dtrace application side, but over time it'll be much easier to 
> justify your kernel changes.  You won't have to talk in abstract models, 
> and you'll have many more concrete examples of people asking for dtrace 
> features against mainline.  Most importantly, you'll make dtrace 
> available on more kernels than just the absolute latest mainline, and 
> removing dependencies makes the project much easier for new users to 
> try.

I am not sure where I gave the impression that my approach relies on an
all-or-nothing landing of patches.  My intent (and the content of the patches
reflects that I think) was to work from a minimal base and build on that,
adding things as needed.  Granted, it depends on a rather crucial feature in
the design that apparently should be avoided for now as well, and I can
definitely work on avoiding that for now.  But I hope that it is clear from
the patch set I posted that an incremental approach is indeed what I intend
to do.

Thank you for putting it in clear terms and explaining patfalls that have
be observed in the past with projects.  I will proceed with an even more
minimalist approach.

To that end, could you advice on who patches should be Cc'd to to have the
first minimal code submitted to a tools/dtrace directory in the kernel tree?

Kris

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-30 Thread Kris Van Hees

On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
> On Thu, May 23, 2019 at 01:16:08AM -0400, Kris Van Hees wrote:
> > On Wed, May 22, 2019 at 01:16:25PM -0700, Alexei Starovoitov wrote:
> > > On Wed, May 22, 2019 at 12:12:53AM -0400, Kris Van Hees wrote:
> > > > 
> > > > Could you elaborate on why you believe my patches are not adding generic
> > > > features?  I can certainly agree that the DTrace-specific portions are 
> > > > less
> > > > generic (although they are certainly available for anyone to use), but I
> > > > don't quite understand why the new features are deemed non-generic and 
> > > > why
> > > > you believe no one else can use this?
> > > 
> > > And once again your statement above contradicts your own patches.
> > > The patch 2 adds new prog type BPF_PROG_TYPE_DTRACE and the rest of the 
> > > patches
> > > are tying everything to it.
> > > This approach contradicts bpf philosophy of being generic execution engine
> > > and not favoriting one program type vs another.
> > 
> > I am not sure I understand where you see a contradiction.  What I posted is
> > a generic feature, and sample code that demonstrates how it can be used 
> > based
> > on the use-case that I am currently working on.  So yes, the sample code is
> > very specific but it does not restrict the use of the cross-prog-type 
> > tail-call
> > feature.  That feature is designed to be generic.
> > 
> > Probes come in different types (kprobe, tracepoint, perf event, ...) and 
> > they
> > each have their own very specific data associated with them.  I agree 100%
> > with you on that.  And sometimes tracing makes use of those specifics.  But
> > even from looking at the implementation of the various probe related prog
> > types (and e.g. the list of helpers they each support) it shows that there 
> > is
> > a lot of commonality as well.  That common functionality is common to all 
> > the
> > probe program types, and that is where I suggest introducing a program type
> > that captures the common concept of a probe, so perhaps a better name would
> > be BPF_PROG_TYPE_PROBE.
> > 
> > The principle remains the same though...  I am proposing adding support for
> > program types that provide common functionality so that programs for various
> > program types can make use of the more generic programs stored in prog 
> > arrays.
> 
> Except that prog array is indirect call based and got awfully slow due
> to retpoline and we're trying to redesign the whole tail_call approach.
> So more extensions to tail_call facility is the opposite of that direction.

OK, I see the point of retpoline having slowed down tail_call.  Do you have
any suggestions in how to accomplish the concept that I am proposing in a
different way?  I believe that the discussion that has been going on in other
emails has shown that while introducing a program type that provides a
generic (abstracted) context is a different approach from what has been done
so far, it is a new use case that provides for additional ways in which BPF
can be used.

> > > I have nothing against dtrace language and dtrace scripts.
> > > Go ahead and compile them into bpf.
> > > All patches to improve bpf infrastructure are very welcomed.
> > > 
> > > In particular you brought up a good point that there is a use case
> > > for sharing a piece of bpf program between kprobe and tracepoint events.
> > > The better way to do that is via bpf2bpf call.
> > > Example:
> > > void bpf_subprog(arbitrary args)
> > > {
> > > }
> > > 
> > > SEC("kprobe/__set_task_comm")
> > > int bpf_prog_kprobe(struct pt_regs *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > SEC("tracepoint/sched/sched_switch")
> > > int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> > > {
> > >   bpf_subprog(...);
> > > }
> > > 
> > > Such configuration is not supported by the verifier yet.
> > > We've been discussing it for some time, but no work has started,
> > > since there was no concrete use case.
> > > If you can work on adding support for it everyone will benefit.
> > > 
> > > Could you please consider doing that as a step forward?
> > 
> > This definitely looks to be an interesting addition and I am happy to look 
> > into
> > that further.  I have a few questions that I hope you can shed light on...
> > 
> > 1. What context would bpf_

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-23 Thread Kris Van Hees

On Thu, May 23, 2019 at 07:08:51PM -0700, Alexei Starovoitov wrote:
> On Thu, May 23, 2019 at 09:57:37PM -0400, Steven Rostedt wrote:
> > On Thu, 23 May 2019 17:31:50 -0700
> > Alexei Starovoitov  wrote:
> > 
> > 
> > > > Now from what I'm reading, it seams that the Dtrace layer may be
> > > > abstracting out fields from the kernel. This is actually something I
> > > > have been thinking about to solve the "tracepoint abi" issue. There's
> > > > usually basic ideas that happen. An interrupt goes off, there's a
> > > > handler, etc. We could abstract that out that we trace when an
> > > > interrupt goes off and the handler happens, and record the vector
> > > > number, and/or what device it was for. We have tracepoints in the
> > > > kernel that do this, but they do depend a bit on the implementation.
> > > > Now, if we could get a layer that abstracts this information away from
> > > > the implementation, then I think that's a *good* thing.  
> > > 
> > > I don't like this deferred irq idea at all.
> > 
> > What do you mean deferred?
> 
> that's how I interpreted your proposal: 
> "interrupt goes off and the handler happens, and record the vector number"
> It's not a good thing to tell about irq later.
> Just like saying lets record perf counter event and report it later.

The abstraction I mentioned does not defer anything - it merely provides a way
for all probe events to be processed as a generic probe with a set of values
associated with it (e.g. syscall arguments for a syscall entry probe).  The
program that implements what needs to happen when that probe fires still does
whatever is necessary to collect information, and dump data in the output
buffers before execution continues.

I could trace entry into a syscall by using a syscall entry tracepoint or by
putting a kprobe on the syscall function itself.  I am usually interested in
whether the syscall was called, what the arguments were, and perhaps I need to
collect some other data related to it.  More often than not, both probes would
get the job done.  With an abstraction that hides the implementation details
of the probe mechanism itself, both cases are essentially the same.

> > > Abstracting details from the users is _never_ a good idea.
> > 
> > Really? Most everything we do is to abstract details from the user. The
> > key is to make the abstraction more meaningful than the raw data.
> > 
> > > A ton of people use bcc scripts and bpftrace because they want those 
> > > details.
> > > They need to know what kernel is doing to make better decisions.
> > > Delaying irq record is the opposite.
> > 
> > I never said anything about delaying the record. Just getting the
> > information that is needed.
> > 
> > > > 
> > > > I wish that was totally true, but tracepoints *can* be an abi. I had
> > > > code reverted because powertop required one to be a specific
> > > > format. To this day, the wakeup event has a "success" field that
> > > > writes in a hardcoded "1", because there's tools that depend on it,
> > > > and they only work if there's a success field and the value is 1.  
> > > 
> > > I really think that you should put powertop nightmares to rest.
> > > That was long ago. The kernel is different now.
> > 
> > Is it?
> > 
> > > Linus made it clear several times that it is ok to change _all_
> > > tracepoints. Period. Some maintainers somehow still don't believe
> > > that they can do it.
> > 
> > From what I remember him saying several times, is that you can change
> > all tracepoints, but if it breaks a tool that is useful, then that
> > change will get reverted. He will allow you to go and fix that tool and
> > bring back the change (which was the solution to powertop).
> 
> my interpretation is different.
> We changed tracepoints. It broke scripts. People changed scripts.

In my world, the sequence is more like: tracepoints get changed, scripts
break, I fix the provider (abstraction), scripts work again.  Users really
appreciate that aspect because many of our users are not kernel experts.

> > > Some tracepoints are used more than others and more people will
> > > complain: "ohh I need to change my script" when that tracepoint
> > > changes. But the kernel development is not going to be hampered by a
> > > tracepoint. No matter how widespread its usage in scripts.
> > 
> > That's because we'll treat bpf (and Dtrace) scripts like modules (no
> > abi), at least we better. But if there's a tool that doesn't use the
> > script and reads the tracepoint directly via perf, then that's a
> > different story.
> 
> absolutely not.
> tracepoint is a tracepoint. It can change regardless of what
> and how is using it.

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-23 Thread Kris Van Hees

On Thu, May 23, 2019 at 07:02:43PM -0400, Steven Rostedt wrote:
> On Thu, 23 May 2019 14:13:31 -0700
> Alexei Starovoitov  wrote:
> 
> > > In DTrace, people write scripts based on UAPI-style interfaces and they 
> > > don't
> > > have to concern themselves with e.g. knowing how to get the value of the 
> > > 3rd
> > > argument that was passed by the firing probe.  All they need to know is 
> > > that
> > > the probe will have a 3rd argument, and that the 3rd argument to *any* 
> > > probe
> > > can be accessed as 'arg2' (or args[2] for typed arguments, if the 
> > > provider is
> > > capable of providing that).  Different probes have different ways of 
> > > passing
> > > arguments, and only the provider code for each probe type needs to know 
> > > how
> > > to retrieve the argument values.
> > > 
> > > Does this help bring clarity to the reasons why an abstract (generic) 
> > > probe
> > > concept is part of DTrace's design?  
> > 
> > It actually sounds worse than I thought.
> > If dtrace script reads some kernel field it's considered to be uapi?! ouch.
> > It means dtrace development philosophy is incompatible with the linux 
> > kernel.
> > There is no way kernel is going to bend itself to make dtrace scripts
> > runnable if that means that all dtrace accessible fields become uapi.
> 
> Now from what I'm reading, it seams that the Dtrace layer may be
> abstracting out fields from the kernel. This is actually something I
> have been thinking about to solve the "tracepoint abi" issue. There's
> usually basic ideas that happen. An interrupt goes off, there's a
> handler, etc. We could abstract that out that we trace when an
> interrupt goes off and the handler happens, and record the vector
> number, and/or what device it was for. We have tracepoints in the
> kernel that do this, but they do depend a bit on the implementation.
> Now, if we could get a layer that abstracts this information away from
> the implementation, then I think that's a *good* thing.

This is indeed what DTrace uses.  When a probe triggers (be it kprobe, network
event, tracepoint, etc), the core execution component is invoked with a probe
id, and a set of data items.  In its current implementation (not BPF based),
the probe triggers which causes a probe type specific handler to be called in
the provider module for that probe type.  The handler determines the probe id
(e.g. for a kprobe that might be based on the program counter value), and it
also prepares the list of data items (which we call arguments to the probe).
It then calls the execution component with the probe id and arguments.

All probe types are handled by a provider, and each provider has a handler
that determines the probe id and arguments, and then calls the execution
component.  So, at the level of the execution component all probes look the
same.

Scripts commonly operate on the abstract probe, but scriptr writers can opt
to do more fancy things that do depend on probe implementation details.  In
that case, there is of course no guarantee that the script will keep working
as kernel releases change.

> > In stark contrast to dtrace all of bpf tracing scripts (bcc scripts
> > and bpftrace scripts) are written for specific kernel with intimate
> > knowledge of kernel details. They do break all the time when kernel changes.
> > kprobe and tracepoints are NOT uapi. All of them can change.
> > tracepoints are a bit more stable than kprobes, but they are not uapi.
> 
> I wish that was totally true, but tracepoints *can* be an abi. I had
> code reverted because powertop required one to be a specific format. To
> this day, the wakeup event has a "success" field that writes in a
> hardcoded "1", because there's tools that depend on it, and they only
> work if there's a success field and the value is 1.
> 
> I do definitely agree with you that the Dtrace code shall *never* keep
> the kernel from changing. That is, if Dtrace depends on something that
> changes (let's say we record priority of a task, but someday priority
> is replaced by something else), then Dtrace must cope with it. It must
> not be a blocker like user space applications can be.

I fully agree that DTrace or any other tool should never prevent changes from
happening at the kernel level.  Even in its current (non-BPF) implementation
it has had to cope with changes.  The abstraction through the providers has
been a real benefit for that because changes to probe mechanisms can be dealt
with at the level of the providers, and everything else can remain the same
because the abstraction "hides" the implementation details.

Kris

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-23 Thread Kris Van Hees

On Thu, May 23, 2019 at 02:13:31PM -0700, Alexei Starovoitov wrote:
> On Thu, May 23, 2019 at 01:46:10AM -0400, Kris Van Hees wrote:
> > 
> > I think there is a difference between a solution and a good solution.  
> > Adding
> > a lot of knowledge in the userspace component about how things are 
> > imeplemented
> > at the kernel level makes for a more fragile infrastructure and involves
> > breaking down well established boundaries in DTrace that are part of the 
> > design
> > specifically to ensure that userspace doesn't need to depend on such 
> > intimate
> > knowledge.
> 
> argh. see more below. This is fundamental disagreement.
> 
> > > > Another advantage of being able to operate on a more abstract probe 
> > > > concept
> > > > that is not tied to a specific probe type is that the userspace 
> > > > component does
> > > > not need to know about the implementation details of the specific 
> > > > probes.
> > > 
> > > If that is indeed the case that dtrace is broken _by design_
> > > and nothing on the kernel side can fix it.
> > > 
> > > bpf prog attached to NMI is running in NMI.
> > > That is very different execution context vs kprobe.
> > > kprobe execution context is also different from syscall.
> > > 
> > > The user writing the script has to be aware in what context
> > > that script will be executing.
> > 
> > The design behind DTrace definitely recognizes that different types of 
> > probes
> > operate in different ways and have different data associated with them.  
> > That
> > is why probes (in legacy DTrace) are managed by providers, one for each type
> > of probe.  The providers handle the specifics of a probe type, and provide a
> > generic probe API to the processing component of DTrace:
> > 
> > SDT probes -> SDT provider ---+
> >   |
> > FBT probes -> FBT provider ---+--> DTrace engine
> >   |
> > syscall probes -> systrace provider --+
> > 
> > This means that the DTrace processing component can be implemented based on 
> > a
> > generic probe concept, and the providers will take care of the specifics.  
> > In
> > that sense, it is similar to so many other parts of the kernel where a 
> > generic
> > API is exposed so that higher level components don't need to know 
> > implementation
> > details.
> > 
> > In DTrace, people write scripts based on UAPI-style interfaces and they 
> > don't
> > have to concern themselves with e.g. knowing how to get the value of the 3rd
> > argument that was passed by the firing probe.  All they need to know is that
> > the probe will have a 3rd argument, and that the 3rd argument to *any* probe
> > can be accessed as 'arg2' (or args[2] for typed arguments, if the provider 
> > is
> > capable of providing that).  Different probes have different ways of passing
> > arguments, and only the provider code for each probe type needs to know how
> > to retrieve the argument values.
> > 
> > Does this help bring clarity to the reasons why an abstract (generic) probe
> > concept is part of DTrace's design?
> 
> It actually sounds worse than I thought.
> If dtrace script reads some kernel field it's considered to be uapi?! ouch.
> It means dtrace development philosophy is incompatible with the linux kernel.
> There is no way kernel is going to bend itself to make dtrace scripts
> runnable if that means that all dtrace accessible fields become uapi.

No, no, that is not at all what I am saying.  In DTrace, the particulars of
how you get to e.g. probe arguments or current task information are not
something that script writers need to concern themselves about.  Similar to
how BPF contexts have a public (uapi) declaration and a kernel-level context
declaration taht is used to actually implement accessing the data (using the
is_valid_access and convert_ctx_access functions that prog types implement).
DTrace exposes an abstract probe entity to script writers where they can
access probe arguments as arg0 through arg9.  Nothing in the userspace needs
to know how you obtain the value of those arguments.  So, scripts can be
written for any kind of probe, and the only information that is used to
verify programs is obtained from the abstract probe description (things like
its unique id, number of arguments, and possible type information for each
argument).  The knowledge of how to get to the value of the probe arguments
is only known at the level of the kernel, so that when

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-22 Thread Kris Van Hees

On Wed, May 22, 2019 at 01:53:31PM -0700, Alexei Starovoitov wrote:
> On Wed, May 22, 2019 at 01:23:27AM -0400, Kris Van Hees wrote:
> > 
> > Userspace aside, there are various features that are not currently available
> > such as retrieving the ppid of the current task, and various other data 
> > items
> > that relate to the current task that triggered a probe.  There are ways to
> > work around it (using the bpf_probe_read() helper, which actually performs a
> > probe_kernel_read()) but that is rather clunky
> 
> Sounds like you're admiting that the access to all kernel data structures
> is actually available, but you don't want to change user space to use it?

I of course agree that access to all kernel structures can be done using the
bpf_probe_read() helper.  But I hope you agree that the availability of that
helper doesn't mean that there is no room for more elegant ways to access
information.  There are already helpers (e.g. bpf_get_current_pid_tgid) that
could be replaced by BPF code that uses bpf_probe_read to accomplish the same
thing.

> > triggered the execution.  Often, a single DTrace clause is associated with
> > multiple probes, of different types.  Probes in the kernel (kprobe, perf 
> > event,
> > tracepoint, ...) are associated with their own BPF program type, so it is 
> > not
> > possible to load the DTrace clause (translated into BPF code) once and
> > associate it with probes of different types.  Instead, I'd have to load it
> > as a BPF_PROG_TYPE_KPROBE program to associate it with a kprobe, and I'd 
> > have
> > to load it as a BPF_PROG_TYPE_TRACEPOINT program to associate it with a
> > tracepoint, and so on.  This also means that I suddenly have to add code to
> > the userspace component to know about the different program types with more
> > detail, like what helpers are available to specific program types.
> 
> That also sounds that there is a solution, but you don't want to change user 
> space ?

I think there is a difference between a solution and a good solution.  Adding
a lot of knowledge in the userspace component about how things are imeplemented
at the kernel level makes for a more fragile infrastructure and involves
breaking down well established boundaries in DTrace that are part of the design
specifically to ensure that userspace doesn't need to depend on such intimate
knowledge.

> > Another advantage of being able to operate on a more abstract probe concept
> > that is not tied to a specific probe type is that the userspace component 
> > does
> > not need to know about the implementation details of the specific probes.
> 
> If that is indeed the case that dtrace is broken _by design_
> and nothing on the kernel side can fix it.
> 
> bpf prog attached to NMI is running in NMI.
> That is very different execution context vs kprobe.
> kprobe execution context is also different from syscall.
> 
> The user writing the script has to be aware in what context
> that script will be executing.

The design behind DTrace definitely recognizes that different types of probes
operate in different ways and have different data associated with them.  That
is why probes (in legacy DTrace) are managed by providers, one for each type
of probe.  The providers handle the specifics of a probe type, and provide a
generic probe API to the processing component of DTrace:

SDT probes -> SDT provider ---+
  |
FBT probes -> FBT provider ---+--> DTrace engine
  |
syscall probes -> systrace provider --+

This means that the DTrace processing component can be implemented based on a
generic probe concept, and the providers will take care of the specifics.  In
that sense, it is similar to so many other parts of the kernel where a generic
API is exposed so that higher level components don't need to know implementation
details.

In DTrace, people write scripts based on UAPI-style interfaces and they don't
have to concern themselves with e.g. knowing how to get the value of the 3rd
argument that was passed by the firing probe.  All they need to know is that
the probe will have a 3rd argument, and that the 3rd argument to *any* probe
can be accessed as 'arg2' (or args[2] for typed arguments, if the provider is
capable of providing that).  Different probes have different ways of passing
arguments, and only the provider code for each probe type needs to know how
to retrieve the argument values.

Does this help bring clarity to the reasons why an abstract (generic) probe
concept is part of DTrace's design?

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-22 Thread Kris Van Hees

On Wed, May 22, 2019 at 12:55:27PM -0700, Alexei Starovoitov wrote:
> On Wed, May 22, 2019 at 02:22:15PM -0400, Kris Van Hees wrote:
> > On Wed, May 22, 2019 at 04:25:32PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > > 
> > > > and no changes are necessary in kernel/events/ring_buffer.c either.
> > > 
> > > Let me just NAK them on the principle that I don't see them in my inbox.
> > 
> > My apologies for failing to include you on the Cc for the patches.  That was
> > an oversight on my end and certainly not intentional.
> > 
> > > Let me further NAK it for adding all sorts of garbage to the code --
> > > we're not going to do gaps and stay_in_page nonsense.
> > 
> > Could you give some guidance in terms of an alternative?  The ring buffer 
> > code
> > provides both non-contiguous page allocation support and a vmalloc-based
> > allocation, and the vmalloc version certainly would avoid the entire gap and
> > page boundary stuff.  But since the allocator is chosen at build time based 
> > on
> > the arch capabilities, there is no way to select a specific memory 
> > allocator.
> > I'd be happy to use an alternative approach that allows direct writing into
> > the ring buffer.
> 
> You do not _need_ direct write from bpf prog.
> dtrace language doesn't mandate direct write.
> 'direct write into ring buffer form bpf prog' is an interesting idea and
> may be nice performance optimization, but in no way it's a blocker for dtrace 
> scripts.
> Also it's far from clear that it actually brings performance benefits.
> Letting bpf progs write directly into ring buffer comes with
> a lot of corner cases. It's something to carefully analyze.

I agree that doing direct writes is something that can be deferred right now,
especially because there are more fundamental things to focus on.  Thank you
for your acknowledgement of the idea, and I certainly look forward to exploring
this further at a later time,

> I suggest to proceed with user space dtrace conversion to bpf
> without introducing kernel changes.

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-22 Thread Kris Van Hees

On Wed, May 22, 2019 at 01:16:25PM -0700, Alexei Starovoitov wrote:
> On Wed, May 22, 2019 at 12:12:53AM -0400, Kris Van Hees wrote:
> > 
> > Could you elaborate on why you believe my patches are not adding generic
> > features?  I can certainly agree that the DTrace-specific portions are less
> > generic (although they are certainly available for anyone to use), but I
> > don't quite understand why the new features are deemed non-generic and why
> > you believe no one else can use this?
> 
> And once again your statement above contradicts your own patches.
> The patch 2 adds new prog type BPF_PROG_TYPE_DTRACE and the rest of the 
> patches
> are tying everything to it.
> This approach contradicts bpf philosophy of being generic execution engine
> and not favoriting one program type vs another.

I am not sure I understand where you see a contradiction.  What I posted is
a generic feature, and sample code that demonstrates how it can be used based
on the use-case that I am currently working on.  So yes, the sample code is
very specific but it does not restrict the use of the cross-prog-type tail-call
feature.  That feature is designed to be generic.

Probes come in different types (kprobe, tracepoint, perf event, ...) and they
each have their own very specific data associated with them.  I agree 100%
with you on that.  And sometimes tracing makes use of those specifics.  But
even from looking at the implementation of the various probe related prog
types (and e.g. the list of helpers they each support) it shows that there is
a lot of commonality as well.  That common functionality is common to all the
probe program types, and that is where I suggest introducing a program type
that captures the common concept of a probe, so perhaps a better name would
be BPF_PROG_TYPE_PROBE.

The principle remains the same though...  I am proposing adding support for
program types that provide common functionality so that programs for various
program types can make use of the more generic programs stored in prog arrays.

> I have nothing against dtrace language and dtrace scripts.
> Go ahead and compile them into bpf.
> All patches to improve bpf infrastructure are very welcomed.
> 
> In particular you brought up a good point that there is a use case
> for sharing a piece of bpf program between kprobe and tracepoint events.
> The better way to do that is via bpf2bpf call.
> Example:
> void bpf_subprog(arbitrary args)
> {
> }
> 
> SEC("kprobe/__set_task_comm")
> int bpf_prog_kprobe(struct pt_regs *ctx)
> {
>   bpf_subprog(...);
> }
> 
> SEC("tracepoint/sched/sched_switch")
> int bpf_prog_tracepoint(struct sched_switch_args *ctx)
> {
>   bpf_subprog(...);
> }
> 
> Such configuration is not supported by the verifier yet.
> We've been discussing it for some time, but no work has started,
> since there was no concrete use case.
> If you can work on adding support for it everyone will benefit.
> 
> Could you please consider doing that as a step forward?

This definitely looks to be an interesting addition and I am happy to look into
that further.  I have a few questions that I hope you can shed light on...

1. What context would bpf_subprog execute with?  If it can be called from
   multiple different prog types, would it see whichever context the caller
   is executing with?  Or would you envision bpf_subprog to not be allowed to
   access the execution context because it cannot know which one is in use?

2. Given that BPF programs are loaded with a specification of the prog type, 
   how would one load a code construct as the one you outline above?  How can
   you load a BPF function and have it be used as subprog from programs that
   are loaded separately?  I.e. in the sample above, if bpf_subprog is loaded
   as part of loading bpf_prog_kprobe (prog type KPROBE), how can it be
   referenced from bpf_prog_tracepoint (prog type TRACEPOINT) which would be
   loaded separately?

Cheers,
Kris

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-22 Thread Kris Van Hees

On Wed, May 22, 2019 at 04:25:32PM +0200, Peter Zijlstra wrote:
> On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> 
> > and no changes are necessary in kernel/events/ring_buffer.c either.
> 
> Let me just NAK them on the principle that I don't see them in my inbox.

My apologies for failing to include you on the Cc for the patches.  That was
an oversight on my end and certainly not intentional.

> Let me further NAK it for adding all sorts of garbage to the code --
> we're not going to do gaps and stay_in_page nonsense.

Could you give some guidance in terms of an alternative?  The ring buffer code
provides both non-contiguous page allocation support and a vmalloc-based
allocation, and the vmalloc version certainly would avoid the entire gap and
page boundary stuff.  But since the allocator is chosen at build time based on
the arch capabilities, there is no way to select a specific memory allocator.
I'd be happy to use an alternative approach that allows direct writing into
the ring buffer.

Thanks,
Kris

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-21 Thread Kris Van Hees

On Tue, May 21, 2019 at 05:48:11PM -0400, Steven Rostedt wrote:
> On Tue, 21 May 2019 14:43:26 -0700
> Alexei Starovoitov  wrote:
> 
> > Steve,
> > sounds like you've missed all prior threads.
> 
> I probably have missed them ;-)
> 
> > The feedback was given to Kris it was very clear:
> > implement dtrace the same way as bpftrace is working with bpf.
> > No changes are necessary to dtrace scripts
> > and no kernel changes are necessary.
> 
> Kris, I haven't been keeping up on all the discussions. But what
> exactly is the issue where Dtrace can't be done the same way as the
> bpftrace is done?

There are several issues (and I keep finding new ones as I move forward) but
the biggest one is that I am not trying to re-design and re-implement) DTrace
from the ground up.  We have an existing userspace component that is getting
modified to work with a new kernel implementation (based on BPF and various
other kernel features that are thankfully available these days).  But we need
to ensure that the userspace component continues to function exactly as one
would expect.  There should be no need to modify DTrace scripts.  Perhaps
bpftrace could be taught to parse DTrace scripts (i.e. implement the D script
language with all its bells and whistles) but it currently cannot and DTrace
obviously can.  It seems to be a better use of resources to focus on the
kernel component, where we can really provide a much cleaner implementation
for DTrace probe execution because BPF is available and very powerful.

Userspace aside, there are various features that are not currently available
such as retrieving the ppid of the current task, and various other data items
that relate to the current task that triggered a probe.  There are ways to
work around it (using the bpf_probe_read() helper, which actually performs a
probe_kernel_read()) but that is rather clunky and definitely shouldn't be
something that can be done from a BPF program if we're doing unprivileged
tracing (which is a goal that is important for us).  New helpers can be added
for things like this, but the list grows large very quickly once you look at
what information DTrace scripts tend to use.

One of the benefits of DTrace is that probes are largely abstracted entities
when you get to the script level.  While different probes provide different
data, they are all represented as probe arguments and they are accessed in a
very consistent manner that is independent from the actual kind of probe that
triggered the execution.  Often, a single DTrace clause is associated with
multiple probes, of different types.  Probes in the kernel (kprobe, perf event,
tracepoint, ...) are associated with their own BPF program type, so it is not
possible to load the DTrace clause (translated into BPF code) once and
associate it with probes of different types.  Instead, I'd have to load it
as a BPF_PROG_TYPE_KPROBE program to associate it with a kprobe, and I'd have
to load it as a BPF_PROG_TYPE_TRACEPOINT program to associate it with a
tracepoint, and so on.  This also means that I suddenly have to add code to
the userspace component to know about the different program types with more
detail, like what helpers are available to specific program types.

Another advantage of being able to operate on a more abstract probe concept
that is not tied to a specific probe type is that the userspace component does
not need to know about the implementation details of the specific probes.
This avoids a tight coupling between the userspace component and the kernel
implementation.

Another feature that is currently not supported is speculative tracing.  This
is a feature that is not as commonly used (although I personally have found it
to be very useful in the past couple of years) but it quite powerful because
it allows for probe data to be recorded, and have the decision on whether it
is to be made available to userspace postponed to a later event.  At that time,
the data can be discarded or committed.

These are just some examples of issues I have been working on.  I spent quite
a bit of time to look for ways to implement what we need for DTrace with a
minimal amount of patches to the kernel because there really isn't any point
in doing unnecessary work.  I do not doubt that there are possible clever
ways to somehow get around some of these issues with clever hacks and
workarounds, but I am not trying to hack something together that hopefully
will be close enough to the expected functionality.

DTrace has proven itself to be quite useful and dependable as a tracing
solution, and I am working on continuing to deliver on that while recognizing
the significant work that others have put into advancing the tracing
infrastructure in Linux in recent years.  So many people have contributed
excellent features - and I am making use of those features as much as I can.
But as is often the case, not everything that I need is currently implemented.
As I expressed during last year's Plumbers in Vancouver, I am

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-21 Thread Kris Van Hees

On Tue, May 21, 2019 at 04:26:19PM -0700, Alexei Starovoitov wrote:
> On Tue, May 21, 2019 at 05:36:49PM -0400, Kris Van Hees wrote:
> > On Tue, May 21, 2019 at 01:55:34PM -0700, Alexei Starovoitov wrote:
> > > On Tue, May 21, 2019 at 02:41:37PM -0400, Kris Van Hees wrote:
> > > > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > > > > On Mon, May 20, 2019 at 11:47:00PM +, Kris Van Hees wrote:
> > > > > > 
> > > > > > 2. bpf: add BPF_PROG_TYPE_DTRACE
> > > > > > 
> > > > > > This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, 
> > > > > > without
> > > > > > actually providing an implementation.  The actual 
> > > > > > implementation is
> > > > > > added in patch 4 (see below).  We do it this way because the
> > > > > > implementation is being added to the tracing subsystem as a 
> > > > > > component
> > > > > > that I would be happy to maintain (if merged) whereas the 
> > > > > > declaration
> > > > > > of the program type must be in the bpf subsystem.  Since the two
> > > > > > subsystems are maintained by different people, we split the
> > > > > > implementing patches across maintainer boundaries while 
> > > > > > ensuring that
> > > > > > the kernel remains buildable between patches.
> > > > > 
> > > > > None of these kernel patches are necessary for what you want to 
> > > > > achieve.
> > > > 
> > > > I disagree.  The current support for BPF programs for probes associates 
> > > > a
> > > > specific BPF program type with a specific set of probes, which means 
> > > > that I
> > > > cannot write BPF programs based on a more general concept of a 'DTrace 
> > > > probe'
> > > > and provide functionality based on that.  It also means that if I have 
> > > > a D
> > > > clause (DTrace probe action code associated with probes) that is to be 
> > > > executed
> > > > for a list of probes of different types, I need to duplicate the program
> > > > because I cannot cross program type boundaries.
> > > 
> > > tracepoint vs kprobe vs raw_tracepoint vs perf event work on different 
> > > input.
> > > There is no common denominator to them that can serve as single 'generic' 
> > > context.
> > > We're working on the concept of bpf libraries where different bpf program
> > > with different types can call single bpf function with arbitrary 
> > > arguments.
> > > This concept already works in bpf2bpf calls. We're working on extending it
> > > to different program types.
> > > You're more then welcome to help in that direction,
> > > but type casting of tracepoint into kprobe is no go.
> > 
> > I am happy to hear about the direction you are going in adding 
> > functionality.
> > Please note though that I am not type casting tracepoint into kprobe or
> > anything like that.  I am making it possible to transfer execution from
> > tracepoint, kprobe, raw-tracepoint, perf event, etc into a BPF program of
> > a different type (BPF_PROG_TYPE_DTRACE) which operates as a general probe
> > action execution program type.  It provides functionality that is used to
> > implement actions to be executed when a probe fires, independent of the
> > actual probe type that fired.
> > 
> > What you describe seems to me to be rather equivalent to what I already
> > implement in my patch.
> 
> except they're not.
> you're converting to one new prog type only that no one else can use.
> Whereas bpf infra is aiming to be as generic as possible and
> fit networking, tracing, security use case all at once.

Two points here...  the patch that implements cross-prog type tail-call support
is not specific to *any* specific prog type.  Each prog type can specify which
(if any) prog types is can receive calls from (and it can implement context
conversion code to carry any relevant info from the caller context into the
context for the callee).  There is nothing in that patch that is specific to
DTrace or any other prog type.

Then I also introduce a new prog type (not tied to any specific probe type) to
provide the ability to execute programs in a probe type independent context,
and it makes use of the cross-prog-type tail-call support in order to be able
to invoke programs in that probe-independent context from probe-specific

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-21 Thread Kris Van Hees

On Tue, May 21, 2019 at 01:55:34PM -0700, Alexei Starovoitov wrote:
> On Tue, May 21, 2019 at 02:41:37PM -0400, Kris Van Hees wrote:
> > On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> > > On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> > > > 
> > > > 2. bpf: add BPF_PROG_TYPE_DTRACE
> > > > 
> > > > This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, 
> > > > without
> > > > actually providing an implementation.  The actual 
> > > > implementation is
> > > > added in patch 4 (see below).  We do it this way because the
> > > > implementation is being added to the tracing subsystem as a 
> > > > component
> > > > that I would be happy to maintain (if merged) whereas the 
> > > > declaration
> > > > of the program type must be in the bpf subsystem.  Since the two
> > > > subsystems are maintained by different people, we split the
> > > > implementing patches across maintainer boundaries while 
> > > > ensuring that
> > > > the kernel remains buildable between patches.
> > > 
> > > None of these kernel patches are necessary for what you want to achieve.
> > 
> > I disagree.  The current support for BPF programs for probes associates a
> > specific BPF program type with a specific set of probes, which means that I
> > cannot write BPF programs based on a more general concept of a 'DTrace 
> > probe'
> > and provide functionality based on that.  It also means that if I have a D
> > clause (DTrace probe action code associated with probes) that is to be 
> > executed
> > for a list of probes of different types, I need to duplicate the program
> > because I cannot cross program type boundaries.
> 
> tracepoint vs kprobe vs raw_tracepoint vs perf event work on different input.
> There is no common denominator to them that can serve as single 'generic' 
> context.
> We're working on the concept of bpf libraries where different bpf program
> with different types can call single bpf function with arbitrary arguments.
> This concept already works in bpf2bpf calls. We're working on extending it
> to different program types.
> You're more then welcome to help in that direction,
> but type casting of tracepoint into kprobe is no go.

I am happy to hear about the direction you are going in adding functionality.
Please note though that I am not type casting tracepoint into kprobe or
anything like that.  I am making it possible to transfer execution from
tracepoint, kprobe, raw-tracepoint, perf event, etc into a BPF program of
a different type (BPF_PROG_TYPE_DTRACE) which operates as a general probe
action execution program type.  It provides functionality that is used to
implement actions to be executed when a probe fires, independent of the
actual probe type that fired.

What you describe seems to me to be rather equivalent to what I already
implement in my patch.

> > The reasons for these patches is because I cannot do the same with the 
> > existing
> > implementation.  Yes, I can do some of it or use some workarounds to 
> > accomplish
> > kind of the same thing, but at the expense of not being able to do what I 
> > need
> > to do but rather do some kind of best effort alternative.  That is not the 
> > goal
> > here.
> 
> what you call 'workaround' other people call 'feature'.
> The kernel community doesn't accept extra code into the kernel
> when user space can do the same.

Sure, but userspace cannot do the same because in the case of DTrace much
of this needs to execute at the kernel level within the context of the probe
firing, because once you get back to userspace, the system has moved on.  We
need to capture information and perform processing of that information at the
time of probe firing.  I am spending quite a lot of my time in the design of
DTrace based on BPF and other kernel features to avoid adding more to the
kernel than is really needed, to certainly also to avoid duplicating code.

But I am not designing and implementing a new tracer - I am making an
existing one available based on existing features (as much as possible).  So,
something that comes close but doesn't quite do what we need is not a
solution.

> > > Feel free to add tools/dtrace/ directory and maintain it though.
> > 
> > Thank you.
> > 
> > > The new dtrace_buffer doesn't need to replicate existing bpf+kernel 
> > > functionality
> > > and no changes are necessary in kernel/events/ring_buffer.c either.
> > > tools/dtrace/ user space component can use either

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-21 Thread Kris Van Hees

As suggested, I resent the patch set as replies to the cover letter post
to support threaded access to the patches.

[RFC PATCH 05/11] trace: update Kconfig and Makefile to include DTrace

2019-05-21 Thread Kris Van Hees

This commit adds the dtrace implementation in kernel/trace/dtrace to
the trace Kconfig and Makefile.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 kernel/trace/Kconfig  | 2 ++
 kernel/trace/Makefile | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 5d965cef6c77..59c3bdfbaffc 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -786,6 +786,8 @@ config GCOV_PROFILE_FTRACE
  Note that on a kernel compiled with this config, ftrace will
  run significantly slower.
 
+source "kernel/trace/dtrace/Kconfig"
+
 endif # FTRACE
 
 endif # TRACING_SUPPORT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c2b2148bb1d2..e643c4eac8f6 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -82,6 +82,7 @@ endif
 obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o
 obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o
+obj-$(CONFIG_DTRACE) += dtrace/
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
 
-- 
2.20.1

[RFC PATCH 01/11] bpf: context casting for tail call

2019-05-21 Thread Kris Van Hees

Currently BPF programs are executed with a context that is provided by
code that initiates the execution.  Tracing tools that want to make use
of existing probes and events that allow BPF programs to be attached to
them are thus limited to the context information provided by the probe
or event source.  Often, more context is needed to allow tracing tools
the ablity to implement more complex constructs (e.g. more state-full
tracing).

This patch extends the tail-call mechanism to allow a BPF program of
one type to call a BPF program of another type.

BPF program types can specify two new operations in struct bpf_prog_ops:
- bool is_valid_tail_call(enum bpf_prog_type stype)
This function is called from bpf_prog_array_valid_tail_call()
which is called from bpf_check_tail_call()
which is called from bpf_prog_select_runtime()
which is called from bpf_prog_load() right after the
verifier finishes processing the program.  It is called for every
map of type BPF_MAP_TYPE_PROG_ARRAY, and is passed the type of the
program that is being loaded and therefore will be the origin of
tail calls.  It returns true if tail calls from the source BPF
program type to the implementing program type are allowed.

- void *convert_ctx(enum bpf_prog_type stype, void *ctx)
This function is called during the execution of a BPF tail-call.
It returns a valid context for the implementing BPF program type,
based on the passed context pointer (ctx) for BPF program type
stype.

The program array holding BPF programs that you can tail-call into
continues to require that all programs are of the same type.  But when
a compatibility check is made in a program that performs a tail-call,
the is_valid_tail_call() function is called (if available) to allow
the target type to determine whether it can handle the conversion of
a context from the source type to the target type.  If the function is
not implemented by the program type, casting is denied.

During execution, the convert_ctx() function is called (if available)
to perform the conversion of the current context to the context that the
target type expects.  Since the program type of the executing BPF program
is not explicitly known during execution, the verifier inserts an
instruction right before the tail-call to assign the current BPF program
type to R4.

The interpreter calls convert_ctx() using the program type in R4 as
source program type, the program type associated with the program array
as target program type, and the context as provided in R1.

A helper (finalize_context) is added to allow tail called programs to
perform context setup based on information that is passed in from the
calling program by means of a map that is indexed by CPU id.  The actual
content of the map is defined by the BPF program type implementation
for the program type that is being called.

The bpf_prog_types array is now being exposed to the rest of the BPF
code (where before it was local to just the syscall handling) because
the is_valid_tail_call() and convert_ctx() operations need to be
accessible.

There is no noticeable effect on BPF program types that do not implement
this new feature.

A JIT implementation is not available yet in this first iteration.

v2: Fixed compilation when CONFIG_BPF_SYSCALL=n.
Fixed casting issue on platforms with 32-bit pointers.

v3: Renamed the new program type operations to be more descriptive.
Added finalize_context() helper.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h   |  3 +++
 include/uapi/linux/bpf.h  | 11 -
 kernel/bpf/core.c | 29 ++-
 kernel/bpf/syscall.c  |  2 +-
 kernel/bpf/verifier.c | 16 +
 tools/include/uapi/linux/bpf.h| 11 -
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 7 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 59631dd0777c..7a40a3cd7ff2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -294,6 +294,8 @@ bpf_ctx_record_field_size(struct bpf_insn_access_aux *aux, 
u32 size)
 struct bpf_prog_ops {
int (*test_run)(struct bpf_prog *prog, const union bpf_attr *kattr,
union bpf_attr __user *uattr);
+   bool (*is_valid_tail_call)(enum bpf_prog_type stype);
+   void *(*convert_ctx)(enum bpf_prog_type stype, void *ctx);
 };
 
 struct bpf_verifier_ops {
@@ -571,6 +573,7 @@ extern const struct file_operations bpf_prog_fops;
 #undef BPF_PROG_TYPE
 #undef BPF_MAP_TYPE
 
+extern const struct bpf_prog_ops * const bpf_prog_types[];
 extern const struct bpf_prog_ops bpf_offload_prog_ops;
 extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
 extern const struct bpf_verifier_ops xdp_analyzer_ops;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index

[RFC PATCH 02/11] bpf: add BPF_PROG_TYPE_DTRACE

2019-05-21 Thread Kris Van Hees

Add a new BPF program type for DTrace.  The program type is not compiled
right now because the CONFIG_DTRACE option does not exist yet.  It will
be added in a following commit.

Three commits are involved here:

1. add the BPF program type (conditional on a to-be-added option)
2. add the BPF_PROG_TYPE_DTRACE implementation (building not enabled)
3. add the CONFIG_DTRACE option and enable compilation of the prog type
   implementation

The reason for this sequence is to ensure that the kernel tree remains
buildable between these commits.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf_types.h  |  3 +++
 include/uapi/linux/bpf.h   |  1 +
 samples/bpf/bpf_load.c | 10 +++---
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c |  2 ++
 tools/lib/bpf/libbpf_probes.c  |  1 +
 6 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5a9975678d6f..908f2e4f597e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -26,6 +26,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event)
 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
+#ifdef CONFIG_DTRACE
+BPF_PROG_TYPE(BPF_PROG_TYPE_DTRACE, dtrace)
+#endif
 #endif
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 61abe6b56948..7bcb707539d1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+   BPF_PROG_TYPE_DTRACE,
 };
 
 enum bpf_attach_type {
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index eae7b635343d..4812295484a1 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -87,6 +87,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_sockops = strncmp(event, "sockops", 7) == 0;
bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+   bool is_dtrace = strncmp(event, "dtrace", 6) == 0;
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
@@ -120,6 +121,8 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_SK_SKB;
} else if (is_sk_msg) {
prog_type = BPF_PROG_TYPE_SK_MSG;
+   } else if (is_dtrace) {
+   prog_type = BPF_PROG_TYPE_DTRACE;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -140,8 +143,8 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
return 0;
 
-   if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-   if (is_socket)
+   if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_dtrace) {
+   if (is_socket || is_dtrace)
event += 6;
else
event += 7;
@@ -643,7 +646,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
memcmp(shname, "cgroup/", 7) == 0 ||
memcmp(shname, "sockops", 7) == 0 ||
memcmp(shname, "sk_skb", 6) == 0 ||
-   memcmp(shname, "sk_msg", 6) == 0) {
+   memcmp(shname, "sk_msg", 6) == 0 ||
+   memcmp(shname, "dtrace", 6) == 0) {
ret = load_and_attach(shname, data->d_buf,
  data->d_size);
if (ret != 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 61abe6b56948..7bcb707539d1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+   BPF_PROG_TYPE_DTRACE,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 7e3b79d7c25f..44704a7d395d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2269,6 +2269,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type 
type)
case BPF_PROG_TYPE_CGROUP_SYSCTL:
return false;
case BPF_PROG_TYPE_KPROBE:
+   case BPF_PROG_TYPE_DTRACE:
default:
return true;
}

[RFC PATCH 10/11] bpf: add bpf_buffer_reserve and bpf_buffer_commit

2019-05-21 Thread Kris Van Hees

Add two helpers that are primarily used in combination with the
writable-buffer support.  The bpf_buffer_reserve() helper sets aside
a chunk of buffer space that can be written to, and once all data
has been written, the bpf_buffer_commit() helper is used to make the
data in the ring buffer visible to userspace.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/uapi/linux/bpf.h  | 39 ++-
 kernel/bpf/verifier.c |  6 +++-
 tools/include/uapi/linux/bpf.h| 39 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +++
 4 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7bcb707539d1..2b7772aa00b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2681,6 +2681,41 @@ union bpf_attr {
  * the implementing program type.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_reserve(void *ctx, int id, struct bpf_map *map, int size)
+ * Description
+ * Reserve *size* bytes in the output buffer for the special BPF
+ * BPF perf event referenced by *map*, a BPF map of type
+ * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must have
+ * the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ * **PERF_TYPE_SOFTWARE** as **type**, and
+ * **PERF_COUNT_SW_BPF_OUTPUT** as **config**.  The reserved space
+ * will be available as the writable buffer identified with
+ * numeric ID **id** in the context.
+ *
+ * The amount of reserved bytes cannot exceed the page size.
+ * The chunk of buffer space will be reserved within a single
+ * page, and if this results in unused space at the end of the
+ * previous page in the ring-buffer, that unsused space will be
+ * filled with zeros.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_commit(void *ctx, int id, struct bpf_map *map)
+ * Description
+ * FInalize the previously reserved space in the output buffer
+ * for the special BPF perf event referenced by *map*, a BPF map
+ * of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must
+ * have the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ * **PERF_TYPE_SOFTWARE** as **type**, and
+ * **PERF_COUNT_SW_BPF_OUTPUT** as **config**.
+ *
+ * The writable buffer identified with numeric ID **id** in the
+ * context will be invalidated, and can no longer be used to
+ * write data to until a new **bpf_buffer_reserve**\ () has been
+ * invoked.
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2792,7 +2827,9 @@ union bpf_attr {
FN(strtoul),\
FN(sk_storage_get), \
FN(sk_storage_delete),  \
-   FN(finalize_context),
+   FN(finalize_context),   \
+   FN(buffer_reserve), \
+   FN(buffer_commit),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 90ae04b4d5c7..ff73ed743a58 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2763,7 +2763,9 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
if (func_id != BPF_FUNC_perf_event_read &&
func_id != BPF_FUNC_perf_event_output &&
-   func_id != BPF_FUNC_perf_event_read_value)
+   func_id != BPF_FUNC_perf_event_read_value &&
+   func_id != BPF_FUNC_buffer_reserve &&
+   func_id != BPF_FUNC_buffer_commit)
goto error;
break;
case BPF_MAP_TYPE_STACK_TRACE:
@@ -2848,6 +2850,8 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
case BPF_FUNC_perf_event_read:
case BPF_FUNC_perf_event_output:
case BPF_FUNC_perf_event_read_value:
+   case BPF_FUNC_buffer_reserve:
+   case BPF_FUNC_buffer_commit:
if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
goto error;
break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7bcb707539d1..2b7772aa00b6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2681,6 +2681,41 @@ union bpf_attr {
  * the implementing program type.
  * Return
  * 0 on s

[RFC PATCH 11/11] dtrace: make use of writable buffers in BPF

2019-05-21 Thread Kris Van Hees

This commit modifies the tiny proof-of-concept DTrace utility to use
the writable-buffer support in BPF along with the new helpers for
buffer reservation and commit.  The dtrace_finalize_context() helper
is updated and is now marked with ctx_update because it sets the
buffer pointer to NULL (and size 0).

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/uapi/linux/dtrace.h |   4 +
 kernel/trace/dtrace/bpf.c   | 150 
 tools/dtrace/dt_buffer.c|  54 +
 tools/dtrace/probe1_bpf.c   |  47 ++-
 4 files changed, 198 insertions(+), 57 deletions(-)

diff --git a/include/uapi/linux/dtrace.h b/include/uapi/linux/dtrace.h
index bbe2562c11f2..3fcc075a429f 100644
--- a/include/uapi/linux/dtrace.h
+++ b/include/uapi/linux/dtrace.h
@@ -33,6 +33,10 @@ struct dtrace_bpf_context {
u32 gid;/* from_kgid(_user_ns, current_real_cred()->gid */
u32 euid;   /* from_kuid(_user_ns, current_real_cred()->euid */
u32 egid;   /* from_kgid(_user_ns, current_real_cred()->egid */
+
+   /* General output buffer */
+   __bpf_md_ptr(u8 *, buf);
+   __bpf_md_ptr(u8 *, buf_end);
 };
 
 /*
diff --git a/kernel/trace/dtrace/bpf.c b/kernel/trace/dtrace/bpf.c
index 95f4103d749e..93bd2f0319cc 100644
--- a/kernel/trace/dtrace/bpf.c
+++ b/kernel/trace/dtrace/bpf.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Actual kernel definition of the DTrace BPF context.
@@ -16,6 +17,9 @@ struct dtrace_bpf_ctx {
u32 ecb_id;
u32 probe_id;
struct task_struct  *task;
+   struct perf_output_handle   handle;
+   u64 buf_len;
+   u8  *buf;
 };
 
 /*
@@ -55,6 +59,8 @@ BPF_CALL_2(dtrace_finalize_context, struct dtrace_bpf_ctx *, 
ctx,
 
ctx->ecb_id = ecb->id;
ctx->probe_id = ecb->probe_id;
+   ctx->buf_len = 0;
+   ctx->buf = NULL;
 
return 0;
 }
@@ -62,17 +68,119 @@ BPF_CALL_2(dtrace_finalize_context, struct dtrace_bpf_ctx 
*, ctx,
 static const struct bpf_func_proto dtrace_finalize_context_proto = {
.func   = dtrace_finalize_context,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,   /* ctx */
.arg2_type  = ARG_CONST_MAP_PTR,/* map */
 };
 
+BPF_CALL_4(dtrace_buffer_reserve, struct dtrace_bpf_ctx *, ctx,
+ int, id, struct bpf_map *, map, int, size)
+{
+   struct bpf_array*arr = container_of(map, struct bpf_array, map);
+   int cpu = smp_processor_id();
+   struct bpf_event_entry  *ee;
+   struct perf_event   *ev;
+   int err;
+
+   /*
+* Make sure the writable-buffer id is valid.  We use the default which
+* is the offset of the start-of-buffer pointer in the public context.
+*/
+   if (id != offsetof(struct dtrace_bpf_context, buf))
+   return -EINVAL;
+
+   /*
+* Verify whether we have an uncommitted reserve.  If so, we deny this
+* request.
+*/
+   if (ctx->handle.rb)
+   return -EBUSY;
+
+   /*
+* Perform sanity checks.
+*/
+   if (cpu >= arr->map.max_entries)
+   return -E2BIG;
+   ee = READ_ONCE(arr->ptrs[cpu]);
+   if (!ee)
+   return -ENOENT;
+   ev = ee->event;
+   if (unlikely(ev->attr.type != PERF_TYPE_SOFTWARE ||
+ev->attr.config != PERF_COUNT_SW_BPF_OUTPUT))
+   return -EINVAL;
+   if (unlikely(ev->oncpu != cpu))
+   return -EOPNOTSUPP;
+
+   size = round_up(size, sizeof(u64));
+
+   err = perf_output_begin_forward_in_page(>handle, ev, size);
+   if (err < 0)
+   return err;
+
+   ctx->buf_len = size;
+   ctx->buf = ctx->handle.addr;
+
+   return 0;
+}
+
+static const struct bpf_func_proto dtrace_buffer_reserve_proto = {
+   .func   = dtrace_buffer_reserve,
+   .gpl_only   = false,
+   .ctx_update = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,   /* ctx */
+   .arg2_type  = ARG_ANYTHING, /* id */
+   .arg3_type  = ARG_CONST_MAP_PTR,/* map */
+   .arg4_type  = ARG_ANYTHING, /* size */
+};
+
+BPF_CALL_3(dtrace_buffer_commit, struct dtrace_bpf_ctx *, ctx,
+int, id, struct bpf_map *, map)
+{
+   /*
+* Make sure the writable-buffer id is valid.  We use the default which
+* is the offset of the start-of-buffer pointer in the public context.
+*/
+   if (id != offset

[RFC PATCH 09/11] bpf: mark helpers explicitly whether they may change

2019-05-21 Thread Kris Van Hees

Some helpers may update the context.  Right now, various network filter
helpers may make changes to the packet data.  This is verified by calling
the bpf_helper_changes_pkt_data() function with the function pointer.

This function resides in net/core/filter.c and needs to be updated for any
helper function that modifies packet data.  To allow for other helpers
(possibly not part of the network filter code) to do the same, this patch
changes the code from using a central function to list all helpers that
have this feature to marking each individual helper that may change the
context data.  This way, whenever a new helper is added that may change
the content of the context, there is no need to update a hardcoded list of
functions.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h|  1 +
 include/linux/filter.h |  1 -
 kernel/bpf/core.c  |  5 
 kernel/bpf/verifier.c  |  2 +-
 net/core/filter.c  | 59 ++
 5 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fc3eda0192fb..9e255d5b1062 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -226,6 +226,7 @@ enum bpf_return_type {
 struct bpf_func_proto {
u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
bool gpl_only;
+   bool ctx_update;
bool pkt_access;
enum bpf_return_type ret_type;
enum bpf_arg_type arg1_type;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7148bab96943..9dacca7d3ef6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -811,7 +811,6 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
-bool bpf_helper_changes_pkt_data(void *func);
 
 static inline bool bpf_dump_raw_ok(void)
 {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 225b1be766b0..8e9accf90c37 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2112,11 +2112,6 @@ void __weak bpf_jit_compile(struct bpf_prog *prog)
 {
 }
 
-bool __weak bpf_helper_changes_pkt_data(void *func)
-{
-   return false;
-}
-
 /* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
  * skb_copy_bits(), so provide a weak definition of it for NET-less config.
  */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5fba4e6f5424..90ae04b4d5c7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3283,7 +3283,7 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
}
 
/* With LD_ABS/IND some JITs save/restore skb from r1. */
-   changes_data = bpf_helper_changes_pkt_data(fn->func);
+   changes_data = fn->ctx_update;
if (changes_data && fn->arg1_type != ARG_PTR_TO_CTX) {
verbose(env, "kernel subsystem misconfigured func %s#%d: r1 != 
ctx\n",
func_id_name(func_id), func_id);
diff --git a/net/core/filter.c b/net/core/filter.c
index 55bfc941d17a..a9e7d3174d36 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1693,6 +1693,7 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, 
u32, offset,
 static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
.func   = bpf_skb_store_bytes,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1825,6 +1826,7 @@ BPF_CALL_2(bpf_skb_pull_data, struct sk_buff *, skb, u32, 
len)
 static const struct bpf_func_proto bpf_skb_pull_data_proto = {
.func   = bpf_skb_pull_data,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1868,6 +1870,7 @@ BPF_CALL_2(sk_skb_pull_data, struct sk_buff *, skb, u32, 
len)
 static const struct bpf_func_proto sk_skb_pull_data_proto = {
.func   = sk_skb_pull_data,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1909,6 +1912,7 @@ BPF_CALL_5(bpf_l3_csum_replace, struct sk_buff *, skb, 
u32, offset,
 static const struct bpf_func_proto bpf_l3_csum_replace_proto = {
.func   = bpf_l3_csum_replace,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1962,6 +1966,7 @@ BPF_CALL_5(bpf_l4_csum_replace, struct sk_buff *, skb, 
u32, offset,
 static const struct bpf_func_proto bpf_l4_csum_replace_proto = {
.func   = bpf_l4_csum_replace,
.gpl_only   = false,
+   .ctx_update

[RFC PATCH 08/11] perf: add perf_output_begin_forward_in_page

2019-05-21 Thread Kris Van Hees

Right now, BPF programs can only write to a perf event ring buffer by
constructing a sample (as an arbitrary chunk of memory of a given size),
and calling perf_event_output() to have it written to the ring buffer.

A new implementation of DTrace (based on BPF) avoids constructing the
data sample prior to writing it to the ring buffer.  Instead, it expects
to be able to reserve a block of memory of a given size, write to that
memory region as it sees fit, and then finalize the written data (making
it available for reading from userspace).

This can (in part) be accomplished as follows:
1. reserve buffer space
Call perf_output_begin_forward_in_page(, event, size) passing
in a handle to be used for this data output session, an event that
identifies the output buffer, and the size (in bytes) to set aside.

2. write data
Perform store operations to the buffer space that was set aside.
The buffer is a writable buffer in the BPF program context, which
means that operations like *(u32 *)[offset] = val can be used.

3. finalize the output session
Call perf_output_end() to finalize the output and make the
new data available for reading from userspace by updating the head
of the ring buffer.

The one caveat is that ring buffers may be allocated from non-contiguous
pages in kernel memory.  This means that a reserved block of memory could
be spread across two non-consecutive pages, and accessing the buffer
space using buf[offset] is no longer safe.  Forcing the ring buffer to be
allocated using vmalloc would avoid this problem, but that would impose
a limitation on all perf event output buffers which is not an acceptable
cost.

The solution implemented here adds a flag to the __perf_output_begin()
function that performs the reserving of buffer space.  The new flag
(stay_in_page) indicates whether the requested chunk of memory must be
on a single page.  In this case, the requested size cannot exceed the
page size.  If the request cannot be satisfied within the current page,
the unused portion of the current page is filled with 0s.

A new function perf_output_begin_forward_in_page() is to be used to
commence output that cannot cross page boundaries.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/perf_event.h  |  3 ++
 kernel/events/ring_buffer.c | 65 -
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 15a82ff0aefe..2b35d1ce61f8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1291,6 +1291,9 @@ extern int perf_output_begin(struct perf_output_handle 
*handle,
 extern int perf_output_begin_forward(struct perf_output_handle *handle,
struct perf_event *event,
unsigned int size);
+extern int perf_output_begin_forward_in_page(struct perf_output_handle *handle,
+struct perf_event *event,
+unsigned int size);
 extern int perf_output_begin_backward(struct perf_output_handle *handle,
  struct perf_event *event,
  unsigned int size);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 674b35383491..01ba540e3ee0 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -116,9 +116,11 @@ ring_buffer_has_space(unsigned long head, unsigned long 
tail,
 static __always_inline int
 __perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size,
-   bool backward)
+   bool backward, bool stay_in_page)
 {
struct ring_buffer *rb;
+   unsigned int adj_size;
+   unsigned int gap_size;
unsigned long tail, offset, head;
int have_lost, page_shift;
struct {
@@ -144,6 +146,13 @@ __perf_output_begin(struct perf_output_handle *handle,
goto out;
}
 
+   page_shift = PAGE_SHIFT + page_order(rb);
+
+   if (unlikely(stay_in_page)) {
+   if (size > (1UL << page_shift))
+   goto out;
+   }
+
handle->rb= rb;
handle->event = event;
 
@@ -156,13 +165,24 @@ __perf_output_begin(struct perf_output_handle *handle,
 
perf_output_get_handle(handle);
 
+   gap_size = 0;
+   adj_size = size;
do {
tail = READ_ONCE(rb->user_page->data_tail);
offset = head = local_read(>head);
+
+   if (unlikely(stay_in_page)) {
+   gap_size = (1UL << page_shift) -
+  (offset & ((1UL << page_shift) - 1));
+   if (gap_size < size)
+   adj_size += gap_size;
+

[RFC PATCH 06/11] dtrace: tiny userspace tool to exercise DTrace support

2019-05-21 Thread Kris Van Hees

This commit provides a small tool that makes use of the following new
features in BPF, as a sample of how the DTrace userspace code will be
interacting with the kernel:

- It uses the ability to tail-call into a BPF program of a different
  type (as long as the proper context conversion is implemented).  This
  is used to attach a BPF kprobe program to a kprobe, and having it
  tail-call a BPF dtrace program.  We do this so the probe action can
  execute in a tracer specific context rather than in a probe context.
  This way, probes of different types can all execute the probe actions
  in the same probe-independent context.

- It uses the new bpf_finalize_context() helper to retrieve data that is
  set in the BPF kprobe program attached to the probe, and use that data
  to further populate the tracer context for the BPF dtrace program.

Output is generated using the bpf_perf_event_output() helper.  This tiny
proof of concept tool demonstrates that the tail-call mechanism into a
different BPF program type works correctly, and that it is possible to
create a new context that contains information that is currently not
available to BPF programs from either a context or by means of a helper
(aside from using bpf_probe_read() on an address that is derived from
the current task).

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 MAINTAINERS   |   1 +
 tools/dtrace/.gitignore   |   1 +
 tools/dtrace/Makefile |  79 
 tools/dtrace/dt_bpf.c |  15 ++
 tools/dtrace/dt_buffer.c  | 386 ++
 tools/dtrace/dt_utils.c   | 132 +
 tools/dtrace/dtrace.c |  38 
 tools/dtrace/dtrace.h |  44 +
 tools/dtrace/probe1_bpf.c | 100 ++
 9 files changed, 796 insertions(+)
 create mode 100644 tools/dtrace/.gitignore
 create mode 100644 tools/dtrace/Makefile
 create mode 100644 tools/dtrace/dt_bpf.c
 create mode 100644 tools/dtrace/dt_buffer.c
 create mode 100644 tools/dtrace/dt_utils.c
 create mode 100644 tools/dtrace/dtrace.c
 create mode 100644 tools/dtrace/dtrace.h
 create mode 100644 tools/dtrace/probe1_bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 07da7cc69f23..6d934c9f5f93 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5456,6 +5456,7 @@ L:dtrace-de...@oss.oracle.com
 S: Maintained
 F: include/uapi/linux/dtrace.h
 F: kernel/trace/dtrace
+F: tools/dtrace
 
 DVB_USB_AF9015 MEDIA DRIVER
 M: Antti Palosaari 
diff --git a/tools/dtrace/.gitignore b/tools/dtrace/.gitignore
new file mode 100644
index ..d60e73526296
--- /dev/null
+++ b/tools/dtrace/.gitignore
@@ -0,0 +1 @@
+dtrace
diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
new file mode 100644
index ..c2ee3fb2576f
--- /dev/null
+++ b/tools/dtrace/Makefile
@@ -0,0 +1,79 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# This Makefile is shamelessly copied from samples/bpf and modified to support
+# building this prototype tracing tool.
+
+DTRACE_PATH?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(DTRACE_PATH)/..
+SAMPLES_PATH   := $(DTRACE_PATH)/../../samples
+
+hostprogs-y:= dtrace
+
+LIBBPF := $(TOOLS_PATH)/lib/bpf/libbpf.a
+OBJS   := ../../samples/bpf/bpf_load.o dt_bpf.o dt_buffer.o 
dt_utils.o
+
+dtrace-objs:= $(OBJS) dtrace.o
+
+always := $(hostprogs-y)
+always += probe1_bpf.o
+
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/lib
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/perf
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/include/uapi
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/include/
+KBUILD_HOSTCFLAGS  += -I$(srctree)/usr/include
+
+KBUILD_HOSTLDLIBS  := $(LIBBPF) -lelf
+
+LLC?= llc
+CLANG  ?= clang
+LLVM_OBJCOPY   ?= llvm-objcopy
+
+ifdef CROSS_COMPILE
+HOSTCC = $(CROSS_COMPILE)gcc
+CLANG_ARCH_ARGS= -target $(ARCH)
+endif
+
+all:
+   $(MAKE) -C ../../ $(CURDIR)/ DTRACE_PATH=$(CURDIR)
+
+clean:
+   $(MAKE) -C ../../ M=$(CURDIR) clean
+   @rm -f *~
+
+$(LIBBPF): FORCE
+   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(DTRACE_PATH)/../../ 
O=
+
+FORCE:
+
+.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
+
+verify_cmds: $(CLANG) $(LLC)
+   @for TOOL in $^ ; do \
+   if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
+   echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+   exit 1; \
+   else true; fi; \
+   done
+
+verify_target_bpf: verify_cmds
+   @if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+   echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+   echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+   exit 2; \
+   else true; fi
+
+$(DTRACE_PATH)/*.c: verify_target_

[RFC PATCH 04/11] trace: initial implementation of DTrace based on kernel

2019-05-21 Thread Kris Van Hees

This patch adds an implementation for BPF_PROG_TYPE_DTRACE, making use of
the tail-call mechanism feature to along tail-calls between programs of a
different BPF program type.  A new config option (DTRACE) is added to
control whether to include this new feature.

The DTRACE BPF program type provides an environment for executing probe
actions within the generalized trace context, independent from the type of
probe that triggered the program.  Probe types support specific BPF program
types, and this implementation uses the tail-call mechanism to call into
the DTRACE BPF program type from probe BPF prgroam types.  This initial
implementation provides support for the KPROBE type only - more will be
added in the near future.

The implementation provides:
 - dtrace_get_func_proto() as helper validator
 - dtrace_is_valid_access() as context access validator
 - dtrace_convert_ctx_access() as context access rewriter
 - dtrace_is_valid_tail_call() to validate the calling program type
 - dtrace_convert_ctx() to convert the context of the calling program into
   a DTRACE BPF program type context
 - dtrace_finalize_context() as bpf_finalize_context() helper for the
   DTRACE BPF program type

The dtrace_bpf_ctx struct defines the DTRACE BPF program type context at
the kernel level, and stores the following members:

struct pt_reg *regs - register state when probe fired
u32 ecb_id  - probe enabling ID
u32 probe_id- probe ID
struct task_struct *task- executing task when probe fired

The regs and task members are populated from dtrace_convert_ctx() which is
called during the tail-call processing.  The ecb_id and probe_id are
populated from dtrace_finalize_context().

Sample use:

#include 

/*
 * Map to store DTRACE BPF programs that can be called using
 * the tail-call mechanism from other BPF program types.
 */
struct bpf_map_def SEC("maps") progs = {
.type = BPF_MAP_TYPE_PROG_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(u32),
.max_entries = 8192,
};

/*
 * Map to store DTrace probe specific information and share
 * it across program boundaries.  This makes it possible for
 * DTRACE BPF program to know what probe caused them to get
 * called.
 */
struct bpf_map_def SEC("maps") probemap = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(u32),
.value_size = sizeof(struct dtrace_ecb),
.max_entries = NR_CPUS,
};

SEC("dtrace/1") int dt_probe1(struct dtrace_bpf_context *ctx)
{
struct dtrace_ecb   *ecb;
charfmt[] = "EPID %d PROBE %d\n";

bpf_finalize_context(ctx, );
bpf_trace_printk(fmt, sizeof(fmt),
 ctx->ecb_id, ctx->probe_id);

return 0;
}

SEC("kprobe/sys_write") int bpf_prog1(struct pt_regs *ctx)
{
struct dtrace_ecb   ecb;
int cpu;

cpu = bpf_get_smp_processor_id();
ecb.id = 3;
ecb.probe_id = 123;

/* Store the ECB. */
bpf_map_update_elem(, , , BPF_ANY);

/* Issue tail-call into DTRACE BPF program. */
bpf_tail_call(ctx, , 1);

/* fall through -> program not found or call failed */
return 0;
}

This patch also adds DTrace as a new subsystem in the MAINTAINERS file,
with me as current maintainer and our development mailing list for
specific development discussions.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 MAINTAINERS  |   7 +
 include/uapi/linux/dtrace.h  |  50 ++
 kernel/trace/dtrace/Kconfig  |   7 +
 kernel/trace/dtrace/Makefile |   3 +
 kernel/trace/dtrace/bpf.c| 321 +++
 5 files changed, 388 insertions(+)
 create mode 100644 include/uapi/linux/dtrace.h
 create mode 100644 kernel/trace/dtrace/Kconfig
 create mode 100644 kernel/trace/dtrace/Makefile
 create mode 100644 kernel/trace/dtrace/bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ce573aaa04df..07da7cc69f23 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5450,6 +5450,13 @@ W:   https://linuxtv.org
 S: Odd Fixes
 F:     drivers/media/pci/dt3155/
 
+DTRACE
+M: Kris Van Hees 
+L: dtrace-de...@oss.oracle.com
+S: Maintained
+F: include/uapi/linux/dtrace.h
+F: kernel/trace/dtrace
+
 DVB_USB_AF9015 MEDIA DRIVER
 M: Antti Palosaari 
 L: linux-me...@vger.kernel.org
diff --git a/include/uapi/linux/dtrace.h b/include/uapi/linux/dtrace.h
new file mode 100644
index ..bbe2562c1

[RFC PATCH 07/11] bpf: implement writable buffers in contexts

2019-05-21 Thread Kris Van Hees

Currently, BPF supports writes to packet data in very specific cases.
The implementation can be of more general use and can be extended to any
number of writable buffers in a context.  The implementation adds two new
register types: PTR_TO_BUFFER and PTR_TO_BUFFER_END, similar to the types
PTR_TO_PACKET and PTR_TO_PACKET_END.  In addition, a field 'buf_id' is
added to the reg_state structure as a way to distinguish between different
buffers in a single context.

Buffers are specified in the context by a pair of members:
- a pointer to the start of the buffer (type PTR_TO_BUFFER)
- a pointer to the first byte beyond the buffer (type PTR_TO_BUFFER_END)

A context can contain multiple buffers.  Each buffer/buffer_end pair is
identified by a unique id (buf_id).  The start-of-buffer member offset is
usually a good unique identifier.

The semantics for using a writable buffer are the same as for packet data.
The BPF program must contain a range test (buf + num > buf_end) to ensure
that the verifier can verify that offsets are within the allowed range.

Whenever a helper is called that might update the content of the context
all range information for registers that hold pointers to a buffer is
cleared, just as it is done for packet pointers.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h  |   3 +
 include/linux/bpf_verifier.h |   4 +-
 kernel/bpf/verifier.c| 198 ---
 3 files changed, 145 insertions(+), 60 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e4bcb79656c4..fc3eda0192fb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -275,6 +275,8 @@ enum bpf_reg_type {
PTR_TO_TCP_SOCK, /* reg points to struct tcp_sock */
PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */
PTR_TO_TP_BUFFER,/* reg points to a writable raw tp's buffer */
+   PTR_TO_BUFFER,   /* reg points to ctx buffer */
+   PTR_TO_BUFFER_END,   /* reg points to ctx buffer end */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -283,6 +285,7 @@ enum bpf_reg_type {
 struct bpf_insn_access_aux {
enum bpf_reg_type reg_type;
int ctx_field_size;
+   u32 buf_id;
 };
 
 static inline void
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 1305ccbd8fe6..3538382184f3 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -45,7 +45,7 @@ struct bpf_reg_state {
/* Ordering of fields matters.  See states_equal() */
enum bpf_reg_type type;
union {
-   /* valid when type == PTR_TO_PACKET */
+   /* valid when type == PTR_TO_PACKET | PTR_TO_BUFFER */
u16 range;
 
/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
@@ -132,6 +132,8 @@ struct bpf_reg_state {
 */
u32 frameno;
enum bpf_reg_liveness live;
+   /* For PTR_TO_BUFFER, to identify distinct buffers in a context. */
+   u32 buf_id;
 };
 
 enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f9e5536fd1af..5fba4e6f5424 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -406,6 +406,8 @@ static const char * const reg_type_str[] = {
[PTR_TO_TCP_SOCK]   = "tcp_sock",
[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
[PTR_TO_TP_BUFFER]  = "tp_buffer",
+   [PTR_TO_BUFFER] = "buf",
+   [PTR_TO_BUFFER_END] = "buf_end",
 };
 
 static char slot_type_char[] = {
@@ -467,6 +469,9 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
verbose(env, ",off=%d", reg->off);
if (type_is_pkt_pointer(t))
verbose(env, ",r=%d", reg->range);
+   else if (t == PTR_TO_BUFFER)
+   verbose(env, ",r=%d,bid=%d", reg->range,
+   reg->buf_id);
else if (t == CONST_PTR_TO_MAP ||
 t == PTR_TO_MAP_VALUE ||
 t == PTR_TO_MAP_VALUE_OR_NULL)
@@ -855,6 +860,12 @@ static bool reg_is_pkt_pointer_any(const struct 
bpf_reg_state *reg)
   reg->type == PTR_TO_PACKET_END;
 }
 
+static bool reg_is_buf_pointer_any(const struct bpf_reg_state *reg)
+{
+   return reg_is_pkt_pointer_any(reg) ||
+  reg->type == PTR_TO_BUFFER || reg->type == PTR_TO_BUFFER_END;
+}
+
 /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
 static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg,
enum bpf_reg_type which)
@@ -1550,7 +1561,7 @@ static int check_map_access(struct bpf_verifier_env *env, 
u32 regn

[RFC PATCH 03/11] bpf: export proto for bpf_perf_event_output helper

2019-05-21 Thread Kris Van Hees

The bpf_perf_event_output helper is used by various tracer BPF program
types, but it was not visible outside of bpf_trace.c.  In order to make
it available to tracer BPF program types that are implemented elsewhere,
a function is added similar to bpf_get_trace_printk_proto() to query the
prototype (bpf_get_perf_event_output_proto()).

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h  | 1 +
 kernel/trace/bpf_trace.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7a40a3cd7ff2..e4bcb79656c4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -479,6 +479,7 @@ bool bpf_prog_array_compatible(struct bpf_array *array, 
const struct bpf_prog *f
 int bpf_prog_calc_tag(struct bpf_prog *fp);
 
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
+const struct bpf_func_proto *bpf_get_perf_event_output_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
unsigned long off, unsigned long len);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b496ffdf5f36..3d812238bc40 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -473,6 +473,12 @@ static const struct bpf_func_proto 
bpf_perf_event_output_proto = {
.arg5_type  = ARG_CONST_SIZE_OR_ZERO,
 };
 
+const struct bpf_func_proto *bpf_get_perf_event_output_proto(void)
+{
+   return _perf_event_output_proto;
+}
+
+
 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
 static DEFINE_PER_CPU(struct perf_sample_data, bpf_misc_sd);
 
-- 
2.20.1

Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-21 Thread Kris Van Hees

On Tue, May 21, 2019 at 10:56:18AM -0700, Alexei Starovoitov wrote:
> On Mon, May 20, 2019 at 11:47:00PM +0000, Kris Van Hees wrote:
> > 
> > 2. bpf: add BPF_PROG_TYPE_DTRACE
> > 
> > This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
> > actually providing an implementation.  The actual implementation is
> > added in patch 4 (see below).  We do it this way because the
> > implementation is being added to the tracing subsystem as a component
> > that I would be happy to maintain (if merged) whereas the declaration
> > of the program type must be in the bpf subsystem.  Since the two
> > subsystems are maintained by different people, we split the
> > implementing patches across maintainer boundaries while ensuring that
> > the kernel remains buildable between patches.
> 
> None of these kernel patches are necessary for what you want to achieve.

I disagree.  The current support for BPF programs for probes associates a
specific BPF program type with a specific set of probes, which means that I
cannot write BPF programs based on a more general concept of a 'DTrace probe'
and provide functionality based on that.  It also means that if I have a D
clause (DTrace probe action code associated with probes) that is to be executed
for a list of probes of different types, I need to duplicate the program
because I cannot cross program type boundaries.

By implementing a program type for DTrace, and making it possible for
tail-calls to be made from various probe-specific program types to the DTrace
program type, I can accomplish what I described above.  More details are in
the cover letter and the commit messages of the individual patches.

The reasons for these patches is because I cannot do the same with the existing
implementation.  Yes, I can do some of it or use some workarounds to accomplish
kind of the same thing, but at the expense of not being able to do what I need
to do but rather do some kind of best effort alternative.  That is not the goal
here.

> Feel free to add tools/dtrace/ directory and maintain it though.

Thank you.

> The new dtrace_buffer doesn't need to replicate existing bpf+kernel 
> functionality
> and no changes are necessary in kernel/events/ring_buffer.c either.
> tools/dtrace/ user space component can use either per-cpu array map
> or hash map as a buffer to store arbitrary data into and use
> existing bpf_perf_event_output() to send it to user space via perf ring 
> buffer.
> 
> See, for example, how bpftrace does that.

When using bpf_perf_event_output() you need to construct the sample first,
and then send it off to user space using the perf ring-buffer.  That is extra
work that is unnecessary.  Also, storing arbitrary data from userspace in maps
is not relevant here because this is about data that is generated at the level
of the kernel and sent to userspace as part of the probe action that is
executed when the probe fires.

Bpftrace indeed uses maps and ways to construct the sample and then uses the
perf ring-buffer to pass data to userspace.  And that is not the way DTrace
works and that is not the mechanism that we need here,  So, while this may be
satisfactory for bpftrace, it is not for DTrace.  We need more fine-grained
control over how we write data to the buffer (doing direct stores from BPF
code) and without the overhead of constructing a complete sample that can just
be handed over to bpf_perf_event_output().

Also, please note that I am not duplicating any kernel functionality when it
comes to buffer handling, and in fact, I found it very easy to be able to
tap into the perf event ring-buffer implementation and add a feature that I
need for DTrace.  That was a very pleasant experience for sure!

Kris

Re: [RFC PATCH 07/11] bpf: implement writable buffers in contexts

2019-05-20 Thread Kris Van Hees

On Mon, May 20, 2019 at 09:21:34PM -0400, Steven Rostedt wrote:
> Hi Kris,
> 
> Note, it's best to thread patches. Otherwise they get spread out in
> mail boxes and hard to manage. That is, every patch should be a reply
> to the 00/11 header patch.

Thanks for that advice - I will make sure to do that for future postings.

> Also, Peter Ziljstra (Cc'd) is the maintainer of perf on the kernel
> side. Please include him on Ccing perf changes that are done inside the
> kernel.

Ah, my apologies for missing Peter in the list of Cc's.  Thank you for adding
him.  I will update my list.

Kris

> On Mon, 20 May 2019 23:52:24 + (UTC)
> Kris Van Hees  wrote:
> 
> > Currently, BPF supports writes to packet data in very specific cases.
> > The implementation can be of more general use and can be extended to any
> > number of writable buffers in a context.  The implementation adds two new
> > register types: PTR_TO_BUFFER and PTR_TO_BUFFER_END, similar to the types
> > PTR_TO_PACKET and PTR_TO_PACKET_END.  In addition, a field 'buf_id' is
> > added to the reg_state structure as a way to distinguish between different
> > buffers in a single context.
> > 
> > Buffers are specified in the context by a pair of members:
> > - a pointer to the start of the buffer (type PTR_TO_BUFFER)
> > - a pointer to the first byte beyond the buffer (type PTR_TO_BUFFER_END)
> > 
> > A context can contain multiple buffers.  Each buffer/buffer_end pair is
> > identified by a unique id (buf_id).  The start-of-buffer member offset is
> > usually a good unique identifier.
> > 
> > The semantics for using a writable buffer are the same as for packet data.
> > The BPF program must contain a range test (buf + num > buf_end) to ensure
> > that the verifier can verify that offsets are within the allowed range.
> > 
> > Whenever a helper is called that might update the content of the context
> > all range information for registers that hold pointers to a buffer is
> > cleared, just as it is done for packet pointers.
> > 
> > Signed-off-by: Kris Van Hees 
> > Reviewed-by: Nick Alcock 
> > ---
> >  include/linux/bpf.h  |   3 +
> >  include/linux/bpf_verifier.h |   4 +-
> >  kernel/bpf/verifier.c| 198 ---
> >  3 files changed, 145 insertions(+), 60 deletions(-)
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index e4bcb79656c4..fc3eda0192fb 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -275,6 +275,8 @@ enum bpf_reg_type {
> > PTR_TO_TCP_SOCK, /* reg points to struct tcp_sock */
> > PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */
> > PTR_TO_TP_BUFFER,/* reg points to a writable raw tp's buffer */
> > +   PTR_TO_BUFFER,   /* reg points to ctx buffer */
> > +   PTR_TO_BUFFER_END,   /* reg points to ctx buffer end */
> >  };
> >  
> >  /* The information passed from prog-specific *_is_valid_access
> > @@ -283,6 +285,7 @@ enum bpf_reg_type {
> >  struct bpf_insn_access_aux {
> > enum bpf_reg_type reg_type;
> > int ctx_field_size;
> > +   u32 buf_id;
> >  };
> >  
> >  static inline void
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 1305ccbd8fe6..3538382184f3 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -45,7 +45,7 @@ struct bpf_reg_state {
> > /* Ordering of fields matters.  See states_equal() */
> > enum bpf_reg_type type;
> > union {
> > -   /* valid when type == PTR_TO_PACKET */
> > +   /* valid when type == PTR_TO_PACKET | PTR_TO_BUFFER */
> > u16 range;
> >  
> > /* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
> > @@ -132,6 +132,8 @@ struct bpf_reg_state {
> >  */
> > u32 frameno;
> > enum bpf_reg_liveness live;
> > +   /* For PTR_TO_BUFFER, to identify distinct buffers in a context. */
> > +   u32 buf_id;
> >  };
> >  
> >  enum bpf_stack_slot_type {
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index f9e5536fd1af..5fba4e6f5424 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -406,6 +406,8 @@ static const char * const reg_type_str[] = {
> > [PTR_TO_TCP_SOCK]   = "tcp_sock",
> > [PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
> > [PTR_TO_TP_BUFFER]  = "tp_buffer",
> > +   [PTR_TO_BUFFER]

[RFC PATCH 05/11] trace: update Kconfig and Makefile to include DTrace

2019-05-20 Thread Kris Van Hees

This commit adds the dtrace implementation in kernel/trace/dtrace to
the trace Kconfig and Makefile.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 kernel/trace/Kconfig  | 2 ++
 kernel/trace/Makefile | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 5d965cef6c77..59c3bdfbaffc 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -786,6 +786,8 @@ config GCOV_PROFILE_FTRACE
  Note that on a kernel compiled with this config, ftrace will
  run significantly slower.
 
+source "kernel/trace/dtrace/Kconfig"
+
 endif # FTRACE
 
 endif # TRACING_SUPPORT
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c2b2148bb1d2..e643c4eac8f6 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -82,6 +82,7 @@ endif
 obj-$(CONFIG_DYNAMIC_EVENTS) += trace_dynevent.o
 obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 obj-$(CONFIG_UPROBE_EVENTS) += trace_uprobe.o
+obj-$(CONFIG_DTRACE) += dtrace/
 
 obj-$(CONFIG_TRACEPOINT_BENCHMARK) += trace_benchmark.o
 
-- 
2.20.1

[RFC PATCH 01/11] bpf: context casting for tail call

2019-05-20 Thread Kris Van Hees

Currently BPF programs are executed with a context that is provided by
code that initiates the execution.  Tracing tools that want to make use
of existing probes and events that allow BPF programs to be attached to
them are thus limited to the context information provided by the probe
or event source.  Often, more context is needed to allow tracing tools
the ablity to implement more complex constructs (e.g. more state-full
tracing).

This patch extends the tail-call mechanism to allow a BPF program of
one type to call a BPF program of another type.

BPF program types can specify two new operations in struct bpf_prog_ops:
- bool is_valid_tail_call(enum bpf_prog_type stype)
This function is called from bpf_prog_array_valid_tail_call()
which is called from bpf_check_tail_call()
which is called from bpf_prog_select_runtime()
which is called from bpf_prog_load() right after the
verifier finishes processing the program.  It is called for every
map of type BPF_MAP_TYPE_PROG_ARRAY, and is passed the type of the
program that is being loaded and therefore will be the origin of
tail calls.  It returns true if tail calls from the source BPF
program type to the implementing program type are allowed.

- void *convert_ctx(enum bpf_prog_type stype, void *ctx)
This function is called during the execution of a BPF tail-call.
It returns a valid context for the implementing BPF program type,
based on the passed context pointer (ctx) for BPF program type
stype.

The program array holding BPF programs that you can tail-call into
continues to require that all programs are of the same type.  But when
a compatibility check is made in a program that performs a tail-call,
the is_valid_tail_call() function is called (if available) to allow
the target type to determine whether it can handle the conversion of
a context from the source type to the target type.  If the function is
not implemented by the program type, casting is denied.

During execution, the convert_ctx() function is called (if available)
to perform the conversion of the current context to the context that the
target type expects.  Since the program type of the executing BPF program
is not explicitly known during execution, the verifier inserts an
instruction right before the tail-call to assign the current BPF program
type to R4.

The interpreter calls convert_ctx() using the program type in R4 as
source program type, the program type associated with the program array
as target program type, and the context as provided in R1.

A helper (finalize_context) is added to allow tail called programs to
perform context setup based on information that is passed in from the
calling program by means of a map that is indexed by CPU id.  The actual
content of the map is defined by the BPF program type implementation
for the program type that is being called.

The bpf_prog_types array is now being exposed to the rest of the BPF
code (where before it was local to just the syscall handling) because
the is_valid_tail_call() and convert_ctx() operations need to be
accessible.

There is no noticeable effect on BPF program types that do not implement
this new feature.

A JIT implementation is not available yet in this first iteration.

v2: Fixed compilation when CONFIG_BPF_SYSCALL=n.
Fixed casting issue on platforms with 32-bit pointers.

v3: Renamed the new program type operations to be more descriptive.
Added finalize_context() helper.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h   |  3 +++
 include/uapi/linux/bpf.h  | 11 -
 kernel/bpf/core.c | 29 ++-
 kernel/bpf/syscall.c  |  2 +-
 kernel/bpf/verifier.c | 16 +
 tools/include/uapi/linux/bpf.h| 11 -
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 7 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 59631dd0777c..7a40a3cd7ff2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -294,6 +294,8 @@ bpf_ctx_record_field_size(struct bpf_insn_access_aux *aux, 
u32 size)
 struct bpf_prog_ops {
int (*test_run)(struct bpf_prog *prog, const union bpf_attr *kattr,
union bpf_attr __user *uattr);
+   bool (*is_valid_tail_call)(enum bpf_prog_type stype);
+   void *(*convert_ctx)(enum bpf_prog_type stype, void *ctx);
 };
 
 struct bpf_verifier_ops {
@@ -571,6 +573,7 @@ extern const struct file_operations bpf_prog_fops;
 #undef BPF_PROG_TYPE
 #undef BPF_MAP_TYPE
 
+extern const struct bpf_prog_ops * const bpf_prog_types[];
 extern const struct bpf_prog_ops bpf_offload_prog_ops;
 extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops;
 extern const struct bpf_verifier_ops xdp_analyzer_ops;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index

[RFC PATCH 03/11] bpf: export proto for bpf_perf_event_output helper

2019-05-20 Thread Kris Van Hees

The bpf_perf_event_output helper is used by various tracer BPF program
types, but it was not visible outside of bpf_trace.c.  In order to make
it available to tracer BPF program types that are implemented elsewhere,
a function is added similar to bpf_get_trace_printk_proto() to query the
prototype (bpf_get_perf_event_output_proto()).

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h  | 1 +
 kernel/trace/bpf_trace.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7a40a3cd7ff2..e4bcb79656c4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -479,6 +479,7 @@ bool bpf_prog_array_compatible(struct bpf_array *array, 
const struct bpf_prog *f
 int bpf_prog_calc_tag(struct bpf_prog *fp);
 
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
+const struct bpf_func_proto *bpf_get_perf_event_output_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
unsigned long off, unsigned long len);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b496ffdf5f36..3d812238bc40 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -473,6 +473,12 @@ static const struct bpf_func_proto 
bpf_perf_event_output_proto = {
.arg5_type  = ARG_CONST_SIZE_OR_ZERO,
 };
 
+const struct bpf_func_proto *bpf_get_perf_event_output_proto(void)
+{
+   return _perf_event_output_proto;
+}
+
+
 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
 static DEFINE_PER_CPU(struct perf_sample_data, bpf_misc_sd);
 
-- 
2.20.1

[RFC PATCH 11/11] dtrace: make use of writable buffers in BPF

2019-05-20 Thread Kris Van Hees

This commit modifies the tiny proof-of-concept DTrace utility to use
the writable-buffer support in BPF along with the new helpers for
buffer reservation and commit.  The dtrace_finalize_context() helper
is updated and is now marked with ctx_update because it sets the
buffer pointer to NULL (and size 0).

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/uapi/linux/dtrace.h |   4 +
 kernel/trace/dtrace/bpf.c   | 150 
 tools/dtrace/dt_buffer.c|  54 +
 tools/dtrace/probe1_bpf.c   |  47 ++-
 4 files changed, 198 insertions(+), 57 deletions(-)

diff --git a/include/uapi/linux/dtrace.h b/include/uapi/linux/dtrace.h
index bbe2562c11f2..3fcc075a429f 100644
--- a/include/uapi/linux/dtrace.h
+++ b/include/uapi/linux/dtrace.h
@@ -33,6 +33,10 @@ struct dtrace_bpf_context {
u32 gid;/* from_kgid(_user_ns, current_real_cred()->gid */
u32 euid;   /* from_kuid(_user_ns, current_real_cred()->euid */
u32 egid;   /* from_kgid(_user_ns, current_real_cred()->egid */
+
+   /* General output buffer */
+   __bpf_md_ptr(u8 *, buf);
+   __bpf_md_ptr(u8 *, buf_end);
 };
 
 /*
diff --git a/kernel/trace/dtrace/bpf.c b/kernel/trace/dtrace/bpf.c
index 95f4103d749e..93bd2f0319cc 100644
--- a/kernel/trace/dtrace/bpf.c
+++ b/kernel/trace/dtrace/bpf.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Actual kernel definition of the DTrace BPF context.
@@ -16,6 +17,9 @@ struct dtrace_bpf_ctx {
u32 ecb_id;
u32 probe_id;
struct task_struct  *task;
+   struct perf_output_handle   handle;
+   u64 buf_len;
+   u8  *buf;
 };
 
 /*
@@ -55,6 +59,8 @@ BPF_CALL_2(dtrace_finalize_context, struct dtrace_bpf_ctx *, 
ctx,
 
ctx->ecb_id = ecb->id;
ctx->probe_id = ecb->probe_id;
+   ctx->buf_len = 0;
+   ctx->buf = NULL;
 
return 0;
 }
@@ -62,17 +68,119 @@ BPF_CALL_2(dtrace_finalize_context, struct dtrace_bpf_ctx 
*, ctx,
 static const struct bpf_func_proto dtrace_finalize_context_proto = {
.func   = dtrace_finalize_context,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,   /* ctx */
.arg2_type  = ARG_CONST_MAP_PTR,/* map */
 };
 
+BPF_CALL_4(dtrace_buffer_reserve, struct dtrace_bpf_ctx *, ctx,
+ int, id, struct bpf_map *, map, int, size)
+{
+   struct bpf_array*arr = container_of(map, struct bpf_array, map);
+   int cpu = smp_processor_id();
+   struct bpf_event_entry  *ee;
+   struct perf_event   *ev;
+   int err;
+
+   /*
+* Make sure the writable-buffer id is valid.  We use the default which
+* is the offset of the start-of-buffer pointer in the public context.
+*/
+   if (id != offsetof(struct dtrace_bpf_context, buf))
+   return -EINVAL;
+
+   /*
+* Verify whether we have an uncommitted reserve.  If so, we deny this
+* request.
+*/
+   if (ctx->handle.rb)
+   return -EBUSY;
+
+   /*
+* Perform sanity checks.
+*/
+   if (cpu >= arr->map.max_entries)
+   return -E2BIG;
+   ee = READ_ONCE(arr->ptrs[cpu]);
+   if (!ee)
+   return -ENOENT;
+   ev = ee->event;
+   if (unlikely(ev->attr.type != PERF_TYPE_SOFTWARE ||
+ev->attr.config != PERF_COUNT_SW_BPF_OUTPUT))
+   return -EINVAL;
+   if (unlikely(ev->oncpu != cpu))
+   return -EOPNOTSUPP;
+
+   size = round_up(size, sizeof(u64));
+
+   err = perf_output_begin_forward_in_page(>handle, ev, size);
+   if (err < 0)
+   return err;
+
+   ctx->buf_len = size;
+   ctx->buf = ctx->handle.addr;
+
+   return 0;
+}
+
+static const struct bpf_func_proto dtrace_buffer_reserve_proto = {
+   .func   = dtrace_buffer_reserve,
+   .gpl_only   = false,
+   .ctx_update = true,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,   /* ctx */
+   .arg2_type  = ARG_ANYTHING, /* id */
+   .arg3_type  = ARG_CONST_MAP_PTR,/* map */
+   .arg4_type  = ARG_ANYTHING, /* size */
+};
+
+BPF_CALL_3(dtrace_buffer_commit, struct dtrace_bpf_ctx *, ctx,
+int, id, struct bpf_map *, map)
+{
+   /*
+* Make sure the writable-buffer id is valid.  We use the default which
+* is the offset of the start-of-buffer pointer in the public context.
+*/
+   if (id != offset

[RFC PATCH 10/11] bpf: add bpf_buffer_reserve and bpf_buffer_commit helpers

2019-05-20 Thread Kris Van Hees

Add two helpers that are primarily used in combination with the
writable-buffer support.  The bpf_buffer_reserve() helper sets aside
a chunk of buffer space that can be written to, and once all data
has been written, the bpf_buffer_commit() helper is used to make the
data in the ring buffer visible to userspace.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/uapi/linux/bpf.h  | 39 ++-
 kernel/bpf/verifier.c |  6 +++-
 tools/include/uapi/linux/bpf.h| 39 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +++
 4 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7bcb707539d1..2b7772aa00b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2681,6 +2681,41 @@ union bpf_attr {
  * the implementing program type.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_reserve(void *ctx, int id, struct bpf_map *map, int size)
+ * Description
+ * Reserve *size* bytes in the output buffer for the special BPF
+ * BPF perf event referenced by *map*, a BPF map of type
+ * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must have
+ * the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ * **PERF_TYPE_SOFTWARE** as **type**, and
+ * **PERF_COUNT_SW_BPF_OUTPUT** as **config**.  The reserved space
+ * will be available as the writable buffer identified with
+ * numeric ID **id** in the context.
+ *
+ * The amount of reserved bytes cannot exceed the page size.
+ * The chunk of buffer space will be reserved within a single
+ * page, and if this results in unused space at the end of the
+ * previous page in the ring-buffer, that unsused space will be
+ * filled with zeros.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_buffer_commit(void *ctx, int id, struct bpf_map *map)
+ * Description
+ * FInalize the previously reserved space in the output buffer
+ * for the special BPF perf event referenced by *map*, a BPF map
+ * of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The perf event must
+ * have the attributes: **PERF_SAMPLE_RAW** as **sample_type**,
+ * **PERF_TYPE_SOFTWARE** as **type**, and
+ * **PERF_COUNT_SW_BPF_OUTPUT** as **config**.
+ *
+ * The writable buffer identified with numeric ID **id** in the
+ * context will be invalidated, and can no longer be used to
+ * write data to until a new **bpf_buffer_reserve**\ () has been
+ * invoked.
+ * Return
+ * 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2792,7 +2827,9 @@ union bpf_attr {
FN(strtoul),\
FN(sk_storage_get), \
FN(sk_storage_delete),  \
-   FN(finalize_context),
+   FN(finalize_context),   \
+   FN(buffer_reserve), \
+   FN(buffer_commit),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 90ae04b4d5c7..ff73ed743a58 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2763,7 +2763,9 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
case BPF_MAP_TYPE_PERF_EVENT_ARRAY:
if (func_id != BPF_FUNC_perf_event_read &&
func_id != BPF_FUNC_perf_event_output &&
-   func_id != BPF_FUNC_perf_event_read_value)
+   func_id != BPF_FUNC_perf_event_read_value &&
+   func_id != BPF_FUNC_buffer_reserve &&
+   func_id != BPF_FUNC_buffer_commit)
goto error;
break;
case BPF_MAP_TYPE_STACK_TRACE:
@@ -2848,6 +2850,8 @@ static int check_map_func_compatibility(struct 
bpf_verifier_env *env,
case BPF_FUNC_perf_event_read:
case BPF_FUNC_perf_event_output:
case BPF_FUNC_perf_event_read_value:
+   case BPF_FUNC_buffer_reserve:
+   case BPF_FUNC_buffer_commit:
if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
goto error;
break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7bcb707539d1..2b7772aa00b6 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2681,6 +2681,41 @@ union bpf_attr {
  * the implementing program type.
  * Return
  * 0 on s

[RFC PATCH 02/11] bpf: add BPF_PROG_TYPE_DTRACE

2019-05-20 Thread Kris Van Hees

Add a new BPF program type for DTrace.  The program type is not compiled
right now because the CONFIG_DTRACE option does not exist yet.  It will
be added in a following commit.

Three commits are involved here:

1. add the BPF program type (conditional on a to-be-added option)
2. add the BPF_PROG_TYPE_DTRACE implementation (building not enabled)
3. add the CONFIG_DTRACE option and enable compilation of the prog type
   implementation

The reason for this sequence is to ensure that the kernel tree remains
buildable between these commits.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf_types.h  |  3 +++
 include/uapi/linux/bpf.h   |  1 +
 samples/bpf/bpf_load.c | 10 +++---
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c |  2 ++
 tools/lib/bpf/libbpf_probes.c  |  1 +
 6 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5a9975678d6f..908f2e4f597e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -26,6 +26,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_TRACEPOINT, tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event)
 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, raw_tracepoint_writable)
+#ifdef CONFIG_DTRACE
+BPF_PROG_TYPE(BPF_PROG_TYPE_DTRACE, dtrace)
+#endif
 #endif
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 61abe6b56948..7bcb707539d1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+   BPF_PROG_TYPE_DTRACE,
 };
 
 enum bpf_attach_type {
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index eae7b635343d..4812295484a1 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -87,6 +87,7 @@ static int load_and_attach(const char *event, struct bpf_insn 
*prog, int size)
bool is_sockops = strncmp(event, "sockops", 7) == 0;
bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+   bool is_dtrace = strncmp(event, "dtrace", 6) == 0;
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
@@ -120,6 +121,8 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
prog_type = BPF_PROG_TYPE_SK_SKB;
} else if (is_sk_msg) {
prog_type = BPF_PROG_TYPE_SK_MSG;
+   } else if (is_dtrace) {
+   prog_type = BPF_PROG_TYPE_DTRACE;
} else {
printf("Unknown event '%s'\n", event);
return -1;
@@ -140,8 +143,8 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
return 0;
 
-   if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-   if (is_socket)
+   if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_dtrace) {
+   if (is_socket || is_dtrace)
event += 6;
else
event += 7;
@@ -643,7 +646,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
memcmp(shname, "cgroup/", 7) == 0 ||
memcmp(shname, "sockops", 7) == 0 ||
memcmp(shname, "sk_skb", 6) == 0 ||
-   memcmp(shname, "sk_msg", 6) == 0) {
+   memcmp(shname, "sk_msg", 6) == 0 ||
+   memcmp(shname, "dtrace", 6) == 0) {
ret = load_and_attach(shname, data->d_buf,
  data->d_size);
if (ret != 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 61abe6b56948..7bcb707539d1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_FLOW_DISSECTOR,
BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+   BPF_PROG_TYPE_DTRACE,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 7e3b79d7c25f..44704a7d395d 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2269,6 +2269,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type 
type)
case BPF_PROG_TYPE_CGROUP_SYSCTL:
return false;
case BPF_PROG_TYPE_KPROBE:
+   case BPF_PROG_TYPE_DTRACE:
default:
return true;
}

[RFC PATCH 04/11] trace: initial implementation of DTrace based on kernel facilities

2019-05-20 Thread Kris Van Hees

This patch adds an implementation for BPF_PROG_TYPE_DTRACE, making use of
the tail-call mechanism feature to along tail-calls between programs of a
different BPF program type.  A new config option (DTRACE) is added to
control whether to include this new feature.

The DTRACE BPF program type provides an environment for executing probe
actions within the generalized trace context, independent from the type of
probe that triggered the program.  Probe types support specific BPF program
types, and this implementation uses the tail-call mechanism to call into
the DTRACE BPF program type from probe BPF prgroam types.  This initial
implementation provides support for the KPROBE type only - more will be
added in the near future.

The implementation provides:
 - dtrace_get_func_proto() as helper validator
 - dtrace_is_valid_access() as context access validator
 - dtrace_convert_ctx_access() as context access rewriter
 - dtrace_is_valid_tail_call() to validate the calling program type
 - dtrace_convert_ctx() to convert the context of the calling program into
   a DTRACE BPF program type context
 - dtrace_finalize_context() as bpf_finalize_context() helper for the
   DTRACE BPF program type

The dtrace_bpf_ctx struct defines the DTRACE BPF program type context at
the kernel level, and stores the following members:

struct pt_reg *regs - register state when probe fired
u32 ecb_id  - probe enabling ID
u32 probe_id- probe ID
struct task_struct *task- executing task when probe fired

The regs and task members are populated from dtrace_convert_ctx() which is
called during the tail-call processing.  The ecb_id and probe_id are
populated from dtrace_finalize_context().

Sample use:

#include 

/*
 * Map to store DTRACE BPF programs that can be called using
 * the tail-call mechanism from other BPF program types.
 */
struct bpf_map_def SEC("maps") progs = {
.type = BPF_MAP_TYPE_PROG_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(u32),
.max_entries = 8192,
};

/*
 * Map to store DTrace probe specific information and share
 * it across program boundaries.  This makes it possible for
 * DTRACE BPF program to know what probe caused them to get
 * called.
 */
struct bpf_map_def SEC("maps") probemap = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(u32),
.value_size = sizeof(struct dtrace_ecb),
.max_entries = NR_CPUS,
};

SEC("dtrace/1") int dt_probe1(struct dtrace_bpf_context *ctx)
{
struct dtrace_ecb   *ecb;
charfmt[] = "EPID %d PROBE %d\n";

bpf_finalize_context(ctx, );
bpf_trace_printk(fmt, sizeof(fmt),
 ctx->ecb_id, ctx->probe_id);

return 0;
}

SEC("kprobe/sys_write") int bpf_prog1(struct pt_regs *ctx)
{
struct dtrace_ecb   ecb;
int cpu;

cpu = bpf_get_smp_processor_id();
ecb.id = 3;
ecb.probe_id = 123;

/* Store the ECB. */
bpf_map_update_elem(, , , BPF_ANY);

/* Issue tail-call into DTRACE BPF program. */
bpf_tail_call(ctx, , 1);

/* fall through -> program not found or call failed */
return 0;
}

This patch also adds DTrace as a new subsystem in the MAINTAINERS file,
with me as current maintainer and our development mailing list for
specific development discussions.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 MAINTAINERS  |   7 +
 include/uapi/linux/dtrace.h  |  50 ++
 kernel/trace/dtrace/Kconfig  |   7 +
 kernel/trace/dtrace/Makefile |   3 +
 kernel/trace/dtrace/bpf.c| 321 +++
 5 files changed, 388 insertions(+)
 create mode 100644 include/uapi/linux/dtrace.h
 create mode 100644 kernel/trace/dtrace/Kconfig
 create mode 100644 kernel/trace/dtrace/Makefile
 create mode 100644 kernel/trace/dtrace/bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ce573aaa04df..07da7cc69f23 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5450,6 +5450,13 @@ W:   https://linuxtv.org
 S: Odd Fixes
 F:     drivers/media/pci/dt3155/
 
+DTRACE
+M: Kris Van Hees 
+L: dtrace-de...@oss.oracle.com
+S: Maintained
+F: include/uapi/linux/dtrace.h
+F: kernel/trace/dtrace
+
 DVB_USB_AF9015 MEDIA DRIVER
 M: Antti Palosaari 
 L: linux-me...@vger.kernel.org
diff --git a/include/uapi/linux/dtrace.h b/include/uapi/linux/dtrace.h
new file mode 100644
index ..bbe2562c1

[RFC PATCH 06/11] dtrace: tiny userspace tool to exercise DTrace support features

2019-05-20 Thread Kris Van Hees

This commit provides a small tool that makes use of the following new
features in BPF, as a sample of how the DTrace userspace code will be
interacting with the kernel:

- It uses the ability to tail-call into a BPF program of a different
  type (as long as the proper context conversion is implemented).  This
  is used to attach a BPF kprobe program to a kprobe, and having it
  tail-call a BPF dtrace program.  We do this so the probe action can
  execute in a tracer specific context rather than in a probe context.
  This way, probes of different types can all execute the probe actions
  in the same probe-independent context.

- It uses the new bpf_finalize_context() helper to retrieve data that is
  set in the BPF kprobe program attached to the probe, and use that data
  to further populate the tracer context for the BPF dtrace program.

Output is generated using the bpf_perf_event_output() helper.  This tiny
proof of concept tool demonstrates that the tail-call mechanism into a
different BPF program type works correctly, and that it is possible to
create a new context that contains information that is currently not
available to BPF programs from either a context or by means of a helper
(aside from using bpf_probe_read() on an address that is derived from
the current task).

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 MAINTAINERS   |   1 +
 tools/dtrace/.gitignore   |   1 +
 tools/dtrace/Makefile |  79 
 tools/dtrace/dt_bpf.c |  15 ++
 tools/dtrace/dt_buffer.c  | 386 ++
 tools/dtrace/dt_utils.c   | 132 +
 tools/dtrace/dtrace.c |  38 
 tools/dtrace/dtrace.h |  44 +
 tools/dtrace/probe1_bpf.c | 100 ++
 9 files changed, 796 insertions(+)
 create mode 100644 tools/dtrace/.gitignore
 create mode 100644 tools/dtrace/Makefile
 create mode 100644 tools/dtrace/dt_bpf.c
 create mode 100644 tools/dtrace/dt_buffer.c
 create mode 100644 tools/dtrace/dt_utils.c
 create mode 100644 tools/dtrace/dtrace.c
 create mode 100644 tools/dtrace/dtrace.h
 create mode 100644 tools/dtrace/probe1_bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 07da7cc69f23..6d934c9f5f93 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5456,6 +5456,7 @@ L:dtrace-de...@oss.oracle.com
 S: Maintained
 F: include/uapi/linux/dtrace.h
 F: kernel/trace/dtrace
+F: tools/dtrace
 
 DVB_USB_AF9015 MEDIA DRIVER
 M: Antti Palosaari 
diff --git a/tools/dtrace/.gitignore b/tools/dtrace/.gitignore
new file mode 100644
index ..d60e73526296
--- /dev/null
+++ b/tools/dtrace/.gitignore
@@ -0,0 +1 @@
+dtrace
diff --git a/tools/dtrace/Makefile b/tools/dtrace/Makefile
new file mode 100644
index ..c2ee3fb2576f
--- /dev/null
+++ b/tools/dtrace/Makefile
@@ -0,0 +1,79 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# This Makefile is shamelessly copied from samples/bpf and modified to support
+# building this prototype tracing tool.
+
+DTRACE_PATH?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(DTRACE_PATH)/..
+SAMPLES_PATH   := $(DTRACE_PATH)/../../samples
+
+hostprogs-y:= dtrace
+
+LIBBPF := $(TOOLS_PATH)/lib/bpf/libbpf.a
+OBJS   := ../../samples/bpf/bpf_load.o dt_bpf.o dt_buffer.o 
dt_utils.o
+
+dtrace-objs:= $(OBJS) dtrace.o
+
+always := $(hostprogs-y)
+always += probe1_bpf.o
+
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/lib
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/perf
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/include/uapi
+KBUILD_HOSTCFLAGS  += -I$(srctree)/tools/include/
+KBUILD_HOSTCFLAGS  += -I$(srctree)/usr/include
+
+KBUILD_HOSTLDLIBS  := $(LIBBPF) -lelf
+
+LLC?= llc
+CLANG  ?= clang
+LLVM_OBJCOPY   ?= llvm-objcopy
+
+ifdef CROSS_COMPILE
+HOSTCC = $(CROSS_COMPILE)gcc
+CLANG_ARCH_ARGS= -target $(ARCH)
+endif
+
+all:
+   $(MAKE) -C ../../ $(CURDIR)/ DTRACE_PATH=$(CURDIR)
+
+clean:
+   $(MAKE) -C ../../ M=$(CURDIR) clean
+   @rm -f *~
+
+$(LIBBPF): FORCE
+   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(DTRACE_PATH)/../../ 
O=
+
+FORCE:
+
+.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
+
+verify_cmds: $(CLANG) $(LLC)
+   @for TOOL in $^ ; do \
+   if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
+   echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+   exit 1; \
+   else true; fi; \
+   done
+
+verify_target_bpf: verify_cmds
+   @if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+   echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+   echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+   exit 2; \
+   else true; fi
+
+$(DTRACE_PATH)/*.c: verify_target_

[RFC PATCH 08/11] perf: add perf_output_begin_forward_in_page

2019-05-20 Thread Kris Van Hees

Right now, BPF programs can only write to a perf event ring buffer by
constructing a sample (as an arbitrary chunk of memory of a given size),
and calling perf_event_output() to have it written to the ring buffer.

A new implementation of DTrace (based on BPF) avoids constructing the
data sample prior to writing it to the ring buffer.  Instead, it expects
to be able to reserve a block of memory of a given size, write to that
memory region as it sees fit, and then finalize the written data (making
it available for reading from userspace).

This can (in part) be accomplished as follows:
1. reserve buffer space
Call perf_output_begin_forward_in_page(, event, size) passing
in a handle to be used for this data output session, an event that
identifies the output buffer, and the size (in bytes) to set aside.

2. write data
Perform store operations to the buffer space that was set aside.
The buffer is a writable buffer in the BPF program context, which
means that operations like *(u32 *)[offset] = val can be used.

3. finalize the output session
Call perf_output_end() to finalize the output and make the
new data available for reading from userspace by updating the head
of the ring buffer.

The one caveat is that ring buffers may be allocated from non-contiguous
pages in kernel memory.  This means that a reserved block of memory could
be spread across two non-consecutive pages, and accessing the buffer
space using buf[offset] is no longer safe.  Forcing the ring buffer to be
allocated using vmalloc would avoid this problem, but that would impose
a limitation on all perf event output buffers which is not an acceptable
cost.

The solution implemented here adds a flag to the __perf_output_begin()
function that performs the reserving of buffer space.  The new flag
(stay_in_page) indicates whether the requested chunk of memory must be
on a single page.  In this case, the requested size cannot exceed the
page size.  If the request cannot be satisfied within the current page,
the unused portion of the current page is filled with 0s.

A new function perf_output_begin_forward_in_page() is to be used to
commence output that cannot cross page boundaries.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/perf_event.h  |  3 ++
 kernel/events/ring_buffer.c | 65 -
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 15a82ff0aefe..2b35d1ce61f8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1291,6 +1291,9 @@ extern int perf_output_begin(struct perf_output_handle 
*handle,
 extern int perf_output_begin_forward(struct perf_output_handle *handle,
struct perf_event *event,
unsigned int size);
+extern int perf_output_begin_forward_in_page(struct perf_output_handle *handle,
+struct perf_event *event,
+unsigned int size);
 extern int perf_output_begin_backward(struct perf_output_handle *handle,
  struct perf_event *event,
  unsigned int size);
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 674b35383491..01ba540e3ee0 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -116,9 +116,11 @@ ring_buffer_has_space(unsigned long head, unsigned long 
tail,
 static __always_inline int
 __perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size,
-   bool backward)
+   bool backward, bool stay_in_page)
 {
struct ring_buffer *rb;
+   unsigned int adj_size;
+   unsigned int gap_size;
unsigned long tail, offset, head;
int have_lost, page_shift;
struct {
@@ -144,6 +146,13 @@ __perf_output_begin(struct perf_output_handle *handle,
goto out;
}
 
+   page_shift = PAGE_SHIFT + page_order(rb);
+
+   if (unlikely(stay_in_page)) {
+   if (size > (1UL << page_shift))
+   goto out;
+   }
+
handle->rb= rb;
handle->event = event;
 
@@ -156,13 +165,24 @@ __perf_output_begin(struct perf_output_handle *handle,
 
perf_output_get_handle(handle);
 
+   gap_size = 0;
+   adj_size = size;
do {
tail = READ_ONCE(rb->user_page->data_tail);
offset = head = local_read(>head);
+
+   if (unlikely(stay_in_page)) {
+   gap_size = (1UL << page_shift) -
+  (offset & ((1UL << page_shift) - 1));
+   if (gap_size < size)
+   adj_size += gap_size;
+

[RFC PATCH 09/11] bpf: mark helpers explicitly whether they may change the context

2019-05-20 Thread Kris Van Hees

Some helpers may update the context.  Right now, various network filter
helpers may make changes to the packet data.  This is verified by calling
the bpf_helper_changes_pkt_data() function with the function pointer.

This function resides in net/core/filter.c and needs to be updated for any
helper function that modifies packet data.  To allow for other helpers
(possibly not part of the network filter code) to do the same, this patch
changes the code from using a central function to list all helpers that
have this feature to marking each individual helper that may change the
context data.  This way, whenever a new helper is added that may change
the content of the context, there is no need to update a hardcoded list of
functions.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h|  1 +
 include/linux/filter.h |  1 -
 kernel/bpf/core.c  |  5 
 kernel/bpf/verifier.c  |  2 +-
 net/core/filter.c  | 59 ++
 5 files changed, 27 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fc3eda0192fb..9e255d5b1062 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -226,6 +226,7 @@ enum bpf_return_type {
 struct bpf_func_proto {
u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
bool gpl_only;
+   bool ctx_update;
bool pkt_access;
enum bpf_return_type ret_type;
enum bpf_arg_type arg1_type;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7148bab96943..9dacca7d3ef6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -811,7 +811,6 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
-bool bpf_helper_changes_pkt_data(void *func);
 
 static inline bool bpf_dump_raw_ok(void)
 {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 225b1be766b0..8e9accf90c37 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2112,11 +2112,6 @@ void __weak bpf_jit_compile(struct bpf_prog *prog)
 {
 }
 
-bool __weak bpf_helper_changes_pkt_data(void *func)
-{
-   return false;
-}
-
 /* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
  * skb_copy_bits(), so provide a weak definition of it for NET-less config.
  */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5fba4e6f5424..90ae04b4d5c7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3283,7 +3283,7 @@ static int check_helper_call(struct bpf_verifier_env 
*env, int func_id, int insn
}
 
/* With LD_ABS/IND some JITs save/restore skb from r1. */
-   changes_data = bpf_helper_changes_pkt_data(fn->func);
+   changes_data = fn->ctx_update;
if (changes_data && fn->arg1_type != ARG_PTR_TO_CTX) {
verbose(env, "kernel subsystem misconfigured func %s#%d: r1 != 
ctx\n",
func_id_name(func_id), func_id);
diff --git a/net/core/filter.c b/net/core/filter.c
index 55bfc941d17a..a9e7d3174d36 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1693,6 +1693,7 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, 
u32, offset,
 static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
.func   = bpf_skb_store_bytes,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1825,6 +1826,7 @@ BPF_CALL_2(bpf_skb_pull_data, struct sk_buff *, skb, u32, 
len)
 static const struct bpf_func_proto bpf_skb_pull_data_proto = {
.func   = bpf_skb_pull_data,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1868,6 +1870,7 @@ BPF_CALL_2(sk_skb_pull_data, struct sk_buff *, skb, u32, 
len)
 static const struct bpf_func_proto sk_skb_pull_data_proto = {
.func   = sk_skb_pull_data,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1909,6 +1912,7 @@ BPF_CALL_5(bpf_l3_csum_replace, struct sk_buff *, skb, 
u32, offset,
 static const struct bpf_func_proto bpf_l3_csum_replace_proto = {
.func   = bpf_l3_csum_replace,
.gpl_only   = false,
+   .ctx_update = true,
.ret_type   = RET_INTEGER,
.arg1_type  = ARG_PTR_TO_CTX,
.arg2_type  = ARG_ANYTHING,
@@ -1962,6 +1966,7 @@ BPF_CALL_5(bpf_l4_csum_replace, struct sk_buff *, skb, 
u32, offset,
 static const struct bpf_func_proto bpf_l4_csum_replace_proto = {
.func   = bpf_l4_csum_replace,
.gpl_only   = false,
+   .ctx_update

[RFC PATCH 07/11] bpf: implement writable buffers in contexts

2019-05-20 Thread Kris Van Hees

Currently, BPF supports writes to packet data in very specific cases.
The implementation can be of more general use and can be extended to any
number of writable buffers in a context.  The implementation adds two new
register types: PTR_TO_BUFFER and PTR_TO_BUFFER_END, similar to the types
PTR_TO_PACKET and PTR_TO_PACKET_END.  In addition, a field 'buf_id' is
added to the reg_state structure as a way to distinguish between different
buffers in a single context.

Buffers are specified in the context by a pair of members:
- a pointer to the start of the buffer (type PTR_TO_BUFFER)
- a pointer to the first byte beyond the buffer (type PTR_TO_BUFFER_END)

A context can contain multiple buffers.  Each buffer/buffer_end pair is
identified by a unique id (buf_id).  The start-of-buffer member offset is
usually a good unique identifier.

The semantics for using a writable buffer are the same as for packet data.
The BPF program must contain a range test (buf + num > buf_end) to ensure
that the verifier can verify that offsets are within the allowed range.

Whenever a helper is called that might update the content of the context
all range information for registers that hold pointers to a buffer is
cleared, just as it is done for packet pointers.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
 include/linux/bpf.h  |   3 +
 include/linux/bpf_verifier.h |   4 +-
 kernel/bpf/verifier.c| 198 ---
 3 files changed, 145 insertions(+), 60 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e4bcb79656c4..fc3eda0192fb 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -275,6 +275,8 @@ enum bpf_reg_type {
PTR_TO_TCP_SOCK, /* reg points to struct tcp_sock */
PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */
PTR_TO_TP_BUFFER,/* reg points to a writable raw tp's buffer */
+   PTR_TO_BUFFER,   /* reg points to ctx buffer */
+   PTR_TO_BUFFER_END,   /* reg points to ctx buffer end */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -283,6 +285,7 @@ enum bpf_reg_type {
 struct bpf_insn_access_aux {
enum bpf_reg_type reg_type;
int ctx_field_size;
+   u32 buf_id;
 };
 
 static inline void
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 1305ccbd8fe6..3538382184f3 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -45,7 +45,7 @@ struct bpf_reg_state {
/* Ordering of fields matters.  See states_equal() */
enum bpf_reg_type type;
union {
-   /* valid when type == PTR_TO_PACKET */
+   /* valid when type == PTR_TO_PACKET | PTR_TO_BUFFER */
u16 range;
 
/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
@@ -132,6 +132,8 @@ struct bpf_reg_state {
 */
u32 frameno;
enum bpf_reg_liveness live;
+   /* For PTR_TO_BUFFER, to identify distinct buffers in a context. */
+   u32 buf_id;
 };
 
 enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f9e5536fd1af..5fba4e6f5424 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -406,6 +406,8 @@ static const char * const reg_type_str[] = {
[PTR_TO_TCP_SOCK]   = "tcp_sock",
[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
[PTR_TO_TP_BUFFER]  = "tp_buffer",
+   [PTR_TO_BUFFER] = "buf",
+   [PTR_TO_BUFFER_END] = "buf_end",
 };
 
 static char slot_type_char[] = {
@@ -467,6 +469,9 @@ static void print_verifier_state(struct bpf_verifier_env 
*env,
verbose(env, ",off=%d", reg->off);
if (type_is_pkt_pointer(t))
verbose(env, ",r=%d", reg->range);
+   else if (t == PTR_TO_BUFFER)
+   verbose(env, ",r=%d,bid=%d", reg->range,
+   reg->buf_id);
else if (t == CONST_PTR_TO_MAP ||
 t == PTR_TO_MAP_VALUE ||
 t == PTR_TO_MAP_VALUE_OR_NULL)
@@ -855,6 +860,12 @@ static bool reg_is_pkt_pointer_any(const struct 
bpf_reg_state *reg)
   reg->type == PTR_TO_PACKET_END;
 }
 
+static bool reg_is_buf_pointer_any(const struct bpf_reg_state *reg)
+{
+   return reg_is_pkt_pointer_any(reg) ||
+  reg->type == PTR_TO_BUFFER || reg->type == PTR_TO_BUFFER_END;
+}
+
 /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
 static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg,
enum bpf_reg_type which)
@@ -1550,7 +1561,7 @@ static int check_map_access(struct bpf_verifier_env *env, 
u32 regn

[RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-20 Thread Kris Van Hees

This patch set is also available, applied to bpf-next, at the following URL:

https://github.com/oracle/dtrace-linux-kernel/tree/dtrace-bpf

The patches in this set are part of an larger effort to re-implement DTrace
based on existing Linux kernel features wherever possible.  This allows
existing DTrace scripts to run without modification on Linux and to write
new scripts using a tracing tool that people may already be familiar with.
This set of patches is posted as an RFC.  I am soliciting feedback on the
patches, especially because they cross boundaries between tracing and BPF.
Some of the features might be combined with existing more specialized forms
of similar functionality, and perhaps some functionality could be moved to
other parts of the code.

This set of patches provides the initial core to make it possible to execute
DTrace BPF programs as probe actions, triggered from existing probes in the
kernel (right now just kprobe, but more will be added in followup patches).
The DTrace BPF programs run in a specific DTrace context that is independent
from the probe-specific BPF program context because DTrace actions are
implemented based on a general probe concept (an abstraction of the various
specific probe types).

It also provides a mechanism to store probe data in output buffers directly
from BPF programs, using direct store instructions.  Finally, it provides a
simple sample userspace tool to load programs, collect data, and print out the
data.  This little tool is currently hardcoded to process a single test case,
to show how the BPF program is to be constructed and to show how to retrieve
data from the output buffers.

The work presented here would not be possible without the effort many people
have put into tracing features on Linux.  Especially BPF is instrumental in
being able to do this project because it provides a safe and fast virtual
execution engine that can be leveraged to execute probe actions in a more
elegant manner.  The perf_event ring-buffer output mechanism has also proven
to be very beneficial to starting a re-implementation of DTrace on Linux,
especially because it avoids needing to add yet another buffer implementation
to the kernel.  It really helped with being able to re-use functionality.

The patch set provides the following patches:

1. bpf: context casting for tail call

This patch adds the ability to tail-call into a BPF program of a
different type than the one initiating the call.  It provides two
program type specific operations: is_valid_tail_call (to validate
whether the tail-call between the source type and target type is
allowed) and convert_ctx (to create a context for the target type
based on the context of the source type).  It also provides a
bpf_finalize_context() helper function prototype.  BPF program types
should implement this helper to perform any final context setup that
may need to be done within the execution context of the program type.
This helper is typically invoked as the first statement in an eBPF
program that can be tail-called from another type.

2. bpf: add BPF_PROG_TYPE_DTRACE

This patch adds BPF_PROG_TYPE_DTRACE as a new BPF program type, without
actually providing an implementation.  The actual implementation is
added in patch 4 (see below).  We do it this way because the
implementation is being added to the tracing subsystem as a component
that I would be happy to maintain (if merged) whereas the declaration
of the program type must be in the bpf subsystem.  Since the two
subsystems are maintained by different people, we split the
implementing patches across maintainer boundaries while ensuring that
the kernel remains buildable between patches.

3. bpf: export proto for bpf_perf_event_output helper

This patch make a prototype available for the bpf_perf_event_output
helper so that program types outside of the base tracing eBPF code can
make use of it.

4. trace: initial implementation of DTrace based on kernel facilities

This patch provides the most basic implementation of the DTrace
execution core based on eBPF and other kernel facilities.  This
version only supports kprobes.  It makes use of the cross-program-type
tail-call support adding with patch 1 (see above).

5. trace: update Kconfig and Makefile to include DTrace

This patch adds DTrace to the kernel config system and it ensures that
if CONFIG_DTRACE is set, the implementation of the DTrace core is
compiled into the kernel.

6. dtrace: tiny userspace tool to exercise DTrace support features

This patch provides a tiny userspace DTrace consumer as a
proof-of-concept and to test the DTrace eBPF program type and its use
by the DTrace core.

7. bpf: implement writable buffers in

77 matches

Mail list logo