On 06-08-2025 13:43, Steven Rostedt wrote:
On Wed, 6 Aug 2025 11:50:06 +0100
Douglas Raillard <douglas.raill...@arm.com> wrote:

On 05-08-2025 20:26, Steven Rostedt wrote:
From: Steven Rostedt <rost...@goodmis.org>

When a system call that reads user space addresses copy it to the ring
buffer, it can copy up to 511 bytes of data. This can waste precious ring
buffer space if the user isn't interested in the output. Add a new file
"syscall_user_buf_size" that gets initialized to a new config
CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.

Have you considered dynamically removing some event fields ? We routinely hit
the same problem with some of our events that have rarely-used large fields.

We do that already with eprobes. Note, syscall events are pseudo events
hooked on the raw_syscall events. Thus modifying what is displayed is
trivial as it's done manually anyway. For normal events, it's all in
the TRACE_EVENT() macro which defines the fields at boot. Trying to
modify it later is very difficult.

I was thinking at a filtering step between assigning to an event struct
with TP_fast_assign and actually writing it to the buffer. An array of (offset, 
size)
would allow selecting which field is to be copied to the buffer, the rest would
be left out (a bit like in some parts of the synthetic event API). The format
file would be impacted to remove some fields, but hopefully not too many other
corners of ftrace.

The advantage of that over eprobe would be:
1. full support of all field types
2. probably lower overhead than the fetch_op interpreter, but maybe not by much.
3. less moving pieces for the user (e.g. no need to have BTF for by-name field 
access,
   no new event name to come up with etc.)



If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
that allowed selecting what field is needed that would be amazing. I had plans
to build something like that in our kernel module based on the synthetic events 
API,
but did not proceed as that API is not exported in a useful way.

Take a look at eprobes. You can make a new event based from an existing
event (including other dynamic events and syscalls).
I finally got around to adding documentation about it:

   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/eprobetrace.rst


That's very interesting, I did not realize that you could access the actual 
event fields
and not just the tracepoint args. With your recent BTF patch, there is now 
little limits
on how deep you can drill down in the structs which is great (and actually more 
powerful
than the original event itself).

Before userspace tooling could make use of that as a field filtering system, a 
few friction
points would need to be addressed:

1. Getting the field list programmatically is currently not really possible as 
dealing with
   the format file is very tricky. We could just pass on the user-requested 
field
   to the kernel but that would prevent userspace validation with usable error 
reporting
   (the 6.15 kernel I tried it on gave me EINVAL and not even a dmesg error 
when trying to use
   a field that does not exist)

2. The type of the field is not inferred, e.g. an explicit ":string" is needed 
here:
e:my/sched_switch sched.sched_switch prev_comm=$prev_comm:string The only place a tool can get this info from is the format file, which means you have to
   parse it and apply some conversions (e.g. "__data_loc char[]" becomes 
"string").

3. Only a restricted subset of field types is supported, e.g. no cpumask, 
buffers other
   than strings etc. In practice, this means the userspace tooling will have to 
either:
     * pass on the restriction to the users (can easily lead to a terrible UX 
by misleading
       the user to think filtering is generally available when in fact it's 
not).
     * or only treat that as a hint and use the unfiltered original event if 
the user asks
       for a field with an unsupported type.

On the bright side, creating a new event like "e:my/sched_switch" gives the event name 
"sched_switch" but
trace-cmd start -e my/sched_switch will only enable the new event which is 
exactly what we need.
This way, the trace can look like a normal one except less fields, so 
downstream data processing
is not impacted and only the data-gathering step needs to know about it.

Depending on whether we want/can deal with those friction point, it could 
either become a high-level
layer usable like the base event system with extra low-level abilities, or stay 
as a tool only suitable for
hand-crafted use cases where the user has deeper knowledge of layout on all 
involved kernels.


On a related note, if we wanted to make something that allowed reducing the 
amount of stored data and
that could deeply integrate with the userspace tooling in charge of collecting 
the data to run a user-defined query,
the best bet is to target SQL-like systems. That family is very established and 
virtually all trace-processing system
will use it as first stage (e.g. Perfetto with sqlite, or LISA with Polars 
dataframes).
In those systems, some important information can typically be extracted from 
the user query [1]:

1. Projection: which tables and columns the query needs. In ftrace, that's the 
list of events and what fields
   are needed. Other events/fields can be discarded as they won't be read by 
the query.

2. Row limit: how many rows the query will read (not always available 
obviously). In ftrace, that would allow
   automatically stopping the tracing when the event count reaches a limit, or 
set the buffer size based on
   the event size for a flight-recorder approach. Additional event occurrences 
would be discarded by the query
   anyway.

3. Predicate filtering: If the query contains a filter to only select rows with 
a column equal to a specific
   value. Other rows don't need to be collected as the query will discard them 
anyway.

Currently:
1. is partially implemented as you can select specific events, but not what 
field you want.
2. is partially implemented (buffer size, but AFAIK there is no way of telling 
ftrace to stop tracing after N events).
3. is fully implemented with /sys/kernel/debug/tracing/events/*/*/filter

If all those are implemented, ftrace would be able to make use of the most 
important implicit info available
in the user query to limit the collected data size, without the user having to 
tune anything manually
and without turning the kernel into a full-blown SQL interpreter.

[1] In the Polars dataframe library, data sources such as a parquet file served over HTTP 
are called "scans".
    When Polars executes an expression, it will get the data from the scans the 
expression refers to,
    and will pass the 3 pieces of info to the scan implementation so that 
processed data size can be minimized
    as early as possible in the pipeline. This is referred to as "projection pushdown", "slice 
pushdown" and "predicate pushdown":
    https://docs.pola.rs/user-guide/lazy/optimizations/
    If some filtering condition is too complex to express in the limited scan 
predicate language, filtering will happen
    later in the pipeline. If the scan does not have a smart way to apply the 
filter (e.g. projection pushdown for a row-oriented file format
    will probably not bring massive speed improvements) then more data than 
necessary will be fetched and filtering will happen
    later in the pipeline.

-- Steve

--
Douglas


Reply via email to