Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written

Douglas Raillard Wed, 06 Aug 2025 09:21:28 -0700

On 06-08-2025 13:43, Steven Rostedt wrote:

On Wed, 6 Aug 2025 11:50:06 +0100
Douglas Raillard <[email protected]> wrote:

On 05-08-2025 20:26, Steven Rostedt wrote:

From: Steven Rostedt <[email protected]>

When a system call that reads user space addresses copy it to the ring
buffer, it can copy up to 511 bytes of data. This can waste precious ring
buffer space if the user isn't interested in the output. Add a new file
"syscall_user_buf_size" that gets initialized to a new config
CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 128.


Have you considered dynamically removing some event fields ? We routinely hit
the same problem with some of our events that have rarely-used large fields.


We do that already with eprobes. Note, syscall events are pseudo events
hooked on the raw_syscall events. Thus modifying what is displayed is
trivial as it's done manually anyway. For normal events, it's all in
the TRACE_EVENT() macro which defines the fields at boot. Trying to
modify it later is very difficult.


I was thinking at a filtering step between assigning to an event struct
with TP_fast_assign and actually writing it to the buffer. An array of (offset, 
size)
would allow selecting which field is to be copied to the buffer, the rest would
be left out (a bit like in some parts of the synthetic event API). The format
file would be impacted to remove some fields, but hopefully not too many other
corners of ftrace.

The advantage of that over eprobe would be:
1. full support of all field types
2. probably lower overhead than the fetch_op interpreter, but maybe not by much.
3. less moving pieces for the user (e.g. no need to have BTF for by-name field 
access,
   no new event name to come up with etc.)


If we could have a "fields" file in /sys/kernel/tracing/events/*/*/fields
that allowed selecting what field is needed that would be amazing. I had plans
to build something like that in our kernel module based on the synthetic events 
API,
but did not proceed as that API is not exported in a useful way.


Take a look at eprobes. You can make a new event based from an existing
event (including other dynamic events and syscalls).
I finally got around to adding documentation about it:

   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/eprobetrace.rst


That's very interesting, I did not realize that you could access the actual 
event fields
and not just the tracepoint args. With your recent BTF patch, there is now 
little limits
on how deep you can drill down in the structs which is great (and actually more 
powerful
than the original event itself).

Before userspace tooling could make use of that as a field filtering system, a 
few friction
points would need to be addressed:

1. Getting the field list programmatically is currently not really possible as 
dealing with
   the format file is very tricky. We could just pass on the user-requested 
field
   to the kernel but that would prevent userspace validation with usable error 
reporting
   (the 6.15 kernel I tried it on gave me EINVAL and not even a dmesg error 
when trying to use
   a field that does not exist)

2. The type of the field is not inferred, e.g. an explicit ":string" is needed 
here:

e:my/sched_switch sched.sched_switch prev_comm=$prev_comm:stringThe only place a tool can get this info from is the format file, which means you have to

parse it and apply some conversions (e.g. "__data_loc char[]" becomes
"string").

3. Only a restricted subset of field types is supported, e.g. no cpumask,
buffers other
than strings etc. In practice, this means the userspace tooling will have to
either:
* pass on the restriction to the users (can easily lead to a terrible UX
by misleading
the user to think filtering is generally available when in fact it's
not).
* or only treat that as a hint and use the unfiltered original event if
the user asks
for a field with an unsupported type.

On the bright side, creating a new event like "e:my/sched_switch" gives the event name
"sched_switch" but
trace-cmd start -e my/sched_switch will only enable the new event which is
exactly what we need.
This way, the trace can look like a normal one except less fields, so
downstream data processing
is not impacted and only the data-gathering step needs to know about it.

Depending on whether we want/can deal with those friction point, it could
either become a high-level
layer usable like the base event system with extra low-level abilities, or stay
as a tool only suitable for
hand-crafted use cases where the user has deeper knowledge of layout on all
involved kernels.

On a related note, if we wanted to make something that allowed reducing the
amount of stored data and
that could deeply integrate with the userspace tooling in charge of collecting
the data to run a user-defined query,
the best bet is to target SQL-like systems. That family is very established and
virtually all trace-processing system
will use it as first stage (e.g. Perfetto with sqlite, or LISA with Polars
dataframes).
In those systems, some important information can typically be extracted from
the user query [1]:

1. Projection: which tables and columns the query needs. In ftrace, that's the
list of events and what fields
are needed. Other events/fields can be discarded as they won't be read by
the query.

2. Row limit: how many rows the query will read (not always available
obviously). In ftrace, that would allow
automatically stopping the tracing when the event count reaches a limit, or
set the buffer size based on
the event size for a flight-recorder approach. Additional event occurrences
would be discarded by the query
anyway.

3. Predicate filtering: If the query contains a filter to only select rows with
a column equal to a specific
value. Other rows don't need to be collected as the query will discard them
anyway.

Currently:
1. is partially implemented as you can select specific events, but not what
field you want.
2. is partially implemented (buffer size, but AFAIK there is no way of telling
ftrace to stop tracing after N events).
3. is fully implemented with /sys/kernel/debug/tracing/events/*/*/filter

If all those are implemented, ftrace would be able to make use of the most
important implicit info available
in the user query to limit the collected data size, without the user having to
tune anything manually
and without turning the kernel into a full-blown SQL interpreter.

[1] In the Polars dataframe library, data sources such as a parquet file served over HTTP
are called "scans".
When Polars executes an expression, it will get the data from the scans the
expression refers to,
and will pass the 3 pieces of info to the scan implementation so that
processed data size can be minimized
as early as possible in the pipeline. This is referred to as "projection pushdown", "slice
pushdown" and "predicate pushdown":
https://docs.pola.rs/user-guide/lazy/optimizations/
If some filtering condition is too complex to express in the limited scan
predicate language, filtering will happen
later in the pipeline. If the scan does not have a smart way to apply the
filter (e.g. projection pushdown for a row-oriented file format
will probably not bring massive speed improvements) then more data than
necessary will be fetched and filtering will happen
later in the pipeline.

-- Steve


--
Douglas

Re: [PATCH 7/7] tracing: Add syscall_user_buf_size to limit amount written

Reply via email to