Re: [PATCH 06/12] uprobes: add batch uprobe register/unregister APIs

2024-07-02 Thread Steven Rostedt
On Wed, 3 Jul 2024 00:19:05 +0900
Masami Hiramatsu (Google)  wrote:

> > BTW, is this (batched register/unregister APIs) something you'd like
> > to use from the tracefs-based (or whatever it's called, I mean non-BPF
> > ones) uprobes as well? Or there is just no way to even specify a batch
> > of uprobes? Just curious if you had any plans for this.  
> 
> No, because current tracefs dynamic event interface is not designed for
> batched registration. I think we can expand it to pass wildcard symbols
> (for kprobe and fprobe) or list of addresses (for uprobes) for uprobe.
> Um, that maybe another good idea.

I don't see why not. The wild cards were added to the kernel
specifically for the tracefs interface (set_ftrace_filter).

-- Steve



Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-07-02 Thread Steven Rostedt
On Tue, 2 Jul 2024 11:32:53 -0400
Mathieu Desnoyers  wrote:

> If we use '*' for user events already, perhaps we'd want to consider
> using the same range for the ring buffer ioctls ? Arguably one is
> about instrumentation and the other is about ring buffer interaction
> (data transport), but those are both related to tracing.

Yeah, but I still rather keep them separate.

Beau, care to send a patch adding an entry into that ioctl document for
user events?

-- Steve



Re: [PATCH v2] filemap: add trace events for get_pages, map_pages, and fault

2024-07-02 Thread Steven Rostedt
On Tue, 2 Jul 2024 19:27:16 +0900
Takaya Saeki  wrote:

> Hello all, and thank you so much for the review, Steven and Masami.
> 
> I'm currently considering replacing the `max_ofs` output with
> `length`. Please let me know your thoughts.
> With the current design, a memory range of an event is an inclusive
> range of [ofs, max_ofs + 4096]. I found the `+4096` part confusing
> during the ureadahead's upstreaming work. Replacing `max_ofs` with
> `length` makes the range specified by an event much more concise.

This makes sense to me.

Matthew, have any comments on this?

Thanks,

-- Steve



Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-07-02 Thread Steven Rostedt
On Tue, 2 Jul 2024 10:36:03 -0400
Mathieu Desnoyers  wrote:

> > I can send a patch this week to update it. Or feel free to send a patch
> > yourself.  
> 
> You need to reserve an unused ioctl Code and Seq# range within:
> 
> Documentation/userspace-api/ioctl/ioctl-number.rst

Ug, it's been so long, I completely forgot about that file.

Thanks for catching this.

> 
> Otherwise this duplicate will confuse all system call instrumentation
> tooling.

Agreed, what if we did this then:

-- Steve

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a141e8e65c5d..9a97030c6c8d 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -186,6 +186,7 @@ Code  Seq#Include File  
 Comments
 'Q'   alllinux/soundcard.h
 'R'   00-1F  linux/random.h  conflict!
 'R'   01 linux/rfkill.h  conflict!
+'R'   20-2F  linux/trace_mmap.h
 'R'   C0-DF  net/bluetooth/rfcomm.h
 'R'   E0 uapi/linux/fsl_mc.h
 'S'   alllinux/cdrom.h   conflict!
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index bd1066754220..c102ef35d11e 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,6 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
-#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+#define TRACE_MMAP_IOCTL_GET_READER_IO('R', 0x20)
 
 #endif /* _TRACE_MMAP_H_ */



Re: [PATCH 2/2] hugetlbfs: use tracepoints in hugetlbfs functions.

2024-07-01 Thread Steven Rostedt
On Wed, 12 Jun 2024 09:11:56 +0800
Hongbo Li  wrote:

> @@ -934,6 +943,12 @@ static int hugetlbfs_setattr(struct mnt_idmap *idmap,
>   if (error)
>   return error;
>  
> + trace_hugetlbfs_setattr(inode, dentry->d_name.len, dentry->d_name.name,
> + attr->ia_valid, attr->ia_mode,
> + from_kuid(_user_ns, attr->ia_uid),
> + from_kgid(_user_ns, attr->ia_gid),
> + inode->i_size, attr->ia_size);
> +

That's a lot of parameters to pass to a tracepoint. Why not just pass the
dentry and attr and do the above in the TP_fast_assign() logic? That would
put less pressure on the icache for the code part.

-- Steve



Re: [PATCH v23 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-06-30 Thread Steven Rostedt
On Sun, 30 Jun 2024 13:53:23 +0300
"Dmitry V. Levin"  wrote:

> On Fri, May 10, 2024 at 03:04:32PM +0100, Vincent Donnefort wrote:
> [...]
> > diff --git a/include/uapi/linux/trace_mmap.h 
> > b/include/uapi/linux/trace_mmap.h
> > index b682e9925539..bd1066754220 100644
> > --- a/include/uapi/linux/trace_mmap.h
> > +++ b/include/uapi/linux/trace_mmap.h
> > @@ -43,4 +43,6 @@ struct trace_buffer_meta {
> > __u64   Reserved2;
> >  };
> >  
> > +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
> > +  
> 
> I'm sorry but among all the numbers this one was probably the least
> fortunate choice because it collides with TCGETS on most of architectures.

Hmm, that is unfortunate.

> 
> For example, this is how strace output would look like when
> TRACE_MMAP_IOCTL_GET_READER support is added:
> 
> $ strace -e ioctl stty
> ioctl(0, TCGETS or TRACE_MMAP_IOCTL_GET_READER, {c_iflag=ICRNL|IXON, 
> c_oflag=NL0|CR0|TAB0|BS0|VT0|FF0|OPOST|ONLCR, c_cflag=B38400|CS8|CREAD, 
> c_lflag=ISIG|ICANON|ECHO|ECHOE|ECHOK|IEXTEN|ECHOCTL|ECHOKE, ...}) = 0
> 
> Even though ioctl numbers are inherently not unique, TCGETS is
> a very traditional one, so it would be great if you could change
> TRACE_MMAP_IOCTL_GET_READER to avoid this collision.
> 
> Given that _IO('T', 0x1) is _IOC(_IOC_NONE, 'T', 0x1, 0),
> something like _IOC(_IOC_NONE, 'T', 0x1, 0x1) should be OK.

Well, it may not be too late to update this as it hasn't been
officially released in 6.10 yet. It's still only in the -rc and the
library doesn't rely on this yet (I've been holding off until 6.10 was
officially released before releasing the library that uses it).

I can send a patch this week to update it. Or feel free to send a patch
yourself.

Thanks,

-- Steve



Re: [PATCH 13/14] tracefs: Convert to new uid/gid option parsing helpers

2024-06-28 Thread Steven Rostedt
On Thu, 27 Jun 2024 19:40:44 -0500
Eric Sandeen  wrote:

> Convert to new uid/gid option parsing helpers
> 
> Signed-off-by: Eric Sandeen 

Acked-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v1 4/4] trace: platform/x86/intel/ifs: Add SBAF trace support

2024-06-27 Thread Steven Rostedt
On Thu, 27 Jun 2024 02:35:16 +
Kuppuswamy Sathyanarayanan  wrote:

> From: Jithu Joseph 
> 
> Add tracing support for the SBAF IFS tests, which may be useful for
> debugging systems that fail these tests. Log details like test content
> batch number, SBAF bundle ID, program index and the exact errors or
> warnings encountered by each HT thread during the test.
> 
> Reviewed-by: Ashok Raj 
> Reviewed-by: Tony Luck 
> Signed-off-by: Jithu Joseph 
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> ---
>  include/trace/events/intel_ifs.h | 27 
>  drivers/platform/x86/intel/ifs/runtest.c |  1 +
>  2 files changed, 28 insertions(+)
> 
> diff --git a/include/trace/events/intel_ifs.h 
> b/include/trace/events/intel_ifs.h
> index 0d88ebf2c980..9c7413de432b 100644
> --- a/include/trace/events/intel_ifs.h
> +++ b/include/trace/events/intel_ifs.h
> @@ -35,6 +35,33 @@ TRACE_EVENT(ifs_status,
>   __entry->status)
>  );
>  
> +TRACE_EVENT(ifs_sbaf,
> +
> + TP_PROTO(int batch, union ifs_sbaf activate, union ifs_sbaf_status 
> status),
> +
> + TP_ARGS(batch, activate, status),
> +
> + TP_STRUCT__entry(
> + __field(int,batch   )
> + __field(u64,status  )

Please put the 64 bit status field before the 32 bit batch field,
otherwise this will likely create a 4 byte hole between the two fields.
Space on the ring buffer is expensive real-estate.

-- Steve

> + __field(u16,bundle  )
> + __field(u16,pgm )
> + ),
> +
> + TP_fast_assign(
> + __entry->batch  = batch;
> + __entry->bundle = activate.bundle_idx;
> + __entry->pgm= activate.pgm_idx;
> + __entry->status = status.data;
> + ),
> +
> + TP_printk("batch: 0x%.2x, bundle_idx: 0x%.4x, pgm_idx: 0x%.4x, status: 
> 0x%.16llx",
> + __entry->batch,
> + __entry->bundle,
> + __entry->pgm,
> + __entry->status)
> +);
> +
>  #endif /* _TRACE_IFS_H */
>  
>  /* This part must be outside protection */
> diff --git a/drivers/platform/x86/intel/ifs/runtest.c 
> b/drivers/platform/x86/intel/ifs/runtest.c
> index bdb31b2f45b4..69ee0eb72025 100644
> --- a/drivers/platform/x86/intel/ifs/runtest.c
> +++ b/drivers/platform/x86/intel/ifs/runtest.c
> @@ -530,6 +530,7 @@ static int dosbaf(void *data)
>*/
>   wrmsrl(MSR_ACTIVATE_SBAF, run_params->activate->data);
>   rdmsrl(MSR_SBAF_STATUS, status.data);
> + trace_ifs_sbaf(ifsd->cur_batch, *run_params->activate, status);
>  
>   /* Pass back the result of the test */
>   if (cpu == first)




Re: [PATCH v3 2/2] rust: add tracepoint support

2024-06-26 Thread Steven Rostedt
On Wed, 26 Jun 2024 10:48:23 +0200
Alice Ryhl  wrote:

> >
> > Because your hooks/rust_binder.h and events/rust_binder.h use the same
> > TRACE_SYSTEM name? Could you try something like:
> >
> > #define TRACE_SYSTEM rust_binder_hook
> >
> > in your hooks/rust_binder.h?  
> 
> I was able to get it to work by moving the includes into two different
> .c files. I don't think changing TRACE_SYSTEM works because it must
> match the filename.

Try to use:

 #define TRACE_SYSTEM_VAR rust_binder_hook_other_name

in one. Then that is used as the variable for that file.

-- Steve



Re: [PATCH v2] filemap: add trace events for get_pages, map_pages, and fault

2024-06-26 Thread Steven Rostedt
On Wed, 26 Jun 2024 21:31:57 +0900
Masami Hiramatsu (Google)  wrote:

> On Thu, 20 Jun 2024 16:19:03 +
> Takaya Saeki  wrote:
> 
> > To allow precise tracking of page caches accessed, add new tracepoints
> > that trigger when a process actually accesses them.
> > 
> > The ureadahead program used by ChromeOS traces the disk access of
> > programs as they start up at boot up. It uses mincore(2) or the
> > 'mm_filemap_add_to_page_cache' trace event to accomplish this. It stores
> > this information in a "pack" file and on subsequent boots, it will read
> > the pack file and call readahead(2) on the information so that disk
> > storage can be loaded into RAM before the applications actually need it.
> > 
> > A problem we see is that due to the kernel's readahead algorithm that
> > can aggressively pull in more data than needed (to try and accomplish
> > the same goal) and this data is also recorded. The end result is that
> > the pack file contains a lot of pages on disk that are never actually
> > used. Calling readahead(2) on these unused pages can slow down the
> > system boot up times.
> > 
> > To solve this, add 3 new trace events, get_pages, map_pages, and fault.
> > These will be used to trace the pages are not only pulled in from disk,
> > but are actually used by the application. Only those pages will be
> > stored in the pack file, and this helps out the performance of boot up.
> > 
> > With the combination of these 3 new trace events and
> > mm_filemap_add_to_page_cache, we observed a reduction in the pack file
> > by 7.3% - 20% on ChromeOS varying by device.
> >   
> 
> This looks good to me from the trace-event point of view.
> 
> Reviewed-by: Masami Hiramatsu (Google) 

I added my reviewed-by on the last patch, you could have added it on
this one as it didn't change as much. But anyway, here it is again:

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH] qdisc: fix NULL pointer dereference in perf_trace_qdisc_reset()

2024-06-25 Thread Steven Rostedt
On Fri, 21 Jun 2024 20:45:49 +0900
ysk...@gmail.com wrote:

> diff --git a/include/trace/events/qdisc.h b/include/trace/events/qdisc.h
> index f1b5e816e7e5..170b51fbe47a 100644
> --- a/include/trace/events/qdisc.h
> +++ b/include/trace/events/qdisc.h
> @@ -81,7 +81,7 @@ TRACE_EVENT(qdisc_reset,
>   TP_ARGS(q),
>  
>   TP_STRUCT__entry(
> - __string(   dev,qdisc_dev(q)->name  )
> + __string(dev, qdisc_dev(q) ? qdisc_dev(q)->name : "noop_queue")

From a tracing point of view:

Reviewed-by: Steven Rostedt (Google) 

-- Steve

>   __string(   kind,   q->ops->id  )
>   __field(u32,parent  )
>   __field(u32,handle  )



Re: [PATCH v2 2/7] error-injection: support static keys around injectable functions

2024-06-25 Thread Steven Rostedt
On Thu, 20 Jun 2024 00:48:56 +0200
Vlastimil Babka  wrote:

> @@ -86,6 +104,7 @@ static void populate_error_injection_list(struct 
> error_injection_entry *start,
>   ent->start_addr = entry;
>   ent->end_addr = entry + size;
>   ent->etype = iter->etype;
> + ent->key = (struct static_key *) iter->static_key_addr;

Nit, should there be a space between the typecast and the "iter"?

>   ent->priv = priv;
>   INIT_LIST_HEAD(>list);
>       list_add_tail(>list, _injection_list);

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v2 1/7] fault-inject: add support for static keys around fault injection sites

2024-06-25 Thread Steven Rostedt
On Thu, 20 Jun 2024 00:48:55 +0200
Vlastimil Babka  wrote:

> +static int debugfs_prob_set(void *data, u64 val)
> +{
> + struct fault_attr *attr = data;
> +
> + mutex_lock(_mutex);
> +
> + if (attr->active) {
> + if (attr->probability != 0 && val == 0) {
> + static_key_slow_dec(attr->active);
> + } else if (attr->probability == 0 && val != 0) {
> + static_key_slow_inc(attr->active);
> + }
> + }

So basically the above is testing if val to probability is going from
zero or non-zero. For such cases, I find it less confusing to have:

if (attr->active) {
if (!!attr->probability != !!val) {
if (val)
static_key_slow_inc(attr->active);
else
static_key_slow_dec(attr->active);
}
}

This does add a layer of nested ifs, but IMO it's a bit more clear in
what is happening, and it gets rid of the "else if".

Not saying you need to change it. This is more of an FYI.

-- Steve


> +
> + attr->probability = val;
> +
> + mutex_unlock(_mutex);
> +
> + return 0;
> +}



Re: [PATCH] filemap: add trace events for get_pages, map_pages, and fault

2024-06-18 Thread Steven Rostedt
 ofs=20480 max_ofs=81920
less-901 [005] .72.534611: mm_filemap_map_pages: dev 
253:3 ino 13f730 ofs=126976 max_ofs=172032
less-901 [005] .72.534630: mm_filemap_map_pages: dev 
253:3 ino 13f730 ofs=32768 max_ofs=94208
less-901 [005] .72.534715: mm_filemap_map_pages: dev 
253:3 ino 13f730 ofs=98304 max_ofs=122880

Just to get some numbers:

 # trace-cmd show |grep less-901 | grep 13f730 | cut -d: -f2- |grep add_to_page 
| sort -u |wc -l
 49

 # trace-cmd show |grep less-901 | grep 13f730 | cut -d: -f2- |grep fault | 
sort -u |wc -l
 3

 # trace-cmd show |grep less-901 | grep 13f730 | cut -d: -f2- |grep get_pages | 
sort -u |wc -l
 1

 # trace-cmd show |grep less-901 | grep 13f730 | cut -d: -f2- |grep map_pages | 
sort -u |wc -l
 6

Note, ureadahead ignores duplicate pages.

If these new events really do only show what is used and not what is
just pulled in, and doesn't miss anything, I can see it bringing down
the number of pages needed to be saved dramatically.

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-18 Thread Steven Rostedt
On Tue, 18 Jun 2024 13:47:49 +0200
Alexander Graf  wrote:

> IMHO the big fat disclaimer should be in the argument name. 
> "reserve_mem" to me sounds like it actually guarantees a reservation - 
> which it doesn't. Can we name it more along the lines of "debug" (to 
> indicate it's not for production data) or "phoenix" (usually gets reborn 
> out of ashes, but you can never know for sure): "debug_mem", / 
> "phoenix_mem"?

I don't see any reason it will not reserve memory. That doesn't seem to
be the issue.  What is not guaranteed is that it will be in the same
location as last time. So the name is correct. It's not
"reserve_consistent_memory" ;-)

-- Steve



Re: [PATCH v6 1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-18 Thread Steven Rostedt
On Tue, 18 Jun 2024 20:55:17 +0800
Zhengyejian  wrote:

> > +   start = memblock_phys_alloc(size, align);
> > +   if (!start)
> > +   return -ENOMEM;
> > +
> > +   reserved_mem_add(start, size, name);
> > +
> > +   return 0;  
> 
> Hi, steve, should here return 1 ? Other __setup functions
> return 1 when it success.

Ug, you're correct.

Mike, want me to send a v7?

-- Steve



Re: Bug in Kernel 6.8.x, 6.9.x Causing Trace/Panic During Shutdown/Reboot

2024-06-18 Thread Steven Rostedt
On Thu, 13 Jun 2024 10:32:24 +0300
Ilkka Naulapää  wrote:

> ok, so if you don't have any idea where this bug is after those debug
> patches, I'll try to find some time to bisect it as a last resort.
> Stay tuned.

FYI,

I just debugged a strange crash that was caused by my config having
something leftover from your config. Specifically, that was:

CONFIG_FORCE_NR_CPUS

Do you get any warning about nr cpus not matching at boot up?

Regardless, can you disable that and see if you still get the same
crash.

Thanks,

-- Steve



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Steven Rostedt
On Mon, 17 Jun 2024 23:01:12 +0200
Alexander Graf  wrote:
> > This could be an added feature, but it is very architecture specific,
> > and would likely need architecture specific updates.  
> 
> 
> It definitely would be an added feature, yes. But one that allows you to 
> ensure persistence a lot more safely :).

Sure.

> 
> Thinking about it again: What if you run the allocation super early (see 
> arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to 
> allocating only from top, you're effectively kernel version independent 
> for your allocations because none of the kernel code ran yet and 
> definitely KASLR independent because you're running deterministically 
> before KASLR even gets allocated.
> 
> > As this code relies on memblock_phys_alloc() being consistent, if
> > something gets allocated before it differently depending on where the
> > kernel is, it can also move the location. A plugin to UEFI would mean
> > that it would need to reserve the memory, and the code here will need
> > to know where it is. We could always make the function reserve_mem()
> > global and weak so that architectures can override it.  
> 
> 
> Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new 
> type of memblock with the respective reservations and you later call 
> memblock_find_in_range_node() instead of memblock_phys_alloc() to pass 
> in flags that you want to allocate only from the new 
> MEMBLOCK_RESERVE_MEM type. The same model would work for BIOS boots 
> through the handle_mem_options() path above. In fact, if the BIOS way 
> works fine, we don't even need UEFI variables: The same way allocations 
> will be identical during BIOS execution, they should stay identical 
> across UEFI launches.
> 
> As cherry on top, kexec also works seamlessly with the special memblock 
> approach because kexec (at least on x86) hands memblocks as is to the 
> next kernel. So the new kernel will also automatically use the same 
> ranges for its allocations.

I'm all for expanding this. But I would just want to get this in for
now as is. It theoretically works on all architectures. If someone
wants to make in more robust and accurate on a specific architecture,
I'm all for it. Like I said, we could make the reserver_mem() function
global and weak, and then if an architecture has a better way to handle
this, it could use that.

Hmm, x86 could do this with the e820 code like I did in my first
versions. Like I said, it didn't fail at all with that.

And we can have an UEFI version as well.

-- Steve



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Steven Rostedt
On Mon, 17 Jun 2024 09:07:29 +0200
Alexander Graf  wrote:

> Hey Steve,
> 
> 
> I believe we're talking about 2 different things :). Let me rephrase a 
> bit and make a concrete example.
> 
> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command 
> line option. The kernel now comes up and allocates a random chunk of 
> memory that - by (admittedly good) chance - may be at the same physical 
> location as before. Let's assume it deemed 0x100 as a good offset.

Note, it's not random. It picks from the top of available memory every
time. But things can mess with it (see below).

> 
> Let's now assume you're running on a UEFI system. There, you always have 
> non-volatile storage available to you even in the pre-boot phase. That 
> means the kernel could create a UEFI variable that says "12M:4096:trace 
> -> 0x100". The pre-boot phase takes all these UEFI variables and   
> marks them as reserved. When you finally reach your command line parsing 
> logic for reserve_mem=, you can flip all reservations that were not on 
> the command line back to normal memory.
> 
> That way you have pretty much guaranteed persistent memory regions, even 
> with KASLR changing your memory layout across boots.
> 
> The nice thing is that the above is an extension of what you've already 
> built: Systems with UEFI simply get better guarantees that their regions 
> persist.

This could be an added feature, but it is very architecture specific,
and would likely need architecture specific updates.

> 
> 
> >  
> >>  
> >>> Requirement:
> >>>
> >>> Need a way to reserve memory that will be at a consistent location for
> >>> every boot, if the kernel and system are the same. Does not need to work
> >>> if rebooting to a different kernel, or if the system can change the
> >>> memory layout between boots.
> >>>
> >>> The reserved memory can not be an hard coded address, as the same kernel /
> >>> command line needs to run on several different machines. The picked memory
> >>> reservation just needs to be the same for a given machine, but may be  
> >>
> >> With KASLR is enabled, doesn't this approach break too often to be
> >> reliable enough for the data you want to extract?
> >>
> >> Picking up the idea above, with a persistent variable we could even make
> >> KASLR avoid that reserved pstore region in its search for a viable KASLR
> >> offset.  
> > I think I was hit by it once in all my testing. For our use case, the
> > few times it fails to map is not going to affect what we need this for
> > at all.  
> 
> 
> Once is pretty good. Do you know why? Also once out of how many runs? Is 
> the randomness source not as random as it should be or are the number of 
> bits for KASLR maybe so few on your target architecture that the odds of 
> hitting anything become low? Do these same constraints hold true outside 
> of your testing environment?

So I just ran it a hundred times in a loop. I added a patch to print
the location of "_text". The loop was this:

  for i in `seq 100`; do
ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped 
boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
sleep 25
  done

It searches dmesg for my added printk as well as the print of were the
ring buffer was loaded in physical memory.

It takes about 15 seconds to reboot, so I waited 25. The results are
attached. I found that it was consistent 76 times, which means 1 out of
4 it's not. Funny enough, it broke whenever it loaded the kernel below
0x1. And then it would be off by a little.

It was consistently at:

  0x27d00

And when it failed, it was at 0x27ce0.

Note, when I used the e820 tables to do this, I never saw a failure. My
assumption is that when it is below 0x1, something else gets
allocated causing this to get pushed down.

As this code relies on memblock_phys_alloc() being consistent, if
something gets allocated before it differently depending on where the
kernel is, it can also move the location. A plugin to UEFI would mean
that it would need to reserve the memory, and the code here will need
to know where it is. We could always make the function reserve_mem()
global and weak so that architectures can override it.

-- Steve


text
Description: Binary data


Re: [PATCH v4 1/3] kbuild: add mod(name,file)_flags to assembler flags for module objects

2024-06-14 Thread Steven Rostedt
On Fri, 14 Jun 2024 14:10:58 -0400
Kris Van Hees  wrote:

> On Fri, Jun 14, 2024 at 01:46:51PM -0400, Steven Rostedt wrote:
> > On Fri, 14 Jun 2024 13:14:26 -0400
> > Kris Van Hees  wrote:
> >   
> > > Module objects compiled from C source can be identified by the presence
> > > of -DKBUILD_MODFILE and -DKBUILD_MODNAME on their compile command lines.
> > > However, module objects from assembler source do not have this defines.
> > > 
> > > Add $(modfile_flags) to modkern_aflags (similar to modkern_cflahs), and
> > > add $(modname_flags) to a_flags (similar to c_flags).  
> > 
> > You explain what this does but not why it does it.  
> 
> The first paragraph is meant to estabish the "why" (being able to identify
> what objects are module objects, even if they are compiled from assembler
> source).

Perhaps there's a lack of context. Sure, the cover letter can help in
this regard, but I always look at each commit as a stand alone.

> 
> As I mention, for objects compiled from C source code, those defines being
> present identifies those objects as belonging to a module.  For objects
> compiled from assembler source code, those defines are not present.  Passing
> them on the compile command line for assembler source code files for objects
> that are part of one or more modules allows us to identify all objects that
> are part of modules with a single consistent mechanism.

Sure, but why do we care? Again, if this was the only patch you sent,
it should explain why it is being done.

Perhaps something like: "In order to be able to identify what code is
from a module, even if it is built in, ..."

But what you are saying is just "C code has these flags, make
assembly have them too". Which is meaningless.

The other patches could use some more explanation too.

-- Steve



Re: [PATCH v4 1/3] kbuild: add mod(name,file)_flags to assembler flags for module objects

2024-06-14 Thread Steven Rostedt
On Fri, 14 Jun 2024 13:14:26 -0400
Kris Van Hees  wrote:

> Module objects compiled from C source can be identified by the presence
> of -DKBUILD_MODFILE and -DKBUILD_MODNAME on their compile command lines.
> However, module objects from assembler source do not have this defines.
> 
> Add $(modfile_flags) to modkern_aflags (similar to modkern_cflahs), and
> add $(modname_flags) to a_flags (similar to c_flags).

You explain what this does but not why it does it.

-- Steve



[PATCH v7 1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In order to allow for requesting a memory region that can be used for
things like pstore on multiple machines where the memory layout is not the
same, add a new option to the kernel command line called "reserve_mem".

The format is:  reserve_mem=nn:align:name

Where it will find nn amount of memory at the given alignment of align.
The name field is to allow another subsystem to retrieve where the memory
was found. For example:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Where ramoops.mem_name will tell ramoops that memory was reserved for it
via the reserve_mem option and it can find it by calling:

  if (reserve_mem_find_by_name("oops", , )) {
// start holds the start address and size holds the size given

This is typically used for systems that do not wipe the RAM, and this
command line will try to reserve the same physical memory on soft reboots.
Note, it is not guaranteed to be the same location. For example, if KASLR
places the kernel at the location of where the RAM reservation was from a
previous boot, the new reservation will be at a different location.  Any
subsystem using this feature must add a way to verify that the contents of
the physical memory is from a previous boot, as there may be cases where
the memory will not be located at the same location.

Not all systems may work either. There could be bit flips if the reboot
goes through the BIOS. Using kexec to reboot the machine is likely to
have better results in such cases.

Link: https://lore.kernel.org/all/zjjvnzux3nzig...@kernel.org/

Suggested-by: Mike Rapoport 
Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 .../admin-guide/kernel-parameters.txt |  22 
 include/linux/mm.h|   2 +
 mm/memblock.c | 117 ++
 3 files changed, 141 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..56e18b1a520d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5710,6 +5710,28 @@
them.  If  is less than 0x1, the region
is assumed to be I/O ports; otherwise it is memory.
 
+   reserve_mem=[RAM]
+   Format: nn[KNG]::
+   Reserve physical memory and label it with a name that
+   other subsystems can use to access it. This is typically
+   used for systems that do not wipe the RAM, and this 
command
+   line will try to reserve the same physical memory on
+   soft reboots. Note, it is not guaranteed to be the same
+   location. For example, if anything about the system 
changes
+   or if booting a different kernel. It can also fail if 
KASLR
+   places the kernel at the location of where the RAM 
reservation
+   was from a previous boot, the new reservation will be 
at a
+   different location.
+   Any subsystem using this feature must add a way to 
verify
+   that the contents of the physical memory is from a 
previous
+   boot, as there may be cases where the memory will not be
+   located at the same location.
+
+   The format is size:align:label for example, to request
+   12 megabytes of 4096 alignment for ramoops:
+
+   reserve_mem=12M:4096:oops ramoops.mem_name=oops
+
reservetop= [X86-32,EARLY]
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..077fb589b88a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4263,4 +4263,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long 
pfn)
 void vma_pgtable_walk_begin(struct vm_area_struct *vma);
 void vma_pgtable_walk_end(struct vm_area_struct *vma);
 
+int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t 
*size);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index d09136e040d3..b7b0e8c3868d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2244,6 +2244,123 @@ void __init memblock_free_all(void)
totalram_pages_add(pages);
 }
 
+/* Keep a table to reserve named memory */
+#define RESERVE_MEM_MAX_ENTRIES8
+#define RESERVE_MEM_NAME_SIZE  16
+struct reserve_mem_table {
+   charname[RESERVE_MEM_NAME_SIZE];
+   phys_addr_t start;
+   phys_addr_t size;
+};
+static struct reserve_mem_table reserved_mem_table[RESERVE_MEM_MAX_ENTRIES];
+static int reserved

[PATCH v7 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a method to find a region specified by reserve_mem=nn:align:name for
ramoops. Adding a kernel command line parameter:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Will use the size and location defined by the memmap parameter where it
finds the memory and labels it "oops". The "oops" in the ramoops option
is used to search for it.

This allows for arbitrary RAM to be used for ramoops if it is known that
the memory is not cleared on kernel crashes or soft reboots.

Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/admin-guide/ramoops.rst | 13 +
 fs/pstore/ram.c   | 14 ++
 2 files changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/ramoops.rst 
b/Documentation/admin-guide/ramoops.rst
index e9f85142182d..2eabef31220d 100644
--- a/Documentation/admin-guide/ramoops.rst
+++ b/Documentation/admin-guide/ramoops.rst
@@ -23,6 +23,8 @@ and type of the memory area are set using three variables:
   * ``mem_size`` for the size. The memory size will be rounded down to a
 power of two.
   * ``mem_type`` to specify if the memory type (default is 
pgprot_writecombine).
+  * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command
+line parameter.
 
 Typically the default value of ``mem_type=0`` should be used as that sets the 
pstore
 mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
@@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several 
different manners:
return ret;
   }
 
+ D. Using a region of memory reserved via ``reserve_mem`` command line
+parameter. The address and size will be defined by the ``reserve_mem``
+parameter. Note, that ``reserve_mem`` may not always allocate memory
+in the same location, and cannot be relied upon. Testing will need
+to be done, and it may not work on every machine, nor every kernel.
+Consider this a "best effort" approach. The ``reserve_mem`` option
+takes a size, alignment and name as arguments. The name is used
+to map the memory to a label that can be retrieved by ramoops.
+
+   reserve_mem=2M:4096:oops  ramoops.mem_name=oops
+
 You can specify either RAM memory or peripheral devices' memory. However, when
 specifying RAM, be sure to reserve the memory by issuing memblock_reserve()
 very early in the architecture code, e.g.::
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index b1a455f42e93..4311fcbc84f2 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -50,6 +50,10 @@ module_param_hw(mem_address, ullong, other, 0400);
 MODULE_PARM_DESC(mem_address,
"start of reserved RAM used to store oops/panic logs");
 
+static char *mem_name;
+module_param_named(mem_name, mem_name, charp, 0400);
+MODULE_PARM_DESC(mem_name, "name of kernel param that holds addr");
+
 static ulong mem_size;
 module_param(mem_size, ulong, 0400);
 MODULE_PARM_DESC(mem_size,
@@ -914,6 +918,16 @@ static void __init ramoops_register_dummy(void)
 {
struct ramoops_platform_data pdata;
 
+   if (mem_name) {
+   phys_addr_t start;
+   phys_addr_t size;
+
+   if (reserve_mem_find_by_name(mem_name, , )) {
+   mem_address = start;
+   mem_size = size;
+   }
+   }
+
/*
 * Prepare a dummy platform data structure to carry the module
 * parameters. If mem_size isn't set, then there are no module
-- 
2.43.0





[PATCH v7 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt
Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
 improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

  See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.

Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be
different for different machines.

Solution:

The solution I have come up with is to introduce a new "reserve_mem=" kernel
command line. This parameter takes the following format:

  reserve_mem=nn:align:label

Where nn is the size of memory to reserve, the align is the alignment of
that memory, and label is the way for other sub-systems to find that memory.
This way the kernel command line could have:

  reserve_mem=12M:4096:oops   ramoops.mem_name=oops

At boot up, the kernel will search for 12 megabytes in usable memory regions
with an alignment of 4096. It will start at the highest regions and work its
way down (for those old devices that want access to lower address DMA). When
it finds a region, it will save it off in a small table and mark it with the
"oops" label. Then the pstore ramoops sub-system could ask for that memory
and location, and it will map itself there.

This prototype allows for 8 different mappings (which may be overkill, 4 is
probably plenty) with 16 byte size to store the label.

I have tested this and it works for us to solve the above problem. We can
update the kernel and command line and increase the size of pstore without
needing to update the firmware, or knowing every memory layout of each
board. I only tested this locally, it has not been tested in the field.

Changes since v6: 
https://lore.kernel.org/all/20240613155506.811013...@goodmis.org/

- Fixed typo of s/reserver/reserve/

Changes since v5: 
https://lore.kernel.org/all/20240613003435.401549...@goodmis.org/


- Stressed more that this is a best effort use case

- Updated ramoops.rst to document this new feature

- Used a new variable "tmp" to use in reserve_mem_find_by_name() instead
  of using "size" and possibly corrupting it.

Changes since v4: 
https://lore.kernel.org/all/20240611215610.548854...@goodmis.org/

- Add all checks about reserve_mem before allocation.
  This means reserved_mem_add() is now a void function.

- Check for name duplications.

- Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=")

Changes since v3: 
https://lore.kernel.org/all/20240611144911.327227...@goodmis.org/

- Changed table type of start and size from unsigned long to phys_addr_t
  (as well as the parameters to the functions that use them)

- Changed old reference to "early_reserve_mem" to "reserve_mem"

- Check before reservering memory:
  o Size is non-zero
  o name has text in it

- If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES

- Remove the silly check of testing *p == '\0' after a p += strlen(p)

Changes since v2: 
https://lore.kernel.org/all/20240606150143.876469...@goodmis.org/

- Fixed typo of "reserver"

- Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name()

- Removed "built-in" from module description that was changed from v1.


Changes since v1: 
https://lore.kernel.org/all/2024060320.801075...@goodmis.org/

- Updated the change log of the first patch as well as added an entry
  into kernel-parameters.txt about how reserve_mem is for soft reboots
  and may not be reliable. 


Steven Rostedt (Google) (2):
  mm/memblock: A

Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 14:21:10 -0700
Andrew Morton  wrote:

> On Thu, 13 Jun 2024 16:10:12 -0400 Steven Rostedt  wrote:
> 
> > > And...  I'm a bit surprised that forward declarations are allowed in C.
> > > A billion years ago I used a C compiler which would use 16 bits for
> > > an enum if the enumted values would fit in 16 bits.  And it would use 32
> > > bits otherwise.  So the enumerated values were *required* for the
> > > compiler to be able to figure out the sizeof.  But it was a billion
> > > years ago.  
> > 
> > Well, I only looked at the one change in ftrace.h which has a
> > "struct seq_file;" that is not used anywhere else in the file, so that
> > one definitely can go.  
> 
> The risk is that something which includes ftrace.h is depending upon
> the enum declaration.

You mean forward struct declaration. And if so, good! it needs to be fixed ;-)

-- Steve



Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 22:22:18 +0300
Alexey Dobriyan  wrote:

> g++ doesn't like forward enum declarations:
> 
>   error: use of enum ‘E’ without previous declaration
>  64 | enum E;

But we don't care about g++. Do we?

I would make that a separate patch.

> 
> Delete those which aren't used.
> 
> Delete some unused/unnecessary forward struct declarations for a change.

This is a clean up, but should have a better change log. Just something
simple like:

  Delete unnecessary forward struct declarations.

Thanks,

-- Steve


> 
> Signed-off-by: Alexey Dobriyan 
> ---
> 
>  fs/ramfs/inode.c |1 -
>  include/linux/console.h  |2 --
>  include/linux/device.h   |3 ---
>  include/linux/ftrace.h   |4 
>  include/linux/security.h |6 --
>  include/linux/signal.h   |2 --
>  include/linux/syscalls.h |7 ---
>  include/linux/sysfs.h|2 --
>  mm/internal.h|4 
>  mm/shmem.c   |1 -
>  10 files changed, 32 deletions(-)
> 
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -51,7 +51,6 @@ struct ramfs_fs_info {
>  
>  #define RAMFS_DEFAULT_MODE   0755
>  
> -static const struct super_operations ramfs_ops;
>  static const struct inode_operations ramfs_dir_inode_operations;
>  
>  struct inode *ramfs_get_inode(struct super_block *sb,
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -21,10 +21,8 @@
>  #include 
>  
>  struct vc_data;
> -struct console_font_op;
>  struct console_font;
>  struct module;
> -struct tty_struct;
>  struct notifier_block;
>  
>  enum con_scroll {
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -36,10 +36,7 @@
>  struct device;
>  struct device_private;
>  struct device_driver;
> -struct driver_private;
>  struct module;
> -struct class;
> -struct subsys_private;
>  struct device_node;
>  struct fwnode_handle;
>  struct iommu_group;
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -531,8 +531,6 @@ extern const void *ftrace_expected;
>  
>  void ftrace_bug(int err, struct dyn_ftrace *rec);
>  
> -struct seq_file;
> -
>  extern int ftrace_text_reserved(const void *start, const void *end);
>  
>  struct ftrace_ops *ftrace_ops_trampoline(unsigned long addr);
> @@ -1147,8 +1145,6 @@ static inline void unpause_graph_tracing(void) { }
>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>  
>  #ifdef CONFIG_TRACING
> -enum ftrace_dump_mode;
> -
>  #define MAX_TRACER_SIZE  100
>  extern char ftrace_dump_on_oops[];
>  extern int ftrace_dump_on_oops_enabled(void);
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -41,7 +41,6 @@ struct rlimit;
>  struct kernel_siginfo;
>  struct sembuf;
>  struct kern_ipc_perm;
> -struct audit_context;
>  struct super_block;
>  struct inode;
>  struct dentry;
> @@ -59,8 +58,6 @@ struct xfrm_sec_ctx;
>  struct mm_struct;
>  struct fs_context;
>  struct fs_parameter;
> -enum fs_value_type;
> -struct watch;
>  struct watch_notification;
>  struct lsm_ctx;
>  
> @@ -183,8 +180,6 @@ struct sock;
>  struct sockaddr;
>  struct socket;
>  struct flowi_common;
> -struct dst_entry;
> -struct xfrm_selector;
>  struct xfrm_policy;
>  struct xfrm_state;
>  struct xfrm_user_sec_ctx;
> @@ -219,7 +214,6 @@ extern unsigned long dac_mmap_min_addr;
>  #define LSM_PRLIMIT_WRITE 2
>  
>  /* forward declares to avoid warnings */
> -struct sched_param;
>  struct request_sock;
>  
>  /* bprm->unsafe reasons */
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -274,8 +274,6 @@ static inline int valid_signal(unsigned long sig)
>   return sig <= _NSIG ? 1 : 0;
>  }
>  
> -struct timespec;
> -struct pt_regs;
>  enum pid_type;
>  
>  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -11,8 +11,6 @@
>  
>  struct __aio_sigset;
>  struct epoll_event;
> -struct iattr;
> -struct inode;
>  struct iocb;
>  struct io_event;
>  struct iovec;
> @@ -20,14 +18,12 @@ struct __kernel_old_itimerval;
>  struct kexec_segment;
>  struct linux_dirent;
>  struct linux_dirent64;
> -struct list_head;
>  struct mmap_arg_struct;
>  struct msgbuf;
>  struct user_msghdr;
>  struct mmsghdr;
>  struct msqid_ds;
>  struct new_utsname;
> -struct nfsctl_arg;
>  struct __old_kernel_stat;
>  struct oldold_utsname;
>  struct old_utsname;
> @@ -38,7 +34,6 @@ struct rusage;
>  struct sched_param;
>  struct sched_attr;
>  struct sel_arg_struct;
> -struct semaphore;
>  struct sembuf;
>  struct shmid_ds;
>  struct sockaddr;
> @@ -48,14 +43,12 @@ struct statfs;
>  struct statfs64;
>  struct statx;
>  struct sysinfo;
> -struct timespec;
>  struct __kernel_old_timeval;
>  struct __kernel_timex;
>  struct timezone;
>  struct tms;
>  struct utimbuf;
>  struct mq_attr;
> -struct compat_stat;
>  struct old_timeval32;
>  struct robust_list_head;
>  struct futex_waitv;
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -23,9 +23,7 @@
>  #include 
>  

Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 13:04:20 -0700
Andrew Morton  wrote:

> On Thu, 13 Jun 2024 15:34:02 -0400 Steven Rostedt  wrote:
> 
> > On Thu, 13 Jun 2024 22:22:18 +0300
> > Alexey Dobriyan  wrote:
> >   
> > > g++ doesn't like forward enum declarations:
> > > 
> > >   error: use of enum ‘E’ without previous declaration
> > >  64 | enum E;  
> > 
> > But we don't care about g++. Do we?  
> 
> It appears that g++ is a useful enum declaration detector.
> 
> I'm curious to know how even the above warning was generated.  Does g++
> work at all on Linux?
> 
> > I would make that a separate patch.  
> 
> What are you referring to here?

The enum change should be separate from the struct changes.

> 
> > > 
> > > Delete those which aren't used.
> > > 
> > > Delete some unused/unnecessary forward struct declarations for a change.  
> > 
> > This is a clean up, but should have a better change log. Just something
> > simple like:
> > 
> >   Delete unnecessary forward struct declarations.  
> 
> Alexey specializes in cute changelogs.

eh

> 
> I do have a concern about the patch: has it been tested with all
> possible Kconfigs?  No.  There may be some configs in which the forward
> declaration is required.
> 
> And...  I'm a bit surprised that forward declarations are allowed in C.
> A billion years ago I used a C compiler which would use 16 bits for
> an enum if the enumted values would fit in 16 bits.  And it would use 32
> bits otherwise.  So the enumerated values were *required* for the
> compiler to be able to figure out the sizeof.  But it was a billion
> years ago.

Well, I only looked at the one change in ftrace.h which has a
"struct seq_file;" that is not used anywhere else in the file, so that
one definitely can go.

-- Steve



Re: [PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 15:19:50 -0300
"Guilherme G. Piccoli"  wrote:

> > +
> > +   reserver_mem=2M:4096:oops  ramoops.mem_name=oops
> > +  
> 
> Likely this could be fixed on merge, to avoid another version, but...
> 
> s/reserver_mem/reserve_mem

That 'r' is my nemesis! Almost every time I type "reserve" I write it
as "reserver" and have to go back and delete it. :-p

Just to make patchwork work nicely, I'll send a v7. But I'll wait for
other comments or acks and reviews.

-- Steve



[PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt
Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
 improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

  See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.

Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be
different for different machines.

Solution:

The solution I have come up with is to introduce a new "reserve_mem=" kernel
command line. This parameter takes the following format:

  reserve_mem=nn:align:label

Where nn is the size of memory to reserve, the align is the alignment of
that memory, and label is the way for other sub-systems to find that memory.
This way the kernel command line could have:

  reserve_mem=12M:4096:oops   ramoops.mem_name=oops

At boot up, the kernel will search for 12 megabytes in usable memory regions
with an alignment of 4096. It will start at the highest regions and work its
way down (for those old devices that want access to lower address DMA). When
it finds a region, it will save it off in a small table and mark it with the
"oops" label. Then the pstore ramoops sub-system could ask for that memory
and location, and it will map itself there.

This prototype allows for 8 different mappings (which may be overkill, 4 is
probably plenty) with 16 byte size to store the label.

I have tested this and it works for us to solve the above problem. We can
update the kernel and command line and increase the size of pstore without
needing to update the firmware, or knowing every memory layout of each
board. I only tested this locally, it has not been tested in the field.

Changes since v5: 
https://lore.kernel.org/all/20240613003435.401549...@goodmis.org/

[ patch at bottom showing differences ]

- Stressed more that this is a best effort use case

- Updated ramoops.rst to document this new feature

- Used a new variable "tmp" to use in reserve_mem_find_by_name() instead
  of using "size" and possibly corrupting it.

Changes since v4: 
https://lore.kernel.org/all/20240611215610.548854...@goodmis.org/

- Add all checks about reserve_mem before allocation.
  This means reserved_mem_add() is now a void function.

- Check for name duplications.

- Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=")

Changes since v3: 
https://lore.kernel.org/all/20240611144911.327227...@goodmis.org/

- Changed table type of start and size from unsigned long to phys_addr_t
  (as well as the parameters to the functions that use them)

- Changed old reference to "early_reserve_mem" to "reserve_mem"

- Check before reservering memory:
  o Size is non-zero
  o name has text in it

- If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES

- Remove the silly check of testing *p == '\0' after a p += strlen(p)

Changes since v2: 
https://lore.kernel.org/all/20240606150143.876469...@goodmis.org/

- Fixed typo of "reserver"

- Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name()

- Removed "built-in" from module description that was changed from v1.


Changes since v1: 
https://lore.kernel.org/all/2024060320.801075...@goodmis.org/

- Updated the change log of the first patch as well as added an entry
  into kernel-parameters.txt about how reserve_mem is for soft reboots
  and may not be reliable. 


Steven Rostedt (Google) (2):
  mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  pstore/ramoop

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 18:54:12 +0200
Alexander Graf  wrote:

> 
> Do you have a "real" pstore on these systems that you could store 
> non-volatile variables in, such as persistent UEFI variables? If so, you 
> could create an actually persistent mapping for your trace pstore even 
> across kernel version updates as a general mechanism to create reserved 
> memblocks at fixed offsets.

After implementing all this, I don't think I can use pstore for my
purpose. pstore is a generic interface for persistent storage, and
requires an interface to access it. From what I understand, it's not
the place to just ask for an area of RAM.

For this, I have a single patch that allows the tracing instance to use
an area reserved by reserve_mem.

  reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace

I've already tested this on qemu and a couple of chromebooks. It works
well.

> 
> 
> > Requirement:
> >
> > Need a way to reserve memory that will be at a consistent location for
> > every boot, if the kernel and system are the same. Does not need to work
> > if rebooting to a different kernel, or if the system can change the
> > memory layout between boots.
> >
> > The reserved memory can not be an hard coded address, as the same kernel /
> > command line needs to run on several different machines. The picked memory
> > reservation just needs to be the same for a given machine, but may be  
> 
> 
> With KASLR is enabled, doesn't this approach break too often to be 
> reliable enough for the data you want to extract?
> 
> Picking up the idea above, with a persistent variable we could even make 
> KASLR avoid that reserved pstore region in its search for a viable KASLR 
> offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

-- Steve



Re: [PATCH v4 01/35] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 17:34:03 +0200
Ilya Leoshkevich  wrote:

> Architectures use assembly code to initialize ftrace_regs and call
> ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
> ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
> KMSAN warnings when running the ftrace testsuite.
> 
> Fix by trusting the architecture-specific assembly code and always
> unpoisoning ftrace_regs in ftrace_ops_list_func.
> 
> The issue was not encountered on x86_64 so far only by accident:
> assembly-allocated ftrace_regs was overlapping a stale partially
> unpoisoned stack frame. Poisoning stack frames before returns [1]
> makes the issue appear on x86_64 as well.
> 
> [1] 
> https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/
> 
> Reviewed-by: Alexander Potapenko 
> Signed-off-by: Ilya Leoshkevich 
> ---

Acked-by: Steven Rostedt (Google) 

-- Steve

>  kernel/trace/ftrace.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 65208d3b5ed9..c35ad4362d71 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -7407,6 +7407,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
> parent_ip,
>  void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>  struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {
> + kmsan_unpoison_memory(fregs, sizeof(*fregs));
>   __ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
>  }
>  #else




[PATCH v6 1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In order to allow for requesting a memory region that can be used for
things like pstore on multiple machines where the memory layout is not the
same, add a new option to the kernel command line called "reserve_mem".

The format is:  reserve_mem=nn:align:name

Where it will find nn amount of memory at the given alignment of align.
The name field is to allow another subsystem to retrieve where the memory
was found. For example:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Where ramoops.mem_name will tell ramoops that memory was reserved for it
via the reserve_mem option and it can find it by calling:

  if (reserve_mem_find_by_name("oops", , )) {
// start holds the start address and size holds the size given

This is typically used for systems that do not wipe the RAM, and this
command line will try to reserve the same physical memory on soft reboots.
Note, it is not guaranteed to be the same location. For example, if KASLR
places the kernel at the location of where the RAM reservation was from a
previous boot, the new reservation will be at a different location.  Any
subsystem using this feature must add a way to verify that the contents of
the physical memory is from a previous boot, as there may be cases where
the memory will not be located at the same location.

Not all systems may work either. There could be bit flips if the reboot
goes through the BIOS. Using kexec to reboot the machine is likely to
have better results in such cases.

Link: https://lore.kernel.org/all/zjjvnzux3nzig...@kernel.org/

Suggested-by: Mike Rapoport 
Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 .../admin-guide/kernel-parameters.txt |  22 
 include/linux/mm.h|   2 +
 mm/memblock.c | 117 ++
 3 files changed, 141 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..56e18b1a520d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5710,6 +5710,28 @@
them.  If  is less than 0x1, the region
is assumed to be I/O ports; otherwise it is memory.
 
+   reserve_mem=[RAM]
+   Format: nn[KNG]::
+   Reserve physical memory and label it with a name that
+   other subsystems can use to access it. This is typically
+   used for systems that do not wipe the RAM, and this 
command
+   line will try to reserve the same physical memory on
+   soft reboots. Note, it is not guaranteed to be the same
+   location. For example, if anything about the system 
changes
+   or if booting a different kernel. It can also fail if 
KASLR
+   places the kernel at the location of where the RAM 
reservation
+   was from a previous boot, the new reservation will be 
at a
+   different location.
+   Any subsystem using this feature must add a way to 
verify
+   that the contents of the physical memory is from a 
previous
+   boot, as there may be cases where the memory will not be
+   located at the same location.
+
+   The format is size:align:label for example, to request
+   12 megabytes of 4096 alignment for ramoops:
+
+   reserve_mem=12M:4096:oops ramoops.mem_name=oops
+
reservetop= [X86-32,EARLY]
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..077fb589b88a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4263,4 +4263,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long 
pfn)
 void vma_pgtable_walk_begin(struct vm_area_struct *vma);
 void vma_pgtable_walk_end(struct vm_area_struct *vma);
 
+int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t 
*size);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index d09136e040d3..b7b0e8c3868d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2244,6 +2244,123 @@ void __init memblock_free_all(void)
totalram_pages_add(pages);
 }
 
+/* Keep a table to reserve named memory */
+#define RESERVE_MEM_MAX_ENTRIES8
+#define RESERVE_MEM_NAME_SIZE  16
+struct reserve_mem_table {
+   charname[RESERVE_MEM_NAME_SIZE];
+   phys_addr_t start;
+   phys_addr_t size;
+};
+static struct reserve_mem_table reserved_mem_table[RESERVE_MEM_MAX_ENTRIES];
+static int reserved

[PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a method to find a region specified by reserve_mem=nn:align:name for
ramoops. Adding a kernel command line parameter:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Will use the size and location defined by the memmap parameter where it
finds the memory and labels it "oops". The "oops" in the ramoops option
is used to search for it.

This allows for arbitrary RAM to be used for ramoops if it is known that
the memory is not cleared on kernel crashes or soft reboots.

Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/admin-guide/ramoops.rst | 13 +
 fs/pstore/ram.c   | 14 ++
 2 files changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/ramoops.rst 
b/Documentation/admin-guide/ramoops.rst
index e9f85142182d..6f534a707b2a 100644
--- a/Documentation/admin-guide/ramoops.rst
+++ b/Documentation/admin-guide/ramoops.rst
@@ -23,6 +23,8 @@ and type of the memory area are set using three variables:
   * ``mem_size`` for the size. The memory size will be rounded down to a
 power of two.
   * ``mem_type`` to specify if the memory type (default is 
pgprot_writecombine).
+  * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command
+line parameter.
 
 Typically the default value of ``mem_type=0`` should be used as that sets the 
pstore
 mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
@@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several 
different manners:
return ret;
   }
 
+ D. Using a region of memory reserved via ``reserve_mem`` command line
+parameter. The address and size will be defined by the ``reserve_mem``
+parameter. Note, that ``reserve_mem`` may not always allocate memory
+in the same location, and cannot be relied upon. Testing will need
+to be done, and it may not work on every machine, nor every kernel.
+Consider this a "best effort" approach. The ``reserve_mem`` option
+takes a size, alignment and name as arguments. The name is used
+to map the memory to a label that can be retrieved by ramoops.
+
+   reserver_mem=2M:4096:oops  ramoops.mem_name=oops
+
 You can specify either RAM memory or peripheral devices' memory. However, when
 specifying RAM, be sure to reserve the memory by issuing memblock_reserve()
 very early in the architecture code, e.g.::
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index b1a455f42e93..4311fcbc84f2 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -50,6 +50,10 @@ module_param_hw(mem_address, ullong, other, 0400);
 MODULE_PARM_DESC(mem_address,
"start of reserved RAM used to store oops/panic logs");
 
+static char *mem_name;
+module_param_named(mem_name, mem_name, charp, 0400);
+MODULE_PARM_DESC(mem_name, "name of kernel param that holds addr");
+
 static ulong mem_size;
 module_param(mem_size, ulong, 0400);
 MODULE_PARM_DESC(mem_size,
@@ -914,6 +918,16 @@ static void __init ramoops_register_dummy(void)
 {
struct ramoops_platform_data pdata;
 
+   if (mem_name) {
+   phys_addr_t start;
+   phys_addr_t size;
+
+   if (reserve_mem_find_by_name(mem_name, , )) {
+   mem_address = start;
+   mem_size = size;
+   }
+   }
+
/*
 * Prepare a dummy platform data structure to carry the module
 * parameters. If mem_size isn't set, then there are no module
-- 
2.43.0





[PATCH] function_graph: Add READ_ONCE() when accessing fgraph_array[]

2024-06-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In function_graph_enter() there's a loop that looks at fgraph_array[]
elements which are fgraph_ops. It first tests if it is a fgraph_stub op,
and if so skips it, as that's just there as a place holder. Then it checks
the fgraph_ops filters to see if the ops wants to trace the current
function.

But if the compiler reloads the fgraph_array[] after the check against
fgraph_stub, it could race with the fgraph_array[] being updated with the
fgraph_stub. That would cause the stub to be processed. But the stub has a
null "func_hash" field which will cause a NULL pointer dereference.

Add a READ_ONCE() so that the gops that is compared against the
fgraph_stub is also the gops that is processed later.

Link: 
https://lore.kernel.org/all/CA+G9fYsSVJQZH=nM=1cjtc94pgsnmf9y65bnov6xsocg_b6...@mail.gmail.com/

Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize 
entry function")
Reported-by: Naresh Kamboju 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/fgraph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 8317d1a7f43a..fc205ad167a9 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -641,7 +641,7 @@ int function_graph_enter(unsigned long ret, unsigned long 
func,
{
for_each_set_bit(i, _array_bitmask,
 sizeof(fgraph_array_bitmask) * 
BITS_PER_BYTE) {
-   struct fgraph_ops *gops = fgraph_array[i];
+   struct fgraph_ops *gops = READ_ONCE(fgraph_array[i]);
int save_curr_ret_stack;
 
if (gops == _stub)
-- 
2.43.0




Re: [PATCH 2/8] tracing: do not trace kernel_text_address()

2024-06-13 Thread Steven Rostedt
On Thu, 13 Jun 2024 15:11:07 +0800
Andy Chiu  wrote:

> kernel_text_address() and __kernel_text_address() are called in
> arch_stack_walk() of riscv. This results in excess amount of un-related
> traces when the kernel is compiled with CONFIG_TRACE_IRQFLAGS. The
> situation worsens when function_graph is active, as it calls
> local_irq_save/restore in each function's entry/exit. This patch adds
> both functions to notrace, so they won't show up on the trace records.

I rather not add notrace just because something is noisy.

You can always just add:

 echo '*kernel_text_address' > /sys/kernel/tracing/set_ftrace_notrace

and achieve the same result.

-- Steve



[PATCH v6 13/13] tracing: Add last boot delta offset for stack traces

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The addresses of a stack trace event are relative to the kallsyms. As that
can change between boots, when printing the stack trace from a buffer that
was from the last boot, it needs all the addresses to be added to the
"text_delta" that gives the delta between the addresses of the functions
for the current boot compared to the address of the last boot. Then it can
be passed to kallsyms to find the function name, otherwise it just shows a
useless list of addresses.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index b9d2c64c0648..48de93598897 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1233,6 +1233,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
struct trace_seq *s = >seq;
unsigned long *p;
unsigned long *end;
+   long delta = iter->tr->text_delta;
 
trace_assign_type(field, iter->ent);
end = (unsigned long *)((long)iter->ent + iter->ent_size);
@@ -1245,7 +1246,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
break;
 
trace_seq_puts(s, " => ");
-   seq_print_ip_sym(s, *p, flags);
+   seq_print_ip_sym(s, (*p) + delta, flags);
trace_seq_putc(s, '\n');
}
 
-- 
2.43.0





[PATCH v6 12/13] tracing: Update function tracing output for previous boot buffer

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





[PATCH v6 11/13] tracing: Handle old buffer mappings for event strings and functions

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Use the saved text_delta and data_delta of a persistent memory mapped ring
buffer that was saved from a previous boot, and use the delta in the trace
event print output so that strings and functions show up normally.

That is, for an event like trace_kmalloc() that prints the callsite via
"%pS", if it used the address saved in the ring buffer it will not match
the function that was saved in the previous boot if the kernel remaps
itself between boots.

For RCU events that point to saved static strings where only the address
of the string is saved in the ring buffer, it too will be adjusted to
point to where the string is on the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dc4eee33d920..71cca10581d6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3671,8 +3671,11 @@ static void test_can_verify(void)
 void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 va_list ap)
 {
+   long text_delta = iter->tr->text_delta;
+   long data_delta = iter->tr->data_delta;
const char *p = fmt;
const char *str;
+   bool good;
int i, j;
 
if (WARN_ON_ONCE(!fmt))
@@ -3691,7 +3694,10 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 
j = 0;
 
-   /* We only care about %s and variants */
+   /*
+* We only care about %s and variants
+* as well as %p[sS] if delta is non-zero
+*/
for (i = 0; p[i]; i++) {
if (i + 1 >= iter->fmt_size) {
/*
@@ -3720,6 +3726,11 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
}
if (p[i+j] == 's')
break;
+
+   if (text_delta && p[i+1] == 'p' &&
+   ((p[i+2] == 's' || p[i+2] == 'S')))
+   break;
+
star = false;
}
j = 0;
@@ -3733,6 +3744,24 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
iter->fmt[i] = '\0';
trace_seq_vprintf(>seq, iter->fmt, ap);
 
+   /* Add delta to %pS pointers */
+   if (p[i+1] == 'p') {
+   unsigned long addr;
+   char fmt[4];
+
+   fmt[0] = '%';
+   fmt[1] = 'p';
+   fmt[2] = p[i+2]; /* Either %ps or %pS */
+   fmt[3] = '\0';
+
+   addr = va_arg(ap, unsigned long);
+   addr += text_delta;
+   trace_seq_printf(>seq, fmt, (void *)addr);
+
+   p += i + 3;
+   continue;
+   }
+
/*
 * If iter->seq is full, the above call no longer guarantees
 * that ap is in sync with fmt processing, and further calls
@@ -3751,6 +3780,14 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
/* The ap now points to the string data of the %s */
str = va_arg(ap, const char *);
 
+   good = trace_safe_str(iter, str, star, len);
+
+   /* Could be from the last boot */
+   if (data_delta && !good) {
+   str += data_delta;
+   good = trace_safe_str(iter, str, star, len);
+   }
+
/*
 * If you hit this warning, it is likely that the
 * trace event in question used %s on a string that
@@ -3760,8 +3797,7 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 * instead. See samples/trace_events/trace-events-sample.h
 * for reference.
 */
-   if (WARN_ONCE(!trace_safe_str(iter, str, star, len),
- "fmt: '%s' current_buffer: '%s'",
+   if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
  fmt, seq_buf_str(>seq.seq))) {
int ret;
 
-- 
2.43.0





[PATCH v6 10/13] tracing/ring-buffer: Add last_boot_info file to boot instance

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If an instance is mapped to memory on boot up, create a new file called
"last_boot_info" that will hold information that can be used to properly
parse the raw data in the ring buffer.

It will export the delta of the addresses for text and data from what it
was from the last boot. It does not expose actually addresses (unless you
knew what the actual address was from the last boot).

The output will look like:

 # cat last_boot_info
 text delta:-268435456
 data delta:-268435456

The text and data are kept separate in case they are ever made different.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  3 +++
 kernel/trace/ring_buffer.c  | 23 ++
 kernel/trace/trace.c| 47 -
 kernel/trace/trace.h|  2 ++
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index a50b0223b1d3..55de3798a9b9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -94,6 +94,9 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long 
size, unsigned flag
   unsigned long range_size,
   struct lock_class_key *key);
 
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8c287430411d..f3d772461a60 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2396,6 +2396,29 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned 
long size, unsigned flag
return alloc_buffer(size, flags, order, start, start + range_size, key);
 }
 
+/**
+ * ring_buffer_last_boot_delta - return the delta offset from last boot
+ * @buffer: The buffer to return the delta from
+ * @text: Return text delta
+ * @data: Return data delta
+ *
+ * Returns: The true if the delta is non zero
+ */
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data)
+{
+   if (!buffer)
+   return false;
+
+   if (!buffer->last_text_delta)
+   return false;
+
+   *text = buffer->last_text_delta;
+   *data = buffer->last_data_delta;
+
+   return true;
+}
+
 /**
  * ring_buffer_free - free a ring buffer.
  * @buffer: the buffer to free.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dfde26aa3211..dc4eee33d920 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6041,6 +6041,18 @@ ssize_t tracing_resize_ring_buffer(struct trace_array 
*tr,
return ret;
 }
 
+static void update_last_data(struct trace_array *tr)
+{
+   if (!tr->text_delta && !tr->data_delta)
+   return;
+
+   /* Clear old data */
+   tracing_reset_online_cpus(>array_buffer);
+
+   /* Using current data now */
+   tr->text_delta = 0;
+   tr->data_delta = 0;
+}
 
 /**
  * tracing_update_buffers - used by tracing facility to expand ring buffers
@@ -6058,6 +6070,9 @@ int tracing_update_buffers(struct trace_array *tr)
int ret = 0;
 
mutex_lock(_types_lock);
+
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded)
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6113,6 +6128,8 @@ int tracing_set_tracer(struct trace_array *tr, const char 
*buf)
 
mutex_lock(_types_lock);
 
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded) {
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6860,6 +6877,21 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static ssize_t
+tracing_last_boot_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
+{
+   struct trace_array *tr = filp->private_data;
+   struct seq_buf seq;
+   char buf[64];
+
+   seq_buf_init(, buf, 64);
+
+   seq_buf_printf(, "text delta:\t%ld\n", tr->text_delta);
+   seq_buf_printf(, "data delta:\t%ld\n", tr->data_delta);
+
+   return simple_read_from_buffer(ubuf, cnt, ppos, buf, 
seq_buf_used());
+}
+
 static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
 {
struct trace_array *tr = inode->i_private;
@@ -7499,6 +7531,13 @@ static const struct file_operations 
trace_time_stamp_mode_fops = {
.release= tracing_single_release_tr,
 };
 
+static const s

[PATCH v6 08/13] tracing: Add option to use memmapped memory for trace boot instance

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add an option to the trace_instance kernel command line parameter that
allows it to use the reserved memory from memmap boot parameter.

  memmap=12M$0x28450 trace_instance=boot_mapped@0x28450:12M

The above will reserves 12 megs at the physical address 0x28450.
The second parameter will create a "boot_mapped" instance and use the
memory reserved as the memory for the ring buffer.

That will create an instance called "boot_mapped":

  /sys/kernel/tracing/instances/boot_mapped

Note, because the ring buffer is using a defined memory ranged, it will
act just like a memory mapped ring buffer. It will not have a snapshot
buffer, as it can't swap out the buffer. The snapshot files as well as any
tracers that uses a snapshot will not be present in the boot_mapped
instance.

Cc: linux...@kvack.org
Signed-off-by: Steven Rostedt (Google) 
---
 .../admin-guide/kernel-parameters.txt |  9 +++
 kernel/trace/trace.c  | 75 +--
 2 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..ff26b6094e79 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6754,6 +6754,15 @@
the same thing would happen if it was left off). The 
irq_handler_entry
event, and all events under the "initcall" system.
 
+   If memory has been reserved (see memmap for x86), the 
instance
+   can use that memory:
+
+   memmap=12M$0x28450 
trace_instance=boot_map@0x28450:12M
+
+   The above will create a "boot_map" instance that uses 
the physical
+   memory at 0x28450 that is 12Megs. The per CPU 
buffers of that
+   instance will be split up accordingly.
+
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 622fe670949d..dfde26aa3211 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9504,6 +9504,31 @@ static int instance_mkdir(const char *name)
return ret;
 }
 
+static u64 map_pages(u64 start, u64 size)
+{
+   struct page **pages;
+   phys_addr_t page_start;
+   unsigned int page_count;
+   unsigned int i;
+   void *vaddr;
+
+   page_count = DIV_ROUND_UP(size, PAGE_SIZE);
+
+   page_start = start;
+   pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
+   if (!pages)
+   return 0;
+
+   for (i = 0; i < page_count; i++) {
+   phys_addr_t addr = page_start + i * PAGE_SIZE;
+   pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
+   }
+   vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
+   kfree(pages);
+
+   return (u64)(unsigned long)vaddr;
+}
+
 /**
  * trace_array_get_by_name - Create/Lookup a trace array, given its name.
  * @name: The name of the trace array to be looked up/created.
@@ -10350,6 +10375,7 @@ __init static void enable_instances(void)
 {
struct trace_array *tr;
char *curr_str;
+   char *name;
char *str;
char *tok;
 
@@ -10358,19 +10384,56 @@ __init static void enable_instances(void)
str = boot_instance_info;
 
while ((curr_str = strsep(, "\t"))) {
+   unsigned long start = 0;
+   unsigned long size = 0;
+   unsigned long addr = 0;
 
tok = strsep(_str, ",");
+   name = strsep(, "@");
+   if (tok) {
+   start = memparse(tok, );
+   if (!start) {
+   pr_warn("Tracing: Invalid boot instance address 
for %s\n",
+   name);
+   continue;
+   }
+   }
 
-   if (IS_ENABLED(CONFIG_TRACER_MAX_TRACE))
-   do_allocate_snapshot(tok);
+   if (start) {
+   if (*tok != ':') {
+   pr_warn("Tracing: No size specified for 
instance %s\n", name);
+   continue;
+   }
+   tok++;
+   size = memparse(tok, );
+   if (!size) {
+   pr_warn("Tracing: Invalid boot instance size 
for %s\n",
+   name);
+   continue;
+   }
+   addr = map_pages(star

[PATCH v6 09/13] ring-buffer: Save text and data locations in mapped meta data

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When a ring buffer is mapped to a specific address, save the address of a
text function and some data. This will be used to determine the delta
between the last boot and the current boot for pointers to functions as
well as to data.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 804dfbdeef84..8c287430411d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,8 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   text_addr;
+   unsigned long   data_addr;
unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
@@ -542,6 +544,9 @@ struct trace_buffer {
unsigned long   range_addr_start;
unsigned long   range_addr_end;
 
+   longlast_text_delta;
+   longlast_data_delta;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1821,10 +1826,15 @@ static void rb_meta_validate_events(struct 
ring_buffer_per_cpu *cpu_buffer)
}
 }
 
+/* Used to calculate data delta */
+static char rb_data_ptr[] = "";
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
unsigned long delta;
+   unsigned long this_text = (unsigned long)rb_range_meta_init;
+   unsigned long this_data = (unsigned long)rb_data_ptr;
void *subbuf;
int cpu;
int i;
@@ -1841,6 +1851,10 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
meta->first_buffer += delta;
meta->head_buffer += delta;
meta->commit_buffer += delta;
+   buffer->last_text_delta = this_text - meta->text_addr;
+   buffer->last_data_delta = this_data - meta->data_addr;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
continue;
}
 
@@ -1857,6 +1871,8 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
subbuf = rb_subbufs_from_meta(meta);
 
meta->first_buffer = (unsigned long)subbuf;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
 
/*
 * The buffers[] array holds the order of the sub-buffers
-- 
2.43.0





[PATCH v6 06/13] ring-buffer: Add test if range of boot buffer is valid

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a test against the ring buffer memory range to see if it has valid
data. The ring_buffer_meta structure is given a new field called
"first_buffer" which holds the address of the first sub-buffer. This is
used to both determine if the other fields are valid as well as finding
the offset between the old addresses of the sub-buffer from the previous
boot to the new addresses of the current boot.

Since the values for nr_subbufs and subbuf_size is to be the same, check
if the values in the meta page match the values calculated.

Take the range of the first_buffer and the total size of all the buffers
and make sure the saved head_buffer and commit_buffer fall in the range.

Iterate through all the sub-buffers to make sure that the values in the
sub-buffer "commit" field (the field that holds the amount of data on the
sub-buffer) is within the end of the sub-buffer. Also check the index
array to make sure that all the indexes are within nr_subbufs.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 143 ++---
 1 file changed, 135 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25b0e61e8c76..588bc057bad7 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,7 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
__u32   subbuf_size;
@@ -1618,21 +1619,103 @@ static void *rb_range_buffer(struct 
ring_buffer_per_cpu *cpu_buffer, int idx)
return (void *)ptr;
 }
 
+/*
+ * See if the existing memory contains valid ring buffer data.
+ * As the previous kernel must be the same as this kernel, all
+ * the calculations (size of buffers and number of buffers)
+ * must be the same.
+ */
+static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
+ struct trace_buffer *buffer, int nr_pages)
+{
+   int subbuf_size = PAGE_SIZE;
+   struct buffer_data_page *subbuf;
+   unsigned long buffers_start;
+   unsigned long buffers_end;
+   int i;
+
+   /* The subbuffer's size and number of subbuffers must match */
+   if (meta->subbuf_size != subbuf_size ||
+   meta->nr_subbufs != nr_pages + 1) {
+   pr_info("Ring buffer boot meta [%d] mismatch of 
subbuf_size/nr_pages\n", cpu);
+   return false;
+   }
+
+   buffers_start = meta->first_buffer;
+   buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
+
+   /* Is the head and commit buffers within the range of buffers? */
+   if (meta->head_buffer < buffers_start ||
+   meta->head_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] head buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   if (meta->commit_buffer < buffers_start ||
+   meta->commit_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] commit buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   subbuf = rb_subbufs_from_meta(meta);
+
+   /* Is the meta buffers and the subbufs themselves have correct data? */
+   for (i = 0; i < meta->nr_subbufs; i++) {
+   if (meta->buffers[i] < 0 ||
+   meta->buffers[i] >= meta->nr_subbufs) {
+   pr_info("Ring buffer boot meta [%d] array out of 
range\n", cpu);
+   return false;
+   }
+
+   if ((unsigned)local_read(>commit) > subbuf_size) {
+   pr_info("Ring buffer boot meta [%d] buffer invalid 
commit\n", cpu);
+   return false;
+   }
+
+   subbuf = (void *)subbuf + subbuf_size;
+   }
+
+   pr_info("Ring buffer meta is from previous boot!\n");
+   return true;
+}
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
+   unsigned long delta;
void *subbuf;
int cpu;
int i;
 
for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
+   void *next_meta;
+
meta = rb_range_meta(buffer, nr_pages, cpu);
 
+   if (rb_meta_valid(meta, cpu, buffer, nr_pages)) {
+   /* Make the mappings match the current address */
+   subbuf = rb_subbufs_from_meta(meta);
+   delta = (unsigned long)subbuf - meta->first_buffer;
+   meta->first_buffer += delta;
+   meta->head_buffer += delta;
+   meta->commit_buffer += delta;

[PATCH v6 07/13] ring-buffer: Validate boot range memory events

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Make sure all the events in each of the sub-buffers that were mapped in a
memory region are valid. This moves the code that walks the buffers for
time-stamp validation out of the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
ifdef block and is used to validate the content. Only the ring buffer
event meta data and time stamps are checked and not the data load.

This also has a second purpose. The buffer_page structure that points to
the data sub-buffers has accounting that keeps track of the number of
events that are on the sub-buffer. This updates that counter as well. That
counter is used in reading the buffer and knowing if the ring buffer is
empty or not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 190 +
 1 file changed, 152 insertions(+), 38 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 588bc057bad7..804dfbdeef84 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1675,10 +1675,152 @@ static bool rb_meta_valid(struct ring_buffer_meta 
*meta, int cpu,
subbuf = (void *)subbuf + subbuf_size;
}
 
-   pr_info("Ring buffer meta is from previous boot!\n");
return true;
 }
 
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int 
cpu,
+  unsigned long long *timestamp, u64 *delta_ptr)
+{
+   struct ring_buffer_event *event;
+   u64 ts, delta;
+   int events = 0;
+   int e;
+
+   *delta_ptr = 0;
+   *timestamp = 0;
+
+   ts = dpage->time_stamp;
+
+   for (e = 0; e < tail; e += rb_event_length(event)) {
+
+   event = (struct ring_buffer_event *)(dpage->data + e);
+
+   switch (event->type_len) {
+
+   case RINGBUF_TYPE_TIME_EXTEND:
+   delta = rb_event_time_stamp(event);
+   ts += delta;
+   break;
+
+   case RINGBUF_TYPE_TIME_STAMP:
+   delta = rb_event_time_stamp(event);
+   delta = rb_fix_abs_ts(delta, ts);
+   if (delta < ts) {
+   *delta_ptr = delta;
+   *timestamp = ts;
+   return -1;
+   }
+   ts = delta;
+   break;
+
+   case RINGBUF_TYPE_PADDING:
+   if (event->time_delta == 1)
+   break;
+   fallthrough;
+   case RINGBUF_TYPE_DATA:
+   events++;
+   ts += event->time_delta;
+   break;
+
+   default:
+   return -1;
+   }
+   }
+   *timestamp = ts;
+   return events;
+}
+
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+{
+   unsigned long long ts;
+   u64 delta;
+   int tail;
+
+   tail = local_read(>commit);
+   return rb_read_data_buffer(dpage, tail, cpu, , );
+}
+
+/* If the meta data has been validated, now validate the events */
+static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   struct buffer_page *head_page;
+   unsigned long entry_bytes = 0;
+   unsigned long entries = 0;
+   int ret;
+   int i;
+
+   if (!meta || !meta->head_buffer)
+   return;
+
+   /* Do the reader page first */
+   ret = rb_validate_buffer(cpu_buffer->reader_page->page, 
cpu_buffer->cpu);
+   if (ret < 0) {
+   pr_info("Ring buffer reader page is invalid\n");
+   goto invalid;
+   }
+   entries += ret;
+   entry_bytes += local_read(_buffer->reader_page->page->commit);
+   local_set(_buffer->reader_page->entries, ret);
+
+   head_page = cpu_buffer->head_page;
+
+   /* If both the head and commit are on the reader_page then we are done. 
*/
+   if (head_page == cpu_buffer->reader_page &&
+   head_page == cpu_buffer->commit_page)
+   goto done;
+
+   /* Iterate until finding the commit page */
+   for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(_page)) {
+
+   /* Reader page has already been done */
+   if (head_page == cpu_buffer->reader_page)
+   continue;
+
+   ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+   if (ret < 0) {
+   pr_info("Ring buffer meta [%d] invalid buffer page\n",
+   cpu_buffer->cpu);
+ 

[PATCH v6 05/13] ring-buffer: Add output of ring buffer meta page

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a buffer_meta per-cpu file for the trace instance that is mapped to
boot memory. This shows the current meta-data and can be used by user
space tools to record off the current mappings to help reconstruct the
ring buffer after a reboot.

It does not expose any virtual addresses, just indexes into the sub-buffer
pages.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 77 ++
 kernel/trace/trace.c   | 30 ++-
 kernel/trace/trace.h   |  2 +
 3 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 0e3ed7f1cc5d..25b0e61e8c76 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -32,6 +32,8 @@
 #include 
 #include 
 
+#include "trace.h"
+
 /*
  * The "absolute" timestamp in the buffer is only 59 bits.
  * If a clock has the 5 MSBs set, it needs to be saved and
@@ -1647,6 +1649,81 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
}
 }
 
+static void *rbm_start(struct seq_file *m, loff_t *pos)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val;
+
+   if (!meta)
+   return NULL;
+
+   if (*pos > meta->nr_subbufs)
+   return NULL;
+
+   val = *pos;
+   val++;
+
+   return (void *)val;
+}
+
+static void *rbm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   (*pos)++;
+
+   return rbm_start(m, pos);
+}
+
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rbm_show(struct seq_file *m, void *v)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val = (unsigned long)v;
+
+   if (val == 1) {
+   seq_printf(m, "head_buffer:   %d\n",
+  rb_meta_subbuf_idx(meta, (void *)meta->head_buffer));
+   seq_printf(m, "commit_buffer: %d\n",
+  rb_meta_subbuf_idx(meta, (void 
*)meta->commit_buffer));
+   seq_printf(m, "subbuf_size:   %d\n", meta->subbuf_size);
+   seq_printf(m, "nr_subbufs:%d\n", meta->nr_subbufs);
+   return 0;
+   }
+
+   val -= 2;
+   seq_printf(m, "buffer[%ld]:%d\n", val, meta->buffers[val]);
+
+   return 0;
+}
+
+static void rbm_stop(struct seq_file *m, void *p)
+{
+}
+
+static const struct seq_operations rb_meta_seq_ops = {
+   .start  = rbm_start,
+   .next   = rbm_next,
+   .show   = rbm_show,
+   .stop   = rbm_stop,
+};
+
+int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, 
int cpu)
+{
+   struct seq_file *m;
+   int ret;
+
+   ret = seq_open(file, _meta_seq_ops);
+   if (ret)
+   return ret;
+
+   m = file->private_data;
+   m->private = buffer->buffers[cpu];
+
+   return 0;
+}
+
 static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
long nr_pages, struct list_head *pages)
 {
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ff2b504fbe00..622fe670949d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5018,7 +5018,7 @@ static int show_traces_open(struct inode *inode, struct 
file *file)
return 0;
 }
 
-static int show_traces_release(struct inode *inode, struct file *file)
+static int tracing_seq_release(struct inode *inode, struct file *file)
 {
struct trace_array *tr = inode->i_private;
 
@@ -5059,7 +5059,7 @@ static const struct file_operations show_traces_fops = {
.open   = show_traces_open,
.read   = seq_read,
.llseek = seq_lseek,
-   .release= show_traces_release,
+   .release= tracing_seq_release,
 };
 
 static ssize_t
@@ -6860,6 +6860,22 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
+{
+   struct trace_array *tr = inode->i_private;
+   int cpu = tracing_get_cpu(inode);
+   int ret;
+
+   ret = tracing_check_open_get_tr(tr);
+   if (ret)
+   return ret;
+
+   ret = ring_buffer_meta_seq_init(filp, tr->array_buffer.buffer, cpu);
+   if (ret < 0)
+   __trace_array_put(tr);
+   return ret;
+}
+
 static ssize_t
 tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
  size_t cnt, loff_t *ppos)
@@ -7436,6 +7452,13 @@ static const struct file_operations tracing_entries_fops 
= {
 

[PATCH v6 03/13] ring-buffer: Add ring_buffer_meta data

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Populate the ring_buffer_meta array. It holds the pointer to the
head_buffer (next to read), the commit_buffer (next to write) the size of
the sub-buffers, number of sub-buffers and an array that keeps track of
the order of the sub-buffers.

This information will be stored in the persistent memory to help on reboot
to reconstruct the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 209 -
 1 file changed, 184 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6edd70e45807..0e3ed7f1cc5d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -43,6 +43,11 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   head_buffer;
+   unsigned long   commit_buffer;
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+   int buffers[];
 };
 
 /*
@@ -501,6 +506,7 @@ struct ring_buffer_per_cpu {
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
+   struct ring_buffer_meta *ring_meta;
 
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
@@ -1261,6 +1267,11 @@ static void rb_head_page_activate(struct 
ring_buffer_per_cpu *cpu_buffer)
 * Set the previous list pointer to have the HEAD flag.
 */
rb_set_list_to_head(head->list.prev);
+
+   if (cpu_buffer->ring_meta) {
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   meta->head_buffer = (unsigned long)head->page;
+   }
 }
 
 static void rb_list_head_clear(struct list_head *list)
@@ -1515,51 +1526,127 @@ rb_range_align_subbuf(unsigned long addr, int 
subbuf_size, int nr_subbufs)
 }
 
 /*
- * Return a specific sub-buffer for a given @cpu defined by @idx.
+ * Return the ring_buffer_meta for a given @cpu.
  */
-static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu)
 {
-   unsigned long ptr;
int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   unsigned long ptr = buffer->range_addr_start;
+   struct ring_buffer_meta *meta;
int nr_subbufs;
 
-   /* Include the reader page */
-   nr_subbufs = nr_pages + 1;
+   if (!ptr)
+   return NULL;
+
+   /* When nr_pages passed in is zero, the first meta has already been 
initialized */
+   if (!nr_pages) {
+   meta = (struct ring_buffer_meta *)ptr;
+   nr_subbufs = meta->nr_subbufs;
+   } else {
+   meta = NULL;
+   /* Include the reader page */
+   nr_subbufs = nr_pages + 1;
+   }
 
/*
 * The first chunk may not be subbuffer aligned, where as
 * the rest of the chunks are.
 */
-   ptr = buffer->range_addr_start;
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
if (cpu) {
-   unsigned long p;
-
-   ptr += subbuf_size * nr_subbufs;
-
-   /* Save the beginning of this CPU chunk */
-   p = ptr;
-
ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
+   ptr += subbuf_size * nr_subbufs;
 
/* We can use multiplication to find chunks greater than 1 */
if (cpu > 1) {
unsigned long size;
+   unsigned long p;
 
+   /* Save the beginning of this CPU chunk */
+   p = ptr;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
ptr += subbuf_size * nr_subbufs;
 
/* Now all chunks after this are the same size */
size = ptr - p;
ptr += size * (cpu - 2);
-
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
}
}
-   if (ptr + subbuf_size * nr_subbufs > buffer->range_addr_end)
+   return (void *)ptr;
+}
+
+/* Return the start of subbufs given the meta pointer */
+static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta)
+{
+   int subbuf_size = meta->subbuf_size;
+   unsigned long ptr;
+
+   ptr = (unsigned long)meta;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs);
+
+   return (void *)ptr;
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx)
+{
+   struct ring_buffer_meta *meta;
+   

[PATCH v6 04/13] tracing: Implement creating an instance based on a given memory region

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Allow for creating a new instance by passing in an address and size to map
the ring buffer for the instance to.

This will allow features like a pstore memory mapped region to be used for
an tracing instance ring buffer that can be retrieved from one boot to the
next.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 50 +++-
 kernel/trace/trace.h |  6 +-
 2 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 578a49ff5c32..ff2b504fbe00 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4921,6 +4921,11 @@ static int tracing_open(struct inode *inode, struct file 
*file)
 static bool
 trace_ok_for_array(struct tracer *t, struct trace_array *tr)
 {
+#ifdef CONFIG_TRACER_SNAPSHOT
+   /* arrays with mapped buffer range do not have snapshots */
+   if (tr->range_addr_start && t->use_max_tr)
+   return false;
+#endif
return (tr->flags & TRACE_ARRAY_FL_GLOBAL) || t->allow_instances;
 }
 
@@ -8664,11 +8669,13 @@ tracing_init_tracefs_percpu(struct trace_array *tr, 
long cpu)
tr, cpu, _entries_fops);
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
-   tr, cpu, _fops);
+   if (!tr->range_addr_start) {
+   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
+ tr, cpu, _fops);
 
-   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
-   tr, cpu, _raw_fops);
+   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
+ tr, cpu, _raw_fops);
+   }
 #endif
 }
 
@@ -9205,7 +9212,18 @@ allocate_trace_buffer(struct trace_array *tr, struct 
array_buffer *buf, int size
 
buf->tr = tr;
 
-   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   if (tr->range_addr_start && tr->range_addr_size) {
+   buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
+ tr->range_addr_start,
+ tr->range_addr_size);
+   /*
+* This is basically the same as a mapped buffer,
+* with the same restrictions.
+*/
+   tr->mapped++;
+   } else {
+   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   }
if (!buf->buffer)
return -ENOMEM;
 
@@ -9242,6 +9260,10 @@ static int allocate_trace_buffers(struct trace_array 
*tr, int size)
return ret;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
+   /* Fix mapped buffer trace arrays do not have snapshot buffers */
+   if (tr->range_addr_start)
+   return 0;
+
ret = allocate_trace_buffer(tr, >max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
@@ -9342,7 +9364,9 @@ static int trace_array_create_dir(struct trace_array *tr)
 }
 
 static struct trace_array *
-trace_array_create_systems(const char *name, const char *systems)
+trace_array_create_systems(const char *name, const char *systems,
+  unsigned long range_addr_start,
+  unsigned long range_addr_size)
 {
struct trace_array *tr;
int ret;
@@ -9368,6 +9392,10 @@ trace_array_create_systems(const char *name, const char 
*systems)
goto out_free_tr;
}
 
+   /* Only for boot up memory mapped ring buffers */
+   tr->range_addr_start = range_addr_start;
+   tr->range_addr_size = range_addr_size;
+
tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
 
cpumask_copy(tr->tracing_cpumask, cpu_all_mask);
@@ -9425,7 +9453,7 @@ trace_array_create_systems(const char *name, const char 
*systems)
 
 static struct trace_array *trace_array_create(const char *name)
 {
-   return trace_array_create_systems(name, NULL);
+   return trace_array_create_systems(name, NULL, 0, 0);
 }
 
 static int instance_mkdir(const char *name)
@@ -9479,7 +9507,7 @@ struct trace_array *trace_array_get_by_name(const char 
*name, const char *system
goto out_unlock;
}
 
-   tr = trace_array_create_systems(name, systems);
+   tr = trace_array_create_systems(name, systems, 0, 0);
 
if (IS_ERR(tr))
tr = NULL;
@@ -9672,8 +9700,10 @@ init_tracer_tracefs(struct trace_array *tr, struct 
dentry *d_tracer)
MEM_FAIL(1, "Could not allocate function filter files");
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_cre

[PATCH v6 02/13] ring-buffer: Add ring_buffer_alloc_range()

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In preparation to allowing the trace ring buffer to be allocated in a
range of memory that is persistent across reboots, add
ring_buffer_alloc_range(). It takes a contiguous range of memory and will
split it up evenly for the per CPU ring buffers.

If there's not enough memory to handle all CPUs with the minimum size, it
will fail to allocate the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  17 +++
 kernel/trace/ring_buffer.c  | 239 ++--
 2 files changed, 220 insertions(+), 36 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 96d2140b471e..a50b0223b1d3 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -89,6 +89,11 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
 struct trace_buffer *
 __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key 
*key);
 
+struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned 
flags,
+  int order, unsigned long start,
+  unsigned long range_size,
+  struct lock_class_key *key);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
@@ -100,6 +105,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+/*
+ * Because the ring buffer is generic, if other users of the ring buffer get
+ * traced by ftrace, it can produce lockdep warnings. We need to keep each
+ * ring buffer's lock class separate.
+ */
+#define ring_buffer_alloc_range(size, flags, order, start, range_size) \
+({ \
+   static struct lock_class_key __key; \
+   __ring_buffer_alloc_range((size), (flags), (order), (start),\
+ (range_size), &__key);\
+})
+
 typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
 ring_buffer_cond_fn cond, void *data);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a240bdc0f2d8..6edd70e45807 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -42,6 +42,9 @@
 
 static void update_pages_handler(struct work_struct *work);
 
+struct ring_buffer_meta {
+};
+
 /*
  * The ring buffer header is special. We must manually up keep it.
  */
@@ -342,7 +345,8 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
-   u32  id;/* ID for external mapping */
+   u32  id:30; /* ID for external mapping */
+   u32  range:1;   /* Mapped via a range */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -373,7 +377,9 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_pages((unsigned long)bpage->page, bpage->order);
+   /* Range pages are not to be freed */
+   if (!bpage->range)
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -524,6 +530,9 @@ struct trace_buffer {
struct rb_irq_work  irq_work;
booltime_stamp_abs;
 
+   unsigned long   range_addr_start;
+   unsigned long   range_addr_end;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1491,9 +1500,70 @@ static void rb_check_pages(struct ring_buffer_per_cpu 
*cpu_buffer)
}
 }
 
+/*
+ * Take an address, add the meta data size as well as the array of
+ * array subbuffer indexes, then align it to a subbuffer size.
+ *
+ * This is used to help find the next per cpu subbuffer within a mapped range.
+ */
+static unsigned long
+rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs)
+{
+   addr += sizeof(struct ring_buffer_meta) +
+   sizeof(int) * nr_subbufs;
+   return ALIGN(addr, subbuf_size);
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+{
+   unsigned long ptr;
+   int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   int nr_subbufs;
+
+   /* Include t

[PATCH v6 01/13] ring-buffer: Allow mapped field to be set without mapping

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In preparation for having the ring buffer mapped to a dedicated location,
which will have the same restrictions as user space memory mapped buffers,
allow it to use the "mapped" field of the ring_buffer_per_cpu structure
without having the user space meta page mapping.

When this starts using the mapped field, it will need to handle adding a
user space mapping (and removing it) from a ring buffer that is using a
dedicated memory range.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 37 +++--
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 28853966aa9a..a240bdc0f2d8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -491,6 +491,7 @@ struct ring_buffer_per_cpu {
unsigned long   pages_removed;
 
unsigned intmapped;
+   unsigned intuser_mapped;/* user space mapping */
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
@@ -5224,6 +5225,9 @@ static void rb_update_meta_page(struct 
ring_buffer_per_cpu *cpu_buffer)
 {
struct trace_buffer_meta *meta = cpu_buffer->meta_page;
 
+   if (!meta)
+   return;
+
meta->reader.read = cpu_buffer->reader_page->read;
meta->reader.id = cpu_buffer->reader_page->id;
meta->reader.lost_events = cpu_buffer->lost_events;
@@ -5280,7 +5284,7 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
cpu_buffer->lost_events = 0;
cpu_buffer->last_overrun = 0;
 
-   if (cpu_buffer->mapped)
+   if (cpu_buffer->user_mapped)
rb_update_meta_page(cpu_buffer);
 
rb_head_page_activate(cpu_buffer);
@@ -6167,7 +6171,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped) {
+   if (!cpu_buffer->user_mapped) {
mutex_unlock(_buffer->mapping_lock);
return ERR_PTR(-ENODEV);
}
@@ -6191,19 +6195,26 @@ static int __rb_inc_dec_mapped(struct 
ring_buffer_per_cpu *cpu_buffer,
 
lockdep_assert_held(_buffer->mapping_lock);
 
+   /* mapped is always greater or equal to user_mapped */
+   if (WARN_ON(cpu_buffer->mapped < cpu_buffer->user_mapped))
+   return -EINVAL;
+
if (inc && cpu_buffer->mapped == UINT_MAX)
return -EBUSY;
 
-   if (WARN_ON(!inc && cpu_buffer->mapped == 0))
+   if (WARN_ON(!inc && cpu_buffer->user_mapped == 0))
return -EINVAL;
 
mutex_lock(_buffer->buffer->mutex);
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
 
-   if (inc)
+   if (inc) {
+   cpu_buffer->user_mapped++;
cpu_buffer->mapped++;
-   else
+   } else {
+   cpu_buffer->user_mapped--;
cpu_buffer->mapped--;
+   }
 
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
mutex_unlock(_buffer->buffer->mutex);
@@ -6328,7 +6339,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 
mutex_lock(_buffer->mapping_lock);
 
-   if (cpu_buffer->mapped) {
+   if (cpu_buffer->user_mapped) {
err = __rb_map_vma(cpu_buffer, vma);
if (!err)
err = __rb_inc_dec_mapped(cpu_buffer, true);
@@ -6359,12 +6370,15 @@ int ring_buffer_map(struct trace_buffer *buffer, int 
cpu,
 */
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
+
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
err = __rb_map_vma(cpu_buffer, vma);
if (!err) {
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   cpu_buffer->mapped = 1;
+   /* This is the first time it is mapped by user */
+   cpu_buffer->mapped++;
+   cpu_buffer->user_mapped = 1;
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
} else {
kfree(cpu_buffer->subbuf_ids);
@@ -6392,10 +6406,10 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int 
cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped) {
+   if (!cpu_buffer->user_mapped) {
err = -ENODEV;
goto out;
-   } else if (cpu_buffer->mapped > 1) {
+   } else if (cpu_buffer->user_mapped > 1) {
__rb_inc_dec_mapped(cpu_buffer, false);
goto out;
}
@@ -

[PATCH v6 00/13] tracing: Persistent traces across a reboot or crash

2024-06-12 Thread Steven Rostedt
acpi_reset <-acpi_reboot
   swapper/0-1   [000] d..1.63.479791: acpi_os_write_port 
<-acpi_reboot

Enjoy...

Changes since v5: 
https://lore.kernel.org/all/20240612021642.941740...@goodmis.org/

[ Diff between v5 at bottom ]

- Have user_mapped be a counter and not just a boolean flag
  This keeps track of mappings from user space. It will always be greater
  or equal to the mapped field.

- Moved "mapped" field of trace_array out of CONFIG_TRACER_MAX_TRACE protection.

Changes since v4: 
https://lore.kernel.org/all/20240611192828.691638...@goodmis.org/

- Fixed "mapped" variable for per cpu ring buffers for when the first user space
  mapping happens. Added "user_mapped" that gets set the first time the ring
  buffer is mapped to user space, and this is used to compare the mapped 
counter.

- Added back the delta = rb_fix_abs_ts(delta, ts) in the validate time stamp 
code.
  This was accidentally removed when copying the validate time stamp debug code
  to the validate persistent buffer code.

Changes since v3: 
https://lore.kernel.org/all/20240606211735.684785...@goodmis.org/

- Removed an unused variable

- Fixed enable_instances() as the path without memory using the memory location
  in the command line parameter passed in "tok" where it now needs to be
  "name" for creating the snapshot buffer, otherwise it would pass in NULL
  which could crash the kernel on boot.

Changes since v2: 
https://lore.kernel.org/all/20240411012541.285904...@goodmis.org/

- Rebased on top of 6.10-rc2 that has the memory mapped ring buffer code.

- Added hard coded address to map to (from memmap=nn$ss), instead of relying
  on using reserve_mem (which I still want to add).

- Updated comments

- Restructured the validate code as the previous version broke the ring
  buffer timestamp validation code.


Steven Rostedt (Google) (13):
  ring-buffer: Allow mapped field to be set without mapping
  ring-buffer: Add ring_buffer_alloc_range()
  ring-buffer: Add ring_buffer_meta data
  tracing: Implement creating an instance based on a given memory region
  ring-buffer: Add output of ring buffer meta page
  ring-buffer: Add test if range of boot buffer is valid
  ring-buffer: Validate boot range memory events
  tracing: Add option to use memmapped memory for trace boot instance
  ring-buffer: Save text and data locations in mapped meta data
  tracing/ring-buffer: Add last_boot_info file to boot instance
  tracing: Handle old buffer mappings for event strings and functions
  tracing: Update function tracing output for previous boot buffer
  tracing: Add last boot delta offset for stack traces


 Documentation/admin-guide/kernel-parameters.txt |   9 +
 include/linux/ring_buffer.h |  20 +
 kernel/trace/ring_buffer.c  | 878 +---
 kernel/trace/trace.c| 242 ++-
 kernel/trace/trace.h|  10 +-
 kernel/trace/trace_output.c |  12 +-
 6 files changed, 1056 insertions(+), 115 deletions(-)


diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ccb2101a2e38..f3d772461a60 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -507,7 +507,7 @@ struct ring_buffer_per_cpu {
unsigned long   pages_removed;
 
unsigned intmapped;
-   unsigned intuser_mapped;/* first user space 
mapping */
+   unsigned intuser_mapped;/* user space mapping */
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
@@ -6854,7 +6854,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped || !cpu_buffer->meta_page) {
+   if (!cpu_buffer->user_mapped) {
mutex_unlock(_buffer->mapping_lock);
return ERR_PTR(-ENODEV);
}
@@ -6878,19 +6878,26 @@ static int __rb_inc_dec_mapped(struct 
ring_buffer_per_cpu *cpu_buffer,
 
lockdep_assert_held(_buffer->mapping_lock);
 
+   /* mapped is always greater or equal to user_mapped */
+   if (WARN_ON(cpu_buffer->mapped < cpu_buffer->user_mapped))
+   return -EINVAL;
+
if (inc && cpu_buffer->mapped == UINT_MAX)
return -EBUSY;
 
-   if (WARN_ON(!inc && cpu_buffer->mapped < cpu_buffer->user_mapped))
+   if (WARN_ON(!inc && cpu_buffer->user_mapped == 0))
return -EINVAL;
 
mutex_lock(_buffer->buffer->mutex);
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
 
-   if (inc)
+   if (inc) {
+   cpu_buffer->user_mappe

Re: [PATCH 05/14] tracefs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback

2024-06-12 Thread Steven Rostedt
On Wed, 12 Jun 2024 16:09:40 +0200
"Jason A. Donenfeld"  wrote:
> > 
> > I think "Depends-on" is the way to go, as it is *not* a stable thing, and
> > what is in stable rules is only about stable patches.  
> 
> How does "Depends-on" not spiral out of control? There's a *lot* of
> "Depends-on" relations one could express in commit series and such. Of
> course a lot of git itself is designed to show some subset of these
> relationships.

If a change occurs because a recent change happened that allows the
current change to work, then I think a Depends-on is appropriate.

Like in this example. I thought this change was broken, and it would
have been except for a recent change. Having the dependency listed is
useful, especially if the dependency is subtle (doesn't break the build
and may not show the bug immediately).

> 
> It seems like in most cases, the "Cc: stable@v.g.o # x.y.z+" notation
> expresses the backporting safety correctly. What is the purpose of
> saying, "if you need this patch for any reason, you also need patch X"?
> Who is the intended audience, and are you sure they need this?

The intended audience is someone backporting features and not fixes.

> 
> I ask these questions because I wind up doing a lot of work backporting
> patches to stable and marking things properly for that or submitting
> manually backported stable patches and so forth, and in general, patch
> applicability for stable things is something I wind up devoting a lot of
> time to. If I have to *additionally* start caring about the theoretical
> possibility that somebody in the future, outside of the stable flow,
> might not understand the context of a given patch and blindly apply it
> to some random tree here or there, that sounds like a lot of extra brain
> cycles to consider.
> 
> So, is this actually necessary, and how does it not spiral out of
> control?

How would you see it going out of control? And "Depends-on" would only
be used for non stable relationships. If stable backports, we can keep
with the current method.

-- Steve



Re: Bug in Kernel 6.8.x, 6.9.x Causing Trace/Panic During Shutdown/Reboot

2024-06-12 Thread Steven Rostedt
On Wed, 12 Jun 2024 15:36:22 +0200
"Linux regression tracking (Thorsten Leemhuis)"  
wrote:

> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
> 
> Ilkka or Steven, what happened to this? This thread looks stalled. I
> also was unsuccessful when looking for other threads related to this
> report or the culprit. Did it fall through the cracks or am I missing
> something here?

Honesty, I have no idea where the bug is. I can't reproduce it. These
patches I sent would check all the places that add to the list to make
sure the proper trace_inode was being added, and the output shows that
they are all correct. Then suddenly, something that came from the
inode cache is calling the tracefs inode cache to free it, and that's
where the bug is happening.

This really looks like another bug that the recent changes have made
more predominate.

-- Steve


> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> #regzbot poke
> 
> On 02.06.24 09:32, Ilkka Naulapää wrote:
> > sorry longer delay, been a bit busy but here is the result from that
> > new patch. Only applied this patch so if the previous one is needed
> > also, let me know and I'll rerun it.
> > 
> > --Ilkka
> > 
> > On Thu, May 30, 2024 at 5:00 PM Steven Rostedt  wrote: 
> >  
> >>
> >> On Thu, 30 May 2024 16:02:37 +0300
> >> Ilkka Naulapää  wrote:
> >>  
> >>> applied your patch and here's the output.
> >>>  
> >>
> >> Unfortunately, it doesn't give me any new information. I added one more
> >> BUG on, want to try this? Otherwise, I'm pretty much at a lost. :-/
> >>
> >> -- Steve
> >>
> >> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> >> index de5b72216b1a..a090495e78c9 100644
> >> --- a/fs/tracefs/inode.c
> >> +++ b/fs/tracefs/inode.c
> >> @@ -39,13 +39,17 @@ static struct inode *tracefs_alloc_inode(struct 
> >> super_block *sb)
> >> return NULL;
> >>
> >> ti->flags = 0;
> >> +   ti->magic = 20240823;
> >>
> >> return >vfs_inode;
> >>  }
> >>
> >>  static void tracefs_free_inode(struct inode *inode)
> >>  {
> >> -   kmem_cache_free(tracefs_inode_cachep, get_tracefs(inode));
> >> +   struct tracefs_inode *ti = get_tracefs(inode);
> >> +
> >> +   BUG_ON(ti->magic != 20240823);
> >> +   kmem_cache_free(tracefs_inode_cachep, ti);
> >>  }
> >>
> >>  static ssize_t default_read_file(struct file *file, char __user *buf,
> >> @@ -147,16 +151,6 @@ static const struct inode_operations 
> >> tracefs_dir_inode_operations = {
> >> .rmdir  = tracefs_syscall_rmdir,
> >>  };
> >>
> >> -struct inode *tracefs_get_inode(struct super_block *sb)
> >> -{
> >> -   struct inode *inode = new_inode(sb);
> >> -   if (inode) {
> >> -   inode->i_ino = get_next_ino();
> >> -   inode->i_atime = inode->i_mtime = 
> >> inode_set_ctime_current(inode);
> >> -   }
> >> -   return inode;
> >> -}
> >> -
> >>  struct tracefs_mount_opts {
> >> kuid_t uid;
> >> kgid_t gid;
> >> @@ -384,6 +378,7 @@ static void tracefs_dentry_iput(struct dentry *dentry, 
> >> struct inode *inode)
> >> return;
> >>
> >> ti = get_tracefs(inode);
> >> +   BUG_ON(ti->magic != 20240823);
> >> if (ti && ti->flags & TRACEFS_EVENT_INODE)
> >> eventfs_set_ef_status_free(dentry);
> >> iput(inode);
> >> @@ -568,6 +563,18 @@ struct dentry *eventfs_end_creating(struct dentry 
> >> *dentry)
> >> return dentry;
> >>  }
> >>
> >> +struct inode *tracefs_get_inode(struct super_block *sb)
> >> +{
> >> +   struct inode *inode = new_inode(sb);
> >> +
> >> +   BUG_ON(sb->s_op != _super_operations);
> >> +   if (inode) {
> >> +   inode->i_ino = get_next_ino();
> >> +   inode->i_atime = inode->i_mtime = 
> >> inode_set_ctime_current(inode);
> >&g

Re: [PATCH v5 01/13] ring-buffer: Allow mapped field to be set without mapping

2024-06-12 Thread Steven Rostedt
On Tue, 11 Jun 2024 22:16:43 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> In preparation for having the ring buffer mapped to a dedicated location,
> which will have the same restrictions as user space memory mapped buffers,
> allow it to use the "mapped" field of the ring_buffer_per_cpu structure
> without having the user space meta page mapping.
> 
> When this starts using the mapped field, it will need to handle adding a
> user space mapping (and removing it) from a ring buffer that is using a
> dedicated memory range.
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 24 +---
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 28853966aa9a..aa8eb878e0d4 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -491,6 +491,7 @@ struct ring_buffer_per_cpu {
>   unsigned long   pages_removed;
>  
>   unsigned intmapped;
> + unsigned intuser_mapped;/* first user space 
> mapping */

This actually needs to be a counter and not just save the mapped value
when it first gets set :-p

As the mappings could technically be incremented and decremented
differently.

I'll send a new patch. But since my long test keeps failing for subtle
things, I'm not going to post another series until the full test passes.

-- Steve


>   struct mutexmapping_lock;
>   unsigned long   *subbuf_ids;/* ID to subbuf VA */
>   struct trace_buffer_meta*meta_page;
> @@ -5224,6 +5225,9 @@ static void rb_update_meta_page(struct 
> ring_buffer_per_cpu *cpu_buffer)
>  {
>   struct trace_buffer_meta *meta = cpu_buffer->meta_page;
>  
> + if (!meta)
> + return;
> +
>   meta->reader.read = cpu_buffer->reader_page->read;
>   meta->reader.id = cpu_buffer->reader_page->id;
>   meta->reader.lost_events = cpu_buffer->lost_events;
> @@ -6167,7 +6171,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int 
> cpu)
>  
>   mutex_lock(_buffer->mapping_lock);
>  
> - if (!cpu_buffer->mapped) {
> + if (!cpu_buffer->mapped || !cpu_buffer->meta_page) {
>   mutex_unlock(_buffer->mapping_lock);
>   return ERR_PTR(-ENODEV);
>   }
> @@ -6194,7 +6198,7 @@ static int __rb_inc_dec_mapped(struct 
> ring_buffer_per_cpu *cpu_buffer,
>   if (inc && cpu_buffer->mapped == UINT_MAX)
>   return -EBUSY;
>  
> - if (WARN_ON(!inc && cpu_buffer->mapped == 0))
> + if (WARN_ON(!inc && cpu_buffer->mapped < cpu_buffer->user_mapped))
>   return -EINVAL;
>  
>   mutex_lock(_buffer->buffer->mutex);
> @@ -6328,7 +6332,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int 
> cpu,
>  
>   mutex_lock(_buffer->mapping_lock);
>  
> - if (cpu_buffer->mapped) {
> + if (cpu_buffer->user_mapped) {
>   err = __rb_map_vma(cpu_buffer, vma);
>   if (!err)
>   err = __rb_inc_dec_mapped(cpu_buffer, true);
> @@ -6359,12 +6363,15 @@ int ring_buffer_map(struct trace_buffer *buffer, int 
> cpu,
>*/
>   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
>   rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
> +
>   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
>  
>   err = __rb_map_vma(cpu_buffer, vma);
>   if (!err) {
>   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> - cpu_buffer->mapped = 1;
> + /* This is the first time it is mapped externally */
> + cpu_buffer->mapped++;
> + cpu_buffer->user_mapped = cpu_buffer->mapped;
>   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
>   } else {
>   kfree(cpu_buffer->subbuf_ids);
> @@ -6392,10 +6399,10 @@ int ring_buffer_unmap(struct trace_buffer *buffer, 
> int cpu)
>  
>   mutex_lock(_buffer->mapping_lock);
>  
> - if (!cpu_buffer->mapped) {
> + if (!cpu_buffer->user_mapped) {
>   err = -ENODEV;
>   goto out;
> - } else if (cpu_buffer->mapped > 1) {
> + } else if (cpu_buffer->mapped > cpu_buffer->user_mapped) {
>   __rb_inc_dec_mapped(cpu_buffer, false);
>   goto out;
>   }
> @@ -6403,7 +6410,10 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int 
> cpu)
>   mutex_lock(>mutex);
>   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
>  
> - cpu_buffer->mapped = 0;
> + /* This is the last user space mapping */
> + if (!WARN_ON_ONCE(cpu_buffer->mapped != cpu_buffer->user_mapped))
> + cpu_buffer->mapped--;
> + cpu_buffer->user_mapped = 0;
>  
>   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
>  




[PATCH v5 13/13] tracing: Add last boot delta offset for stack traces

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The addresses of a stack trace event are relative to the kallsyms. As that
can change between boots, when printing the stack trace from a buffer that
was from the last boot, it needs all the addresses to be added to the
"text_delta" that gives the delta between the addresses of the functions
for the current boot compared to the address of the last boot. Then it can
be passed to kallsyms to find the function name, otherwise it just shows a
useless list of addresses.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index b9d2c64c0648..48de93598897 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1233,6 +1233,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
struct trace_seq *s = >seq;
unsigned long *p;
unsigned long *end;
+   long delta = iter->tr->text_delta;
 
trace_assign_type(field, iter->ent);
end = (unsigned long *)((long)iter->ent + iter->ent_size);
@@ -1245,7 +1246,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
break;
 
trace_seq_puts(s, " => ");
-   seq_print_ip_sym(s, *p, flags);
+   seq_print_ip_sym(s, (*p) + delta, flags);
trace_seq_putc(s, '\n');
}
 
-- 
2.43.0





[PATCH v5 10/13] tracing/ring-buffer: Add last_boot_info file to boot instance

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If an instance is mapped to memory on boot up, create a new file called
"last_boot_info" that will hold information that can be used to properly
parse the raw data in the ring buffer.

It will export the delta of the addresses for text and data from what it
was from the last boot. It does not expose actually addresses (unless you
knew what the actual address was from the last boot).

The output will look like:

 # cat last_boot_info
 text delta:-268435456
 data delta:-268435456

The text and data are kept separate in case they are ever made different.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  3 +++
 kernel/trace/ring_buffer.c  | 23 ++
 kernel/trace/trace.c| 47 -
 kernel/trace/trace.h|  2 ++
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index a50b0223b1d3..55de3798a9b9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -94,6 +94,9 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long 
size, unsigned flag
   unsigned long range_size,
   struct lock_class_key *key);
 
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 195e47ef730d..ccb2101a2e38 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2396,6 +2396,29 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned 
long size, unsigned flag
return alloc_buffer(size, flags, order, start, start + range_size, key);
 }
 
+/**
+ * ring_buffer_last_boot_delta - return the delta offset from last boot
+ * @buffer: The buffer to return the delta from
+ * @text: Return text delta
+ * @data: Return data delta
+ *
+ * Returns: The true if the delta is non zero
+ */
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data)
+{
+   if (!buffer)
+   return false;
+
+   if (!buffer->last_text_delta)
+   return false;
+
+   *text = buffer->last_text_delta;
+   *data = buffer->last_data_delta;
+
+   return true;
+}
+
 /**
  * ring_buffer_free - free a ring buffer.
  * @buffer: the buffer to free.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dfde26aa3211..dc4eee33d920 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6041,6 +6041,18 @@ ssize_t tracing_resize_ring_buffer(struct trace_array 
*tr,
return ret;
 }
 
+static void update_last_data(struct trace_array *tr)
+{
+   if (!tr->text_delta && !tr->data_delta)
+   return;
+
+   /* Clear old data */
+   tracing_reset_online_cpus(>array_buffer);
+
+   /* Using current data now */
+   tr->text_delta = 0;
+   tr->data_delta = 0;
+}
 
 /**
  * tracing_update_buffers - used by tracing facility to expand ring buffers
@@ -6058,6 +6070,9 @@ int tracing_update_buffers(struct trace_array *tr)
int ret = 0;
 
mutex_lock(_types_lock);
+
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded)
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6113,6 +6128,8 @@ int tracing_set_tracer(struct trace_array *tr, const char 
*buf)
 
mutex_lock(_types_lock);
 
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded) {
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6860,6 +6877,21 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static ssize_t
+tracing_last_boot_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
+{
+   struct trace_array *tr = filp->private_data;
+   struct seq_buf seq;
+   char buf[64];
+
+   seq_buf_init(, buf, 64);
+
+   seq_buf_printf(, "text delta:\t%ld\n", tr->text_delta);
+   seq_buf_printf(, "data delta:\t%ld\n", tr->data_delta);
+
+   return simple_read_from_buffer(ubuf, cnt, ppos, buf, 
seq_buf_used());
+}
+
 static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
 {
struct trace_array *tr = inode->i_private;
@@ -7499,6 +7531,13 @@ static const struct file_operations 
trace_time_stamp_mode_fops = {
.release= tracing_single_release_tr,
 };
 
+static const s

[PATCH v5 12/13] tracing: Update function tracing output for previous boot buffer

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





[PATCH v5 11/13] tracing: Handle old buffer mappings for event strings and functions

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Use the saved text_delta and data_delta of a persistent memory mapped ring
buffer that was saved from a previous boot, and use the delta in the trace
event print output so that strings and functions show up normally.

That is, for an event like trace_kmalloc() that prints the callsite via
"%pS", if it used the address saved in the ring buffer it will not match
the function that was saved in the previous boot if the kernel remaps
itself between boots.

For RCU events that point to saved static strings where only the address
of the string is saved in the ring buffer, it too will be adjusted to
point to where the string is on the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dc4eee33d920..71cca10581d6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3671,8 +3671,11 @@ static void test_can_verify(void)
 void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 va_list ap)
 {
+   long text_delta = iter->tr->text_delta;
+   long data_delta = iter->tr->data_delta;
const char *p = fmt;
const char *str;
+   bool good;
int i, j;
 
if (WARN_ON_ONCE(!fmt))
@@ -3691,7 +3694,10 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 
j = 0;
 
-   /* We only care about %s and variants */
+   /*
+* We only care about %s and variants
+* as well as %p[sS] if delta is non-zero
+*/
for (i = 0; p[i]; i++) {
if (i + 1 >= iter->fmt_size) {
/*
@@ -3720,6 +3726,11 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
}
if (p[i+j] == 's')
break;
+
+   if (text_delta && p[i+1] == 'p' &&
+   ((p[i+2] == 's' || p[i+2] == 'S')))
+   break;
+
star = false;
}
j = 0;
@@ -3733,6 +3744,24 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
iter->fmt[i] = '\0';
trace_seq_vprintf(>seq, iter->fmt, ap);
 
+   /* Add delta to %pS pointers */
+   if (p[i+1] == 'p') {
+   unsigned long addr;
+   char fmt[4];
+
+   fmt[0] = '%';
+   fmt[1] = 'p';
+   fmt[2] = p[i+2]; /* Either %ps or %pS */
+   fmt[3] = '\0';
+
+   addr = va_arg(ap, unsigned long);
+   addr += text_delta;
+   trace_seq_printf(>seq, fmt, (void *)addr);
+
+   p += i + 3;
+   continue;
+   }
+
/*
 * If iter->seq is full, the above call no longer guarantees
 * that ap is in sync with fmt processing, and further calls
@@ -3751,6 +3780,14 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
/* The ap now points to the string data of the %s */
str = va_arg(ap, const char *);
 
+   good = trace_safe_str(iter, str, star, len);
+
+   /* Could be from the last boot */
+   if (data_delta && !good) {
+   str += data_delta;
+   good = trace_safe_str(iter, str, star, len);
+   }
+
/*
 * If you hit this warning, it is likely that the
 * trace event in question used %s on a string that
@@ -3760,8 +3797,7 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 * instead. See samples/trace_events/trace-events-sample.h
 * for reference.
 */
-   if (WARN_ONCE(!trace_safe_str(iter, str, star, len),
- "fmt: '%s' current_buffer: '%s'",
+   if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
  fmt, seq_buf_str(>seq.seq))) {
int ret;
 
-- 
2.43.0





[PATCH v5 09/13] ring-buffer: Save text and data locations in mapped meta data

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When a ring buffer is mapped to a specific address, save the address of a
text function and some data. This will be used to determine the delta
between the last boot and the current boot for pointers to functions as
well as to data.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2118c478e42b..195e47ef730d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,8 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   text_addr;
+   unsigned long   data_addr;
unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
@@ -542,6 +544,9 @@ struct trace_buffer {
unsigned long   range_addr_start;
unsigned long   range_addr_end;
 
+   longlast_text_delta;
+   longlast_data_delta;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1821,10 +1826,15 @@ static void rb_meta_validate_events(struct 
ring_buffer_per_cpu *cpu_buffer)
}
 }
 
+/* Used to calculate data delta */
+static char rb_data_ptr[] = "";
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
unsigned long delta;
+   unsigned long this_text = (unsigned long)rb_range_meta_init;
+   unsigned long this_data = (unsigned long)rb_data_ptr;
void *subbuf;
int cpu;
int i;
@@ -1841,6 +1851,10 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
meta->first_buffer += delta;
meta->head_buffer += delta;
meta->commit_buffer += delta;
+   buffer->last_text_delta = this_text - meta->text_addr;
+   buffer->last_data_delta = this_data - meta->data_addr;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
continue;
}
 
@@ -1857,6 +1871,8 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
subbuf = rb_subbufs_from_meta(meta);
 
meta->first_buffer = (unsigned long)subbuf;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
 
/*
 * The buffers[] array holds the order of the sub-buffers
-- 
2.43.0





[PATCH v5 08/13] tracing: Add option to use memmapped memory for trace boot instance

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add an option to the trace_instance kernel command line parameter that
allows it to use the reserved memory from memmap boot parameter.

  memmap=12M$0x28450 trace_instance=boot_mapped@0x28450:12M

The above will reserves 12 megs at the physical address 0x28450.
The second parameter will create a "boot_mapped" instance and use the
memory reserved as the memory for the ring buffer.

That will create an instance called "boot_mapped":

  /sys/kernel/tracing/instances/boot_mapped

Note, because the ring buffer is using a defined memory ranged, it will
act just like a memory mapped ring buffer. It will not have a snapshot
buffer, as it can't swap out the buffer. The snapshot files as well as any
tracers that uses a snapshot will not be present in the boot_mapped
instance.

Cc: linux...@kvack.org
Signed-off-by: Steven Rostedt (Google) 
---
 .../admin-guide/kernel-parameters.txt |  9 +++
 kernel/trace/trace.c  | 75 +--
 2 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..ff26b6094e79 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6754,6 +6754,15 @@
the same thing would happen if it was left off). The 
irq_handler_entry
event, and all events under the "initcall" system.
 
+   If memory has been reserved (see memmap for x86), the 
instance
+   can use that memory:
+
+   memmap=12M$0x28450 
trace_instance=boot_map@0x28450:12M
+
+   The above will create a "boot_map" instance that uses 
the physical
+   memory at 0x28450 that is 12Megs. The per CPU 
buffers of that
+   instance will be split up accordingly.
+
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 622fe670949d..dfde26aa3211 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9504,6 +9504,31 @@ static int instance_mkdir(const char *name)
return ret;
 }
 
+static u64 map_pages(u64 start, u64 size)
+{
+   struct page **pages;
+   phys_addr_t page_start;
+   unsigned int page_count;
+   unsigned int i;
+   void *vaddr;
+
+   page_count = DIV_ROUND_UP(size, PAGE_SIZE);
+
+   page_start = start;
+   pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
+   if (!pages)
+   return 0;
+
+   for (i = 0; i < page_count; i++) {
+   phys_addr_t addr = page_start + i * PAGE_SIZE;
+   pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
+   }
+   vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
+   kfree(pages);
+
+   return (u64)(unsigned long)vaddr;
+}
+
 /**
  * trace_array_get_by_name - Create/Lookup a trace array, given its name.
  * @name: The name of the trace array to be looked up/created.
@@ -10350,6 +10375,7 @@ __init static void enable_instances(void)
 {
struct trace_array *tr;
char *curr_str;
+   char *name;
char *str;
char *tok;
 
@@ -10358,19 +10384,56 @@ __init static void enable_instances(void)
str = boot_instance_info;
 
while ((curr_str = strsep(, "\t"))) {
+   unsigned long start = 0;
+   unsigned long size = 0;
+   unsigned long addr = 0;
 
tok = strsep(_str, ",");
+   name = strsep(, "@");
+   if (tok) {
+   start = memparse(tok, );
+   if (!start) {
+   pr_warn("Tracing: Invalid boot instance address 
for %s\n",
+   name);
+   continue;
+   }
+   }
 
-   if (IS_ENABLED(CONFIG_TRACER_MAX_TRACE))
-   do_allocate_snapshot(tok);
+   if (start) {
+   if (*tok != ':') {
+   pr_warn("Tracing: No size specified for 
instance %s\n", name);
+   continue;
+   }
+   tok++;
+   size = memparse(tok, );
+   if (!size) {
+   pr_warn("Tracing: Invalid boot instance size 
for %s\n",
+   name);
+   continue;
+   }
+   addr = map_pages(star

[PATCH v5 07/13] ring-buffer: Validate boot range memory events

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Make sure all the events in each of the sub-buffers that were mapped in a
memory region are valid. This moves the code that walks the buffers for
time-stamp validation out of the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
ifdef block and is used to validate the content. Only the ring buffer
event meta data and time stamps are checked and not the data load.

This also has a second purpose. The buffer_page structure that points to
the data sub-buffers has accounting that keeps track of the number of
events that are on the sub-buffer. This updates that counter as well. That
counter is used in reading the buffer and knowing if the ring buffer is
empty or not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 190 +
 1 file changed, 152 insertions(+), 38 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c746ec12b7cd..2118c478e42b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1675,10 +1675,152 @@ static bool rb_meta_valid(struct ring_buffer_meta 
*meta, int cpu,
subbuf = (void *)subbuf + subbuf_size;
}
 
-   pr_info("Ring buffer meta is from previous boot!\n");
return true;
 }
 
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int 
cpu,
+  unsigned long long *timestamp, u64 *delta_ptr)
+{
+   struct ring_buffer_event *event;
+   u64 ts, delta;
+   int events = 0;
+   int e;
+
+   *delta_ptr = 0;
+   *timestamp = 0;
+
+   ts = dpage->time_stamp;
+
+   for (e = 0; e < tail; e += rb_event_length(event)) {
+
+   event = (struct ring_buffer_event *)(dpage->data + e);
+
+   switch (event->type_len) {
+
+   case RINGBUF_TYPE_TIME_EXTEND:
+   delta = rb_event_time_stamp(event);
+   ts += delta;
+   break;
+
+   case RINGBUF_TYPE_TIME_STAMP:
+   delta = rb_event_time_stamp(event);
+   delta = rb_fix_abs_ts(delta, ts);
+   if (delta < ts) {
+   *delta_ptr = delta;
+   *timestamp = ts;
+   return -1;
+   }
+   ts = delta;
+   break;
+
+   case RINGBUF_TYPE_PADDING:
+   if (event->time_delta == 1)
+   break;
+   fallthrough;
+   case RINGBUF_TYPE_DATA:
+   events++;
+   ts += event->time_delta;
+   break;
+
+   default:
+   return -1;
+   }
+   }
+   *timestamp = ts;
+   return events;
+}
+
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+{
+   unsigned long long ts;
+   u64 delta;
+   int tail;
+
+   tail = local_read(>commit);
+   return rb_read_data_buffer(dpage, tail, cpu, , );
+}
+
+/* If the meta data has been validated, now validate the events */
+static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   struct buffer_page *head_page;
+   unsigned long entry_bytes = 0;
+   unsigned long entries = 0;
+   int ret;
+   int i;
+
+   if (!meta || !meta->head_buffer)
+   return;
+
+   /* Do the reader page first */
+   ret = rb_validate_buffer(cpu_buffer->reader_page->page, 
cpu_buffer->cpu);
+   if (ret < 0) {
+   pr_info("Ring buffer reader page is invalid\n");
+   goto invalid;
+   }
+   entries += ret;
+   entry_bytes += local_read(_buffer->reader_page->page->commit);
+   local_set(_buffer->reader_page->entries, ret);
+
+   head_page = cpu_buffer->head_page;
+
+   /* If both the head and commit are on the reader_page then we are done. 
*/
+   if (head_page == cpu_buffer->reader_page &&
+   head_page == cpu_buffer->commit_page)
+   goto done;
+
+   /* Iterate until finding the commit page */
+   for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(_page)) {
+
+   /* Reader page has already been done */
+   if (head_page == cpu_buffer->reader_page)
+   continue;
+
+   ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+   if (ret < 0) {
+   pr_info("Ring buffer meta [%d] invalid buffer page\n",
+   cpu_buffer->cpu);
+ 

[PATCH v5 06/13] ring-buffer: Add test if range of boot buffer is valid

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a test against the ring buffer memory range to see if it has valid
data. The ring_buffer_meta structure is given a new field called
"first_buffer" which holds the address of the first sub-buffer. This is
used to both determine if the other fields are valid as well as finding
the offset between the old addresses of the sub-buffer from the previous
boot to the new addresses of the current boot.

Since the values for nr_subbufs and subbuf_size is to be the same, check
if the values in the meta page match the values calculated.

Take the range of the first_buffer and the total size of all the buffers
and make sure the saved head_buffer and commit_buffer fall in the range.

Iterate through all the sub-buffers to make sure that the values in the
sub-buffer "commit" field (the field that holds the amount of data on the
sub-buffer) is within the end of the sub-buffer. Also check the index
array to make sure that all the indexes are within nr_subbufs.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 143 ++---
 1 file changed, 135 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 076e7135b9ef..c746ec12b7cd 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,7 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
__u32   subbuf_size;
@@ -1618,21 +1619,103 @@ static void *rb_range_buffer(struct 
ring_buffer_per_cpu *cpu_buffer, int idx)
return (void *)ptr;
 }
 
+/*
+ * See if the existing memory contains valid ring buffer data.
+ * As the previous kernel must be the same as this kernel, all
+ * the calculations (size of buffers and number of buffers)
+ * must be the same.
+ */
+static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
+ struct trace_buffer *buffer, int nr_pages)
+{
+   int subbuf_size = PAGE_SIZE;
+   struct buffer_data_page *subbuf;
+   unsigned long buffers_start;
+   unsigned long buffers_end;
+   int i;
+
+   /* The subbuffer's size and number of subbuffers must match */
+   if (meta->subbuf_size != subbuf_size ||
+   meta->nr_subbufs != nr_pages + 1) {
+   pr_info("Ring buffer boot meta [%d] mismatch of 
subbuf_size/nr_pages\n", cpu);
+   return false;
+   }
+
+   buffers_start = meta->first_buffer;
+   buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
+
+   /* Is the head and commit buffers within the range of buffers? */
+   if (meta->head_buffer < buffers_start ||
+   meta->head_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] head buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   if (meta->commit_buffer < buffers_start ||
+   meta->commit_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] commit buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   subbuf = rb_subbufs_from_meta(meta);
+
+   /* Is the meta buffers and the subbufs themselves have correct data? */
+   for (i = 0; i < meta->nr_subbufs; i++) {
+   if (meta->buffers[i] < 0 ||
+   meta->buffers[i] >= meta->nr_subbufs) {
+   pr_info("Ring buffer boot meta [%d] array out of 
range\n", cpu);
+   return false;
+   }
+
+   if ((unsigned)local_read(>commit) > subbuf_size) {
+   pr_info("Ring buffer boot meta [%d] buffer invalid 
commit\n", cpu);
+   return false;
+   }
+
+   subbuf = (void *)subbuf + subbuf_size;
+   }
+
+   pr_info("Ring buffer meta is from previous boot!\n");
+   return true;
+}
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
+   unsigned long delta;
void *subbuf;
int cpu;
int i;
 
for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
+   void *next_meta;
+
meta = rb_range_meta(buffer, nr_pages, cpu);
 
+   if (rb_meta_valid(meta, cpu, buffer, nr_pages)) {
+   /* Make the mappings match the current address */
+   subbuf = rb_subbufs_from_meta(meta);
+   delta = (unsigned long)subbuf - meta->first_buffer;
+   meta->first_buffer += delta;
+   meta->head_buffer += delta;
+   meta->commit_buffer += delta;

[PATCH v5 05/13] ring-buffer: Add output of ring buffer meta page

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a buffer_meta per-cpu file for the trace instance that is mapped to
boot memory. This shows the current meta-data and can be used by user
space tools to record off the current mappings to help reconstruct the
ring buffer after a reboot.

It does not expose any virtual addresses, just indexes into the sub-buffer
pages.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 77 ++
 kernel/trace/trace.c   | 30 ++-
 kernel/trace/trace.h   |  2 +
 3 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 17818b9eab2a..076e7135b9ef 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -32,6 +32,8 @@
 #include 
 #include 
 
+#include "trace.h"
+
 /*
  * The "absolute" timestamp in the buffer is only 59 bits.
  * If a clock has the 5 MSBs set, it needs to be saved and
@@ -1647,6 +1649,81 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
}
 }
 
+static void *rbm_start(struct seq_file *m, loff_t *pos)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val;
+
+   if (!meta)
+   return NULL;
+
+   if (*pos > meta->nr_subbufs)
+   return NULL;
+
+   val = *pos;
+   val++;
+
+   return (void *)val;
+}
+
+static void *rbm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   (*pos)++;
+
+   return rbm_start(m, pos);
+}
+
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rbm_show(struct seq_file *m, void *v)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val = (unsigned long)v;
+
+   if (val == 1) {
+   seq_printf(m, "head_buffer:   %d\n",
+  rb_meta_subbuf_idx(meta, (void *)meta->head_buffer));
+   seq_printf(m, "commit_buffer: %d\n",
+  rb_meta_subbuf_idx(meta, (void 
*)meta->commit_buffer));
+   seq_printf(m, "subbuf_size:   %d\n", meta->subbuf_size);
+   seq_printf(m, "nr_subbufs:%d\n", meta->nr_subbufs);
+   return 0;
+   }
+
+   val -= 2;
+   seq_printf(m, "buffer[%ld]:%d\n", val, meta->buffers[val]);
+
+   return 0;
+}
+
+static void rbm_stop(struct seq_file *m, void *p)
+{
+}
+
+static const struct seq_operations rb_meta_seq_ops = {
+   .start  = rbm_start,
+   .next   = rbm_next,
+   .show   = rbm_show,
+   .stop   = rbm_stop,
+};
+
+int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, 
int cpu)
+{
+   struct seq_file *m;
+   int ret;
+
+   ret = seq_open(file, _meta_seq_ops);
+   if (ret)
+   return ret;
+
+   m = file->private_data;
+   m->private = buffer->buffers[cpu];
+
+   return 0;
+}
+
 static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
long nr_pages, struct list_head *pages)
 {
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ff2b504fbe00..622fe670949d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5018,7 +5018,7 @@ static int show_traces_open(struct inode *inode, struct 
file *file)
return 0;
 }
 
-static int show_traces_release(struct inode *inode, struct file *file)
+static int tracing_seq_release(struct inode *inode, struct file *file)
 {
struct trace_array *tr = inode->i_private;
 
@@ -5059,7 +5059,7 @@ static const struct file_operations show_traces_fops = {
.open   = show_traces_open,
.read   = seq_read,
.llseek = seq_lseek,
-   .release= show_traces_release,
+   .release= tracing_seq_release,
 };
 
 static ssize_t
@@ -6860,6 +6860,22 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
+{
+   struct trace_array *tr = inode->i_private;
+   int cpu = tracing_get_cpu(inode);
+   int ret;
+
+   ret = tracing_check_open_get_tr(tr);
+   if (ret)
+   return ret;
+
+   ret = ring_buffer_meta_seq_init(filp, tr->array_buffer.buffer, cpu);
+   if (ret < 0)
+   __trace_array_put(tr);
+   return ret;
+}
+
 static ssize_t
 tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
  size_t cnt, loff_t *ppos)
@@ -7436,6 +7452,13 @@ static const struct file_operations tracing_entries_fops 
= {
 

[PATCH v5 04/13] tracing: Implement creating an instance based on a given memory region

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Allow for creating a new instance by passing in an address and size to map
the ring buffer for the instance to.

This will allow features like a pstore memory mapped region to be used for
an tracing instance ring buffer that can be retrieved from one boot to the
next.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 50 +++-
 kernel/trace/trace.h |  4 
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 578a49ff5c32..ff2b504fbe00 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4921,6 +4921,11 @@ static int tracing_open(struct inode *inode, struct file 
*file)
 static bool
 trace_ok_for_array(struct tracer *t, struct trace_array *tr)
 {
+#ifdef CONFIG_TRACER_SNAPSHOT
+   /* arrays with mapped buffer range do not have snapshots */
+   if (tr->range_addr_start && t->use_max_tr)
+   return false;
+#endif
return (tr->flags & TRACE_ARRAY_FL_GLOBAL) || t->allow_instances;
 }
 
@@ -8664,11 +8669,13 @@ tracing_init_tracefs_percpu(struct trace_array *tr, 
long cpu)
tr, cpu, _entries_fops);
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
-   tr, cpu, _fops);
+   if (!tr->range_addr_start) {
+   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
+ tr, cpu, _fops);
 
-   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
-   tr, cpu, _raw_fops);
+   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
+ tr, cpu, _raw_fops);
+   }
 #endif
 }
 
@@ -9205,7 +9212,18 @@ allocate_trace_buffer(struct trace_array *tr, struct 
array_buffer *buf, int size
 
buf->tr = tr;
 
-   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   if (tr->range_addr_start && tr->range_addr_size) {
+   buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
+ tr->range_addr_start,
+ tr->range_addr_size);
+   /*
+* This is basically the same as a mapped buffer,
+* with the same restrictions.
+*/
+   tr->mapped++;
+   } else {
+   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   }
if (!buf->buffer)
return -ENOMEM;
 
@@ -9242,6 +9260,10 @@ static int allocate_trace_buffers(struct trace_array 
*tr, int size)
return ret;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
+   /* Fix mapped buffer trace arrays do not have snapshot buffers */
+   if (tr->range_addr_start)
+   return 0;
+
ret = allocate_trace_buffer(tr, >max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
@@ -9342,7 +9364,9 @@ static int trace_array_create_dir(struct trace_array *tr)
 }
 
 static struct trace_array *
-trace_array_create_systems(const char *name, const char *systems)
+trace_array_create_systems(const char *name, const char *systems,
+  unsigned long range_addr_start,
+  unsigned long range_addr_size)
 {
struct trace_array *tr;
int ret;
@@ -9368,6 +9392,10 @@ trace_array_create_systems(const char *name, const char 
*systems)
goto out_free_tr;
}
 
+   /* Only for boot up memory mapped ring buffers */
+   tr->range_addr_start = range_addr_start;
+   tr->range_addr_size = range_addr_size;
+
tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
 
cpumask_copy(tr->tracing_cpumask, cpu_all_mask);
@@ -9425,7 +9453,7 @@ trace_array_create_systems(const char *name, const char 
*systems)
 
 static struct trace_array *trace_array_create(const char *name)
 {
-   return trace_array_create_systems(name, NULL);
+   return trace_array_create_systems(name, NULL, 0, 0);
 }
 
 static int instance_mkdir(const char *name)
@@ -9479,7 +9507,7 @@ struct trace_array *trace_array_get_by_name(const char 
*name, const char *system
goto out_unlock;
}
 
-   tr = trace_array_create_systems(name, systems);
+   tr = trace_array_create_systems(name, systems, 0, 0);
 
if (IS_ERR(tr))
tr = NULL;
@@ -9672,8 +9700,10 @@ init_tracer_tracefs(struct trace_array *tr, struct 
dentry *d_tracer)
MEM_FAIL(1, "Could not allocate function filter files");
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_cre

[PATCH v5 03/13] ring-buffer: Add ring_buffer_meta data

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Populate the ring_buffer_meta array. It holds the pointer to the
head_buffer (next to read), the commit_buffer (next to write) the size of
the sub-buffers, number of sub-buffers and an array that keeps track of
the order of the sub-buffers.

This information will be stored in the persistent memory to help on reboot
to reconstruct the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 209 -
 1 file changed, 184 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fd52cad34b0f..17818b9eab2a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -43,6 +43,11 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   head_buffer;
+   unsigned long   commit_buffer;
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+   int buffers[];
 };
 
 /*
@@ -501,6 +506,7 @@ struct ring_buffer_per_cpu {
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
+   struct ring_buffer_meta *ring_meta;
 
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
@@ -1261,6 +1267,11 @@ static void rb_head_page_activate(struct 
ring_buffer_per_cpu *cpu_buffer)
 * Set the previous list pointer to have the HEAD flag.
 */
rb_set_list_to_head(head->list.prev);
+
+   if (cpu_buffer->ring_meta) {
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   meta->head_buffer = (unsigned long)head->page;
+   }
 }
 
 static void rb_list_head_clear(struct list_head *list)
@@ -1515,51 +1526,127 @@ rb_range_align_subbuf(unsigned long addr, int 
subbuf_size, int nr_subbufs)
 }
 
 /*
- * Return a specific sub-buffer for a given @cpu defined by @idx.
+ * Return the ring_buffer_meta for a given @cpu.
  */
-static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu)
 {
-   unsigned long ptr;
int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   unsigned long ptr = buffer->range_addr_start;
+   struct ring_buffer_meta *meta;
int nr_subbufs;
 
-   /* Include the reader page */
-   nr_subbufs = nr_pages + 1;
+   if (!ptr)
+   return NULL;
+
+   /* When nr_pages passed in is zero, the first meta has already been 
initialized */
+   if (!nr_pages) {
+   meta = (struct ring_buffer_meta *)ptr;
+   nr_subbufs = meta->nr_subbufs;
+   } else {
+   meta = NULL;
+   /* Include the reader page */
+   nr_subbufs = nr_pages + 1;
+   }
 
/*
 * The first chunk may not be subbuffer aligned, where as
 * the rest of the chunks are.
 */
-   ptr = buffer->range_addr_start;
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
if (cpu) {
-   unsigned long p;
-
-   ptr += subbuf_size * nr_subbufs;
-
-   /* Save the beginning of this CPU chunk */
-   p = ptr;
-
ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
+   ptr += subbuf_size * nr_subbufs;
 
/* We can use multiplication to find chunks greater than 1 */
if (cpu > 1) {
unsigned long size;
+   unsigned long p;
 
+   /* Save the beginning of this CPU chunk */
+   p = ptr;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
ptr += subbuf_size * nr_subbufs;
 
/* Now all chunks after this are the same size */
size = ptr - p;
ptr += size * (cpu - 2);
-
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
}
}
-   if (ptr + subbuf_size * nr_subbufs > buffer->range_addr_end)
+   return (void *)ptr;
+}
+
+/* Return the start of subbufs given the meta pointer */
+static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta)
+{
+   int subbuf_size = meta->subbuf_size;
+   unsigned long ptr;
+
+   ptr = (unsigned long)meta;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs);
+
+   return (void *)ptr;
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx)
+{
+   struct ring_buffer_meta *meta;
+   

[PATCH v5 01/13] ring-buffer: Allow mapped field to be set without mapping

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In preparation for having the ring buffer mapped to a dedicated location,
which will have the same restrictions as user space memory mapped buffers,
allow it to use the "mapped" field of the ring_buffer_per_cpu structure
without having the user space meta page mapping.

When this starts using the mapped field, it will need to handle adding a
user space mapping (and removing it) from a ring buffer that is using a
dedicated memory range.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 28853966aa9a..aa8eb878e0d4 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -491,6 +491,7 @@ struct ring_buffer_per_cpu {
unsigned long   pages_removed;
 
unsigned intmapped;
+   unsigned intuser_mapped;/* first user space 
mapping */
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
@@ -5224,6 +5225,9 @@ static void rb_update_meta_page(struct 
ring_buffer_per_cpu *cpu_buffer)
 {
struct trace_buffer_meta *meta = cpu_buffer->meta_page;
 
+   if (!meta)
+   return;
+
meta->reader.read = cpu_buffer->reader_page->read;
meta->reader.id = cpu_buffer->reader_page->id;
meta->reader.lost_events = cpu_buffer->lost_events;
@@ -6167,7 +6171,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped) {
+   if (!cpu_buffer->mapped || !cpu_buffer->meta_page) {
mutex_unlock(_buffer->mapping_lock);
return ERR_PTR(-ENODEV);
}
@@ -6194,7 +6198,7 @@ static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu 
*cpu_buffer,
if (inc && cpu_buffer->mapped == UINT_MAX)
return -EBUSY;
 
-   if (WARN_ON(!inc && cpu_buffer->mapped == 0))
+   if (WARN_ON(!inc && cpu_buffer->mapped < cpu_buffer->user_mapped))
return -EINVAL;
 
mutex_lock(_buffer->buffer->mutex);
@@ -6328,7 +6332,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 
mutex_lock(_buffer->mapping_lock);
 
-   if (cpu_buffer->mapped) {
+   if (cpu_buffer->user_mapped) {
err = __rb_map_vma(cpu_buffer, vma);
if (!err)
err = __rb_inc_dec_mapped(cpu_buffer, true);
@@ -6359,12 +6363,15 @@ int ring_buffer_map(struct trace_buffer *buffer, int 
cpu,
 */
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
+
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
err = __rb_map_vma(cpu_buffer, vma);
if (!err) {
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   cpu_buffer->mapped = 1;
+   /* This is the first time it is mapped externally */
+   cpu_buffer->mapped++;
+   cpu_buffer->user_mapped = cpu_buffer->mapped;
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
} else {
kfree(cpu_buffer->subbuf_ids);
@@ -6392,10 +6399,10 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int 
cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped) {
+   if (!cpu_buffer->user_mapped) {
err = -ENODEV;
goto out;
-   } else if (cpu_buffer->mapped > 1) {
+   } else if (cpu_buffer->mapped > cpu_buffer->user_mapped) {
__rb_inc_dec_mapped(cpu_buffer, false);
goto out;
}
@@ -6403,7 +6410,10 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int 
cpu)
mutex_lock(>mutex);
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
 
-   cpu_buffer->mapped = 0;
+   /* This is the last user space mapping */
+   if (!WARN_ON_ONCE(cpu_buffer->mapped != cpu_buffer->user_mapped))
+   cpu_buffer->mapped--;
+   cpu_buffer->user_mapped = 0;
 
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
-- 
2.43.0





[PATCH v5 00/13] tracing: Persistent traces across a reboot or crash

2024-06-11 Thread Steven Rostedt
acpi_reset <-acpi_reboot
   swapper/0-1   [000] d..1.63.479791: acpi_os_write_port 
<-acpi_reboot

Enjoy...

Changes since v4: 
https://lore.kernel.org/all/20240611192828.691638...@goodmis.org/

[ Diff between v3 at bottom ]

- Fixed "mapped" variable for per cpu ring buffers for when the first user space
  mapping happens. Added "user_mapped" that gets set the first time the ring
  buffer is mapped to user space, and this is used to compare the mapped 
counter.

- Added back the delta = rb_fix_abs_ts(delta, ts) in the validate time stamp 
code.
  This was accidentally removed when copying the validate time stamp debug code
  to the validate persistent buffer code.

Changes since v3: 
https://lore.kernel.org/all/20240606211735.684785...@goodmis.org/

- Removed an unused variable

- Fixed enable_instances() as the path without memory using the memory location
  in the command line parameter passed in "tok" where it now needs to be
  "name" for creating the snapshot buffer, otherwise it would pass in NULL
  which could crash the kernel on boot.

Changes since v2: 
https://lore.kernel.org/all/20240411012541.285904...@goodmis.org/

- Rebased on top of 6.10-rc2 that has the memory mapped ring buffer code.

- Added hard coded address to map to (from memmap=nn$ss), instead of relying
  on using reserve_mem (which I still want to add).

- Updated comments

- Restructured the validate code as the previous version broke the ring
  buffer timestamp validation code.


Steven Rostedt (Google) (13):
  ring-buffer: Allow mapped field to be set without mapping
  ring-buffer: Add ring_buffer_alloc_range()
  ring-buffer: Add ring_buffer_meta data
  tracing: Implement creating an instance based on a given memory region
  ring-buffer: Add output of ring buffer meta page
  ring-buffer: Add test if range of boot buffer is valid
  ring-buffer: Validate boot range memory events
  tracing: Add option to use memmapped memory for trace boot instance
  ring-buffer: Save text and data locations in mapped meta data
  tracing/ring-buffer: Add last_boot_info file to boot instance
  tracing: Handle old buffer mappings for event strings and functions
  tracing: Update function tracing output for previous boot buffer
  tracing: Add last boot delta offset for stack traces


 Documentation/admin-guide/kernel-parameters.txt |   9 +
 include/linux/ring_buffer.h |  20 +
 kernel/trace/ring_buffer.c  | 867 +---
 kernel/trace/trace.c| 242 ++-
 kernel/trace/trace.h|   8 +
 kernel/trace/trace_output.c |  12 +-
 6 files changed, 1046 insertions(+), 112 deletions(-)


diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8c1d7ea01e6f..ccb2101a2e38 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -507,6 +507,7 @@ struct ring_buffer_per_cpu {
unsigned long   pages_removed;
 
unsigned intmapped;
+   unsigned intuser_mapped;/* first user space 
mapping */
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
@@ -1710,6 +1711,7 @@ static int rb_read_data_buffer(struct buffer_data_page 
*dpage, int tail, int cpu
 
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
+   delta = rb_fix_abs_ts(delta, ts);
if (delta < ts) {
*delta_ptr = delta;
*timestamp = ts;
@@ -6879,7 +6881,7 @@ static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu 
*cpu_buffer,
if (inc && cpu_buffer->mapped == UINT_MAX)
return -EBUSY;
 
-   if (WARN_ON(!inc && cpu_buffer->mapped == 0))
+   if (WARN_ON(!inc && cpu_buffer->mapped < cpu_buffer->user_mapped))
return -EINVAL;
 
mutex_lock(_buffer->buffer->mutex);
@@ -7013,7 +7015,7 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
 
mutex_lock(_buffer->mapping_lock);
 
-   if (cpu_buffer->mapped) {
+   if (cpu_buffer->user_mapped) {
err = __rb_map_vma(cpu_buffer, vma);
if (!err)
err = __rb_inc_dec_mapped(cpu_buffer, true);
@@ -7050,7 +7052,9 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
err = __rb_map_vma(cpu_buffer, vma);
if (!err) {
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
+   /* This is the first time it is mapped externally */
cpu_buffer->mapped++;
+   cpu_buffer->user_mapped = cpu_buffer->m

[PATCH v5 02/13] ring-buffer: Add ring_buffer_alloc_range()

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In preparation to allowing the trace ring buffer to be allocated in a
range of memory that is persistent across reboots, add
ring_buffer_alloc_range(). It takes a contiguous range of memory and will
split it up evenly for the per CPU ring buffers.

If there's not enough memory to handle all CPUs with the minimum size, it
will fail to allocate the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  17 +++
 kernel/trace/ring_buffer.c  | 239 ++--
 2 files changed, 220 insertions(+), 36 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 96d2140b471e..a50b0223b1d3 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -89,6 +89,11 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
 struct trace_buffer *
 __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key 
*key);
 
+struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned 
flags,
+  int order, unsigned long start,
+  unsigned long range_size,
+  struct lock_class_key *key);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
@@ -100,6 +105,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+/*
+ * Because the ring buffer is generic, if other users of the ring buffer get
+ * traced by ftrace, it can produce lockdep warnings. We need to keep each
+ * ring buffer's lock class separate.
+ */
+#define ring_buffer_alloc_range(size, flags, order, start, range_size) \
+({ \
+   static struct lock_class_key __key; \
+   __ring_buffer_alloc_range((size), (flags), (order), (start),\
+ (range_size), &__key);\
+})
+
 typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
 ring_buffer_cond_fn cond, void *data);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa8eb878e0d4..fd52cad34b0f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -42,6 +42,9 @@
 
 static void update_pages_handler(struct work_struct *work);
 
+struct ring_buffer_meta {
+};
+
 /*
  * The ring buffer header is special. We must manually up keep it.
  */
@@ -342,7 +345,8 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
-   u32  id;/* ID for external mapping */
+   u32  id:30; /* ID for external mapping */
+   u32  range:1;   /* Mapped via a range */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -373,7 +377,9 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_pages((unsigned long)bpage->page, bpage->order);
+   /* Range pages are not to be freed */
+   if (!bpage->range)
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -524,6 +530,9 @@ struct trace_buffer {
struct rb_irq_work  irq_work;
booltime_stamp_abs;
 
+   unsigned long   range_addr_start;
+   unsigned long   range_addr_end;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1491,9 +1500,70 @@ static void rb_check_pages(struct ring_buffer_per_cpu 
*cpu_buffer)
}
 }
 
+/*
+ * Take an address, add the meta data size as well as the array of
+ * array subbuffer indexes, then align it to a subbuffer size.
+ *
+ * This is used to help find the next per cpu subbuffer within a mapped range.
+ */
+static unsigned long
+rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs)
+{
+   addr += sizeof(struct ring_buffer_meta) +
+   sizeof(int) * nr_subbufs;
+   return ALIGN(addr, subbuf_size);
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+{
+   unsigned long ptr;
+   int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   int nr_subbufs;
+
+   /* Include t

Re: [PATCH v4 01/13] ring-buffer: Allow mapped field to be set without mapping

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 21:39:37 -0400
Steven Rostedt  wrote:

> > 
> > Maybe explain why sometimes __rb_inc_dec_mapped() is called to
> > increment or decrement ->mapped, and sometimes it id done directly ?
> > I can see that the function also acquires the buffer mutex, which
> > isn't needed at the places where mapped is incremented/decremented
> > directly, but common code would still be nice, and it is odd to see
> > over/underflows handled sometimes but not always.  
> 
> Sure. I'll add comments explaining more.

And I found a bug with this code. It assumes that mapped will be equal
to 1 if it's the last mapping. That will no longer be the case.

-- Steve



Re: [PATCH v4 01/13] ring-buffer: Allow mapped field to be set without mapping

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 16:53:43 -0700
Guenter Roeck  wrote:

> >>> @@ -6403,7 +6407,8 @@ int ring_buffer_unmap(struct trace_buffer *buffer, 
> >>> int cpu)
> >>>   mutex_lock(>mutex);
> >>>   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> >>>
> >>> - cpu_buffer->mapped = 0;
> >>> + WARN_ON_ONCE(!cpu_buffer->mapped);
> >>> + cpu_buffer->mapped--;  
> >>
> >> This will wrap to UINT_MAX if it was 0. Is that intentional ?  
> > 
> > If mapped is non zero, it limits what it can do. If it enters here as zero,
> > we are really in a unknown state, so yeah, wrapping will just keep it
> > limited. Which is a good thing.
> > 
> > Do you want me to add a comment there?
> >   
> 
> Maybe. I just wondered if something like
>   if (!WARN_ON_ONCE(!cpu_buffer->mapped))
>   cpu_buffer->mapped--;
> 
> would be better than wrapping because 'mapped' is used as flag elsewhere,
> but then I can see that it is also manipulated in __rb_inc_dec_mapped(),
> and that it is checked against UINT_MAX there (and not decremented if it is 
> 0).

Yeah, the __rb_inc_dec_mapped() is used as it is called when external
sources map the ring buffer. 

This is incremented and decremented internally. That is, we increment
it the first time the ring buffer is mapped, and decrement it again the
last time it is mapped.

I could add the above logic as well. I hit a bug in my more vigorous
testing so I need to make another revision anyway.

> 
> Maybe explain why sometimes __rb_inc_dec_mapped() is called to
> increment or decrement ->mapped, and sometimes it id done directly ?
> I can see that the function also acquires the buffer mutex, which
> isn't needed at the places where mapped is incremented/decremented
> directly, but common code would still be nice, and it is odd to see
> over/underflows handled sometimes but not always.

Sure. I'll add comments explaining more.

Thanks,

-- Steve



Re: [PATCH 3/3] tracing/kprobe: Remove cleanup code unrelated to selftest

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 06:26:53 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> This cleanup all kprobe events code is not related to the selftest
> itself, and it can fail by the reason unrelated to this test.
> If the test is successful, the generated events are cleaned up.
> And if not, we cannot guarantee that the kprobe events will work
> correctly. So, anyway, there is no need to clean it up.
> 
> Signed-off-by: Masami Hiramatsu (Google) 

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v4 01/13] ring-buffer: Allow mapped field to be set without mapping

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 15:43:59 -0700
Guenter Roeck  wrote:

> On 6/11/24 12:28, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > In preparation for having the ring buffer mapped to a dedicated location,
> > which will have the same restrictions as user space memory mapped buffers,
> > allow it to use the "mapped" field of the ring_buffer_per_cpu structure
> > without having the user space meta page mapping.
> > 
> > When this starts using the mapped field, it will need to handle adding a
> > user space mapping (and removing it) from a ring buffer that is using a
> > dedicated memory range.
> > 
> > Signed-off-by: Steven Rostedt (Google) 
> > ---
> >   kernel/trace/ring_buffer.c | 11 ---
> >   1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index 28853966aa9a..78beaccf9c8c 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -5224,6 +5224,9 @@ static void rb_update_meta_page(struct 
> > ring_buffer_per_cpu *cpu_buffer)
> >   {
> > struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> >   
> > +   if (!meta)
> > +   return;
> > +
> > meta->reader.read = cpu_buffer->reader_page->read;
> > meta->reader.id = cpu_buffer->reader_page->id;
> > meta->reader.lost_events = cpu_buffer->lost_events;
> > @@ -6167,7 +6170,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int 
> > cpu)
> >   
> > mutex_lock(_buffer->mapping_lock);
> >   
> > -   if (!cpu_buffer->mapped) {
> > +   if (!cpu_buffer->mapped || !cpu_buffer->meta_page) {
> > mutex_unlock(_buffer->mapping_lock);
> > return ERR_PTR(-ENODEV);
> > }
> > @@ -6359,12 +6362,13 @@ int ring_buffer_map(struct trace_buffer *buffer, 
> > int cpu,
> >  */
> > raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> > rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
> > +  
> 
> Picky again. Is that a leftover from something ? I don't see an immediate 
> reason
> for the added newline.

Hmm, I could remove it.

> 
> > raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> >   
> > err = __rb_map_vma(cpu_buffer, vma);
> > if (!err) {
> > raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> > -   cpu_buffer->mapped = 1;
> > +   cpu_buffer->mapped++;
> > raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> > } else {
> > kfree(cpu_buffer->subbuf_ids);
> > @@ -6403,7 +6407,8 @@ int ring_buffer_unmap(struct trace_buffer *buffer, 
> > int cpu)
> > mutex_lock(>mutex);
> > raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> >   
> > -   cpu_buffer->mapped = 0;
> > +   WARN_ON_ONCE(!cpu_buffer->mapped);
> > +   cpu_buffer->mapped--;  
> 
> This will wrap to UINT_MAX if it was 0. Is that intentional ?

If mapped is non zero, it limits what it can do. If it enters here as zero,
we are really in a unknown state, so yeah, wrapping will just keep it
limited. Which is a good thing.

Do you want me to add a comment there?

-- Steve


> 
> >   
> > raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> > 




[PATCH v4 13/13] tracing: Add last boot delta offset for stack traces

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The addresses of a stack trace event are relative to the kallsyms. As that
can change between boots, when printing the stack trace from a buffer that
was from the last boot, it needs all the addresses to be added to the
"text_delta" that gives the delta between the addresses of the functions
for the current boot compared to the address of the last boot. Then it can
be passed to kallsyms to find the function name, otherwise it just shows a
useless list of addresses.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index b9d2c64c0648..48de93598897 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1233,6 +1233,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
struct trace_seq *s = >seq;
unsigned long *p;
unsigned long *end;
+   long delta = iter->tr->text_delta;
 
trace_assign_type(field, iter->ent);
end = (unsigned long *)((long)iter->ent + iter->ent_size);
@@ -1245,7 +1246,7 @@ static enum print_line_t trace_stack_print(struct 
trace_iterator *iter,
break;
 
trace_seq_puts(s, " => ");
-   seq_print_ip_sym(s, *p, flags);
+   seq_print_ip_sym(s, (*p) + delta, flags);
trace_seq_putc(s, '\n');
}
 
-- 
2.43.0





[PATCH v4 11/13] tracing: Handle old buffer mappings for event strings and functions

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Use the saved text_delta and data_delta of a persistent memory mapped ring
buffer that was saved from a previous boot, and use the delta in the trace
event print output so that strings and functions show up normally.

That is, for an event like trace_kmalloc() that prints the callsite via
"%pS", if it used the address saved in the ring buffer it will not match
the function that was saved in the previous boot if the kernel remaps
itself between boots.

For RCU events that point to saved static strings where only the address
of the string is saved in the ring buffer, it too will be adjusted to
point to where the string is on the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dc4eee33d920..71cca10581d6 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3671,8 +3671,11 @@ static void test_can_verify(void)
 void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 va_list ap)
 {
+   long text_delta = iter->tr->text_delta;
+   long data_delta = iter->tr->data_delta;
const char *p = fmt;
const char *str;
+   bool good;
int i, j;
 
if (WARN_ON_ONCE(!fmt))
@@ -3691,7 +3694,10 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 
j = 0;
 
-   /* We only care about %s and variants */
+   /*
+* We only care about %s and variants
+* as well as %p[sS] if delta is non-zero
+*/
for (i = 0; p[i]; i++) {
if (i + 1 >= iter->fmt_size) {
/*
@@ -3720,6 +3726,11 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
}
if (p[i+j] == 's')
break;
+
+   if (text_delta && p[i+1] == 'p' &&
+   ((p[i+2] == 's' || p[i+2] == 'S')))
+   break;
+
star = false;
}
j = 0;
@@ -3733,6 +3744,24 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
iter->fmt[i] = '\0';
trace_seq_vprintf(>seq, iter->fmt, ap);
 
+   /* Add delta to %pS pointers */
+   if (p[i+1] == 'p') {
+   unsigned long addr;
+   char fmt[4];
+
+   fmt[0] = '%';
+   fmt[1] = 'p';
+   fmt[2] = p[i+2]; /* Either %ps or %pS */
+   fmt[3] = '\0';
+
+   addr = va_arg(ap, unsigned long);
+   addr += text_delta;
+   trace_seq_printf(>seq, fmt, (void *)addr);
+
+   p += i + 3;
+   continue;
+   }
+
/*
 * If iter->seq is full, the above call no longer guarantees
 * that ap is in sync with fmt processing, and further calls
@@ -3751,6 +3780,14 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
/* The ap now points to the string data of the %s */
str = va_arg(ap, const char *);
 
+   good = trace_safe_str(iter, str, star, len);
+
+   /* Could be from the last boot */
+   if (data_delta && !good) {
+   str += data_delta;
+   good = trace_safe_str(iter, str, star, len);
+   }
+
/*
 * If you hit this warning, it is likely that the
 * trace event in question used %s on a string that
@@ -3760,8 +3797,7 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 * instead. See samples/trace_events/trace-events-sample.h
 * for reference.
 */
-   if (WARN_ONCE(!trace_safe_str(iter, str, star, len),
- "fmt: '%s' current_buffer: '%s'",
+   if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
  fmt, seq_buf_str(>seq.seq))) {
int ret;
 
-- 
2.43.0





[PATCH v4 12/13] tracing: Update function tracing output for previous boot buffer

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





[PATCH v4 10/13] tracing/ring-buffer: Add last_boot_info file to boot instance

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If an instance is mapped to memory on boot up, create a new file called
"last_boot_info" that will hold information that can be used to properly
parse the raw data in the ring buffer.

It will export the delta of the addresses for text and data from what it
was from the last boot. It does not expose actually addresses (unless you
knew what the actual address was from the last boot).

The output will look like:

 # cat last_boot_info
 text delta:-268435456
 data delta:-268435456

The text and data are kept separate in case they are ever made different.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  3 +++
 kernel/trace/ring_buffer.c  | 23 ++
 kernel/trace/trace.c| 47 -
 kernel/trace/trace.h|  2 ++
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index a50b0223b1d3..55de3798a9b9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -94,6 +94,9 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long 
size, unsigned flag
   unsigned long range_size,
   struct lock_class_key *key);
 
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ab6b8a0ee8e1..8c1d7ea01e6f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2394,6 +2394,29 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned 
long size, unsigned flag
return alloc_buffer(size, flags, order, start, start + range_size, key);
 }
 
+/**
+ * ring_buffer_last_boot_delta - return the delta offset from last boot
+ * @buffer: The buffer to return the delta from
+ * @text: Return text delta
+ * @data: Return data delta
+ *
+ * Returns: The true if the delta is non zero
+ */
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data)
+{
+   if (!buffer)
+   return false;
+
+   if (!buffer->last_text_delta)
+   return false;
+
+   *text = buffer->last_text_delta;
+   *data = buffer->last_data_delta;
+
+   return true;
+}
+
 /**
  * ring_buffer_free - free a ring buffer.
  * @buffer: the buffer to free.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dfde26aa3211..dc4eee33d920 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6041,6 +6041,18 @@ ssize_t tracing_resize_ring_buffer(struct trace_array 
*tr,
return ret;
 }
 
+static void update_last_data(struct trace_array *tr)
+{
+   if (!tr->text_delta && !tr->data_delta)
+   return;
+
+   /* Clear old data */
+   tracing_reset_online_cpus(>array_buffer);
+
+   /* Using current data now */
+   tr->text_delta = 0;
+   tr->data_delta = 0;
+}
 
 /**
  * tracing_update_buffers - used by tracing facility to expand ring buffers
@@ -6058,6 +6070,9 @@ int tracing_update_buffers(struct trace_array *tr)
int ret = 0;
 
mutex_lock(_types_lock);
+
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded)
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6113,6 +6128,8 @@ int tracing_set_tracer(struct trace_array *tr, const char 
*buf)
 
mutex_lock(_types_lock);
 
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded) {
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6860,6 +6877,21 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static ssize_t
+tracing_last_boot_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
+{
+   struct trace_array *tr = filp->private_data;
+   struct seq_buf seq;
+   char buf[64];
+
+   seq_buf_init(, buf, 64);
+
+   seq_buf_printf(, "text delta:\t%ld\n", tr->text_delta);
+   seq_buf_printf(, "data delta:\t%ld\n", tr->data_delta);
+
+   return simple_read_from_buffer(ubuf, cnt, ppos, buf, 
seq_buf_used());
+}
+
 static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
 {
struct trace_array *tr = inode->i_private;
@@ -7499,6 +7531,13 @@ static const struct file_operations 
trace_time_stamp_mode_fops = {
.release= tracing_single_release_tr,
 };
 
+static const s

[PATCH v4 08/13] tracing: Add option to use memmapped memory for trace boot instance

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add an option to the trace_instance kernel command line parameter that
allows it to use the reserved memory from memmap boot parameter.

  memmap=12M$0x28450 trace_instance=boot_mapped@0x28450:12M

The above will reserves 12 megs at the physical address 0x28450.
The second parameter will create a "boot_mapped" instance and use the
memory reserved as the memory for the ring buffer.

That will create an instance called "boot_mapped":

  /sys/kernel/tracing/instances/boot_mapped

Note, because the ring buffer is using a defined memory ranged, it will
act just like a memory mapped ring buffer. It will not have a snapshot
buffer, as it can't swap out the buffer. The snapshot files as well as any
tracers that uses a snapshot will not be present in the boot_mapped
instance.

Cc: linux...@kvack.org
Signed-off-by: Steven Rostedt (Google) 
---
 .../admin-guide/kernel-parameters.txt |  9 +++
 kernel/trace/trace.c  | 75 +--
 2 files changed, 78 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..ff26b6094e79 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6754,6 +6754,15 @@
the same thing would happen if it was left off). The 
irq_handler_entry
event, and all events under the "initcall" system.
 
+   If memory has been reserved (see memmap for x86), the 
instance
+   can use that memory:
+
+   memmap=12M$0x28450 
trace_instance=boot_map@0x28450:12M
+
+   The above will create a "boot_map" instance that uses 
the physical
+   memory at 0x28450 that is 12Megs. The per CPU 
buffers of that
+   instance will be split up accordingly.
+
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 622fe670949d..dfde26aa3211 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9504,6 +9504,31 @@ static int instance_mkdir(const char *name)
return ret;
 }
 
+static u64 map_pages(u64 start, u64 size)
+{
+   struct page **pages;
+   phys_addr_t page_start;
+   unsigned int page_count;
+   unsigned int i;
+   void *vaddr;
+
+   page_count = DIV_ROUND_UP(size, PAGE_SIZE);
+
+   page_start = start;
+   pages = kmalloc_array(page_count, sizeof(struct page *), GFP_KERNEL);
+   if (!pages)
+   return 0;
+
+   for (i = 0; i < page_count; i++) {
+   phys_addr_t addr = page_start + i * PAGE_SIZE;
+   pages[i] = pfn_to_page(addr >> PAGE_SHIFT);
+   }
+   vaddr = vmap(pages, page_count, VM_MAP, PAGE_KERNEL);
+   kfree(pages);
+
+   return (u64)(unsigned long)vaddr;
+}
+
 /**
  * trace_array_get_by_name - Create/Lookup a trace array, given its name.
  * @name: The name of the trace array to be looked up/created.
@@ -10350,6 +10375,7 @@ __init static void enable_instances(void)
 {
struct trace_array *tr;
char *curr_str;
+   char *name;
char *str;
char *tok;
 
@@ -10358,19 +10384,56 @@ __init static void enable_instances(void)
str = boot_instance_info;
 
while ((curr_str = strsep(, "\t"))) {
+   unsigned long start = 0;
+   unsigned long size = 0;
+   unsigned long addr = 0;
 
tok = strsep(_str, ",");
+   name = strsep(, "@");
+   if (tok) {
+   start = memparse(tok, );
+   if (!start) {
+   pr_warn("Tracing: Invalid boot instance address 
for %s\n",
+   name);
+   continue;
+   }
+   }
 
-   if (IS_ENABLED(CONFIG_TRACER_MAX_TRACE))
-   do_allocate_snapshot(tok);
+   if (start) {
+   if (*tok != ':') {
+   pr_warn("Tracing: No size specified for 
instance %s\n", name);
+   continue;
+   }
+   tok++;
+   size = memparse(tok, );
+   if (!size) {
+   pr_warn("Tracing: Invalid boot instance size 
for %s\n",
+   name);
+   continue;
+   }
+   addr = map_pages(star

[PATCH v4 09/13] ring-buffer: Save text and data locations in mapped meta data

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When a ring buffer is mapped to a specific address, save the address of a
text function and some data. This will be used to determine the delta
between the last boot and the current boot for pointers to functions as
well as to data.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 86f03c0ba4c0..ab6b8a0ee8e1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,8 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   text_addr;
+   unsigned long   data_addr;
unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
@@ -541,6 +543,9 @@ struct trace_buffer {
unsigned long   range_addr_start;
unsigned long   range_addr_end;
 
+   longlast_text_delta;
+   longlast_data_delta;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1819,10 +1824,15 @@ static void rb_meta_validate_events(struct 
ring_buffer_per_cpu *cpu_buffer)
}
 }
 
+/* Used to calculate data delta */
+static char rb_data_ptr[] = "";
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
unsigned long delta;
+   unsigned long this_text = (unsigned long)rb_range_meta_init;
+   unsigned long this_data = (unsigned long)rb_data_ptr;
void *subbuf;
int cpu;
int i;
@@ -1839,6 +1849,10 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
meta->first_buffer += delta;
meta->head_buffer += delta;
meta->commit_buffer += delta;
+   buffer->last_text_delta = this_text - meta->text_addr;
+   buffer->last_data_delta = this_data - meta->data_addr;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
continue;
}
 
@@ -1855,6 +1869,8 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
subbuf = rb_subbufs_from_meta(meta);
 
meta->first_buffer = (unsigned long)subbuf;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
 
/*
 * The buffers[] array holds the order of the sub-buffers
-- 
2.43.0





[PATCH v4 07/13] ring-buffer: Validate boot range memory events

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Make sure all the events in each of the sub-buffers that were mapped in a
memory region are valid. This moves the code that walks the buffers for
time-stamp validation out of the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
ifdef block and is used to validate the content. Only the ring buffer
event meta data and time stamps are checked and not the data load.

This also has a second purpose. The buffer_page structure that points to
the data sub-buffers has accounting that keeps track of the number of
events that are on the sub-buffer. This updates that counter as well. That
counter is used in reading the buffer and knowing if the ring buffer is
empty or not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 189 +
 1 file changed, 151 insertions(+), 38 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aecd4a7d62be..86f03c0ba4c0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1674,10 +1674,151 @@ static bool rb_meta_valid(struct ring_buffer_meta 
*meta, int cpu,
subbuf = (void *)subbuf + subbuf_size;
}
 
-   pr_info("Ring buffer meta is from previous boot!\n");
return true;
 }
 
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int 
cpu,
+  unsigned long long *timestamp, u64 *delta_ptr)
+{
+   struct ring_buffer_event *event;
+   u64 ts, delta;
+   int events = 0;
+   int e;
+
+   *delta_ptr = 0;
+   *timestamp = 0;
+
+   ts = dpage->time_stamp;
+
+   for (e = 0; e < tail; e += rb_event_length(event)) {
+
+   event = (struct ring_buffer_event *)(dpage->data + e);
+
+   switch (event->type_len) {
+
+   case RINGBUF_TYPE_TIME_EXTEND:
+   delta = rb_event_time_stamp(event);
+   ts += delta;
+   break;
+
+   case RINGBUF_TYPE_TIME_STAMP:
+   delta = rb_event_time_stamp(event);
+   if (delta < ts) {
+   *delta_ptr = delta;
+   *timestamp = ts;
+   return -1;
+   }
+   ts = delta;
+   break;
+
+   case RINGBUF_TYPE_PADDING:
+   if (event->time_delta == 1)
+   break;
+   fallthrough;
+   case RINGBUF_TYPE_DATA:
+   events++;
+   ts += event->time_delta;
+   break;
+
+   default:
+   return -1;
+   }
+   }
+   *timestamp = ts;
+   return events;
+}
+
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+{
+   unsigned long long ts;
+   u64 delta;
+   int tail;
+
+   tail = local_read(>commit);
+   return rb_read_data_buffer(dpage, tail, cpu, , );
+}
+
+/* If the meta data has been validated, now validate the events */
+static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   struct buffer_page *head_page;
+   unsigned long entry_bytes = 0;
+   unsigned long entries = 0;
+   int ret;
+   int i;
+
+   if (!meta || !meta->head_buffer)
+   return;
+
+   /* Do the reader page first */
+   ret = rb_validate_buffer(cpu_buffer->reader_page->page, 
cpu_buffer->cpu);
+   if (ret < 0) {
+   pr_info("Ring buffer reader page is invalid\n");
+   goto invalid;
+   }
+   entries += ret;
+   entry_bytes += local_read(_buffer->reader_page->page->commit);
+   local_set(_buffer->reader_page->entries, ret);
+
+   head_page = cpu_buffer->head_page;
+
+   /* If both the head and commit are on the reader_page then we are done. 
*/
+   if (head_page == cpu_buffer->reader_page &&
+   head_page == cpu_buffer->commit_page)
+   goto done;
+
+   /* Iterate until finding the commit page */
+   for (i = 0; i < meta->nr_subbufs + 1; i++, rb_inc_page(_page)) {
+
+   /* Reader page has already been done */
+   if (head_page == cpu_buffer->reader_page)
+   continue;
+
+   ret = rb_validate_buffer(head_page->page, cpu_buffer->cpu);
+   if (ret < 0) {
+   pr_info("Ring buffer meta [%d] invalid buffer page\n",
+   cpu_buffer->cpu);
+   goto invalid;
+   }
+   ent

[PATCH v4 06/13] ring-buffer: Add test if range of boot buffer is valid

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a test against the ring buffer memory range to see if it has valid
data. The ring_buffer_meta structure is given a new field called
"first_buffer" which holds the address of the first sub-buffer. This is
used to both determine if the other fields are valid as well as finding
the offset between the old addresses of the sub-buffer from the previous
boot to the new addresses of the current boot.

Since the values for nr_subbufs and subbuf_size is to be the same, check
if the values in the meta page match the values calculated.

Take the range of the first_buffer and the total size of all the buffers
and make sure the saved head_buffer and commit_buffer fall in the range.

Iterate through all the sub-buffers to make sure that the values in the
sub-buffer "commit" field (the field that holds the amount of data on the
sub-buffer) is within the end of the sub-buffer. Also check the index
array to make sure that all the indexes are within nr_subbufs.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 143 ++---
 1 file changed, 135 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 093c73c617cc..aecd4a7d62be 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,7 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
__u32   subbuf_size;
@@ -1617,21 +1618,103 @@ static void *rb_range_buffer(struct 
ring_buffer_per_cpu *cpu_buffer, int idx)
return (void *)ptr;
 }
 
+/*
+ * See if the existing memory contains valid ring buffer data.
+ * As the previous kernel must be the same as this kernel, all
+ * the calculations (size of buffers and number of buffers)
+ * must be the same.
+ */
+static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
+ struct trace_buffer *buffer, int nr_pages)
+{
+   int subbuf_size = PAGE_SIZE;
+   struct buffer_data_page *subbuf;
+   unsigned long buffers_start;
+   unsigned long buffers_end;
+   int i;
+
+   /* The subbuffer's size and number of subbuffers must match */
+   if (meta->subbuf_size != subbuf_size ||
+   meta->nr_subbufs != nr_pages + 1) {
+   pr_info("Ring buffer boot meta [%d] mismatch of 
subbuf_size/nr_pages\n", cpu);
+   return false;
+   }
+
+   buffers_start = meta->first_buffer;
+   buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
+
+   /* Is the head and commit buffers within the range of buffers? */
+   if (meta->head_buffer < buffers_start ||
+   meta->head_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] head buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   if (meta->commit_buffer < buffers_start ||
+   meta->commit_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] commit buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   subbuf = rb_subbufs_from_meta(meta);
+
+   /* Is the meta buffers and the subbufs themselves have correct data? */
+   for (i = 0; i < meta->nr_subbufs; i++) {
+   if (meta->buffers[i] < 0 ||
+   meta->buffers[i] >= meta->nr_subbufs) {
+   pr_info("Ring buffer boot meta [%d] array out of 
range\n", cpu);
+   return false;
+   }
+
+   if ((unsigned)local_read(>commit) > subbuf_size) {
+   pr_info("Ring buffer boot meta [%d] buffer invalid 
commit\n", cpu);
+   return false;
+   }
+
+   subbuf = (void *)subbuf + subbuf_size;
+   }
+
+   pr_info("Ring buffer meta is from previous boot!\n");
+   return true;
+}
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
+   unsigned long delta;
void *subbuf;
int cpu;
int i;
 
for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
+   void *next_meta;
+
meta = rb_range_meta(buffer, nr_pages, cpu);
 
+   if (rb_meta_valid(meta, cpu, buffer, nr_pages)) {
+   /* Make the mappings match the current address */
+   subbuf = rb_subbufs_from_meta(meta);
+   delta = (unsigned long)subbuf - meta->first_buffer;
+   meta->first_buffer += delta;
+   meta->head_buffer += delta;
+   meta->commit_buffer += delta;

[PATCH v4 05/13] ring-buffer: Add output of ring buffer meta page

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a buffer_meta per-cpu file for the trace instance that is mapped to
boot memory. This shows the current meta-data and can be used by user
space tools to record off the current mappings to help reconstruct the
ring buffer after a reboot.

It does not expose any virtual addresses, just indexes into the sub-buffer
pages.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 77 ++
 kernel/trace/trace.c   | 30 ++-
 kernel/trace/trace.h   |  2 +
 3 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 385dc1750fc7..093c73c617cc 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -32,6 +32,8 @@
 #include 
 #include 
 
+#include "trace.h"
+
 /*
  * The "absolute" timestamp in the buffer is only 59 bits.
  * If a clock has the 5 MSBs set, it needs to be saved and
@@ -1646,6 +1648,81 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
}
 }
 
+static void *rbm_start(struct seq_file *m, loff_t *pos)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val;
+
+   if (!meta)
+   return NULL;
+
+   if (*pos > meta->nr_subbufs)
+   return NULL;
+
+   val = *pos;
+   val++;
+
+   return (void *)val;
+}
+
+static void *rbm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   (*pos)++;
+
+   return rbm_start(m, pos);
+}
+
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rbm_show(struct seq_file *m, void *v)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val = (unsigned long)v;
+
+   if (val == 1) {
+   seq_printf(m, "head_buffer:   %d\n",
+  rb_meta_subbuf_idx(meta, (void *)meta->head_buffer));
+   seq_printf(m, "commit_buffer: %d\n",
+  rb_meta_subbuf_idx(meta, (void 
*)meta->commit_buffer));
+   seq_printf(m, "subbuf_size:   %d\n", meta->subbuf_size);
+   seq_printf(m, "nr_subbufs:%d\n", meta->nr_subbufs);
+   return 0;
+   }
+
+   val -= 2;
+   seq_printf(m, "buffer[%ld]:%d\n", val, meta->buffers[val]);
+
+   return 0;
+}
+
+static void rbm_stop(struct seq_file *m, void *p)
+{
+}
+
+static const struct seq_operations rb_meta_seq_ops = {
+   .start  = rbm_start,
+   .next   = rbm_next,
+   .show   = rbm_show,
+   .stop   = rbm_stop,
+};
+
+int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, 
int cpu)
+{
+   struct seq_file *m;
+   int ret;
+
+   ret = seq_open(file, _meta_seq_ops);
+   if (ret)
+   return ret;
+
+   m = file->private_data;
+   m->private = buffer->buffers[cpu];
+
+   return 0;
+}
+
 static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
long nr_pages, struct list_head *pages)
 {
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ff2b504fbe00..622fe670949d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5018,7 +5018,7 @@ static int show_traces_open(struct inode *inode, struct 
file *file)
return 0;
 }
 
-static int show_traces_release(struct inode *inode, struct file *file)
+static int tracing_seq_release(struct inode *inode, struct file *file)
 {
struct trace_array *tr = inode->i_private;
 
@@ -5059,7 +5059,7 @@ static const struct file_operations show_traces_fops = {
.open   = show_traces_open,
.read   = seq_read,
.llseek = seq_lseek,
-   .release= show_traces_release,
+   .release= tracing_seq_release,
 };
 
 static ssize_t
@@ -6860,6 +6860,22 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
+{
+   struct trace_array *tr = inode->i_private;
+   int cpu = tracing_get_cpu(inode);
+   int ret;
+
+   ret = tracing_check_open_get_tr(tr);
+   if (ret)
+   return ret;
+
+   ret = ring_buffer_meta_seq_init(filp, tr->array_buffer.buffer, cpu);
+   if (ret < 0)
+   __trace_array_put(tr);
+   return ret;
+}
+
 static ssize_t
 tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
  size_t cnt, loff_t *ppos)
@@ -7436,6 +7452,13 @@ static const struct file_operations tracing_entries_fops 
= {
 

[PATCH v4 03/13] ring-buffer: Add ring_buffer_meta data

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Populate the ring_buffer_meta array. It holds the pointer to the
head_buffer (next to read), the commit_buffer (next to write) the size of
the sub-buffers, number of sub-buffers and an array that keeps track of
the order of the sub-buffers.

This information will be stored in the persistent memory to help on reboot
to reconstruct the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 209 -
 1 file changed, 184 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 53abe7916f2b..385dc1750fc7 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -43,6 +43,11 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   head_buffer;
+   unsigned long   commit_buffer;
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+   int buffers[];
 };
 
 /*
@@ -500,6 +505,7 @@ struct ring_buffer_per_cpu {
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
+   struct ring_buffer_meta *ring_meta;
 
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
@@ -1260,6 +1266,11 @@ static void rb_head_page_activate(struct 
ring_buffer_per_cpu *cpu_buffer)
 * Set the previous list pointer to have the HEAD flag.
 */
rb_set_list_to_head(head->list.prev);
+
+   if (cpu_buffer->ring_meta) {
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   meta->head_buffer = (unsigned long)head->page;
+   }
 }
 
 static void rb_list_head_clear(struct list_head *list)
@@ -1514,51 +1525,127 @@ rb_range_align_subbuf(unsigned long addr, int 
subbuf_size, int nr_subbufs)
 }
 
 /*
- * Return a specific sub-buffer for a given @cpu defined by @idx.
+ * Return the ring_buffer_meta for a given @cpu.
  */
-static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu)
 {
-   unsigned long ptr;
int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   unsigned long ptr = buffer->range_addr_start;
+   struct ring_buffer_meta *meta;
int nr_subbufs;
 
-   /* Include the reader page */
-   nr_subbufs = nr_pages + 1;
+   if (!ptr)
+   return NULL;
+
+   /* When nr_pages passed in is zero, the first meta has already been 
initialized */
+   if (!nr_pages) {
+   meta = (struct ring_buffer_meta *)ptr;
+   nr_subbufs = meta->nr_subbufs;
+   } else {
+   meta = NULL;
+   /* Include the reader page */
+   nr_subbufs = nr_pages + 1;
+   }
 
/*
 * The first chunk may not be subbuffer aligned, where as
 * the rest of the chunks are.
 */
-   ptr = buffer->range_addr_start;
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
if (cpu) {
-   unsigned long p;
-
-   ptr += subbuf_size * nr_subbufs;
-
-   /* Save the beginning of this CPU chunk */
-   p = ptr;
-
ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
+   ptr += subbuf_size * nr_subbufs;
 
/* We can use multiplication to find chunks greater than 1 */
if (cpu > 1) {
unsigned long size;
+   unsigned long p;
 
+   /* Save the beginning of this CPU chunk */
+   p = ptr;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
ptr += subbuf_size * nr_subbufs;
 
/* Now all chunks after this are the same size */
size = ptr - p;
ptr += size * (cpu - 2);
-
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
}
}
-   if (ptr + subbuf_size * nr_subbufs > buffer->range_addr_end)
+   return (void *)ptr;
+}
+
+/* Return the start of subbufs given the meta pointer */
+static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta)
+{
+   int subbuf_size = meta->subbuf_size;
+   unsigned long ptr;
+
+   ptr = (unsigned long)meta;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs);
+
+   return (void *)ptr;
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx)
+{
+   struct ring_buffer_meta *meta;
+   

[PATCH v4 02/13] ring-buffer: Add ring_buffer_alloc_range()

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In preparation to allowing the trace ring buffer to be allocated in a
range of memory that is persistent across reboots, add
ring_buffer_alloc_range(). It takes a contiguous range of memory and will
split it up evenly for the per CPU ring buffers.

If there's not enough memory to handle all CPUs with the minimum size, it
will fail to allocate the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  17 +++
 kernel/trace/ring_buffer.c  | 239 ++--
 2 files changed, 220 insertions(+), 36 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 96d2140b471e..a50b0223b1d3 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -89,6 +89,11 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
 struct trace_buffer *
 __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key 
*key);
 
+struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned 
flags,
+  int order, unsigned long start,
+  unsigned long range_size,
+  struct lock_class_key *key);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
@@ -100,6 +105,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+/*
+ * Because the ring buffer is generic, if other users of the ring buffer get
+ * traced by ftrace, it can produce lockdep warnings. We need to keep each
+ * ring buffer's lock class separate.
+ */
+#define ring_buffer_alloc_range(size, flags, order, start, range_size) \
+({ \
+   static struct lock_class_key __key; \
+   __ring_buffer_alloc_range((size), (flags), (order), (start),\
+ (range_size), &__key);\
+})
+
 typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
 ring_buffer_cond_fn cond, void *data);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 78beaccf9c8c..53abe7916f2b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -42,6 +42,9 @@
 
 static void update_pages_handler(struct work_struct *work);
 
+struct ring_buffer_meta {
+};
+
 /*
  * The ring buffer header is special. We must manually up keep it.
  */
@@ -342,7 +345,8 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
-   u32  id;/* ID for external mapping */
+   u32  id:30; /* ID for external mapping */
+   u32  range:1;   /* Mapped via a range */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -373,7 +377,9 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_pages((unsigned long)bpage->page, bpage->order);
+   /* Range pages are not to be freed */
+   if (!bpage->range)
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -523,6 +529,9 @@ struct trace_buffer {
struct rb_irq_work  irq_work;
booltime_stamp_abs;
 
+   unsigned long   range_addr_start;
+   unsigned long   range_addr_end;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1490,9 +1499,70 @@ static void rb_check_pages(struct ring_buffer_per_cpu 
*cpu_buffer)
}
 }
 
+/*
+ * Take an address, add the meta data size as well as the array of
+ * array subbuffer indexes, then align it to a subbuffer size.
+ *
+ * This is used to help find the next per cpu subbuffer within a mapped range.
+ */
+static unsigned long
+rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs)
+{
+   addr += sizeof(struct ring_buffer_meta) +
+   sizeof(int) * nr_subbufs;
+   return ALIGN(addr, subbuf_size);
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+{
+   unsigned long ptr;
+   int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   int nr_subbufs;
+
+   /* Include t

[PATCH v4 04/13] tracing: Implement creating an instance based on a given memory region

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Allow for creating a new instance by passing in an address and size to map
the ring buffer for the instance to.

This will allow features like a pstore memory mapped region to be used for
an tracing instance ring buffer that can be retrieved from one boot to the
next.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 50 +++-
 kernel/trace/trace.h |  4 
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 578a49ff5c32..ff2b504fbe00 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4921,6 +4921,11 @@ static int tracing_open(struct inode *inode, struct file 
*file)
 static bool
 trace_ok_for_array(struct tracer *t, struct trace_array *tr)
 {
+#ifdef CONFIG_TRACER_SNAPSHOT
+   /* arrays with mapped buffer range do not have snapshots */
+   if (tr->range_addr_start && t->use_max_tr)
+   return false;
+#endif
return (tr->flags & TRACE_ARRAY_FL_GLOBAL) || t->allow_instances;
 }
 
@@ -8664,11 +8669,13 @@ tracing_init_tracefs_percpu(struct trace_array *tr, 
long cpu)
tr, cpu, _entries_fops);
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
-   tr, cpu, _fops);
+   if (!tr->range_addr_start) {
+   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
+ tr, cpu, _fops);
 
-   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
-   tr, cpu, _raw_fops);
+   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
+ tr, cpu, _raw_fops);
+   }
 #endif
 }
 
@@ -9205,7 +9212,18 @@ allocate_trace_buffer(struct trace_array *tr, struct 
array_buffer *buf, int size
 
buf->tr = tr;
 
-   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   if (tr->range_addr_start && tr->range_addr_size) {
+   buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
+ tr->range_addr_start,
+ tr->range_addr_size);
+   /*
+* This is basically the same as a mapped buffer,
+* with the same restrictions.
+*/
+   tr->mapped++;
+   } else {
+   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   }
if (!buf->buffer)
return -ENOMEM;
 
@@ -9242,6 +9260,10 @@ static int allocate_trace_buffers(struct trace_array 
*tr, int size)
return ret;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
+   /* Fix mapped buffer trace arrays do not have snapshot buffers */
+   if (tr->range_addr_start)
+   return 0;
+
ret = allocate_trace_buffer(tr, >max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
@@ -9342,7 +9364,9 @@ static int trace_array_create_dir(struct trace_array *tr)
 }
 
 static struct trace_array *
-trace_array_create_systems(const char *name, const char *systems)
+trace_array_create_systems(const char *name, const char *systems,
+  unsigned long range_addr_start,
+  unsigned long range_addr_size)
 {
struct trace_array *tr;
int ret;
@@ -9368,6 +9392,10 @@ trace_array_create_systems(const char *name, const char 
*systems)
goto out_free_tr;
}
 
+   /* Only for boot up memory mapped ring buffers */
+   tr->range_addr_start = range_addr_start;
+   tr->range_addr_size = range_addr_size;
+
tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
 
cpumask_copy(tr->tracing_cpumask, cpu_all_mask);
@@ -9425,7 +9453,7 @@ trace_array_create_systems(const char *name, const char 
*systems)
 
 static struct trace_array *trace_array_create(const char *name)
 {
-   return trace_array_create_systems(name, NULL);
+   return trace_array_create_systems(name, NULL, 0, 0);
 }
 
 static int instance_mkdir(const char *name)
@@ -9479,7 +9507,7 @@ struct trace_array *trace_array_get_by_name(const char 
*name, const char *system
goto out_unlock;
}
 
-   tr = trace_array_create_systems(name, systems);
+   tr = trace_array_create_systems(name, systems, 0, 0);
 
if (IS_ERR(tr))
tr = NULL;
@@ -9672,8 +9700,10 @@ init_tracer_tracefs(struct trace_array *tr, struct 
dentry *d_tracer)
MEM_FAIL(1, "Could not allocate function filter files");
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_cre

[PATCH v4 01/13] ring-buffer: Allow mapped field to be set without mapping

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In preparation for having the ring buffer mapped to a dedicated location,
which will have the same restrictions as user space memory mapped buffers,
allow it to use the "mapped" field of the ring_buffer_per_cpu structure
without having the user space meta page mapping.

When this starts using the mapped field, it will need to handle adding a
user space mapping (and removing it) from a ring buffer that is using a
dedicated memory range.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 28853966aa9a..78beaccf9c8c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5224,6 +5224,9 @@ static void rb_update_meta_page(struct 
ring_buffer_per_cpu *cpu_buffer)
 {
struct trace_buffer_meta *meta = cpu_buffer->meta_page;
 
+   if (!meta)
+   return;
+
meta->reader.read = cpu_buffer->reader_page->read;
meta->reader.id = cpu_buffer->reader_page->id;
meta->reader.lost_events = cpu_buffer->lost_events;
@@ -6167,7 +6170,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped) {
+   if (!cpu_buffer->mapped || !cpu_buffer->meta_page) {
mutex_unlock(_buffer->mapping_lock);
return ERR_PTR(-ENODEV);
}
@@ -6359,12 +6362,13 @@ int ring_buffer_map(struct trace_buffer *buffer, int 
cpu,
 */
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
+
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
err = __rb_map_vma(cpu_buffer, vma);
if (!err) {
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   cpu_buffer->mapped = 1;
+   cpu_buffer->mapped++;
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
} else {
kfree(cpu_buffer->subbuf_ids);
@@ -6403,7 +6407,8 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int 
cpu)
mutex_lock(>mutex);
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
 
-   cpu_buffer->mapped = 0;
+   WARN_ON_ONCE(!cpu_buffer->mapped);
+   cpu_buffer->mapped--;
 
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
-- 
2.43.0





[PATCH v4 00/13] tracing: Persistent traces across a reboot or crash

2024-06-11 Thread Steven Rostedt
acpi_reset <-acpi_reboot
   swapper/0-1   [000] d..1.63.479791: acpi_os_write_port 
<-acpi_reboot

Enjoy...

Changes since v3: 
https://lore.kernel.org/all/20240606211735.684785...@goodmis.org/

- Removed an unused variable

- Fixed enable_instances() as the path without memory using the memory location
  in the command line parameter passed in "tok" where it now needs to be
  "name" for creating the snapshot buffer, otherwise it would pass in NULL
  which could crash the kernel on boot.

Changes since v2: 
https://lore.kernel.org/all/20240411012541.285904...@goodmis.org/

- Rebased on top of 6.10-rc2 that has the memory mapped ring buffer code.

- Added hard coded address to map to (from memmap=nn$ss), instead of relying
  on using reserve_mem (which I still want to add).

- Updated comments

- Restructured the validate code as the previous version broke the ring
  buffer timestamp validation code.


Steven Rostedt (Google) (13):
  ring-buffer: Allow mapped field to be set without mapping
  ring-buffer: Add ring_buffer_alloc_range()
  ring-buffer: Add ring_buffer_meta data
  tracing: Implement creating an instance based on a given memory region
  ring-buffer: Add output of ring buffer meta page
  ring-buffer: Add test if range of boot buffer is valid
  ring-buffer: Validate boot range memory events
  tracing: Add option to use memmapped memory for trace boot instance
  ring-buffer: Save text and data locations in mapped meta data
  tracing/ring-buffer: Add last_boot_info file to boot instance
  tracing: Handle old buffer mappings for event strings and functions
  tracing: Update function tracing output for previous boot buffer
  tracing: Add last boot delta offset for stack traces


 Documentation/admin-guide/kernel-parameters.txt |   9 +
 include/linux/ring_buffer.h |  20 +
 kernel/trace/ring_buffer.c  | 853 +---
 kernel/trace/trace.c| 242 ++-
 kernel/trace/trace.h|   8 +
 kernel/trace/trace_output.c |  12 +-
 6 files changed, 1036 insertions(+), 108 deletions(-)



Re: [PATCH v3 3/3] tracing/kprobe: Remove cleanup code unrelated to selftest

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 22:30:56 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> This cleanup all kprobe events code is not related to the selftest
> itself, and it can fail by the reason unrelated to this test.
> If the test is successful, the generated events are cleaned up.
> And if not, we cannot guarantee that the kprobe events will work
> correctly. So, anyway, there is no need to clean it up.
> 
> Signed-off-by: Masami Hiramatsu (Google) 

Reviewed-by: Steven Rostedt (Google) 

-- Steve

> ---
>  kernel/trace/trace_kprobe.c |4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 8c5816c04bd2..7fd0f8576e4c 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -2114,10 +2114,6 @@ static __init int kprobe_trace_self_tests_init(void)
>  
>  
>  end:
> - ret = dyn_events_release_all(_kprobe_ops);
> - if (WARN_ONCE(ret, "error on cleaning up probes."))
> - warn++;
> -
>   /*
>* Wait for the optimizer work to finish. Otherwise it might fiddle
>* with probes in already freed __init text.




Re: [PATCH v3 2/3] tracing/kprobe: Integrate test warnings into WARN_ONCE

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 22:30:46 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Cleanup the redundant WARN_ON_ONCE(cond) + pr_warn(msg) into
> WARN_ONCE(cond, msg). Also add some WARN_ONCE() for hitcount check.
> These WARN_ONCE() errors makes it easy to handle errors from ktest.
> 
> Suggested-by: Steven Rostedt 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v3:
>   - integrate WARN_ON_ONCE() and pr_warn() instead of remove
> WARN_ONCE().

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v6 3/3] sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 11:03:25 +0200
Daniel Bristot de Oliveira  wrote:

> On 6/10/24 21:20, Qais Yousef wrote:
> > -   if (realtime_prio(p->prio)) /* includes deadline */
> > +   if (rt_or_dl_prio(p->prio))  
> 
> that is it... no thinking, no recall, no comment, no confusion...

How about "not_normal_prio(p->prio)" ?

/me runs!

-- Steve



Re: [PATCH 05/14] tracefs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 10:42:28 +0200
Vlastimil Babka  wrote:

> AFAICS that documented way is for a different situation? I assume you mean
> this part:
> 
> * Specify any additional patch prerequisites for cherry picking::
> 
> Cc:  # 3.3.x: a1f84a3: sched: Check for idle
> 
> But that would assume we actively want to backport this cleanup patch in the
> first place. But as I understand Steven's intention, we want just to make
> sure that if in the future this patch is backported (i.e. as a dependency of
> something else) it won't be forgotten to also backport c9929f0e344a
> ("mm/slob: remove CONFIG_SLOB"). How to express that without actively
> marking this patch for backport at the same time?

Exactly! This isn't to be tagged as stable. It's just a way to say "if you
need this patch for any reason, you also need patch X".

I think "Depends-on" is the way to go, as it is *not* a stable thing, and
what is in stable rules is only about stable patches.

-- Steve



Re: [PATCH 05/14] tracefs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback

2024-06-11 Thread Steven Rostedt
On Tue, 11 Jun 2024 08:23:11 +0200
Greg KH  wrote:

> > Depends-on: c9929f0e344a ("mm/slob: remove CONFIG_SLOB")  
> 
> Ick, no, use the documented way of handling this as described in the
> stable kernel rules file.

You mentioned this before, I guess you mean this:

> To send additional instructions to the stable team, use a shell-style inline
> comment to pass arbitrary or predefined notes:
> 
> * Specify any additional patch prerequisites for cherry picking::
> 
> Cc:  # 3.3.x: a1f84a3: sched: Check for idle
> Cc:  # 3.3.x: 1b9508f: sched: Rate-limit newidle
> Cc:  # 3.3.x: fd21073: sched: Fix affinity logic
> Cc:  # 3.3.x
> Signed-off-by: Ingo Molnar 
> 
>   The tag sequence has the meaning of::
> 
> git cherry-pick a1f84a3
> git cherry-pick 1b9508f
> git cherry-pick fd21073
> git cherry-pick 
> 
>   Note that for a patch series, you do not have to list as prerequisites the
>   patches present in the series itself. For example, if you have the following
>   patch series::
> 
> patch1
> patch2
> 
>   where patch2 depends on patch1, you do not have to list patch1 as
>   prerequisite of patch2 if you have already marked patch1 for stable
>   inclusion.

What's with the "3.3.x"? Isn't that obsolete? And honestly, I find the
above much more "ick" than "Depends-on:". That's because I like to read
human readable tags and not machine processing tags. I'm a human, not a machine.

-- Steve



Re: [PATCH 2/3] tracing/kprobe: Remove unneeded WARN_ON_ONCE() in selftests

2024-06-10 Thread Steven Rostedt
On Tue, 11 Jun 2024 06:26:44 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Since the kprobe-events selftest shows OK or NG with the reason, the
> WARN_ON_ONCE()s for each place are redundant. Let's remove it.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  kernel/trace/trace_kprobe.c |   26 +-
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 16383247bdbf..4abed36544d0 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -2023,18 +2023,18 @@ static __init int kprobe_trace_self_tests_init(void)
>   pr_info("Testing kprobe tracing: ");
>  
>   ret = create_or_delete_trace_kprobe("p:testprobe 
> kprobe_trace_selftest_target $stack $stack0 +0($stack)");
> - if (WARN_ON_ONCE(ret)) {
> + if (ret) {
>   pr_warn("error on probing function entry.\n");

Actually, you can consolidate this to:

if (WARN_ONCE(ret, "error on probing function entry."))

>   warn++;
>   } else {
>   /* Enable trace point */
>   tk = find_trace_kprobe("testprobe", KPROBE_EVENT_SYSTEM);
> - if (WARN_ON_ONCE(tk == NULL)) {
> + if (tk == NULL) {
>   pr_warn("error on getting new probe.\n");

And this to:

if (WARN_ONCE(tk == NULL, "error on getting new probe."))

end so on.

-- Steve

>   warn++;
>   } else {
>   file = find_trace_probe_file(tk, top_trace_array());
> - if (WARN_ON_ONCE(file == NULL)) {
> + if (file == NULL) {
>   pr_warn("error on getting probe file.\n");
>   warn++;
>   } else
> @@ -2044,18 +2044,18 @@ static __init int kprobe_trace_self_tests_init(void)
>   }
>  
>   ret = create_or_delete_trace_kprobe("r:testprobe2 
> kprobe_trace_selftest_target $retval");
> - if (WARN_ON_ONCE(ret)) {
> + if (ret) {
>   pr_warn("error on probing function return.\n");
>   warn++;
>   } else {
>   /* Enable trace point */
>   tk = find_trace_kprobe("testprobe2", KPROBE_EVENT_SYSTEM);
> - if (WARN_ON_ONCE(tk == NULL)) {
> + if (tk == NULL) {
>   pr_warn("error on getting 2nd new probe.\n");
>   warn++;
>   } else {
>   file = find_trace_probe_file(tk, top_trace_array());
> - if (WARN_ON_ONCE(file == NULL)) {
> + if (file == NULL) {
>   pr_warn("error on getting probe file.\n");
>   warn++;
>   } else
> @@ -2079,7 +2079,7 @@ static __init int kprobe_trace_self_tests_init(void)
>  
>   /* Disable trace points before removing it */
>   tk = find_trace_kprobe("testprobe", KPROBE_EVENT_SYSTEM);
> - if (WARN_ON_ONCE(tk == NULL)) {
> + if (tk == NULL) {
>   pr_warn("error on getting test probe.\n");
>   warn++;
>   } else {
> @@ -2089,7 +2089,7 @@ static __init int kprobe_trace_self_tests_init(void)
>   }
>  
>   file = find_trace_probe_file(tk, top_trace_array());
> - if (WARN_ON_ONCE(file == NULL)) {
> + if (file == NULL) {
>   pr_warn("error on getting probe file.\n");
>   warn++;
>   } else
> @@ -2098,7 +2098,7 @@ static __init int kprobe_trace_self_tests_init(void)
>   }
>  
>   tk = find_trace_kprobe("testprobe2", KPROBE_EVENT_SYSTEM);
> - if (WARN_ON_ONCE(tk == NULL)) {
> + if (tk == NULL) {
>   pr_warn("error on getting 2nd test probe.\n");
>   warn++;
>   } else {
> @@ -2108,7 +2108,7 @@ static __init int kprobe_trace_self_tests_init(void)
>   }
>  
>   file = find_trace_probe_file(tk, top_trace_array());
> - if (WARN_ON_ONCE(file == NULL)) {
> + if (file == NULL) {
>   pr_warn("error on getting probe file.\n");
>   warn++;
>   } else
> @@ -2117,20 +2117,20 @@ static __init int kprobe_trace_self_tests_init(void)
>   }
>  
>   ret = create_or_delete_trace_kprobe("-:testprobe");
> - if (WARN_ON_ONCE(ret)) {
> + if (ret) {
>   pr_warn("error on deleting a probe.\n");
>   warn++;
>   }
>  
>   ret = create_or_delete_trace_kprobe("-:testprobe2");
> - if (WARN_ON_ONCE(ret)) {
> + if (ret) {
>   pr_warn("error on deleting a probe.\n");
>   warn++;
>   }
>  
>  end:
>   ret = dyn_events_release_all(_kprobe_ops);
> - if (WARN_ON_ONCE(ret)) {
> + if (ret) {
>   pr_warn("error on cleaning up probes.\n");

Re: [PATCH -next 2/2] ftrace: Add kernel-doc comments for unregister_ftrace_direct() function

2024-06-10 Thread Steven Rostedt
On Fri,  7 Jun 2024 16:49:57 +0800
Yang Li  wrote:

> Added kernel-doc comments for the unregister_ftrace_direct() function to
> improve code documentation and readability.
> 

Someone else beat you to this.

-- Steve

> Reported-by: Abaci Robot 
> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9300
> Signed-off-by: Yang Li 
> ---
>  kernel/trace/ftrace.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 4aeb1183ea9f..3b0dbd55cc05 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5988,6 +5988,8 @@ EXPORT_SYMBOL_GPL(register_ftrace_direct);
>   * unregister_ftrace_direct - Remove calls to custom trampoline
>   * previously registered by register_ftrace_direct for @ops object.
>   * @ops: The address of the struct ftrace_ops object
> + * @addr: The address of the direct call to remove
> + * @free_filters: Boolean indicating whether to free the filters
>   *
>   * This is used to remove a direct calls to @addr from the nop locations
>   * of the functions registered in @ops (with by ftrace_set_filter_ip




Re: [PATCH -next 1/2] function_graph: Add kernel-doc comments for ftrace_graph_ret_addr() function

2024-06-10 Thread Steven Rostedt
On Fri,  7 Jun 2024 16:49:56 +0800
Yang Li  wrote:

> Added kernel-doc comments for the ftrace_graph_ret_addr() function to
> improve code documentation and readability.
> 
> Reported-by: Abaci Robot 
> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9299
> Signed-off-by: Yang Li 
> ---
>  kernel/trace/fgraph.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index a13551a023aa..4ad33e4cb8da 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -872,6 +872,12 @@ ftrace_graph_get_ret_stack(struct task_struct *task, int 
> idx)
>  /**
>   * ftrace_graph_ret_addr - convert a potentially modified stack return 
> address
>   *  to its original value
> + * @task: pointer to the task_struct of the task being examined
> + * @idx: pointer to a state variable, should be initialized to zero
> + *before the first call

parameter descriptions should not go across more than one line. At least
not in my code. Also, you don't need to add that it needs to be initialized
here. That belongs in the body.

And it's not a state variable. The description you got that from is wrong.

I'll go update it and give you a reported by, as the entire description
needs a rewrite.

-- Steve


> + * @ret: the current return address found on the stack
> + * @retp: pointer to the return address on the stack, ignored if
> + * HAVE_FUNCTION_GRAPH_RET_ADDR_PTR is not defined
>   *
>   * This function can be called by stack unwinding code to convert a found 
> stack
>   * return address ('ret') to its original value, in case the function graph




Re: [PATCH 1/3] tracing: Build event generation tests only as modules

2024-06-10 Thread Steven Rostedt
On Tue, 11 Jun 2024 06:26:34 +0900
"Masami Hiramatsu (Google)"  wrote:

> The kprobes and synth event generation test modules add events and lock
> (get a reference) those event file reference in module init function,
> and unlock and delete it in module exit function. This is because those
> are designed for playing as modules.
> 
> If we make those modules as built-in, those events are left locked in the
> kernel, and never be removed. This causes kprobe event self-test failure
> as below.

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH 2/3] tracing/kprobe: Remove unneeded WARN_ON_ONCE() in selftests

2024-06-10 Thread Steven Rostedt
On Tue, 11 Jun 2024 06:26:44 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Since the kprobe-events selftest shows OK or NG with the reason, the
> WARN_ON_ONCE()s for each place are redundant. Let's remove it.

Note, the ktests we run to validate commits, fail when it detects a WARN()
triggered.

If this fails in any configuration, ktest will not detect it failed.

-- Steve


> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  kernel/trace/trace_kprobe.c |   26 +-
>  1 file changed, 13 insertions(+), 13 deletions(-)



Re: [PATCH 05/14] tracefs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback

2024-06-10 Thread Steven Rostedt
On Mon, 10 Jun 2024 22:42:30 +0200
Vlastimil Babka  wrote:

> On 6/10/24 5:46 PM, Paul E. McKenney wrote:
> > On Mon, Jun 10, 2024 at 11:22:23AM -0400, Steven Rostedt wrote:  
> >> On Sun,  9 Jun 2024 10:27:17 +0200
> >> Julia Lawall  wrote:
> >>   
> >> > diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> >> > index 7c29f4afc23d..338c52168e61 100644
> >> > --- a/fs/tracefs/inode.c
> >> > +++ b/fs/tracefs/inode.c
> >> > @@ -53,14 +53,6 @@ static struct inode *tracefs_alloc_inode(struct 
> >> > super_block *sb)
> >> >  return >vfs_inode;
> >> >  }
> >> >  
> >> > -static void tracefs_free_inode_rcu(struct rcu_head *rcu)
> >> > -{
> >> > -struct tracefs_inode *ti;
> >> > -
> >> > -ti = container_of(rcu, struct tracefs_inode, rcu);
> >> > -kmem_cache_free(tracefs_inode_cachep, ti);  
> >> 
> >> Does this work?
> >> 
> >> tracefs needs to be freed via the tracefs_inode_cachep. Does
> >> kfree_rcu() handle specific frees for objects that were not allocated
> >> via kmalloc()?  
> > 
> > A recent change to kfree() allows it to correctly handle memory allocated
> > via kmem_cache_alloc().  News to me as of a few weeks ago.  ;-)  
> 
> Hey, I did try not to keep that a secret :)
> https://lore.kernel.org/all/20230310103210.22372-8-vba...@suse.cz/
> 

Heh, I didn't look at that patch very deeply.

-- Steve



Re: [PATCH 05/14] tracefs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback

2024-06-10 Thread Steven Rostedt
On Mon, 10 Jun 2024 08:46:42 -0700
"Paul E. McKenney"  wrote:

> > > index 7c29f4afc23d..338c52168e61 100644
> > > --- a/fs/tracefs/inode.c
> > > +++ b/fs/tracefs/inode.c
> > > @@ -53,14 +53,6 @@ static struct inode *tracefs_alloc_inode(struct 
> > > super_block *sb)
> > >   return >vfs_inode;
> > >  }
> > >  
> > > -static void tracefs_free_inode_rcu(struct rcu_head *rcu)
> > > -{
> > > - struct tracefs_inode *ti;
> > > -
> > > - ti = container_of(rcu, struct tracefs_inode, rcu);
> > > - kmem_cache_free(tracefs_inode_cachep, ti);  
> > 
> > Does this work?
> > 
> > tracefs needs to be freed via the tracefs_inode_cachep. Does
> > kfree_rcu() handle specific frees for objects that were not allocated
> > via kmalloc()?  
> 
> A recent change to kfree() allows it to correctly handle memory allocated
> via kmem_cache_alloc().  News to me as of a few weeks ago.  ;-)

If that's the case then:

Acked-by: Steven Rostedt (Google) 

Do we have a way to add a "Depends-on" tag so that anyone backporting this
will know that it requires the change to whatever allowed that to happen?

Or we need to update the change log to explicitly state that this should
*not* be backported.

-- Steve



Re: [PATCH 0/3] tracing: Fix some selftest issues

2024-06-10 Thread Steven Rostedt
On Mon, 10 Jun 2024 11:10:01 +0900
Masami Hiramatsu (Google)  wrote:

> > But you don't explain what exactly the conflict is. What about those
> > events causes kprobe selftests to fail?  
> 
> I also found another problem on these modules. These modules get trace
> event file references to prevent removing probes. Therefore, if we
> embed these modules, we can not remove these events forever!
> 
> /*
>  * Now get the gen_kprobe_test event file.  We need to prevent
>  * the instance and event from disappearing from underneath
>  * us, which trace_get_event_file() does (though in this case
>  * we're using the top-level instance which never goes away).
>  */
> gen_kprobe_test = trace_get_event_file(NULL, "kprobes",
>"gen_kprobe_test");
> if (IS_ERR(gen_kprobe_test)) {
> ret = PTR_ERR(gen_kprobe_test);
> goto delete;
> }
> 
> This means most ftracetest fails because we can not clean up the
> tracing state by removing dynamic events, which are installed while
> booting.
> Note that these references (locks) will be removed when the module
> is unloaded.

I'm fine if you want to force these as modules. Just make sure the change
log lists all the issues of them not being modules.

-- Steve



Re: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

2024-06-10 Thread Steven Rostedt
On Thu, 6 Jun 2024 10:37:46 -0500
Yan Zhai  wrote:

> > name: kfree_skb
> > ID: 1799
> > format:
> > field:unsigned short common_type;   offset:0;   size:2; 
> > signed:0;
> > field:unsigned char common_flags;   offset:2;   size:1; 
> > signed:0;
> > field:unsigned char common_preempt_count;   offset:3;   
> > size:1; signed:0;
> > field:int common_pid;   offset:4;   size:4; signed:1;
> >
> > field:void * skbaddr;   offset:8;   size:8; signed:0;
> > field:void * location;  offset:16;  size:8; signed:0;
> > field:unsigned short protocol;  offset:24;  size:2; signed:0;
> > field:enum skb_drop_reason reason;  offset:28;  size:4; 
> > signed:0;
> >
> > Notice that "protocol" is 2 bytes in size at offset 24, but "reason" starts
> > at offset 28. This means at offset 26, there's a 2 byte hole.
> >  
> The reason I added the pointer as the last argument is trying to
> minimize the surprise to existing TP users, because for common ABIs
> it's fine to omit later arguments when defining a function, but it
> needs change and recompilation if the order of arguments changed.

Nothing should be hard coding the offsets of the fields. This is
exported to user space so that tools can see where the fields are.
That's the purpose of libtraceevent. The fields should be movable and
not affect anything. There should be no need to recompile.

> 
> Looking at the actual format after the change, it does not add a new
> hole since protocol and reason are already packed into the same 8-byte
> block, so rx_skaddr starts at 8-byte aligned offset:
> 
> # cat /sys/kernel/debug/tracing/events/skb/kfree_skb/format
> name: kfree_skb
> ID: 2260
> format:
> field:unsigned short common_type;   offset:0;
> size:2; signed:0;
> field:unsigned char common_flags;   offset:2;
> size:1; signed:0;
> field:unsigned char common_preempt_count;   offset:3;
>  size:1; signed:0;
> field:int common_pid;   offset:4;   size:4; signed:1;
> 
> field:void * skbaddr;   offset:8;   size:8; signed:0;
> field:void * location;  offset:16;  size:8; signed:0;
> field:unsigned short protocol;  offset:24;  size:2; signed:0;
> field:enum skb_drop_reason reason;  offset:28;
> size:4; signed:0;
> field:void * rx_skaddr; offset:32;  size:8; signed:0;
> 
> Do you think we still need to change the order?

Up to you, just wanted to point it out.

-- Steve




Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks

2024-06-10 Thread Steven Rostedt
On Fri, 7 Jun 2024 10:29:03 +0200
Petr Pavlu  wrote:

> Another option could be to try traversing the whole list in smaller
> parts and give up the reader_lock in between them. This would need some
> care to make sure that the operation completes, e.g. the code would need
> to bail out if it detects a change on cpu_buffer->pages_read.

I think I like this approach the most. Perhaps even have a counter that
gets incremented everything a new reader page is taken. And if it
detects that, it restarts the check?

To prevent a DOS, we restart 3 times at most, and then just say "the
list is OK" and exit.

So basically, we release the lock within the loop per each sub-buffer,
and then check if the reader touch it when reacquiring the lock.

-- Steve



Re: [PATCH 05/14] tracefs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback

2024-06-10 Thread Steven Rostedt
On Sun,  9 Jun 2024 10:27:17 +0200
Julia Lawall  wrote:

> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index 7c29f4afc23d..338c52168e61 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -53,14 +53,6 @@ static struct inode *tracefs_alloc_inode(struct 
> super_block *sb)
>   return >vfs_inode;
>  }
>  
> -static void tracefs_free_inode_rcu(struct rcu_head *rcu)
> -{
> - struct tracefs_inode *ti;
> -
> - ti = container_of(rcu, struct tracefs_inode, rcu);
> - kmem_cache_free(tracefs_inode_cachep, ti);

Does this work?

tracefs needs to be freed via the tracefs_inode_cachep. Does
kfree_rcu() handle specific frees for objects that were not allocated
via kmalloc()?

-- Steve


> -}
> -
>  static void tracefs_free_inode(struct inode *inode)
>  {
>   struct tracefs_inode *ti = get_tracefs(inode);
> @@ -70,7 +62,7 @@ static void tracefs_free_inode(struct inode *inode)
>   list_del_rcu(>list);
>   spin_unlock_irqrestore(_inode_lock, flags);
>  
> - call_rcu(>rcu, tracefs_free_inode_rcu);
> + kfree_rcu(ti, rcu);
>  }
>  
>  static ssize_t default_read_file(struct file *file, char __user *buf,




  1   2   3   4   5   6   7   8   9   10   >