[PATCH v3 0/2] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt


[ subject is still wrong, but is to match v2, see patch 2 for correct subject ]


Changes since v2: 
https://lore.kernel.org/all/20240116211217.968123...@goodmis.org/

Implemented Linus's suggestion to just change the iterate_shared to
use the hard coded inodes.


Steven Rostedt (Google) (2):
  eventfs: Have the inodes all for files and directories all be the same
  eventfs: Do not create dentries nor inodes in iterate_shared


 fs/tracefs/event_inode.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)



[PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-16 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The dentries and inodes are created in the readdir for the sole purpose of
getting a consistent inode number. Linus stated that is unnecessary, and
that all inodes can have the same inode number. For a virtual file system
they are pretty meaningless.

Instead use a single unique inode number for all files and one for all
directories.

Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240116211353.412180...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Christian Brauner 
Cc: Al  Viro 
Cc: Ajay Kaher 
Suggested-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..5edf0b96758b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -32,6 +32,10 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Choose something "unique" ;-) */
+#define EVENTFS_FILE_INODE_INO 0x12c4e37
+#define EVENTFS_DIR_INODE_INO  0x134b2f5
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
inode->i_fop = fop;
inode->i_private = data;
 
+   /* All files will have the same inode number */
+   inode->i_ino = EVENTFS_FILE_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
d_instantiate(dentry, inode);
@@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_op = _root_dir_inode_operations;
inode->i_fop = _file_operations;
 
+   /* All directories will have the same inode number */
+   inode->i_ino = EVENTFS_DIR_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
 
-- 
2.43.0





Re: [PATCH v2 2/2] eventfs: Create list of files and directories at dir open

2024-01-16 Thread Steven Rostedt
On Tue, 16 Jan 2024 13:39:38 -0800
Linus Torvalds  wrote:

> I don't understand why your still insist on this pointless open wrapper.

I just liked the consistency of it.

> 
> Just do this all at iterate time. No open wrappers. No nothing. Just
> iterate over the days structures your have.
> 
> IOW, instead of iterating in open to create the array, just iterate in -
> look, it's in the *name* for chrissake - iterate_shared.
> 
> No array. No random allocation for said array.
> 
> If you can iterate at open time, you can iterate at iterate_shared time.
> Stop creating a list that your already have.
> 
> And nobody cares if you do a readdir at the same time as modifying the
> directory. This isn't a real filesystem with strict POSIX semantics.

OK, I can do that.

-- Steve




[PATCH v2 2/2] eventfs: Create list of files and directories at dir open

2024-01-16 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The original eventfs code added a wrapper around the dcache_readdir open
callback and created all the dentries and inodes at open, and increment
their ref count. A wrapper was added around the dcache_readdir release
function to decrement all the ref counts of those created inodes and
dentries. But this proved to be buggy[1] for when a kprobe was created
during a dir read, it would create a dentry between the open and the
release, and because the release would decrement all ref counts of all
files and directories, that would include the kprobe directory that was
not there to have its ref count incremented in open. This would cause the
ref count to go to negative and later crash the kernel.

To solve this, the dentries and inodes that were created and had their ref
count upped in open needed to be saved. That list needed to be passed from
the open to the release, so that the release would only decrement the ref
counts of the entries that were incremented in the open.

Unfortunately, the dcache_readdir logic was already using the
file->private_data, which is the only field that can be used to pass
information from the open to the release. What was done was the eventfs
created another descriptor that had a void pointer to save the
dcache_readdir pointer, and it wrapped all the callbacks, so that it could
save the list of entries that had their ref counts incremented in the
open, and pass it to the release. The wrapped callbacks would just put
back the dcache_readdir pointer and call the functions it used so it could
still use its data[2].

But Linus had an issue with the "hijacking" of the file->private_data
(unfortunately this discussion was on a security list, so no public link).
Which we finally agreed on doing everything within the iterate_shared
callback and leave the dcache_readdir out of it[3]. All the information
needed for the getents() could be created then.

But this ended up being buggy too[4]. The iterate_shared callback was not
the right place to create the dentries and inodes. Even Christian Brauner
had issues with that[5].

An attempt was to go back to creating the inodes and dentries at
the open, create an array to store the information in the
file->private_data, and pass that information to the other callbacks.[6]

The difference between that and the original method, is that it does not
use dcache_readdir. It also does not up the ref counts of the dentries and
pass them. Instead, it creates an array of a structure that saves the
dentry's name and inode number. That information is used in the
iterate_shared callback, and the array is freed in the dir release. The
dentries and inodes created in the open are not used for the iterate_share
or release callbacks. Just their names and inode numbers.

Linus did not like the creation of the dentries for the sole purpose of
getting their inode numbers. Instead, he suggested to just hard code them
even though they would not be unique.  With the hard coding of the inode
numbers, there's no need to create the dentries for readdir().

Now the list created at the dir open can just get the names from the meta
data and use the hard coded inode numbers.

This means that the state of the eventfs at the dir open remains the same
from the point of view of the getdents() function, until the dir is closed.
This also means that the getdents() will not fail. If there's an issue, it
fails at the dir open.

[1] 
https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/
[2] 
https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/
[3] 
https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/
[4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/
[5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/
[6] https://lore.kernel.org/all/20240116114711.7e863...@gandalf.local.home/

Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 213 +++
 1 file changed, 149 insertions(+), 64 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 5edf0b96758b..05cd825d4cc9 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -57,7 +57,9 @@ enum {
 static struct dentry *eventfs_root_lookup(struct inode *dir,
  struct dentry *dentry,
  unsigned int flags);
+static int eventfs_dir_open(struct inode *inode, struct file *file);
 static int eventfs_iterate(struct file *file, struct dir_context *ctx);
+static int eventfs_dir_release(struct inode *inode, struct file *file);
 
 static void update_attr(struct eventfs_attr *attr, s

[PATCH v2 0/2] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt
[ subject is wrong, but is to match v1, see patch 2 for correct subject ]

Fix reading dir again, this time without creating dentries and inodes.

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240116114711.7e863...@gandalf.local.home


Steven Rostedt (Google) (2):
  eventfs: Have the inodes all for files and directories all be the same
  eventfs: Create list of files and directories at dir open


 fs/tracefs/event_inode.c | 223 +--
 1 file changed, 159 insertions(+), 64 deletions(-)



[PATCH v2 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-16 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The dentries and inodes are created in the readdir for the sole purpose of
getting a consistent inode number. Linus stated that is unnecessary, and
that all inodes can have the same inode number. For a virtual file system
they are pretty meaningless.

Instead use a single unique inode number for all files and one for all
directories.

Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/

Suggested-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..5edf0b96758b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -32,6 +32,10 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Choose something "unique" ;-) */
+#define EVENTFS_FILE_INODE_INO 0x12c4e37
+#define EVENTFS_DIR_INODE_INO  0x134b2f5
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
inode->i_fop = fop;
inode->i_private = data;
 
+   /* All files will have the same inode number */
+   inode->i_ino = EVENTFS_FILE_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
d_instantiate(dentry, inode);
@@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_op = _root_dir_inode_operations;
inode->i_fop = _file_operations;
 
+   /* All directories will have the same inode number */
+   inode->i_ino = EVENTFS_DIR_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
 
-- 
2.43.0





Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt
On Tue, 16 Jan 2024 10:53:36 -0800
Linus Torvalds  wrote:

> Let's at least *try* to move towards a better and simpler world, in other
> words.

OK, but I did just finish the below. I'll save it for another time if the
single inode becomes an issue.

I cropped it to just 31 bits, so I'm not sure how much that would still
expose kernel addresses.

But as you prefer a single inode number, I'll submit that instead, and
store this as one of my many git branches that I hoard.

This patch would give me:

~# ls -i /sys/kernel/tracing/events
2024739183 9p   801480813 iocost281269778 qdisc
 401505785 alarmtimer   641794539 iomap1206040657 ras
1930290274 avc 1239754069 iommu 275816503 raw_syscalls
1108006611 block138722947 io_uring 1758073266 rcu
1528091656 bpf_test_run 694421747 ipi  1384703124 regmap
 557364529 bpf_trace  2765888 irq30507018 regulator
1351362737 cfg80211 369905250 irq_matrix   2078862821 resctrl
 886445085 cgroup  1115008967 irq_vectors   324090967 rpm
 796209754 clk 1448778784 jbd2 1141318882 rseq
 478739129 compaction99410253 kmem 1274783780 rtc
1714397712 cpuhp   2001779594 ksm  1409072807 sched
 720701943 csd   51728677 kyber 347239934 scsi
1824588347 dev 1507974437 libata   1768671172 sd
1640041299 devfreq 1421146927 lock 1167562598 signal
 121399632 dma_fence956825721 mac802114116590 skb
 975908383 drm  738868061 maple_tree   1435501164 smbus
1227060804 e1000e_trace 969175760 mce  1664441095 sock
1770307058 enable  1225375766 mdio 1634697993 spi
1372107864 error_report 744198394 migrate   556841757 swiotlb
1356665351 exceptions   602669807 mmap  400337480 syscalls
[..]

~# ls -i /sys/kernel/tracing/events/sched
1906992193 enable1210369853 sched_process_wait
 592505447 filter1443020725 sched_stat_blocked
1742488280 sched_kthread_stop1213185672 sched_stat_iowait
1674576234 sched_kthread_stop_ret1908510325 sched_stat_runtime
1999376743 sched_kthread_work_execute_end1570203600 sched_stat_sleep
1041533096 sched_kthread_work_execute_start   608113040 sched_stat_wait
 166824308 sched_kthread_work_queue_work 2033295833 sched_stick_numa
 492462616 sched_migrate_task1617500395 sched_swap_numa
1786868196 sched_move_numa   1865216679 sched_switch
 691687388 sched_pi_setprio  1465729401 sched_wait_task
1610977146 sched_process_exec 969941227 
sched_wake_idle_without_ipi
 610128037 sched_process_exit  84398030 sched_wakeup
2139284699 sched_process_fork 750220489 sched_wakeup_new
 169172804 sched_process_free1024064201 sched_waking

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a0598eb3e26e..57448bf4acb4 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +33,9 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Used for making inode numbers */
+static siphash_key_t inode_key;
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -57,6 +61,28 @@ static int eventfs_dir_open(struct inode *inode, struct file 
*file);
 static int eventfs_iterate(struct file *file, struct dir_context *ctx);
 static int eventfs_dir_release(struct inode *inode, struct file *file);
 
+/* Copied from scripts/kconfig/symbol.c */
+static unsigned strhash(const char *s)
+{
+   /* fnv32 hash */
+   unsigned hash = 2166136261U;
+   for (; *s; s++)
+   hash = (hash ^ *s) * 0x01000193;
+   return hash;
+}
+
+/* Just try to make something consistent and unique */
+static int eventfs_inode_ino(struct eventfs_inode *ei, const char *name)
+{
+   unsigned long sip = (unsigned long)ei;
+
+   sip += strhash(name);
+   sip = siphash_1u32((int)sip, _key);
+
+   /* keep it positive */
+   return sip & ((1U << 31) - 1);
+}
+
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
unsigned int ia_valid = iattr->ia_valid;
@@ -491,6 +517,8 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
mutex_unlock(_mutex);
 
dentry = create_file(name, mode, attr, parent, data, fops);
+   if (!IS_ERR_OR_NULL(dentry))
+   dentry->d_inode->i_ino = eventfs_inode_ino(ei, name);
 
mutex_lock(_mutex);
 
@@ -585,6 +613,8 @@ create_dir_dentry(struct eventfs_inode *pei, struct 
eventfs_inode *ei,
mutex_unlock(_mutex);
 
dentry = create_dir(ei, parent);
+   if (!IS_ERR_OR_NULL(dentry))
+ 

Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt
On Tue, 16 Jan 2024 10:21:49 -0800
Linus Torvalds  wrote:

> Here's a clue: just fix your inode numbers.
> 
> I can think of many ways to do it. Here's a couple:
> 
>  - use a fixed inode number for all inodes. It's fine. Really. You might
> confuse some programs that still do getpwd() the legacy way, but hey,
> nobody cares
> 
>  - just put the inode number in the same data structure everything else is
>
> 
>  - make the inode number be a hash of the address of your data structure.
> That's actually the closest to a traditional "real" inode number, which is
> just an index to some on-disk thing
> 
> I'm sure there are other *trivial* solutions.
> 
> None of this is an excuse to misuse sentries.
> 
> Try the first one - one single inode number - first. You shouldn't be doing
> iget() anyway, so why do you care so deeply about a number that makes no
> sense and nobody should care about?

It was me being paranoid that using the same inode number would break user
space. If that is not a concern, then I'm happy to just make it either the
same, or maybe just hash the ei and name that it is associated with.

If I do not fully understand how something is used, I try hard to make it
act the same as it does for other use cases. That is, I did all this to
keep inodes unique and consistent because I did not know if it would break
something if I didn't.

Removing that requirement does make it much easier to implement readdir.

I think I'll do the hashing, just because I'm still paranoid that something
might still break if they are all the same.

-- Steve



Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt
On Tue, 16 Jan 2024 13:12:28 -0500
Steven Rostedt  wrote:

> Maybe I can just use a hash to generate he inode numbers from the name?
> Hopefully there will be no collisions. Then I don't need the dentry
> creation at all.

Maybe I could use a hash of the address of the meta data to create the
inode number? That may be unique enough.

-- Steve



Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt
On Tue, 16 Jan 2024 09:55:15 -0800
Linus Torvalds  wrote:

> [ html crud because I still don't have power or real Internet, just trying
> to keep an eye on things on my phone. Mailing lists removed to avoid
> bounces, please put them back in replies that don't have my horrible
> formatting ]
> 
> No.
> 
> Christ, you're making things worse again
> 
> The reason for the bug is that you're still messing with the dentries at
> readdir() time.

I may have deleted the comment, but the only reason I created the
inodes/destries is to keep the consistent inode number as it's created at
the time the inodes and dentries are.

Ah I did delete the comment, but it is still applicable :-/

   /*
-* Need to create the dentries and inodes to have a consistent
-* inode number.
+* Need to make a struct eventfs_dent array, start by
+* allocating enough for just the files, which is a fixed
+* array. Then use realloc via add_entry() for the directories
+* which is stored in a linked list.
 */

So if for some reason, user space did a getdents() and used the inode
numbers to match what it found, they would likely be the same.

> 
> Just stop it. Readdir should not be creating dentries. Readdir should not
> be even *looking* at dentries. You're not a virtual filesystem, the
> dentries are just caches for filename lookup, and have *nothing* to do with
> readdir.

Actually, it's not looking at them. I did it as a way to just have the
inode numbers be consistent.

dents[cnt].ino = d->d_inode->i_ino;

Yes, that's the only reason I create them. The dentry/inode is not used for
anything outside of that.

> 
> So just iterate over your own internal data structures in readdir. DO NOT
> CREATE DENTRIES OR INODES FOR READDIR.
> 
> I've told you before, and I'll tell you again: either you are a real and
> proper virtual filesystem and you let the vfs layer manage *everything*,
> and the dentries and inodes are all you have. Or you are a *real*
> filesystem and you maintain your own data structures and the dentries and
> inodes are just the in-memory caches.
> 
> This "do both" is UNACCEPTABLE.
> 

The dentries were created for the inode numbers so that I did not need to
add them to meta data. They are generated at creation time.

I don't know how important inode numbers are. If they can all just have
random inode numbers and it doesn't break user space, where an inode number
will be one value at one read and another shortly after, is that going to
cause a problem?

Maybe I can just use a hash to generate he inode numbers from the name?
Hopefully there will be no collisions. Then I don't need the dentry
creation at all.

I do realize that if the dentries get freed due to reclaim and recreated,
their inode numbers will be different. But for a virtual file system, I
don't know how important having consistent inode numbers is.

-- Steve




[PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The original eventfs code added a wrapper around the dcache_readdir open
callback and created all the dentries and inodes at open, and increment
their ref count. A wrapper was added around the dcache_readdir release
function to decrement all the ref counts of those created inodes and
dentries. But this proved to be buggy[1] for when a kprobe was created
during a dir read, it would create a dentry between the open and the
release, and because the release would decrement all ref counts of all
files and directories, that would include the kprobe directory that was
not there to have its ref count incremented in open. This would cause the
ref count to go to negative and later crash the kernel.

To solve this, the dentries and inodes that were created and had their ref
count upped in open needed to be saved. That list needed to be passed from
the open to the release, so that the release would only decrement the ref
counts of the entries that were incremented in the open.

Unfortunately, the dcache_readdir logic was already using the
file->private_data, which is the only field that can be used to pass
information from the open to the release. What was done was the eventfs
created another descriptor that had a void pointer to save the
dcache_readdir pointer, and it wrapped all the callbacks, so that it could
save the list of entries that had their ref counts incremented in the
open, and pass it to the release. The wrapped callbacks would just put
back the dcache_readdir pointer and call the functions it used so it could
still use its data[2].

But Linus had an issue with the "hijacking" of the file->private_data
(unfortunately this discussion was on a security list, so no public link).
Which we finally agreed on doing everything within the iterate_shared
callback and leave the dcache_readdir out of it[3]. All the information
needed for the getents() could be created then.

But this ended up being buggy too[4]. The iterate_shared callback was not
the right place to create the dentries and inodes. Even Christian Brauner
had issues with that[5].

The real fix should be to go back to creating the inodes and dentries at
the open, create an array to store the information in the
file->private_data, and pass that information to the other callbacks.

The difference between this and the original method, is that it does not
use dcache_readdir. It also does not up the ref counts of the dentries and
pass them. Instead, it creates an array of a structure that saves the
dentry's name and inode number. That information is used in the
iterate_shared callback, and the array is freed in the dir release. The
dentries and inodes created in the open are not used for the iterate_share
or release callbacks. Just their names and inode numbers.

This means that the state of the eventfs at the dir open remains the same
from the point of view of the getdents() function, until the dir is closed.
This also means that the getdents() will not fail. If there's an issue, it
fails at the dir open.

[1] 
https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/
[2] 
https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/
[3] 
https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/
[4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/
[5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/

Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 209 ---
 1 file changed, 153 insertions(+), 56 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..50b37f31d835 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -53,7 +53,9 @@ enum {
 static struct dentry *eventfs_root_lookup(struct inode *dir,
  struct dentry *dentry,
  unsigned int flags);
+static int eventfs_dir_open(struct inode *inode, struct file *file);
 static int eventfs_iterate(struct file *file, struct dir_context *ctx);
+static int eventfs_dir_release(struct inode *inode, struct file *file);
 
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
@@ -212,7 +214,9 @@ static const struct inode_operations 
eventfs_file_inode_operations = {
 
 static const struct file_operations eventfs_file_operations = {
.read   = generic_read_dir,
+   .open   = eventfs_dir_open,
.iterate_shared = eventfs_iterate,
+   .release= eventfs_dir_release,
.llseek = generic_file_llseek,
 };
 
@@ -706,34 +710,71 @@ static struct dentry *eventfs_root_lo

Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-15 Thread Steven Rostedt
On Mon, 15 Jan 2024 17:29:09 +
Vincent Donnefort  wrote:

> > 
> > What needs to be done, and feel free to add this as a separate patch,
> > is to have checks where snapshot is used.
> > 
> >   (All errors return -EBUSY)
> > 
> > Before allowing mapping, check to see if:
> > 
> >  1) the current tracer has "use_max_tr" set.
> >  2) any event has a "snapshot" trigger set
> >  3) Any function has a "snapshot" command set  
> 
> Could we sum-up this with a single check to allocate_snapshot? If that is
> allocated it's probably because we'll be using it?

Not always. It can be allocated at any time and never used.

I'd like to keep the allocation of the snapshot buffer agnostic to if
the buffer is mapped or not, especially since it can be allocated at
boot up and never used. Several of my boxes have "alloc_snapshot" on
the command line even though I don't always use it. But we could update
the output of reading the "snapshot" file to:

 ~# cat /sys/kernel/tracing/snapshot 
# tracer: nop
#
#
# ** Snapshot disabled due to the buffer being memory mapped **
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
#  Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
#  (Doesn't have to be '2' works with any number that
#   is not a '0' or '1')

If it is allocated:

 ~# cat /sys/kernel/tracing/snapshot
# tracer: nop
#
# ** Snapshot disabled due to the buffer being memory mapped **
#
# * Snapshot is allocated *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
#  Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
#  (Doesn't have to be '2' works with any number that
#   is not a '0' or '1')


> 
> That would simply add the requirement to echo 0 > snapshot before starting the
> memory map?

I rather not depend on that. It just makes it more complex for why
mapping failed. If you get -EBUSY, it will be hard to know why.

> 
> The opposite could be to let tracing_alloc_snapshot_instance() fail whenever a
> mapping is in place?

Yes, that would fail if mapping is in place.

Because the snapshot being allocated can be something people want, as
it allows the snapshot to happen immediately when needed, I don't want
the fact that it is allocated to prevent mapping. As mapping may just
happen for a small period of time while an application is running.

-- Steve



Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-15 Thread Steven Rostedt
On Mon, 15 Jan 2024 11:09:38 -0500
Steven Rostedt  wrote:

> No. The ring buffer logic should not care if the user of it is swapping
> the entire ring buffer or not. It only cares if parts of the ring
> buffer is being swapped or not. That's not the level of scope it should
> care about. If we do not want a swap to happen in update_max_tr()
> that's not ring_buffer.c's problem. The code to prevent that from
> happening should be 100% in trace.c.

What needs to be done, and feel free to add this as a separate patch,
is to have checks where snapshot is used.

  (All errors return -EBUSY)

Before allowing mapping, check to see if:

 1) the current tracer has "use_max_tr" set.
 2) any event has a "snapshot" trigger set
 3) Any function has a "snapshot" command set

Fail if any of the above is true.

Also in reverse, if the buffer is mapped, then fail:

 1) a tracer being set that has "use_max_tr" set.
 2) a "snapshot" command being set on a function
 3) a "snapshot" trigger being set on an event.

For the last two, we may be able to get away with just a below as well.
Adding the tr->flags bit. We could also add a tr->snapshot count to
keep track of everything that is using a snapshot, and if that count is
non-zero, mapping fails.

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..f534f74ae80f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->flags & TRACE_ARRAY_FL_MAPPED) {
+   trace_array_puts(tr, "*** BUFFER IS MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);


-- Steve



Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-15 Thread Steven Rostedt
On Mon, 15 Jan 2024 15:37:31 +
Vincent Donnefort  wrote:

> > > @@ -5418,6 +5446,11 @@ int ring_buffer_swap_cpu(struct trace_buffer 
> > > *buffer_a,
> > >   cpu_buffer_a = buffer_a->buffers[cpu];
> > >   cpu_buffer_b = buffer_b->buffers[cpu];
> > >  
> > > + if (READ_ONCE(cpu_buffer_a->mapped) || READ_ONCE(cpu_buffer_b->mapped)) 
> > > {
> > > + ret = -EBUSY;
> > > + goto out;
> > > + }  
> > 
> > Ah, this is not enough to stop snapshot. 
> > update_max_tr()@kernel/trace/trace.c
> > is used for swapping all CPU buffer and it does not use this function.
> > (but that should use this instead of accessing buffer directly...)
> > 
> > Maybe we need ring_buffer_swap() and it checks all cpu_buffer does not set
> > mapping bits.
> > 
> > Thank you,  
> 
> How about instead of having ring_buffer_swap_cpu() returning -EBUSY when 
> mapped
> we have two functions to block any mapping that trace.c could call?
> 

No. The ring buffer logic should not care if the user of it is swapping
the entire ring buffer or not. It only cares if parts of the ring
buffer is being swapped or not. That's not the level of scope it should
care about. If we do not want a swap to happen in update_max_tr()
that's not ring_buffer.c's problem. The code to prevent that from
happening should be 100% in trace.c.

The trace.c code knows what's being mapped or not. The accounting to
keep a mapped buffer from being swapped should stay in trace.c, and not
add unnecessary implementation coupling between that and ring_buffer.c.

-- Steve



Re: [PATCH v11 4/5] Documentation: tracing: Add ring-buffer mapping

2024-01-14 Thread Steven Rostedt
On Sun, 14 Jan 2024 23:26:43 +0900
Masami Hiramatsu (Google)  wrote:

> Hi Vincent,
> 
> On Thu, 11 Jan 2024 16:17:11 +
> Vincent Donnefort  wrote:
> 
> > It is now possible to mmap() a ring-buffer to stream its content. Add
> > some documentation and a code example.
> > 
> > Signed-off-by: Vincent Donnefort 
> > 
> > diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
> > index 5092d6c13af5..0b300901fd75 100644
> > --- a/Documentation/trace/index.rst
> > +++ b/Documentation/trace/index.rst
> > @@ -29,6 +29,7 @@ Linux Tracing Technologies
> > timerlat-tracer
> > intel_th
> > ring-buffer-design
> > +   ring-buffer-map
> > stm
> > sys-t
> > coresight/index
> > diff --git a/Documentation/trace/ring-buffer-map.rst 
> > b/Documentation/trace/ring-buffer-map.rst
> > new file mode 100644
> > index ..2ba7b5339178
> > --- /dev/null
> > +++ b/Documentation/trace/ring-buffer-map.rst
> > @@ -0,0 +1,105 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +==
> > +Tracefs ring-buffer memory mapping
> > +==
> > +
> > +:Author: Vincent Donnefort 
> > +
> > +Overview
> > +
> > +Tracefs ring-buffer memory map provides an efficient method to stream data
> > +as no memory copy is necessary. The application mapping the ring-buffer 
> > becomes
> > +then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
> > +
> > +Memory mapping setup
> > +
> > +The mapping works with a mmap() of the trace_pipe_raw interface.
> > +
> > +The first system page of the mapping contains ring-buffer statistics and
> > +description. It is referred as the meta-page. One of the most important 
> > field of
> > +the meta-page is the reader. It contains the subbuf ID which can be safely 
> > read
> > +by the mapper (see ring-buffer-design.rst).
> > +
> > +The meta-page is followed by all the subbuf, ordered by ascendant ID. It is
> > +therefore effortless to know where the reader starts in the mapping:
> > +
> > +.. code-block:: c
> > +
> > +reader_id = meta->reader->id;
> > +reader_offset = meta->meta_page_size + reader_id * 
> > meta->subbuf_size;
> > +
> > +When the application is done with the current reader, it can get a new one 
> > using
> > +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also 
> > updates
> > +the meta-page fields.
> > +
> > +Limitations
> > +===
> > +When a mapping is in place on a Tracefs ring-buffer, it is not possible to
> > +either resize it (either by increasing the entire size of the ring-buffer 
> > or
> > +each subbuf). It is also not possible to use snapshot or splice.  
> 
> I've played with the sample code.
> 
> - "free_buffer" just doesn't work when the process is mmap the ring buffer.
> - After mmap the buffers, when the snapshot took, the IOCTL returns an error.
> 
> OK, but I rather like to fail snapshot with -EBUSY when the buffer is mmaped.
> 
> > +
> > +Concurrent readers (either another application mapping that ring-buffer or 
> > the
> > +kernel with trace_pipe) are allowed but not recommended. They will compete 
> > for
> > +the ring-buffer and the output is unpredictable.
> > +
> > +Example
> > +===
> > +
> > +.. code-block:: c
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +#include 
> > +#include 
> > +
> > +#define TRACE_PIPE_RAW 
> > "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
> > +
> > +int main(void)
> > +{
> > +int page_size = getpagesize(), fd, reader_id;
> > +unsigned long meta_len, data_len;
> > +struct trace_buffer_meta *meta;
> > +void *map, *reader, *data;
> > +
> > +fd = open(TRACE_PIPE_RAW, O_RDONLY);
> > +if (fd < 0)
> > +exit(EXIT_FAILURE);
> > +
> > +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
> > +if (map == MAP_FAILED)
> > +exit(EXIT_FAILURE);
> > +
> > +meta = (struct trace_buffer_meta *)map;
> > +meta_len = meta->meta_page_size;
> > +
> > +printf("entries:%lu\n", meta->entries);
> > +printf("overrun:%lu\n", meta->overrun);
> > +printf("read:   %lu\n", meta->read);
> > +printf("subbufs_touched:%lu\n", meta->subbufs_touched);
> > +printf("subbufs_lost:   %lu\n", meta->subbufs_lost);
> > +printf("subbufs_read:   %lu\n", meta->subbufs_read);
> > +printf("nr_subbufs: %u\n", meta->nr_subbufs);
> > +
> > +data_len = meta->subbuf_size * meta->nr_subbufs;
> > +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
> > data_len);

The above is buggy. It 

Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-12 Thread Steven Rostedt
On Fri, 12 Jan 2024 10:06:41 -0500
Steven Rostedt  wrote:

> I'm thinking both may be good, as the number of dropped events are not
> added if there's no room to put it at the end of the ring buffer. When
> there's no room, it just sets a flag that there was missed events but
> doesn't give how many events were missed.
> 
> If anything, it should be in both locations. In the sub-buffer header, to
> be consistent with the read/splice version, and in the meta page were if
> there's no room to add the count, it can be accessed in the meta page.

I think  the meta data page should look like this:

struct trace_buffer_meta {
__u32   meta_page_size;
__u32   meta_struct_len;
 
__u32   subbuf_size;
__u32   nr_subbufs;
 
struct {
__u64   lost_events;
__u32   id;
__u32   read;
} reader;
 
__u64   entries;
__u64   overrun;
__u64   read;

};

1) meta_page_size

  The size of this meta page. It's the first thing the application
  needs to know how much to mmap.

2) meta_struct_len

  The actual length of the structure. It's the second thing the
  application will look at to know what version it is dealing with.

3) subbuf_size
4) nr_subbufs

  The next two is the next thing the application should do. That is to
  memory map in all the sub-buffers. With 1) and this, it knows how to
  do that.

The first four are needed for setup, and never changes once mapped. The
rest gets updated during the trace.

5) reader
  5a) lost_events
  5b) id
  5c) read

  This is actively needed during tracing. It is what is used to know
  where and how to read the reader sub-buffer.

6) entries
7) overrun
8) read

  These are useful statistics, but probably seldom really looked at.
  They should be at the end.

Also notice that I converted all "unsigned long" to __u64. This is because
it is possible to have a 32 bit userspace running this on top of a 64 bit
kernel. If we were to use "long" it would be inconsistent and buggy.

Now if you need subbufs_touched and subbufs_lost. When that gets opened, it
would update the  meta_struct_len accordingly, and the structure would look
like:

struct trace_buffer_meta {
__u32   meta_page_size;
__u32   meta_struct_len;
 
__u32   subbuf_size;
__u32   nr_subbufs;
 
struct {
__u64   lost_events;
__u32   id;
__u32   read;
} reader;
 
__u64   entries;
__u64   overrun;
__u64   read;

__u64   subbufs_touched;
__u64   subbufs_lost;
};


Make sense?

-- Steve



Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-12 Thread Steven Rostedt
On Fri, 12 Jan 2024 09:13:02 +
Vincent Donnefort  wrote:

> > > +
> > > + unsigned long   subbufs_touched;
> > > + unsigned long   subbufs_lost;
> > > + unsigned long   subbufs_read;  
> > 
> > Now I'm thinking we may not want this exported, as I'm not sure it's 
> > useful.  
> 
> touched and lost are not useful now, but it'll be for my support of the
> hypervisor tracing, that's why I added them already.
> 
> subbufs_read could probably go away though as even in that case I can track 
> that
> in the reader.

Yeah, but I think we can leave it off for now.

This is something we could extend the structure with. In fact, it can be
different for different buffers!

That is, since they are pretty meaningless for the normal ring buffer, I
don't think we need to export it. But if it's useful for your hypervisor
buffer, it can export it. It would be a good test to know if the extension
works. Of course, that means if the normal ring buffer needs more info, it
must also include this as well, as it will already be part of the extended
structure.


> 
> > 
> > Vincent, talking with Mathieu, he was suggesting that we only export what
> > we really need, and I don't think we need the above. Especially since they
> > are not exposed in the stats file.
> > 
> >   
> > > +
> > > + struct {
> > > + unsigned long   lost_events;
> > > + __u32   id;
> > > + __u32   read;
> > > + } reader;  
> > 
> > The above is definitely needed, as all of it is used to read the
> > reader-page of the sub-buffer.
> > 
> > lost_events is the number of lost events that happened before this
> > sub-buffer was swapped out.
> > 
> > Hmm, Vincent, I just notice that you update the lost_events as:
> >   
> > > +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> > > +{
> > > + struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > > +
> > > + WRITE_ONCE(meta->entries, local_read(_buffer->entries));
> > > + WRITE_ONCE(meta->overrun, local_read(_buffer->overrun));
> > > + WRITE_ONCE(meta->read, cpu_buffer->read);
> > > +
> > > + WRITE_ONCE(meta->subbufs_touched, 
> > > local_read(_buffer->pages_touched));
> > > + WRITE_ONCE(meta->subbufs_lost, local_read(_buffer->pages_lost));
> > > + WRITE_ONCE(meta->subbufs_read, local_read(_buffer->pages_read));
> > > +
> > > + WRITE_ONCE(meta->reader.read, cpu_buffer->reader_page->read);
> > > + WRITE_ONCE(meta->reader.id, cpu_buffer->reader_page->id);
> > > + WRITE_ONCE(meta->reader.lost_events, cpu_buffer->lost_events);
> > > +}  
> > 
> > The lost_events may need to be handled differently, as it doesn't always
> > get cleared. So it may be stale data.  
> 
> My idea was to check this value after the ioctl(). If > 0 then events were 
> lost
> between the that ioctl() and the previous swap.
> 
> But now if with "[PATCH] ring-buffer: Have mmapped ring buffer keep track of
> missed events" we put this information in the page itself, we can get rid of
> this field.
> 

I'm thinking both may be good, as the number of dropped events are not
added if there's no room to put it at the end of the ring buffer. When
there's no room, it just sets a flag that there was missed events but
doesn't give how many events were missed.

If anything, it should be in both locations. In the sub-buffer header, to
be consistent with the read/splice version, and in the meta page were if
there's no room to add the count, it can be accessed in the meta page.

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Steven Rostedt
On Fri, 12 Jan 2024 08:53:44 -0500
Steven Rostedt  wrote:

> > // We managed to open the directory so we have permission to list
> > // directory entries in "xfs".
> > fd = open("/sys/kernel/tracing/events/xfs");
> > 
> > // Remove ownership so we can't open the directory anymore
> > chown("/sys/kernel/tracing/events/xfs", 0, 0);
> > 
> > // Or just remove exec bit for the group and restrict to owner
> > chmod("/sys/kernel/tracing/events/xfs", 700);
> > 
> > // Drop caches to force an eventfs_root_lookup() on everything
> > write("/proc/sys/vm/drop_caches", "3", 1);  
> 
> This requires opening the directory, then having it's permissions
> change, and then immediately dropping the caches.
> 
> > 
> > // Returns 0 even though directory has a lot of entries and we should be
> > // able to list them
> > getdents64(fd, ...);  
> 
> And do we care?

Hmm, maybe the issue you have is with the inconsistency of the action?

For this to fail, it would require the admin to do both change the
permission and to flush caches. If you don't flush the caches then the
task with the dir open can still read it regardless. If the dentries
were already created.

In that case I'm fine if we change the creation of the dentries to not
check the permission.

But for now, it's just a weird side effect that I don't really see how
it would affect any user's workflow.

-- Steve




Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Steven Rostedt
On Fri, 12 Jan 2024 09:27:24 +0100
Christian Brauner  wrote:

> On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> > On Thu, 11 Jan 2024 22:01:32 +0100
> > Christian Brauner  wrote:
> >   
> > > What I'm pointing out in the current logic is that the caller is
> > > taxed twice:
> > > 
> > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> > > _because_ the permission check in lookup_one_len() is the exact
> > > same permission check again that the vfs has done
> > > inode_permission(MAY_EXEC, "xfs").  
> > 
> > As I described in: 
> > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/
> > 
> > The eventfs files below "events" doesn't need the .permissions callback at
> > all. It's only there because the "events" inode uses it.
> > 
> > The .permissions call for eventfs has:  
> 
> It doesn't matter whether there's a ->permission handler. If you don't
> add one explicitly the VFS will simply call generic_permission():
> 
> inode_permission()
> -> do_inode_permission()  
>{
> if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
> if (likely(inode->i_op->permission))
> return inode->i_op->permission(idmap, inode, mask);
>
> /* This gets set once for the inode lifetime */
> spin_lock(>i_lock);
> inode->i_opflags |= IOP_FASTPERM;
> spin_unlock(>i_lock);
> }
> return generic_permission(idmap, inode, mask);
>}

Yes I know that, because that's where I knew what to call in the non
"events" dir case.

> 
> > Anyway, the issue is with "events" directory and remounting, because like
> > the tracefs system, the inode and dentry for "evnets" is created at boot
> > up, before the mount happens. The VFS layer is going to check the
> > permissions of its inode and dentry, which will be incorrect if the mount
> > was mounted with a "gid" option.  
> 
> The gid option has nothing to do with this and it is just handled fine
> if you remove the second permission checking in (2).

I guess I'm confused to what you are having an issue with. Is it just
that the permission check gets called twice?

> 
> You need to remove the inode_permission() code from
> eventfs_start_creating(). It is just an internal lookup and the fact
> that you have it in there allows userspace to break readdir on the
> eventfs portions of tracefs as I've shown in the parts of the mail that
> you cut off.

That's because I didn't see how it was related to the way I fixed the
mount=gid issue. Are you only concerned because of the check in
eventfs_start_creating()?

Yes, you posted code that would make things act funny for some code
that I see no real world use case for. Yeah, it may not act "properly"
but I'm not sure that's bad.

Here, I'll paste it back:

> // We managed to open the directory so we have permission to list
> // directory entries in "xfs".
> fd = open("/sys/kernel/tracing/events/xfs");
> 
> // Remove ownership so we can't open the directory anymore
> chown("/sys/kernel/tracing/events/xfs", 0, 0);
> 
> // Or just remove exec bit for the group and restrict to owner
> chmod("/sys/kernel/tracing/events/xfs", 700);
> 
> // Drop caches to force an eventfs_root_lookup() on everything
> write("/proc/sys/vm/drop_caches", "3", 1);

This requires opening the directory, then having it's permissions
change, and then immediately dropping the caches.

> 
> // Returns 0 even though directory has a lot of entries and we should be
> // able to list them
> getdents64(fd, ...);

And do we care?

Since tracing exposes internal kernel information, perhaps this is a
feature and not a bug. If someone who had access to the tracing system
and you wanted to stop them, if they had a directory open that they no
longer have access to, you don't want them to see what's left in the
directory.

In other words, I like the idea that the getends64(fd, ...) will fail!

If there's a file underneath that wasn't change, and the admin thought
that just keeping the top directory permissions off is good enough,
then that attacker having that directory open before the directory had
it's file permissions change is a way to still have access to the files
below it.

> 
> And the failure is in the inode_permission(MAY_EXEC, "xfs") call in
> lookup_one_len() in eventfs_start_creating() which now fails.

And I think is a good thing!

Again, tracefs is special. It gives you access and possibly control to
the kernel behavior. I like the fact that as soon as someone loses
permission to a directory, they immediately lose it.

-- Steve



Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Steven Rostedt
On Thu, 11 Jan 2024 11:34:58 -0500
Mathieu Desnoyers  wrote:


> The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1],
> and has a lot of similarities with this patch series.
> 
> LTTng has the notion of "subbuffer id" to allow atomically exchanging a
> "reader" extra subbuffer with the subbuffer to be read. It implements
> "get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader)
> to move the currently read subbuffer position. [2]
> 
> It would not hurt to compare your approach to LTTng and highlight
> similarities/differences, and the rationale for the differences.
> 
> Especially when it comes to designing kernel ABIs, it's good to make sure
> that all bases are covered, because those choices will have lasting impacts.
> 
> Thanks,
> 
> Mathieu
> 
> [1] 
> https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c
> [2] 
> https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c
> 

Hi Mathieu,

Thanks for sharing!

As we discussed a little bit in the tracing meeting we do somethings
differently but very similar too ;-)

The similarities as that all the sub-buffers are mapped. You have a
reader-sub-buffer as well.

The main difference is that you use an ioctl() that returns where to find
the reader-sub-buffer, where our ioctl() is just "I'm done, get me a new
reader-sub-buffer". Ours will update the meta page.

Our meta page looks like this:

> +struct trace_buffer_meta {
> + unsigned long   entries;
> + unsigned long   overrun;
> + unsigned long   read;

If start tracing: trace-cmd start -e sched_switch and do:

 ~ cat /sys/kernel/tracing/per_cpu/cpu0/stats 
entries: 14
overrun: 0
commit overrun: 0
bytes: 992
oldest event ts: 84844.825372
now ts: 84847.102075
dropped events: 0
read events: 0

You'll see similar to the above.

 entries = entries
 overrun = overrun
 read = read

The above "read" is total number of events read.

Pretty staight forward ;-)


> +
> + unsigned long   subbufs_touched;
> + unsigned long   subbufs_lost;
> + unsigned long   subbufs_read;

Now I'm thinking we may not want this exported, as I'm not sure it's useful.

Vincent, talking with Mathieu, he was suggesting that we only export what
we really need, and I don't think we need the above. Especially since they
are not exposed in the stats file.


> +
> + struct {
> + unsigned long   lost_events;
> + __u32   id;
> + __u32   read;
> + } reader;

The above is definitely needed, as all of it is used to read the
reader-page of the sub-buffer.

lost_events is the number of lost events that happened before this
sub-buffer was swapped out.

Hmm, Vincent, I just notice that you update the lost_events as:

> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> +
> + WRITE_ONCE(meta->entries, local_read(_buffer->entries));
> + WRITE_ONCE(meta->overrun, local_read(_buffer->overrun));
> + WRITE_ONCE(meta->read, cpu_buffer->read);
> +
> + WRITE_ONCE(meta->subbufs_touched, 
> local_read(_buffer->pages_touched));
> + WRITE_ONCE(meta->subbufs_lost, local_read(_buffer->pages_lost));
> + WRITE_ONCE(meta->subbufs_read, local_read(_buffer->pages_read));
> +
> + WRITE_ONCE(meta->reader.read, cpu_buffer->reader_page->read);
> + WRITE_ONCE(meta->reader.id, cpu_buffer->reader_page->id);
> + WRITE_ONCE(meta->reader.lost_events, cpu_buffer->lost_events);
> +}

The lost_events may need to be handled differently, as it doesn't always
get cleared. So it may be stale data.


> +
> + __u32   subbuf_size;
> + __u32   nr_subbufs;

This gets is the information needed to read the mapped ring buffer.

> +
> + __u32   meta_page_size;
> + __u32   meta_struct_len;

The ring buffer gets mapped after "meta_page_size" and this structure is
"meta_struct_len" which will change if we add new data to it.

> +};

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-11 Thread Steven Rostedt
On Thu, 11 Jan 2024 22:01:32 +0100
Christian Brauner  wrote:

> What I'm pointing out in the current logic is that the caller is
> taxed twice:
> 
> (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> (2) And again when you call lookup_one_len() in eventfs_start_creating()
> _because_ the permission check in lookup_one_len() is the exact
> same permission check again that the vfs has done
> inode_permission(MAY_EXEC, "xfs").

As I described in: 
https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/

The eventfs files below "events" doesn't need the .permissions callback at
all. It's only there because the "events" inode uses it.

The .permissions call for eventfs has:

static int eventfs_permission(struct mnt_idmap *idmap,
  struct inode *inode, int mask)
{
set_top_events_ownership(inode);
return generic_permission(idmap, inode, mask);
}

Where the "set_top_events_ownership() is a nop for everything but the
"events" directory.

I guess I could have two ops:

static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
.permission = eventfs_permission,
};

static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
};

And use the second one for all dentries below the root, but I figured it's
not that big of a deal if I called the permissions on all. Perhaps I should
do it with two?

Anyway, the issue is with "events" directory and remounting, because like
the tracefs system, the inode and dentry for "evnets" is created at boot
up, before the mount happens. The VFS layer is going to check the
permissions of its inode and dentry, which will be incorrect if the mount
was mounted with a "gid" option.


-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt  wrote:

> On Wed, 10 Jan 2024 08:07:46 -0500
> Steven Rostedt  wrote:
> 
> > Or are you saying that I don't need the ".permission" callback, because
> > eventfs does it when it creates the inodes? But for eventfs to know what
> > the permissions changes are, it uses .getattr and .setattr.  
> 
> OK, if your main argument is that we do not need .permission, I agree with
> you. But that's a trivial change and doesn't affect the complexity that
> eventfs is doing. In fact, removing the "permission" check is simply this
> patch:
> 
> --
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index fdff53d5a1f8..f2af07a857e2 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
>   return 0;
>  }
>  
> -static int eventfs_permission(struct mnt_idmap *idmap,
> -   struct inode *inode, int mask)
> -{
> - set_top_events_ownership(inode);
> - return generic_permission(idmap, inode, mask);
> -}
> -
>  static const struct inode_operations eventfs_root_dir_inode_operations = {
>   .lookup = eventfs_root_lookup,
>   .setattr= eventfs_set_attr,
>   .getattr= eventfs_get_attr,
> - .permission = eventfs_permission,
>  };
>  
>  static const struct inode_operations eventfs_file_inode_operations = {
> --
> 
> I only did that because Linus mentioned it, and I thought it was needed.
> I'll apply this patch too, as it appears to work with this code.

Oh, eventfs files and directories don't need the .permissions because its
inodes and dentries are not created until accessed. But the "events"
directory itself has its dentry and inode created at boot up, but still
uses the eventfs_root_dir_inode_operations. So the .permissions is still
needed!

If you look at the "set_top_events_ownership()" function, it has:

/* The top events directory doesn't get automatically updated */
if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
return;

That is, it does nothing if the entry is not the "events" directory. It
falls back to he default "->permissions()" function for everything but the
top level "events" directory.

But this and .getattr are still needed for the events directory, because it
suffers the same issue as the other tracefs entries. That is, it's inodes
and dentries are created at boot up before it is mounted. So if the mount
has gid=1000, it will be ignored.

The .getattr is called by "stat" which ls does. So after boot up if you
just do:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ls /sys/kernel/tracing/events/
9pext4iomapmodule  raw_syscalls  thermal
alarmtimerfib iommumsr rcu   thp
avc   fib6io_uring napiregmaptimer
block filelockipi  neigh   regulator tlb
bpf_test_run  filemap irq  net resctrl   udp
bpf_trace ftrace  irq_matrix   netfs   rpm   virtio_gpu[
...]

The above works because "ls" does a stat() on the directory first, which
does a .getattr() call that updates the permissions of the existing "events"
directory inode.

  BUT!

If I had used my own getents() program that has:

fd = openat(AT_FDCWD, argv[1], O_RDONLY);
if (fd < 0)
perror("openat");

n = getdents64(fd, buf, BUF_SIZE);
if (n < 0)
perror("getdents64");

Where it calls the openat() without doing a stat fist, and after boot, had done:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ./getdents /sys/kernel/tracing/events
openat: Permission denied
getdents64: Bad file descriptor

It errors because he "events" inode permission hasn't been updated yet.
Now after getting the above error, if I do the "ls" and then run it again:

 $ ls /sys/kernel/tracing/events > /dev/null
 $ ./getdents /sys/kernel/tracing/events
enable
header_page
header_event
initcall
vsyscall
syscalls

it works!

so no, I can't remove that .permissions callback from eventfs.

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt  wrote:

> I'll apply this patch too, as it appears to work with this code.

I meant "appears to work without this code".

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 08:07:46 -0500
Steven Rostedt  wrote:

> Or are you saying that I don't need the ".permission" callback, because
> eventfs does it when it creates the inodes? But for eventfs to know what
> the permissions changes are, it uses .getattr and .setattr.

OK, if your main argument is that we do not need .permission, I agree with
you. But that's a trivial change and doesn't affect the complexity that
eventfs is doing. In fact, removing the "permission" check is simply this
patch:

--
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..f2af07a857e2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
return 0;
 }
 
-static int eventfs_permission(struct mnt_idmap *idmap,
- struct inode *inode, int mask)
-{
-   set_top_events_ownership(inode);
-   return generic_permission(idmap, inode, mask);
-}
-
 static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
-   .permission = eventfs_permission,
 };
 
 static const struct inode_operations eventfs_file_inode_operations = {
--

I only did that because Linus mentioned it, and I thought it was needed.
I'll apply this patch too, as it appears to work with this code.

Thanks!

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt
On Wed, 10 Jan 2024 12:45:36 +0100
Christian Brauner  wrote:

> So say you do:
> 
> mkdir /sys/kernel/tracing/instances/foo
> 
> After this has returned we know everything we need to know about the new
> tracefs instance including the ownership and the mode of all inodes in
> /sys/kernel/tracing/instances/foo/events/* and below precisely because
> ownership is always inherited from the parent dentry and recorded in the
> metadata struct eventfs_inode.
> 
> So say someone does:
> 
> open("/sys/kernel/tracing/instances/foo/events/xfs");
> 
> and say this is the first time that someone accesses that events/
> directory.
> 
> When the open pathwalk is done, the vfs will determine via
> 
> [1] may_lookup(inode_of(events))
> 
> whether you are able to list entries such as "xfs" in that directory.
> The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> it ends up calling i_op->eventfs_root_lookup(events).
> 
> At this point tracefs/eventfs adds the inodes for all entries in that
> "events" directory including "xfs" based on the metadata it recorded
> during the mkdir. Since now someone is actually interested in them. And
> it initializes the inodes with ownership and everything and adds the
> dentries that belong into that directory.
> 
> Nothing here depends on the permissions of the caller. The only
> permission that mattered was done in the VFS in [1]. If the caller has
> permissions to enter a directory they can lookup and list its contents.
> And its contents where determined/fixed etc when mkdir was called.
> 
> So we just need to add the required objects into the caches (inode,
> dentry) whose addition we intentionally defered until someone actually
> needed them.
> 
> So, eventfs_root_lookup() now initializes the inodes with the ownership
> from the stored metadata or from the parent dentry and splices in inodes
> and dentries. No permission checking is needed for this because it is
> always a recheck of what the vfs did in [1].
> 
> We now return to the vfs and path walk continues to the final component
> that you actually want to open which is that "xfs" directory in this
> example. We check the permissions on that inode via may_open("xfs") and
> we open that directory returning an fd to userspace ultimately.
> 
> (I'm going by memory since I need to step out the door.)

So, let's say we do:

 chgrp -R rostedt /sys/kernel/tracing/

But I don't want rostedt to have access to xfs

 chgrp -R root /sys/kernel/tracing/events/xfs

Both actions will create the inodes and dentries of all files and
directories (because of "-R"). But once that is done, the ref counts go to
zero. They stay around until reclaim. But then I open Chrome ;-) and it
reclaims all the dentries and inodes, so we are back to here we were on
boot.

Now as rostedt I do:

 ls /sys/kernel/tracing/events/xfs

The VFS layer doesn't know if I have permission to that or not, because all
the inodes and dentries have been freed. It has to call back to eventfs to
find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
recreated the inodes with the proper permission.

Or are you saying that I don't need the ".permission" callback, because
eventfs does it when it creates the inodes? But for eventfs to know what
the permissions changes are, it uses .getattr and .setattr.

-- Steve



[PATCH v3] ring-buffer: Have mmapped ring buffer keep track of missed events

2024-01-09 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

While testing libtracefs on the mmapped ring buffer, the test that checks
if missed events are accounted for failed when using the mapped buffer.
This is because the mapped page does not update the missed events that
were dropped because the writer filled up the ring buffer before the
reader could catch it.

Add the missed events to the reader page/sub-buffer when the IOCTL is done
and a new reader page is acquired.

Note that all accesses to the reader_page via rb_page_commit() had to be
switched to rb_page_size(), and rb_page_size() which was just a copy of
rb_page_commit() but now it masks out the RB_MISSED bits. This is needed
as the mapped reader page is still active in the ring buffer code and
where it reads the commit field of the bpage for the size, it now must
mask it otherwise the missed bits that are now set will corrupt the size
returned.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v2: 
https://lore.kernel.org/linux-trace-kernel/20240109173552.0b40d...@gandalf.local.home/

- Remove left-over "dump_stack()" that was used for debugging

 kernel/trace/ring_buffer.c | 60 ++
 1 file changed, 54 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 07dae67424a9..603956bc1e59 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -312,6 +312,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event 
*event)
 /* Missed count stored at end */
 #define RB_MISSED_STORED   (1 << 30)
 
+#define RB_MISSED_MASK (3 << 30)
+
 struct buffer_data_page {
u64  time_stamp;/* page time stamp */
local_t  commit;/* write committed index */
@@ -2303,7 +2305,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 /* Size is determined by what has been committed */
 static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
 {
-   return rb_page_commit(bpage);
+   return rb_page_commit(bpage) & ~RB_MISSED_MASK;
 }
 
 static __always_inline unsigned
@@ -3930,7 +3932,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu 
*cpu_buffer)
return true;
 
/* Reader should exhaust content in reader page */
-   if (reader->read != rb_page_commit(reader))
+   if (reader->read != rb_page_size(reader))
return false;
 
/*
@@ -4401,7 +4403,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
return ((iter->head_page == commit_page && iter->head >= commit) ||
(iter->head_page == reader && commit_page == head_page &&
 head_page->read == commit &&
-iter->head == rb_page_commit(cpu_buffer->reader_page)));
+iter->head == rb_page_size(cpu_buffer->reader_page)));
 }
 EXPORT_SYMBOL_GPL(ring_buffer_iter_empty);
 
@@ -5745,7 +5747,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
event = rb_reader_event(cpu_buffer);
 
read = reader->read;
-   commit = rb_page_commit(reader);
+   commit = rb_page_size(reader);
 
/* Check if any events were dropped */
missed_events = cpu_buffer->lost_events;
@@ -5822,7 +5824,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
-   cpu_buffer->read_bytes += rb_page_commit(reader);
+   cpu_buffer->read_bytes += rb_page_size(reader);
 
/* swap the pages */
rb_init_page(bpage);
@@ -6349,6 +6351,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer 
*buffer, int cpu,
 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *reader;
+   unsigned long missed_events;
unsigned long reader_size;
unsigned long flags;
 
@@ -6374,9 +6378,53 @@ int ring_buffer_map_get_reader(struct trace_buffer 
*buffer, int cpu)
goto out;
}
 
-   if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
+   reader = rb_get_reader_page(cpu_buffer);
+   if (WARN_ON(!reader))
goto out;
 
+   /* Check if any events were dropped */
+   missed_events = cpu_buffer->lost_events;
+
+   if (cpu_buffer->reader_page != cpu_buffer->commit_page) {
+   if (missed_events) {
+   struct buffer_data_page *bpage = reader->page;
+   unsigned int commit;
+   /*
+* Use the real_end for the data size,
+* This gives us a chance to store the lost events
+* on the page.
+

Re: [PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions

2024-01-09 Thread Steven Rostedt
On Wed, 10 Jan 2024 08:42:05 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 9 Jan 2024 15:13:51 +
> Vincent Donnefort  wrote:
> 
> > > > @@ -388,6 +389,7 @@ struct rb_irq_work {
> > > > boolwaiters_pending;
> > > > boolfull_waiters_pending;
> > > > boolwakeup_full;
> > > > +   boolis_cpu_buffer;  
> > > 
> > > I think 'is_cpu_buffer' is a bit unclear (or generic),
> > > what about 'meta_page_update'?  
> > 
> > Hum not sure about that change. This was really to identify if parent of
> > rb_irq_work is a cpu_buffer or a trace_buffer. It can be a cpu_buffer 
> > regardless
> > of the need to update the meta-page.  
> 
> Yeah, I just meant that is "for_cpu_buffer", not "rb_irq_work is_cpu_buffer".
> So when reading the code, I just felt uncomfortable.
> 

How about "in_cpu_buffer" as that is what it is.

struct ring_buffer_per_cpu {
struct rb_irq_work {
boolin_cpu_buffer;
}
}

Would that make you feel more comfortable? ;-)

-- Steve




[PATCH v2] ring-buffer: Have mmapped ring buffer keep track of missed events

2024-01-09 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

While testing libtracefs on the mmapped ring buffer, the test that checks
if missed events are accounted for failed when using the mapped buffer.
This is because the mapped page does not update the missed events that
were dropped because the writer filled up the ring buffer before the
reader could catch it.

Add the missed events to the reader page/sub-buffer when the IOCTL is done
and a new reader page is acquired.

Note that all accesses to the reader_page via rb_page_commit() had to be
switched to rb_page_size(), and rb_page_size() which was just a copy of
rb_page_commit() but now it masks out the RB_MISSED bits. This is needed
as the mapped reader page is still active in the ring buffer code and
where it reads the commit field of the bpage for the size, it now must
mask it otherwise the missed bits that are now set will corrupt the size
returned.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/all/20240109205112.74225-5-rost...@goodmis.org/

- Forgot to update the "real end" so that it has room to add the number of
  missed events. Without that update, it would just say "missed events" but not
  how many events were missed.

 kernel/trace/ring_buffer.c | 61 ++
 1 file changed, 55 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 07dae67424a9..b111f0694805 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -312,6 +312,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event 
*event)
 /* Missed count stored at end */
 #define RB_MISSED_STORED   (1 << 30)
 
+#define RB_MISSED_MASK (3 << 30)
+
 struct buffer_data_page {
u64  time_stamp;/* page time stamp */
local_t  commit;/* write committed index */
@@ -2303,7 +2305,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 /* Size is determined by what has been committed */
 static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
 {
-   return rb_page_commit(bpage);
+   return rb_page_commit(bpage) & ~RB_MISSED_MASK;
 }
 
 static __always_inline unsigned
@@ -2769,6 +2771,7 @@ static void rb_add_timestamp(struct ring_buffer_per_cpu 
*cpu_buffer,
once++;
pr_warn("Ring buffer clock went backwards: %llu 
-> %llu\n",
info->before, info->ts);
+   dump_stack();
}
} else
rb_check_timestamp(cpu_buffer, info);
@@ -3930,7 +3933,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu 
*cpu_buffer)
return true;
 
/* Reader should exhaust content in reader page */
-   if (reader->read != rb_page_commit(reader))
+   if (reader->read != rb_page_size(reader))
return false;
 
/*
@@ -4401,7 +4404,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
return ((iter->head_page == commit_page && iter->head >= commit) ||
(iter->head_page == reader && commit_page == head_page &&
 head_page->read == commit &&
-iter->head == rb_page_commit(cpu_buffer->reader_page)));
+iter->head == rb_page_size(cpu_buffer->reader_page)));
 }
 EXPORT_SYMBOL_GPL(ring_buffer_iter_empty);
 
@@ -5745,7 +5748,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
event = rb_reader_event(cpu_buffer);
 
read = reader->read;
-   commit = rb_page_commit(reader);
+   commit = rb_page_size(reader);
 
/* Check if any events were dropped */
missed_events = cpu_buffer->lost_events;
@@ -5822,7 +5825,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
-   cpu_buffer->read_bytes += rb_page_commit(reader);
+   cpu_buffer->read_bytes += rb_page_size(reader);
 
/* swap the pages */
rb_init_page(bpage);
@@ -6349,6 +6352,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer 
*buffer, int cpu,
 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *reader;
+   unsigned long missed_events;
unsigned long reader_size;
unsigned long flags;
 
@@ -6374,9 +6379,53 @@ int ring_buffer_map_get_reader(struct trace_buffer 
*buffer, int cpu)
goto out;
}
 
-   if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
+   reader = rb_get_reader_page(cpu_buffer);
+   if (WARN_ON(!reader))
goto out;
 
+ 

[PATCH] ring-buffer: Have mmapped ring buffer keep track of missed events

2024-01-09 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

While testing libtracefs on the mmapped ring buffer, the test that checks
if missed events are accounted for failed when using the mapped buffer.
This is because the mapped page does not update the missed events that
were dropped because the writer filled up the ring buffer before the
reader could catch it.

Add the missed events to the reader page/sub-buffer when the IOCTL is done
and a new reader page is acquired.

Note that all accesses to the reader_page via rb_page_commit() had to be
switched to rb_page_size(), and rb_page_size() which was just a copy of
rb_page_commit() but now it masks out the RB_MISSED bits. This is needed
as the mapped reader page is still active in the ring buffer code and
where it reads the commit field of the bpage for the size, it now must
mask it otherwise the missed bits that are now set will corrupt the size
returned.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 54 +-
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 07dae67424a9..90af4f28671f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -312,6 +312,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event 
*event)
 /* Missed count stored at end */
 #define RB_MISSED_STORED   (1 << 30)
 
+#define RB_MISSED_MASK (3 << 30)
+
 struct buffer_data_page {
u64  time_stamp;/* page time stamp */
local_t  commit;/* write committed index */
@@ -2303,7 +2305,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 /* Size is determined by what has been committed */
 static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
 {
-   return rb_page_commit(bpage);
+   return rb_page_commit(bpage) & ~RB_MISSED_MASK;
 }
 
 static __always_inline unsigned
@@ -2769,6 +2771,7 @@ static void rb_add_timestamp(struct ring_buffer_per_cpu 
*cpu_buffer,
once++;
pr_warn("Ring buffer clock went backwards: %llu 
-> %llu\n",
info->before, info->ts);
+   dump_stack();
}
} else
rb_check_timestamp(cpu_buffer, info);
@@ -3930,7 +3933,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu 
*cpu_buffer)
return true;
 
/* Reader should exhaust content in reader page */
-   if (reader->read != rb_page_commit(reader))
+   if (reader->read != rb_page_size(reader))
return false;
 
/*
@@ -4401,7 +4404,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
return ((iter->head_page == commit_page && iter->head >= commit) ||
(iter->head_page == reader && commit_page == head_page &&
 head_page->read == commit &&
-iter->head == rb_page_commit(cpu_buffer->reader_page)));
+iter->head == rb_page_size(cpu_buffer->reader_page)));
 }
 EXPORT_SYMBOL_GPL(ring_buffer_iter_empty);
 
@@ -5745,7 +5748,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
event = rb_reader_event(cpu_buffer);
 
read = reader->read;
-   commit = rb_page_commit(reader);
+   commit = rb_page_size(reader);
 
/* Check if any events were dropped */
missed_events = cpu_buffer->lost_events;
@@ -5822,7 +5825,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
-   cpu_buffer->read_bytes += rb_page_commit(reader);
+   cpu_buffer->read_bytes += rb_page_size(reader);
 
/* swap the pages */
rb_init_page(bpage);
@@ -6349,6 +6352,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer 
*buffer, int cpu,
 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *reader;
+   unsigned long missed_events;
unsigned long reader_size;
unsigned long flags;
 
@@ -6374,9 +6379,46 @@ int ring_buffer_map_get_reader(struct trace_buffer 
*buffer, int cpu)
goto out;
}
 
-   if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
+   reader = rb_get_reader_page(cpu_buffer);
+   if (WARN_ON(!reader))
goto out;
 
+   /* Check if any events were dropped */
+   missed_events = cpu_buffer->lost_events;
+
+   if (cpu_buffer->reader_page != cpu_buffer->commit_page) {
+   if (missed_events) {
+   struct buffer_data_page *bpage = reader->page;
+

Re: [PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions

2024-01-09 Thread Steven Rostedt
On Tue, 9 Jan 2024 15:13:51 +
Vincent Donnefort  wrote:

> > > @@ -388,6 +389,7 @@ struct rb_irq_work {
> > >   boolwaiters_pending;
> > >   boolfull_waiters_pending;
> > >   boolwakeup_full;
> > > + boolis_cpu_buffer;  
> > 
> > I think 'is_cpu_buffer' is a bit unclear (or generic),
> > what about 'meta_page_update'?  
> 
> Hum not sure about that change. This was really to identify if parent of
> rb_irq_work is a cpu_buffer or a trace_buffer. It can be a cpu_buffer 
> regardless
> of the need to update the meta-page.

Yeah, this was added because the irq_work is called with the rb_work for
both the cpu_buffer and the global struct tracing_buffer object. The
meta_page is only available to the cpu_buffer and does not exist on the
struct trace_buffer, so this is checked before doing a "container_of()" on
the wrong structure. Both the cpu_buffer and the global buffer call the
same function with the rb_work structure.

So "is_cpu_buffer" is the right terminology as it's unrelated to the meta page:

struct trace_buffer {
[..]
struct rb_irq_work  irq_work;
[..]
};

struct trace_buffer *buffer;
buffer->irq_work.is_cpu_buffer = false;

struct ring_buffer_per_cpu {
[..]
struct rb_irq_work  irq_work;
[..]
}

struct ring_buffer_per_cpu *cpu_buffer;
cpu_buffer->irq_work.is_cpu_buffer = true;


[..]
init_irq_work(>irq_work.work, rb_wake_up_waiters);
[..]
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);

// both the buffer and cpu_buffer call rb_wake_up_waiters()

static void rb_wake_up_waiters(struct irq_work *work)
{
struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
work);

// This container_of() gets rbwork which is the rb_irq_work structure that
// both buffer and cpu_buffer have.

if (rbwork->is_cpu_buffer) {
struct ring_buffer_per_cpu *cpu_buffer;

cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu,
  irq_work);

// The above crashes if done by the buffer and not the cpu_buffer.
// The "is_cpu_buffer" is there to differentiate the two rb_work entities.
// It is a way to say this is safe to do the above "container_of()".

/*
 * If the waiter is a cpu_buffer, this might be due to a
 * userspace mapping. Let's update the meta-page.
 */
rb_update_meta_page(cpu_buffer);
}


-- Steve



Re: [PATCH v10 0/2] Introducing trace buffer mapping by user-space

2024-01-09 Thread Steven Rostedt


Hi Masami, thanks for looking at this.

On Tue, 9 Jan 2024 22:04:45 +0900
Masami Hiramatsu (Google)  wrote:

> > The tracing ring-buffers can be stored on disk or sent to network
> > without any copy via splice. However the later doesn't allow real time
> > processing of the traces. A solution is to give userspace direct access
> > to the ring-buffer pages via a mapping. An application can now become a
> > consumer of the ring-buffer, in a similar fashion to what trace_pipe
> > offers.  
> 
> I think this is very nice feature. But this series seems just a feature,
> no document and no example code. Can you add 2 patches to add those?
> I know libtracefs already provide a support code, but I think it is
> better to have a test code under tools/testing/selftests/ring_buffer.

Yeah, we should have sample code and a test.

> 
> I also wonder what happen if other operation (e.g. taking snapshot) happens
> while mmaping the ring buffer.

Hmm, good point. We should disable snapshots when mapped, and also prevent
mapping with latency tracer if we are not already doing that.

-- Steve



Re: possible deadlock in __perf_install_in_context

2024-01-08 Thread Steven Rostedt


This is related to perf not tracing.

-- Steve


On Tue, 09 Jan 2024 08:34:15 +0800
"Ubisectech Sirius"  wrote:

> Dear concerned.
> Greetings!
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 6.7.0-g0dd3ee311255.
> technical details:
> 1. Issue Description: possible deadlock in __perf_install_in_context
> 2. Stack Dump: 
> [ 158.488994][ T8029] Call Trace:
> [ 158.489411][ T8029] 
> arch/x86/events/intel/../perf_event.h:1166 arch/x86/events/intel/core.c:2799)
> [ 158.498427][ T8029] x86_pmu_start (arch/x86/events/core.c:1516)
> [ 158.499034][ T8029] x86_pmu_enable (arch/x86/events/core.c:1331 
> (discriminator 2))
> [ 158.499601][ T8029] perf_ctx_enable (kernel/events/core.c:703 
> (discriminator 2))
> [ 158.500171][ T8029] ctx_resched (kernel/events/core.c:2741)
> [ 158.500733][ T8029] __perf_install_in_context (kernel/events/core.c:2807)
> [ 158.502106][ T8029] remote_function (kernel/events/core.c:92 
> kernel/events/core.c:72)
> [ 158.503364][ T8029] generic_exec_single (kernel/smp.c:134 (discriminator 3) 
> kernel/smp.c:404 (discriminator 3))
> [ 158.503995][ T8029] smp_call_function_single (kernel/smp.c:647)
> (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:77 
> ./arch/x86/include/asm/irqflags.h:135 lib/percpu_counter.c:102)
> [ 158.512958][ T8029] perf_install_in_context (kernel/events/core.c:2909 
> (discriminator 1))
> [ 158.515717][ T8029] __do_sys_perf_event_open (kernel/events/core.c:1443 
> kernel/events/core.c:12747)
> [ 158.518483][ T8029] do_syscall_64 (arch/x86/entry/common.c:52 
> arch/x86/entry/common.c:83)
> [ 158.519281][ T8029] entry_SYSCALL_64_after_hwframe 
> (arch/x86/entry/entry_64.S:129)
> [ 158.519991][ T8029] RIP: 0033:0x7f04a0c9cf29
> [ 158.520536][ T8029] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
> 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48
> All code
> 
>  0: 00 c3 add %al,%bl
>  2: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
>  9: 00 00 00
>  c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>  11: 48 89 f8 mov %rdi,%rax
>  14: 48 89 f7 mov %rsi,%rdi
>  17: 48 89 d6 mov %rdx,%rsi
>  1a: 48 89 ca mov %rcx,%rdx
>  1d: 4d 89 c2 mov %r8,%r10
>  20: 4d 89 c8 mov %r9,%r8
>  23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
>  28: 0f 05 syscall
>  2a:* 48 3d 01 f0 ff ff cmp $0xf001,%rax <-- trapping instruction
>  30: 73 01 jae 0x33
>  32: c3 ret
>  33: 48 8b 0d 37 8f 0d 00 mov 0xd8f37(%rip),%rcx # 0xd8f71
>  3a: f7 d8 neg %eax
>  3c: 64 89 01 mov %eax,%fs:(%rcx)
>  3f: 48 rex.W
> Code starting with the faulting instruction
> ===
>  0: 48 3d 01 f0 ff ff cmp $0xf001,%rax
>  6: 73 01 jae 0x9
>  8: c3 ret
>  9: 48 8b 0d 37 8f 0d 00 mov 0xd8f37(%rip),%rcx # 0xd8f47
>  10: f7 d8 neg %eax
>  12: 64 89 01 mov %eax,%fs:(%rcx)
>  15: 48 rex.W
> [ 158.522837][ T8029] RSP: 002b:7ffe5f1174b8 EFLAGS: 0246 ORIG_RAX: 
> 012a
> [ 158.523848][ T8029] RAX: ffda RBX:  RCX: 
> 7f04a0c9cf29
> [ 158.524797][ T8029] RDX:  RSI:  RDI: 
> 20004740
> [ 158.525738][ T8029] RBP: 7ffe5f1174c0 R08:  R09: 
> 7ffe5f1174f0
> [ 158.526717][ T8029] R10:  R11: 0246 R12: 
> 5597d067d180
> [ 158.527661][ T8029] R13:  R14:  R15: 
> 
> [ 158.528611][ T8029] 
> [ 158.530059][ T8029]
> [ 158.530364][ T8029] ==
> [ 158.531146][ T8029] WARNING: possible circular locking dependency detected
> [ 158.531881][ T8029] 6.7.0-g0dd3ee311255 #6 Not tainted
> [ 158.532457][ T8029] --
> [ 158.533256][ T8029] poc/8029 is trying to acquire lock:
> [ 158.533880][ T8029] 88801ca53018 (>lock){}-{2:2}, at: 
> __perf_event_task_sched_out (kernel/events/core.c:3573 
> kernel/events/core.c:3676)
> [ 158.535067][ T8029]
> [ 158.535067][ T8029] but task is already holding lock:
> [ 158.535925][ T8029] 88802d23c758 (>__lock){-.-.}-{2:2}, at: 
> raw_spin_rq_lock_nested (kernel/sched/core.c:574)
> [ 158.537001][ T8029]
> [ 158.537001][ T8029] which lock already depends on the new lock.
> [ 158.537001][ T8029]
> [ 158.538196][ T8029]
> [ 158.538196][ T8029] the existing dependency chain (in reverse order) is:
> [ 158.539200][ T8029]
> [ 158.539200][ T8029] -> #3 (>__lock){-.-.}-{2:2}:
> [ 158.540081][ T8029] _raw_spin_lock_nested (kernel/locking/spinlock.c:379)
> [ 158.540772][ T8029] raw_spin_rq_lock_nested (kernel/sched/core.c:574)
> [ 158.541471][ T8029] task_fork_fair (kernel/sched/sched.h:1222 
> kernel/sched/sched.h:1581 kernel/sched/sched.h:1664 kernel/sched/fair.c:12586)
> [ 158.542092][ T8029] sched_cgroup_fork (kernel/sched/core.c:4814)
> [ 158.542772][ 

Re: [PATCH] tracing user_events: Simplify user_event_parse_field() parsing

2024-01-08 Thread Steven Rostedt
On Mon, 8 Jan 2024 17:13:12 -0500
Steven Rostedt  wrote:

> On Mon, 8 Jan 2024 21:47:44 +
> Beau Belgrave  wrote:
> 
> > > - len = str_has_prefix(field, "__rel_loc ");
> > > - if (len)
> > > - goto skip_next;
> > > + if (!(len = str_has_prefix(field, "__data_loc unsigned ")) &&
> > > + !(len = str_has_prefix(field, "__data_loc ")) &&
> > > + !(len = str_has_prefix(field, "__rel_loc unsigned ")) &&
> > > + !(len = str_has_prefix(field, "__rel_loc "))) {
> > > + goto parse;
> > > + }
> > 
> > This now triggers a checkpatch error:
> > ERROR: do not use assignment in if condition  
> 
> What a horrible message.
> 
> > #1184: FILE: kernel/trace/trace_events_user.c:1184:
> > +   if (!(len = str_has_prefix(field, "__data_loc unsigned ")) &&
> > 
> > I personally prefer to keep these files fully checkpatch clean.  
> 
> I've stopped using checkpatch years ago because I disagreed with so much it 
> :-p
>   (Including this message)

Note that checkpatch is a guideline and not a rule. The general rule is, if
the code looks worse when applying the checkpatch rule, don't do it.

- Steve



Re: [PATCH] tracing user_events: Simplify user_event_parse_field() parsing

2024-01-08 Thread Steven Rostedt
On Mon, 8 Jan 2024 21:47:44 +
Beau Belgrave  wrote:

> > -   len = str_has_prefix(field, "__rel_loc ");
> > -   if (len)
> > -   goto skip_next;
> > +   if (!(len = str_has_prefix(field, "__data_loc unsigned ")) &&
> > +   !(len = str_has_prefix(field, "__data_loc ")) &&
> > +   !(len = str_has_prefix(field, "__rel_loc unsigned ")) &&
> > +   !(len = str_has_prefix(field, "__rel_loc "))) {
> > +   goto parse;
> > +   }  
> 
> This now triggers a checkpatch error:
> ERROR: do not use assignment in if condition

What a horrible message.

> #1184: FILE: kernel/trace/trace_events_user.c:1184:
> +   if (!(len = str_has_prefix(field, "__data_loc unsigned ")) &&
> 
> I personally prefer to keep these files fully checkpatch clean.

I've stopped using checkpatch years ago because I disagreed with so much it :-p
  (Including this message)

> However, I did test these changes under the self-tests and it passed.
> 
> Do they bug you that much? :)

No big deal if you prefer the other way. I was just doing an audit of
str_has_prefix() to see what code could be cleaned up that uses it, and I
found this code.

If you prefer to limit your code to "checkpatch clean", I'll leave it alone.

-- Steve



[PATCH] tracing user_events: Simplify user_event_parse_field() parsing

2024-01-08 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Instead of having a bunch of if statements with:

   len = str_has_prefix(field, "__data_loc unsigned ");
   if (len)
   goto skip_next;

   len = str_has_prefix(field, "__data_loc ");
   if (len)
   goto skip_next;

   len = str_has_prefix(field, "__rel_loc unsigned ");
   if (len)
   goto skip_next;

   len = str_has_prefix(field, "__rel_loc ");
   if (len)
   goto skip_next;

goto parse;

 skip_next:

Consolidate it into a negative check and jump to parse if all the
str_has_prefix() calls fail. If one succeeds, it will just continue with
len equal to the proper string:

   if (!(len = str_has_prefix(field, "__data_loc unsigned ")) &&
   !(len = str_has_prefix(field, "__data_loc ")) &&
   !(len = str_has_prefix(field, "__rel_loc unsigned ")) &&
   !(len = str_has_prefix(field, "__rel_loc "))) {
   goto parse;
   }

 skip_next:

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_events_user.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 9365ce407426..ce0c5f1ded48 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1175,23 +1175,13 @@ static int user_event_parse_field(char *field, struct 
user_event *user,
goto skip_next;
}
 
-   len = str_has_prefix(field, "__data_loc unsigned ");
-   if (len)
-   goto skip_next;
-
-   len = str_has_prefix(field, "__data_loc ");
-   if (len)
-   goto skip_next;
-
-   len = str_has_prefix(field, "__rel_loc unsigned ");
-   if (len)
-   goto skip_next;
-
-   len = str_has_prefix(field, "__rel_loc ");
-   if (len)
-   goto skip_next;
+   if (!(len = str_has_prefix(field, "__data_loc unsigned ")) &&
+   !(len = str_has_prefix(field, "__data_loc ")) &&
+   !(len = str_has_prefix(field, "__rel_loc unsigned ")) &&
+   !(len = str_has_prefix(field, "__rel_loc "))) {
+   goto parse;
+   }
 
-   goto parse;
 skip_next:
type = field;
field = strpbrk(field + len, " ");
-- 
2.43.0




Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Steven Rostedt
On Mon, 8 Jan 2024 12:32:46 +0100
Christian Brauner  wrote:

> On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote:
> > On Sun, 7 Jan 2024 13:29:12 -0500
> > Steven Rostedt  wrote:
> >   
> > > > 
> > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > > redundant and just wrong.
> > > 
> > > I don't think so.  
> > 
> > Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
> > It exists without that. Although one rationale to do eventfs was so  
> 
> Every instance/foo/ tracefs instances also contains an events directory
> and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's
> not a separate filesystem. Both eventfs and tracefs are on the same
> single, system wide superblock.
> 
> > that the instance directories wouldn't recreate the same 10thousands
> > event inodes and dentries for every mkdir done.  
> 
> I know but that's irrelevant to what I'm trying to tell you.
> 
> A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs
> instance. With or without the on-demand dentry and inode creation for
> the eventfs portion that tracefs "instance" has now been created in its
> entirety including all the required information for someone to later
> come along and perform a lookup on /sys/kernel/tracing/instances/foo/events.
> 
> All you've done is to defer the addition of the dentries and inodes when
> someone does actually look at the events directory of the tracefs
> instance.
> 
> Whether you choose to splice in the dentries and inodes for the eventfs
> portion during lookup and readdir or if you had chosen to not do the
> on-demand thing at all and the entries were created at the same time as
> the mkdir call are equivalent from the perspective of permission
> checking.
> 
> If you have the required permissions to look at the events directory
> then there's no reason why listing the directory entries in there should
> fail. This can't even happen right now.

Ah, I think I know where the confusion lies. The tracing information in
kernel/trace/*.c doesn't keep track of permission. It relies totally on
fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with
'gid=xxx' then it's up to tracefs to maintain that information and not the
tracing subsystem. The tracing subsystem only gives the "default"
permissions (before boot finishes).

The difference between normal file systems and pseudo file systems like
debugfs and tracefs, is that normal file systems keep the permission
information stored on the external device. That is, when the inodes and
dentries are created, the information is retrieved from the stored file
system.

I think this may actually be a failure of debugfs (and tracefs as it was
based on debugfs), in that the inodes and dentries are created at the same
time the "files" backing them are. Which is normally at boot up and before
the file system is mounted.

That is, inodes and dentries are actually coupled with the data they
represent. It's not a cache for a back store like a hard drive partition.

To create a file in debugfs you do:

struct dentry *debugfs_create_file(const char *name, umode_t mode,
   struct dentry *parent, void *data,
   const struct file_operations *fops)

That is, you pass a the name, the mode, the parent dentry, data, and the
fops and that will create an inode and dentry (which is returned).

This happens at boot up before user space is running and before debugfs is
even mounted.

Because debugfs is mostly for debugging, people don't care about how it's
mounted. It is usually restricted to root only access. Especially since
there's a lot of sensitive information that shouldn't be exposed to
non-privileged users.

The reason tracefs came about is that people asked me to be able to have
access to tracing without needing to even enable debugfs. They also want to
easily make it accessible to non root users and talking with Kees Cook, he
recommended using ACL. But because it inherited a lot from debugfs, I
started doing these tricks like walking the dentry tree to make it work a
bit better. Because the dentries and inodes were created before mount, I
had to play these tricks.

But as Linus pointed out, that was the wrong way to do that. The right way
was to use .getattr and .permission callbacks to figure out what the
permissions to the files are.

This has nothing to do with the creation of the files, it's about who has
access to the files that the inodes point to.

This sounds like another topic to bring up at LSFMM ;-)  "Can we
standardize pseudo file systems like debugfs and tracefs to act more like
real file systems, and have inodes and dentries act as cache and not be so
coupled to the data?"

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Steven Rostedt
On Mon, 8 Jan 2024 12:04:54 +0100
Christian Brauner  wrote:

> > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > redundant and just wrong.  
> > 
> > I don't think so.  
> 
> I'm very well aware that the dentries and inode aren't created during
> mkdir but the completely directory layout is determined. You're just
> splicing in dentries and inodes during lookup and readdir.
> 
> If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
> do a lookup/readdir on
> 
> ls -al /sys/kernel/tracing/instances/foo/events
> 
> Why should the creation of the dentries and inodes ever fail due to a
> permission failure?

They shouldn't.

> The vfs did already verify that you had the required
> permissions to list entries in that directory. Why should filling up
> /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
> That tracefs instance would be half-functional. And again, right now
> that inode_permission() check cannot even fail.

And it shouldn't. But without dentries and inodes, how does VFS know what
is allowed to open the files?

-- Steve



Re: [PATCH] tracing histograms: Simplify parse_actions() function

2024-01-08 Thread Steven Rostedt
On Mon, 8 Jan 2024 10:32:14 +0200
Andy Shevchenko  wrote:

> On Mon, Jan 8, 2024 at 3:31 AM Steven Rostedt  wrote:
> >
> > From: "Steven Rostedt (Google)" 
> >
> > The parse_actions() function uses 'len = str_has_prefix()' to test which
> > action is in the string being parsed. But then it goes and repeats the
> > logic for each different action. This logic can be simplified and
> > duplicate code can be removed as 'len' contains the length of the found
> > prefix which should be used for all actions.  
> 
> > Link: 
> > https://lore.kernel.org/all/20240107112044.6702c...@gandalf.local.home/
> >
> > Signed-off-by: Steven Rostedt (Google)   
> 
> If you want Link to be formally a tag, you should drop the following
> blank line.

The link is for humans not for parsers.

> 
> 
> > +   if ((len = str_has_prefix(str, "onmatch(")))
> > +   hid = HANDLER_ONMATCH;
> > +   else if ((len = str_has_prefix(str, "onmax(")))
> > +   hid = HANDLER_ONMAX;
> > +   else if ((len = str_has_prefix(str, "onchange(")))
> > +   hid = HANDLER_ONCHANGE;  
> 
> The repeating check for ( might be moved out as well after this like
> 
>   if (str[len] != '(') {
> // not sure if you need data to be assigned here as well
> ret = -EINVAL;
> ...
>   }
> 

Not sure how that makes it any better. It adds more code. I could start
with checking the "on" before checking for "match", "max" and "change", but
that just makes it more complex.

-- Steve




[PATCH] tracing histograms: Simplify parse_actions() function

2024-01-07 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The parse_actions() function uses 'len = str_has_prefix()' to test which
action is in the string being parsed. But then it goes and repeats the
logic for each different action. This logic can be simplified and
duplicate code can be removed as 'len' contains the length of the found
prefix which should be used for all actions.

Link: https://lore.kernel.org/all/20240107112044.6702c...@gandalf.local.home/

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_events_hist.c | 49 
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 5ecf3c8bde20..6ece1308d36a 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -4805,36 +4805,35 @@ static int parse_actions(struct hist_trigger_data 
*hist_data)
int len;
 
for (i = 0; i < hist_data->attrs->n_actions; i++) {
+   enum handler_id hid = 0;
+   char *action_str;
+
str = hist_data->attrs->action_str[i];
 
-   if ((len = str_has_prefix(str, "onmatch("))) {
-   char *action_str = str + len;
+   if ((len = str_has_prefix(str, "onmatch(")))
+   hid = HANDLER_ONMATCH;
+   else if ((len = str_has_prefix(str, "onmax(")))
+   hid = HANDLER_ONMAX;
+   else if ((len = str_has_prefix(str, "onchange(")))
+   hid = HANDLER_ONCHANGE;
 
-   data = onmatch_parse(tr, action_str);
-   if (IS_ERR(data)) {
-   ret = PTR_ERR(data);
-   break;
-   }
-   } else if ((len = str_has_prefix(str, "onmax("))) {
-   char *action_str = str + len;
+   action_str = str + len;
 
-   data = track_data_parse(hist_data, action_str,
-   HANDLER_ONMAX);
-   if (IS_ERR(data)) {
-   ret = PTR_ERR(data);
-   break;
-   }
-   } else if ((len = str_has_prefix(str, "onchange("))) {
-   char *action_str = str + len;
+   switch (hid) {
+   case HANDLER_ONMATCH:
+   data = onmatch_parse(tr, action_str);
+   break;
+   case HANDLER_ONMAX:
+   case HANDLER_ONCHANGE:
+   data = track_data_parse(hist_data, action_str, hid);
+   break;
+   default:
+   data = ERR_PTR(-EINVAL);
+   break;
+   }
 
-   data = track_data_parse(hist_data, action_str,
-   HANDLER_ONCHANGE);
-   if (IS_ERR(data)) {
-   ret = PTR_ERR(data);
-   break;
-   }
-   } else {
-   ret = -EINVAL;
+   if (IS_ERR(data)) {
+   ret = PTR_ERR(data);
break;
}
 
-- 
2.42.0




Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Steven Rostedt
On Sun, 7 Jan 2024 13:29:12 -0500
Steven Rostedt  wrote:

> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.  
> 
> I don't think so.

Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
It exists without that. Although one rationale to do eventfs was so
that the instance directories wouldn't recreate the same 10thousands
event inodes and dentries for every mkdir done.

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Steven Rostedt
On Sun, 7 Jan 2024 13:42:39 +0100
Christian Brauner  wrote:
> 
> So, I tried to do an exploratory patch even though I promised myself not
> to do it. But hey...
> 
> Some notes:
> 
> * Permission handling for idmapped mounts is done completely in the
>   VFS. That's the case for all filesytems that don't have a custom
>   ->permission() handler. So there's nothing to do for us here.  
> 
> * Idmapped mount handling for ->getattr() is done completely by the VFS
>   if the filesystem doesn't have a custom ->getattr() handler. So we're
>   done here.
> 
> * Tracefs doesn't support attribute changes via ->setattr() (chown(),
>   chmod etc.). So there's nothing to here.
> 
> * Eventfs does support attribute changes via ->setattr(). But it relies
>   on simple_setattr() which is already idmapped mount aware. So there's
>   nothing for us to do.
> 
> * Ownership is inherited from the parent inode (tracefs) or optionally
>   from stashed ownership information (eventfs). That means the idmapping
>   is irrelevant here. It's similar to the "inherit gid from parent
>   directory" logic we have in some circumstances. TL;DR nothing to do
>   here as well.

The reason ownership is inherited from the parent is because the inodes
are created at boot up before user space starts.

eventfs inodes are created on demand after user space so it needs to
either check the "default" ownership and permissions, or if the user
changed a specific file/directory, it must save it and use that again
if the inode/dentry are reclaimed and then referenced again and
recreated.

> 
> * Tracefs supports the creation of instances from userspace via mkdir.
>   For example,
> 
>   mkdir /sys/kernel/tracing/instances/foo
> 
>   And here the idmapping is relevant so we need to make the helpers
>   aware of the idmapping.
> 
>   I just went and plumbed this through to most helpers.
> 
> There's some subtlety in eventfs. Afaict, the directories and files for
> the individual events are created on-demand during lookup or readdir.
> 
> The ownership of these events is again inherited from the parent inode
> or recovered from stored state. In both cases the actual idmapping is
> irrelevant.
> 
> The callchain here is:
> 
> eventfs_root_lookup("xfs", "events")
> -> create_{dir,file}_dentry("xfs", "events")
>-> create_{dir,file}("xfs", "events")
>   -> eventfs_start_creating("xfs", "events")
>  -> lookup_one_len("xfs", "events")  
> 
> And the subtlety is that lookup_one_len() does permission checking on
> the parent inode (IOW, if you want a dentry for "blech" under "events"
> it'll do a permission check on events->d_inode) for exec permissions
> and then goes on to give you a new dentry.
> 
> Usually this call would have to be changed to lookup_one() and the
> idmapping be handed down to it. But I think that's irrelevant here.
> 
> Lookup generally doesn't need to be aware of idmappings at all. The
> permission checking is done purely in the vfs via may_lookup() and the
> idmapping is irrelevant because we always initialize inodes with the
> filesystem level ownership (see the idmappings.rst) documentation if
> you're interested in excessive details (otherwise you get inode aliases
> which you really don't want).
> 
> For tracefs it would not matter for lookup per se but only because
> tracefs seemingly creates inodes/dentries during lookup (and readdir()).

tracefs creates the inodes/dentries at boot up, it's eventfs that does
it on demand during lookup.

For inodes/dentries:

 /sys/kernel/tracing/* is all created at boot up, except for "events".
 /sys/kernel/tracing/events/* is created on demand.

> 
> But imho the permission checking done in current eventfs_root_lookup()
> via lookup_one_len() is meaningless in any way; possibly even
> (conceptually) wrong.
> 
> Because, the actual permission checking for the creation of the eventfs
> entries isn't really done during lookup or readdir, it's done when mkdir
> is called:
> 
> mkdir /sys/kernel/tracing/instances/foo

No. that creates a entire new tracefs instance, which happens to
include another eventfs directory.

The eventsfs directory is in "/sys/kernel/tracing/events" and
 "/sys/kernel/tracing/instances/*/events"

eventfs has 10s of thousands of files and directories which is why I
changed it to be on-demand inode/dentry creation and also reclaiming
when no longer accessed.

 # ls /sys/kernel/tracing/events/

will create the inodes and dentries, and a memory stress program will
free those created inodes and dentries.

> 
> Here, all possible entries beneath foo including "events" and further
> below are recorded and stored. So once mkdir returns it basically means
> that it succeeded with the creation of all the necessary directories and
> files. For all purposes the foo/events/ directory and below have all the
> entries that matter. They have been created. It's comparable to them not
> being in the {d,i}cache, I guess.

No. Only the meta data is created for the eventfs 

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-05 Thread Steven Rostedt
On Fri, 5 Jan 2024 15:26:28 +0100
Christian Brauner  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > Instead of walking the dentries on mount/remount to update the gid values of
> > all the dentries if a gid option is specified on mount, just update the root
> > inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> > operations to update the permissions of the files and directories.
> > 
> > For all files and directories in the top level instance:
> > 
> >  /sys/kernel/tracing/*
> > 
> > It will use the root inode as the default permissions. The inode that
> > represents: /sys/kernel/tracing (or wherever it is mounted).
> > 
> > When an instance is created:
> > 
> >  mkdir /sys/kernel/tracing/instance/foo
> > 
> > The directory "foo" and all its files and directories underneath will use
> > the default of what foo is when it was created. A remount of tracefs will
> > not affect it.  
> 
> That kinda sounds like eventfs should actually be a separate filesystem.
> But I don't know enough about the relationship between the two concepts.

It may someday become one, as eventfs is used by perf where the rest of the
tracefs system is not.

> 
> > 
> > If a user were to modify the permissions of any file or directory in
> > tracefs, it will also no longer be modified by a change in ownership of a
> > remount.  
> 
> Very odd semantics and I would recommend to avoid that. It's just plain
> weird imo.
> 
> > 
> > The events directory, if it is in the top level instance, will use the
> > tracefs root inode as the default ownership for itself and all the files and
> > directories below it.
> > 
> > For the events directory in an instance ("foo"), it will keep the ownership
> > of what it was when it was created, and that will be used as the default
> > ownership for the files and directories beneath it.
> > 
> > Link: 
> > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/
> > 
> > Signed-off-by: Steven Rostedt (Google) 
> > ---  
> 
> So tracefs supports remounting with different uid/gid mount options and
> then actually wades through _all_ of the inodes and changes their
> ownership internally? What's the use-case for this? Containers?

No, in fact tracing doesn't work well with containers as tracing is global
to the entire machine. It can work with privileged containers though.

The reason for this is because tracefs was based off of debugfs where the
files and directores are created at boot up and mounted later. The reason
to do this was to allow users to mount with gid=GID to allow a given group
to have access to tracing. Without this update, tracefs would ignore it
like debugfs and proc does today.

I think its time I explain the purpose of tracefs and how it came to be.

The tracing system required a way to control tracing and read the traces.
It could have just used a new system like perf (although
/sys/kernel/debug/tracing predates perf), where it created a single ioctl()
like system call do do everything.

As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
tracer, which both have an embedded background, I chose an interface that
could work with just an unmodified version of busybox. That is, I wanted it
to work with just cat and echo.

The main difference with tracefs compared to other file systems is that it
is a control interface, where writes happen as much as reads. The data read
is controlled. The closest thing I can think of is how cgroups work.

As tracing is a privileged operation, but something that could be changed
to allow a group to have access to, I wanted to make it easy for an admin
to decide who gets to do what at boot up via the /etc/fstab file.

> 
> Aside from optimizing this and the special semantics for this eventfs
> stuff that you really should think twice of doing, here's one idea for
> an extension that might alleviate some of the pain:
> 
> If you need flexible dynamic ownership change to e.g., be able to
> delegate (all, a directory, a single file of) tracefs to
> unprivileged/containers/whatever then you might want to consider
> supporting idmapped mounts for tracefs. Because then you can do stuff
> like:

I'm a novice here and have no idea on how id maps work ;-)

> 
> user1@localhost:~/data/scripts$ sudo mount --bind -o 
> X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt
> user1@localhost:~/data/scripts$ ls -ln /run/
> total 12
> drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
> drwx--  2 0  0   40 Jan  5 11:57 cryptsetup
> drwxr-xr-x  2 0  0   60 Jan

[PATCH 3/4] eventfs: Read ei->entries before ei->children in eventfs_iterate()

2024-01-04 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

In order to apply a shortcut to skip over the current ctx->pos
immediately, by using the ei->entries array, the reading of that array
should be first. Moving the array reading before the linked list reading
will make the shortcut change diff nicer to read.

Link: 
https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 46 
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c73fb1f7ddbc..a1934e0eea3b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -752,8 +752,8 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 * Need to create the dentries and inodes to have a consistent
 * inode number.
 */
-   list_for_each_entry_srcu(ei_child, >children, list,
-srcu_read_lock_held(_srcu)) {
+   for (i = 0; i < ei->nr_entries; i++) {
+   void *cdata = ei->data;
 
if (c > 0) {
c--;
@@ -762,23 +762,32 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
ctx->pos++;
 
-   if (ei_child->is_freed)
-   continue;
+   entry = >entries[i];
+   name = entry->name;
 
-   name = ei_child->name;
+   mutex_lock(_mutex);
+   /* If ei->is_freed then just bail here, nothing more to do */
+   if (ei->is_freed) {
+   mutex_unlock(_mutex);
+   goto out_dec;
+   }
+   r = entry->callback(name, , , );
+   mutex_unlock(_mutex);
+   if (r <= 0)
+   continue;
 
-   dentry = create_dir_dentry(ei, ei_child, ei_dentry);
+   dentry = create_file_dentry(ei, i, ei_dentry, name, mode, 
cdata, fops);
if (!dentry)
goto out_dec;
ino = dentry->d_inode->i_ino;
dput(dentry);
 
-   if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
+   if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
goto out_dec;
}
 
-   for (i = 0; i < ei->nr_entries; i++) {
-   void *cdata = ei->data;
+   list_for_each_entry_srcu(ei_child, >children, list,
+srcu_read_lock_held(_srcu)) {
 
if (c > 0) {
c--;
@@ -787,27 +796,18 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
ctx->pos++;
 
-   entry = >entries[i];
-   name = entry->name;
-
-   mutex_lock(_mutex);
-   /* If ei->is_freed then just bail here, nothing more to do */
-   if (ei->is_freed) {
-   mutex_unlock(_mutex);
-   goto out_dec;
-   }
-   r = entry->callback(name, , , );
-   mutex_unlock(_mutex);
-   if (r <= 0)
+   if (ei_child->is_freed)
continue;
 
-   dentry = create_file_dentry(ei, i, ei_dentry, name, mode, 
cdata, fops);
+   name = ei_child->name;
+
+   dentry = create_dir_dentry(ei, ei_child, ei_dentry);
if (!dentry)
goto out_dec;
ino = dentry->d_inode->i_ino;
dput(dentry);
 
-   if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
+   if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
goto out_dec;
}
ret = 1;
-- 
2.42.0





[PATCH 0/4] eventfs: More updates to eventfs_iterate()

2024-01-04 Thread Steven Rostedt
With the ongoing descussion of eventfs iterator, a few more changes
are required and some changes are just enhancements.

- Stop immediately in the loop if the ei is found to be in the process
  of being freed.

- Make the ctx->pos update consistent with the skipped previous read index.
  This fixes a bug with duplicate files being showned by 'ls'.

- Swap reading ei->entries with ei->children to make the next change
  easier to read

- Add a "shortcut" in the ei->entries array to skip over already read
  entries.


Steven Rostedt (Google) (4):
  eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set
  eventfs: Do ctx->pos update for all iterations in  eventfs_iterate()
  eventfs: Read ei->entries before ei->children in eventfs_iterate()
  eventfs: Shortcut eventfs_iterate() by skipping entries already read


 fs/tracefs/event_inode.c | 67 ++--
 1 file changed, 36 insertions(+), 31 deletions(-)



[PATCH 4/4] eventfs: Shortcut eventfs_iterate() by skipping entries already read

2024-01-04 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As the ei->entries array is fixed for the duration of the eventfs_inode,
it can be used to skip over already read entries in eventfs_iterate().

That is, if ctx->pos is greater than zero, there's no reason in doing the
loop across the ei->entries array for the entries less than ctx->pos.
Instead, start the lookup of the entries at the current ctx->pos.

Link: 
https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/

Suggested-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a1934e0eea3b..fdff53d5a1f8 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -746,21 +746,15 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
if (!ei || !ei_dentry)
goto out;
 
-   ret = 0;
-
/*
 * Need to create the dentries and inodes to have a consistent
 * inode number.
 */
-   for (i = 0; i < ei->nr_entries; i++) {
-   void *cdata = ei->data;
-
-   if (c > 0) {
-   c--;
-   continue;
-   }
+   ret = 0;
 
-   ctx->pos++;
+   /* Start at 'c' to jump over already read entries */
+   for (i = c; i < ei->nr_entries; i++, ctx->pos++) {
+   void *cdata = ei->data;
 
entry = >entries[i];
name = entry->name;
@@ -769,7 +763,7 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
/* If ei->is_freed then just bail here, nothing more to do */
if (ei->is_freed) {
mutex_unlock(_mutex);
-   goto out_dec;
+   goto out;
}
r = entry->callback(name, , , );
mutex_unlock(_mutex);
@@ -778,14 +772,17 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
dentry = create_file_dentry(ei, i, ei_dentry, name, mode, 
cdata, fops);
if (!dentry)
-   goto out_dec;
+   goto out;
ino = dentry->d_inode->i_ino;
dput(dentry);
 
if (!dir_emit(ctx, name, strlen(name), ino, DT_REG))
-   goto out_dec;
+   goto out;
}
 
+   /* Subtract the skipped entries above */
+   c -= min((unsigned int)c, (unsigned int)ei->nr_entries);
+
list_for_each_entry_srcu(ei_child, >children, list,
 srcu_read_lock_held(_srcu)) {
 
-- 
2.42.0





[PATCH 2/4] eventfs: Do ctx->pos update for all iterations in eventfs_iterate()

2024-01-04 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The ctx->pos was only updated when it added an entry, but the "skip to
current pos" check (c--) happened for every loop regardless of if the
entry was added or not. This inconsistency caused readdir to be incorrect.

It was due to:

for (i = 0; i < ei->nr_entries; i++) {

if (c > 0) {
c--;
continue;
}

mutex_lock(_mutex);
/* If ei->is_freed then just bail here, nothing more to do */
if (ei->is_freed) {
mutex_unlock(_mutex);
goto out;
}
r = entry->callback(name, , , );
mutex_unlock(_mutex);

[..]
ctx->pos++;
}

But this can cause the iterator to return a file that was already read.
That's because of the way the callback() works. Some events may not have
all files, and the callback can return 0 to tell eventfs to skip the file
for this directory.

for instance, we have:

 # ls /sys/kernel/tracing/events/ftrace/function
format  hist  hist_debug  id  inject

and

 # ls /sys/kernel/tracing/events/sched/sched_switch/
enable  filter  format  hist  hist_debug  id  inject  trigger

Where the function directory is missing "enable", "filter" and
"trigger". That's because the callback() for events has:

static int event_callback(const char *name, umode_t *mode, void **data,
  const struct file_operations **fops)
{
struct trace_event_file *file = *data;
struct trace_event_call *call = file->event_call;

[..]

/*
 * Only event directories that can be enabled should have
 * triggers or filters, with the exception of the "print"
 * event that can have a "trigger" file.
 */
if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) {
if (call->class->reg && strcmp(name, "enable") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = _enable_fops;
return 1;
}

if (strcmp(name, "filter") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = _event_filter_fops;
return 1;
}
}

if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE) ||
strcmp(trace_event_name(call), "print") == 0) {
if (strcmp(name, "trigger") == 0) {
*mode = TRACE_MODE_WRITE;
*fops = _trigger_fops;
return 1;
}
}
[..]
return 0;
}

Where the function event has the TRACE_EVENT_FL_IGNORE_ENABLE set.

This means that the entries array elements for "enable", "filter" and
"trigger" when called on the function event will have the callback return
0 and not 1, to tell eventfs to skip these files for it.

Because the "skip to current ctx->pos" check happened for all entries, but
the ctx->pos++ only happened to entries that exist, it would confuse the
reading of a directory. Which would cause:

 # ls /sys/kernel/tracing/events/ftrace/function/
format  hist  hist  hist_debug  hist_debug  id  inject  inject

The missing "enable", "filter" and "trigger" caused ls to show "hist",
"hist_debug" and "inject" twice.

Update the ctx->pos for every iteration to keep its update and the "skip"
update consistent. This also means that on error, the ctx->pos needs to be
decremented if it was incremented without adding something.

Link: https://lore.kernel.org/all/20240104150500.38b15...@gandalf.local.home/

Fixes: 493ec81a8fb8e ("eventfs: Stop using dcache_readdir() for getdents()")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 0aca6910efb3..c73fb1f7ddbc 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -760,6 +760,8 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
continue;
}
 
+   ctx->pos++;
+
if (ei_child->is_freed)
continue;
 
@@ -767,13 +769,12 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
dentry = create_dir_dentry(ei, ei_child, ei_dentry);
if (!dentry)
-   goto out;
+   goto out_dec;
ino = dentry->d_inode->i_ino;
dput(dentry);
 

[PATCH 1/4] eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set

2024-01-04 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If ei->is_freed is set in eventfs_iterate(), it means that the directory
that is being iterated on is in the process of being freed. Just exit the
loop immediately when that is ever detected, and separate out the return
of the entry->callback() from ei->is_freed.

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 72912b5f9a90..0aca6910efb3 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -788,11 +788,12 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
name = entry->name;
 
mutex_lock(_mutex);
-   /* If ei->is_freed, then the event itself may be too */
-   if (!ei->is_freed)
-   r = entry->callback(name, , , );
-   else
-   r = -1;
+   /* If ei->is_freed then just bail here, nothing more to do */
+   if (ei->is_freed) {
+   mutex_unlock(_mutex);
+   goto out;
+   }
+   r = entry->callback(name, , , );
mutex_unlock(_mutex);
if (r <= 0)
continue;
-- 
2.42.0





Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions

2024-01-04 Thread Steven Rostedt
On Thu, 21 Dec 2023 12:58:13 -0500
Steven Rostedt  wrote:

> On Thu, 21 Dec 2023 17:35:22 +
> Vincent Donnefort  wrote:
> 
> > @@ -5999,6 +6078,307 @@ int ring_buffer_subbuf_order_set(struct 
> > trace_buffer *buffer, int order)
> >  }
> >  EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
> >
> 
> The kernel developers have agreed to allow loop variables to be declared in
> loops. This will simplify these macros:
> 
> 
> 
> > +#define subbuf_page(off, start) \
> > +   virt_to_page((void *)(start + (off << PAGE_SHIFT)))
> > +
> > +#define foreach_subbuf_page(off, sub_order, start, page)   \
> > +   for (off = 0, page = subbuf_page(0, start); \
> > +off < (1 << sub_order);\
> > +off++, page = subbuf_page(off, start))  
> 
> #define foreach_subbuf_page(sub_order, start, page)   \
>   for (int __off = 0, page = subbuf_page(0, (start)); \
>__off < (1 << (sub_order));\
>__off++, page = subbuf_page(__off, (start)))

So it seems that you can't declare "int __off" with page there, but we
could have:

#define foreach_subbuf_page(sub_order, start, page) \
page = subbuf_page(0, (start)); \
for (int __off = 0; __off < (1 << (sub_order)); \
 __off++, page = subbuf_page(__off, (start)))


And that would work.

-- Steve



Re: [PATCH v8] bus: mhi: host: Add tracing support

2024-01-04 Thread Steven Rostedt
On Thu, 7 Dec 2023 10:00:47 +0530
Krishna chaitanya chundru  wrote:

> This change adds ftrace support for following functions which
> helps in debugging the issues when there is Channel state & MHI
> state change and also when we receive data and control events:
> 1. mhi_intvec_mhi_states
> 2. mhi_process_data_event_ring
> 3. mhi_process_ctrl_ev_ring
> 4. mhi_gen_tre
> 5. mhi_update_channel_state
> 6. mhi_tryset_pm_state
> 7. mhi_pm_st_worker
> 
> Where ever the trace events are added, debug messages are removed.
> 
> Signed-off-by: Krishna chaitanya chundru 
> ---
> Changes in v8:
> - Pass the structure and derefernce the variables in TP_fast_assign as 
> suggested by steve
> - Link to v7: 
> https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com

So this looks good from a tracing POV.

Reviewed-by: Steven Rostedt (Google) 

But I do have some more comments that could be done by new patches.



> +TRACE_EVENT(mhi_intvec_states,
> +
> + TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state),
> +
> + TP_ARGS(mhi_cntrl, dev_ee, dev_state),
> +
> + TP_STRUCT__entry(
> + __string(name, mhi_cntrl->mhi_dev->name)
> + __field(int, local_ee)
> + __field(int, state)
> + __field(int, dev_ee)
> + __field(int, dev_state)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, mhi_cntrl->mhi_dev->name);
> + __entry->local_ee = mhi_cntrl->ee;
> + __entry->state = mhi_cntrl->dev_state;
> + __entry->dev_ee = dev_ee;
> + __entry->dev_state = dev_state;
> + ),
> +
> + TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n",
> +   __get_str(name),
> +   TO_MHI_EXEC_STR(__entry->local_ee),
> +   mhi_state_str(__entry->state),
> +   TO_MHI_EXEC_STR(__entry->dev_ee),
> +   mhi_state_str(__entry->dev_state))

So the above may have issues with user space parsing.

For one, that mhi_state_str() is:

static inline const char *mhi_state_str(enum mhi_state state)
{
switch (state) {
case MHI_STATE_RESET:
return "RESET";
case MHI_STATE_READY:
return "READY";
case MHI_STATE_M0:
return "M0";
case MHI_STATE_M1:
return "M1";
case MHI_STATE_M2:
return "M2";
case MHI_STATE_M3:
return "M3";
case MHI_STATE_M3_FAST:
return "M3 FAST";
case MHI_STATE_BHI:
return "BHI";
case MHI_STATE_SYS_ERR:
return "SYS ERROR";
default:
return "Unknown state";
}
};
 
Which if this could be changed into:

#define MHI_STATE_LIST  \
EM(RESET,"RESET")   \
EM(READY,"READY")   \
EM(M0,   "M0")  \
EM(M1,   "M1")  \
EM(M2,   "M2")  \
EM(M3,   "M3")  \
EM(M3_FAST,  "M3_FAST") \
EM(BHI,  "BHI") \
EMe(SYS_ERR, "SYS ERROR")

#undef EM
#undef EMe

#define EM(a, b)  case MHI_STATE_##a: return b;
#define EMe(a, b) case MHI_STATE_##a: return b;

static inline const char *mhi_state_str(enum mhi_state state)
{
switch (state) {
MHI_STATE_LIST
default:
return "Unknown state";
}

Then you could use that macro in the trace header:

#undef EM
#undef EMe

#define EM(a, b)TRACE_DEFINE_ENUM(MHI_STATE_##a);
#define EMe(a, b)   TRACE_DEFINE_ENUM(MHI_STATE_##a);

MHI_STATE_LIST

And in the print fmts:

#undef EM
#undef EMe

#define EM(a, b)   { MHI_STATE_##a, b },
#define EMe(a, b)  { MHI_STATE_##a, b }


TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n",
  __get_str(name),
  TO_MHI_EXEC_STR(__entry->local_ee),

  __print_symbolic(__entry->state), MHI_STATE_LIST),

  TO_MHI_EXEC_STR(__entry->dev_ee),

  __print_symbolic(__entry->dev_state, MHI_STATE_LIST))


And that will be exported to user space in the
/sys/kernel/tracing/events/*/*/format file, as:

__print_symbolic(REC->state, { 
{ MHI_STATE_RESET, "RESET"},
{ MHI_STATE_READY, "READY"},
{ MHI_STATE_M0, "M0"},
{ MHI_STATE_M1, "M1"},
{ MHI_STA

[PATCH v2] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Instead of walking the dentries on mount/remount to update the gid values of
all the dentries if a gid option is specified on mount, just update the root
inode. Add .getattr, .setattr, and .permissions on the tracefs inode
operations to update the permissions of the files and directories.

For all files and directories in the top level instance:

 /sys/kernel/tracing/*

It will use the root inode as the default permissions. The inode that
represents: /sys/kernel/tracing (or wherever it is mounted).

When an instance is created:

 mkdir /sys/kernel/tracing/instance/foo

The directory "foo" and all its files and directories underneath will use
the default of what foo is when it was created. A remount of tracefs will
not affect it.

If a user were to modify the permissions of any file or directory in
tracefs, it will also no longer be modified by a change in ownership of a
remount.

The events directory, if it is in the top level instance, will use the
tracefs root inode as the default ownership for itself and all the files and
directories below it.

For the events directory in an instance ("foo"), it will keep the ownership
of what it was when it was created, and that will be used as the default
ownership for the files and directories beneath it.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240103203246.11573...@gandalf.local.home

- Use dentry->d_sb->s_root directly than inode parent (Al Viro)

- Simplify while loop to find instance inode or root inode (Al Viro)

 fs/tracefs/event_inode.c |  79 +++-
 fs/tracefs/inode.c   | 198 ++-
 fs/tracefs/internal.h|   3 +
 3 files changed, 190 insertions(+), 90 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 41af56f44f0a..72912b5f9a90 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -45,6 +45,7 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
+   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -115,10 +116,17 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * The events directory dentry is never freed, unless its
 * part of an instance that is deleted. It's attr is the
 * default for its child files and directories.
-* Do not update it. It's not used for its own mode or ownership
+* Do not update it. It's not used for its own mode or 
ownership.
 */
-   if (!ei->is_events)
+   if (ei->is_events) {
+   /* But it still needs to know if it was modified */
+   if (iattr->ia_valid & ATTR_UID)
+   ei->attr.mode |= EVENTFS_SAVE_UID;
+   if (iattr->ia_valid & ATTR_GID)
+   ei->attr.mode |= EVENTFS_SAVE_GID;
+   } else {
update_attr(>attr, iattr);
+   }
 
} else {
name = dentry->d_name.name;
@@ -136,9 +144,66 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
+static void update_top_events_attr(struct eventfs_inode *ei, struct dentry 
*dentry)
+{
+   struct inode *inode;
+
+   /* Only update if the "events" was on the top level */
+   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   return;
+
+   /* Get the tracefs root inode. */
+   inode = d_inode(dentry->d_sb->s_root);
+   ei->attr.uid = inode->i_uid;
+   ei->attr.gid = inode->i_gid;
+}
+
+static void set_top_events_ownership(struct inode *inode)
+{
+   struct tracefs_inode *ti = get_tracefs(inode);
+   struct eventfs_inode *ei = ti->private;
+   struct dentry *dentry;
+
+   /* The top events directory doesn't get automatically updated */
+   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   return;
+
+   dentry = ei->dentry;
+
+   update_top_events_attr(ei, dentry);
+
+   if (!(ei->attr.mode & EVENTFS_SAVE_UID))
+   inode->i_uid = ei->attr.uid;
+
+   if (!(ei->attr.mode & EVENTFS_SAVE_GID))
+   inode->i_gid = ei->attr.gid;
+}
+
+static int eventfs_get_attr(struct mnt_idmap *idmap,
+   const struct path *path, struct kstat *stat,
+   u32 request_mask, unsigned int flags)
+{
+   struct dentry *dentry = path->dentry;
+   s

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt
On Thu, 4 Jan 2024 01:48:37 +
Al Viro  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +   /* Get the tracefs root from the parent */
> > +   inode = d_inode(dentry->d_parent);
> > +   inode = d_inode(inode->i_sb->s_root);  
> 
> That makes no sense.  First of all, for any positive dentry we have
> dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> dentries on given superblock.  So what's the point of that dance?
> If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> and be done with that...

That was more of thinking that the dentry and dentry->d_parent are
different. As dentry is part of eventfs and dentry->d_parent is part of
tracefs. Currently they both have the same superblock so yeah, I could just
write it that way too and it would work. But in my head, I was thinking
that they behave differently and maybe one day eventfs would get its own
superblock which would not work.

To explain this better:

  /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events

But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
Everything in /sys/kernel/tracing/events is part of eventfs.

That was my thought process. But as both tracefs and eventfs still use
tracefs_get_inode(), it would work as you state.

I'll update that, as I don't foresee that eventfs will become its own file
system.

Thanks,

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt
On Thu, 4 Jan 2024 01:59:10 +
Al Viro  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +static struct inode *instance_inode(struct dentry *parent, struct inode 
> > *inode)
> > +{
> > +   struct tracefs_inode *ti;
> > +   struct inode *root_inode;
> > +
> > +   root_inode = d_inode(inode->i_sb->s_root);
> > +
> > +   /* If parent is NULL then use root inode */
> > +   if (!parent)
> > +   return root_inode;
> > +
> > +   /* Find the inode that is flagged as an instance or the root inode */
> > +   do {
> > +   inode = d_inode(parent);
> > +   if (inode == root_inode)
> > +   return root_inode;
> > +
> > +   ti = get_tracefs(inode);
> > +
> > +   if (ti->flags & TRACEFS_INSTANCE_INODE)
> > +   return inode;
> > +   } while ((parent = parent->d_parent));  
> 
> *blink*
> 
> This is equivalent to
>   ...
>   parent = parent->d_parent;
>   } while (true);

Yeah, that loop went through a few iterations as I first thought that root
was a tracefs_inode and the get_tracefs() would work on it. No, it was not,
and it caused a cash. But I didn't rewrite the loop well after fixing that.

I was also not sure if parent could be NULL, and wanted to protect against it.

> 
> ->d_parent is *never* NULL.  And what the hell is that loop supposed to do,  
> anyway?  Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE?
> 
> If root is not marked that way, I would suggest
>   if (!parent)
>   parent = inode->i_sb->s_root;
>   while (!IS_ROOT(parent)) {
>   struct tracefs_inode *ti = get_tracefs(parent->d_inode);
>   if (ti->flags & TRACEFS_INSTANCE_INODE)
>   break;
>   parent = parent->d_parent;
>   }
>   return parent->d_inode;

Sure, I could rewrite it that way too.

Thanks,

-- Steve



[PATCH v2 2/2] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The eventfs creates dynamically allocated dentries and inodes. Using the
dcache_readdir() logic for its own directory lookups requires hiding the
cursor of the dcache logic and playing games to allow the dcache_readdir()
to still have access to the cursor while the eventfs saved what it created
and what it needs to release.

Instead, just have eventfs have its own iterate_shared callback function
that will fill in the dent entries. This simplifies the code quite a bit.

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 194 +--
 1 file changed, 64 insertions(+), 130 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c360300fb866..41af56f44f0a 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -52,9 +52,7 @@ enum {
 static struct dentry *eventfs_root_lookup(struct inode *dir,
  struct dentry *dentry,
  unsigned int flags);
-static int dcache_dir_open_wrapper(struct inode *inode, struct file *file);
-static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx);
-static int eventfs_release(struct inode *inode, struct file *file);
+static int eventfs_iterate(struct file *file, struct dir_context *ctx);
 
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
@@ -148,11 +146,9 @@ static const struct inode_operations 
eventfs_file_inode_operations = {
 };
 
 static const struct file_operations eventfs_file_operations = {
-   .open   = dcache_dir_open_wrapper,
.read   = generic_read_dir,
-   .iterate_shared = dcache_readdir_wrapper,
+   .iterate_shared = eventfs_iterate,
.llseek = generic_file_llseek,
-   .release= eventfs_release,
 };
 
 /* Return the evenfs_inode of the "events" directory */
@@ -643,128 +639,87 @@ static struct dentry *eventfs_root_lookup(struct inode 
*dir,
return ret;
 }
 
-struct dentry_list {
-   void*cursor;
-   struct dentry   **dentries;
-};
-
-/**
- * eventfs_release - called to release eventfs file/dir
- * @inode: inode to be released
- * @file: file to be released (not used)
- */
-static int eventfs_release(struct inode *inode, struct file *file)
-{
-   struct tracefs_inode *ti;
-   struct dentry_list *dlist = file->private_data;
-   void *cursor;
-   int i;
-
-   ti = get_tracefs(inode);
-   if (!(ti->flags & TRACEFS_EVENT_INODE))
-   return -EINVAL;
-
-   if (WARN_ON_ONCE(!dlist))
-   return -EINVAL;
-
-   for (i = 0; dlist->dentries && dlist->dentries[i]; i++) {
-   dput(dlist->dentries[i]);
-   }
-
-   cursor = dlist->cursor;
-   kfree(dlist->dentries);
-   kfree(dlist);
-   file->private_data = cursor;
-   return dcache_dir_close(inode, file);
-}
-
-static int add_dentries(struct dentry ***dentries, struct dentry *d, int cnt)
-{
-   struct dentry **tmp;
-
-   tmp = krealloc(*dentries, sizeof(d) * (cnt + 2), GFP_NOFS);
-   if (!tmp)
-   return -1;
-   tmp[cnt] = d;
-   tmp[cnt + 1] = NULL;
-   *dentries = tmp;
-   return 0;
-}
-
-/**
- * dcache_dir_open_wrapper - eventfs open wrapper
- * @inode: not used
- * @file: dir to be opened (to create it's children)
- *
- * Used to dynamic create file/dir with-in @file, all the
- * file/dir will be created. If already created then references
- * will be increased
+/*
+ * Walk the children of a eventfs_inode to fill in getdents().
  */
-static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
+static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 {
const struct file_operations *fops;
+   struct inode *f_inode = file_inode(file);
const struct eventfs_entry *entry;
struct eventfs_inode *ei_child;
struct tracefs_inode *ti;
struct eventfs_inode *ei;
-   struct dentry_list *dlist;
-   struct dentry **dentries = NULL;
-   struct dentry *parent = file_dentry(file);
-   struct dentry *d;
-   struct inode *f_inode = file_inode(file);
-   const char *name = parent->d_name.name;
+   struct dentry *ei_dentry = NULL;
+   struct dentry *dentry;
+   const char *name;
umode_t mode;
-   void *data;
-   int cnt = 0;
int idx;
-   int ret;
-   int i;
-   int r;
+   int ret = -EINVAL;
+   int ino;
+   int i, r, c;
+
+   if (!dir_emit_dots(file, ctx))
+   return 0;
 
ti = get_tracefs(f_inode);
if (!(ti->flags & TRACEFS_EVENT_INODE))
return -EINVAL;
 
-   if (WARN_ON_ONCE(file->private_data))
-   return -EINVAL;
+   c = ctx->pos - 2;
 
idx = srcu_read_lock(

[PATCH v2 1/2] eventfs: Remove "lookup" parameter from create_dir/file_dentry()

2024-01-03 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The "lookup" parameter is a way to differentiate the call to
create_file/dir_dentry() from when it's just a lookup (no need to up the
dentry refcount) and accessed via a readdir (need to up the refcount).

But reality, it just makes the code more complex. Just up the refcount and
let the caller decide to dput() the result or not.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240103102553.17a19...@gandalf.local.home

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 55 +++-
 1 file changed, 20 insertions(+), 35 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f0677ea0ec24..c360300fb866 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -390,16 +390,14 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, 
struct dentry *dentry)
  * @mode: The mode of the file.
  * @data: The data to use to set the inode of the file with on open()
  * @fops: The fops of the file to be created.
- * @lookup: If called by the lookup routine, in which case, dput() the created 
dentry.
  *
  * Create a dentry for a file of an eventfs_inode @ei and place it into the
- * address located at @e_dentry. If the @e_dentry already has a dentry, then
- * just do a dget() on it and return. Otherwise create the dentry and attach 
it.
+ * address located at @e_dentry.
  */
 static struct dentry *
 create_file_dentry(struct eventfs_inode *ei, int idx,
   struct dentry *parent, const char *name, umode_t mode, void 
*data,
-  const struct file_operations *fops, bool lookup)
+  const struct file_operations *fops)
 {
struct eventfs_attr *attr = NULL;
struct dentry **e_dentry = >d_children[idx];
@@ -414,9 +412,7 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
}
/* If the e_dentry already has a dentry, use it */
if (*e_dentry) {
-   /* lookup does not need to up the ref count */
-   if (!lookup)
-   dget(*e_dentry);
+   dget(*e_dentry);
mutex_unlock(_mutex);
return *e_dentry;
}
@@ -441,13 +437,12 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 * way to being freed, don't return it. If e_dentry is NULL
 * it means it was already freed.
 */
-   if (ei->is_freed)
+   if (ei->is_freed) {
dentry = NULL;
-   else
+   } else {
dentry = *e_dentry;
-   /* The lookup does not need to up the dentry refcount */
-   if (dentry && !lookup)
dget(dentry);
+   }
mutex_unlock(_mutex);
return dentry;
}
@@ -465,9 +460,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
}
mutex_unlock(_mutex);
 
-   if (lookup)
-   dput(dentry);
-
return dentry;
 }
 
@@ -500,13 +492,12 @@ static void eventfs_post_create_dir(struct eventfs_inode 
*ei)
  * @pei: The eventfs_inode parent of ei.
  * @ei: The eventfs_inode to create the directory for
  * @parent: The dentry of the parent of this directory
- * @lookup: True if this is called by the lookup code
  *
  * This creates and attaches a directory dentry to the eventfs_inode @ei.
  */
 static struct dentry *
 create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
- struct dentry *parent, bool lookup)
+ struct dentry *parent)
 {
struct dentry *dentry = NULL;
 
@@ -518,11 +509,9 @@ create_dir_dentry(struct eventfs_inode *pei, struct 
eventfs_inode *ei,
return NULL;
}
if (ei->dentry) {
-   /* If the dentry already has a dentry, use it */
+   /* If the eventfs_inode already has a dentry, use it */
dentry = ei->dentry;
-   /* lookup does not need to up the ref count */
-   if (!lookup)
-   dget(dentry);
+   dget(dentry);
mutex_unlock(_mutex);
return dentry;
}
@@ -542,7 +531,7 @@ create_dir_dentry(struct eventfs_inode *pei, struct 
eventfs_inode *ei,
 * way to being freed.
 */
dentry = ei->dentry;
-   if (dentry && !lookup)
+   if (dentry)
dget(dentry);
mutex_unlock(_mutex);
return dentry;
@@ -562,9 +551,6 @@ create_dir_dentry(struct eventfs_inode *pei, struct 
eventfs_inode *ei,
}
mutex_unlock(_mutex);
 
-   if (lookup)
-   dput(dentry);
-
return dentry;
 }
 
@@ -589,8 +575,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
stru

[PATCH v2 0/2] eventfs: Don't use dcache_readdir() for getdents()

2024-01-03 Thread Steven Rostedt


After having a "pleasant" conversation with Linus over on the security
mailing list, we came to the conclusion that eventfs should not be
using the dcache_readdir() routine for iterating the entries of a
directory (getdents()).

Instead, the .open and .release callbacks of the directory file
operations was removed and all the work is now done in the .iterate_shared.

The function dcache_readdir_wrapper() was renamed to eventfs_iterate().

As the files and directories of eventfs is all known within the meta
data, it can easily supply getdents() with the information it needs without
traversing the dentry.

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240103102553.17a19...@gandalf.local.home/

- Broke up into two patches, one to fix the lookup parameter and the
  other to do the meat of the change.

- Moved the ctx->pos count up to skip creating of dentries in those cases.


Steven Rostedt (Google) (2):
  eventfs: Remove "lookup" parameter from create_dir/file_dentry()
  eventfs: Stop using dcache_readdir() for getdents()


 fs/tracefs/event_inode.c | 241 ---
 1 file changed, 80 insertions(+), 161 deletions(-)



[PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Instead of walking the dentries on mount/remount to update the gid values of
all the dentries if a gid option is specified on mount, just update the root
inode. Add .getattr, .setattr, and .permissions on the tracefs inode
operations to update the permissions of the files and directories.

For all files and directories in the top level instance:

 /sys/kernel/tracing/*

It will use the root inode as the default permissions. The inode that
represents: /sys/kernel/tracing (or wherever it is mounted).

When an instance is created:

 mkdir /sys/kernel/tracing/instance/foo

The directory "foo" and all its files and directories underneath will use
the default of what foo is when it was created. A remount of tracefs will
not affect it.

If a user were to modify the permissions of any file or directory in
tracefs, it will also no longer be modified by a change in ownership of a
remount.

The events directory, if it is in the top level instance, will use the
tracefs root inode as the default ownership for itself and all the files and
directories below it.

For the events directory in an instance ("foo"), it will keep the ownership
of what it was when it was created, and that will be used as the default
ownership for the files and directories beneath it.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c |  80 ++-
 fs/tracefs/inode.c   | 205 ++-
 fs/tracefs/internal.h|   3 +
 3 files changed, 198 insertions(+), 90 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 53d34a4b5a2b..641bffa0f139 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -45,6 +45,7 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
+   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -115,10 +116,17 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * The events directory dentry is never freed, unless its
 * part of an instance that is deleted. It's attr is the
 * default for its child files and directories.
-* Do not update it. It's not used for its own mode or ownership
+* Do not update it. It's not used for its own mode or 
ownership.
 */
-   if (!ei->is_events)
+   if (ei->is_events) {
+   /* But it still needs to know if it was modified */
+   if (iattr->ia_valid & ATTR_UID)
+   ei->attr.mode |= EVENTFS_SAVE_UID;
+   if (iattr->ia_valid & ATTR_GID)
+   ei->attr.mode |= EVENTFS_SAVE_GID;
+   } else {
update_attr(>attr, iattr);
+   }
 
} else {
name = dentry->d_name.name;
@@ -136,9 +144,67 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
+static void update_top_events_attr(struct eventfs_inode *ei, struct dentry 
*dentry)
+{
+   struct inode *inode;
+
+   /* Only update if the "events" was on the top level */
+   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   return;
+
+   /* Get the tracefs root from the parent */
+   inode = d_inode(dentry->d_parent);
+   inode = d_inode(inode->i_sb->s_root);
+   ei->attr.uid = inode->i_uid;
+   ei->attr.gid = inode->i_gid;
+}
+
+static void set_top_events_ownership(struct inode *inode)
+{
+   struct tracefs_inode *ti = get_tracefs(inode);
+   struct eventfs_inode *ei = ti->private;
+   struct dentry *dentry;
+
+   /* The top events directory doesn't get automatically updated */
+   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   return;
+
+   dentry = ei->dentry;
+
+   update_top_events_attr(ei, dentry);
+
+   if (!(ei->attr.mode & EVENTFS_SAVE_UID))
+   inode->i_uid = ei->attr.uid;
+
+   if (!(ei->attr.mode & EVENTFS_SAVE_GID))
+   inode->i_gid = ei->attr.gid;
+}
+
+static int eventfs_get_attr(struct mnt_idmap *idmap,
+   const struct path *path, struct kstat *stat,
+   u32 request_mask, unsigned int flags)
+{
+   struct dentry *dentry = path->dentry;
+   struct inode *inode = d_backing_inode(dentry);
+
+   set_top_events_ownership(inode);
+
+   generic_fillattr(idmap, request_mask, inode, stat);
+   return 0;
+}
+
+static int eventfs_permission

Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Steven Rostedt
On Wed, 3 Jan 2024 13:54:36 -0800
Linus Torvalds  wrote:

> On Wed, 3 Jan 2024 at 11:57, Linus Torvalds
>  wrote:
> >
> > Or, you know, you could do what I've told you to do at least TEN TIMES
> > already, which is to not mess with any of this, and just implement the
> > '->permission()' callback (and getattr() to just make 'ls' look sane
> > too, rather than silently saying "we'll act as if gid is set right,
> > but not show it").
> 
> Actually, an even simpler option might be to just do this all at
> d_revalidate() time.
> 
> Here's an updated patch that builds, and is PURELY AN EXAMPLE. I think
> it "works", but it currently always resets the inode mode/uid/gid
> unconditionally, which is wrong - it should not do so if the inode has
> been manually set.
> 
> So take this as a "this might work", but it probably needs a bit more
> work - eventfs_set_attr() should set some bit in the inode to say
> "these have been set manually", and then revalidate would say "I'll
> not touch inodes that have that bit set".
> 
> Or something.
> 
> Anyway, this patch is nwo relative to your latest pull request, so it
> has the check for dentry->d_inode in set_gid() (and still removes the
> whole function).
> 
> Again: UNTESTED, and meant as a "this is another way to avoid messing
> with the dentry tree manually, and just using the VFS interfaces we
> already have"
> 

I actually have something almost working too. Here's the WIP. It only works
for tracefs, and now eventfs needs to be updated as the "events" directory
no longer has the right ownership. So I need a way to link the eventfs
entries to use the tracefs default conditionally.

The issue I'm currently working on is to make the files of:

  /sys/kernel/tracing/events

default to the root of tracefs, but

 /sys/kernel/tracing/instances/foo/events

to default to what events was when it was created by "mkdir instances/foo".

But other than that, the tracefs part seems to work.

-- Steve

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index ae648deed019..9c55dc903d7d 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -141,10 +141,76 @@ static int tracefs_syscall_rmdir(struct inode *inode, 
struct dentry *dentry)
return ret;
 }
 
-static const struct inode_operations tracefs_dir_inode_operations = {
+static void set_tracefs_inode_owner(struct inode *inode)
+{
+   struct inode *root_inode = d_inode(inode->i_sb->s_root);
+   struct tracefs_inode *ti = get_tracefs(inode);
+
+   /*
+* If this inode has never been referenced, then update
+* the permissions to the superblock.
+*/
+   if (!(ti->flags & TRACEFS_EVENT_UID_PERM_SET))
+   inode->i_uid = root_inode->i_uid;
+
+   if (!(ti->flags & TRACEFS_EVENT_GID_PERM_SET))
+   inode->i_gid = root_inode->i_gid;
+}
+
+static int tracefs_permission(struct mnt_idmap *idmap,
+ struct inode *inode, int mask)
+{
+   set_tracefs_inode_owner(inode);
+   return generic_permission(idmap, inode, mask);
+}
+
+static int tracefs_getattr(struct mnt_idmap *idmap,
+  const struct path *path, struct kstat *stat,
+  u32 request_mask, unsigned int flags)
+{
+   struct inode *inode = d_backing_inode(path->dentry);
+
+   set_tracefs_inode_owner(inode);
+   generic_fillattr(idmap, request_mask, inode, stat);
+   return 0;
+}
+
+static int tracefs_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+  struct iattr *attr)
+{
+   unsigned int ia_valid = attr->ia_valid;
+   struct inode *inode = d_inode(dentry);
+   struct tracefs_inode *ti = get_tracefs(inode);
+
+   if (ia_valid & ATTR_UID)
+   ti->flags |= TRACEFS_EVENT_UID_PERM_SET;
+
+   if (ia_valid & ATTR_GID)
+   ti->flags |= TRACEFS_EVENT_GID_PERM_SET;
+
+   return simple_setattr(idmap, dentry, attr);
+}
+
+static const struct inode_operations tracefs_instance_dir_inode_operations = {
.lookup = simple_lookup,
.mkdir  = tracefs_syscall_mkdir,
.rmdir  = tracefs_syscall_rmdir,
+   .permission = tracefs_permission,
+   .getattr= tracefs_getattr,
+   .setattr= tracefs_setattr,
+};
+
+static const struct inode_operations tracefs_dir_inode_operations = {
+   .lookup = simple_lookup,
+   .permission = tracefs_permission,
+   .getattr= tracefs_getattr,
+   .setattr= tracefs_setattr,
+};
+
+static const struct inode_operations tracefs_file_inode_operations = {
+   .permission = tracefs_permission,
+   .getattr= tracefs_getattr,
+   .setattr= tracefs_setattr,
 };
 
 struct inode *tracefs_get_inode(struct super_block *sb)
@@ -183,77 +249,6 @@ struct tracefs_fs_info {
struct tracefs_mount_opts mount_opts;
 };
 
-static void change_gid(struct dentry *dentry, kgid_t gid)
-{
-   if 

Re: [syzbot] [fs?] [trace?] BUG: unable to handle kernel paging request in tracefs_apply_options

2024-01-03 Thread Steven Rostedt
On Wed, 03 Jan 2024 13:41:31 -0800
syzbot  wrote:

> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:453f5db0619e Merge tag 'trace-v6.7-rc7' of git://git.kerne..
> git tree:   upstream
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=10ec3829e8
> kernel config:  https://syzkaller.appspot.com/x/.config?x=f8e72bae38c079e4
> dashboard link: https://syzkaller.appspot.com/bug?extid=f8a023e0c6beabe2371a
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1414af31e8
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15e52409e8
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/38b92a7149e8/disk-453f5db0.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/4f872267133f/vmlinux-453f5db0.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/587572061791/bzImage-453f5db0.xz
> 
> The issue was bisected to:
> 
> commit 7e8358edf503e87236c8d07f69ef0ed846dd5112
> Author: Steven Rostedt (Google) 
> Date:   Fri Dec 22 00:07:57 2023 +
> 
> eventfs: Fix file and directory uid and gid ownership
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=108cd519e8
> final oops: https://syzkaller.appspot.com/x/report.txt?x=128cd519e8
> console output: https://syzkaller.appspot.com/x/log.txt?x=148cd519e8
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+f8a023e0c6beabe23...@syzkaller.appspotmail.com
> Fixes: 7e8358edf503 ("eventfs: Fix file and directory uid and gid ownership")
> 
> BUG: unable to handle page fault for address: fff0
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> PGD d734067 P4D d734067 PUD d736067 PMD 0 
> Oops:  [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 5056 Comm: syz-executor170 Not tainted 
> 6.7.0-rc7-syzkaller-00049-g453f5db0619e #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 11/17/2023
> RIP: 0010:set_gid fs/tracefs/inode.c:224 [inline]
> RIP: 0010:tracefs_apply_options+0x4d0/0xa40 fs/tracefs/inode.c:337
> Code: 24 10 49 8b 1e 48 83 c3 f0 74 3d 48 89 d8 48 c1 e8 03 48 bd 00 00 00 00 
> 00 fc ff df 80 3c 28 00 74 08 48 89 df e8 70 ff 88 fe <48> 8b 1b 48 89 de 48 
> 83 e6 02 31 ff e8 bf fe 2c fe 48 83 e3 02 75
> RSP: 0018:c900040ffca8 EFLAGS: 00010246
> RAX: 1ffe RBX: fff0 RCX: 888014bf5940
> RDX:  RSI: 0004 RDI: c900040ffc20
> RBP: dc00 R08: 0003 R09: f5200081ff84
> R10: dc00 R11: f5200081ff84 R12: 88801d743888
> R13: 88801b0c3710 R14: 88801d7437e8 R15: 88801d743810
> FS:  557dd480() GS:8880b980() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: fff0 CR3: 1ec48000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  tracefs_remount+0x78/0x80 fs/tracefs/inode.c:353
>  reconfigure_super+0x440/0x870 fs/super.c:1143
>  do_remount fs/namespace.c:2884 [inline]

This is the same bug that was fixed by:

   
https://lore.kernel.org/linux-trace-kernel/20240102151249.05da2...@gandalf.local.home/

And just waiting to be applied:

   https://lore.kernel.org/all/20240102210731.1f1c5...@gandalf.local.home/

Thanks,

-- Steve

>  path_mount+0xc24/0xfa0 fs/namespace.c:3656
>  do_mount fs/namespace.c:3677 [inline]
>  __do_sys_mount fs/namespace.c:3886 [inline]
>  __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3863
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0x45/0x110 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x63/0x6b
> RIP: 0033:0x7fec326e8d99
> Code: 48 83 c4 28 c3 e8 67 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7ffc8103ddf8 EFLAGS: 0246 ORIG_RAX: 00a5
> RAX: ffda RBX: 7ffc8103de00 RCX: 7fec326e8d99
> RDX:  RSI: 20c0 RDI: 
> RBP: 7ffc8103de08 R08: 2140 R09: 7fec326b5b80
> R10: 02200022 R11: 0246 R12: 
> R13: 7ffc8103e068 R14: 0001 R15: 0001
>  
> Modules linked in:
> CR2: fff0
> ---[ end trace  ]---
> RIP: 0010:set_gid fs/tracefs/inode.c:224

Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Steven Rostedt
On Wed, 3 Jan 2024 10:38:09 -0800
Linus Torvalds  wrote:

> @@ -332,10 +255,8 @@ static int tracefs_apply_options(struct super_block *sb, 
> bool remount)
>   if (!remount || opts->opts & BIT(Opt_uid))
>   inode->i_uid = opts->uid;
>  
> - if (!remount || opts->opts & BIT(Opt_gid)) {
> - /* Set all the group ids to the mount option */
> - set_gid(sb->s_root, opts->gid);
> - }
> + if (!remount || opts->opts & BIT(Opt_gid))
> + inode->i_gid = opts->gid;
>  
>   return 0;
>  }

This doesn't work because for tracefs (not eventfs) the dentries are
created at boot up and before the file system is mounted. This means you
can't even set a gid in /etc/fstab. This will cause a regression.

tracefs was designed after debugfs, which also ignores gid. But because
there's users out there that want non-root accounts to have access to
tracing, it is documented to set the gid to a group that you can then add
users to. And that's the reason behind the set_gid() walk.

Reverting that one commit won't fix things either, because it only blocked
OTH to be read, but the creation of the files changed their mode's passed
to block OTH as well, so all those would need to be changed too. And I
don't think making the trace files open to OTH is a good solution, even if
the tracefs top level directory itself blocks other. The issue was that the
user use to just mount the top level to allow the group access to the files
below, which allowed all users access. But this is weak control of the file
system.

Even my non-test machines have me in the tracing group so my user account
has access to tracefs.

On boot up, all the tracefs files are created via tracefs_create_file() and
directories by tracefs_create_dir() which was copied from
debugfs_create_file/dir(). At this moment, the dentry is created with the
permissions set. There's no looking at the super block.

So we need a way to change the permissions at mount time.

The only solution I can think of that doesn't include walking the current
dentries, is to convert all of tracefs to be more like eventfs, and have
the dentries created on demand. But perhaps, different than eventfs, they
do not need to be freed when they are no longer referenced, which should
make it easier to implement. And there's not nearly as many files and
directories, so keeping meta data around isn't as much of an issue.

Instead of creating the inode and dentry in the tracefs_create_file/dir(),
it could just create a descriptor that holds the fops, data and mode. Then
on lookup, it would create the inodes and dentries similar to eventfs.

It would need its own iterate_shared as well.

-- Steve



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Steven Rostedt
On Wed, 3 Jan 2024 10:38:09 -0800
Linus Torvalds  wrote:

> On Wed, 3 Jan 2024 at 10:12, Linus Torvalds
>  wrote:
> >
> > Much better. Now eventfs looks more like a real filesystem, and less
> > like an eldritch horror monster that is parts of dcache tackled onto a
> > pseudo-filesystem.  
> 
> Oh, except I think you still need to just remove the 'set_gid()' mess.
> 
> It's disgusting and it's wrong, and it's not even what the 'uid'
> option does (it only sets the root inode uid).
> 
> If you remount the filesystem with different gid values, you get to
> keep both broken pieces. And if it isn't a remount, then setting the
> root uid is sufficient.
> 
> I think the whole thing was triggered by commit 49d67e445742, and
> maybe the fix is to just revert that commit.
> 
> That commit makes no sense in general, since the default mounting
> position for tracefs that the kernel sets up is only accessible to
> root anyway.
> 
> Alternatively, just do the ->permissions() thing, and allow access to
> the group in the mount options.
> 
> Getting rid of set_gid() would be this attached lovely patch:
> 
>  fs/tracefs/inode.c | 83 
> ++
>  1 file changed, 2 insertions(+), 81 deletions(-)
> 
> and would get rid of the final (?) piece of disgusting dcache hackery
> that tracefs most definitely should not have.
>

I'll look at that and play with it. I understand VFS much better now that I
spent so much time with eventfs. That commit had to do with allowing OTH
read access, which is a security issue as the trace files expose a lot of
the kernel internals.

I think these changes are a bit much for -rc8, don't you? Or do you want
all this in before v6.7 is released. I'd be more comfortable with adding
these changes in the upcoming merge window, where I can have more time
playing with them.

-- Steve



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Steven Rostedt
On Wed, 3 Jan 2024 10:12:08 -0800
Linus Torvalds  wrote:

> On Wed, 3 Jan 2024 at 07:24, Steven Rostedt  wrote:
> >
> > Instead, just have eventfs have its own iterate_shared callback function
> > that will fill in the dent entries. This simplifies the code quite a bit.  
> 
> Much better. Now eventfs looks more like a real filesystem, and less
> like an eldritch horror monster that is parts of dcache tackled onto a
> pseudo-filesystem.

Thanks.

> 
> However, one request, and one nit:
> 
> > Also, remove the "lookup" parameter to the create_file/dir_dentry() and
> > always have it return a dentry that has its ref count incremented, and
> > have the caller call the dput. This simplifies that code as well.  
> 
> Can you please do that as a separate patch, where the first patch just
> cleans up the directory iteration, and the second patch then goes "now
> there are no more callers that have the 'lookup' argument set to
> false".

Yeah, I was thinking of doing it as two patches and figured I'd merge them
into one because I deleted one of the users of it. As I was on the fence
with doing two patches, I'm happy to change that.

> 
> Because as-is, the patch is kind of two things mixed up.
> 
> The small nit is this:
> 
> > +static int eventfs_iterate(struct file *file, struct dir_context *ctx)
> >  {
> > +   /*
> > +* Need to create the dentries and inodes to have a consistent
> > +* inode number.
> > +*/
> > list_for_each_entry_srcu(ei_child, >children, list,
> >  srcu_read_lock_held(_srcu)) {
> > -   d = create_dir_dentry(ei, ei_child, parent, false);
> > -   if (d) {
> > -   ret = add_dentries(, d, cnt);
> > -   if (ret < 0)
> > -   break;
> > -   cnt++;
> > +
> > +   if (ei_child->is_freed)
> > +   continue;
> > +
> > +   name = ei_child->name;
> > +
> > +   dentry = create_dir_dentry(ei, ei_child, ei_dentry);
> > +   if (!dentry)
> > +   goto out;
> > +   ino = dentry->d_inode->i_ino;
> > +   dput(dentry);
> > +
> > +   if (c > 0) {
> > +   c--;
> > +   continue;
> > }  
> 
> Just move this "is the position before this name" up to the top of the
> loop. Even above the "is_freed" part.
> 
> Let's just always count all the entries in the child list.
> 
> And same for the ei->nr_entries loop:
> 
> > for (i = 0; i < ei->nr_entries; i++) {  
> 
> where there's no point in creating that dentry just to look up the
> inode number, only to then decide "Oh, we already iterated past this
> part, so let's not do anything with it".
> 
> This wouldn't seem to matter much with a big enough getdents buffer
> (which is the normal user level behavior), but it actually does,
> because we don't keep track of "we have read to the end of the
> directory".
> 
> So every readdir ends up effectively doing getdents at least twice:
> once to read the directory entries, and then once to just be told
> "that was all".
> 
> End result: you should strive very hard to *not* waste time on the
> directory entries that have already been read, and are less than
> 'ctx->pos'.

My patch originally did that, but then I was worried about counting something
that doesn't exist.

If it is done twice, there's a good chance the dentry will still be around
anyway, so it doesn't slow it down that much. The dput() only decrements
the entry and doesn't free it. I added back my "show_events_dentries" file
to test this. They sit with refcount equal to zero waiting to be reclaimed.
But if they get referenced again, the refcount goes up again.

That is, the first time it is called, where ctx->pos is likely zero, it
creates the dentry, but that is also added to the list. The next time, with
ctx->pos greater than zero, the create_dir_dentry() starts with:

static struct dentry *
create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
  struct dentry *parent)
{
struct dentry *dentry = NULL;

WARN_ON_ONCE(!inode_is_locked(parent->d_inode));

mutex_lock(_mutex);
if (pei->is_freed || ei->is_freed) {
mutex_unlock(_mutex);
return NULL;
}
if (ei->dentry) {
/* If the dentry already has a dentr

[PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The eventfs creates dynamically allocated dentries and inodes. Using the
dcache_readdir() logic for its own directory lookups requires hiding the
cursor of the dcache logic and playing games to allow the dcache_readdir()
to still have access to the cursor while the eventfs saved what it created
and what it needs to release.

Instead, just have eventfs have its own iterate_shared callback function
that will fill in the dent entries. This simplifies the code quite a bit.

Also, remove the "lookup" parameter to the create_file/dir_dentry() and
always have it return a dentry that has its ref count incremented, and
have the caller call the dput. This simplifies that code as well.

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 237 +--
 1 file changed, 78 insertions(+), 159 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f0677ea0ec24..53d34a4b5a2b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -52,9 +52,7 @@ enum {
 static struct dentry *eventfs_root_lookup(struct inode *dir,
  struct dentry *dentry,
  unsigned int flags);
-static int dcache_dir_open_wrapper(struct inode *inode, struct file *file);
-static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx);
-static int eventfs_release(struct inode *inode, struct file *file);
+static int eventfs_iterate(struct file *file, struct dir_context *ctx);
 
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
@@ -148,11 +146,9 @@ static const struct inode_operations 
eventfs_file_inode_operations = {
 };
 
 static const struct file_operations eventfs_file_operations = {
-   .open   = dcache_dir_open_wrapper,
.read   = generic_read_dir,
-   .iterate_shared = dcache_readdir_wrapper,
+   .iterate_shared = eventfs_iterate,
.llseek = generic_file_llseek,
-   .release= eventfs_release,
 };
 
 /* Return the evenfs_inode of the "events" directory */
@@ -390,16 +386,14 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, 
struct dentry *dentry)
  * @mode: The mode of the file.
  * @data: The data to use to set the inode of the file with on open()
  * @fops: The fops of the file to be created.
- * @lookup: If called by the lookup routine, in which case, dput() the created 
dentry.
  *
  * Create a dentry for a file of an eventfs_inode @ei and place it into the
- * address located at @e_dentry. If the @e_dentry already has a dentry, then
- * just do a dget() on it and return. Otherwise create the dentry and attach 
it.
+ * address located at @e_dentry.
  */
 static struct dentry *
 create_file_dentry(struct eventfs_inode *ei, int idx,
   struct dentry *parent, const char *name, umode_t mode, void 
*data,
-  const struct file_operations *fops, bool lookup)
+  const struct file_operations *fops)
 {
struct eventfs_attr *attr = NULL;
struct dentry **e_dentry = >d_children[idx];
@@ -414,9 +408,7 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
}
/* If the e_dentry already has a dentry, use it */
if (*e_dentry) {
-   /* lookup does not need to up the ref count */
-   if (!lookup)
-   dget(*e_dentry);
+   dget(*e_dentry);
mutex_unlock(_mutex);
return *e_dentry;
}
@@ -441,13 +433,12 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 * way to being freed, don't return it. If e_dentry is NULL
 * it means it was already freed.
 */
-   if (ei->is_freed)
+   if (ei->is_freed) {
dentry = NULL;
-   else
+   } else {
dentry = *e_dentry;
-   /* The lookup does not need to up the dentry refcount */
-   if (dentry && !lookup)
dget(dentry);
+   }
mutex_unlock(_mutex);
return dentry;
}
@@ -465,9 +456,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
}
mutex_unlock(_mutex);
 
-   if (lookup)
-   dput(dentry);
-
return dentry;
 }
 
@@ -500,13 +488,12 @@ static void eventfs_post_create_dir(struct eventfs_inode 
*ei)
  * @pei: The eventfs_inode parent of ei.
  * @ei: The eventfs_inode to create the directory for
  * @parent: The dentry of the parent of this directory
- * @lookup: True if this is called by the lookup code
  *
  * This creates and attaches a directory dentry to the eventfs_inode @ei.
  */
 static struct dentry *
 create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei,
- struct dentry

Re: [PATCH 1/1] fs/proc: remove redudant comments from /proc/bootconfig

2024-01-02 Thread Steven Rostedt
On Tue, 2 Jan 2024 18:19:37 +0800
Zhenhua Huang  wrote:

> commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to
> /proc/bootconfig") adds bootloader argument comments into /proc/bootconfig.
> 
> /proc/bootconfig shows boot_command_line[] multiple times following
> every xbc key value pair, that's duplicated and not necessary.
> Remove redundant ones.
> 
> Output before and after the fix is like:
> key1 = value1
> *bootloader argument comments*
> key2 = value2
> *bootloader argument comments*
> key3 = value3
> *bootloader argument comments*
> ...
> 
> key1 = value1
> key2 = value2
> key3 = value3
> *bootloader argument comments*
> ...
> 
> Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to 
> /proc/bootconfig")
> Signed-off-by: Zhenhua Huang 

Nice catch.

Reviewed-by: Steven Rostedt (Google) 

-- Steve

> ---
>  fs/proc/bootconfig.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
> index 902b326..e5635a6 100644
> --- a/fs/proc/bootconfig.c
> +++ b/fs/proc/bootconfig.c
> @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, 
> size_t size)
>   break;
>   dst += ret;
>   }
> - if (ret >= 0 && boot_command_line[0]) {
> - ret = snprintf(dst, rest(dst, end), "# Parameters from 
> bootloader:\n# %s\n",
> -boot_command_line);
> - if (ret > 0)
> - dst += ret;
> - }
> + }
> + if (ret >= 0 && boot_command_line[0]) {
> + ret = snprintf(dst, rest(dst, end), "# Parameters from 
> bootloader:\n# %s\n",
> +boot_command_line);
> + if (ret > 0)
> + dst += ret;
>   }
>  out:
>   kfree(key);




[PATCH] eventfs: Fix bitwise fields for "is_events"

2024-01-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

A flag was needed to denote which eventfs_inode was the "events"
directory, so a bit was taken from the "nr_entries" field, as there's not
that many entries, and 2^30 is plenty. But the bit number for nr_entries
was not updated to reflect the bit taken from it, which would add an
unnecessary integer to the structure.

Cc: sta...@vger.kernel.org
Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 899e447778ac..42bdeb471a07 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -63,7 +63,7 @@ struct eventfs_inode {
};
unsigned intis_freed:1;
unsigned intis_events:1;
-   unsigned intnr_entries:31;
+   unsigned intnr_entries:30;
 };
 
 static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
-- 
2.42.0




[PATCH v2] tracefs: Check for dentry->d_inode exists in set_gid()

2024-01-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If a getdents() is called on the tracefs directory but does not get all
the files, it can leave a "cursor" dentry in the d_subdirs list of tracefs
dentry. This cursor dentry does not have a d_inode for it. Before
referencing tracefs_inode from the dentry, the d_inode must first be
checked if it has content. If not, then it's not a tracefs_inode and can
be ignored.

The following caused a crash:

 #define getdents64(fd, dirp, count) syscall(SYS_getdents64, fd, dirp, count)
 #define BUF_SIZE 256
 #define TDIR "/tmp/file0"

 int main(void)
 {
char buf[BUF_SIZE];
int fd;
int n;

mkdir(TDIR, 0777);
mount(NULL, TDIR, "tracefs", 0, NULL);
fd = openat(AT_FDCWD, TDIR, O_RDONLY);
n = getdents64(fd, buf, BUF_SIZE);
ret = mount(NULL, TDIR, NULL, 
MS_NOSUID|MS_REMOUNT|MS_RELATIME|MS_LAZYTIME,
"gid=1000");
return 0;
 }

That's because the 256 BUF_SIZE was not big enough to read all the
dentries of the tracefs file system and it left a "cursor" dentry in the
subdirs of the tracefs root inode. Then on remounting with "gid=1000",
it would cause an iteration of all dentries which hit:

ti = get_tracefs(dentry->d_inode);
if (ti && (ti->flags & TRACEFS_EVENT_INODE))
eventfs_update_gid(dentry, gid);

Which crashed because of the dereference of the cursor dentry which had a NULL
d_inode.

In the subdir loop of the dentry lookup of set_gid(), if a child has a
NULL d_inode, simply skip it.

Link: https://lore.kernel.org/all/20240102135637.3a21f...@gandalf.local.home/

Cc: sta...@vger.kernel.org
Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership")
Reported-by: "Ubisectech Sirius" 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240102142311.56708...@gandalf.local.home

- Simplify the logic to just continue the loop if the cursor dentry is hit.

 fs/tracefs/inode.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 62524b20964e..bc86ffdb103b 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -215,6 +215,10 @@ static void set_gid(struct dentry *parent, kgid_t gid)
struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
next = tmp->next;
 
+   /* Note, getdents() can add a cursor dentry with no inode */
+   if (!dentry->d_inode)
+   continue;
+
spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
 
change_gid(dentry, gid);
-- 
2.42.0




[PATCH] tracefs: Check for dentry->d_inode exists in set_gid()

2024-01-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If a getdents() is called on the tracefs directory but does not get all
the files, it can leave a "cursor" dentry in the d_subdirs list of tracefs
dentry. This cursor dentry does not have a d_inode for it. Before
referencing tracefs_inode from the dentry, the d_inode must first be
checked if it has content. If not, then it's not a tracefs_inode and can
be ignored.

The following caused a crash:

 #define getdents64(fd, dirp, count) syscall(SYS_getdents64, fd, dirp, count)
 #define BUF_SIZE 256
 #define TDIR "/tmp/file0"

 int main(void)
 {
char buf[BUF_SIZE];
int fd;
int n;

mkdir(TDIR, 0777);
mount(NULL, TDIR, "tracefs", 0, NULL);
fd = openat(AT_FDCWD, TDIR, O_RDONLY);
n = getdents64(fd, buf, BUF_SIZE);
ret = mount(NULL, TDIR, NULL, 
MS_NOSUID|MS_REMOUNT|MS_RELATIME|MS_LAZYTIME,
"gid=1000");
return 0;
 }

That's because the 256 BUF_SIZE was not big enough to read all the
dentries of the tracefs file system and it left a "cursor" dentry in the
subdirs of the tracefs root inode. Then on remounting with "gid=1000",
it would cause an iteration of all dentries which hit:

ti = get_tracefs(dentry->d_inode);
if (ti && (ti->flags & TRACEFS_EVENT_INODE))
eventfs_update_gid(dentry, gid);

Which crashed because of the dereference of the cursor dentry which had a NULL
d_inode.

Link: https://lore.kernel.org/all/20240102135637.3a21f...@gandalf.local.home/

Cc: sta...@vger.kernel.org
Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership")
Reported-by: "Ubisectech Sirius" 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 62524b20964e..c29387a36bc8 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -219,10 +219,13 @@ static void set_gid(struct dentry *parent, kgid_t gid)
 
change_gid(dentry, gid);
 
-   /* If this is the events directory, update that too */
-   ti = get_tracefs(dentry->d_inode);
-   if (ti && (ti->flags & TRACEFS_EVENT_INODE))
-   eventfs_update_gid(dentry, gid);
+   /* Note, getdents() can add a cursor dentry with no inode */
+   if (dentry->d_inode) {
+   /* If this is the events directory, update that too */
+   ti = get_tracefs(dentry->d_inode);
+   if (ti && (ti->flags & TRACEFS_EVENT_INODE))
+   eventfs_update_gid(dentry, gid);
+   }
 
if (!list_empty(>d_subdirs)) {
spin_unlock(_parent->d_lock);
-- 
2.42.0




Re: BUG: unable to handle page fault for address: fffffffffffffff0 in tracefs_apply_options

2024-01-02 Thread Steven Rostedt
On Tue, 02 Jan 2024 18:54:26 +0800
"Ubisectech Sirius"  wrote:

> Dear concerned.
> Greetings!
> We are Ubisectech Sirius Team, the vulnerability lab of China 
> ValiantSec.Recently, our team has discovered a issue in Linux kernel 6.7.
> technical details:
> 1. Vulnerability Description: BUG: unable to handle page fault for address: 
> fff0 in tracefs_apply_options

This is not a vulnerability, as it requires root privilege to trigger.

Taking off security list and adding LKML and linux-trace-kernel.


> 2. stack dump: 
> [ 1938.699609][ T8046] BUG: unable to handle page fault for address: 
> fff0
> [ 1938.702696][ T8046] #PF: supervisor read access in kernel mode
> [ 1938.704647][ T8046] #PF: error_code(0x) - not-present page
> [ 1938.706699][ T8046] PGD 8f7a067 P4D 8f7a067 PUD 8f7c067 PMD 0
> [ 1938.708939][ T8046] Oops:  [#1] PREEMPT SMP KASAN
> [ 1938.710819][ T8046] CPU: 0 PID: 8046 Comm: poc Not tainted 
> 6.7.0-rc8-g610a9b8f49fb #4
> [ 1938.713729][ T8046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [ 1938.717282][ T8046] RIP: 0010:tracefs_apply_options.isra.0 
> (fs/tracefs/inode.c:224 fs/tracefs/inode.c:337)
> [ 1938.720778][ T8046] Code: 24 1c 41 89 47 08 4c 8b bb 60 ff ff ff 4c 89 ff 
> 48 83 ef 10 74 31 48 89 7c 24 08 e8 c7 61 f8 fe 48 8b 7c 24 08 e8 2d b5 30 ff 
> <49> 8b 47 f0 31 ff 83 e0 02 49 89 c7 48 89 c6 e8 e9 5d f8 fe 4d 85
> All code
> 
>  0: 24 1c and $0x1c,%al
>  2: 41 89 47 08 mov %eax,0x8(%r15)
>  6: 4c 8b bb 60 ff ff ff mov -0xa0(%rbx),%r15
>  d: 4c 89 ff mov %r15,%rdi
>  10: 48 83 ef 10 sub $0x10,%rdi
>  14: 74 31 je 0x47
>  16: 48 89 7c 24 08 mov %rdi,0x8(%rsp)
>  1b: e8 c7 61 f8 fe call 0xfef861e7
>  20: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi
>  25: e8 2d b5 30 ff call 0xff30b557
>  2a:* 49 8b 47 f0 mov -0x10(%r15),%rax <-- trapping instruction
>  2e: 31 ff xor %edi,%edi
>  30: 83 e0 02 and $0x2,%eax
>  33: 49 89 c7 mov %rax,%r15
>  36: 48 89 c6 mov %rax,%rsi
>  39: e8 e9 5d f8 fe call 0xfef85e27
>  3e: 4d rex.WRB
>  3f: 85 .byte 0x85
> Code starting with the faulting instruction
> ===
>  0: 49 8b 47 f0 mov -0x10(%r15),%rax
>  4: 31 ff xor %edi,%edi
>  6: 83 e0 02 and $0x2,%eax
>  9: 49 89 c7 mov %rax,%r15
>  c: 48 89 c6 mov %rax,%rsi
>  f: e8 e9 5d f8 fe call 0xfef85dfd
>  14: 4d rex.WRB
>  15: 85 .byte 0x85
> [ 1938.728026][ T8046] RSP: 0018:8880228dfca0 EFLAGS: 00010246
> [ 1938.730241][ T8046] RAX:  RBX: 888020c4cf00 RCX: 
> 8261a573
> [ 1938.733108][ T8046] RDX: dc00 RSI: 8261a569 RDI: 
> fff0
> [ 1938.736001][ T8046] RBP: 88801140d6a0 R08: 81457a74 R09: 
> ed100451bf86
> [ 1938.738918][ T8046] R10: 0003 R11: 0002 R12: 
> 88801140d7b8
> [ 1938.741677][ T8046] R13: 888020c4ce88 R14: 88801140d730 R15: 
> 
> [ 1938.75][ T8046] FS: 7f35440c2640() GS:88802d20() 
> knlGS:
> [ 1938.747704][ T8046] CS: 0010 DS:  ES:  CR0: 80050033
> [ 1938.750058][ T8046] CR2: fff0 CR3: 1885e000 CR4: 
> 00750ef0
> [ 1938.753025][ T8046] DR0:  DR1:  DR2: 
> 
> [ 1938.755888][ T8046] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 1938.758848][ T8046] PKRU: 5554
> [ 1938.760125][ T8046] Call Trace:
> [ 1938.761327][ T8046] 
> [ 1938.762430][ T8046] ? show_regs (arch/x86/kernel/dumpstack.c:479)
> [ 1938.764058][ T8046] ? __die (arch/x86/kernel/dumpstack.c:421 
> arch/x86/kernel/dumpstack.c:434)
> [ 1938.765504][ T8046] ? page_fault_oops (arch/x86/mm/fault.c:703)
> [ 1938.767316][ T8046] ? search_extable (lib/extable.c:115)
> [ 1938.770094][ T8046] ? dump_pagetable (arch/x86/mm/fault.c:635)
> [ 1938.771836][ T8046] ? pgtable_bad (arch/x86/mm/fault.c:122)
> [ 1938.773468][ T8046] ? preempt_count_add (kernel/sched/core.c:5841)
> [ 1938.775301][ T8046] ? preempt_count_sub (kernel/sched/core.c:5865)
> [ 1938.777117][ T8046] ? search_module_extables (kernel/module/main.c:3243 
> (discriminator 3))
> [ 1938.779095][ T8046] ? fixup_exception (arch/x86/mm/extable.c:305)
> [ 1938.780930][ T8046] ? kernelmode_fixup_or_oops (arch/x86/mm/fault.c:761)
> [ 1938.783041][ T8046] ? __bad_area_nosemaphore (arch/x86/mm/fault.c:819)
> [ 1938.784770][ T8046] ? do_kern_addr_fault (arch/x86/mm/fault.c:1227)
> [ 1938.785242][ T8046] ? exc_page_fault (arch/x86/mm/fault.c:1503 
> arch/x86/mm/fault.c:1561)
> [ 1938.786979][ T8046] ? asm_exc_page_fault 
> (./arch/x86/include/asm/idtentry.h:570)
> [ 1938.788858][ T8046] ? do_raw_spin_lock (kernel/locking/spinlock_debug.c:93 
> kernel/locking/spinlock_debug.c:117)
> [ 1938.789530][ T8046] ? tracefs_apply_options.isra.0 (fs/tracefs/inode.c:224 
> fs/tracefs/inode.c:337)
> [ 1938.790145][ T8046] ? tracefs_apply_options.isra.0 (fs/tracefs/inode.c:224 

Re: [PATCH v3 01/34] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2024-01-02 Thread Steven Rostedt
On Thu, 14 Dec 2023 00:24:21 +0100
Ilya Leoshkevich  wrote:

> Architectures use assembly code to initialize ftrace_regs and call
> ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
> ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
> KMSAN warnings when running the ftrace testsuite.

BTW, why is this only a problem for s390 and no other architectures?

If it is only a s390 thing, then we should do this instead:

in include/linux/ftrace.h:

/* Add a comment here to why this is needed */
#ifndef ftrace_list_func_unpoison
# define ftrace_list_func_unpoison(fregs) do { } while(0)
#endif

In arch/s390/include/asm/ftrace.h:

/* Add a comment to why s390 is special */
# define ftrace_list_func_unpoison(fregs) kmsan_unpoison_memory(fregs, 
sizeof(*fregs))

> 
> Fix by trusting the architecture-specific assembly code and always
> unpoisoning ftrace_regs in ftrace_ops_list_func.
> 
> Acked-by: Steven Rostedt (Google) 

I'm taking my ack away for this change in favor of what I'm suggesting now.

> Reviewed-by: Alexander Potapenko 
> Signed-off-by: Ilya Leoshkevich 
> ---
>  kernel/trace/ftrace.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8de8bec5f366..dfb8b26966aa 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -7399,6 +7399,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
> parent_ip,
>  void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>  struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {
> + kmsan_unpoison_memory(fregs, sizeof(*fregs));

And here have:

ftrace_list_func_unpoison(fregs);

That way we only do it for archs that really need it, and do not affect
archs that do not.


I want to know why this only affects s390, because if we are just doing
this because "it works", it could be just covering up a symptom of
something else and not actually doing the "right thing".


-- Steve


>   __ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
>  }
>  #else




Re: Unable to trace nf_nat function using kprobe with latest kernel 6.1.66-1

2024-01-02 Thread Steven Rostedt
On Tue, 2 Jan 2024 11:23:54 +0530
P K  wrote:

> Hi,
> 
> I am unable to trace nf_nat functions using kprobe with the latest kernel.
> Previously in kernel 6.1.55-1 it was working fine.
> Can someone please check if it's broken with the latest commit or  i
> have to use it differently?
> 

Note, attaching to kernel functions is never considered stable API and may
break at any kernel release.

Also, if the compiler decides to inline the function, and makes it no
longer visible in /proc/kallsyms then that too will cause this to break.

-- Steve


> Mentioned below are output:
> Kernel - 6.1.55-1
> / # bpftrace -e 'kprobe:nf_nat_ipv4_manip_pkt { printf("func called\n"); }'
> Attaching 1 probe...
> cannot attach kprobe, probe entry may not exist
> ERROR: Error attaching probe: 'kprobe:nf_nat_ipv4_manip_pkt'
> 
> 
> 
> Kernel 6.1.55-1
> / # bpftrace -e 'kprobe:nf_nat_ipv4_manip_pkt { printf("func called\n"); }'
> Attaching 1 probe...
> func called
> func called




Re: [PATCH] ftrace: Fix modification of direct_function hash while in use

2023-12-29 Thread Steven Rostedt


Masami and Jiri,

This patch made it through all my tests. If I can get an Acked-by by
Sunday, I'll include it in my push to Linus (I have a couple of other fixes
to send him).

-- Steve


On Fri, 29 Dec 2023 11:51:34 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
> if the number of new entries are added is large enough to cause two
> allocations in the loop:
> 
> for (i = 0; i < size; i++) {
> hlist_for_each_entry(entry, >buckets[i], hlist) {
> new = ftrace_add_rec_direct(entry->ip, addr, 
> _hash);
> if (!new)
> goto out_remove;
> entry->direct = addr;
> }
> }
> 
> Where ftrace_add_rec_direct() has:
> 
> if (ftrace_hash_empty(direct_functions) ||
> direct_functions->count > 2 * (1 << direct_functions->size_bits)) 
> {
> struct ftrace_hash *new_hash;
> int size = ftrace_hash_empty(direct_functions) ? 0 :
> direct_functions->count + 1;
> 
> if (size < 32)
> size = 32;
> 
> new_hash = dup_hash(direct_functions, size);
> if (!new_hash)
> return NULL;
> 
> *free_hash = direct_functions;
> direct_functions = new_hash;
> }
> 
> The "*free_hash = direct_functions;" can happen twice, losing the previous
> allocation of direct_functions.
> 
> But this also exposed a more serious bug.
> 
> The modification of direct_functions above is not safe. As
> direct_functions can be referenced at any time to find what direct caller
> it should call, the time between:
> 
> new_hash = dup_hash(direct_functions, size);
>  and
> direct_functions = new_hash;
> 
> can have a race with another CPU (or even this one if it gets interrupted),
> and the entries being moved to the new hash are not referenced.
> 
> That's because the "dup_hash()" is really misnamed and is really a
> "move_hash()". It moves the entries from the old hash to the new one.
> 
> Now even if that was changed, this code is not proper as direct_functions
> should not be updated until the end. That is the best way to handle
> function reference changes, and is the way other parts of ftrace handles
> this.
> 
> The following is done:
> 
>  1. Change add_hash_entry() to return the entry it created and inserted
> into the hash, and not just return success or not.
> 
>  2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
> the former.
> 
>  3. Allocate a "new_hash" at the start that is made for holding both the
> new hash entries as well as the existing entries in direct_functions.
> 
>  4. Copy (not move) the direct_function entries over to the new_hash.
> 
>  5. Copy the entries of the added hash to the new_hash.
> 
>  6. If everything succeeds, then use rcu_pointer_assign() to update the
> direct_functions with the new_hash.
> 
> This simplifies the code and fixes both the memory leak as well as the
> race condition mentioned above.
> 
> Link: 
> https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/
> 
> Cc: sta...@vger.kernel.org
> Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ftrace.c | 100 --
>  1 file changed, 47 insertions(+), 53 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8de8bec5f366..b01ae7d36021 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash,
>   hash->count++;
>  }
>  
> -static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
> +static struct ftrace_func_entry *
> +add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
>  {
>   struct ftrace_func_entry *entry;
>  
>   entry = kmalloc(sizeof(*entry), GFP_KERNEL);
>   if (!entry)
> - return -ENOMEM;
> + return NULL;
>  
>   entry->ip = ip;
>   __add_hash_entry(hash, entry);
>  
> - return 0;
> + return entry;
>  }
>  
>  static void
> @@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct 
> ftrace_hash *hash)
>   struct ftra

Re: [RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping

2023-12-29 Thread Steven Rostedt
On Fri, 29 Dec 2023 13:40:50 -0500
Steven Rostedt  wrote:

> I'm sending this to a wider audience, as I want to hear more
> feedback on this before I accept it.
>

I forgot to mention that this can be applied on top of:


   git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git  for-next

-- Steve



[RFC][PATCH 2/2] tracing: Allow user-space mapping of the ring-buffer

2023-12-29 Thread Steven Rostedt
From: Vincent Donnefort 

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-3-vdonnef...@google.com

Signed-off-by: Vincent Donnefort 
Signed-off-by: Steven Rostedt (Google) 
---
 include/uapi/linux/trace_mmap.h |  2 +
 kernel/trace/trace.c| 79 +++--
 2 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index f950648b0ba9..8c49489c5867 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -26,4 +26,6 @@ struct trace_buffer_meta {
__u32   meta_struct_len;/* Len of this struct */
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 46dbe22121e9..cfeaf2cd204e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
+
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
iter->wait_index++;
@@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_da

[RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping

2023-12-29 Thread Steven Rostedt


I'm sending this to a wider audience, as I want to hear more
feedback on this before I accept it.

Vincent has been working on allowing the ftrace ring buffer to be
memory mapped into user space. This has been going on since
last year, where we talked at the 2022 Tracing Summit in London.

Vincent's last series can be found here:

   
https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-1-vdonnef...@google.com/

But I'm posting these as these are what I now have in my queue.

I've tested these patches pretty thoroughly and they look good.
I even have a libtracefs API patch ready to implement this.

For testing, you can install:

 git://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git (version 1.8.1)

 git://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git
  (latest branch of libtracefs)

Then apply:

  https://lore.kernel.org/all/20231228201100.78aae...@rorschach.local.home/

to libtracefs. And then build the samples:

  make samples

Which will create: bin/cpu-map

That you can use with:

  # trace-cmd start -e sched
  # bin/cpu-map 0

Which will output all the events in CPU 0 via the memory mapping if it is
supported by the kernel or else it will print at the start:

  "Was not able to map, falling back to buffered read"

If you want to see the source code for cpu-map, you have to look
at the man page ;-) The "make samples" will extract the example code
from the man pages and build them.

  Documentation/libtracefs-cpu-map.txt

The library API is rather simple, it just has:

 tracefs_cpu_open_mapped()
 tracefs_cpu_is_mapped()
 tracefs_cpu_map()
 tracefs_cpu_unmap()

Which will create a tracefs_cpu handle for reading. If it fails to
map, it just does a normal read.

Anyway, this email is not about the library interface, but more of
the kernel interface. And that is this:

First there's a "meta page" that can mapped via mmap() on the
trace_pipe_raw file (there's one trace_pipe_raw file per CPU):

page_size = getpagesize();
meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);

The meta will have a layout of:

struct trace_buffer_meta {
unsigned long   entries;
unsigned long   overrun;
unsigned long   read;

unsigned long   subbufs_touched;
unsigned long   subbufs_lost;
unsigned long   subbufs_read;

struct {
unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
__u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
__u32   read;   /* Number of bytes read on the 
reader subbuf */
} reader;

__u32   subbuf_size;/* Size of each subbuf 
including the header */
__u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */

__u32   meta_page_size; /* Size of the meta-page */
__u32   meta_struct_len;/* Len of this struct */
};

The meta_page_size can grow if we need to (and then we can extend this
API if we need to). If the meta_page_size is greater than a page, it
should remap it:

if (meta->meta_page_size > page_size) {
int new_size = meta->meta_page_size;
munmap(meta, page_size);
meta = mmap(NULL, new_size, PROT_READ, MAP_SHARED, fd, 0);  
}

Now the sub buffers of the ring buffer are mapped right after
the meta page. The can be added with:


data_len = meta->subbuf_size * meta->nr_subbufs;
data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
  fd, meta->meta_page_size);

This maps all the ring buffer sub-buffers as well as the reader page.
The way this works is that the reader page is free to read from
user space and writer will only write to the other pages.

To get the reader page:

subbuf = data + meta->subbuf_size * meta->reader.id;

Then you can load that into the libtraceevent kbuffer API:

struct tep_handle *tep = tracefs_local_events(NULL);
kbuf = tep_kbuffer(tep);
kbuffer_load_subbuffer(kbuf, subbuf);

And use the kbuf descriptor to iterate the events.

When done with the reader page, the application needs to make an
ioctl() call:

ioctl(fd, TRACE_MMAP_IOCTL_GET_READER);


This will swap the reader page with the head of the other pages,
and the old reader page is now going to be in the writable portion
of the ring buffer where the writer can write to it, but the
page that was swapped out becomes the new reader page. If there is
no data, then user space can check the meta->reader.id to see if
it changed or not. If it did not change, then there's no new data.

If the writer is still on that page, it acts the same as we do in
the kernel. It can still update that page and the begging of the
sub-buffer has the index of where the writer currently is on that
page.

All the mapped pages are read-only.

When the ring buffer is mapped, it 

[RFC][PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-29 Thread Steven Rostedt
From: Vincent Donnefort 

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-2-vdonnef...@google.com

Signed-off-by: Vincent Donnefort 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |   7 +
 include/uapi/linux/trace_mmap.h |  29 +++
 kernel/trace/ring_buffer.c  | 382 +++-
 3 files changed, 417 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/trace_mmap.h

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..f950648b0ba9
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_TRACE_MMAP_H_
+#define _UAPI_TRACE_MMAP_H_
+
+#include 
+
+struct trace_buffer_meta {
+   unsigned long   entries;
+   unsigned long   overrun;
+   unsigned long   read;
+
+   unsigned long   subbufs_touched;
+   unsigned long   subbufs_lost;
+   unsigned long   subbufs_read;
+
+   struct {
+   unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
+   __u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
+   __u32   read;   /* Number of bytes read on the 
reader subbuf */
+   } reader;
+
+   __u32   subbuf_size;/* Size of each subbuf 
including the header */
+   __u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */
+
+   __u32   meta_page_size; /* Size of the meta-page */
+   __u32   meta_struct_len;/* Len of this struct */
+};
+
+#endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 173d2595ce2d..2f3e0260db88 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -338,6 +338,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -388,6 +389,7 @@ struct rb_irq_work {
boolwaiters_pending;
boolfull_waiters_pending;
boolwakeup_full;
+   boolis_cpu_buffer;
 };
 
 /*
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   int mapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer 
*buffer, int cpu, int f
return (dirty * 100) &

[PATCH] ftrace: Fix modification of direct_function hash while in use

2023-12-29 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Masami Hiramatsu reported a memory leak in register_ftrace_direct() where
if the number of new entries are added is large enough to cause two
allocations in the loop:

for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, >buckets[i], hlist) {
new = ftrace_add_rec_direct(entry->ip, addr, 
_hash);
if (!new)
goto out_remove;
entry->direct = addr;
}
}

Where ftrace_add_rec_direct() has:

if (ftrace_hash_empty(direct_functions) ||
direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
struct ftrace_hash *new_hash;
int size = ftrace_hash_empty(direct_functions) ? 0 :
direct_functions->count + 1;

if (size < 32)
size = 32;

new_hash = dup_hash(direct_functions, size);
if (!new_hash)
return NULL;

*free_hash = direct_functions;
direct_functions = new_hash;
}

The "*free_hash = direct_functions;" can happen twice, losing the previous
allocation of direct_functions.

But this also exposed a more serious bug.

The modification of direct_functions above is not safe. As
direct_functions can be referenced at any time to find what direct caller
it should call, the time between:

new_hash = dup_hash(direct_functions, size);
 and
direct_functions = new_hash;

can have a race with another CPU (or even this one if it gets interrupted),
and the entries being moved to the new hash are not referenced.

That's because the "dup_hash()" is really misnamed and is really a
"move_hash()". It moves the entries from the old hash to the new one.

Now even if that was changed, this code is not proper as direct_functions
should not be updated until the end. That is the best way to handle
function reference changes, and is the way other parts of ftrace handles
this.

The following is done:

 1. Change add_hash_entry() to return the entry it created and inserted
into the hash, and not just return success or not.

 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove
the former.

 3. Allocate a "new_hash" at the start that is made for holding both the
new hash entries as well as the existing entries in direct_functions.

 4. Copy (not move) the direct_function entries over to the new_hash.

 5. Copy the entries of the added hash to the new_hash.

 6. If everything succeeds, then use rcu_pointer_assign() to update the
direct_functions with the new_hash.

This simplifies the code and fixes both the memory leak as well as the
race condition mentioned above.

Link: 
https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/

Cc: sta...@vger.kernel.org
Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ftrace.c | 100 --
 1 file changed, 47 insertions(+), 53 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8de8bec5f366..b01ae7d36021 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash,
hash->count++;
 }
 
-static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
+static struct ftrace_func_entry *
+add_hash_entry(struct ftrace_hash *hash, unsigned long ip)
 {
struct ftrace_func_entry *entry;
 
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
if (!entry)
-   return -ENOMEM;
+   return NULL;
 
entry->ip = ip;
__add_hash_entry(hash, entry);
 
-   return 0;
+   return entry;
 }
 
 static void
@@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct 
ftrace_hash *hash)
struct ftrace_func_entry *entry;
struct ftrace_hash *new_hash;
int size;
-   int ret;
int i;
 
new_hash = alloc_ftrace_hash(size_bits);
@@ -1366,8 +1366,7 @@ alloc_and_copy_ftrace_hash(int size_bits, struct 
ftrace_hash *hash)
size = 1 << hash->size_bits;
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, >buckets[i], hlist) {
-   ret = add_hash_entry(new_hash, entry->ip);
-   if (ret < 0)
+   if (add_hash_entry(new_hash, entry->ip) == NULL)
goto free_hash;
}
}
@@ -2536,7 +2535,7 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec)
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
 /* Protected by rcu_tasks for reading, and direct_mutex for writing */

Re: [PATCH] tracing: Fix possible memory leak in ftrace_regsiter_direct()

2023-12-29 Thread Steven Rostedt
On Wed, 27 Dec 2023 21:38:25 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> If ftrace_register_direct() called with a large number of target

There's no function called "ftrace_register_direct()", I guess you meant
register_ftrace_direct()?

> functions (e.g. 65), the free_hash can be updated twice or more
> in the ftrace_add_rec_direct() without freeing the previous hash
> memory. Thus this can cause a memory leak.
> 
> Fix this issue by expanding the direct_hash at once before
> adding the new entries.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> Fixes: f64dd4627ec6 ("ftrace: Add multi direct register/unregister interface")
> Cc: sta...@vger.kernel.org
> ---
>  kernel/trace/ftrace.c |   49 
> +++--
>  1 file changed, 31 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8de8bec5f366..9269c2c3e595 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2555,28 +2555,33 @@ unsigned long ftrace_find_rec_direct(unsigned long ip)
>   return entry->direct;
>  }
>  
> -static struct ftrace_func_entry*
> -ftrace_add_rec_direct(unsigned long ip, unsigned long addr,
> -   struct ftrace_hash **free_hash)
> +static struct ftrace_hash *ftrace_expand_direct(int inc_count)
>  {
> - struct ftrace_func_entry *entry;
> + struct ftrace_hash *new_hash, *free_hash;
> + int size = ftrace_hash_empty(direct_functions) ? 0 :
> + direct_functions->count + inc_count;
>  
> - if (ftrace_hash_empty(direct_functions) ||
> - direct_functions->count > 2 * (1 << direct_functions->size_bits)) {
> - struct ftrace_hash *new_hash;
> - int size = ftrace_hash_empty(direct_functions) ? 0 :
> - direct_functions->count + 1;
> + if (!ftrace_hash_empty(direct_functions) &&
> + size <= 2 * (1 << direct_functions->size_bits))
> + return NULL;
>  
> - if (size < 32)
> - size = 32;
> + if (size < 32)
> + size = 32;

Hmm, why the limit of 32? I know this was there before this patch, but this
patch made me notice it.

In dup_hash() we have:

bits = fls(size / 2);

/* Don't allocate too much */
if (bits > FTRACE_HASH_MAX_BITS)
bits = FTRACE_HASH_MAX_BITS;

Where bits will determine the number of buckets.

If size = 32, then bits = fls(32/2) = fls(16) = 5

So the buckets will be 2^5 = 32. Thus, you will get 32 buckets even with 64
entries, which will cause a minimum of two loops to find a bucket.

Is this a concern?


>  
> - new_hash = dup_hash(direct_functions, size);
> - if (!new_hash)
> - return NULL;
> + new_hash = dup_hash(direct_functions, size);
> + if (!new_hash)
> + return ERR_PTR(-ENOMEM);
>  
> - *free_hash = direct_functions;
> - direct_functions = new_hash;
> - }
> + free_hash = direct_functions;
> + direct_functions = new_hash;
> +
> + return free_hash;
> +}
> +
> +static struct ftrace_func_entry*
> +ftrace_add_rec_direct(unsigned long ip, unsigned long addr)
> +{
> + struct ftrace_func_entry *entry;
>  
>   entry = kmalloc(sizeof(*entry), GFP_KERNEL);
>   if (!entry)
> @@ -5436,11 +5441,19 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> unsigned long addr)
>   }
>   }
>  
> + /* ... and prepare the insertion */
> + free_hash = ftrace_expand_direct(hash->count);

Why does the direct_functions need to be expanded before new items are
entered? Can't we do it the way ftrace does it, and that is just to fill
the hash, and then expand if necessary.

Hmm, also, I think there's a bug here too. That is, hash_dup() does not do
what it says. It doesn't copy it actually moves. So when you use hash_dup()
with the source being direct_functions, it's removing the entries from the
direct functions, and not copying them :-/

That is, during the resize, the check against direct_functions will not
find them and things will be missed!

What I think we need to do, is what ftrace does, and that is to add
everything to a temp hash first and then do an RCU assign to the
direct_functions.

> + if (IS_ERR(free_hash)) {
> + err = PTR_ERR(free_hash);
> + free_hash = NULL;
> + goto out_unlock;
> + }
> +
>   /* ... and insert them to direct_functions hash. */
>   err = -ENOMEM;
>   for (i = 0; i < size; i++) {
>   hlist_for_each_entry(entry, >buckets[i], hlist) {
> - new = ftrace_add_rec_direct(entry->ip, addr, 
> _hash);
> + new = ftrace_add_rec_direct(entry->ip, addr);
>   if (!new)
>   goto out_remove;
>   entry->direct = addr;

This should fix both the leak and the fact that 

Re: [PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100

2023-12-28 Thread Steven Rostedt
On Wed, 27 Dec 2023 07:57:08 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 26 Dec 2023 12:59:02 -0500
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > The tracefs file "buffer_percent" is to allow user space to set a
> > water-mark on how much of the tracing ring buffer needs to be filled in
> > order to wake up a blocked reader.
> > 
> >  0 - is to wait until any data is in the buffer
> >  1 - is to wait for 1% of the sub buffers to be filled
> >  50 - would be half of the sub buffers are filled with data
> >  100 - is not to wake the waiter until the ring buffer is completely full
> > 
> > Unfortunately the test for being full was:
> > 
> > dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
> > return (dirty * 100) > (full * nr_pages);
> > 
> > Where "full" is the value for "buffer_percent".
> > 
> > There is two issues with the above when full == 100.
> > 
> > 1. dirty * 100 > 100 * nr_pages will never be true
> >That is, the above is basically saying that if the user sets
> >buffer_percent to 100, more pages need to be dirty than exist in the
> >ring buffer!
> > 
> > 2. The page that the writer is on is never considered dirty, as dirty
> >pages are only those that are full. When the writer goes to a new
> >sub-buffer, it clears the contents of that sub-buffer.
> > 
> > That is, even if the check was ">=" it would still not be equal as the
> > most pages that can be considered "dirty" is nr_pages - 1.
> > 
> > To fix this, add one to dirty and use ">=" in the compare.
> > 
> > Cc: sta...@vger.kernel.org
> > Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage")
> > Signed-off-by: Steven Rostedt (Google) 
> > ---
> >  kernel/trace/ring_buffer.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index 83eab547f1d1..32c0dd2fd1c3 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct 
> > trace_buffer *buffer, int cpu, int f
> > if (!nr_pages || !full)
> > return true;
> >  
> > -   dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
> > +   /*
> > +* Add one as dirty will never equal nr_pages, as the sub-buffer
> > +* that the writer is on is not counted as dirty.
> > +* This is needed if "buffer_percent" is set to 100.
> > +*/
> > +   dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1;  
> 
> Is this "+ 1" required? If we have 200 pages and 1 buffer is dirty,
> it is 0.5% dirty. Consider @full = 1%.

Yes it is required, as the comment above it states. dirty will never
equal nr_pages. Without it, buffer_percent == 100 will never wake up.

The +1 is to add the page the writer is on, which is never considered
"dirty".

> 
> @dirty = 1 + 1 = 2 and @dirty * 100 == 200. but 
> @full * @nr_pages = 1 * 200 = 200.
> Thus it hits (200 >= 200 is true) even if dirty pages are 0.5%.

Do we care?

What's the difference if it wakes up on 2 dirty pages or 1? It would be
very hard to measure the difference.

But if you say 100, which means "I want to wake up when full" it will
never wake up. Because it will always be nr_pages - 1.

We could also say the +1 is the reader page too, because that's not
counted as well.

In other words, we can bike shed this to make 1% accurate (which
honestly, I have no idea what the use case for that would be) or we can
fix the bug that has 100% which just means, wake me up if the buffer is
full, and when the writer is on the last page, it is considered full.

-- Steve



[PATCH] tracing: Fix blocked reader of snapshot buffer

2023-12-28 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If an application blocks on the snapshot or snapshot_raw files, expecting
to be woken up when a snapshot occurs, it will not happen. Or it may
happen with an unexpected result.

That result is that the application will be reading the main buffer
instead of the snapshot buffer. That is because when the snapshot occurs,
the main and snapshot buffers are swapped. But the reader has a descriptor
still pointing to the buffer that it originally connected to.

This is fine for the main buffer readers, as they may be blocked waiting
for a watermark to be hit, and when a snapshot occurs, the data that the
main readers want is now on the snapshot buffer.

But for waiters of the snapshot buffer, they are waiting for an event to
occur that will trigger the snapshot and they can then consume it quickly
to save the snapshot before the next snapshot occurs. But to do this, they
need to read the new snapshot buffer, not the old one that is now
receiving new data.

Also, it does not make sense to have a watermark "buffer_percent" on the
snapshot buffer, as the snapshot buffer is static and does not receive new
data except all at once.

Cc: sta...@vger.kernel.org
Fixes: debdd57f5145f ("tracing: Make a snapshot feature available from 
userspace")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c |  3 ++-
 kernel/trace/trace.c   | 20 +---
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index e476d42c84c5..07dae67424a9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -838,7 +838,8 @@ void ring_buffer_wake_waiters(struct trace_buffer *buffer, 
int cpu)
/* make sure the waiters see the new index */
smp_wmb();
 
-   rb_wake_up_waiters(>work);
+   /* This can be called in any context */
+   irq_work_queue(>work);
 }
 
 /**
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index cfeaf2cd204e..606787edae8c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1902,6 +1902,9 @@ update_max_tr(struct trace_array *tr, struct task_struct 
*tsk, int cpu,
__update_max_tr(tr, tsk, cpu);
 
arch_spin_unlock(>max_lock);
+
+   /* Any waiters on the old snapshot buffer need to wake up */
+   ring_buffer_wake_waiters(tr->array_buffer.buffer, RING_BUFFER_ALL_CPUS);
 }
 
 /**
@@ -1953,12 +1956,23 @@ update_max_tr_single(struct trace_array *tr, struct 
task_struct *tsk, int cpu)
 
 static int wait_on_pipe(struct trace_iterator *iter, int full)
 {
+   int ret;
+
/* Iterators are static, they should be filled or empty */
if (trace_buffer_iter(iter, iter->cpu_file))
return 0;
 
-   return ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file,
-   full);
+   ret = ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, 
full);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   /*
+* Make sure this is still the snapshot buffer, as if a snapshot were
+* to happen, this would now be the main buffer.
+*/
+   if (iter->snapshot)
+   iter->array_buffer = >tr->max_buffer;
+#endif
+   return ret;
 }
 
 #ifdef CONFIG_FTRACE_STARTUP_TEST
@@ -8560,7 +8574,7 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
 
wait_index = READ_ONCE(iter->wait_index);
 
-   ret = wait_on_pipe(iter, iter->tr->buffer_percent);
+   ret = wait_on_pipe(iter, iter->snapshot ? 0 : 
iter->tr->buffer_percent);
if (ret)
goto out;
 
-- 
2.42.0




[PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100

2023-12-26 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The tracefs file "buffer_percent" is to allow user space to set a
water-mark on how much of the tracing ring buffer needs to be filled in
order to wake up a blocked reader.

 0 - is to wait until any data is in the buffer
 1 - is to wait for 1% of the sub buffers to be filled
 50 - would be half of the sub buffers are filled with data
 100 - is not to wake the waiter until the ring buffer is completely full

Unfortunately the test for being full was:

dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
return (dirty * 100) > (full * nr_pages);

Where "full" is the value for "buffer_percent".

There is two issues with the above when full == 100.

1. dirty * 100 > 100 * nr_pages will never be true
   That is, the above is basically saying that if the user sets
   buffer_percent to 100, more pages need to be dirty than exist in the
   ring buffer!

2. The page that the writer is on is never considered dirty, as dirty
   pages are only those that are full. When the writer goes to a new
   sub-buffer, it clears the contents of that sub-buffer.

That is, even if the check was ">=" it would still not be equal as the
most pages that can be considered "dirty" is nr_pages - 1.

To fix this, add one to dirty and use ">=" in the compare.

Cc: sta...@vger.kernel.org
Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 83eab547f1d1..32c0dd2fd1c3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct trace_buffer 
*buffer, int cpu, int f
if (!nr_pages || !full)
return true;
 
-   dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
+   /*
+* Add one as dirty will never equal nr_pages, as the sub-buffer
+* that the writer is on is not counted as dirty.
+* This is needed if "buffer_percent" is set to 100.
+*/
+   dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1;
 
-   return (dirty * 100) > (full * nr_pages);
+   return (dirty * 100) >= (full * nr_pages);
 }
 
 /*
-- 
2.42.0




[PATCH v2] eventfs: Fix file and directory uid and gid ownership

2023-12-21 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

It was reported that when mounting the tracefs file system with a gid
other than root, the ownership did not carry down to the eventfs directory
due to the dynamic nature of it.

A fix was done to solve this, but it had two issues.

(a) if the attr passed into update_inode_attr() was NULL, it didn't do
anything. This is true for files that have not had a chown or chgrp
done to itself or any of its sibling files, as the attr is allocated
for all children when any one needs it.

 # umount /sys/kernel/tracing
 # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt

 # ls -ld /mnt/events/sched
drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/

 # ls -ld /mnt/events/sched/sched_switch
drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/

But when checking the files:

 # ls -l /mnt/events/sched/sched_switch
total 0
-rw-r- 1 root root 0 Dec 21 13:12 enable
-rw-r- 1 root root 0 Dec 21 13:12 filter
-r--r- 1 root root 0 Dec 21 13:12 format
-r--r- 1 root root 0 Dec 21 13:12 hist
-r--r- 1 root root 0 Dec 21 13:12 id
-rw-r- 1 root root 0 Dec 21 13:12 trigger

(b) When the attr does not denote the UID or GID, it defaulted to using
the parent uid or gid. This is incorrect as changing the parent
uid or gid will automatically change all its children.

 # chgrp tracing /mnt/events/timer

 # ls -ld /mnt/events/timer
drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer

 # ls -l /mnt/events/timer
total 0
-rw-r- 1 root root0 Dec 21 14:35 enable
-rw-r- 1 root root0 Dec 21 14:35 filter
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start

At first it was thought that this could be easily fixed by just making the
default ownership of the superblock when it was mounted. But this does not
handle the case of:

 # chgrp tracing instances
 # mkdir instances/foo

If the superblock was used, then the group ownership would be that of what
it was when it was mounted, when it should instead be "tracing".

Instead, set a flag for the top level eventfs directory ("events") to flag
which eventfs_inode belongs to it.

Since the "events" directory's dentry and inode are never freed, it does
not need to use its attr field to restore its mode and ownership. Use the
this eventfs_inode's attr as the default ownership for all the files and
directories underneath it.

When the events eventfs_inode is created, it sets its ownership to its
parent uid and gid. As the events directory is created at boot up before
it gets mounted, this will always be uid=0 and gid=0. If it's created via
an instance, then it will take the ownership of the instance directory.

When the file system is mounted, it will update all the gids if one is
specified. This will have a callback to update the events evenfs_inode's
default entries.

When a file or directory is created under the events directory, it will
walk the ei->dentry parents until it finds the evenfs_inode that belongs
to the events directory to retrieve the default uid and gid values.

Link: 
https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=gtw+anwscylhpfswrq84pjho...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to 
parent uid and gid")
Reported-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231221173812.51fe6...@gandalf.local.home

- While writing kselftests to test ownership, I found a bug. The remount
  option allows for an update of the gid. If it is specified, then all
  dentries are traversed in tracefs and eventfs and the gid is updated.

  The bug is that a eventfs that does not have a dentry, but had its gid
  updated, the old gid is still stored, and when the dentry is created, it
  will use the stored gid.

  Instead, check all eventfs_inodes that do not have a dentry, and traverse
  its children to make sure their gids are updated as well.

 fs/tracefs/event_inode.c | 105 +++
 fs/tracefs/inode.c   |   6 +++
 fs/tracefs/internal.h|   2 +
 3 files changed, 103 insertions(+), 10 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
ind

[PATCH] eventfs: Fix file and directory uid and gid ownership

2023-12-21 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

It was reported that when mounting the tracefs file system with a gid
other than root, the ownership did not carry down to the eventfs directory
due to the dynamic nature of it.

A fix was done to solve this, but it had two issues.

(a) if the attr passed into update_inode_attr() was NULL, it didn't do
anything. This is true for files that have not had a chown or chgrp
done to itself or any of its sibling files, as the attr is allocated
for all children when any one needs it.

 # umount /sys/kernel/tracing
 # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt

 # ls -ld /mnt/events/sched
drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/

 # ls -ld /mnt/events/sched/sched_switch
drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/

But when checking the files:

 # ls -l /mnt/events/sched/sched_switch
total 0
-rw-r- 1 root root 0 Dec 21 13:12 enable
-rw-r- 1 root root 0 Dec 21 13:12 filter
-r--r- 1 root root 0 Dec 21 13:12 format
-r--r- 1 root root 0 Dec 21 13:12 hist
-r--r- 1 root root 0 Dec 21 13:12 id
-rw-r- 1 root root 0 Dec 21 13:12 trigger

(b) When the attr does not denote the UID or GID, it defaulted to using
the parent uid or gid. This is incorrect as changing the parent
uid or gid will automatically change all its children.

 # chgrp tracing /mnt/events/timer

 # ls -ld /mnt/events/timer
drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer

 # ls -l /mnt/events/timer
total 0
-rw-r- 1 root root0 Dec 21 14:35 enable
-rw-r- 1 root root0 Dec 21 14:35 filter
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start

At first it was thought that this could be easily fixed by just making the
default ownership of the superblock when it was mounted. But this does not
handle the case of:

 # chgrp tracing instances
 # mkdir instances/foo

If the superblock was used, then the group ownership would be that of what
it was when it was mounted, when it should instead be "tracing".

Instead, set a flag for the top level eventfs directory ("events") to flag
which eventfs_inode belongs to it.

Since the "events" directory's dentry and inode are never freed, it does
not need to use its attr field to restore its mode and ownership. Use the
this eventfs_inode's attr as the default ownership for all the files and
directories underneath it.

When the events eventfs_inode is created, it sets its ownership to its
parent uid and gid. As the events directory is created at boot up before
it gets mounted, this will always be uid=0 and gid=0. If it's created via
an instance, then it will take the ownership of the instance directory.

When the file system is mounted, it will update all the gids if one is
specified. This will have a callback to update the events evenfs_inode's
default entries.

When a file or directory is created under the events directory, it will
walk the ei->dentry parents until it finds the evenfs_inode that belongs
to the events directory to retrieve the default uid and gid values.

Link: 
https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=gtw+anwscylhpfswrq84pjho...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to 
parent uid and gid")
Reported-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 74 ++--
 fs/tracefs/inode.c   |  6 
 fs/tracefs/internal.h|  2 ++
 3 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2ccc849a5bda..2d6b11195420 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -113,7 +113,14 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   update_attr(>attr, iattr);
+   /*
+* The events directory dentry is never freed, unless its
+* part of an instance that is deleted. It's attr is the
+* default for its child files and directores.
+ 

Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 17:35:22 +
Vincent Donnefort  wrote:

> @@ -5999,6 +6078,307 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
> *buffer, int order)
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
>  

The kernel developers have agreed to allow loop variables to be declared in
loops. This will simplify these macros:



> +#define subbuf_page(off, start) \
> + virt_to_page((void *)(start + (off << PAGE_SHIFT)))
> +
> +#define foreach_subbuf_page(off, sub_order, start, page) \
> + for (off = 0, page = subbuf_page(0, start); \
> +  off < (1 << sub_order);\
> +  off++, page = subbuf_page(off, start))

#define foreach_subbuf_page(sub_order, start, page) \
for (int __off = 0, page = subbuf_page(0, (start)); \
 __off < (1 << (sub_order));\
 __off++, page = subbuf_page(__off, (start)))

And parameters as r-values should always be wrapped in parenthesis.

> +
> +static inline void subbuf_map_prepare(unsigned long subbuf_start, int order)
> +{
> + struct page *page;
> + int subbuf_off;

Then you can remove the "subbuf_off".

> +
> + /*
> +  * When allocating order > 0 pages, only the first struct page has a
> +  * refcount > 1. Increasing the refcount here ensures none of the struct
> +  * page composing the sub-buffer is freeed when the mapping is closed.

Nice catch by the way ;-)

-- Steve

> +  */
> + foreach_subbuf_page(subbuf_off, order, subbuf_start, page)
> + page_ref_inc(page);
> +}
> +
> +static inline void subbuf_unmap(unsigned long subbuf_start, int order)
> +{
> + struct page *page;
> + int subbuf_off;
> +
> + foreach_subbuf_page(subbuf_off, order, subbuf_start, page) {
> + page_ref_dec(page);
> + page->mapping = NULL;
> + }
> +}
> +



Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 17:35:22 +
Vincent Donnefort  wrote:

> @@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer 
> *buffer, int cpu, int f
>   return (dirty * 100) > (full * nr_pages);
>  }
>  
> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + if (unlikely(READ_ONCE(cpu_buffer->mapped))) {
> + /* Ensure the meta_page is ready */
> + smp_rmb();
> + WRITE_ONCE(cpu_buffer->meta_page->entries,
> +local_read(_buffer->entries));
> + WRITE_ONCE(cpu_buffer->meta_page->overrun,
> +local_read(_buffer->overrun));
> + WRITE_ONCE(cpu_buffer->meta_page->subbufs_touched,
> +local_read(_buffer->pages_touched));
> + WRITE_ONCE(cpu_buffer->meta_page->subbufs_lost,
> +local_read(_buffer->pages_lost));
> + }
> +}
> +
>  /*
>   * rb_wake_up_waiters - wake up tasks waiting for ring buffer input
>   *
> @@ -749,6 +773,18 @@ static void rb_wake_up_waiters(struct irq_work *work)
>  {
>   struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
> work);
>  
> + if (rbwork->is_cpu_buffer) {
> + struct ring_buffer_per_cpu *cpu_buffer;
> +
> + cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu,
> +   irq_work);
> + /*
> +  * If the waiter is a cpu_buffer, this might be due to a
> +  * userspace mapping. Let's update the meta-page.
> +  */
> + rb_update_meta_page(cpu_buffer);
> + }
> +
>   wake_up_all(>waiters);
>   if (rbwork->full_waiters_pending || rbwork->wakeup_full) {
>   rbwork->wakeup_full = false;


I think this code would be cleaner if we did:

static void rb_update_meta_page(strucrt rb_irq_work *rbwork)
{
struct ring_buffer_per_cpu *cpu_buffer;

if (!rbwork->is_cpu_buffer)
return;

/*
 * If the waiter is a cpu_buffer, this might be due to a
 * userspace mapping. Let's update the meta-page.
 */
cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu,
  irq_work);

if (unlikely(READ_ONCE(cpu_buffer->mapped))) {

// I don't think we need the "unlikely"

/* Ensure the meta_page is ready */
smp_rmb();
WRITE_ONCE(cpu_buffer->meta_page->entries,
   local_read(_buffer->entries));
WRITE_ONCE(cpu_buffer->meta_page->overrun,
   local_read(_buffer->overrun));
WRITE_ONCE(cpu_buffer->meta_page->subbufs_touched,
   local_read(_buffer->pages_touched));
WRITE_ONCE(cpu_buffer->meta_page->subbufs_lost,
   local_read(_buffer->pages_lost));
}
}

/*
 * rb_wake_up_waiters - wake up tasks waiting for ring buffer input
 *
 * Schedules a delayed work to wake up any task that is blocked on the
 * ring buffer waiters queue.
 */
static void rb_wake_up_waiters(struct irq_work *work)
{
struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
work);

rb_update_meta_page(cpu_buffer);

wake_up_all(>waiters);
if (rbwork->full_waiters_pending || rbwork->wakeup_full) {
rbwork->wakeup_full = false;
rbwork->full_waiters_pending = false;
wake_up_all(>full_waiters);
}
}


-- Steve



Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 14:51:29 +
David Laight  wrote:

> > I think 1kb units is perfectly fine (patch 15 changes to kb units). The
> > interface says its to define the minimal size of the sub-buffer, not the
> > actual size.  
> 
> I didn't read that far through :-(
> 

Well, this isn't a normal patch series as I took the work from Tzvetomir
back from 2021, and started massaging them. I wanted to keep Tzvetomir's
original authorship even though he's not working on it, so I just applied
his patches as-is and then added patches on top of them, to denote what I
did and what he did.

This is why I never modified the first 5 patches of this series, except for
subject lines and change logs.

-- Steve



Re: [PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 11:06:38 +0100
Alexander Graf  wrote:

> Thanks a bunch for the super quick turnaround time for the fix! I can 
> confirm that I'm no longer seeing the warning :)
> 
> Tested-by: Alexander Graf 

Thanks Alex,

> 
> 
> Do we need another similar patch for the kprobe self tests? The below is 
> with 55cb5f43689d7 plus an unrelated initrd patch plus this patch and 
> the following .config: http://csgraf.de/tmp2/config-ftrace.xz

Sure, now you tell me ;-)

I just finished all my tests and will be sending Linus a pull request soon.

I'll let Masami handle this one, as it's under his department.

-- Steve

> 
> [  919.717134] Testing all events: OK
> [  924.418194] Testing ftrace filter: OK
> [  924.418887] trace_kprobe: Testing kprobe tracing:
> [  924.424244] [ cut here ]
> [  924.424952] WARNING: CPU: 2 PID: 1 at 
> kernel/trace/trace_kprobe.c:2073 kprobe_trace_self_tests_init+0x192/0x540
> [  924.425659] Modules linked in:
> [  924.425886] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
> 6.7.0-rc6-00024-gc10698ad3c9a #298
> [  924.426448] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [  924.427230] RIP: 0010:kprobe_trace_self_tests_init+0x192/0x540
> [  924.427639] Code: 7e 10 74 3b 48 8b 36 48 39 d6 75 f2 0f 0b 48 c7 c7 
> 58 71 79 a5 e8 ee 3e 5a fe 48 c7 c7 20 38 b7 a5 e8 a2 51 68 fe 85 c0 74 
> 33 <0f> 0b 48 c7 c7 38 73 79 a5 e8 d0 3e 5a fe e8 4b 64 62 fe eb 23 48
> [  924.428922] RSP: :ab508001be58 EFLAGS: 00010286
> [  924.429288] RAX: fff0 RBX: 005a RCX: 
> 0202
> [  924.429800] RDX:  RSI: 0002e970 RDI: 
> a5b708a0
> [  924.430295] RBP:  R08: 0001 R09: 
> a4855a90
> [  924.430794] R10: 0007 R11: 028a R12: 
> 0001
> [  924.431286] R13: a5cc9a00 R14: 8cec81bebe00 R15: 
> a619f0f0
> [  924.431785] FS:  () GS:8cecf910() 
> knlGS:
> [  924.432346] CS:  0010 DS:  ES:  CR0: 80050033
> [  924.432748] CR2:  CR3: 4883e001 CR4: 
> 003706f0
> [  924.433246] DR0:  DR1:  DR2: 
> 
> [  924.433739] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  924.434233] Call Trace:
> [  924.434418]  
> [  924.434569]  ? __warn+0x7d/0x140
> [  924.434807]  ? kprobe_trace_self_tests_init+0x192/0x540
> [  924.435172]  ? report_bug+0xf8/0x1e0
> [  924.435430]  ? handle_bug+0x3f/0x70
> [  924.435677]  ? exc_invalid_op+0x13/0x60
> [  924.435954]  ? asm_exc_invalid_op+0x16/0x20
> [  924.436249]  ? rdinit_setup+0x40/0x40
> [  924.436509]  ? trace_kprobe_release+0x70/0xb0
> [  924.436822]  ? kprobe_trace_self_tests_init+0x192/0x540
> [  924.437189]  ? kprobe_trace_self_tests_init+0x421/0x540
> [  924.437551]  ? init_kprobe_trace+0x1b0/0x1b0
> [  924.437855]  do_one_initcall+0x44/0x200
> [  924.438131]  kernel_init_freeable+0x1e8/0x330
> [  924.438439]  ? rest_init+0xd0/0xd0
> [  924.438682]  kernel_init+0x16/0x130
> [  924.438943]  ret_from_fork+0x30/0x50
> [  924.439202]  ? rest_init+0xd0/0xd0
> [  924.439445]  ret_from_fork_asm+0x11/0x20
> [  924.439734]  
> [  924.439893] ---[ end trace  ]---
> [  924.440217] trace_kprobe: error on cleaning up probes.
> [  924.440575] NG: Some tests are failed. Please check them.




Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 09:17:55 +
David Laight  wrote:

> > Unfortunately, it has to be PAGE_SIZE (and for now it's a power of 2 to
> > make masking easy). It's used for splice and will also be used for memory
> > mapping with user space.  
> 
> Perhaps then the sysctl to set the size should be powers of 4k

It's not a sysctl but a file in tracefs

> with a minimum size of PAGE_SIZE.
> Then you don't have to know the page size when setting things up.

The user shouldn't need to know either. But the size of the sub-buffer
limits the biggest size of an event, so the user only needs to make sure
the sub-buffer is bigger than their biggest event.

> 
> I'm also guessing that no Linux kernels have a PAGE_SIZE of 2k?
> IIRC some old mmu (maybe 68020 era) used 2k pages.

I think 1kb units is perfectly fine (patch 15 changes to kb units). The
interface says its to define the minimal size of the sub-buffer, not the
actual size.

-- Steve



Re: [PATCH v5 15/15] tracing: Update subbuffer with kilobytes not page order

2023-12-20 Thread Steven Rostedt
On Thu, 21 Dec 2023 09:26:21 +0900
Masami Hiramatsu (Google)  wrote:

> > If the user specifies 3 via:
> > 
> >   echo 3 > buffer_subbuf_size_kb
> > 
> > Then the sub-buffer size will round up to 4kb (on a 4kb page size system).
> > 
> > If they specify:
> > 
> >   echo 6 > buffer_subbuf_size_kb
> > 
> > The sub-buffer size will become 8kb.  
> 
> I think this is better interface. Can we apply this earlier in the series
> to avoid rewriting the document and test code?

I kept it separate for testing purposes.

Through out all this, it was a good way to make sure the two approaches
were compatible. I still like to keep them separate as that's the way it
was developed. It's good to keep that history.

-- Steve



Re: [PATCH -next v4 2/2] mm: vmscan: add new event to trace shrink lru

2023-12-20 Thread Steven Rostedt
On Tue, 19 Dec 2023 17:21:23 -0800
Bixuan Cui  wrote:

> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index b99cd28c9815..02868bdc5999 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -395,7 +395,24 @@ TRACE_EVENT(mm_vmscan_write_folio,
>   show_reclaim_flags(__entry->reclaim_flags))
>  );
>  
> -TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
> +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_start,
> +
> + TP_PROTO(int nid),
> +
> + TP_ARGS(nid),
> +
> + TP_STRUCT__entry(
> + __field(int, nid)
> + ),
> +
> + TP_fast_assign(
> + __entry->nid = nid;
> + ),
> +
> + TP_printk("nid=%d", __entry->nid)
> +);
> +
> +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_end,
>  
>   TP_PROTO(int nid,
>   unsigned long nr_scanned, unsigned long nr_reclaimed,
> @@ -446,7 +463,24 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
>   show_reclaim_flags(__entry->reclaim_flags))
>  );
>  
> -TRACE_EVENT(mm_vmscan_lru_shrink_active,
> +TRACE_EVENT(mm_vmscan_lru_shrink_active_start,
> +
> + TP_PROTO(int nid),
> +
> + TP_ARGS(nid),
> +
> + TP_STRUCT__entry(
> + __field(int, nid)
> + ),
> +
> + TP_fast_assign(
> + __entry->nid = nid;
> + ),
> +
> + TP_printk("nid=%d", __entry->nid)
> +);
> +
> +TRACE_EVENT(mm_vmscan_lru_shrink_active_end,
>  

These two events are identical, please use DECLARE_EVENT_CLASS and
DEFINE_EVENT macros:

DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template,

TP_PROTO(int nid),

TP_ARGS(nid),

TP_STRUCT__entry(
__field(int, nid)
),

TP_fast_assign(
__entry->nid = nid;
),

TP_printk("nid=%d", __entry->nid)
);

DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, 
mm_vmscan_lru_shrink_inactive_start,
TP_PROTO(int nid),
TP_ARGS(nid)
);

DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, 
mm_vmscan_lru_shrink_active_end,
TP_PROTO(int nid),
TP_ARGS(nid)
);

This saves a bit of memory footprint when doing so.

-- Steve



Re: [PATCH v5 04/15] ring-buffer: Set new size of the ring buffer sub page

2023-12-20 Thread Steven Rostedt
On Thu, 21 Dec 2023 01:34:56 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 19 Dec 2023 13:54:18 -0500
> Steven Rostedt  wrote:
> 
> > From: "Tzvetomir Stoyanov (VMware)" 
> > 
> > There are two approaches when changing the size of the ring buffer
> > sub page:
> >  1. Destroying all pages and allocating new pages with the new size.
> >  2. Allocating new pages, copying the content of the old pages before
> > destroying them.
> > The first approach is easier, it is selected in the proposed
> > implementation. Changing the ring buffer sub page size is supposed to
> > not happen frequently. Usually, that size should be set only once,
> > when the buffer is not in use yet and is supposed to be empty.
> > 
> > Link: 
> > https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoya...@gmail.com
> >   
> 
> OK, this actually reallocate the sub buffers when a new order is set.
> BTW, with this change, if we set a new order, the total buffer size will be
> changed too? Or reserve the total size? I think either is OK but it should
> be described in the document. (e.g. if it is changed, user should set the
> order first and set the total size later.)
> 

Patch 11 keeps the same size of the buffer. As I would think that would be
what the user would expect. And not only that, it breaks the latency
tracers if it doesn't keep the same size.

-- Steve



Re: [PATCH RERESEND 04/11] tracing: tracing_buffers_splice_read: behave as-if non-blocking I/O

2023-12-20 Thread Steven Rostedt
On Thu, 14 Dec 2023 19:45:02 +0100
Ahelenia Ziemiańska  wrote:

> Otherwise we risk sleeping with the pipe locked for indeterminate
> lengths of time.

This change log is really lacking.

Why is this an issue?

 
> Link: 
> https://lore.kernel.org/linux-fsdevel/qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw/t/#u

The change log should not rely on any external links. Those are for
reference only. At a bare minimum, the change log should have a tl;dr;
summary of the issue. The change log itself should have enough information
to understand why this change is happening without the need to look at any
links.

-- Steve



Re: [PATCH v5 06/15] ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure

2023-12-20 Thread Steven Rostedt
On Thu, 21 Dec 2023 01:23:14 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 19 Dec 2023 13:54:20 -0500
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > On failure to allocate ring buffer pages, the pointer to the CPU buffer
> > pages is freed, but the pages that were allocated previously were not.
> > Make sure they are freed too.
> > 
> > Fixes: TBD ("tracing: Set new size of the ring buffer sub page")  
> 
> Do you merge this fix to the original one in the same series later?
> I think it is better to merge it for git bisect.

It only fixes the new functionality. If a git bisect gets here for that,
then we know the issue already. But it shouldn't break bisect for things
that happened before this change.

The reason I'm not merging this with the other patch is because the
original patch is from Tzvetomir, who isn't working on this anymore. I want
to keep his patches untouched, and anything I do is on top of it.

-- Steve



[PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()

2023-12-20 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The synth_event_gen_test module can be built in, if someone wants to run
the tests at boot up and not have to load them.

The synth_event_gen_test_init() function creates and enables the synthetic
events and runs its tests.

The synth_event_gen_test_exit() disables the events it created and
destroys the events.

If the module is builtin, the events are never disabled. The issue is, the
events should be disable after the tests are run. This could be an issue
if the rest of the boot up tests are enabled, as they expect the events to
be in a known state before testing. That known state happens to be
disabled.

When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
a warning will trigger:

 Running tests on trace events:
 Testing event create_synth_test:
 Enabled event during self test!
 [ cut here ]
 WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 
event_trace_self_tests+0x1c2/0x480
 Modules linked in:
 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.2-debian-1.16.2-1 04/01/2014
 RIP: 0010:event_trace_self_tests+0x1c2/0x480
 Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 
84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 
3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
 RSP: :c901fdc0 EFLAGS: 00010246
 RAX: 0029 RBX: 88810399ca80 RCX: 
 RDX:  RSI: b9f19478 RDI: 88823c734e64
 RBP: 88810399f300 R08:  R09: fbfff79eb32a
 R10: bcf59957 R11: 0001 R12: 888104068090
 R13: bc89f0a0 R14: bc8a0f08 R15: 0078
 FS:  () GS:88823c70() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 0001f6282001 CR4: 00170ef0
 Call Trace:
  
  ? __warn+0xa5/0x200
  ? event_trace_self_tests+0x1c2/0x480
  ? report_bug+0x1f6/0x220
  ? handle_bug+0x6f/0x90
  ? exc_invalid_op+0x17/0x50
  ? asm_exc_invalid_op+0x1a/0x20
  ? tracer_preempt_on+0x78/0x1c0
  ? event_trace_self_tests+0x1c2/0x480
  ? __pfx_event_trace_self_tests_init+0x10/0x10
  event_trace_self_tests_init+0x27/0xe0
  do_one_initcall+0xd6/0x3c0
  ? __pfx_do_one_initcall+0x10/0x10
  ? kasan_set_track+0x25/0x30
  ? rcu_is_watching+0x38/0x60
  kernel_init_freeable+0x324/0x450
  ? __pfx_kernel_init+0x10/0x10
  kernel_init+0x1f/0x1e0
  ? _raw_spin_unlock_irq+0x33/0x50
  ret_from_fork+0x34/0x60
  ? __pfx_kernel_init+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  

This is because the synth_event_gen_test_init() left the synthetic events
that it created enabled. By having it disable them after testing, the
other selftests will run fine.

Cc: sta...@vger.kernel.org
Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module")
Reported-by: Alexander Graf 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/synth_event_gen_test.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/trace/synth_event_gen_test.c 
b/kernel/trace/synth_event_gen_test.c
index 8dfe85499d4a..354c2117be43 100644
--- a/kernel/trace/synth_event_gen_test.c
+++ b/kernel/trace/synth_event_gen_test.c
@@ -477,6 +477,17 @@ static int __init synth_event_gen_test_init(void)
 
ret = test_trace_synth_event();
WARN_ON(ret);
+
+   /* Disable when done */
+   trace_array_set_clr_event(gen_synth_test->tr,
+ "synthetic",
+ "gen_synth_test", false);
+   trace_array_set_clr_event(empty_synth_test->tr,
+ "synthetic",
+ "empty_synth_test", false);
+   trace_array_set_clr_event(create_synth_test->tr,
+ "synthetic",
+ "create_synth_test", false);
  out:
return ret;
 }
-- 
2.42.0




Re: [PATCH] eventfs: Have event files and directories default to parent uid and gid

2023-12-20 Thread Steven Rostedt
On Wed, 20 Dec 2023 10:50:17 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Dongliang reported:
> 
>   I found that in the latest version, the nodes of tracefs have been
>   changed to dynamically created.
> 
>   This has caused me to encounter a problem where the gid I specified in
>   the mounting parameters cannot apply to all files, as in the following
>   situation:
> 
>   /data/tmp/events # mount | grep tracefs
>   tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012)
> 
>   gid 3012 = readtracefs
> 
>   /data/tmp # ls -lh
>   total 0
>   -r--r-   1 root readtracefs 0 1970-01-01 08:00 README
>   -r--r-   1 root readtracefs 0 1970-01-01 08:00 available_events
> 
>   ums9621_1h10:/data/tmp/events # ls -lh
>   total 0
>   drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer
>   drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc
> 
>   It will prevent certain applications from accessing tracefs properly, I
>   try to avoid this issue by making the following modifications.
> 
> To fix this, have the files created default to taking the ownership of
> the parent dentry unless the ownership was previously set by the user.
> 
> Link: 
> https://lore.kernel.org/linux-trace-kernel/1703063706-30539-1-git-send-email-dongliang@unisoc.com/
> 

I forgot to add:

  Cc: sta...@vger.kernel.org
  Fixes: 28e12c09f5aa0 ("eventfs: Save ownership and mode")


-- Steve

> Reported-by: Dongliang Cui 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/tracefs/event_inode.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 43e237864a42..2ccc849a5bda 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -148,7 +148,8 @@ static const struct file_operations 
> eventfs_file_operations = {
>   .release= eventfs_release,
>  };
>  
> -static void update_inode_attr(struct inode *inode, struct eventfs_attr 
> *attr, umode_t mode)
> +static void update_inode_attr(struct dentry *dentry, struct inode *inode,
> +   struct eventfs_attr *attr, umode_t mode)
>  {
>   if (!attr) {
>   inode->i_mode = mode;
> @@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, 
> struct eventfs_attr *attr, um
>  
>   if (attr->mode & EVENTFS_SAVE_UID)
>   inode->i_uid = attr->uid;
> + else
> + inode->i_uid = d_inode(dentry->d_parent)->i_uid;
>  
>   if (attr->mode & EVENTFS_SAVE_GID)
>   inode->i_gid = attr->gid;
> + else
> + inode->i_gid = d_inode(dentry->d_parent)->i_gid;
>  }
>  
>  /**
> @@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, 
> umode_t mode,
>   return eventfs_failed_creating(dentry);
>  
>   /* If the user updated the directory's attributes, use them */
> - update_inode_attr(inode, attr, mode);
> + update_inode_attr(dentry, inode, attr, mode);
>  
>   inode->i_op = _file_inode_operations;
>   inode->i_fop = fop;
> @@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode 
> *ei, struct dentry *parent
>   return eventfs_failed_creating(dentry);
>  
>   /* If the user updated the directory's attributes, use them */
> - update_inode_attr(inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | 
> S_IXUGO);
> + update_inode_attr(dentry, inode, >attr,
> +   S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
>  
>   inode->i_op = _root_dir_inode_operations;
>   inode->i_fop = _file_operations;




Re: [PATCH v8 0/2] Introducing trace buffer mapping by user-space

2023-12-20 Thread Steven Rostedt
On Wed, 20 Dec 2023 13:49:30 +
Vincent Donnefort  wrote:

> I meant, to only do in rb_wake_up_waiters()
> 
>   if (rbwork->is_cpu_buffer) 
>   rb_update_meta_page(cpu_buffer)
> 
> And skip the meta-page update for the !is_cpu_buffer case?

Ah yeah, that works.

-- Steve



[PATCH] eventfs: Have event files and directories default to parent uid and gid

2023-12-20 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Dongliang reported:

  I found that in the latest version, the nodes of tracefs have been
  changed to dynamically created.

  This has caused me to encounter a problem where the gid I specified in
  the mounting parameters cannot apply to all files, as in the following
  situation:

  /data/tmp/events # mount | grep tracefs
  tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012)

  gid 3012 = readtracefs

  /data/tmp # ls -lh
  total 0
  -r--r-   1 root readtracefs 0 1970-01-01 08:00 README
  -r--r-   1 root readtracefs 0 1970-01-01 08:00 available_events

  ums9621_1h10:/data/tmp/events # ls -lh
  total 0
  drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer
  drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc

  It will prevent certain applications from accessing tracefs properly, I
  try to avoid this issue by making the following modifications.

To fix this, have the files created default to taking the ownership of
the parent dentry unless the ownership was previously set by the user.

Link: 
https://lore.kernel.org/linux-trace-kernel/1703063706-30539-1-git-send-email-dongliang@unisoc.com/

Reported-by: Dongliang Cui 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 43e237864a42..2ccc849a5bda 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -148,7 +148,8 @@ static const struct file_operations eventfs_file_operations 
= {
.release= eventfs_release,
 };
 
-static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, 
umode_t mode)
+static void update_inode_attr(struct dentry *dentry, struct inode *inode,
+ struct eventfs_attr *attr, umode_t mode)
 {
if (!attr) {
inode->i_mode = mode;
@@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, struct 
eventfs_attr *attr, um
 
if (attr->mode & EVENTFS_SAVE_UID)
inode->i_uid = attr->uid;
+   else
+   inode->i_uid = d_inode(dentry->d_parent)->i_uid;
 
if (attr->mode & EVENTFS_SAVE_GID)
inode->i_gid = attr->gid;
+   else
+   inode->i_gid = d_inode(dentry->d_parent)->i_gid;
 }
 
 /**
@@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
return eventfs_failed_creating(dentry);
 
/* If the user updated the directory's attributes, use them */
-   update_inode_attr(inode, attr, mode);
+   update_inode_attr(dentry, inode, attr, mode);
 
inode->i_op = _file_inode_operations;
inode->i_fop = fop;
@@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
return eventfs_failed_creating(dentry);
 
/* If the user updated the directory's attributes, use them */
-   update_inode_attr(inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | 
S_IXUGO);
+   update_inode_attr(dentry, inode, >attr,
+ S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
inode->i_op = _root_dir_inode_operations;
inode->i_fop = _file_operations;
-- 
2.42.0




Re: [PATCH v5 03/15] ring-buffer: Add interface for configuring trace sub buffer size

2023-12-20 Thread Steven Rostedt
On Wed, 20 Dec 2023 23:26:19 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 19 Dec 2023 13:54:17 -0500
> Steven Rostedt  wrote:
> 
> > +/**
> > + * ring_buffer_subbuf_order_set - set the size of ring buffer sub page.
> > + * @buffer: The ring_buffer to set the new page size.
> > + * @order: Order of the system pages in one sub buffer page
> > + *
> > + * By default, one ring buffer pages equals to one system page. This API 
> > can be
> > + * used to set new size of the ring buffer page. The size must be order of
> > + * system page size, that's why the input parameter @order is the order of
> > + * system pages that are allocated for one ring buffer page:
> > + *  0 - 1 system page
> > + *  1 - 2 system pages
> > + *  3 - 4 system pages
> > + *  ...  
> 
> Don't we have the max order of the pages?

Actually there is. I think it's 7?

Honestly, anything over 5 is probably too much. But hey.

> 
> > + *
> > + * Returns 0 on success or < 0 in case of an error.
> > + */
> > +int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
> > +{
> > +   int psize;
> > +
> > +   if (!buffer || order < 0)
> > +   return -EINVAL;
> > +
> > +   if (buffer->subbuf_order == order)
> > +   return 0;
> > +
> > +   psize = (1 << order) * PAGE_SIZE;
> > +   if (psize <= BUF_PAGE_HDR_SIZE)
> > +   return -EINVAL;
> > +
> > +   buffer->subbuf_order = order;
> > +   buffer->subbuf_size = psize - BUF_PAGE_HDR_SIZE;
> > +
> > +   /* Todo: reset the buffer with the new page size */
> > +  
> 
> I just wonder why there is no reallocate the sub buffers here.
> Is it OK to change the sub buffer page size and order while
> using the ring buffer?

Currently we disable the ring buffer to do the update.

-- Steve




Re: [PATCH]eventfs: Apply the gid in the mounting parameters to all files

2023-12-20 Thread Steven Rostedt
On Wed, 20 Dec 2023 17:15:06 +0800
Dongliang Cui  wrote:

> I found that in the latest version, the nodes of 
> tracefs have been changed to dynamically created.
> 
> This has caused me to encounter a problem where 
> the gid I specified in the mounting parameters 
> cannot apply to all files, as in the following
> situation:
> 
> /data/tmp/events # mount | grep tracefs
> tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012)
> 
> gid 3012 = readtracefs
> 
> /data/tmp # ls -lh
> total 0
> -r--r-   1 root readtracefs 0 1970-01-01 08:00 README
> -r--r-   1 root readtracefs 0 1970-01-01 08:00 available_events
> 
> ums9621_1h10:/data/tmp/events # ls -lh
> total 0
> drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer
> drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc
> 
> It will prevent certain applications from accessing
> tracefs properly, I try to avoid this issue by making 
> the following modifications.
> 
> Signed-off-by: Dongliang Cui 


Thanks for the report, can you try this fix instead?

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 43e237864a42..2ccc849a5bda 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -148,7 +148,8 @@ static const struct file_operations eventfs_file_operations 
= {
.release= eventfs_release,
 };
 
-static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, 
umode_t mode)
+static void update_inode_attr(struct dentry *dentry, struct inode *inode,
+ struct eventfs_attr *attr, umode_t mode)
 {
if (!attr) {
inode->i_mode = mode;
@@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, struct 
eventfs_attr *attr, um
 
if (attr->mode & EVENTFS_SAVE_UID)
inode->i_uid = attr->uid;
+   else
+   inode->i_uid = d_inode(dentry->d_parent)->i_uid;
 
if (attr->mode & EVENTFS_SAVE_GID)
inode->i_gid = attr->gid;
+   else
+   inode->i_gid = d_inode(dentry->d_parent)->i_gid;
 }
 
 /**
@@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
return eventfs_failed_creating(dentry);
 
/* If the user updated the directory's attributes, use them */
-   update_inode_attr(inode, attr, mode);
+   update_inode_attr(dentry, inode, attr, mode);
 
inode->i_op = _file_inode_operations;
inode->i_fop = fop;
@@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
return eventfs_failed_creating(dentry);
 
/* If the user updated the directory's attributes, use them */
-   update_inode_attr(inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | 
S_IXUGO);
+   update_inode_attr(dentry, inode, >attr,
+ S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
inode->i_op = _root_dir_inode_operations;
inode->i_fop = _file_operations;



Re: [PATCH v8 0/2] Introducing trace buffer mapping by user-space

2023-12-20 Thread Steven Rostedt
On Wed, 20 Dec 2023 13:06:06 +
Vincent Donnefort  wrote:

> > @@ -771,10 +772,20 @@ static void rb_update_meta_page(struct 
> > ring_buffer_per_cpu *cpu_buffer)
> >  static void rb_wake_up_waiters(struct irq_work *work)
> >  {
> > struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
> > work);
> > -   struct ring_buffer_per_cpu *cpu_buffer =
> > -   container_of(rbwork, struct ring_buffer_per_cpu, irq_work);
> > +   struct ring_buffer_per_cpu *cpu_buffer;
> > +   struct trace_buffer *buffer;
> > +   int cpu;
> >  
> > -   rb_update_meta_page(cpu_buffer);
> > +   if (rbwork->is_cpu_buffer) {
> > +   cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu, 
> > irq_work);
> > +   rb_update_meta_page(cpu_buffer);
> > +   } else {
> > +   buffer = container_of(rbwork, struct trace_buffer, irq_work);
> > +   for_each_buffer_cpu(buffer, cpu) {
> > +   cpu_buffer = buffer->buffers[cpu];
> > +   rb_update_meta_page(cpu_buffer);
> > +   }
> > +   }  
> 
> Arg, somehow never reproduced the problem :-\. I suppose you need to cat
> trace/trace_pipe and mmap(trace/cpuX/trace_pipe) at the same time?

It triggered as soon as I ran "trace-cmd start -e sched_switch"

In other words, it broke the non mmap case. This function gets called for
both the buffer and cpu_buffer irq_work entries. You added the
container_of() to get access to cpu_buffer, when the rbwork could also be
for the main buffer too. The main buffer has no meta page, and it triggered
a NULL pointer dereference, as "cpu_buffer->mapped" returned true (because
it was on something of the buffer structure that wasn't zero), and then here:

if (cpu_buffer->mapped) {
WRITE_ONCE(cpu_buffer->meta_page->reader.read, 0);

It dereferenced cpu_buffer->meta_page->reader

which is only God knows what!

> 
> Updating the meta-page is only useful if the reader we are waking up is a
> user-space one, which would only happen with the cpu_buffer version of this
> function. We could limit the update of the meta_page only to this case?

I rather not add another irq_work entry. This workaround should be good
enough.

Thanks,

-- Steve



[PATCH] ring-buffer: Remove stale comment from ring_buffer_size()

2023-12-20 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

It's been 11 years since the ring_buffer_size() function was updated to
use the nr_pages from the buffer->buffers[cpu] structure instead of using
the buffer->nr_pages that no longer exists.

The comment in the code is more of what a change log should have and is
pretty much useless for development. It's saying how things worked back in
2012 that bares no purpose on today's code. Remove it.

Link: 
https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535...@acums.aculab.com/

Reported-by: David Laight 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 173d2595ce2d..7887d61d5b56 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5122,12 +5122,6 @@ EXPORT_SYMBOL_GPL(ring_buffer_iter_advance);
  */
 unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu)
 {
-   /*
-* Earlier, this method returned
-*  buffer->subbuf_size * buffer->nr_pages
-* Since the nr_pages field is now removed, we have converted this to
-* return the per cpu buffer value.
-*/
if (!cpumask_test_cpu(cpu, buffer->cpumask))
return 0;
 
-- 
2.42.0




Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-20 Thread Steven Rostedt
On Wed, 20 Dec 2023 08:48:02 +
David Laight  wrote:

> From: Steven Rostedt
> > Sent: 19 December 2023 18:54
> > From: "Tzvetomir Stoyanov (VMware)" 
> > 
> > Currently the size of one sub buffer page is global for all buffers and
> > it is hard coded to one system page. In order to introduce configurable
> > ring buffer sub page size, the internal logic should be refactored to
> > work with sub page size per ring buffer.
> >   
> ...
> > -   nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
> > +   /* Default buffer page size - one system page */
> > +   buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
> > +
> > +   /* Max payload is buffer page size - header (8bytes) */
> > +   buffer->max_data_size = buffer->subbuf_size - (sizeof(u32) * 2);
> > +
> > +   nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);  
> 
> While not new, does this really make any sense for systems with 64k pages?
> Wouldn't it be better to have units of 4k?

Unfortunately, it has to be PAGE_SIZE (and for now it's a power of 2 to
make masking easy). It's used for splice and will also be used for memory
mapping with user space.

> 
> ...
> > @@ -5102,14 +5110,14 @@ unsigned long ring_buffer_size(struct trace_buffer 
> > *buffer, int cpu)
> >  {
> > /*
> >  * Earlier, this method returned
> > -*  BUF_PAGE_SIZE * buffer->nr_pages
> > +*  buffer->subbuf_size * buffer->nr_pages
> >  * Since the nr_pages field is now removed, we have converted this to
> >  * return the per cpu buffer value.  
> 
> Overenthusiastic global replace...

Possibly, but the comment still applies, and should probably be removed, as
it's rather old (2012). It's basically just saying that the size use to be
calculated from buffer->nr_pages and now it's calculated by
buffer->buffers[cpu]->nr_pages.

I think I'll just add a patch to remove that comment.

Thanks,

-- Steve



<    2   3   4   5   6   7   8   9   10   11   >