[PATCH v3 0/2] eventfs: Create dentries and inodes at dir open
[ subject is still wrong, but is to match v2, see patch 2 for correct subject ] Changes since v2: https://lore.kernel.org/all/20240116211217.968123...@goodmis.org/ Implemented Linus's suggestion to just change the iterate_shared to use the hard coded inodes. Steven Rostedt (Google) (2): eventfs: Have the inodes all for files and directories all be the same eventfs: Do not create dentries nor inodes in iterate_shared fs/tracefs/event_inode.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-)
[PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same
From: "Steven Rostedt (Google)" The dentries and inodes are created in the readdir for the sole purpose of getting a consistent inode number. Linus stated that is unnecessary, and that all inodes can have the same inode number. For a virtual file system they are pretty meaningless. Instead use a single unique inode number for all files and one for all directories. Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.412180...@goodmis.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Christian Brauner Cc: Al Viro Cc: Ajay Kaher Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..5edf0b96758b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,6 +32,10 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Choose something "unique" ;-) */ +#define EVENTFS_FILE_INODE_INO 0x12c4e37 +#define EVENTFS_DIR_INODE_INO 0x134b2f5 + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_fop = fop; inode->i_private = data; + /* All files will have the same inode number */ + inode->i_ino = EVENTFS_FILE_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; d_instantiate(dentry, inode); @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_op = _root_dir_inode_operations; inode->i_fop = _file_operations; + /* All directories will have the same inode number */ + inode->i_ino = EVENTFS_DIR_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; -- 2.43.0
Re: [PATCH v2 2/2] eventfs: Create list of files and directories at dir open
On Tue, 16 Jan 2024 13:39:38 -0800 Linus Torvalds wrote: > I don't understand why your still insist on this pointless open wrapper. I just liked the consistency of it. > > Just do this all at iterate time. No open wrappers. No nothing. Just > iterate over the days structures your have. > > IOW, instead of iterating in open to create the array, just iterate in - > look, it's in the *name* for chrissake - iterate_shared. > > No array. No random allocation for said array. > > If you can iterate at open time, you can iterate at iterate_shared time. > Stop creating a list that your already have. > > And nobody cares if you do a readdir at the same time as modifying the > directory. This isn't a real filesystem with strict POSIX semantics. OK, I can do that. -- Steve
[PATCH v2 2/2] eventfs: Create list of files and directories at dir open
From: "Steven Rostedt (Google)" The original eventfs code added a wrapper around the dcache_readdir open callback and created all the dentries and inodes at open, and increment their ref count. A wrapper was added around the dcache_readdir release function to decrement all the ref counts of those created inodes and dentries. But this proved to be buggy[1] for when a kprobe was created during a dir read, it would create a dentry between the open and the release, and because the release would decrement all ref counts of all files and directories, that would include the kprobe directory that was not there to have its ref count incremented in open. This would cause the ref count to go to negative and later crash the kernel. To solve this, the dentries and inodes that were created and had their ref count upped in open needed to be saved. That list needed to be passed from the open to the release, so that the release would only decrement the ref counts of the entries that were incremented in the open. Unfortunately, the dcache_readdir logic was already using the file->private_data, which is the only field that can be used to pass information from the open to the release. What was done was the eventfs created another descriptor that had a void pointer to save the dcache_readdir pointer, and it wrapped all the callbacks, so that it could save the list of entries that had their ref counts incremented in the open, and pass it to the release. The wrapped callbacks would just put back the dcache_readdir pointer and call the functions it used so it could still use its data[2]. But Linus had an issue with the "hijacking" of the file->private_data (unfortunately this discussion was on a security list, so no public link). Which we finally agreed on doing everything within the iterate_shared callback and leave the dcache_readdir out of it[3]. All the information needed for the getents() could be created then. But this ended up being buggy too[4]. The iterate_shared callback was not the right place to create the dentries and inodes. Even Christian Brauner had issues with that[5]. An attempt was to go back to creating the inodes and dentries at the open, create an array to store the information in the file->private_data, and pass that information to the other callbacks.[6] The difference between that and the original method, is that it does not use dcache_readdir. It also does not up the ref counts of the dentries and pass them. Instead, it creates an array of a structure that saves the dentry's name and inode number. That information is used in the iterate_shared callback, and the array is freed in the dir release. The dentries and inodes created in the open are not used for the iterate_share or release callbacks. Just their names and inode numbers. Linus did not like the creation of the dentries for the sole purpose of getting their inode numbers. Instead, he suggested to just hard code them even though they would not be unique. With the hard coding of the inode numbers, there's no need to create the dentries for readdir(). Now the list created at the dir open can just get the names from the meta data and use the hard coded inode numbers. This means that the state of the eventfs at the dir open remains the same from the point of view of the getdents() function, until the dir is closed. This also means that the getdents() will not fail. If there's an issue, it fails at the dir open. [1] https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/ [2] https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/ [3] https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/ [4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/ [5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/ [6] https://lore.kernel.org/all/20240116114711.7e863...@gandalf.local.home/ Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 213 +++ 1 file changed, 149 insertions(+), 64 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 5edf0b96758b..05cd825d4cc9 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -57,7 +57,9 @@ enum { static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); +static int eventfs_dir_open(struct inode *inode, struct file *file); static int eventfs_iterate(struct file *file, struct dir_context *ctx); +static int eventfs_dir_release(struct inode *inode, struct file *file); static void update_attr(struct eventfs_attr *attr, s
[PATCH v2 0/2] eventfs: Create dentries and inodes at dir open
[ subject is wrong, but is to match v1, see patch 2 for correct subject ] Fix reading dir again, this time without creating dentries and inodes. Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240116114711.7e863...@gandalf.local.home Steven Rostedt (Google) (2): eventfs: Have the inodes all for files and directories all be the same eventfs: Create list of files and directories at dir open fs/tracefs/event_inode.c | 223 +-- 1 file changed, 159 insertions(+), 64 deletions(-)
[PATCH v2 1/2] eventfs: Have the inodes all for files and directories all be the same
From: "Steven Rostedt (Google)" The dentries and inodes are created in the readdir for the sole purpose of getting a consistent inode number. Linus stated that is unnecessary, and that all inodes can have the same inode number. For a virtual file system they are pretty meaningless. Instead use a single unique inode number for all files and one for all directories. Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/ Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..5edf0b96758b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,6 +32,10 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Choose something "unique" ;-) */ +#define EVENTFS_FILE_INODE_INO 0x12c4e37 +#define EVENTFS_DIR_INODE_INO 0x134b2f5 + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_fop = fop; inode->i_private = data; + /* All files will have the same inode number */ + inode->i_ino = EVENTFS_FILE_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; d_instantiate(dentry, inode); @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_op = _root_dir_inode_operations; inode->i_fop = _file_operations; + /* All directories will have the same inode number */ + inode->i_ino = EVENTFS_DIR_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; -- 2.43.0
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 10:53:36 -0800 Linus Torvalds wrote: > Let's at least *try* to move towards a better and simpler world, in other > words. OK, but I did just finish the below. I'll save it for another time if the single inode becomes an issue. I cropped it to just 31 bits, so I'm not sure how much that would still expose kernel addresses. But as you prefer a single inode number, I'll submit that instead, and store this as one of my many git branches that I hoard. This patch would give me: ~# ls -i /sys/kernel/tracing/events 2024739183 9p 801480813 iocost281269778 qdisc 401505785 alarmtimer 641794539 iomap1206040657 ras 1930290274 avc 1239754069 iommu 275816503 raw_syscalls 1108006611 block138722947 io_uring 1758073266 rcu 1528091656 bpf_test_run 694421747 ipi 1384703124 regmap 557364529 bpf_trace 2765888 irq30507018 regulator 1351362737 cfg80211 369905250 irq_matrix 2078862821 resctrl 886445085 cgroup 1115008967 irq_vectors 324090967 rpm 796209754 clk 1448778784 jbd2 1141318882 rseq 478739129 compaction99410253 kmem 1274783780 rtc 1714397712 cpuhp 2001779594 ksm 1409072807 sched 720701943 csd 51728677 kyber 347239934 scsi 1824588347 dev 1507974437 libata 1768671172 sd 1640041299 devfreq 1421146927 lock 1167562598 signal 121399632 dma_fence956825721 mac802114116590 skb 975908383 drm 738868061 maple_tree 1435501164 smbus 1227060804 e1000e_trace 969175760 mce 1664441095 sock 1770307058 enable 1225375766 mdio 1634697993 spi 1372107864 error_report 744198394 migrate 556841757 swiotlb 1356665351 exceptions 602669807 mmap 400337480 syscalls [..] ~# ls -i /sys/kernel/tracing/events/sched 1906992193 enable1210369853 sched_process_wait 592505447 filter1443020725 sched_stat_blocked 1742488280 sched_kthread_stop1213185672 sched_stat_iowait 1674576234 sched_kthread_stop_ret1908510325 sched_stat_runtime 1999376743 sched_kthread_work_execute_end1570203600 sched_stat_sleep 1041533096 sched_kthread_work_execute_start 608113040 sched_stat_wait 166824308 sched_kthread_work_queue_work 2033295833 sched_stick_numa 492462616 sched_migrate_task1617500395 sched_swap_numa 1786868196 sched_move_numa 1865216679 sched_switch 691687388 sched_pi_setprio 1465729401 sched_wait_task 1610977146 sched_process_exec 969941227 sched_wake_idle_without_ipi 610128037 sched_process_exit 84398030 sched_wakeup 2139284699 sched_process_fork 750220489 sched_wakeup_new 169172804 sched_process_free1024064201 sched_waking -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a0598eb3e26e..57448bf4acb4 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -32,6 +33,9 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Used for making inode numbers */ +static siphash_key_t inode_key; + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -57,6 +61,28 @@ static int eventfs_dir_open(struct inode *inode, struct file *file); static int eventfs_iterate(struct file *file, struct dir_context *ctx); static int eventfs_dir_release(struct inode *inode, struct file *file); +/* Copied from scripts/kconfig/symbol.c */ +static unsigned strhash(const char *s) +{ + /* fnv32 hash */ + unsigned hash = 2166136261U; + for (; *s; s++) + hash = (hash ^ *s) * 0x01000193; + return hash; +} + +/* Just try to make something consistent and unique */ +static int eventfs_inode_ino(struct eventfs_inode *ei, const char *name) +{ + unsigned long sip = (unsigned long)ei; + + sip += strhash(name); + sip = siphash_1u32((int)sip, _key); + + /* keep it positive */ + return sip & ((1U << 31) - 1); +} + static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { unsigned int ia_valid = iattr->ia_valid; @@ -491,6 +517,8 @@ create_file_dentry(struct eventfs_inode *ei, int idx, mutex_unlock(_mutex); dentry = create_file(name, mode, attr, parent, data, fops); + if (!IS_ERR_OR_NULL(dentry)) + dentry->d_inode->i_ino = eventfs_inode_ino(ei, name); mutex_lock(_mutex); @@ -585,6 +613,8 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, mutex_unlock(_mutex); dentry = create_dir(ei, parent); + if (!IS_ERR_OR_NULL(dentry)) +
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 10:21:49 -0800 Linus Torvalds wrote: > Here's a clue: just fix your inode numbers. > > I can think of many ways to do it. Here's a couple: > > - use a fixed inode number for all inodes. It's fine. Really. You might > confuse some programs that still do getpwd() the legacy way, but hey, > nobody cares > > - just put the inode number in the same data structure everything else is > > > - make the inode number be a hash of the address of your data structure. > That's actually the closest to a traditional "real" inode number, which is > just an index to some on-disk thing > > I'm sure there are other *trivial* solutions. > > None of this is an excuse to misuse sentries. > > Try the first one - one single inode number - first. You shouldn't be doing > iget() anyway, so why do you care so deeply about a number that makes no > sense and nobody should care about? It was me being paranoid that using the same inode number would break user space. If that is not a concern, then I'm happy to just make it either the same, or maybe just hash the ei and name that it is associated with. If I do not fully understand how something is used, I try hard to make it act the same as it does for other use cases. That is, I did all this to keep inodes unique and consistent because I did not know if it would break something if I didn't. Removing that requirement does make it much easier to implement readdir. I think I'll do the hashing, just because I'm still paranoid that something might still break if they are all the same. -- Steve
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 13:12:28 -0500 Steven Rostedt wrote: > Maybe I can just use a hash to generate he inode numbers from the name? > Hopefully there will be no collisions. Then I don't need the dentry > creation at all. Maybe I could use a hash of the address of the meta data to create the inode number? That may be unique enough. -- Steve
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 09:55:15 -0800 Linus Torvalds wrote: > [ html crud because I still don't have power or real Internet, just trying > to keep an eye on things on my phone. Mailing lists removed to avoid > bounces, please put them back in replies that don't have my horrible > formatting ] > > No. > > Christ, you're making things worse again > > The reason for the bug is that you're still messing with the dentries at > readdir() time. I may have deleted the comment, but the only reason I created the inodes/destries is to keep the consistent inode number as it's created at the time the inodes and dentries are. Ah I did delete the comment, but it is still applicable :-/ /* -* Need to create the dentries and inodes to have a consistent -* inode number. +* Need to make a struct eventfs_dent array, start by +* allocating enough for just the files, which is a fixed +* array. Then use realloc via add_entry() for the directories +* which is stored in a linked list. */ So if for some reason, user space did a getdents() and used the inode numbers to match what it found, they would likely be the same. > > Just stop it. Readdir should not be creating dentries. Readdir should not > be even *looking* at dentries. You're not a virtual filesystem, the > dentries are just caches for filename lookup, and have *nothing* to do with > readdir. Actually, it's not looking at them. I did it as a way to just have the inode numbers be consistent. dents[cnt].ino = d->d_inode->i_ino; Yes, that's the only reason I create them. The dentry/inode is not used for anything outside of that. > > So just iterate over your own internal data structures in readdir. DO NOT > CREATE DENTRIES OR INODES FOR READDIR. > > I've told you before, and I'll tell you again: either you are a real and > proper virtual filesystem and you let the vfs layer manage *everything*, > and the dentries and inodes are all you have. Or you are a *real* > filesystem and you maintain your own data structures and the dentries and > inodes are just the in-memory caches. > > This "do both" is UNACCEPTABLE. > The dentries were created for the inode numbers so that I did not need to add them to meta data. They are generated at creation time. I don't know how important inode numbers are. If they can all just have random inode numbers and it doesn't break user space, where an inode number will be one value at one read and another shortly after, is that going to cause a problem? Maybe I can just use a hash to generate he inode numbers from the name? Hopefully there will be no collisions. Then I don't need the dentry creation at all. I do realize that if the dentries get freed due to reclaim and recreated, their inode numbers will be different. But for a virtual file system, I don't know how important having consistent inode numbers is. -- Steve
[PATCH] eventfs: Create dentries and inodes at dir open
From: "Steven Rostedt (Google)" The original eventfs code added a wrapper around the dcache_readdir open callback and created all the dentries and inodes at open, and increment their ref count. A wrapper was added around the dcache_readdir release function to decrement all the ref counts of those created inodes and dentries. But this proved to be buggy[1] for when a kprobe was created during a dir read, it would create a dentry between the open and the release, and because the release would decrement all ref counts of all files and directories, that would include the kprobe directory that was not there to have its ref count incremented in open. This would cause the ref count to go to negative and later crash the kernel. To solve this, the dentries and inodes that were created and had their ref count upped in open needed to be saved. That list needed to be passed from the open to the release, so that the release would only decrement the ref counts of the entries that were incremented in the open. Unfortunately, the dcache_readdir logic was already using the file->private_data, which is the only field that can be used to pass information from the open to the release. What was done was the eventfs created another descriptor that had a void pointer to save the dcache_readdir pointer, and it wrapped all the callbacks, so that it could save the list of entries that had their ref counts incremented in the open, and pass it to the release. The wrapped callbacks would just put back the dcache_readdir pointer and call the functions it used so it could still use its data[2]. But Linus had an issue with the "hijacking" of the file->private_data (unfortunately this discussion was on a security list, so no public link). Which we finally agreed on doing everything within the iterate_shared callback and leave the dcache_readdir out of it[3]. All the information needed for the getents() could be created then. But this ended up being buggy too[4]. The iterate_shared callback was not the right place to create the dentries and inodes. Even Christian Brauner had issues with that[5]. The real fix should be to go back to creating the inodes and dentries at the open, create an array to store the information in the file->private_data, and pass that information to the other callbacks. The difference between this and the original method, is that it does not use dcache_readdir. It also does not up the ref counts of the dentries and pass them. Instead, it creates an array of a structure that saves the dentry's name and inode number. That information is used in the iterate_shared callback, and the array is freed in the dir release. The dentries and inodes created in the open are not used for the iterate_share or release callbacks. Just their names and inode numbers. This means that the state of the eventfs at the dir open remains the same from the point of view of the getdents() function, until the dir is closed. This also means that the getdents() will not fail. If there's an issue, it fails at the dir open. [1] https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/ [2] https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/ [3] https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/ [4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/ [5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/ Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 209 --- 1 file changed, 153 insertions(+), 56 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..50b37f31d835 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -53,7 +53,9 @@ enum { static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); +static int eventfs_dir_open(struct inode *inode, struct file *file); static int eventfs_iterate(struct file *file, struct dir_context *ctx); +static int eventfs_dir_release(struct inode *inode, struct file *file); static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { @@ -212,7 +214,9 @@ static const struct inode_operations eventfs_file_inode_operations = { static const struct file_operations eventfs_file_operations = { .read = generic_read_dir, + .open = eventfs_dir_open, .iterate_shared = eventfs_iterate, + .release= eventfs_dir_release, .llseek = generic_file_llseek, }; @@ -706,34 +710,71 @@ static struct dentry *eventfs_root_lo
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Mon, 15 Jan 2024 17:29:09 + Vincent Donnefort wrote: > > > > What needs to be done, and feel free to add this as a separate patch, > > is to have checks where snapshot is used. > > > > (All errors return -EBUSY) > > > > Before allowing mapping, check to see if: > > > > 1) the current tracer has "use_max_tr" set. > > 2) any event has a "snapshot" trigger set > > 3) Any function has a "snapshot" command set > > Could we sum-up this with a single check to allocate_snapshot? If that is > allocated it's probably because we'll be using it? Not always. It can be allocated at any time and never used. I'd like to keep the allocation of the snapshot buffer agnostic to if the buffer is mapped or not, especially since it can be allocated at boot up and never used. Several of my boxes have "alloc_snapshot" on the command line even though I don't always use it. But we could update the output of reading the "snapshot" file to: ~# cat /sys/kernel/tracing/snapshot # tracer: nop # # # ** Snapshot disabled due to the buffer being memory mapped ** # # * Snapshot is freed * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') If it is allocated: ~# cat /sys/kernel/tracing/snapshot # tracer: nop # # ** Snapshot disabled due to the buffer being memory mapped ** # # * Snapshot is allocated * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') > > That would simply add the requirement to echo 0 > snapshot before starting the > memory map? I rather not depend on that. It just makes it more complex for why mapping failed. If you get -EBUSY, it will be hard to know why. > > The opposite could be to let tracing_alloc_snapshot_instance() fail whenever a > mapping is in place? Yes, that would fail if mapping is in place. Because the snapshot being allocated can be something people want, as it allows the snapshot to happen immediately when needed, I don't want the fact that it is allocated to prevent mapping. As mapping may just happen for a small period of time while an application is running. -- Steve
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Mon, 15 Jan 2024 11:09:38 -0500 Steven Rostedt wrote: > No. The ring buffer logic should not care if the user of it is swapping > the entire ring buffer or not. It only cares if parts of the ring > buffer is being swapped or not. That's not the level of scope it should > care about. If we do not want a swap to happen in update_max_tr() > that's not ring_buffer.c's problem. The code to prevent that from > happening should be 100% in trace.c. What needs to be done, and feel free to add this as a separate patch, is to have checks where snapshot is used. (All errors return -EBUSY) Before allowing mapping, check to see if: 1) the current tracer has "use_max_tr" set. 2) any event has a "snapshot" trigger set 3) Any function has a "snapshot" command set Fail if any of the above is true. Also in reverse, if the buffer is mapped, then fail: 1) a tracer being set that has "use_max_tr" set. 2) a "snapshot" command being set on a function 3) a "snapshot" trigger being set on an event. For the last two, we may be able to get away with just a below as well. Adding the tr->flags bit. We could also add a tr->snapshot count to keep track of everything that is using a snapshot, and if that count is non-zero, mapping fails. diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2a7c6fd934e9..f534f74ae80f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->flags & TRACE_ARRAY_FL_MAPPED) { + trace_array_puts(tr, "*** BUFFER IS MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); -- Steve
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Mon, 15 Jan 2024 15:37:31 + Vincent Donnefort wrote: > > > @@ -5418,6 +5446,11 @@ int ring_buffer_swap_cpu(struct trace_buffer > > > *buffer_a, > > > cpu_buffer_a = buffer_a->buffers[cpu]; > > > cpu_buffer_b = buffer_b->buffers[cpu]; > > > > > > + if (READ_ONCE(cpu_buffer_a->mapped) || READ_ONCE(cpu_buffer_b->mapped)) > > > { > > > + ret = -EBUSY; > > > + goto out; > > > + } > > > > Ah, this is not enough to stop snapshot. > > update_max_tr()@kernel/trace/trace.c > > is used for swapping all CPU buffer and it does not use this function. > > (but that should use this instead of accessing buffer directly...) > > > > Maybe we need ring_buffer_swap() and it checks all cpu_buffer does not set > > mapping bits. > > > > Thank you, > > How about instead of having ring_buffer_swap_cpu() returning -EBUSY when > mapped > we have two functions to block any mapping that trace.c could call? > No. The ring buffer logic should not care if the user of it is swapping the entire ring buffer or not. It only cares if parts of the ring buffer is being swapped or not. That's not the level of scope it should care about. If we do not want a swap to happen in update_max_tr() that's not ring_buffer.c's problem. The code to prevent that from happening should be 100% in trace.c. The trace.c code knows what's being mapped or not. The accounting to keep a mapped buffer from being swapped should stay in trace.c, and not add unnecessary implementation coupling between that and ring_buffer.c. -- Steve
Re: [PATCH v11 4/5] Documentation: tracing: Add ring-buffer mapping
On Sun, 14 Jan 2024 23:26:43 +0900 Masami Hiramatsu (Google) wrote: > Hi Vincent, > > On Thu, 11 Jan 2024 16:17:11 + > Vincent Donnefort wrote: > > > It is now possible to mmap() a ring-buffer to stream its content. Add > > some documentation and a code example. > > > > Signed-off-by: Vincent Donnefort > > > > diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst > > index 5092d6c13af5..0b300901fd75 100644 > > --- a/Documentation/trace/index.rst > > +++ b/Documentation/trace/index.rst > > @@ -29,6 +29,7 @@ Linux Tracing Technologies > > timerlat-tracer > > intel_th > > ring-buffer-design > > + ring-buffer-map > > stm > > sys-t > > coresight/index > > diff --git a/Documentation/trace/ring-buffer-map.rst > > b/Documentation/trace/ring-buffer-map.rst > > new file mode 100644 > > index ..2ba7b5339178 > > --- /dev/null > > +++ b/Documentation/trace/ring-buffer-map.rst > > @@ -0,0 +1,105 @@ > > +.. SPDX-License-Identifier: GPL-2.0 > > + > > +== > > +Tracefs ring-buffer memory mapping > > +== > > + > > +:Author: Vincent Donnefort > > + > > +Overview > > + > > +Tracefs ring-buffer memory map provides an efficient method to stream data > > +as no memory copy is necessary. The application mapping the ring-buffer > > becomes > > +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. > > + > > +Memory mapping setup > > + > > +The mapping works with a mmap() of the trace_pipe_raw interface. > > + > > +The first system page of the mapping contains ring-buffer statistics and > > +description. It is referred as the meta-page. One of the most important > > field of > > +the meta-page is the reader. It contains the subbuf ID which can be safely > > read > > +by the mapper (see ring-buffer-design.rst). > > + > > +The meta-page is followed by all the subbuf, ordered by ascendant ID. It is > > +therefore effortless to know where the reader starts in the mapping: > > + > > +.. code-block:: c > > + > > +reader_id = meta->reader->id; > > +reader_offset = meta->meta_page_size + reader_id * > > meta->subbuf_size; > > + > > +When the application is done with the current reader, it can get a new one > > using > > +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also > > updates > > +the meta-page fields. > > + > > +Limitations > > +=== > > +When a mapping is in place on a Tracefs ring-buffer, it is not possible to > > +either resize it (either by increasing the entire size of the ring-buffer > > or > > +each subbuf). It is also not possible to use snapshot or splice. > > I've played with the sample code. > > - "free_buffer" just doesn't work when the process is mmap the ring buffer. > - After mmap the buffers, when the snapshot took, the IOCTL returns an error. > > OK, but I rather like to fail snapshot with -EBUSY when the buffer is mmaped. > > > + > > +Concurrent readers (either another application mapping that ring-buffer or > > the > > +kernel with trace_pipe) are allowed but not recommended. They will compete > > for > > +the ring-buffer and the output is unpredictable. > > + > > +Example > > +=== > > + > > +.. code-block:: c > > + > > +#include > > +#include > > +#include > > +#include > > + > > +#include > > + > > +#include > > +#include > > + > > +#define TRACE_PIPE_RAW > > "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw" > > + > > +int main(void) > > +{ > > +int page_size = getpagesize(), fd, reader_id; > > +unsigned long meta_len, data_len; > > +struct trace_buffer_meta *meta; > > +void *map, *reader, *data; > > + > > +fd = open(TRACE_PIPE_RAW, O_RDONLY); > > +if (fd < 0) > > +exit(EXIT_FAILURE); > > + > > +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); > > +if (map == MAP_FAILED) > > +exit(EXIT_FAILURE); > > + > > +meta = (struct trace_buffer_meta *)map; > > +meta_len = meta->meta_page_size; > > + > > +printf("entries:%lu\n", meta->entries); > > +printf("overrun:%lu\n", meta->overrun); > > +printf("read: %lu\n", meta->read); > > +printf("subbufs_touched:%lu\n", meta->subbufs_touched); > > +printf("subbufs_lost: %lu\n", meta->subbufs_lost); > > +printf("subbufs_read: %lu\n", meta->subbufs_read); > > +printf("nr_subbufs: %u\n", meta->nr_subbufs); > > + > > +data_len = meta->subbuf_size * meta->nr_subbufs; > > +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, > > data_len); The above is buggy. It
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Fri, 12 Jan 2024 10:06:41 -0500 Steven Rostedt wrote: > I'm thinking both may be good, as the number of dropped events are not > added if there's no room to put it at the end of the ring buffer. When > there's no room, it just sets a flag that there was missed events but > doesn't give how many events were missed. > > If anything, it should be in both locations. In the sub-buffer header, to > be consistent with the read/splice version, and in the meta page were if > there's no room to add the count, it can be accessed in the meta page. I think the meta data page should look like this: struct trace_buffer_meta { __u32 meta_page_size; __u32 meta_struct_len; __u32 subbuf_size; __u32 nr_subbufs; struct { __u64 lost_events; __u32 id; __u32 read; } reader; __u64 entries; __u64 overrun; __u64 read; }; 1) meta_page_size The size of this meta page. It's the first thing the application needs to know how much to mmap. 2) meta_struct_len The actual length of the structure. It's the second thing the application will look at to know what version it is dealing with. 3) subbuf_size 4) nr_subbufs The next two is the next thing the application should do. That is to memory map in all the sub-buffers. With 1) and this, it knows how to do that. The first four are needed for setup, and never changes once mapped. The rest gets updated during the trace. 5) reader 5a) lost_events 5b) id 5c) read This is actively needed during tracing. It is what is used to know where and how to read the reader sub-buffer. 6) entries 7) overrun 8) read These are useful statistics, but probably seldom really looked at. They should be at the end. Also notice that I converted all "unsigned long" to __u64. This is because it is possible to have a 32 bit userspace running this on top of a 64 bit kernel. If we were to use "long" it would be inconsistent and buggy. Now if you need subbufs_touched and subbufs_lost. When that gets opened, it would update the meta_struct_len accordingly, and the structure would look like: struct trace_buffer_meta { __u32 meta_page_size; __u32 meta_struct_len; __u32 subbuf_size; __u32 nr_subbufs; struct { __u64 lost_events; __u32 id; __u32 read; } reader; __u64 entries; __u64 overrun; __u64 read; __u64 subbufs_touched; __u64 subbufs_lost; }; Make sense? -- Steve
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Fri, 12 Jan 2024 09:13:02 + Vincent Donnefort wrote: > > > + > > > + unsigned long subbufs_touched; > > > + unsigned long subbufs_lost; > > > + unsigned long subbufs_read; > > > > Now I'm thinking we may not want this exported, as I'm not sure it's > > useful. > > touched and lost are not useful now, but it'll be for my support of the > hypervisor tracing, that's why I added them already. > > subbufs_read could probably go away though as even in that case I can track > that > in the reader. Yeah, but I think we can leave it off for now. This is something we could extend the structure with. In fact, it can be different for different buffers! That is, since they are pretty meaningless for the normal ring buffer, I don't think we need to export it. But if it's useful for your hypervisor buffer, it can export it. It would be a good test to know if the extension works. Of course, that means if the normal ring buffer needs more info, it must also include this as well, as it will already be part of the extended structure. > > > > > Vincent, talking with Mathieu, he was suggesting that we only export what > > we really need, and I don't think we need the above. Especially since they > > are not exposed in the stats file. > > > > > > > + > > > + struct { > > > + unsigned long lost_events; > > > + __u32 id; > > > + __u32 read; > > > + } reader; > > > > The above is definitely needed, as all of it is used to read the > > reader-page of the sub-buffer. > > > > lost_events is the number of lost events that happened before this > > sub-buffer was swapped out. > > > > Hmm, Vincent, I just notice that you update the lost_events as: > > > > > +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer) > > > +{ > > > + struct trace_buffer_meta *meta = cpu_buffer->meta_page; > > > + > > > + WRITE_ONCE(meta->entries, local_read(_buffer->entries)); > > > + WRITE_ONCE(meta->overrun, local_read(_buffer->overrun)); > > > + WRITE_ONCE(meta->read, cpu_buffer->read); > > > + > > > + WRITE_ONCE(meta->subbufs_touched, > > > local_read(_buffer->pages_touched)); > > > + WRITE_ONCE(meta->subbufs_lost, local_read(_buffer->pages_lost)); > > > + WRITE_ONCE(meta->subbufs_read, local_read(_buffer->pages_read)); > > > + > > > + WRITE_ONCE(meta->reader.read, cpu_buffer->reader_page->read); > > > + WRITE_ONCE(meta->reader.id, cpu_buffer->reader_page->id); > > > + WRITE_ONCE(meta->reader.lost_events, cpu_buffer->lost_events); > > > +} > > > > The lost_events may need to be handled differently, as it doesn't always > > get cleared. So it may be stale data. > > My idea was to check this value after the ioctl(). If > 0 then events were > lost > between the that ioctl() and the previous swap. > > But now if with "[PATCH] ring-buffer: Have mmapped ring buffer keep track of > missed events" we put this information in the page itself, we can get rid of > this field. > I'm thinking both may be good, as the number of dropped events are not added if there's no room to put it at the end of the ring buffer. When there's no room, it just sets a flag that there was missed events but doesn't give how many events were missed. If anything, it should be in both locations. In the sub-buffer header, to be consistent with the read/splice version, and in the meta page were if there's no room to add the count, it can be accessed in the meta page. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 12 Jan 2024 08:53:44 -0500 Steven Rostedt wrote: > > // We managed to open the directory so we have permission to list > > // directory entries in "xfs". > > fd = open("/sys/kernel/tracing/events/xfs"); > > > > // Remove ownership so we can't open the directory anymore > > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > > > // Or just remove exec bit for the group and restrict to owner > > chmod("/sys/kernel/tracing/events/xfs", 700); > > > > // Drop caches to force an eventfs_root_lookup() on everything > > write("/proc/sys/vm/drop_caches", "3", 1); > > This requires opening the directory, then having it's permissions > change, and then immediately dropping the caches. > > > > > // Returns 0 even though directory has a lot of entries and we should be > > // able to list them > > getdents64(fd, ...); > > And do we care? Hmm, maybe the issue you have is with the inconsistency of the action? For this to fail, it would require the admin to do both change the permission and to flush caches. If you don't flush the caches then the task with the dir open can still read it regardless. If the dentries were already created. In that case I'm fine if we change the creation of the dentries to not check the permission. But for now, it's just a weird side effect that I don't really see how it would affect any user's workflow. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 12 Jan 2024 09:27:24 +0100 Christian Brauner wrote: > On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote: > > On Thu, 11 Jan 2024 22:01:32 +0100 > > Christian Brauner wrote: > > > > > What I'm pointing out in the current logic is that the caller is > > > taxed twice: > > > > > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > > > (2) And again when you call lookup_one_len() in eventfs_start_creating() > > > _because_ the permission check in lookup_one_len() is the exact > > > same permission check again that the vfs has done > > > inode_permission(MAY_EXEC, "xfs"). > > > > As I described in: > > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ > > > > The eventfs files below "events" doesn't need the .permissions callback at > > all. It's only there because the "events" inode uses it. > > > > The .permissions call for eventfs has: > > It doesn't matter whether there's a ->permission handler. If you don't > add one explicitly the VFS will simply call generic_permission(): > > inode_permission() > -> do_inode_permission() >{ > if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { > if (likely(inode->i_op->permission)) > return inode->i_op->permission(idmap, inode, mask); > > /* This gets set once for the inode lifetime */ > spin_lock(>i_lock); > inode->i_opflags |= IOP_FASTPERM; > spin_unlock(>i_lock); > } > return generic_permission(idmap, inode, mask); >} Yes I know that, because that's where I knew what to call in the non "events" dir case. > > > Anyway, the issue is with "events" directory and remounting, because like > > the tracefs system, the inode and dentry for "evnets" is created at boot > > up, before the mount happens. The VFS layer is going to check the > > permissions of its inode and dentry, which will be incorrect if the mount > > was mounted with a "gid" option. > > The gid option has nothing to do with this and it is just handled fine > if you remove the second permission checking in (2). I guess I'm confused to what you are having an issue with. Is it just that the permission check gets called twice? > > You need to remove the inode_permission() code from > eventfs_start_creating(). It is just an internal lookup and the fact > that you have it in there allows userspace to break readdir on the > eventfs portions of tracefs as I've shown in the parts of the mail that > you cut off. That's because I didn't see how it was related to the way I fixed the mount=gid issue. Are you only concerned because of the check in eventfs_start_creating()? Yes, you posted code that would make things act funny for some code that I see no real world use case for. Yeah, it may not act "properly" but I'm not sure that's bad. Here, I'll paste it back: > // We managed to open the directory so we have permission to list > // directory entries in "xfs". > fd = open("/sys/kernel/tracing/events/xfs"); > > // Remove ownership so we can't open the directory anymore > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > // Or just remove exec bit for the group and restrict to owner > chmod("/sys/kernel/tracing/events/xfs", 700); > > // Drop caches to force an eventfs_root_lookup() on everything > write("/proc/sys/vm/drop_caches", "3", 1); This requires opening the directory, then having it's permissions change, and then immediately dropping the caches. > > // Returns 0 even though directory has a lot of entries and we should be > // able to list them > getdents64(fd, ...); And do we care? Since tracing exposes internal kernel information, perhaps this is a feature and not a bug. If someone who had access to the tracing system and you wanted to stop them, if they had a directory open that they no longer have access to, you don't want them to see what's left in the directory. In other words, I like the idea that the getends64(fd, ...) will fail! If there's a file underneath that wasn't change, and the admin thought that just keeping the top directory permissions off is good enough, then that attacker having that directory open before the directory had it's file permissions change is a way to still have access to the files below it. > > And the failure is in the inode_permission(MAY_EXEC, "xfs") call in > lookup_one_len() in eventfs_start_creating() which now fails. And I think is a good thing! Again, tracefs is special. It gives you access and possibly control to the kernel behavior. I like the fact that as soon as someone loses permission to a directory, they immediately lose it. -- Steve
Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Thu, 11 Jan 2024 11:34:58 -0500 Mathieu Desnoyers wrote: > The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1], > and has a lot of similarities with this patch series. > > LTTng has the notion of "subbuffer id" to allow atomically exchanging a > "reader" extra subbuffer with the subbuffer to be read. It implements > "get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader) > to move the currently read subbuffer position. [2] > > It would not hurt to compare your approach to LTTng and highlight > similarities/differences, and the rationale for the differences. > > Especially when it comes to designing kernel ABIs, it's good to make sure > that all bases are covered, because those choices will have lasting impacts. > > Thanks, > > Mathieu > > [1] > https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c > [2] > https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c > Hi Mathieu, Thanks for sharing! As we discussed a little bit in the tracing meeting we do somethings differently but very similar too ;-) The similarities as that all the sub-buffers are mapped. You have a reader-sub-buffer as well. The main difference is that you use an ioctl() that returns where to find the reader-sub-buffer, where our ioctl() is just "I'm done, get me a new reader-sub-buffer". Ours will update the meta page. Our meta page looks like this: > +struct trace_buffer_meta { > + unsigned long entries; > + unsigned long overrun; > + unsigned long read; If start tracing: trace-cmd start -e sched_switch and do: ~ cat /sys/kernel/tracing/per_cpu/cpu0/stats entries: 14 overrun: 0 commit overrun: 0 bytes: 992 oldest event ts: 84844.825372 now ts: 84847.102075 dropped events: 0 read events: 0 You'll see similar to the above. entries = entries overrun = overrun read = read The above "read" is total number of events read. Pretty staight forward ;-) > + > + unsigned long subbufs_touched; > + unsigned long subbufs_lost; > + unsigned long subbufs_read; Now I'm thinking we may not want this exported, as I'm not sure it's useful. Vincent, talking with Mathieu, he was suggesting that we only export what we really need, and I don't think we need the above. Especially since they are not exposed in the stats file. > + > + struct { > + unsigned long lost_events; > + __u32 id; > + __u32 read; > + } reader; The above is definitely needed, as all of it is used to read the reader-page of the sub-buffer. lost_events is the number of lost events that happened before this sub-buffer was swapped out. Hmm, Vincent, I just notice that you update the lost_events as: > +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer) > +{ > + struct trace_buffer_meta *meta = cpu_buffer->meta_page; > + > + WRITE_ONCE(meta->entries, local_read(_buffer->entries)); > + WRITE_ONCE(meta->overrun, local_read(_buffer->overrun)); > + WRITE_ONCE(meta->read, cpu_buffer->read); > + > + WRITE_ONCE(meta->subbufs_touched, > local_read(_buffer->pages_touched)); > + WRITE_ONCE(meta->subbufs_lost, local_read(_buffer->pages_lost)); > + WRITE_ONCE(meta->subbufs_read, local_read(_buffer->pages_read)); > + > + WRITE_ONCE(meta->reader.read, cpu_buffer->reader_page->read); > + WRITE_ONCE(meta->reader.id, cpu_buffer->reader_page->id); > + WRITE_ONCE(meta->reader.lost_events, cpu_buffer->lost_events); > +} The lost_events may need to be handled differently, as it doesn't always get cleared. So it may be stale data. > + > + __u32 subbuf_size; > + __u32 nr_subbufs; This gets is the information needed to read the mapped ring buffer. > + > + __u32 meta_page_size; > + __u32 meta_struct_len; The ring buffer gets mapped after "meta_page_size" and this structure is "meta_struct_len" which will change if we add new data to it. > +}; -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 11 Jan 2024 22:01:32 +0100 Christian Brauner wrote: > What I'm pointing out in the current logic is that the caller is > taxed twice: > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > (2) And again when you call lookup_one_len() in eventfs_start_creating() > _because_ the permission check in lookup_one_len() is the exact > same permission check again that the vfs has done > inode_permission(MAY_EXEC, "xfs"). As I described in: https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ The eventfs files below "events" doesn't need the .permissions callback at all. It's only there because the "events" inode uses it. The .permissions call for eventfs has: static int eventfs_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { set_top_events_ownership(inode); return generic_permission(idmap, inode, mask); } Where the "set_top_events_ownership() is a nop for everything but the "events" directory. I guess I could have two ops: static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, .permission = eventfs_permission, }; static const struct inode_operations eventfs_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, }; And use the second one for all dentries below the root, but I figured it's not that big of a deal if I called the permissions on all. Perhaps I should do it with two? Anyway, the issue is with "events" directory and remounting, because like the tracefs system, the inode and dentry for "evnets" is created at boot up, before the mount happens. The VFS layer is going to check the permissions of its inode and dentry, which will be incorrect if the mount was mounted with a "gid" option. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 10:52:51 -0500 Steven Rostedt wrote: > On Wed, 10 Jan 2024 08:07:46 -0500 > Steven Rostedt wrote: > > > Or are you saying that I don't need the ".permission" callback, because > > eventfs does it when it creates the inodes? But for eventfs to know what > > the permissions changes are, it uses .getattr and .setattr. > > OK, if your main argument is that we do not need .permission, I agree with > you. But that's a trivial change and doesn't affect the complexity that > eventfs is doing. In fact, removing the "permission" check is simply this > patch: > > -- > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index fdff53d5a1f8..f2af07a857e2 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, > return 0; > } > > -static int eventfs_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > -{ > - set_top_events_ownership(inode); > - return generic_permission(idmap, inode, mask); > -} > - > static const struct inode_operations eventfs_root_dir_inode_operations = { > .lookup = eventfs_root_lookup, > .setattr= eventfs_set_attr, > .getattr= eventfs_get_attr, > - .permission = eventfs_permission, > }; > > static const struct inode_operations eventfs_file_inode_operations = { > -- > > I only did that because Linus mentioned it, and I thought it was needed. > I'll apply this patch too, as it appears to work with this code. Oh, eventfs files and directories don't need the .permissions because its inodes and dentries are not created until accessed. But the "events" directory itself has its dentry and inode created at boot up, but still uses the eventfs_root_dir_inode_operations. So the .permissions is still needed! If you look at the "set_top_events_ownership()" function, it has: /* The top events directory doesn't get automatically updated */ if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) return; That is, it does nothing if the entry is not the "events" directory. It falls back to he default "->permissions()" function for everything but the top level "events" directory. But this and .getattr are still needed for the events directory, because it suffers the same issue as the other tracefs entries. That is, it's inodes and dentries are created at boot up before it is mounted. So if the mount has gid=1000, it will be ignored. The .getattr is called by "stat" which ls does. So after boot up if you just do: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ls /sys/kernel/tracing/events/ 9pext4iomapmodule raw_syscalls thermal alarmtimerfib iommumsr rcu thp avc fib6io_uring napiregmaptimer block filelockipi neigh regulator tlb bpf_test_run filemap irq net resctrl udp bpf_trace ftrace irq_matrix netfs rpm virtio_gpu[ ...] The above works because "ls" does a stat() on the directory first, which does a .getattr() call that updates the permissions of the existing "events" directory inode. BUT! If I had used my own getents() program that has: fd = openat(AT_FDCWD, argv[1], O_RDONLY); if (fd < 0) perror("openat"); n = getdents64(fd, buf, BUF_SIZE); if (n < 0) perror("getdents64"); Where it calls the openat() without doing a stat fist, and after boot, had done: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ./getdents /sys/kernel/tracing/events openat: Permission denied getdents64: Bad file descriptor It errors because he "events" inode permission hasn't been updated yet. Now after getting the above error, if I do the "ls" and then run it again: $ ls /sys/kernel/tracing/events > /dev/null $ ./getdents /sys/kernel/tracing/events enable header_page header_event initcall vsyscall syscalls it works! so no, I can't remove that .permissions callback from eventfs. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 10:52:51 -0500 Steven Rostedt wrote: > I'll apply this patch too, as it appears to work with this code. I meant "appears to work without this code". -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 08:07:46 -0500 Steven Rostedt wrote: > Or are you saying that I don't need the ".permission" callback, because > eventfs does it when it creates the inodes? But for eventfs to know what > the permissions changes are, it uses .getattr and .setattr. OK, if your main argument is that we do not need .permission, I agree with you. But that's a trivial change and doesn't affect the complexity that eventfs is doing. In fact, removing the "permission" check is simply this patch: -- diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..f2af07a857e2 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, return 0; } -static int eventfs_permission(struct mnt_idmap *idmap, - struct inode *inode, int mask) -{ - set_top_events_ownership(inode); - return generic_permission(idmap, inode, mask); -} - static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, - .permission = eventfs_permission, }; static const struct inode_operations eventfs_file_inode_operations = { -- I only did that because Linus mentioned it, and I thought it was needed. I'll apply this patch too, as it appears to work with this code. Thanks! -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 12:45:36 +0100 Christian Brauner wrote: > So say you do: > > mkdir /sys/kernel/tracing/instances/foo > > After this has returned we know everything we need to know about the new > tracefs instance including the ownership and the mode of all inodes in > /sys/kernel/tracing/instances/foo/events/* and below precisely because > ownership is always inherited from the parent dentry and recorded in the > metadata struct eventfs_inode. > > So say someone does: > > open("/sys/kernel/tracing/instances/foo/events/xfs"); > > and say this is the first time that someone accesses that events/ > directory. > > When the open pathwalk is done, the vfs will determine via > > [1] may_lookup(inode_of(events)) > > whether you are able to list entries such as "xfs" in that directory. > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds > it ends up calling i_op->eventfs_root_lookup(events). > > At this point tracefs/eventfs adds the inodes for all entries in that > "events" directory including "xfs" based on the metadata it recorded > during the mkdir. Since now someone is actually interested in them. And > it initializes the inodes with ownership and everything and adds the > dentries that belong into that directory. > > Nothing here depends on the permissions of the caller. The only > permission that mattered was done in the VFS in [1]. If the caller has > permissions to enter a directory they can lookup and list its contents. > And its contents where determined/fixed etc when mkdir was called. > > So we just need to add the required objects into the caches (inode, > dentry) whose addition we intentionally defered until someone actually > needed them. > > So, eventfs_root_lookup() now initializes the inodes with the ownership > from the stored metadata or from the parent dentry and splices in inodes > and dentries. No permission checking is needed for this because it is > always a recheck of what the vfs did in [1]. > > We now return to the vfs and path walk continues to the final component > that you actually want to open which is that "xfs" directory in this > example. We check the permissions on that inode via may_open("xfs") and > we open that directory returning an fd to userspace ultimately. > > (I'm going by memory since I need to step out the door.) So, let's say we do: chgrp -R rostedt /sys/kernel/tracing/ But I don't want rostedt to have access to xfs chgrp -R root /sys/kernel/tracing/events/xfs Both actions will create the inodes and dentries of all files and directories (because of "-R"). But once that is done, the ref counts go to zero. They stay around until reclaim. But then I open Chrome ;-) and it reclaims all the dentries and inodes, so we are back to here we were on boot. Now as rostedt I do: ls /sys/kernel/tracing/events/xfs The VFS layer doesn't know if I have permission to that or not, because all the inodes and dentries have been freed. It has to call back to eventfs to find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will recreated the inodes with the proper permission. Or are you saying that I don't need the ".permission" callback, because eventfs does it when it creates the inodes? But for eventfs to know what the permissions changes are, it uses .getattr and .setattr. -- Steve
[PATCH v3] ring-buffer: Have mmapped ring buffer keep track of missed events
From: "Steven Rostedt (Google)" While testing libtracefs on the mmapped ring buffer, the test that checks if missed events are accounted for failed when using the mapped buffer. This is because the mapped page does not update the missed events that were dropped because the writer filled up the ring buffer before the reader could catch it. Add the missed events to the reader page/sub-buffer when the IOCTL is done and a new reader page is acquired. Note that all accesses to the reader_page via rb_page_commit() had to be switched to rb_page_size(), and rb_page_size() which was just a copy of rb_page_commit() but now it masks out the RB_MISSED bits. This is needed as the mapped reader page is still active in the ring buffer code and where it reads the commit field of the bpage for the size, it now must mask it otherwise the missed bits that are now set will corrupt the size returned. Signed-off-by: Steven Rostedt (Google) --- Changes since v2: https://lore.kernel.org/linux-trace-kernel/20240109173552.0b40d...@gandalf.local.home/ - Remove left-over "dump_stack()" that was used for debugging kernel/trace/ring_buffer.c | 60 ++ 1 file changed, 54 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 07dae67424a9..603956bc1e59 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -312,6 +312,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event *event) /* Missed count stored at end */ #define RB_MISSED_STORED (1 << 30) +#define RB_MISSED_MASK (3 << 30) + struct buffer_data_page { u64 time_stamp;/* page time stamp */ local_t commit;/* write committed index */ @@ -2303,7 +2305,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter) /* Size is determined by what has been committed */ static __always_inline unsigned rb_page_size(struct buffer_page *bpage) { - return rb_page_commit(bpage); + return rb_page_commit(bpage) & ~RB_MISSED_MASK; } static __always_inline unsigned @@ -3930,7 +3932,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) return true; /* Reader should exhaust content in reader page */ - if (reader->read != rb_page_commit(reader)) + if (reader->read != rb_page_size(reader)) return false; /* @@ -4401,7 +4403,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter) return ((iter->head_page == commit_page && iter->head >= commit) || (iter->head_page == reader && commit_page == head_page && head_page->read == commit && -iter->head == rb_page_commit(cpu_buffer->reader_page))); +iter->head == rb_page_size(cpu_buffer->reader_page))); } EXPORT_SYMBOL_GPL(ring_buffer_iter_empty); @@ -5745,7 +5747,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, event = rb_reader_event(cpu_buffer); read = reader->read; - commit = rb_page_commit(reader); + commit = rb_page_size(reader); /* Check if any events were dropped */ missed_events = cpu_buffer->lost_events; @@ -5822,7 +5824,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, } else { /* update the entry counter */ cpu_buffer->read += rb_page_entries(reader); - cpu_buffer->read_bytes += rb_page_commit(reader); + cpu_buffer->read_bytes += rb_page_size(reader); /* swap the pages */ rb_init_page(bpage); @@ -6349,6 +6351,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) { struct ring_buffer_per_cpu *cpu_buffer; + struct buffer_page *reader; + unsigned long missed_events; unsigned long reader_size; unsigned long flags; @@ -6374,9 +6378,53 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) goto out; } - if (WARN_ON(!rb_get_reader_page(cpu_buffer))) + reader = rb_get_reader_page(cpu_buffer); + if (WARN_ON(!reader)) goto out; + /* Check if any events were dropped */ + missed_events = cpu_buffer->lost_events; + + if (cpu_buffer->reader_page != cpu_buffer->commit_page) { + if (missed_events) { + struct buffer_data_page *bpage = reader->page; + unsigned int commit; + /* +* Use the real_end for the data size, +* This gives us a chance to store the lost events +* on the page. +
Re: [PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions
On Wed, 10 Jan 2024 08:42:05 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 9 Jan 2024 15:13:51 + > Vincent Donnefort wrote: > > > > > @@ -388,6 +389,7 @@ struct rb_irq_work { > > > > boolwaiters_pending; > > > > boolfull_waiters_pending; > > > > boolwakeup_full; > > > > + boolis_cpu_buffer; > > > > > > I think 'is_cpu_buffer' is a bit unclear (or generic), > > > what about 'meta_page_update'? > > > > Hum not sure about that change. This was really to identify if parent of > > rb_irq_work is a cpu_buffer or a trace_buffer. It can be a cpu_buffer > > regardless > > of the need to update the meta-page. > > Yeah, I just meant that is "for_cpu_buffer", not "rb_irq_work is_cpu_buffer". > So when reading the code, I just felt uncomfortable. > How about "in_cpu_buffer" as that is what it is. struct ring_buffer_per_cpu { struct rb_irq_work { boolin_cpu_buffer; } } Would that make you feel more comfortable? ;-) -- Steve
[PATCH v2] ring-buffer: Have mmapped ring buffer keep track of missed events
From: "Steven Rostedt (Google)" While testing libtracefs on the mmapped ring buffer, the test that checks if missed events are accounted for failed when using the mapped buffer. This is because the mapped page does not update the missed events that were dropped because the writer filled up the ring buffer before the reader could catch it. Add the missed events to the reader page/sub-buffer when the IOCTL is done and a new reader page is acquired. Note that all accesses to the reader_page via rb_page_commit() had to be switched to rb_page_size(), and rb_page_size() which was just a copy of rb_page_commit() but now it masks out the RB_MISSED bits. This is needed as the mapped reader page is still active in the ring buffer code and where it reads the commit field of the bpage for the size, it now must mask it otherwise the missed bits that are now set will corrupt the size returned. Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/all/20240109205112.74225-5-rost...@goodmis.org/ - Forgot to update the "real end" so that it has room to add the number of missed events. Without that update, it would just say "missed events" but not how many events were missed. kernel/trace/ring_buffer.c | 61 ++ 1 file changed, 55 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 07dae67424a9..b111f0694805 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -312,6 +312,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event *event) /* Missed count stored at end */ #define RB_MISSED_STORED (1 << 30) +#define RB_MISSED_MASK (3 << 30) + struct buffer_data_page { u64 time_stamp;/* page time stamp */ local_t commit;/* write committed index */ @@ -2303,7 +2305,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter) /* Size is determined by what has been committed */ static __always_inline unsigned rb_page_size(struct buffer_page *bpage) { - return rb_page_commit(bpage); + return rb_page_commit(bpage) & ~RB_MISSED_MASK; } static __always_inline unsigned @@ -2769,6 +2771,7 @@ static void rb_add_timestamp(struct ring_buffer_per_cpu *cpu_buffer, once++; pr_warn("Ring buffer clock went backwards: %llu -> %llu\n", info->before, info->ts); + dump_stack(); } } else rb_check_timestamp(cpu_buffer, info); @@ -3930,7 +3933,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) return true; /* Reader should exhaust content in reader page */ - if (reader->read != rb_page_commit(reader)) + if (reader->read != rb_page_size(reader)) return false; /* @@ -4401,7 +4404,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter) return ((iter->head_page == commit_page && iter->head >= commit) || (iter->head_page == reader && commit_page == head_page && head_page->read == commit && -iter->head == rb_page_commit(cpu_buffer->reader_page))); +iter->head == rb_page_size(cpu_buffer->reader_page))); } EXPORT_SYMBOL_GPL(ring_buffer_iter_empty); @@ -5745,7 +5748,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, event = rb_reader_event(cpu_buffer); read = reader->read; - commit = rb_page_commit(reader); + commit = rb_page_size(reader); /* Check if any events were dropped */ missed_events = cpu_buffer->lost_events; @@ -5822,7 +5825,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, } else { /* update the entry counter */ cpu_buffer->read += rb_page_entries(reader); - cpu_buffer->read_bytes += rb_page_commit(reader); + cpu_buffer->read_bytes += rb_page_size(reader); /* swap the pages */ rb_init_page(bpage); @@ -6349,6 +6352,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) { struct ring_buffer_per_cpu *cpu_buffer; + struct buffer_page *reader; + unsigned long missed_events; unsigned long reader_size; unsigned long flags; @@ -6374,9 +6379,53 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) goto out; } - if (WARN_ON(!rb_get_reader_page(cpu_buffer))) + reader = rb_get_reader_page(cpu_buffer); + if (WARN_ON(!reader)) goto out; +
[PATCH] ring-buffer: Have mmapped ring buffer keep track of missed events
From: "Steven Rostedt (Google)" While testing libtracefs on the mmapped ring buffer, the test that checks if missed events are accounted for failed when using the mapped buffer. This is because the mapped page does not update the missed events that were dropped because the writer filled up the ring buffer before the reader could catch it. Add the missed events to the reader page/sub-buffer when the IOCTL is done and a new reader page is acquired. Note that all accesses to the reader_page via rb_page_commit() had to be switched to rb_page_size(), and rb_page_size() which was just a copy of rb_page_commit() but now it masks out the RB_MISSED bits. This is needed as the mapped reader page is still active in the ring buffer code and where it reads the commit field of the bpage for the size, it now must mask it otherwise the missed bits that are now set will corrupt the size returned. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 54 +- 1 file changed, 48 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 07dae67424a9..90af4f28671f 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -312,6 +312,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event *event) /* Missed count stored at end */ #define RB_MISSED_STORED (1 << 30) +#define RB_MISSED_MASK (3 << 30) + struct buffer_data_page { u64 time_stamp;/* page time stamp */ local_t commit;/* write committed index */ @@ -2303,7 +2305,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter) /* Size is determined by what has been committed */ static __always_inline unsigned rb_page_size(struct buffer_page *bpage) { - return rb_page_commit(bpage); + return rb_page_commit(bpage) & ~RB_MISSED_MASK; } static __always_inline unsigned @@ -2769,6 +2771,7 @@ static void rb_add_timestamp(struct ring_buffer_per_cpu *cpu_buffer, once++; pr_warn("Ring buffer clock went backwards: %llu -> %llu\n", info->before, info->ts); + dump_stack(); } } else rb_check_timestamp(cpu_buffer, info); @@ -3930,7 +3933,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) return true; /* Reader should exhaust content in reader page */ - if (reader->read != rb_page_commit(reader)) + if (reader->read != rb_page_size(reader)) return false; /* @@ -4401,7 +4404,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter) return ((iter->head_page == commit_page && iter->head >= commit) || (iter->head_page == reader && commit_page == head_page && head_page->read == commit && -iter->head == rb_page_commit(cpu_buffer->reader_page))); +iter->head == rb_page_size(cpu_buffer->reader_page))); } EXPORT_SYMBOL_GPL(ring_buffer_iter_empty); @@ -5745,7 +5748,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, event = rb_reader_event(cpu_buffer); read = reader->read; - commit = rb_page_commit(reader); + commit = rb_page_size(reader); /* Check if any events were dropped */ missed_events = cpu_buffer->lost_events; @@ -5822,7 +5825,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, } else { /* update the entry counter */ cpu_buffer->read += rb_page_entries(reader); - cpu_buffer->read_bytes += rb_page_commit(reader); + cpu_buffer->read_bytes += rb_page_size(reader); /* swap the pages */ rb_init_page(bpage); @@ -6349,6 +6352,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) { struct ring_buffer_per_cpu *cpu_buffer; + struct buffer_page *reader; + unsigned long missed_events; unsigned long reader_size; unsigned long flags; @@ -6374,9 +6379,46 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) goto out; } - if (WARN_ON(!rb_get_reader_page(cpu_buffer))) + reader = rb_get_reader_page(cpu_buffer); + if (WARN_ON(!reader)) goto out; + /* Check if any events were dropped */ + missed_events = cpu_buffer->lost_events; + + if (cpu_buffer->reader_page != cpu_buffer->commit_page) { + if (missed_events) { + struct buffer_data_page *bpage = reader->page; +
Re: [PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions
On Tue, 9 Jan 2024 15:13:51 + Vincent Donnefort wrote: > > > @@ -388,6 +389,7 @@ struct rb_irq_work { > > > boolwaiters_pending; > > > boolfull_waiters_pending; > > > boolwakeup_full; > > > + boolis_cpu_buffer; > > > > I think 'is_cpu_buffer' is a bit unclear (or generic), > > what about 'meta_page_update'? > > Hum not sure about that change. This was really to identify if parent of > rb_irq_work is a cpu_buffer or a trace_buffer. It can be a cpu_buffer > regardless > of the need to update the meta-page. Yeah, this was added because the irq_work is called with the rb_work for both the cpu_buffer and the global struct tracing_buffer object. The meta_page is only available to the cpu_buffer and does not exist on the struct trace_buffer, so this is checked before doing a "container_of()" on the wrong structure. Both the cpu_buffer and the global buffer call the same function with the rb_work structure. So "is_cpu_buffer" is the right terminology as it's unrelated to the meta page: struct trace_buffer { [..] struct rb_irq_work irq_work; [..] }; struct trace_buffer *buffer; buffer->irq_work.is_cpu_buffer = false; struct ring_buffer_per_cpu { [..] struct rb_irq_work irq_work; [..] } struct ring_buffer_per_cpu *cpu_buffer; cpu_buffer->irq_work.is_cpu_buffer = true; [..] init_irq_work(>irq_work.work, rb_wake_up_waiters); [..] init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters); // both the buffer and cpu_buffer call rb_wake_up_waiters() static void rb_wake_up_waiters(struct irq_work *work) { struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, work); // This container_of() gets rbwork which is the rb_irq_work structure that // both buffer and cpu_buffer have. if (rbwork->is_cpu_buffer) { struct ring_buffer_per_cpu *cpu_buffer; cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu, irq_work); // The above crashes if done by the buffer and not the cpu_buffer. // The "is_cpu_buffer" is there to differentiate the two rb_work entities. // It is a way to say this is safe to do the above "container_of()". /* * If the waiter is a cpu_buffer, this might be due to a * userspace mapping. Let's update the meta-page. */ rb_update_meta_page(cpu_buffer); } -- Steve
Re: [PATCH v10 0/2] Introducing trace buffer mapping by user-space
Hi Masami, thanks for looking at this. On Tue, 9 Jan 2024 22:04:45 +0900 Masami Hiramatsu (Google) wrote: > > The tracing ring-buffers can be stored on disk or sent to network > > without any copy via splice. However the later doesn't allow real time > > processing of the traces. A solution is to give userspace direct access > > to the ring-buffer pages via a mapping. An application can now become a > > consumer of the ring-buffer, in a similar fashion to what trace_pipe > > offers. > > I think this is very nice feature. But this series seems just a feature, > no document and no example code. Can you add 2 patches to add those? > I know libtracefs already provide a support code, but I think it is > better to have a test code under tools/testing/selftests/ring_buffer. Yeah, we should have sample code and a test. > > I also wonder what happen if other operation (e.g. taking snapshot) happens > while mmaping the ring buffer. Hmm, good point. We should disable snapshots when mapped, and also prevent mapping with latency tracer if we are not already doing that. -- Steve
Re: possible deadlock in __perf_install_in_context
This is related to perf not tracing. -- Steve On Tue, 09 Jan 2024 08:34:15 +0800 "Ubisectech Sirius" wrote: > Dear concerned. > Greetings! > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > Recently, our team has discovered a issue in Linux kernel 6.7.0-g0dd3ee311255. > technical details: > 1. Issue Description: possible deadlock in __perf_install_in_context > 2. Stack Dump: > [ 158.488994][ T8029] Call Trace: > [ 158.489411][ T8029] > arch/x86/events/intel/../perf_event.h:1166 arch/x86/events/intel/core.c:2799) > [ 158.498427][ T8029] x86_pmu_start (arch/x86/events/core.c:1516) > [ 158.499034][ T8029] x86_pmu_enable (arch/x86/events/core.c:1331 > (discriminator 2)) > [ 158.499601][ T8029] perf_ctx_enable (kernel/events/core.c:703 > (discriminator 2)) > [ 158.500171][ T8029] ctx_resched (kernel/events/core.c:2741) > [ 158.500733][ T8029] __perf_install_in_context (kernel/events/core.c:2807) > [ 158.502106][ T8029] remote_function (kernel/events/core.c:92 > kernel/events/core.c:72) > [ 158.503364][ T8029] generic_exec_single (kernel/smp.c:134 (discriminator 3) > kernel/smp.c:404 (discriminator 3)) > [ 158.503995][ T8029] smp_call_function_single (kernel/smp.c:647) > (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:77 > ./arch/x86/include/asm/irqflags.h:135 lib/percpu_counter.c:102) > [ 158.512958][ T8029] perf_install_in_context (kernel/events/core.c:2909 > (discriminator 1)) > [ 158.515717][ T8029] __do_sys_perf_event_open (kernel/events/core.c:1443 > kernel/events/core.c:12747) > [ 158.518483][ T8029] do_syscall_64 (arch/x86/entry/common.c:52 > arch/x86/entry/common.c:83) > [ 158.519281][ T8029] entry_SYSCALL_64_after_hwframe > (arch/x86/entry/entry_64.S:129) > [ 158.519991][ T8029] RIP: 0033:0x7f04a0c9cf29 > [ 158.520536][ T8029] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 > 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 > <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48 > All code > > 0: 00 c3 add %al,%bl > 2: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > 9: 00 00 00 > c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) > 11: 48 89 f8 mov %rdi,%rax > 14: 48 89 f7 mov %rsi,%rdi > 17: 48 89 d6 mov %rdx,%rsi > 1a: 48 89 ca mov %rcx,%rdx > 1d: 4d 89 c2 mov %r8,%r10 > 20: 4d 89 c8 mov %r9,%r8 > 23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9 > 28: 0f 05 syscall > 2a:* 48 3d 01 f0 ff ff cmp $0xf001,%rax <-- trapping instruction > 30: 73 01 jae 0x33 > 32: c3 ret > 33: 48 8b 0d 37 8f 0d 00 mov 0xd8f37(%rip),%rcx # 0xd8f71 > 3a: f7 d8 neg %eax > 3c: 64 89 01 mov %eax,%fs:(%rcx) > 3f: 48 rex.W > Code starting with the faulting instruction > === > 0: 48 3d 01 f0 ff ff cmp $0xf001,%rax > 6: 73 01 jae 0x9 > 8: c3 ret > 9: 48 8b 0d 37 8f 0d 00 mov 0xd8f37(%rip),%rcx # 0xd8f47 > 10: f7 d8 neg %eax > 12: 64 89 01 mov %eax,%fs:(%rcx) > 15: 48 rex.W > [ 158.522837][ T8029] RSP: 002b:7ffe5f1174b8 EFLAGS: 0246 ORIG_RAX: > 012a > [ 158.523848][ T8029] RAX: ffda RBX: RCX: > 7f04a0c9cf29 > [ 158.524797][ T8029] RDX: RSI: RDI: > 20004740 > [ 158.525738][ T8029] RBP: 7ffe5f1174c0 R08: R09: > 7ffe5f1174f0 > [ 158.526717][ T8029] R10: R11: 0246 R12: > 5597d067d180 > [ 158.527661][ T8029] R13: R14: R15: > > [ 158.528611][ T8029] > [ 158.530059][ T8029] > [ 158.530364][ T8029] == > [ 158.531146][ T8029] WARNING: possible circular locking dependency detected > [ 158.531881][ T8029] 6.7.0-g0dd3ee311255 #6 Not tainted > [ 158.532457][ T8029] -- > [ 158.533256][ T8029] poc/8029 is trying to acquire lock: > [ 158.533880][ T8029] 88801ca53018 (>lock){}-{2:2}, at: > __perf_event_task_sched_out (kernel/events/core.c:3573 > kernel/events/core.c:3676) > [ 158.535067][ T8029] > [ 158.535067][ T8029] but task is already holding lock: > [ 158.535925][ T8029] 88802d23c758 (>__lock){-.-.}-{2:2}, at: > raw_spin_rq_lock_nested (kernel/sched/core.c:574) > [ 158.537001][ T8029] > [ 158.537001][ T8029] which lock already depends on the new lock. > [ 158.537001][ T8029] > [ 158.538196][ T8029] > [ 158.538196][ T8029] the existing dependency chain (in reverse order) is: > [ 158.539200][ T8029] > [ 158.539200][ T8029] -> #3 (>__lock){-.-.}-{2:2}: > [ 158.540081][ T8029] _raw_spin_lock_nested (kernel/locking/spinlock.c:379) > [ 158.540772][ T8029] raw_spin_rq_lock_nested (kernel/sched/core.c:574) > [ 158.541471][ T8029] task_fork_fair (kernel/sched/sched.h:1222 > kernel/sched/sched.h:1581 kernel/sched/sched.h:1664 kernel/sched/fair.c:12586) > [ 158.542092][ T8029] sched_cgroup_fork (kernel/sched/core.c:4814) > [ 158.542772][
Re: [PATCH] tracing user_events: Simplify user_event_parse_field() parsing
On Mon, 8 Jan 2024 17:13:12 -0500 Steven Rostedt wrote: > On Mon, 8 Jan 2024 21:47:44 + > Beau Belgrave wrote: > > > > - len = str_has_prefix(field, "__rel_loc "); > > > - if (len) > > > - goto skip_next; > > > + if (!(len = str_has_prefix(field, "__data_loc unsigned ")) && > > > + !(len = str_has_prefix(field, "__data_loc ")) && > > > + !(len = str_has_prefix(field, "__rel_loc unsigned ")) && > > > + !(len = str_has_prefix(field, "__rel_loc "))) { > > > + goto parse; > > > + } > > > > This now triggers a checkpatch error: > > ERROR: do not use assignment in if condition > > What a horrible message. > > > #1184: FILE: kernel/trace/trace_events_user.c:1184: > > + if (!(len = str_has_prefix(field, "__data_loc unsigned ")) && > > > > I personally prefer to keep these files fully checkpatch clean. > > I've stopped using checkpatch years ago because I disagreed with so much it > :-p > (Including this message) Note that checkpatch is a guideline and not a rule. The general rule is, if the code looks worse when applying the checkpatch rule, don't do it. - Steve
Re: [PATCH] tracing user_events: Simplify user_event_parse_field() parsing
On Mon, 8 Jan 2024 21:47:44 + Beau Belgrave wrote: > > - len = str_has_prefix(field, "__rel_loc "); > > - if (len) > > - goto skip_next; > > + if (!(len = str_has_prefix(field, "__data_loc unsigned ")) && > > + !(len = str_has_prefix(field, "__data_loc ")) && > > + !(len = str_has_prefix(field, "__rel_loc unsigned ")) && > > + !(len = str_has_prefix(field, "__rel_loc "))) { > > + goto parse; > > + } > > This now triggers a checkpatch error: > ERROR: do not use assignment in if condition What a horrible message. > #1184: FILE: kernel/trace/trace_events_user.c:1184: > + if (!(len = str_has_prefix(field, "__data_loc unsigned ")) && > > I personally prefer to keep these files fully checkpatch clean. I've stopped using checkpatch years ago because I disagreed with so much it :-p (Including this message) > However, I did test these changes under the self-tests and it passed. > > Do they bug you that much? :) No big deal if you prefer the other way. I was just doing an audit of str_has_prefix() to see what code could be cleaned up that uses it, and I found this code. If you prefer to limit your code to "checkpatch clean", I'll leave it alone. -- Steve
[PATCH] tracing user_events: Simplify user_event_parse_field() parsing
From: "Steven Rostedt (Google)" Instead of having a bunch of if statements with: len = str_has_prefix(field, "__data_loc unsigned "); if (len) goto skip_next; len = str_has_prefix(field, "__data_loc "); if (len) goto skip_next; len = str_has_prefix(field, "__rel_loc unsigned "); if (len) goto skip_next; len = str_has_prefix(field, "__rel_loc "); if (len) goto skip_next; goto parse; skip_next: Consolidate it into a negative check and jump to parse if all the str_has_prefix() calls fail. If one succeeds, it will just continue with len equal to the proper string: if (!(len = str_has_prefix(field, "__data_loc unsigned ")) && !(len = str_has_prefix(field, "__data_loc ")) && !(len = str_has_prefix(field, "__rel_loc unsigned ")) && !(len = str_has_prefix(field, "__rel_loc "))) { goto parse; } skip_next: Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events_user.c | 22 ++ 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 9365ce407426..ce0c5f1ded48 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -1175,23 +1175,13 @@ static int user_event_parse_field(char *field, struct user_event *user, goto skip_next; } - len = str_has_prefix(field, "__data_loc unsigned "); - if (len) - goto skip_next; - - len = str_has_prefix(field, "__data_loc "); - if (len) - goto skip_next; - - len = str_has_prefix(field, "__rel_loc unsigned "); - if (len) - goto skip_next; - - len = str_has_prefix(field, "__rel_loc "); - if (len) - goto skip_next; + if (!(len = str_has_prefix(field, "__data_loc unsigned ")) && + !(len = str_has_prefix(field, "__data_loc ")) && + !(len = str_has_prefix(field, "__rel_loc unsigned ")) && + !(len = str_has_prefix(field, "__rel_loc "))) { + goto parse; + } - goto parse; skip_next: type = field; field = strpbrk(field + len, " "); -- 2.43.0
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, 8 Jan 2024 12:32:46 +0100 Christian Brauner wrote: > On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote: > > On Sun, 7 Jan 2024 13:29:12 -0500 > > Steven Rostedt wrote: > > > > > > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > > redundant and just wrong. > > > > > > I don't think so. > > > > Just to make it clear. eventfs has nothing to do with mkdir instance/foo. > > It exists without that. Although one rationale to do eventfs was so > > Every instance/foo/ tracefs instances also contains an events directory > and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's > not a separate filesystem. Both eventfs and tracefs are on the same > single, system wide superblock. > > > that the instance directories wouldn't recreate the same 10thousands > > event inodes and dentries for every mkdir done. > > I know but that's irrelevant to what I'm trying to tell you. > > A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs > instance. With or without the on-demand dentry and inode creation for > the eventfs portion that tracefs "instance" has now been created in its > entirety including all the required information for someone to later > come along and perform a lookup on /sys/kernel/tracing/instances/foo/events. > > All you've done is to defer the addition of the dentries and inodes when > someone does actually look at the events directory of the tracefs > instance. > > Whether you choose to splice in the dentries and inodes for the eventfs > portion during lookup and readdir or if you had chosen to not do the > on-demand thing at all and the entries were created at the same time as > the mkdir call are equivalent from the perspective of permission > checking. > > If you have the required permissions to look at the events directory > then there's no reason why listing the directory entries in there should > fail. This can't even happen right now. Ah, I think I know where the confusion lies. The tracing information in kernel/trace/*.c doesn't keep track of permission. It relies totally on fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with 'gid=xxx' then it's up to tracefs to maintain that information and not the tracing subsystem. The tracing subsystem only gives the "default" permissions (before boot finishes). The difference between normal file systems and pseudo file systems like debugfs and tracefs, is that normal file systems keep the permission information stored on the external device. That is, when the inodes and dentries are created, the information is retrieved from the stored file system. I think this may actually be a failure of debugfs (and tracefs as it was based on debugfs), in that the inodes and dentries are created at the same time the "files" backing them are. Which is normally at boot up and before the file system is mounted. That is, inodes and dentries are actually coupled with the data they represent. It's not a cache for a back store like a hard drive partition. To create a file in debugfs you do: struct dentry *debugfs_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops) That is, you pass a the name, the mode, the parent dentry, data, and the fops and that will create an inode and dentry (which is returned). This happens at boot up before user space is running and before debugfs is even mounted. Because debugfs is mostly for debugging, people don't care about how it's mounted. It is usually restricted to root only access. Especially since there's a lot of sensitive information that shouldn't be exposed to non-privileged users. The reason tracefs came about is that people asked me to be able to have access to tracing without needing to even enable debugfs. They also want to easily make it accessible to non root users and talking with Kees Cook, he recommended using ACL. But because it inherited a lot from debugfs, I started doing these tricks like walking the dentry tree to make it work a bit better. Because the dentries and inodes were created before mount, I had to play these tricks. But as Linus pointed out, that was the wrong way to do that. The right way was to use .getattr and .permission callbacks to figure out what the permissions to the files are. This has nothing to do with the creation of the files, it's about who has access to the files that the inodes point to. This sounds like another topic to bring up at LSFMM ;-) "Can we standardize pseudo file systems like debugfs and tracefs to act more like real file systems, and have inodes and dentries act as cache and not be so coupled to the data?" -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, 8 Jan 2024 12:04:54 +0100 Christian Brauner wrote: > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > redundant and just wrong. > > > > I don't think so. > > I'm very well aware that the dentries and inode aren't created during > mkdir but the completely directory layout is determined. You're just > splicing in dentries and inodes during lookup and readdir. > > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later > do a lookup/readdir on > > ls -al /sys/kernel/tracing/instances/foo/events > > Why should the creation of the dentries and inodes ever fail due to a > permission failure? They shouldn't. > The vfs did already verify that you had the required > permissions to list entries in that directory. Why should filling up > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't > That tracefs instance would be half-functional. And again, right now > that inode_permission() check cannot even fail. And it shouldn't. But without dentries and inodes, how does VFS know what is allowed to open the files? -- Steve
Re: [PATCH] tracing histograms: Simplify parse_actions() function
On Mon, 8 Jan 2024 10:32:14 +0200 Andy Shevchenko wrote: > On Mon, Jan 8, 2024 at 3:31 AM Steven Rostedt wrote: > > > > From: "Steven Rostedt (Google)" > > > > The parse_actions() function uses 'len = str_has_prefix()' to test which > > action is in the string being parsed. But then it goes and repeats the > > logic for each different action. This logic can be simplified and > > duplicate code can be removed as 'len' contains the length of the found > > prefix which should be used for all actions. > > > Link: > > https://lore.kernel.org/all/20240107112044.6702c...@gandalf.local.home/ > > > > Signed-off-by: Steven Rostedt (Google) > > If you want Link to be formally a tag, you should drop the following > blank line. The link is for humans not for parsers. > > > > + if ((len = str_has_prefix(str, "onmatch("))) > > + hid = HANDLER_ONMATCH; > > + else if ((len = str_has_prefix(str, "onmax("))) > > + hid = HANDLER_ONMAX; > > + else if ((len = str_has_prefix(str, "onchange("))) > > + hid = HANDLER_ONCHANGE; > > The repeating check for ( might be moved out as well after this like > > if (str[len] != '(') { > // not sure if you need data to be assigned here as well > ret = -EINVAL; > ... > } > Not sure how that makes it any better. It adds more code. I could start with checking the "on" before checking for "match", "max" and "change", but that just makes it more complex. -- Steve
[PATCH] tracing histograms: Simplify parse_actions() function
From: "Steven Rostedt (Google)" The parse_actions() function uses 'len = str_has_prefix()' to test which action is in the string being parsed. But then it goes and repeats the logic for each different action. This logic can be simplified and duplicate code can be removed as 'len' contains the length of the found prefix which should be used for all actions. Link: https://lore.kernel.org/all/20240107112044.6702c...@gandalf.local.home/ Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_events_hist.c | 49 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 5ecf3c8bde20..6ece1308d36a 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -4805,36 +4805,35 @@ static int parse_actions(struct hist_trigger_data *hist_data) int len; for (i = 0; i < hist_data->attrs->n_actions; i++) { + enum handler_id hid = 0; + char *action_str; + str = hist_data->attrs->action_str[i]; - if ((len = str_has_prefix(str, "onmatch("))) { - char *action_str = str + len; + if ((len = str_has_prefix(str, "onmatch("))) + hid = HANDLER_ONMATCH; + else if ((len = str_has_prefix(str, "onmax("))) + hid = HANDLER_ONMAX; + else if ((len = str_has_prefix(str, "onchange("))) + hid = HANDLER_ONCHANGE; - data = onmatch_parse(tr, action_str); - if (IS_ERR(data)) { - ret = PTR_ERR(data); - break; - } - } else if ((len = str_has_prefix(str, "onmax("))) { - char *action_str = str + len; + action_str = str + len; - data = track_data_parse(hist_data, action_str, - HANDLER_ONMAX); - if (IS_ERR(data)) { - ret = PTR_ERR(data); - break; - } - } else if ((len = str_has_prefix(str, "onchange("))) { - char *action_str = str + len; + switch (hid) { + case HANDLER_ONMATCH: + data = onmatch_parse(tr, action_str); + break; + case HANDLER_ONMAX: + case HANDLER_ONCHANGE: + data = track_data_parse(hist_data, action_str, hid); + break; + default: + data = ERR_PTR(-EINVAL); + break; + } - data = track_data_parse(hist_data, action_str, - HANDLER_ONCHANGE); - if (IS_ERR(data)) { - ret = PTR_ERR(data); - break; - } - } else { - ret = -EINVAL; + if (IS_ERR(data)) { + ret = PTR_ERR(data); break; } -- 2.42.0
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, 7 Jan 2024 13:29:12 -0500 Steven Rostedt wrote: > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > I don't think so. Just to make it clear. eventfs has nothing to do with mkdir instance/foo. It exists without that. Although one rationale to do eventfs was so that the instance directories wouldn't recreate the same 10thousands event inodes and dentries for every mkdir done. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, 7 Jan 2024 13:42:39 +0100 Christian Brauner wrote: > > So, I tried to do an exploratory patch even though I promised myself not > to do it. But hey... > > Some notes: > > * Permission handling for idmapped mounts is done completely in the > VFS. That's the case for all filesytems that don't have a custom > ->permission() handler. So there's nothing to do for us here. > > * Idmapped mount handling for ->getattr() is done completely by the VFS > if the filesystem doesn't have a custom ->getattr() handler. So we're > done here. > > * Tracefs doesn't support attribute changes via ->setattr() (chown(), > chmod etc.). So there's nothing to here. > > * Eventfs does support attribute changes via ->setattr(). But it relies > on simple_setattr() which is already idmapped mount aware. So there's > nothing for us to do. > > * Ownership is inherited from the parent inode (tracefs) or optionally > from stashed ownership information (eventfs). That means the idmapping > is irrelevant here. It's similar to the "inherit gid from parent > directory" logic we have in some circumstances. TL;DR nothing to do > here as well. The reason ownership is inherited from the parent is because the inodes are created at boot up before user space starts. eventfs inodes are created on demand after user space so it needs to either check the "default" ownership and permissions, or if the user changed a specific file/directory, it must save it and use that again if the inode/dentry are reclaimed and then referenced again and recreated. > > * Tracefs supports the creation of instances from userspace via mkdir. > For example, > > mkdir /sys/kernel/tracing/instances/foo > > And here the idmapping is relevant so we need to make the helpers > aware of the idmapping. > > I just went and plumbed this through to most helpers. > > There's some subtlety in eventfs. Afaict, the directories and files for > the individual events are created on-demand during lookup or readdir. > > The ownership of these events is again inherited from the parent inode > or recovered from stored state. In both cases the actual idmapping is > irrelevant. > > The callchain here is: > > eventfs_root_lookup("xfs", "events") > -> create_{dir,file}_dentry("xfs", "events") >-> create_{dir,file}("xfs", "events") > -> eventfs_start_creating("xfs", "events") > -> lookup_one_len("xfs", "events") > > And the subtlety is that lookup_one_len() does permission checking on > the parent inode (IOW, if you want a dentry for "blech" under "events" > it'll do a permission check on events->d_inode) for exec permissions > and then goes on to give you a new dentry. > > Usually this call would have to be changed to lookup_one() and the > idmapping be handed down to it. But I think that's irrelevant here. > > Lookup generally doesn't need to be aware of idmappings at all. The > permission checking is done purely in the vfs via may_lookup() and the > idmapping is irrelevant because we always initialize inodes with the > filesystem level ownership (see the idmappings.rst) documentation if > you're interested in excessive details (otherwise you get inode aliases > which you really don't want). > > For tracefs it would not matter for lookup per se but only because > tracefs seemingly creates inodes/dentries during lookup (and readdir()). tracefs creates the inodes/dentries at boot up, it's eventfs that does it on demand during lookup. For inodes/dentries: /sys/kernel/tracing/* is all created at boot up, except for "events". /sys/kernel/tracing/events/* is created on demand. > > But imho the permission checking done in current eventfs_root_lookup() > via lookup_one_len() is meaningless in any way; possibly even > (conceptually) wrong. > > Because, the actual permission checking for the creation of the eventfs > entries isn't really done during lookup or readdir, it's done when mkdir > is called: > > mkdir /sys/kernel/tracing/instances/foo No. that creates a entire new tracefs instance, which happens to include another eventfs directory. The eventsfs directory is in "/sys/kernel/tracing/events" and "/sys/kernel/tracing/instances/*/events" eventfs has 10s of thousands of files and directories which is why I changed it to be on-demand inode/dentry creation and also reclaiming when no longer accessed. # ls /sys/kernel/tracing/events/ will create the inodes and dentries, and a memory stress program will free those created inodes and dentries. > > Here, all possible entries beneath foo including "events" and further > below are recorded and stored. So once mkdir returns it basically means > that it succeeded with the creation of all the necessary directories and > files. For all purposes the foo/events/ directory and below have all the > entries that matter. They have been created. It's comparable to them not > being in the {d,i}cache, I guess. No. Only the meta data is created for the eventfs
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 5 Jan 2024 15:26:28 +0100 Christian Brauner wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" > > > > Instead of walking the dentries on mount/remount to update the gid values of > > all the dentries if a gid option is specified on mount, just update the root > > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > > operations to update the permissions of the files and directories. > > > > For all files and directories in the top level instance: > > > > /sys/kernel/tracing/* > > > > It will use the root inode as the default permissions. The inode that > > represents: /sys/kernel/tracing (or wherever it is mounted). > > > > When an instance is created: > > > > mkdir /sys/kernel/tracing/instance/foo > > > > The directory "foo" and all its files and directories underneath will use > > the default of what foo is when it was created. A remount of tracefs will > > not affect it. > > That kinda sounds like eventfs should actually be a separate filesystem. > But I don't know enough about the relationship between the two concepts. It may someday become one, as eventfs is used by perf where the rest of the tracefs system is not. > > > > > If a user were to modify the permissions of any file or directory in > > tracefs, it will also no longer be modified by a change in ownership of a > > remount. > > Very odd semantics and I would recommend to avoid that. It's just plain > weird imo. > > > > > The events directory, if it is in the top level instance, will use the > > tracefs root inode as the default ownership for itself and all the files and > > directories below it. > > > > For the events directory in an instance ("foo"), it will keep the ownership > > of what it was when it was created, and that will be used as the default > > ownership for the files and directories beneath it. > > > > Link: > > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ > > > > Signed-off-by: Steven Rostedt (Google) > > --- > > So tracefs supports remounting with different uid/gid mount options and > then actually wades through _all_ of the inodes and changes their > ownership internally? What's the use-case for this? Containers? No, in fact tracing doesn't work well with containers as tracing is global to the entire machine. It can work with privileged containers though. The reason for this is because tracefs was based off of debugfs where the files and directores are created at boot up and mounted later. The reason to do this was to allow users to mount with gid=GID to allow a given group to have access to tracing. Without this update, tracefs would ignore it like debugfs and proc does today. I think its time I explain the purpose of tracefs and how it came to be. The tracing system required a way to control tracing and read the traces. It could have just used a new system like perf (although /sys/kernel/debug/tracing predates perf), where it created a single ioctl() like system call do do everything. As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev tracer, which both have an embedded background, I chose an interface that could work with just an unmodified version of busybox. That is, I wanted it to work with just cat and echo. The main difference with tracefs compared to other file systems is that it is a control interface, where writes happen as much as reads. The data read is controlled. The closest thing I can think of is how cgroups work. As tracing is a privileged operation, but something that could be changed to allow a group to have access to, I wanted to make it easy for an admin to decide who gets to do what at boot up via the /etc/fstab file. > > Aside from optimizing this and the special semantics for this eventfs > stuff that you really should think twice of doing, here's one idea for > an extension that might alleviate some of the pain: > > If you need flexible dynamic ownership change to e.g., be able to > delegate (all, a directory, a single file of) tracefs to > unprivileged/containers/whatever then you might want to consider > supporting idmapped mounts for tracefs. Because then you can do stuff > like: I'm a novice here and have no idea on how id maps work ;-) > > user1@localhost:~/data/scripts$ sudo mount --bind -o > X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt > user1@localhost:~/data/scripts$ ls -ln /run/ > total 12 > drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials > drwx-- 2 0 0 40 Jan 5 11:57 cryptsetup > drwxr-xr-x 2 0 0 60 Jan
[PATCH 3/4] eventfs: Read ei->entries before ei->children in eventfs_iterate()
From: "Steven Rostedt (Google)" In order to apply a shortcut to skip over the current ctx->pos immediately, by using the ei->entries array, the reading of that array should be first. Moving the array reading before the linked list reading will make the shortcut change diff nicer to read. Link: https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 46 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index c73fb1f7ddbc..a1934e0eea3b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -752,8 +752,8 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) * Need to create the dentries and inodes to have a consistent * inode number. */ - list_for_each_entry_srcu(ei_child, >children, list, -srcu_read_lock_held(_srcu)) { + for (i = 0; i < ei->nr_entries; i++) { + void *cdata = ei->data; if (c > 0) { c--; @@ -762,23 +762,32 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) ctx->pos++; - if (ei_child->is_freed) - continue; + entry = >entries[i]; + name = entry->name; - name = ei_child->name; + mutex_lock(_mutex); + /* If ei->is_freed then just bail here, nothing more to do */ + if (ei->is_freed) { + mutex_unlock(_mutex); + goto out_dec; + } + r = entry->callback(name, , , ); + mutex_unlock(_mutex); + if (r <= 0) + continue; - dentry = create_dir_dentry(ei, ei_child, ei_dentry); + dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); if (!dentry) goto out_dec; ino = dentry->d_inode->i_ino; dput(dentry); - if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) + if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) goto out_dec; } - for (i = 0; i < ei->nr_entries; i++) { - void *cdata = ei->data; + list_for_each_entry_srcu(ei_child, >children, list, +srcu_read_lock_held(_srcu)) { if (c > 0) { c--; @@ -787,27 +796,18 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) ctx->pos++; - entry = >entries[i]; - name = entry->name; - - mutex_lock(_mutex); - /* If ei->is_freed then just bail here, nothing more to do */ - if (ei->is_freed) { - mutex_unlock(_mutex); - goto out_dec; - } - r = entry->callback(name, , , ); - mutex_unlock(_mutex); - if (r <= 0) + if (ei_child->is_freed) continue; - dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); + name = ei_child->name; + + dentry = create_dir_dentry(ei, ei_child, ei_dentry); if (!dentry) goto out_dec; ino = dentry->d_inode->i_ino; dput(dentry); - if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) + if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; } ret = 1; -- 2.42.0
[PATCH 0/4] eventfs: More updates to eventfs_iterate()
With the ongoing descussion of eventfs iterator, a few more changes are required and some changes are just enhancements. - Stop immediately in the loop if the ei is found to be in the process of being freed. - Make the ctx->pos update consistent with the skipped previous read index. This fixes a bug with duplicate files being showned by 'ls'. - Swap reading ei->entries with ei->children to make the next change easier to read - Add a "shortcut" in the ei->entries array to skip over already read entries. Steven Rostedt (Google) (4): eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set eventfs: Do ctx->pos update for all iterations in eventfs_iterate() eventfs: Read ei->entries before ei->children in eventfs_iterate() eventfs: Shortcut eventfs_iterate() by skipping entries already read fs/tracefs/event_inode.c | 67 ++-- 1 file changed, 36 insertions(+), 31 deletions(-)
[PATCH 4/4] eventfs: Shortcut eventfs_iterate() by skipping entries already read
From: "Steven Rostedt (Google)" As the ei->entries array is fixed for the duration of the eventfs_inode, it can be used to skip over already read entries in eventfs_iterate(). That is, if ctx->pos is greater than zero, there's no reason in doing the loop across the ei->entries array for the entries less than ctx->pos. Instead, start the lookup of the entries at the current ctx->pos. Link: https://lore.kernel.org/all/CAHk-=wiKwDUDv3+jCsv-uacDcHDVTYsXtBR9=6sgm5mqx+d...@mail.gmail.com/ Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a1934e0eea3b..fdff53d5a1f8 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -746,21 +746,15 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) if (!ei || !ei_dentry) goto out; - ret = 0; - /* * Need to create the dentries and inodes to have a consistent * inode number. */ - for (i = 0; i < ei->nr_entries; i++) { - void *cdata = ei->data; - - if (c > 0) { - c--; - continue; - } + ret = 0; - ctx->pos++; + /* Start at 'c' to jump over already read entries */ + for (i = c; i < ei->nr_entries; i++, ctx->pos++) { + void *cdata = ei->data; entry = >entries[i]; name = entry->name; @@ -769,7 +763,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) /* If ei->is_freed then just bail here, nothing more to do */ if (ei->is_freed) { mutex_unlock(_mutex); - goto out_dec; + goto out; } r = entry->callback(name, , , ); mutex_unlock(_mutex); @@ -778,14 +772,17 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) dentry = create_file_dentry(ei, i, ei_dentry, name, mode, cdata, fops); if (!dentry) - goto out_dec; + goto out; ino = dentry->d_inode->i_ino; dput(dentry); if (!dir_emit(ctx, name, strlen(name), ino, DT_REG)) - goto out_dec; + goto out; } + /* Subtract the skipped entries above */ + c -= min((unsigned int)c, (unsigned int)ei->nr_entries); + list_for_each_entry_srcu(ei_child, >children, list, srcu_read_lock_held(_srcu)) { -- 2.42.0
[PATCH 2/4] eventfs: Do ctx->pos update for all iterations in eventfs_iterate()
From: "Steven Rostedt (Google)" The ctx->pos was only updated when it added an entry, but the "skip to current pos" check (c--) happened for every loop regardless of if the entry was added or not. This inconsistency caused readdir to be incorrect. It was due to: for (i = 0; i < ei->nr_entries; i++) { if (c > 0) { c--; continue; } mutex_lock(_mutex); /* If ei->is_freed then just bail here, nothing more to do */ if (ei->is_freed) { mutex_unlock(_mutex); goto out; } r = entry->callback(name, , , ); mutex_unlock(_mutex); [..] ctx->pos++; } But this can cause the iterator to return a file that was already read. That's because of the way the callback() works. Some events may not have all files, and the callback can return 0 to tell eventfs to skip the file for this directory. for instance, we have: # ls /sys/kernel/tracing/events/ftrace/function format hist hist_debug id inject and # ls /sys/kernel/tracing/events/sched/sched_switch/ enable filter format hist hist_debug id inject trigger Where the function directory is missing "enable", "filter" and "trigger". That's because the callback() for events has: static int event_callback(const char *name, umode_t *mode, void **data, const struct file_operations **fops) { struct trace_event_file *file = *data; struct trace_event_call *call = file->event_call; [..] /* * Only event directories that can be enabled should have * triggers or filters, with the exception of the "print" * event that can have a "trigger" file. */ if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) { if (call->class->reg && strcmp(name, "enable") == 0) { *mode = TRACE_MODE_WRITE; *fops = _enable_fops; return 1; } if (strcmp(name, "filter") == 0) { *mode = TRACE_MODE_WRITE; *fops = _event_filter_fops; return 1; } } if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE) || strcmp(trace_event_name(call), "print") == 0) { if (strcmp(name, "trigger") == 0) { *mode = TRACE_MODE_WRITE; *fops = _trigger_fops; return 1; } } [..] return 0; } Where the function event has the TRACE_EVENT_FL_IGNORE_ENABLE set. This means that the entries array elements for "enable", "filter" and "trigger" when called on the function event will have the callback return 0 and not 1, to tell eventfs to skip these files for it. Because the "skip to current ctx->pos" check happened for all entries, but the ctx->pos++ only happened to entries that exist, it would confuse the reading of a directory. Which would cause: # ls /sys/kernel/tracing/events/ftrace/function/ format hist hist hist_debug hist_debug id inject inject The missing "enable", "filter" and "trigger" caused ls to show "hist", "hist_debug" and "inject" twice. Update the ctx->pos for every iteration to keep its update and the "skip" update consistent. This also means that on error, the ctx->pos needs to be decremented if it was incremented without adding something. Link: https://lore.kernel.org/all/20240104150500.38b15...@gandalf.local.home/ Fixes: 493ec81a8fb8e ("eventfs: Stop using dcache_readdir() for getdents()") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 0aca6910efb3..c73fb1f7ddbc 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -760,6 +760,8 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) continue; } + ctx->pos++; + if (ei_child->is_freed) continue; @@ -767,13 +769,12 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) dentry = create_dir_dentry(ei, ei_child, ei_dentry); if (!dentry) - goto out; + goto out_dec; ino = dentry->d_inode->i_ino; dput(dentry);
[PATCH 1/4] eventfs: Have eventfs_iterate() stop immediately if ei->is_freed is set
From: "Steven Rostedt (Google)" If ei->is_freed is set in eventfs_iterate(), it means that the directory that is being iterated on is in the process of being freed. Just exit the loop immediately when that is ever detected, and separate out the return of the entry->callback() from ei->is_freed. Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 72912b5f9a90..0aca6910efb3 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -788,11 +788,12 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = entry->name; mutex_lock(_mutex); - /* If ei->is_freed, then the event itself may be too */ - if (!ei->is_freed) - r = entry->callback(name, , , ); - else - r = -1; + /* If ei->is_freed then just bail here, nothing more to do */ + if (ei->is_freed) { + mutex_unlock(_mutex); + goto out; + } + r = entry->callback(name, , , ); mutex_unlock(_mutex); if (r <= 0) continue; -- 2.42.0
Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions
On Thu, 21 Dec 2023 12:58:13 -0500 Steven Rostedt wrote: > On Thu, 21 Dec 2023 17:35:22 + > Vincent Donnefort wrote: > > > @@ -5999,6 +6078,307 @@ int ring_buffer_subbuf_order_set(struct > > trace_buffer *buffer, int order) > > } > > EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set); > > > > The kernel developers have agreed to allow loop variables to be declared in > loops. This will simplify these macros: > > > > > +#define subbuf_page(off, start) \ > > + virt_to_page((void *)(start + (off << PAGE_SHIFT))) > > + > > +#define foreach_subbuf_page(off, sub_order, start, page) \ > > + for (off = 0, page = subbuf_page(0, start); \ > > +off < (1 << sub_order);\ > > +off++, page = subbuf_page(off, start)) > > #define foreach_subbuf_page(sub_order, start, page) \ > for (int __off = 0, page = subbuf_page(0, (start)); \ >__off < (1 << (sub_order));\ >__off++, page = subbuf_page(__off, (start))) So it seems that you can't declare "int __off" with page there, but we could have: #define foreach_subbuf_page(sub_order, start, page) \ page = subbuf_page(0, (start)); \ for (int __off = 0; __off < (1 << (sub_order)); \ __off++, page = subbuf_page(__off, (start))) And that would work. -- Steve
Re: [PATCH v8] bus: mhi: host: Add tracing support
On Thu, 7 Dec 2023 10:00:47 +0530 Krishna chaitanya chundru wrote: > This change adds ftrace support for following functions which > helps in debugging the issues when there is Channel state & MHI > state change and also when we receive data and control events: > 1. mhi_intvec_mhi_states > 2. mhi_process_data_event_ring > 3. mhi_process_ctrl_ev_ring > 4. mhi_gen_tre > 5. mhi_update_channel_state > 6. mhi_tryset_pm_state > 7. mhi_pm_st_worker > > Where ever the trace events are added, debug messages are removed. > > Signed-off-by: Krishna chaitanya chundru > --- > Changes in v8: > - Pass the structure and derefernce the variables in TP_fast_assign as > suggested by steve > - Link to v7: > https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com So this looks good from a tracing POV. Reviewed-by: Steven Rostedt (Google) But I do have some more comments that could be done by new patches. > +TRACE_EVENT(mhi_intvec_states, > + > + TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state), > + > + TP_ARGS(mhi_cntrl, dev_ee, dev_state), > + > + TP_STRUCT__entry( > + __string(name, mhi_cntrl->mhi_dev->name) > + __field(int, local_ee) > + __field(int, state) > + __field(int, dev_ee) > + __field(int, dev_state) > + ), > + > + TP_fast_assign( > + __assign_str(name, mhi_cntrl->mhi_dev->name); > + __entry->local_ee = mhi_cntrl->ee; > + __entry->state = mhi_cntrl->dev_state; > + __entry->dev_ee = dev_ee; > + __entry->dev_state = dev_state; > + ), > + > + TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n", > + __get_str(name), > + TO_MHI_EXEC_STR(__entry->local_ee), > + mhi_state_str(__entry->state), > + TO_MHI_EXEC_STR(__entry->dev_ee), > + mhi_state_str(__entry->dev_state)) So the above may have issues with user space parsing. For one, that mhi_state_str() is: static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { case MHI_STATE_RESET: return "RESET"; case MHI_STATE_READY: return "READY"; case MHI_STATE_M0: return "M0"; case MHI_STATE_M1: return "M1"; case MHI_STATE_M2: return "M2"; case MHI_STATE_M3: return "M3"; case MHI_STATE_M3_FAST: return "M3 FAST"; case MHI_STATE_BHI: return "BHI"; case MHI_STATE_SYS_ERR: return "SYS ERROR"; default: return "Unknown state"; } }; Which if this could be changed into: #define MHI_STATE_LIST \ EM(RESET,"RESET") \ EM(READY,"READY") \ EM(M0, "M0") \ EM(M1, "M1") \ EM(M2, "M2") \ EM(M3, "M3") \ EM(M3_FAST, "M3_FAST") \ EM(BHI, "BHI") \ EMe(SYS_ERR, "SYS ERROR") #undef EM #undef EMe #define EM(a, b) case MHI_STATE_##a: return b; #define EMe(a, b) case MHI_STATE_##a: return b; static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { MHI_STATE_LIST default: return "Unknown state"; } Then you could use that macro in the trace header: #undef EM #undef EMe #define EM(a, b)TRACE_DEFINE_ENUM(MHI_STATE_##a); #define EMe(a, b) TRACE_DEFINE_ENUM(MHI_STATE_##a); MHI_STATE_LIST And in the print fmts: #undef EM #undef EMe #define EM(a, b) { MHI_STATE_##a, b }, #define EMe(a, b) { MHI_STATE_##a, b } TP_printk("%s: local ee: %s state: %s device ee: %s state: %s\n", __get_str(name), TO_MHI_EXEC_STR(__entry->local_ee), __print_symbolic(__entry->state), MHI_STATE_LIST), TO_MHI_EXEC_STR(__entry->dev_ee), __print_symbolic(__entry->dev_state, MHI_STATE_LIST)) And that will be exported to user space in the /sys/kernel/tracing/events/*/*/format file, as: __print_symbolic(REC->state, { { MHI_STATE_RESET, "RESET"}, { MHI_STATE_READY, "READY"}, { MHI_STATE_M0, "M0"}, { MHI_STATE_M1, "M1"}, { MHI_STA
[PATCH v2] tracefs/eventfs: Use root and instance inodes as default ownership
From: "Steven Rostedt (Google)" Instead of walking the dentries on mount/remount to update the gid values of all the dentries if a gid option is specified on mount, just update the root inode. Add .getattr, .setattr, and .permissions on the tracefs inode operations to update the permissions of the files and directories. For all files and directories in the top level instance: /sys/kernel/tracing/* It will use the root inode as the default permissions. The inode that represents: /sys/kernel/tracing (or wherever it is mounted). When an instance is created: mkdir /sys/kernel/tracing/instance/foo The directory "foo" and all its files and directories underneath will use the default of what foo is when it was created. A remount of tracefs will not affect it. If a user were to modify the permissions of any file or directory in tracefs, it will also no longer be modified by a change in ownership of a remount. The events directory, if it is in the top level instance, will use the tracefs root inode as the default ownership for itself and all the files and directories below it. For the events directory in an instance ("foo"), it will keep the ownership of what it was when it was created, and that will be used as the default ownership for the files and directories beneath it. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240103203246.11573...@gandalf.local.home - Use dentry->d_sb->s_root directly than inode parent (Al Viro) - Simplify while loop to find instance inode or root inode (Al Viro) fs/tracefs/event_inode.c | 79 +++- fs/tracefs/inode.c | 198 ++- fs/tracefs/internal.h| 3 + 3 files changed, 190 insertions(+), 90 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 41af56f44f0a..72912b5f9a90 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -45,6 +45,7 @@ enum { EVENTFS_SAVE_MODE = BIT(16), EVENTFS_SAVE_UID= BIT(17), EVENTFS_SAVE_GID= BIT(18), + EVENTFS_TOPLEVEL= BIT(19), }; #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) @@ -115,10 +116,17 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * The events directory dentry is never freed, unless its * part of an instance that is deleted. It's attr is the * default for its child files and directories. -* Do not update it. It's not used for its own mode or ownership +* Do not update it. It's not used for its own mode or ownership. */ - if (!ei->is_events) + if (ei->is_events) { + /* But it still needs to know if it was modified */ + if (iattr->ia_valid & ATTR_UID) + ei->attr.mode |= EVENTFS_SAVE_UID; + if (iattr->ia_valid & ATTR_GID) + ei->attr.mode |= EVENTFS_SAVE_GID; + } else { update_attr(>attr, iattr); + } } else { name = dentry->d_name.name; @@ -136,9 +144,66 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } +static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry) +{ + struct inode *inode; + + /* Only update if the "events" was on the top level */ + if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + /* Get the tracefs root inode. */ + inode = d_inode(dentry->d_sb->s_root); + ei->attr.uid = inode->i_uid; + ei->attr.gid = inode->i_gid; +} + +static void set_top_events_ownership(struct inode *inode) +{ + struct tracefs_inode *ti = get_tracefs(inode); + struct eventfs_inode *ei = ti->private; + struct dentry *dentry; + + /* The top events directory doesn't get automatically updated */ + if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + dentry = ei->dentry; + + update_top_events_attr(ei, dentry); + + if (!(ei->attr.mode & EVENTFS_SAVE_UID)) + inode->i_uid = ei->attr.uid; + + if (!(ei->attr.mode & EVENTFS_SAVE_GID)) + inode->i_gid = ei->attr.gid; +} + +static int eventfs_get_attr(struct mnt_idmap *idmap, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int flags) +{ + struct dentry *dentry = path->dentry; + s
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 4 Jan 2024 01:48:37 + Al Viro wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > + /* Get the tracefs root from the parent */ > > + inode = d_inode(dentry->d_parent); > > + inode = d_inode(inode->i_sb->s_root); > > That makes no sense. First of all, for any positive dentry we have > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > dentries on given superblock. So what's the point of that dance? > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > and be done with that... That was more of thinking that the dentry and dentry->d_parent are different. As dentry is part of eventfs and dentry->d_parent is part of tracefs. Currently they both have the same superblock so yeah, I could just write it that way too and it would work. But in my head, I was thinking that they behave differently and maybe one day eventfs would get its own superblock which would not work. To explain this better: /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events But everything but "events" in /sys/kernel/tracing/* is part of tracefs. Everything in /sys/kernel/tracing/events is part of eventfs. That was my thought process. But as both tracefs and eventfs still use tracefs_get_inode(), it would work as you state. I'll update that, as I don't foresee that eventfs will become its own file system. Thanks, -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 4 Jan 2024 01:59:10 + Al Viro wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > +static struct inode *instance_inode(struct dentry *parent, struct inode > > *inode) > > +{ > > + struct tracefs_inode *ti; > > + struct inode *root_inode; > > + > > + root_inode = d_inode(inode->i_sb->s_root); > > + > > + /* If parent is NULL then use root inode */ > > + if (!parent) > > + return root_inode; > > + > > + /* Find the inode that is flagged as an instance or the root inode */ > > + do { > > + inode = d_inode(parent); > > + if (inode == root_inode) > > + return root_inode; > > + > > + ti = get_tracefs(inode); > > + > > + if (ti->flags & TRACEFS_INSTANCE_INODE) > > + return inode; > > + } while ((parent = parent->d_parent)); > > *blink* > > This is equivalent to > ... > parent = parent->d_parent; > } while (true); Yeah, that loop went through a few iterations as I first thought that root was a tracefs_inode and the get_tracefs() would work on it. No, it was not, and it caused a cash. But I didn't rewrite the loop well after fixing that. I was also not sure if parent could be NULL, and wanted to protect against it. > > ->d_parent is *never* NULL. And what the hell is that loop supposed to do, > anyway? Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE? > > If root is not marked that way, I would suggest > if (!parent) > parent = inode->i_sb->s_root; > while (!IS_ROOT(parent)) { > struct tracefs_inode *ti = get_tracefs(parent->d_inode); > if (ti->flags & TRACEFS_INSTANCE_INODE) > break; > parent = parent->d_parent; > } > return parent->d_inode; Sure, I could rewrite it that way too. Thanks, -- Steve
[PATCH v2 2/2] eventfs: Stop using dcache_readdir() for getdents()
From: "Steven Rostedt (Google)" The eventfs creates dynamically allocated dentries and inodes. Using the dcache_readdir() logic for its own directory lookups requires hiding the cursor of the dcache logic and playing games to allow the dcache_readdir() to still have access to the cursor while the eventfs saved what it created and what it needs to release. Instead, just have eventfs have its own iterate_shared callback function that will fill in the dent entries. This simplifies the code quite a bit. Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 194 +-- 1 file changed, 64 insertions(+), 130 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index c360300fb866..41af56f44f0a 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -52,9 +52,7 @@ enum { static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); -static int dcache_dir_open_wrapper(struct inode *inode, struct file *file); -static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx); -static int eventfs_release(struct inode *inode, struct file *file); +static int eventfs_iterate(struct file *file, struct dir_context *ctx); static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { @@ -148,11 +146,9 @@ static const struct inode_operations eventfs_file_inode_operations = { }; static const struct file_operations eventfs_file_operations = { - .open = dcache_dir_open_wrapper, .read = generic_read_dir, - .iterate_shared = dcache_readdir_wrapper, + .iterate_shared = eventfs_iterate, .llseek = generic_file_llseek, - .release= eventfs_release, }; /* Return the evenfs_inode of the "events" directory */ @@ -643,128 +639,87 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, return ret; } -struct dentry_list { - void*cursor; - struct dentry **dentries; -}; - -/** - * eventfs_release - called to release eventfs file/dir - * @inode: inode to be released - * @file: file to be released (not used) - */ -static int eventfs_release(struct inode *inode, struct file *file) -{ - struct tracefs_inode *ti; - struct dentry_list *dlist = file->private_data; - void *cursor; - int i; - - ti = get_tracefs(inode); - if (!(ti->flags & TRACEFS_EVENT_INODE)) - return -EINVAL; - - if (WARN_ON_ONCE(!dlist)) - return -EINVAL; - - for (i = 0; dlist->dentries && dlist->dentries[i]; i++) { - dput(dlist->dentries[i]); - } - - cursor = dlist->cursor; - kfree(dlist->dentries); - kfree(dlist); - file->private_data = cursor; - return dcache_dir_close(inode, file); -} - -static int add_dentries(struct dentry ***dentries, struct dentry *d, int cnt) -{ - struct dentry **tmp; - - tmp = krealloc(*dentries, sizeof(d) * (cnt + 2), GFP_NOFS); - if (!tmp) - return -1; - tmp[cnt] = d; - tmp[cnt + 1] = NULL; - *dentries = tmp; - return 0; -} - -/** - * dcache_dir_open_wrapper - eventfs open wrapper - * @inode: not used - * @file: dir to be opened (to create it's children) - * - * Used to dynamic create file/dir with-in @file, all the - * file/dir will be created. If already created then references - * will be increased +/* + * Walk the children of a eventfs_inode to fill in getdents(). */ -static int dcache_dir_open_wrapper(struct inode *inode, struct file *file) +static int eventfs_iterate(struct file *file, struct dir_context *ctx) { const struct file_operations *fops; + struct inode *f_inode = file_inode(file); const struct eventfs_entry *entry; struct eventfs_inode *ei_child; struct tracefs_inode *ti; struct eventfs_inode *ei; - struct dentry_list *dlist; - struct dentry **dentries = NULL; - struct dentry *parent = file_dentry(file); - struct dentry *d; - struct inode *f_inode = file_inode(file); - const char *name = parent->d_name.name; + struct dentry *ei_dentry = NULL; + struct dentry *dentry; + const char *name; umode_t mode; - void *data; - int cnt = 0; int idx; - int ret; - int i; - int r; + int ret = -EINVAL; + int ino; + int i, r, c; + + if (!dir_emit_dots(file, ctx)) + return 0; ti = get_tracefs(f_inode); if (!(ti->flags & TRACEFS_EVENT_INODE)) return -EINVAL; - if (WARN_ON_ONCE(file->private_data)) - return -EINVAL; + c = ctx->pos - 2; idx = srcu_read_lock(
[PATCH v2 1/2] eventfs: Remove "lookup" parameter from create_dir/file_dentry()
From: "Steven Rostedt (Google)" The "lookup" parameter is a way to differentiate the call to create_file/dir_dentry() from when it's just a lookup (no need to up the dentry refcount) and accessed via a readdir (need to up the refcount). But reality, it just makes the code more complex. Just up the refcount and let the caller decide to dput() the result or not. Link: https://lore.kernel.org/linux-trace-kernel/20240103102553.17a19...@gandalf.local.home Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 55 +++- 1 file changed, 20 insertions(+), 35 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index f0677ea0ec24..c360300fb866 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -390,16 +390,14 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) * @mode: The mode of the file. * @data: The data to use to set the inode of the file with on open() * @fops: The fops of the file to be created. - * @lookup: If called by the lookup routine, in which case, dput() the created dentry. * * Create a dentry for a file of an eventfs_inode @ei and place it into the - * address located at @e_dentry. If the @e_dentry already has a dentry, then - * just do a dget() on it and return. Otherwise create the dentry and attach it. + * address located at @e_dentry. */ static struct dentry * create_file_dentry(struct eventfs_inode *ei, int idx, struct dentry *parent, const char *name, umode_t mode, void *data, - const struct file_operations *fops, bool lookup) + const struct file_operations *fops) { struct eventfs_attr *attr = NULL; struct dentry **e_dentry = >d_children[idx]; @@ -414,9 +412,7 @@ create_file_dentry(struct eventfs_inode *ei, int idx, } /* If the e_dentry already has a dentry, use it */ if (*e_dentry) { - /* lookup does not need to up the ref count */ - if (!lookup) - dget(*e_dentry); + dget(*e_dentry); mutex_unlock(_mutex); return *e_dentry; } @@ -441,13 +437,12 @@ create_file_dentry(struct eventfs_inode *ei, int idx, * way to being freed, don't return it. If e_dentry is NULL * it means it was already freed. */ - if (ei->is_freed) + if (ei->is_freed) { dentry = NULL; - else + } else { dentry = *e_dentry; - /* The lookup does not need to up the dentry refcount */ - if (dentry && !lookup) dget(dentry); + } mutex_unlock(_mutex); return dentry; } @@ -465,9 +460,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx, } mutex_unlock(_mutex); - if (lookup) - dput(dentry); - return dentry; } @@ -500,13 +492,12 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei) * @pei: The eventfs_inode parent of ei. * @ei: The eventfs_inode to create the directory for * @parent: The dentry of the parent of this directory - * @lookup: True if this is called by the lookup code * * This creates and attaches a directory dentry to the eventfs_inode @ei. */ static struct dentry * create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, - struct dentry *parent, bool lookup) + struct dentry *parent) { struct dentry *dentry = NULL; @@ -518,11 +509,9 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, return NULL; } if (ei->dentry) { - /* If the dentry already has a dentry, use it */ + /* If the eventfs_inode already has a dentry, use it */ dentry = ei->dentry; - /* lookup does not need to up the ref count */ - if (!lookup) - dget(dentry); + dget(dentry); mutex_unlock(_mutex); return dentry; } @@ -542,7 +531,7 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, * way to being freed. */ dentry = ei->dentry; - if (dentry && !lookup) + if (dentry) dget(dentry); mutex_unlock(_mutex); return dentry; @@ -562,9 +551,6 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, } mutex_unlock(_mutex); - if (lookup) - dput(dentry); - return dentry; } @@ -589,8 +575,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, stru
[PATCH v2 0/2] eventfs: Don't use dcache_readdir() for getdents()
After having a "pleasant" conversation with Linus over on the security mailing list, we came to the conclusion that eventfs should not be using the dcache_readdir() routine for iterating the entries of a directory (getdents()). Instead, the .open and .release callbacks of the directory file operations was removed and all the work is now done in the .iterate_shared. The function dcache_readdir_wrapper() was renamed to eventfs_iterate(). As the files and directories of eventfs is all known within the meta data, it can easily supply getdents() with the information it needs without traversing the dentry. Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240103102553.17a19...@gandalf.local.home/ - Broke up into two patches, one to fix the lookup parameter and the other to do the meat of the change. - Moved the ctx->pos count up to skip creating of dentries in those cases. Steven Rostedt (Google) (2): eventfs: Remove "lookup" parameter from create_dir/file_dentry() eventfs: Stop using dcache_readdir() for getdents() fs/tracefs/event_inode.c | 241 --- 1 file changed, 80 insertions(+), 161 deletions(-)
[PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
From: "Steven Rostedt (Google)" Instead of walking the dentries on mount/remount to update the gid values of all the dentries if a gid option is specified on mount, just update the root inode. Add .getattr, .setattr, and .permissions on the tracefs inode operations to update the permissions of the files and directories. For all files and directories in the top level instance: /sys/kernel/tracing/* It will use the root inode as the default permissions. The inode that represents: /sys/kernel/tracing (or wherever it is mounted). When an instance is created: mkdir /sys/kernel/tracing/instance/foo The directory "foo" and all its files and directories underneath will use the default of what foo is when it was created. A remount of tracefs will not affect it. If a user were to modify the permissions of any file or directory in tracefs, it will also no longer be modified by a change in ownership of a remount. The events directory, if it is in the top level instance, will use the tracefs root inode as the default ownership for itself and all the files and directories below it. For the events directory in an instance ("foo"), it will keep the ownership of what it was when it was created, and that will be used as the default ownership for the files and directories beneath it. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 80 ++- fs/tracefs/inode.c | 205 ++- fs/tracefs/internal.h| 3 + 3 files changed, 198 insertions(+), 90 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 53d34a4b5a2b..641bffa0f139 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -45,6 +45,7 @@ enum { EVENTFS_SAVE_MODE = BIT(16), EVENTFS_SAVE_UID= BIT(17), EVENTFS_SAVE_GID= BIT(18), + EVENTFS_TOPLEVEL= BIT(19), }; #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) @@ -115,10 +116,17 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * The events directory dentry is never freed, unless its * part of an instance that is deleted. It's attr is the * default for its child files and directories. -* Do not update it. It's not used for its own mode or ownership +* Do not update it. It's not used for its own mode or ownership. */ - if (!ei->is_events) + if (ei->is_events) { + /* But it still needs to know if it was modified */ + if (iattr->ia_valid & ATTR_UID) + ei->attr.mode |= EVENTFS_SAVE_UID; + if (iattr->ia_valid & ATTR_GID) + ei->attr.mode |= EVENTFS_SAVE_GID; + } else { update_attr(>attr, iattr); + } } else { name = dentry->d_name.name; @@ -136,9 +144,67 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } +static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry) +{ + struct inode *inode; + + /* Only update if the "events" was on the top level */ + if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + /* Get the tracefs root from the parent */ + inode = d_inode(dentry->d_parent); + inode = d_inode(inode->i_sb->s_root); + ei->attr.uid = inode->i_uid; + ei->attr.gid = inode->i_gid; +} + +static void set_top_events_ownership(struct inode *inode) +{ + struct tracefs_inode *ti = get_tracefs(inode); + struct eventfs_inode *ei = ti->private; + struct dentry *dentry; + + /* The top events directory doesn't get automatically updated */ + if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + return; + + dentry = ei->dentry; + + update_top_events_attr(ei, dentry); + + if (!(ei->attr.mode & EVENTFS_SAVE_UID)) + inode->i_uid = ei->attr.uid; + + if (!(ei->attr.mode & EVENTFS_SAVE_GID)) + inode->i_gid = ei->attr.gid; +} + +static int eventfs_get_attr(struct mnt_idmap *idmap, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int flags) +{ + struct dentry *dentry = path->dentry; + struct inode *inode = d_backing_inode(dentry); + + set_top_events_ownership(inode); + + generic_fillattr(idmap, request_mask, inode, stat); + return 0; +} + +static int eventfs_permission
Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()
On Wed, 3 Jan 2024 13:54:36 -0800 Linus Torvalds wrote: > On Wed, 3 Jan 2024 at 11:57, Linus Torvalds > wrote: > > > > Or, you know, you could do what I've told you to do at least TEN TIMES > > already, which is to not mess with any of this, and just implement the > > '->permission()' callback (and getattr() to just make 'ls' look sane > > too, rather than silently saying "we'll act as if gid is set right, > > but not show it"). > > Actually, an even simpler option might be to just do this all at > d_revalidate() time. > > Here's an updated patch that builds, and is PURELY AN EXAMPLE. I think > it "works", but it currently always resets the inode mode/uid/gid > unconditionally, which is wrong - it should not do so if the inode has > been manually set. > > So take this as a "this might work", but it probably needs a bit more > work - eventfs_set_attr() should set some bit in the inode to say > "these have been set manually", and then revalidate would say "I'll > not touch inodes that have that bit set". > > Or something. > > Anyway, this patch is nwo relative to your latest pull request, so it > has the check for dentry->d_inode in set_gid() (and still removes the > whole function). > > Again: UNTESTED, and meant as a "this is another way to avoid messing > with the dentry tree manually, and just using the VFS interfaces we > already have" > I actually have something almost working too. Here's the WIP. It only works for tracefs, and now eventfs needs to be updated as the "events" directory no longer has the right ownership. So I need a way to link the eventfs entries to use the tracefs default conditionally. The issue I'm currently working on is to make the files of: /sys/kernel/tracing/events default to the root of tracefs, but /sys/kernel/tracing/instances/foo/events to default to what events was when it was created by "mkdir instances/foo". But other than that, the tracefs part seems to work. -- Steve diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index ae648deed019..9c55dc903d7d 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -141,10 +141,76 @@ static int tracefs_syscall_rmdir(struct inode *inode, struct dentry *dentry) return ret; } -static const struct inode_operations tracefs_dir_inode_operations = { +static void set_tracefs_inode_owner(struct inode *inode) +{ + struct inode *root_inode = d_inode(inode->i_sb->s_root); + struct tracefs_inode *ti = get_tracefs(inode); + + /* +* If this inode has never been referenced, then update +* the permissions to the superblock. +*/ + if (!(ti->flags & TRACEFS_EVENT_UID_PERM_SET)) + inode->i_uid = root_inode->i_uid; + + if (!(ti->flags & TRACEFS_EVENT_GID_PERM_SET)) + inode->i_gid = root_inode->i_gid; +} + +static int tracefs_permission(struct mnt_idmap *idmap, + struct inode *inode, int mask) +{ + set_tracefs_inode_owner(inode); + return generic_permission(idmap, inode, mask); +} + +static int tracefs_getattr(struct mnt_idmap *idmap, + const struct path *path, struct kstat *stat, + u32 request_mask, unsigned int flags) +{ + struct inode *inode = d_backing_inode(path->dentry); + + set_tracefs_inode_owner(inode); + generic_fillattr(idmap, request_mask, inode, stat); + return 0; +} + +static int tracefs_setattr(struct mnt_idmap *idmap, struct dentry *dentry, + struct iattr *attr) +{ + unsigned int ia_valid = attr->ia_valid; + struct inode *inode = d_inode(dentry); + struct tracefs_inode *ti = get_tracefs(inode); + + if (ia_valid & ATTR_UID) + ti->flags |= TRACEFS_EVENT_UID_PERM_SET; + + if (ia_valid & ATTR_GID) + ti->flags |= TRACEFS_EVENT_GID_PERM_SET; + + return simple_setattr(idmap, dentry, attr); +} + +static const struct inode_operations tracefs_instance_dir_inode_operations = { .lookup = simple_lookup, .mkdir = tracefs_syscall_mkdir, .rmdir = tracefs_syscall_rmdir, + .permission = tracefs_permission, + .getattr= tracefs_getattr, + .setattr= tracefs_setattr, +}; + +static const struct inode_operations tracefs_dir_inode_operations = { + .lookup = simple_lookup, + .permission = tracefs_permission, + .getattr= tracefs_getattr, + .setattr= tracefs_setattr, +}; + +static const struct inode_operations tracefs_file_inode_operations = { + .permission = tracefs_permission, + .getattr= tracefs_getattr, + .setattr= tracefs_setattr, }; struct inode *tracefs_get_inode(struct super_block *sb) @@ -183,77 +249,6 @@ struct tracefs_fs_info { struct tracefs_mount_opts mount_opts; }; -static void change_gid(struct dentry *dentry, kgid_t gid) -{ - if
Re: [syzbot] [fs?] [trace?] BUG: unable to handle kernel paging request in tracefs_apply_options
On Wed, 03 Jan 2024 13:41:31 -0800 syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:453f5db0619e Merge tag 'trace-v6.7-rc7' of git://git.kerne.. > git tree: upstream > console+strace: https://syzkaller.appspot.com/x/log.txt?x=10ec3829e8 > kernel config: https://syzkaller.appspot.com/x/.config?x=f8e72bae38c079e4 > dashboard link: https://syzkaller.appspot.com/bug?extid=f8a023e0c6beabe2371a > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) > 2.40 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1414af31e8 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15e52409e8 > > Downloadable assets: > disk image: > https://storage.googleapis.com/syzbot-assets/38b92a7149e8/disk-453f5db0.raw.xz > vmlinux: > https://storage.googleapis.com/syzbot-assets/4f872267133f/vmlinux-453f5db0.xz > kernel image: > https://storage.googleapis.com/syzbot-assets/587572061791/bzImage-453f5db0.xz > > The issue was bisected to: > > commit 7e8358edf503e87236c8d07f69ef0ed846dd5112 > Author: Steven Rostedt (Google) > Date: Fri Dec 22 00:07:57 2023 + > > eventfs: Fix file and directory uid and gid ownership > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=108cd519e8 > final oops: https://syzkaller.appspot.com/x/report.txt?x=128cd519e8 > console output: https://syzkaller.appspot.com/x/log.txt?x=148cd519e8 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+f8a023e0c6beabe23...@syzkaller.appspotmail.com > Fixes: 7e8358edf503 ("eventfs: Fix file and directory uid and gid ownership") > > BUG: unable to handle page fault for address: fff0 > #PF: supervisor read access in kernel mode > #PF: error_code(0x) - not-present page > PGD d734067 P4D d734067 PUD d736067 PMD 0 > Oops: [#1] PREEMPT SMP KASAN > CPU: 0 PID: 5056 Comm: syz-executor170 Not tainted > 6.7.0-rc7-syzkaller-00049-g453f5db0619e #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 11/17/2023 > RIP: 0010:set_gid fs/tracefs/inode.c:224 [inline] > RIP: 0010:tracefs_apply_options+0x4d0/0xa40 fs/tracefs/inode.c:337 > Code: 24 10 49 8b 1e 48 83 c3 f0 74 3d 48 89 d8 48 c1 e8 03 48 bd 00 00 00 00 > 00 fc ff df 80 3c 28 00 74 08 48 89 df e8 70 ff 88 fe <48> 8b 1b 48 89 de 48 > 83 e6 02 31 ff e8 bf fe 2c fe 48 83 e3 02 75 > RSP: 0018:c900040ffca8 EFLAGS: 00010246 > RAX: 1ffe RBX: fff0 RCX: 888014bf5940 > RDX: RSI: 0004 RDI: c900040ffc20 > RBP: dc00 R08: 0003 R09: f5200081ff84 > R10: dc00 R11: f5200081ff84 R12: 88801d743888 > R13: 88801b0c3710 R14: 88801d7437e8 R15: 88801d743810 > FS: 557dd480() GS:8880b980() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: fff0 CR3: 1ec48000 CR4: 003506f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > > tracefs_remount+0x78/0x80 fs/tracefs/inode.c:353 > reconfigure_super+0x440/0x870 fs/super.c:1143 > do_remount fs/namespace.c:2884 [inline] This is the same bug that was fixed by: https://lore.kernel.org/linux-trace-kernel/20240102151249.05da2...@gandalf.local.home/ And just waiting to be applied: https://lore.kernel.org/all/20240102210731.1f1c5...@gandalf.local.home/ Thanks, -- Steve > path_mount+0xc24/0xfa0 fs/namespace.c:3656 > do_mount fs/namespace.c:3677 [inline] > __do_sys_mount fs/namespace.c:3886 [inline] > __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3863 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x45/0x110 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x63/0x6b > RIP: 0033:0x7fec326e8d99 > Code: 48 83 c4 28 c3 e8 67 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 > 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:7ffc8103ddf8 EFLAGS: 0246 ORIG_RAX: 00a5 > RAX: ffda RBX: 7ffc8103de00 RCX: 7fec326e8d99 > RDX: RSI: 20c0 RDI: > RBP: 7ffc8103de08 R08: 2140 R09: 7fec326b5b80 > R10: 02200022 R11: 0246 R12: > R13: 7ffc8103e068 R14: 0001 R15: 0001 > > Modules linked in: > CR2: fff0 > ---[ end trace ]--- > RIP: 0010:set_gid fs/tracefs/inode.c:224
Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()
On Wed, 3 Jan 2024 10:38:09 -0800 Linus Torvalds wrote: > @@ -332,10 +255,8 @@ static int tracefs_apply_options(struct super_block *sb, > bool remount) > if (!remount || opts->opts & BIT(Opt_uid)) > inode->i_uid = opts->uid; > > - if (!remount || opts->opts & BIT(Opt_gid)) { > - /* Set all the group ids to the mount option */ > - set_gid(sb->s_root, opts->gid); > - } > + if (!remount || opts->opts & BIT(Opt_gid)) > + inode->i_gid = opts->gid; > > return 0; > } This doesn't work because for tracefs (not eventfs) the dentries are created at boot up and before the file system is mounted. This means you can't even set a gid in /etc/fstab. This will cause a regression. tracefs was designed after debugfs, which also ignores gid. But because there's users out there that want non-root accounts to have access to tracing, it is documented to set the gid to a group that you can then add users to. And that's the reason behind the set_gid() walk. Reverting that one commit won't fix things either, because it only blocked OTH to be read, but the creation of the files changed their mode's passed to block OTH as well, so all those would need to be changed too. And I don't think making the trace files open to OTH is a good solution, even if the tracefs top level directory itself blocks other. The issue was that the user use to just mount the top level to allow the group access to the files below, which allowed all users access. But this is weak control of the file system. Even my non-test machines have me in the tracing group so my user account has access to tracefs. On boot up, all the tracefs files are created via tracefs_create_file() and directories by tracefs_create_dir() which was copied from debugfs_create_file/dir(). At this moment, the dentry is created with the permissions set. There's no looking at the super block. So we need a way to change the permissions at mount time. The only solution I can think of that doesn't include walking the current dentries, is to convert all of tracefs to be more like eventfs, and have the dentries created on demand. But perhaps, different than eventfs, they do not need to be freed when they are no longer referenced, which should make it easier to implement. And there's not nearly as many files and directories, so keeping meta data around isn't as much of an issue. Instead of creating the inode and dentry in the tracefs_create_file/dir(), it could just create a descriptor that holds the fops, data and mode. Then on lookup, it would create the inodes and dentries similar to eventfs. It would need its own iterate_shared as well. -- Steve
Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()
On Wed, 3 Jan 2024 10:38:09 -0800 Linus Torvalds wrote: > On Wed, 3 Jan 2024 at 10:12, Linus Torvalds > wrote: > > > > Much better. Now eventfs looks more like a real filesystem, and less > > like an eldritch horror monster that is parts of dcache tackled onto a > > pseudo-filesystem. > > Oh, except I think you still need to just remove the 'set_gid()' mess. > > It's disgusting and it's wrong, and it's not even what the 'uid' > option does (it only sets the root inode uid). > > If you remount the filesystem with different gid values, you get to > keep both broken pieces. And if it isn't a remount, then setting the > root uid is sufficient. > > I think the whole thing was triggered by commit 49d67e445742, and > maybe the fix is to just revert that commit. > > That commit makes no sense in general, since the default mounting > position for tracefs that the kernel sets up is only accessible to > root anyway. > > Alternatively, just do the ->permissions() thing, and allow access to > the group in the mount options. > > Getting rid of set_gid() would be this attached lovely patch: > > fs/tracefs/inode.c | 83 > ++ > 1 file changed, 2 insertions(+), 81 deletions(-) > > and would get rid of the final (?) piece of disgusting dcache hackery > that tracefs most definitely should not have. > I'll look at that and play with it. I understand VFS much better now that I spent so much time with eventfs. That commit had to do with allowing OTH read access, which is a security issue as the trace files expose a lot of the kernel internals. I think these changes are a bit much for -rc8, don't you? Or do you want all this in before v6.7 is released. I'd be more comfortable with adding these changes in the upcoming merge window, where I can have more time playing with them. -- Steve
Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()
On Wed, 3 Jan 2024 10:12:08 -0800 Linus Torvalds wrote: > On Wed, 3 Jan 2024 at 07:24, Steven Rostedt wrote: > > > > Instead, just have eventfs have its own iterate_shared callback function > > that will fill in the dent entries. This simplifies the code quite a bit. > > Much better. Now eventfs looks more like a real filesystem, and less > like an eldritch horror monster that is parts of dcache tackled onto a > pseudo-filesystem. Thanks. > > However, one request, and one nit: > > > Also, remove the "lookup" parameter to the create_file/dir_dentry() and > > always have it return a dentry that has its ref count incremented, and > > have the caller call the dput. This simplifies that code as well. > > Can you please do that as a separate patch, where the first patch just > cleans up the directory iteration, and the second patch then goes "now > there are no more callers that have the 'lookup' argument set to > false". Yeah, I was thinking of doing it as two patches and figured I'd merge them into one because I deleted one of the users of it. As I was on the fence with doing two patches, I'm happy to change that. > > Because as-is, the patch is kind of two things mixed up. > > The small nit is this: > > > +static int eventfs_iterate(struct file *file, struct dir_context *ctx) > > { > > + /* > > +* Need to create the dentries and inodes to have a consistent > > +* inode number. > > +*/ > > list_for_each_entry_srcu(ei_child, >children, list, > > srcu_read_lock_held(_srcu)) { > > - d = create_dir_dentry(ei, ei_child, parent, false); > > - if (d) { > > - ret = add_dentries(, d, cnt); > > - if (ret < 0) > > - break; > > - cnt++; > > + > > + if (ei_child->is_freed) > > + continue; > > + > > + name = ei_child->name; > > + > > + dentry = create_dir_dentry(ei, ei_child, ei_dentry); > > + if (!dentry) > > + goto out; > > + ino = dentry->d_inode->i_ino; > > + dput(dentry); > > + > > + if (c > 0) { > > + c--; > > + continue; > > } > > Just move this "is the position before this name" up to the top of the > loop. Even above the "is_freed" part. > > Let's just always count all the entries in the child list. > > And same for the ei->nr_entries loop: > > > for (i = 0; i < ei->nr_entries; i++) { > > where there's no point in creating that dentry just to look up the > inode number, only to then decide "Oh, we already iterated past this > part, so let's not do anything with it". > > This wouldn't seem to matter much with a big enough getdents buffer > (which is the normal user level behavior), but it actually does, > because we don't keep track of "we have read to the end of the > directory". > > So every readdir ends up effectively doing getdents at least twice: > once to read the directory entries, and then once to just be told > "that was all". > > End result: you should strive very hard to *not* waste time on the > directory entries that have already been read, and are less than > 'ctx->pos'. My patch originally did that, but then I was worried about counting something that doesn't exist. If it is done twice, there's a good chance the dentry will still be around anyway, so it doesn't slow it down that much. The dput() only decrements the entry and doesn't free it. I added back my "show_events_dentries" file to test this. They sit with refcount equal to zero waiting to be reclaimed. But if they get referenced again, the refcount goes up again. That is, the first time it is called, where ctx->pos is likely zero, it creates the dentry, but that is also added to the list. The next time, with ctx->pos greater than zero, the create_dir_dentry() starts with: static struct dentry * create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, struct dentry *parent) { struct dentry *dentry = NULL; WARN_ON_ONCE(!inode_is_locked(parent->d_inode)); mutex_lock(_mutex); if (pei->is_freed || ei->is_freed) { mutex_unlock(_mutex); return NULL; } if (ei->dentry) { /* If the dentry already has a dentr
[PATCH] eventfs: Stop using dcache_readdir() for getdents()
From: "Steven Rostedt (Google)" The eventfs creates dynamically allocated dentries and inodes. Using the dcache_readdir() logic for its own directory lookups requires hiding the cursor of the dcache logic and playing games to allow the dcache_readdir() to still have access to the cursor while the eventfs saved what it created and what it needs to release. Instead, just have eventfs have its own iterate_shared callback function that will fill in the dent entries. This simplifies the code quite a bit. Also, remove the "lookup" parameter to the create_file/dir_dentry() and always have it return a dentry that has its ref count incremented, and have the caller call the dput. This simplifies that code as well. Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 237 +-- 1 file changed, 78 insertions(+), 159 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index f0677ea0ec24..53d34a4b5a2b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -52,9 +52,7 @@ enum { static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); -static int dcache_dir_open_wrapper(struct inode *inode, struct file *file); -static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx); -static int eventfs_release(struct inode *inode, struct file *file); +static int eventfs_iterate(struct file *file, struct dir_context *ctx); static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { @@ -148,11 +146,9 @@ static const struct inode_operations eventfs_file_inode_operations = { }; static const struct file_operations eventfs_file_operations = { - .open = dcache_dir_open_wrapper, .read = generic_read_dir, - .iterate_shared = dcache_readdir_wrapper, + .iterate_shared = eventfs_iterate, .llseek = generic_file_llseek, - .release= eventfs_release, }; /* Return the evenfs_inode of the "events" directory */ @@ -390,16 +386,14 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) * @mode: The mode of the file. * @data: The data to use to set the inode of the file with on open() * @fops: The fops of the file to be created. - * @lookup: If called by the lookup routine, in which case, dput() the created dentry. * * Create a dentry for a file of an eventfs_inode @ei and place it into the - * address located at @e_dentry. If the @e_dentry already has a dentry, then - * just do a dget() on it and return. Otherwise create the dentry and attach it. + * address located at @e_dentry. */ static struct dentry * create_file_dentry(struct eventfs_inode *ei, int idx, struct dentry *parent, const char *name, umode_t mode, void *data, - const struct file_operations *fops, bool lookup) + const struct file_operations *fops) { struct eventfs_attr *attr = NULL; struct dentry **e_dentry = >d_children[idx]; @@ -414,9 +408,7 @@ create_file_dentry(struct eventfs_inode *ei, int idx, } /* If the e_dentry already has a dentry, use it */ if (*e_dentry) { - /* lookup does not need to up the ref count */ - if (!lookup) - dget(*e_dentry); + dget(*e_dentry); mutex_unlock(_mutex); return *e_dentry; } @@ -441,13 +433,12 @@ create_file_dentry(struct eventfs_inode *ei, int idx, * way to being freed, don't return it. If e_dentry is NULL * it means it was already freed. */ - if (ei->is_freed) + if (ei->is_freed) { dentry = NULL; - else + } else { dentry = *e_dentry; - /* The lookup does not need to up the dentry refcount */ - if (dentry && !lookup) dget(dentry); + } mutex_unlock(_mutex); return dentry; } @@ -465,9 +456,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx, } mutex_unlock(_mutex); - if (lookup) - dput(dentry); - return dentry; } @@ -500,13 +488,12 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei) * @pei: The eventfs_inode parent of ei. * @ei: The eventfs_inode to create the directory for * @parent: The dentry of the parent of this directory - * @lookup: True if this is called by the lookup code * * This creates and attaches a directory dentry to the eventfs_inode @ei. */ static struct dentry * create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, - struct dentry
Re: [PATCH 1/1] fs/proc: remove redudant comments from /proc/bootconfig
On Tue, 2 Jan 2024 18:19:37 +0800 Zhenhua Huang wrote: > commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to > /proc/bootconfig") adds bootloader argument comments into /proc/bootconfig. > > /proc/bootconfig shows boot_command_line[] multiple times following > every xbc key value pair, that's duplicated and not necessary. > Remove redundant ones. > > Output before and after the fix is like: > key1 = value1 > *bootloader argument comments* > key2 = value2 > *bootloader argument comments* > key3 = value3 > *bootloader argument comments* > ... > > key1 = value1 > key2 = value2 > key3 = value3 > *bootloader argument comments* > ... > > Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment to > /proc/bootconfig") > Signed-off-by: Zhenhua Huang Nice catch. Reviewed-by: Steven Rostedt (Google) -- Steve > --- > fs/proc/bootconfig.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c > index 902b326..e5635a6 100644 > --- a/fs/proc/bootconfig.c > +++ b/fs/proc/bootconfig.c > @@ -62,12 +62,12 @@ static int __init copy_xbc_key_value_list(char *dst, > size_t size) > break; > dst += ret; > } > - if (ret >= 0 && boot_command_line[0]) { > - ret = snprintf(dst, rest(dst, end), "# Parameters from > bootloader:\n# %s\n", > -boot_command_line); > - if (ret > 0) > - dst += ret; > - } > + } > + if (ret >= 0 && boot_command_line[0]) { > + ret = snprintf(dst, rest(dst, end), "# Parameters from > bootloader:\n# %s\n", > +boot_command_line); > + if (ret > 0) > + dst += ret; > } > out: > kfree(key);
[PATCH] eventfs: Fix bitwise fields for "is_events"
From: "Steven Rostedt (Google)" A flag was needed to denote which eventfs_inode was the "events" directory, so a bit was taken from the "nr_entries" field, as there's not that many entries, and 2^30 is plenty. But the bit number for nr_entries was not updated to reflect the bit taken from it, which would add an unnecessary integer to the structure. Cc: sta...@vger.kernel.org Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 899e447778ac..42bdeb471a07 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -63,7 +63,7 @@ struct eventfs_inode { }; unsigned intis_freed:1; unsigned intis_events:1; - unsigned intnr_entries:31; + unsigned intnr_entries:30; }; static inline struct tracefs_inode *get_tracefs(const struct inode *inode) -- 2.42.0
[PATCH v2] tracefs: Check for dentry->d_inode exists in set_gid()
From: "Steven Rostedt (Google)" If a getdents() is called on the tracefs directory but does not get all the files, it can leave a "cursor" dentry in the d_subdirs list of tracefs dentry. This cursor dentry does not have a d_inode for it. Before referencing tracefs_inode from the dentry, the d_inode must first be checked if it has content. If not, then it's not a tracefs_inode and can be ignored. The following caused a crash: #define getdents64(fd, dirp, count) syscall(SYS_getdents64, fd, dirp, count) #define BUF_SIZE 256 #define TDIR "/tmp/file0" int main(void) { char buf[BUF_SIZE]; int fd; int n; mkdir(TDIR, 0777); mount(NULL, TDIR, "tracefs", 0, NULL); fd = openat(AT_FDCWD, TDIR, O_RDONLY); n = getdents64(fd, buf, BUF_SIZE); ret = mount(NULL, TDIR, NULL, MS_NOSUID|MS_REMOUNT|MS_RELATIME|MS_LAZYTIME, "gid=1000"); return 0; } That's because the 256 BUF_SIZE was not big enough to read all the dentries of the tracefs file system and it left a "cursor" dentry in the subdirs of the tracefs root inode. Then on remounting with "gid=1000", it would cause an iteration of all dentries which hit: ti = get_tracefs(dentry->d_inode); if (ti && (ti->flags & TRACEFS_EVENT_INODE)) eventfs_update_gid(dentry, gid); Which crashed because of the dereference of the cursor dentry which had a NULL d_inode. In the subdir loop of the dentry lookup of set_gid(), if a child has a NULL d_inode, simply skip it. Link: https://lore.kernel.org/all/20240102135637.3a21f...@gandalf.local.home/ Cc: sta...@vger.kernel.org Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership") Reported-by: "Ubisectech Sirius" Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240102142311.56708...@gandalf.local.home - Simplify the logic to just continue the loop if the cursor dentry is hit. fs/tracefs/inode.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 62524b20964e..bc86ffdb103b 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -215,6 +215,10 @@ static void set_gid(struct dentry *parent, kgid_t gid) struct dentry *dentry = list_entry(tmp, struct dentry, d_child); next = tmp->next; + /* Note, getdents() can add a cursor dentry with no inode */ + if (!dentry->d_inode) + continue; + spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED); change_gid(dentry, gid); -- 2.42.0
[PATCH] tracefs: Check for dentry->d_inode exists in set_gid()
From: "Steven Rostedt (Google)" If a getdents() is called on the tracefs directory but does not get all the files, it can leave a "cursor" dentry in the d_subdirs list of tracefs dentry. This cursor dentry does not have a d_inode for it. Before referencing tracefs_inode from the dentry, the d_inode must first be checked if it has content. If not, then it's not a tracefs_inode and can be ignored. The following caused a crash: #define getdents64(fd, dirp, count) syscall(SYS_getdents64, fd, dirp, count) #define BUF_SIZE 256 #define TDIR "/tmp/file0" int main(void) { char buf[BUF_SIZE]; int fd; int n; mkdir(TDIR, 0777); mount(NULL, TDIR, "tracefs", 0, NULL); fd = openat(AT_FDCWD, TDIR, O_RDONLY); n = getdents64(fd, buf, BUF_SIZE); ret = mount(NULL, TDIR, NULL, MS_NOSUID|MS_REMOUNT|MS_RELATIME|MS_LAZYTIME, "gid=1000"); return 0; } That's because the 256 BUF_SIZE was not big enough to read all the dentries of the tracefs file system and it left a "cursor" dentry in the subdirs of the tracefs root inode. Then on remounting with "gid=1000", it would cause an iteration of all dentries which hit: ti = get_tracefs(dentry->d_inode); if (ti && (ti->flags & TRACEFS_EVENT_INODE)) eventfs_update_gid(dentry, gid); Which crashed because of the dereference of the cursor dentry which had a NULL d_inode. Link: https://lore.kernel.org/all/20240102135637.3a21f...@gandalf.local.home/ Cc: sta...@vger.kernel.org Fixes: 7e8358edf503e ("eventfs: Fix file and directory uid and gid ownership") Reported-by: "Ubisectech Sirius" Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/inode.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 62524b20964e..c29387a36bc8 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -219,10 +219,13 @@ static void set_gid(struct dentry *parent, kgid_t gid) change_gid(dentry, gid); - /* If this is the events directory, update that too */ - ti = get_tracefs(dentry->d_inode); - if (ti && (ti->flags & TRACEFS_EVENT_INODE)) - eventfs_update_gid(dentry, gid); + /* Note, getdents() can add a cursor dentry with no inode */ + if (dentry->d_inode) { + /* If this is the events directory, update that too */ + ti = get_tracefs(dentry->d_inode); + if (ti && (ti->flags & TRACEFS_EVENT_INODE)) + eventfs_update_gid(dentry, gid); + } if (!list_empty(>d_subdirs)) { spin_unlock(_parent->d_lock); -- 2.42.0
Re: BUG: unable to handle page fault for address: fffffffffffffff0 in tracefs_apply_options
On Tue, 02 Jan 2024 18:54:26 +0800 "Ubisectech Sirius" wrote: > Dear concerned. > Greetings! > We are Ubisectech Sirius Team, the vulnerability lab of China > ValiantSec.Recently, our team has discovered a issue in Linux kernel 6.7. > technical details: > 1. Vulnerability Description: BUG: unable to handle page fault for address: > fff0 in tracefs_apply_options This is not a vulnerability, as it requires root privilege to trigger. Taking off security list and adding LKML and linux-trace-kernel. > 2. stack dump: > [ 1938.699609][ T8046] BUG: unable to handle page fault for address: > fff0 > [ 1938.702696][ T8046] #PF: supervisor read access in kernel mode > [ 1938.704647][ T8046] #PF: error_code(0x) - not-present page > [ 1938.706699][ T8046] PGD 8f7a067 P4D 8f7a067 PUD 8f7c067 PMD 0 > [ 1938.708939][ T8046] Oops: [#1] PREEMPT SMP KASAN > [ 1938.710819][ T8046] CPU: 0 PID: 8046 Comm: poc Not tainted > 6.7.0-rc8-g610a9b8f49fb #4 > [ 1938.713729][ T8046] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.15.0-1 04/01/2014 > [ 1938.717282][ T8046] RIP: 0010:tracefs_apply_options.isra.0 > (fs/tracefs/inode.c:224 fs/tracefs/inode.c:337) > [ 1938.720778][ T8046] Code: 24 1c 41 89 47 08 4c 8b bb 60 ff ff ff 4c 89 ff > 48 83 ef 10 74 31 48 89 7c 24 08 e8 c7 61 f8 fe 48 8b 7c 24 08 e8 2d b5 30 ff > <49> 8b 47 f0 31 ff 83 e0 02 49 89 c7 48 89 c6 e8 e9 5d f8 fe 4d 85 > All code > > 0: 24 1c and $0x1c,%al > 2: 41 89 47 08 mov %eax,0x8(%r15) > 6: 4c 8b bb 60 ff ff ff mov -0xa0(%rbx),%r15 > d: 4c 89 ff mov %r15,%rdi > 10: 48 83 ef 10 sub $0x10,%rdi > 14: 74 31 je 0x47 > 16: 48 89 7c 24 08 mov %rdi,0x8(%rsp) > 1b: e8 c7 61 f8 fe call 0xfef861e7 > 20: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi > 25: e8 2d b5 30 ff call 0xff30b557 > 2a:* 49 8b 47 f0 mov -0x10(%r15),%rax <-- trapping instruction > 2e: 31 ff xor %edi,%edi > 30: 83 e0 02 and $0x2,%eax > 33: 49 89 c7 mov %rax,%r15 > 36: 48 89 c6 mov %rax,%rsi > 39: e8 e9 5d f8 fe call 0xfef85e27 > 3e: 4d rex.WRB > 3f: 85 .byte 0x85 > Code starting with the faulting instruction > === > 0: 49 8b 47 f0 mov -0x10(%r15),%rax > 4: 31 ff xor %edi,%edi > 6: 83 e0 02 and $0x2,%eax > 9: 49 89 c7 mov %rax,%r15 > c: 48 89 c6 mov %rax,%rsi > f: e8 e9 5d f8 fe call 0xfef85dfd > 14: 4d rex.WRB > 15: 85 .byte 0x85 > [ 1938.728026][ T8046] RSP: 0018:8880228dfca0 EFLAGS: 00010246 > [ 1938.730241][ T8046] RAX: RBX: 888020c4cf00 RCX: > 8261a573 > [ 1938.733108][ T8046] RDX: dc00 RSI: 8261a569 RDI: > fff0 > [ 1938.736001][ T8046] RBP: 88801140d6a0 R08: 81457a74 R09: > ed100451bf86 > [ 1938.738918][ T8046] R10: 0003 R11: 0002 R12: > 88801140d7b8 > [ 1938.741677][ T8046] R13: 888020c4ce88 R14: 88801140d730 R15: > > [ 1938.75][ T8046] FS: 7f35440c2640() GS:88802d20() > knlGS: > [ 1938.747704][ T8046] CS: 0010 DS: ES: CR0: 80050033 > [ 1938.750058][ T8046] CR2: fff0 CR3: 1885e000 CR4: > 00750ef0 > [ 1938.753025][ T8046] DR0: DR1: DR2: > > [ 1938.755888][ T8046] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 1938.758848][ T8046] PKRU: 5554 > [ 1938.760125][ T8046] Call Trace: > [ 1938.761327][ T8046] > [ 1938.762430][ T8046] ? show_regs (arch/x86/kernel/dumpstack.c:479) > [ 1938.764058][ T8046] ? __die (arch/x86/kernel/dumpstack.c:421 > arch/x86/kernel/dumpstack.c:434) > [ 1938.765504][ T8046] ? page_fault_oops (arch/x86/mm/fault.c:703) > [ 1938.767316][ T8046] ? search_extable (lib/extable.c:115) > [ 1938.770094][ T8046] ? dump_pagetable (arch/x86/mm/fault.c:635) > [ 1938.771836][ T8046] ? pgtable_bad (arch/x86/mm/fault.c:122) > [ 1938.773468][ T8046] ? preempt_count_add (kernel/sched/core.c:5841) > [ 1938.775301][ T8046] ? preempt_count_sub (kernel/sched/core.c:5865) > [ 1938.777117][ T8046] ? search_module_extables (kernel/module/main.c:3243 > (discriminator 3)) > [ 1938.779095][ T8046] ? fixup_exception (arch/x86/mm/extable.c:305) > [ 1938.780930][ T8046] ? kernelmode_fixup_or_oops (arch/x86/mm/fault.c:761) > [ 1938.783041][ T8046] ? __bad_area_nosemaphore (arch/x86/mm/fault.c:819) > [ 1938.784770][ T8046] ? do_kern_addr_fault (arch/x86/mm/fault.c:1227) > [ 1938.785242][ T8046] ? exc_page_fault (arch/x86/mm/fault.c:1503 > arch/x86/mm/fault.c:1561) > [ 1938.786979][ T8046] ? asm_exc_page_fault > (./arch/x86/include/asm/idtentry.h:570) > [ 1938.788858][ T8046] ? do_raw_spin_lock (kernel/locking/spinlock_debug.c:93 > kernel/locking/spinlock_debug.c:117) > [ 1938.789530][ T8046] ? tracefs_apply_options.isra.0 (fs/tracefs/inode.c:224 > fs/tracefs/inode.c:337) > [ 1938.790145][ T8046] ? tracefs_apply_options.isra.0 (fs/tracefs/inode.c:224
Re: [PATCH v3 01/34] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
On Thu, 14 Dec 2023 00:24:21 +0100 Ilya Leoshkevich wrote: > Architectures use assembly code to initialize ftrace_regs and call > ftrace_ops_list_func(). Therefore, from the KMSAN's point of view, > ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes > KMSAN warnings when running the ftrace testsuite. BTW, why is this only a problem for s390 and no other architectures? If it is only a s390 thing, then we should do this instead: in include/linux/ftrace.h: /* Add a comment here to why this is needed */ #ifndef ftrace_list_func_unpoison # define ftrace_list_func_unpoison(fregs) do { } while(0) #endif In arch/s390/include/asm/ftrace.h: /* Add a comment to why s390 is special */ # define ftrace_list_func_unpoison(fregs) kmsan_unpoison_memory(fregs, sizeof(*fregs)) > > Fix by trusting the architecture-specific assembly code and always > unpoisoning ftrace_regs in ftrace_ops_list_func. > > Acked-by: Steven Rostedt (Google) I'm taking my ack away for this change in favor of what I'm suggesting now. > Reviewed-by: Alexander Potapenko > Signed-off-by: Ilya Leoshkevich > --- > kernel/trace/ftrace.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 8de8bec5f366..dfb8b26966aa 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -7399,6 +7399,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long > parent_ip, > void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, > struct ftrace_ops *op, struct ftrace_regs *fregs) > { > + kmsan_unpoison_memory(fregs, sizeof(*fregs)); And here have: ftrace_list_func_unpoison(fregs); That way we only do it for archs that really need it, and do not affect archs that do not. I want to know why this only affects s390, because if we are just doing this because "it works", it could be just covering up a symptom of something else and not actually doing the "right thing". -- Steve > __ftrace_ops_list_func(ip, parent_ip, NULL, fregs); > } > #else
Re: Unable to trace nf_nat function using kprobe with latest kernel 6.1.66-1
On Tue, 2 Jan 2024 11:23:54 +0530 P K wrote: > Hi, > > I am unable to trace nf_nat functions using kprobe with the latest kernel. > Previously in kernel 6.1.55-1 it was working fine. > Can someone please check if it's broken with the latest commit or i > have to use it differently? > Note, attaching to kernel functions is never considered stable API and may break at any kernel release. Also, if the compiler decides to inline the function, and makes it no longer visible in /proc/kallsyms then that too will cause this to break. -- Steve > Mentioned below are output: > Kernel - 6.1.55-1 > / # bpftrace -e 'kprobe:nf_nat_ipv4_manip_pkt { printf("func called\n"); }' > Attaching 1 probe... > cannot attach kprobe, probe entry may not exist > ERROR: Error attaching probe: 'kprobe:nf_nat_ipv4_manip_pkt' > > > > Kernel 6.1.55-1 > / # bpftrace -e 'kprobe:nf_nat_ipv4_manip_pkt { printf("func called\n"); }' > Attaching 1 probe... > func called > func called
Re: [PATCH] ftrace: Fix modification of direct_function hash while in use
Masami and Jiri, This patch made it through all my tests. If I can get an Acked-by by Sunday, I'll include it in my push to Linus (I have a couple of other fixes to send him). -- Steve On Fri, 29 Dec 2023 11:51:34 -0500 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Masami Hiramatsu reported a memory leak in register_ftrace_direct() where > if the number of new entries are added is large enough to cause two > allocations in the loop: > > for (i = 0; i < size; i++) { > hlist_for_each_entry(entry, >buckets[i], hlist) { > new = ftrace_add_rec_direct(entry->ip, addr, > _hash); > if (!new) > goto out_remove; > entry->direct = addr; > } > } > > Where ftrace_add_rec_direct() has: > > if (ftrace_hash_empty(direct_functions) || > direct_functions->count > 2 * (1 << direct_functions->size_bits)) > { > struct ftrace_hash *new_hash; > int size = ftrace_hash_empty(direct_functions) ? 0 : > direct_functions->count + 1; > > if (size < 32) > size = 32; > > new_hash = dup_hash(direct_functions, size); > if (!new_hash) > return NULL; > > *free_hash = direct_functions; > direct_functions = new_hash; > } > > The "*free_hash = direct_functions;" can happen twice, losing the previous > allocation of direct_functions. > > But this also exposed a more serious bug. > > The modification of direct_functions above is not safe. As > direct_functions can be referenced at any time to find what direct caller > it should call, the time between: > > new_hash = dup_hash(direct_functions, size); > and > direct_functions = new_hash; > > can have a race with another CPU (or even this one if it gets interrupted), > and the entries being moved to the new hash are not referenced. > > That's because the "dup_hash()" is really misnamed and is really a > "move_hash()". It moves the entries from the old hash to the new one. > > Now even if that was changed, this code is not proper as direct_functions > should not be updated until the end. That is the best way to handle > function reference changes, and is the way other parts of ftrace handles > this. > > The following is done: > > 1. Change add_hash_entry() to return the entry it created and inserted > into the hash, and not just return success or not. > > 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove > the former. > > 3. Allocate a "new_hash" at the start that is made for holding both the > new hash entries as well as the existing entries in direct_functions. > > 4. Copy (not move) the direct_function entries over to the new_hash. > > 5. Copy the entries of the added hash to the new_hash. > > 6. If everything succeeds, then use rcu_pointer_assign() to update the > direct_functions with the new_hash. > > This simplifies the code and fixes both the memory leak as well as the > race condition mentioned above. > > Link: > https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/ > > Cc: sta...@vger.kernel.org > Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()") > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ftrace.c | 100 -- > 1 file changed, 47 insertions(+), 53 deletions(-) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 8de8bec5f366..b01ae7d36021 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash, > hash->count++; > } > > -static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip) > +static struct ftrace_func_entry * > +add_hash_entry(struct ftrace_hash *hash, unsigned long ip) > { > struct ftrace_func_entry *entry; > > entry = kmalloc(sizeof(*entry), GFP_KERNEL); > if (!entry) > - return -ENOMEM; > + return NULL; > > entry->ip = ip; > __add_hash_entry(hash, entry); > > - return 0; > + return entry; > } > > static void > @@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct > ftrace_hash *hash) > struct ftra
Re: [RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping
On Fri, 29 Dec 2023 13:40:50 -0500 Steven Rostedt wrote: > I'm sending this to a wider audience, as I want to hear more > feedback on this before I accept it. > I forgot to mention that this can be applied on top of: git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git for-next -- Steve
[RFC][PATCH 2/2] tracing: Allow user-space mapping of the ring-buffer
From: Vincent Donnefort Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Link: https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-3-vdonnef...@google.com Signed-off-by: Vincent Donnefort Signed-off-by: Steven Rostedt (Google) --- include/uapi/linux/trace_mmap.h | 2 + kernel/trace/trace.c| 79 +++-- 2 files changed, 78 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index f950648b0ba9..8c49489c5867 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -26,4 +26,6 @@ struct trace_buffer_meta { __u32 meta_struct_len;/* Len of this struct */ }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 46dbe22121e9..cfeaf2cd204e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = >iter; + int err; - if (cmd) - return -ENOIOCTLCMD; + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent); + if (err) + return err; + } + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(_types_lock); iter->wait_index++; @@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + vm_fault_t ret = VM_FAULT_SIGBUS; + struct page *page; + + page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file, +vmf->pgoff); + if (!page) + return ret; + + get_page(page); + vmf->page = page; + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + + ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file); +} + +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + + WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file)); +} + +static const struct vm_operations_struct tracing_buffers_vmops = { + .open = tracing_buffers_mmap_open, + .close = tracing_buffers_mmap_close, + .fault = tracing_buffers_mmap_fault, +}; + +static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = filp->private_da
[RFC][PATCH 0/2] ring-buffer: Allow user space memorry mapping
I'm sending this to a wider audience, as I want to hear more feedback on this before I accept it. Vincent has been working on allowing the ftrace ring buffer to be memory mapped into user space. This has been going on since last year, where we talked at the 2022 Tracing Summit in London. Vincent's last series can be found here: https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-1-vdonnef...@google.com/ But I'm posting these as these are what I now have in my queue. I've tested these patches pretty thoroughly and they look good. I even have a libtracefs API patch ready to implement this. For testing, you can install: git://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git (version 1.8.1) git://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git (latest branch of libtracefs) Then apply: https://lore.kernel.org/all/20231228201100.78aae...@rorschach.local.home/ to libtracefs. And then build the samples: make samples Which will create: bin/cpu-map That you can use with: # trace-cmd start -e sched # bin/cpu-map 0 Which will output all the events in CPU 0 via the memory mapping if it is supported by the kernel or else it will print at the start: "Was not able to map, falling back to buffered read" If you want to see the source code for cpu-map, you have to look at the man page ;-) The "make samples" will extract the example code from the man pages and build them. Documentation/libtracefs-cpu-map.txt The library API is rather simple, it just has: tracefs_cpu_open_mapped() tracefs_cpu_is_mapped() tracefs_cpu_map() tracefs_cpu_unmap() Which will create a tracefs_cpu handle for reading. If it fails to map, it just does a normal read. Anyway, this email is not about the library interface, but more of the kernel interface. And that is this: First there's a "meta page" that can mapped via mmap() on the trace_pipe_raw file (there's one trace_pipe_raw file per CPU): page_size = getpagesize(); meta = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); The meta will have a layout of: struct trace_buffer_meta { unsigned long entries; unsigned long overrun; unsigned long read; unsigned long subbufs_touched; unsigned long subbufs_lost; unsigned long subbufs_read; struct { unsigned long lost_events;/* Events lost at the time of the reader swap */ __u32 id; /* Reader subbuf ID from 0 to nr_subbufs - 1 */ __u32 read; /* Number of bytes read on the reader subbuf */ } reader; __u32 subbuf_size;/* Size of each subbuf including the header */ __u32 nr_subbufs; /* Number of subbufs in the ring-buffer */ __u32 meta_page_size; /* Size of the meta-page */ __u32 meta_struct_len;/* Len of this struct */ }; The meta_page_size can grow if we need to (and then we can extend this API if we need to). If the meta_page_size is greater than a page, it should remap it: if (meta->meta_page_size > page_size) { int new_size = meta->meta_page_size; munmap(meta, page_size); meta = mmap(NULL, new_size, PROT_READ, MAP_SHARED, fd, 0); } Now the sub buffers of the ring buffer are mapped right after the meta page. The can be added with: data_len = meta->subbuf_size * meta->nr_subbufs; data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta->meta_page_size); This maps all the ring buffer sub-buffers as well as the reader page. The way this works is that the reader page is free to read from user space and writer will only write to the other pages. To get the reader page: subbuf = data + meta->subbuf_size * meta->reader.id; Then you can load that into the libtraceevent kbuffer API: struct tep_handle *tep = tracefs_local_events(NULL); kbuf = tep_kbuffer(tep); kbuffer_load_subbuffer(kbuf, subbuf); And use the kbuf descriptor to iterate the events. When done with the reader page, the application needs to make an ioctl() call: ioctl(fd, TRACE_MMAP_IOCTL_GET_READER); This will swap the reader page with the head of the other pages, and the old reader page is now going to be in the writable portion of the ring buffer where the writer can write to it, but the page that was swapped out becomes the new reader page. If there is no data, then user space can check the meta->reader.id to see if it changed or not. If it did not change, then there's no new data. If the writer is still on that page, it acts the same as we do in the kernel. It can still update that page and the begging of the sub-buffer has the index of where the writer currently is on that page. All the mapped pages are read-only. When the ring buffer is mapped, it
[RFC][PATCH 1/2] ring-buffer: Introducing ring-buffer mapping functions
From: Vincent Donnefort In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Link: https://lore.kernel.org/linux-trace-kernel/20231221173523.3015715-2-vdonnef...@google.com Signed-off-by: Vincent Donnefort Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 7 + include/uapi/linux/trace_mmap.h | 29 +++ kernel/trace/ring_buffer.c | 382 +++- 3 files changed, 417 insertions(+), 1 deletion(-) create mode 100644 include/uapi/linux/trace_mmap.h diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..0841ba8bab14 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, + unsigned long pgoff); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..f950648b0ba9 --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_TRACE_MMAP_H_ +#define _UAPI_TRACE_MMAP_H_ + +#include + +struct trace_buffer_meta { + unsigned long entries; + unsigned long overrun; + unsigned long read; + + unsigned long subbufs_touched; + unsigned long subbufs_lost; + unsigned long subbufs_read; + + struct { + unsigned long lost_events;/* Events lost at the time of the reader swap */ + __u32 id; /* Reader subbuf ID from 0 to nr_subbufs - 1 */ + __u32 read; /* Number of bytes read on the reader subbuf */ + } reader; + + __u32 subbuf_size;/* Size of each subbuf including the header */ + __u32 nr_subbufs; /* Number of subbufs in the ring-buffer */ + + __u32 meta_page_size; /* Size of the meta-page */ + __u32 meta_struct_len;/* Len of this struct */ +}; + +#endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 173d2595ce2d..2f3e0260db88 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -338,6 +338,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -388,6 +389,7 @@ struct rb_irq_work { boolwaiters_pending; boolfull_waiters_pending; boolwakeup_full; + boolis_cpu_buffer; }; /* @@ -484,6 +486,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + int mapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to addr */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer *buffer, int cpu, int f return (dirty * 100) &
[PATCH] ftrace: Fix modification of direct_function hash while in use
From: "Steven Rostedt (Google)" Masami Hiramatsu reported a memory leak in register_ftrace_direct() where if the number of new entries are added is large enough to cause two allocations in the loop: for (i = 0; i < size; i++) { hlist_for_each_entry(entry, >buckets[i], hlist) { new = ftrace_add_rec_direct(entry->ip, addr, _hash); if (!new) goto out_remove; entry->direct = addr; } } Where ftrace_add_rec_direct() has: if (ftrace_hash_empty(direct_functions) || direct_functions->count > 2 * (1 << direct_functions->size_bits)) { struct ftrace_hash *new_hash; int size = ftrace_hash_empty(direct_functions) ? 0 : direct_functions->count + 1; if (size < 32) size = 32; new_hash = dup_hash(direct_functions, size); if (!new_hash) return NULL; *free_hash = direct_functions; direct_functions = new_hash; } The "*free_hash = direct_functions;" can happen twice, losing the previous allocation of direct_functions. But this also exposed a more serious bug. The modification of direct_functions above is not safe. As direct_functions can be referenced at any time to find what direct caller it should call, the time between: new_hash = dup_hash(direct_functions, size); and direct_functions = new_hash; can have a race with another CPU (or even this one if it gets interrupted), and the entries being moved to the new hash are not referenced. That's because the "dup_hash()" is really misnamed and is really a "move_hash()". It moves the entries from the old hash to the new one. Now even if that was changed, this code is not proper as direct_functions should not be updated until the end. That is the best way to handle function reference changes, and is the way other parts of ftrace handles this. The following is done: 1. Change add_hash_entry() to return the entry it created and inserted into the hash, and not just return success or not. 2. Replace ftrace_add_rec_direct() with add_hash_entry(), and remove the former. 3. Allocate a "new_hash" at the start that is made for holding both the new hash entries as well as the existing entries in direct_functions. 4. Copy (not move) the direct_function entries over to the new_hash. 5. Copy the entries of the added hash to the new_hash. 6. If everything succeeds, then use rcu_pointer_assign() to update the direct_functions with the new_hash. This simplifies the code and fixes both the memory leak as well as the race condition mentioned above. Link: https://lore.kernel.org/all/170368070504.42064.8960569647118388081.stgit@devnote2/ Cc: sta...@vger.kernel.org Fixes: 763e34e74bb7d ("ftrace: Add register_ftrace_direct()") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ftrace.c | 100 -- 1 file changed, 47 insertions(+), 53 deletions(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 8de8bec5f366..b01ae7d36021 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1183,18 +1183,19 @@ static void __add_hash_entry(struct ftrace_hash *hash, hash->count++; } -static int add_hash_entry(struct ftrace_hash *hash, unsigned long ip) +static struct ftrace_func_entry * +add_hash_entry(struct ftrace_hash *hash, unsigned long ip) { struct ftrace_func_entry *entry; entry = kmalloc(sizeof(*entry), GFP_KERNEL); if (!entry) - return -ENOMEM; + return NULL; entry->ip = ip; __add_hash_entry(hash, entry); - return 0; + return entry; } static void @@ -1349,7 +1350,6 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftrace_hash *hash) struct ftrace_func_entry *entry; struct ftrace_hash *new_hash; int size; - int ret; int i; new_hash = alloc_ftrace_hash(size_bits); @@ -1366,8 +1366,7 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftrace_hash *hash) size = 1 << hash->size_bits; for (i = 0; i < size; i++) { hlist_for_each_entry(entry, >buckets[i], hlist) { - ret = add_hash_entry(new_hash, entry->ip); - if (ret < 0) + if (add_hash_entry(new_hash, entry->ip) == NULL) goto free_hash; } } @@ -2536,7 +2535,7 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec) #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS /* Protected by rcu_tasks for reading, and direct_mutex for writing */
Re: [PATCH] tracing: Fix possible memory leak in ftrace_regsiter_direct()
On Wed, 27 Dec 2023 21:38:25 +0900 "Masami Hiramatsu (Google)" wrote: > From: Masami Hiramatsu (Google) > > If ftrace_register_direct() called with a large number of target There's no function called "ftrace_register_direct()", I guess you meant register_ftrace_direct()? > functions (e.g. 65), the free_hash can be updated twice or more > in the ftrace_add_rec_direct() without freeing the previous hash > memory. Thus this can cause a memory leak. > > Fix this issue by expanding the direct_hash at once before > adding the new entries. > > Signed-off-by: Masami Hiramatsu (Google) > Fixes: f64dd4627ec6 ("ftrace: Add multi direct register/unregister interface") > Cc: sta...@vger.kernel.org > --- > kernel/trace/ftrace.c | 49 > +++-- > 1 file changed, 31 insertions(+), 18 deletions(-) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 8de8bec5f366..9269c2c3e595 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -2555,28 +2555,33 @@ unsigned long ftrace_find_rec_direct(unsigned long ip) > return entry->direct; > } > > -static struct ftrace_func_entry* > -ftrace_add_rec_direct(unsigned long ip, unsigned long addr, > - struct ftrace_hash **free_hash) > +static struct ftrace_hash *ftrace_expand_direct(int inc_count) > { > - struct ftrace_func_entry *entry; > + struct ftrace_hash *new_hash, *free_hash; > + int size = ftrace_hash_empty(direct_functions) ? 0 : > + direct_functions->count + inc_count; > > - if (ftrace_hash_empty(direct_functions) || > - direct_functions->count > 2 * (1 << direct_functions->size_bits)) { > - struct ftrace_hash *new_hash; > - int size = ftrace_hash_empty(direct_functions) ? 0 : > - direct_functions->count + 1; > + if (!ftrace_hash_empty(direct_functions) && > + size <= 2 * (1 << direct_functions->size_bits)) > + return NULL; > > - if (size < 32) > - size = 32; > + if (size < 32) > + size = 32; Hmm, why the limit of 32? I know this was there before this patch, but this patch made me notice it. In dup_hash() we have: bits = fls(size / 2); /* Don't allocate too much */ if (bits > FTRACE_HASH_MAX_BITS) bits = FTRACE_HASH_MAX_BITS; Where bits will determine the number of buckets. If size = 32, then bits = fls(32/2) = fls(16) = 5 So the buckets will be 2^5 = 32. Thus, you will get 32 buckets even with 64 entries, which will cause a minimum of two loops to find a bucket. Is this a concern? > > - new_hash = dup_hash(direct_functions, size); > - if (!new_hash) > - return NULL; > + new_hash = dup_hash(direct_functions, size); > + if (!new_hash) > + return ERR_PTR(-ENOMEM); > > - *free_hash = direct_functions; > - direct_functions = new_hash; > - } > + free_hash = direct_functions; > + direct_functions = new_hash; > + > + return free_hash; > +} > + > +static struct ftrace_func_entry* > +ftrace_add_rec_direct(unsigned long ip, unsigned long addr) > +{ > + struct ftrace_func_entry *entry; > > entry = kmalloc(sizeof(*entry), GFP_KERNEL); > if (!entry) > @@ -5436,11 +5441,19 @@ int register_ftrace_direct(struct ftrace_ops *ops, > unsigned long addr) > } > } > > + /* ... and prepare the insertion */ > + free_hash = ftrace_expand_direct(hash->count); Why does the direct_functions need to be expanded before new items are entered? Can't we do it the way ftrace does it, and that is just to fill the hash, and then expand if necessary. Hmm, also, I think there's a bug here too. That is, hash_dup() does not do what it says. It doesn't copy it actually moves. So when you use hash_dup() with the source being direct_functions, it's removing the entries from the direct functions, and not copying them :-/ That is, during the resize, the check against direct_functions will not find them and things will be missed! What I think we need to do, is what ftrace does, and that is to add everything to a temp hash first and then do an RCU assign to the direct_functions. > + if (IS_ERR(free_hash)) { > + err = PTR_ERR(free_hash); > + free_hash = NULL; > + goto out_unlock; > + } > + > /* ... and insert them to direct_functions hash. */ > err = -ENOMEM; > for (i = 0; i < size; i++) { > hlist_for_each_entry(entry, >buckets[i], hlist) { > - new = ftrace_add_rec_direct(entry->ip, addr, > _hash); > + new = ftrace_add_rec_direct(entry->ip, addr); > if (!new) > goto out_remove; > entry->direct = addr; This should fix both the leak and the fact that
Re: [PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100
On Wed, 27 Dec 2023 07:57:08 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 26 Dec 2023 12:59:02 -0500 > Steven Rostedt wrote: > > > From: "Steven Rostedt (Google)" > > > > The tracefs file "buffer_percent" is to allow user space to set a > > water-mark on how much of the tracing ring buffer needs to be filled in > > order to wake up a blocked reader. > > > > 0 - is to wait until any data is in the buffer > > 1 - is to wait for 1% of the sub buffers to be filled > > 50 - would be half of the sub buffers are filled with data > > 100 - is not to wake the waiter until the ring buffer is completely full > > > > Unfortunately the test for being full was: > > > > dirty = ring_buffer_nr_dirty_pages(buffer, cpu); > > return (dirty * 100) > (full * nr_pages); > > > > Where "full" is the value for "buffer_percent". > > > > There is two issues with the above when full == 100. > > > > 1. dirty * 100 > 100 * nr_pages will never be true > >That is, the above is basically saying that if the user sets > >buffer_percent to 100, more pages need to be dirty than exist in the > >ring buffer! > > > > 2. The page that the writer is on is never considered dirty, as dirty > >pages are only those that are full. When the writer goes to a new > >sub-buffer, it clears the contents of that sub-buffer. > > > > That is, even if the check was ">=" it would still not be equal as the > > most pages that can be considered "dirty" is nr_pages - 1. > > > > To fix this, add one to dirty and use ">=" in the compare. > > > > Cc: sta...@vger.kernel.org > > Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") > > Signed-off-by: Steven Rostedt (Google) > > --- > > kernel/trace/ring_buffer.c | 9 +++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > > index 83eab547f1d1..32c0dd2fd1c3 100644 > > --- a/kernel/trace/ring_buffer.c > > +++ b/kernel/trace/ring_buffer.c > > @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct > > trace_buffer *buffer, int cpu, int f > > if (!nr_pages || !full) > > return true; > > > > - dirty = ring_buffer_nr_dirty_pages(buffer, cpu); > > + /* > > +* Add one as dirty will never equal nr_pages, as the sub-buffer > > +* that the writer is on is not counted as dirty. > > +* This is needed if "buffer_percent" is set to 100. > > +*/ > > + dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1; > > Is this "+ 1" required? If we have 200 pages and 1 buffer is dirty, > it is 0.5% dirty. Consider @full = 1%. Yes it is required, as the comment above it states. dirty will never equal nr_pages. Without it, buffer_percent == 100 will never wake up. The +1 is to add the page the writer is on, which is never considered "dirty". > > @dirty = 1 + 1 = 2 and @dirty * 100 == 200. but > @full * @nr_pages = 1 * 200 = 200. > Thus it hits (200 >= 200 is true) even if dirty pages are 0.5%. Do we care? What's the difference if it wakes up on 2 dirty pages or 1? It would be very hard to measure the difference. But if you say 100, which means "I want to wake up when full" it will never wake up. Because it will always be nr_pages - 1. We could also say the +1 is the reader page too, because that's not counted as well. In other words, we can bike shed this to make 1% accurate (which honestly, I have no idea what the use case for that would be) or we can fix the bug that has 100% which just means, wake me up if the buffer is full, and when the writer is on the last page, it is considered full. -- Steve
[PATCH] tracing: Fix blocked reader of snapshot buffer
From: "Steven Rostedt (Google)" If an application blocks on the snapshot or snapshot_raw files, expecting to be woken up when a snapshot occurs, it will not happen. Or it may happen with an unexpected result. That result is that the application will be reading the main buffer instead of the snapshot buffer. That is because when the snapshot occurs, the main and snapshot buffers are swapped. But the reader has a descriptor still pointing to the buffer that it originally connected to. This is fine for the main buffer readers, as they may be blocked waiting for a watermark to be hit, and when a snapshot occurs, the data that the main readers want is now on the snapshot buffer. But for waiters of the snapshot buffer, they are waiting for an event to occur that will trigger the snapshot and they can then consume it quickly to save the snapshot before the next snapshot occurs. But to do this, they need to read the new snapshot buffer, not the old one that is now receiving new data. Also, it does not make sense to have a watermark "buffer_percent" on the snapshot buffer, as the snapshot buffer is static and does not receive new data except all at once. Cc: sta...@vger.kernel.org Fixes: debdd57f5145f ("tracing: Make a snapshot feature available from userspace") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 3 ++- kernel/trace/trace.c | 20 +--- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index e476d42c84c5..07dae67424a9 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -838,7 +838,8 @@ void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu) /* make sure the waiters see the new index */ smp_wmb(); - rb_wake_up_waiters(>work); + /* This can be called in any context */ + irq_work_queue(>work); } /** diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index cfeaf2cd204e..606787edae8c 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1902,6 +1902,9 @@ update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu, __update_max_tr(tr, tsk, cpu); arch_spin_unlock(>max_lock); + + /* Any waiters on the old snapshot buffer need to wake up */ + ring_buffer_wake_waiters(tr->array_buffer.buffer, RING_BUFFER_ALL_CPUS); } /** @@ -1953,12 +1956,23 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu) static int wait_on_pipe(struct trace_iterator *iter, int full) { + int ret; + /* Iterators are static, they should be filled or empty */ if (trace_buffer_iter(iter, iter->cpu_file)) return 0; - return ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, - full); + ret = ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file, full); + +#ifdef CONFIG_TRACER_MAX_TRACE + /* +* Make sure this is still the snapshot buffer, as if a snapshot were +* to happen, this would now be the main buffer. +*/ + if (iter->snapshot) + iter->array_buffer = >tr->max_buffer; +#endif + return ret; } #ifdef CONFIG_FTRACE_STARTUP_TEST @@ -8560,7 +8574,7 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, wait_index = READ_ONCE(iter->wait_index); - ret = wait_on_pipe(iter, iter->tr->buffer_percent); + ret = wait_on_pipe(iter, iter->snapshot ? 0 : iter->tr->buffer_percent); if (ret) goto out; -- 2.42.0
[PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100
From: "Steven Rostedt (Google)" The tracefs file "buffer_percent" is to allow user space to set a water-mark on how much of the tracing ring buffer needs to be filled in order to wake up a blocked reader. 0 - is to wait until any data is in the buffer 1 - is to wait for 1% of the sub buffers to be filled 50 - would be half of the sub buffers are filled with data 100 - is not to wake the waiter until the ring buffer is completely full Unfortunately the test for being full was: dirty = ring_buffer_nr_dirty_pages(buffer, cpu); return (dirty * 100) > (full * nr_pages); Where "full" is the value for "buffer_percent". There is two issues with the above when full == 100. 1. dirty * 100 > 100 * nr_pages will never be true That is, the above is basically saying that if the user sets buffer_percent to 100, more pages need to be dirty than exist in the ring buffer! 2. The page that the writer is on is never considered dirty, as dirty pages are only those that are full. When the writer goes to a new sub-buffer, it clears the contents of that sub-buffer. That is, even if the check was ">=" it would still not be equal as the most pages that can be considered "dirty" is nr_pages - 1. To fix this, add one to dirty and use ">=" in the compare. Cc: sta...@vger.kernel.org Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 83eab547f1d1..32c0dd2fd1c3 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct trace_buffer *buffer, int cpu, int f if (!nr_pages || !full) return true; - dirty = ring_buffer_nr_dirty_pages(buffer, cpu); + /* +* Add one as dirty will never equal nr_pages, as the sub-buffer +* that the writer is on is not counted as dirty. +* This is needed if "buffer_percent" is set to 100. +*/ + dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1; - return (dirty * 100) > (full * nr_pages); + return (dirty * 100) >= (full * nr_pages); } /* -- 2.42.0
[PATCH v2] eventfs: Fix file and directory uid and gid ownership
From: "Steven Rostedt (Google)" It was reported that when mounting the tracefs file system with a gid other than root, the ownership did not carry down to the eventfs directory due to the dynamic nature of it. A fix was done to solve this, but it had two issues. (a) if the attr passed into update_inode_attr() was NULL, it didn't do anything. This is true for files that have not had a chown or chgrp done to itself or any of its sibling files, as the attr is allocated for all children when any one needs it. # umount /sys/kernel/tracing # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt # ls -ld /mnt/events/sched drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/ # ls -ld /mnt/events/sched/sched_switch drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/ But when checking the files: # ls -l /mnt/events/sched/sched_switch total 0 -rw-r- 1 root root 0 Dec 21 13:12 enable -rw-r- 1 root root 0 Dec 21 13:12 filter -r--r- 1 root root 0 Dec 21 13:12 format -r--r- 1 root root 0 Dec 21 13:12 hist -r--r- 1 root root 0 Dec 21 13:12 id -rw-r- 1 root root 0 Dec 21 13:12 trigger (b) When the attr does not denote the UID or GID, it defaulted to using the parent uid or gid. This is incorrect as changing the parent uid or gid will automatically change all its children. # chgrp tracing /mnt/events/timer # ls -ld /mnt/events/timer drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer # ls -l /mnt/events/timer total 0 -rw-r- 1 root root0 Dec 21 14:35 enable -rw-r- 1 root root0 Dec 21 14:35 filter drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start At first it was thought that this could be easily fixed by just making the default ownership of the superblock when it was mounted. But this does not handle the case of: # chgrp tracing instances # mkdir instances/foo If the superblock was used, then the group ownership would be that of what it was when it was mounted, when it should instead be "tracing". Instead, set a flag for the top level eventfs directory ("events") to flag which eventfs_inode belongs to it. Since the "events" directory's dentry and inode are never freed, it does not need to use its attr field to restore its mode and ownership. Use the this eventfs_inode's attr as the default ownership for all the files and directories underneath it. When the events eventfs_inode is created, it sets its ownership to its parent uid and gid. As the events directory is created at boot up before it gets mounted, this will always be uid=0 and gid=0. If it's created via an instance, then it will take the ownership of the instance directory. When the file system is mounted, it will update all the gids if one is specified. This will have a callback to update the events evenfs_inode's default entries. When a file or directory is created under the events directory, it will walk the ei->dentry parents until it finds the evenfs_inode that belongs to the events directory to retrieve the default uid and gid values. Link: https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=gtw+anwscylhpfswrq84pjho...@mail.gmail.com/ Cc: sta...@vger.kernel.org Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to parent uid and gid") Reported-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20231221173812.51fe6...@gandalf.local.home - While writing kselftests to test ownership, I found a bug. The remount option allows for an update of the gid. If it is specified, then all dentries are traversed in tracefs and eventfs and the gid is updated. The bug is that a eventfs that does not have a dentry, but had its gid updated, the old gid is still stored, and when the dentry is created, it will use the stored gid. Instead, check all eventfs_inodes that do not have a dentry, and traverse its children to make sure their gids are updated as well. fs/tracefs/event_inode.c | 105 +++ fs/tracefs/inode.c | 6 +++ fs/tracefs/internal.h| 2 + 3 files changed, 103 insertions(+), 10 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c ind
[PATCH] eventfs: Fix file and directory uid and gid ownership
From: "Steven Rostedt (Google)" It was reported that when mounting the tracefs file system with a gid other than root, the ownership did not carry down to the eventfs directory due to the dynamic nature of it. A fix was done to solve this, but it had two issues. (a) if the attr passed into update_inode_attr() was NULL, it didn't do anything. This is true for files that have not had a chown or chgrp done to itself or any of its sibling files, as the attr is allocated for all children when any one needs it. # umount /sys/kernel/tracing # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt # ls -ld /mnt/events/sched drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/ # ls -ld /mnt/events/sched/sched_switch drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/ But when checking the files: # ls -l /mnt/events/sched/sched_switch total 0 -rw-r- 1 root root 0 Dec 21 13:12 enable -rw-r- 1 root root 0 Dec 21 13:12 filter -r--r- 1 root root 0 Dec 21 13:12 format -r--r- 1 root root 0 Dec 21 13:12 hist -r--r- 1 root root 0 Dec 21 13:12 id -rw-r- 1 root root 0 Dec 21 13:12 trigger (b) When the attr does not denote the UID or GID, it defaulted to using the parent uid or gid. This is incorrect as changing the parent uid or gid will automatically change all its children. # chgrp tracing /mnt/events/timer # ls -ld /mnt/events/timer drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer # ls -l /mnt/events/timer total 0 -rw-r- 1 root root0 Dec 21 14:35 enable -rw-r- 1 root root0 Dec 21 14:35 filter drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start At first it was thought that this could be easily fixed by just making the default ownership of the superblock when it was mounted. But this does not handle the case of: # chgrp tracing instances # mkdir instances/foo If the superblock was used, then the group ownership would be that of what it was when it was mounted, when it should instead be "tracing". Instead, set a flag for the top level eventfs directory ("events") to flag which eventfs_inode belongs to it. Since the "events" directory's dentry and inode are never freed, it does not need to use its attr field to restore its mode and ownership. Use the this eventfs_inode's attr as the default ownership for all the files and directories underneath it. When the events eventfs_inode is created, it sets its ownership to its parent uid and gid. As the events directory is created at boot up before it gets mounted, this will always be uid=0 and gid=0. If it's created via an instance, then it will take the ownership of the instance directory. When the file system is mounted, it will update all the gids if one is specified. This will have a callback to update the events evenfs_inode's default entries. When a file or directory is created under the events directory, it will walk the ei->dentry parents until it finds the evenfs_inode that belongs to the events directory to retrieve the default uid and gid values. Link: https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=gtw+anwscylhpfswrq84pjho...@mail.gmail.com/ Cc: sta...@vger.kernel.org Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to parent uid and gid") Reported-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 74 ++-- fs/tracefs/inode.c | 6 fs/tracefs/internal.h| 2 ++ 3 files changed, 72 insertions(+), 10 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 2ccc849a5bda..2d6b11195420 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -113,7 +113,14 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * determined by the parent directory. */ if (dentry->d_inode->i_mode & S_IFDIR) { - update_attr(>attr, iattr); + /* +* The events directory dentry is never freed, unless its +* part of an instance that is deleted. It's attr is the +* default for its child files and directores. +
Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions
On Thu, 21 Dec 2023 17:35:22 + Vincent Donnefort wrote: > @@ -5999,6 +6078,307 @@ int ring_buffer_subbuf_order_set(struct trace_buffer > *buffer, int order) > } > EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set); > The kernel developers have agreed to allow loop variables to be declared in loops. This will simplify these macros: > +#define subbuf_page(off, start) \ > + virt_to_page((void *)(start + (off << PAGE_SHIFT))) > + > +#define foreach_subbuf_page(off, sub_order, start, page) \ > + for (off = 0, page = subbuf_page(0, start); \ > + off < (1 << sub_order);\ > + off++, page = subbuf_page(off, start)) #define foreach_subbuf_page(sub_order, start, page) \ for (int __off = 0, page = subbuf_page(0, (start)); \ __off < (1 << (sub_order));\ __off++, page = subbuf_page(__off, (start))) And parameters as r-values should always be wrapped in parenthesis. > + > +static inline void subbuf_map_prepare(unsigned long subbuf_start, int order) > +{ > + struct page *page; > + int subbuf_off; Then you can remove the "subbuf_off". > + > + /* > + * When allocating order > 0 pages, only the first struct page has a > + * refcount > 1. Increasing the refcount here ensures none of the struct > + * page composing the sub-buffer is freeed when the mapping is closed. Nice catch by the way ;-) -- Steve > + */ > + foreach_subbuf_page(subbuf_off, order, subbuf_start, page) > + page_ref_inc(page); > +} > + > +static inline void subbuf_unmap(unsigned long subbuf_start, int order) > +{ > + struct page *page; > + int subbuf_off; > + > + foreach_subbuf_page(subbuf_off, order, subbuf_start, page) { > + page_ref_dec(page); > + page->mapping = NULL; > + } > +} > +
Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions
On Thu, 21 Dec 2023 17:35:22 + Vincent Donnefort wrote: > @@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer > *buffer, int cpu, int f > return (dirty * 100) > (full * nr_pages); > } > > +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer) > +{ > + if (unlikely(READ_ONCE(cpu_buffer->mapped))) { > + /* Ensure the meta_page is ready */ > + smp_rmb(); > + WRITE_ONCE(cpu_buffer->meta_page->entries, > +local_read(_buffer->entries)); > + WRITE_ONCE(cpu_buffer->meta_page->overrun, > +local_read(_buffer->overrun)); > + WRITE_ONCE(cpu_buffer->meta_page->subbufs_touched, > +local_read(_buffer->pages_touched)); > + WRITE_ONCE(cpu_buffer->meta_page->subbufs_lost, > +local_read(_buffer->pages_lost)); > + } > +} > + > /* > * rb_wake_up_waiters - wake up tasks waiting for ring buffer input > * > @@ -749,6 +773,18 @@ static void rb_wake_up_waiters(struct irq_work *work) > { > struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, > work); > > + if (rbwork->is_cpu_buffer) { > + struct ring_buffer_per_cpu *cpu_buffer; > + > + cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu, > + irq_work); > + /* > + * If the waiter is a cpu_buffer, this might be due to a > + * userspace mapping. Let's update the meta-page. > + */ > + rb_update_meta_page(cpu_buffer); > + } > + > wake_up_all(>waiters); > if (rbwork->full_waiters_pending || rbwork->wakeup_full) { > rbwork->wakeup_full = false; I think this code would be cleaner if we did: static void rb_update_meta_page(strucrt rb_irq_work *rbwork) { struct ring_buffer_per_cpu *cpu_buffer; if (!rbwork->is_cpu_buffer) return; /* * If the waiter is a cpu_buffer, this might be due to a * userspace mapping. Let's update the meta-page. */ cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu, irq_work); if (unlikely(READ_ONCE(cpu_buffer->mapped))) { // I don't think we need the "unlikely" /* Ensure the meta_page is ready */ smp_rmb(); WRITE_ONCE(cpu_buffer->meta_page->entries, local_read(_buffer->entries)); WRITE_ONCE(cpu_buffer->meta_page->overrun, local_read(_buffer->overrun)); WRITE_ONCE(cpu_buffer->meta_page->subbufs_touched, local_read(_buffer->pages_touched)); WRITE_ONCE(cpu_buffer->meta_page->subbufs_lost, local_read(_buffer->pages_lost)); } } /* * rb_wake_up_waiters - wake up tasks waiting for ring buffer input * * Schedules a delayed work to wake up any task that is blocked on the * ring buffer waiters queue. */ static void rb_wake_up_waiters(struct irq_work *work) { struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, work); rb_update_meta_page(cpu_buffer); wake_up_all(>waiters); if (rbwork->full_waiters_pending || rbwork->wakeup_full) { rbwork->wakeup_full = false; rbwork->full_waiters_pending = false; wake_up_all(>full_waiters); } } -- Steve
Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer
On Thu, 21 Dec 2023 14:51:29 + David Laight wrote: > > I think 1kb units is perfectly fine (patch 15 changes to kb units). The > > interface says its to define the minimal size of the sub-buffer, not the > > actual size. > > I didn't read that far through :-( > Well, this isn't a normal patch series as I took the work from Tzvetomir back from 2021, and started massaging them. I wanted to keep Tzvetomir's original authorship even though he's not working on it, so I just applied his patches as-is and then added patches on top of them, to denote what I did and what he did. This is why I never modified the first 5 patches of this series, except for subject lines and change logs. -- Steve
Re: [PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()
On Thu, 21 Dec 2023 11:06:38 +0100 Alexander Graf wrote: > Thanks a bunch for the super quick turnaround time for the fix! I can > confirm that I'm no longer seeing the warning :) > > Tested-by: Alexander Graf Thanks Alex, > > > Do we need another similar patch for the kprobe self tests? The below is > with 55cb5f43689d7 plus an unrelated initrd patch plus this patch and > the following .config: http://csgraf.de/tmp2/config-ftrace.xz Sure, now you tell me ;-) I just finished all my tests and will be sending Linus a pull request soon. I'll let Masami handle this one, as it's under his department. -- Steve > > [ 919.717134] Testing all events: OK > [ 924.418194] Testing ftrace filter: OK > [ 924.418887] trace_kprobe: Testing kprobe tracing: > [ 924.424244] [ cut here ] > [ 924.424952] WARNING: CPU: 2 PID: 1 at > kernel/trace/trace_kprobe.c:2073 kprobe_trace_self_tests_init+0x192/0x540 > [ 924.425659] Modules linked in: > [ 924.425886] CPU: 2 PID: 1 Comm: swapper/0 Not tainted > 6.7.0-rc6-00024-gc10698ad3c9a #298 > [ 924.426448] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 > [ 924.427230] RIP: 0010:kprobe_trace_self_tests_init+0x192/0x540 > [ 924.427639] Code: 7e 10 74 3b 48 8b 36 48 39 d6 75 f2 0f 0b 48 c7 c7 > 58 71 79 a5 e8 ee 3e 5a fe 48 c7 c7 20 38 b7 a5 e8 a2 51 68 fe 85 c0 74 > 33 <0f> 0b 48 c7 c7 38 73 79 a5 e8 d0 3e 5a fe e8 4b 64 62 fe eb 23 48 > [ 924.428922] RSP: :ab508001be58 EFLAGS: 00010286 > [ 924.429288] RAX: fff0 RBX: 005a RCX: > 0202 > [ 924.429800] RDX: RSI: 0002e970 RDI: > a5b708a0 > [ 924.430295] RBP: R08: 0001 R09: > a4855a90 > [ 924.430794] R10: 0007 R11: 028a R12: > 0001 > [ 924.431286] R13: a5cc9a00 R14: 8cec81bebe00 R15: > a619f0f0 > [ 924.431785] FS: () GS:8cecf910() > knlGS: > [ 924.432346] CS: 0010 DS: ES: CR0: 80050033 > [ 924.432748] CR2: CR3: 4883e001 CR4: > 003706f0 > [ 924.433246] DR0: DR1: DR2: > > [ 924.433739] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 924.434233] Call Trace: > [ 924.434418] > [ 924.434569] ? __warn+0x7d/0x140 > [ 924.434807] ? kprobe_trace_self_tests_init+0x192/0x540 > [ 924.435172] ? report_bug+0xf8/0x1e0 > [ 924.435430] ? handle_bug+0x3f/0x70 > [ 924.435677] ? exc_invalid_op+0x13/0x60 > [ 924.435954] ? asm_exc_invalid_op+0x16/0x20 > [ 924.436249] ? rdinit_setup+0x40/0x40 > [ 924.436509] ? trace_kprobe_release+0x70/0xb0 > [ 924.436822] ? kprobe_trace_self_tests_init+0x192/0x540 > [ 924.437189] ? kprobe_trace_self_tests_init+0x421/0x540 > [ 924.437551] ? init_kprobe_trace+0x1b0/0x1b0 > [ 924.437855] do_one_initcall+0x44/0x200 > [ 924.438131] kernel_init_freeable+0x1e8/0x330 > [ 924.438439] ? rest_init+0xd0/0xd0 > [ 924.438682] kernel_init+0x16/0x130 > [ 924.438943] ret_from_fork+0x30/0x50 > [ 924.439202] ? rest_init+0xd0/0xd0 > [ 924.439445] ret_from_fork_asm+0x11/0x20 > [ 924.439734] > [ 924.439893] ---[ end trace ]--- > [ 924.440217] trace_kprobe: error on cleaning up probes. > [ 924.440575] NG: Some tests are failed. Please check them.
Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer
On Thu, 21 Dec 2023 09:17:55 + David Laight wrote: > > Unfortunately, it has to be PAGE_SIZE (and for now it's a power of 2 to > > make masking easy). It's used for splice and will also be used for memory > > mapping with user space. > > Perhaps then the sysctl to set the size should be powers of 4k It's not a sysctl but a file in tracefs > with a minimum size of PAGE_SIZE. > Then you don't have to know the page size when setting things up. The user shouldn't need to know either. But the size of the sub-buffer limits the biggest size of an event, so the user only needs to make sure the sub-buffer is bigger than their biggest event. > > I'm also guessing that no Linux kernels have a PAGE_SIZE of 2k? > IIRC some old mmu (maybe 68020 era) used 2k pages. I think 1kb units is perfectly fine (patch 15 changes to kb units). The interface says its to define the minimal size of the sub-buffer, not the actual size. -- Steve
Re: [PATCH v5 15/15] tracing: Update subbuffer with kilobytes not page order
On Thu, 21 Dec 2023 09:26:21 +0900 Masami Hiramatsu (Google) wrote: > > If the user specifies 3 via: > > > > echo 3 > buffer_subbuf_size_kb > > > > Then the sub-buffer size will round up to 4kb (on a 4kb page size system). > > > > If they specify: > > > > echo 6 > buffer_subbuf_size_kb > > > > The sub-buffer size will become 8kb. > > I think this is better interface. Can we apply this earlier in the series > to avoid rewriting the document and test code? I kept it separate for testing purposes. Through out all this, it was a good way to make sure the two approaches were compatible. I still like to keep them separate as that's the way it was developed. It's good to keep that history. -- Steve
Re: [PATCH -next v4 2/2] mm: vmscan: add new event to trace shrink lru
On Tue, 19 Dec 2023 17:21:23 -0800 Bixuan Cui wrote: > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h > index b99cd28c9815..02868bdc5999 100644 > --- a/include/trace/events/vmscan.h > +++ b/include/trace/events/vmscan.h > @@ -395,7 +395,24 @@ TRACE_EVENT(mm_vmscan_write_folio, > show_reclaim_flags(__entry->reclaim_flags)) > ); > > -TRACE_EVENT(mm_vmscan_lru_shrink_inactive, > +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_start, > + > + TP_PROTO(int nid), > + > + TP_ARGS(nid), > + > + TP_STRUCT__entry( > + __field(int, nid) > + ), > + > + TP_fast_assign( > + __entry->nid = nid; > + ), > + > + TP_printk("nid=%d", __entry->nid) > +); > + > +TRACE_EVENT(mm_vmscan_lru_shrink_inactive_end, > > TP_PROTO(int nid, > unsigned long nr_scanned, unsigned long nr_reclaimed, > @@ -446,7 +463,24 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive, > show_reclaim_flags(__entry->reclaim_flags)) > ); > > -TRACE_EVENT(mm_vmscan_lru_shrink_active, > +TRACE_EVENT(mm_vmscan_lru_shrink_active_start, > + > + TP_PROTO(int nid), > + > + TP_ARGS(nid), > + > + TP_STRUCT__entry( > + __field(int, nid) > + ), > + > + TP_fast_assign( > + __entry->nid = nid; > + ), > + > + TP_printk("nid=%d", __entry->nid) > +); > + > +TRACE_EVENT(mm_vmscan_lru_shrink_active_end, > These two events are identical, please use DECLARE_EVENT_CLASS and DEFINE_EVENT macros: DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template, TP_PROTO(int nid), TP_ARGS(nid), TP_STRUCT__entry( __field(int, nid) ), TP_fast_assign( __entry->nid = nid; ), TP_printk("nid=%d", __entry->nid) ); DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, mm_vmscan_lru_shrink_inactive_start, TP_PROTO(int nid), TP_ARGS(nid) ); DEFINE_EVENT(mm_vmscan_lru_shrink_start_template, mm_vmscan_lru_shrink_active_end, TP_PROTO(int nid), TP_ARGS(nid) ); This saves a bit of memory footprint when doing so. -- Steve
Re: [PATCH v5 04/15] ring-buffer: Set new size of the ring buffer sub page
On Thu, 21 Dec 2023 01:34:56 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 19 Dec 2023 13:54:18 -0500 > Steven Rostedt wrote: > > > From: "Tzvetomir Stoyanov (VMware)" > > > > There are two approaches when changing the size of the ring buffer > > sub page: > > 1. Destroying all pages and allocating new pages with the new size. > > 2. Allocating new pages, copying the content of the old pages before > > destroying them. > > The first approach is easier, it is selected in the proposed > > implementation. Changing the ring buffer sub page size is supposed to > > not happen frequently. Usually, that size should be set only once, > > when the buffer is not in use yet and is supposed to be empty. > > > > Link: > > https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoya...@gmail.com > > > > OK, this actually reallocate the sub buffers when a new order is set. > BTW, with this change, if we set a new order, the total buffer size will be > changed too? Or reserve the total size? I think either is OK but it should > be described in the document. (e.g. if it is changed, user should set the > order first and set the total size later.) > Patch 11 keeps the same size of the buffer. As I would think that would be what the user would expect. And not only that, it breaks the latency tracers if it doesn't keep the same size. -- Steve
Re: [PATCH RERESEND 04/11] tracing: tracing_buffers_splice_read: behave as-if non-blocking I/O
On Thu, 14 Dec 2023 19:45:02 +0100 Ahelenia Ziemiańska wrote: > Otherwise we risk sleeping with the pipe locked for indeterminate > lengths of time. This change log is really lacking. Why is this an issue? > Link: > https://lore.kernel.org/linux-fsdevel/qk6hjuam54khlaikf2ssom6custxf5is2ekkaequf4hvode3ls@zgf7j5j4ubvw/t/#u The change log should not rely on any external links. Those are for reference only. At a bare minimum, the change log should have a tl;dr; summary of the issue. The change log itself should have enough information to understand why this change is happening without the need to look at any links. -- Steve
Re: [PATCH v5 06/15] ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure
On Thu, 21 Dec 2023 01:23:14 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 19 Dec 2023 13:54:20 -0500 > Steven Rostedt wrote: > > > From: "Steven Rostedt (Google)" > > > > On failure to allocate ring buffer pages, the pointer to the CPU buffer > > pages is freed, but the pages that were allocated previously were not. > > Make sure they are freed too. > > > > Fixes: TBD ("tracing: Set new size of the ring buffer sub page") > > Do you merge this fix to the original one in the same series later? > I think it is better to merge it for git bisect. It only fixes the new functionality. If a git bisect gets here for that, then we know the issue already. But it shouldn't break bisect for things that happened before this change. The reason I'm not merging this with the other patch is because the original patch is from Tzvetomir, who isn't working on this anymore. I want to keep his patches untouched, and anything I do is on top of it. -- Steve
[PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()
From: "Steven Rostedt (Google)" The synth_event_gen_test module can be built in, if someone wants to run the tests at boot up and not have to load them. The synth_event_gen_test_init() function creates and enables the synthetic events and runs its tests. The synth_event_gen_test_exit() disables the events it created and destroys the events. If the module is builtin, the events are never disabled. The issue is, the events should be disable after the tests are run. This could be an issue if the rest of the boot up tests are enabled, as they expect the events to be in a known state before testing. That known state happens to be disabled. When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y a warning will trigger: Running tests on trace events: Testing event create_synth_test: Enabled event during self test! [ cut here ] WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 event_trace_self_tests+0x1c2/0x480 Modules linked in: CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:event_trace_self_tests+0x1c2/0x480 Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff RSP: :c901fdc0 EFLAGS: 00010246 RAX: 0029 RBX: 88810399ca80 RCX: RDX: RSI: b9f19478 RDI: 88823c734e64 RBP: 88810399f300 R08: R09: fbfff79eb32a R10: bcf59957 R11: 0001 R12: 888104068090 R13: bc89f0a0 R14: bc8a0f08 R15: 0078 FS: () GS:88823c70() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0001f6282001 CR4: 00170ef0 Call Trace: ? __warn+0xa5/0x200 ? event_trace_self_tests+0x1c2/0x480 ? report_bug+0x1f6/0x220 ? handle_bug+0x6f/0x90 ? exc_invalid_op+0x17/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? tracer_preempt_on+0x78/0x1c0 ? event_trace_self_tests+0x1c2/0x480 ? __pfx_event_trace_self_tests_init+0x10/0x10 event_trace_self_tests_init+0x27/0xe0 do_one_initcall+0xd6/0x3c0 ? __pfx_do_one_initcall+0x10/0x10 ? kasan_set_track+0x25/0x30 ? rcu_is_watching+0x38/0x60 kernel_init_freeable+0x324/0x450 ? __pfx_kernel_init+0x10/0x10 kernel_init+0x1f/0x1e0 ? _raw_spin_unlock_irq+0x33/0x50 ret_from_fork+0x34/0x60 ? __pfx_kernel_init+0x10/0x10 ret_from_fork_asm+0x1b/0x30 This is because the synth_event_gen_test_init() left the synthetic events that it created enabled. By having it disable them after testing, the other selftests will run fine. Cc: sta...@vger.kernel.org Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module") Reported-by: Alexander Graf Signed-off-by: Steven Rostedt (Google) --- kernel/trace/synth_event_gen_test.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/trace/synth_event_gen_test.c b/kernel/trace/synth_event_gen_test.c index 8dfe85499d4a..354c2117be43 100644 --- a/kernel/trace/synth_event_gen_test.c +++ b/kernel/trace/synth_event_gen_test.c @@ -477,6 +477,17 @@ static int __init synth_event_gen_test_init(void) ret = test_trace_synth_event(); WARN_ON(ret); + + /* Disable when done */ + trace_array_set_clr_event(gen_synth_test->tr, + "synthetic", + "gen_synth_test", false); + trace_array_set_clr_event(empty_synth_test->tr, + "synthetic", + "empty_synth_test", false); + trace_array_set_clr_event(create_synth_test->tr, + "synthetic", + "create_synth_test", false); out: return ret; } -- 2.42.0
Re: [PATCH] eventfs: Have event files and directories default to parent uid and gid
On Wed, 20 Dec 2023 10:50:17 -0500 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Dongliang reported: > > I found that in the latest version, the nodes of tracefs have been > changed to dynamically created. > > This has caused me to encounter a problem where the gid I specified in > the mounting parameters cannot apply to all files, as in the following > situation: > > /data/tmp/events # mount | grep tracefs > tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012) > > gid 3012 = readtracefs > > /data/tmp # ls -lh > total 0 > -r--r- 1 root readtracefs 0 1970-01-01 08:00 README > -r--r- 1 root readtracefs 0 1970-01-01 08:00 available_events > > ums9621_1h10:/data/tmp/events # ls -lh > total 0 > drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer > drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc > > It will prevent certain applications from accessing tracefs properly, I > try to avoid this issue by making the following modifications. > > To fix this, have the files created default to taking the ownership of > the parent dentry unless the ownership was previously set by the user. > > Link: > https://lore.kernel.org/linux-trace-kernel/1703063706-30539-1-git-send-email-dongliang@unisoc.com/ > I forgot to add: Cc: sta...@vger.kernel.org Fixes: 28e12c09f5aa0 ("eventfs: Save ownership and mode") -- Steve > Reported-by: Dongliang Cui > Signed-off-by: Steven Rostedt (Google) > --- > fs/tracefs/event_inode.c | 12 +--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 43e237864a42..2ccc849a5bda 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -148,7 +148,8 @@ static const struct file_operations > eventfs_file_operations = { > .release= eventfs_release, > }; > > -static void update_inode_attr(struct inode *inode, struct eventfs_attr > *attr, umode_t mode) > +static void update_inode_attr(struct dentry *dentry, struct inode *inode, > + struct eventfs_attr *attr, umode_t mode) > { > if (!attr) { > inode->i_mode = mode; > @@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, > struct eventfs_attr *attr, um > > if (attr->mode & EVENTFS_SAVE_UID) > inode->i_uid = attr->uid; > + else > + inode->i_uid = d_inode(dentry->d_parent)->i_uid; > > if (attr->mode & EVENTFS_SAVE_GID) > inode->i_gid = attr->gid; > + else > + inode->i_gid = d_inode(dentry->d_parent)->i_gid; > } > > /** > @@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, > umode_t mode, > return eventfs_failed_creating(dentry); > > /* If the user updated the directory's attributes, use them */ > - update_inode_attr(inode, attr, mode); > + update_inode_attr(dentry, inode, attr, mode); > > inode->i_op = _file_inode_operations; > inode->i_fop = fop; > @@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode > *ei, struct dentry *parent > return eventfs_failed_creating(dentry); > > /* If the user updated the directory's attributes, use them */ > - update_inode_attr(inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | > S_IXUGO); > + update_inode_attr(dentry, inode, >attr, > + S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); > > inode->i_op = _root_dir_inode_operations; > inode->i_fop = _file_operations;
Re: [PATCH v8 0/2] Introducing trace buffer mapping by user-space
On Wed, 20 Dec 2023 13:49:30 + Vincent Donnefort wrote: > I meant, to only do in rb_wake_up_waiters() > > if (rbwork->is_cpu_buffer) > rb_update_meta_page(cpu_buffer) > > And skip the meta-page update for the !is_cpu_buffer case? Ah yeah, that works. -- Steve
[PATCH] eventfs: Have event files and directories default to parent uid and gid
From: "Steven Rostedt (Google)" Dongliang reported: I found that in the latest version, the nodes of tracefs have been changed to dynamically created. This has caused me to encounter a problem where the gid I specified in the mounting parameters cannot apply to all files, as in the following situation: /data/tmp/events # mount | grep tracefs tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012) gid 3012 = readtracefs /data/tmp # ls -lh total 0 -r--r- 1 root readtracefs 0 1970-01-01 08:00 README -r--r- 1 root readtracefs 0 1970-01-01 08:00 available_events ums9621_1h10:/data/tmp/events # ls -lh total 0 drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc It will prevent certain applications from accessing tracefs properly, I try to avoid this issue by making the following modifications. To fix this, have the files created default to taking the ownership of the parent dentry unless the ownership was previously set by the user. Link: https://lore.kernel.org/linux-trace-kernel/1703063706-30539-1-git-send-email-dongliang@unisoc.com/ Reported-by: Dongliang Cui Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 43e237864a42..2ccc849a5bda 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -148,7 +148,8 @@ static const struct file_operations eventfs_file_operations = { .release= eventfs_release, }; -static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, umode_t mode) +static void update_inode_attr(struct dentry *dentry, struct inode *inode, + struct eventfs_attr *attr, umode_t mode) { if (!attr) { inode->i_mode = mode; @@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, um if (attr->mode & EVENTFS_SAVE_UID) inode->i_uid = attr->uid; + else + inode->i_uid = d_inode(dentry->d_parent)->i_uid; if (attr->mode & EVENTFS_SAVE_GID) inode->i_gid = attr->gid; + else + inode->i_gid = d_inode(dentry->d_parent)->i_gid; } /** @@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, umode_t mode, return eventfs_failed_creating(dentry); /* If the user updated the directory's attributes, use them */ - update_inode_attr(inode, attr, mode); + update_inode_attr(dentry, inode, attr, mode); inode->i_op = _file_inode_operations; inode->i_fop = fop; @@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent return eventfs_failed_creating(dentry); /* If the user updated the directory's attributes, use them */ - update_inode_attr(inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); + update_inode_attr(dentry, inode, >attr, + S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); inode->i_op = _root_dir_inode_operations; inode->i_fop = _file_operations; -- 2.42.0
Re: [PATCH v5 03/15] ring-buffer: Add interface for configuring trace sub buffer size
On Wed, 20 Dec 2023 23:26:19 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 19 Dec 2023 13:54:17 -0500 > Steven Rostedt wrote: > > > +/** > > + * ring_buffer_subbuf_order_set - set the size of ring buffer sub page. > > + * @buffer: The ring_buffer to set the new page size. > > + * @order: Order of the system pages in one sub buffer page > > + * > > + * By default, one ring buffer pages equals to one system page. This API > > can be > > + * used to set new size of the ring buffer page. The size must be order of > > + * system page size, that's why the input parameter @order is the order of > > + * system pages that are allocated for one ring buffer page: > > + * 0 - 1 system page > > + * 1 - 2 system pages > > + * 3 - 4 system pages > > + * ... > > Don't we have the max order of the pages? Actually there is. I think it's 7? Honestly, anything over 5 is probably too much. But hey. > > > + * > > + * Returns 0 on success or < 0 in case of an error. > > + */ > > +int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order) > > +{ > > + int psize; > > + > > + if (!buffer || order < 0) > > + return -EINVAL; > > + > > + if (buffer->subbuf_order == order) > > + return 0; > > + > > + psize = (1 << order) * PAGE_SIZE; > > + if (psize <= BUF_PAGE_HDR_SIZE) > > + return -EINVAL; > > + > > + buffer->subbuf_order = order; > > + buffer->subbuf_size = psize - BUF_PAGE_HDR_SIZE; > > + > > + /* Todo: reset the buffer with the new page size */ > > + > > I just wonder why there is no reallocate the sub buffers here. > Is it OK to change the sub buffer page size and order while > using the ring buffer? Currently we disable the ring buffer to do the update. -- Steve
Re: [PATCH]eventfs: Apply the gid in the mounting parameters to all files
On Wed, 20 Dec 2023 17:15:06 +0800 Dongliang Cui wrote: > I found that in the latest version, the nodes of > tracefs have been changed to dynamically created. > > This has caused me to encounter a problem where > the gid I specified in the mounting parameters > cannot apply to all files, as in the following > situation: > > /data/tmp/events # mount | grep tracefs > tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012) > > gid 3012 = readtracefs > > /data/tmp # ls -lh > total 0 > -r--r- 1 root readtracefs 0 1970-01-01 08:00 README > -r--r- 1 root readtracefs 0 1970-01-01 08:00 available_events > > ums9621_1h10:/data/tmp/events # ls -lh > total 0 > drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer > drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc > > It will prevent certain applications from accessing > tracefs properly, I try to avoid this issue by making > the following modifications. > > Signed-off-by: Dongliang Cui Thanks for the report, can you try this fix instead? -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 43e237864a42..2ccc849a5bda 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -148,7 +148,8 @@ static const struct file_operations eventfs_file_operations = { .release= eventfs_release, }; -static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, umode_t mode) +static void update_inode_attr(struct dentry *dentry, struct inode *inode, + struct eventfs_attr *attr, umode_t mode) { if (!attr) { inode->i_mode = mode; @@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, um if (attr->mode & EVENTFS_SAVE_UID) inode->i_uid = attr->uid; + else + inode->i_uid = d_inode(dentry->d_parent)->i_uid; if (attr->mode & EVENTFS_SAVE_GID) inode->i_gid = attr->gid; + else + inode->i_gid = d_inode(dentry->d_parent)->i_gid; } /** @@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, umode_t mode, return eventfs_failed_creating(dentry); /* If the user updated the directory's attributes, use them */ - update_inode_attr(inode, attr, mode); + update_inode_attr(dentry, inode, attr, mode); inode->i_op = _file_inode_operations; inode->i_fop = fop; @@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent return eventfs_failed_creating(dentry); /* If the user updated the directory's attributes, use them */ - update_inode_attr(inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); + update_inode_attr(dentry, inode, >attr, + S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); inode->i_op = _root_dir_inode_operations; inode->i_fop = _file_operations;
Re: [PATCH v8 0/2] Introducing trace buffer mapping by user-space
On Wed, 20 Dec 2023 13:06:06 + Vincent Donnefort wrote: > > @@ -771,10 +772,20 @@ static void rb_update_meta_page(struct > > ring_buffer_per_cpu *cpu_buffer) > > static void rb_wake_up_waiters(struct irq_work *work) > > { > > struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, > > work); > > - struct ring_buffer_per_cpu *cpu_buffer = > > - container_of(rbwork, struct ring_buffer_per_cpu, irq_work); > > + struct ring_buffer_per_cpu *cpu_buffer; > > + struct trace_buffer *buffer; > > + int cpu; > > > > - rb_update_meta_page(cpu_buffer); > > + if (rbwork->is_cpu_buffer) { > > + cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu, > > irq_work); > > + rb_update_meta_page(cpu_buffer); > > + } else { > > + buffer = container_of(rbwork, struct trace_buffer, irq_work); > > + for_each_buffer_cpu(buffer, cpu) { > > + cpu_buffer = buffer->buffers[cpu]; > > + rb_update_meta_page(cpu_buffer); > > + } > > + } > > Arg, somehow never reproduced the problem :-\. I suppose you need to cat > trace/trace_pipe and mmap(trace/cpuX/trace_pipe) at the same time? It triggered as soon as I ran "trace-cmd start -e sched_switch" In other words, it broke the non mmap case. This function gets called for both the buffer and cpu_buffer irq_work entries. You added the container_of() to get access to cpu_buffer, when the rbwork could also be for the main buffer too. The main buffer has no meta page, and it triggered a NULL pointer dereference, as "cpu_buffer->mapped" returned true (because it was on something of the buffer structure that wasn't zero), and then here: if (cpu_buffer->mapped) { WRITE_ONCE(cpu_buffer->meta_page->reader.read, 0); It dereferenced cpu_buffer->meta_page->reader which is only God knows what! > > Updating the meta-page is only useful if the reader we are waking up is a > user-space one, which would only happen with the cpu_buffer version of this > function. We could limit the update of the meta_page only to this case? I rather not add another irq_work entry. This workaround should be good enough. Thanks, -- Steve
[PATCH] ring-buffer: Remove stale comment from ring_buffer_size()
From: "Steven Rostedt (Google)" It's been 11 years since the ring_buffer_size() function was updated to use the nr_pages from the buffer->buffers[cpu] structure instead of using the buffer->nr_pages that no longer exists. The comment in the code is more of what a change log should have and is pretty much useless for development. It's saying how things worked back in 2012 that bares no purpose on today's code. Remove it. Link: https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535...@acums.aculab.com/ Reported-by: David Laight Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 173d2595ce2d..7887d61d5b56 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -5122,12 +5122,6 @@ EXPORT_SYMBOL_GPL(ring_buffer_iter_advance); */ unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu) { - /* -* Earlier, this method returned -* buffer->subbuf_size * buffer->nr_pages -* Since the nr_pages field is now removed, we have converted this to -* return the per cpu buffer value. -*/ if (!cpumask_test_cpu(cpu, buffer->cpumask)) return 0; -- 2.42.0
Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer
On Wed, 20 Dec 2023 08:48:02 + David Laight wrote: > From: Steven Rostedt > > Sent: 19 December 2023 18:54 > > From: "Tzvetomir Stoyanov (VMware)" > > > > Currently the size of one sub buffer page is global for all buffers and > > it is hard coded to one system page. In order to introduce configurable > > ring buffer sub page size, the internal logic should be refactored to > > work with sub page size per ring buffer. > > > ... > > - nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE); > > + /* Default buffer page size - one system page */ > > + buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE; > > + > > + /* Max payload is buffer page size - header (8bytes) */ > > + buffer->max_data_size = buffer->subbuf_size - (sizeof(u32) * 2); > > + > > + nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size); > > While not new, does this really make any sense for systems with 64k pages? > Wouldn't it be better to have units of 4k? Unfortunately, it has to be PAGE_SIZE (and for now it's a power of 2 to make masking easy). It's used for splice and will also be used for memory mapping with user space. > > ... > > @@ -5102,14 +5110,14 @@ unsigned long ring_buffer_size(struct trace_buffer > > *buffer, int cpu) > > { > > /* > > * Earlier, this method returned > > -* BUF_PAGE_SIZE * buffer->nr_pages > > +* buffer->subbuf_size * buffer->nr_pages > > * Since the nr_pages field is now removed, we have converted this to > > * return the per cpu buffer value. > > Overenthusiastic global replace... Possibly, but the comment still applies, and should probably be removed, as it's rather old (2012). It's basically just saying that the size use to be calculated from buffer->nr_pages and now it's calculated by buffer->buffers[cpu]->nr_pages. I think I'll just add a patch to remove that comment. Thanks, -- Steve