Re: [PATCH] tracing: Have trace_event_file have ref counters
On Wed, 1 Nov 2023 22:32:54 -0400 Steven Rostedt wrote: > > Ouch! I thought the file descriptor has been hold by the opened process. > > Well, the struct *filp is, but not the filp->private that points to the > struct trace_event_file *file. That was supposed to be "struct file *filp" -- Steve
Re: [PATCH] tracing: Have trace_event_file have ref counters
On Thu, 2 Nov 2023 11:14:33 +0900 Masami Hiramatsu (Google) wrote: > > > > What happens here is that the kprobe event creates a trace_event_file > > "file" descriptor that represents the file in tracefs to the event. It > > maintains state of the event (is it enabled for the given instance?). > > Opening the "enable" file gets a reference to the event "file" descriptor > > via the open file descriptor. When the kprobe event is deleted, the file is > > also deleted from the tracefs system which also frees the event "file" > > descriptor. > > Ouch! I thought the file descriptor has been hold by the opened process. Well, the struct *filp is, but not the filp->private that points to the struct trace_event_file *file. > > > > > But as the tracefs file is still opened by user space, it will not be > > totally removed until the final dput() is called on it. But this is not > > true with the event "file" descriptor that is already freed. If the user > > does a write to or simply closes the file descriptor it will reference the > > event "file" descriptor that was just freed, causing a use-after-free bug. > > > > To solve this, add a ref count to the event "file" descriptor as well as a > > new flag called "FREED". The "file" will not be freed until the last > > reference is released. But the FREE flag will be set when the event is > > removed to prevent any more modifications to that event from happening, > > even if there's still a reference to the event "file" descriptor. > > > > Link: > > https://lore.kernel.org/linux-trace-kernel/2023103131.1e705...@gandalf.local.home/ > > > > Looks good to me. > > Reviewed-by: Masami Hiramatsu (Google) > > BTW, can we add some tracefs selftests to ftracetest or independent test? Yes, I was thinking about this, but that will have to wait till after this gets in mainline. I'm already way behind schedule. Thanks, -- Steve
Re: [PATCH] tracing: Have trace_event_file have ref counters
On Tue, 31 Oct 2023 12:24:53 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The following can crash the kernel: > > # cd /sys/kernel/tracing > # echo 'p:sched schedule' > kprobe_events > # exec 5>>events/kprobes/sched/enable > # > kprobe_events > # exec 5>&- > > The above commands: > > 1. Change directory to the tracefs directory > 2. Create a kprobe event (doesn't matter what one) > 3. Open bash file descriptor 5 on the enable file of the kprobe event > 4. Delete the kprobe event (removes the files too) > 5. Close the bash file descriptor 5 > > The above causes a crash! > > BUG: kernel NULL pointer dereference, address: 0028 > #PF: supervisor read access in kernel mode > #PF: error_code(0x) - not-present page > PGD 0 P4D 0 > Oops: [#1] PREEMPT SMP PTI > CPU: 6 PID: 877 Comm: bash Not tainted > 6.5.0-rc4-test-8-g2c6b6b1029d4-dirty #186 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.16.2-debian-1.16.2-1 04/01/2014 > RIP: 0010:tracing_release_file_tr+0xc/0x50 > > What happens here is that the kprobe event creates a trace_event_file > "file" descriptor that represents the file in tracefs to the event. It > maintains state of the event (is it enabled for the given instance?). > Opening the "enable" file gets a reference to the event "file" descriptor > via the open file descriptor. When the kprobe event is deleted, the file is > also deleted from the tracefs system which also frees the event "file" > descriptor. Ouch! I thought the file descriptor has been hold by the opened process. > > But as the tracefs file is still opened by user space, it will not be > totally removed until the final dput() is called on it. But this is not > true with the event "file" descriptor that is already freed. If the user > does a write to or simply closes the file descriptor it will reference the > event "file" descriptor that was just freed, causing a use-after-free bug. > > To solve this, add a ref count to the event "file" descriptor as well as a > new flag called "FREED". The "file" will not be freed until the last > reference is released. But the FREE flag will be set when the event is > removed to prevent any more modifications to that event from happening, > even if there's still a reference to the event "file" descriptor. > > Link: > https://lore.kernel.org/linux-trace-kernel/2023103131.1e705...@gandalf.local.home/ > Looks good to me. Reviewed-by: Masami Hiramatsu (Google) BTW, can we add some tracefs selftests to ftracetest or independent test? Thanks, > Cc: sta...@vger.kernel.org > Fixes: f5ca233e2e66d ("tracing: Increase trace array ref count on enable and > filter files") > Reported-by: Beau Belgrave > Signed-off-by: Steven Rostedt (Google) > --- > include/linux/trace_events.h | 4 > kernel/trace/trace.c | 15 +++ > kernel/trace/trace.h | 3 +++ > kernel/trace/trace_events.c| 31 ++ > kernel/trace/trace_events_filter.c | 3 +++ > 5 files changed, 52 insertions(+), 4 deletions(-) > > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h > index 12207dc6722d..696f8dc4aa53 100644 > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -492,6 +492,7 @@ enum { > EVENT_FILE_FL_TRIGGER_COND_BIT, > EVENT_FILE_FL_PID_FILTER_BIT, > EVENT_FILE_FL_WAS_ENABLED_BIT, > + EVENT_FILE_FL_FREED_BIT, > }; > > extern struct trace_event_file *trace_get_event_file(const char *instance, > @@ -630,6 +631,7 @@ extern int __kprobe_event_add_fields(struct dynevent_cmd > *cmd, ...); > * TRIGGER_COND - When set, one or more triggers has an associated filter > * PID_FILTER- When set, the event is filtered based on pid > * WAS_ENABLED - Set when enabled to know to clear trace on module removal > + * FREED - File descriptor is freed, all fields should be > considered invalid > */ > enum { > EVENT_FILE_FL_ENABLED = (1 << EVENT_FILE_FL_ENABLED_BIT), > @@ -643,6 +645,7 @@ enum { > EVENT_FILE_FL_TRIGGER_COND = (1 << EVENT_FILE_FL_TRIGGER_COND_BIT), > EVENT_FILE_FL_PID_FILTER= (1 << EVENT_FILE_FL_PID_FILTER_BIT), > EVENT_FILE_FL_WAS_ENABLED = (1 << EVENT_FILE_FL_WAS_ENABLED_BIT), > + EVENT_FILE_FL_FREED = (1 << EVENT_FILE_FL_FREED_BIT), > }; > > struct trace_event_file { > @@ -671,6 +674,7 @@ struct trace_event_file { >* caching and such. Which is mostly OK ;-) >*/ > unsigned long flags; > + atomic_tref;/* ref count for opened files */ > atomic_tsm_ref; /* soft-mode reference counter */ > atomic_ttm_ref; /* trigger-mode reference counter */ > }; > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 2539cfc20a97..9aebf904ff97 100644 > --- a/kernel/trace/trace.c > +++
Re: [PATCH v6 7/8] eventfs: Remove special processing of dput() of events directory
On Wed, 01 Nov 2023 13:25:48 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The top level events directory is no longer special with regards to how it > should be delete. Remove the extra processing for it in > eventfs_set_ei_status_free(). > Looks good to me. Reviewed-by: Masami Hiramatsu (Google) > Cc: Ajay Kaher > Signed-off-by: Steven Rostedt (Google) > --- > Changes since v5: > https://lkml.kernel.org/r/20231031223421.185413...@goodmis.org > > - Resynced to this patch series > > fs/tracefs/event_inode.c | 19 ++- > 1 file changed, 2 insertions(+), 17 deletions(-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 0a04ae0ca8c8..0087a3f455f1 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -274,28 +274,11 @@ static void free_ei(struct eventfs_inode *ei) > */ > void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry > *dentry) > { > - struct tracefs_inode *ti_parent; > struct eventfs_inode *ei; > int i; > > - /* The top level events directory may be freed by this */ > - if (unlikely(ti->flags & TRACEFS_EVENT_TOP_INODE)) { > - mutex_lock(_mutex); > - ei = ti->private; > - /* Nothing should access this, but just in case! */ > - ti->private = NULL; > - mutex_unlock(_mutex); > - > - free_ei(ei); > - return; > - } > - > mutex_lock(_mutex); > > - ti_parent = get_tracefs(dentry->d_parent->d_inode); > - if (!ti_parent || !(ti_parent->flags & TRACEFS_EVENT_INODE)) > - goto out; > - > ei = dentry->d_fsdata; > if (!ei) > goto out; > @@ -920,6 +903,8 @@ struct eventfs_inode *eventfs_create_events_dir(const > char *name, struct dentry > inode->i_op = _root_dir_inode_operations; > inode->i_fop = _file_operations; > > + dentry->d_fsdata = ei; > + > /* directory inodes start off with i_nlink == 2 (for "." entry) */ > inc_nlink(inode); > d_instantiate(dentry, inode); > -- > 2.42.0 -- Masami Hiramatsu (Google)
Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On Thu, 2023-11-02 at 09:16 +0800, Huang, Ying wrote: > Vishal Verma writes: > [..] > > + > > +static int create_altmaps_and_memory_blocks(int nid, struct memory_group > > *group, > > + u64 start, u64 size) > > +{ > > + unsigned long memblock_size = memory_block_size_bytes(); > > + u64 cur_start; > > + int ret; > > + > > + for (cur_start = start; cur_start < start + size; > > + cur_start += memblock_size) { > > + struct mhp_params params = { .pgprot = > > + > > pgprot_mhp(PAGE_KERNEL) }; > > + struct vmem_altmap mhp_altmap = { > > + .base_pfn = PHYS_PFN(cur_start), > > + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1), > > + }; > > + > > + mhp_altmap.free = memory_block_memmap_on_memory_pages(); > > + params.altmap = kmemdup(_altmap, sizeof(struct > > vmem_altmap), > > + GFP_KERNEL); > > + if (!params.altmap) > > + return -ENOMEM; > > Use "goto out" here too? Hm, yes I suppose we want to clean up previous iterations of the loop - I'll make this change. > > > + > > + /* call arch's memory hotadd */ > > + ret = arch_add_memory(nid, cur_start, memblock_size, > > ); > > + if (ret < 0) { > > + kfree(params.altmap); > > + goto out; > > + } > > + > > + /* create memory block devices after memory was added */ > > + ret = create_memory_block_devices(cur_start, memblock_size, > > + params.altmap, group); > > + if (ret) { > > + arch_remove_memory(cur_start, memblock_size, NULL); > > + kfree(params.altmap); > > How about move arch_remove_memory() and kree() to error path and use > different label? I thought of this, but it got slightly awkward because of the scope of 'params' (declared/allocated within the loop), just kfree'ing in that scope looked cleaner..
Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
Vishal Verma writes: > The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to > 'memblock_size' chunks of memory being added. Adding a larger span of > memory precludes memmap_on_memory semantics. > > For users of hotplug such as kmem, large amounts of memory might get > added from the CXL subsystem. In some cases, this amount may exceed the > available 'main memory' to store the memmap for the memory being added. > In this case, it is useful to have a way to place the memmap on the > memory being added, even if it means splitting the addition into > memblock-sized chunks. > > Change add_memory_resource() to loop over memblock-sized chunks of > memory if caller requested memmap_on_memory, and if other conditions for > it are met. Teach try_remove_memory() to also expect that a memory > range being removed might have been split up into memblock sized chunks, > and to loop through those as needed. > > This does preclude being able to use PUD mappings in the direct map; a > proposal to how this could be optimized in the future is laid out > here[1]. > > [1]: > https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Dan Williams > Cc: Dave Jiang > Cc: Dave Hansen > Cc: Huang Ying > Suggested-by: David Hildenbrand > Reviewed-by: Dan Williams > Signed-off-by: Vishal Verma > --- > mm/memory_hotplug.c | 213 > ++-- > 1 file changed, 138 insertions(+), 75 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 6be7de9efa55..d242e49d7f7b 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned > long size) > return arch_supports_memmap_on_memory(vmemmap_size); > } > > +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size) > +{ > + unsigned long memblock_size = memory_block_size_bytes(); > + u64 cur_start; > + > + /* > + * For memmap_on_memory, the altmaps were added on a per-memblock > + * basis; we have to process each individual memory block. > + */ > + for (cur_start = start; cur_start < start + size; > + cur_start += memblock_size) { > + struct vmem_altmap *altmap = NULL; > + struct memory_block *mem; > + > + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start))); > + WARN_ON_ONCE(!mem); > + if (!mem) > + continue; > + > + altmap = mem->altmap; > + mem->altmap = NULL; > + > + remove_memory_block_devices(cur_start, memblock_size); > + > + arch_remove_memory(cur_start, memblock_size, altmap); > + > + /* Verify that all vmemmap pages have actually been freed. */ > + WARN(altmap->alloc, "Altmap not fully unmapped"); > + kfree(altmap); > + } > +} > + > +static int create_altmaps_and_memory_blocks(int nid, struct memory_group > *group, > + u64 start, u64 size) > +{ > + unsigned long memblock_size = memory_block_size_bytes(); > + u64 cur_start; > + int ret; > + > + for (cur_start = start; cur_start < start + size; > + cur_start += memblock_size) { > + struct mhp_params params = { .pgprot = > + pgprot_mhp(PAGE_KERNEL) }; > + struct vmem_altmap mhp_altmap = { > + .base_pfn = PHYS_PFN(cur_start), > + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1), > + }; > + > + mhp_altmap.free = memory_block_memmap_on_memory_pages(); > + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap), > + GFP_KERNEL); > + if (!params.altmap) > + return -ENOMEM; Use "goto out" here too? > + > + /* call arch's memory hotadd */ > + ret = arch_add_memory(nid, cur_start, memblock_size, ); > + if (ret < 0) { > + kfree(params.altmap); > + goto out; > + } > + > + /* create memory block devices after memory was added */ > + ret = create_memory_block_devices(cur_start, memblock_size, > + params.altmap, group); > + if (ret) { > + arch_remove_memory(cur_start, memblock_size, NULL); > + kfree(params.altmap); How about move arch_remove_memory() and kree() to error path and use different label? -- Best Regards, Huang, Ying > + goto out; > + } > + } > + > + return 0; > +out: > + if (ret && (cur_start != start)) > + remove_memory_blocks_and_altmaps(start, cur_start -
[GIT PULL] NVDIMM for 6.7
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git tags/libnvdimm-for-6.7 ... to get updates to the nvdimm tree. They are a mix of bug fixes and updates to interfaces used by nvdimm. Bug fixes include: Fix a sleep during spinlock in PREEMPT_RT kernels when getting a BTT lane Check for kstrdup memory failure in region probe Use the correct variables in the nvdimm_badblocks_populate() kdoc block Updates to interfaces include: Use devm_kstrdup to manage memory better Allow class structures to be declared at build time Start using __counted_by to help with bounds checking Remove use of the deprecated strncpy They have all been in -next for more than a week with no reported issues. --- The following changes since commit 6465e260f48790807eef06b583b38ca9789b6072: Linux 6.6-rc3 (2023-09-24 14:31:13 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git tags/libnvdimm-for-6.7 for you to fetch changes up to 9ea459e477dc09370cdd8ee13b61aefe8cd1f20a: libnvdimm: remove kernel-doc warnings: (2023-10-18 09:48:05 -0700) libnvdimm updates for v6.7 - updates to deprecated and changed interfaces - bug/kdoc fixes Chen Ni (1): libnvdimm/of_pmem: Use devm_kstrdup instead of kstrdup and check its return value Greg Kroah-Hartman (1): testing: nvdimm: make struct class structures constant Justin Stitt (1): dax: refactor deprecated strncpy Kees Cook (1): libnvdimm: Annotate struct nd_region with __counted_by Tomas Glozar (1): nd_btt: Make BTT lanes preemptible Zhu Wang (1): libnvdimm: remove kernel-doc warnings: drivers/dax/bus.c | 2 +- drivers/nvdimm/badrange.c | 4 ++-- drivers/nvdimm/nd.h| 2 +- drivers/nvdimm/of_pmem.c | 8 +++- drivers/nvdimm/region_devs.c | 10 +- tools/testing/nvdimm/test/ndtest.c | 17 + tools/testing/nvdimm/test/nfit.c | 14 +++--- 7 files changed, 32 insertions(+), 25 deletions(-)
Re: [PATCH v6 5/8] eventfs: Hold eventfs_mutex when calling callback functions
On Wed, 01 Nov 2023 13:25:46 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The callback function that is used to create inodes and dentries is not > protected by anything and the data that is passed to it could become > stale. After eventfs_remove_dir() is called by the tracing system, it is > free to remove the events that are associated to that directory. > Unfortunately, that means the callbacks must not be called after that. > > CPU0 CPU1 > > eventfs_root_lookup() { >eventfs_remove_dir() { > mutex_lock(_mutex); > ei->is_freed = set; > mutex_unlock(_mutex); >} >kfree(event_call); > > for (...) { > entry = >entries[i]; > r = entry->callback() { > call = data;// call == event_call above > if (call->flags ...) > > [ USE AFTER FREE BUG ] > > The safest way to protect this is to wrap the callback with: > > mutex_lock(_mutex); > if (!ei->is_freed) > r = entry->callback(); > else > r = -1; > mutex_unlock(_mutex); > > This will make sure that the callback will not be called after it is > freed. But now it needs to be known that the callback is called while > holding internal eventfs locks, and that it must not call back into the > eventfs / tracefs system. There's no reason it should anyway, but document > that as well. > > Link: > https://lore.kernel.org/all/CA+G9fYu9GOEbD=rR5eMR-=HJ8H6rMsbzDC2ZY5=y50wpwae...@mail.gmail.com/ > Looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thanks! > Cc: Ajay Kaher > Reported-by: Linux Kernel Functional Testing > Reported-by: Naresh Kamboju > Tested-by: Linux Kernel Functional Testing > Tested-by: Naresh Kamboju > Signed-off-by: Steven Rostedt (Google) > --- > Changes since v5: > https://lkml.kernel.org/r/20231031223420.778161...@goodmis.org > > - Resynced to this patch series > > fs/tracefs/event_inode.c | 22 ++-- > include/linux/tracefs.h | 43 > 2 files changed, 63 insertions(+), 2 deletions(-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 93d08e552483..8ac9abf7a3d5 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -615,7 +615,13 @@ static struct dentry *eventfs_root_lookup(struct inode > *dir, > entry = >entries[i]; > if (strcmp(name, entry->name) == 0) { > void *cdata = data; > - r = entry->callback(name, , , ); > + mutex_lock(_mutex); > + /* If ei->is_freed, then the event itself may be too */ > + if (!ei->is_freed) > + r = entry->callback(name, , , ); > + else > + r = -1; > + mutex_unlock(_mutex); > if (r <= 0) > continue; > ret = simple_lookup(dir, dentry, flags); > @@ -749,7 +755,13 @@ static int dcache_dir_open_wrapper(struct inode *inode, > struct file *file) > void *cdata = data; > entry = >entries[i]; > name = entry->name; > - r = entry->callback(name, , , ); > + mutex_lock(_mutex); > + /* If ei->is_freed, then the event itself may be too */ > + if (!ei->is_freed) > + r = entry->callback(name, , , ); > + else > + r = -1; > + mutex_unlock(_mutex); > if (r <= 0) > continue; > d = create_file_dentry(ei, i, parent, name, mode, cdata, fops, > false); > @@ -819,6 +831,10 @@ static int dcache_readdir_wrapper(struct file *file, > struct dir_context *ctx) > * data = A pointer to @data, and the callback may replace it, which will > * cause the file created to pass the new data to the open() call. > * fops = the fops to use for the created file. > + * > + * NB. @callback is called while holding internal locks of the eventfs > + * system. The callback must not call any code that might also call into > + * the tracefs or eventfs system or it will risk creating a deadlock. > */ > struct eventfs_inode *eventfs_create_dir(const char *name, struct > eventfs_inode *parent, >const struct eventfs_entry *entries, > @@ -878,6 +894,8 @@ struct eventfs_inode *eventfs_create_dir(const char > *name, struct eventfs_inode > * @data: The default data to pass to the files (an entry may override it). > * > * This function creates the top of the trace event directory. > + * > + * See eventfs_create_dir()
Re: [PATCH v5 4/7] eventfs: Save ownership and mode
On Thu, 2 Nov 2023 08:43:32 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 31 Oct 2023 18:33:30 -0400 > Steven Rostedt wrote: > > > From: "Steven Rostedt (Google)" > > > > Now that inodes and dentries are created on the fly, they are also > > reclaimed on memory pressure. Since the ownership and file mode are saved > > in the inode, if they are freed, any changes to the ownership and mode > > will be lost. > > Do we (need to) allow to change the ownership and mode of the eventfs files? > I thought it was fixed on the files in tracefs... Yes, it's the only way to allow non root users access to the tracing directories. > > Otherwise, the code itself looks good to me. > > Reviewed-by: Masami Hiramatsu (Google) > Thanks! -- Steve
Re: [PATCH v5 4/7] eventfs: Save ownership and mode
On Tue, 31 Oct 2023 18:33:30 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Now that inodes and dentries are created on the fly, they are also > reclaimed on memory pressure. Since the ownership and file mode are saved > in the inode, if they are freed, any changes to the ownership and mode > will be lost. Do we (need to) allow to change the ownership and mode of the eventfs files? I thought it was fixed on the files in tracefs... Otherwise, the code itself looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thanks, > > To counter this, if the user changes the permissions or ownership, save > them, and when creating the inodes again, restore those changes. > > Signed-off-by: Steven Rostedt (Google) > --- > Changes since v4: > https://lkml.kernel.org/r/20231031193428.558586...@goodmis.org > > - Rebased to this series > > fs/tracefs/event_inode.c | 148 +++ > fs/tracefs/internal.h| 16 + > 2 files changed, 151 insertions(+), 13 deletions(-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index b28f240bbb6d..d1683bf6d316 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -40,6 +40,15 @@ static DEFINE_MUTEX(eventfs_mutex); > */ > DEFINE_STATIC_SRCU(eventfs_srcu); > > +/* Mode is unsigned short, use the upper bits for flags */ > +enum { > + EVENTFS_SAVE_MODE = BIT(16), > + EVENTFS_SAVE_UID= BIT(17), > + EVENTFS_SAVE_GID= BIT(18), > +}; > + > +#define EVENTFS_MODE_MASK(EVENTFS_SAVE_MODE - 1) > + > static struct dentry *eventfs_root_lookup(struct inode *dir, > struct dentry *dentry, > unsigned int flags); > @@ -47,8 +56,89 @@ static int dcache_dir_open_wrapper(struct inode *inode, > struct file *file); > static int dcache_readdir_wrapper(struct file *file, struct dir_context > *ctx); > static int eventfs_release(struct inode *inode, struct file *file); > > +static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) > +{ > + unsigned int ia_valid = iattr->ia_valid; > + > + if (ia_valid & ATTR_MODE) { > + attr->mode = (attr->mode & ~EVENTFS_MODE_MASK) | > + (iattr->ia_mode & EVENTFS_MODE_MASK) | > + EVENTFS_SAVE_MODE; > + } > + if (ia_valid & ATTR_UID) { > + attr->mode |= EVENTFS_SAVE_UID; > + attr->uid = iattr->ia_uid; > + } > + if (ia_valid & ATTR_GID) { > + attr->mode |= EVENTFS_SAVE_GID; > + attr->gid = iattr->ia_gid; > + } > +} > + > +static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, > + struct iattr *iattr) > +{ > + const struct eventfs_entry *entry; > + struct eventfs_inode *ei; > + const char *name; > + int ret; > + > + mutex_lock(_mutex); > + ei = dentry->d_fsdata; > + /* The LSB is set when the eventfs_inode is being freed */ > + if (((unsigned long)ei & 1UL) || ei->is_freed) { > + /* Do not allow changes if the event is about to be removed. */ > + mutex_unlock(_mutex); > + return -ENODEV; > + } > + > + /* Preallocate the children mode array if necessary */ > + if (!(dentry->d_inode->i_mode & S_IFDIR)) { > + if (!ei->entry_attrs) { > + ei->entry_attrs = kzalloc(sizeof(*ei->entry_attrs) * > ei->nr_entries, > + GFP_KERNEL); > + if (!ei->entry_attrs) { > + ret = -ENOMEM; > + goto out; > + } > + } > + } > + > + ret = simple_setattr(idmap, dentry, iattr); > + if (ret < 0) > + goto out; > + > + /* > + * If this is a dir, then update the ei cache, only the file > + * mode is saved in the ei->m_children, and the ownership is > + * determined by the parent directory. > + */ > + if (dentry->d_inode->i_mode & S_IFDIR) { > + update_attr(>attr, iattr); > + > + } else { > + name = dentry->d_name.name; > + > + for (int i = 0; i < ei->nr_entries; i++) { > + entry = >entries[i]; > + if (strcmp(name, entry->name) == 0) { > + update_attr(>entry_attrs[i], iattr); > + break; > + } > + } > + } > + out: > + mutex_unlock(_mutex); > + return ret; > +} > + > static const struct inode_operations eventfs_root_dir_inode_operations = { > .lookup = eventfs_root_lookup, > + .setattr= eventfs_set_attr, > +}; > + > +static const struct inode_operations eventfs_file_inode_operations = { > + .setattr= eventfs_set_attr, > }; > > static const struct
[PATCH v8 0/3] mm: use memmap_on_memory semantics for dax/kmem
The dax/kmem driver can potentially hot-add large amounts of memory originating from CXL memory expanders, or NVDIMMs, or other 'device memories'. There is a chance there isn't enough regular system memory available to fit the memmap for this new memory. It's therefore desirable, if all other conditions are met, for the kmem managed memory to place its memmap on the newly added memory itself. The main hurdle for accomplishing this for kmem is that memmap_on_memory can only be done if the memory being added is equal to the size of one memblock. To overcome this, allow the hotplug code to split an add_memory() request into memblock-sized chunks, and try_remove_memory() to also expect and handle such a scenario. Patch 1 replaces an open-coded kmemdup() Patch 2 teaches the memory_hotplug code to allow for splitting add_memory() and remove_memory() requests over memblock sized chunks. Patch 3 allows the dax region drivers to request memmap_on_memory semantics. CXL dax regions default this to 'on', all others default to off to keep existing behavior unchanged. Signed-off-by: Vishal Verma --- Changes in v8: - Fix unwinding in create_altmaps_and_memory_blocks() to remove partially added blocks and altmaps. (David, Ying) - Simplify remove_memory_blocks_and_altmaps() since an altmap is assured. (David) - Since we remove per memory-block altmaps, the walk through memory blocks for a larger range isn't needed. Instead we can lookup the memory block directly from the pfn, allowing the test_has_altmap_cb() callback to be dropped and removed. (David) - Link to v7: https://lore.kernel.org/r/20231025-vv-kmem_memmap-v7-0-4a76d7652...@intel.com Changes in v7: - Make the add_memory_resource() flow symmetrical w.r.t. try_remove_memory() in terms of how the altmap path is taken (David Hildenbrand) - Move a comment, clean up usage of 'memblock' vs. 'memory_block' (David Hildenbrand) - Don't use the altmap path for the mhp_supports_memmap_on_memory(memblock_size) == false case (Huang Ying) - Link to v6: https://lore.kernel.org/r/20231016-vv-kmem_memmap-v6-0-078f0d3c0...@intel.com Changes in v6: - Add a prep patch to replace an open coded kmemdup in add_memory_resource() (Dan Williams) - Fix ordering of firmware_map_remove w.r.t taking the hotplug lock (David Hildenbrand) - Remove unused 'nid' variable, and a stray whitespace (David Hildenbrand) - Clean up and simplify the altmap vs non-altmap paths for try_remove_memory (David Hildenbrand) - Add a note to the changelog in patch 1 linking to the PUD mappings proposal (David Hildenbrand) - Remove the new sysfs ABI from the kmem/dax drivers until ABI documentation for /sys/bus/dax can be established (will split this out into a separate patchset) (Dan Williams) - Link to v5: https://lore.kernel.org/r/20231005-vv-kmem_memmap-v5-0-a54d1981f...@intel.com Changes in v5: - Separate out per-memblock operations from per memory block operations in try_remove_memory(), and rename the inner function appropriately. This does expand the scope of the memory hotplug lock to include remove_memory_block_devices(), but the alternative was to drop the lock in the inner function separately for each iteration, and then re-acquire it in try_remove_memory() creating a small window where the lock isn't held. (David Hildenbrand) - Remove unnecessary rc check from the memmap_on_memory_store sysfs helper in patch 2 (Dan Carpenter) - Link to v4: https://lore.kernel.org/r/20230928-vv-kmem_memmap-v4-0-6ff73fec5...@intel.com Changes in v4: - Rebase to Aneesh's PPC64 memmap_on_memory series v8 [2]. - Tweak a goto / error path in add_memory_create_devices() (Jonathan) - Retain the old behavior for dax devices, only default to memmap_on_memory for CXL (Jonathan) - Link to v3: https://lore.kernel.org/r/20230801-vv-kmem_memmap-v3-0-406e9aaf5...@intel.com [2]: https://lore.kernel.org/linux-mm/20230808091501.287660-1-aneesh.ku...@linux.ibm.com Changes in v3: - Rebase on Aneesh's patches [1] - Drop Patch 1 - it is not needed since [1] allows for dynamic setting of the memmap_on_memory param (David) - Link to v2: https://lore.kernel.org/r/20230720-vv-kmem_memmap-v2-0-88bdaab34...@intel.com [1]: https://lore.kernel.org/r/20230801044116.10674-1-aneesh.ku...@linux.ibm.com Changes in v2: - Drop the patch to create an override path for the memmap_on_memory module param (David) - Move the chunking into memory_hotplug.c so that any caller of add_memory() can request this behavior. (David) - Handle remove_memory() too. (David, Ying) - Add a sysfs control in the kmem driver for memmap_on_memory semantics (David, Jonathan) - Add a #else case to define mhp_supports_memmap_on_memory() if CONFIG_MEMORY_HOTPLUG is unset. (0day report) - Link to v1: https://lore.kernel.org/r/20230613-vv-kmem_memmap-v1-0-f6de9c6af...@intel.com --- Vishal Verma (3): mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource() mm/memory_hotplug:
[PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. This does preclude being able to use PUD mappings in the direct map; a proposal to how this could be optimized in the future is laid out here[1]. [1]: https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Suggested-by: David Hildenbrand Reviewed-by: Dan Williams Signed-off-by: Vishal Verma --- mm/memory_hotplug.c | 213 ++-- 1 file changed, 138 insertions(+), 75 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6be7de9efa55..d242e49d7f7b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) return arch_supports_memmap_on_memory(vmemmap_size); } +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size) +{ + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + + /* +* For memmap_on_memory, the altmaps were added on a per-memblock +* basis; we have to process each individual memory block. +*/ + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + struct vmem_altmap *altmap = NULL; + struct memory_block *mem; + + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start))); + WARN_ON_ONCE(!mem); + if (!mem) + continue; + + altmap = mem->altmap; + mem->altmap = NULL; + + remove_memory_block_devices(cur_start, memblock_size); + + arch_remove_memory(cur_start, memblock_size, altmap); + + /* Verify that all vmemmap pages have actually been freed. */ + WARN(altmap->alloc, "Altmap not fully unmapped"); + kfree(altmap); + } +} + +static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group, + u64 start, u64 size) +{ + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + int ret; + + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + struct mhp_params params = { .pgprot = +pgprot_mhp(PAGE_KERNEL) }; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(cur_start), + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1), + }; + + mhp_altmap.free = memory_block_memmap_on_memory_pages(); + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap), + GFP_KERNEL); + if (!params.altmap) + return -ENOMEM; + + /* call arch's memory hotadd */ + ret = arch_add_memory(nid, cur_start, memblock_size, ); + if (ret < 0) { + kfree(params.altmap); + goto out; + } + + /* create memory block devices after memory was added */ + ret = create_memory_block_devices(cur_start, memblock_size, + params.altmap, group); + if (ret) { + arch_remove_memory(cur_start, memblock_size, NULL); + kfree(params.altmap); + goto out; + } + } + + return 0; +out: + if (ret && (cur_start != start)) + remove_memory_blocks_and_altmaps(start, cur_start - start); + return ret; +} + /* * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations (triggered e.g. by sysfs). @@ -1390,10 +1468,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { struct mhp_params
[PATCH v8 3/3] dax/kmem: allow kmem to add memory with memmap_on_memory
Large amounts of memory managed by the kmem driver may come in via CXL, and it is often desirable to have the memmap for this memory on the new memory itself. Enroll kmem-managed memory for memmap_on_memory semantics if the dax region originates via CXL. For non-CXL dax regions, retain the existing default behavior of hot adding without memmap_on_memory semantics. Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Reviewed-by: Jonathan Cameron Reviewed-by: David Hildenbrand Reviewed-by: Huang, Ying Signed-off-by: Vishal Verma --- drivers/dax/bus.h | 1 + drivers/dax/dax-private.h | 1 + drivers/dax/bus.c | 3 +++ drivers/dax/cxl.c | 1 + drivers/dax/hmem/hmem.c | 1 + drivers/dax/kmem.c| 8 +++- drivers/dax/pmem.c| 1 + 7 files changed, 15 insertions(+), 1 deletion(-) diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 1ccd23360124..cbbf64443098 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -23,6 +23,7 @@ struct dev_dax_data { struct dev_pagemap *pgmap; resource_size_t size; int id; + bool memmap_on_memory; }; struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data); diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 27cf2d79..446617b73aea 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -70,6 +70,7 @@ struct dev_dax { struct ida ida; struct device dev; struct dev_pagemap *pgmap; + bool memmap_on_memory; int nr_range; struct dev_dax_range { unsigned long pgoff; diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 0ee96e6fc426..ad9f821b8c78 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -367,6 +367,7 @@ static ssize_t create_store(struct device *dev, struct device_attribute *attr, .dax_region = dax_region, .size = 0, .id = -1, + .memmap_on_memory = false, }; struct dev_dax *dev_dax = devm_create_dev_dax(); @@ -1400,6 +1401,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) dev_dax->align = dax_region->align; ida_init(_dax->ida); + dev_dax->memmap_on_memory = data->memmap_on_memory; + inode = dax_inode(dax_dev); dev->devt = inode->i_rdev; dev->bus = _bus_type; diff --git a/drivers/dax/cxl.c b/drivers/dax/cxl.c index 8bc9d04034d6..c696837ab23c 100644 --- a/drivers/dax/cxl.c +++ b/drivers/dax/cxl.c @@ -26,6 +26,7 @@ static int cxl_dax_region_probe(struct device *dev) .dax_region = dax_region, .id = -1, .size = range_len(_dax->hpa_range), + .memmap_on_memory = true, }; return PTR_ERR_OR_ZERO(devm_create_dev_dax()); diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index 5d2ddef0f8f5..b9da69f92697 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -36,6 +36,7 @@ static int dax_hmem_probe(struct platform_device *pdev) .dax_region = dax_region, .id = -1, .size = region_idle ? 0 : range_len(>range), + .memmap_on_memory = false, }; return PTR_ERR_OR_ZERO(devm_create_dev_dax()); diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index c57acb73e3db..0aa6c45a4e5a 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -56,6 +57,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) unsigned long total_len = 0; struct dax_kmem_data *data; int i, rc, mapped = 0; + mhp_t mhp_flags; int numa_node; /* @@ -136,12 +138,16 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) */ res->flags = IORESOURCE_SYSTEM_RAM; + mhp_flags = MHP_NID_IS_MGID; + if (dev_dax->memmap_on_memory) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; + /* * Ensure that future kexec'd kernels will not treat * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, range.start, - range_len(), kmem_name, MHP_NID_IS_MGID); + range_len(), kmem_name, mhp_flags); if (rc) { dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index ae0cb113a5d3..f3c6c67b8412 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -63,6 +63,7 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev) .id = id, .pgmap = ,
[PATCH v8 1/3] mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource()
A review of the memmap_on_memory modifications to add_memory_resource() revealed an instance of an open-coded kmemdup(). Replace it with kmemdup(). Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Reported-by: Dan Williams Reviewed-by: David Hildenbrand Signed-off-by: Vishal Verma --- mm/memory_hotplug.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f8d3e7427e32..6be7de9efa55 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1439,11 +1439,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (mhp_supports_memmap_on_memory(size)) { mhp_altmap.free = memory_block_memmap_on_memory_pages(); - params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); + params.altmap = kmemdup(_altmap, + sizeof(struct vmem_altmap), + GFP_KERNEL); if (!params.altmap) goto error; - - memcpy(params.altmap, _altmap, sizeof(mhp_altmap)); } /* fallback to not using altmap */ } -- 2.41.0
[GIT PULL] Modules changes for v6.7-rc1
The following changes since commit 7d461b291e65938f15f56fe58da2303b07578a76: Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm (2023-11-01 06:28:35 -1000) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/modules-6.7-rc1 for you to fetch changes up to ea0b0bcef4917a2640ecc100c768b8e785784834: module: Annotate struct module_notes_attrs with __counted_by (2023-11-01 13:07:32 -0700) Modules changes for v6.7-rc1 The only thing worth highligthing is that gzip moves to use vmalloc() instead of kmalloc just as we had a fix for this for zstd on v6.6-rc1. The rest is regular house keeping, keeping things neat, tidy, and boring. Oh and this has been on linux-next for over a month. Andrea Righi (1): module/decompress: use vmalloc() for gzip decompression workspace Kees Cook (2): module: Clarify documentation of module_param_call() module: Annotate struct module_notes_attrs with __counted_by Luis Chamberlain (1): MAINTAINERS: add include/linux/module*.h to modules Tiezhu Yang (2): module: Make is_mapping_symbol() return bool module: Make is_valid_name() return bool Zhu Mao (1): module: Fix comment typo MAINTAINERS | 2 +- include/linux/module_symbol.h | 2 +- include/linux/moduleparam.h | 6 +- kernel/module/decompress.c| 4 ++-- kernel/module/stats.c | 2 +- kernel/module/sysfs.c | 2 +- scripts/mod/modpost.c | 4 ++-- 7 files changed, 13 insertions(+), 9 deletions(-)
Re: [PATCH] eventfs: Fix kerneldoc of eventfs_remove_rec()
On Mon, 30 Oct 2023 21:57:13 +0530 Mukesh Ojha wrote: > On 10/30/2023 9:45 PM, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" > > > > The eventfs_remove_rec() had some missing parameters in the kerneldoc > > comment above it. Also, rephrase the description a bit more to have a bit > > more correct grammar. > > > > Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use > > eventfs_inode"); > > Reported-by: kernel test robot > > Closes: > > https://lore.kernel.org/oe-kbuild-all/202310052216.4sgqaswo-...@intel.com/ > > Signed-off-by: Steven Rostedt (Google) > > Reviewed-by: Mukesh Ojha Hi Mukesh! First, I want to thank you for your reviews. We certainly need more reviewers. But I need to also state that "Reviewed-by" tags should not be sent so lightly. The only times a Reviewed-by tag should be sent is if you participated in the discussion of the code, you have authored some of the code that is being modified, or are marked as a reviewer of the code in the MAINTAINERS file. For example, you added to the discussion here: https://lore.kernel.org/all/65dcdd9c-a75b-4fe7-bdcf-471a5602d...@linaro.org/ And adding your Reviewed-by tag is appropriate. But when a maintainer receives a Reviewed-by from someone they don't know, without any discussion in the patch, it may make that maintainer think that the person sending the Reviewed-by is only out to get listed in the LWN "Reviewed-by" count. I review other developers' code all the time, and unless the code touches something I worked on or I'm marked as a reviewer in the MAINTAINERS file, I do not send a Reviewed-by tag unless I added some input to the patch in question. My advice to you is to keep up the reviewing, I appreciate it (I really do!), but don't send out Reviewed-by tags unless you are marked as a reviewer of the code, or participated in a discussion on that code. Thanks, -- Steve
Re: [QUESTION] ftrace_test_recursion_trylock behaviour
On Wed, 1 Nov 2023 18:32:14 +0100 Jiri Olsa wrote: > hi, > I'm doing some testing on top of fprobes and noticed that the > ftrace_test_recursion_trylock allows caller from the same context > going through twice. > > The change below adds extra fprobe on stack_trace_print, which is > called within the sample_entry_handler and I can see it being executed > with following trace output: > ><...>-457 [003] ...1.32.352554: sample_entry_handler: > Enter ip = 0x81177420 <...>-457 > [003] ...2.32.352578: sample_entry_handler_extra: Enter > ip = 0x8127ae70 > > IOW nested ftrace_test_recursion_trylock call in the same context > succeeded. > > It seems the reason is the TRACE_CTX_TRANSITION bit logic. > > Just making sure it's intentional.. we have kprobe_multi code on top of > fprobe with another re-entry logic and that might behave differently based > on ftrace_test_recursion_trylock logic. Yes it's intentional, as it's a work around for an issue that may be cleared up now with Peter Zijlstra's noinstr updates. The use case for that TRACE_CTX_TRANSITION is when a function is traced just after an interrupt was triggered but before the preempt count was updated to let us know that we are in an interrupt context. Daniel Bristot reported a regression after the trylock was first introduced where the interrupt entry function was traced sometimes but not always. That's because if the interrupt happened normally, it would be traced, but if the interrupt happened when another event was being traced, the recursion logic would see that the trace of the interrupt was happening in the same context as the event it interrupted and drop the interrupt trace. But after the preempt count was updated, the other functions in the interrupt would be seen. This led to very confusing trace output. The solution to that was this workaround hack, where the trace recursion logic would allow a single recursion (the interrupt preempting another trace before it set preempt count). But with noinstr, there should be no more instances of this problem and we can drop that extra bit. But the last I checked, there were a few places that still could be traced without the preempt_count set. I'll have to re-investigate. -- Steve
[QUESTION] ftrace_test_recursion_trylock behaviour
hi, I'm doing some testing on top of fprobes and noticed that the ftrace_test_recursion_trylock allows caller from the same context going through twice. The change below adds extra fprobe on stack_trace_print, which is called within the sample_entry_handler and I can see it being executed with following trace output: <...>-457 [003] ...1.32.352554: sample_entry_handler: Enter ip = 0x81177420 <...>-457 [003] ...2.32.352578: sample_entry_handler_extra: Enter ip = 0x8127ae70 IOW nested ftrace_test_recursion_trylock call in the same context succeeded. It seems the reason is the TRACE_CTX_TRANSITION bit logic. Just making sure it's intentional.. we have kprobe_multi code on top of fprobe with another re-entry logic and that might behave differently based on ftrace_test_recursion_trylock logic. thanks, jirka --- diff --git a/samples/fprobe/fprobe_example.c b/samples/fprobe/fprobe_example.c index 64e715e7ed11..531272af0682 100644 --- a/samples/fprobe/fprobe_example.c +++ b/samples/fprobe/fprobe_example.c @@ -23,6 +23,9 @@ static struct fprobe sample_probe; static unsigned long nhit; +static struct fprobe sample_probe_extra; +static char *symbol_extra = "stack_trace_print"; + static char symbol[MAX_SYMBOL_LEN] = "kernel_clone"; module_param_string(symbol, symbol, sizeof(symbol), 0644); MODULE_PARM_DESC(symbol, "Probed symbol(s), given by comma separated symbols or a wildcard pattern."); @@ -48,6 +51,15 @@ static void show_backtrace(void) stack_trace_print(stacks, len, 24); } +static int sample_entry_handler_extra(struct fprobe *fp, unsigned long ip, + unsigned long ret_ip, + struct pt_regs *regs, void *data) +{ + if (use_trace) + trace_printk("Enter <%pS> ip = 0x%p\n", (void *)ip, (void *)ip); + return 0; +} + static int sample_entry_handler(struct fprobe *fp, unsigned long ip, unsigned long ret_ip, struct pt_regs *regs, void *data) @@ -96,6 +108,8 @@ static int __init fprobe_init(void) sample_probe.entry_handler = sample_entry_handler; sample_probe.exit_handler = sample_exit_handler; + sample_probe_extra.entry_handler = sample_entry_handler_extra; + if (strchr(symbol, '*')) { /* filter based fprobe */ ret = register_fprobe(_probe, symbol, @@ -137,12 +151,19 @@ static int __init fprobe_init(void) else pr_info("Planted fprobe at %s\n", symbol); + ret = register_fprobe_syms(_probe_extra, (const char **) _extra, 1); + if (ret < 0) + pr_err("register_fprobe extra failed, returned %d\n", ret); + else + pr_info("Planted extra fprobe at %s\n", symbol_extra); + return ret; } static void __exit fprobe_exit(void) { unregister_fprobe(_probe); + unregister_fprobe(_probe_extra); pr_info("fprobe at %s unregistered. %ld times hit, %ld times missed\n", symbol, nhit, sample_probe.nmissed);
[PATCH v6 8/8] eventfs: Use simple_recursive_removal() to clean up dentries
From: "Steven Rostedt (Google)" Looking at how dentry is removed via the tracefs system, I found that eventfs does not do everything that it did under tracefs. The tracefs removal of a dentry calls simple_recursive_removal() that does a lot more than a simple d_invalidate(). As it should be a requirement that any eventfs_inode that has a dentry, so does its parent. When removing a eventfs_inode, if it has a dentry, a call to simple_recursive_removal() on that dentry should clean up all the dentries underneath it. Add WARN_ON_ONCE() to check for the parent having a dentry if any children do. Link: https://lore.kernel.org/all/20231101022553.GE1957730@ZenIV/ Cc: sta...@vger.kernel.org Cc: Al Viro Fixes: 5bdcd5f5331a2 ("eventfs: Implement removal of meta data from eventfs") Signed-off-by: Steven Rostedt (Google) --- Changes since the last patch: https://lore.kernel.org/linux-trace-kernel/20231031144703.71eef...@gandalf.local.home - Was originally called: eventfs: Process deletion of dentry more thoroughly - Al Viro pointed out that I could use simple_recursive_removal() instead. I had originally thought that I could not, but looking deeper into it, and realizing that if a dentry exists on any eventfs_inode, then all the parent eventfs_inode of that would als have a dentry. Hence, calling simple_recursive_removal() on the top dentry would clean up all the children dentries as well. Doing it his way cleans up the code quite a bit! fs/tracefs/event_inode.c | 77 +++- fs/tracefs/internal.h| 2 -- 2 files changed, 44 insertions(+), 35 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 0087a3f455f1..f8a594a50ae6 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -967,30 +967,29 @@ static void unhook_dentry(struct dentry *dentry) { if (!dentry) return; - - /* Keep the dentry from being freed yet (see eventfs_workfn()) */ + /* +* Need to add a reference to the dentry that is expected by +* simple_recursive_removal(), which will include a dput(). +*/ dget(dentry); - dentry->d_fsdata = NULL; - d_invalidate(dentry); - mutex_lock(_mutex); - /* dentry should now have at least a single reference */ - WARN_ONCE((int)d_count(dentry) < 1, - "dentry %px (%s) less than one reference (%d) after invalidate\n", - dentry, dentry->d_name.name, d_count(dentry)); - mutex_unlock(_mutex); + /* +* Also add a reference for the dput() in eventfs_workfn(). +* That is required as that dput() will free the ei after +* the SRCU grace period is over. +*/ + dget(dentry); } /** * eventfs_remove_rec - remove eventfs dir or file from list * @ei: eventfs_inode to be removed. - * @head: the list head to place the deleted @ei and children * @level: prevent recursion from going more than 3 levels deep. * * This function recursively removes eventfs_inodes which * contains info of files and/or directories. */ -static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head *head, int level) +static void eventfs_remove_rec(struct eventfs_inode *ei, int level) { struct eventfs_inode *ei_child; @@ -1009,13 +1008,26 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head *head, /* search for nested folders or files */ list_for_each_entry_srcu(ei_child, >children, list, lockdep_is_held(_mutex)) { - eventfs_remove_rec(ei_child, head, level + 1); + /* Children only have dentry if parent does */ + WARN_ON_ONCE(ei_child->dentry && !ei->dentry); + eventfs_remove_rec(ei_child, level + 1); } + ei->is_freed = 1; + for (int i = 0; i < ei->nr_entries; i++) { + if (ei->d_children[i]) { + /* Children only have dentry if parent does */ + WARN_ON_ONCE(!ei->dentry); + unhook_dentry(ei->d_children[i]); + } + } + + unhook_dentry(ei->dentry); + list_del_rcu(>list); - list_add_tail(>del_list, head); + call_srcu(_srcu, >rcu, free_rcu_ei); } /** @@ -1026,30 +1038,22 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head *head, */ void eventfs_remove_dir(struct eventfs_inode *ei) { - struct eventfs_inode *tmp; - LIST_HEAD(ei_del_list); + struct dentry *dentry; if (!ei) return; - /* -* Move the deleted eventfs_inodes onto the ei_del_list -* which will also set the is_freed value. Note, this has to be -* done under the eventfs_mutex, but the deletions of -* the dentries must be done outside the eventfs_mutex. -* Hence moving them to this temporary
[PATCH v6 4/8] eventfs: Save ownership and mode
From: "Steven Rostedt (Google)" Now that inodes and dentries are created on the fly, they are also reclaimed on memory pressure. Since the ownership and file mode are saved in the inode, if they are freed, any changes to the ownership and mode will be lost. To counter this, if the user changes the permissions or ownership, save them, and when creating the inodes again, restore those changes. Cc: Ajay Kaher Signed-off-by: Steven Rostedt (Google) --- Changes since v5: https://lkml.kernel.org/r/20231031223420.568912...@goodmis.org - Resynced to this patch series fs/tracefs/event_inode.c | 148 +++ fs/tracefs/internal.h| 16 + 2 files changed, 151 insertions(+), 13 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index e9625732c52d..93d08e552483 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -40,6 +40,15 @@ static DEFINE_MUTEX(eventfs_mutex); */ DEFINE_STATIC_SRCU(eventfs_srcu); +/* Mode is unsigned short, use the upper bits for flags */ +enum { + EVENTFS_SAVE_MODE = BIT(16), + EVENTFS_SAVE_UID= BIT(17), + EVENTFS_SAVE_GID= BIT(18), +}; + +#define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) + static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); @@ -47,8 +56,89 @@ static int dcache_dir_open_wrapper(struct inode *inode, struct file *file); static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx); static int eventfs_release(struct inode *inode, struct file *file); +static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) +{ + unsigned int ia_valid = iattr->ia_valid; + + if (ia_valid & ATTR_MODE) { + attr->mode = (attr->mode & ~EVENTFS_MODE_MASK) | + (iattr->ia_mode & EVENTFS_MODE_MASK) | + EVENTFS_SAVE_MODE; + } + if (ia_valid & ATTR_UID) { + attr->mode |= EVENTFS_SAVE_UID; + attr->uid = iattr->ia_uid; + } + if (ia_valid & ATTR_GID) { + attr->mode |= EVENTFS_SAVE_GID; + attr->gid = iattr->ia_gid; + } +} + +static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, + struct iattr *iattr) +{ + const struct eventfs_entry *entry; + struct eventfs_inode *ei; + const char *name; + int ret; + + mutex_lock(_mutex); + ei = dentry->d_fsdata; + /* The LSB is set when the eventfs_inode is being freed */ + if (((unsigned long)ei & 1UL) || ei->is_freed) { + /* Do not allow changes if the event is about to be removed. */ + mutex_unlock(_mutex); + return -ENODEV; + } + + /* Preallocate the children mode array if necessary */ + if (!(dentry->d_inode->i_mode & S_IFDIR)) { + if (!ei->entry_attrs) { + ei->entry_attrs = kzalloc(sizeof(*ei->entry_attrs) * ei->nr_entries, + GFP_KERNEL); + if (!ei->entry_attrs) { + ret = -ENOMEM; + goto out; + } + } + } + + ret = simple_setattr(idmap, dentry, iattr); + if (ret < 0) + goto out; + + /* +* If this is a dir, then update the ei cache, only the file +* mode is saved in the ei->m_children, and the ownership is +* determined by the parent directory. +*/ + if (dentry->d_inode->i_mode & S_IFDIR) { + update_attr(>attr, iattr); + + } else { + name = dentry->d_name.name; + + for (int i = 0; i < ei->nr_entries; i++) { + entry = >entries[i]; + if (strcmp(name, entry->name) == 0) { + update_attr(>entry_attrs[i], iattr); + break; + } + } + } + out: + mutex_unlock(_mutex); + return ret; +} + static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, + .setattr= eventfs_set_attr, +}; + +static const struct inode_operations eventfs_file_inode_operations = { + .setattr= eventfs_set_attr, }; static const struct file_operations eventfs_file_operations = { @@ -59,10 +149,30 @@ static const struct file_operations eventfs_file_operations = { .release= eventfs_release, }; +static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, umode_t mode) +{ + if (!attr) { + inode->i_mode = mode; + return; + } + + if (attr->mode &
[PATCH v6 5/8] eventfs: Hold eventfs_mutex when calling callback functions
From: "Steven Rostedt (Google)" The callback function that is used to create inodes and dentries is not protected by anything and the data that is passed to it could become stale. After eventfs_remove_dir() is called by the tracing system, it is free to remove the events that are associated to that directory. Unfortunately, that means the callbacks must not be called after that. CPU0 CPU1 eventfs_root_lookup() { eventfs_remove_dir() { mutex_lock(_mutex); ei->is_freed = set; mutex_unlock(_mutex); } kfree(event_call); for (...) { entry = >entries[i]; r = entry->callback() { call = data; // call == event_call above if (call->flags ...) [ USE AFTER FREE BUG ] The safest way to protect this is to wrap the callback with: mutex_lock(_mutex); if (!ei->is_freed) r = entry->callback(); else r = -1; mutex_unlock(_mutex); This will make sure that the callback will not be called after it is freed. But now it needs to be known that the callback is called while holding internal eventfs locks, and that it must not call back into the eventfs / tracefs system. There's no reason it should anyway, but document that as well. Link: https://lore.kernel.org/all/CA+G9fYu9GOEbD=rR5eMR-=HJ8H6rMsbzDC2ZY5=y50wpwae...@mail.gmail.com/ Cc: Ajay Kaher Reported-by: Linux Kernel Functional Testing Reported-by: Naresh Kamboju Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Signed-off-by: Steven Rostedt (Google) --- Changes since v5: https://lkml.kernel.org/r/20231031223420.778161...@goodmis.org - Resynced to this patch series fs/tracefs/event_inode.c | 22 ++-- include/linux/tracefs.h | 43 2 files changed, 63 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 93d08e552483..8ac9abf7a3d5 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -615,7 +615,13 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, entry = >entries[i]; if (strcmp(name, entry->name) == 0) { void *cdata = data; - r = entry->callback(name, , , ); + mutex_lock(_mutex); + /* If ei->is_freed, then the event itself may be too */ + if (!ei->is_freed) + r = entry->callback(name, , , ); + else + r = -1; + mutex_unlock(_mutex); if (r <= 0) continue; ret = simple_lookup(dir, dentry, flags); @@ -749,7 +755,13 @@ static int dcache_dir_open_wrapper(struct inode *inode, struct file *file) void *cdata = data; entry = >entries[i]; name = entry->name; - r = entry->callback(name, , , ); + mutex_lock(_mutex); + /* If ei->is_freed, then the event itself may be too */ + if (!ei->is_freed) + r = entry->callback(name, , , ); + else + r = -1; + mutex_unlock(_mutex); if (r <= 0) continue; d = create_file_dentry(ei, i, parent, name, mode, cdata, fops, false); @@ -819,6 +831,10 @@ static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx) * data = A pointer to @data, and the callback may replace it, which will * cause the file created to pass the new data to the open() call. * fops = the fops to use for the created file. + * + * NB. @callback is called while holding internal locks of the eventfs + * system. The callback must not call any code that might also call into + * the tracefs or eventfs system or it will risk creating a deadlock. */ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent, const struct eventfs_entry *entries, @@ -878,6 +894,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode * @data: The default data to pass to the files (an entry may override it). * * This function creates the top of the trace event directory. + * + * See eventfs_create_dir() for use of @entries. */ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent, const struct eventfs_entry *entries, diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 13359b1a35d1..7a5fe17b6bf9
[PATCH v6 6/8] eventfs: Delete eventfs_inode when the last dentry is freed
From: "Steven Rostedt (Google)" There exists a race between holding a reference of an eventfs_inode dentry and the freeing of the eventfs_inode. If user space has a dentry held long enough, it may still be able to access the dentry's eventfs_inode after it has been freed. To prevent this, have he eventfs_inode freed via the last dput() (or via RCU if the eventfs_inode does not have a dentry). This means reintroducing the eventfs_inode del_list field at a temporary place to put the eventfs_inode. It needs to mark it as freed (via the list) but also must invalidate the dentry immediately as the return from eventfs_remove_dir() expects that they are. But the dentry invalidation must not be called under the eventfs_mutex, so it must be done after the eventfs_inode is marked as free (put on a deletion list). Cc: sta...@vger.kernel.org Cc: Ajay Kaher Fixes: 5bdcd5f5331a2 ("eventfs: Implement removal of meta data from eventfs") Signed-off-by: Steven Rostedt (Google) --- Changes since v5: https://lkml.kernel.org/r/20231031223420.988874...@goodmis.org - Rebased for this patch series fs/tracefs/event_inode.c | 146 ++- fs/tracefs/internal.h| 2 + 2 files changed, 69 insertions(+), 79 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 8ac9abf7a3d5..0a04ae0ca8c8 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -85,8 +85,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, mutex_lock(_mutex); ei = dentry->d_fsdata; - /* The LSB is set when the eventfs_inode is being freed */ - if (((unsigned long)ei & 1UL) || ei->is_freed) { + if (ei->is_freed) { /* Do not allow changes if the event is about to be removed. */ mutex_unlock(_mutex); return -ENODEV; @@ -276,35 +275,17 @@ static void free_ei(struct eventfs_inode *ei) void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) { struct tracefs_inode *ti_parent; - struct eventfs_inode *ei_child, *tmp; struct eventfs_inode *ei; int i; /* The top level events directory may be freed by this */ if (unlikely(ti->flags & TRACEFS_EVENT_TOP_INODE)) { - LIST_HEAD(ef_del_list); - mutex_lock(_mutex); - ei = ti->private; - - /* Record all the top level files */ - list_for_each_entry_srcu(ei_child, >children, list, -lockdep_is_held(_mutex)) { - list_add_tail(_child->del_list, _del_list); - } - /* Nothing should access this, but just in case! */ ti->private = NULL; - mutex_unlock(_mutex); - /* Now safely free the top level files and their children */ - list_for_each_entry_safe(ei_child, tmp, _del_list, del_list) { - list_del(_child->del_list); - eventfs_remove_dir(ei_child); - } - free_ei(ei); return; } @@ -319,14 +300,6 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) if (!ei) goto out; - /* -* If ei was freed, then the LSB bit is set for d_fsdata. -* But this should not happen, as it should still have a -* ref count that prevents it. Warn in case it does. -*/ - if (WARN_ON_ONCE((unsigned long)ei & 1)) - goto out; - /* This could belong to one of the files of the ei */ if (ei->dentry != dentry) { for (i = 0; i < ei->nr_entries; i++) { @@ -336,6 +309,8 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) if (WARN_ON_ONCE(i == ei->nr_entries)) goto out; ei->d_children[i] = NULL; + } else if (ei->is_freed) { + free_ei(ei); } else { ei->dentry = NULL; } @@ -962,13 +937,65 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry return ERR_PTR(-ENOMEM); } +static LLIST_HEAD(free_list); + +static void eventfs_workfn(struct work_struct *work) +{ +struct eventfs_inode *ei, *tmp; +struct llist_node *llnode; + + llnode = llist_del_all(_list); +llist_for_each_entry_safe(ei, tmp, llnode, llist) { + /* This dput() matches the dget() from unhook_dentry() */ + for (int i = 0; i < ei->nr_entries; i++) { + if (ei->d_children[i]) + dput(ei->d_children[i]); + } + /* This should only get here if it had a dentry */ + if (!WARN_ON_ONCE(!ei->dentry)) + dput(ei->dentry); +} +} + +static
[PATCH v6 3/8] eventfs: Test for ei->is_freed when accessing ei->dentry
From: "Steven Rostedt (Google)" The eventfs_inode (ei) is protected by SRCU, but the ei->dentry is not. It is protected by the eventfs_mutex. Anytime the eventfs_mutex is released, and access to the ei->dentry needs to be done, it should first check if ei->is_freed is set under the eventfs_mutex. If it is, then the ei->dentry is invalid and must not be used. The ei->dentry must only be accessed under the eventfs_mutex and after checking if ei->is_freed is set. When the ei is being freed, it will (under the eventfs_mutex) set is_freed and at the same time move the dentry to a free list to be cleared after the eventfs_mutex is released. This means that any access to the ei->dentry must check first if ei->is_freed is set, because if it is, then the dentry is on its way to be freed. Also add comments to describe this better. Link: https://lore.kernel.org/all/CA+G9fYt6pY+tMZEOg=soeywqoe19fgp3ur15sgowkdk+_x8...@mail.gmail.com/ Link: https://lore.kernel.org/all/CA+G9fYuDP3hVQ3t7FfrBAjd_WFVSurMgCepTxunSJf=MTe=6...@mail.gmail.com/ Cc: Ajay Kaher Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Linux Kernel Functional Testing Reported-by: Naresh Kamboju Reported-by: Beau Belgrave Reviewed-by: Masami Hiramatsu (Google) Tested-by: Linux Kernel Functional Testing Tested-by: Naresh Kamboju Tested-by: Beau Belgrave Signed-off-by: Steven Rostedt (Google) --- Changes since v5: https://lkml.kernel.org/r/20231031223420.347714...@goodmis.org - Resynced to this patch series fs/tracefs/event_inode.c | 45 ++-- fs/tracefs/internal.h| 3 ++- 2 files changed, 41 insertions(+), 7 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index dd5971855732..e9625732c52d 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -24,7 +24,20 @@ #include #include "internal.h" +/* + * eventfs_mutex protects the eventfs_inode (ei) dentry. Any access + * to the ei->dentry must be done under this mutex and after checking + * if ei->is_freed is not set. The ei->dentry is released under the + * mutex at the same time ei->is_freed is set. If ei->is_freed is set + * then the ei->dentry is invalid. + */ static DEFINE_MUTEX(eventfs_mutex); + +/* + * The eventfs_inode (ei) itself is protected by SRCU. It is released from + * its parent's list and will have is_freed set (under eventfs_mutex). + * After the SRCU grace period is over, the ei may be freed. + */ DEFINE_STATIC_SRCU(eventfs_srcu); static struct dentry *eventfs_root_lookup(struct inode *dir, @@ -239,6 +252,10 @@ create_file_dentry(struct eventfs_inode *ei, struct dentry **e_dentry, bool invalidate = false; mutex_lock(_mutex); + if (ei->is_freed) { + mutex_unlock(_mutex); + return NULL; + } /* If the e_dentry already has a dentry, use it */ if (*e_dentry) { /* lookup does not need to up the ref count */ @@ -312,6 +329,8 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei) struct eventfs_inode *ei_child; struct tracefs_inode *ti; + lockdep_assert_held(_mutex); + /* srcu lock already held */ /* fill parent-child relation */ list_for_each_entry_srcu(ei_child, >children, list, @@ -325,6 +344,7 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei) /** * create_dir_dentry - Create a directory dentry for the eventfs_inode + * @pei: The eventfs_inode parent of ei. * @ei: The eventfs_inode to create the directory for * @parent: The dentry of the parent of this directory * @lookup: True if this is called by the lookup code @@ -332,12 +352,17 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei) * This creates and attaches a directory dentry to the eventfs_inode @ei. */ static struct dentry * -create_dir_dentry(struct eventfs_inode *ei, struct dentry *parent, bool lookup) +create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, + struct dentry *parent, bool lookup) { bool invalidate = false; struct dentry *dentry = NULL; mutex_lock(_mutex); + if (pei->is_freed || ei->is_freed) { + mutex_unlock(_mutex); + return NULL; + } if (ei->dentry) { /* If the dentry already has a dentry, use it */ dentry = ei->dentry; @@ -440,7 +465,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, */ mutex_lock(_mutex); ei = READ_ONCE(ti->private); - if (ei) + if (ei && !ei->is_freed) ei_dentry = READ_ONCE(ei->dentry); mutex_unlock(_mutex); @@ -454,7 +479,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, if (strcmp(ei_child->name, name) != 0) continue; ret = simple_lookup(dir, dentry, flags); -
[PATCH v6 2/8] eventfs: Have a free_ei() that just frees the eventfs_inode
From: "Steven Rostedt (Google)" As the eventfs_inode is freed in two different locations, make a helper function free_ei() to make sure all the allocated fields of the eventfs_inode is freed. This requires renaming the existing free_ei() which is called by the srcu handler to free_rcu_ei() and have free_ei() just do the freeing, where free_rcu_ei() will call it. Cc: Ajay Kaher Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Steven Rostedt (Google) --- Changes since v5: https://lkml.kernel.org/r/20231031223420.141626...@goodmis.org - Resynced to this patch series fs/tracefs/event_inode.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 1ce73acf3df0..dd5971855732 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -129,6 +129,13 @@ static struct dentry *create_dir(const char *name, struct dentry *parent) return eventfs_end_creating(dentry); } +static void free_ei(struct eventfs_inode *ei) +{ + kfree_const(ei->name); + kfree(ei->d_children); + kfree(ei); +} + /** * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode * @ti: the tracefs_inode of the dentry @@ -168,9 +175,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) eventfs_remove_dir(ei_child); } - kfree_const(ei->name); - kfree(ei->d_children); - kfree(ei); + free_ei(ei); return; } @@ -784,13 +789,11 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry return ERR_PTR(-ENOMEM); } -static void free_ei(struct rcu_head *head) +static void free_rcu_ei(struct rcu_head *head) { struct eventfs_inode *ei = container_of(head, struct eventfs_inode, rcu); - kfree_const(ei->name); - kfree(ei->d_children); - kfree(ei); + free_ei(ei); } /** @@ -881,7 +884,7 @@ void eventfs_remove_dir(struct eventfs_inode *ei) for (i = 0; i < ei->nr_entries; i++) unhook_dentry(>d_children[i], _list); unhook_dentry(>dentry, _list); - call_srcu(_srcu, >rcu, free_ei); + call_srcu(_srcu, >rcu, free_rcu_ei); } mutex_unlock(_mutex); -- 2.42.0
[PATCH v6 1/8] eventfs: Remove "is_freed" union with rcu head
From: "Steven Rostedt (Google)" The eventfs_inode->is_freed was a union with the rcu_head with the assumption that when it was on the srcu list the head would contain a pointer which would make "is_freed" true. But that was a wrong assumption as the rcu head is a single link list where the last element is NULL. Instead, split the nr_entries integer so that "is_freed" is one bit and the nr_entries is the next 31 bits. As there shouldn't be more than 10 (currently there's at most 5 to 7 depending on the config), this should not be a problem. Cc: sta...@vger.kernel.org Cc: Ajay Kaher Fixes: 63940449555e7 ("eventfs: Implement eventfs lookup, read, open functions") Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Steven Rostedt (Google) --- Changes since v5: https://lkml.kernel.org/r/20231031223419.935276...@goodmis.org - Resynced for this patch series fs/tracefs/event_inode.c | 2 ++ fs/tracefs/internal.h| 6 +++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 9f612a8f009d..1ce73acf3df0 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -824,6 +824,8 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head *head, eventfs_remove_rec(ei_child, head, level + 1); } + ei->is_freed = 1; + list_del_rcu(>list); list_add_tail(>del_list, head); } diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 64fde9490f52..c7d88aaa949f 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -23,6 +23,7 @@ struct tracefs_inode { * @d_parent: pointer to the parent's dentry * @d_children: The array of dentries to represent the files when created * @data: The private data to pass to the callbacks + * @is_freed: Flag set if the eventfs is on its way to be freed * @nr_entries: The number of items in @entries */ struct eventfs_inode { @@ -38,14 +39,13 @@ struct eventfs_inode { * Union - used for deletion * @del_list: list of eventfs_inode to delete * @rcu:eventfs_inode to delete in RCU -* @is_freed: node is freed if one of the above is set */ union { struct list_headdel_list; struct rcu_head rcu; - unsigned long is_freed; }; - int nr_entries; + unsigned intis_freed:1; + unsigned intnr_entries:31; }; static inline struct tracefs_inode *get_tracefs(const struct inode *inode) -- 2.42.0
[PATCH v6 7/8] eventfs: Remove special processing of dput() of events directory
From: "Steven Rostedt (Google)" The top level events directory is no longer special with regards to how it should be delete. Remove the extra processing for it in eventfs_set_ei_status_free(). Cc: Ajay Kaher Signed-off-by: Steven Rostedt (Google) --- Changes since v5: https://lkml.kernel.org/r/20231031223421.185413...@goodmis.org - Resynced to this patch series fs/tracefs/event_inode.c | 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 0a04ae0ca8c8..0087a3f455f1 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -274,28 +274,11 @@ static void free_ei(struct eventfs_inode *ei) */ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) { - struct tracefs_inode *ti_parent; struct eventfs_inode *ei; int i; - /* The top level events directory may be freed by this */ - if (unlikely(ti->flags & TRACEFS_EVENT_TOP_INODE)) { - mutex_lock(_mutex); - ei = ti->private; - /* Nothing should access this, but just in case! */ - ti->private = NULL; - mutex_unlock(_mutex); - - free_ei(ei); - return; - } - mutex_lock(_mutex); - ti_parent = get_tracefs(dentry->d_parent->d_inode); - if (!ti_parent || !(ti_parent->flags & TRACEFS_EVENT_INODE)) - goto out; - ei = dentry->d_fsdata; if (!ei) goto out; @@ -920,6 +903,8 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry inode->i_op = _root_dir_inode_operations; inode->i_fop = _file_operations; + dentry->d_fsdata = ei; + /* directory inodes start off with i_nlink == 2 (for "." entry) */ inc_nlink(inode); d_instantiate(dentry, inode); -- 2.42.0
[PATCH v6 0/8] eventfs: Fixing dynamic creation
Changes since v5: https://lore.kernel.org/all/20231031223326.794680...@goodmis.org/ This was originally based on: [PATCH] eventfs: Process deletion of dentry more thoroughly https://lore.kernel.org/linux-trace-kernel/20231031144703.71eef...@gandalf.local.home/ But Al Viro asked why I do not use use simple_recursive_removal()? Which at first I thought I could not, but after playing with it more I realized I could (with a little addition of dget()'s to balance the dput()s). This simplifies the code. But to do this, it really needs to be done *after* these changes. Hence, I moved that patch to after this series and to handle this rebase, I'm reposting all the patches and including the above mentioned patch at the end and renamed to: eventfs: Use simple_recursive_removal() to clean up dentries I hope this is the last series, as this *is* for this merge window! Steven Rostedt (Google) (8): eventfs: Remove "is_freed" union with rcu head eventfs: Have a free_ei() that just frees the eventfs_inode eventfs: Test for ei->is_freed when accessing ei->dentry eventfs: Save ownership and mode eventfs: Hold eventfs_mutex when calling callback functions eventfs: Delete eventfs_inode when the last dentry is freed eventfs: Remove special processing of dput() of events directory eventfs: Use simple_recursive_removal() to clean up dentries fs/tracefs/event_inode.c | 420 --- fs/tracefs/internal.h| 29 +++- include/linux/tracefs.h | 43 + 3 files changed, 357 insertions(+), 135 deletions(-)
stable-rc: 5.15 - all builds failed - ld.lld: error: undefined symbol: kallsyms_on_each_symbol
Hi Greg, I see the following build warning / errors everywhere on stable-rc 5.15 branch. ld.lld: error: undefined symbol: kallsyms_on_each_symbol >>> referenced by trace_kprobe.c >>> trace/trace_kprobe.o:(create_local_trace_kprobe) in archive >>> kernel/built-in.a >>> referenced by trace_kprobe.c >>> trace/trace_kprobe.o:(__trace_kprobe_create) in archive >>> kernel/built-in.a make[1]: *** [Makefile:1227: vmlinux] Error 1 Links, - https://storage.tuxsuite.com/public/linaro/lkft/builds/2XXALLRIZaXJVcqhff4ZmGTeZoQ/ - Naresh
Re: stable-rc: 5.4 - all builds failed - ld.lld: error: undefined symbol: kallsyms_on_each_symbol
On Wed, 1 Nov 2023 at 20:38, Naresh Kamboju wrote: > > Hi Greg, > > I see the following build warning / errors everywhere on stable-rc 5.4 branch. I am sorry, It is on stable-rc 5.15. > > ld.lld: error: undefined symbol: kallsyms_on_each_symbol > >>> referenced by trace_kprobe.c > >>> trace/trace_kprobe.o:(create_local_trace_kprobe) in archive > >>> kernel/built-in.a > >>> referenced by trace_kprobe.c > >>> trace/trace_kprobe.o:(__trace_kprobe_create) in archive > >>> kernel/built-in.a > make[1]: *** [Makefile:1227: vmlinux] Error 1 I see the following latest commit on stable-rc 5.15 branch tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols commit b022f0c7e404887a7c5229788fc99eff9f9a80d5 upstream. - Naresh > Links, > - > https://storage.tuxsuite.com/public/linaro/lkft/builds/2XXALLRIZaXJVcqhff4ZmGTeZoQ/ > > - Naresh
stable-rc: 5.4 - all builds failed - ld.lld: error: undefined symbol: kallsyms_on_each_symbol
Hi Greg, I see the following build warning / errors everywhere on stable-rc 5.4 branch. ld.lld: error: undefined symbol: kallsyms_on_each_symbol >>> referenced by trace_kprobe.c >>> trace/trace_kprobe.o:(create_local_trace_kprobe) in archive >>> kernel/built-in.a >>> referenced by trace_kprobe.c >>> trace/trace_kprobe.o:(__trace_kprobe_create) in archive >>> kernel/built-in.a make[1]: *** [Makefile:1227: vmlinux] Error 1 Links, - https://storage.tuxsuite.com/public/linaro/lkft/builds/2XXALLRIZaXJVcqhff4ZmGTeZoQ/ - Naresh
Re: [PATCH] eventfs: Process deletion of dentry more thoroughly
On Wed, 1 Nov 2023 00:16:59 -0400 Steven Rostedt wrote: > On Wed, 1 Nov 2023 02:25:53 + > Al Viro wrote: > > > Umm... Is there any reason not to use simple_recursive_removal() there? > > Hmm, I may be able to (I'm still a newbie with understanding of the vfs). > > I did it this way thinking that a dentry may exist in the children but not > at a higher level, but I don't think that can be the case. This creates > dentries and inodes dynamically when they are referenced. The eventfs_inode > maps to each directory (the files of a directory are created from the > information from the eventfs_inode). OK, as I tried to use the simple_recursive_remove() and I failed miserably! I think I know why. What happened was the last child would get one extra "dput" than it needed. That's because dentry's exist without any reference on them and they don't disappear until a reclaim happens. What I mean is, when a file is "open()'d" a dentry is created on the fly so that the user can access the file. When it is "close()'d" the dentry count goes to zero. Then on memory reclaim, the dentries may be removed. If another open happens, the dentry is created again, or the one that is still cached can be reinstated. It looks like the simple_recursive_remove() expects all dentries to have at least a 1 when entering, which is not the case here. But! Now what I could do is to do a dget() when removing the eventfs_inodes (ei) on any dentry that is attached to them. /me goes and tries that... OK, that actually seems to work. With the assumption that there will never be dentry without a parent I think I can go this approach. Thanks! -- Steve
Re: [PATCH v2 2/3] remoteproc: qcom: pas: make region assign more generic
On 10/31/2023 10:36 PM, Neil Armstrong wrote: Hi, On 30/10/2023 14:10, Mukesh Ojha wrote: On 10/30/2023 3:33 PM, Neil Armstrong wrote: The current memory region assign only supports a single memory region. But new platforms introduces more regions to make the memory requirements more flexible for various use cases. Those new platforms also shares the memory region between the DSP and HLOS. To handle this, make the region assign more generic in order to support more than a single memory region and also permit setting the regions permissions as shared. Signed-off-by: Neil Armstrong --- drivers/remoteproc/qcom_q6v5_pas.c | 102 - 1 file changed, 66 insertions(+), 36 deletions(-) diff --git a/drivers/remoteproc/qcom_q6v5_pas.c b/drivers/remoteproc/qcom_q6v5_pas.c index 913a5d2068e8..4829fd26e17d 100644 --- a/drivers/remoteproc/qcom_q6v5_pas.c +++ b/drivers/remoteproc/qcom_q6v5_pas.c @@ -33,6 +33,8 @@ #define ADSP_DECRYPT_SHUTDOWN_DELAY_MS 100 +#define MAX_ASSIGN_COUNT 2 + struct adsp_data { int crash_reason_smem; const char *firmware_name; @@ -51,6 +53,9 @@ struct adsp_data { int ssctl_id; int region_assign_idx; + int region_assign_count; + bool region_assign_shared; + int region_assign_vmid; }; struct qcom_adsp { @@ -87,15 +92,18 @@ struct qcom_adsp { phys_addr_t dtb_mem_phys; phys_addr_t mem_reloc; phys_addr_t dtb_mem_reloc; - phys_addr_t region_assign_phys; + phys_addr_t region_assign_phys[MAX_ASSIGN_COUNT]; void *mem_region; void *dtb_mem_region; size_t mem_size; size_t dtb_mem_size; - size_t region_assign_size; + size_t region_assign_size[MAX_ASSIGN_COUNT]; int region_assign_idx; - u64 region_assign_perms; + int region_assign_count; + bool region_assign_shared; + int region_assign_vmid; + u64 region_assign_perms[MAX_ASSIGN_COUNT]; struct qcom_rproc_glink glink_subdev; struct qcom_rproc_subdev smd_subdev; @@ -590,37 +598,52 @@ static int adsp_alloc_memory_region(struct qcom_adsp *adsp) static int adsp_assign_memory_region(struct qcom_adsp *adsp) { - struct reserved_mem *rmem = NULL; - struct qcom_scm_vmperm perm; + struct qcom_scm_vmperm perm[MAX_ASSIGN_COUNT]; + unsigned int perm_size = 1; AFAICS, not need of initialization. Indeed, removed struct device_node *node; - int ret; + int offset, ret; Nit: one variable per line. Done if (!adsp->region_assign_idx) Not related to this patch.. Should not this be valid only for > 1 ? I don't understand, only region_assign_idx > 1 triggers a memory_assign, and this check discards configurations with region_assign_idx == 0 as expected. Ah, you can ignore the comments, I got the intention after commenting here .. return 0; - node = of_parse_phandle(adsp->dev->of_node, "memory-region", adsp->region_assign_idx); - if (node) - rmem = of_reserved_mem_lookup(node); - of_node_put(node); - if (!rmem) { - dev_err(adsp->dev, "unable to resolve shareable memory-region\n"); - return -EINVAL; - } + for (offset = 0; offset < adsp->region_assign_count; ++offset) { + struct reserved_mem *rmem = NULL; + + node = of_parse_phandle(adsp->dev->of_node, "memory-region", + adsp->region_assign_idx + offset); + if (node) + rmem = of_reserved_mem_lookup(node); + of_node_put(node); + if (!rmem) { + dev_err(adsp->dev, "unable to resolve shareable memory-region index %d\n", + offset); + return -EINVAL; > + } - perm.vmid = QCOM_SCM_VMID_MSS_MSA; - perm.perm = QCOM_SCM_PERM_RW; + if (adsp->region_assign_shared) { + perm[0].vmid = QCOM_SCM_VMID_HLOS; + perm[0].perm = QCOM_SCM_PERM_RW; + perm[1].vmid = adsp->region_assign_vmid; + perm[1].perm = QCOM_SCM_PERM_RW; + perm_size = 2; + } else { + perm[0].vmid = adsp->region_assign_vmid; + perm[0].perm = QCOM_SCM_PERM_RW; + perm_size = 1; + } - adsp->region_assign_phys = rmem->base; - adsp->region_assign_size = rmem->size; - adsp->region_assign_perms = BIT(QCOM_SCM_VMID_HLOS); + adsp->region_assign_phys[offset] = rmem->base; + adsp->region_assign_size[offset] = rmem->size; + adsp->region_assign_perms[offset] = BIT(QCOM_SCM_VMID_HLOS); Do we need array for this, is this changing ? We need to keep region_assign_perms for unassign, but for the other 2 we would need to duplicate the code from adsp_assign_memory_region into adsp_unassign_memory_region. Thanks got it. - ret = qcom_scm_assign_mem(adsp->region_assign_phys, - adsp->region_assign_size, - >region_assign_perms, - , 1); - if (ret < 0) { - dev_err(adsp->dev, "assign
Re: [PATCH v5 2/7] eventfs: Have a free_ei() that just frees the eventfs_inode
On Tue, 31 Oct 2023 18:33:28 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > As the eventfs_inode is freed in two different locations, make a helper > function free_ei() to make sure all the allocated fields of the > eventfs_inode is freed. > > This requires renaming the existing free_ei() which is called by the srcu > handler to free_rcu_ei() and have free_ei() just do the freeing, where > free_rcu_ei() will call it. Looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thanks, > > Signed-off-by: Steven Rostedt (Google) > --- > Changse since v4: > https://lore.kernel.org/all/20231031193428.133533...@goodmis.org/T/#u > > - Rebased to this patch series > > fs/tracefs/event_inode.c | 19 +++ > 1 file changed, 11 insertions(+), 8 deletions(-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 2c2c75b2ad73..0331d9bd568b 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -129,6 +129,13 @@ static struct dentry *create_dir(const char *name, > struct dentry *parent) > return eventfs_end_creating(dentry); > } > > +static void free_ei(struct eventfs_inode *ei) > +{ > + kfree_const(ei->name); > + kfree(ei->d_children); > + kfree(ei); > +} > + > /** > * eventfs_set_ei_status_free - remove the dentry reference from an > eventfs_inode > * @ti: the tracefs_inode of the dentry > @@ -168,9 +175,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, > struct dentry *dentry) > eventfs_remove_dir(ei_child); > } > > - kfree_const(ei->name); > - kfree(ei->d_children); > - kfree(ei); > + free_ei(ei); > return; > } > > @@ -784,13 +789,11 @@ struct eventfs_inode *eventfs_create_events_dir(const > char *name, struct dentry > return ERR_PTR(-ENOMEM); > } > > -static void free_ei(struct rcu_head *head) > +static void free_rcu_ei(struct rcu_head *head) > { > struct eventfs_inode *ei = container_of(head, struct eventfs_inode, > rcu); > > - kfree_const(ei->name); > - kfree(ei->d_children); > - kfree(ei); > + free_ei(ei); > } > > /** > @@ -883,7 +886,7 @@ void eventfs_remove_dir(struct eventfs_inode *ei) > for (i = 0; i < ei->nr_entries; i++) > unhook_dentry(>d_children[i], _list); > unhook_dentry(>dentry, _list); > - call_srcu(_srcu, >rcu, free_ei); > + call_srcu(_srcu, >rcu, free_rcu_ei); > } > mutex_unlock(_mutex); > > -- > 2.42.0 -- Masami Hiramatsu (Google)
Re: [PATCH v5 1/7] eventfs: Remove "is_freed" union with rcu head
On Tue, 31 Oct 2023 18:33:27 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The eventfs_inode->is_freed was a union with the rcu_head with the > assumption that when it was on the srcu list the head would contain a > pointer which would make "is_freed" true. But that was a wrong assumption > as the rcu head is a single link list where the last element is NULL. > > Instead, split the nr_entries integer so that "is_freed" is one bit and > the nr_entries is the next 31 bits. As there shouldn't be more than 10 > (currently there's at most 5 to 7 depending on the config), this should > not be a problem. Yeah, even 16 bit nr_entries is enough.. (maybe the biggest user is syscall event group) Looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thank you, > Cc: sta...@vger.kernel.org > Fixes: 63940449555e7 ("eventfs: Implement eventfs lookup, read, open > functions") > Signed-off-by: Steven Rostedt (Google) > --- > fs/tracefs/event_inode.c | 2 ++ > fs/tracefs/internal.h| 6 +++--- > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 754885dfe71c..2c2c75b2ad73 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -824,6 +824,8 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, > struct list_head *head, > eventfs_remove_rec(ei_child, head, level + 1); > } > > + ei->is_freed = 1; > + > list_del_rcu(>list); > list_add_tail(>del_list, head); > } > diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h > index 64fde9490f52..c7d88aaa949f 100644 > --- a/fs/tracefs/internal.h > +++ b/fs/tracefs/internal.h > @@ -23,6 +23,7 @@ struct tracefs_inode { > * @d_parent: pointer to the parent's dentry > * @d_children: The array of dentries to represent the files when created > * @data:The private data to pass to the callbacks > + * @is_freed:Flag set if the eventfs is on its way to be freed > * @nr_entries: The number of items in @entries > */ > struct eventfs_inode { > @@ -38,14 +39,13 @@ struct eventfs_inode { >* Union - used for deletion >* @del_list: list of eventfs_inode to delete >* @rcu:eventfs_inode to delete in RCU > - * @is_freed: node is freed if one of the above is set >*/ > union { > struct list_headdel_list; > struct rcu_head rcu; > - unsigned long is_freed; > }; > - int nr_entries; > + unsigned intis_freed:1; > + unsigned intnr_entries:31; > }; > > static inline struct tracefs_inode *get_tracefs(const struct inode *inode) > -- > 2.42.0 -- Masami Hiramatsu (Google)
Re: [PATCH v3] seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()
Le 27/10/2023 à 17:56, Kees Cook a écrit : Solve two ergonomic issues with struct seq_buf; 1) Too much boilerplate is required to initialize: struct seq_buf s; char buf[32]; seq_buf_init(s, buf, sizeof(buf)); Instead, we can build this directly on the stack. Provide DECLARE_SEQ_BUF() macro to do this: DECLARE_SEQ_BUF(s, 32); 2) %NUL termination is fragile and requires 2 steps to get a valid C String (and is a layering violation exposing the "internals" of seq_buf): seq_buf_terminate(s); do_something(s->buffer); Instead, we can just return s->buffer directly after terminating it in the refactored seq_buf_terminate(), now known as seq_buf_str(): do_something(seq_buf_str(s)); Cc: Steven Rostedt Cc: "Matthew Wilcox (Oracle)" Cc: Christoph Hellwig Cc: Justin Stitt Cc: Kent Overstreet Cc: Petr Mladek Cc: Andy Shevchenko Cc: Rasmus Villemoes Cc: Sergey Senozhatsky Cc: Masami Hiramatsu Cc: Greg Kroah-Hartman Cc: Arnd Bergmann Cc: Jonathan Corbet Cc: Yun Zhou Cc: Jacob Keller Cc: Zhen Lei Cc: linux-trace-ker...@vger.kernel.org Link: https://lore.kernel.org/r/20231026194033.it.702-k...@kernel.org Signed-off-by: Kees Cook --- v3 - fix commit log typos - improve code style for DECLARE_SEQ_BUF (shevchenko) - const-ify seq_bug_str() return (rostedt) v2 - https://lore.kernel.org/lkml/20231026194033.it.702-k...@kernel.org v1 - https://lore.kernel.org/lkml/20231026170722.work.638-k...@kernel.org --- include/linux/seq_buf.h | 21 + kernel/trace/trace.c| 11 +-- lib/seq_buf.c | 4 +--- 3 files changed, 19 insertions(+), 17 deletions(-) diff --git a/include/linux/seq_buf.h b/include/linux/seq_buf.h index 8483e4b2d0d2..5fb1f12c33f9 100644 --- a/include/linux/seq_buf.h +++ b/include/linux/seq_buf.h @@ -21,9 +21,18 @@ struct seq_buf { size_t len; }; +#define DECLARE_SEQ_BUF(NAME, SIZE) \ + char __ ## NAME ## _buffer[SIZE] = "";\ + struct seq_buf NAME = { \ + .buffer = &__ ## NAME ## _buffer, \ ~~~ Is the & needed here? CJ + .size = SIZE, \ + } + static inline void seq_buf_clear(struct seq_buf *s) { s->len = 0; + if (s->size) + s->buffer[0] = '\0'; } static inline void @@ -69,8 +78,8 @@ static inline unsigned int seq_buf_used(struct seq_buf *s) } /** - * seq_buf_terminate - Make sure buffer is nul terminated - * @s: the seq_buf descriptor to terminate. + * seq_buf_str - get %NUL-terminated C string from seq_buf + * @s: the seq_buf handle * * This makes sure that the buffer in @s is nul terminated and * safe to read as a string. @@ -81,16 +90,20 @@ static inline unsigned int seq_buf_used(struct seq_buf *s) * * After this function is called, s->buffer is safe to use * in string operations. + * + * Returns @s->buf after making sure it is terminated. */ -static inline void seq_buf_terminate(struct seq_buf *s) +static inline const char *seq_buf_str(struct seq_buf *s) { if (WARN_ON(s->size == 0)) - return; + return ""; if (seq_buf_buffer_left(s)) s->buffer[s->len] = 0; else s->buffer[s->size - 1] = 0; + + return s->buffer; } /** diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d629065c2383..2539cfc20a97 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3828,15 +3828,6 @@ static bool trace_safe_str(struct trace_iterator *iter, const char *str, return false; } -static const char *show_buffer(struct trace_seq *s) -{ - struct seq_buf *seq = >seq; - - seq_buf_terminate(seq); - - return seq->buffer; -} - static DEFINE_STATIC_KEY_FALSE(trace_no_verify); static int test_can_verify_check(const char *fmt, ...) @@ -3976,7 +3967,7 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, */ if (WARN_ONCE(!trace_safe_str(iter, str, star, len), "fmt: '%s' current_buffer: '%s'", - fmt, show_buffer(>seq))) { + fmt, seq_buf_str(>seq.seq))) { int ret; /* Try to safely read the string */ diff --git a/lib/seq_buf.c b/lib/seq_buf.c index b7477aefff53..23518f77ea9c 100644 --- a/lib/seq_buf.c +++ b/lib/seq_buf.c @@ -109,9 +109,7 @@ void seq_buf_do_printk(struct seq_buf *s, const char *lvl) if (s->size == 0 || s->len == 0) return; - seq_buf_terminate(s); - - start = s->buffer; + start = seq_buf_str(s); while ((lf = strchr(start, '\n'))) { int len = lf - start + 1;