Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-01 Thread 吳澤南
On Wed, 2024-05-01 at 23:50 -0400, Steven Rostedt wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On Thu, 2 May 2024 03:10:24 +
> Tze-nan Wu (吳澤南)  wrote:
> 
> > >   
> > Sorry for my late reply, I'm testing the patch on my machine now. 
> > Test will be done in four hours.
> > 
> > There's something I'm worrying about in the patch,
> > what I'm worrying about is commented in the code below.
> > 
> > /kernel/trace/trace_events.c:
> >   static int
> >   event_create_dir(struct eventfs_inode *parent, 
> >   struct trace_event_file *file) 
> >   {
> > ...
> > ...
> > ...
> > nr_entries = ARRAY_SIZE(event_entries);
> > 
> > name = trace_event_name(call);
> > 
> > +event_file_get(file);// Line A
> > ^
> > // Should we move the "event_file_get" to here, instead  
> > // of calling it at line C?
> > // Due to Line B could eventually invoke "event_file_put".
> > //   eventfs_create_dir -> free_ei ->put_ei -> kref_put 
> > //  -> release_ei -> event_release -> event_file_put
> > // Not sure if this is a potential risk? If Line B do
> call   
> > // event_file_put,"event_file_put" will be called prior to
> > // "event_file_get", could corrupt the reference of the
> file.
> 
> No, but you do bring up a good point. The release should not be
> called on
> error, but it looks like it possibly can be.
> 
> > 
> > ei = eventfs_create_dir(name, e_events,// Line B 
> >  event_entries, nr_entries, file);
> > if (IS_ERR(ei)) {
> > pr_warn("Could not create tracefs '%s'
> directory\n", 
> > name);
> > return -1;
> > }
> > file->ei = ei;
> > 
> > ret = event_define_fields(call);
> > if (ret < 0) {
> > pr_warn("Could not initialize trace point
> events/%s\n",
> > name);
> > return ret;
> >^  
> >// Maybe we chould have similar concern if we return here.
> >// Due to the event_inode had been created, but we did not
> call 
> >// event_file_get. 
> >// Could it lead to some issues in the future while freeing 
> >// event_indoe?
> > }
> > 
> > 
> > -event_file_get(file);   //Line C
> > return 0;
> >   }
> 
> This prevents the release() function from being called on failure of
> creating the ei.
> 
> Can you try this patch instead?
> -- Steve
> 
Sure, will reply the mail again after the test finish ASAP.
-- Tze-nan

> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 894c6ca1e500..f5510e26f0f6 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -84,10 +84,17 @@ enum {
>  static void release_ei(struct kref *ref)
>  {
>  struct eventfs_inode *ei = container_of(ref, struct eventfs_inode,
> kref);
> +const struct eventfs_entry *entry;
>  struct eventfs_root_inode *rei;
>  
>  WARN_ON_ONCE(!ei->is_freed);
>  
> +for (int i = 0; i < ei->nr_entries; i++) {
> +entry = >entries[i];
> +if (entry->release)
> +entry->release(entry->name, ei->data);
> +}
> +
>  kfree(ei->entry_attrs);
>  kfree_const(ei->name);
>  if (ei->is_events) {
> @@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode
> *ei)
>  }
>  }
>  
> +/*
> + * Called when creation of an ei fails, do not call release()
> functions.
> + */
> +static inline void cleanup_ei(struct eventfs_inode *ei)
> +{
> +if (ei) {
> +/* Set nr_entries to 0 to prevent release() function being called */
> +ei->nr_entries = 0;
> +free_ei(ei);
> +}
> +}
> +
>  static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
>  {
>  if (ei)
> @@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const
> char *name, struct eventfs_inode
>  
>  /* Was the parent freed? */
>  if (list_empty(>list)) {
> -free_ei(ei);
> +cleanup_ei(ei);
>  ei = NULL;
>  }
>  return ei;
> @@ -835,7 +854,7 @@ struct eventfs_inode
> *eventfs_create_events_dir(const char *name, struct dentry
>  return ei;
>  
>   fail:
> -free_ei(ei);
> +cleanup_ei(ei);
>  tracefs_failed_creating(dentry);
>  return ERR_PTR(-ENOMEM);
>  }
> diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
> index 7a5fe17b6bf9..d03f74658716 100644
> --- a/include/linux/tracefs.h
> +++ b/include/linux/tracefs.h
> @@ -62,6 +62,8 @@ struct eventfs_file;
>  typedef int (*eventfs_callback)(const char *name, umode_t *mode,
> void **data,
>  const struct file_operations **fops);
>  
> +typedef void (*eventfs_release)(const char *name, void *data);
> +
>  /**
>   * struct eventfs_entry - dynamically created eventfs file call back
> handler
>   * @name:Then name of the dynamic file in an eventfs directory
> @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name,
> umode_t *mode, void **data,
>  

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-05-01 Thread Guillaume Morin
On 30 Apr 21:25, David Hildenbrand wrote:
> > I tried to get the hugepd stuff right but this was the first I heard
> > about it :-) Afaict follow_huge_pmd and friends were already DTRT
> 
> I'll have to have a closer look at some details (the hugepd writability
> check looks a bit odd), but it's mostly what I would have expected!

Ok in the meantime, here is the uprobe change on your current
uprobes_cow trying to address the comments you made in your previous
message. Some of them were not 100% clear to me, so it's a best effort
patch :-) Again lightly tested

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
hugetlb_fs_parameters[] = {
{}
 };
 
+bool hugetlbfs_mapping(struct address_space *mapping) {
+   return mapping->a_ops == _aops;
+}
+
 /*
  * Mask used when checking the page offset value passed in via system
  * calls.  This value will be converted to a loff_t which is signed.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -511,6 +511,8 @@ struct hugetlbfs_sb_info {
umode_t mode;
 };
 
+bool hugetlbfs_mapping(struct address_space *mapping);
+
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 {
return sb->s_fs_info;
@@ -557,6 +559,8 @@ static inline struct hstate *hstate_inode(struct inode *i)
 {
return NULL;
 }
+
+static inline bool hugetlbfs_mapping(struct address_space *mapping) { return 
false; }
 #endif /* !CONFIG_HUGETLBFS */
 
 #ifdef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include  /* read_mapping_page */
 #include 
 #include 
@@ -120,7 +121,7 @@ struct xol_area {
  */
 static bool valid_vma(struct vm_area_struct *vma, bool is_register)
 {
-   vm_flags_t flags = VM_HUGETLB | VM_MAYEXEC | VM_MAYSHARE;
+   vm_flags_t flags = VM_MAYEXEC | VM_MAYSHARE;
 
if (is_register)
flags |= VM_WRITE;
@@ -177,6 +178,19 @@ static void copy_to_page(struct page *page, unsigned long 
vaddr, const void *src
kunmap_atomic(kaddr);
 }
 
+static bool compare_pages(struct page *page1, struct page *page2, unsigned 
long page_size)
+{
+   char *addr1, *addr2;
+   int ret;
+
+   addr1 = kmap_local_page(page1);
+   addr2 = kmap_local_page(page2);
+   ret = memcmp(addr1, addr2, page_size);
+   kunmap_local(addr2);
+   kunmap_local(addr1);
+   return ret == 0;
+}
+
 static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode)
 {
uprobe_opcode_t old_opcode;
@@ -366,7 +380,9 @@ static int update_ref_ctr(struct uprobe *uprobe, struct 
mm_struct *mm,
 }
 
 static bool orig_page_is_identical(struct vm_area_struct *vma,
-   unsigned long vaddr, struct page *page, bool *large)
+   unsigned long vaddr, struct page *page,
+   unsigned long page_size,
+   bool *large)
 {
const pgoff_t index = vaddr_to_offset(vma, vaddr) >> PAGE_SHIFT;
struct page *orig_page = find_get_page(vma->vm_file->f_inode->i_mapping,
@@ -380,7 +396,7 @@ static bool orig_page_is_identical(struct vm_area_struct 
*vma,
 
*large = folio_test_large(orig_folio);
identical = folio_test_uptodate(orig_folio) &&
-   pages_identical(page, orig_page);
+   compare_pages(page, orig_page, page_size);
folio_put(orig_folio);
return identical;
 }
@@ -396,6 +412,81 @@ struct uwo_data {
uprobe_opcode_t opcode;
 };
 
+static int __write_opcode_hugetlb(pte_t *ptep, unsigned long page_mask,
+ unsigned long vaddr,
+ unsigned long next, struct mm_walk *walk)
+{
+   struct uwo_data *data = walk->private;
+   const bool is_register = !!is_swbp_insn(>opcode);
+   pte_t pte = huge_ptep_get(ptep);
+   struct folio *folio;
+   struct page *page;
+   bool large;
+   struct hstate *h = hstate_vma(walk->vma);
+   unsigned subpage_index = (vaddr & (huge_page_size(h) - 1)) >>
+   PAGE_SHIFT;
+
+   if (!pte_present(pte))
+   return UWO_RETRY;
+   page = vm_normal_page(walk->vma, vaddr, pte);
+   if (!page)
+   return UWO_RETRY;
+   folio = page_folio(page);
+
+   /* When unregistering and there is no anon folio anymore, we're done. */
+   if (!folio_test_anon(folio))
+   return is_register ? UWO_RETRY_WRITE_FAULT : UWO_DONE;
+
+   /*
+* See can_follow_write_pte(): we'd actually prefer requiring a
+* writable PTE here, but when unregistering we might no longer have
+* VM_WRITE ...
+*/
+   if (!huge_pte_write(pte)) {
+   

Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-01 Thread Steven Rostedt
On Thu, 2 May 2024 03:10:24 +
Tze-nan Wu (吳澤南)  wrote:

> >   
> Sorry for my late reply, I'm testing the patch on my machine now. 
> Test will be done in four hours.
> 
> There's something I'm worrying about in the patch,
> what I'm worrying about is commented in the code below.
> 
> /kernel/trace/trace_events.c:
>   static int
>   event_create_dir(struct eventfs_inode *parent, 
>   struct trace_event_file *file) 
>   {
> ...
> ...
> ...
> nr_entries = ARRAY_SIZE(event_entries);
> 
> name = trace_event_name(call);
> 
> +event_file_get(file);// Line A
> ^
> // Should we move the "event_file_get" to here, instead  
> // of calling it at line C?
> // Due to Line B could eventually invoke "event_file_put".
> //   eventfs_create_dir -> free_ei ->put_ei -> kref_put 
> //  -> release_ei -> event_release -> event_file_put
> // Not sure if this is a potential risk? If Line B do call   
> // event_file_put,"event_file_put" will be called prior to
> // "event_file_get", could corrupt the reference of the file.

No, but you do bring up a good point. The release should not be called on
error, but it looks like it possibly can be.

> 
> ei = eventfs_create_dir(name, e_events,// Line B 
>  event_entries, nr_entries, file);
> if (IS_ERR(ei)) {
> pr_warn("Could not create tracefs '%s' directory\n", 
> name);
> return -1;
> }
> file->ei = ei;
> 
> ret = event_define_fields(call);
> if (ret < 0) {
> pr_warn("Could not initialize trace point events/%s\n",
> name);
> return ret;
>^  
>// Maybe we chould have similar concern if we return here.
>// Due to the event_inode had been created, but we did not call 
>// event_file_get. 
>// Could it lead to some issues in the future while freeing 
>// event_indoe?
> }
> 
> 
> -event_file_get(file);   //Line C
> return 0;
>   }

This prevents the release() function from being called on failure of
creating the ei.

Can you try this patch instead?

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..f5510e26f0f6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
@@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode *ei)
}
 }
 
+/*
+ * Called when creation of an ei fails, do not call release() functions.
+ */
+static inline void cleanup_ei(struct eventfs_inode *ei)
+{
+   if (ei) {
+   /* Set nr_entries to 0 to prevent release() function being 
called */
+   ei->nr_entries = 0;
+   free_ei(ei);
+   }
+}
+
 static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
 {
if (ei)
@@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
 
/* Was the parent freed? */
if (list_empty(>list)) {
-   free_ei(ei);
+   cleanup_ei(ei);
ei = NULL;
}
return ei;
@@ -835,7 +854,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
return ei;
 
  fail:
-   free_ei(ei);
+   cleanup_ei(ei);
tracefs_failed_creating(dentry);
return ERR_PTR(-ENOMEM);
 }
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
*mode, void **data,
 struct eventfs_entry {
const char  *name;
eventfs_callbackcallback;
+   eventfs_release release;
 };
 
 struct eventfs_inode;
diff --git a/kernel/trace/trace_events.c 

Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-01 Thread Paul E. McKenney
On Thu, May 02, 2024 at 11:05:01AM +0900, Masami Hiramatsu wrote:
> On Wed, 1 May 2024 16:12:37 -0700
> "Paul E. McKenney"  wrote:
> 
> > Note that the immediate pressure for this patch should be relieved by the
> > NAPI patch series [1], but this sort of problem could easily arise again.
> > 
> > When running heavy test workloads with KASAN enabled, RCU Tasks grace
> > periods can extend for many tens of seconds, significantly slowing
> > trace registration.  Therefore, make the registration-side RCU Tasks
> > grace period be asynchronous via call_rcu_tasks().
> 
> Good catch! AFAICS, there is no reason to wait for synchronization
> when adding a new direct trampoline.
> This looks good to me.
> 
> Reviewed-by: Masami Hiramatsu (Google) 

Thank you very much!  I will apply this on my next rebase.

Thanx, Paul

> Thank you,
> 
> > [1] https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/
> > 
> > Reported-by: Jakub Kicinski 
> > Reported-by: Alexei Starovoitov 
> > Reported-by: Chris Mason 
> > Signed-off-by: Paul E. McKenney 
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Mark Rutland 
> > Cc: Mathieu Desnoyers 
> > Cc: 
> > 
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index 6c96b30f3d63b..32ea92934268c 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -5365,6 +5365,13 @@ static void remove_direct_functions_hash(struct 
> > ftrace_hash *hash, unsigned long
> > }
> >  }
> >  
> > +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> > +{
> > +   struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
> > +
> > +   free_ftrace_hash(fhp);
> > +}
> > +
> >  /**
> >   * register_ftrace_direct - Call a custom trampoline directly
> >   * for multiple functions registered in @ops
> > @@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> > unsigned long addr)
> >   out_unlock:
> > mutex_unlock(_mutex);
> >  
> > -   if (free_hash && free_hash != EMPTY_HASH) {
> > -   synchronize_rcu_tasks();
> > -   free_ftrace_hash(free_hash);
> > -   }
> > +   if (free_hash && free_hash != EMPTY_HASH)
> > +   call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
> >  
> > if (new_hash)
> > free_ftrace_hash(new_hash);
> 
> 
> -- 
> Masami Hiramatsu (Google) 



Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-01 Thread 吳澤南
On Mon, 2024-04-29 at 14:46 -0400, Steven Rostedt wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On Sun, 28 Apr 2024 20:28:37 -0400
> Steven Rostedt  wrote:
> 
> > > Looking for any suggestion or solution, appreciate.  
> > 
> > Yeah, I do not think eventfs should be involved in this. It needs
> to be
> > protected at a higher level (in the synthetic/dynamic event code).
> > 
> > I'm just coming back from Japan, and I'll need to take a deeper
> look at
> > this after I recover from my jetlag.
> 
> OK, so I guess the eventfs nodes need an optional release callback.
> Here's
> the right way to do that. I added a "release" function to the passed
> in
> entry array that allows for calling a release function when the
> eventfs_inode is freed. Then in code for creating events, I call
> event_file_get() on the file being assigned and have the freeing of
> the
> "enable" file have the release function that will call
> event_file_put() on
> that file structure.
> 
> Does this fix it for you?
> 
Sorry for my late reply, I'm testing the patch on my machine now. 
Test will be done in four hours.

There's something I'm worrying about in the patch,
what I'm worrying about is commented in the code below.

/kernel/trace/trace_events.c:
  static int
  event_create_dir(struct eventfs_inode *parent, 
  struct trace_event_file *file) 
  {
...
...
...
nr_entries = ARRAY_SIZE(event_entries);

name = trace_event_name(call);

+event_file_get(file);// Line A
^
// Should we move the "event_file_get" to here, instead  
// of calling it at line C?
// Due to Line B could eventually invoke "event_file_put".
//   eventfs_create_dir -> free_ei ->put_ei -> kref_put 
//  -> release_ei -> event_release -> event_file_put
// Not sure if this is a potential risk? If Line B do call   
// event_file_put,"event_file_put" will be called prior to
// "event_file_get", could corrupt the reference of the file.

ei = eventfs_create_dir(name, e_events,// Line B 
 event_entries, nr_entries, file);
if (IS_ERR(ei)) {
pr_warn("Could not create tracefs '%s' directory\n", 
name);
return -1;
}
file->ei = ei;

ret = event_define_fields(call);
if (ret < 0) {
pr_warn("Could not initialize trace point events/%s\n",
name);
return ret;
   ^  
   // Maybe we chould have similar concern if we return here.
   // Due to the event_inode had been created, but we did not call 
   // event_file_get. 
   // Could it lead to some issues in the future while freeing 
   // event_indoe?
}


-event_file_get(file);   //Line C
return 0;
  }

Thanks,
Tze-nan

> -- Steve
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 894c6ca1e500..dc97c19f9e0a 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -84,10 +84,17 @@ enum {
>  static void release_ei(struct kref *ref)
>  {
>  struct eventfs_inode *ei = container_of(ref, struct eventfs_inode,
> kref);
> +const struct eventfs_entry *entry;
>  struct eventfs_root_inode *rei;
>  
>  WARN_ON_ONCE(!ei->is_freed);
>  
> +for (int i = 0; i < ei->nr_entries; i++) {
> +entry = >entries[i];
> +if (entry->release)
> +entry->release(entry->name, ei->data);
> +}
> +
>  kfree(ei->entry_attrs);
>  kfree_const(ei->name);
>  if (ei->is_events) {
> diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
> index 7a5fe17b6bf9..d03f74658716 100644
> --- a/include/linux/tracefs.h
> +++ b/include/linux/tracefs.h
> @@ -62,6 +62,8 @@ struct eventfs_file;
>  typedef int (*eventfs_callback)(const char *name, umode_t *mode,
> void **data,
>  const struct file_operations **fops);
>  
> +typedef void (*eventfs_release)(const char *name, void *data);
> +
>  /**
>   * struct eventfs_entry - dynamically created eventfs file call back
> handler
>   * @name:Then name of the dynamic file in an eventfs directory
> @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name,
> umode_t *mode, void **data,
>  struct eventfs_entry {
>  const char*name;
>  eventfs_callbackcallback;
> +eventfs_releaserelease;
>  };
>  
>  struct eventfs_inode;
> diff --git a/kernel/trace/trace_events.c
> b/kernel/trace/trace_events.c
> index 52f75c36bbca..d14c84281f2b 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -2552,6 +2552,14 @@ static int event_callback(const char *name,
> umode_t *mode, void **data,
>  return 0;
>  }
>  
> +/* The file is incremented on creation and freeing the enable file
> decrements it */
> +static void event_release(const char *name, void *data)
> +{
> +struct trace_event_file *file = data;
> +
> 

[PATCH 5/5] eventfs: Have "events" directory get permissions from its parent

2024-05-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The events directory gets its permissions from the root inode. But this
can cause an inconsistency if the instances directory changes its
permissions, as the permissions of the created directories under it should
inherit the permissions of the instances directory when directories under
it are created.

Currently the behavior is:

 # cd /sys/kernel/tracing
 # chgrp 1002 instances
 # mkdir instances/foo
 # ls -l instances/foo
[..]
 -r--r-  1 root lkp  0 May  1 18:55 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 18:55 current_tracer
 -rw-r-  1 root lkp  0 May  1 18:55 error_log
 drwxr-xr-x  1 root root 0 May  1 18:55 events
 --w---  1 root lkp  0 May  1 18:55 free_buffer
 drwxr-x---  2 root lkp  0 May  1 18:55 options
 drwxr-x--- 10 root lkp  0 May  1 18:55 per_cpu
 -rw-r-  1 root lkp  0 May  1 18:55 set_event

All the files and directories under "foo" has the "lkp" group except the
"events" directory. That's because its getting its default value from the
mount point instead of its parent.

Have the "events" directory make its default value based on its parent's
permissions. That now gives:

 # ls -l instances/foo
[..]
 -rw-r-  1 root lkp 0 May  1 21:16 buffer_subbuf_size_kb
 -r--r-  1 root lkp 0 May  1 21:16 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:16 current_tracer
 -rw-r-  1 root lkp 0 May  1 21:16 error_log
 drwxr-xr-x  1 root lkp 0 May  1 21:16 events
 --w---  1 root lkp 0 May  1 21:16 free_buffer
 drwxr-x---  2 root lkp 0 May  1 21:16 options
 drwxr-x--- 10 root lkp 0 May  1 21:16 per_cpu
 -rw-r-  1 root lkp 0 May  1 21:16 set_event

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 624a0e4a8e29..f77800398a54 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 struct eventfs_root_inode {
struct eventfs_inodeei;
+   struct inode*parent_inode;
struct dentry   *events_dir;
 };
 
@@ -207,12 +208,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 
 static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
-   struct inode *root;
+   struct eventfs_root_inode *rei;
+   struct inode *parent;
+
+   rei = get_root_inode(ei);
+
+   /* Use the parent inode permissions unless root set its permissions */
+   parent = rei->parent_inode;
 
-   /* Get the tracefs root inode. */
-   root = d_inode(sb->s_root);
-   ei->attr.uid = root->i_uid;
-   ei->attr.gid = root->i_gid;
+   if (rei->ei.attr.mode & EVENTFS_SAVE_UID)
+   ei->attr.uid = rei->ei.attr.uid;
+   else
+   ei->attr.uid = parent->i_uid;
+
+   if (rei->ei.attr.mode & EVENTFS_SAVE_GID)
+   ei->attr.gid = rei->ei.attr.gid;
+   else
+   ei->attr.gid = parent->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
@@ -798,6 +810,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
// Note: we have a ref to the dentry from tracefs_start_creating()
rei = get_root_inode(ei);
rei->events_dir = dentry;
+   rei->parent_inode = d_inode(dentry->d_sb->s_root);
 
ei->entries = entries;
ei->nr_entries = size;
@@ -807,10 +820,15 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
 
+   /*
+* When the "events" directory is created, it takes on the
+* permissions of its parent. But can be reset on remount.
+*/
+   ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID;
+
INIT_LIST_HEAD(>children);
INIT_LIST_HEAD(>list);
 
-- 
2.43.0





[PATCH 4/5] eventfs: Do not treat events directory different than other directories

2024-05-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Treat the events directory the same as other directories when it comes to
permissions. The events directory was considered different because it's
dentry is persistent, whereas the other directory dentries are created
when accessed. But the way tracefs now does its ownership by using the
root dentry's permissions as the default permissions, the events directory
can get out of sync when a remount is performed setting the group and user
permissions.

Remove the special case for the events directory on setting the
attributes. This allows the updates caused by remount to work properly as
well as simplifies the code.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2c33262107c6..624a0e4a8e29 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -187,21 +187,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   /*
-* The events directory dentry is never freed, unless its
-* part of an instance that is deleted. It's attr is the
-* default for its child files and directories.
-* Do not update it. It's not used for its own mode or 
ownership.
-*/
-   if (ei->is_events) {
-   /* But it still needs to know if it was modified */
-   if (iattr->ia_valid & ATTR_UID)
-   ei->attr.mode |= EVENTFS_SAVE_UID;
-   if (iattr->ia_valid & ATTR_GID)
-   ei->attr.mode |= EVENTFS_SAVE_GID;
-   } else {
-   update_attr(>attr, iattr);
-   }
+   update_attr(>attr, iattr);
 
} else {
name = dentry->d_name.name;
-- 
2.43.0





[PATCH 3/5] eventfs: Do not differentiate the toplevel events directory

2024-05-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The toplevel events directory is really no different than the events
directory of instances. Having the two be different caused
inconsistencies and made it harder to fix the permissions bugs.

Make all events directories act the same.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 -
 fs/tracefs/internal.h|  7 +++
 2 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index db71fed7057d..2c33262107c6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -68,7 +68,6 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
-   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -220,14 +219,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
+static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
struct inode *root;
 
-   /* Only update if the "events" was on the top level */
-   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
-   return;
-
/* Get the tracefs root inode. */
root = d_inode(sb->s_root);
ei->attr.uid = root->i_uid;
@@ -240,10 +235,10 @@ static void set_top_events_ownership(struct inode *inode)
struct eventfs_inode *ei = ti->private;
 
/* The top events directory doesn't get automatically updated */
-   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   if (!ei || !ei->is_events)
return;
 
-   update_top_events_attr(ei, inode->i_sb);
+   update_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -272,7 +267,7 @@ static int eventfs_permission(struct mnt_idmap *idmap,
return generic_permission(idmap, inode, mask);
 }
 
-static const struct inode_operations eventfs_root_dir_inode_operations = {
+static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
@@ -340,7 +335,7 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
// Walk upwards until you find the events inode
} while (!ei->is_events);
 
-   update_top_events_attr(ei, dentry->d_sb);
+   update_events_attr(ei, dentry->d_sb);
 
return ei;
 }
@@ -446,7 +441,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
update_inode_attr(dentry, inode, >attr,
  S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
/* All directories will have the same inode number */
@@ -826,14 +821,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /*
-* If the events directory is of the top instance, then parent
-* is NULL. Set the attr.mode to reflect this and its permissions will
-* default to the tracefs root dentry.
-*/
-   if (!parent)
-   ei->attr.mode = EVENTFS_TOPLEVEL;
-
/* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
@@ -842,13 +829,13 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
INIT_LIST_HEAD(>list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags |= TRACEFS_EVENT_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_uid = uid;
inode->i_gid = gid;
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
dentry->d_fsdata = get_ei(ei);
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 661ac13e2984..d83c2a25f288 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -4,10 +4,9 @@
 
 enum {
TRACEFS_EVENT_INODE = BIT(1),
-   TRACEFS_EVENT_TOP_INODE = BIT(2),
-   TRACEFS_GID_PERM_SET= BIT(3),
-   TRACEFS_UID_PERM_SET= BIT(4),
-   TRACEFS_INSTANCE_INODE  = BIT(5),
+   TRACEFS_GID_PERM_SET= BIT(2),
+   

[PATCH 2/5] tracefs: Still use mount point as default permissions for instances

2024-05-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If the instances directory's permissions were never change, then have it
and its children use the mount point permissions as the default.

Currently, the permissions of instance directories are determined by the
instance directory's permissions itself. But if the tracefs file system is
remounted and changes the permissions, the instance directory and its
children should use the new permission.

But because both the instance directory and its children use the instance
directory's inode for permissions, it misses the update.

To demonstrate this:

  # cd /sys/kernel/tracing/
  # mkdir instances/foo
  # ls -ld instances/foo
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 18:57 current_tracer

  # mount -o remount,gid=1002 .
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo/
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 18:57 current_tracer

Notice that changing the group id to that of "lkp" did not affect the
instances directory nor its children. It should have been:

  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 19:19 current_tracer
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:25 instances/foo/
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 19:19 instances

  # mount -o remount,gid=1002 .
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 19:19 current_tracer
  # ls -ld instances
 drwxr-x--- 3 root lkp 0 May  1 19:19 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root lkp 0 May  1 19:25 instances/foo/

Where all files were updated by the remount gid update.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 2a1dc2b442d1..1130c0fe2426 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -170,16 +170,39 @@ static void set_tracefs_inode_owner(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct inode *root_inode = ti->private;
+   kuid_t uid;
+   kgid_t gid;
+
+   uid = root_inode->i_uid;
+   gid = root_inode->i_gid;
+
+   /*
+* If the root is not the mount point, then check the root's
+* permissions. If it was never set, then default to the
+* mount point.
+*/
+   if (root_inode != d_inode(root_inode->i_sb->s_root)) {
+   struct tracefs_inode *rti;
+
+   rti = get_tracefs(root_inode);
+   root_inode = d_inode(root_inode->i_sb->s_root);
+
+   if (!(rti->flags & TRACEFS_UID_PERM_SET))
+   uid = root_inode->i_uid;
+
+   if (!(rti->flags & TRACEFS_GID_PERM_SET))
+   gid = root_inode->i_gid;
+   }
 
/*
 * If this inode has never been referenced, then update
 * the permissions to the superblock.
 */
if (!(ti->flags & TRACEFS_UID_PERM_SET))
-   inode->i_uid = root_inode->i_uid;
+   inode->i_uid = uid;
 
if (!(ti->flags & TRACEFS_GID_PERM_SET))
-   inode->i_gid = root_inode->i_gid;
+   inode->i_gid = gid;
 }
 
 static int tracefs_permission(struct mnt_idmap *idmap,
-- 
2.43.0





[PATCH 0/5] tracefs/eventfs: Fix inconsistent permissions

2024-05-01 Thread Steven Rostedt


The tracefs and eventfs permissions are created dynamically based
on what the mount point inode has or the instances directory inode has.
But the way it worked had some inconsistencies that could lead to
security issues as the file system is not behaving like admins would
expect.

The files and directories could ignore the remount option that changes
the gid or uid ownerships, leaving files susceptable to access that
is not expected. This happens if a file had its value changed previously
and then a remount changed all the files permissions. The one that
was changed previously would not be affected.

This change set resolves these inconsistencies.

This also fixes the test_ownership.tc test as it would pass on the
first time it is run, but fail on the second time, because of the
inconsistant state of the permissions. Now you can run that test
multiple times and it will always pass.

Steven Rostedt (Google) (5):
  tracefs: Reset permissions on remount if permissions are options
  tracefs: Still use mount point as default permissions for instances
  eventfs: Do not differentiate the toplevel events directory
  eventfs: Do not treat events directory different than other directories
  eventfs: Have "events" directory get permissions from its parent


 fs/tracefs/event_inode.c | 102 ---
 fs/tracefs/inode.c   |  67 +--
 fs/tracefs/internal.h|   9 +++--
 3 files changed, 130 insertions(+), 48 deletions(-)



[PATCH 1/5] tracefs: Reset permissions on remount if permissions are options

2024-05-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There's an inconsistency with the way permissions are handled in tracefs.
Because the permissions are generated when accessed, they default to the
root inode's permission if they were never set by the user. If the user
sets the permissions, then a flag is set and the permissions are saved via
the inode (for tracefs files) or an internal attribute field (for
eventfs).

But if a remount happens that specify the permissions, all the files that
were not changed by the user gets updated, but the ones that were are not.
If the user were to remount the file system with a given permission, then
all files and directories within that file system should be updated.

This can cause security issues if a file's permission was updated but the
admin forgot about it. They could incorrectly think that remounting with
permissions set would update all files, but miss some.

For example:

 # cd /sys/kernel/tracing
 # chgrp 1002 current_tracer
 # ls -l
[..]
 -rw-r-  1 root root 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root root 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root root 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 21:25 current_tracer
 -rw-r-  1 root root 0 May  1 21:25 dynamic_events
 -r--r-  1 root root 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root root 0 May  1 21:25 enabled_functions

Where current_tracer now has group "lkp".

 # mount -o remount,gid=1001 .
 # ls -l
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root tracing 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:25 current_tracer
 -rw-r-  1 root tracing 0 May  1 21:25 dynamic_events
 -r--r-  1 root tracing 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root tracing 0 May  1 21:25 enabled_functions

Everything changed but the "current_tracer".

Add a new link list that keeps track of all the tracefs_inodes which has
the permission flags that tell if the file/dir should use the root inode's
permission or not. Then on remount, clear all the flags so that the
default behavior of using the root inode's permission is done for all
files and directories.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 +
 fs/tracefs/inode.c   | 40 +++-
 fs/tracefs/internal.h|  2 ++
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index dc97c19f9e0a..db71fed7057d 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -289,6 +289,35 @@ static const struct file_operations 
eventfs_file_operations = {
.llseek = generic_file_llseek,
 };
 
+/*
+ * On a remount of tracefs, if UID or GID options are set, then
+ * the mount point inode permissions should be used.
+ * Reset the saved permission flags appropriately.
+ */
+void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+{
+   struct eventfs_inode *ei = ti->private;
+
+   if (!ei)
+   return;
+
+   if (update_uid)
+   ei->attr.mode &= ~EVENTFS_SAVE_UID;
+
+   if (update_gid)
+   ei->attr.mode &= ~EVENTFS_SAVE_GID;
+
+   if (!ei->entry_attrs)
+   return;
+
+   for (int i = 0; i < ei->nr_entries; i++) {
+   if (update_uid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
+   if (update_gid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
+   }
+}
+
 /* Return the evenfs_inode of the "events" directory */
 static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5545e6bf7d26..2a1dc2b442d1 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -30,6 +30,13 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
+/*
+ * Keep track of all tracefs_inodes in order to update their
+ * flags if necessary on a remount.
+ */
+static DEFINE_MUTEX(tracefs_inode_mutex);
+static LIST_HEAD(tracefs_inodes);
+
 static struct inode *tracefs_alloc_inode(struct super_block *sb)
 {
struct tracefs_inode *ti;
@@ -38,12 +45,22 @@ static struct inode *tracefs_alloc_inode(struct super_block 
*sb)
if (!ti)
return NULL;
 
+   mutex_lock(_inode_mutex);
+   list_add(>list, _inodes);
+   mutex_unlock(_inode_mutex);
+
return >vfs_inode;
 }
 
 static void tracefs_free_inode(struct inode *inode)
 {
-   kmem_cache_free(tracefs_inode_cachep, get_tracefs(inode));
+   struct tracefs_inode *ti = get_tracefs(inode);
+
+   mutex_lock(_inode_mutex);
+   list_del(>list);
+   

Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-01 Thread Google
On Wed, 1 May 2024 16:12:37 -0700
"Paul E. McKenney"  wrote:

> Note that the immediate pressure for this patch should be relieved by the
> NAPI patch series [1], but this sort of problem could easily arise again.
> 
> When running heavy test workloads with KASAN enabled, RCU Tasks grace
> periods can extend for many tens of seconds, significantly slowing
> trace registration.  Therefore, make the registration-side RCU Tasks
> grace period be asynchronous via call_rcu_tasks().
> 

Good catch! AFAICS, there is no reason to wait for synchronization
when adding a new direct trampoline.
This looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> [1] https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/
> 
> Reported-by: Jakub Kicinski 
> Reported-by: Alexei Starovoitov 
> Reported-by: Chris Mason 
> Signed-off-by: Paul E. McKenney 
> Cc: Steven Rostedt 
> Cc: Masami Hiramatsu 
> Cc: Mark Rutland 
> Cc: Mathieu Desnoyers 
> Cc: 
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 6c96b30f3d63b..32ea92934268c 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5365,6 +5365,13 @@ static void remove_direct_functions_hash(struct 
> ftrace_hash *hash, unsigned long
>   }
>  }
>  
> +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> +{
> + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
> +
> + free_ftrace_hash(fhp);
> +}
> +
>  /**
>   * register_ftrace_direct - Call a custom trampoline directly
>   * for multiple functions registered in @ops
> @@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> unsigned long addr)
>   out_unlock:
>   mutex_unlock(_mutex);
>  
> - if (free_hash && free_hash != EMPTY_HASH) {
> - synchronize_rcu_tasks();
> - free_ftrace_hash(free_hash);
> - }
> + if (free_hash && free_hash != EMPTY_HASH)
> + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
>  
>   if (new_hash)
>   free_ftrace_hash(new_hash);


-- 
Masami Hiramatsu (Google) 



Re: [v3] tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()

2024-05-01 Thread Google
On Mon, 29 Apr 2024 15:55:09 +0200
Markus Elfring  wrote:

> …
> > > it jumps to the label 'out' instead of 'fail' by mistake.In the result,
> …
> >
> > Looks good to me.
> 
> * Do you care for a typo in this change description?
> 
> * Would you like to read any improved (patch) version descriptions (or 
> changelogs)?

Thanks, but those are nitpicks and I don't mind it.

Thank you,

> 
> Regards,
> Markus


-- 
Masami Hiramatsu (Google) 



[PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-01 Thread Paul E. McKenney
Note that the immediate pressure for this patch should be relieved by the
NAPI patch series [1], but this sort of problem could easily arise again.

When running heavy test workloads with KASAN enabled, RCU Tasks grace
periods can extend for many tens of seconds, significantly slowing
trace registration.  Therefore, make the registration-side RCU Tasks
grace period be asynchronous via call_rcu_tasks().

[1] https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/

Reported-by: Jakub Kicinski 
Reported-by: Alexei Starovoitov 
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: 

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 6c96b30f3d63b..32ea92934268c 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5365,6 +5365,13 @@ static void remove_direct_functions_hash(struct 
ftrace_hash *hash, unsigned long
}
 }
 
+static void register_ftrace_direct_cb(struct rcu_head *rhp)
+{
+   struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
+
+   free_ftrace_hash(fhp);
+}
+
 /**
  * register_ftrace_direct - Call a custom trampoline directly
  * for multiple functions registered in @ops
@@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
unsigned long addr)
  out_unlock:
mutex_unlock(_mutex);
 
-   if (free_hash && free_hash != EMPTY_HASH) {
-   synchronize_rcu_tasks();
-   free_ftrace_hash(free_hash);
-   }
+   if (free_hash && free_hash != EMPTY_HASH)
+   call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
 
if (new_hash)
free_ftrace_hash(new_hash);



Re: [syzbot] [net?] [virt?] [kvm?] KASAN: slab-use-after-free Read in vhost_task_fn

2024-05-01 Thread syzbot
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any 
issue:

Reported-and-tested-by: syzbot+98edc2df894917b34...@syzkaller.appspotmail.com

Tested on:

commit: f138e94c KASAN: slab-use-after-free Read in vhost_task..
git tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
console output: https://syzkaller.appspot.com/x/log.txt?x=10a152a718
kernel config:  https://syzkaller.appspot.com/x/.config?x=3714fc09f933e505
dashboard link: https://syzkaller.appspot.com/bug?extid=98edc2df894917b3431f
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40

Note: no patches were applied.
Note: testing is done by a robot and is best-effort only.



Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin
On Wed, May 01, 2024 at 10:57:38AM -0500, Mike Christie wrote:
> On 5/1/24 2:50 AM, Hillf Danton wrote:
> > On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> >>
> >> and then it failed testing.
> >>
> > So did my patch [1] but then the reason was spotted [2,3]
> > 
> > [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> > [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> > [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/
> 
> Just to make sure I understand the conclusion.
> 
> Edward's patch that just swaps the order of the calls:
> 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> fixes the UAF. I tested the same in my setup. However, when you guys tested it
> with sysbot, it also triggered a softirq/RCU warning.
> 
> The softirq/RCU part of the issue is fixed with this commit:
> 
> https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/
> 
> commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
> Author: Zqiang 
> Date:   Sat Apr 27 18:28:08 2024 +0800
> 
> softirq: Fix suspicious RCU usage in __do_softirq()
> 
> The problem was that I was testing with -next master which has that patch.
> It looks like you guys were testing against bb7a2467e6be which didn't have
> the patch, and so that's why you guys still hit the softirq/RCU issue. Later
> when you added that patch to your patch, it worked with syzbot.
> 
> So is it safe to assume that the softirq/RCU patch above will be upstream
> when the vhost changes go in or is there a tag I need to add to my patches?

That patch is upstream now. I rebased and asked syzbot to test
https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
on top.

If that passes I will squash.




Re: [syzbot] [net?] [virt?] [kvm?] KASAN: slab-use-after-free Read in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin
#syz test https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
f138e94c1f0dbeae721917694fb2203446a68ea9




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin
On Wed, May 01, 2024 at 10:57:38AM -0500, Mike Christie wrote:
> On 5/1/24 2:50 AM, Hillf Danton wrote:
> > On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> >>
> >> and then it failed testing.
> >>
> > So did my patch [1] but then the reason was spotted [2,3]
> > 
> > [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> > [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> > [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/
> 
> Just to make sure I understand the conclusion.
> 
> Edward's patch that just swaps the order of the calls:
> 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> fixes the UAF. I tested the same in my setup. However, when you guys tested it
> with sysbot, it also triggered a softirq/RCU warning.
> 
> The softirq/RCU part of the issue is fixed with this commit:
> 
> https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/
> 
> commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
> Author: Zqiang 
> Date:   Sat Apr 27 18:28:08 2024 +0800
> 
> softirq: Fix suspicious RCU usage in __do_softirq()
> 
> The problem was that I was testing with -next master which has that patch.
> It looks like you guys were testing against bb7a2467e6be which didn't have
> the patch, and so that's why you guys still hit the softirq/RCU issue. Later
> when you added that patch to your patch, it worked with syzbot.
> 
> So is it safe to assume that the softirq/RCU patch above will be upstream
> when the vhost changes go in or is there a tag I need to add to my patches?

Two points:
- I do not want bisect broken. If you depend on this patch either I pick
  it too before your patch, or we defer until 
1dd1eff161bd55968d3d46bc36def62d71fb4785
  is merged. You can also ask for that patch to be merged in this cycle.
- Do not assume - pls push somewhere a hash based on vhost that syzbot can test
  and confirm all is well. Thanks!




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Mike Christie
On 5/1/24 2:50 AM, Hillf Danton wrote:
> On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
>>
>> and then it failed testing.
>>
> So did my patch [1] but then the reason was spotted [2,3]
> 
> [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/

Just to make sure I understand the conclusion.

Edward's patch that just swaps the order of the calls:

https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

fixes the UAF. I tested the same in my setup. However, when you guys tested it
with sysbot, it also triggered a softirq/RCU warning.

The softirq/RCU part of the issue is fixed with this commit:

https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/

commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
Author: Zqiang 
Date:   Sat Apr 27 18:28:08 2024 +0800

softirq: Fix suspicious RCU usage in __do_softirq()

The problem was that I was testing with -next master which has that patch.
It looks like you guys were testing against bb7a2467e6be which didn't have
the patch, and so that's why you guys still hit the softirq/RCU issue. Later
when you added that patch to your patch, it worked with syzbot.

So is it safe to assume that the softirq/RCU patch above will be upstream
when the vhost changes go in or is there a tag I need to add to my patches?



Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-05-01 Thread Michael S. Tsirkin
On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
> > It took me a while, but I was able to figure out why EEVDF behaves 
> > different then CFS does. I'm still waiting for some official confirmation
> > of my assumptions but it all seems very plausible to me.
> > 
> > Leaving aside all the specifics of vhost and kworkers, a more general
> > description of the scenario would be as follows:
> > 
> > Assume that we have two tasks taking turns on a single CPU. 
> > Task 1 does something and wakes up Task 2.
> > Task 2 does something and goes to sleep.
> > And we're just repeating that.
> > Task 1 and task 2 only run for very short amounts of time, i.e. much 
> > shorter than a regular time slice (vhost = task1, kworker = task2).
> > 
> > Let's further assume, that task 1 runs longer than task 2. 
> > In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
> > of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
> > always gets picked on wake up because it has the smaller vruntime. 
> > In EEVDF, this would translate to a permanent positive lag, which also 
> > causes task 2 to get consistently scheduled on wake up.
> > 
> > Let's now assume, that ocassionally, task 2 runs a little bit longer than
> > task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> > bit, but, it can easily remain below the value of task 1. Task 2 would 
> > still get picked on wake up.
> > With EEVDF, in its current form, task 2 will now get a negative lag, which
> > in turn, will cause it not being picked on the next wake up.
> 
> Right, so I've been working on changes where tasks will be able to
> 'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
> runqueue will allow them to burn off negative lag. Once they get picked
> again they are guaranteed to have zero (or more) lag. If by that time
> they've not been woken up again, they get dequeued with 0-lag.
> 
> (placement with 0-lag will ensure eligibility doesn't inhibit the pick,
> but is not sufficient to ensure a pick)
> 
> However, this alone will not be sufficient to get the behaviour you
> want. Notably, even at 0-lag the virtual deadline will still be after
> the virtual deadline of the already running task -- assuming they have
> equal request sizes.
> 
> That is, IIUC, you want your task 2 (kworker) to always preempt task 1
> (vhost), right? So even if tsak 2 were to have 0-lag, placing it would
> be something like:
> 
> t1  |-<
> t2|-<
> V-|-
> 
> So t1 has started at | with a virtual deadline at <. Then a short
> while later -- V will have advanced a little -- it wakes t2 with 0-lag,
> but as you can observe, its virtual deadline will be later than t1's and
> as such it will never get picked, even though they're both eligible.
> 
> > So, it seems we have a change in the level of how far the both variants 
> > look 
> > into the past. CFS being willing to take more history into account, whereas
> > EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> > and place_entity not taking the original vruntime into account).
> >
> > All of this can be seen as correct by design, a task consumes more time
> > than the others, so it has to give way to others. The big difference
> > is now, that CFS allowed a task to collect some bonus by constantly using 
> > less CPU time than others and trading that time against ocassionally taking
> > more CPU time. EEVDF could do the same thing, by allowing the accumulation
> > of positive lag, which can then be traded against the one time the task
> > would get negative lag. This might clash with other EEVDF assumptions 
> > though.
> 
> Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
> considers both virtual runtime (for eligibility, which ties to fairness)
> but primarily virtual deadline (for timeliness).
> 
> If you want to make EEVDF force pick a task by modifying vruntime you
> have to place it with lag > request (slice) such that the virtual
> deadline of the newly placed task is before the already running task,
> yielding both eligibility and earliest deadline.
> 
> Consistently placing tasks with such large (positive) lag will affect
> fairness though, they're basically always runnable, so barring external
> throttling, they'll starve you.
> 
> > The patch below fixes the degredation, but is not at all aligned with what 
> > EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> > correct.
> > 
> > So, what does this now mean for the vhost regression we were discussing?
> > 
> > 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> > 2. vhost in its current form relies on the way how CFS works by assuming 
> >that the kworker always gets scheduled.
> 
> How does it assume this? Also, this is a performance issue, not a

Re: [PATCH] eventfs/tracing: Add callback for release of an eventfs_inode

2024-05-01 Thread Google
On Tue, 30 Apr 2024 14:23:27 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Synthetic events create and destroy tracefs files when they are created
> and removed. The tracing subsystem has its own file descriptor
> representing the state of the events attached to the tracefs files.
> There's a race between the eventfs files and this file descriptor of the
> tracing system where the following can cause an issue:
> 
> With two scripts 'A' and 'B' doing:
> 
>   Script 'A':
> echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
> while :
> do
>   echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
> done
> 
>   Script 'B':
> echo > /sys/kernel/tracing/synthetic_events
> 
> Script 'A' creates a synthetic event "hello" and then just writes zero
> into its enable file.
> 
> Script 'B' removes all synthetic events (including the newly created
> "hello" event).
> 
> What happens is that the opening of the "enable" file has:
> 
>  {
>   struct trace_event_file *file = inode->i_private;
>   int ret;
> 
>   ret = tracing_check_open_get_tr(file->tr);
>  [..]
> 
> But deleting the events frees the "file" descriptor, and a "use after
> free" happens with the dereference at "file->tr".
> 
> The file descriptor does have a reference counter, but there needs to be a
> way to decrement it from the eventfs when the eventfs_inode is removed
> that represents this file descriptor.
> 
> Add an optional "release" callback to the eventfs_entry array structure,
> that gets called when the eventfs file is about to be removed. This allows
> for the creating on the eventfs file to increment the tracing file
> descriptor ref counter. When the eventfs file is deleted, it can call the
> release function that will call the put function for the tracing file
> descriptor.
> 
> This will protect the tracing file from being freed while a eventfs file
> that references it is being opened.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> Link: 
> https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/
> 
> Cc: sta...@vger.kernel.org
> Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use 
> eventfs_inode")
> Reported-by: Tze-nan wu 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/tracefs/event_inode.c|  7 +++
>  include/linux/tracefs.h |  3 +++
>  kernel/trace/trace_events.c | 12 
>  3 files changed, 22 insertions(+)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 894c6ca1e500..dc97c19f9e0a 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -84,10 +84,17 @@ enum {
>  static void release_ei(struct kref *ref)
>  {
>   struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
> kref);
> + const struct eventfs_entry *entry;
>   struct eventfs_root_inode *rei;
>  
>   WARN_ON_ONCE(!ei->is_freed);
>  
> + for (int i = 0; i < ei->nr_entries; i++) {
> + entry = >entries[i];
> + if (entry->release)
> + entry->release(entry->name, ei->data);
> + }
> +
>   kfree(ei->entry_attrs);
>   kfree_const(ei->name);
>   if (ei->is_events) {
> diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
> index 7a5fe17b6bf9..d03f74658716 100644
> --- a/include/linux/tracefs.h
> +++ b/include/linux/tracefs.h
> @@ -62,6 +62,8 @@ struct eventfs_file;
>  typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
>   const struct file_operations **fops);
>  
> +typedef void (*eventfs_release)(const char *name, void *data);
> +
>  /**
>   * struct eventfs_entry - dynamically created eventfs file call back handler
>   * @name:Then name of the dynamic file in an eventfs directory
> @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
> *mode, void **data,
>  struct eventfs_entry {
>   const char  *name;
>   eventfs_callbackcallback;
> + eventfs_release release;
>  };
>  
>  struct eventfs_inode;
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 52f75c36bbca..6ef29eba90ce 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t 
> *mode, void **data,
>   return 0;
>  }
>  
> +/* The file is incremented on creation and freeing the enable file 
> decrements it */
> +static void event_release(const char *name, void *data)
> +{
> + struct trace_event_file *file = data;
> +
> + event_file_put(file);
> +}
> +
>  static int
>  event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
>  {
> @@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct 
> trace_event_file *file)
>   {
>   .name   = "enable",
>

Re: [PATCH 20/20] [RFC] sh: dma: Remove unused functionality

2024-05-01 Thread John Paul Adrian Glaubitz
Hi Geert,

On Wed, 2024-05-01 at 11:12 +0200, John Paul Adrian Glaubitz wrote:
> Hi Geert,
> 
> On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> > dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> > request_dma_bycap() are unused.  Remove them, and all related code.
> > 
> > Signed-off-by: Geert Uytterhoeven 
> > ---
> >  arch/sh/drivers/dma/dma-api.c | 116 --
> >  arch/sh/include/asm/dma.h |   7 --
> >  2 files changed, 123 deletions(-)
> > 
> > diff --git a/arch/sh/drivers/dma/dma-api.c b/arch/sh/drivers/dma/dma-api.c
> > index f49097fa634c36d4..87e5a892887360f5 100644
> > --- a/arch/sh/drivers/dma/dma-api.c
> > +++ b/arch/sh/drivers/dma/dma-api.c
> > @@ -41,21 +41,6 @@ struct dma_info *get_dma_info(unsigned int chan)
> >  }
> >  EXPORT_SYMBOL(get_dma_info);
> >  
> > -struct dma_info *get_dma_info_by_name(const char *dmac_name)
> > -{
> > -   struct dma_info *info;
> > -
> > -   list_for_each_entry(info, _dmac_list, list) {
> > -   if (dmac_name && (strcmp(dmac_name, info->name) != 0))
> > -   continue;
> > -   else
> > -   return info;
> > -   }
> > -
> > -   return NULL;
> > -}
> > -EXPORT_SYMBOL(get_dma_info_by_name);
> > -
> >  static unsigned int get_nr_channels(void)
> >  {
> > struct dma_info *info;
> > @@ -101,66 +86,6 @@ int get_dma_residue(unsigned int chan)
> >  }
> >  EXPORT_SYMBOL(get_dma_residue);
> >  
> > -static int search_cap(const char **haystack, const char *needle)
> > -{
> > -   const char **p;
> > -
> > -   for (p = haystack; *p; p++)
> > -   if (strcmp(*p, needle) == 0)
> > -   return 1;
> > -
> > -   return 0;
> > -}
> > -
> > -/**
> > - * request_dma_bycap - Allocate a DMA channel based on its capabilities
> > - * @dmac: List of DMA controllers to search
> > - * @caps: List of capabilities
> > - *
> > - * Search all channels of all DMA controllers to find a channel which
> > - * matches the requested capabilities. The result is the channel
> > - * number if a match is found, or %-ENODEV if no match is found.
> > - *
> > - * Note that not all DMA controllers export capabilities, in which
> > - * case they can never be allocated using this API, and so
> > - * request_dma() must be used specifying the channel number.
> > - */
> > -int request_dma_bycap(const char **dmac, const char **caps, const char 
> > *dev_id)
> > -{
> > -   unsigned int found = 0;
> > -   struct dma_info *info;
> > -   const char **p;
> > -   int i;
> > -
> > -   BUG_ON(!dmac || !caps);
> > -
> > -   list_for_each_entry(info, _dmac_list, list)
> > -   if (strcmp(*dmac, info->name) == 0) {
> > -   found = 1;
> > -   break;
> > -   }
> > -
> > -   if (!found)
> > -   return -ENODEV;
> > -
> > -   for (i = 0; i < info->nr_channels; i++) {
> > -   struct dma_channel *channel = >channels[i];
> > -
> > -   if (unlikely(!channel->caps))
> > -   continue;
> > -
> > -   for (p = caps; *p; p++) {
> > -   if (!search_cap(channel->caps, *p))
> > -   break;
> > -   if (request_dma(channel->chan, dev_id) == 0)
> > -   return channel->chan;
> > -   }
> > -   }
> > -
> > -   return -EINVAL;
> > -}
> > -EXPORT_SYMBOL(request_dma_bycap);
> > -
> >  int request_dma(unsigned int chan, const char *dev_id)
> >  {
> > struct dma_channel *channel = { 0 };
> > @@ -213,35 +138,6 @@ void dma_wait_for_completion(unsigned int chan)
> >  }
> >  EXPORT_SYMBOL(dma_wait_for_completion);
> >  
> > -int register_chan_caps(const char *dmac, struct dma_chan_caps *caps)
> > -{
> > -   struct dma_info *info;
> > -   unsigned int found = 0;
> > -   int i;
> > -
> > -   list_for_each_entry(info, _dmac_list, list)
> > -   if (strcmp(dmac, info->name) == 0) {
> > -   found = 1;
> > -   break;
> > -   }
> > -
> > -   if (unlikely(!found))
> > -   return -ENODEV;
> > -
> > -   for (i = 0; i < info->nr_channels; i++, caps++) {
> > -   struct dma_channel *channel;
> > -
> > -   if ((info->first_channel_nr + i) != caps->ch_num)
> > -   return -EINVAL;
> > -
> > -   channel = >channels[i];
> > -   channel->caps = caps->caplist;
> > -   }
> > -
> > -   return 0;
> > -}
> > -EXPORT_SYMBOL(register_chan_caps);
> > -
> >  void dma_configure_channel(unsigned int chan, unsigned long flags)
> >  {
> > struct dma_info *info = get_dma_info(chan);
> > @@ -267,18 +163,6 @@ int dma_xfer(unsigned int chan, unsigned long from,
> >  }
> >  EXPORT_SYMBOL(dma_xfer);
> >  
> > -int dma_extend(unsigned int chan, unsigned long op, void *param)
> > -{
> > -   struct dma_info *info = get_dma_info(chan);
> > -   struct dma_channel *channel = get_dma_channel(chan);
> > -
> > -   if (info->ops->extend)
> > -   return info->ops->extend(channel, 

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-05-01 Thread Peter Zijlstra
On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
> It took me a while, but I was able to figure out why EEVDF behaves 
> different then CFS does. I'm still waiting for some official confirmation
> of my assumptions but it all seems very plausible to me.
> 
> Leaving aside all the specifics of vhost and kworkers, a more general
> description of the scenario would be as follows:
> 
> Assume that we have two tasks taking turns on a single CPU. 
> Task 1 does something and wakes up Task 2.
> Task 2 does something and goes to sleep.
> And we're just repeating that.
> Task 1 and task 2 only run for very short amounts of time, i.e. much 
> shorter than a regular time slice (vhost = task1, kworker = task2).
> 
> Let's further assume, that task 1 runs longer than task 2. 
> In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
> of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
> always gets picked on wake up because it has the smaller vruntime. 
> In EEVDF, this would translate to a permanent positive lag, which also 
> causes task 2 to get consistently scheduled on wake up.
> 
> Let's now assume, that ocassionally, task 2 runs a little bit longer than
> task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> bit, but, it can easily remain below the value of task 1. Task 2 would 
> still get picked on wake up.
> With EEVDF, in its current form, task 2 will now get a negative lag, which
> in turn, will cause it not being picked on the next wake up.

Right, so I've been working on changes where tasks will be able to
'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
runqueue will allow them to burn off negative lag. Once they get picked
again they are guaranteed to have zero (or more) lag. If by that time
they've not been woken up again, they get dequeued with 0-lag.

(placement with 0-lag will ensure eligibility doesn't inhibit the pick,
but is not sufficient to ensure a pick)

However, this alone will not be sufficient to get the behaviour you
want. Notably, even at 0-lag the virtual deadline will still be after
the virtual deadline of the already running task -- assuming they have
equal request sizes.

That is, IIUC, you want your task 2 (kworker) to always preempt task 1
(vhost), right? So even if tsak 2 were to have 0-lag, placing it would
be something like:

t1  |-<
t2|-<
V-|-

So t1 has started at | with a virtual deadline at <. Then a short
while later -- V will have advanced a little -- it wakes t2 with 0-lag,
but as you can observe, its virtual deadline will be later than t1's and
as such it will never get picked, even though they're both eligible.

> So, it seems we have a change in the level of how far the both variants look 
> into the past. CFS being willing to take more history into account, whereas
> EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> and place_entity not taking the original vruntime into account).
>
> All of this can be seen as correct by design, a task consumes more time
> than the others, so it has to give way to others. The big difference
> is now, that CFS allowed a task to collect some bonus by constantly using 
> less CPU time than others and trading that time against ocassionally taking
> more CPU time. EEVDF could do the same thing, by allowing the accumulation
> of positive lag, which can then be traded against the one time the task
> would get negative lag. This might clash with other EEVDF assumptions though.

Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
considers both virtual runtime (for eligibility, which ties to fairness)
but primarily virtual deadline (for timeliness).

If you want to make EEVDF force pick a task by modifying vruntime you
have to place it with lag > request (slice) such that the virtual
deadline of the newly placed task is before the already running task,
yielding both eligibility and earliest deadline.

Consistently placing tasks with such large (positive) lag will affect
fairness though, they're basically always runnable, so barring external
throttling, they'll starve you.

> The patch below fixes the degredation, but is not at all aligned with what 
> EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> correct.
> 
> So, what does this now mean for the vhost regression we were discussing?
> 
> 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> 2. vhost in its current form relies on the way how CFS works by assuming 
>that the kworker always gets scheduled.

How does it assume this? Also, this is a performance issue, not a
correctness issue, right?

> I would like to argue that it therefore makes sense to reconsider the vhost
> implementation to make it less dependent on the internals of the scheduler.

I think I'll propose the opposite :-) Much of the problems we have are

Re: [PATCH 00/20] sh: Fix missing prototypes

2024-05-01 Thread John Paul Adrian Glaubitz
Hi Geert,

On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
>   Hi all,
> 
> This patch series fixes several "no previous prototype for "
> warnings when building a kernel for SuperH.
> 
> Known issues:
>   - The various warnings about cache functions are not yet fixed, but
> I didn't want to hold off the rest of this series,
>   - sdk7786_defconfig needs "[PATCH/RFC] locking/spinlocks: Make __raw_*
> lock ops static" [1],
>   - Probably there are more warnings to fix, I didn't build all
> defconfigs.
> 
> This has been boot-tested on landisk and on qemu/rts7751r2d.
> 
> Thanks for your comments!
> 
> [1] 
> https://lore.kernel.org/linux-sh/c395b02613572131568bc1fd1bc456d20d1a5426.1709325647.git.geert+rene...@glider.be
> 
> Geert Uytterhoeven (20):
>   sh: pgtable: Fix missing prototypes
>   sh: fpu: Add missing forward declarations
>   sh: syscall: Add missing forward declaration for sys_cacheflush()
>   sh: tlb: Add missing forward declaration for handle_tlbmiss()
>   sh: return_address: Add missing #include 
>   sh: traps: Add missing #include 
>   sh: hw_breakpoint: Add missing forward declaration for
> arch_bp_generic_fields()
>   sh: boot: Add proper forward declarations
>   sh: ftrace: Fix missing prototypes
>   sh: nommu: Add missing #include 
>   sh: math-emu: Add missing #include 
>   sh: dma: Remove unused dmac_search_free_channel()
>   sh: sh2a: Add missing #include 
>   sh: sh7786: Remove unused sh7786_usb_use_exclock()
>   sh: smp: Fix missing prototypes
>   sh: kprobes: Merge arch_copy_kprobe() into arch_prepare_kprobe()
>   sh: kprobes: Make trampoline_probe_handler() static
>   sh: kprobes: Remove unneeded kprobe_opcode_t casts
>   sh: dwarf: Make dwarf_lookup_fde() static
>   [RFC] sh: dma: Remove unused functionality
> 
>  arch/sh/boot/compressed/cache.c |   3 +
>  arch/sh/boot/compressed/cache.h |  10 ++
>  arch/sh/boot/compressed/misc.c  |   8 +-
>  arch/sh/boot/compressed/misc.h  |   9 ++
>  arch/sh/drivers/dma/dma-api.c   | 143 
>  arch/sh/include/asm/dma.h   |   7 --
>  arch/sh/include/asm/fpu.h   |   3 +
>  arch/sh/include/asm/ftrace.h|  10 ++
>  arch/sh/include/asm/hw_breakpoint.h |   2 +
>  arch/sh/include/asm/syscalls.h  |   1 +
>  arch/sh/include/asm/tlb.h   |   4 +
>  arch/sh/kernel/cpu/sh2a/opcode_helper.c |   2 +
>  arch/sh/kernel/cpu/sh4a/setup-sh7786.c  |  14 ---
>  arch/sh/kernel/dwarf.c  |   2 +-
>  arch/sh/kernel/kprobes.c|  13 +--
>  arch/sh/kernel/return_address.c |   2 +
>  arch/sh/kernel/smp.c|   4 +-
>  arch/sh/kernel/traps.c  |  10 +-
>  arch/sh/kernel/traps_32.c   |   1 +
>  arch/sh/math-emu/math.c |   2 +
>  arch/sh/mm/nommu.c  |   2 +
>  arch/sh/mm/pgtable.c|   4 +-
>  arch/sh/mm/tlbex_32.c   |   1 +
>  23 files changed, 68 insertions(+), 189 deletions(-)
>  create mode 100644 arch/sh/boot/compressed/cache.h
>  create mode 100644 arch/sh/boot/compressed/misc.h
> 

For the whole series:

Reviewed-by: John Paul Adrian Glaubitz 

I would still like to get feedback from Yoshinori on patch #20 though, i.e.

"sh: dma: Remove unused functionality".

On the other hand, we could just merge this series and re-add the functions
later if we decide they're still needed.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: [PATCH 20/20] [RFC] sh: dma: Remove unused functionality

2024-05-01 Thread John Paul Adrian Glaubitz
Hi Geert,

On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> request_dma_bycap() are unused.  Remove them, and all related code.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
>  arch/sh/drivers/dma/dma-api.c | 116 --
>  arch/sh/include/asm/dma.h |   7 --
>  2 files changed, 123 deletions(-)
> 
> diff --git a/arch/sh/drivers/dma/dma-api.c b/arch/sh/drivers/dma/dma-api.c
> index f49097fa634c36d4..87e5a892887360f5 100644
> --- a/arch/sh/drivers/dma/dma-api.c
> +++ b/arch/sh/drivers/dma/dma-api.c
> @@ -41,21 +41,6 @@ struct dma_info *get_dma_info(unsigned int chan)
>  }
>  EXPORT_SYMBOL(get_dma_info);
>  
> -struct dma_info *get_dma_info_by_name(const char *dmac_name)
> -{
> - struct dma_info *info;
> -
> - list_for_each_entry(info, _dmac_list, list) {
> - if (dmac_name && (strcmp(dmac_name, info->name) != 0))
> - continue;
> - else
> - return info;
> - }
> -
> - return NULL;
> -}
> -EXPORT_SYMBOL(get_dma_info_by_name);
> -
>  static unsigned int get_nr_channels(void)
>  {
>   struct dma_info *info;
> @@ -101,66 +86,6 @@ int get_dma_residue(unsigned int chan)
>  }
>  EXPORT_SYMBOL(get_dma_residue);
>  
> -static int search_cap(const char **haystack, const char *needle)
> -{
> - const char **p;
> -
> - for (p = haystack; *p; p++)
> - if (strcmp(*p, needle) == 0)
> - return 1;
> -
> - return 0;
> -}
> -
> -/**
> - * request_dma_bycap - Allocate a DMA channel based on its capabilities
> - * @dmac: List of DMA controllers to search
> - * @caps: List of capabilities
> - *
> - * Search all channels of all DMA controllers to find a channel which
> - * matches the requested capabilities. The result is the channel
> - * number if a match is found, or %-ENODEV if no match is found.
> - *
> - * Note that not all DMA controllers export capabilities, in which
> - * case they can never be allocated using this API, and so
> - * request_dma() must be used specifying the channel number.
> - */
> -int request_dma_bycap(const char **dmac, const char **caps, const char 
> *dev_id)
> -{
> - unsigned int found = 0;
> - struct dma_info *info;
> - const char **p;
> - int i;
> -
> - BUG_ON(!dmac || !caps);
> -
> - list_for_each_entry(info, _dmac_list, list)
> - if (strcmp(*dmac, info->name) == 0) {
> - found = 1;
> - break;
> - }
> -
> - if (!found)
> - return -ENODEV;
> -
> - for (i = 0; i < info->nr_channels; i++) {
> - struct dma_channel *channel = >channels[i];
> -
> - if (unlikely(!channel->caps))
> - continue;
> -
> - for (p = caps; *p; p++) {
> - if (!search_cap(channel->caps, *p))
> - break;
> - if (request_dma(channel->chan, dev_id) == 0)
> - return channel->chan;
> - }
> - }
> -
> - return -EINVAL;
> -}
> -EXPORT_SYMBOL(request_dma_bycap);
> -
>  int request_dma(unsigned int chan, const char *dev_id)
>  {
>   struct dma_channel *channel = { 0 };
> @@ -213,35 +138,6 @@ void dma_wait_for_completion(unsigned int chan)
>  }
>  EXPORT_SYMBOL(dma_wait_for_completion);
>  
> -int register_chan_caps(const char *dmac, struct dma_chan_caps *caps)
> -{
> - struct dma_info *info;
> - unsigned int found = 0;
> - int i;
> -
> - list_for_each_entry(info, _dmac_list, list)
> - if (strcmp(dmac, info->name) == 0) {
> - found = 1;
> - break;
> - }
> -
> - if (unlikely(!found))
> - return -ENODEV;
> -
> - for (i = 0; i < info->nr_channels; i++, caps++) {
> - struct dma_channel *channel;
> -
> - if ((info->first_channel_nr + i) != caps->ch_num)
> - return -EINVAL;
> -
> - channel = >channels[i];
> - channel->caps = caps->caplist;
> - }
> -
> - return 0;
> -}
> -EXPORT_SYMBOL(register_chan_caps);
> -
>  void dma_configure_channel(unsigned int chan, unsigned long flags)
>  {
>   struct dma_info *info = get_dma_info(chan);
> @@ -267,18 +163,6 @@ int dma_xfer(unsigned int chan, unsigned long from,
>  }
>  EXPORT_SYMBOL(dma_xfer);
>  
> -int dma_extend(unsigned int chan, unsigned long op, void *param)
> -{
> - struct dma_info *info = get_dma_info(chan);
> - struct dma_channel *channel = get_dma_channel(chan);
> -
> - if (info->ops->extend)
> - return info->ops->extend(channel, op, param);
> -
> - return -ENOSYS;
> -}
> -EXPORT_SYMBOL(dma_extend);
> -
>  static int dma_proc_show(struct seq_file *m, void *v)
>  {
>   struct dma_info *info = v;
> diff --git a/arch/sh/include/asm/dma.h b/arch/sh/include/asm/dma.h
> index 

Re: [PATCH 12/20] sh: dma: Remove unused dmac_search_free_channel()

2024-05-01 Thread John Paul Adrian Glaubitz
Hi Geert,

On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> arch/sh/drivers/dma/dma-api.c:164:5: warning: no previous prototype for 
> 'dmac_search_free_channel' [-Wmissing-prototypes]
> 
> dmac_search_free_channel() never had a user in upstream, remove it.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
> dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> request_dma_bycap() are also unused, but don't trigger warnings
> ---

I assume the other functions didn't trigger a warning because their symbols
were exported. Correct me if I'm wrong.

Adrian

>  arch/sh/drivers/dma/dma-api.c | 27 ---
>  1 file changed, 27 deletions(-)
> 
> diff --git a/arch/sh/drivers/dma/dma-api.c b/arch/sh/drivers/dma/dma-api.c
> index 89cd4a3b4ccafbe2..f49097fa634c36d4 100644
> --- a/arch/sh/drivers/dma/dma-api.c
> +++ b/arch/sh/drivers/dma/dma-api.c
> @@ -161,33 +161,6 @@ int request_dma_bycap(const char **dmac, const char 
> **caps, const char *dev_id)
>  }
>  EXPORT_SYMBOL(request_dma_bycap);
>  
> -int dmac_search_free_channel(const char *dev_id)
> -{
> - struct dma_channel *channel = { 0 };
> - struct dma_info *info = get_dma_info(0);
> - int i;
> -
> - for (i = 0; i < info->nr_channels; i++) {
> - channel = >channels[i];
> - if (unlikely(!channel))
> - return -ENODEV;
> -
> - if (atomic_read(>busy) == 0)
> - break;
> - }
> -
> - if (info->ops->request) {
> - int result = info->ops->request(channel);
> - if (result)
> - return result;
> -
> - atomic_set(>busy, 1);
> - return channel->chan;
> - }
> -
> - return -ENOSYS;
> -}
> -
>  int request_dma(unsigned int chan, const char *dev_id)
>  {
>   struct dma_channel *channel = { 0 };

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Hillf Danton
On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> 
> and then it failed testing.
> 
So did my patch [1] but then the reason was spotted [2,3]

[1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
[2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
[3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/



Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin
On Wed, May 01, 2024 at 08:15:44AM +0800, Hillf Danton wrote:
> On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> > On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> > >  static int vhost_task_fn(void *data)
> > >  {
> > >   struct vhost_task *vtsk = data;
> > > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > >   schedule();
> > >   }
> > >  
> > > - mutex_lock(>exit_mutex);
> > > + mutex_lock(_mutex);
> > >   /*
> > >* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> > >* When the vhost layer has called vhost_task_stop it's already stopped
> > > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > >   vtsk->handle_sigkill(vtsk->data);
> > >   }
> > >   complete(>exited);
> > > - mutex_unlock(>exit_mutex);
> > > + mutex_unlock(_mutex);
> > >  
> > 
> > Edward, thanks for the patch. I think though I just needed to swap the
> > order of the calls above.
> > 
> > Instead of:
> > 
> > complete(>exited);
> > mutex_unlock(>exit_mutex);
> > 
> > it should have been:
> > 
> > mutex_unlock(>exit_mutex);
> > complete(>exited);
> 
> JFYI Edward did it [1]
> 
> [1] 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

and then it failed testing.

> > 
> > If my analysis is correct, then Michael do you want me to resubmit a
> > patch on top of your vhost branch or resubmit the entire patchset?