Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-28 Thread Tobias Huschle
On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

[...]

> > - At depth 4, the cgroup shows the observed vruntime value which is smaller
> >by a factor of 20, but depth 0 seems to be running with values of the
> >correct magnitude.
> 
> A child is running means its parent also being the cfs->curr, but
> not vice versa if there are more than one child.
> 
> > - cgroup at depth 0 has zero lag, with higher depth, there are large lag
> >values (as observed 606.338267 onwards)
> 
> These values of se->vlag means 'run this entity to parity' to avoid
> excess context switch, which is what RUN_TO_PARITY does, or nothing
> when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.
> 

Thanks for clarifying that. This makes things clearer to me.

> > 
> > Now the following occurs, triggered by the vhost:
> > - The kworker gets placed again with:
> >  vruntime  deadline
> > cgroup56117619190   57650477291 -> depth 0, last known value
> > kworker   56117885776   56120885776 -> lag of -725
> > - vhost continues executing and updates its vruntime accordingly, here
> >I would need to enhance the trace to also print the vruntimes of the
> >parent sched_entities to see the progress of their vruntime/deadline/lag
> >values as well
> > - It is a bit irritating that the EEVDF algorithm would not pick the kworker
> >over the cgroup as its deadline is smaller.
> >But, the kworker has negative lag, which might cause EEVDF to not pick
> >the kworker.
> >The cgroup at depth 0 has no lag, all deeper layers have a significantly
> >positve lag (last known values, might have changed in the meantime).
> >At this point I would see the option that the vhost task is stuck
> >somewhere or EEVDF just does not see the kworker as a eligible option.
> 
> IMHO such lag should not introduce that long delay. Can you run the
> test again with NEXT_BUDDY disabled?

I added a trace event to the next buddy path, it does not get triggered, so I'd 
assume that no buddies are selected.

> 
> > 
> > - Once the vhost is migrated off the cpu, the update_entity_lag function
> >works with the following values at 606.467022: sched_update
> >For the cgroup at depth 0
> >- vruntime = 5710415 --> this is in line with the amount of new 
> > timeslices
> > vhost got assigned while the kworker was 
> > waiting
> >- vlag =   -62439022 --> the scheduler knows that the cgroup was
> > overconsuming, but no runtime for the 
> > kworker
> >For the cfs_rq we have
> >- min_vruntime =  56117885776 --> this matches the vruntime of the 
> > kworker
> >- avg_vruntime = 161750065796 --> this is rather large in comparison, 
> > but I
> >  might access this value at a bad time
> 
> Use avg_vruntime() instead.

Fair.

[...]

> > 
> > # full trace #
> > 
> > sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
> > --> during __pick_eevdf, prints values for best and the first node loop 
> > variable, second loop is never executed
> > 
> > sched_place/sched_update: 
> > sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
> 
> It would be better replace cfs_rq->avg_vruntime with avg_vruntime().
> Although we can get real @avg by (vruntime + vlag), I am not sure
> vlag (@lag in trace) is se->vlag or the local variable in the place
> function which is scaled and no longer be the true vlag.
> 

Oh my bad, sev is the vlag value of the sched_entity, lag is the local variable.

[...]

> >  vhost-2931-2953[013] d   606.338313: sched_wakeup: 
> > comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> > --> kworker set to runnable, but vhost keeps on executing
> 
> What are the weights of the two entities?

I'll do another run and look at those values.

[...]



[PATCH v4] libtracecmd: Use an rbtree for mapping of cache pages

2023-11-28 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

I was loading a very large trace.dat file into kernelshark when it just
stopped near the end off the load and hung there for a very long time. I
ran it under gdb and hit Ctrl^C when it hit the hang to see what it was
doing and found that it was spinning in:

  libtracecmd trace-input.c: free_zpage

static void free_zpage(struct cpu_data *cpu_data, void *map)
{
struct zchunk_cache *cache;

list_for_each_entry(cache, _data->compress.cache, list) {
if (map <= cache->map && map > (cache->map + 
cache->chunk->size))
goto found;
}
return;

found:

It seems that there can be over 10 thousand cache pages in that link
list and that kernelshark is constantly hitting that. So it's doing a
linear search for the mapped page to free it.

I ran:  time kernelshark trace.dat

And exited out when it finished loading and the result was:

real6m14.772s
user6m0.649s
sys 0m12.718s

That's over 6 minutes to load the trace.dat file!!!

I ran perf record on it and it showed 77% of the time was in free_zpage().

I pulled out my old algorithms book and wrote up a rbtree for internal use
of libtracecmd. Then I switched the cache into a binary rbtree to do the
look ups. As the lookups used both where the memory of the compressed page
is mapped as well as the offset depending on how the search was done, I
found that it only used the memory allocation address in one location.
Luckily, the memory allocation mapping lookup also had access to the
offset of the file the memory represented. That allowed me to make all
lookups use the file offset.

After converting the cache to an rbtree lookup, I ran kernelshark again on
opening that file and exited out as soon as it finished loading and the
timings was:

real1m22.356s
user1m10.532s
sys 0m10.901s

Still a bit long, but it dropped from over 6 minutes to under 1 1/2
minutes. Also, free_zpages() was no longer in the perf record output.

Running the same file under trace-cmd produced:

Without this change:

  $ time trace-cmd report trace.dat > /dev/null
 real9m20.390s
 user9m16.391s
 sys 0m3.529s

With this change:

  $ time trace-cmd report trace.dat > /dev/null
 real6m22.935s
 user6m19.537s
 sys 0m3.139s

Not as drastic as the KernelShark speedup, but still brings it down by a
third.

Signed-off-by: Steven Rostedt (Google) 
---
Changse since v3: 
https://lore.kernel.org/all/20231019165205.371e8...@gandalf.local.home/

- Check for node not NULL before checking node->left

Changes since v2: 
https://lore.kernel.org/all/20231017104307.662aa...@gandalf.local.home/

 - Fixed a few more bugs
 - Added a nice little iterator (although not used)
 - Added a "check_tree()" routine in case I need to debug it again
   This is #if out but nice to keep in the code anyway.

Changes since v1: 
https://lore.kernel.org/linux-trace-devel/20231016230058.2f0e9...@gandalf.local.home/

- Removed some leftover debug
- Fix deletion in setting the color of the node that replaces the deleted
  node.


 lib/trace-cmd/Makefile   |   1 +
 lib/trace-cmd/include/private/trace-rbtree.h |  34 ++
 lib/trace-cmd/trace-input.c  |  81 +++-
 lib/trace-cmd/trace-rbtree.c | 440 +++
 4 files changed, 533 insertions(+), 23 deletions(-)
 create mode 100644 lib/trace-cmd/include/private/trace-rbtree.h
 create mode 100644 lib/trace-cmd/trace-rbtree.c

diff --git a/lib/trace-cmd/Makefile b/lib/trace-cmd/Makefile
index e9d26b2bb367..aba6fda5b0f6 100644
--- a/lib/trace-cmd/Makefile
+++ b/lib/trace-cmd/Makefile
@@ -9,6 +9,7 @@ DEFAULT_TARGET = $(LIBTRACECMD_STATIC)
 
 OBJS =
 OBJS += trace-hash.o
+OBJS += trace-rbtree.o
 OBJS += trace-hooks.o
 OBJS += trace-input.o
 OBJS += trace-output.o
diff --git a/lib/trace-cmd/include/private/trace-rbtree.h 
b/lib/trace-cmd/include/private/trace-rbtree.h
new file mode 100644
index ..822111998e7a
--- /dev/null
+++ b/lib/trace-cmd/include/private/trace-rbtree.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+/*
+ * Copyright (C) 2023 Google, Steven Rostedt 
+ *
+ */
+#ifndef _TRACE_RBTREE_H
+#define _TRACE_RBTREE_H
+
+struct trace_rbtree_node {
+   struct trace_rbtree_node*parent;
+   struct trace_rbtree_node*left;
+   struct trace_rbtree_node*right;
+   int color;
+};
+
+typedef int (*trace_rbtree_cmp_fn)(const struct trace_rbtree_node *A, const 
struct trace_rbtree_node *B);
+
+typedef int (*trace_rbtree_search_fn)(const struct trace_rbtree_node *n, const 
void *data);
+
+struct trace_rbtree {
+   struct trace_rbtree_node*node;
+   trace_rbtree_search_fn  search;
+   trace_rbtree_cmp_fn cmp;
+   size_t  nr_nodes;
+};
+
+void trace_rbtree_init(struct trace_rbtree *tree, trace_rbtree_cmp_fn cmp_fn,
+  

[PATCH AUTOSEL 6.6 35/40] eventfs: Do not allow NULL parent to eventfs_start_creating()

2023-11-28 Thread Sasha Levin
From: "Steven Rostedt (Google)" 

[ Upstream commit fc4561226feaad5fcdcb55646c348d77b8ee69c5 ]

The eventfs directory is dynamically created via the meta data supplied by
the existing trace events. All files and directories in eventfs has a
parent. Do not allow NULL to be passed into eventfs_start_creating() as
the parent because that should never happen. Warn if it does.

Link: https://lkml.kernel.org/r/20231121231112.693841...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Andrew Morton 
Reviewed-by: Josef Bacik 
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Sasha Levin 
---
 fs/tracefs/inode.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 891653ba9cf35..0292c6a2bed9f 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -509,20 +509,15 @@ struct dentry *eventfs_start_creating(const char *name, 
struct dentry *parent)
struct dentry *dentry;
int error;
 
+   /* Must always have a parent. */
+   if (WARN_ON_ONCE(!parent))
+   return ERR_PTR(-EINVAL);
+
error = simple_pin_fs(_fs_type, _mount,
  _mount_count);
if (error)
return ERR_PTR(error);
 
-   /*
-* If the parent is not specified, we create it in the root.
-* We need the root dentry to do this, which is in the super
-* block. A pointer to that is in the struct vfsmount that we
-* have around.
-*/
-   if (!parent)
-   parent = tracefs_mount->mnt_root;
-
if (unlikely(IS_DEADDIR(parent->d_inode)))
dentry = ERR_PTR(-ENOENT);
else
-- 
2.42.0




Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)

2023-11-28 Thread David Hildenbrand

On 28.11.23 18:55, David Hildenbrand wrote:

On 19.11.23 17:57, Alexandru Elisei wrote:

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA don't have the
associated tag storage block reserved, so mark the PTEs as
PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
reserve the tag storage on the fault path.


That sounds alot like fake PROT_NONE. Would there be a way to unify hat
handling and simply reuse pte_protnone()? For example, could we special
case on VMA flags?

Like, don't do NUMA hinting in these special VMAs. Then, have something
like:

if (pte_protnone(vmf->orig_pte))
return handle_pte_protnone(vmf);



Think out loud: maybe there isn't even the need to special-case on the 
VMA. Arch code should know it there is something to do. If not, it 
surely was triggered bu NUMA hinting. So maybe that could be handled in 
handle_pte_protnone() quite nicely.


--
Cheers,

David / dhildenb




Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)

2023-11-28 Thread David Hildenbrand

On 19.11.23 17:57, Alexandru Elisei wrote:

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA don't have the
associated tag storage block reserved, so mark the PTEs as
PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
reserve the tag storage on the fault path.


That sounds alot like fake PROT_NONE. Would there be a way to unify hat 
handling and simply reuse pte_protnone()? For example, could we special 
case on VMA flags?


Like, don't do NUMA hinting in these special VMAs. Then, have something 
like:


if (pte_protnone(vmf->orig_pte))
return handle_pte_protnone(vmf);

In there, special case on the VMA flags.

I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some 
code from the prot_none handling path. At least the recovery path and 
writability handling looks like it better be located shared in 
handle_pte_protnone() as well.


That might take some magic out of this patch.

--
Cheers,

David / dhildenb




Re: [PATCH RFC v2 04/27] mm: migrate/mempolicy: Add hook to modify migration target gfp

2023-11-28 Thread Alexandru Elisei
Hi,

On Tue, Nov 28, 2023 at 08:49:57AM +0200, Mike Rapoport wrote:
> On Mon, Nov 27, 2023 at 11:52:56AM +, Alexandru Elisei wrote:
> > Hi Mike,
> > 
> > I really appreciate you having a look!
> > 
> > On Sat, Nov 25, 2023 at 12:03:22PM +0200, Mike Rapoport wrote:
> > > On Sun, Nov 19, 2023 at 04:56:58PM +, Alexandru Elisei wrote:
> > > > It might be desirable for an architecture to modify the gfp flags used 
> > > > to
> > > > allocate the destination page for migration based on the page that it is
> > > > being replaced. For example, if an architectures has metadata associated
> > > > with a page (like arm64, when the memory tagging extension is 
> > > > implemented),
> > > > it can request that the destination page similarly has storage for tags
> > > > already allocated.
> > > > 
> > > > No functional change.
> > > > 
> > > > Signed-off-by: Alexandru Elisei 
> > > > ---
> > > >  include/linux/migrate.h | 4 
> > > >  mm/mempolicy.c  | 2 ++
> > > >  mm/migrate.c| 3 +++
> > > >  3 files changed, 9 insertions(+)
> > > > 
> > > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > > index 2ce13e8a309b..0acef592043c 100644
> > > > --- a/include/linux/migrate.h
> > > > +++ b/include/linux/migrate.h
> > > > @@ -60,6 +60,10 @@ struct movable_operations {
> > > >  /* Defined in mm/debug.c: */
> > > >  extern const char *migrate_reason_names[MR_TYPES];
> > > >  
> > > > +#ifndef arch_migration_target_gfp
> > > > +#define arch_migration_target_gfp(src, gfp) 0
> > > > +#endif
> > > > +
> > > >  #ifdef CONFIG_MIGRATION
> > > >  
> > > >  void putback_movable_pages(struct list_head *l);
> > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > index 10a590ee1c89..50bc43ab50d6 100644
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -1182,6 +1182,7 @@ static struct folio 
> > > > *alloc_migration_target_by_mpol(struct folio *src,
> > > >  
> > > > h = folio_hstate(src);
> > > > gfp = htlb_alloc_mask(h);
> > > > +   gfp |= arch_migration_target_gfp(src, gfp);
> > > 
> > > I think it'll be more robust to have arch_migration_target_gfp() to modify
> > > the flags and return the new mask with added (or potentially removed)
> > > flags.
> > 
> > I did it this way so an arch won't be able to remove flags set by the MM 
> > code.
> > There's a similar pattern in do_mmap() -> calc_vm_flag_bits() ->
> > arch_calc_vm_flag_bits().
> 
> Ok, just add a sentence about it to the commit message.

Great, will do that!

Thanks,
Alex

>  
> > Thanks,
> > Alex
> > 
> > > 
> > > > nodemask = policy_nodemask(gfp, pol, ilx, );
> > > > return alloc_hugetlb_folio_nodemask(h, nid, nodemask, 
> > > > gfp);
> > > > }
> > > > @@ -1190,6 +1191,7 @@ static struct folio 
> > > > *alloc_migration_target_by_mpol(struct folio *src,
> > > > gfp = GFP_TRANSHUGE;
> > > > else
> > > > gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | 
> > > > __GFP_COMP;
> > > > +   gfp |= arch_migration_target_gfp(src, gfp);
> > > >  
> > > > page = alloc_pages_mpol(gfp, order, pol, ilx, nid);
> > > > return page_rmappable_folio(page);
> > > 
> > > -- 
> > > Sincerely yours,
> > > Mike.
> > > 
> 
> -- 
> Sincerely yours,
> Mike.
> 



[PATCH] tracing: Allow creating instances with specified system events

2023-11-28 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

A trace instance may only need to enable specific events. As the eventfs
directory of an instance currently creates all events which adds overhead,
allow internal instances to be created with just the events in systems
that they care about. This currently only deals with systems and not
individual events, but this should bring down the overhead of creating
instances for specific use cases quite bit.

The trace_array_get_by_name() now has another parameter "systems". This
parameter is a const string pointer of a comma/space separated list of
event systems that should be created by the trace_array. (Note if the
trace_array already exists, this parameter is ignored).

The list of systems is saved and if a module is loaded, its events will
not be added unless the system for those events also match the systems
string.

Note that all dynamic events are still added as they are created by the
user.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since RFC: 
https://lore.kernel.org/all/20231127174108.3c331...@gandalf.local.home/

- Record "systems" in the trace_array so that only events with the system
  name will be added. This includes modules that create events after the
  trace_array is created.

- Dynamic events had to be specified directly to still allow them to be
  created.

 drivers/scsi/qla2xxx/qla_os.c   |  2 +-
 include/linux/trace.h   |  4 +--
 include/linux/trace_events.h|  4 +++
 kernel/trace/trace.c| 23 ++---
 kernel/trace/trace.h|  1 +
 kernel/trace/trace_boot.c   |  2 +-
 kernel/trace/trace_events.c | 52 +++--
 samples/ftrace/sample-trace-array.c |  2 +-
 8 files changed, 78 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 03348f605c2e..dd674378f2f3 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -2889,7 +2889,7 @@ static void qla2x00_iocb_work_fn(struct work_struct *work)
 static void
 qla_trace_init(void)
 {
-   qla_trc_array = trace_array_get_by_name("qla2xxx");
+   qla_trc_array = trace_array_get_by_name("qla2xxx", NULL);
if (!qla_trc_array) {
ql_log(ql_log_fatal, NULL, 0x0001,
   "Unable to create qla2xxx trace instance, instance 
logging will be disabled.\n");
diff --git a/include/linux/trace.h b/include/linux/trace.h
index 2a70a447184c..fdcd76b7be83 100644
--- a/include/linux/trace.h
+++ b/include/linux/trace.h
@@ -51,7 +51,7 @@ int trace_array_printk(struct trace_array *tr, unsigned long 
ip,
   const char *fmt, ...);
 int trace_array_init_printk(struct trace_array *tr);
 void trace_array_put(struct trace_array *tr);
-struct trace_array *trace_array_get_by_name(const char *name);
+struct trace_array *trace_array_get_by_name(const char *name, const char 
*systems);
 int trace_array_destroy(struct trace_array *tr);
 
 /* For osnoise tracer */
@@ -84,7 +84,7 @@ static inline int trace_array_init_printk(struct trace_array 
*tr)
 static inline void trace_array_put(struct trace_array *tr)
 {
 }
-static inline struct trace_array *trace_array_get_by_name(const char *name)
+static inline struct trace_array *trace_array_get_by_name(const char *name, 
const char *systems)
 {
return NULL;
 }
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..ff818cab07b8 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -365,6 +365,10 @@ enum {
 
 #define TRACE_EVENT_FL_UKPROBE (TRACE_EVENT_FL_KPROBE | TRACE_EVENT_FL_UPROBE)
 
+#define TRACE_EVENT_FL_DYNAMIC_MASK\
+   (TRACE_EVENT_FL_DYNAMIC | TRACE_EVENT_FL_KPROBE |   \
+TRACE_EVENT_FL_UPROBE | TRACE_EVENT_FL_EPROBE | TRACE_EVENT_FL_FPROBE)
+
 struct trace_event_call {
struct list_headlist;
struct trace_event_class *class;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 9aebf904ff97..1c7129d07bfb 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9520,7 +9520,8 @@ static int trace_array_create_dir(struct trace_array *tr)
return ret;
 }
 
-static struct trace_array *trace_array_create(const char *name)
+static struct trace_array *
+trace_array_create_systems(const char *name, const char *systems)
 {
struct trace_array *tr;
int ret;
@@ -9540,6 +9541,12 @@ static struct trace_array *trace_array_create(const char 
*name)
if (!zalloc_cpumask_var(>pipe_cpumask, GFP_KERNEL))
goto out_free_tr;
 
+   if (systems) {
+   tr->system_names = kstrdup_const(systems, GFP_KERNEL);
+   if (!tr->system_names)
+   goto out_free_tr;
+   }
+
tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
 
cpumask_copy(tr->tracing_cpumask, cpu_all_mask);
@@ -9586,12 +9593,18 

Re: [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail

2023-11-28 Thread Alexandru Elisei
Hi,

On Tue, Nov 28, 2023 at 05:57:31PM +0100, David Hildenbrand wrote:
> On 27.11.23 13:09, Alexandru Elisei wrote:
> > Hi,
> > 
> > Thank you so much for your comments, there are genuinely useful.
> > 
> > On Fri, Nov 24, 2023 at 08:35:47PM +0100, David Hildenbrand wrote:
> > > On 19.11.23 17:56, Alexandru Elisei wrote:
> > > > Introduce arch_prep_new_page(), which will be used by arm64 to reserve 
> > > > tag
> > > > storage for an allocated page. Reserving tag storage can fail, for 
> > > > example,
> > > > if the tag storage page has a short pin on it, so allow prep_new_page() 
> > > > ->
> > > > arch_prep_new_page() to similarly fail.
> > > 
> > > But what are the side-effects of this? How does the calling code recover?
> > > 
> > > E.g., what if we need to populate a page into user space, but that
> > > particular page we allocated fails to be prepared? So we inject a signal
> > > into that poor process?
> > 
> > When the page fails to be prepared, it is put back to the tail of the
> > freelist with __free_one_page(.., FPI_TO_TAIL). If all the allocation paths
> > are exhausted and no page has been found for which tag storage has been
> > reserved, then that's treated like an OOM situation.
> > 
> > I have been thinking about this, and I think I can simplify the code by
> > making tag reservation a best effort approach. The page can be allocated
> > even if reserving tag storage fails, but the page is marked as invalid in
> > set_pte_at() (PAGE_NONE + an extra bit to tell arm64 that it needs tag
> > storage) and next time it is accessed, arm64 will reserve tag storage in
> > the fault handling code (the mechanism for that is implemented in patch #19
> > of the series, "mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
> > mprotect(PROT_MTE)").
> > 
> > With this new approach, prep_new_page() stays the way it is, and no further
> > changes are required for the page allocator, as there are already arch
> > callbacks that can be used for that, for example tag_clear_highpage() and
> > arch_alloc_page(). The downside is extra page faults, which might impact
> > performance.
> > 
> > What do you think?
> 
> That sounds a lot more robust, compared to intermittent failures to allocate
> pages.

Great, thank you for the feedback, I will use this approach for the next
iteration of the series.

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 



Re: [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page

2023-11-28 Thread David Hildenbrand

On 19.11.23 17:57, Alexandru Elisei wrote:

On arm64, the zero page receives special treatment by having the tagged
flag set on MTE initialization, not when the page is mapped in a process
address space. Reserve the corresponding tag block when tag storage
management is being activated.


Out of curiosity: why does the shared zeropage require tagged storage? 
What about the huge zeropage?


--
Cheers,

David / dhildenb




Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled

2023-11-28 Thread David Hildenbrand

On 27.11.23 16:07, Alexandru Elisei wrote:

Hi,

On Fri, Nov 24, 2023 at 08:54:12PM +0100, David Hildenbrand wrote:

On 19.11.23 17:57, Alexandru Elisei wrote:

To be able to reserve the tag storage associated with a page requires that
the tag storage page can be migrated.

When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
in non-preemptible contexts, which can make reserving the associate tag
storage impossible.


I assume that it's the only in-kernel user that actually requires tagged
memory (besides for user space), correct?


Indeed, this is the case. I'll expand the commit message to be more clear about
it.



Great, thanks!

--
Cheers,

David / dhildenb




Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype

2023-11-28 Thread David Hildenbrand

On 27.11.23 16:01, Alexandru Elisei wrote:

Hi David,

On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:

On 19.11.23 17:57, Alexandru Elisei wrote:

Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
the page allocator to manage them like regular pages.

Ths migratype lends the pages some very desirable properties:

* They cannot be longterm pinned, meaning they will always be migratable.

* The pages can be allocated explicitely by using their PFN (with
alloc_contig_range()) when they are needed to store tags.

Signed-off-by: Alexandru Elisei 
---
   arch/arm64/Kconfig  |  1 +
   arch/arm64/kernel/mte_tag_storage.c | 68 +
   include/linux/mmzone.h  |  5 +++
   mm/internal.h   |  3 --
   4 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe8276fdc7a8..047487046e8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
   if ARM64_MTE
   config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
+   select CONFIG_CMA
help
  Adds support for dynamic management of the memory used by the hardware
  for storing MTE tags. This memory, unlike normal memory, cannot be
diff --git a/arch/arm64/kernel/mte_tag_storage.c 
b/arch/arm64/kernel/mte_tag_storage.c
index fa6267ef8392..427f4f1909f3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
* Copyright (C) 2023 ARM Ltd.
*/
+#include 
   #include 
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, 
const char *uname,
return ret;
}
+   /* Pages are managed in pageblock_nr_pages chunks */
+   if (!IS_ALIGNED(tag_range->start | range_len(tag_range), 
pageblock_nr_pages)) {
+   pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock 
size 0x%llx",
+  PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
+  PFN_PHYS(pageblock_nr_pages));
+   return -EINVAL;
+   }
+
ret = tag_storage_get_memory_node(node, _node);
if (ret)
return ret;
@@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
pr_info("MTE tag storage region management disabled");
}
   }
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+   phys_addr_t dram_start, dram_end;
+   struct range *tag_range;
+   unsigned long pfn;
+   int i, ret;
+
+   if (num_tag_regions == 0)
+   return 0;
+
+   dram_start = memblock_start_of_DRAM();
+   dram_end = memblock_end_of_DRAM();
+
+   for (i = 0; i < num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   /*
+* Tag storage region was clipped by arm64_bootmem_init()
+* enforcing addressing limits.
+*/
+   if (PFN_PHYS(tag_range->start) < dram_start ||
+   PFN_PHYS(tag_range->end) >= dram_end) {
+   pr_err("Tag storage region 0x%llx-0x%llx outside addressable 
memory",
+  PFN_PHYS(tag_range->start), 
PFN_PHYS(tag_range->end));
+   ret = -EINVAL;
+   goto out_disabled;
+   }
+   }
+
+   /*
+* MTE disabled, tag storage pages can be used like any other pages. The
+* only restriction is that the pages cannot be used by kexec because
+* the memory remains marked as reserved in the memblock allocator.
+*/
+   if (!system_supports_mte()) {
+   for (i = 0; i< num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   for (pfn = tag_range->start; pfn <= tag_range->end; 
pfn++)
+   free_reserved_page(pfn_to_page(pfn));
+   }
+   ret = 0;
+   goto out_disabled;
+   }
+
+   for (i = 0; i < num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   for (pfn = tag_range->start; pfn <= tag_range->end; pfn += 
pageblock_nr_pages)
+   init_cma_reserved_pageblock(pfn_to_page(pfn));
+   totalcma_pages += range_len(tag_range);
+   }


You shouldn't be doing that manually in arm code. Likely you want some cma.c
helper for something like that.


If you referring to the last loop (the one that does
ini_cma_reserved_pageblock()), indeed, there's already a function which
does that, cma_init_reserved_areas() -> cma_activate_area().



But, can you elaborate on why you took this hacky (sorry) approach as
documented in the cover letter:


No worries, it is indeed a bit hacky 

Re: [RFC][PATCH] tracing: Allow creating instances with specified system events

2023-11-28 Thread Steven Rostedt
On Wed, 29 Nov 2023 00:17:38 +0900
Masami Hiramatsu (Google)  wrote:

> Hi Steve,
> 
> On Mon, 27 Nov 2023 17:41:08 -0500
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > A trace instance may only need to enable specific events. As the eventfs
> > directory of an instance currently creates all events which adds overhead,
> > allow internal instances to be created with just the events in systems
> > that they care about. This currently only deals with systems and not
> > individual events, but this should bring down the overhead of creating
> > instances for specific use cases quite bit.  
> 
> This sounds good, but can the eventfs reduce such overhead because
> if the user doesn't touch the actual event, the event dentry will
> be released soon?

Yes, but this also removes the creation of the meta data behind it. Which
has a descriptor for every event. And since there are over a thousand
events, this is still quite a bit of savings. It also removes the
eventfs_inode that represents each directory.

> > Note, this may also be useful for creating instances in the eventfs, but
> > I'm not sure how to do this there. I could add a deliminator:
> > 
> >   mkdir /sys/kernel/tracing/instances/foo::sched,timer  
> 
> Can we limit this after making an instance? In that case, we can use
> rmdir to remove unused "systems" directories. Or, maybe we can create
> it afterwards with mkdir or use comand to a pseudo file.
> 
> echo sched:* timer:* > instances/foo/available_events

That, or I even thought of allowing rmdir on event systems, and even mkdir.

 # cd instance/foo/events
 # rmdir *
 # mkdir sched timer

-- Steve



Re: [RFC][PATCH] tracing: Allow creating instances with specified system events

2023-11-28 Thread Google
Hi Steve,

On Mon, 27 Nov 2023 17:41:08 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> A trace instance may only need to enable specific events. As the eventfs
> directory of an instance currently creates all events which adds overhead,
> allow internal instances to be created with just the events in systems
> that they care about. This currently only deals with systems and not
> individual events, but this should bring down the overhead of creating
> instances for specific use cases quite bit.

This sounds good, but can the eventfs reduce such overhead because
if the user doesn't touch the actual event, the event dentry will
be released soon?

> 
> The qla2xxx instance could just enable the systems it cares about, but that
> should be a separate patch.
> 
> Note that kprobes and synthetic events created after the creation of these
> instances, will be added to these instances, but those that are created
> before the creation of the instance will not be included.
> 
> Note, this may also be useful for creating instances in the eventfs, but
> I'm not sure how to do this there. I could add a deliminator:
> 
>   mkdir /sys/kernel/tracing/instances/foo::sched,timer

Can we limit this after making an instance? In that case, we can use
rmdir to remove unused "systems" directories. Or, maybe we can create
it afterwards with mkdir or use comand to a pseudo file.

echo sched:* timer:* > instances/foo/available_events

Thank you,

> 
> But if a tool already uses "::" as a deliminator, then this will break it.
> I could just have it work if all the events after the deliminator exist.
> 
>   Thoughts?
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  drivers/scsi/qla2xxx/qla_os.c   |  2 +-
>  include/linux/trace.h   |  4 ++--
>  kernel/trace/trace.c| 22 
>  kernel/trace/trace.h|  3 ++-
>  kernel/trace/trace_boot.c   |  2 +-
>  kernel/trace/trace_events.c | 31 ++---
>  samples/ftrace/sample-trace-array.c |  2 +-
>  7 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index 03348f605c2e..dd674378f2f3 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -2889,7 +2889,7 @@ static void qla2x00_iocb_work_fn(struct work_struct 
> *work)
>  static void
>  qla_trace_init(void)
>  {
> - qla_trc_array = trace_array_get_by_name("qla2xxx");
> + qla_trc_array = trace_array_get_by_name("qla2xxx", NULL);
>   if (!qla_trc_array) {
>   ql_log(ql_log_fatal, NULL, 0x0001,
>  "Unable to create qla2xxx trace instance, instance 
> logging will be disabled.\n");
> diff --git a/include/linux/trace.h b/include/linux/trace.h
> index 2a70a447184c..fdcd76b7be83 100644
> --- a/include/linux/trace.h
> +++ b/include/linux/trace.h
> @@ -51,7 +51,7 @@ int trace_array_printk(struct trace_array *tr, unsigned 
> long ip,
>  const char *fmt, ...);
>  int trace_array_init_printk(struct trace_array *tr);
>  void trace_array_put(struct trace_array *tr);
> -struct trace_array *trace_array_get_by_name(const char *name);
> +struct trace_array *trace_array_get_by_name(const char *name, const char 
> *systems);
>  int trace_array_destroy(struct trace_array *tr);
>  
>  /* For osnoise tracer */
> @@ -84,7 +84,7 @@ static inline int trace_array_init_printk(struct 
> trace_array *tr)
>  static inline void trace_array_put(struct trace_array *tr)
>  {
>  }
> -static inline struct trace_array *trace_array_get_by_name(const char *name)
> +static inline struct trace_array *trace_array_get_by_name(const char *name, 
> const char *systems)
>  {
>   return NULL;
>  }
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 9aebf904ff97..3a5637ccabfd 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -9500,7 +9500,7 @@ struct trace_array *trace_array_find_get(const char 
> *instance)
>   return tr;
>  }
>  
> -static int trace_array_create_dir(struct trace_array *tr)
> +static int trace_array_create_dir(struct trace_array *tr, const char 
> *systems)
>  {
>   int ret;
>  
> @@ -9508,7 +9508,7 @@ static int trace_array_create_dir(struct trace_array 
> *tr)
>   if (!tr->dir)
>   return -EINVAL;
>  
> - ret = event_trace_add_tracer(tr->dir, tr);
> + ret = event_trace_add_tracer(tr->dir, tr, systems);
>   if (ret) {
>   tracefs_remove(tr->dir);
>   return ret;
> @@ -9520,7 +9520,8 @@ static int trace_array_create_dir(struct trace_array 
> *tr)
>   return ret;
>  }
>  
> -static struct trace_array *trace_array_create(const char *name)
> +static struct trace_array *
> +trace_array_create_systems(const char *name, const char *systems)
>  {
>   struct trace_array *tr;
>   int ret;
> @@ -9569,7 +9570,7 @@ static struct trace_array *trace_array_create(const 
> char *name)

Re: [PATCH 1/2] tracing: Simplify and fix "buffered event" synchronization

2023-11-28 Thread Petr Pavlu
On 11/27/23 18:41, Steven Rostedt wrote:
> On Mon, 27 Nov 2023 16:12:47 +0100
> Petr Pavlu  wrote:
> 
>> The following warning appears when using buffered events:
>> [  203.556451] WARNING: CPU: 53 PID: 10220 at 
>> kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
> 
> Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
> in your version?

The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
code and RB_WARN_ON():
  3895  void ring_buffer_discard_commit(struct trace_buffer *buffer,
  3896  struct ring_buffer_event *event)
  3897  {
[...]
  3908   * This must only be called if the event has not been
  3909   * committed yet. Thus we can assume that preemption
  3910   * is still disabled.
  3911   */
  3912  RB_WARN_ON(buffer, !local_read(_buffer->committing));
  3913  
  3914  rb_decrement_entry(cpu_buffer, event);
  3915  if (rb_try_to_discard(cpu_buffer, event))
  3916  goto out;

https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912

>> [...]
>> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G   
>>  E  6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
>> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS 
>> GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
>> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
>> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 
>> 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 
>> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
>> [  203.735734] RSP: 0018:b4ae4f7b7d80 EFLAGS: 00010202
>> [  203.735745] RAX:  RBX: b4ae4f7b7de0 RCX: 
>> 8ac10662c000
>> [  203.735754] RDX: 8ac0c750be00 RSI: 8ac10662c000 RDI: 
>> 8ac0c004d400
>> [  203.781832] RBP: 8ac0c039cea0 R08:  R09: 
>> 
>> [  203.781839] R10:  R11:  R12: 
>> 
>> [  203.781842] R13: 8ac10662c000 R14: 8ac0c004d400 R15: 
>> 8ac10662c008
>> [  203.781846] FS:  7f4cd8a67740() GS:8ad79888() 
>> knlGS:
>> [  203.781851] CS:  0010 DS:  ES:  CR0: 80050033
>> [  203.781855] CR2: 559766a74028 CR3: 0001804c4000 CR4: 
>> 001506f0
>> [  203.781862] Call Trace:
>> [  203.781870]  
>> [  203.851949]  trace_event_buffer_commit+0x1ea/0x250
>> [  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
>> [  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
>> [  203.851990]  do_syscall_64+0x3a/0xe0
>> [  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>> [  203.852090] RIP: 0033:0x7f4cd870fa77
>> [  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 
>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 
>> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
>> [  203.982932] RSP: 002b:7fff99717dd8 EFLAGS: 0246 ORIG_RAX: 
>> 0089
>> [  203.982942] RAX: ffda RBX: 558ea1d7b6f0 RCX: 
>> 7f4cd870fa77
>> [  203.982948] RDX:  RSI: 7fff99717de0 RDI: 
>> 558ea1d7b6f0
>> [  203.982957] RBP: 7fff99717de0 R08: 7fff997180e0 R09: 
>> 7fff997180e0
>> [  203.982962] R10: 7fff997180e0 R11: 0246 R12: 
>> 7fff99717f40
>> [  204.049239] R13: 7fff99718590 R14: 558e9f2127a8 R15: 
>> 7fff997180b0
>> [  204.049256]  
>>
>> For instance, it can be triggered by running these two commands in
>> parallel:
>> $ while true; do
>> echo hist:key=id.syscall:val=hitcount > \
>>   /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
>>   done
>> $ stress-ng --sysinfo $(nproc)
>>
>> The warning indicates that the current ring_buffer_per_cpu is not in the
>> committing state. It happens because the active ring_buffer_event
>> doesn't actually come from the ring_buffer_per_cpu but is allocated from
>> trace_buffered_event.
>>
>> The bug is in function trace_buffered_event_disable() where the
>> following normally happens:
>> * The code invokes disable_trace_buffered_event() via
>>   smp_call_function_many() and follows it by synchronize_rcu(). This
>>   increments the per-CPU variable trace_buffered_event_cnt on each
>>   target CPU and grants trace_buffered_event_disable() the exclusive
>>   access to the per-CPU variable trace_buffered_event.
>> * Maintenance is performed on trace_buffered_event, all per-CPU event
>>   buffers get freed.
>> * The code invokes enable_trace_buffered_event() via
>>   smp_call_function_many(). This decrements trace_buffered_event_cnt and
>>   releases the access to trace_buffered_event.
>>
>> A problem is that smp_call_function_many() runs a given function on all
>> target CPUs except on 

Re: [RFC][PATCH] tracing: Allow creating instances with specified system events

2023-11-28 Thread Steven Rostedt
On Tue, 28 Nov 2023 09:20:29 -0500
Steven Rostedt  wrote:

> On Tue, 28 Nov 2023 14:14:29 +0100
> Dmytro Maluka  wrote:
> 
> 
> > This limitation will cause (unrelated) events created by modules that
> > are insmoded after creating the instance to be also added to the
> > instance. Why not filter those as well?  
> 
> I did think of that. But that would be a separate patch. Where I would save
> the string that is passed in, and whenever a new module is loaded, it would
> only add the events if the events' system matches in the string. This would
> also allow adding event systems that do not yet exist.

Now that I'm implementing this, it makes more sense to just do that as one
patch. Otherwise the check against the systems string is redundant, as the
creation of the events needs to check against the tr->systems too. I found
that I was deleting most of this patch to implement the change.

-- Steve



Re: [RFC][PATCH] tracing: Allow creating instances with specified system events

2023-11-28 Thread Steven Rostedt
On Tue, 28 Nov 2023 14:14:29 +0100
Dmytro Maluka  wrote:


> This limitation will cause (unrelated) events created by modules that
> are insmoded after creating the instance to be also added to the
> instance. Why not filter those as well?

I did think of that. But that would be a separate patch. Where I would save
the string that is passed in, and whenever a new module is loaded, it would
only add the events if the events' system matches in the string. This would
also allow adding event systems that do not yet exist.

> 
> Besides that, the change looks nice to me.

Thanks, I'll make it into a more formal patch set.

-- Steve




Re: [RFC][PATCH] tracing: Allow creating instances with specified system events

2023-11-28 Thread Dmytro Maluka
On Mon, Nov 27, 2023 at 05:41:08PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> A trace instance may only need to enable specific events. As the eventfs
> directory of an instance currently creates all events which adds overhead,
> allow internal instances to be created with just the events in systems
> that they care about. This currently only deals with systems and not
> individual events, but this should bring down the overhead of creating
> instances for specific use cases quite bit.
> 
> The qla2xxx instance could just enable the systems it cares about, but that
> should be a separate patch.
> 
> Note that kprobes and synthetic events created after the creation of these
> instances, will be added to these instances, but those that are created
> before the creation of the instance will not be included.

This limitation will cause (unrelated) events created by modules that
are insmoded after creating the instance to be also added to the
instance. Why not filter those as well?

Besides that, the change looks nice to me.

> Note, this may also be useful for creating instances in the eventfs, but
> I'm not sure how to do this there. I could add a deliminator:
> 
>   mkdir /sys/kernel/tracing/instances/foo::sched,timer
> 
> But if a tool already uses "::" as a deliminator, then this will break it.
> I could just have it work if all the events after the deliminator exist.
> 
>   Thoughts?
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  drivers/scsi/qla2xxx/qla_os.c   |  2 +-
>  include/linux/trace.h   |  4 ++--
>  kernel/trace/trace.c| 22 
>  kernel/trace/trace.h|  3 ++-
>  kernel/trace/trace_boot.c   |  2 +-
>  kernel/trace/trace_events.c | 31 ++---
>  samples/ftrace/sample-trace-array.c |  2 +-
>  7 files changed, 49 insertions(+), 17 deletions(-)



Re: [PATCH v3 28/33] fprobe: Rewrite fprobe on function-graph tracer

2023-11-28 Thread Google
On Tue, 28 Nov 2023 11:53:19 +0100
Jiri Olsa  wrote:

> On Mon, Nov 27, 2023 at 10:58:40PM +0900, Masami Hiramatsu (Google) wrote:
> > From: Masami Hiramatsu (Google) 
> > 
> > Rewrite fprobe implementation on function-graph tracer.
> > Major API changes are:
> >  -  'nr_maxactive' field is deprecated.
> >  -  This depends on CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
> > !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS, and
> > CONFIG_HAVE_FUNCTION_GRAPH_FREGS. So currently works only
> > on x86_64.
> >  -  Currently the entry size is limited in 15 * sizeof(long).
> >  -  If there is too many fprobe exit handler set on the same
> > function, it will fail to probe.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >  Changes in v3:
> >   - Update for new reserve_data/retrieve_data API.
> >   - Fix internal push/pop on fgraph data logic so that it can
> > correctly save/restore the returning fprobes.
> 
> hi,
> looks like this one conflicts with recent:
> 
>   4bbd93455659 kprobes: kretprobe scalability improvement

Thanks for reporting!

I also found some other patches conflicts with recent commits.
Let me rebase it on the recent branch.

Thank!

> 
> jirka


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] dt-bindings: correct white-spaces in examples

2023-11-28 Thread Ulf Hansson
On Fri, 24 Nov 2023 at 10:21, Krzysztof Kozlowski
 wrote:
>
> Use only one and exactly one space around '=' in DTS example.
>
> Signed-off-by: Krzysztof Kozlowski 

Acked-by: Ulf Hansson  # For MMC

Kind regards
Uffe

>
> ---
>
> Merging idea: Rob's DT.
> Should apply cleanly on Rob's for-next.
> ---
>  .../devicetree/bindings/auxdisplay/hit,hd44780.yaml   | 2 +-
>  .../devicetree/bindings/clock/baikal,bt1-ccu-pll.yaml | 2 +-
>  Documentation/devicetree/bindings/iio/adc/adi,ad7780.yaml | 6 +++---
>  .../devicetree/bindings/iio/adc/qcom,spmi-iadc.yaml   | 2 +-
>  .../devicetree/bindings/iio/adc/qcom,spmi-rradc.yaml  | 2 +-
>  .../interrupt-controller/st,stih407-irq-syscfg.yaml   | 4 ++--
>  Documentation/devicetree/bindings/mmc/arm,pl18x.yaml  | 2 +-
>  Documentation/devicetree/bindings/net/sff,sfp.yaml| 2 +-
>  .../devicetree/bindings/pci/toshiba,visconti-pcie.yaml| 2 +-
>  .../bindings/pinctrl/renesas,rzg2l-pinctrl.yaml   | 6 +++---
>  .../devicetree/bindings/power/supply/richtek,rt9455.yaml  | 8 
>  .../devicetree/bindings/regulator/mps,mp5416.yaml | 4 ++--
>  .../devicetree/bindings/regulator/mps,mpq7920.yaml| 4 ++--
>  .../devicetree/bindings/remoteproc/fsl,imx-rproc.yaml | 8 
>  14 files changed, 27 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/auxdisplay/hit,hd44780.yaml 
> b/Documentation/devicetree/bindings/auxdisplay/hit,hd44780.yaml
> index fde07e4b119d..406a922a714e 100644
> --- a/Documentation/devicetree/bindings/auxdisplay/hit,hd44780.yaml
> +++ b/Documentation/devicetree/bindings/auxdisplay/hit,hd44780.yaml
> @@ -113,7 +113,7 @@ examples:
>  hd44780 {
>  compatible = "hit,hd44780";
>  display-height-chars = <2>;
> -display-width-chars  = <16>;
> +display-width-chars = <16>;
>  data-gpios = < 4 0>,
>   < 5 0>,
>   < 6 0>,
> diff --git a/Documentation/devicetree/bindings/clock/baikal,bt1-ccu-pll.yaml 
> b/Documentation/devicetree/bindings/clock/baikal,bt1-ccu-pll.yaml
> index 624984d51c10..7f8d98226437 100644
> --- a/Documentation/devicetree/bindings/clock/baikal,bt1-ccu-pll.yaml
> +++ b/Documentation/devicetree/bindings/clock/baikal,bt1-ccu-pll.yaml
> @@ -125,7 +125,7 @@ examples:
>  clk25m: clock-oscillator-25m {
>compatible = "fixed-clock";
>#clock-cells = <0>;
> -  clock-frequency  = <2500>;
> +  clock-frequency = <2500>;
>clock-output-names = "clk25m";
>  };
>  ...
> diff --git a/Documentation/devicetree/bindings/iio/adc/adi,ad7780.yaml 
> b/Documentation/devicetree/bindings/iio/adc/adi,ad7780.yaml
> index 5fcc8dd012f1..be2616ff9af6 100644
> --- a/Documentation/devicetree/bindings/iio/adc/adi,ad7780.yaml
> +++ b/Documentation/devicetree/bindings/iio/adc/adi,ad7780.yaml
> @@ -80,9 +80,9 @@ examples:
>  compatible = "adi,ad7780";
>  reg = <0>;
>
> -avdd-supply  = <_supply>;
> -powerdown-gpios  = < 12 GPIO_ACTIVE_HIGH>;
> -adi,gain-gpios   = <  5 GPIO_ACTIVE_LOW>;
> +avdd-supply = <_supply>;
> +powerdown-gpios = < 12 GPIO_ACTIVE_HIGH>;
> +adi,gain-gpios = <  5 GPIO_ACTIVE_LOW>;
>  adi,filter-gpios = < 15 GPIO_ACTIVE_LOW>;
>  };
>  };
> diff --git a/Documentation/devicetree/bindings/iio/adc/qcom,spmi-iadc.yaml 
> b/Documentation/devicetree/bindings/iio/adc/qcom,spmi-iadc.yaml
> index 73def67fbe01..b6a233cd5f6b 100644
> --- a/Documentation/devicetree/bindings/iio/adc/qcom,spmi-iadc.yaml
> +++ b/Documentation/devicetree/bindings/iio/adc/qcom,spmi-iadc.yaml
> @@ -58,7 +58,7 @@ examples:
>  reg = <0x3600>;
>  interrupts = <0x0 0x36 0x0 IRQ_TYPE_EDGE_RISING>;
>  qcom,external-resistor-micro-ohms = <1>;
> -#io-channel-cells  = <1>;
> +#io-channel-cells = <1>;
>  };
>  };
>  ...
> diff --git a/Documentation/devicetree/bindings/iio/adc/qcom,spmi-rradc.yaml 
> b/Documentation/devicetree/bindings/iio/adc/qcom,spmi-rradc.yaml
> index b3a626389870..64abe9a4cd9e 100644
> --- a/Documentation/devicetree/bindings/iio/adc/qcom,spmi-rradc.yaml
> +++ b/Documentation/devicetree/bindings/iio/adc/qcom,spmi-rradc.yaml
> @@ -46,6 +46,6 @@ examples:
>  pmic_rradc: adc@4500 {
>  compatible = "qcom,pmi8998-rradc";
>  reg = <0x4500>;
> -#io-channel-cells  = <1>;
> +#io-channel-cells = <1>;
>  };
>  };
> diff --git 
> a/Documentation/devicetree/bindings/interrupt-controller/st,stih407-irq-syscfg.yaml
>  
> b/Documentation/devicetree/bindings/interrupt-controller/st,stih407-irq-syscfg.yaml
> index 2b153d7c5421..e44e4e5708a7 100644
> --- 
> a/Documentation/devicetree/bindings/interrupt-controller/st,stih407-irq-syscfg.yaml
> +++ 
> 

Re: [PATCH v3 28/33] fprobe: Rewrite fprobe on function-graph tracer

2023-11-28 Thread Jiri Olsa
On Mon, Nov 27, 2023 at 10:58:40PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Rewrite fprobe implementation on function-graph tracer.
> Major API changes are:
>  -  'nr_maxactive' field is deprecated.
>  -  This depends on CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
> !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS, and
> CONFIG_HAVE_FUNCTION_GRAPH_FREGS. So currently works only
> on x86_64.
>  -  Currently the entry size is limited in 15 * sizeof(long).
>  -  If there is too many fprobe exit handler set on the same
> function, it will fail to probe.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v3:
>   - Update for new reserve_data/retrieve_data API.
>   - Fix internal push/pop on fgraph data logic so that it can
> correctly save/restore the returning fprobes.

hi,
looks like this one conflicts with recent:

  4bbd93455659 kprobes: kretprobe scalability improvement

jirka



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-28 Thread Abel Wu

On 11/27/23 9:56 PM, Tobias Huschle Wrote:

On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:

On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

The below should also work for internal entities, but last time I poked
around with it I had some regressions elsewhere -- you know, fix one,
wreck another type of situations on hand.

But still, could you please give it a go -- it applies cleanly to linus'
master and -rc2.

---
Subject: sched/eevdf: Revenge of the Sith^WSleepers



Tried the patch, it does not help unfortuntately.

It might also be possible that the long running vhost is stuck on something.
During those phases where the vhost just runs for a while. This might have
been a risk for a while, EEVDF might have just uncovered an unfortuntate
sequence of events.
I'll look into this option.

I also added some more trace outputs in order to find the actual vruntimes
of the cgroup parents. The numbers look kind of reasonable, but I struggle
to judge this with certainty.

In one of the scenarios where the kworker sees an absurd wait time, the
following occurs (full trace below):

- The kworker ends its timeslice after 4941 ns
- __pick_eevdf finds the cgroup holding vhost as the next option to execute
- Last known values are:
 vruntime  deadline
cgroup56117619190   57650477291 -> depth 0
kworker   56117624131   56120619190
   This is fair, since the kworker is not runnable here.
- At depth 4, the cgroup shows the observed vruntime value which is smaller
   by a factor of 20, but depth 0 seems to be running with values of the
   correct magnitude.


A child is running means its parent also being the cfs->curr, but
not vice versa if there are more than one child.


- cgroup at depth 0 has zero lag, with higher depth, there are large lag
   values (as observed 606.338267 onwards)


These values of se->vlag means 'run this entity to parity' to avoid
excess context switch, which is what RUN_TO_PARITY does, or nothing
when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.



Now the following occurs, triggered by the vhost:
- The kworker gets placed again with:
 vruntime  deadline
cgroup56117619190   57650477291 -> depth 0, last known value
kworker   56117885776   56120885776 -> lag of -725
- vhost continues executing and updates its vruntime accordingly, here
   I would need to enhance the trace to also print the vruntimes of the
   parent sched_entities to see the progress of their vruntime/deadline/lag
   values as well
- It is a bit irritating that the EEVDF algorithm would not pick the kworker
   over the cgroup as its deadline is smaller.
   But, the kworker has negative lag, which might cause EEVDF to not pick
   the kworker.
   The cgroup at depth 0 has no lag, all deeper layers have a significantly
   positve lag (last known values, might have changed in the meantime).
   At this point I would see the option that the vhost task is stuck
   somewhere or EEVDF just does not see the kworker as a eligible option.


IMHO such lag should not introduce that long delay. Can you run the
test again with NEXT_BUDDY disabled?



- Once the vhost is migrated off the cpu, the update_entity_lag function
   works with the following values at 606.467022: sched_update
   For the cgroup at depth 0
   - vruntime = 5710415 --> this is in line with the amount of new 
timeslices
vhost got assigned while the kworker was waiting
   - vlag =   -62439022 --> the scheduler knows that the cgroup was
overconsuming, but no runtime for the kworker
   For the cfs_rq we have
   - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
   - avg_vruntime = 161750065796 --> this is rather large in comparison, but I
 might access this value at a bad time


Use avg_vruntime() instead.


   - nr_running   =2 --> at this point, both, cgroup and kworker are
 still on the queue, with the cgroup being
 in the migration process
--> It seems like the overconsumption accumulates at cgroup depth 0 and is not
 propageted downwards. This might be intended though.

- At 606.479979: sched_place, cgroup hosting the vhost is migrated back
   onto cpu 13 with a lag of -166821875 it gets scheduled right away as
   there is no other task (nr_running = 0)

- At 606.479996: sched_place, the kworker gets placed again, this time
   with no lag and get scheduled almost immediately, with a wait
   time of 1255 ns.

It shall be noted, that these scenarios also occur when the first placement
of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
when observing this issue.

# full trace #

sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
--> 

Re: [RFC][PATCH] tracing: Allow creating instances with specified system events

2023-11-28 Thread Daniel Wagner
On Mon, Nov 27, 2023 at 05:50:21PM -0500, Steven Rostedt wrote:
> On Mon, 27 Nov 2023 17:41:08 -0500
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > A trace instance may only need to enable specific events. As the eventfs
> > directory of an instance currently creates all events which adds overhead,
> > allow internal instances to be created with just the events in systems
> > that they care about. This currently only deals with systems and not
> > individual events, but this should bring down the overhead of creating
> > instances for specific use cases quite bit.
> 
> This change log is not very good. I didn't actually state what was done :-p
> 
> Anyway, function trace_array_get_by_name() has an added parameter "systems"
> (and I forgot to update its kerneldoc). This parameter is a string of
> comma, or space, or commas and spaces deliminators of event system names.
> If it's not NULL, then it will only create the event system directories of
> those event systems that match the systems parameter.
> 
> That is:
> 
>  trace_array_get_by_name("qla2xxx", "qla");
> 
> Will create an instance called "qla2xxx" and only have "qla" system events
> in it.

Makes sense to me. The qla2xxx driver is interested only in its own
trace events. This should be fine, at least as far I understand it.



Re: [PATCH v2 0/5] MODVERSIONS + RUST Redux

2023-11-28 Thread Greg KH
On Tue, Nov 28, 2023 at 08:05:26AM +, Greg KH wrote:
> On Mon, Nov 27, 2023 at 11:27:07AM -0800, Matthew Maurer wrote:
> > > > >
> > > > > > With regards to future directions that likely won't work for 
> > > > > > loosening it:
> > > > > > Unfortunately, the .rmeta format itself is not stable, so I 
> > > > > > wouldn't want to
> > > > > > teach genksyms to open it up and split out the pieces for specific 
> > > > > > functions.
> > > > > > Extending genksyms to parse Rust would also not solve the situation 
> > > > > > -
> > > > > > layouts are allowed to differ across compiler versions or even (in 
> > > > > > rare
> > > > > > cases) seemingly unrelated code changes.
> > > > >
> > > > > What do you mean by "layout" here?  Yes, the crcs can be different
> > > > > across compiler versions and seemingly unrelated code changes 
> > > > > (genksyms
> > > > > is VERY fragile) but that's ok, that's not what you are checking here.
> > > > > You want to know if the rust function signature changes or not from 
> > > > > the
> > > > > last time you built the code, with the same compiler and options, 
> > > > > that's
> > > > > all you are verifying.
> > What I mean by layout here is that if you write in Rust:
> > struct Foo {
> >   x: i32,
> >   y: i32,
> > }
> > it is not guaranteed to have the same layout across different compilations, 
> > even
> > within the same compiler. See
> > https://doc.rust-lang.org/reference/type-layout.html#the-rust-representation
> 
> Then you are going to have big problems, sorry.
> 
> > Specifically, the compiler is allowed to arbitrarily insert padding,
> > reorder fields, etc.
> > on the same code as long as the overall alignment of the struct and 
> > individual
> > alignment of the fields remains correct and non-overlapping.
> > 
> > This means the compiler is *explicitly* allowed to, for example, permute x 
> > and y
> > as an optimization. In the above example this is unlikely, but if you
> > instead consider
> > struct Bar {
> >   x: i8,
> >   y: i64,
> >   z: i8,
> > }
> > It's easy to see why the compiler might decide to structure this as
> > y,x,z to reduce the
> > size of the struct. Those optimization decisions may be affected by
> > any other part of
> > the code, PGO, etc.
> 
> Then you all need to figure out some way to determine how the compiler
> layed out the structure after it compiled/optimized it and be able to
> compare it to previous builds (or just generate a crc based on the
> layout it chose.)
> 
> > > > > > Future directions that might work for loosening it:
> > > > > > * Generating crcs from debuginfo + compiler + flags
> > > > > > * Adding a feature to the rust compiler to dump this information. 
> > > > > > This
> > > > > > is likely to
> > > > > >   get pushback because Rust's current stance is that there is no 
> > > > > > ability to load
> > > > > >   object code built against a different library.
> > > > >
> > > > > Why not parse the function signature like we do for C?
> > Because the function signature is insufficient to check the ABI, see above.
> > > > >
> > > > > > Would setting up Rust symbols so that they have a crc built out of 
> > > > > > .rmeta be
> > > > > > sufficient for you to consider this useful? If not, can you help me 
> > > > > > understand
> > > > > > what level of precision would be required?
> > > > >
> > > > > What exactly does .rmeta have to do with the function signature?  
> > > > > That's
> > > > > all you care about here.
> > The .rmeta file contains the decisions the compiler made about layout
> > in the crate
> > you're interfacing with. For example, the choice to encode Bar
> > with a yxz field order would be written into the .rmeta file.
> 
> Ok, then yes, can you parse the .rmeta file to get that information?
> 
> > > > rmeta is generated per crate.
> > > >
> > > > CRC is computed per symbol.
> > > >
> > > > They have different granularity.
> > > > It is weird to refuse a module for incompatibility
> > > > of a symbol that it is not using at all.
> > >
> > > I agree, this should be on a per-symbol basis, so the Rust
> > > infrastructure in the kernel needs to be fixed up to support this
> > > properly, not just ignored like this patchset does.
> > I agree there is a divergence here, I tried to point it out so that it
> > wouldn't be
> > a surprise later. The .rmeta file itself (which is the only way we
> > could know that
> > the ABI actually matches, because layout decisions are in there) is an 
> > unstable
> > format, which is why I would be reluctant to try to parse it and find only 
> > the
> > relevant portions to hash. This isn't just a "technically unstable"
> > format, but one
> > in which the compiler essentially just serializes out relevant internal data
> > structures, so any parser for it will involve significant alterations
> > on compiler
> > updates, which doesn't seem like a good plan.
> > >
> > > thanks,
> > >
> > > greg k-h
> > Given the above additional information, would you be interested 

Re: [PATCH v2 0/5] MODVERSIONS + RUST Redux

2023-11-28 Thread Greg KH
On Mon, Nov 27, 2023 at 11:27:07AM -0800, Matthew Maurer wrote:
> > > >
> > > > > With regards to future directions that likely won't work for 
> > > > > loosening it:
> > > > > Unfortunately, the .rmeta format itself is not stable, so I wouldn't 
> > > > > want to
> > > > > teach genksyms to open it up and split out the pieces for specific 
> > > > > functions.
> > > > > Extending genksyms to parse Rust would also not solve the situation -
> > > > > layouts are allowed to differ across compiler versions or even (in 
> > > > > rare
> > > > > cases) seemingly unrelated code changes.
> > > >
> > > > What do you mean by "layout" here?  Yes, the crcs can be different
> > > > across compiler versions and seemingly unrelated code changes (genksyms
> > > > is VERY fragile) but that's ok, that's not what you are checking here.
> > > > You want to know if the rust function signature changes or not from the
> > > > last time you built the code, with the same compiler and options, that's
> > > > all you are verifying.
> What I mean by layout here is that if you write in Rust:
> struct Foo {
>   x: i32,
>   y: i32,
> }
> it is not guaranteed to have the same layout across different compilations, 
> even
> within the same compiler. See
> https://doc.rust-lang.org/reference/type-layout.html#the-rust-representation

Then you are going to have big problems, sorry.

> Specifically, the compiler is allowed to arbitrarily insert padding,
> reorder fields, etc.
> on the same code as long as the overall alignment of the struct and individual
> alignment of the fields remains correct and non-overlapping.
> 
> This means the compiler is *explicitly* allowed to, for example, permute x 
> and y
> as an optimization. In the above example this is unlikely, but if you
> instead consider
> struct Bar {
>   x: i8,
>   y: i64,
>   z: i8,
> }
> It's easy to see why the compiler might decide to structure this as
> y,x,z to reduce the
> size of the struct. Those optimization decisions may be affected by
> any other part of
> the code, PGO, etc.

Then you all need to figure out some way to determine how the compiler
layed out the structure after it compiled/optimized it and be able to
compare it to previous builds (or just generate a crc based on the
layout it chose.)

> > > > > Future directions that might work for loosening it:
> > > > > * Generating crcs from debuginfo + compiler + flags
> > > > > * Adding a feature to the rust compiler to dump this information. This
> > > > > is likely to
> > > > >   get pushback because Rust's current stance is that there is no 
> > > > > ability to load
> > > > >   object code built against a different library.
> > > >
> > > > Why not parse the function signature like we do for C?
> Because the function signature is insufficient to check the ABI, see above.
> > > >
> > > > > Would setting up Rust symbols so that they have a crc built out of 
> > > > > .rmeta be
> > > > > sufficient for you to consider this useful? If not, can you help me 
> > > > > understand
> > > > > what level of precision would be required?
> > > >
> > > > What exactly does .rmeta have to do with the function signature?  That's
> > > > all you care about here.
> The .rmeta file contains the decisions the compiler made about layout
> in the crate
> you're interfacing with. For example, the choice to encode Bar
> with a yxz field order would be written into the .rmeta file.

Ok, then yes, can you parse the .rmeta file to get that information?

> > > rmeta is generated per crate.
> > >
> > > CRC is computed per symbol.
> > >
> > > They have different granularity.
> > > It is weird to refuse a module for incompatibility
> > > of a symbol that it is not using at all.
> >
> > I agree, this should be on a per-symbol basis, so the Rust
> > infrastructure in the kernel needs to be fixed up to support this
> > properly, not just ignored like this patchset does.
> I agree there is a divergence here, I tried to point it out so that it
> wouldn't be
> a surprise later. The .rmeta file itself (which is the only way we
> could know that
> the ABI actually matches, because layout decisions are in there) is an 
> unstable
> format, which is why I would be reluctant to try to parse it and find only the
> relevant portions to hash. This isn't just a "technically unstable"
> format, but one
> in which the compiler essentially just serializes out relevant internal data
> structures, so any parser for it will involve significant alterations
> on compiler
> updates, which doesn't seem like a good plan.
> >
> > thanks,
> >
> > greg k-h
> Given the above additional information, would you be interested in a patchset
> which either:
> 
> A. Computes the CRC off the Rust type signature, knowing the compiler is
> allowed to change the ABI based on information not contained in the CRC.

No.

> B. Uses the CRC of the .rmeta file, knowing, as was pointed out, that this
> effectively contains the ABI of every symbol in the compilation unit, as well
> as inline