date:20240419

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Mike Rapoport

On Fri, Apr 19, 2024 at 02:42:16PM -0700, Song Liu wrote:
> On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport  wrote:
> >
> > On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
> > > [...]
> > > > > >
> > > > > > [1] 
> > > > > > https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> > > > >
> > > > > For the ROX to work, we need different users (module text, kprobe, 
> > > > > etc.) to have
> > > > > the same execmem_range. From [1]:
> > > > >
> > > > > static void *execmem_cache_alloc(struct execmem_range *range, size_t 
> > > > > size)
> > > > > {
> > > > > ...
> > > > >p = __execmem_cache_alloc(size);
> > > > >if (p)
> > > > >return p;
> > > > >   err = execmem_cache_populate(range, size);
> > > > > ...
> > > > > }
> > > > >
> > > > > We are calling __execmem_cache_alloc() without range. For this to 
> > > > > work,
> > > > > we can only call execmem_cache_alloc() with one execmem_range.
> > > >
> > > > Actually, on x86 this will "just work" because everything shares the 
> > > > same
> > > > address space :)
> > > >
> > > > The 2M pages in the cache will be in the modules space, so
> > > > __execmem_cache_alloc() will always return memory from that address 
> > > > space.
> > > >
> > > > For other architectures this indeed needs to be fixed with passing the
> > > > range to __execmem_cache_alloc() and limiting search in the cache for 
> > > > that
> > > > range.
> > >
> > > I think we at least need the "map to" concept (initially proposed by 
> > > Thomas)
> > > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > > the same range.
> >
> > Why?
> 
> IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
> input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
> will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
> the "range" object, we will have to compare different range parameters to 
> check
> we can share cached pages between module text and kprobe, which is not
> efficient. Did I miss something?

We can always share large ROX pages as long as they are within the correct
address space. The permissions for them are ROX and the alignment
differences are due to KASAN and this is handled during allocation of the
large page to refill the cache. __execmem_cache_alloc() only needs to limit
the search for the address space of the range.

And regardless, they way we deal with sharing of the cache can be sorted
out later.

> Thanks,
> Song

-- 
Sincerely yours,
Mike.

Re: [PATCH v9 07/36] function_graph: Allow multiple users to attach to function graph

2024-04-19 Thread Steven Rostedt

On Mon, 15 Apr 2024 21:50:20 +0900
"Masami Hiramatsu (Google)"  wrote:

> @@ -27,23 +28,157 @@
>  
>  #define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack)
>  #define FGRAPH_RET_INDEX DIV_ROUND_UP(FGRAPH_RET_SIZE, sizeof(long))
> +
> +/*
> + * On entry to a function (via function_graph_enter()), a new 
> ftrace_ret_stack
> + * is allocated on the task's ret_stack with indexes entry, then each
> + * fgraph_ops on the fgraph_array[]'s entryfunc is called and if that returns
> + * non-zero, the index into the fgraph_array[] for that fgraph_ops is 
> recorded
> + * on the indexes entry as a bit flag.
> + * As the associated ftrace_ret_stack saved for those fgraph_ops needs to
> + * be found, the index to it is also added to the ret_stack along with the
> + * index of the fgraph_array[] to each fgraph_ops that needs their retfunc
> + * called.
> + *
> + * The top of the ret_stack (when not empty) will always have a reference
> + * to the last ftrace_ret_stack saved. All references to the
> + * ftrace_ret_stack has the format of:
> + *
> + * bits:  0 -  9 offset in words from the previous ftrace_ret_stack
> + *   (bitmap type should have FGRAPH_RET_INDEX always)
> + * bits: 10 - 11 Type of storage
> + * 0 - reserved
> + * 1 - bitmap of fgraph_array index
> + *
> + * For bitmap of fgraph_array index
> + *  bits: 12 - 27The bitmap of fgraph_ops fgraph_array index

I really hate the terminology I came up with here, and would love to
get better terminology for describing what is going on. I looked it
over but I'm constantly getting confused. And I wrote this code!

Perhaps we should use:

 @frame : The data that represents a single function call. When a
  function is traced, all the data used for all the callbacks
  attached to it, is in a single frame. This would replace the
  FGRAPH_RET_SIZE as FGRAPH_FRAME_SIZE.

 @offset : This is the word size position on the stack. It would
   replace INDEX, as I think "index" is being used for more
   than one thing. Perhaps it should be "offset" when dealing
   with where it is on the shadow stack, and "pos" when dealing
   with which callback ops is being referenced.


> + *
> + * That is, at the end of function_graph_enter, if the first and forth
> + * fgraph_ops on the fgraph_array[] (index 0 and 3) needs their retfunc 
> called
> + * on the return of the function being traced, this is what will be on the
> + * task's shadow ret_stack: (the stack grows upward)
> + *
> + * || <- task->curr_ret_stack
> + * ++
> + * | bitmap_type(bitmap:(BIT(3)|BIT(0)),|
> + * | offset:FGRAPH_RET_INDEX)   | <- the offset is from here
> + * ++
> + * | struct ftrace_ret_stack|
> + * |   (stores the saved ret pointer)   | <- the offset points here
> + * ++
> + * | (X) | (N)  | ( N words away from
> + * ||   previous ret_stack)
> + *
> + * If a backtrace is required, and the real return pointer needs to be
> + * fetched, then it looks at the task's curr_ret_stack index, if it
> + * is greater than zero (reserved, or right before poped), it would mask
> + * the value by FGRAPH_RET_INDEX_MASK to get the offset index of the
> + * ftrace_ret_stack structure stored on the shadow stack.
> + */
> +
> +#define FGRAPH_RET_INDEX_SIZE10

Replace SIZE with BITS.

> +#define FGRAPH_RET_INDEX_MASKGENMASK(FGRAPH_RET_INDEX_SIZE - 1, 0)

  #define FGRAPH_FRAME_SIZE_BITS10
  #define FGRAPH_FRAME_SIZE_MASKGENMASK(FGRAPH_FRAME_SIZE_BITS - 1, 0)


> +
> +#define FGRAPH_TYPE_SIZE 2
> +#define FGRAPH_TYPE_MASK GENMASK(FGRAPH_TYPE_SIZE - 1, 0)

  #define FGRAPH_TYPE_BITS  2
  #define FGRAPH_TYPE_MASK  GENMASK(FGRAPH_TYPE_BITS - 1, 0)


> +#define FGRAPH_TYPE_SHIFTFGRAPH_RET_INDEX_SIZE
> +
> +enum {
> + FGRAPH_TYPE_RESERVED= 0,
> + FGRAPH_TYPE_BITMAP  = 1,
> +};
> +
> +#define FGRAPH_INDEX_SIZE16

replace "INDEX" with "OPS" as it will be the indexes of ops in the
array.

  #define FGRAPH_OPS_BITS   16
  #define FGRAPH_OPS_MASK   GENMASK(FGRAPH_OPS_BITS - 1, 0)

> +#define FGRAPH_INDEX_MASKGENMASK(FGRAPH_INDEX_SIZE - 1, 0)
> +#define FGRAPH_INDEX_SHIFT   (FGRAPH_TYPE_SHIFT + FGRAPH_TYPE_SIZE)
> +
> +/* Currently the max stack index can't be more than register callers */
> +#define FGRAPH_MAX_INDEX (FGRAPH_INDEX_SIZE + FGRAPH_RET_INDEX)

FGRAPH_MAX_INDEX isn't even used. Let's delete it.

> +
> +#define FGRAPH_ARRAY_SIZEFGRAPH_INDEX_SIZE

  #define FGRAPH_ARRAY_SIZE FGRAPH_INDEX_BITS

> +
>  #define SHADOW_STACK_SIZE (PAGE_SIZE)
>  #define SHADOW_STACK_INDEX (SHADOW_STACK_SIZE / sizeof(long))
>  /* Leave o

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-19 Thread Haitao Huang

On Fri, 19 Apr 2024 17:44:59 -0500, Huang, Kai  wrote:

On Fri, 2024-04-19 at 13:55 -0500, Haitao Huang wrote:
On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai   
wrote:

>
>
> On 16/04/2024 3:20 pm, Haitao Huang wrote:
> > From: Kristen Carlson Accardi 
> >  In cases EPC pages need be allocated during a page fault and the  
cgroup
> > usage is near its limit, an asynchronous reclamation needs be  
triggered

> > to avoid blocking the page fault handling.
> >  Create a workqueue, corresponding work item and function  
definitions

> > for EPC cgroup to support the asynchronous reclamation.
> >  In case the workqueue allocation is failed during init, disable  
cgroup.

>
> It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is
> "exactly what does this mean" isn't quite clear.
>
First, this is really some corner case most people don't care: during
init, kernel can't even allocate a workqueue object. So I don't think we
should write extra code to implement some sophisticated solution. Any
solution we come up with may just not work as the way user want or solve
the real issue due to the fact such allocation failure even happens at
init time.

I think for such boot time failure we can either choose directly  
BUG_ON(),

or we try to handle it _nicely_, but not half-way.  My experience is
adding BUG_ON() should be avoided in general, but it might be acceptable
during kernel boot.  I will leave it to others.

[...]

>
> ..., IIUC you choose a (third) solution that is even one more step  
back:

>
> It just makes try_charge() always succeed, but EPC pages are still
> managed in the "per-cgroup" list.
>
> But this solution, AFAICT, doesn't work.  The reason is when you fail  
to

> allocate EPC page you will do the global reclaim, but now the global
> list is empty.
>
> Am I missing anything?

But when cgroups enabled in config, global reclamation starts from root
and reclaim from the whole hierarchy if user may still be able to  
create.

Just that we don't have async/sync per-cgroup reclaim triggered.

OK.  I missed this as it is in a later patch.

>
> So my thinking is, we have two options:
>
> 1) Modify the MISC cgroup core code to allow the kernel to disable one
> particular resource.  It shouldn't be hard, e.g., we can add a
> 'disabled' flag to the 'struct misc_res'.
>
> Hmm.. wait, after checking, the MISC cgroup won't show any control  
files

> if the "capacity" of the resource is 0:
>
> "
>   * Miscellaneous resources capacity for the entire machine. 0  
capacity

>   * means resource is not initialized or not present in the host.
> "
>
> So I really suppose we should go with this route, i.e., by just  
setting

> the EPC capacity to 0?
>
> Note misc_cg_try_charge() will fail if capacity is 0, but we can make  
it

> return success by explicitly check whether SGX cgroup is disabled by
> using a helper, e.g., sgx_cgroup_disabled().
>
> And you always return the root SGX cgroup in sgx_get_current_cg() when
> sgx_cgroup_disabled() is true.
>
> And in sgx_reclaim_pages_global(), you do something like:
>
>static void sgx_reclaim_pages_global(..)
>{
>#ifdef CONFIG_CGROUP_MISC
>if (sgx_cgroup_disabled())
>sgx_reclaim_pages(&sgx_root_cg.lru);
>else
>sgx_cgroup_reclaim_pages(misc_cg_root());
>#else
>sgx_reclaim_pages(&sgx_global_list);
>#endif
>}
>
> I am perhaps missing some other spots too but you got the idea.
>
> At last, after typing those, I believe we should have a separate patch
> to handle disable SGX cgroup at initialization time.  And you can even
> put this patch _somewhere_ after the patch
>
>"x86/sgx: Implement basic EPC misc cgroup functionality"
>
> and before this patch.
>
> It makes sense to have such patch anyway, because with it we can  
easily
> to add a kernel command line 'sgx_cgroup=disabled" if the user wants  
it

> disabled (when someone has such requirement in the future).
>

I think we can add support for "sgx_cgroup=disabled" in future if indeed
needed. But just for init failure, no?

It's not about the commandline, which we can add in the future when
needed.  It's about we need to have a way to handle SGX cgroup being
disabled at boot time nicely, because we already have a case where we  
need

to do so.

Your approach looks half-way to me, and is not future extendible.  If we
choose to do it, do it right -- that is, we need a way to disable it
completely in both kernel and userspace so that userspace won't be able  
to

see it.

That would need more changes in misc cgroup implementation to support  
sgx-disable. Right now misc does not have separate files for different  
resource types. So we can only block echo "sgx_epc..." to those interface  
files, can't really make files not visible.

Anyway, I see this is really a configuration failure case. And we handle  
it without added complexity and do as much as we can gracefully until  
another fatal error

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-19 Thread Huang, Kai

On Fri, 2024-04-19 at 13:55 -0500, Haitao Huang wrote:
> On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai  wrote:
> 
> > 
> > 
> > On 16/04/2024 3:20 pm, Haitao Huang wrote:
> > > From: Kristen Carlson Accardi 
> > >  In cases EPC pages need be allocated during a page fault and the cgroup
> > > usage is near its limit, an asynchronous reclamation needs be triggered
> > > to avoid blocking the page fault handling.
> > >  Create a workqueue, corresponding work item and function definitions
> > > for EPC cgroup to support the asynchronous reclamation.
> > >  In case the workqueue allocation is failed during init, disable cgroup.
> > 
> > It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is  
> > "exactly what does this mean" isn't quite clear.
> > 
> First, this is really some corner case most people don't care: during  
> init, kernel can't even allocate a workqueue object. So I don't think we  
> should write extra code to implement some sophisticated solution. Any  
> solution we come up with may just not work as the way user want or solve  
> the real issue due to the fact such allocation failure even happens at  
> init time.

I think for such boot time failure we can either choose directly BUG_ON(),
or we try to handle it _nicely_, but not half-way.  My experience is
adding BUG_ON() should be avoided in general, but it might be acceptable
during kernel boot.  I will leave it to others.


[...]

> > 
> > ..., IIUC you choose a (third) solution that is even one more step back:
> > 
> > It just makes try_charge() always succeed, but EPC pages are still  
> > managed in the "per-cgroup" list.
> > 
> > But this solution, AFAICT, doesn't work.  The reason is when you fail to  
> > allocate EPC page you will do the global reclaim, but now the global  
> > list is empty.
> > 
> > Am I missing anything?
> 
> But when cgroups enabled in config, global reclamation starts from root  
> and reclaim from the whole hierarchy if user may still be able to create.  
> Just that we don't have async/sync per-cgroup reclaim triggered.

OK.  I missed this as it is in a later patch.

> 
> > 
> > So my thinking is, we have two options:
> > 
> > 1) Modify the MISC cgroup core code to allow the kernel to disable one  
> > particular resource.  It shouldn't be hard, e.g., we can add a  
> > 'disabled' flag to the 'struct misc_res'.
> > 
> > Hmm.. wait, after checking, the MISC cgroup won't show any control files  
> > if the "capacity" of the resource is 0:
> > 
> > "
> >   * Miscellaneous resources capacity for the entire machine. 0 capacity
> >   * means resource is not initialized or not present in the host.
> > "
> > 
> > So I really suppose we should go with this route, i.e., by just setting  
> > the EPC capacity to 0?
> > 
> > Note misc_cg_try_charge() will fail if capacity is 0, but we can make it  
> > return success by explicitly check whether SGX cgroup is disabled by  
> > using a helper, e.g., sgx_cgroup_disabled().
> > 
> > And you always return the root SGX cgroup in sgx_get_current_cg() when  
> > sgx_cgroup_disabled() is true.
> > 
> > And in sgx_reclaim_pages_global(), you do something like:
> > 
> > static void sgx_reclaim_pages_global(..)
> > {
> > #ifdef CONFIG_CGROUP_MISC
> > if (sgx_cgroup_disabled())
> > sgx_reclaim_pages(&sgx_root_cg.lru);
> > else
> > sgx_cgroup_reclaim_pages(misc_cg_root());
> > #else
> > sgx_reclaim_pages(&sgx_global_list);
> > #endif
> > }
> > 
> > I am perhaps missing some other spots too but you got the idea.
> > 
> > At last, after typing those, I believe we should have a separate patch  
> > to handle disable SGX cgroup at initialization time.  And you can even  
> > put this patch _somewhere_ after the patch
> > 
> > "x86/sgx: Implement basic EPC misc cgroup functionality"
> > 
> > and before this patch.
> > 
> > It makes sense to have such patch anyway, because with it we can easily  
> > to add a kernel command line 'sgx_cgroup=disabled" if the user wants it  
> > disabled (when someone has such requirement in the future).
> > 
> 
> I think we can add support for "sgx_cgroup=disabled" in future if indeed  
> needed. But just for init failure, no?
> 

It's not about the commandline, which we can add in the future when
needed.  It's about we need to have a way to handle SGX cgroup being
disabled at boot time nicely, because we already have a case where we need
to do so.

Your approach looks half-way to me, and is not future extendible.  If we
choose to do it, do it right -- that is, we need a way to disable it
completely in both kernel and userspace so that userspace won't be able to
see it.

Re: [PATCH v9 5/9] clk: mmp: Add Marvell PXA1908 clock driver

2024-04-19 Thread Stephen Boyd

Quoting Duje Mihanović (2024-04-19 07:31:14)
> On Friday, April 12, 2024 4:57:09 AM GMT+2 Stephen Boyd wrote:
> > Quoting Duje Mihanović (2024-04-11 03:15:34)
> > 
> > > On 4/11/2024 10:00 AM, Stephen Boyd wrote:
> > > > Is there a reason this file can't be a platform driver?
> > > 
> > > Not that I know of, I did it like this only because the other in-tree
> > > MMP clk drivers do so. I guess the initialization should look like any
> > > of the qcom GCC drivers then?
> > 
> > Yes.
> 
> With the entire clock driver code in one file this is quite messy as I also 
> needed to add module_init and module_exit functions to (un)register each 
> platform driver, presumably because the module_platform_driver macro doesn't 
> work with multiple platform drivers in one module. If I split up the driver 
> code for each clock controller block into its own file (such as clk-of-
> pxa1908-apbc.c) as I believe is the best option, should the commits be split 
> up accordingly as well?

Sure. Why is 'of' in the name? Maybe that is unnecessary?

> 
> > > While at it, do you think the other MMP clk drivers could use a 
> conversion?
> > 
> > I'm a little wary if the conversion cannot be tested though.
> 
> I'd rather leave it to someone with the hardware then, especially since the 
> only reason I found out about the above is that the board I'm working on 
> failed to boot completely without the module_init function.
> 

Ok, sounds fine.

Re: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

2024-04-19 Thread Oliver Upton

On Fri, Apr 19, 2024 at 01:57:03PM -0700, James Houghton wrote:
> On Fri, Apr 12, 2024 at 11:41 AM David Matlack  wrote:
> >
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > This patchset adds a fast path in KVM to test and clear access bits on
> > > sptes without taking the mmu_lock. It also adds support for using a
> > > bitmap to (1) test the access bits for many sptes in a single call to
> > > mmu_notifier_test_young, and to (2) clear the access bits for many ptes
> > > in a single call to mmu_notifier_clear_young.
> >
> > How much improvement would we get if we _just_ made test/clear_young
> > lockless on x86 and hold the read-lock on arm64? And then how much
> > benefit does the bitmap look-around add on top of that?

Thanks David for providing the suggestion.

> I don't have these results right now. For the next version I will (1)
> separate the series into the locking change and the bitmap change, and
> I will (2) have performance data for each change separately. It is
> conceivable that the bitmap change should just be considered as a
> completely separate patchset.

That'd be great. Having the performance numbers will make it even more
compelling, but I'd be tempted to go for the lock improvement just
because it doesn't add any new complexity and leverages existing patterns
in the architectures that people seem to want improvements for.

The bitmap interface, OTOH, is rather complex. At least the current
implementation breaks some of the isolation we have between the MMU code
and the page table walker library on arm64, which I'm not ecstatic about.
It _could_ be justified by a massive performance uplift over locking, but
it'd have to be a sizable win.

-- 
Thanks,
Oliver

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-19 Thread Huang, Kai



> Documentation of task_get_css() says it always  
> returns a valid css. This function is used by get_current_misc_cg() to get  
> the css refernce.
> 
> 
> /**
>   * task_get_css - find and get the css for (task, subsys)
>   * @task: the target task
>   * @subsys_id: the target subsystem ID
>   *
>   * Find the css for the (@task, @subsys_id) combination, increment a
>   * reference on and return it.  This function is guaranteed to return a
>   * valid css.  The returned css may already have been offlined.
>   */
> static inline struct cgroup_subsys_state *
> task_get_css(struct task_struct *task, int subsys_id)

Ah, I missed this comment.

This confirms my code reading too.

> 
> 
> If you look at the code of this function, you will see it does not check  
> NULL either for task_css().
> 
> So I think we are pretty sure here it's confirmed by this documentation  
> and testing.

Yeah agreed.  Thanks.

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-19 Thread James Houghton

On Fri, Apr 19, 2024 at 2:07 PM David Matlack  wrote:
>
> On 2024-04-19 01:47 PM, James Houghton wrote:
> > On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> > > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > bool young = false;
> > >
> > > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> > >
> > > if (tdp_mmu_enabled)
> > > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> > >
> > > return young;
> > > }
> > >
> > > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > > {
> > > bool young = false;
> > >
> > > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > > young = kvm_handle_gfn_range(kvm, range, 
> > > kvm_test_age_rmap);
> > >
> > > if (tdp_mmu_enabled)
> > > young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> > >
> > > return young;
> >
> >
> > Yeah I think this is the right thing to do. Given your other
> > suggestions (on patch 3), I think this will look something like this
> > -- let me know if I've misunderstood something:
> >
> > bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> >
> > if (check_rmap)
> >   KVM_MMU_LOCK(kvm);
> >
> > rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> >
> > if (check_rmap)
> >   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> >
> > if (tdp_mmu_enabled)
> >   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> >
> > rcu_read_unlock();
> > if (check_rmap)
> >   KVM_MMU_UNLOCK(kvm);
>
> I was thinking a little different. If you follow my suggestion to first
> make the TDP MMU aging lockless, you'll end up with something like this
> prior to adding bitmap support (note: the comments are just for
> demonstrative purposes):
>
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool young = false;
>
> /* Shadow MMU aging holds write-lock. */
> if (kvm_memslots_have_rmaps(kvm)) {
> write_lock(&kvm->mmu_lock);
> young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> write_unlock(&kvm->mmu_lock);
> }
>
> /* TDM MMU aging is lockless. */
> if (tdp_mmu_enabled)
> young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
>
> return young;
> }
>
> Then when you add bitmap support it would look something like this:
>
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> unsigned long *bitmap = range->arg.metadata->bitmap;
> bool young = false;
>
> /* SHadow MMU aging holds write-lock and does not support bitmap. */
> if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
> write_lock(&kvm->mmu_lock);
> young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> write_unlock(&kvm->mmu_lock);
> }
>
> /* TDM MMU aging is lockless and supports bitmap. */
> if (tdp_mmu_enabled)
> young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
>
> return young;
> }
>
> rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

Oh yes this is a lot better. I hope I would have seen this when it
came time to actually update this patch. Thanks.

>
> That brings up a question I've been wondering about. If KVM only
> advertises support for the bitmap lookaround when shadow roots are not
> allocated, does that mean MGLRU will be blind to accesses made by L2
> when nested virtualization is enabled? And does that mean the Linux MM
> will think all L2 memory is cold (i.e. good candidate for swapping)
> because it isn't seeing accesses made by L2?

Yes, I think so (for both questions). That's better than KVM not
participating in MGLRU aging at all, which is the case today (IIUC --
also ignoring the case where KVM accesses guest memory directly). We
could have MGLRU always invoke the mmu notifiers, but frequently
taking the MMU lock for writing might be worse than evicting when we
shouldn't. Maybe Yu tried this at some point, but I can't find any
results for this.

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Fri, Apr 19, 2024 at 1:00 PM Mike Rapoport  wrote:
>
> On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> > On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
> > [...]
> > > > >
> > > > > [1] 
> > > > > https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> > > >
> > > > For the ROX to work, we need different users (module text, kprobe, 
> > > > etc.) to have
> > > > the same execmem_range. From [1]:
> > > >
> > > > static void *execmem_cache_alloc(struct execmem_range *range, size_t 
> > > > size)
> > > > {
> > > > ...
> > > >p = __execmem_cache_alloc(size);
> > > >if (p)
> > > >return p;
> > > >   err = execmem_cache_populate(range, size);
> > > > ...
> > > > }
> > > >
> > > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > > we can only call execmem_cache_alloc() with one execmem_range.
> > >
> > > Actually, on x86 this will "just work" because everything shares the same
> > > address space :)
> > >
> > > The 2M pages in the cache will be in the modules space, so
> > > __execmem_cache_alloc() will always return memory from that address space.
> > >
> > > For other architectures this indeed needs to be fixed with passing the
> > > range to __execmem_cache_alloc() and limiting search in the cache for that
> > > range.
> >
> > I think we at least need the "map to" concept (initially proposed by Thomas)
> > to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> > maps to EXECMEM_MODULE_TEXT, so that all these actually share
> > the same range.
>
> Why?

IIUC, we need to update __execmem_cache_alloc() to take a range pointer as
input. module text will use "range" for EXECMEM_MODULE_TEXT, while kprobe
will use "range" for EXECMEM_KPROBE. Without "map to" concept or sharing
the "range" object, we will have to compare different range parameters to check
we can share cached pages between module text and kprobe, which is not
efficient. Did I miss something?

Thanks,
Song

Re: [PATCH 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-19 Thread Beau Belgrave

On Fri, Apr 19, 2024 at 11:33:05AM +0900, Masami Hiramatsu wrote:
> On Tue, 16 Apr 2024 22:41:01 +
> Beau Belgrave  wrote:
> 
> > When the ABI was updated to prevent same name w/different args, it
> > missed an important corner case when fields don't end with a space.
> > Typically, space is used for fields to help separate them, like
> > "u8 field1; u8 field2". If no spaces are used, like
> > "u8 field1;u8 field2", then the parsing works for the first time.
> > However, the match check fails on a subsequent register, leading to
> > confusion.
> > 
> > This is because the match check uses argv_split() and assumes that all
> > fields will be split upon the space. When spaces are used, we get back
> > { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
> > This causes a mismatch, and the user program gets back -EADDRINUSE.
> > 
> > Add a method to detect this case before calling argv_split(). If found
> > force a space after the field separator character ';'. This ensures all
> > cases work properly for matching.
> > 
> > With this fix, the following are all treated as matching:
> > u8 field1;u8 field2
> > u8 field1; u8 field2
> > u8 field1;\tu8 field2
> > u8 field1;\nu8 field2
> 
> Sounds good to me. I just have some nits.
> 
> > 
> > Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different 
> > args event")
> > Signed-off-by: Beau Belgrave 
> > ---
> >  kernel/trace/trace_events_user.c | 88 +++-
> >  1 file changed, 87 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/trace/trace_events_user.c 
> > b/kernel/trace/trace_events_user.c
> > index 70d428c394b6..9184d3962b2a 100644
> > --- a/kernel/trace/trace_events_user.c
> > +++ b/kernel/trace/trace_events_user.c
> > @@ -1989,6 +1989,92 @@ static int user_event_set_tp_name(struct user_event 
> > *user)
> > return 0;
> >  }
> >  
> > +/*
> > + * Counts how many ';' without a trailing space are in the args.
> > + */
> > +static int count_semis_no_space(char *args)
> > +{
> > +   int count = 0;
> > +
> > +   while ((args = strchr(args, ';'))) {
> > +   args++;
> > +
> > +   if (!isspace(*args))
> > +   count++;
> > +   }
> > +
> > +   return count;
> > +}
> > +
> > +/*
> > + * Copies the arguments while ensuring all ';' have a trailing space.
> > + */
> > +static char *fix_semis_no_space(char *args, int count)
> 
> nit: This name does not represent what it does. 'insert_space_after_semis()'
> is more self-described.
> 

Sure, will fix in a v2.

> > +{
> > +   char *fixed, *pos;
> > +   char c, last;
> > +   int len;
> > +
> > +   len = strlen(args) + count;
> > +   fixed = kmalloc(len + 1, GFP_KERNEL);
> > +
> > +   if (!fixed)
> > +   return NULL;
> > +
> > +   pos = fixed;
> > +   last = '\0';
> > +
> > +   while (len > 0) {
> > +   c = *args++;
> > +
> > +   if (last == ';' && !isspace(c)) {
> > +   *pos++ = ' ';
> > +   len--;
> > +   }
> > +
> > +   if (len > 0) {
> > +   *pos++ = c;
> > +   len--;
> > +   }
> > +
> > +   last = c;
> > +   }
> 
> nit: This loop can be simpler, because we are sure fixed has enough length;
> 
> /* insert a space after ';' if there is no space. */
> while(*args) {
>   *pos = *args++;
>   if (*pos++ == ';' && !isspace(*args))
>   *pos++ = ' ';
> }
> 

I was worried that if count_semis_no_space() ever had different logic
(maybe after this commit) that it could cause an overflow if the count
was wrong, etc.

I don't have an issue making it shorter, but I was trying to be more on
the safe side, since this isn't a fast path (event register).

> > +
> > +   /*
> > +* len is the length of the copy excluding the null.
> > +* This ensures we always have room for a null.
> > +*/
> > +   *pos = '\0';
> > +
> > +   return fixed;
> > +}
> > +
> > +static char **user_event_argv_split(char *args, int *argc)
> > +{
> > +   /* Count how many ';' without a trailing space */
> > +   int count = count_semis_no_space(args);
> > +
> > +   if (count) {
> 
> nit: it is better to exit fast, so 
> 
>   if (!count)
>   return argv_split(GFP_KERNEL, args, argc);
> 
>   ...

Sure, will fix in a v2.

> 
> Thank you,
> 
> OT: BTW, can this also simplify synthetic events?
> 

I'm not sure, I'll check when I have some time. I want to get this fix
in sooner rather than later.

Thanks,
-Beau

> > +   /* We must fixup 'field;field' to 'field; field' */
> > +   char *fixed = fix_semis_no_space(args, count);
> > +   char **split;
> > +
> > +   if (!fixed)
> > +   return NULL;
> > +
> > +   /* We do a normal split afterwards */
> > +   split = argv_split(GFP_KERNEL, fixed, argc);
> > +
> > +   /* We can free since argv_split makes a copy */
> > +   kfree(fixed);
> > +
> > +   return split;
> > +   }

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-19 Thread David Matlack

On 2024-04-19 01:47 PM, James Houghton wrote:
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> > return young;
> > }
> >
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> >
> > return young;
> 
> 
> Yeah I think this is the right thing to do. Given your other
> suggestions (on patch 3), I think this will look something like this
> -- let me know if I've misunderstood something:
> 
> bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> 
> if (check_rmap)
>   KVM_MMU_LOCK(kvm);
> 
> rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> 
> if (check_rmap)
>   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> 
> if (tdp_mmu_enabled)
>   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> 
> rcu_read_unlock();
> if (check_rmap)
>   KVM_MMU_UNLOCK(kvm);

I was thinking a little different. If you follow my suggestion to first
make the TDP MMU aging lockless, you'll end up with something like this
prior to adding bitmap support (note: the comments are just for
demonstrative purposes):

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

/* Shadow MMU aging holds write-lock. */
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

/* TDM MMU aging is lockless. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

Then when you add bitmap support it would look something like this:

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
unsigned long *bitmap = range->arg.metadata->bitmap;
bool young = false;

/* SHadow MMU aging holds write-lock and does not support bitmap. */
if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
write_lock(&kvm->mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(&kvm->mmu_lock);
}

/* TDM MMU aging is lockless and supports bitmap. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

That brings up a question I've been wondering about. If KVM only
advertises support for the bitmap lookaround when shadow roots are not
allocated, does that mean MGLRU will be blind to accesses made by L2
when nested virtualization is enabled? And does that mean the Linux MM
will think all L2 memory is cold (i.e. good candidate for swapping)
because it isn't seeing accesses made by L2?

Re: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

2024-04-19 Thread James Houghton

On Fri, Apr 12, 2024 at 11:41 AM David Matlack  wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > This patchset adds a fast path in KVM to test and clear access bits on
> > sptes without taking the mmu_lock. It also adds support for using a
> > bitmap to (1) test the access bits for many sptes in a single call to
> > mmu_notifier_test_young, and to (2) clear the access bits for many ptes
> > in a single call to mmu_notifier_clear_young.
>
> How much improvement would we get if we _just_ made test/clear_young
> lockless on x86 and hold the read-lock on arm64? And then how much
> benefit does the bitmap look-around add on top of that?

I don't have these results right now. For the next version I will (1)
separate the series into the locking change and the bitmap change, and
I will (2) have performance data for each change separately. It is
conceivable that the bitmap change should just be considered as a
completely separate patchset.

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-19 Thread James Houghton

On Fri, Apr 12, 2024 at 1:44 PM David Matlack  wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
>
> I think this patch will trigger a lockdep assert in
>
>   kvm_tdp_mmu_age_gfn_range
> kvm_tdp_mmu_handle_gfn
>   for_each_tdp_mmu_root
> __for_each_tdp_mmu_root
>   kvm_lockdep_assert_mmu_lock_held
>
> ... because it walks tdp_mmu_roots without holding mmu_lock.

Indeed, thanks. I'll make sure to build with CONFIG_LOCKDEP for the
future versions and check for errors.

>
> Yu's patch[1] added a lockless walk to the TDP MMU. We'd need something
> similar here and also update the comment above tdp_mmu_roots describing
> how tdp_mmu_roots can be read locklessly.

I'll add the macro / function to do the lockless walk of tdp_mmu_roots
and explain why it is safe. Thanks for pointing out this big mistake.

> [1] https://lore.kernel.org/kvmarm/zitx64bbx5vdj...@google.com/

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-19 Thread James Houghton

On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao 
> > > Signed-off-by: James Houghton 
> > > ---
> > >  arch/x86/include/asm/kvm_host.h | 14 ++
> > >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
> > > *slot, unsigned long npages);
> > >   */
> > >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > +   /*
> > > +* Indicate that we support bitmap-based aging when using the TDP MMU
> > > +* and the accessed bit is available in the TDP page tables.
> > > +*
> > > +* We have no other preparatory work to do here, so we do not need to
> > > +* redefine kvm_arch_finish_bitmap_age().
> > > +*/
> > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > +&& shadow_accessed_mask;
> > > +}
> > > +
> > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > > kvm_gfn_range *range)
> > >  {
> > > bool young = false;
> > >
> > > -   if (kvm_memslots_have_rmaps(kvm))
> > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > +   if (range->lockless) {
> > > +   kvm_age_set_unreliable(range);
> > > +   return false;
> > > +   }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> >
> > Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> > consumed or used. So I think MGLRU will assume all memory is
> > unaccessed?
> >
> > One way to improve the situation would be to re-order the TDP MMU
> > function first and return young instead of false, so that way MGLRU at
> > least has visibility into accesses made by L1 (and L2 if EPT is disable
> > in L2). But that still means MGLRU is blind to accesses made by L2.
> >
> > What about grabbing the mmu_lock if there's a shadow root allocated and
> > get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
> >
> >   if (kvm_memslots_have_rmaps(kvm)) {
> >   write_lock(&kvm->mmu_lock);
> >   young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >   write_unlock(&kvm->mmu_lock);
> >   }
> >
> > The TDP MMU walk would still be lockless. KVM only has to take the
> > mmu_lock to collect accesses made by L2.
> >
> > kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> > as well, but that seems relatively simple with the helper functions.
>
> Wait, even simpler, just check kvm_memslots_have_rmaps() in
> kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
> bitmap request.
>
> i.e.
>
> static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct 
> mmu_notifier *mn)
> {
> /*
>  * Indicate that we support bitmap-based aging when using the TDP MMU
>  * and the accessed bit is available in the TDP page tables.
>  *
>  * We have no other preparatory work to do here, so we do not need to
>  * redefine kvm_arch_finish_bitmap_age().
>  */
> return IS_ENABLED(CONFIG_X86_64)
> && tdp_mmu_enabled
> && shadow_accessed_mask
> && !kvm_memslots_have_rmaps(kvm);
> }
>
> bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> bool young = false;
>
> if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>
> if

Re: [PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

2024-04-19 Thread James Houghton

On Fri, Apr 12, 2024 at 1:28 PM David Matlack  wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
> > they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
> > and that they do not need KVM to grab the MMU lock for writing. This
> > function allows architectures to do other locking or other preparatory
> > work that it needs.
>
> There's a lot going on here. I know it's extra work but I think the
> series would be easier to understand and simplify if you introduced the
> KVM support for lockless test/clear_young() first, and then introduce
> support for the bitmap-based look-around.

Yeah I think this is the right thing to do. Thanks.

>
> Specifically:
>
>  1. Make all test/clear_young() notifiers lockless. i.e. Move the
> mmu_lock into the architecture-specific code (kvm_age_gfn() and
> kvm_test_age_gfn()).
>
>  2. Convert KVM/x86's kvm_{test,}_age_gfn() to be lockless for the TDP
> MMU.
>
>  4. Convert KVM/arm64's kvm_{test,}_age_gfn() to hold the mmu_lock in
> read-mode.
>
>  5. Add bitmap-based look-around support to KVM/x86 and KVM/arm64
> (probably 2-3 patches).

This all sounds good to me. Thanks for laying it out for me -- this
should be a lot simpler.

>
> >
> > If an architecture does not implement kvm_arch_prepare_bitmap_age() or
> > is unable to do bitmap-based aging at runtime (and marks the bitmap as
> > unreliable):
> >  1. If a bitmap was provided, we inform the caller that the bitmap is
> > unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
> >  2. If a bitmap was not provided, fall back to the old logic.
> >
> > Also add logic for architectures to easily use the provided bitmap if
> > they are able. The expectation is that the architecture's implementation
> > of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
> > kvm_gfn_age() will use kvm_gfn_should_age().
> >
> > Suggested-by: Yu Zhao 
> > Signed-off-by: James Houghton 
> > ---
> >  include/linux/kvm_host.h | 60 ++
> >  virt/kvm/kvm_main.c  | 92 +---
> >  2 files changed, 127 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 1800d03a06a9..5862fd7b5f9b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc 
> > kvm_vm_stats_desc[];
> >  extern const struct kvm_stats_header kvm_vcpu_stats_header;
> >  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
> >
> > +/*
> > + * Architectures that support using bitmaps for kvm_age_gfn() and
> > + * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
> > + * and do any work they need to prepare. The subsequent walk will not
> > + * automatically grab the KVM MMU lock, so some architectures may opt
> > + * to grab it.
> > + *
> > + * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() 
> > is
> > + * guaranteed.
> > + */
> > +#ifndef kvm_arch_prepare_bitmap_age
> > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
>
> I find the name of these architecture callbacks misleading/confusing.
> The lockless path is used even when a bitmap is not provided. i.e.
> bitmap can be NULL in between kvm_arch_prepare/finish_bitmap_age().

Yes. I am really terrible at picking names I'm happy to just nix
this, following your other suggestions.

>
> > +{
> > + return false;
> > +}
> > +#endif
> > +#ifndef kvm_arch_finish_bitmap_age
> > +static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
> > +#endif
>
> kvm_arch_finish_bitmap_age() seems unnecessary. I think the KVM/arm64
> code could acquire/release the mmu_lock in read-mode in
> kvm_test_age_gfn() and kvm_age_gfn() right?

Yes you're right, except that the way it is now, we only lock/unlock
once for the notifier instead of once for each overlapping memslot,
but that's not an issue, as you mention below.

>
> > +
> >  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> >  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
> >  {
> > @@ -2076,9 +2096,16 @@ static inline bool 
> > mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
> >   return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
> >  }
> >
> > +struct test_clear_young_metadata {
> > + unsigned long *bitmap;
> > + unsigned long bitmap_offset_end;
>
> bitmap_offset_end is unused.

Indeed, sorry about that.

>
> > + unsigned long end;
> > + bool unreliable;
> > +};
> >  union kvm_mmu_notifier_arg {
> >   pte_t pte;
> >   unsigned long attributes;
> > + struct test_clear_young_metadata *metadata;
>
> nit: Maybe s/metadata/test_clear_young/ ?

Yes, that's better.

>
> >  };
> >
> >  struct kvm_gfn_range {
> > @@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
> >   gfn_t end;
> >   union kvm_mmu_notifier_arg arg;
> >   bool may_blo

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-19 Thread James Houghton

On Fri, Apr 12, 2024 at 11:45 AM David Matlack  wrote:
>
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > The bitmap is provided for secondary MMUs to use if they support it. For
> > test_young(), after it returns, the bitmap represents the pages that
> > were young in the interval [start, end). For clear_young, it represents
> > the pages that we wish the secondary MMU to clear the accessed/young bit
> > for.
> >
> > If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> > should be unchanged except that if young PTEs are found and the
> > architecture supports passing in a bitmap, instead of returning 1,
> > MMU_NOTIFIER_YOUNG_FAST is returned.
> >
> > This allows MGLRU's look-around logic to work faster, resulting in a 4%
> > improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> > to indicate to main mm that doing look-around is likely to be
> > beneficial.
> >
> > If the secondary MMU doesn't support the bitmap, it must return
> > an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> >
> > [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/
> >
> > Suggested-by: Yu Zhao 
> > Signed-off-by: James Houghton 
> > ---
> >  include/linux/mmu_notifier.h | 93 +---
> >  include/trace/events/kvm.h   | 13 +++--
> >  mm/mmu_notifier.c| 20 +---
> >  virt/kvm/kvm_main.c  | 19 ++--
> >  4 files changed, 123 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index f349e08a9dfe..daaa9db625d3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >
> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> > +#define MMU_NOTIFIER_YOUNG   (1 << 0)
> > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE appears to be unused by all callers
> of test/clear_young(). I would vote to remove it.

Works for me.

>
> > +#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)
>
> Instead of MMU_NOTIFIER_YOUNG_FAST, how about
> MMU_NOTIFIER_YOUNG_LOOK_AROUND? i.e. The secondary MMU is returning
> saying it recommends doing a look-around and passing in a bitmap?
>
> That would avoid the whole "what does FAST really mean" confusion.

I think MMU_NOTIFIER_YOUNG_LOOK_AROUND is fine.

>
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index fb49c2a60200..ca4b1ef9dfc2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
> > mmu_notifier *mn,
> >  static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
> >   struct mm_struct *mm,
> >   unsigned long start,
> > - unsigned long end)
> > + unsigned long end,
> > + unsigned long *bitmap)
> >  {
> >   trace_kvm_age_hva(start, end);
> >
> > + /* We don't support bitmaps. Don't test or clear anything. */
> > + if (bitmap)
> > + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
>
> Wouldn't it be a bug to get a bitmap here? The main MM is only suppost
> to pass in a bitmap if the secondary MMU returns
> MMU_NOTIFIER_YOUNG_FAST, which KVM does not do at this point.
>
> Put another way, this check seems unneccessary.
>
> > +
> >   /*
> >* Even though we do not flush TLB, this will still adversely
> >* affect performance on pre-Haswell Intel EPT, where there is
> > @@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct 
> > mmu_notifier *mn,
> >
> >  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> >  struct mm_struct *mm,
> > -unsigned long address)
> > +unsigned long start,
> > +unsigned long end,
> > +unsigned long *bitmap)
> >  {
> > - trace_kvm_test_age_hva(address);
> > + trace_kvm_test_age_hva(start, end);
> > +
> > + /* We don't support bitmaps. Don't test or clear anything. */
> > + if (bitmap)
> > + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;
>
> Same thing here.

I will remove them, they are indeed unnecessary, as it is just dead code.

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Mike Rapoport

On Fri, Apr 19, 2024 at 10:32:39AM -0700, Song Liu wrote:
> On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
> [...]
> > > >
> > > > [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> > >
> > > For the ROX to work, we need different users (module text, kprobe, etc.) 
> > > to have
> > > the same execmem_range. From [1]:
> > >
> > > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > > {
> > > ...
> > >p = __execmem_cache_alloc(size);
> > >if (p)
> > >return p;
> > >   err = execmem_cache_populate(range, size);
> > > ...
> > > }
> > >
> > > We are calling __execmem_cache_alloc() without range. For this to work,
> > > we can only call execmem_cache_alloc() with one execmem_range.
> >
> > Actually, on x86 this will "just work" because everything shares the same
> > address space :)
> >
> > The 2M pages in the cache will be in the modules space, so
> > __execmem_cache_alloc() will always return memory from that address space.
> >
> > For other architectures this indeed needs to be fixed with passing the
> > range to __execmem_cache_alloc() and limiting search in the cache for that
> > range.
> 
> I think we at least need the "map to" concept (initially proposed by Thomas)
> to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
> maps to EXECMEM_MODULE_TEXT, so that all these actually share
> the same range.

Why?
 
> Does this make sense?
> 
> Song

-- 
Sincerely yours,
Mike.

Re: [PATCH v12 09/14] x86/sgx: Implement async reclamation for cgroup

2024-04-19 Thread Haitao Huang


On Thu, 18 Apr 2024 20:32:14 -0500, Huang, Kai  wrote:




On 16/04/2024 3:20 pm, Haitao Huang wrote:

From: Kristen Carlson Accardi 
 In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs be triggered
to avoid blocking the page fault handling.
 Create a workqueue, corresponding work item and function definitions
for EPC cgroup to support the asynchronous reclamation.
 In case the workqueue allocation is failed during init, disable cgroup.


It's fine and reasonable to disable (SGX EPC) cgroup.  The problem is  
"exactly what does this mean" isn't quite clear.


First, this is really some corner case most people don't care: during  
init, kernel can't even allocate a workqueue object. So I don't think we  
should write extra code to implement some sophisticated solution. Any  
solution we come up with may just not work as the way user want or solve  
the real issue due to the fact such allocation failure even happens at  
init time.


So IMHO the current solution should be fine and I'll answer some of your  
detailed questions below.


Given SGX EPC is just one type of MISC cgroup resources, we cannot just  
disable MISC cgroup as a whole.


So, the first interpretation is we treat the entire MISC_CG_RES_SGX  
resource type doesn't exist, that is, we just don't show control files  
in the file system, and all EPC pages are tracked in the global list.


But it might be not straightforward to implement in the SGX driver,  
i.e., we might need to do more MISC cgroup core code change to make it  
being able to support disable particular resource at runtime -- I need  
to double check.


So if that is not something worth to do, we will still need to live with  
the fact that, the user is still able to create SGX cgroup in the  
hierarchy and see those control files, and being able to read/write them.




Can not reliably predict what will happen. Most likely the ENOMEM will be  
returned by sgx_cgroup_alloc() if reached or other error in the stack if  
not reached to sgx_cgroup_alloc()

 and user fails on creating anything.

But if they do end up creating some cgroups (sgx_cgroup_alloc() and  
everything else  on the call stack passed without failure), everything  
still kind of works for the reason answered below.


The second interpretation I suppose is, although the SGX cgroup is still  
seen as supported in userspace, in kernel we just treat it doesn't exist.


Specifically, that means: 1) we always return the root SGX cgroup for  
any EPC page when allocating a new one; 2) as a result, we still track  
all EPC pages in a single global list.




Current code has similar behavior without extra code.


But from the code below ...



  static int __sgx_cgroup_try_charge(struct sgx_cgroup *epc_cg)
  {
if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE))
@@ -117,19 +226,28 @@ int sgx_cgroup_try_charge(struct sgx_cgroup  
*sgx_cg, enum sgx_reclaim reclaim)

  {
int ret;
  +	/* cgroup disabled due to wq allocation failure during  
sgx_cgroup_init(). */

+   if (!sgx_cg_wq)
+   return 0;
+


..., IIUC you choose a (third) solution that is even one more step back:

It just makes try_charge() always succeed, but EPC pages are still  
managed in the "per-cgroup" list.


But this solution, AFAICT, doesn't work.  The reason is when you fail to  
allocate EPC page you will do the global reclaim, but now the global  
list is empty.


Am I missing anything?


But when cgroups enabled in config, global reclamation starts from root  
and reclaim from the whole hierarchy if user may still be able to create.  
Just that we don't have async/sync per-cgroup reclaim triggered.




So my thinking is, we have two options:

1) Modify the MISC cgroup core code to allow the kernel to disable one  
particular resource.  It shouldn't be hard, e.g., we can add a  
'disabled' flag to the 'struct misc_res'.


Hmm.. wait, after checking, the MISC cgroup won't show any control files  
if the "capacity" of the resource is 0:


"
  * Miscellaneous resources capacity for the entire machine. 0 capacity
  * means resource is not initialized or not present in the host.
"

So I really suppose we should go with this route, i.e., by just setting  
the EPC capacity to 0?


Note misc_cg_try_charge() will fail if capacity is 0, but we can make it  
return success by explicitly check whether SGX cgroup is disabled by  
using a helper, e.g., sgx_cgroup_disabled().


And you always return the root SGX cgroup in sgx_get_current_cg() when  
sgx_cgroup_disabled() is true.


And in sgx_reclaim_pages_global(), you do something like:

static void sgx_reclaim_pages_global(..)
{
#ifdef CONFIG_CGROUP_MISC
if (sgx_cgroup_disabled())
sgx_reclaim_pages(&sgx_root_cg.lru);
else
sgx_cgroup_reclaim_pages(misc_cg_root());
#else

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-19 Thread David Hildenbrand


On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
  #include 
  #include 
  
+#include 

+
  struct trace_buffer;
  struct ring_buffer_iter;
  
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);

  #define trace_rb_cpu_prepare  NULL
  #endif
  
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,

+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
  #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -26,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
  };
  
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {

u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(&cpu_buffer->irq_work.waiters

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-19 Thread Haitao Huang


On Thu, 18 Apr 2024 18:29:53 -0500, Huang, Kai  wrote:




--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SGX_EPC_CGROUP_H_
+#define _SGX_EPC_CGROUP_H_
+
+#include 


I don't see why you need  here.  Also, ...


+#include 
+#include 
+
+#include "sgx.h"


... "sgx.h" already includes 

[...]


right



+static inline struct sgx_cgroup *sgx_get_current_cg(void)
+{
+/* get_current_misc_cg() never returns NULL when Kconfig enabled  
*/

+return sgx_cgroup_from_misc_cg(get_current_misc_cg());
+}


I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when  
Kconfig

is on.

I typed my analysis below in [*].  And it would be helpful if any  
cgroup

expert can have a second eye on this.

[...]

Thanks for checking this and I did similar and agree with the  
conclusion. I think this is confirmed also by Michal's description  
AFAICT:

"
The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist."


After looking I believe we can even disable MISC cgroup at runtime for a  
particular cgroup (haven't actually verified on real machine, though):


  # echo "-misc" > /sys/fs/cgroup/my_group/cgroup.subtree_control


My test confirms this is does not cause NULL cgroup for the tasks.
It actually works the same way as commandline disable except for that this  
only disables misc in subtree and does not show any misc.* files or allow  
creating such files in the subtree.


And if you look at the MISC cgroup core code, many functions actually  
handle a NULL css, e.g., misc_cg_try_charge():


int misc_cg_try_charge(enum misc_res_type type,
struct misc_cg *cg, u64 amount)
{
...

if (!(valid_type(type) && cg &&
READ_ONCE(misc_res_capacity[type])))
return -EINVAL;

...
}

That's why I am still a little bit worried about this.  And it's better  
to have cgroup expert(s) to confirm here.




I think it's just being defensive as this function is public API called by  
other parts of kernel. Documentation of task_get_css() says it always  
returns a valid css. This function is used by get_current_misc_cg() to get  
the css refernce.



/**
 * task_get_css - find and get the css for (task, subsys)
 * @task: the target task
 * @subsys_id: the target subsystem ID
 *
 * Find the css for the (@task, @subsys_id) combination, increment a
 * reference on and return it.  This function is guaranteed to return a
 * valid css.  The returned css may already have been offlined.
 */
static inline struct cgroup_subsys_state *
task_get_css(struct task_struct *task, int subsys_id)


If you look at the code of this function, you will see it does not check  
NULL either for task_css().


So I think we are pretty sure here it's confirmed by this documentation  
and testing.

Thanks
Haitao

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Fri, Apr 19, 2024 at 10:03 AM Mike Rapoport  wrote:
[...]
> > >
> > > [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> >
> > For the ROX to work, we need different users (module text, kprobe, etc.) to 
> > have
> > the same execmem_range. From [1]:
> >
> > static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> > {
> > ...
> >p = __execmem_cache_alloc(size);
> >if (p)
> >return p;
> >   err = execmem_cache_populate(range, size);
> > ...
> > }
> >
> > We are calling __execmem_cache_alloc() without range. For this to work,
> > we can only call execmem_cache_alloc() with one execmem_range.
>
> Actually, on x86 this will "just work" because everything shares the same
> address space :)
>
> The 2M pages in the cache will be in the modules space, so
> __execmem_cache_alloc() will always return memory from that address space.
>
> For other architectures this indeed needs to be fixed with passing the
> range to __execmem_cache_alloc() and limiting search in the cache for that
> range.

I think we at least need the "map to" concept (initially proposed by Thomas)
to get this work. For example, EXECMEM_BPF and EXECMEM_KPROBE
maps to EXECMEM_MODULE_TEXT, so that all these actually share
the same range.

Does this make sense?

Song

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Mike Rapoport

On Fri, Apr 19, 2024 at 08:54:40AM -0700, Song Liu wrote:
> On Thu, Apr 18, 2024 at 11:56 PM Mike Rapoport  wrote:
> >
> > On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> > > On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
> > > >
> > > > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > > > > >
> > > > > > > > I'm looking at execmem_types more as definition of the 
> > > > > > > > consumers, maybe I
> > > > > > > > should have named the enum execmem_consumer at the first place.
> > > > > > >
> > > > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > > > unnecessary complexity. IIUC, for most (if not all) archs, 
> > > > > > > ftrace, kprobe,
> > > > > > > and bpf (and maybe also module text) all have the same 
> > > > > > > requirements.
> > > > > > > Did I miss something?
> > > > > >
> > > > > > It's enough to have one architecture with different constrains for 
> > > > > > kprobes
> > > > > > and bpf to warrant a type for each.
> > > > >
> > > > > AFAICT, some of these constraints can be changed without too much 
> > > > > work.
> > > >
> > > > But why?
> > > > I honestly don't understand what are you trying to optimize here. A few
> > > > lines of initialization in execmem_info?
> > >
> > > IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> > > harder for bpf and kprobe to share the same ROX page. In many use cases,
> > > a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> > > module text. It is not efficient if we have to allocate separate pages 
> > > for each
> > > of these use cases. If this is not a problem, the current approach works.
> >
> > The caching of large ROX pages does not need to be per type.
> >
> > In the POC I've posted for caching of large ROX pages on x86 [1], the cache 
> > is
> > global and to make kprobes and bpf use it it's enough to set a flag in
> > execmem_info.
> >
> > [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org
> 
> For the ROX to work, we need different users (module text, kprobe, etc.) to 
> have
> the same execmem_range. From [1]:
> 
> static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
> {
> ...
>p = __execmem_cache_alloc(size);
>if (p)
>return p;
>   err = execmem_cache_populate(range, size);
> ...
> }
> 
> We are calling __execmem_cache_alloc() without range. For this to work,
> we can only call execmem_cache_alloc() with one execmem_range.

Actually, on x86 this will "just work" because everything shares the same
address space :)

The 2M pages in the cache will be in the modules space, so
__execmem_cache_alloc() will always return memory from that address space.

For other architectures this indeed needs to be fixed with passing the
range to __execmem_cache_alloc() and limiting search in the cache for that
range.
 
> Did I miss something?
> 
> Thanks,
> Song

-- 
Sincerely yours,
Mike.

Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-19 Thread Christophe Leroy



Le 19/04/2024 à 17:49, Mike Rapoport a écrit :
> Hi Masami,
> 
> On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
>> Hi Mike,
>>
>> On Thu, 11 Apr 2024 19:00:50 +0300
>> Mike Rapoport  wrote:
>>
>>> From: "Mike Rapoport (IBM)" 
>>>
>>> kprobes depended on CONFIG_MODULES because it has to allocate memory for
>>> code.
>>>
>>> Since code allocations are now implemented with execmem, kprobes can be
>>> enabled in non-modular kernels.
>>>
>>> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
>>> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
>>> dependency of CONFIG_KPROBES on CONFIG_MODULES.
>>
>> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
>> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
>> function body? We have enough dummy functions for that, so it should
>> not make a problem.
> 
> The code in check_kprobe_address_safe() that gets the module and checks for
> __init functions does not compile with IS_ENABLED(CONFIG_MODULES).
> I can pull it out to a helper or leave #ifdef in the function body,
> whichever you prefer.

As far as I can see, the only problem is MODULE_STATE_COMING.
Can we move 'enum module_state' out of #ifdef CONFIG_MODULES in module.h  ?


>   
>> -- 
>> Masami Hiramatsu
>

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-19 Thread Song Liu

On Thu, Apr 18, 2024 at 11:56 PM Mike Rapoport  wrote:
>
> On Thu, Apr 18, 2024 at 02:01:22PM -0700, Song Liu wrote:
> > On Thu, Apr 18, 2024 at 10:54 AM Mike Rapoport  wrote:
> > >
> > > On Thu, Apr 18, 2024 at 09:13:27AM -0700, Song Liu wrote:
> > > > On Thu, Apr 18, 2024 at 8:37 AM Mike Rapoport  wrote:
> > > > > > >
> > > > > > > I'm looking at execmem_types more as definition of the consumers, 
> > > > > > > maybe I
> > > > > > > should have named the enum execmem_consumer at the first place.
> > > > > >
> > > > > > I think looking at execmem_type from consumers' point of view adds
> > > > > > unnecessary complexity. IIUC, for most (if not all) archs, ftrace, 
> > > > > > kprobe,
> > > > > > and bpf (and maybe also module text) all have the same requirements.
> > > > > > Did I miss something?
> > > > >
> > > > > It's enough to have one architecture with different constrains for 
> > > > > kprobes
> > > > > and bpf to warrant a type for each.
> > > >
> > > > AFAICT, some of these constraints can be changed without too much work.
> > >
> > > But why?
> > > I honestly don't understand what are you trying to optimize here. A few
> > > lines of initialization in execmem_info?
> >
> > IIUC, having separate EXECMEM_BPF and EXECMEM_KPROBE makes it
> > harder for bpf and kprobe to share the same ROX page. In many use cases,
> > a 2MiB page (assuming x86_64) is enough for all BPF, kprobe, ftrace, and
> > module text. It is not efficient if we have to allocate separate pages for 
> > each
> > of these use cases. If this is not a problem, the current approach works.
>
> The caching of large ROX pages does not need to be per type.
>
> In the POC I've posted for caching of large ROX pages on x86 [1], the cache is
> global and to make kprobes and bpf use it it's enough to set a flag in
> execmem_info.
>
> [1] https://lore.kernel.org/all/20240411160526.2093408-1-r...@kernel.org

For the ROX to work, we need different users (module text, kprobe, etc.) to have
the same execmem_range. From [1]:

static void *execmem_cache_alloc(struct execmem_range *range, size_t size)
{
...
   p = __execmem_cache_alloc(size);
   if (p)
   return p;
  err = execmem_cache_populate(range, size);
...
}

We are calling __execmem_cache_alloc() without range. For this to work,
we can only call execmem_cache_alloc() with one execmem_range.

Did I miss something?

Thanks,
Song

Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-19 Thread Mike Rapoport

Hi Masami,

On Thu, Apr 18, 2024 at 06:16:15AM +0900, Masami Hiramatsu wrote:
> Hi Mike,
> 
> On Thu, 11 Apr 2024 19:00:50 +0300
> Mike Rapoport  wrote:
> 
> > From: "Mike Rapoport (IBM)" 
> > 
> > kprobes depended on CONFIG_MODULES because it has to allocate memory for
> > code.
> > 
> > Since code allocations are now implemented with execmem, kprobes can be
> > enabled in non-modular kernels.
> > 
> > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> > dependency of CONFIG_KPROBES on CONFIG_MODULES.
> 
> Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
> Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
> function body? We have enough dummy functions for that, so it should
> not make a problem.

The code in check_kprobe_address_safe() that gets the module and checks for
__init functions does not compile with IS_ENABLED(CONFIG_MODULES). 
I can pull it out to a helper or leave #ifdef in the function body,
whichever you prefer.
 
> -- 
> Masami Hiramatsu

-- 
Sincerely yours,
Mike.

Re: [ANNOUNCE] 5.10.214-rt106

2024-04-19 Thread Pavel Machek

Hi!

> The tag is there in the repository:
> 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v5.10.214-rt106-rebase
> 
> When I released 5.10.215-rt107 (and its -rebase counterpart), 
> 5.10.214-rt106-rebase
> was no longer pointing to a commit inside that  branch, probably why your git
> update didn't get the tag. You could try a
> 
> git fetch --tags 

Aha, thanks for the pointer and sorry for the noise.

Best regards,
Pavel
-- 
DENX Software Engineering GmbH,Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


signature.asc
Description: PGP signature

Re: [PATCH v9 5/9] clk: mmp: Add Marvell PXA1908 clock driver

2024-04-19 Thread Duje Mihanović

On Friday, April 12, 2024 4:57:09 AM GMT+2 Stephen Boyd wrote:
> Quoting Duje Mihanović (2024-04-11 03:15:34)
> 
> > On 4/11/2024 10:00 AM, Stephen Boyd wrote:
> > > Is there a reason this file can't be a platform driver?
> > 
> > Not that I know of, I did it like this only because the other in-tree
> > MMP clk drivers do so. I guess the initialization should look like any
> > of the qcom GCC drivers then?
> 
> Yes.

With the entire clock driver code in one file this is quite messy as I also 
needed to add module_init and module_exit functions to (un)register each 
platform driver, presumably because the module_platform_driver macro doesn't 
work with multiple platform drivers in one module. If I split up the driver 
code for each clock controller block into its own file (such as clk-of-
pxa1908-apbc.c) as I believe is the best option, should the commits be split 
up accordingly as well?

> > While at it, do you think the other MMP clk drivers could use a 
conversion?
> 
> I'm a little wary if the conversion cannot be tested though.

I'd rather leave it to someone with the hardware then, especially since the 
only reason I found out about the above is that the board I'm working on 
failed to boot completely without the module_init function.

Regards,
-- 
Duje

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-19 Thread Jonathan Cameron

On Fri,  5 Apr 2024 00:07:06 +
"Ho-Ren (Jack) Chuang"  wrote:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
> 
> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
> 
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
> 
> * Introduce another new lock `default_dram_perf_lock` for adist calculation
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> protect `default_dram_perf_*`. This approach not only avoids deadlock
> but also prevents holding a large lock simultaneously.
> 
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
> 
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> Reviewed-by: "Huang, Ying" 
Reviewed-by: Jonathan Cameron

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-19 Thread Sean Christopherson

On Fri, Apr 19, 2024, Will Deacon wrote:
> > @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t 
> > __kvm_handle_hva_range(struct kvm *kvm,
> > break;
> > }
> > r.ret |= range->handler(kvm, &gfn_range);
> > +
> > +  /*
> > +   * Use a precise gfn-based TLB flush when possible, as
> > +   * most mmu_notifier events affect a small-ish range.
> > +   * Fall back to a full TLB flush if the gfn-based flush
> > +   * fails, and don't bother trying the gfn-based flush
> > +   * if a full flush is already pending.
> > +   */
> > +  if (range->flush_on_ret && !need_flush && r.ret &&
> > +  kvm_arch_flush_remote_tlbs_range(kvm, 
> > gfn_range.start,
> > +   gfn_range.end - 
> > gfn_range.start + 1))
> 
> What's that '+ 1' needed for here?

 (a) To see if you're paying attention.
 (b) Because more is always better.
 (c) Because math is hard.
 (d) Because I haven't tested this.
 (e) Both (c) and (d).

Re: [PATCH v2 1/2] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-19 Thread Krzysztof Kozlowski

On 19/04/2024 10:42, Olivia Wen wrote:
> From: "olivia.wen" 
> 
> Under different applications, the MT8188 SCP can be used as single-core
> or dual-core.
> 
> Signed-off-by: olivia.wen 


Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-19 Thread Will Deacon

On Thu, Apr 18, 2024 at 12:53:26PM -0700, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, Will Deacon wrote:
> > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > > >  wrote:
> > > > > 
> > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > > wrote:
> > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > > Also, if you're in the business of hacking the MMU notifier code, 
> > > > > > > it
> > > > > > > would be really great to change the .clear_flush_young() callback 
> > > > > > > so
> > > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > > moment,
> > > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > > 'flush_on_ret'
> > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > > code
> > > > > > > when we actually update the ptes for small ranges.
> > > > > > 
> > > > > > Indeed, and I was looking at this earlier this week as it has a 
> > > > > > pretty
> > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, 
> > > > > > with
> > > > > > costly consequences).
> > > > > > 
> > > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > > code that deals with the page tables, as it has a pretty good idea 
> > > > > > of
> > > > > > what needs to be invalidated and how -- specially on architectures
> > > > > > that have a HW-broadcast facility like arm64.
> > > > > 
> > > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > > simpler, more
> > > > > straightforward solution would be to let architectures override 
> > > > > flush_on_ret,
> > > > > but I would prefer something like the below as x86 can also utilize a 
> > > > > range-based
> > > > > flush when running as a nested hypervisor.
> > > 
> > > ...
> > > 
> > > > I think this works for us on HW that has range invalidation, which
> > > > would already be a positive move.
> > > > 
> > > > For the lesser HW that isn't range capable, it also gives the
> > > > opportunity to perform the iteration ourselves or go for the nuclear
> > > > option if the range is larger than some arbitrary constant (though
> > > > this is additional work).
> > > > 
> > > > But this still considers the whole range as being affected by
> > > > range->handler(). It'd be interesting to try and see whether more
> > > > precise tracking is (or isn't) generally beneficial.
> > > 
> > > I assume the idea would be to let arch code do single-page invalidations 
> > > of
> > > stage-2 entries for each gfn?
> > 
> > Right, as it's the only code which knows which ptes actually ended up
> > being aged.
> > 
> > > Unless I'm having a brain fart, x86 can't make use of that functionality. 
> > >  Intel
> > > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > > AMD
> > > provides an instruction to do broadcast invalidations, but it takes a 
> > > virtual
> > > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > > address or
> > > a guest virtual address, but it's a moot point because KVM doen't have 
> > > the guest
> > > virtual address, and if it's a host virtual address, there would need to 
> > > be valid
> > > mappings in the host page tables for it to work, which KVM can't 
> > > guarantee.
> > 
> > Ah, so it sounds like it would need to be an arch opt-in then.
> 
> Even if x86 (or some other arch code) could use the precise tracking, I think 
> it
> would make sense to have the behavior be arch specific.  Adding infrastructure
> to get information from arch code, only to turn around and give it back to 
> arch
> code would be odd.

Sorry, yes, that's what I had in mind. Basically, a way for the arch code
to say "I've handled the TLBI, don't worry about it."

> Unless arm64 can't do the invalidation immediately after aging the stage-2 
> PTE,
> the best/easiest solution would be to let arm64 opt out of the common TLB 
> flush
> when a SPTE is made young.
> 
> With the range-based flushing bundled in, this?
> 
> ---
>  include/linux/kvm_host.h |  2 ++
>  virt/kvm/kvm_main.c  | 40 +---
>  2 files changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index afbc99264ffa..8fe5f5e16919 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2010,6 +2010,8 @@ extern const struct kvm_stats_header 
> kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>  
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> +int kvm_arch_flush_tlb_if_young(void);
> +
>  static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned lon

Re: [PATCH] Revert "tracing/trigger: Fix to return error if failed to alloc snapshot"

2024-04-19 Thread Greg Kroah-Hartman

On Thu, Apr 18, 2024 at 06:58:06PM +0530, Siddh Raman Pant wrote:
> This reverts commit b5085b5ac1d96ea2a8a6240f869655176ce44197.
> 
> The change has an incorrect assumption about the return value because
> in the current stable trees for versions 5.15 and before, the following
> commit responsible for making 0 a success value is not present:
> b8cc44a4d3c1 ("tracing: Remove logic for registering multiple event triggers 
> at a time")
> 
> The return value should be 0 on failure in the current tree, because in
> the functions event_trigger_callback() and event_enable_trigger_func(),
> we have:
> 
>   ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
>   /*
>* The above returns on success the # of functions enabled,
>* but if it didn't find any functions it returns zero.
>* Consider no functions a failure too.
>*/
>   if (!ret) {
>   ret = -ENOENT;
> 
> Cc: sta...@kernel.org # 5.15, 5.10, 5.4, 4.19
> Signed-off-by: Siddh Raman Pant 
> ---
>  kernel/trace/trace_events_trigger.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)

Now queued up everywhere, thanks.

greg k-h

Re: [ANNOUNCE] 5.10.214-rt106

2024-04-19 Thread Luis Claudio R. Goncalves

On Fri, Apr 19, 2024 at 11:23:42AM +0200, Pavel Machek wrote:
> Hi!
> 
> > I'm pleased to announce the 5.10.214-rt106 stable release.
> > 
> > This release is simply an update to the new stable 5.10.214 version and no
> > RT-specific changes have been performed.
> > 
> > You can get this release via the git tree at:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git
> > 
> >   branch: v5.10-rt
> >   Head SHA1: 3d208061796d4addeb543c78e0a4ec769b6ce6b2
> 
> Thank you.
> 
> Could you also push v5.10.214-rt106-rebase tag to the repository for
> consistency?

Hi Pavel!

The tag is there in the repository:


https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/commit/?h=v5.10.214-rt106-rebase

When I released 5.10.215-rt107 (and its -rebase counterpart), 
5.10.214-rt106-rebase
was no longer pointing to a commit inside that  branch, probably why your git
update didn't get the tag. You could try a

git fetch --tags 

Best regards,
Luis

> Best regards,
>   Pavel
> -- 
> DENX Software Engineering GmbH,Managing Director: Erika Unter
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


---end quoted text---


signature.asc
Description: PGP signature

[syzbot] [virt?] [net?] KMSAN: uninit-value in vsock_assign_transport (2)

2024-04-19 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:8cd26fd90c1a Merge tag 'for-6.9-rc4-tag' of git://git.kern..
git tree:   upstream
console+strace: https://syzkaller.appspot.com/x/log.txt?x=102d27cd18
kernel config:  https://syzkaller.appspot.com/x/.config?x=87a805e655619c64
dashboard link: https://syzkaller.appspot.com/bug?extid=6c21aeb59d0e82eb2782
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16e38c3b18
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10e62fed18

Downloadable assets:
disk image: 
https://storage.googleapis.com/syzbot-assets/488822aee24a/disk-8cd26fd9.raw.xz
vmlinux: 
https://storage.googleapis.com/syzbot-assets/ba40e322ba00/vmlinux-8cd26fd9.xz
kernel image: 
https://storage.googleapis.com/syzbot-assets/f30af1dfbc30/bzImage-8cd26fd9.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com

=
BUG: KMSAN: uninit-value in vsock_assign_transport+0xb2a/0xb90 
net/vmw_vsock/af_vsock.c:500
 vsock_assign_transport+0xb2a/0xb90 net/vmw_vsock/af_vsock.c:500
 vsock_connect+0x544/0x1560 net/vmw_vsock/af_vsock.c:1393
 __sys_connect_file net/socket.c:2048 [inline]
 __sys_connect+0x606/0x690 net/socket.c:2065
 __do_sys_connect net/socket.c:2075 [inline]
 __se_sys_connect net/socket.c:2072 [inline]
 __x64_sys_connect+0x91/0xe0 net/socket.c:2072
 x64_sys_call+0x3356/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:43
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Uninit was created at:
 __kmalloc_large_node+0x231/0x370 mm/slub.c:3921
 __do_kmalloc_node mm/slub.c:3954 [inline]
 __kmalloc_node+0xb07/0x1060 mm/slub.c:3973
 kmalloc_node include/linux/slab.h:648 [inline]
 kvmalloc_node+0xc0/0x2d0 mm/util.c:634
 kvmalloc include/linux/slab.h:766 [inline]
 vhost_vsock_dev_open+0x44/0x510 drivers/vhost/vsock.c:659
 misc_open+0x66b/0x760 drivers/char/misc.c:165
 chrdev_open+0xa5f/0xb80 fs/char_dev.c:414
 do_dentry_open+0x11f1/0x2120 fs/open.c:955
 vfs_open+0x7e/0xa0 fs/open.c:1089
 do_open fs/namei.c:3642 [inline]
 path_openat+0x4a3c/0x5b00 fs/namei.c:3799
 do_filp_open+0x20e/0x590 fs/namei.c:3826
 do_sys_openat2+0x1bf/0x2f0 fs/open.c:1406
 do_sys_open fs/open.c:1421 [inline]
 __do_sys_openat fs/open.c:1437 [inline]
 __se_sys_openat fs/open.c:1432 [inline]
 __x64_sys_openat+0x2a1/0x310 fs/open.c:1432
 x64_sys_call+0x3a64/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:258
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

CPU: 1 PID: 5021 Comm: syz-executor390 Not tainted 
6.9.0-rc4-syzkaller-00038-g8cd26fd90c1a #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
03/27/2024
=


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Re: [ANNOUNCE] 5.10.214-rt106

2024-04-19 Thread Pavel Machek

Hi!

> I'm pleased to announce the 5.10.214-rt106 stable release.
> 
> This release is simply an update to the new stable 5.10.214 version and no
> RT-specific changes have been performed.
> 
> You can get this release via the git tree at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git
> 
>   branch: v5.10-rt
>   Head SHA1: 3d208061796d4addeb543c78e0a4ec769b6ce6b2

Thank you.

Could you also push v5.10.214-rt106-rebase tag to the repository for
consistency?

Best regards,
Pavel
-- 
DENX Software Engineering GmbH,Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


signature.asc
Description: PGP signature

[PATCH v2 2/2] remoteproc: mediatek: Support MT8188 SCP core 1

2024-04-19 Thread Olivia Wen

From: "olivia.wen" 

There are three primary modifications.

1. The struct mtk_scp_of_data usage on MT8188
MT8192 functions are unsuitable for the dual-core MT8188 SCP,
which has two RISC-V cores similar to MT8195 but without L1TCM.
We've added MT8188-specific functions to configure L1TCM
in multicore setups.

2. SCP_IPI_IMGSYS_CMD feature
This version also adds SCP_IPI_IMGSYS_CMD to facilitate
communication between the imgsys kernel and the backend driver.

3. Different code sizes and IPI share buffer sizes
Each SCP necessitates different code and IPI share buffer sizes.
Introducing a structure mtk_scp_sizes_data to handle them.

Signed-off-by: olivia.wen 
---
 drivers/remoteproc/mtk_common.h|  11 +-
 drivers/remoteproc/mtk_scp.c   | 230 +
 drivers/remoteproc/mtk_scp_ipi.c   |   7 +-
 include/linux/remoteproc/mtk_scp.h |   1 +
 4 files changed, 223 insertions(+), 26 deletions(-)

diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h
index 6d7736a..fd5c539 100644
--- a/drivers/remoteproc/mtk_common.h
+++ b/drivers/remoteproc/mtk_common.h
@@ -78,7 +78,6 @@
 #define MT8195_L2TCM_OFFSET0x850d0
 
 #define SCP_FW_VER_LEN 32
-#define SCP_SHARE_BUFFER_SIZE  288
 
 struct scp_run {
u32 signaled;
@@ -97,6 +96,11 @@ struct scp_ipi_desc {
 
 struct mtk_scp;
 
+struct mtk_scp_sizes_data {
+   size_t max_dram_size;
+   size_t ipi_share_buffer_size;
+};
+
 struct mtk_scp_of_data {
int (*scp_clk_get)(struct mtk_scp *scp);
int (*scp_before_load)(struct mtk_scp *scp);
@@ -110,6 +114,7 @@ struct mtk_scp_of_data {
u32 host_to_scp_int_bit;
 
size_t ipi_buf_offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 };
 
 struct mtk_scp_of_cluster {
@@ -141,10 +146,10 @@ struct mtk_scp {
struct scp_ipi_desc ipi_desc[SCP_IPI_MAX];
bool ipi_id_ack[SCP_IPI_MAX];
wait_queue_head_t ack_wq;
+   u8 *share_buf;
 
void *cpu_addr;
dma_addr_t dma_addr;
-   size_t dram_size;
 
struct rproc_subdev *rpmsg_subdev;
 
@@ -162,7 +167,7 @@ struct mtk_scp {
 struct mtk_share_obj {
u32 id;
u32 len;
-   u8 share_buf[SCP_SHARE_BUFFER_SIZE];
+   u8 *share_buf;
 };
 
 void scp_memcpy_aligned(void __iomem *dst, const void *src, unsigned int len);
diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index 6751829..e281d28 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -20,7 +20,6 @@
 #include "mtk_common.h"
 #include "remoteproc_internal.h"
 
-#define MAX_CODE_SIZE 0x50
 #define SECTION_NAME_IPI_BUFFER ".ipi_buffer"
 
 /**
@@ -94,14 +93,15 @@ static void scp_ipi_handler(struct mtk_scp *scp)
 {
struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf;
struct scp_ipi_desc *ipi_desc = scp->ipi_desc;
-   u8 tmp_data[SCP_SHARE_BUFFER_SIZE];
scp_ipi_handler_t handler;
u32 id = readl(&rcv_obj->id);
u32 len = readl(&rcv_obj->len);
+   const struct mtk_scp_sizes_data *scp_sizes;
 
-   if (len > SCP_SHARE_BUFFER_SIZE) {
-   dev_err(scp->dev, "ipi message too long (len %d, max %d)", len,
-   SCP_SHARE_BUFFER_SIZE);
+   scp_sizes = scp->data->scp_sizes;
+   if (len > scp_sizes->ipi_share_buffer_size) {
+   dev_err(scp->dev, "ipi message too long (len %d, max %zd)", len,
+   scp_sizes->ipi_share_buffer_size);
return;
}
if (id >= SCP_IPI_MAX) {
@@ -117,8 +117,9 @@ static void scp_ipi_handler(struct mtk_scp *scp)
return;
}
 
-   memcpy_fromio(tmp_data, &rcv_obj->share_buf, len);
-   handler(tmp_data, len, ipi_desc[id].priv);
+   memset(scp->share_buf, 0, scp_sizes->ipi_share_buffer_size);
+   memcpy_fromio(scp->share_buf, &rcv_obj->share_buf, len);
+   handler(scp->share_buf, len, ipi_desc[id].priv);
scp_ipi_unlock(scp, id);
 
scp->ipi_id_ack[id] = true;
@@ -133,6 +134,8 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
 {
int ret;
size_t buf_sz, offset;
+   size_t share_buf_offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 
/* read the ipi buf addr from FW itself first */
ret = scp_elf_read_ipi_buf_addr(scp, fw, &offset);
@@ -152,12 +155,15 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
return -EOVERFLOW;
}
 
+   scp_sizes = scp->data->scp_sizes;
scp->recv_buf = (struct mtk_share_obj __iomem *)
(scp->sram_base + offset);
+   share_buf_offset = sizeof(scp->recv_buf->id)
+   + sizeof(scp->recv_buf->len) + scp_sizes->ipi_share_buffer_size;
scp->send_buf = (struct mtk_share_obj __iomem *)
-   (scp->sram_base + offset + sizeof(*scp->recv_buf));
-   memset_io(scp->

[PATCH v2 0/2] Support MT8188 SCP core 1

2024-04-19 Thread Olivia Wen

From: "olivia.wen" 

Change in v2:
- change the order of mt8188-scp-dual
- modify the struct mtk_scp_of_data on MT8188
- add MT8188-specific functions
- add structure mtk_scp_sizes_data to manage sizes
- change tmp_data allocation to share_buf

olivia.wen (2):
  dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP
  remoteproc: mediatek: Support MT8188 SCP core 1

 .../devicetree/bindings/remoteproc/mtk,scp.yaml|   2 +
 drivers/remoteproc/mtk_common.h|  11 +-
 drivers/remoteproc/mtk_scp.c   | 230 +++--
 drivers/remoteproc/mtk_scp_ipi.c   |   7 +-
 include/linux/remoteproc/mtk_scp.h |   1 +
 5 files changed, 225 insertions(+), 26 deletions(-)

-- 
2.6.4

[PATCH v2 1/2] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-19 Thread Olivia Wen

From: "olivia.wen" 

Under different applications, the MT8188 SCP can be used as single-core
or dual-core.

Signed-off-by: olivia.wen 
---
 Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml 
b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
index 507f98f..c5dc3c2 100644
--- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
@@ -19,6 +19,7 @@ properties:
   - mediatek,mt8183-scp
   - mediatek,mt8186-scp
   - mediatek,mt8188-scp
+  - mediatek,mt8188-scp-dual
   - mediatek,mt8192-scp
   - mediatek,mt8195-scp
   - mediatek,mt8195-scp-dual
@@ -194,6 +195,7 @@ allOf:
   properties:
 compatible:
   enum:
+- mediatek,mt8188-scp-dual
 - mediatek,mt8195-scp-dual
 then:
   properties:
-- 
2.6.4

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-19 Thread Steven Rostedt

On Fri, 19 Apr 2024 14:36:18 +0900
Masami Hiramatsu (Google)  wrote:

> Hi Steve,
> 
> Can you review this series? Especially, [07/36] and [12/36] has been changed
> a lot from your original patch.

I haven't forgotten (just been a bit hectic).

Worse comes to worse, I'll review it tomorrow.

-- Steve

> 
> Thank you,
> 
> On Mon, 15 Apr 2024 21:48:59 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Hi,
> > 
> > Here is the 9th version of the series to re-implement the fprobe on
> > function-graph tracer. The previous version is;
> > 
> > https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
> > 
> > This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> > and fixed some bugs + performance optimization patch[36/36].
> >  - [12/36] Fix to clear fgraph_array entry in registration failure, also
> >return -ENOSPC when fgraph_array is full.
> >  - [28/36] Add new store_fprobe_entry_data() for fprobe.
> >  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
> >  - [36/36] Add new flag to skip timestamp recording.
> > 
> > Overview
> > 
> > This series does major 2 changes, enable multiple function-graphs on
> > the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> > fprobe on this function-graph.
> > 
> > The former changes had been sent from Steven Rostedt 4 years ago (*),
> > which allows users to set different setting function-graph tracer (and
> > other tracers based on function-graph) in each trace-instances at the
> > same time.
> > 
> > (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> > 
> > The purpose of latter change are;
> > 
> >  1) Remove dependency of the rethook from fprobe so that we can reduce
> >the return hook code and shadow stack.
> > 
> >  2) Make 'ftrace_regs' the common trace interface for the function
> >boundary.
> > 
> > 1) Currently we have 2(or 3) different function return hook codes,
> >  the function-graph tracer and rethook (and legacy kretprobe).
> >  But since this  is redundant and needs double maintenance cost,
> >  I would like to unify those. From the user's viewpoint, function-
> >  graph tracer is very useful to grasp the execution path. For this
> >  purpose, it is hard to use the rethook in the function-graph
> >  tracer, but the opposite is possible. (Strictly speaking, kretprobe
> >  can not use it because it requires 'pt_regs' for historical reasons.)
> > 
> > 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
> >  wrong for the function entry and exit. Moreover, depending on the
> >  architecture, there is no way to accurately reproduce 'pt_regs'
> >  outside of interrupt or exception handlers. This means fprobe should
> >  not use 'pt_regs' because it does not use such exceptions.
> >  (Conversely, kprobe should use 'pt_regs' because it is an abstract
> >   interface of the software breakpoint exception.)
> > 
> > This series changes fprobe to use function-graph tracer for tracing
> > function entry and exit, instead of mixture of ftrace and rethook.
> > Unlike the rethook which is a per-task list of system-wide allocated
> > nodes, the function graph's ret_stack is a per-task shadow stack.
> > Thus it does not need to set 'nr_maxactive' (which is the number of
> > pre-allocated nodes).
> > Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> > Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> > their register interface, this changes it to convert 'ftrace_regs' to
> > 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> > so users must access only registers for function parameters or
> > return value. 
> > 
> > Design
> > --
> > Instead of using ftrace's function entry hook directly, the new fprobe
> > is built on top of the function-graph's entry and return callbacks
> > with 'ftrace_regs'.
> > 
> > Since the fprobe requires access to 'ftrace_regs', the architecture
> > must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> > CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> > entry callback with 'ftrace_regs', and also
> > CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> > return_to_handler.
> > 
> > All fprobes share a single function-graph ops (means shares a common
> > ftrace filter) similar to the kprobe-on-ftrace. This needs another
> > layer to find corresponding fprobe in the common function-graph
> > callbacks, but has much better scalability, since the number of
> > registered function-graph ops is limited.
> > 
> > In the entry callback, the fprobe runs its entry_handler and saves the
> > address of 'fprobe' on the function-graph's shadow stack as data. The
> > return callback decodes the data to get the 'fprobe' address, and runs
> > the exit_handler.
> > 
> > The fprobe introduces two hash-tables, one is for entry callback which
> > searches fprobes related to the given function address passed by

39 matches

Mail list logo