Re: [PATCH] media: dt-bindings: qcom,sc7280-venus: Allow one IOMMU entry

2024-01-29 Thread Luca Weiss
On Mon Jan 29, 2024 at 6:37 PM CET, Conor Dooley wrote:
> On Mon, Jan 29, 2024 at 08:48:54AM +0100, Luca Weiss wrote:
> > Some SC7280-based boards crash when providing the "secure_non_pixel"
> > context bank, so allow only one iommu in the bindings also.
> > 
> > Signed-off-by: Luca Weiss 
>
> Do we have any idea why this happens? How is someone supposed to know
> whether or not their system requires you to only provide one iommu?
> Yes, a crash might be the obvious answer, but is there a way of knowing
> without the crashes?

+CC Vikash Garodia

Unfortunately I don't really have much more information than this
message here:
https://lore.kernel.org/linux-arm-msm/ff021f49-f81b-0fd1-bd2c-895dbbb03...@quicinc.com/

And see also the following replies for a bit more context, like this
one:
https://lore.kernel.org/linux-arm-msm/a4e8b531-49f9-f4a1-51cb-e422c5628...@quicinc.com/

Maybe Vikash can add some more info regarding this.

Regards
Luca

>
> Cheers,
> Conor.
>
> > ---
> > Reference:
> > https://lore.kernel.org/linux-arm-msm/20231201-sc7280-venus-pas-v3-2-bc132dc5f...@fairphone.com/
> > ---
> >  Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml 
> > b/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml
> > index 8f9b6433aeb8..10c334e6b3dc 100644
> > --- a/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml
> > +++ b/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml
> > @@ -43,6 +43,7 @@ properties:
> >- const: vcodec_bus
> >  
> >iommus:
> > +minItems: 1
> >  maxItems: 2
> >  
> >interconnects:
> > 
> > ---
> > base-commit: 596764183be8ebb13352b281a442a1f1151c9b06
> > change-id: 20240129-sc7280-venus-bindings-6e62a99620de
> > 
> > Best regards,
> > -- 
> > Luca Weiss 
> > 




Re: [PATCH RFC v3 09/35] mm: cma: Introduce cma_remove_mem()

2024-01-29 Thread Anshuman Khandual



On 1/25/24 22:12, Alexandru Elisei wrote:
> Memory is added to CMA with cma_declare_contiguous_nid() and
> cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
> cma_init_reserved_areas(), where the page allocator can make use of it.

cma_declare_contiguous_nid() reserves memory in memblock and marks the
for subsequent CMA usage, where as cma_init_reserved_areas() activates
these memory areas through init_cma_reserved_pageblock(). Standard page
allocator only receives these memory via free_reserved_page() - only if
the page block activation fails.

> 
> If a device manages multiple CMA areas, and there's an error when one of
> the areas is added to CMA, there is no mechanism for the device to prevent

What kind of error ? init_cma_reserved_pageblock() fails ? But that will
not happen until cma_init_reserved_areas().

> the rest of the areas, which were added before the error occured, from
> being later added to the MIGRATE_CMA list.

Why is this mechanism required ? cma_init_reserved_areas() scans over all
CMA areas and try and activate each of them sequentially. Why is not this
sufficient ?

> 
> Add cma_remove_mem() which allows a previously reserved CMA area to be
> removed and thus it cannot be used by the page allocator.

Successfully activated CMA areas do not get used by the buddy allocator.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch.
> 
>  include/linux/cma.h |  1 +
>  mm/cma.c| 30 +-
>  2 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index e32559da6942..787cbec1702e 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
>   unsigned int order_per_bit,
>   const char *name,
>   struct cma **res_cma);
> +extern void cma_remove_mem(struct cma **res_cma);
>  extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned 
> int align,
> bool no_warn);
>  extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned 
> long count,
> diff --git a/mm/cma.c b/mm/cma.c
> index 4a0f68b9443b..2881bab12b01 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void)
>  {
>   int i;
>  
> - for (i = 0; i < cma_area_count; i++)
> + for (i = 0; i < cma_area_count; i++) {
> + /* Region was removed. */
> + if (!cma_areas[i].count)
> + continue;

Skip previously added CMA area (now zeroed out) ?

>   cma_activate_area(_areas[i]);
> + }
>  
>   return 0;
>  }

cma_init_reserved_areas() gets called via core_initcall(). Some how
platform/device needs to call cma_remove_mem() before core_initcall()
gets called ? This might be time sensitive.

> @@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
>   return 0;
>  }
>  
> +/**
> + * cma_remove_mem() - remove cma area
> + * @res_cma: Pointer to the cma region.
> + *
> + * This function removes a cma region created with cma_init_reserved_mem(). 
> The
> + * ->count is set to 0.
> + */
> +void __init cma_remove_mem(struct cma **res_cma)
> +{
> + struct cma *cma;
> +
> + if (WARN_ON_ONCE(!res_cma || !(*res_cma)))
> + return;
> +
> + cma = *res_cma;
> + if (WARN_ON_ONCE(!cma->count))
> + return;
> +
> + totalcma_pages -= cma->count;
> + cma->count = 0;
> +
> + *res_cma = NULL;
> +}
> +
>  /**
>   * cma_declare_contiguous_nid() - reserve custom contiguous area
>   * @base: Base address of the reserved area optional, use 0 for any

But first please do explain what are the errors device or platform might
see on a previously marked CMA area so that removing them on way becomes
necessary preventing their activation via cma_init_reserved_areas().



Re: [PATCH RFC v3 08/35] mm: cma: Introduce cma_alloc_range()

2024-01-29 Thread Anshuman Khandual



On 1/25/24 22:12, Alexandru Elisei wrote:
> Today, cma_alloc() is used to allocate a contiguous memory region. The
> function allows the caller to specify the number of pages to allocate, but
> not the starting address. cma_alloc() will walk over the entire CMA region
> trying to allocate the first available range of the specified size.
> 
> Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
> caller to specify a particular range in the CMA region, defined by the
> start pfn and the size.
> 
> arm64 will make use of this function when tag storage management will be
> implemented: cma_alloc_range() will be used to reserve the tag storage
> associated with a tagged page.

Basically, you would like to pass on a preferred start address and the
allocation could just fail if a contig range is not available from such
a starting address ?

Then why not just change cma_alloc() to take a new argument 'start_pfn'.
Why create a new but almost similar allocator ?

But then I am wondering why this could not be done in the arm64 platform
code itself operating on a CMA area reserved just for tag storage. Unless
this new allocator has other usage beyond MTE, this could be implemented
in the platform itself.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch.
> 
>  include/linux/cma.h|  2 +
>  include/trace/events/cma.h | 59 ++
>  mm/cma.c   | 86 ++
>  3 files changed, 147 insertions(+)
> 
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 63873b93deaa..e32559da6942 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
> phys_addr_t size,
>   struct cma **res_cma);
>  extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned 
> int align,
> bool no_warn);
> +extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned 
> long count,
> +unsigned tries, gfp_t gfp);
>  extern bool cma_pages_valid(struct cma *cma, const struct page *pages, 
> unsigned long count);
>  extern bool cma_release(struct cma *cma, const struct page *pages, unsigned 
> long count);
>  
> diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h
> index 25103e67737c..a89af313a572 100644
> --- a/include/trace/events/cma.h
> +++ b/include/trace/events/cma.h
> @@ -36,6 +36,65 @@ TRACE_EVENT(cma_release,
> __entry->count)
>  );
>  
> +TRACE_EVENT(cma_alloc_range_start,
> +
> + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> +  unsigned tries),
> +
> + TP_ARGS(name, start, count, tries),
> +
> + TP_STRUCT__entry(
> + __string(name, name)
> + __field(unsigned long, start)
> + __field(unsigned long, count)
> + __field(unsigned, tries)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, name);
> + __entry->start = start;
> + __entry->count = count;
> + __entry->tries = tries;
> + ),
> +
> + TP_printk("name=%s start=%lx count=%lu tries=%u",
> +   __get_str(name),
> +   __entry->start,
> +   __entry->count,
> +   __entry->tries)
> +);
> +
> +TRACE_EVENT(cma_alloc_range_finish,
> +
> + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> +  unsigned attempts, int err),
> +
> + TP_ARGS(name, start, count, attempts, err),
> +
> + TP_STRUCT__entry(
> + __string(name, name)
> + __field(unsigned long, start)
> + __field(unsigned long, count)
> + __field(unsigned, attempts)
> + __field(int, err)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, name);
> + __entry->start = start;
> + __entry->count = count;
> + __entry->attempts = attempts;
> + __entry->err = err;
> + ),
> +
> + TP_printk("name=%s start=%lx count=%lu attempts=%u err=%d",
> +   __get_str(name),
> +   __entry->start,
> +   __entry->count,
> +   __entry->attempts,
> +   __entry->err)
> +);
> +
>  TRACE_EVENT(cma_alloc_start,
>  
>   TP_PROTO(const char *name, unsigned long count, unsigned int align),
> diff --git a/mm/cma.c b/mm/cma.c
> index 543bb6b3be8e..4a0f68b9443b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -416,6 +416,92 @@ static void cma_debug_show_areas(struct cma *cma)
>  static inline void cma_debug_show_areas(struct cma *cma) { }
>  #endif
>  
> +/**
> + * cma_alloc_range() - allocate pages in a specific range
> + * @cma:   Contiguous memory region for which the allocation is performed.
> + * @start: Starting pfn of the allocation.
> + * @count: Requested number 

Re: [PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages

2024-01-29 Thread Anshuman Khandual



On 1/29/24 17:21, Alexandru Elisei wrote:
> Hi,
> 
> On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
>>> after each cma_alloc() function call. This is done even though cma_alloc()
>>> can allocate an arbitrary number of CMA pages. When looking at
>>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
>>> doesn't tell much with regards to how many CMA pages were allocated via
>>> cma_alloc() versus via the page allocator (regular allocation request or
>>> PCP lists refill).
>>>
>>> This can also be rather confusing to a user who isn't familiar with the
>>> code, since the unit of measurement for nr_free_cma is the number of pages,
>>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
>>> function calls.
>>>
>>> Let's make this consistent, and arguably more useful, by having
>>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
>>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
>>> allocate.
>>>
>>> For users that wish to track the number of cma_alloc() calls, there are
>>> tracepoints for that already implemented.
>>>
>>> Signed-off-by: Alexandru Elisei 
>>> ---
>>>  mm/cma.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/cma.c b/mm/cma.c
>>> index f49c95f8ee37..dbf7fe8cb1bd 100644
>>> --- a/mm/cma.c
>>> +++ b/mm/cma.c
>>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
>>> count,
>>> pr_debug("%s(): returned %p\n", __func__, page);
>>>  out:
>>> if (page) {
>>> -   count_vm_event(CMA_ALLOC_SUCCESS);
>>> +   count_vm_events(CMA_ALLOC_SUCCESS, count);
>>> cma_sysfs_account_success_pages(cma, count);
>>> } else {
>>> -   count_vm_event(CMA_ALLOC_FAIL);
>>> +   count_vm_events(CMA_ALLOC_FAIL, count);
>>> if (cma)
>>> cma_sysfs_account_fail_pages(cma, count);
>>> }
>>
>> Without getting into the merits of this patch - which is actually trying to 
>> do
>> semantics change to /proc/vmstat, wondering how is this even related to this
>> particular series ? If required this could be debated on it's on separately.
> 
> Having the number of CMA pages allocated and the number of CMA pages freed
> allows someone to infer how many tagged pages are in use at a given time:

That should not be done in CMA which is a generic multi purpose allocator.

> (allocated CMA pages - CMA pages allocated by drivers* - CMA pages
> released) * 32. That is valuable information for software and hardware
> designers.
> 
> Besides that, for every iteration of the series, this has proven invaluable
> for discovering bugs with freeing and/or reserving tag storage pages.

I am afraid that might not be enough justification for getting something
merged mainline.

> 
> *that would require userspace reading cma_alloc_success and
> cma_release_success before any tagged allocations are performed.

While assuming that no other non-memory-tagged CMA based allocation amd free
call happens in the meantime ? That would be on real thin ice.

I suppose arm64 tagged memory specific allocation or free related counters
need to be created on the caller side, including arch_free_pages_prepare().



Re: [PATCH RFC v3 04/35] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"

2024-01-29 Thread Anshuman Khandual



On 1/29/24 17:16, Alexandru Elisei wrote:
> Hi,
> 
> On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
>>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
>>> because CMA is always allowed when __GFP_MOVABLE is set.
>>>
>>> With the introduction of the arch_alloc_cma() function, the above is not
>>> true anymore, so bring back the filter.
>>
>> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
>> assigned to alloc_flags in gfp_to_alloc_flags_cma().
> 
> Can I add your Reviewed-by tag then?

I think all these changes need to be reviewed in their entirety
even though some patches do look good on their own. For example
this patch depends on whether [PATCH 03/35] is acceptable or not.

I would suggest separating out CMA patches which could be debated
and merged regardless of this series.



Re: [PATCH RFC v3 01/35] mm: page_alloc: Add gfp_flags parameter to arch_alloc_page()

2024-01-29 Thread Anshuman Khandual



On 1/29/24 17:11, Alexandru Elisei wrote:
> Hi,
> 
> On Mon, Jan 29, 2024 at 11:18:59AM +0530, Anshuman Khandual wrote:
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Extend the usefulness of arch_alloc_page() by adding the gfp_flags
>>> parameter.
>> Although the change here is harmless in itself, it will definitely benefit
>> from some additional context explaining the rationale, taking into account
>> why-how arch_alloc_page() got added particularly for s390 platform and how
>> it's going to be used in the present proposal.
> arm64 will use it to reserve tag storage if the caller requested a tagged
> page. Right now that means that __GFP_ZEROTAGS is set in the gfp mask, but
> I'll rename it to __GFP_TAGGED in patch #18 ("arm64: mte: Rename
> __GFP_ZEROTAGS to __GFP_TAGGED") [1].
> 
> [1] 
> https://lore.kernel.org/lkml/20240125164256.4147-19-alexandru.eli...@arm.com/

Makes sense, but please do update the commit message explaining how
new gfp mask argument will be used to detect tagged page allocation
requests, further requiring tag storage allocation.



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 17:50, Linus Torvalds
 wrote:
>
> So what I propose is that
>
>  - ei->dentry and ei->d_children[] need to die. Really. They are
> buggy. There is no way to save them. There never was.
>
>  - but we *can* introduce a new 'ei->events_dir' pointer that is
> *only* set by eventfs_create_events_dir(), and which is stable exactly
> because that function also does a dget() on it, so now the dentry will
> actually continue to exist reliably
>
> I think that works.

Well, it doesn't. I don't see where the bug is, but since Al is now
aware of the thread, maybe when he wakes up he will tell me where I've
gone wrong.

In the meantime I did do the pending tracefs pull, so the series has
changed a bit, and this is the rebased series on top of my current
public git tree.

It is still broken wrt 'events' directories. You don't even need the
"create, delete, create" sequence that Steven pointed out, just a
plain sequence of

 # cd /sys/kernel/tracing
 # ls events/kprobes/
 # echo 'p:sched schedule' >> kprobe_events

messes up - ie it's enough to just have 'lookup' create a negative
dentry by trying to look up 'events/kprobes/' before actually trying
to create that kprobe_events.

But I've been staring at this code for too long, so I'm posting this
just as a "it's broken, but _something_ like this", because I'm taking
a break from looking at this.

I'll get back to it tomorrow, but I hope that Al will show me the
error of my ways.

  Linus
From 6763ac4af7ccc0c97fb5f7c98d0c8ae1289ec0fe Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Mon, 29 Jan 2024 18:49:42 -0800
Subject: [PATCH 5/5] eventfs: get rid of dentry pointers without refcounts

The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer.  That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.

There were two reasonms why eventfs tried to keep a pointer to a dentry:

 - the creation of a 'events' directory would actually have a stable
   dentry pointer that it created with tracefs_start_creating().

   And it needed that dentry when tearing it all down again in
   eventfs_remove_events_dir().

   This use is actually ok, because the special top-level events
   directory dentries are actually stable, not just a temporary cache of
   the eventfs data structures.

 - the 'eventfs_inode' (aka ei) needs to stay around as long as there
   are dentries that refer to it.

   It then used these dentry pointers as a replacement for doing
   reference counting: it would try to make sure that there was only
   ever one dentry associated with an event_inode, and keep a child
   dentry array around to see which dentries might still refer to the
   parent ei.

This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.

The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei.  As
does the directory dentries.  That makes the broken use case go away.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 245 ---
 fs/tracefs/internal.h|   9 +-
 2 files changed, 80 insertions(+), 174 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1d0102bfd7da..a37db0dac302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -62,6 +62,34 @@ enum {
 
 #define EVENTFS_MODE_MASK	(EVENTFS_SAVE_MODE - 1)
 
+/*
+ * eventfs_inode reference count management.
+ *
+ * NOTE! We count only references from dentries, in the
+ * form 'dentry->d_fsdata'. There are also references from
+ * directory inodes ('ti->private'), but the dentry reference
+ * count is always a superset of the inode reference count.
+ */
+static void release_ei(struct kref *ref)
+{
+	struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref);
+	kfree(ei->entry_attrs);
+	kfree(ei);
+}
+
+static inline void put_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_put(>kref, release_ei);
+}
+
+static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_get(>kref);
+	return ei;
+}
+
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 	  struct dentry *dentry,
 	  unsigned int flags);
@@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *lookup_file(struct dentry *dentry,
+static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
+  struct dentry *dentry,
   umode_t mode,
   struct eventfs_attr *attr,
   void *data,
@@ -302,11 +331,11 @@ static struct dentry *lookup_file(struct dentry 

Re: [PATCH v3 5/6] LoongArch: KVM: Add physical cpuid map support

2024-01-29 Thread maobibo




On 2024/1/29 下午9:11, Huacai Chen wrote:

Hi, Bibo,

Without this patch I can also create a SMP VM, so what problem does
this patch want to solve?
With ipi irqchip, physical cpuid is used for dest cpu rather than 
logical cpuid. And if ipi device is emulated in qemu side, there is 
find_cpu_by_archid to get dest vcpu in file hw/intc/loongarch_ipi.c


Here with hypercall method, ipi is emulated in kvm kernel side, there
should be the same physical cpuid searching logic. And function 
kvm_get_vcpu_by_cpuid is used with pv_ipi backend.


Regards
Bibo Mao



Huacai

On Mon, Jan 22, 2024 at 6:03 PM Bibo Mao  wrote:


Physical cpuid is used to irq routing for irqchips such as ipi/msi/
extioi interrupt controller. And physical cpuid is stored at CSR
register LOONGARCH_CSR_CPUID, it can not be changed once vcpu is
created. Since different irqchips have different size definition
about physical cpuid, KVM uses the smallest cpuid from extioi, and
the max cpuid size is defines as 256.

Signed-off-by: Bibo Mao 
---
  arch/loongarch/include/asm/kvm_host.h | 26 
  arch/loongarch/include/asm/kvm_vcpu.h |  1 +
  arch/loongarch/kvm/vcpu.c | 93 ++-
  arch/loongarch/kvm/vm.c   | 11 
  4 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 2d62f7b0d377..57399d7cf8b7 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -64,6 +64,30 @@ struct kvm_world_switch {

  #define MAX_PGTABLE_LEVELS 4

+/*
+ * Physical cpu id is used for interrupt routing, there are different
+ * definitions about physical cpuid on different hardwares.
+ *  For LOONGARCH_CSR_CPUID register, max cpuid size if 512
+ *  For IPI HW, max dest CPUID size 1024
+ *  For extioi interrupt controller, max dest CPUID size is 256
+ *  For MSI interrupt controller, max supported CPUID size is 65536
+ *
+ * Currently max CPUID is defined as 256 for KVM hypervisor, in future
+ * it will be expanded to 4096, including 16 packages at most. And every
+ * package supports at most 256 vcpus
+ */
+#define KVM_MAX_PHYID  256
+
+struct kvm_phyid_info {
+   struct kvm_vcpu *vcpu;
+   boolenabled;
+};
+
+struct kvm_phyid_map {
+   int max_phyid;
+   struct kvm_phyid_info phys_map[KVM_MAX_PHYID];
+};
+
  struct kvm_arch {
 /* Guest physical mm */
 kvm_pte_t *pgd;
@@ -71,6 +95,8 @@ struct kvm_arch {
 unsigned long invalid_ptes[MAX_PGTABLE_LEVELS];
 unsigned int  pte_shifts[MAX_PGTABLE_LEVELS];
 unsigned int  root_level;
+   struct mutex  phyid_map_lock;
+   struct kvm_phyid_map  *phyid_map;

 s64 time_offset;
 struct kvm_context __percpu *vmcs;
diff --git a/arch/loongarch/include/asm/kvm_vcpu.h 
b/arch/loongarch/include/asm/kvm_vcpu.h
index e71ceb88f29e..2402129ee955 100644
--- a/arch/loongarch/include/asm/kvm_vcpu.h
+++ b/arch/loongarch/include/asm/kvm_vcpu.h
@@ -81,6 +81,7 @@ void kvm_save_timer(struct kvm_vcpu *vcpu);
  void kvm_restore_timer(struct kvm_vcpu *vcpu);

  int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt 
*irq);
+struct kvm_vcpu *kvm_get_vcpu_by_cpuid(struct kvm *kvm, int cpuid);

  /*
   * Loongarch KVM guest interrupt handling
diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c
index 27701991886d..97ca9c7160e6 100644
--- a/arch/loongarch/kvm/vcpu.c
+++ b/arch/loongarch/kvm/vcpu.c
@@ -274,6 +274,95 @@ static int _kvm_getcsr(struct kvm_vcpu *vcpu, unsigned int 
id, u64 *val)
 return 0;
  }

+static inline int kvm_set_cpuid(struct kvm_vcpu *vcpu, u64 val)
+{
+   int cpuid;
+   struct loongarch_csrs *csr = vcpu->arch.csr;
+   struct kvm_phyid_map  *map;
+
+   if (val >= KVM_MAX_PHYID)
+   return -EINVAL;
+
+   cpuid = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_ESTAT);
+   map = vcpu->kvm->arch.phyid_map;
+   mutex_lock(>kvm->arch.phyid_map_lock);
+   if (map->phys_map[cpuid].enabled) {
+   /*
+* Cpuid is already set before
+* Forbid changing different cpuid at runtime
+*/
+   if (cpuid != val) {
+   /*
+* Cpuid 0 is initial value for vcpu, maybe invalid
+* unset value for vcpu
+*/
+   if (cpuid) {
+   mutex_unlock(>kvm->arch.phyid_map_lock);
+   return -EINVAL;
+   }
+   } else {
+/* Discard duplicated cpuid set */
+   mutex_unlock(>kvm->arch.phyid_map_lock);
+   return 0;
+   }
+   }
+
+   if (map->phys_map[val].enabled) {
+   /*
+* New cpuid is already set with other vcpu
+* Forbid sharing 

Re: [RFC PATCH 1/2] x86/kprobes: Prohibit kprobing on INT and UD

2024-01-29 Thread Jinghao Jia
On 1/29/24 19:44, Masami Hiramatsu (Google) wrote:
> On Sun, 28 Jan 2024 15:25:59 -0600
> Jinghao Jia  wrote:
> 
  /* Check if paddr is at an instruction boundary */
  static int can_probe(unsigned long paddr)
  {
 @@ -294,6 +310,16 @@ static int can_probe(unsigned long paddr)
  #endif
addr += insn.length;
}
 +  __addr = recover_probed_instruction(buf, addr);
 +  if (!__addr)
 +  return 0;
 +
 +  if (insn_decode_kernel(, (void *)__addr) < 0)
 +  return 0;
 +
 +  if (is_exception_insn())
 +  return 0;
 +
>>>
>>> Please don't put this outside of decoding loop. You should put these in
>>> the loop which decodes the instruction from the beginning of the function.
>>> Since the x86 instrcution is variable length, can_probe() needs to check
>>> whether that the address is instruction boundary and decodable.
>>>
>>> Thank you,
>>
>> If my understanding is correct then this is trying to decode the kprobe
>> target instruction, given that it is after the main decoding loop.  Here I
>> hoisted the decoding logic out of the if(IS_ENABLED(CONFIG_CFI_CLANG))
>> block so that we do not need to decode the same instruction twice.  I left
>> the main decoding loop unchanged so it is still decoding the function from
>> the start and should handle instruction boundaries. Are there any caveats
>> that I missed?
> 
> Ah, sorry I misread the patch. You're correct!
> This is a good place to do that.
> 
> But hmm, I think we should add another patch to check the addr == paddr
> soon after the loop so that we will avoid decoding.
> 
> Thank you,
> 

Yes, that makes sense to me. At the same time, I'm also thinking about
changing the return type of can_probe() to bool, since we are just using
int as bool in this context.

--Jinghao

>>
>> --Jinghao
>>
>>>
if (IS_ENABLED(CONFIG_CFI_CLANG)) {
/*
 * The compiler generates the following instruction sequence
 @@ -308,13 +334,6 @@ static int can_probe(unsigned long paddr)
 * Also, these movl and addl are used for showing expected
 * type. So those must not be touched.
 */
 -  __addr = recover_probed_instruction(buf, addr);
 -  if (!__addr)
 -  return 0;
 -
 -  if (insn_decode_kernel(, (void *)__addr) < 0)
 -  return 0;
 -
if (insn.opcode.value == 0xBA)
offset = 12;
else if (insn.opcode.value == 0x3)
 -- 
 2.43.0

>>>
>>>
> 
> 


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-29 Thread Jason Wang
On Mon, Jan 29, 2024 at 7:40 PM wangyunjian  wrote:
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Monday, January 29, 2024 11:03 AM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke 
> > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> >
> > On Thu, Jan 25, 2024 at 8:54 PM wangyunjian 
> > wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > Sent: Thursday, January 25, 2024 12:49 PM
> > > > To: wangyunjian 
> > > > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com;
> > > > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com;
> > > > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > > > 
> > > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > > >
> > > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > > > 
> > > > wrote:
> > > > >
> > > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > > >
> > > > > This patch tries to address this by:
> > > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe 
> > > > > empty.
> > > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > > - add tun_put_user_desc function to copy the Rx data to VM
> > > >
> > > > Code explains themselves, let's explain why you need to do this.
> > > >
> > > > 1) why you want to use peek_len
> > > > 2) for "vq's array", what does it mean?
> > > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I
> > > > guess you meant TX zerocopy instead of RX (as I don't see codes for
> > > > RX?)
> > >
> > > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX
> > > zerocopy from the view of vhost-net.
> >
> > Ok.
> >
> > >
> > > >
> > > > A big question is how could you handle GSO packets from
> > userspace/guests?
> > >
> > > Now by disabling VM's TSO and csum feature.
> >
> > Btw, how could you do that?
>
> By set network backend-specific options:
> 
>  mrg_rxbuf='off'/>
> 
> 

This is the mgmt work, but the problem is what happens if GSO is not
disabled in the guest, or is there a way to:

1) forcing the guest GSO to be off
2) a graceful fallback

Thanks

>
> Thanks
>
> >
> > Thanks
> >
>




Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-29 Thread Jason Wang
On Mon, Jan 29, 2024 at 7:10 PM wangyunjian  wrote:
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Monday, January 29, 2024 11:05 AM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke 
> > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> >
> > On Sat, Jan 27, 2024 at 5:34 PM wangyunjian 
> > wrote:
> > >
> > > > > -Original Message-
> > > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > > Sent: Thursday, January 25, 2024 12:49 PM
> > > > > To: wangyunjian 
> > > > > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com;
> > > > > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com;
> > > > > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > > > > 
> > > > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > > > >
> > > > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > > > 
> > > > > wrote:
> > > > > >
> > > > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > > > >
> > > > > > This patch tries to address this by:
> > > > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe
> > empty.
> > > > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > > > - add tun_put_user_desc function to copy the Rx data to VM
> > > > >
> > > > > Code explains themselves, let's explain why you need to do this.
> > > > >
> > > > > 1) why you want to use peek_len
> > > > > 2) for "vq's array", what does it mean?
> > > > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so
> > > > > I guess you meant TX zerocopy instead of RX (as I don't see codes
> > > > > for
> > > > > RX?)
> > > >
> > > > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX
> > > > zerocopy from the view of vhost-net.
> > > >
> > > > >
> > > > > A big question is how could you handle GSO packets from
> > userspace/guests?
> > > >
> > > > Now by disabling VM's TSO and csum feature. XDP does not support GSO
> > > > packets.
> > > > However, this feature can be added once XDP supports it in the future.
> > > >
> > > > >
> > > > > >
> > > > > > Signed-off-by: Yunjian Wang 
> > > > > > ---
> > > > > >  drivers/net/tun.c   | 165
> > > > > +++-
> > > > > >  drivers/vhost/net.c |  18 +++--
> > > > > >  2 files changed, 176 insertions(+), 7 deletions(-)
> > >
> > > [...]
> > >
> > > > > >
> > > > > >  static int peek_head_len(struct vhost_net_virtqueue *rvq,
> > > > > > struct sock
> > > > > > *sk)  {
> > > > > > +   struct socket *sock = sk->sk_socket;
> > > > > > struct sk_buff *head;
> > > > > > int len = 0;
> > > > > > unsigned long flags;
> > > > > >
> > > > > > -   if (rvq->rx_ring)
> > > > > > -   return vhost_net_buf_peek(rvq);
> > > > > > +   if (rvq->rx_ring) {
> > > > > > +   len = vhost_net_buf_peek(rvq);
> > > > > > +   if (likely(len))
> > > > > > +   return len;
> > > > > > +   }
> > > > > > +
> > > > > > +   if (sock->ops->peek_len)
> > > > > > +   return sock->ops->peek_len(sock);
> > > > >
> > > > > What prevents you from reusing the ptr_ring here? Then you don't
> > > > > need the above tricks.
> > > >
> > > > Thank you for your suggestion. I will consider how to reuse the 
> > > > ptr_ring.
> > >
> > > If ptr_ring is used to transfer xdp_descs, there is a problem: After
> > > some xdp_descs are obtained through xsk_tx_peek_desc(), the descs may
> > > fail to be added to ptr_ring. However, no API is available to
> > > implement the rollback function.
> >
> > I don't understand, this issue seems to exist in the physical NIC as well?
> >
> > We get more descriptors than the free slots in the NIC ring.
> >
> > How did other NIC solve this issue?
>
> Currently, physical NICs such as i40e, ice, ixgbe, igc, and mlx5 obtains
> available NIC descriptors and then retrieve the same number of xsk
> descriptors for processing.

Any reason we can't do the same? ptr_ring should be much simpler than
NIC ring anyhow.

Thanks

>
> Thanks
>
> >
> > Thanks
> >
> > >
> > > Thanks
> > >
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > >
> > > > > > spin_lock_irqsave(>sk_receive_queue.lock, flags);
> > > > > > head = skb_peek(>sk_receive_queue);
> > > > > > --
> > > > > > 2.33.0
> > > > > >
> > 

Re: [PATCH v3 6/6] LoongArch: Add pv ipi support on LoongArch system

2024-01-29 Thread maobibo




On 2024/1/29 下午9:10, Huacai Chen wrote:

Hi, Bibo,

On Mon, Jan 22, 2024 at 6:03 PM Bibo Mao  wrote:


On LoongArch system, ipi hw uses iocsr registers, there is one iocsr
register access on ipi sender and two iocsr access on ipi receiver
which is ipi interrupt handler. On VM mode all iocsr registers
accessing will trap into hypervisor. So with one ipi hw notification
there will be three times of trap.

This patch adds pv ipi support for VM, hypercall instruction is used
to ipi sender, and hypervisor will inject SWI on the VM. During SWI
interrupt handler, only estat CSR register is written to clear irq.
Estat CSR register access will not trap into hypervisor. So with pv ipi
supported, pv ipi sender will trap into hypervsor one time, pv ipi
revicer will not trap, there is only one time of trap.

Also this patch adds ipi multicast support, the method is similar with
x86. With ipi multicast support, ipi notification can be sent to at most
128 vcpus at one time. It reduces trap into hypervisor greatly.

Signed-off-by: Bibo Mao 
---
  arch/loongarch/include/asm/hardirq.h   |   1 +
  arch/loongarch/include/asm/kvm_host.h  |   1 +
  arch/loongarch/include/asm/kvm_para.h  | 124 +
  arch/loongarch/include/asm/loongarch.h |   1 +
  arch/loongarch/kernel/irq.c|   2 +-
  arch/loongarch/kernel/paravirt.c   | 113 ++
  arch/loongarch/kernel/smp.c|   2 +-
  arch/loongarch/kvm/exit.c  |  73 ++-
  arch/loongarch/kvm/vcpu.c  |   1 +
  9 files changed, 314 insertions(+), 4 deletions(-)

diff --git a/arch/loongarch/include/asm/hardirq.h 
b/arch/loongarch/include/asm/hardirq.h
index 9f0038e19c7f..8a611843c1f0 100644
--- a/arch/loongarch/include/asm/hardirq.h
+++ b/arch/loongarch/include/asm/hardirq.h
@@ -21,6 +21,7 @@ enum ipi_msg_type {
  typedef struct {
 unsigned int ipi_irqs[NR_IPI];
 unsigned int __softirq_pending;
+   atomic_t messages cacheline_aligned_in_smp;
  } cacheline_aligned irq_cpustat_t;

  DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 57399d7cf8b7..1bf927e2bfac 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -43,6 +43,7 @@ struct kvm_vcpu_stat {
 u64 idle_exits;
 u64 cpucfg_exits;
 u64 signal_exits;
+   u64 hvcl_exits;
  };

  #define KVM_MEM_HUGEPAGE_CAPABLE   (1UL << 0)
diff --git a/arch/loongarch/include/asm/kvm_para.h 
b/arch/loongarch/include/asm/kvm_para.h
index 41200e922a82..a25a84e372b9 100644
--- a/arch/loongarch/include/asm/kvm_para.h
+++ b/arch/loongarch/include/asm/kvm_para.h
@@ -9,6 +9,10 @@
  #define HYPERVISOR_VENDOR_SHIFT8
  #define HYPERCALL_CODE(vendor, code)   ((vendor << HYPERVISOR_VENDOR_SHIFT) + 
code)

+#define KVM_HC_CODE_SERVICE0
+#define KVM_HC_SERVICE HYPERCALL_CODE(HYPERVISOR_KVM, 
KVM_HC_CODE_SERVICE)
+#define  KVM_HC_FUNC_IPI   1
+
  /*
   * LoongArch hypcall return code
   */
@@ -16,6 +20,126 @@
  #define KVM_HC_INVALID_CODE-1UL
  #define KVM_HC_INVALID_PARAMETER   -2UL

+/*
+ * Hypercalls interface for KVM hypervisor
+ *
+ * a0: function identifier
+ * a1-a6: args
+ * Return value will be placed in v0.
+ * Up to 6 arguments are passed in a1, a2, a3, a4, a5, a6.
+ */
+static __always_inline long kvm_hypercall(u64 fid)
+{
+   register long ret asm("v0");
+   register unsigned long fun asm("a0") = fid;
+
+   __asm__ __volatile__(
+   "hvcl "__stringify(KVM_HC_SERVICE)
+   : "=r" (ret)
+   : "r" (fun)
+   : "memory"
+   );
+
+   return ret;
+}
+
+static __always_inline long kvm_hypercall1(u64 fid, unsigned long arg0)
+{
+   register long ret asm("v0");
+   register unsigned long fun asm("a0") = fid;
+   register unsigned long a1  asm("a1") = arg0;
+
+   __asm__ __volatile__(
+   "hvcl "__stringify(KVM_HC_SERVICE)
+   : "=r" (ret)
+   : "r" (fun), "r" (a1)
+   : "memory"
+   );
+
+   return ret;
+}
+
+static __always_inline long kvm_hypercall2(u64 fid,
+   unsigned long arg0, unsigned long arg1)
+{
+   register long ret asm("v0");
+   register unsigned long fun asm("a0") = fid;
+   register unsigned long a1  asm("a1") = arg0;
+   register unsigned long a2  asm("a2") = arg1;
+
+   __asm__ __volatile__(
+   "hvcl "__stringify(KVM_HC_SERVICE)
+   : "=r" (ret)
+   : "r" (fun), "r" (a1), "r" (a2)
+   : "memory"
+   );
+
+   return ret;
+}
+
+static __always_inline long kvm_hypercall3(u64 fid,
+   unsigned long arg0, unsigned long arg1, unsigned long arg2)
+{
+   register long ret asm("v0");
+   

Re: [PATCH v3 1/6] LoongArch/smp: Refine ipi ops on LoongArch platform

2024-01-29 Thread maobibo




On 2024/1/29 下午8:38, Huacai Chen wrote:

Hi, Bibo,

On Mon, Jan 22, 2024 at 6:03 PM Bibo Mao  wrote:


This patch refines ipi handling on LoongArch platform, there are
three changes with this patch.
1. Add generic get_percpu_irq api, replace some percpu irq function
such as get_ipi_irq/get_pmc_irq/get_timer_irq with get_percpu_irq.

2. Change parameter action definition with function
loongson_send_ipi_single and loongson_send_ipi_mask. Code encoding is used
here rather than bitmap encoding for ipi action, ipi hw sender uses action
code, and ipi receiver will get action bitmap encoding, the ipi hw will
convert it into bitmap in ipi message buffer.

3. Add smp_ops on LoongArch platform so that pv ipi can be used later.

Signed-off-by: Bibo Mao 
---
  arch/loongarch/include/asm/hardirq.h |  4 ++
  arch/loongarch/include/asm/irq.h | 10 -
  arch/loongarch/include/asm/smp.h | 31 +++
  arch/loongarch/kernel/irq.c  | 22 +--
  arch/loongarch/kernel/perf_event.c   | 14 +--
  arch/loongarch/kernel/smp.c  | 58 +++-
  arch/loongarch/kernel/time.c | 12 +-
  7 files changed, 71 insertions(+), 80 deletions(-)

diff --git a/arch/loongarch/include/asm/hardirq.h 
b/arch/loongarch/include/asm/hardirq.h
index 0ef3b18f8980..9f0038e19c7f 100644
--- a/arch/loongarch/include/asm/hardirq.h
+++ b/arch/loongarch/include/asm/hardirq.h
@@ -12,6 +12,10 @@
  extern void ack_bad_irq(unsigned int irq);
  #define ack_bad_irq ack_bad_irq

+enum ipi_msg_type {
+   IPI_RESCHEDULE,
+   IPI_CALL_FUNCTION,
+};
  #define NR_IPI 2

  typedef struct {
diff --git a/arch/loongarch/include/asm/irq.h b/arch/loongarch/include/asm/irq.h
index 218b4da0ea90..00101b6d601e 100644
--- a/arch/loongarch/include/asm/irq.h
+++ b/arch/loongarch/include/asm/irq.h
@@ -117,8 +117,16 @@ extern struct fwnode_handle *liointc_handle;
  extern struct fwnode_handle *pch_lpc_handle;
  extern struct fwnode_handle *pch_pic_handle[MAX_IO_PICS];

-extern irqreturn_t loongson_ipi_interrupt(int irq, void *dev);
+static inline int get_percpu_irq(int vector)
+{
+   struct irq_domain *d;
+
+   d = irq_find_matching_fwnode(cpuintc_handle, DOMAIN_BUS_ANY);
+   if (d)
+   return irq_create_mapping(d, vector);

+   return -EINVAL;
+}
  #include 

  #endif /* _ASM_IRQ_H */
diff --git a/arch/loongarch/include/asm/smp.h b/arch/loongarch/include/asm/smp.h
index f81e5f01d619..330f1cb3741c 100644
--- a/arch/loongarch/include/asm/smp.h
+++ b/arch/loongarch/include/asm/smp.h
@@ -12,6 +12,13 @@
  #include 
  #include 

+struct smp_ops {
+   void (*call_func_ipi)(const struct cpumask *mask, unsigned int action);
+   void (*call_func_single_ipi)(int cpu, unsigned int action);

To keep consistency, it is better to use call_func_ipi_single and
call_func_ipi_mask.
yes, how about using send_ipi_single/send_ipi_mask here? since both 
function arch_smp_send_reschedule() and 
arch_send_call_function_single_ipi use smp_ops.





+   void (*ipi_init)(void);
+};
+
+extern struct smp_ops smp_ops;
  extern int smp_num_siblings;
  extern int num_processors;
  extern int disabled_cpus;
@@ -24,8 +31,6 @@ void loongson_prepare_cpus(unsigned int max_cpus);
  void loongson_boot_secondary(int cpu, struct task_struct *idle);
  void loongson_init_secondary(void);
  void loongson_smp_finish(void);
-void loongson_send_ipi_single(int cpu, unsigned int action);
-void loongson_send_ipi_mask(const struct cpumask *mask, unsigned int action);
  #ifdef CONFIG_HOTPLUG_CPU
  int loongson_cpu_disable(void);
  void loongson_cpu_die(unsigned int cpu);
@@ -59,9 +64,12 @@ extern int __cpu_logical_map[NR_CPUS];

  #define cpu_physical_id(cpu)   cpu_logical_map(cpu)

-#define SMP_BOOT_CPU   0x1
-#define SMP_RESCHEDULE 0x2
-#define SMP_CALL_FUNCTION  0x4
+#define ACTTION_BOOT_CPU   0
+#define ACTTION_RESCHEDULE 1
+#define ACTTION_CALL_FUNCTION  2
+#define SMP_BOOT_CPU   BIT(ACTTION_BOOT_CPU)
+#define SMP_RESCHEDULE BIT(ACTTION_RESCHEDULE)
+#define SMP_CALL_FUNCTION  BIT(ACTTION_CALL_FUNCTION)

  struct secondary_data {
 unsigned long stack;
@@ -71,7 +79,8 @@ extern struct secondary_data cpuboot_data;

  extern asmlinkage void smpboot_entry(void);
  extern asmlinkage void start_secondary(void);
-
+extern void arch_send_call_function_single_ipi(int cpu);
+extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);

Similarly, to keep consistency, it is better to use
arch_send_function_ipi_single and arch_send_function_ipi_mask.
These two functions are used by all architectures and called in commcon 
code send_call_function_single_ipi(). It is the same with removed static 
inline function as follows:


 -static inline void arch_send_call_function_single_ipi(int cpu)
 -{
 -   loongson_send_ipi_single(cpu, SMP_CALL_FUNCTION);
 -}
 -
 -static inline void arch_send_call_function_ipi_mask(const struct 
cpumask *mask)

 -{
 -   

Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()

2024-01-29 Thread Dave Chinner
On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote:
> Use dax_is_supported() to validate whether the architecture has
> virtually aliased caches at mount time.
> 
> This is relevant for architectures which require a dynamic check
> to validate whether they have virtually aliased data caches
> (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Where's the rest of this patchset? I have no idea what
dax_is_supported() actually does, how it interacts with
CONFIG_FS_DAX, etc.

If you are changing anything to do with FSDAX, the cc-ing the
-entire- patchset to linux-fsdevel is absolutely necessary so the
entire patchset lands in our inboxes and not just a random patch
from the middle of a bigger change.

> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Chandan Babu R 
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> ---
>  fs/xfs/xfs_super.c | 20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 764304595e8b..b27ecb11db66 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1376,14 +1376,22 @@ xfs_fs_parse_param(
>   case Opt_nodiscard:
>   parsing_mp->m_features &= ~XFS_FEAT_DISCARD;
>   return 0;
> -#ifdef CONFIG_FS_DAX
>   case Opt_dax:
> - xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
> - return 0;
> + if (dax_is_supported()) {
> + xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
> + return 0;
> + } else {
> + xfs_warn(parsing_mp, "dax option not supported.");
> + return -EINVAL;
> + }
>   case Opt_dax_enum:
> - xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
> - return 0;
> -#endif
> + if (dax_is_supported()) {
> + xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
> + return 0;
> + } else {
> + xfs_warn(parsing_mp, "dax option not supported.");
> + return -EINVAL;
> + }

Assuming that I understand what dax_is_supported() is doing, this
change isn't right.  We're just setting the DAX configuration flags
from the mount options here, we don't validate them until 
we've parsed all options and eliminated conflicts and rejected
conflicting options. We validate whether the options are
appropriate for the underlying hardware configuration later in the
mount process.

dax=always suitability is check in xfs_setup_dax_always() called
later in the mount process when we have enough context and support
to open storage devices and check them for DAX support. If the
hardware does not support DAX then we simply we turn off DAX
support, we do not reject the mount as this change does.

dax=inode and dax=never are valid options on all configurations,
even those with without FSDAX support or have hardware that is not
capable of using DAX. dax=inode only affects how an inode is
instantiated in cache - if the inode has a flag that says "use DAX"
and dax is suppoortable by the hardware, then the turn on DAX for
that inode. Otherwise we just use the normal non-dax IO paths.

Again, we don't error out the filesystem if DAX is not supported,
we just don't turn it on. This check is done in
xfs_inode_should_enable_dax() and I think all you need to do is
replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported()
call...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 09:29:07 -0800
Beau Belgrave  wrote:

> Thanks, yeah ideally we wouldn't use special characters.
> 
> I'm not picky about this. However, I did want something that clearly
> allowed a glob pattern to find all versions of a given register name of
> user_events by user programs that record. The dot notation will pull in
> more than expected if dotted namespace style names are used.
> 
> An example is "Asserts" and "Asserts.Verbose" from different programs.
> If we tried to find all versions of "Asserts" via glob of "Asserts.*" it
> will pull in "Asserts.Verbose.1" in addition to "Asserts.0".

Do you prevent brackets in names?

> 
> While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it
> doesn't work if the number is higher, like 128. If we ever decide to
> change the ID from an integer to say hex to save space, these globs
> would break.
> 
> Is there some scheme that fits the C-variable name that addresses the
> above scenarios? Brackets gave me a simple glob that seemed to prevent a
> lot of this ("Asserts.\[*\]" in this case).

Prevent a lot of what? I'm not sure what your example here is.

> 
> Are we confident that we always want to represent the ID as a base-10
> integer vs a base-16 integer? The suffix will be ABI to ensure recording
> programs can find their events easily.

Is there a difference to what we choose?

-- Steve



[PATCH v8 14/15] Docs/x86/sgx: Add description for cgroup support

2024-01-29 Thread Haitao Huang
From: Sean Christopherson 

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call.  It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. 
must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running.  The EPC is managed by the SGX subsystem 
and
+is not accounted by the memory controller.  Note that this is true only for EPC
+memory itself, i.e.  normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing 
store
+(normal system memory allocated via shmem).  The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+---
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+resource will show the total amount of EPC memory available on the
+platform.
+
+  misc.current
+A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+resource will show the current active EPC memory usage of the cgroup 
and
+its descendants. EPC pages that are swapped out to backing RAM are not
+included in the current count.
+
+  misc.max
+A read-write single value file which exists on non-root cgroups. The
+sgx_epc resource will show the EPC usage hard limit. The default is
+"max".
+
+If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+page fault handling, will be blocked until EPC can be reclaimed from 
the
+cgroup. If there are no pages left that are reclaimable within the same
+group, the kernel returns ENOMEM.
+
+The EPC pages allocated for a guest VM by the virtual EPC driver are 
not
+reclaimable by the host kernel. In case the guest cgroup's limit is
+reached and no reclaimable pages left in the same cgroup, the virtual
+EPC driver returns SIGBUS to the user space process to indicate failure
+on new EPC allocation requests.
+
+The misc.max limit is non-preemptive. If a user writes a limit lower
+than the current usage to this file, the cgroup will not preemptively
+deallocate pages currently in use, and will only start blocking the 
next
+allocation and reclaiming EPC at that time.
+
+  misc.events
+A read-only flat-keyed file which exists on non-root cgroups.
+A value change in this file generates a file modified event.
+
+  max
+The number of times 

[PATCH v8 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-01-29 Thread Haitao Huang
The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

./watch_misc_for_tests.sh current

With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test.  Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.

The script also includes a test with low mem_cg limit and LARGE sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg.

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang 
---
V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 .../selftests/sgx/run_epc_cg_selftests.sh | 246 ++
 .../selftests/sgx/watch_misc_for_tests.sh |  13 +
 2 files changed, 259 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh 
b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index ..e027bf39f005
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,246 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+echo "# Please make sure cgroup-tools is installed, and misc cgroup is 
mounted."
+exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+# We will only set limit in test1 and run tests in test3
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_MISC_ROOT=/sys/fs/cgroup
+CG_MEM_ROOT=/sys/fs/cgroup
+CG_V1=0
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+echo "# cgroup V2 is in use."
+else
+echo "# cgroup V1 is in use."
+CG_MISC_ROOT=/sys/fs/cgroup/misc
+CG_MEM_ROOT=/sys/fs/cgroup/memory
+CG_V1=1
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" >  $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+wait_check_process_status() {
+local pid=$1
+local check_for_success=$2  # If 1, check for success;
+# If 0, check for failure
+wait "$pid"
+local status=$?
+
+if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+echo "# Process $pid succeeded."
+return 0
+elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+echo "# Process $pid returned failure."
+return 0
+fi
+return 1
+}
+
+wait_and_detect_for_any() {
+local pids=("$@")
+local check_for_success=$1  # If 1, check for success;
+# If 0, check for failure
+local detected=1 # 0 for success detection
+
+for pid in "${pids[@]:1}"; do
+if wait_check_process_status "$pid" "$check_for_success"; then
+detected=0
+# Wait for other processes to exit
+fi
+done
+
+return $detected
+}
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting 
failure..."
+# Always use leaf node of misc 

[PATCH v8 13/15] x86/sgx: Turn on per-cgroup EPC reclamation

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.

Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., sum of cgroup limits
greater than the total capacity, cgroups may never reclaim but the total
usage can still be near the capacity. Therefore global reclamation is
still needed in those cases and it should reclaim from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU.

Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 6b0c26cac621..d4265a390ba9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
 
 static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
*epc_page)
 {
+#ifdef CONFIG_CGROUP_SGX_EPC
+   if (epc_page->epc_cg)
+   return _page->epc_cg->lru;
+
+   /* This should not happen if kernel is configured correctly */
+   WARN_ON_ONCE(1);
+#endif
return _global_lru;
 }
 
 static inline bool sgx_can_reclaim(void)
 {
+#ifdef CONFIG_CGROUP_SGX_EPC
+   return !sgx_epc_cgroup_lru_empty(misc_cg_root());
+#else
return !list_empty(_global_lru.reclaimable);
+#endif
 }
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -410,7 +421,10 @@ static void sgx_reclaim_pages_global(bool indirect)
 {
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
 
-   sgx_reclaim_pages(_global_lru, _to_scan, indirect);
+   if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+   sgx_epc_cgroup_reclaim_pages(misc_cg_root(), indirect);
+   else
+   sgx_reclaim_pages(_global_lru, _to_scan, indirect);
 }
 
 /*
-- 
2.25.1




[PATCH v8 12/15] x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

When cgroup is enabled, all reclaimable pages will be tracked in cgroup
LRUs. The global reclaimer needs to start reclamation from the root
cgroup. Expose the top level cgroup reclamation function so the global
reclaimer can reuse it.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V8:
- Remove unneeded breaks in function declarations. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 127f515ffccf..e08425b1faa5 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -88,7 +88,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
  * @indirect:   In ksgxd or EPC cgroup work queue context.
  * Return: Number of pages reclaimed.
  */
-static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool 
indirect)
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
 {
/*
 * Attempting to reclaim only a few pages will often fail and is
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index d061cd807b45..5b3e8e1b8630 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct 
sgx_epc_cgroup *epc_cg, bool
 static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
 
 static inline void sgx_epc_cgroup_init(void) { }
+
+static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, 
bool indirect)
+{
+   return 0;
+}
 #else
 struct sgx_epc_cgroup {
struct misc_cg *cg;
@@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup 
*epc_cg)
 int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
 void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
 bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect);
+
 void sgx_epc_cgroup_init(void);
 
 #endif
-- 
2.25.1




[PATCH v8 11/15] x86/sgx: Abstract check for global reclaimable pages

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

To determine if any page available for reclamation at the global level,
only checking for emptiness of the global LRU is not adequate when pages
are tracked in multiple LRUs, one per cgroup. For this purpose, create a
new helper, sgx_can_reclaim(), currently only checks the global LRU,
later will check emptiness of LRUs of all cgroups when per-cgroup
tracking is turned on. Replace all the checks of the global LRU,
list_empty(_global_lru.reclaimable), with calls to
sgx_can_reclaim().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
v7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 2279ae967707..6b0c26cac621 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
sgx_epc_page *epc_pag
return _global_lru;
 }
 
+static inline bool sgx_can_reclaim(void)
+{
+   return !list_empty(_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -398,7 +403,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list 
*lru, unsigned int *nr_to
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_global_lru.reclaimable);
+   sgx_can_reclaim();
 }
 
 static void sgx_reclaim_pages_global(bool indirect)
@@ -601,7 +606,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool 
reclaim)
break;
}
 
-   if (list_empty(_global_lru.reclaimable)) {
+   if (!sgx_can_reclaim()) {
page = ERR_PTR(-ENOMEM);
break;
}
-- 
2.25.1




[PATCH v8 09/15] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-01-29 Thread Haitao Huang
Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it only considers two contexts in which
the swapping can be done: normal tasks and the ksgxd kthread.
With the new EPC cgroup implementation, the swapping can also happen in
EPC cgroup work-queue threads. In those cases, it improperly selects the
root mem_cgroup to charge for the RAM usage.

Change sgx_encl_get_mem_cgroup() to handle non-task contexts only and
return the mem_cgroup of an mm_struct associated with the enclave. The
return is used to charge for EPC backing pages in all kthread cases.

Pass a flag into the top level reclamation function,
sgx_reclaim_pages(), to explicitly indicate whether it is called from a
background kthread. Internally, if the flag is true, switch the active
mem_cgroup to the one returned from sgx_encl_get_mem_cgroup(), prior to
any backing page allocation, in order to ensure that shmem page
allocations are charged to the enclave's cgroup.

Removed current_is_ksgxd() as it is no longer needed.

Signed-off-by: Haitao Huang 
Reported-by: Mikko Ylinen 
---
V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c   | 43 ++--
 arch/x86/kernel/cpu/sgx/encl.h   |  3 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  7 +++--
 arch/x86/kernel/cpu/sgx/main.c   | 27 -
 arch/x86/kernel/cpu/sgx/sgx.h|  3 +-
 5 files changed, 40 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..348e8b58abeb 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,9 +993,7 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, 
unsigned long page_inde
 }
 
 /*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Returns the mem_cgroup of a struct mm stored in the enclave's mm_list.
  */
 static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
 {
@@ -1003,14 +1001,6 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
struct sgx_encl_mm *encl_mm;
int idx;
 
-   /*
-* If called from normal task context, return the mem_cgroup
-* of the current task's mm. The remainder of the handling is for
-* ksgxd.
-*/
-   if (!current_is_ksgxd())
-   return get_mem_cgroup_from_mm(current->mm);
-
/*
 * Search the enclave's mm_list to find an mm associated with
 * this enclave to charge the allocation to.
@@ -1047,29 +1037,38 @@ static struct mem_cgroup 
*sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
  * @encl:  an enclave pointer
  * @page_index:enclave page index
  * @backing:   data for accessing backing storage for the page
+ * @indirect:  in ksgxd or EPC cgroup work queue context
+ *
+ * Create a backing page for loading data back into an EPC page with ELDU. This
+ * function takes a reference on a new backing page which must be dropped with 
a
+ * corresponding call to sgx_encl_put_backing().
  *
- * When called from ksgxd, sets the active memcg from one of the
- * mms in the enclave's mm_list prior to any backing page allocation,
- * in order to ensure that shmem page allocations are charged to the
- * enclave.  Create a backing page for loading data back into an EPC page with
- * ELDU.  This function takes a reference on a new backing page which
- * must be dropped with a corresponding call to sgx_encl_put_backing().
+ * When @indirect is true, sets the active memcg from one of the mms in the
+ * enclave's mm_list prior to any backing page allocation, in order to ensure
+ * that shmem page allocations are charged to the enclave.
  *
  * Return:
  *   0 on success,
  *   -errno otherwise.
  */
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-  struct sgx_backing *backing)
+  struct sgx_backing *backing, bool indirect)
 {
-   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
-   struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
+   struct mem_cgroup *encl_memcg;
+   struct mem_cgroup *memcg;
int ret;
 
+   if (indirect) {
+   encl_memcg = sgx_encl_get_mem_cgroup(encl);
+   memcg = set_active_memcg(encl_memcg);
+   }
+
ret = __sgx_encl_get_backing(encl, page_index, backing);
 
-   set_active_memcg(memcg);
-   mem_cgroup_put(encl_memcg);
+   if (indirect) {
+   set_active_memcg(memcg);
+   mem_cgroup_put(encl_memcg);
+   }
 
return ret;
 }
diff --git a/arch/x86/kernel/cpu/sgx/encl.h 

[PATCH v8 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

When the EPC usage of a cgroup is near its limit, the cgroup needs to
reclaim pages used in the same cgroup to make room for new allocations.
This is analogous to the behavior that the global reclaimer is triggered
when the global usage is close to total available EPC.

Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
whether synchronous reclaim is allowed or not. And trigger the
synchronous/asynchronous reclamation flow accordingly.

Note at this point, all reclaimable EPC pages are still tracked in the
global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
is activated yet.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 --
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  4 ++--
 arch/x86/kernel/cpu/sgx/main.c   |  2 +-
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index cbcb7b0de3fe..127f515ffccf 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -176,13 +176,35 @@ static void sgx_epc_cgroup_reclaim_work_func(struct 
work_struct *work)
 /**
  * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
  * @epc_cg:The EPC cgroup to be charged for the page.
+ * @reclaim:   Whether or not synchronous reclaim is allowed
  * Return:
  * * %0 - If successfully charged.
  * * -errno - for failures.
  */
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
 {
-   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+   for (;;) {
+   if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+   PAGE_SIZE))
+   break;
+
+   if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+   return -ENOMEM;
+
+   if (signal_pending(current))
+   return -ERESTARTSYS;
+
+   if (!reclaim) {
+   queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
+   return -EBUSY;
+   }
+
+   if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+   /* All pages were too young to reclaim, try again a 
little later */
+   schedule();
+   }
+
+   return 0;
 }
 
 /**
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index e3c6a08f0ee8..d061cd807b45 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -23,7 +23,7 @@ static inline struct sgx_epc_cgroup 
*sgx_get_current_epc_cg(void)
 
 static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
 
-static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, 
bool reclaim)
 {
return 0;
 }
@@ -66,7 +66,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup 
*epc_cg)
put_misc_cg(epc_cg->cg);
 }
 
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
 void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
 bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
 void sgx_epc_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 51904f191b97..2279ae967707 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -588,7 +588,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool 
reclaim)
int ret;
 
epc_cg = sgx_get_current_epc_cg();
-   ret = sgx_epc_cgroup_try_charge(epc_cg);
+   ret = sgx_epc_cgroup_try_charge(epc_cg, reclaim);
if (ret) {
sgx_put_epc_cg(epc_cg);
return ERR_PTR(ret);
-- 
2.25.1




[PATCH v8 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

Implement the reclamation flow for cgroup, encapsulated in the top-level
function sgx_epc_cgroup_reclaim_pages(). It does a pre-order walk on its
subtree, and make calls to sgx_reclaim_pages() at each node passing in
the LRU of that node. It keeps track of total reclaimed pages, and pages
left to attempt.  It stops the walk if desired number of pages are
attempted.

In some contexts, e.g. page fault handling, only asynchronous
reclamation is allowed. Create a work-queue, corresponding work item and
function definitions to support the asynchronous reclamation. Both
synchronous and asynchronous flows invoke the same top level reclaim
function, and will be triggered later by sgx_epc_cgroup_try_charge()
when usage of the cgroup is at or near its limit.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 174 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   3 +
 2 files changed, 176 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index eac8548164de..8858a0850f8a 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -7,9 +7,173 @@
 
 static struct sgx_epc_cgroup epc_cg_root;
 
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup 
*epc_cg)
+{
+   return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+   return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors.  Used in
+ * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is
+ * over its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup 
*epc_cg)
+{
+   struct misc_cg *i = epc_cg->cg;
+   u64 m = U64_MAX;
+
+   while (i) {
+   m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+   i = misc_cg_parent(i);
+   }
+
+   return m / PAGE_SIZE;
+}
+
 /**
- * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root:  Root of the tree to check
  *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
+{
+   struct cgroup_subsys_state *css_root;
+   struct cgroup_subsys_state *pos;
+   struct sgx_epc_cgroup *epc_cg;
+   bool ret = true;
+
+   /*
+* Caller ensure css_root ref acquired
+*/
+   css_root = >css;
+
+   rcu_read_lock();
+   css_for_each_descendant_pre(pos, css_root) {
+   if (!css_tryget(pos))
+   break;
+
+   rcu_read_unlock();
+
+   epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+   spin_lock(_cg->lru.lock);
+   ret = list_empty(_cg->lru.reclaimable);
+   spin_unlock(_cg->lru.lock);
+
+   rcu_read_lock();
+   css_put(pos);
+   if (!ret)
+   break;
+   }
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
+/**
+ * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to 
reclaim pages
+ * @root:  Root of the tree to start walking
+ * Return: Number of pages reclaimed.
+ */
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
+{
+   /*
+* Attempting to reclaim only a few pages will often fail and is
+* inefficient, while reclaiming a huge number of pages can result in
+* soft lockups due to holding various locks for an extended duration.
+*/
+   unsigned int nr_to_scan = SGX_NR_TO_SCAN;
+   struct cgroup_subsys_state *css_root;
+   struct cgroup_subsys_state *pos;
+   struct sgx_epc_cgroup *epc_cg;
+   unsigned int cnt;
+
+/* Caller ensure css_root ref acquired */
+   css_root = >css;
+
+   cnt = 0;
+   rcu_read_lock();
+   css_for_each_descendant_pre(pos, css_root) {
+   if (!css_tryget(pos))
+   break;
+   rcu_read_unlock();
+
+   epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+   cnt += 

[PATCH v8 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup

2024-01-29 Thread Haitao Huang
From: Sean Christopherson 

Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
from its LRU or LRUs of its descendants to make room for any new
allocations.

To prepare for reclamation per cgroup, expose the top level reclamation
function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
to the function to pass in an LRU so cgroups can pass in different
tracking LRUs later.  Add another parameter for passing in the number of
pages to scan and make the function return the number of pages reclaimed
as a cgroup reclaimer may need to track reclamation progress from its
descendants, change number of pages to scan in subsequent calls.

Create a wrapper for the global reclaimer, sgx_reclaim_pages_global(),
to just call this function with the global LRU passed in. When
per-cgroup LRU is added later, the wrapper will perform global
reclamation from the root cgroup.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V8:
- Use width of 80 characters in text paragraphs. (Jarkko)

V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
---
 arch/x86/kernel/cpu/sgx/main.c | 53 +++---
 arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
 2 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index a131aa985c95..4f5824c4751d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -286,11 +286,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
mutex_unlock(>lock);
 }
 
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which 
have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
+ *
+ * Take a fixed number of pages from the head of a given LRU and reclaim them 
to
+ * the enclave's private shmem files. Skip the pages, which have been accessed
+ * since the last scan. Move those pages to the tail of the list so that the
+ * pages get scanned in LRU like fashion.
  *
  * Batch process a chunk of pages (at the moment 16) in order to degrade amount
  * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a 
bit
@@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
  * + EWB) but not sufficiently. Reclaiming one page at a time would also be
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
+ *
+ * @lru:   The LRU from which pages are reclaimed.
+ * @nr_to_scan: Pointer to the target number of pages to scan, must be less 
than
+ * SGX_NR_TO_SCAN.
+ * Return: Number of pages reclaimed.
  */
-static void sgx_reclaim_pages(void)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int 
*nr_to_scan)
 {
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(_global_lru.lock);
-   for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
-   struct sgx_epc_page, list);
+   spin_lock(>lock);
+
+   for (; *nr_to_scan > 0; --(*nr_to_scan)) {
+   epc_page = list_first_entry_or_null(>reclaimable, struct 
sgx_epc_page, list);
if (!epc_page)
break;
 
@@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_global_lru.lock);
+
+   spin_unlock(>lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(_global_lru.lock);
-   list_add_tail(_page->list, _global_lru.reclaimable);
-   spin_unlock(_global_lru.lock);
+   spin_lock(>lock);
+   list_add_tail(_page->list, >reclaimable);
+   spin_unlock(>lock);
 
kref_put(_page->encl->refcount, sgx_encl_release);
 
@@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void)
sgx_reclaimer_block(epc_page);
}
 
+   ret = 0;
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
if 

[PATCH v8 05/15] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2024-01-29 Thread Haitao Huang
From: Sean Christopherson 

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
---
V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +-
 arch/x86/kernel/cpu/sgx/sgx.h  | 15 +
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c32f18b70c73..912959c7ecc9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   if (list_empty(_active_page_list))
+   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
+   struct sgx_epc_page, list);
+   if (!epc_page)
break;
 
-   epc_page = list_first_entry(_active_page_list,
-   struct sgx_epc_page, list);
list_del_init(_page->list);
encl_page = epc_page->owner;
 
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(_reclaimer_lock);
-   list_add_tail(_page->list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
+   list_add_tail(_page->list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 
kref_put(_page->encl->refcount, sgx_encl_release);
 
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_active_page_list);
+  !list_empty(_global_lru.reclaimable);
 }
 
 /*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
ksgxd_tsk = tsk;
 
+   sgx_lru_init(_global_lru);
+
return true;
 }
 
@@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   list_add_tail(>list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 }
 
 /**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
return -EBUSY;
}
 
list_del(>list);
page->flags &= 

[PATCH v8 06/15] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
allocated page into the global LRU list while
sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
global LRU references in these functions to make them reusable when
pages are tracked in per-cgroup LRUs.

Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
EPC page. It simply returns the global LRU now, and will later return
the LRU of the cgroup within which the EPC page was allocated. Replace
the hard coded global LRU with a call to this helper.

Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 912959c7ecc9..a131aa985c95 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
*epc_page)
+{
+   return _global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a 
LRU.
  * @page:  EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+   spin_lock(>lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _global_lru.reclaimable);
-   spin_unlock(_global_lru.lock);
+   list_add_tail(>list, >reclaimable);
+   spin_unlock(>lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
  * @page:  EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+   spin_lock(>lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
return -EBUSY;
}
 
list_del(>list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
 
return 0;
 }
-- 
2.25.1




[PATCH v8 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments.  For
example, in a Kubernates environment, a user can request certain EPC
quota for a pod but the orchestrator can not enforce the quota to limit
runtime EPC usage of the pod without an EPC cgroup controller.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables user to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/Kconfig | 13 +
 arch/x86/kernel/cpu/sgx/Makefile |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 73 
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 
 arch/x86/kernel/cpu/sgx/main.c   | 52 +++-
 arch/x86/kernel/cpu/sgx/sgx.h|  5 ++
 include/linux/misc_cgroup.h  |  2 +
 7 files changed, 217 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5edec175b9bf..10c3d1d099b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1947,6 +1947,19 @@ config X86_SGX
 
  If unsure, say N.
 
+config CGROUP_SGX_EPC
+   bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
Intel SGX"
+   depends on X86_SGX && CGROUP_MISC
+   help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
 config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index ..eac8548164de
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include 
+#include 
+#include "epc_cgroup.h"
+
+static struct sgx_epc_cgroup epc_cg_root;
+
+/**
+ * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @epc_cg:The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+{
+   return 

[PATCH v8 02/15] cgroup/misc: Export APIs for SGX driver

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

The SGX EPC cgroup will reclaim EPC pages when a usage in a cgroup
reaches its or ancestor's limit. This requires a walk from the current
cgroup up to the root similar to misc_cg_try_charge(). Export
misc_cg_parent() to enable this walk.

The SGX driver may also need start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 
 kernel/cgroup/misc.c| 21 -
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 0806d4436208..541a5611c597 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -64,6 +64,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -84,6 +85,20 @@ static inline struct misc_cg *css_misc(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -108,6 +123,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+   return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+   return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 14ab13ef3bc7..1f0d8e05b36c 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+   return _cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -183,7 +178,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!amount)
return 0;
 
-   for (i = cg; i; i = parent_misc(i)) {
+   for (i = cg; i; i = misc_cg_parent(i)) {
res = >res[type];
 
new_usage = atomic64_add_return(amount, >usage);
@@ -196,12 +191,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
return 0;
 
 err_charge:
-   for (j = i; j; j = parent_misc(j)) {
+   for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(>res[type].events);
cgroup_file_notify(>events_file);
}
 
-   for (j = cg; j != i; j = parent_misc(j))
+   for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -223,7 +218,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
 
-   for (i = cg; i; i = parent_misc(i))
+   for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1




[PATCH v8 03/15] cgroup/misc: Add SGX EPC resource type

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 
 kernel/cgroup/misc.c| 4 
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 541a5611c597..2f6cc3a0ad23 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* SGX EPC memory resource */
+   MISC_CG_RES_SGX_EPC,
 #endif
MISC_CG_RES_TYPES
 };
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 1f0d8e05b36c..e51d6a45007f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* Intel SGX EPC memory bytes */
+   "sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.25.1




[PATCH v8 01/15] cgroup/misc: Add per resource callbacks for CSS events

2024-01-29 Thread Haitao Huang
From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)
V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of 
required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 11 +
 kernel/cgroup/misc.c| 84 +
 2 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
 
 #include 
 
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is 
allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
 
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 
amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..14ab13ef3bc7 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
  */
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
  * @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 
capacity)
 }
 EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
 
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required 
callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+   if (!valid_type(type))
+   return -EINVAL;
+
+   if (!ops->alloc) {
+   pr_err("%s: alloc missing\n", __func__);
+   return -EINVAL;
+   }
+
+   if (!ops->free) {
+   pr_err("%s: free missing\n", __func__);
+   return -EINVAL;
+   }
+
+   misc_res_ops[type] = ops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
 /**
  * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
  * @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
 };
 
+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+   int ret;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+   WRITE_ONCE(cg->res[i].max, MAX_NUM);
+   atomic64_set(>res[i].usage, 0);
+   if (misc_res_ops[i]) {
+   ret = misc_res_ops[i]->alloc(cg);
+   if (ret)
+   return ret;
+   }
+   }
+
+   return 0;
+}
+
+static inline void _misc_cg_res_free(struct misc_cg *cg)
+{
+   enum misc_res_type i;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++)
+   if (misc_res_ops[i])
+   misc_res_ops[i]->free(cg);
+}
+
 /**
  * misc_cg_alloc() - Allocate misc cgroup.
  * @parent_css: Parent cgroup.
@@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
-   enum misc_res_type i;
-   struct misc_cg *cg;
+   struct misc_cg *parent_cg, *cg;
+ 

[PATCH v8 00/15] Add Cgroup support for SGX EPC memory

2024-01-29 Thread Haitao Huang
SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.

For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design. 

Track EPC pages in per-cgroup LRUs (from Dave)
--

tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
 
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
 
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
 
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
 
Unreclaimable Enclave Pages
---

There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.

Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.

Unreclaimable Guest VM EPC Pages


The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates 

Re: [PATCH 0/4] tracing/user_events: Introduce multi-format events

2024-01-29 Thread Google
Hi Beau,

On Tue, 23 Jan 2024 22:08:40 +
Beau Belgrave  wrote:

> Currently user_events supports 1 event with the same name and must have
> the exact same format when referenced by multiple programs. This opens
> an opportunity for malicous or poorly thought through programs to
> create events that others use with different formats. Another scenario
> is user programs wishing to use the same event name but add more fields
> later when the software updates. Various versions of a program may be
> running side-by-side, which is prevented by the current single format
> requirement.
> 
> Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates
> the user program wishes to use the same user_event name, but may have
> several different formats of the event in the future. When this flag is
> used, create the underlying tracepoint backing the user_event with a
> unique name per-version of the format. It's important that existing ABI
> users do not get this logic automatically, even if one of the multi
> format events matches the format. This ensures existing programs that
> create events and assume the tracepoint name will match exactly continue
> to work as expected. Add logic to only check multi-format events with
> other multi-format events and single-format events to only check
> single-format events during find.

Thanks for this work! This will allow many instance to use the same
user-events at the same time.

BTW, can we force this flag set by default? My concern is if any user
program use this user-event interface in the container (maybe it is
possible if we bind-mount it). In this case, the user program can
detect the other program is using the event if this flag is not set.
Moreover, if there is a malicious program running in the container,
it can prevent using the event name from other programs even if it
is isolated by the name-space.

Steve suggested that if a user program which is running in a namespace
uses user-event without this flag, we can reject that by default.

What would you think about?

Thank you,


> 
> Add a register_name (reg_name) to the user_event struct which allows for
> split naming of events. We now have the name that was used to register
> within user_events as well as the unique name for the tracepoint. Upon
> registering events ensure matches based on first the reg_name, followed
> by the fields and format of the event. This allows for multiple events
> with the same registered name to have different formats. The underlying
> tracepoint will have a unique name in the format of {reg_name}:[unique_id].
> The unique_id is the time, in nanoseconds, of the event creation converted
> to hex. Since this is done under the register mutex, it is extremely
> unlikely for these IDs to ever match. It's also very unlikely a malicious
> program could consistently guess what the name would be and attempt to
> squat on it via the single format ABI.
> 
> For example, if both "test u32 value" and "test u64 value" are used with
> the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique
> tracepoints. The dynamic_events file would then show the following:
>   u:test u64 count
>   u:test u32 count
> 
> The actual tracepoint names look like this:
>   test:[d5874fdac44]
>   test:[d5914662cd4]
> 
> Deleting events via "!u:test u64 count" would only delete the first
> tracepoint that matched that format. When the delete ABI is used all
> events with the same name will be attempted to be deleted. If
> per-version deletion is required, user programs should either not use
> persistent events or delete them via dynamic_events.
> 
> Beau Belgrave (4):
>   tracing/user_events: Prepare find/delete for same name events
>   tracing/user_events: Introduce multi-format events
>   selftests/user_events: Test multi-format events
>   tracing/user_events: Document multi-format flag
> 
>  Documentation/trace/user_events.rst   |  23 +-
>  include/uapi/linux/user_events.h  |   6 +-
>  kernel/trace/trace_events_user.c  | 224 +-
>  .../testing/selftests/user_events/abi_test.c  | 134 +++
>  4 files changed, 325 insertions(+), 62 deletions(-)
> 
> 
> base-commit: 610a9b8f49fbcf1100716370d3b5f6f884a2835a
> -- 
> 2.34.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Al Viro
On Sun, Jan 28, 2024 at 08:36:12PM -0800, Linus Torvalds wrote:

[snip]

apologies for being MIA on that, will post tomorrow morning once I get some
sleep...



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 16:35, Steven Rostedt  wrote:
>
>  # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
>  # ls -l events/kprobes/
> ls: cannot access 'events/kprobes/': No such file or directory
>
> Where it should now exist but doesn't. But the lookup code never triggered.
>
> If the lookup fails, does it cache the result?

I think you end up still having negative dentries around.

The old code then tried to compensate for that by trying to remember
the old dentry with 'ei->dentry' and 'ei->d_children[]', and would at
lookup time try to use the *old* dentry instead of the new one.

And because dentries are just caches and can go away, it then had that
odd dance with '.d_iput', so that when a dentry was removed, it would
be removed from the 'ei->dentry' and 'ei->d_children[]' array too.

Except that d_iput() of an old dentry isn't actually serialized with
->d_lookup() in any way, so you end up with the whole race that I
already talked about earlier, where you could still have an
'ei->dentry' that pointed to something that had already been unhashed,
but d_iput() hadn't been called *yet*, so d_lookup() is called with a
new dentry, but the tracefs code then desperately tries to use the old
dentry pointer that just isn't _valid_ any more, but it doesn't know
that because d_iput() hasn't been called yet...

And as I *also* pointed out when I described that originally, you'll
practically never hit this race, because you just need to be *very*
unlucky with the whole "dentry is freed due to memory pressure".

But basically, this is why I absolutely *HATE* that "ei->dentry"
backpointer. It's truly fundamentally broken.

You can't reference-count it, since the whole point of your current
tracefs scheme is to *not* keep dentries and inodes around forever,
and doing a "dget()" on that 'ei->dentry' would thus fundamentally
screw that up.

But you also cannot keep it in sync with dentries being released due
to memory pressure, because of the above thing.

See why I've tried to tell you that the back-pointer is basically a
100% sign of a bug.

The *only* time you can have a valid dentry pointer is when you have
also taken a ref to it with dget(), and you can't do that.

So then you have all that completely broken code that _tries_ to
maintain consistency with ->d_children[] etc, and it works 99.9% in
practice, because the race is just so hard to hit because dentries
only normally get evicted either synchronously (which you do under the
eventfs_mutex) or under memory pressure (which is basically never
going to be something you can test).

And yes, my lookup patch removed all the band-aids for "if I have an
ei->dentry, I'll reuse it". So I think it ends up exposing all the
previous bugs that the old "let's reuse the old dentry" code tried to
hide.

But, as mentioned, that ei->dentry pointer really REALLY is broken.

NBow, having looked at this a lot, I think I have a way forward.
Because there is actually *one* case where you actually *do* do the
whole "dget()" to get a stable dentry pointer. And that's exactly the
"events" directory creation (ie eventfs_create_events_dir()).

So what I propose is that

 - ei->dentry and ei->d_children[] need to die. Really. They are
buggy. There is no way to save them. There never was.

 - but we *can* introduce a new 'ei->events_dir' pointer that is
*only* set by eventfs_create_events_dir(), and which is stable exactly
because that function also does a dget() on it, so now the dentry will
actually continue to exist reliably

I think that works. The only thing that actually *needs* the existing
'ei->dentry' is literally the eventfs_remove_events_dir() that gets
rid of the stable events directory. It's undoing
eventfs_create_events_dir(), and it will do the final dput() too.

I will try to make a patch for this. I do think it means that every
time we do that

dentry->d_fsdata = ei;

we need to also do proper reference counting of said 'ei'. Because we
can't release 'ei' early when we have dentries that point to it.

Let me see how painful this will be.

 Linus



Re: [RFC PATCH 2/2] x86/kprobes: boost more instructions from grp2/3/4/5

2024-01-29 Thread Google
On Sun, 28 Jan 2024 15:30:50 -0600
Jinghao Jia  wrote:

> 
> 
> On 1/27/24 20:22, Masami Hiramatsu (Google) wrote:
> > On Fri, 26 Jan 2024 22:41:24 -0600
> > Jinghao Jia  wrote:
> > 
> >> With the instruction decoder, we are now able to decode and recognize
> >> instructions with opcode extensions. There are more instructions in
> >> these groups that can be boosted:
> >>
> >> Group 2: ROL, ROR, RCL, RCR, SHL/SAL, SHR, SAR
> >> Group 3: TEST, NOT, NEG, MUL, IMUL, DIV, IDIV
> >> Group 4: INC, DEC (byte operation)
> >> Group 5: INC, DEC (word/doubleword/quadword operation)
> >>
> >> These instructions are not boosted previously because there are reserved
> >> opcodes within the groups, e.g., group 2 with ModR/M.nnn == 110 is
> >> unmapped. As a result, kprobes attached to them requires two int3 traps
> >> as being non-boostable also prevents jump-optimization.
> >>
> >> Some simple tests on QEMU show that after boosting and jump-optimization
> >> a single kprobe on these instructions with an empty pre-handler runs 10x
> >> faster (~1000 cycles vs. ~100 cycles).
> >>
> >> Since these instructions are mostly ALU operations and do not touch
> >> special registers like RIP, let's boost them so that we get the
> >> performance benefit.
> >>
> > 
> > As far as we check the ModR/M byte, I think we can safely run these
> > instructions on trampoline buffer without adjusting results (this
> > means it can be "boosted").
> > I just have a minor comment, but basically this looks good to me.
> > 
> > Reviewed-by: Masami Hiramatsu (Google) 
> > 
> >> Signed-off-by: Jinghao Jia 
> >> ---
> >>  arch/x86/kernel/kprobes/core.c | 21 +++--
> >>  1 file changed, 15 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/kprobes/core.c 
> >> b/arch/x86/kernel/kprobes/core.c
> >> index 792b38d22126..f847bd9cc91b 100644
> >> --- a/arch/x86/kernel/kprobes/core.c
> >> +++ b/arch/x86/kernel/kprobes/core.c
> >> @@ -169,22 +169,31 @@ int can_boost(struct insn *insn, void *addr)
> >>case 0x62:  /* bound */
> >>case 0x70 ... 0x7f: /* Conditional jumps */
> >>case 0x9a:  /* Call far */
> >> -  case 0xc0 ... 0xc1: /* Grp2 */
> >>case 0xcc ... 0xce: /* software exceptions */
> >> -  case 0xd0 ... 0xd3: /* Grp2 */
> >>case 0xd6:  /* (UD) */
> >>case 0xd8 ... 0xdf: /* ESC */
> >>case 0xe0 ... 0xe3: /* LOOP*, JCXZ */
> >>case 0xe8 ... 0xe9: /* near Call, JMP */
> >>case 0xeb:  /* Short JMP */
> >>case 0xf0 ... 0xf4: /* LOCK/REP, HLT */
> >> -  case 0xf6 ... 0xf7: /* Grp3 */
> >> -  case 0xfe:  /* Grp4 */
> >>/* ... are not boostable */
> >>return 0;
> >> +  case 0xc0 ... 0xc1: /* Grp2 */
> >> +  case 0xd0 ... 0xd3: /* Grp2 */
> >> +  /* ModR/M nnn == 110 is reserved */
> >> +  return X86_MODRM_REG(insn->modrm.bytes[0]) != 6;
> >> +  case 0xf6 ... 0xf7: /* Grp3 */
> >> +  /* ModR/M nnn == 001 is reserved */
> > 
> > /* AMD uses nnn == 001 as TEST, but Intel makes it reserved. */
> > 
> 
> I will incorporate this into the v2. Since nnn == 001 is still considered
> reserved by Intel, we still need to prevent it from being boosted, don't
> we?
> 
> --Jinghao
> 
> >> +  return X86_MODRM_REG(insn->modrm.bytes[0]) != 1;
> >> +  case 0xfe:  /* Grp4 */
> >> +  /* Only inc and dec are boostable */
> >> +  return X86_MODRM_REG(insn->modrm.bytes[0]) == 0 ||
> >> + X86_MODRM_REG(insn->modrm.bytes[0]) == 1;
> >>case 0xff:  /* Grp5 */
> >> -  /* Only indirect jmp is boostable */
> >> -  return X86_MODRM_REG(insn->modrm.bytes[0]) == 4;
> >> +  /* Only inc, dec, and indirect jmp are boostable */
> >> +  return X86_MODRM_REG(insn->modrm.bytes[0]) == 0 ||
> >> + X86_MODRM_REG(insn->modrm.bytes[0]) == 1 ||
> >> + X86_MODRM_REG(insn->modrm.bytes[0]) == 4;
> >>default:
> >>return 1;
> >>}
> >> -- 
> >> 2.43.0
> >>
> > 
> > Thamnk you,
> > 


-- 
Masami Hiramatsu (Google) 



Re: [RFC PATCH 1/2] x86/kprobes: Prohibit kprobing on INT and UD

2024-01-29 Thread Google
On Sun, 28 Jan 2024 15:25:59 -0600
Jinghao Jia  wrote:

> >>  /* Check if paddr is at an instruction boundary */
> >>  static int can_probe(unsigned long paddr)
> >>  {
> >> @@ -294,6 +310,16 @@ static int can_probe(unsigned long paddr)
> >>  #endif
> >>addr += insn.length;
> >>}
> >> +  __addr = recover_probed_instruction(buf, addr);
> >> +  if (!__addr)
> >> +  return 0;
> >> +
> >> +  if (insn_decode_kernel(, (void *)__addr) < 0)
> >> +  return 0;
> >> +
> >> +  if (is_exception_insn())
> >> +  return 0;
> >> +
> > 
> > Please don't put this outside of decoding loop. You should put these in
> > the loop which decodes the instruction from the beginning of the function.
> > Since the x86 instrcution is variable length, can_probe() needs to check
> > whether that the address is instruction boundary and decodable.
> > 
> > Thank you,
> 
> If my understanding is correct then this is trying to decode the kprobe
> target instruction, given that it is after the main decoding loop.  Here I
> hoisted the decoding logic out of the if(IS_ENABLED(CONFIG_CFI_CLANG))
> block so that we do not need to decode the same instruction twice.  I left
> the main decoding loop unchanged so it is still decoding the function from
> the start and should handle instruction boundaries. Are there any caveats
> that I missed?

Ah, sorry I misread the patch. You're correct!
This is a good place to do that.

But hmm, I think we should add another patch to check the addr == paddr
soon after the loop so that we will avoid decoding.

Thank you,

> 
> --Jinghao
> 
> > 
> >>if (IS_ENABLED(CONFIG_CFI_CLANG)) {
> >>/*
> >> * The compiler generates the following instruction sequence
> >> @@ -308,13 +334,6 @@ static int can_probe(unsigned long paddr)
> >> * Also, these movl and addl are used for showing expected
> >> * type. So those must not be touched.
> >> */
> >> -  __addr = recover_probed_instruction(buf, addr);
> >> -  if (!__addr)
> >> -  return 0;
> >> -
> >> -  if (insn_decode_kernel(, (void *)__addr) < 0)
> >> -  return 0;
> >> -
> >>if (insn.opcode.value == 0xBA)
> >>offset = 12;
> >>else if (insn.opcode.value == 0x3)
> >> -- 
> >> 2.43.0
> >>
> > 
> > 


-- 
Masami Hiramatsu (Google) 



Re: [RESEND PATCH v2] modules: wait do_free_init correctly

2024-01-29 Thread Changbin Du
On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote:
> On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > do_free_init() into a global workqueue instead of call_rcu(). So now
> > rcu_barrier() can not ensure that do_free_init has completed. We should
> > wait it via flush_work().
> > 
> > Without this fix, we still could encounter false positive reports in
> > W+X checking, and rcu synchronization is unnecessary.
> 
> You didn't answer my question, which should be documented in the commit log.
> 
> Does this mean we never freed modules init because of this? If so then
> your commit log should clearly explain that. It should also explain that
> if true (you have to verify) then it means we were no longer saving
> the memory we wished to save, and that is important for distributions
> which do want to save anything on memory. You may want to do a general
> estimate on how much that means these days on any desktop / server.
>
Actually, I have explained it in commit msg. It's not about saving memory. The
synchronization here is just to ensure the module init's been freed before
doing W+X checking. The problem is that the current implementation is wrong,
rcu_barrier() cannot guarantee that. So we can encounter false positive reports.
But anyway, the module init will be freed, and it's just a timing related issue.

>   Luis

-- 
Cheers,
Changbin Du



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 16:01:25 -0800
Linus Torvalds  wrote:

> I'll go see what's up with the "create it again" case - I don't
> immediately see what's wrong.

Interesting. I added a printk in the lookup, and just did this:

 # cd /sys/kernel/tracing
 # ls events/kprobes

And it showed that it tried to see if "kprobes" existed in the lookup.
Which it did not because I haven't created any kprobes yet.

Then I did:

 # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
 # ls -l events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory

Where it should now exist but doesn't. But the lookup code never triggered.

If the lookup fails, does it cache the result?

-- Steve



Re: [PATCH RFC v3 23/35] arm64: mte: Try to reserve tag storage in arch_alloc_page()

2024-01-29 Thread Peter Collingbourne
On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
 wrote:
>
> Reserve tag storage for a page that is being allocated as tagged. This
> is a best effort approach, and failing to reserve tag storage is
> allowed.
>
> When all the associated tagged pages have been freed, return the tag
> storage pages back to the page allocator, where they can be used again for
> data allocations.
>
> Signed-off-by: Alexandru Elisei 
> ---
>
> Changes since rfc v2:
>
> * Based on rfc v2 patch #16 ("arm64: mte: Manage tag storage on page
> allocation").
> * Fixed calculation of the number of associated tag storage blocks (Hyesoo
> Yu).
> * Tag storage is reserved in arch_alloc_page() instead of
> arch_prep_new_page().
>
>  arch/arm64/include/asm/mte.h |  16 +-
>  arch/arm64/include/asm/mte_tag_storage.h |  31 +++
>  arch/arm64/include/asm/page.h|   5 +
>  arch/arm64/include/asm/pgtable.h |  19 ++
>  arch/arm64/kernel/mte_tag_storage.c  | 234 +++
>  arch/arm64/mm/fault.c|   7 +
>  fs/proc/page.c   |   1 +
>  include/linux/kernel-page-flags.h|   1 +
>  include/linux/page-flags.h   |   1 +
>  include/trace/events/mmflags.h   |   3 +-
>  mm/huge_memory.c |   1 +
>  11 files changed, 316 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 8034695b3dd7..6457b7899207 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
>  #ifdef CONFIG_ARM64_MTE
>
>  /* track which pages have valid allocation tags */
> -#define PG_mte_tagged  PG_arch_2
> +#define PG_mte_tagged  PG_arch_2
>  /* simple lock to avoid multiple threads tagging the same page */
> -#define PG_mte_lockPG_arch_3
> +#define PG_mte_lockPG_arch_3
> +/* Track if a tagged page has tag storage reserved */
> +#define PG_tag_storage_reservedPG_arch_4
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> +extern bool page_tag_storage_reserved(struct page *page);
> +#endif
>
>  static inline void set_page_mte_tagged(struct page *page)
>  {
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +   /* Open code mte_tag_storage_enabled() */
> +   WARN_ON_ONCE(static_branch_likely(_storage_enabled_key) &&
> +!page_tag_storage_reserved(page));
> +#endif
> /*
>  * Ensure that the tags written prior to this function are visible
>  * before the page flags update.
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> b/arch/arm64/include/asm/mte_tag_storage.h
> index 7b3f6bff8e6f..09f1318d924e 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -5,6 +5,12 @@
>  #ifndef __ASM_MTE_TAG_STORAGE_H
>  #define __ASM_MTE_TAG_STORAGE_H
>
> +#ifndef __ASSEMBLY__
> +
> +#include 
> +
> +#include 
> +
>  #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
>
>  DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> @@ -15,6 +21,15 @@ static inline bool tag_storage_enabled(void)
>  }
>
>  void mte_init_tag_storage(void);
> +
> +static inline bool alloc_requires_tag_storage(gfp_t gfp)
> +{
> +   return gfp & __GFP_TAGGED;
> +}
> +int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> +void free_tag_storage(struct page *page, int order);
> +
> +bool page_tag_storage_reserved(struct page *page);
>  #else
>  static inline bool tag_storage_enabled(void)
>  {
> @@ -23,6 +38,22 @@ static inline bool tag_storage_enabled(void)
>  static inline void mte_init_tag_storage(void)
>  {
>  }
> +static inline bool alloc_requires_tag_storage(struct page *page)

This function should take a gfp_t to match the
CONFIG_ARM64_MTE_TAG_STORAGE case.

Peter

> +{
> +   return false;
> +}
> +static inline int reserve_tag_storage(struct page *page, int order, gfp_t 
> gfp)
> +{
> +   return 0;
> +}
> +static inline void free_tag_storage(struct page *page, int order)
> +{
> +}
> +static inline bool page_tag_storage_reserved(struct page *page)
> +{
> +   return true;
> +}
>  #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
>
> +#endif /* !__ASSEMBLY__ */
>  #endif /* __ASM_MTE_TAG_STORAGE_H  */
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 88bab032a493..3a656492f34a 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -35,6 +35,11 @@ void copy_highpage(struct page *to, struct page *from);
>  void tag_clear_highpage(struct page *to);
>  #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
>
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +void arch_alloc_page(struct page *, int order, gfp_t gfp);
> +#define HAVE_ARCH_ALLOC_PAGE
> +#endif
> +
>  #define clear_user_page(page, vaddr, pg)   clear_page(page)
>  #define copy_user_page(to, from, vaddr, pg)copy_page(to, from)
>
> diff --git 

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 14:49, Steven Rostedt  wrote:
>
> Now I didn't change this last d_instantiate, because this is not called
> through the lookup code. This is the root events directory and acts more
> like debugfs. It's not "dynamically" added.

Ahh, yes, I see, the dentry was created (as a negative one) with
tracefs_start_creating() -> lookup_one_len().

So  yes, there d_instantiate() is correct, as it's exactly that "turn
negative dentry into a positive one" case.

I'll go see what's up with the "create it again" case - I don't
immediately see what's wrong.

  Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 17:47:32 -0500
Steven Rostedt  wrote:

> > And I hope there aren't any other stupid things I missed like that.  
> 
> Well the preliminary tests pass with this added to your patch:

Spoke too soon. The later tests started failing.

It fails on creating a kprobe, deleting it, and then recreating it. Even
though the directory is there, it can't be accessed.

 # cd /sys/kernel/tracing

// Create a kprobe

 # echo 'p:sched schedule' >> kprobe_events 
 # ls events/kprobes/
enable  filter  sched

// Now delete the kprobe

 # echo '-:sched schedule' >> kprobe_events

// Make sure it's gone

 # ls events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory

// Recreate it

 # echo 'p:sched schedule' >> kprobe_events 
 # ls events/kprobes/
ls: cannot access 'events/kprobes/': No such file or directory
 # ls events | grep kprobes
kprobes

No longer able to access it.

 # ls -l events | grep kprobes
ls: cannot access 'events/kprobes': No such file or directory
d? ? ???? kprobes


-- Steve



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 14:42:47 -0800
Linus Torvalds  wrote:

> @@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
>   ti->flags = TRACEFS_EVENT_INODE;
>   ti->private = NULL; // Directories have 'ei', files 
> not
>  
> - d_instantiate(dentry, inode);
> + d_add(dentry, inode);
>   fsnotify_create(dentry->d_parent->d_inode, dentry);
>   return eventfs_end_creating(dentry);
>  };
> @@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
> *dentry,
>  ei->dentry = dentry; // Remove me!
>  
>   inc_nlink(inode);
> - d_instantiate(dentry, inode);
> + d_add(dentry, inode);
>   inc_nlink(dentry->d_parent->d_inode);
>   fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
>   return eventfs_end_creating(dentry);
> @@ -786,7 +784,7 @@ struct eventfs_inode *eventfs_create_events_dir(const 
> char *name, struct dentry
>  
>   /* directory inodes start off with i_nlink == 2 (for "." entry) */
>   inc_nlink(inode);
> - d_instantiate(dentry, inode);
> + d_add(dentry, inode);

Now I didn't change this last d_instantiate, because this is not called
through the lookup code. This is the root events directory and acts more
like debugfs. It's not "dynamically" added.

-- Steve


>   inc_nlink(dentry->d_parent->d_inode);
>   fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
>   tracefs_end_creating(dentry);




Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 14:35:37 -0800
Linus Torvalds  wrote:

> And I hope there aren't any other stupid things I missed like that.

Well the preliminary tests pass with this added to your patch:

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de322..ad11063bdd53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
 {
struct eventfs_inode *ei;
 
-   mutex_lock(_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
-   mutex_unlock(_mutex);
 
update_top_events_attr(ei, dentry->d_sb);
 
@@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
ti->flags = TRACEFS_EVENT_INODE;
ti->private = NULL; // Directories have 'ei', files 
not
 
-   d_instantiate(dentry, inode);
+   d_add(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
 };
@@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
 ei->dentry = dentry;   // Remove me!
 
inc_nlink(inode);
-   d_instantiate(dentry, inode);
+   d_add(dentry, inode);
inc_nlink(dentry->d_parent->d_inode);
fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);

-- Steve



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 14:35, Linus Torvalds
 wrote:
>
> So just replace all the d_instantiate() calls there with "d_add()"
> instead. I think that will fix it.

I can confirm that with the mutex deadlock removed and the d_add()
fix, at least things *look* superficially ok.

I didn't actually do anything with it. So it might be leaking dentry
refs like mad or something like that, but at least the obvious cases
look fine.

just for completeness, here's the fixup diff I used.

  Linus
 fs/tracefs/event_inode.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de322..5b307bb64f8f 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 
-	mutex_lock(_mutex);
 	do {
 		// The parent is stable because we do not do renames
 		dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 		}
 		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
-	mutex_unlock(_mutex);
 
 	update_top_events_attr(ei, dentry->d_sb);
 
@@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
 };
@@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 ei->dentry = dentry;	// Remove me!
 
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -786,7 +784,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 
 	/* directory inodes start off with i_nlink == 2 (for "." entry) */
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	tracefs_end_creating(dentry);


Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 14:21, Steven Rostedt  wrote:
>
> But crashes with just a:
>
>  # ls /sys/kernel/tracing/events
>
> [   66.423983] [ cut here ]
> [   66.426447] kernel BUG at fs/dcache.c:1876!

Duh.

That's a bit too much copy-and-paste by me.

So what is going on is that a ->lookup() function should *not* call
d_instantiate() at all, and the only reason it actually used to work
here was due to the incorrect "simple_lookup()", which basically did
all the preliminaries.

A ->lookup() should do 'd_add()' on the dentry.

So just replace all the d_instantiate() calls there with "d_add()"
instead. I think that will fix it.

Basically the "simple_lookup()" had done the "d_add(dentry, NULL)",
and at that point the "d_instantiate()" just exposed the inode and
turned the negative dentry into a positive one.

So "d_add()" is "I'm adding the inode to a new dentry under lookup".
And "d_instantiate()" is "I'm adding this inode to an existing dentry
that used to be negative"

And so the old "d_add(NULL)+d_instantiate(inode)" _kind_ of worked,
except it made that negative dentry visible for a short while.

And when I did the cleanup, I didn't think of this thing, so I left
the d_instantiate() calls as such, even though they now really need to
be d_add().

Hope that explains it.

And I hope there aren't any other stupid things I missed like that.

 Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds  wrote:

> [0004-tracefs-dentry-lookup-crapectomy.patch  text/x-patch (11761 bytes)] 

I had to add:

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de322..89897d934302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
 {
struct eventfs_inode *ei;
 
-   mutex_lock(_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
-   mutex_unlock(_mutex);
 
update_top_events_attr(ei, dentry->d_sb);
 

As eventfs_find_events() is only called by update_inode_attr() which is
only called by: lookup_file() and lookup_dir_entry() which are called by
eventfs_root_lookup() where eventfs_mutex is already held.

But crashes with just a:

 # ls /sys/kernel/tracing/events

[   66.423983] [ cut here ]
[   66.426447] kernel BUG at fs/dcache.c:1876!
[   66.428363] invalid opcode:  [#1] PREEMPT SMP KASAN PTI
[   66.430320] CPU: 4 PID: 863 Comm: ls Not tainted 
6.8.0-rc1-test-9-gcaff43732484-dirty #463
[   66.433192] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.3-debian-1.16.3-2 04/01/2014
[   66.436122] RIP: 0010:d_instantiate+0x69/0x70
[   66.437537] Code: 00 4c 89 e7 e8 18 f0 49 01 48 89 df 48 89 ee e8 0d fe ff 
ff 4c 89 e7 5b 5d 41 5c e9 e1 f3 49 01 5b 5d 41 5c c3 cc cc cc cc 90 <0f> 0b 90 
0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[   66.442759] RSP: 0018:c900015f7830 EFLAGS: 00010282
[   66.444315] RAX:  RBX: 8881006aec40 RCX: b77fcbff
[   66.446237] RDX: dc00 RSI: 888127c46ec0 RDI: 8881006aed70
[   66.448170] RBP: 888127c46ec0 R08: 0001 R09: fbfff7766c4f
[   66.450094] R10: bbb3627f R11:  R12: 81a0
[   66.452007] R13: 88810f85 R14:  R15: 
[   66.455224] FS:  7fe1e56d9800() GS:88823c80() 
knlGS:
[   66.457484] CS:  0010 DS:  ES:  CR0: 80050033
[   66.459139] CR2: 556b5293b000 CR3: 00010f056001 CR4: 00170ef0
[   66.461192] Call Trace:
[   66.462094]  
[   66.462915]  ? die+0x36/0x90
[   66.463928]  ? do_trap+0x133/0x230
[   66.465089]  ? d_instantiate+0x69/0x70
[   66.466325]  ? d_instantiate+0x69/0x70
[   66.467532]  ? do_error_trap+0x90/0x130
[   66.468787]  ? d_instantiate+0x69/0x70
[   66.470020]  ? handle_invalid_op+0x2c/0x40
[   66.471340]  ? d_instantiate+0x69/0x70
[   66.472559]  ? exc_invalid_op+0x2e/0x50
[   66.473807]  ? asm_exc_invalid_op+0x1a/0x20
[   66.475154]  ? d_instantiate+0x1f/0x70
[   66.476396]  ? d_instantiate+0x69/0x70
[   66.477629]  eventfs_root_lookup+0x366/0x660
[   66.479021]  ? __pfx_eventfs_root_lookup+0x10/0x10
[   66.480567]  ? print_circular_bug_entry+0x170/0x170
[   66.482107]  ? lockdep_init_map_type+0xd3/0x3a0
[   66.485243]  __lookup_slow+0x194/0x2a0
[   66.486410]  ? __pfx___lookup_slow+0x10/0x10
[   66.487694]  ? rwsem_read_trylock+0x118/0x1b0
[   66.489057]  ? i915_ttm_backup+0x2a0/0x5e0
[   66.490358]  ? down_read+0xbb/0x240
[   66.491506]  ? down_read+0xbb/0x240
[   66.492674]  ? trace_preempt_on+0xc8/0xe0
[   66.493962]  ? i915_ttm_backup+0x2a0/0x5e0
[   66.495291]  walk_component+0x166/0x220
[   66.496564]  path_lookupat+0xa9/0x2e0
[   66.497766]  ? __pfx_mark_lock+0x10/0x10
[   66.499026]  filename_lookup+0x19c/0x350
[   66.500335]  ? __pfx_filename_lookup+0x10/0x10
[   66.501688]  ? __pfx___lock_acquire+0x10/0x10
[   66.502996]  ? __pfx___lock_acquire+0x10/0x10
[   66.504305]  ? _raw_read_unlock_irqrestore+0x40/0x80
[   66.505765]  ? stack_depot_save_flags+0x1f0/0x790
[   66.507137]  vfs_statx+0xe1/0x270
[   66.508196]  ? __pfx_vfs_statx+0x10/0x10
[   66.509385]  ? __virt_addr_valid+0x155/0x330
[   66.510667]  ? __pfx_lock_release+0x10/0x10
[   66.511924]  do_statx+0xac/0x110
[   66.515394]  ? __pfx_do_statx+0x10/0x10
[   66.516631]  ? getname_flags.part.0+0xd6/0x260
[   66.517956]  __x64_sys_statx+0xa0/0xc0
[   66.519079]  do_syscall_64+0xca/0x1e0
[   66.520206]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   66.521674] RIP: 0033:0x7fe1e586e2ea
[   66.522780] Code: 48 8b 05 31 bb 0d 00 ba ff ff ff ff 64 c7 00 16 00 00 00 
e9 a5 fd ff ff e8 03 10 02 00 0f 1f 00 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 2e 89 c1 85 c0 74 0f 48 8b 05 f9 ba 0d 00 64
[   66.527728] RSP: 002b:7fff06f5ef98 EFLAGS: 0246 ORIG_RAX: 
014c
[   66.529922] RAX: ffda RBX: 556b52935d18 RCX: 7fe1e586e2ea
[   66.531863] RDX: 0900 RSI: 7fff06f5f0d0 RDI: ff9c
[   

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 13:45, Steven Rostedt  wrote:
> >  1 file changed, 50 insertions(+), 219 deletions(-)
>
> Thanks, much appreciated.

Well, I decided I might as well give it a test-run, and there's an
immediate deadlock on eventfs_mutex, because I missed removing it from
eventfs_find_events() when the callers now already hold it.

So at a minimum, it will require this patch on top:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -230,7 +230,6 @@ static struct eventfs_inode
*eventfs_find_events(
   {
struct eventfs_inode *ei;

  - mutex_lock(_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
  @@ -247,7 +246,6 @@
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
  - mutex_unlock(_mutex);

update_top_events_attr(ei, dentry->d_sb);

to not deadlock immediately on the first lookup.

And honestly, there might be other such obvious "I missed that when
reading the code".

Let me reboot into a fixed system and do some more basic smoke-testing.

Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds  wrote:

> [0001-tracefs-remove-stale-update_gid-code.patch  text/x-patch (2612 bytes)] 

Oh, I already applied this and even sent you a pull request with it.

  https://lore.kernel.org/all/20240128223151.2dad6...@rorschach.local.home/

-- Steve



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 12:51:59 -0800
Linus Torvalds  wrote:

> End result: what simple_lookup() does is say "oh, you didn't have the
> file, so it's by definition a negative dentry", and thus all it does
> is to do "d_add(dentry, NULL)".
> 
> Anyway, removing this was painful. I initially thought "I'll just
> remove the calls". But it all ended up cascading into "that's also
> wrong".
> 
> So now I have a patch that tries to fix this all up, and it looks like thisL:
> 
>  1 file changed, 50 insertions(+), 219 deletions(-)

Thanks, much appreciated.

> 
> because it basically removed all the old code, and replaced it with
> much simpler code.
> 
> I'm including the patch here as an attachment, but I want to note very
> clearly that this *builds* for me, and it looks a *lot* more obvious
> and correct than the old code did, but I haven't tested it. AT ALL.

I'm going to stare at them as I test them. Because I want to understand
them. I may come back with questions.

> 
> Also note that it depends on my previous patches, so I guess I'll
> include them here again just to make it unambiguous.
> 
> Finally - this does *not* fix up the refcounting. I still think the
> SRCU stuff is completely broken. But that's another headache. But at
> least now the *lookup* parts look like they DTRT wrt eventfs_mutex.
> 
> The SRCU logic from the directory iteration parts still needs crapectomy.

I think the not dropping the mutex lock lets me get rid of the SRCU. I
added the SRCU when I was hitting the deadlocks with the iput code which
I'm not hitting anymore. So getting rid of the SRCU shouldn't be hard.

> 
> AGAIN: these patches (ie particularly that last one - 0004) were all
> done entirely "blindly" - I've looked at the code, and fixed the bugs
> and problems I've seen by pure code inspection.
> 
> That's great, but it really means that it's all untested. It *looks*
> better than the old code, but there may be some silly gotcha that I
> have missed.

I'll let you know. 

Oh, does b4 handle attachments? Because this breaks the patchwork flow.
I haven't used b4 yet.

Thanks,

-- Steve



Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression

2024-01-29 Thread Dan Williams
Mathieu Desnoyers wrote:
> This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
> even on ARMv7 which does not have virtually aliased dcaches:
> 
> commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> 
> It used to work fine before: I have customers using dax over pmem on
> ARMv7, but this regression will likely prevent them from upgrading their
> kernel.
> 
> The root of the issue here is the fact that DAX was never designed to
> handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
> touches the pages through their linear mapping, which is not consistent
> with the userspace mappings on virtually aliased dcaches. 
> 
> This patch series introduces cache_is_aliasing() with new Kconfig
> options:
> 
>   * ARCH_HAS_CACHE_ALIASING
>   * ARCH_HAS_CACHE_ALIASING_DYNAMIC
> 
> and implements it for all architectures. The "DYNAMIC" implementation
> implements cache_is_aliasing() as a runtime check, which is what is
> needed on architectures like 32-bit ARMV6 and ARMV6K.
> 
> With this we can basically narrow down the list of architectures which
> are unsupported by DAX to those which are really affected.
> 
> Feedback is welcome,

Hi Mathieu, this looks good overall, just some quibbling about the
ordering.

I would introduce dax_is_supported() with the current overly broad
interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then
fixup the filesystems to use the new helper, and finally go back and
convert dax_is_supported() to use cache_is_aliasing() internally.

Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC
needs to exist. As long as all paths that care are calling
cache_is_aliasing() then whether it is dynamic or not is something only
the compiler cares about. If those dynamic archs do not want to pay the
.text size increase they can always do CONFIG_FS_DAX=n, right?



[RFC PATCH 5/7] ext4: Use dax_is_supported()

2024-01-29 Thread Mathieu Desnoyers
Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/ext4/super.c | 52 -
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c5fcf377ab1f..9e0606289239 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2359,34 +2359,32 @@ static int ext4_parse_param(struct fs_context *fc, 
struct fs_parameter *param)
return ext4_parse_test_dummy_encryption(param, ctx);
case Opt_dax:
case Opt_dax_type:
-#ifdef CONFIG_FS_DAX
-   {
-   int type = (token == Opt_dax) ?
-  Opt_dax : result.uint_32;
-
-   switch (type) {
-   case Opt_dax:
-   case Opt_dax_always:
-   ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   break;
-   case Opt_dax_never:
-   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   break;
-   case Opt_dax_inode:
-   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
-   ctx_clear_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
-   /* Strictly for printing options */
-   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE);
-   break;
+   if (dax_is_supported()) {
+   int type = (token == Opt_dax) ?
+  Opt_dax : result.uint_32;
+
+   switch (type) {
+   case Opt_dax:
+   case Opt_dax_always:
+   ctx_set_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   ctx_clear_mount_opt2(ctx, 
EXT4_MOUNT2_DAX_NEVER);
+   break;
+   case Opt_dax_never:
+   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_NEVER);
+   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   break;
+   case Opt_dax_inode:
+   ctx_clear_mount_opt(ctx, EXT4_MOUNT_DAX_ALWAYS);
+   ctx_clear_mount_opt2(ctx, 
EXT4_MOUNT2_DAX_NEVER);
+   /* Strictly for printing options */
+   ctx_set_mount_opt2(ctx, EXT4_MOUNT2_DAX_INODE);
+   break;
+   }
+   return 0;
+   } else {
+   ext4_msg(NULL, KERN_INFO, "dax option not supported");
+   return -EINVAL;
}
-   return 0;
-   }
-#else
-   ext4_msg(NULL, KERN_INFO, "dax option not supported");
-   return -EINVAL;
-#endif
case Opt_data_err:
if (result.uint_32 == Opt_data_err_abort)
ctx_set_mount_opt(ctx, m->mount_opt);
-- 
2.39.2




[RFC PATCH 7/7] xfs: Use dax_is_supported()

2024-01-29 Thread Mathieu Desnoyers
Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Chandan Babu R 
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/xfs/xfs_super.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 764304595e8b..b27ecb11db66 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1376,14 +1376,22 @@ xfs_fs_parse_param(
case Opt_nodiscard:
parsing_mp->m_features &= ~XFS_FEAT_DISCARD;
return 0;
-#ifdef CONFIG_FS_DAX
case Opt_dax:
-   xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
-   return 0;
+   if (dax_is_supported()) {
+   xfs_mount_set_dax_mode(parsing_mp, XFS_DAX_ALWAYS);
+   return 0;
+   } else {
+   xfs_warn(parsing_mp, "dax option not supported.");
+   return -EINVAL;
+   }
case Opt_dax_enum:
-   xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
-   return 0;
-#endif
+   if (dax_is_supported()) {
+   xfs_mount_set_dax_mode(parsing_mp, result.uint_32);
+   return 0;
+   } else {
+   xfs_warn(parsing_mp, "dax option not supported.");
+   return -EINVAL;
+   }
/* Following mount options will be removed in September 2025 */
case Opt_ikeep:
xfs_fs_warn_deprecated(fc, param, XFS_FEAT_IKEEP, true);
-- 
2.39.2




[RFC PATCH 6/7] fuse: Introduce fuse_dax_is_supported()

2024-01-29 Thread Mathieu Desnoyers
Use dax_is_supported() in addition to IS_ENABLED(CONFIG_FUSE_DAX) to
validate whether CONFIG_FUSE_DAX is enabled and the architecture does
not have virtually aliased caches.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Miklos Szeredi 
Cc: linux-fsde...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/fuse/file.c  |  2 +-
 fs/fuse/fuse_i.h| 36 +-
 fs/fuse/inode.c | 47 +++--
 fs/fuse/virtio_fs.c |  4 ++--
 4 files changed, 62 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a660f1f21540..133ac8524064 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3247,6 +3247,6 @@ void fuse_init_file_inode(struct inode *inode, unsigned 
int flags)
init_waitqueue_head(>page_waitq);
fi->writepages = RB_ROOT;
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX))
+   if (fuse_dax_is_supported())
fuse_dax_inode_init(inode, flags);
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1df83eebda92..1cbe37106669 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /** Default max number of pages that can be used in a single read request */
 #define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
@@ -979,6 +980,38 @@ static inline void fuse_sync_bucket_dec(struct 
fuse_sync_bucket *bucket)
rcu_read_unlock();
 }
 
+#ifdef CONFIG_FUSE_DAX
+static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode 
*inode)
+{
+   return inode->dax;
+}
+
+static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc)
+{
+   return fc->dax_mode;
+}
+
+static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc)
+{
+   return fc->dax;
+}
+#else
+static inline struct fuse_inode_dax *fuse_inode_get_dax(struct fuse_inode 
*inode)
+{
+   return NULL;
+}
+
+static inline enum fuse_dax_mode fuse_conn_get_dax_mode(struct fuse_conn *fc)
+{
+   return FUSE_DAX_INODE_DEFAULT;
+}
+
+static inline struct fuse_conn_dax *fuse_conn_get_dax(struct fuse_conn *fc)
+{
+   return NULL;
+}
+#endif
+
 /** Device operations */
 extern const struct file_operations fuse_dev_operations;
 
@@ -1324,7 +1357,8 @@ void fuse_free_conn(struct fuse_conn *fc);
 
 /* dax.c */
 
-#define FUSE_IS_DAX(inode) (IS_ENABLED(CONFIG_FUSE_DAX) && IS_DAX(inode))
+#define fuse_dax_is_supported()(IS_ENABLED(CONFIG_FUSE_DAX) && 
dax_is_supported())
+#define FUSE_IS_DAX(inode) (fuse_dax_is_supported() && IS_DAX(inode))
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to);
 ssize_t fuse_dax_write_iter(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2a6d44f91729..030e6ce5486d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -108,7 +108,7 @@ static struct inode *fuse_alloc_inode(struct super_block 
*sb)
if (!fi->forget)
goto out_free;
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX) && !fuse_dax_inode_alloc(sb, fi))
+   if (fuse_dax_is_supported() && !fuse_dax_inode_alloc(sb, fi))
goto out_free_forget;
 
return >inode;
@@ -126,9 +126,8 @@ static void fuse_free_inode(struct inode *inode)
 
mutex_destroy(>mutex);
kfree(fi->forget);
-#ifdef CONFIG_FUSE_DAX
-   kfree(fi->dax);
-#endif
+   if (fuse_dax_is_supported())
+   kfree(fuse_inode_get_dax(fi));
kmem_cache_free(fuse_inode_cachep, fi);
 }
 
@@ -361,7 +360,7 @@ void fuse_change_attributes(struct inode *inode, struct 
fuse_attr *attr,
invalidate_inode_pages2(inode->i_mapping);
}
 
-   if (IS_ENABLED(CONFIG_FUSE_DAX))
+   if (fuse_dax_is_supported())
fuse_dax_dontcache(inode, attr->flags);
 }
 
@@ -856,14 +855,16 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
}
-#ifdef CONFIG_FUSE_DAX
-   if (fc->dax_mode == FUSE_DAX_ALWAYS)
-   seq_puts(m, ",dax=always");
-   else if (fc->dax_mode == FUSE_DAX_NEVER)
-   seq_puts(m, ",dax=never");
-   else if (fc->dax_mode == FUSE_DAX_INODE_USER)
-   seq_puts(m, ",dax=inode");
-#endif
+   if (fuse_dax_is_supported()) {
+   enum fuse_dax_mode dax_mode = fuse_conn_get_dax_mode(fc);
+
+   if (dax_mode == FUSE_DAX_ALWAYS)
+   

[RFC PATCH 2/7] dax: Fix incorrect list of cache aliasing architectures

2024-01-29 Thread Mathieu Desnoyers
fs/Kconfig:FS_DAX prevents DAX from building on architectures with
virtually aliased dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

Use this instead in Kconfig to prevent FS_DAX from being built on
architectures with virtually aliased dcache:

  depends on !ARCH_HAS_CACHE_ALIASING

For architectures which detect dcache aliasing at runtime, introduce
a new dax_is_supported() static inline which uses "cache_is_aliasing()"
to figure out whether the environment has aliasing dcaches.

This new dax_is_supported() helper will be used in each filesystem
supporting the dax mount option to validate whether dax is indeed
supported.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/Kconfig  | 2 +-
 include/linux/dax.h | 9 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..6746fe403761 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,7 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
+   depends on !ARCH_HAS_CACHE_ALIASING
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..8c595b04deeb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef unsigned long dax_entry_t;
 
@@ -78,6 +79,10 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
return false;
return dax_synchronous(dax_dev);
 }
+static inline bool dax_is_supported(void)
+{
+   return !cache_is_aliasing();
+}
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
 {
@@ -122,6 +127,10 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
 {
return 0;
 }
+static inline bool dax_is_supported(void)
+{
+   return false;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
-- 
2.39.2




[RFC PATCH 4/7] ext2: Use dax_is_supported()

2024-01-29 Thread Mathieu Desnoyers
Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/ext2/super.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..0398e7a90eb6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
super_block *sb,
set_opt(opts->s_mount_opt, XIP);
fallthrough;
case Opt_dax:
-#ifdef CONFIG_FS_DAX
-   ext2_msg(sb, KERN_WARNING,
-   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
-   set_opt(opts->s_mount_opt, DAX);
-#else
-   ext2_msg(sb, KERN_INFO, "dax option not supported");
-#endif
+   if (dax_is_supported()) {
+   ext2_msg(sb, KERN_WARNING,
+"DAX enabled. Warning: EXPERIMENTAL, 
use at your own risk");
+   set_opt(opts->s_mount_opt, DAX);
+   } else {
+   ext2_msg(sb, KERN_INFO, "dax option not 
supported");
+   }
break;
 
 #if defined(CONFIG_QUOTA)
-- 
2.39.2




[RFC PATCH 1/7] Introduce cache_is_aliasing() across all architectures

2024-01-29 Thread Mathieu Desnoyers
Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:
   (ARCH_HAS_CACHE_ALIASING=y)

* arm V4, V5 (CPU_CACHE_VIVT)
* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The dcache aliasing depends on querying CPU state at runtime:
   (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y)

* arm V6, V6K (CPU_CACHE_VIPT) (cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The dcache is never aliasing:

* arm V7, V7M (unless ARM V6 or V6K are also present) (CPU_CACHE_VIPT)
* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  3 +++
 arch/arm/mm/Kconfig |  2 ++
 arch/csky/Kconfig   |  1 +
 arch/m68k/Kconfig   |  1 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/sh/Kconfig |  1 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  8 
 mm/Kconfig  | 10 ++
 17 files changed, 75 insertions(+)
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..969e6740bcf7 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..b03054b35c74 100644
--- a/arch/arm/include/asm/cachetype.h
+++ b/arch/arm/include/asm/cachetype.h
@@ -16,6 +16,9 @@ extern unsigned int cacheid;
 #define cache_is_vipt()cacheid_is(CACHEID_VIPT)
 #define cache_is_vipt_nonaliasing()cacheid_is(CACHEID_VIPT_NONALIASING)
 #define cache_is_vipt_aliasing()   cacheid_is(CACHEID_VIPT_ALIASING)
+#ifdef CONFIG_ARCH_HAS_CACHE_ALIASING_DYNAMIC
+#define cache_is_aliasing()cache_is_vipt_aliasing()
+#endif
 #define icache_is_vivt_asid_tagged()   cacheid_is(CACHEID_ASID_TAGGED)
 #define icache_is_vipt_aliasing()  cacheid_is(CACHEID_VIPT_I_ALIASING)
 #define icache_is_pipt()   cacheid_is(CACHEID_PIPT)
diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig
index c164cde50243..23af93cdc03d 100644
--- a/arch/arm/mm/Kconfig
+++ b/arch/arm/mm/Kconfig
@@ -539,9 +539,11 @@ config CPU_CACHE_NOP
bool
 
 config CPU_CACHE_VIVT
+   select ARCH_HAS_CACHE_ALIASING
bool
 
 config CPU_CACHE_VIPT
+   select ARCH_HAS_CACHE_ALIASING_DYNAMIC if CPU_V6 || CPU_V6K
bool
 
 config CPU_CACHE_FA
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index cf2a6fd7dff8..439d7640deb8 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -2,6 +2,7 @@
 config CSKY
def_bool y
select ARCH_32BIT_OFF_T
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_DMA_PREP_COHERENT
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_SYNC_DMA_FOR_CPU
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 4b3e93cac723..216338704f0a 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -3,6 +3,7 @@ config M68K
bool
default y
select ARCH_32BIT_OFF_T
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CPU_FINALIZE_INIT if MMU

[RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression

2024-01-29 Thread Mathieu Desnoyers
This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches. 

This patch series introduces cache_is_aliasing() with new Kconfig
options:

  * ARCH_HAS_CACHE_ALIASING
  * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implements it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is
needed on architectures like 32-bit ARMV6 and ARMV6K.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Feedback is welcome,

Thanks,

Mathieu

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev

Mathieu Desnoyers (7):
  Introduce cache_is_aliasing() across all architectures
  dax: Fix incorrect list of cache aliasing architectures
  erofs: Use dax_is_supported()
  ext2: Use dax_is_supported()
  ext4: Use dax_is_supported()
  fuse: Introduce fuse_dax_is_supported()
  xfs: Use dax_is_supported()

 arch/arc/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  3 ++
 arch/arm/mm/Kconfig |  2 ++
 arch/csky/Kconfig   |  1 +
 arch/m68k/Kconfig   |  1 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/sh/Kconfig |  1 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 fs/Kconfig  |  2 +-
 fs/erofs/super.c| 10 +++---
 fs/ext2/super.c | 14 
 fs/ext4/super.c | 52 ++---
 fs/fuse/file.c  |  2 +-
 fs/fuse/fuse_i.h| 36 +++-
 fs/fuse/inode.c | 47 +-
 fs/fuse/virtio_fs.c |  4 +--
 fs/xfs/xfs_super.c  | 20 +++
 include/linux/cacheinfo.h   |  8 +
 include/linux/dax.h |  9 +
 mm/Kconfig  | 10 ++
 27 files changed, 198 insertions(+), 73 deletions(-)
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

-- 
2.39.2




Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 12:25, Steven Rostedt  wrote:
>
> > So the fundamental bug I now find is that eventfs_root_lookup() gets a
> > target dentry, and for some unfathomable reason it then does
> >
> > ret = simple_lookup(dir, dentry, flags);
> >
> > on it. Which is *completely* broken, because what "simple_lookup()"
> > does is just say "oh, you didn't have a dentry of this kind before, so
> > clearly a lookup must be a non-existent file". Remember: this is for
> > 'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
> > files.
>
> Sorry, I don't really understand what you mean by "ALL files"? You mean
> that all files in the pseudo file system has a dentry to it (like debugfs,
> and the rest of tracefs)?

Yes.

So the whole - and *ONLY* - point of 'simple_lookup()' is for
filesystems like tmpfs, or like debugfs or other filesystems like
that, which never actually *need* to look anything up, because
everything is already cached in the dentry tree.

That's what the "simple" part of the simple functions mean. They are
simple from a dcache standpoint, because the dcache is all there is.

End result: what simple_lookup() does is say "oh, you didn't have the
file, so it's by definition a negative dentry", and thus all it does
is to do "d_add(dentry, NULL)".

Anyway, removing this was painful. I initially thought "I'll just
remove the calls". But it all ended up cascading into "that's also
wrong".

So now I have a patch that tries to fix this all up, and it looks like thisL:

 1 file changed, 50 insertions(+), 219 deletions(-)

because it basically removed all the old code, and replaced it with
much simpler code.

I'm including the patch here as an attachment, but I want to note very
clearly that this *builds* for me, and it looks a *lot* more obvious
and correct than the old code did, but I haven't tested it. AT ALL.

Also note that it depends on my previous patches, so I guess I'll
include them here again just to make it unambiguous.

Finally - this does *not* fix up the refcounting. I still think the
SRCU stuff is completely broken. But that's another headache. But at
least now the *lookup* parts look like they DTRT wrt eventfs_mutex.

The SRCU logic from the directory iteration parts still needs crapectomy.

AGAIN: these patches (ie particularly that last one - 0004) were all
done entirely "blindly" - I've looked at the code, and fixed the bugs
and problems I've seen by pure code inspection.

That's great, but it really means that it's all untested. It *looks*
better than the old code, but there may be some silly gotcha that I
have missed.

Linus
From b1f487acf6f4e9093d8b0fa00f864a6d07a3c4c2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Sat, 27 Jan 2024 13:27:01 -0800
Subject: [PATCH 2/4] tracefs: avoid using the ei->dentry pointer unnecessarily

The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.

However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there.  Repeat.

But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.

Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.

But it already has the real filesystem inode, which has that same
superblock pointer.  So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..2d128bedd654 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry)
+static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb)
 {
-	struct inode *inode;
+	struct inode *root;
 
 	/* Only update if the "events" was on the top level */
 	if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
 	/* Get the tracefs root inode. */
-	inode = d_inode(dentry->d_sb->s_root);
-	ei->attr.uid = inode->i_uid;
-	ei->attr.gid = inode->i_gid;
+	root = d_inode(sb->s_root);
+	ei->attr.uid = root->i_uid;
+	ei->attr.gid = root->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
 {
 	struct tracefs_inode *ti = get_tracefs(inode);
 	struct eventfs_inode *ei = ti->private;
-	struct dentry *dentry;
 
 	/* The top events directory doesn't get 

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 11:51:52 -0800
Linus Torvalds  wrote:

> On Mon, 29 Jan 2024 at 11:24, Linus Torvalds
>  wrote:
> >
> > So the patch was completely broken. Here's the one that should
> > actually compile (although still not actually *tested*).  
> 
> Note that this fixes the d_instantiate() ordering wrt initializing the inode.
> 
> But as I look up the call chain, I see many more fundamental mistakes.
> 
> Steven - the reason you think that the VFS doesn't have documentation
> is that we *do* have tons of documentation, but it's of the kind "Here
> is what you should do".
> 
> It is *not* of the kind that says "You messed up and did something
> else, and how do you recover from it?".

When I first worked on this, I did read all the VFS documentation, and it
was difficult to understand what I needed and what I didn't. It is focused
on a real file system and not a pseudo ones. Another mistake I made was
thinking that debugfs was doing things the "right" way as well. And having
a good idea of how debugfs worked, and thinking it was correct, just made
reading VFS documentation even more difficult as I couldn't relate what I
knew (about debugfs) with what was being explained. I thought it was
perfectly fine to use dentry as a handle for the file system. I did until
you told me it wasn't. That made a profound change in my understanding of
how things are supposed to work.

> 
> So the fundamental bug I now find is that eventfs_root_lookup() gets a
> target dentry, and for some unfathomable reason it then does
> 
> ret = simple_lookup(dir, dentry, flags);
> 
> on it. Which is *completely* broken, because what "simple_lookup()"
> does is just say "oh, you didn't have a dentry of this kind before, so
> clearly a lookup must be a non-existent file". Remember: this is for
> 'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
> files.

Sorry, I don't really understand what you mean by "ALL files"? You mean
that all files in the pseudo file system has a dentry to it (like debugfs,
and the rest of tracefs)?

> 
> For the tracefs kind of filesystem, it's TOTALLY BOGUS. What the
> "simple_lookup()" will do is just a plain
> 
> d_add(dentry, NULL);
> 
> and nothing else. And guess what *that* does? It basically
> instantiates a negative dentry, telling all other lookups that the
> path does not exist.
> 
> So if you have two concurrent lookups, one will do that
> simple_lookup(), and the other will then - depending on timing -
> either see the negative dentry and return -ENOENT, or - if it comes in
> a bit later - see the new inode that then later gets added by the
> first lookup with d_instantiate().
> 
> See? That simple_lookup() is not just unnecessary, but it's also
> actively completely WRONG. Because it instantiates a NULL pointer,
> other processes that race with the lookup may now end up saying "that
> file doesn't exist", even though it should.
> 
> Basically, you can't use *any* of the "simple" filesystem helpers.
> Because they are all designed for that "the dentry tree is all there
> is" case.

Yeah, the above code did come about with me not fully understanding the
above. It's not that it wasn't documented, but I admit, when I read the VFS
documentation, I had a lot of trouble trying to make sense of things like
negative dentries and how they relate.

I now have a much better understanding of most of this, thanks to our
discussion here, and also knowing that using dentry as the main handle to a
file is *not* how to do it. When thinking it was, it made things much more
difficult to comprehend.

-- Steve



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-29 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> Declaring rodata_enabled and mark_rodata_ro() at all time
> helps removing related #ifdefery in C files.
> 
> Signed-off-by: Christophe Leroy 

Very nice cleanup, thanks!, applied and pushed

  Luis



Re: [PATCH 1/3] module: Use set_memory_rox()

2024-01-29 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 08:24:23AM +0100, Christophe Leroy wrote:
> A couple of architectures seem concerned about calling set_memory_ro()
> and set_memory_x() too frequently and have implemented a version of
> set_memory_rox(), see commit 60463628c9e0 ("x86/mm: Implement native
> set_memory_rox()") and commit 22e99fa56443 ("s390/mm: implement
> set_memory_rox()")
> 
> Use set_memory_rox() in modules when STRICT_MODULES_RWX is set.
> 
> Signed-off-by: Christophe Leroy 

Nice simplification. I applied all 3 patches and pushed!

  Luis



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 11:24, Linus Torvalds
 wrote:
>
> So the patch was completely broken. Here's the one that should
> actually compile (although still not actually *tested*).

Note that this fixes the d_instantiate() ordering wrt initializing the inode.

But as I look up the call chain, I see many more fundamental mistakes.

Steven - the reason you think that the VFS doesn't have documentation
is that we *do* have tons of documentation, but it's of the kind "Here
is what you should do".

It is *not* of the kind that says "You messed up and did something
else, and how do you recover from it?".

So the fundamental bug I now find is that eventfs_root_lookup() gets a
target dentry, and for some unfathomable reason it then does

ret = simple_lookup(dir, dentry, flags);

on it. Which is *completely* broken, because what "simple_lookup()"
does is just say "oh, you didn't have a dentry of this kind before, so
clearly a lookup must be a non-existent file". Remember: this is for
'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
files.

For the tracefs kind of filesystem, it's TOTALLY BOGUS. What the
"simple_lookup()" will do is just a plain

d_add(dentry, NULL);

and nothing else. And guess what *that* does? It basically
instantiates a negative dentry, telling all other lookups that the
path does not exist.

So if you have two concurrent lookups, one will do that
simple_lookup(), and the other will then - depending on timing -
either see the negative dentry and return -ENOENT, or - if it comes in
a bit later - see the new inode that then later gets added by the
first lookup with d_instantiate().

See? That simple_lookup() is not just unnecessary, but it's also
actively completely WRONG. Because it instantiates a NULL pointer,
other processes that race with the lookup may now end up saying "that
file doesn't exist", even though it should.

Basically, you can't use *any* of the "simple" filesystem helpers.
Because they are all designed for that "the dentry tree is all there
is" case.

Linus



[PATCH v2 1/4] selftests: add new kallsyms selftests

2024-01-29 Thread Luis Chamberlain
We lack find_symbol() selftests, so add one. This let's us stress test
improvements easily on find_symbol() or optimizations. It also inherently
allows us to test the limits of kallsyms on Linux today.

We test a pathalogical use case for kallsyms by introducing modules
which are automatically written for us with a larger number of symbols.
We have 4 kallsyms test modules:

A: has KALLSYSMS_NUMSYMS exported symbols
B: uses one of A's symbols
C: adds KALLSYMS_SCALE_FACTOR * KALLSYSMS_NUMSYMS exported
D: adds 2 * the symbols than C

By using anything much larger than KALLSYSMS_NUMSYMS as 10,000 and
KALLSYMS_SCALE_FACTOR of 8 we segfault today. So we're capped at
around 16 symbols somehow today. We can inpsect that issue at
our leasure later, but for now the real value to this test is that
this will easily allow us to test improvements on find_symbol().

On x86_64 we can use perf, for other architectures we just use 'time'
and allow for customizations. For example a future enhancements could
be done for parisc to check for unaligned accesses which triggers a
special special exception handler assembler code inside the kernel.
The negative impact on performance is so large on parisc that it
keeps track of its accesses on /proc/cpuinfo as UAH:

IRQ:   CPU0   CPU1
3:   1332  0 SuperIO  ttyS0
7:1270013  0 SuperIO  pata_ns87415
64:  320023012  320021431 CPU  timer
65:   17080507   20624423 CPU  IPI
UAH:   10948640  58104   Unaligned access handler traps

While at it, this tidies up lib/ test modules to allow us to have
a new directory for them. The amount of test modules under lib/
is insane.

This should also hopefully showcase how to start doing basic
self module writing code, which may be more useful for more complex
cases later in the future.

Signed-off-by: Luis Chamberlain 
---
 lib/Kconfig.debug | 103 ++
 lib/Makefile  |   1 +
 lib/tests/Makefile|   1 +
 lib/tests/module/.gitignore   |   4 +
 lib/tests/module/Makefile |  15 ++
 lib/tests/module/gen_test_kallsyms.sh | 128 ++
 tools/testing/selftests/module/Makefile   |  12 ++
 tools/testing/selftests/module/config |   3 +
 tools/testing/selftests/module/find_symbol.sh |  81 +++
 9 files changed, 348 insertions(+)
 create mode 100644 lib/tests/Makefile
 create mode 100644 lib/tests/module/.gitignore
 create mode 100644 lib/tests/module/Makefile
 create mode 100755 lib/tests/module/gen_test_kallsyms.sh
 create mode 100644 tools/testing/selftests/module/Makefile
 create mode 100644 tools/testing/selftests/module/config
 create mode 100755 tools/testing/selftests/module/find_symbol.sh

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 97ce28f4d154..29db47ca251f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2835,6 +2835,109 @@ config TEST_KMOD
 
  If unsure, say N.
 
+config TEST_RUNTIME
+   bool
+
+config TEST_RUNTIME_MODULE
+   bool
+
+config TEST_KALLSYMS
+   tristate "module kallsyms find_symbol() test"
+   depends on m
+   select TEST_RUNTIME
+   select TEST_RUNTIME_MODULE
+   select TEST_KALLSYMS_A
+   select TEST_KALLSYMS_B
+   select TEST_KALLSYMS_C
+   select TEST_KALLSYMS_D
+   help
+ This allows us to stress test find_symbol() through the kallsyms
+ used to place symbols on the kernel ELF kallsyms and modules kallsyms
+ where we place kernel symbols such as exported symbols.
+
+ We have four test modules:
+
+ A: has KALLSYSMS_NUMSYMS exported symbols
+ B: uses one of A's symbols
+ C: adds KALLSYMS_SCALE_FACTOR * KALLSYSMS_NUMSYMS exported
+ D: adds 2 * the symbols than C
+
+ We stress test find_symbol() through two means:
+
+ 1) Upon load of B it will trigger simplify_symbols() to look for the
+ one symbol it uses from the module A with tons of symbols. This is an
+ indirect way for us to have B call resolve_symbol_wait() upon module
+ load. This will eventually call find_symbol() which will eventually
+ try to find the symbols used with find_exported_symbol_in_section().
+ find_exported_symbol_in_section() uses bsearch() so a binary search
+ for each symbol. Binary search will at worst be O(log(n)) so the
+ larger TEST_MODULE_KALLSYSMS the worse the search.
+
+ 2) The selftests should load C first, before B. Upon B's load towards
+ the end right before we call module B's init routine we get
+ complete_formation() called on the module. That will first check
+ for duplicate symbols with the call to verify_exported_symbols().
+ That is when we'll force iteration on module C's insane symbol list.
+ Since it has 10 * KALLSYMS_NUMSYMS it means we can 

[PATCH v2 4/4] modules: Add missing entry for __ex_table

2024-01-29 Thread Luis Chamberlain
From: Helge Deller 

The entry for __ex_table was missing, which may make __ex_table
become 1- or 2-byte aligned in modules.
Add the entry to ensure it gets 32-bit aligned.

Signed-off-by: Helge Deller 
Signed-off-by: Luis Chamberlain 
---
 scripts/module.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index b00415a9ff27..488f61b156b2 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -26,6 +26,7 @@ SECTIONS {
.altinstructions0 : ALIGN(8) { KEEP(*(.altinstructions)) }
__bug_table 0 : ALIGN(8) { KEEP(*(__bug_table)) }
__jump_table0 : ALIGN(8) { KEEP(*(__jump_table)) }
+   __ex_table  0 : ALIGN(4) { KEEP(*(__ex_table)) }
 
__patchable_function_entries : { *(__patchable_function_entries) }
 
-- 
2.42.0




[PATCH v2 0/3] modules: few of alignment fixes

2024-01-29 Thread Luis Chamberlain
Helge Deller had written a series of patches to fix a few 
misalignemnt annotations in the kernel [0]. Three of these
patches were tagged as being stable candidates. Because of these
annotation I suggested proof of the imapact, however we did not
easily have a way to verify the value / impact of fixing a few
misaligment changes in the kernel for modules.

For the more hotpath alignment fix Helge had I suggested we could
easily test this by stress testing find_symbol() with a few set of
modules. This adds such tests to allow us to both allow us to test
the impact of such possible fix, and also while at it, let's us
test future improvements on the find_symbol() path.

Changes on this v2:

 - Adds new selftest for kallsyms
 - Drops patch #1 as Masahiro Yamada already applied it to linux-kbuild/fixes
 - Removes stable tags
 - Drops patch #3 as it was not needed
 - Adds a new patch with the issues noted by Helge as a fix
   to commit f3304ecd7f06 ("linux/export: use inline assembler to
   populate symbol CRCs") as noted by Masahiro Yamada
 - Adds selftest impact on x86_64 for patch #2, this really should
   be tested on parisc though as that's an example architecture
   where we could see perhaps more improvement

[0] https://lkml.kernel.org/r/2023111814.139916-1-del...@kernel.org

Masahiro, if there no issues feel free to take this or I can take them in
too via the modules-next tree. Lemme know!

Helge Deller (2):
  modules: Ensure 64-bit alignment on __ksymtab_* sections
  modules: Add missing entry for __ex_table

Luis Chamberlain (2):
  selftests: add new kallsyms selftests
  vmlinux.lds.h: add missing alignment for symbol CRCs

 include/linux/export-internal.h   |   1 +
 lib/Kconfig.debug | 103 ++
 lib/Makefile  |   1 +
 lib/tests/Makefile|   1 +
 lib/tests/module/.gitignore   |   4 +
 lib/tests/module/Makefile |  15 ++
 lib/tests/module/gen_test_kallsyms.sh | 128 ++
 scripts/module.lds.S  |   9 +-
 tools/testing/selftests/module/Makefile   |  12 ++
 tools/testing/selftests/module/config |   3 +
 tools/testing/selftests/module/find_symbol.sh |  81 +++
 11 files changed, 354 insertions(+), 4 deletions(-)
 create mode 100644 lib/tests/Makefile
 create mode 100644 lib/tests/module/.gitignore
 create mode 100644 lib/tests/module/Makefile
 create mode 100755 lib/tests/module/gen_test_kallsyms.sh
 create mode 100644 tools/testing/selftests/module/Makefile
 create mode 100644 tools/testing/selftests/module/config
 create mode 100755 tools/testing/selftests/module/find_symbol.sh

-- 
2.42.0




[PATCH v2 3/4] vmlinux.lds.h: add missing alignment for symbol CRCs

2024-01-29 Thread Luis Chamberlain
Commit f3304ecd7f06 (linux/export: use inline assembler to populate
symbol CRCs") fixed an issue with unexpected padding  by adding
asm inline assembly to directly specify the desired data layout.
It however forgot to add the alignment, fix that.

Reported-by: Helge Deller 
Fixes: f3304ecd7f06 ("linux/export: use inline assembler to populate symbol 
CRCs")
Signed-off-by: Luis Chamberlain 
---
 include/linux/export-internal.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/export-internal.h b/include/linux/export-internal.h
index 69501e0ec239..51b8cf3f60ef 100644
--- a/include/linux/export-internal.h
+++ b/include/linux/export-internal.h
@@ -61,6 +61,7 @@
 
 #define SYMBOL_CRC(sym, crc, sec)   \
asm(".section \"___kcrctab" sec "+" #sym "\",\"a\"" "\n" \
+   ".balign 4" "\n" \
"__crc_" #sym ":"   "\n" \
".long " #crc   "\n" \
".previous" "\n")
-- 
2.42.0




[PATCH v2 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2024-01-29 Thread Luis Chamberlain
From: Helge Deller 

On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
(e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
64-bit pointers into the __ksymtab* sections.
Make sure that those sections will be correctly aligned at module link time,
otherwise unaligned memory accesses may happen at runtime.

The __kcrctab* sections store 32-bit entities, so use ALIGN(4) for those.

Testing with the kallsyms selftest on x86_64 we see a savings of about
1,958,153 ns in the worst case which may or not be within noise. Testing
on parisc would be useful and welcomed.

On x86_64 before:

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

86,430,119 ns   duration_time
84,407,000 ns   system_time
   213  page-faults

   0.086430119 seconds time elapsed

   0.0 seconds user
   0.084407000 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

85,777,474 ns   duration_time
82,581,000 ns   system_time
   212  page-faults

   0.085777474 seconds time elapsed

   0.0 seconds user
   0.082581000 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

87,906,053 ns   duration_time
87,939,000 ns   system_time
   212  page-faults

   0.087906053 seconds time elapsed

   0.0 seconds user
   0.087939000 seconds sys

After:

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

82,925,631 ns   duration_time
83,000,000 ns   system_time
   212  page-faults

   0.082925631 seconds time elapsed

   0.0 seconds user
   0.08300 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

87,776,380 ns   duration_time
86,678,000 ns   system_time
   213  page-faults

   0.087776380 seconds time elapsed

   0.0 seconds user
   0.086678000 seconds sys

 Performance counter stats for '/sbin/modprobe test_kallsyms_b':

85,947,900 ns   duration_time
82,006,000 ns   system_time
   212  page-faults

   0.085947900 seconds time elapsed

   0.0 seconds user
   0.082006000 seconds sys

Signed-off-by: Helge Deller 
[mcgrof: ran kallsyms selftests on x86_64]
Signed-off-by: Luis Chamberlain 
---
 scripts/module.lds.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..b00415a9ff27 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -15,10 +15,10 @@ SECTIONS {
*(.discard.*)
}
 
-   __ksymtab   0 : { *(SORT(___ksymtab+*)) }
-   __ksymtab_gpl   0 : { *(SORT(___ksymtab_gpl+*)) }
-   __kcrctab   0 : { *(SORT(___kcrctab+*)) }
-   __kcrctab_gpl   0 : { *(SORT(___kcrctab_gpl+*)) }
+   __ksymtab   0 : ALIGN(8) { *(SORT(___ksymtab+*)) }
+   __ksymtab_gpl   0 : ALIGN(8) { *(SORT(___ksymtab_gpl+*)) }
+   __kcrctab   0 : ALIGN(4) { *(SORT(___kcrctab+*)) }
+   __kcrctab_gpl   0 : ALIGN(4) { *(SORT(___kcrctab_gpl+*)) }
 
.ctors  0 : ALIGN(8) { *(SORT(.ctors.*)) *(.ctors) }
.init_array 0 : ALIGN(8) { *(SORT(.init_array.*)) 
*(.init_array) }
-- 
2.42.0




Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
 wrote:
>
> IOW, I think the right fix is really just this:

Oh, for some reason I sent out the original patch I had which didn't
fix the create_dir() case.

So that patch was missing the important hunk that added the

ti->flags = TRACEFS_EVENT_INODE;
ti->private = ei;

to create_dir() (to match the removal in eventfs_post_create_dir()).

I had incorrectly put it in the create_file() case, that should just
set ->private to NULL. afaik

So the patch was completely broken. Here's the one that should
actually compile (although still not actually *tested*).

   Linus
From 6e5db10ebc96ebe6b9707c9938c450f51e9a3ae0 Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Mon, 29 Jan 2024 11:06:32 -0800
Subject: [PATCH] eventfsfs: initialize the tracefs inode properly

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	inode->i_ino = EVENTFS_FILE_INODE_INO;
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = NULL;			// Directories have 'ei', files not
+
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = ei;
 
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
 	struct eventfs_inode *ei_child;
-	struct tracefs_inode *ti;
 
 	lockdep_assert_held(_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei)
  srcu_read_lock_held(_srcu)) {
 		ei_child->d_parent = ei->dentry;
 	}
-
-	ti = get_tracefs(ei->dentry->d_inode);
-	ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	INIT_LIST_HEAD(>list);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
 	ti->private = ei;
 
 	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb



Re: [PATCH v2 1/4] remoteproc: Add TEE support

2024-01-29 Thread Mathieu Poirier
On Thu, Jan 18, 2024 at 11:04:30AM +0100, Arnaud Pouliquen wrote:
> From: Arnaud Pouliquen 
> 
> Add a remoteproc TEE (Trusted Execution Environment) device
> that will be probed by the TEE bus. If the associated Trusted
> application is supported on secure part this device offers a client
> interface to load a firmware in the secure part.
> This firmware could be authenticated and decrypted by the secure
> trusted application.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
>  drivers/remoteproc/Kconfig  |   9 +
>  drivers/remoteproc/Makefile |   1 +
>  drivers/remoteproc/tee_remoteproc.c | 393 
>  include/linux/tee_remoteproc.h  |  99 +++
>  4 files changed, 502 insertions(+)
>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>  create mode 100644 include/linux/tee_remoteproc.h
> 
> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
> index 48845dc8fa85..85299606806c 100644
> --- a/drivers/remoteproc/Kconfig
> +++ b/drivers/remoteproc/Kconfig
> @@ -365,6 +365,15 @@ config XLNX_R5_REMOTEPROC
>  
> It's safe to say N if not interested in using RPU r5f cores.
>  
> +
> +config TEE_REMOTEPROC
> + tristate "trusted firmware support by a TEE application"
> + depends on OPTEE
> + help
> +   Support for trusted remote processors firmware. The firmware
> +   authentication and/or decryption are managed by a trusted application.
> +   This can be either built-in or a loadable module.
> +
>  endif # REMOTEPROC
>  
>  endmenu
> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
> index 91314a9b43ce..fa8daebce277 100644
> --- a/drivers/remoteproc/Makefile
> +++ b/drivers/remoteproc/Makefile
> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)   += rcar_rproc.o
>  obj-$(CONFIG_ST_REMOTEPROC)  += st_remoteproc.o
>  obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o
>  obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o
> +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o
>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)   += ti_k3_dsp_remoteproc.o
>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)+= ti_k3_r5_remoteproc.o
>  obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o
> diff --git a/drivers/remoteproc/tee_remoteproc.c 
> b/drivers/remoteproc/tee_remoteproc.c
> new file mode 100644
> index ..49e1e0caf889
> --- /dev/null
> +++ b/drivers/remoteproc/tee_remoteproc.c
> @@ -0,0 +1,393 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) STMicroelectronics 2023 - All Rights Reserved
> + * Author: Arnaud Pouliquen 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "remoteproc_internal.h"
> +
> +#define MAX_TEE_PARAM_ARRY_MEMBER4
> +
> +/*
> + * Authentication of the firmware and load in the remote processor memory
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [in]   params[1].memref:  buffer containing the image of the 
> buffer
> + */
> +#define TA_RPROC_FW_CMD_LOAD_FW  1
> +
> +/*
> + * Start the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_START_FW 2
> +
> +/*
> + * Stop the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_STOP_FW  3
> +
> +/*
> + * Return the address of the resource table, or 0 if not found
> + * No check is done to verify that the address returned is accessible by
> + * the non secure context. If the resource table is loaded in a protected
> + * memory the access by the non secure context will lead to a data abort.
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [out]  params[1].value.a: 32bit LSB resource table memory address
> + * [out]  params[1].value.b: 32bit MSB resource table memory address
> + * [out]  params[2].value.a: 32bit LSB resource table memory size
> + * [out]  params[2].value.b: 32bit MSB resource table memory size
> + */
> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE4
> +
> +/*
> + * Return the address of the core dump
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [out] params[1].memref:   address of the core dump image if exist,
> + *   else return Null
> + */
> +#define TA_RPROC_FW_CMD_GET_COREDUMP 5
> +
> +struct tee_rproc_mem {
> + char name[20];
> + void __iomem *cpu_addr;
> + phys_addr_t bus_addr;
> + u32 dev_addr;
> + size_t size;
> +};
> +
> +struct tee_rproc_context {
> + struct list_head sessions;
> + struct tee_context *tee_ctx;
> + struct device *dev;
> +};
> +
> +static struct tee_rproc_context *tee_rproc_ctx;
> +
> +static void prepare_args(struct tee_rproc *trproc, 

Re: [PATCH 4/4] modules: Add missing entry for __ex_table

2024-01-29 Thread Luis Chamberlain
On Wed, Nov 22, 2023 at 11:18:14PM +0100, del...@kernel.org wrote:
> From: Helge Deller 
> 
> The entry for __ex_table was missing, which may make __ex_table
> become 1- or 2-byte aligned in modules.
> Add the entry to ensure it gets 32-bit aligned.
> 
> Signed-off-by: Helge Deller 
> Cc:  # v6.0+

Cc'ing stable was overkill, I'll remove it.

  Luis

> ---
>  scripts/module.lds.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/scripts/module.lds.S b/scripts/module.lds.S
> index b00415a9ff27..488f61b156b2 100644
> --- a/scripts/module.lds.S
> +++ b/scripts/module.lds.S
> @@ -26,6 +26,7 @@ SECTIONS {
>   .altinstructions0 : ALIGN(8) { KEEP(*(.altinstructions)) }
>   __bug_table 0 : ALIGN(8) { KEEP(*(__bug_table)) }
>   __jump_table0 : ALIGN(8) { KEEP(*(__jump_table)) }
> + __ex_table  0 : ALIGN(4) { KEEP(*(__ex_table)) }
>  
>   __patchable_function_entries : { *(__patchable_function_entries) }
>  
> -- 
> 2.41.0
> 
> 



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:44, Steven Rostedt  wrote:
>
> On Mon, 29 Jan 2024 09:40:06 -0800
> Linus Torvalds  wrote:
>
> > Now, I do agree that your locking is strange, and that should be fixed
> > *too*, but I think the above is the "right" fix for this particular
> > issue.
>
> Would you be OK if I did both as a "fix"?

See my crossed email - not dropping the mutex *is* actually a fix for
another piece of data.

  Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
 wrote:
>
> eventfs_create_events_dir() seems to have the same bug with ti->flags,
> btw, but got the ti->private initialization right.
>
> Funnily enough, create_file() got this right. I don't even understand
> why create_dir() did what it did.
>
> IOW, I think the right fix is really just this:

Actually, I think you have another uninitialized field here too:
'dentry->d_fsdata'.

And it looks like both create_file and create_dir got that wrong, but
eventfs_create_events_dir() got it right.

So you *also* need to do that

dentry->d_fsdata = ei;

before you do the d_instantiate().

Now, from a quick look, all the d_fsdata accesses *do* seem to be
protected by the eventfs_mutex, except

 (a) eventfs_create_events_dir() doesn't seem to take the mutex, but
gets the ordering right, so is ok

 (b) create_dir_dentry() drops the mutex in the middle, so the mutex
doesn't actually serialize anything

Not dropping the mutex in create_dir_dentry() *will* fix this bug, but
honestly, I'd suggest that in addition to not dropping the mutex, you
just fix the ordering too.

IOW, just do that

dentry->d_fsdata = ei;

before you do d_instantiate(), and now accessing d_fsdata is just
always safe and doesn't even need the mutex.

The whole "initialize everything before exposing it to others" is
simply just a good idea.

Linus



Re: [RESEND PATCH v2] modules: wait do_free_init correctly

2024-01-29 Thread Luis Chamberlain
On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> do_free_init() into a global workqueue instead of call_rcu(). So now
> rcu_barrier() can not ensure that do_free_init has completed. We should
> wait it via flush_work().
> 
> Without this fix, we still could encounter false positive reports in
> W+X checking, and rcu synchronization is unnecessary.

You didn't answer my question, which should be documented in the commit log.

Does this mean we never freed modules init because of this? If so then
your commit log should clearly explain that. It should also explain that
if true (you have to verify) then it means we were no longer saving
the memory we wished to save, and that is important for distributions
which do want to save anything on memory. You may want to do a general
estimate on how much that means these days on any desktop / server.

  Luis



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 12:44:36 -0500
Steven Rostedt  wrote:

> On Mon, 29 Jan 2024 09:40:06 -0800
> Linus Torvalds  wrote:
> 
> > Now, I do agree that your locking is strange, and that should be fixed
> > *too*, but I think the above is the "right" fix for this particular
> > issue.  
> 
> Would you be OK if I did both as a "fix"?
> 

In separate patches of course. And I may even give the same tags to both
patches as well.

-- Steve



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Mon, 29 Jan 2024 09:40:06 -0800
Linus Torvalds  wrote:

> Now, I do agree that your locking is strange, and that should be fixed
> *too*, but I think the above is the "right" fix for this particular
> issue.

Would you be OK if I did both as a "fix"?

-- Steve



Re: [PATCH v4 2/2] remoteproc: enhance rproc_put() for clusters

2024-01-29 Thread Tanmay Shah


On 1/26/24 11:38 AM, Bjorn Andersson wrote:
> On Wed, Jan 03, 2024 at 02:11:25PM -0800, Tanmay Shah wrote:
> > This patch enhances rproc_put() to support remoteproc clusters
> > with multiple child nodes as in rproc_get_by_phandle().
> > 
> > Signed-off-by: Tarak Reddy 
> > Signed-off-by: Tanmay Shah 
>
> As described in the first patch, this documents that Tarak first
> certified the origin of this patch, then you certify the origin as you
> handle the patch.
>
> But according to From: you're the author, so how could Tarak have
> certified the origin before you authored the patch?
>
> Either correct the author, or add Co-developed-by, if that's what
> happened.
>
> > ---
> >  drivers/remoteproc/remoteproc_core.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/remoteproc/remoteproc_core.c 
> > b/drivers/remoteproc/remoteproc_core.c
> > index 0b3b34085e2f..f276956f2c5c 100644
> > --- a/drivers/remoteproc/remoteproc_core.c
> > +++ b/drivers/remoteproc/remoteproc_core.c
> > @@ -2554,7 +2554,11 @@ EXPORT_SYMBOL(rproc_free);
> >   */
> >  void rproc_put(struct rproc *rproc)
> >  {
> > -   module_put(rproc->dev.parent->driver->owner);
> > +   if (rproc->dev.parent->driver)
> > +   module_put(rproc->dev.parent->driver->owner);
> > +   else
> > +   module_put(rproc->dev.parent->parent->driver->owner);
> > +
>
> This does however highlight a bug that was introduced by patch 1, please
> avoid this by squashing the two patches together (and use
> Co-developed-by as needed).

Thanks Bjorn for catching this. This change originally was developed by Tarak, 
but I sent upstream based on his patch so I missed

to update his name as author. I should update author name.

However, if we are going to squash this in first patch, then I think, first 
patch's author will stay as it is.

Following Action Item on me for v5:

1) Fix commit text in first patch.

2) Squash second patch in first.

3) Add my s-o-b signature after Mathieu's

4) Add Tarak's s-o-b as well. As he developed second patch.

Hope got it all.


Thanks,

Tanmay

>
> Regards,
> Bjorn
>
> > put_device(>dev);
> >  }
> >  EXPORT_SYMBOL(rproc_put);
> > -- 
> > 2.25.1
> > 



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:01, Steven Rostedt  wrote:
>
> Thanks for the analysis. I have a patch that removes the dropping of the
> mutex over the create_dir/file() calls, and lockdep hasn't complained about
> it.
>
> I was going to add that to my queue for the next merge window along with
> other clean ups but this looks like it actually fixes a real bug. I'll move
> that over to my urgent queue and start testing it.

Note that it is *not* enough to just keep the mutex.

All the *users* need to get the mutex too.

Otherwise you have this:

CPU1:

   create_dir_dentry():
  mutex locked the whole time..
dentry = create_dir(ei, parent);
   does d_instantiate(dentry, inode);
eventfs_post_create_dir(ei);
   dentry->d_fsdata = ei;
mutex dropped;

but CPU2 can still come in, see the dentry immediately after the
"d_instantiate()", and do an "open()" or "stat()" on the dentry (which
will *not* cause a 'lookup()', since it's in the dentry cache), and
that will then cause either

 ->permission() -> eventfs_permission() -> set_top_events_ownership()

or

 ->get_attr() -> eventfs_get_attr() -> set_top_events_ownership()

and both of those will now do the dentry->inode->ei lookup. And
neither of them takes the mutex.

So then it doesn't even matter that you didn't drop the mutex in the
middle, because the users simply won't be serializing with it anyway.

So you'd have to make set_top_events_ownership() take the mutex around
it all too.

In fact, pretty much *any* use of "ti->private" needs the mutex.

Which is obviously a bit painful.

Honestly, I think the right model is to just make sure that the inode
is fully initialized when you do 'd_instantiate()'

The patch looks obvious, and I think this actually fixes *another*
bug, namely that the old

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;

was buggy, because 'ti->flags' was uninitialized before.

eventfs_create_events_dir() seems to have the same bug with ti->flags,
btw, but got the ti->private initialization right.

Funnily enough, create_file() got this right. I don't even understand
why create_dir() did what it did.

IOW, I think the right fix is really just this:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -328,7 +328,8 @@
inode->i_ino = EVENTFS_FILE_INODE_INO;

ti = get_tracefs(inode);
  - ti->flags |= TRACEFS_EVENT_INODE;
  + ti->flags = TRACEFS_EVENT_INODE;
  + ti->private = ei;
d_instantiate(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
  @@ -513,7 +514,6 @@
   static void eventfs_post_create_dir(struct eventfs_inode *ei)
   {
struct eventfs_inode *ei_child;
  - struct tracefs_inode *ti;

lockdep_assert_held(_mutex);

  @@ -523,9 +523,6 @@
 srcu_read_lock_held(_srcu)) {
ei_child->d_parent = ei->dentry;
}
  -
  - ti = get_tracefs(ei->dentry->d_inode);
  - ti->private = ei;
   }

   /**
  @@ -943,7 +940,7 @@
INIT_LIST_HEAD(>list);

ti = get_tracefs(inode);
  - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
  + ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
ti->private = ei;

inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;

which fixes the initialization errors with the 'ti' fields.

Now, I do agree that your locking is strange, and that should be fixed
*too*, but I think the above is the "right" fix for this particular
issue.

Linus



Re: [PATCH] media: dt-bindings: qcom,sc7280-venus: Allow one IOMMU entry

2024-01-29 Thread Conor Dooley
On Mon, Jan 29, 2024 at 08:48:54AM +0100, Luca Weiss wrote:
> Some SC7280-based boards crash when providing the "secure_non_pixel"
> context bank, so allow only one iommu in the bindings also.
> 
> Signed-off-by: Luca Weiss 

Do we have any idea why this happens? How is someone supposed to know
whether or not their system requires you to only provide one iommu?
Yes, a crash might be the obvious answer, but is there a way of knowing
without the crashes?

Cheers,
Conor.

> ---
> Reference:
> https://lore.kernel.org/linux-arm-msm/20231201-sc7280-venus-pas-v3-2-bc132dc5f...@fairphone.com/
> ---
>  Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml 
> b/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml
> index 8f9b6433aeb8..10c334e6b3dc 100644
> --- a/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml
> +++ b/Documentation/devicetree/bindings/media/qcom,sc7280-venus.yaml
> @@ -43,6 +43,7 @@ properties:
>- const: vcodec_bus
>  
>iommus:
> +minItems: 1
>  maxItems: 2
>  
>interconnects:
> 
> ---
> base-commit: 596764183be8ebb13352b281a442a1f1151c9b06
> change-id: 20240129-sc7280-venus-bindings-6e62a99620de
> 
> Best regards,
> -- 
> Luca Weiss 
> 


signature.asc
Description: PGP signature


Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events

2024-01-29 Thread Beau Belgrave
On Fri, Jan 26, 2024 at 03:04:45PM -0500, Steven Rostedt wrote:
> On Fri, 26 Jan 2024 11:10:07 -0800
> Beau Belgrave  wrote:
> 
> > > OK, so the each different event has suffixed name. But this will
> > > introduce non C-variable name.
> > > 
> > > Steve, do you think your library can handle these symbols? It will
> > > be something like "event:[1]" as the event name.
> > > Personally I like "event.1" style. (of course we need to ensure the
> > > user given event name is NOT including such suffix numbers)
> > >   
> > 
> > Just to clarify around events including a suffix number. This is why
> > multi-events use "user_events_multi" system name and the single-events
> > using just "user_events".
> > 
> > Even if a user program did include a suffix, the suffix would still get
> > appended. An example is "test" vs "test:[0]" using multi-format would
> > result in two tracepoints ("test:[0]" and "test:[0]:[1]" respectively
> > (assuming these are the first multi-events on the system).
> > 
> > I'm with you, we really don't want any spoofing or squatting possible.
> > By using different system names and always appending the suffix I
> > believe covers this.
> > 
> > Looking forward to hearing Steven's thoughts on this as well.
> 
> I'm leaning towards Masami's suggestion to use dots, as that won't conflict
> with special characters from bash, as '[' and ']' do.
> 

Thanks, yeah ideally we wouldn't use special characters.

I'm not picky about this. However, I did want something that clearly
allowed a glob pattern to find all versions of a given register name of
user_events by user programs that record. The dot notation will pull in
more than expected if dotted namespace style names are used.

An example is "Asserts" and "Asserts.Verbose" from different programs.
If we tried to find all versions of "Asserts" via glob of "Asserts.*" it
will pull in "Asserts.Verbose.1" in addition to "Asserts.0".

While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it
doesn't work if the number is higher, like 128. If we ever decide to
change the ID from an integer to say hex to save space, these globs
would break.

Is there some scheme that fits the C-variable name that addresses the
above scenarios? Brackets gave me a simple glob that seemed to prevent a
lot of this ("Asserts.\[*\]" in this case).

Are we confident that we always want to represent the ID as a base-10
integer vs a base-16 integer? The suffix will be ABI to ensure recording
programs can find their events easily.

Thanks,
-Beau

> -- Steve



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Steven Rostedt
On Sun, 28 Jan 2024 20:36:12 -0800
Linus Torvalds  wrote:


> End result: the d_instantiate() needs to be done *after* the inode has
> been fully filled in.
> 
> Alternatively, you could
> 
>  (a) not drop the eventfs_mutex around the create_dir() call
> 
>  (b) take the eventfs_mutex around all of set_top_events_ownership()
> 
> and just fix it by having the lock protect the lack of ordering.

Hi Linus,

Thanks for the analysis. I have a patch that removes the dropping of the
mutex over the create_dir/file() calls, and lockdep hasn't complained about
it.

I was going to add that to my queue for the next merge window along with
other clean ups but this looks like it actually fixes a real bug. I'll move
that over to my urgent queue and start testing it.

-- Steve




Re: Re: [RFC PATCH] rpmsg: glink: Add bounds check on tx path

2024-01-29 Thread Michal Koutný
On Mon, Jan 29, 2024 at 04:18:36PM +0530, Deepak Kumar Singh 
 wrote:
> There is already a patch posted for similar problem -
> https://lore.kernel.org/all/20231201110631.669085-1-quic_dee...@quicinc.com/

I was not aware, thanks for the pointer.

Do you plan to update your patch to "just" bail-out/zero instead of
using slightly random values (as pointed out by Bjorn)?

Michal


signature.asc
Description: PGP signature


Re: [PATCH 2/3] remoteproc: qcom_q6v5_pas: Add support for X1E80100 ADSP/CDSP

2024-01-29 Thread Dmitry Baryshkov
On Mon, 29 Jan 2024 at 15:35, Abel Vesa  wrote:
>
> From: Sibi Sankar 
>
> Add support for PIL loading on ADSP and CDSP on X1E80100 SoCs.
>
> Signed-off-by: Sibi Sankar 
> Signed-off-by: Abel Vesa 
> ---
>  drivers/remoteproc/qcom_q6v5_pas.c | 41 
> ++
>  1 file changed, 41 insertions(+)
>

Reviewed-by: Dmitry Baryshkov 


-- 
With best wishes
Dmitry



Re: [PATCH 3/3] remoteproc: qcom_q6v5_pas: Unload lite firmware on ADSP

2024-01-29 Thread Dmitry Baryshkov
On Mon, 29 Jan 2024 at 15:35, Abel Vesa  wrote:
>
> From: Sibi Sankar 
>
> The UEFI loads a lite variant of the ADSP firmware to support charging
> use cases. The kernel needs to unload and reload it with the firmware
> that has full feature support for audio. This patch arbitarily shutsdown
> the lite firmware before loading the full firmware.
>
> Signed-off-by: Sibi Sankar 
> Signed-off-by: Abel Vesa 
> ---
>  drivers/remoteproc/qcom_q6v5_pas.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
> b/drivers/remoteproc/qcom_q6v5_pas.c
> index 083d71f80e5c..4f6940368eb4 100644
> --- a/drivers/remoteproc/qcom_q6v5_pas.c
> +++ b/drivers/remoteproc/qcom_q6v5_pas.c
> @@ -39,6 +39,7 @@ struct adsp_data {
> const char *dtb_firmware_name;
> int pas_id;
> int dtb_pas_id;
> +   int lite_pas_id;
> unsigned int minidump_id;
> bool auto_boot;
> bool decrypt_shutdown;
> @@ -72,6 +73,7 @@ struct qcom_adsp {
> const char *dtb_firmware_name;
> int pas_id;
> int dtb_pas_id;
> +   int lite_pas_id;
> unsigned int minidump_id;
> int crash_reason_smem;
> bool decrypt_shutdown;
> @@ -210,6 +212,10 @@ static int adsp_load(struct rproc *rproc, const struct 
> firmware *fw)
> /* Store firmware handle to be used in adsp_start() */
> adsp->firmware = fw;
>
> +   /* WIP: Shutdown the ADSP if it's running a lite version of the 
> firmware*/

Why is it still marked as WIP?

> +   if (adsp->lite_pas_id)
> +   ret = qcom_scm_pas_shutdown(adsp->lite_pas_id);
> +
> if (adsp->dtb_pas_id) {
> ret = request_firmware(>dtb_firmware, 
> adsp->dtb_firmware_name, adsp->dev);
> if (ret) {
> @@ -693,6 +699,7 @@ static int adsp_probe(struct platform_device *pdev)
> adsp->rproc = rproc;
> adsp->minidump_id = desc->minidump_id;
> adsp->pas_id = desc->pas_id;
> +   adsp->lite_pas_id = desc->lite_pas_id;
> adsp->info_name = desc->sysmon_name;
> adsp->decrypt_shutdown = desc->decrypt_shutdown;
> adsp->region_assign_idx = desc->region_assign_idx;
> @@ -990,6 +997,7 @@ static const struct adsp_data x1e80100_adsp_resource = {
> .dtb_firmware_name = "adsp_dtb.mdt",
> .pas_id = 1,
> .dtb_pas_id = 0x24,
> +   .lite_pas_id = 0x1f,
> .minidump_id = 5,
> .auto_boot = true,
> .proxy_pd_names = (char*[]){
>
> --
> 2.34.1
>
>


-- 
With best wishes
Dmitry



[PATCH v13 5/6] Documentation: tracing: Add ring-buffer mapping

2024-01-29 Thread Vincent Donnefort
It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..628254e63830
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,104 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the sub-buffer ID which can be safely
+read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot or splice.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.43.0.429.g432eaa2c6b-goog




[PATCH v13 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-01-29 Thread Vincent Donnefort
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index d4bb67430719..47194c51a4ac 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -40,4 +40,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b6a0e506b3f9..b570f4519d87 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1175,6 +1175,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1309,6 +1315,9 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
if (tr->snapshot == UINT_MAX)
return -EBUSY;
 
+   if (tr->mapped)
+   return -EBUSY;
+
ret = tracing_alloc_snapshot_instance(tr);
if (ret)
return ret;
@@ -6534,7 +6543,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8653,15 +8662,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
+
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
iter->wait_index++;
@@ -8674,6 +8699,90 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   mutex_lock(_types_lock);
+   if (!WARN_ON(!iter->tr->mapped))
+   iter->tr->mapped--;
+   

[PATCH v13 3/6] tracing: Add snapshot refcount

2024-01-29 Thread Vincent Donnefort
When a ring-buffer is memory mapped by user-space, no trace or
ring-buffer swap is possible. This means the snapshot feature is
mutually exclusive with the memory mapping. Having a refcount on
snapshot users will help to know if a mapping is possible or not.

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..b6a0e506b3f9 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1300,6 +1300,52 @@ static void free_snapshot(struct trace_array *tr)
tr->allocated_snapshot = false;
 }
 
+static int tracing_arm_snapshot_locked(struct trace_array *tr)
+{
+   int ret;
+
+   lockdep_assert_held(_types_lock);
+
+   if (tr->snapshot == UINT_MAX)
+   return -EBUSY;
+
+   ret = tracing_alloc_snapshot_instance(tr);
+   if (ret)
+   return ret;
+
+   tr->snapshot++;
+
+   return 0;
+}
+
+static void tracing_disarm_snapshot_locked(struct trace_array *tr)
+{
+   lockdep_assert_held(_types_lock);
+
+   if (WARN_ON(!tr->snapshot))
+   return;
+
+   tr->snapshot--;
+}
+
+int tracing_arm_snapshot(struct trace_array *tr)
+{
+   int ret;
+
+   mutex_lock(_types_lock);
+   ret = tracing_arm_snapshot_locked(tr);
+   mutex_unlock(_types_lock);
+
+   return ret;
+}
+
+void tracing_disarm_snapshot(struct trace_array *tr)
+{
+   mutex_lock(_types_lock);
+   tracing_disarm_snapshot_locked(tr);
+   mutex_unlock(_types_lock);
+}
+
 /**
  * tracing_alloc_snapshot - allocate snapshot buffer.
  *
@@ -1373,10 +1419,6 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
 
mutex_lock(_types_lock);
 
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret)
-   goto fail_unlock;
-
if (tr->current_trace->use_max_tr) {
ret = -EBUSY;
goto fail_unlock;
@@ -1395,6 +1437,10 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
goto fail_unlock;
}
 
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
+   goto fail_unlock;
+
local_irq_disable();
arch_spin_lock(>max_lock);
tr->cond_snapshot = cond_snapshot;
@@ -1439,6 +1485,8 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
arch_spin_unlock(>max_lock);
local_irq_enable();
 
+   tracing_disarm_snapshot(tr);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
@@ -6593,11 +6641,12 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 */
synchronize_rcu();
free_snapshot(tr);
+   tracing_disarm_snapshot_locked(tr);
}
 
-   if (t->use_max_tr && !tr->allocated_snapshot) {
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+   if (t->use_max_tr) {
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
goto out;
}
 #else
@@ -6606,8 +6655,13 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 
if (t->init) {
ret = tracer_init(t, tr);
-   if (ret)
+   if (ret) {
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (t->use_max_tr)
+   tracing_disarm_snapshot_locked(tr);
+#endif
goto out;
+   }
}
 
tr->current_trace = t;
@@ -7709,10 +7763,11 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,
if (tr->allocated_snapshot)
ret = resize_buffer_duplicate_size(>max_buffer,
>array_buffer, iter->cpu_file);
-   else
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
break;
+
/* Now, we're going to swap */
if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
local_irq_disable();
@@ -7722,6 +,7 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,
smp_call_function_single(iter->cpu_file, 
tracing_swap_cpu_buffer,
 (void *)tr, 1);
}
+   tracing_disarm_snapshot_locked(tr);
break;
default:
if (tr->allocated_snapshot) {
@@ -8846,8 +8902,13 @@ ftrace_trace_snapshot_callback(struct trace_array *tr, 
struct ftrace_hash *hash,
 
ops = param ? _count_probe_ops :  _probe_ops;
 
-   if (glob[0] == '!')
-   return unregister_ftrace_function_probe_func(glob+1, tr, ops);
+   if (glob[0] == '!') {
+   ret = unregister_ftrace_function_probe_func(glob+1, tr, ops);

[PATCH v13 2/6] ring-buffer: Introducing ring-buffer mapping functions

2024-01-29 Thread Vincent Donnefort
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..d4bb67430719
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each subbuf, including the header.
+ * @nr_subbufs:Number of subbfs in the ring-buffer.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. From 0 to @nr_subbufs 
- 1
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8179e0a8984e..081065e76d4a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -338,6 +338,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +485,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   int mapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1548,6 +1555,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+   mutex_init(_buffer->mapping_lock);
 
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL, cpu_to_node(cpu));
@@ -5160,6 +5168,19 @@ static void rb_clear_buffer_page(struct buffer_page 
*page)
page->read = 0;
 }
 

[PATCH v13 1/6] ring-buffer: Zero ring-buffer sub-buffers

2024-01-29 Thread Vincent Donnefort
In preparation for the ring-buffer memory mapping where each subbuf will
be accessible to user-space, zero all the page allocations.

Signed-off-by: Vincent Donnefort 
Reviewed-by: Masami Hiramatsu (Google) 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 13aaf5e85b81..8179e0a8984e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1472,7 +1472,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
 
list_add(>list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
+   mflags | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1557,7 +1558,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
if (bpage->data)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   page = alloc_pages_node(cpu_to_node(cpu),
+   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.43.0.429.g432eaa2c6b-goog




[PATCH v13 0/6] Introducing trace buffer mapping by user-space

2024-01-29 Thread Vincent Donnefort
The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent. (29142dc92c37d3259a33aef15b03e6ee25b0d188)
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (6):
  ring-buffer: Zero ring-buffer sub-buffers
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Add snapshot refcount
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +
 Documentation/trace/ring-buffer-map.rst   | 104 ++
 include/linux/ring_buffer.h   |   7 +
 include/uapi/linux/trace_mmap.h   |  45 +++
 kernel/trace/ring_buffer.c| 335 +-
 kernel/trace/trace.c  | 210 ++-
 kernel/trace/trace.h  |   6 +
 kernel/trace/trace_events_trigger.c   |  58 ++-
 tools/testing/selftests/ring-buffer/Makefile  |   8 +
 tools/testing/selftests/ring-buffer/config|   1 +
 .../testing/selftests/ring-buffer/map_test.c  | 183 ++
 11 files changed, 918 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/trace/ring-buffer-map.rst
 create mode 100644 include/uapi/linux/trace_mmap.h
 create mode 100644 tools/testing/selftests/ring-buffer/Makefile
 create mode 100644 tools/testing/selftests/ring-buffer/config
 create mode 100644 tools/testing/selftests/ring-buffer/map_test.c


base-commit: 29142dc92c37d3259a33aef15b03e6ee25b0d188
-- 
2.43.0.429.g432eaa2c6b-goog




[PATCH 3/3] remoteproc: qcom_q6v5_pas: Unload lite firmware on ADSP

2024-01-29 Thread Abel Vesa
From: Sibi Sankar 

The UEFI loads a lite variant of the ADSP firmware to support charging
use cases. The kernel needs to unload and reload it with the firmware
that has full feature support for audio. This patch arbitarily shutsdown
the lite firmware before loading the full firmware.

Signed-off-by: Sibi Sankar 
Signed-off-by: Abel Vesa 
---
 drivers/remoteproc/qcom_q6v5_pas.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 083d71f80e5c..4f6940368eb4 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -39,6 +39,7 @@ struct adsp_data {
const char *dtb_firmware_name;
int pas_id;
int dtb_pas_id;
+   int lite_pas_id;
unsigned int minidump_id;
bool auto_boot;
bool decrypt_shutdown;
@@ -72,6 +73,7 @@ struct qcom_adsp {
const char *dtb_firmware_name;
int pas_id;
int dtb_pas_id;
+   int lite_pas_id;
unsigned int minidump_id;
int crash_reason_smem;
bool decrypt_shutdown;
@@ -210,6 +212,10 @@ static int adsp_load(struct rproc *rproc, const struct 
firmware *fw)
/* Store firmware handle to be used in adsp_start() */
adsp->firmware = fw;
 
+   /* WIP: Shutdown the ADSP if it's running a lite version of the 
firmware*/
+   if (adsp->lite_pas_id)
+   ret = qcom_scm_pas_shutdown(adsp->lite_pas_id);
+
if (adsp->dtb_pas_id) {
ret = request_firmware(>dtb_firmware, 
adsp->dtb_firmware_name, adsp->dev);
if (ret) {
@@ -693,6 +699,7 @@ static int adsp_probe(struct platform_device *pdev)
adsp->rproc = rproc;
adsp->minidump_id = desc->minidump_id;
adsp->pas_id = desc->pas_id;
+   adsp->lite_pas_id = desc->lite_pas_id;
adsp->info_name = desc->sysmon_name;
adsp->decrypt_shutdown = desc->decrypt_shutdown;
adsp->region_assign_idx = desc->region_assign_idx;
@@ -990,6 +997,7 @@ static const struct adsp_data x1e80100_adsp_resource = {
.dtb_firmware_name = "adsp_dtb.mdt",
.pas_id = 1,
.dtb_pas_id = 0x24,
+   .lite_pas_id = 0x1f,
.minidump_id = 5,
.auto_boot = true,
.proxy_pd_names = (char*[]){

-- 
2.34.1




[PATCH 2/3] remoteproc: qcom_q6v5_pas: Add support for X1E80100 ADSP/CDSP

2024-01-29 Thread Abel Vesa
From: Sibi Sankar 

Add support for PIL loading on ADSP and CDSP on X1E80100 SoCs.

Signed-off-by: Sibi Sankar 
Signed-off-by: Abel Vesa 
---
 drivers/remoteproc/qcom_q6v5_pas.c | 41 ++
 1 file changed, 41 insertions(+)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index a9dd58608052..083d71f80e5c 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -984,6 +984,45 @@ static const struct adsp_data sc8280xp_nsp1_resource = {
.ssctl_id = 0x20,
 };
 
+static const struct adsp_data x1e80100_adsp_resource = {
+   .crash_reason_smem = 423,
+   .firmware_name = "adsp.mdt",
+   .dtb_firmware_name = "adsp_dtb.mdt",
+   .pas_id = 1,
+   .dtb_pas_id = 0x24,
+   .minidump_id = 5,
+   .auto_boot = true,
+   .proxy_pd_names = (char*[]){
+   "lcx",
+   "lmx",
+   NULL
+   },
+   .load_state = "adsp",
+   .ssr_name = "lpass",
+   .sysmon_name = "adsp",
+   .ssctl_id = 0x14,
+};
+
+static const struct adsp_data x1e80100_cdsp_resource = {
+   .crash_reason_smem = 601,
+   .firmware_name = "cdsp.mdt",
+   .dtb_firmware_name = "cdsp_dtb.mdt",
+   .pas_id = 18,
+   .dtb_pas_id = 0x25,
+   .minidump_id = 7,
+   .auto_boot = true,
+   .proxy_pd_names = (char*[]){
+   "cx",
+   "mxc",
+   "nsp",
+   NULL
+   },
+   .load_state = "cdsp",
+   .ssr_name = "cdsp",
+   .sysmon_name = "cdsp",
+   .ssctl_id = 0x17,
+};
+
 static const struct adsp_data sm8350_cdsp_resource = {
.crash_reason_smem = 601,
.firmware_name = "cdsp.mdt",
@@ -1236,6 +1275,8 @@ static const struct of_device_id adsp_of_match[] = {
{ .compatible = "qcom,sm8550-adsp-pas", .data = _adsp_resource},
{ .compatible = "qcom,sm8550-cdsp-pas", .data = _cdsp_resource},
{ .compatible = "qcom,sm8550-mpss-pas", .data = _mpss_resource},
+   { .compatible = "qcom,x1e80100-adsp-pas", .data = 
_adsp_resource},
+   { .compatible = "qcom,x1e80100-cdsp-pas", .data = 
_cdsp_resource},
{ },
 };
 MODULE_DEVICE_TABLE(of, adsp_of_match);

-- 
2.34.1




[PATCH 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the X1E80100 aDSP & cDSP

2024-01-29 Thread Abel Vesa
Document the aDSP and cDSP Peripheral Authentication Service on the
X1E80100 Platform.

Signed-off-by: Abel Vesa 
---
 Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
index 58120829fb06..95ae32ea8a0a 100644
--- a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
@@ -19,6 +19,8 @@ properties:
   - qcom,sm8550-adsp-pas
   - qcom,sm8550-cdsp-pas
   - qcom,sm8550-mpss-pas
+  - qcom,x1e80100-adsp-pas
+  - qcom,x1e80100-cdsp-pas
 
   reg:
 maxItems: 1
@@ -63,6 +65,8 @@ allOf:
   enum:
 - qcom,sm8550-adsp-pas
 - qcom,sm8550-cdsp-pas
+- qcom,x1e80100-adsp-pas
+- qcom,x1e80100-cdsp-pas
 then:
   properties:
 interrupts:
@@ -85,6 +89,7 @@ allOf:
 compatible:
   enum:
 - qcom,sm8550-adsp-pas
+- qcom,x1e80100-adsp-pas
 then:
   properties:
 power-domains:
@@ -116,6 +121,7 @@ allOf:
 compatible:
   enum:
 - qcom,sm8550-cdsp-pas
+- qcom,x1e80100-cdsp-pas
 then:
   properties:
 power-domains:

-- 
2.34.1




[PATCH 0/3] remoteproc: qcom_q6v5_pas: Add aDSP and cDSP for X1E80100

2024-01-29 Thread Abel Vesa
Signed-off-by: Abel Vesa 
---
Abel Vesa (1):
  dt-bindings: remoteproc: qcom,sm8550-pas: document the X1E80100 aDSP & 
cDSP

Sibi Sankar (2):
  remoteproc: qcom_q6v5_pas: Add support for X1E80100 ADSP/CDSP
  remoteproc: qcom_q6v5_pas: Unload lite firmware on ADSP

 .../bindings/remoteproc/qcom,sm8550-pas.yaml   |  6 +++
 drivers/remoteproc/qcom_q6v5_pas.c | 49 ++
 2 files changed, 55 insertions(+)
---
base-commit: 01af33cc9894b4489fb68fa35c40e9fe85df63dc
change-id: 20231201-x1e80100-remoteproc-b27da583e8cc

Best regards,
-- 
Abel Vesa 




Re: [PATCH v3 5/6] LoongArch: KVM: Add physical cpuid map support

2024-01-29 Thread Huacai Chen
Hi, Bibo,

Without this patch I can also create a SMP VM, so what problem does
this patch want to solve?

Huacai

On Mon, Jan 22, 2024 at 6:03 PM Bibo Mao  wrote:
>
> Physical cpuid is used to irq routing for irqchips such as ipi/msi/
> extioi interrupt controller. And physical cpuid is stored at CSR
> register LOONGARCH_CSR_CPUID, it can not be changed once vcpu is
> created. Since different irqchips have different size definition
> about physical cpuid, KVM uses the smallest cpuid from extioi, and
> the max cpuid size is defines as 256.
>
> Signed-off-by: Bibo Mao 
> ---
>  arch/loongarch/include/asm/kvm_host.h | 26 
>  arch/loongarch/include/asm/kvm_vcpu.h |  1 +
>  arch/loongarch/kvm/vcpu.c | 93 ++-
>  arch/loongarch/kvm/vm.c   | 11 
>  4 files changed, 130 insertions(+), 1 deletion(-)
>
> diff --git a/arch/loongarch/include/asm/kvm_host.h 
> b/arch/loongarch/include/asm/kvm_host.h
> index 2d62f7b0d377..57399d7cf8b7 100644
> --- a/arch/loongarch/include/asm/kvm_host.h
> +++ b/arch/loongarch/include/asm/kvm_host.h
> @@ -64,6 +64,30 @@ struct kvm_world_switch {
>
>  #define MAX_PGTABLE_LEVELS 4
>
> +/*
> + * Physical cpu id is used for interrupt routing, there are different
> + * definitions about physical cpuid on different hardwares.
> + *  For LOONGARCH_CSR_CPUID register, max cpuid size if 512
> + *  For IPI HW, max dest CPUID size 1024
> + *  For extioi interrupt controller, max dest CPUID size is 256
> + *  For MSI interrupt controller, max supported CPUID size is 65536
> + *
> + * Currently max CPUID is defined as 256 for KVM hypervisor, in future
> + * it will be expanded to 4096, including 16 packages at most. And every
> + * package supports at most 256 vcpus
> + */
> +#define KVM_MAX_PHYID  256
> +
> +struct kvm_phyid_info {
> +   struct kvm_vcpu *vcpu;
> +   boolenabled;
> +};
> +
> +struct kvm_phyid_map {
> +   int max_phyid;
> +   struct kvm_phyid_info phys_map[KVM_MAX_PHYID];
> +};
> +
>  struct kvm_arch {
> /* Guest physical mm */
> kvm_pte_t *pgd;
> @@ -71,6 +95,8 @@ struct kvm_arch {
> unsigned long invalid_ptes[MAX_PGTABLE_LEVELS];
> unsigned int  pte_shifts[MAX_PGTABLE_LEVELS];
> unsigned int  root_level;
> +   struct mutex  phyid_map_lock;
> +   struct kvm_phyid_map  *phyid_map;
>
> s64 time_offset;
> struct kvm_context __percpu *vmcs;
> diff --git a/arch/loongarch/include/asm/kvm_vcpu.h 
> b/arch/loongarch/include/asm/kvm_vcpu.h
> index e71ceb88f29e..2402129ee955 100644
> --- a/arch/loongarch/include/asm/kvm_vcpu.h
> +++ b/arch/loongarch/include/asm/kvm_vcpu.h
> @@ -81,6 +81,7 @@ void kvm_save_timer(struct kvm_vcpu *vcpu);
>  void kvm_restore_timer(struct kvm_vcpu *vcpu);
>
>  int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt 
> *irq);
> +struct kvm_vcpu *kvm_get_vcpu_by_cpuid(struct kvm *kvm, int cpuid);
>
>  /*
>   * Loongarch KVM guest interrupt handling
> diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c
> index 27701991886d..97ca9c7160e6 100644
> --- a/arch/loongarch/kvm/vcpu.c
> +++ b/arch/loongarch/kvm/vcpu.c
> @@ -274,6 +274,95 @@ static int _kvm_getcsr(struct kvm_vcpu *vcpu, unsigned 
> int id, u64 *val)
> return 0;
>  }
>
> +static inline int kvm_set_cpuid(struct kvm_vcpu *vcpu, u64 val)
> +{
> +   int cpuid;
> +   struct loongarch_csrs *csr = vcpu->arch.csr;
> +   struct kvm_phyid_map  *map;
> +
> +   if (val >= KVM_MAX_PHYID)
> +   return -EINVAL;
> +
> +   cpuid = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_ESTAT);
> +   map = vcpu->kvm->arch.phyid_map;
> +   mutex_lock(>kvm->arch.phyid_map_lock);
> +   if (map->phys_map[cpuid].enabled) {
> +   /*
> +* Cpuid is already set before
> +* Forbid changing different cpuid at runtime
> +*/
> +   if (cpuid != val) {
> +   /*
> +* Cpuid 0 is initial value for vcpu, maybe invalid
> +* unset value for vcpu
> +*/
> +   if (cpuid) {
> +   mutex_unlock(>kvm->arch.phyid_map_lock);
> +   return -EINVAL;
> +   }
> +   } else {
> +/* Discard duplicated cpuid set */
> +   mutex_unlock(>kvm->arch.phyid_map_lock);
> +   return 0;
> +   }
> +   }
> +
> +   if (map->phys_map[val].enabled) {
> +   /*
> +* New cpuid is already set with other vcpu
> +* Forbid sharing the same cpuid between different vcpus
> +*/
> +   if (map->phys_map[val].vcpu != vcpu) {
> +   mutex_unlock(>kvm->arch.phyid_map_lock);
> +   return -EINVAL;
> +   }

Re: [PATCH v3 6/6] LoongArch: Add pv ipi support on LoongArch system

2024-01-29 Thread Huacai Chen
Hi, Bibo,

On Mon, Jan 22, 2024 at 6:03 PM Bibo Mao  wrote:
>
> On LoongArch system, ipi hw uses iocsr registers, there is one iocsr
> register access on ipi sender and two iocsr access on ipi receiver
> which is ipi interrupt handler. On VM mode all iocsr registers
> accessing will trap into hypervisor. So with one ipi hw notification
> there will be three times of trap.
>
> This patch adds pv ipi support for VM, hypercall instruction is used
> to ipi sender, and hypervisor will inject SWI on the VM. During SWI
> interrupt handler, only estat CSR register is written to clear irq.
> Estat CSR register access will not trap into hypervisor. So with pv ipi
> supported, pv ipi sender will trap into hypervsor one time, pv ipi
> revicer will not trap, there is only one time of trap.
>
> Also this patch adds ipi multicast support, the method is similar with
> x86. With ipi multicast support, ipi notification can be sent to at most
> 128 vcpus at one time. It reduces trap into hypervisor greatly.
>
> Signed-off-by: Bibo Mao 
> ---
>  arch/loongarch/include/asm/hardirq.h   |   1 +
>  arch/loongarch/include/asm/kvm_host.h  |   1 +
>  arch/loongarch/include/asm/kvm_para.h  | 124 +
>  arch/loongarch/include/asm/loongarch.h |   1 +
>  arch/loongarch/kernel/irq.c|   2 +-
>  arch/loongarch/kernel/paravirt.c   | 113 ++
>  arch/loongarch/kernel/smp.c|   2 +-
>  arch/loongarch/kvm/exit.c  |  73 ++-
>  arch/loongarch/kvm/vcpu.c  |   1 +
>  9 files changed, 314 insertions(+), 4 deletions(-)
>
> diff --git a/arch/loongarch/include/asm/hardirq.h 
> b/arch/loongarch/include/asm/hardirq.h
> index 9f0038e19c7f..8a611843c1f0 100644
> --- a/arch/loongarch/include/asm/hardirq.h
> +++ b/arch/loongarch/include/asm/hardirq.h
> @@ -21,6 +21,7 @@ enum ipi_msg_type {
>  typedef struct {
> unsigned int ipi_irqs[NR_IPI];
> unsigned int __softirq_pending;
> +   atomic_t messages cacheline_aligned_in_smp;
>  } cacheline_aligned irq_cpustat_t;
>
>  DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
> diff --git a/arch/loongarch/include/asm/kvm_host.h 
> b/arch/loongarch/include/asm/kvm_host.h
> index 57399d7cf8b7..1bf927e2bfac 100644
> --- a/arch/loongarch/include/asm/kvm_host.h
> +++ b/arch/loongarch/include/asm/kvm_host.h
> @@ -43,6 +43,7 @@ struct kvm_vcpu_stat {
> u64 idle_exits;
> u64 cpucfg_exits;
> u64 signal_exits;
> +   u64 hvcl_exits;
>  };
>
>  #define KVM_MEM_HUGEPAGE_CAPABLE   (1UL << 0)
> diff --git a/arch/loongarch/include/asm/kvm_para.h 
> b/arch/loongarch/include/asm/kvm_para.h
> index 41200e922a82..a25a84e372b9 100644
> --- a/arch/loongarch/include/asm/kvm_para.h
> +++ b/arch/loongarch/include/asm/kvm_para.h
> @@ -9,6 +9,10 @@
>  #define HYPERVISOR_VENDOR_SHIFT8
>  #define HYPERCALL_CODE(vendor, code)   ((vendor << HYPERVISOR_VENDOR_SHIFT) 
> + code)
>
> +#define KVM_HC_CODE_SERVICE0
> +#define KVM_HC_SERVICE HYPERCALL_CODE(HYPERVISOR_KVM, 
> KVM_HC_CODE_SERVICE)
> +#define  KVM_HC_FUNC_IPI   1
> +
>  /*
>   * LoongArch hypcall return code
>   */
> @@ -16,6 +20,126 @@
>  #define KVM_HC_INVALID_CODE-1UL
>  #define KVM_HC_INVALID_PARAMETER   -2UL
>
> +/*
> + * Hypercalls interface for KVM hypervisor
> + *
> + * a0: function identifier
> + * a1-a6: args
> + * Return value will be placed in v0.
> + * Up to 6 arguments are passed in a1, a2, a3, a4, a5, a6.
> + */
> +static __always_inline long kvm_hypercall(u64 fid)
> +{
> +   register long ret asm("v0");
> +   register unsigned long fun asm("a0") = fid;
> +
> +   __asm__ __volatile__(
> +   "hvcl "__stringify(KVM_HC_SERVICE)
> +   : "=r" (ret)
> +   : "r" (fun)
> +   : "memory"
> +   );
> +
> +   return ret;
> +}
> +
> +static __always_inline long kvm_hypercall1(u64 fid, unsigned long arg0)
> +{
> +   register long ret asm("v0");
> +   register unsigned long fun asm("a0") = fid;
> +   register unsigned long a1  asm("a1") = arg0;
> +
> +   __asm__ __volatile__(
> +   "hvcl "__stringify(KVM_HC_SERVICE)
> +   : "=r" (ret)
> +   : "r" (fun), "r" (a1)
> +   : "memory"
> +   );
> +
> +   return ret;
> +}
> +
> +static __always_inline long kvm_hypercall2(u64 fid,
> +   unsigned long arg0, unsigned long arg1)
> +{
> +   register long ret asm("v0");
> +   register unsigned long fun asm("a0") = fid;
> +   register unsigned long a1  asm("a1") = arg0;
> +   register unsigned long a2  asm("a2") = arg1;
> +
> +   __asm__ __volatile__(
> +   "hvcl "__stringify(KVM_HC_SERVICE)
> +   : "=r" (ret)
> +   : "r" (fun), "r" (a1), "r" (a2)
> +   : "memory"
> +   );
> +
> +   

Re: [PATCH v3 1/6] LoongArch/smp: Refine ipi ops on LoongArch platform

2024-01-29 Thread Huacai Chen
Hi, Bibo,

On Mon, Jan 22, 2024 at 6:03 PM Bibo Mao  wrote:
>
> This patch refines ipi handling on LoongArch platform, there are
> three changes with this patch.
> 1. Add generic get_percpu_irq api, replace some percpu irq function
> such as get_ipi_irq/get_pmc_irq/get_timer_irq with get_percpu_irq.
>
> 2. Change parameter action definition with function
> loongson_send_ipi_single and loongson_send_ipi_mask. Code encoding is used
> here rather than bitmap encoding for ipi action, ipi hw sender uses action
> code, and ipi receiver will get action bitmap encoding, the ipi hw will
> convert it into bitmap in ipi message buffer.
>
> 3. Add smp_ops on LoongArch platform so that pv ipi can be used later.
>
> Signed-off-by: Bibo Mao 
> ---
>  arch/loongarch/include/asm/hardirq.h |  4 ++
>  arch/loongarch/include/asm/irq.h | 10 -
>  arch/loongarch/include/asm/smp.h | 31 +++
>  arch/loongarch/kernel/irq.c  | 22 +--
>  arch/loongarch/kernel/perf_event.c   | 14 +--
>  arch/loongarch/kernel/smp.c  | 58 +++-
>  arch/loongarch/kernel/time.c | 12 +-
>  7 files changed, 71 insertions(+), 80 deletions(-)
>
> diff --git a/arch/loongarch/include/asm/hardirq.h 
> b/arch/loongarch/include/asm/hardirq.h
> index 0ef3b18f8980..9f0038e19c7f 100644
> --- a/arch/loongarch/include/asm/hardirq.h
> +++ b/arch/loongarch/include/asm/hardirq.h
> @@ -12,6 +12,10 @@
>  extern void ack_bad_irq(unsigned int irq);
>  #define ack_bad_irq ack_bad_irq
>
> +enum ipi_msg_type {
> +   IPI_RESCHEDULE,
> +   IPI_CALL_FUNCTION,
> +};
>  #define NR_IPI 2
>
>  typedef struct {
> diff --git a/arch/loongarch/include/asm/irq.h 
> b/arch/loongarch/include/asm/irq.h
> index 218b4da0ea90..00101b6d601e 100644
> --- a/arch/loongarch/include/asm/irq.h
> +++ b/arch/loongarch/include/asm/irq.h
> @@ -117,8 +117,16 @@ extern struct fwnode_handle *liointc_handle;
>  extern struct fwnode_handle *pch_lpc_handle;
>  extern struct fwnode_handle *pch_pic_handle[MAX_IO_PICS];
>
> -extern irqreturn_t loongson_ipi_interrupt(int irq, void *dev);
> +static inline int get_percpu_irq(int vector)
> +{
> +   struct irq_domain *d;
> +
> +   d = irq_find_matching_fwnode(cpuintc_handle, DOMAIN_BUS_ANY);
> +   if (d)
> +   return irq_create_mapping(d, vector);
>
> +   return -EINVAL;
> +}
>  #include 
>
>  #endif /* _ASM_IRQ_H */
> diff --git a/arch/loongarch/include/asm/smp.h 
> b/arch/loongarch/include/asm/smp.h
> index f81e5f01d619..330f1cb3741c 100644
> --- a/arch/loongarch/include/asm/smp.h
> +++ b/arch/loongarch/include/asm/smp.h
> @@ -12,6 +12,13 @@
>  #include 
>  #include 
>
> +struct smp_ops {
> +   void (*call_func_ipi)(const struct cpumask *mask, unsigned int 
> action);
> +   void (*call_func_single_ipi)(int cpu, unsigned int action);
To keep consistency, it is better to use call_func_ipi_single and
call_func_ipi_mask.

> +   void (*ipi_init)(void);
> +};
> +
> +extern struct smp_ops smp_ops;
>  extern int smp_num_siblings;
>  extern int num_processors;
>  extern int disabled_cpus;
> @@ -24,8 +31,6 @@ void loongson_prepare_cpus(unsigned int max_cpus);
>  void loongson_boot_secondary(int cpu, struct task_struct *idle);
>  void loongson_init_secondary(void);
>  void loongson_smp_finish(void);
> -void loongson_send_ipi_single(int cpu, unsigned int action);
> -void loongson_send_ipi_mask(const struct cpumask *mask, unsigned int action);
>  #ifdef CONFIG_HOTPLUG_CPU
>  int loongson_cpu_disable(void);
>  void loongson_cpu_die(unsigned int cpu);
> @@ -59,9 +64,12 @@ extern int __cpu_logical_map[NR_CPUS];
>
>  #define cpu_physical_id(cpu)   cpu_logical_map(cpu)
>
> -#define SMP_BOOT_CPU   0x1
> -#define SMP_RESCHEDULE 0x2
> -#define SMP_CALL_FUNCTION  0x4
> +#define ACTTION_BOOT_CPU   0
> +#define ACTTION_RESCHEDULE 1
> +#define ACTTION_CALL_FUNCTION  2
> +#define SMP_BOOT_CPU   BIT(ACTTION_BOOT_CPU)
> +#define SMP_RESCHEDULE BIT(ACTTION_RESCHEDULE)
> +#define SMP_CALL_FUNCTION  BIT(ACTTION_CALL_FUNCTION)
>
>  struct secondary_data {
> unsigned long stack;
> @@ -71,7 +79,8 @@ extern struct secondary_data cpuboot_data;
>
>  extern asmlinkage void smpboot_entry(void);
>  extern asmlinkage void start_secondary(void);
> -
> +extern void arch_send_call_function_single_ipi(int cpu);
> +extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
Similarly, to keep consistency, it is better to use
arch_send_function_ipi_single and arch_send_function_ipi_mask.

Huacai

>  extern void calculate_cpu_foreign_map(void);
>
>  /*
> @@ -79,16 +88,6 @@ extern void calculate_cpu_foreign_map(void);
>   */
>  extern void show_ipi_list(struct seq_file *p, int prec);
>
> -static inline void arch_send_call_function_single_ipi(int cpu)
> -{
> -   loongson_send_ipi_single(cpu, SMP_CALL_FUNCTION);
> -}
> -
> -static inline void arch_send_call_function_ipi_mask(const struct cpumask 
> *mask)
> -{
> - 

Re: [PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known

2024-01-29 Thread Alexandru Elisei
Hi Peter,

On Fri, Jan 26, 2024 at 12:00:36PM -0800, Peter Collingbourne wrote:
> On Thu, Jan 25, 2024 at 8:43 AM Alexandru Elisei
>  wrote:
> >
> > arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> > When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> > the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> > set in vma_alloc_zeroed_movable_folio().
> >
> > Expand this to be more generic by adding an arch hook that modifes the gfp
> > flags for an allocation when the VMA is known.
> >
> > Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> > is also set; from that point of view, the current behaviour is unchanged,
> > even though the arm64 flag is set in more places.  When arm64 will have
> > support to reuse the tag storage for data allocation, the uses of the
> > __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> > to reserve the corresponding tag storage for the pages being allocated.
> >
> > The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> > the caller; this has been done to keep an architecture from modifying the
> > flags already set by the core memory management code; this is similar to
> > how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> > implemented. This can be revisited in the future if there's a need to do
> > so.
> >
> > Signed-off-by: Alexandru Elisei 
> 
> This patch also needs to update the non-CONFIG_NUMA definition of
> vma_alloc_folio in include/linux/gfp.h to call arch_calc_vma_gfp. See:
> https://r.android.com/2849146

Of course, you're already reported this to me, I cherry-pick the version of
the patch that doesn't have the fix for this series.

Will fix.

Thanks,
Alex

> 
> Peter



Re: [PATCH RFC v3 07/35] mm: cma: Add CMA_RELEASE_{SUCCESS,FAIL} events

2024-01-29 Thread Alexandru Elisei
Hi,

On Mon, Jan 29, 2024 at 03:01:24PM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Similar to the two events that relate to CMA allocations, add the
> > CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages
> > are freed.
> 
> How is this is going to be beneficial towards analyzing CMA alloc/release
> behaviour - particularly with respect to this series. OR just adding this
> from parity perspective with CMA alloc side counters ? Regardless this
> CMA change too could be discussed separately.

Added for parity and because it's useful for this series (see my reply to
the previous patch where I discuss how I've used the counters).

Thanks,
Alex

> 
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> > 
> > Changes since rfc v2:
> > 
> > * New patch.
> > 
> >  include/linux/vm_event_item.h | 2 ++
> >  mm/cma.c  | 6 +-
> >  mm/vmstat.c   | 2 ++
> >  3 files changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 747943bc8cc2..aba5c5bf8127 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -83,6 +83,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  #ifdef CONFIG_CMA
> > CMA_ALLOC_SUCCESS,
> > CMA_ALLOC_FAIL,
> > +   CMA_RELEASE_SUCCESS,
> > +   CMA_RELEASE_FAIL,
> >  #endif
> > UNEVICTABLE_PGCULLED,   /* culled to noreclaim list */
> > UNEVICTABLE_PGSCANNED,  /* scanned for reclaimability */
> > diff --git a/mm/cma.c b/mm/cma.c
> > index dbf7fe8cb1bd..543bb6b3be8e 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -562,8 +562,10 @@ bool cma_release(struct cma *cma, const struct page 
> > *pages,
> >  {
> > unsigned long pfn;
> >  
> > -   if (!cma_pages_valid(cma, pages, count))
> > +   if (!cma_pages_valid(cma, pages, count)) {
> > +   count_vm_events(CMA_RELEASE_FAIL, count);
> > return false;
> > +   }
> >  
> > pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count);
> >  
> > @@ -575,6 +577,8 @@ bool cma_release(struct cma *cma, const struct page 
> > *pages,
> > cma_clear_bitmap(cma, pfn, count);
> > trace_cma_release(cma->name, pfn, pages, count);
> >  
> > +   count_vm_events(CMA_RELEASE_SUCCESS, count);
> > +
> > return true;
> >  }
> >  
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index db79935e4a54..eebfd5c6c723 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1340,6 +1340,8 @@ const char * const vmstat_text[] = {
> >  #ifdef CONFIG_CMA
> > "cma_alloc_success",
> > "cma_alloc_fail",
> > +   "cma_release_success",
> > +   "cma_release_fail",
> >  #endif
> > "unevictable_pgs_culled",
> > "unevictable_pgs_scanned",



Re: [PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages

2024-01-29 Thread Alexandru Elisei
Hi,

On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
> > after each cma_alloc() function call. This is done even though cma_alloc()
> > can allocate an arbitrary number of CMA pages. When looking at
> > /proc/vmstat, the number of successful (or failed) cma_alloc() calls
> > doesn't tell much with regards to how many CMA pages were allocated via
> > cma_alloc() versus via the page allocator (regular allocation request or
> > PCP lists refill).
> > 
> > This can also be rather confusing to a user who isn't familiar with the
> > code, since the unit of measurement for nr_free_cma is the number of pages,
> > but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
> > function calls.
> > 
> > Let's make this consistent, and arguably more useful, by having
> > CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
> > CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
> > allocate.
> > 
> > For users that wish to track the number of cma_alloc() calls, there are
> > tracepoints for that already implemented.
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> >  mm/cma.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/cma.c b/mm/cma.c
> > index f49c95f8ee37..dbf7fe8cb1bd 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
> > count,
> > pr_debug("%s(): returned %p\n", __func__, page);
> >  out:
> > if (page) {
> > -   count_vm_event(CMA_ALLOC_SUCCESS);
> > +   count_vm_events(CMA_ALLOC_SUCCESS, count);
> > cma_sysfs_account_success_pages(cma, count);
> > } else {
> > -   count_vm_event(CMA_ALLOC_FAIL);
> > +   count_vm_events(CMA_ALLOC_FAIL, count);
> > if (cma)
> > cma_sysfs_account_fail_pages(cma, count);
> > }
> 
> Without getting into the merits of this patch - which is actually trying to do
> semantics change to /proc/vmstat, wondering how is this even related to this
> particular series ? If required this could be debated on it's on separately.

Having the number of CMA pages allocated and the number of CMA pages freed
allows someone to infer how many tagged pages are in use at a given time:
(allocated CMA pages - CMA pages allocated by drivers* - CMA pages
released) * 32. That is valuable information for software and hardware
designers.

Besides that, for every iteration of the series, this has proven invaluable
for discovering bugs with freeing and/or reserving tag storage pages.

*that would require userspace reading cma_alloc_success and
cma_release_success before any tagged allocations are performed.

Thanks,
Alex



Re: [PATCH RFC v3 05/35] mm: cma: Don't append newline when generating CMA area name

2024-01-29 Thread Alexandru Elisei
Hi,

On Mon, Jan 29, 2024 at 02:43:08PM +0530, Anshuman Khandual wrote:
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > cma->name is displayed in several CMA messages. When the name is generated
> > by the CMA code, don't append a newline to avoid breaking the text across
> > two lines.
> 
> An example of such mis-formatted CMA output from dmesg could be added
> here in the commit message to demonstrate the problem better.
> 
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> 
> Regardless, LGTM.
> 
> Reviewed-by: Anshuman Khandual 

Thanks!

> 
> > 
> > Changes since rfc v2:
> > 
> > * New patch. This is a fix, and can be merged independently of the other
> > patches.
> 
> Right, need not be part of this series. Hence please send it separately to
> the MM list.

Will do!

Alex

> 
> > 
> >  mm/cma.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 7c09c47e530b..f49c95f8ee37 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -204,7 +204,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, 
> > phys_addr_t size,
> > if (name)
> > snprintf(cma->name, CMA_MAX_NAME, name);
> > else
> > -   snprintf(cma->name, CMA_MAX_NAME,  "cma%d\n", cma_area_count);
> > +   snprintf(cma->name, CMA_MAX_NAME,  "cma%d", cma_area_count);
> >  
> > cma->base_pfn = PFN_DOWN(base);
> > cma->count = size >> PAGE_SHIFT;



  1   2   >