date:20230925

RE: [PATCH v4] vmscan: add trace events for lru_gen

2023-09-25 Thread 김재원

>On Mon, Sep 25, 2023 at 10:20?PM Jaewon Kim  wrote:
>>
>> As the legacy lru provides, the lru_gen needs some trace events for
>> debugging.
>>
>> This commit introduces 2 trace events.
>>   trace_mm_vmscan_lru_gen_scan
>>   trace_mm_vmscan_lru_gen_evict
>>
>> Each event is similar to the following legacy events.
>>   trace_mm_vmscan_lru_isolate,
>>   trace_mm_vmscan_lru_shrink_[in]active
>
>We should just reuse trace_mm_vmscan_lru_isolate and
>trace_mm_vmscan_lru_shrink_inactive instead of adding new tracepoints.
>
>To reuse trace_mm_vmscan_lru_isolate, we'd just need to append two new
>names to LRU_NAMES.
>
>The naming of trace_mm_vmscan_lru_shrink_inactive might seem confusing
>but it's how MGLRU maintains the compatibility, e.g., the existing
>active/inactive counters in /proc/vmstat.


Hello

Actually I had tried to reuse them. But some value was not that compatible.
Let me try that way again.

>
>> Here's an example
>>   mm_vmscan_lru_gen_scan: classzone=2 order=0 nr_requested=4096 
>> nr_scanned=64 nr_skipped=0 nr_taken=64 lru=anon
>>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=64 nr_dirty=0 nr_writeback=0 
>> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 
>> nr_ref_keep=0 nr_unmap_fail=0 priority=2 
>> flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
>>   mm_vmscan_lru_gen_scan: classzone=1 order=0 nr_requested=4096 
>> nr_scanned=64 nr_skipped=0 nr_taken=64 lru=file
>>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=64 nr_dirty=0 nr_writeback=0 
>> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 
>> nr_ref_keep=0 nr_unmap_fail=0 priority=12 
>> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>>
>> Signed-off-by: Jaewon Kim 
>> ---
>> v4: wrap with #ifdef CONFIG_LRU_GEN
>> v3: change printk format
>> v2: use condition and make it aligned
>> v1: introduce trace events
>> ---
>>  include/trace/events/mmflags.h |  9 
>>  include/trace/events/vmscan.h  | 96 ++
>>  mm/vmscan.c| 20 +--
>>  3 files changed, 120 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
>> index 1478b9dd05fa..6dfe85bd4e81 100644
>> --- a/include/trace/events/mmflags.h
>> +++ b/include/trace/events/mmflags.h
>> @@ -274,6 +274,12 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" )
>>\
>> EM (LRU_ACTIVE_FILE, "active_file") \
>> EMe(LRU_UNEVICTABLE, "unevictable")
>>
>> +#ifdef CONFIG_LRU_GEN
>> +#define LRU_GEN_NAMES  \
>> +   EM (LRU_GEN_ANON, "anon") \
>> +   EMe(LRU_GEN_FILE, "file")
>> +#endif
>> +
>>  /*
>>   * First define the enums in the above macros to be exported to userspace
>>   * via TRACE_DEFINE_ENUM().
>> @@ -288,6 +294,9 @@ COMPACTION_PRIORITY
>>  /* COMPACTION_FEEDBACK are defines not enums. Not needed here. */
>>  ZONE_TYPE
>>  LRU_NAMES
>> +#ifdef CONFIG_LRU_GEN
>> +LRU_GEN_NAMES
>> +#endif
>>
>>  /*
>>   * Now redefine the EM() and EMe() macros to map the enums to the strings
>> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
>> index d2123dd960d5..2080ef742f89 100644
>> --- a/include/trace/events/vmscan.h
>> +++ b/include/trace/events/vmscan.h
>> @@ -327,6 +327,102 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>> __print_symbolic(__entry->lru, LRU_NAMES))
>>  );
>>
>> +#ifdef CONFIG_LRU_GEN
>> +TRACE_EVENT_CONDITION(mm_vmscan_lru_gen_scan,
>> +   TP_PROTO(int highest_zoneidx,
>> +   int order,
>> +   unsigned long nr_requested,
>> +   unsigned long nr_scanned,
>> +   unsigned long nr_skipped,
>> +   unsigned long nr_taken,
>> +   int lru),
>> +
>> +   TP_ARGS(highest_zoneidx, order, nr_requested, nr_scanned, 
>> nr_skipped, nr_taken, lru),
>> +
>> +   TP_CONDITION(nr_scanned),
>> +
>> +   TP_STRUCT__entry(
>> +   __field(int, highest_zoneidx)
>> +   __field(int, order)
>> +   __field(unsigned long, nr_requested)
>> +   __field(unsigned long, nr_scanned)
>> +   __field(unsigned long, nr_skipped)
>> +   __field(unsigned long, nr_taken)
>> +   __field(int, lru)
>> +   ),
>> +
>> +   TP_fast_assign(
>> +   __entry->highest_zoneidx = highest_zoneidx;
>> +   __entry->order = order;
>> +   __entry->nr_requested = nr_requested;
>> +   __entry->nr_scanned = nr_scanned;
>> +   __entry->nr_skipped = nr_skipped;
>> +   __entry->nr_taken = nr_taken;
>> +   __entry->lru = lru;
>> +   ),
>> +
>> +   /*
>> +* classzone is previous name of the highest_zoneidx.
>> +* Reason not to change it is the ABI requirement of the tracepoint.
>> +*/
>> +   TP_printk("classzone=%d order=%d nr_requested=%lu nr_scanned=%lu 
>> nr_skipped=%lu nr_taken=%lu lru=%s",
>> +

Re: [PATCH v4] vmscan: add trace events for lru_gen

2023-09-25 Thread Yu Zhao

On Mon, Sep 25, 2023 at 10:20 PM Jaewon Kim  wrote:
>
> As the legacy lru provides, the lru_gen needs some trace events for
> debugging.
>
> This commit introduces 2 trace events.
>   trace_mm_vmscan_lru_gen_scan
>   trace_mm_vmscan_lru_gen_evict
>
> Each event is similar to the following legacy events.
>   trace_mm_vmscan_lru_isolate,
>   trace_mm_vmscan_lru_shrink_[in]active

We should just reuse trace_mm_vmscan_lru_isolate and
trace_mm_vmscan_lru_shrink_inactive instead of adding new tracepoints.

To reuse trace_mm_vmscan_lru_isolate, we'd just need to append two new
names to LRU_NAMES.

The naming of trace_mm_vmscan_lru_shrink_inactive might seem confusing
but it's how MGLRU maintains the compatibility, e.g., the existing
active/inactive counters in /proc/vmstat.

> Here's an example
>   mm_vmscan_lru_gen_scan: classzone=2 order=0 nr_requested=4096 nr_scanned=64 
> nr_skipped=0 nr_taken=64 lru=anon
>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=64 nr_dirty=0 nr_writeback=0 
> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 
> nr_ref_keep=0 nr_unmap_fail=0 priority=2 
> flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
>   mm_vmscan_lru_gen_scan: classzone=1 order=0 nr_requested=4096 nr_scanned=64 
> nr_skipped=0 nr_taken=64 lru=file
>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=64 nr_dirty=0 nr_writeback=0 
> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 
> nr_ref_keep=0 nr_unmap_fail=0 priority=12 
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>
> Signed-off-by: Jaewon Kim 
> ---
> v4: wrap with #ifdef CONFIG_LRU_GEN
> v3: change printk format
> v2: use condition and make it aligned
> v1: introduce trace events
> ---
>  include/trace/events/mmflags.h |  9 
>  include/trace/events/vmscan.h  | 96 ++
>  mm/vmscan.c| 20 +--
>  3 files changed, 120 insertions(+), 5 deletions(-)
>
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 1478b9dd05fa..6dfe85bd4e81 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -274,6 +274,12 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) 
>   \
> EM (LRU_ACTIVE_FILE, "active_file") \
> EMe(LRU_UNEVICTABLE, "unevictable")
>
> +#ifdef CONFIG_LRU_GEN
> +#define LRU_GEN_NAMES  \
> +   EM (LRU_GEN_ANON, "anon") \
> +   EMe(LRU_GEN_FILE, "file")
> +#endif
> +
>  /*
>   * First define the enums in the above macros to be exported to userspace
>   * via TRACE_DEFINE_ENUM().
> @@ -288,6 +294,9 @@ COMPACTION_PRIORITY
>  /* COMPACTION_FEEDBACK are defines not enums. Not needed here. */
>  ZONE_TYPE
>  LRU_NAMES
> +#ifdef CONFIG_LRU_GEN
> +LRU_GEN_NAMES
> +#endif
>
>  /*
>   * Now redefine the EM() and EMe() macros to map the enums to the strings
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index d2123dd960d5..2080ef742f89 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -327,6 +327,102 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
> __print_symbolic(__entry->lru, LRU_NAMES))
>  );
>
> +#ifdef CONFIG_LRU_GEN
> +TRACE_EVENT_CONDITION(mm_vmscan_lru_gen_scan,
> +   TP_PROTO(int highest_zoneidx,
> +   int order,
> +   unsigned long nr_requested,
> +   unsigned long nr_scanned,
> +   unsigned long nr_skipped,
> +   unsigned long nr_taken,
> +   int lru),
> +
> +   TP_ARGS(highest_zoneidx, order, nr_requested, nr_scanned, nr_skipped, 
> nr_taken, lru),
> +
> +   TP_CONDITION(nr_scanned),
> +
> +   TP_STRUCT__entry(
> +   __field(int, highest_zoneidx)
> +   __field(int, order)
> +   __field(unsigned long, nr_requested)
> +   __field(unsigned long, nr_scanned)
> +   __field(unsigned long, nr_skipped)
> +   __field(unsigned long, nr_taken)
> +   __field(int, lru)
> +   ),
> +
> +   TP_fast_assign(
> +   __entry->highest_zoneidx = highest_zoneidx;
> +   __entry->order = order;
> +   __entry->nr_requested = nr_requested;
> +   __entry->nr_scanned = nr_scanned;
> +   __entry->nr_skipped = nr_skipped;
> +   __entry->nr_taken = nr_taken;
> +   __entry->lru = lru;
> +   ),
> +
> +   /*
> +* classzone is previous name of the highest_zoneidx.
> +* Reason not to change it is the ABI requirement of the tracepoint.
> +*/
> +   TP_printk("classzone=%d order=%d nr_requested=%lu nr_scanned=%lu 
> nr_skipped=%lu nr_taken=%lu lru=%s",
> +   __entry->highest_zoneidx,
> +   __entry->order,
> +   __entry->nr_requested,
> +   __entry->nr_scanned,
> +   __entry->nr_skipped,
> +   __entry->nr_taken,
> +

[PATCH v4] vmscan: add trace events for lru_gen

2023-09-25 Thread Jaewon Kim

As the legacy lru provides, the lru_gen needs some trace events for
debugging.

This commit introduces 2 trace events.
  trace_mm_vmscan_lru_gen_scan
  trace_mm_vmscan_lru_gen_evict

Each event is similar to the following legacy events.
  trace_mm_vmscan_lru_isolate,
  trace_mm_vmscan_lru_shrink_[in]active

Here's an example
  mm_vmscan_lru_gen_scan: classzone=2 order=0 nr_requested=4096 nr_scanned=64 
nr_skipped=0 nr_taken=64 lru=anon
  mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=64 nr_dirty=0 nr_writeback=0 
nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 
nr_ref_keep=0 nr_unmap_fail=0 priority=2 flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
  mm_vmscan_lru_gen_scan: classzone=1 order=0 nr_requested=4096 nr_scanned=64 
nr_skipped=0 nr_taken=64 lru=file
  mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=64 nr_dirty=0 nr_writeback=0 
nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 
nr_ref_keep=0 nr_unmap_fail=0 priority=12 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

Signed-off-by: Jaewon Kim 
---
v4: wrap with #ifdef CONFIG_LRU_GEN
v3: change printk format
v2: use condition and make it aligned
v1: introduce trace events
---
 include/trace/events/mmflags.h |  9 
 include/trace/events/vmscan.h  | 96 ++
 mm/vmscan.c| 20 +--
 3 files changed, 120 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 1478b9dd05fa..6dfe85bd4e81 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -274,6 +274,12 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" )   
\
EM (LRU_ACTIVE_FILE, "active_file") \
EMe(LRU_UNEVICTABLE, "unevictable")
 
+#ifdef CONFIG_LRU_GEN
+#define LRU_GEN_NAMES  \
+   EM (LRU_GEN_ANON, "anon") \
+   EMe(LRU_GEN_FILE, "file")
+#endif
+
 /*
  * First define the enums in the above macros to be exported to userspace
  * via TRACE_DEFINE_ENUM().
@@ -288,6 +294,9 @@ COMPACTION_PRIORITY
 /* COMPACTION_FEEDBACK are defines not enums. Not needed here. */
 ZONE_TYPE
 LRU_NAMES
+#ifdef CONFIG_LRU_GEN
+LRU_GEN_NAMES
+#endif
 
 /*
  * Now redefine the EM() and EMe() macros to map the enums to the strings
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d2123dd960d5..2080ef742f89 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -327,6 +327,102 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__print_symbolic(__entry->lru, LRU_NAMES))
 );
 
+#ifdef CONFIG_LRU_GEN
+TRACE_EVENT_CONDITION(mm_vmscan_lru_gen_scan,
+   TP_PROTO(int highest_zoneidx,
+   int order,
+   unsigned long nr_requested,
+   unsigned long nr_scanned,
+   unsigned long nr_skipped,
+   unsigned long nr_taken,
+   int lru),
+
+   TP_ARGS(highest_zoneidx, order, nr_requested, nr_scanned, nr_skipped, 
nr_taken, lru),
+
+   TP_CONDITION(nr_scanned),
+
+   TP_STRUCT__entry(
+   __field(int, highest_zoneidx)
+   __field(int, order)
+   __field(unsigned long, nr_requested)
+   __field(unsigned long, nr_scanned)
+   __field(unsigned long, nr_skipped)
+   __field(unsigned long, nr_taken)
+   __field(int, lru)
+   ),
+
+   TP_fast_assign(
+   __entry->highest_zoneidx = highest_zoneidx;
+   __entry->order = order;
+   __entry->nr_requested = nr_requested;
+   __entry->nr_scanned = nr_scanned;
+   __entry->nr_skipped = nr_skipped;
+   __entry->nr_taken = nr_taken;
+   __entry->lru = lru;
+   ),
+
+   /*
+* classzone is previous name of the highest_zoneidx.
+* Reason not to change it is the ABI requirement of the tracepoint.
+*/
+   TP_printk("classzone=%d order=%d nr_requested=%lu nr_scanned=%lu 
nr_skipped=%lu nr_taken=%lu lru=%s",
+   __entry->highest_zoneidx,
+   __entry->order,
+   __entry->nr_requested,
+   __entry->nr_scanned,
+   __entry->nr_skipped,
+   __entry->nr_taken,
+   __print_symbolic(__entry->lru, LRU_GEN_NAMES))
+);
+
+TRACE_EVENT(mm_vmscan_lru_gen_evict,
+
+   TP_PROTO(int nid, unsigned long nr_reclaimed,
+   struct reclaim_stat *stat, int priority, int file),
+
+   TP_ARGS(nid, nr_reclaimed, stat, priority, file),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, nr_reclaimed)
+   __field(unsigned long, nr_dirty)
+   __field(unsigned long, nr_writeback)
+   __field(unsigned long, nr_congested)
+   __field(unsigned long, nr_immediate)
+   __field(unsigned int, nr_activate0)
+   __field(unsigned int, nr_activate1)
+

[PATCH] [v2] ieee802154: ca8210: Fix a potential UAF in ca8210_probe

2023-09-25 Thread Dinghao Liu

If of_clk_add_provider() fails in ca8210_register_ext_clock(),
it calls clk_unregister() to release priv->clk and returns an
error. However, the caller ca8210_probe() then calls ca8210_remove(),
where priv->clk is freed again in ca8210_unregister_ext_clock(). In
this case, a use-after-free may happen in the second time we call
clk_unregister().

Fix this by removing the first clk_unregister(). Also, priv->clk could
be an error code on failure of clk_register_fixed_rate(). Use
IS_ERR_OR_NULL to catch this case in ca8210_unregister_ext_clock().

Fixes: ded845a781a5 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
Signed-off-by: Dinghao Liu 
---

Changelog:

v2: -Remove the first clk_unregister() instead of nulling priv->clk.
---
 drivers/net/ieee802154/ca8210.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ieee802154/ca8210.c b/drivers/net/ieee802154/ca8210.c
index aebb19f1b3a4..b35c6f59bd1a 100644
--- a/drivers/net/ieee802154/ca8210.c
+++ b/drivers/net/ieee802154/ca8210.c
@@ -2759,7 +2759,6 @@ static int ca8210_register_ext_clock(struct spi_device 
*spi)
}
ret = of_clk_add_provider(np, of_clk_src_simple_get, priv->clk);
if (ret) {
-   clk_unregister(priv->clk);
dev_crit(
>dev,
"Failed to register external clock as clock provider\n"
@@ -2780,7 +2779,7 @@ static void ca8210_unregister_ext_clock(struct spi_device 
*spi)
 {
struct ca8210_priv *priv = spi_get_drvdata(spi);
 
-   if (!priv->clk)
+   if (IS_ERR_OR_NULL(priv->clk))
return
 
of_clk_del_provider(spi->dev.of_node);
-- 
2.17.1

Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events

2023-09-25 Thread Haitao Huang


Hi Jarkko

On Mon, 25 Sep 2023 12:09:21 -0500, Jarkko Sakkinen   
wrote:



On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:

From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed, or in event of user writing the max value to
the misc.max file to set the usage limit of a specific resource
[admin-guide/cgroup-v2.rst, 5-9. Misc].

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver:
- On css_alloc, allocate and initialize necessary structures for EPC
reclaiming, e.g., LRU list, work queue, etc.
- On css_free, cleanup and free those structures created in alloc.
- On max_write, trigger EPC reclaiming if the new limit is at or below
current usage.

Signed-off-by: Kristen Carlson Accardi 
Signed-off-by: Haitao Huang 
---
V5:
- Remove prefixes from the callback names (tj)
- Update commit message (Jarkko)

V4:
- Moved this to the front of the series.
- Applies on cgroup/for-6.6 with the overflow fix for misc.

V3:
- Removed the released() callback
---
 include/linux/misc_cgroup.h |  5 +
 kernel/cgroup/misc.c| 32 +---
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..96a88822815a 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -37,6 +37,11 @@ struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
+
+   /* per resource callback ops */
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+   void (*max_write)(struct misc_cg *cg);
 };

 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..62c9198dee21 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct  
kernfs_open_file *of, char *buf,


cg = css_misc(of_css(of));

-   if (READ_ONCE(misc_res_capacity[type]))
+   if (READ_ONCE(misc_res_capacity[type])) {
WRITE_ONCE(cg->res[type].max, max);
-   else
+   if (cg->res[type].max_write)
+   cg->res[type].max_write(cg);
+   } else {
ret = -EINVAL;
+   }

return ret ? ret : nbytes;
 }
@@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
+   struct misc_cg *parent_cg;
enum misc_res_type i;
struct misc_cg *cg;
+   int ret;

if (!parent_css) {
cg = _cg;
+   parent_cg = _cg;
} else {
cg = kzalloc(sizeof(*cg), GFP_KERNEL);
if (!cg)
return ERR_PTR(-ENOMEM);
+   parent_cg = css_misc(parent_css);
}

for (i = 0; i < MISC_CG_RES_TYPES; i++) {
WRITE_ONCE(cg->res[i].max, MAX_NUM);
atomic64_set(>res[i].usage, 0);
+   if (parent_cg->res[i].alloc) {
+   ret = parent_cg->res[i].alloc(cg);
+   if (ret)
+   goto alloc_err;
+   }
}

return >css;
+
+alloc_err:
+   for (i = 0; i < MISC_CG_RES_TYPES; i++)
+   if (parent_cg->res[i].free)
+   cg->res[i].free(cg);
+   kfree(cg);
+   return ERR_PTR(ret);
 }

 /**
@@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state  
*parent_css)

  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-   kfree(css_misc(css));
+   struct misc_cg *cg = css_misc(css);
+   enum misc_res_type i;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++)
+   if (cg->res[i].free)
+   cg->res[i].free(cg);
+
+   kfree(cg);
 }

 /* Cgroup controller callbacks */
--
2.25.1


Since the only existing client feature requires all callbacks, should
this not have that as an invariant?

I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
catch issues in the kernel code).



These callbacks are chained from cgroup_subsys, and they are defined  
separately so it'd be better follow the same pattern.  Or change together  
with cgroup_subsys if we want to do that. Reasonable?


Thanks
Haitao

RE: [PATCH v3] vmscan: add trace events for lru_gen

2023-09-25 Thread 김재원

>Hello,
>
>On Sun, 24 Sep 2023 23:23:43 +0900 Jaewon Kim  wrote:
>
>> As the legacy lru provides, the lru_gen needs some trace events for
>> debugging.
>> 
>> This commit introduces 2 trace events.
>>   trace_mm_vmscan_lru_gen_scan
>>   trace_mm_vmscan_lru_gen_evict
>> 
>> Each event is similar to the following legacy events.
>>   trace_mm_vmscan_lru_isolate,
>>   trace_mm_vmscan_lru_shrink_[in]active
>> 
>> Here's an example
>>   mm_vmscan_lru_gen_scan: isolate_mode=0 classzone=1 order=9 
>> nr_requested=4096 nr_scanned=431 nr_skipped=0 nr_taken=55 lru=anon
>>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=42 nr_dirty=0 nr_writeback=0 
>> nr_congested=0 nr_immediate=0 nr_activate_anon=13 nr_activate_file=0 
>> nr_ref_keep=0 nr_unmap_fail=0 priority=2 
>> flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
>>   mm_vmscan_lru_gen_scan: isolate_mode=0 classzone=1 order=9 
>> nr_requested=4096 nr_scanned=66 nr_skipped=0 nr_taken=64 lru=file
>>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=62 nr_dirty=0 nr_writeback=0 
>> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=2 
>> nr_ref_keep=0 nr_unmap_fail=0 priority=2 
>> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
>> 
>> Signed-off-by: Jaewon Kim 
>> Reviewed-by: Steven Rostedt (Google) 
>> Reviewed-by: T.J. Mercier 
>> ---
>> v3: change printk format
>> v2: use condition and make it aligned
>> v1: introduce trace events
>> ---
>>  include/trace/events/mmflags.h |  5 ++
>>  include/trace/events/vmscan.h  | 98 ++
>>  mm/vmscan.c| 17 --
>>  3 files changed, 115 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
>> index 1478b9dd05fa..44e9b38f83e7 100644
>> --- a/include/trace/events/mmflags.h
>> +++ b/include/trace/events/mmflags.h
>> @@ -274,6 +274,10 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,  "softdirty" 
>> )   \
>>  EM (LRU_ACTIVE_FILE, "active_file") \
>>  EMe(LRU_UNEVICTABLE, "unevictable")
>>  
>> +#define LRU_GEN_NAMES   \
>> +EM (LRU_GEN_ANON, "anon") \
>> +EMe(LRU_GEN_FILE, "file")
>> +
>
>I found this patchset makes build fails when !CONFIG_LRU_GEN, like below:
>
>In file included from /linux/include/trace/trace_events.h:27,
> from /linux/include/trace/define_trace.h:102,
> from /linux/include/trace/events/oom.h:195,
> from /linux/mm/oom_kill.c:53:
>/linux/include/trace/events/mmflags.h:278:7: error: ‘LRU_GEN_ANON’ 
> undeclared here (not in a function); did you mean ‘LRU_GEN_PGOFF’?
>  278 |   EM (LRU_GEN_ANON, "anon") \
>  |   ^~~~
>
>Maybe some config checks are needed?

Sorry and thank you for your comment.
I think I need to wrap with CONFIG_LRU_GEN
Adding #ifdef CONFIG_LRU_GEN even to vmscan.c seems not nice though.

Additionally I had to remove isolate_mode to be compatible with
mm, vmscan: remove ISOLATE_UNMAPPED
https://lore.kernel.org/linux-mm/20230914131637.12204-4-vba...@suse.cz/

Here's what I changed on top of this v3 patch.
I'm trying to find a way not to use ifdef CONFIG_LRU_GEN in vmscan.c



--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -274,9 +274,11 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" )   
\
EM (LRU_ACTIVE_FILE, "active_file") \
EMe(LRU_UNEVICTABLE, "unevictable")
 
+#ifdef CONFIG_LRU_GEN
 #define LRU_GEN_NAMES  \
EM (LRU_GEN_ANON, "anon") \
EMe(LRU_GEN_FILE, "file")
+#endif
 
 /*
  * First define the enums in the above macros to be exported to userspace
@@ -292,7 +294,9 @@ COMPACTION_PRIORITY
 /* COMPACTION_FEEDBACK are defines not enums. Not needed here. */
 ZONE_TYPE
 LRU_NAMES
+#ifdef CONFIG_LRU_GEN
 LRU_GEN_NAMES
+#endif
 
 /*
  * Now redefine the EM() and EMe() macros to map the enums to the strings
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index e7230fa8bda1..ba99182d6558 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -323,6 +323,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__print_symbolic(__entry->lru, LRU_NAMES))
 );
 
+#ifdef CONFIG_LRU_GEN
 TRACE_EVENT_CONDITION(mm_vmscan_lru_gen_scan,
TP_PROTO(int highest_zoneidx,
int order,
@@ -330,10 +331,9 @@ TRACE_EVENT_CONDITION(mm_vmscan_lru_gen_scan,
unsigned long nr_scanned,
unsigned long nr_skipped,
unsigned long nr_taken,
-   isolate_mode_t isolate_mode,
int lru),
 
-   TP_ARGS(highest_zoneidx, order, nr_requested, nr_scanned, nr_skipped, 
nr_taken, isolate_mode, lru),
+   TP_ARGS(highest_zoneidx, order, nr_requested, nr_scanned, nr_skipped, 
nr_taken, lru),
 
TP_CONDITION(nr_scanned),
 
@@ -344,7 +344,6 @@ TRACE_EVENT_CONDITION(mm_vmscan_lru_gen_scan,

Re: [PATCH v5 04/12] fprobe: Use ftrace_regs in fprobe entry handler

2023-09-25 Thread Google

On Tue, 26 Sep 2023 00:14:33 +0200
Jiri Olsa  wrote:

> On Mon, Sep 25, 2023 at 09:15:15PM +0900, Masami Hiramatsu wrote:
> > Hi Jiri,
> > 
> > On Mon, 25 Sep 2023 12:41:59 +0200
> > Jiri Olsa  wrote:
> > 
> > > On Sun, Sep 24, 2023 at 10:36:36PM +0900, Masami Hiramatsu (Google) wrote:
> > > > From: Masami Hiramatsu (Google) 
> > > > 
> > > > This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
> > > > instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe
> > > > on arm64.
> > > > 
> > > > Signed-off-by: Masami Hiramatsu (Google) 
> > > > Acked-by: Florent Revest 
> > > 
> > > I was getting bpf selftests failures with this patchset and when
> > > bisecting I'm getting crash when running on top of this change
> > 
> > Thanks for bisecting!
> > 
> > > 
> > > looks like it's missing some of the regs NULL checks added later?
> > 
> > yeah, if the RIP (arch_rethook_prepare+0x0/0x30) is correct, 
> > 
> > void arch_rethook_prepare(struct rethook_node *rh, struct ftrace_regs 
> > *fregs, bool mcount)
> > 
> > RSI (the 2nd argument) is NULL. This means fregs == NULL and caused the 
> > crash.
> > I think ftrace_get_regs(fregs) for the entry handler may return NULL.
> > 
> > Ah, 
> > 
> > @@ -182,7 +182,7 @@ static void fprobe_init(struct fprobe *fp)
> > fp->ops.func = fprobe_kprobe_handler;
> > else
> > fp->ops.func = fprobe_handler;
> > -   fp->ops.flags |= FTRACE_OPS_FL_SAVE_REGS;
> > +   fp->ops.flags |= FTRACE_OPS_FL_SAVE_ARGS;
> >  }
> >  
> >  static int fprobe_init_rethook(struct fprobe *fp, int num)
> > 
> > This may cause the issue, it should keep REGS at this point (this must be 
> > done in
> > [9/12]). But after applying [9/12], it shouldn't be a problem... 
> > 
> > Let me check it again.
> 
> that helped with the crash, I'll continue bisecting to find out
> where it breaks the tests

Can you share the configuration and the test?
I would like to reproduce it because I couldn't make it reproduced.

Thank you,

-- 
Masami Hiramatsu (Google)

Re: [PATCH v5 04/12] fprobe: Use ftrace_regs in fprobe entry handler

2023-09-25 Thread Jiri Olsa

On Mon, Sep 25, 2023 at 09:15:15PM +0900, Masami Hiramatsu wrote:
> Hi Jiri,
> 
> On Mon, 25 Sep 2023 12:41:59 +0200
> Jiri Olsa  wrote:
> 
> > On Sun, Sep 24, 2023 at 10:36:36PM +0900, Masami Hiramatsu (Google) wrote:
> > > From: Masami Hiramatsu (Google) 
> > > 
> > > This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
> > > instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe
> > > on arm64.
> > > 
> > > Signed-off-by: Masami Hiramatsu (Google) 
> > > Acked-by: Florent Revest 
> > 
> > I was getting bpf selftests failures with this patchset and when
> > bisecting I'm getting crash when running on top of this change
> 
> Thanks for bisecting!
> 
> > 
> > looks like it's missing some of the regs NULL checks added later?
> 
> yeah, if the RIP (arch_rethook_prepare+0x0/0x30) is correct, 
> 
> void arch_rethook_prepare(struct rethook_node *rh, struct ftrace_regs *fregs, 
> bool mcount)
> 
> RSI (the 2nd argument) is NULL. This means fregs == NULL and caused the crash.
> I think ftrace_get_regs(fregs) for the entry handler may return NULL.
> 
> Ah, 
> 
> @@ -182,7 +182,7 @@ static void fprobe_init(struct fprobe *fp)
>   fp->ops.func = fprobe_kprobe_handler;
>   else
>   fp->ops.func = fprobe_handler;
> - fp->ops.flags |= FTRACE_OPS_FL_SAVE_REGS;
> + fp->ops.flags |= FTRACE_OPS_FL_SAVE_ARGS;
>  }
>  
>  static int fprobe_init_rethook(struct fprobe *fp, int num)
> 
> This may cause the issue, it should keep REGS at this point (this must be 
> done in
> [9/12]). But after applying [9/12], it shouldn't be a problem... 
> 
> Let me check it again.

that helped with the crash, I'll continue bisecting to find out
where it breaks the tests

thanks,
jirka

Re: [PATCH v3] vmscan: add trace events for lru_gen

2023-09-25 Thread SeongJae Park

Hello,

On Sun, 24 Sep 2023 23:23:43 +0900 Jaewon Kim  wrote:

> As the legacy lru provides, the lru_gen needs some trace events for
> debugging.
> 
> This commit introduces 2 trace events.
>   trace_mm_vmscan_lru_gen_scan
>   trace_mm_vmscan_lru_gen_evict
> 
> Each event is similar to the following legacy events.
>   trace_mm_vmscan_lru_isolate,
>   trace_mm_vmscan_lru_shrink_[in]active
> 
> Here's an example
>   mm_vmscan_lru_gen_scan: isolate_mode=0 classzone=1 order=9 
> nr_requested=4096 nr_scanned=431 nr_skipped=0 nr_taken=55 lru=anon
>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=42 nr_dirty=0 nr_writeback=0 
> nr_congested=0 nr_immediate=0 nr_activate_anon=13 nr_activate_file=0 
> nr_ref_keep=0 nr_unmap_fail=0 priority=2 
> flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
>   mm_vmscan_lru_gen_scan: isolate_mode=0 classzone=1 order=9 
> nr_requested=4096 nr_scanned=66 nr_skipped=0 nr_taken=64 lru=file
>   mm_vmscan_lru_gen_evict: nid=0 nr_reclaimed=62 nr_dirty=0 nr_writeback=0 
> nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=2 
> nr_ref_keep=0 nr_unmap_fail=0 priority=2 
> flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
> 
> Signed-off-by: Jaewon Kim 
> Reviewed-by: Steven Rostedt (Google) 
> Reviewed-by: T.J. Mercier 
> ---
> v3: change printk format
> v2: use condition and make it aligned
> v1: introduce trace events
> ---
>  include/trace/events/mmflags.h |  5 ++
>  include/trace/events/vmscan.h  | 98 ++
>  mm/vmscan.c| 17 --
>  3 files changed, 115 insertions(+), 5 deletions(-)
> 
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 1478b9dd05fa..44e9b38f83e7 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -274,6 +274,10 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,   "softdirty" 
> )   \
>   EM (LRU_ACTIVE_FILE, "active_file") \
>   EMe(LRU_UNEVICTABLE, "unevictable")
>  
> +#define LRU_GEN_NAMES\
> + EM (LRU_GEN_ANON, "anon") \
> + EMe(LRU_GEN_FILE, "file")
> +

I found this patchset makes build fails when !CONFIG_LRU_GEN, like below:

In file included from /linux/include/trace/trace_events.h:27,
 from /linux/include/trace/define_trace.h:102,
 from /linux/include/trace/events/oom.h:195,
 from /linux/mm/oom_kill.c:53:
/linux/include/trace/events/mmflags.h:278:7: error: ‘LRU_GEN_ANON’ 
undeclared here (not in a function); did you mean ‘LRU_GEN_PGOFF’?
  278 |   EM (LRU_GEN_ANON, "anon") \
  |   ^~~~

Maybe some config checks are needed?


Thanks,
SJ

[...]

Re: [PATCH v5 02/18] cgroup/misc: Add SGX EPC resource type and export APIs for SGX driver

2023-09-25 Thread Tejun Heo

On Fri, Sep 22, 2023 at 08:06:41PM -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
> for the misc controller.
> 
> Add per resource type private data so that SGX can store additional per
> cgroup data in misc_cg->misc_cg_res[MISC_CG_RES_SGX_EPC].
> 
> Export misc_cg_root() so the SGX driver can initialize and add those
> additional structures to the root misc cgroup as part of initialization
> for EPC cgroup support. This bootstraps the same additional
> initialization for non-root cgroups in the 'alloc()' callback added in the
> previous patch.
> 
> The SGX driver, as the EPC memory provider, will have a background
> worker to reclaim EPC pages to make room for new allocations in the same
> cgroup when its usage counter reaches near the limit controlled by the
> cgroup and its ancestors. Therefore it needs to do a walk from the
> current cgroup up to the root. To enable this walk, move parent_misc()
> into misc_cgroup.h and make inline to make this function available to
> SGX, rename it to misc_cg_parent(), and update kernel/cgroup/misc.c to
> use the new name.
> 
> Signed-off-by: Kristen Carlson Accardi 
> Signed-off-by: Haitao Huang 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun

Re: [PATCH] tracing/user_events: align uaddr on unsigned long alignment

2023-09-25 Thread Clément Léger




On 25/09/2023 20:04, Clément Léger wrote:
> 
> 
> On 25/09/2023 18:04, Beau Belgrave wrote:
>> On Mon, Sep 25, 2023 at 09:53:16AM +0200, Clément Léger wrote:
>>>
>>>
>>> On 22/09/2023 21:22, Beau Belgrave wrote:
 On Tue, Sep 19, 2023 at 02:59:12PM +0200, Clément Léger wrote:
>
>
> On 14/09/2023 19:29, Steven Rostedt wrote:
>> On Thu, 14 Sep 2023 13:17:00 -0400
>> Steven Rostedt  wrote:
>>
>>> Now lets look at big endian layout:
>>>
>>>  uaddr = 0xbeef0004
>>>  enabler = 1;
>>>
>>>  memory at 0xbeef:  00 00 00 00 00 00 00 02
>>> ^
>>> addr: 0xbeef0004
>>>
>>> (enabler is set )
>>>
>>> bitoffset = uaddr & (sizeof(unsigned long) - 1); bitoffset = 4
>>> bit_offset *= 8; bitoffset = 32
>>> uaddr &= ~(sizeof(unsigned long) - 1);   uaddr = 
>>> 0xbeef
>>>
>>> ptr = kaddr + (uaddr & ~PAGE_MASK);
>>>
>>> clear_bit(1 + 32, ptr);
>>>
>>>  memory at 0xbeef:  00 00 00 00 00 00 00 02
>>>   ^
>>> bit 33 of 0xbeef
>>>
>>> I don't think that's what you expected!
>>
>> I believe the above can be fixed with:
>>
>>  bit_offset = uaddr & (sizeof(unsigned long) - 1);
>>  if (bit_offset) {
>> #ifdef CONFIG_CPU_BIG_ENDIAN
>>  bit_offest = 0;
>> #else
>>  bit_offset *= BITS_PER_BYTE;
>> #endif
>>  uaddr &= ~(sizeof(unsigned long) - 1);
>>  }
>>
>> -- Steve
>
>
> Actually, after looking more in depth at that, it seems like there are
> actually 2 problems that can happen.
>
> First one is atomic access misalignment due to enable_size == 4 and
> uaddr not being aligned on a (long) boundary on 64 bits architecture.
> This can generate misaligned exceptions on various architectures. This
> can be fixed in a more general way according to Masami snippet.
>
> Second one that I can see is on 64 bits, big endian architectures with
> enable_size == 4. In that case, the bit provided by the userspace won't
> be correctly set since this code kind of assume that the atomic are done
> on 32bits value. Since the kernel assume long sized atomic operation, on
> big endian 64 bits architecture, the updated bit will actually be in the
> next 32 bits word.
>
> Can someone confirm my understanding ?
>

 I have a ppc 64bit BE VM I've been validating this on. If we do the
 shifting within user_events (vs a generic set_bit_aligned approach)
 64bit BE does not need additional bit manipulation. However, if we were
 to blindly pass the address and bit as is to set_bit_aligned() it
 assumes the bit number is for a long, not a 32 bit word. So for that
 approach we would need to offset the bit in the unaligned case.

 Here's a patch I have that I've validated on ppc64 BE, aarch64 LE, and
 x86_64 LE. I personally feel more comfortable with this approach than
 the generic set_bit_aligned() one.

 Thanks,
 -Beau

 diff --git a/kernel/trace/trace_events_user.c 
 b/kernel/trace/trace_events_user.c
 index e3f2b8d72e01..ae854374d0b7 100644
 --- a/kernel/trace/trace_events_user.c
 +++ b/kernel/trace/trace_events_user.c
 @@ -162,6 +162,23 @@ struct user_event_validator {
int flags;
  };
  
 +static inline void align_addr_bit(unsigned long *addr, int *bit)
 +{
 +  if (IS_ALIGNED(*addr, sizeof(long)))
 +  return;
 +
 +  *addr = ALIGN_DOWN(*addr, sizeof(long));
 +
 +  /*
 +   * We only support 32 and 64 bit values. The only time we need
 +   * to align is a 32 bit value on a 64 bit kernel, which on LE
 +   * is always 32 bits, and on BE requires no change.
 +   */
 +#ifdef __LITTLE_ENDIAN
 +  *bit += 32;
 +#endif
>>>
>>> Hi Beau, except the specific alignment that is basically what I ended up
>>> with for the BE 64bit case (ie just bit += 32). Regarding the generic
>>> alignment, depends on what the maintainers wishes (generic of user_event
>>> specific). I also feel like this shoulmd be handle specifically for
>>> user_events which uses set_bit in some non standard way. Any suggestion ?
>>>
>>
>> Looking at this deeper, user_events needs to track some of this
>> alignment requirements within each enabler. This is needed because when
>> enablers are disabled/removed they do not have the actual size. The size
>> is needed to know if we need to update the bits, etc. (IE: If originally
>> a 32-bit value was used and it's aligned and it's on BE, it needs the
>> bits shifted.)
>>
>> My final version that I played around with looked like this for just

Re: [PATCH] tracing/user_events: align uaddr on unsigned long alignment

2023-09-25 Thread Clément Léger




On 25/09/2023 18:04, Beau Belgrave wrote:
> On Mon, Sep 25, 2023 at 09:53:16AM +0200, Clément Léger wrote:
>>
>>
>> On 22/09/2023 21:22, Beau Belgrave wrote:
>>> On Tue, Sep 19, 2023 at 02:59:12PM +0200, Clément Léger wrote:


 On 14/09/2023 19:29, Steven Rostedt wrote:
> On Thu, 14 Sep 2023 13:17:00 -0400
> Steven Rostedt  wrote:
>
>> Now lets look at big endian layout:
>>
>>  uaddr = 0xbeef0004
>>  enabler = 1;
>>
>>  memory at 0xbeef:  00 00 00 00 00 00 00 02
>> ^
>> addr: 0xbeef0004
>>
>>  (enabler is set )
>>
>>  bitoffset = uaddr & (sizeof(unsigned long) - 1); bitoffset = 4
>>  bit_offset *= 8; bitoffset = 32
>>  uaddr &= ~(sizeof(unsigned long) - 1);   uaddr = 0xbeef
>>
>>  ptr = kaddr + (uaddr & ~PAGE_MASK);
>>
>>  clear_bit(1 + 32, ptr);
>>
>>  memory at 0xbeef:  00 00 00 00 00 00 00 02
>>   ^
>>  bit 33 of 0xbeef
>>
>> I don't think that's what you expected!
>
> I believe the above can be fixed with:
>
>   bit_offset = uaddr & (sizeof(unsigned long) - 1);
>   if (bit_offset) {
> #ifdef CONFIG_CPU_BIG_ENDIAN
>   bit_offest = 0;
> #else
>   bit_offset *= BITS_PER_BYTE;
> #endif
>   uaddr &= ~(sizeof(unsigned long) - 1);
>   }
>
> -- Steve


 Actually, after looking more in depth at that, it seems like there are
 actually 2 problems that can happen.

 First one is atomic access misalignment due to enable_size == 4 and
 uaddr not being aligned on a (long) boundary on 64 bits architecture.
 This can generate misaligned exceptions on various architectures. This
 can be fixed in a more general way according to Masami snippet.

 Second one that I can see is on 64 bits, big endian architectures with
 enable_size == 4. In that case, the bit provided by the userspace won't
 be correctly set since this code kind of assume that the atomic are done
 on 32bits value. Since the kernel assume long sized atomic operation, on
 big endian 64 bits architecture, the updated bit will actually be in the
 next 32 bits word.

 Can someone confirm my understanding ?

>>>
>>> I have a ppc 64bit BE VM I've been validating this on. If we do the
>>> shifting within user_events (vs a generic set_bit_aligned approach)
>>> 64bit BE does not need additional bit manipulation. However, if we were
>>> to blindly pass the address and bit as is to set_bit_aligned() it
>>> assumes the bit number is for a long, not a 32 bit word. So for that
>>> approach we would need to offset the bit in the unaligned case.
>>>
>>> Here's a patch I have that I've validated on ppc64 BE, aarch64 LE, and
>>> x86_64 LE. I personally feel more comfortable with this approach than
>>> the generic set_bit_aligned() one.
>>>
>>> Thanks,
>>> -Beau
>>>
>>> diff --git a/kernel/trace/trace_events_user.c 
>>> b/kernel/trace/trace_events_user.c
>>> index e3f2b8d72e01..ae854374d0b7 100644
>>> --- a/kernel/trace/trace_events_user.c
>>> +++ b/kernel/trace/trace_events_user.c
>>> @@ -162,6 +162,23 @@ struct user_event_validator {
>>> int flags;
>>>  };
>>>  
>>> +static inline void align_addr_bit(unsigned long *addr, int *bit)
>>> +{
>>> +   if (IS_ALIGNED(*addr, sizeof(long)))
>>> +   return;
>>> +
>>> +   *addr = ALIGN_DOWN(*addr, sizeof(long));
>>> +
>>> +   /*
>>> +* We only support 32 and 64 bit values. The only time we need
>>> +* to align is a 32 bit value on a 64 bit kernel, which on LE
>>> +* is always 32 bits, and on BE requires no change.
>>> +*/
>>> +#ifdef __LITTLE_ENDIAN
>>> +   *bit += 32;
>>> +#endif
>>
>> Hi Beau, except the specific alignment that is basically what I ended up
>> with for the BE 64bit case (ie just bit += 32). Regarding the generic
>> alignment, depends on what the maintainers wishes (generic of user_event
>> specific). I also feel like this shoulmd be handle specifically for
>> user_events which uses set_bit in some non standard way. Any suggestion ?
>>
> 
> Looking at this deeper, user_events needs to track some of this
> alignment requirements within each enabler. This is needed because when
> enablers are disabled/removed they do not have the actual size. The size
> is needed to know if we need to update the bits, etc. (IE: If originally
> a 32-bit value was used and it's aligned and it's on BE, it needs the
> bits shifted.)
> 
> My final version that I played around with looked like this for just the
> alignment requirements:
> +static inline void align_addr_bit(unsigned long *addr, int *bit,
> + unsigned long *flags)
> +{
> +   if (IS_ALIGNED(*addr, sizeof(long))) {
>

[PATCH] remoteproc: zynqmp: change tcm address translation method

2023-09-25 Thread Tanmay Shah

Introduce device address in hardcode TCM table.
Device address is used for address translation.
Also, previous method(hack) to mask few bits from address
to achieve address translation is removed

Signed-off-by: Tanmay Shah 
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 58 +
 1 file changed, 20 insertions(+), 38 deletions(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
b/drivers/remoteproc/xlnx_r5_remoteproc.c
index efd758c2f4ed..4395edea9a64 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -39,12 +39,14 @@ enum zynqmp_r5_cluster_mode {
  * struct mem_bank_data - Memory Bank description
  *
  * @addr: Start address of memory bank
+ * @da: device address
  * @size: Size of Memory bank
  * @pm_domain_id: Power-domains id of memory bank for firmware to turn on/off
  * @bank_name: name of the bank for remoteproc framework
  */
 struct mem_bank_data {
phys_addr_t addr;
+   u32 da;
size_t size;
u32 pm_domain_id;
char *bank_name;
@@ -76,18 +78,18 @@ struct mbox_info {
  * accepted for system-dt specifications and upstreamed in linux kernel
  */
 static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
-   {0xffe0UL, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each */
-   {0xffe2UL, 0x1UL, PD_R5_0_BTCM, "btcm0"},
-   {0xffe9UL, 0x1UL, PD_R5_1_ATCM, "atcm1"},
-   {0xffebUL, 0x1UL, PD_R5_1_BTCM, "btcm1"},
+   {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
*/
+   {0xffe2UL, 0x2, 0x1UL, PD_R5_0_BTCM, "btcm0"},
+   {0xffe9UL, 0x0, 0x1UL, PD_R5_1_ATCM, "atcm1"},
+   {0xffebUL, 0x2, 0x1UL, PD_R5_1_BTCM, "btcm1"},
 };
 
 /* In lockstep mode cluster combines each 64KB TCM and makes 128KB TCM */
 static const struct mem_bank_data zynqmp_tcm_banks_lockstep[] = {
-   {0xffe0UL, 0x2UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 128KB each */
-   {0xffe2UL, 0x2UL, PD_R5_0_BTCM, "btcm0"},
-   {0, 0, PD_R5_1_ATCM, ""},
-   {0, 0, PD_R5_1_BTCM, ""},
+   {0xffe0UL, 0x0, 0x2UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 128KB 
each */
+   {0xffe2UL, 0x2, 0x2UL, PD_R5_0_BTCM, "btcm0"},
+   {0, 0, 0, PD_R5_1_ATCM, ""},
+   {0, 0, 0, PD_R5_1_BTCM, ""},
 };
 
 /**
@@ -534,30 +536,6 @@ static int tcm_mem_map(struct rproc *rproc,
/* clear TCMs */
memset_io(va, 0, mem->len);
 
-   /*
-* The R5s expect their TCM banks to be at address 0x0 and 0x2000,
-* while on the Linux side they are at 0xffex.
-*
-* Zero out the high 12 bits of the address. This will give
-* expected values for TCM Banks 0A and 0B (0x0 and 0x2).
-*/
-   mem->da &= 0x000f;
-
-   /*
-* TCM Banks 1A and 1B still have to be translated.
-*
-* Below handle these two banks' absolute addresses (0xffe9 and
-* 0xffeb) and convert to the expected relative addresses
-* (0x0 and 0x2).
-*/
-   if (mem->da == 0x9 || mem->da == 0xB)
-   mem->da -= 0x9;
-
-   /* if translated TCM bank address is not valid report error */
-   if (mem->da != 0x0 && mem->da != 0x2) {
-   dev_err(>dev, "invalid TCM address: %x\n", mem->da);
-   return -EINVAL;
-   }
return 0;
 }
 
@@ -579,6 +557,7 @@ static int add_tcm_carveout_split_mode(struct rproc *rproc)
u32 pm_domain_id;
size_t bank_size;
char *bank_name;
+   u32 da;
 
r5_core = rproc->priv;
dev = r5_core->dev;
@@ -591,6 +570,7 @@ static int add_tcm_carveout_split_mode(struct rproc *rproc)
 */
for (i = 0; i < num_banks; i++) {
bank_addr = r5_core->tcm_banks[i]->addr;
+   da = r5_core->tcm_banks[i]->da;
bank_name = r5_core->tcm_banks[i]->bank_name;
bank_size = r5_core->tcm_banks[i]->size;
pm_domain_id = r5_core->tcm_banks[i]->pm_domain_id;
@@ -603,11 +583,11 @@ static int add_tcm_carveout_split_mode(struct rproc 
*rproc)
goto release_tcm_split;
}
 
-   dev_dbg(dev, "TCM carveout split mode %s addr=%llx, size=0x%lx",
-   bank_name, bank_addr, bank_size);
+   dev_dbg(dev, "TCM carveout split mode %s addr=%llx, da=0x%x, 
size=0x%lx",
+   bank_name, bank_addr, da, bank_size);
 
rproc_mem = rproc_mem_entry_init(dev, NULL, bank_addr,
-bank_size, bank_addr,
+bank_size, da,
 tcm_mem_map, tcm_mem_unmap,
 bank_name);
if (!rproc_mem) {
@@ -648,6 +628,7 @@ static int

Re: [PATCH v5 16/18] x86/sgx: Limit process EPC usage with misc cgroup controller

2023-09-25 Thread Jarkko Sakkinen

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
>
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Tested-by: Mikko Ylinen 
>
> Cc: Sean Christopherson 
> ---
> V5:
> - kernel-doc fixes (Jarkko)
>
> V4:
> - Fix a white space issue in Kconfig (Randy).
> - Update comments for LRU list as it can be owned by a cgroup.
> - Fix comments for sgx_reclaim_epc_pages() and use IS_ENABLED consistently 
> (Mikko)
>
> V3:
>
> 1) Use the same maximum number of reclaiming candidate pages to be
> processed, SGX_NR_TO_SCAN_MAX, for each reclaiming iteration in both
> cgroup worker function and ksgxd. This fixes an overflow in the
> backing store buffer with the same fixed size allocated on stack in
> sgx_reclaim_epc_pages().
>
> 2) Initialize max for root EPC cgroup. Otherwise, all
> misc_cg_try_charge() calls would fail as it checks for all limits of
> ancestors all the way to the root node.
>
> 3) Start reclaiming whenever misc_cg_try_charge fails. Removed all
> re-checks for limits and current usage. For all purposes and intent,
> when misc_try_charge() fails, reclaiming is needed. This also corrects
> an error of not reclaiming when the child limit is larger than one of
> its ancestors.
>
> 4) Handle failure on charging to the root EPC cgroup. Failure on charging
> to root means we are at or above capacity, so start reclaiming or return
> OOM error.
>
> 5) Removed the custom cgroup tree walking iterator with epoch tracking
> logic. Replaced it with just the plain css_for_each_descendant_pre
> iterator. The custom iterator implemented a rather complex epoch scheme
> I believe was intended to prevent extra reclaiming from multiple worker
> threads doing the same walk but it turned out not matter much as each
> thread would only reclaim when usage is above limit. Using the plain
> css_for_each_descendant_pre iterator simplified code a bit.
>
> 6) Do not reclaim synchronously in misc_max_write callback which would
> block the user. Instead queue an async work item to run the reclaiming
> loop.
>
> 7) Other minor refactoring:
> - Remove unused params in epc_cgroup APIs
> - centralize uncharge into sgx_free_epc_page()
> ---
>  arch/x86/Kconfig |  13 +
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 415 +++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  59 
>  arch/x86/kernel/cpu/sgx/main.c   |  68 -
>  arch/x86/kernel/cpu/sgx/sgx.h|  17 +-
>  6 files changed, 556 insertions(+), 17 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
> If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
> Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> +   Provides control over the EPC footprint of tasks in a cgroup via
> +   the Miscellaneous cgroup controller.
> +
> +   EPC is a subset of regular memory that is usable only by SGX
> +   enclaves and is very limited in quantity, e.g. less than 1%
> +   of total DRAM.
> +
> +   Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>   bool "X86 userspace shadow stack"
>   depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile 
> b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4

Re: [PATCH v5 07/18] x86/sgx: Introduce RECLAIM_IN_PROGRESS state

2023-09-25 Thread Jarkko Sakkinen

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Sean Christopherson 
>
> Add RECLAIM_IN_PROGRESS state to not rely on list_empty(_page->list)
> to determine if an EPC page is selected as a reclaiming candidate.
>
> When a page is being reclaimed from the page pool (sgx_global_lru),
> there is an intermediate stage where a page may have been identified as
> a candidate for reclaiming, but has not yet been reclaimed.  Currently
> such pages are list_del_init()'d from the global LRU list, and stored in
> a an array on stack. To prevent another thread from dropping the same
> page in the middle of reclaiming, sgx_drop_epc_page() checks for
> list_empty(_page->list).
>
> A later patch will replace the array on stack with a temporary list to
> store the candidate pages, so list_empty() should no longer be used for
> this purpose.
>
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Cc: Sean Christopherson 
> ---
> V4:
> - Fixed some typos.
> - Revised commit message.
>
> V3:
> - Extend the sgx_epc_page_state enum introduced earlier to replace the
> flag based approach.
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 21 ++---
>  arch/x86/kernel/cpu/sgx/sgx.h  | 16 
>  2 files changed, 26 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index b26860399402..c1ae19a154d0 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -312,13 +312,15 @@ static void sgx_reclaim_pages(void)
>   list_del_init(_page->list);
>   encl_page = epc_page->owner;
>  
> - if (kref_get_unless_zero(_page->encl->refcount) != 0)
> + if (kref_get_unless_zero(_page->encl->refcount) != 0) {
> + sgx_epc_page_set_state(epc_page, 
> SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>   chunk[cnt++] = epc_page;
> - else
> + } else {
>   /* The owner is freeing the page. No need to add the
>* page back to the list of reclaimable pages.
>*/
>   sgx_epc_page_reset_state(epc_page);
> + }
>   }
>   spin_unlock(_global_lru.lock);
>  
> @@ -528,16 +530,13 @@ void sgx_record_epc_page(struct sgx_epc_page *page, 
> unsigned long flags)
>  int sgx_drop_epc_page(struct sgx_epc_page *page)
>  {
>   spin_lock(_global_lru.lock);
> - if (sgx_epc_page_reclaimable(page->flags)) {
> - /* The page is being reclaimed. */
> - if (list_empty(>list)) {
> - spin_unlock(_global_lru.lock);
> - return -EBUSY;
> - }
> -
> - list_del(>list);
> - sgx_epc_page_reset_state(page);
> + if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> + spin_unlock(_global_lru.lock);
> + return -EBUSY;
>   }
> +
> + list_del(>list);
> + sgx_epc_page_reset_state(page);
>   spin_unlock(_global_lru.lock);
>  
>   return 0;
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 2faeb40b345f..764cec23f4e5 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -40,6 +40,8 @@ enum sgx_epc_page_state {
>  
>   /* Page is in use and tracked in a reclaimable LRU list
>* Becomes NOT_TRACKED after sgx_drop_epc()
> +  * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
> +  * for reclaiming
>*/
>   SGX_EPC_PAGE_RECLAIMABLE = 2,
>  
> @@ -50,6 +52,14 @@ enum sgx_epc_page_state {
>*/
>   SGX_EPC_PAGE_UNRECLAIMABLE = 3,
>  
> + /* Page is being prepared for reclamation, tracked in a temporary
> +  * isolated list by the reclaimer.
> +  * Changes in sgx_reclaim_pages() back to RECLAIMABLE if preparation
> +  * fails for any reason.
> +  * Becomes NOT_TRACKED if reclaimed successfully in sgx_reclaim_pages()
> +  * and immediately sgx_free_epc() is called to make it FREE.
> +  */
> + SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 4,
>  };
>  
>  #define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)
> @@ -73,6 +83,12 @@ static inline void sgx_epc_page_set_state(struct 
> sgx_epc_page *page, unsigned lo
>   page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
>  }
>  
> +static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
> +{
> + return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags &
> + SGX_EPC_PAGE_STATE_MASK);
> +}

return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags & 
SGX_EPC_PAGE_STATE_MASK);

> +
>  static inline bool sgx_epc_page_reclaimable(unsigned long flags)
>  {
>   return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
> -- 
> 2.25.1


BR,

Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states

2023-09-25 Thread Jarkko Sakkinen

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> Use the lower 3 bits in the flags field of sgx_epc_page struct to
> track EPC states in its life cycle and define an enum for possible
> states. More state(s) will be added later.
>
> Signed-off-by: Haitao Huang 
> ---
> V4:
> - No changes other than required for patch reordering.
>
> V3:
> - This is new in V3 to replace the bit mask based approach (requested by 
> Jarkko)
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 14 +++---
>  arch/x86/kernel/cpu/sgx/ioctl.c |  7 +++--
>  arch/x86/kernel/cpu/sgx/main.c  | 19 +++--
>  arch/x86/kernel/cpu/sgx/sgx.h   | 49 ++---
>  4 files changed, 71 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 97a53e34a8b4..f5afc8d65e22 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -244,8 +244,12 @@ static struct sgx_epc_page *sgx_encl_load_secs(struct 
> sgx_encl *encl)
>  {
>   struct sgx_epc_page *epc_page = encl->secs.epc_page;
>  
> - if (!epc_page)
> + if (!epc_page) {
>   epc_page = sgx_encl_eldu(>secs, NULL);
> + if (!IS_ERR(epc_page))
> + sgx_record_epc_page(epc_page,
> + SGX_EPC_PAGE_UNRECLAIMABLE);

Can be a single line probably (less than 100 characters).

> + }
>  
>   return epc_page;
>  }
> @@ -272,7 +276,7 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
> sgx_encl *encl,
>   return ERR_CAST(epc_page);
>  
>   encl->secs_child_cnt++;
> - sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  
>   return entry;
>  }
> @@ -398,7 +402,7 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> vm_area_struct *vma,
>   encl_page->type = SGX_PAGE_TYPE_REG;
>   encl->secs_child_cnt++;
>  
> - sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>  
>   phys_addr = sgx_get_epc_phys_addr(epc_page);
>   /*
> @@ -1256,6 +1260,8 @@ struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
>   sgx_encl_free_epc_page(epc_page);
>   return ERR_PTR(-EFAULT);
>   }
> + sgx_record_epc_page(epc_page,
> + SGX_EPC_PAGE_UNRECLAIMABLE);

There is bunch of these apparently.

>  
>   return epc_page;
>  }
> @@ -1315,7 +1321,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
>  {
>   int ret;
>  
> - WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
>  
>   ret = __eremove(sgx_get_epc_virt_addr(page));
>   if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index a75eb44022a3..9a32bf5a1070 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
> sgx_secs *secs)
>   encl->attributes = secs->attributes;
>   encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
>  
> + sgx_record_epc_page(encl->secs.epc_page,
> + SGX_EPC_PAGE_UNRECLAIMABLE);
> +
>   /* Set only after completion, as encl->lock has not been taken. */
>   set_bit(SGX_ENCL_CREATED, >flags);
>  
> @@ -322,7 +325,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, 
> unsigned long src,
>   goto err_out;
>   }
>  
> - sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + sgx_record_epc_page(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>   mutex_unlock(>lock);
>   mmap_read_unlock(current->mm);
>   return ret;
> @@ -976,7 +979,7 @@ static long sgx_enclave_modify_types(struct sgx_encl 
> *encl,
>  
>   mutex_lock(>lock);
>  
> - sgx_record_epc_page(entry->epc_page, 
> SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + sgx_record_epc_page(entry->epc_page, 
> SGX_EPC_PAGE_RECLAIMABLE);
>   }
>  
>   /* Change EPC type */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index dec1d57cbff6..b26860399402 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -318,7 +318,7 @@ static void sgx_reclaim_pages(void)
>   /* The owner is freeing the page. No need to add the
>* page back to the list of reclaimable pages.
>*/
> - epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + sgx_epc_page_reset_state(epc_page);
>   }
>   spin_unlock(_global_lru.lock);
>  
> @@ -344,6 +344,7 @@ static void sgx_reclaim_pages(void)
>  
>  skip:
>

Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events

2023-09-25 Thread Jarkko Sakkinen

On Sat Sep 23, 2023 at 6:06 AM EEST, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed, or in event of user writing the max value to
> the misc.max file to set the usage limit of a specific resource
> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver:
> - On css_alloc, allocate and initialize necessary structures for EPC
> reclaiming, e.g., LRU list, work queue, etc.
> - On css_free, cleanup and free those structures created in alloc.
> - On max_write, trigger EPC reclaiming if the new limit is at or below
> current usage.
>
> Signed-off-by: Kristen Carlson Accardi 
> Signed-off-by: Haitao Huang 
> ---
> V5:
> - Remove prefixes from the callback names (tj)
> - Update commit message (Jarkko)
>
> V4:
> - Moved this to the front of the series.
> - Applies on cgroup/for-6.6 with the overflow fix for misc.
>
> V3:
> - Removed the released() callback
> ---
>  include/linux/misc_cgroup.h |  5 +
>  kernel/cgroup/misc.c| 32 +---
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index e799b1f8d05b..96a88822815a 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -37,6 +37,11 @@ struct misc_res {
>   u64 max;
>   atomic64_t usage;
>   atomic64_t events;
> +
> + /* per resource callback ops */
> + int (*alloc)(struct misc_cg *cg);
> + void (*free)(struct misc_cg *cg);
> + void (*max_write)(struct misc_cg *cg);
>  };
>  
>  /**
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index 79a3717a5803..62c9198dee21 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -276,10 +276,13 @@ static ssize_t misc_cg_max_write(struct 
> kernfs_open_file *of, char *buf,
>  
>   cg = css_misc(of_css(of));
>  
> - if (READ_ONCE(misc_res_capacity[type]))
> + if (READ_ONCE(misc_res_capacity[type])) {
>   WRITE_ONCE(cg->res[type].max, max);
> - else
> + if (cg->res[type].max_write)
> + cg->res[type].max_write(cg);
> + } else {
>   ret = -EINVAL;
> + }
>  
>   return ret ? ret : nbytes;
>  }
> @@ -383,23 +386,39 @@ static struct cftype misc_cg_files[] = {
>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> + struct misc_cg *parent_cg;
>   enum misc_res_type i;
>   struct misc_cg *cg;
> + int ret;
>  
>   if (!parent_css) {
>   cg = _cg;
> + parent_cg = _cg;
>   } else {
>   cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>   if (!cg)
>   return ERR_PTR(-ENOMEM);
> + parent_cg = css_misc(parent_css);
>   }
>  
>   for (i = 0; i < MISC_CG_RES_TYPES; i++) {
>   WRITE_ONCE(cg->res[i].max, MAX_NUM);
>   atomic64_set(>res[i].usage, 0);
> + if (parent_cg->res[i].alloc) {
> + ret = parent_cg->res[i].alloc(cg);
> + if (ret)
> + goto alloc_err;
> + }
>   }
>  
>   return >css;
> +
> +alloc_err:
> + for (i = 0; i < MISC_CG_RES_TYPES; i++)
> + if (parent_cg->res[i].free)
> + cg->res[i].free(cg);
> + kfree(cg);
> + return ERR_PTR(ret);
>  }
>  
>  /**
> @@ -410,7 +429,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>   */
>  static void misc_cg_free(struct cgroup_subsys_state *css)
>  {
> - kfree(css_misc(css));
> + struct misc_cg *cg = css_misc(css);
> + enum misc_res_type i;
> +
> + for (i = 0; i < MISC_CG_RES_TYPES; i++)
> + if (cg->res[i].free)
> + cg->res[i].free(cg);
> +
> + kfree(cg);
>  }
>  
>  /* Cgroup controller callbacks */
> -- 
> 2.25.1

Since the only existing client feature requires all callbacks, should
this not have that as an invariant?

I.e. it might be better to fail unless *all* ops are non-nil (e.g. to
catch issues in the kernel code).

BR, Jarkko

Re: [PATCH v4 01/18] cgroup/misc: Add per resource callbacks for CSS events

2023-09-25 Thread Jarkko Sakkinen

On Sat Sep 16, 2023 at 7:11 AM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Wed, 13 Sep 2023 04:39:06 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Wed Sep 13, 2023 at 7:06 AM EEST, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi 
> >>
> >> Consumers of the misc cgroup controller might need to perform separate
> >> actions for Cgroups Subsystem State(CSS) events: cgroup alloc and free.
> >
> > nit: s/State(CSS)/State (CSS)/
> >
> > "cgroup alloc" and "cgroup free" mean absolutely nothing.
> >
> >
> >> In addition, writes to the max value may also need separate action. Add
> >
> > What "the max value"?
> >
> >> the ability to allow downstream users to setup callbacks for these
> >> operations, and call the corresponding per-resource-type callback when
> >> appropriate.
> >
> > Who are "the downstream users" and what sort of callbacks they setup?
>
> How about this?
>
> The misc cgroup controller (subsystem) currently does not perform resource  
> type specific action for Cgroups Subsystem State (CSS) events: the  
> 'css_alloc' event when a cgroup is created and the 'css_free' event when a  
> cgroup is destroyed, or in event of user writing the max value to the  
> misc.max file to set the consumption limit of a specific resource  
> [admin-guide/cgroup-v2.rst, 5-9. Misc].
>
> Define callbacks for those events and allow resource providers to register  
> the callbacks per resource type as needed. This will be utilized later by  
> the EPC misc cgroup support implemented in the SGX driver:
> - On cgroup alloc, allocate and initialize necessary structures for EPC  
> reclaiming, e.g., LRU list, work queue, etc.
> - On cgroup free, cleanup and free those structures created in alloc.
> - On max write, trigger EPC reclaiming if the new limit is at or below  
> current consumption.

Yeah, this is much better (I was on holiday, thus the delay on
response).

> Thanks
> Haitao

BR, Jarkko

Re: [PATCH v6 06/13] selftests/sgx: Remove redundant enclave base address save/restore

2023-09-25 Thread Jarkko Sakkinen

On Thu Sep 21, 2023 at 5:35 PM EEST, Jo Van Bulck wrote:
> Remove redundant push/pop pair that stores and restores the enclave base
> address in the test enclave, as it is never used after the pop and can
> anyway be easily retrieved via the __encl_base symbol.
>
> Signed-off-by: Jo Van Bulck 
> Acked-by: Kai Huang 
> ---
>  tools/testing/selftests/sgx/test_encl_bootstrap.S | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/tools/testing/selftests/sgx/test_encl_bootstrap.S 
> b/tools/testing/selftests/sgx/test_encl_bootstrap.S
> index 03ae0f57e29d..e0ce993d3f2c 100644
> --- a/tools/testing/selftests/sgx/test_encl_bootstrap.S
> +++ b/tools/testing/selftests/sgx/test_encl_bootstrap.S
> @@ -55,12 +55,9 @@ encl_entry_core:
>   push%rax
>  
>   push%rcx # push the address after EENTER
> - push%rbx # push the enclave base address
>  
>   callencl_body
>  
> - pop %rbx # pop the enclave base address
> -
>   /* Clear volatile GPRs, except RAX (EEXIT function). */
>   xor %rcx, %rcx
>   xor %rdx, %rdx
> -- 
> 2.25.1

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko

Re: [PATCH v4 3/3] remoteproc: zynqmp: get TCM from device-tree

2023-09-25 Thread Tanmay Shah



On 9/7/23 1:08 PM, Mathieu Poirier wrote:
> On Wed, Sep 06, 2023 at 05:02:40PM -0500, Tanmay Shah wrote:
> > HI Mathieu,
> > 
> > Thanks for reviews. Please find my comments below.
> >
>
> I took another look after reading your comment and found more problems...
>
> > 
> > On 9/6/23 2:47 PM, Mathieu Poirier wrote:
> > > Hi Tanmay,
> > >
> > > On Tue, Aug 29, 2023 at 11:19:00AM -0700, Tanmay Shah wrote:
> > > > Use new dt bindings to get TCM address and size
> > > > information. Also make sure that driver stays
> > > > compatible with previous device-tree bindings.
> > > > So, if TCM information isn't available in device-tree
> > > > for zynqmp platform, hard-coded address of TCM will
> > > > be used.
> > > > 
> > > > New platforms that are compatible with this
> > > > driver must add TCM support in device-tree as per new
> > > > bindings.
> > > > 
> > > > Signed-off-by: Tanmay Shah 
> > > > ---
> > > >  drivers/remoteproc/xlnx_r5_remoteproc.c | 279 +++-
> > > >  1 file changed, 221 insertions(+), 58 deletions(-)
> > > > 
> > > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > > > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > index feca6de68da2..4eb62eb545c2 100644
> > > > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > > > @@ -39,15 +39,19 @@ enum zynqmp_r5_cluster_mode {
> > > >   * struct mem_bank_data - Memory Bank description
> > > >   *
> > > >   * @addr: Start address of memory bank
> > > > + * @da: device address for this tcm bank
> > > >   * @size: Size of Memory bank
> > > >   * @pm_domain_id: Power-domains id of memory bank for firmware to turn 
> > > > on/off
> > > > + * @pm_domain_id2: second core's corresponding TCM's pm_domain_id
> > > >   * @bank_name: name of the bank for remoteproc framework
> > > >   */
> > > >  struct mem_bank_data {
> > > > -   phys_addr_t addr;
> > > > -   size_t size;
> > > > +   u32 addr;
> > > > +   u32 da;
> > > > +   u32 size;
> > >
> > > Why are the types of @addr and @size changed?
> > 
> > So, R5 can access 32-bit address range only. Before I had missed this.
> > 
> > In Devce-tree bindings I am keeping address-cells and size-cells as 2.
> > 
> > So, out of 64-bits only 32-bits will be used to get address of TCM. Same 
> > for size.
> > 
> > This motivated me to change the type of @addr and @size fields. It doesn't 
> > have any side effects.
>
> It doesn't have an effect but it also doesn't need to be in this patch,
> especially since it is not documented. 
>
>
> This patch needs to be broken in 3 parts:
>
> 1) One patch that deals with the addition of the static mem_bank_data for
> lockstep mode.
>
> 2) One patch that deals with the addition of ->pm_domain_id2 and the potential
> bug I may have highlighted below.

Hi Mathieu,

Just heads up. There is change in this plan. I found out that pm domain 
framework can be used to power-on/off devices

with pm domains in device-tree. So, I am developing patches accordingly.

I will still split patches but it won't be same as what was posted here. There 
will be patch that is using

pm domain (genpd) framework to power-on/off TCM.

Tanmay

> 3) One patch that deals with extracting the TCM information from the DT.
> Everything else needs to be in another patch.
>
> > 
> > 
> > >
> > > > u32 pm_domain_id;
> > > > -   char *bank_name;
> > > > +   u32 pm_domain_id2;
> > > > +   char bank_name[32];
> > >
> > > Same
> > 
> > Now we have "reg-names" property in dts so, when that is available, I try 
> > to use it.
> > 
> > So, instead of keeping simple pointer, I copy name from "struct resources". 
> > So, I changed bank_name
> > 
> > from pointer to array.
> >
>
> I'll look at that part again when the rest of may comments are addressed.
>
> > 
> > >
> > > >  };
> > > >  
> > > >  /**
> > > > @@ -75,11 +79,17 @@ struct mbox_info {
> > > >   * Hardcoded TCM bank values. This will be removed once TCM bindings 
> > > > are
> > > >   * accepted for system-dt specifications and upstreamed in linux kernel
> > > >   */
> > > > -static const struct mem_bank_data zynqmp_tcm_banks[] = {
> > > > -   {0xffe0UL, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB 
> > > > each */
> > > > -   {0xffe2UL, 0x1UL, PD_R5_0_BTCM, "btcm0"},
> > > > -   {0xffe9UL, 0x1UL, PD_R5_1_ATCM, "atcm1"},
> > > > -   {0xffebUL, 0x1UL, PD_R5_1_BTCM, "btcm1"},
> > > > +static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
> > > > +   {0xffe0, 0x0, 0x1, PD_R5_0_ATCM, 0, "atcm0"}, /* TCM 
> > > > 64KB each */
> > > > +   {0xffe2, 0x2, 0x1, PD_R5_0_BTCM, 0, "btcm0"},
> > >
> > > Here the device address for btcm0 is 0x2 while in the cover letter it 
> > > is
> > > 0x2000.
> > 
> > Thanks for catching this. This is actually typo in cover-letter. It should 
> > be 0x2 in cover-letter.
> > 
> > >
> > > > +   {0xffe9, 0x0, 0x1, PD_R5_1_ATCM,

Re: [PATCH v2] nd_btt: Make BTT lanes preemptible

2023-09-25 Thread Ira Weiny

Tomas Glozar wrote:
> nd_region_acquire_lane uses get_cpu, which disables preemption. This is
> an issue on PREEMPT_RT kernels, since btt_write_pg and also
> nd_region_acquire_lane itself take a spin lock, resulting in BUG:
> sleeping function called from invalid context.
> 
> Fix the issue by replacing get_cpu with smp_process_id and
> migrate_disable when needed. This makes BTT operations preemptible, thus
> permitting the use of spin_lock.
> 
> BUG example occurring when running ndctl tests on PREEMPT_RT kernel:
> 
> BUG: sleeping function called from invalid context at
> kernel/locking/spinlock_rt.c:48
> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 4903, name:
> libndctl
> preempt_count: 1, expected: 0
> RCU nest depth: 0, expected: 0
> Preemption disabled at:
> [] nd_region_acquire_lane+0x15/0x90 [libnvdimm]
> Call Trace:
>  
>  dump_stack_lvl+0x8e/0xb0
>  __might_resched+0x19b/0x250
>  rt_spin_lock+0x4c/0x100
>  ? btt_write_pg+0x2d7/0x500 [nd_btt]
>  btt_write_pg+0x2d7/0x500 [nd_btt]
>  ? local_clock_noinstr+0x9/0xc0
>  btt_submit_bio+0x16d/0x270 [nd_btt]
>  __submit_bio+0x48/0x80
>  __submit_bio_noacct+0x7e/0x1e0
>  submit_bio_wait+0x58/0xb0
>  __blkdev_direct_IO_simple+0x107/0x240
>  ? inode_set_ctime_current+0x51/0x110
>  ? __pfx_submit_bio_wait_endio+0x10/0x10
>  blkdev_write_iter+0x1d8/0x290
>  vfs_write+0x237/0x330
>  ...
>  
> 
> Fixes: 5212e11fde4d ("nd_btt: atomic sector updates")
> Signed-off-by: Tomas Glozar 

Thanks for the clarification.

Reviewed-by: Ira Weiny

Re: [PATCH v1 9/9] ACPI: NFIT: Don't use KBUILD_MODNAME for driver name

2023-09-25 Thread Dan Williams

Andy Shevchenko wrote:
> On Mon, Sep 25, 2023 at 05:48:42PM +0300, Michal Wilczynski wrote:
> > Driver name is part of the ABI, so it should be hard-coded, as ABI
> > should be always kept backward compatible. Prevent ABI from changing
> > accidentally in case KBUILD_MODNAME change.
> 
> This is up to maintainers, probably we won't have any users outside of 
> existing
> model (instantiating via ACPI ID). All the above is "strictly speaking"...

...right, more than 8 years for this "risk" to materialize indicates
it's a non-issue to me.

Re: [PATCH] tracing/user_events: align uaddr on unsigned long alignment

2023-09-25 Thread Beau Belgrave

On Mon, Sep 25, 2023 at 09:53:16AM +0200, Clément Léger wrote:
> 
> 
> On 22/09/2023 21:22, Beau Belgrave wrote:
> > On Tue, Sep 19, 2023 at 02:59:12PM +0200, Clément Léger wrote:
> >>
> >>
> >> On 14/09/2023 19:29, Steven Rostedt wrote:
> >>> On Thu, 14 Sep 2023 13:17:00 -0400
> >>> Steven Rostedt  wrote:
> >>>
>  Now lets look at big endian layout:
> 
>   uaddr = 0xbeef0004
>   enabler = 1;
> 
>   memory at 0xbeef:  00 00 00 00 00 00 00 02
>  ^
>  addr: 0xbeef0004
> 
>   (enabler is set )
> 
>   bitoffset = uaddr & (sizeof(unsigned long) - 1); bitoffset = 4
>   bit_offset *= 8; bitoffset = 32
>   uaddr &= ~(sizeof(unsigned long) - 1);   uaddr = 0xbeef
> 
>   ptr = kaddr + (uaddr & ~PAGE_MASK);
> 
>   clear_bit(1 + 32, ptr);
> 
>   memory at 0xbeef:  00 00 00 00 00 00 00 02
>    ^
>   bit 33 of 0xbeef
> 
>  I don't think that's what you expected!
> >>>
> >>> I believe the above can be fixed with:
> >>>
> >>>   bit_offset = uaddr & (sizeof(unsigned long) - 1);
> >>>   if (bit_offset) {
> >>> #ifdef CONFIG_CPU_BIG_ENDIAN
> >>>   bit_offest = 0;
> >>> #else
> >>>   bit_offset *= BITS_PER_BYTE;
> >>> #endif
> >>>   uaddr &= ~(sizeof(unsigned long) - 1);
> >>>   }
> >>>
> >>> -- Steve
> >>
> >>
> >> Actually, after looking more in depth at that, it seems like there are
> >> actually 2 problems that can happen.
> >>
> >> First one is atomic access misalignment due to enable_size == 4 and
> >> uaddr not being aligned on a (long) boundary on 64 bits architecture.
> >> This can generate misaligned exceptions on various architectures. This
> >> can be fixed in a more general way according to Masami snippet.
> >>
> >> Second one that I can see is on 64 bits, big endian architectures with
> >> enable_size == 4. In that case, the bit provided by the userspace won't
> >> be correctly set since this code kind of assume that the atomic are done
> >> on 32bits value. Since the kernel assume long sized atomic operation, on
> >> big endian 64 bits architecture, the updated bit will actually be in the
> >> next 32 bits word.
> >>
> >> Can someone confirm my understanding ?
> >>
> > 
> > I have a ppc 64bit BE VM I've been validating this on. If we do the
> > shifting within user_events (vs a generic set_bit_aligned approach)
> > 64bit BE does not need additional bit manipulation. However, if we were
> > to blindly pass the address and bit as is to set_bit_aligned() it
> > assumes the bit number is for a long, not a 32 bit word. So for that
> > approach we would need to offset the bit in the unaligned case.
> > 
> > Here's a patch I have that I've validated on ppc64 BE, aarch64 LE, and
> > x86_64 LE. I personally feel more comfortable with this approach than
> > the generic set_bit_aligned() one.
> > 
> > Thanks,
> > -Beau
> > 
> > diff --git a/kernel/trace/trace_events_user.c 
> > b/kernel/trace/trace_events_user.c
> > index e3f2b8d72e01..ae854374d0b7 100644
> > --- a/kernel/trace/trace_events_user.c
> > +++ b/kernel/trace/trace_events_user.c
> > @@ -162,6 +162,23 @@ struct user_event_validator {
> > int flags;
> >  };
> >  
> > +static inline void align_addr_bit(unsigned long *addr, int *bit)
> > +{
> > +   if (IS_ALIGNED(*addr, sizeof(long)))
> > +   return;
> > +
> > +   *addr = ALIGN_DOWN(*addr, sizeof(long));
> > +
> > +   /*
> > +* We only support 32 and 64 bit values. The only time we need
> > +* to align is a 32 bit value on a 64 bit kernel, which on LE
> > +* is always 32 bits, and on BE requires no change.
> > +*/
> > +#ifdef __LITTLE_ENDIAN
> > +   *bit += 32;
> > +#endif
> 
> Hi Beau, except the specific alignment that is basically what I ended up
> with for the BE 64bit case (ie just bit += 32). Regarding the generic
> alignment, depends on what the maintainers wishes (generic of user_event
> specific). I also feel like this shoulmd be handle specifically for
> user_events which uses set_bit in some non standard way. Any suggestion ?
> 

Looking at this deeper, user_events needs to track some of this
alignment requirements within each enabler. This is needed because when
enablers are disabled/removed they do not have the actual size. The size
is needed to know if we need to update the bits, etc. (IE: If originally
a 32-bit value was used and it's aligned and it's on BE, it needs the
bits shifted.)

My final version that I played around with looked like this for just the
alignment requirements:
+static inline void align_addr_bit(unsigned long *addr, int *bit,
+ unsigned long *flags)
+{
+   if (IS_ALIGNED(*addr, sizeof(long))) {
+#ifdef __BIG_ENDIAN
+   if (test_bit(ENABLE_VAL_32_BIT, flags))
+

Re: [PATCH v1 9/9] ACPI: NFIT: Don't use KBUILD_MODNAME for driver name

2023-09-25 Thread Andy Shevchenko

On Mon, Sep 25, 2023 at 05:48:42PM +0300, Michal Wilczynski wrote:
> Driver name is part of the ABI, so it should be hard-coded, as ABI
> should be always kept backward compatible. Prevent ABI from changing
> accidentally in case KBUILD_MODNAME change.

This is up to maintainers, probably we won't have any users outside of existing
model (instantiating via ACPI ID). All the above is "strictly speaking"...

-- 
With Best Regards,
Andy Shevchenko

[PATCH v1 9/9] ACPI: NFIT: Don't use KBUILD_MODNAME for driver name

2023-09-25 Thread Michal Wilczynski

Driver name is part of the ABI, so it should be hard-coded, as ABI
should be always kept backward compatible. Prevent ABI from changing
accidentally in case KBUILD_MODNAME change.

Suggested-by: Andy Shevchenko 
Signed-off-by: Michal Wilczynski 
---
 drivers/acpi/nfit/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index f09530d2520a..987eb5567036 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -3478,7 +3478,7 @@ MODULE_DEVICE_TABLE(acpi, acpi_nfit_ids);
 static struct platform_driver acpi_nfit_driver = {
.probe = acpi_nfit_probe,
.driver = {
-   .name = KBUILD_MODNAME,
+   .name = "nfit",
.acpi_match_table = acpi_nfit_ids,
},
 };
-- 
2.41.0

[PATCH v1 8/9] ACPI: NFIT: Remove redundant call to to_acpi_dev()

2023-09-25 Thread Michal Wilczynski

In acpi_nfit_ctl() ACPI handle is extracted using to_acpi_dev()
function and accessing handle field in ACPI device. After transformation
from ACPI driver to platform driver this is not optimal anymore. To get
a handle it's enough to just use ACPI_HANDLE() macro to extract the
handle.

Suggested-by: Andy Shevchenko 
Signed-off-by: Michal Wilczynski 
---
 drivers/acpi/nfit/core.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 2e50b1334a69..f09530d2520a 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -475,8 +475,6 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
struct nvdimm *nvdimm,
guid = to_nfit_uuid(nfit_mem->family);
handle = adev->handle;
} else {
-   struct acpi_device *adev = to_acpi_dev(acpi_desc);
-
cmd_name = nvdimm_bus_cmd_name(cmd);
cmd_mask = nd_desc->cmd_mask;
if (cmd == ND_CMD_CALL && call_pkg->nd_family) {
@@ -493,7 +491,7 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
struct nvdimm *nvdimm,
guid = to_nfit_uuid(NFIT_DEV_BUS);
}
desc = nd_cmd_bus_desc(cmd);
-   handle = adev->handle;
+   handle = ACPI_HANDLE(dev);
dimm_name = "bus";
}
 
-- 
2.41.0

[PATCH v1 7/9] ACPI: NFIT: Replace acpi_driver with platform_driver

2023-09-25 Thread Michal Wilczynski

NFIT driver uses struct acpi_driver incorrectly to register itself.
This is wrong as the instances of the ACPI devices are not meant
to be literal devices, they're supposed to describe ACPI entry of a
particular device.

Use platform_driver instead of acpi_driver. In relevant places call
platform devices instances pdev to make a distinction with ACPI
devices instances.

NFIT driver uses devm_*() family of functions extensively. This change
has no impact on correct functioning of the whole devm_*() family of
functions, since the lifecycle of the device stays the same. It is still
being created during the enumeration, and destroyed on platform device
removal.

Suggested-by: Rafael J. Wysocki 
Signed-off-by: Michal Wilczynski 
---
 drivers/acpi/nfit/core.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 821870f57862..2e50b1334a69 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "intel.h"
@@ -98,7 +99,7 @@ static struct acpi_device *to_acpi_dev(struct acpi_nfit_desc 
*acpi_desc)
|| strcmp(nd_desc->provider_name, "ACPI.NFIT") != 0)
return NULL;
 
-   return to_acpi_device(acpi_desc->dev);
+   return ACPI_COMPANION(acpi_desc->dev);
 }
 
 static int xlat_bus_status(void *buf, unsigned int cmd, u32 status)
@@ -3284,11 +3285,11 @@ static void acpi_nfit_put_table(void *table)
 
 static void acpi_nfit_notify(acpi_handle handle, u32 event, void *data)
 {
-   struct acpi_device *adev = data;
+   struct device *dev = data;
 
-   device_lock(>dev);
-   __acpi_nfit_notify(>dev, handle, event);
-   device_unlock(>dev);
+   device_lock(dev);
+   __acpi_nfit_notify(dev, handle, event);
+   device_unlock(dev);
 }
 
 static void acpi_nfit_remove_notify_handler(void *data)
@@ -3329,11 +3330,12 @@ void acpi_nfit_shutdown(void *data)
 }
 EXPORT_SYMBOL_GPL(acpi_nfit_shutdown);
 
-static int acpi_nfit_add(struct acpi_device *adev)
+static int acpi_nfit_probe(struct platform_device *pdev)
 {
struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL };
struct acpi_nfit_desc *acpi_desc;
-   struct device *dev = >dev;
+   struct device *dev = >dev;
+   struct acpi_device *adev = ACPI_COMPANION(dev);
struct acpi_table_header *tbl;
acpi_status status = AE_OK;
acpi_size sz;
@@ -3360,7 +3362,7 @@ static int acpi_nfit_add(struct acpi_device *adev)
acpi_desc = devm_kzalloc(dev, sizeof(*acpi_desc), GFP_KERNEL);
if (!acpi_desc)
return -ENOMEM;
-   acpi_nfit_desc_init(acpi_desc, >dev);
+   acpi_nfit_desc_init(acpi_desc, dev);
 
/* Save the acpi header for exporting the revision via sysfs */
acpi_desc->acpi_header = *tbl;
@@ -3391,7 +3393,7 @@ static int acpi_nfit_add(struct acpi_device *adev)
return rc;
 
rc = acpi_dev_install_notify_handler(adev->handle, ACPI_DEVICE_NOTIFY,
-acpi_nfit_notify, adev);
+acpi_nfit_notify, dev);
if (rc)
return rc;
 
@@ -3475,11 +3477,11 @@ static const struct acpi_device_id acpi_nfit_ids[] = {
 };
 MODULE_DEVICE_TABLE(acpi, acpi_nfit_ids);
 
-static struct acpi_driver acpi_nfit_driver = {
-   .name = KBUILD_MODNAME,
-   .ids = acpi_nfit_ids,
-   .ops = {
-   .add = acpi_nfit_add,
+static struct platform_driver acpi_nfit_driver = {
+   .probe = acpi_nfit_probe,
+   .driver = {
+   .name = KBUILD_MODNAME,
+   .acpi_match_table = acpi_nfit_ids,
},
 };
 
@@ -3517,7 +3519,7 @@ static __init int nfit_init(void)
return -ENOMEM;
 
nfit_mce_register();
-   ret = acpi_bus_register_driver(_nfit_driver);
+   ret = platform_driver_register(_nfit_driver);
if (ret) {
nfit_mce_unregister();
destroy_workqueue(nfit_wq);
@@ -3530,7 +3532,7 @@ static __init int nfit_init(void)
 static __exit void nfit_exit(void)
 {
nfit_mce_unregister();
-   acpi_bus_unregister_driver(_nfit_driver);
+   platform_driver_unregister(_nfit_driver);
destroy_workqueue(nfit_wq);
WARN_ON(!list_empty(_descs));
 }
-- 
2.41.0

[PATCH v1 6/9] ACPI: AC: Rename ACPI device from device to adev

2023-09-25 Thread Michal Wilczynski

Since transformation from ACPI driver to platform driver there are two
devices on which the driver operates - ACPI device and platform device.
For the sake of reader this calls for the distinction in their naming,
to avoid confusion. Rename device to adev, as corresponding
platform device is called pdev.

Signed-off-by: Michal Wilczynski 
---
 drivers/acpi/ac.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
index 78e53d0fdece..270a6919ec12 100644
--- a/drivers/acpi/ac.c
+++ b/drivers/acpi/ac.c
@@ -120,7 +120,7 @@ static enum power_supply_property ac_props[] = {
 static void acpi_ac_notify(acpi_handle handle, u32 event, void *data)
 {
struct acpi_ac *ac = data;
-   struct acpi_device *device = ACPI_COMPANION(ac->dev);
+   struct acpi_device *adev = ACPI_COMPANION(ac->dev);
 
switch (event) {
default:
@@ -141,11 +141,11 @@ static void acpi_ac_notify(acpi_handle handle, u32 event, 
void *data)
msleep(ac_sleep_before_get_state_ms);
 
acpi_ac_get_state(ac);
-   acpi_bus_generate_netlink_event(device->pnp.device_class,
+   acpi_bus_generate_netlink_event(adev->pnp.device_class,
dev_name(ac->dev),
event,
ac->state);
-   acpi_notifier_call_chain(device, event, ac->state);
+   acpi_notifier_call_chain(adev, event, ac->state);
kobject_uevent(>charger->dev.kobj, KOBJ_CHANGE);
}
 }
@@ -204,7 +204,7 @@ static const struct dmi_system_id ac_dmi_table[]  
__initconst = {
 
 static int acpi_ac_probe(struct platform_device *pdev)
 {
-   struct acpi_device *device = ACPI_COMPANION(>dev);
+   struct acpi_device *adev = ACPI_COMPANION(>dev);
struct power_supply_config psy_cfg = {};
struct acpi_ac *ac;
int result;
@@ -214,8 +214,8 @@ static int acpi_ac_probe(struct platform_device *pdev)
return -ENOMEM;
 
ac->dev = >dev;
-   strcpy(acpi_device_name(device), ACPI_AC_DEVICE_NAME);
-   strcpy(acpi_device_class(device), ACPI_AC_CLASS);
+   strcpy(acpi_device_name(adev), ACPI_AC_DEVICE_NAME);
+   strcpy(acpi_device_class(adev), ACPI_AC_CLASS);
 
platform_set_drvdata(pdev, ac);
 
@@ -225,7 +225,7 @@ static int acpi_ac_probe(struct platform_device *pdev)
 
psy_cfg.drv_data = ac;
 
-   ac->charger_desc.name = acpi_device_bid(device);
+   ac->charger_desc.name = acpi_device_bid(adev);
ac->charger_desc.type = POWER_SUPPLY_TYPE_MAINS;
ac->charger_desc.properties = ac_props;
ac->charger_desc.num_properties = ARRAY_SIZE(ac_props);
@@ -237,13 +237,13 @@ static int acpi_ac_probe(struct platform_device *pdev)
goto err_release_ac;
}
 
-   pr_info("%s [%s] (%s-line)\n", acpi_device_name(device),
-   acpi_device_bid(device), str_on_off(ac->state));
+   pr_info("%s [%s] (%s-line)\n", acpi_device_name(adev),
+   acpi_device_bid(adev), str_on_off(ac->state));
 
ac->battery_nb.notifier_call = acpi_ac_battery_notify;
register_acpi_notifier(>battery_nb);
 
-   result = acpi_dev_install_notify_handler(device->handle, 
ACPI_ALL_NOTIFY,
+   result = acpi_dev_install_notify_handler(adev->handle, ACPI_ALL_NOTIFY,
 acpi_ac_notify, ac);
if (result)
goto err_unregister;
-- 
2.41.0

[PATCH v1 5/9] ACPI: AC: Replace acpi_driver with platform_driver

2023-09-25 Thread Michal Wilczynski

AC driver uses struct acpi_driver incorrectly to register itself. This
is wrong as the instances of the ACPI devices are not meant to
be literal devices, they're supposed to describe ACPI entry of a
particular device.

Use platform_driver instead of acpi_driver. In relevant places call
platform devices instances pdev to make a distinction with ACPI
devices instances.

Drop unnecessary casts from acpi_bus_generate_netlink_event() and
acpi_notifier_call_chain().

Add a blank line to distinguish pdev API vs local ACPI notify function.

Suggested-by: Rafael J. Wysocki 
Signed-off-by: Michal Wilczynski 
---
 drivers/acpi/ac.c | 70 +--
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
index fd8b392c19f4..78e53d0fdece 100644
--- a/drivers/acpi/ac.c
+++ b/drivers/acpi/ac.c
@@ -33,8 +33,9 @@ MODULE_AUTHOR("Paul Diefenbaugh");
 MODULE_DESCRIPTION("ACPI AC Adapter Driver");
 MODULE_LICENSE("GPL");
 
-static int acpi_ac_add(struct acpi_device *device);
-static void acpi_ac_remove(struct acpi_device *device);
+static int acpi_ac_probe(struct platform_device *pdev);
+static void acpi_ac_remove(struct platform_device *pdev);
+
 static void acpi_ac_notify(acpi_handle handle, u32 event, void *data);
 
 static const struct acpi_device_id ac_device_ids[] = {
@@ -51,21 +52,10 @@ static SIMPLE_DEV_PM_OPS(acpi_ac_pm, NULL, acpi_ac_resume);
 static int ac_sleep_before_get_state_ms;
 static int ac_only;
 
-static struct acpi_driver acpi_ac_driver = {
-   .name = "ac",
-   .class = ACPI_AC_CLASS,
-   .ids = ac_device_ids,
-   .ops = {
-   .add = acpi_ac_add,
-   .remove = acpi_ac_remove,
-   },
-   .drv.pm = _ac_pm,
-};
-
 struct acpi_ac {
struct power_supply *charger;
struct power_supply_desc charger_desc;
-   struct acpi_device *device;
+   struct device *dev;
unsigned long long state;
struct notifier_block battery_nb;
 };
@@ -85,10 +75,10 @@ static int acpi_ac_get_state(struct acpi_ac *ac)
return 0;
}
 
-   status = acpi_evaluate_integer(ac->device->handle, "_PSR", NULL,
+   status = acpi_evaluate_integer(ACPI_HANDLE(ac->dev), "_PSR", NULL,
   >state);
if (ACPI_FAILURE(status)) {
-   acpi_handle_info(ac->device->handle,
+   acpi_handle_info(ACPI_HANDLE(ac->dev),
"Error reading AC Adapter state: %s\n",
acpi_format_exception(status));
ac->state = ACPI_AC_STATUS_UNKNOWN;
@@ -129,12 +119,12 @@ static enum power_supply_property ac_props[] = {
 /* Driver Model */
 static void acpi_ac_notify(acpi_handle handle, u32 event, void *data)
 {
-   struct acpi_device *device = data;
-   struct acpi_ac *ac = acpi_driver_data(device);
+   struct acpi_ac *ac = data;
+   struct acpi_device *device = ACPI_COMPANION(ac->dev);
 
switch (event) {
default:
-   acpi_handle_debug(device->handle, "Unsupported event [0x%x]\n",
+   acpi_handle_debug(ACPI_HANDLE(ac->dev), "Unsupported event 
[0x%x]\n",
  event);
fallthrough;
case ACPI_AC_NOTIFY_STATUS:
@@ -152,9 +142,10 @@ static void acpi_ac_notify(acpi_handle handle, u32 event, 
void *data)
 
acpi_ac_get_state(ac);
acpi_bus_generate_netlink_event(device->pnp.device_class,
- dev_name(>dev), event,
- (u32) ac->state);
-   acpi_notifier_call_chain(device, event, (u32) ac->state);
+   dev_name(ac->dev),
+   event,
+   ac->state);
+   acpi_notifier_call_chain(device, event, ac->state);
kobject_uevent(>charger->dev.kobj, KOBJ_CHANGE);
}
 }
@@ -211,8 +202,9 @@ static const struct dmi_system_id ac_dmi_table[]  
__initconst = {
{},
 };
 
-static int acpi_ac_add(struct acpi_device *device)
+static int acpi_ac_probe(struct platform_device *pdev)
 {
+   struct acpi_device *device = ACPI_COMPANION(>dev);
struct power_supply_config psy_cfg = {};
struct acpi_ac *ac;
int result;
@@ -221,10 +213,11 @@ static int acpi_ac_add(struct acpi_device *device)
if (!ac)
return -ENOMEM;
 
-   ac->device = device;
+   ac->dev = >dev;
strcpy(acpi_device_name(device), ACPI_AC_DEVICE_NAME);
strcpy(acpi_device_class(device), ACPI_AC_CLASS);
-   device->driver_data = ac;
+
+   platform_set_drvdata(pdev, ac);
 
result = acpi_ac_get_state(ac);
if (result)
@@ -237,7 +230,7 @@ static int acpi_ac_add(struct acpi_device *device)

[PATCH v1 4/9] ACPI: AC: Use string_choices API instead of ternary operator

2023-09-25 Thread Michal Wilczynski

Use modern string_choices API instead of manually determining the
output using ternary operator.

Suggested-by: Andy Shevchenko 
Reviewed-by: Andy Shevchenko 
Signed-off-by: Michal Wilczynski 
---
 drivers/acpi/ac.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
index dd04809a787c..fd8b392c19f4 100644
--- a/drivers/acpi/ac.c
+++ b/drivers/acpi/ac.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -243,8 +244,8 @@ static int acpi_ac_add(struct acpi_device *device)
goto err_release_ac;
}
 
-   pr_info("%s [%s] (%s)\n", acpi_device_name(device),
-   acpi_device_bid(device), ac->state ? "on-line" : "off-line");
+   pr_info("%s [%s] (%s-line)\n", acpi_device_name(device),
+   acpi_device_bid(device), str_on_off(ac->state));
 
ac->battery_nb.notifier_call = acpi_ac_battery_notify;
register_acpi_notifier(>battery_nb);
-- 
2.41.0

[PATCH v1 1/9] ACPI: bus: Make notify wrappers more generic

2023-09-25 Thread Michal Wilczynski

acpi_dev_install_notify_handler() and acpi_dev_remove_notify_handler()
are wrappers around ACPICA installers. They are meant to save some
duplicated code from drivers. However as we're moving towards drivers
operating on platform_device they become a bit inconvenient to use as
inside the driver code we mostly want to use driver data of platform
device instead of ACPI device.

Make notify handlers installer wrappers more generic, while still
saving some code that would be duplicated otherwise.

Reviewed-by: Andy Shevchenko 
Signed-off-by: Michal Wilczynski 
---

Notes:
So one solution could be to just replace acpi_device with
platform_device as an argument in those functions. However I don't
believe this is a correct solution, as it is very often the case that
drivers declare their own private structures which gets allocated during
the .probe() callback, and become the heart of the driver. When drivers
do that it makes much more sense to just pass the private structure
to the notify handler instead of forcing user to dance with the
platform_device or acpi_device.

 drivers/acpi/ac.c |  6 +++---
 drivers/acpi/acpi_video.c |  6 +++---
 drivers/acpi/battery.c|  6 +++---
 drivers/acpi/bus.c| 14 ++
 drivers/acpi/hed.c|  6 +++---
 drivers/acpi/nfit/core.c  |  6 +++---
 drivers/acpi/thermal.c|  6 +++---
 include/acpi/acpi_bus.h   |  9 -
 8 files changed, 28 insertions(+), 31 deletions(-)

diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
index 225dc6818751..0b245f9f7ec8 100644
--- a/drivers/acpi/ac.c
+++ b/drivers/acpi/ac.c
@@ -256,8 +256,8 @@ static int acpi_ac_add(struct acpi_device *device)
ac->battery_nb.notifier_call = acpi_ac_battery_notify;
register_acpi_notifier(>battery_nb);
 
-   result = acpi_dev_install_notify_handler(device, ACPI_ALL_NOTIFY,
-acpi_ac_notify);
+   result = acpi_dev_install_notify_handler(device->handle, 
ACPI_ALL_NOTIFY,
+acpi_ac_notify, device);
if (result)
goto err_unregister;
 
@@ -306,7 +306,7 @@ static void acpi_ac_remove(struct acpi_device *device)
 
ac = acpi_driver_data(device);
 
-   acpi_dev_remove_notify_handler(device, ACPI_ALL_NOTIFY,
+   acpi_dev_remove_notify_handler(device->handle, ACPI_ALL_NOTIFY,
   acpi_ac_notify);
power_supply_unregister(ac->charger);
unregister_acpi_notifier(>battery_nb);
diff --git a/drivers/acpi/acpi_video.c b/drivers/acpi/acpi_video.c
index 948e31f7ce6e..025c17890127 100644
--- a/drivers/acpi/acpi_video.c
+++ b/drivers/acpi/acpi_video.c
@@ -2059,8 +2059,8 @@ static int acpi_video_bus_add(struct acpi_device *device)
 
acpi_video_bus_add_notify_handler(video);
 
-   error = acpi_dev_install_notify_handler(device, ACPI_DEVICE_NOTIFY,
-   acpi_video_bus_notify);
+   error = acpi_dev_install_notify_handler(device->handle, 
ACPI_DEVICE_NOTIFY,
+   acpi_video_bus_notify, device);
if (error)
goto err_remove;
 
@@ -2092,7 +2092,7 @@ static void acpi_video_bus_remove(struct acpi_device 
*device)
 
video = acpi_driver_data(device);
 
-   acpi_dev_remove_notify_handler(device, ACPI_DEVICE_NOTIFY,
+   acpi_dev_remove_notify_handler(device->handle, ACPI_DEVICE_NOTIFY,
   acpi_video_bus_notify);
 
mutex_lock(_list_lock);
diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
index 969bf81e8d54..45dae32a8646 100644
--- a/drivers/acpi/battery.c
+++ b/drivers/acpi/battery.c
@@ -1213,8 +1213,8 @@ static int acpi_battery_add(struct acpi_device *device)
 
device_init_wakeup(>dev, 1);
 
-   result = acpi_dev_install_notify_handler(device, ACPI_ALL_NOTIFY,
-acpi_battery_notify);
+   result = acpi_dev_install_notify_handler(device->handle, 
ACPI_ALL_NOTIFY,
+acpi_battery_notify, device);
if (result)
goto fail_pm;
 
@@ -1241,7 +1241,7 @@ static void acpi_battery_remove(struct acpi_device 
*device)
 
battery = acpi_driver_data(device);
 
-   acpi_dev_remove_notify_handler(device, ACPI_ALL_NOTIFY,
+   acpi_dev_remove_notify_handler(device->handle, ACPI_ALL_NOTIFY,
   acpi_battery_notify);
 
device_init_wakeup(>dev, 0);
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index f41dda2d3493..479fe888d629 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -554,14 +554,13 @@ static void acpi_device_remove_notify_handler(struct 
acpi_device *device,
acpi_os_wait_events_complete();
 }
 
-int acpi_dev_install_notify_handler(struct acpi_device *adev,
-

[PATCH v1 3/9] ACPI: AC: Remove unnecessary checks

2023-09-25 Thread Michal Wilczynski

Remove unnecessary checks for NULL for variables that can't be NULL at
the point they're checked for it. Defensive programming is discouraged
in the kernel.

Signed-off-by: Michal Wilczynski 
---
 drivers/acpi/ac.c | 27 ---
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
index 0b245f9f7ec8..dd04809a787c 100644
--- a/drivers/acpi/ac.c
+++ b/drivers/acpi/ac.c
@@ -131,9 +131,6 @@ static void acpi_ac_notify(acpi_handle handle, u32 event, 
void *data)
struct acpi_device *device = data;
struct acpi_ac *ac = acpi_driver_data(device);
 
-   if (!ac)
-   return;
-
switch (event) {
default:
acpi_handle_debug(device->handle, "Unsupported event [0x%x]\n",
@@ -216,12 +213,8 @@ static const struct dmi_system_id ac_dmi_table[]  
__initconst = {
 static int acpi_ac_add(struct acpi_device *device)
 {
struct power_supply_config psy_cfg = {};
-   int result = 0;
-   struct acpi_ac *ac = NULL;
-
-
-   if (!device)
-   return -EINVAL;
+   struct acpi_ac *ac;
+   int result;
 
ac = kzalloc(sizeof(struct acpi_ac), GFP_KERNEL);
if (!ac)
@@ -275,16 +268,9 @@ static int acpi_ac_add(struct acpi_device *device)
 #ifdef CONFIG_PM_SLEEP
 static int acpi_ac_resume(struct device *dev)
 {
-   struct acpi_ac *ac;
+   struct acpi_ac *ac = acpi_driver_data(to_acpi_device(dev));
unsigned int old_state;
 
-   if (!dev)
-   return -EINVAL;
-
-   ac = acpi_driver_data(to_acpi_device(dev));
-   if (!ac)
-   return -EINVAL;
-
old_state = ac->state;
if (acpi_ac_get_state(ac))
return 0;
@@ -299,12 +285,7 @@ static int acpi_ac_resume(struct device *dev)
 
 static void acpi_ac_remove(struct acpi_device *device)
 {
-   struct acpi_ac *ac = NULL;
-
-   if (!device || !acpi_driver_data(device))
-   return;
-
-   ac = acpi_driver_data(device);
+   struct acpi_ac *ac = acpi_driver_data(device);
 
acpi_dev_remove_notify_handler(device->handle, ACPI_ALL_NOTIFY,
   acpi_ac_notify);
-- 
2.41.0

[PATCH v1 2/9] docs: firmware-guide: ACPI: Clarify ACPI bus concepts

2023-09-25 Thread Michal Wilczynski

Some devices implement ACPI driver as a way to manage devices
enumerated by the ACPI. This might be confusing as a preferred way to
implement a driver for devices not connected to any bus is a platform
driver, as stated in the documentation. Clarify relationships between
ACPI device, platform device and ACPI entries.

Suggested-by: Elena Reshetova 
Signed-off-by: Michal Wilczynski 
---
 Documentation/firmware-guide/acpi/enumeration.rst | 13 +
 1 file changed, 13 insertions(+)

diff --git a/Documentation/firmware-guide/acpi/enumeration.rst 
b/Documentation/firmware-guide/acpi/enumeration.rst
index 56d9913a3370..f56cc79a9e83 100644
--- a/Documentation/firmware-guide/acpi/enumeration.rst
+++ b/Documentation/firmware-guide/acpi/enumeration.rst
@@ -64,6 +64,19 @@ If the driver needs to perform more complex initialization 
like getting and
 configuring GPIOs it can get its ACPI handle and extract this information
 from ACPI tables.
 
+ACPI bus
+
+
+Historically some devices not connected to any bus were represented as ACPI
+devices, and had to implement ACPI driver. This is not a preferred way for new
+drivers. As explained above devices not connected to any bus should implement
+platform driver. ACPI device would be created during enumeration nonetheless,
+and would be accessible through ACPI_COMPANION() macro, and the ACPI handle 
would
+be accessible through ACPI_HANDLE() macro. ACPI device is meant to describe
+information related to ACPI entry e.g. handle of the ACPI entry. Think -
+ACPI device interfaces with the FW, and the platform device with the rest of
+the system.
+
 DMA support
 ===
 
-- 
2.41.0

[PATCH v1 0/9] Replace acpi_driver with platform_driver

2023-09-25 Thread Michal Wilczynski

Currently there is a number of drivers that somewhat incorrectly
register themselves as acpi_driver, while they should be registering as
platform_drivers. acpi_device was never meant to represent actual
device, it only represents an entry in the ACPI table and ACPI driver
should be treated as a driver for the ACPI entry of the particular
device, not the device itself. Drivers of the devices itself should
register as platform_drivers. Replace acpi_driver with platform_driver
for all relevant drivers in drivers/acpi directory. This is just the
first part of the change, as many acpi_drivers are present in many files
outside of drivers/acpi directory.

Additionally during internal review we've decided that it's best to send
the relevant patches sequentially, in order not to overwhelm the reviewers.
So I'm starting with NFIT and AC drivers.

This change is possible thanks to recent .notify callback removal in
drivers/acpi [1]. So flow for the remaining acpi_drivers will look like
this:

1) Remove .notify callback with separate patchset for drivers/platform
   and other outstanding places like drivers/hwmon, drivers/iio etc.
2) Replace acpi_driver with platform_driver, as aside for .notify
   callback they're mostly functionally compatible.

Most of the patches in this series could be considered independent, and
could be merged separately, with a notable exception of the very first
patch in this series that changes notify installer wrappers to be more
generic. I decided it would be best not to force the user of this
wrappers to pass any specific structure, like acpi_device or
platform_device. It makes sense as it is very often the case that
drivers declare their own private structure which gets allocated during
the .probe() callback, and become the heart of the driver. In that case
it makes much more sense to pass this structure to notify handler.
Additionally some drivers, like acpi_video that already have custom
notify handlers do that already.

Link: 
https://lore.kernel.org/linux-acpi/20230703080252.2899090-1-michal.wilczyn...@intel.com/
 [1]

Michal Wilczynski (9):
  ACPI: bus: Make notify wrappers more generic
  docs: firmware-guide: ACPI: Clarify ACPI bus concepts
  ACPI: AC: Remove unnecessary checks
  ACPI: AC: Use string_choices API instead of ternary operator
  ACPI: AC: Replace acpi_driver with platform_driver
  ACPI: AC: Rename ACPI device from device to adev
  ACPI: NFIT: Replace acpi_driver with platform_driver
  ACPI: NFIT: Remove redundant call to to_acpi_dev()
  ACPI: NFIT: Don't use KBUILD_MODNAME for driver name

 .../firmware-guide/acpi/enumeration.rst   |  13 +++
 drivers/acpi/ac.c | 108 --
 drivers/acpi/acpi_video.c |   6 +-
 drivers/acpi/battery.c|   6 +-
 drivers/acpi/bus.c|  14 +--
 drivers/acpi/hed.c|   6 +-
 drivers/acpi/nfit/core.c  |  42 +++
 drivers/acpi/thermal.c|   6 +-
 include/acpi/acpi_bus.h   |   9 +-
 9 files changed, 103 insertions(+), 107 deletions(-)

-- 
2.41.0

droid4 -- weird behaviour when attempting to use usb host

2023-09-25 Thread Pavel Machek

Hi!

I'm having some fun with usb host. Good news is it works with
externally powered hub... after a while. I get some error messages
about inability to go host mode, but with enough patience it
eventually does enter host mode and I see my keyboard/mouse.

And usually in that process, one of my cpu cores disappear. top no
longer shows 2 cores, and I was wondering for a while if d4 is
single-core system. It is not, my two cores are back after reboot.

That's with 6.1.9 kernel from leste. Ideas how to debug this would be
welcome. (Do you use usb host?)

Best regards,
Pavel
--

Re: [PATCH 2/2] arm64: dts: qcom: msm8939-huawei-kiwi: Add initial device tree

2023-09-25 Thread Konrad Dybcio

On 25.09.2023 16:28, lukas walter wrote:
> Date: Wed, 20 Sep 2023 16:47:30 +0200
> 
>>> +
>>> +   reserved-memory {
>>> +   reserved@84a0 {
>>> +   reg = <0x0 0x84a0 0x0 0x160>;
>>> +   no-map;
>>> +   };
>> Do we know what this is for?
> 
> This seems to be some QSEE/TrustZone memory required to boot.
> I would name it `qseecom_mem: qseecom@84a0` like other phones
> currently have it.
> 
> `[1.162115] QSEECOM: qseecom_probe: secure app region
> addr=0x84a0 size=0x190`
Sounds good!

> 
>>> +   };
>>> +
>>> +   gpio-hall-sensor {
>>> +   compatible = "gpio-keys";
>>> +
>>> +   pinctrl-0 = <_hall_sensor_default>;
>>> +   pinctrl-names = "default";
>>> +
>>> +   label = "GPIO Hall Effect Sensor";
>> I think we can have both hall sensor and V+ under gpio-keys
>>
>> And then I am not sure how useful the label is for the container
>> node, maybe you or somebody else can tell me whether it's used
>> anywhere
>>> +
>>> +   event-hall-sensor {
>>> +   label = "Hall Effect Sensor";
>>> +   gpios = < 69 GPIO_ACTIVE_LOW>;
>>> +   linux,input-type = ;
>>> +   linux,code = ;
>>> +   linux,can-disable;
>> Should this not be a wakeup-source btw?
> 
> I am not sure how to change this. I would like to leave this as many
> other hall sensors seem to be configured identically.
Krzysztof, opinions?

> 
> Is this fine?
> Should I send a V2 with the signoff and reserved-memory changes?
I don't quite get it, what signoff?

Konrad

Re: [PATCH 2/2] arm64: dts: qcom: msm8939-huawei-kiwi: Add initial device tree

2023-09-25 Thread lukas walter

Date: Wed, 20 Sep 2023 16:47:30 +0200

>> +
>> +reserved-memory {
>> +reserved@84a0 {
>> +reg = <0x0 0x84a0 0x0 0x160>;
>> +no-map;
>> +};
>Do we know what this is for?

This seems to be some QSEE/TrustZone memory required to boot.
I would name it `qseecom_mem: qseecom@84a0` like other phones
currently have it.

`[1.162115] QSEECOM: qseecom_probe: secure app region
addr=0x84a0 size=0x190`

>> +};
>> +
>> +gpio-hall-sensor {
>> +compatible = "gpio-keys";
>> +
>> +pinctrl-0 = <_hall_sensor_default>;
>> +pinctrl-names = "default";
>> +
>> +label = "GPIO Hall Effect Sensor";
>I think we can have both hall sensor and V+ under gpio-keys
>
>And then I am not sure how useful the label is for the container
>node, maybe you or somebody else can tell me whether it's used
>anywhere
>> +
>> +event-hall-sensor {
>> +label = "Hall Effect Sensor";
>> +gpios = < 69 GPIO_ACTIVE_LOW>;
>> +linux,input-type = ;
>> +linux,code = ;
>> +linux,can-disable;
>Should this not be a wakeup-source btw?

I am not sure how to change this. I would like to leave this as many
other hall sensors seem to be configured identically.

Is this fine?
Should I send a V2 with the signoff and reserved-memory changes?

Re: [PATCH v5 04/12] fprobe: Use ftrace_regs in fprobe entry handler

2023-09-25 Thread Google

Hi Jiri,

On Mon, 25 Sep 2023 12:41:59 +0200
Jiri Olsa  wrote:

> On Sun, Sep 24, 2023 at 10:36:36PM +0900, Masami Hiramatsu (Google) wrote:
> > From: Masami Hiramatsu (Google) 
> > 
> > This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
> > instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe
> > on arm64.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > Acked-by: Florent Revest 
> 
> I was getting bpf selftests failures with this patchset and when
> bisecting I'm getting crash when running on top of this change

Thanks for bisecting!

> 
> looks like it's missing some of the regs NULL checks added later?

yeah, if the RIP (arch_rethook_prepare+0x0/0x30) is correct, 

void arch_rethook_prepare(struct rethook_node *rh, struct ftrace_regs *fregs, 
bool mcount)

RSI (the 2nd argument) is NULL. This means fregs == NULL and caused the crash.
I think ftrace_get_regs(fregs) for the entry handler may return NULL.

Ah, 

@@ -182,7 +182,7 @@ static void fprobe_init(struct fprobe *fp)
fp->ops.func = fprobe_kprobe_handler;
else
fp->ops.func = fprobe_handler;
-   fp->ops.flags |= FTRACE_OPS_FL_SAVE_REGS;
+   fp->ops.flags |= FTRACE_OPS_FL_SAVE_ARGS;
 }
 
 static int fprobe_init_rethook(struct fprobe *fp, int num)

This may cause the issue, it should keep REGS at this point (this must be done 
in
[9/12]). But after applying [9/12], it shouldn't be a problem... 

Let me check it again.

Thank you!

> 
> jirka
> 
> 
> ---
> [  124.089449][  T677] BUG: kernel NULL pointer dereference, address: 
> 0098
> [  124.090102][  T677] #PF: supervisor read access in kernel mode
> [  124.090568][  T677] #PF: error_code(0x) - not-present page
> [  124.091039][  T677] PGD 158fd8067 P4D 158fd8067 PUD 10896a067 PMD 0 
> [  124.091482][  T677] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
> [  124.091986][  T677] CPU: 1 PID: 677 Comm: test_progs Tainted: G   
> OE  6.6.0-rc2+ #768 1c8a8990289f2615e36dadd01915b80e8da29bf5
> [  124.092898][  T677] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
> BIOS 1.16.2-1.fc38 04/01/2014
> [  124.093613][  T677] RIP: 0010:arch_rethook_prepare+0x0/0x30
> [  124.094060][  T677] Code: 90 90 90 90 90 90 90 90 90 90 48 89 b7 a8 00 00 
> 00 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 
> <48> 8b 86 98 00 00 00 48 8b 10 48 89 57 20 48 8b 96 98 00 00 00 48
> [  124.096239][  T677] RSP: 0018:c90003d3bc98 EFLAGS: 00010286
> [  124.096708][  T677] RAX:  RBX: 88815d9fbe50 RCX: 
> 88815d9fbe40
> [  124.097332][  T677] RDX: 0001 RSI:  RDI: 
> 88815d9fbe40
> [  124.097946][  T677] RBP:  R08:  R09: 
> 0004
> [  124.098554][  T677] R10: 0001 R11:  R12: 
> 81fbb7b0
> [  124.099108][  T677] R13: 81fbd47b R14: fff7 R15: 
> 88815d9fbe40
> [  124.099720][  T677] FS:  7f9c2063ed00() GS:88846d20() 
> knlGS:
> [  124.100403][  T677] CS:  0010 DS:  ES:  CR0: 80050033
> [  124.100908][  T677] CR2: 0098 CR3: 000108e02002 CR4: 
> 00770ee0
> [  124.101537][  T677] DR0:  DR1:  DR2: 
> 
> [  124.102096][  T677] DR3:  DR6: 4ff0 DR7: 
> 0400
> [  124.102689][  T677] PKRU: 5554
> [  124.102988][  T677] Call Trace:
> [  124.103303][  T677]  
> [  124.103580][  T677]  ? __die+0x1f/0x70
> [  124.103928][  T677]  ? page_fault_oops+0x176/0x4d0
> [  124.104348][  T677]  ? __pte_offset_map_lock+0xa5/0x190
> [  124.104818][  T677]  ? do_user_addr_fault+0x73/0x870
> [  124.105277][  T677]  ? exc_page_fault+0x81/0x250
> [  124.105709][  T677]  ? asm_exc_page_fault+0x22/0x30
> [  124.106171][  T677]  ? bpf_prog_test_run_tracing+0x14b/0x2c0
> [  124.106675][  T677]  ? __pfx_bpf_fentry_test1+0x10/0x10
> [  124.107135][  T677]  ? __pfx_arch_rethook_prepare+0x10/0x10
> [  124.107598][  T677]  rethook_hook+0x10/0x30
> [  124.107966][  T677]  fprobe_handler+0x129/0x210
> [  124.108351][  T677]  ? __pfx_bpf_fentry_test1+0x10/0x10
> [  124.108796][  T677]  ? bpf_prog_test_run_tracing+0x14b/0x2c0
> [  124.109276][  T677]  arch_ftrace_ops_list_func+0xf2/0x200
> [  124.109708][  T677]  ftrace_call+0x5/0x44
> [  124.110060][  T677]  ? bpf_fentry_test1+0x5/0x10
> [  124.110463][  T677]  bpf_fentry_test1+0x5/0x10
> [  124.110881][  T677]  bpf_prog_test_run_tracing+0x14b/0x2c0
> [  124.111337][  T677]  __sys_bpf+0x305/0x2820
> [  124.111705][  T677]  __x64_sys_bpf+0x1a/0x30
> [  124.112053][  T677]  do_syscall_64+0x38/0x90
> [  124.116245][  T677]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  124.116688][  T677] RIP: 0033:0x7f9c20806b5d
> [  124.117078][  T677] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e 
> fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c

Re: (subset) [PATCH v2 0/7] Initial support for the Fairphone 5 smartphone

2023-09-25 Thread Srinivas Kandagatla



On Tue, 19 Sep 2023 14:45:54 +0200, Luca Weiss wrote:
> Add support to boot up mainline kernel on the QCM6490-based Fairphone 5
> smartphone.
> 
> These patches only cover a part of the functionality brought up on
> mainline so far, with the rest needing larger dts and driver changes or
> depend on patches that are not yet merged. I will work on sending those
> once these base patches here have settled.
> 
> [...]

Applied, thanks!

[2/7] nvmem: qfprom: Mark core clk as optional
  commit: 844ac302b2aa81c47a4323fc34a0a454cc749dbc

Best regards,
-- 
Srinivas Kandagatla

Re: [PATCH v5 04/12] fprobe: Use ftrace_regs in fprobe entry handler

2023-09-25 Thread Jiri Olsa

On Sun, Sep 24, 2023 at 10:36:36PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
> instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe
> on arm64.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> Acked-by: Florent Revest 

I was getting bpf selftests failures with this patchset and when
bisecting I'm getting crash when running on top of this change

looks like it's missing some of the regs NULL checks added later?

jirka


---
[  124.089449][  T677] BUG: kernel NULL pointer dereference, address: 
0098
[  124.090102][  T677] #PF: supervisor read access in kernel mode
[  124.090568][  T677] #PF: error_code(0x) - not-present page
[  124.091039][  T677] PGD 158fd8067 P4D 158fd8067 PUD 10896a067 PMD 0 
[  124.091482][  T677] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
[  124.091986][  T677] CPU: 1 PID: 677 Comm: test_progs Tainted: G   OE 
 6.6.0-rc2+ #768 1c8a8990289f2615e36dadd01915b80e8da29bf5
[  124.092898][  T677] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.2-1.fc38 04/01/2014
[  124.093613][  T677] RIP: 0010:arch_rethook_prepare+0x0/0x30
[  124.094060][  T677] Code: 90 90 90 90 90 90 90 90 90 90 48 89 b7 a8 00 00 00 
c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <48> 
8b 86 98 00 00 00 48 8b 10 48 89 57 20 48 8b 96 98 00 00 00 48
[  124.096239][  T677] RSP: 0018:c90003d3bc98 EFLAGS: 00010286
[  124.096708][  T677] RAX:  RBX: 88815d9fbe50 RCX: 
88815d9fbe40
[  124.097332][  T677] RDX: 0001 RSI:  RDI: 
88815d9fbe40
[  124.097946][  T677] RBP:  R08:  R09: 
0004
[  124.098554][  T677] R10: 0001 R11:  R12: 
81fbb7b0
[  124.099108][  T677] R13: 81fbd47b R14: fff7 R15: 
88815d9fbe40
[  124.099720][  T677] FS:  7f9c2063ed00() GS:88846d20() 
knlGS:
[  124.100403][  T677] CS:  0010 DS:  ES:  CR0: 80050033
[  124.100908][  T677] CR2: 0098 CR3: 000108e02002 CR4: 
00770ee0
[  124.101537][  T677] DR0:  DR1:  DR2: 

[  124.102096][  T677] DR3:  DR6: 4ff0 DR7: 
0400
[  124.102689][  T677] PKRU: 5554
[  124.102988][  T677] Call Trace:
[  124.103303][  T677]  
[  124.103580][  T677]  ? __die+0x1f/0x70
[  124.103928][  T677]  ? page_fault_oops+0x176/0x4d0
[  124.104348][  T677]  ? __pte_offset_map_lock+0xa5/0x190
[  124.104818][  T677]  ? do_user_addr_fault+0x73/0x870
[  124.105277][  T677]  ? exc_page_fault+0x81/0x250
[  124.105709][  T677]  ? asm_exc_page_fault+0x22/0x30
[  124.106171][  T677]  ? bpf_prog_test_run_tracing+0x14b/0x2c0
[  124.106675][  T677]  ? __pfx_bpf_fentry_test1+0x10/0x10
[  124.107135][  T677]  ? __pfx_arch_rethook_prepare+0x10/0x10
[  124.107598][  T677]  rethook_hook+0x10/0x30
[  124.107966][  T677]  fprobe_handler+0x129/0x210
[  124.108351][  T677]  ? __pfx_bpf_fentry_test1+0x10/0x10
[  124.108796][  T677]  ? bpf_prog_test_run_tracing+0x14b/0x2c0
[  124.109276][  T677]  arch_ftrace_ops_list_func+0xf2/0x200
[  124.109708][  T677]  ftrace_call+0x5/0x44
[  124.110060][  T677]  ? bpf_fentry_test1+0x5/0x10
[  124.110463][  T677]  bpf_fentry_test1+0x5/0x10
[  124.110881][  T677]  bpf_prog_test_run_tracing+0x14b/0x2c0
[  124.111337][  T677]  __sys_bpf+0x305/0x2820
[  124.111705][  T677]  __x64_sys_bpf+0x1a/0x30
[  124.112053][  T677]  do_syscall_64+0x38/0x90
[  124.116245][  T677]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  124.116688][  T677] RIP: 0033:0x7f9c20806b5d
[  124.117078][  T677] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 
3d 01 f0 ff ff 73 01 c3 48 8b 0d 7b 92 0c 00 f7 d8 64 89 01 48
[  124.118487][  T677] RSP: 002b:7ffc615e5228 EFLAGS: 0206 ORIG_RAX: 
0141
[  124.119090][  T677] RAX: ffda RBX: 7f9c20918000 RCX: 
7f9c20806b5d
[  124.119698][  T677] RDX: 0050 RSI: 7ffc615e5260 RDI: 
000a
[  124.120298][  T677] RBP: 7ffc615e5240 R08:  R09: 
7ffc615e5260
[  124.120893][  T677] R10: 0064 R11: 0206 R12: 
0001
[  124.121486][  T677] R13:  R14: 7f9c2094d000 R15: 
011a8db0
[  124.122156][  T677]  
[  124.122421][  T677] Modules linked in: bpf_testmod(OE) intel_rapl_msr 
intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel 
ghash_clmulni_intel kvm_intel rapl iTCO_wdt iTCO_vendor_support i2c_i801 
i2c_smbus lpc_ich drm loop drm_panel_orientation_quirks zram
[  124.125064][  T677] CR2: 0098
[  124.125431][  T677] ---[ end trace  ]---
[  124.125861][  T677] RIP:

Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

2023-09-25 Thread Google

Hi Wuqiang,

On Tue,  5 Sep 2023 09:52:51 +0800
"wuqiang.matt"  wrote:

> The object pool is a scalable implementaion of high performance queue
> for object allocation and reclamation, such as kretprobe instances.
> 
> With leveraging percpu ring-array to mitigate the hot spot of memory
> contention, it could deliver near-linear scalability for high parallel
> scenarios. The objpool is best suited for following cases:
> 1) Memory allocation or reclamation are prohibited or too expensive
> 2) Consumers are of different priorities, such as irqs and threads
> 
> Limitations:
> 1) Maximum objects (capacity) is determined during pool initializing
>and can't be modified (extended) after objpool creation
> 2) The memory of objects won't be freed until objpool is finalized
> 3) Object allocation (pop) may fail after trying all cpu slots

I made a simplifying patch on this by (mainly) removing ages array.
I also rename local variable to use more readable names, like slot,
pool, and obj.

Here the results which I run the test_objpool.ko.

Original:
[   50.500235] Summary of testcases:
[   50.503139] duration: 1027135us  hits:   30628293miss:  
0sync: percpu objpool
[   50.510416] duration: 1047977us  hits:   30453023miss:  
0sync: percpu objpool from vmalloc
[   50.517421] duration: 1047098us  hits:   31145034miss:  
0sync & hrtimer: percpu objpool
[   50.524671] duration: 1053233us  hits:   30919640miss:  
0sync & hrtimer: percpu objpool from vmalloc
[   50.531382] duration: 1055822us  hits:3407221miss: 
830219sync overrun: percpu objpool
[   50.538135] duration: 1055998us  hits:3404624miss: 
854160sync overrun: percpu objpool from vmalloc
[   50.546686] duration: 1046104us  hits:   19464798miss:  
0async: percpu objpool
[   50.552989] duration: 1033054us  hits:   18957836miss:  
0async: percpu objpool from vmalloc
[   50.560289] duration: 1041778us  hits:   33806868miss:  
0async & hrtimer: percpu objpool
[   50.567425] duration: 1048901us  hits:   34211860miss:  
0async & hrtimer: percpu objpool from vmalloc

Simplified:
[   48.393236] Summary of testcases:
[   48.395384] duration: 1013002us  hits:   29661448miss:  
0sync: percpu objpool
[   48.400351] duration: 1057231us  hits:   31035578miss:  
0sync: percpu objpool from vmalloc
[   48.405660] duration: 1043196us  hits:   30546652miss:  
0sync & hrtimer: percpu objpool
[   48.411216] duration: 1047128us  hits:   30601306miss:  
0sync & hrtimer: percpu objpool from vmalloc
[   48.417231] duration: 1051695us  hits:3468287miss: 
892881sync overrun: percpu objpool
[   48.422405] duration: 1054003us  hits:3549491miss: 
898141sync overrun: percpu objpool from vmalloc
[   48.428425] duration: 1052946us  hits:   19005228miss:  
0async: percpu objpool
[   48.433597] duration: 1051757us  hits:   19670866miss:  
0async: percpu objpool from vmalloc
[   48.439280] duration: 1042951us  hits:   37135332miss:  
0async & hrtimer: percpu objpool
[   48.445085] duration: 1029803us  hits:   37093248miss:  
0async & hrtimer: percpu objpool from vmalloc

Can you test it too?

Thanks,

>From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" 
Date: Mon, 25 Sep 2023 16:07:12 +0900
Subject: [PATCH] objpool: Simplify objpool by removing ages array

Simplify the objpool code by removing ages array from per-cpu slot.
It chooses enough big capacity (which is a rounded up power of 2 value
of nr_objects + 1) for the entries array, the tail never catch up to
the head in per-cpu slot. Thus tail == head means the slot is empty.

This also uses consistent local variable names for pool, slot and obj.

Signed-off-by: Masami Hiramatsu (Google) 
---
 include/linux/objpool.h |  61 
 lib/objpool.c   | 310 
 2 files changed, 147 insertions(+), 224 deletions(-)

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index 33c832216b98..ecd5ecaffcd3 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -38,33 +38,23 @@
  * struct objpool_slot - percpu ring array of objpool
  * @head: head of the local ring array (to retrieve at)
  * @tail: tail of the local ring array (to append at)
- * @bits: log2 of capacity (for bitwise operations)
- * @mask: capacity - 1
+ * @mask: capacity of entries - 1
+ * @entries: object entries on this slot.
  *
  * Represents a cpu-local array-based ring buffer, its size is specialized
  * during

Re: [PATCH] ieee802154: ca8210: Fix a potential UAF in ca8210_probe

2023-09-25 Thread dinghao . liu

Hi Miquèl,

> > index aebb19f1b3a4..1d545879c000 100644
> > --- a/drivers/net/ieee802154/ca8210.c
> > +++ b/drivers/net/ieee802154/ca8210.c
> > @@ -2760,6 +2760,7 @@ static int ca8210_register_ext_clock(struct 
> > spi_device *spi)
> > ret = of_clk_add_provider(np, of_clk_src_simple_get, priv->clk);
> > if (ret) {
> > clk_unregister(priv->clk);
> > +   priv->clk = NULL;
> 
> This function is a bit convoluted. You could just return the result of
> of_clk_add_provider() (keep the printk's if you want, they don't seem
> very useful) and let ca8210_unregister_ext_clock() do the cleanup.

Thanks for your advice! I will resend a new patch as suggested.

> 
> > dev_crit(
> > >dev,
> > "Failed to register external clock as clock provider\n"
> > @@ -2780,7 +2781,7 @@ static void ca8210_unregister_ext_clock(struct 
> > spi_device *spi)
> >  {
> > struct ca8210_priv *priv = spi_get_drvdata(spi);
> >  
> > -   if (!priv->clk)
> > +   if (IS_ERR_OR_NULL(priv->clk))
> 
> Does not look useful as you are enforcing priv->clock to be valid or
> NULL, it cannot be an error code.

I find that ca8210_register_ext_clock() uses IS_ERR to check priv->clk
after calling clk_register_fixed_rate(). So I think priv->clk could be
a non-null pointer even on failure. And a null pointer check may miss
this case in ca8210_unregister_ext_clock(). 

Regards,
Dinghao

Re: [PATCH] ieee802154: ca8210: Fix a potential UAF in ca8210_probe

2023-09-25 Thread Miquel Raynal

Hi Dinghao,

dinghao@zju.edu.cn wrote on Mon, 25 Sep 2023 15:24:22 +0800:

> If of_clk_add_provider() fails in ca8210_register_ext_clock(),
> it calls clk_unregister() to release priv->clk and returns an
> error. However, the caller ca8210_probe() then calls ca8210_remove(),
> where priv->clk is freed again in ca8210_unregister_ext_clock(). In
> this case, a use-after-free may happen in the second time we call
> clk_unregister().
> 
> Fix this by nulling priv->clk after the first clk_unregister(). Also
> refine the pointer checking in ca8210_unregister_ext_clock().
> 
> Fixes: ded845a781a5 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
> Signed-off-by: Dinghao Liu 
> ---
>  drivers/net/ieee802154/ca8210.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ieee802154/ca8210.c b/drivers/net/ieee802154/ca8210.c
> index aebb19f1b3a4..1d545879c000 100644
> --- a/drivers/net/ieee802154/ca8210.c
> +++ b/drivers/net/ieee802154/ca8210.c
> @@ -2760,6 +2760,7 @@ static int ca8210_register_ext_clock(struct spi_device 
> *spi)
>   ret = of_clk_add_provider(np, of_clk_src_simple_get, priv->clk);
>   if (ret) {
>   clk_unregister(priv->clk);
> + priv->clk = NULL;

This function is a bit convoluted. You could just return the result of
of_clk_add_provider() (keep the printk's if you want, they don't seem
very useful) and let ca8210_unregister_ext_clock() do the cleanup.

>   dev_crit(
>   >dev,
>   "Failed to register external clock as clock provider\n"
> @@ -2780,7 +2781,7 @@ static void ca8210_unregister_ext_clock(struct 
> spi_device *spi)
>  {
>   struct ca8210_priv *priv = spi_get_drvdata(spi);
>  
> - if (!priv->clk)
> + if (IS_ERR_OR_NULL(priv->clk))

Does not look useful as you are enforcing priv->clock to be valid or
NULL, it cannot be an error code.

Thanks,
Miquèl

Re: [PATCH] tracing/user_events: align uaddr on unsigned long alignment

2023-09-25 Thread Clément Léger




On 22/09/2023 22:00, Beau Belgrave wrote:
> On Tue, Sep 19, 2023 at 02:59:12PM +0200, Clément Léger wrote:
>>
>>
>> On 14/09/2023 19:29, Steven Rostedt wrote:
>>> On Thu, 14 Sep 2023 13:17:00 -0400
>>> Steven Rostedt  wrote:
>>>
 Now lets look at big endian layout:

  uaddr = 0xbeef0004
  enabler = 1;

  memory at 0xbeef:  00 00 00 00 00 00 00 02
 ^
 addr: 0xbeef0004

(enabler is set )

bitoffset = uaddr & (sizeof(unsigned long) - 1); bitoffset = 4
bit_offset *= 8; bitoffset = 32
uaddr &= ~(sizeof(unsigned long) - 1);   uaddr = 0xbeef

ptr = kaddr + (uaddr & ~PAGE_MASK);

clear_bit(1 + 32, ptr);

  memory at 0xbeef:  00 00 00 00 00 00 00 02
   ^
bit 33 of 0xbeef

 I don't think that's what you expected!
>>>
>>> I believe the above can be fixed with:
>>>
>>> bit_offset = uaddr & (sizeof(unsigned long) - 1);
>>> if (bit_offset) {
>>> #ifdef CONFIG_CPU_BIG_ENDIAN
>>> bit_offest = 0;
>>> #else
>>> bit_offset *= BITS_PER_BYTE;
>>> #endif
>>> uaddr &= ~(sizeof(unsigned long) - 1);
>>> }
>>>
>>> -- Steve
>>
>>
>> Actually, after looking more in depth at that, it seems like there are
>> actually 2 problems that can happen.
>>
>> First one is atomic access misalignment due to enable_size == 4 and
>> uaddr not being aligned on a (long) boundary on 64 bits architecture.
>> This can generate misaligned exceptions on various architectures. This
>> can be fixed in a more general way according to Masami snippet.
>>
>> Second one that I can see is on 64 bits, big endian architectures with
>> enable_size == 4. In that case, the bit provided by the userspace won't
>> be correctly set since this code kind of assume that the atomic are done
>> on 32bits value. Since the kernel assume long sized atomic operation, on
>> big endian 64 bits architecture, the updated bit will actually be in the
>> next 32 bits word.
>>
>> Can someone confirm my understanding ?
>>
> 
> Actually, I take back some of what I said [1]. If a 32-bit on a 64-bit
> kernel comes in on BE, and is aligned, we do need to offset the bits as
> well (just verified on my ppc64 BE VM).

Yes, that is what I meant in my previous comment. I'll resend my series
which handles that properly (and which includes generic
set_bit_unaligned()).

Thanks,

Clément

> 
> You should be able to use that patch as a base though and add a flag to
> struct user_event_enabler when this case occurs. Then in the
> align_addr_bit() adjust the bits as well upon aligned cases.
> 
> Thanks,
> -Beau
> 
> 1. 
> https://lore.kernel.org/linux-trace-kernel/20230922192231.ga1828-be...@linux.microsoft.com/
> 
>> Clément

Re: [PATCH] tracing/user_events: align uaddr on unsigned long alignment

2023-09-25 Thread Clément Léger




On 22/09/2023 21:22, Beau Belgrave wrote:
> On Tue, Sep 19, 2023 at 02:59:12PM +0200, Clément Léger wrote:
>>
>>
>> On 14/09/2023 19:29, Steven Rostedt wrote:
>>> On Thu, 14 Sep 2023 13:17:00 -0400
>>> Steven Rostedt  wrote:
>>>
 Now lets look at big endian layout:

  uaddr = 0xbeef0004
  enabler = 1;

  memory at 0xbeef:  00 00 00 00 00 00 00 02
 ^
 addr: 0xbeef0004

(enabler is set )

bitoffset = uaddr & (sizeof(unsigned long) - 1); bitoffset = 4
bit_offset *= 8; bitoffset = 32
uaddr &= ~(sizeof(unsigned long) - 1);   uaddr = 0xbeef

ptr = kaddr + (uaddr & ~PAGE_MASK);

clear_bit(1 + 32, ptr);

  memory at 0xbeef:  00 00 00 00 00 00 00 02
   ^
bit 33 of 0xbeef

 I don't think that's what you expected!
>>>
>>> I believe the above can be fixed with:
>>>
>>> bit_offset = uaddr & (sizeof(unsigned long) - 1);
>>> if (bit_offset) {
>>> #ifdef CONFIG_CPU_BIG_ENDIAN
>>> bit_offest = 0;
>>> #else
>>> bit_offset *= BITS_PER_BYTE;
>>> #endif
>>> uaddr &= ~(sizeof(unsigned long) - 1);
>>> }
>>>
>>> -- Steve
>>
>>
>> Actually, after looking more in depth at that, it seems like there are
>> actually 2 problems that can happen.
>>
>> First one is atomic access misalignment due to enable_size == 4 and
>> uaddr not being aligned on a (long) boundary on 64 bits architecture.
>> This can generate misaligned exceptions on various architectures. This
>> can be fixed in a more general way according to Masami snippet.
>>
>> Second one that I can see is on 64 bits, big endian architectures with
>> enable_size == 4. In that case, the bit provided by the userspace won't
>> be correctly set since this code kind of assume that the atomic are done
>> on 32bits value. Since the kernel assume long sized atomic operation, on
>> big endian 64 bits architecture, the updated bit will actually be in the
>> next 32 bits word.
>>
>> Can someone confirm my understanding ?
>>
> 
> I have a ppc 64bit BE VM I've been validating this on. If we do the
> shifting within user_events (vs a generic set_bit_aligned approach)
> 64bit BE does not need additional bit manipulation. However, if we were
> to blindly pass the address and bit as is to set_bit_aligned() it
> assumes the bit number is for a long, not a 32 bit word. So for that
> approach we would need to offset the bit in the unaligned case.
> 
> Here's a patch I have that I've validated on ppc64 BE, aarch64 LE, and
> x86_64 LE. I personally feel more comfortable with this approach than
> the generic set_bit_aligned() one.
> 
> Thanks,
> -Beau
> 
> diff --git a/kernel/trace/trace_events_user.c 
> b/kernel/trace/trace_events_user.c
> index e3f2b8d72e01..ae854374d0b7 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -162,6 +162,23 @@ struct user_event_validator {
>   int flags;
>  };
>  
> +static inline void align_addr_bit(unsigned long *addr, int *bit)
> +{
> + if (IS_ALIGNED(*addr, sizeof(long)))
> + return;
> +
> + *addr = ALIGN_DOWN(*addr, sizeof(long));
> +
> + /*
> +  * We only support 32 and 64 bit values. The only time we need
> +  * to align is a 32 bit value on a 64 bit kernel, which on LE
> +  * is always 32 bits, and on BE requires no change.
> +  */
> +#ifdef __LITTLE_ENDIAN
> + *bit += 32;
> +#endif

Hi Beau, except the specific alignment that is basically what I ended up
with for the BE 64bit case (ie just bit += 32). Regarding the generic
alignment, depends on what the maintainers wishes (generic of user_event
specific). I also feel like this shoulmd be handle specifically for
user_events which uses set_bit in some non standard way. Any suggestion ?

Thanks,

Clément

> +}
> +
>  typedef void (*user_event_func_t) (struct user_event *user, struct iov_iter 
> *i,
>  void *tpdata, bool *faulted);
>  
> @@ -481,6 +498,7 @@ static int user_event_enabler_write(struct user_event_mm 
> *mm,
>   unsigned long *ptr;
>   struct page *page;
>   void *kaddr;
> + int bit = ENABLE_BIT(enabler);
>   int ret;
>  
>   lockdep_assert_held(_mutex);
> @@ -496,6 +514,8 @@ static int user_event_enabler_write(struct user_event_mm 
> *mm,
>test_bit(ENABLE_VAL_FREEING_BIT, ENABLE_BITOPS(enabler
>   return -EBUSY;
>  
> + align_addr_bit(, );
> +
>   ret = pin_user_pages_remote(mm->mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT,
>   , NULL);
>  
> @@ -514,9 +534,9 @@ static int user_event_enabler_write(struct user_event_mm 
> *mm,
>  
>   /* Update bit atomically, user tracers must be atomic as

[PATCH] ieee802154: ca8210: Fix a potential UAF in ca8210_probe

2023-09-25 Thread Dinghao Liu

If of_clk_add_provider() fails in ca8210_register_ext_clock(),
it calls clk_unregister() to release priv->clk and returns an
error. However, the caller ca8210_probe() then calls ca8210_remove(),
where priv->clk is freed again in ca8210_unregister_ext_clock(). In
this case, a use-after-free may happen in the second time we call
clk_unregister().

Fix this by nulling priv->clk after the first clk_unregister(). Also
refine the pointer checking in ca8210_unregister_ext_clock().

Fixes: ded845a781a5 ("ieee802154: Add CA8210 IEEE 802.15.4 device driver")
Signed-off-by: Dinghao Liu 
---
 drivers/net/ieee802154/ca8210.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ieee802154/ca8210.c b/drivers/net/ieee802154/ca8210.c
index aebb19f1b3a4..1d545879c000 100644
--- a/drivers/net/ieee802154/ca8210.c
+++ b/drivers/net/ieee802154/ca8210.c
@@ -2760,6 +2760,7 @@ static int ca8210_register_ext_clock(struct spi_device 
*spi)
ret = of_clk_add_provider(np, of_clk_src_simple_get, priv->clk);
if (ret) {
clk_unregister(priv->clk);
+   priv->clk = NULL;
dev_crit(
>dev,
"Failed to register external clock as clock provider\n"
@@ -2780,7 +2781,7 @@ static void ca8210_unregister_ext_clock(struct spi_device 
*spi)
 {
struct ca8210_priv *priv = spi_get_drvdata(spi);
 
-   if (!priv->clk)
+   if (IS_ERR_OR_NULL(priv->clk))
return
 
of_clk_del_provider(spi->dev.of_node);
-- 
2.17.1

[PATCH RT 0/1] Linux v4.19.295-rt129-rc1

2023-09-25 Thread Daniel Wagner

Dear RT Folks,

This is the RT stable review cycle of patch 4.19.295-rt129-rc1.

Please scream at me if I messed something up. Please test the patches
too.

The -rc release is also available on kernel.org

  https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

on the v4.19-rt-next branch.

If all goes well, this patch will be converted to the next main
release on 2023-10-02.

Signing key fingerprint:

  5BF6 7BC5 0826 72CA BB45  ACAE 587C 5ECA 5D0A 306C

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Daniel

Changes from v4.19.292-rt128:


Daniel Wagner (1):
  Linux 4.19.295-rt129

 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.41.0

[PATCH RT 1/1] Linux 4.19.295-rt129

2023-09-25 Thread Daniel Wagner

v4.19.295-rt129-rc1 stable review patch.
If anyone has any objections, please let me know.

---


Signed-off-by: Daniel Wagner 
---
 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/localversion-rt b/localversion-rt
index 6d2a676e2033..90303f5aabcf 100644
--- a/localversion-rt
+++ b/localversion-rt
@@ -1 +1 @@
--rt128
+-rt129
-- 
2.41.0

48 matches

Mail list logo