date:20240613

Re: [PATCH v5 0/8] DAMON based tiered memory management for CXL memory

2024-06-13 Thread Honggyu Kim

Hi SeongJae,

On Thu, 13 Jun 2024 10:46:04 -0700 SeongJae Park  wrote:
> Hi Honggyu,
> 
> On Thu, 13 Jun 2024 22:20:47 +0900 Honggyu Kim  wrote:
> 
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> > It says there is no implementation of the demote/promote DAMOS action
> > are made.  This patch series is about its implementation for physical
> > address space so that this scheme can be applied in system wide level.
> > 
> > Changes from RFC v4:
> > https://lore.kernel.org/20240512175447.75943-1...@kernel.org
> >   1. Add usage and design documents
> >   2. Rename alloc_demote_folio to alloc_migrate_folio
> >   3. Add evaluation results with "demotion_enabled" true
> >   4. Rebase based on v6.10-rc3
> 
> I left comments on the new patches for the documentation.
> 
> [...]
> > 
> > Evaluation Results
> > ==
> > 
> > All the result values are normalized to DRAM-only execution time because
> > the workload cannot be faster than DRAM-only unless the workload hits
> > the peak bandwidth but our redis test doesn't go beyond the bandwidth
> > limit.
> > 
> > So the DRAM-only execution time is the ideal result without affected by
> > the gap between DRAM and CXL performance difference.  The NUMA node
> > environment is as follows.
> > 
> >   node0 - local DRAM, 512GB with a CPU socket (fast tier)
> >   node1 - disabled
> >   node2 - CXL DRAM, 96GB, no CPU attached (slow tier)
> > 
> > The following is the result of generating zipfian distribution to
> > redis-server and the numbers are averaged by 50 times of execution.
> > 
> >   1. YCSB zipfian distribution read only workload
> >   memory pressure with cold memory on node0 with 512GB of local DRAM.
> >   
> > ++=
> >   |   cold memory occupied by mmap and memset  |
> >   |   0G  440G  450G  460G  470G  480G  490G  500G |
> >   
> > ++=
> >   Execution time normalized to DRAM-only values| 
> > GEOMEAN
> >   
> > ++-
> >   DRAM-only   | 1.00 - - - - - - - | 
> > 1.00
> >   CXL-only| 1.19 - - - - - - - | 
> > 1.19
> >   default |-  1.00  1.05  1.08  1.12  1.14  1.18  1.18 | 
> > 1.11
> >   DAMON tiered|-  1.03  1.03  1.03  1.03  1.03  1.07 *1.05 | 
> > 1.04
> >   DAMON lazy  |-  1.04  1.03  1.04  1.05  1.06  1.06 *1.06 | 
> > 1.05
> >   
> > ++=
> >   CXL usage of redis-server in GB  | 
> > AVERAGE
> >   
> > ++-
> >   DRAM-only   |  0.0 - - - - - - - |  
> > 0.0
> >   CXL-only| 51.4 - - - - - - - | 
> > 51.4
> >   default |-   0.6  10.6  20.5  30.5  40.5  47.6  50.4 | 
> > 28.7
> >   DAMON tiered|-   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  
> > 2.2
> >   DAMON lazy  |-   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  
> > 5.5
> >   
> > ++=
> > 
> > Each test result is based on the exeuction environment as follows.
> 
> Nit.  s/exeuction/execution/

Thanks. Fixed it.

> [...]
> > In summary, the evaluation results show that DAMON memory management
> > with DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown
> > compared to the "default" memory policy from 11% to 3~5% when the system
> > runs with high memory pressure on its fast tier DRAM nodes.
> > 
> > Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> > tiered memory systems run more efficiently under high memory pressures.
> 
> Thank you very much for continuing this great work.
> 
> Other than trivial comments on documentation patches and the above typo, I 
> have
> no particular concern on this patchset.  I'm looking forward to the next
> version.

I have addressed all your comments and resent v6 again. Please have a
look again.
https://lore.kernel.org/20240614030010.751-1-honggyu@sk.com

Thanks very much for your review!

Thanks,
Honggyu

> 
> Thanks,
> SJ
> [...]

[PATCH v6 7/7] Docs/damon: document damos_migrate_{hot,cold}

2024-06-13 Thread Honggyu Kim

This patch adds damon description for "migrate_hot" and "migrate_cold"
actions for both usage and design documents as long as a new
"target_nid" knob to set the migration target node.

Signed-off-by: Honggyu Kim 
---
 Documentation/admin-guide/mm/damon/usage.rst | 4 
 Documentation/mm/damon/design.rst| 4 
 2 files changed, 8 insertions(+)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst 
b/Documentation/admin-guide/mm/damon/usage.rst
index e58ceb89ea2a..98804e34448b 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -300,6 +300,10 @@ from the file and their meaning are same to those of the 
list on
 The ``apply_interval_us`` file is for setting and getting the scheme's
 :ref:`apply_interval ` in microseconds.
 
+The ``target_nid`` file is for setting the migration target node, which is
+only meaningful when the ``action`` is either ``migrate_hot`` or
+``migrate_cold``.
+
 .. _sysfs_access_pattern:
 
 schemes//access_pattern/
diff --git a/Documentation/mm/damon/design.rst 
b/Documentation/mm/damon/design.rst
index 3df387249937..3f12c884eb3a 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -325,6 +325,10 @@ that supports each action are as below.
Supported by ``paddr`` operations set.
  - ``lru_deprio``: Deprioritize the region on its LRU lists.
Supported by ``paddr`` operations set.
+ - ``migrate_hot``: Migrate the regions prioritizing warmer regions.
+   Supported by ``paddr`` operations set.
+ - ``migrate_cold``: Migrate the regions prioritizing colder regions.
+   Supported by ``paddr`` operations set.
  - ``stat``: Do nothing but count the statistics.
Supported by all operations sets.
 
-- 
2.34.1

[PATCH v6 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-06-13 Thread Honggyu Kim

This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
instead of swapping them out.

The 'target_nid' sysfs knob informs the migration target node ID.

Here is one of the example usage of this 'migrate_cold' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/
  $ cat contexts//schemes//action
  migrate_cold
  $ echo 2 > contexts//schemes//target_nid
  $ echo commit > state
  $ numactl -p 0 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  701 (hot_cold) 501  0601  1101

Since there are some common routines with pageout, many functions have
similar logics between pageout and migrate cold.

damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list().

Signed-off-by: Honggyu Kim 
Signed-off-by: Hyeongtak Ji 
Signed-off-by: SeongJae Park 
---
 include/linux/damon.h|   2 +
 mm/damon/paddr.c | 154 +++
 mm/damon/sysfs-schemes.c |   1 +
 3 files changed, 157 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 21d6b69a015c..56714b6eb0d7 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
  *
@@ -122,6 +123,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
 };
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 18797c1b419b..882ae54af829 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -12,6 +12,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -325,6 +328,153 @@ static unsigned long damon_pa_deactivate_pages(struct 
damon_region *r,
return damon_pa_mark_accessed_or_deactivate(r, s, false);
 }
 
+static unsigned int __damon_pa_migrate_folio_list(
+   struct list_head *migrate_folios, struct pglist_data *pgdat,
+   int target_nid)
+{
+   unsigned int nr_succeeded;
+   nodemask_t allowed_mask = NODE_MASK_NONE;
+   struct migration_target_control mtc = {
+   /*
+* Allocate from 'node', or fail quickly and quietly.
+* When this happens, 'page' will likely just be discarded
+* instead of migrated.
+*/
+   .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
+   __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT,
+   .nid = target_nid,
+   .nmask = _mask
+   };
+
+   if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
+   return 0;
+
+   if (list_empty(migrate_folios))
+   return 0;
+
+   /* Migration ignores all cpuset and mempolicy settings */
+   migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
+ (unsigned long), MIGRATE_ASYNC, MR_DAMON,
+ _succeeded);
+
+   return nr_succeeded;
+}
+
+static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
+   struct pglist_data *pgdat,
+   int target_nid)
+{
+   unsigned int nr_migrated = 0;
+   struct folio *folio;
+   LIST_HEAD(ret_folios);
+   LIST_HEAD(migrate_folios);
+
+   while (!list_empty(folio_list)) {
+   struct folio *folio;
+
+   cond_resched();
+
+   folio = lru_to_folio(folio_list);
+   list_del(>lru);
+
+   if (!folio_trylock(folio))
+   goto keep;
+
+   /* Relocate its contents to another node. */
+   list_add(>lru, _folios);
+   folio_unlock(folio);
+   continue;
+keep:
+   list_add(>lru, _folios);
+   }
+   /* 'folio_list' is always empty here */
+
+   /* Migrate folios selected for migration */
+   nr_migrated += __damon_pa_migrate_folio_list(
+   _folios, pgdat, target_nid);
+   /*
+* Folios that could not be migrated are still in @migrate_folios.  Add
+* those back on @folio_list
+*/
+   if (!list_empty(_folios))
+   list_splice_init(_folios, folio_list);
+
+   try_to_unmap_flush();
+
+

[PATCH v6 6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

2024-06-13 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
DAMOS_MIGRATE_COLD, but proritizes hot pages.

It migrates pages inside the given region to the 'target_nid' NUMA node
in the sysfs.

Here is one of the example usage of this 'migrate_hot' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/
  $ cat contexts//schemes//action
  migrate_hot
  $ echo 0 > contexts//schemes//target_nid
  $ echo commit > state
  $ numactl -p 2 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  701 (hot_cold) 501  0601  1101

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
Signed-off-by: SeongJae Park 
---
 include/linux/damon.h| 2 ++
 mm/damon/paddr.c | 3 +++
 mm/damon/sysfs-schemes.c | 1 +
 3 files changed, 6 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 56714b6eb0d7..3d62d98d6359 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_HOT:  Migrate the regions prioritizing warmer regions.
  * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
@@ -123,6 +124,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_MIGRATE_HOT,
DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 882ae54af829..af6aac388a43 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -486,6 +486,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx 
*ctx,
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_pa_deactivate_pages(r, scheme);
+   case DAMOS_MIGRATE_HOT:
case DAMOS_MIGRATE_COLD:
return damon_pa_migrate(r, scheme);
case DAMOS_STAT:
@@ -508,6 +509,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
return damon_hot_score(context, r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_cold_score(context, r, scheme);
+   case DAMOS_MIGRATE_HOT:
+   return damon_hot_score(context, r, scheme);
case DAMOS_MIGRATE_COLD:
return damon_cold_score(context, r, scheme);
default:
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 880015d5b5ea..66fccfa776d7 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1458,6 +1458,7 @@ static const char * const damon_sysfs_damos_action_strs[] 
= {
"nohugepage",
"lru_prio",
"lru_deprio",
+   "migrate_hot",
"migrate_cold",
"stat",
 };
-- 
2.34.1

[PATCH v6 4/7] mm/migrate: add MR_DAMON to migrate_reason

2024-06-13 Thread Honggyu Kim

The current patch series introduces DAMON based migration across NUMA
nodes so it'd be better to have a new migrate_reason in trace events.

Signed-off-by: Honggyu Kim 
Reviewed-by: SeongJae Park 
Signed-off-by: SeongJae Park 
---
 include/linux/migrate_mode.h   | 1 +
 include/trace/events/migrate.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..cec36b7e7ced 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+   MR_DAMON,
MR_TYPES
 };
 
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 0190ef725b43..cd01dd7b3640 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -22,7 +22,8 @@
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
EM( MR_CONTIG_RANGE,"contig_range") \
EM( MR_LONGTERM_PIN,"longterm_pin") \
-   EMe(MR_DEMOTION,"demotion")
+   EM( MR_DEMOTION,"demotion") \
+   EMe(MR_DAMON,   "damon")
 
 /*
  * First define the enums in the above macros to be exported to userspace
-- 
2.34.1

[PATCH v6 2/7] mm: rename alloc_demote_folio to alloc_migrate_folio

2024-06-13 Thread Honggyu Kim

The alloc_demote_folio can also be used for general migration including
both demotion and promotion so it'd be better to rename it from
alloc_demote_folio to alloc_migrate_folio.

Signed-off-by: Honggyu Kim 
Reviewed-by: SeongJae Park 
---
 mm/internal.h | 2 +-
 mm/vmscan.c   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b3ca996a4efc..9f967842f636 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1052,7 +1052,7 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct 
file *, unsigned long,
 unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
-struct folio *alloc_demote_folio(struct folio *src, unsigned long private);
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private);
 unsigned long reclaim_pages(struct list_head *folio_list);
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2f4406872f43..f5414b101909 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -916,7 +916,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-struct folio *alloc_demote_folio(struct folio *src, unsigned long private)
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
 {
struct folio *dst;
nodemask_t *allowed_mask;
@@ -979,7 +979,7 @@ static unsigned int demote_folio_list(struct list_head 
*demote_folios,
node_get_allowed_targets(pgdat, _mask);
 
/* Demotion ignores all cpuset and mempolicy settings */
-   migrate_pages(demote_folios, alloc_demote_folio, NULL,
+   migrate_pages(demote_folios, alloc_migrate_folio, NULL,
  (unsigned long), MIGRATE_ASYNC, MR_DEMOTION,
  _succeeded);
 
-- 
2.34.1

[PATCH v6 3/7] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes

2024-06-13 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch adds target_nid under
  /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//

The 'target_nid' can be used as the destination node for DAMOS actions
such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
Signed-off-by: SeongJae Park 
---
 include/linux/damon.h| 11 ++-
 mm/damon/core.c  |  5 -
 mm/damon/dbgfs.c |  2 +-
 mm/damon/lru_sort.c  |  3 ++-
 mm/damon/reclaim.c   |  3 ++-
 mm/damon/sysfs-schemes.c | 33 -
 6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index f7da65e1ac04..21d6b69a015c 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -374,6 +374,7 @@ struct damos_access_pattern {
  * @apply_interval_us: The time between applying the @action.
  * @quota: Control the aggressiveness of this scheme.
  * @wmarks:Watermarks for automated (in)activation of this scheme.
+ * @target_nid:Destination node if @action is 
"migrate_{hot,cold}".
  * @filters:   Additional set of  damos_filter for 
  * @stat:  Statistics of this scheme.
  * @list:  List head for siblings.
@@ -389,6 +390,10 @@ struct damos_access_pattern {
  * monitoring context are inactive, DAMON stops monitoring either, and just
  * repeatedly checks the watermarks.
  *
+ * @target_nid is used to set the migration target node for migrate_hot or
+ * migrate_cold actions, which means it's only meaningful when @action is 
either
+ * "migrate_hot" or "migrate_cold".
+ *
  * Before applying the  to a memory region,  damon_operations
  * implementation could check pages of the region and skip  to respect
  * 
@@ -410,6 +415,9 @@ struct damos {
 /* public: */
struct damos_quota quota;
struct damos_watermarks wmarks;
+   union {
+   int target_nid;
+   };
struct list_head filters;
struct damos_stat stat;
struct list_head list;
@@ -726,7 +734,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks);
+   struct damos_watermarks *wmarks,
+   int target_nid);
 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
 void damon_destroy_scheme(struct damos *s);
 
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 6392f1cc97a3..c0ec5be4f56e 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -354,7 +354,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks)
+   struct damos_watermarks *wmarks,
+   int target_nid)
 {
struct damos *scheme;
 
@@ -381,6 +382,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
scheme->wmarks = *wmarks;
scheme->wmarks.activated = true;
 
+   scheme->target_nid = target_nid;
+
return scheme;
 }
 
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index 2461cfe2e968..51a6f1cac385 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -281,7 +281,7 @@ static struct damos **str_to_schemes(const char *str, 
ssize_t len,
 
pos += parsed;
scheme = damon_new_scheme(, action, 0, ,
-   );
+   , NUMA_NO_NODE);
if (!scheme)
goto fail;
 
diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 3de2916a65c3..3775f0f2743d 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme(
/* under the quota. */
,
/* (De)activate this according to the watermarks. */
-   _lru_sort_wmarks);
+   _lru_sort_wmarks,
+   NUMA_NO_NODE);
 }
 
 /* Create a DAMON-based operation scheme for hot memory regions */
diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index 9bd341d62b4c..a05ccb41749b 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -177,7 +177,8 @@ static struct damos *damon_reclaim_new_scheme(void)
/* under the quota. */
_reclaim_quota,
/* (De)activate this according to the watermarks. */
-   _reclaim_wmarks);
+   _reclaim_wmarks,
+   NUMA_NO_NODE);
 }
 
 static void damon_reclaim_copy_quota_status(struct

[PATCH v6 1/7] mm: make alloc_demote_folio externally invokable for migration

2024-06-13 Thread Honggyu Kim

The alloc_demote_folio can be used out of vmscan.c so it'd be better to
remove static keyword from it.

Signed-off-by: Honggyu Kim 
Reviewed-by: SeongJae Park 
Signed-off-by: SeongJae Park 
---
 mm/internal.h | 1 +
 mm/vmscan.c   | 3 +--
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b2c75b12014e..b3ca996a4efc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1052,6 +1052,7 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct 
file *, unsigned long,
 unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+struct folio *alloc_demote_folio(struct folio *src, unsigned long private);
 unsigned long reclaim_pages(struct list_head *folio_list);
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2e34de9cd0d4..2f4406872f43 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -916,8 +916,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct folio *alloc_demote_folio(struct folio *src,
-   unsigned long private)
+struct folio *alloc_demote_folio(struct folio *src, unsigned long private)
 {
struct folio *dst;
nodemask_t *allowed_mask;
-- 
2.34.1

[PATCH v6 0/7] DAMON based tiered memory management for CXL memory

2024-06-13 Thread Honggyu Kim

There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This patch series is about its implementation for physical
address space so that this scheme can be applied in system wide level.

Changes from v5:
https://lore.kernel.org/20240613132056.608-1-honggyu@sk.com
  1. Remove new actions in usage document as its for debugfs
  2. Apply minor fixes on cover letter

Changes from RFC v4:
https://lore.kernel.org/20240512175447.75943-1...@kernel.org
  1. Add usage and design documents
  2. Rename alloc_demote_folio to alloc_migrate_folio
  3. Add evaluation results with "demotion_enabled" true
  4. Rebase based on v6.10-rc3

Changes from RFC v3:
https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com
  0. updated from v3 and posted by SJ on behalf of Honggyu under his
 approval.
  1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode'
  2. Drop vmstat change
  3. Drop unnecessary page reference check

Changes from RFC v2:
https://lore.kernel.org/20240226140555.1615-1-honggyu@sk.com
  1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
  2. Create 'target_nid' to set the migration target node instead of
 depending on node distance based information.
  3. Instead of having page level access check in this patch series,
 delegate the job to a new DAMOS filter type YOUNG[2].
  4. Introduce vmstat counters "damon_migrate_{hot,cold}".
  5. Rebase from v6.7 to v6.8.

Changes from RFC:
https://lore.kernel.org/20240115045253.1775-1-honggyu@sk.com
  1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
  2. Simplify some functions of vmscan.c and used in paddr.c, but need
 to be reviewed more in depth.
  3. Refactor most functions for common usage for both promote and
 demote actions and introduce an enum migration_mode for its control.
  4. Add "target_nid" sysfs knob for migration destination node for both
 promote and demote actions.
  5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
 DAMOS_STAT.

Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers.  This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 11% to 3~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.

DAMON configuration
===

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.

Evaluation Workload
===

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5].  We have measured two different workloads with zipfian and
latest

Re: [PATCH v4 12/35] kmsan: Support SLAB_POISON

2024-06-13 Thread Ilya Leoshkevich

On Thu, 2024-06-13 at 16:30 -0700, SeongJae Park wrote:
> Hi Ilya,
> 
> On Thu, 13 Jun 2024 17:34:14 +0200 Ilya Leoshkevich
>  wrote:
> 
> > Avoid false KMSAN negatives with SLUB_DEBUG by allowing
> > kmsan_slab_free() to poison the freed memory, and by preventing
> > init_object() from unpoisoning new allocations by using __memset().
> > 
> > There are two alternatives to this approach. First, init_object()
> > can be marked with __no_sanitize_memory. This annotation should be
> > used
> > with great care, because it drops all instrumentation from the
> > function, and any shadow writes will be lost. Even though this is
> > not a
> > concern with the current init_object() implementation, this may
> > change
> > in the future.
> > 
> > Second, kmsan_poison_memory() calls may be added after memset()
> > calls.
> > The downside is that init_object() is called from
> > free_debug_processing(), in which case poisoning will erase the
> > distinction between simply uninitialized memory and UAF.
> > 
> > Signed-off-by: Ilya Leoshkevich 
> > ---
> >  mm/kmsan/hooks.c |  2 +-
> >  mm/slub.c    | 13 +
> >  2 files changed, 10 insertions(+), 5 deletions(-)
> > 
> [...]
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1139,7 +1139,12 @@ static void init_object(struct kmem_cache
> > *s, void *object, u8 val)
> >     unsigned int poison_size = s->object_size;
> >  
> >     if (s->flags & SLAB_RED_ZONE) {
> > -   memset(p - s->red_left_pad, val, s->red_left_pad);
> > +   /*
> > +* Use __memset() here and below in order to avoid
> > overwriting
> > +* the KMSAN shadow. Keeping the shadow makes it
> > possible to
> > +* distinguish uninit-value from use-after-free.
> > +*/
> > +   __memset(p - s->red_left_pad, val, s-
> > >red_left_pad);
> 
> I found my build test[1] fails with below error on latest mm-unstable
> branch.
> 'git bisect' points me this patch.
> 
>   CC  mm/slub.o
>     /mm/slub.c: In function 'init_object':
>     /mm/slub.c:1147:17: error: implicit declaration of function
> '__memset'; did you mean 'memset'? [-Werror=implicit-function-
> declaration]
>  1147 | __memset(p - s->red_left_pad, val, s-
> >red_left_pad);
>   | ^~~~
>   | memset
>     cc1: some warnings being treated as errors
> 
> I haven't looked in deep, but reporting first.  Do you have any idea?
> 
> [1]
> https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh
> 
> 
> Thanks,
> SJ
> 
> [...]

Thanks for the report.

Apparently not all architectures have __memset(). We should probably go
back to memset_no_sanitize_memory() [1], but this time mark it with
noinline __maybe_unused __no_sanitize_memory, like it's done in, e.g.,
32/35.

Alexander, what do you think?

[1]
https://lore.kernel.org/lkml/20231121220155.1217090-14-...@linux.ibm.com/

[PATCH v7 1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In order to allow for requesting a memory region that can be used for
things like pstore on multiple machines where the memory layout is not the
same, add a new option to the kernel command line called "reserve_mem".

The format is:  reserve_mem=nn:align:name

Where it will find nn amount of memory at the given alignment of align.
The name field is to allow another subsystem to retrieve where the memory
was found. For example:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Where ramoops.mem_name will tell ramoops that memory was reserved for it
via the reserve_mem option and it can find it by calling:

  if (reserve_mem_find_by_name("oops", , )) {
// start holds the start address and size holds the size given

This is typically used for systems that do not wipe the RAM, and this
command line will try to reserve the same physical memory on soft reboots.
Note, it is not guaranteed to be the same location. For example, if KASLR
places the kernel at the location of where the RAM reservation was from a
previous boot, the new reservation will be at a different location.  Any
subsystem using this feature must add a way to verify that the contents of
the physical memory is from a previous boot, as there may be cases where
the memory will not be located at the same location.

Not all systems may work either. There could be bit flips if the reboot
goes through the BIOS. Using kexec to reboot the machine is likely to
have better results in such cases.

Link: https://lore.kernel.org/all/zjjvnzux3nzig...@kernel.org/

Suggested-by: Mike Rapoport 
Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 .../admin-guide/kernel-parameters.txt |  22 
 include/linux/mm.h|   2 +
 mm/memblock.c | 117 ++
 3 files changed, 141 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..56e18b1a520d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5710,6 +5710,28 @@
them.  If  is less than 0x1, the region
is assumed to be I/O ports; otherwise it is memory.
 
+   reserve_mem=[RAM]
+   Format: nn[KNG]::
+   Reserve physical memory and label it with a name that
+   other subsystems can use to access it. This is typically
+   used for systems that do not wipe the RAM, and this 
command
+   line will try to reserve the same physical memory on
+   soft reboots. Note, it is not guaranteed to be the same
+   location. For example, if anything about the system 
changes
+   or if booting a different kernel. It can also fail if 
KASLR
+   places the kernel at the location of where the RAM 
reservation
+   was from a previous boot, the new reservation will be 
at a
+   different location.
+   Any subsystem using this feature must add a way to 
verify
+   that the contents of the physical memory is from a 
previous
+   boot, as there may be cases where the memory will not be
+   located at the same location.
+
+   The format is size:align:label for example, to request
+   12 megabytes of 4096 alignment for ramoops:
+
+   reserve_mem=12M:4096:oops ramoops.mem_name=oops
+
reservetop= [X86-32,EARLY]
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..077fb589b88a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4263,4 +4263,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long 
pfn)
 void vma_pgtable_walk_begin(struct vm_area_struct *vma);
 void vma_pgtable_walk_end(struct vm_area_struct *vma);
 
+int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t 
*size);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index d09136e040d3..b7b0e8c3868d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2244,6 +2244,123 @@ void __init memblock_free_all(void)
totalram_pages_add(pages);
 }
 
+/* Keep a table to reserve named memory */
+#define RESERVE_MEM_MAX_ENTRIES8
+#define RESERVE_MEM_NAME_SIZE  16
+struct reserve_mem_table {
+   charname[RESERVE_MEM_NAME_SIZE];
+   phys_addr_t start;
+   phys_addr_t size;
+};
+static struct reserve_mem_table reserved_mem_table[RESERVE_MEM_MAX_ENTRIES];
+static int reserved_mem_count;
+
+/* Add wildcard

[PATCH v7 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add a method to find a region specified by reserve_mem=nn:align:name for
ramoops. Adding a kernel command line parameter:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Will use the size and location defined by the memmap parameter where it
finds the memory and labels it "oops". The "oops" in the ramoops option
is used to search for it.

This allows for arbitrary RAM to be used for ramoops if it is known that
the memory is not cleared on kernel crashes or soft reboots.

Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/admin-guide/ramoops.rst | 13 +
 fs/pstore/ram.c   | 14 ++
 2 files changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/ramoops.rst 
b/Documentation/admin-guide/ramoops.rst
index e9f85142182d..2eabef31220d 100644
--- a/Documentation/admin-guide/ramoops.rst
+++ b/Documentation/admin-guide/ramoops.rst
@@ -23,6 +23,8 @@ and type of the memory area are set using three variables:
   * ``mem_size`` for the size. The memory size will be rounded down to a
 power of two.
   * ``mem_type`` to specify if the memory type (default is 
pgprot_writecombine).
+  * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command
+line parameter.
 
 Typically the default value of ``mem_type=0`` should be used as that sets the 
pstore
 mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
@@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several 
different manners:
return ret;
   }
 
+ D. Using a region of memory reserved via ``reserve_mem`` command line
+parameter. The address and size will be defined by the ``reserve_mem``
+parameter. Note, that ``reserve_mem`` may not always allocate memory
+in the same location, and cannot be relied upon. Testing will need
+to be done, and it may not work on every machine, nor every kernel.
+Consider this a "best effort" approach. The ``reserve_mem`` option
+takes a size, alignment and name as arguments. The name is used
+to map the memory to a label that can be retrieved by ramoops.
+
+   reserve_mem=2M:4096:oops  ramoops.mem_name=oops
+
 You can specify either RAM memory or peripheral devices' memory. However, when
 specifying RAM, be sure to reserve the memory by issuing memblock_reserve()
 very early in the architecture code, e.g.::
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index b1a455f42e93..4311fcbc84f2 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -50,6 +50,10 @@ module_param_hw(mem_address, ullong, other, 0400);
 MODULE_PARM_DESC(mem_address,
"start of reserved RAM used to store oops/panic logs");
 
+static char *mem_name;
+module_param_named(mem_name, mem_name, charp, 0400);
+MODULE_PARM_DESC(mem_name, "name of kernel param that holds addr");
+
 static ulong mem_size;
 module_param(mem_size, ulong, 0400);
 MODULE_PARM_DESC(mem_size,
@@ -914,6 +918,16 @@ static void __init ramoops_register_dummy(void)
 {
struct ramoops_platform_data pdata;
 
+   if (mem_name) {
+   phys_addr_t start;
+   phys_addr_t size;
+
+   if (reserve_mem_find_by_name(mem_name, , )) {
+   mem_address = start;
+   mem_size = size;
+   }
+   }
+
/*
 * Prepare a dummy platform data structure to carry the module
 * parameters. If mem_size isn't set, then there are no module
-- 
2.43.0

[PATCH v7 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt

Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
 improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

  See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.

Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be
different for different machines.

Solution:

The solution I have come up with is to introduce a new "reserve_mem=" kernel
command line. This parameter takes the following format:

  reserve_mem=nn:align:label

Where nn is the size of memory to reserve, the align is the alignment of
that memory, and label is the way for other sub-systems to find that memory.
This way the kernel command line could have:

  reserve_mem=12M:4096:oops   ramoops.mem_name=oops

At boot up, the kernel will search for 12 megabytes in usable memory regions
with an alignment of 4096. It will start at the highest regions and work its
way down (for those old devices that want access to lower address DMA). When
it finds a region, it will save it off in a small table and mark it with the
"oops" label. Then the pstore ramoops sub-system could ask for that memory
and location, and it will map itself there.

This prototype allows for 8 different mappings (which may be overkill, 4 is
probably plenty) with 16 byte size to store the label.

I have tested this and it works for us to solve the above problem. We can
update the kernel and command line and increase the size of pstore without
needing to update the firmware, or knowing every memory layout of each
board. I only tested this locally, it has not been tested in the field.

Changes since v6: 
https://lore.kernel.org/all/20240613155506.811013...@goodmis.org/

- Fixed typo of s/reserver/reserve/

Changes since v5: 
https://lore.kernel.org/all/20240613003435.401549...@goodmis.org/


- Stressed more that this is a best effort use case

- Updated ramoops.rst to document this new feature

- Used a new variable "tmp" to use in reserve_mem_find_by_name() instead
  of using "size" and possibly corrupting it.

Changes since v4: 
https://lore.kernel.org/all/20240611215610.548854...@goodmis.org/

- Add all checks about reserve_mem before allocation.
  This means reserved_mem_add() is now a void function.

- Check for name duplications.

- Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=")

Changes since v3: 
https://lore.kernel.org/all/20240611144911.327227...@goodmis.org/

- Changed table type of start and size from unsigned long to phys_addr_t
  (as well as the parameters to the functions that use them)

- Changed old reference to "early_reserve_mem" to "reserve_mem"

- Check before reservering memory:
  o Size is non-zero
  o name has text in it

- If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES

- Remove the silly check of testing *p == '\0' after a p += strlen(p)

Changes since v2: 
https://lore.kernel.org/all/20240606150143.876469...@goodmis.org/

- Fixed typo of "reserver"

- Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name()

- Removed "built-in" from module description that was changed from v1.


Changes since v1: 
https://lore.kernel.org/all/2024060320.801075...@goodmis.org/

- Updated the change log of the first patch as well as added an entry
  into kernel-parameters.txt about how reserve_mem is for soft reboots
  and may not be reliable. 


Steven Rostedt (Google) (2):
  mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  pstore/ramoops: Add ramoops.mem_name= command line option

Re: [PATCH v4 12/35] kmsan: Support SLAB_POISON

2024-06-13 Thread SeongJae Park

Hi Ilya,

On Thu, 13 Jun 2024 17:34:14 +0200 Ilya Leoshkevich  wrote:

> Avoid false KMSAN negatives with SLUB_DEBUG by allowing
> kmsan_slab_free() to poison the freed memory, and by preventing
> init_object() from unpoisoning new allocations by using __memset().
> 
> There are two alternatives to this approach. First, init_object()
> can be marked with __no_sanitize_memory. This annotation should be used
> with great care, because it drops all instrumentation from the
> function, and any shadow writes will be lost. Even though this is not a
> concern with the current init_object() implementation, this may change
> in the future.
> 
> Second, kmsan_poison_memory() calls may be added after memset() calls.
> The downside is that init_object() is called from
> free_debug_processing(), in which case poisoning will erase the
> distinction between simply uninitialized memory and UAF.
> 
> Signed-off-by: Ilya Leoshkevich 
> ---
>  mm/kmsan/hooks.c |  2 +-
>  mm/slub.c| 13 +
>  2 files changed, 10 insertions(+), 5 deletions(-)
> 
[...]
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1139,7 +1139,12 @@ static void init_object(struct kmem_cache *s, void 
> *object, u8 val)
>   unsigned int poison_size = s->object_size;
>  
>   if (s->flags & SLAB_RED_ZONE) {
> - memset(p - s->red_left_pad, val, s->red_left_pad);
> + /*
> +  * Use __memset() here and below in order to avoid overwriting
> +  * the KMSAN shadow. Keeping the shadow makes it possible to
> +  * distinguish uninit-value from use-after-free.
> +  */
> + __memset(p - s->red_left_pad, val, s->red_left_pad);

I found my build test[1] fails with below error on latest mm-unstable branch.
'git bisect' points me this patch.

  CC  mm/slub.o
/mm/slub.c: In function 'init_object':
/mm/slub.c:1147:17: error: implicit declaration of function '__memset'; did 
you mean 'memset'? [-Werror=implicit-function-declaration]
 1147 | __memset(p - s->red_left_pad, val, s->red_left_pad);
  | ^~~~
  | memset
cc1: some warnings being treated as errors

I haven't looked in deep, but reporting first.  Do you have any idea?

[1] https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh


Thanks,
SJ

[...]

Re: [PATCH 3/8] riscv: ftrace: support fastcc in Clang for WITH_ARGS

2024-06-13 Thread Nathan Chancellor

On Thu, Jun 13, 2024 at 03:11:08PM +0800, Andy Chiu wrote:
> Some caller-saved registers which are not defined as function arguments
> in the ABI can still be passed as arguments when the kernel is compiled
> with Clang. As a result, we must save and restore those registers to
> prevent ftrace from clobbering them.
> 
> - [1]: https://reviews.llvm.org/D68559
> Reported-by: Evgenii Shatokhin 
> Closes: 
> https://lore.kernel.org/linux-riscv/7e7c7914-445d-426d-89a0-59a9199c4...@yadro.com/
> Signed-off-by: Andy Chiu 

Acked-by: Nathan Chancellor 

> ---
>  arch/riscv/include/asm/ftrace.h |  7 +++
>  arch/riscv/kernel/asm-offsets.c |  7 +++
>  arch/riscv/kernel/mcount-dyn.S  | 16 ++--
>  3 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h
> index 9eb31a7ea0aa..5f81c53dbfd9 100644
> --- a/arch/riscv/include/asm/ftrace.h
> +++ b/arch/riscv/include/asm/ftrace.h
> @@ -144,6 +144,13 @@ struct ftrace_regs {
>   unsigned long a5;
>   unsigned long a6;
>   unsigned long a7;
> +#ifdef CONFIG_CC_IS_CLANG
> + unsigned long t2;
> + unsigned long t3;
> + unsigned long t4;
> + unsigned long t5;
> + unsigned long t6;
> +#endif
>   };
>   };
>  };
> diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> index b09ca5f944f7..db5a26fcc9ae 100644
> --- a/arch/riscv/kernel/asm-offsets.c
> +++ b/arch/riscv/kernel/asm-offsets.c
> @@ -497,6 +497,13 @@ void asm_offsets(void)
>   DEFINE(FREGS_SP,offsetof(struct ftrace_regs, sp));
>   DEFINE(FREGS_S0,offsetof(struct ftrace_regs, s0));
>   DEFINE(FREGS_T1,offsetof(struct ftrace_regs, t1));
> +#ifdef CONFIG_CC_IS_CLANG
> + DEFINE(FREGS_T2,offsetof(struct ftrace_regs, t2));
> + DEFINE(FREGS_T3,offsetof(struct ftrace_regs, t3));
> + DEFINE(FREGS_T4,offsetof(struct ftrace_regs, t4));
> + DEFINE(FREGS_T5,offsetof(struct ftrace_regs, t5));
> + DEFINE(FREGS_T6,offsetof(struct ftrace_regs, t6));
> +#endif
>   DEFINE(FREGS_A0,offsetof(struct ftrace_regs, a0));
>   DEFINE(FREGS_A1,offsetof(struct ftrace_regs, a1));
>   DEFINE(FREGS_A2,offsetof(struct ftrace_regs, a2));
> diff --git a/arch/riscv/kernel/mcount-dyn.S b/arch/riscv/kernel/mcount-dyn.S
> index 745dd4c4a69c..e988bd26b28b 100644
> --- a/arch/riscv/kernel/mcount-dyn.S
> +++ b/arch/riscv/kernel/mcount-dyn.S
> @@ -96,7 +96,13 @@
>   REG_S   x8,  FREGS_S0(sp)
>  #endif
>   REG_S   x6,  FREGS_T1(sp)
> -
> +#ifdef CONFIG_CC_IS_CLANG
> + REG_S   x7,  FREGS_T2(sp)
> + REG_S   x28, FREGS_T3(sp)
> + REG_S   x29, FREGS_T4(sp)
> + REG_S   x30, FREGS_T5(sp)
> + REG_S   x31, FREGS_T6(sp)
> +#endif
>   // save the arguments
>   REG_S   x10, FREGS_A0(sp)
>   REG_S   x11, FREGS_A1(sp)
> @@ -115,7 +121,13 @@
>   REG_L   x8, FREGS_S0(sp)
>  #endif
>   REG_L   x6,  FREGS_T1(sp)
> -
> +#ifdef CONFIG_CC_IS_CLANG
> + REG_L   x7,  FREGS_T2(sp)
> + REG_L   x28, FREGS_T3(sp)
> + REG_L   x29, FREGS_T4(sp)
> + REG_L   x30, FREGS_T5(sp)
> + REG_L   x31, FREGS_T6(sp)
> +#endif
>   // restore the arguments
>   REG_L   x10, FREGS_A0(sp)
>   REG_L   x11, FREGS_A1(sp)
> 
> -- 
> 2.43.0
>

Re: [PATCH 1/1] s390/virtio_ccw: fix config change notifications

2024-06-13 Thread Vasily Gorbik

On Thu, Jun 13, 2024 at 03:21:15PM +0200, Halil Pasic wrote:
> On Wed, 12 Jun 2024 16:04:15 +0200
> Thomas Huth  wrote:
> 
> > On 11/06/2024 23.47, Halil Pasic wrote:
> > > Commit e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API")
> > > broke configuration change notifications for virtio-ccw by putting the
> > > DMA address of *indicatorp directly into ccw->cda disregarding the fact
> > > that if !!(vcdev->is_thinint) then the function
> > > virtio_ccw_register_adapter_ind() will overwrite that ccw->cda value
> > > with the address of the virtio_thinint_area so it can actually set up
> > > the adapter interrupts via CCW_CMD_SET_IND_ADAPTER.  Thus we end up
> > > pointing to the wrong object for both CCW_CMD_SET_IND if setting up the
> > > adapter interrupts fails, and for CCW_CMD_SET_CONF_IND regardless
> > > whether it succeeds or fails.
> > > 
> > > To fix this, let us save away the dma address of *indicatorp in a local
> > > variable, and copy it to ccw->cda after the "vcdev->is_thinint" branch.
> > > 
> > > Reported-by: Boqiao Fu 
> > > Reported-by: Sebastian Mitterle 
> > > Fixes: e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API")
> > > Signed-off-by: Halil Pasic 
> > > ---
> > > I know that checkpatch.pl complains about a missing 'Closes' tag.
> > > Unfortunately I don't have an appropriate URL at hand. @Sebastian,
> > > @Boqiao: do you have any suggetions?  
> > 
> > Closes: https://issues.redhat.com/browse/RHEL-39983
> > ?
> 
> Yep! That is a public bug tracker bug. Qualifies!
> @Vasily: Can you guys pick hat one up when picking the patch?

Sure, applied. Thanks!

Re: [PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Guilherme G. Piccoli

On 13/06/2024 15:42, Steven Rostedt wrote:
> On Thu, 13 Jun 2024 15:19:50 -0300
> "Guilherme G. Piccoli"  wrote:
> 
>>> +
>>> +   reserver_mem=2M:4096:oops  ramoops.mem_name=oops
>>> +  
>>
>> Likely this could be fixed on merge, to avoid another version, but...
>>
>> s/reserver_mem/reserve_mem
> 
> That 'r' is my nemesis! Almost every time I type "reserve" I write it
> as "reserver" and have to go back and delete it. :-p

LOL, I thought the same! And I even wondered if you're not a vim user
that makes use of 'r' for replacing text and double typed that - I do
that all the time, with r and i.

Anyway, thanks for fixing that.
Cheers!

Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Steven Rostedt

On Thu, 13 Jun 2024 14:21:10 -0700
Andrew Morton  wrote:

> On Thu, 13 Jun 2024 16:10:12 -0400 Steven Rostedt  wrote:
> 
> > > And...  I'm a bit surprised that forward declarations are allowed in C.
> > > A billion years ago I used a C compiler which would use 16 bits for
> > > an enum if the enumted values would fit in 16 bits.  And it would use 32
> > > bits otherwise.  So the enumerated values were *required* for the
> > > compiler to be able to figure out the sizeof.  But it was a billion
> > > years ago.  
> > 
> > Well, I only looked at the one change in ftrace.h which has a
> > "struct seq_file;" that is not used anywhere else in the file, so that
> > one definitely can go.  
> 
> The risk is that something which includes ftrace.h is depending upon
> the enum declaration.

You mean forward struct declaration. And if so, good! it needs to be fixed ;-)

-- Steve

Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Andrew Morton

On Thu, 13 Jun 2024 16:10:12 -0400 Steven Rostedt  wrote:

> > And...  I'm a bit surprised that forward declarations are allowed in C.
> > A billion years ago I used a C compiler which would use 16 bits for
> > an enum if the enumted values would fit in 16 bits.  And it would use 32
> > bits otherwise.  So the enumerated values were *required* for the
> > compiler to be able to figure out the sizeof.  But it was a billion
> > years ago.
> 
> Well, I only looked at the one change in ftrace.h which has a
> "struct seq_file;" that is not used anywhere else in the file, so that
> one definitely can go.

The risk is that something which includes ftrace.h is depending upon
the enum declaration.

Re: [PATCH] ARM: dts: qcom: motorola-falcon: add accelerometer, magnetometer

2024-06-13 Thread Stanislav Jakubek

On Thu, Jun 13, 2024 at 09:48:26AM +0200, Konrad Dybcio wrote:
> 
> 
> On 6/9/24 13:05, Stanislav Jakubek wrote:
> > Add the accelerometer and magnetometer that are present on the Motorola
> > Moto G (2013) device.
> > 
> > Signed-off-by: Stanislav Jakubek 
> > ---
> 
> [...]
> 
> > +_i2c2 {
> 
> Consider setting a clock-frequency = <>

Hi Konrad,

checking downstream [1], qcom,i2c-bus-freq for I2C2 is 10, which seems
to be the default. Should I specify it anyway?

Though, now that I've checked, it seems that I missed the clock-frequency
for I2C3, for which qcom,i2c-bus-freq is 40...
The currently supported HW on it seems to work fine though ¯\_(ツ)_/¯

[1] 
https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/arch/arm/boot/dts/msm8226.dtsi#L983

Stanislav

> 
> With that:
> 
> Reviewed-by: Konrad Dybcio 
> 
> Konrad

Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Steven Rostedt

On Thu, 13 Jun 2024 22:22:18 +0300
Alexey Dobriyan  wrote:

> g++ doesn't like forward enum declarations:
> 
>   error: use of enum ‘E’ without previous declaration
>  64 | enum E;

But we don't care about g++. Do we?

I would make that a separate patch.

> 
> Delete those which aren't used.
> 
> Delete some unused/unnecessary forward struct declarations for a change.

This is a clean up, but should have a better change log. Just something
simple like:

  Delete unnecessary forward struct declarations.

Thanks,

-- Steve


> 
> Signed-off-by: Alexey Dobriyan 
> ---
> 
>  fs/ramfs/inode.c |1 -
>  include/linux/console.h  |2 --
>  include/linux/device.h   |3 ---
>  include/linux/ftrace.h   |4 
>  include/linux/security.h |6 --
>  include/linux/signal.h   |2 --
>  include/linux/syscalls.h |7 ---
>  include/linux/sysfs.h|2 --
>  mm/internal.h|4 
>  mm/shmem.c   |1 -
>  10 files changed, 32 deletions(-)
> 
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -51,7 +51,6 @@ struct ramfs_fs_info {
>  
>  #define RAMFS_DEFAULT_MODE   0755
>  
> -static const struct super_operations ramfs_ops;
>  static const struct inode_operations ramfs_dir_inode_operations;
>  
>  struct inode *ramfs_get_inode(struct super_block *sb,
> --- a/include/linux/console.h
> +++ b/include/linux/console.h
> @@ -21,10 +21,8 @@
>  #include 
>  
>  struct vc_data;
> -struct console_font_op;
>  struct console_font;
>  struct module;
> -struct tty_struct;
>  struct notifier_block;
>  
>  enum con_scroll {
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -36,10 +36,7 @@
>  struct device;
>  struct device_private;
>  struct device_driver;
> -struct driver_private;
>  struct module;
> -struct class;
> -struct subsys_private;
>  struct device_node;
>  struct fwnode_handle;
>  struct iommu_group;
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -531,8 +531,6 @@ extern const void *ftrace_expected;
>  
>  void ftrace_bug(int err, struct dyn_ftrace *rec);
>  
> -struct seq_file;
> -
>  extern int ftrace_text_reserved(const void *start, const void *end);
>  
>  struct ftrace_ops *ftrace_ops_trampoline(unsigned long addr);
> @@ -1147,8 +1145,6 @@ static inline void unpause_graph_tracing(void) { }
>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
>  
>  #ifdef CONFIG_TRACING
> -enum ftrace_dump_mode;
> -
>  #define MAX_TRACER_SIZE  100
>  extern char ftrace_dump_on_oops[];
>  extern int ftrace_dump_on_oops_enabled(void);
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -41,7 +41,6 @@ struct rlimit;
>  struct kernel_siginfo;
>  struct sembuf;
>  struct kern_ipc_perm;
> -struct audit_context;
>  struct super_block;
>  struct inode;
>  struct dentry;
> @@ -59,8 +58,6 @@ struct xfrm_sec_ctx;
>  struct mm_struct;
>  struct fs_context;
>  struct fs_parameter;
> -enum fs_value_type;
> -struct watch;
>  struct watch_notification;
>  struct lsm_ctx;
>  
> @@ -183,8 +180,6 @@ struct sock;
>  struct sockaddr;
>  struct socket;
>  struct flowi_common;
> -struct dst_entry;
> -struct xfrm_selector;
>  struct xfrm_policy;
>  struct xfrm_state;
>  struct xfrm_user_sec_ctx;
> @@ -219,7 +214,6 @@ extern unsigned long dac_mmap_min_addr;
>  #define LSM_PRLIMIT_WRITE 2
>  
>  /* forward declares to avoid warnings */
> -struct sched_param;
>  struct request_sock;
>  
>  /* bprm->unsafe reasons */
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -274,8 +274,6 @@ static inline int valid_signal(unsigned long sig)
>   return sig <= _NSIG ? 1 : 0;
>  }
>  
> -struct timespec;
> -struct pt_regs;
>  enum pid_type;
>  
>  extern int next_signal(struct sigpending *pending, sigset_t *mask);
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -11,8 +11,6 @@
>  
>  struct __aio_sigset;
>  struct epoll_event;
> -struct iattr;
> -struct inode;
>  struct iocb;
>  struct io_event;
>  struct iovec;
> @@ -20,14 +18,12 @@ struct __kernel_old_itimerval;
>  struct kexec_segment;
>  struct linux_dirent;
>  struct linux_dirent64;
> -struct list_head;
>  struct mmap_arg_struct;
>  struct msgbuf;
>  struct user_msghdr;
>  struct mmsghdr;
>  struct msqid_ds;
>  struct new_utsname;
> -struct nfsctl_arg;
>  struct __old_kernel_stat;
>  struct oldold_utsname;
>  struct old_utsname;
> @@ -38,7 +34,6 @@ struct rusage;
>  struct sched_param;
>  struct sched_attr;
>  struct sel_arg_struct;
> -struct semaphore;
>  struct sembuf;
>  struct shmid_ds;
>  struct sockaddr;
> @@ -48,14 +43,12 @@ struct statfs;
>  struct statfs64;
>  struct statx;
>  struct sysinfo;
> -struct timespec;
>  struct __kernel_old_timeval;
>  struct __kernel_timex;
>  struct timezone;
>  struct tms;
>  struct utimbuf;
>  struct mq_attr;
> -struct compat_stat;
>  struct old_timeval32;
>  struct robust_list_head;
>  struct futex_waitv;
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -23,9 +23,7 @@
>  #include 
>

Re: [PATCH] ACPI: NFIT: add missing MODULE_DESCRIPTION() macro

2024-06-13 Thread Rafael J. Wysocki

On Thu, Jun 6, 2024 at 6:10 PM Dave Jiang  wrote:
>
>
>
> On 6/3/24 6:30 AM, Jeff Johnson wrote:
> > make allmodconfig && make W=1 C=1 reports:
> > WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/acpi/nfit/nfit.o
> >
> > Add the missing invocation of the MODULE_DESCRIPTION() macro.
> >
> > Signed-off-by: Jeff Johnson 
>
> Reviewed-by: Dave Jiang 
> > ---
> >  drivers/acpi/nfit/core.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> > index d4595d1985b1..e8520fb8af4f 100644
> > --- a/drivers/acpi/nfit/core.c
> > +++ b/drivers/acpi/nfit/core.c
> > @@ -3531,5 +3531,6 @@ static __exit void nfit_exit(void)
> >
> >  module_init(nfit_init);
> >  module_exit(nfit_exit);
> > +MODULE_DESCRIPTION("ACPI NVDIMM Firmware Interface Table (NFIT) module");
> >  MODULE_LICENSE("GPL v2");
> >  MODULE_AUTHOR("Intel Corporation");
> >
> > ---

Applied as 6.11 material, thanks!

Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Steven Rostedt

On Thu, 13 Jun 2024 13:04:20 -0700
Andrew Morton  wrote:

> On Thu, 13 Jun 2024 15:34:02 -0400 Steven Rostedt  wrote:
> 
> > On Thu, 13 Jun 2024 22:22:18 +0300
> > Alexey Dobriyan  wrote:
> >   
> > > g++ doesn't like forward enum declarations:
> > > 
> > >   error: use of enum ‘E’ without previous declaration
> > >  64 | enum E;  
> > 
> > But we don't care about g++. Do we?  
> 
> It appears that g++ is a useful enum declaration detector.
> 
> I'm curious to know how even the above warning was generated.  Does g++
> work at all on Linux?
> 
> > I would make that a separate patch.  
> 
> What are you referring to here?

The enum change should be separate from the struct changes.

> 
> > > 
> > > Delete those which aren't used.
> > > 
> > > Delete some unused/unnecessary forward struct declarations for a change.  
> > 
> > This is a clean up, but should have a better change log. Just something
> > simple like:
> > 
> >   Delete unnecessary forward struct declarations.  
> 
> Alexey specializes in cute changelogs.

eh

> 
> I do have a concern about the patch: has it been tested with all
> possible Kconfigs?  No.  There may be some configs in which the forward
> declaration is required.
> 
> And...  I'm a bit surprised that forward declarations are allowed in C.
> A billion years ago I used a C compiler which would use 16 bits for
> an enum if the enumted values would fit in 16 bits.  And it would use 32
> bits otherwise.  So the enumerated values were *required* for the
> compiler to be able to figure out the sizeof.  But it was a billion
> years ago.

Well, I only looked at the one change in ftrace.h which has a
"struct seq_file;" that is not used anywhere else in the file, so that
one definitely can go.

-- Steve

Re: [PATCH] linux++: delete some forward declarations

2024-06-13 Thread Andrew Morton

On Thu, 13 Jun 2024 15:34:02 -0400 Steven Rostedt  wrote:

> On Thu, 13 Jun 2024 22:22:18 +0300
> Alexey Dobriyan  wrote:
> 
> > g++ doesn't like forward enum declarations:
> > 
> > error: use of enum ‘E’ without previous declaration
> >64 | enum E;
> 
> But we don't care about g++. Do we?

It appears that g++ is a useful enum declaration detector.

I'm curious to know how even the above warning was generated.  Does g++
work at all on Linux?

> I would make that a separate patch.

What are you referring to here?

> > 
> > Delete those which aren't used.
> > 
> > Delete some unused/unnecessary forward struct declarations for a change.
> 
> This is a clean up, but should have a better change log. Just something
> simple like:
> 
>   Delete unnecessary forward struct declarations.

Alexey specializes in cute changelogs.

I do have a concern about the patch: has it been tested with all
possible Kconfigs?  No.  There may be some configs in which the forward
declaration is required.

And...  I'm a bit surprised that forward declarations are allowed in C.
A billion years ago I used a C compiler which would use 16 bits for
an enum if the enumted values would fit in 16 bits.  And it would use 32
bits otherwise.  So the enumerated values were *required* for the
compiler to be able to figure out the sizeof.  But it was a billion
years ago.

[PATCH] linux++: delete some forward declarations

2024-06-13 Thread Alexey Dobriyan

g++ doesn't like forward enum declarations:

error: use of enum ‘E’ without previous declaration
   64 | enum E;

Delete those which aren't used.

Delete some unused/unnecessary forward struct declarations for a change.

Signed-off-by: Alexey Dobriyan 
---

 fs/ramfs/inode.c |1 -
 include/linux/console.h  |2 --
 include/linux/device.h   |3 ---
 include/linux/ftrace.h   |4 
 include/linux/security.h |6 --
 include/linux/signal.h   |2 --
 include/linux/syscalls.h |7 ---
 include/linux/sysfs.h|2 --
 mm/internal.h|4 
 mm/shmem.c   |1 -
 10 files changed, 32 deletions(-)

--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -51,7 +51,6 @@ struct ramfs_fs_info {
 
 #define RAMFS_DEFAULT_MODE 0755
 
-static const struct super_operations ramfs_ops;
 static const struct inode_operations ramfs_dir_inode_operations;
 
 struct inode *ramfs_get_inode(struct super_block *sb,
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -21,10 +21,8 @@
 #include 
 
 struct vc_data;
-struct console_font_op;
 struct console_font;
 struct module;
-struct tty_struct;
 struct notifier_block;
 
 enum con_scroll {
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -36,10 +36,7 @@
 struct device;
 struct device_private;
 struct device_driver;
-struct driver_private;
 struct module;
-struct class;
-struct subsys_private;
 struct device_node;
 struct fwnode_handle;
 struct iommu_group;
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -531,8 +531,6 @@ extern const void *ftrace_expected;
 
 void ftrace_bug(int err, struct dyn_ftrace *rec);
 
-struct seq_file;
-
 extern int ftrace_text_reserved(const void *start, const void *end);
 
 struct ftrace_ops *ftrace_ops_trampoline(unsigned long addr);
@@ -1147,8 +1145,6 @@ static inline void unpause_graph_tracing(void) { }
 #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
 
 #ifdef CONFIG_TRACING
-enum ftrace_dump_mode;
-
 #define MAX_TRACER_SIZE100
 extern char ftrace_dump_on_oops[];
 extern int ftrace_dump_on_oops_enabled(void);
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -41,7 +41,6 @@ struct rlimit;
 struct kernel_siginfo;
 struct sembuf;
 struct kern_ipc_perm;
-struct audit_context;
 struct super_block;
 struct inode;
 struct dentry;
@@ -59,8 +58,6 @@ struct xfrm_sec_ctx;
 struct mm_struct;
 struct fs_context;
 struct fs_parameter;
-enum fs_value_type;
-struct watch;
 struct watch_notification;
 struct lsm_ctx;
 
@@ -183,8 +180,6 @@ struct sock;
 struct sockaddr;
 struct socket;
 struct flowi_common;
-struct dst_entry;
-struct xfrm_selector;
 struct xfrm_policy;
 struct xfrm_state;
 struct xfrm_user_sec_ctx;
@@ -219,7 +214,6 @@ extern unsigned long dac_mmap_min_addr;
 #define LSM_PRLIMIT_WRITE 2
 
 /* forward declares to avoid warnings */
-struct sched_param;
 struct request_sock;
 
 /* bprm->unsafe reasons */
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -274,8 +274,6 @@ static inline int valid_signal(unsigned long sig)
return sig <= _NSIG ? 1 : 0;
 }
 
-struct timespec;
-struct pt_regs;
 enum pid_type;
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -11,8 +11,6 @@
 
 struct __aio_sigset;
 struct epoll_event;
-struct iattr;
-struct inode;
 struct iocb;
 struct io_event;
 struct iovec;
@@ -20,14 +18,12 @@ struct __kernel_old_itimerval;
 struct kexec_segment;
 struct linux_dirent;
 struct linux_dirent64;
-struct list_head;
 struct mmap_arg_struct;
 struct msgbuf;
 struct user_msghdr;
 struct mmsghdr;
 struct msqid_ds;
 struct new_utsname;
-struct nfsctl_arg;
 struct __old_kernel_stat;
 struct oldold_utsname;
 struct old_utsname;
@@ -38,7 +34,6 @@ struct rusage;
 struct sched_param;
 struct sched_attr;
 struct sel_arg_struct;
-struct semaphore;
 struct sembuf;
 struct shmid_ds;
 struct sockaddr;
@@ -48,14 +43,12 @@ struct statfs;
 struct statfs64;
 struct statx;
 struct sysinfo;
-struct timespec;
 struct __kernel_old_timeval;
 struct __kernel_timex;
 struct timezone;
 struct tms;
 struct utimbuf;
 struct mq_attr;
-struct compat_stat;
 struct old_timeval32;
 struct robust_list_head;
 struct futex_waitv;
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -23,9 +23,7 @@
 #include 
 
 struct kobject;
-struct module;
 struct bin_attribute;
-enum kobj_ns_type;
 
 struct attribute {
const char  *name;
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1095,10 +1095,6 @@ unsigned int reclaim_clean_pages_from_list(struct zone 
*zone,
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES 
(ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
 
-enum ttu_flags;
-struct tlbflush_unmap_batch;
-
-
 /*
  * only for MM internal work items which do not depend on
  * any allocations or locks which might depend on allocations
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -261,7 +261,6 @@

Re: [PATCH 4/8] riscv: ftrace: align patchable functions to 4 Byte boundary

2024-06-13 Thread Nathan Chancellor

Hi Andy,

On Thu, Jun 13, 2024 at 03:11:09PM +0800, Andy Chiu wrote:
> We are changing ftrace code patching in order to remove dependency from
> stop_machine() and enable kernel preemption. This requires us to align
> functions entry at a 4-B align address.
> 
> However, -falign-functions on older versions of GCC alone was not strong
> enoungh to align all functions. In fact, cold functions are not aligned
> after turning on optimizations. We consider this is a bug in GCC and
> turn off guess-branch-probility as a workaround to align all functions.
> 
> GCC bug id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345
> 
> The option -fmin-function-alignment is able to align all functions
> properly on newer versions of gcc. So, we add a cc-option to test if
> the toolchain supports it.
> 
> Suggested-by: Evgenii Shatokhin 
> Signed-off-by: Andy Chiu 
> ---
>  arch/riscv/Kconfig  | 1 +
>  arch/riscv/Makefile | 7 ++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index b94176e25be1..80b8d48e1e46 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -203,6 +203,7 @@ config CLANG_SUPPORTS_DYNAMIC_FTRACE
>  config GCC_SUPPORTS_DYNAMIC_FTRACE
>   def_bool CC_IS_GCC
>   depends on $(cc-option,-fpatchable-function-entry=8)
> + depends on $(cc-option,-fmin-function-alignment=4) || !RISCV_ISA_C

Please use CC_HAS_MIN_FUNCTION_ALIGNMENT (from arch/Kconfig), which
already checks for support for this option.

>  config HAVE_SHADOW_CALL_STACK
>   def_bool $(cc-option,-fsanitize=shadow-call-stack)
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index 06de9d365088..74628ad8dcf8 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -14,8 +14,13 @@ endif
>  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>   LDFLAGS_vmlinux += --no-relax
>   KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
> +ifeq ($(CONFIG_CC_IS_CLANG),y)

Same here, please invert this and use

  ifdef CONFIG_CC_HAS_MIN_FUNCTION_ALIGNMENT

like the main Makefile does.

> + cflags_ftrace_align := -falign-functions=4
> +else
> + cflags_ftrace_align := -fmin-function-alignment=4
> +endif
>  ifeq ($(CONFIG_RISCV_ISA_C),y)
> - CC_FLAGS_FTRACE := -fpatchable-function-entry=4
> + CC_FLAGS_FTRACE := -fpatchable-function-entry=4 $(cflags_ftrace_align)
>  else
>   CC_FLAGS_FTRACE := -fpatchable-function-entry=2
>  endif
> 
> -- 
> 2.43.0
> 
>

[PATCH v4 24/35] s390/cpacf: Unpoison the results of cpacf_trng()

2024-06-13 Thread Ilya Leoshkevich

Prevent KMSAN from complaining about buffers filled by cpacf_trng()
being uninitialized.

Tested-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/cpacf.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/include/asm/cpacf.h b/arch/s390/include/asm/cpacf.h
index c786538e397c..dae8843b164f 100644
--- a/arch/s390/include/asm/cpacf.h
+++ b/arch/s390/include/asm/cpacf.h
@@ -12,6 +12,7 @@
 #define _ASM_S390_CPACF_H
 
 #include 
+#include 
 
 /*
  * Instruction opcodes for the CPACF instructions
@@ -542,6 +543,8 @@ static inline void cpacf_trng(u8 *ucbuf, unsigned long 
ucbuf_len,
: [ucbuf] "+" (u.pair), [cbuf] "+" (c.pair)
: [fc] "K" (CPACF_PRNO_TRNG), [opc] "i" (CPACF_PRNO)
: "cc", "memory", "0");
+   kmsan_unpoison_memory(ucbuf, ucbuf_len);
+   kmsan_unpoison_memory(cbuf, cbuf_len);
 }
 
 /**
-- 
2.45.1

[PATCH v4 35/35] kmsan: Enable on s390

2024-06-13 Thread Ilya Leoshkevich

Now that everything else is in place, enable KMSAN in Kconfig.

Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c59d2b54df49..3cba4993d7c7 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -158,6 +158,7 @@ config S390
select HAVE_ARCH_KASAN
select HAVE_ARCH_KASAN_VMALLOC
select HAVE_ARCH_KCSAN
+   select HAVE_ARCH_KMSAN
select HAVE_ARCH_KFENCE
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_SECCOMP_FILTER
-- 
2.45.1

[PATCH v4 29/35] s390/mm: Define KMSAN metadata for vmalloc and modules

2024-06-13 Thread Ilya Leoshkevich

The pages for the KMSAN metadata associated with most kernel mappings
are taken from memblock by the common code. However, vmalloc and module
metadata needs to be defined by the architectures.

Be a little bit more careful than x86: allocate exactly MODULES_LEN
for the module shadow and origins, and then take 2/3 of vmalloc for
the vmalloc shadow and origins. This ensures that users passing small
vmalloc= values on the command line do not cause module metadata
collisions.

Reviewed-by: Alexander Potapenko 
Acked-by: Alexander Gordeev 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/startup.c| 7 +++
 arch/s390/include/asm/pgtable.h | 8 
 2 files changed, 15 insertions(+)

diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
index 182aac6a0f77..93775142322d 100644
--- a/arch/s390/boot/startup.c
+++ b/arch/s390/boot/startup.c
@@ -301,11 +301,18 @@ static unsigned long setup_kernel_memory_layout(unsigned 
long kernel_size)
MODULES_END = round_down(kernel_start, _SEGMENT_SIZE);
MODULES_VADDR = MODULES_END - MODULES_LEN;
VMALLOC_END = MODULES_VADDR;
+   if (IS_ENABLED(CONFIG_KMSAN))
+   VMALLOC_END -= MODULES_LEN * 2;
 
/* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
space left */
vsize = (VMALLOC_END - FIXMAP_SIZE) / 2;
vsize = round_down(vsize, _SEGMENT_SIZE);
vmalloc_size = min(vmalloc_size, vsize);
+   if (IS_ENABLED(CONFIG_KMSAN)) {
+   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
+   vmalloc_size = round_down(vmalloc_size / 3, _SEGMENT_SIZE);
+   VMALLOC_END -= vmalloc_size * 2;
+   }
VMALLOC_START = VMALLOC_END - vmalloc_size;
 
__memcpy_real_area = round_down(VMALLOC_START - MEMCPY_REAL_SIZE, 
PAGE_SIZE);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 70b6ee557eb2..2f44c23efec0 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -107,6 +107,14 @@ static inline int is_module_addr(void *addr)
return 1;
 }
 
+#ifdef CONFIG_KMSAN
+#define KMSAN_VMALLOC_SIZE (VMALLOC_END - VMALLOC_START)
+#define KMSAN_VMALLOC_SHADOW_START VMALLOC_END
+#define KMSAN_VMALLOC_ORIGIN_START (KMSAN_VMALLOC_SHADOW_START + 
KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_SHADOW_START (KMSAN_VMALLOC_ORIGIN_START + 
KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN)
+#endif
+
 #ifdef CONFIG_RANDOMIZE_BASE
 #define KASLR_LEN  (1UL << 31)
 #else
-- 
2.45.1

Re: [PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Steven Rostedt

On Thu, 13 Jun 2024 15:19:50 -0300
"Guilherme G. Piccoli"  wrote:

> > +
> > +   reserver_mem=2M:4096:oops  ramoops.mem_name=oops
> > +  
> 
> Likely this could be fixed on merge, to avoid another version, but...
> 
> s/reserver_mem/reserve_mem

That 'r' is my nemesis! Almost every time I type "reserve" I write it
as "reserver" and have to go back and delete it. :-p

Just to make patchwork work nicely, I'll send a v7. But I'll wait for
other comments or acks and reviews.

-- Steve

[PATCH v4 05/35] kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces

2024-06-13 Thread Ilya Leoshkevich

Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Skip the comparison when this is the case.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 470b0b4afcc4..8a1bbbc723ab 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -20,7 +20,8 @@
 
 static inline bool is_bad_asm_addr(void *addr, uintptr_t size, bool is_store)
 {
-   if ((u64)addr < TASK_SIZE)
+   if (IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) &&
+   (u64)addr < TASK_SIZE)
return true;
if (!kmsan_get_metadata(addr, KMSAN_META_SHADOW))
return true;
-- 
2.45.1

[PATCH v4 21/35] s390: Use a larger stack for KMSAN

2024-06-13 Thread Ilya Leoshkevich

Adjust the stack size for the KMSAN-enabled kernel like it was done
for the KASAN-enabled one in commit 7fef92ccadd7 ("s390/kasan: double
the stack size"). Both tools have similar requirements.

Reviewed-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Makefile  | 2 +-
 arch/s390/include/asm/thread_info.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/Makefile b/arch/s390/Makefile
index f2b21c7a70ef..7fd57398221e 100644
--- a/arch/s390/Makefile
+++ b/arch/s390/Makefile
@@ -36,7 +36,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option
 KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
 
 UTS_MACHINE:= s390x
-STACK_SIZE := $(if $(CONFIG_KASAN),65536,16384)
+STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384))
 CHECKFLAGS += -D__s390__ -D__s390x__
 
 export LD_BFD
diff --git a/arch/s390/include/asm/thread_info.h 
b/arch/s390/include/asm/thread_info.h
index a674c7d25da5..d02a709717b8 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -16,7 +16,7 @@
 /*
  * General size of kernel stacks
  */
-#ifdef CONFIG_KASAN
+#if defined(CONFIG_KASAN) || defined(CONFIG_KMSAN)
 #define THREAD_SIZE_ORDER 4
 #else
 #define THREAD_SIZE_ORDER 2
-- 
2.45.1

[PATCH v4 08/35] kmsan: Remove an x86-specific #include from kmsan.h

2024-06-13 Thread Ilya Leoshkevich

Replace the x86-specific asm/pgtable_64_types.h #include with the
linux/pgtable.h one, which all architectures have.

While at it, sort the headers alphabetically for the sake of
consistency with other KMSAN code.

Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core")
Suggested-by: Heiko Carstens 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index a14744205435..adf443bcffe8 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -10,14 +10,14 @@
 #ifndef __MM_KMSAN_KMSAN_H
 #define __MM_KMSAN_KMSAN_H
 
-#include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
 
 #define KMSAN_ALLOCA_MAGIC_ORIGIN 0xabcd0100
 #define KMSAN_CHAIN_MAGIC_ORIGIN 0xabcd0200
-- 
2.45.1

[PATCH v2 07/14] openrisc/thread_info: Introduce TIF_NOTIFY_IPI flag

2024-06-13 Thread K Prateek Nayak

Add support for TIF_NOTIFY_IPI on OpenRISC. With TIF_NOTIFY_IPI, a
sender sending an IPI to an idle CPU in TIF_POLLING mode will set the
TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the
CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This
avoids spurious calls to schedule_idle() in cases where an IPI does not
necessarily wake up a task on the idle CPU.

Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: "Rafael J. Wysocki" 
Cc: Daniel Lezcano 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
Cc: Valentin Schneider 
Cc: K Prateek Nayak 
Cc: linux-openr...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: K Prateek Nayak 
---
v1..v2:
o No changes.
---
 arch/openrisc/include/asm/thread_info.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/openrisc/include/asm/thread_info.h 
b/arch/openrisc/include/asm/thread_info.h
index 4af3049c34c2..6a386703bc43 100644
--- a/arch/openrisc/include/asm/thread_info.h
+++ b/arch/openrisc/include/asm/thread_info.h
@@ -92,6 +92,7 @@ register struct thread_info *current_thread_info_reg 
asm("r10");
 * mode
 */
 #define TIF_NOTIFY_SIGNAL  5   /* signal notifications exist */
+#define TIF_NOTIFY_IPI 6   /* Pending IPI on TIF_POLLLING idle CPU 
*/
 #define TIF_SYSCALL_TRACEPOINT  8   /* for ftrace syscall instrumentation 
*/
 #define TIF_RESTORE_SIGMASK 9
 #define TIF_POLLING_NRFLAG 16  /* true if poll_idle() is polling   
 * TIF_NEED_RESCHED
@@ -104,6 +105,7 @@ register struct thread_info *current_thread_info_reg 
asm("r10");
 #define _TIF_NEED_RESCHED  (1<

[PATCH v4 04/35] kmsan: Increase the maximum store size to 4096

2024-06-13 Thread Ilya Leoshkevich

The inline assembly block in s390's chsc() stores that much.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index cc3907a9c33a..470b0b4afcc4 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -110,11 +110,10 @@ void __msan_instrument_asm_store(void *addr, uintptr_t 
size)
 
ua_flags = user_access_save();
/*
-* Most of the accesses are below 32 bytes. The two exceptions so far
-* are clwb() (64 bytes) and FPU state (512 bytes).
-* It's unlikely that the assembly will touch more than 512 bytes.
+* Most of the accesses are below 32 bytes. The exceptions so far are
+* clwb() (64 bytes), FPU state (512 bytes) and chsc() (4096 bytes).
 */
-   if (size > 512) {
+   if (size > 4096) {
WARN_ONCE(1, "assembly store size too big: %ld\n", size);
size = 8;
}
-- 
2.45.1

Re: [PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Guilherme G. Piccoli

Thanks Steve, re-tested in my VM and it's working fine.
Just a minor below...


On 13/06/2024 12:55, Steven Rostedt wrote:
> [...]
> + D. Using a region of memory reserved via ``reserve_mem`` command line
> +parameter. The address and size will be defined by the ``reserve_mem``
> +parameter. Note, that ``reserve_mem`` may not always allocate memory
> +in the same location, and cannot be relied upon. Testing will need
> +to be done, and it may not work on every machine, nor every kernel.
> +Consider this a "best effort" approach. The ``reserve_mem`` option
> +takes a size, alignment and name as arguments. The name is used
> +to map the memory to a label that can be retrieved by ramoops.
> +
> + reserver_mem=2M:4096:oops  ramoops.mem_name=oops
> +

Likely this could be fixed on merge, to avoid another version, but...

s/reserver_mem/reserve_mem


Cheers!

[PATCH v4 16/35] mm: slub: Unpoison the memchr_inv() return value

2024-06-13 Thread Ilya Leoshkevich

Even though the KMSAN warnings generated by memchr_inv() are suppressed
by metadata_access_enable(), its return value may still be poisoned.

The reason is that the last iteration of memchr_inv() returns
`*start != value ? start : NULL`, where *start is poisoned. Because of
this, somewhat counterintuitively, the shadow value computed by
visitSelectInst() is equal to `(uintptr_t)start`.

The intention behind guarding memchr_inv() behind
metadata_access_enable() is to touch poisoned metadata without
triggering KMSAN, so unpoison its return value.

Acked-by: Vlastimil Babka 
Signed-off-by: Ilya Leoshkevich 
---
 mm/slub.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index a290f6c63e7b..b9101b2dc9aa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1185,6 +1185,7 @@ static int check_bytes_and_report(struct kmem_cache *s, 
struct slab *slab,
metadata_access_enable();
fault = memchr_inv(kasan_reset_tag(start), value, bytes);
metadata_access_disable();
+   kmsan_unpoison_memory(, sizeof(fault));
if (!fault)
return 1;
 
@@ -1291,6 +1292,7 @@ static void slab_pad_check(struct kmem_cache *s, struct 
slab *slab)
metadata_access_enable();
fault = memchr_inv(kasan_reset_tag(pad), POISON_INUSE, remainder);
metadata_access_disable();
+   kmsan_unpoison_memory(, sizeof(fault));
if (!fault)
return;
while (end > fault && end[-1] == POISON_INUSE)
-- 
2.45.1

[PATCH v4 19/35] kmsan: Accept ranges starting with 0 on s390

2024-06-13 Thread Ilya Leoshkevich

On s390 the virtual address 0 is valid (current CPU's lowcore is mapped
there), therefore KMSAN should not complain about it.

Disable the respective check on s390. There doesn't seem to be a
Kconfig option to describe this situation, so explicitly check for
s390.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/init.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index 9de76ac7062c..3f8b1bbb9060 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -33,7 +33,10 @@ static void __init kmsan_record_future_shadow_range(void 
*start, void *end)
bool merged = false;
 
KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES);
-   KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend);
+   KMSAN_WARN_ON((nstart >= nend) ||
+ /* Virtual address 0 is valid on s390. */
+ (!IS_ENABLED(CONFIG_S390) && !nstart) ||
+ !nend);
nstart = ALIGN_DOWN(nstart, PAGE_SIZE);
nend = ALIGN(nend, PAGE_SIZE);
 
-- 
2.45.1

Re: [PATCH v5 0/8] DAMON based tiered memory management for CXL memory

2024-06-13 Thread SeongJae Park

Hi Honggyu,

On Thu, 13 Jun 2024 22:20:47 +0900 Honggyu Kim  wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This patch series is about its implementation for physical
> address space so that this scheme can be applied in system wide level.
> 
> Changes from RFC v4:
> https://lore.kernel.org/20240512175447.75943-1...@kernel.org
>   1. Add usage and design documents
>   2. Rename alloc_demote_folio to alloc_migrate_folio
>   3. Add evaluation results with "demotion_enabled" true
>   4. Rebase based on v6.10-rc3

I left comments on the new patches for the documentation.

[...]
> 
> Evaluation Results
> ==
> 
> All the result values are normalized to DRAM-only execution time because
> the workload cannot be faster than DRAM-only unless the workload hits
> the peak bandwidth but our redis test doesn't go beyond the bandwidth
> limit.
> 
> So the DRAM-only execution time is the ideal result without affected by
> the gap between DRAM and CXL performance difference.  The NUMA node
> environment is as follows.
> 
>   node0 - local DRAM, 512GB with a CPU socket (fast tier)
>   node1 - disabled
>   node2 - CXL DRAM, 96GB, no CPU attached (slow tier)
> 
> The following is the result of generating zipfian distribution to
> redis-server and the numbers are averaged by 50 times of execution.
> 
>   1. YCSB zipfian distribution read only workload
>   memory pressure with cold memory on node0 with 512GB of local DRAM.
>   
> ++=
>   |   cold memory occupied by mmap and memset  |
>   |   0G  440G  450G  460G  470G  480G  490G  500G |
>   
> ++=
>   Execution time normalized to DRAM-only values| 
> GEOMEAN
>   
> ++-
>   DRAM-only   | 1.00 - - - - - - - | 1.00
>   CXL-only| 1.19 - - - - - - - | 1.19
>   default |-  1.00  1.05  1.08  1.12  1.14  1.18  1.18 | 1.11
>   DAMON tiered|-  1.03  1.03  1.03  1.03  1.03  1.07 *1.05 | 1.04
>   DAMON lazy  |-  1.04  1.03  1.04  1.05  1.06  1.06 *1.06 | 1.05
>   
> ++=
>   CXL usage of redis-server in GB  | 
> AVERAGE
>   
> ++-
>   DRAM-only   |  0.0 - - - - - - - |  0.0
>   CXL-only| 51.4 - - - - - - - | 51.4
>   default |-   0.6  10.6  20.5  30.5  40.5  47.6  50.4 | 28.7
>   DAMON tiered|-   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  2.2
>   DAMON lazy  |-   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  5.5
>   
> ++=
> 
> Each test result is based on the exeuction environment as follows.

Nit.  s/exeuction/execution/

[...]
> In summary, the evaluation results show that DAMON memory management
> with DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown
> compared to the "default" memory policy from 11% to 3~5% when the system
> runs with high memory pressure on its fast tier DRAM nodes.
> 
> Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> tiered memory systems run more efficiently under high memory pressures.

Thank you very much for continuing this great work.

Other than trivial comments on documentation patches and the above typo, I have
no particular concern on this patchset.  I'm looking forward to the next
version.


Thanks,
SJ
[...]

Re: [PATCH v5 8/8] Docs/.../mm/damon: add more damos actions

2024-06-13 Thread SeongJae Park

On Thu, 13 Jun 2024 07:07:29 -0700 SeongJae Park  wrote:

> Hi Honggyu,
> 
> On Thu, 13 Jun 2024 22:20:55 +0900 Honggyu Kim  wrote:
> 
> > This patch adds damon description for "migrate_hot" and "migrate_cold"
> > actions for both usage and design documents as long as a new
> > "target_nid" knob to set the migration target node.
[...]
> Except the DAMON debugfs interface section, this patch looks good to me.

One more thing.  I'd prefer making the subject more specific.  What about,
"Docs/damon: document damos_migrate_{hot,cold}"?


Thanks,
SJ

[...]

[PATCH v4 23/35] s390/checksum: Add a KMSAN check

2024-06-13 Thread Ilya Leoshkevich

Add a KMSAN check to the CKSM inline assembly, similar to how it was
done for ASAN in commit e42ac7789df6 ("s390/checksum: always use cksm
instruction").

Acked-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/checksum.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/checksum.h b/arch/s390/include/asm/checksum.h
index b89159591ca0..46f5c9660616 100644
--- a/arch/s390/include/asm/checksum.h
+++ b/arch/s390/include/asm/checksum.h
@@ -13,6 +13,7 @@
 #define _S390_CHECKSUM_H
 
 #include 
+#include 
 #include 
 
 static inline __wsum cksm(const void *buff, int len, __wsum sum)
@@ -23,6 +24,7 @@ static inline __wsum cksm(const void *buff, int len, __wsum 
sum)
};
 
instrument_read(buff, len);
+   kmsan_check_memory(buff, len);
asm volatile("\n"
"0: cksm%[sum],%[rp]\n"
"   jo  0b\n"
-- 
2.45.1

Re: [PATCH] ARM: dts: qcom: motorola-falcon: add accelerometer, magnetometer

2024-06-13 Thread Konrad Dybcio





On 6/13/24 19:18, Stanislav Jakubek wrote:

On Thu, Jun 13, 2024 at 09:48:26AM +0200, Konrad Dybcio wrote:



On 6/9/24 13:05, Stanislav Jakubek wrote:

Add the accelerometer and magnetometer that are present on the Motorola
Moto G (2013) device.

Signed-off-by: Stanislav Jakubek 
---


[...]


+_i2c2 {


Consider setting a clock-frequency = <>


Hi Konrad,

checking downstream [1], qcom,i2c-bus-freq for I2C2 is 10, which seems
to be the default. Should I specify it anyway?


Yes, at the very least to silence the dmesg warning ;)

Konrad

Re: [PATCH v5 2/8] mm: rename alloc_demote_folio to alloc_migrate_folio

2024-06-13 Thread SeongJae Park

On Thu, 13 Jun 2024 22:20:49 +0900 Honggyu Kim  wrote:

> The alloc_demote_folio can also be used for general migration including
> both demotion and promotion so it'd be better to rename it from
> alloc_demote_folio to alloc_migrate_folio.
> 
> Signed-off-by: Honggyu Kim 

Reviewed-by: SeongJae Park 


Thanks,
SJ

[...]

[PATCH v4 01/35] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2024-06-13 Thread Ilya Leoshkevich

Architectures use assembly code to initialize ftrace_regs and call
ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the architecture-specific assembly code and always
unpoisoning ftrace_regs in ftrace_ops_list_func.

The issue was not encountered on x86_64 so far only by accident:
assembly-allocated ftrace_regs was overlapping a stale partially
unpoisoned stack frame. Poisoning stack frames before returns [1]
makes the issue appear on x86_64 as well.

[1] 
https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 kernel/trace/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 65208d3b5ed9..c35ad4362d71 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7407,6 +7407,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
parent_ip,
 void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
__ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
 }
 #else
-- 
2.45.1

[PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt

Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
 improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

  See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.

Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be
different for different machines.

Solution:

The solution I have come up with is to introduce a new "reserve_mem=" kernel
command line. This parameter takes the following format:

  reserve_mem=nn:align:label

Where nn is the size of memory to reserve, the align is the alignment of
that memory, and label is the way for other sub-systems to find that memory.
This way the kernel command line could have:

  reserve_mem=12M:4096:oops   ramoops.mem_name=oops

At boot up, the kernel will search for 12 megabytes in usable memory regions
with an alignment of 4096. It will start at the highest regions and work its
way down (for those old devices that want access to lower address DMA). When
it finds a region, it will save it off in a small table and mark it with the
"oops" label. Then the pstore ramoops sub-system could ask for that memory
and location, and it will map itself there.

This prototype allows for 8 different mappings (which may be overkill, 4 is
probably plenty) with 16 byte size to store the label.

I have tested this and it works for us to solve the above problem. We can
update the kernel and command line and increase the size of pstore without
needing to update the firmware, or knowing every memory layout of each
board. I only tested this locally, it has not been tested in the field.

Changes since v5: 
https://lore.kernel.org/all/20240613003435.401549...@goodmis.org/

[ patch at bottom showing differences ]

- Stressed more that this is a best effort use case

- Updated ramoops.rst to document this new feature

- Used a new variable "tmp" to use in reserve_mem_find_by_name() instead
  of using "size" and possibly corrupting it.

Changes since v4: 
https://lore.kernel.org/all/20240611215610.548854...@goodmis.org/

- Add all checks about reserve_mem before allocation.
  This means reserved_mem_add() is now a void function.

- Check for name duplications.

- Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=")

Changes since v3: 
https://lore.kernel.org/all/20240611144911.327227...@goodmis.org/

- Changed table type of start and size from unsigned long to phys_addr_t
  (as well as the parameters to the functions that use them)

- Changed old reference to "early_reserve_mem" to "reserve_mem"

- Check before reservering memory:
  o Size is non-zero
  o name has text in it

- If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES

- Remove the silly check of testing *p == '\0' after a p += strlen(p)

Changes since v2: 
https://lore.kernel.org/all/20240606150143.876469...@goodmis.org/

- Fixed typo of "reserver"

- Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name()

- Removed "built-in" from module description that was changed from v1.


Changes since v1: 
https://lore.kernel.org/all/2024060320.801075...@goodmis.org/

- Updated the change log of the first patch as well as added an entry
  into kernel-parameters.txt about how reserve_mem is for soft reboots
  and may not be reliable. 


Steven Rostedt (Google) (2):
  mm/memblock: Add "reserve_mem" to reserved named memory at boot up
  pstore/ramoops: Add ramoops.mem_name= command line option


 Documentation/admin-guide/kernel-parameters.txt |  22 +

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt

On Thu, 13 Jun 2024 18:54:12 +0200
Alexander Graf  wrote:

> 
> Do you have a "real" pstore on these systems that you could store 
> non-volatile variables in, such as persistent UEFI variables? If so, you 
> could create an actually persistent mapping for your trace pstore even 
> across kernel version updates as a general mechanism to create reserved 
> memblocks at fixed offsets.

After implementing all this, I don't think I can use pstore for my
purpose. pstore is a generic interface for persistent storage, and
requires an interface to access it. From what I understand, it's not
the place to just ask for an area of RAM.

For this, I have a single patch that allows the tracing instance to use
an area reserved by reserve_mem.

  reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace

I've already tested this on qemu and a couple of chromebooks. It works
well.

> 
> 
> > Requirement:
> >
> > Need a way to reserve memory that will be at a consistent location for
> > every boot, if the kernel and system are the same. Does not need to work
> > if rebooting to a different kernel, or if the system can change the
> > memory layout between boots.
> >
> > The reserved memory can not be an hard coded address, as the same kernel /
> > command line needs to run on several different machines. The picked memory
> > reservation just needs to be the same for a given machine, but may be  
> 
> 
> With KASLR is enabled, doesn't this approach break too often to be 
> reliable enough for the data you want to extract?
> 
> Picking up the idea above, with a persistent variable we could even make 
> KASLR avoid that reserved pstore region in its search for a viable KASLR 
> offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

-- Steve

[PATCH v4 17/35] mm: kfence: Disable KMSAN when checking the canary

2024-06-13 Thread Ilya Leoshkevich

KMSAN warns about check_canary() accessing the canary.

The reason is that, even though set_canary() is properly instrumented
and sets shadow, slub explicitly poisons the canary's address range
afterwards.

Unpoisoning the canary is not the right thing to do: only
check_canary() is supposed to ever touch it. Instead, disable KMSAN
checks around canary read accesses.

Reviewed-by: Alexander Potapenko 
Tested-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kfence/core.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 964b8482275b..cce330d5b797 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -305,8 +305,14 @@ metadata_update_state(struct kfence_metadata *meta, enum 
kfence_object_state nex
WRITE_ONCE(meta->state, next);
 }
 
+#ifdef CONFIG_KMSAN
+#define CHECK_CANARY_ATTRIBUTES noinline __no_kmsan_checks
+#else
+#define CHECK_CANARY_ATTRIBUTES inline
+#endif
+
 /* Check canary byte at @addr. */
-static inline bool check_canary_byte(u8 *addr)
+static CHECK_CANARY_ATTRIBUTES bool check_canary_byte(u8 *addr)
 {
struct kfence_metadata *meta;
unsigned long flags;
@@ -341,7 +347,8 @@ static inline void set_canary(const struct kfence_metadata 
*meta)
*((u64 *)addr) = KFENCE_CANARY_PATTERN_U64;
 }
 
-static inline void check_canary(const struct kfence_metadata *meta)
+static CHECK_CANARY_ATTRIBUTES void
+check_canary(const struct kfence_metadata *meta)
 {
const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);
unsigned long addr = pageaddr;
-- 
2.45.1

[PATCH v4 22/35] s390/boot: Add the KMSAN runtime stub

2024-06-13 Thread Ilya Leoshkevich

It should be possible to have inline functions in the s390 header
files, which call kmsan_unpoison_memory(). The problem is that these
header files might be included by the decompressor, which does not
contain KMSAN runtime, causing linker errors.

Not compiling these calls if __SANITIZE_MEMORY__ is not defined -
either by changing kmsan-checks.h or at the call sites - may cause
unintended side effects, since calling these functions from an
uninstrumented code that is linked into the kernel is valid use case.

One might want to explicitly distinguish between the kernel and the
decompressor. Checking for a decompressor-specific #define is quite
heavy-handed, and will have to be done at all call sites.

A more generic approach is to provide a dummy kmsan_unpoison_memory()
definition. This produces some runtime overhead, but only when building
with CONFIG_KMSAN. The benefit is that it does not disturb the existing
KMSAN build logic and call sites don't need to be changed.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile | 1 +
 arch/s390/boot/kmsan.c  | 6 ++
 2 files changed, 7 insertions(+)
 create mode 100644 arch/s390/boot/kmsan.c

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index 526ed20b9d31..e7658997452b 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -44,6 +44,7 @@ obj-$(findstring y, $(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) 
$(CONFIG_PGSTE)) +=
 obj-$(CONFIG_RANDOMIZE_BASE)   += kaslr.o
 obj-y  += $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) info.o
 obj-$(CONFIG_KERNEL_ZSTD) += clz_ctz.o
+obj-$(CONFIG_KMSAN) += kmsan.o
 obj-all := $(obj-y) piggy.o syms.o
 
 targets:= bzImage section_cmp.boot.data 
section_cmp.boot.preserved.data $(obj-y)
diff --git a/arch/s390/boot/kmsan.c b/arch/s390/boot/kmsan.c
new file mode 100644
index ..e7b3ac48143e
--- /dev/null
+++ b/arch/s390/boot/kmsan.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+
+void kmsan_unpoison_memory(const void *address, size_t size)
+{
+}
-- 
2.45.1

Re: [PATCH v3 1/1] dt-bindings: remoteproc: imx_rproc: add minItems for power-domain

2024-06-13 Thread Rob Herring (Arm)



On Mon, 10 Jun 2024 11:17:21 -0400, Frank Li wrote:
> "fsl,imx8qxp-cm4" and "fsl,imx8qm-cm4" need minimum 2 power domains. Other
> platform doesn't require 'power-domain'.
> 
> Signed-off-by: Frank Li 
> ---
> 
> Notes:
> Change from v2 to v3
> - only imx8qxp and imx8qm need power-domain, other platform don't need it.
> - update commit message.
> 
> Change from v1 to v2
> - set minitem to 2 at top
> - Add imx8qm compatible string also
> - use not logic to handle difference compatible string restriction
> - update commit message.
> 
> pass dt_binding_check.
> 
> make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- -j8  dt_binding_check 
> DT_SCHEMA_FILES=fsl,imx-rproc.yaml
>   SCHEMA  Documentation/devicetree/bindings/processed-schema.json
>   CHKDT   Documentation/devicetree/bindings
>   LINTDocumentation/devicetree/bindings
>   DTEX
> Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.example.dts
>   DTC_CHK 
> Documentation/devicetree/bindings/remoteproc/fsl,imx-rproc.example.dtb
> 
>  .../bindings/remoteproc/fsl,imx-rproc.yaml| 15 +++
>  1 file changed, 15 insertions(+)
> 

Reviewed-by: Rob Herring (Arm)

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Alexander Graf


Hey Steve,

On 13.06.24 17:55, Steven Rostedt wrote:

Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
  improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

   See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.



(sorry for not commenting on earlier versions, I didn't see v1-v5 in my 
inbox)


Do you have a "real" pstore on these systems that you could store 
non-volatile variables in, such as persistent UEFI variables? If so, you 
could create an actually persistent mapping for your trace pstore even 
across kernel version updates as a general mechanism to create reserved 
memblocks at fixed offsets.




Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be



With KASLR is enabled, doesn't this approach break too often to be 
reliable enough for the data you want to extract?


Picking up the idea above, with a persistent variable we could even make 
KASLR avoid that reserved pstore region in its search for a viable KASLR 
offset.



Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

[PATCH v4 18/35] lib/zlib: Unpoison DFLTCC output buffers

2024-06-13 Thread Ilya Leoshkevich

The constraints of the DFLTCC inline assembly are not precise: they
do not communicate the size of the output buffers to the compiler, so
it cannot automatically instrument it.

Add the manual kmsan_unpoison_memory() calls for the output buffers.
The logic is the same as in [1].

[1] 
https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5

Reported-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 lib/zlib_dfltcc/dfltcc.h  |  1 +
 lib/zlib_dfltcc/dfltcc_util.h | 28 
 2 files changed, 29 insertions(+)

diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h
index b96232bdd44d..0f2a16d7a48a 100644
--- a/lib/zlib_dfltcc/dfltcc.h
+++ b/lib/zlib_dfltcc/dfltcc.h
@@ -80,6 +80,7 @@ struct dfltcc_param_v0 {
 uint8_t csb[1152];
 };
 
+static_assert(offsetof(struct dfltcc_param_v0, csb) == 384);
 static_assert(sizeof(struct dfltcc_param_v0) == 1536);
 
 #define CVT_CRC32 0
diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h
index 4a46b5009f0d..10509270d822 100644
--- a/lib/zlib_dfltcc/dfltcc_util.h
+++ b/lib/zlib_dfltcc/dfltcc_util.h
@@ -2,6 +2,8 @@
 #ifndef DFLTCC_UTIL_H
 #define DFLTCC_UTIL_H
 
+#include "dfltcc.h"
+#include 
 #include 
 
 /*
@@ -20,6 +22,7 @@ typedef enum {
 #define DFLTCC_CMPR 2
 #define DFLTCC_XPND 4
 #define HBT_CIRCULAR (1 << 7)
+#define DFLTCC_FN_MASK ((1 << 7) - 1)
 #define HB_BITS 15
 #define HB_SIZE (1 << HB_BITS)
 
@@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc(
 )
 {
 Byte *t2 = op1 ? *op1 : NULL;
+unsigned char *orig_t2 = t2;
 size_t t3 = len1 ? *len1 : 0;
 const Byte *t4 = op2 ? *op2 : NULL;
 size_t t5 = len2 ? *len2 : 0;
@@ -59,6 +63,30 @@ static inline dfltcc_cc dfltcc(
  : "cc", "memory");
 t2 = r2; t3 = r3; t4 = r4; t5 = r5;
 
+/*
+ * Unpoison the parameter block and the output buffer.
+ * This is a no-op in non-KMSAN builds.
+ */
+switch (fn & DFLTCC_FN_MASK) {
+case DFLTCC_QAF:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_qaf_param));
+break;
+case DFLTCC_GDHT:
+kmsan_unpoison_memory(param, offsetof(struct dfltcc_param_v0, csb));
+break;
+case DFLTCC_CMPR:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(
+orig_t2,
+t2 - orig_t2 +
+(((struct dfltcc_param_v0 *)param)->sbb == 0 ? 0 : 1));
+break;
+case DFLTCC_XPND:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(orig_t2, t2 - orig_t2);
+break;
+}
+
 if (op1)
 *op1 = t2;
 if (len1)
-- 
2.45.1

Re: [PATCH v4 01/35] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2024-06-13 Thread Steven Rostedt

On Thu, 13 Jun 2024 17:34:03 +0200
Ilya Leoshkevich  wrote:

> Architectures use assembly code to initialize ftrace_regs and call
> ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
> ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
> KMSAN warnings when running the ftrace testsuite.
> 
> Fix by trusting the architecture-specific assembly code and always
> unpoisoning ftrace_regs in ftrace_ops_list_func.
> 
> The issue was not encountered on x86_64 so far only by accident:
> assembly-allocated ftrace_regs was overlapping a stale partially
> unpoisoned stack frame. Poisoning stack frames before returns [1]
> makes the issue appear on x86_64 as well.
> 
> [1] 
> https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/
> 
> Reviewed-by: Alexander Potapenko 
> Signed-off-by: Ilya Leoshkevich 
> ---

Acked-by: Steven Rostedt (Google) 

-- Steve

>  kernel/trace/ftrace.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 65208d3b5ed9..c35ad4362d71 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -7407,6 +7407,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
> parent_ip,
>  void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
>  struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {
> + kmsan_unpoison_memory(fregs, sizeof(*fregs));
>   __ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
>  }
>  #else

[PATCH v4 34/35] s390: Implement the architecture-specific KMSAN functions

2024-06-13 Thread Ilya Leoshkevich

arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the
prefix and calling kmsan_get_metadata() again.

kmsan_virt_addr_valid() delegates to virt_addr_valid().

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/kmsan.h | 43 +++
 1 file changed, 43 insertions(+)
 create mode 100644 arch/s390/include/asm/kmsan.h

diff --git a/arch/s390/include/asm/kmsan.h b/arch/s390/include/asm/kmsan.h
new file mode 100644
index ..e572686d340c
--- /dev/null
+++ b/arch/s390/include/asm/kmsan.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_S390_KMSAN_H
+#define _ASM_S390_KMSAN_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef MODULE
+
+static inline bool is_lowcore_addr(void *addr)
+{
+   return addr >= (void *)_lowcore &&
+  addr < (void *)(_lowcore + 1);
+}
+
+static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin)
+{
+   if (is_lowcore_addr(addr)) {
+   /*
+* Different lowcores accessed via S390_lowcore are described
+* by the same struct page. Resolve the prefix manually in
+* order to get a distinct struct page.
+*/
+   addr += (void *)lowcore_ptr[raw_smp_processor_id()] -
+   (void *)_lowcore;
+   if (WARN_ON_ONCE(is_lowcore_addr(addr)))
+   return NULL;
+   return kmsan_get_metadata(addr, is_origin);
+   }
+   return NULL;
+}
+
+static inline bool kmsan_virt_addr_valid(void *addr)
+{
+   return virt_addr_valid(addr);
+}
+
+#endif /* !MODULE */
+
+#endif /* _ASM_S390_KMSAN_H */
-- 
2.45.1

[PATCH v4 31/35] s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs

2024-06-13 Thread Ilya Leoshkevich

This is normally done by the generic entry code, but the
kernel_stack_overflow() flow bypasses it.

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/traps.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 52578b5cecbd..dde69d2a64f0 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -262,6 +263,11 @@ static void monitor_event_exception(struct pt_regs *regs)
 
 void kernel_stack_overflow(struct pt_regs *regs)
 {
+   /*
+* Normally regs are unpoisoned by the generic entry code, but
+* kernel_stack_overflow() is a rare case that is called bypassing it.
+*/
+   kmsan_unpoison_entry_regs(regs);
bust_spinlocks(1);
printk("Kernel stack overflow.\n");
show_regs(regs);
-- 
2.45.1

[PATCH v4 27/35] s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()

2024-06-13 Thread Ilya Leoshkevich

s390 uses assembly code to initialize ftrace_regs and call
kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the assembly code and always unpoisoning ftrace_regs in
kprobe_ftrace_handler().

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/ftrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index ddf2ee47cb87..0bd6adc40a34 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -303,6 +304,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
if (bit < 0)
return;
 
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
regs = ftrace_get_regs(fregs);
p = get_kprobe((kprobe_opcode_t *)ip);
if (!regs || unlikely(!p) || kprobe_disabled(p))
-- 
2.45.1

[PATCH v6 1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In order to allow for requesting a memory region that can be used for
things like pstore on multiple machines where the memory layout is not the
same, add a new option to the kernel command line called "reserve_mem".

The format is:  reserve_mem=nn:align:name

Where it will find nn amount of memory at the given alignment of align.
The name field is to allow another subsystem to retrieve where the memory
was found. For example:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Where ramoops.mem_name will tell ramoops that memory was reserved for it
via the reserve_mem option and it can find it by calling:

  if (reserve_mem_find_by_name("oops", , )) {
// start holds the start address and size holds the size given

This is typically used for systems that do not wipe the RAM, and this
command line will try to reserve the same physical memory on soft reboots.
Note, it is not guaranteed to be the same location. For example, if KASLR
places the kernel at the location of where the RAM reservation was from a
previous boot, the new reservation will be at a different location.  Any
subsystem using this feature must add a way to verify that the contents of
the physical memory is from a previous boot, as there may be cases where
the memory will not be located at the same location.

Not all systems may work either. There could be bit flips if the reboot
goes through the BIOS. Using kexec to reboot the machine is likely to
have better results in such cases.

Link: https://lore.kernel.org/all/zjjvnzux3nzig...@kernel.org/

Suggested-by: Mike Rapoport 
Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 .../admin-guide/kernel-parameters.txt |  22 
 include/linux/mm.h|   2 +
 mm/memblock.c | 117 ++
 3 files changed, 141 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index b600df82669d..56e18b1a520d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5710,6 +5710,28 @@
them.  If  is less than 0x1, the region
is assumed to be I/O ports; otherwise it is memory.
 
+   reserve_mem=[RAM]
+   Format: nn[KNG]::
+   Reserve physical memory and label it with a name that
+   other subsystems can use to access it. This is typically
+   used for systems that do not wipe the RAM, and this 
command
+   line will try to reserve the same physical memory on
+   soft reboots. Note, it is not guaranteed to be the same
+   location. For example, if anything about the system 
changes
+   or if booting a different kernel. It can also fail if 
KASLR
+   places the kernel at the location of where the RAM 
reservation
+   was from a previous boot, the new reservation will be 
at a
+   different location.
+   Any subsystem using this feature must add a way to 
verify
+   that the contents of the physical memory is from a 
previous
+   boot, as there may be cases where the memory will not be
+   located at the same location.
+
+   The format is size:align:label for example, to request
+   12 megabytes of 4096 alignment for ramoops:
+
+   reserve_mem=12M:4096:oops ramoops.mem_name=oops
+
reservetop= [X86-32,EARLY]
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..077fb589b88a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4263,4 +4263,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long 
pfn)
 void vma_pgtable_walk_begin(struct vm_area_struct *vma);
 void vma_pgtable_walk_end(struct vm_area_struct *vma);
 
+int reserve_mem_find_by_name(const char *name, phys_addr_t *start, phys_addr_t 
*size);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index d09136e040d3..b7b0e8c3868d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2244,6 +2244,123 @@ void __init memblock_free_all(void)
totalram_pages_add(pages);
 }
 
+/* Keep a table to reserve named memory */
+#define RESERVE_MEM_MAX_ENTRIES8
+#define RESERVE_MEM_NAME_SIZE  16
+struct reserve_mem_table {
+   charname[RESERVE_MEM_NAME_SIZE];
+   phys_addr_t start;
+   phys_addr_t size;
+};
+static struct reserve_mem_table reserved_mem_table[RESERVE_MEM_MAX_ENTRIES];
+static int reserved_mem_count;
+
+/* Add wildcard

[PATCH v6 2/2] pstore/ramoops: Add ramoops.mem_name= command line option

2024-06-13 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add a method to find a region specified by reserve_mem=nn:align:name for
ramoops. Adding a kernel command line parameter:

  reserve_mem=12M:4096:oops ramoops.mem_name=oops

Will use the size and location defined by the memmap parameter where it
finds the memory and labels it "oops". The "oops" in the ramoops option
is used to search for it.

This allows for arbitrary RAM to be used for ramoops if it is known that
the memory is not cleared on kernel crashes or soft reboots.

Tested-by: Guilherme G. Piccoli 
Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/admin-guide/ramoops.rst | 13 +
 fs/pstore/ram.c   | 14 ++
 2 files changed, 27 insertions(+)

diff --git a/Documentation/admin-guide/ramoops.rst 
b/Documentation/admin-guide/ramoops.rst
index e9f85142182d..6f534a707b2a 100644
--- a/Documentation/admin-guide/ramoops.rst
+++ b/Documentation/admin-guide/ramoops.rst
@@ -23,6 +23,8 @@ and type of the memory area are set using three variables:
   * ``mem_size`` for the size. The memory size will be rounded down to a
 power of two.
   * ``mem_type`` to specify if the memory type (default is 
pgprot_writecombine).
+  * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command
+line parameter.
 
 Typically the default value of ``mem_type=0`` should be used as that sets the 
pstore
 mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
@@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several 
different manners:
return ret;
   }
 
+ D. Using a region of memory reserved via ``reserve_mem`` command line
+parameter. The address and size will be defined by the ``reserve_mem``
+parameter. Note, that ``reserve_mem`` may not always allocate memory
+in the same location, and cannot be relied upon. Testing will need
+to be done, and it may not work on every machine, nor every kernel.
+Consider this a "best effort" approach. The ``reserve_mem`` option
+takes a size, alignment and name as arguments. The name is used
+to map the memory to a label that can be retrieved by ramoops.
+
+   reserver_mem=2M:4096:oops  ramoops.mem_name=oops
+
 You can specify either RAM memory or peripheral devices' memory. However, when
 specifying RAM, be sure to reserve the memory by issuing memblock_reserve()
 very early in the architecture code, e.g.::
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index b1a455f42e93..4311fcbc84f2 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -50,6 +50,10 @@ module_param_hw(mem_address, ullong, other, 0400);
 MODULE_PARM_DESC(mem_address,
"start of reserved RAM used to store oops/panic logs");
 
+static char *mem_name;
+module_param_named(mem_name, mem_name, charp, 0400);
+MODULE_PARM_DESC(mem_name, "name of kernel param that holds addr");
+
 static ulong mem_size;
 module_param(mem_size, ulong, 0400);
 MODULE_PARM_DESC(mem_size,
@@ -914,6 +918,16 @@ static void __init ramoops_register_dummy(void)
 {
struct ramoops_platform_data pdata;
 
+   if (mem_name) {
+   phys_addr_t start;
+   phys_addr_t size;
+
+   if (reserve_mem_find_by_name(mem_name, , )) {
+   mem_address = start;
+   mem_size = size;
+   }
+   }
+
/*
 * Prepare a dummy platform data structure to carry the module
 * parameters. If mem_size isn't set, then there are no module
-- 
2.43.0

[PATCH v4 32/35] s390/uaccess: Add KMSAN support to put_user() and get_user()

2024-06-13 Thread Ilya Leoshkevich

put_user() uses inline assembly with precise constraints, so Clang is
in principle capable of instrumenting it automatically. Unfortunately,
one of the constraints contains a dereferenced user pointer, and Clang
does not currently distinguish user and kernel pointers. Therefore
KMSAN attempts to access shadow for user pointers, which is not a right
thing to do.

An obvious fix to add __no_sanitize_memory to __put_user_fn() does not
work, since it's __always_inline. And __always_inline cannot be removed
due to the __put_user_bad() trick.

A different obvious fix of using the "a" instead of the "+Q" constraint
degrades the code quality, which is very important here, since it's a
hot path.

Instead, repurpose the __put_user_asm() macro to define
__put_user_{char,short,int,long}_noinstr() functions and mark them with
__no_sanitize_memory. For the non-KMSAN builds make them
__always_inline in order to keep the generated code quality. Also
define __put_user_{char,short,int,long}() functions, which call the
aforementioned ones and which *are* instrumented, because they call
KMSAN hooks, which may be implemented as macros.

The same applies to get_user() as well.

Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/uaccess.h | 111 +++-
 1 file changed, 79 insertions(+), 32 deletions(-)

diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h
index 81ae8a98e7ec..c3c26dd1fc04 100644
--- a/arch/s390/include/asm/uaccess.h
+++ b/arch/s390/include/asm/uaccess.h
@@ -78,13 +78,24 @@ union oac {
 
 int __noreturn __put_user_bad(void);
 
-#define __put_user_asm(to, from, size) \
-({ \
+#ifdef CONFIG_KMSAN
+#define GET_PUT_USER_NOINSTR_ATTRIBUTES \
+   noinline __maybe_unused __no_sanitize_memory
+#else
+#define GET_PUT_USER_NOINSTR_ATTRIBUTES __always_inline
+#endif
+
+#define DEFINE_PUT_USER(type)  \
+static GET_PUT_USER_NOINSTR_ATTRIBUTES int \
+__put_user_##type##_noinstr(unsigned type __user *to,  \
+   unsigned type *from,\
+   unsigned long size) \
+{  \
union oac __oac_spec = {\
.oac1.as = PSW_BITS_AS_SECONDARY,   \
.oac1.a = 1,\
};  \
-   int __rc;   \
+   int rc; \
\
asm volatile(   \
"   lr  0,%[spec]\n"\
@@ -93,12 +104,28 @@ int __noreturn __put_user_bad(void);
"2:\n"  \
EX_TABLE_UA_STORE(0b, 2b, %[rc])\
EX_TABLE_UA_STORE(1b, 2b, %[rc])\
-   : [rc] "=" (__rc), [_to] "+Q" (*(to)) \
+   : [rc] "=" (rc), [_to] "+Q" (*(to))   \
: [_size] "d" (size), [_from] "Q" (*(from)),\
  [spec] "d" (__oac_spec.val)   \
: "cc", "0");   \
-   __rc;   \
-})
+   return rc;  \
+}  \
+   \
+static __always_inline int \
+__put_user_##type(unsigned type __user *to, unsigned type *from,   \
+ unsigned long size)   \
+{  \
+   int rc; \
+   \
+   rc = __put_user_##type##_noinstr(to, from, size);   \
+   instrument_put_user(*from, to, size);   \
+   return rc;  \
+}
+
+DEFINE_PUT_USER(char);
+DEFINE_PUT_USER(short);
+DEFINE_PUT_USER(int);
+DEFINE_PUT_USER(long);
 
 static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned 
long size)
 {
@@ -106,24 +133,24 @@ static __always_inline int __put_user_fn(void *x, void 
__user *ptr, unsigned lon

[PATCH v4 30/35] s390/string: Add KMSAN support

2024-06-13 Thread Ilya Leoshkevich

Add KMSAN support for the s390 implementations of the string functions.
Do this similar to how it's already done for KASAN, except that the
optimized memset{16,32,64}() functions need to be disabled: it's
important for KMSAN to know that they initialized something.

The way boot code is built with regard to string functions is
problematic, since most files think it's configured with sanitizers,
but boot/string.c doesn't. This creates various problems with the
memset64() definitions, depending on whether the code is built with
sanitizers or fortify. This should probably be streamlined, but in the
meantime resolve the issues by introducing the IN_BOOT_STRING_C macro,
similar to the existing IN_ARCH_STRING_C macro.

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/string.c| 16 
 arch/s390/include/asm/string.h | 20 +++-
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/arch/s390/boot/string.c b/arch/s390/boot/string.c
index faccb33b462c..f6b9b1df48a8 100644
--- a/arch/s390/boot/string.c
+++ b/arch/s390/boot/string.c
@@ -1,11 +1,18 @@
 // SPDX-License-Identifier: GPL-2.0
+#define IN_BOOT_STRING_C 1
 #include 
 #include 
 #include 
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
+#undef CONFIG_KMSAN
 #include "../lib/string.c"
 
+/*
+ * Duplicate some functions from the common lib/string.c
+ * instead of fully including it.
+ */
+
 int strncmp(const char *cs, const char *ct, size_t count)
 {
unsigned char c1, c2;
@@ -22,6 +29,15 @@ int strncmp(const char *cs, const char *ct, size_t count)
return 0;
 }
 
+void *memset64(uint64_t *s, uint64_t v, size_t count)
+{
+   uint64_t *xs = s;
+
+   while (count--)
+   *xs++ = v;
+   return s;
+}
+
 char *skip_spaces(const char *str)
 {
while (isspace(*str))
diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h
index 351685de53d2..2ab868cbae6c 100644
--- a/arch/s390/include/asm/string.h
+++ b/arch/s390/include/asm/string.h
@@ -15,15 +15,12 @@
 #define __HAVE_ARCH_MEMCPY /* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMMOVE/* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMSET /* gcc builtin & arch function */
-#define __HAVE_ARCH_MEMSET16   /* arch function */
-#define __HAVE_ARCH_MEMSET32   /* arch function */
-#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 void *memcpy(void *dest, const void *src, size_t n);
 void *memset(void *s, int c, size_t n);
 void *memmove(void *dest, const void *src, size_t n);
 
-#ifndef CONFIG_KASAN
+#if !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
 #define __HAVE_ARCH_MEMCHR /* inline & arch function */
 #define __HAVE_ARCH_MEMCMP /* arch function */
 #define __HAVE_ARCH_MEMSCAN/* inline & arch function */
@@ -36,6 +33,9 @@ void *memmove(void *dest, const void *src, size_t n);
 #define __HAVE_ARCH_STRNCPY/* arch function */
 #define __HAVE_ARCH_STRNLEN/* inline & arch function */
 #define __HAVE_ARCH_STRSTR /* arch function */
+#define __HAVE_ARCH_MEMSET16   /* arch function */
+#define __HAVE_ARCH_MEMSET32   /* arch function */
+#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 /* Prototypes for non-inlined arch strings functions. */
 int memcmp(const void *s1, const void *s2, size_t n);
@@ -44,7 +44,7 @@ size_t strlcat(char *dest, const char *src, size_t n);
 char *strncat(char *dest, const char *src, size_t n);
 char *strncpy(char *dest, const char *src, size_t n);
 char *strstr(const char *s1, const char *s2);
-#endif /* !CONFIG_KASAN */
+#endif /* !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) */
 
 #undef __HAVE_ARCH_STRCHR
 #undef __HAVE_ARCH_STRNCHR
@@ -74,20 +74,30 @@ void *__memset16(uint16_t *s, uint16_t v, size_t count);
 void *__memset32(uint32_t *s, uint32_t v, size_t count);
 void *__memset64(uint64_t *s, uint64_t v, size_t count);
 
+#ifdef __HAVE_ARCH_MEMSET16
 static inline void *memset16(uint16_t *s, uint16_t v, size_t count)
 {
return __memset16(s, v, count * sizeof(v));
 }
+#endif
 
+#ifdef __HAVE_ARCH_MEMSET32
 static inline void *memset32(uint32_t *s, uint32_t v, size_t count)
 {
return __memset32(s, v, count * sizeof(v));
 }
+#endif
 
+#ifdef __HAVE_ARCH_MEMSET64
+#ifdef IN_BOOT_STRING_C
+void *memset64(uint64_t *s, uint64_t v, size_t count);
+#else
 static inline void *memset64(uint64_t *s, uint64_t v, size_t count)
 {
return __memset64(s, v, count * sizeof(v));
 }
+#endif
+#endif
 
 #if !defined(IN_ARCH_STRING_C) && (!defined(CONFIG_FORTIFY_SOURCE) || 
defined(__NO_FORTIFY))
 
-- 
2.45.1

[PATCH v4 09/35] kmsan: Expose kmsan_get_metadata()

2024-06-13 Thread Ilya Leoshkevich

Each s390 CPU has lowcore pages associated with it. Each CPU sees its
own lowcore at virtual address 0 through a hardware mechanism called
prefixing. Additionally, all lowcores are mapped to non-0 virtual
addresses stored in the lowcore_ptr[] array.

When lowcore is accessed through virtual address 0, one needs to
resolve metadata for lowcore_ptr[raw_smp_processor_id()].

Expose kmsan_get_metadata() to make it possible to do this from the
arch code.

Signed-off-by: Ilya Leoshkevich 
---
 include/linux/kmsan.h  | 9 +
 mm/kmsan/instrumentation.c | 1 +
 mm/kmsan/kmsan.h   | 1 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index e0c23a32cdf0..fe6c2212bdb1 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -230,6 +230,15 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out);
  */
 void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
 
+/**
+ * kmsan_get_metadata() - Return a pointer to KMSAN shadow or origins.
+ * @addr:  kernel address.
+ * @is_origin: whether to return origins or shadow.
+ *
+ * Return NULL if metadata cannot be found.
+ */
+void *kmsan_get_metadata(void *addr, bool is_origin);
+
 #else
 
 static inline void kmsan_init_shadow(void)
diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 8a1bbbc723ab..94b49fac9d8b 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -14,6 +14,7 @@
 
 #include "kmsan.h"
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index adf443bcffe8..34b83c301d57 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -66,7 +66,6 @@ struct shadow_origin_ptr {
 
 struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size,
 bool store);
-void *kmsan_get_metadata(void *addr, bool is_origin);
 void __init kmsan_init_alloc_meta_for_range(void *start, void *end);
 
 enum kmsan_bug_reason {
-- 
2.45.1

[PATCH v4 28/35] s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN

2024-06-13 Thread Ilya Leoshkevich

Lockdep generates the following false positives with KMSAN on s390x:

[6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[ ...]
[6.577050] Call Trace:
[6.619637]  [<0690d2de>] check_flags+0x1fe/0x210
[6.665411] ([<0690d2da>] check_flags+0x1fa/0x210)
[6.707478]  [<006cec1a>] lock_acquire+0x2ca/0xce0
[6.749959]  [<069820ea>] _raw_spin_lock_irqsave+0xea/0x190
[6.794912]  [<041fc988>] __stack_depot_save+0x218/0x5b0
[6.838420]  [<0197affe>] __msan_poison_alloca+0xfe/0x1a0
[6.882985]  [<07c5827c>] start_kernel+0x70c/0xd50
[6.927454]  [<00100036>] startup_continue+0x36/0x40

Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that
interrupts are on, but on the CPU they are still off. KMSAN
instrumentation takes spinlocks, giving lockdep a chance to see and
complain about this discrepancy.

KMSAN instrumentation is inserted in order to poison the __mask
variable. Disable instrumentation in the respective functions. They are
very small and it's easy to see that no important metadata updates are
lost because of this.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/irqflags.h | 17 ++---
 drivers/s390/char/sclp.c |  2 +-
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/irqflags.h b/arch/s390/include/asm/irqflags.h
index 02427b205c11..bcab456dfb80 100644
--- a/arch/s390/include/asm/irqflags.h
+++ b/arch/s390/include/asm/irqflags.h
@@ -37,12 +37,18 @@ static __always_inline void __arch_local_irq_ssm(unsigned 
long flags)
asm volatile("ssm   %0" : : "Q" (flags) : "memory");
 }
 
-static __always_inline unsigned long arch_local_save_flags(void)
+#ifdef CONFIG_KMSAN
+#define arch_local_irq_attributes noinline notrace __no_sanitize_memory 
__maybe_unused
+#else
+#define arch_local_irq_attributes __always_inline
+#endif
+
+static arch_local_irq_attributes unsigned long arch_local_save_flags(void)
 {
return __arch_local_irq_stnsm(0xff);
 }
 
-static __always_inline unsigned long arch_local_irq_save(void)
+static arch_local_irq_attributes unsigned long arch_local_irq_save(void)
 {
return __arch_local_irq_stnsm(0xfc);
 }
@@ -52,7 +58,12 @@ static __always_inline void arch_local_irq_disable(void)
arch_local_irq_save();
 }
 
-static __always_inline void arch_local_irq_enable(void)
+static arch_local_irq_attributes void arch_local_irq_enable_external(void)
+{
+   __arch_local_irq_stosm(0x01);
+}
+
+static arch_local_irq_attributes void arch_local_irq_enable(void)
 {
__arch_local_irq_stosm(0x03);
 }
diff --git a/drivers/s390/char/sclp.c b/drivers/s390/char/sclp.c
index d53ee34d398f..fb1d9949adca 100644
--- a/drivers/s390/char/sclp.c
+++ b/drivers/s390/char/sclp.c
@@ -736,7 +736,7 @@ sclp_sync_wait(void)
cr0_sync.val = cr0.val & ~CR0_IRQ_SUBCLASS_MASK;
cr0_sync.val |= 1UL << (63 - 54);
local_ctl_load(0, _sync);
-   __arch_local_irq_stosm(0x01);
+   arch_local_irq_enable_external();
/* Loop until driver state indicates finished request */
while (sclp_running_state != sclp_running_state_idle) {
/* Check for expired request timer */
-- 
2.45.1

[PATCH v4 25/35] s390/cpumf: Unpoison STCCTM output buffer

2024-06-13 Thread Ilya Leoshkevich

stcctm() uses the "Q" constraint for dest, therefore KMSAN does not
understand that it fills multiple doublewords pointed to by dest, not
just one. This results in false positives.

Unpoison the whole dest manually with kmsan_unpoison_memory().

Reported-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/cpu_mf.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/s390/include/asm/cpu_mf.h b/arch/s390/include/asm/cpu_mf.h
index a0de5b9b02ea..9e4bbc3e53f8 100644
--- a/arch/s390/include/asm/cpu_mf.h
+++ b/arch/s390/include/asm/cpu_mf.h
@@ -10,6 +10,7 @@
 #define _ASM_S390_CPU_MF_H
 
 #include 
+#include 
 #include 
 #include 
 
@@ -239,6 +240,11 @@ static __always_inline int stcctm(enum stcctm_ctr_set set, 
u64 range, u64 *dest)
: "=d" (cc)
: "Q" (*dest), "d" (range), "i" (set)
: "cc", "memory");
+   /*
+* If cc == 2, less than RANGE counters are stored, but it's not easy
+* to tell how many. Always unpoison the whole range for simplicity.
+*/
+   kmsan_unpoison_memory(dest, range * sizeof(u64));
return cc;
 }
 
-- 
2.45.1

[PATCH v4 33/35] s390/unwind: Disable KMSAN checks

2024-06-13 Thread Ilya Leoshkevich

The unwind code can read uninitialized frames. Furthermore, even in
the good case, KMSAN does not emit shadow for backchains. Therefore
disable it for the unwinding functions.

Reviewed-by: Alexander Potapenko 
Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/unwind_bc.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c
index 0ece156fdd7c..cd44be2b6ce8 100644
--- a/arch/s390/kernel/unwind_bc.c
+++ b/arch/s390/kernel/unwind_bc.c
@@ -49,6 +49,8 @@ static inline bool is_final_pt_regs(struct unwind_state 
*state,
   READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE;
 }
 
+/* Avoid KMSAN false positives from touching uninitialized frames. */
+__no_kmsan_checks
 bool unwind_next_frame(struct unwind_state *state)
 {
struct stack_info *info = >stack_info;
@@ -118,6 +120,8 @@ bool unwind_next_frame(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_next_frame);
 
+/* Avoid KMSAN false positives from touching uninitialized frames. */
+__no_kmsan_checks
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long first_frame)
 {
-- 
2.45.1

[PATCH v4 11/35] kmsan: Allow disabling KMSAN checks for the current task

2024-06-13 Thread Ilya Leoshkevich

Like for KASAN, it's useful to temporarily disable KMSAN checks around,
e.g., redzone accesses. Introduce kmsan_disable_current() and
kmsan_enable_current(), which are similar to their KASAN counterparts.

Make them reentrant in order to handle memory allocations in interrupt
context. Repurpose the allow_reporting field for this.

Signed-off-by: Ilya Leoshkevich 
---
 Documentation/dev-tools/kmsan.rst |  4 ++--
 include/linux/kmsan.h | 24 
 include/linux/kmsan_types.h   |  2 +-
 mm/kmsan/core.c   |  1 -
 mm/kmsan/hooks.c  | 18 +++---
 mm/kmsan/report.c |  7 ---
 tools/objtool/check.c |  2 ++
 7 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/Documentation/dev-tools/kmsan.rst 
b/Documentation/dev-tools/kmsan.rst
index 323eedad53cd..022a823f5f1b 100644
--- a/Documentation/dev-tools/kmsan.rst
+++ b/Documentation/dev-tools/kmsan.rst
@@ -338,11 +338,11 @@ Per-task KMSAN state
 
 
 Every task_struct has an associated KMSAN task state that holds the KMSAN
-context (see above) and a per-task flag disallowing KMSAN reports::
+context (see above) and a per-task counter disallowing KMSAN reports::
 
   struct kmsan_context {
 ...
-bool allow_reporting;
+unsigned int depth;
 struct kmsan_context_state cstate;
 ...
   }
diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index fe6c2212bdb1..23de1b3d6aee 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -239,6 +239,22 @@ void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
  */
 void *kmsan_get_metadata(void *addr, bool is_origin);
 
+/*
+ * kmsan_enable_current(): Enable KMSAN for the current task.
+ *
+ * Each kmsan_enable_current() current call must be preceded by a
+ * kmsan_disable_current() call. These call pairs may be nested.
+ */
+void kmsan_enable_current(void);
+
+/*
+ * kmsan_disable_current(): Disable KMSAN for the current task.
+ *
+ * Each kmsan_disable_current() current call must be followed by a
+ * kmsan_enable_current() call. These call pairs may be nested.
+ */
+void kmsan_disable_current(void);
+
 #else
 
 static inline void kmsan_init_shadow(void)
@@ -338,6 +354,14 @@ static inline void kmsan_unpoison_entry_regs(const struct 
pt_regs *regs)
 {
 }
 
+static inline void kmsan_enable_current(void)
+{
+}
+
+static inline void kmsan_disable_current(void)
+{
+}
+
 #endif
 
 #endif /* _LINUX_KMSAN_H */
diff --git a/include/linux/kmsan_types.h b/include/linux/kmsan_types.h
index 929287981afe..dfc59918b3c0 100644
--- a/include/linux/kmsan_types.h
+++ b/include/linux/kmsan_types.h
@@ -31,7 +31,7 @@ struct kmsan_context_state {
 struct kmsan_ctx {
struct kmsan_context_state cstate;
int kmsan_in_runtime;
-   bool allow_reporting;
+   unsigned int depth;
 };
 
 #endif /* _LINUX_KMSAN_TYPES_H */
diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c
index 95f859e38c53..81b0711a 100644
--- a/mm/kmsan/core.c
+++ b/mm/kmsan/core.c
@@ -43,7 +43,6 @@ void kmsan_internal_task_create(struct task_struct *task)
struct thread_info *info = current_thread_info();
 
__memset(ctx, 0, sizeof(*ctx));
-   ctx->allow_reporting = true;
kmsan_internal_unpoison_memory(info, sizeof(*info), false);
 }
 
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index b408714f9ba3..267d0afa2e8b 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -39,12 +39,10 @@ void kmsan_task_create(struct task_struct *task)
 
 void kmsan_task_exit(struct task_struct *task)
 {
-   struct kmsan_ctx *ctx = >kmsan_ctx;
-
if (!kmsan_enabled || kmsan_in_runtime())
return;
 
-   ctx->allow_reporting = false;
+   kmsan_disable_current();
 }
 
 void kmsan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags)
@@ -424,3 +422,17 @@ void kmsan_check_memory(const void *addr, size_t size)
   REASON_ANY);
 }
 EXPORT_SYMBOL(kmsan_check_memory);
+
+void kmsan_enable_current(void)
+{
+   KMSAN_WARN_ON(current->kmsan_ctx.depth == 0);
+   current->kmsan_ctx.depth--;
+}
+EXPORT_SYMBOL(kmsan_enable_current);
+
+void kmsan_disable_current(void)
+{
+   current->kmsan_ctx.depth++;
+   KMSAN_WARN_ON(current->kmsan_ctx.depth == 0);
+}
+EXPORT_SYMBOL(kmsan_disable_current);
diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c
index c79d3b0d2d0d..92e73ec61435 100644
--- a/mm/kmsan/report.c
+++ b/mm/kmsan/report.c
@@ -8,6 +8,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -158,12 +159,12 @@ void kmsan_report(depot_stack_handle_t origin, void 
*address, int size,
 
if (!kmsan_enabled)
return;
-   if (!current->kmsan_ctx.allow_reporting)
+   if (current->kmsan_ctx.depth)
return;
if (!origin)
return;
 
-   current->kmsan_ctx.allow_reporting = false;
+   kmsan_disable_current();
ua_flags =

[PATCH v4 10/35] kmsan: Export panic_on_kmsan

2024-06-13 Thread Ilya Leoshkevich

When building the kmsan test as a module, modpost fails with the
following error message:

ERROR: modpost: "panic_on_kmsan" [mm/kmsan/kmsan_test.ko] undefined!

Export panic_on_kmsan in order to improve the KMSAN usability for
modules.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/report.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c
index 02736ec757f2..c79d3b0d2d0d 100644
--- a/mm/kmsan/report.c
+++ b/mm/kmsan/report.c
@@ -20,6 +20,7 @@ static DEFINE_RAW_SPINLOCK(kmsan_report_lock);
 /* Protected by kmsan_report_lock */
 static char report_local_descr[DESCR_SIZE];
 int panic_on_kmsan __read_mostly;
+EXPORT_SYMBOL_GPL(panic_on_kmsan);
 
 #ifdef MODULE_PARAM_PREFIX
 #undef MODULE_PARAM_PREFIX
-- 
2.45.1

[PATCH v4 26/35] s390/diag: Unpoison diag224() output buffer

2024-06-13 Thread Ilya Leoshkevich

Diagnose 224 stores 4k bytes, which currently cannot be deduced from
the inline assembly constraints. This leads to KMSAN false positives.

Fix the constraints by using a 4k-sized struct instead of a raw
pointer. While at it, prettify them too.

Suggested-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/diag.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kernel/diag.c b/arch/s390/kernel/diag.c
index 8dee9aa0ec95..8a7009618ba7 100644
--- a/arch/s390/kernel/diag.c
+++ b/arch/s390/kernel/diag.c
@@ -278,12 +278,14 @@ int diag224(void *ptr)
int rc = -EOPNOTSUPP;
 
diag_stat_inc(DIAG_STAT_X224);
-   asm volatile(
-   "   diag%1,%2,0x224\n"
-   "0: lhi %0,0x0\n"
+   asm volatile("\n"
+   "   diag%[type],%[addr],0x224\n"
+   "0: lhi %[rc],0\n"
"1:\n"
EX_TABLE(0b,1b)
-   : "+d" (rc) :"d" (0), "d" (addr) : "memory");
+   : [rc] "+d" (rc)
+   , "=m" (*(struct { char buf[PAGE_SIZE]; } *)ptr)
+   : [type] "d" (0), [addr] "d" (addr));
return rc;
 }
 EXPORT_SYMBOL(diag224);
-- 
2.45.1

[PATCH v4 20/35] s390/boot: Turn off KMSAN

2024-06-13 Thread Ilya Leoshkevich

All other sanitizers are disabled for boot as well. While at it, add a
comment explaining why we need this.

Reviewed-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index 070c9b2e905f..526ed20b9d31 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -3,11 +3,13 @@
 # Makefile for the linux s390-specific parts of the memory manager.
 #
 
+# Tooling runtimes are unavailable and cannot be linked for early boot code
 KCOV_INSTRUMENT := n
 GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 KBUILD_AFLAGS := $(KBUILD_AFLAGS_DECOMPRESSOR)
 KBUILD_CFLAGS := $(KBUILD_CFLAGS_DECOMPRESSOR)
-- 
2.45.1

[PATCH v4 00/35] kmsan: Enable on s390

2024-06-13 Thread Ilya Leoshkevich

v3: https://lore.kernel.org/lkml/20231213233605.661251-1-...@linux.ibm.com/
v3 -> v4: Rebase.
  Elaborate why ftrace_ops_list_func() change is needed on
  x64_64 (Steven).
  Add a comment to the DFLTCC patch (Alexander P.).
  Simplify diag224();
  Improve __arch_local_irq_attributes style;
  Use IS_ENABLED(CONFIG_KMSAN) for vmalloc area (Heiko).
  Align vmalloc area on _SEGMENT_SIZE (Alexander G.).

v2: https://lore.kernel.org/lkml/20231121220155.1217090-1-...@linux.ibm.com/
v2 -> v3: Drop kmsan_memmove_metadata() and strlcpy() patches;
  Remove kmsan_get_metadata() stub;
  Move kmsan_enable_current() and kmsan_disable_current() to
  include/linux/kmsan.h, explain why a counter is needed;
  Drop the memset_no_sanitize_memory() patch;
  Use __memset() in the SLAB_POISON patch;
  Add kmsan-checks.h to the DFLTCC patch;
  Add recursion check to the arch_kmsan_get_meta_or_null()
  patch (Alexander P.).

  Fix inline + __no_kmsan_checks issues.
  New patch for s390/irqflags, that resolves a lockdep warning.
  New patch for s390/diag, that resolves a false positive when
  running on an LPAR.
  New patch for STCCTM, same as above.
  New patch for check_bytes_and_report() that resolves a false
  positive that occurs even on Intel.

v1: https://lore.kernel.org/lkml/20231115203401.2495875-1-...@linux.ibm.com/
v1 -> v2: Add comments, sort #includes, introduce
  memset_no_sanitize_memory() and use it to avoid unpoisoning
  of redzones, change vmalloc alignment to _REGION3_SIZE, add
  R-bs (Alexander P.).

  Fix building
  [PATCH 28/33] s390/string: Add KMSAN support
  with FORTIFY_SOURCE.
  Reported-by: kernel test robot 
  Closes: 
https://lore.kernel.org/oe-kbuild-all/202311170550.bsbo44ix-...@intel.com/

Hi,

This series provides the minimal support for Kernel Memory Sanitizer on
s390. Kernel Memory Sanitizer is clang-only instrumentation for finding
accesses to uninitialized memory. The clang support for s390 has already
been merged [1].

With this series, I can successfully boot s390 defconfig and
debug_defconfig with kmsan.panic=1. The tool found one real
s390-specific bug (fixed in master).

Best regards,
Ilya

[1] https://reviews.llvm.org/D148596

Ilya Leoshkevich (35):
  ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
  kmsan: Make the tests compatible with kmsan.panic=1
  kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled
  kmsan: Increase the maximum store size to 4096
  kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces
  kmsan: Fix kmsan_copy_to_user() on arches with overlapping address
spaces
  kmsan: Remove a useless assignment from
kmsan_vmap_pages_range_noflush()
  kmsan: Remove an x86-specific #include from kmsan.h
  kmsan: Expose kmsan_get_metadata()
  kmsan: Export panic_on_kmsan
  kmsan: Allow disabling KMSAN checks for the current task
  kmsan: Support SLAB_POISON
  kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()
  kmsan: Do not round up pg_data_t size
  mm: slub: Let KMSAN access metadata
  mm: slub: Unpoison the memchr_inv() return value
  mm: kfence: Disable KMSAN when checking the canary
  lib/zlib: Unpoison DFLTCC output buffers
  kmsan: Accept ranges starting with 0 on s390
  s390/boot: Turn off KMSAN
  s390: Use a larger stack for KMSAN
  s390/boot: Add the KMSAN runtime stub
  s390/checksum: Add a KMSAN check
  s390/cpacf: Unpoison the results of cpacf_trng()
  s390/cpumf: Unpoison STCCTM output buffer
  s390/diag: Unpoison diag224() output buffer
  s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()
  s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN
  s390/mm: Define KMSAN metadata for vmalloc and modules
  s390/string: Add KMSAN support
  s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs
  s390/uaccess: Add KMSAN support to put_user() and get_user()
  s390/unwind: Disable KMSAN checks
  s390: Implement the architecture-specific KMSAN functions
  kmsan: Enable on s390

 Documentation/dev-tools/kmsan.rst   |   4 +-
 arch/s390/Kconfig   |   1 +
 arch/s390/Makefile  |   2 +-
 arch/s390/boot/Makefile |   3 +
 arch/s390/boot/kmsan.c  |   6 ++
 arch/s390/boot/startup.c|   7 ++
 arch/s390/boot/string.c |  16 
 arch/s390/include/asm/checksum.h|   2 +
 arch/s390/include/asm/cpacf.h   |   3 +
 arch/s390/include/asm/cpu_mf.h  |   6 ++
 arch/s390/include/asm/irqflags.h|  17 -
 arch/s390/include/asm/kmsan.h   |  43 +++
 arch/s390/include/asm/pgtable.h |   8 ++
 arch/s390/include/asm/string.h  |  20 +++--
 arch/s390/include/asm/thread_info.h |   2 +-
 arch/s390/include/asm/uaccess.h | 111 
 arch/s390/kernel/diag.c |  10

[PATCH v4 13/35] kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()

2024-06-13 Thread Ilya Leoshkevich

Improve the readability by replacing the custom aligning logic with
ALIGN_DOWN(). Unlike other places where a similar sequence is used,
there is no size parameter that needs to be adjusted, so the standard
macro fits.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index 2d57408c78ae..9c58f081d84f 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -123,14 +123,12 @@ struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void 
*address, u64 size,
  */
 void *kmsan_get_metadata(void *address, bool is_origin)
 {
-   u64 addr = (u64)address, pad, off;
+   u64 addr = (u64)address, off;
struct page *page;
void *ret;
 
-   if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) {
-   pad = addr % KMSAN_ORIGIN_SIZE;
-   addr -= pad;
-   }
+   if (is_origin)
+   addr = ALIGN_DOWN(addr, KMSAN_ORIGIN_SIZE);
address = (void *)addr;
if (kmsan_internal_is_vmalloc_addr(address) ||
kmsan_internal_is_module_addr(address))
-- 
2.45.1

[PATCH v4 12/35] kmsan: Support SLAB_POISON

2024-06-13 Thread Ilya Leoshkevich

Avoid false KMSAN negatives with SLUB_DEBUG by allowing
kmsan_slab_free() to poison the freed memory, and by preventing
init_object() from unpoisoning new allocations by using __memset().

There are two alternatives to this approach. First, init_object()
can be marked with __no_sanitize_memory. This annotation should be used
with great care, because it drops all instrumentation from the
function, and any shadow writes will be lost. Even though this is not a
concern with the current init_object() implementation, this may change
in the future.

Second, kmsan_poison_memory() calls may be added after memset() calls.
The downside is that init_object() is called from
free_debug_processing(), in which case poisoning will erase the
distinction between simply uninitialized memory and UAF.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c |  2 +-
 mm/slub.c| 13 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 267d0afa2e8b..26d86dfdc819 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -74,7 +74,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object)
return;
 
/* RCU slabs could be legally used after free within the RCU period */
-   if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)))
+   if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
return;
/*
 * If there's a constructor, freed memory must remain in the same state
diff --git a/mm/slub.c b/mm/slub.c
index 1373ac365a46..4dd55cabe701 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1139,7 +1139,12 @@ static void init_object(struct kmem_cache *s, void 
*object, u8 val)
unsigned int poison_size = s->object_size;
 
if (s->flags & SLAB_RED_ZONE) {
-   memset(p - s->red_left_pad, val, s->red_left_pad);
+   /*
+* Use __memset() here and below in order to avoid overwriting
+* the KMSAN shadow. Keeping the shadow makes it possible to
+* distinguish uninit-value from use-after-free.
+*/
+   __memset(p - s->red_left_pad, val, s->red_left_pad);
 
if (slub_debug_orig_size(s) && val == SLUB_RED_ACTIVE) {
/*
@@ -1152,12 +1157,12 @@ static void init_object(struct kmem_cache *s, void 
*object, u8 val)
}
 
if (s->flags & __OBJECT_POISON) {
-   memset(p, POISON_FREE, poison_size - 1);
-   p[poison_size - 1] = POISON_END;
+   __memset(p, POISON_FREE, poison_size - 1);
+   __memset(p + poison_size - 1, POISON_END, 1);
}
 
if (s->flags & SLAB_RED_ZONE)
-   memset(p + poison_size, val, s->inuse - poison_size);
+   __memset(p + poison_size, val, s->inuse - poison_size);
 }
 
 static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
-- 
2.45.1

[PATCH v4 14/35] kmsan: Do not round up pg_data_t size

2024-06-13 Thread Ilya Leoshkevich

x86's alloc_node_data() rounds up node data size to PAGE_SIZE. It's not
explained why it's needed, but it's most likely for performance
reasons, since the padding bytes are not used anywhere. Some other
architectures do it as well, e.g., mips rounds it up to the cache line
size.

kmsan_init_shadow() initializes metadata for each node data and assumes
the x86 rounding, which does not match other architectures. This may
cause the range end to overshoot the end of available memory, in turn
causing virt_to_page_or_null() in kmsan_init_alloc_meta_for_range() to
return NULL, which leads to kernel panic shortly after.

Since the padding bytes are not used, drop the rounding.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index 3ac3b8921d36..9de76ac7062c 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -72,7 +72,7 @@ static void __init kmsan_record_future_shadow_range(void 
*start, void *end)
  */
 void __init kmsan_init_shadow(void)
 {
-   const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+   const size_t nd_size = sizeof(pg_data_t);
phys_addr_t p_start, p_end;
u64 loop;
int nid;
-- 
2.45.1

[PATCH v4 15/35] mm: slub: Let KMSAN access metadata

2024-06-13 Thread Ilya Leoshkevich

Building the kernel with CONFIG_SLUB_DEBUG and CONFIG_KMSAN causes
KMSAN to complain about touching redzones in kfree().

Fix by extending the existing KASAN-related metadata_access_enable()
and metadata_access_disable() functions to KMSAN.

Acked-by: Vlastimil Babka 
Signed-off-by: Ilya Leoshkevich 
---
 mm/slub.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 4dd55cabe701..a290f6c63e7b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -829,10 +829,12 @@ static int disable_higher_order_debug;
 static inline void metadata_access_enable(void)
 {
kasan_disable_current();
+   kmsan_disable_current();
 }
 
 static inline void metadata_access_disable(void)
 {
+   kmsan_enable_current();
kasan_enable_current();
 }
 
-- 
2.45.1

[PATCH v4 07/35] kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush()

2024-06-13 Thread Ilya Leoshkevich

The value assigned to prot is immediately overwritten on the next line
with PAGE_KERNEL. The right hand side of the assignment has no
side-effects.

Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Suggested-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index b9d05aff313e..2d57408c78ae 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -243,7 +243,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, 
unsigned long end,
s_pages[i] = shadow_page_for(pages[i]);
o_pages[i] = origin_page_for(pages[i]);
}
-   prot = __pgprot(pgprot_val(prot) | _PAGE_NX);
prot = PAGE_KERNEL;
 
origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN);
-- 
2.45.1

[PATCH v4 03/35] kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled

2024-06-13 Thread Ilya Leoshkevich

KMSAN relies on memblock returning all available pages to it
(see kmsan_memblock_free_pages()). It partitions these pages into 3
categories: pages available to the buddy allocator, shadow pages and
origin pages. This partitioning is static.

If new pages appear after kmsan_init_runtime(), it is considered
an error. DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as
incompatible with KMSAN.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..9791fce5d0a7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -946,6 +946,7 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   depends on !KMSAN
select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
-- 
2.45.1

[PATCH v4 06/35] kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces

2024-06-13 Thread Ilya Leoshkevich

Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Assume that we are handling user memory access in
this case.

Reported-by: Alexander Gordeev 
Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 22e8657800ef..b408714f9ba3 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -267,7 +267,8 @@ void kmsan_copy_to_user(void __user *to, const void *from, 
size_t to_copy,
return;
 
ua_flags = user_access_save();
-   if ((u64)to < TASK_SIZE) {
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) ||
+   (u64)to < TASK_SIZE) {
/* This is a user memory access, check it. */
kmsan_internal_check_memory((void *)from, to_copy - left, to,
REASON_COPY_TO_USER);
-- 
2.45.1

[PATCH v4 02/35] kmsan: Make the tests compatible with kmsan.panic=1

2024-06-13 Thread Ilya Leoshkevich

It's useful to have both tests and kmsan.panic=1 during development,
but right now the warnings, that the tests cause, lead to kernel
panics.

Temporarily set kmsan.panic=0 for the duration of the KMSAN testing.

Reviewed-by: Alexander Potapenko 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan_test.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c
index 07d3a3a5a9c5..9bfd11674fe3 100644
--- a/mm/kmsan/kmsan_test.c
+++ b/mm/kmsan/kmsan_test.c
@@ -659,9 +659,13 @@ static void test_exit(struct kunit *test)
 {
 }
 
+static int orig_panic_on_kmsan;
+
 static int kmsan_suite_init(struct kunit_suite *suite)
 {
register_trace_console(probe_console, NULL);
+   orig_panic_on_kmsan = panic_on_kmsan;
+   panic_on_kmsan = 0;
return 0;
 }
 
@@ -669,6 +673,7 @@ static void kmsan_suite_exit(struct kunit_suite *suite)
 {
unregister_trace_console(probe_console, NULL);
tracepoint_synchronize_unregister();
+   panic_on_kmsan = orig_panic_on_kmsan;
 }
 
 static struct kunit_suite kmsan_test_suite = {
-- 
2.45.1

Re: [PATCH v2] net: missing check virtio

2024-06-13 Thread Jiri Pirko

Thu, Jun 13, 2024 at 11:54:48AM CEST, are...@swemel.ru wrote:
>Two missing check in virtio_net_hdr_to_skb() allowed syzbot
>to crash kernels again
>
>1. After the skb_segment function the buffer may become non-linear
>(nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere
>the __skb_linearize function will not be executed, then the buffer will
>remain non-linear. Then the condition (offset >= skb_headlen(skb))
>becomes true, which causes WARN_ON_ONCE in skb_checksum_help.
>
>2. The struct sk_buff and struct virtio_net_hdr members must be
>mathematically related.
>(gso_size) must be greater than (needed) otherwise WARN_ON_ONCE.
>(remainder) must be greater than (needed) otherwise WARN_ON_ONCE.
>(remainder) may be 0 if division is without remainder.
>
>offset+2 (4191) > skb_headlen() (1116)
>WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 skb_checksum_help+0x5e2/0x740 
>net/core/dev.c:3303
>Modules linked in:
>CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted 
>6.7.0-rc3-syzkaller-00014-gdf60cee26a2e #0
>Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 
>11/10/2023
>RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303
>Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b 
>53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe 
>ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef
>RSP: 0018:c90003a9f338 EFLAGS: 00010286
>RAX:  RBX: 888025125780 RCX: 814db209
>RDX: 888015393b80 RSI: 814db216 RDI: 0001
>RBP: 8880251257f4 R08: 0001 R09: 
>R10:  R11: 0001 R12: 045c
>R13: 105f R14: 8880251257f0 R15: 105d
>FS:  55c24380() GS:8880b990() knlGS:
>CS:  0010 DS:  ES:  CR0: 80050033
>CR2: 2000f000 CR3: 23151000 CR4: 003506f0
>DR0:  DR1:  DR2: 
>DR3:  DR6: fffe0ff0 DR7: 0400
>Call Trace:
> 
> ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777
> ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584
> ip_finish_output_gso net/ipv4/ip_output.c:286 [inline]
> __ip_finish_output net/ipv4/ip_output.c:308 [inline]
> __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295
> ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323
> NF_HOOK_COND include/linux/netfilter.h:303 [inline]
> ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433
> dst_output include/net/dst.h:451 [inline]
> ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129
> iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82
> ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline]
> sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076
> __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
> netdev_start_xmit include/linux/netdevice.h:4954 [inline]
> xmit_one net/core/dev.c:3545 [inline]
> dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561
> __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346
> dev_queue_xmit include/linux/netdevice.h:3134 [inline]
> packet_xmit+0x257/0x380 net/packet/af_packet.c:276
> packet_snd net/packet/af_packet.c:3087 [inline]
> packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119
> sock_sendmsg_nosec net/socket.c:730 [inline]
> __sock_sendmsg+0xd5/0x180 net/socket.c:745
> __sys_sendto+0x255/0x340 net/socket.c:2190
> __do_sys_sendto net/socket.c:2202 [inline]
> __se_sys_sendto net/socket.c:2198 [inline]
> __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198
> do_syscall_x64 arch/x86/entry/common.c:51 [inline]
> do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
> entry_SYSCALL_64_after_hwframe+0x63/0x6b
>
>Found by Linux Verification Center (linuxtesting.org) with Syzkaller
>
>Signed-off-by: Denis Arefev 

Could you please provide "Fixes" blaming the commit which itroduced the
bug?


>---
> V1 -> V2: incorrect type in argument 2
> include/linux/virtio_net.h | 11 +++
> 1 file changed, 11 insertions(+)
>
>diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>index 4dfa9b69ca8d..d1d7825318c3 100644
>--- a/include/linux/virtio_net.h
>+++ b/include/linux/virtio_net.h
>@@ -56,6 +56,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>   unsigned int thlen = 0;
>   unsigned int p_off = 0;
>   unsigned int ip_proto;
>+  u64 ret, remainder, gso_size;
> 
>   if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
>   switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
>@@ -98,6 +99,16 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
>   u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
>   u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16));
> 
>+  if (hdr->gso_size) {
>+  gso_size = __virtio16_to_cpu(little_endian, 
>hdr->gso_size);
>+  ret = div64_u64_rem(skb->len, gso_size, );
>+

Re: [PATCH v5 8/8] Docs/.../mm/damon: add more damos actions

2024-06-13 Thread SeongJae Park

Hi Honggyu,

On Thu, 13 Jun 2024 22:20:55 +0900 Honggyu Kim  wrote:

> This patch adds damon description for "migrate_hot" and "migrate_cold"
> actions for both usage and design documents as long as a new
> "target_nid" knob to set the migration target node.
> 
> Signed-off-by: Honggyu Kim 
> ---
>  Documentation/admin-guide/mm/damon/usage.rst | 8 +++-
>  Documentation/mm/damon/design.rst| 4 
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/mm/damon/usage.rst 
> b/Documentation/admin-guide/mm/damon/usage.rst
> index 7bff54963975..84d62d16c9f9 100644
> --- a/Documentation/admin-guide/mm/damon/usage.rst
> +++ b/Documentation/admin-guide/mm/damon/usage.rst
> @@ -300,6 +300,10 @@ from the file and their meaning are same to those of the 
> list on
>  The ``apply_interval_us`` file is for setting and getting the scheme's
>  :ref:`apply_interval ` in microseconds.
>  
> +The ``target_nid`` file is for setting the migration target node, which is
> +only meaningful when the ``action`` is either ``migrate_hot`` or
> +``migrate_cold``.
> +
>  .. _sysfs_access_pattern:
>  
>  schemes//access_pattern/
> @@ -759,7 +763,9 @@ list on :ref:`design doc `.
>   - 4: ``nohugepage``
>   - 5: ``lru_prio``
>   - 6: ``lru_deprio``
> - - 7: ``stat``
> + - 7: ``migrate_hot``
> + - 8: ``migrate_cold``
> + - 9: ``stat``

This section is for DAMON debugfs interface.  And to my understanding, this
patchset is not adding support of migrate_{hot,cold} DAMOS actions on DAMON
debugfs interface.  So I think this part should be removed.

>  
>  Quota
>  ~
> diff --git a/Documentation/mm/damon/design.rst 
> b/Documentation/mm/damon/design.rst
> index 3df387249937..3f12c884eb3a 100644
> --- a/Documentation/mm/damon/design.rst
> +++ b/Documentation/mm/damon/design.rst
> @@ -325,6 +325,10 @@ that supports each action are as below.
> Supported by ``paddr`` operations set.
>   - ``lru_deprio``: Deprioritize the region on its LRU lists.
> Supported by ``paddr`` operations set.
> + - ``migrate_hot``: Migrate the regions prioritizing warmer regions.
> +   Supported by ``paddr`` operations set.
> + - ``migrate_cold``: Migrate the regions prioritizing colder regions.
> +   Supported by ``paddr`` operations set.
>   - ``stat``: Do nothing but count the statistics.
> Supported by all operations sets.

Except the DAMON debugfs interface section, this patch looks good to me.


Thanks,
SJ

[...]

Re: [PATCH 1/2] vdpa: support set mac address from vdpa tool

2024-06-13 Thread Jiri Pirko

Thu, Jun 13, 2024 at 09:50:54AM CEST, m...@redhat.com wrote:
>On Thu, Jun 13, 2024 at 09:21:07AM +0200, Jiri Pirko wrote:
>> Thu, Jun 13, 2024 at 08:49:25AM CEST, m...@redhat.com wrote:
>> >On Wed, Jun 12, 2024 at 09:22:32AM +0200, Jiri Pirko wrote:
>> >> Wed, Jun 12, 2024 at 09:15:44AM CEST, m...@redhat.com wrote:
>> >> >On Wed, Jun 12, 2024 at 08:29:53AM +0200, Jiri Pirko wrote:
>> >> >> Wed, Jun 12, 2024 at 03:58:10AM CEST, k...@kernel.org wrote:
>> >> >> >On Tue, 11 Jun 2024 13:32:32 +0800 Cindy Lu wrote:
>> >> >> >> Add new UAPI to support the mac address from vdpa tool
>> >> >> >> Function vdpa_nl_cmd_dev_config_set_doit() will get the
>> >> >> >> MAC address from the vdpa tool and then set it to the device.
>> >> >> >> 
>> >> >> >> The usage is: vdpa dev set name vdpa_name mac **:**:**:**:**:**
>> >> >> >
>> >> >> >Why don't you use devlink?
>> >> >> 
>> >> >> Fair question. Why does vdpa-specific uapi even exist? To have
>> >> >> driver-specific uapi Does not make any sense to me :/
>> >> >
>> >> >I am not sure which uapi do you refer to? The one this patch proposes or
>> >> >the existing one?
>> >> 
>> >> Sure, I'm sure pointing out, that devlink should have been the answer
>> >> instead of vdpa netlink introduction. That ship is sailed,
>> >
>> >> now we have
>> >> unfortunate api duplication which leads to questions like Jakub's one.
>> >> That's all :/
>> >
>> >
>> >
>> >Yea there's no point to argue now, there were arguments this and that
>> >way.  I don't think we currently have a lot
>> >of duplication, do we?
>> 
>> True. I think it would be good to establish guidelines for api
>> extensions in this area.
>> 
>> >
>> >-- 
>> >MST
>> >
>
>
>Guidelines are good, are there existing examples of such guidelines in
>Linux to follow though? Specifically after reviewing this some more, I

Documentation directory in general.


>think what Cindy is trying to do is actually provisioning as opposed to
>programming.
>
>-- 
>MST
>

Re: [PATCH] function_graph: Add READ_ONCE() when accessing fgraph_array[]

2024-06-13 Thread Google

On Thu, 13 Jun 2024 09:52:23 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> In function_graph_enter() there's a loop that looks at fgraph_array[]
> elements which are fgraph_ops. It first tests if it is a fgraph_stub op,
> and if so skips it, as that's just there as a place holder. Then it checks
> the fgraph_ops filters to see if the ops wants to trace the current
> function.
> 
> But if the compiler reloads the fgraph_array[] after the check against
> fgraph_stub, it could race with the fgraph_array[] being updated with the
> fgraph_stub. That would cause the stub to be processed. But the stub has a
> null "func_hash" field which will cause a NULL pointer dereference.
> 
> Add a READ_ONCE() so that the gops that is compared against the
> fgraph_stub is also the gops that is processed later.
> 
> Link: 
> https://lore.kernel.org/all/CA+G9fYsSVJQZH=nM=1cjtc94pgsnmf9y65bnov6xsocg_b6...@mail.gmail.com/
> 
> Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize 
> entry function")
> Reported-by: Naresh Kamboju 
> Signed-off-by: Steven Rostedt (Google) 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thanks.

> ---
>  kernel/trace/fgraph.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 8317d1a7f43a..fc205ad167a9 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -641,7 +641,7 @@ int function_graph_enter(unsigned long ret, unsigned long 
> func,
>   {
>   for_each_set_bit(i, _array_bitmask,
>sizeof(fgraph_array_bitmask) * 
> BITS_PER_BYTE) {
> - struct fgraph_ops *gops = fgraph_array[i];
> + struct fgraph_ops *gops = READ_ONCE(fgraph_array[i]);
>   int save_curr_ret_stack;
>  
>   if (gops == _stub)
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 1/1] s390/virtio_ccw: fix config change notifications

2024-06-13 Thread Eric Farman

On Tue, 2024-06-11 at 23:47 +0200, Halil Pasic wrote:
> Commit e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API")
> broke configuration change notifications for virtio-ccw by putting
> the
> DMA address of *indicatorp directly into ccw->cda disregarding the
> fact
> that if !!(vcdev->is_thinint) then the function
> virtio_ccw_register_adapter_ind() will overwrite that ccw->cda value
> with the address of the virtio_thinint_area so it can actually set up
> the adapter interrupts via CCW_CMD_SET_IND_ADAPTER.  Thus we end up
> pointing to the wrong object for both CCW_CMD_SET_IND if setting up
> the
> adapter interrupts fails, and for CCW_CMD_SET_CONF_IND regardless
> whether it succeeds or fails.
> 
> To fix this, let us save away the dma address of *indicatorp in a
> local
> variable, and copy it to ccw->cda after the "vcdev->is_thinint"
> branch.
> 
> Reported-by: Boqiao Fu 
> Reported-by: Sebastian Mitterle 
> Fixes: e3e9bda38e6d ("s390/virtio_ccw: use DMA handle from DMA API")
> Signed-off-by: Halil Pasic 
> ---
> I know that checkpatch.pl complains about a missing 'Closes' tag.
> Unfortunately I don't have an appropriate URL at hand. @Sebastian,
> @Boqiao: do you have any suggetions?
> ---
>  drivers/s390/virtio/virtio_ccw.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Reviewed-by: Eric Farman

Re: [PATCH v5 7/8] Docs/admin-guide/mm/damon/usage: add missing actions

2024-06-13 Thread SeongJae Park

Hi Honggyu,

On Thu, 13 Jun 2024 22:20:54 +0900 Honggyu Kim  wrote:

> "lru_prio" and "lru_deprio" are missing in the damos action list so they
> have to be added properly at damon usage document.
> 
> Signed-off-by: Honggyu Kim 
> ---
>  Documentation/admin-guide/mm/damon/usage.rst | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/mm/damon/usage.rst 
> b/Documentation/admin-guide/mm/damon/usage.rst
> index e58ceb89ea2a..7bff54963975 100644
> --- a/Documentation/admin-guide/mm/damon/usage.rst
> +++ b/Documentation/admin-guide/mm/damon/usage.rst
> @@ -757,7 +757,9 @@ list on :ref:`design doc `.
>   - 2: ``pageout``
>   - 3: ``hugepage``
>   - 4: ``nohugepage``
> - - 5: ``stat``
> + - 5: ``lru_prio``
> + - 6: ``lru_deprio``
> + - 7: ``stat``

This section is for DAMON debugfs interface, which is deprecated.  After the
deprecation, we stopped supporting new DAMOS actions.  This section is not
documenting lru_[de]prio for the reason.  I think this patch is not needed.

Please let me know if I'm missing something, though.


Thanks,
SJ

[...]

Re: [PATCH 0/8] DAMON based tiered memory management for CXL memory

2024-06-13 Thread SeongJae Park

On Thu, 13 Jun 2024 22:27:26 +0900 Honggyu Kim  wrote:

> On Thu, 13 Jun 2024 22:17:31 +0900 Honggyu Kim  wrote:
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> >
> > It says there is no implementation of the demote/promote DAMOS action
> > are made.  This patch series is about its implementation for physical
> > address space so that this scheme can be applied in system wide level.
> >
> > Changes from RFC v4:
> > https://lore.kernel.org/20240512175447.75943-1...@kernel.org
> >   1. Add usage and design documents
> >   2. Rename alloc_demote_folio to alloc_migrate_folio
> >   3. Add evaluation results with "demotion_enabled" true
> >   4. Rebase based on v6.10-rc3
> 
> Sorry for making confusion, I didn't add "PATCH v5" tag for this patch
> series so please ignore and see the resent one.

Thank you for clarifying this.

Nonetheless, I don't mine resetting the version number of a patchset after
dropping RFC.  Actually, I personally rather prefer resetting the version
number.  Anyway, I don't care that much.  Please use any way that you feel more
comfortable :)  Please just keep the number monotonically increase.


Thanks,
SJ

> 
> Honggyu

[PATCH] function_graph: Add READ_ONCE() when accessing fgraph_array[]

2024-06-13 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In function_graph_enter() there's a loop that looks at fgraph_array[]
elements which are fgraph_ops. It first tests if it is a fgraph_stub op,
and if so skips it, as that's just there as a place holder. Then it checks
the fgraph_ops filters to see if the ops wants to trace the current
function.

But if the compiler reloads the fgraph_array[] after the check against
fgraph_stub, it could race with the fgraph_array[] being updated with the
fgraph_stub. That would cause the stub to be processed. But the stub has a
null "func_hash" field which will cause a NULL pointer dereference.

Add a READ_ONCE() so that the gops that is compared against the
fgraph_stub is also the gops that is processed later.

Link: 
https://lore.kernel.org/all/CA+G9fYsSVJQZH=nM=1cjtc94pgsnmf9y65bnov6xsocg_b6...@mail.gmail.com/

Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize 
entry function")
Reported-by: Naresh Kamboju 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/fgraph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 8317d1a7f43a..fc205ad167a9 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -641,7 +641,7 @@ int function_graph_enter(unsigned long ret, unsigned long 
func,
{
for_each_set_bit(i, _array_bitmask,
 sizeof(fgraph_array_bitmask) * 
BITS_PER_BYTE) {
-   struct fgraph_ops *gops = fgraph_array[i];
+   struct fgraph_ops *gops = READ_ONCE(fgraph_array[i]);
int save_curr_ret_stack;
 
if (gops == _stub)
-- 
2.43.0

[PATCH 6/6] ftrace: Revert the FTRACE_MCOUNT_MAX_OFFSET workaround

2024-06-13 Thread Zheng Yejian

After patch titled "ftrace: Skip invalid __fentry__ in
ftrace_process_locs()", __fentry__ locations in overridden weak function
have been checked and skipped, then all records in ftrace_pages are
valid, the FTRACE_MCOUNT_MAX_OFFSET workaround can be reverted, include:
 1. commit b39181f7c690 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid
adding weak function")
 2. commit 7af82ff90a2b ("powerpc/ftrace: Ignore weak functions")
 3. commit f6834c8c59a8 ("powerpc/ftrace: Fix dropping weak symbols with
older toolchains")

Signed-off-by: Zheng Yejian 
---
 arch/powerpc/include/asm/ftrace.h |   7 --
 arch/x86/include/asm/ftrace.h |   7 --
 kernel/trace/ftrace.c | 141 +-
 3 files changed, 2 insertions(+), 153 deletions(-)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 107fc5a48456..d6ed058c8041 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -10,13 +10,6 @@
 
 #define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
 
-/* Ignore unused weak functions which will have larger offsets */
-#if defined(CONFIG_MPROFILE_KERNEL) || 
defined(CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY)
-#define FTRACE_MCOUNT_MAX_OFFSET   16
-#elif defined(CONFIG_PPC32)
-#define FTRACE_MCOUNT_MAX_OFFSET   8
-#endif
-
 #ifndef __ASSEMBLY__
 extern void _mcount(void);
 
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 897cf02c20b1..7a147c9da08d 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -9,13 +9,6 @@
 # define MCOUNT_ADDR   ((unsigned long)(__fentry__))
 #define MCOUNT_INSN_SIZE   5 /* sizeof mcount call */
 
-/* Ignore unused weak functions which will have non zero offsets */
-#ifdef CONFIG_HAVE_FENTRY
-# include 
-/* Add offset for endbr64 if IBT enabled */
-# define FTRACE_MCOUNT_MAX_OFFSET  ENDBR_INSN_SIZE
-#endif
-
 #ifdef CONFIG_DYNAMIC_FTRACE
 #define ARCH_SUPPORTS_FTRACE_OPS 1
 #endif
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index c46c35ac9b42..1d60dc9a850b 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -49,8 +49,6 @@
 #define FTRACE_NOCLEAR_FLAGS   (FTRACE_FL_DISABLED | FTRACE_FL_TOUCHED | \
 FTRACE_FL_MODIFIED)
 
-#define FTRACE_INVALID_FUNCTION"__ftrace_invalid_address__"
-
 #define FTRACE_WARN_ON(cond)   \
({  \
int ___r = cond;\
@@ -3709,105 +3707,6 @@ static void add_trampoline_func(struct seq_file *m, 
struct ftrace_ops *ops,
seq_printf(m, " ->%pS", ptr);
 }
 
-#ifdef FTRACE_MCOUNT_MAX_OFFSET
-/*
- * Weak functions can still have an mcount/fentry that is saved in
- * the __mcount_loc section. These can be detected by having a
- * symbol offset of greater than FTRACE_MCOUNT_MAX_OFFSET, as the
- * symbol found by kallsyms is not the function that the mcount/fentry
- * is part of. The offset is much greater in these cases.
- *
- * Test the record to make sure that the ip points to a valid kallsyms
- * and if not, mark it disabled.
- */
-static int test_for_valid_rec(struct dyn_ftrace *rec)
-{
-   char str[KSYM_SYMBOL_LEN];
-   unsigned long offset;
-   const char *ret;
-
-   ret = kallsyms_lookup(rec->ip, NULL, , NULL, str);
-
-   /* Weak functions can cause invalid addresses */
-   if (!ret || offset > FTRACE_MCOUNT_MAX_OFFSET) {
-   rec->flags |= FTRACE_FL_DISABLED;
-   return 0;
-   }
-   return 1;
-}
-
-static struct workqueue_struct *ftrace_check_wq __initdata;
-static struct work_struct ftrace_check_work __initdata;
-
-/*
- * Scan all the mcount/fentry entries to make sure they are valid.
- */
-static __init void ftrace_check_work_func(struct work_struct *work)
-{
-   struct ftrace_page *pg;
-   struct dyn_ftrace *rec;
-
-   mutex_lock(_lock);
-   do_for_each_ftrace_rec(pg, rec) {
-   test_for_valid_rec(rec);
-   } while_for_each_ftrace_rec();
-   mutex_unlock(_lock);
-}
-
-static int __init ftrace_check_for_weak_functions(void)
-{
-   INIT_WORK(_check_work, ftrace_check_work_func);
-
-   ftrace_check_wq = alloc_workqueue("ftrace_check_wq", WQ_UNBOUND, 0);
-
-   queue_work(ftrace_check_wq, _check_work);
-   return 0;
-}
-
-static int __init ftrace_check_sync(void)
-{
-   /* Make sure the ftrace_check updates are finished */
-   if (ftrace_check_wq)
-   destroy_workqueue(ftrace_check_wq);
-   return 0;
-}
-
-late_initcall_sync(ftrace_check_sync);
-subsys_initcall(ftrace_check_for_weak_functions);
-
-static int print_rec(struct seq_file *m, unsigned long ip)
-{
-   unsigned long offset;
-   char str[KSYM_SYMBOL_LEN];
-   char *modname;
-   const char *ret;
-
-   ret = kallsyms_lookup(ip, NULL, , , str);
-   /* Weak functions can cause invalid addresses */
-   if (!ret

[PATCH 3/6] module: kallsyms: Determine exact function size

2024-06-13 Thread Zheng Yejian

When a weak type function is overridden, its symbol will be removed
from the symbol table, but its code will not been removed. It will
cause find_kallsyms_symbol() to compute a larger function size than
it actually is, just because symbol of its following weak function is
removed.

To fix this issue, check that an given address is within the size of
the function found.

Signed-off-by: Zheng Yejian 
---
 include/linux/module.h   |  7 +++
 kernel/module/kallsyms.c | 19 +--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index ffa1c603163c..13518f464d3f 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -597,6 +597,13 @@ static inline unsigned long kallsyms_symbol_value(const 
Elf_Sym *sym)
 }
 #endif
 
+#ifndef HAVE_ARCH_KALLSYMS_SYMBOL_TYPE
+static inline unsigned int kallsyms_symbol_type(const Elf_Sym *sym)
+{
+   return ELF_ST_TYPE(sym->st_info);
+}
+#endif
+
 /* FIXME: It'd be nice to isolate modules during init, too, so they
aren't used before they (may) fail.  But presently too much code
(IDE & SCSI) require entry into the module during init.*/
diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
index 62fb57bb9f16..092ae6f43dad 100644
--- a/kernel/module/kallsyms.c
+++ b/kernel/module/kallsyms.c
@@ -262,6 +262,7 @@ static const char *find_kallsyms_symbol(struct module *mod,
unsigned long nextval, bestval;
struct mod_kallsyms *kallsyms = rcu_dereference_sched(mod->kallsyms);
struct module_memory *mod_mem;
+   const Elf_Sym *sym;
 
/* At worse, next value is at end of module */
if (within_module_init(addr, mod))
@@ -278,9 +279,10 @@ static const char *find_kallsyms_symbol(struct module *mod,
 * starts real symbols at 1).
 */
for (i = 1; i < kallsyms->num_symtab; i++) {
-   const Elf_Sym *sym = >symtab[i];
-   unsigned long thisval = kallsyms_symbol_value(sym);
+   unsigned long thisval;
 
+   sym = >symtab[i];
+   thisval = kallsyms_symbol_value(sym);
if (sym->st_shndx == SHN_UNDEF)
continue;
 
@@ -292,6 +294,13 @@ static const char *find_kallsyms_symbol(struct module *mod,
is_mapping_symbol(kallsyms_symbol_name(kallsyms, i)))
continue;
 
+   if (kallsyms_symbol_type(sym) == STT_FUNC &&
+   addr >= thisval && addr < thisval + sym->st_size) {
+   best = i;
+   bestval = thisval;
+   nextval = thisval + sym->st_size;
+   goto find;
+   }
if (thisval <= addr && thisval > bestval) {
best = i;
bestval = thisval;
@@ -303,6 +312,12 @@ static const char *find_kallsyms_symbol(struct module *mod,
if (!best)
return NULL;
 
+   sym = >symtab[best];
+   if (kallsyms_symbol_type(sym) == STT_FUNC && sym->st_size &&
+   addr >= kallsyms_symbol_value(sym) + sym->st_size)
+   return NULL;
+
+find:
if (size)
*size = nextval - bestval;
if (offset)
-- 
2.25.1

Re: [PATCH v7 3/5] regulator: add regulators driver for Marvell 88PM886 PMIC

2024-06-13 Thread Lee Jones

On Fri, 31 May 2024, Karel Balej wrote:

> Support the LDO and buck regulators of the Marvell 88PM886 PMIC.
> 
> Signed-off-by: Karel Balej 
> ---
> 
> Notes:
> v7:
> - Address Mark's feedback:
>   - Drop get_current_limit op, max_uA values and thus unneeded struct
> pm886_regulator and adapt the code accordingly.
> v6:
> - Remove all definitions (now present in the header).
> v5:
> - Add remaining regulators.
> - Clean up includes.
> - Address Mark's feedback:
>   - Use dedicated regmap config.
> RFC v4:
> - Initialize regulators regmap in the regulators driver.
> - Register all regulators at once.
> - Drop regulator IDs.
> - Add missing '\n' to dev_err_probe message.
> - Fix includes.
> - Add ID table.
> RFC v3:
> - Do not have a variable for each regulator -- define them all in the
>   pm886_regulators array.
> - Use new regulators regmap index name.
> - Use dev_err_probe.
> RFC v2:
> - Drop of_compatible and related code.
> - Drop unused include.
> - Remove some abstraction: use only one regmap for all regulators and
>   only mention 88PM886 in Kconfig description.
> - Reword commit message.
> 
>  drivers/regulator/88pm886-regulator.c | 392 ++
>  drivers/regulator/Kconfig |   6 +
>  drivers/regulator/Makefile|   1 +
>  3 files changed, 399 insertions(+)
>  create mode 100644 drivers/regulator/88pm886-regulator.c

I'm fine with this set - just waiting for Mark to review.

-- 
Lee Jones [李琼斯]

Re: [PATCH 0/8] DAMON based tiered memory management for CXL memory

2024-06-13 Thread Honggyu Kim

On Thu, 13 Jun 2024 22:17:31 +0900 Honggyu Kim  wrote:
> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
>
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This patch series is about its implementation for physical
> address space so that this scheme can be applied in system wide level.
>
> Changes from RFC v4:
> https://lore.kernel.org/20240512175447.75943-1...@kernel.org
>   1. Add usage and design documents
>   2. Rename alloc_demote_folio to alloc_migrate_folio
>   3. Add evaluation results with "demotion_enabled" true
>   4. Rebase based on v6.10-rc3

Sorry for making confusion, I didn't add "PATCH v5" tag for this patch
series so please ignore and see the resent one.

Honggyu

[PATCH 5/6] ftrace: Fix possible out-of-bound issue in ftrace_process_locs()

2024-06-13 Thread Zheng Yejian

In ftrace_process_locs(), a series pages are prepared and linked in
start_pg, then fentry records are skipped or added, then unused pages
are freed.

However, assume that all records are skipped, currently the start_pg
will still be in list of ftrace_pages_start but without any record.
Then in ftrace_free_mem() index record by (pg->index - 1) will be out
of bound.

To fix this issue, properly handle with unused start_pg and add
WARN_ON_ONCE() where the records need to be indexed.

Fixes: 26efd79c4624 ("ftrace: Fix possible warning on checking all pages used 
in ftrace_process_locs()")
Signed-off-by: Zheng Yejian 
---
 kernel/trace/ftrace.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 0e8628e4d296..c46c35ac9b42 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -6575,10 +6575,22 @@ static int ftrace_process_locs(struct module *mod,
rec->ip = addr;
}
 
-   if (pg->next) {
+   if (pg->index == 0) {
+   /* No record is added on the current page, so it's unused */
+   pg_unuse = pg;
+   } else if (pg->next) {
+   /* Current page has records, so it's next page is unused */
pg_unuse = pg->next;
pg->next = NULL;
}
+   /*
+* Even the start_pg hasn't been used, that means, no record has
+* been added, so restore state of ftrace_pages and just go out.
+*/
+   if (pg_unuse == start_pg) {
+   ftrace_pages->next = NULL;
+   goto out;
+   }
 
/* Assign the last page to ftrace_pages */
ftrace_pages = pg;
@@ -6794,6 +6806,8 @@ void ftrace_release_mod(struct module *mod)
 */
last_pg = _pages_start;
for (pg = ftrace_pages_start; pg; pg = *last_pg) {
+   /* The page should have at lease one record */
+   WARN_ON_ONCE(!pg->index);
rec = >records[0];
if (within_module(rec->ip, mod)) {
/*
@@ -7176,6 +7190,8 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, 
void *end_ptr)
mod_map = allocate_ftrace_mod_map(mod, start, end);
 
for (pg = ftrace_pages_start; pg; last_pg = >next, pg = *last_pg) {
+   /* The page should have at lease one record */
+   WARN_ON_ONCE(!pg->index);
if (end < pg->records[0].ip ||
start >= (pg->records[pg->index - 1].ip + MCOUNT_INSN_SIZE))
continue;
-- 
2.25.1

[PATCH 1/6] kallsyms: Optimize multiple times of realloc() to one time of malloc()

2024-06-13 Thread Zheng Yejian

Array 'table' is used to store pointers of symbols that read from in.map
file, and its size depends on the number of symbols. Currently 'table'
is expanded by calling realloc() every 1 symbols read.

However, there generally are around 10+ symbols, which means that
the expansion is generally 10+ times.

As an optimization, introduce linked list 'sym_list' to associate and
count all symbols, then store them into 'table' at one time.

Signed-off-by: Zheng Yejian 
---
 scripts/kallsyms.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 47978efe4797..6559a9802f6e 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -33,6 +33,7 @@
 #define KSYM_NAME_LEN  512
 
 struct sym_entry {
+   struct sym_entry *next;
unsigned long long addr;
unsigned int len;
unsigned int seq;
@@ -60,7 +61,8 @@ static struct addr_range percpu_range = {
 };
 
 static struct sym_entry **table;
-static unsigned int table_size, table_cnt;
+static struct sym_entry *sym_list;
+static unsigned int table_cnt;
 static int all_symbols;
 static int absolute_percpu;
 static int base_relative;
@@ -273,6 +275,7 @@ static void read_map(const char *in)
struct sym_entry *sym;
char *buf = NULL;
size_t buflen = 0;
+   int i;
 
fp = fopen(in, "r");
if (!fp) {
@@ -286,18 +289,22 @@ static void read_map(const char *in)
continue;
 
sym->start_pos = table_cnt;
-
-   if (table_cnt >= table_size) {
-   table_size += 1;
-   table = realloc(table, sizeof(*table) * table_size);
-   if (!table) {
-   fprintf(stderr, "out of memory\n");
-   fclose(fp);
-   exit (1);
-   }
-   }
-
-   table[table_cnt++] = sym;
+   table_cnt++;
+   sym->next = sym_list;
+   sym_list = sym;
+   }
+   table = malloc(sizeof(*table) * table_cnt);
+   if (!table) {
+   fprintf(stderr, "unable to allocate memory for table\n");
+   free(buf);
+   fclose(fp);
+   exit(EXIT_FAILURE);
+   }
+   sym = sym_list;
+   i = table_cnt - 1;
+   while (sym) {
+   table[i--] = sym;
+   sym = sym->next;
}
 
free(buf);
-- 
2.25.1

[PATCH 0/6] kallsyms: Emit symbol for holes in text and fix weak function issue

2024-06-13 Thread Zheng Yejian

ftrace_location() was changed to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location after commit aebfd12521d9 ("x86/ibt,ftrace: Search for
__fentry__ location"). That is, if sym+0 location is not __fentry__,
ftrace_location() would find one over the entire size of the sym.

However, there is case that more than one __fentry__ exist in the sym
range (described below) and ftrace_location() would find wrong __fentry__
location by binary searching, which would cause its users like livepatch/
kprobe/bpf to not work properly on this sym!

The case is that, based on current compiler behavior, suppose:
 - function A is followed by weak function B1 in same binary file;
 - weak function B1 is overridden by function B2;
Then in the final binary file:
 - symbol B1 will be removed from symbol table while its instructions are
   not removed;
 - __fentry__ of B1 will be still in __mcount_loc table;
 - function size of A is computed by substracting the symbol address of
   A from its next symbol address (see kallsyms_lookup_size_offset()),
   but because symbol info of B1 is removed, the next symbol of A is
   originally the next symbol of B1. See following example, function
   sizeof A will be (symbol_address_C - symbol_address_A):

 symbol_address_A
 symbol_address_B1 (Not in symbol table)
 symbol_address_C

The weak function issue has been discovered in commit b39181f7c690
("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
but it didn't resolve the issue in ftrace_location().

Peter suggested to use entry size for FUNC type objects to find holes in
the text and fill them with a symbol, then check the mcount locations
against the symbol table and for every one that falls in a hole [1] [2].

What the patch set does is described as follows:

- Patch 1: Do an optimization for scripts/kallsym.c about memory allocation
  when read symbols from file. This patch has little to do with the above
  issue, but since I changed this script, so it also can be reviewed here;

- Patch 2: Change scripts/kallsyms.c to emit a symbol where there is a hole
  in the text, the symbol name is temporarily named "__hole_symbol_X";

- Patch 3: When lookup symbols in module, use entry size info to determine
  the exact boundaries of a function symbol;

- Patch 4: Holes in text have been found in previous patches, now check
  __fentry__ in mcount table and skip those locate in the holes;

- Patch 5: Accidentally found a out-of-bound issue when all __fentry__
  are skipped, so fix it;

- Patch 6: Revert Steve's patch about the FTRACE_MCOUNT_MAX_OFFSET
  solution, also two related definition for powerpc.

[1] 
https://lore.kernel.org/all/20240607150228.gr8...@noisy.programming.kicks-ass.net/
[2] 
https://lore.kernel.org/all/20240611092157.gu40...@noisy.programming.kicks-ass.net/

Zheng Yejian (6):
  kallsyms: Optimize multiple times of realloc() to one time of malloc()
  kallsyms: Emit symbol at the holes in the text
  module: kallsyms: Determine exact function size
  ftrace: Skip invalid __fentry__ in ftrace_process_locs()
  ftrace: Fix possible out-of-bound issue in ftrace_process_locs()
  ftrace: Revert the FTRACE_MCOUNT_MAX_OFFSET workaround

 arch/powerpc/include/asm/ftrace.h |   7 --
 arch/x86/include/asm/ftrace.h |   7 --
 include/linux/kallsyms.h  |  13 +++
 include/linux/module.h|  14 +++
 kernel/module/kallsyms.c  |  42 ++--
 kernel/trace/ftrace.c | 174 ++
 scripts/kallsyms.c| 134 ---
 scripts/link-vmlinux.sh   |   4 +-
 scripts/mksysmap  |   2 +-
 9 files changed, 216 insertions(+), 181 deletions(-)

-- 
2.25.1

[PATCH 4/6] ftrace: Skip invalid fentry in ftrace_process_locs()

2024-06-13 Thread Zheng Yejian

ftrace_location() was changed to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location after commit aebfd12521d9 ("x86/ibt,ftrace: Search for
__fentry__ location"). That is, if sym+0 location is not __fentry__,
ftrace_location() would find one over the entire size of the sym.

However, there is case that more than one __fentry__ exist in the sym
range (described below) and ftrace_location() would find wrong __fentry__
location by binary searching, which would cause its users like livepatch/
kprobe/bpf to not work properly on this sym!

The case is that, based on current compiler behavior, suppose:
 - function A is followed by weak function B1 in same binary file;
 - weak function B1 is overridden by function B2;
Then in the final binary file:
 - symbol B1 will be removed from symbol table while its instructions are
   not removed;
 - __fentry__ of B1 will be still in __mcount_loc table;
 - function size of A is computed by substracting the symbol address of
   A from its next symbol address (see kallsyms_lookup_size_offset()),
   but because symbol info of B1 is removed, the next symbol of A is
   originally the next symbol of B1. See following example, function
   sizeof A will be (symbol_address_C - symbol_address_A):

 symbol_address_A
 symbol_address_B1 (Not in symbol table)
 symbol_address_C

The weak function issue has been discovered in commit b39181f7c690
("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function")
but it didn't resolve the issue in ftrace_location().

To solve the issue, with Peter's suggestions, in previous patches, all
holes in the text have been found and filled with specail symbols, also
the same case with module weak function has been handled. Then check and
skip __fentry__ that locate in the holes.

Also in this patch, introduce following helper functions:
 1. kallsyms_is_hole_symbol() is used to check if a function name is
of a hole symbol;
 2. module_kallsyms_find_symbol() is used to check if a __fentry__
locate in a valid function of the given module. It is needed because
other symbol lookup functions like module_address_lookup() will find
module of the passed address, but as ftrace_process_locs() is called,
the module has not been fully loaded, so those lookup functions can
not work.

Fixes: aebfd12521d9 ("x86/ibt,ftrace: Search for __fentry__ location")
Signed-off-by: Zheng Yejian 
---
 include/linux/kallsyms.h | 13 +
 include/linux/module.h   |  7 +++
 kernel/module/kallsyms.c | 23 +--
 kernel/trace/ftrace.c| 15 ++-
 4 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
index c3f075e8f60c..0bf0d595f244 100644
--- a/include/linux/kallsyms.h
+++ b/include/linux/kallsyms.h
@@ -88,6 +88,14 @@ const char *kallsyms_lookup(unsigned long addr,
unsigned long *offset,
char **modname, char *namebuf);
 
+/*
+ * Check if the name is of a hole symbol
+ */
+static inline int kallsyms_is_hole_symbol(const char *name)
+{
+   return !strcmp(name, "__hole_symbol_X");
+}
+
 /* Look up a kernel symbol and return it in a text buffer. */
 extern int sprint_symbol(char *buffer, unsigned long address);
 extern int sprint_symbol_build_id(char *buffer, unsigned long address);
@@ -119,6 +127,11 @@ static inline const char *kallsyms_lookup(unsigned long 
addr,
return NULL;
 }
 
+static inline int kallsyms_is_hole_symbol(const char *name)
+{
+   return 0;
+}
+
 static inline int sprint_symbol(char *buffer, unsigned long addr)
 {
*buffer = '\0';
diff --git a/include/linux/module.h b/include/linux/module.h
index 13518f464d3f..a3fd077ef2a8 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -960,6 +960,8 @@ int module_get_kallsym(unsigned int symnum, unsigned long 
*value, char *type,
 unsigned long module_kallsyms_lookup_name(const char *name);
 
 unsigned long find_kallsyms_symbol_value(struct module *mod, const char *name);
+int module_kallsyms_find_symbol(struct module *mod, unsigned long addr,
+   unsigned long *size, unsigned long *offset);
 
 #else  /* CONFIG_MODULES && CONFIG_KALLSYMS */
 
@@ -1004,6 +1006,11 @@ static inline unsigned long 
find_kallsyms_symbol_value(struct module *mod,
return 0;
 }
 
+static inline int module_kallsyms_find_symbol(struct module *mod, unsigned 
long addr,
+ unsigned long *size, unsigned 
long *offset)
+{
+   return 0;
+}
 #endif  /* CONFIG_MODULES && CONFIG_KALLSYMS */
 
 #endif /* _LINUX_MODULE_H */
diff --git a/kernel/module/kallsyms.c b/kernel/module/kallsyms.c
index 092ae6f43dad..e9c439d81708 100644
--- a/kernel/module/kallsyms.c
+++ b/kernel/module/kallsyms.c
@@ -253,10 +253,10 @@ static const char *kallsyms_symbol_name(struct 
mod_kallsyms

[PATCH 2/6] kallsyms: Emit symbol at the holes in the text

2024-06-13 Thread Zheng Yejian

When a weak type function is overridden, its symbol will be removed
from the symbol table, but its code will not be removed. Besides,
due to lacking of size for kallsyms, kernel compute function size by
substracting its symbol address from its next symbol address (see
kallsyms_lookup_size_offset()). These will cause that size of some
function is computed to be larger than it actually is, just because
symbol of its following weak function is removed.

This issue also causes multiple __fentry__ locations to be counted in
the some function scope, and eventually causes ftrace_location() to find
wrong __fentry__ location. It was reported in
Link: 
https://lore.kernel.org/all/20240607115211.734845-1-zhengyeji...@huawei.com/

Peter suggested to change scipts/kallsyms.c to emit readily
identifiable symbol names for all the weak junk, eg:

  __weak_junk_N

The name of this kind symbol needs some discussion, but it's temporarily
called "__hole_symbol_X" in this patch:
1. Pass size info to scripts/kallsyms  (see mksysmap());
2. Traverse sorted function symbols, if one function address plus its
   size less than next function address, it means there's a hole, then
   emit a symbol "__hole_symbol_X" there which type is 't'.

After this patch, the effect is as follows:

  $ cat /proc/kallsyms | grep -A 3 do_one_initcall
  810021e0 T do_one_initcall
  8100245e t __hole_symbol_X
  810024a0 t __pfx_rootfs_init_fs_context

Suggested-by: Peter Zijlstra 
Signed-off-by: Zheng Yejian 
---
 scripts/kallsyms.c  | 101 +++-
 scripts/link-vmlinux.sh |   4 +-
 scripts/mksysmap|   2 +-
 3 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 6559a9802f6e..5c4cde864a04 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -35,6 +35,7 @@
 struct sym_entry {
struct sym_entry *next;
unsigned long long addr;
+   unsigned long long size;
unsigned int len;
unsigned int seq;
unsigned int start_pos;
@@ -74,6 +75,7 @@ static int token_profit[0x1];
 static unsigned char best_table[256][2];
 static unsigned char best_table_len[256];
 
+static const char hole_symbol[] = "__hole_symbol_X";
 
 static void usage(void)
 {
@@ -130,8 +132,16 @@ static struct sym_entry *read_symbol(FILE *in, char **buf, 
size_t *buf_len)
size_t len;
ssize_t readlen;
struct sym_entry *sym;
+   unsigned long long size = 0;
 
errno = 0;
+   /*
+* Example of expected symbol format:
+* 1. symbol with size info:
+*8170 01d7 T __startup_64
+* 2. symbol without size info:
+*02a0 A text_size
+*/
readlen = getline(buf, buf_len, in);
if (readlen < 0) {
if (errno) {
@@ -145,9 +155,24 @@ static struct sym_entry *read_symbol(FILE *in, char **buf, 
size_t *buf_len)
(*buf)[readlen - 1] = 0;
 
addr = strtoull(*buf, , 16);
+   if (*buf == p || *p++ != ' ') {
+   fprintf(stderr, "line format error: unable to parse address\n");
+   exit(EXIT_FAILURE);
+   }
+
+   if (*p == '0') {
+   char *str = p;
 
-   if (*buf == p || *p++ != ' ' || !isascii((type = *p++)) || *p++ != ' ') 
{
-   fprintf(stderr, "line format error\n");
+   size = strtoull(str, , 16);
+   if (str == p || *p++ != ' ') {
+   fprintf(stderr, "line format error: unable to parse 
size\n");
+   exit(EXIT_FAILURE);
+   }
+   }
+
+   type = *p++;
+   if (!isascii(type) || *p++ != ' ') {
+   fprintf(stderr, "line format error: unable to parse type\n");
exit(EXIT_FAILURE);
}
 
@@ -182,6 +207,7 @@ static struct sym_entry *read_symbol(FILE *in, char **buf, 
size_t *buf_len)
exit(EXIT_FAILURE);
}
sym->addr = addr;
+   sym->size = size;
sym->len = len;
sym->sym[0] = type;
strcpy(sym_name(sym), name);
@@ -795,6 +821,76 @@ static void sort_symbols(void)
qsort(table, table_cnt, sizeof(table[0]), compare_symbols);
 }
 
+static int may_exist_hole_after_symbol(const struct sym_entry *se)
+{
+   char type = se->sym[0];
+
+   /* Only check text symbol or weak symbol */
+   if (type != 't' && type != 'T' &&
+   type != 'w' && type != 'W')
+   return 0;
+   /* Symbol without size has no hole */
+   return se->size != 0;
+}
+
+static struct sym_entry *gen_hole_symbol(unsigned long long addr)
+{
+   struct sym_entry *sym;
+   static size_t len = sizeof(hole_symbol);
+
+   /* include type field */
+   sym = malloc(sizeof(*sym) + len + 1);
+   if (!sym) {
+   fprintf(stderr, "unable to allocate memory for hole symbol\n");
+

[PATCH v5 5/8] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-06-13 Thread Honggyu Kim

This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
instead of swapping them out.

The 'target_nid' sysfs knob informs the migration target node ID.

Here is one of the example usage of this 'migrate_cold' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/
  $ cat contexts//schemes//action
  migrate_cold
  $ echo 2 > contexts//schemes//target_nid
  $ echo commit > state
  $ numactl -p 0 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  701 (hot_cold) 501  0601  1101

Since there are some common routines with pageout, many functions have
similar logics between pageout and migrate cold.

damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list().

Signed-off-by: Honggyu Kim 
Signed-off-by: Hyeongtak Ji 
Signed-off-by: SeongJae Park 
---
 include/linux/damon.h|   2 +
 mm/damon/paddr.c | 154 +++
 mm/damon/sysfs-schemes.c |   1 +
 3 files changed, 157 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 21d6b69a015c..56714b6eb0d7 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
  *
@@ -122,6 +123,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
 };
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 18797c1b419b..882ae54af829 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -12,6 +12,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -325,6 +328,153 @@ static unsigned long damon_pa_deactivate_pages(struct 
damon_region *r,
return damon_pa_mark_accessed_or_deactivate(r, s, false);
 }
 
+static unsigned int __damon_pa_migrate_folio_list(
+   struct list_head *migrate_folios, struct pglist_data *pgdat,
+   int target_nid)
+{
+   unsigned int nr_succeeded;
+   nodemask_t allowed_mask = NODE_MASK_NONE;
+   struct migration_target_control mtc = {
+   /*
+* Allocate from 'node', or fail quickly and quietly.
+* When this happens, 'page' will likely just be discarded
+* instead of migrated.
+*/
+   .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
+   __GFP_NOWARN | __GFP_NOMEMALLOC | GFP_NOWAIT,
+   .nid = target_nid,
+   .nmask = _mask
+   };
+
+   if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
+   return 0;
+
+   if (list_empty(migrate_folios))
+   return 0;
+
+   /* Migration ignores all cpuset and mempolicy settings */
+   migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
+ (unsigned long), MIGRATE_ASYNC, MR_DAMON,
+ _succeeded);
+
+   return nr_succeeded;
+}
+
+static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
+   struct pglist_data *pgdat,
+   int target_nid)
+{
+   unsigned int nr_migrated = 0;
+   struct folio *folio;
+   LIST_HEAD(ret_folios);
+   LIST_HEAD(migrate_folios);
+
+   while (!list_empty(folio_list)) {
+   struct folio *folio;
+
+   cond_resched();
+
+   folio = lru_to_folio(folio_list);
+   list_del(>lru);
+
+   if (!folio_trylock(folio))
+   goto keep;
+
+   /* Relocate its contents to another node. */
+   list_add(>lru, _folios);
+   folio_unlock(folio);
+   continue;
+keep:
+   list_add(>lru, _folios);
+   }
+   /* 'folio_list' is always empty here */
+
+   /* Migrate folios selected for migration */
+   nr_migrated += __damon_pa_migrate_folio_list(
+   _folios, pgdat, target_nid);
+   /*
+* Folios that could not be migrated are still in @migrate_folios.  Add
+* those back on @folio_list
+*/
+   if (!list_empty(_folios))
+   list_splice_init(_folios, folio_list);
+
+   try_to_unmap_flush();
+
+

[PATCH v5 4/8] mm/migrate: add MR_DAMON to migrate_reason

2024-06-13 Thread Honggyu Kim

The current patch series introduces DAMON based migration across NUMA
nodes so it'd be better to have a new migrate_reason in trace events.

Signed-off-by: Honggyu Kim 
Reviewed-by: SeongJae Park 
Signed-off-by: SeongJae Park 
---
 include/linux/migrate_mode.h   | 1 +
 include/trace/events/migrate.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..cec36b7e7ced 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+   MR_DAMON,
MR_TYPES
 };
 
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 0190ef725b43..cd01dd7b3640 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -22,7 +22,8 @@
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
EM( MR_CONTIG_RANGE,"contig_range") \
EM( MR_LONGTERM_PIN,"longterm_pin") \
-   EMe(MR_DEMOTION,"demotion")
+   EM( MR_DEMOTION,"demotion") \
+   EMe(MR_DAMON,   "damon")
 
 /*
  * First define the enums in the above macros to be exported to userspace
-- 
2.34.1

Re: [PATCH 2/8] tracing: do not trace kernel_text_address()

2024-06-13 Thread Steven Rostedt

On Thu, 13 Jun 2024 15:11:07 +0800
Andy Chiu  wrote:

> kernel_text_address() and __kernel_text_address() are called in
> arch_stack_walk() of riscv. This results in excess amount of un-related
> traces when the kernel is compiled with CONFIG_TRACE_IRQFLAGS. The
> situation worsens when function_graph is active, as it calls
> local_irq_save/restore in each function's entry/exit. This patch adds
> both functions to notrace, so they won't show up on the trace records.

I rather not add notrace just because something is noisy.

You can always just add:

 echo '*kernel_text_address' > /sys/kernel/tracing/set_ftrace_notrace

and achieve the same result.

-- Steve

[PATCH v5 6/8] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

2024-06-13 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
DAMOS_MIGRATE_COLD, but proritizes hot pages.

It migrates pages inside the given region to the 'target_nid' NUMA node
in the sysfs.

Here is one of the example usage of this 'migrate_hot' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/
  $ cat contexts//schemes//action
  migrate_hot
  $ echo 0 > contexts//schemes//target_nid
  $ echo commit > state
  $ numactl -p 2 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  701 (hot_cold) 501  0601  1101

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
Signed-off-by: SeongJae Park 
---
 include/linux/damon.h| 2 ++
 mm/damon/paddr.c | 3 +++
 mm/damon/sysfs-schemes.c | 1 +
 3 files changed, 6 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 56714b6eb0d7..3d62d98d6359 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_HOT:  Migrate the regions prioritizing warmer regions.
  * @DAMOS_MIGRATE_COLD:Migrate the regions prioritizing colder regions.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
@@ -123,6 +124,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_MIGRATE_HOT,
DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 882ae54af829..af6aac388a43 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -486,6 +486,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx 
*ctx,
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_pa_deactivate_pages(r, scheme);
+   case DAMOS_MIGRATE_HOT:
case DAMOS_MIGRATE_COLD:
return damon_pa_migrate(r, scheme);
case DAMOS_STAT:
@@ -508,6 +509,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
return damon_hot_score(context, r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_cold_score(context, r, scheme);
+   case DAMOS_MIGRATE_HOT:
+   return damon_hot_score(context, r, scheme);
case DAMOS_MIGRATE_COLD:
return damon_cold_score(context, r, scheme);
default:
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 880015d5b5ea..66fccfa776d7 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1458,6 +1458,7 @@ static const char * const damon_sysfs_damos_action_strs[] 
= {
"nohugepage",
"lru_prio",
"lru_deprio",
+   "migrate_hot",
"migrate_cold",
"stat",
 };
-- 
2.34.1

[PATCH v5 7/8] Docs/admin-guide/mm/damon/usage: add missing actions

2024-06-13 Thread Honggyu Kim

"lru_prio" and "lru_deprio" are missing in the damos action list so they
have to be added properly at damon usage document.

Signed-off-by: Honggyu Kim 
---
 Documentation/admin-guide/mm/damon/usage.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/damon/usage.rst 
b/Documentation/admin-guide/mm/damon/usage.rst
index e58ceb89ea2a..7bff54963975 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -757,7 +757,9 @@ list on :ref:`design doc `.
  - 2: ``pageout``
  - 3: ``hugepage``
  - 4: ``nohugepage``
- - 5: ``stat``
+ - 5: ``lru_prio``
+ - 6: ``lru_deprio``
+ - 7: ``stat``
 
 Quota
 ~
-- 
2.34.1

[PATCH v5 0/8] DAMON based tiered memory management for CXL memory

2024-06-13 Thread Honggyu Kim

There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This patch series is about its implementation for physical
address space so that this scheme can be applied in system wide level.

Changes from RFC v4:
https://lore.kernel.org/20240512175447.75943-1...@kernel.org
  1. Add usage and design documents
  2. Rename alloc_demote_folio to alloc_migrate_folio
  3. Add evaluation results with "demotion_enabled" true
  4. Rebase based on v6.10-rc3

Changes from RFC v3:
https://lore.kernel.org/20240405060858.2818-1-honggyu@sk.com
  0. updated from v3 and posted by SJ on behalf of Honggyu under his
 approval.
  1. Do not reuse damon_pa_pageout() and drop 'enum migration_mode'
  2. Drop vmstat change
  3. Drop unnecessary page reference check

Changes from RFC v2:
https://lore.kernel.org/20240226140555.1615-1-honggyu@sk.com
  1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
  2. Create 'target_nid' to set the migration target node instead of
 depending on node distance based information.
  3. Instead of having page level access check in this patch series,
 delegate the job to a new DAMOS filter type YOUNG[2].
  4. Introduce vmstat counters "damon_migrate_{hot,cold}".
  5. Rebase from v6.7 to v6.8.

Changes from RFC:
https://lore.kernel.org/20240115045253.1775-1-honggyu@sk.com
  1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
  2. Simplify some functions of vmscan.c and used in paddr.c, but need
 to be reviewed more in depth.
  3. Refactor most functions for common usage for both promote and
 demote actions and introduce an enum migration_mode for its control.
  4. Add "target_nid" sysfs knob for migration destination node for both
 promote and demote actions.
  5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
 DAMOS_STAT.

Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers.  This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 17~18% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.

DAMON configuration
===

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.

Evaluation Workload
===

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these

[PATCH v5 3/8] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes

2024-06-13 Thread Honggyu Kim

From: Hyeongtak Ji 

This patch adds target_nid under
  /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//

The 'target_nid' can be used as the destination node for DAMOS actions
such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
Signed-off-by: SeongJae Park 
---
 include/linux/damon.h| 11 ++-
 mm/damon/core.c  |  5 -
 mm/damon/dbgfs.c |  2 +-
 mm/damon/lru_sort.c  |  3 ++-
 mm/damon/reclaim.c   |  3 ++-
 mm/damon/sysfs-schemes.c | 33 -
 6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index f7da65e1ac04..21d6b69a015c 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -374,6 +374,7 @@ struct damos_access_pattern {
  * @apply_interval_us: The time between applying the @action.
  * @quota: Control the aggressiveness of this scheme.
  * @wmarks:Watermarks for automated (in)activation of this scheme.
+ * @target_nid:Destination node if @action is 
"migrate_{hot,cold}".
  * @filters:   Additional set of  damos_filter for 
  * @stat:  Statistics of this scheme.
  * @list:  List head for siblings.
@@ -389,6 +390,10 @@ struct damos_access_pattern {
  * monitoring context are inactive, DAMON stops monitoring either, and just
  * repeatedly checks the watermarks.
  *
+ * @target_nid is used to set the migration target node for migrate_hot or
+ * migrate_cold actions, which means it's only meaningful when @action is 
either
+ * "migrate_hot" or "migrate_cold".
+ *
  * Before applying the  to a memory region,  damon_operations
  * implementation could check pages of the region and skip  to respect
  * 
@@ -410,6 +415,9 @@ struct damos {
 /* public: */
struct damos_quota quota;
struct damos_watermarks wmarks;
+   union {
+   int target_nid;
+   };
struct list_head filters;
struct damos_stat stat;
struct list_head list;
@@ -726,7 +734,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks);
+   struct damos_watermarks *wmarks,
+   int target_nid);
 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
 void damon_destroy_scheme(struct damos *s);
 
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 6392f1cc97a3..c0ec5be4f56e 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -354,7 +354,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks)
+   struct damos_watermarks *wmarks,
+   int target_nid)
 {
struct damos *scheme;
 
@@ -381,6 +382,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
scheme->wmarks = *wmarks;
scheme->wmarks.activated = true;
 
+   scheme->target_nid = target_nid;
+
return scheme;
 }
 
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index 2461cfe2e968..51a6f1cac385 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -281,7 +281,7 @@ static struct damos **str_to_schemes(const char *str, 
ssize_t len,
 
pos += parsed;
scheme = damon_new_scheme(, action, 0, ,
-   );
+   , NUMA_NO_NODE);
if (!scheme)
goto fail;
 
diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 3de2916a65c3..3775f0f2743d 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme(
/* under the quota. */
,
/* (De)activate this according to the watermarks. */
-   _lru_sort_wmarks);
+   _lru_sort_wmarks,
+   NUMA_NO_NODE);
 }
 
 /* Create a DAMON-based operation scheme for hot memory regions */
diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index 9bd341d62b4c..a05ccb41749b 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -177,7 +177,8 @@ static struct damos *damon_reclaim_new_scheme(void)
/* under the quota. */
_reclaim_quota,
/* (De)activate this according to the watermarks. */
-   _reclaim_wmarks);
+   _reclaim_wmarks,
+   NUMA_NO_NODE);
 }
 
 static void damon_reclaim_copy_quota_status(struct

1 2 >

1 - 100 of 134 matches

Mail list logo